ORIGINALPAPERBalancedcorpusofcontemporarywrittenJapaneseKikuoMaekawa·MakotoYamazaki·ToshinobuOgiso·TakehikoMaruyama·HidekiOgura·WakakoKashino·HanaeKoiso·MasayaYamaguchi·MakiroTanaka·YasuharuDenPublishedonline:29December2013TheAuthor(s)2013.
ThisarticleispublishedwithopenaccessatSpringerlink.
comAbstractThebalancedcorpusofcontemporarywrittenJapanese(BCCWJ)isJapan'srst100millionwordsbalancedcorpus.
Itconsistsofthreesubcorpora(publicationsubcorpus,librarysubcorpus,andspecial-purposesubcorpus)andcoversawiderangeoftextregistersincludingbooksingeneral,magazines,newspapers,governmentalwhitepapers,best-sellingbooks,aninternetbulletin-board,ablog,schooltextbooks,minutesofthenationaldiet,publicitynewslettersoflocalgovernments,laws,andpoetryverses.
Arandomsamplingtechniqueisuti-lizedwheneverpossibleinordertomaximizetherepresentativenessofthecorpus.
ThecorpusisannotatedintermsofdualPOSanalysis,documentstructure,andbibliographicalinformation.
TheBCCWJiscurrentlyaccessibleinthreedifferentwaysincludingChunagonaweb-basedinterfacetothedualPOSanalysisdata.
Lastly,resultsofsomepilotevaluationofthecorpuswithrespecttothetextualdiversityarereported.
TheanalysesincludePOSdistribution,word-classdistribu-tion,entropyoforthography,sentencelength,andvariationoftheadjectivepred-icate.
Hightextualdiversityisobservedinalltheseanalyses.
K.
Maekawa(&)·M.
Yamazaki·T.
Ogiso·T.
Maruyama·W.
Kashino·M.
Yamaguchi·M.
TanakaDepartmentofCorpusStudies,NationalInstituteforJapaneseLanguageandLinguistics(NINJAL),Tokyo,Japane-mail:kikuo@ninjal.
ac.
jpH.
OguraCollegeofLetters,RitsumeikanUniversity,Kusatsu,JapanH.
KoisoDepartmentofLinguisticTheoryandStructure,NINJAL,Tokyo,JapanY.
DenFacultyofLetters,ChibaUniversity,Chiba,Japan123LangResources&Evaluation(2014)48:345-371DOI10.
1007/s10579-013-9261-0KeywordsBCCWJ·Japanese·Balancedcorpus·Design·Annotation·DualPOSanalysis·Evaluation·Shonagon·Chunagon1IntroductionOneseriousprobleminthecorpus-basedanalysesofpresent-dayJapaneseisthelackofabalancedcorpus.
Tosolvethisproblem,theauthorshaverecentlyreleaseda100millionwordsbalancedcorpusofcontemporarywrittenJapanese.
Theaimofthispaperistwofold:todescribethebasicpropertiesofthecorpusandtoevaluatethecorpusfromapointofviewofthediversityoftextsinthecorpus.
Inthisintroductorysection,thereasonweneedabalancedcorpusinthestudyofJapaneseisdiscussed.
Mostcorpus-basedanalysesoftheJapaneselanguagepublishedinthelasttwodecadesorsodependsheavilyupontheanalysesofthetextarchivesofnewspaperarticlesreleasedbynewspapercompanies.
Forexample,theKyotoUniversityTextCorpus(KurohashiandNagao1998)thatplayedanimportantroleinthedevelopmentofanannotatedcorpusofJapanesenaturallanguageprocessing(NLP)consistsof40thousandssentencestakenfromthearticlesoftheMainichinewspaperpublishedin1995.
Useofnewspaperarticlesinlinguisticsandnaturallanguageprocessing,however,imposesseveralimportantproblemswithrespecttothecorpusrepresen-tativeness,becausenewspaperarticlesarewrittenbyskilledjournalistsandindependentlycheckedbyprofessionalproofreaders.
Accordingly,aswewillseelaterinSect.
6,newspaperarticlesbelongtotheclassofJapanesetextwherelinguisticvariationsareconsiderablysuppressed.
Inadditiontonewspaperarticles,literarytexts(mostlynovelsandessays)arealsousedinthelinguisticanalysesofJapanese.
TwomainsourcesinthiseldincludetheCD-ROMversionofShinchobunkonohyakusatsu(onehundredtitlesfromtheShinchopaperbacks),andAozorabunko,acollectionofcopyright-expiredliteraryworks.
1Theproblemwiththesecollectionsisthatthetextsaretoooldtobecalled'contemporary.
'Inadditiontothese,textsgatheredbyinternetcrawlinghavebecomeamajorresourceinrecentyearsforbothlinguisticsandNLP.
Inlinguistics,textsonthewebareexpectedtobesuitableforthestudyoflinguisticvariation,becausematerialsofdiversewritingstylesarelikelytoexistontheweb.
Thereis,however,aproblemwiththisapproach;socalledmeta-informationaboutthetexts(genre,registers,etc.
)and/orthewriters(gender,age,etc.
)arefrequentlymissing.
Althoughthereisthepossibilityofestimatingthemeta-informationbymeansofvariousup-to-datestatisticalclusteringandclassicationmethods,thisapproachrequiresacertainamountofsupervisedlearningtrainingdata,derivedfromreliablereferencecorporaincludingthetypesofdatamentionedaboveandcoveringvarioustexttypes.
1http://www.
aozora.
gr.
jp/.
346K.
Maekawaetal.
123Tosolvetheseproblems,theauthorslaunchedacorpuscompilationprojectinthespringof2006,forpublicreleaseofJapan'srst100millionwordsbalancedcorpusintheyearof2011.
ThecorpuswasnamedtheBalancedCorpusofContemporaryWrittenJapanese(BCCWJ,hereafter).
2CorpusdesignTherearethreebasicprinciplesinthedesignoftheBCCWJ.
TherstprincipleistodesignthecorpusexclusivelyasacorpusofwrittenJapaneseratherthanspokenJapanese.
SpokenvarietiesareexcludedfromthecorpusmainlybecausesomeofthemarecoveredbytherecentlyreleasedCorpusofSpontaneousJapanese,orCSJ(Maekawaetal.
2000;Maekawa2003).
IntheactualBCCWJ,however,thereisoneexceptiontothisprinciple:TheOMregister(seebelowforabbreviation)thatdealswiththeminutesofJapan'sNationalDiet.
ThereasonfortheinclusionofthisregisterisdiscussedinSect.
2.
2.
3below.
Thesecondprincipleisthemaximumuseofrandomsampling.
Asiswellknown,samplesextractedbythismethodmaximallyrepresentthecharacteristicsofthepopulation.
Unfortunately,however,severalreasonspreventtheapplicationofrandomsamplingintheimplementationofabalancedcorpus.
Foronething,applicationofrandomsamplingispossibleonlywhenthepopulationforthesamplingisexplicitlydenable.
AsfaraswrittenJapaneseisconcerned,itwasinmostcasespossibletogettheinformationnecessaryfordeningthepopulation.
InSect.
2.
2below,characteristicsofeachregisterareexplainedwithreferencetoitssamplingpopulation.
Thethirdprincipleistomakethecorpuspubliclyavailable.
Thisprinciple,whencombinedwiththesecondone,inevitablyincreasestheburdenofcopyrighttreatment.
ThisproblemisdiscussedinSect.
3below.
2.
1SizeThesizeoftheBCCWJis100millionwords.
ItisthesizeofrenownedBritishNationalCorpus(BNC),butthisisnottheonlyreasonwithwhichthesizeofBCCWJwasdetermined.
Abalancedpilotcorpusofonemillionwordscoveringbooks,magazines,andnewspaperswasconstructedin2005inordertoestimatetheconstructioncostofalarge-scalebalancedcorpus.
Basedonthisstudy,itturnedoutthatonehundredmillionwordsisthemaximumsize,givenacorpusdesigndescribedinSects.
2and4below,and,moreimportantly,tobecompletedwithin5years.
Itwasestimatedthatconstructionofacorpuslargerthanthissizewouldcertainlyexceedthelimitoftheresearchbudgetavailableatthattime.
2.
2BalanceandrepresentativenessBCCWJconsistsofthreesubcorpora:Publication,library,andspecial-purposesubcorpora.
Also,thesubcorporacanbedividedinto13differentclassesoftextsBalancedcorpusofcontemporarywrittenJapanese347123thatwetentativelycalltext'registers.
'Table1showstherelationshipbetweenthethreesubcorporaand13registers.
Thefollowingsubsectionswillbedevotedtothedescriptionofthesesubcorporaandregisters.
2.
2.
1PublicationsubcorpusThepublicationsubcorpus,orPSC,consistsofthreeregisters,booksingeneral(abbr.
PB),magazinesingeneral(PM),andnewspapersfromaroundJapan(PN).
Samplesofbooksandmagazinesarethemostimportantamongtheseregisters.
TextsintheseregistersarenotreadilyavailableforlinguistsandNLPresearchersexceptforthecopyright-expiredones(seeSect.
1above),becausecopyrightclearanceintheseregistersareextremelycomplicated,hencedifcult(seeSect.
3below).
Fromapointofviewofcorpusbalance,theuniquecharacteristicofthePSCisthatasingledenitionofstatisticalpopulationissharedbythreecomponentregisters.
Thestatisticalpopulationsforthesamplingconsistedofallbooks,magazines,andnewspaperspublishedintheyears2001–2005;thepopulationswereconstructedbasedontheinformationprovidedbypubliclyavailabledatabasesincludingJ-BISC(JapanBiblio-Disc,abookdirectoryserviceoftheNationalDietLibrary)andZasshishinbunsokatarogu(almanacofallmagazinesandnewspapers).
AsfarasthePSCisconcerned,accordingly,themutualratioofthesamplesizesofthethreeregisters(i.
e.
28.
55vs4.
44vs1.
37)isnotarbitrary.
Theyratherreecttheactivitiesofthethreeregistersinthedesignatedyears,asmeasuredbytheamountofpublishedtexts.
IntheactualsamplingofthesamplesbelongingtothePSC,thetechniqueknownasstratied(orlayered)samplingisused.
Forexample,thepopulationofthebooksamplesisdividedinto55categoriescoveringall11categoriesoftheNDC(NipponTable1StructureoftheBCCWJSubcorpusRegisterAbbr.
#Sample#Word(Unit:million)PublicationBooksPB10,17728.
55MagazinesPM1,9964.
44NewspapersPN1,4731.
37LibraryBooksLB10,55130.
38Special-purposeWhitepapersOW1,5004.
88Bulletinboard(Yahoo!
Chiebukuro)OC91,45510.
26Blog(Yahoo!
Blog)OY52,68010.
19Best-sellingbooksOB1,3903.
74SchooltextbooksOT4120.
93MinutesoftheNationaldietOM1595.
10PublicitynewslettersoflocalgovernmentsOP3543.
76LawsOL3461.
08PoetryversesOV2520.
25348K.
Maekawaetal.
123DecimalClassication,Japan'sstandardbookclassicationsystem)foreachofthe5years,sothatsamplesareextractedfromallthesecategories.
Inthesamevein,thepopulationofthemagazinesamplesisdividedinto30categories(6magazinegenresandveyears),and,thepopulationofnewspaperarticlesisdividedinto80categories(16newspapersandveyears).
Thepopulationsthusdenedcontain48.
5,10.
5,and6.
4billioncharactersrespectivelyforPB,PM,andPNregisters.
ThesamplingratioofPSCissettoonethousandth(1/1,000).
NotethatthesizeofpopulationisestimatedintermsofthenumberofcharactersratherthanwordsbecauseJapanesetextsneedwordsegmentationbeforetheyarecountedaswords.
Lastly,itisimportanttounderstandthatthePSCisconstructedexclusivelyonthebasisoftheproduction(publishing)aspectofpublicationandhasnothingtodowiththereception(sales)aspect.
TwobooksinthesameformatandhavingthesamenumberofpagesalwayshavethesameprobabilityofbeingextractedasasampleofthePSCregardlessofthedifferenceinthesalesrecordsofthebooks.
2.
2.
2LibrarysubcorpusLibrarysubcorpus,orLSC,consistsofasingleregisterofLB,andisdesignedsothatitreectsareceptionaspectofpublication.
Needlesstosay,thebestwaytoachievethisgoalistoconstructacorpuswhosematerialsaresampledonthebasisofthesalesdata,butsuchacorpusisimpossibletodesign,becausereliabledatadoesnotexistregardingthesaleofbooksand/ormagazines.
Instead,acorpusisdesignedwhosepopulationisallthebooksregisteredinthepubliclibrariesofTokyoMetropolis,withtheexpectationthatthebooksregisteredinmultiplepubliclibrarieswhichhavestayedthereforawhilemayrepresent,insomeway,thebooksread(namely,received)byacertainamountofreaders.
Moreprecisely,thestatisticalpopulationoftheLSCisthesetofbookssatisfyingthefollowingtwoconditions.
First,thebookispublishedbetween1986(theyearwhentheISBN,whichisindispensableforthemanagementoflibrarydatabases,wasadoptedbymostJapanesepublishers)and2005.
Second,thebookisregisteredinthepubliclibrariesencompassingmorethan13citiesand/orspecialwards(outofthetotalof52)ofTokyometropolis.
Thesizeofthepopulationthusdened(47.
9billioncharacters)isnearlythesameasthatofthepopulationofbooksamplesinthePSC.
TheLSCpopulationisdividedinto220layersconsistingof11NDCcategoriesand20yearsofpublicationforstratiedsampling.
2.
2.
3Special-purposesubcorpusThespecial-purposesubcorpus,orSSC,isdesignedsothatitcoversthekindsofregistersthatareregardedtobeindispensablefortheunderstandingofthetotalityofcontemporarywrittenJapanese,butarenotcoveredeitherbyPSCorLSC.
AsshowninTable1,therearenineregistersintheSSC.
BalancedcorpusofcontemporarywrittenJapanese349123Whitepaper(OW)isthecollectionofsamplesrandomlysampledfromthepopulationof1,006whitepaperspublishedbytheJapanesegovernmentduringtheyears1976–2005.
ItisexpectedtorepresentthevarietyofJapaneseusedinofcialadministrativepublications.
ThepopulationfortheOWsamplesisdividedinto54layers(consistingofninegenresandsixtimeperiodsofveyears).
Bulletin-board(OC)andblog(OY)arethetworegistersrepresentingthetextsintheinternet.
Theyareexpectedtorepresentthemostup-to-datecharacteristicsoftheJapaneselanguage.
Atthesametime,theyareexpectedtorepresentthewritingstylethatismorecasualthantheotherregisters.
Moreover,theOCsamplesareexpectedtoshowmoreinterpersonalfeaturesofthelanguagelikephrase-nalparticlesandhonoricsthanintheotherregisters;thisexpectationisbasedonthefactthatexchangingaseriesofquestionsandanswersinabulletin-boardisallwrittendialogues.
ThepopulationfortheOCsamplesconsistsofmorethanthreemillionpairsofquestionsandthecorrespondinganswers(includingmultipleanswerstoasinglequestion)postedbetweenOctober2004andOctober2005.
ThepopulationfortheOYsamplesconsistsofabout3.
5millionblogarticlespostedbetweenApril2008andApril2009.
Best-sellingbooks(OB)containsamplesfromthepopulationconsistingof951best-sellingbookspublishedduringtheyears1976–2005.
Schooltextbooks(OT)consistofsamplesfromthepopulationconsistingof145elementary,junior-high,andhighschooltextbookspublishedduringtheyears2005–2007.
Thepopulationwaslayeredwithrespecttoninesubjectsandthreetypesofschool.
MinutesoftheNationalDiet(OM)consistofsamplesfromthepopulationconsistingof32,986dietmeetingsrangingbetween1976and2005.
Thepopulationwaslayeredwithrespecttothetypeofdiet(theHouseofCouncilorsandtheHouseofRepresentatives),sixtimeperiodsofveyearseach,andthetypeofmeetings.
Aspointedoutearlier,OMisthesoleexceptiontotherstprincipleofBCCWJdesign(exclusionofspokenvarieties).
TheinclusionofOMisthoughttobenecessarybecause,foronething,itisconcernedwiththelanguageofpoliticaldiscussionsanddebatesthatisnotcoveredbyotherregistersoftheBCCWJandtheCSJ.
Foranother,thereisaneedofthecorpususersfortheinclusionofthisregister.
StudiescollectedinMatsuda(2008)showconvincinglytheusefulnessoftheminutesoftheNationalDietforthestudyoflanguagevariationsandchanges.
Publicitynewslettersoflocalgovernments(OP)consistofsamplesfromthepopulationofnewslettersissuedby100localmunicipalitiesintheyearof2008.
Themunicipalitiescoversall8areasofJapan(Hokkaido,Tohoku,Kanto,Chubu,Kinki,Chugoku,Shikoku,and,Kyushu-Okinawa),andthepopulationwaslayeredalongtheseareas.
Laws(OL)consistofsamplesfromthepopulationconsistingof718lawsthatwerepromulgatedduringtheyears1976–2005.
Thepopulationwaslayeredforsixtimeperiodsofveyearseach.
350K.
Maekawaetal.
123Lastly,poetryverse(OV)consistsofsamplesofshortxed-formverses(tankaandhaiku)andfree-formpoemsrandomlyselectedfromthefollowingthreecollections:Vols.
14–17ofChikumaShobo'sGendaiTankaZenshu(2002),vols.
8–15ofKadokawa'sZohoGendaiHaikuTaikei(1980–82),and118titlesofShichosha'sGendaiShiBunko(1986–2005).
NotethatinregisterOV,multipletankaandhaikuareincludedinasinglesample(seeSect.
2.
3.
3below).
2.
3NotesoncorpusdesignandimplementationInthissection,someadditionalinformationaboutthedesignandimplementationoftheBCCWJarepresented.
Thisinformationisindispensablefortheusersofthecorpus.
2.
3.
1MissingregistersTherearetworegistersthatthepresentauthorswantedtoincludeintheBCCWJbuttonoavail:internetmailandmanga(Japanesecartoons).
Internetmailoftencontainsmaterialstooprivatetobepubliclyavailable.
Asformanga,themainobstaclesconsistinthedifcultiesofcopyrighttreatmentandestimationoftheamountoftextinatitle.
2.
3.
2TemporalcoverageAsdescribedintheprecedingsubsections,thetemporalcoverageofapopulationdiffersconsiderablydependingontheregisters.
RegistersOW,OB,OM,andOLhavethelongestcoverageof30years(1976–2005).
LB(LSCasawhole)hasamiddlecoverageof20years(1986–2005).
PB,PM,andPN(PSCasawhole)haveshortcoveragesofveyears(2001–2005).
ThecoverageofOTisalsoshort(2005–2007),butnotidenticaltothatofPSC.
OC,OY,andOPhavetheshortestcoverageofoneyear.
Usershavetobecarefulaboutthedifferenceinthetemporalcoveragewhenhe/shewantstocomparetheresultsobtainedfromdifferentregisters.
Ontheotherhand,itispossibletoobtainknowledgeaboutthereal-timechangeoftheJapaneselanguageinthepast20–30yearsusingtheregistersofrelativelylongtemporalcoveragelikeLB,OB,OW,OMandOL.
2.
3.
3SamplelengthSamplelengthisanimportantfactorofsampling.
Fromthepointofviewofcorpusrepresentativeness,itisdesirabletoextractfromthepopulationmanyveryshortsamplesofthesamelengthratherthanoneextremelylongsample,giventhatthetotalamountofwordsinthesamplesstaysthesame.
However,extractionofveryshortsamplesisnotnecessarilythebestsolutionfromapointofviewoflinguisticsresearch.
Forexample,tooshortasamplemakesitimpossibletodisambiguatehomonyms.
Italsomakesitimpossibletoanalyzethestructureofdiscourse.
ThoseBalancedcorpusofcontemporarywrittenJapanese351123whoareinterestedinthestructureofdiscoursemaywanttohaveaccesstoaslongacontextaspossible.
Asaremedytothisproblem,twodifferentsamplelengthsareadoptedinthesamplingoftheBCCWJ:xed-lengthandvariable-lengthsamples.
Axed-lengthsampleconsistsof1,000characters(countingkanji—Chineselogographs—,hiragana,katakana—twotypesofJapanesesyllableletters–,andRomanalphabetslettersequallyasonecharacter,andexcludingallpunctuationmarks).
Fixed-lengthsamplesaresuitableforvariousstatisticalinferencesliketheestimationofwordand/orkanjifrequencies.
Asampleof1,000charactersroughlyapproximatesthecharactersprintedintwopagesofaJapanesepaperback(文庫本).
Needlesstosay,theendofaxed-lengthsampledoesnotalwaysmatchalinguisticboundaryofanysort(likephraseandsentence).
Whenthethousandthcharacterofaxed-lengthsampleoccursinthemiddleofasentence,theendofthesampleisextendedtomatchupwiththenearestsentenceboundary.
2Avariable-lengthsamplecoverstheentiretextwhichislogicallyorganizedasachapterorsection.
Thelengthofavariable-lengthsamplediffersconsiderablydependingonsamples.
Forexample,somenewspaperarticleslikeobituariesareshorterthan1,000characters.
Ontheotherhand,therearenovelchaptersthatencompassseveraltensofthousandscharacters.
Inthecaseofthelatter,weintroduceanupperlimitofthevariable-lengthsamples;asinglevariable-lengthsamplemaynotexceed10,000characters.
Variable-lengthsamplesaresuitableforvariouscontext-sensitivelinguisticanalyses.
Noticethatvariable-lengthsamplesareextractedfromallBCCWJregisters,whilexed-lengthsamplesareextractedonlyfromPB,PM,PN,LB,andOW.
Intheseregisters,thereisusuallyoverlapbetweenthexed-andvariable-lengthsamples,becausethetwosamplesareextractedsimultaneouslyfromthesame(randomlyselected)document.
WhenwecomputethewordfrequencyoftheBCCWJasinTable1,theoverlapwasexcludedfromthedata.
SeealsothedescriptionofChunagoninSect.
5.
2below.
3TreatmentofcopyrightInpresent-daysociety,oneofthelargestdifcultiesofcorpuscompilationliesinthetreatmentofcopyright.
ItisespeciallytrueofacorpuslikeBCCWJthatexclusivelydealswithcontemporarymaterials.
AllBCCWJsamplesarecopyright-protected,withthefollowingtwoexceptions.
Textsoflaw(registerOL)arenotcopyright-protectedaccordingtotheJapanesecopyrightlaw.
TheminutesoftheNationalDiet(OM)arecopyrightprotected,butthecouncilorsandrepresentativesmakeitaruletoabandontheirrights.
TherearetwospecialfactorsthatmakethecopyrighttreatmentoftheBCCWJverydifcult.
ThetotalnumberofsamplesintheBCCWJis172,675(countingonlyvariable-lengthsamples),andismuchlargerthaninBNC(whichconsistsofabout2Thelocationofthethousandthcharacterismarkedbyatagsothatuserscanextractexactly1,000characterswhenitisneeded.
352K.
Maekawaetal.
1234,000samples).
Thisisbecausethesample-lengthoftheBCCWJisgenerallymuchshorterthanthatoftheBNC.
Thisistruenotonlyinthexed-lengthsamples,butalsointhecaseofvariable-lengthsamples.
Moreover,extensiveuseofrandomsamplingmakesthecopyrighttreatmentevenmoredifcult.
Becausethechoiceofsamplesisrandom,hencehasnothingtodowiththeeasinessofcopyrighttreatment,itisimpossibletogiveprioritytothesampleswhoseauthorsarelikelytogivepermissiontotheuseoftheirsamplesinthecorpus.
AbriefoverviewofthecopyrighttreatmentintheconstructionoftheBCCWJispresentedbelow.
SamplesinregistersOCandOYaretheeasiesttoclearcopyright.
Yahoo!
JapanCorporation,whoholdsthecopyrightofallmaterialspostedtotheirinternetsite,gaveuspermissiontoallsamples.
Similarly,thepermissionstothesamplesinthePNandOWregistersaregivenbythenewspapercompaniesandrelevantministriesoragenciesintheJapanesegovernment.
Samplesinotherregistersaretreatedbasicallyonthebasisofone-by-onenegotiation.
InthecaseofbooksamplesinregistersPB,LB,andOB,forexample,weobtainedpermissionsfromtheauthorsformorethan24thousandssamples(includingmanycaseswhereanauthorhasmultiplesamples).
Needlesstosay,thenegotiationsarenotalwayssuccessful.
Thesuccessrateisabout80%,oncewearesuccessfulinmakingcontactwiththeauthors(orcopyrightholders).
Becausetheprobabilityofsuccessfulcontactwiththeauthorsisabout90%,thenalsuccessrateofthenegotiationisabout70%.
Itwashencenecessarytopreparemoresamplesthanwereallyneed.
InthecaseofbooksamplesinthePB,LB,andOBregisters,forexample,thetotalnumberofpreparedsamplesis29,188whilethenumberofsamplescurrentlyintheBCCWJis22,058.
4CorpusannotationAtthepresenttime,threebasicannotationsareappliedtotheBCCWJ,namely,POSannotation,documentstructureannotation,andmeta-informationannotation.
4.
1POSannotation4.
1.
1DualPOSanalysisandtheUniDicAssuggestedearlierinSect.
2.
2.
1,POSanalysisofJapanesetextrequireswordsegmentation.
Ithencerequiresprincipledtreatmentofalinguisticunitapproxi-matinga'word'.
Inthisrespect,however,treatmentofwords,compoundwordsespecially,intheprecedingPOSanalysissystemsisveryproblematic.
Table2comparestheoutputsofthreeexistingmachine-readabledictionariesofautomaticPOSanalysissystems,viz.
,Juman,Yahoo!
Japan'sAPI,andChasenwithIPAdictionary.
TheoutputofMeCabwithUniDic,anewlydevelopeddictionaryforthePOSannotationoftheBCCWJ(Denetal.
2008;Seealsothenextsubsection)isalsoshownforfurthercomparison.
ThetableshowswordsegmentationdoneBalancedcorpusofcontemporarywrittenJapanese353123automaticallybythesystemswhenthenamesoffourJapanesenationalinstitutesareanalyzedoutofcontext.
The'+'symbolsshowthelocationsofwordboundaries.
国立国会図書館("kokuritsukokkaitoshokan"'NationalDietLibrary';"kokuri-tsu"国立'national',"kokkai"国会'diet',and"toshokan"図書館'library')istreatedinthreedifferentways.
ThedictionaryusedinYahoo!
Japan'swebAPItreatsitasasingleentry.
3IPA(Information-technologyPromotionAgency)'sdictionarydeliveredwiththeNAIST(NaraInstituteofScienceandTchnology)'sChaSenmorphologicalanalysissystem(AsaharaandMatsumoto2000)dividesitintotwoentries,viz.
,"kokuritsu"+"kokkaitoshokan".
And,thedictionaryofKyotoUniversity'sJUMANmorphologicalanalysissystem(Kurohashietal.
1994)dividesitintofourentriesas"kokuritsu"+"kokkai"+"tosho"+"kan"(where"kan"館isanalyzedasasufxstandingfor'house'and"tosho"図書standsfor'books').
Whatisworse,fromapointofviewoflinguistics,considerableinternalinconsistencyisobservedinalldictionariesexceptfortheUniDic.
InthecaseofJuman,thestring国立"kokuritsu"isanalyzedasawordintherstthreeinstitutions,but'NationalAstronomicalObservatory'makesanexception.
TheoutputofIPAshowsexactlythesamepattern.
Ontheotherhand,Yahoo!
treats'NationalDietLibrary'and'NationalMuseumofEthnology'assingleentries,whileitanalyzes'NationalArchivesofJapan'and'NationalAstronomicalObservatory'intotwoentries.
Similarwithin-dictionaryinconsistencycanbeobservedinthetreatmentofsufx館"kan"intheIPA.
Theseinternalinconsistenciescanbeaseriousobstaclewhenresearcherswanttoobtainbasicstatisticalinformationaboutthestructureofthelanguage.
Theseinconsistenciesseemtoemergeasaresultoftheinteractionoftwofactors.
Intherstplace,Japaneseisaso-calledagglutinativelanguagethathasahighdegreeoffreedominthedenitionof'word'.
Inadditiontothis,itseemsthatusersofamorphologicalanalysissystemhaveheterogeneousrequirementsonthesystem'soutput.
Userswhoareinterestedintheanalysisofnamedentitiesmayprefersingleentryanalysis,whiletheuserswhoareinterestedinthemorphologicalstructureofcompoundnounsmayprefernesegmentation.
Clearly,thesedemandsTable2Comparisonofexistingdictionaries3http://developer.
yahoo.
co.
jp/webapi/jlp/ma/v1/parse.
html.
354K.
Maekawaetal.
123aremutuallyincompatible,andthemixtureofthetwodemandsresultsinthelackofmorphologicalconsistencyasobservedinTable2.
Thisproblemcanberesolved,atleastpartially,ifthetextisanalyzedconsistentlyattwodifferentlevelsofmorphology.
OneofthelevelsiscalledSUW,orshort-unitword.
SUWapproximatesthelevelatwhichentriesoftraditionalJapanesedictionariesarespecied.
TheotherleveliscalledLUW,orlong-unitword.
LUWisdevotedmostly,butnotexclusively,forcompoundwordslikecompoundnouns,compoundverbs,andcompoundparticles.
4Table3showsanexampleofthedualPOSanalysis.
Theexamplephraseis公害紛争処理法における公害紛争処理の手続きは("kogaifunsoshorihoniok-erukogaifunsoshorinotetsuzukiwa",'asfortheprocedureofthesettlementofpollutiondisputesaccordingtotheactforthesettlementofpollutiondisputes').
TheleftandrighthalvesofthetableshowtheresultsofSUWandLUWanalysesrespectively.
TherstfournounsofSUWcorrespondtoacompoundnounintheLUWanalysis,thefollowingthreeSUWentries,consistingofaparticle,averb,andanauxiliaryverb,correspondtoasinglecompoundparticleintheLUWanalysis,andthenextthreeSUWnounscorrespondtoacompoundLUWnoun.
Therearetwomoreimportantfactstobepointedoutinthetable.
First,wordsinLUWarenotalwayscompoundwords.
Asinthelastthreerowsofthetable,SUWandLUWanalysescoincideifatextdoesnotincludeanycompoundword.
Second,itcanbethecasethatsubstringsofanLUWcanbeusedasindependentLUWsasisthecaseintherstandthirdLUWentriesinTable3,wherethethirdentryTable3Exampleoftwo-wayPOSanalysis4TherstcorpustowhichthedualPOSanalysiswasappliedwastheCorpusofSpontaneousJapanese(Maekawaetal.
2000).
BalancedcorpusofcontemporarywrittenJapanese355123isthesubstringoftherst.
Onemoreexampleispresented:研究所("kenkyujo",'researchinstitution')isanLUWconsistingoftwoSUWs,研究("kenkyu",'research')and所("jo",'institution'),butlongerLUWscontainingthisLUWarealsopossiblelike国語研究所("kokugokenkyujo",'researchinstituteoftheJapaneselanguage')or国立国語研究所("kokuritsukokugokenkyujo",'nationalinstituteoftheJapaneselanguage').
5InthePOSanalysisoftheBCCWJ,textsarerstanalyzedintoSUWusingtheUniDic;andthen,someoftheSUWaremergedintoLUWonthebasisofstatisticallearning(UchimotoandIsahara2007;Kozawaetal.
2011).
Althoughfarfrombeingperfect,introductionofthedualPOSanalysisresolvesconsiderablytheproblemscausedbythecontradictingdemandsonPOSanalysis.
ThenumberofwordsiscountedconsistentlyinSUWinthispaper.
4.
1.
2TheperformanceTheSUWanalysisoftheBCCWJisconductedusingthecombinationoftheMeCabmorphologicalanalyzer(Kudo,Yamamoto,andMatsumoto2004)andtheUniDic.
Figure1showstheperformanceofSUWanalysisbyMeCabandUniDic.
InadditiontothethreeregistersoftheBCCWJ(OW,PN,andOC),samplesofbooksinPB,LB,OBregistersarereclassiedintothetwocategoriesof"literature"and"non-literature"here.
Moreover,performanceofthespokendataintheCSJisalsoshown.
Inthisgure,theperformanceisevaluatedbymeansoftheF-measure,whichiscomputedusingtheformula2PR/(P+R),wherePandRstandrespectivelyforprecisionandrecall.
Samplesusedfortheevaluation(about100,000SUWinsize)arerandomlyselectedfromtheBCCWJ-CoreandtheCSJ.
Theyareanalyzedmanuallyandcross-checkedbyagroupofexpertannotators.
Theresultofthe0.
9600.
9650.
9700.
9750.
9800.
9850.
9900.
9951.
000OWLiteratureNon-LiteraturePNOCCSJFvalueBoundaryPOSLemmaFig.
1PerformanceofSUWanalysisbyMeCab+UniDic5ThelastentryistheJapanesenameoftheinstitutiontowhichsomeoftheauthorsbelong,i.
e.
,NINJAL.
356K.
Maekawaetal.
123manualanalysisthusobtainedistreatedasthe'correctanswer'inthefollowingevaluations.
F-measureiscomputedunderthreedifferentcriteriaofevaluation.
Undertherstcriterion,onlythecorrectnessofSUWboundarylocationsisevaluated.
TheF-valuewascomputedbasedupontherecallandprecisionvaluesobtainedbycomparingthesystemoutputwiththecorrectanswer.
Theresultisshownbytheopenbarsinthegure.
Underthesecondcriterion,conjointcorrectnessoftheSUWboundarylocationsandthePOSinformationisevaluated.
InFig.
1,theshadedbarsannotatedas'POS'standforthiscase.
Lastly,underthethirdcriterion,theconjoinedcorrectnessofboundarylocation,POSinformation,andlemmaspecicationisevaluated.
Here,lemmaspecicationmeansthecorrectnessoflemmachoicefromasetofsynonyms.
ThiscriterionisimportantbecauseJapanesehasaconsiderablenumberofhomonymsforhistoricalandphonologicalreasons.
Forexample,awordバス"basu"hasatleastvemeanings:'bus','bass','bath'(loansfromEnglish),'Bass'(surnameorplacenameinEnglish),and'lotus'(nativeJapanese"hasu"turnsinto"basu"bytheruleofsequentialvoicing,orrendaku,whenthewordappearsinthelasthalfofacompoundnounlikein"onibasu"'Euryaleferox').
Thelledbarsinthegurestandforthiscase.
NotealsothattheBCCWJsamplesanalyzedhereareopendata,meaningthattheywerenotincludedinthedatausedforthelearningoftheSUWanalysissystem.
Ascanbeseenfromthegure,thesystemperformancediffersdependingonregisters,andtheanalysisofthewebtext(OC)isthemostdifcultamongtheBCCWJregisters.
ButtheF-measuresexceedthelevelof0.
98eveninthemostdifcultregisterandunderthemostdifcult(i.
e.
,thethird)criterion.
ThelatestversionofUniDicisdownloadablefromtheinternet.
6AsfortheLUWanalysis,Fujiikeetal.
(2010)reportstheaccuracies(butnottheF-measure)computedunderthesamethreecriteriaasinSUW,andacrossnearlythesameregistersasinFig.
1.
Thereportedaccuracyishigherthan98%eveninthemostdifcultcase,i.
e.
theanalysisoftextsintheOCregisterasevaluatedaccordingtothethirdcriterion.
4.
1.
3TheBCCWJ-coreInthecourseoftheSUWanalysisoftheBCCWJ,about1%ofthecorpusisanalyzedwithspecialcare.
ThissubsetisnamedBCCWJ-Core.
Asfarasthetextsinthissubsetareconcerned,theresultsoftheautomaticSUWanalysisarecheckedandcorrectedmanuallysothattheaverageaccuracybecomeshigherthan99%evenwhentheyareevaluatedwiththethirdcriterionmentionedabove.
Thehigh-accuracyPOSdatathusobtainedwasutilizedasthelearningdataforthetrainingoftheMeCab+UniDicanalysissystem.
Also,theBCCWJ-Coreshouldbequiteusefulforthecorpususerswhoputmorepriorityontheaccuracyofmorphologicalannotationratherthantheamountofdata.
Table4showsthecontentsoftheBCCWJ-Core.
6http://sourceforge.
jp/projects/unidic/.
BalancedcorpusofcontemporarywrittenJapanese3571234.
2DocumentstructureannotationTextsoftheBCCWJareannotatedwithrespecttodocumentstructure.
Figure2showstheopeningpartofanXMLleofasampleintheOWregister.
Tagsinthisgurelike\article[,\cluster[,\paragraph[,and\sentence[areallconcernedwiththehierarchicalstructureofadocument,buttherearealsootherclassesoftags.
Table5showsalltagsusedintheannotationofvariable-lengthsamples(seeSect.
2.
3.
3above).
ThetypeofXMLdocumentshowninFig.
2iscalledaC-XML(character-basedXML)documentanddoesnotcontaintheresultsofdual-POSanalysis.
ThereisanothertypeofXMLdocumentcalledM-XML(morphology-basedXML)documentthatcontainsboththedocumentstructureinformationandtheresultsofdualPOSanalysis.
AcrucialdifferencebetweentheC-andM-XMLdocumentsisthetreatmentofsentence.
TheC-XMLdocumentsallowsrecursiveuseofthe\sentence[tagwithinanother\sentence[tag.
IntheexampleofC-XMLshowninPanelAofFig.
3,textsinthesecondandthirdlinesareenclosedbythe\sentence[tags,anddominatedbythetopmost\sentence[tagthatopenedTable4ContentsoftheBCCWJ-CoreRegister#Sample#WordPB83204,050PM86202,268PN340308,504OW62197,011OC93893,932OY47192,746Total1,9801,098,511Fig.
2Exampleofdocumentstructureannotation358K.
Maekawaetal.
123Table5TagsusedindocumentstructureannotationTypeoftagTagNameGlossSamplesampleMarksawholesamplesamplingInformationaboutsamplingHierarchicalstructureofdocumentarticleSemanticallycoherenttextwrittenbyawriterblockEndMarkerofsemanticboundary.
clusterMarksthewholetextcoveredbya\title[tagtitleBlockCoversthe\title[anditscorrespondingelementstitleTitlegiventoadeniteareaofasampleorphanedTitleTitlegiventoanindeniteareaofsamplelistEnlistedelementsparagraphParagraphsentenceSentence.
PermitsrecursiveapplicationFiguresgureBlockBlockofgureanditsaccompanyingelementsgureFigurepersecaptionCaptionpersetableTableperseQuotationsquotationElementcitedfromoutsideofthecurrentarticlecitationElementcitedfromotherdocumentsourceInformationabouttheciteddocumentspeechTranscriptionofspeech,innerspeechspeakerDesignationofspeakernoteBodyNoteanditsscopeNotesnoteBodyInlineInlinenoteabstractAbstractofarticleorclusterAbstractauthorsDataInformationabouttheauthorcontentsTableofcontentsMiscellaneousproleProleoftheauthorsorpersonaerejectedBlockShowstheexistenceofdeletedblockverseMarkspoem,tanka,haiku,orlyricsverseLineMarksonelineinaverserubyKanalettersgiventokanjiletterstoshowthereadingCharacterscorrectionCorrectionoferrataintheoriginaltextmissingCharacterCharactersoutsidetheJISX0213:2004charactersetenclosedCharacterEnclosedcharacterslikecursiveCharactersincursivestyleimageSymbolsoutsidetheJISX0213:2004charactersetsuperscriptSuperscriptssubscriptSubscriptsfractionProperfractionpartofamixedfractiondeleteTextwithstrike-throughlinebrPhysicallinebreakBalancedcorpusofcontemporarywrittenJapanese359123intherstlineandclosedinthefourth.
7Notethatthetopmost\sentence[hasitsowntexts.
Theyareprintedintherstandfourthlinesinthepanel.
PanelBofFig.
3showsthesamedocumentintheformatofanM-XMLdocument.
8Alltexts,includingtheonesbelongingtothetopmost\sentence[inPanelA,aretreateduniformlyas\sentence[,anddominatedbyanewlyintroduced\superSentence[node.
Asshownbythisexample,therecursivestructureofsentencesintheC-XMLdocumentistransformedintoa'at'structureconsistingofonlytwolevels,i.
e.
,\sentence[and\superSentence[,intheM-XMLdocument.
The'at'structureintheM-XMLdocumentdoesnotreectthesyntacticstructureoftheoriginaltext(whichisreectedinthetaggingoftheC-XMLdocument),butitispreferredfromapointofviewofPOSanalysisfortworeasons.
Firstly,lengthofthetextenclosedbya\sentence[tagintheM-XMLdocumentismuchmoremoderatecomparedtothetextenclosedbythetopmost\sentence[tagintheC-XMLdocument,whichoftencontainsseveralhundredSUWs,andhenceismoreappropriatetobetheinputforthePOSanalysissystem.
Secondly,itisconvenienttomanagethecorpusdatausingthesentenceoftheM-XMLdocumentsastheunitofdatamanagement.
Seetheuser'smanualofBCCWJformoredetailsaboutthetwoXMLformats(NINJAL2011).
4.
3Meta-informationMeta-informationoftheBCCWJconsistsofthefollowingfourcategories:bibliographicalinformation,directoryinformation,samplinginformation,andarticleinformation.
Bibliographicalinformationistheinformationabouttheoriginaldocumentfromwhichthesampleinquestionisextracted.
Table6summarizesthecontentsofbibliographicalinformation.
Theinformationrecordedineldsnamed"Genre_1"to"Genre_3"differsaccordingtothemediaofthetext(namely,book,magazine,newspaper,etc.
).
Table6showsthecaseofbooks.
"C-code"inGenre_3isabookclassicationcodeusedinbookstoresinJapan.
Thiscodeclassiesthesupposedcustomerofthebook(general,educated,practical,women,children,etc.
),whileNDCclassiesthesubjectsofbooks.
DirectoryinformationdealswiththedataaboutthepersonorinstitutionrecordedintheBib_authoreldofthebibliographicalinformation.
Table7showsthefourTable5continuedTypeoftagTagNameGlossinfoInformationaboutcharacterintheoriginaltextrejectedSpanShowstheexistenceofdeletedcharacterssubstitutionAcharactersubstitutedbydifferentcharacter(s)7Theintermediatetwosentenceswereenclosedbythe\quote[tag,becausetheywerequotedbytheJapanesequotationsymbols("「"and"」")intheoriginaltext.
8DualPOSinformationisnotshownhereforthesakeofvisibility.
360K.
Maekawaetal.
123eldsofthedirectoryinformation.
Theinformationofgenderandbirthyearisobtainedeitherfromopensourcesorbythequestionnairethatwassenttotheauthorsasapartofthecopyrightprocessing.
Themeta-informationwillbequitevaluablefortheuserswhowanttoanalyzethecorpusfromasociolinguisticpointofview.
SamplinginformationdealswiththedataabouthoweachsampleoftheBCCWJwasextractedfromthepopulation.
Itconsistsoffourelds:Sample_ID,Bib_ID(uniqueindexgiventotheoriginaltextfromwhichthesamplewasextracted),Sampling_page(pagenumberoftheoriginaltextfromwhichthesamplewasextracted),andSampling_point(informationaboutthebeginningofthesampleinthepagespeciedbytheSampling_pageeld).
Lastly,articleinformationdealswiththedataabouttheunitcalled"article"thatplaysacrucialroleinthespecicationofauthorship.
Ithappenssometimesthatdifferentpartsofasinglesamplearewrittenindependentlybydifferentauthors.
Wheneveritispossibletodivideasampleintoseparatepartsthatarewrittenbydifferentauthors,eachpartiscalledan'article.
'ArticleinformationoftheBCCWJconsistsofeldslikeArticle_ID(uniqueIDofarticles),Directory_ID(uniqueIDoftheauthorofarticle),First_appearance(theyearwhenthearticleappearedforthersttimeinwhatevermedia),First_published(theyearwhenthearticleispublishedasabook),andsoforth.
Notethatthecaseofjointwriting,i.
e.
,thecasewhenmultipleauthorsjointlywriteatextwhichisnotseparableintoarticles,istreatedasasinglearticle.
5Corpusrelease5.
1ShonagonCompilationworkoftheBCCWJcametoanendinDecember2011.
Currently,thecorpusispubliclyaccessibleinthreedifferentways.
TheeasiestwayistouseawebinterfaceprogramcalledShonagon.
9InShonagon,userscansearchastringofuptoABFig.
3ExampleofsentenceannotationinC-XMLandM-XMLdocuments9http://www.
kotonoha.
gr.
jp/shonagon/.
BalancedcorpusofcontemporarywrittenJapanese36112310characterslongfromthewholebodyofBCCWJbyspecifyingregistersand/ortemporalcoverage.
Whentherearemorethan500hits,randomlyselected500hitsareshownwithprecedingandfollowingcontexts(eachconsistingofupto40characters).
Someofthemeta-informationregardingbibliographicalanddirectoryinformationarealsoshown.
UseofShonagonisfreeofcharge.
5.
2ChunagonInShonagon,onlystringsearchispossible.
WhenonewantstousetheresultsofdualPOSanalysis,onehastoutilizeChunagon,anotherwebinterfaceprogram.
10Thisinterfaceisalsofreeofcharge,buttheuserregistrationbymeansofsnailmailisrequired.
Thiscumbersomeprocessisintroducedtoprotecttherightofcopyrightholders.
InChunagon,itispossibletosearchN-gramofSUWorLUW(whereNis1≤N≤10.
MixtureofSUWandLUWisnotallowed).
EachoftheSUW/LUWintheN-gramcanbespeciednelywithrespecttomorphologicalinformationlikePOS(53Table6ContentsofthebibliographicalinformationaboutbooksFieldGlossBib_IDIDoftheoriginaltextfromwhichthesampleisextractedTitleTitleoftheoriginaltextSubtitleSubtitleoftheoriginaltextNumberVolumeandnumberoftheoriginaltextBib_authorAuthoroftheoriginaltextPublisherPublisheroftheoriginaltextYearPublishedyearoftheoriginaltextISBNISBNoftheoriginaltextSizeBooksizeoftheoriginaltextPagesNumberofpagesoftheoriginaltextGenre_1TherstdigitofNDC(seeSect.
2.
2.
1above)Genre_2Therst3digitofNDCGenre_3CcodeBib_author_IDIDoftheBib_author(seeabove)Table7ContentsofthedirectoryinformationFieldGlossDirectory_IDIDgiventothepersonorinstitutionNameNameofthepersonorinstitutionGenderSexofthepersonBirthyearBirthyear(perdecadeas1950s,1960setc.
)10https://chunagon.
ninjal.
ac.
jp/.
362K.
Maekawaetal.
123classesclassiedinthreelayers),lemma,readingoflemma,wordform,writtenform,conjugationform(provisional,gerund,conclusive,adnominal,conditional,etc.
),conjugationtypeandsoforth.
Whenmultipleconditionsofmorphologicalinformationarespecied,theGUIofChunagoncombinestheconditionsusinglogicalproduct(i.
e.
AND).
Onewayofspecifyinglogicalsum(i.
e.
,OR)ofconditionsistoeditaquerystoredinChunagon.
Seebelowinthissubsection.
Itispossibletolimittherangeofsearchinseveralindependentways.
First,itispossibletospecifyregistersfromwhichthesamplesaretobesearched.
Second,itispossibletolimittherangeorsearchbyspecifyingthetypeofsample-length;thethreepossiblechoicesarexed-lengthsamplesonly,variable-lengthsamplesonly,andbothofthem(excludingtheoverlap).
Lastly,itispossibletomakesearchonlyofthematerialsincludedintheBCCWJ-Core.
AsinShonagon,Chunagondisplaysonly500hitsonthescreeniftherearemorethan500hits.
UnlikeShonagon,however,userscandownloadallhititemsunlessthetotalnumberofhitisover100,000.
AllqueriesissuedbythesameuserareautomaticallystoredinChunagonforreuse;userscanaccesstherecordtoreissuethesamequeries,ortheycancreatenewqueriesbyeditingtherecordedqueries.
Figure4showsanexampleofaquerystoredinChunagon.
Asmentionedabove,userscanusealogicalsum(OR)whileeditingthequeries.
Ascanbeseenfromthegure,thesyntaxofthequerylanguageresemblesthatofSQL.
Thisquerydenesasearchforthecaseswhereanadjectiveintheiradverbialformisimmediatelyfollowedbyanauxiliaryverbた"ta"('PAST'),whichisinturnfollowedimmediatelybyanotherauxiliaryverbです"desu"(politeformofcopula)beforeasentenceboundarymarkedbyaperiod.
ItisalsospeciedthattherangeofthesearchislimitedtotheBCCWJ-Core,andtheSUWisspeciedasthemorphologicalunit.
Similarquerieswereusedintheanalysisofi-adjectivesreportedinSect.
6.
5below.
Lastly,userscanissuemultiplequeriesatatimewhenhe/sheisdealingdirectlywiththequerylanguage(ratherthanmakingaqueryusingtheGUImentionedabove).
Thiscontributesgreatlytothespeeding-upandthereproducibilityofcorpusanalysis.
5.
3DVD-releaseItisimpossibletomakeaccesstothefullcontentsoftheBCCWJevenintheenvironmentofChunagon.
UserswhowanttoutilizethewholeinformationintheBCCWJdescribedinSects.
2and4shouldutilizeDVD-releaseversionofthecorpus.
Inthisversion,allC-XML/M-XMLdocuments(seesubsection4.
2above),Fig.
4AnexampleofaquerystoredinChunagonBalancedcorpusofcontemporarywrittenJapanese363123meta-informationdata,andthedatatableusedinChunagon(intheTSVformat)arepackagedintwoDVDs.
ThisisliterallythewholebodyoftheBCCWJ.
NotethattheDVD-releaseversiondoesnotincludeanyenvironmentforcorpussearch.
Usersshouldconstructtheenvironmentofcorpusquerybythemselves.
Formostusers,useofRDBMlikeMySQLisrecommended.
NotealsothatwithregardtotheDVD-releaseoftheBCCWJ,thepriceofpermanentacademiclicenseis50,000JPY.
116SomepreliminaryanalysesIntherestofthispaper,resultsofsomepreliminaryanalysesoftheBCCWJarepresented.
TheseanalysesareconcernedwiththeBCCWJ-Core,and,thefocusofanalysesisplacedonthediversityoftextsinthecorpusasobservedacrossregisters.
6.
1POSdistributionManyprecedingstudiesreportthatthePOSdistributiondifferssystematicallydependingonthetypeoftexts(seeKabashimaandSatake1978amongothers).
Figure5comparesthedistributionsoffourmaincontent-wordclasses(noun,verb,adverb,andadjective)inSUW.
NotethattwotypesofadjectivesinJapanese,namelyi-adjectives(i.
e.
,adjectivesendingin/i/),andna-adjectives(adjectivesendingin/na/),arepooledintooneclass.
RegisterslikePN(newspaper)andOW(whitepaper)whosemainfunctionisthetransmissionoffactsaremarkedbyahigherratioofnounsandlowerratioofadverbs,whileregisterslikeOC(bulletin-board)andOY(blog)whosemainfunctionconsistsintheexpressionofsubjectiveopinionaremarkedbyalowerratioofnounsandhigherratioofadverbs.
6.
2Distributionofword-classesUniDichasaeldforword-classinformation,i.
e.
,Japanese(nativeJapanese),Sino-Japanese(historicalloansfromChinese),loans(modernloansfromEnglishandotherEuropeanlanguages),andtheirhybrids.
Punctuationmarksandvariousalphabeticalabbreviationslike"OS"and"HDD"areclassiedasbelongingtothefthword-class:symbol.
Figure6showsthedistributionoftheseword-classesintheSUW.
Themoststrikinggenre-relateddifferenceseemstobetheratioofSino-Japanesevocabulary.
RegistersOWandPNarethehighest,whileOCandPBarethelowest.
Also,itisinterestingtoseethatthesumoftheratiosofJapaneseandSino-Japaneseremainsnearlyconstant(88–92%)acrossallregisters.
ThesetendenciescoincidefairlywellwiththeobservationsreportedinKoisoetal.
(2009)whoexaminedthedistributionofword-classesintheBCCWJduringitsconstruction.
11Applicationinformationcanbefoundathttp://www.
ninjal.
ac.
jp/corpus_center/bccwj/subscription.
html.
364K.
Maekawaetal.
1236.
3EntropyoforthographicvariationsInSect.
1,itissuggestedthatnewspaperarticlesareaclassoftextwhereorthographicvariationsareextensivelysuppressed;theimplicationisthatthereareotherregisterswherethevariationsarerelativelylarger.
Thishypothesisistested0%20%40%60%80%100%PNOWPMPBOYOCNounVerbAdvAdjMiscFig.
5DifferenceofPOSdistributionduetoregisters0%20%40%60%80%100%OCPBOYPMPNOWJapaneseSino-JapaneseLoanHybridSymbolFig.
6Distributionofword-classesBalancedcorpusofcontemporarywrittenJapanese365123here,byquantifyingthedegreeoforthographicvariationofthenounsinthesixregistersoftheBCCWJ-Core.
Toachievethisobjective,twoanalysesareconducted.
First,theratioofthetotalnumberofvariablenouns(i.
e.
,thenounsthathavemorethantwowaysofbeingwritten)tothetotalnumberofallnouns(numberintermsoftypesratherthantokens)wascomputedforeachregister.
Figure7showstheresults.
TheregisterthatshowedthelowestratioisOW(whitepapers),andthePN(newspapers)isthesecondlowest.
Ontheotherhand,registersOY(blog)andPN(magazines)showedthehighestratioofvariablenouns.
Thesecondanalysisisthecomputationofinformationentropy.
Entropyininformationtheoryisamathematicalmeasureoftheuncertaintyassociatedwitharandomvariable.
Theentropyofcointossingisequalto1.
0bit,whiletheentropyofonecastofdiceisabout2.
59bit.
Largerentropyimplieslargeruncertaintyoftherandomvariable(inthiscase,theorthographyofanoun).
Figure8isacompositebargraphshowingtworesultsofentropycomputationsimultaneously.
Thewidelledbarsthatlocatebehindthenarrowdottedbarsshowthemeanentropyofallvariablenounsineachregister,andtheaxisofentropyisontheleftsideofthegure.
Thenarrowdottedbarsshowthemeanentropyofallnouns(coveringbothvariableandnon-variableones)ineachregister,andtheaxisisontherightside.
Inbothcases,OWisthelowestandthePNisthesecondlowest.
Ontheotherhand,thehighestregisterdiffersdependingonthewaysofcomputation.
OCandOYarethehighestwithrespecttovariablenouns,whileOYandPMarethehighestwithrespecttoallnouns.
ThefactthatOYisalwaysamongthehighestcoincideswellwiththeresultshowninFig.
7.
And,aspredicted,PNbelongstotheclassofregisterswhereorthographicvariationisverylow(ifnotthelowest).
6.
4SentencelengthSofar,textdiversitywasexaminedmostlywithrespecttomorphologicalproperties.
Twosyntacticpropertiesareexaminedbelow.
Probably,thesimplestmeasureof0.
01.
02.
03.
04.
05.
06.
07.
08.
0OYPMOCPBPNOW%Fig.
7Ratioofvariablenouns366K.
Maekawaetal.
123sententialtextdiversityisthecomparisonofsentencelength.
Figure9isthecomparisonofmeansentencelengthasmeasuredbythenumberofSUWinasentenceacrossregisters.
Punctuationmarksareremovedfromthedata.
SentencesintheOWregisterarebyfarthelongest.
Ontheotherhand,sentencesaretheshortestintheinternet-relatedregisters(OCandOY).
Theresultofaone-wayANOVAissignicant(DF=5,F=290.
32,P\0.
001).
PairwisettestswithBonferronicorrectionrevealthatdifferencesofallcombinationsoftworegistersaresignicantatthe0.
001level,exceptforthepairsofPBandPN,and,PMandPN.
6.
5AdjectivepredicateThesecondsyntacticpropertytobeexaminedisthestructureofi-adjectivepredicates.
InJapanese,fourPOScategoriescanconstituteapredicate,namely,nouns,na-adjectives,i-adjectives,andverbs.
InthestandarddescriptionoftheJapanesegrammar(seeTeramura1982amongothers),itisacknowledgedthatnounsandna-adjectivesneedtobefollowedbyacopula("da"or"desu")toconstituteapredicate,whilei-adjectivesandverbsconstitutepredicatesbythemselveswithoutthepresenceofacopula.
Ineverydayuseofthelanguage,however,i-adjectivesinthepredicatepositionareoftenfollowedbyacopula"desu".
Accordingly,"anohonwaomoshiroi"('thatbookisinteresting',"ano"'that',"hon"'book',"wa"TOPIC,"omoshiroi"'interesting')and"anohonwaomoshiroidesu"arebothobservedintherealusage.
Figure10comparestherelativefrequenciesoftwotypesofi-adjectivepredicates.
Thelegend"Adj"standsforthecaseswherei-adjectivesconstituteapredicatebythemselves,and,"Adj+desu"standsforthepredicatewithacopula.
Here,threecautionsaretobepresented.
First,caseswherei-adjective+"desu"isfollowedbyphrase-nalparticleslike"ne","yo","ka"areallexcluded.
Second,caseswherei-adjectivesarefollowedbytheconjecturalformofcopula"desho"are00.
010.
020.
030.
040.
050.
060.
070.
700.
750.
800.
850.
900.
95OCOYPBPMPNOWbitVariablenouns(Leftaxis)Allnouns(Right)Fig.
8MeanentropyofnounsBalancedcorpusofcontemporarywrittenJapanese367123allexcluded.
ThesetwocaseshavetoberemovedfromthedatabecausethesepredicatesarewidelyrecognizedasgrammaticalinthetraditionaldescriptionofJapanesesyntax.
Third,thepresentdataisbasedupontheanalysisofthecaseswherei-adjectivesconstituteapredicateatsentence-nalposition.
Subordinatesentencesandcoordinatesentencesinnon-nalpositionsendingini-adjectivepredicatesarenotincludedinthedata.
Thegureshowsclearlythatuseofthe"Adj+desu"predicateisheavilyconcentratedintwointernet-relatedregisters,OCandOY.
EspeciallyinOC,thepredicatewithacopulaisinthemajority.
InregistersPN,PB,andOW,ontheotherhand,noinstanceof"Adj+desu"predicateisfoundasfarastheBCCWJ-Coreisconcerned.
AnalysisofthewholeBCCWJshowsthesameoverallpattern,butahandfulof"Adj+desu"predicatesarefoundinregistersPB,PN,LB,OB,andOT.
Theratioofthe"Adj+desu"intheseregistersisgenerallylessthan1%(Maekawa2012).
6.
6AutomaticclassicationofregistersLastly,automaticclassicationoftextregistersisconductedusingthemeasuresoftextdiversitypresentedintheprevioussubsection.
AsubsetoftheBCCWJ-Coreisconstructedsuchthatsixregistershave50sampleseach.
Then,eachsampleinthesubsetischaracterizedbyavectorcomprisedofthefollowingeightvariables:sentencelength,theratiosofnouns,verbs,adverbs,andadjectivesinthePOSdistribution,theratiosofJapanesewords,Sino-Japanesewords,andloanwordsintheword-classdistribution.
Allvariablesarez-normalizedbeforetheanalysis.
POSratiosofadverbsandnouns,andword-classratioofloanwordsarelog-transformedbeforethenormalization.
Supportvectormachines,orSVM,wereusedastheclassicationtool.
Functionsvmofthee1071libraryoftheRlanguage(ver.
2.
15.
1)wasusedforanalysiswith0510152025303540OWPBPNPMOCOYSUWFig.
9Comparisonofmeansentencelength368K.
Maekawaetal.
123thedefaultsettingofkernel(radial)andtype(C-classication).
SVMparametersweresearchedwithintherangesbetween105and102,and,102and102respectivelyforgammaandcost.
Table8showsthecross-tabulatedresultsofthetenfoldcross-validation.
Therowsandcolumnscorrespondrespectivelytocorrectandclassiedregisters.
Theoverallaveragerateofsuccessis0.
83,andismuchhigherthanthebaseline(1/6–0.
17).
ItsuggestsstronglytheusefulnessofsomeofthemeasuresoftextdiversitypresentedinthissectionasthecueforthecharacterizationoftextsintheBCCWJ.
Atthesametime,itturnsoutthattheregisterPMisthemostdifculttoclassify.
Astraightforwardinterpretationofthisfactisthatmagazinesbelongtoaheteroge-neousregistercoveringawiderangeoftextsoverlappingconsiderablywiththetypicaltextsofotherregisters,especiallyPBandPN.
AnothertendencyworthmentioningistheproximitybetweenOCandOY,butthisisnotassalientatendencyasthatofPM.
Table8ResultofregisterclassicationbySVMOCOWOYPBPMPNOC4205210OW0490001OY3040304PB2004710PM01312277PN00035420%20%40%60%80%100%OWPBPNPMOYOCAdj+desuAdjFig.
10Relativefrequenciesoftwotypesofi-adjectivepredicatesBalancedcorpusofcontemporarywrittenJapanese3691237ConclusionBCCWJisdesignedtobeareliablereferencecorpusofpresent-daywrittenJapanesewithasmuchrepresentativenessaspossible.
CompilationoftheBCCWJterminatedsuccessfullyinDecember2011withalmostnodelayintheoriginalschedule.
Pilotevaluationsofthecorpusrevealedthatthetextsinthecorpusaremuchmorediversethanthetextspreviouslyanalyzedinmostofthecorpus-linguisticstudiesofJapanese,i.
e.
,thenewspaperarticles.
Asstatedabove,theBCCWJispubliclyaccessibleinthreedifferentways.
AsofOctober2013,Shonagonhasmorethan297,000accumulatedvisitors.
Chunagonhasmorethan1,500licensees.
Andtherearemorethan250licenseesoftheDVD-releaseversion.
Itisthebeliefofthepresentauthorsthattheneweraofcorpus-basedanalysisofpresent-dayJapaneseisnowbeingopenedbytheusersoftheBCCWJ.
AcknowledgmentsTheauthorsaregratefulforthemore-than-15-thousandscopyrightholderswhogaveuspermissionstousetheirmaterialsintheBCCWJ.
Theyarealsogratefulfortheirformercolleagueswhosenamesarenotincludedintheauthors'list.
TheyplayedimportantrolesinthecompilationoftheBCCWJ.
CompilationoftheBCCWJwassupportedbythegrant-in-aidforscienticresearchforpriority-arearesearch(GrantNo18061009,2006–2010)totherstauthorandtheresearchbudgetoftheformerNationalInstituteforJapaneseLanguage.
Atthesametime,thispaperistheoutcomeofthecollaborativeresearchproject"FoundationofCorpusJapaneseLinguistics"ofthecurrentNationalInstituteforJapaneseLanguageandLinguistics.
Lastly,theauthorsaregratefulforJohnWhitmanandDonnaEricksonwhokindlycheckedvariousversionsofthispaperandprovideduswithmanyvaluablecomments.
OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttributionLicensewhichpermitsanyuse,distribution,andreproductioninanymedium,providedtheoriginalauthor(s)andthesourcearecredited.
ReferencesAsahara,M.
,&Matsumoto,Y.
(2000).
Extendedmodelsandtoolsforhigh-performancepart-of-speechtagger.
InProceedingsofthe18thinternationalconferenceoncomputationallinguistics(COLING2000),pp.
21–27.
Den,Y.
,Nakamura,J.
,Ogiso,T.
,&Ogura,H.
(2008).
AproperapproachtoJapanesemorphologicalanalysis:Dictionary,model,andevaluation.
InProceedingofthe6thLanguageResourcesandEvaluationConference(LREC2008),pp.
1019–1024.
Fujiike,Y.
,Ogura,H.
,Konishi,H.
,Ogiso,T.
,Koiso,H.
,Uchimoto,K.
,&Kozawa,S.
(2010).
Gendainihongokakikotobakinkokopasuniokeruchotan'ikaisekinoshinchokujokyo.
InTokuteiryoikikenkyunihongokopasuheisei21nendokokaiwakushoppuyokoshu,NationalInstituteforJapaneseLanguageandLinguistics,pp.
93–99.
Kabashima,T.
,&Satake,H.
(1978).
Shinbunshokogaku:Hyougennokagaku.
Tokyo:Sanseido.
Koiso,H.
,Ogiso,T.
,Ogura,H.
,&Miyauchi,S.
(2009).
Kopasunimotozukutayonajanrunobuntaihikaku:Tantan'ijohonichumokushite.
InProceedings15thannualmeetingoftheAssociationforNaturalLanguageProcessing,pp.
593–597.
Kozawa,S.
,Uchimoto,K.
,&Den,Y.
(2011).
BCCWJnimotozukuchu-chotan'ikaisekitsuru.
InTokuteiryoikikenkyunihongokopasuheisei22nendokokaiwakushoppuyokoshu,NationalInstituteforJapaneseLanguageandLinguistics,pp.
331–338.
Kudo,T.
,Yamamoto,K.
,&Matsumoto,Y.
(2004).
ApplyingconditionalrandomeldstoJapanesemorphologicalanalysis.
InProceedingsofEMNLP,2004,pp.
230–237.
370K.
Maekawaetal.
123Kurohashi,S.
,&Nagao,M.
(1998).
BuildingaJapaneseparsedcorpuswhileimprovingtheparsingsystem.
InProceedings1stinternationalconferenceonlanguageresourcesandevaluation,pp.
719–724.
Kurohashi,S.
,Nakamura,T.
,Matsumoto,Y.
,&Nagao,M.
(1994).
ImprovementsofJapanesemorphologicalanalyzerJUMAN.
InProceedingsinternationalworkshoponsharablenaturallanguageresources,pp.
22–28.
Maekawa,K.
(2003).
CorpusofspontaneousJapanese:Itsdesignandevaluation.
InProceedingsISCAandIEEEworkshoponspontaneousspeechprocessingandrecognition,(SSPR2003),pp.
7–12.
Maekawa,K.
(2012).
Keiyoushi+desujutsugonoseikiyoinnitsuitenojunbitekikosatsu.
InProceedings1stJapanesecorpuslinguisticsworkshop,pp.
211–220.
Maekawa,K.
,Koiso,H.
,Furui,S.
,&Isahara,H.
(2000).
SpontaneousspeechcorpusofJapanese.
InProceedings2ndinternationalconferenceonlanguageresourcesandevaluation,pp.
947–952.
Matsuda,K.
(Ed.
).
(2008).
Kokkaikaigirokuotsukattanihongokenkyu.
Tokyo:Hituji.
NINJAL(Ed.
)(2011).
BCCWJriyonotebiki(ElectronicdocumentdeliveredwiththeDVD-releaseversionoftheBCCWJ).
Teramura,H.
(1982).
Nihongonoshintakkusutoimi(Vol.
1).
Tokyo:Kuroshio.
Uchimoto,K.
,&Isahara,H.
(2007).
MorphologicalannotationofalargespontaneousspeechcorpusinJapanese.
InProceedings20thinternationaljointconferenceofarticialintelligence,pp.
1731–1737.
BalancedcorpusofcontemporarywrittenJapanese371123
CloudCone的[2021 Flash Sale]活动仍在继续,针对独立服务器、VPS或者Hosted email,其中VPS主机基于KVM架构,最低每月1.99美元,支持7天退款到账户,可使用PayPal或者支付宝付款,先充值后下单的方式。这是一家成立于2017年的国外VPS主机商,提供独立服务器租用和VPS主机,其中VPS基于KVM架构,多个不同系列,也经常提供一些促销套餐,数据中心在洛杉...
vpsdime上了新产品系列-Windows VPS,配置依旧很高但是价格依旧是走低端线路。或许vpsdime的母公司Nodisto IT想把核心产品集中到vpsdime上吧,当然这只是站长个人的猜测,毕竟winity.io也是专业卖Windows vps的,而且也是他们自己的品牌。vpsdime是一家新上来不久的奇葩VPS提供商,实际是和backupspy以及crowncloud等都是同一家公司...
RAKsmart 商家这几年还是在做事情的,虽然他们家顺带做的VPS主机并不是主营业务,毕竟当下的基础云服务器竞争过于激烈,他们家主营业务的独立服务器。包括在去年开始有新增多个数据中心独立服务器,包括有10G带宽的不限流量的独立服务器。当然,如果有需要便宜VPS主机的他们家也是有的,比如有最低月付1.99美元的美国VPS主机,而且可选安装Windows系统。这里商家有提供下面六款六月份的活动便宜V...
sourceforge.jp为你推荐
公司网络被攻击最近公司频繁的受到网络攻击,导致网络瘫痪,又碰到arp攻击,有病毒的,有点崩溃。。。硬盘工作原理硬盘是如何工作的mathplayer如何学好理科同ip域名同IP网站具体是什么意思,能换独立的吗网站检测如何进行网站全面诊断www.haole012.com012qq.com真的假的haokandianyingwang有什么好看的电影网站www.vtigu.com如图所示的RT三角形ABC中,角B=90°(初三二次根式)30 如图所示的RT三角形ABC中,角B=90°,点p从点B开始沿BA边以1厘米每秒的速度向A移动;同时,点Q也从点B开始沿BC边以2厘米每秒的速度向点C移动。问:几秒后三角形PBQ的面积为35平方厘米?PQ的距离是多少www.javmoo.comJAV编程怎么做?www.dm8.cc有谁知道海贼王最新漫画网址是多少??
虚拟主机管理系统 万网域名管理 hostigation 80vps 息壤主机 mediafire 360抢票助手 2017年黑色星期五 空间服务商 hnyd ca4249 中国智能物流骨干网 idc是什么 国外免费asp空间 电信主机 申请免费空间和域名 无限流量 太原联通测速 中国电信网络测速 php服务器 更多