matchoscommerce

oscommerce 时间:2021-04-12 阅读:()

ExtractingEnterpriseVocabulariesUsingLinkedOpenDataJulianDolby,AchilleFokoue,AdityaKalyanpur,EdithSchonberg,andKavithaSrinivasIBMWatsonResearchCenter,P.
O.
Box704,YorktownHeights,NY10598,USA{dolby,achille,adityakal,ediths,ksrinivs}@us.
ibm.
comAbstract.
Acommonvocabularyisvitaltosmoothbusinessoperation,yetcodifyingandmaintaininganenterprisevocabularyisanarduous,manualtask.
Wedescribeaprocesstoautomaticallyextractadomainspecicvocabulary(termsandtypes)fromunstructureddataintheen-terpriseguidedbytermdenitionsinLinkedOpenData(LOD).
WevalidateourtechniquesbyapplyingthemtotheIT(InformationTech-nology)domain,taking58GartneranalystreportsandusingtwospecicLODsources–DBpediaandFreebase.
Weshowinitialndingsthatad-dressthegeneralizabilityofthesetechniquesforvocabularyextractioninnewdomains,suchastheenergyindustry.
Keywords:LinkedData,VocabularyExtraction.
1IntroductionMostenterprisesoperatewiththeirowndomain-specicvocabularies.
Avocab-ularycanbeanythingfromasetofsemanticdenitionstoaformalontology.
Vocabulariesarenecessarytoworkeectivelyinaglobalenvironmentandtointeractwithcustomers.
Theyfacilitatecommontasks,suchassearchingaprod-uctcatalogorunderstandinggeneraltrends.
However,buildingandmaintainingavocabularymanuallyisbothtime-consuminganderror-prone.
Thispaperpresentsaprocesstofullyautomatetheconstructionofadomain-specicvocabularyfromunstructureddocumentsinanenterprise.
Thecon-structedvocabularyisasetoftermsandtypes,wherewelabeleachtermbyitstype(s).
WeappliedthisprocesstotheIT(informationtechnology)domain,andautomaticallybuiltanITvocabularytaking58Gartneranalystreportsasourinputcorpus.
Forthisdomain,wecapturetypessuchasDistributedComputingTechnology,ApplicationServer,andProgrammingLanguage,andusethemtolabeltermslike"CloudComputing","IBMWebSphere",and"SmallTalk".
Ourapproachisbasedonasimpleobservation:peoplesearchingfortermdenitionsontheWebusuallyndanswersineitheraglossaryorWikipedia.
WeuseLOD(LinkedOpenData)asoursourcefordomain-specictypes.
WedecidedtofocusontwospecicsubsetsofLODasourreferencedata–DBpediaandFreebase.
BothdatasetsderivefromWikipediaandthushavebroaddomainA.
Bernsteinetal.
(Eds.
):ISWC2009,LNCS5823,pp.
779–794,2009.
cSpringer-VerlagBerlinHeidelberg2009780J.
Dolbyetal.
coverage.
Also,bothhavetypeinformation,e.
g.
,DBpediaassociatesentitieswithtypesfromtheYAGOandWikipediatypehierarchies[1].
Toperformvocabularyextraction,akeystepistodeterminewhattherelevantdomain-specictypesareforaparticularcorpus.
SimplylookingupcorpustermsinLODisnotadequatesincethetermsmayhavedierentsensesinLOD.
Manyorallofthesesensesmaybeunrelatedtothedomain.
Therefore,weusestatisticaltechniquestoautomaticallyisolatedomain-specictypesfoundinLOD.
Thesetypesarethenusedtolabelcorpusterms.
Thiscorevocabularyextractionpro-cessbasedonLODisdiscussedinSection2.
FortheGartnercase,thistechniqueproducedresultswithhighprecision(80%)butpoorrecall(23.
8%).
Toaddressthis,wepresenttwotechniquestoboostrecallforvocabularyex-traction.
Thersttechnique,presentedinSection3,directlyimprovesrecallbyincreasingthecoverageinLOD.
NumerousinstancesinDBpediaorFree-basehaveincompleteornotypeinformationatall,whichexplainssomeofthepoorrecall.
Thesecondtechnique,presentedinSection4,improvesrecallbyautomatedmachinelearning.
OurtechniquetoimprovecoverageinLODisaformoftypeinferencebasedontheattributesofaparticularinstance,andbasedonfuzzydenitionsofdomains/rangesfortheseattributes.
Asanexample,wecaninferthatanin-stancehasatypeCompanyifithasanattributerevenue,becausestatistically,acrosstheDBpediadataset,instanceswiththeattributerevenuetendtohaveCompanyasatype.
WeperformedthistypeinferenceontheentireLODdataset,andthenre-appliedourcorevocabularyextractionalgorithm.
Typeinferenceimprovedrecallto37.
6%withoutalteringtheprecision.
Ourtechniquetoboostrecallusingmachinelearningreliesonbuildingstatis-ticalnamedentityrecognition(NER)modelsautomatically.
WeaccomplishedthisbyusingseedsgenerateddirectlyfromLODandexploitingstructuralinfor-mationinbothWikipediaandDBpediatogeneratehighqualitycontextualpat-terns(features)forthemodel.
Theendresultwasaneective,general-purposeNERmodelthatworkedwellacrossdierentcorpora,i.
e.
,itwastrainedonWikipediabutappliedtothedomain-speciccorpus,theITanalystreports.
AddingresultsfromNERtoourpreviousoutputgaveusanetrecallof46%andprecisionof78%.
WediscussstrengthsandlimitationsofouroverallapproachinSection5.
Inparticular,wedescribeinitialndingsthatshowthegeneralizabilityofthevo-cabularyextractiontechniquesinnewdomains.
Finally,wediscussrelatedworkandconclusionsinSection6.
Insummary,themaincontributionsofthispaperare:–WedescribeaprocesstoautomaticallyextractadomainspecicvocabularyfromunstructureddataintheenterpriseusinginformationinLOD.
Weval-idateourtechniquesbyapplyingthemtoaspecicdomain,theITindustry,withaprecisionof78%andrecallof46%.
–Wedescribeasetoftechniquestoboostrecallinvocabularyextraction.
TherstimprovesthecoverageofstructuredinformationinDBpediaandFreebase(twocorepiecesofLOD).
BymakingLODmorerobust,itbecomesExtractingEnterpriseVocabulariesUsingLinkedOpenData781moreusefulforarangeofapplications,includingourcurrenttaskofextract-ingvocabularies.
TheimprovedversionsofLODimprovedtherecallofourvocabularyextractionprocessby14%withoutaectingprecision.
Thesec-ondtechniqueimprovesrecallbyrelyingontechniquesforautomatednamedentityrecognition,andthisimprovesrecallbyanadditional8%.
2VocabularyExtractionOurprocessextractsdomainspecictermsfromanunstructuredtextcorpus,andlabelsthesetermswithappropriatedomain-specictypes.
Thisisadier-entproblemthantraditionalNER.
O-the-shelfNERsaretypicallytrainedtorecognizeaxedsetofhigh-leveltypessuchasPerson,Organization,Locationetc.
Inourcase,weneedtodiscoverthetypesforaspecicdomain,andusethesediscoveredtypestolabeltermsinthecorpus.
Werelyonsourcesoutsidethecorpus,inparticularLOD,sinceappropriatetypesmaynotevenappearinthecorpus.
Wealsonotethat,typically,domain-specictermsinacorpusarefoundusingtf-idfscores.
Tothebestofourknowledge,wehavenotseentypeinformationbeingusedasanadditionaldimensiontolterdomain-specicterms,andthisisoneofthedierentiatorsofourapproach.
Briey,ourprocessperformsthefollowingsteps:1.
Extractapopulationofterms(nounphrases)fromthedomaincorpus.
2.
Extractaseedsetofdomain-specictermsfromthecorpususingtraditionaltf-idfscores.
3.
Extractdomain-specictypesfromLOD,usingtheseedtermsfromstep2.
4.
Filterthesetofalltermsfromstep1,basedonthedomain-specictypesfromstep3.
Theresultisasetofdomain-specicterms,labeledbytheirrespectivetype(s).
Thisallowscertainrelevantcorpustermsthathavelowtf-idfscorestobeselectedbasedontheirrelevantdomain-specictype.
Thefollowingsubsectionsprovidethedetailsofeachofthesesteps,andresultsfortheITdomain.
2.
1TermPopulationWeextractthenounphrasesfromadomaincorpus,usingastandardNLPpart-of-speechtagger(fromOpenNLP1).
Thesenounphrasescomprisethepopulationofallterms.
Forourcasestudy,thedomaincorpuswas58ITanalystreportsfromGartner.
Theresultingtermpopulationextractedconsistedofapproximately30,000terms.
Tomeasuretheprecisionandrecallofourprocess,wecreatedagoldstandardfromthetermpopulation.
Werandomlyselected1000termsfromthepopulation,andfourpeoplejudgedwhethereachterminthissamplewasrelevanttothedomain.
Atermwasconsideredrelevantonlyifallfourjudgesagreed.
Intheend,10%ofthesampletermswereconsideredrelevanttotheITdomain.
Therefore,weexpect3,000termsinthepopulationtoberelevant.
1http://opennlp.
sourceforge.
net/782J.
Dolbyetal.
2.
2Domain-SpecicSeedTermsThenextstepistoextractaninitialsetofdomain-specictermsfromthecorpusbasedontraditionaltf-idfmetrics.
ThesetermsaresubsequentlyusedtolookuptypesinLOD,andformthebasisfordomain-specictypeselection.
Forthistask,itismoreimportanttondaprecisesetofdomain-specicterms,thantondallofthedomain-specicterms.
Weuseano-the-shelftool,GlossEx[2],thatextractswordsandnounphrasesalongwiththeirfrequencyanddomain-specicity.
Thisassociateddataallowslow-frequency,low-specicitytermstobelteredout.
Forouranalystreportcorpus,weselectedpropernounphrasesandcommonnounphraseswithfrequency≥2andwithanappropriatetool-specicdomain-specicitythreshold.
Weobtained1137domain-specictermsfromtheGartnerITreports.
2.
3Domain-SpecicTypesUsingthedomain-specicseedterms,wediscoverasetofrelevantinterestingtypesfromLOD.
Ouralgorithmtodiscoverdomain-specictypesisoutlinedinTable1.
TherststepistondacorrespondingLODentityforadomain-specicterm.
ThemostpreciseanddirectwaytodothisistoencodethetermdirectlyasanLODURI(i.
e.
,byaddingtheDBpediaURLprex)andcheckifitexists.
Thisproducedmatchingentitiesfor588oftheterms.
Ideallyitshouldbepossibletosimplylookupthetype(s)ofeachoftheseentitiesandmarkthemasinteresting.
However,thisdoesnotworksinceatermcanultimatelymaptodierenttypeswithdierentsenses.
Forexample,"Java"isaprogramminglanguageandanisland,andwemayselecttheincorrectLODentityandhencesense.
EvenifthereisasingleLODentityforaterm,itmaynotbethesensethatisrelevantforthedomain.
Forexample,theterm"FairWarning"isasoftwareproduct,butthereisonlyonetypesenseinLOD,whichisCategory:MusicAlbum(pointingtoanalbumwiththesamenamereleasedbyVanHalen).
Table1.
ExtractionofDomain-SpecicTypesInput:Sdt:setofdomain-specicseedterms,thresholdparametersαY,αF,αCOutput:τsetofdomain-specictypesfromLOD(1)initializepotentialtypelistτp←(2)foreachdomain-specicseedT∈Sdt(3)encodeTasaDBpediaURIU(4)τp←τp∪types(U)wheretypes(U)=valuesofprop.
rdf:typeforU∪valuesofprop.
skos:subjectforU∪'equivalent'typesobtainedviamappingstoFreebase(5)τ←T∈τpif(freq(T)≥αYandTisaYagoType)or(freq(T)≥αFandTisaFreebaseType)or(freq(T)≥αCandTisaWikiCategory)(6)foreachtypeX∈(τpτ)(7)ifthereexistsatypeY∈τs.
t.
LOD|=XY(8)τ←τ∪{X}ExtractingEnterpriseVocabulariesUsingLinkedOpenData783Toaddressthisproblem,welteroutuninterestingtypesusingsimplesta-tisticalinformation.
WescoreLODtypesbasedonthenumberoftermstheymatchacrossallthedocumentsinourcorpusandlteroutinfrequenttypes(thosewhosefrequencyisbelowapre-determinedthreshold).
OneissuehereisthedierentkindsoftypeinformationinLOD–YAGOtypesfromDBpedia,FreebasetypesandCategoriesthatcomefromWikipedia.
Wefoundthathavingseparatefrequencythresholds(αY,αF,αCresp.
inTable1)foreachofthetypesproducedbetterresults.
AnotherissuewenoticedwithDBpediaisthatseveralentitiesdonothaveanyvaluesintheirrdf:typeeld,buthadinterestingtypeinformationintheskos:subjecteld.
Forexample,theterm"NetBIOS"hasnordf:type,thoughitsskos:subjectismentionedasCategory:Middleware.
Hence,wetakeSKOSsubjectvaluesintoaccountaswellwhencomputingtypesforaseedterm(step4ofthealgorithm).
Also,DBpediaandFreebasecomewithdierenttypesforthesameinstance,andweexploittheseinstep4toobtainadditionalrelatedtypeinformationfromFreebase.
Filteringalsoremovesseverallowfrequencytypesthatareinteresting,e.
g.
,yago:XMLParsers.
Toaddressthisissue,weconsiderlowfrequencytypesasinterestingiftheyaresubsumedbyanyofthehighfrequencytypes.
Forexample,yago:XMLParsersissubsumedbyyago:SoftwareintheYagotypehierarchy,andisthusconsideredrelevantaswell(steps6-8ofthealgorithm).
Weranthetype-discoveryalgorithmwith1137seedtermsandsettingappro-priatetype-frequencythresholds(αY=4,αF=4,αC=4)basedonmanualinspectionofhighlyfrequenttypesinτp.
Thisproduced170interestingtypes.
Uponmanualinspection,wefoundtheoutputtohaveextremelyhighprecision(98%).
Asexpected,thereisatradeobetweenprecisionandrecall–reducingthetype-frequencythresholdsincreasesthesizeoftheoutputtypesbutdecreasesitsprecision.
2.
4FilteredTermsandTypesandDiscussionofInitialResultsOncewehavethesetofdomain-specictypes,werevisittheentirepopulationofcorpustermsfromSection2.
1.
Wendtermsinthispopulationthatbelongtooneofthedomain-specictypes.
Thissetismorecomprehensivethantheinitialseedsetin2.
2.
Specically,weselectthetermsfromthepopulationthatare'closelyrelated'toentitiesinLODwhichbelongtoatleastonedomain-specictype.
Inordertond'closelyrelated'entities,weperformakeywordsearchforeachtermoveradatabase,populatedwithdatafromDBpediaandFreeBase,andindexedusingLucene.
Weselectmatcheswitharelevancescoregreaterthan0.
6.
Forexample,searchingfortheterm"WebSphere"overtheLuceneindexproducesentitymatchessuchas"IBMWebSphere",whosecor-respondingrdf:type/skos:subjectvaluebelongedtooneofourdomain-specictypes.
Therefore,'WebSphere'isselectedasadomain-specicterm.
Thisprocessresultedin896type-labeledtermsusingthe170interestingtypesfoundinthepreviousstep.
Weevaluatedtheprecisionofboththetermsandtheirtypesbymanuallyevaluatinga200termsample.
Thisgaveusaprecisionof784J.
Dolbyetal.
80%.
Wealsocomputedrecallbytakingintoaccountourgoldstandardestimateandtheprecision,andfoundthatrecallwas23.
8%.
Althoughtheprecisionofourinitialresultswasreasonable,ourrecallwaspoor.
OurrstapproachtoimprovingrecallwastodirectlyimprovethecoverageoftypesinLOD.
Thetechniquesdevelopedareoutlinedinthenextsection.
3ImprovingCoverageofLODInthissection,wediscusstechniquestoaddnewknowledgetoourreferenceLODdatasets–DBpediaandFreebase.
ForDBpedia,weuseddatadumpsforDBpedia3.
1.
ForFreebase,weusedtheWEXdatadumpsfromJuly[3].
Sometimestherearenotypesinthiscombineddataset,whichisaproblemwhenusingthedatasettoperformvocabularyextractionforanydomain.
Wethereforeenhancedthetypeinformationbylinkinginstancesandtypesacrossdatasets,andbyinferringnewtypesforinstances.
3.
1AddingTypeInformationfromLinkingAnobvioussteptoimprovetypecoverageinLinkedOpenDataistoleveragethefactthatDBpediaandFreebasemighthavedierenttypesforthesameinstance.
WelinkedDBpediainstancestoFreebaseinstancesusingtheirsharedWikipedianame.
Thistechniqueallowedusmatchall2.
2millionFreebasein-stancesexcept4,946,mainlybecauseofdierencesinWikipediaversionsbetweenthetwodatasets.
AlthoughlinkinginstancesachievestheaggregationoftypesacrossFreebaseandDBpedia,thetwotypesystemsarestilldisconnected.
Itwouldbeusefultoknowmappingsbetweenthetwosetsoftypesforvocabularyextraction.
Forin-stance,ifweknewboththatatypeinDBpediasuchasyago:Byway102930645mapstothecorrespondingFreebasetypefreebase:/transportation/road,andalsoknewthatyago:Byway102930645isasaninterestingdomainspecictype,thenfreebase:/transportation/roadislikelyaninterestingtypeaswell.
DBpediahas159,379typesandFreebasehasamuchsmallersetof4,158typesthatarespeciedatacoarserlevelofgranularity.
WethereforeusedtherelativefrequencywithwhichagivenDBpediatypeAco-occurswithaFreebasetypeBtodrivethemapping.
Weconsideredamappingvalidiftheconditionalprobabilityp(FreebaseType|DBpediaType)wasgreaterthan.
80.
Manualin-spectionofarandomsampleof110pairingsrevealedthat88%mappingswerecorrect.
Withthistechnique,wewereabletomap91,558DBpediaoutof152,696DBpediatypestoFreebasetypes(weexcludedmappingswhichmappedtothefreebasetype/common/topicbecauseitsatopleveltypelikeowl:Thingoryago:Entity).
Inall,becauseasingleDBpediatypecanmaptomultipleFree-basetypes(e.
g.
,yago:InternetCompaniesEstablishedIn1996ismappedtofreebase:/business/companyandfreebase:/business/employer),wehad140,063mappingswiththe80%threshold.
ExtractingEnterpriseVocabulariesUsingLinkedOpenData7853.
2TypeInferenceWeproposeasimplestatisticaltechniquetoextractfuzzydomainandrangerestrictionsforpropertiesofanindividual,andusetheserestrictionstoperformtypeinference,aswedescribebelow.
ToillustratehowourtechniqueworksinDBpedia,takeasanexamplethein-stanceyago:Ligier,whichisaFrenchautomobilemakerthatmakesracecars.
DBpediahasthetypesyago:FormulaOneEntrantsandyago:ReliantVehiclesastypes,neitherofwhichisacompany.
Yet,dbpedia:Ligierhaspropertiesspecictocompanies,suchasdbpedia-owl:Company#industry,dbpedia-owl:Company#parentCompany,etc.
Ifwehadapredenedontology,wheredbpedia-owl:Company#industryhadadomainofthetypeyago:Company,wecouldhaveusedRDFSorOWLreasoningtoinferthatyago:LigierisreallyaCompany.
However,manuallydeningdomainandrangerestrictionsisnotanoption,becauseDBpediahas39,345properties.
Wethereforeusedstatisti-caltechniquestodenefuzzynotionsofdomainsandrangestoperformtypeinference.
Ourapproachtotypeinferenceisbasedoncorrelatingwhatpropertiesanentityhaswithitsexplicittypes.
Theideaisthatifmanyinstancesofaparticulartypehaveacertainsetofedges,thenotherentitieswiththatsamesetofedgesprobablyareinstancesofthattypetoo.
Inperformingthistypeofinference,wereliedontheDBpediatoFreebasemappingsweestablishedin3.
1toinferFreebasetypes.
Asdiscussedearlier,becauseFreebasetypesarespeciedatacoarserlevelofgranularity,typeinferenceismorerobustforthesetypesbecauseofthelargersamplesizeofinstances.
Moreformally,wedenethenotionofapropertyimplyingatypebasedonthefractionofthegivenedgethatpertaintoinstanceswiththatpropertybeinggreaterthansomethresholdτ.
Notethatthisnotionappliestobothsubjectsandobjectsofedges.
Isubj(p,t)≡|{p(x,y)|x:t}||{p(x,y)|t1x:t1}|>τIobj(p,t)≡|{p(x,y)|y:t}||{p(x,y)|t1y:t1}|>τwherei:tisanrdf:typeassertionbetweenaninstanceiandatypet,andp(i,x)isaroleassertionwhichlinksinstanceitoinstancexonapropertyp.
Thisstepcanbethoughtofasinferringdomainsandrangesforproperties,aswasdonein[4];however,ratherthanusethesetypesdirectlyassuchconstraints,weusetheminavotingschemetoinfertypesforsubjectandobjectinstances.
Giventhenotionofapropertyimplyingatype,wedenethenotionofprop-ertiesvotingforatype,bywhichwesimplymeanhowmanyofagiveninstance'spropertiesimplyagiventype:V(i,t)≡p(xp(i,x)∧Isubj(p,t))∨(xp(x,i)∧Iobj(p,t))Weadditionallydenethenotionofalledgesthattakepartinvoting,i.
e.
thenumberofedgesthatpertaintoagiveninstancethatimplyanytype:Vany(i)≡pt(xp(i,x)∧Isubj(p,t))∨(xp(x,i)∧Iobj(p,t))786J.
Dolbyetal.
Finally,giventhenotionofvoting,wedenetheimpliedtypesofaninstancesimplyasthosetypesthatreceivethegreatestnumberofvotesfrompropertiesofthatinstancecomparedtothetotalnumberofpropertiesofthatinstancethatcouldvoteforanytype:T(i)≡t(t1V(i,t)≥V(i,t1))∧V(i,t)Vany(i)≥λWeappliedthistechniquetoourdatasettingτandλto.
5.
Weinferredtypesfor1.
1Minstancesofthe2.
1MinstancesinLinkedOpenData.
Tohelpevaluatethesetypes,weintroduceatechniquenexttoautomaticallydetectwhenwemighthaveinferredinvalidtypes.
3.
3EvaluatingTypeInferencesAtitscore,thedetectionofinvalidtypeinferenceisbasedontheobservationthatiftwotypesareknowntobelogicallydisjoint,suchasPersonandPlace,andweinferatypethatisdisjointwithanyoftheexplicitlyassertedtypesofaninstance,thisconstitutesanerrorinourtypeinference.
Inpriorwork,ontologyreasoninghasbeenusedtoautomaticallydetectinvalidtypeinferenceintextextraction[5].
However,extendingthisapproachtoLinkedOpenDataisnoteasy.
AlthoughYAGOtypesareorganizedinahierarchy,therearenoobviouslevelsinthehierarchytoinsertdisjoints.
Freebasehasnohierarchyatall:thetypestructureiscompletelyat.
OurapproachthereforewastorststatisticallydeneatypehierarchyforFreebase,andthenusethathierarchytodenedisjointclassestodetectinvalidtypeinferences.
BecauseFreebasehasnotypestructure,eachinstanceinFreebaseisannotatedwithaatsetoftypes,inwhichmore-generaltypesoccuralongwithless-generaltypes.
Weusethisfacttoapproximatetheusualnotionofsupertype:asupertypeYbydenitioncontainsalltheinstancesofitssubtypeX,and,innormalcircumstances,thuswewouldexpecttoseethefollowing,wherePdenotesprobability:P(i∈Y|i∈X)==1BecausetheFreebasedataisnoisy,thisprobabilitywillmostlikelybelessthan1;toaccountforthis,wecanrecasttheaboveconstraintasfollows:P(i∈Y|i∈X)>τThus,XYifinstancesofXarealmostalwaysinstancesofY.
Thisallowsustodenegroupsoftypesthatarerelatedassubtypes;inparticular,relatedgroupsaremaximalgroupsthatareclosedundersubtype,i.
e.
Gisagroupifandonlyifx,yx∈G∧(yx∨xy)y∈GThisdenitiongaveus78groups,withτsetto.
65.
Ofthese,26groupsweresingletons,andthelargestgrouphad409types.
Wefurthermanuallygroupedthe78groupsbythosethatappearedtobelongtothesamedomain;thisgaveus35largergroupsoftypes,whichcovered1,281typesoutof4,158types.
Inpractice,thesegroupswereoverwhelminglydisjoint,andwedroppedthefewExtractingEnterpriseVocabulariesUsingLinkedOpenData787typesthatoccurredinmultiplegroups.
The35groupsoftypesweredeclaredaspairwisedisjoint(i.
e.
,eachtypeTingroupAwasdeclareddisjointfromeachtypeQingroupB).
Toevaluatethe1.
1Minferredtypesbetter,wedividedtheminto3categories:(i)Veriedfortheentitiesforwhichatleastoneoftheinferredtypesisthesameasanexplicitlydeclaredone(thiscategoryhad808,849instances),(ii)Additionalinferredtypesfortheentitiesforwhichtheinferredtypeswerenotdisjointwithanyexistingtypeassertion,i.
e.
thesedenoteadditionalinferredtypeassertionsthathelpsimprovecoverageinDBpedia(thiscategoryhad279,407instances),andnally(iii)Invalidfortheentitiesforwhichatleastoneoftheinferredtypesconictedwithanexplicitlyassertedtype(thiscategoryhad6,874instances).
Todeterminetheaccuracyofourinferredtypes,wetooksamplesoftheinvalidandadditionalinferredtypes,andevaluatedtheprecisionforthesecategorieswithtworandomsamplesof200instanceseach.
Aninstancewasconsideredtobetypedcorrectlyifalltheinferredtypesforaninstancewerecorrect.
Intheadditionalinferredtypescategory,wetyped177instancescorrectly,and23incorrectly.
Intheinvalidcategory,wetyped21instancescorrectly,and179wrong.
Takingtheoverallresultsforallthecategoriesintoaccount,weachievedanetrecallof49.
1%andanestimatedprecisionof95.
8%accuracy.
Notethatwehavetheusualtrade-obetweenprecisionandrecallbasedonvaluesfortheparametersτ,λ,howeverweachievedareasonablyhighF-scoreof64.
9.
WeaddedonlytheAdditionalinferredtypesintoourversionofLinkedOpenData,andre-ranourvocabularyextraction.
Theresultsaredescribedinthenextsection.
3.
4ResultswithImprovedLODNotethatthepreviousexecutionofthevocabularyextractionalgorithm(Table1)usingo-the-shelfLODdatasetsproduced170interestingtypesand896type-labeledtermsintheoutput,withaprecisionof80%andrecallof23.
8%.
Were-appliedthealgorithmwiththenewlydiscoveredtypeassertionsaddedtoDBpediaandthenewtypemappingsfromDBpediatoFreebase(andthesameinputparametersasearlier)anddiscovered188interestingtypesandanetoutputof1403type-labeledterms.
Manualinspectionofa200termsamplerevealedthattheprecisionwasunaltered,andrecallhadincreasedto37.
6%,whichvalidatedourcoverageenhancementtechniques.
4ImprovingCoverageUsingStatisticalNERSinceLODcoverageisincomplete,thetechniquesdescribedabovedonotpro-duceacompletedomain-specicvocabulary.
Fromourgoldstandard,describedinSection2.
1,weexpecttondaround3Kdomain-specicterms,andtheoutputofthepreviousstepisstillquiteshort.
Inordertoimprovethecoverageofoursolution,weautomaticallybuildanNERmodelfordomain-specictypes.
788J.
Dolbyetal.
Thepreviousstepproducedalargenumberofinterestingtypes(>170).
TheseincludeYAGOtypes,FreebasetypesandWikipediaCategories,whichhaverelatedgroups,suchas,yago:ComputerCompanies,freebase:venture_funded_companyandCategory:CompaniesEstablishedIn1888,allconceptuallysub-classesofCompany.
Giventhelargenumberofcloselyrelatedtypes,itdoesnotmakesensetobuildanNERmodelforeachofthetypes.
Insteadwedecidedtolookonlyattop-leveltypesintheoutput(typesthatwerenotsubsumedbyanyother).
Furthermore,giventhenoiseinthetype-instanceinformationinLOD,wedecidedtorestrictourselvestoYagotypesthathaveWordnetsenseID'sattachedtothem(e.
g.
yago:Company108058098),sincetheyhaveapreciseun-ambiguousmeaningandtheirinstancesaremoreaccuratelyrepresentedinLOD.
Inourcase,thisyieldsveYAGO/Wordnettypes:yago:Company108058098,yago:Software106566077,yago:ProgrammingLanguage106898352,yago:Format106636806andyago:WebSite106359193.
TheprocessofbuildingastatisticalmodeltodoNERisinspiredbytechniquesdescribedinsystemssuchasSnowball[6]andPORE[7].
Thebasicmethodologyisthefollowing–startwithasetoftrainingseedtuples,whereeachtupleisanpair;generateasetof'textualpatterns'(orfeatures)fromthecontextsurroundingtheinstanceinatextcorpus;andbuildamodeltolearnthecorrelationbetweencontextualpatternsforaninstanceanditscorrespondingtype.
WecombinethebestideasfrombothSnowballandPOREandmakesignicantnewadditions(seetheRelatedWork(Section6)foradetailedcomparison).
WecouldnotaordtotrainthemodelontheITcorpusitselffortworea-sons:(i)lackofsucientcontextualdata(weonlyhad58reports),(ii)lackofadequatetrainingseedtuples(evenifwetookthemostpreciseterm-typepairsgeneratedintheprevioussection,itwasnotenoughdatatobuildaro-bustmodel).
However,Wikipedia,combinedwithLOD,providesanexcellentandviablealternative.
Thisisbecausewecanautomaticallyobtainthetrain-ingseedtuplesfromDBpedia,withoutbeingrestrictedtoourdomain-specicterms.
WelookforinstancesoftheYAGO/WordnettypesinDBpedia,andndthecontextfortheseinstancesfromthecorrespondingWikipediapage.
Forourlearningphase,wetookeither1000trainingseedinstancespertypeorasmanyinstancesaswerepresentinLOD(e.
g.
,yago:ProgrammingLanguage106898352hadonly206instances).
Thisgaveusatotalof4679seedinstancesacrossallvetypes.
Akeydierentiatorinoursolutionisthekindoftextpatternswegenerate(bypatternshere,wemeanasequenceofstrings).
Forexample,supposewewanttodetectthetypeCompany.
Thefollowingtextpattern[X,acquired,Y],whereX,Yarepropernouns,servesasapotentiallyinterestingpatterntoinferthatXisoftypeCompany.
However,amoreselectivepatternisthefollowing:[Xacquired].
KnowingthatYisoftypeCompany,makesastrongercaseforXtobeaCompany.
Addingtype-informationtopatternsproducesmoreselectivepatterns.
ExtractingEnterpriseVocabulariesUsingLinkedOpenData789Table2.
PatternGenerationAlgorithmInput:SentenceScontainingtraininginstanceI,entitieswithWikipediaURLsWN1.
.
WNk;andcompleteType-OutcomesetformodelOTOutput:SetCPofpatterns(stringsequences)(1)RunSthroughOpenNLPPOStaggertogettokensequenceTKword,POS>,.
.
](2)RemovetokensinTKwherethewordisanadverb,modierordeterminer(3)Replacecommonnouns/verbsinTKwithrespectivewordstemsusingWordNet(4)foreachoccurrenceofpairinTK,(whereposiispositionindexofpair)(5)SPANS←ExtractSpans(TK,posi)(6)foreach[start-pos,end-pos]∈SPANS(7)TKspan←subsequenceTK(start-pos,end-pos)(8)CP←CP∪wordsequenceinTKspan(9)CP←CP∪wordsequenceinTKspanreplacingpropernouns/pronounswithresp.
POStag(10)CP←CP∪wordsequenceinTKspanreplacingWN1.
.
WNkwithcorrespondingtypesandtheirresp.
super-typesfromLOD(providedthattypeisinOT)(11)Removeadjectives(JJ)fromTKrepeat(8)-(10)once(Note:Whengeneratingpatternsinsteps(8)-(10),wereplacetraininginstanceIbytaggedvariable'X:POS(I)'Subroutine:ExtractSpans(TK,posi)(1)SPANS←(2)foreachj,0≤j≤(posi1)(3)ifTK(j).
POS=verbornoun(4)SPANS←SPANS∪[j,posi](5)foreachj,(posi+1)≤j≤length(TK)(6)ifTK(j).
POS=verbornoun(7)SPANS←SPANS∪[posi,j](8)returnSPANSToaddprecisetypeinformation,weexploitthestructureofWikipediaandDBpedia.
InWikipedia,eachentity-sensehasaspecicpage,otherWikipediaentitiesmentionedonapagearetypicallyhyperlinkedtopageswiththecorrectsense.
E.
g.
,the"OracleCorporation"pageonWikipediahasthesentence"Or-acleannouncesbidtobuyBEA".
Inthissentence,thewordBEAishyperlinkedtothe'BEASystems'pageonWikipedia(asopposedtoBea,avillageinSpain).
Thus,usingthehyperlinkedWikipediaURLasthekeyidentierforaparticu-larentitysense,andobtainingtype-informationforthecorrespondingDBpediaURL,enablesustoaddprecisetypeinformationtopatterns.
Fortheexamplesentenceabove,andtheseed,wegeneratethepattern[X:NNP,announces,bid,to,buy,]be-cause"BEASystems"hasthetype(amongothers)inLOD,whileXhereisavariablerepresentingtheseedinstanceOracle,andistaggedasapropernoun.
Moreover,notonlydowesubstituteanamedentityinapatternbyallitscorrespondingtypesinLOD,weaddinsuper-typeinformation,basedonthetype-hierarchyinLOD.
WeonlyfocusontheYAGO/Wordnettypesinthehierarchywhicharethemostprecise.
Thisgeneralizationofpatternsfurtherhelpsimproverecallofthemodel.
Besidesusingtypeinformation,wealsouseastemmer/lemmatizer(usingaJavaWordNetAPI2)togeneralizepatterns,and2http://sourceforge.
net/projects/jwordnet790J.
Dolbyetal.
Table3.
SampleHighScoringPatternsTypePatternCompany[,be,acquire,by,X:NNP]Software[,release,version,of,X:NNP]ProgLang[,write,in,X:NNP]Format[encode,X:NNP]Website[X:NNP,forum]Table4.
EvaluatingourNERmodelDomainNoFeedbackWithFeedbackPrec.
Rec.
FPrec.
Rec.
FWikipedia71.
141.
552.
469.
542.
552.
8ITCorpus76.
438.
651.
376.
552.
362.
1apart-of-speechtaggertoeliminateredundantwords(e.
g.
,determiners)andaddPOSinformationforpropernounsandpronounsinpatterns.
DetailsofourpatterngenerationisdescribedinthealgorithminTable2.
WeusethetextpatternsasfeaturestotrainaNaiveBayesianclassierthatrecognizestheconcernedtypes.
Finally,werepeattherecognitionphase.
Newlyrecognizedterm-typetuplesarefedbackintothesystemandusedtorescorepatternstakinginthenewcontexts,andalsotogeneratenewcontextsfortheremainingunrecognizedtermsbyaddingintypeinformation.
Thisprocessre-peatsuntilnothingchanges.
Thisfeedbackloopiseective,sinceweproduceseveralpatternswithtypeinformationinthem,andthesepatternsarenotap-plicableunlessatleastsometermsinthecontextalreadyhavetypesassigned.
Forexample,thesentencefragment"IBMacquiredTelelogic"appearsinourtextcorpus;initially,wedetectthattheterm"IBM"hastypeCompany,andfeedingthisinformationbacktothesystemhelpsthemachinerecognize"Tele-logic"isaCompanyaswell(basedonthepattern[,acquired,X]forType(X):Company).
Wehavetobecarefulduringthefeedbackprocesssincetermsthathaveincorrectlyrecognizedtypes,whenfedbacktothesystem,maypropagateadditionalerrors.
Topreventthis,weonlyfeedbacktermswhosetypeshavebeenrecognizedwithahighdegreeofcondence(Pr(Ti)>0.
81).
SomesamplehighscoringpatternscapturedbyourmodelareshowninTable3.
4.
1EvaluationofOurNERModelWeevaluatedourNERseparatelyonWikipediadataandtheITcorpus.
FortheWikipediaevaluation,wetookourinitialsetof4679seedinstancesfromLOD,andrandomlyselected4179instancesfortrainingandsetasidetheremain-ing500instancesforevaluation.
ForevaluationontheITcorpus,wemanuallygeneratedagold-standardof159pairs,byrandomlyselectingaExtractingEnterpriseVocabulariesUsingLinkedOpenData791Table5.
SampleITVocabularyExtractedSoftwareDeveloperBMCSoftware,Fujitsu,IBMSoftwareGroup,Automattic,ApacheSoftwareFoundation,.
.
.
TelecommunicationsequipmentvendorsAlcatel-Lucent,Avaya,SonicWALL,LucentTechnologies.
.
.
Service-orientedbusinesscomputingMultitenancy,B2BGateway,SaaS,SOAGovernace,CloudComputingContentManagementSystemsPHP-Nuke,OsCommerce,EnterpriseContentManagement,WordPress,DrupalSoftwareCorep,LotusSametimeAdvanced,rhype,AIMProBusinessEdition,BPMT,Agilense.
.
.
ProgramminglanguageJoomia,ABAP,COBOL,BASIC,ruby,JavaJavaPlaformNetBeans,JDevelopper,JavaSoftware,JavaME,OracleJDevelopper,ZAAPWebsiteTwitter,MicrosoftLive,OceOnline,GMail,SecondLifesampleof200pairsfromtheoutputofSection2.
4,andthenmanuallyxingerroneouspairs.
TheresultsareshowninTable4.
Thetableshowsprecision,recallandF-scoresforourmodelovereachofthedomains,withandwithoutthefeedbackloopimplemented.
Theresultsareencouraging.
Whilenotneartheperformanceofstate-of-the-artNER's(whichachieveF-scoresin90%range,e.
g.
,[8]),thereareseveralkeypointstokeepinmind.
First,typicalNERsdetectapre-denedsetoftypesandarespeciallyopti-mizedforthetypesusingacombinationofhand-craftedpatterns/rulesand/oralargeamountofmanuallyannotatedtrainingdata.
Wehavetakenacompletelyautomatedapproachforbothrecognizingdomain-specictypesandgeneratingtrainingdataandpatterns,andourscoresarecomparableto,andinsomecasesevenbetterthan,similarapproachessuchas[6],[7].
ThequalityissuesinLODadverselyaectsourresults,andthusthemorewecanimprovethequalityofLODthebetterourresultsshouldbe.
Second,thereisscopeforimprovementusingamorerobustclassierbasedonSupportVectorMachines(SVMs).
Finally,theperformanceofourmodelacrossdomaincorporaissignicant.
Themodel,whichistrainedonWikipediaandLODandappliedtoITcorpus,performscomparablywellwithoutfeedback,andsubstantiallybetterwithfeed-back(esp.
recall).
ThisindicatesthatthekindofpatternswelearnonWikipedia,usinginformationfromLOD,canbeinterestingandgenericenoughtobeappli-cableacrossdierentdomains.
TheperformanceimprovementwithfeedbackontheITcorpuswasduetobetteruniformityinthewritingstyle,andthusincor-poratingtext-patternsforrecognizedtermsinthefeedbackloophelpedgenerateadditionalinterestingdomain-specicpatterns.
4.
2ResultswithNERModelAsaresultofapplyingourNERmodeltotheITreportsgenerated381newterm-typepairs,whichwasaddedtotheoutputofSection3.
4togiveus1784termsinallinourdomain-specicvocabulary.
Usingthesameevaluationprocess792J.
Dolbyetal.
asearlier,wefoundthatprecisiondroppedabitto78%,butrecallincreasedto46%.
Table5showsasampleoftheextracteddomain-specicvocabulary.
5DiscussionBenetsofUsingLOD:TherearesomedirectbenetsofusingLODthatweshouldmention.
First,havinglabeleddomain-specictermswithappropriatetypes,anextlogicalstepistoarrangethetypelabelsinahierarchyforclassica-tionpurposes.
Here,wecanleveragetheYAGOWordnethierarchyinDBpediainadditiontousingourinferredtypehierarchyfromFreebase.
Second,awaytoenrichthevocabularyistocapturerelationsbetweenterms,e.
g.
,thedeveloperOfrelationbetweenthecompanyIBMandthesoftwareWebSphere.
SuchrelationinformationexistsinLOD,makingitpossibletoextractrelevantrelationsbe-tweenvocabularyterms.
Third,becauseWikipediadoescoverabroadsetoftopics,ourtechniquescangeneralizetoanewdomain.
Forexample,wecon-ductedapreliminaryexperimentwhichsuggeststhatourvocabularyextractiontechniquescanbegeneralizedtotheenergydomain.
Inourexperiment,theinputwasacorpusof102newsarticlesfortheenergysector,drawnfromvariouswebsitesthatspecializeinnewsfortheenergyin-dustry.
The102newsarticlesmatchedthe58ITreportsintermsoflength(i.
e.
,numberofwords).
Westartedwithasetof1469domain-specictermsdrawnfromGlossEx[2],anddrewasampleof500terms.
Ofthissample,204termswereevaluatedbyfourpeopletoberelevanttotheenergydomain,leavinguswithanestimateof599relevanttermsintheoverallsample.
Ourvocabularyextractiontechniqueextracted260termsandtypes,ofwhich190werecorrecttermsandtypes(73%precisionandrecallof32%).
SampletermsandtypeswefoundintheenergysectorincludedcompaniessuchasGazprom,Petrobras,importantpeopleinenergysuchasKevinWalsh,ChrisSkrebowski,countriesorregionsrelevanttoenergysuchasSakhalin,andSouthOssetia,andtermssuchasTarsands,Bitumen,LNG,andMethaneHydrates.
Limitations:Ourcurrentrecallscoreisstilllessthan50%inspiteofimprovingthecoverageofinformationinourtwoLODsourcesandusingautomatedNER,bothofwhichindependentlyhadreasonablygoodperformance.
Obviously,wecouldimprovecoveragebyconsideringadditionalLODsourcesorbuildingmoreNERmodels.
YetanotherwaytoimproverecallistolookatthegeneralWebforadditionaltypeinformation.
Theidea,exploredinpreviousworksuchas[9],istocaptureis-arelationsintextusingHearstPatternsencodedasqueriestoaWebsearchengine.
Forexample,giventheentityIBMWebsphere,wecanissuethefollowingphrasequery"IBMWebsphereisa"toGoogle,parsetheoutputslookingforanounphrase(NP)followingtheinputqueryandusetheNPasabasisforthetype.
Therearetwochallengeshere–recognizingtype-phrasesthatareconceptuallysimilar(e.
g.
,'ApplicationServer',"WebServer",'Soft-warePlatform"),whichistypicallydoneusingclusteringalgorithmsbasedonWordNetetc.
,anddealingwithmultipletypesensesandguringouttherelevanttypeforaparticulardomain(e.
g.
JavacouldbeaProgrammingLanguageorExtractingEnterpriseVocabulariesUsingLinkedOpenData793theIsland).
Forthelatter,weuseourautomaticallydiscovereddomain-specictypesaslters.
Initialresultsinusingourtype-discoveryalgorithmasanout-putltertoapproachessuchas[9]havebeenverypromising–boostingrecallto75%fortheGartnercasewithoutalteringprecision,andweplantofurtherinvestigatethisapproachinthefuture.
6RelatedWorkandConclusionsThereareanumberofattemptstodenetaxonomiesfromcategoriesinWikipedia,andmaptheclassestoWordnet(see[1],[10]),whichaddressadier-entproblemfrominferringhierarchiesfromarelativelyattypestructurelikeFreebase.
Wuetal.
[4]addresstheproblemofcreatingaclasshierarchyfromWikipediainfoboxes.
Althoughtheirmajorfocusisondetectingsubsumptionamongsttheseinfoboxclasses,oneaspectoftheirworkistoinferrangesforin-foboxproperties.
Forthis,theyexaminewhattypesofinstancesarereferencedbytheseproperties.
Thisisrelatedtowhatwedofortypeinference;however,theyfocusoninferringrangesforindividualproperties,whereasweusethedo-mainandrangeinformationofallincidentedgestoinfertypesforinstancesthemselves.
ResearchinNERhasmainlyfocusedonrecognizingaxedsetofgenerictypes,andlittleornoworkhasbeendoneonrecognizingalargersetofdomain-specictypes(asisourscenario).
Alternately,therehasbeenalotofrecentinterestonrelationshipdetection(e.
g.
Snowball[6],PORE[7]),andtypedetec-tioncanbeseenasaspecialcaseofit(is-arelation).
However,wedierentiateourselvesinseveralways.
LikeSnowball,webuildtext-patternstorepresentthecontext.
However,weobtaintheappropriatetrainingseedsautomaticallyfromLOD.
Ourpatternscapturelong-distancedependenciesbynotbeinglimitedtoaxedsizecontextasinSnowball,andweaddpart-of-speechinformationtoim-provepatternquality.
Also,bothSnowballandPOREaddtypeinformationtopatterns,howeverSnowballusesano-the-shelfNER,whichsuersfromgranu-larityandPOREaddstypeinformationbylookingattextualinformationontheWikipediapage(e.
g.
,Categories),whichcanbenoisyandnon-normative.
AsdescribedinSection4,ourpatternscontainprecisetypeinformation(i.
e.
YagoWordnetsenses)fromLODfortheprecise-entitysenseobtainedbylookingatWikipediaURIs,andwegeneralizetypesbylookingattheYago-Wordnettypehierarchy.
Finally,wehavedemonstratedthatourpatternsaregeneralizableacrossdomains,apointnotaddressedinprevioussolutions.
OnthebroaderproblemofextractingsemanticrelationshipsfromWikipedia,Kylin'sself-supervisedlearningtechniquestoextractalargenumberofat-tribute/valuepairsfromWikipedia[11]hasrecentlydemonstratedverygoodresults.
Similartoourapproach,thelearningisperformedonthestructuredinformationavailableinWikipediaintheformofinfoboxes.
OursystemdiersfromKylinintwoimportantways.
First,ourgoalistoidentifyandextractonlydomainspecictypes,notallthevaluesofthetypeattribute.
Second,Kylin'stechnicalapproachassumesthatthevalueoftheattributeappearsinthetext794J.
Dolbyetal.
usedintheevaluationphase.
Inourtwousecases,thisassumptionclearlydoesnotholdforthetypeattribute.
Infact,manyofthediscovereddomaintypeswerenotmentionedineithertheanalystreportsortheenergyarticles.
Inconclusion,wehaveshownthatgeneral-purposestructuredinformationinLinkedOpenData(LOD)combinedwithstatisticalanalysiscanbeusedforautomatedextractionofenterprise-specicvocabulariesfromtext.
WeappliedthisideatotheITdomainlookingatGartneranalystreports,andautomaticallygeneratedavocabularywith78%precisionand46%recall.
Aspartofourso-lution,wehavedevelopedanalgorithmtoautomaticallydetectdomain-specictypesforacorpus;asetoftechniquestoimprovecoverageofinformationinLOD.
WehaveshowninitialpromisingresultsfortheEnergyindustyandareworkingonimprovingthesolutioncoveragebyleveragingthegeneralWeb.
References1.
Suchanek,F.
M.
,Kasneci,G.
,Weikum,G.
:Yago:Alargeontologyfromwikipediaandwordnet.
WebSemant6(3),203–217(2008)2.
Park,Y.
,Byrd,R.
J.
,Boguraev,B.
K.
:Automaticglossaryextraction:beyondter-minologyidentication.
In:Proceedingsofthe19thinternationalconferenceonComputationallinguistics,pp.
1–7.
AssociationforComputationalLinguistics,Morristown(2002)3.
MetawebTechnologies:Freebasedatadumps(2008),http://download.
freebase.
com/datadumps/4.
Wu,F.
,Weld,D.
S.
:Automaticallyreningthewikipediainfoboxontology.
In:Proc.
of17thinternationalconferenceonWorldWideWeb(WWW),pp.
635–644.
ACM,NewYork(2008)5.
Welty,C.
,Murdock,J.
W.
:Towardsknowledgeacquisitionfrominformationextraction.
In:Cruz,I.
,Decker,S.
,Allemang,D.
,Preist,C.
,Schwabe,D.
,Mika,P.
,Uschold,M.
,Aroyo,L.
M.
(eds.
)ISWC2006.
LNCS,vol.
4273,pp.
709–722.
Springer,Heidelberg(2006)6.
Agichtein,E.
,Gravano,L.
:Snowball:Extractingrelationsfromlargeplain-textcollections.
In:ProceedingsoftheFifthACMInternationalConferenceonDigitalLibraries(2000)7.
Wang,G.
,Yu,Y.
,Zhu,H.
:Pore:Positive-onlyrelationextractionfromwikipediatext.
In:Aberer,K.
,Choi,K.
-S.
,Noy,N.
,Allemang,D.
,Lee,K.
-I.
,Nixon,L.
J.
B.
,Golbeck,J.
,Mika,P.
,Maynard,D.
,Mizoguchi,R.
,Schreiber,G.
,Cudre-Mauroux,P.
(eds.
)ASWC2007andISWC2007.
LNCS,vol.
4825,pp.
580–594.
Springer,Heidelberg(2007)8.
Chieu,H.
L.
,Ng,H.
T.
:Namedentityrecognitionwithamaximumentropyap-proach.
In:ProceedingsoftheseventhconferenceonNaturallanguagelearningatHLT-NAACL2003,pp.
160–163.
AssociationforComputationalLinguistics,Morristown(2003)9.
Cimiano,P.
,Staab,S.
:Learningbygoogling.
SIGKDDExplorations6(2),24–34(2004)10.
Ponzetto,S.
,Strube,M.
:Derivingalargescaletaxonomyfromwikipedia.
In:Pro-ceedingsofthe22ndNationalConferenceonArticialIntelligence(AAAI2007),Vancouver,B.
C,July,pp.
1440–1447(2007)11.
Wu,F.
,Homann,R.
,Weld,D.
S.
:Informationextractionfromwikipedia:movingdownthelongtail.
In:KDD,pp.
731–739(2008)

展开全文

matchoscommerce相关文档

老人server 苹果appstore宕机为什App Store下载软件到了一半就停了不动了搜狗360因为我做百度，搜狗，360，神马竞价推广已经有一年多了，所以请问下，网上有哪些平台可以接竞价的单呢？asp.net什么是asp.net 中国企业在线有什么B2B网站可以做国外的？多给些。。回答的好追加波音737起飞爆胎飞机会爆胎？flashftp下载rmdown怎么下载平阴县教育和体育局下属锦东小学教学设备采购项目竞争性磋商文件三友网网测是什么意思？腾讯公司电话是多少腾讯公司电话是多少 linuxvps 亚洲大于500m 本网站服务器在美国维护 adman enzu hostmonster gateone 创宇云 2017年万圣节 html空间元旦促销 200g硬盘域名和空间 789 工信部网站备案查询 atom处理器 lamp兄弟连免费个人主页注册阿里云邮箱江苏双线更多

matchoscommerce

特网云57元，香港云主机 1核 1G 10M宽带1G(防御)

Virtono：€23.7/年，KVM-2GB/25GB/2TB/洛杉矶&达拉斯&纽约&罗马尼亚等

月费$389，RackNerd美国大硬盘独立服务器