swappabledrupal7
drupal7 时间:2021-04-13 阅读:(
)
SCMS–SemantifyingContentManagementSystemsAxel-CyrilleNgongaNgomo1,NormanHeino1,KlausLyko1,ReneSpeck1,andMartinKaltenb¨ock21UniversityofLeipzigAKSWGroupJohannisgasse26,04103Leipzig2SemanticWebCompanyLerchenfelderg¨urtel43A-1160ViennaAbstract.
ThemigrationtotheSemanticWebrequiresfromCMSthattheyintegratehuman-andmachine-readabledatatosupporttheirseam-lessintegrationintotheSemanticWeb.
Yet,thereisstillablatantneedforframeworksthatcanbeeasilyintegratedintoCMSandallowtotrans-formtheircontentintomachine-readableknowledgewithhighaccuracy.
Inthispaper,wedescribetheSCMS(SemanticContentManagementSystems)framework,whosemaingoalsaretheextractionofknowledgefromunstructureddatainanyCMSandtheintegrationoftheextractedknowledgeintothesameCMS.
Ourframeworkintegratesahighlyaccu-rateknowledgeextractionpipeline.
Inaddition,itreliesontheRDFandHTTPstandardsforcommunicationandcanthusbeintegratedinvirtu-allyanyCMS.
Wepresenthowourframeworkisbeingusedintheenergysector.
Wealsoevaluateourapproachandshowthatourframeworkout-performsevencommercialsoftwarebyreachingupto96%F-score.
1IntroductionContentManagementSystems(CMS)encompassmostoftheinformationavail-ableonthedocument-orientedWeb(alsoreferredtoasHumanWeb).
Therewith,theyconstitutetheinterfacebetweenhumansandthedataontheWeb.
Conse-quently,oneofthemaintasksofCMShasalwaysbeentomaketheircontentaseasilyprocessableforhumansaspossible.
Still,withthemigrationfromthedocument-orientedtotheSemanticWeb,thereisanincreasingneedtoinsertmachine-readabledataintothecontentofCMSsoastoenabletheseamlessintegrationoftheircontentintotheSemanticWeb.
Giventhesheervolumeofdataavailableonthedocument-orientedWeb,theinsertionofmachine-readabledatamustbecarriedout(semi-)automatically.
Theframeworksdevelopedforthepurposeofautomaticknowledgeextractionmustthereforebeaccurate(i.
e.
,displayhighF-scores)soastoensurethathumansneedtocurateaminimalamountoftheknowledgeextractedautomatically.
Thiscriterioniscentralfortheuseofautomaticknowledgeextraction,asapproacheswithalowrecallleadL.
Aroyoetal.
(Eds.
):ISWC2011,PartII,LNCS7032,pp.
189–204,2011.
cSpringer-VerlagBerlinHeidelberg2011190A.
-C.
NgongaNgomoetal.
tohumanshavingtondthefalsenegatives1byhand,whilealowprecisionforcesthesamehumanstohavetocontinuallychecktheoutputoftheknowl-edgeextractionframework.
Afurthercriterionthatdeterminestheusabilityofaknowledgeextractionframeworkisitsexibility,i.
e.
,howeasyitistointegratethisframeworkinCMS.
ThiscriterionisofhighimportanceasthecurrentCMSlandscapeconsistsofhundredsofveryheterogeneousframeworksimplementedindozensofdierentlanguages2.
Inthispaper,wedescribetheSCMSframework3.
Themaingoalofourframe-workistoallowtheextractionofstructureddata(i.
e.
,RDF)outoftheunstruc-turedcontentofCMS,thelinkingofthiscontentwiththeWebofDataandtheintegrationofthiswealthofknowledgebackintotheCMS.
SCMSreliesexclu-sivelyonRDFmessagesandsimpleWebprotocolsforitsintegrationintoexistingCMSandtheprocessingoftheircontent.
Thus,itishighlyexibleandcanbeusedwithvirtuallyanyCMS.
Inaddition,theunderlyingapproachimplementsahighlyaccurateknowledgeextractionpipelinethatcanbeconguredeasilyfortheuser'spurposes.
Thispipelineallowstomergeandimprovetheresultsofstate-of-the-arttoolsforinformationextraction,tomanuallypost-processtheresultsatwillandtointegratetheextractedknowledgeintoCMS,forexampleasRDFa.
Themaincontributionsofthispaperarethefollowing:1.
Wepresentthearchitectureofourapproachandshowthatitcanbeinte-gratedeasilyinvirtuallyanyCMS,provideditoerssucienthooksintothelife-cycleofitsmanagedcontentitems.
2.
WegiveanoverviewofthevocabulariesweusetorepresenttheknowledgeextractedfromCMS.
3.
Wepresenthowourapproachisbeingusedinausecasecenteredaroundrenewableenergy.
4.
Weevaluateourapproachagainstastate-of-the-artcommercialsystemforknowledgeextractionintwopracticalusecasesandshowthatweoutperformthecommercialsystemwithrespecttoF-scorewhilereachingupto96%F-scoreontheextractionoflocations.
Therestofthispaperisstructuredasfollows:WestartbygivinganoverviewofrelatedworkfromtheNLPandtheSemanticWebcommunityinSection2.
Thereafter,wepresenttheSCMSframework(Section3)anditsmaincompo-nents(Section4)aswellasthevocabulariestheyuse.
Subsequently,weepitomizetherenewableenergyusecasewithinwhichourframeworkisbeingdeployedinSection5.
Section6thenpresentstheresultsofanevaluationofourframeworkintwousecasesagainstanenterprisecommercialsystem(CS)whosenamecan-notberevealedforlegalreasons.
Finally,wegiveanoverviewofourfutureworkandconclude.
1i.
e.
,Theentitiesandrelationsthatwerenotfoundbythesoftware2AlistofCMSonthemarketcanbefoundathttp://en.
wikipedia.
org/wiki/List_of_content_management_systems3http://www.
scms.
euSCMS–SemantifyingContentManagementSystems1912RelatedWorkInformationExtractionisthebackboneofknowledgeextractionandisoneofthecoretasksofNLP.
ThreemaincategoriesofNLPtoolsplayacentralroledur-ingtheextractionofknowledgefromtext:KeyphraseExtraction(KE),NamedEntityRecognition(NER)andrelationextraction(RE).
Theautomaticdetec-tionofkeyphrases(i.
e.
,multi-wordunitsortextfragmentsthatcapturetheessenceofadocument)hasbeenanimportanttaskofNLPfordecades.
Still,duetotheveryambiguousdenitionofwhatanappropriatekeyphraseis,cur-rentapproachestotheextractionofkeyphrasesstilldisplaylowF-scores[16].
Accordingto[15],themajorityoftheapproachestoKEimplementcombinationsofstatistical,rule-basedorheuristicmethods[11,21]onmostlydocument[17],keyphrase[28]ortermcohesionfeatures[23].
NERaimstodiscoverinstancesofpredenedclassesofentities(e.
g.
,persons,locations,organizationsorproducts)intext.
MostNERtoolsimplementoneofthreemaincategoriesofapproaches:dictionary-based[29,3],rule-based[6,26]andmachine-learningapproaches[18].
Nowadays,themethodsofchoiceareborrowedfromsupervisedmachinelearningwhentrainingexamplesareavail-able[32,7,10].
Yet,duetoscarcityoflargedomain-specictrainingcorpora,semi-supervised[24,18]andunsupervisedmachinelearningapproaches[19,9]havealsobeenusedforextractingnamedentitiesfromtext.
TheextractionofrelationsfromunstructureddatabuildsuponworkforNERandKEtodeterminetheentitiesbetweenwhichrelationsmightexist.
Someearlyworkonpatternextractionreliedonsupervisedmachinelearning[12].
Yet,suchapproachesdemandedlargeamountoftrainingdata.
ThesubsequentgenerationofapproachestoREaimedatbootstrappingpatternsbasedonasmallnumberofinputpatternsandinstances[5,2].
NewerapproachesaimtoeithercollectredundancyinformationfromthewholeWeb[22]orWikipedia[30,31]inanunsupervisedmannerortouselinguisticanalysis[13,20]toharvestgenericpatternsforrelations.
InadditiontotheworkdonebytheNLPcommunity,severaltoolsandframe-workshavebeendevelopedexplicitlyforextractingRDFandRDFaoutofNL[1].
Forexample,theFirefoxextensionPiggyBank[14]allowstoextractRDFfromwebpagesbyusingscreenscrapers.
TheRDFextractedfromthesewebpagesisthenstoredlocallyinaSesamestore.
Thedatabeingstoredlocallyallowstheusertomergethedataextractedfromdierentwebsitestoperformseman-ticoperations.
Morerecently,theDrupalextensionOpenPublish4wasreleased.
Theaimofthisextensionistosupportcontentpublisherswiththeautomaticannotationoftheirdata.
Forthispurpose,OpenPublishutilizestheservicesprovidedbyOpenCalais5toannotatethecontentofnewsentries.
Epiphany[1]implementsaservicethatannotateswebpagesautomaticallywithentitiesfoundintheLinkedDataCloud.
ApacheStanbol6implementssimilarfunctionalityon4http://www.
openpublish.
com5http://www.
opencalais.
org6http://incubator.
apache.
org/stanbol192A.
-C.
NgongaNgomoetal.
alargerscalebyprovidingsynchronousRESTfulinterfacesthatallowContentManagementSystemstoextractannotationsfromtext.
Themaindrawbackofcurrentframeworksisthattheyeitherfocusononepar-ticulartask(e.
g.
,ndingnamedentitiesintext)ormakeuseofNLPalgorithmswithoutimprovinguponthem.
Consequently,theyhavethesamelimitationsastheNLPapproachesdiscussedabove.
Tothebestofourknowledge,ourframe-workistherstframeworkdesignedexplicitlyforthepurposesoftheSemanticWebthatcombinesexibilitywithaccuracy.
TheexibilityoftheSCMShasbeenshownbyitsdeploymentonDrupal7,Typo38andconX9.
Inaddition,ourframeworkisabletoextractRDFfromNLwithanaccuracysuperiortothatofcommercialsystemsasshownbyourevaluation.
Ourframeworkalsoprovidesamachine-learningmodulethatallowstotailorittonewdomainsandclassesofnamedentities.
Moreover,SCMSprovidesdedicatedinterfacesforinteracting(e.
g.
,editing,querying,merging)withthetriplesextracted,makingitusableinalargenumberofdomainsandusecases.
3TheSCMSFrameworkAnoverviewofthearchitecturebehindSCMSisgiveninFigure1.
Theframe-workconsistsoftwolayers:anorchestrationandcurationlayerandanextractionandstoragelayer.
TheCMSthatistobeextendedwithsemanticcapabilitiesresidesuponourframeworkandmustbeextendedminimallyviaaCMSwrap-per.
Thisextensionimplementsthein-andoutputbehavioroftheCMSandcommunicatesexclusivelywiththerstlayerofourframework,thusmakingthecomponentsoftheextractionandstoragelayerofourframeworkswappablewithoutanydrawbackfortheusers.
TheoverallgoaloftherstlayeroftheSCMSframeworkistocoordinatetheaccesstothedata.
Itconsistsoftwotools:theorchestrationserviceandthedatawikiOntoWiki.
TheorchestrationserviceistheinputgateofSCMS.
ItreceivesthedatathatistobeannotatedasaRDFmessagethatabidesbythevocabularypresentedinSection4.
2andreturnstheresultsoftheframeworktotheendpointspeciedintheRDFmessageitreceives.
OntoWikiprovidesfunctionalityforthemanualcurationoftheresultsoftheknowledgeextractionprocessandmanagesthedataowtothetriplestoreVirtuoso10,therstcomponentoftheextractionandstoragelayer.
Inadditiontoatriplestore,thesecondlayercontainstheFederatedknOwledgeeXtractionFrameworkFOX11,thatusesmachinelearningtocombineandimproveupontheresultsofNLPtoolsaswellasconvertstheseresultsintoRDFbyusingthevocabulariesdisplayedinSection4.
3.
VirtuosoalsocontainsacrawlerthatallowstoretrievesupplementaryknowledgefromtheWebandlinkittotheinformationalreadyavailableintheCMSbyintegratingit7http://drupal.
org8http://typo3.
org9http://conx.
at10http://virtuoso.
openlinksw.
com11http://fox.
aksw.
orgSCMS–SemantifyingContentManagementSystems193Orchestra-tionServiceVirtuosoFOXCMSWrapperpush(content)annotations(RDF)–asynctextannotationsOntoWikiinjectioncrawlednewsoptionalExtractionandStorageLayerWrapperLayerOrchestrationandCurationLayerpush(curationchanges)Fig.
1.
ArchitectureandpathsofcommunicationofcomponentsintheSCMScontentsemanticationsystemintotheCMS.
Inthefollowing,wepresentthecentralcomponentsoftheSCMSstackinmoredetail.
4ToolsandVocabulariesInthissectionwedescribethemaincomponentsoftheSCMSstackandhowtheyttogether.
Asrunningexample,weuseahypotheticalcontentitemcontainedinaDrupalCMS.
Thisnode(inDrupalterminology)thatconsistsoftwoparts:–Thetitle"Prometeus"and–abodythatcontainsthesentence"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
OnlythebodytothecontentitemistobeannotatedbytheSCMSstack.
Notethatforreasonsofbrevity,wewillonlyshowtheresultsoftheextractionofnamedentities.
Yet,SCMScanalsoextractkeywords,keyphrasesandrelations.
4.
1WrapperACMSwrapper(shortwrapper)isacomponentthatistightlyintegratedintoaCMS(seeFigure2)andwhoseroleistoensurethecommunicationbetweenthe194A.
-C.
NgongaNgomoetal.
Orchestr.
ServiceCMSWrapperann.
requestann.
response(async)injectRDFaFig.
2.
Architectureofcommunicationbetweenwrapper,CMSandorchestrationserviceCMSandtheorchestrationmoduleofourframework.
Inthisrespect,awrapperhastofulllthreemaintasks:1.
Requestgeneration:WrappersusuallyregisterforchangeeventstotheCMSeditingsystem.
Wheneveradocumenthasbeenedited,theygenerateanannotationrequestthatabidesbythevocabularydepictedinFigure3.
Thisrequestisthensenttotheorchestrationservice.
2.
Responsereceipt:Oncetheannotationhasbeencarriedout,theannotationresultsaresentbacktothewrapper.
Thesecondofthewrapper'smaintasksisconsequentlytoreacttothoseannotationresponsesandtostoretheannotationstothedocumentappropriately(e.
g.
,inatriplestore).
Sincetheannotationresultsaresentbackasynchronously(i.
e.
,inaseparaterequest),thewrappermustprovideacallbackURLforthispurpose.
3.
Dataprocessing:Oncethedatahavebeenreceivedandstored,wrappersusuallyintegratetheannotationsintothecontentitemsthatwereprocessedbytheCMS.
Theintegrationofannotationsismostcommonlycarriedoutby"injecting"theannotationsasRDFaintothedocument'sHTMLrendering.
ThedatainjectionismostlyrealizedbyregisteringtodocumentviewingeventsintherespectiveCMSandwritingtheRDFafromthewrapper'slocaltriplestoreintothecontentitemsthatarebeingviewed.
AnexampleofawrapperrequestforourexampleisshowninListing1.
Thecontent:encodedoftheDrupalnodehttp://example.
com/drupal/node/10istobeannotatedbyFOX.
Inaddition,thewholenodeistobestoredinthetriplestoreforthepurposeofmanualprocessing.
Notethatthewrappercanchoosenottosendportionsofthecontentitemthatarenottobestoredinthetriplestore,e.
g.
,privatedata.
Inaddition,notethatthedescriptionofadocumentisnotlimitedtocertainpropertiesortoacertainnumberthereof,whichensuresthehighlevelofexibilityoftheSCMSstack.
Moreover,theRDFdataextractedbySCMScanbeeasilymergedwithanystructuredinformationprovidednativelybytheCMS(i.
e.
,metadatasuchasauthorinformation).
Consequently,SCMSenablesCMSthatalreadyprovidemetadataasRDFtoanswercomplexques-tionsthatcombinedataandmetadata,e.
g.
,WhichauthorswrotedocumentsthatarerelatedtoBudapestSCMS–SemantifyingContentManagementSystems195ascms:Requestasioc:Itemxsd:stringxsd:stringxsd:stringscms:documentdc:titledc:descriptioncontent:encodedscms:annotatescms:annotateardf:Resourcescms:callbackEndpointFig.
3.
Vocabularyusedbythewrapperrequests1@prefixcontent:.
2@prefixdc:.
3@prefixsioc:.
4@base.
56a;7;8;9content:encoded.
1011asioc:Item;12dc:title"Prometeus";13content:encoded"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
Listing1.
ExampleannotationrequestassentbytheDrupalwrapper4.
2OrchestrationServiceThemaintasksoftheorchestrationservicearetocapturestateinformationandtodistributethedataacrossSCMS'layers.
TherstofthetasksisduetotheFOXframeworkhavingbeendesignedtobestateless.
Theorchestrationservicecapturesstateinformationbysplittingupeachdocument-basedannotationre-questsbyawrapperintoseveralproperty-basedannotationrequeststhataresenttoFOX.
Inourexample,theorchestrationservicedetectsthatsolelythecontent:encodedpropertyistobeannotated.
Then,itreadsthecontentofthatpropertyfromthewrapperrequestandgeneratestheannotationrequest"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHun-gary,i.
e.
,Budapest.
"forFOX.
Notethatwhilethisproperty-basedannotationrequestconsistsexclusivelyoftextorHTMLanddoesnotcontainanyRDF,theresponsereturnedbyFOXisaRDFdocumentserializedinTurtleorRDF/XML.
TheannotationresultsreturnedbyFOXarecombinedbytheorchestra-tionserviceintotheannotationresponse.
Therewith,therelationbetweenthe196A.
-C.
NgongaNgomoetal.
inputdocumentandtheannotationsextractedbyFOXisre-established.
Whenallannotationsforaparticularrequesthavebeenreceivedandcombined,theannotationresponseissentbacktothewrapperviatheprovidedcallbackURL.
Inaddition,theresultssentbacktothewrapperarestoredinOntoWikitofacilitatethecurationofannotationsextractedautomatically.
TheannotationresponsegeneratedbytheorchestrationserviceforourexampleisshowninListing2.
ItreliesupontheoutputsentbyFOX.
TheexactmeaningofthepredicatesusedbyFOXandforwardedbytheorchestrationserviceareexplainedinSection4.
31@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:annotates;10scms:property;11scms:beginIndex"70"^^xsd:int;12scms:endIndex"77"^^xsd:int;13scms:means;14scms:source;15ann:body"Hungary"^^xsd:string.
1617[]aann:Annotation,scmsann:ORGANIZATION;18scms:annotates;19scms:property;20scms:beginIndex"12"^^xsd:int;21scms:endIndex"21"^^xsd:int;22scms:means;23scms:source;24ann:body"Prometeus"^^xsd:string.
2526[]aann:Annotation,scmsann:LOCATION;27scms:annotates;28scms:property;29scms:beginIndex"85"^^xsd:int;30scms:endIndex"93"^^xsd:int;31scms:means;32scms:source;33ann:body"Budapest"^^xsd:string.
Listing2.
Exampleannotationresponseassentbytheorchestrationservice4.
3FOXTheFOXframeworkisastatelessandextensibleframeworkthatencompassesalltheNLPfunctionalitynecessarytoextractknowledgefromthecontentofCMS.
ItsarchitectureconsistsofthreelayersasshowninFigure4.
FOXtakestextorHTMLasinput.
Thisdataissenttothecontrollerlayer,whichimplementsthefunctionalitynecessarytocleanthedata,i.
e.
,removeHTMLandXMLtagsaswellasfurthernoise.
Oncethedatahasbeencleaned,SCMS–SemantifyingContentManagementSystems197NamedEntityRecognitionKeywordExtractionRelationExtractionLookupModuleTrainingPredictionControllerMLLayerControllerLayerToolLayerFig.
4.
ArchitectureoftheFOXframeworkthecontrollerlayerbeginswiththeorchestrationofthetoolsinthetoollayer.
Eachofthetoolsisassignedathreadfromathreadpool,soastomaximizeus-ageofmulti-coreCPUs.
Everythreadrunsitstoolandgeneratesaneventonceithascompleteditscomputation.
Intheeventthatatooldoesnotcompleteafterasettime,thecorrespondingthreadisterminated.
Sofar,FOXintegratestoolsforKE,NERandRE.
TheKEisrealizedbyPoolParty12forextractingkeywordsfromacontrolledvocabulary,KEA13andtheYahooTermExtractionservice14forstatisticalextractionandseveralothertools.
Inaddition,FOXinte-gratestheStanfordNamedEntityRecognizer15[10],theIllinoisNamedEntityTagger16[25]andcommercialsoftwareforNER.
TheREiscarriedoutbyusingtheCAREplatform17.
Theresultsfromthetoollayerareforwardedtothepredictionmoduleofthemachine-learninglayer.
TheroleofthepredictionmoduleistogenerateFOX'soutputbasedontheoutputthetoolsinFOX'sbackend.
Forthispurpose,itimplementsseveralensemblelearningtechniques[8]withwhichitcancombinetheoutputofseveraltools.
Currently,thepredictionmodulecarriesoutthiscombinationbyusingafeed-forwardneuralnetwork.
TheneuralnetworkinsertedinFOXwastrainedbyusing117newsarticles.
Itreached89.
21%F-Scoreinanevaluationbasedonaten-fold-cross-validationonNER,therewithoutperformingevencommercialsystems18.
Oncetheneuralnetworkhascombinedtheoutputofthetoolandgeneratedabetterpredictionofthenamedentities,theoutputofFOXisgeneratedby12http://poolparty.
biz13http://www.
nzdl.
org/Kea/14http://developer.
yahoo.
com/search/content/V1/termExtraction.
html15http://nlp.
stanford.
edu/software/CRF-NER.
shtml16http://cogcomp.
cs.
illinois.
edu/page/software_view/417http://www.
digitaltrowel.
com/Technology/18Moredetailsontheevaluationareprovidedathttp://fox.
aksw.
org198A.
-C.
NgongaNgomoetal.
usingthevocabulariesshowninFigure5.
ThesevocabulariesextendthetwobroadlyusedvocabulariesAnnotea19andAutotag20.
Inparticular,weaddedtheconstructsexplicatedinthefollowing:–scms:beginIndexdenotestheindexinaliteralvaluestringatwhichapar-ticularannotationorkeyphrasebegins;–scms:endIndexstandsfortheindexinaliteralvaluestringatwhichaparticularannotationorkeyphraseends;–scms:meansmarkstheURIassignedtoanamedentityidentiedforanannotation;–scms:sourcedenotestheprovenanceoftheannotation,i.
e.
,theURIofthetoolwhichcomputedtheannotationoreventhesystemIDofthepersonwhocuratedorcreatedtheannotationand–scmsannisthenamespacefortheannotationclasses,i.
e,location,person,organizationandmiscellaneous.
Giventhattheoverheadduetothemergingoftheresultsviatheneuralnetworkisofonlyafewmillisecondsandthanktothemulti-corearchitectureofcurrentservers,FOXisalmostastime-ecientasstate-of-the-arttools.
Still,asourevaluationshows,thesefewmillisecondsoverheadcanleadtoanincreaseofmorethan13%F-Score(seeSection6).
TheoutputofFOXforourexampleisshowninListing3.
Thisistheoutputthatisforwardedtotheorchestrationservice,whichaddsprovenanceinformationtotheRDFbeforesendingananswertothecallbackURIprovidedbythewrapper.
Bythesemeans,weensurethatthewrappercanwritetheRDFainthewritesegmentoftheitemcontent.
4.
4OntoWikiOntoWikiisasemanticdatawiki[4]thatwasdesignedtofacilitatethebrowsingandeditingRDFknowledgebases.
Itsbrowsingfeaturesrangefromarbitraryconcepthierarchiestofacet-basedsearchandquerybuildinginterfaces.
SemanticcontentcanbecreatedandeditedbyusingtheRDFauthorsystemwhichhasbeenintegratedinOntoWiki[27].
OntoWikiplaystwokeyroleswithintheSCMSstack.
First,itservesasentrypointforthetriplestore.
Thisallowsforthetriplestoretobeexchangedwith-outanydrawbackfortheuser,leadingtoaneasycustomizationofourstack.
Inaddition,OntoWikiplaystheroleofanannotationconsolidationandcura-tiontoolandisconsequentlythecenterofthecurationpipeline.
ToensurethatOntoWikiisalwaysup-to-date,theorchestrationservicesendsitsannotationresponsestobothOntoWikiandthewrapper'scallbackURI.
Thus,OntoWikiisalsoawareofthewrapper(i.
e.
,itscallbackURI)andcansendtheresultsofanymanualcurationprocessbacktowrapper.
Notethatmanuallycuratedannotationsaresavedwithadierent(ifmanuallycreated)orsupplementary(ifmanuallycurated)valueintheirscms:sourceproperty.
Thisgivesconsuming19http://www.
w3.
org/2000/10/annotation-ns#20http://commontag.
org/ns#SCMS–SemantifyingContentManagementSystems199aann:Annotationardf:Resourcexsd:stringscms:meansann:bodyxsd:integerxsd:integerscms:beginIndexscms:endIndexardf:Resourcescms:tool(a)namedentityannotationactag:AutoTagardf:Resourcectag:meansxsd:stringctag:labelardf:Resourcescms:toolanyProp(b)keywordannotationFig.
5.
VocabulariesusedbyFOXforrepresentingnamedentities(a)andkeywords(b)tools(e.
g.
,wrappers)achancetoassignhighertrustvaluestothoseannota-tions.
Inaddition,ifanewextractionrunisperformedonthesamedocument,manuallycreatedandcuratedannotationscanbekeptforfurtheruse.
NotethatthecrawlerinVirtuosocanbeusedtofetchevenmoredatapertainingtotheannotationscomputedbyFOX.
ThisdatacanbesentdirectlytoFOXandinsertedinVirtuososoastoextendtheknowledgebasefortheCMS.
5UseCaseTheSCMSframeworkisbeingdeployedintherenewableenergysector.
Therenewableenergyandenergyeciencysectorrequiresalargeamountofup-to-dateandhigh-qualityinformationanddatasoastodevelopandpushtheareaofcleanenergysystemsworldwide.
Thisinformation,dataandknowledgeaboutcleanenergytechnologies,developments,projectsandlawspercountryworld-widehelpspolicyanddecisionmakers,projectdevelopersandnancingagenciestomakebetterdecisionsoninvestmentsaswellascleanenergyprojectstosetup.
TheREEEP–theRenewableEnergyandEnergyEciencyPartnership21isanon-governmentalorganizationthatprovidestheaforementionedinformationtotherespectivetargetgroupsaroundtheglobe.
Forthispurpose,REEEPhasdevelopedthereegle.
infoInformationGatewayonRenewableEnergyandEn-ergyEciency22thatoerscountryprolesoncleanenergy,anActorsCatalogthatcontainstherelevantstakeholdersintheeldpercountry.
Furthermore,itsuppliesenergystatisticsandpotentialsaswellasnewsoncleanenergy.
21http://www.
reeep.
org22http://www.
reegle.
info200A.
-C.
NgongaNgomoetal.
1@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:beginIndex"70"^^xsd:int;10scms:endIndex"77"^^xsd:int;11scms:means;12scms:source;13ann:body"Hungary"^^xsd:string.
1415[]aann:Annotation,scmsann:ORGANIZATION;16scms:beginIndex"12"^^xsd:int;17scms:endIndex"21"^^xsd:int;18scms:means;19scms:source;20ann:body"Prometeus"^^xsd:string.
2122[]aann:Annotation,scmsann:LOCATION;23scms:beginIndex"85"^^xsd:int;24scms:endIndex"93"^^xsd:int;25scms:means;26scms:source;27ann:body"Budapest"^^xsd:string.
Listing3.
AnnotationsasreturnedbyFOXinTurtleformatThemotivationbehindapplyingSCMStotheREEEPdatawastofacilitatetheintegrationofthisdatainsemanticapplicationstosupportecientdecisionmaking.
Toachievethisgoal,weaimedtoexpandthereegle.
infoinformationgatewaybyaddingRDFatotheunstructuredinformationavailableontheweb-siteandbymakingthesametriplesavailableviaaSPARQLendpoint.
Forourcurrentprototype,weimplementedaCMSwrapperfortheDrupalCMSandimportedtheactorscatalogofreeglewithinin(seeFigure6).
ThisdatawasthenprocessedbytheSCMSstackasfollows:Allactorsandcountrydescrip-tionsweresenttotheorchestrationservice,whichforwardedthemtoFOX.
TheRDFdataextractedbyFOXweresentbacktotheDrupalWrapperandwrittenviaOntoWikiintoVirtuoso.
TheDrupalwrapperthenusedthekeyphrasestoextendthesetoftagsassignedtothecorrespondingproleintheCMS.
ThenamedentitieswereintegratedinthepagebyusingthepositionalinformationreturnedbyFOX.
Bythesemeans,wemadetheREEEPdataaccessibleforhumans(viatheWebpage)butalsoformachines(viaOntoWiki'sintegratedSPARQLendpointandviatheRDFawrittenintheWebpages).
OurapproachalsomakestheautomatedintegrationofnovelknowledgesourcesinREEEPpossible.
Toachievethisgoal,severalselectedsources(websources,blogsandnewsfeeds)arecurrentlybeingcrawledandthenanalyzedbyFOXtoextractstructuredinformationoutofthemassesofunstructuredtextfromtheInternet.
SCMS–SemantifyingContentManagementSystems201Fig.
6.
ScreenshotsofSCMS-enhancedDrupal6EvaluationTheusabilityofourapproachdependsheavilyonthequalityoftheknowl-edgereturnedviaautomatedmeans.
Consequently,weevaluatedthequalityoftheRDFainjectedintotheREEEPdatabymeasuringtheprecisionandrecallofSCMSandcompareditwiththatofastate-of-the-artcommercialsystem(CS)whosenamecannotberevealedforlegalreasons.
WechoseCSbecauseitoutperformedfreelyavailableNERtoolssuchastheStanfordNamedEntityRecognizer23[10]andtheIllinoisNamedEntityTagger24[25]inaprioreval-uationonanewspapercorpus.
Withinthatevaluation,FOXreached89.
21%F-scoreandwas14%betterthanCSw.
r.
t.
F-score25.
Asitcanhappenthatonlysegmentsofmulti-wordunitsarerecognizedasbeingnamedentities,wefollowedatoken-wiseevaluationoftheSCMSsystem.
Thus,ifoursystemrec-ognizedUnitedKingdomofGreatBritainasaLOCATIONwhenpresentedwithUnitedKingdomofGreatBritainandNorthernIreland,itwasscoredwith5truepositivesand3falsenegatives.
Ourevaluationwascarriedoutwithtwodierentdatasets.
Inourrstevalu-ation,wemeasuredtheperformanceofbothsystemsoncountryprolescrawledfromtheWeb,i.
e.
,oninformationthatistobeaddedautomaticallytotheREEEPknowledgebases.
Forthispurpose,weselected9countrydescriptionsrandomlyandannotated34sentencesmanually.
Thesesentencescontained119namedentitiestokens,ofwhich104werelocationsand15organizations.
Inour23http://nlp.
stanford.
edu/software/CRF-NER.
shtml24http://cogcomp.
cs.
illinois.
edu/page/software_view/425Moredetailsathttp://fox.
aksw.
org202A.
-C.
NgongaNgomoetal.
secondevaluation,weaimedatmeasuringhowwellSCMSperformsonthedatathatcanbefoundcurrentlyintheREEEPcatalogue.
Forthispurpose,weanno-tated23actorsproleswhichconsistedof68sentencesmanually.
Theresultingreferencedatacontained20location,78organizationand11persontokens.
Notethatbothdatasetsareofverydierentnatureastherstcontainsalargenum-beroforganizationsandarelativelysmallnumberoflocationswhilethesecondconsistsmainlyoflocations.
TheresultsofourevaluationareshowninTable1.
CSfollowsaverycon-servativestrategy,whichleadstoithavingveryhighprecisionscoresofupto100%insomeexperiments.
Yet,itsconservativestrategyleadstoarecallwhichismostlysignicantlyinferiortothatofSCMS.
TheonlycategorywithinwhichCSoutperformsSCMSisthedetectionofpersonsintheactorsproledata.
Thisisduetoitdetecting6outofthe11persontokensinthedataset,whileSCMSonlydetects5.
Inallothercases,SCMSoutperformsCSbyupto13%F-score(detectionoforganizationsinthecountryprolesdataset).
Overall,SCMSoutperformCSby7%F-scoreoncountryprolesandalmost8%F-scoreonactors.
Table1.
Evaluationresultsoncountryandactorsproles.
ThesuperiorF-scoreforeachcategoryisinboldfont.
CountryProlesActorsProlesEntityTypeMeasureFOXCSFOXCSLocationPrecision98%100%83.
33%100%Recall94.
23%78.
85%90%70%F-Score96.
08%88.
17%86.
54%82.
35%OrganizationPrecision73.
33%100%57.
14%90.
91%Recall68.
75%40%69.
23%47.
44%F-Score70.
97%57.
14%62.
72%62.
35%PersonPrecision––100%100%Recall––45.
45%54.
55%F-Score––62.
5%70.
59%OverallPrecision93.
97%100%85.
16%98.
2%Recall91.
60%74.
79%70.
64%52.
29%F-Score92.
77%85.
58%77.
22%68.
24%7ConclusionInthispaper,wepresentedtheSCMSframeworkforextractingstructureddatafromCMScontent.
Wepresentedthearchitectureofourapproachandexplainedhoweachofitscomponentsworks.
Inaddition,weexplainedthevocabulariesutilizedbythecomponentsofourframework.
WepresentedoneusecasefortheSCMSsystem,i.
e.
,howSCMSisusedintherenewableenergysector.
TheSCMSstackabidesbythecriteriaofaccuracyandexibility.
Theexi-bilityofourapproachisensuredbythecombinationofRDFmessagesthatcanSCMS–SemantifyingContentManagementSystems203beeasilyextendedandofstandardWebcommunicationprotocols.
Theaccu-racyofSCMSwasdemonstratedinanevaluationonactorandcountryproles,withinwhichSCMSoutperformedevencommercialsoftware.
Ourapproachcanbeextendedbyaddingsupportfornegativestatements,i.
e.
,statementsthatarenotcorrectbutcanbefoundindierentknowledgesourcesacrossthedatalandscapeanalyzedbyourframework.
Inaddition,thefeedbackgeneratedbyuserswillbeintegratedinthetrainingoftheframeworktomakeitevenmoreaccurateovertime.
References1.
Adrian,B.
,Hees,J.
,Herman,I.
,Sintek,M.
,Dengel,A.
:Epiphany:AdaptablerDFaGenerationLinkingtheWebofDocumentstotheWebofData.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
178–192.
Springer,Heidelberg(2010)2.
Agichtein,E.
,Gravano,L.
:Snowball:Extractingrelationsfromlargeplain-textcollections.
In:ACMDL,pp.
85–94(2000)3.
Amsler,R.
:Researchtowardsthedevelopmentofalexicalknowledgebasefornaturallanguageprocessing.
SIGIRForum23,1–2(1989)4.
Auer,S.
,Dietzold,S.
,Riechert,T.
:OntoWiki–AToolforSocial,SemanticCol-laboration.
In:Cruz,I.
,Decker,S.
,Allemang,D.
,Preist,C.
,Schwabe,D.
,Mika,P.
,Uschold,M.
,Aroyo,L.
M.
(eds.
)ISWC2006.
LNCS,vol.
4273,pp.
736–749.
Springer,Heidelberg(2006)5.
Brin,S.
:ExtractingPatternsandRelationsfromtheWorldWideWeb.
In:Atzeni,P.
,Mendelzon,A.
O.
,Mecca,G.
(eds.
)WebDB1998.
LNCS,vol.
1590,pp.
172–183.
Springer,Heidelberg(1999)6.
Coates-Stephens,S.
:Theanalysisandacquisitionofpropernamesfortheun-derstandingoffreetext.
ComputersandtheHumanities26,441–456(1992)10.
1007/BF001369857.
Curran,J.
R.
,Clark,S.
:Languageindependentnerusingamaximumentropytag-ger.
In:HLT-NAACL,pp.
164–167(2003)8.
Dietterich,T.
G.
:EnsembleMethodsinMachineLearning.
In:Kittler,J.
,Roli,F.
(eds.
)MCS2000.
LNCS,vol.
1857,pp.
1–15.
Springer,Heidelberg(2000)9.
Etzioni,O.
,Cafarella,M.
,Downey,D.
,Popescu,A.
-M.
,Shaked,T.
,Soderland,S.
,Weld,D.
S.
,Yates,A.
:Unsupervisednamed-entityextractionfromtheweb:anexperimentalstudy.
Artif.
Intell.
165,91–134(2005)10.
Finkel,J.
,Grenager,T.
,Manning,C.
:Incorporatingnon-localinformationintoinformationextractionsystemsbygibbssampling.
In:ACL,pp.
363–370(2005)11.
Frank,E.
,Paynter,G.
W.
,Witten,I.
H.
,Gutwin,C.
,Nevill-Manning,C.
G.
:Domain-specickeyphraseextraction.
In:ProceedingsoftheSixteenthInterna-tionalJointConferenceonArticialIntelligence,IJCAI1999,pp.
668–673.
MorganKaufmannPublishersInc.
,SanFrancisco(1999)12.
Grishman,R.
,Yangarber,R.
:Nyu:DescriptionoftheProteus/PetsystemasusedforMUC-7ST.
In:MUC-7.
MorganKaufmann(1998)13.
Harabagiu,S.
,Bejan,C.
A.
,Morarescu,P.
:Shallowsemanticsforrelationextrac-tion.
In:IJCAI,pp.
1061–1066(2005)14.
Huynh,D.
,Mazzocchi,S.
,Karger,D.
R.
:PiggyBank:ExperiencetheSemanticWebInsideYourWebBrowser.
In:Gil,Y.
,Motta,E.
,Benjamins,V.
R.
,Musen,M.
A.
(eds.
)ISWC2005.
LNCS,vol.
3729,pp.
413–430.
Springer,Heidelberg(2005)204A.
-C.
NgongaNgomoetal.
15.
Kim,S.
N.
,Kan,M.
-Y.
:Re-examiningautomatickeyphraseextractionapproachesinscienticarticles.
In:MWE2009,pp.
9–16(2009)16.
Kim,S.
N.
,Medelyan,O.
,Kan,M.
-Y.
,Baldwin,T.
:Semeval-2010task5:Auto-matickeyphraseextractionfromscienticarticles.
In:SemEval2010,pp.
21–26.
AssociationforComputationalLinguistics,Stroudsburg(2010)17.
Matsuo,Y.
,Ishizuka,M.
:KeywordExtractionFromASingleDocumentUsingWordCo-OccurrenceStatisticalInformation.
InternationalJournalonArticialIntelligenceTools13(1),157–169(2004)18.
Nadeau,D.
:Semi-SupervisedNamedEntityRecognition:LearningtoRecognize100EntityTypeswithLittleSupervision.
PhDthesis,UniversityofOttawa(2007)19.
Nadeau,D.
,Turney,P.
,Matwin,S.
:UnsupervisedNamed-EntityRecognition:Gen-eratingGazetteersandResolvingAmbiguity.
In:Lamontagne,L.
,Marchand,M.
(eds.
)CanadianAI2006.
LNCS(LNAI),vol.
4013,pp.
266–277.
Springer,Heidel-berg(2006)20.
Nguyen,D.
P.
T.
,Matsuo,Y.
,Ishizuka,M.
:Relationextractionfromwikipediausingsubtreemining.
In:AAAI,pp.
1414–1420(2007)21.
Nguyen,T.
D.
,Kan,M.
-Y.
:KeyphraseExtractioninScienticPublications.
In:Goh,D.
H.
-L.
,Cao,T.
H.
,Slvberg,I.
T.
,Rasmussen,E.
(eds.
)ICADL2007.
LNCS,vol.
4822,pp.
317–326.
Springer,Heidelberg(2007)22.
Pantel,P.
,Pennacchiotti,M.
:Espresso:Leveraginggenericpatternsforautomati-callyharvestingsemanticrelations.
In:ACL,pp.
113–120(2006)23.
Park,Y.
,Byrd,R.
J.
,Boguraev,B.
K.
:Automaticglossaryextraction:beyondter-minologyidentication.
In:COLING,pp.
1–7(2002)24.
Pasca,M.
,Lin,D.
,Bigham,J.
,Lifchits,A.
,Jain,A.
:Organizingandsearchingtheworldwideweboffacts-stepone:theone-millionfactextractionchallenge.
In:Proceedingsofthe21stNationalConferenceonArticialIntelligence,vol.
2,pp.
1400–1405.
AAAIPress(2006)25.
Ratinov,L.
,Roth,D.
:Designchallengesandmisconceptionsinnamedentityrecog-nition.
In:CONLL,pp.
147–155(2009)26.
Thielen,C.
:Anapproachtopropernametaggingforgerman.
In:ProceedingsoftheEACL1995SIGDATWorkshop(1995)27.
Tramp,S.
,Heino,N.
,Auer,S.
,Frischmuth,P.
:RDFauthor:EmployingRDFaforCollaborativeKnowledgeEngineering.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
90–104.
Springer,Heidelberg(2010)28.
Turney,P.
D.
:Coherentkeyphraseextractionviawebmining.
In:IJCAI,SanFran-cisco,CA,USA,pp.
434–439(2003)29.
Walker,D.
,Amsler,R.
:Theuseofmachine-readabledictionariesinsublanguageanalysis.
AnalysingLanguageinRestrictedDomains(1986)30.
Wang,G.
,Yu,Y.
,Zhu,H.
:PORE:Positive-OnlyRelationExtractionfromWikipediaText.
In:Aberer,K.
,Choi,K.
-S.
,Noy,N.
,Allemang,D.
,Lee,K.
-I.
,Nixon,L.
J.
B.
,Golbeck,J.
,Mika,P.
,Maynard,D.
,Mizoguchi,R.
,Schreiber,G.
,Cudre-Mauroux,P.
(eds.
)ASWC2007andISWC2007.
LNCS,vol.
4825,pp.
580–594.
Springer,Heidelberg(2007)31.
Yan,Y.
,Okazaki,N.
,Matsuo,Y.
,Yang,Z.
,Ishizuka,M.
:Unsupervisedrelationextractionbyminingwikipediatextsusinginformationfromtheweb.
In:ACL2009,pp.
1021–1029(2009)32.
Zhou,G.
,Su,J.
:Namedentityrecognitionusinganhmm-basedchunktagger.
In:Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,ACL2002,pp.
473–480.
AssociationforComputationalLinguistics,Morristown(2002)
今天早上相比很多网友和一样收到来自Linode的庆祝18周年的邮件信息。和往年一样,他们会回顾在过去一年中的成绩,以及在未来准备改进的地方。虽然目前Linode商家没有提供以前JP1优化线路的机房,但是人家一直跟随自己的脚步在走,确实在云服务器市场上有自己的立足之地。我们看看过去一年中Linode的成就:第一、承诺投入 100,000 美元来帮助具有社会意识的非营利组织,促进有价值的革新。第二、发...
halocloud怎么样?halocloud是一个于2019下半年建立的商家,主要提供日本软银VPS,广州移动VDS,株洲联通VDS,广州移动独立服务器,Halo邮局服务,Azure香港1000M带宽月抛机器等。日本软银vps,100M/200M/500M带宽,可看奈飞,香港azure1000M带宽,可以解锁奈飞等流媒体,有需要看奈飞的朋友可以入手!点击进入:halocloud官方网站地址日本vp...
全球独立服务器、站群多IP服务器、VPS(哪个国家都有),香港、美国、日本、韩国、新加坡、越南、泰国、加拿大、英国、德国、法国等等99元起步,湘南科技郴州市湘南科技有限公司官方网址:www.xiangnankeji.cn产品内容:全球独立服务器、站群多IP服务器、VPS(哪个国家都有),香港、美国、日本、韩国、新加坡、越南、泰国、加拿大、英国、德国、法国等等99元起步,湘南科技VPS价格表:独立服...
drupal7为你推荐
phpwindPHPWIND怎么和PHPWIND整合人人视频总部基地落户重庆迁户口入重庆重庆网站制作重庆网站制作哪家好,重庆做网站制作的公司有谁比较了解的,应该去哪里做好些?360arp防火墙在哪arp防火墙在哪开额- -360里是哪个?面板flash缤纷网缤纷的意思是什么工具条手机的工具栏怎么在任务栏里?怎么把工具栏调到手机下面?kingcmsKingCMS 开始该则呢么设置呢?骑士人才系统公司要采购一套人才系统源码,看了一下骑士和嘉缘的,谁家的比较好一点呢?托就不要回答了。本帖隐藏的内容怎么设置要查看本帖隐藏内容请回复
域名邮箱 域名注册使用godaddy 域名交易网 softbank官网 42u标准机柜尺寸 debian7 gspeed 赞助 联通网站 跟踪路由命令 万网空间管理 dnspod 免备案cdn加速 免费赚q币 免费的加速器 ncp是什么 什么是dns 瓦工技术 ddos是什么 vpn服务器架设 更多