swappabledrupal7

drupal7  时间:2021-04-13  阅读:()
SCMS–SemantifyingContentManagementSystemsAxel-CyrilleNgongaNgomo1,NormanHeino1,KlausLyko1,ReneSpeck1,andMartinKaltenb¨ock21UniversityofLeipzigAKSWGroupJohannisgasse26,04103Leipzig2SemanticWebCompanyLerchenfelderg¨urtel43A-1160ViennaAbstract.
ThemigrationtotheSemanticWebrequiresfromCMSthattheyintegratehuman-andmachine-readabledatatosupporttheirseam-lessintegrationintotheSemanticWeb.
Yet,thereisstillablatantneedforframeworksthatcanbeeasilyintegratedintoCMSandallowtotrans-formtheircontentintomachine-readableknowledgewithhighaccuracy.
Inthispaper,wedescribetheSCMS(SemanticContentManagementSystems)framework,whosemaingoalsaretheextractionofknowledgefromunstructureddatainanyCMSandtheintegrationoftheextractedknowledgeintothesameCMS.
Ourframeworkintegratesahighlyaccu-rateknowledgeextractionpipeline.
Inaddition,itreliesontheRDFandHTTPstandardsforcommunicationandcanthusbeintegratedinvirtu-allyanyCMS.
Wepresenthowourframeworkisbeingusedintheenergysector.
Wealsoevaluateourapproachandshowthatourframeworkout-performsevencommercialsoftwarebyreachingupto96%F-score.
1IntroductionContentManagementSystems(CMS)encompassmostoftheinformationavail-ableonthedocument-orientedWeb(alsoreferredtoasHumanWeb).
Therewith,theyconstitutetheinterfacebetweenhumansandthedataontheWeb.
Conse-quently,oneofthemaintasksofCMShasalwaysbeentomaketheircontentaseasilyprocessableforhumansaspossible.
Still,withthemigrationfromthedocument-orientedtotheSemanticWeb,thereisanincreasingneedtoinsertmachine-readabledataintothecontentofCMSsoastoenabletheseamlessintegrationoftheircontentintotheSemanticWeb.
Giventhesheervolumeofdataavailableonthedocument-orientedWeb,theinsertionofmachine-readabledatamustbecarriedout(semi-)automatically.
Theframeworksdevelopedforthepurposeofautomaticknowledgeextractionmustthereforebeaccurate(i.
e.
,displayhighF-scores)soastoensurethathumansneedtocurateaminimalamountoftheknowledgeextractedautomatically.
Thiscriterioniscentralfortheuseofautomaticknowledgeextraction,asapproacheswithalowrecallleadL.
Aroyoetal.
(Eds.
):ISWC2011,PartII,LNCS7032,pp.
189–204,2011.
cSpringer-VerlagBerlinHeidelberg2011190A.
-C.
NgongaNgomoetal.
tohumanshavingtondthefalsenegatives1byhand,whilealowprecisionforcesthesamehumanstohavetocontinuallychecktheoutputoftheknowl-edgeextractionframework.
Afurthercriterionthatdeterminestheusabilityofaknowledgeextractionframeworkisitsexibility,i.
e.
,howeasyitistointegratethisframeworkinCMS.
ThiscriterionisofhighimportanceasthecurrentCMSlandscapeconsistsofhundredsofveryheterogeneousframeworksimplementedindozensofdierentlanguages2.
Inthispaper,wedescribetheSCMSframework3.
Themaingoalofourframe-workistoallowtheextractionofstructureddata(i.
e.
,RDF)outoftheunstruc-turedcontentofCMS,thelinkingofthiscontentwiththeWebofDataandtheintegrationofthiswealthofknowledgebackintotheCMS.
SCMSreliesexclu-sivelyonRDFmessagesandsimpleWebprotocolsforitsintegrationintoexistingCMSandtheprocessingoftheircontent.
Thus,itishighlyexibleandcanbeusedwithvirtuallyanyCMS.
Inaddition,theunderlyingapproachimplementsahighlyaccurateknowledgeextractionpipelinethatcanbeconguredeasilyfortheuser'spurposes.
Thispipelineallowstomergeandimprovetheresultsofstate-of-the-arttoolsforinformationextraction,tomanuallypost-processtheresultsatwillandtointegratetheextractedknowledgeintoCMS,forexampleasRDFa.
Themaincontributionsofthispaperarethefollowing:1.
Wepresentthearchitectureofourapproachandshowthatitcanbeinte-gratedeasilyinvirtuallyanyCMS,provideditoerssucienthooksintothelife-cycleofitsmanagedcontentitems.
2.
WegiveanoverviewofthevocabulariesweusetorepresenttheknowledgeextractedfromCMS.
3.
Wepresenthowourapproachisbeingusedinausecasecenteredaroundrenewableenergy.
4.
Weevaluateourapproachagainstastate-of-the-artcommercialsystemforknowledgeextractionintwopracticalusecasesandshowthatweoutperformthecommercialsystemwithrespecttoF-scorewhilereachingupto96%F-scoreontheextractionoflocations.
Therestofthispaperisstructuredasfollows:WestartbygivinganoverviewofrelatedworkfromtheNLPandtheSemanticWebcommunityinSection2.
Thereafter,wepresenttheSCMSframework(Section3)anditsmaincompo-nents(Section4)aswellasthevocabulariestheyuse.
Subsequently,weepitomizetherenewableenergyusecasewithinwhichourframeworkisbeingdeployedinSection5.
Section6thenpresentstheresultsofanevaluationofourframeworkintwousecasesagainstanenterprisecommercialsystem(CS)whosenamecan-notberevealedforlegalreasons.
Finally,wegiveanoverviewofourfutureworkandconclude.
1i.
e.
,Theentitiesandrelationsthatwerenotfoundbythesoftware2AlistofCMSonthemarketcanbefoundathttp://en.
wikipedia.
org/wiki/List_of_content_management_systems3http://www.
scms.
euSCMS–SemantifyingContentManagementSystems1912RelatedWorkInformationExtractionisthebackboneofknowledgeextractionandisoneofthecoretasksofNLP.
ThreemaincategoriesofNLPtoolsplayacentralroledur-ingtheextractionofknowledgefromtext:KeyphraseExtraction(KE),NamedEntityRecognition(NER)andrelationextraction(RE).
Theautomaticdetec-tionofkeyphrases(i.
e.
,multi-wordunitsortextfragmentsthatcapturetheessenceofadocument)hasbeenanimportanttaskofNLPfordecades.
Still,duetotheveryambiguousdenitionofwhatanappropriatekeyphraseis,cur-rentapproachestotheextractionofkeyphrasesstilldisplaylowF-scores[16].
Accordingto[15],themajorityoftheapproachestoKEimplementcombinationsofstatistical,rule-basedorheuristicmethods[11,21]onmostlydocument[17],keyphrase[28]ortermcohesionfeatures[23].
NERaimstodiscoverinstancesofpredenedclassesofentities(e.
g.
,persons,locations,organizationsorproducts)intext.
MostNERtoolsimplementoneofthreemaincategoriesofapproaches:dictionary-based[29,3],rule-based[6,26]andmachine-learningapproaches[18].
Nowadays,themethodsofchoiceareborrowedfromsupervisedmachinelearningwhentrainingexamplesareavail-able[32,7,10].
Yet,duetoscarcityoflargedomain-specictrainingcorpora,semi-supervised[24,18]andunsupervisedmachinelearningapproaches[19,9]havealsobeenusedforextractingnamedentitiesfromtext.
TheextractionofrelationsfromunstructureddatabuildsuponworkforNERandKEtodeterminetheentitiesbetweenwhichrelationsmightexist.
Someearlyworkonpatternextractionreliedonsupervisedmachinelearning[12].
Yet,suchapproachesdemandedlargeamountoftrainingdata.
ThesubsequentgenerationofapproachestoREaimedatbootstrappingpatternsbasedonasmallnumberofinputpatternsandinstances[5,2].
NewerapproachesaimtoeithercollectredundancyinformationfromthewholeWeb[22]orWikipedia[30,31]inanunsupervisedmannerortouselinguisticanalysis[13,20]toharvestgenericpatternsforrelations.
InadditiontotheworkdonebytheNLPcommunity,severaltoolsandframe-workshavebeendevelopedexplicitlyforextractingRDFandRDFaoutofNL[1].
Forexample,theFirefoxextensionPiggyBank[14]allowstoextractRDFfromwebpagesbyusingscreenscrapers.
TheRDFextractedfromthesewebpagesisthenstoredlocallyinaSesamestore.
Thedatabeingstoredlocallyallowstheusertomergethedataextractedfromdierentwebsitestoperformseman-ticoperations.
Morerecently,theDrupalextensionOpenPublish4wasreleased.
Theaimofthisextensionistosupportcontentpublisherswiththeautomaticannotationoftheirdata.
Forthispurpose,OpenPublishutilizestheservicesprovidedbyOpenCalais5toannotatethecontentofnewsentries.
Epiphany[1]implementsaservicethatannotateswebpagesautomaticallywithentitiesfoundintheLinkedDataCloud.
ApacheStanbol6implementssimilarfunctionalityon4http://www.
openpublish.
com5http://www.
opencalais.
org6http://incubator.
apache.
org/stanbol192A.
-C.
NgongaNgomoetal.
alargerscalebyprovidingsynchronousRESTfulinterfacesthatallowContentManagementSystemstoextractannotationsfromtext.
Themaindrawbackofcurrentframeworksisthattheyeitherfocusononepar-ticulartask(e.
g.
,ndingnamedentitiesintext)ormakeuseofNLPalgorithmswithoutimprovinguponthem.
Consequently,theyhavethesamelimitationsastheNLPapproachesdiscussedabove.
Tothebestofourknowledge,ourframe-workistherstframeworkdesignedexplicitlyforthepurposesoftheSemanticWebthatcombinesexibilitywithaccuracy.
TheexibilityoftheSCMShasbeenshownbyitsdeploymentonDrupal7,Typo38andconX9.
Inaddition,ourframeworkisabletoextractRDFfromNLwithanaccuracysuperiortothatofcommercialsystemsasshownbyourevaluation.
Ourframeworkalsoprovidesamachine-learningmodulethatallowstotailorittonewdomainsandclassesofnamedentities.
Moreover,SCMSprovidesdedicatedinterfacesforinteracting(e.
g.
,editing,querying,merging)withthetriplesextracted,makingitusableinalargenumberofdomainsandusecases.
3TheSCMSFrameworkAnoverviewofthearchitecturebehindSCMSisgiveninFigure1.
Theframe-workconsistsoftwolayers:anorchestrationandcurationlayerandanextractionandstoragelayer.
TheCMSthatistobeextendedwithsemanticcapabilitiesresidesuponourframeworkandmustbeextendedminimallyviaaCMSwrap-per.
Thisextensionimplementsthein-andoutputbehavioroftheCMSandcommunicatesexclusivelywiththerstlayerofourframework,thusmakingthecomponentsoftheextractionandstoragelayerofourframeworkswappablewithoutanydrawbackfortheusers.
TheoverallgoaloftherstlayeroftheSCMSframeworkistocoordinatetheaccesstothedata.
Itconsistsoftwotools:theorchestrationserviceandthedatawikiOntoWiki.
TheorchestrationserviceistheinputgateofSCMS.
ItreceivesthedatathatistobeannotatedasaRDFmessagethatabidesbythevocabularypresentedinSection4.
2andreturnstheresultsoftheframeworktotheendpointspeciedintheRDFmessageitreceives.
OntoWikiprovidesfunctionalityforthemanualcurationoftheresultsoftheknowledgeextractionprocessandmanagesthedataowtothetriplestoreVirtuoso10,therstcomponentoftheextractionandstoragelayer.
Inadditiontoatriplestore,thesecondlayercontainstheFederatedknOwledgeeXtractionFrameworkFOX11,thatusesmachinelearningtocombineandimproveupontheresultsofNLPtoolsaswellasconvertstheseresultsintoRDFbyusingthevocabulariesdisplayedinSection4.
3.
VirtuosoalsocontainsacrawlerthatallowstoretrievesupplementaryknowledgefromtheWebandlinkittotheinformationalreadyavailableintheCMSbyintegratingit7http://drupal.
org8http://typo3.
org9http://conx.
at10http://virtuoso.
openlinksw.
com11http://fox.
aksw.
orgSCMS–SemantifyingContentManagementSystems193Orchestra-tionServiceVirtuosoFOXCMSWrapperpush(content)annotations(RDF)–asynctextannotationsOntoWikiinjectioncrawlednewsoptionalExtractionandStorageLayerWrapperLayerOrchestrationandCurationLayerpush(curationchanges)Fig.
1.
ArchitectureandpathsofcommunicationofcomponentsintheSCMScontentsemanticationsystemintotheCMS.
Inthefollowing,wepresentthecentralcomponentsoftheSCMSstackinmoredetail.
4ToolsandVocabulariesInthissectionwedescribethemaincomponentsoftheSCMSstackandhowtheyttogether.
Asrunningexample,weuseahypotheticalcontentitemcontainedinaDrupalCMS.
Thisnode(inDrupalterminology)thatconsistsoftwoparts:–Thetitle"Prometeus"and–abodythatcontainsthesentence"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
OnlythebodytothecontentitemistobeannotatedbytheSCMSstack.
Notethatforreasonsofbrevity,wewillonlyshowtheresultsoftheextractionofnamedentities.
Yet,SCMScanalsoextractkeywords,keyphrasesandrelations.
4.
1WrapperACMSwrapper(shortwrapper)isacomponentthatistightlyintegratedintoaCMS(seeFigure2)andwhoseroleistoensurethecommunicationbetweenthe194A.
-C.
NgongaNgomoetal.
Orchestr.
ServiceCMSWrapperann.
requestann.
response(async)injectRDFaFig.
2.
Architectureofcommunicationbetweenwrapper,CMSandorchestrationserviceCMSandtheorchestrationmoduleofourframework.
Inthisrespect,awrapperhastofulllthreemaintasks:1.
Requestgeneration:WrappersusuallyregisterforchangeeventstotheCMSeditingsystem.
Wheneveradocumenthasbeenedited,theygenerateanannotationrequestthatabidesbythevocabularydepictedinFigure3.
Thisrequestisthensenttotheorchestrationservice.
2.
Responsereceipt:Oncetheannotationhasbeencarriedout,theannotationresultsaresentbacktothewrapper.
Thesecondofthewrapper'smaintasksisconsequentlytoreacttothoseannotationresponsesandtostoretheannotationstothedocumentappropriately(e.
g.
,inatriplestore).
Sincetheannotationresultsaresentbackasynchronously(i.
e.
,inaseparaterequest),thewrappermustprovideacallbackURLforthispurpose.
3.
Dataprocessing:Oncethedatahavebeenreceivedandstored,wrappersusuallyintegratetheannotationsintothecontentitemsthatwereprocessedbytheCMS.
Theintegrationofannotationsismostcommonlycarriedoutby"injecting"theannotationsasRDFaintothedocument'sHTMLrendering.
ThedatainjectionismostlyrealizedbyregisteringtodocumentviewingeventsintherespectiveCMSandwritingtheRDFafromthewrapper'slocaltriplestoreintothecontentitemsthatarebeingviewed.
AnexampleofawrapperrequestforourexampleisshowninListing1.
Thecontent:encodedoftheDrupalnodehttp://example.
com/drupal/node/10istobeannotatedbyFOX.
Inaddition,thewholenodeistobestoredinthetriplestoreforthepurposeofmanualprocessing.
Notethatthewrappercanchoosenottosendportionsofthecontentitemthatarenottobestoredinthetriplestore,e.
g.
,privatedata.
Inaddition,notethatthedescriptionofadocumentisnotlimitedtocertainpropertiesortoacertainnumberthereof,whichensuresthehighlevelofexibilityoftheSCMSstack.
Moreover,theRDFdataextractedbySCMScanbeeasilymergedwithanystructuredinformationprovidednativelybytheCMS(i.
e.
,metadatasuchasauthorinformation).
Consequently,SCMSenablesCMSthatalreadyprovidemetadataasRDFtoanswercomplexques-tionsthatcombinedataandmetadata,e.
g.
,WhichauthorswrotedocumentsthatarerelatedtoBudapestSCMS–SemantifyingContentManagementSystems195ascms:Requestasioc:Itemxsd:stringxsd:stringxsd:stringscms:documentdc:titledc:descriptioncontent:encodedscms:annotatescms:annotateardf:Resourcescms:callbackEndpointFig.
3.
Vocabularyusedbythewrapperrequests1@prefixcontent:.
2@prefixdc:.
3@prefixsioc:.
4@base.
56a;7;8;9content:encoded.
1011asioc:Item;12dc:title"Prometeus";13content:encoded"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
Listing1.
ExampleannotationrequestassentbytheDrupalwrapper4.
2OrchestrationServiceThemaintasksoftheorchestrationservicearetocapturestateinformationandtodistributethedataacrossSCMS'layers.
TherstofthetasksisduetotheFOXframeworkhavingbeendesignedtobestateless.
Theorchestrationservicecapturesstateinformationbysplittingupeachdocument-basedannotationre-questsbyawrapperintoseveralproperty-basedannotationrequeststhataresenttoFOX.
Inourexample,theorchestrationservicedetectsthatsolelythecontent:encodedpropertyistobeannotated.
Then,itreadsthecontentofthatpropertyfromthewrapperrequestandgeneratestheannotationrequest"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHun-gary,i.
e.
,Budapest.
"forFOX.
Notethatwhilethisproperty-basedannotationrequestconsistsexclusivelyoftextorHTMLanddoesnotcontainanyRDF,theresponsereturnedbyFOXisaRDFdocumentserializedinTurtleorRDF/XML.
TheannotationresultsreturnedbyFOXarecombinedbytheorchestra-tionserviceintotheannotationresponse.
Therewith,therelationbetweenthe196A.
-C.
NgongaNgomoetal.
inputdocumentandtheannotationsextractedbyFOXisre-established.
Whenallannotationsforaparticularrequesthavebeenreceivedandcombined,theannotationresponseissentbacktothewrapperviatheprovidedcallbackURL.
Inaddition,theresultssentbacktothewrapperarestoredinOntoWikitofacilitatethecurationofannotationsextractedautomatically.
TheannotationresponsegeneratedbytheorchestrationserviceforourexampleisshowninListing2.
ItreliesupontheoutputsentbyFOX.
TheexactmeaningofthepredicatesusedbyFOXandforwardedbytheorchestrationserviceareexplainedinSection4.
31@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:annotates;10scms:property;11scms:beginIndex"70"^^xsd:int;12scms:endIndex"77"^^xsd:int;13scms:means;14scms:source;15ann:body"Hungary"^^xsd:string.
1617[]aann:Annotation,scmsann:ORGANIZATION;18scms:annotates;19scms:property;20scms:beginIndex"12"^^xsd:int;21scms:endIndex"21"^^xsd:int;22scms:means;23scms:source;24ann:body"Prometeus"^^xsd:string.
2526[]aann:Annotation,scmsann:LOCATION;27scms:annotates;28scms:property;29scms:beginIndex"85"^^xsd:int;30scms:endIndex"93"^^xsd:int;31scms:means;32scms:source;33ann:body"Budapest"^^xsd:string.
Listing2.
Exampleannotationresponseassentbytheorchestrationservice4.
3FOXTheFOXframeworkisastatelessandextensibleframeworkthatencompassesalltheNLPfunctionalitynecessarytoextractknowledgefromthecontentofCMS.
ItsarchitectureconsistsofthreelayersasshowninFigure4.
FOXtakestextorHTMLasinput.
Thisdataissenttothecontrollerlayer,whichimplementsthefunctionalitynecessarytocleanthedata,i.
e.
,removeHTMLandXMLtagsaswellasfurthernoise.
Oncethedatahasbeencleaned,SCMS–SemantifyingContentManagementSystems197NamedEntityRecognitionKeywordExtractionRelationExtractionLookupModuleTrainingPredictionControllerMLLayerControllerLayerToolLayerFig.
4.
ArchitectureoftheFOXframeworkthecontrollerlayerbeginswiththeorchestrationofthetoolsinthetoollayer.
Eachofthetoolsisassignedathreadfromathreadpool,soastomaximizeus-ageofmulti-coreCPUs.
Everythreadrunsitstoolandgeneratesaneventonceithascompleteditscomputation.
Intheeventthatatooldoesnotcompleteafterasettime,thecorrespondingthreadisterminated.
Sofar,FOXintegratestoolsforKE,NERandRE.
TheKEisrealizedbyPoolParty12forextractingkeywordsfromacontrolledvocabulary,KEA13andtheYahooTermExtractionservice14forstatisticalextractionandseveralothertools.
Inaddition,FOXinte-gratestheStanfordNamedEntityRecognizer15[10],theIllinoisNamedEntityTagger16[25]andcommercialsoftwareforNER.
TheREiscarriedoutbyusingtheCAREplatform17.
Theresultsfromthetoollayerareforwardedtothepredictionmoduleofthemachine-learninglayer.
TheroleofthepredictionmoduleistogenerateFOX'soutputbasedontheoutputthetoolsinFOX'sbackend.
Forthispurpose,itimplementsseveralensemblelearningtechniques[8]withwhichitcancombinetheoutputofseveraltools.
Currently,thepredictionmodulecarriesoutthiscombinationbyusingafeed-forwardneuralnetwork.
TheneuralnetworkinsertedinFOXwastrainedbyusing117newsarticles.
Itreached89.
21%F-Scoreinanevaluationbasedonaten-fold-cross-validationonNER,therewithoutperformingevencommercialsystems18.
Oncetheneuralnetworkhascombinedtheoutputofthetoolandgeneratedabetterpredictionofthenamedentities,theoutputofFOXisgeneratedby12http://poolparty.
biz13http://www.
nzdl.
org/Kea/14http://developer.
yahoo.
com/search/content/V1/termExtraction.
html15http://nlp.
stanford.
edu/software/CRF-NER.
shtml16http://cogcomp.
cs.
illinois.
edu/page/software_view/417http://www.
digitaltrowel.
com/Technology/18Moredetailsontheevaluationareprovidedathttp://fox.
aksw.
org198A.
-C.
NgongaNgomoetal.
usingthevocabulariesshowninFigure5.
ThesevocabulariesextendthetwobroadlyusedvocabulariesAnnotea19andAutotag20.
Inparticular,weaddedtheconstructsexplicatedinthefollowing:–scms:beginIndexdenotestheindexinaliteralvaluestringatwhichapar-ticularannotationorkeyphrasebegins;–scms:endIndexstandsfortheindexinaliteralvaluestringatwhichaparticularannotationorkeyphraseends;–scms:meansmarkstheURIassignedtoanamedentityidentiedforanannotation;–scms:sourcedenotestheprovenanceoftheannotation,i.
e.
,theURIofthetoolwhichcomputedtheannotationoreventhesystemIDofthepersonwhocuratedorcreatedtheannotationand–scmsannisthenamespacefortheannotationclasses,i.
e,location,person,organizationandmiscellaneous.
Giventhattheoverheadduetothemergingoftheresultsviatheneuralnetworkisofonlyafewmillisecondsandthanktothemulti-corearchitectureofcurrentservers,FOXisalmostastime-ecientasstate-of-the-arttools.
Still,asourevaluationshows,thesefewmillisecondsoverheadcanleadtoanincreaseofmorethan13%F-Score(seeSection6).
TheoutputofFOXforourexampleisshowninListing3.
Thisistheoutputthatisforwardedtotheorchestrationservice,whichaddsprovenanceinformationtotheRDFbeforesendingananswertothecallbackURIprovidedbythewrapper.
Bythesemeans,weensurethatthewrappercanwritetheRDFainthewritesegmentoftheitemcontent.
4.
4OntoWikiOntoWikiisasemanticdatawiki[4]thatwasdesignedtofacilitatethebrowsingandeditingRDFknowledgebases.
Itsbrowsingfeaturesrangefromarbitraryconcepthierarchiestofacet-basedsearchandquerybuildinginterfaces.
SemanticcontentcanbecreatedandeditedbyusingtheRDFauthorsystemwhichhasbeenintegratedinOntoWiki[27].
OntoWikiplaystwokeyroleswithintheSCMSstack.
First,itservesasentrypointforthetriplestore.
Thisallowsforthetriplestoretobeexchangedwith-outanydrawbackfortheuser,leadingtoaneasycustomizationofourstack.
Inaddition,OntoWikiplaystheroleofanannotationconsolidationandcura-tiontoolandisconsequentlythecenterofthecurationpipeline.
ToensurethatOntoWikiisalwaysup-to-date,theorchestrationservicesendsitsannotationresponsestobothOntoWikiandthewrapper'scallbackURI.
Thus,OntoWikiisalsoawareofthewrapper(i.
e.
,itscallbackURI)andcansendtheresultsofanymanualcurationprocessbacktowrapper.
Notethatmanuallycuratedannotationsaresavedwithadierent(ifmanuallycreated)orsupplementary(ifmanuallycurated)valueintheirscms:sourceproperty.
Thisgivesconsuming19http://www.
w3.
org/2000/10/annotation-ns#20http://commontag.
org/ns#SCMS–SemantifyingContentManagementSystems199aann:Annotationardf:Resourcexsd:stringscms:meansann:bodyxsd:integerxsd:integerscms:beginIndexscms:endIndexardf:Resourcescms:tool(a)namedentityannotationactag:AutoTagardf:Resourcectag:meansxsd:stringctag:labelardf:Resourcescms:toolanyProp(b)keywordannotationFig.
5.
VocabulariesusedbyFOXforrepresentingnamedentities(a)andkeywords(b)tools(e.
g.
,wrappers)achancetoassignhighertrustvaluestothoseannota-tions.
Inaddition,ifanewextractionrunisperformedonthesamedocument,manuallycreatedandcuratedannotationscanbekeptforfurtheruse.
NotethatthecrawlerinVirtuosocanbeusedtofetchevenmoredatapertainingtotheannotationscomputedbyFOX.
ThisdatacanbesentdirectlytoFOXandinsertedinVirtuososoastoextendtheknowledgebasefortheCMS.
5UseCaseTheSCMSframeworkisbeingdeployedintherenewableenergysector.
Therenewableenergyandenergyeciencysectorrequiresalargeamountofup-to-dateandhigh-qualityinformationanddatasoastodevelopandpushtheareaofcleanenergysystemsworldwide.
Thisinformation,dataandknowledgeaboutcleanenergytechnologies,developments,projectsandlawspercountryworld-widehelpspolicyanddecisionmakers,projectdevelopersandnancingagenciestomakebetterdecisionsoninvestmentsaswellascleanenergyprojectstosetup.
TheREEEP–theRenewableEnergyandEnergyEciencyPartnership21isanon-governmentalorganizationthatprovidestheaforementionedinformationtotherespectivetargetgroupsaroundtheglobe.
Forthispurpose,REEEPhasdevelopedthereegle.
infoInformationGatewayonRenewableEnergyandEn-ergyEciency22thatoerscountryprolesoncleanenergy,anActorsCatalogthatcontainstherelevantstakeholdersintheeldpercountry.
Furthermore,itsuppliesenergystatisticsandpotentialsaswellasnewsoncleanenergy.
21http://www.
reeep.
org22http://www.
reegle.
info200A.
-C.
NgongaNgomoetal.
1@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:beginIndex"70"^^xsd:int;10scms:endIndex"77"^^xsd:int;11scms:means;12scms:source;13ann:body"Hungary"^^xsd:string.
1415[]aann:Annotation,scmsann:ORGANIZATION;16scms:beginIndex"12"^^xsd:int;17scms:endIndex"21"^^xsd:int;18scms:means;19scms:source;20ann:body"Prometeus"^^xsd:string.
2122[]aann:Annotation,scmsann:LOCATION;23scms:beginIndex"85"^^xsd:int;24scms:endIndex"93"^^xsd:int;25scms:means;26scms:source;27ann:body"Budapest"^^xsd:string.
Listing3.
AnnotationsasreturnedbyFOXinTurtleformatThemotivationbehindapplyingSCMStotheREEEPdatawastofacilitatetheintegrationofthisdatainsemanticapplicationstosupportecientdecisionmaking.
Toachievethisgoal,weaimedtoexpandthereegle.
infoinformationgatewaybyaddingRDFatotheunstructuredinformationavailableontheweb-siteandbymakingthesametriplesavailableviaaSPARQLendpoint.
Forourcurrentprototype,weimplementedaCMSwrapperfortheDrupalCMSandimportedtheactorscatalogofreeglewithinin(seeFigure6).
ThisdatawasthenprocessedbytheSCMSstackasfollows:Allactorsandcountrydescrip-tionsweresenttotheorchestrationservice,whichforwardedthemtoFOX.
TheRDFdataextractedbyFOXweresentbacktotheDrupalWrapperandwrittenviaOntoWikiintoVirtuoso.
TheDrupalwrapperthenusedthekeyphrasestoextendthesetoftagsassignedtothecorrespondingproleintheCMS.
ThenamedentitieswereintegratedinthepagebyusingthepositionalinformationreturnedbyFOX.
Bythesemeans,wemadetheREEEPdataaccessibleforhumans(viatheWebpage)butalsoformachines(viaOntoWiki'sintegratedSPARQLendpointandviatheRDFawrittenintheWebpages).
OurapproachalsomakestheautomatedintegrationofnovelknowledgesourcesinREEEPpossible.
Toachievethisgoal,severalselectedsources(websources,blogsandnewsfeeds)arecurrentlybeingcrawledandthenanalyzedbyFOXtoextractstructuredinformationoutofthemassesofunstructuredtextfromtheInternet.
SCMS–SemantifyingContentManagementSystems201Fig.
6.
ScreenshotsofSCMS-enhancedDrupal6EvaluationTheusabilityofourapproachdependsheavilyonthequalityoftheknowl-edgereturnedviaautomatedmeans.
Consequently,weevaluatedthequalityoftheRDFainjectedintotheREEEPdatabymeasuringtheprecisionandrecallofSCMSandcompareditwiththatofastate-of-the-artcommercialsystem(CS)whosenamecannotberevealedforlegalreasons.
WechoseCSbecauseitoutperformedfreelyavailableNERtoolssuchastheStanfordNamedEntityRecognizer23[10]andtheIllinoisNamedEntityTagger24[25]inaprioreval-uationonanewspapercorpus.
Withinthatevaluation,FOXreached89.
21%F-scoreandwas14%betterthanCSw.
r.
t.
F-score25.
Asitcanhappenthatonlysegmentsofmulti-wordunitsarerecognizedasbeingnamedentities,wefollowedatoken-wiseevaluationoftheSCMSsystem.
Thus,ifoursystemrec-ognizedUnitedKingdomofGreatBritainasaLOCATIONwhenpresentedwithUnitedKingdomofGreatBritainandNorthernIreland,itwasscoredwith5truepositivesand3falsenegatives.
Ourevaluationwascarriedoutwithtwodierentdatasets.
Inourrstevalu-ation,wemeasuredtheperformanceofbothsystemsoncountryprolescrawledfromtheWeb,i.
e.
,oninformationthatistobeaddedautomaticallytotheREEEPknowledgebases.
Forthispurpose,weselected9countrydescriptionsrandomlyandannotated34sentencesmanually.
Thesesentencescontained119namedentitiestokens,ofwhich104werelocationsand15organizations.
Inour23http://nlp.
stanford.
edu/software/CRF-NER.
shtml24http://cogcomp.
cs.
illinois.
edu/page/software_view/425Moredetailsathttp://fox.
aksw.
org202A.
-C.
NgongaNgomoetal.
secondevaluation,weaimedatmeasuringhowwellSCMSperformsonthedatathatcanbefoundcurrentlyintheREEEPcatalogue.
Forthispurpose,weanno-tated23actorsproleswhichconsistedof68sentencesmanually.
Theresultingreferencedatacontained20location,78organizationand11persontokens.
Notethatbothdatasetsareofverydierentnatureastherstcontainsalargenum-beroforganizationsandarelativelysmallnumberoflocationswhilethesecondconsistsmainlyoflocations.
TheresultsofourevaluationareshowninTable1.
CSfollowsaverycon-servativestrategy,whichleadstoithavingveryhighprecisionscoresofupto100%insomeexperiments.
Yet,itsconservativestrategyleadstoarecallwhichismostlysignicantlyinferiortothatofSCMS.
TheonlycategorywithinwhichCSoutperformsSCMSisthedetectionofpersonsintheactorsproledata.
Thisisduetoitdetecting6outofthe11persontokensinthedataset,whileSCMSonlydetects5.
Inallothercases,SCMSoutperformsCSbyupto13%F-score(detectionoforganizationsinthecountryprolesdataset).
Overall,SCMSoutperformCSby7%F-scoreoncountryprolesandalmost8%F-scoreonactors.
Table1.
Evaluationresultsoncountryandactorsproles.
ThesuperiorF-scoreforeachcategoryisinboldfont.
CountryProlesActorsProlesEntityTypeMeasureFOXCSFOXCSLocationPrecision98%100%83.
33%100%Recall94.
23%78.
85%90%70%F-Score96.
08%88.
17%86.
54%82.
35%OrganizationPrecision73.
33%100%57.
14%90.
91%Recall68.
75%40%69.
23%47.
44%F-Score70.
97%57.
14%62.
72%62.
35%PersonPrecision––100%100%Recall––45.
45%54.
55%F-Score––62.
5%70.
59%OverallPrecision93.
97%100%85.
16%98.
2%Recall91.
60%74.
79%70.
64%52.
29%F-Score92.
77%85.
58%77.
22%68.
24%7ConclusionInthispaper,wepresentedtheSCMSframeworkforextractingstructureddatafromCMScontent.
Wepresentedthearchitectureofourapproachandexplainedhoweachofitscomponentsworks.
Inaddition,weexplainedthevocabulariesutilizedbythecomponentsofourframework.
WepresentedoneusecasefortheSCMSsystem,i.
e.
,howSCMSisusedintherenewableenergysector.
TheSCMSstackabidesbythecriteriaofaccuracyandexibility.
Theexi-bilityofourapproachisensuredbythecombinationofRDFmessagesthatcanSCMS–SemantifyingContentManagementSystems203beeasilyextendedandofstandardWebcommunicationprotocols.
Theaccu-racyofSCMSwasdemonstratedinanevaluationonactorandcountryproles,withinwhichSCMSoutperformedevencommercialsoftware.
Ourapproachcanbeextendedbyaddingsupportfornegativestatements,i.
e.
,statementsthatarenotcorrectbutcanbefoundindierentknowledgesourcesacrossthedatalandscapeanalyzedbyourframework.
Inaddition,thefeedbackgeneratedbyuserswillbeintegratedinthetrainingoftheframeworktomakeitevenmoreaccurateovertime.
References1.
Adrian,B.
,Hees,J.
,Herman,I.
,Sintek,M.
,Dengel,A.
:Epiphany:AdaptablerDFaGenerationLinkingtheWebofDocumentstotheWebofData.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
178–192.
Springer,Heidelberg(2010)2.
Agichtein,E.
,Gravano,L.
:Snowball:Extractingrelationsfromlargeplain-textcollections.
In:ACMDL,pp.
85–94(2000)3.
Amsler,R.
:Researchtowardsthedevelopmentofalexicalknowledgebasefornaturallanguageprocessing.
SIGIRForum23,1–2(1989)4.
Auer,S.
,Dietzold,S.
,Riechert,T.
:OntoWiki–AToolforSocial,SemanticCol-laboration.
In:Cruz,I.
,Decker,S.
,Allemang,D.
,Preist,C.
,Schwabe,D.
,Mika,P.
,Uschold,M.
,Aroyo,L.
M.
(eds.
)ISWC2006.
LNCS,vol.
4273,pp.
736–749.
Springer,Heidelberg(2006)5.
Brin,S.
:ExtractingPatternsandRelationsfromtheWorldWideWeb.
In:Atzeni,P.
,Mendelzon,A.
O.
,Mecca,G.
(eds.
)WebDB1998.
LNCS,vol.
1590,pp.
172–183.
Springer,Heidelberg(1999)6.
Coates-Stephens,S.
:Theanalysisandacquisitionofpropernamesfortheun-derstandingoffreetext.
ComputersandtheHumanities26,441–456(1992)10.
1007/BF001369857.
Curran,J.
R.
,Clark,S.
:Languageindependentnerusingamaximumentropytag-ger.
In:HLT-NAACL,pp.
164–167(2003)8.
Dietterich,T.
G.
:EnsembleMethodsinMachineLearning.
In:Kittler,J.
,Roli,F.
(eds.
)MCS2000.
LNCS,vol.
1857,pp.
1–15.
Springer,Heidelberg(2000)9.
Etzioni,O.
,Cafarella,M.
,Downey,D.
,Popescu,A.
-M.
,Shaked,T.
,Soderland,S.
,Weld,D.
S.
,Yates,A.
:Unsupervisednamed-entityextractionfromtheweb:anexperimentalstudy.
Artif.
Intell.
165,91–134(2005)10.
Finkel,J.
,Grenager,T.
,Manning,C.
:Incorporatingnon-localinformationintoinformationextractionsystemsbygibbssampling.
In:ACL,pp.
363–370(2005)11.
Frank,E.
,Paynter,G.
W.
,Witten,I.
H.
,Gutwin,C.
,Nevill-Manning,C.
G.
:Domain-specickeyphraseextraction.
In:ProceedingsoftheSixteenthInterna-tionalJointConferenceonArticialIntelligence,IJCAI1999,pp.
668–673.
MorganKaufmannPublishersInc.
,SanFrancisco(1999)12.
Grishman,R.
,Yangarber,R.
:Nyu:DescriptionoftheProteus/PetsystemasusedforMUC-7ST.
In:MUC-7.
MorganKaufmann(1998)13.
Harabagiu,S.
,Bejan,C.
A.
,Morarescu,P.
:Shallowsemanticsforrelationextrac-tion.
In:IJCAI,pp.
1061–1066(2005)14.
Huynh,D.
,Mazzocchi,S.
,Karger,D.
R.
:PiggyBank:ExperiencetheSemanticWebInsideYourWebBrowser.
In:Gil,Y.
,Motta,E.
,Benjamins,V.
R.
,Musen,M.
A.
(eds.
)ISWC2005.
LNCS,vol.
3729,pp.
413–430.
Springer,Heidelberg(2005)204A.
-C.
NgongaNgomoetal.
15.
Kim,S.
N.
,Kan,M.
-Y.
:Re-examiningautomatickeyphraseextractionapproachesinscienticarticles.
In:MWE2009,pp.
9–16(2009)16.
Kim,S.
N.
,Medelyan,O.
,Kan,M.
-Y.
,Baldwin,T.
:Semeval-2010task5:Auto-matickeyphraseextractionfromscienticarticles.
In:SemEval2010,pp.
21–26.
AssociationforComputationalLinguistics,Stroudsburg(2010)17.
Matsuo,Y.
,Ishizuka,M.
:KeywordExtractionFromASingleDocumentUsingWordCo-OccurrenceStatisticalInformation.
InternationalJournalonArticialIntelligenceTools13(1),157–169(2004)18.
Nadeau,D.
:Semi-SupervisedNamedEntityRecognition:LearningtoRecognize100EntityTypeswithLittleSupervision.
PhDthesis,UniversityofOttawa(2007)19.
Nadeau,D.
,Turney,P.
,Matwin,S.
:UnsupervisedNamed-EntityRecognition:Gen-eratingGazetteersandResolvingAmbiguity.
In:Lamontagne,L.
,Marchand,M.
(eds.
)CanadianAI2006.
LNCS(LNAI),vol.
4013,pp.
266–277.
Springer,Heidel-berg(2006)20.
Nguyen,D.
P.
T.
,Matsuo,Y.
,Ishizuka,M.
:Relationextractionfromwikipediausingsubtreemining.
In:AAAI,pp.
1414–1420(2007)21.
Nguyen,T.
D.
,Kan,M.
-Y.
:KeyphraseExtractioninScienticPublications.
In:Goh,D.
H.
-L.
,Cao,T.
H.
,Slvberg,I.
T.
,Rasmussen,E.
(eds.
)ICADL2007.
LNCS,vol.
4822,pp.
317–326.
Springer,Heidelberg(2007)22.
Pantel,P.
,Pennacchiotti,M.
:Espresso:Leveraginggenericpatternsforautomati-callyharvestingsemanticrelations.
In:ACL,pp.
113–120(2006)23.
Park,Y.
,Byrd,R.
J.
,Boguraev,B.
K.
:Automaticglossaryextraction:beyondter-minologyidentication.
In:COLING,pp.
1–7(2002)24.
Pasca,M.
,Lin,D.
,Bigham,J.
,Lifchits,A.
,Jain,A.
:Organizingandsearchingtheworldwideweboffacts-stepone:theone-millionfactextractionchallenge.
In:Proceedingsofthe21stNationalConferenceonArticialIntelligence,vol.
2,pp.
1400–1405.
AAAIPress(2006)25.
Ratinov,L.
,Roth,D.
:Designchallengesandmisconceptionsinnamedentityrecog-nition.
In:CONLL,pp.
147–155(2009)26.
Thielen,C.
:Anapproachtopropernametaggingforgerman.
In:ProceedingsoftheEACL1995SIGDATWorkshop(1995)27.
Tramp,S.
,Heino,N.
,Auer,S.
,Frischmuth,P.
:RDFauthor:EmployingRDFaforCollaborativeKnowledgeEngineering.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
90–104.
Springer,Heidelberg(2010)28.
Turney,P.
D.
:Coherentkeyphraseextractionviawebmining.
In:IJCAI,SanFran-cisco,CA,USA,pp.
434–439(2003)29.
Walker,D.
,Amsler,R.
:Theuseofmachine-readabledictionariesinsublanguageanalysis.
AnalysingLanguageinRestrictedDomains(1986)30.
Wang,G.
,Yu,Y.
,Zhu,H.
:PORE:Positive-OnlyRelationExtractionfromWikipediaText.
In:Aberer,K.
,Choi,K.
-S.
,Noy,N.
,Allemang,D.
,Lee,K.
-I.
,Nixon,L.
J.
B.
,Golbeck,J.
,Mika,P.
,Maynard,D.
,Mizoguchi,R.
,Schreiber,G.
,Cudre-Mauroux,P.
(eds.
)ASWC2007andISWC2007.
LNCS,vol.
4825,pp.
580–594.
Springer,Heidelberg(2007)31.
Yan,Y.
,Okazaki,N.
,Matsuo,Y.
,Yang,Z.
,Ishizuka,M.
:Unsupervisedrelationextractionbyminingwikipediatextsusinginformationfromtheweb.
In:ACL2009,pp.
1021–1029(2009)32.
Zhou,G.
,Su,J.
:Namedentityrecognitionusinganhmm-basedchunktagger.
In:Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,ACL2002,pp.
473–480.
AssociationforComputationalLinguistics,Morristown(2002)

特网云-新上线香港五区补货资源充足限时抢 虚拟主机6折,低至38元!

官方网站:点击访问特网云官网活动方案:===========================香港云限时购==============================支持Linux和Windows操作系统,配置都是可以自选的,非常的灵活,宽带充足新老客户活动期间新购活动款产品都可以享受续费折扣(只限在活动期间购买活动款产品才可享受续费折扣 优惠码:AADE01),购买折扣与续费折扣不叠加,都是在原价...

香港服务器促销:香港华为云混合服务器、高防服务器首月半价,普通110M大带宽服务器月付799,付5用6,付10用13

博鳌云是一家以海外互联网基础业务为主的高新技术企业,运营全球高品质数据中心业务。自2008年开始为用户提供服务,距今11年,在国人商家中来说非常老牌。致力于为中国用户提供域名注册(国外接口)、免费虚拟主机、香港虚拟主机、VPS云主机和香港、台湾、马来西亚等地服务器租用服务,各类网络应用解決方案等领域的专业网络数据服务。商家支持支付宝、微信、银行转账等付款方式。目前香港有一款特价独立服务器正在促销,...

819云互联(800元/月),香港BGP E5 2650 16G,日本 E5 2650 16G

819云互联 在本月发布了一个购买香港,日本独立服务器的活动,相对之前的首月活动性价比更高,最多只能享受1个月的活动 续费价格恢复原价 是有些颇高 这次819云互联与机房是合作伙伴 本次拿到机房 活动7天内购买独立服务器后期的长期续费价格 加大力度 确实来说这次的就可以买年付或者更长时间了…本次是5个机房可供选择,独立服务器最低默认是50M带宽,不限制流量,。官网:https://ww...

drupal7为你推荐
thinksns网站成功 安装ThinkSNS后主页有问题360邮箱360免费申请邮箱在那里重庆杨家坪猪肉摊主杀人在毫无预兆的情况下,对方激情杀人(持械偷袭)——作为习武者,你该怎么办?internetexplorer无法打开Internet Explorer 打不开了flashfxp注册码找flashfxp3.4注册码的是cuteftp大飞资讯手机出现热点资讯怎么关闭三友网怎么是“三友”正大天地网天地网微信移动办公平台12306.com注册12306邮箱地址怎么写
国外免费vps linuxapache虚拟主机 winscp 512av 512m shopex空间 info域名 php空间推荐 厦门电信 空间租赁 百度云加速 数据湾 googlevoice windowssever2008 winserver2008r2 2016黑色星期五 西部主机 硬防 赵荣 电脑主机启动不了 更多