swappabledrupal7

drupal7  时间:2021-04-13  阅读:()
SCMS–SemantifyingContentManagementSystemsAxel-CyrilleNgongaNgomo1,NormanHeino1,KlausLyko1,ReneSpeck1,andMartinKaltenb¨ock21UniversityofLeipzigAKSWGroupJohannisgasse26,04103Leipzig2SemanticWebCompanyLerchenfelderg¨urtel43A-1160ViennaAbstract.
ThemigrationtotheSemanticWebrequiresfromCMSthattheyintegratehuman-andmachine-readabledatatosupporttheirseam-lessintegrationintotheSemanticWeb.
Yet,thereisstillablatantneedforframeworksthatcanbeeasilyintegratedintoCMSandallowtotrans-formtheircontentintomachine-readableknowledgewithhighaccuracy.
Inthispaper,wedescribetheSCMS(SemanticContentManagementSystems)framework,whosemaingoalsaretheextractionofknowledgefromunstructureddatainanyCMSandtheintegrationoftheextractedknowledgeintothesameCMS.
Ourframeworkintegratesahighlyaccu-rateknowledgeextractionpipeline.
Inaddition,itreliesontheRDFandHTTPstandardsforcommunicationandcanthusbeintegratedinvirtu-allyanyCMS.
Wepresenthowourframeworkisbeingusedintheenergysector.
Wealsoevaluateourapproachandshowthatourframeworkout-performsevencommercialsoftwarebyreachingupto96%F-score.
1IntroductionContentManagementSystems(CMS)encompassmostoftheinformationavail-ableonthedocument-orientedWeb(alsoreferredtoasHumanWeb).
Therewith,theyconstitutetheinterfacebetweenhumansandthedataontheWeb.
Conse-quently,oneofthemaintasksofCMShasalwaysbeentomaketheircontentaseasilyprocessableforhumansaspossible.
Still,withthemigrationfromthedocument-orientedtotheSemanticWeb,thereisanincreasingneedtoinsertmachine-readabledataintothecontentofCMSsoastoenabletheseamlessintegrationoftheircontentintotheSemanticWeb.
Giventhesheervolumeofdataavailableonthedocument-orientedWeb,theinsertionofmachine-readabledatamustbecarriedout(semi-)automatically.
Theframeworksdevelopedforthepurposeofautomaticknowledgeextractionmustthereforebeaccurate(i.
e.
,displayhighF-scores)soastoensurethathumansneedtocurateaminimalamountoftheknowledgeextractedautomatically.
Thiscriterioniscentralfortheuseofautomaticknowledgeextraction,asapproacheswithalowrecallleadL.
Aroyoetal.
(Eds.
):ISWC2011,PartII,LNCS7032,pp.
189–204,2011.
cSpringer-VerlagBerlinHeidelberg2011190A.
-C.
NgongaNgomoetal.
tohumanshavingtondthefalsenegatives1byhand,whilealowprecisionforcesthesamehumanstohavetocontinuallychecktheoutputoftheknowl-edgeextractionframework.
Afurthercriterionthatdeterminestheusabilityofaknowledgeextractionframeworkisitsexibility,i.
e.
,howeasyitistointegratethisframeworkinCMS.
ThiscriterionisofhighimportanceasthecurrentCMSlandscapeconsistsofhundredsofveryheterogeneousframeworksimplementedindozensofdierentlanguages2.
Inthispaper,wedescribetheSCMSframework3.
Themaingoalofourframe-workistoallowtheextractionofstructureddata(i.
e.
,RDF)outoftheunstruc-turedcontentofCMS,thelinkingofthiscontentwiththeWebofDataandtheintegrationofthiswealthofknowledgebackintotheCMS.
SCMSreliesexclu-sivelyonRDFmessagesandsimpleWebprotocolsforitsintegrationintoexistingCMSandtheprocessingoftheircontent.
Thus,itishighlyexibleandcanbeusedwithvirtuallyanyCMS.
Inaddition,theunderlyingapproachimplementsahighlyaccurateknowledgeextractionpipelinethatcanbeconguredeasilyfortheuser'spurposes.
Thispipelineallowstomergeandimprovetheresultsofstate-of-the-arttoolsforinformationextraction,tomanuallypost-processtheresultsatwillandtointegratetheextractedknowledgeintoCMS,forexampleasRDFa.
Themaincontributionsofthispaperarethefollowing:1.
Wepresentthearchitectureofourapproachandshowthatitcanbeinte-gratedeasilyinvirtuallyanyCMS,provideditoerssucienthooksintothelife-cycleofitsmanagedcontentitems.
2.
WegiveanoverviewofthevocabulariesweusetorepresenttheknowledgeextractedfromCMS.
3.
Wepresenthowourapproachisbeingusedinausecasecenteredaroundrenewableenergy.
4.
Weevaluateourapproachagainstastate-of-the-artcommercialsystemforknowledgeextractionintwopracticalusecasesandshowthatweoutperformthecommercialsystemwithrespecttoF-scorewhilereachingupto96%F-scoreontheextractionoflocations.
Therestofthispaperisstructuredasfollows:WestartbygivinganoverviewofrelatedworkfromtheNLPandtheSemanticWebcommunityinSection2.
Thereafter,wepresenttheSCMSframework(Section3)anditsmaincompo-nents(Section4)aswellasthevocabulariestheyuse.
Subsequently,weepitomizetherenewableenergyusecasewithinwhichourframeworkisbeingdeployedinSection5.
Section6thenpresentstheresultsofanevaluationofourframeworkintwousecasesagainstanenterprisecommercialsystem(CS)whosenamecan-notberevealedforlegalreasons.
Finally,wegiveanoverviewofourfutureworkandconclude.
1i.
e.
,Theentitiesandrelationsthatwerenotfoundbythesoftware2AlistofCMSonthemarketcanbefoundathttp://en.
wikipedia.
org/wiki/List_of_content_management_systems3http://www.
scms.
euSCMS–SemantifyingContentManagementSystems1912RelatedWorkInformationExtractionisthebackboneofknowledgeextractionandisoneofthecoretasksofNLP.
ThreemaincategoriesofNLPtoolsplayacentralroledur-ingtheextractionofknowledgefromtext:KeyphraseExtraction(KE),NamedEntityRecognition(NER)andrelationextraction(RE).
Theautomaticdetec-tionofkeyphrases(i.
e.
,multi-wordunitsortextfragmentsthatcapturetheessenceofadocument)hasbeenanimportanttaskofNLPfordecades.
Still,duetotheveryambiguousdenitionofwhatanappropriatekeyphraseis,cur-rentapproachestotheextractionofkeyphrasesstilldisplaylowF-scores[16].
Accordingto[15],themajorityoftheapproachestoKEimplementcombinationsofstatistical,rule-basedorheuristicmethods[11,21]onmostlydocument[17],keyphrase[28]ortermcohesionfeatures[23].
NERaimstodiscoverinstancesofpredenedclassesofentities(e.
g.
,persons,locations,organizationsorproducts)intext.
MostNERtoolsimplementoneofthreemaincategoriesofapproaches:dictionary-based[29,3],rule-based[6,26]andmachine-learningapproaches[18].
Nowadays,themethodsofchoiceareborrowedfromsupervisedmachinelearningwhentrainingexamplesareavail-able[32,7,10].
Yet,duetoscarcityoflargedomain-specictrainingcorpora,semi-supervised[24,18]andunsupervisedmachinelearningapproaches[19,9]havealsobeenusedforextractingnamedentitiesfromtext.
TheextractionofrelationsfromunstructureddatabuildsuponworkforNERandKEtodeterminetheentitiesbetweenwhichrelationsmightexist.
Someearlyworkonpatternextractionreliedonsupervisedmachinelearning[12].
Yet,suchapproachesdemandedlargeamountoftrainingdata.
ThesubsequentgenerationofapproachestoREaimedatbootstrappingpatternsbasedonasmallnumberofinputpatternsandinstances[5,2].
NewerapproachesaimtoeithercollectredundancyinformationfromthewholeWeb[22]orWikipedia[30,31]inanunsupervisedmannerortouselinguisticanalysis[13,20]toharvestgenericpatternsforrelations.
InadditiontotheworkdonebytheNLPcommunity,severaltoolsandframe-workshavebeendevelopedexplicitlyforextractingRDFandRDFaoutofNL[1].
Forexample,theFirefoxextensionPiggyBank[14]allowstoextractRDFfromwebpagesbyusingscreenscrapers.
TheRDFextractedfromthesewebpagesisthenstoredlocallyinaSesamestore.
Thedatabeingstoredlocallyallowstheusertomergethedataextractedfromdierentwebsitestoperformseman-ticoperations.
Morerecently,theDrupalextensionOpenPublish4wasreleased.
Theaimofthisextensionistosupportcontentpublisherswiththeautomaticannotationoftheirdata.
Forthispurpose,OpenPublishutilizestheservicesprovidedbyOpenCalais5toannotatethecontentofnewsentries.
Epiphany[1]implementsaservicethatannotateswebpagesautomaticallywithentitiesfoundintheLinkedDataCloud.
ApacheStanbol6implementssimilarfunctionalityon4http://www.
openpublish.
com5http://www.
opencalais.
org6http://incubator.
apache.
org/stanbol192A.
-C.
NgongaNgomoetal.
alargerscalebyprovidingsynchronousRESTfulinterfacesthatallowContentManagementSystemstoextractannotationsfromtext.
Themaindrawbackofcurrentframeworksisthattheyeitherfocusononepar-ticulartask(e.
g.
,ndingnamedentitiesintext)ormakeuseofNLPalgorithmswithoutimprovinguponthem.
Consequently,theyhavethesamelimitationsastheNLPapproachesdiscussedabove.
Tothebestofourknowledge,ourframe-workistherstframeworkdesignedexplicitlyforthepurposesoftheSemanticWebthatcombinesexibilitywithaccuracy.
TheexibilityoftheSCMShasbeenshownbyitsdeploymentonDrupal7,Typo38andconX9.
Inaddition,ourframeworkisabletoextractRDFfromNLwithanaccuracysuperiortothatofcommercialsystemsasshownbyourevaluation.
Ourframeworkalsoprovidesamachine-learningmodulethatallowstotailorittonewdomainsandclassesofnamedentities.
Moreover,SCMSprovidesdedicatedinterfacesforinteracting(e.
g.
,editing,querying,merging)withthetriplesextracted,makingitusableinalargenumberofdomainsandusecases.
3TheSCMSFrameworkAnoverviewofthearchitecturebehindSCMSisgiveninFigure1.
Theframe-workconsistsoftwolayers:anorchestrationandcurationlayerandanextractionandstoragelayer.
TheCMSthatistobeextendedwithsemanticcapabilitiesresidesuponourframeworkandmustbeextendedminimallyviaaCMSwrap-per.
Thisextensionimplementsthein-andoutputbehavioroftheCMSandcommunicatesexclusivelywiththerstlayerofourframework,thusmakingthecomponentsoftheextractionandstoragelayerofourframeworkswappablewithoutanydrawbackfortheusers.
TheoverallgoaloftherstlayeroftheSCMSframeworkistocoordinatetheaccesstothedata.
Itconsistsoftwotools:theorchestrationserviceandthedatawikiOntoWiki.
TheorchestrationserviceistheinputgateofSCMS.
ItreceivesthedatathatistobeannotatedasaRDFmessagethatabidesbythevocabularypresentedinSection4.
2andreturnstheresultsoftheframeworktotheendpointspeciedintheRDFmessageitreceives.
OntoWikiprovidesfunctionalityforthemanualcurationoftheresultsoftheknowledgeextractionprocessandmanagesthedataowtothetriplestoreVirtuoso10,therstcomponentoftheextractionandstoragelayer.
Inadditiontoatriplestore,thesecondlayercontainstheFederatedknOwledgeeXtractionFrameworkFOX11,thatusesmachinelearningtocombineandimproveupontheresultsofNLPtoolsaswellasconvertstheseresultsintoRDFbyusingthevocabulariesdisplayedinSection4.
3.
VirtuosoalsocontainsacrawlerthatallowstoretrievesupplementaryknowledgefromtheWebandlinkittotheinformationalreadyavailableintheCMSbyintegratingit7http://drupal.
org8http://typo3.
org9http://conx.
at10http://virtuoso.
openlinksw.
com11http://fox.
aksw.
orgSCMS–SemantifyingContentManagementSystems193Orchestra-tionServiceVirtuosoFOXCMSWrapperpush(content)annotations(RDF)–asynctextannotationsOntoWikiinjectioncrawlednewsoptionalExtractionandStorageLayerWrapperLayerOrchestrationandCurationLayerpush(curationchanges)Fig.
1.
ArchitectureandpathsofcommunicationofcomponentsintheSCMScontentsemanticationsystemintotheCMS.
Inthefollowing,wepresentthecentralcomponentsoftheSCMSstackinmoredetail.
4ToolsandVocabulariesInthissectionwedescribethemaincomponentsoftheSCMSstackandhowtheyttogether.
Asrunningexample,weuseahypotheticalcontentitemcontainedinaDrupalCMS.
Thisnode(inDrupalterminology)thatconsistsoftwoparts:–Thetitle"Prometeus"and–abodythatcontainsthesentence"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
OnlythebodytothecontentitemistobeannotatedbytheSCMSstack.
Notethatforreasonsofbrevity,wewillonlyshowtheresultsoftheextractionofnamedentities.
Yet,SCMScanalsoextractkeywords,keyphrasesandrelations.
4.
1WrapperACMSwrapper(shortwrapper)isacomponentthatistightlyintegratedintoaCMS(seeFigure2)andwhoseroleistoensurethecommunicationbetweenthe194A.
-C.
NgongaNgomoetal.
Orchestr.
ServiceCMSWrapperann.
requestann.
response(async)injectRDFaFig.
2.
Architectureofcommunicationbetweenwrapper,CMSandorchestrationserviceCMSandtheorchestrationmoduleofourframework.
Inthisrespect,awrapperhastofulllthreemaintasks:1.
Requestgeneration:WrappersusuallyregisterforchangeeventstotheCMSeditingsystem.
Wheneveradocumenthasbeenedited,theygenerateanannotationrequestthatabidesbythevocabularydepictedinFigure3.
Thisrequestisthensenttotheorchestrationservice.
2.
Responsereceipt:Oncetheannotationhasbeencarriedout,theannotationresultsaresentbacktothewrapper.
Thesecondofthewrapper'smaintasksisconsequentlytoreacttothoseannotationresponsesandtostoretheannotationstothedocumentappropriately(e.
g.
,inatriplestore).
Sincetheannotationresultsaresentbackasynchronously(i.
e.
,inaseparaterequest),thewrappermustprovideacallbackURLforthispurpose.
3.
Dataprocessing:Oncethedatahavebeenreceivedandstored,wrappersusuallyintegratetheannotationsintothecontentitemsthatwereprocessedbytheCMS.
Theintegrationofannotationsismostcommonlycarriedoutby"injecting"theannotationsasRDFaintothedocument'sHTMLrendering.
ThedatainjectionismostlyrealizedbyregisteringtodocumentviewingeventsintherespectiveCMSandwritingtheRDFafromthewrapper'slocaltriplestoreintothecontentitemsthatarebeingviewed.
AnexampleofawrapperrequestforourexampleisshowninListing1.
Thecontent:encodedoftheDrupalnodehttp://example.
com/drupal/node/10istobeannotatedbyFOX.
Inaddition,thewholenodeistobestoredinthetriplestoreforthepurposeofmanualprocessing.
Notethatthewrappercanchoosenottosendportionsofthecontentitemthatarenottobestoredinthetriplestore,e.
g.
,privatedata.
Inaddition,notethatthedescriptionofadocumentisnotlimitedtocertainpropertiesortoacertainnumberthereof,whichensuresthehighlevelofexibilityoftheSCMSstack.
Moreover,theRDFdataextractedbySCMScanbeeasilymergedwithanystructuredinformationprovidednativelybytheCMS(i.
e.
,metadatasuchasauthorinformation).
Consequently,SCMSenablesCMSthatalreadyprovidemetadataasRDFtoanswercomplexques-tionsthatcombinedataandmetadata,e.
g.
,WhichauthorswrotedocumentsthatarerelatedtoBudapestSCMS–SemantifyingContentManagementSystems195ascms:Requestasioc:Itemxsd:stringxsd:stringxsd:stringscms:documentdc:titledc:descriptioncontent:encodedscms:annotatescms:annotateardf:Resourcescms:callbackEndpointFig.
3.
Vocabularyusedbythewrapperrequests1@prefixcontent:.
2@prefixdc:.
3@prefixsioc:.
4@base.
56a;7;8;9content:encoded.
1011asioc:Item;12dc:title"Prometeus";13content:encoded"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
Listing1.
ExampleannotationrequestassentbytheDrupalwrapper4.
2OrchestrationServiceThemaintasksoftheorchestrationservicearetocapturestateinformationandtodistributethedataacrossSCMS'layers.
TherstofthetasksisduetotheFOXframeworkhavingbeendesignedtobestateless.
Theorchestrationservicecapturesstateinformationbysplittingupeachdocument-basedannotationre-questsbyawrapperintoseveralproperty-basedannotationrequeststhataresenttoFOX.
Inourexample,theorchestrationservicedetectsthatsolelythecontent:encodedpropertyistobeannotated.
Then,itreadsthecontentofthatpropertyfromthewrapperrequestandgeneratestheannotationrequest"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHun-gary,i.
e.
,Budapest.
"forFOX.
Notethatwhilethisproperty-basedannotationrequestconsistsexclusivelyoftextorHTMLanddoesnotcontainanyRDF,theresponsereturnedbyFOXisaRDFdocumentserializedinTurtleorRDF/XML.
TheannotationresultsreturnedbyFOXarecombinedbytheorchestra-tionserviceintotheannotationresponse.
Therewith,therelationbetweenthe196A.
-C.
NgongaNgomoetal.
inputdocumentandtheannotationsextractedbyFOXisre-established.
Whenallannotationsforaparticularrequesthavebeenreceivedandcombined,theannotationresponseissentbacktothewrapperviatheprovidedcallbackURL.
Inaddition,theresultssentbacktothewrapperarestoredinOntoWikitofacilitatethecurationofannotationsextractedautomatically.
TheannotationresponsegeneratedbytheorchestrationserviceforourexampleisshowninListing2.
ItreliesupontheoutputsentbyFOX.
TheexactmeaningofthepredicatesusedbyFOXandforwardedbytheorchestrationserviceareexplainedinSection4.
31@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:annotates;10scms:property;11scms:beginIndex"70"^^xsd:int;12scms:endIndex"77"^^xsd:int;13scms:means;14scms:source;15ann:body"Hungary"^^xsd:string.
1617[]aann:Annotation,scmsann:ORGANIZATION;18scms:annotates;19scms:property;20scms:beginIndex"12"^^xsd:int;21scms:endIndex"21"^^xsd:int;22scms:means;23scms:source;24ann:body"Prometeus"^^xsd:string.
2526[]aann:Annotation,scmsann:LOCATION;27scms:annotates;28scms:property;29scms:beginIndex"85"^^xsd:int;30scms:endIndex"93"^^xsd:int;31scms:means;32scms:source;33ann:body"Budapest"^^xsd:string.
Listing2.
Exampleannotationresponseassentbytheorchestrationservice4.
3FOXTheFOXframeworkisastatelessandextensibleframeworkthatencompassesalltheNLPfunctionalitynecessarytoextractknowledgefromthecontentofCMS.
ItsarchitectureconsistsofthreelayersasshowninFigure4.
FOXtakestextorHTMLasinput.
Thisdataissenttothecontrollerlayer,whichimplementsthefunctionalitynecessarytocleanthedata,i.
e.
,removeHTMLandXMLtagsaswellasfurthernoise.
Oncethedatahasbeencleaned,SCMS–SemantifyingContentManagementSystems197NamedEntityRecognitionKeywordExtractionRelationExtractionLookupModuleTrainingPredictionControllerMLLayerControllerLayerToolLayerFig.
4.
ArchitectureoftheFOXframeworkthecontrollerlayerbeginswiththeorchestrationofthetoolsinthetoollayer.
Eachofthetoolsisassignedathreadfromathreadpool,soastomaximizeus-ageofmulti-coreCPUs.
Everythreadrunsitstoolandgeneratesaneventonceithascompleteditscomputation.
Intheeventthatatooldoesnotcompleteafterasettime,thecorrespondingthreadisterminated.
Sofar,FOXintegratestoolsforKE,NERandRE.
TheKEisrealizedbyPoolParty12forextractingkeywordsfromacontrolledvocabulary,KEA13andtheYahooTermExtractionservice14forstatisticalextractionandseveralothertools.
Inaddition,FOXinte-gratestheStanfordNamedEntityRecognizer15[10],theIllinoisNamedEntityTagger16[25]andcommercialsoftwareforNER.
TheREiscarriedoutbyusingtheCAREplatform17.
Theresultsfromthetoollayerareforwardedtothepredictionmoduleofthemachine-learninglayer.
TheroleofthepredictionmoduleistogenerateFOX'soutputbasedontheoutputthetoolsinFOX'sbackend.
Forthispurpose,itimplementsseveralensemblelearningtechniques[8]withwhichitcancombinetheoutputofseveraltools.
Currently,thepredictionmodulecarriesoutthiscombinationbyusingafeed-forwardneuralnetwork.
TheneuralnetworkinsertedinFOXwastrainedbyusing117newsarticles.
Itreached89.
21%F-Scoreinanevaluationbasedonaten-fold-cross-validationonNER,therewithoutperformingevencommercialsystems18.
Oncetheneuralnetworkhascombinedtheoutputofthetoolandgeneratedabetterpredictionofthenamedentities,theoutputofFOXisgeneratedby12http://poolparty.
biz13http://www.
nzdl.
org/Kea/14http://developer.
yahoo.
com/search/content/V1/termExtraction.
html15http://nlp.
stanford.
edu/software/CRF-NER.
shtml16http://cogcomp.
cs.
illinois.
edu/page/software_view/417http://www.
digitaltrowel.
com/Technology/18Moredetailsontheevaluationareprovidedathttp://fox.
aksw.
org198A.
-C.
NgongaNgomoetal.
usingthevocabulariesshowninFigure5.
ThesevocabulariesextendthetwobroadlyusedvocabulariesAnnotea19andAutotag20.
Inparticular,weaddedtheconstructsexplicatedinthefollowing:–scms:beginIndexdenotestheindexinaliteralvaluestringatwhichapar-ticularannotationorkeyphrasebegins;–scms:endIndexstandsfortheindexinaliteralvaluestringatwhichaparticularannotationorkeyphraseends;–scms:meansmarkstheURIassignedtoanamedentityidentiedforanannotation;–scms:sourcedenotestheprovenanceoftheannotation,i.
e.
,theURIofthetoolwhichcomputedtheannotationoreventhesystemIDofthepersonwhocuratedorcreatedtheannotationand–scmsannisthenamespacefortheannotationclasses,i.
e,location,person,organizationandmiscellaneous.
Giventhattheoverheadduetothemergingoftheresultsviatheneuralnetworkisofonlyafewmillisecondsandthanktothemulti-corearchitectureofcurrentservers,FOXisalmostastime-ecientasstate-of-the-arttools.
Still,asourevaluationshows,thesefewmillisecondsoverheadcanleadtoanincreaseofmorethan13%F-Score(seeSection6).
TheoutputofFOXforourexampleisshowninListing3.
Thisistheoutputthatisforwardedtotheorchestrationservice,whichaddsprovenanceinformationtotheRDFbeforesendingananswertothecallbackURIprovidedbythewrapper.
Bythesemeans,weensurethatthewrappercanwritetheRDFainthewritesegmentoftheitemcontent.
4.
4OntoWikiOntoWikiisasemanticdatawiki[4]thatwasdesignedtofacilitatethebrowsingandeditingRDFknowledgebases.
Itsbrowsingfeaturesrangefromarbitraryconcepthierarchiestofacet-basedsearchandquerybuildinginterfaces.
SemanticcontentcanbecreatedandeditedbyusingtheRDFauthorsystemwhichhasbeenintegratedinOntoWiki[27].
OntoWikiplaystwokeyroleswithintheSCMSstack.
First,itservesasentrypointforthetriplestore.
Thisallowsforthetriplestoretobeexchangedwith-outanydrawbackfortheuser,leadingtoaneasycustomizationofourstack.
Inaddition,OntoWikiplaystheroleofanannotationconsolidationandcura-tiontoolandisconsequentlythecenterofthecurationpipeline.
ToensurethatOntoWikiisalwaysup-to-date,theorchestrationservicesendsitsannotationresponsestobothOntoWikiandthewrapper'scallbackURI.
Thus,OntoWikiisalsoawareofthewrapper(i.
e.
,itscallbackURI)andcansendtheresultsofanymanualcurationprocessbacktowrapper.
Notethatmanuallycuratedannotationsaresavedwithadierent(ifmanuallycreated)orsupplementary(ifmanuallycurated)valueintheirscms:sourceproperty.
Thisgivesconsuming19http://www.
w3.
org/2000/10/annotation-ns#20http://commontag.
org/ns#SCMS–SemantifyingContentManagementSystems199aann:Annotationardf:Resourcexsd:stringscms:meansann:bodyxsd:integerxsd:integerscms:beginIndexscms:endIndexardf:Resourcescms:tool(a)namedentityannotationactag:AutoTagardf:Resourcectag:meansxsd:stringctag:labelardf:Resourcescms:toolanyProp(b)keywordannotationFig.
5.
VocabulariesusedbyFOXforrepresentingnamedentities(a)andkeywords(b)tools(e.
g.
,wrappers)achancetoassignhighertrustvaluestothoseannota-tions.
Inaddition,ifanewextractionrunisperformedonthesamedocument,manuallycreatedandcuratedannotationscanbekeptforfurtheruse.
NotethatthecrawlerinVirtuosocanbeusedtofetchevenmoredatapertainingtotheannotationscomputedbyFOX.
ThisdatacanbesentdirectlytoFOXandinsertedinVirtuososoastoextendtheknowledgebasefortheCMS.
5UseCaseTheSCMSframeworkisbeingdeployedintherenewableenergysector.
Therenewableenergyandenergyeciencysectorrequiresalargeamountofup-to-dateandhigh-qualityinformationanddatasoastodevelopandpushtheareaofcleanenergysystemsworldwide.
Thisinformation,dataandknowledgeaboutcleanenergytechnologies,developments,projectsandlawspercountryworld-widehelpspolicyanddecisionmakers,projectdevelopersandnancingagenciestomakebetterdecisionsoninvestmentsaswellascleanenergyprojectstosetup.
TheREEEP–theRenewableEnergyandEnergyEciencyPartnership21isanon-governmentalorganizationthatprovidestheaforementionedinformationtotherespectivetargetgroupsaroundtheglobe.
Forthispurpose,REEEPhasdevelopedthereegle.
infoInformationGatewayonRenewableEnergyandEn-ergyEciency22thatoerscountryprolesoncleanenergy,anActorsCatalogthatcontainstherelevantstakeholdersintheeldpercountry.
Furthermore,itsuppliesenergystatisticsandpotentialsaswellasnewsoncleanenergy.
21http://www.
reeep.
org22http://www.
reegle.
info200A.
-C.
NgongaNgomoetal.
1@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:beginIndex"70"^^xsd:int;10scms:endIndex"77"^^xsd:int;11scms:means;12scms:source;13ann:body"Hungary"^^xsd:string.
1415[]aann:Annotation,scmsann:ORGANIZATION;16scms:beginIndex"12"^^xsd:int;17scms:endIndex"21"^^xsd:int;18scms:means;19scms:source;20ann:body"Prometeus"^^xsd:string.
2122[]aann:Annotation,scmsann:LOCATION;23scms:beginIndex"85"^^xsd:int;24scms:endIndex"93"^^xsd:int;25scms:means;26scms:source;27ann:body"Budapest"^^xsd:string.
Listing3.
AnnotationsasreturnedbyFOXinTurtleformatThemotivationbehindapplyingSCMStotheREEEPdatawastofacilitatetheintegrationofthisdatainsemanticapplicationstosupportecientdecisionmaking.
Toachievethisgoal,weaimedtoexpandthereegle.
infoinformationgatewaybyaddingRDFatotheunstructuredinformationavailableontheweb-siteandbymakingthesametriplesavailableviaaSPARQLendpoint.
Forourcurrentprototype,weimplementedaCMSwrapperfortheDrupalCMSandimportedtheactorscatalogofreeglewithinin(seeFigure6).
ThisdatawasthenprocessedbytheSCMSstackasfollows:Allactorsandcountrydescrip-tionsweresenttotheorchestrationservice,whichforwardedthemtoFOX.
TheRDFdataextractedbyFOXweresentbacktotheDrupalWrapperandwrittenviaOntoWikiintoVirtuoso.
TheDrupalwrapperthenusedthekeyphrasestoextendthesetoftagsassignedtothecorrespondingproleintheCMS.
ThenamedentitieswereintegratedinthepagebyusingthepositionalinformationreturnedbyFOX.
Bythesemeans,wemadetheREEEPdataaccessibleforhumans(viatheWebpage)butalsoformachines(viaOntoWiki'sintegratedSPARQLendpointandviatheRDFawrittenintheWebpages).
OurapproachalsomakestheautomatedintegrationofnovelknowledgesourcesinREEEPpossible.
Toachievethisgoal,severalselectedsources(websources,blogsandnewsfeeds)arecurrentlybeingcrawledandthenanalyzedbyFOXtoextractstructuredinformationoutofthemassesofunstructuredtextfromtheInternet.
SCMS–SemantifyingContentManagementSystems201Fig.
6.
ScreenshotsofSCMS-enhancedDrupal6EvaluationTheusabilityofourapproachdependsheavilyonthequalityoftheknowl-edgereturnedviaautomatedmeans.
Consequently,weevaluatedthequalityoftheRDFainjectedintotheREEEPdatabymeasuringtheprecisionandrecallofSCMSandcompareditwiththatofastate-of-the-artcommercialsystem(CS)whosenamecannotberevealedforlegalreasons.
WechoseCSbecauseitoutperformedfreelyavailableNERtoolssuchastheStanfordNamedEntityRecognizer23[10]andtheIllinoisNamedEntityTagger24[25]inaprioreval-uationonanewspapercorpus.
Withinthatevaluation,FOXreached89.
21%F-scoreandwas14%betterthanCSw.
r.
t.
F-score25.
Asitcanhappenthatonlysegmentsofmulti-wordunitsarerecognizedasbeingnamedentities,wefollowedatoken-wiseevaluationoftheSCMSsystem.
Thus,ifoursystemrec-ognizedUnitedKingdomofGreatBritainasaLOCATIONwhenpresentedwithUnitedKingdomofGreatBritainandNorthernIreland,itwasscoredwith5truepositivesand3falsenegatives.
Ourevaluationwascarriedoutwithtwodierentdatasets.
Inourrstevalu-ation,wemeasuredtheperformanceofbothsystemsoncountryprolescrawledfromtheWeb,i.
e.
,oninformationthatistobeaddedautomaticallytotheREEEPknowledgebases.
Forthispurpose,weselected9countrydescriptionsrandomlyandannotated34sentencesmanually.
Thesesentencescontained119namedentitiestokens,ofwhich104werelocationsand15organizations.
Inour23http://nlp.
stanford.
edu/software/CRF-NER.
shtml24http://cogcomp.
cs.
illinois.
edu/page/software_view/425Moredetailsathttp://fox.
aksw.
org202A.
-C.
NgongaNgomoetal.
secondevaluation,weaimedatmeasuringhowwellSCMSperformsonthedatathatcanbefoundcurrentlyintheREEEPcatalogue.
Forthispurpose,weanno-tated23actorsproleswhichconsistedof68sentencesmanually.
Theresultingreferencedatacontained20location,78organizationand11persontokens.
Notethatbothdatasetsareofverydierentnatureastherstcontainsalargenum-beroforganizationsandarelativelysmallnumberoflocationswhilethesecondconsistsmainlyoflocations.
TheresultsofourevaluationareshowninTable1.
CSfollowsaverycon-servativestrategy,whichleadstoithavingveryhighprecisionscoresofupto100%insomeexperiments.
Yet,itsconservativestrategyleadstoarecallwhichismostlysignicantlyinferiortothatofSCMS.
TheonlycategorywithinwhichCSoutperformsSCMSisthedetectionofpersonsintheactorsproledata.
Thisisduetoitdetecting6outofthe11persontokensinthedataset,whileSCMSonlydetects5.
Inallothercases,SCMSoutperformsCSbyupto13%F-score(detectionoforganizationsinthecountryprolesdataset).
Overall,SCMSoutperformCSby7%F-scoreoncountryprolesandalmost8%F-scoreonactors.
Table1.
Evaluationresultsoncountryandactorsproles.
ThesuperiorF-scoreforeachcategoryisinboldfont.
CountryProlesActorsProlesEntityTypeMeasureFOXCSFOXCSLocationPrecision98%100%83.
33%100%Recall94.
23%78.
85%90%70%F-Score96.
08%88.
17%86.
54%82.
35%OrganizationPrecision73.
33%100%57.
14%90.
91%Recall68.
75%40%69.
23%47.
44%F-Score70.
97%57.
14%62.
72%62.
35%PersonPrecision––100%100%Recall––45.
45%54.
55%F-Score––62.
5%70.
59%OverallPrecision93.
97%100%85.
16%98.
2%Recall91.
60%74.
79%70.
64%52.
29%F-Score92.
77%85.
58%77.
22%68.
24%7ConclusionInthispaper,wepresentedtheSCMSframeworkforextractingstructureddatafromCMScontent.
Wepresentedthearchitectureofourapproachandexplainedhoweachofitscomponentsworks.
Inaddition,weexplainedthevocabulariesutilizedbythecomponentsofourframework.
WepresentedoneusecasefortheSCMSsystem,i.
e.
,howSCMSisusedintherenewableenergysector.
TheSCMSstackabidesbythecriteriaofaccuracyandexibility.
Theexi-bilityofourapproachisensuredbythecombinationofRDFmessagesthatcanSCMS–SemantifyingContentManagementSystems203beeasilyextendedandofstandardWebcommunicationprotocols.
Theaccu-racyofSCMSwasdemonstratedinanevaluationonactorandcountryproles,withinwhichSCMSoutperformedevencommercialsoftware.
Ourapproachcanbeextendedbyaddingsupportfornegativestatements,i.
e.
,statementsthatarenotcorrectbutcanbefoundindierentknowledgesourcesacrossthedatalandscapeanalyzedbyourframework.
Inaddition,thefeedbackgeneratedbyuserswillbeintegratedinthetrainingoftheframeworktomakeitevenmoreaccurateovertime.
References1.
Adrian,B.
,Hees,J.
,Herman,I.
,Sintek,M.
,Dengel,A.
:Epiphany:AdaptablerDFaGenerationLinkingtheWebofDocumentstotheWebofData.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
178–192.
Springer,Heidelberg(2010)2.
Agichtein,E.
,Gravano,L.
:Snowball:Extractingrelationsfromlargeplain-textcollections.
In:ACMDL,pp.
85–94(2000)3.
Amsler,R.
:Researchtowardsthedevelopmentofalexicalknowledgebasefornaturallanguageprocessing.
SIGIRForum23,1–2(1989)4.
Auer,S.
,Dietzold,S.
,Riechert,T.
:OntoWiki–AToolforSocial,SemanticCol-laboration.
In:Cruz,I.
,Decker,S.
,Allemang,D.
,Preist,C.
,Schwabe,D.
,Mika,P.
,Uschold,M.
,Aroyo,L.
M.
(eds.
)ISWC2006.
LNCS,vol.
4273,pp.
736–749.
Springer,Heidelberg(2006)5.
Brin,S.
:ExtractingPatternsandRelationsfromtheWorldWideWeb.
In:Atzeni,P.
,Mendelzon,A.
O.
,Mecca,G.
(eds.
)WebDB1998.
LNCS,vol.
1590,pp.
172–183.
Springer,Heidelberg(1999)6.
Coates-Stephens,S.
:Theanalysisandacquisitionofpropernamesfortheun-derstandingoffreetext.
ComputersandtheHumanities26,441–456(1992)10.
1007/BF001369857.
Curran,J.
R.
,Clark,S.
:Languageindependentnerusingamaximumentropytag-ger.
In:HLT-NAACL,pp.
164–167(2003)8.
Dietterich,T.
G.
:EnsembleMethodsinMachineLearning.
In:Kittler,J.
,Roli,F.
(eds.
)MCS2000.
LNCS,vol.
1857,pp.
1–15.
Springer,Heidelberg(2000)9.
Etzioni,O.
,Cafarella,M.
,Downey,D.
,Popescu,A.
-M.
,Shaked,T.
,Soderland,S.
,Weld,D.
S.
,Yates,A.
:Unsupervisednamed-entityextractionfromtheweb:anexperimentalstudy.
Artif.
Intell.
165,91–134(2005)10.
Finkel,J.
,Grenager,T.
,Manning,C.
:Incorporatingnon-localinformationintoinformationextractionsystemsbygibbssampling.
In:ACL,pp.
363–370(2005)11.
Frank,E.
,Paynter,G.
W.
,Witten,I.
H.
,Gutwin,C.
,Nevill-Manning,C.
G.
:Domain-specickeyphraseextraction.
In:ProceedingsoftheSixteenthInterna-tionalJointConferenceonArticialIntelligence,IJCAI1999,pp.
668–673.
MorganKaufmannPublishersInc.
,SanFrancisco(1999)12.
Grishman,R.
,Yangarber,R.
:Nyu:DescriptionoftheProteus/PetsystemasusedforMUC-7ST.
In:MUC-7.
MorganKaufmann(1998)13.
Harabagiu,S.
,Bejan,C.
A.
,Morarescu,P.
:Shallowsemanticsforrelationextrac-tion.
In:IJCAI,pp.
1061–1066(2005)14.
Huynh,D.
,Mazzocchi,S.
,Karger,D.
R.
:PiggyBank:ExperiencetheSemanticWebInsideYourWebBrowser.
In:Gil,Y.
,Motta,E.
,Benjamins,V.
R.
,Musen,M.
A.
(eds.
)ISWC2005.
LNCS,vol.
3729,pp.
413–430.
Springer,Heidelberg(2005)204A.
-C.
NgongaNgomoetal.
15.
Kim,S.
N.
,Kan,M.
-Y.
:Re-examiningautomatickeyphraseextractionapproachesinscienticarticles.
In:MWE2009,pp.
9–16(2009)16.
Kim,S.
N.
,Medelyan,O.
,Kan,M.
-Y.
,Baldwin,T.
:Semeval-2010task5:Auto-matickeyphraseextractionfromscienticarticles.
In:SemEval2010,pp.
21–26.
AssociationforComputationalLinguistics,Stroudsburg(2010)17.
Matsuo,Y.
,Ishizuka,M.
:KeywordExtractionFromASingleDocumentUsingWordCo-OccurrenceStatisticalInformation.
InternationalJournalonArticialIntelligenceTools13(1),157–169(2004)18.
Nadeau,D.
:Semi-SupervisedNamedEntityRecognition:LearningtoRecognize100EntityTypeswithLittleSupervision.
PhDthesis,UniversityofOttawa(2007)19.
Nadeau,D.
,Turney,P.
,Matwin,S.
:UnsupervisedNamed-EntityRecognition:Gen-eratingGazetteersandResolvingAmbiguity.
In:Lamontagne,L.
,Marchand,M.
(eds.
)CanadianAI2006.
LNCS(LNAI),vol.
4013,pp.
266–277.
Springer,Heidel-berg(2006)20.
Nguyen,D.
P.
T.
,Matsuo,Y.
,Ishizuka,M.
:Relationextractionfromwikipediausingsubtreemining.
In:AAAI,pp.
1414–1420(2007)21.
Nguyen,T.
D.
,Kan,M.
-Y.
:KeyphraseExtractioninScienticPublications.
In:Goh,D.
H.
-L.
,Cao,T.
H.
,Slvberg,I.
T.
,Rasmussen,E.
(eds.
)ICADL2007.
LNCS,vol.
4822,pp.
317–326.
Springer,Heidelberg(2007)22.
Pantel,P.
,Pennacchiotti,M.
:Espresso:Leveraginggenericpatternsforautomati-callyharvestingsemanticrelations.
In:ACL,pp.
113–120(2006)23.
Park,Y.
,Byrd,R.
J.
,Boguraev,B.
K.
:Automaticglossaryextraction:beyondter-minologyidentication.
In:COLING,pp.
1–7(2002)24.
Pasca,M.
,Lin,D.
,Bigham,J.
,Lifchits,A.
,Jain,A.
:Organizingandsearchingtheworldwideweboffacts-stepone:theone-millionfactextractionchallenge.
In:Proceedingsofthe21stNationalConferenceonArticialIntelligence,vol.
2,pp.
1400–1405.
AAAIPress(2006)25.
Ratinov,L.
,Roth,D.
:Designchallengesandmisconceptionsinnamedentityrecog-nition.
In:CONLL,pp.
147–155(2009)26.
Thielen,C.
:Anapproachtopropernametaggingforgerman.
In:ProceedingsoftheEACL1995SIGDATWorkshop(1995)27.
Tramp,S.
,Heino,N.
,Auer,S.
,Frischmuth,P.
:RDFauthor:EmployingRDFaforCollaborativeKnowledgeEngineering.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
90–104.
Springer,Heidelberg(2010)28.
Turney,P.
D.
:Coherentkeyphraseextractionviawebmining.
In:IJCAI,SanFran-cisco,CA,USA,pp.
434–439(2003)29.
Walker,D.
,Amsler,R.
:Theuseofmachine-readabledictionariesinsublanguageanalysis.
AnalysingLanguageinRestrictedDomains(1986)30.
Wang,G.
,Yu,Y.
,Zhu,H.
:PORE:Positive-OnlyRelationExtractionfromWikipediaText.
In:Aberer,K.
,Choi,K.
-S.
,Noy,N.
,Allemang,D.
,Lee,K.
-I.
,Nixon,L.
J.
B.
,Golbeck,J.
,Mika,P.
,Maynard,D.
,Mizoguchi,R.
,Schreiber,G.
,Cudre-Mauroux,P.
(eds.
)ASWC2007andISWC2007.
LNCS,vol.
4825,pp.
580–594.
Springer,Heidelberg(2007)31.
Yan,Y.
,Okazaki,N.
,Matsuo,Y.
,Yang,Z.
,Ishizuka,M.
:Unsupervisedrelationextractionbyminingwikipediatextsusinginformationfromtheweb.
In:ACL2009,pp.
1021–1029(2009)32.
Zhou,G.
,Su,J.
:Namedentityrecognitionusinganhmm-basedchunktagger.
In:Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,ACL2002,pp.
473–480.
AssociationforComputationalLinguistics,Morristown(2002)

BuyVM老牌商家新增迈阿密机房 不限流量 月付2美元

我们很多老用户对于BuyVM商家还是相当熟悉的,也有翻看BuyVM相关的文章可以追溯到2014年的时候有介绍过,不过那时候介绍这个商家并不是很多,主要是因为这个商家很是刁钻。比如我们注册账户的信息是否完整,以及我们使用是否规范,甚至有其他各种问题导致我们是不能购买他们家机器的。以前你嚣张是很多人没有办法购买到其他商家的机器,那时候其他商家的机器不多。而如今,我们可选的商家比较多,你再也嚣张不起来。...

HostSailor:罗马尼亚机房,内容宽松;罗马尼亚VPS七折优惠,罗马尼亚服务器95折

hostsailor怎么样?hostsailor成立多年,是一家罗马尼亚主机商家,机房就设在罗马尼亚,具说商家对内容管理的还是比较宽松的,商家提供虚拟主机、VPS及独立服务器,今天收到商家推送的八月优惠,针对所有的产品都有相应的优惠,商家的VPS产品分为KVM和OpenVZ两种架构,OVZ的比较便宜,有这方面需要的朋友可以看看。点击进入:hostsailor商家官方网站HostSailor优惠活动...

这几个Vultr VPS主机商家的优点造就商家的用户驱动力

目前云服务器市场竞争是相当的大的,比如我们在年中活动中看到各大服务商都找准这个噱头的活动发布各种活动,有的甚至就是平时的活动价格,只是换一个说法而已。可见这个行业确实竞争很大,当然我们也可以看到很多主机商几个月就消失,也有看到很多个人商家捣鼓几个品牌然后忽悠一圈跑路的。当然,个人建议在选择服务商的时候尽量选择老牌商家,这样性能更为稳定一些。近期可能会准备重新整理Vultr商家的一些信息和教程。以前...

drupal7为你推荐
操作httpscrewflash开启javascript如何启用JavaScript?360和搜狗360游览器和搜狗的哪个好重庆电信dnsPSP上网急救!重庆电信的DNS是多少啊?课程cuteftp抢米网怎么样才能在小米官方网站抢到手机?三友网三友有机硅是不是国企,待遇如何?现在花钱去是不是值得?discuz伪静态discuz怎么才能把专题目录也实现伪静态的方法详解骑士人才系统公司要采购一套人才系统源码,看了一下骑士和嘉缘的,谁家的比较好一点呢?托就不要回答了。
国外虚拟空间 最好的虚拟主机 主机测评 大硬盘 香港主机 128m内存 NetSpeeder 地址大全 浙江独立 帽子云 100m空间 泉州电信 服务器托管什么意思 四核服务器 net空间 云服务器比较 阿里云邮箱申请 godaddyssl 美国vpn代理 建站论坛 更多