swappabledrupal7

drupal7  时间:2021-04-13  阅读:()
SCMS–SemantifyingContentManagementSystemsAxel-CyrilleNgongaNgomo1,NormanHeino1,KlausLyko1,ReneSpeck1,andMartinKaltenb¨ock21UniversityofLeipzigAKSWGroupJohannisgasse26,04103Leipzig2SemanticWebCompanyLerchenfelderg¨urtel43A-1160ViennaAbstract.
ThemigrationtotheSemanticWebrequiresfromCMSthattheyintegratehuman-andmachine-readabledatatosupporttheirseam-lessintegrationintotheSemanticWeb.
Yet,thereisstillablatantneedforframeworksthatcanbeeasilyintegratedintoCMSandallowtotrans-formtheircontentintomachine-readableknowledgewithhighaccuracy.
Inthispaper,wedescribetheSCMS(SemanticContentManagementSystems)framework,whosemaingoalsaretheextractionofknowledgefromunstructureddatainanyCMSandtheintegrationoftheextractedknowledgeintothesameCMS.
Ourframeworkintegratesahighlyaccu-rateknowledgeextractionpipeline.
Inaddition,itreliesontheRDFandHTTPstandardsforcommunicationandcanthusbeintegratedinvirtu-allyanyCMS.
Wepresenthowourframeworkisbeingusedintheenergysector.
Wealsoevaluateourapproachandshowthatourframeworkout-performsevencommercialsoftwarebyreachingupto96%F-score.
1IntroductionContentManagementSystems(CMS)encompassmostoftheinformationavail-ableonthedocument-orientedWeb(alsoreferredtoasHumanWeb).
Therewith,theyconstitutetheinterfacebetweenhumansandthedataontheWeb.
Conse-quently,oneofthemaintasksofCMShasalwaysbeentomaketheircontentaseasilyprocessableforhumansaspossible.
Still,withthemigrationfromthedocument-orientedtotheSemanticWeb,thereisanincreasingneedtoinsertmachine-readabledataintothecontentofCMSsoastoenabletheseamlessintegrationoftheircontentintotheSemanticWeb.
Giventhesheervolumeofdataavailableonthedocument-orientedWeb,theinsertionofmachine-readabledatamustbecarriedout(semi-)automatically.
Theframeworksdevelopedforthepurposeofautomaticknowledgeextractionmustthereforebeaccurate(i.
e.
,displayhighF-scores)soastoensurethathumansneedtocurateaminimalamountoftheknowledgeextractedautomatically.
Thiscriterioniscentralfortheuseofautomaticknowledgeextraction,asapproacheswithalowrecallleadL.
Aroyoetal.
(Eds.
):ISWC2011,PartII,LNCS7032,pp.
189–204,2011.
cSpringer-VerlagBerlinHeidelberg2011190A.
-C.
NgongaNgomoetal.
tohumanshavingtondthefalsenegatives1byhand,whilealowprecisionforcesthesamehumanstohavetocontinuallychecktheoutputoftheknowl-edgeextractionframework.
Afurthercriterionthatdeterminestheusabilityofaknowledgeextractionframeworkisitsexibility,i.
e.
,howeasyitistointegratethisframeworkinCMS.
ThiscriterionisofhighimportanceasthecurrentCMSlandscapeconsistsofhundredsofveryheterogeneousframeworksimplementedindozensofdierentlanguages2.
Inthispaper,wedescribetheSCMSframework3.
Themaingoalofourframe-workistoallowtheextractionofstructureddata(i.
e.
,RDF)outoftheunstruc-turedcontentofCMS,thelinkingofthiscontentwiththeWebofDataandtheintegrationofthiswealthofknowledgebackintotheCMS.
SCMSreliesexclu-sivelyonRDFmessagesandsimpleWebprotocolsforitsintegrationintoexistingCMSandtheprocessingoftheircontent.
Thus,itishighlyexibleandcanbeusedwithvirtuallyanyCMS.
Inaddition,theunderlyingapproachimplementsahighlyaccurateknowledgeextractionpipelinethatcanbeconguredeasilyfortheuser'spurposes.
Thispipelineallowstomergeandimprovetheresultsofstate-of-the-arttoolsforinformationextraction,tomanuallypost-processtheresultsatwillandtointegratetheextractedknowledgeintoCMS,forexampleasRDFa.
Themaincontributionsofthispaperarethefollowing:1.
Wepresentthearchitectureofourapproachandshowthatitcanbeinte-gratedeasilyinvirtuallyanyCMS,provideditoerssucienthooksintothelife-cycleofitsmanagedcontentitems.
2.
WegiveanoverviewofthevocabulariesweusetorepresenttheknowledgeextractedfromCMS.
3.
Wepresenthowourapproachisbeingusedinausecasecenteredaroundrenewableenergy.
4.
Weevaluateourapproachagainstastate-of-the-artcommercialsystemforknowledgeextractionintwopracticalusecasesandshowthatweoutperformthecommercialsystemwithrespecttoF-scorewhilereachingupto96%F-scoreontheextractionoflocations.
Therestofthispaperisstructuredasfollows:WestartbygivinganoverviewofrelatedworkfromtheNLPandtheSemanticWebcommunityinSection2.
Thereafter,wepresenttheSCMSframework(Section3)anditsmaincompo-nents(Section4)aswellasthevocabulariestheyuse.
Subsequently,weepitomizetherenewableenergyusecasewithinwhichourframeworkisbeingdeployedinSection5.
Section6thenpresentstheresultsofanevaluationofourframeworkintwousecasesagainstanenterprisecommercialsystem(CS)whosenamecan-notberevealedforlegalreasons.
Finally,wegiveanoverviewofourfutureworkandconclude.
1i.
e.
,Theentitiesandrelationsthatwerenotfoundbythesoftware2AlistofCMSonthemarketcanbefoundathttp://en.
wikipedia.
org/wiki/List_of_content_management_systems3http://www.
scms.
euSCMS–SemantifyingContentManagementSystems1912RelatedWorkInformationExtractionisthebackboneofknowledgeextractionandisoneofthecoretasksofNLP.
ThreemaincategoriesofNLPtoolsplayacentralroledur-ingtheextractionofknowledgefromtext:KeyphraseExtraction(KE),NamedEntityRecognition(NER)andrelationextraction(RE).
Theautomaticdetec-tionofkeyphrases(i.
e.
,multi-wordunitsortextfragmentsthatcapturetheessenceofadocument)hasbeenanimportanttaskofNLPfordecades.
Still,duetotheveryambiguousdenitionofwhatanappropriatekeyphraseis,cur-rentapproachestotheextractionofkeyphrasesstilldisplaylowF-scores[16].
Accordingto[15],themajorityoftheapproachestoKEimplementcombinationsofstatistical,rule-basedorheuristicmethods[11,21]onmostlydocument[17],keyphrase[28]ortermcohesionfeatures[23].
NERaimstodiscoverinstancesofpredenedclassesofentities(e.
g.
,persons,locations,organizationsorproducts)intext.
MostNERtoolsimplementoneofthreemaincategoriesofapproaches:dictionary-based[29,3],rule-based[6,26]andmachine-learningapproaches[18].
Nowadays,themethodsofchoiceareborrowedfromsupervisedmachinelearningwhentrainingexamplesareavail-able[32,7,10].
Yet,duetoscarcityoflargedomain-specictrainingcorpora,semi-supervised[24,18]andunsupervisedmachinelearningapproaches[19,9]havealsobeenusedforextractingnamedentitiesfromtext.
TheextractionofrelationsfromunstructureddatabuildsuponworkforNERandKEtodeterminetheentitiesbetweenwhichrelationsmightexist.
Someearlyworkonpatternextractionreliedonsupervisedmachinelearning[12].
Yet,suchapproachesdemandedlargeamountoftrainingdata.
ThesubsequentgenerationofapproachestoREaimedatbootstrappingpatternsbasedonasmallnumberofinputpatternsandinstances[5,2].
NewerapproachesaimtoeithercollectredundancyinformationfromthewholeWeb[22]orWikipedia[30,31]inanunsupervisedmannerortouselinguisticanalysis[13,20]toharvestgenericpatternsforrelations.
InadditiontotheworkdonebytheNLPcommunity,severaltoolsandframe-workshavebeendevelopedexplicitlyforextractingRDFandRDFaoutofNL[1].
Forexample,theFirefoxextensionPiggyBank[14]allowstoextractRDFfromwebpagesbyusingscreenscrapers.
TheRDFextractedfromthesewebpagesisthenstoredlocallyinaSesamestore.
Thedatabeingstoredlocallyallowstheusertomergethedataextractedfromdierentwebsitestoperformseman-ticoperations.
Morerecently,theDrupalextensionOpenPublish4wasreleased.
Theaimofthisextensionistosupportcontentpublisherswiththeautomaticannotationoftheirdata.
Forthispurpose,OpenPublishutilizestheservicesprovidedbyOpenCalais5toannotatethecontentofnewsentries.
Epiphany[1]implementsaservicethatannotateswebpagesautomaticallywithentitiesfoundintheLinkedDataCloud.
ApacheStanbol6implementssimilarfunctionalityon4http://www.
openpublish.
com5http://www.
opencalais.
org6http://incubator.
apache.
org/stanbol192A.
-C.
NgongaNgomoetal.
alargerscalebyprovidingsynchronousRESTfulinterfacesthatallowContentManagementSystemstoextractannotationsfromtext.
Themaindrawbackofcurrentframeworksisthattheyeitherfocusononepar-ticulartask(e.
g.
,ndingnamedentitiesintext)ormakeuseofNLPalgorithmswithoutimprovinguponthem.
Consequently,theyhavethesamelimitationsastheNLPapproachesdiscussedabove.
Tothebestofourknowledge,ourframe-workistherstframeworkdesignedexplicitlyforthepurposesoftheSemanticWebthatcombinesexibilitywithaccuracy.
TheexibilityoftheSCMShasbeenshownbyitsdeploymentonDrupal7,Typo38andconX9.
Inaddition,ourframeworkisabletoextractRDFfromNLwithanaccuracysuperiortothatofcommercialsystemsasshownbyourevaluation.
Ourframeworkalsoprovidesamachine-learningmodulethatallowstotailorittonewdomainsandclassesofnamedentities.
Moreover,SCMSprovidesdedicatedinterfacesforinteracting(e.
g.
,editing,querying,merging)withthetriplesextracted,makingitusableinalargenumberofdomainsandusecases.
3TheSCMSFrameworkAnoverviewofthearchitecturebehindSCMSisgiveninFigure1.
Theframe-workconsistsoftwolayers:anorchestrationandcurationlayerandanextractionandstoragelayer.
TheCMSthatistobeextendedwithsemanticcapabilitiesresidesuponourframeworkandmustbeextendedminimallyviaaCMSwrap-per.
Thisextensionimplementsthein-andoutputbehavioroftheCMSandcommunicatesexclusivelywiththerstlayerofourframework,thusmakingthecomponentsoftheextractionandstoragelayerofourframeworkswappablewithoutanydrawbackfortheusers.
TheoverallgoaloftherstlayeroftheSCMSframeworkistocoordinatetheaccesstothedata.
Itconsistsoftwotools:theorchestrationserviceandthedatawikiOntoWiki.
TheorchestrationserviceistheinputgateofSCMS.
ItreceivesthedatathatistobeannotatedasaRDFmessagethatabidesbythevocabularypresentedinSection4.
2andreturnstheresultsoftheframeworktotheendpointspeciedintheRDFmessageitreceives.
OntoWikiprovidesfunctionalityforthemanualcurationoftheresultsoftheknowledgeextractionprocessandmanagesthedataowtothetriplestoreVirtuoso10,therstcomponentoftheextractionandstoragelayer.
Inadditiontoatriplestore,thesecondlayercontainstheFederatedknOwledgeeXtractionFrameworkFOX11,thatusesmachinelearningtocombineandimproveupontheresultsofNLPtoolsaswellasconvertstheseresultsintoRDFbyusingthevocabulariesdisplayedinSection4.
3.
VirtuosoalsocontainsacrawlerthatallowstoretrievesupplementaryknowledgefromtheWebandlinkittotheinformationalreadyavailableintheCMSbyintegratingit7http://drupal.
org8http://typo3.
org9http://conx.
at10http://virtuoso.
openlinksw.
com11http://fox.
aksw.
orgSCMS–SemantifyingContentManagementSystems193Orchestra-tionServiceVirtuosoFOXCMSWrapperpush(content)annotations(RDF)–asynctextannotationsOntoWikiinjectioncrawlednewsoptionalExtractionandStorageLayerWrapperLayerOrchestrationandCurationLayerpush(curationchanges)Fig.
1.
ArchitectureandpathsofcommunicationofcomponentsintheSCMScontentsemanticationsystemintotheCMS.
Inthefollowing,wepresentthecentralcomponentsoftheSCMSstackinmoredetail.
4ToolsandVocabulariesInthissectionwedescribethemaincomponentsoftheSCMSstackandhowtheyttogether.
Asrunningexample,weuseahypotheticalcontentitemcontainedinaDrupalCMS.
Thisnode(inDrupalterminology)thatconsistsoftwoparts:–Thetitle"Prometeus"and–abodythatcontainsthesentence"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
OnlythebodytothecontentitemistobeannotatedbytheSCMSstack.
Notethatforreasonsofbrevity,wewillonlyshowtheresultsoftheextractionofnamedentities.
Yet,SCMScanalsoextractkeywords,keyphrasesandrelations.
4.
1WrapperACMSwrapper(shortwrapper)isacomponentthatistightlyintegratedintoaCMS(seeFigure2)andwhoseroleistoensurethecommunicationbetweenthe194A.
-C.
NgongaNgomoetal.
Orchestr.
ServiceCMSWrapperann.
requestann.
response(async)injectRDFaFig.
2.
Architectureofcommunicationbetweenwrapper,CMSandorchestrationserviceCMSandtheorchestrationmoduleofourframework.
Inthisrespect,awrapperhastofulllthreemaintasks:1.
Requestgeneration:WrappersusuallyregisterforchangeeventstotheCMSeditingsystem.
Wheneveradocumenthasbeenedited,theygenerateanannotationrequestthatabidesbythevocabularydepictedinFigure3.
Thisrequestisthensenttotheorchestrationservice.
2.
Responsereceipt:Oncetheannotationhasbeencarriedout,theannotationresultsaresentbacktothewrapper.
Thesecondofthewrapper'smaintasksisconsequentlytoreacttothoseannotationresponsesandtostoretheannotationstothedocumentappropriately(e.
g.
,inatriplestore).
Sincetheannotationresultsaresentbackasynchronously(i.
e.
,inaseparaterequest),thewrappermustprovideacallbackURLforthispurpose.
3.
Dataprocessing:Oncethedatahavebeenreceivedandstored,wrappersusuallyintegratetheannotationsintothecontentitemsthatwereprocessedbytheCMS.
Theintegrationofannotationsismostcommonlycarriedoutby"injecting"theannotationsasRDFaintothedocument'sHTMLrendering.
ThedatainjectionismostlyrealizedbyregisteringtodocumentviewingeventsintherespectiveCMSandwritingtheRDFafromthewrapper'slocaltriplestoreintothecontentitemsthatarebeingviewed.
AnexampleofawrapperrequestforourexampleisshowninListing1.
Thecontent:encodedoftheDrupalnodehttp://example.
com/drupal/node/10istobeannotatedbyFOX.
Inaddition,thewholenodeistobestoredinthetriplestoreforthepurposeofmanualprocessing.
Notethatthewrappercanchoosenottosendportionsofthecontentitemthatarenottobestoredinthetriplestore,e.
g.
,privatedata.
Inaddition,notethatthedescriptionofadocumentisnotlimitedtocertainpropertiesortoacertainnumberthereof,whichensuresthehighlevelofexibilityoftheSCMSstack.
Moreover,theRDFdataextractedbySCMScanbeeasilymergedwithanystructuredinformationprovidednativelybytheCMS(i.
e.
,metadatasuchasauthorinformation).
Consequently,SCMSenablesCMSthatalreadyprovidemetadataasRDFtoanswercomplexques-tionsthatcombinedataandmetadata,e.
g.
,WhichauthorswrotedocumentsthatarerelatedtoBudapestSCMS–SemantifyingContentManagementSystems195ascms:Requestasioc:Itemxsd:stringxsd:stringxsd:stringscms:documentdc:titledc:descriptioncontent:encodedscms:annotatescms:annotateardf:Resourcescms:callbackEndpointFig.
3.
Vocabularyusedbythewrapperrequests1@prefixcontent:.
2@prefixdc:.
3@prefixsioc:.
4@base.
56a;7;8;9content:encoded.
1011asioc:Item;12dc:title"Prometeus";13content:encoded"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHungary,i.
e.
,Budapest.
".
Listing1.
ExampleannotationrequestassentbytheDrupalwrapper4.
2OrchestrationServiceThemaintasksoftheorchestrationservicearetocapturestateinformationandtodistributethedataacrossSCMS'layers.
TherstofthetasksisduetotheFOXframeworkhavingbeendesignedtobestateless.
Theorchestrationservicecapturesstateinformationbysplittingupeachdocument-basedannotationre-questsbyawrapperintoseveralproperty-basedannotationrequeststhataresenttoFOX.
Inourexample,theorchestrationservicedetectsthatsolelythecontent:encodedpropertyistobeannotated.
Then,itreadsthecontentofthatpropertyfromthewrapperrequestandgeneratestheannotationrequest"ThecompanyPrometeusisanenergyproviderlocatedinthecapitalofHun-gary,i.
e.
,Budapest.
"forFOX.
Notethatwhilethisproperty-basedannotationrequestconsistsexclusivelyoftextorHTMLanddoesnotcontainanyRDF,theresponsereturnedbyFOXisaRDFdocumentserializedinTurtleorRDF/XML.
TheannotationresultsreturnedbyFOXarecombinedbytheorchestra-tionserviceintotheannotationresponse.
Therewith,therelationbetweenthe196A.
-C.
NgongaNgomoetal.
inputdocumentandtheannotationsextractedbyFOXisre-established.
Whenallannotationsforaparticularrequesthavebeenreceivedandcombined,theannotationresponseissentbacktothewrapperviatheprovidedcallbackURL.
Inaddition,theresultssentbacktothewrapperarestoredinOntoWikitofacilitatethecurationofannotationsextractedautomatically.
TheannotationresponsegeneratedbytheorchestrationserviceforourexampleisshowninListing2.
ItreliesupontheoutputsentbyFOX.
TheexactmeaningofthepredicatesusedbyFOXandforwardedbytheorchestrationserviceareexplainedinSection4.
31@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:annotates;10scms:property;11scms:beginIndex"70"^^xsd:int;12scms:endIndex"77"^^xsd:int;13scms:means;14scms:source;15ann:body"Hungary"^^xsd:string.
1617[]aann:Annotation,scmsann:ORGANIZATION;18scms:annotates;19scms:property;20scms:beginIndex"12"^^xsd:int;21scms:endIndex"21"^^xsd:int;22scms:means;23scms:source;24ann:body"Prometeus"^^xsd:string.
2526[]aann:Annotation,scmsann:LOCATION;27scms:annotates;28scms:property;29scms:beginIndex"85"^^xsd:int;30scms:endIndex"93"^^xsd:int;31scms:means;32scms:source;33ann:body"Budapest"^^xsd:string.
Listing2.
Exampleannotationresponseassentbytheorchestrationservice4.
3FOXTheFOXframeworkisastatelessandextensibleframeworkthatencompassesalltheNLPfunctionalitynecessarytoextractknowledgefromthecontentofCMS.
ItsarchitectureconsistsofthreelayersasshowninFigure4.
FOXtakestextorHTMLasinput.
Thisdataissenttothecontrollerlayer,whichimplementsthefunctionalitynecessarytocleanthedata,i.
e.
,removeHTMLandXMLtagsaswellasfurthernoise.
Oncethedatahasbeencleaned,SCMS–SemantifyingContentManagementSystems197NamedEntityRecognitionKeywordExtractionRelationExtractionLookupModuleTrainingPredictionControllerMLLayerControllerLayerToolLayerFig.
4.
ArchitectureoftheFOXframeworkthecontrollerlayerbeginswiththeorchestrationofthetoolsinthetoollayer.
Eachofthetoolsisassignedathreadfromathreadpool,soastomaximizeus-ageofmulti-coreCPUs.
Everythreadrunsitstoolandgeneratesaneventonceithascompleteditscomputation.
Intheeventthatatooldoesnotcompleteafterasettime,thecorrespondingthreadisterminated.
Sofar,FOXintegratestoolsforKE,NERandRE.
TheKEisrealizedbyPoolParty12forextractingkeywordsfromacontrolledvocabulary,KEA13andtheYahooTermExtractionservice14forstatisticalextractionandseveralothertools.
Inaddition,FOXinte-gratestheStanfordNamedEntityRecognizer15[10],theIllinoisNamedEntityTagger16[25]andcommercialsoftwareforNER.
TheREiscarriedoutbyusingtheCAREplatform17.
Theresultsfromthetoollayerareforwardedtothepredictionmoduleofthemachine-learninglayer.
TheroleofthepredictionmoduleistogenerateFOX'soutputbasedontheoutputthetoolsinFOX'sbackend.
Forthispurpose,itimplementsseveralensemblelearningtechniques[8]withwhichitcancombinetheoutputofseveraltools.
Currently,thepredictionmodulecarriesoutthiscombinationbyusingafeed-forwardneuralnetwork.
TheneuralnetworkinsertedinFOXwastrainedbyusing117newsarticles.
Itreached89.
21%F-Scoreinanevaluationbasedonaten-fold-cross-validationonNER,therewithoutperformingevencommercialsystems18.
Oncetheneuralnetworkhascombinedtheoutputofthetoolandgeneratedabetterpredictionofthenamedentities,theoutputofFOXisgeneratedby12http://poolparty.
biz13http://www.
nzdl.
org/Kea/14http://developer.
yahoo.
com/search/content/V1/termExtraction.
html15http://nlp.
stanford.
edu/software/CRF-NER.
shtml16http://cogcomp.
cs.
illinois.
edu/page/software_view/417http://www.
digitaltrowel.
com/Technology/18Moredetailsontheevaluationareprovidedathttp://fox.
aksw.
org198A.
-C.
NgongaNgomoetal.
usingthevocabulariesshowninFigure5.
ThesevocabulariesextendthetwobroadlyusedvocabulariesAnnotea19andAutotag20.
Inparticular,weaddedtheconstructsexplicatedinthefollowing:–scms:beginIndexdenotestheindexinaliteralvaluestringatwhichapar-ticularannotationorkeyphrasebegins;–scms:endIndexstandsfortheindexinaliteralvaluestringatwhichaparticularannotationorkeyphraseends;–scms:meansmarkstheURIassignedtoanamedentityidentiedforanannotation;–scms:sourcedenotestheprovenanceoftheannotation,i.
e.
,theURIofthetoolwhichcomputedtheannotationoreventhesystemIDofthepersonwhocuratedorcreatedtheannotationand–scmsannisthenamespacefortheannotationclasses,i.
e,location,person,organizationandmiscellaneous.
Giventhattheoverheadduetothemergingoftheresultsviatheneuralnetworkisofonlyafewmillisecondsandthanktothemulti-corearchitectureofcurrentservers,FOXisalmostastime-ecientasstate-of-the-arttools.
Still,asourevaluationshows,thesefewmillisecondsoverheadcanleadtoanincreaseofmorethan13%F-Score(seeSection6).
TheoutputofFOXforourexampleisshowninListing3.
Thisistheoutputthatisforwardedtotheorchestrationservice,whichaddsprovenanceinformationtotheRDFbeforesendingananswertothecallbackURIprovidedbythewrapper.
Bythesemeans,weensurethatthewrappercanwritetheRDFainthewritesegmentoftheitemcontent.
4.
4OntoWikiOntoWikiisasemanticdatawiki[4]thatwasdesignedtofacilitatethebrowsingandeditingRDFknowledgebases.
Itsbrowsingfeaturesrangefromarbitraryconcepthierarchiestofacet-basedsearchandquerybuildinginterfaces.
SemanticcontentcanbecreatedandeditedbyusingtheRDFauthorsystemwhichhasbeenintegratedinOntoWiki[27].
OntoWikiplaystwokeyroleswithintheSCMSstack.
First,itservesasentrypointforthetriplestore.
Thisallowsforthetriplestoretobeexchangedwith-outanydrawbackfortheuser,leadingtoaneasycustomizationofourstack.
Inaddition,OntoWikiplaystheroleofanannotationconsolidationandcura-tiontoolandisconsequentlythecenterofthecurationpipeline.
ToensurethatOntoWikiisalwaysup-to-date,theorchestrationservicesendsitsannotationresponsestobothOntoWikiandthewrapper'scallbackURI.
Thus,OntoWikiisalsoawareofthewrapper(i.
e.
,itscallbackURI)andcansendtheresultsofanymanualcurationprocessbacktowrapper.
Notethatmanuallycuratedannotationsaresavedwithadierent(ifmanuallycreated)orsupplementary(ifmanuallycurated)valueintheirscms:sourceproperty.
Thisgivesconsuming19http://www.
w3.
org/2000/10/annotation-ns#20http://commontag.
org/ns#SCMS–SemantifyingContentManagementSystems199aann:Annotationardf:Resourcexsd:stringscms:meansann:bodyxsd:integerxsd:integerscms:beginIndexscms:endIndexardf:Resourcescms:tool(a)namedentityannotationactag:AutoTagardf:Resourcectag:meansxsd:stringctag:labelardf:Resourcescms:toolanyProp(b)keywordannotationFig.
5.
VocabulariesusedbyFOXforrepresentingnamedentities(a)andkeywords(b)tools(e.
g.
,wrappers)achancetoassignhighertrustvaluestothoseannota-tions.
Inaddition,ifanewextractionrunisperformedonthesamedocument,manuallycreatedandcuratedannotationscanbekeptforfurtheruse.
NotethatthecrawlerinVirtuosocanbeusedtofetchevenmoredatapertainingtotheannotationscomputedbyFOX.
ThisdatacanbesentdirectlytoFOXandinsertedinVirtuososoastoextendtheknowledgebasefortheCMS.
5UseCaseTheSCMSframeworkisbeingdeployedintherenewableenergysector.
Therenewableenergyandenergyeciencysectorrequiresalargeamountofup-to-dateandhigh-qualityinformationanddatasoastodevelopandpushtheareaofcleanenergysystemsworldwide.
Thisinformation,dataandknowledgeaboutcleanenergytechnologies,developments,projectsandlawspercountryworld-widehelpspolicyanddecisionmakers,projectdevelopersandnancingagenciestomakebetterdecisionsoninvestmentsaswellascleanenergyprojectstosetup.
TheREEEP–theRenewableEnergyandEnergyEciencyPartnership21isanon-governmentalorganizationthatprovidestheaforementionedinformationtotherespectivetargetgroupsaroundtheglobe.
Forthispurpose,REEEPhasdevelopedthereegle.
infoInformationGatewayonRenewableEnergyandEn-ergyEciency22thatoerscountryprolesoncleanenergy,anActorsCatalogthatcontainstherelevantstakeholdersintheeldpercountry.
Furthermore,itsuppliesenergystatisticsandpotentialsaswellasnewsoncleanenergy.
21http://www.
reeep.
org22http://www.
reegle.
info200A.
-C.
NgongaNgomoetal.
1@prefixscmsann:.
2@prefixctag:.
3@prefixxsd:.
4@prefixrdf:.
5@prefixann:.
6@prefixscms:.
78[]aann:Annotation,scmsann:LOCATION;9scms:beginIndex"70"^^xsd:int;10scms:endIndex"77"^^xsd:int;11scms:means;12scms:source;13ann:body"Hungary"^^xsd:string.
1415[]aann:Annotation,scmsann:ORGANIZATION;16scms:beginIndex"12"^^xsd:int;17scms:endIndex"21"^^xsd:int;18scms:means;19scms:source;20ann:body"Prometeus"^^xsd:string.
2122[]aann:Annotation,scmsann:LOCATION;23scms:beginIndex"85"^^xsd:int;24scms:endIndex"93"^^xsd:int;25scms:means;26scms:source;27ann:body"Budapest"^^xsd:string.
Listing3.
AnnotationsasreturnedbyFOXinTurtleformatThemotivationbehindapplyingSCMStotheREEEPdatawastofacilitatetheintegrationofthisdatainsemanticapplicationstosupportecientdecisionmaking.
Toachievethisgoal,weaimedtoexpandthereegle.
infoinformationgatewaybyaddingRDFatotheunstructuredinformationavailableontheweb-siteandbymakingthesametriplesavailableviaaSPARQLendpoint.
Forourcurrentprototype,weimplementedaCMSwrapperfortheDrupalCMSandimportedtheactorscatalogofreeglewithinin(seeFigure6).
ThisdatawasthenprocessedbytheSCMSstackasfollows:Allactorsandcountrydescrip-tionsweresenttotheorchestrationservice,whichforwardedthemtoFOX.
TheRDFdataextractedbyFOXweresentbacktotheDrupalWrapperandwrittenviaOntoWikiintoVirtuoso.
TheDrupalwrapperthenusedthekeyphrasestoextendthesetoftagsassignedtothecorrespondingproleintheCMS.
ThenamedentitieswereintegratedinthepagebyusingthepositionalinformationreturnedbyFOX.
Bythesemeans,wemadetheREEEPdataaccessibleforhumans(viatheWebpage)butalsoformachines(viaOntoWiki'sintegratedSPARQLendpointandviatheRDFawrittenintheWebpages).
OurapproachalsomakestheautomatedintegrationofnovelknowledgesourcesinREEEPpossible.
Toachievethisgoal,severalselectedsources(websources,blogsandnewsfeeds)arecurrentlybeingcrawledandthenanalyzedbyFOXtoextractstructuredinformationoutofthemassesofunstructuredtextfromtheInternet.
SCMS–SemantifyingContentManagementSystems201Fig.
6.
ScreenshotsofSCMS-enhancedDrupal6EvaluationTheusabilityofourapproachdependsheavilyonthequalityoftheknowl-edgereturnedviaautomatedmeans.
Consequently,weevaluatedthequalityoftheRDFainjectedintotheREEEPdatabymeasuringtheprecisionandrecallofSCMSandcompareditwiththatofastate-of-the-artcommercialsystem(CS)whosenamecannotberevealedforlegalreasons.
WechoseCSbecauseitoutperformedfreelyavailableNERtoolssuchastheStanfordNamedEntityRecognizer23[10]andtheIllinoisNamedEntityTagger24[25]inaprioreval-uationonanewspapercorpus.
Withinthatevaluation,FOXreached89.
21%F-scoreandwas14%betterthanCSw.
r.
t.
F-score25.
Asitcanhappenthatonlysegmentsofmulti-wordunitsarerecognizedasbeingnamedentities,wefollowedatoken-wiseevaluationoftheSCMSsystem.
Thus,ifoursystemrec-ognizedUnitedKingdomofGreatBritainasaLOCATIONwhenpresentedwithUnitedKingdomofGreatBritainandNorthernIreland,itwasscoredwith5truepositivesand3falsenegatives.
Ourevaluationwascarriedoutwithtwodierentdatasets.
Inourrstevalu-ation,wemeasuredtheperformanceofbothsystemsoncountryprolescrawledfromtheWeb,i.
e.
,oninformationthatistobeaddedautomaticallytotheREEEPknowledgebases.
Forthispurpose,weselected9countrydescriptionsrandomlyandannotated34sentencesmanually.
Thesesentencescontained119namedentitiestokens,ofwhich104werelocationsand15organizations.
Inour23http://nlp.
stanford.
edu/software/CRF-NER.
shtml24http://cogcomp.
cs.
illinois.
edu/page/software_view/425Moredetailsathttp://fox.
aksw.
org202A.
-C.
NgongaNgomoetal.
secondevaluation,weaimedatmeasuringhowwellSCMSperformsonthedatathatcanbefoundcurrentlyintheREEEPcatalogue.
Forthispurpose,weanno-tated23actorsproleswhichconsistedof68sentencesmanually.
Theresultingreferencedatacontained20location,78organizationand11persontokens.
Notethatbothdatasetsareofverydierentnatureastherstcontainsalargenum-beroforganizationsandarelativelysmallnumberoflocationswhilethesecondconsistsmainlyoflocations.
TheresultsofourevaluationareshowninTable1.
CSfollowsaverycon-servativestrategy,whichleadstoithavingveryhighprecisionscoresofupto100%insomeexperiments.
Yet,itsconservativestrategyleadstoarecallwhichismostlysignicantlyinferiortothatofSCMS.
TheonlycategorywithinwhichCSoutperformsSCMSisthedetectionofpersonsintheactorsproledata.
Thisisduetoitdetecting6outofthe11persontokensinthedataset,whileSCMSonlydetects5.
Inallothercases,SCMSoutperformsCSbyupto13%F-score(detectionoforganizationsinthecountryprolesdataset).
Overall,SCMSoutperformCSby7%F-scoreoncountryprolesandalmost8%F-scoreonactors.
Table1.
Evaluationresultsoncountryandactorsproles.
ThesuperiorF-scoreforeachcategoryisinboldfont.
CountryProlesActorsProlesEntityTypeMeasureFOXCSFOXCSLocationPrecision98%100%83.
33%100%Recall94.
23%78.
85%90%70%F-Score96.
08%88.
17%86.
54%82.
35%OrganizationPrecision73.
33%100%57.
14%90.
91%Recall68.
75%40%69.
23%47.
44%F-Score70.
97%57.
14%62.
72%62.
35%PersonPrecision––100%100%Recall––45.
45%54.
55%F-Score––62.
5%70.
59%OverallPrecision93.
97%100%85.
16%98.
2%Recall91.
60%74.
79%70.
64%52.
29%F-Score92.
77%85.
58%77.
22%68.
24%7ConclusionInthispaper,wepresentedtheSCMSframeworkforextractingstructureddatafromCMScontent.
Wepresentedthearchitectureofourapproachandexplainedhoweachofitscomponentsworks.
Inaddition,weexplainedthevocabulariesutilizedbythecomponentsofourframework.
WepresentedoneusecasefortheSCMSsystem,i.
e.
,howSCMSisusedintherenewableenergysector.
TheSCMSstackabidesbythecriteriaofaccuracyandexibility.
Theexi-bilityofourapproachisensuredbythecombinationofRDFmessagesthatcanSCMS–SemantifyingContentManagementSystems203beeasilyextendedandofstandardWebcommunicationprotocols.
Theaccu-racyofSCMSwasdemonstratedinanevaluationonactorandcountryproles,withinwhichSCMSoutperformedevencommercialsoftware.
Ourapproachcanbeextendedbyaddingsupportfornegativestatements,i.
e.
,statementsthatarenotcorrectbutcanbefoundindierentknowledgesourcesacrossthedatalandscapeanalyzedbyourframework.
Inaddition,thefeedbackgeneratedbyuserswillbeintegratedinthetrainingoftheframeworktomakeitevenmoreaccurateovertime.
References1.
Adrian,B.
,Hees,J.
,Herman,I.
,Sintek,M.
,Dengel,A.
:Epiphany:AdaptablerDFaGenerationLinkingtheWebofDocumentstotheWebofData.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
178–192.
Springer,Heidelberg(2010)2.
Agichtein,E.
,Gravano,L.
:Snowball:Extractingrelationsfromlargeplain-textcollections.
In:ACMDL,pp.
85–94(2000)3.
Amsler,R.
:Researchtowardsthedevelopmentofalexicalknowledgebasefornaturallanguageprocessing.
SIGIRForum23,1–2(1989)4.
Auer,S.
,Dietzold,S.
,Riechert,T.
:OntoWiki–AToolforSocial,SemanticCol-laboration.
In:Cruz,I.
,Decker,S.
,Allemang,D.
,Preist,C.
,Schwabe,D.
,Mika,P.
,Uschold,M.
,Aroyo,L.
M.
(eds.
)ISWC2006.
LNCS,vol.
4273,pp.
736–749.
Springer,Heidelberg(2006)5.
Brin,S.
:ExtractingPatternsandRelationsfromtheWorldWideWeb.
In:Atzeni,P.
,Mendelzon,A.
O.
,Mecca,G.
(eds.
)WebDB1998.
LNCS,vol.
1590,pp.
172–183.
Springer,Heidelberg(1999)6.
Coates-Stephens,S.
:Theanalysisandacquisitionofpropernamesfortheun-derstandingoffreetext.
ComputersandtheHumanities26,441–456(1992)10.
1007/BF001369857.
Curran,J.
R.
,Clark,S.
:Languageindependentnerusingamaximumentropytag-ger.
In:HLT-NAACL,pp.
164–167(2003)8.
Dietterich,T.
G.
:EnsembleMethodsinMachineLearning.
In:Kittler,J.
,Roli,F.
(eds.
)MCS2000.
LNCS,vol.
1857,pp.
1–15.
Springer,Heidelberg(2000)9.
Etzioni,O.
,Cafarella,M.
,Downey,D.
,Popescu,A.
-M.
,Shaked,T.
,Soderland,S.
,Weld,D.
S.
,Yates,A.
:Unsupervisednamed-entityextractionfromtheweb:anexperimentalstudy.
Artif.
Intell.
165,91–134(2005)10.
Finkel,J.
,Grenager,T.
,Manning,C.
:Incorporatingnon-localinformationintoinformationextractionsystemsbygibbssampling.
In:ACL,pp.
363–370(2005)11.
Frank,E.
,Paynter,G.
W.
,Witten,I.
H.
,Gutwin,C.
,Nevill-Manning,C.
G.
:Domain-specickeyphraseextraction.
In:ProceedingsoftheSixteenthInterna-tionalJointConferenceonArticialIntelligence,IJCAI1999,pp.
668–673.
MorganKaufmannPublishersInc.
,SanFrancisco(1999)12.
Grishman,R.
,Yangarber,R.
:Nyu:DescriptionoftheProteus/PetsystemasusedforMUC-7ST.
In:MUC-7.
MorganKaufmann(1998)13.
Harabagiu,S.
,Bejan,C.
A.
,Morarescu,P.
:Shallowsemanticsforrelationextrac-tion.
In:IJCAI,pp.
1061–1066(2005)14.
Huynh,D.
,Mazzocchi,S.
,Karger,D.
R.
:PiggyBank:ExperiencetheSemanticWebInsideYourWebBrowser.
In:Gil,Y.
,Motta,E.
,Benjamins,V.
R.
,Musen,M.
A.
(eds.
)ISWC2005.
LNCS,vol.
3729,pp.
413–430.
Springer,Heidelberg(2005)204A.
-C.
NgongaNgomoetal.
15.
Kim,S.
N.
,Kan,M.
-Y.
:Re-examiningautomatickeyphraseextractionapproachesinscienticarticles.
In:MWE2009,pp.
9–16(2009)16.
Kim,S.
N.
,Medelyan,O.
,Kan,M.
-Y.
,Baldwin,T.
:Semeval-2010task5:Auto-matickeyphraseextractionfromscienticarticles.
In:SemEval2010,pp.
21–26.
AssociationforComputationalLinguistics,Stroudsburg(2010)17.
Matsuo,Y.
,Ishizuka,M.
:KeywordExtractionFromASingleDocumentUsingWordCo-OccurrenceStatisticalInformation.
InternationalJournalonArticialIntelligenceTools13(1),157–169(2004)18.
Nadeau,D.
:Semi-SupervisedNamedEntityRecognition:LearningtoRecognize100EntityTypeswithLittleSupervision.
PhDthesis,UniversityofOttawa(2007)19.
Nadeau,D.
,Turney,P.
,Matwin,S.
:UnsupervisedNamed-EntityRecognition:Gen-eratingGazetteersandResolvingAmbiguity.
In:Lamontagne,L.
,Marchand,M.
(eds.
)CanadianAI2006.
LNCS(LNAI),vol.
4013,pp.
266–277.
Springer,Heidel-berg(2006)20.
Nguyen,D.
P.
T.
,Matsuo,Y.
,Ishizuka,M.
:Relationextractionfromwikipediausingsubtreemining.
In:AAAI,pp.
1414–1420(2007)21.
Nguyen,T.
D.
,Kan,M.
-Y.
:KeyphraseExtractioninScienticPublications.
In:Goh,D.
H.
-L.
,Cao,T.
H.
,Slvberg,I.
T.
,Rasmussen,E.
(eds.
)ICADL2007.
LNCS,vol.
4822,pp.
317–326.
Springer,Heidelberg(2007)22.
Pantel,P.
,Pennacchiotti,M.
:Espresso:Leveraginggenericpatternsforautomati-callyharvestingsemanticrelations.
In:ACL,pp.
113–120(2006)23.
Park,Y.
,Byrd,R.
J.
,Boguraev,B.
K.
:Automaticglossaryextraction:beyondter-minologyidentication.
In:COLING,pp.
1–7(2002)24.
Pasca,M.
,Lin,D.
,Bigham,J.
,Lifchits,A.
,Jain,A.
:Organizingandsearchingtheworldwideweboffacts-stepone:theone-millionfactextractionchallenge.
In:Proceedingsofthe21stNationalConferenceonArticialIntelligence,vol.
2,pp.
1400–1405.
AAAIPress(2006)25.
Ratinov,L.
,Roth,D.
:Designchallengesandmisconceptionsinnamedentityrecog-nition.
In:CONLL,pp.
147–155(2009)26.
Thielen,C.
:Anapproachtopropernametaggingforgerman.
In:ProceedingsoftheEACL1995SIGDATWorkshop(1995)27.
Tramp,S.
,Heino,N.
,Auer,S.
,Frischmuth,P.
:RDFauthor:EmployingRDFaforCollaborativeKnowledgeEngineering.
In:Cimiano,P.
,Pinto,H.
S.
(eds.
)EKAW2010.
LNCS,vol.
6317,pp.
90–104.
Springer,Heidelberg(2010)28.
Turney,P.
D.
:Coherentkeyphraseextractionviawebmining.
In:IJCAI,SanFran-cisco,CA,USA,pp.
434–439(2003)29.
Walker,D.
,Amsler,R.
:Theuseofmachine-readabledictionariesinsublanguageanalysis.
AnalysingLanguageinRestrictedDomains(1986)30.
Wang,G.
,Yu,Y.
,Zhu,H.
:PORE:Positive-OnlyRelationExtractionfromWikipediaText.
In:Aberer,K.
,Choi,K.
-S.
,Noy,N.
,Allemang,D.
,Lee,K.
-I.
,Nixon,L.
J.
B.
,Golbeck,J.
,Mika,P.
,Maynard,D.
,Mizoguchi,R.
,Schreiber,G.
,Cudre-Mauroux,P.
(eds.
)ASWC2007andISWC2007.
LNCS,vol.
4825,pp.
580–594.
Springer,Heidelberg(2007)31.
Yan,Y.
,Okazaki,N.
,Matsuo,Y.
,Yang,Z.
,Ishizuka,M.
:Unsupervisedrelationextractionbyminingwikipediatextsusinginformationfromtheweb.
In:ACL2009,pp.
1021–1029(2009)32.
Zhou,G.
,Su,J.
:Namedentityrecognitionusinganhmm-basedchunktagger.
In:Proceedingsofthe40thAnnualMeetingonAssociationforComputationalLinguistics,ACL2002,pp.
473–480.
AssociationforComputationalLinguistics,Morristown(2002)

搬瓦工:香港PCCW机房即将关闭;可免费升级至香港CN2 GIA;2核2G/1Gbps大带宽高端线路,89美元/年

搬瓦工怎么样?这几天收到搬瓦工发来的邮件,告知香港pccw机房(HKHK_1)即将关闭,这也不算是什么出乎意料的事情,反而他不关闭我倒觉得奇怪。因为目前搬瓦工香港cn2 GIA 机房和香港pccw机房价格、配置都一样,可以互相迁移,但是不管是速度还是延迟还是丢包率,搬瓦工香港PCCW机房都比不上香港cn2 gia 机房,所以不知道香港 PCCW 机房存在还有什么意义?关闭也是理所当然的事情。点击进...

速云:深圳独立服务器,新品上线,深港mpls免费体验,多重活动!

速云怎么样?速云是一家国人商家。速云商家主要提供广州移动、深圳移动、广州茂名联通、香港HKT等VDS和独立服务器。目前,速云推出深圳独服优惠活动,机房为深圳移动机房,购买深圳服务器可享受5折优惠,目前独立服务器还支持申请免费试用,需要提交工单开通免费体验试用,次月可享受永久8折优惠,也是需工单申请哦!点击进入:速云官方网站地址活动期限至 2021年7月22日速云云服务器优惠活动:活动1:新购首月可...

百纵科技:美国独立服务器租用/高配置;E52670/32G内存/512G SSD/4IP/50M带宽,999元/月

百纵科技怎么样?百纵科技国人商家,ISP ICP 电信增值许可证的正规公司,近期上线美国C3机房洛杉矶独立服务器,大带宽/高配置多ip站群服务器。百纵科技拥有专业技术售后团队,机器支持自动化,自助安装系统 重启,开机交付时间 30分钟内交付!美国洛杉矶高防服务器配置特点: 硬件配置高 线路稳定 洛杉矶C3机房等级T4 平价销售,支持免费测试,美国独服适合做站,满意付款。点击进入:百纵科技官方网站地...

drupal7为你推荐
在线代理怎么样设置代理,让别人看我的IP是别的地方,不是我真实的IP?conn.asp数据库连接出错,请打开conn.asp文件检查连接字串。filezillaserverfilezilla server interface怎么填360邮箱360免费申请邮箱在那里360防火墙在哪里360防火墙flashfxp注册码找flashfxp3.4注册码结点cuteftp大飞资讯单仁资讯的黄功夫是何许人?刚刚网刚刚在网上认识了一个女孩子,不是很了解她,就跟她表白了。curl扩展系统不支持CURL 怎么解决
韩国vps俄罗斯美女 什么是二级域名 hostigation linode 特价空间 evssl 毫秒英文 秒杀预告 怎么测试下载速度 免费吧 什么是服务器托管 搜索引擎提交入口 空间租赁 河南移动梦网 免费的asp空间 空间服务器 深圳主机托管 服务器防御 小夜博客 第八届中美互联网论坛 更多