Jordanwest

west  时间:2021-01-25  阅读:()
WEST:ModernTechnologiesforWebPeopleSearchDmitriV.
KalashnikovZhaoqiChenRabiaNuray-TuranSharadMehrotraZhengZhangComputerScienceDepartmentUniversityofCalifornia,IrvineI.
INTRODUCTIONInthispaperwedescribeWEST(WebEntitySearchTech-nologies)systemthatwehavedevelopedtoimprovepeoplesearchovertheInternet.
RecentlytheproblemofWebPeopleSearch(WePS)hasattractedsignicantattentionfromboththeindustryandacademia.
IntheclassicformulationofWePSproblemtheuserissuesaquerytoawebsearchenginethatconsistsofanameofapersonofinterest.
Forsuchaquery,atraditionalsearchenginesuchasYahooorGooglewouldreturnwebpagesthatarerelatedtoanypeoplewhohappenedtohavethequeriedname.
ThegoalofWePS,instead,istooutputasetofclustersofwebpages,oneclusterpereachdistinctperson,containingallofthewebpagesrelatedtothatperson.
Theuserthencanlocatethedesiredclusterandexplorethewebpagesitcontains.
TheWePSapproachofferssignicantadvantages.
Forex-ample,considersearchingforapersonwhoisanamesakeoftheformerPresidentBillClinton.
Thewebpagesofthelessfamouspersonwillbeovershadowedintoday'ssearchenginesandwillappearfarinthesearch.
WePSsystemsaddressthisproblembyrstpresentingtotheuserthesetofclusters,amongwhichtheuserthencanselecttheclustercontainingthewebpagesofthenamesakeofinterest.
ThekeytechnologyofanyWePSsystem,includingWEST,isthatofEntityResolution.
InasettingofEntityResolutionproblem,adatasetcontainsinformationaboutobjectsandtheirinteractions.
Theobjectsarereferredtovia(textual)descrip-tions/references,whichmightnotbeuniqueidentiersoftheobjects,leadingtoambiguity.
ThetaskofEntityResolutionalgorithmsistoidentifyallofthereferencesthatco-refer,i.
e.
,refertothesamereal-worldentity.
InWePSthewebpagesreturnedbyasearchenginecanbeviewedasreferences.
Theoveralltaskcanbeviewedasthatofndingthewebpagesthatrefertothesamenamesake.
WehavedevelopedthreedifferentEntityResolutionalgo-rithmsthatcanbeemployedbyWEST:1)GraphERapproachextractstheSocialNetwork(peo-ple,organizations,locations)offthewebpagesalongwithhyperlinkandemailinformation.
ItrepresentstheresultingEntity-Relationshipnetworkasagraph.
TheapproachthenanalyzesthisgraphandthewebpageThisresearchwassupportedbyNSFAwards0331707and0331690,andDHSAwardEMW-2007-FP-02535.
textualsimilaritytodeterminewhichwebpagesco-refer[4],[5].
GraphERwillbecoveredinSectionIII-A.
2)EnsembleERapproachcombinesresultsofmultiple"base"ERsystemstoproducetheoverallclustering.
Duringthetrainingphase,EnsembleERapproachem-ployssupervisedlearningtostudyhowwellthebaseERsystemsperformintermsoftheirqualityundervarietyofconditions/contextsbytrainingameta-levelclassier.
Itthenusesthisclassierduringtheactualqueryprocessingtocomputeitsnalclustering[3].
EnsembleERwillbecoveredinSectionIII-B.
3)WebERapproach,unliketheabovetwo(andmanyother)approaches,doesnotlimititsprocessingtoanalyzingtherelevantwebpagesonly.
Instead,itleveragesapowerfulexternaldatasourcetogainitsadvantage.
Specically,likeGraphERitrstextractssocialnetworkofftheweb-pages.
ButthenitqueriestheWebtocollectadditionalinformationonthevariouscomponentsofthisnetwork[6].
WebERwillbecoveredinSectionIII-C.
Eachofthesethreealgorithmshasbeendemonstratedtooutperformthecurrentstateofthearttechniquesonavarietyofdatasets[3]–[6].
Thecomparisonincludes18approachesthathavebeenpartofWePSTaskcompetitiononalargedatasetwhichisnowconsideredtobeadefactostandardfortestingWePSsolutions[1].
WESTprovidesmultipleinterfacestosearch.
TheinputandoutputinterfacesofWESTareillustratedinFigures1and2respectively.
Naturally,WESTsupportsthestandardWePSinterfacewheretheuserprovidesapersonnameasthequery.
Italsosupportsadditionalfunctionality,wheretheusercanspecifycontextqueriestohelplocatethenamesakeofinterestquicker.
Thecontextcanbespeciedintheformoflocation,people,and/ororganizationsassociatedwiththenamesakeofinterest.
NoticethatthecontexthereisnotusedasadditionalkeywordstoquerytheWeb,butisusedtoidentifytherightnamesaketheuserislookingfor.
Thismeansthatthewebpagesintheclusterdoesnothavetoeachcontainthecontextkeywords,andsomeofthemmightevencontainnoneoftheseadditionalcontextkeywords.
BesidestheUIforsearchingforasingleindividual,WESToffersaGroupSearchinterfacetosupporttheGroupIdenti-cationquerycapabilities.
InaGroupIdenticationtask,theinputismultiplenamesofpeoplethatareknowntoberelatedinsomeway.
Forinstance,aquerymightbe"MichaelJordan"Fig.
1.
InputInterfaceofWEST.
Fig.
2.
OutputInterfaceofWEST.
and"MagicJohnson",implyingthatthemeantnamesakesarebasketballplayers.
Theobjectiveistoretrievethewebpagesofthemeantnamesakesonly.
Whilethedemonstrationwillillustrateboththesinglepersonsearchandgroupsearchcapabilities,thesubsequentdiscussionwillfocusonasinglepersonsearch.
Thealgorith-micdetailsoftheGroupSearchcanbefoundin[4].
Therestofthispaperisorganizedasfollows.
SectionIIpresentsthestepsoftheoverallWESTapproach.
ThenSectionIIIcoversthethreeEntityResolutionalgorithms.
Finally,SectionIVdescribesthefunctionalityofWESTthatwillbedisplayedduringthedemo.
II.
OVERALLALGORITHMThestepsoftheoverallWESTapproach,inthecontextofamiddlewarearchitecture,areillustratedinFigure3.
Theyinclude:1)UserInput.
TheuserissuesaqueryviatheWESTinputinterface.
2)Top-KRetrieval.
Thesystem(middleware)sendsaqueryconsistingofapersonnametoasearchengine,suchasGoogle,andretrievesthetop-Kreturnedwebpages.
ThisisastandardstepperformedbymostofthecurrentWePSsystems.
Top-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingTop-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingFig.
3.
OverviewoftheWESTProcessingSteps.
3)Pre-processing.
Thesetop-Kwebpagesarethenprepro-cessed.
Themaintwopre-processingstepsare:a)TF/IDF.
Pre-processingstepsforcomputingTF/IDFarecarriedout.
Theyinclude:stemming,stopwordremoval,nounphraseidentication,in-vertedindexcomputations,etc.
b)Extraction.
NamedEntities,includingpeople,lo-cations,organizationsareextractedusingathirdpartynamedentityextractionsoftware.
Hyperlinksandemailsaddressedareextractedaswell.
Someauxiliarydatastructuresarebuiltonthisdata.
4)Clustering.
OneofthethreeEntityResolutionalgo-rithmsisappliedtothedatatoclusterthewebpages.
ThealgorithmswillbeexplainedinSectionIII.
5)Post-processing.
Thepost-processingstepsinclude:a)ClusterSketchesarecomputed.
b)ClusterRankiscomputedbasedon(a)thecontextkeywords,ifpresentand(b)theoriginalsearchengine'sorderingofthewebpages.
c)WebpageRankiscomputedtodeterminetherela-tiveorderingofwebpagesinsideeachcluster.
6)Visualization.
Theresultingclustersarepresentedtotheuser,whichcanbeinteractivelyexplored.
WenextdiscussthekeycomponentofanyWePSsystem:theEntityResolutionalgorithms.
III.
ENTITYRESOLUTIONALGORITHMSThissectionpresentsanoverviewofthethreeentityreso-lutionalgorithmsusedbytheWESTsystemforclusteringthewebpages.
A.
GraphERTodeterminewhethertworeferencesuandvco-refertraditionalapproachesatthecoreanalyzesimilarityoffeaturesofuandvaccordingtosomefeature-basedsimilarityfunctionf(u,v).
TheGraphERapproachhasbeendevelopedbasedontheobservationthatmanydatasetsarerelationalinnature.
Theycontainnotonlyobjectsandtheirfeaturesbutalsoinformationaboutrelationshipsinwhichtheyparticipate.
InstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionInstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionFig.
4.
AGeneralFrameworkforCombiningMultipleSystems.
GraphERutilizestheinformationstoredintheserelationshipstoimprovethedisambiguationquality.
TheapproachviewsthedatasetbeinganalyzedasanEntity-RelationshipGraphofnodes(entities)interconnectedviarelationships(edges).
FortheWePSdomain,thenodesarethenamedentities,hyperlinks,andemailsextractedoffthewebpagesduringthepre-processingaswellasthewebpagesthemselves.
Therelationshipsareco-occurrencerelationships,andthosethatarederivedfromhyperlinkanddecompositions.
Thegraphcreationprocedureisdiscussedindetailin[4].
TheentityrelationshipsgraphinthiscaseisacombinationoftheSocialNetworkextractedfromthewebpagesaswellasthehyperlinkgraph.
Todecidewhethertworeferencesuandvco-refer,GraphERanalyzeshowstronglyuandvareconnectedinthisgraphaccordingtoaconnectionstrengthmeasurec(u,v).
Tocomputec(u,v),thealgorithmdiscoversthesetPLuvofallL-shortsimpleu-vpaths.
1Thevalueofc(u,v)iscomputedasthesumoftheconnectionsstrengthcontributedfromeachpathpinPLuv:c(u,v)=p∈PLuvc(p).
Asupervisedlearningprocedure,formulatedasalinearpro-grammingoptimizationtask,isusedtolearnc(p)functionfromdata[4],[5].
Thesimilarityfunctions(u,v)isthendenedasacombinationofc(u,v)andf(u,v).
Theoutputofthisfunctionisusedbyacorrelationclusteringalgorithmtogeneratethenalclustersofwebpages.
B.
EnsembleEREnsembleERapproachismotivatedbytheobservationthatoftenthereisnosingleentityresolution(ER)techniquealwaysperformthebest.
Rather,differentERsolutionsperformbetterindifferentcontexts.
EnsembleERisastacking-likeframeworkthatcombinestheclusteringresultsofmultiplebase-levelERsystemssothatthenalclusteringqualityissuperiortothatofanysinglebaseERsystem.
Thekeyideaistotransformtheoutputofbase-levelERsystems,togetherwithcontext,intoameta-levelfeatureset.
Asupervisedlearningapproachisutilizedtotrainaclassieronthemeta-leveldata.
Thealgorithmthenappliesthemeta-levelclassiertothedatasetbeingprocessedtocreatethenalclusteringresults.
Figure4showsageneralframeworkofcombiningmultiplesystems.
SimilartoGraphERapproach,EnsembleERalsoutilizesagraphrepresentationofthedataset.
Thegraphhoweveris1ApathisL-shortifitslengthdoesnotexceedL.
Apathissimpleifitdoesnotcontainduplicatenodes.
different.
Thenodesarethetop-Kwebpages.
Edge(u,v)betweentwowebpagesuandviscreatedonlyifacertainnumberofthebase-levelERsystemsdecidethatuandvshouldbeinthesamecluster.
Edge(u,v)representsapossibilitythatuandvmightco-refer.
WithrespecttothegraphthattaskofEnsembleERcanbeviewedasdecidingforeachedgewhetheruandvshouldbeputinonecluster.
LetS1,S2,Snbethenbase-levelERsystems.
Foreachedgeei=(u,v),eachSjoutputitsdecisiondij∈{0,1}.
Here,ifuandvareplacedinthesameclusterbySjthendij=1otherwisedij=0.
Then,foreachedgeeiwecandeneadecisionfeaturevectorasdi={di1,di2,din}.
Foredgeeiitslocalcontextisalsoencodedasamulti-dimensionalcontextfeaturevectorfi={fi1,fi2,fim}.
OneoftheinterestingaspectsofEnsembleERsolutionisthatitcreatescontextfeaturesinapredictiveway,basedonrstestimatingsomeunknownparametersofthedatabeingprocessed.
Forinstance,letK1,K2,KnbethenumberofclustersthatsystemsS1,S2,Snoutput.
OneofthefeaturesusedbyEnsembleERiscomputedbyapplyingaregressiontothisdatatoestimatethenumberofnamesakesK,wherethetruenumberofnamesakesK+isunknownbeforehandtothealgorithm.
EnsembleERthenconvertsthedifferencebetweenKandKjintoafeature,basedontheintuitionthattheclosertheKjtoK,themorecondencecanbeplacedintheanswerofsystemSj.
ThegoalofEnsembleERreducestondingamappingdi*fi→ai.
Here,ai={0,1}isthepredictionofthecombinedalgorithmforedgeei=(u,v),whereai=1iftheoverallalgorithmbelievesuandvbelongtothesamecluster,andai=0otherwise.
ThedetailsoftheEnsemblealgorithmcanbefoundin[3].
C.
WebERWebERapproachisconsiderablydifferentfrommostoftheotherWePSsolutions.
UnlikemanyotherWePSsystems,WebERdoesnotlimititsprocessingtoanalyzingonlytheinformationstoredinthetop-Kreturnedwebpages.
RatheritemploystheWebasanexternaldatasourcetogetadditionalinformation,whichultimatelyleadstohigherqualityresults.
WebERisprimarilyintendedtobeaserver-sidesolution.
Thatis,itscodeisexecutedatasearchengine(server)side.
Becauseofthat,mostofthepre-processingcanbeaccomplishedinbulkbeforequeryprocessingstarts,includingextractionandTF/IDFcomputations.
ThequeriestothesearchenginearecarriedoutinternallywithoutgoingviatheInternetthusmakingtheirprocessingmuchfaster.
LetD={d1,d2,dK}bethesetofthetop-Kreturnedwebpages.
WebERrstmergessomeofthewebpagesintoinitialclustersusingNamedEntity(NE)clusteringwithaconservativethresholds.
Thedocument-documentsimilarityiscomputedusingTF/IDFapproachwithcosinesimilarity.
Onlyafewwebpagesthathaveoverwhelmingevidencethattheyrepresentthesamepeoplearemergedduringthisprocess.
LetPiandOibethesetofpeopleandorganizationsextractedfromwebpagedi.
ForeachpairwebpagesdianddjthatALL-IN-ONEUBC-ASUC3MWITDFKI2JHU1-13TITPIUA-ZSASWAT-IVAUGONE-IN-ONEUNNFICOSHEFUVAPSNUSIRST-BPCU-COMSEMWEST00.
10.
20.
30.
40.
50.
60.
70.
80.
9SystemsFpFig.
5.
TheExperimentresultsonWePSdataset.
arenotyetputinthesameclustertheapproachformsandissuesqueriestotheWebtocollecttheco-occurrencestatistics,whichinthiscaseisthenumberofthepagesreturnedforagivenquery.
WebERusestwomaintypesofqueries:NANDCiANDCjCiANDCjHereNisthenameofthepersonbeingqueriedbytheuser,andCiandCjarethecontextofpagesdianddj.
ContextCicanbeeither(a)anORcombinationofpeoplefromPi,or(b)anORcombinationoforganizationsfromOi.
ThesameholdsforCiresultingineightqueriesfordianddjpair.
Theseco-occurrencecountsareindicativeofhowoftentheelementsofthetwosocialnetworksco-occuronthewebandthushowstronglytheyarerelated.
Thesecountsarethentransformedintofeatures,whicharethenusedtocomputethesimilaritybetweenwebpagesdianddj.
OneofthekeycontributionsofthisworkisanewSkyline-basedclassierfordecidingwhichdianddjwebpagesshouldbemergedbasedonthecorrespondingfeaturevector.
Itisaspecializedclassierthatwehavedesignedspecicallyfortheclusteringproblemathand.
Skyline-basedclassiergainsitsadvantageduetoavarietyoffunctionalitiesbuiltintoit,including:Ittakesintoaccountdominancethatispresentinthefeaturesspace.
Italsonetunesitselftothequalitymeasurebeingused.
Ittakesintoaccounttransitivityofmerges:thatis,ac-countsforthefactthattwolargeclusterscanbemergedbyasinglemergedecision,and,thus,onedirectmergedecisioncanleadtomultipleindirectones.
Thesepropertiesallowittoeasilyoutperformotherclassi-cationmethods(whicharegeneric),suchasDTCorSVM.
Theapproachisdiscussedindetailin[6].
IV.
DEMONSTRATIONTheERalgorithmsusedbyWESTareknowntoproducehighlycompetitiveresults.
Figure5presentsthecomparisonresultsoftheWESTwith18otherWePSsolutionsthathavebeenpartoftheWePSTaskchallenge[1].
ThequalityofclusteringisevaluatedintermsofFpmeasure(harmonicmeanofPurityandInversePurity[1]).
ForthegroupidenticationwehavecomparedWESTwiththestateoftheartapproachpublishedin[2].
TheaverageF-measureonthisdatasetachievedbyWESTis92%whichisnearly12%improvementovertheresultreportedin[2].
TheWESTsystemwillbedemonstratedthroughtwoap-plicationsbuiltoverthebasesystem.
SinglePersonSearch(illustratedinFigure1):whereinausercanenterapersonnameandcontextintheformofpeople,locations,and/ororganizationsassociatedwiththepersonbeingqueried.
Theresultswillbeasetofclusters.
Eachclusterwillhaveasetofkeywordsattachedtoindicatethemainaspectofthecorrespondingnamesake.
Theclusterswillbepresentedinarankedorderbasedontheoriginalranksofthewebpagesintheclustersandthecontextkeywords.
Figure2showssampleresultingclustersforthequery"AndrewMcCallum".
TherstreturnedgroupcorrespondstoAndrewMcCallumtheUMassCSprofessor,thesecondtothepresidentoftheAustralianCouncilofSocialServices,thethirdtoaCanadianmusician,etc.
Theuserwillbeabletoclickontheclustersandexploretheirclustersinteractively.
Thewebpagesinaclusterwillbepresentedinarankedorderaswell.
GroupSearch:Anotherinterfacewillbeusedtodemon-stratetheGroupIdenticationsearchcapabilitiesofWEST.
Ingroupqueryinterface,theusercaninputseveralpersonnames.
Theresultwillbethewebpagesthatarerelatedtothemeantnamesakes.
Theseapplicationswillbedemonstratedbothintheonlineandofinemodes.
Intheonlinemode,thequeryinputbytheuserwillbetranslatedintoacorresponding(setof)queriesoverInternetsearchengines(specicallyoverGoogle).
WESTallowstheusertospecifythenumberofwebpagestoretrievefromthesearchengine,whichwillbedisambiguatedintocorrespondingclusters.
Intheonlinemode,WESTusesonlyGraphERandEnsembleERapproachessinceWebERisaserver-sideapproachandisnotamenableforrealizationasamiddleware.
Thedemonstrationwillallowobserverstododiversesearches(perhaps,oftheirownnames)andperceiveboththequalityaswellasefciencyofWEST.
Intheofinemode,WESTwillusepreconstructed"canned"exampleswherewehavealreadycrawledthewebtoretrievethesearchresultsandconstructedthecorrespondingclusters.
Intheofinemode,inadditiontoillustratingtheGraphERandEnsembleERapproaches,wewillalsodemonstratethedisambiguationpoweroftheWebERapproach.
REFERENCES[1]J.
Artiles,J.
Gonzalo,andS.
Sekine.
Thesemeval-2007wepsevaluation:Establishingabenchmarkforthewebpeoplesearchtask.
InSemEval,2007.
[2]R.
BekkermanandA.
McCallum.
Disambiguatingwebappearancesofpeopleinasocialnetwork.
InWWW,2005.
[3]Z.
Chen,D.
V.
Kalashnikov,andS.
Mehrotra.
Combiningentityresolutiontechniqueswithapplicationtowebpeoplesearch.
InUndersubmission.
[4]D.
V.
Kalashnikov,Z.
Chen,S.
Mehrotra,andR.
Nuray.
Webpeoplesearchviaconnectionanalysis.
IEEETKDE,2008.
toappear.
[5]D.
V.
Kalashnikov,S.
Mehrotra,S.
Chen,R.
Nuray,andN.
Ashish.
Disambiguationalgorithmforpeoplesearchontheweb.
InICDE,2007.
[6]D.
V.
Kalashnikov,R.
Nuray-Turan,andS.
Mehrotra.
Towardsbreakingthequalitycurse.
Aweb-queryingapproachtoWebPeopleSearch.
InProc.
ofAnnualInternationalACMSIGIRConference,Singapore,July20–242008.

Sharktech鲨鱼服务器商提供洛杉矶独立服务器促销 不限流量月99美元

Sharktech(鲨鱼服务器商)我们还是比较懂的,有提供独立服务器和高防服务器,而且性价比都还算是不错,而且我们看到有一些主机商的服务器也是走这个商家渠道分销的。这不看到鲨鱼服务器商家洛杉矶独立服务器纷纷促销,不限制流量的独立服务器起步99美元,这个还未曾有过。第一、鲨鱼机房服务器方案洛杉矶机房,默认1Gbps带宽,不限流量,自带5个IPv4,免费60Gbps / 48Mpps DDoS防御。C...

ShockHosting($4.99/月),东京机房 可享受五折优惠,下单赠送10美金

ShockHosting商家在前面文章中有介绍过几次。ShockHosting商家成立于2013年的美国主机商,目前主要提供虚拟主机、VPS主机、独立服务器和域名注册等综合IDC业务,现有美国洛杉矶、新泽西、芝加哥、达拉斯、荷兰阿姆斯特丹、英国和澳大利亚悉尼七大数据中心。这次有新增日本东京机房。而且同时有推出5折优惠促销,而且即刻使用支付宝下单的话还可获赠10美金的账户信用额度,折扣相比之前的常规...

Fiberia.io:$2.9/月KVM-4GB/50GB/2TB/荷兰机房

Fiberia.io是个新站,跟ViridWeb.com同一家公司的,主要提供基于KVM架构的VPS主机,数据中心在荷兰Dronten。商家的主机价格不算贵,比如4GB内存套餐每月2.9美元起,采用SSD硬盘,1Gbps网络端口,提供IPv4+IPv6,支持PayPal付款,有7天退款承诺,感兴趣的可以试一试,年付有优惠但建议月付为宜。下面列出几款主机配置信息。CPU:1core内存:4GB硬盘:...

west为你推荐
法兰绒和珊瑚绒哪个好珊瑚绒和法兰绒哪个暖和英语词典哪个好什么英语词典好?核芯显卡与独立显卡哪个好核芯显卡与独立显卡哪个好手机炒股软件哪个好手机炒股软件网校哪个好初中网校哪个好?oppo和vivo哪个好vivo和oppo哪个更耐用雅思和托福哪个好考托福和雅思哪个好考 急。。。。。电动牙刷哪个好电动牙刷哪个牌子比较好,不要那么贵的行车记录仪哪个好行车记录仪什么牌子好辽宁联通网上营业厅辽宁联通怎样用发短信方式查询话费和流量
香港vps cybermonday openv 英语简历模板word 174.127.195.202 阿里云代金券 免费博客空间 论坛空间 长沙服务器 亚马逊香港官网 php空间购买 美国堪萨斯 个人免费主页 google台湾 空间登入 丽萨 dnspod 阿里云邮箱登陆地址 美国迈阿密 创速 更多