Jordanwest

west  时间:2021-01-25  阅读:()
WEST:ModernTechnologiesforWebPeopleSearchDmitriV.
KalashnikovZhaoqiChenRabiaNuray-TuranSharadMehrotraZhengZhangComputerScienceDepartmentUniversityofCalifornia,IrvineI.
INTRODUCTIONInthispaperwedescribeWEST(WebEntitySearchTech-nologies)systemthatwehavedevelopedtoimprovepeoplesearchovertheInternet.
RecentlytheproblemofWebPeopleSearch(WePS)hasattractedsignicantattentionfromboththeindustryandacademia.
IntheclassicformulationofWePSproblemtheuserissuesaquerytoawebsearchenginethatconsistsofanameofapersonofinterest.
Forsuchaquery,atraditionalsearchenginesuchasYahooorGooglewouldreturnwebpagesthatarerelatedtoanypeoplewhohappenedtohavethequeriedname.
ThegoalofWePS,instead,istooutputasetofclustersofwebpages,oneclusterpereachdistinctperson,containingallofthewebpagesrelatedtothatperson.
Theuserthencanlocatethedesiredclusterandexplorethewebpagesitcontains.
TheWePSapproachofferssignicantadvantages.
Forex-ample,considersearchingforapersonwhoisanamesakeoftheformerPresidentBillClinton.
Thewebpagesofthelessfamouspersonwillbeovershadowedintoday'ssearchenginesandwillappearfarinthesearch.
WePSsystemsaddressthisproblembyrstpresentingtotheuserthesetofclusters,amongwhichtheuserthencanselecttheclustercontainingthewebpagesofthenamesakeofinterest.
ThekeytechnologyofanyWePSsystem,includingWEST,isthatofEntityResolution.
InasettingofEntityResolutionproblem,adatasetcontainsinformationaboutobjectsandtheirinteractions.
Theobjectsarereferredtovia(textual)descrip-tions/references,whichmightnotbeuniqueidentiersoftheobjects,leadingtoambiguity.
ThetaskofEntityResolutionalgorithmsistoidentifyallofthereferencesthatco-refer,i.
e.
,refertothesamereal-worldentity.
InWePSthewebpagesreturnedbyasearchenginecanbeviewedasreferences.
Theoveralltaskcanbeviewedasthatofndingthewebpagesthatrefertothesamenamesake.
WehavedevelopedthreedifferentEntityResolutionalgo-rithmsthatcanbeemployedbyWEST:1)GraphERapproachextractstheSocialNetwork(peo-ple,organizations,locations)offthewebpagesalongwithhyperlinkandemailinformation.
ItrepresentstheresultingEntity-Relationshipnetworkasagraph.
TheapproachthenanalyzesthisgraphandthewebpageThisresearchwassupportedbyNSFAwards0331707and0331690,andDHSAwardEMW-2007-FP-02535.
textualsimilaritytodeterminewhichwebpagesco-refer[4],[5].
GraphERwillbecoveredinSectionIII-A.
2)EnsembleERapproachcombinesresultsofmultiple"base"ERsystemstoproducetheoverallclustering.
Duringthetrainingphase,EnsembleERapproachem-ployssupervisedlearningtostudyhowwellthebaseERsystemsperformintermsoftheirqualityundervarietyofconditions/contextsbytrainingameta-levelclassier.
Itthenusesthisclassierduringtheactualqueryprocessingtocomputeitsnalclustering[3].
EnsembleERwillbecoveredinSectionIII-B.
3)WebERapproach,unliketheabovetwo(andmanyother)approaches,doesnotlimititsprocessingtoanalyzingtherelevantwebpagesonly.
Instead,itleveragesapowerfulexternaldatasourcetogainitsadvantage.
Specically,likeGraphERitrstextractssocialnetworkofftheweb-pages.
ButthenitqueriestheWebtocollectadditionalinformationonthevariouscomponentsofthisnetwork[6].
WebERwillbecoveredinSectionIII-C.
Eachofthesethreealgorithmshasbeendemonstratedtooutperformthecurrentstateofthearttechniquesonavarietyofdatasets[3]–[6].
Thecomparisonincludes18approachesthathavebeenpartofWePSTaskcompetitiononalargedatasetwhichisnowconsideredtobeadefactostandardfortestingWePSsolutions[1].
WESTprovidesmultipleinterfacestosearch.
TheinputandoutputinterfacesofWESTareillustratedinFigures1and2respectively.
Naturally,WESTsupportsthestandardWePSinterfacewheretheuserprovidesapersonnameasthequery.
Italsosupportsadditionalfunctionality,wheretheusercanspecifycontextqueriestohelplocatethenamesakeofinterestquicker.
Thecontextcanbespeciedintheformoflocation,people,and/ororganizationsassociatedwiththenamesakeofinterest.
NoticethatthecontexthereisnotusedasadditionalkeywordstoquerytheWeb,butisusedtoidentifytherightnamesaketheuserislookingfor.
Thismeansthatthewebpagesintheclusterdoesnothavetoeachcontainthecontextkeywords,andsomeofthemmightevencontainnoneoftheseadditionalcontextkeywords.
BesidestheUIforsearchingforasingleindividual,WESToffersaGroupSearchinterfacetosupporttheGroupIdenti-cationquerycapabilities.
InaGroupIdenticationtask,theinputismultiplenamesofpeoplethatareknowntoberelatedinsomeway.
Forinstance,aquerymightbe"MichaelJordan"Fig.
1.
InputInterfaceofWEST.
Fig.
2.
OutputInterfaceofWEST.
and"MagicJohnson",implyingthatthemeantnamesakesarebasketballplayers.
Theobjectiveistoretrievethewebpagesofthemeantnamesakesonly.
Whilethedemonstrationwillillustrateboththesinglepersonsearchandgroupsearchcapabilities,thesubsequentdiscussionwillfocusonasinglepersonsearch.
Thealgorith-micdetailsoftheGroupSearchcanbefoundin[4].
Therestofthispaperisorganizedasfollows.
SectionIIpresentsthestepsoftheoverallWESTapproach.
ThenSectionIIIcoversthethreeEntityResolutionalgorithms.
Finally,SectionIVdescribesthefunctionalityofWESTthatwillbedisplayedduringthedemo.
II.
OVERALLALGORITHMThestepsoftheoverallWESTapproach,inthecontextofamiddlewarearchitecture,areillustratedinFigure3.
Theyinclude:1)UserInput.
TheuserissuesaqueryviatheWESTinputinterface.
2)Top-KRetrieval.
Thesystem(middleware)sendsaqueryconsistingofapersonnametoasearchengine,suchasGoogle,andretrievesthetop-Kreturnedwebpages.
ThisisastandardstepperformedbymostofthecurrentWePSsystems.
Top-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingTop-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingFig.
3.
OverviewoftheWESTProcessingSteps.
3)Pre-processing.
Thesetop-Kwebpagesarethenprepro-cessed.
Themaintwopre-processingstepsare:a)TF/IDF.
Pre-processingstepsforcomputingTF/IDFarecarriedout.
Theyinclude:stemming,stopwordremoval,nounphraseidentication,in-vertedindexcomputations,etc.
b)Extraction.
NamedEntities,includingpeople,lo-cations,organizationsareextractedusingathirdpartynamedentityextractionsoftware.
Hyperlinksandemailsaddressedareextractedaswell.
Someauxiliarydatastructuresarebuiltonthisdata.
4)Clustering.
OneofthethreeEntityResolutionalgo-rithmsisappliedtothedatatoclusterthewebpages.
ThealgorithmswillbeexplainedinSectionIII.
5)Post-processing.
Thepost-processingstepsinclude:a)ClusterSketchesarecomputed.
b)ClusterRankiscomputedbasedon(a)thecontextkeywords,ifpresentand(b)theoriginalsearchengine'sorderingofthewebpages.
c)WebpageRankiscomputedtodeterminetherela-tiveorderingofwebpagesinsideeachcluster.
6)Visualization.
Theresultingclustersarepresentedtotheuser,whichcanbeinteractivelyexplored.
WenextdiscussthekeycomponentofanyWePSsystem:theEntityResolutionalgorithms.
III.
ENTITYRESOLUTIONALGORITHMSThissectionpresentsanoverviewofthethreeentityreso-lutionalgorithmsusedbytheWESTsystemforclusteringthewebpages.
A.
GraphERTodeterminewhethertworeferencesuandvco-refertraditionalapproachesatthecoreanalyzesimilarityoffeaturesofuandvaccordingtosomefeature-basedsimilarityfunctionf(u,v).
TheGraphERapproachhasbeendevelopedbasedontheobservationthatmanydatasetsarerelationalinnature.
Theycontainnotonlyobjectsandtheirfeaturesbutalsoinformationaboutrelationshipsinwhichtheyparticipate.
InstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionInstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionFig.
4.
AGeneralFrameworkforCombiningMultipleSystems.
GraphERutilizestheinformationstoredintheserelationshipstoimprovethedisambiguationquality.
TheapproachviewsthedatasetbeinganalyzedasanEntity-RelationshipGraphofnodes(entities)interconnectedviarelationships(edges).
FortheWePSdomain,thenodesarethenamedentities,hyperlinks,andemailsextractedoffthewebpagesduringthepre-processingaswellasthewebpagesthemselves.
Therelationshipsareco-occurrencerelationships,andthosethatarederivedfromhyperlinkanddecompositions.
Thegraphcreationprocedureisdiscussedindetailin[4].
TheentityrelationshipsgraphinthiscaseisacombinationoftheSocialNetworkextractedfromthewebpagesaswellasthehyperlinkgraph.
Todecidewhethertworeferencesuandvco-refer,GraphERanalyzeshowstronglyuandvareconnectedinthisgraphaccordingtoaconnectionstrengthmeasurec(u,v).
Tocomputec(u,v),thealgorithmdiscoversthesetPLuvofallL-shortsimpleu-vpaths.
1Thevalueofc(u,v)iscomputedasthesumoftheconnectionsstrengthcontributedfromeachpathpinPLuv:c(u,v)=p∈PLuvc(p).
Asupervisedlearningprocedure,formulatedasalinearpro-grammingoptimizationtask,isusedtolearnc(p)functionfromdata[4],[5].
Thesimilarityfunctions(u,v)isthendenedasacombinationofc(u,v)andf(u,v).
Theoutputofthisfunctionisusedbyacorrelationclusteringalgorithmtogeneratethenalclustersofwebpages.
B.
EnsembleEREnsembleERapproachismotivatedbytheobservationthatoftenthereisnosingleentityresolution(ER)techniquealwaysperformthebest.
Rather,differentERsolutionsperformbetterindifferentcontexts.
EnsembleERisastacking-likeframeworkthatcombinestheclusteringresultsofmultiplebase-levelERsystemssothatthenalclusteringqualityissuperiortothatofanysinglebaseERsystem.
Thekeyideaistotransformtheoutputofbase-levelERsystems,togetherwithcontext,intoameta-levelfeatureset.
Asupervisedlearningapproachisutilizedtotrainaclassieronthemeta-leveldata.
Thealgorithmthenappliesthemeta-levelclassiertothedatasetbeingprocessedtocreatethenalclusteringresults.
Figure4showsageneralframeworkofcombiningmultiplesystems.
SimilartoGraphERapproach,EnsembleERalsoutilizesagraphrepresentationofthedataset.
Thegraphhoweveris1ApathisL-shortifitslengthdoesnotexceedL.
Apathissimpleifitdoesnotcontainduplicatenodes.
different.
Thenodesarethetop-Kwebpages.
Edge(u,v)betweentwowebpagesuandviscreatedonlyifacertainnumberofthebase-levelERsystemsdecidethatuandvshouldbeinthesamecluster.
Edge(u,v)representsapossibilitythatuandvmightco-refer.
WithrespecttothegraphthattaskofEnsembleERcanbeviewedasdecidingforeachedgewhetheruandvshouldbeputinonecluster.
LetS1,S2,Snbethenbase-levelERsystems.
Foreachedgeei=(u,v),eachSjoutputitsdecisiondij∈{0,1}.
Here,ifuandvareplacedinthesameclusterbySjthendij=1otherwisedij=0.
Then,foreachedgeeiwecandeneadecisionfeaturevectorasdi={di1,di2,din}.
Foredgeeiitslocalcontextisalsoencodedasamulti-dimensionalcontextfeaturevectorfi={fi1,fi2,fim}.
OneoftheinterestingaspectsofEnsembleERsolutionisthatitcreatescontextfeaturesinapredictiveway,basedonrstestimatingsomeunknownparametersofthedatabeingprocessed.
Forinstance,letK1,K2,KnbethenumberofclustersthatsystemsS1,S2,Snoutput.
OneofthefeaturesusedbyEnsembleERiscomputedbyapplyingaregressiontothisdatatoestimatethenumberofnamesakesK,wherethetruenumberofnamesakesK+isunknownbeforehandtothealgorithm.
EnsembleERthenconvertsthedifferencebetweenKandKjintoafeature,basedontheintuitionthattheclosertheKjtoK,themorecondencecanbeplacedintheanswerofsystemSj.
ThegoalofEnsembleERreducestondingamappingdi*fi→ai.
Here,ai={0,1}isthepredictionofthecombinedalgorithmforedgeei=(u,v),whereai=1iftheoverallalgorithmbelievesuandvbelongtothesamecluster,andai=0otherwise.
ThedetailsoftheEnsemblealgorithmcanbefoundin[3].
C.
WebERWebERapproachisconsiderablydifferentfrommostoftheotherWePSsolutions.
UnlikemanyotherWePSsystems,WebERdoesnotlimititsprocessingtoanalyzingonlytheinformationstoredinthetop-Kreturnedwebpages.
RatheritemploystheWebasanexternaldatasourcetogetadditionalinformation,whichultimatelyleadstohigherqualityresults.
WebERisprimarilyintendedtobeaserver-sidesolution.
Thatis,itscodeisexecutedatasearchengine(server)side.
Becauseofthat,mostofthepre-processingcanbeaccomplishedinbulkbeforequeryprocessingstarts,includingextractionandTF/IDFcomputations.
ThequeriestothesearchenginearecarriedoutinternallywithoutgoingviatheInternetthusmakingtheirprocessingmuchfaster.
LetD={d1,d2,dK}bethesetofthetop-Kreturnedwebpages.
WebERrstmergessomeofthewebpagesintoinitialclustersusingNamedEntity(NE)clusteringwithaconservativethresholds.
Thedocument-documentsimilarityiscomputedusingTF/IDFapproachwithcosinesimilarity.
Onlyafewwebpagesthathaveoverwhelmingevidencethattheyrepresentthesamepeoplearemergedduringthisprocess.
LetPiandOibethesetofpeopleandorganizationsextractedfromwebpagedi.
ForeachpairwebpagesdianddjthatALL-IN-ONEUBC-ASUC3MWITDFKI2JHU1-13TITPIUA-ZSASWAT-IVAUGONE-IN-ONEUNNFICOSHEFUVAPSNUSIRST-BPCU-COMSEMWEST00.
10.
20.
30.
40.
50.
60.
70.
80.
9SystemsFpFig.
5.
TheExperimentresultsonWePSdataset.
arenotyetputinthesameclustertheapproachformsandissuesqueriestotheWebtocollecttheco-occurrencestatistics,whichinthiscaseisthenumberofthepagesreturnedforagivenquery.
WebERusestwomaintypesofqueries:NANDCiANDCjCiANDCjHereNisthenameofthepersonbeingqueriedbytheuser,andCiandCjarethecontextofpagesdianddj.
ContextCicanbeeither(a)anORcombinationofpeoplefromPi,or(b)anORcombinationoforganizationsfromOi.
ThesameholdsforCiresultingineightqueriesfordianddjpair.
Theseco-occurrencecountsareindicativeofhowoftentheelementsofthetwosocialnetworksco-occuronthewebandthushowstronglytheyarerelated.
Thesecountsarethentransformedintofeatures,whicharethenusedtocomputethesimilaritybetweenwebpagesdianddj.
OneofthekeycontributionsofthisworkisanewSkyline-basedclassierfordecidingwhichdianddjwebpagesshouldbemergedbasedonthecorrespondingfeaturevector.
Itisaspecializedclassierthatwehavedesignedspecicallyfortheclusteringproblemathand.
Skyline-basedclassiergainsitsadvantageduetoavarietyoffunctionalitiesbuiltintoit,including:Ittakesintoaccountdominancethatispresentinthefeaturesspace.
Italsonetunesitselftothequalitymeasurebeingused.
Ittakesintoaccounttransitivityofmerges:thatis,ac-countsforthefactthattwolargeclusterscanbemergedbyasinglemergedecision,and,thus,onedirectmergedecisioncanleadtomultipleindirectones.
Thesepropertiesallowittoeasilyoutperformotherclassi-cationmethods(whicharegeneric),suchasDTCorSVM.
Theapproachisdiscussedindetailin[6].
IV.
DEMONSTRATIONTheERalgorithmsusedbyWESTareknowntoproducehighlycompetitiveresults.
Figure5presentsthecomparisonresultsoftheWESTwith18otherWePSsolutionsthathavebeenpartoftheWePSTaskchallenge[1].
ThequalityofclusteringisevaluatedintermsofFpmeasure(harmonicmeanofPurityandInversePurity[1]).
ForthegroupidenticationwehavecomparedWESTwiththestateoftheartapproachpublishedin[2].
TheaverageF-measureonthisdatasetachievedbyWESTis92%whichisnearly12%improvementovertheresultreportedin[2].
TheWESTsystemwillbedemonstratedthroughtwoap-plicationsbuiltoverthebasesystem.
SinglePersonSearch(illustratedinFigure1):whereinausercanenterapersonnameandcontextintheformofpeople,locations,and/ororganizationsassociatedwiththepersonbeingqueried.
Theresultswillbeasetofclusters.
Eachclusterwillhaveasetofkeywordsattachedtoindicatethemainaspectofthecorrespondingnamesake.
Theclusterswillbepresentedinarankedorderbasedontheoriginalranksofthewebpagesintheclustersandthecontextkeywords.
Figure2showssampleresultingclustersforthequery"AndrewMcCallum".
TherstreturnedgroupcorrespondstoAndrewMcCallumtheUMassCSprofessor,thesecondtothepresidentoftheAustralianCouncilofSocialServices,thethirdtoaCanadianmusician,etc.
Theuserwillbeabletoclickontheclustersandexploretheirclustersinteractively.
Thewebpagesinaclusterwillbepresentedinarankedorderaswell.
GroupSearch:Anotherinterfacewillbeusedtodemon-stratetheGroupIdenticationsearchcapabilitiesofWEST.
Ingroupqueryinterface,theusercaninputseveralpersonnames.
Theresultwillbethewebpagesthatarerelatedtothemeantnamesakes.
Theseapplicationswillbedemonstratedbothintheonlineandofinemodes.
Intheonlinemode,thequeryinputbytheuserwillbetranslatedintoacorresponding(setof)queriesoverInternetsearchengines(specicallyoverGoogle).
WESTallowstheusertospecifythenumberofwebpagestoretrievefromthesearchengine,whichwillbedisambiguatedintocorrespondingclusters.
Intheonlinemode,WESTusesonlyGraphERandEnsembleERapproachessinceWebERisaserver-sideapproachandisnotamenableforrealizationasamiddleware.
Thedemonstrationwillallowobserverstododiversesearches(perhaps,oftheirownnames)andperceiveboththequalityaswellasefciencyofWEST.
Intheofinemode,WESTwillusepreconstructed"canned"exampleswherewehavealreadycrawledthewebtoretrievethesearchresultsandconstructedthecorrespondingclusters.
Intheofinemode,inadditiontoillustratingtheGraphERandEnsembleERapproaches,wewillalsodemonstratethedisambiguationpoweroftheWebERapproach.
REFERENCES[1]J.
Artiles,J.
Gonzalo,andS.
Sekine.
Thesemeval-2007wepsevaluation:Establishingabenchmarkforthewebpeoplesearchtask.
InSemEval,2007.
[2]R.
BekkermanandA.
McCallum.
Disambiguatingwebappearancesofpeopleinasocialnetwork.
InWWW,2005.
[3]Z.
Chen,D.
V.
Kalashnikov,andS.
Mehrotra.
Combiningentityresolutiontechniqueswithapplicationtowebpeoplesearch.
InUndersubmission.
[4]D.
V.
Kalashnikov,Z.
Chen,S.
Mehrotra,andR.
Nuray.
Webpeoplesearchviaconnectionanalysis.
IEEETKDE,2008.
toappear.
[5]D.
V.
Kalashnikov,S.
Mehrotra,S.
Chen,R.
Nuray,andN.
Ashish.
Disambiguationalgorithmforpeoplesearchontheweb.
InICDE,2007.
[6]D.
V.
Kalashnikov,R.
Nuray-Turan,andS.
Mehrotra.
Towardsbreakingthequalitycurse.
Aweb-queryingapproachtoWebPeopleSearch.
InProc.
ofAnnualInternationalACMSIGIRConference,Singapore,July20–242008.

亚州云-美国Care云服务器,618大带宽美国Care年付云活动服务器,采用KVM架构,支持3天免费无理由退款!

官方网站:点击访问亚州云活动官网活动方案:地区:美国CERA(联通)CPU:1核(可加)内存:1G(可加)硬盘:40G系统盘+20G数据盘架构:KVM流量:无限制带宽:100Mbps(可加)IPv4:1个价格:¥128/年(年付为4折)购买:直达订购链接测试IP:45.145.7.3Tips:不满意三天无理由退回充值账户!地区:枣庄电信高防防御:100GCPU:8核(可加)内存:4G(可加)硬盘:...

HostYun 新增可选洛杉矶/日本机房 全场9折月付19.8元起

关于HostYun主机商在之前也有几次分享,这个前身是我们可能熟悉的小众的HostShare商家,主要就是提供廉价主机,那时候官方还声称选择这个品牌的机器不要用于正式生产项目,如今这个品牌重新转变成Hostyun。目前提供的VPS主机包括KVM和XEN架构,数据中心可选日本、韩国、香港和美国的多个地区机房,电信双程CN2 GIA线路,香港和日本机房,均为国内直连线路,访问质量不错。今天和大家分享下...

Webhosting24:€15/年-AMD Ryzen/512MB/10GB/2TB/纽约&日本&新加坡等机房

Webhosting24是一家始于2001年的意大利商家,提供的产品包括虚拟主机、VPS、独立服务器等,可选数机房包括美国洛杉矶、迈阿密、纽约、德国慕尼黑、日本、新加坡、澳大利亚悉尼等。商家VPS主机采用AMD Ryzen 9 5950X CPU,NVMe磁盘,基于KVM架构,德国机房不限制流量,网站采用欧元计费,最低年付15欧元起。这里以美国机房为例,分享几款套餐配置信息。CPU:1core内存...

west为你推荐
美女桌面背景图片适合女生的电脑壁纸316不锈钢和304哪个好保温杯不锈钢316和304哪个好骁龙750g和765g哪个好麒麟970跟骁龙730哪个更好锦天城和君合哪个好记忆棉和乳胶哪个好集成显卡和独立显卡哪个好集成显卡和独立显卡是什么区别呢哪个好?等额本息等额本金哪个好等额本金和等额本息的区别哪个好核芯显卡与独立显卡哪个好独立显卡和核芯显卡有什么区别速腾和朗逸哪个好大众速腾和朗逸哪个比较好?家用!红茶和绿茶哪个好红茶和绿茶 那个更好雅思和托福哪个好考考托福好还是雅思好
免费网站空间 香港服务器租用 汉邦高科域名申请 fastdomain 美国主机代购 nerd info域名 php探针 服务器cpu性能排行 上海域名 qingyun 服务器维护方案 刀片式服务器 双十一秒杀 hkt 服务器硬件防火墙 net空间 华为云建站 网页加速 国外网页代理 更多