WEST:ModernTechnologiesforWebPeopleSearchDmitriV.
KalashnikovZhaoqiChenRabiaNuray-TuranSharadMehrotraZhengZhangComputerScienceDepartmentUniversityofCalifornia,IrvineI.
INTRODUCTIONInthispaperwedescribeWEST(WebEntitySearchTech-nologies)systemthatwehavedevelopedtoimprovepeoplesearchovertheInternet.
RecentlytheproblemofWebPeopleSearch(WePS)hasattractedsignicantattentionfromboththeindustryandacademia.
IntheclassicformulationofWePSproblemtheuserissuesaquerytoawebsearchenginethatconsistsofanameofapersonofinterest.
Forsuchaquery,atraditionalsearchenginesuchasYahooorGooglewouldreturnwebpagesthatarerelatedtoanypeoplewhohappenedtohavethequeriedname.
ThegoalofWePS,instead,istooutputasetofclustersofwebpages,oneclusterpereachdistinctperson,containingallofthewebpagesrelatedtothatperson.
Theuserthencanlocatethedesiredclusterandexplorethewebpagesitcontains.
TheWePSapproachofferssignicantadvantages.
Forex-ample,considersearchingforapersonwhoisanamesakeoftheformerPresidentBillClinton.
Thewebpagesofthelessfamouspersonwillbeovershadowedintoday'ssearchenginesandwillappearfarinthesearch.
WePSsystemsaddressthisproblembyrstpresentingtotheuserthesetofclusters,amongwhichtheuserthencanselecttheclustercontainingthewebpagesofthenamesakeofinterest.
ThekeytechnologyofanyWePSsystem,includingWEST,isthatofEntityResolution.
InasettingofEntityResolutionproblem,adatasetcontainsinformationaboutobjectsandtheirinteractions.
Theobjectsarereferredtovia(textual)descrip-tions/references,whichmightnotbeuniqueidentiersoftheobjects,leadingtoambiguity.
ThetaskofEntityResolutionalgorithmsistoidentifyallofthereferencesthatco-refer,i.
e.
,refertothesamereal-worldentity.
InWePSthewebpagesreturnedbyasearchenginecanbeviewedasreferences.
Theoveralltaskcanbeviewedasthatofndingthewebpagesthatrefertothesamenamesake.
WehavedevelopedthreedifferentEntityResolutionalgo-rithmsthatcanbeemployedbyWEST:1)GraphERapproachextractstheSocialNetwork(peo-ple,organizations,locations)offthewebpagesalongwithhyperlinkandemailinformation.
ItrepresentstheresultingEntity-Relationshipnetworkasagraph.
TheapproachthenanalyzesthisgraphandthewebpageThisresearchwassupportedbyNSFAwards0331707and0331690,andDHSAwardEMW-2007-FP-02535.
textualsimilaritytodeterminewhichwebpagesco-refer[4],[5].
GraphERwillbecoveredinSectionIII-A.
2)EnsembleERapproachcombinesresultsofmultiple"base"ERsystemstoproducetheoverallclustering.
Duringthetrainingphase,EnsembleERapproachem-ployssupervisedlearningtostudyhowwellthebaseERsystemsperformintermsoftheirqualityundervarietyofconditions/contextsbytrainingameta-levelclassier.
Itthenusesthisclassierduringtheactualqueryprocessingtocomputeitsnalclustering[3].
EnsembleERwillbecoveredinSectionIII-B.
3)WebERapproach,unliketheabovetwo(andmanyother)approaches,doesnotlimititsprocessingtoanalyzingtherelevantwebpagesonly.
Instead,itleveragesapowerfulexternaldatasourcetogainitsadvantage.
Specically,likeGraphERitrstextractssocialnetworkofftheweb-pages.
ButthenitqueriestheWebtocollectadditionalinformationonthevariouscomponentsofthisnetwork[6].
WebERwillbecoveredinSectionIII-C.
Eachofthesethreealgorithmshasbeendemonstratedtooutperformthecurrentstateofthearttechniquesonavarietyofdatasets[3]–[6].
Thecomparisonincludes18approachesthathavebeenpartofWePSTaskcompetitiononalargedatasetwhichisnowconsideredtobeadefactostandardfortestingWePSsolutions[1].
WESTprovidesmultipleinterfacestosearch.
TheinputandoutputinterfacesofWESTareillustratedinFigures1and2respectively.
Naturally,WESTsupportsthestandardWePSinterfacewheretheuserprovidesapersonnameasthequery.
Italsosupportsadditionalfunctionality,wheretheusercanspecifycontextqueriestohelplocatethenamesakeofinterestquicker.
Thecontextcanbespeciedintheformoflocation,people,and/ororganizationsassociatedwiththenamesakeofinterest.
NoticethatthecontexthereisnotusedasadditionalkeywordstoquerytheWeb,butisusedtoidentifytherightnamesaketheuserislookingfor.
Thismeansthatthewebpagesintheclusterdoesnothavetoeachcontainthecontextkeywords,andsomeofthemmightevencontainnoneoftheseadditionalcontextkeywords.
BesidestheUIforsearchingforasingleindividual,WESToffersaGroupSearchinterfacetosupporttheGroupIdenti-cationquerycapabilities.
InaGroupIdenticationtask,theinputismultiplenamesofpeoplethatareknowntoberelatedinsomeway.
Forinstance,aquerymightbe"MichaelJordan"Fig.
1.
InputInterfaceofWEST.
Fig.
2.
OutputInterfaceofWEST.
and"MagicJohnson",implyingthatthemeantnamesakesarebasketballplayers.
Theobjectiveistoretrievethewebpagesofthemeantnamesakesonly.
Whilethedemonstrationwillillustrateboththesinglepersonsearchandgroupsearchcapabilities,thesubsequentdiscussionwillfocusonasinglepersonsearch.
Thealgorith-micdetailsoftheGroupSearchcanbefoundin[4].
Therestofthispaperisorganizedasfollows.
SectionIIpresentsthestepsoftheoverallWESTapproach.
ThenSectionIIIcoversthethreeEntityResolutionalgorithms.
Finally,SectionIVdescribesthefunctionalityofWESTthatwillbedisplayedduringthedemo.
II.
OVERALLALGORITHMThestepsoftheoverallWESTapproach,inthecontextofamiddlewarearchitecture,areillustratedinFigure3.
Theyinclude:1)UserInput.
TheuserissuesaqueryviatheWESTinputinterface.
2)Top-KRetrieval.
Thesystem(middleware)sendsaqueryconsistingofapersonnametoasearchengine,suchasGoogle,andretrievesthetop-Kreturnedwebpages.
ThisisastandardstepperformedbymostofthecurrentWePSsystems.
Top-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingTop-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingFig.
3.
OverviewoftheWESTProcessingSteps.
3)Pre-processing.
Thesetop-Kwebpagesarethenprepro-cessed.
Themaintwopre-processingstepsare:a)TF/IDF.
Pre-processingstepsforcomputingTF/IDFarecarriedout.
Theyinclude:stemming,stopwordremoval,nounphraseidentication,in-vertedindexcomputations,etc.
b)Extraction.
NamedEntities,includingpeople,lo-cations,organizationsareextractedusingathirdpartynamedentityextractionsoftware.
Hyperlinksandemailsaddressedareextractedaswell.
Someauxiliarydatastructuresarebuiltonthisdata.
4)Clustering.
OneofthethreeEntityResolutionalgo-rithmsisappliedtothedatatoclusterthewebpages.
ThealgorithmswillbeexplainedinSectionIII.
5)Post-processing.
Thepost-processingstepsinclude:a)ClusterSketchesarecomputed.
b)ClusterRankiscomputedbasedon(a)thecontextkeywords,ifpresentand(b)theoriginalsearchengine'sorderingofthewebpages.
c)WebpageRankiscomputedtodeterminetherela-tiveorderingofwebpagesinsideeachcluster.
6)Visualization.
Theresultingclustersarepresentedtotheuser,whichcanbeinteractivelyexplored.
WenextdiscussthekeycomponentofanyWePSsystem:theEntityResolutionalgorithms.
III.
ENTITYRESOLUTIONALGORITHMSThissectionpresentsanoverviewofthethreeentityreso-lutionalgorithmsusedbytheWESTsystemforclusteringthewebpages.
A.
GraphERTodeterminewhethertworeferencesuandvco-refertraditionalapproachesatthecoreanalyzesimilarityoffeaturesofuandvaccordingtosomefeature-basedsimilarityfunctionf(u,v).
TheGraphERapproachhasbeendevelopedbasedontheobservationthatmanydatasetsarerelationalinnature.
Theycontainnotonlyobjectsandtheirfeaturesbutalsoinformationaboutrelationshipsinwhichtheyparticipate.
InstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionInstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionFig.
4.
AGeneralFrameworkforCombiningMultipleSystems.
GraphERutilizestheinformationstoredintheserelationshipstoimprovethedisambiguationquality.
TheapproachviewsthedatasetbeinganalyzedasanEntity-RelationshipGraphofnodes(entities)interconnectedviarelationships(edges).
FortheWePSdomain,thenodesarethenamedentities,hyperlinks,andemailsextractedoffthewebpagesduringthepre-processingaswellasthewebpagesthemselves.
Therelationshipsareco-occurrencerelationships,andthosethatarederivedfromhyperlinkanddecompositions.
Thegraphcreationprocedureisdiscussedindetailin[4].
TheentityrelationshipsgraphinthiscaseisacombinationoftheSocialNetworkextractedfromthewebpagesaswellasthehyperlinkgraph.
Todecidewhethertworeferencesuandvco-refer,GraphERanalyzeshowstronglyuandvareconnectedinthisgraphaccordingtoaconnectionstrengthmeasurec(u,v).
Tocomputec(u,v),thealgorithmdiscoversthesetPLuvofallL-shortsimpleu-vpaths.
1Thevalueofc(u,v)iscomputedasthesumoftheconnectionsstrengthcontributedfromeachpathpinPLuv:c(u,v)=p∈PLuvc(p).
Asupervisedlearningprocedure,formulatedasalinearpro-grammingoptimizationtask,isusedtolearnc(p)functionfromdata[4],[5].
Thesimilarityfunctions(u,v)isthendenedasacombinationofc(u,v)andf(u,v).
Theoutputofthisfunctionisusedbyacorrelationclusteringalgorithmtogeneratethenalclustersofwebpages.
B.
EnsembleEREnsembleERapproachismotivatedbytheobservationthatoftenthereisnosingleentityresolution(ER)techniquealwaysperformthebest.
Rather,differentERsolutionsperformbetterindifferentcontexts.
EnsembleERisastacking-likeframeworkthatcombinestheclusteringresultsofmultiplebase-levelERsystemssothatthenalclusteringqualityissuperiortothatofanysinglebaseERsystem.
Thekeyideaistotransformtheoutputofbase-levelERsystems,togetherwithcontext,intoameta-levelfeatureset.
Asupervisedlearningapproachisutilizedtotrainaclassieronthemeta-leveldata.
Thealgorithmthenappliesthemeta-levelclassiertothedatasetbeingprocessedtocreatethenalclusteringresults.
Figure4showsageneralframeworkofcombiningmultiplesystems.
SimilartoGraphERapproach,EnsembleERalsoutilizesagraphrepresentationofthedataset.
Thegraphhoweveris1ApathisL-shortifitslengthdoesnotexceedL.
Apathissimpleifitdoesnotcontainduplicatenodes.
different.
Thenodesarethetop-Kwebpages.
Edge(u,v)betweentwowebpagesuandviscreatedonlyifacertainnumberofthebase-levelERsystemsdecidethatuandvshouldbeinthesamecluster.
Edge(u,v)representsapossibilitythatuandvmightco-refer.
WithrespecttothegraphthattaskofEnsembleERcanbeviewedasdecidingforeachedgewhetheruandvshouldbeputinonecluster.
LetS1,S2,Snbethenbase-levelERsystems.
Foreachedgeei=(u,v),eachSjoutputitsdecisiondij∈{0,1}.
Here,ifuandvareplacedinthesameclusterbySjthendij=1otherwisedij=0.
Then,foreachedgeeiwecandeneadecisionfeaturevectorasdi={di1,di2,din}.
Foredgeeiitslocalcontextisalsoencodedasamulti-dimensionalcontextfeaturevectorfi={fi1,fi2,fim}.
OneoftheinterestingaspectsofEnsembleERsolutionisthatitcreatescontextfeaturesinapredictiveway,basedonrstestimatingsomeunknownparametersofthedatabeingprocessed.
Forinstance,letK1,K2,KnbethenumberofclustersthatsystemsS1,S2,Snoutput.
OneofthefeaturesusedbyEnsembleERiscomputedbyapplyingaregressiontothisdatatoestimatethenumberofnamesakesK,wherethetruenumberofnamesakesK+isunknownbeforehandtothealgorithm.
EnsembleERthenconvertsthedifferencebetweenKandKjintoafeature,basedontheintuitionthattheclosertheKjtoK,themorecondencecanbeplacedintheanswerofsystemSj.
ThegoalofEnsembleERreducestondingamappingdi*fi→ai.
Here,ai={0,1}isthepredictionofthecombinedalgorithmforedgeei=(u,v),whereai=1iftheoverallalgorithmbelievesuandvbelongtothesamecluster,andai=0otherwise.
ThedetailsoftheEnsemblealgorithmcanbefoundin[3].
C.
WebERWebERapproachisconsiderablydifferentfrommostoftheotherWePSsolutions.
UnlikemanyotherWePSsystems,WebERdoesnotlimititsprocessingtoanalyzingonlytheinformationstoredinthetop-Kreturnedwebpages.
RatheritemploystheWebasanexternaldatasourcetogetadditionalinformation,whichultimatelyleadstohigherqualityresults.
WebERisprimarilyintendedtobeaserver-sidesolution.
Thatis,itscodeisexecutedatasearchengine(server)side.
Becauseofthat,mostofthepre-processingcanbeaccomplishedinbulkbeforequeryprocessingstarts,includingextractionandTF/IDFcomputations.
ThequeriestothesearchenginearecarriedoutinternallywithoutgoingviatheInternetthusmakingtheirprocessingmuchfaster.
LetD={d1,d2,dK}bethesetofthetop-Kreturnedwebpages.
WebERrstmergessomeofthewebpagesintoinitialclustersusingNamedEntity(NE)clusteringwithaconservativethresholds.
Thedocument-documentsimilarityiscomputedusingTF/IDFapproachwithcosinesimilarity.
Onlyafewwebpagesthathaveoverwhelmingevidencethattheyrepresentthesamepeoplearemergedduringthisprocess.
LetPiandOibethesetofpeopleandorganizationsextractedfromwebpagedi.
ForeachpairwebpagesdianddjthatALL-IN-ONEUBC-ASUC3MWITDFKI2JHU1-13TITPIUA-ZSASWAT-IVAUGONE-IN-ONEUNNFICOSHEFUVAPSNUSIRST-BPCU-COMSEMWEST00.
10.
20.
30.
40.
50.
60.
70.
80.
9SystemsFpFig.
5.
TheExperimentresultsonWePSdataset.
arenotyetputinthesameclustertheapproachformsandissuesqueriestotheWebtocollecttheco-occurrencestatistics,whichinthiscaseisthenumberofthepagesreturnedforagivenquery.
WebERusestwomaintypesofqueries:NANDCiANDCjCiANDCjHereNisthenameofthepersonbeingqueriedbytheuser,andCiandCjarethecontextofpagesdianddj.
ContextCicanbeeither(a)anORcombinationofpeoplefromPi,or(b)anORcombinationoforganizationsfromOi.
ThesameholdsforCiresultingineightqueriesfordianddjpair.
Theseco-occurrencecountsareindicativeofhowoftentheelementsofthetwosocialnetworksco-occuronthewebandthushowstronglytheyarerelated.
Thesecountsarethentransformedintofeatures,whicharethenusedtocomputethesimilaritybetweenwebpagesdianddj.
OneofthekeycontributionsofthisworkisanewSkyline-basedclassierfordecidingwhichdianddjwebpagesshouldbemergedbasedonthecorrespondingfeaturevector.
Itisaspecializedclassierthatwehavedesignedspecicallyfortheclusteringproblemathand.
Skyline-basedclassiergainsitsadvantageduetoavarietyoffunctionalitiesbuiltintoit,including:Ittakesintoaccountdominancethatispresentinthefeaturesspace.
Italsonetunesitselftothequalitymeasurebeingused.
Ittakesintoaccounttransitivityofmerges:thatis,ac-countsforthefactthattwolargeclusterscanbemergedbyasinglemergedecision,and,thus,onedirectmergedecisioncanleadtomultipleindirectones.
Thesepropertiesallowittoeasilyoutperformotherclassi-cationmethods(whicharegeneric),suchasDTCorSVM.
Theapproachisdiscussedindetailin[6].
IV.
DEMONSTRATIONTheERalgorithmsusedbyWESTareknowntoproducehighlycompetitiveresults.
Figure5presentsthecomparisonresultsoftheWESTwith18otherWePSsolutionsthathavebeenpartoftheWePSTaskchallenge[1].
ThequalityofclusteringisevaluatedintermsofFpmeasure(harmonicmeanofPurityandInversePurity[1]).
ForthegroupidenticationwehavecomparedWESTwiththestateoftheartapproachpublishedin[2].
TheaverageF-measureonthisdatasetachievedbyWESTis92%whichisnearly12%improvementovertheresultreportedin[2].
TheWESTsystemwillbedemonstratedthroughtwoap-plicationsbuiltoverthebasesystem.
SinglePersonSearch(illustratedinFigure1):whereinausercanenterapersonnameandcontextintheformofpeople,locations,and/ororganizationsassociatedwiththepersonbeingqueried.
Theresultswillbeasetofclusters.
Eachclusterwillhaveasetofkeywordsattachedtoindicatethemainaspectofthecorrespondingnamesake.
Theclusterswillbepresentedinarankedorderbasedontheoriginalranksofthewebpagesintheclustersandthecontextkeywords.
Figure2showssampleresultingclustersforthequery"AndrewMcCallum".
TherstreturnedgroupcorrespondstoAndrewMcCallumtheUMassCSprofessor,thesecondtothepresidentoftheAustralianCouncilofSocialServices,thethirdtoaCanadianmusician,etc.
Theuserwillbeabletoclickontheclustersandexploretheirclustersinteractively.
Thewebpagesinaclusterwillbepresentedinarankedorderaswell.
GroupSearch:Anotherinterfacewillbeusedtodemon-stratetheGroupIdenticationsearchcapabilitiesofWEST.
Ingroupqueryinterface,theusercaninputseveralpersonnames.
Theresultwillbethewebpagesthatarerelatedtothemeantnamesakes.
Theseapplicationswillbedemonstratedbothintheonlineandofinemodes.
Intheonlinemode,thequeryinputbytheuserwillbetranslatedintoacorresponding(setof)queriesoverInternetsearchengines(specicallyoverGoogle).
WESTallowstheusertospecifythenumberofwebpagestoretrievefromthesearchengine,whichwillbedisambiguatedintocorrespondingclusters.
Intheonlinemode,WESTusesonlyGraphERandEnsembleERapproachessinceWebERisaserver-sideapproachandisnotamenableforrealizationasamiddleware.
Thedemonstrationwillallowobserverstododiversesearches(perhaps,oftheirownnames)andperceiveboththequalityaswellasefciencyofWEST.
Intheofinemode,WESTwillusepreconstructed"canned"exampleswherewehavealreadycrawledthewebtoretrievethesearchresultsandconstructedthecorrespondingclusters.
Intheofinemode,inadditiontoillustratingtheGraphERandEnsembleERapproaches,wewillalsodemonstratethedisambiguationpoweroftheWebERapproach.
REFERENCES[1]J.
Artiles,J.
Gonzalo,andS.
Sekine.
Thesemeval-2007wepsevaluation:Establishingabenchmarkforthewebpeoplesearchtask.
InSemEval,2007.
[2]R.
BekkermanandA.
McCallum.
Disambiguatingwebappearancesofpeopleinasocialnetwork.
InWWW,2005.
[3]Z.
Chen,D.
V.
Kalashnikov,andS.
Mehrotra.
Combiningentityresolutiontechniqueswithapplicationtowebpeoplesearch.
InUndersubmission.
[4]D.
V.
Kalashnikov,Z.
Chen,S.
Mehrotra,andR.
Nuray.
Webpeoplesearchviaconnectionanalysis.
IEEETKDE,2008.
toappear.
[5]D.
V.
Kalashnikov,S.
Mehrotra,S.
Chen,R.
Nuray,andN.
Ashish.
Disambiguationalgorithmforpeoplesearchontheweb.
InICDE,2007.
[6]D.
V.
Kalashnikov,R.
Nuray-Turan,andS.
Mehrotra.
Towardsbreakingthequalitycurse.
Aweb-queryingapproachtoWebPeopleSearch.
InProc.
ofAnnualInternationalACMSIGIRConference,Singapore,July20–242008.
轻云互联成立于2018年的国人商家,广州轻云互联网络科技有限公司旗下品牌,主要从事VPS、虚拟主机等云计算产品业务,适合建站、新手上车的值得选择,香港三网直连(电信CN2GIA联通移动CN2直连);美国圣何塞(回程三网CN2GIA)线路,所有产品均采用KVM虚拟技术架构,高效售后保障,稳定多年,高性能可用,网络优质,为您的业务保驾护航。活动规则:用户购买任意全区域云服务器月付以上享受免费更换IP服...
快快CDN主营业务为海外服务器无须备案,高防CDN,防劫持CDN,香港服务器,美国服务器,加速CDN,是一家综合性的主机服务商。美国高防服务器,1800DDOS防御,单机1800G DDOS防御,大陆直链 cn2线路,线路友好。快快CDN全球安全防护平台是一款集 DDOS 清洗、CC 指纹识别、WAF 防护为一体的外加全球加速的超强安全加速网络,为您的各类型业务保驾护航加速前进!价格都非常给力,需...
青果网络QG.NET定位为高效多云管理服务商,已拥有工信部颁发的全网云计算/CDN/IDC/ISP/IP-VPN等多项资质,是CNNIC/APNIC联盟的成员之一,2019年荣获国家高薪技术企业、福建省省级高新技术企业双项荣誉。那么青果网络作为国内主流的IDC厂商之一,那么其旗下美国洛杉矶CN2 GIA线路云服务器到底怎么样?官方网站:https://www.qg.net/CPU内存系统盘流量宽带...
west为你推荐
马云卸任软银董事马云已经卸任了阿里巴巴,那么他接下来的身份是什么?桌面背景图片大全谁能给我个 游戏桌面图标大全天气预报哪个好用哪个最准确最准天气预报软件排行是怎样的?视频剪辑软件哪个好视频剪辑哪个软件好用传奇类手游哪个好传奇手游哪个好玩免费帕萨特和迈腾哪个好2019帕萨特和迈腾哪个好?隔音怎么样?华为p40和mate30哪个好Huawei Mate30 和 P40 哪个好?录音软件哪个好什么软件用来录音更好?加速器哪个好主流加速器哪个好海克斯皮肤哪个好诺手二周年皮肤好不好,和海克斯那个比哪个好,二周年属于稀有吗
国内vps 最便宜虚拟主机 新网域名解析 腾讯云盘 免费主机 美国主机代购 suspended php探针 lighttpd 百度云1t 免费phpmysql空间 银盘服务是什么 闪讯官网 免费外链相册 独享主机 丽萨 photobucket lamp兄弟连 杭州电信宽带优惠 如何登陆阿里云邮箱 更多