Jordanwest

west  时间:2021-01-25  阅读:()
WEST:ModernTechnologiesforWebPeopleSearchDmitriV.
KalashnikovZhaoqiChenRabiaNuray-TuranSharadMehrotraZhengZhangComputerScienceDepartmentUniversityofCalifornia,IrvineI.
INTRODUCTIONInthispaperwedescribeWEST(WebEntitySearchTech-nologies)systemthatwehavedevelopedtoimprovepeoplesearchovertheInternet.
RecentlytheproblemofWebPeopleSearch(WePS)hasattractedsignicantattentionfromboththeindustryandacademia.
IntheclassicformulationofWePSproblemtheuserissuesaquerytoawebsearchenginethatconsistsofanameofapersonofinterest.
Forsuchaquery,atraditionalsearchenginesuchasYahooorGooglewouldreturnwebpagesthatarerelatedtoanypeoplewhohappenedtohavethequeriedname.
ThegoalofWePS,instead,istooutputasetofclustersofwebpages,oneclusterpereachdistinctperson,containingallofthewebpagesrelatedtothatperson.
Theuserthencanlocatethedesiredclusterandexplorethewebpagesitcontains.
TheWePSapproachofferssignicantadvantages.
Forex-ample,considersearchingforapersonwhoisanamesakeoftheformerPresidentBillClinton.
Thewebpagesofthelessfamouspersonwillbeovershadowedintoday'ssearchenginesandwillappearfarinthesearch.
WePSsystemsaddressthisproblembyrstpresentingtotheuserthesetofclusters,amongwhichtheuserthencanselecttheclustercontainingthewebpagesofthenamesakeofinterest.
ThekeytechnologyofanyWePSsystem,includingWEST,isthatofEntityResolution.
InasettingofEntityResolutionproblem,adatasetcontainsinformationaboutobjectsandtheirinteractions.
Theobjectsarereferredtovia(textual)descrip-tions/references,whichmightnotbeuniqueidentiersoftheobjects,leadingtoambiguity.
ThetaskofEntityResolutionalgorithmsistoidentifyallofthereferencesthatco-refer,i.
e.
,refertothesamereal-worldentity.
InWePSthewebpagesreturnedbyasearchenginecanbeviewedasreferences.
Theoveralltaskcanbeviewedasthatofndingthewebpagesthatrefertothesamenamesake.
WehavedevelopedthreedifferentEntityResolutionalgo-rithmsthatcanbeemployedbyWEST:1)GraphERapproachextractstheSocialNetwork(peo-ple,organizations,locations)offthewebpagesalongwithhyperlinkandemailinformation.
ItrepresentstheresultingEntity-Relationshipnetworkasagraph.
TheapproachthenanalyzesthisgraphandthewebpageThisresearchwassupportedbyNSFAwards0331707and0331690,andDHSAwardEMW-2007-FP-02535.
textualsimilaritytodeterminewhichwebpagesco-refer[4],[5].
GraphERwillbecoveredinSectionIII-A.
2)EnsembleERapproachcombinesresultsofmultiple"base"ERsystemstoproducetheoverallclustering.
Duringthetrainingphase,EnsembleERapproachem-ployssupervisedlearningtostudyhowwellthebaseERsystemsperformintermsoftheirqualityundervarietyofconditions/contextsbytrainingameta-levelclassier.
Itthenusesthisclassierduringtheactualqueryprocessingtocomputeitsnalclustering[3].
EnsembleERwillbecoveredinSectionIII-B.
3)WebERapproach,unliketheabovetwo(andmanyother)approaches,doesnotlimititsprocessingtoanalyzingtherelevantwebpagesonly.
Instead,itleveragesapowerfulexternaldatasourcetogainitsadvantage.
Specically,likeGraphERitrstextractssocialnetworkofftheweb-pages.
ButthenitqueriestheWebtocollectadditionalinformationonthevariouscomponentsofthisnetwork[6].
WebERwillbecoveredinSectionIII-C.
Eachofthesethreealgorithmshasbeendemonstratedtooutperformthecurrentstateofthearttechniquesonavarietyofdatasets[3]–[6].
Thecomparisonincludes18approachesthathavebeenpartofWePSTaskcompetitiononalargedatasetwhichisnowconsideredtobeadefactostandardfortestingWePSsolutions[1].
WESTprovidesmultipleinterfacestosearch.
TheinputandoutputinterfacesofWESTareillustratedinFigures1and2respectively.
Naturally,WESTsupportsthestandardWePSinterfacewheretheuserprovidesapersonnameasthequery.
Italsosupportsadditionalfunctionality,wheretheusercanspecifycontextqueriestohelplocatethenamesakeofinterestquicker.
Thecontextcanbespeciedintheformoflocation,people,and/ororganizationsassociatedwiththenamesakeofinterest.
NoticethatthecontexthereisnotusedasadditionalkeywordstoquerytheWeb,butisusedtoidentifytherightnamesaketheuserislookingfor.
Thismeansthatthewebpagesintheclusterdoesnothavetoeachcontainthecontextkeywords,andsomeofthemmightevencontainnoneoftheseadditionalcontextkeywords.
BesidestheUIforsearchingforasingleindividual,WESToffersaGroupSearchinterfacetosupporttheGroupIdenti-cationquerycapabilities.
InaGroupIdenticationtask,theinputismultiplenamesofpeoplethatareknowntoberelatedinsomeway.
Forinstance,aquerymightbe"MichaelJordan"Fig.
1.
InputInterfaceofWEST.
Fig.
2.
OutputInterfaceofWEST.
and"MagicJohnson",implyingthatthemeantnamesakesarebasketballplayers.
Theobjectiveistoretrievethewebpagesofthemeantnamesakesonly.
Whilethedemonstrationwillillustrateboththesinglepersonsearchandgroupsearchcapabilities,thesubsequentdiscussionwillfocusonasinglepersonsearch.
Thealgorith-micdetailsoftheGroupSearchcanbefoundin[4].
Therestofthispaperisorganizedasfollows.
SectionIIpresentsthestepsoftheoverallWESTapproach.
ThenSectionIIIcoversthethreeEntityResolutionalgorithms.
Finally,SectionIVdescribesthefunctionalityofWESTthatwillbedisplayedduringthedemo.
II.
OVERALLALGORITHMThestepsoftheoverallWESTapproach,inthecontextofamiddlewarearchitecture,areillustratedinFigure3.
Theyinclude:1)UserInput.
TheuserissuesaqueryviatheWESTinputinterface.
2)Top-KRetrieval.
Thesystem(middleware)sendsaqueryconsistingofapersonnametoasearchengine,suchasGoogle,andretrievesthetop-Kreturnedwebpages.
ThisisastandardstepperformedbymostofthecurrentWePSsystems.
Top-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingTop-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingFig.
3.
OverviewoftheWESTProcessingSteps.
3)Pre-processing.
Thesetop-Kwebpagesarethenprepro-cessed.
Themaintwopre-processingstepsare:a)TF/IDF.
Pre-processingstepsforcomputingTF/IDFarecarriedout.
Theyinclude:stemming,stopwordremoval,nounphraseidentication,in-vertedindexcomputations,etc.
b)Extraction.
NamedEntities,includingpeople,lo-cations,organizationsareextractedusingathirdpartynamedentityextractionsoftware.
Hyperlinksandemailsaddressedareextractedaswell.
Someauxiliarydatastructuresarebuiltonthisdata.
4)Clustering.
OneofthethreeEntityResolutionalgo-rithmsisappliedtothedatatoclusterthewebpages.
ThealgorithmswillbeexplainedinSectionIII.
5)Post-processing.
Thepost-processingstepsinclude:a)ClusterSketchesarecomputed.
b)ClusterRankiscomputedbasedon(a)thecontextkeywords,ifpresentand(b)theoriginalsearchengine'sorderingofthewebpages.
c)WebpageRankiscomputedtodeterminetherela-tiveorderingofwebpagesinsideeachcluster.
6)Visualization.
Theresultingclustersarepresentedtotheuser,whichcanbeinteractivelyexplored.
WenextdiscussthekeycomponentofanyWePSsystem:theEntityResolutionalgorithms.
III.
ENTITYRESOLUTIONALGORITHMSThissectionpresentsanoverviewofthethreeentityreso-lutionalgorithmsusedbytheWESTsystemforclusteringthewebpages.
A.
GraphERTodeterminewhethertworeferencesuandvco-refertraditionalapproachesatthecoreanalyzesimilarityoffeaturesofuandvaccordingtosomefeature-basedsimilarityfunctionf(u,v).
TheGraphERapproachhasbeendevelopedbasedontheobservationthatmanydatasetsarerelationalinnature.
Theycontainnotonlyobjectsandtheirfeaturesbutalsoinformationaboutrelationshipsinwhichtheyparticipate.
InstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionInstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionFig.
4.
AGeneralFrameworkforCombiningMultipleSystems.
GraphERutilizestheinformationstoredintheserelationshipstoimprovethedisambiguationquality.
TheapproachviewsthedatasetbeinganalyzedasanEntity-RelationshipGraphofnodes(entities)interconnectedviarelationships(edges).
FortheWePSdomain,thenodesarethenamedentities,hyperlinks,andemailsextractedoffthewebpagesduringthepre-processingaswellasthewebpagesthemselves.
Therelationshipsareco-occurrencerelationships,andthosethatarederivedfromhyperlinkanddecompositions.
Thegraphcreationprocedureisdiscussedindetailin[4].
TheentityrelationshipsgraphinthiscaseisacombinationoftheSocialNetworkextractedfromthewebpagesaswellasthehyperlinkgraph.
Todecidewhethertworeferencesuandvco-refer,GraphERanalyzeshowstronglyuandvareconnectedinthisgraphaccordingtoaconnectionstrengthmeasurec(u,v).
Tocomputec(u,v),thealgorithmdiscoversthesetPLuvofallL-shortsimpleu-vpaths.
1Thevalueofc(u,v)iscomputedasthesumoftheconnectionsstrengthcontributedfromeachpathpinPLuv:c(u,v)=p∈PLuvc(p).
Asupervisedlearningprocedure,formulatedasalinearpro-grammingoptimizationtask,isusedtolearnc(p)functionfromdata[4],[5].
Thesimilarityfunctions(u,v)isthendenedasacombinationofc(u,v)andf(u,v).
Theoutputofthisfunctionisusedbyacorrelationclusteringalgorithmtogeneratethenalclustersofwebpages.
B.
EnsembleEREnsembleERapproachismotivatedbytheobservationthatoftenthereisnosingleentityresolution(ER)techniquealwaysperformthebest.
Rather,differentERsolutionsperformbetterindifferentcontexts.
EnsembleERisastacking-likeframeworkthatcombinestheclusteringresultsofmultiplebase-levelERsystemssothatthenalclusteringqualityissuperiortothatofanysinglebaseERsystem.
Thekeyideaistotransformtheoutputofbase-levelERsystems,togetherwithcontext,intoameta-levelfeatureset.
Asupervisedlearningapproachisutilizedtotrainaclassieronthemeta-leveldata.
Thealgorithmthenappliesthemeta-levelclassiertothedatasetbeingprocessedtocreatethenalclusteringresults.
Figure4showsageneralframeworkofcombiningmultiplesystems.
SimilartoGraphERapproach,EnsembleERalsoutilizesagraphrepresentationofthedataset.
Thegraphhoweveris1ApathisL-shortifitslengthdoesnotexceedL.
Apathissimpleifitdoesnotcontainduplicatenodes.
different.
Thenodesarethetop-Kwebpages.
Edge(u,v)betweentwowebpagesuandviscreatedonlyifacertainnumberofthebase-levelERsystemsdecidethatuandvshouldbeinthesamecluster.
Edge(u,v)representsapossibilitythatuandvmightco-refer.
WithrespecttothegraphthattaskofEnsembleERcanbeviewedasdecidingforeachedgewhetheruandvshouldbeputinonecluster.
LetS1,S2,Snbethenbase-levelERsystems.
Foreachedgeei=(u,v),eachSjoutputitsdecisiondij∈{0,1}.
Here,ifuandvareplacedinthesameclusterbySjthendij=1otherwisedij=0.
Then,foreachedgeeiwecandeneadecisionfeaturevectorasdi={di1,di2,din}.
Foredgeeiitslocalcontextisalsoencodedasamulti-dimensionalcontextfeaturevectorfi={fi1,fi2,fim}.
OneoftheinterestingaspectsofEnsembleERsolutionisthatitcreatescontextfeaturesinapredictiveway,basedonrstestimatingsomeunknownparametersofthedatabeingprocessed.
Forinstance,letK1,K2,KnbethenumberofclustersthatsystemsS1,S2,Snoutput.
OneofthefeaturesusedbyEnsembleERiscomputedbyapplyingaregressiontothisdatatoestimatethenumberofnamesakesK,wherethetruenumberofnamesakesK+isunknownbeforehandtothealgorithm.
EnsembleERthenconvertsthedifferencebetweenKandKjintoafeature,basedontheintuitionthattheclosertheKjtoK,themorecondencecanbeplacedintheanswerofsystemSj.
ThegoalofEnsembleERreducestondingamappingdi*fi→ai.
Here,ai={0,1}isthepredictionofthecombinedalgorithmforedgeei=(u,v),whereai=1iftheoverallalgorithmbelievesuandvbelongtothesamecluster,andai=0otherwise.
ThedetailsoftheEnsemblealgorithmcanbefoundin[3].
C.
WebERWebERapproachisconsiderablydifferentfrommostoftheotherWePSsolutions.
UnlikemanyotherWePSsystems,WebERdoesnotlimititsprocessingtoanalyzingonlytheinformationstoredinthetop-Kreturnedwebpages.
RatheritemploystheWebasanexternaldatasourcetogetadditionalinformation,whichultimatelyleadstohigherqualityresults.
WebERisprimarilyintendedtobeaserver-sidesolution.
Thatis,itscodeisexecutedatasearchengine(server)side.
Becauseofthat,mostofthepre-processingcanbeaccomplishedinbulkbeforequeryprocessingstarts,includingextractionandTF/IDFcomputations.
ThequeriestothesearchenginearecarriedoutinternallywithoutgoingviatheInternetthusmakingtheirprocessingmuchfaster.
LetD={d1,d2,dK}bethesetofthetop-Kreturnedwebpages.
WebERrstmergessomeofthewebpagesintoinitialclustersusingNamedEntity(NE)clusteringwithaconservativethresholds.
Thedocument-documentsimilarityiscomputedusingTF/IDFapproachwithcosinesimilarity.
Onlyafewwebpagesthathaveoverwhelmingevidencethattheyrepresentthesamepeoplearemergedduringthisprocess.
LetPiandOibethesetofpeopleandorganizationsextractedfromwebpagedi.
ForeachpairwebpagesdianddjthatALL-IN-ONEUBC-ASUC3MWITDFKI2JHU1-13TITPIUA-ZSASWAT-IVAUGONE-IN-ONEUNNFICOSHEFUVAPSNUSIRST-BPCU-COMSEMWEST00.
10.
20.
30.
40.
50.
60.
70.
80.
9SystemsFpFig.
5.
TheExperimentresultsonWePSdataset.
arenotyetputinthesameclustertheapproachformsandissuesqueriestotheWebtocollecttheco-occurrencestatistics,whichinthiscaseisthenumberofthepagesreturnedforagivenquery.
WebERusestwomaintypesofqueries:NANDCiANDCjCiANDCjHereNisthenameofthepersonbeingqueriedbytheuser,andCiandCjarethecontextofpagesdianddj.
ContextCicanbeeither(a)anORcombinationofpeoplefromPi,or(b)anORcombinationoforganizationsfromOi.
ThesameholdsforCiresultingineightqueriesfordianddjpair.
Theseco-occurrencecountsareindicativeofhowoftentheelementsofthetwosocialnetworksco-occuronthewebandthushowstronglytheyarerelated.
Thesecountsarethentransformedintofeatures,whicharethenusedtocomputethesimilaritybetweenwebpagesdianddj.
OneofthekeycontributionsofthisworkisanewSkyline-basedclassierfordecidingwhichdianddjwebpagesshouldbemergedbasedonthecorrespondingfeaturevector.
Itisaspecializedclassierthatwehavedesignedspecicallyfortheclusteringproblemathand.
Skyline-basedclassiergainsitsadvantageduetoavarietyoffunctionalitiesbuiltintoit,including:Ittakesintoaccountdominancethatispresentinthefeaturesspace.
Italsonetunesitselftothequalitymeasurebeingused.
Ittakesintoaccounttransitivityofmerges:thatis,ac-countsforthefactthattwolargeclusterscanbemergedbyasinglemergedecision,and,thus,onedirectmergedecisioncanleadtomultipleindirectones.
Thesepropertiesallowittoeasilyoutperformotherclassi-cationmethods(whicharegeneric),suchasDTCorSVM.
Theapproachisdiscussedindetailin[6].
IV.
DEMONSTRATIONTheERalgorithmsusedbyWESTareknowntoproducehighlycompetitiveresults.
Figure5presentsthecomparisonresultsoftheWESTwith18otherWePSsolutionsthathavebeenpartoftheWePSTaskchallenge[1].
ThequalityofclusteringisevaluatedintermsofFpmeasure(harmonicmeanofPurityandInversePurity[1]).
ForthegroupidenticationwehavecomparedWESTwiththestateoftheartapproachpublishedin[2].
TheaverageF-measureonthisdatasetachievedbyWESTis92%whichisnearly12%improvementovertheresultreportedin[2].
TheWESTsystemwillbedemonstratedthroughtwoap-plicationsbuiltoverthebasesystem.
SinglePersonSearch(illustratedinFigure1):whereinausercanenterapersonnameandcontextintheformofpeople,locations,and/ororganizationsassociatedwiththepersonbeingqueried.
Theresultswillbeasetofclusters.
Eachclusterwillhaveasetofkeywordsattachedtoindicatethemainaspectofthecorrespondingnamesake.
Theclusterswillbepresentedinarankedorderbasedontheoriginalranksofthewebpagesintheclustersandthecontextkeywords.
Figure2showssampleresultingclustersforthequery"AndrewMcCallum".
TherstreturnedgroupcorrespondstoAndrewMcCallumtheUMassCSprofessor,thesecondtothepresidentoftheAustralianCouncilofSocialServices,thethirdtoaCanadianmusician,etc.
Theuserwillbeabletoclickontheclustersandexploretheirclustersinteractively.
Thewebpagesinaclusterwillbepresentedinarankedorderaswell.
GroupSearch:Anotherinterfacewillbeusedtodemon-stratetheGroupIdenticationsearchcapabilitiesofWEST.
Ingroupqueryinterface,theusercaninputseveralpersonnames.
Theresultwillbethewebpagesthatarerelatedtothemeantnamesakes.
Theseapplicationswillbedemonstratedbothintheonlineandofinemodes.
Intheonlinemode,thequeryinputbytheuserwillbetranslatedintoacorresponding(setof)queriesoverInternetsearchengines(specicallyoverGoogle).
WESTallowstheusertospecifythenumberofwebpagestoretrievefromthesearchengine,whichwillbedisambiguatedintocorrespondingclusters.
Intheonlinemode,WESTusesonlyGraphERandEnsembleERapproachessinceWebERisaserver-sideapproachandisnotamenableforrealizationasamiddleware.
Thedemonstrationwillallowobserverstododiversesearches(perhaps,oftheirownnames)andperceiveboththequalityaswellasefciencyofWEST.
Intheofinemode,WESTwillusepreconstructed"canned"exampleswherewehavealreadycrawledthewebtoretrievethesearchresultsandconstructedthecorrespondingclusters.
Intheofinemode,inadditiontoillustratingtheGraphERandEnsembleERapproaches,wewillalsodemonstratethedisambiguationpoweroftheWebERapproach.
REFERENCES[1]J.
Artiles,J.
Gonzalo,andS.
Sekine.
Thesemeval-2007wepsevaluation:Establishingabenchmarkforthewebpeoplesearchtask.
InSemEval,2007.
[2]R.
BekkermanandA.
McCallum.
Disambiguatingwebappearancesofpeopleinasocialnetwork.
InWWW,2005.
[3]Z.
Chen,D.
V.
Kalashnikov,andS.
Mehrotra.
Combiningentityresolutiontechniqueswithapplicationtowebpeoplesearch.
InUndersubmission.
[4]D.
V.
Kalashnikov,Z.
Chen,S.
Mehrotra,andR.
Nuray.
Webpeoplesearchviaconnectionanalysis.
IEEETKDE,2008.
toappear.
[5]D.
V.
Kalashnikov,S.
Mehrotra,S.
Chen,R.
Nuray,andN.
Ashish.
Disambiguationalgorithmforpeoplesearchontheweb.
InICDE,2007.
[6]D.
V.
Kalashnikov,R.
Nuray-Turan,andS.
Mehrotra.
Towardsbreakingthequalitycurse.
Aweb-queryingapproachtoWebPeopleSearch.
InProc.
ofAnnualInternationalACMSIGIRConference,Singapore,July20–242008.

简单测评melbicom俄罗斯莫斯科数据中心的VPS,三网CN2回国,电信双程cn2

melbicom从2015年就开始运作了,在国内也是有一定的粉丝群,站长最早是从2017年开始介绍melbicom。上一次测评melbicom是在2018年,由于期间有不少人持续关注这个品牌,而且站长貌似也听说过路由什么的有变动的迹象。为此,今天重新对莫斯科数据中心的VPS进行一次简单测评,数据仅供参考。官方网站: https://melbicom.net比特币、信用卡、PayPal、支付宝、银联...

美国服务器20G防御 50G防御 688元CN2回国

全球领先的IDC服务商华纳云“美国服务器”正式发售啦~~~~此次上线的美国服务器包含美国云服务器、美国服务器、美国高防服务器以及美国高防云服务器。针对此次美国服务器新品上线,华纳云也推出了史无前例的超低活动力度。美国云服务器低至3折,1核1G5M低至24元/月,20G DDos防御的美国服务器低至688元/月,年付再送2个月,两年送4个月,三年送6个月,且永久续费同价,更多款高性价比配置供您选择。...

选择Vultr VPS主机不支持支付宝付款的解决方案

在刚才更新Vultr 新年福利文章的时候突然想到前几天有网友问到自己有在Vultr 注册账户的时候无法用支付宝付款的问题,当时有帮助他给予解决,这里正好顺带一并介绍整理出来。毕竟对于来说,虽然使用的服务器不多,但是至少是见过世面的,大大小小商家的一些特性特征还是比较清楚的。在这篇文章中,和大家分享如果我们有在Vultr新注册账户或者充值购买云服务器的时候,不支持支付宝付款的原因。毕竟我们是知道的,...

west为你推荐
美团月付怎么关闭美团打车免密去付关掉了,怎么回复骁龙750g和765g哪个好骁龙730G和骁龙835、联发科Helio G90T哪个更好?输入法哪个好用手机输入法哪个好?法兰绒和珊瑚绒哪个好法兰绒和珊瑚绒哪个好被套好网校哪个好请问在网校排名中,哪个网校是最好的?想找一家最好的来选择啊?美国国际集团全球500强有哪些企业是美国的qq空间登录qq空间如何登陆腾讯空间登录腾讯qq空间进入登陆个人QQ空间dns服务器未响应电脑网络连接不到,DNS服务器未响应是什么意思?网通dns服务器地址联通DNS地址怎样设置
网游服务器租用 网站域名备案 西安服务器 域名商 搬瓦工官网 免费名片模板 tightvnc 亚洲小于500m qq数据库下载 789电视 adroit cn3 免费网页空间 智能dns解析 江苏徐州移动 重庆服务器 深圳主机托管 netvigator 广州主机托管 锐速 更多