Jordanwest

west  时间:2021-01-25  阅读:()
WEST:ModernTechnologiesforWebPeopleSearchDmitriV.
KalashnikovZhaoqiChenRabiaNuray-TuranSharadMehrotraZhengZhangComputerScienceDepartmentUniversityofCalifornia,IrvineI.
INTRODUCTIONInthispaperwedescribeWEST(WebEntitySearchTech-nologies)systemthatwehavedevelopedtoimprovepeoplesearchovertheInternet.
RecentlytheproblemofWebPeopleSearch(WePS)hasattractedsignicantattentionfromboththeindustryandacademia.
IntheclassicformulationofWePSproblemtheuserissuesaquerytoawebsearchenginethatconsistsofanameofapersonofinterest.
Forsuchaquery,atraditionalsearchenginesuchasYahooorGooglewouldreturnwebpagesthatarerelatedtoanypeoplewhohappenedtohavethequeriedname.
ThegoalofWePS,instead,istooutputasetofclustersofwebpages,oneclusterpereachdistinctperson,containingallofthewebpagesrelatedtothatperson.
Theuserthencanlocatethedesiredclusterandexplorethewebpagesitcontains.
TheWePSapproachofferssignicantadvantages.
Forex-ample,considersearchingforapersonwhoisanamesakeoftheformerPresidentBillClinton.
Thewebpagesofthelessfamouspersonwillbeovershadowedintoday'ssearchenginesandwillappearfarinthesearch.
WePSsystemsaddressthisproblembyrstpresentingtotheuserthesetofclusters,amongwhichtheuserthencanselecttheclustercontainingthewebpagesofthenamesakeofinterest.
ThekeytechnologyofanyWePSsystem,includingWEST,isthatofEntityResolution.
InasettingofEntityResolutionproblem,adatasetcontainsinformationaboutobjectsandtheirinteractions.
Theobjectsarereferredtovia(textual)descrip-tions/references,whichmightnotbeuniqueidentiersoftheobjects,leadingtoambiguity.
ThetaskofEntityResolutionalgorithmsistoidentifyallofthereferencesthatco-refer,i.
e.
,refertothesamereal-worldentity.
InWePSthewebpagesreturnedbyasearchenginecanbeviewedasreferences.
Theoveralltaskcanbeviewedasthatofndingthewebpagesthatrefertothesamenamesake.
WehavedevelopedthreedifferentEntityResolutionalgo-rithmsthatcanbeemployedbyWEST:1)GraphERapproachextractstheSocialNetwork(peo-ple,organizations,locations)offthewebpagesalongwithhyperlinkandemailinformation.
ItrepresentstheresultingEntity-Relationshipnetworkasagraph.
TheapproachthenanalyzesthisgraphandthewebpageThisresearchwassupportedbyNSFAwards0331707and0331690,andDHSAwardEMW-2007-FP-02535.
textualsimilaritytodeterminewhichwebpagesco-refer[4],[5].
GraphERwillbecoveredinSectionIII-A.
2)EnsembleERapproachcombinesresultsofmultiple"base"ERsystemstoproducetheoverallclustering.
Duringthetrainingphase,EnsembleERapproachem-ployssupervisedlearningtostudyhowwellthebaseERsystemsperformintermsoftheirqualityundervarietyofconditions/contextsbytrainingameta-levelclassier.
Itthenusesthisclassierduringtheactualqueryprocessingtocomputeitsnalclustering[3].
EnsembleERwillbecoveredinSectionIII-B.
3)WebERapproach,unliketheabovetwo(andmanyother)approaches,doesnotlimititsprocessingtoanalyzingtherelevantwebpagesonly.
Instead,itleveragesapowerfulexternaldatasourcetogainitsadvantage.
Specically,likeGraphERitrstextractssocialnetworkofftheweb-pages.
ButthenitqueriestheWebtocollectadditionalinformationonthevariouscomponentsofthisnetwork[6].
WebERwillbecoveredinSectionIII-C.
Eachofthesethreealgorithmshasbeendemonstratedtooutperformthecurrentstateofthearttechniquesonavarietyofdatasets[3]–[6].
Thecomparisonincludes18approachesthathavebeenpartofWePSTaskcompetitiononalargedatasetwhichisnowconsideredtobeadefactostandardfortestingWePSsolutions[1].
WESTprovidesmultipleinterfacestosearch.
TheinputandoutputinterfacesofWESTareillustratedinFigures1and2respectively.
Naturally,WESTsupportsthestandardWePSinterfacewheretheuserprovidesapersonnameasthequery.
Italsosupportsadditionalfunctionality,wheretheusercanspecifycontextqueriestohelplocatethenamesakeofinterestquicker.
Thecontextcanbespeciedintheformoflocation,people,and/ororganizationsassociatedwiththenamesakeofinterest.
NoticethatthecontexthereisnotusedasadditionalkeywordstoquerytheWeb,butisusedtoidentifytherightnamesaketheuserislookingfor.
Thismeansthatthewebpagesintheclusterdoesnothavetoeachcontainthecontextkeywords,andsomeofthemmightevencontainnoneoftheseadditionalcontextkeywords.
BesidestheUIforsearchingforasingleindividual,WESToffersaGroupSearchinterfacetosupporttheGroupIdenti-cationquerycapabilities.
InaGroupIdenticationtask,theinputismultiplenamesofpeoplethatareknowntoberelatedinsomeway.
Forinstance,aquerymightbe"MichaelJordan"Fig.
1.
InputInterfaceofWEST.
Fig.
2.
OutputInterfaceofWEST.
and"MagicJohnson",implyingthatthemeantnamesakesarebasketballplayers.
Theobjectiveistoretrievethewebpagesofthemeantnamesakesonly.
Whilethedemonstrationwillillustrateboththesinglepersonsearchandgroupsearchcapabilities,thesubsequentdiscussionwillfocusonasinglepersonsearch.
Thealgorith-micdetailsoftheGroupSearchcanbefoundin[4].
Therestofthispaperisorganizedasfollows.
SectionIIpresentsthestepsoftheoverallWESTapproach.
ThenSectionIIIcoversthethreeEntityResolutionalgorithms.
Finally,SectionIVdescribesthefunctionalityofWESTthatwillbedisplayedduringthedemo.
II.
OVERALLALGORITHMThestepsoftheoverallWESTapproach,inthecontextofamiddlewarearchitecture,areillustratedinFigure3.
Theyinclude:1)UserInput.
TheuserissuesaqueryviatheWESTinputinterface.
2)Top-KRetrieval.
Thesystem(middleware)sendsaqueryconsistingofapersonnametoasearchengine,suchasGoogle,andretrievesthetop-Kreturnedwebpages.
ThisisastandardstepperformedbymostofthecurrentWePSsystems.
Top-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingTop-KWebpagesPerson1Person2Person3ResultsClusteringPersonXSearchEnginePreprocessingPreprocessedpagesAuxiliaryInformationPostprocessingFig.
3.
OverviewoftheWESTProcessingSteps.
3)Pre-processing.
Thesetop-Kwebpagesarethenprepro-cessed.
Themaintwopre-processingstepsare:a)TF/IDF.
Pre-processingstepsforcomputingTF/IDFarecarriedout.
Theyinclude:stemming,stopwordremoval,nounphraseidentication,in-vertedindexcomputations,etc.
b)Extraction.
NamedEntities,includingpeople,lo-cations,organizationsareextractedusingathirdpartynamedentityextractionsoftware.
Hyperlinksandemailsaddressedareextractedaswell.
Someauxiliarydatastructuresarebuiltonthisdata.
4)Clustering.
OneofthethreeEntityResolutionalgo-rithmsisappliedtothedatatoclusterthewebpages.
ThealgorithmswillbeexplainedinSectionIII.
5)Post-processing.
Thepost-processingstepsinclude:a)ClusterSketchesarecomputed.
b)ClusterRankiscomputedbasedon(a)thecontextkeywords,ifpresentand(b)theoriginalsearchengine'sorderingofthewebpages.
c)WebpageRankiscomputedtodeterminetherela-tiveorderingofwebpagesinsideeachcluster.
6)Visualization.
Theresultingclustersarepresentedtotheuser,whichcanbeinteractivelyexplored.
WenextdiscussthekeycomponentofanyWePSsystem:theEntityResolutionalgorithms.
III.
ENTITYRESOLUTIONALGORITHMSThissectionpresentsanoverviewofthethreeentityreso-lutionalgorithmsusedbytheWESTsystemforclusteringthewebpages.
A.
GraphERTodeterminewhethertworeferencesuandvco-refertraditionalapproachesatthecoreanalyzesimilarityoffeaturesofuandvaccordingtosomefeature-basedsimilarityfunctionf(u,v).
TheGraphERapproachhasbeendevelopedbasedontheobservationthatmanydatasetsarerelationalinnature.
Theycontainnotonlyobjectsandtheirfeaturesbutalsoinformationaboutrelationshipsinwhichtheyparticipate.
InstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionInstanceBaseModel1BaseModel1BaseModel1…CombiningModelPredictionFig.
4.
AGeneralFrameworkforCombiningMultipleSystems.
GraphERutilizestheinformationstoredintheserelationshipstoimprovethedisambiguationquality.
TheapproachviewsthedatasetbeinganalyzedasanEntity-RelationshipGraphofnodes(entities)interconnectedviarelationships(edges).
FortheWePSdomain,thenodesarethenamedentities,hyperlinks,andemailsextractedoffthewebpagesduringthepre-processingaswellasthewebpagesthemselves.
Therelationshipsareco-occurrencerelationships,andthosethatarederivedfromhyperlinkanddecompositions.
Thegraphcreationprocedureisdiscussedindetailin[4].
TheentityrelationshipsgraphinthiscaseisacombinationoftheSocialNetworkextractedfromthewebpagesaswellasthehyperlinkgraph.
Todecidewhethertworeferencesuandvco-refer,GraphERanalyzeshowstronglyuandvareconnectedinthisgraphaccordingtoaconnectionstrengthmeasurec(u,v).
Tocomputec(u,v),thealgorithmdiscoversthesetPLuvofallL-shortsimpleu-vpaths.
1Thevalueofc(u,v)iscomputedasthesumoftheconnectionsstrengthcontributedfromeachpathpinPLuv:c(u,v)=p∈PLuvc(p).
Asupervisedlearningprocedure,formulatedasalinearpro-grammingoptimizationtask,isusedtolearnc(p)functionfromdata[4],[5].
Thesimilarityfunctions(u,v)isthendenedasacombinationofc(u,v)andf(u,v).
Theoutputofthisfunctionisusedbyacorrelationclusteringalgorithmtogeneratethenalclustersofwebpages.
B.
EnsembleEREnsembleERapproachismotivatedbytheobservationthatoftenthereisnosingleentityresolution(ER)techniquealwaysperformthebest.
Rather,differentERsolutionsperformbetterindifferentcontexts.
EnsembleERisastacking-likeframeworkthatcombinestheclusteringresultsofmultiplebase-levelERsystemssothatthenalclusteringqualityissuperiortothatofanysinglebaseERsystem.
Thekeyideaistotransformtheoutputofbase-levelERsystems,togetherwithcontext,intoameta-levelfeatureset.
Asupervisedlearningapproachisutilizedtotrainaclassieronthemeta-leveldata.
Thealgorithmthenappliesthemeta-levelclassiertothedatasetbeingprocessedtocreatethenalclusteringresults.
Figure4showsageneralframeworkofcombiningmultiplesystems.
SimilartoGraphERapproach,EnsembleERalsoutilizesagraphrepresentationofthedataset.
Thegraphhoweveris1ApathisL-shortifitslengthdoesnotexceedL.
Apathissimpleifitdoesnotcontainduplicatenodes.
different.
Thenodesarethetop-Kwebpages.
Edge(u,v)betweentwowebpagesuandviscreatedonlyifacertainnumberofthebase-levelERsystemsdecidethatuandvshouldbeinthesamecluster.
Edge(u,v)representsapossibilitythatuandvmightco-refer.
WithrespecttothegraphthattaskofEnsembleERcanbeviewedasdecidingforeachedgewhetheruandvshouldbeputinonecluster.
LetS1,S2,Snbethenbase-levelERsystems.
Foreachedgeei=(u,v),eachSjoutputitsdecisiondij∈{0,1}.
Here,ifuandvareplacedinthesameclusterbySjthendij=1otherwisedij=0.
Then,foreachedgeeiwecandeneadecisionfeaturevectorasdi={di1,di2,din}.
Foredgeeiitslocalcontextisalsoencodedasamulti-dimensionalcontextfeaturevectorfi={fi1,fi2,fim}.
OneoftheinterestingaspectsofEnsembleERsolutionisthatitcreatescontextfeaturesinapredictiveway,basedonrstestimatingsomeunknownparametersofthedatabeingprocessed.
Forinstance,letK1,K2,KnbethenumberofclustersthatsystemsS1,S2,Snoutput.
OneofthefeaturesusedbyEnsembleERiscomputedbyapplyingaregressiontothisdatatoestimatethenumberofnamesakesK,wherethetruenumberofnamesakesK+isunknownbeforehandtothealgorithm.
EnsembleERthenconvertsthedifferencebetweenKandKjintoafeature,basedontheintuitionthattheclosertheKjtoK,themorecondencecanbeplacedintheanswerofsystemSj.
ThegoalofEnsembleERreducestondingamappingdi*fi→ai.
Here,ai={0,1}isthepredictionofthecombinedalgorithmforedgeei=(u,v),whereai=1iftheoverallalgorithmbelievesuandvbelongtothesamecluster,andai=0otherwise.
ThedetailsoftheEnsemblealgorithmcanbefoundin[3].
C.
WebERWebERapproachisconsiderablydifferentfrommostoftheotherWePSsolutions.
UnlikemanyotherWePSsystems,WebERdoesnotlimititsprocessingtoanalyzingonlytheinformationstoredinthetop-Kreturnedwebpages.
RatheritemploystheWebasanexternaldatasourcetogetadditionalinformation,whichultimatelyleadstohigherqualityresults.
WebERisprimarilyintendedtobeaserver-sidesolution.
Thatis,itscodeisexecutedatasearchengine(server)side.
Becauseofthat,mostofthepre-processingcanbeaccomplishedinbulkbeforequeryprocessingstarts,includingextractionandTF/IDFcomputations.
ThequeriestothesearchenginearecarriedoutinternallywithoutgoingviatheInternetthusmakingtheirprocessingmuchfaster.
LetD={d1,d2,dK}bethesetofthetop-Kreturnedwebpages.
WebERrstmergessomeofthewebpagesintoinitialclustersusingNamedEntity(NE)clusteringwithaconservativethresholds.
Thedocument-documentsimilarityiscomputedusingTF/IDFapproachwithcosinesimilarity.
Onlyafewwebpagesthathaveoverwhelmingevidencethattheyrepresentthesamepeoplearemergedduringthisprocess.
LetPiandOibethesetofpeopleandorganizationsextractedfromwebpagedi.
ForeachpairwebpagesdianddjthatALL-IN-ONEUBC-ASUC3MWITDFKI2JHU1-13TITPIUA-ZSASWAT-IVAUGONE-IN-ONEUNNFICOSHEFUVAPSNUSIRST-BPCU-COMSEMWEST00.
10.
20.
30.
40.
50.
60.
70.
80.
9SystemsFpFig.
5.
TheExperimentresultsonWePSdataset.
arenotyetputinthesameclustertheapproachformsandissuesqueriestotheWebtocollecttheco-occurrencestatistics,whichinthiscaseisthenumberofthepagesreturnedforagivenquery.
WebERusestwomaintypesofqueries:NANDCiANDCjCiANDCjHereNisthenameofthepersonbeingqueriedbytheuser,andCiandCjarethecontextofpagesdianddj.
ContextCicanbeeither(a)anORcombinationofpeoplefromPi,or(b)anORcombinationoforganizationsfromOi.
ThesameholdsforCiresultingineightqueriesfordianddjpair.
Theseco-occurrencecountsareindicativeofhowoftentheelementsofthetwosocialnetworksco-occuronthewebandthushowstronglytheyarerelated.
Thesecountsarethentransformedintofeatures,whicharethenusedtocomputethesimilaritybetweenwebpagesdianddj.
OneofthekeycontributionsofthisworkisanewSkyline-basedclassierfordecidingwhichdianddjwebpagesshouldbemergedbasedonthecorrespondingfeaturevector.
Itisaspecializedclassierthatwehavedesignedspecicallyfortheclusteringproblemathand.
Skyline-basedclassiergainsitsadvantageduetoavarietyoffunctionalitiesbuiltintoit,including:Ittakesintoaccountdominancethatispresentinthefeaturesspace.
Italsonetunesitselftothequalitymeasurebeingused.
Ittakesintoaccounttransitivityofmerges:thatis,ac-countsforthefactthattwolargeclusterscanbemergedbyasinglemergedecision,and,thus,onedirectmergedecisioncanleadtomultipleindirectones.
Thesepropertiesallowittoeasilyoutperformotherclassi-cationmethods(whicharegeneric),suchasDTCorSVM.
Theapproachisdiscussedindetailin[6].
IV.
DEMONSTRATIONTheERalgorithmsusedbyWESTareknowntoproducehighlycompetitiveresults.
Figure5presentsthecomparisonresultsoftheWESTwith18otherWePSsolutionsthathavebeenpartoftheWePSTaskchallenge[1].
ThequalityofclusteringisevaluatedintermsofFpmeasure(harmonicmeanofPurityandInversePurity[1]).
ForthegroupidenticationwehavecomparedWESTwiththestateoftheartapproachpublishedin[2].
TheaverageF-measureonthisdatasetachievedbyWESTis92%whichisnearly12%improvementovertheresultreportedin[2].
TheWESTsystemwillbedemonstratedthroughtwoap-plicationsbuiltoverthebasesystem.
SinglePersonSearch(illustratedinFigure1):whereinausercanenterapersonnameandcontextintheformofpeople,locations,and/ororganizationsassociatedwiththepersonbeingqueried.
Theresultswillbeasetofclusters.
Eachclusterwillhaveasetofkeywordsattachedtoindicatethemainaspectofthecorrespondingnamesake.
Theclusterswillbepresentedinarankedorderbasedontheoriginalranksofthewebpagesintheclustersandthecontextkeywords.
Figure2showssampleresultingclustersforthequery"AndrewMcCallum".
TherstreturnedgroupcorrespondstoAndrewMcCallumtheUMassCSprofessor,thesecondtothepresidentoftheAustralianCouncilofSocialServices,thethirdtoaCanadianmusician,etc.
Theuserwillbeabletoclickontheclustersandexploretheirclustersinteractively.
Thewebpagesinaclusterwillbepresentedinarankedorderaswell.
GroupSearch:Anotherinterfacewillbeusedtodemon-stratetheGroupIdenticationsearchcapabilitiesofWEST.
Ingroupqueryinterface,theusercaninputseveralpersonnames.
Theresultwillbethewebpagesthatarerelatedtothemeantnamesakes.
Theseapplicationswillbedemonstratedbothintheonlineandofinemodes.
Intheonlinemode,thequeryinputbytheuserwillbetranslatedintoacorresponding(setof)queriesoverInternetsearchengines(specicallyoverGoogle).
WESTallowstheusertospecifythenumberofwebpagestoretrievefromthesearchengine,whichwillbedisambiguatedintocorrespondingclusters.
Intheonlinemode,WESTusesonlyGraphERandEnsembleERapproachessinceWebERisaserver-sideapproachandisnotamenableforrealizationasamiddleware.
Thedemonstrationwillallowobserverstododiversesearches(perhaps,oftheirownnames)andperceiveboththequalityaswellasefciencyofWEST.
Intheofinemode,WESTwillusepreconstructed"canned"exampleswherewehavealreadycrawledthewebtoretrievethesearchresultsandconstructedthecorrespondingclusters.
Intheofinemode,inadditiontoillustratingtheGraphERandEnsembleERapproaches,wewillalsodemonstratethedisambiguationpoweroftheWebERapproach.
REFERENCES[1]J.
Artiles,J.
Gonzalo,andS.
Sekine.
Thesemeval-2007wepsevaluation:Establishingabenchmarkforthewebpeoplesearchtask.
InSemEval,2007.
[2]R.
BekkermanandA.
McCallum.
Disambiguatingwebappearancesofpeopleinasocialnetwork.
InWWW,2005.
[3]Z.
Chen,D.
V.
Kalashnikov,andS.
Mehrotra.
Combiningentityresolutiontechniqueswithapplicationtowebpeoplesearch.
InUndersubmission.
[4]D.
V.
Kalashnikov,Z.
Chen,S.
Mehrotra,andR.
Nuray.
Webpeoplesearchviaconnectionanalysis.
IEEETKDE,2008.
toappear.
[5]D.
V.
Kalashnikov,S.
Mehrotra,S.
Chen,R.
Nuray,andN.
Ashish.
Disambiguationalgorithmforpeoplesearchontheweb.
InICDE,2007.
[6]D.
V.
Kalashnikov,R.
Nuray-Turan,andS.
Mehrotra.
Towardsbreakingthequalitycurse.
Aweb-queryingapproachtoWebPeopleSearch.
InProc.
ofAnnualInternationalACMSIGIRConference,Singapore,July20–242008.

SpinServers(月89美元) 2*e5-2630L v2,美国独立服务器

SpinServers服务商也不算是老牌的服务商,商家看介绍是是2018年成立的主机品牌,隶属于Majestic Hosting Solutions LLC旗下。商家主要经营独立服务器租用和Hybrid Dedicated服务器等,目前包含的数据中心在美国达拉斯、圣何塞机房,自有硬件和IP资源等,商家还自定义支持用户IP广播到机房。看到SpinServers推出了美国独服的夏季优惠促销活动,最低月...

hostio荷兰10Gbps带宽,10Gbps带宽,€5/月,最低配2G内存+2核+5T流量

成立于2006年的荷兰Access2.IT Group B.V.(可查:VAT: NL853006404B01,CoC: 58365400) 一直运作着主机周边的业务,当前正在对荷兰的高性能AMD平台的VPS进行5折优惠,所有VPS直接砍一半。自有AS208258,vps母鸡配置为Supermicro 1024US-TRT 1U,2*AMD Epyc 7452(64核128线程),16条32G D...

华纳云-618大促3折起,18元/月买CN2 GIA 2M 香港云,物理机高防同享,10M带宽独享三网直连,无限流量!

官方网站:点击访问华纳云活动官网活动方案:一、香港云服务器此次推出八种配置的香港云服务器,满足不同行业不同业务规模的客户需求,同时每种配置的云服务都有不同的带宽选择,灵活性更高,可用性更强,性价比更优质。配置带宽月付6折季付5.5折半年付5折年付4.5折2年付4折3年付3折购买1H1G2M/99180324576648直达购买5M/17331556710081134直达购买2H2G2M892444...

west为你推荐
软银收购wework校内网被软银收购后会泄露我国几千万大学生的资料给日本吗???国内免备案服务器国内的服务器是都要备案是吗?有没有不需要备案的?腾讯空间首页腾讯空间主页哪去了手动挡和自动挡哪个好手动挡车和自动挡的哪个好?杰士邦和杜蕾斯哪个好杰士邦的超薄款跟杜蕾斯的超薄款,哪个舒服点?dnf魔枪士转职哪个好dnf魔枪士转职哪个职业好?qq空间登录网页版求这张图的原图,是QQ空间最近网页版登录界面的背景dns服务器设置DNS服务器怎么设置??网通dns服务器地址中国联通的默认DNS是多少360云盘企业版360云盘转企业版我的数据该怎么办
info域名注册 国外vps 企业域名备案 godaddy域名解析 备案域名出售 香港机房 Dedicated iis安装教程 密码泄露 512m内存 hnyd dux 大容量存储器 阿里校园 100m独享 免费申请网站 美国独立日 lamp怎么读 美国代理服务器 什么是dns 更多