originatingpagerank

pagerank  时间:2021-04-19  阅读:()
ACautiousSurferforPageRankLanNieBaoningWuBrianD.
DavisonDepartmentofComputerScience&EngineeringLehighUniversityBethlehem,PA18015USA{lan2,baw4,davison}@cse.
lehigh.
eduABSTRACTThisworkproposesanovelcautioussurfertoincorporatetrustintotheprocessofcalculatingauthorityforwebpages.
Weeval-uateatotalofsixtyqueriesovertwolarge,real-worlddatasetstodemonstratethatincorporatingtrustcanimprovePageRank'sper-formance.
CategoriesandSubjectDescriptorsH.
3.
3[InformationStorageandRetrieval]:InformationSearchandRetrievalGeneralTermsAlgorithms,PerformanceKeywordsWebsearchengine,authority,trust,spam,rankingperformance1.
INTRODUCTIONTraditionallinkanalysisapproacheslikePageRank[5]generallyassesstheimportanceofapagebasedonthenumberandqualityofpageslinkingtoit.
However,theyassumethatthecontentandlinksofapagecanbetrusted.
Notonlyarethepagestrusted,buttheyaretrustedequally.
Unfortunately,thisassumptiondoesnotalwaysholdgiventheadversarialnatureoftoday'sweb.
Tocompensate,TrustRank[3]wasintroducedtopropagatetrustintheWebfromapre-labeledsetoftrustedpages,buildingontheassumptionthatgoodsitesseldompointtobadsites.
TrustRank'sPageRank-basedpropagationowstrusttopagesconnectedtotheseedset,whilespamsitesarelikelytogetlittletrust,andarethusdemotedinrank.
Unlikeexistingworkthatusestrusttoidentifyordemotespampages,wedescribeanovelapproachtoutilizetrustestimatesashintstoguideawebsurfer'sbehavior,anddemonstrateimprove-mentsinrankedretrieval.
Thetrustestimatescouldcomefromanysource,butforthisworkwefocusontheuseofTrustRanktogen-eratetrustscores.
2.
DIRECTTRUST-BASEDRANKINGSOnemightwonder"whynotuseTrustRankscoresdirectlytorepresentauthority"AsshownbyGy¨ongyietal.
[3]andotherworkofours[6],trust-basedalgorithmscandemotespam.
Utiliz-ingsuchapproachesforretrievalrankingmaysometimesimproveCopyrightisheldbytheauthor/owner(s).
WWW2007,May8–12,2007,Banff,Alberta,Canada.
ACM978-1-59593-654-7/07/0005.
searchperformance,especiallyforthose"spam-specic"querieswhoseresultswouldotherwisebeoverwhelmedbyspam.
However,thegoalofasearchengineistondgoodqualityre-sults;"spam-free"isanecessarybutnotsufcientconditionforhighquality.
Ifweuseatrust-basedalgorithmalonetosimplyre-placePageRankforrankingpurposes,somegoodqualitypageswillbeunfairlydemotedandreplaced,forexample,bypageswithinthetrustedseedsets,eventhoughtheymaybemuchlessauthoritative.
Consideredfromanotherangle,suchtrust-basedalgorithmsprop-agatetrustthroughpathsoriginatingfromtheseedset;asaresult,somegoodqualitypagesmaygetlowvalueiftheyarenotwell-connectedtothoseseeds.
Inconclusion,trustcannotbeequatedtoauthority;however,trustinformationcanassistusincalculatingauthorityinasaferwaybyreducingcontaminationfromspam.
InsteadofusingTrustRank(oranyothertrustestimate)alonetocalculateauthority,wein-corporateitintoPageRanksothatspampagesarepenalizedwhilehighlyauthoritativepages(thatarenototherwiseknowntobetrust-worthy)remainunharmed.
3.
THECAUTIOUSSURFERInthissection,wedescribehowtodirectthewebsurfer'sbe-haviorbyutilizingtrustinformation.
Unliketherandomsurferde-scribedinthePageRankmodel,thiscautioussurfercarefullyat-temptstonotletuntrustworthypagesinuenceitsbehavior.
Imagineawanderingwebsurfer,consideringwhatnextpagetovisit.
Ifthecurrentpageistrustworthy,thesurferismorelikelytofollowanoutgoinglink.
Incontrast,ifthecurrentpageisuntrust-worthy,itsrecommendationwillalsobevaluelessorsuspicious;asaresult,thesurferismorelikelytoleavethecurrentpageandjumptoarandompageontheweb.
Inaddition,linksmayleadtotargetswithdifferenttrustworthiness.
WebiasourCautiousSurfertofavormoretrustworthypageswhenrandomlyjumpingtoapage.
TheCautiousSurferneedsatrustestimateforeachpage.
Weassumethatanestimateofapage'strustworthinesshasbeenpro-vided,e.
g.
,fromTrustRank.
Tosmooththetrustdistribution,weusetherankorderinsteadofthetrustvalue:t(j)=1rank(Trust(j))/NwhereTrust(j)representstheprovidedtrustworthinessestimateofpagej,Nisthetotalnumberofpagesandrank(Trust(j))istherankofpagejamongallNpageswhenorderedbydecreasingtrustscore.
Inthisway,agivenpagej'sauthorityinourCautiousSurfermodel(CS(j))canbecalculatedasCS(j)=t(j)0@Xk:k→jCS(k)t(k)Pi:k→it(i)+Xm∈N(1t(m))CS(m)t(m)1ALabelBM2500PageRankTrustRankCautiousSurferspam16.
67%13.
83%12.
13%12.
42%normal36.
74%44.
37%50.
25%49.
30%undecided3.
15%2.
96%2.
61%2.
67%unknown43.
44%38.
84%35.
01%35.
61%Table1:Distributionoflabelsintop10resultsacross157queriesintheUK-2006dataset.
4.
EXPERIMENTALRESULTSHerewereporttheperformanceofourCautiousSurfer(CS),PageRank(PR),andTrustRank(TR)ontwolargescaledatasets.
ExperimentsonUK-2006.
Thisdatasetisacrawlofthe.
ukdo-main[7]downloadedinMay2006byUniversit`adegliStudidiM-ilano.
Thereare77Mpagesinthiscrawlfrom11,392differenthosts.
Alabeledhostlistisalsoprovided[1].
Withinthelist,767hostsaremarkedasspambyhumanjudges,7,472hostsasnormal,and176hostsmarkedasundecided(notclearlyspamornormal).
Theremaining2977hostsaremarkedasunknown(notjudged).
TheTRandCSapproachesrequirepreselectedseedsets;wereporttheaverageofvetrialsinwhichwerandomlysample10%ofthelabelednormalsitestoformthetrustedseedset.
Sincethelabelsareprovidedatthehostlevel,wecomputeauthorityinthehostgraph.
Toevaluatequery-specicretrievalperformance,weuseasampleof3.
4Mwebpages(therst400crawledpagesforeachsiteincrawlorder)fromthefulldataset.
ThesepagesinherittheirauthorityscorefromtheirhostswhichisthencombinedwiththeBM2500IRscoreforthenalranking.
Thecombinationisorder-based,inwhichrankingpositionsbasedonauthorityscore(weightedby.
2)andIRscore(weightedby.
8)aresummedtogether.
Wechoosetofocuson"hot"queries—thosemorelikelytobeofinteresttosearchenginespammers.
Weselectedpopularqueriesfroma1999Excitequerylogthatcontainatleastonepopularterm(top200)withinthemeta-keywordeldfromallpageswithinspamsites.
Thisresultedin157hotqueries.
SincetheUK-2006datasetislabeled,wecanusethedistribu-tionoflabeledsitesasameasurementofrankingalgorithmper-formance,asshowninTable1.
Sincethisisanautomaticpro-cesswithouttheconstraintsofhumanevaluation,wecheckthetop10resultsforall157hotqueries.
BothTrustRankandtheCau-tiousSurferareabletonoticeablyimproveupontheBM2500andPageRankdistributions.
ThesimilardistributionsfoundbetweenTrustRankandtheCautiousSurfer(basedonTrustRankcalcula-tionsoftrust)suggestthattheCautiousSurferisabletoincorporatethespamremovalvalueprovidedbythetrustranking.
Weconsiderwhethertherankingsareusefulforretrievalnext.
Werandomlyselected30ofthe157queriesforourrelevanceevaluation.
FourmembersofourlabwereeachgivenqueriesandURLs(blindtothesourcerankingalgorithm).
ForeachqueryandURLpair,theevaluatordecidedtherelevanceusingavelevelscalewhichweretranslatedintointegervaluesfrom2to-2.
Weusethemeanofallvaluesofpairsgeneratedbyarankingalgorithmasscore@10.
Iftheaveragescoreforapairismorethan0.
5,itisUK2006WebBaseMethodScore@10P@10Score@10P@10PageRank0.
14830.
7%0.
66855.
7%TrustRank0.
17131.
4%0.
74759.
3%CautiousSurfer0.
18032.
4%0.
79861.
3%Table2:Rankingperformancecomparison.
markedasrelevant.
TheaveragenumberofrelevantURLswithinthetoptenresultsforthe30queriesisdenedasprecision@10.
TheoverallretrievalperformancecomparisonsareshownintheleftcolumnsofTable4.
CautiousSurferoutperformstheotherap-proachesonbothprecisionandqualityfortop-10results.
Thus,weseethatbyincorporatingestimatesoftrust,theCautiousSurferisabletogenerateusefulrankingsforretrieval,andnotjustrankingswithlessspam.
ExperimentsonWebBase.
Theseconddatasetisa2005crawlfromtheStanfordWebBase[2].
Itcontains58Mpagesandap-proximately900Mlinks,butnolabels.
Tocompensate,welabelasgoodallpagesinthisdatasetthatalsoappearwithinthelistofURLsreferencedbythedmozOpenDirectoryProject.
Notethattheselabelsarepage-based,sowecancomputeauthorityinthepagelevelgraphdirectly.
Wechose30queriesfromthepopularquerylistforevaluationofwebpagesintheWebBasedataset.
Bytestingonaseconddataset,wegetabetterunderstandingofexpectedperformanceonfuturedatasets.
TheWebBasedatasetisofparticularinterestasitisamoretypicalgraphofwebpages(ascomparedtowebhosts),andusesamuchsmallerseedsetofgoodpages(just.
17%ofallpagesinthedataset).
TheperformanceisshownintherightcolumnsofTable4.
Again,theCautiousSurfernoticeablyoutperformsbothPageRankandTrustRank,demonstratingthattheapproachretainsitslevelofperformanceinbothpage-levelandsite-levelwebgraphs.
5.
CONCLUSIONInthispaperwehavedescribedamethodologyforincorporatingtrustintothecalculationofPageRank-basedauthority.
Additionaldetailsareavailableelsewhere[4].
Theresultsontwolargereal-worlddatasetsshowthatourCautiousSurfermodelcanimprovesearchengines'rankingqualityanddemotewebspamaswell.
Acknowledgments.
ThisworkwassupportedinpartbyagrantfromMicrosoftLiveLabs("AcceleratingSearch")andtheNa-tionalScienceFoundationunderCAREERawardIIS-0545875.
WethanktheLaboratoryofWebAlgorithmics,Universit`adegliStudidiMilanoandYahoo!
ResearchBarcelonaformakingtheUK-2006datasetandlabelsavailableandStanfordUniversityforaccesstotheirWebBasecollections.
6.
REFERENCES[1]C.
Castillo,D.
Donato,L.
Becchetti,P.
Boldi,M.
Santini,andS.
Vigna.
Areferencecollectionforwebspam.
ACMSIGIRForum,40(2),Dec.
2006.
[2]J.
Cho,H.
Garcia-Molina,T.
Haveliwala,W.
Lam,A.
Paepcke,S.
RaghavanandG.
Wesley.
StanfordWebBasecomponentsandapplications.
ACMTransactionsonInternetTechnology,6(2):153–186,2006.
[3]Z.
Gy¨ongyi,H.
Garcia-Molina,andJ.
Pedersen.
CombatingwebspamwithTrustRank.
InProc.
ofthe30thInt'lConf.
onVeryLargeDataBases(VLDB),pages271–279,Toronto,Canada,Sept.
2004.
[4]L.
Nie,B.
Wu,andB.
D.
Davison.
Incorporatingtrustintowebsearch.
AvailableasTechnicalReportLU-CSE-07-002,Dept.
ofComputerScienceandEngineering,LehighUniversity,2007.
[5]L.
Page,S.
Brin,R.
Motwani,andT.
Winograd.
ThePageRankcitationranking:BringingordertotheWeb.
Unpublisheddraft,1998.
[6]B.
Wu,V.
Goel,andB.
D.
Davison.
Propagatingtrustanddistrusttodemotewebspam.
InProc.
ofModelsofTrustfortheWebworkshopatthe15thInt'lWorldWideWebConf.
,Edinburgh,Scotland,May2006.
[7]Yahoo!
Research.
WebcollectionUK-2006.
http://research.
yahoo.
com/.
CrawledbytheLaboratoryofWebAlgorithmics,UniversityofMilan,http://law.
dsi.
unimi.
it/.
URLretrievedOct.
2006.

Bluehost美国虚拟主机2.95美元/月,十八周年庆年付赠送顶级域名和SSL证书

Bluehost怎么样,Bluehost好不好,Bluehost成立十八周年全场虚拟主机优惠促销活动开始,购买12个月赠送主流域名和SSL证书,Bluehost是老牌虚拟主机商家了,有需要虚拟主机的朋友赶紧入手吧,活动时间:美国MST时间7月6日中午12:00到8月13日晚上11:59。Bluehost成立于2003年,主营WordPress托管、虚拟主机、VPS主机、专用服务器业务。Blueho...

提速啦香港独立物理服务器E3 16G 20M 5IP 299元

提速啦(www.tisula.com)是赣州王成璟网络科技有限公司旗下云服务器品牌,目前拥有在籍员工40人左右,社保在籍员工30人+,是正规的国内拥有IDC ICP ISP CDN 云牌照资质商家,2018-2021年连续4年获得CTG机房顶级金牌代理商荣誉 2021年赣州市于都县创业大赛三等奖,2020年于都电子商务示范企业,2021年于都县电子商务融合推广大使。资源优势介绍:Ceranetwo...

spinservers:圣何塞10Gbps带宽服务器月付$109起,可升级1Gbps无限流量

spinservers是Majestic Hosting Solutions LLC旗下站点,主营国外服务器租用和Hybrid Dedicated等,数据中心在美国达拉斯和圣何塞机房。目前,商家针对圣何塞部分独立服务器进行促销优惠,使用优惠码后Dual Intel Xeon E5-2650L V3(24核48线程)+64GB内存服务器每月仅109美元起,提供10Gbps端口带宽,可以升级至1Gbp...

pagerank为你推荐
搜狗360360影视大全怎样免费看大片cisco2960配置cisco 2960 配置VLAN上网什么是支付宝支付宝是什么意思?360arp防火墙在哪360ARP防火墙哪里下载?面板flash文档下载怎样在手机上建立word的文档? 需要下载什么软件?客服电话中国移动的人工服务电话号码是多少传奇域名自己的传奇服务器怎么建设?discuz伪静态求虚拟主机Discuz 伪静态设置方法如何发帖子怎么发帖子啊?
美国域名 韩国vps俄罗斯美女 域名交易网 西安电信测速 gomezpeer mysql主机 圣诞促销 国外免费全能空间 网站cdn加速 100m空间 中国电信测网速 免费申请个人网站 国外免费asp空间 hdd 360云服务 登陆空间 web应用服务器 湖南idc 北京主机托管 美国代理服务器 更多