originatingpagerank
pagerank 时间:2021-04-19 阅读:(
)
ACautiousSurferforPageRankLanNieBaoningWuBrianD.
DavisonDepartmentofComputerScience&EngineeringLehighUniversityBethlehem,PA18015USA{lan2,baw4,davison}@cse.
lehigh.
eduABSTRACTThisworkproposesanovelcautioussurfertoincorporatetrustintotheprocessofcalculatingauthorityforwebpages.
Weeval-uateatotalofsixtyqueriesovertwolarge,real-worlddatasetstodemonstratethatincorporatingtrustcanimprovePageRank'sper-formance.
CategoriesandSubjectDescriptorsH.
3.
3[InformationStorageandRetrieval]:InformationSearchandRetrievalGeneralTermsAlgorithms,PerformanceKeywordsWebsearchengine,authority,trust,spam,rankingperformance1.
INTRODUCTIONTraditionallinkanalysisapproacheslikePageRank[5]generallyassesstheimportanceofapagebasedonthenumberandqualityofpageslinkingtoit.
However,theyassumethatthecontentandlinksofapagecanbetrusted.
Notonlyarethepagestrusted,buttheyaretrustedequally.
Unfortunately,thisassumptiondoesnotalwaysholdgiventheadversarialnatureoftoday'sweb.
Tocompensate,TrustRank[3]wasintroducedtopropagatetrustintheWebfromapre-labeledsetoftrustedpages,buildingontheassumptionthatgoodsitesseldompointtobadsites.
TrustRank'sPageRank-basedpropagationowstrusttopagesconnectedtotheseedset,whilespamsitesarelikelytogetlittletrust,andarethusdemotedinrank.
Unlikeexistingworkthatusestrusttoidentifyordemotespampages,wedescribeanovelapproachtoutilizetrustestimatesashintstoguideawebsurfer'sbehavior,anddemonstrateimprove-mentsinrankedretrieval.
Thetrustestimatescouldcomefromanysource,butforthisworkwefocusontheuseofTrustRanktogen-eratetrustscores.
2.
DIRECTTRUST-BASEDRANKINGSOnemightwonder"whynotuseTrustRankscoresdirectlytorepresentauthority"AsshownbyGy¨ongyietal.
[3]andotherworkofours[6],trust-basedalgorithmscandemotespam.
Utiliz-ingsuchapproachesforretrievalrankingmaysometimesimproveCopyrightisheldbytheauthor/owner(s).
WWW2007,May8–12,2007,Banff,Alberta,Canada.
ACM978-1-59593-654-7/07/0005.
searchperformance,especiallyforthose"spam-specic"querieswhoseresultswouldotherwisebeoverwhelmedbyspam.
However,thegoalofasearchengineistondgoodqualityre-sults;"spam-free"isanecessarybutnotsufcientconditionforhighquality.
Ifweuseatrust-basedalgorithmalonetosimplyre-placePageRankforrankingpurposes,somegoodqualitypageswillbeunfairlydemotedandreplaced,forexample,bypageswithinthetrustedseedsets,eventhoughtheymaybemuchlessauthoritative.
Consideredfromanotherangle,suchtrust-basedalgorithmsprop-agatetrustthroughpathsoriginatingfromtheseedset;asaresult,somegoodqualitypagesmaygetlowvalueiftheyarenotwell-connectedtothoseseeds.
Inconclusion,trustcannotbeequatedtoauthority;however,trustinformationcanassistusincalculatingauthorityinasaferwaybyreducingcontaminationfromspam.
InsteadofusingTrustRank(oranyothertrustestimate)alonetocalculateauthority,wein-corporateitintoPageRanksothatspampagesarepenalizedwhilehighlyauthoritativepages(thatarenototherwiseknowntobetrust-worthy)remainunharmed.
3.
THECAUTIOUSSURFERInthissection,wedescribehowtodirectthewebsurfer'sbe-haviorbyutilizingtrustinformation.
Unliketherandomsurferde-scribedinthePageRankmodel,thiscautioussurfercarefullyat-temptstonotletuntrustworthypagesinuenceitsbehavior.
Imagineawanderingwebsurfer,consideringwhatnextpagetovisit.
Ifthecurrentpageistrustworthy,thesurferismorelikelytofollowanoutgoinglink.
Incontrast,ifthecurrentpageisuntrust-worthy,itsrecommendationwillalsobevaluelessorsuspicious;asaresult,thesurferismorelikelytoleavethecurrentpageandjumptoarandompageontheweb.
Inaddition,linksmayleadtotargetswithdifferenttrustworthiness.
WebiasourCautiousSurfertofavormoretrustworthypageswhenrandomlyjumpingtoapage.
TheCautiousSurferneedsatrustestimateforeachpage.
Weassumethatanestimateofapage'strustworthinesshasbeenpro-vided,e.
g.
,fromTrustRank.
Tosmooththetrustdistribution,weusetherankorderinsteadofthetrustvalue:t(j)=1rank(Trust(j))/NwhereTrust(j)representstheprovidedtrustworthinessestimateofpagej,Nisthetotalnumberofpagesandrank(Trust(j))istherankofpagejamongallNpageswhenorderedbydecreasingtrustscore.
Inthisway,agivenpagej'sauthorityinourCautiousSurfermodel(CS(j))canbecalculatedasCS(j)=t(j)0@Xk:k→jCS(k)t(k)Pi:k→it(i)+Xm∈N(1t(m))CS(m)t(m)1ALabelBM2500PageRankTrustRankCautiousSurferspam16.
67%13.
83%12.
13%12.
42%normal36.
74%44.
37%50.
25%49.
30%undecided3.
15%2.
96%2.
61%2.
67%unknown43.
44%38.
84%35.
01%35.
61%Table1:Distributionoflabelsintop10resultsacross157queriesintheUK-2006dataset.
4.
EXPERIMENTALRESULTSHerewereporttheperformanceofourCautiousSurfer(CS),PageRank(PR),andTrustRank(TR)ontwolargescaledatasets.
ExperimentsonUK-2006.
Thisdatasetisacrawlofthe.
ukdo-main[7]downloadedinMay2006byUniversit`adegliStudidiM-ilano.
Thereare77Mpagesinthiscrawlfrom11,392differenthosts.
Alabeledhostlistisalsoprovided[1].
Withinthelist,767hostsaremarkedasspambyhumanjudges,7,472hostsasnormal,and176hostsmarkedasundecided(notclearlyspamornormal).
Theremaining2977hostsaremarkedasunknown(notjudged).
TheTRandCSapproachesrequirepreselectedseedsets;wereporttheaverageofvetrialsinwhichwerandomlysample10%ofthelabelednormalsitestoformthetrustedseedset.
Sincethelabelsareprovidedatthehostlevel,wecomputeauthorityinthehostgraph.
Toevaluatequery-specicretrievalperformance,weuseasampleof3.
4Mwebpages(therst400crawledpagesforeachsiteincrawlorder)fromthefulldataset.
ThesepagesinherittheirauthorityscorefromtheirhostswhichisthencombinedwiththeBM2500IRscoreforthenalranking.
Thecombinationisorder-based,inwhichrankingpositionsbasedonauthorityscore(weightedby.
2)andIRscore(weightedby.
8)aresummedtogether.
Wechoosetofocuson"hot"queries—thosemorelikelytobeofinteresttosearchenginespammers.
Weselectedpopularqueriesfroma1999Excitequerylogthatcontainatleastonepopularterm(top200)withinthemeta-keywordeldfromallpageswithinspamsites.
Thisresultedin157hotqueries.
SincetheUK-2006datasetislabeled,wecanusethedistribu-tionoflabeledsitesasameasurementofrankingalgorithmper-formance,asshowninTable1.
Sincethisisanautomaticpro-cesswithouttheconstraintsofhumanevaluation,wecheckthetop10resultsforall157hotqueries.
BothTrustRankandtheCau-tiousSurferareabletonoticeablyimproveupontheBM2500andPageRankdistributions.
ThesimilardistributionsfoundbetweenTrustRankandtheCautiousSurfer(basedonTrustRankcalcula-tionsoftrust)suggestthattheCautiousSurferisabletoincorporatethespamremovalvalueprovidedbythetrustranking.
Weconsiderwhethertherankingsareusefulforretrievalnext.
Werandomlyselected30ofthe157queriesforourrelevanceevaluation.
FourmembersofourlabwereeachgivenqueriesandURLs(blindtothesourcerankingalgorithm).
ForeachqueryandURLpair,theevaluatordecidedtherelevanceusingavelevelscalewhichweretranslatedintointegervaluesfrom2to-2.
Weusethemeanofallvaluesofpairsgeneratedbyarankingalgorithmasscore@10.
Iftheaveragescoreforapairismorethan0.
5,itisUK2006WebBaseMethodScore@10P@10Score@10P@10PageRank0.
14830.
7%0.
66855.
7%TrustRank0.
17131.
4%0.
74759.
3%CautiousSurfer0.
18032.
4%0.
79861.
3%Table2:Rankingperformancecomparison.
markedasrelevant.
TheaveragenumberofrelevantURLswithinthetoptenresultsforthe30queriesisdenedasprecision@10.
TheoverallretrievalperformancecomparisonsareshownintheleftcolumnsofTable4.
CautiousSurferoutperformstheotherap-proachesonbothprecisionandqualityfortop-10results.
Thus,weseethatbyincorporatingestimatesoftrust,theCautiousSurferisabletogenerateusefulrankingsforretrieval,andnotjustrankingswithlessspam.
ExperimentsonWebBase.
Theseconddatasetisa2005crawlfromtheStanfordWebBase[2].
Itcontains58Mpagesandap-proximately900Mlinks,butnolabels.
Tocompensate,welabelasgoodallpagesinthisdatasetthatalsoappearwithinthelistofURLsreferencedbythedmozOpenDirectoryProject.
Notethattheselabelsarepage-based,sowecancomputeauthorityinthepagelevelgraphdirectly.
Wechose30queriesfromthepopularquerylistforevaluationofwebpagesintheWebBasedataset.
Bytestingonaseconddataset,wegetabetterunderstandingofexpectedperformanceonfuturedatasets.
TheWebBasedatasetisofparticularinterestasitisamoretypicalgraphofwebpages(ascomparedtowebhosts),andusesamuchsmallerseedsetofgoodpages(just.
17%ofallpagesinthedataset).
TheperformanceisshownintherightcolumnsofTable4.
Again,theCautiousSurfernoticeablyoutperformsbothPageRankandTrustRank,demonstratingthattheapproachretainsitslevelofperformanceinbothpage-levelandsite-levelwebgraphs.
5.
CONCLUSIONInthispaperwehavedescribedamethodologyforincorporatingtrustintothecalculationofPageRank-basedauthority.
Additionaldetailsareavailableelsewhere[4].
Theresultsontwolargereal-worlddatasetsshowthatourCautiousSurfermodelcanimprovesearchengines'rankingqualityanddemotewebspamaswell.
Acknowledgments.
ThisworkwassupportedinpartbyagrantfromMicrosoftLiveLabs("AcceleratingSearch")andtheNa-tionalScienceFoundationunderCAREERawardIIS-0545875.
WethanktheLaboratoryofWebAlgorithmics,Universit`adegliStudidiMilanoandYahoo!
ResearchBarcelonaformakingtheUK-2006datasetandlabelsavailableandStanfordUniversityforaccesstotheirWebBasecollections.
6.
REFERENCES[1]C.
Castillo,D.
Donato,L.
Becchetti,P.
Boldi,M.
Santini,andS.
Vigna.
Areferencecollectionforwebspam.
ACMSIGIRForum,40(2),Dec.
2006.
[2]J.
Cho,H.
Garcia-Molina,T.
Haveliwala,W.
Lam,A.
Paepcke,S.
RaghavanandG.
Wesley.
StanfordWebBasecomponentsandapplications.
ACMTransactionsonInternetTechnology,6(2):153–186,2006.
[3]Z.
Gy¨ongyi,H.
Garcia-Molina,andJ.
Pedersen.
CombatingwebspamwithTrustRank.
InProc.
ofthe30thInt'lConf.
onVeryLargeDataBases(VLDB),pages271–279,Toronto,Canada,Sept.
2004.
[4]L.
Nie,B.
Wu,andB.
D.
Davison.
Incorporatingtrustintowebsearch.
AvailableasTechnicalReportLU-CSE-07-002,Dept.
ofComputerScienceandEngineering,LehighUniversity,2007.
[5]L.
Page,S.
Brin,R.
Motwani,andT.
Winograd.
ThePageRankcitationranking:BringingordertotheWeb.
Unpublisheddraft,1998.
[6]B.
Wu,V.
Goel,andB.
D.
Davison.
Propagatingtrustanddistrusttodemotewebspam.
InProc.
ofModelsofTrustfortheWebworkshopatthe15thInt'lWorldWideWebConf.
,Edinburgh,Scotland,May2006.
[7]Yahoo!
Research.
WebcollectionUK-2006.
http://research.
yahoo.
com/.
CrawledbytheLaboratoryofWebAlgorithmics,UniversityofMilan,http://law.
dsi.
unimi.
it/.
URLretrievedOct.
2006.
炭云怎么样?炭云(之前的碳云),国人商家,正规公司(哈尔滨桓林信息技术有限公司),主机之家测评介绍过多次。现在上海CN2共享IP的VPS有一款特价,上海cn2 vps,2核/384MB内存/8GB空间/800GB流量/77Mbps端口/共享IP/Hyper-v,188元/年,特别适合电信网络。有需要的可以关注一下。点击进入:炭云官方网站地址炭云vps套餐:套餐cpu内存硬盘流量/带宽ip价格购买上...
企鹅小屋怎么样?企鹅小屋最近针对自己的美国cn2 gia套餐推出了2个优惠码:月付7折和年付6折,独享CPU,100%性能,三网回程CN2 GIA网络,100Mbps峰值带宽,用完优惠码1G内存套餐是年付240元,线路方面三网回程CN2 GIA。如果新购IP不能正常使用,请在开通时间60分钟内工单VPS技术部门更换正常IP;特价主机不支持退款。点击进入:企鹅小屋官网地址企鹅小屋优惠码:年付6折优惠...
无忧云官网无忧云怎么样 无忧云服务器好不好 无忧云值不值得购买 无忧云,无忧云是一家成立于2017年的老牌商家旗下的服务器销售品牌,现由深圳市云上无忧网络科技有限公司运营,是正规持证IDC/ISP/IRCS商家,主要销售国内、中国香港、国外服务器产品,线路有腾讯云国外线路、自营香港CN2线路等,都是中国大陆直连线路,非常适合免北岸建站业务需求和各种负载较高的项目,同时国内服务器也有多个BGP以及高...
pagerank为你推荐
投标在线代理三星iphone企业邮局系统什么邮件系统最适合企业?uctools为什么一直服务器暂不可用wordpressWordPress 是什么?centos6.5centos 6.5服务器基本配置有哪些大飞资讯新闻资讯包括什么内容?宜人贷官网宜人财富怎么样?什么是seo学习SEO的好处是什么?网站后台密码破解如何破解网站后台密码
特价空间 香港新世界电讯 typecho 韩国名字大全 徐正曦 美国免费空间 支持外链的相册 网站加速软件 永久免费空间 稳定空间 阵亡将士纪念日 免备案cdn加速 黑科云 上海联通 什么是dns hosts文件 vi命令 阿里云主机 香港云主机 kosskeb4 更多