belongingpagerank
pagerank 时间:2021-04-19 阅读:(
)
NewversionsofPageRankemployingalternativeWebdocumentmodels1MikeThelwallSchoolofComputingandInformationTechnology,UniversityofWolverhampton,35/49LichfieldStreet,WolverhamptonWV11EQ,UKm.
thelwall@wlv.
ac.
ukLiwenVaughanFacultyofInformationandMediaStudies,UniversityofWesternOntario,London,Ontario,N6A5B7,Canadalvaughan@uwo.
caKeywords:WebIR,PageRank,hyperlinkanalysis,searchenginesAbstractWeintroduceseveralnewversionsofPageRank(thelinkbasedWebpagerankingalgorithm),baseduponaninformationscienceperspectiveontheconceptoftheWebdocument.
AlthoughtheWebpageisthetypicalindivisibleunitofinformationinsearchengineresultsandmostWebinformationretrievalalgorithms,otherresearchhassuggestedthataggregatingpagesbasedupondirectoriesanddomainsgivespromisingalternatives,particularlywhenWeblinksaretheobjectofstudy.
ThenewalgorithmsintroducedbaseduponthesealternativeswereusedtorankfoursetsofWebpages.
Therankingresultswerecomparedwithhumansubjects'rankings.
Theresultsofthetestsweresomewhatinconclusive:thenewapproachworkedwellforthesetthatincludespagesfromdifferentWebsites;however,itdoesnotworkwellinrankingpagesthatarefromthesamesite.
Itseemsthatthenewalgorithmsmaybeeffectiveforsometasksbutnotforothers,especiallywhenonlylownumbersoflinksareinvolvedorthepagestoberankedarefromthesamesiteordirectory.
IntroductionCommercialsearchenginesareakeyaccesspointtotheWebandhavethedifficulttaskoftryingtofindthemostusefulofthebillionsofWebpagesforeach–typicallyshort(Spinketal.
,2001)–userqueryentered.
Probablythetaskismostdifficultwhenmillionsofpagescontainthequeryterm(s)andthesemustbeorderedsothattheuserispresentedwiththemostlikelyones.
Google'sPageRank(BrinandPage,1998)wasanattempttoresolvethisdilemmabasedupontheassumptionsthat:(1)moreusefulpageswillhavemorelinkstothemand(2)linksfromwelllinkedtopagesarebetterindicatorsofquality.
ThecontinuedriseofGoogletoitscurrentdominantposition(Sullivan,2002)andtheproliferationofotherlinkbasedalgorithms(e.
g.
Kleinberg,1999;CrestaniandLee,2000;Ngetal.
,2001;AltaVista,1Thelwall,M.
&Vaughan,L(2004).
NewversionsofPageRankemployingalternativeWebdocumentmodels.
ASLIBProceedings,56(1),24-33.
12002)seemstomakeanunassailableargumentforthePageRankalgorithm,despitethepaucityofclearcutresults(e.
g.
Hawkingetal.
,2000;SavoyandPicard,2001).
ModernWebIRalgorithmsareprobablyahighlycomplexmixtureofdifferentapproaches,perhapsoptimisedusingprobabilistictechniquestoidentifythebestcombination(e.
g.
Gaoetal.
,2001;XiandFox,2001;TsikrikaandLalmas,2002;SavoyandPicard,2001).
Itisnotpossibletobedefinitiveaboutcommercialsearchenginealgorithms,however,sincetheyarekeptsecretapartfromthebroadestdetails.
InfactacademicresearchintoWebIRisinastrangesituationsinceresearchbudgetsanddatasetscouldbeexpectedtobedwarfedbythoseofthecommercialgiants,whoseexistencedependsuponhighqualityresultsinanincrediblycompetitivemarketplace.
OnepaperthatcomparedthetwofoundthattheacademicsystemswereslightlybetterbuttheauthorsadmittedthatthetaskswereuntypicalforWebusers(Hawkingetal.
,2001a).
Nevertheless,Googleisonecaseamongstmanyofsearchalgorithmsgainingfromapproachesanddevelopmentsininformationscienceingeneralandbibliometricsinparticular.
Thealternativedocumentmodels(Thelwall,2002a)areanexampleofatheoreticalapproachfrominformationsciencethatmaybringbenefitstoWebIR.
TheprinciplebehindthesemodelsisthatWebpagesoftennaturallyclusterintorecognisabledocumentsbaseduponthedirectoryordomainthattheyarein.
Whenworkingwithlinksitcanoftenmakesensetoutiliseadirectoryordomainlevelofaggregation,especiallyifeachindividualpagecontainsasetofidenticallinks,perhapsinastandardnavigationbar.
Theresultofaggregationinsuchacasewouldbetheremovalofallduplicatelinks,givingamoreappropriatelinkcount.
Thisapproachhasbeenshowntogiveimprovedacademiclinkmetrics(Thelwall,2002a;ThelwallandTang,2003;ThelwallandWilkinson,2003;ThelwallandHarries,2003).
Furthersupportforthesemodelsisgivenbytheirabilitytocluster(setsof)Webpagesindifferentandnon-trivialways(Thelwall,2003).
Anaturalquestion,therefore,iswhetherWebIRalgorithmscanbenefitfromthealternativedocumentmodels.
Inthispaper,newversionsofPageRankwillbeintroducedusingalternativedocumentmodels.
TheeffectivenessofthesenewrankingalgorithmswillbecomparedagainstthatofthestandardPageRank.
Humanrankingjudgementwillbeusedasthebenchmarkagainstwhichtocomparedifferentalgorithms.
VersionsofPageRankbasedonthealternativedocumentmodelPageRankwasdevelopedbythefoundersofGoogle,SergeyBrinandLawrencePage(1998).
Thegeniusoftheapproachisthatthealgorithmissimpleandintuitive,yetadmitsamathematicalimplementationthatscalestothebillionsofpagescurrentlyontheWeb.
Forourpurposes,sincewearenotmodifyingthemathematicalalgorithmofPageRankbutonlythedocumentspaceuponwhichitisapplied,wewilldescribetheprincipleofPageRankbutnotthedetailsofitsimplementation.
TheprecisedetailsofthemathsandfurtherdescriptionscanbefoundintheoriginalPageRankpaper(BrinandPage,1998)aswellasseveralotherrelatedpapers(Haveliwala,1999;Lifantsev,2000;Ngetal.
,2001;Thelwall,2002b).
EssentiallytheapproachusedbyPageRankcanbedescribedwithavotingmetaphor.
Atthestartoftheprocess,eachWebpageisallocatedavotep.
Forexample,eachpagemaybeallocatedthesamevalue0.
1.
EachpagethensharesafractionofαPageRankwereused.
Incontrast,apurelytext-matchingalgorithmwouldhavegreatdifficultyindecidingwhichpagecontainingthematchingtextwasthemostrelevant.
AcriticismoftheoriginalPageRankisthatmanypagesreceiveahighnumberoflinksforreasonsotherthantheirquality.
Forexample,somesiteshaveastandardnavigationbaroneachpage,allcontainingalinktothehomepageandafewotherpages.
Forthesiteitself,thisprobablydoesservetoindicatethemostusefulpages,butrelativetoothersitesthetotalnumberofpagescontainingthelinkbarwillbecriticaltodeterminethefinalPageRankofthetargetedpages,meaningthatlargersiteswillautomaticallyrankhigher.
Ithasalsobeennotedthatlinksbetweenpageswithinasitearetypicallyfornavigationpurposes,andthereforearelessreliableasindicatorsoftargetpagequalitythanlinksbetweensites.
Moreover,navigationbarssometimescontainlinkstoothersitesandonesiteoftencontainsmultiplelinkstoanotherforreasonsthatarenotrelatedtotargetsitequality.
AllofthesefactorsunderminetheeffectivenessofPageRankasanindicatorofthequalityofthepage.
Anadditionalproblemistheorganisationofinformationbysite,domainordirectory.
Forexample,asitecontainingmuchhighqualityinformationmayreceivemanylinkstoitshomepage,whereasitsactualcontentisontensofthousandsofotherpagesunderthehomepage,mostofwhichdonotreceivemanylinks.
AcaseinpointforthisistheMicrosoftsitethatincludesanenormousbodyofauthoritativeinformationspreadovermanypages.
Intheory,linkstothehomepagewillredistributethroughthelayersofasitetothesecontentcarryingpages,butinpracticethisdoesnotwork(Thelwall,2002b)andsothecontentpageswillnotreflecttheprestigeofthehostingsite.
Thisisanargumentforincludinginrankingmeasuresanassessmentofthesiteasawholeinadditiontotheindividualpages.
AsimilarargumentcanbemadeforanycoherentclusterofWebpageswitharecognisablehomepage.
Basedupontheargumentsmadeabove,theclaimisthatPageRankcanbeimprovedbyincorporatingrankingsofapagebaseduponitshostingsite,domainanddirectory.
Aprecisedefinitionofdocumentmodelsbasedupontheselevelsofaggregationisgivenbelow(takenfromThelwall,2002a).
IndividualWebpage.
EachseparateHTMLfileistreatedasadocumentforthepurposesofextractinglinks.
EachuniqueURLinalinkistreatedaspointingtoaseparatedocumentforthepurposesoffindinglinktargets.
URLsaretruncatedbeforeanyinternaltargetmarker'#'characterisfound,however,toavoidmultiplereferencestodifferentpartsofthesamepage.
3Directory.
AllHTMLfilesinthesamedirectoryaretreatedasasingledocument.
AlltargetURLsareautomaticallyshortenedtothepositionofthelastslash,andlinksfromdifferentpagesinthesamedirectoryarecombinedandduplicateseliminated.
Domainname.
AsaboveexceptallHTMLfileswiththesamedomainnamearetreatedasasingledocumentforbothlinksourcesandlinktargets.
Inparticular,thisclusterstogetherallpageshostedbyasinglesubdomainofauniversitysite.
University.
Asaboveexceptthatallpagesbelongingtoauniversityaretreatedasasingledocumentforbothlinksourcesandlinktargets.
ApplyingPageRanktothesemodelsmeansallocatingvotesattheappropriatedocumentlevelanddistributingthemaccordingtolinksidentifiedasabove.
Forexample,inthecaseofthedomain-basedPageRank,itwouldstartwithavotepbeingallocatedtoeachdirectoryandthenafractionαofitbeingredistributedequallytoalldirectoriesthatarelinkedtobythisdirectory.
Theextrabonusvote(1-α)pwouldalsobeallocatedtoeachdirectory.
Subsequentvotingroundswouldthenfollowthesameprinciple.
StandardPageRankisbasedonthepagelevelmodeldescribedabove.
Weintroducethreenewalgorithms:PageRankusingthedirectory,domainanduniversitydocumentmodelswiththeadditionalmodificationthatonlylinksbetweendifferentsites(inourcaseuniversities)willbeused.
Thisisbaseduponthehypothesisthatlinksinsideasiteareprimarilyfornavigationpurposes,whereaslinkstoexternalsitesaremorereliableasindicatorsoftargetquality.
ThevariantswillbecalledintersitedirectoryPageRank,intersitedomainPageRankandintersiteuniversityPageRank.
ItwouldalsobepossibletoapplyPageRanktothepagemodelafterexcludinginternalsitelinks,butthiswouldnotbeeffectivesincerelativelyfewpagesaretargetedbyothersitesandsoalmostallpageswouldberankedlast.
LiteratureReviewWebIRalgorithmsAlthoughthemaintaskoftheearlysearchenginessuchastheWorldWideWebWorm(Chun,1999)wastofindWebpages,therapidgrowthoftheWebmeantthattechnicaldevelopmentquicklyswitchedtofindingthemostrelevantpagesforuserqueries.
Thisleadtoincreasinglyrefinedtextmatchingtechniques,suchaslatentsemanticindexing(Deerwesteretal.
,1990)wherethequerytermsdonothavetobeinthepageforittoberetrieved,butwithlinkbasedalgorithms,suchasGoogle'sandKleinberg's,therelationshipbetweenpagesandthosesurroundinghasbecomeimportant.
ThesuccessoflinkapproacheshasnotbeenreplicatedinthecomputerscienceTRECtasks,however,perhapsduetoanuntypicaltestcorpusused,oruntypicaltasks(Hawkingetal.
,2000).
Anothertrendisfortheapplicationofmultipletechniquesinablendtoobtainoptimalresults.
Forexample,textmatchingcanbecombinedwithlinkalgorithmsandURLstructureheuristicsinordertoidentifyhomepages,animportanttask,asreflectedinitsinclusionintheTRECWebtrack.
Variousmethodsareavailabletoidentifythebestweightingstousetocombinethesealternativetechniques(e.
g.
Gaoetal.
,2001).
Oneside-effectofthis,however,isthattheconstructionofanefficientpieceofsoftwarewillnotleadtoclearresultsabouttheusefulnessofanyoneofthecomponentsofitsalgorithm.
Conversely,evaluatingoneapproachonitsown,whilstyieldingsuchresults,willnotyieldanoptimalsystem.
Oneimplicationofthisisthat4researchintoindividualcomponentscanincreasinglybeseenasinformationscienceratherthancomputerscience.
OthervariationsofPageRankSeveralvariationsorgeneralisationsofPageRankhavebeensuggested.
Infactitsoriginatorssuggestedafewmodificationsattheoutset,includingusinganon-uniformpatternofinitialvotessothatPageRankcouldbepersonalisedtotheuser,bygivingtheirvaluedpageshigherinitialpvalues(BrinandPage,1998).
ThisapproachcanalsobeusedtoalterthePageRankresultsthroughtheinclusionofanothersourceofinformationaboutpagequality.
BharatandMihaila(2001)developedanewversionofPageRankanddemonstratethroughuserevaluationsthatitsperformanceiscomparablewiththestandardPageRank.
Lifantsev(2000)developedageneraltheoreticalmodelforapplyingvariantsofthePageRanktechnique.
Haveliwala(1999)developedcomputingtechniquestoapplystandardPageRanktosmallerplatforms.
Meghabghab(2002)proposedaversionbaseduponinandoutdegreesofnodes,butthisdidnotproduceimprovedresults.
RichardsonandDomingos(2001)developedacombinationofPageRankwithcontentinformation,andprobablythisiswhatGoogledoesalready.
SearchenginequalityevaluationtechniquesAlthoughmanymeasureshavebeenusedtoassesstheretrievalresultsofasearchengine(e.
g.
Hawkingetal.
,2001a)theconcerninthisstudyisonlywithevaluatingasearchengine'sabilitytorankthepagesretrievedonaparticulartopic.
Asaresult,thenormalquestionsofprecision(thepercentageofpagesreturnedthatarerelevanttothetopic)andrecall(thepercentageofrelevantpagesfoundontheWeb)donotapply,sincethesearetypicallybaseduponbinarydecisionsofrelevanceandnotonrelativemeritsofthepagesthemselves.
Forexample,TRECtypeevaluationsfocusonwhethereachpagedoesmatchthecriteriaofthesearchratherthanonthequalityofthepagecontent.
Evaluationofrankingperformancehasactuallybeenaparticularlytroublesomeandcontroversialaspectofsearchengineresearch.
Manypapersdescribingadvanceshavegivenanecdotalratherthanformalevaluations(BrinandPage,1998).
TherelevanceofthedocumentsinTRECtopicsareformallyevaluatedinbatchesbyagroupofhumans(Hawkingetal.
,1999)butthisapproachhasbeencriticisedonthegroundsthatonlyarealenduserofinformationcansuccessfullyevaluateretrievalresults(GordonandPathak,1999).
Anotherapproach,unavailabletomostresearchers,istoanalysesearchenginelogfilestominesearchpatterns(e.
g.
Spinketal.
,2001).
Commercialsearchenginesprobablyemployacombinationofevaluationmethodsbutnoneareidealbecauseof(a)thediversityofinformationontheWeband(b)thedifficultyofgettingagroupofuserstoevaluateasimilarsetofresultsinawaythatisnotartificial.
Asaresult,anyevaluationprocesswillnecessarilybeacompromisebutthetaskoftheresearcheristoovercometheseobstaclesaseffectivelyaspossible.
ResearchquestionsThequestionsaddressedarewhetheranyofthefollowingalternativeversionsofPageRankproducesimprovedrankingsoverstandardPageRank.
PageRankwithinternalsitelinksexcludedandbasedupon:5thedomain,thedirectory,ortheuniversitydocumentmodel.
FoursetsofWebpagesonfourdifferenttopicswereselectedforthestudy(detailsofthechoiceofpagesarebelow).
Eachsetofpageswasrankedbyhumansubjects(detailsbelow).
DifferentversionsofPageRankalgorithmwereusedtorankeachsetofpagesandtherankingresultscomparedwiththatofhumansubjects.
Thealgorithmthatgeneratesarankingclosertothehumanrankingisconsideredtobebetter.
DataCollectionSubjectsofthestudySubjectsofthestudywerestudentsenrolledontheInformationRetrievalcourse,partoftheMasterofLibraryandInformationSciencedegree,inthesummertermof2002attheFacultyofInformationandMediaStudies,UniversityofWesternOntario,Canada.
OneoftheassignmentsofthecoursewastorankasetofWebpagesandthencomparetherankingagainstthosegeneratedbydifferentsearchalgorithmstogainanunderstandingofsearchalgorithmsandsearchengines.
Twenty-fourstudentsonthecourseweredividedrandomlyintofourgroupsofsixpeopleeach.
EachgroupwasgivenasetofWebpagesonaparticulartopic(detailsbelow)andeachstudentindependentlyrankedthepagesinthewaythathe/shethoughttheyshouldberankedinasearchoutput.
Thegroupthenmetandexchangedtheirrankingaswellasthecriteriausedintheranking.
Eachstudentthendidanotherroundoftherankingbasedonthediscussionwithothergroupmembers(theycouldchoosenottochangetheirrankingfromthefirstroundofexercise).
Studentsthenproceededwiththeotherpartsoftheassignmentthatwerenotdirectlyrelatedtothestudy.
Forthepurposeofthisstudy,studentrankingresultswereaggregated(detailsindataanalysisbelow)andusedasthebenchmarkagainstwhichtocomparerankingresultsfromdifferentPageRankalgorithmsunderinvestigation.
Basedontheethicalprincipleofvoluntarilyparticipation,studentsweregiventhechoiceofallowingtheirrankingdatatobeusedforthestudyornot.
Allstudentsonthecoursegavepermissiontousetheirdataforthestudy.
ChoiceofpagesetsBecauseallsubjectsinthestudywereCanadiangraduatestudents,thetopicsofthepagestoberankedwereallchosentoberelatedtoCanadianuniversitylifesothatstudentswereknowledgeableaboutthesubjectandwerecompetenttorankthepages.
Thefollowingfourtopicswereselected:1.
OntarioGraduateScholarshipinScienceandTechnology(referredtoasOGSbelow).
2.
SocietyofGraduateStudiesattheUniversityofWesternOntario(referredtoasSOGSlater).
3.
OmbudspersonofficeattheUniversityofWesternOntario(ombudspersonforshort).
4.
AdmissionrequirementsfortheMBAprogramattheUniversityofToronto(MBAforshort).
6AsetofWebpagesoneachtopicwereretrievedusingthreesearchengines(Google,AltaVista,andTeoma)andthetop10pagesretrievedbyeachengineweremergedtoformthesetofpagesforthatparticulartopic.
Asaresult,therewereabout20pagesineachsettoberanked.
Whenperformingthesearchonthesearchengines,restrictionsbydomainswereimposedtoavoidtheinclusionoftotallyirrelevantpages.
Forexample,thesearchofpagesonSOGSwasrestrictedtothedomainofwww.
uwo.
ca(theuniversity'sURL)sothatirrelevantpagesthathappenedtohavethewordSOGSwerenotlikelytoberetrieved.
Therankingofthesepagesbythesearchengineswerenotrevealedtothesubjectsbeforetheydidtherankingtoavoidpossiblebias.
DataforcalculatingPageRankscoresAsexplainedabove,thecalculationofPageRankscoresarebasedonthelinkinginformationamongpages.
SearchenginessuchasGoogleuselinkstructuresamongallpagesintheirdatabasetocalculatethePageRankscores.
Forthepurposeofthisstudy,auniverseofpagesmustbedefinedonwhichtobasethecalculationofPageRankscores.
ItwasdecidedtouseallCanadianuniversityWebpagestobesuchauniversebecause:(1)itisimpossibletocoverallpagesontheWebforaproject;(2)allpagestoberankedareaboutCanadianuniversitiessothelinkstothesepagesaremostlikelytocomefromotherCanadianuniversities;(3)itisfeasibletocrawlthisnumberofpages(3,930,113intotal)andrecordtheirlinkinginformation.
Theunderlyingassumptionofthisdatacollectionmethodisthatsimilarresultswouldbeobtainedifafullsearchenginedatabaseweretobeused.
Althoughthisassumptionisimpossibletoverify,itissupportedbytherobustnessofthePageRankalgorithm(Ngetal.
,2001).
Inanycase,theperformanceofPageRankonanyconceptuallycoherentsetofpagesisofinterestandappropriate.
TheURLsofallCanadianuniversitieswereobtainedfromanonlinelist(AssociationofUniversitiesandCollegesofCanada,2002)andtheexhaustivityofthesetverifiedandsupplementedusinganunrelatedprintmediasource(Johnston,2002).
Thelistincludedallfulluniversitiesaswellasaffiliatedcolleges.
EachuniversityWebsitewasthencrawledbyaspecialistinformationscienceWebcrawler(Thelwall,2001a)torecordlinkinformation.
Thecrawlerwasdesignedtocoversitesaccurately,checkingforduplicatepagesexhaustively.
Thecrawlercannormallyonlyfindpagesbyfollowinglinksiterativelyfromthehomepageandsopagesthatwerenotlinkedtowouldnothavebeencovered.
Twoexceptionsweremade,however.
Firstly,someuniversities'homepagesdidnotcontainanyHTMLlinksandsoastandardcrawlwouldreturnonlyonepage.
Inthesecasesapageoflinkstoalldepartmentalhomepageswassoughtandusedasanalternativestartingpoint.
Secondly,theURLsofthefoursetsofpagesusedinthestudywerepreloadedintothecrawlertoensurethattheywouldbecovered,evenifnolinkstothemhadbeenfound.
Someareaswereexcludedonthebasisofbeingmirrorsitesorhugeonlinedatabaseswithonlyinternallinks.
Thecrawlingwasconductedinthesummerof2002,shortlybeforethepagesfortheexperimentwererankedbythestudents.
DataAnalysis7Asdiscussedin'Datacollection',eachsubjectrankedthesetofpagestwice.
Thesecondroundofranking,afterthegroupdiscussion,representsthefinalrankingdecisionandwasthususedfordataanalysis.
Only9outof24subjectschangedtheirrankingfromthefirstroundandmostchangesareminorinvolvingonlyafewpages.
Theaverageofthesixgroupmembers'rankingwastakentorepresenthumanrankingforthatsetofpages.
Althoughindividualstudent'srankingsdiffered,theyweremostlycorrelatedwitheachother,whichprovidessomeassuranceofthereliabilityofthehumanrankingdata.
TherankinggeneratedbyeachPageRankalgorithmwascorrelatedwiththehumanrankingtoseewhichalgorithmwasbetter(i.
e.
closertohumanranking).
TheSpearmancorrelationcoefficienttestwasusedbecausethehumanrankingscoresareobviouslyordinaldata.
ResultsTheresultsofcorrelationtestsaresummarizedinTableI.
Thefoursetsofpagesarelabelledwiththeiracronyms(see'Choiceofpagesets'aboveforadetaileddescriptionofthecontentofeachset).
ThefirstcolumnofdatainTableIgivesthecorrelationcoefficientsbetweenhumanrankingandtherankingbythestandardPageRank.
TheothercolumnsshowthecorrelationbetweenhumanrankingandtherankinggeneratedbyvariousversionsofPageRankemployingalternativedocumentmodels.
Thecolumnlabelled'directory'representsthePageRankusingthedirectoryleveldocumentmodel.
Thecolumnslabelled'domain'and'university'areforPageRanksusingdomainlevelanduniversityleveldocumentmodelsrespectively.
TableICorrelationsbetweenhumanrankingandrankingbyalgorithmsPageSetStandardPageRankIntersitedirectoryPageRankIntersitedomainPageRankIntersiteuniversityPageRankOGS-0.
08-0.
060.
320.
05Ombudsperson0.
600.
63N/AN/AMBA0.
2-0.
14-0.
29N/ASOGS0.
27N/AN/AN/ATheN/AsigninTableImeansthatPageRankscoresarethesameoralmostthesameforallpagesinthesetandthuscorrelationcoefficientcannotbecalculated.
ItshouldbenotedthatthepresenceofsomanyN/AsignsinTableIshouldnotbeinterpretedtomeanthatthealternativedocumentmodelswouldfrequentlynotprovideusefulPageRankdata.
Itistheresultofthewaythatthepageswereselected.
Recallthatrestrictiontoaspecificdomainwasnecessarywhenformingthepageset.
Forexample,theSOGSpagesetwasretrievedexclusivelyfromthedomainofwww.
uwo.
ca.
InfacttheuniquewordSOGScausedtheretrievedpagestoallcomefromthesamedirectorywww.
uwo.
ca/sogs/.
ThisexplainswhyPageRankbasedonthedirectory,domain,anduniversitylevelcannotprovidedatathatdistinguishespageswithinthisset.
Forthisreason,thissethadtobeomittedfromthetestsofalternativedocumentmodels.
CorrelationcoefficientsthatarestatisticallysignificantareshowninboldfaceinTableI.
ThestandardPageRankhadasignificantcorrelationforonlyoneoutofthefoursetsofpagesusedinthestudy,theombudspersonset.
PageRankbasedonthe8directoryleveldocumentmodelshowedaslightimprovementoverthestandardmodel.
TheonlypagesetthatisappropriatetotestthealternativedocumentmodelistheOGSsetbecausenorestrictiontoaparticularuniversity'sdomainwasimposedwhenformingthisset(OntarioGraduateScholarshipisnotrestrictedtoaparticularuniversity).
Asaresult,pageswithinthissetcomefromdifferentuniversitiesandthealternativedocumentmodelswereabletodistinguishthesepageswell.
Forthisset,thestandardPageRankalmostrankedthepagesinthedirectionoppositetothatbyhumansubjects(themeaningofthenegativecorrelation).
PageRankbasedonthedomainleveldocumentmodelshowsanadvantageoverthestandardmodelwhiletheuniversitylevelmodelshowedonlyaveryslightimprovement.
ResultsfromtheMBAsetcameasasurpriseinthatthealternativedocumentmodelsshoweddisadvantageoverthestandardPageRankmodel.
Itisnotclearwhetheritisananomalouscaseorwhetherthealternativedocumentmodelsarenotappropriateinsomecases.
OnepossibleexplanationforthefailureinthispagesetisthatthePageRankscorescalculatedforthissetarenotreliable.
RecallthatthePageRankscoresarecalculatedfromthedatabasethatincludesallCanadianuniversityWebpages.
TheMBApagesetiscentredaroundtheWebsiteoftheBusinessSchooloftheUniversityofToronto.
DuetothenatureoftheSchool,therearemanylinkstotheWebsitethatarenotfromotherCanadianuniversities.
Forexample,asearchoflinkstothissiteusingAltaVistasearchenginesfoundoveronehundredlinksfrom.
comdomain.
ThePageRankcalculationmissedalltheselinksandisthereforebiased.
Thisproblemdoesapply,ornottothisextent,toothersetsoftestpagesinthestudy.
Forexample,theWebsitethattheombudspersonsetiscentredaroundonlyhasonelinkfromthe.
comdomain.
Futurestudiescanavoidthisproblembyamorecarefulexaminationofpagespriortotherankingexperiment.
DiscussionThestandardPageRankdoesnotseemtobeveryeffectiveinrankingWebpagesinthestudyasshownbythefactthatitsrankingscorrelatesignificantlywithhumanrankingsforonlyoneoutoffoursetsofpagestested.
AlternativeapproachesareneededtoimprovetheeffectivenessofPageRank.
ThestudyproposedandtestednewversionsofPageRankbasedonalternativedocumentmodels.
Althoughtheresultsfromthestudydonotprovideclearevidencethatthealternativemodelsarebetter,itshowedthatthesemodelshavesomepromise.
Infact,theresultsfromtheOGSpageset,theonlysetthatisappropriatetotestallthealternativedocumentmodels,showedasubstantialadvantageoftheintersitedomainPageRankoverthestandardPageRank.
Onefacthasemergedclearlyfromthisresearch:thatitisdifficulttoassessthequalityofWebrankingalgorithms,especiallythoseinvolvinglinks,andespeciallyforresearchersthatdonothaveaccesstoacrawlofasizeablepercentageoftheWeb.
Afullscientificevaluationwouldinvolvehugehumanandcomputingresources:ideallyarandomselectionofquerieswithresultsrankedbyarepresentativesetofusersforwhomthequeriesrepresentedrealinformationrequests.
Inordertobeabletochoosequeriesatrandom,accesstoamajorsearchengineserverloganditsdatabaseforcalculatingtherankingscoreswouldbeneeded.
TheTRECapproach(trec.
nist.
gov,Hawkingetal.
,2001b)toresolvingasimilarproblemisasensibleone:tohaveacentrallyorganisedandratedcollectionofpagesthataresharedforalgorithmtestingpurposesbyparticipatingresearchers.
However,thisdoesnotyetsatisfyourneedbecausethosepagesareassignedabinaryrelevancescorebutnotrankedbydegreeofrelevance.
Forthereasonsdiscussedabove,therankingtask9wouldbelikelytobemorecomplexandinvolvemoreandmoredifficultassessmentsthanthecurrentlyemployedbinaryrelevancejudgements.
OurcompromisewastochooseasmallsetoffourqueriesthatwererelevanttoafixedgroupofendusersandbelongedtoacoherentsubsetoftheWebthatcouldbecrawledandassumedtobesufficientlylarge(3,930,113pages)forrankingthepagesetschosen.
ThiswouldnotbeaproblemifinformationneedslinkcreationandinformationdistributionwereknowntobehighlyuniformandpredictableontheWeb,i.
e.
ifthechoiceoftopicforeachsetwereknownnottoinfluencetheeffectivenessofarankingalgorithm,butwebelievethatthisisnotthecase.
Onalargescale,linkpatternsappeartobereasonablypredictableinsomecontexts(Thelwall,2001b,2002a)andoveralargenumberofpagesitseemsintuitivelyclearthatthosewith,say,threelinkstothemwouldbe,onaverage,slightlybetterqualitythanthosewithonlytwo.
Nevertheless,linksarestilltypicallycreatedbyindividualsinanunsystematicfashionandnotsubjecttoanykindofqualitycontrol.
Asaresultitisdifficulttoclaimthatthreelinkstoapageislikelytoconsistentlyindicatebettertargetpagequalitycontentthantwo.
Thisismoreevidentifitisacknowledgedthatfactorsotherthanqualitycaninfluencelinkcounts,includingtargetpageage.
Asaresult,anygivenlink-basedrankingalgorithmislikelytobeeffectiveforsometopicsbutineffectiveforothers.
Moreover,withthelownumbersoflinkslikelytobeinvolvedinpagesforsometopics,itseemslikelythateventhemosteffectivealgorithmwouldregularlyfailforasignificantproportionofsearchtopics.
Therefore,itisprobablynotsurprisingthattheproposednewalgorithminthisstudydoesnotworkwellforallthesearchtopicsintheexperiment.
Futureresearchinthisareashoulddesignawiderrangeofsearchqueriesandavoidproblemsencounteredinthisstudy.
Insummary,itseemsthatonlyresearchersworkingfor,orinconjunctionwith,amajorsearchenginewouldbecapableoffullyassessingnewWebrankingalgorithms,andotherswillremainforcedtoextrapolatefromtheteststhattheyareabletorun.
ThemostpromiseforacademicresearchersprobablylieswithcentralisedinitiativessuchasTREC,although,ascanbeseenabove,thechoiceoftopicscanimpactonalgorithmsindifferentways,dependingonthedetailsoftheirworkings.
ConclusionsAlthoughthestudydidnotsucceedinprovidingadefiniteanswertotheresearchquestionsexamined,itprovidedsomeevidencethatthealternativePageRankalgorithmsproposedcouldhavethepotentialtoimprovethestandardPageRankmodel.
ThestudysucceededintestingWebIRalgorithmsusinganempiricalstudyinvolvinghumansubjects,adirectionthatwasnotfollowedbymanypreviousstudies.
TheultimatevalueofanyWebIRalgorithmliesonitsabilitytoservehumanneedsandthusthebestwaytotestthemistoseeiftheymatchthoseneeds.
FutureresearchwithalternativedocumentmodelbasedrankingalgorithmsshouldkeepthehumanrankingapproachofthestudybutdesignarangeoftestqueriesthatallinvolvepagesfromdifferentWebsites.
AcknowledgementWegratefullythankallstudentswhoparticipatedinthestudybygivingpermissionforustousetheirrankingdata.
Thestudywouldhavebeenimpossiblewithouttheirsupport.
References10AltaVista(2002),AltaVistaadvancedsearchtutorial–linkpopularity,availableat:help.
altavista.
com/adv_search/ast_haw_popularity(accessed6September2002).
AssociationofUniversitiesandCollegesofCanada(2002),TheDirectoryofCanadianUniversities–UniversityWebsites,availableat:www.
aucc.
ca/english/dcu/universities/universitysites.
html(accessed24April2002).
Bharat,K.
andMihaila,G.
A.
(2001),"Whenexpertsagree:usingnon-affiliatedexpertstorankpopulartopics",inTenthInternationalWorldWideWebConference,availableat:www.
www10.
org/cdrom/papers/474/index.
htmlBrin,S.
andPage,L.
(1998),"Theanatomyofalargescalehypertextualwebsearchengine",ComputerNetworksandISDNSystems,Vol.
30No.
1-7,pp.
107-117,availableat:citeseer.
nj.
nec.
com/brin98anatomy.
htmlChun,T.
Y.
(1999),"WorldWideWebrobots:anoverview",Online&CD-ROMReview,Vol.
23No.
3,pp.
135-142.
Crestani,F.
andLee,P.
L.
(2000),"SearchingtheWebbyconstrainedspreadingactivation",InformationProcessingandManagement,Vol.
36No.
4,pp.
585-605.
Deerwester,S.
,Dumais,S.
T.
,Furnas,G.
W.
,Landauer,T.
K.
andHarshman,R.
(1990),"Indexingbylatentsemanticanalysis",JournaloftheAmericanSocietyforInformationScience,Vol.
41No.
6,pp.
391-407.
Gao,J.
,Walker,S.
,Robertson,S.
,Cao,G.
,He,H.
,Zhang,M.
andNie,J-Y(2001),"TREC-10WebTrackExperimentsatMSRA384-392",TREC2001,availableat:trec.
nist.
gov/pubs/trec10/t10_proceedings.
htmlGordon,M.
andPathak,P.
(1999),"FindinginformationontheWorldWideWeb:theretrievaleffectivenessofsearchengines",InformationProcessing&Management,Vol.
35,pp.
141-180.
Haveliwala,T.
(1999),"EfficientcomputationofPageRank",StanfordUniversityTechnicalReport,availableat:dbpubs.
stanford.
edu:8090/pub/1999-31Hawking,D.
,Bailey,P.
andCraswell,N.
(2000),"ACSysTREC-8experiments",inVoorhees,E.
andHarman,D.
(Eds),InformationTechnology:EighthTextRetrievalConference(TREC-8),NIST,Gaithersburg,MD,USA,pp.
307-315.
Hawking,D.
,Craswell,N.
,Bailey,P.
andGriffiths,K.
(2001a),"Measuringsearchenginequality",InformationRetrieval,Vol.
4No.
1,pp.
33-59.
Hawking,D.
,Craswell,N.
,Thistlewaite,P.
andHarman,D.
(1999),"ResultsandchallengesinWebsearchevaluation",8thInternationalWorldWideWebConference,availableat:www8.
org/w8-papers/2c-search-discover/results/results.
html.
Hawking,D.
,Craswell,N.
,Thistlewaite,P.
andHarman,D.
(2001b),"ResultsandchallengesinWebsearchevaluation",ComputerNetworks,Vol.
31No.
11-16,pp.
1321-1330,availableat:www8.
org/w8-papers/2c-search-discover/results/results.
htmlJohnston,A.
D.
(Ed.
)(2002),TheMaclean'sGuidetoCanadianUniversities2002,RogersPublishing,Toronto,Canada.
Kleinberg,J.
(1999),"Authoritativesourcesinahyperlinkedenvironment",JournaloftheACM,Vol.
46No.
5,pp.
604-632.
Lifantsev,M.
(2000),"VotingmodelforrankingWebpages",inGraham,P.
andMaheswaran,M.
(Eds),ProceedingsoftheInternationalConferenceonInternetComputing,CSREAPress,LasVegas,Nevada,USA,pp.
143-148.
11Meghabghab,G.
(2002),"Google'sWebpagerankingappliedtodifferenttopologicalWebgraphstructures",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No.
9,pp.
736-747.
Ng,A.
Y.
,Zheng,A.
X.
andJordan,M.
I.
(2001),"Stablealgorithmsforlinkanalysis",inCroft,W.
,Harper,D.
,Kraft,D.
&Zobel,J.
(Eds)Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR2001),ACMPress,NewYork,pp.
258-266.
Richardson,M.
andDomingosP.
(2001),"Theintelligentsurfer:probabilisticcombinationoflinkandcontentinformationinPageRank",posteratNeuralInformationProcessingSystems:NaturalandSynthetic2001,availableat:www.
cs.
washington.
edu/homes/mattr/doc/NIPS2001/qd-pagerank.
pdfSavoy,J.
andPicard,J.
(2001),"RetrievaleffectivenessontheWeb",InformationProcessingandManagement,Vol.
37No.
4,pp.
543-569.
Spink,A.
Wolfram,D.
,Jansen,B.
J.
andSaracevic,T.
(2001),"SearchingtheWeb:thepublicandtheirqueries",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No3,pp.
226-234.
Sullivan,D.
(2002),"Googletopsin'searchhours'ratings",SearchEngineWatch,availableat:searchenginewatch.
com/sereport/02/05-ratings.
html(accessed6September2002).
Thelwall,M.
(2001a),"Awebcrawlerdesignfordatamining",JournalofInformationScience,Vol.
27No.
5,pp.
319-325.
Thelwall,M.
(2001b),"ExtractingmacroscopicinformationfromWeblinks",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No.
13,pp.
1157-1168.
Thelwall,M.
(2002a),"ConceptualizingdocumentationontheWeb:anevaluationofdifferentheuristic-basedmodelsforcountinglinksbetweenuniversityWebsites",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
53No.
12,pp.
995-1005.
Thelwall,M.
(2002b),"Subjectgatewaysitesandsearchengineranking",OnlineInformationReview,Vol.
26No.
2,pp.
101-107.
Thelwall,M.
(2003),AlayeredapproachforinvestigatingthetopologicalstructureofcommunitiesintheWeb,JournalofDocumentation,59(4),410-429.
Thelwall,M.
andHarries,G.
(2003),"TheconnectionbetweentheresearchofauniversityandcountsoflinkstoitsWebpages:aninvestigationbaseduponaclassificationoftherelationshipsofpagestotheresearchofthehostuniversity",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
54No.
7,pp.
594-602.
Thelwall,M.
andTang,R.
(2003),DisciplinaryandlinguisticconsiderationsforacademicWeblinking:anexploratoryhyperlinkmediatedstudywithMainlandChinaandTaiwan,Scientometrics,Vol.
58No.
1,pp.
153-179.
Thelwall,M.
andWilkinson,D.
(2003),"ThreetargetdocumentrangemetricsforuniversityWebsites",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
54No.
6,pp.
489-496.
Tsikrika,T.
andLalmas,M.
(2002),"CombiningWebdocumentrepresentationsinaBayesianInferenceNetworkmodelusinglinkandcontent-basedevidence",inProceedingsof24thEuropeanColloquiumonInformationRetrievalResearch,(ECIR2002),pp53-72,Glasgow,Scotland.
Xi,W.
andFox,E.
A.
(2001),"MachineLearningApproachforHomepageFindingTask",TREC2001,pp.
686-697,availableat:trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
12
tmhhost怎么样?tmhhost正在搞暑假大促销活动,全部是高端线路VPS,现在直接季付8折优惠,活动截止时间是8月31日。可选机房及线路有美国洛杉矶cn2 gia+200G高防、洛杉矶三网CN2 GIA、洛杉矶CERA机房CN2 GIA,日本软银(100M带宽)、香港BGP直连200M带宽、香港三网CN2 GIA、韩国双向CN2。点击进入:tmhhost官方网站地址tmhhost优惠码:Tm...
部落分享过多次G-core(gcorelabs)的产品及评测信息,以VPS主机为主,距离上一次分享商家的独立服务器还在2年多前,本月初商家针对迈阿密机房限定E5-2623v4 CPU的独立服务器推出75折优惠码,活动将在9月30日到期,这里再分享下。G-core(gcorelabs)是一家总部位于卢森堡的国外主机商,主要提供基于KVM架构的VPS主机和独立服务器租用等,数据中心包括俄罗斯、美国、日...
CloudCone 商家产品还是比较有特点的,支持随时的删除机器按时间计费模式,类似什么熟悉的Vultr、Linode、DO等服务商,但是也有不足之处就在于机房太少。商家的活动也是经常有的,比如这次中国春节期间商家也是有提供活动,比如有限定指定时间段之前注册的用户可以享受年付优惠VPS主机,比如年付13.5美元。1、CloudCone新年礼物限定款仅限2019年注册优惠购买,活动开始时间:1月31...
pagerank为你推荐
wordpress模板wordpress的模版怎么用googlepr值如何提高网站的Google页面等级PR值?my.qq.commy.qq.com,QQ用户上不去?internetexplorer无法打开为什么Internet Explorer浏览器无法打开asp.net网页制作开发ASP.NET的网站,步骤是怎样?有经验的可以说说自己的经验信息cuteftp瞄准的拼音瞄怎么读,瞄的组词,瞄的读音,瞄的笔顺,瞄的意思网络u盘有没有网络U盘 5G的 就像真的U盘一样的?就像下载到真U盘一样的 到自己电脑直接复制就可以拉的啊网络u盘网吧网络U盘是怎么弄的联系我们代码如何查询统一社会信用代码
香港主机租用 动态域名解析软件 国内永久免费云服务器 空间打开慢 正版win8.1升级win10 godaddy域名证书 息壤代理 免费网页申请 怎么建立邮箱 网站在线扫描 512mb 域名与空间 东莞idc 宏讯 上海电信测速网站 美国凤凰城 下载速度测试 免费asp空间 东莞服务器托管 数据库空间 更多