belongingpagerank
pagerank 时间:2021-04-19 阅读:(
)
NewversionsofPageRankemployingalternativeWebdocumentmodels1MikeThelwallSchoolofComputingandInformationTechnology,UniversityofWolverhampton,35/49LichfieldStreet,WolverhamptonWV11EQ,UKm.
thelwall@wlv.
ac.
ukLiwenVaughanFacultyofInformationandMediaStudies,UniversityofWesternOntario,London,Ontario,N6A5B7,Canadalvaughan@uwo.
caKeywords:WebIR,PageRank,hyperlinkanalysis,searchenginesAbstractWeintroduceseveralnewversionsofPageRank(thelinkbasedWebpagerankingalgorithm),baseduponaninformationscienceperspectiveontheconceptoftheWebdocument.
AlthoughtheWebpageisthetypicalindivisibleunitofinformationinsearchengineresultsandmostWebinformationretrievalalgorithms,otherresearchhassuggestedthataggregatingpagesbasedupondirectoriesanddomainsgivespromisingalternatives,particularlywhenWeblinksaretheobjectofstudy.
ThenewalgorithmsintroducedbaseduponthesealternativeswereusedtorankfoursetsofWebpages.
Therankingresultswerecomparedwithhumansubjects'rankings.
Theresultsofthetestsweresomewhatinconclusive:thenewapproachworkedwellforthesetthatincludespagesfromdifferentWebsites;however,itdoesnotworkwellinrankingpagesthatarefromthesamesite.
Itseemsthatthenewalgorithmsmaybeeffectiveforsometasksbutnotforothers,especiallywhenonlylownumbersoflinksareinvolvedorthepagestoberankedarefromthesamesiteordirectory.
IntroductionCommercialsearchenginesareakeyaccesspointtotheWebandhavethedifficulttaskoftryingtofindthemostusefulofthebillionsofWebpagesforeach–typicallyshort(Spinketal.
,2001)–userqueryentered.
Probablythetaskismostdifficultwhenmillionsofpagescontainthequeryterm(s)andthesemustbeorderedsothattheuserispresentedwiththemostlikelyones.
Google'sPageRank(BrinandPage,1998)wasanattempttoresolvethisdilemmabasedupontheassumptionsthat:(1)moreusefulpageswillhavemorelinkstothemand(2)linksfromwelllinkedtopagesarebetterindicatorsofquality.
ThecontinuedriseofGoogletoitscurrentdominantposition(Sullivan,2002)andtheproliferationofotherlinkbasedalgorithms(e.
g.
Kleinberg,1999;CrestaniandLee,2000;Ngetal.
,2001;AltaVista,1Thelwall,M.
&Vaughan,L(2004).
NewversionsofPageRankemployingalternativeWebdocumentmodels.
ASLIBProceedings,56(1),24-33.
12002)seemstomakeanunassailableargumentforthePageRankalgorithm,despitethepaucityofclearcutresults(e.
g.
Hawkingetal.
,2000;SavoyandPicard,2001).
ModernWebIRalgorithmsareprobablyahighlycomplexmixtureofdifferentapproaches,perhapsoptimisedusingprobabilistictechniquestoidentifythebestcombination(e.
g.
Gaoetal.
,2001;XiandFox,2001;TsikrikaandLalmas,2002;SavoyandPicard,2001).
Itisnotpossibletobedefinitiveaboutcommercialsearchenginealgorithms,however,sincetheyarekeptsecretapartfromthebroadestdetails.
InfactacademicresearchintoWebIRisinastrangesituationsinceresearchbudgetsanddatasetscouldbeexpectedtobedwarfedbythoseofthecommercialgiants,whoseexistencedependsuponhighqualityresultsinanincrediblycompetitivemarketplace.
OnepaperthatcomparedthetwofoundthattheacademicsystemswereslightlybetterbuttheauthorsadmittedthatthetaskswereuntypicalforWebusers(Hawkingetal.
,2001a).
Nevertheless,Googleisonecaseamongstmanyofsearchalgorithmsgainingfromapproachesanddevelopmentsininformationscienceingeneralandbibliometricsinparticular.
Thealternativedocumentmodels(Thelwall,2002a)areanexampleofatheoreticalapproachfrominformationsciencethatmaybringbenefitstoWebIR.
TheprinciplebehindthesemodelsisthatWebpagesoftennaturallyclusterintorecognisabledocumentsbaseduponthedirectoryordomainthattheyarein.
Whenworkingwithlinksitcanoftenmakesensetoutiliseadirectoryordomainlevelofaggregation,especiallyifeachindividualpagecontainsasetofidenticallinks,perhapsinastandardnavigationbar.
Theresultofaggregationinsuchacasewouldbetheremovalofallduplicatelinks,givingamoreappropriatelinkcount.
Thisapproachhasbeenshowntogiveimprovedacademiclinkmetrics(Thelwall,2002a;ThelwallandTang,2003;ThelwallandWilkinson,2003;ThelwallandHarries,2003).
Furthersupportforthesemodelsisgivenbytheirabilitytocluster(setsof)Webpagesindifferentandnon-trivialways(Thelwall,2003).
Anaturalquestion,therefore,iswhetherWebIRalgorithmscanbenefitfromthealternativedocumentmodels.
Inthispaper,newversionsofPageRankwillbeintroducedusingalternativedocumentmodels.
TheeffectivenessofthesenewrankingalgorithmswillbecomparedagainstthatofthestandardPageRank.
Humanrankingjudgementwillbeusedasthebenchmarkagainstwhichtocomparedifferentalgorithms.
VersionsofPageRankbasedonthealternativedocumentmodelPageRankwasdevelopedbythefoundersofGoogle,SergeyBrinandLawrencePage(1998).
Thegeniusoftheapproachisthatthealgorithmissimpleandintuitive,yetadmitsamathematicalimplementationthatscalestothebillionsofpagescurrentlyontheWeb.
Forourpurposes,sincewearenotmodifyingthemathematicalalgorithmofPageRankbutonlythedocumentspaceuponwhichitisapplied,wewilldescribetheprincipleofPageRankbutnotthedetailsofitsimplementation.
TheprecisedetailsofthemathsandfurtherdescriptionscanbefoundintheoriginalPageRankpaper(BrinandPage,1998)aswellasseveralotherrelatedpapers(Haveliwala,1999;Lifantsev,2000;Ngetal.
,2001;Thelwall,2002b).
EssentiallytheapproachusedbyPageRankcanbedescribedwithavotingmetaphor.
Atthestartoftheprocess,eachWebpageisallocatedavotep.
Forexample,eachpagemaybeallocatedthesamevalue0.
1.
EachpagethensharesafractionofαPageRankwereused.
Incontrast,apurelytext-matchingalgorithmwouldhavegreatdifficultyindecidingwhichpagecontainingthematchingtextwasthemostrelevant.
AcriticismoftheoriginalPageRankisthatmanypagesreceiveahighnumberoflinksforreasonsotherthantheirquality.
Forexample,somesiteshaveastandardnavigationbaroneachpage,allcontainingalinktothehomepageandafewotherpages.
Forthesiteitself,thisprobablydoesservetoindicatethemostusefulpages,butrelativetoothersitesthetotalnumberofpagescontainingthelinkbarwillbecriticaltodeterminethefinalPageRankofthetargetedpages,meaningthatlargersiteswillautomaticallyrankhigher.
Ithasalsobeennotedthatlinksbetweenpageswithinasitearetypicallyfornavigationpurposes,andthereforearelessreliableasindicatorsoftargetpagequalitythanlinksbetweensites.
Moreover,navigationbarssometimescontainlinkstoothersitesandonesiteoftencontainsmultiplelinkstoanotherforreasonsthatarenotrelatedtotargetsitequality.
AllofthesefactorsunderminetheeffectivenessofPageRankasanindicatorofthequalityofthepage.
Anadditionalproblemistheorganisationofinformationbysite,domainordirectory.
Forexample,asitecontainingmuchhighqualityinformationmayreceivemanylinkstoitshomepage,whereasitsactualcontentisontensofthousandsofotherpagesunderthehomepage,mostofwhichdonotreceivemanylinks.
AcaseinpointforthisistheMicrosoftsitethatincludesanenormousbodyofauthoritativeinformationspreadovermanypages.
Intheory,linkstothehomepagewillredistributethroughthelayersofasitetothesecontentcarryingpages,butinpracticethisdoesnotwork(Thelwall,2002b)andsothecontentpageswillnotreflecttheprestigeofthehostingsite.
Thisisanargumentforincludinginrankingmeasuresanassessmentofthesiteasawholeinadditiontotheindividualpages.
AsimilarargumentcanbemadeforanycoherentclusterofWebpageswitharecognisablehomepage.
Basedupontheargumentsmadeabove,theclaimisthatPageRankcanbeimprovedbyincorporatingrankingsofapagebaseduponitshostingsite,domainanddirectory.
Aprecisedefinitionofdocumentmodelsbasedupontheselevelsofaggregationisgivenbelow(takenfromThelwall,2002a).
IndividualWebpage.
EachseparateHTMLfileistreatedasadocumentforthepurposesofextractinglinks.
EachuniqueURLinalinkistreatedaspointingtoaseparatedocumentforthepurposesoffindinglinktargets.
URLsaretruncatedbeforeanyinternaltargetmarker'#'characterisfound,however,toavoidmultiplereferencestodifferentpartsofthesamepage.
3Directory.
AllHTMLfilesinthesamedirectoryaretreatedasasingledocument.
AlltargetURLsareautomaticallyshortenedtothepositionofthelastslash,andlinksfromdifferentpagesinthesamedirectoryarecombinedandduplicateseliminated.
Domainname.
AsaboveexceptallHTMLfileswiththesamedomainnamearetreatedasasingledocumentforbothlinksourcesandlinktargets.
Inparticular,thisclusterstogetherallpageshostedbyasinglesubdomainofauniversitysite.
University.
Asaboveexceptthatallpagesbelongingtoauniversityaretreatedasasingledocumentforbothlinksourcesandlinktargets.
ApplyingPageRanktothesemodelsmeansallocatingvotesattheappropriatedocumentlevelanddistributingthemaccordingtolinksidentifiedasabove.
Forexample,inthecaseofthedomain-basedPageRank,itwouldstartwithavotepbeingallocatedtoeachdirectoryandthenafractionαofitbeingredistributedequallytoalldirectoriesthatarelinkedtobythisdirectory.
Theextrabonusvote(1-α)pwouldalsobeallocatedtoeachdirectory.
Subsequentvotingroundswouldthenfollowthesameprinciple.
StandardPageRankisbasedonthepagelevelmodeldescribedabove.
Weintroducethreenewalgorithms:PageRankusingthedirectory,domainanduniversitydocumentmodelswiththeadditionalmodificationthatonlylinksbetweendifferentsites(inourcaseuniversities)willbeused.
Thisisbaseduponthehypothesisthatlinksinsideasiteareprimarilyfornavigationpurposes,whereaslinkstoexternalsitesaremorereliableasindicatorsoftargetquality.
ThevariantswillbecalledintersitedirectoryPageRank,intersitedomainPageRankandintersiteuniversityPageRank.
ItwouldalsobepossibletoapplyPageRanktothepagemodelafterexcludinginternalsitelinks,butthiswouldnotbeeffectivesincerelativelyfewpagesaretargetedbyothersitesandsoalmostallpageswouldberankedlast.
LiteratureReviewWebIRalgorithmsAlthoughthemaintaskoftheearlysearchenginessuchastheWorldWideWebWorm(Chun,1999)wastofindWebpages,therapidgrowthoftheWebmeantthattechnicaldevelopmentquicklyswitchedtofindingthemostrelevantpagesforuserqueries.
Thisleadtoincreasinglyrefinedtextmatchingtechniques,suchaslatentsemanticindexing(Deerwesteretal.
,1990)wherethequerytermsdonothavetobeinthepageforittoberetrieved,butwithlinkbasedalgorithms,suchasGoogle'sandKleinberg's,therelationshipbetweenpagesandthosesurroundinghasbecomeimportant.
ThesuccessoflinkapproacheshasnotbeenreplicatedinthecomputerscienceTRECtasks,however,perhapsduetoanuntypicaltestcorpusused,oruntypicaltasks(Hawkingetal.
,2000).
Anothertrendisfortheapplicationofmultipletechniquesinablendtoobtainoptimalresults.
Forexample,textmatchingcanbecombinedwithlinkalgorithmsandURLstructureheuristicsinordertoidentifyhomepages,animportanttask,asreflectedinitsinclusionintheTRECWebtrack.
Variousmethodsareavailabletoidentifythebestweightingstousetocombinethesealternativetechniques(e.
g.
Gaoetal.
,2001).
Oneside-effectofthis,however,isthattheconstructionofanefficientpieceofsoftwarewillnotleadtoclearresultsabouttheusefulnessofanyoneofthecomponentsofitsalgorithm.
Conversely,evaluatingoneapproachonitsown,whilstyieldingsuchresults,willnotyieldanoptimalsystem.
Oneimplicationofthisisthat4researchintoindividualcomponentscanincreasinglybeseenasinformationscienceratherthancomputerscience.
OthervariationsofPageRankSeveralvariationsorgeneralisationsofPageRankhavebeensuggested.
Infactitsoriginatorssuggestedafewmodificationsattheoutset,includingusinganon-uniformpatternofinitialvotessothatPageRankcouldbepersonalisedtotheuser,bygivingtheirvaluedpageshigherinitialpvalues(BrinandPage,1998).
ThisapproachcanalsobeusedtoalterthePageRankresultsthroughtheinclusionofanothersourceofinformationaboutpagequality.
BharatandMihaila(2001)developedanewversionofPageRankanddemonstratethroughuserevaluationsthatitsperformanceiscomparablewiththestandardPageRank.
Lifantsev(2000)developedageneraltheoreticalmodelforapplyingvariantsofthePageRanktechnique.
Haveliwala(1999)developedcomputingtechniquestoapplystandardPageRanktosmallerplatforms.
Meghabghab(2002)proposedaversionbaseduponinandoutdegreesofnodes,butthisdidnotproduceimprovedresults.
RichardsonandDomingos(2001)developedacombinationofPageRankwithcontentinformation,andprobablythisiswhatGoogledoesalready.
SearchenginequalityevaluationtechniquesAlthoughmanymeasureshavebeenusedtoassesstheretrievalresultsofasearchengine(e.
g.
Hawkingetal.
,2001a)theconcerninthisstudyisonlywithevaluatingasearchengine'sabilitytorankthepagesretrievedonaparticulartopic.
Asaresult,thenormalquestionsofprecision(thepercentageofpagesreturnedthatarerelevanttothetopic)andrecall(thepercentageofrelevantpagesfoundontheWeb)donotapply,sincethesearetypicallybaseduponbinarydecisionsofrelevanceandnotonrelativemeritsofthepagesthemselves.
Forexample,TRECtypeevaluationsfocusonwhethereachpagedoesmatchthecriteriaofthesearchratherthanonthequalityofthepagecontent.
Evaluationofrankingperformancehasactuallybeenaparticularlytroublesomeandcontroversialaspectofsearchengineresearch.
Manypapersdescribingadvanceshavegivenanecdotalratherthanformalevaluations(BrinandPage,1998).
TherelevanceofthedocumentsinTRECtopicsareformallyevaluatedinbatchesbyagroupofhumans(Hawkingetal.
,1999)butthisapproachhasbeencriticisedonthegroundsthatonlyarealenduserofinformationcansuccessfullyevaluateretrievalresults(GordonandPathak,1999).
Anotherapproach,unavailabletomostresearchers,istoanalysesearchenginelogfilestominesearchpatterns(e.
g.
Spinketal.
,2001).
Commercialsearchenginesprobablyemployacombinationofevaluationmethodsbutnoneareidealbecauseof(a)thediversityofinformationontheWeband(b)thedifficultyofgettingagroupofuserstoevaluateasimilarsetofresultsinawaythatisnotartificial.
Asaresult,anyevaluationprocesswillnecessarilybeacompromisebutthetaskoftheresearcheristoovercometheseobstaclesaseffectivelyaspossible.
ResearchquestionsThequestionsaddressedarewhetheranyofthefollowingalternativeversionsofPageRankproducesimprovedrankingsoverstandardPageRank.
PageRankwithinternalsitelinksexcludedandbasedupon:5thedomain,thedirectory,ortheuniversitydocumentmodel.
FoursetsofWebpagesonfourdifferenttopicswereselectedforthestudy(detailsofthechoiceofpagesarebelow).
Eachsetofpageswasrankedbyhumansubjects(detailsbelow).
DifferentversionsofPageRankalgorithmwereusedtorankeachsetofpagesandtherankingresultscomparedwiththatofhumansubjects.
Thealgorithmthatgeneratesarankingclosertothehumanrankingisconsideredtobebetter.
DataCollectionSubjectsofthestudySubjectsofthestudywerestudentsenrolledontheInformationRetrievalcourse,partoftheMasterofLibraryandInformationSciencedegree,inthesummertermof2002attheFacultyofInformationandMediaStudies,UniversityofWesternOntario,Canada.
OneoftheassignmentsofthecoursewastorankasetofWebpagesandthencomparetherankingagainstthosegeneratedbydifferentsearchalgorithmstogainanunderstandingofsearchalgorithmsandsearchengines.
Twenty-fourstudentsonthecourseweredividedrandomlyintofourgroupsofsixpeopleeach.
EachgroupwasgivenasetofWebpagesonaparticulartopic(detailsbelow)andeachstudentindependentlyrankedthepagesinthewaythathe/shethoughttheyshouldberankedinasearchoutput.
Thegroupthenmetandexchangedtheirrankingaswellasthecriteriausedintheranking.
Eachstudentthendidanotherroundoftherankingbasedonthediscussionwithothergroupmembers(theycouldchoosenottochangetheirrankingfromthefirstroundofexercise).
Studentsthenproceededwiththeotherpartsoftheassignmentthatwerenotdirectlyrelatedtothestudy.
Forthepurposeofthisstudy,studentrankingresultswereaggregated(detailsindataanalysisbelow)andusedasthebenchmarkagainstwhichtocomparerankingresultsfromdifferentPageRankalgorithmsunderinvestigation.
Basedontheethicalprincipleofvoluntarilyparticipation,studentsweregiventhechoiceofallowingtheirrankingdatatobeusedforthestudyornot.
Allstudentsonthecoursegavepermissiontousetheirdataforthestudy.
ChoiceofpagesetsBecauseallsubjectsinthestudywereCanadiangraduatestudents,thetopicsofthepagestoberankedwereallchosentoberelatedtoCanadianuniversitylifesothatstudentswereknowledgeableaboutthesubjectandwerecompetenttorankthepages.
Thefollowingfourtopicswereselected:1.
OntarioGraduateScholarshipinScienceandTechnology(referredtoasOGSbelow).
2.
SocietyofGraduateStudiesattheUniversityofWesternOntario(referredtoasSOGSlater).
3.
OmbudspersonofficeattheUniversityofWesternOntario(ombudspersonforshort).
4.
AdmissionrequirementsfortheMBAprogramattheUniversityofToronto(MBAforshort).
6AsetofWebpagesoneachtopicwereretrievedusingthreesearchengines(Google,AltaVista,andTeoma)andthetop10pagesretrievedbyeachengineweremergedtoformthesetofpagesforthatparticulartopic.
Asaresult,therewereabout20pagesineachsettoberanked.
Whenperformingthesearchonthesearchengines,restrictionsbydomainswereimposedtoavoidtheinclusionoftotallyirrelevantpages.
Forexample,thesearchofpagesonSOGSwasrestrictedtothedomainofwww.
uwo.
ca(theuniversity'sURL)sothatirrelevantpagesthathappenedtohavethewordSOGSwerenotlikelytoberetrieved.
Therankingofthesepagesbythesearchengineswerenotrevealedtothesubjectsbeforetheydidtherankingtoavoidpossiblebias.
DataforcalculatingPageRankscoresAsexplainedabove,thecalculationofPageRankscoresarebasedonthelinkinginformationamongpages.
SearchenginessuchasGoogleuselinkstructuresamongallpagesintheirdatabasetocalculatethePageRankscores.
Forthepurposeofthisstudy,auniverseofpagesmustbedefinedonwhichtobasethecalculationofPageRankscores.
ItwasdecidedtouseallCanadianuniversityWebpagestobesuchauniversebecause:(1)itisimpossibletocoverallpagesontheWebforaproject;(2)allpagestoberankedareaboutCanadianuniversitiessothelinkstothesepagesaremostlikelytocomefromotherCanadianuniversities;(3)itisfeasibletocrawlthisnumberofpages(3,930,113intotal)andrecordtheirlinkinginformation.
Theunderlyingassumptionofthisdatacollectionmethodisthatsimilarresultswouldbeobtainedifafullsearchenginedatabaseweretobeused.
Althoughthisassumptionisimpossibletoverify,itissupportedbytherobustnessofthePageRankalgorithm(Ngetal.
,2001).
Inanycase,theperformanceofPageRankonanyconceptuallycoherentsetofpagesisofinterestandappropriate.
TheURLsofallCanadianuniversitieswereobtainedfromanonlinelist(AssociationofUniversitiesandCollegesofCanada,2002)andtheexhaustivityofthesetverifiedandsupplementedusinganunrelatedprintmediasource(Johnston,2002).
Thelistincludedallfulluniversitiesaswellasaffiliatedcolleges.
EachuniversityWebsitewasthencrawledbyaspecialistinformationscienceWebcrawler(Thelwall,2001a)torecordlinkinformation.
Thecrawlerwasdesignedtocoversitesaccurately,checkingforduplicatepagesexhaustively.
Thecrawlercannormallyonlyfindpagesbyfollowinglinksiterativelyfromthehomepageandsopagesthatwerenotlinkedtowouldnothavebeencovered.
Twoexceptionsweremade,however.
Firstly,someuniversities'homepagesdidnotcontainanyHTMLlinksandsoastandardcrawlwouldreturnonlyonepage.
Inthesecasesapageoflinkstoalldepartmentalhomepageswassoughtandusedasanalternativestartingpoint.
Secondly,theURLsofthefoursetsofpagesusedinthestudywerepreloadedintothecrawlertoensurethattheywouldbecovered,evenifnolinkstothemhadbeenfound.
Someareaswereexcludedonthebasisofbeingmirrorsitesorhugeonlinedatabaseswithonlyinternallinks.
Thecrawlingwasconductedinthesummerof2002,shortlybeforethepagesfortheexperimentwererankedbythestudents.
DataAnalysis7Asdiscussedin'Datacollection',eachsubjectrankedthesetofpagestwice.
Thesecondroundofranking,afterthegroupdiscussion,representsthefinalrankingdecisionandwasthususedfordataanalysis.
Only9outof24subjectschangedtheirrankingfromthefirstroundandmostchangesareminorinvolvingonlyafewpages.
Theaverageofthesixgroupmembers'rankingwastakentorepresenthumanrankingforthatsetofpages.
Althoughindividualstudent'srankingsdiffered,theyweremostlycorrelatedwitheachother,whichprovidessomeassuranceofthereliabilityofthehumanrankingdata.
TherankinggeneratedbyeachPageRankalgorithmwascorrelatedwiththehumanrankingtoseewhichalgorithmwasbetter(i.
e.
closertohumanranking).
TheSpearmancorrelationcoefficienttestwasusedbecausethehumanrankingscoresareobviouslyordinaldata.
ResultsTheresultsofcorrelationtestsaresummarizedinTableI.
Thefoursetsofpagesarelabelledwiththeiracronyms(see'Choiceofpagesets'aboveforadetaileddescriptionofthecontentofeachset).
ThefirstcolumnofdatainTableIgivesthecorrelationcoefficientsbetweenhumanrankingandtherankingbythestandardPageRank.
TheothercolumnsshowthecorrelationbetweenhumanrankingandtherankinggeneratedbyvariousversionsofPageRankemployingalternativedocumentmodels.
Thecolumnlabelled'directory'representsthePageRankusingthedirectoryleveldocumentmodel.
Thecolumnslabelled'domain'and'university'areforPageRanksusingdomainlevelanduniversityleveldocumentmodelsrespectively.
TableICorrelationsbetweenhumanrankingandrankingbyalgorithmsPageSetStandardPageRankIntersitedirectoryPageRankIntersitedomainPageRankIntersiteuniversityPageRankOGS-0.
08-0.
060.
320.
05Ombudsperson0.
600.
63N/AN/AMBA0.
2-0.
14-0.
29N/ASOGS0.
27N/AN/AN/ATheN/AsigninTableImeansthatPageRankscoresarethesameoralmostthesameforallpagesinthesetandthuscorrelationcoefficientcannotbecalculated.
ItshouldbenotedthatthepresenceofsomanyN/AsignsinTableIshouldnotbeinterpretedtomeanthatthealternativedocumentmodelswouldfrequentlynotprovideusefulPageRankdata.
Itistheresultofthewaythatthepageswereselected.
Recallthatrestrictiontoaspecificdomainwasnecessarywhenformingthepageset.
Forexample,theSOGSpagesetwasretrievedexclusivelyfromthedomainofwww.
uwo.
ca.
InfacttheuniquewordSOGScausedtheretrievedpagestoallcomefromthesamedirectorywww.
uwo.
ca/sogs/.
ThisexplainswhyPageRankbasedonthedirectory,domain,anduniversitylevelcannotprovidedatathatdistinguishespageswithinthisset.
Forthisreason,thissethadtobeomittedfromthetestsofalternativedocumentmodels.
CorrelationcoefficientsthatarestatisticallysignificantareshowninboldfaceinTableI.
ThestandardPageRankhadasignificantcorrelationforonlyoneoutofthefoursetsofpagesusedinthestudy,theombudspersonset.
PageRankbasedonthe8directoryleveldocumentmodelshowedaslightimprovementoverthestandardmodel.
TheonlypagesetthatisappropriatetotestthealternativedocumentmodelistheOGSsetbecausenorestrictiontoaparticularuniversity'sdomainwasimposedwhenformingthisset(OntarioGraduateScholarshipisnotrestrictedtoaparticularuniversity).
Asaresult,pageswithinthissetcomefromdifferentuniversitiesandthealternativedocumentmodelswereabletodistinguishthesepageswell.
Forthisset,thestandardPageRankalmostrankedthepagesinthedirectionoppositetothatbyhumansubjects(themeaningofthenegativecorrelation).
PageRankbasedonthedomainleveldocumentmodelshowsanadvantageoverthestandardmodelwhiletheuniversitylevelmodelshowedonlyaveryslightimprovement.
ResultsfromtheMBAsetcameasasurpriseinthatthealternativedocumentmodelsshoweddisadvantageoverthestandardPageRankmodel.
Itisnotclearwhetheritisananomalouscaseorwhetherthealternativedocumentmodelsarenotappropriateinsomecases.
OnepossibleexplanationforthefailureinthispagesetisthatthePageRankscorescalculatedforthissetarenotreliable.
RecallthatthePageRankscoresarecalculatedfromthedatabasethatincludesallCanadianuniversityWebpages.
TheMBApagesetiscentredaroundtheWebsiteoftheBusinessSchooloftheUniversityofToronto.
DuetothenatureoftheSchool,therearemanylinkstotheWebsitethatarenotfromotherCanadianuniversities.
Forexample,asearchoflinkstothissiteusingAltaVistasearchenginesfoundoveronehundredlinksfrom.
comdomain.
ThePageRankcalculationmissedalltheselinksandisthereforebiased.
Thisproblemdoesapply,ornottothisextent,toothersetsoftestpagesinthestudy.
Forexample,theWebsitethattheombudspersonsetiscentredaroundonlyhasonelinkfromthe.
comdomain.
Futurestudiescanavoidthisproblembyamorecarefulexaminationofpagespriortotherankingexperiment.
DiscussionThestandardPageRankdoesnotseemtobeveryeffectiveinrankingWebpagesinthestudyasshownbythefactthatitsrankingscorrelatesignificantlywithhumanrankingsforonlyoneoutoffoursetsofpagestested.
AlternativeapproachesareneededtoimprovetheeffectivenessofPageRank.
ThestudyproposedandtestednewversionsofPageRankbasedonalternativedocumentmodels.
Althoughtheresultsfromthestudydonotprovideclearevidencethatthealternativemodelsarebetter,itshowedthatthesemodelshavesomepromise.
Infact,theresultsfromtheOGSpageset,theonlysetthatisappropriatetotestallthealternativedocumentmodels,showedasubstantialadvantageoftheintersitedomainPageRankoverthestandardPageRank.
Onefacthasemergedclearlyfromthisresearch:thatitisdifficulttoassessthequalityofWebrankingalgorithms,especiallythoseinvolvinglinks,andespeciallyforresearchersthatdonothaveaccesstoacrawlofasizeablepercentageoftheWeb.
Afullscientificevaluationwouldinvolvehugehumanandcomputingresources:ideallyarandomselectionofquerieswithresultsrankedbyarepresentativesetofusersforwhomthequeriesrepresentedrealinformationrequests.
Inordertobeabletochoosequeriesatrandom,accesstoamajorsearchengineserverloganditsdatabaseforcalculatingtherankingscoreswouldbeneeded.
TheTRECapproach(trec.
nist.
gov,Hawkingetal.
,2001b)toresolvingasimilarproblemisasensibleone:tohaveacentrallyorganisedandratedcollectionofpagesthataresharedforalgorithmtestingpurposesbyparticipatingresearchers.
However,thisdoesnotyetsatisfyourneedbecausethosepagesareassignedabinaryrelevancescorebutnotrankedbydegreeofrelevance.
Forthereasonsdiscussedabove,therankingtask9wouldbelikelytobemorecomplexandinvolvemoreandmoredifficultassessmentsthanthecurrentlyemployedbinaryrelevancejudgements.
OurcompromisewastochooseasmallsetoffourqueriesthatwererelevanttoafixedgroupofendusersandbelongedtoacoherentsubsetoftheWebthatcouldbecrawledandassumedtobesufficientlylarge(3,930,113pages)forrankingthepagesetschosen.
ThiswouldnotbeaproblemifinformationneedslinkcreationandinformationdistributionwereknowntobehighlyuniformandpredictableontheWeb,i.
e.
ifthechoiceoftopicforeachsetwereknownnottoinfluencetheeffectivenessofarankingalgorithm,butwebelievethatthisisnotthecase.
Onalargescale,linkpatternsappeartobereasonablypredictableinsomecontexts(Thelwall,2001b,2002a)andoveralargenumberofpagesitseemsintuitivelyclearthatthosewith,say,threelinkstothemwouldbe,onaverage,slightlybetterqualitythanthosewithonlytwo.
Nevertheless,linksarestilltypicallycreatedbyindividualsinanunsystematicfashionandnotsubjecttoanykindofqualitycontrol.
Asaresultitisdifficulttoclaimthatthreelinkstoapageislikelytoconsistentlyindicatebettertargetpagequalitycontentthantwo.
Thisismoreevidentifitisacknowledgedthatfactorsotherthanqualitycaninfluencelinkcounts,includingtargetpageage.
Asaresult,anygivenlink-basedrankingalgorithmislikelytobeeffectiveforsometopicsbutineffectiveforothers.
Moreover,withthelownumbersoflinkslikelytobeinvolvedinpagesforsometopics,itseemslikelythateventhemosteffectivealgorithmwouldregularlyfailforasignificantproportionofsearchtopics.
Therefore,itisprobablynotsurprisingthattheproposednewalgorithminthisstudydoesnotworkwellforallthesearchtopicsintheexperiment.
Futureresearchinthisareashoulddesignawiderrangeofsearchqueriesandavoidproblemsencounteredinthisstudy.
Insummary,itseemsthatonlyresearchersworkingfor,orinconjunctionwith,amajorsearchenginewouldbecapableoffullyassessingnewWebrankingalgorithms,andotherswillremainforcedtoextrapolatefromtheteststhattheyareabletorun.
ThemostpromiseforacademicresearchersprobablylieswithcentralisedinitiativessuchasTREC,although,ascanbeseenabove,thechoiceoftopicscanimpactonalgorithmsindifferentways,dependingonthedetailsoftheirworkings.
ConclusionsAlthoughthestudydidnotsucceedinprovidingadefiniteanswertotheresearchquestionsexamined,itprovidedsomeevidencethatthealternativePageRankalgorithmsproposedcouldhavethepotentialtoimprovethestandardPageRankmodel.
ThestudysucceededintestingWebIRalgorithmsusinganempiricalstudyinvolvinghumansubjects,adirectionthatwasnotfollowedbymanypreviousstudies.
TheultimatevalueofanyWebIRalgorithmliesonitsabilitytoservehumanneedsandthusthebestwaytotestthemistoseeiftheymatchthoseneeds.
FutureresearchwithalternativedocumentmodelbasedrankingalgorithmsshouldkeepthehumanrankingapproachofthestudybutdesignarangeoftestqueriesthatallinvolvepagesfromdifferentWebsites.
AcknowledgementWegratefullythankallstudentswhoparticipatedinthestudybygivingpermissionforustousetheirrankingdata.
Thestudywouldhavebeenimpossiblewithouttheirsupport.
References10AltaVista(2002),AltaVistaadvancedsearchtutorial–linkpopularity,availableat:help.
altavista.
com/adv_search/ast_haw_popularity(accessed6September2002).
AssociationofUniversitiesandCollegesofCanada(2002),TheDirectoryofCanadianUniversities–UniversityWebsites,availableat:www.
aucc.
ca/english/dcu/universities/universitysites.
html(accessed24April2002).
Bharat,K.
andMihaila,G.
A.
(2001),"Whenexpertsagree:usingnon-affiliatedexpertstorankpopulartopics",inTenthInternationalWorldWideWebConference,availableat:www.
www10.
org/cdrom/papers/474/index.
htmlBrin,S.
andPage,L.
(1998),"Theanatomyofalargescalehypertextualwebsearchengine",ComputerNetworksandISDNSystems,Vol.
30No.
1-7,pp.
107-117,availableat:citeseer.
nj.
nec.
com/brin98anatomy.
htmlChun,T.
Y.
(1999),"WorldWideWebrobots:anoverview",Online&CD-ROMReview,Vol.
23No.
3,pp.
135-142.
Crestani,F.
andLee,P.
L.
(2000),"SearchingtheWebbyconstrainedspreadingactivation",InformationProcessingandManagement,Vol.
36No.
4,pp.
585-605.
Deerwester,S.
,Dumais,S.
T.
,Furnas,G.
W.
,Landauer,T.
K.
andHarshman,R.
(1990),"Indexingbylatentsemanticanalysis",JournaloftheAmericanSocietyforInformationScience,Vol.
41No.
6,pp.
391-407.
Gao,J.
,Walker,S.
,Robertson,S.
,Cao,G.
,He,H.
,Zhang,M.
andNie,J-Y(2001),"TREC-10WebTrackExperimentsatMSRA384-392",TREC2001,availableat:trec.
nist.
gov/pubs/trec10/t10_proceedings.
htmlGordon,M.
andPathak,P.
(1999),"FindinginformationontheWorldWideWeb:theretrievaleffectivenessofsearchengines",InformationProcessing&Management,Vol.
35,pp.
141-180.
Haveliwala,T.
(1999),"EfficientcomputationofPageRank",StanfordUniversityTechnicalReport,availableat:dbpubs.
stanford.
edu:8090/pub/1999-31Hawking,D.
,Bailey,P.
andCraswell,N.
(2000),"ACSysTREC-8experiments",inVoorhees,E.
andHarman,D.
(Eds),InformationTechnology:EighthTextRetrievalConference(TREC-8),NIST,Gaithersburg,MD,USA,pp.
307-315.
Hawking,D.
,Craswell,N.
,Bailey,P.
andGriffiths,K.
(2001a),"Measuringsearchenginequality",InformationRetrieval,Vol.
4No.
1,pp.
33-59.
Hawking,D.
,Craswell,N.
,Thistlewaite,P.
andHarman,D.
(1999),"ResultsandchallengesinWebsearchevaluation",8thInternationalWorldWideWebConference,availableat:www8.
org/w8-papers/2c-search-discover/results/results.
html.
Hawking,D.
,Craswell,N.
,Thistlewaite,P.
andHarman,D.
(2001b),"ResultsandchallengesinWebsearchevaluation",ComputerNetworks,Vol.
31No.
11-16,pp.
1321-1330,availableat:www8.
org/w8-papers/2c-search-discover/results/results.
htmlJohnston,A.
D.
(Ed.
)(2002),TheMaclean'sGuidetoCanadianUniversities2002,RogersPublishing,Toronto,Canada.
Kleinberg,J.
(1999),"Authoritativesourcesinahyperlinkedenvironment",JournaloftheACM,Vol.
46No.
5,pp.
604-632.
Lifantsev,M.
(2000),"VotingmodelforrankingWebpages",inGraham,P.
andMaheswaran,M.
(Eds),ProceedingsoftheInternationalConferenceonInternetComputing,CSREAPress,LasVegas,Nevada,USA,pp.
143-148.
11Meghabghab,G.
(2002),"Google'sWebpagerankingappliedtodifferenttopologicalWebgraphstructures",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No.
9,pp.
736-747.
Ng,A.
Y.
,Zheng,A.
X.
andJordan,M.
I.
(2001),"Stablealgorithmsforlinkanalysis",inCroft,W.
,Harper,D.
,Kraft,D.
&Zobel,J.
(Eds)Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR2001),ACMPress,NewYork,pp.
258-266.
Richardson,M.
andDomingosP.
(2001),"Theintelligentsurfer:probabilisticcombinationoflinkandcontentinformationinPageRank",posteratNeuralInformationProcessingSystems:NaturalandSynthetic2001,availableat:www.
cs.
washington.
edu/homes/mattr/doc/NIPS2001/qd-pagerank.
pdfSavoy,J.
andPicard,J.
(2001),"RetrievaleffectivenessontheWeb",InformationProcessingandManagement,Vol.
37No.
4,pp.
543-569.
Spink,A.
Wolfram,D.
,Jansen,B.
J.
andSaracevic,T.
(2001),"SearchingtheWeb:thepublicandtheirqueries",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No3,pp.
226-234.
Sullivan,D.
(2002),"Googletopsin'searchhours'ratings",SearchEngineWatch,availableat:searchenginewatch.
com/sereport/02/05-ratings.
html(accessed6September2002).
Thelwall,M.
(2001a),"Awebcrawlerdesignfordatamining",JournalofInformationScience,Vol.
27No.
5,pp.
319-325.
Thelwall,M.
(2001b),"ExtractingmacroscopicinformationfromWeblinks",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No.
13,pp.
1157-1168.
Thelwall,M.
(2002a),"ConceptualizingdocumentationontheWeb:anevaluationofdifferentheuristic-basedmodelsforcountinglinksbetweenuniversityWebsites",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
53No.
12,pp.
995-1005.
Thelwall,M.
(2002b),"Subjectgatewaysitesandsearchengineranking",OnlineInformationReview,Vol.
26No.
2,pp.
101-107.
Thelwall,M.
(2003),AlayeredapproachforinvestigatingthetopologicalstructureofcommunitiesintheWeb,JournalofDocumentation,59(4),410-429.
Thelwall,M.
andHarries,G.
(2003),"TheconnectionbetweentheresearchofauniversityandcountsoflinkstoitsWebpages:aninvestigationbaseduponaclassificationoftherelationshipsofpagestotheresearchofthehostuniversity",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
54No.
7,pp.
594-602.
Thelwall,M.
andTang,R.
(2003),DisciplinaryandlinguisticconsiderationsforacademicWeblinking:anexploratoryhyperlinkmediatedstudywithMainlandChinaandTaiwan,Scientometrics,Vol.
58No.
1,pp.
153-179.
Thelwall,M.
andWilkinson,D.
(2003),"ThreetargetdocumentrangemetricsforuniversityWebsites",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
54No.
6,pp.
489-496.
Tsikrika,T.
andLalmas,M.
(2002),"CombiningWebdocumentrepresentationsinaBayesianInferenceNetworkmodelusinglinkandcontent-basedevidence",inProceedingsof24thEuropeanColloquiumonInformationRetrievalResearch,(ECIR2002),pp53-72,Glasgow,Scotland.
Xi,W.
andFox,E.
A.
(2001),"MachineLearningApproachforHomepageFindingTask",TREC2001,pp.
686-697,availableat:trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
12
Hostadvice主机目录对我们的服务进行了测试,然后给PQ.hosting颁发了十大WordPress托管奖。为此,宣布PQ.Hosting将在一周内进行折扣优惠,购买和续订虚拟服务器使用优惠码:Hostadvice ,全部优惠10%。PQ.hosting,国外商家,成天于2019年,正规公司,是全球互联网注册商协会 RIPE 的成员。主要是因为提供1Gbps带宽、不限流量的基于KVM虚拟的V...
关于Linode,这是一家运营超过18年的VPS云主机商家,产品支持随时删除(按小时计费),可选包括美国、英国、新加坡、日本、印度、加拿大、德国等全球十多个数据中心,最低每月费用5美元($0.0075/小时)起。目前,注册Linode的新用户添加付款方式后可以获得100美元赠送,有效期为60天,让更多新朋友可以体验Linode的产品和服务。Linode的云主机产品分为几类,下面分别列出几款套餐配置...
iON Cloud怎么样?iON Cloud今天发布了7月份优惠,使用优惠码:VC4VF8RHFL,新购指定型号VPS半年付或以上可享八五折!iON的云服务器包括美国洛杉矶、美国圣何塞(包含了优化线路、CN2 GIA线路)、新加坡(CN2 GIA线路、PCCW线路、移动CMI线路)这几个机房或者线路可供选择,有Linux和Windows系统之分,整体来说针对中国的优化是非常明显的,机器稳定可靠,比...
pagerank为你推荐
搜狗360没有登录过搜狗浏览器,只是用搜狗高速浏览器等QQ淘宝会有事情么sqlserver2000挂起安装sqlserver2000时总提示有挂起操作!字节跳动回应TikTok易主贾斯汀比伯的confident他在mv女主说了什么,大神回复,采纳支付宝是什么什么是支付宝? 请详细介绍.internetexplorer无法打开Internet Explorer 打不开了360arp防火墙在哪谁知道360防火墙的arp防火墙文件在哪ipad代理想买个ipad,3000至4000元左右有什么好的加多宝和王老吉王老吉和加多宝是什么关系宜人贷官网我在宜人财富贷款2万元,下款的时候时候系统说银行卡号错误,然 我在宜人财富贷款2万我在宜人财富贷款网站制作套餐怎样制作网站,制作网站要钱吗
网络域名 vps租用 mach5 godaddy优惠码 京东云擎 java空间 云全民 qq数据库下载 工作站服务器 免费dns解析 drupal安装 优酷黄金会员账号共享 国外的代理服务器 阿里云邮箱登陆 免费网络空间 木马检测 iptables screen ddos防火墙 dell服务器论坛 更多