errorpagerank

pagerank  时间:2021-04-19  阅读:()
Page1of12CANGOOGLE'SPAGERANKBEUSEDTOFINDTHEMOSTIMPORTANTACADEMICWEBPAGESMikeThelwall1m.
thelwall@wlv.
ac.
ukSchoolofComputingandInformationTechnology,UniversityofWolverhampton,35/49LichfieldStreet,WolverhamptonWV11EQ,UKPhone:+441902321470;Fax:+441902321478Google'sPageRankisaninfluentialalgorithmthatusesamodelofWebusethatisdominatedbyitslinkstructureinordertorankpagesbytheirestimatedvaluetotheWebcommunity.
ThispaperreportsontheoutcomeofapplyingthealgorithmtotheWebsitesofthreenationaluniversitysystemsinordertotestwhetheritiscapableofidentifyingthemostimportantWebpages.
Theresultsarealsocomparedtosimpleinlinkcounts.
ItwasdiscoveredthatthehighestinlinkedpagesdonotalwayshavethehighestPageRank,indicatingthatthetwometricsaregenuinelydifferent,evenforthetoppages.
Moresignificantly,however,internallinksdominatedexternallinksforthehighranksineithermethodandsuperficialreasonsaccountedforhighscoresinbothcases.
ItisconcludedthatPageRankisnotusefulforidentifyingthetoppagesinasiteandthatitmustbecombinedwithapowerfultextmatchingtechniquesinordertogetthequalityofinformationretrievalresultsprovidedbyGoogle.
INTRODUCTIONGoogle'sPageRankalgorithm(Brin&Page,1998)forrankingWebpagesisanInformationRetrieval(IR)algorithmthatisrelativelywell-knowntothegeneralpublicbecauseofitsuseintheGoogleToolbarandinthecompany'smarketingapproach,"TheheartofoursoftwareisPageRank"(Google,2002).
Itisalsoarguablythemostinfluentialandsuccessfulofthepastfiveyears,onthebackofthesearchengine'snumberonestatusforonlinesearchingaccordingtosomemeasurements(Sullivan,2002).
Despitethis,theredonotappeartohavebeenmanystudiesfocussingonthequestionsofhoweffectiveitisorunderwhichconditionsitiseffective.
PageRankisbasedupontheassumptionthatgoodqualitypagesaremorelikelytobelinkedtothanpoorqualityonesandthereforethatmininginformationaboutthelinkstructureoftheWebcouldbemoreeffectiveatidentifyingthebestpagesmatchingsearchenginequeriesthanasimpletext-matchingalgorithm.
Infactitgoesonestepfurtherandincorporatesthequalityofthelinkingpageinitsiterativealgorithm,describedindetailbelow.
InthiscontexttwonaturalquestionstoaskfromabibliometricperspectivearewhetherthepagesthataremosthighlylinkedtoaresignificantlydifferentfromthosethathavethehighestPageRank,andwhethereithermethodiscapableofidentifyingthehighestqualityormostusefulpagesinasite.
Thequestionshaveadditionalpertinencebecauseofthegrowingbodyofinformetricresearchthatisbaseduponlinkcounts(e.
g.
Larsen,1996;Rousseau,1997;Ingwersen,1998;Smith,1999;Leydesdorff&Curran,2000;Thelwall,2001a,b,f,2002a;Smith&Thelwall,2002).
Potentially,suchinvestigationsmaybenefitfrom1JournalofDocumentation58(6),2002,toappear.
Page2of12switchingtoPageRank,oranotheriterativeratingsystem,inordertotakeintoaccountinsomewaythequalityoftheinlinksratherthanjusttheirnumbers.
PAGERANKANDOTHERWEBINFORMATIONRETRIEVALALGORITHMSOnamathematicallevelthePageRankalgorithmfindstheprincipaleigenvectorofamatrixcreatedfromthelinkstructureofthesystem.
Moredescriptively,thematrixencodesthemodelofasurfervisitingWebpagesinsuccession.
Ateachpagethesurferjumpstoacompletelyrandompagewithprobability0.
85andfollowsarandomlinkchosenfromthecurrentpagewithprobability0.
15.
IfthesurferisallowedtoproceedinthisfashionfromanystartingpointforaverylongtimethenthePageRankofanypageisdefinedtobetheprobabilitythatittheyareviewingitafteranygivenjump.
Therankingsystemgeneratedfavourspagesthatarethetargetofmanylinks,sincetheyaremorelikelytobejumpedto.
Italsoweightsmorehighlylinksfrommoreimportantsourcepagessincethesesourcesaremorelikelytobejumpedtoand,therefore,morelikelytooriginateanewjump.
Therationalefortheuseoflinksisthattheyprovideadditionalinformationaboutpagesthatcanbeusedtohelpdecidehowimportantthepageis,ratherthanwhatitscontentisabout(Brin&Page,1998).
AmoremathematicalequivalentdefinitionofthePageRankalgorithmcanbefoundinNgetal.
(2001)inadditiontotheoriginalBrinandPagearticle.
OneotherkeylinkbasedIRprocedureisKleinberg's(1999)topicdistillationalgorithm,whichisprimarilyfortopic-specificsearching.
Thisuseslinkstodecidehowimportantpagesareforaspecifictopic,ratherthaningeneral.
Itworksbystartingwithaqueryandidentifyingrelevantpagesthroughtextsemantics,thenusingthelinkstructurewithinthiscollectiontoallocatepagesiteratively(a)anauthorityscorebysummingtheweightsoftheincominglinkpagesand(b)ahubscorebysummingtheweightsoftheoutgoinglinktargetpages.
PageRankhasbeenshowntobeintrinsicallythemorestableofthetwo,however,withtheKleinbergalgorithmbeingsensitivetosmallchangesinlinkstructures(Ngetal.
,2001).
ThereisnoknownstudythatscientificallydemonstratesthateitheriseffectiveinaclearlydefinedsenseofidentifyingthebestinformationontheWeb,butthesuccessofGoogleisstillapowerfulargumentfortheimportanceofPageRank.
Itmaywellbethecasethatthevarioussearchenginecompaniesperformextensivetestingandknowtheanswerstothesequestionsbutdonotmakethemavailableforcommercialreasons.
ThescientificTRECcompetitionresultswerenotpromising,however(Hawkingetal.
,2000),althoughthiscouldhavebeenduetoanuntypicaltestcorpusorthevariantofPageRankused(Thelwall,2002b).
AccordingtoGaoetal.
(2001),"[Therecent]researchofwebretrievalhasfocusedonlink-basedrankingmethods.
However,nonehadachievedbetterresultsthancontent-basedmethodsinTRECexperiments".
Otherresearchintolink-relatedalgorithmsservestoconfirmtheimportanceofthisarea(Haveliwala,1999;Broderetal.
,2000;Lifantsev,2000;Rafiei&Mendelzon,2000;Richardson&Domingos,2001).
BharatandMihaila(2001)forexampledevelopanewalgorithmanddemonstratethroughuserevaluationsthatitsperformanceiscomparablewithPageRank.
UnlikePageRank,however,theotherWebIRalgorithmsintegratethetextanalysiswithlinkanalysis,makingthemunsuitablefortaskssuchasfindingthe'best'overallpages.
THERESEARCHQUESTIONSThispaperreportsonastudytoapplyPageRanktodatabasesofthelinkstructuresoftheWebsitesofUK,AustralianandNewZealanduniversities.
ThisPage3of12algorithmischosenforitsarguablepre-eminenceinadditiontoitssuitabilityforthetaskoffindingtheoverallbestpagesonasite.
ThethreenationsselectedarechosenforthefreeavailabilityoflinkdataforthemandbecausetheyrepresentininternationaltermsrelativelyearlyInternetadoptersandextensiveWebusers.
The10highestrankedpagesforeachuniversitywillbeanalysedaswellasthe100highestforeachnationalsystem.
ReasonsfordifferencesbetweenPageRankandinlinkcountswillbeuncoveredfromaninvestigationintotheinlinkstructureofthepagesinquestion.
Thisisessentiallyaninvestigativeandqualitativebibliometricapproach(e.
g.
Glser&Laudel,2001;Goodrumetal,2001)ratherthanoneofformalscientifichypothesistesting.
Thetheoreticalcontextisthehypothesisthatthetoprankedpageswilleithercontainhighqualitycontentorwillbegatewaystootherusefulpages.
Thetwospecificquestionsaddressedareasfollows.
ArethepagesgiventhehighestrankbyPageRankclearlythemostusefulorhighestqualityinthesystemanalysed,orcantheirhighpositionsbetheresultofunrelatedfactorsIsPageRankmoresuccessfulthansimpleinlinkcountsatidentifyingthetoppagesMETHODOLOGYThelinkstructureofthenationaluniversitysystemswasobtainedfromapubliclyavailabledatabase(cybermetrics.
wlv.
ac.
uk/database)describedindetailinThelwall(2001d)andobtainedbyaninformationscienceWebcrawler(Thelwall,2001c).
ThiscoverstheproportionofeachWebsitethatcouldbefoundbyiterativelyfollowinglinksfromthehomepage,excludingcopiesofpagesfromothersources(mirrorsites)whenidentified.
MirrorsitesareaparticularproblembecauseitisnotknowntowhatextentGoogle'sspidercrawlsthem.
ForexampletherearenumerouscopiesofSun'sJavadocumentationonUKuniversityWebsitesandideallyGooglewouldignoretheseandonlycrawltheoriginalontheSunMicrosystemswebsite.
AnyadditionalcopiesinGoogle'sdatabasewouldclearlybewastingspace.
Nevertheless,identifyingandeliminatingduplicatepagesisatechnicallychallengingjob,despitepublishedresearchonspeedinguptheprocess(Heydon&Najork,1999)becauseofthesheersizeoftheWeb.
Thedatabasesusedwillincludesomemirrorsitesthathavebeenmissedduetohumanerror,whichispossiblysimilartothesituationforGoogle.
Thenamesofthe156universitiescrawledcanbeobtainedfromtheoriginatingdatabasesite,viathedomainnamesfiles.
Thelinkdatabaseconsistsofaseparatefileforeachinstitutioncontainingthelinkstructureofitswebsite.
ThemostchallengingpartoftheresearchwasinwritingaprogramtoencodetheURLsintonumbersforthePageRankalgorithm.
ThiswasdifficultbecauseofthememorytakenbytheURLsandthenumberofstringcomparisonoperationsthatwererequiredtoensurethateachURLwasgivenauniquenumber.
Onecombinedlinkstructurefilewasconstructedforeachnationalsystemandusedtobuildamatrixofitslinkstructure,andoneseparatelinkstructurefilewasalsocreatedforeachinstitution.
ThiswasthenloadedintoanewprogramcodedtoexecutethePageRankalgorithm,andranksobtainedfromit.
Theprocedurefollowedwasessentiallythesameasthenon-blockedversiondescribedbyHaveliwala(1999)forsmallcomputers,exceptthatnopageswereeliminatedfromthesystemduetoalackoflinks.
Instead,acorrectingfactorwasincorporatedtoadjustfortheaffectofpagesinthesystemwithoutlinks.
Althoughinthelargestcasethefulllinkstructurematrixwouldhavebeentoobigtostoreasanarray,withentries,itcouldinfactbestoredwithonly2entriesasasparsematrix,recordingthelocationof14104*710*Page4of12thenon-zeroentries,withunrecordedlocationsbeingimplicitlyzero.
ThePageRanklistwascombinedwiththeURLkeyfileandsortedtoproducetwotop10listsforeachinstitution,oneforPageRanksandoneforinlinks.
Similartop100listswereproducedforeachwholesystem.
Table1summarisesbasicinformationaboutthedatabases.
TheUKdatabaseisjustover10%ofthesizeoftheoriginalBrin&Page(1998)corpus.
Table1.
InformationaboutthedatabasesusedCountryAustraliaNewZealandUKUniversityWebsitesincluded388110Crawldates10/2001-1/20021-2/20026-7/2001Totalpages3,804,612341,6676,920,448Totallinks20,054,0172,119,67732,516,604ThefirstanalysiswasasimplecalculationtoseewhetherPageRankwasmoreeffectiveatidentifyingusefulpagesthaninlinkcounts.
Thetwoassumptionsmadearethat(a)themostusefulpagesareinstitutionalhomepagesandthat(b)thesearenormallytherootpagesoftheirowndomainnames.
Basedupontheseassumptions,thetestappliedwastoseewhichofthetwomechanismsrankedthistypeofURLmosthighly.
Asanaside,homepagefindingisarecognisedIRtask,forwhichlinkshavebeenfounduseful(Xi&Fox,2001).
ThesecondanalysisisalargecombinedexperimenttodeterminewhetherPageRankorinlinkcountsrevealthemostimportantpagesonasite,andwhetheroneappearstobebetterthantheother.
Theinvestigationisconductedbyusingtablesofthetoppagesfrombothmethodsandevaluatingthesequalitatively.
Separateresultsarereportedforindividualuniversitiesandfornationaluniversitysystems.
Thesearepotentiallysignificantlydifferententitiesunderthehypothesisthatlinksbetweeninstitutionscarryahigherinformationvaluethanthosewithinasinglesite(Kleinberg,1999;Thelwall,2001a).
RESULTSANDDISCUSSIONHomepagerankingAscanbeseeninTable2,thereisnosignificantdifferencebetweenthesuccessofrawinlinkcountsandPageRankintherankoftheuniversityhomepage,basedonlyuponthelinkstructureoftheuniversityWebsiteonitsown.
Inalmostallofthetiedcasesthehomepageswerenumberoneinbothlists.
Inonlyfivecasesthehomepageswerenotinthetoptenofeitherlist.
Nostatisticaltestisneededtoseethatthedifferencesarenegligible,butstandardhypothesistestsforproportionswouldshowthis.
Table2.
AcomparisonoftherankingofuniversityhomepagesproducedbyPageRankandbysimpleinlinkcountsoperatingoneachuniversityWebsiteonitsownSystemHomepagesrankinghigherwithPageRankHomepagesrankinghigherwithinlinkcountsHomepagesrankingthesamewithbothmethodsHomepagesnotinthetop10ineitherlistAustralia13331NewZealand2150Page5of12UK1210844IndividualuniversitiesThetoptenlistsofindividualuniversitieswereexaminedforpatterns.
InmanyliststherewasagroupofcloselyrelatedURLsthatcamefromalargesubsitewithanavigationbarlinkingtothemainpages.
Oftenthesewerethemainofficialpagesofthesite,asisthecaseforLaTrobeUniversity,showninTable3.
Thedominanceofthemainpagesinthiscaseiscausedbytheexistenceofastandardlinksbaratthetopofallofficialpages.
AbigdifferencecanbeseenbetweentheresultsfromthissiteandthoseofWolverhampton(Table4),wherethemainpageswerenotindexedduetotheiruseofActiveServerPagesqueries.
Theofficiallinksbarforotherpagesusesaserversidemapthatisalsonotindexable,althoughtheURLofthemapcanbeseenrankedthirdinthetable.
ThisisaclearcaseofdesigndecisionsdominatingthetopresultsofthePageRankcalculationforindividualuniversities.
Table3.
ThetenhighestrankedpagesforLaTrobe,usingPageRankoninternallinksonly,withPageRankslinearlyscaledtomakethelargestequalto1CountPageRankPage89521www.
latrobe.
edu.
au100580.
513953www.
latrobe.
edu.
au/international99100.
506899www.
latrobe.
edu.
au/search98620.
505285www.
latrobe.
edu.
au/contact99660.
505202www.
latrobe.
edu.
au/about98580.
504549www.
latrobe.
edu.
au/sitemap99090.
50315www.
latrobe.
edu.
au/teaching99030.
502879www.
latrobe.
edu.
au/research99030.
502845www.
latrobe.
edu.
au/faculties99020.
502827www.
latrobe.
edu.
au/campusesTable4.
ThetenhighestrankedpagesinWolverhampton,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage31711www.
wlv.
ac.
uk/disclaimer/official.
html30370.
7812036www.
wlv.
ac.
uk28020.
6703956www.
wlv.
ac.
uk/resources/uni.
nav.
bar.
map18980.
5537226www.
scit.
wlv.
ac.
uk/appdocs/php44740.
4495484www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/overview-summary.
html44750.
4286861www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api44690.
4277823www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/deprecated-list.
html44680.
427765www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/index-files/index-1.
html44680.
4277458www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/help-doc.
html8110.
374287www.
wlv.
ac.
uk/disclaimer/personal.
htmlInadditiontoinstancesofdominationbyofficialpages,othercaseswerealsofoundwheresetsofcomputerdocumentationorothertypesoflargesubsitehadhighinterlinkingpatterns.
ThiscanbeseeninTable4andisalsoillustratedforthecaseofPage6of12theRoyalMelbourneInstituteofTechnology(RMIT),asshowninTable5.
ItcanbeseenthatthemainpagesonalargePHP:HypertextPreprocessor(PHP,arecursiveacronym)Webpageserver-sidescriptinglanguagesitehaveattractedalargenumberoflinks,inactualfactfromthestandardnavigationlinksfoundonallotherpagesofthislargesetofdocumentation.
Thisisacasewhereacombinationofthesheersizeoftheresourceanditsinclusionofastandardsetofnavigationallinkshavecombinedtogiveitskeypagesahugeinlinkcount.
Itappearstobeforuseonlyinonestudentcourseandisacopyofdocumentationproducedelsewhere,sofromanexternalWebuser'spointofviewitwouldnotbeconsideredasimportantcontentontheRMITsite.
ThefifthpageinTable6isthehomepageoftheRMITResearchandDevelopmentSection,whichhostsalargesubsitewithalinktotheirhomepageoneachpage.
Thisisanexampleofasimilarphenomenon:theinternalsizeofasubsitedeterminingtherankofitshomepage.
Table5.
ThetenmostlinkedtopagesinRMIT,countingonlyinternallinks,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/downloads.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/docs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/faq.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/support.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/bugs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/links.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/copyright.
phpIntermsofthedifferencebetweenhighinlinkcountpagesandthosewithahighPageRank,acomparisonofTable5andTable6showsthattherecanberealdifferences.
ManyoftheRMITpagesinTable5havearelativelylowPageRankasaresultofeachlinkingpagecontainingalargenumberofotherlinks,whichdissipatestheeffectofeachindividuallinkthroughsharing'rank'betweenalltargetsofapage.
SomepagesinTable6haveabouthalfasmanyinlinksbutmorethandoublethePageRankbecausethepagesthatlinktothemhavefeweroveralllinks.
Table6.
ThetenhighestrankedpagesinRMIT,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm12480.
1176734www.
homepages.
eu.
rmit.
edu.
au/bondy/saskiabondyfamtreesite/persons.
html6660.
0819186www.
rmit.
edu.
au/departments/rd10970.
0796801bonza.
rmit.
edu.
au10840.
0786278bonza.
rmit.
edu.
au/search.
html10840.
0786278bonza.
rmit.
edu.
au/essays10830.
0785995bonza.
rmit.
edu.
au/links10830.
0785995bonza.
rmit.
edu.
au/contact.
htmlPage7of12ThenumberonepageinTable4andthenumbertwopageinTable5demonstrateanotherfeatureofbothPageRankandinlinkcounts:thehighscorethatpagescanhavewhichpossessalegalfunctioninregardtoWebcontent.
AttheUniversityofWolverhampton,thepagewiththehighestPageRankisthelegaldisclaimerthatallofficialpagesaresupposedtocontainalinkto.
TheRMITdisclaimerpageclearlyalsoenjoysasimilarstatus.
ThisisaproblemfromanIRorbibliometricperspective,asthepagedoesnotcontaininformationofunusuallyhighvalue.
Therearealsoseveralhighlyrankedcopyrightpagesinotheruniversitylists(seetables8and9).
NationalsystemsTables7to9givethetop10pagesfromeachnationalsystem,afterapplyingPageRanktotheircombinedlinkstructurefiles.
Thetop100pageswereproducedineachcasebuttherestarenotshownforreasonsofspace.
Althoughtheuniversityhomepagesineachlistarenaturalinclusions,noneoftheotherpagescouldberegardedascontainingunusuallyusefulinformation,rathertheyowetheirpositiontoarelativelyephemeralcausesuchastheonesdiscussedabove.
TheUK'stoppageisacaseinpoint.
Itattractsonlyinternallinksfromitsownsiteandislinkedtofromahugecollectionofpages,eachcontainingadescriptionofoneofthemodulestaughtattheUniversityofStaffordshire,allofwhichcontainonlyonelink.
Ironically,thelinkappearstobeanautomaticallyinsertedtypo(thehomepagehasanadditionalunderscore:www.
staffs.
ac.
uk/schools/art_and_design)andthelinkinquestionisnon-functioningbecausethereisnocontentbetweenthestartandendoftheanchortag.
Thelackofsharingwithotherlinks,however,isthemainfactorthathasleadtoahighPageRank.
Table7.
Australiantop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage2330416(moved)www.
unimelb.
edu.
au/pwebstats/pwebstats.
html329400.
30076838www.
unimelb.
edu.
au441290.
21490477www.
monash.
edu.
au225020.
2129788www.
unimelb.
edu.
au/disclaimer108270.
19488217www.
csse.
monash.
edu.
au/disclaimers/user.
html181570.
1839759(notavailable)www.
gu.
edu.
au/cgi-bin/textflip.
cgi289770.
18257898www.
uq.
edu.
au75250.
18203945www.
educ.
utas.
edu.
au189890.
17177727www.
unisa.
edu.
au343410.
16868638www.
unsw.
edu.
auTable8.
NewZealandtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage1699013www.
otago.
ac.
nz/sas/common/images/copyrite.
htm75610.
23612077www.
otago.
ac.
nz66110.
21837467www.
vuw.
ac.
nz79530.
20127887www.
massey.
ac.
nz/disclaim.
htm97230.
1893377www.
massey.
ac.
nzPage8of1228080.
1854485webview.
massey.
ac.
nz28070.
18543735webview.
massey.
ac.
nz/help/help.
htm52540.
18509097www.
canterbury.
ac.
nz21850.
16212336nix.
tmk.
auckland.
ac.
nz/SAL30790.
14880847www.
auckland.
ac.
nzTable9.
UKtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage384314(notavailable)www.
staffs.
ac.
uk/schools/art_anddesign226910.
97471338www.
st-and.
ac.
uk171600.
97352870www.
cc.
ic.
ac.
uk/college/onlinedocs/sasonlinedocv8/sasdoc/sashtml/common/images/copyrite.
htm128410.
96260684bicss.
mdx.
ac.
uk/css/public262760.
91602217www.
napier.
ac.
uk34770.
9152157unrankedwww.
aom.
bham.
ac.
uk/handbook/courses/glossary.
htm34640.
91223353www.
ao.
bham.
ac.
uk/handbook/courses/glossary.
htm279820.
86076877www.
ulst.
ac.
uk168510.
83058268www.
leeds.
ac.
uk187610.
79502157www-maths.
mcs.
st-and.
ac.
ukThetopAustralianpageisofatypenotmentionedbefore,aWebstatisticssoftwarehomepage,andthisparticularexampleisfromtheformersiteofMartinGleeson'spwebstatsprogramthatattractslargenumbersoflinksfromserverstatisticspagesgeneratedbythesoftware.
TherearesimilarlyhighlyinlinkedpagesintheUK(Thelwall,2002b).
Ascanalsobeseen,therearehelpandglossarypagesintheNewZealandandUKlistsrespectively.
ThesemaybeusefulinthecontextofthepagesthatlinktothembutprobablymuchlesssoforthewiderWebuser.
Alsopresentaretwodepartmentalhomepages,bothasaresultofcreditlinksonlargecollectionsofpages.
InthecaseofStAndrews,forexample,thelinkscomepredominantlyfromthepagesofanonlinehistoryofmathsarchive.
TheninthrankedNewZealandpageisfromamirrorcopyofheScientificApplicationsonLinuxsite,gettingitslinksfromwithinitsownsite.
TheURLiscase-sensitive,hencethemixedcaseversionshowninTable8.
ThepagesreferencedherewereallloadedintoaWebbrowseronthe19thofFebruary,2002withGoogle'stoolbar(toolbar.
google.
com)installedsothatthePageRankfeaturecouldbeused.
Thisgivesanumberbetween0and10foreachloadedpage.
ClearlythesearenotPageRanksascouldbeobtaineddirectlyfromBrin&Page'salgorithm,sincetheunmodifiedvalueswouldallbelessthanone,buttheuseofthiswordbyGoogletodescribethedisplayedfiguresgivessomecausetobelievethattheyarerelatedinamonotonicway(i.
e.
largertoolbarvaluescomefromlargerPageRankvalues).
TheresultsofthisexerciseleadtothediscoverythatinsomecasesthetoolbarPageRankfiguregivenwasappliedtothedomainandautomaticallyreducedbyoneforeachdirectoryinthepathoftheURL,sothatlongerURLstendedtohavelower'PageRanks',irrespectiveofhighinlinkingasseenintables7–9.
ThiswasthecaseforthezerorankedpageintheUKlist,forexample.
Itissurmised,then,thateitherthePageRankalgorithmhasbeenmodifiedforthecurrentversionofGoogle,orthetoolbarusesanapproximatenon-monotonicversionofitincertainsituations(i.
e.
itreversestherelativeranksofsomepages).
Page9of12CONCLUSIONSItisveryclearfromthedatathatthetoprankedpages,eitherwithPageRankorwithrawinlinkcounts,arethereasaresultofnavigationalarchitecturepolicydecisionsprimarilyratherthanontheirownindividualmerit.
PerhapstheclearestexampleofthisisacomparisonoftheUlsterUniversityhomepagewiththatofWolverhampton.
TheformerattractsthehighestinlinkcountofallUKpageswhereasthelatterattractsrelativelyfewasaresultofglobalsitedesigndecisions.
ItmustbeconcludedfromthisthatPageRankandinlinkcountsarenotreliablemethodsforascertainingthemostvaluableresourcesonalargeuniversityWebsiteornationalsystemofuniversitysites.
ContrastingthesefindingswiththoseofThelwall(2002b)itcanbeseenthattherootcauseoftheproblemsistheinclusionofinternallinks.
Inlinkcountsbaseduponexternallinksonlyyieldmuch"better"results,althoughstillnotperfect.
ThisisarealproblemforthePageRankalgorithmbecauseitdependsoninternallinkstofunction,forexampleredistributinglinkvotesfromthehomepageofasitetolinksonitsotherpages,aswouldbeneededforPageRanktopropagatefromanimportantmultipagesite,suchastheHumbulHumanitiesHub(www.
humbul.
ac.
uk).
ComparingtheUKtop100resultswiththoseforexternallinksonly(Thelwall,2002b)itcanbeseenthatthetwoarefundamentallydifferent.
Bothcontainmanyuniversityhomepagesbutthelatterdoesnotcontainanyoftheotherpagestypicallyfoundonthestandardinternalnavigationbar.
PerhapsthemostdamningevidenceisthatthesinglepagethatisprobablythemostwidelyusedresourceonaUKWebsite,theUKclickablemap,doesnotappearatallineitherUKtop100reportedhere,despitehaving891externallinksfromotherUKuniversities.
InthecontextoftheresultspresentedhereitishardtobelievethatplainPageRankiseffectiveasanIRalgorithm,evenwhencombinedwithsimpletextmatching.
Thefundamentalproblemistheallocationofequalweighttointernallinksasexternalonesandtheloopholethatthisgivestoallownavigationalpolicydecisionstoswamptherelativelysmallnumberoflinkscreatedfornon-navigationalreasons.
Thealgorithmmaybemoreeffectiveonaglobalscale,wheretherearerelativelymoreexternallinksbutthedifferencewillnotbeoftheorderofmagnitudeneededtomakearealdifferenceforuniversityWebsites.
Itmaybe,however,thathugesitessuchasYahoo!
doimprovetheresultsbybestowinghigherPageRanksonthebetterWebpages,butthiswouldaffectonlyafewpagesoneachuniversitysite.
TheoriginalGooglepatent(Page,1998)doesmentionmyriadpotentialcustomisationsofthealgorithm,includingtreatinginter-sitelinksdifferently,andsoitislikelythattheversioninuseatthetimeoftestingwasdifferentfromtheoriginal.
Anotherpointthatshouldbementionedisthatthisanalysishasbeenconfinedtothetoptenor100pagesofeachsetandthereforeignorestheoverwhelmingmajorityofpages.
Nevertheless,itcanbeseenthatthesamekindsofargumentscanalsoapplyforthese:pagesthatarepartofahighlyinterlinkingnavigationalstructurewillrankmuchbetterthanothersthatarethetargetofonlyoneortwoexternallinks,eventhoughsuchlinksareprobablyamuchbetterindicatorofhighquality.
Itcouldbearguedthatahighdegreeofinterlinkingisagoodindicatorofqualityatleastinsitedesign,butinthiscasethePageRankisstilltotallydependentontheabsolutenumberofpagesinvolvedandtheextenttowhichlinksarealsopresenttootherresources.
Despitethenumber-intensivemathematicalalgorithmsusedtoproducethedatapresentedhere,thishasbeenessentiallyaqualitativestudy.
Thisisnotseenasaweaknessinthecontextoftheveryclear-cutnatureoftheresultsobtained.
Indeedthequalitativeapproach,focussingoninvestigatingthecauseoftheproblemshasPage10of12enabledthegainingofinsightsintothereasonswhytherankingmethodshavebeenunabletoproduceconvincinglistsofthetoppagesinthesitescovered.
Inconclusion,PageRankisnotaneffectivemethodforidentifyingthe"best"Webpagesinauniversitysystembecauseofitsdominationbyinternallinks,anargumentthatwouldstillapplyevenifallmirrorsiteshadbeenremovedfromthedata.
TheastonishingaccuracyofGoogle(Thelwall,2002c)mustbeduetoitscomplementaryuseofaveryeffectivetext-basedmatchingalgorithm,whichmustitselfbeincorporatedintothefinalranks.
ApromisingfuturedirectionforbibliometricresearchistodevelopavariantofPageRankthatcanharnessthepotentialofitssystemfortransferringrankiterativelythroughlinksinawaythatwouldnotbedominatedbyinternalsitelinks.
ACKNOWLEDGEMENTSIwouldliketothanktherefereesfortheirveryhelpfulcomments.
REFERENCESBharat,K&Mihaila,G.
A.
(2001).
WhenExpertsAgree:UsingNon-AffiliatedExpertstoRankPopularTopics,In:TenthInternationalWorldWideWebConference.
Availableathttp://www.
www10.
org/cdrom/papers/474/index.
htmlBrin,S.
&Page,L.
(1998).
TheAnatomyofalargescalehypertextualwebsearchengine,ComputerNetworksandISDNSystems,30(1-7),107-117.
Availableathttp://citeseer.
nj.
nec.
com/brin98anatomy.
htmlBroder,A.
,Kumar,R.
,Maghoull,F.
,Raghavan,P.
,Rajagopalan,S.
,Stata,R.
,Tomkins,A.
&Wiener,J.
(2000).
Graphstructureintheweb.
ComputerNetworks,33(1-6),309-320.
Gao,J.
Walker,S.
,Robertson,S.
,Cao,G.
,He,H.
,Zhang,M.
&Nie,J-Y(2001).
TREC-10WebTrackExperimentsatMSRA384-392.
TREC2001.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
Glser,J.
&Laudel,G.
(2001).
Integratingscientometricindicatorsintosociologicalstudies:methodicalandmethodologicalproblems.
Scientometrics,52(3),411-434.
Goodrum,A.
A.
,McCain,K.
W.
,Lawrence,S.
&Giles,C.
L.
(2001).
ScholarlypublishingintheInternetage:acitationanalysisofcomputerscienceliterature.
InformationProcessingandManagement,37(5),661-676.
Google(2002).
GoogleTechnology.
Available:http://www.
google.
com/technology/.
Accessed3July,2002.
Haveliwala,T.
(1999).
EfficientComputationofPageRank.
StanfordUniversityTechnicalReport.
Availablehttp://dbpubs.
stanford.
edu:8090/pub/1999-31Hawking,D.
,Bailey,P.
andCraswell,N.
(2000).
ACSysTREC-8experiments.
In:InformationTechnology:EighthTextREtrievalConference(TREC-8),NIST,Gaithersburg,MD,USA,pp.
307-315.
Heydon,A.
&Najork,M.
(1999).
Mercator:Ascalable,extensibleWebcrawler.
WorldWideWeb,2,219-229.
Ingwersen,P.
(1998).
ThecalculationofWebImpactFactors.
JournalofDocumentation,54(2),236-243.
Kleinberg,J.
(1999).
Authoritativesourcesinahyperlinkedenvironment,JournaloftheACM,46(5),604-632.
Larson,R.
R.
(1996).
BibliometricsoftheWorldWideWeb:AnExploratoryAnalysisoftheIntellectualStructureofCyberspace.
ASIS96.
Availableat:http://sherlock.
berkeley.
edu/asis96/asis96.
htmlPage11of12Leydesdorff,L.
&Curran,M.
,(2000).
Mappinguniversity-industry-governmentrelationsontheInternet:theconstructionofindicatorsforaknowledge-basedeconomy,Cybermetrics,4.
Availableat:http://www.
cindoc.
csic.
es/cybermetrics/articles/v4i1p2.
htmlLifantsev,M.
(2000).
VotingmodelforrankingWebpages.
InGraham,P.
&Maheswaran,M.
(eds),ProceedingsoftheInternationalConferenceonInternetComputing,LasVegas,Nevada,USA,CSREAPress,pp.
143-148.
Ng,A.
Y.
,Zheng,A.
X.
&Jordan,M.
I.
(2001).
Stablealgorithmsforlinkanalysis.
In:Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR2001),NewYork:ACMPress,pp.
258-266.
Page,B.
(1998).
UnitedStatesPatent6,285,999.
Available:http://patft.
uspto.
gov/.
Rafiei,D.
&Mendelzon,A.
O.
(2000).
WhatisthispageknownforComputingWebpagereputations,ComputerNetworks,33(1-6),823-835.
Richardson,M.
&DomingosP.
(2001).
TheIntelligentSurfer:ProbabilisticCombinationofLinkandContentInformationinPageRank.
PosteratNeuralInformationProcessingSystems:NaturalandSynthetic2001.
Availableat:http://www.
cs.
washington.
edu/homes/mattr/doc/NIPS2001/qd-pagerank.
pdfRousseau,R.
,(1997).
Sitations,anexploratorystudy,Cybermetrics,1.
Available:http://www.
cindoc.
csic.
es/cybermetrics/articles/v1i1p1.
htmlSmith,A.
G.
(1999).
Ataleoftwowebspaces:comparingsitesusingWebImpactFactors.
JournalofDocumentation,55(5),577-592.
Smith,A.
&Thelwall,M.
(2002,toappear).
WebImpactFactorsforAustralasianUniversities,Scientometrics,54(1-2).
Sullivan,D.
(2002).
GoogleTopsIn"SearchHours"Ratings.
Available:http://searchenginewatch.
com/sereport/02/05-ratings.
html.
Accessed3July,2002.
Thelwall,M.
(2001a).
Extractingmacroscopicinformationfromweblinks.
JournaloftheAmericanSocietyforInformationScienceandTechnology,52(13),1157-1168.
Thelwall,M.
(2001b,toappear).
Evidencefortheexistenceofgeographictrendsinuniversitywebsiteinterlinking.
JournalofDocumentation.
Thelwall,M.
(2001c).
Awebcrawlerdesignfordatamining,JournalofInformationScience,27(5),319-325.
Thelwall,M.
(2001d).
ApubliclyaccessibledatabaseofUKuniversitywebsitelinksandadiscussionoftheneedforhumaninterventioninwebcrawling,UniversityofWolverhampton.
Available:http://www.
scit.
wlv.
ac.
uk/~cm1993/papers/a_publicly_accessible_database.
pdf.
Thelwall,M.
(2001e).
Thetop100linkedpagesonUKuniversityWebsites:highinlinkcountsarenotassociatedwithqualityscholarlycontent,UniversityofWolverhampton.
Thelwall,M.
(2001f).
ResultsfromaWebImpactFactorcrawler,JournalofDocumentation,57(2),177-191.
Thelwall,M.
(2002a).
AcomparisonofsourcesofLinksforacademicWebImpactFactorcalculations.
JournalofDocumentation,58,60-72.
Thelwall,M.
(2002b).
Subjectgatewaysitesandsearchengineranking,OnlineInformationReview,26(2),124-138.
Thelwall,M.
(2002c,toappear).
InpraiseofGoogle:findinglawjournalWebsites,OnlineInformationReview,26(4).
Page12of12Xi,W.
&Fox,E.
A.
(2001).
MachineLearningApproachforHomepageFindingTask.
TREC2001,pp.
686-697.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.

RackNerd提供四款高配美国服务器促销活动低至月$189

RackNerd 商家给的感觉就是一直蹭节日热点,然后时不时通过修改配置结构不断的提供低价年付的VPS主机,不过他们家还是在做事的,这么两年多的发展,居然已经有新增至十几个数据中心,而且产品线发展也是比较丰富。比如也有独立服务器业务,不过在他们轮番的低价年付VPS主机活动下,他们的服务器估摸着销路不是太好的。这里,今天有看到RackNerd商家的独立服务器业务有促销。这次提供美国多个机房的高配独立...

A400:36元/季,16.8/月kvm架构,线路优质,延迟低

A400互联是一家成立于2020年的商家,主要推行洛杉矶服务器采用kvm架构,线路优质,延迟低,稳定性高!全场产品对标腾讯云轻量,服务器线路有有美国洛杉矶cn2_gia、香港cn2+cmi,目前推行的vps服务器均为精心挑选的优质线路机房,A400互联推出了夏季优惠洛杉矶5折、香港7折促销活动,质量可靠,价格实惠!二:优惠码洛杉矶五折优惠码:20210620香港cn2七折优惠码:0710三、优惠方...

Pacificrack:新增三款超级秒杀套餐/洛杉矶QN机房/1Gbps月流量1TB/年付仅7美刀

PacificRack最近促销上瘾了,活动频繁,接二连三的追加便宜VPS秒杀,PacificRack在 7月中下旬已经推出了五款秒杀VPS套餐,现在商家又新增了三款更便宜的特价套餐,年付低至7.2美元,这已经是本月第三波促销,带宽都是1Gbps。PacificRack 7月秒杀VPS整个系列都是PR-M,也就是魔方的后台管理。2G内存起步的支持Windows 7、10、Server 2003\20...

pagerank为你推荐
支持ipadinternalservererrorError 500--Internal Server Error 求教这个问题怎么解决?phpcms模板请教 phpcms v9 如何设置新模板为系统默认模板?accessdenied重装时系统都会提示access deniedphp计划任务windows系统下如何设置PHP定时任务人人视频总部基地落户重庆重庆总部城的项目简介internetexplorer无法打开Internet Explorer无法打开站点怎么解决申请支付宝账户申请支付宝账号注册大飞资讯单仁资讯集团怎么样三友网怎么是“三友”
台湾服务器 mach iisphpmysql 免费网络电视 牛人与腾讯客服对话 双拼域名 支付宝扫码领红包 如何建立邮箱 根服务器 starry 美国盐湖城 阿里云手机官网 阿里云邮箱申请 小夜博客 百度新闻源申请 西部数码主机 byebyelove 赵荣博客 kosskeb79 **tp服务器是什么 更多