errorpagerank
pagerank 时间:2021-04-19 阅读:(
)
Page1of12CANGOOGLE'SPAGERANKBEUSEDTOFINDTHEMOSTIMPORTANTACADEMICWEBPAGESMikeThelwall1m.
thelwall@wlv.
ac.
ukSchoolofComputingandInformationTechnology,UniversityofWolverhampton,35/49LichfieldStreet,WolverhamptonWV11EQ,UKPhone:+441902321470;Fax:+441902321478Google'sPageRankisaninfluentialalgorithmthatusesamodelofWebusethatisdominatedbyitslinkstructureinordertorankpagesbytheirestimatedvaluetotheWebcommunity.
ThispaperreportsontheoutcomeofapplyingthealgorithmtotheWebsitesofthreenationaluniversitysystemsinordertotestwhetheritiscapableofidentifyingthemostimportantWebpages.
Theresultsarealsocomparedtosimpleinlinkcounts.
ItwasdiscoveredthatthehighestinlinkedpagesdonotalwayshavethehighestPageRank,indicatingthatthetwometricsaregenuinelydifferent,evenforthetoppages.
Moresignificantly,however,internallinksdominatedexternallinksforthehighranksineithermethodandsuperficialreasonsaccountedforhighscoresinbothcases.
ItisconcludedthatPageRankisnotusefulforidentifyingthetoppagesinasiteandthatitmustbecombinedwithapowerfultextmatchingtechniquesinordertogetthequalityofinformationretrievalresultsprovidedbyGoogle.
INTRODUCTIONGoogle'sPageRankalgorithm(Brin&Page,1998)forrankingWebpagesisanInformationRetrieval(IR)algorithmthatisrelativelywell-knowntothegeneralpublicbecauseofitsuseintheGoogleToolbarandinthecompany'smarketingapproach,"TheheartofoursoftwareisPageRank"(Google,2002).
Itisalsoarguablythemostinfluentialandsuccessfulofthepastfiveyears,onthebackofthesearchengine'snumberonestatusforonlinesearchingaccordingtosomemeasurements(Sullivan,2002).
Despitethis,theredonotappeartohavebeenmanystudiesfocussingonthequestionsofhoweffectiveitisorunderwhichconditionsitiseffective.
PageRankisbasedupontheassumptionthatgoodqualitypagesaremorelikelytobelinkedtothanpoorqualityonesandthereforethatmininginformationaboutthelinkstructureoftheWebcouldbemoreeffectiveatidentifyingthebestpagesmatchingsearchenginequeriesthanasimpletext-matchingalgorithm.
Infactitgoesonestepfurtherandincorporatesthequalityofthelinkingpageinitsiterativealgorithm,describedindetailbelow.
InthiscontexttwonaturalquestionstoaskfromabibliometricperspectivearewhetherthepagesthataremosthighlylinkedtoaresignificantlydifferentfromthosethathavethehighestPageRank,andwhethereithermethodiscapableofidentifyingthehighestqualityormostusefulpagesinasite.
Thequestionshaveadditionalpertinencebecauseofthegrowingbodyofinformetricresearchthatisbaseduponlinkcounts(e.
g.
Larsen,1996;Rousseau,1997;Ingwersen,1998;Smith,1999;Leydesdorff&Curran,2000;Thelwall,2001a,b,f,2002a;Smith&Thelwall,2002).
Potentially,suchinvestigationsmaybenefitfrom1JournalofDocumentation58(6),2002,toappear.
Page2of12switchingtoPageRank,oranotheriterativeratingsystem,inordertotakeintoaccountinsomewaythequalityoftheinlinksratherthanjusttheirnumbers.
PAGERANKANDOTHERWEBINFORMATIONRETRIEVALALGORITHMSOnamathematicallevelthePageRankalgorithmfindstheprincipaleigenvectorofamatrixcreatedfromthelinkstructureofthesystem.
Moredescriptively,thematrixencodesthemodelofasurfervisitingWebpagesinsuccession.
Ateachpagethesurferjumpstoacompletelyrandompagewithprobability0.
85andfollowsarandomlinkchosenfromthecurrentpagewithprobability0.
15.
IfthesurferisallowedtoproceedinthisfashionfromanystartingpointforaverylongtimethenthePageRankofanypageisdefinedtobetheprobabilitythatittheyareviewingitafteranygivenjump.
Therankingsystemgeneratedfavourspagesthatarethetargetofmanylinks,sincetheyaremorelikelytobejumpedto.
Italsoweightsmorehighlylinksfrommoreimportantsourcepagessincethesesourcesaremorelikelytobejumpedtoand,therefore,morelikelytooriginateanewjump.
Therationalefortheuseoflinksisthattheyprovideadditionalinformationaboutpagesthatcanbeusedtohelpdecidehowimportantthepageis,ratherthanwhatitscontentisabout(Brin&Page,1998).
AmoremathematicalequivalentdefinitionofthePageRankalgorithmcanbefoundinNgetal.
(2001)inadditiontotheoriginalBrinandPagearticle.
OneotherkeylinkbasedIRprocedureisKleinberg's(1999)topicdistillationalgorithm,whichisprimarilyfortopic-specificsearching.
Thisuseslinkstodecidehowimportantpagesareforaspecifictopic,ratherthaningeneral.
Itworksbystartingwithaqueryandidentifyingrelevantpagesthroughtextsemantics,thenusingthelinkstructurewithinthiscollectiontoallocatepagesiteratively(a)anauthorityscorebysummingtheweightsoftheincominglinkpagesand(b)ahubscorebysummingtheweightsoftheoutgoinglinktargetpages.
PageRankhasbeenshowntobeintrinsicallythemorestableofthetwo,however,withtheKleinbergalgorithmbeingsensitivetosmallchangesinlinkstructures(Ngetal.
,2001).
ThereisnoknownstudythatscientificallydemonstratesthateitheriseffectiveinaclearlydefinedsenseofidentifyingthebestinformationontheWeb,butthesuccessofGoogleisstillapowerfulargumentfortheimportanceofPageRank.
Itmaywellbethecasethatthevarioussearchenginecompaniesperformextensivetestingandknowtheanswerstothesequestionsbutdonotmakethemavailableforcommercialreasons.
ThescientificTRECcompetitionresultswerenotpromising,however(Hawkingetal.
,2000),althoughthiscouldhavebeenduetoanuntypicaltestcorpusorthevariantofPageRankused(Thelwall,2002b).
AccordingtoGaoetal.
(2001),"[Therecent]researchofwebretrievalhasfocusedonlink-basedrankingmethods.
However,nonehadachievedbetterresultsthancontent-basedmethodsinTRECexperiments".
Otherresearchintolink-relatedalgorithmsservestoconfirmtheimportanceofthisarea(Haveliwala,1999;Broderetal.
,2000;Lifantsev,2000;Rafiei&Mendelzon,2000;Richardson&Domingos,2001).
BharatandMihaila(2001)forexampledevelopanewalgorithmanddemonstratethroughuserevaluationsthatitsperformanceiscomparablewithPageRank.
UnlikePageRank,however,theotherWebIRalgorithmsintegratethetextanalysiswithlinkanalysis,makingthemunsuitablefortaskssuchasfindingthe'best'overallpages.
THERESEARCHQUESTIONSThispaperreportsonastudytoapplyPageRanktodatabasesofthelinkstructuresoftheWebsitesofUK,AustralianandNewZealanduniversities.
ThisPage3of12algorithmischosenforitsarguablepre-eminenceinadditiontoitssuitabilityforthetaskoffindingtheoverallbestpagesonasite.
ThethreenationsselectedarechosenforthefreeavailabilityoflinkdataforthemandbecausetheyrepresentininternationaltermsrelativelyearlyInternetadoptersandextensiveWebusers.
The10highestrankedpagesforeachuniversitywillbeanalysedaswellasthe100highestforeachnationalsystem.
ReasonsfordifferencesbetweenPageRankandinlinkcountswillbeuncoveredfromaninvestigationintotheinlinkstructureofthepagesinquestion.
Thisisessentiallyaninvestigativeandqualitativebibliometricapproach(e.
g.
Glser&Laudel,2001;Goodrumetal,2001)ratherthanoneofformalscientifichypothesistesting.
Thetheoreticalcontextisthehypothesisthatthetoprankedpageswilleithercontainhighqualitycontentorwillbegatewaystootherusefulpages.
Thetwospecificquestionsaddressedareasfollows.
ArethepagesgiventhehighestrankbyPageRankclearlythemostusefulorhighestqualityinthesystemanalysed,orcantheirhighpositionsbetheresultofunrelatedfactorsIsPageRankmoresuccessfulthansimpleinlinkcountsatidentifyingthetoppagesMETHODOLOGYThelinkstructureofthenationaluniversitysystemswasobtainedfromapubliclyavailabledatabase(cybermetrics.
wlv.
ac.
uk/database)describedindetailinThelwall(2001d)andobtainedbyaninformationscienceWebcrawler(Thelwall,2001c).
ThiscoverstheproportionofeachWebsitethatcouldbefoundbyiterativelyfollowinglinksfromthehomepage,excludingcopiesofpagesfromothersources(mirrorsites)whenidentified.
MirrorsitesareaparticularproblembecauseitisnotknowntowhatextentGoogle'sspidercrawlsthem.
ForexampletherearenumerouscopiesofSun'sJavadocumentationonUKuniversityWebsitesandideallyGooglewouldignoretheseandonlycrawltheoriginalontheSunMicrosystemswebsite.
AnyadditionalcopiesinGoogle'sdatabasewouldclearlybewastingspace.
Nevertheless,identifyingandeliminatingduplicatepagesisatechnicallychallengingjob,despitepublishedresearchonspeedinguptheprocess(Heydon&Najork,1999)becauseofthesheersizeoftheWeb.
Thedatabasesusedwillincludesomemirrorsitesthathavebeenmissedduetohumanerror,whichispossiblysimilartothesituationforGoogle.
Thenamesofthe156universitiescrawledcanbeobtainedfromtheoriginatingdatabasesite,viathedomainnamesfiles.
Thelinkdatabaseconsistsofaseparatefileforeachinstitutioncontainingthelinkstructureofitswebsite.
ThemostchallengingpartoftheresearchwasinwritingaprogramtoencodetheURLsintonumbersforthePageRankalgorithm.
ThiswasdifficultbecauseofthememorytakenbytheURLsandthenumberofstringcomparisonoperationsthatwererequiredtoensurethateachURLwasgivenauniquenumber.
Onecombinedlinkstructurefilewasconstructedforeachnationalsystemandusedtobuildamatrixofitslinkstructure,andoneseparatelinkstructurefilewasalsocreatedforeachinstitution.
ThiswasthenloadedintoanewprogramcodedtoexecutethePageRankalgorithm,andranksobtainedfromit.
Theprocedurefollowedwasessentiallythesameasthenon-blockedversiondescribedbyHaveliwala(1999)forsmallcomputers,exceptthatnopageswereeliminatedfromthesystemduetoalackoflinks.
Instead,acorrectingfactorwasincorporatedtoadjustfortheaffectofpagesinthesystemwithoutlinks.
Althoughinthelargestcasethefulllinkstructurematrixwouldhavebeentoobigtostoreasanarray,withentries,itcouldinfactbestoredwithonly2entriesasasparsematrix,recordingthelocationof14104*710*Page4of12thenon-zeroentries,withunrecordedlocationsbeingimplicitlyzero.
ThePageRanklistwascombinedwiththeURLkeyfileandsortedtoproducetwotop10listsforeachinstitution,oneforPageRanksandoneforinlinks.
Similartop100listswereproducedforeachwholesystem.
Table1summarisesbasicinformationaboutthedatabases.
TheUKdatabaseisjustover10%ofthesizeoftheoriginalBrin&Page(1998)corpus.
Table1.
InformationaboutthedatabasesusedCountryAustraliaNewZealandUKUniversityWebsitesincluded388110Crawldates10/2001-1/20021-2/20026-7/2001Totalpages3,804,612341,6676,920,448Totallinks20,054,0172,119,67732,516,604ThefirstanalysiswasasimplecalculationtoseewhetherPageRankwasmoreeffectiveatidentifyingusefulpagesthaninlinkcounts.
Thetwoassumptionsmadearethat(a)themostusefulpagesareinstitutionalhomepagesandthat(b)thesearenormallytherootpagesoftheirowndomainnames.
Basedupontheseassumptions,thetestappliedwastoseewhichofthetwomechanismsrankedthistypeofURLmosthighly.
Asanaside,homepagefindingisarecognisedIRtask,forwhichlinkshavebeenfounduseful(Xi&Fox,2001).
ThesecondanalysisisalargecombinedexperimenttodeterminewhetherPageRankorinlinkcountsrevealthemostimportantpagesonasite,andwhetheroneappearstobebetterthantheother.
Theinvestigationisconductedbyusingtablesofthetoppagesfrombothmethodsandevaluatingthesequalitatively.
Separateresultsarereportedforindividualuniversitiesandfornationaluniversitysystems.
Thesearepotentiallysignificantlydifferententitiesunderthehypothesisthatlinksbetweeninstitutionscarryahigherinformationvaluethanthosewithinasinglesite(Kleinberg,1999;Thelwall,2001a).
RESULTSANDDISCUSSIONHomepagerankingAscanbeseeninTable2,thereisnosignificantdifferencebetweenthesuccessofrawinlinkcountsandPageRankintherankoftheuniversityhomepage,basedonlyuponthelinkstructureoftheuniversityWebsiteonitsown.
Inalmostallofthetiedcasesthehomepageswerenumberoneinbothlists.
Inonlyfivecasesthehomepageswerenotinthetoptenofeitherlist.
Nostatisticaltestisneededtoseethatthedifferencesarenegligible,butstandardhypothesistestsforproportionswouldshowthis.
Table2.
AcomparisonoftherankingofuniversityhomepagesproducedbyPageRankandbysimpleinlinkcountsoperatingoneachuniversityWebsiteonitsownSystemHomepagesrankinghigherwithPageRankHomepagesrankinghigherwithinlinkcountsHomepagesrankingthesamewithbothmethodsHomepagesnotinthetop10ineitherlistAustralia13331NewZealand2150Page5of12UK1210844IndividualuniversitiesThetoptenlistsofindividualuniversitieswereexaminedforpatterns.
InmanyliststherewasagroupofcloselyrelatedURLsthatcamefromalargesubsitewithanavigationbarlinkingtothemainpages.
Oftenthesewerethemainofficialpagesofthesite,asisthecaseforLaTrobeUniversity,showninTable3.
Thedominanceofthemainpagesinthiscaseiscausedbytheexistenceofastandardlinksbaratthetopofallofficialpages.
AbigdifferencecanbeseenbetweentheresultsfromthissiteandthoseofWolverhampton(Table4),wherethemainpageswerenotindexedduetotheiruseofActiveServerPagesqueries.
Theofficiallinksbarforotherpagesusesaserversidemapthatisalsonotindexable,althoughtheURLofthemapcanbeseenrankedthirdinthetable.
ThisisaclearcaseofdesigndecisionsdominatingthetopresultsofthePageRankcalculationforindividualuniversities.
Table3.
ThetenhighestrankedpagesforLaTrobe,usingPageRankoninternallinksonly,withPageRankslinearlyscaledtomakethelargestequalto1CountPageRankPage89521www.
latrobe.
edu.
au100580.
513953www.
latrobe.
edu.
au/international99100.
506899www.
latrobe.
edu.
au/search98620.
505285www.
latrobe.
edu.
au/contact99660.
505202www.
latrobe.
edu.
au/about98580.
504549www.
latrobe.
edu.
au/sitemap99090.
50315www.
latrobe.
edu.
au/teaching99030.
502879www.
latrobe.
edu.
au/research99030.
502845www.
latrobe.
edu.
au/faculties99020.
502827www.
latrobe.
edu.
au/campusesTable4.
ThetenhighestrankedpagesinWolverhampton,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage31711www.
wlv.
ac.
uk/disclaimer/official.
html30370.
7812036www.
wlv.
ac.
uk28020.
6703956www.
wlv.
ac.
uk/resources/uni.
nav.
bar.
map18980.
5537226www.
scit.
wlv.
ac.
uk/appdocs/php44740.
4495484www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/overview-summary.
html44750.
4286861www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api44690.
4277823www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/deprecated-list.
html44680.
427765www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/index-files/index-1.
html44680.
4277458www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/help-doc.
html8110.
374287www.
wlv.
ac.
uk/disclaimer/personal.
htmlInadditiontoinstancesofdominationbyofficialpages,othercaseswerealsofoundwheresetsofcomputerdocumentationorothertypesoflargesubsitehadhighinterlinkingpatterns.
ThiscanbeseeninTable4andisalsoillustratedforthecaseofPage6of12theRoyalMelbourneInstituteofTechnology(RMIT),asshowninTable5.
ItcanbeseenthatthemainpagesonalargePHP:HypertextPreprocessor(PHP,arecursiveacronym)Webpageserver-sidescriptinglanguagesitehaveattractedalargenumberoflinks,inactualfactfromthestandardnavigationlinksfoundonallotherpagesofthislargesetofdocumentation.
Thisisacasewhereacombinationofthesheersizeoftheresourceanditsinclusionofastandardsetofnavigationallinkshavecombinedtogiveitskeypagesahugeinlinkcount.
Itappearstobeforuseonlyinonestudentcourseandisacopyofdocumentationproducedelsewhere,sofromanexternalWebuser'spointofviewitwouldnotbeconsideredasimportantcontentontheRMITsite.
ThefifthpageinTable6isthehomepageoftheRMITResearchandDevelopmentSection,whichhostsalargesubsitewithalinktotheirhomepageoneachpage.
Thisisanexampleofasimilarphenomenon:theinternalsizeofasubsitedeterminingtherankofitshomepage.
Table5.
ThetenmostlinkedtopagesinRMIT,countingonlyinternallinks,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/downloads.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/docs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/faq.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/support.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/bugs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/links.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/copyright.
phpIntermsofthedifferencebetweenhighinlinkcountpagesandthosewithahighPageRank,acomparisonofTable5andTable6showsthattherecanberealdifferences.
ManyoftheRMITpagesinTable5havearelativelylowPageRankasaresultofeachlinkingpagecontainingalargenumberofotherlinks,whichdissipatestheeffectofeachindividuallinkthroughsharing'rank'betweenalltargetsofapage.
SomepagesinTable6haveabouthalfasmanyinlinksbutmorethandoublethePageRankbecausethepagesthatlinktothemhavefeweroveralllinks.
Table6.
ThetenhighestrankedpagesinRMIT,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm12480.
1176734www.
homepages.
eu.
rmit.
edu.
au/bondy/saskiabondyfamtreesite/persons.
html6660.
0819186www.
rmit.
edu.
au/departments/rd10970.
0796801bonza.
rmit.
edu.
au10840.
0786278bonza.
rmit.
edu.
au/search.
html10840.
0786278bonza.
rmit.
edu.
au/essays10830.
0785995bonza.
rmit.
edu.
au/links10830.
0785995bonza.
rmit.
edu.
au/contact.
htmlPage7of12ThenumberonepageinTable4andthenumbertwopageinTable5demonstrateanotherfeatureofbothPageRankandinlinkcounts:thehighscorethatpagescanhavewhichpossessalegalfunctioninregardtoWebcontent.
AttheUniversityofWolverhampton,thepagewiththehighestPageRankisthelegaldisclaimerthatallofficialpagesaresupposedtocontainalinkto.
TheRMITdisclaimerpageclearlyalsoenjoysasimilarstatus.
ThisisaproblemfromanIRorbibliometricperspective,asthepagedoesnotcontaininformationofunusuallyhighvalue.
Therearealsoseveralhighlyrankedcopyrightpagesinotheruniversitylists(seetables8and9).
NationalsystemsTables7to9givethetop10pagesfromeachnationalsystem,afterapplyingPageRanktotheircombinedlinkstructurefiles.
Thetop100pageswereproducedineachcasebuttherestarenotshownforreasonsofspace.
Althoughtheuniversityhomepagesineachlistarenaturalinclusions,noneoftheotherpagescouldberegardedascontainingunusuallyusefulinformation,rathertheyowetheirpositiontoarelativelyephemeralcausesuchastheonesdiscussedabove.
TheUK'stoppageisacaseinpoint.
Itattractsonlyinternallinksfromitsownsiteandislinkedtofromahugecollectionofpages,eachcontainingadescriptionofoneofthemodulestaughtattheUniversityofStaffordshire,allofwhichcontainonlyonelink.
Ironically,thelinkappearstobeanautomaticallyinsertedtypo(thehomepagehasanadditionalunderscore:www.
staffs.
ac.
uk/schools/art_and_design)andthelinkinquestionisnon-functioningbecausethereisnocontentbetweenthestartandendoftheanchortag.
Thelackofsharingwithotherlinks,however,isthemainfactorthathasleadtoahighPageRank.
Table7.
Australiantop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage2330416(moved)www.
unimelb.
edu.
au/pwebstats/pwebstats.
html329400.
30076838www.
unimelb.
edu.
au441290.
21490477www.
monash.
edu.
au225020.
2129788www.
unimelb.
edu.
au/disclaimer108270.
19488217www.
csse.
monash.
edu.
au/disclaimers/user.
html181570.
1839759(notavailable)www.
gu.
edu.
au/cgi-bin/textflip.
cgi289770.
18257898www.
uq.
edu.
au75250.
18203945www.
educ.
utas.
edu.
au189890.
17177727www.
unisa.
edu.
au343410.
16868638www.
unsw.
edu.
auTable8.
NewZealandtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage1699013www.
otago.
ac.
nz/sas/common/images/copyrite.
htm75610.
23612077www.
otago.
ac.
nz66110.
21837467www.
vuw.
ac.
nz79530.
20127887www.
massey.
ac.
nz/disclaim.
htm97230.
1893377www.
massey.
ac.
nzPage8of1228080.
1854485webview.
massey.
ac.
nz28070.
18543735webview.
massey.
ac.
nz/help/help.
htm52540.
18509097www.
canterbury.
ac.
nz21850.
16212336nix.
tmk.
auckland.
ac.
nz/SAL30790.
14880847www.
auckland.
ac.
nzTable9.
UKtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage384314(notavailable)www.
staffs.
ac.
uk/schools/art_anddesign226910.
97471338www.
st-and.
ac.
uk171600.
97352870www.
cc.
ic.
ac.
uk/college/onlinedocs/sasonlinedocv8/sasdoc/sashtml/common/images/copyrite.
htm128410.
96260684bicss.
mdx.
ac.
uk/css/public262760.
91602217www.
napier.
ac.
uk34770.
9152157unrankedwww.
aom.
bham.
ac.
uk/handbook/courses/glossary.
htm34640.
91223353www.
ao.
bham.
ac.
uk/handbook/courses/glossary.
htm279820.
86076877www.
ulst.
ac.
uk168510.
83058268www.
leeds.
ac.
uk187610.
79502157www-maths.
mcs.
st-and.
ac.
ukThetopAustralianpageisofatypenotmentionedbefore,aWebstatisticssoftwarehomepage,andthisparticularexampleisfromtheformersiteofMartinGleeson'spwebstatsprogramthatattractslargenumbersoflinksfromserverstatisticspagesgeneratedbythesoftware.
TherearesimilarlyhighlyinlinkedpagesintheUK(Thelwall,2002b).
Ascanalsobeseen,therearehelpandglossarypagesintheNewZealandandUKlistsrespectively.
ThesemaybeusefulinthecontextofthepagesthatlinktothembutprobablymuchlesssoforthewiderWebuser.
Alsopresentaretwodepartmentalhomepages,bothasaresultofcreditlinksonlargecollectionsofpages.
InthecaseofStAndrews,forexample,thelinkscomepredominantlyfromthepagesofanonlinehistoryofmathsarchive.
TheninthrankedNewZealandpageisfromamirrorcopyofheScientificApplicationsonLinuxsite,gettingitslinksfromwithinitsownsite.
TheURLiscase-sensitive,hencethemixedcaseversionshowninTable8.
ThepagesreferencedherewereallloadedintoaWebbrowseronthe19thofFebruary,2002withGoogle'stoolbar(toolbar.
google.
com)installedsothatthePageRankfeaturecouldbeused.
Thisgivesanumberbetween0and10foreachloadedpage.
ClearlythesearenotPageRanksascouldbeobtaineddirectlyfromBrin&Page'salgorithm,sincetheunmodifiedvalueswouldallbelessthanone,buttheuseofthiswordbyGoogletodescribethedisplayedfiguresgivessomecausetobelievethattheyarerelatedinamonotonicway(i.
e.
largertoolbarvaluescomefromlargerPageRankvalues).
TheresultsofthisexerciseleadtothediscoverythatinsomecasesthetoolbarPageRankfiguregivenwasappliedtothedomainandautomaticallyreducedbyoneforeachdirectoryinthepathoftheURL,sothatlongerURLstendedtohavelower'PageRanks',irrespectiveofhighinlinkingasseenintables7–9.
ThiswasthecaseforthezerorankedpageintheUKlist,forexample.
Itissurmised,then,thateitherthePageRankalgorithmhasbeenmodifiedforthecurrentversionofGoogle,orthetoolbarusesanapproximatenon-monotonicversionofitincertainsituations(i.
e.
itreversestherelativeranksofsomepages).
Page9of12CONCLUSIONSItisveryclearfromthedatathatthetoprankedpages,eitherwithPageRankorwithrawinlinkcounts,arethereasaresultofnavigationalarchitecturepolicydecisionsprimarilyratherthanontheirownindividualmerit.
PerhapstheclearestexampleofthisisacomparisonoftheUlsterUniversityhomepagewiththatofWolverhampton.
TheformerattractsthehighestinlinkcountofallUKpageswhereasthelatterattractsrelativelyfewasaresultofglobalsitedesigndecisions.
ItmustbeconcludedfromthisthatPageRankandinlinkcountsarenotreliablemethodsforascertainingthemostvaluableresourcesonalargeuniversityWebsiteornationalsystemofuniversitysites.
ContrastingthesefindingswiththoseofThelwall(2002b)itcanbeseenthattherootcauseoftheproblemsistheinclusionofinternallinks.
Inlinkcountsbaseduponexternallinksonlyyieldmuch"better"results,althoughstillnotperfect.
ThisisarealproblemforthePageRankalgorithmbecauseitdependsoninternallinkstofunction,forexampleredistributinglinkvotesfromthehomepageofasitetolinksonitsotherpages,aswouldbeneededforPageRanktopropagatefromanimportantmultipagesite,suchastheHumbulHumanitiesHub(www.
humbul.
ac.
uk).
ComparingtheUKtop100resultswiththoseforexternallinksonly(Thelwall,2002b)itcanbeseenthatthetwoarefundamentallydifferent.
Bothcontainmanyuniversityhomepagesbutthelatterdoesnotcontainanyoftheotherpagestypicallyfoundonthestandardinternalnavigationbar.
PerhapsthemostdamningevidenceisthatthesinglepagethatisprobablythemostwidelyusedresourceonaUKWebsite,theUKclickablemap,doesnotappearatallineitherUKtop100reportedhere,despitehaving891externallinksfromotherUKuniversities.
InthecontextoftheresultspresentedhereitishardtobelievethatplainPageRankiseffectiveasanIRalgorithm,evenwhencombinedwithsimpletextmatching.
Thefundamentalproblemistheallocationofequalweighttointernallinksasexternalonesandtheloopholethatthisgivestoallownavigationalpolicydecisionstoswamptherelativelysmallnumberoflinkscreatedfornon-navigationalreasons.
Thealgorithmmaybemoreeffectiveonaglobalscale,wheretherearerelativelymoreexternallinksbutthedifferencewillnotbeoftheorderofmagnitudeneededtomakearealdifferenceforuniversityWebsites.
Itmaybe,however,thathugesitessuchasYahoo!
doimprovetheresultsbybestowinghigherPageRanksonthebetterWebpages,butthiswouldaffectonlyafewpagesoneachuniversitysite.
TheoriginalGooglepatent(Page,1998)doesmentionmyriadpotentialcustomisationsofthealgorithm,includingtreatinginter-sitelinksdifferently,andsoitislikelythattheversioninuseatthetimeoftestingwasdifferentfromtheoriginal.
Anotherpointthatshouldbementionedisthatthisanalysishasbeenconfinedtothetoptenor100pagesofeachsetandthereforeignorestheoverwhelmingmajorityofpages.
Nevertheless,itcanbeseenthatthesamekindsofargumentscanalsoapplyforthese:pagesthatarepartofahighlyinterlinkingnavigationalstructurewillrankmuchbetterthanothersthatarethetargetofonlyoneortwoexternallinks,eventhoughsuchlinksareprobablyamuchbetterindicatorofhighquality.
Itcouldbearguedthatahighdegreeofinterlinkingisagoodindicatorofqualityatleastinsitedesign,butinthiscasethePageRankisstilltotallydependentontheabsolutenumberofpagesinvolvedandtheextenttowhichlinksarealsopresenttootherresources.
Despitethenumber-intensivemathematicalalgorithmsusedtoproducethedatapresentedhere,thishasbeenessentiallyaqualitativestudy.
Thisisnotseenasaweaknessinthecontextoftheveryclear-cutnatureoftheresultsobtained.
Indeedthequalitativeapproach,focussingoninvestigatingthecauseoftheproblemshasPage10of12enabledthegainingofinsightsintothereasonswhytherankingmethodshavebeenunabletoproduceconvincinglistsofthetoppagesinthesitescovered.
Inconclusion,PageRankisnotaneffectivemethodforidentifyingthe"best"Webpagesinauniversitysystembecauseofitsdominationbyinternallinks,anargumentthatwouldstillapplyevenifallmirrorsiteshadbeenremovedfromthedata.
TheastonishingaccuracyofGoogle(Thelwall,2002c)mustbeduetoitscomplementaryuseofaveryeffectivetext-basedmatchingalgorithm,whichmustitselfbeincorporatedintothefinalranks.
ApromisingfuturedirectionforbibliometricresearchistodevelopavariantofPageRankthatcanharnessthepotentialofitssystemfortransferringrankiterativelythroughlinksinawaythatwouldnotbedominatedbyinternalsitelinks.
ACKNOWLEDGEMENTSIwouldliketothanktherefereesfortheirveryhelpfulcomments.
REFERENCESBharat,K&Mihaila,G.
A.
(2001).
WhenExpertsAgree:UsingNon-AffiliatedExpertstoRankPopularTopics,In:TenthInternationalWorldWideWebConference.
Availableathttp://www.
www10.
org/cdrom/papers/474/index.
htmlBrin,S.
&Page,L.
(1998).
TheAnatomyofalargescalehypertextualwebsearchengine,ComputerNetworksandISDNSystems,30(1-7),107-117.
Availableathttp://citeseer.
nj.
nec.
com/brin98anatomy.
htmlBroder,A.
,Kumar,R.
,Maghoull,F.
,Raghavan,P.
,Rajagopalan,S.
,Stata,R.
,Tomkins,A.
&Wiener,J.
(2000).
Graphstructureintheweb.
ComputerNetworks,33(1-6),309-320.
Gao,J.
Walker,S.
,Robertson,S.
,Cao,G.
,He,H.
,Zhang,M.
&Nie,J-Y(2001).
TREC-10WebTrackExperimentsatMSRA384-392.
TREC2001.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
Glser,J.
&Laudel,G.
(2001).
Integratingscientometricindicatorsintosociologicalstudies:methodicalandmethodologicalproblems.
Scientometrics,52(3),411-434.
Goodrum,A.
A.
,McCain,K.
W.
,Lawrence,S.
&Giles,C.
L.
(2001).
ScholarlypublishingintheInternetage:acitationanalysisofcomputerscienceliterature.
InformationProcessingandManagement,37(5),661-676.
Google(2002).
GoogleTechnology.
Available:http://www.
google.
com/technology/.
Accessed3July,2002.
Haveliwala,T.
(1999).
EfficientComputationofPageRank.
StanfordUniversityTechnicalReport.
Availablehttp://dbpubs.
stanford.
edu:8090/pub/1999-31Hawking,D.
,Bailey,P.
andCraswell,N.
(2000).
ACSysTREC-8experiments.
In:InformationTechnology:EighthTextREtrievalConference(TREC-8),NIST,Gaithersburg,MD,USA,pp.
307-315.
Heydon,A.
&Najork,M.
(1999).
Mercator:Ascalable,extensibleWebcrawler.
WorldWideWeb,2,219-229.
Ingwersen,P.
(1998).
ThecalculationofWebImpactFactors.
JournalofDocumentation,54(2),236-243.
Kleinberg,J.
(1999).
Authoritativesourcesinahyperlinkedenvironment,JournaloftheACM,46(5),604-632.
Larson,R.
R.
(1996).
BibliometricsoftheWorldWideWeb:AnExploratoryAnalysisoftheIntellectualStructureofCyberspace.
ASIS96.
Availableat:http://sherlock.
berkeley.
edu/asis96/asis96.
htmlPage11of12Leydesdorff,L.
&Curran,M.
,(2000).
Mappinguniversity-industry-governmentrelationsontheInternet:theconstructionofindicatorsforaknowledge-basedeconomy,Cybermetrics,4.
Availableat:http://www.
cindoc.
csic.
es/cybermetrics/articles/v4i1p2.
htmlLifantsev,M.
(2000).
VotingmodelforrankingWebpages.
InGraham,P.
&Maheswaran,M.
(eds),ProceedingsoftheInternationalConferenceonInternetComputing,LasVegas,Nevada,USA,CSREAPress,pp.
143-148.
Ng,A.
Y.
,Zheng,A.
X.
&Jordan,M.
I.
(2001).
Stablealgorithmsforlinkanalysis.
In:Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR2001),NewYork:ACMPress,pp.
258-266.
Page,B.
(1998).
UnitedStatesPatent6,285,999.
Available:http://patft.
uspto.
gov/.
Rafiei,D.
&Mendelzon,A.
O.
(2000).
WhatisthispageknownforComputingWebpagereputations,ComputerNetworks,33(1-6),823-835.
Richardson,M.
&DomingosP.
(2001).
TheIntelligentSurfer:ProbabilisticCombinationofLinkandContentInformationinPageRank.
PosteratNeuralInformationProcessingSystems:NaturalandSynthetic2001.
Availableat:http://www.
cs.
washington.
edu/homes/mattr/doc/NIPS2001/qd-pagerank.
pdfRousseau,R.
,(1997).
Sitations,anexploratorystudy,Cybermetrics,1.
Available:http://www.
cindoc.
csic.
es/cybermetrics/articles/v1i1p1.
htmlSmith,A.
G.
(1999).
Ataleoftwowebspaces:comparingsitesusingWebImpactFactors.
JournalofDocumentation,55(5),577-592.
Smith,A.
&Thelwall,M.
(2002,toappear).
WebImpactFactorsforAustralasianUniversities,Scientometrics,54(1-2).
Sullivan,D.
(2002).
GoogleTopsIn"SearchHours"Ratings.
Available:http://searchenginewatch.
com/sereport/02/05-ratings.
html.
Accessed3July,2002.
Thelwall,M.
(2001a).
Extractingmacroscopicinformationfromweblinks.
JournaloftheAmericanSocietyforInformationScienceandTechnology,52(13),1157-1168.
Thelwall,M.
(2001b,toappear).
Evidencefortheexistenceofgeographictrendsinuniversitywebsiteinterlinking.
JournalofDocumentation.
Thelwall,M.
(2001c).
Awebcrawlerdesignfordatamining,JournalofInformationScience,27(5),319-325.
Thelwall,M.
(2001d).
ApubliclyaccessibledatabaseofUKuniversitywebsitelinksandadiscussionoftheneedforhumaninterventioninwebcrawling,UniversityofWolverhampton.
Available:http://www.
scit.
wlv.
ac.
uk/~cm1993/papers/a_publicly_accessible_database.
pdf.
Thelwall,M.
(2001e).
Thetop100linkedpagesonUKuniversityWebsites:highinlinkcountsarenotassociatedwithqualityscholarlycontent,UniversityofWolverhampton.
Thelwall,M.
(2001f).
ResultsfromaWebImpactFactorcrawler,JournalofDocumentation,57(2),177-191.
Thelwall,M.
(2002a).
AcomparisonofsourcesofLinksforacademicWebImpactFactorcalculations.
JournalofDocumentation,58,60-72.
Thelwall,M.
(2002b).
Subjectgatewaysitesandsearchengineranking,OnlineInformationReview,26(2),124-138.
Thelwall,M.
(2002c,toappear).
InpraiseofGoogle:findinglawjournalWebsites,OnlineInformationReview,26(4).
Page12of12Xi,W.
&Fox,E.
A.
(2001).
MachineLearningApproachforHomepageFindingTask.
TREC2001,pp.
686-697.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
官方网站:点击访问特网云官网活动方案:===========================香港云限时购==============================支持Linux和Windows操作系统,配置都是可以自选的,非常的灵活,宽带充足新老客户活动期间新购活动款产品都可以享受续费折扣(只限在活动期间购买活动款产品才可享受续费折扣 优惠码:AADE01),购买折扣与续费折扣不叠加,都是在原价...
zoecloud怎么样?zoecloud是一家国人商家,5月成立,暂时主要提供香港BGP KVM VPS,线路为AS41378,并有首发永久8折优惠:HKBGP20OFF。目前,解锁香港区 Netflix、Youtube Premium ,但不保证一直解锁,谢绝以不是原生 IP 理由退款。不保证中国大陆连接速度,建议移动中转使用,配合广州移动食用效果更佳。点击进入:zoecloud官方网站地址zo...
优惠码50SSDOFF 首月5折50WHTSSD 年付5折15OFF 85折优惠,可循环使用荷兰VPSCPU内存SSD带宽IPv4价格购买1核1G50G1Gbps/3TB1个$ 9.10/月链接2核2G80G1Gbps/5TB1个$ 12.70/月链接2核3G100G1Gbps/7TB1个$ 16.30/月链接3核4G150G1Gbps/10TB1个$ 18.10/月链接阿联酋VPSCPU内存SS...
pagerank为你推荐
点击media3.2网易yeahapple.com.cn苹果官方网址到底是http://store.apple.com/cn/?还是 http://www.apple.com.cn????netshwinsockreset游戏出现battlEye Launcher 怎么办支持http易名网易名网交易域名是怎么收费的小型汽车网上自主编号申请请问各位大虾,如何在网上选车牌号?即时通请问有没有人知道即时通是什么?怎样先可以开??网络u盘你们谁知道网络硬盘怎么用drupal主题如何在 drupal 上让网页呈现手机版页面以让智能手机更易浏览阅读
工信部域名备案查询 最便宜虚拟主机 购买域名和空间 westhost ix主机 60g硬盘 evssl 地址大全 嘉洲服务器 京东商城0元抢购 php空间推荐 ntfs格式分区 太原网通测速平台 东莞服务器 国外视频网站有哪些 台湾google 免费ftp 免费asp空间申请 lamp什么意思 谷歌搜索打不开 更多