errorpagerank
pagerank 时间:2021-04-19 阅读:(
)
Page1of12CANGOOGLE'SPAGERANKBEUSEDTOFINDTHEMOSTIMPORTANTACADEMICWEBPAGESMikeThelwall1m.
thelwall@wlv.
ac.
ukSchoolofComputingandInformationTechnology,UniversityofWolverhampton,35/49LichfieldStreet,WolverhamptonWV11EQ,UKPhone:+441902321470;Fax:+441902321478Google'sPageRankisaninfluentialalgorithmthatusesamodelofWebusethatisdominatedbyitslinkstructureinordertorankpagesbytheirestimatedvaluetotheWebcommunity.
ThispaperreportsontheoutcomeofapplyingthealgorithmtotheWebsitesofthreenationaluniversitysystemsinordertotestwhetheritiscapableofidentifyingthemostimportantWebpages.
Theresultsarealsocomparedtosimpleinlinkcounts.
ItwasdiscoveredthatthehighestinlinkedpagesdonotalwayshavethehighestPageRank,indicatingthatthetwometricsaregenuinelydifferent,evenforthetoppages.
Moresignificantly,however,internallinksdominatedexternallinksforthehighranksineithermethodandsuperficialreasonsaccountedforhighscoresinbothcases.
ItisconcludedthatPageRankisnotusefulforidentifyingthetoppagesinasiteandthatitmustbecombinedwithapowerfultextmatchingtechniquesinordertogetthequalityofinformationretrievalresultsprovidedbyGoogle.
INTRODUCTIONGoogle'sPageRankalgorithm(Brin&Page,1998)forrankingWebpagesisanInformationRetrieval(IR)algorithmthatisrelativelywell-knowntothegeneralpublicbecauseofitsuseintheGoogleToolbarandinthecompany'smarketingapproach,"TheheartofoursoftwareisPageRank"(Google,2002).
Itisalsoarguablythemostinfluentialandsuccessfulofthepastfiveyears,onthebackofthesearchengine'snumberonestatusforonlinesearchingaccordingtosomemeasurements(Sullivan,2002).
Despitethis,theredonotappeartohavebeenmanystudiesfocussingonthequestionsofhoweffectiveitisorunderwhichconditionsitiseffective.
PageRankisbasedupontheassumptionthatgoodqualitypagesaremorelikelytobelinkedtothanpoorqualityonesandthereforethatmininginformationaboutthelinkstructureoftheWebcouldbemoreeffectiveatidentifyingthebestpagesmatchingsearchenginequeriesthanasimpletext-matchingalgorithm.
Infactitgoesonestepfurtherandincorporatesthequalityofthelinkingpageinitsiterativealgorithm,describedindetailbelow.
InthiscontexttwonaturalquestionstoaskfromabibliometricperspectivearewhetherthepagesthataremosthighlylinkedtoaresignificantlydifferentfromthosethathavethehighestPageRank,andwhethereithermethodiscapableofidentifyingthehighestqualityormostusefulpagesinasite.
Thequestionshaveadditionalpertinencebecauseofthegrowingbodyofinformetricresearchthatisbaseduponlinkcounts(e.
g.
Larsen,1996;Rousseau,1997;Ingwersen,1998;Smith,1999;Leydesdorff&Curran,2000;Thelwall,2001a,b,f,2002a;Smith&Thelwall,2002).
Potentially,suchinvestigationsmaybenefitfrom1JournalofDocumentation58(6),2002,toappear.
Page2of12switchingtoPageRank,oranotheriterativeratingsystem,inordertotakeintoaccountinsomewaythequalityoftheinlinksratherthanjusttheirnumbers.
PAGERANKANDOTHERWEBINFORMATIONRETRIEVALALGORITHMSOnamathematicallevelthePageRankalgorithmfindstheprincipaleigenvectorofamatrixcreatedfromthelinkstructureofthesystem.
Moredescriptively,thematrixencodesthemodelofasurfervisitingWebpagesinsuccession.
Ateachpagethesurferjumpstoacompletelyrandompagewithprobability0.
85andfollowsarandomlinkchosenfromthecurrentpagewithprobability0.
15.
IfthesurferisallowedtoproceedinthisfashionfromanystartingpointforaverylongtimethenthePageRankofanypageisdefinedtobetheprobabilitythatittheyareviewingitafteranygivenjump.
Therankingsystemgeneratedfavourspagesthatarethetargetofmanylinks,sincetheyaremorelikelytobejumpedto.
Italsoweightsmorehighlylinksfrommoreimportantsourcepagessincethesesourcesaremorelikelytobejumpedtoand,therefore,morelikelytooriginateanewjump.
Therationalefortheuseoflinksisthattheyprovideadditionalinformationaboutpagesthatcanbeusedtohelpdecidehowimportantthepageis,ratherthanwhatitscontentisabout(Brin&Page,1998).
AmoremathematicalequivalentdefinitionofthePageRankalgorithmcanbefoundinNgetal.
(2001)inadditiontotheoriginalBrinandPagearticle.
OneotherkeylinkbasedIRprocedureisKleinberg's(1999)topicdistillationalgorithm,whichisprimarilyfortopic-specificsearching.
Thisuseslinkstodecidehowimportantpagesareforaspecifictopic,ratherthaningeneral.
Itworksbystartingwithaqueryandidentifyingrelevantpagesthroughtextsemantics,thenusingthelinkstructurewithinthiscollectiontoallocatepagesiteratively(a)anauthorityscorebysummingtheweightsoftheincominglinkpagesand(b)ahubscorebysummingtheweightsoftheoutgoinglinktargetpages.
PageRankhasbeenshowntobeintrinsicallythemorestableofthetwo,however,withtheKleinbergalgorithmbeingsensitivetosmallchangesinlinkstructures(Ngetal.
,2001).
ThereisnoknownstudythatscientificallydemonstratesthateitheriseffectiveinaclearlydefinedsenseofidentifyingthebestinformationontheWeb,butthesuccessofGoogleisstillapowerfulargumentfortheimportanceofPageRank.
Itmaywellbethecasethatthevarioussearchenginecompaniesperformextensivetestingandknowtheanswerstothesequestionsbutdonotmakethemavailableforcommercialreasons.
ThescientificTRECcompetitionresultswerenotpromising,however(Hawkingetal.
,2000),althoughthiscouldhavebeenduetoanuntypicaltestcorpusorthevariantofPageRankused(Thelwall,2002b).
AccordingtoGaoetal.
(2001),"[Therecent]researchofwebretrievalhasfocusedonlink-basedrankingmethods.
However,nonehadachievedbetterresultsthancontent-basedmethodsinTRECexperiments".
Otherresearchintolink-relatedalgorithmsservestoconfirmtheimportanceofthisarea(Haveliwala,1999;Broderetal.
,2000;Lifantsev,2000;Rafiei&Mendelzon,2000;Richardson&Domingos,2001).
BharatandMihaila(2001)forexampledevelopanewalgorithmanddemonstratethroughuserevaluationsthatitsperformanceiscomparablewithPageRank.
UnlikePageRank,however,theotherWebIRalgorithmsintegratethetextanalysiswithlinkanalysis,makingthemunsuitablefortaskssuchasfindingthe'best'overallpages.
THERESEARCHQUESTIONSThispaperreportsonastudytoapplyPageRanktodatabasesofthelinkstructuresoftheWebsitesofUK,AustralianandNewZealanduniversities.
ThisPage3of12algorithmischosenforitsarguablepre-eminenceinadditiontoitssuitabilityforthetaskoffindingtheoverallbestpagesonasite.
ThethreenationsselectedarechosenforthefreeavailabilityoflinkdataforthemandbecausetheyrepresentininternationaltermsrelativelyearlyInternetadoptersandextensiveWebusers.
The10highestrankedpagesforeachuniversitywillbeanalysedaswellasthe100highestforeachnationalsystem.
ReasonsfordifferencesbetweenPageRankandinlinkcountswillbeuncoveredfromaninvestigationintotheinlinkstructureofthepagesinquestion.
Thisisessentiallyaninvestigativeandqualitativebibliometricapproach(e.
g.
Glser&Laudel,2001;Goodrumetal,2001)ratherthanoneofformalscientifichypothesistesting.
Thetheoreticalcontextisthehypothesisthatthetoprankedpageswilleithercontainhighqualitycontentorwillbegatewaystootherusefulpages.
Thetwospecificquestionsaddressedareasfollows.
ArethepagesgiventhehighestrankbyPageRankclearlythemostusefulorhighestqualityinthesystemanalysed,orcantheirhighpositionsbetheresultofunrelatedfactorsIsPageRankmoresuccessfulthansimpleinlinkcountsatidentifyingthetoppagesMETHODOLOGYThelinkstructureofthenationaluniversitysystemswasobtainedfromapubliclyavailabledatabase(cybermetrics.
wlv.
ac.
uk/database)describedindetailinThelwall(2001d)andobtainedbyaninformationscienceWebcrawler(Thelwall,2001c).
ThiscoverstheproportionofeachWebsitethatcouldbefoundbyiterativelyfollowinglinksfromthehomepage,excludingcopiesofpagesfromothersources(mirrorsites)whenidentified.
MirrorsitesareaparticularproblembecauseitisnotknowntowhatextentGoogle'sspidercrawlsthem.
ForexampletherearenumerouscopiesofSun'sJavadocumentationonUKuniversityWebsitesandideallyGooglewouldignoretheseandonlycrawltheoriginalontheSunMicrosystemswebsite.
AnyadditionalcopiesinGoogle'sdatabasewouldclearlybewastingspace.
Nevertheless,identifyingandeliminatingduplicatepagesisatechnicallychallengingjob,despitepublishedresearchonspeedinguptheprocess(Heydon&Najork,1999)becauseofthesheersizeoftheWeb.
Thedatabasesusedwillincludesomemirrorsitesthathavebeenmissedduetohumanerror,whichispossiblysimilartothesituationforGoogle.
Thenamesofthe156universitiescrawledcanbeobtainedfromtheoriginatingdatabasesite,viathedomainnamesfiles.
Thelinkdatabaseconsistsofaseparatefileforeachinstitutioncontainingthelinkstructureofitswebsite.
ThemostchallengingpartoftheresearchwasinwritingaprogramtoencodetheURLsintonumbersforthePageRankalgorithm.
ThiswasdifficultbecauseofthememorytakenbytheURLsandthenumberofstringcomparisonoperationsthatwererequiredtoensurethateachURLwasgivenauniquenumber.
Onecombinedlinkstructurefilewasconstructedforeachnationalsystemandusedtobuildamatrixofitslinkstructure,andoneseparatelinkstructurefilewasalsocreatedforeachinstitution.
ThiswasthenloadedintoanewprogramcodedtoexecutethePageRankalgorithm,andranksobtainedfromit.
Theprocedurefollowedwasessentiallythesameasthenon-blockedversiondescribedbyHaveliwala(1999)forsmallcomputers,exceptthatnopageswereeliminatedfromthesystemduetoalackoflinks.
Instead,acorrectingfactorwasincorporatedtoadjustfortheaffectofpagesinthesystemwithoutlinks.
Althoughinthelargestcasethefulllinkstructurematrixwouldhavebeentoobigtostoreasanarray,withentries,itcouldinfactbestoredwithonly2entriesasasparsematrix,recordingthelocationof14104*710*Page4of12thenon-zeroentries,withunrecordedlocationsbeingimplicitlyzero.
ThePageRanklistwascombinedwiththeURLkeyfileandsortedtoproducetwotop10listsforeachinstitution,oneforPageRanksandoneforinlinks.
Similartop100listswereproducedforeachwholesystem.
Table1summarisesbasicinformationaboutthedatabases.
TheUKdatabaseisjustover10%ofthesizeoftheoriginalBrin&Page(1998)corpus.
Table1.
InformationaboutthedatabasesusedCountryAustraliaNewZealandUKUniversityWebsitesincluded388110Crawldates10/2001-1/20021-2/20026-7/2001Totalpages3,804,612341,6676,920,448Totallinks20,054,0172,119,67732,516,604ThefirstanalysiswasasimplecalculationtoseewhetherPageRankwasmoreeffectiveatidentifyingusefulpagesthaninlinkcounts.
Thetwoassumptionsmadearethat(a)themostusefulpagesareinstitutionalhomepagesandthat(b)thesearenormallytherootpagesoftheirowndomainnames.
Basedupontheseassumptions,thetestappliedwastoseewhichofthetwomechanismsrankedthistypeofURLmosthighly.
Asanaside,homepagefindingisarecognisedIRtask,forwhichlinkshavebeenfounduseful(Xi&Fox,2001).
ThesecondanalysisisalargecombinedexperimenttodeterminewhetherPageRankorinlinkcountsrevealthemostimportantpagesonasite,andwhetheroneappearstobebetterthantheother.
Theinvestigationisconductedbyusingtablesofthetoppagesfrombothmethodsandevaluatingthesequalitatively.
Separateresultsarereportedforindividualuniversitiesandfornationaluniversitysystems.
Thesearepotentiallysignificantlydifferententitiesunderthehypothesisthatlinksbetweeninstitutionscarryahigherinformationvaluethanthosewithinasinglesite(Kleinberg,1999;Thelwall,2001a).
RESULTSANDDISCUSSIONHomepagerankingAscanbeseeninTable2,thereisnosignificantdifferencebetweenthesuccessofrawinlinkcountsandPageRankintherankoftheuniversityhomepage,basedonlyuponthelinkstructureoftheuniversityWebsiteonitsown.
Inalmostallofthetiedcasesthehomepageswerenumberoneinbothlists.
Inonlyfivecasesthehomepageswerenotinthetoptenofeitherlist.
Nostatisticaltestisneededtoseethatthedifferencesarenegligible,butstandardhypothesistestsforproportionswouldshowthis.
Table2.
AcomparisonoftherankingofuniversityhomepagesproducedbyPageRankandbysimpleinlinkcountsoperatingoneachuniversityWebsiteonitsownSystemHomepagesrankinghigherwithPageRankHomepagesrankinghigherwithinlinkcountsHomepagesrankingthesamewithbothmethodsHomepagesnotinthetop10ineitherlistAustralia13331NewZealand2150Page5of12UK1210844IndividualuniversitiesThetoptenlistsofindividualuniversitieswereexaminedforpatterns.
InmanyliststherewasagroupofcloselyrelatedURLsthatcamefromalargesubsitewithanavigationbarlinkingtothemainpages.
Oftenthesewerethemainofficialpagesofthesite,asisthecaseforLaTrobeUniversity,showninTable3.
Thedominanceofthemainpagesinthiscaseiscausedbytheexistenceofastandardlinksbaratthetopofallofficialpages.
AbigdifferencecanbeseenbetweentheresultsfromthissiteandthoseofWolverhampton(Table4),wherethemainpageswerenotindexedduetotheiruseofActiveServerPagesqueries.
Theofficiallinksbarforotherpagesusesaserversidemapthatisalsonotindexable,althoughtheURLofthemapcanbeseenrankedthirdinthetable.
ThisisaclearcaseofdesigndecisionsdominatingthetopresultsofthePageRankcalculationforindividualuniversities.
Table3.
ThetenhighestrankedpagesforLaTrobe,usingPageRankoninternallinksonly,withPageRankslinearlyscaledtomakethelargestequalto1CountPageRankPage89521www.
latrobe.
edu.
au100580.
513953www.
latrobe.
edu.
au/international99100.
506899www.
latrobe.
edu.
au/search98620.
505285www.
latrobe.
edu.
au/contact99660.
505202www.
latrobe.
edu.
au/about98580.
504549www.
latrobe.
edu.
au/sitemap99090.
50315www.
latrobe.
edu.
au/teaching99030.
502879www.
latrobe.
edu.
au/research99030.
502845www.
latrobe.
edu.
au/faculties99020.
502827www.
latrobe.
edu.
au/campusesTable4.
ThetenhighestrankedpagesinWolverhampton,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage31711www.
wlv.
ac.
uk/disclaimer/official.
html30370.
7812036www.
wlv.
ac.
uk28020.
6703956www.
wlv.
ac.
uk/resources/uni.
nav.
bar.
map18980.
5537226www.
scit.
wlv.
ac.
uk/appdocs/php44740.
4495484www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/overview-summary.
html44750.
4286861www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api44690.
4277823www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/deprecated-list.
html44680.
427765www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/index-files/index-1.
html44680.
4277458www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/help-doc.
html8110.
374287www.
wlv.
ac.
uk/disclaimer/personal.
htmlInadditiontoinstancesofdominationbyofficialpages,othercaseswerealsofoundwheresetsofcomputerdocumentationorothertypesoflargesubsitehadhighinterlinkingpatterns.
ThiscanbeseeninTable4andisalsoillustratedforthecaseofPage6of12theRoyalMelbourneInstituteofTechnology(RMIT),asshowninTable5.
ItcanbeseenthatthemainpagesonalargePHP:HypertextPreprocessor(PHP,arecursiveacronym)Webpageserver-sidescriptinglanguagesitehaveattractedalargenumberoflinks,inactualfactfromthestandardnavigationlinksfoundonallotherpagesofthislargesetofdocumentation.
Thisisacasewhereacombinationofthesheersizeoftheresourceanditsinclusionofastandardsetofnavigationallinkshavecombinedtogiveitskeypagesahugeinlinkcount.
Itappearstobeforuseonlyinonestudentcourseandisacopyofdocumentationproducedelsewhere,sofromanexternalWebuser'spointofviewitwouldnotbeconsideredasimportantcontentontheRMITsite.
ThefifthpageinTable6isthehomepageoftheRMITResearchandDevelopmentSection,whichhostsalargesubsitewithalinktotheirhomepageoneachpage.
Thisisanexampleofasimilarphenomenon:theinternalsizeofasubsitedeterminingtherankofitshomepage.
Table5.
ThetenmostlinkedtopagesinRMIT,countingonlyinternallinks,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/downloads.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/docs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/faq.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/support.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/bugs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/links.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/copyright.
phpIntermsofthedifferencebetweenhighinlinkcountpagesandthosewithahighPageRank,acomparisonofTable5andTable6showsthattherecanberealdifferences.
ManyoftheRMITpagesinTable5havearelativelylowPageRankasaresultofeachlinkingpagecontainingalargenumberofotherlinks,whichdissipatestheeffectofeachindividuallinkthroughsharing'rank'betweenalltargetsofapage.
SomepagesinTable6haveabouthalfasmanyinlinksbutmorethandoublethePageRankbecausethepagesthatlinktothemhavefeweroveralllinks.
Table6.
ThetenhighestrankedpagesinRMIT,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm12480.
1176734www.
homepages.
eu.
rmit.
edu.
au/bondy/saskiabondyfamtreesite/persons.
html6660.
0819186www.
rmit.
edu.
au/departments/rd10970.
0796801bonza.
rmit.
edu.
au10840.
0786278bonza.
rmit.
edu.
au/search.
html10840.
0786278bonza.
rmit.
edu.
au/essays10830.
0785995bonza.
rmit.
edu.
au/links10830.
0785995bonza.
rmit.
edu.
au/contact.
htmlPage7of12ThenumberonepageinTable4andthenumbertwopageinTable5demonstrateanotherfeatureofbothPageRankandinlinkcounts:thehighscorethatpagescanhavewhichpossessalegalfunctioninregardtoWebcontent.
AttheUniversityofWolverhampton,thepagewiththehighestPageRankisthelegaldisclaimerthatallofficialpagesaresupposedtocontainalinkto.
TheRMITdisclaimerpageclearlyalsoenjoysasimilarstatus.
ThisisaproblemfromanIRorbibliometricperspective,asthepagedoesnotcontaininformationofunusuallyhighvalue.
Therearealsoseveralhighlyrankedcopyrightpagesinotheruniversitylists(seetables8and9).
NationalsystemsTables7to9givethetop10pagesfromeachnationalsystem,afterapplyingPageRanktotheircombinedlinkstructurefiles.
Thetop100pageswereproducedineachcasebuttherestarenotshownforreasonsofspace.
Althoughtheuniversityhomepagesineachlistarenaturalinclusions,noneoftheotherpagescouldberegardedascontainingunusuallyusefulinformation,rathertheyowetheirpositiontoarelativelyephemeralcausesuchastheonesdiscussedabove.
TheUK'stoppageisacaseinpoint.
Itattractsonlyinternallinksfromitsownsiteandislinkedtofromahugecollectionofpages,eachcontainingadescriptionofoneofthemodulestaughtattheUniversityofStaffordshire,allofwhichcontainonlyonelink.
Ironically,thelinkappearstobeanautomaticallyinsertedtypo(thehomepagehasanadditionalunderscore:www.
staffs.
ac.
uk/schools/art_and_design)andthelinkinquestionisnon-functioningbecausethereisnocontentbetweenthestartandendoftheanchortag.
Thelackofsharingwithotherlinks,however,isthemainfactorthathasleadtoahighPageRank.
Table7.
Australiantop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage2330416(moved)www.
unimelb.
edu.
au/pwebstats/pwebstats.
html329400.
30076838www.
unimelb.
edu.
au441290.
21490477www.
monash.
edu.
au225020.
2129788www.
unimelb.
edu.
au/disclaimer108270.
19488217www.
csse.
monash.
edu.
au/disclaimers/user.
html181570.
1839759(notavailable)www.
gu.
edu.
au/cgi-bin/textflip.
cgi289770.
18257898www.
uq.
edu.
au75250.
18203945www.
educ.
utas.
edu.
au189890.
17177727www.
unisa.
edu.
au343410.
16868638www.
unsw.
edu.
auTable8.
NewZealandtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage1699013www.
otago.
ac.
nz/sas/common/images/copyrite.
htm75610.
23612077www.
otago.
ac.
nz66110.
21837467www.
vuw.
ac.
nz79530.
20127887www.
massey.
ac.
nz/disclaim.
htm97230.
1893377www.
massey.
ac.
nzPage8of1228080.
1854485webview.
massey.
ac.
nz28070.
18543735webview.
massey.
ac.
nz/help/help.
htm52540.
18509097www.
canterbury.
ac.
nz21850.
16212336nix.
tmk.
auckland.
ac.
nz/SAL30790.
14880847www.
auckland.
ac.
nzTable9.
UKtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage384314(notavailable)www.
staffs.
ac.
uk/schools/art_anddesign226910.
97471338www.
st-and.
ac.
uk171600.
97352870www.
cc.
ic.
ac.
uk/college/onlinedocs/sasonlinedocv8/sasdoc/sashtml/common/images/copyrite.
htm128410.
96260684bicss.
mdx.
ac.
uk/css/public262760.
91602217www.
napier.
ac.
uk34770.
9152157unrankedwww.
aom.
bham.
ac.
uk/handbook/courses/glossary.
htm34640.
91223353www.
ao.
bham.
ac.
uk/handbook/courses/glossary.
htm279820.
86076877www.
ulst.
ac.
uk168510.
83058268www.
leeds.
ac.
uk187610.
79502157www-maths.
mcs.
st-and.
ac.
ukThetopAustralianpageisofatypenotmentionedbefore,aWebstatisticssoftwarehomepage,andthisparticularexampleisfromtheformersiteofMartinGleeson'spwebstatsprogramthatattractslargenumbersoflinksfromserverstatisticspagesgeneratedbythesoftware.
TherearesimilarlyhighlyinlinkedpagesintheUK(Thelwall,2002b).
Ascanalsobeseen,therearehelpandglossarypagesintheNewZealandandUKlistsrespectively.
ThesemaybeusefulinthecontextofthepagesthatlinktothembutprobablymuchlesssoforthewiderWebuser.
Alsopresentaretwodepartmentalhomepages,bothasaresultofcreditlinksonlargecollectionsofpages.
InthecaseofStAndrews,forexample,thelinkscomepredominantlyfromthepagesofanonlinehistoryofmathsarchive.
TheninthrankedNewZealandpageisfromamirrorcopyofheScientificApplicationsonLinuxsite,gettingitslinksfromwithinitsownsite.
TheURLiscase-sensitive,hencethemixedcaseversionshowninTable8.
ThepagesreferencedherewereallloadedintoaWebbrowseronthe19thofFebruary,2002withGoogle'stoolbar(toolbar.
google.
com)installedsothatthePageRankfeaturecouldbeused.
Thisgivesanumberbetween0and10foreachloadedpage.
ClearlythesearenotPageRanksascouldbeobtaineddirectlyfromBrin&Page'salgorithm,sincetheunmodifiedvalueswouldallbelessthanone,buttheuseofthiswordbyGoogletodescribethedisplayedfiguresgivessomecausetobelievethattheyarerelatedinamonotonicway(i.
e.
largertoolbarvaluescomefromlargerPageRankvalues).
TheresultsofthisexerciseleadtothediscoverythatinsomecasesthetoolbarPageRankfiguregivenwasappliedtothedomainandautomaticallyreducedbyoneforeachdirectoryinthepathoftheURL,sothatlongerURLstendedtohavelower'PageRanks',irrespectiveofhighinlinkingasseenintables7–9.
ThiswasthecaseforthezerorankedpageintheUKlist,forexample.
Itissurmised,then,thateitherthePageRankalgorithmhasbeenmodifiedforthecurrentversionofGoogle,orthetoolbarusesanapproximatenon-monotonicversionofitincertainsituations(i.
e.
itreversestherelativeranksofsomepages).
Page9of12CONCLUSIONSItisveryclearfromthedatathatthetoprankedpages,eitherwithPageRankorwithrawinlinkcounts,arethereasaresultofnavigationalarchitecturepolicydecisionsprimarilyratherthanontheirownindividualmerit.
PerhapstheclearestexampleofthisisacomparisonoftheUlsterUniversityhomepagewiththatofWolverhampton.
TheformerattractsthehighestinlinkcountofallUKpageswhereasthelatterattractsrelativelyfewasaresultofglobalsitedesigndecisions.
ItmustbeconcludedfromthisthatPageRankandinlinkcountsarenotreliablemethodsforascertainingthemostvaluableresourcesonalargeuniversityWebsiteornationalsystemofuniversitysites.
ContrastingthesefindingswiththoseofThelwall(2002b)itcanbeseenthattherootcauseoftheproblemsistheinclusionofinternallinks.
Inlinkcountsbaseduponexternallinksonlyyieldmuch"better"results,althoughstillnotperfect.
ThisisarealproblemforthePageRankalgorithmbecauseitdependsoninternallinkstofunction,forexampleredistributinglinkvotesfromthehomepageofasitetolinksonitsotherpages,aswouldbeneededforPageRanktopropagatefromanimportantmultipagesite,suchastheHumbulHumanitiesHub(www.
humbul.
ac.
uk).
ComparingtheUKtop100resultswiththoseforexternallinksonly(Thelwall,2002b)itcanbeseenthatthetwoarefundamentallydifferent.
Bothcontainmanyuniversityhomepagesbutthelatterdoesnotcontainanyoftheotherpagestypicallyfoundonthestandardinternalnavigationbar.
PerhapsthemostdamningevidenceisthatthesinglepagethatisprobablythemostwidelyusedresourceonaUKWebsite,theUKclickablemap,doesnotappearatallineitherUKtop100reportedhere,despitehaving891externallinksfromotherUKuniversities.
InthecontextoftheresultspresentedhereitishardtobelievethatplainPageRankiseffectiveasanIRalgorithm,evenwhencombinedwithsimpletextmatching.
Thefundamentalproblemistheallocationofequalweighttointernallinksasexternalonesandtheloopholethatthisgivestoallownavigationalpolicydecisionstoswamptherelativelysmallnumberoflinkscreatedfornon-navigationalreasons.
Thealgorithmmaybemoreeffectiveonaglobalscale,wheretherearerelativelymoreexternallinksbutthedifferencewillnotbeoftheorderofmagnitudeneededtomakearealdifferenceforuniversityWebsites.
Itmaybe,however,thathugesitessuchasYahoo!
doimprovetheresultsbybestowinghigherPageRanksonthebetterWebpages,butthiswouldaffectonlyafewpagesoneachuniversitysite.
TheoriginalGooglepatent(Page,1998)doesmentionmyriadpotentialcustomisationsofthealgorithm,includingtreatinginter-sitelinksdifferently,andsoitislikelythattheversioninuseatthetimeoftestingwasdifferentfromtheoriginal.
Anotherpointthatshouldbementionedisthatthisanalysishasbeenconfinedtothetoptenor100pagesofeachsetandthereforeignorestheoverwhelmingmajorityofpages.
Nevertheless,itcanbeseenthatthesamekindsofargumentscanalsoapplyforthese:pagesthatarepartofahighlyinterlinkingnavigationalstructurewillrankmuchbetterthanothersthatarethetargetofonlyoneortwoexternallinks,eventhoughsuchlinksareprobablyamuchbetterindicatorofhighquality.
Itcouldbearguedthatahighdegreeofinterlinkingisagoodindicatorofqualityatleastinsitedesign,butinthiscasethePageRankisstilltotallydependentontheabsolutenumberofpagesinvolvedandtheextenttowhichlinksarealsopresenttootherresources.
Despitethenumber-intensivemathematicalalgorithmsusedtoproducethedatapresentedhere,thishasbeenessentiallyaqualitativestudy.
Thisisnotseenasaweaknessinthecontextoftheveryclear-cutnatureoftheresultsobtained.
Indeedthequalitativeapproach,focussingoninvestigatingthecauseoftheproblemshasPage10of12enabledthegainingofinsightsintothereasonswhytherankingmethodshavebeenunabletoproduceconvincinglistsofthetoppagesinthesitescovered.
Inconclusion,PageRankisnotaneffectivemethodforidentifyingthe"best"Webpagesinauniversitysystembecauseofitsdominationbyinternallinks,anargumentthatwouldstillapplyevenifallmirrorsiteshadbeenremovedfromthedata.
TheastonishingaccuracyofGoogle(Thelwall,2002c)mustbeduetoitscomplementaryuseofaveryeffectivetext-basedmatchingalgorithm,whichmustitselfbeincorporatedintothefinalranks.
ApromisingfuturedirectionforbibliometricresearchistodevelopavariantofPageRankthatcanharnessthepotentialofitssystemfortransferringrankiterativelythroughlinksinawaythatwouldnotbedominatedbyinternalsitelinks.
ACKNOWLEDGEMENTSIwouldliketothanktherefereesfortheirveryhelpfulcomments.
REFERENCESBharat,K&Mihaila,G.
A.
(2001).
WhenExpertsAgree:UsingNon-AffiliatedExpertstoRankPopularTopics,In:TenthInternationalWorldWideWebConference.
Availableathttp://www.
www10.
org/cdrom/papers/474/index.
htmlBrin,S.
&Page,L.
(1998).
TheAnatomyofalargescalehypertextualwebsearchengine,ComputerNetworksandISDNSystems,30(1-7),107-117.
Availableathttp://citeseer.
nj.
nec.
com/brin98anatomy.
htmlBroder,A.
,Kumar,R.
,Maghoull,F.
,Raghavan,P.
,Rajagopalan,S.
,Stata,R.
,Tomkins,A.
&Wiener,J.
(2000).
Graphstructureintheweb.
ComputerNetworks,33(1-6),309-320.
Gao,J.
Walker,S.
,Robertson,S.
,Cao,G.
,He,H.
,Zhang,M.
&Nie,J-Y(2001).
TREC-10WebTrackExperimentsatMSRA384-392.
TREC2001.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
Glser,J.
&Laudel,G.
(2001).
Integratingscientometricindicatorsintosociologicalstudies:methodicalandmethodologicalproblems.
Scientometrics,52(3),411-434.
Goodrum,A.
A.
,McCain,K.
W.
,Lawrence,S.
&Giles,C.
L.
(2001).
ScholarlypublishingintheInternetage:acitationanalysisofcomputerscienceliterature.
InformationProcessingandManagement,37(5),661-676.
Google(2002).
GoogleTechnology.
Available:http://www.
google.
com/technology/.
Accessed3July,2002.
Haveliwala,T.
(1999).
EfficientComputationofPageRank.
StanfordUniversityTechnicalReport.
Availablehttp://dbpubs.
stanford.
edu:8090/pub/1999-31Hawking,D.
,Bailey,P.
andCraswell,N.
(2000).
ACSysTREC-8experiments.
In:InformationTechnology:EighthTextREtrievalConference(TREC-8),NIST,Gaithersburg,MD,USA,pp.
307-315.
Heydon,A.
&Najork,M.
(1999).
Mercator:Ascalable,extensibleWebcrawler.
WorldWideWeb,2,219-229.
Ingwersen,P.
(1998).
ThecalculationofWebImpactFactors.
JournalofDocumentation,54(2),236-243.
Kleinberg,J.
(1999).
Authoritativesourcesinahyperlinkedenvironment,JournaloftheACM,46(5),604-632.
Larson,R.
R.
(1996).
BibliometricsoftheWorldWideWeb:AnExploratoryAnalysisoftheIntellectualStructureofCyberspace.
ASIS96.
Availableat:http://sherlock.
berkeley.
edu/asis96/asis96.
htmlPage11of12Leydesdorff,L.
&Curran,M.
,(2000).
Mappinguniversity-industry-governmentrelationsontheInternet:theconstructionofindicatorsforaknowledge-basedeconomy,Cybermetrics,4.
Availableat:http://www.
cindoc.
csic.
es/cybermetrics/articles/v4i1p2.
htmlLifantsev,M.
(2000).
VotingmodelforrankingWebpages.
InGraham,P.
&Maheswaran,M.
(eds),ProceedingsoftheInternationalConferenceonInternetComputing,LasVegas,Nevada,USA,CSREAPress,pp.
143-148.
Ng,A.
Y.
,Zheng,A.
X.
&Jordan,M.
I.
(2001).
Stablealgorithmsforlinkanalysis.
In:Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR2001),NewYork:ACMPress,pp.
258-266.
Page,B.
(1998).
UnitedStatesPatent6,285,999.
Available:http://patft.
uspto.
gov/.
Rafiei,D.
&Mendelzon,A.
O.
(2000).
WhatisthispageknownforComputingWebpagereputations,ComputerNetworks,33(1-6),823-835.
Richardson,M.
&DomingosP.
(2001).
TheIntelligentSurfer:ProbabilisticCombinationofLinkandContentInformationinPageRank.
PosteratNeuralInformationProcessingSystems:NaturalandSynthetic2001.
Availableat:http://www.
cs.
washington.
edu/homes/mattr/doc/NIPS2001/qd-pagerank.
pdfRousseau,R.
,(1997).
Sitations,anexploratorystudy,Cybermetrics,1.
Available:http://www.
cindoc.
csic.
es/cybermetrics/articles/v1i1p1.
htmlSmith,A.
G.
(1999).
Ataleoftwowebspaces:comparingsitesusingWebImpactFactors.
JournalofDocumentation,55(5),577-592.
Smith,A.
&Thelwall,M.
(2002,toappear).
WebImpactFactorsforAustralasianUniversities,Scientometrics,54(1-2).
Sullivan,D.
(2002).
GoogleTopsIn"SearchHours"Ratings.
Available:http://searchenginewatch.
com/sereport/02/05-ratings.
html.
Accessed3July,2002.
Thelwall,M.
(2001a).
Extractingmacroscopicinformationfromweblinks.
JournaloftheAmericanSocietyforInformationScienceandTechnology,52(13),1157-1168.
Thelwall,M.
(2001b,toappear).
Evidencefortheexistenceofgeographictrendsinuniversitywebsiteinterlinking.
JournalofDocumentation.
Thelwall,M.
(2001c).
Awebcrawlerdesignfordatamining,JournalofInformationScience,27(5),319-325.
Thelwall,M.
(2001d).
ApubliclyaccessibledatabaseofUKuniversitywebsitelinksandadiscussionoftheneedforhumaninterventioninwebcrawling,UniversityofWolverhampton.
Available:http://www.
scit.
wlv.
ac.
uk/~cm1993/papers/a_publicly_accessible_database.
pdf.
Thelwall,M.
(2001e).
Thetop100linkedpagesonUKuniversityWebsites:highinlinkcountsarenotassociatedwithqualityscholarlycontent,UniversityofWolverhampton.
Thelwall,M.
(2001f).
ResultsfromaWebImpactFactorcrawler,JournalofDocumentation,57(2),177-191.
Thelwall,M.
(2002a).
AcomparisonofsourcesofLinksforacademicWebImpactFactorcalculations.
JournalofDocumentation,58,60-72.
Thelwall,M.
(2002b).
Subjectgatewaysitesandsearchengineranking,OnlineInformationReview,26(2),124-138.
Thelwall,M.
(2002c,toappear).
InpraiseofGoogle:findinglawjournalWebsites,OnlineInformationReview,26(4).
Page12of12Xi,W.
&Fox,E.
A.
(2001).
MachineLearningApproachforHomepageFindingTask.
TREC2001,pp.
686-697.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
六一云互联六一云互联为西安六一网络科技有限公司的旗下产品。是一个正规持有IDC/ISP/CDN的国内公司,成立于2018年,主要销售海外高防高速大带宽云服务器/CDN,并以高质量.稳定性.售后相应快.支持退款等特点受很多用户的支持!近期公司也推出了很多给力的抽奖和折扣活动如:新用户免费抽奖,最大可获得500元,湖北新购六折续费八折折上折,全场八折等等最新活动:1.湖北100G高防:新购六折续费八折...
LightNode官网LightNode是一家位于香港的VPS服务商.提供基于KVM虚拟化技术的VPS.在提供全球常见节点的同时,还具备东南亚地区、中国香港等边缘节点.满足开发者建站,游戏应用,外贸电商等应用场景的需求。为用户带来高性能服务器以及优质的服务的同时还提供丰厚的促销活动,新用户注册最高送$20。注册用户带新客即可得10%返佣。商家支持PayPal,支付宝等支付方式。官网:https:/...
ZJI是成立于2011年原Wordpress圈知名主机商—维翔主机,2018年9月更名为ZJI,主要提供香港、日本、美国独立服务器(自营/数据中心直营)租用及VDS、虚拟主机空间、域名注册业务。本月商家针对香港阿里云线路独立服务器提供月付立减270-400元优惠码,优惠后香港独立服务器(阿里云专线)E3或者E5 CPU,SSD硬盘,最低每月仅480元起。阿里一型CPU:Intel E5-2630L...
pagerank为你推荐
TD-SCDMAsnsphpweb破解忘记phpweb网站后台用户名密码,怎么找回破解flashfxp那位大侠能通俗易懂的告诉我FlashFXP到底是个什么东西。到底有什么作用?到底怎么操作?phpadmin下载phpmyadmin怎么安装啊?可以直接下载安装吗?还需要下载其他数据库吗?开启javascript怎么在浏览器中启用JavaScript?中国企业在线一般都在哪里找企业信息啊?ym.163.com网易163企业邮箱的foxmail怎样设置?支付宝调整还款日支付宝还款日期可以更改吗?my.qq.commy.qq.com,QQ用户上不去?重庆杨家坪猪肉摊主杀人重庆忠县的猪肉市场应该好好整顿一下了。6月份我买到了母猪肉。今天好不容易才下定决心去买农贸市场买肉。
长沙域名注册 免费cn域名 韩国加速器 iisphpmysql 国外空间服务商 名片模板psd 轻博 国外网站代理服务器 蜗牛魔方 193邮箱 日本bb瘦 双十一秒杀 域名接入 国外代理服务器地址 isp服务商 上海服务器 Updog 腾讯总部在哪 双线asp空间 移动服务器托管 更多