errorpagerank

pagerank  时间:2021-04-19  阅读:()
Page1of12CANGOOGLE'SPAGERANKBEUSEDTOFINDTHEMOSTIMPORTANTACADEMICWEBPAGESMikeThelwall1m.
thelwall@wlv.
ac.
ukSchoolofComputingandInformationTechnology,UniversityofWolverhampton,35/49LichfieldStreet,WolverhamptonWV11EQ,UKPhone:+441902321470;Fax:+441902321478Google'sPageRankisaninfluentialalgorithmthatusesamodelofWebusethatisdominatedbyitslinkstructureinordertorankpagesbytheirestimatedvaluetotheWebcommunity.
ThispaperreportsontheoutcomeofapplyingthealgorithmtotheWebsitesofthreenationaluniversitysystemsinordertotestwhetheritiscapableofidentifyingthemostimportantWebpages.
Theresultsarealsocomparedtosimpleinlinkcounts.
ItwasdiscoveredthatthehighestinlinkedpagesdonotalwayshavethehighestPageRank,indicatingthatthetwometricsaregenuinelydifferent,evenforthetoppages.
Moresignificantly,however,internallinksdominatedexternallinksforthehighranksineithermethodandsuperficialreasonsaccountedforhighscoresinbothcases.
ItisconcludedthatPageRankisnotusefulforidentifyingthetoppagesinasiteandthatitmustbecombinedwithapowerfultextmatchingtechniquesinordertogetthequalityofinformationretrievalresultsprovidedbyGoogle.
INTRODUCTIONGoogle'sPageRankalgorithm(Brin&Page,1998)forrankingWebpagesisanInformationRetrieval(IR)algorithmthatisrelativelywell-knowntothegeneralpublicbecauseofitsuseintheGoogleToolbarandinthecompany'smarketingapproach,"TheheartofoursoftwareisPageRank"(Google,2002).
Itisalsoarguablythemostinfluentialandsuccessfulofthepastfiveyears,onthebackofthesearchengine'snumberonestatusforonlinesearchingaccordingtosomemeasurements(Sullivan,2002).
Despitethis,theredonotappeartohavebeenmanystudiesfocussingonthequestionsofhoweffectiveitisorunderwhichconditionsitiseffective.
PageRankisbasedupontheassumptionthatgoodqualitypagesaremorelikelytobelinkedtothanpoorqualityonesandthereforethatmininginformationaboutthelinkstructureoftheWebcouldbemoreeffectiveatidentifyingthebestpagesmatchingsearchenginequeriesthanasimpletext-matchingalgorithm.
Infactitgoesonestepfurtherandincorporatesthequalityofthelinkingpageinitsiterativealgorithm,describedindetailbelow.
InthiscontexttwonaturalquestionstoaskfromabibliometricperspectivearewhetherthepagesthataremosthighlylinkedtoaresignificantlydifferentfromthosethathavethehighestPageRank,andwhethereithermethodiscapableofidentifyingthehighestqualityormostusefulpagesinasite.
Thequestionshaveadditionalpertinencebecauseofthegrowingbodyofinformetricresearchthatisbaseduponlinkcounts(e.
g.
Larsen,1996;Rousseau,1997;Ingwersen,1998;Smith,1999;Leydesdorff&Curran,2000;Thelwall,2001a,b,f,2002a;Smith&Thelwall,2002).
Potentially,suchinvestigationsmaybenefitfrom1JournalofDocumentation58(6),2002,toappear.
Page2of12switchingtoPageRank,oranotheriterativeratingsystem,inordertotakeintoaccountinsomewaythequalityoftheinlinksratherthanjusttheirnumbers.
PAGERANKANDOTHERWEBINFORMATIONRETRIEVALALGORITHMSOnamathematicallevelthePageRankalgorithmfindstheprincipaleigenvectorofamatrixcreatedfromthelinkstructureofthesystem.
Moredescriptively,thematrixencodesthemodelofasurfervisitingWebpagesinsuccession.
Ateachpagethesurferjumpstoacompletelyrandompagewithprobability0.
85andfollowsarandomlinkchosenfromthecurrentpagewithprobability0.
15.
IfthesurferisallowedtoproceedinthisfashionfromanystartingpointforaverylongtimethenthePageRankofanypageisdefinedtobetheprobabilitythatittheyareviewingitafteranygivenjump.
Therankingsystemgeneratedfavourspagesthatarethetargetofmanylinks,sincetheyaremorelikelytobejumpedto.
Italsoweightsmorehighlylinksfrommoreimportantsourcepagessincethesesourcesaremorelikelytobejumpedtoand,therefore,morelikelytooriginateanewjump.
Therationalefortheuseoflinksisthattheyprovideadditionalinformationaboutpagesthatcanbeusedtohelpdecidehowimportantthepageis,ratherthanwhatitscontentisabout(Brin&Page,1998).
AmoremathematicalequivalentdefinitionofthePageRankalgorithmcanbefoundinNgetal.
(2001)inadditiontotheoriginalBrinandPagearticle.
OneotherkeylinkbasedIRprocedureisKleinberg's(1999)topicdistillationalgorithm,whichisprimarilyfortopic-specificsearching.
Thisuseslinkstodecidehowimportantpagesareforaspecifictopic,ratherthaningeneral.
Itworksbystartingwithaqueryandidentifyingrelevantpagesthroughtextsemantics,thenusingthelinkstructurewithinthiscollectiontoallocatepagesiteratively(a)anauthorityscorebysummingtheweightsoftheincominglinkpagesand(b)ahubscorebysummingtheweightsoftheoutgoinglinktargetpages.
PageRankhasbeenshowntobeintrinsicallythemorestableofthetwo,however,withtheKleinbergalgorithmbeingsensitivetosmallchangesinlinkstructures(Ngetal.
,2001).
ThereisnoknownstudythatscientificallydemonstratesthateitheriseffectiveinaclearlydefinedsenseofidentifyingthebestinformationontheWeb,butthesuccessofGoogleisstillapowerfulargumentfortheimportanceofPageRank.
Itmaywellbethecasethatthevarioussearchenginecompaniesperformextensivetestingandknowtheanswerstothesequestionsbutdonotmakethemavailableforcommercialreasons.
ThescientificTRECcompetitionresultswerenotpromising,however(Hawkingetal.
,2000),althoughthiscouldhavebeenduetoanuntypicaltestcorpusorthevariantofPageRankused(Thelwall,2002b).
AccordingtoGaoetal.
(2001),"[Therecent]researchofwebretrievalhasfocusedonlink-basedrankingmethods.
However,nonehadachievedbetterresultsthancontent-basedmethodsinTRECexperiments".
Otherresearchintolink-relatedalgorithmsservestoconfirmtheimportanceofthisarea(Haveliwala,1999;Broderetal.
,2000;Lifantsev,2000;Rafiei&Mendelzon,2000;Richardson&Domingos,2001).
BharatandMihaila(2001)forexampledevelopanewalgorithmanddemonstratethroughuserevaluationsthatitsperformanceiscomparablewithPageRank.
UnlikePageRank,however,theotherWebIRalgorithmsintegratethetextanalysiswithlinkanalysis,makingthemunsuitablefortaskssuchasfindingthe'best'overallpages.
THERESEARCHQUESTIONSThispaperreportsonastudytoapplyPageRanktodatabasesofthelinkstructuresoftheWebsitesofUK,AustralianandNewZealanduniversities.
ThisPage3of12algorithmischosenforitsarguablepre-eminenceinadditiontoitssuitabilityforthetaskoffindingtheoverallbestpagesonasite.
ThethreenationsselectedarechosenforthefreeavailabilityoflinkdataforthemandbecausetheyrepresentininternationaltermsrelativelyearlyInternetadoptersandextensiveWebusers.
The10highestrankedpagesforeachuniversitywillbeanalysedaswellasthe100highestforeachnationalsystem.
ReasonsfordifferencesbetweenPageRankandinlinkcountswillbeuncoveredfromaninvestigationintotheinlinkstructureofthepagesinquestion.
Thisisessentiallyaninvestigativeandqualitativebibliometricapproach(e.
g.
Glser&Laudel,2001;Goodrumetal,2001)ratherthanoneofformalscientifichypothesistesting.
Thetheoreticalcontextisthehypothesisthatthetoprankedpageswilleithercontainhighqualitycontentorwillbegatewaystootherusefulpages.
Thetwospecificquestionsaddressedareasfollows.
ArethepagesgiventhehighestrankbyPageRankclearlythemostusefulorhighestqualityinthesystemanalysed,orcantheirhighpositionsbetheresultofunrelatedfactorsIsPageRankmoresuccessfulthansimpleinlinkcountsatidentifyingthetoppagesMETHODOLOGYThelinkstructureofthenationaluniversitysystemswasobtainedfromapubliclyavailabledatabase(cybermetrics.
wlv.
ac.
uk/database)describedindetailinThelwall(2001d)andobtainedbyaninformationscienceWebcrawler(Thelwall,2001c).
ThiscoverstheproportionofeachWebsitethatcouldbefoundbyiterativelyfollowinglinksfromthehomepage,excludingcopiesofpagesfromothersources(mirrorsites)whenidentified.
MirrorsitesareaparticularproblembecauseitisnotknowntowhatextentGoogle'sspidercrawlsthem.
ForexampletherearenumerouscopiesofSun'sJavadocumentationonUKuniversityWebsitesandideallyGooglewouldignoretheseandonlycrawltheoriginalontheSunMicrosystemswebsite.
AnyadditionalcopiesinGoogle'sdatabasewouldclearlybewastingspace.
Nevertheless,identifyingandeliminatingduplicatepagesisatechnicallychallengingjob,despitepublishedresearchonspeedinguptheprocess(Heydon&Najork,1999)becauseofthesheersizeoftheWeb.
Thedatabasesusedwillincludesomemirrorsitesthathavebeenmissedduetohumanerror,whichispossiblysimilartothesituationforGoogle.
Thenamesofthe156universitiescrawledcanbeobtainedfromtheoriginatingdatabasesite,viathedomainnamesfiles.
Thelinkdatabaseconsistsofaseparatefileforeachinstitutioncontainingthelinkstructureofitswebsite.
ThemostchallengingpartoftheresearchwasinwritingaprogramtoencodetheURLsintonumbersforthePageRankalgorithm.
ThiswasdifficultbecauseofthememorytakenbytheURLsandthenumberofstringcomparisonoperationsthatwererequiredtoensurethateachURLwasgivenauniquenumber.
Onecombinedlinkstructurefilewasconstructedforeachnationalsystemandusedtobuildamatrixofitslinkstructure,andoneseparatelinkstructurefilewasalsocreatedforeachinstitution.
ThiswasthenloadedintoanewprogramcodedtoexecutethePageRankalgorithm,andranksobtainedfromit.
Theprocedurefollowedwasessentiallythesameasthenon-blockedversiondescribedbyHaveliwala(1999)forsmallcomputers,exceptthatnopageswereeliminatedfromthesystemduetoalackoflinks.
Instead,acorrectingfactorwasincorporatedtoadjustfortheaffectofpagesinthesystemwithoutlinks.
Althoughinthelargestcasethefulllinkstructurematrixwouldhavebeentoobigtostoreasanarray,withentries,itcouldinfactbestoredwithonly2entriesasasparsematrix,recordingthelocationof14104*710*Page4of12thenon-zeroentries,withunrecordedlocationsbeingimplicitlyzero.
ThePageRanklistwascombinedwiththeURLkeyfileandsortedtoproducetwotop10listsforeachinstitution,oneforPageRanksandoneforinlinks.
Similartop100listswereproducedforeachwholesystem.
Table1summarisesbasicinformationaboutthedatabases.
TheUKdatabaseisjustover10%ofthesizeoftheoriginalBrin&Page(1998)corpus.
Table1.
InformationaboutthedatabasesusedCountryAustraliaNewZealandUKUniversityWebsitesincluded388110Crawldates10/2001-1/20021-2/20026-7/2001Totalpages3,804,612341,6676,920,448Totallinks20,054,0172,119,67732,516,604ThefirstanalysiswasasimplecalculationtoseewhetherPageRankwasmoreeffectiveatidentifyingusefulpagesthaninlinkcounts.
Thetwoassumptionsmadearethat(a)themostusefulpagesareinstitutionalhomepagesandthat(b)thesearenormallytherootpagesoftheirowndomainnames.
Basedupontheseassumptions,thetestappliedwastoseewhichofthetwomechanismsrankedthistypeofURLmosthighly.
Asanaside,homepagefindingisarecognisedIRtask,forwhichlinkshavebeenfounduseful(Xi&Fox,2001).
ThesecondanalysisisalargecombinedexperimenttodeterminewhetherPageRankorinlinkcountsrevealthemostimportantpagesonasite,andwhetheroneappearstobebetterthantheother.
Theinvestigationisconductedbyusingtablesofthetoppagesfrombothmethodsandevaluatingthesequalitatively.
Separateresultsarereportedforindividualuniversitiesandfornationaluniversitysystems.
Thesearepotentiallysignificantlydifferententitiesunderthehypothesisthatlinksbetweeninstitutionscarryahigherinformationvaluethanthosewithinasinglesite(Kleinberg,1999;Thelwall,2001a).
RESULTSANDDISCUSSIONHomepagerankingAscanbeseeninTable2,thereisnosignificantdifferencebetweenthesuccessofrawinlinkcountsandPageRankintherankoftheuniversityhomepage,basedonlyuponthelinkstructureoftheuniversityWebsiteonitsown.
Inalmostallofthetiedcasesthehomepageswerenumberoneinbothlists.
Inonlyfivecasesthehomepageswerenotinthetoptenofeitherlist.
Nostatisticaltestisneededtoseethatthedifferencesarenegligible,butstandardhypothesistestsforproportionswouldshowthis.
Table2.
AcomparisonoftherankingofuniversityhomepagesproducedbyPageRankandbysimpleinlinkcountsoperatingoneachuniversityWebsiteonitsownSystemHomepagesrankinghigherwithPageRankHomepagesrankinghigherwithinlinkcountsHomepagesrankingthesamewithbothmethodsHomepagesnotinthetop10ineitherlistAustralia13331NewZealand2150Page5of12UK1210844IndividualuniversitiesThetoptenlistsofindividualuniversitieswereexaminedforpatterns.
InmanyliststherewasagroupofcloselyrelatedURLsthatcamefromalargesubsitewithanavigationbarlinkingtothemainpages.
Oftenthesewerethemainofficialpagesofthesite,asisthecaseforLaTrobeUniversity,showninTable3.
Thedominanceofthemainpagesinthiscaseiscausedbytheexistenceofastandardlinksbaratthetopofallofficialpages.
AbigdifferencecanbeseenbetweentheresultsfromthissiteandthoseofWolverhampton(Table4),wherethemainpageswerenotindexedduetotheiruseofActiveServerPagesqueries.
Theofficiallinksbarforotherpagesusesaserversidemapthatisalsonotindexable,althoughtheURLofthemapcanbeseenrankedthirdinthetable.
ThisisaclearcaseofdesigndecisionsdominatingthetopresultsofthePageRankcalculationforindividualuniversities.
Table3.
ThetenhighestrankedpagesforLaTrobe,usingPageRankoninternallinksonly,withPageRankslinearlyscaledtomakethelargestequalto1CountPageRankPage89521www.
latrobe.
edu.
au100580.
513953www.
latrobe.
edu.
au/international99100.
506899www.
latrobe.
edu.
au/search98620.
505285www.
latrobe.
edu.
au/contact99660.
505202www.
latrobe.
edu.
au/about98580.
504549www.
latrobe.
edu.
au/sitemap99090.
50315www.
latrobe.
edu.
au/teaching99030.
502879www.
latrobe.
edu.
au/research99030.
502845www.
latrobe.
edu.
au/faculties99020.
502827www.
latrobe.
edu.
au/campusesTable4.
ThetenhighestrankedpagesinWolverhampton,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage31711www.
wlv.
ac.
uk/disclaimer/official.
html30370.
7812036www.
wlv.
ac.
uk28020.
6703956www.
wlv.
ac.
uk/resources/uni.
nav.
bar.
map18980.
5537226www.
scit.
wlv.
ac.
uk/appdocs/php44740.
4495484www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/overview-summary.
html44750.
4286861www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api44690.
4277823www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/deprecated-list.
html44680.
427765www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/index-files/index-1.
html44680.
4277458www.
scit.
wlv.
ac.
uk/~cm1914/cp2027/docs/api/help-doc.
html8110.
374287www.
wlv.
ac.
uk/disclaimer/personal.
htmlInadditiontoinstancesofdominationbyofficialpages,othercaseswerealsofoundwheresetsofcomputerdocumentationorothertypesoflargesubsitehadhighinterlinkingpatterns.
ThiscanbeseeninTable4andisalsoillustratedforthecaseofPage6of12theRoyalMelbourneInstituteofTechnology(RMIT),asshowninTable5.
ItcanbeseenthatthemainpagesonalargePHP:HypertextPreprocessor(PHP,arecursiveacronym)Webpageserver-sidescriptinglanguagesitehaveattractedalargenumberoflinks,inactualfactfromthestandardnavigationlinksfoundonallotherpagesofthislargesetofdocumentation.
Thisisacasewhereacombinationofthesheersizeoftheresourceanditsinclusionofastandardsetofnavigationallinkshavecombinedtogiveitskeypagesahugeinlinkcount.
Itappearstobeforuseonlyinonestudentcourseandisacopyofdocumentationproducedelsewhere,sofromanexternalWebuser'spointofviewitwouldnotbeconsideredasimportantcontentontheRMITsite.
ThefifthpageinTable6isthehomepageoftheRMITResearchandDevelopmentSection,whichhostsalargesubsitewithalinktotheirhomepageoneachpage.
Thisisanexampleofasimilarphenomenon:theinternalsizeofasubsitedeterminingtherankofitshomepage.
Table5.
ThetenmostlinkedtopagesinRMIT,countingonlyinternallinks,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/downloads.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/docs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/faq.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/support.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/bugs.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/links.
php20750.
0274677kroid.
mds.
rmit.
edu.
au/cs843/ref/php/copyright.
phpIntermsofthedifferencebetweenhighinlinkcountpagesandthosewithahighPageRank,acomparisonofTable5andTable6showsthattherecanberealdifferences.
ManyoftheRMITpagesinTable5havearelativelylowPageRankasaresultofeachlinkingpagecontainingalargenumberofotherlinks,whichdissipatestheeffectofeachindividuallinkthroughsharing'rank'betweenalltargetsofapage.
SomepagesinTable6haveabouthalfasmanyinlinksbutmorethandoublethePageRankbecausethepagesthatlinktothemhavefeweroveralllinks.
Table6.
ThetenhighestrankedpagesinRMIT,usingPageRankoninternallinksonly,PageRanksscaledCountPageRankPage124851www.
rmit.
edu.
au68290.
5016145www.
rmit.
edu.
au/webmaster/disclaimer.
html32860.
4388598www.
viscom.
rmit.
edu.
au/robin/talks.
htm12480.
1176734www.
homepages.
eu.
rmit.
edu.
au/bondy/saskiabondyfamtreesite/persons.
html6660.
0819186www.
rmit.
edu.
au/departments/rd10970.
0796801bonza.
rmit.
edu.
au10840.
0786278bonza.
rmit.
edu.
au/search.
html10840.
0786278bonza.
rmit.
edu.
au/essays10830.
0785995bonza.
rmit.
edu.
au/links10830.
0785995bonza.
rmit.
edu.
au/contact.
htmlPage7of12ThenumberonepageinTable4andthenumbertwopageinTable5demonstrateanotherfeatureofbothPageRankandinlinkcounts:thehighscorethatpagescanhavewhichpossessalegalfunctioninregardtoWebcontent.
AttheUniversityofWolverhampton,thepagewiththehighestPageRankisthelegaldisclaimerthatallofficialpagesaresupposedtocontainalinkto.
TheRMITdisclaimerpageclearlyalsoenjoysasimilarstatus.
ThisisaproblemfromanIRorbibliometricperspective,asthepagedoesnotcontaininformationofunusuallyhighvalue.
Therearealsoseveralhighlyrankedcopyrightpagesinotheruniversitylists(seetables8and9).
NationalsystemsTables7to9givethetop10pagesfromeachnationalsystem,afterapplyingPageRanktotheircombinedlinkstructurefiles.
Thetop100pageswereproducedineachcasebuttherestarenotshownforreasonsofspace.
Althoughtheuniversityhomepagesineachlistarenaturalinclusions,noneoftheotherpagescouldberegardedascontainingunusuallyusefulinformation,rathertheyowetheirpositiontoarelativelyephemeralcausesuchastheonesdiscussedabove.
TheUK'stoppageisacaseinpoint.
Itattractsonlyinternallinksfromitsownsiteandislinkedtofromahugecollectionofpages,eachcontainingadescriptionofoneofthemodulestaughtattheUniversityofStaffordshire,allofwhichcontainonlyonelink.
Ironically,thelinkappearstobeanautomaticallyinsertedtypo(thehomepagehasanadditionalunderscore:www.
staffs.
ac.
uk/schools/art_and_design)andthelinkinquestionisnon-functioningbecausethereisnocontentbetweenthestartandendoftheanchortag.
Thelackofsharingwithotherlinks,however,isthemainfactorthathasleadtoahighPageRank.
Table7.
Australiantop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage2330416(moved)www.
unimelb.
edu.
au/pwebstats/pwebstats.
html329400.
30076838www.
unimelb.
edu.
au441290.
21490477www.
monash.
edu.
au225020.
2129788www.
unimelb.
edu.
au/disclaimer108270.
19488217www.
csse.
monash.
edu.
au/disclaimers/user.
html181570.
1839759(notavailable)www.
gu.
edu.
au/cgi-bin/textflip.
cgi289770.
18257898www.
uq.
edu.
au75250.
18203945www.
educ.
utas.
edu.
au189890.
17177727www.
unisa.
edu.
au343410.
16868638www.
unsw.
edu.
auTable8.
NewZealandtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage1699013www.
otago.
ac.
nz/sas/common/images/copyrite.
htm75610.
23612077www.
otago.
ac.
nz66110.
21837467www.
vuw.
ac.
nz79530.
20127887www.
massey.
ac.
nz/disclaim.
htm97230.
1893377www.
massey.
ac.
nzPage8of1228080.
1854485webview.
massey.
ac.
nz28070.
18543735webview.
massey.
ac.
nz/help/help.
htm52540.
18509097www.
canterbury.
ac.
nz21850.
16212336nix.
tmk.
auckland.
ac.
nz/SAL30790.
14880847www.
auckland.
ac.
nzTable9.
UKtop10pages,usingPageRankoninternallinksonly,PageRanksscaled,andGoogle'stoolbarvaluealsoshownCountPageRankToolbarPage384314(notavailable)www.
staffs.
ac.
uk/schools/art_anddesign226910.
97471338www.
st-and.
ac.
uk171600.
97352870www.
cc.
ic.
ac.
uk/college/onlinedocs/sasonlinedocv8/sasdoc/sashtml/common/images/copyrite.
htm128410.
96260684bicss.
mdx.
ac.
uk/css/public262760.
91602217www.
napier.
ac.
uk34770.
9152157unrankedwww.
aom.
bham.
ac.
uk/handbook/courses/glossary.
htm34640.
91223353www.
ao.
bham.
ac.
uk/handbook/courses/glossary.
htm279820.
86076877www.
ulst.
ac.
uk168510.
83058268www.
leeds.
ac.
uk187610.
79502157www-maths.
mcs.
st-and.
ac.
ukThetopAustralianpageisofatypenotmentionedbefore,aWebstatisticssoftwarehomepage,andthisparticularexampleisfromtheformersiteofMartinGleeson'spwebstatsprogramthatattractslargenumbersoflinksfromserverstatisticspagesgeneratedbythesoftware.
TherearesimilarlyhighlyinlinkedpagesintheUK(Thelwall,2002b).
Ascanalsobeseen,therearehelpandglossarypagesintheNewZealandandUKlistsrespectively.
ThesemaybeusefulinthecontextofthepagesthatlinktothembutprobablymuchlesssoforthewiderWebuser.
Alsopresentaretwodepartmentalhomepages,bothasaresultofcreditlinksonlargecollectionsofpages.
InthecaseofStAndrews,forexample,thelinkscomepredominantlyfromthepagesofanonlinehistoryofmathsarchive.
TheninthrankedNewZealandpageisfromamirrorcopyofheScientificApplicationsonLinuxsite,gettingitslinksfromwithinitsownsite.
TheURLiscase-sensitive,hencethemixedcaseversionshowninTable8.
ThepagesreferencedherewereallloadedintoaWebbrowseronthe19thofFebruary,2002withGoogle'stoolbar(toolbar.
google.
com)installedsothatthePageRankfeaturecouldbeused.
Thisgivesanumberbetween0and10foreachloadedpage.
ClearlythesearenotPageRanksascouldbeobtaineddirectlyfromBrin&Page'salgorithm,sincetheunmodifiedvalueswouldallbelessthanone,buttheuseofthiswordbyGoogletodescribethedisplayedfiguresgivessomecausetobelievethattheyarerelatedinamonotonicway(i.
e.
largertoolbarvaluescomefromlargerPageRankvalues).
TheresultsofthisexerciseleadtothediscoverythatinsomecasesthetoolbarPageRankfiguregivenwasappliedtothedomainandautomaticallyreducedbyoneforeachdirectoryinthepathoftheURL,sothatlongerURLstendedtohavelower'PageRanks',irrespectiveofhighinlinkingasseenintables7–9.
ThiswasthecaseforthezerorankedpageintheUKlist,forexample.
Itissurmised,then,thateitherthePageRankalgorithmhasbeenmodifiedforthecurrentversionofGoogle,orthetoolbarusesanapproximatenon-monotonicversionofitincertainsituations(i.
e.
itreversestherelativeranksofsomepages).
Page9of12CONCLUSIONSItisveryclearfromthedatathatthetoprankedpages,eitherwithPageRankorwithrawinlinkcounts,arethereasaresultofnavigationalarchitecturepolicydecisionsprimarilyratherthanontheirownindividualmerit.
PerhapstheclearestexampleofthisisacomparisonoftheUlsterUniversityhomepagewiththatofWolverhampton.
TheformerattractsthehighestinlinkcountofallUKpageswhereasthelatterattractsrelativelyfewasaresultofglobalsitedesigndecisions.
ItmustbeconcludedfromthisthatPageRankandinlinkcountsarenotreliablemethodsforascertainingthemostvaluableresourcesonalargeuniversityWebsiteornationalsystemofuniversitysites.
ContrastingthesefindingswiththoseofThelwall(2002b)itcanbeseenthattherootcauseoftheproblemsistheinclusionofinternallinks.
Inlinkcountsbaseduponexternallinksonlyyieldmuch"better"results,althoughstillnotperfect.
ThisisarealproblemforthePageRankalgorithmbecauseitdependsoninternallinkstofunction,forexampleredistributinglinkvotesfromthehomepageofasitetolinksonitsotherpages,aswouldbeneededforPageRanktopropagatefromanimportantmultipagesite,suchastheHumbulHumanitiesHub(www.
humbul.
ac.
uk).
ComparingtheUKtop100resultswiththoseforexternallinksonly(Thelwall,2002b)itcanbeseenthatthetwoarefundamentallydifferent.
Bothcontainmanyuniversityhomepagesbutthelatterdoesnotcontainanyoftheotherpagestypicallyfoundonthestandardinternalnavigationbar.
PerhapsthemostdamningevidenceisthatthesinglepagethatisprobablythemostwidelyusedresourceonaUKWebsite,theUKclickablemap,doesnotappearatallineitherUKtop100reportedhere,despitehaving891externallinksfromotherUKuniversities.
InthecontextoftheresultspresentedhereitishardtobelievethatplainPageRankiseffectiveasanIRalgorithm,evenwhencombinedwithsimpletextmatching.
Thefundamentalproblemistheallocationofequalweighttointernallinksasexternalonesandtheloopholethatthisgivestoallownavigationalpolicydecisionstoswamptherelativelysmallnumberoflinkscreatedfornon-navigationalreasons.
Thealgorithmmaybemoreeffectiveonaglobalscale,wheretherearerelativelymoreexternallinksbutthedifferencewillnotbeoftheorderofmagnitudeneededtomakearealdifferenceforuniversityWebsites.
Itmaybe,however,thathugesitessuchasYahoo!
doimprovetheresultsbybestowinghigherPageRanksonthebetterWebpages,butthiswouldaffectonlyafewpagesoneachuniversitysite.
TheoriginalGooglepatent(Page,1998)doesmentionmyriadpotentialcustomisationsofthealgorithm,includingtreatinginter-sitelinksdifferently,andsoitislikelythattheversioninuseatthetimeoftestingwasdifferentfromtheoriginal.
Anotherpointthatshouldbementionedisthatthisanalysishasbeenconfinedtothetoptenor100pagesofeachsetandthereforeignorestheoverwhelmingmajorityofpages.
Nevertheless,itcanbeseenthatthesamekindsofargumentscanalsoapplyforthese:pagesthatarepartofahighlyinterlinkingnavigationalstructurewillrankmuchbetterthanothersthatarethetargetofonlyoneortwoexternallinks,eventhoughsuchlinksareprobablyamuchbetterindicatorofhighquality.
Itcouldbearguedthatahighdegreeofinterlinkingisagoodindicatorofqualityatleastinsitedesign,butinthiscasethePageRankisstilltotallydependentontheabsolutenumberofpagesinvolvedandtheextenttowhichlinksarealsopresenttootherresources.
Despitethenumber-intensivemathematicalalgorithmsusedtoproducethedatapresentedhere,thishasbeenessentiallyaqualitativestudy.
Thisisnotseenasaweaknessinthecontextoftheveryclear-cutnatureoftheresultsobtained.
Indeedthequalitativeapproach,focussingoninvestigatingthecauseoftheproblemshasPage10of12enabledthegainingofinsightsintothereasonswhytherankingmethodshavebeenunabletoproduceconvincinglistsofthetoppagesinthesitescovered.
Inconclusion,PageRankisnotaneffectivemethodforidentifyingthe"best"Webpagesinauniversitysystembecauseofitsdominationbyinternallinks,anargumentthatwouldstillapplyevenifallmirrorsiteshadbeenremovedfromthedata.
TheastonishingaccuracyofGoogle(Thelwall,2002c)mustbeduetoitscomplementaryuseofaveryeffectivetext-basedmatchingalgorithm,whichmustitselfbeincorporatedintothefinalranks.
ApromisingfuturedirectionforbibliometricresearchistodevelopavariantofPageRankthatcanharnessthepotentialofitssystemfortransferringrankiterativelythroughlinksinawaythatwouldnotbedominatedbyinternalsitelinks.
ACKNOWLEDGEMENTSIwouldliketothanktherefereesfortheirveryhelpfulcomments.
REFERENCESBharat,K&Mihaila,G.
A.
(2001).
WhenExpertsAgree:UsingNon-AffiliatedExpertstoRankPopularTopics,In:TenthInternationalWorldWideWebConference.
Availableathttp://www.
www10.
org/cdrom/papers/474/index.
htmlBrin,S.
&Page,L.
(1998).
TheAnatomyofalargescalehypertextualwebsearchengine,ComputerNetworksandISDNSystems,30(1-7),107-117.
Availableathttp://citeseer.
nj.
nec.
com/brin98anatomy.
htmlBroder,A.
,Kumar,R.
,Maghoull,F.
,Raghavan,P.
,Rajagopalan,S.
,Stata,R.
,Tomkins,A.
&Wiener,J.
(2000).
Graphstructureintheweb.
ComputerNetworks,33(1-6),309-320.
Gao,J.
Walker,S.
,Robertson,S.
,Cao,G.
,He,H.
,Zhang,M.
&Nie,J-Y(2001).
TREC-10WebTrackExperimentsatMSRA384-392.
TREC2001.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
Glser,J.
&Laudel,G.
(2001).
Integratingscientometricindicatorsintosociologicalstudies:methodicalandmethodologicalproblems.
Scientometrics,52(3),411-434.
Goodrum,A.
A.
,McCain,K.
W.
,Lawrence,S.
&Giles,C.
L.
(2001).
ScholarlypublishingintheInternetage:acitationanalysisofcomputerscienceliterature.
InformationProcessingandManagement,37(5),661-676.
Google(2002).
GoogleTechnology.
Available:http://www.
google.
com/technology/.
Accessed3July,2002.
Haveliwala,T.
(1999).
EfficientComputationofPageRank.
StanfordUniversityTechnicalReport.
Availablehttp://dbpubs.
stanford.
edu:8090/pub/1999-31Hawking,D.
,Bailey,P.
andCraswell,N.
(2000).
ACSysTREC-8experiments.
In:InformationTechnology:EighthTextREtrievalConference(TREC-8),NIST,Gaithersburg,MD,USA,pp.
307-315.
Heydon,A.
&Najork,M.
(1999).
Mercator:Ascalable,extensibleWebcrawler.
WorldWideWeb,2,219-229.
Ingwersen,P.
(1998).
ThecalculationofWebImpactFactors.
JournalofDocumentation,54(2),236-243.
Kleinberg,J.
(1999).
Authoritativesourcesinahyperlinkedenvironment,JournaloftheACM,46(5),604-632.
Larson,R.
R.
(1996).
BibliometricsoftheWorldWideWeb:AnExploratoryAnalysisoftheIntellectualStructureofCyberspace.
ASIS96.
Availableat:http://sherlock.
berkeley.
edu/asis96/asis96.
htmlPage11of12Leydesdorff,L.
&Curran,M.
,(2000).
Mappinguniversity-industry-governmentrelationsontheInternet:theconstructionofindicatorsforaknowledge-basedeconomy,Cybermetrics,4.
Availableat:http://www.
cindoc.
csic.
es/cybermetrics/articles/v4i1p2.
htmlLifantsev,M.
(2000).
VotingmodelforrankingWebpages.
InGraham,P.
&Maheswaran,M.
(eds),ProceedingsoftheInternationalConferenceonInternetComputing,LasVegas,Nevada,USA,CSREAPress,pp.
143-148.
Ng,A.
Y.
,Zheng,A.
X.
&Jordan,M.
I.
(2001).
Stablealgorithmsforlinkanalysis.
In:Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR2001),NewYork:ACMPress,pp.
258-266.
Page,B.
(1998).
UnitedStatesPatent6,285,999.
Available:http://patft.
uspto.
gov/.
Rafiei,D.
&Mendelzon,A.
O.
(2000).
WhatisthispageknownforComputingWebpagereputations,ComputerNetworks,33(1-6),823-835.
Richardson,M.
&DomingosP.
(2001).
TheIntelligentSurfer:ProbabilisticCombinationofLinkandContentInformationinPageRank.
PosteratNeuralInformationProcessingSystems:NaturalandSynthetic2001.
Availableat:http://www.
cs.
washington.
edu/homes/mattr/doc/NIPS2001/qd-pagerank.
pdfRousseau,R.
,(1997).
Sitations,anexploratorystudy,Cybermetrics,1.
Available:http://www.
cindoc.
csic.
es/cybermetrics/articles/v1i1p1.
htmlSmith,A.
G.
(1999).
Ataleoftwowebspaces:comparingsitesusingWebImpactFactors.
JournalofDocumentation,55(5),577-592.
Smith,A.
&Thelwall,M.
(2002,toappear).
WebImpactFactorsforAustralasianUniversities,Scientometrics,54(1-2).
Sullivan,D.
(2002).
GoogleTopsIn"SearchHours"Ratings.
Available:http://searchenginewatch.
com/sereport/02/05-ratings.
html.
Accessed3July,2002.
Thelwall,M.
(2001a).
Extractingmacroscopicinformationfromweblinks.
JournaloftheAmericanSocietyforInformationScienceandTechnology,52(13),1157-1168.
Thelwall,M.
(2001b,toappear).
Evidencefortheexistenceofgeographictrendsinuniversitywebsiteinterlinking.
JournalofDocumentation.
Thelwall,M.
(2001c).
Awebcrawlerdesignfordatamining,JournalofInformationScience,27(5),319-325.
Thelwall,M.
(2001d).
ApubliclyaccessibledatabaseofUKuniversitywebsitelinksandadiscussionoftheneedforhumaninterventioninwebcrawling,UniversityofWolverhampton.
Available:http://www.
scit.
wlv.
ac.
uk/~cm1993/papers/a_publicly_accessible_database.
pdf.
Thelwall,M.
(2001e).
Thetop100linkedpagesonUKuniversityWebsites:highinlinkcountsarenotassociatedwithqualityscholarlycontent,UniversityofWolverhampton.
Thelwall,M.
(2001f).
ResultsfromaWebImpactFactorcrawler,JournalofDocumentation,57(2),177-191.
Thelwall,M.
(2002a).
AcomparisonofsourcesofLinksforacademicWebImpactFactorcalculations.
JournalofDocumentation,58,60-72.
Thelwall,M.
(2002b).
Subjectgatewaysitesandsearchengineranking,OnlineInformationReview,26(2),124-138.
Thelwall,M.
(2002c,toappear).
InpraiseofGoogle:findinglawjournalWebsites,OnlineInformationReview,26(4).
Page12of12Xi,W.
&Fox,E.
A.
(2001).
MachineLearningApproachforHomepageFindingTask.
TREC2001,pp.
686-697.
Available:http://trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.

特网云(1050元),IP数5 个可用 IP (/29) ,美国高防御服务器 无视攻击

特网云特网云为您提供高速、稳定、安全、弹性的云计算服务计算、存储、监控、安全,完善的云产品满足您的一切所需,深耕云计算领域10余年;我们拥有前沿的核心技术,始终致力于为政府机构、企业组织和个人开发者提供稳定、安全、可靠、高性价比的云计算产品与服务。官方网站:https://www.56dr.com/ 10年老品牌 值得信赖 有需要的请联系======================特网云美国高防御...

百星数据(60元/月,600元/年)日本/韩国/香港cn2 gia云服务器,2核2G/40G/5M带宽

百星数据(baixidc),2012年开始运作至今,主要提供境外自营云服务器和独立服务器出租业务,根据网络线路的不同划分为:美国cera 9929、美国cn2 gia、香港cn2 gia、韩国cn2 gia、日本cn2 gia等云服务器及物理服务器业务。目前,百星数据 推出的日本、韩国、香港cn2 gia云服务器,2核2G/40G/5M带宽低至60元/月,600元/年。百星数据优惠码:优惠码:30...

inux国外美老牌PhotonVPS月$2.5 ,Linux系统首月半价

PhotonVPS 服务商我们是不是已经很久没有见过?曾经也是相当的火爆的,我们中文习惯称作为饭桶VPS主机商。翻看之前的文章,在2015年之前也有较多商家的活动分享的,这几年由于服务商太多,乃至于有一些老牌的服务商都逐渐淡忘。这不有看到PhotonVPS商家发布促销活动。PhotonVPS 商家七月份推出首月半价Linux系统VPS主机,首月低至2.5美元,有洛杉矶、达拉斯、阿什本机房,除提供普...

pagerank为你推荐
iproute网关怎么设置?支付宝蜻蜓发布想做支付宝蜻蜓刷脸支付的代理么?怎么做?cisco2960配置寻求思科2960交换机配置命令cisco2960思科2960如何划分vlan?支付宝调整还款日月底30号用花呗到时候下个月什么时候还款?flashfxp下载求最新无需注册的FlashFXP下载地址文档下载请问手机版wps如何把云文档下载到手机上的本地文档?三友网三友联众集团怎么样?电子商务世界世界前十大电子商务企业名字站点管理有关站点的知识介绍?
qq域名邮箱 深圳域名空间 美国vps enzu site5 优惠码 好看的留言 宁波服务器 合租空间 免费防火墙 如何用qq邮箱发邮件 上海联通宽带测速 国外视频网站有哪些 web应用服务器 smtp服务器地址 东莞主机托管 注册阿里云邮箱 金主 可外链的相册 锐速 更多