belongingpagerank

pagerank  时间:2021-04-19  阅读:()
NewversionsofPageRankemployingalternativeWebdocumentmodels1MikeThelwallSchoolofComputingandInformationTechnology,UniversityofWolverhampton,35/49LichfieldStreet,WolverhamptonWV11EQ,UKm.
thelwall@wlv.
ac.
ukLiwenVaughanFacultyofInformationandMediaStudies,UniversityofWesternOntario,London,Ontario,N6A5B7,Canadalvaughan@uwo.
caKeywords:WebIR,PageRank,hyperlinkanalysis,searchenginesAbstractWeintroduceseveralnewversionsofPageRank(thelinkbasedWebpagerankingalgorithm),baseduponaninformationscienceperspectiveontheconceptoftheWebdocument.
AlthoughtheWebpageisthetypicalindivisibleunitofinformationinsearchengineresultsandmostWebinformationretrievalalgorithms,otherresearchhassuggestedthataggregatingpagesbasedupondirectoriesanddomainsgivespromisingalternatives,particularlywhenWeblinksaretheobjectofstudy.
ThenewalgorithmsintroducedbaseduponthesealternativeswereusedtorankfoursetsofWebpages.
Therankingresultswerecomparedwithhumansubjects'rankings.
Theresultsofthetestsweresomewhatinconclusive:thenewapproachworkedwellforthesetthatincludespagesfromdifferentWebsites;however,itdoesnotworkwellinrankingpagesthatarefromthesamesite.
Itseemsthatthenewalgorithmsmaybeeffectiveforsometasksbutnotforothers,especiallywhenonlylownumbersoflinksareinvolvedorthepagestoberankedarefromthesamesiteordirectory.
IntroductionCommercialsearchenginesareakeyaccesspointtotheWebandhavethedifficulttaskoftryingtofindthemostusefulofthebillionsofWebpagesforeach–typicallyshort(Spinketal.
,2001)–userqueryentered.
Probablythetaskismostdifficultwhenmillionsofpagescontainthequeryterm(s)andthesemustbeorderedsothattheuserispresentedwiththemostlikelyones.
Google'sPageRank(BrinandPage,1998)wasanattempttoresolvethisdilemmabasedupontheassumptionsthat:(1)moreusefulpageswillhavemorelinkstothemand(2)linksfromwelllinkedtopagesarebetterindicatorsofquality.
ThecontinuedriseofGoogletoitscurrentdominantposition(Sullivan,2002)andtheproliferationofotherlinkbasedalgorithms(e.
g.
Kleinberg,1999;CrestaniandLee,2000;Ngetal.
,2001;AltaVista,1Thelwall,M.
&Vaughan,L(2004).
NewversionsofPageRankemployingalternativeWebdocumentmodels.
ASLIBProceedings,56(1),24-33.
12002)seemstomakeanunassailableargumentforthePageRankalgorithm,despitethepaucityofclearcutresults(e.
g.
Hawkingetal.
,2000;SavoyandPicard,2001).
ModernWebIRalgorithmsareprobablyahighlycomplexmixtureofdifferentapproaches,perhapsoptimisedusingprobabilistictechniquestoidentifythebestcombination(e.
g.
Gaoetal.
,2001;XiandFox,2001;TsikrikaandLalmas,2002;SavoyandPicard,2001).
Itisnotpossibletobedefinitiveaboutcommercialsearchenginealgorithms,however,sincetheyarekeptsecretapartfromthebroadestdetails.
InfactacademicresearchintoWebIRisinastrangesituationsinceresearchbudgetsanddatasetscouldbeexpectedtobedwarfedbythoseofthecommercialgiants,whoseexistencedependsuponhighqualityresultsinanincrediblycompetitivemarketplace.
OnepaperthatcomparedthetwofoundthattheacademicsystemswereslightlybetterbuttheauthorsadmittedthatthetaskswereuntypicalforWebusers(Hawkingetal.
,2001a).
Nevertheless,Googleisonecaseamongstmanyofsearchalgorithmsgainingfromapproachesanddevelopmentsininformationscienceingeneralandbibliometricsinparticular.
Thealternativedocumentmodels(Thelwall,2002a)areanexampleofatheoreticalapproachfrominformationsciencethatmaybringbenefitstoWebIR.
TheprinciplebehindthesemodelsisthatWebpagesoftennaturallyclusterintorecognisabledocumentsbaseduponthedirectoryordomainthattheyarein.
Whenworkingwithlinksitcanoftenmakesensetoutiliseadirectoryordomainlevelofaggregation,especiallyifeachindividualpagecontainsasetofidenticallinks,perhapsinastandardnavigationbar.
Theresultofaggregationinsuchacasewouldbetheremovalofallduplicatelinks,givingamoreappropriatelinkcount.
Thisapproachhasbeenshowntogiveimprovedacademiclinkmetrics(Thelwall,2002a;ThelwallandTang,2003;ThelwallandWilkinson,2003;ThelwallandHarries,2003).
Furthersupportforthesemodelsisgivenbytheirabilitytocluster(setsof)Webpagesindifferentandnon-trivialways(Thelwall,2003).
Anaturalquestion,therefore,iswhetherWebIRalgorithmscanbenefitfromthealternativedocumentmodels.
Inthispaper,newversionsofPageRankwillbeintroducedusingalternativedocumentmodels.
TheeffectivenessofthesenewrankingalgorithmswillbecomparedagainstthatofthestandardPageRank.
Humanrankingjudgementwillbeusedasthebenchmarkagainstwhichtocomparedifferentalgorithms.
VersionsofPageRankbasedonthealternativedocumentmodelPageRankwasdevelopedbythefoundersofGoogle,SergeyBrinandLawrencePage(1998).
Thegeniusoftheapproachisthatthealgorithmissimpleandintuitive,yetadmitsamathematicalimplementationthatscalestothebillionsofpagescurrentlyontheWeb.
Forourpurposes,sincewearenotmodifyingthemathematicalalgorithmofPageRankbutonlythedocumentspaceuponwhichitisapplied,wewilldescribetheprincipleofPageRankbutnotthedetailsofitsimplementation.
TheprecisedetailsofthemathsandfurtherdescriptionscanbefoundintheoriginalPageRankpaper(BrinandPage,1998)aswellasseveralotherrelatedpapers(Haveliwala,1999;Lifantsev,2000;Ngetal.
,2001;Thelwall,2002b).
EssentiallytheapproachusedbyPageRankcanbedescribedwithavotingmetaphor.
Atthestartoftheprocess,eachWebpageisallocatedavotep.
Forexample,eachpagemaybeallocatedthesamevalue0.
1.
EachpagethensharesafractionofαPageRankwereused.
Incontrast,apurelytext-matchingalgorithmwouldhavegreatdifficultyindecidingwhichpagecontainingthematchingtextwasthemostrelevant.
AcriticismoftheoriginalPageRankisthatmanypagesreceiveahighnumberoflinksforreasonsotherthantheirquality.
Forexample,somesiteshaveastandardnavigationbaroneachpage,allcontainingalinktothehomepageandafewotherpages.
Forthesiteitself,thisprobablydoesservetoindicatethemostusefulpages,butrelativetoothersitesthetotalnumberofpagescontainingthelinkbarwillbecriticaltodeterminethefinalPageRankofthetargetedpages,meaningthatlargersiteswillautomaticallyrankhigher.
Ithasalsobeennotedthatlinksbetweenpageswithinasitearetypicallyfornavigationpurposes,andthereforearelessreliableasindicatorsoftargetpagequalitythanlinksbetweensites.
Moreover,navigationbarssometimescontainlinkstoothersitesandonesiteoftencontainsmultiplelinkstoanotherforreasonsthatarenotrelatedtotargetsitequality.
AllofthesefactorsunderminetheeffectivenessofPageRankasanindicatorofthequalityofthepage.
Anadditionalproblemistheorganisationofinformationbysite,domainordirectory.
Forexample,asitecontainingmuchhighqualityinformationmayreceivemanylinkstoitshomepage,whereasitsactualcontentisontensofthousandsofotherpagesunderthehomepage,mostofwhichdonotreceivemanylinks.
AcaseinpointforthisistheMicrosoftsitethatincludesanenormousbodyofauthoritativeinformationspreadovermanypages.
Intheory,linkstothehomepagewillredistributethroughthelayersofasitetothesecontentcarryingpages,butinpracticethisdoesnotwork(Thelwall,2002b)andsothecontentpageswillnotreflecttheprestigeofthehostingsite.
Thisisanargumentforincludinginrankingmeasuresanassessmentofthesiteasawholeinadditiontotheindividualpages.
AsimilarargumentcanbemadeforanycoherentclusterofWebpageswitharecognisablehomepage.
Basedupontheargumentsmadeabove,theclaimisthatPageRankcanbeimprovedbyincorporatingrankingsofapagebaseduponitshostingsite,domainanddirectory.
Aprecisedefinitionofdocumentmodelsbasedupontheselevelsofaggregationisgivenbelow(takenfromThelwall,2002a).
IndividualWebpage.
EachseparateHTMLfileistreatedasadocumentforthepurposesofextractinglinks.
EachuniqueURLinalinkistreatedaspointingtoaseparatedocumentforthepurposesoffindinglinktargets.
URLsaretruncatedbeforeanyinternaltargetmarker'#'characterisfound,however,toavoidmultiplereferencestodifferentpartsofthesamepage.
3Directory.
AllHTMLfilesinthesamedirectoryaretreatedasasingledocument.
AlltargetURLsareautomaticallyshortenedtothepositionofthelastslash,andlinksfromdifferentpagesinthesamedirectoryarecombinedandduplicateseliminated.
Domainname.
AsaboveexceptallHTMLfileswiththesamedomainnamearetreatedasasingledocumentforbothlinksourcesandlinktargets.
Inparticular,thisclusterstogetherallpageshostedbyasinglesubdomainofauniversitysite.
University.
Asaboveexceptthatallpagesbelongingtoauniversityaretreatedasasingledocumentforbothlinksourcesandlinktargets.
ApplyingPageRanktothesemodelsmeansallocatingvotesattheappropriatedocumentlevelanddistributingthemaccordingtolinksidentifiedasabove.
Forexample,inthecaseofthedomain-basedPageRank,itwouldstartwithavotepbeingallocatedtoeachdirectoryandthenafractionαofitbeingredistributedequallytoalldirectoriesthatarelinkedtobythisdirectory.
Theextrabonusvote(1-α)pwouldalsobeallocatedtoeachdirectory.
Subsequentvotingroundswouldthenfollowthesameprinciple.
StandardPageRankisbasedonthepagelevelmodeldescribedabove.
Weintroducethreenewalgorithms:PageRankusingthedirectory,domainanduniversitydocumentmodelswiththeadditionalmodificationthatonlylinksbetweendifferentsites(inourcaseuniversities)willbeused.
Thisisbaseduponthehypothesisthatlinksinsideasiteareprimarilyfornavigationpurposes,whereaslinkstoexternalsitesaremorereliableasindicatorsoftargetquality.
ThevariantswillbecalledintersitedirectoryPageRank,intersitedomainPageRankandintersiteuniversityPageRank.
ItwouldalsobepossibletoapplyPageRanktothepagemodelafterexcludinginternalsitelinks,butthiswouldnotbeeffectivesincerelativelyfewpagesaretargetedbyothersitesandsoalmostallpageswouldberankedlast.
LiteratureReviewWebIRalgorithmsAlthoughthemaintaskoftheearlysearchenginessuchastheWorldWideWebWorm(Chun,1999)wastofindWebpages,therapidgrowthoftheWebmeantthattechnicaldevelopmentquicklyswitchedtofindingthemostrelevantpagesforuserqueries.
Thisleadtoincreasinglyrefinedtextmatchingtechniques,suchaslatentsemanticindexing(Deerwesteretal.
,1990)wherethequerytermsdonothavetobeinthepageforittoberetrieved,butwithlinkbasedalgorithms,suchasGoogle'sandKleinberg's,therelationshipbetweenpagesandthosesurroundinghasbecomeimportant.
ThesuccessoflinkapproacheshasnotbeenreplicatedinthecomputerscienceTRECtasks,however,perhapsduetoanuntypicaltestcorpusused,oruntypicaltasks(Hawkingetal.
,2000).
Anothertrendisfortheapplicationofmultipletechniquesinablendtoobtainoptimalresults.
Forexample,textmatchingcanbecombinedwithlinkalgorithmsandURLstructureheuristicsinordertoidentifyhomepages,animportanttask,asreflectedinitsinclusionintheTRECWebtrack.
Variousmethodsareavailabletoidentifythebestweightingstousetocombinethesealternativetechniques(e.
g.
Gaoetal.
,2001).
Oneside-effectofthis,however,isthattheconstructionofanefficientpieceofsoftwarewillnotleadtoclearresultsabouttheusefulnessofanyoneofthecomponentsofitsalgorithm.
Conversely,evaluatingoneapproachonitsown,whilstyieldingsuchresults,willnotyieldanoptimalsystem.
Oneimplicationofthisisthat4researchintoindividualcomponentscanincreasinglybeseenasinformationscienceratherthancomputerscience.
OthervariationsofPageRankSeveralvariationsorgeneralisationsofPageRankhavebeensuggested.
Infactitsoriginatorssuggestedafewmodificationsattheoutset,includingusinganon-uniformpatternofinitialvotessothatPageRankcouldbepersonalisedtotheuser,bygivingtheirvaluedpageshigherinitialpvalues(BrinandPage,1998).
ThisapproachcanalsobeusedtoalterthePageRankresultsthroughtheinclusionofanothersourceofinformationaboutpagequality.
BharatandMihaila(2001)developedanewversionofPageRankanddemonstratethroughuserevaluationsthatitsperformanceiscomparablewiththestandardPageRank.
Lifantsev(2000)developedageneraltheoreticalmodelforapplyingvariantsofthePageRanktechnique.
Haveliwala(1999)developedcomputingtechniquestoapplystandardPageRanktosmallerplatforms.
Meghabghab(2002)proposedaversionbaseduponinandoutdegreesofnodes,butthisdidnotproduceimprovedresults.
RichardsonandDomingos(2001)developedacombinationofPageRankwithcontentinformation,andprobablythisiswhatGoogledoesalready.
SearchenginequalityevaluationtechniquesAlthoughmanymeasureshavebeenusedtoassesstheretrievalresultsofasearchengine(e.
g.
Hawkingetal.
,2001a)theconcerninthisstudyisonlywithevaluatingasearchengine'sabilitytorankthepagesretrievedonaparticulartopic.
Asaresult,thenormalquestionsofprecision(thepercentageofpagesreturnedthatarerelevanttothetopic)andrecall(thepercentageofrelevantpagesfoundontheWeb)donotapply,sincethesearetypicallybaseduponbinarydecisionsofrelevanceandnotonrelativemeritsofthepagesthemselves.
Forexample,TRECtypeevaluationsfocusonwhethereachpagedoesmatchthecriteriaofthesearchratherthanonthequalityofthepagecontent.
Evaluationofrankingperformancehasactuallybeenaparticularlytroublesomeandcontroversialaspectofsearchengineresearch.
Manypapersdescribingadvanceshavegivenanecdotalratherthanformalevaluations(BrinandPage,1998).
TherelevanceofthedocumentsinTRECtopicsareformallyevaluatedinbatchesbyagroupofhumans(Hawkingetal.
,1999)butthisapproachhasbeencriticisedonthegroundsthatonlyarealenduserofinformationcansuccessfullyevaluateretrievalresults(GordonandPathak,1999).
Anotherapproach,unavailabletomostresearchers,istoanalysesearchenginelogfilestominesearchpatterns(e.
g.
Spinketal.
,2001).
Commercialsearchenginesprobablyemployacombinationofevaluationmethodsbutnoneareidealbecauseof(a)thediversityofinformationontheWeband(b)thedifficultyofgettingagroupofuserstoevaluateasimilarsetofresultsinawaythatisnotartificial.
Asaresult,anyevaluationprocesswillnecessarilybeacompromisebutthetaskoftheresearcheristoovercometheseobstaclesaseffectivelyaspossible.
ResearchquestionsThequestionsaddressedarewhetheranyofthefollowingalternativeversionsofPageRankproducesimprovedrankingsoverstandardPageRank.
PageRankwithinternalsitelinksexcludedandbasedupon:5thedomain,thedirectory,ortheuniversitydocumentmodel.
FoursetsofWebpagesonfourdifferenttopicswereselectedforthestudy(detailsofthechoiceofpagesarebelow).
Eachsetofpageswasrankedbyhumansubjects(detailsbelow).
DifferentversionsofPageRankalgorithmwereusedtorankeachsetofpagesandtherankingresultscomparedwiththatofhumansubjects.
Thealgorithmthatgeneratesarankingclosertothehumanrankingisconsideredtobebetter.
DataCollectionSubjectsofthestudySubjectsofthestudywerestudentsenrolledontheInformationRetrievalcourse,partoftheMasterofLibraryandInformationSciencedegree,inthesummertermof2002attheFacultyofInformationandMediaStudies,UniversityofWesternOntario,Canada.
OneoftheassignmentsofthecoursewastorankasetofWebpagesandthencomparetherankingagainstthosegeneratedbydifferentsearchalgorithmstogainanunderstandingofsearchalgorithmsandsearchengines.
Twenty-fourstudentsonthecourseweredividedrandomlyintofourgroupsofsixpeopleeach.
EachgroupwasgivenasetofWebpagesonaparticulartopic(detailsbelow)andeachstudentindependentlyrankedthepagesinthewaythathe/shethoughttheyshouldberankedinasearchoutput.
Thegroupthenmetandexchangedtheirrankingaswellasthecriteriausedintheranking.
Eachstudentthendidanotherroundoftherankingbasedonthediscussionwithothergroupmembers(theycouldchoosenottochangetheirrankingfromthefirstroundofexercise).
Studentsthenproceededwiththeotherpartsoftheassignmentthatwerenotdirectlyrelatedtothestudy.
Forthepurposeofthisstudy,studentrankingresultswereaggregated(detailsindataanalysisbelow)andusedasthebenchmarkagainstwhichtocomparerankingresultsfromdifferentPageRankalgorithmsunderinvestigation.
Basedontheethicalprincipleofvoluntarilyparticipation,studentsweregiventhechoiceofallowingtheirrankingdatatobeusedforthestudyornot.
Allstudentsonthecoursegavepermissiontousetheirdataforthestudy.
ChoiceofpagesetsBecauseallsubjectsinthestudywereCanadiangraduatestudents,thetopicsofthepagestoberankedwereallchosentoberelatedtoCanadianuniversitylifesothatstudentswereknowledgeableaboutthesubjectandwerecompetenttorankthepages.
Thefollowingfourtopicswereselected:1.
OntarioGraduateScholarshipinScienceandTechnology(referredtoasOGSbelow).
2.
SocietyofGraduateStudiesattheUniversityofWesternOntario(referredtoasSOGSlater).
3.
OmbudspersonofficeattheUniversityofWesternOntario(ombudspersonforshort).
4.
AdmissionrequirementsfortheMBAprogramattheUniversityofToronto(MBAforshort).
6AsetofWebpagesoneachtopicwereretrievedusingthreesearchengines(Google,AltaVista,andTeoma)andthetop10pagesretrievedbyeachengineweremergedtoformthesetofpagesforthatparticulartopic.
Asaresult,therewereabout20pagesineachsettoberanked.
Whenperformingthesearchonthesearchengines,restrictionsbydomainswereimposedtoavoidtheinclusionoftotallyirrelevantpages.
Forexample,thesearchofpagesonSOGSwasrestrictedtothedomainofwww.
uwo.
ca(theuniversity'sURL)sothatirrelevantpagesthathappenedtohavethewordSOGSwerenotlikelytoberetrieved.
Therankingofthesepagesbythesearchengineswerenotrevealedtothesubjectsbeforetheydidtherankingtoavoidpossiblebias.
DataforcalculatingPageRankscoresAsexplainedabove,thecalculationofPageRankscoresarebasedonthelinkinginformationamongpages.
SearchenginessuchasGoogleuselinkstructuresamongallpagesintheirdatabasetocalculatethePageRankscores.
Forthepurposeofthisstudy,auniverseofpagesmustbedefinedonwhichtobasethecalculationofPageRankscores.
ItwasdecidedtouseallCanadianuniversityWebpagestobesuchauniversebecause:(1)itisimpossibletocoverallpagesontheWebforaproject;(2)allpagestoberankedareaboutCanadianuniversitiessothelinkstothesepagesaremostlikelytocomefromotherCanadianuniversities;(3)itisfeasibletocrawlthisnumberofpages(3,930,113intotal)andrecordtheirlinkinginformation.
Theunderlyingassumptionofthisdatacollectionmethodisthatsimilarresultswouldbeobtainedifafullsearchenginedatabaseweretobeused.
Althoughthisassumptionisimpossibletoverify,itissupportedbytherobustnessofthePageRankalgorithm(Ngetal.
,2001).
Inanycase,theperformanceofPageRankonanyconceptuallycoherentsetofpagesisofinterestandappropriate.
TheURLsofallCanadianuniversitieswereobtainedfromanonlinelist(AssociationofUniversitiesandCollegesofCanada,2002)andtheexhaustivityofthesetverifiedandsupplementedusinganunrelatedprintmediasource(Johnston,2002).
Thelistincludedallfulluniversitiesaswellasaffiliatedcolleges.
EachuniversityWebsitewasthencrawledbyaspecialistinformationscienceWebcrawler(Thelwall,2001a)torecordlinkinformation.
Thecrawlerwasdesignedtocoversitesaccurately,checkingforduplicatepagesexhaustively.
Thecrawlercannormallyonlyfindpagesbyfollowinglinksiterativelyfromthehomepageandsopagesthatwerenotlinkedtowouldnothavebeencovered.
Twoexceptionsweremade,however.
Firstly,someuniversities'homepagesdidnotcontainanyHTMLlinksandsoastandardcrawlwouldreturnonlyonepage.
Inthesecasesapageoflinkstoalldepartmentalhomepageswassoughtandusedasanalternativestartingpoint.
Secondly,theURLsofthefoursetsofpagesusedinthestudywerepreloadedintothecrawlertoensurethattheywouldbecovered,evenifnolinkstothemhadbeenfound.
Someareaswereexcludedonthebasisofbeingmirrorsitesorhugeonlinedatabaseswithonlyinternallinks.
Thecrawlingwasconductedinthesummerof2002,shortlybeforethepagesfortheexperimentwererankedbythestudents.
DataAnalysis7Asdiscussedin'Datacollection',eachsubjectrankedthesetofpagestwice.
Thesecondroundofranking,afterthegroupdiscussion,representsthefinalrankingdecisionandwasthususedfordataanalysis.
Only9outof24subjectschangedtheirrankingfromthefirstroundandmostchangesareminorinvolvingonlyafewpages.
Theaverageofthesixgroupmembers'rankingwastakentorepresenthumanrankingforthatsetofpages.
Althoughindividualstudent'srankingsdiffered,theyweremostlycorrelatedwitheachother,whichprovidessomeassuranceofthereliabilityofthehumanrankingdata.
TherankinggeneratedbyeachPageRankalgorithmwascorrelatedwiththehumanrankingtoseewhichalgorithmwasbetter(i.
e.
closertohumanranking).
TheSpearmancorrelationcoefficienttestwasusedbecausethehumanrankingscoresareobviouslyordinaldata.
ResultsTheresultsofcorrelationtestsaresummarizedinTableI.
Thefoursetsofpagesarelabelledwiththeiracronyms(see'Choiceofpagesets'aboveforadetaileddescriptionofthecontentofeachset).
ThefirstcolumnofdatainTableIgivesthecorrelationcoefficientsbetweenhumanrankingandtherankingbythestandardPageRank.
TheothercolumnsshowthecorrelationbetweenhumanrankingandtherankinggeneratedbyvariousversionsofPageRankemployingalternativedocumentmodels.
Thecolumnlabelled'directory'representsthePageRankusingthedirectoryleveldocumentmodel.
Thecolumnslabelled'domain'and'university'areforPageRanksusingdomainlevelanduniversityleveldocumentmodelsrespectively.
TableICorrelationsbetweenhumanrankingandrankingbyalgorithmsPageSetStandardPageRankIntersitedirectoryPageRankIntersitedomainPageRankIntersiteuniversityPageRankOGS-0.
08-0.
060.
320.
05Ombudsperson0.
600.
63N/AN/AMBA0.
2-0.
14-0.
29N/ASOGS0.
27N/AN/AN/ATheN/AsigninTableImeansthatPageRankscoresarethesameoralmostthesameforallpagesinthesetandthuscorrelationcoefficientcannotbecalculated.
ItshouldbenotedthatthepresenceofsomanyN/AsignsinTableIshouldnotbeinterpretedtomeanthatthealternativedocumentmodelswouldfrequentlynotprovideusefulPageRankdata.
Itistheresultofthewaythatthepageswereselected.
Recallthatrestrictiontoaspecificdomainwasnecessarywhenformingthepageset.
Forexample,theSOGSpagesetwasretrievedexclusivelyfromthedomainofwww.
uwo.
ca.
InfacttheuniquewordSOGScausedtheretrievedpagestoallcomefromthesamedirectorywww.
uwo.
ca/sogs/.
ThisexplainswhyPageRankbasedonthedirectory,domain,anduniversitylevelcannotprovidedatathatdistinguishespageswithinthisset.
Forthisreason,thissethadtobeomittedfromthetestsofalternativedocumentmodels.
CorrelationcoefficientsthatarestatisticallysignificantareshowninboldfaceinTableI.
ThestandardPageRankhadasignificantcorrelationforonlyoneoutofthefoursetsofpagesusedinthestudy,theombudspersonset.
PageRankbasedonthe8directoryleveldocumentmodelshowedaslightimprovementoverthestandardmodel.
TheonlypagesetthatisappropriatetotestthealternativedocumentmodelistheOGSsetbecausenorestrictiontoaparticularuniversity'sdomainwasimposedwhenformingthisset(OntarioGraduateScholarshipisnotrestrictedtoaparticularuniversity).
Asaresult,pageswithinthissetcomefromdifferentuniversitiesandthealternativedocumentmodelswereabletodistinguishthesepageswell.
Forthisset,thestandardPageRankalmostrankedthepagesinthedirectionoppositetothatbyhumansubjects(themeaningofthenegativecorrelation).
PageRankbasedonthedomainleveldocumentmodelshowsanadvantageoverthestandardmodelwhiletheuniversitylevelmodelshowedonlyaveryslightimprovement.
ResultsfromtheMBAsetcameasasurpriseinthatthealternativedocumentmodelsshoweddisadvantageoverthestandardPageRankmodel.
Itisnotclearwhetheritisananomalouscaseorwhetherthealternativedocumentmodelsarenotappropriateinsomecases.
OnepossibleexplanationforthefailureinthispagesetisthatthePageRankscorescalculatedforthissetarenotreliable.
RecallthatthePageRankscoresarecalculatedfromthedatabasethatincludesallCanadianuniversityWebpages.
TheMBApagesetiscentredaroundtheWebsiteoftheBusinessSchooloftheUniversityofToronto.
DuetothenatureoftheSchool,therearemanylinkstotheWebsitethatarenotfromotherCanadianuniversities.
Forexample,asearchoflinkstothissiteusingAltaVistasearchenginesfoundoveronehundredlinksfrom.
comdomain.
ThePageRankcalculationmissedalltheselinksandisthereforebiased.
Thisproblemdoesapply,ornottothisextent,toothersetsoftestpagesinthestudy.
Forexample,theWebsitethattheombudspersonsetiscentredaroundonlyhasonelinkfromthe.
comdomain.
Futurestudiescanavoidthisproblembyamorecarefulexaminationofpagespriortotherankingexperiment.
DiscussionThestandardPageRankdoesnotseemtobeveryeffectiveinrankingWebpagesinthestudyasshownbythefactthatitsrankingscorrelatesignificantlywithhumanrankingsforonlyoneoutoffoursetsofpagestested.
AlternativeapproachesareneededtoimprovetheeffectivenessofPageRank.
ThestudyproposedandtestednewversionsofPageRankbasedonalternativedocumentmodels.
Althoughtheresultsfromthestudydonotprovideclearevidencethatthealternativemodelsarebetter,itshowedthatthesemodelshavesomepromise.
Infact,theresultsfromtheOGSpageset,theonlysetthatisappropriatetotestallthealternativedocumentmodels,showedasubstantialadvantageoftheintersitedomainPageRankoverthestandardPageRank.
Onefacthasemergedclearlyfromthisresearch:thatitisdifficulttoassessthequalityofWebrankingalgorithms,especiallythoseinvolvinglinks,andespeciallyforresearchersthatdonothaveaccesstoacrawlofasizeablepercentageoftheWeb.
Afullscientificevaluationwouldinvolvehugehumanandcomputingresources:ideallyarandomselectionofquerieswithresultsrankedbyarepresentativesetofusersforwhomthequeriesrepresentedrealinformationrequests.
Inordertobeabletochoosequeriesatrandom,accesstoamajorsearchengineserverloganditsdatabaseforcalculatingtherankingscoreswouldbeneeded.
TheTRECapproach(trec.
nist.
gov,Hawkingetal.
,2001b)toresolvingasimilarproblemisasensibleone:tohaveacentrallyorganisedandratedcollectionofpagesthataresharedforalgorithmtestingpurposesbyparticipatingresearchers.
However,thisdoesnotyetsatisfyourneedbecausethosepagesareassignedabinaryrelevancescorebutnotrankedbydegreeofrelevance.
Forthereasonsdiscussedabove,therankingtask9wouldbelikelytobemorecomplexandinvolvemoreandmoredifficultassessmentsthanthecurrentlyemployedbinaryrelevancejudgements.
OurcompromisewastochooseasmallsetoffourqueriesthatwererelevanttoafixedgroupofendusersandbelongedtoacoherentsubsetoftheWebthatcouldbecrawledandassumedtobesufficientlylarge(3,930,113pages)forrankingthepagesetschosen.
ThiswouldnotbeaproblemifinformationneedslinkcreationandinformationdistributionwereknowntobehighlyuniformandpredictableontheWeb,i.
e.
ifthechoiceoftopicforeachsetwereknownnottoinfluencetheeffectivenessofarankingalgorithm,butwebelievethatthisisnotthecase.
Onalargescale,linkpatternsappeartobereasonablypredictableinsomecontexts(Thelwall,2001b,2002a)andoveralargenumberofpagesitseemsintuitivelyclearthatthosewith,say,threelinkstothemwouldbe,onaverage,slightlybetterqualitythanthosewithonlytwo.
Nevertheless,linksarestilltypicallycreatedbyindividualsinanunsystematicfashionandnotsubjecttoanykindofqualitycontrol.
Asaresultitisdifficulttoclaimthatthreelinkstoapageislikelytoconsistentlyindicatebettertargetpagequalitycontentthantwo.
Thisismoreevidentifitisacknowledgedthatfactorsotherthanqualitycaninfluencelinkcounts,includingtargetpageage.
Asaresult,anygivenlink-basedrankingalgorithmislikelytobeeffectiveforsometopicsbutineffectiveforothers.
Moreover,withthelownumbersoflinkslikelytobeinvolvedinpagesforsometopics,itseemslikelythateventhemosteffectivealgorithmwouldregularlyfailforasignificantproportionofsearchtopics.
Therefore,itisprobablynotsurprisingthattheproposednewalgorithminthisstudydoesnotworkwellforallthesearchtopicsintheexperiment.
Futureresearchinthisareashoulddesignawiderrangeofsearchqueriesandavoidproblemsencounteredinthisstudy.
Insummary,itseemsthatonlyresearchersworkingfor,orinconjunctionwith,amajorsearchenginewouldbecapableoffullyassessingnewWebrankingalgorithms,andotherswillremainforcedtoextrapolatefromtheteststhattheyareabletorun.
ThemostpromiseforacademicresearchersprobablylieswithcentralisedinitiativessuchasTREC,although,ascanbeseenabove,thechoiceoftopicscanimpactonalgorithmsindifferentways,dependingonthedetailsoftheirworkings.
ConclusionsAlthoughthestudydidnotsucceedinprovidingadefiniteanswertotheresearchquestionsexamined,itprovidedsomeevidencethatthealternativePageRankalgorithmsproposedcouldhavethepotentialtoimprovethestandardPageRankmodel.
ThestudysucceededintestingWebIRalgorithmsusinganempiricalstudyinvolvinghumansubjects,adirectionthatwasnotfollowedbymanypreviousstudies.
TheultimatevalueofanyWebIRalgorithmliesonitsabilitytoservehumanneedsandthusthebestwaytotestthemistoseeiftheymatchthoseneeds.
FutureresearchwithalternativedocumentmodelbasedrankingalgorithmsshouldkeepthehumanrankingapproachofthestudybutdesignarangeoftestqueriesthatallinvolvepagesfromdifferentWebsites.
AcknowledgementWegratefullythankallstudentswhoparticipatedinthestudybygivingpermissionforustousetheirrankingdata.
Thestudywouldhavebeenimpossiblewithouttheirsupport.
References10AltaVista(2002),AltaVistaadvancedsearchtutorial–linkpopularity,availableat:help.
altavista.
com/adv_search/ast_haw_popularity(accessed6September2002).
AssociationofUniversitiesandCollegesofCanada(2002),TheDirectoryofCanadianUniversities–UniversityWebsites,availableat:www.
aucc.
ca/english/dcu/universities/universitysites.
html(accessed24April2002).
Bharat,K.
andMihaila,G.
A.
(2001),"Whenexpertsagree:usingnon-affiliatedexpertstorankpopulartopics",inTenthInternationalWorldWideWebConference,availableat:www.
www10.
org/cdrom/papers/474/index.
htmlBrin,S.
andPage,L.
(1998),"Theanatomyofalargescalehypertextualwebsearchengine",ComputerNetworksandISDNSystems,Vol.
30No.
1-7,pp.
107-117,availableat:citeseer.
nj.
nec.
com/brin98anatomy.
htmlChun,T.
Y.
(1999),"WorldWideWebrobots:anoverview",Online&CD-ROMReview,Vol.
23No.
3,pp.
135-142.
Crestani,F.
andLee,P.
L.
(2000),"SearchingtheWebbyconstrainedspreadingactivation",InformationProcessingandManagement,Vol.
36No.
4,pp.
585-605.
Deerwester,S.
,Dumais,S.
T.
,Furnas,G.
W.
,Landauer,T.
K.
andHarshman,R.
(1990),"Indexingbylatentsemanticanalysis",JournaloftheAmericanSocietyforInformationScience,Vol.
41No.
6,pp.
391-407.
Gao,J.
,Walker,S.
,Robertson,S.
,Cao,G.
,He,H.
,Zhang,M.
andNie,J-Y(2001),"TREC-10WebTrackExperimentsatMSRA384-392",TREC2001,availableat:trec.
nist.
gov/pubs/trec10/t10_proceedings.
htmlGordon,M.
andPathak,P.
(1999),"FindinginformationontheWorldWideWeb:theretrievaleffectivenessofsearchengines",InformationProcessing&Management,Vol.
35,pp.
141-180.
Haveliwala,T.
(1999),"EfficientcomputationofPageRank",StanfordUniversityTechnicalReport,availableat:dbpubs.
stanford.
edu:8090/pub/1999-31Hawking,D.
,Bailey,P.
andCraswell,N.
(2000),"ACSysTREC-8experiments",inVoorhees,E.
andHarman,D.
(Eds),InformationTechnology:EighthTextRetrievalConference(TREC-8),NIST,Gaithersburg,MD,USA,pp.
307-315.
Hawking,D.
,Craswell,N.
,Bailey,P.
andGriffiths,K.
(2001a),"Measuringsearchenginequality",InformationRetrieval,Vol.
4No.
1,pp.
33-59.
Hawking,D.
,Craswell,N.
,Thistlewaite,P.
andHarman,D.
(1999),"ResultsandchallengesinWebsearchevaluation",8thInternationalWorldWideWebConference,availableat:www8.
org/w8-papers/2c-search-discover/results/results.
html.
Hawking,D.
,Craswell,N.
,Thistlewaite,P.
andHarman,D.
(2001b),"ResultsandchallengesinWebsearchevaluation",ComputerNetworks,Vol.
31No.
11-16,pp.
1321-1330,availableat:www8.
org/w8-papers/2c-search-discover/results/results.
htmlJohnston,A.
D.
(Ed.
)(2002),TheMaclean'sGuidetoCanadianUniversities2002,RogersPublishing,Toronto,Canada.
Kleinberg,J.
(1999),"Authoritativesourcesinahyperlinkedenvironment",JournaloftheACM,Vol.
46No.
5,pp.
604-632.
Lifantsev,M.
(2000),"VotingmodelforrankingWebpages",inGraham,P.
andMaheswaran,M.
(Eds),ProceedingsoftheInternationalConferenceonInternetComputing,CSREAPress,LasVegas,Nevada,USA,pp.
143-148.
11Meghabghab,G.
(2002),"Google'sWebpagerankingappliedtodifferenttopologicalWebgraphstructures",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No.
9,pp.
736-747.
Ng,A.
Y.
,Zheng,A.
X.
andJordan,M.
I.
(2001),"Stablealgorithmsforlinkanalysis",inCroft,W.
,Harper,D.
,Kraft,D.
&Zobel,J.
(Eds)Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR2001),ACMPress,NewYork,pp.
258-266.
Richardson,M.
andDomingosP.
(2001),"Theintelligentsurfer:probabilisticcombinationoflinkandcontentinformationinPageRank",posteratNeuralInformationProcessingSystems:NaturalandSynthetic2001,availableat:www.
cs.
washington.
edu/homes/mattr/doc/NIPS2001/qd-pagerank.
pdfSavoy,J.
andPicard,J.
(2001),"RetrievaleffectivenessontheWeb",InformationProcessingandManagement,Vol.
37No.
4,pp.
543-569.
Spink,A.
Wolfram,D.
,Jansen,B.
J.
andSaracevic,T.
(2001),"SearchingtheWeb:thepublicandtheirqueries",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No3,pp.
226-234.
Sullivan,D.
(2002),"Googletopsin'searchhours'ratings",SearchEngineWatch,availableat:searchenginewatch.
com/sereport/02/05-ratings.
html(accessed6September2002).
Thelwall,M.
(2001a),"Awebcrawlerdesignfordatamining",JournalofInformationScience,Vol.
27No.
5,pp.
319-325.
Thelwall,M.
(2001b),"ExtractingmacroscopicinformationfromWeblinks",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
52No.
13,pp.
1157-1168.
Thelwall,M.
(2002a),"ConceptualizingdocumentationontheWeb:anevaluationofdifferentheuristic-basedmodelsforcountinglinksbetweenuniversityWebsites",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
53No.
12,pp.
995-1005.
Thelwall,M.
(2002b),"Subjectgatewaysitesandsearchengineranking",OnlineInformationReview,Vol.
26No.
2,pp.
101-107.
Thelwall,M.
(2003),AlayeredapproachforinvestigatingthetopologicalstructureofcommunitiesintheWeb,JournalofDocumentation,59(4),410-429.
Thelwall,M.
andHarries,G.
(2003),"TheconnectionbetweentheresearchofauniversityandcountsoflinkstoitsWebpages:aninvestigationbaseduponaclassificationoftherelationshipsofpagestotheresearchofthehostuniversity",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
54No.
7,pp.
594-602.
Thelwall,M.
andTang,R.
(2003),DisciplinaryandlinguisticconsiderationsforacademicWeblinking:anexploratoryhyperlinkmediatedstudywithMainlandChinaandTaiwan,Scientometrics,Vol.
58No.
1,pp.
153-179.
Thelwall,M.
andWilkinson,D.
(2003),"ThreetargetdocumentrangemetricsforuniversityWebsites",JournaloftheAmericanSocietyforInformationScienceandTechnology,Vol.
54No.
6,pp.
489-496.
Tsikrika,T.
andLalmas,M.
(2002),"CombiningWebdocumentrepresentationsinaBayesianInferenceNetworkmodelusinglinkandcontent-basedevidence",inProceedingsof24thEuropeanColloquiumonInformationRetrievalResearch,(ECIR2002),pp53-72,Glasgow,Scotland.
Xi,W.
andFox,E.
A.
(2001),"MachineLearningApproachforHomepageFindingTask",TREC2001,pp.
686-697,availableat:trec.
nist.
gov/pubs/trec10/t10_proceedings.
html.
12

亚洲云-浙江高防BGP,至强铂金8270,提供自助防火墙管理,超大内存满足你各种需求

官方网站:点击访问亚洲云官网618活动方案:618特价活动(6.18-6.30)全站首月活动月底结束!地区:浙江高防BGPCPU:至强铂金8270主频7 默频3.61 睿频4.0核心:8核(最高支持64核)内存:8G(最高支持128G)DDR4 3200硬盘:40G系统盘+80G数据盘带宽:上行:20Mbps/下行:1000Mbps防御:100G(可加至300G)防火墙:提供自助 天机盾+金盾 管...

爱用云互联租用服务器租美国、日本、美国、日本、购买2天内不满意可以退换,IP可免费更换!

爱用云互联怎么样?爱用云是一家成立于2018年的老牌商家旗下的服务器销售品牌,是正规持证IDC/ISP/IRCS商家,主要销售国内、中国香港、国外服务器产品,线路有腾讯云国外线路、自营香港CN2线路等,都是中国大陆直连线路,非常适合免备案建站业务需求和各种负载较高的项目,同时国内服务器也有多个BGP以及高防节点。专注为个人开发者用户,中小型,大型企业用户提供一站式核心网络云端服务部署,促使用户云端...

百纵科技(19元/月),美国洛杉矶10G防御服务器/洛杉矶C3机房 带金盾高防

百纵科技官网:https://www.baizon.cn/百纵科技:美国云服务器活动重磅来袭,洛杉矶C3机房 带金盾高防,会员后台可自助管理防火墙,添加黑白名单 CC策略开启低中高.CPU全系列E52680v3 DDR4内存 三星固态盘列阵。另有高防清洗!美国洛杉矶 CN2 云服务器CPU内存带宽数据盘防御价格1H1G10M10G10G19元/月 购买地址2H1G10M10G10G29元/月 购买...

pagerank为你推荐
asp.net空间谁知道免费的ASP空间X1080012高等数学Ⅱ课程教学大纲客服电话赶集网客服电话是多少三友网怎么是“三友”刚刚网刚刚在网上认识了一个女孩子,不是很了解她,就跟她表白了。闪拍网闪拍网是真的吗billboardchina美国Billboard公告牌年度10大金曲最新华丽合辑免费代理加盟怎么开免费的代理网店欢迎光临本店鸡蛋蔬菜饺子每个10个3元,牛肉蔬菜饺子每10个5元,欢迎光临本店! 汉译英如何发帖子怎么发帖啊
樊云 新加坡服务器 美国翻墙 blackfriday cloudstack 美国仿牌空间 好看的留言 淘宝双十一2018 论坛空间 一元域名 警告本网站美国保护 福建天翼加速 gspeed 100m空间 idc是什么 服务器干什么用的 网页提速 什么是web服务器 日本代理ip 万网主机 更多