originatingpagerank
pagerank 时间:2021-04-19 阅读:(
)
ACautiousSurferforPageRankLanNieBaoningWuBrianD.
DavisonDepartmentofComputerScience&EngineeringLehighUniversityBethlehem,PA18015USA{lan2,baw4,davison}@cse.
lehigh.
eduABSTRACTThisworkproposesanovelcautioussurfertoincorporatetrustintotheprocessofcalculatingauthorityforwebpages.
Weeval-uateatotalofsixtyqueriesovertwolarge,real-worlddatasetstodemonstratethatincorporatingtrustcanimprovePageRank'sper-formance.
CategoriesandSubjectDescriptorsH.
3.
3[InformationStorageandRetrieval]:InformationSearchandRetrievalGeneralTermsAlgorithms,PerformanceKeywordsWebsearchengine,authority,trust,spam,rankingperformance1.
INTRODUCTIONTraditionallinkanalysisapproacheslikePageRank[5]generallyassesstheimportanceofapagebasedonthenumberandqualityofpageslinkingtoit.
However,theyassumethatthecontentandlinksofapagecanbetrusted.
Notonlyarethepagestrusted,buttheyaretrustedequally.
Unfortunately,thisassumptiondoesnotalwaysholdgiventheadversarialnatureoftoday'sweb.
Tocompensate,TrustRank[3]wasintroducedtopropagatetrustintheWebfromapre-labeledsetoftrustedpages,buildingontheassumptionthatgoodsitesseldompointtobadsites.
TrustRank'sPageRank-basedpropagationowstrusttopagesconnectedtotheseedset,whilespamsitesarelikelytogetlittletrust,andarethusdemotedinrank.
Unlikeexistingworkthatusestrusttoidentifyordemotespampages,wedescribeanovelapproachtoutilizetrustestimatesashintstoguideawebsurfer'sbehavior,anddemonstrateimprove-mentsinrankedretrieval.
Thetrustestimatescouldcomefromanysource,butforthisworkwefocusontheuseofTrustRanktogen-eratetrustscores.
2.
DIRECTTRUST-BASEDRANKINGSOnemightwonder"whynotuseTrustRankscoresdirectlytorepresentauthority"AsshownbyGy¨ongyietal.
[3]andotherworkofours[6],trust-basedalgorithmscandemotespam.
Utiliz-ingsuchapproachesforretrievalrankingmaysometimesimproveCopyrightisheldbytheauthor/owner(s).
WWW2007,May8–12,2007,Banff,Alberta,Canada.
ACM978-1-59593-654-7/07/0005.
searchperformance,especiallyforthose"spam-specic"querieswhoseresultswouldotherwisebeoverwhelmedbyspam.
However,thegoalofasearchengineistondgoodqualityre-sults;"spam-free"isanecessarybutnotsufcientconditionforhighquality.
Ifweuseatrust-basedalgorithmalonetosimplyre-placePageRankforrankingpurposes,somegoodqualitypageswillbeunfairlydemotedandreplaced,forexample,bypageswithinthetrustedseedsets,eventhoughtheymaybemuchlessauthoritative.
Consideredfromanotherangle,suchtrust-basedalgorithmsprop-agatetrustthroughpathsoriginatingfromtheseedset;asaresult,somegoodqualitypagesmaygetlowvalueiftheyarenotwell-connectedtothoseseeds.
Inconclusion,trustcannotbeequatedtoauthority;however,trustinformationcanassistusincalculatingauthorityinasaferwaybyreducingcontaminationfromspam.
InsteadofusingTrustRank(oranyothertrustestimate)alonetocalculateauthority,wein-corporateitintoPageRanksothatspampagesarepenalizedwhilehighlyauthoritativepages(thatarenototherwiseknowntobetrust-worthy)remainunharmed.
3.
THECAUTIOUSSURFERInthissection,wedescribehowtodirectthewebsurfer'sbe-haviorbyutilizingtrustinformation.
Unliketherandomsurferde-scribedinthePageRankmodel,thiscautioussurfercarefullyat-temptstonotletuntrustworthypagesinuenceitsbehavior.
Imagineawanderingwebsurfer,consideringwhatnextpagetovisit.
Ifthecurrentpageistrustworthy,thesurferismorelikelytofollowanoutgoinglink.
Incontrast,ifthecurrentpageisuntrust-worthy,itsrecommendationwillalsobevaluelessorsuspicious;asaresult,thesurferismorelikelytoleavethecurrentpageandjumptoarandompageontheweb.
Inaddition,linksmayleadtotargetswithdifferenttrustworthiness.
WebiasourCautiousSurfertofavormoretrustworthypageswhenrandomlyjumpingtoapage.
TheCautiousSurferneedsatrustestimateforeachpage.
Weassumethatanestimateofapage'strustworthinesshasbeenpro-vided,e.
g.
,fromTrustRank.
Tosmooththetrustdistribution,weusetherankorderinsteadofthetrustvalue:t(j)=1rank(Trust(j))/NwhereTrust(j)representstheprovidedtrustworthinessestimateofpagej,Nisthetotalnumberofpagesandrank(Trust(j))istherankofpagejamongallNpageswhenorderedbydecreasingtrustscore.
Inthisway,agivenpagej'sauthorityinourCautiousSurfermodel(CS(j))canbecalculatedasCS(j)=t(j)0@Xk:k→jCS(k)t(k)Pi:k→it(i)+Xm∈N(1t(m))CS(m)t(m)1ALabelBM2500PageRankTrustRankCautiousSurferspam16.
67%13.
83%12.
13%12.
42%normal36.
74%44.
37%50.
25%49.
30%undecided3.
15%2.
96%2.
61%2.
67%unknown43.
44%38.
84%35.
01%35.
61%Table1:Distributionoflabelsintop10resultsacross157queriesintheUK-2006dataset.
4.
EXPERIMENTALRESULTSHerewereporttheperformanceofourCautiousSurfer(CS),PageRank(PR),andTrustRank(TR)ontwolargescaledatasets.
ExperimentsonUK-2006.
Thisdatasetisacrawlofthe.
ukdo-main[7]downloadedinMay2006byUniversit`adegliStudidiM-ilano.
Thereare77Mpagesinthiscrawlfrom11,392differenthosts.
Alabeledhostlistisalsoprovided[1].
Withinthelist,767hostsaremarkedasspambyhumanjudges,7,472hostsasnormal,and176hostsmarkedasundecided(notclearlyspamornormal).
Theremaining2977hostsaremarkedasunknown(notjudged).
TheTRandCSapproachesrequirepreselectedseedsets;wereporttheaverageofvetrialsinwhichwerandomlysample10%ofthelabelednormalsitestoformthetrustedseedset.
Sincethelabelsareprovidedatthehostlevel,wecomputeauthorityinthehostgraph.
Toevaluatequery-specicretrievalperformance,weuseasampleof3.
4Mwebpages(therst400crawledpagesforeachsiteincrawlorder)fromthefulldataset.
ThesepagesinherittheirauthorityscorefromtheirhostswhichisthencombinedwiththeBM2500IRscoreforthenalranking.
Thecombinationisorder-based,inwhichrankingpositionsbasedonauthorityscore(weightedby.
2)andIRscore(weightedby.
8)aresummedtogether.
Wechoosetofocuson"hot"queries—thosemorelikelytobeofinteresttosearchenginespammers.
Weselectedpopularqueriesfroma1999Excitequerylogthatcontainatleastonepopularterm(top200)withinthemeta-keywordeldfromallpageswithinspamsites.
Thisresultedin157hotqueries.
SincetheUK-2006datasetislabeled,wecanusethedistribu-tionoflabeledsitesasameasurementofrankingalgorithmper-formance,asshowninTable1.
Sincethisisanautomaticpro-cesswithouttheconstraintsofhumanevaluation,wecheckthetop10resultsforall157hotqueries.
BothTrustRankandtheCau-tiousSurferareabletonoticeablyimproveupontheBM2500andPageRankdistributions.
ThesimilardistributionsfoundbetweenTrustRankandtheCautiousSurfer(basedonTrustRankcalcula-tionsoftrust)suggestthattheCautiousSurferisabletoincorporatethespamremovalvalueprovidedbythetrustranking.
Weconsiderwhethertherankingsareusefulforretrievalnext.
Werandomlyselected30ofthe157queriesforourrelevanceevaluation.
FourmembersofourlabwereeachgivenqueriesandURLs(blindtothesourcerankingalgorithm).
ForeachqueryandURLpair,theevaluatordecidedtherelevanceusingavelevelscalewhichweretranslatedintointegervaluesfrom2to-2.
Weusethemeanofallvaluesofpairsgeneratedbyarankingalgorithmasscore@10.
Iftheaveragescoreforapairismorethan0.
5,itisUK2006WebBaseMethodScore@10P@10Score@10P@10PageRank0.
14830.
7%0.
66855.
7%TrustRank0.
17131.
4%0.
74759.
3%CautiousSurfer0.
18032.
4%0.
79861.
3%Table2:Rankingperformancecomparison.
markedasrelevant.
TheaveragenumberofrelevantURLswithinthetoptenresultsforthe30queriesisdenedasprecision@10.
TheoverallretrievalperformancecomparisonsareshownintheleftcolumnsofTable4.
CautiousSurferoutperformstheotherap-proachesonbothprecisionandqualityfortop-10results.
Thus,weseethatbyincorporatingestimatesoftrust,theCautiousSurferisabletogenerateusefulrankingsforretrieval,andnotjustrankingswithlessspam.
ExperimentsonWebBase.
Theseconddatasetisa2005crawlfromtheStanfordWebBase[2].
Itcontains58Mpagesandap-proximately900Mlinks,butnolabels.
Tocompensate,welabelasgoodallpagesinthisdatasetthatalsoappearwithinthelistofURLsreferencedbythedmozOpenDirectoryProject.
Notethattheselabelsarepage-based,sowecancomputeauthorityinthepagelevelgraphdirectly.
Wechose30queriesfromthepopularquerylistforevaluationofwebpagesintheWebBasedataset.
Bytestingonaseconddataset,wegetabetterunderstandingofexpectedperformanceonfuturedatasets.
TheWebBasedatasetisofparticularinterestasitisamoretypicalgraphofwebpages(ascomparedtowebhosts),andusesamuchsmallerseedsetofgoodpages(just.
17%ofallpagesinthedataset).
TheperformanceisshownintherightcolumnsofTable4.
Again,theCautiousSurfernoticeablyoutperformsbothPageRankandTrustRank,demonstratingthattheapproachretainsitslevelofperformanceinbothpage-levelandsite-levelwebgraphs.
5.
CONCLUSIONInthispaperwehavedescribedamethodologyforincorporatingtrustintothecalculationofPageRank-basedauthority.
Additionaldetailsareavailableelsewhere[4].
Theresultsontwolargereal-worlddatasetsshowthatourCautiousSurfermodelcanimprovesearchengines'rankingqualityanddemotewebspamaswell.
Acknowledgments.
ThisworkwassupportedinpartbyagrantfromMicrosoftLiveLabs("AcceleratingSearch")andtheNa-tionalScienceFoundationunderCAREERawardIIS-0545875.
WethanktheLaboratoryofWebAlgorithmics,Universit`adegliStudidiMilanoandYahoo!
ResearchBarcelonaformakingtheUK-2006datasetandlabelsavailableandStanfordUniversityforaccesstotheirWebBasecollections.
6.
REFERENCES[1]C.
Castillo,D.
Donato,L.
Becchetti,P.
Boldi,M.
Santini,andS.
Vigna.
Areferencecollectionforwebspam.
ACMSIGIRForum,40(2),Dec.
2006.
[2]J.
Cho,H.
Garcia-Molina,T.
Haveliwala,W.
Lam,A.
Paepcke,S.
RaghavanandG.
Wesley.
StanfordWebBasecomponentsandapplications.
ACMTransactionsonInternetTechnology,6(2):153–186,2006.
[3]Z.
Gy¨ongyi,H.
Garcia-Molina,andJ.
Pedersen.
CombatingwebspamwithTrustRank.
InProc.
ofthe30thInt'lConf.
onVeryLargeDataBases(VLDB),pages271–279,Toronto,Canada,Sept.
2004.
[4]L.
Nie,B.
Wu,andB.
D.
Davison.
Incorporatingtrustintowebsearch.
AvailableasTechnicalReportLU-CSE-07-002,Dept.
ofComputerScienceandEngineering,LehighUniversity,2007.
[5]L.
Page,S.
Brin,R.
Motwani,andT.
Winograd.
ThePageRankcitationranking:BringingordertotheWeb.
Unpublisheddraft,1998.
[6]B.
Wu,V.
Goel,andB.
D.
Davison.
Propagatingtrustanddistrusttodemotewebspam.
InProc.
ofModelsofTrustfortheWebworkshopatthe15thInt'lWorldWideWebConf.
,Edinburgh,Scotland,May2006.
[7]Yahoo!
Research.
WebcollectionUK-2006.
http://research.
yahoo.
com/.
CrawledbytheLaboratoryofWebAlgorithmics,UniversityofMilan,http://law.
dsi.
unimi.
it/.
URLretrievedOct.
2006.
螢光云官網萤光云成立于2002年,是一家自有IDC的云厂商,主打高防云服务器产品。在国内有福州、北京、上海、台湾、香港CN2节点,还有华盛顿、河内、曼谷等海外节点。萤光云的高防云服务器自带50G防御,适合高防建站、游戏高防等业务。本次萤光云中秋云活动简单无套路,直接在原有价格上砍了一大刀,最低价格16元/月,而且有没有账户限制,新老客户都可以买,就是直接满满的诚意给大家送优惠了!官网首页:www....
进入6月,各大网络平台都开启了618促销,腾讯云目前也正在开展618云上Go活动,上海/北京/广州/成都/香港/新加坡/硅谷等多个地区云服务器及轻量服务器秒杀,最低年付95元起,参与活动的产品还包括短信包、CDN流量包、MySQL数据库、云存储(标准存储)、直播/点播流量包等等,本轮秒杀活动每天5场,一直持续到7月中旬,感兴趣的朋友可以关注本页。活动页面:https://cloud.tencent...
触摸云触摸云(cmzi.com),国人商家,有IDC/ISP正规资质,主营香港线路VPS、物理机等产品。本次为大家带上的是美国高防2区的套餐。去程普通线路,回程cn2 gia,均衡防御速度与防御,防御值为200G,无视UDP攻击,可选择性是否开启CC防御策略,超过峰值黑洞1-2小时。最低套餐20M起,多数套餐为50M,适合有防御型建站需求使用。美国高防2区 弹性云[大宽带]· 配置:1-16核· ...
pagerank为你推荐
操作http现有新的ios更新可用请从ios14be苹果11建议更新ios14.3cisco2960配置寻求思科2960交换机配置命令filezillaserver谁用过FileZilla_Server啊,请教支付宝调整还款日花呗调整还款日算延期吗?360arp防火墙在哪谁知道360防火墙的arp防火墙文件在哪网站ipad即时通平台有好的放单平台吗?可信网站网站备案了,还要验证可信网站吗?他们有什么区别discuz伪静态求虚拟主机Discuz 伪静态设置方法
openv burstnet 163网 双11抢红包攻略 骨干网络 789电视 699美元 cdn加速是什么 绍兴电信 超级服务器 国外视频网站有哪些 华为云服务登录 1元域名 什么是web服务器 免费的域名 英雄联盟台服官网 帽子云排名 中国电信测速网站 双十二促销 上海联通 更多