originatingpagerank
pagerank 时间:2021-04-19 阅读:(
)
ACautiousSurferforPageRankLanNieBaoningWuBrianD.
DavisonDepartmentofComputerScience&EngineeringLehighUniversityBethlehem,PA18015USA{lan2,baw4,davison}@cse.
lehigh.
eduABSTRACTThisworkproposesanovelcautioussurfertoincorporatetrustintotheprocessofcalculatingauthorityforwebpages.
Weeval-uateatotalofsixtyqueriesovertwolarge,real-worlddatasetstodemonstratethatincorporatingtrustcanimprovePageRank'sper-formance.
CategoriesandSubjectDescriptorsH.
3.
3[InformationStorageandRetrieval]:InformationSearchandRetrievalGeneralTermsAlgorithms,PerformanceKeywordsWebsearchengine,authority,trust,spam,rankingperformance1.
INTRODUCTIONTraditionallinkanalysisapproacheslikePageRank[5]generallyassesstheimportanceofapagebasedonthenumberandqualityofpageslinkingtoit.
However,theyassumethatthecontentandlinksofapagecanbetrusted.
Notonlyarethepagestrusted,buttheyaretrustedequally.
Unfortunately,thisassumptiondoesnotalwaysholdgiventheadversarialnatureoftoday'sweb.
Tocompensate,TrustRank[3]wasintroducedtopropagatetrustintheWebfromapre-labeledsetoftrustedpages,buildingontheassumptionthatgoodsitesseldompointtobadsites.
TrustRank'sPageRank-basedpropagationowstrusttopagesconnectedtotheseedset,whilespamsitesarelikelytogetlittletrust,andarethusdemotedinrank.
Unlikeexistingworkthatusestrusttoidentifyordemotespampages,wedescribeanovelapproachtoutilizetrustestimatesashintstoguideawebsurfer'sbehavior,anddemonstrateimprove-mentsinrankedretrieval.
Thetrustestimatescouldcomefromanysource,butforthisworkwefocusontheuseofTrustRanktogen-eratetrustscores.
2.
DIRECTTRUST-BASEDRANKINGSOnemightwonder"whynotuseTrustRankscoresdirectlytorepresentauthority"AsshownbyGy¨ongyietal.
[3]andotherworkofours[6],trust-basedalgorithmscandemotespam.
Utiliz-ingsuchapproachesforretrievalrankingmaysometimesimproveCopyrightisheldbytheauthor/owner(s).
WWW2007,May8–12,2007,Banff,Alberta,Canada.
ACM978-1-59593-654-7/07/0005.
searchperformance,especiallyforthose"spam-specic"querieswhoseresultswouldotherwisebeoverwhelmedbyspam.
However,thegoalofasearchengineistondgoodqualityre-sults;"spam-free"isanecessarybutnotsufcientconditionforhighquality.
Ifweuseatrust-basedalgorithmalonetosimplyre-placePageRankforrankingpurposes,somegoodqualitypageswillbeunfairlydemotedandreplaced,forexample,bypageswithinthetrustedseedsets,eventhoughtheymaybemuchlessauthoritative.
Consideredfromanotherangle,suchtrust-basedalgorithmsprop-agatetrustthroughpathsoriginatingfromtheseedset;asaresult,somegoodqualitypagesmaygetlowvalueiftheyarenotwell-connectedtothoseseeds.
Inconclusion,trustcannotbeequatedtoauthority;however,trustinformationcanassistusincalculatingauthorityinasaferwaybyreducingcontaminationfromspam.
InsteadofusingTrustRank(oranyothertrustestimate)alonetocalculateauthority,wein-corporateitintoPageRanksothatspampagesarepenalizedwhilehighlyauthoritativepages(thatarenototherwiseknowntobetrust-worthy)remainunharmed.
3.
THECAUTIOUSSURFERInthissection,wedescribehowtodirectthewebsurfer'sbe-haviorbyutilizingtrustinformation.
Unliketherandomsurferde-scribedinthePageRankmodel,thiscautioussurfercarefullyat-temptstonotletuntrustworthypagesinuenceitsbehavior.
Imagineawanderingwebsurfer,consideringwhatnextpagetovisit.
Ifthecurrentpageistrustworthy,thesurferismorelikelytofollowanoutgoinglink.
Incontrast,ifthecurrentpageisuntrust-worthy,itsrecommendationwillalsobevaluelessorsuspicious;asaresult,thesurferismorelikelytoleavethecurrentpageandjumptoarandompageontheweb.
Inaddition,linksmayleadtotargetswithdifferenttrustworthiness.
WebiasourCautiousSurfertofavormoretrustworthypageswhenrandomlyjumpingtoapage.
TheCautiousSurferneedsatrustestimateforeachpage.
Weassumethatanestimateofapage'strustworthinesshasbeenpro-vided,e.
g.
,fromTrustRank.
Tosmooththetrustdistribution,weusetherankorderinsteadofthetrustvalue:t(j)=1rank(Trust(j))/NwhereTrust(j)representstheprovidedtrustworthinessestimateofpagej,Nisthetotalnumberofpagesandrank(Trust(j))istherankofpagejamongallNpageswhenorderedbydecreasingtrustscore.
Inthisway,agivenpagej'sauthorityinourCautiousSurfermodel(CS(j))canbecalculatedasCS(j)=t(j)0@Xk:k→jCS(k)t(k)Pi:k→it(i)+Xm∈N(1t(m))CS(m)t(m)1ALabelBM2500PageRankTrustRankCautiousSurferspam16.
67%13.
83%12.
13%12.
42%normal36.
74%44.
37%50.
25%49.
30%undecided3.
15%2.
96%2.
61%2.
67%unknown43.
44%38.
84%35.
01%35.
61%Table1:Distributionoflabelsintop10resultsacross157queriesintheUK-2006dataset.
4.
EXPERIMENTALRESULTSHerewereporttheperformanceofourCautiousSurfer(CS),PageRank(PR),andTrustRank(TR)ontwolargescaledatasets.
ExperimentsonUK-2006.
Thisdatasetisacrawlofthe.
ukdo-main[7]downloadedinMay2006byUniversit`adegliStudidiM-ilano.
Thereare77Mpagesinthiscrawlfrom11,392differenthosts.
Alabeledhostlistisalsoprovided[1].
Withinthelist,767hostsaremarkedasspambyhumanjudges,7,472hostsasnormal,and176hostsmarkedasundecided(notclearlyspamornormal).
Theremaining2977hostsaremarkedasunknown(notjudged).
TheTRandCSapproachesrequirepreselectedseedsets;wereporttheaverageofvetrialsinwhichwerandomlysample10%ofthelabelednormalsitestoformthetrustedseedset.
Sincethelabelsareprovidedatthehostlevel,wecomputeauthorityinthehostgraph.
Toevaluatequery-specicretrievalperformance,weuseasampleof3.
4Mwebpages(therst400crawledpagesforeachsiteincrawlorder)fromthefulldataset.
ThesepagesinherittheirauthorityscorefromtheirhostswhichisthencombinedwiththeBM2500IRscoreforthenalranking.
Thecombinationisorder-based,inwhichrankingpositionsbasedonauthorityscore(weightedby.
2)andIRscore(weightedby.
8)aresummedtogether.
Wechoosetofocuson"hot"queries—thosemorelikelytobeofinteresttosearchenginespammers.
Weselectedpopularqueriesfroma1999Excitequerylogthatcontainatleastonepopularterm(top200)withinthemeta-keywordeldfromallpageswithinspamsites.
Thisresultedin157hotqueries.
SincetheUK-2006datasetislabeled,wecanusethedistribu-tionoflabeledsitesasameasurementofrankingalgorithmper-formance,asshowninTable1.
Sincethisisanautomaticpro-cesswithouttheconstraintsofhumanevaluation,wecheckthetop10resultsforall157hotqueries.
BothTrustRankandtheCau-tiousSurferareabletonoticeablyimproveupontheBM2500andPageRankdistributions.
ThesimilardistributionsfoundbetweenTrustRankandtheCautiousSurfer(basedonTrustRankcalcula-tionsoftrust)suggestthattheCautiousSurferisabletoincorporatethespamremovalvalueprovidedbythetrustranking.
Weconsiderwhethertherankingsareusefulforretrievalnext.
Werandomlyselected30ofthe157queriesforourrelevanceevaluation.
FourmembersofourlabwereeachgivenqueriesandURLs(blindtothesourcerankingalgorithm).
ForeachqueryandURLpair,theevaluatordecidedtherelevanceusingavelevelscalewhichweretranslatedintointegervaluesfrom2to-2.
Weusethemeanofallvaluesofpairsgeneratedbyarankingalgorithmasscore@10.
Iftheaveragescoreforapairismorethan0.
5,itisUK2006WebBaseMethodScore@10P@10Score@10P@10PageRank0.
14830.
7%0.
66855.
7%TrustRank0.
17131.
4%0.
74759.
3%CautiousSurfer0.
18032.
4%0.
79861.
3%Table2:Rankingperformancecomparison.
markedasrelevant.
TheaveragenumberofrelevantURLswithinthetoptenresultsforthe30queriesisdenedasprecision@10.
TheoverallretrievalperformancecomparisonsareshownintheleftcolumnsofTable4.
CautiousSurferoutperformstheotherap-proachesonbothprecisionandqualityfortop-10results.
Thus,weseethatbyincorporatingestimatesoftrust,theCautiousSurferisabletogenerateusefulrankingsforretrieval,andnotjustrankingswithlessspam.
ExperimentsonWebBase.
Theseconddatasetisa2005crawlfromtheStanfordWebBase[2].
Itcontains58Mpagesandap-proximately900Mlinks,butnolabels.
Tocompensate,welabelasgoodallpagesinthisdatasetthatalsoappearwithinthelistofURLsreferencedbythedmozOpenDirectoryProject.
Notethattheselabelsarepage-based,sowecancomputeauthorityinthepagelevelgraphdirectly.
Wechose30queriesfromthepopularquerylistforevaluationofwebpagesintheWebBasedataset.
Bytestingonaseconddataset,wegetabetterunderstandingofexpectedperformanceonfuturedatasets.
TheWebBasedatasetisofparticularinterestasitisamoretypicalgraphofwebpages(ascomparedtowebhosts),andusesamuchsmallerseedsetofgoodpages(just.
17%ofallpagesinthedataset).
TheperformanceisshownintherightcolumnsofTable4.
Again,theCautiousSurfernoticeablyoutperformsbothPageRankandTrustRank,demonstratingthattheapproachretainsitslevelofperformanceinbothpage-levelandsite-levelwebgraphs.
5.
CONCLUSIONInthispaperwehavedescribedamethodologyforincorporatingtrustintothecalculationofPageRank-basedauthority.
Additionaldetailsareavailableelsewhere[4].
Theresultsontwolargereal-worlddatasetsshowthatourCautiousSurfermodelcanimprovesearchengines'rankingqualityanddemotewebspamaswell.
Acknowledgments.
ThisworkwassupportedinpartbyagrantfromMicrosoftLiveLabs("AcceleratingSearch")andtheNa-tionalScienceFoundationunderCAREERawardIIS-0545875.
WethanktheLaboratoryofWebAlgorithmics,Universit`adegliStudidiMilanoandYahoo!
ResearchBarcelonaformakingtheUK-2006datasetandlabelsavailableandStanfordUniversityforaccesstotheirWebBasecollections.
6.
REFERENCES[1]C.
Castillo,D.
Donato,L.
Becchetti,P.
Boldi,M.
Santini,andS.
Vigna.
Areferencecollectionforwebspam.
ACMSIGIRForum,40(2),Dec.
2006.
[2]J.
Cho,H.
Garcia-Molina,T.
Haveliwala,W.
Lam,A.
Paepcke,S.
RaghavanandG.
Wesley.
StanfordWebBasecomponentsandapplications.
ACMTransactionsonInternetTechnology,6(2):153–186,2006.
[3]Z.
Gy¨ongyi,H.
Garcia-Molina,andJ.
Pedersen.
CombatingwebspamwithTrustRank.
InProc.
ofthe30thInt'lConf.
onVeryLargeDataBases(VLDB),pages271–279,Toronto,Canada,Sept.
2004.
[4]L.
Nie,B.
Wu,andB.
D.
Davison.
Incorporatingtrustintowebsearch.
AvailableasTechnicalReportLU-CSE-07-002,Dept.
ofComputerScienceandEngineering,LehighUniversity,2007.
[5]L.
Page,S.
Brin,R.
Motwani,andT.
Winograd.
ThePageRankcitationranking:BringingordertotheWeb.
Unpublisheddraft,1998.
[6]B.
Wu,V.
Goel,andB.
D.
Davison.
Propagatingtrustanddistrusttodemotewebspam.
InProc.
ofModelsofTrustfortheWebworkshopatthe15thInt'lWorldWideWebConf.
,Edinburgh,Scotland,May2006.
[7]Yahoo!
Research.
WebcollectionUK-2006.
http://research.
yahoo.
com/.
CrawledbytheLaboratoryofWebAlgorithmics,UniversityofMilan,http://law.
dsi.
unimi.
it/.
URLretrievedOct.
2006.
Hostkey.com成立于2007年的荷兰公司,主要运营服务器出租与托管,其次是VPS、域名、域名证书,各种软件授权等。hostkey当前运作荷兰阿姆斯特丹、俄罗斯莫斯科、美国纽约等数据中心。支持Paypal,信用卡,Webmoney,以及支付宝等付款方式。禁止VPN,代理,Tor,网络诈骗,儿童色情,Spam,网络扫描,俄罗斯色情,俄罗斯电影,俄罗斯MP3,俄罗斯Trackers,以及俄罗斯法...
瓜云互联怎么样?瓜云互联之前商家使用的面板为WHMCS,目前商家已经正式更换到了魔方云的面板,瓜云互联商家主要提供中国香港和美国洛杉矶机房的套餐,香港采用CN2线路直连大陆,洛杉矶为高防vps套餐,三网回程CN2 GIA,提供超高的DDOS防御,瓜云互联商家承诺打死退款,目前商家提供了一个全场9折和充值的促销,有需要的朋友可以看看。点击进入:瓜云互联官方网站瓜云互联促销优惠:9折优惠码:联系在线客...
官方网站:点击访问特网云官网活动方案:===========================香港云限时购==============================支持Linux和Windows操作系统,配置都是可以自选的,非常的灵活,宽带充足新老客户活动期间新购活动款产品都可以享受续费折扣(只限在活动期间购买活动款产品才可享受续费折扣 优惠码:AADE01),购买折扣与续费折扣不叠加,都是在原价...
pagerank为你推荐
投标在线代理小企业如何做品牌中小企业如何树立品牌形象,提高知名度?internalservererrorinternal server error怎么解决重庆网络公司一九互联重庆网络公司,重庆网络优化,重庆页面制作性价比高且便宜的网络公司有哪些?360退出北京时间北京时间校准显示时间企业电子邮局企业邮箱怎么使用?全国企业信息查询想查一个企业的信息,哪个网站提供信息查询?波音737起飞爆胎一般的客机的起飞速度是多少?设计eset河南省全民健康信息平台建设指引(试行)
香港服务器租用 vps教程 域名备案只选云聚达 hkbn tk域名 最好的空间 本网站在美国维护 京东商城0元抢购 新家坡 秒杀汇 支付宝扫码领红包 国外视频网站有哪些 双线asp空间 zcloud 美国asp空间 winserver2008 远程登录 cloudflare screen vim命令 更多