originatingpagerank

pagerank  时间:2021-04-19  阅读:()
ACautiousSurferforPageRankLanNieBaoningWuBrianD.
DavisonDepartmentofComputerScience&EngineeringLehighUniversityBethlehem,PA18015USA{lan2,baw4,davison}@cse.
lehigh.
eduABSTRACTThisworkproposesanovelcautioussurfertoincorporatetrustintotheprocessofcalculatingauthorityforwebpages.
Weeval-uateatotalofsixtyqueriesovertwolarge,real-worlddatasetstodemonstratethatincorporatingtrustcanimprovePageRank'sper-formance.
CategoriesandSubjectDescriptorsH.
3.
3[InformationStorageandRetrieval]:InformationSearchandRetrievalGeneralTermsAlgorithms,PerformanceKeywordsWebsearchengine,authority,trust,spam,rankingperformance1.
INTRODUCTIONTraditionallinkanalysisapproacheslikePageRank[5]generallyassesstheimportanceofapagebasedonthenumberandqualityofpageslinkingtoit.
However,theyassumethatthecontentandlinksofapagecanbetrusted.
Notonlyarethepagestrusted,buttheyaretrustedequally.
Unfortunately,thisassumptiondoesnotalwaysholdgiventheadversarialnatureoftoday'sweb.
Tocompensate,TrustRank[3]wasintroducedtopropagatetrustintheWebfromapre-labeledsetoftrustedpages,buildingontheassumptionthatgoodsitesseldompointtobadsites.
TrustRank'sPageRank-basedpropagationowstrusttopagesconnectedtotheseedset,whilespamsitesarelikelytogetlittletrust,andarethusdemotedinrank.
Unlikeexistingworkthatusestrusttoidentifyordemotespampages,wedescribeanovelapproachtoutilizetrustestimatesashintstoguideawebsurfer'sbehavior,anddemonstrateimprove-mentsinrankedretrieval.
Thetrustestimatescouldcomefromanysource,butforthisworkwefocusontheuseofTrustRanktogen-eratetrustscores.
2.
DIRECTTRUST-BASEDRANKINGSOnemightwonder"whynotuseTrustRankscoresdirectlytorepresentauthority"AsshownbyGy¨ongyietal.
[3]andotherworkofours[6],trust-basedalgorithmscandemotespam.
Utiliz-ingsuchapproachesforretrievalrankingmaysometimesimproveCopyrightisheldbytheauthor/owner(s).
WWW2007,May8–12,2007,Banff,Alberta,Canada.
ACM978-1-59593-654-7/07/0005.
searchperformance,especiallyforthose"spam-specic"querieswhoseresultswouldotherwisebeoverwhelmedbyspam.
However,thegoalofasearchengineistondgoodqualityre-sults;"spam-free"isanecessarybutnotsufcientconditionforhighquality.
Ifweuseatrust-basedalgorithmalonetosimplyre-placePageRankforrankingpurposes,somegoodqualitypageswillbeunfairlydemotedandreplaced,forexample,bypageswithinthetrustedseedsets,eventhoughtheymaybemuchlessauthoritative.
Consideredfromanotherangle,suchtrust-basedalgorithmsprop-agatetrustthroughpathsoriginatingfromtheseedset;asaresult,somegoodqualitypagesmaygetlowvalueiftheyarenotwell-connectedtothoseseeds.
Inconclusion,trustcannotbeequatedtoauthority;however,trustinformationcanassistusincalculatingauthorityinasaferwaybyreducingcontaminationfromspam.
InsteadofusingTrustRank(oranyothertrustestimate)alonetocalculateauthority,wein-corporateitintoPageRanksothatspampagesarepenalizedwhilehighlyauthoritativepages(thatarenototherwiseknowntobetrust-worthy)remainunharmed.
3.
THECAUTIOUSSURFERInthissection,wedescribehowtodirectthewebsurfer'sbe-haviorbyutilizingtrustinformation.
Unliketherandomsurferde-scribedinthePageRankmodel,thiscautioussurfercarefullyat-temptstonotletuntrustworthypagesinuenceitsbehavior.
Imagineawanderingwebsurfer,consideringwhatnextpagetovisit.
Ifthecurrentpageistrustworthy,thesurferismorelikelytofollowanoutgoinglink.
Incontrast,ifthecurrentpageisuntrust-worthy,itsrecommendationwillalsobevaluelessorsuspicious;asaresult,thesurferismorelikelytoleavethecurrentpageandjumptoarandompageontheweb.
Inaddition,linksmayleadtotargetswithdifferenttrustworthiness.
WebiasourCautiousSurfertofavormoretrustworthypageswhenrandomlyjumpingtoapage.
TheCautiousSurferneedsatrustestimateforeachpage.
Weassumethatanestimateofapage'strustworthinesshasbeenpro-vided,e.
g.
,fromTrustRank.
Tosmooththetrustdistribution,weusetherankorderinsteadofthetrustvalue:t(j)=1rank(Trust(j))/NwhereTrust(j)representstheprovidedtrustworthinessestimateofpagej,Nisthetotalnumberofpagesandrank(Trust(j))istherankofpagejamongallNpageswhenorderedbydecreasingtrustscore.
Inthisway,agivenpagej'sauthorityinourCautiousSurfermodel(CS(j))canbecalculatedasCS(j)=t(j)0@Xk:k→jCS(k)t(k)Pi:k→it(i)+Xm∈N(1t(m))CS(m)t(m)1ALabelBM2500PageRankTrustRankCautiousSurferspam16.
67%13.
83%12.
13%12.
42%normal36.
74%44.
37%50.
25%49.
30%undecided3.
15%2.
96%2.
61%2.
67%unknown43.
44%38.
84%35.
01%35.
61%Table1:Distributionoflabelsintop10resultsacross157queriesintheUK-2006dataset.
4.
EXPERIMENTALRESULTSHerewereporttheperformanceofourCautiousSurfer(CS),PageRank(PR),andTrustRank(TR)ontwolargescaledatasets.
ExperimentsonUK-2006.
Thisdatasetisacrawlofthe.
ukdo-main[7]downloadedinMay2006byUniversit`adegliStudidiM-ilano.
Thereare77Mpagesinthiscrawlfrom11,392differenthosts.
Alabeledhostlistisalsoprovided[1].
Withinthelist,767hostsaremarkedasspambyhumanjudges,7,472hostsasnormal,and176hostsmarkedasundecided(notclearlyspamornormal).
Theremaining2977hostsaremarkedasunknown(notjudged).
TheTRandCSapproachesrequirepreselectedseedsets;wereporttheaverageofvetrialsinwhichwerandomlysample10%ofthelabelednormalsitestoformthetrustedseedset.
Sincethelabelsareprovidedatthehostlevel,wecomputeauthorityinthehostgraph.
Toevaluatequery-specicretrievalperformance,weuseasampleof3.
4Mwebpages(therst400crawledpagesforeachsiteincrawlorder)fromthefulldataset.
ThesepagesinherittheirauthorityscorefromtheirhostswhichisthencombinedwiththeBM2500IRscoreforthenalranking.
Thecombinationisorder-based,inwhichrankingpositionsbasedonauthorityscore(weightedby.
2)andIRscore(weightedby.
8)aresummedtogether.
Wechoosetofocuson"hot"queries—thosemorelikelytobeofinteresttosearchenginespammers.
Weselectedpopularqueriesfroma1999Excitequerylogthatcontainatleastonepopularterm(top200)withinthemeta-keywordeldfromallpageswithinspamsites.
Thisresultedin157hotqueries.
SincetheUK-2006datasetislabeled,wecanusethedistribu-tionoflabeledsitesasameasurementofrankingalgorithmper-formance,asshowninTable1.
Sincethisisanautomaticpro-cesswithouttheconstraintsofhumanevaluation,wecheckthetop10resultsforall157hotqueries.
BothTrustRankandtheCau-tiousSurferareabletonoticeablyimproveupontheBM2500andPageRankdistributions.
ThesimilardistributionsfoundbetweenTrustRankandtheCautiousSurfer(basedonTrustRankcalcula-tionsoftrust)suggestthattheCautiousSurferisabletoincorporatethespamremovalvalueprovidedbythetrustranking.
Weconsiderwhethertherankingsareusefulforretrievalnext.
Werandomlyselected30ofthe157queriesforourrelevanceevaluation.
FourmembersofourlabwereeachgivenqueriesandURLs(blindtothesourcerankingalgorithm).
ForeachqueryandURLpair,theevaluatordecidedtherelevanceusingavelevelscalewhichweretranslatedintointegervaluesfrom2to-2.
Weusethemeanofallvaluesofpairsgeneratedbyarankingalgorithmasscore@10.
Iftheaveragescoreforapairismorethan0.
5,itisUK2006WebBaseMethodScore@10P@10Score@10P@10PageRank0.
14830.
7%0.
66855.
7%TrustRank0.
17131.
4%0.
74759.
3%CautiousSurfer0.
18032.
4%0.
79861.
3%Table2:Rankingperformancecomparison.
markedasrelevant.
TheaveragenumberofrelevantURLswithinthetoptenresultsforthe30queriesisdenedasprecision@10.
TheoverallretrievalperformancecomparisonsareshownintheleftcolumnsofTable4.
CautiousSurferoutperformstheotherap-proachesonbothprecisionandqualityfortop-10results.
Thus,weseethatbyincorporatingestimatesoftrust,theCautiousSurferisabletogenerateusefulrankingsforretrieval,andnotjustrankingswithlessspam.
ExperimentsonWebBase.
Theseconddatasetisa2005crawlfromtheStanfordWebBase[2].
Itcontains58Mpagesandap-proximately900Mlinks,butnolabels.
Tocompensate,welabelasgoodallpagesinthisdatasetthatalsoappearwithinthelistofURLsreferencedbythedmozOpenDirectoryProject.
Notethattheselabelsarepage-based,sowecancomputeauthorityinthepagelevelgraphdirectly.
Wechose30queriesfromthepopularquerylistforevaluationofwebpagesintheWebBasedataset.
Bytestingonaseconddataset,wegetabetterunderstandingofexpectedperformanceonfuturedatasets.
TheWebBasedatasetisofparticularinterestasitisamoretypicalgraphofwebpages(ascomparedtowebhosts),andusesamuchsmallerseedsetofgoodpages(just.
17%ofallpagesinthedataset).
TheperformanceisshownintherightcolumnsofTable4.
Again,theCautiousSurfernoticeablyoutperformsbothPageRankandTrustRank,demonstratingthattheapproachretainsitslevelofperformanceinbothpage-levelandsite-levelwebgraphs.
5.
CONCLUSIONInthispaperwehavedescribedamethodologyforincorporatingtrustintothecalculationofPageRank-basedauthority.
Additionaldetailsareavailableelsewhere[4].
Theresultsontwolargereal-worlddatasetsshowthatourCautiousSurfermodelcanimprovesearchengines'rankingqualityanddemotewebspamaswell.
Acknowledgments.
ThisworkwassupportedinpartbyagrantfromMicrosoftLiveLabs("AcceleratingSearch")andtheNa-tionalScienceFoundationunderCAREERawardIIS-0545875.
WethanktheLaboratoryofWebAlgorithmics,Universit`adegliStudidiMilanoandYahoo!
ResearchBarcelonaformakingtheUK-2006datasetandlabelsavailableandStanfordUniversityforaccesstotheirWebBasecollections.
6.
REFERENCES[1]C.
Castillo,D.
Donato,L.
Becchetti,P.
Boldi,M.
Santini,andS.
Vigna.
Areferencecollectionforwebspam.
ACMSIGIRForum,40(2),Dec.
2006.
[2]J.
Cho,H.
Garcia-Molina,T.
Haveliwala,W.
Lam,A.
Paepcke,S.
RaghavanandG.
Wesley.
StanfordWebBasecomponentsandapplications.
ACMTransactionsonInternetTechnology,6(2):153–186,2006.
[3]Z.
Gy¨ongyi,H.
Garcia-Molina,andJ.
Pedersen.
CombatingwebspamwithTrustRank.
InProc.
ofthe30thInt'lConf.
onVeryLargeDataBases(VLDB),pages271–279,Toronto,Canada,Sept.
2004.
[4]L.
Nie,B.
Wu,andB.
D.
Davison.
Incorporatingtrustintowebsearch.
AvailableasTechnicalReportLU-CSE-07-002,Dept.
ofComputerScienceandEngineering,LehighUniversity,2007.
[5]L.
Page,S.
Brin,R.
Motwani,andT.
Winograd.
ThePageRankcitationranking:BringingordertotheWeb.
Unpublisheddraft,1998.
[6]B.
Wu,V.
Goel,andB.
D.
Davison.
Propagatingtrustanddistrusttodemotewebspam.
InProc.
ofModelsofTrustfortheWebworkshopatthe15thInt'lWorldWideWebConf.
,Edinburgh,Scotland,May2006.
[7]Yahoo!
Research.
WebcollectionUK-2006.
http://research.
yahoo.
com/.
CrawledbytheLaboratoryofWebAlgorithmics,UniversityofMilan,http://law.
dsi.
unimi.
it/.
URLretrievedOct.
2006.

1C2G5M轻量服务器48元/年,2C4G8M三年仅198元,COM域名首年1元起

腾讯云双十一活动已于今天正式开启了,多重优惠享不停,首购服务器低至0.4折,比如1C2G5M轻量应用服务器仅48元/年起,2C4G8M也仅70元/年起;个人及企业用户还可以一键领取3500-7000元满减券,用于支付新购、续费、升级等各项账单;企业用户还可以以首年1年的价格注册.COM域名。活动页面:https://cloud.tencent.com/act/double11我们分享的信息仍然以秒...

RAKsmart:美国洛杉矶独服,E3处理器/16G/1TB,$76.77/月;美国/香港/日本/韩国站群服务器,自带5+253个IPv4

RAKsmart怎么样?RAKsmart机房即日起开始针对洛杉矶机房的独立服务器进行特别促销活动:低至$76.77/月,最低100Mbps带宽,最高10Gbps带宽,优化线路,不限制流量,具体包括有:常规服务器、站群服务器、10G大带宽服务器、整机机柜托管。活动截止6月30日结束。RAKsmart,美国华人老牌机房,专注于圣何塞服务器,有VPS、独立服务器等。支持PayPal、支付宝付款。点击直达...

PQ.hosting:香港HE/乌克兰/俄罗斯/荷兰/摩尔多瓦/德国/斯洛伐克/捷克vps,2核/2GB内存/30GB NVMe空间,€3/月

PQ.hosting怎么样?PQ.hosting是一家俄罗斯商家,正规公司,主要提供KVM VPS和独立服务器,VPS数据中心有香港HE、俄罗斯莫斯科DataPro、乌克兰VOLIA、拉脱维亚、荷兰Serverius、摩尔多瓦Alexhost、德国等。部分配置有变化,同时开通Paypal付款。香港、乌克兰、德国、斯洛伐克、捷克等为NVMe硬盘。香港为HE线路,三网绕美(不太建议香港)。免费支持wi...

pagerank为你推荐
社交sns深圳市残友电子善务股份有限公司internalservererrorError 500--Internal Server Error登陆建行个人网银,WIN7 64位IE10版本!企业信息查询系统查企业信息哪个的软件好?163yeah请问网易的163,126,yeah,VIP,188邮箱各有什么特点?sqlserver数据库电脑如何找到sql server数据库ipad代理在哪买ipad更便宜zhuo爱timi什么意思团购程序有什么好用的社区团购小程序?地址栏图标网站在地址栏显示的图标,是怎么显示出来的
免费域名空间 泛域名 济南域名注册 openv t牌 dreamhost vultr美国与日本 国外php主机 国外服务器 ssh帐号 七夕快乐英文 河南移动网 香港新世界中心 支持外链的相册 江苏双线服务器 最漂亮的qq空间 网站加速 塔式服务器 apache启动失败 easypanel 更多