homepageadsense

adsense  时间:2021-05-20  阅读:()
AnalysingFeaturesofJapaneseSplogsandCharacteristicsofKeywordsYuukiSatoTakehitoUtsuroUniversityofTsukuba,Tsukuba,305-8573,JAPANTomohiroFukuharaUniversityofTokyo,Kashiwa277-8568,JAPANYasuhideKawadaNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANYoshiakiMurakamiNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANHiroshiNakagawaUniversityofTokyo,Tokyo,113-0033,JAPANNorikoKandoNationalInstituteofInformatics,Tokyo,101-8430,JAPANABSTRACTThispaperfocusesonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyanalyzingthecharacteristicsofkey-wordscontainedinsplogs.
Sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecaneciently(manually)collectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wemanuallyexam-inevariousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentisexcerptfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinfor-mativeresults,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers.
CategoriesandSubjectDescriptorsH.
3.
0[INFORMATIONSTORAGEANDRETRIEVAL]:GeneralGeneralTermsReliabilityKeywordsBloganalysis,splog,timeseriescharacteristicsofkeywords,keywordbursts1.
INTRODUCTIONWeblogsorblogsareconsideredtobeoneofpersonaljour-nals,marketorproductcommentaries.
Whiletraditionalsearchenginescontinuetodiscoverandindexblogs,theblo-gospherehasproducedcustomblogsearchandanalysisen-Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.
Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.
AIRWeb'08,April22,2008Beijing,China.
Copyright2008ACM978-1-60558-159-0.
.
.
$5.
00.
gines,systemsthatemployspecializedinformationretrievaltechniques.
Thereareseveralpreviousworksandservicesonbloganalysissystems.
[13]proposedasystemcalledblog-WatcherthatcollectsandanalyzesJapaneseblogarticles.
[6]proposedasystemcalledBlogPulsethatanalyzestrendsofblogarticles.
WithrespecttobloganalysisservicesontheInternet,thereareseveralcommercialandnon-commercialservicessuchasTechnorati1,BlogPulse2,kizasi.
jp3,andblog-Watcher4.
Withrespecttomultilingualblogservices,GlobeofBlogs5providesaretrievalfunctionofblogarticlesacrosslanguages.
BestBlogsinAsiaDirectory6alsoprovidesaretrievalfunctionforAsianlanguageblogs.
Blogwise7alsoanalyzesmultilingualblogarticles.
AswithmostInternet-enabledapplications,theeaseofcontentcreationanddistributionmakestheblogospherespamprone[7,1,10,12,9].
Spamblogsorsplogsareblogshost-ingspamposts,createdusingmachinegeneratedorhijackedcontentforthesolepurposeofhostingadvertisementsorraisingthePageRankoftargetsites.
[10]reportedthatforEnglishblogs,around88%ofallpingingURLs(i.
e.
,bloghomepages)aresplogs,whichaccountforabout75%ofallpings.
Basedonthisestimation,asstatedin[1,11],splogscancauseproblemsincludingthedegradationofinforma-tionretrievalqualityandthesignicantwasteofnetworkandstorageresources.
Severalpreviousworks[10,12,9]reportedimportantcharacteristicsofsplogs.
[12]reportedcharacteristicsofpingtimeseries,in-degree/out-degreedis-tributions,andtypicalwordsinsplogsfoundinTREC8Blog06datacollection.
[10,9]alsoreportedtheresultsofanalyzingsplogsintheBlogPulsedataset.
Inthecontextofsemi-automaticallycollectingwebspamsincludingsplogs,[16]discusshowtocollectspammer-targetedkeywordstobeusedwhencollectingalargenumberofwebspamseciently.
Unlikethosepreviousworks,thispaperfocusesonana-lyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem[14].
Ashasbeenoftennotedinthepreviousworks,textcontentofsplogsismostlyex-1http://technorati.
com/2http://www.
blogpulse.
com/3http://kizasi.
jp/(inJapanese)4http://blogwatcher.
pi.
titech.
ac.
jp/(inJapanese)5http://www.
globeofblogs.
com/6http://www.
misohoni.
com/bba/7http://www.
blogwise.
com/8http://trec.
nist.
gov/Table1:FeaturesforCharacterizingSplogsandtheirRatesinSplogDataSetRateinFeatureTypesFeaturesDescriptionsSplogs(%)linkstoaliatedsitesBlogarticles(posts)containsucientlymanyout-goinglinkstoaliatedsites,exceptfortheout-goinglinksthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
80.
5Aliateadvertisementarti-cles(posts)Blogarticles(posts)themselvescontainsucientlymanyad-vertisements,exceptfortheadvertisementsthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
31.
0Featuresarticles(posts)withadultcontentBlogarticles(posts)containadultcontent.
8.
1keywordswithpopupadvertisementCertainbloghostshavefacilitiesofautomaticallyaddingpopupadvertisementstokeywords.
42.
1excerptfromnewsar-ticlesTextcontentisautomaticallyormanuallyexcerptedfromnewsarticles.
14.
3Contentexcerptfromblogar-ticles(posts)orotherwebtextsTextcontentisautomaticallyormanuallyexcerptedfromotherblogarticles(posts),orwebtextsotherthannewsarticlesandadvertisementpages.
70.
8Sourceexcerptfromadver-tisementpagesTextcontentisautomaticallyormanuallyexcerptedfromcer-tainadvertisementpages.
27.
1FeaturesoriginallywrittentextsSpammerswriteoriginalsplogtexts.
2.
9meaninglesssequenceofwordsMostofthemaresocalledwordsaladspamtext[2]andareautomaticallygenerated.
3.
6excerptfromothersources,selectedwithoutkeywordretrievalTextcontentisautomaticallyormanuallyexcerptedfromothersourceswithoutkeywordretrieval.
Typicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates.
12.
7Creationexcerptfromothersources,retrievedwithakeywordvaryingdaybydayTextcontentisautomaticallyormanuallyretrievedfromothersourceswithakeywordvaryingdaybyday,andthenexcerpted.
49.
5Procedureexcerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepageForabloghomepage,allofitstextcontentisexcerpt,whichareautomaticallyormanuallyretrievedfromothersourceswithasinglekeywordthroughoutallofitsposts.
36.
9Featureskeywordstuedblog[9]Blogarticles(posts)containlistsofkeywordsforSEOpurposes.
11.
5automaticallygener-atedtextMostofthemaresocalledwordsaladspamtext[2],whichisamixtureofseeminglymeaningfulwordsthattogethersignifynothing.
Sometimes,connectingseveralsentenceseachofwhichisexcerptedfromothersource.
4.
5cerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Consid-eringthisfact,inthiswork,weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyan-alyzingthecharacteristicsofkeywordscontainedinsplogs.
Thecharacteristicsofakeywordtowhichwepayattentioninthispaperiswhetherthekeywordisofpublic/privatecon-cernaswellasthedurationofpeople'sconcerntothekey-word.
Furthermore,sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecanecientlycollectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wethenmanuallyexaminevari-ousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentsareexcerptsfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinforma-tiveresultsofouranalysis,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers,andhence,theanalysisreportedinthispaperisstronglyaectedbythechoicesofthosespam-merswhentheycreatethosesplogs.
2.
PROCEDUREOFCREATINGSPLOGSTextcontentofsplogsismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertise-mentpages,andotherwebtexts.
Inanycase,splogshavecommercialintention—theydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Forthispurpose,splogsareusuallycreatedbysearchingforup-to-datecontentfromothersourcesandbyexcerptingthem.
Thisprocedureofcreatingsplogscanberoughlydividedintothefollowingtwocases:authenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesauthenticblogssplogsTimeSeries(a)keywordwithburst(b)keywordwithoutburstFigure1:TimeSeriesCharacteristicsofKeywordOccurrenceStatisticsinSplogs/AuthenticBlogsi)excerptingtextcontentfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordre-trieval,ii)excerptingtextcontentbyretrievingthemfromothersourceswithcertainkeywords.
Splogpostscreatedbytherstprocedurejustafewdaysbeforethecurrentdatetendtocontainup-to-datetextcon-tentwhichareoriginallyfromquiterecentnewsarticlesorblogposts.
Ontheotherhand,forsplogscreatedbythesec-ondprocedure,spammersusuallycarefullychoosekeywordsforretrievingtextcontentfromothersourcessuchasnewsarticlesandblogposts.
Theytendtochoosehighpayingadsense9keywords.
3.
FEATURESFORCHARACTERIZINGSPLOGSThissectiondescribesthefeaturesforcharacterizingJapanesesploghomepagesmanuallycollectedbytheprocedureofsec-tion5.
3.
AswesummarizeinTable1,thispaperconsidersthefol-lowingthreetypesoffeaturesforsplogs,namely,1)aliatefeatures,2)contentsourcefeatures,and3)creationproce-durefeatures.
Foreachofthesethreefeaturetypes,Table1listsseveralbinaryfeatureseachofwhichdenoteswhetherthegivensploghomepagehasthedesignatedcharacteristicsornot.
Here,notethatfeaturesofthesametypeareinde-pendentofeachotherandhencearenotnecessarilydisjoint.
Alsonotethatmostofthosefeaturesarefortheuseinman-ualexaminationofsplogs,andhence,itisnotnecessarilymeanttoautomaticallydetectthem.
3.
1AfliateFeaturesAmongthethreefeaturetypes,rstwedescribealiatefeatures.
Asintroducedin[10,9],splogsaregeneratedwithtwooftenoverlappingmotives,namely,creationoffakeblogsforthepurposeofhostingprotableadvertisement,andun-justiablyincreasingtherankingofaliatedsites.
Sincebothmotivesaredeeplyrelatedtoaliateadvertising,inthispaper,weconsiderfeaturesofsplogsregardingissuesofaliates.
Asthealiatefeatures,wemanuallyexaminethefollowingfourpoints:9http://google.
com/adsensei)whethertheblogarticle(posts)containout-goinglinkstoaliatedsites,ii)whethertheblogarticle(posts)themselvescontainad-vertisements,iii)whetherblogarticles(posts)containadultcontent10,iv)whetherblogarticles(posts)containpopupadvertise-mentsautomaticallyaddedtocertainkeywords.
3.
2ContentSourceFeaturesSecond,oneoftheimportantcharacteristicsofsplogsisthattheirtextcontentismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Inordertoestimatethemech-anismofcreatingsplogs,wemanuallyexaminethecontentsourceofsplogsandclassifythemaccordingtothefollowingvefeatures,namely,contentsourcefeatures:i)excerptfromnewsarticles,ii)excerptfromblogarticles(posts)orotherwebtexts,iii)excerptfromadvertisementpages,iv)originallywrittentexts,v)meaninglesssequenceofwordssuchaswordsaladspamtexts[2].
3.
3CreationProcedureFeaturesFurthermore,weestimatetheproceduresofsearchingthewebforthoseexcerptandmanuallyclassifythemaccord-ingtothefollowingvefeatures,namely,creationprocedurefeatures:i)excerptfromothersources,selectedwithoutkeywordretrieval,wheretypicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates,ii)excerptfromothersources,retrievedwithakeywordvaryingdaybyday,iii)excerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepage,iv)keywordstuedblog[9],10Adultcontentisamongthemajortargetgenresforaliateadvertising,whileothermajortargetgenresincludehealthfoodandslimmingproducts,cosmetics,andnance.
Weregardblogswhichcontainadultcontentasmoreharmfulthanothers,andrecordthemwithanindependentfeature.
Figure2:AKeywordMapforCharacterizingKeywordsv)automaticallygeneratedtextincludingwordsaladspamtexts[2].
Asthecreationprocedurefeatures,wedistinguishtwomajorproceduresofcreatingsplogs,i.
e.
,a)excerptfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordretrieval,andb)andexcerptbyretrievingtextsfromothersourceswithcertainkeywords.
Theformertypecorrespondstothefeaturei)above,whilethelattertothefeaturesii)andiii)above.
4.
CHARACTERISTICSOFSPLOGSANDKEYWORDS4.
1TimeSeriesCharacteristicsofKeywordsAmongtheproblemscausedbysplogs,thissectiondis-cussesissuesonnoisesinwordoccurrencestatisticsintheblogosphere.
Figure1illustratestwotypicalcasesofnoisesintimeserieskeywordoccurrencestatistics,where(a)isthecaseofakeywordwithburst,and(b)isthecaseofakey-wordwithoutburst.
Forbothcases,keywordoccurrencesaremixtureofthosefromauthenticblogsandsplogs.
With-outdetectingandremovingsplogs,itisdiculttoestimaterealkeywordoccurrencestatisticsonlyinauthenticblogs.
Forthecaseofthekeywordswithburst,especially,itisestimatedthatburstinsplogsmaybedelayedfromthatinauthenticblogs,becausetextcontentofsplogsismostlyexcerptfromothersourcessuchasnewsarticlesandblogposts.
4.
2KeywordMapforCharacterizingKeywordsThissectionintroducesthekeywordmapofFigure2forcharacterizingkeywords.
Theverticalaxisofthemapde-noteswhethereachkeywordisofpublic/privateconcern,whileitshorizontalaxisdenotesthedurationofpeople'sconcerntoeachkeyword.
Keywordswithpublicconcernaretypicallyreportedinnewsassocial/political/economicalis-sues,whilethosewithprivateconcernaretypicallyissuesregardingentertainmentorcelebrity,orhighpayingadsensekeywords.
Ontheotherhand,keywordswithshorttermdu-rationincludeseasonalonesandthoserelatedtotemporaryevents,whilethosewithlongtermdurationincludeorgani-zationnameswithalonghistorysuchaspoliticalpartiesandcountrynames,orthoserelatedtopermanentissuessuchashealthandbeauty.
OnthemapofFigure2,50keywordsthatarebalancedintheirdistributiononthemapareplaced,wherethepositionofeachkeywordisdeterminedtotallybyintuition.
Thosekeywordsvaryintheirtimeseriescharacteristicsofoccur-rencestatistics,wheresomeofthemarewithburstwhileothersarenot.
Eachofthosekeywordsisintendedtobeusedforretrievingblog(authenticblogandsplog)home-pagesintheprocedureofsection5.
3.
Themajorpurposeofplacingsuchvariouskeywordsontoamaplikethisistosimplyexaminethecorrelationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontain-ingeachkeyword.
Table2:SummaryofJapaneseBlogData(atDe-cember3rd,2007,0:00)#ofblogcurrent#ofhomepages#ofarticles#ofdaysarticlesperday3,591,306192,699,2761,355196,9755.
ANALYZINGSPLOGSBASEDONCHAR-ACTERISTICSOFKEYWORDS5.
1MotivationsThispaperreportstheresultsofanalyzingthefollowingthreepointsaftercollectingblogsandthenmanuallydetect-ingsplogsamongthem.
1.
Featuresofsplogsaremanuallyexaminedaccordingtothoseintroducedinsection3.
2.
Accordingtothekeywordmapforcharacterizingkey-words,variouscharacteristicsofkeywordsaremanu-allyexamined,whichincludetimeseriescharacteristicssuchaswhetherwith/withoutburst.
3.
Basedontheresultsofexaminingabovetwopoints,wefurtheranalyzevariouscorrelationbetweencharac-teristicsofsplogsandkeywords.
Thisanalysismainlyincludesthefollowings:(a)correlationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontainingeachkeyword.
Thiswillrevealthepreferenceofspammerswhenchoosingkeywords.
(b)correlationbetweenthecharacteristicsofkeywordsandthesplogcreationprocedures.
5.
2JapaneseBlogDataForcollectingtheJapaneseblogdata,weusethesystemcalledKANSHIN[3,4,5]whichcollectsblogarticles(posts)writteninChinese,Japanese,Korean,andEnglish.
Thesys-temhaslistsofbloghomepagesforeachlanguage.
Byusingtheselists,thesystemcollectsRSS11andAtomfeedlesprovidedbybloghomepages,andextractskeywordsfromfeedlesbyusingmorphologicalanalysistools,andstorekeywordsandarticlesineachdatabase.
Thesystemusesseverallinguistictoolsforextractingandindexingkeywordsfromblogarticlesforeachlanguage.
ForJapanese,itusesamorphologicalanalysistoolcalledJuman12.
Thesystemprovidesuserswithfunctionsforretrievingandanalyzingarticles.
Table2showsthesummaryofJapaneseblogdatastoredinthesystem(checkedatDecember3rd,2007).
3.
6millionbloghomepagesand193millionarticlesareregisteredforJapanesesinceMarch18th,2004.
5.
3ProcedureoftheAnalysisThissectiongivesthespecicprocedureofcollectingandanalyzingsplogsbasedoncharacteristicsofkeywords.
Theroughstrategyofcollectingsplogshereistosimplycollectbloghomepages,(i.
e.
,notblogposts)whichcontainagivenkeywordandthen,11SeveralreferencessuchasRDFSiteSummaryorReallySimpleSyndicationorRichSiteSummaryexist.
12http://nlp.
kuee.
kyoto-u.
ac.
jp/nl-resource/juman.
htmlseesaa44%cocolog32%jugem.
jp12%ameblo5%livedoor1%Rest6%yahoo0%goo.
ne0%Figure3:BlogHostDistributionintheSplogHome-pageDataSetconsideringthefeaturesofsplogsdenedinsec-tion3,tomanuallyjudgewhethereachofthecol-lectedbloghomepagesisasplogoranauthenticblog.
Consideringtheresultofapreliminaryexamination,weas-sumethat,forkeywordswithburst,therateofsplogsamongthebloghomepagesthatcontainthosekeywordsmaybehigherontheburstdatethanonotherdates.
Wefurtherassumethat,evenforkeywordswithoutburst,therateofsplogsmaybehigheronthedatewiththemostfrequentoc-currenceintheblogospherethanotherdates.
Basedonthisobservation,inordertocollectsucientnumberofsplogs,foreachkeyword,wecollectbloghomepagescontainingthekeywordonthedatewithitsmostfrequentoccurrence.
Fur-thermore,alsoconsideringtheresultofapreliminaryexam-ination,wepreferbloghomepageswithmorepostsperdaythanthosewithfewerpostsperday.
Thefollowinglistsummarizestheaboveprocedure.
1.
Foreachofthe50keywordsinFigure2,wecollectbloghomepageURLswhichcontainthekeywordonthedatewithitsmostfrequentoccurrenceduringtheyear2007.
2.
AmongthecollectedURLs,weselectthetopmost50withrespecttothenumberofpostsperday.
Wefur-therrandomlyselect60URLsfromtherest.
Thisamountto110URLsintotal,wherethetopmost50URLsareusuallywithmorethanthreepostsperday,whiletheremaining60URLsarewithoneortwopostsperday.
3.
ForeachofthecollectedURLs,anannotatorjudgeswhethereachbinaryfeaturedenedinsection3holdsornot.
4.
Basedontheabovejudgement,eachURLisjudgedtobeasplogoranauthenticblogaccordingtothefollowingrule.
(a)IfoneofthefollowingsholdsforthegivenURL,thenitismostly13splog.
13By"mostly",wemeanthatitisusuallynecessarytojudgebyconsideringthecontentsofeachblog.
Table3:SplogRateperBlogHostBlogHostseesaacocologjugem.
jpameblolivedoorgoo.
neyahooRestTotal#ofBlogSplog1921425424321026442HomepagesAuthenticBlog2031151693551281302073961703Total3952572233791311312074222145SplogRate(%)48.
655.
324.
26.
32.
30.
80.
06.
220.
6Table4:SplogRate,ProfessionalSpammerRate(fromprofessionalspammer/splog),#ofProfessionalSpammers,and,AmateurOnlySplogRate(fromamateurspammer/(fromamateurspammer+non-splog))(indescendingorderofsplogrates,boldfaced:"splograte>10%,professionalspammerrate>50%",underlined:"amateuronlysplograte20%ormore,mostlywithprivateconcern")KeywordSplogRate(%)ProfessionalSpammerRate(%)#ofProfessionalSpammersAmateurOnlySplogRate(%)erog,adultcontentblog89.
292.
4338.
5rumor88.
194.
8127.
8nationalpension58.
190.
2212.
0norevision40.
918.
5136.
1healthfood37.
458.
7219.
8cosmeticsurgery24.
414.
3221.
7Viagra22.
511.
1120.
5Darvish,aJapanesebaseballplayer22.
10.
0022.
1video19.
10.
0019.
1Asasho-ryu,asumowrestler15.
280.
023.
4Billy'sBootCamp15.
10.
0015.
1Saeko,aJapaneseactressandDarvish'swife14.
314.
3112.
2COMSN,Inc.
,elderlycarebusinesscompanywithascandal6.
971.
422.
1ZARD(aJapanesefemalesinger,accidentallydied)4.
720.
013.
8ChinaAirlines4.
720.
013.
8NorthKorea2.
9100.
010.
0Wii(avideogameconsoleofNintendo)2.
866.
711.
0heatwave2.
833.
311.
9"Thedignityofthewoman",thetitleofabook2.
00.
002.
0aJapaneseslangwordfor"lazywoman"1.
850.
010.
9UpperHouseelection0.
00.
000.
0DemocraticPartyofJapan0.
00.
000.
0Total20.
561.
5109.
0i.
Thefeature"originallywrittentext"doesnothold.
ii.
Thefeature"originallywrittentext"holdsandatleastoneofthefeatures"linkstoaliatedsites","advertisementarticles(posts)",or"ar-ticles(posts)withadultcontent"holds.
(b)Otherwise,thegivenURLisanauthenticblog.
5.
Finally,weanalyzethecorrelationbetweencharacter-isticsofkeywordsandthedistributionoffeaturesman-uallyannotatedtosplogs.
6.
PRELIMINARYRESULTSOFANALYZ-INGSPLOGSThissectiondiscussespreliminaryresultsofanalyzingJapanesesplogsbasedoncharacteristicsofkeywords,fea-turesofsplogs,aswellasotherfeatureswhichcanbeauto-maticallyanalyzedsuchasbloghostsdistribution.
Wefur-theranalyzethecorrelationbetweencharacteristicsofkey-wordsandthefeaturedistributionofsplogs.
Here,notethattheresultsshownbelowarepreliminaryinthattheyarefor22keywordsoutofthe50onthemapofFigure2.
6.
1BlogHostsStatisticsAscanbeclearlyseenfromFigure3,inourJapanesebloghomepagedataset,morethan88%ofsplogsarefromthetopthreehosts.
Furthermore,asshowninTable3,forthetoptwohosts,abouthalfofthebloghomepagesaresplogs14.
Itisestimatedthatthosehostswithhighsplogratespaylesscostofmanuallyremovingsplogsthanthosewithlowsplogrates.
Asweargueinthenextsection,itisobservedthataverysmallnumberofspammersactuallycreatesubstantialnumberofsploghomepagesonthosethreehosts,andthisincreasesthesplogratesofthosehosts.
6.
2RelationsbetweenCharacteristicsofKey-wordsandSplogs14DuetoerrorsintheprocedureofcollectingblogURLsforjudgingsplog/authenticblogdistinction,forthemoment,wedonothave110blogsURLsintotalforseveralkeywords.
Table5:10ProfessionalSpammersidentiedinourSplogDataSet#ofFeaturesofSplogs(inTable1)IDSplogsAliateContentSourceCreationProcedureKeywords1115(42.
5%)linkstoaliatedsites,popupadver-tisementblogorotherwebtextsretrievedwithasin-glekeywordrumor,norevision,cosmeticsurgery,Asasho-ryu,Saeko,ChinaAirlines,COMSN,Inc.
,ZARD,heatwave,Wii,NorthKorea,"lazywoman"256(20.
6%)linkstoaliatedsitesblogorotherwebtextsretrievedwithakey-wordvaryingdaybydayerog330(11.
0%)linkstoaliatedsitesnewsarticles,adver-tisementpagesselectedwithoutkey-wordretrievalnationalpension,COMSN,Inc.
426(9.
6%)linkstoaliatedsites,advertisementarticles,popupadvertisementblogorotherwebtexts,advertisementpagesretrievedwithakey-wordvaryingdaybydaynationalpension520(7.
4%)linkstoaliatedsites,advertisementarticlesadvertisementpagesretrievedwithakey-wordvaryingdaybyday,keywordstuedbloghealthfood610(3.
7%)linkstoaliatedsites,adultcontent,popupadvertisementnewsarticles,blogorotherwebtextsselectedwithoutkey-wordretrievalerog,Asasho-ryu,71015(5.
5%)———erog,healthfood,Viagra,cos-meticsurgery,Total272————Next,foreachofthe22keywords,Table4givessplogratesinthebloghomepagescollectedwiththekeyword,indescendingorderofsplogsrates.
Inthetable,those22key-wordsaredividedintothreegroups,i.
e.
,thosewithsplograteshigherthan30%,thosewithsplogrates3010%,andtherest.
Wefurthercountoccurrencesoffeaturesofsplogsintheentiresplogdataset,andlisttheirratesinthesplogdatasetasintherightmostcolumnofTable1.
Basedonthisfeatureanalysis,weexaminecorrelationofthosesplogfea-turesandcharacteristicsofkeywordswithsplograteshigherthan10%.
Furthermore,wejudgedwhethertwosplogsarecreatedbyanidenticalspammerwhentheirhtmllayoutsaresimilar15,andthengroupedthosesplogsfromanidenticalspammer.
Inthispaper,wenamethosespammerseachofwhomcre-atedmorethanonesplogsinourdatasetasprofessionalspammers,whilewealsonamethoseremainingspammerseachofwhomcreatedonlyonesploginourdatasetasam-ateurspammers.
Withthisjudgement,wecanidentify10professionalspammersinoursplogdataset(summarizedinTable5),whereoutofthetotal442sploghomepages,272(61.
5%)canberegardedascreatedbythose10professionalspammers.
Basedonthisprofessional/amateurspammeranalysis,foreachkeyword,Table4showsrateofsploghome-pagesbeingcreatedbyoneofthe10professionalspammersTable4alsoshowsthenumberofprofessionalspammersob-servedforeachkeyword,aswellassplogratesafterremovingthosecreatedbyprofessionalspammers(amateuronlysplog15Ournextplanistoemploythetechniquepresentedin[15],sothatwecanautomaticallygroupsploghomepagesintothe10groupsshownhere.
rate).
Majorconclusionsofthisanalysiscanbesummarizedasbelow,someofwhicharealsonotedinthemapofthe22keywordsinFigure4.
(1)Themostimportantfacttonotehereisthat,forfouroutofthevekeywordswithsplograteover30%,mostsploghomepagesarecreatedbyprofessionalspammers.
Splogscontainingthesefourkeywordsactuallyamounttomorethanhalfoftheentiresplogdataset.
Thisfactisveryimportantbecausethefollowinganalysisisstronglyaectedbythechoicesofthoseprofessionalspammersincreatingthosesplogs.
(2)AscanbeseenfromthemapinFigure4,mostofthekeywordsplacedintheupperhalfofthemaphavelowsplogrates.
Thismeansthatsplogstendtocontainkeywordswithprivateconcernmoreoftenthanthosewithpublicconcern.
"Nationalpension"and"Asasho-ryu"arewithexceptionallyhighsplogrates,thoughthisstatisticsisstronglyaectedbythechoicesofprofessionalspammers.
Thosespammerspostedsplogpostsoncertaindates,wherethesplogarticlesarecreatedfromtheexcerptsofthenewsreportsandblogpostsonthosedates.
Thoseexcerptsoccasionallyincludescandalreportscloselyrelatedtothetwokeywords.
(3)Thethreekeywords"rumor","erog,adultcontentblog",and"healthfood",correspondtoanothergroupofsplogscreatedbyprofessionalspammers.
Inthecaseofthesekey-words,thespammerspostedsplogposts,wherethesplogarticlesarecreatedfromtheexcerptofotherblogsandad-vertisements,butnotnewsarticles,byretrievingthemwithcertainkeywords.
7.
CONCLUSIONFigure4:KeywordMapwithSplogAnalysisResultsThispaperfocusedonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Amongvariousinformativeresultsofouranalysis,itisim-portanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofprofessionalspam-mers.
Futureworksincludefurtheranalysisofsplogsbyintegratingwithotherfeaturesstudiedinthepreviousworks[12,10,9],suchascharacteristicwordsinsplogs,in-degree/out-degreedistributions,andpingtimeseries.
Next,weplantoapplyexistingsplogdetectiontechniques[11,8]tooursplogdataset,andthentodevelopasplogdetectorwithhighaccuracy.
Splogs/authenticblogscollectedinthisworkarealsousefulforanalyzingcharacteristicsofkeywordsinamuchlargerscale,simplybyautomaticallycollectingamuchlargernumberofkeywords,andthenmeasuringcorrelationbetweensplogsandeachkeyword.
8.
REFERENCES[1]Wikipedia,Spamblog.
http://en.
wikipedia.
org/wiki/Spam_blog.
[2]Wikipedia,Wordsalad(computerscience).
http://en.
wikipedia.
org/wiki/Wordsalad%28computer_science%29.
[3]T.
Fukuhara,T.
Murayama,andT.
Nishida.
AnalyzingconcernsofpeopleusingWeblogarticlesandrealworldtemporaldata.
InProceedingsofWWW20052ndAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2005.
[4]T.
Fukuhara,H.
Nakagawa,andT.
Nishida.
Understandingsentimentofpeoplefromnewsarticles:Temporalsentimentanalysisofsocialevents.
InProceedingsofICWSM,pages271–272,2007.
[5]T.
Fukuhara,T.
Utsuro,andH.
Nakagawa.
Cross-lingualconcernanalysisfrommultilingualweblogarticles.
InA.
Nijholt,O.
Stock,andT.
Nishida,editors,Proceedingsofthe6thInternationalWorkshoponSocialIntelligenceDesign,pages55–64,2007.
[6]N.
Glance,M.
Hurst,andT.
Tomokiyo.
Blogpulse:AutomatedtrenddiscoveryforWeblogs.
InWWW2004WorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2004.
[7]Z.
Gy¨ongyiandH.
Garcia-Molina.
Webspamtaxonomy.
InProc.
1stAIRWeb,pages39–47,2005.
[8]P.
Kolari,T.
Finin,andA.
Joshi.
SVMsfortheBlogosphere:BlogidenticationandSplogdetection.
InProceedingsofthe2006AAAISpringSymposiumonComputationalApproachestoAnalyzingWeblogs,pages92–99,2006.
[9]P.
Kolari,T.
Finin,andA.
Joshi.
Spaminblogsandsocialmedia.
InTutorialatICWSM,2007.
[10]P.
Kolari,A.
Joshi,andT.
Finin.
Characterizingthesplogosphere.
InProceedingsofWWW20063rdAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2006.
[11]Y.
-R.
Lin,H.
Sundaram,Y.
Chi,J.
Tatemura,andB.
L.
Tseng.
Splogdetectionusingself-similarityanalysisonblogtemporaldynamics.
InProc.
3rdAIRWeb,pages1–8,2007.
[12]C.
MacdonaldandI.
Ounis.
TheTRECBlogs06collection:Creatingandanalysingablogtestcollection.
TechnicalReportTR-2006-224,UniversityofGlasgow,DepartmentofComputingScience,2006.
[13]T.
Nanno,T.
Fujiki,Y.
Suzuki,andM.
Okumura.
Automaticallycollecting,monitoring,andminingJapaneseweblogs.
InWWWAlt.
'04:Proceedingsofthe13thinternationalWorldWideWebconferenceonAlternatetrackpapers&posters,pages320–321.
ACMPress,2004.
[14]Y.
Sato,T.
Utsuro,T.
Fukuhara,Y.
Kawada,Y.
Murakami,H.
Nakagawa,andN.
Kando.
CollectingandanalyzingJapanesesplogsbasedoncharacteristicsofkeywords.
InProc.
ICWSM,pages218–219,2008.
[15]T.
Urvoy,T.
Lavergne,andP.
Filoche.
TrackingWebspamwithhiddenstylesimilarity.
InProc.
2ndAIRWeb,pages25–30,2006.
[16]Y.
Wang,M.
Ma,Y.
Niu,andH.
Chen.
Spamdouble-funnel:Connectingwebspammerswithadvertisers,.
InProc.
16thWWWConf.
,pages291–300,2007.

优林70/月,西南高防地区最低70/月

优林怎么样?优林好不好?优林 是一家国人VPS主机商,成立于2016年,主营国内外服务器产品。云服务器基于hyper-v和kvm虚拟架构,国内速度还不错。今天优林给我们带来促销的是国内西南地区高防云服务器!全部是独享带宽!续费同价!官方网站:https://www.idc857.com​地区CPU内存硬盘流量带宽防御价格购买地址德阳高防4核4g50G无限流量10M100G70元/月点击购买德阳高防...

欧路云:美国200G高防云-10元/月,香港云-15元/月,加拿大480G高防云-23元/月

欧路云 主要运行弹性云服务器,可自由定制配置,可选加拿大的480G超高防系列,也可以选择美国(200G高防)系列,也有速度直逼内地的香港CN2系列。所有配置都可以在下单的时候自行根据项目 需求来定制自由升级降级 (降级按天数配置费用 退款回预存款)。由专业人员提供一系列的技术支持!官方网站:https://www.oulucloud.com/云服务器(主机测评专属优惠)全场8折 优惠码:zhuji...

hostyun评测香港原生IPVPS

hostyun新上了香港cloudie机房的香港原生IP的VPS,写的是默认接入200Mbps带宽(共享),基于KVM虚拟,纯SSD RAID10,三网直连,混合超售的CN2网络,商家对VPS的I/O有大致100MB/S的限制。由于是原生香港IP,所以这个VPS还是有一定的看头的,这里给大家弄个测评,数据仅供参考!9折优惠码:hostyun,循环优惠内存CPUSSD流量带宽价格购买1G1核10G3...

adsense为你推荐
css3圆角如何用CSS实现圆角矩形?fusionchartsfusioncharts曲线图怎么默认显示数量google中国地图谷歌卫星地图中文版下载在哪下??firefoxflash插件火狐浏览器怎么安装flash迅雷下载速度迅雷限制下载速度要设置多少电信版iphone4s4和苹果iPhone 4S 电信版有什么区别chrome18怎么关闭chrome的自动更新,稳定版要18了,mactype要悲剧了搜狗拼音输入法4.3搜狗拼音输入法最旧版div居中DIV怎么居中微信5.0是哪一年的微信是哪一年开始有的?
域名注册中心 cn域名备案 域名备案只选云聚达 域名停靠一青草视频 sugarhosts 42u机柜尺寸 godaddy支付宝 godaddy优惠券 服务器日志分析 512au 好看的桌面背景大图 远程登陆工具 北京主机 天互数据 工作站服务器 银盘服务是什么 国外视频网站有哪些 跟踪路由命令 海外空间 免费蓝钻 更多