homepageadsense

adsense  时间:2021-05-20  阅读:()
AnalysingFeaturesofJapaneseSplogsandCharacteristicsofKeywordsYuukiSatoTakehitoUtsuroUniversityofTsukuba,Tsukuba,305-8573,JAPANTomohiroFukuharaUniversityofTokyo,Kashiwa277-8568,JAPANYasuhideKawadaNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANYoshiakiMurakamiNavixCo.
,Ltd.
,Tokyo,141-0031,JAPANHiroshiNakagawaUniversityofTokyo,Tokyo,113-0033,JAPANNorikoKandoNationalInstituteofInformatics,Tokyo,101-8430,JAPANABSTRACTThispaperfocusesonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyanalyzingthecharacteristicsofkey-wordscontainedinsplogs.
Sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecaneciently(manually)collectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wemanuallyexam-inevariousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentisexcerptfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinfor-mativeresults,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers.
CategoriesandSubjectDescriptorsH.
3.
0[INFORMATIONSTORAGEANDRETRIEVAL]:GeneralGeneralTermsReliabilityKeywordsBloganalysis,splog,timeseriescharacteristicsofkeywords,keywordbursts1.
INTRODUCTIONWeblogsorblogsareconsideredtobeoneofpersonaljour-nals,marketorproductcommentaries.
Whiletraditionalsearchenginescontinuetodiscoverandindexblogs,theblo-gospherehasproducedcustomblogsearchandanalysisen-Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.
Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.
AIRWeb'08,April22,2008Beijing,China.
Copyright2008ACM978-1-60558-159-0.
.
.
$5.
00.
gines,systemsthatemployspecializedinformationretrievaltechniques.
Thereareseveralpreviousworksandservicesonbloganalysissystems.
[13]proposedasystemcalledblog-WatcherthatcollectsandanalyzesJapaneseblogarticles.
[6]proposedasystemcalledBlogPulsethatanalyzestrendsofblogarticles.
WithrespecttobloganalysisservicesontheInternet,thereareseveralcommercialandnon-commercialservicessuchasTechnorati1,BlogPulse2,kizasi.
jp3,andblog-Watcher4.
Withrespecttomultilingualblogservices,GlobeofBlogs5providesaretrievalfunctionofblogarticlesacrosslanguages.
BestBlogsinAsiaDirectory6alsoprovidesaretrievalfunctionforAsianlanguageblogs.
Blogwise7alsoanalyzesmultilingualblogarticles.
AswithmostInternet-enabledapplications,theeaseofcontentcreationanddistributionmakestheblogospherespamprone[7,1,10,12,9].
Spamblogsorsplogsareblogshost-ingspamposts,createdusingmachinegeneratedorhijackedcontentforthesolepurposeofhostingadvertisementsorraisingthePageRankoftargetsites.
[10]reportedthatforEnglishblogs,around88%ofallpingingURLs(i.
e.
,bloghomepages)aresplogs,whichaccountforabout75%ofallpings.
Basedonthisestimation,asstatedin[1,11],splogscancauseproblemsincludingthedegradationofinforma-tionretrievalqualityandthesignicantwasteofnetworkandstorageresources.
Severalpreviousworks[10,12,9]reportedimportantcharacteristicsofsplogs.
[12]reportedcharacteristicsofpingtimeseries,in-degree/out-degreedis-tributions,andtypicalwordsinsplogsfoundinTREC8Blog06datacollection.
[10,9]alsoreportedtheresultsofanalyzingsplogsintheBlogPulsedataset.
Inthecontextofsemi-automaticallycollectingwebspamsincludingsplogs,[16]discusshowtocollectspammer-targetedkeywordstobeusedwhencollectingalargenumberofwebspamseciently.
Unlikethosepreviousworks,thispaperfocusesonana-lyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem[14].
Ashasbeenoftennotedinthepreviousworks,textcontentofsplogsismostlyex-1http://technorati.
com/2http://www.
blogpulse.
com/3http://kizasi.
jp/(inJapanese)4http://blogwatcher.
pi.
titech.
ac.
jp/(inJapanese)5http://www.
globeofblogs.
com/6http://www.
misohoni.
com/bba/7http://www.
blogwise.
com/8http://trec.
nist.
gov/Table1:FeaturesforCharacterizingSplogsandtheirRatesinSplogDataSetRateinFeatureTypesFeaturesDescriptionsSplogs(%)linkstoaliatedsitesBlogarticles(posts)containsucientlymanyout-goinglinkstoaliatedsites,exceptfortheout-goinglinksthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
80.
5Aliateadvertisementarti-cles(posts)Blogarticles(posts)themselvescontainsucientlymanyad-vertisements,exceptfortheadvertisementsthatthebloghostsautomaticallyaddtoindividualbloghomepagesandblogposts.
31.
0Featuresarticles(posts)withadultcontentBlogarticles(posts)containadultcontent.
8.
1keywordswithpopupadvertisementCertainbloghostshavefacilitiesofautomaticallyaddingpopupadvertisementstokeywords.
42.
1excerptfromnewsar-ticlesTextcontentisautomaticallyormanuallyexcerptedfromnewsarticles.
14.
3Contentexcerptfromblogar-ticles(posts)orotherwebtextsTextcontentisautomaticallyormanuallyexcerptedfromotherblogarticles(posts),orwebtextsotherthannewsarticlesandadvertisementpages.
70.
8Sourceexcerptfromadver-tisementpagesTextcontentisautomaticallyormanuallyexcerptedfromcer-tainadvertisementpages.
27.
1FeaturesoriginallywrittentextsSpammerswriteoriginalsplogtexts.
2.
9meaninglesssequenceofwordsMostofthemaresocalledwordsaladspamtext[2]andareautomaticallygenerated.
3.
6excerptfromothersources,selectedwithoutkeywordretrievalTextcontentisautomaticallyormanuallyexcerptedfromothersourceswithoutkeywordretrieval.
Typicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates.
12.
7Creationexcerptfromothersources,retrievedwithakeywordvaryingdaybydayTextcontentisautomaticallyormanuallyretrievedfromothersourceswithakeywordvaryingdaybyday,andthenexcerpted.
49.
5Procedureexcerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepageForabloghomepage,allofitstextcontentisexcerpt,whichareautomaticallyormanuallyretrievedfromothersourceswithasinglekeywordthroughoutallofitsposts.
36.
9Featureskeywordstuedblog[9]Blogarticles(posts)containlistsofkeywordsforSEOpurposes.
11.
5automaticallygener-atedtextMostofthemaresocalledwordsaladspamtext[2],whichisamixtureofseeminglymeaningfulwordsthattogethersignifynothing.
Sometimes,connectingseveralsentenceseachofwhichisexcerptedfromothersource.
4.
5cerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Consid-eringthisfact,inthiswork,weestimatethebehaviorofspammerswhencreatingsplogsfromothersourcesbyan-alyzingthecharacteristicsofkeywordscontainedinsplogs.
Thecharacteristicsofakeywordtowhichwepayattentioninthispaperiswhetherthekeywordisofpublic/privatecon-cernaswellasthedurationofpeople'sconcerntothekey-word.
Furthermore,sincesplogsoftencausenoisesinwordoccurrencestatisticsintheblogosphere,weassumethatwecanecientlycollectsplogsbysamplingbloghomepagescontainingkeywordsofacertaintypeonthedatewithitsmostfrequentoccurrence.
Wethenmanuallyexaminevari-ousfeaturesofcollectedbloghomepagesregardingwhethertheirtextcontentsareexcerptsfromothersourcesornot,aswellaswhethertheydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Amongvariousinforma-tiveresultsofouranalysis,itisimportanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofspammers,andhence,theanalysisreportedinthispaperisstronglyaectedbythechoicesofthosespam-merswhentheycreatethosesplogs.
2.
PROCEDUREOFCREATINGSPLOGSTextcontentofsplogsismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertise-mentpages,andotherwebtexts.
Inanycase,splogshavecommercialintention—theydisplayaliateadvertisementorout-goinglinkstoaliatedsites.
Forthispurpose,splogsareusuallycreatedbysearchingforup-to-datecontentfromothersourcesandbyexcerptingthem.
Thisprocedureofcreatingsplogscanberoughlydividedintothefollowingtwocases:authenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesburstofakeywordauthenticblogssplogsTimeSeriesauthenticblogssplogsTimeSeries(a)keywordwithburst(b)keywordwithoutburstFigure1:TimeSeriesCharacteristicsofKeywordOccurrenceStatisticsinSplogs/AuthenticBlogsi)excerptingtextcontentfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordre-trieval,ii)excerptingtextcontentbyretrievingthemfromothersourceswithcertainkeywords.
Splogpostscreatedbytherstprocedurejustafewdaysbeforethecurrentdatetendtocontainup-to-datetextcon-tentwhichareoriginallyfromquiterecentnewsarticlesorblogposts.
Ontheotherhand,forsplogscreatedbythesec-ondprocedure,spammersusuallycarefullychoosekeywordsforretrievingtextcontentfromothersourcessuchasnewsarticlesandblogposts.
Theytendtochoosehighpayingadsense9keywords.
3.
FEATURESFORCHARACTERIZINGSPLOGSThissectiondescribesthefeaturesforcharacterizingJapanesesploghomepagesmanuallycollectedbytheprocedureofsec-tion5.
3.
AswesummarizeinTable1,thispaperconsidersthefol-lowingthreetypesoffeaturesforsplogs,namely,1)aliatefeatures,2)contentsourcefeatures,and3)creationproce-durefeatures.
Foreachofthesethreefeaturetypes,Table1listsseveralbinaryfeatureseachofwhichdenoteswhetherthegivensploghomepagehasthedesignatedcharacteristicsornot.
Here,notethatfeaturesofthesametypeareinde-pendentofeachotherandhencearenotnecessarilydisjoint.
Alsonotethatmostofthosefeaturesarefortheuseinman-ualexaminationofsplogs,andhence,itisnotnecessarilymeanttoautomaticallydetectthem.
3.
1AfliateFeaturesAmongthethreefeaturetypes,rstwedescribealiatefeatures.
Asintroducedin[10,9],splogsaregeneratedwithtwooftenoverlappingmotives,namely,creationoffakeblogsforthepurposeofhostingprotableadvertisement,andun-justiablyincreasingtherankingofaliatedsites.
Sincebothmotivesaredeeplyrelatedtoaliateadvertising,inthispaper,weconsiderfeaturesofsplogsregardingissuesofaliates.
Asthealiatefeatures,wemanuallyexaminethefollowingfourpoints:9http://google.
com/adsensei)whethertheblogarticle(posts)containout-goinglinkstoaliatedsites,ii)whethertheblogarticle(posts)themselvescontainad-vertisements,iii)whetherblogarticles(posts)containadultcontent10,iv)whetherblogarticles(posts)containpopupadvertise-mentsautomaticallyaddedtocertainkeywords.
3.
2ContentSourceFeaturesSecond,oneoftheimportantcharacteristicsofsplogsisthattheirtextcontentismostlyexcerptedfromothersourcessuchasnewsarticles,blogarticles(posts),advertisementpages,andotherwebtexts.
Inordertoestimatethemech-anismofcreatingsplogs,wemanuallyexaminethecontentsourceofsplogsandclassifythemaccordingtothefollowingvefeatures,namely,contentsourcefeatures:i)excerptfromnewsarticles,ii)excerptfromblogarticles(posts)orotherwebtexts,iii)excerptfromadvertisementpages,iv)originallywrittentexts,v)meaninglesssequenceofwordssuchaswordsaladspamtexts[2].
3.
3CreationProcedureFeaturesFurthermore,weestimatetheproceduresofsearchingthewebforthoseexcerptandmanuallyclassifythemaccord-ingtothefollowingvefeatures,namely,creationprocedurefeatures:i)excerptfromothersources,selectedwithoutkeywordretrieval,wheretypicalcasesareexcerptfromnewsarticlesorblogpostsonthesamedateorclosedates,ii)excerptfromothersources,retrievedwithakeywordvaryingdaybyday,iii)excerptfromothersources,retrievedwithasinglekey-wordthroughoutabloghomepage,iv)keywordstuedblog[9],10Adultcontentisamongthemajortargetgenresforaliateadvertising,whileothermajortargetgenresincludehealthfoodandslimmingproducts,cosmetics,andnance.
Weregardblogswhichcontainadultcontentasmoreharmfulthanothers,andrecordthemwithanindependentfeature.
Figure2:AKeywordMapforCharacterizingKeywordsv)automaticallygeneratedtextincludingwordsaladspamtexts[2].
Asthecreationprocedurefeatures,wedistinguishtwomajorproceduresofcreatingsplogs,i.
e.
,a)excerptfromnewsarticlesorblogpostsonthesamedateorclosedateswithoutkeywordretrieval,andb)andexcerptbyretrievingtextsfromothersourceswithcertainkeywords.
Theformertypecorrespondstothefeaturei)above,whilethelattertothefeaturesii)andiii)above.
4.
CHARACTERISTICSOFSPLOGSANDKEYWORDS4.
1TimeSeriesCharacteristicsofKeywordsAmongtheproblemscausedbysplogs,thissectiondis-cussesissuesonnoisesinwordoccurrencestatisticsintheblogosphere.
Figure1illustratestwotypicalcasesofnoisesintimeserieskeywordoccurrencestatistics,where(a)isthecaseofakeywordwithburst,and(b)isthecaseofakey-wordwithoutburst.
Forbothcases,keywordoccurrencesaremixtureofthosefromauthenticblogsandsplogs.
With-outdetectingandremovingsplogs,itisdiculttoestimaterealkeywordoccurrencestatisticsonlyinauthenticblogs.
Forthecaseofthekeywordswithburst,especially,itisestimatedthatburstinsplogsmaybedelayedfromthatinauthenticblogs,becausetextcontentofsplogsismostlyexcerptfromothersourcessuchasnewsarticlesandblogposts.
4.
2KeywordMapforCharacterizingKeywordsThissectionintroducesthekeywordmapofFigure2forcharacterizingkeywords.
Theverticalaxisofthemapde-noteswhethereachkeywordisofpublic/privateconcern,whileitshorizontalaxisdenotesthedurationofpeople'sconcerntoeachkeyword.
Keywordswithpublicconcernaretypicallyreportedinnewsassocial/political/economicalis-sues,whilethosewithprivateconcernaretypicallyissuesregardingentertainmentorcelebrity,orhighpayingadsensekeywords.
Ontheotherhand,keywordswithshorttermdu-rationincludeseasonalonesandthoserelatedtotemporaryevents,whilethosewithlongtermdurationincludeorgani-zationnameswithalonghistorysuchaspoliticalpartiesandcountrynames,orthoserelatedtopermanentissuessuchashealthandbeauty.
OnthemapofFigure2,50keywordsthatarebalancedintheirdistributiononthemapareplaced,wherethepositionofeachkeywordisdeterminedtotallybyintuition.
Thosekeywordsvaryintheirtimeseriescharacteristicsofoccur-rencestatistics,wheresomeofthemarewithburstwhileothersarenot.
Eachofthosekeywordsisintendedtobeusedforretrievingblog(authenticblogandsplog)home-pagesintheprocedureofsection5.
3.
Themajorpurposeofplacingsuchvariouskeywordsontoamaplikethisistosimplyexaminethecorrelationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontain-ingeachkeyword.
Table2:SummaryofJapaneseBlogData(atDe-cember3rd,2007,0:00)#ofblogcurrent#ofhomepages#ofarticles#ofdaysarticlesperday3,591,306192,699,2761,355196,9755.
ANALYZINGSPLOGSBASEDONCHAR-ACTERISTICSOFKEYWORDS5.
1MotivationsThispaperreportstheresultsofanalyzingthefollowingthreepointsaftercollectingblogsandthenmanuallydetect-ingsplogsamongthem.
1.
Featuresofsplogsaremanuallyexaminedaccordingtothoseintroducedinsection3.
2.
Accordingtothekeywordmapforcharacterizingkey-words,variouscharacteristicsofkeywordsaremanu-allyexamined,whichincludetimeseriescharacteristicssuchaswhetherwith/withoutburst.
3.
Basedontheresultsofexaminingabovetwopoints,wefurtheranalyzevariouscorrelationbetweencharac-teristicsofsplogsandkeywords.
Thisanalysismainlyincludesthefollowings:(a)correlationbetweenthecharacteristicsofkeywordsandtherateofsplogsamongtheblogscontainingeachkeyword.
Thiswillrevealthepreferenceofspammerswhenchoosingkeywords.
(b)correlationbetweenthecharacteristicsofkeywordsandthesplogcreationprocedures.
5.
2JapaneseBlogDataForcollectingtheJapaneseblogdata,weusethesystemcalledKANSHIN[3,4,5]whichcollectsblogarticles(posts)writteninChinese,Japanese,Korean,andEnglish.
Thesys-temhaslistsofbloghomepagesforeachlanguage.
Byusingtheselists,thesystemcollectsRSS11andAtomfeedlesprovidedbybloghomepages,andextractskeywordsfromfeedlesbyusingmorphologicalanalysistools,andstorekeywordsandarticlesineachdatabase.
Thesystemusesseverallinguistictoolsforextractingandindexingkeywordsfromblogarticlesforeachlanguage.
ForJapanese,itusesamorphologicalanalysistoolcalledJuman12.
Thesystemprovidesuserswithfunctionsforretrievingandanalyzingarticles.
Table2showsthesummaryofJapaneseblogdatastoredinthesystem(checkedatDecember3rd,2007).
3.
6millionbloghomepagesand193millionarticlesareregisteredforJapanesesinceMarch18th,2004.
5.
3ProcedureoftheAnalysisThissectiongivesthespecicprocedureofcollectingandanalyzingsplogsbasedoncharacteristicsofkeywords.
Theroughstrategyofcollectingsplogshereistosimplycollectbloghomepages,(i.
e.
,notblogposts)whichcontainagivenkeywordandthen,11SeveralreferencessuchasRDFSiteSummaryorReallySimpleSyndicationorRichSiteSummaryexist.
12http://nlp.
kuee.
kyoto-u.
ac.
jp/nl-resource/juman.
htmlseesaa44%cocolog32%jugem.
jp12%ameblo5%livedoor1%Rest6%yahoo0%goo.
ne0%Figure3:BlogHostDistributionintheSplogHome-pageDataSetconsideringthefeaturesofsplogsdenedinsec-tion3,tomanuallyjudgewhethereachofthecol-lectedbloghomepagesisasplogoranauthenticblog.
Consideringtheresultofapreliminaryexamination,weas-sumethat,forkeywordswithburst,therateofsplogsamongthebloghomepagesthatcontainthosekeywordsmaybehigherontheburstdatethanonotherdates.
Wefurtherassumethat,evenforkeywordswithoutburst,therateofsplogsmaybehigheronthedatewiththemostfrequentoc-currenceintheblogospherethanotherdates.
Basedonthisobservation,inordertocollectsucientnumberofsplogs,foreachkeyword,wecollectbloghomepagescontainingthekeywordonthedatewithitsmostfrequentoccurrence.
Fur-thermore,alsoconsideringtheresultofapreliminaryexam-ination,wepreferbloghomepageswithmorepostsperdaythanthosewithfewerpostsperday.
Thefollowinglistsummarizestheaboveprocedure.
1.
Foreachofthe50keywordsinFigure2,wecollectbloghomepageURLswhichcontainthekeywordonthedatewithitsmostfrequentoccurrenceduringtheyear2007.
2.
AmongthecollectedURLs,weselectthetopmost50withrespecttothenumberofpostsperday.
Wefur-therrandomlyselect60URLsfromtherest.
Thisamountto110URLsintotal,wherethetopmost50URLsareusuallywithmorethanthreepostsperday,whiletheremaining60URLsarewithoneortwopostsperday.
3.
ForeachofthecollectedURLs,anannotatorjudgeswhethereachbinaryfeaturedenedinsection3holdsornot.
4.
Basedontheabovejudgement,eachURLisjudgedtobeasplogoranauthenticblogaccordingtothefollowingrule.
(a)IfoneofthefollowingsholdsforthegivenURL,thenitismostly13splog.
13By"mostly",wemeanthatitisusuallynecessarytojudgebyconsideringthecontentsofeachblog.
Table3:SplogRateperBlogHostBlogHostseesaacocologjugem.
jpameblolivedoorgoo.
neyahooRestTotal#ofBlogSplog1921425424321026442HomepagesAuthenticBlog2031151693551281302073961703Total3952572233791311312074222145SplogRate(%)48.
655.
324.
26.
32.
30.
80.
06.
220.
6Table4:SplogRate,ProfessionalSpammerRate(fromprofessionalspammer/splog),#ofProfessionalSpammers,and,AmateurOnlySplogRate(fromamateurspammer/(fromamateurspammer+non-splog))(indescendingorderofsplogrates,boldfaced:"splograte>10%,professionalspammerrate>50%",underlined:"amateuronlysplograte20%ormore,mostlywithprivateconcern")KeywordSplogRate(%)ProfessionalSpammerRate(%)#ofProfessionalSpammersAmateurOnlySplogRate(%)erog,adultcontentblog89.
292.
4338.
5rumor88.
194.
8127.
8nationalpension58.
190.
2212.
0norevision40.
918.
5136.
1healthfood37.
458.
7219.
8cosmeticsurgery24.
414.
3221.
7Viagra22.
511.
1120.
5Darvish,aJapanesebaseballplayer22.
10.
0022.
1video19.
10.
0019.
1Asasho-ryu,asumowrestler15.
280.
023.
4Billy'sBootCamp15.
10.
0015.
1Saeko,aJapaneseactressandDarvish'swife14.
314.
3112.
2COMSN,Inc.
,elderlycarebusinesscompanywithascandal6.
971.
422.
1ZARD(aJapanesefemalesinger,accidentallydied)4.
720.
013.
8ChinaAirlines4.
720.
013.
8NorthKorea2.
9100.
010.
0Wii(avideogameconsoleofNintendo)2.
866.
711.
0heatwave2.
833.
311.
9"Thedignityofthewoman",thetitleofabook2.
00.
002.
0aJapaneseslangwordfor"lazywoman"1.
850.
010.
9UpperHouseelection0.
00.
000.
0DemocraticPartyofJapan0.
00.
000.
0Total20.
561.
5109.
0i.
Thefeature"originallywrittentext"doesnothold.
ii.
Thefeature"originallywrittentext"holdsandatleastoneofthefeatures"linkstoaliatedsites","advertisementarticles(posts)",or"ar-ticles(posts)withadultcontent"holds.
(b)Otherwise,thegivenURLisanauthenticblog.
5.
Finally,weanalyzethecorrelationbetweencharacter-isticsofkeywordsandthedistributionoffeaturesman-uallyannotatedtosplogs.
6.
PRELIMINARYRESULTSOFANALYZ-INGSPLOGSThissectiondiscussespreliminaryresultsofanalyzingJapanesesplogsbasedoncharacteristicsofkeywords,fea-turesofsplogs,aswellasotherfeatureswhichcanbeauto-maticallyanalyzedsuchasbloghostsdistribution.
Wefur-theranalyzethecorrelationbetweencharacteristicsofkey-wordsandthefeaturedistributionofsplogs.
Here,notethattheresultsshownbelowarepreliminaryinthattheyarefor22keywordsoutofthe50onthemapofFigure2.
6.
1BlogHostsStatisticsAscanbeclearlyseenfromFigure3,inourJapanesebloghomepagedataset,morethan88%ofsplogsarefromthetopthreehosts.
Furthermore,asshowninTable3,forthetoptwohosts,abouthalfofthebloghomepagesaresplogs14.
Itisestimatedthatthosehostswithhighsplogratespaylesscostofmanuallyremovingsplogsthanthosewithlowsplogrates.
Asweargueinthenextsection,itisobservedthataverysmallnumberofspammersactuallycreatesubstantialnumberofsploghomepagesonthosethreehosts,andthisincreasesthesplogratesofthosehosts.
6.
2RelationsbetweenCharacteristicsofKey-wordsandSplogs14DuetoerrorsintheprocedureofcollectingblogURLsforjudgingsplog/authenticblogdistinction,forthemoment,wedonothave110blogsURLsintotalforseveralkeywords.
Table5:10ProfessionalSpammersidentiedinourSplogDataSet#ofFeaturesofSplogs(inTable1)IDSplogsAliateContentSourceCreationProcedureKeywords1115(42.
5%)linkstoaliatedsites,popupadver-tisementblogorotherwebtextsretrievedwithasin-glekeywordrumor,norevision,cosmeticsurgery,Asasho-ryu,Saeko,ChinaAirlines,COMSN,Inc.
,ZARD,heatwave,Wii,NorthKorea,"lazywoman"256(20.
6%)linkstoaliatedsitesblogorotherwebtextsretrievedwithakey-wordvaryingdaybydayerog330(11.
0%)linkstoaliatedsitesnewsarticles,adver-tisementpagesselectedwithoutkey-wordretrievalnationalpension,COMSN,Inc.
426(9.
6%)linkstoaliatedsites,advertisementarticles,popupadvertisementblogorotherwebtexts,advertisementpagesretrievedwithakey-wordvaryingdaybydaynationalpension520(7.
4%)linkstoaliatedsites,advertisementarticlesadvertisementpagesretrievedwithakey-wordvaryingdaybyday,keywordstuedbloghealthfood610(3.
7%)linkstoaliatedsites,adultcontent,popupadvertisementnewsarticles,blogorotherwebtextsselectedwithoutkey-wordretrievalerog,Asasho-ryu,71015(5.
5%)———erog,healthfood,Viagra,cos-meticsurgery,Total272————Next,foreachofthe22keywords,Table4givessplogratesinthebloghomepagescollectedwiththekeyword,indescendingorderofsplogsrates.
Inthetable,those22key-wordsaredividedintothreegroups,i.
e.
,thosewithsplograteshigherthan30%,thosewithsplogrates3010%,andtherest.
Wefurthercountoccurrencesoffeaturesofsplogsintheentiresplogdataset,andlisttheirratesinthesplogdatasetasintherightmostcolumnofTable1.
Basedonthisfeatureanalysis,weexaminecorrelationofthosesplogfea-turesandcharacteristicsofkeywordswithsplograteshigherthan10%.
Furthermore,wejudgedwhethertwosplogsarecreatedbyanidenticalspammerwhentheirhtmllayoutsaresimilar15,andthengroupedthosesplogsfromanidenticalspammer.
Inthispaper,wenamethosespammerseachofwhomcre-atedmorethanonesplogsinourdatasetasprofessionalspammers,whilewealsonamethoseremainingspammerseachofwhomcreatedonlyonesploginourdatasetasam-ateurspammers.
Withthisjudgement,wecanidentify10professionalspammersinoursplogdataset(summarizedinTable5),whereoutofthetotal442sploghomepages,272(61.
5%)canberegardedascreatedbythose10professionalspammers.
Basedonthisprofessional/amateurspammeranalysis,foreachkeyword,Table4showsrateofsploghome-pagesbeingcreatedbyoneofthe10professionalspammersTable4alsoshowsthenumberofprofessionalspammersob-servedforeachkeyword,aswellassplogratesafterremovingthosecreatedbyprofessionalspammers(amateuronlysplog15Ournextplanistoemploythetechniquepresentedin[15],sothatwecanautomaticallygroupsploghomepagesintothe10groupsshownhere.
rate).
Majorconclusionsofthisanalysiscanbesummarizedasbelow,someofwhicharealsonotedinthemapofthe22keywordsinFigure4.
(1)Themostimportantfacttonotehereisthat,forfouroutofthevekeywordswithsplograteover30%,mostsploghomepagesarecreatedbyprofessionalspammers.
Splogscontainingthesefourkeywordsactuallyamounttomorethanhalfoftheentiresplogdataset.
Thisfactisveryimportantbecausethefollowinganalysisisstronglyaectedbythechoicesofthoseprofessionalspammersincreatingthosesplogs.
(2)AscanbeseenfromthemapinFigure4,mostofthekeywordsplacedintheupperhalfofthemaphavelowsplogrates.
Thismeansthatsplogstendtocontainkeywordswithprivateconcernmoreoftenthanthosewithpublicconcern.
"Nationalpension"and"Asasho-ryu"arewithexceptionallyhighsplogrates,thoughthisstatisticsisstronglyaectedbythechoicesofprofessionalspammers.
Thosespammerspostedsplogpostsoncertaindates,wherethesplogarticlesarecreatedfromtheexcerptsofthenewsreportsandblogpostsonthosedates.
Thoseexcerptsoccasionallyincludescandalreportscloselyrelatedtothetwokeywords.
(3)Thethreekeywords"rumor","erog,adultcontentblog",and"healthfood",correspondtoanothergroupofsplogscreatedbyprofessionalspammers.
Inthecaseofthesekey-words,thespammerspostedsplogposts,wherethesplogarticlesarecreatedfromtheexcerptofotherblogsandad-vertisements,butnotnewsarticles,byretrievingthemwithcertainkeywords.
7.
CONCLUSIONFigure4:KeywordMapwithSplogAnalysisResultsThispaperfocusedonanalyzing(Japanese)splogsbasedonvariouscharacteristicsofkeywordscontainedinthem.
Amongvariousinformativeresultsofouranalysis,itisim-portanttonotethatmorethanhalfofthecollectedsplogsarecreatedbyaverysmallnumberofprofessionalspam-mers.
Futureworksincludefurtheranalysisofsplogsbyintegratingwithotherfeaturesstudiedinthepreviousworks[12,10,9],suchascharacteristicwordsinsplogs,in-degree/out-degreedistributions,andpingtimeseries.
Next,weplantoapplyexistingsplogdetectiontechniques[11,8]tooursplogdataset,andthentodevelopasplogdetectorwithhighaccuracy.
Splogs/authenticblogscollectedinthisworkarealsousefulforanalyzingcharacteristicsofkeywordsinamuchlargerscale,simplybyautomaticallycollectingamuchlargernumberofkeywords,andthenmeasuringcorrelationbetweensplogsandeachkeyword.
8.
REFERENCES[1]Wikipedia,Spamblog.
http://en.
wikipedia.
org/wiki/Spam_blog.
[2]Wikipedia,Wordsalad(computerscience).
http://en.
wikipedia.
org/wiki/Wordsalad%28computer_science%29.
[3]T.
Fukuhara,T.
Murayama,andT.
Nishida.
AnalyzingconcernsofpeopleusingWeblogarticlesandrealworldtemporaldata.
InProceedingsofWWW20052ndAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2005.
[4]T.
Fukuhara,H.
Nakagawa,andT.
Nishida.
Understandingsentimentofpeoplefromnewsarticles:Temporalsentimentanalysisofsocialevents.
InProceedingsofICWSM,pages271–272,2007.
[5]T.
Fukuhara,T.
Utsuro,andH.
Nakagawa.
Cross-lingualconcernanalysisfrommultilingualweblogarticles.
InA.
Nijholt,O.
Stock,andT.
Nishida,editors,Proceedingsofthe6thInternationalWorkshoponSocialIntelligenceDesign,pages55–64,2007.
[6]N.
Glance,M.
Hurst,andT.
Tomokiyo.
Blogpulse:AutomatedtrenddiscoveryforWeblogs.
InWWW2004WorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2004.
[7]Z.
Gy¨ongyiandH.
Garcia-Molina.
Webspamtaxonomy.
InProc.
1stAIRWeb,pages39–47,2005.
[8]P.
Kolari,T.
Finin,andA.
Joshi.
SVMsfortheBlogosphere:BlogidenticationandSplogdetection.
InProceedingsofthe2006AAAISpringSymposiumonComputationalApproachestoAnalyzingWeblogs,pages92–99,2006.
[9]P.
Kolari,T.
Finin,andA.
Joshi.
Spaminblogsandsocialmedia.
InTutorialatICWSM,2007.
[10]P.
Kolari,A.
Joshi,andT.
Finin.
Characterizingthesplogosphere.
InProceedingsofWWW20063rdAnnualWorkshopontheWebloggingEcosystem:Aggregation,AnalysisandDynamics,2006.
[11]Y.
-R.
Lin,H.
Sundaram,Y.
Chi,J.
Tatemura,andB.
L.
Tseng.
Splogdetectionusingself-similarityanalysisonblogtemporaldynamics.
InProc.
3rdAIRWeb,pages1–8,2007.
[12]C.
MacdonaldandI.
Ounis.
TheTRECBlogs06collection:Creatingandanalysingablogtestcollection.
TechnicalReportTR-2006-224,UniversityofGlasgow,DepartmentofComputingScience,2006.
[13]T.
Nanno,T.
Fujiki,Y.
Suzuki,andM.
Okumura.
Automaticallycollecting,monitoring,andminingJapaneseweblogs.
InWWWAlt.
'04:Proceedingsofthe13thinternationalWorldWideWebconferenceonAlternatetrackpapers&posters,pages320–321.
ACMPress,2004.
[14]Y.
Sato,T.
Utsuro,T.
Fukuhara,Y.
Kawada,Y.
Murakami,H.
Nakagawa,andN.
Kando.
CollectingandanalyzingJapanesesplogsbasedoncharacteristicsofkeywords.
InProc.
ICWSM,pages218–219,2008.
[15]T.
Urvoy,T.
Lavergne,andP.
Filoche.
TrackingWebspamwithhiddenstylesimilarity.
InProc.
2ndAIRWeb,pages25–30,2006.
[16]Y.
Wang,M.
Ma,Y.
Niu,andH.
Chen.
Spamdouble-funnel:Connectingwebspammerswithadvertisers,.
InProc.
16thWWWConf.
,pages291–300,2007.

美国云服务器 2核4G限量 24元/月 香港云服务器 2核4G限量 24元/月 妮妮云

妮妮云的来历妮妮云是 789 陈总 张总 三方共同投资建立的网站 本着“良心 便宜 稳定”的初衷 为小白用户避免被坑妮妮云的市场定位妮妮云主要代理市场稳定速度的云服务器产品,避免新手购买云服务器的时候众多商家不知道如何选择,妮妮云就帮你选择好了产品,无需承担购买风险,不用担心出现被跑路 被诈骗的情况。妮妮云的售后保证妮妮云退款 通过于合作商的友好协商,云服务器提供2天内全额退款到网站余额,超过2天...

hypervmart:英国/荷兰vps,2核/3GB内存/25GB NVMe空间/不限流量/1Gbps端口/Hyper-V,$10.97/季

hypervmart怎么样?hypervmart是一家国外主机商,成立于2011年,提供虚拟主机、VPS等,vps基于Hyper-V 2012 R2,宣称不超售,支持linux和windows,有荷兰和英国2个数据中心,特色是1Gbps带宽、不限流量。现在配置提高,价格不变,性价比提高了很多。(数据中心不太清楚,按以前的记录,应该是欧洲),支持Paypal付款。点击进入:hypervmart官方网...

ShockHosting($4.99/月),东京机房 可享受五折优惠,下单赠送10美金

ShockHosting商家在前面文章中有介绍过几次。ShockHosting商家成立于2013年的美国主机商,目前主要提供虚拟主机、VPS主机、独立服务器和域名注册等综合IDC业务,现有美国洛杉矶、新泽西、芝加哥、达拉斯、荷兰阿姆斯特丹、英国和澳大利亚悉尼七大数据中心。这次有新增日本东京机房。而且同时有推出5折优惠促销,而且即刻使用支付宝下单的话还可获赠10美金的账户信用额度,折扣相比之前的常规...

adsense为你推荐
评标杀毒软件免费下载耳机苹果5adbandroid设置media支持ipad支持ipad城乡居民社会养老保险人脸识别生存认证win10445端口windows server2008怎么开放4443端口tcpip上的netbios怎么启用TCP/IP上的NetBIOStcpip上的netbiostcpip上的netbios是什么用的,有安全隐患吗?开启还是关上
免费ftp站点 合租空间 360云服务 杭州电信宽带优惠 阵亡将士纪念日 umax netvigator 葫芦机 qq空间打开很慢 screen 域名商城 29美元 ddos攻击器下载 vpn服务器架设 视频监控服务器 联通3g无限流量卡 韩剧国语版789 lickmyboobs什么意思 阿里通免费网络电话 免费网络电影 更多