copy谷歌sb

谷歌sb  时间:2021-05-21  阅读:()
RuleBasedPlagiarismDetectionusingInformationRetrievalAniruddhaGhosh,PinakiBhaskar,SantanuPal,SivajiBandyopadhyayDepartmentofComputerScienceandEngineering,JadavpurUniversity,Kolkata–700032,India{arghyaonline,pinaki.
bhaskar,santanu.
pal.
ju}@gmail.
com,sivaji_cse_ju@yahoo.
comAbstract.
ThispaperreportsaboutthedevelopmentofaPlagiarismdetectionsystemasapartofthePlagiarismdetectiontaskinPAN2011.
TheexternalplagiarismdetectionproblemhasbeensolvedwiththehelpofNutch,anopensourceInformationRetrieval(IR)system.
Thesystemcontainsthreephases–knowledgepreparation,candidateretrievalandplagiarismdetection.
Fromthesourcedocuments,knowledgebasehasbeenpreparedfordevelopingtheNutchindexandthequerieshavebeenformedfromthesuspiciousdocumentsforsubmissiontotheNutchIRsystem.
TheretrievedcandidatesourcesentencesareassignedsimilarityscoresbyNutch.
Dissimilarityscoreisassignedforeachcandidatesentenceandthesuspicioussentence.
Eachcandidatesourcesentenceisrankedbasedonthesetwoscores.
Thetoprankedcandidatesentenceisselectedforeachsuspicioussentence.
Keywords:PlagiarismDetection,InformationRetrievalSystem,SimilarityScore,DissimilarityScore.
1IntroductionPlagiarismmaybedefinedasthewrongfulmisuseandclosereplicationofthoughts,ideas,orexpressionsfromtheoriginalworkofsomeoneinthesamelanguageoffromanotherlanguage.
From18thcentury,plagiarismhasbeenconsideredasacademicdishonesty[1].
Fordecades,researchershaveexploreddifferenttechniquestodetectplagiarism.
Plagiarismcanoccurindifferentforms–fullplagiarism,substantialplagiarism,minimalisticplagiarism,sourcecitationetc.
IthasbecomeachallengingtaskintheareaofNaturalLanguageProcessing.
Inourapproach,wehaveconsideredalltheformsofplagiarismexceptminimalisticplagiarismatthesentencelevel.
Duetoabsenceofcontrolledevaluationenvironmenttocompareresultsofthealgorithms,plagiarismdetectionisstillachallengingtask[2].
Researchershaveorganizedvariousconferences(similartoPAN)toovercometheplagiarismproblem.
Fingerprintretrievalmethod[3],candidateretrieval[4]andpassageretrieval[5]arethemostprominentattemptsonplagiarismdetection.
Thesystemdescribedin[6]workswithanaturallanguageparsertofindswappedwordsandphrasestodetectintentionalplagiarismwhilen-gramco-occurrencestatisticisusedtodetectverbatimcopy.
TheLongestCommonSubsequencetechniquehasbeenusedin[7]tohandletextmodification.
Researchershaveusedcosinesimilarityscoreandn-gramvectorspacemodelatdifferentlevels,i.
e.
,word[8]andcharacter[9]levels.
Inthepresentwork,plagiarismhasbeentreatedasanIRproblem.
Anopensourcesearchengine,Nutch,hasbeenusedtoretrievetheplagiarizedpartsfromthesuspiciousdocuments.
2SystemFrameworkTheInformationRetrieval(Nutch1)basedPlagiarismDetectionsystemframeworkisshowninthefigure1.
Thesystemisdefinedinthreephases:KnowledgePreparation,CandidateRetrieval,i.
e.
,identificationofsuspicioussentenceandtheprobablesetofsourcesentencepairsandfinallyplagiarismdetectionofeachidentifiedsuspicioussentence.
Fig.
1.
SystemArchitecture3KnowledgePreparationEachsourcedocumentisparsedtoidentifyandextractallthesentencesinthedocument.
NowKnowledgefilesaregeneratedforeachsourcesentence.
Thefilenamesofknowledgefilesarecreatedinsuchamannerthatthesourcesentenceintheoriginalsourcedocumentcanbetracked.
Theknowledgeofeachsentenceintheknowledgefileisstoredintheformofstems,synonyms,hyponyms,hypernymsandsynsetsofeachword(afterremovalofthestopwords)thatareextractedfromWordNet3.
02.
Duplicatewordsareremovedtogetthesetofidenticalsenseuniquewords.
Thesewordsareusedtoidentifytheplagiarizedwords,thewordsthataresimilarinsensetotheoriginalwords.
Theoriginalwordsinthesentenceareaddedtothissetofwords.
Thus,eachknowledgefileforasentenceconsistsofasetofwords.
Afteralltheknowledgefilesarebuilt,theseareindexedusingLucene3.
1http://nutch.
apache.
org/2http://wordnet.
princeton.
edu/3http://lucene.
apache.
org/4CandidatesRetrievalEachsuspiciousdocumentisparsedtoidentifyandextractallthesentencesinthesuspiciousdocuments.
EachSuspicioussentenceisconsideredfromtheparsedsuspiciousdocumenttogeneratethequery.
FirstallthestopwordsareremovedfromthesentenceandthentheremainingwordsarebeingstemmedusingWordNet3.
0stemmertogettherootformofeachword.
Aftergeneratingthequeryfromthesuspicioussentences,thequeryisfiredtoNutchtoretrievetheprobablesetofsourcesentencescorrespondingtoeachsuspicioussentence.
Assourcedocumentsaresplitintosentencesintofilesandeachfilecontainsonlyonesentence,Nutchperformsasentence-sentencemappingforaproximalmatchbetweenthequeryandindexedsourcefiles.
AsetofprobablecandidatesourcesentencesisidentifiedbyNutchinrankedorderforeachsuspicioussentence.
Nutchprovidesthesimilarityscorebetweenasuspicioussentenceandthecorrespondingcandidatesourcesentence.
5PlagiarismDetectionAnalgorithmfordissimilaritymeasurement,proposedin[10],hasbeenusedtocalculatethedissimilarityscorebetweenthesuspicioussentenceanditscorrespondingretrievedcandidatesentences.
Foridenticalsentencesthathavemostnumberofidenticaln-grams,thedissimilarityscoreis0.
Usingthismeasurewehavecalculatedthedissimilarityscoresofeachsourcesentencecorrespondingtothesuspicioussentences.
Thedissimilarityscorearesubtractedfromthesimilarityscoreforeachcandidatesourcesentenceandafinalfine-grainedscorehasbeengenerated.
Alltheretrievedcandidatesourcesentencesforeachsuspicioussentencearerankedaccordingtothisfine-grainedscore.
Thetoprankedcandidatesourcesentenceisidentifiedasthesourcesentencefortheplagiarizedsentenceinthesuspiciousdocument.
6EvaluationTheplagiarismdetectionsystemwasevaluatedusingtheevaluationframeworkdescribedin[2].
TheevaluationscoresareshowninTable1.
Table1.
EvaluationMeasurementPrecisionRecallGranularityPladgetScore0.
00118290.
00500522.
00288180.
00120637ConclusionandFutureWorksThepresenttaskisourfirstattemptinplagiarismdetection.
Wehavetestedtheplagiarismatthesentencelevelbutphraselevelexperimentationisstillleftforinvestigate.
Infuture,analgorithmhastobedevelopedtotesttherelevanceofthecandidatesourcesentencesretrievedbyNutchandchoosethemostrelevantplagiarizedpart.
Theknowledgefilesforthesourcedocumentswillalsohavetobeupdated.
AcknowledgmentTheworkhasbeencarriedoutwithsupportfromDepartmentofInformationTechnology(DIT),Govt.
ofIndiafundedProjectDevelopmentof"CrossLingualInformationAccess(CLIA)"SystemPhaseII.
References1.
WikipediaarticleonPlagiarism:http://en.
wikipedia.
org/wiki/Plagiarism2.
PotthastM.
etal.
:AnEvaluationFrameworkforPlagiarismDetection.
InProceedingsoftheCOLING2010,Beijing,China,August2010.
3.
YuriiPalkovskii,AlexeiBelovandIrinaMuzika.
:ExploringFingerprintingasExternalPlagiarismDetectionMethod:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
4.
VivianeP.
Moreira,RafaelC.
PereiraandGalanteRenata.
:UFRGS@PAN2010:DetectingExternalPlagiarism:LabReportforPanatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
5.
ClaraVaniaandMirnaAdriani.
:ExternalPlagiarismDetectionUsingPassageSimilarities:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
6.
M.
Mozgovoy,T.
KakkonenandE.
Sutinen.
:UsingNaturalLanguageParsersinPlagiarismDetection.
InProceedingofSLaTE'07Workshop,Pennsylvania,USA,October2007.
7.
Chen,Chien-Ying,Jen-YuanYehandHao-RenKe.
:PlagiarismDetectionusingROUGEandWordNet.
JournalofComputing,2(3),pages34-44,March2010.
https://sites.
google.
com/site/journalofcomputing/.
ISSN2151-9617.
8.
CristianGrozeaandMariusPopescu.
:Encoplot-PerformanceintheSecondInternationalPlagiarismDetectionChallenge:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
9.
Basileetal.
:APlagiarismDetectionProcedureinThreeSteps:Selection,Matchesand"Squares".
InProceedingsoftheSEPLN2009WorkshoponUncoveringPlagiarism,AuthorshipandSocialSoftwareMisuse(PAN2009),Donostia-SanSebastian,Spain.
10.
VladoKeselj,FuchunPeng,NickCerconeandCalvinThomas.
:"N-gram-basedAuthorProfilesforAuthorshipAttribution".
InProceedingsofthePACLING'03,DalhousieUniversity,Halifax,NovaScotia,Canada,pp.
255-264,August2003.

Hostodo:$19.99/年KVM-1GB/12GB/4TB/拉斯维加斯

Hostodo发布了几款采用NVMe磁盘的促销套餐,从512MB内存起,最低年付14.99美元,基于KVM架构,开设在拉斯维加斯机房。这是一家成立于2014年的国外VPS主机商,主打低价VPS套餐且年付为主,基于OpenVZ和KVM架构,产品性能一般,数据中心目前在拉斯维加斯和迈阿密,支持使用PayPal或者支付宝等付款方式。下面列出几款NVMe硬盘套餐配置信息。CPU:1core内存:512MB...

CloudCone:$14/年KVM-512MB/10GB/3TB/洛杉矶机房

CloudCone发布了2021年的闪售活动,提供了几款年付VPS套餐,基于KVM架构,采用Intel® Xeon® Silver 4214 or Xeon® E5s CPU及SSD硬盘组RAID10,最低每年14.02美元起,支持PayPal或者支付宝付款。这是一家成立于2017年的国外VPS主机商,提供VPS和独立服务器租用,数据中心为美国洛杉矶MC机房。下面列出几款年付套餐配置信息。CPU:...

无视CC攻击CDN ,DDOS打不死高防CDN,免备案CDN,月付58元起

快快CDN主营业务为海外服务器无须备案,高防CDN,防劫持CDN,香港服务器,美国服务器,加速CDN,是一家综合性的主机服务商。美国高防服务器,1800DDOS防御,单机1800G DDOS防御,大陆直链 cn2线路,线路友好。快快CDN全球安全防护平台是一款集 DDOS 清洗、CC 指纹识别、WAF 防护为一体的外加全球加速的超强安全加速网络,为您的各类型业务保驾护航加速前进!价格都非常给力,需...

谷歌sb为你推荐
Singlesb甘肃省政府采购支持ipad尺寸(mm)操作區域手控ipadwifiIPAD连上了WIFI,但是无法上网,急!!ms17-010win10蒙林北冬虫夏草酒·10年原浆1*6 500ml 176,176是一瓶的价格还是一箱的价格联通iphone4北京 朝阳区 哪家联通店可以卖Iphone4的,本周周末过去买谷歌sb为什么百度一搜SB是谷歌,谷歌一搜SB是百度?csshack针对IE6的CSS HACK是什么?googleadsense10分钟申请Google Adsense是一种怎样的体验
美国虚拟空间 深圳虚拟主机 免费域名注册网站 高防直连vps 中国域名网 万网域名证书查询 狗爹 cve-2014-6271 vmsnap3 12306抢票攻略 directadmin http500内部服务器错误 php免费空间 上海域名 100x100头像 架设服务器 100m独享 爱奇艺vip免费试用7天 能外链的相册 香港新世界中心 更多