RuleBasedPlagiarismDetectionusingInformationRetrievalAniruddhaGhosh,PinakiBhaskar,SantanuPal,SivajiBandyopadhyayDepartmentofComputerScienceandEngineering,JadavpurUniversity,Kolkata–700032,India{arghyaonline,pinaki.
bhaskar,santanu.
pal.
ju}@gmail.
com,sivaji_cse_ju@yahoo.
comAbstract.
ThispaperreportsaboutthedevelopmentofaPlagiarismdetectionsystemasapartofthePlagiarismdetectiontaskinPAN2011.
TheexternalplagiarismdetectionproblemhasbeensolvedwiththehelpofNutch,anopensourceInformationRetrieval(IR)system.
Thesystemcontainsthreephases–knowledgepreparation,candidateretrievalandplagiarismdetection.
Fromthesourcedocuments,knowledgebasehasbeenpreparedfordevelopingtheNutchindexandthequerieshavebeenformedfromthesuspiciousdocumentsforsubmissiontotheNutchIRsystem.
TheretrievedcandidatesourcesentencesareassignedsimilarityscoresbyNutch.
Dissimilarityscoreisassignedforeachcandidatesentenceandthesuspicioussentence.
Eachcandidatesourcesentenceisrankedbasedonthesetwoscores.
Thetoprankedcandidatesentenceisselectedforeachsuspicioussentence.
Keywords:PlagiarismDetection,InformationRetrievalSystem,SimilarityScore,DissimilarityScore.
1IntroductionPlagiarismmaybedefinedasthewrongfulmisuseandclosereplicationofthoughts,ideas,orexpressionsfromtheoriginalworkofsomeoneinthesamelanguageoffromanotherlanguage.
From18thcentury,plagiarismhasbeenconsideredasacademicdishonesty[1].
Fordecades,researchershaveexploreddifferenttechniquestodetectplagiarism.
Plagiarismcanoccurindifferentforms–fullplagiarism,substantialplagiarism,minimalisticplagiarism,sourcecitationetc.
IthasbecomeachallengingtaskintheareaofNaturalLanguageProcessing.
Inourapproach,wehaveconsideredalltheformsofplagiarismexceptminimalisticplagiarismatthesentencelevel.
Duetoabsenceofcontrolledevaluationenvironmenttocompareresultsofthealgorithms,plagiarismdetectionisstillachallengingtask[2].
Researchershaveorganizedvariousconferences(similartoPAN)toovercometheplagiarismproblem.
Fingerprintretrievalmethod[3],candidateretrieval[4]andpassageretrieval[5]arethemostprominentattemptsonplagiarismdetection.
Thesystemdescribedin[6]workswithanaturallanguageparsertofindswappedwordsandphrasestodetectintentionalplagiarismwhilen-gramco-occurrencestatisticisusedtodetectverbatimcopy.
TheLongestCommonSubsequencetechniquehasbeenusedin[7]tohandletextmodification.
Researchershaveusedcosinesimilarityscoreandn-gramvectorspacemodelatdifferentlevels,i.
e.
,word[8]andcharacter[9]levels.
Inthepresentwork,plagiarismhasbeentreatedasanIRproblem.
Anopensourcesearchengine,Nutch,hasbeenusedtoretrievetheplagiarizedpartsfromthesuspiciousdocuments.
2SystemFrameworkTheInformationRetrieval(Nutch1)basedPlagiarismDetectionsystemframeworkisshowninthefigure1.
Thesystemisdefinedinthreephases:KnowledgePreparation,CandidateRetrieval,i.
e.
,identificationofsuspicioussentenceandtheprobablesetofsourcesentencepairsandfinallyplagiarismdetectionofeachidentifiedsuspicioussentence.
Fig.
1.
SystemArchitecture3KnowledgePreparationEachsourcedocumentisparsedtoidentifyandextractallthesentencesinthedocument.
NowKnowledgefilesaregeneratedforeachsourcesentence.
Thefilenamesofknowledgefilesarecreatedinsuchamannerthatthesourcesentenceintheoriginalsourcedocumentcanbetracked.
Theknowledgeofeachsentenceintheknowledgefileisstoredintheformofstems,synonyms,hyponyms,hypernymsandsynsetsofeachword(afterremovalofthestopwords)thatareextractedfromWordNet3.
02.
Duplicatewordsareremovedtogetthesetofidenticalsenseuniquewords.
Thesewordsareusedtoidentifytheplagiarizedwords,thewordsthataresimilarinsensetotheoriginalwords.
Theoriginalwordsinthesentenceareaddedtothissetofwords.
Thus,eachknowledgefileforasentenceconsistsofasetofwords.
Afteralltheknowledgefilesarebuilt,theseareindexedusingLucene3.
1http://nutch.
apache.
org/2http://wordnet.
princeton.
edu/3http://lucene.
apache.
org/4CandidatesRetrievalEachsuspiciousdocumentisparsedtoidentifyandextractallthesentencesinthesuspiciousdocuments.
EachSuspicioussentenceisconsideredfromtheparsedsuspiciousdocumenttogeneratethequery.
FirstallthestopwordsareremovedfromthesentenceandthentheremainingwordsarebeingstemmedusingWordNet3.
0stemmertogettherootformofeachword.
Aftergeneratingthequeryfromthesuspicioussentences,thequeryisfiredtoNutchtoretrievetheprobablesetofsourcesentencescorrespondingtoeachsuspicioussentence.
Assourcedocumentsaresplitintosentencesintofilesandeachfilecontainsonlyonesentence,Nutchperformsasentence-sentencemappingforaproximalmatchbetweenthequeryandindexedsourcefiles.
AsetofprobablecandidatesourcesentencesisidentifiedbyNutchinrankedorderforeachsuspicioussentence.
Nutchprovidesthesimilarityscorebetweenasuspicioussentenceandthecorrespondingcandidatesourcesentence.
5PlagiarismDetectionAnalgorithmfordissimilaritymeasurement,proposedin[10],hasbeenusedtocalculatethedissimilarityscorebetweenthesuspicioussentenceanditscorrespondingretrievedcandidatesentences.
Foridenticalsentencesthathavemostnumberofidenticaln-grams,thedissimilarityscoreis0.
Usingthismeasurewehavecalculatedthedissimilarityscoresofeachsourcesentencecorrespondingtothesuspicioussentences.
Thedissimilarityscorearesubtractedfromthesimilarityscoreforeachcandidatesourcesentenceandafinalfine-grainedscorehasbeengenerated.
Alltheretrievedcandidatesourcesentencesforeachsuspicioussentencearerankedaccordingtothisfine-grainedscore.
Thetoprankedcandidatesourcesentenceisidentifiedasthesourcesentencefortheplagiarizedsentenceinthesuspiciousdocument.
6EvaluationTheplagiarismdetectionsystemwasevaluatedusingtheevaluationframeworkdescribedin[2].
TheevaluationscoresareshowninTable1.
Table1.
EvaluationMeasurementPrecisionRecallGranularityPladgetScore0.
00118290.
00500522.
00288180.
00120637ConclusionandFutureWorksThepresenttaskisourfirstattemptinplagiarismdetection.
Wehavetestedtheplagiarismatthesentencelevelbutphraselevelexperimentationisstillleftforinvestigate.
Infuture,analgorithmhastobedevelopedtotesttherelevanceofthecandidatesourcesentencesretrievedbyNutchandchoosethemostrelevantplagiarizedpart.
Theknowledgefilesforthesourcedocumentswillalsohavetobeupdated.
AcknowledgmentTheworkhasbeencarriedoutwithsupportfromDepartmentofInformationTechnology(DIT),Govt.
ofIndiafundedProjectDevelopmentof"CrossLingualInformationAccess(CLIA)"SystemPhaseII.
References1.
WikipediaarticleonPlagiarism:http://en.
wikipedia.
org/wiki/Plagiarism2.
PotthastM.
etal.
:AnEvaluationFrameworkforPlagiarismDetection.
InProceedingsoftheCOLING2010,Beijing,China,August2010.
3.
YuriiPalkovskii,AlexeiBelovandIrinaMuzika.
:ExploringFingerprintingasExternalPlagiarismDetectionMethod:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
4.
VivianeP.
Moreira,RafaelC.
PereiraandGalanteRenata.
:UFRGS@PAN2010:DetectingExternalPlagiarism:LabReportforPanatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
5.
ClaraVaniaandMirnaAdriani.
:ExternalPlagiarismDetectionUsingPassageSimilarities:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
6.
M.
Mozgovoy,T.
KakkonenandE.
Sutinen.
:UsingNaturalLanguageParsersinPlagiarismDetection.
InProceedingofSLaTE'07Workshop,Pennsylvania,USA,October2007.
7.
Chen,Chien-Ying,Jen-YuanYehandHao-RenKe.
:PlagiarismDetectionusingROUGEandWordNet.
JournalofComputing,2(3),pages34-44,March2010.
https://sites.
google.
com/site/journalofcomputing/.
ISSN2151-9617.
8.
CristianGrozeaandMariusPopescu.
:Encoplot-PerformanceintheSecondInternationalPlagiarismDetectionChallenge:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
9.
Basileetal.
:APlagiarismDetectionProcedureinThreeSteps:Selection,Matchesand"Squares".
InProceedingsoftheSEPLN2009WorkshoponUncoveringPlagiarism,AuthorshipandSocialSoftwareMisuse(PAN2009),Donostia-SanSebastian,Spain.
10.
VladoKeselj,FuchunPeng,NickCerconeandCalvinThomas.
:"N-gram-basedAuthorProfilesforAuthorshipAttribution".
InProceedingsofthePACLING'03,DalhousieUniversity,Halifax,NovaScotia,Canada,pp.
255-264,August2003.
目前舍利云服务器的主要特色是适合seo和建站,性价比方面非常不错,舍利云的产品以BGP线路速度优质稳定而著称,对于产品的线路和带宽有着极其严格的讲究,这主要表现在其对母鸡的超售有严格的管控,与此同时舍利云也尽心尽力为用户提供完美服务。目前,香港cn2云服务器,5M/10M带宽,价格低至30元/月,可试用1天;;美国cera云服务器,原生ip,低至28元/月起。一、香港CN2云服务器香港CN2精品线...
BuyVM针对中国客户推出了China Special - STREAM RYZEN VPS主机,带Streaming Optimized IP,帮你解锁多平台流媒体,适用于对于海外流媒体有需求的客户,主机开设在拉斯维加斯机房,AMD Ryzen+NVMe磁盘,支持Linux或者Windows操作系统,IPv4+IPv6,1Gbps不限流量,最低月付5加元起,比美元更低一些,现在汇率1加元=0.7...
GigsGigsCloud是一家成立于2015年老牌国外主机商,提供VPS主机和独立服务器租用,数据中心包括美国洛杉矶、中国香港、新加坡、马来西亚和日本等。商家VPS主机基于KVM架构,绝大部分系列产品中国访问速度不错,比如洛杉矶机房有CN2 GIA、AS9929及高防线路等。目前Los Angeles - SimpleCloud with Premium China DDOS Protectio...
谷歌sb为你推荐
尊敬的浪潮英信服务器用户:敬请参阅最后一页特别声明支持ipad支持ipad支持ipadDescriptionios5win10关闭445端口win10怎么关闭445的最新相关信息iphonewifi苹果wifi版和4G版是什么意思,有什么区别吗icloudiphone苹果手机显示"已停用,连接itunes"是什么意思icloudiphone没开启icloud的iphone怎么用find my iphone找回
绍兴服务器租用 simcentric raksmart 163网 地址大全 圣诞节促销 nerds 100m独享 香港新世界中心 国外ip加速器 注册阿里云邮箱 攻击服务器 netvigator rewritecond 葫芦机 贵州电信 百度新闻源申请 comodo 建站行业 vim 更多