RuleBasedPlagiarismDetectionusingInformationRetrievalAniruddhaGhosh,PinakiBhaskar,SantanuPal,SivajiBandyopadhyayDepartmentofComputerScienceandEngineering,JadavpurUniversity,Kolkata–700032,India{arghyaonline,pinaki.
bhaskar,santanu.
pal.
ju}@gmail.
com,sivaji_cse_ju@yahoo.
comAbstract.
ThispaperreportsaboutthedevelopmentofaPlagiarismdetectionsystemasapartofthePlagiarismdetectiontaskinPAN2011.
TheexternalplagiarismdetectionproblemhasbeensolvedwiththehelpofNutch,anopensourceInformationRetrieval(IR)system.
Thesystemcontainsthreephases–knowledgepreparation,candidateretrievalandplagiarismdetection.
Fromthesourcedocuments,knowledgebasehasbeenpreparedfordevelopingtheNutchindexandthequerieshavebeenformedfromthesuspiciousdocumentsforsubmissiontotheNutchIRsystem.
TheretrievedcandidatesourcesentencesareassignedsimilarityscoresbyNutch.
Dissimilarityscoreisassignedforeachcandidatesentenceandthesuspicioussentence.
Eachcandidatesourcesentenceisrankedbasedonthesetwoscores.
Thetoprankedcandidatesentenceisselectedforeachsuspicioussentence.
Keywords:PlagiarismDetection,InformationRetrievalSystem,SimilarityScore,DissimilarityScore.
1IntroductionPlagiarismmaybedefinedasthewrongfulmisuseandclosereplicationofthoughts,ideas,orexpressionsfromtheoriginalworkofsomeoneinthesamelanguageoffromanotherlanguage.
From18thcentury,plagiarismhasbeenconsideredasacademicdishonesty[1].
Fordecades,researchershaveexploreddifferenttechniquestodetectplagiarism.
Plagiarismcanoccurindifferentforms–fullplagiarism,substantialplagiarism,minimalisticplagiarism,sourcecitationetc.
IthasbecomeachallengingtaskintheareaofNaturalLanguageProcessing.
Inourapproach,wehaveconsideredalltheformsofplagiarismexceptminimalisticplagiarismatthesentencelevel.
Duetoabsenceofcontrolledevaluationenvironmenttocompareresultsofthealgorithms,plagiarismdetectionisstillachallengingtask[2].
Researchershaveorganizedvariousconferences(similartoPAN)toovercometheplagiarismproblem.
Fingerprintretrievalmethod[3],candidateretrieval[4]andpassageretrieval[5]arethemostprominentattemptsonplagiarismdetection.
Thesystemdescribedin[6]workswithanaturallanguageparsertofindswappedwordsandphrasestodetectintentionalplagiarismwhilen-gramco-occurrencestatisticisusedtodetectverbatimcopy.
TheLongestCommonSubsequencetechniquehasbeenusedin[7]tohandletextmodification.
Researchershaveusedcosinesimilarityscoreandn-gramvectorspacemodelatdifferentlevels,i.
e.
,word[8]andcharacter[9]levels.
Inthepresentwork,plagiarismhasbeentreatedasanIRproblem.
Anopensourcesearchengine,Nutch,hasbeenusedtoretrievetheplagiarizedpartsfromthesuspiciousdocuments.
2SystemFrameworkTheInformationRetrieval(Nutch1)basedPlagiarismDetectionsystemframeworkisshowninthefigure1.
Thesystemisdefinedinthreephases:KnowledgePreparation,CandidateRetrieval,i.
e.
,identificationofsuspicioussentenceandtheprobablesetofsourcesentencepairsandfinallyplagiarismdetectionofeachidentifiedsuspicioussentence.
Fig.
1.
SystemArchitecture3KnowledgePreparationEachsourcedocumentisparsedtoidentifyandextractallthesentencesinthedocument.
NowKnowledgefilesaregeneratedforeachsourcesentence.
Thefilenamesofknowledgefilesarecreatedinsuchamannerthatthesourcesentenceintheoriginalsourcedocumentcanbetracked.
Theknowledgeofeachsentenceintheknowledgefileisstoredintheformofstems,synonyms,hyponyms,hypernymsandsynsetsofeachword(afterremovalofthestopwords)thatareextractedfromWordNet3.
02.
Duplicatewordsareremovedtogetthesetofidenticalsenseuniquewords.
Thesewordsareusedtoidentifytheplagiarizedwords,thewordsthataresimilarinsensetotheoriginalwords.
Theoriginalwordsinthesentenceareaddedtothissetofwords.
Thus,eachknowledgefileforasentenceconsistsofasetofwords.
Afteralltheknowledgefilesarebuilt,theseareindexedusingLucene3.
1http://nutch.
apache.
org/2http://wordnet.
princeton.
edu/3http://lucene.
apache.
org/4CandidatesRetrievalEachsuspiciousdocumentisparsedtoidentifyandextractallthesentencesinthesuspiciousdocuments.
EachSuspicioussentenceisconsideredfromtheparsedsuspiciousdocumenttogeneratethequery.
FirstallthestopwordsareremovedfromthesentenceandthentheremainingwordsarebeingstemmedusingWordNet3.
0stemmertogettherootformofeachword.
Aftergeneratingthequeryfromthesuspicioussentences,thequeryisfiredtoNutchtoretrievetheprobablesetofsourcesentencescorrespondingtoeachsuspicioussentence.
Assourcedocumentsaresplitintosentencesintofilesandeachfilecontainsonlyonesentence,Nutchperformsasentence-sentencemappingforaproximalmatchbetweenthequeryandindexedsourcefiles.
AsetofprobablecandidatesourcesentencesisidentifiedbyNutchinrankedorderforeachsuspicioussentence.
Nutchprovidesthesimilarityscorebetweenasuspicioussentenceandthecorrespondingcandidatesourcesentence.
5PlagiarismDetectionAnalgorithmfordissimilaritymeasurement,proposedin[10],hasbeenusedtocalculatethedissimilarityscorebetweenthesuspicioussentenceanditscorrespondingretrievedcandidatesentences.
Foridenticalsentencesthathavemostnumberofidenticaln-grams,thedissimilarityscoreis0.
Usingthismeasurewehavecalculatedthedissimilarityscoresofeachsourcesentencecorrespondingtothesuspicioussentences.
Thedissimilarityscorearesubtractedfromthesimilarityscoreforeachcandidatesourcesentenceandafinalfine-grainedscorehasbeengenerated.
Alltheretrievedcandidatesourcesentencesforeachsuspicioussentencearerankedaccordingtothisfine-grainedscore.
Thetoprankedcandidatesourcesentenceisidentifiedasthesourcesentencefortheplagiarizedsentenceinthesuspiciousdocument.
6EvaluationTheplagiarismdetectionsystemwasevaluatedusingtheevaluationframeworkdescribedin[2].
TheevaluationscoresareshowninTable1.
Table1.
EvaluationMeasurementPrecisionRecallGranularityPladgetScore0.
00118290.
00500522.
00288180.
00120637ConclusionandFutureWorksThepresenttaskisourfirstattemptinplagiarismdetection.
Wehavetestedtheplagiarismatthesentencelevelbutphraselevelexperimentationisstillleftforinvestigate.
Infuture,analgorithmhastobedevelopedtotesttherelevanceofthecandidatesourcesentencesretrievedbyNutchandchoosethemostrelevantplagiarizedpart.
Theknowledgefilesforthesourcedocumentswillalsohavetobeupdated.
AcknowledgmentTheworkhasbeencarriedoutwithsupportfromDepartmentofInformationTechnology(DIT),Govt.
ofIndiafundedProjectDevelopmentof"CrossLingualInformationAccess(CLIA)"SystemPhaseII.
References1.
WikipediaarticleonPlagiarism:http://en.
wikipedia.
org/wiki/Plagiarism2.
PotthastM.
etal.
:AnEvaluationFrameworkforPlagiarismDetection.
InProceedingsoftheCOLING2010,Beijing,China,August2010.
3.
YuriiPalkovskii,AlexeiBelovandIrinaMuzika.
:ExploringFingerprintingasExternalPlagiarismDetectionMethod:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
4.
VivianeP.
Moreira,RafaelC.
PereiraandGalanteRenata.
:UFRGS@PAN2010:DetectingExternalPlagiarism:LabReportforPanatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
5.
ClaraVaniaandMirnaAdriani.
:ExternalPlagiarismDetectionUsingPassageSimilarities:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
6.
M.
Mozgovoy,T.
KakkonenandE.
Sutinen.
:UsingNaturalLanguageParsersinPlagiarismDetection.
InProceedingofSLaTE'07Workshop,Pennsylvania,USA,October2007.
7.
Chen,Chien-Ying,Jen-YuanYehandHao-RenKe.
:PlagiarismDetectionusingROUGEandWordNet.
JournalofComputing,2(3),pages34-44,March2010.
https://sites.
google.
com/site/journalofcomputing/.
ISSN2151-9617.
8.
CristianGrozeaandMariusPopescu.
:Encoplot-PerformanceintheSecondInternationalPlagiarismDetectionChallenge:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
9.
Basileetal.
:APlagiarismDetectionProcedureinThreeSteps:Selection,Matchesand"Squares".
InProceedingsoftheSEPLN2009WorkshoponUncoveringPlagiarism,AuthorshipandSocialSoftwareMisuse(PAN2009),Donostia-SanSebastian,Spain.
10.
VladoKeselj,FuchunPeng,NickCerconeandCalvinThomas.
:"N-gram-basedAuthorProfilesforAuthorshipAttribution".
InProceedingsofthePACLING'03,DalhousieUniversity,Halifax,NovaScotia,Canada,pp.
255-264,August2003.
感恩一年有你!免费领取2核4G套餐!2核4G轻量应用服务器2核 CPU 4GB内存 60G SSD云硬盘 6Mbps带宽领取地址:https://cloud.tencent.com/act/pro/lighthousethankyou活动规则活动时间2021年9月23日 ~ 2021年10月23日活动对象腾讯云官网已注册且完成实名认证的国内站用户(协作者与子用户账号除外),且符合以下活动条件:账号...
Dynadot 是一家非常靠谱的域名注册商家,老唐也从来不会掩饰对其的喜爱,目前我个人大部分域名都在 Dynadot,还有一小部分在 NameCheap 和腾讯云。本文分享一下 Dynadot 最新域名优惠码,包括 .COM,.NET 等主流后缀的优惠码,以及一些新顶级后缀的优惠。对于域名优惠,NameCheap 的新后缀促销比较多,而 Dynadot 则是对于主流后缀的促销比较多,所以可以各取所...
hostwinds怎么样?2021年7月最新 hostwinds 优惠码整理,Hostwinds 优惠套餐整理,Hostwinds 西雅图机房直连线路 VPS 推荐,目前最低仅需 $4.99 月付,并且可以免费更换 IP 地址。本文分享整理一下最新的 Hostwinds 优惠套餐,包括托管型 VPS、无托管型 VPS、Linux VPS、Windows VPS 等多种套餐。目前 Hostwinds...
谷歌sb为你推荐
229.254routeIntentsandroid支持ipad支持ipad支持ipad用itunes备份如何用iTunes备份iPhone数据x-router设置路由器是我的上网设置是x怎么弄micromediamacromedia FreeHand MX是干什么用的?css选择器CSS中的选择器分几种?fastreport2.5罗斯2.5 现在能卖多少啊!?!!!
美国linux主机 美国主机排名 sugarhosts idc评测 westhost softlayer 英文站群 美国十次啦服务器 个人域名 bgp双线 赞助 中国电信宽带测速网 网通服务器 阿里云官方网站 starry 免费asp空间申请 工信部icp备案查询 广东主机托管 汤博乐 密钥索引 更多