copy谷歌sb

谷歌sb  时间:2021-05-21  阅读:()
RuleBasedPlagiarismDetectionusingInformationRetrievalAniruddhaGhosh,PinakiBhaskar,SantanuPal,SivajiBandyopadhyayDepartmentofComputerScienceandEngineering,JadavpurUniversity,Kolkata–700032,India{arghyaonline,pinaki.
bhaskar,santanu.
pal.
ju}@gmail.
com,sivaji_cse_ju@yahoo.
comAbstract.
ThispaperreportsaboutthedevelopmentofaPlagiarismdetectionsystemasapartofthePlagiarismdetectiontaskinPAN2011.
TheexternalplagiarismdetectionproblemhasbeensolvedwiththehelpofNutch,anopensourceInformationRetrieval(IR)system.
Thesystemcontainsthreephases–knowledgepreparation,candidateretrievalandplagiarismdetection.
Fromthesourcedocuments,knowledgebasehasbeenpreparedfordevelopingtheNutchindexandthequerieshavebeenformedfromthesuspiciousdocumentsforsubmissiontotheNutchIRsystem.
TheretrievedcandidatesourcesentencesareassignedsimilarityscoresbyNutch.
Dissimilarityscoreisassignedforeachcandidatesentenceandthesuspicioussentence.
Eachcandidatesourcesentenceisrankedbasedonthesetwoscores.
Thetoprankedcandidatesentenceisselectedforeachsuspicioussentence.
Keywords:PlagiarismDetection,InformationRetrievalSystem,SimilarityScore,DissimilarityScore.
1IntroductionPlagiarismmaybedefinedasthewrongfulmisuseandclosereplicationofthoughts,ideas,orexpressionsfromtheoriginalworkofsomeoneinthesamelanguageoffromanotherlanguage.
From18thcentury,plagiarismhasbeenconsideredasacademicdishonesty[1].
Fordecades,researchershaveexploreddifferenttechniquestodetectplagiarism.
Plagiarismcanoccurindifferentforms–fullplagiarism,substantialplagiarism,minimalisticplagiarism,sourcecitationetc.
IthasbecomeachallengingtaskintheareaofNaturalLanguageProcessing.
Inourapproach,wehaveconsideredalltheformsofplagiarismexceptminimalisticplagiarismatthesentencelevel.
Duetoabsenceofcontrolledevaluationenvironmenttocompareresultsofthealgorithms,plagiarismdetectionisstillachallengingtask[2].
Researchershaveorganizedvariousconferences(similartoPAN)toovercometheplagiarismproblem.
Fingerprintretrievalmethod[3],candidateretrieval[4]andpassageretrieval[5]arethemostprominentattemptsonplagiarismdetection.
Thesystemdescribedin[6]workswithanaturallanguageparsertofindswappedwordsandphrasestodetectintentionalplagiarismwhilen-gramco-occurrencestatisticisusedtodetectverbatimcopy.
TheLongestCommonSubsequencetechniquehasbeenusedin[7]tohandletextmodification.
Researchershaveusedcosinesimilarityscoreandn-gramvectorspacemodelatdifferentlevels,i.
e.
,word[8]andcharacter[9]levels.
Inthepresentwork,plagiarismhasbeentreatedasanIRproblem.
Anopensourcesearchengine,Nutch,hasbeenusedtoretrievetheplagiarizedpartsfromthesuspiciousdocuments.
2SystemFrameworkTheInformationRetrieval(Nutch1)basedPlagiarismDetectionsystemframeworkisshowninthefigure1.
Thesystemisdefinedinthreephases:KnowledgePreparation,CandidateRetrieval,i.
e.
,identificationofsuspicioussentenceandtheprobablesetofsourcesentencepairsandfinallyplagiarismdetectionofeachidentifiedsuspicioussentence.
Fig.
1.
SystemArchitecture3KnowledgePreparationEachsourcedocumentisparsedtoidentifyandextractallthesentencesinthedocument.
NowKnowledgefilesaregeneratedforeachsourcesentence.
Thefilenamesofknowledgefilesarecreatedinsuchamannerthatthesourcesentenceintheoriginalsourcedocumentcanbetracked.
Theknowledgeofeachsentenceintheknowledgefileisstoredintheformofstems,synonyms,hyponyms,hypernymsandsynsetsofeachword(afterremovalofthestopwords)thatareextractedfromWordNet3.
02.
Duplicatewordsareremovedtogetthesetofidenticalsenseuniquewords.
Thesewordsareusedtoidentifytheplagiarizedwords,thewordsthataresimilarinsensetotheoriginalwords.
Theoriginalwordsinthesentenceareaddedtothissetofwords.
Thus,eachknowledgefileforasentenceconsistsofasetofwords.
Afteralltheknowledgefilesarebuilt,theseareindexedusingLucene3.
1http://nutch.
apache.
org/2http://wordnet.
princeton.
edu/3http://lucene.
apache.
org/4CandidatesRetrievalEachsuspiciousdocumentisparsedtoidentifyandextractallthesentencesinthesuspiciousdocuments.
EachSuspicioussentenceisconsideredfromtheparsedsuspiciousdocumenttogeneratethequery.
FirstallthestopwordsareremovedfromthesentenceandthentheremainingwordsarebeingstemmedusingWordNet3.
0stemmertogettherootformofeachword.
Aftergeneratingthequeryfromthesuspicioussentences,thequeryisfiredtoNutchtoretrievetheprobablesetofsourcesentencescorrespondingtoeachsuspicioussentence.
Assourcedocumentsaresplitintosentencesintofilesandeachfilecontainsonlyonesentence,Nutchperformsasentence-sentencemappingforaproximalmatchbetweenthequeryandindexedsourcefiles.
AsetofprobablecandidatesourcesentencesisidentifiedbyNutchinrankedorderforeachsuspicioussentence.
Nutchprovidesthesimilarityscorebetweenasuspicioussentenceandthecorrespondingcandidatesourcesentence.
5PlagiarismDetectionAnalgorithmfordissimilaritymeasurement,proposedin[10],hasbeenusedtocalculatethedissimilarityscorebetweenthesuspicioussentenceanditscorrespondingretrievedcandidatesentences.
Foridenticalsentencesthathavemostnumberofidenticaln-grams,thedissimilarityscoreis0.
Usingthismeasurewehavecalculatedthedissimilarityscoresofeachsourcesentencecorrespondingtothesuspicioussentences.
Thedissimilarityscorearesubtractedfromthesimilarityscoreforeachcandidatesourcesentenceandafinalfine-grainedscorehasbeengenerated.
Alltheretrievedcandidatesourcesentencesforeachsuspicioussentencearerankedaccordingtothisfine-grainedscore.
Thetoprankedcandidatesourcesentenceisidentifiedasthesourcesentencefortheplagiarizedsentenceinthesuspiciousdocument.
6EvaluationTheplagiarismdetectionsystemwasevaluatedusingtheevaluationframeworkdescribedin[2].
TheevaluationscoresareshowninTable1.
Table1.
EvaluationMeasurementPrecisionRecallGranularityPladgetScore0.
00118290.
00500522.
00288180.
00120637ConclusionandFutureWorksThepresenttaskisourfirstattemptinplagiarismdetection.
Wehavetestedtheplagiarismatthesentencelevelbutphraselevelexperimentationisstillleftforinvestigate.
Infuture,analgorithmhastobedevelopedtotesttherelevanceofthecandidatesourcesentencesretrievedbyNutchandchoosethemostrelevantplagiarizedpart.
Theknowledgefilesforthesourcedocumentswillalsohavetobeupdated.
AcknowledgmentTheworkhasbeencarriedoutwithsupportfromDepartmentofInformationTechnology(DIT),Govt.
ofIndiafundedProjectDevelopmentof"CrossLingualInformationAccess(CLIA)"SystemPhaseII.
References1.
WikipediaarticleonPlagiarism:http://en.
wikipedia.
org/wiki/Plagiarism2.
PotthastM.
etal.
:AnEvaluationFrameworkforPlagiarismDetection.
InProceedingsoftheCOLING2010,Beijing,China,August2010.
3.
YuriiPalkovskii,AlexeiBelovandIrinaMuzika.
:ExploringFingerprintingasExternalPlagiarismDetectionMethod:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
4.
VivianeP.
Moreira,RafaelC.
PereiraandGalanteRenata.
:UFRGS@PAN2010:DetectingExternalPlagiarism:LabReportforPanatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
5.
ClaraVaniaandMirnaAdriani.
:ExternalPlagiarismDetectionUsingPassageSimilarities:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
6.
M.
Mozgovoy,T.
KakkonenandE.
Sutinen.
:UsingNaturalLanguageParsersinPlagiarismDetection.
InProceedingofSLaTE'07Workshop,Pennsylvania,USA,October2007.
7.
Chen,Chien-Ying,Jen-YuanYehandHao-RenKe.
:PlagiarismDetectionusingROUGEandWordNet.
JournalofComputing,2(3),pages34-44,March2010.
https://sites.
google.
com/site/journalofcomputing/.
ISSN2151-9617.
8.
CristianGrozeaandMariusPopescu.
:Encoplot-PerformanceintheSecondInternationalPlagiarismDetectionChallenge:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
9.
Basileetal.
:APlagiarismDetectionProcedureinThreeSteps:Selection,Matchesand"Squares".
InProceedingsoftheSEPLN2009WorkshoponUncoveringPlagiarism,AuthorshipandSocialSoftwareMisuse(PAN2009),Donostia-SanSebastian,Spain.
10.
VladoKeselj,FuchunPeng,NickCerconeandCalvinThomas.
:"N-gram-basedAuthorProfilesforAuthorshipAttribution".
InProceedingsofthePACLING'03,DalhousieUniversity,Halifax,NovaScotia,Canada,pp.
255-264,August2003.

速云:深圳独立服务器,新品上线,深港mpls免费体验,多重活动!

速云怎么样?速云是一家国人商家。速云商家主要提供广州移动、深圳移动、广州茂名联通、香港HKT等VDS和独立服务器。目前,速云推出深圳独服优惠活动,机房为深圳移动机房,购买深圳服务器可享受5折优惠,目前独立服务器还支持申请免费试用,需要提交工单开通免费体验试用,次月可享受永久8折优惠,也是需工单申请哦!点击进入:速云官方网站地址活动期限至 2021年7月22日速云云服务器优惠活动:活动1:新购首月可...

ParkInHost - 俄罗斯VPS主机 抗投诉 55折,月付2.75欧元起

ParkInHost主机商是首次介绍到的主机商,这个商家是2013年的印度主机商,隶属于印度DiggDigital公司,主营业务有俄罗斯、荷兰、德国等机房的抗投诉虚拟主机、VPS主机和独立服务器。也看到商家的数据中心还有中国香港和美国、法国等,不过香港机房肯定不是直连的。根据曾经对于抗投诉外贸主机的了解,虽然ParkInHost以无视DMCA的抗投诉VPS和抗投诉服务器,但是,我们还是要做好数据备...

腾讯云轻量服务器老用户续费优惠和老用户复购活动

继阿里云服务商推出轻量服务器后,腾讯云这两年对于轻量服务器的推广力度还是比较大的。实际上对于我们大部分网友用户来说,轻量服务器对于我们网站和一般的业务来说是绝对够用的。反而有些时候轻量服务器的带宽比CVM云服务器够大,配置也够好,更有是价格也便宜,所以对于初期的网站业务来说轻量服务器是够用的。这几天UCLOUD优刻得香港服务器稳定性不佳,于是有网友也在考虑搬迁到腾讯云服务器商家,对于轻量服务器官方...

谷歌sb为你推荐
j^=iáíá=fq~=OQJOU==aJPPNMO=m~dê~ó=http://www.paper.edu.cn支持ipad支持ipad图书馆学、情报学期刊投稿指南2.3ios5重庆网通重庆联通现在有哪些资费???itunes备份怎么使用iTunes备份google中国地图谷歌卫星地图中文版下载在哪下??重庆电信宽带管家重庆电信宽带安装收费
中国十大域名注册商 高防服务器租用 免费域名跳转 asp.net主机 狗爹 外贸主机 香港主机 10t等于多少g paypal认证 91vps 腾讯实名认证中心 河南移动网 100mbps 根服务器 监控服务器 论坛主机 永久免费空间 杭州电信宽带优惠 蓝队云 美国服务器 更多