copy谷歌sb

谷歌sb  时间:2021-05-21  阅读:()
RuleBasedPlagiarismDetectionusingInformationRetrievalAniruddhaGhosh,PinakiBhaskar,SantanuPal,SivajiBandyopadhyayDepartmentofComputerScienceandEngineering,JadavpurUniversity,Kolkata–700032,India{arghyaonline,pinaki.
bhaskar,santanu.
pal.
ju}@gmail.
com,sivaji_cse_ju@yahoo.
comAbstract.
ThispaperreportsaboutthedevelopmentofaPlagiarismdetectionsystemasapartofthePlagiarismdetectiontaskinPAN2011.
TheexternalplagiarismdetectionproblemhasbeensolvedwiththehelpofNutch,anopensourceInformationRetrieval(IR)system.
Thesystemcontainsthreephases–knowledgepreparation,candidateretrievalandplagiarismdetection.
Fromthesourcedocuments,knowledgebasehasbeenpreparedfordevelopingtheNutchindexandthequerieshavebeenformedfromthesuspiciousdocumentsforsubmissiontotheNutchIRsystem.
TheretrievedcandidatesourcesentencesareassignedsimilarityscoresbyNutch.
Dissimilarityscoreisassignedforeachcandidatesentenceandthesuspicioussentence.
Eachcandidatesourcesentenceisrankedbasedonthesetwoscores.
Thetoprankedcandidatesentenceisselectedforeachsuspicioussentence.
Keywords:PlagiarismDetection,InformationRetrievalSystem,SimilarityScore,DissimilarityScore.
1IntroductionPlagiarismmaybedefinedasthewrongfulmisuseandclosereplicationofthoughts,ideas,orexpressionsfromtheoriginalworkofsomeoneinthesamelanguageoffromanotherlanguage.
From18thcentury,plagiarismhasbeenconsideredasacademicdishonesty[1].
Fordecades,researchershaveexploreddifferenttechniquestodetectplagiarism.
Plagiarismcanoccurindifferentforms–fullplagiarism,substantialplagiarism,minimalisticplagiarism,sourcecitationetc.
IthasbecomeachallengingtaskintheareaofNaturalLanguageProcessing.
Inourapproach,wehaveconsideredalltheformsofplagiarismexceptminimalisticplagiarismatthesentencelevel.
Duetoabsenceofcontrolledevaluationenvironmenttocompareresultsofthealgorithms,plagiarismdetectionisstillachallengingtask[2].
Researchershaveorganizedvariousconferences(similartoPAN)toovercometheplagiarismproblem.
Fingerprintretrievalmethod[3],candidateretrieval[4]andpassageretrieval[5]arethemostprominentattemptsonplagiarismdetection.
Thesystemdescribedin[6]workswithanaturallanguageparsertofindswappedwordsandphrasestodetectintentionalplagiarismwhilen-gramco-occurrencestatisticisusedtodetectverbatimcopy.
TheLongestCommonSubsequencetechniquehasbeenusedin[7]tohandletextmodification.
Researchershaveusedcosinesimilarityscoreandn-gramvectorspacemodelatdifferentlevels,i.
e.
,word[8]andcharacter[9]levels.
Inthepresentwork,plagiarismhasbeentreatedasanIRproblem.
Anopensourcesearchengine,Nutch,hasbeenusedtoretrievetheplagiarizedpartsfromthesuspiciousdocuments.
2SystemFrameworkTheInformationRetrieval(Nutch1)basedPlagiarismDetectionsystemframeworkisshowninthefigure1.
Thesystemisdefinedinthreephases:KnowledgePreparation,CandidateRetrieval,i.
e.
,identificationofsuspicioussentenceandtheprobablesetofsourcesentencepairsandfinallyplagiarismdetectionofeachidentifiedsuspicioussentence.
Fig.
1.
SystemArchitecture3KnowledgePreparationEachsourcedocumentisparsedtoidentifyandextractallthesentencesinthedocument.
NowKnowledgefilesaregeneratedforeachsourcesentence.
Thefilenamesofknowledgefilesarecreatedinsuchamannerthatthesourcesentenceintheoriginalsourcedocumentcanbetracked.
Theknowledgeofeachsentenceintheknowledgefileisstoredintheformofstems,synonyms,hyponyms,hypernymsandsynsetsofeachword(afterremovalofthestopwords)thatareextractedfromWordNet3.
02.
Duplicatewordsareremovedtogetthesetofidenticalsenseuniquewords.
Thesewordsareusedtoidentifytheplagiarizedwords,thewordsthataresimilarinsensetotheoriginalwords.
Theoriginalwordsinthesentenceareaddedtothissetofwords.
Thus,eachknowledgefileforasentenceconsistsofasetofwords.
Afteralltheknowledgefilesarebuilt,theseareindexedusingLucene3.
1http://nutch.
apache.
org/2http://wordnet.
princeton.
edu/3http://lucene.
apache.
org/4CandidatesRetrievalEachsuspiciousdocumentisparsedtoidentifyandextractallthesentencesinthesuspiciousdocuments.
EachSuspicioussentenceisconsideredfromtheparsedsuspiciousdocumenttogeneratethequery.
FirstallthestopwordsareremovedfromthesentenceandthentheremainingwordsarebeingstemmedusingWordNet3.
0stemmertogettherootformofeachword.
Aftergeneratingthequeryfromthesuspicioussentences,thequeryisfiredtoNutchtoretrievetheprobablesetofsourcesentencescorrespondingtoeachsuspicioussentence.
Assourcedocumentsaresplitintosentencesintofilesandeachfilecontainsonlyonesentence,Nutchperformsasentence-sentencemappingforaproximalmatchbetweenthequeryandindexedsourcefiles.
AsetofprobablecandidatesourcesentencesisidentifiedbyNutchinrankedorderforeachsuspicioussentence.
Nutchprovidesthesimilarityscorebetweenasuspicioussentenceandthecorrespondingcandidatesourcesentence.
5PlagiarismDetectionAnalgorithmfordissimilaritymeasurement,proposedin[10],hasbeenusedtocalculatethedissimilarityscorebetweenthesuspicioussentenceanditscorrespondingretrievedcandidatesentences.
Foridenticalsentencesthathavemostnumberofidenticaln-grams,thedissimilarityscoreis0.
Usingthismeasurewehavecalculatedthedissimilarityscoresofeachsourcesentencecorrespondingtothesuspicioussentences.
Thedissimilarityscorearesubtractedfromthesimilarityscoreforeachcandidatesourcesentenceandafinalfine-grainedscorehasbeengenerated.
Alltheretrievedcandidatesourcesentencesforeachsuspicioussentencearerankedaccordingtothisfine-grainedscore.
Thetoprankedcandidatesourcesentenceisidentifiedasthesourcesentencefortheplagiarizedsentenceinthesuspiciousdocument.
6EvaluationTheplagiarismdetectionsystemwasevaluatedusingtheevaluationframeworkdescribedin[2].
TheevaluationscoresareshowninTable1.
Table1.
EvaluationMeasurementPrecisionRecallGranularityPladgetScore0.
00118290.
00500522.
00288180.
00120637ConclusionandFutureWorksThepresenttaskisourfirstattemptinplagiarismdetection.
Wehavetestedtheplagiarismatthesentencelevelbutphraselevelexperimentationisstillleftforinvestigate.
Infuture,analgorithmhastobedevelopedtotesttherelevanceofthecandidatesourcesentencesretrievedbyNutchandchoosethemostrelevantplagiarizedpart.
Theknowledgefilesforthesourcedocumentswillalsohavetobeupdated.
AcknowledgmentTheworkhasbeencarriedoutwithsupportfromDepartmentofInformationTechnology(DIT),Govt.
ofIndiafundedProjectDevelopmentof"CrossLingualInformationAccess(CLIA)"SystemPhaseII.
References1.
WikipediaarticleonPlagiarism:http://en.
wikipedia.
org/wiki/Plagiarism2.
PotthastM.
etal.
:AnEvaluationFrameworkforPlagiarismDetection.
InProceedingsoftheCOLING2010,Beijing,China,August2010.
3.
YuriiPalkovskii,AlexeiBelovandIrinaMuzika.
:ExploringFingerprintingasExternalPlagiarismDetectionMethod:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
4.
VivianeP.
Moreira,RafaelC.
PereiraandGalanteRenata.
:UFRGS@PAN2010:DetectingExternalPlagiarism:LabReportforPanatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
5.
ClaraVaniaandMirnaAdriani.
:ExternalPlagiarismDetectionUsingPassageSimilarities:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
6.
M.
Mozgovoy,T.
KakkonenandE.
Sutinen.
:UsingNaturalLanguageParsersinPlagiarismDetection.
InProceedingofSLaTE'07Workshop,Pennsylvania,USA,October2007.
7.
Chen,Chien-Ying,Jen-YuanYehandHao-RenKe.
:PlagiarismDetectionusingROUGEandWordNet.
JournalofComputing,2(3),pages34-44,March2010.
https://sites.
google.
com/site/journalofcomputing/.
ISSN2151-9617.
8.
CristianGrozeaandMariusPopescu.
:Encoplot-PerformanceintheSecondInternationalPlagiarismDetectionChallenge:LabReportforPANatCLEF2010.
InBraschleretal.
[2].
ISBN978-88-904810-0-0.
9.
Basileetal.
:APlagiarismDetectionProcedureinThreeSteps:Selection,Matchesand"Squares".
InProceedingsoftheSEPLN2009WorkshoponUncoveringPlagiarism,AuthorshipandSocialSoftwareMisuse(PAN2009),Donostia-SanSebastian,Spain.
10.
VladoKeselj,FuchunPeng,NickCerconeandCalvinThomas.
:"N-gram-basedAuthorProfilesforAuthorshipAttribution".
InProceedingsofthePACLING'03,DalhousieUniversity,Halifax,NovaScotia,Canada,pp.
255-264,August2003.

恒创科技SonderCloud,美国VPS综合性能测评报告,美国洛杉矶机房,CN2+BGP优质线路,2核4G内存10Mbps带宽,适用于稳定建站业务需求

最近主机参考拿到了一台恒创科技的美国VPS云服务器测试机器,那具体恒创科技美国云服务器性能到底怎么样呢?主机参考进行了一番VPS测评,大家可以参考一下,总体来说还是非常不错的,是值得购买的。非常适用于稳定建站业务需求。恒创科技服务器怎么样?恒创科技服务器好不好?henghost怎么样?henghost值不值得购买?SonderCloud服务器好不好?恒创科技henghost值不值得购买?恒创科技是...

昔日数据月付12元起,湖北十堰机房10M带宽月付19元起

昔日数据怎么样?昔日数据是一个来自国内服务器销售商,成立于2020年底,主要销售国内海外云服务器,目前有国内湖北十堰云服务器和香港hkbn云服务器 采用KVM虚拟化技术构架,湖北十堰机房10M带宽月付19元起;香港HKBN,月付12元起; 此次夏日活动全部首月5折促销,有需要的可以关注一下。点击进入:昔日数据官方网站地址昔日数据优惠码:优惠码: XR2021 全场通用(活动持续半个月 2021/7...

轻云互联(19元)香港高防云服务器 ,美国云服务器

轻云互联成立于2018年的国人商家,广州轻云互联网络科技有限公司旗下品牌,主要从事VPS、虚拟主机等云计算产品业务,适合建站、新手上车的值得选择,香港三网直连(电信CN2GIA联通移动CN2直连);美国圣何塞(回程三网CN2GIA)线路,所有产品均采用KVM虚拟技术架构,高效售后保障,稳定多年,高性能可用,网络优质,为您的业务保驾护航。活动规则:用户购买任意全区域云服务器月付以上享受免费更换IP服...

谷歌sb为你推荐
正确答案杀毒软件免费下载债券127支持ipadeaccelerator开启eAccelerator内存优化就各种毛病,DZ到底用哪个内存优化比较好。。。icloudiphone自己用icloud把iPhone抹掉了.激活却不是自己的id怎么破firefoxflash插件安装火狐浏览器后,老是提示安装flash player?迅雷雷鸟雷鸟手机怎么样卡巴斯基好用吗卡巴斯基好吗altools.u32为什么我做的Authorware在打包后不是全屏的?winrar5.0winrar解压软件下载 winrar压缩软件下载
免费国内空间 域名备案批量查询 westhost simcentric webhosting 账号泄露 好看的桌面背景图 云全民 佛山高防服务器 美国堪萨斯 789电视剧 吉林铁通 美国盐湖城 net空间 supercache 阿里云邮箱怎么注册 tracker服务器 以下 挂马检测工具 cc攻击 更多