ErrorAnalysisofNamedEntityRecognitioninBCCWJMasaakiIchihara1KanakoKomiya1TomoyaIwakura2MaikoYamazaki31IbarakiUniversity3FujitsuLaboratoriesLtd.
,2TokyoInstituteofTechnology{11t4004s@hcs,kkomiya@mx}.
ibaraki.
ac.
jp,iwakura.
tomoya@jp.
fujitsu.
com,yamazaki@lr.
pi.
titech.
ac.
jp1IntroductionNamedEntityRecognitionisaprocessbywhichnamedentities(NEs)suchasthenamesofpersons,locations,andartifactsareextracted.
Mostnamedentityrecognitiontechniqueshavebeenstudiedonnewsarticles,however,theirperformancesondier-entdomaintextssuchasblogs,booksandmaga-zinesarestillnotevaluatedwell.
ThispaperreportsanerroranalysisofKNPonsixdomainsforreveal-ingcausesoferrorsforfurtherimprovementofNErecognition1.
2ErrorAnalysisofKNPonBCCWJJapanesedependencyandcasestructureanalyzerKNP2([2]and[3])wasusedasthenamedentityrecognizer.
TheversionsweusedwereKNPVer.
4.
11andJUMANVer.
7.
0.
Thesixgenres,"Q&Asites","whitepapers","blogs","books","magazines",and"newspaperar-ticles",inBalancedCorpusofContemporaryWrit-tenJapanese(BCCWJ)wereusedasthetargetcor-pora.
OnehundredthirtysixtextsextractedfromBC-CWJ,theyareavailableasClassA3,wereusedfortheexperiments.
TheyweremanuallyannotatedwithninekindsofNEthatweredenedbyInformationRetrievalandExtractionExercise(IREX)4.
TheseNEtypesarethenamesofpersons,locations,artifacts,dates,times,moneys,percents,andoptional5.
Theanno-tationwasdonebyvemembersofNEteamoftheProjectNextNLP,andcheckedbyfourmembersofit.
1ThispaperisanEnglishversionof(Ichiharaetal.
,2015)[1]withadditionalinformationandsomecorrections.
2http://nlp.
ist.
i.
kyoto-u.
ac.
jp/EN/index.
phpKNP3http://plata.
ar.
media.
kyoto-u.
ac.
jp/mori/research/NLR/JDC/ClassA-1.
list4http://nlp.
cs.
nyu.
edu/irex/index-e.
html5KNPdoesnotextractoptionaltags.
WecomparedKNPoutputswiththemanuallyan-notatedtextsandanalyzederrors.
Table1showstheperformancesofKNP.
Theequa-tionsofrecall,precision,accuracy,andF-measureareasfollows.
"Correct",thenumeratorofrecall,precision,andaccuracy,isthenumberofthecor-rectanswersofKNP.
"Annotated",thedenominatorofrecall,denotesthenumberoftheNEsthatweremanuallyannotated.
"KNPoutputs",thedenomi-natorofprecision,denotesthenumberoftheNEsthatKNPoutput.
Thedenominatorofaccuracyisthelogicalsum(OR)of"Annotated"and"KNPout-puts".
Thedenominatorsofrecall,precision,andac-curacyvarybecauseKNPsometimescannotextractsomeNEsandsometimesextractswronginforma-tion.
Also,anNEthatthesystemoutputsometimesconsistsofmultipleannotatedNEsasillustratedbyanexampleinFigure1andviceversa.
Table1showstherecallislowerthantheprecision.
KNP:PERSON/PERSONAnnotationLOCATION/LOCATIONLOCATION/LOCATIONFigure1:AnexampleofanNEKNPoutputincludesmultipleannotatedNEsRecall=CorrectAnnotated(1)Precision=CorrectKNPoutputs(2)Accuracy=CorrectAnnotated∪KNPoutputs(3)Fmeasure=2Recall·PrecisionRecall+Precision(4)Table1:PerformancesofKNPPerformanceRateCorrectDenominatorRecall61.
79%2641Precision74.
79%16322182Accuracy57.
95%2816F-measure67.
68Theerrorswereclassiedintothefollowingvetypes.
Exampleswereshownwithdescription.
NoextractionTheerrorwhereKNPdidnotex-tracttokensasanNEthoughtheywereanno-tated.
KNP:AnnotationARTIFACT/ARTIFACTNoannotationTheerrorwhereKNPextractedtokensasanNEthoughtheywerenotanno-tated.
KNP:PERSON/PERSONAnnotationWrongrangeTheerrorwhereKNPextractedto-kensasanNEandonlytherangewaswrong.
(Theextractedtokenswerepartiallyannotatedortheywerethepartoftheannotatedtokens.
)KNP1:PERSON/PERSONAnnotation1PERSON/PERSONKNP2:ORGANIZATION/ORGANIZATIONAnnotation2ORGANIZATION/ORGANIZATIONWrongtagTheerrorwhereKNPextractedtokensasanNEandonlythetagtypewaswrong.
KNP:PERSON/PERSONAnnotationLOCATION/LOCATIONWrongrangeandtagTheerrorwhereKNPex-tractedtokensasanNEandboththerangeandthetagtypewerewrong.
KNP:PERSON/PERSONAnnotationLOCATION/LOCATIONTable2:SummaryoferrorsErrortypeNumRateNoextraction61952.
28%Noannotation15913.
43%Wrongrange16213.
68%Wrongtag12710.
73%Wrongrangeandtag1179.
88%Allerrors1184100.
00%Table2showsasummaryoferrors.
Theseerrorswerecountedbythelogicalsum(OR)ofannotatedNEsandKNPoutputs.
Themostfrequenterrorwas"Noextraction"anditaccountedformorethanhalfofthetotalerrors.
Thesecondmostfrequenter-rorwas"Wrongrange"andmostofthemweretheerrorswhereextractedtokenswerethepartoftheannotatedtokens.
Table3showsasummaryoferrorsbytypesofNEs.
Theseerrorswerealsocountedbythelogi-calsum(OR)ofannotatedNEsandKNPoutputs.
"Correct"and"Error"arethenumbersofthecorrectanswersandtheerrorsofKNP.
"Total"isthesumof"Correct"and"Error".
"Noextraction"and"Er-rorswithextraction"inthetablemeanthenumbersof"Noextraction"andtheerrorsotherthan"Noex-traction",respectively.
"Noextractionrate"istheratioof"Noextraction"in"Error".
Table3showsthatnoextractionratesof"ARTI-FACT","PERCENT","TIME",and"OPTIONAL"areespeciallyhigh.
Atthesametime,therearesmallnumberofNEsof"PERCENT"and"TIME"inthecorpora.
Therefore,wecansee"ARTIFACT"isthebigreasonwhythenoextractionrateofalltagsishigh.
Noextractionrateof"OPTIONAL"is100%becauseKNPdoesnotextractOPTIONALsandthisisanotherreason.
Table3alsoshowsthatmostof"TIME","MONEY",and"PRECENT"werecorrectlytaggedbyKNPiftheyweretagged.
Mostoftheerrorswhentheywereextractedarethoseof"ORGANIZA-TION","PERSON",and"LOCATION".
Thesumoferrorsof"ARTIFACT"and"DATE"arelessthan30%ofallerrorswhentheywereextracted.
Table4showstheaccuraciesandtheratesofnoextractionin"Total"accordingtothetagtype.
"Ac-curacy"istheratioofthecorrectanswersin"Total",thesumofcorrectanswersanderrorsofKNP,and"Noextraction/Total"istheratioofnoextractioninit.
Theseerrorswerealsocountedbythelogicalsum(OR)ofannotatedNEsandKNPoutputs.
Table4showsthattheaccuracyof"ARTIFACT"isparticularlylowcomparingwiththeothertags.
Thesametableshowstheratioofnoextractionin"Total"isalsohigh.
Therefore,wecouldseethat"Noextraction"of"ARTIFACT"isthebiggestcauseTable3:SummaryoferrorsbytypesofNEsTagCorrectErrorTotalNoextractionErrorswithextractionNoextractionrateARTIFACT902593491926774.
13%DATE343145488628342.
76%LOCATION4092266357215431.
86%MONEY884922250.
00%ORGANIZATION2362004367712338.
50%PERCENT79129110283.
33%PERSON3642225868813439.
64%TIME2393290100.
00%OPTIONAL01071071070100.
00%AllTags16321184281661956552.
28%Table4:Accuraciesandratesofnoextractionin"Total"accordingtothetagtypeTagAccuracyNoextraction/TotalARTIFACT25.
79%55.
01%DATE70.
29%12.
70%LOCATION64.
41%11.
34%MONEY95.
65%2.
17%ORGANIZATION54.
13%17.
66%PERCENT86.
81%10.
99%PERSON62.
12%15.
02%TIME71.
88%28.
13%OPTIONAL0.
00%100.
00%AllTags57.
95%21.
98%oftheerrorsofKNPandthemainreasonoflowrecall.
3ErrorAnalysisof"NoEx-traction"Thetargetcorporaweusedconsistedofsixgenres,"Q&Asites","whitepapers","blogs","books","magazines",and"newspaperarticles",inBCCWJ.
Table5showsasummaryoferrorsbygenresoftexts.
Theseerrorsexcept"Noextraction"arethosethatKNPoutput.
"Correct"and"Error"arethenumberofthecorrectanswersandtheerrorsofKNP.
"Total"isthesumof"Correct"and"Error".
"Noextraction"and"Errorswithextraction"intheta-blemeanthenumbersof"Noextraction"andtheerrorsotherthan"Noextraction",respectively.
"Noextractionrate"istheratioof"Noextraction"in"Error".
"Docs"isthenumberofdocumentsofthegenre.
Thetotalnumberoferrors(1169)andtotalnum-beroferrorswithextraction(550)aredierentfromthoseinTables2and3(1184and565).
Thisisbe-causesomeNEsthatKNPoutputincludemultipleTable6:Accuraciesandratesofnoextractionin"Total"accordingtothegenreGenreAccuracyNoextraction/TotalQ&A40.
00%44.
21%Whitepaper58.
73%20.
63%Blog50.
74%27.
89%Book50.
35%28.
07%Magazine53.
45%14.
66%Newspaper72.
27%15.
49%All58.
26%22.
10%annotatedNEs.
Inaddition,thenumberofwordsvariesaccordingtothegenre.
WethinkthisisareasonwhythetotalnumberoftheNEswasnotproportionaltothenumberofthedocuments.
Table5showsthatthegenrewhosenoextractionratewasthehighestwas"Q&Asites"andthegenrewiththelowestratewas"magazines".
Table6showstheaccuraciesandtheratesofnoextractionin"Total"accordingtothegenre.
"Accu-racy"istheratioofthecorrectanswersin"Total",thesumofcorrectanswersanderrorsofKNP,and"Noextraction/Total"istheratioofnoextractioninit.
Theseerrorsexcept"Noextraction"arethosethatKNPoutput.
"Accuracy"of"All"(58.
26%)isdierentfrom"Recall"inTable1(61.
79%)becausethenumberoftheNEsKNPoutputwasdierentfromthenumberoftheNEsthatwereannotatedbyhumans.
Table6showsthat"newspaperarticles"isthegenrewhoseaccuracyisthehighest.
WethinkthisisbecauseKNPwastrainedwithnewspaperarticlesofMAINICHISHIMBUN.
Table6alsoshowsthegenrewiththelowestaccuracywas"Q&Asites".
WethinkthisisbecausethewritingstyleofQ&Asiteswasthemostdierentfromthatofnewspaperarticles.
Thesametableshowsthatthegenrewhosenoextractionratewasthehighestwas"Q&Asites"Table5:SummaryoferrorsbygenresoftextsGenreCorrectErrorTotalNoextractionErrorswithextractionNoextractionrateDocsQ&A76114190843073.
68%74Whitepaper42730072715015050.
00%8Blog171166337947256.
63%34Book2172144311219356.
54%5Magazine1861623485111131.
48%2Newspaper5552137681199455.
87%13AllGenres16321169280161955052.
95%136andthegenrewiththelowestratewas"magazines".
3.
1NoExtractionofQ&ASites"Q&Asites"wasthegenrewhoseaccuracywasthelowest.
Theexamplesofnoextractionerrorsin"Q&Asites"areshownasfollows.
iManynamesofproducts,characters,andmedicineswerenotextracted.
(SakuraWars)(SuperNintendoEntertainmentSystem)(ActRaiser)4(Res-identEvil4)(KamenRider)(Ultraman)(Gundam)(Minostacin)(Aspirin)iiAbbreviationswerenotextracted.
Formalnamesarenotedinbrackets.
(MarioWorld)(SuperMarioWorld)GC((NintendoGameCube))JNB((JapanNetBank))LA((LosAngeles))iiiTheunusualdateexpressionswerenotextracted.
(90/11/21)ivHiraganaexpressionsweresometimeswronglyparsed.
"(Satoshi)"in"(CHIEBUKURERSatoshi)"shouldbethenameofpersonbutitiswronglyparsedas"(Satoru)":averb.
vNEswritteninalphabetsandnumberswerenotextracted.
"(JREast)"wereextracted.
3.
2NoExtractionofNewspaperAr-ticles"Newspaperarticles"wasthegenrewhoseaccuracywasthehighest.
Theexamplesofnoextractioner-rorsin"newspaperarticles"areshownasfollows.
iSomeNEswithspecicprexesandsuxeswerenotextracted.
(half**,ex.
halftime)(**region,ex.
(capitalregion)(threemajormetropolitanareas))(**area)(**point)(same**,ex.
(same**year)(sameday)(sameyearautumn))iiOPTIONALswerenotextractedbecauseKNPdoesnotextractoptionaltags.
iiiTheunusualEnglishexpressionsinJapanesesen-tenceswerenotextracted.
KOERAJAPANivBracketssometimescausetheerrors.
(Phoenix(Arizona,US))vNEsthatconsistofgeneralnounswerenotex-tracted.
Thiscouldbethereasonwhythenamesofproductsandcharacterswerenotextracted.
(Hirune,anap)(Zaurus)(FamilyMart)(Sharp)(TheRenaissance)"Softbank"sometimescouldbeextractedandsometimescouldnot.
Theywereparsedasnom-inativecasewhentheywereextractedandas"inclause"whentheywerenot.
4DiscussionAccordingtotheexamplesdescribed,wethinkthatthelackofknowledgeinthedictionaryandtheerrorsoftheparserarethebigreasonsoftheerrorsofthenamedentityrecognition.
Inparticular,thenamesofartifactsincludingthenamesofproductsorchar-actersareoftennewwordsthatwerecoined.
TheseNEsarenotinthedictionaryKNPusesandthere-fore,theyshouldbejudgediftheyweretheNEsornotdependsonthefeaturesofthesurroundingpat-ternsandthesyntacticfeatures.
Asaresult,thecorrectparsingwouldbeimportantfortheNEsthatcannotusedictionaryinformation.
However,theca-sualwritingstylelikeQ&Asitescausestheerrorsinmorphologicalanalysisandparsing.
Wethinkthatifthesentencesoftheseinformalwritingstylescouldbecorrectlyanalyzedandparsed,theerrorswouldbedecreased.
Thetrainingoftextswithinformalwritingstylescouldbethesolutionofthisproblem.
Inaddition,mostoftheNEsthatwerenotextractedbyKNPwerefoundinWikipediaorotherWebsites.
Thisinformationalsocouldhelptherecallimprove.
5ConclusionThispaperreportsanerroranalysisofthenamedentityrecognizerKNPonsixdomainsforrevealingcausesoferrors.
ThetextsofBCCWJweremanu-allyannotatedandcomparedwiththeautomaticallytaggedtexts.
Theanalysisrevealedthatthemostfrequenterrorwas"Noextraction":thecasewherethetokenswerenotextractedbyKNPthoughtheywereannotated.
Italsorevealedthat"Noextrac-tion"of"ARTIFACT"isthebiggestcauseoflowrecalland"Q&Asite"isthegenrewhoseaccuracyisthelowest.
Wefocusedonthenoextractionerrorsandfoundoutthatthelackofdictionaryinformationandthevariouswritingstylescausetheseerrors.
AcknowledgementsThisworkwaspartiallysupportedbyJSPSKAK-ENHIGrantNumber24700138.
WewouldliketothankDr.
RyoheiSasanowhoprovidesusthehelp-fulinformationaboutKNPandteammembersofNEteamofProjectNextNLP.
References[1]MasaakiIchihara,MaikoYamazaki,andKanakoKomiya.
Erroranalysisofnamedentityextrac-tioninbccwj(bccwj).
7,p.
toappear,2015.
[2]RyoheiSasanoandSadaoKurohashi.
Japanesenamedentityrecognitionusingnon-localinfor-mation(injapanese).
IPSJJournal,Vol.
49,No.
11,pp.
3765–3776,2008.
[3]knp.
,19,pp.
110–113,2013.
zji怎么样?zji是一家老牌国人主机商家,公司开办在香港,这个平台主要销售独立服务器业务,和hostkvm是同一样,两个平台销售的产品类别不一平,商家的技术非常不错,机器非常稳定。昨天收到商家的优惠推送,目前针对香港邦联四型推出了65折优惠BGP线路服务器,性价比非常不错,有需要香港独立服务器的朋友可以入手,非常适合做站。zji优惠码:月付/年付优惠码:zji 物理服务器/VDS/虚拟主机空间订...
tmthosting怎么样?tmthosting家本站也分享过多次,之前也是不温不火的商家,加上商家的价格略贵,之到斯巴达商家出现,这个商家才被中国用户熟知,原因就是斯巴达家的机器是三网回程AS4837线路,而且也没有多余的加价,斯巴达家断货后,有朋友发现TMTHosting竟然也在同一机房,所以大家就都入手了TMTHosting家的机器。目前,TMTHosting商家放出了夏季优惠,针对VPS推...
萨主机(lisahost)新上了美国cn2 gia国际精品网络 – 精品线路,支持解锁美区Netflix所有资源,HULU, DISNEY, StartZ, HBO MAX,ESPN, Amazon Prime Video等,同时支持Tiktok。套餐原价基础上加价20元可更换23段美国原生ip。支持Tiktok。成功下单后,在线充值相应差价,提交工单更换美国原生IP。!!!注意是加价20换原生I...
softbank官网为你推荐
租服务器我想租服务器,请问会提供哪些服务?网站服务器租用哪些网站适合租用独立服务器?成都虚拟空间成都有没有能玩ps主机游戏的网咖?100m网站空间做网站100M的空间够用吗?深圳网站空间求免费稳定空间网站?免费网站空间申请申请免费空间的网站1g虚拟主机想买个1G虚拟主机,不限流量的,但不知道哪个建站网站靠谱,求推荐!apache虚拟主机Apache跟虚拟主机有什么关系?论坛虚拟主机论坛虚拟主机的IP地址在后台的那个地方呀淘宝虚拟主机淘宝里卖虚拟主机、独立服务器、VPS的都是怎么进货的。
网络服务器租用 深圳虚拟主机 花生壳域名 万网域名 vps动态ip 域名解析文件 新通用顶级域名 西安电信测速 创宇云 typecho mysql主机 免费smtp服务器 me空间社区 hostloc 天翼云盘 smtp虚拟服务器 稳定空间 黑科云 万网服务器 hdsky 更多