Zero-shotkeywordspottingforvisualspeechrecognitionin-the-wildThemosStafylakis[0000000292273588]andGeorgiosTzimiropoulos[0000000218035338]ComputerVisionLaboratoryUniversityofNottingham,U.
K.
{themos.
stafylakis,yorgos.
tzimiropoulos}@nottingham.
ac.
ukAbstract.
Visualkeywordspotting(KWS)istheproblemofestimat-ingwhetheratextqueryoccursinagivenrecordingusingonlyvideoinformation.
ThispaperfocusesonvisualKWSforwordsunseenduringtraining,areal-world,practicalsettingwhichsofarhasreceivednoat-tentionbythecommunity.
Tothisend,wedeviseanend-to-endarchitec-turecomprising(a)astate-of-the-artvisualfeatureextractorbasedonspatiotemporalResidualNetworks,(b)agrapheme-to-phonememodelbasedonsequence-to-sequenceneuralnetworks,and(c)astackofrecur-rentneuralnetworkswhichlearnhowtocorrelatevisualfeatureswiththekeywordrepresentation.
DierenttopriorworksonKWS,whichtrytolearnwordrepresentationsmerelyfromsequencesofgraphemes(i.
e.
letters),weproposetheuseofagrapheme-to-phonemeencoder-decodermodelwhichlearnshowtomapwordstotheirpronunciation.
Wedemon-stratethatoursystemobtainsverypromisingvisual-onlyKWSresultsonthechallengingLRS2database,forkeywordsunseenduringtraining.
WealsoshowthatoursystemoutperformsabaselinewhichaddressesKWSviaautomaticspeechrecognition(ASR),whileitdrasticallyim-provesoverotherrecentlyproposedASR-freeKWSmethods.
Keywords:Visualkeywordspotting·visualspeechrecognition·zero-shotlearning1IntroductionThispaperaddressestheproblemofvisual-onlyAutomaticSpeechRecognition(ASR)i.
e.
theproblemofrecognizingspeechfromvideoinformationonly,inparticular,fromanalyzingthespatiotemporalvisualpatternsinducedbythemouthandlipsmovement.
VisualASRisachallengingresearchproblem,withdecentresultsbeingreportedonlyrecentlythankstotheadventofDeepLearningandthecollectionoflargeandchallengingdatasets[1–3].
Inparticular,wefocusontheproblemofKeywordSpotting(KWS)i.
e.
theproblemofndingoccurrencesofatextqueryamongasetofrecordings.
Inthisworkweconsideronlywords,howeverthesamearchitecturecanbeusedforshortphrases.
Althoughtheproblemcanbeapproachedwithstandard2T.
Stafylakis&G.
TzimiropoulosASRmethods,recentworksaimtoaddressitwithmoredirectand"ASR-free"methods[4].
Moreover,suchKWSapproachesareinlinewitharecentlyemergedresearchdirectioninASR(typicallytermedAcoustics-to-Word)wherewordsarereplacingphonemes,triphonesorlettersasbasicrecognitionunits[5,6].
Motivation.
OneofthemainproblemsregardingtheuseofwordsasbasicrecognitionunitsistheexistenceofOut-Of-Vocabulary(OOV)words,i.
e.
wordsforwhichtheexactphonetictranscriptionisunknown,aswellaswordswithveryfeworzerooccurrencesinthetrainingset.
Thisproblemisfarmoreexacerbatedinthevisualdomainwherecollecting,annotatinganddistributinglargedatasetsforfullysupervisedvisualspeechrecognitionisaverytediousprocess.
Tothebestofourknowledge,thispaperistherstattempttowardsvisualKWSunderthezero-shotsetting.
Relationtozero-shotlearning.
Ourapproachsharescertainsimilaritieswithzero-shotlearningmethods,e.
g.
forrecognizingobjectsinimageswithouttrainingexamplesoftheparticularobjects[7].
Dierentto[7],whererepre-sentationsoftheobjectsencodesemanticrelationships,wewishtolearnwordrepresentationsthatencodemerelytheirphoneticcontent.
Tothisend,wepro-posetouseagrapheme-to-phoneme(G2P)encoder-decodermodelwhichlearnshowtomapwords(i.
e.
sequencesofgraphemesorsimplyletters)totheirpro-nunciation(i.
e.
tosequencesofphonemes)1.
BytrainingtheG2Pmodelusingatrainingsetofsuchpairs(i.
e.
wordsandtheirpronunciation),weobtainaxed-lengthrepresentation(embedding)foranyword,includingwordsnotappearinginthephoneticdictionaryorinthevisualspeechtrainingset.
Theproposedsystemreceivesasinputavideoandakeywordandestimateswhetherthekeywordiscontainedinthevideo.
WeusetheLRS2databasetotrainaRecurrentNeuralNetwork(BidirectionalLongShort-TermMemory,BiL-STM)thatlearnsnon-linearcorrelationsbetweenvisualfeaturesandtheircor-respondingkeywordrepresentation[8].
Thebackendofthenetworkismodelingtheprobabilitythatthevideocontainsthekeywordandprovidesanestimateofitspositioninthevideosequence.
Theproposedsystemistrainedend-to-end,withoutinformationaboutthekeywordboundaries,andoncetraineditcanspotanykeyword,eventhosenotincludedintheLRS2trainingset.
Insummary,ourcontributionsare:–WearethersttostudyQuery-by-TextvisualKWSforwordsunseenduringtraining.
–Wedeviseanend-to-endarchitecturecomprising(a)astate-of-the-artvisualfeatureextractorbasedonspatiotemporalResidualNetworks,(b)aG2Pmodelbasedonsequence-to-sequenceneuralnetworks,and(c)astackofrecurrentneuralnetworksthatlearnhowtocorrelatevisualfeatureswiththekeywordrepresentation.
–Wedemonstratethatoursystemobtainsverypromisingvisual-onlyKWSresultsonthechallengingLRS2database.
1Forexample,thephonetictranscriptionoftheword"nish"is"FIH1NIH0SH",wherethenumericalvaluesafterthevowel"IH"indicatedierentlevelsofstretching.
Zero-shotkeywordspotting32RelatedWorkVisualASR.
Duringthepastfewyears,theinterestinvisualandaudiovisualASRhasbeenrevived.
Researchintheeldislargelyinuencedbyrecentad-vancesinaudio-onlyASR,aswellasbythestate-of-the-artincomputervision,mostlyforextractingvisualfeatures.
In[9],CNNfeaturesarecombinedwithGatedRecurrentUnits(GRUs)inanend-to-endvisualASRarchitecture,capa-bleofperformingsentence-levelvisualASRonarelativelyeasydataset(GRID[10]).
Similarlytoseveralrecentend-to-endaudio-basedASRapproaches,CTCisdeployedinordertocircumventthelackoftemporalalignmentbetweenframesandannotationles[11,12].
In[1,13],the"Listen,attendandspell"([14])audio-onlyASRarchitectureisadaptedtotheaudiovisualdomain,andtestedonrecentlyreleasedin-the-wildaudiovisualdatasets.
Thearchitectureisanatten-tiveencoder-decodermodelwiththedecoderoperatingdirectlyonletters(i.
e.
graphemes)ratherthanonphonemesorvisemes(i.
e.
thevisualanaloguesofphonemes[15]).
ItdeploysaVGGforextractingvisualfeaturesandtheaudioandvisualmodalitiesarefusedinthedecoder.
Themodelyieldsstate-of-the-artresultsinaudiovisualASR.
OtherrecentadvancesinvisualandaudiovisualASRinvolveresidual-LSTMs,adversarialdomain-adaptationmethods,useofself-attentionlayers(i.
e.
Transformer[16]),combinationsofCTCandattention,gatingneuralnetworks,aswellasnovelfusionapproaches[17–24].
Wordsasrecognitionunits.
Thegeneraltendencyindeeplearningto-wardsend-to-endarchitectures,togetherwiththechallengeofsimplifyingthefairlycomplextraditionalASRparadigm,hasresultedintoanewresearchdi-rectionofusingwordsdirectlyasrecognitionunits.
In[25],anacousticdeeparchitectureisintroduced,whichmodelswordsbyprojectingthemontoacon-tinuousembeddingspace.
Inthisembeddingspace,wordsthatsoundalikearenearbyintheEuclideansense,dierentiatingitfromotherswordembeddingspaceswheredistancescorrespondtosyntacticandsemanticrelations[26,27].
In[5,6],twoCTC-basedASRarchitecturesareintroduced,whereCTCmapsdirectlyacousticsfeaturestowords.
TheexperimentsshowthatCTCwordmod-elscanoutperformstate-of-the-artbaselinesthatmakeuseofcontext-dependenttriphonesasrecognitionunits,phoneticdictionariesandlanguagemodels.
Intheproblemofaudio-basedKWS,end-to-endword-basedapproacheshavealsoemerged.
In[28],theauthorsintroduceaKWSsystembasedonsequencetraining,composedofaCNNforacousticmodelingandanaggregationstage,whichaggregatestheframe-levelscoresintoasequence-levelscoreforwords.
However,thesystemislimitedtowordsseenduringtraining,sinceitmerelyassociateseachwordwithalabel(i.
e.
one-hotvector)withoutconsideringthemassequencesofcharacters.
Otherrecentworksaimtospotspecickeywordsusedtoactivatevoiceassistantsystems[29–31].
TheapplicationofBiLSTMsonKWSwasrstproposedin[32].
Thearchitectureiscapableofspottingatleastalimitedsetofkeywords,havingasoftmaxoutputlayerwithasmanyoutputunitsaskeywords,andaCTClossfortraining.
Morerecently,theauthorsin[4]pro-poseanaudio-onlyKWSsystemcapableofgeneralizingtounseenwords,usingaCNN/RNNtoautoencodesequencesofgraphemes(correspondingtowordsor4T.
Stafylakis&G.
Tzimiropoulosshortphrases)intoxed-lengthrepresentationvectors.
Theextractedrepresenta-tions,togetherwithaudio-featurerepresentationsextractedwithanacousticau-toencoderarepassedtoafeed-forwardneuralnetworkwhichistrainedtopredictwhetherthekeywordoccursintheutteranceornot.
Althoughthisaudio-onlyapproachsharescertainconceptualsimilaritieswithours,theimplementationsaredierentinseveralways.
OurapproachdeploysaGrapheme-to-Phonememodeltolearnkeywordrepresentations,itdoesnotmakeuseofautoencodersforextractingrepresentationsofvisualsequences,andmoreimportantlyitlearnshowtocorrelatevisualinformationwithkeywordsfromlow-levelvisualfeaturesratherthanfromvideo-levelrepresentations.
Theauthorsin[33]recentlyproposedavisualKWSapproachusingwordsasrecognitionunits.
TheydeploytheResNetfeatureextractorwithus(proposedbyourteamin[34,35]andtrainedonLRW[2])andtheydemonstratetheca-pacityoftheirnetworkinspottingoccurrencesoftheNw=500wordsinLRW[36].
Thebottleneckoftheirmethodisthewordrepresentation(eachwordcor-respondstoalabel,withoutconsideringwordsassequencesofgraphemes).
Suchanunstructuredwordrepresentationmayperformwellonclosed-setwordiden-tication/detectiontasks,butpreventsthemethodfromgeneralizingtowordsunseenduringtraining.
Zero-shotlearning.
AnalogiescanbedrawnbetweenKWSwithunseenwordsandzero-shotlearningfordetectingnewclasses,suchasobjectsorani-mals.
KWSwithunseenwordsisessentiallyazero-shotlearningproblem,whereattributes(letters)aresharedbetweenclasses(words)sothattheknowledgelearnedfromseenclassesistransferedtounseenones[37].
Moreover,similarlytoatypicalzero-shotlearningtrainingset-upwhereboundingboxesoftheob-jectsofinterestarenotgiven,aKWStrainingalgorithmknowsonlywhetherornotaparticularwordisutteredinagiventrainingvideo,withouthavinginformationabouttheexacttimeinterval.
Forthesereasons,zero-shotlearningmethodswhiche.
g.
learnmappingsfromanimagefeaturespacetoasemanticspace([38,39])arepertinenttoourmethod.
Finally,recentmethodsinactionrecognitionusingarepresentationvectortoencodee.
g.
3Dhuman-skeletonse-quencesalsoexhibitcertainsimilaritieswithourmethod[40].
3ProposedMethod3.
1SystemoverviewOursystemiscomposedoffourdierentmodules.
Therstmoduleisavisualfeatureextractor,whichreceivesasinputtheimageframesequence(assumingafacedetectorhasalreadybeenapplied,asinLRS2)andoutputsfeatures.
AspatiotemporalResidualNetworkisusedforthispurpose,whichhasshownremarkableperformanceinword-levelvisualASR[34,35].
Thesecondmoduleofthearchitecturereceivesasinputauser-denedkey-word(ormoregenerallyatextquery)andoutputsaxed-lengthrepresentationofthekeywordinRde.
Thismappingislearnedbyagrapheme-to-phoneme(G2PZero-shotkeywordspotting5[41])model,whichisasequence-to-sequenceneuralnetworkwithtwoRNNsplay-ingtherolesofencoderanddecoder(similarlyto[42]).
ThetwoRNNsinteractwitheachotherviathelasthiddenstateoftheencoder,whichisusedbythedecoderinordertoinitializeitsownhiddenstate.
Weclaimthatthisrepresen-tationisagoodchoiceforextractingwordrepresentations,since(a)itcontainsinformationaboutitspronunciationwithoutrequiringthephonetictranscrip-tionduringevaluation,and(b)itgeneralizestowordsunseenduringtraining,providedthattheG2Pistrainedwithasucientlylargevocabulary.
Thethirdmoduleiswherethevisualfeatureswiththekeywordrepresen-tationarecombinedandnon-linearcorrelationsbetweenthemarelearned.
ItisimplementedbyastackofbidirectionalLSTMs,whichreceivesasinputthesequenceoffeaturevectorsandconcatenateseachsuchvectorwiththewordrepresentationvector.
Finally,theforthmoduleisthebackendclassierandlocalizer,whoseaimsare(a)toestimatewhetherornotthequeryoccursinthevideo,and(b)toprovideuswithanestimateofitspositioninthevideo.
Notethatwedonottrainthenetworkwithinformationaboutthetimeintervalskeywordsoccur.
Theonlysupervisionusedduringtrainingisabinarylabelindicatingwhetherornotthekeywordoccursinthevideo,togetherwiththegraphemeandphonemesequencesofthekeyword.
ThebasicbuildingblocksofthemodelaredepictedinFig.
1.
Fig.
1:Theblock-diagramoftheproposedKWSsystem.
3.
2ModelingvisualpatternsusingspatiotemporalResNetThefront-endofthenetworkisan18-layerResidualNetwork(ResNet),whichhasshownverygoodperformanceonLRW[34][43]aswellasonLRS2[20].
IthasbeenveriedthatCNNfeaturesencodingspatiotemporalinformationintheirrstlayersyieldmuchbetterperformanceinlipreading,evenwhencombinedwithdeepLSTMsorGRUsinthebackend[34,9,13].
Forthisreason,wereplacetherst2Dconvolutional,batch-normalizationandmax-poolinglayersoftheResNetwiththeir3Dcounterparts.
Thetemporalsizeofthekernelissetequal6T.
Stafylakis&G.
TzimiropoulostoTr=5,andthereforeeachResNetfeatureisextractedoverawindowof0.
2s(assuming25fps).
Thetemporalstrideisequalto1,sinceanyreductionoftimeresolutionisundesiredatthisstage.
Finally,theaveragepoolinglayeroftheResNetoutput(founde.
g.
inImageNetversionsofResNet[43])isreplacedwithafullyconnectedlayer.
Overall,thespatiotemporalResNetimplementsafunctionxt=fr([It2,It1,It,It+1,It+2],Wr),whereWrdenotestheparametersoftheResNetandItthe(grayscaleandcropped)framesattimet.
WeuseapretrainedmodelonLRWwhichwene-tuneonthepretrainsetofLRS2usingclosed-setwordidentication.
ThepretrainsetofLRS2isusefulforthispurpose,notmerelyduetothelargenumberofutterancesitcontains,butalsosuetoitsmoredetailedannotationles,whichcontaininformationaboutthe(estimated)timeeachwordbeginsandends.
Wordboundariespermitsustoexcerptxed-durationvideosegmentscontainingspecicwordsandessentiallymimictheLRWset-up.
Tothisend,weselectthe2000mostfrequentlyappearingwordscontainingatleast4phonemesandweextractframesequencesof1.
5secduration,havingthetargetwordinthecenter.
Thebackendisa2-layerLSTM(jointlypretrainedonLRW)whichweremoveoncethetrainingiscompleted.
Preprocessing.
TheframesinLRS2arealreadycroppedaccordingtheboundingboxextractedbyfacedetectorandtracker[1,2].
WecroptheframesfurtherwithaxedsetofcoecientsCcrop=[15,46,145,125],weresizethemto122*122,andwenallyfeedtheResNetwithframesofsize112*112,afterapplyingrandomcroppingintraining(fordataaugmentation),andxedcentralcroppingintesting,asin[34].
3.
3Grapheme-to-phonememodelsforencodingkeywordsGrapheme-to-phoneme(G2P)modelsareextensivelyusedinspeechtechnologiesinordertolearnamappingG→PfromsequencesofgraphemesG∈GtosequencesofphonemesP∈P.
Suchmodelsaretypicallytrainedinasupervisedfashion,usingaphoneticdictionary,suchastheCMUdictionary(forEnglish).
ThenumberofdierentphonemesintheCMUdictionaryisequaltoNphn=69,witheachvowelcontributingmorethanonephoneme,duetothevariablelevelofstretching.
TheeectivenessofaG2Pmodelismeasuredbyitsgeneralizability,i.
e.
byitscapacityinestimatingthecorrectpronunciation(s)ofwordsunseenduringtraining.
Sequence-to-sequenceneuralnetworkshaverecentlyshowntheirstrengthinaddressingthisproblem[41].
Inasequence-to-sequenceG2Pmodel,bothsequencesaretypicallymodeledbyanRNN,suchasanLSTMoraGRU.
TherstRNNisafunctionr=fe(G,We)parametrizedbyWe,whichencodesthegraphemesequenceinaxed-sizerepresentationr|G,wherer∈Rdr,whilethesecondRNNestimatesthephonemesequenceP=fd(r,Wd).
Therepresentationvectoristypicallydenedastheoutputofthelaststep,i.
e.
oncetheRNNhasseenthewholegraphemesequence.
OurimplementationofG2PinvolvestwounidirectionalLSTMswithhiddensizeequaltodl=64.
Similarlytosequence-to-sequencemodelsformachinetranslation(e.
g.
[42]),theencoderreceivesasinputthe(reversed)sequenceofZero-shotkeywordspotting7graphemesandthedecoderreceivedce,Tandtheoutputhe,Tfromtheencoder(correspondingtothelasttimestept=T)toinitializeitsownstate,denotedbycd,0andhd,0.
Toextractthewordrepresentationr,werstconcatenatethetwovectors,wethenprojectthemtoRdrtoobtainrandnallywere-projectthembacktoR2dl,i.
e.
cte,T,hte,Tt→r→ctd,0,htd,0t,wherextdenotesthetransposeofx.
Fortheprojectionsweusetwolinearlayerswithsquarematrices(sincedr=2dl),whilebiasesareomittedforhavingamorecompactnotation.
TheG2Pmodelistrainedbyminimizingthecross-entropy(CE)betweenthetruePandposteriorprobabilityoversequencesP(Pt|G),averagedacrosstimesteps,i.
e.
Lw(P,G)=1TTt=1CE(Pt,P(Pt|G)).
(1)SincetheG2Pmodelistrainedwithback-propagation,itslossfunctioncanbeaddedasauxiliarylosstotheprimaryKWSlossfunctionandtheoverallarchitecturecanbetrainedjointly.
Jointtrainingishighlydesired,asitenforcestheencodertolearnrepresentationsthatareoptimalnotmerelyfordecoding,butforourprimarytask,too.
Duringevaluation,themappingG→zlearnedbytheencoderisallthatisrequired,andthereforethedecoderfdec(·,Wdec)andthetruepronunciationParenotrequiredforKWS.
3.
4StackofBiLSTMs,binaryclassierandlossfunctionThebackendofthemodelreceivesthesequenceofvisualfeaturesX={xt}Tt=1ofavideoandthewordrepresentationvectorrandestimateswhetherthekeywordisutteredbythespeaker.
CapturingcorrelationswithBiLSTMs.
LSTMshaveexceptionalca-pacityinmodelinglong-termcorrelationsbetweeninputvectors,aswellascor-relationsbetweendierententriesoftheinputvectors,duetotheexpressivepoweroftheirgatingmechanismwhichcontrolsthememorycellandtheoutput[44].
WeusetwobidirectionalLSTM(BiLSTM),withtherstBiLSTMmerelyapplyingatransformationofthefeaturesequenceX→Y,i.
e.
ht,ct,ht,ct=fl0xt,ht1,ct1,fl0xt,ht+1,ct+1(2)andyt=Wtl0htt,httt(3)whereWl0isalinearlayerofsize(2dv,dv),fl0andfl0arefunctionscor-respondingtotheforwardandbackwardLSTMmodels(thedependenceontheirparametersiskeptimplicit),whiledv=256.
TheinputvectorsXarebatch-normalized,anddropoutswithp=0.
2areappliedbyrepeatingthesame8T.
Stafylakis&G.
Tzimiropoulosdropoutmaskforallfeaturevectorsofthesamesequence[45].
Theoutputsvec-torsoftherstBiLSTMareconcatenatedwiththewordrepresentationvectortoobtainy+t=[ytt,rt]t.
Afterapplyingbatch-normalizationtoy+t,wepassthemasinputtothesecondBiLSTM,withequationsdenedasabove,resultinginasequenceofoutputvectorsdenotedbyZ={zt}Tt=1,wherezt∈Rdv.
Notetheequivalencebetweentheproposedframe-levelconcatenationandkeyword-basedmodeladaptation.
Wemayconsiderrasameanstoadaptthebiasesofthelinearlayersinthethreegatesandtheinputtothecell,insuchawaysothattheactivationsofitsneuronsreonlyoversubsequencesinZthatcorrespondtothekeywordencodedinr.
Feed-ForwardClassierfornetworkinitialization.
Fortherstfewepochs,weuseasimplefeed-forwardclassier,whichwesubsequentlyreplacewithaBiLSTMbackenddiscussedbelow.
TheoutputsoftheBiLSTMstackareprojectedtoalinearlayer(dv,dv/2)andarepassedtoanon-linearity(LeakyRectiedLinearUnits,denotedbyLReLU)tolter-outthoseentrieswithnega-tivevalues,followedbyasummationoperatortoaggregateoverthetemporaldi-mension,i.
e.
v=Tt=1LReLU(Wtzt).
Afterapplyingdropoutstovweprojectthemtoalinearlayer(dv/2,dv/4)andweapplyagainaLReLUlayer.
Finally,weapplyalinearlayertodropthesizefromdv/4to1andaSigmoidlayer,withwhichwemodeltheposteriorprobabilitiesthatthevideocontainsthekeywordornot,i.
e.
Pl|{I}Tt=1,G,wherel∈{0,1}thebinaryindicatorvariableandlitstruevalue.
BiLSTMClassierandkeywordlocalization.
OncethenetworkwiththeFeed-ForwardClassieristrained,wereplaceitwithanBiLSTMclassier.
Thelatterdoesnotaggregateoverthetemporaldimension,becauseitaimstojointly(a)estimatetheposteriorprobabilitythatthevideocontainsthekey-word,and(b)locatethetimestepthatthekeywordoccurs.
Recallthatthenetworkistrainedwithoutinformationabouttheactualtimeintervalsthekey-wordoccurs.
Nevertheless,anapproximatepositionofthekeywordcanstillbeestimated,evenfromtheoutputoftheBiLSTMstack.
AsFig.
2shows,theaverageactivationoftheinputoftheBiLSTMclassier(afterapplyingthelin-earlayerandReLU)exhibitsapeak,typicallywithinthekeywordboundaries.
TheBiLSTMClassieraimstomodelthisproperty,byapplyingmax(·)andargmax(·)inordertoestimatetheposteriorthatthekeywordoccursandlocal-izethekeyword,respectively.
Moreanalytically,theBiLSTMClassierreceivestheoutputfeaturesoftheBiLSTMstackanditpassesthemtoalinearlayerWofsize(dv,ds)wheredl=16,andtoaLReLU,i.
e.
st=LReLU(Wtzt).
TheBiLSTMisthenappliedonthesequence,followedbyalinearlayer(whichdropsthedimensionfrom2dsto1,i.
e.
avectorwandabiasb),themax(·)andnallythesigmoidσ(·)fromwhichweestimatetheposterior.
Moreformally,H=BiLSTM(S),yt=wtht+bandp=σ(max(y)),t=argmax(y)whereS={st}Tt=1,H={ht}Tt=1,y={yt}Tt=1,p=Pl=1|{It}Tt=1,G(i.
e.
thepos-teriorthatthekeyworddenedbyGoccursintheframesequence{It}Tt=1),andtisthetimestepwherethemaximumoccurs,andshouldbesomewherewithintheactualkeywordboundaries.
NotethatwedidnotsucceedintrainingtheZero-shotkeywordspotting9Fig.
2:Localizationofthekeywordaboutinthephrase"Everyonehasgonehomehappyandthat'swhatit'sallabout".
Thekeywordboundariesaredepictedwithtwoverticallinesoverthelog-spectrogram.
networkwiththeBiLSTMClassierfromscratch,probablyduetothemax(·)operator.
Lossforjointtraining.
Theprimarylossisdenedas:Lvl,{It}Tt=1,G=CEl,Pl|{It}Tt=1,G,(4)whilethewholemodelistrainedjointlybyminimizingaweightedsummationoftheprimaryandauxiliarylosses,i.
e.
L[l,P],{It}Tt=1,G=Lvl,{It}Tt=1,G+αwLw(P,G),(5)whereαwascalarforbalancingthetwolosses.
Itisworthnotingthattherepresentationvectorsrandtheencoder'sparametersreceivegradientsfrombothlossfunctions,viathedecoderoftheG2PmodelandtheLSTMbackend.
Contrarily,thedecoderandthebinaryclassierreceivegradientsonlyfromLw(·,·)andLv(·,·),respectively.
4TrainingthemodelInthissectionwedescribeourrecipefortrainingthemodel.
Weexplainhowwepartitionthedata,howwecreateminibatches,andwegivedetailsabouttheoptimizationparameters.
4.
1LRS2andCMUDictionarypartitionsWeusetheocialpartitionoftheLRS2intopretrain,train,validationandtestset.
TheKWSnetworkistrainedonpretrainandtrainsets.
Thepretrainsetisalsousedtone-tunetheResNet,aswediscussinSection3.
2.
TheG2PmodelistrainedfromscratchandjointlywiththewholeKWSnetwork.
LRS2containsabout145KvideosofspokensentencesfromBBCTV(96Kinpretrain,46Kintrain,1082invalidation,and1243intestset).
Thenumberofframespervideointhetestsetvariesbetween15and145.
Intermsofkeywords,werandomlypartitiontheCMUphoneticdictionaryintotrain,validationandtestwords(correspondingto0.
75,0.
05and0.
20,re-spectively),whilewordswithlessthatnp=4phonemesareremoved.
Finally,10T.
Stafylakis&G.
TzimiropoulosweaddtothetestsetofthedictionarythosewordsweinitiallyassignedtothetrainingandvalidationsetsthatdonotoccurintheLRS2pretrainortrainsets,sincetheyarenotusedinanywayduringtraining.
4.
2Minibatches,trainingsetsandbackendMinibatchesfortrainingtheKWSmodelshouldcontainbothpositiveandnega-tiveexamples,i.
e.
pairsofvideosandkeywordswhereeachpairisconsideredaspositivewhenthevideocontainsthecorrespondingkeywordandnegativeoth-erwise.
Epochsandminibatchesaredenedbasedonthevideos,i.
e.
eachepochcontainsallvideosofthetrainandpretrainsetofLRS2,partitionedintomini-batches.
ThelistofkeywordsineachminibatchiscreatedbyallwordsoccurringintheminibatchbelongingtothetrainingsetofCMUdictionaryandhavingatleastnpnumberofphonemes.
Ateachminibatch,eachvideoispairedwith(a)allitskeywords(positivepairs)and(b)anequalnumberofotherrandomlychosenkeywordsfromthelist(negativepairs).
Thiswayweensurethateachvideohasequalnumberofpositiveandnegativeexamples.
Ateachepochweshuethevideosinordertocreatenewnegativepairs.
Byfeedingthealgorithmwiththesamesetofvideosandkeywordsunderdierentbinarylabelsineachminibatch,weenforceittocapturethecorrelationsbetweenvideosandwords,insteadofattemptingtocorrelatethebinarylabelwithcertainkeywordsorwithirrelevantaspectsofspecicvideos.
Fortherst20epochsweuse(a)onlythetrainsetofLRS2(becauseitcontainsshorterutterancesandmuchfewerlabelingerrorscomparedtothepretrain),(b)np=4andαw=1.
0(i.
e.
minimumnumberofphonemesandweightofauxiliaryloss,respectively),and(c)thesimplefeed-forwardbackend.
Afterthe20thepoch(a)weaddthepretrainset,(b)wesetnp=6andαw=0.
1,and(c)wereplacethebackendwiththeBiLSTM-based(allnetworkparametersbutthoseofthebackendarekeptfrozenduringthe21stepoch).
4.
3OptimizationThelossfunctionineq.
(5)isoptimizedwithbackpropagationusingtheAdamoptimizer[46].
Thenumberofepochsis100,theinitiallearningrateis2*103andwedropitbyafactorof2every20epochs.
Thebestmodelischosenbasedontheperformanceonthevalidationset.
TheimplementationisbasedonPyTorchandthecodetogetherwithpretrainedmodelsandResNetfeatureswillbereleasedsoon.
Thenumberofvideosineachminibatchis40,however,asexplainedinSection4.
2,wecreatemultipletrainingexamplespervideo(equaltotwicethenumberoftrainingkeywordsitcontains).
Finally,theResNetisoptimizedwiththecongurationsuggestedin[34].
5ExperimentsWepresentheretheexperimentalset-up,themetricsweuseandtheresultsweobtainusingtheproposedKWSmodel.
Moreover,wereportbaselineresultsZero-shotkeywordspotting11using(a)avisualASRmodelwithahybridCTC/attentionarchitecture,and(b)animplementationoftheASR-freeKWSmethodrecentlyproposedin[4].
5.
1EvaluationmetricsandkeywordselectionKWSisessentiallyadetectionproblem,andinsuchproblemstheoptimalthresh-oldisapplication-dependent,typicallydeterminedbythedesiredbalancebe-tweenthefalsealarmrate(FAR)andthemisseddetectionrate(MDR).
OurprimaryerrormetricistheEqualErrorRate(EER),denedastheFAR(orMDR)whenthethresholdissetsothatthetworatesareequal.
WealsoreportMDRforcertainlowvaluesofFAR(andviceversa)aswellasFARvs.
MDRcurves.
ApartfromEER,FARandMDRweevaluatetheperformancebasedonrankingmeasures.
Morespecically,foreachtextquery(i.
e.
keyword)wereportthepercentageoftimesthescoreofavideocontainingthequeryiswithintheTop-Nscores,whereN∈{1,2,4,8}.
Sinceaqueryqmayoccurinmorethanonevideos,apositivepairwithscoresq,v′isconsideredasTop-Nifthenumberofnegativepairsassociatedwiththegivenqueryqwithscorehigherthansq,v′islessthanN,i.
e.
if|{q,v|lq,v=0,sq,v>sq,v′}|Nqbecausesomekeywordsappearinmorethanonevideos.
5.
2BaselineandproposednetworksCTC/AttentionHybridASRmodel.
WepresenthereourbaselineobtainedwithaASR-basedmodel.
WeusethesameResNetfeaturesbutadeeper(4-layer)andwider(320-unit)BiLSTM.
Theimplementationisbasedontheopen-sourceESPnetPythontoolkitpresentedin[47]usingthehybridCTC/attentioncharacter-levelnetworkintroducedin[48].
ThesystemistrainedonthepretrainandtrainsetsofLRS2,whilefortrainingthelanguagemodelwealsousetheLibrispeechcorpus[49].
ThenetworkattainsWER=71.
4%ontheLRS2testset.
Indecoding,weusethesinglestepdecoderbeamsearch(proposedin[48])with|H|=40numberofdecodinghypothesesh∈H.
Similarlyto[50],insteadofsearchingforthekeywordonlyonthebestdecodinghypothesisweapproximatetheposteriorprobabilitythatakeywordqoccursinthevideovwithfeaturesequenceXasfollows:P(l=1|q,X)=h∈H1[q∈h]P(h|X),(6)P(h|X)≈exp(sh/c)h′∈Hexp(sh′/c),(7)12T.
Stafylakis&G.
Tzimiropouloswhere1[q∈h]istheindicatorfunctionthatthedecodinghypothesishcontainsq,shisthescore(log-likelihood)ofhypothesish(combiningCTCandattention[48])andc=5.
0isafudgefactoroptimizedinthevalidationset.
Baselinewithvideoembeddings.
WeimplementanASR-freemethodthatisverycloseto[4]proposedforaudio-basedKWS.
Dierentto[4]weuseourLSTM-basedencoder-decoderinsteadoftheproposedCNN-based.
Avideoembeddingisextractedfromthewholeutterance,isconcatenatedwiththewordrepresentationandfedtoafeed-forwardbinaryclassierasin[4].
Thisnetworkisusefulinordertoemphasizetheeectivenessofourframe-levelconcatenation.
ProposedNetworkandalternativeEncoder-Decoderlosses.
Toas-sesstheeectivenessoftheproposedG2Ptrainingmethod,weexamine3alter-nativestrategies:(a)Theencoderreceivesgradientsmerelyfromthedecoder,whichisequivalenttotrainingaG2Pnetworkseparately,usingonlythewordsappearinginthetrainingset.
(b)Thenetworkhasnodecoder,auxiliarylossorphoneme-basedsupervision,i.
e.
theencoderistrainedbyminimizingthepri-marylossonly.
(c)AGrapheme-to-Grapheme(G2G)networkisusedinsteadofaG2P.
TheadvantageofthisapproachoverG2Pisthatitdoesnotrequireapronunciationdictionary,i.
e.
itrequireslesssupervision.
Theadvantageoverthesecondapproachistheuseoftheauxiliaryloss(overgraphemesinsteadofphonemes),whichactsasaregularizer.
5.
3ExperimentalResultsonLRS2.
OurrstsetofresultsbasedonthedetectionmetricsaregiveninTable1.
Weob-servethatallvariantsoftheproposednetworkattainmuchbetterperformancecomparedtovideoembeddings.
Clearly,video-levelrepresentationscannotre-tainthene-grainedinformationrequiredtospotindividualwords.
OurbestnetworkistheproposedJoint-G2Pnetwork(i.
e.
KWSnetworkjointlytrainedwithG2P),whilethedegradationofthenetworkwhengraphemesareusedastargetsintheauxiliaryloss(Joint-G2G)underlinesthebenetsfromusingphoneticsupervision.
Nevertheless,thedegradationisrelativelysmall,showingthattheproposedarchitectureiscapableoflearningbasicpronunciationrulesevenwithoutphoneticsupervision.
Finally,thevariantwithoutadecoderduringtrainingisinferiortoallothervariants(includingJoint-G2G),showingthereg-ularizationcapacityofthedecoder.
TheFAR-MDRtradeocurvesaredepictedinFig.
3(a),obtainedbyshiftingthedecisionthresholdwhichweapplytotheoutputofthenetwork.
ThecurvesshowthattheproposedarchitecturewithG2Pandjointtrainingissuperiortoallothersexaminedandinalloperatingpoints.
Finally,weomitresultsobtainedwiththeASR-basedmodelasthescoringruledescribedineq.
(6)-(7)isinadequateformeasuringEER.
ThemodelyieldsverylowFAR(≈0.
2%)atthecostofveryhighMDR(≈63%)ofallreasonableoperatingpoints.
Lengthofkeywordsandcameraview.
Wearealsointerestedinexam-iningtheextenttowhichthelengthofthekeywordaectstheperformance.
Tothisend,weincreasetheminimumnumberofphonemesfromnp=6to7and8.
Moreover,weevaluatethenetworkonlyonthosevideoslabeledasNear-FrontalZero-shotkeywordspotting13Table1:Equal-Error,FalseAlarmandMissedDetectionRatesNetworkEERMDRFAR=5%MDRFAR=1%FARMDR=5%FARMDR=1%VideoEmbed.
32.
09%77.
32%92.
67%66.
76%83.
57%Prop.
w/oDec.
8.
46%14.
09%40.
32%14.
25%36.
43%Prop.
G2P-only7.
22%10.
88%29.
21%10.
85%30.
99%Prop.
Joint-G2G7.
26%10.
08%27.
38%10.
51%40.
26%Prop.
Joint-G2P6.
46%8.
93%26.
00%8.
48%20.
11%(a)(b)Fig.
3:FAR-MDRtradeo.
(a)Comparisonbetweencongurationsofthepro-posednetwork.
(b)TheeectoftheminimumnumberofphonemesperkeywordandthecameraviewintheperformanceattainedbyJoint-G2P.
Asexpected,longerkeywordsandnearfrontal(NF)viewyieldbetterresults.
(NF)view,byremovingthoselabeledasMulti-View(thelabelingisgivenintheannotationlesofLRS2).
TheresultsareplottedinFig.
3(b).
Asexpected,thelongerthekeywords,thelowertheerrorrates.
Moreover,theperformanceisbetterwhenonlyNFviewsareconsidered.
Rankingmeasuresandlocalizationaccuracy.
Wemeasureherethepercentageoftimesvideoscontainingthequeryareinthetop-Nscores.
TheresultsaregiveninTable2.
Asweobserve,ourbestsystemscoresTop-1equalto34.
14%meaningthatinabout1outof3queries,thevideocontainingthequeryisrankedrstamongsttheNtest=1243videos.
Moreover,in2outof3queriesthevideocontainingthequeryisamongsttheTop-8.
Theothertrainingstrategiesperformwell,too,especiallytheonewheretheencoderistrainedmerelywiththeauxiliaryloss(G2P-only).
TherankingmeasuresattainedbytheVideo-Embeddingmethodareverybadsoweomitthem.
TheASR-basedsystemattainsrelativelyhighTop-1score,howevertherestofthescoresareratherpoor.
WeshouldemphasizethoughthatotherASR-basedKWSmethodsexistforapproximatingtheposteriorofakeywordoccurrence,e.
g.
usingexplicit14T.
Stafylakis&G.
Tzimiropouloskeywordlattices[51],insteadofusingthesetofdecodinghypothesesHcreatedbythebeamsearchineq.
(6)-(7).
Finally,wereportthelocalizationaccuracyforallversionsoftheproposednetwork,denedasthepercentageoftimestheestimatedlocationtlieswithinthekeywordboundaries(±2frames).
Thereferencewordboundariesareesti-matedbyapplyingforcedalignmentbetweentheaudioandtheactualtext.
Weobservethatalthoughthealgorithmistrainedwithoutanyinformationaboutthelocationofthekeywords,itcanstillprovideaverypreciseestimateofthelocationofthekeywordinthevastmajorityofcases.
Table2:Rankingresultsshowingtheratebywhichthevideosequencecontainingthekeywordisamongstthetop-Nscores.
LocalizationaccuracyisalsoprovidedNetworkTop-1Top-2Top-4Top-8Local.
Acc.
ASR-based24.
51%31.
39%33.
51%37.
57%-Prop.
w/oDec23.
71%33.
68%43.
99%55.
90%96.
20%Prop.
G2P-only34.
14%46.
28%57.
16%65.
75%97.
39%Prop.
Joint-G2G31.
16%43.
07%54.
98%65.
75%97.
86%Prop.
Joint-G2P34.
14%46.
96%57.
04%67.
70%96.
67%6ConclusionsWeproposedanarchitectureforvisual-onlyKWSwithtextqueries.
Ratherthanusingsubwordunits(e.
g.
phonemes,visemes)asmainrecognitionunits,wefollowedthedirectionofmodelingwordsdirectly.
Contrarytootherword-basedapproaches,whichtreatwordsmerelyasclassesdenedbyalabel(e.
g.
[35]),weinjectintothemodelawordrepresentationextractedbyagrapheme-to-phonememodel.
Thiszero-shotlearningapproachenablesthemodeltolearnnonlinearcorrelationsbetweenvisualframesandwordrepresentationsandtotransferitsknowledgetowordsunseenduringtraining.
Theexperimentsshowedthattheproposedmethodiscapableofattainingverypromisingresultsonthemostchallengingpubliclyavailabledataset(LRS2),outperformingthetwobaselinesbyalargemargin.
Finally,wedemonstrateditscapacityinlocalizingthekeywordintheframesequence,eventhoughwedonotuseanyinformationaboutthelocationofthekeywordduringtraining.
7AcknowledgementsThisprojecthasreceivedfundingfromtheEuropeanUnion'sHorizon2020researchandinnovationprogrammeundertheMarieSklodowska-CuriegrantagreementNo706668(TalkingHeads).
WearegratefultoDr.
StavrosPetridisandMr.
PingchuanMa(i-bug,ImperialCollegeLondon)fortheircontributiontotheASR-basedexperiments.
Zero-shotkeywordspotting15References1.
Chung,J.
S.
,Senior,A.
,Vinyals,O.
,Zisserman,A.
:Lipreadingsentencesinthewild.
In:ComputerVisionandPatternRecognition(CVPR).
(2017)2.
Chung,J.
S.
,Zisserman,A.
:Lipreadinginthewild.
In:AsianConferenceonComputerVision(ACCV),Springer(2016)87–1033.
Anina,I.
,Zhou,Z.
,Zhao,G.
,Pietik¨ainen,M.
:OuluVS2:amulti-viewaudiovisualdatabasefornon-rigidmouthmotionanalysis.
In:AutomaticFaceandGestureRecognition(FG),201511thIEEEInternationalConferenceandWorkshopson.
Volume1.
,IEEE(2015)1–54.
Audhkhasi,K.
,Rosenberg,A.
,Sethy,A.
,Ramabhadran,B.
,Kingsbury,B.
:End-to-endASR-freekeywordsearchfromspeech.
IEEEJournalofSelectedTopicsinSignalProcessing11(8)(2017)1351–13595.
Audhkhasi,K.
,Ramabhadran,B.
,Saon,G.
,Picheny,M.
,Nahamoo,D.
:DirectAcoustics-to-WordModelsforEnglishConversationalSpeechRecognition.
In:Interspeech.
(2017)6.
Soltau,H.
,Liao,H.
,Sak,H.
:NeuralSpeechRecognizer:Acoustic-to-WordLSTMModelforLargeVocabularySpeechRecognition.
In:Interspeech.
(2017)7.
Socher,R.
,Ganjoo,M.
,Manning,C.
D.
,Ng,A.
:Zero-shotlearningthroughcross-modaltransfer.
In:AdvancesinNeuralInformationProcessingSystems(NIPS).
(2013)8.
Chung,J.
S.
,Zisserman,A.
:LipreadingSentencesinthewild(linktoLRS2).
http://www.
robots.
ox.
ac.
uk/~vgg/data/lip_reading_sentences/9.
Assael,Y.
M.
,Shillingford,B.
,Whiteson,S.
,deFreitas,N.
:Lipnet:Sentence-levellipreading.
arXivpreprintarXiv:1611.
01599(2016)10.
Cooke,M.
,Barker,J.
,Cunningham,S.
,Shao,X.
:Anaudio-visualcorpusforspeechperceptionandautomaticspeechrecognition.
TheJournaloftheAcousticalSocietyofAmerica120(5)(2006)2421–242411.
Graves,A.
,Jaitly,N.
:Towardsend-to-endspeechrecognitionwithrecurrentneuralnetworks.
In:InternationalConferenceonMachineLearning.
(2014)1764–177212.
Zweig,G.
,Yu,C.
,Droppo,J.
,Stolcke,A.
:Advancesinall-neuralspeechrecogni-tion.
In:IEEEInternationalConferenceonAcoustics,SpeechandSignalProcess-ing(ICASSP),IEEE(2017)4805–480913.
Chung,J.
S.
,Zisserman,A.
:Lipreadinginprole.
In:BritishMachineVisionConference(BMVC).
(2017)14.
Chan,W.
,Jaitly,N.
,Le,Q.
,Vinyals,O.
:Listen,attendandspell:Aneuralnetworkforlargevocabularyconversationalspeechrecognition.
In:IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP).
(2016)4960–496415.
Bear,H.
L.
,Harvey,R.
:Decodingvisemes:improvingmachinelip-reading.
In:IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),IEEE(2016)2009–201316.
Vaswani,A.
,Shazeer,N.
,Parmar,N.
,Uszkoreit,J.
,Jones,L.
,Gomez,A.
N.
,Kaiser,L.
,Polosukhin,I.
:Attentionisallyouneed.
In:AdvancesinNeuralInformationProcessingSystems(NIPS).
(2017)5998–600817.
Koumparoulis,A.
,Potamianos,G.
,Mroueh,Y.
,Rennie,S.
J.
:ExploringROIsizeindeeplearningbasedlipreading.
In:AVSP.
(2017)18.
Petridis,S.
,Stafylakis,T.
,Ma,P.
,Cai,F.
,Tzimiropoulos,G.
,Pantic,M.
:End-to-endaudiovisualspeechrecognition.
In:InternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP).
(2018)16T.
Stafylakis&G.
Tzimiropoulos19.
Wand,M.
,Schmidhuber,J.
:ImprovingSpeaker-IndependentLipreadingwithDomain-AdversarialTraining.
In:Interspeech.
(2017)20.
Afouras,T.
,Chung,J.
S.
,Zisserman,A.
:Deeplipreading:acomparisonofmodelsandanonlineapplication.
arXivpreprintarXiv:1806.
06053(2018)21.
Xu,K.
,Li,D.
,Cassimatis,N.
,Wang,X.
:LCANet:End-to-EndLipreadingwithCascadedAttention-CTC.
In:13thIEEEInternationalConferenceonAutomaticFace&GestureRecognition(FG),IEEE(2018)548–55522.
Sterpu,G.
,Saam,C.
,Harte,N.
:CanDNNslearntolipreadfullsentencesarXivpreprintarXiv:1805.
11685(2018)23.
Tao,F.
,Busso,C.
:Gatingneuralnetworkforlargevocabularyaudiovisualspeechrecognition.
IEEE/ACMTransactionsonAudio,SpeechandLanguageProcessing(TASLP)26(7)(2018)1286–129824.
Mroueh,Y.
,Marcheret,E.
,Goel,V.
:Deepmultimodallearningforaudio-visualspeechrecognition.
In:IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),IEEE(2015)2130–213425.
Bengio,S.
,Heigold,G.
:Wordembeddingsforspeechrecognition.
In:Interspeech.
(2014)26.
Pennington,J.
,Socher,R.
,Manning,C.
:Glove:Globalvectorsforwordrepre-sentation.
In:Proceedingsofthe2014conferenceonempiricalmethodsinnaturallanguageprocessing(EMNLP).
(2014)1532–154327.
Mikolov,T.
,Sutskever,I.
,Chen,K.
,Corrado,G.
S.
,Dean,J.
:Distributedrepre-sentationsofwordsandphrasesandtheircompositionality.
In:Advancesinneuralinformationprocessingsystems(NIPS).
(2013)3111–311928.
Palaz,D.
,Synnaeve,G.
,Collobert,R.
:Jointlylearningtolocateandclassifywordsusingconvolutionalnetworks.
In:Interspeech.
(2016)2741–274529.
Sun,M.
,Snyder,D.
,Gao,Y.
,Nagaraja,V.
,Rodehorst,M.
,Panchapagesan,S.
,Str¨om,N.
,Matsoukas,S.
,Vitaladevuni,S.
:Compressedtimedelayneuralnetworkforsmall-footprintkeywordspotting.
In:Interspeech.
(2017)3607–361130.
Sun,M.
,Nagaraja,V.
,Homeister,B.
,Vitaladevuni,S.
:Modelshrinkingforembeddedkeywordspotting.
In:IEEE14thInternationalConferenceonMachineLearningandApplications(ICMLA),IEEE(2015)369–37431.
Chen,G.
,Parada,C.
,Heigold,G.
:Small-footprintkeywordspottingusingdeepneuralnetworks.
In:IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP).
(2014)4087–409132.
Fernandez,S.
,Graves,A.
,Schmidhuber,J.
:Anapplicationofrecurrentneuralnetworkstodiscriminativekeywordspotting.
In:InternationalConferenceonAr-ticialNeuralNetworks,Springer(2007)220–22933.
Jha,A.
,Namboodiri,V.
P.
,Jawahar,C.
:Wordspottinginsilentlipvideos.
In:2018IEEEWinterConferenceonApplicationsofComputerVision(WACV),IEEE(2018)150–15934.
Stafylakis,T.
,Tzimiropoulos,G.
:CombiningResidualNetworkswithLSTMsforLipreading.
In:Interspeech.
(2017)35.
Stafylakis,T.
,Tzimiropoulos,G.
:Deepwordembeddingsforvisualspeechrecog-nition.
In:InternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP).
(2018)36.
Chung,J.
S.
,Zisserman,A.
:Lipreadinginthewild(linktoLRW).
http://www.
robots.
ox.
ac.
uk/~vgg/data/lip_reading/37.
Xian,Y.
,Lampert,C.
H.
,Schiele,B.
,Akata,Z.
:Zero-shotlearning-acomprehensiveevaluationofthegood,thebadandtheugly.
arXivpreprintarXiv:1707.
00600(2017)Zero-shotkeywordspotting1738.
Frome,A.
,Corrado,G.
S.
,Shlens,J.
,Bengio,S.
,Dean,J.
,Mikolov,T.
,etal.
:De-vise:Adeepvisual-semanticembeddingmodel.
In:Advancesinneuralinformationprocessingsystems(NIPS).
(2013)2121–212939.
Akata,Z.
,Perronnin,F.
,Harchaoui,Z.
,Schmid,C.
:Label-embeddingforim-ageclassication.
IEEEtransactionsonpatternanalysisandmachineintelligence38(7)(2016)1425–143840.
Mahasseni,B.
,Todorovic,S.
:RegularizingLongShortTermMemoryWith3DHuman-SkeletonSequencesforActionRecognition.
In:TheIEEEConferenceonComputerVisionandPatternRecognition(CVPR).
(June2016)41.
Yao,K.
,Zweig,G.
:Sequence-to-sequenceneuralnetmodelsforgrapheme-to-phonemeconversion.
arXivpreprintarXiv:1506.
00196(2015)42.
Sutskever,I.
,Vinyals,O.
,Le,Q.
V.
:Sequencetosequencelearningwithneuralnetworks.
In:Advancesinneuralinformationprocessingsystems.
(2014)3104–311243.
He,K.
,Zhang,X.
,Ren,S.
,Sun,J.
:Identitymappingsindeepresidualnetworks.
In:EuropeanConferenceonComputerVision,Springer(2016)630–64544.
Hochreiter,S.
,Schmidhuber,J.
:Longshort-termmemory.
Neuralcomputation9(8)(1997)1735–178045.
Gal,Y.
,Ghahramani,Z.
:Atheoreticallygroundedapplicationofdropoutinre-currentneuralnetworks.
In:AdvancesinNeuralInformationProcessingSystems(NIPS).
(2016)1019–102746.
Kingma,D.
P.
,Ba,J.
:Adam:AMethodforStochasticOptimization.
In:ICLR.
(2014)47.
Watanabe,S.
,Hori,T.
,Karita,S.
,Hayashi,T.
,Nishitoba,J.
,Unno,Y.
,Soplin,N.
E.
Y.
,Heymann,J.
,Wiesner,M.
,Chen,N.
,etal.
:ESPnet:End-to-endspeechprocessingtoolkit.
arXivpreprintarXiv:1804.
00015(2018)48.
Watanabe,S.
,Hori,T.
,Kim,S.
,Hershey,J.
R.
,Hayashi,T.
:HybridCTC/attentionarchitectureforend-to-endspeechrecognition.
IEEEJournalofSelectedTopicsinSignalProcessing11(8)(2017)1240–125349.
Panayotov,V.
,Chen,G.
,Povey,D.
,Khudanpur,S.
:Librispeech:anASRcor-pusbasedonpublicdomainaudiobooks.
In:IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),IEEE(2015)5206–521050.
Miller,D.
R.
,Kleber,M.
,Kao,C.
L.
,Kimball,O.
,Colthurst,T.
,Lowe,S.
A.
,Schwartz,R.
M.
,Gish,H.
:Rapidandaccuratespokentermdetection.
In:EighthAnnualConferenceoftheInternationalSpeechCommunicationAssociation.
(2007)51.
Zhuang,Y.
,Chang,X.
,Qian,Y.
,Yu,K.
:UnrestrictedVocabularyKeywordSpot-tingUsingLSTM-CTC.
In:Interspeech.
(2016)938–942
在上个月的时候也有记录到 NameCheap 域名注册商有发布域名转入促销活动的,那时候我也有帮助自己和公司的客户通过域名转入到NC服务商这样可以实现省钱续费的目的。上个月续费转入的时候是选择9月和10月份到期的域名,这不还有几个域名年底到期的,正好看到NameCheap商家再次发布转入优惠,所以打算把剩下的还有几个看看一并转入进来。活动截止到9月20日,如果我们需要转入域名的话可以准备起来。 N...
Nocser刚刚在WHT发布了几款促销服务器,Intel Xeon X3430,8GB内存,1TB HDD,30M不限流量,月付$60.00。Nocser是一家注册于马来西亚的主机商,主要经营虚拟主机、VPS和马来西亚独立服务器业务,数据中心位于马来西亚AIMS机房,线路方面,AIMS到国内电信一般,绕日本NTT;联通和移动比较友好,联通走新加坡,移动走香港,延迟都在100左右。促销马来西亚服务器...
易探云怎么样?易探云最早是主攻香港云服务器的品牌商家,由于之前香港云服务器性价比高、稳定性不错获得了不少用户的支持。易探云推出大量香港云服务器,采用BGP、CN2线路,机房有香港九龙、香港新界、香港沙田、香港葵湾等,香港1核1G低至18元/月,183.60元/年,老站长建站推荐香港2核4G5M+10G数据盘仅799元/年,性价比超强,关键是延迟全球为50ms左右,适合国内境外外贸行业网站等,如果需...
www.avmoo.net为你推荐
网罗设计网络设计, 计算机德尔,哪个好,哪个能赚钱?futureshop在国内还是在加拿大买笔记本h连锁酒店有哪些快捷酒店连锁酒店。硬盘工作原理硬盘跟光盘的工作原理?地图应用谁知道什么地图软件好用,求 最好可以看到路上行人甲骨文不满赔偿未签合同被辞退的赔偿网站检测请问,对网站进行监控检测的工具有哪些?8090lu.com8090lu.com怎么样了?工程有进展吗?haole10.comwww.qq10eu.in是QQ网站吗www.javmoo.comJAV编程怎么做?
服务器租赁 大庆服务器租用 互联网域名管理办法 krypt 荷兰服务器 安云加速器 dropbox网盘 主机合租 777te 警告本网站美国保护 蜗牛魔方 qingyun 流媒体加速 银盘服务是什么 根服务器 网通服务器 dnspod 贵阳电信测速 买空间网 netvigator 更多