resultedpp43.com

pp43.com  时间:2021-03-20  阅读:()
BayesianYing-YangSystemandTheoryasAUniedStatisticalLearningApproach:(III)ModelsandAlgorithmsforDependenceReduction,DataDimensionReduction,ICAandSupervisedLearningLeiXuDepartmentofComputerScienceandEngineeringTheChineseUniversityofHongKong,Shatin,NT,HongKong,ChinaFax85226035024,Emaillxu@cs.
cuhk.
hk,http://www.
cse.
cuhk.
edu.
hk/lxu/Invitedpaper,inK.
M.
Wong,I.
KingandD.
Y.
Yeungeds,TheoreticalAspectsofNeuralComputation:AMultidisciplinaryPerspective(HongKongInternationalWorkshopTANC'97),Springer,pp43-60.
Abstract.
Thispaperisasisterpaperof1]publishedinthissamepro-ceeding,forfurtherinterpretingtheBayesianYing-Yang(BYY)learningsystemandtheorythroughitsusesondevelopingmodelsandalgorithmsfordependencereduction,independentcomponentanalysis,datadimen-sionreduction,supervisedclassicationandregressionwiththree-layernet,mixtures-of-experts,andradialbasisfunctionnets.
Readersarere-ferredto14,1]forthedetailsonBYYlearningsystemandtheory.
Inaddition,therelationofBYYlearningsystemandtheorytoanumberofexistinglearningmodelandtheorieshasbeendiscussedin14].
1BKYYDependenceReductionSystemandTheoryInmanyapplications,wewanttoimplementaunsupervisedmappingfromob-servationxintoy=y(1)y(k)]Tsuchthatthedependenceamongthecom-ponentsofyisreducedasmuchaspossible.
Thisaimisalsoregardedasabasicprincipleinabrainperceptionsystemformedviaunsupervisedlearning3].
ThisprocessisstudiedintheliteratureunderseveralnamessuchasDependenceRe-duction,FactorialLearning,IndependentComponentAnalysis(ICA),FactorialSupportedbyHKRGCEarmarkedGrantsCUHK250/94EandCUHK339/96EandbyHoSin-HangEducationEndowmentFundforProjectHSH95/02.
ThebasicideasoftheBYYlearningsystemandtheoryinmypreviouspapersstartedthreeyearsago|therstyearofmyreturningtoHK.
AsHKintransitiontoChina,thisworkwasintransitiontoitscurrentshape.
Thepreliminaryversionofthispaperanditssisterpapers14,1]areallcompletedintherstmonththatHKreturnedtoChinaandthusIformallyreturnedtomymotherlandaswell.
Iwouldliketousemythiswork,aneortontheharmonyofanancientChinesephilosophyandthemodernwesternscience,asamemoryofthishistoricevent.
Encoding.
Thesenamesareverycloselyrelated,althoughtheirdetailedmean-ingsareslightdierent.
WeconsiderthatDependenceReductionmaybeamoregeneralnamebecauseitcoversthemeaningsoftheotherthreenamesandalsomaybemoreappropriatetothoseeortsthatattempttoreducedependencebutnallymayormaynotreallyreachindependence.
Duetolimitedspace,weomittomentionaquitelargevolumepublicationsrelatedtothesetopics.
In14],ithasbeenshownthatwecanapplytheBYYlearningsystemandtheoryatthespecialcaseofy=y(1)y(k)]Twithy(j)beingbinarytoobtainasocalledBYYFactorialEncoding(FE)systemandtheory.
Here,weconsiderthegeneralcasethaty=y(1)y(k)]TiseitherbinaryorrealsuchthatnotonlythepreviouslyproposedBYYFEsystemandtheoryhasbeenfurtherrenedandimproved,butalsoageneralBayesianYing-YangDependenceReduction(BYY-DR)systemandtheoryissuggested.
Generallyspeaking,wecandesignpMy(y)tobeanindependentparametricdensityorfreedensity:pMy(y)=(p(yjk)=Qkj=1p(y(j)jj)p(y(j)jj)=Pkyjr=1rjp(y(j)jrj)p(y)=Qkj=1p(y(j))p(y(j))isfree:(1)whererj>0Pkyjr=1rj=1,k=fjgkj=1.
WeputthispMy(y)in2eq.
1.
4],eq.
1.
5]andeq.
1.
6]togetBYY-DR,BCYY-DR,BKYY-DRsystemandtheoryasthespecialcasesofBYY,BCYY,BKYYlearningsystemandtheory,respec-tively.
Inthispaper,weconcentrateontheBKYY-DRsystemandtheoryonly.
Generallyspeaking,itisjustadirectapplicationofusingtheaboveeq.
(1)ineq.
1.
6]withthegeneralimplementationtechniquegiveninSec.
3in1].
FromSec.
3in1],wefurthergetthedetailedalgorithmsandcriteriainTab.
1withthefollowingthreearchitectures:Forward:pMxjy(xjy)=p(xjy)freepMyjx(yjx)=p(yjxyjx)Backward:pMyjx(yjx)=p(yjx)freepMxjy(xjy)=p(xjyxjy)Bi;direction:pMyjx(yjx)=p(yjxyjx)pMxjy(xjy)=p(xjyxjy)(2)whereadensityisimplementeddirectlybyaphysicalcomputingdeviceifitisaparametricmodel,orindirectlybyothercomponentsifitisfree.
InTab.
1,PartAisobtainedfromeq.
1.
6]andeq.
1.
17]forthegeneralim-plementation.
PartBprovidestherelationtosomeexistingmethodsandwillbefurtherdiscussedinthenextsection.
TheadaptivealgorithmgiveninPartCisobtainedaccordingtothegeneralimplementationtechniquegiveninSec.
3in1],withpr(y)=pMy(y).
ThemodelselectioncriteriaaregiveninPartD.
2BKYYDRforICAandBlindSourceSeparationWeconsiderthefollowingnoisypost-nonlinearinstantaneousmixturethattheddimensionalobservationxcomesfromkindependentsourcess(1)s(k)viax=g(Ay)+eEy=0Ee=0EyeT=0g(Ay)=g1(^x1)gd(^xd)]^x=Ay(3)2Inthispaper,wewillfrequentlyrefertotheequationsinpaper1],whichispublishedinthisvolumetoo.
Forconvenience,wesimplyuseeq.
1.
n]todenoteeq.
(n)in1].
Table1.
BKYY-DRSystemandTheoryForwardBi-directionBackwardForKL(M1M2)givenineq.
1.
6],pMy(y)givenbyeq.
(1)pMx(x)isxedatanonparametricestimationbasedonDx=fxigNi=1PartA:TheAlternativeMinimizationProcedureFixpMxjy(xjy)pMy(y),updatetheparameteryjxofpMyjx(yjx)toreducepMyjx(yjx)=Step1Kl(M1M2)ineq.
1.
11]pMxjy(xjy)pMy(y)pMxjyy(x)e.
g.
,movingonestepalongpMxjyy(x)=thegradientdescentdirectionRpMxjy(xjy)pMy(y)dyFixpMyjx(yjx),updatepMxjy(xjy)=theparameterxjyofpMxjy(xjy)toincreasepMyjx(yjx)pMx(x)pMyjxx(y)RxypMyjx(yjx)pMx(x)lnpMxjy(xjy)dxdyStep2pMyjxx(y)=e.
g.
,movingonestepalongRpMyjx(yjx)pMx(x)dxthegradientascentdirectionForpMy(y)free,letpMy(yj)=RpMyjx(yjjx)pMx(x)dx,j=1kForpMy(y)parametric,updateitsparametertoincreaseRxypMyjx(yjx)pMx(x)lnpMy(y)dxdye.
g.
,movingonestepalongthegradientascentdirection.
PartB:minM1M2KL(M1M2)isequivalenttominimizeRypMyjxx(y)lnpMyjxx(y)pMy(y)dyRxpMx(x)lnpMx(x)pMxjyy(x)dxwithpMyjxx(y)andwithpMxjyy(x)andpMxjy(xjy)givenabovepMyjx(yjx)givenabovePartC:TheStochasticApproximationAdaptiveAlgorithmRefertoSec.
3in1],usepMy(y)byeq.
(1)astheSamplingReferencedensity,getasamplextandrandomlytakeasampleytfrompMy(y).
FixpMxjy(xjy)pMy(y),updateStep1newyjx=oldyjx;pMy(yt)GetpMyjx(yjx)@pMyjx(ytjxt)lnpMyjx(ytjxt)pMxjy(xtjyt)pMy(yt)]@yjxjyjx=oldyjxthesameasinPartAFixpMyjx(yjx),updateGetpMxjy(xjy)abovenewxjy=oldxjy+pMy(yt)(inthebatchwayonly)@pMyjx(ytjxt)lnpMxjy(xtjyt)@xjyjxjy=oldxjyStep2WhenpMy(y)free,thesameasinPartA(inthebatchwayonly).
ForpMy(y)parametric,newy=oldy+pMy(yt)@pMyjx(ytjxt)lnpMy(yt)@yjy=oldyPartD:TheModelSelectionCriteriaforkAfterlearning,denotetheresultsbypMyjx(yjx)pMxjy(xjy)pMy(y)Dx=fxtgNt=1andrandomlytakesamplesDy=fygN01frompMy(y)Fromeqs.
1.
13&14],getk=argminkJ1(k)ork=argminkJ2(k)J1(k)=1NN0PNt=1PN0=1pMyjx(yjxt)pMy(y)lnpMyjx(yjxt)pMxjy(xtjy)pMy(y)J2(k)=;1NN0PNt=1PN0=1pMyjx(yjxt)pMy(y)lnpMxjy(xtjy)pMy(y)]:Table2.
BKYYDRforICAandBlindSourceSeparation(BSS)ForwardBi-directionBackwardKl(M1M2)andpMx(x)arethesameasinTab.
1Kl(M1M2)=RpMx(x)(H(yjx);Q(xjy);C(y))dxdy,H(yjx)=pMyjx(yjx)lnpMyjx(yjx)Q(xjy)=pMyjx(yjx)lnpMxjy(xjy),C(y)=pMyjx(yjx)lnpMy(y)PartA:ArchitectureDesignForbinaryy,pMy(y)=Qkj=1qy(j)j(1;qj)1;y(j)k=fqj:0qj1gkj=1Forrealy,pMy(y)=p(yjk)givenbyeq.
(1)pMxjy(xjy)isfreepMxjy(xjy)=G(xg(Ay))Forbinaryy,pMyjx(yjx)=Qkj=1yjj(1;j)1;yjj=s(eTjWf(xf)+By])H(yjx)=pMyjx(yjx)Pkj=1yjlnj+(1;yj)ln(1;j)]pMyjx(yjx)isfreeForrealy,pMyjx(yjx)=Pnyjxj=1jG(y;Wf(xf)0j)Note:1.
s(r)issigmoidwithitsrangeon01],e.
g.
,s(r)=1=(1+e;r)2.
eTjxgivesthej-thelementofthevectorx.
3.
Thenonlinearfunctionf(xf)canbeimplementedbyaforwardnetwork.
PartB:AdaptiveAlgorithmStep1:thesameasPartCinTab.
1,getasamplextandrandomlytakeasampleytfrompMy(y).
FixpMxjy(xjy)pMy(y),updateyjx=fWBffjggoryjx=fWBfg,pMyjx(yjx)=newyjx=oldyjx;pMy(yt)@H(ytjxt);Q(xtjyt);C(yt)@yjxjyjx=oldyjxpMxjy(xjy)pMy(y)RpMxjy(xjy)pMy(y)dyStep2:FixpMyjx(ytjxt),updateg(^x)=g1(^x1)gd(^xd)]^x=Ay,Gt=@gT(^x)@^xj^x=Aoldyt=old(old);1Anew=Aold+pMyjx(ytjxt)pMy(yt)Gtxt;g(Aoldytold)]yTt,Ht=@gT(^x)@j^x=Aoldyt=old(old);1,new=old+pMyjx(ytjxt)pMy(yt)Htxt;g(Aoldytold)]Geta(xtyt)=xt;g(Anewytnew),(inbatchwayonly)new=(1;)old+pMyjx(ytjxt)pMy(yt)a(xtyt)aT(xtyt).
pMxjy(xjy)=Particularly,forlinearg(u)=u,wehavepMyjx(yjx)pMx(x)RpMyjx(yjx)pMx(x)dxAnew=Aold+pMyjx(ytjxt)pMy(yt)(xt;Aoldyt)yTt,new=(1;)old+pMyjx(ytjxt)pMy(yt)(xt;Anewyt)(xt;Anewyt)TWnew=Wold+pMyjx(ytjxt)pMy(yt)(yt;Woldxt)xTt(backwardcase).
ForrealyandpMy(y)isparametricnewj=oldj+pMyjx(ytjxt)pMy(yt)@lnp(y(j)tjj)@jjj=oldjj=1k,Forbinaryy,qnewj=11+exp(;rnewj)rnewj=roldj+pMyjx(ytjxt)pMy(yt)(y(j);qoldj).
PartC:ThreeWaysofRecoveryx!
y(1).
Randomsampling^yaccordingtotheresultedpMyjx(yjx).
(2).
Maximumposteriordecision.
^y=argmaxypMyjx(yjx).
(3)Forlinearg(u)=u,getadirectinversemapping^y=WxPartD:TheCriteriaforSelectingk(thesameasinTab.
1)Table3.
AdaptiveAlgorithmwithFinite-MixtureforNoiselessICAGaussianMixtureDerivativeSigmoidMixtureWitholdkxed,updateWnew=Wold+(I+(y)yT)Wold(y)=1(y1)k(yk)]Ty=Woldxy(j)=xTtwoldjj(y(j))=@lnp(y(j)jj)@y(j)=Pkyjr=1oldrj@p(y(j)joldrj)=@y(j)Pkyjr=1oldrjp(y(j)joldrj)p(y(j)jrj)=G(y(j)rj2rj),p(y(j)jrj)=ds(y(j)jrj)dy(j)=brje;brj(y(j);arj)(1+e;brj(y(j);arj))2rj=frj2rjgs(y(j)jrj)=11+e;brj(y(j);arj)rj=farjbrjg@p(y(j)jrj)@y(j)=;(y(j);rj)2rjp(y(j)jrj)@p(y(j)jrj)@y(j)=b2rje;brj(y(j);arj)(e;brj(y(j);arj);1)(e;brj(y(j);arj)+1)3Step1:y(j)=xTtwnewjhrj(x)=oldrjp(y(j)joldrj)Pkyjr=1oldrjp(y(j)joldrj)newrj=(1;)oldrj+hrj(x)newrj=oldrj+hrj(x)newrj@lnp(y(j)joldrj)@rj,ittakesthespecicformasfollows,respectivelyStep2:newrj=(1;)oldrj+Step2:bnewrj=boldrj+hrj(x)newrjy(j),(2rj)new=(1;)hrj(x)newrj1boldrj;(y(j);anewrj)1;e;boldrj(y(j);anewrj)1+e;boldrj(y(j);anewrj)](2rj)old+hrj(x)newrjy(j);newrj]2anewrj=aoldrj+hrj(x)newrjboldrj1;e;boldrj(y(j);aoldrj)1+e;boldrj(y(j);aoldrj)Then,letWold=Wnew,oldk=newk.
Wecanevensimplylet2rj=1brj=1,kyj=2andrj=0:5suchthatthealgorithmscanbesimpliedconsiderablybyremovingtheupdatingon2rjbrjrj.
withxbeingstrictwidestationaryandergodicandebeingnoise.
ThepurposeofBlindSourceSeparation(BSS)istoget^ywhichrecoversyuptoonlyconstantunknownscalesandanypermutationofindices.
Inthecasethate=0,d=kandg(u)=ulinear,itreducestothewellknownlinearinstantaneousmixture,whichissolvedwhen^y=Wxmakesthecomponentsof^ybecomesindependent.
Thus,itisalsocalledIndependentComponentAnalysis(ICA),studiedwidelyintheliterature.
Duetolimitedspace,weomittomentiononebyonealltheexistingreferencesandtheycanbefoundinaveryrecentoverviewpaper8].
Inthecasethate=0,d=kandg(u)nonlinear,eq.
(3)reducestothepost-nonlinearinstantaneousmixturestudiedrecentlyby13].
Inthispaper,weconsiderthegeneralcaseeq.
(3)withe6=0,d6=k,nonlinearg(u)andunknownk,bydirectlyusingtheBYYDRintroducedinSec.
1.
WheneisgaussianG(e0),eq.
(3)canbedescribedbypMxjy(xjy)=G(xg(Ay)).
PuttingthisspecicsettinginTab.
1andtogetherwithappro-priatedesignsasgiveninPartAofTab.
2,wecangetadaptivealgorithmfortheBSSproblemeq.
(3)directly,asgivenbyPartBinTab.
2,wherethebatchwayICAalgrotihmisalsogivenfortheforwardcase.
Moreover,PartCinTab.
2sug-geststhreewaysofrecoveringyfromx.
Thersttwoaredirectlyunderstandable.
Inthebackwardcase,thereisoriginallynoWforthethirdway.
HoweverwecanindirectlygetoneviaminWJ2(W)J2(W)=Eky;^yk2=TrE(y;Wx)(y;Wx)T].
ThisWcanalsobeadaptivelylearnedtogetherwiththeadaptationonA,asshownbythelastlineinthemiddleblockofPartB.
Finally,thecriteriaforselectingtheunknownnumberofsourceskeepthesameasPartDinTab.
1.
Itisinterestingtoconsiderthenoiselessspecialcasee=0.
Whenkissmallerthanitscorrectvalue,wehaveeectivelyjj6=0duringthelearning,andthusthesituationissimilartothecasewithnoise.
Whenkbecomesequalorlargerthanitscorrectvalue,willbecomesingularasotherparametersconvergetothecorrectvalues.
Inotherwords,wecanstillusetheadaptivelearningalgorithminTab.
2forthebackwardandBi-directioncasesdirectly.
becomingsingularisjustasignalthatindicatesthecorrectconvergence.
Thatis,wecanstartatasmallvalueforkandthengraduallyincreaseit,andthenstopthelearningoncebecomessingular.
Fortheforwardcasewithlinearg(u)=uande=0,wehavepMyjx(yjx)=(x;W;1y)=jWj,pMyjxx(y)=RxpMyjx(yjx)pMx(x)dx=pMx(W;1y)=jWj=p(y),thusfromPartBinTab.
1,minM1M2KL(M1M2)isequivalenttominimizeRyp(y)lnp(y)=Qkj=1pMy(y(j))]dyy=Wx.
IfwefurtherletpMy(y(j))=p(y(j)),itreducesexactlytotheminimummutualinformation(MMI)criterionusedby2]andalsoequivalenttomaximumlikelihoodICA6,12],INFORMAX4].
TheMMIcriterionisusuallyimplementedbygradientalgorithm.
Animprovedadaptivenaturalgradientalgorithmisusedin2]:Wnew=Wold+WW=(I+(y)yT)W(y)=1(y(1))k(y(k))]Tj(y(j))=@lnp(y(j)jj)@y(j):(4)Thesementionedeortshavereachedcertainsuccessesforsourcesofeitheronlysuper-gaussiansoronlysub-gaussians16].
Thekeypointisthedierenceintheuseoftheparameterformforp(y(j)jj).
FromTab.
2,wecangetanewadaptivenoiselessICAalgorithm,basedoneq.
(4)andeq.
(1).
Itsspecicversionsforp(y(j)jrj)beingeithergaussianordenedbysigmoidfunctionaregiveninTab.
3.
Thisuseofnitemixtureden-sityforp(y(j)jrj)makesthealgorithmbecomemoreexibletoadaptdierentsources.
Experimentshaveshownthatitworkswellforvariouskindsofsources,includingthoseonwhichthosementionedeortssucceededandfailed16].
3BKYYDataDimensionReduction(DDR)InSec.
9of14],ithasbeenshownthatwecanalsousetheBasisBYYlearningtheoryeqs.
1.
4]1.
5]1.
6]asageneraldatadimensionreduction(DDR)systemandtheorywiththefollowingdesign:pMy(y)=nyXj=1jp(yjj)pMxjy(xjy)=nxjyXj=1jp(x;f(yAj)jj)pMyjx(yjx)=nyjxXj=1jp(yjxg(xWj)j)(5)forsolvingtheproblemofmappingtheobservedhighdimensiondatax,gener-atedfromaunknowny=y1yk]2Rkundernoise,backtoitsoriginalRk.
Table4.
BKYYLearningforThreeLayerForwardNets(GeneralCase)1.
VariableTypesy=y1yk]witheitheryj2Roryj2f01g,Binary-F:z=z1zm]witheitherzj2Rorzj2f01g,withoutconstraint.
Binary-E:zofBinary-FplustheexclusiveconstraintPmj=1zj=1.
2.
ArchitectureDesignpMzjy(zjy)=p(zjyzjy)=:pMyjx(yjx)=p(yjxyjx)=:8<:Qmj=1zjj(1;j)1;zjBinary-Fz,Pmj=1zjj=Pmj=1jBinary-Ez,G(zg(yWzjy)zjy)Realz.
Qkj=1yjj(1;j)1;yjbinaryy,G(yf(xWyjx)I)Realy.
j=gj(yWzjy),j=fj(xWyjx)g(yWzjy)=g1(yWzjy)gm(yWzjy)]f(xWyjx)=f1(xWyjx)fk(xWyjx)]nonlinearfunctionsingeneralnonlinearfunctionsingenerale.
g.
,gj(yWzjy)=s(eTj(Wzjyy))e.
g.
,fj(xWyjx)=s(eTj(Wyjxx))FreepMyjxz(yjxz)=p(yjxz):p(yjxz)=p(zjyzjy)p(yjxyjx)pM2(zjx),pM2(zjx)=Ryp(zjyzjy)p(yjxyjx)dyParametricpMyjxz(yjxz)=p(yjxzyjxz):Forbinaryy,Qkj=1yjj(1;j)1;yj,j=s(hj(xyWyjxz))h(xyWyjxz)=h1(xyWyjxz)hk(xyWyjxz)],e.
g.
,hj(xyWyjxz)=eTj(WyjxzxTzT]T)Forrealy,p(yjxzyjxz)=G(yh(xyWyjxz)I)Note:1.
s(r)issigmoidwithitsrangeon01],e.
g.
,s(r)=1=(1+e;r)ForBinary-Ez,s(r)maybemonotonicincreasing,e.
g.
,s(r)=er2.
eTjxgivesthej-thelementofthevectorx.
3.
theequationforspecifyingthefreep(yjxz)isgivenbyTheorem1in1]3.
Learning,i.
e.
,minfyjxzzjyyjxkgJ(yjxzzjyyjxk)J(yjxzzjyyjxk)=;Hyjxz;Lzjy;LyjxHyjxz=RxzHyjxz(xz)pML1(zx)dxdz,Hyjxz(xz)=;RypMyjxz(yjxz)lnpMyjxz(yjxz)dyLzjy=RxzLzjy(xz)pML1(zx)dxdzLzjy(xz)=RypMyjxz(yjxz)lnp(zjyzjy)dyLyjx=RxzLyjx(xz)pML1(zx)dxdzLyjx(xz)=RypMyjxz(yjxz)lnp(yjxyjx)dyForfreepMyjxz(yjxz)=p(yjxz),J(zjyyjxk)=;Lzjx,Lzjx=RxzpML1(zx)lnpM2(zjx)dxdzWhenpML1(zx)=pMx(x)pMzjx(zjx)givenbyeq.
1.
2]andeq.
1.
8],Hyjxz=P(xz)2DxzHyjxz(xz)Lzjy=P(xz)2DxzLzjy(xz)Lzjx=P(xz)2DxzlnpM2(zjx)Lyjx=P(xz)2DxzLyjx(xz)4.
ParameterLearningAlgorithmataxedkStep1:Fixzjyyjx,eitherletthefreep(yjxz)givenbyPart2aboveorupdateyjxzbyminyjxzJ(yjxzzjyyjxk)e.
g.
,movingonestepalongthegradientdescentdirection.
Step2:FixpMyjxz(yjxz),updatezjybymaxzjyLzjyandyjxbymaxyjxLyjx,e.
g.
,movingonestepalongthegradientascentdirection.
5.
ModelSelection(selectingthenumberkofhiddenunit)Withyjxzzjyyjxobtainedbytheabovealgorithm,selectkbyFromeqs.
1.
13&14],byJ1(k)=J(yjxzzjyyjxk)orJ2(k)=;(Lzjy+Lyjx)jfyjxzzjyyjxgTable5.
BKYYLearningforThreeLayerForwardNets(SpecialCases)1.
-YangbasedsystempMyjxz(yjxz)=p(yjxz)free,p(yjxyjx)=(y;f(xWyjx)),i.
e.
,x!
ybyadeterministicmappingy=f(xWyjx).
FromPart3inTab.
4,minfzjyyjxgJ(zjyyjxk)becomesMLlearning:maxfzjyWyjxgLzjxLzjx=P(xz)2Dxzlnp(zjf(xWyjx)zjy)Forp(zjyzjy)=G(zg(yW2)2I),itbecomesminfWzjyWyjxgP(xz)2Dxzkzi;g(f(xiWyjx)Wzjy)k22.
SmoothedYangbasedsystempMyjxz(yjxz)=p(yjxz)free,p(zjyzjy)=p(zjE(yjxyjx)zjy),E(yjxyjx)=Ryyp(yjxyjx)dy=f(xWyjx)pM2(zjx)=p(zjf(xWyjx)zjy)Hyjxz(xz)=;Lyjx(xz)J(zjyyjxk)=;P(xz)2Dxzlnp(zjf(xWyjx)zjy)thustheparameterlearningisthesameasintheabove-YangbasedsystemHowever,forselectingk,byeqs.
1.
13&14]J(k)=J2(zjyyjxk),J2(zjyyjxk)=;P(xz)2Dxzlnp(zjf(xWyjx)zjy)+Ryp(yjxyjx)lnp(yjxyjx)dy]Ryp(yjxyjx)lnp(yjxyjx)dy=;Pkj=1jlnj+(1;j)ln(1;j)],binaryy3.
Mean-FieldYangbasedsystempMyjxz(yjxz)=p(yjxz)free.
Inalltheintegralsovery,weletyapproximatelyreplacedbyE(yjxyjx)=f(xWyjx)pM2(zjx)=p(zjf(xWyjx)zjy)p(f(xWyjx)jxyjx)p(yjxz)=p(zjyzjy)p(yjxyjx)p(zjf(xWyjx)zjy)p(f(xWyjx)jxyjx)Hyjxz(xz)=0Lzjy(xz)=lnp(zjf(xWyjx)zjy)Lyjx(xz)=;lnp(f(xWyjx)jxyjx)lnp(f(xWyjx)jxyjx)=0forrealy,;Pkj=1jlnj+(1;j)ln(1;j)]forbinaryyJ1(zjyyjxk)=J2(zjyyjxk)=J(zjyyjxk)J(zjyyjxk)=;P(xz)2Dxzlnp(zjf(xWyjx)zjy)+lnp(f(xWyjx)jxyjx)]4.
TheStochasticApproximationAdaptiveAlgorithmpMyjxz(yjxz)=p(yjxzyjxz),andp(zjyzjy),p(yjxyjx)areparametricgiveninTab.
4RefertoSec.
3in1],usep(yjxyjx)astheSamplingReferencedensity,Getasamplextztandrandomlytakeasampleytfromp(yjxtyjx).
Step1:Fixzjyyjx,newyjxz=oldyjxz;p(yjxtyjx)@p(ytjxtztyjxz)lnp(ytjxtztyjxz)p(ztjytzjy)p(ytjxtyjx)]@yjxzjyjxz=oldyjxzStep2:Fixyjxz,updatenewzjy=oldzjy+p(ytjxtztyjxz)p(yjxtyjx)@lnp(ztjytzjy)@zjyjzjy=oldzjynewyjx=oldyjx+p(ytjxtztyjxz)p(yjxtyjx)@lnp(ytjxtyjx)@yjxjyjx=oldyjxPart5ThreeWaysofMappingxt!
zt(1):^zt=g(f(xtWyjx)Wzjy),(2):Forbinaryy,^yt=argminyp(yjxtyjx)andthen^zt=g(^ytWzjy),(3):^yt=REp(zjyzjy)p(yjxyjx)dyg(f(xtWyjx)Wzjy)p(f(xtWyjx)jxyjx)6.
ModelSelection(selectingthenumberkofhiddenunit)D=fxtztgNt=1andrandomlytakesamplesDy=fytgN0=1fromp(yjxtyjx)Fromeqs.
1.
13&14],getk=argminkJ1(k)ork=argminkJ2(k)J1(k)=1NN0PNt=1PN0=11p(yjxtyjx)p(yjxtztyjxz)lnp(yjxtztyjxz)p(ztjyzjy)p(yjxtyjx)J2(k)=;1NN0PNt=1PN0=11p(yjxtyjx)p(yjxtztyjxz)lnp(ztjyzjy)p(yjxtyjx)]ThisDDRtheoryaimsatnotonlymodelingthegeneratingprocessofdataxbypMxjy(xjy)andthebackward-mappingbypMyjx(yjx),butalsoatdiscoveringtheoriginaldimensionkaswellasthestructuralscalesnynxjynyjx.
ThisBYYDDRiscloselyrelatedtotheBYYDRdiscussedinSec.
1.
ThekeypointofDDRistoreducethedimensionofdata(i.
e.
,k(1)andeq.
(5),whilethekeypointofDRinSec.
1istoletthecomponentsofybecomeindependentanditcanbeeitherkTogetsomedeepinsightsonDDR,weseethelineardimensionreductionwithgaussiandensities|thespecialcaseofeq.
(5)withp(yjj)=G(ym(j)y(j)y),p(x;f(yAj)jj)=G(xAjy(j)xjy),p(yjxg(xWj)=G(yWTjx(j)yjx),wherexisrepresentedbyagaussianmixtureinyviadierentlinearmappingWj,whichcanberegardedasacombinedjobofdatadimensionreductionandclusteringinRk.
WecanalsoregardthatxisgeneratedfromagaussianmixtureinyviadierentlinearmappingAj.
Wefurtherconsiderthesimplestcasethatny=1nxjy=1nyjx=1withx=Ay+exexisgaussianEex=0EexeTx]=2xjyIdpMy(y)=G(y0y)pMxjy(xjy)=G(xAy2xjyId)pMx(x)=ph(x)isgivenbyeq.
1.
2].
(6)AparticularexampleofthisdesignhasbeendiscussedinSec.
9of14]andisshowntoberelatedtotheconventionalPCAlearning.
Inthefollowing,weconsideranevenmoreinterestingexamplethatpMyjx(yjx)=p(yjx)isfree.
Sincep(yjx)isfree,theminimizationofeq.
1.
6]willresultinp(x)=ZG(xAy2xjyId)G(y0y)dy=G(x02xjyId+AyAT)p(yjx)=G(xAy2xjyId)G(y0y)p(x)KL(yk)=Zxph(x)lnph(x)p(x)dx:(7)Moreover,itisnotdiculttoseethatthisp(yjx)isactuallyintheformp(yjx)=G(yWTxyjx)WT=yAT(2xjyId+AyAT);1yjx=y;WT(2xjyId+AyAT)W=y;yAT(2xjyId+AyAT);1Ay:(8)Thatis,thesituationsareactuallyequivalentwhenpMyjx(yjx)iseitherfreeorgaussianwithlinearregression.
Whenh!
0,minKL(yk)with=f2xjyAygisequivalenttominJ(k)J(k)=lnjxj+Tr;1xSx]x=2xjyId+AyAT(9)withSx=1NPNi=1xixTi.
WerstexplorethepropertyofitssolutionbyrAJ(k)=;1xAy;;1xSx;1xyfromrAJ(k)=0AT=AT;1xSxandSx=TxT=Ix=diagx1xd]wehaveAT;1x(2xjyId+TAyAT)=ATAT=Dj0]D=diagd1dk]Dj0];1x(2xjyId+Dj0]TyDj0])=Dj0]J(k)=0:5kXj=1ln(d2jyj+2xjy)+(d;k)ln2xjy+k+dXj=k+1xj=2xjy]:(10)Inotherwords,thesolutionisnotuniqueandatypicalexampleisthatAconsistsofkeigenvectorsofSx,correspondingtothekeigenvaluesthatminimizesJ1(k)=0:5kXj=1lnxj+(d;k)ln2xjy]2xjy=1d;kdXj=k+1xjyj=xj;2xjy:(11)Unfortunately,directlypickingkamongdeigenvectorsofSxforeq.
(11)isadicultcombinatorialproblem.
Instead,weproposeaniterativealgorithmtosearchasolutiononATA=I.
Withthisconstraint,wehave;1x=;2xjyId;A(2xjy;1y+Ik);1AT]andJ(k)=lnjy+2xjyIkj+lnj2xjyId;kj+;2xjyfTrSx;(2xjy;1y+Ik);1ATSxA]gryJ(k)=(y+2xjyIk);1;(2xjyIk+y);1ATSxA(2xjyIk+y);1ryJ(k)=0resultinginATSxA=y+2xjyIkory=ATSxA;2xjyIkrAJ(k)=;2;2xjySxA(2xjy;1y+Ik)=;2;2xjySxA;1y(ATSxA):(12)Thus,bynoticing;1yATSxA=ATSxA;1y,thegradientofJ(k)withrespecttoAonthemanifoldATA=IisobtainedasrcA=(I;AAT)rAJ(k)=;2;2xjy(SxA;1y;A;1yATSxA)ATSxAvecrcA]=;CvecSxA;1y;A;1yATSxA]C=;2xjy((ATSxA)Id):(13)SinceCispositivedeniteduetothepositivedeniteB,SxA;1y;A;1yATSxAisadirectionthatreducesJ(k)onthemanifoldATA=I,andwehavethefollowingiterativealgorithmforsolvingminJ(k)onthismanifold:Anew=Aold+(SxAold;1oldy;Aold;1oldyAoldTSxAold)newy=AoldTSxAold;2oldxjyIk(14)x=2oldxjyId+AoldoldyAoldT2newxjy=2oldxjy+Tr;1xSx;1x;;1x]wheretheupdatingfor2newxjyisbecausefromtheoriginalJ(k)givenbyeq.
(9)wehavedJ(k)d2xjy=;Tr;1xSx;1x;;1x].
Theupdatingformulaeforyxcomedirectlyfromeq.
(12)andeq.
(9).
Aslongasthelearningstepsizeiscontrolledsmallenough,thealgorithmkeepstodecreaseJ(k)untilnallyconverged.
Wecanalsomodifythealgorithmeq.
(14)intoanadaptiveoneforeachxi:z=AoldTxiAnew=Aold+(xizT;1oldy;Aold;1oldyzzT)newy=(1;)oldy+zzT;2oldxjyIk](15)x=2oldxjyId+AoldoldyAoldT2newxjy=(1;)2oldxjy+(k;1xxik2;Tr;1x]):ForasolutionA,diagx1xk]=y+2xjyIkbytheabovealgorithms,wecangetthebestdimensionbyk=minkJ1(k)fromeq.
(11).
Also,fromeq.
(8),;Rp(yjx)ph(x)lnp(yjx)dxdy=0:5k+lnjy(I;AT(2xjyId+AyAT);1Ay)j].
Byeq.
1.
13],afterignoringsometermsthatareirrelevanttok,wecangetJ2(k)=0:5dln(1d;kdXj=k+1xj)+k+kXj=1ln(xj;1d;kdXj=k+1xj)]:(16)Finally,fromthelastequationineq.
(10),wegetthePCAifyisxedatanyposotivediagonalmatrixwithdierentelements,andtheMinorcomponentanalysis(MCA)if2xjyisxedatanylargeconstant.
Thus,eq.
(14)oreq.
(15)actasauniedalgorithmforlineardimensionreductionthatincludesthePCAandMCAasspecialcases.
4BKYYLearningforThreeLayerForwardNetsThree-layerperceptronorthree-layerforwardnetisusuallytrainedbythemax-imumlikelihood(ML)orparticularlyitsspecialcasecalledtheleastsquareslearningviabackpropagation(BP)technique,whichissimpletobeunderstoodandthusverypopularintheliterature.
However,itsgeneralizationabilityde-pendsontheappropriateselectionofthenumberofhiddenunitsoragoodregularizationtechniqueonitsparameterestimation,basesonsomeheuristicsorthecomplicatedupper-boundapproximationofgeneralizationerror.
GiveninTab.
4arethedetailedformoftheFullyMatchedYang-basedBKYYlearninggivenbyeqs.
1.
10]1.
27]&1.
28]onthecascadearchitecture|agen-eralformofthreelayerforwardnetthatincludesthreelayerperceptronasaspecialcase.
Theimplementationofitsparameterlearningalgorithminvolvestheintegralorsummationoperationsoverallthevaluesofywhichcanbeveryexpensiveforalargekexcepttheanalyticallyintegrablecasethatbothp(zjyzjy)p(yjxyjx)areGaussiansandf(yWyjx)g(zWzjy)arelinear.
Thus,weneedeithertodevelopsomeecientalgorithmortomakesomesimplication.
GiveninTab.
5arefoursuchexamples.
TherstisexactlytheconventionalMLlearningorparticularlytheleastsquarelearningforathreelayerperceptronwhichcanbetrainedbyback-propagationtechnique.
Thesecondisobtainedbylettingp(zjyzjy)beconditionedonthesmoothedregressionE(yjxyjx)insteadofonydirectly.
Thesersttwoexamplesareequivalentinparameterlearning.
However,thesecondalsoprovidesanimportantnewresult|anewcriterionforselectingthebestnumberkofthehiddenunitsbyJ2(k)thatcontainsanextratermtopenalizealargekduringitsMLttingonpM2(zjx).
Interestingly,thiscriterionJ2(k)ismuchsimplerthanmanyoftheexistinghiddenunitselectioncriteriabasedontheestimationsoftheupperboundofgeneralizationerror.
ThethirdexampleistouseMean-Fieldapproximationonyforalltheintegralorsummationoperationsovery.
Forrealy,itturnedouttobeexactlythesameastherstexample.
However,forbinaryy,thepenalizingtermappearedinthesecondexampleinJ2(k)nowappearsinJ(zjyyjxk)too.
Inotherwords,thistermnotonlypenalizeskincreasinginselectingthebestk,butalsoregularizestheparameterlearningatagivenxedk.
Therstthreeexamplesaremadebyeithersimplicationorapproximation.
Inthegeneralcases,byusingthegeneralrandomsamplingimplementationtechniqueeq.
1.
22]andeq.
1.
23]giveninSec.
3of1],wecangetastochasticapproximationadaptivealgorithmgiveninPart4ofTab.
5,foravoidingtheintegralorsummationoperationsoveryduringlearning.
Onthetestingphase,threewaysofimplementingmappingx!
zaregiveninPart5ofTab.
5.
Moreover,viarandomlysampling,wecanalsogetnewcriteriaforselectingthebestnumberkofthehiddenunits.
5BKYYLearningforMixtureofExpertsAsdiscussedpreviously,thecascadearchitectureofthreelayerperceptronen-counterstheintegralorsummationovery,whichcausesanimpracticalcostunlessytakesonlyafewvalues.
However,ifytakesjustafewvalues,therepre-sentationabilityofthenetworkislimitedsinceyfunctionsasabottle-neckthattheentireinformationowmustgothrough.
InthecaseoftheFullyMatchedBKYYlearning,whenpMzjxy(zjxy)6=pMzjy(zjy),eq.
1.
27]willbecomepM2(xz)=RpMzjy(zjxy)pML2(xy)dy=RpMzjy(zjxy)pMxjy(xjy)pMy(y)dyYingbased,RpMzjy(zjxy)pMyjx(yjx)pMx(x)dyYangbasedpM2(zjx)=pM2(xz)RzpM2(xz)dz=RypMzjy(zjxy)pMxjyMy(yjx)dyYingbased,RypMzjy(zjxy)pMyjx(yjx)dyYangbasedpMxjyMy(yjx)=pMxjy(xjy)pMy(y)RypMxjy(xjy)pMy(y)dy:(17)Inthiscase,eachpMzjxy(zjxy)itselfbuildsadirectlinkx!
z,gatedviatheinternalvariableysuchthataweightedmixturepM2(zjx)isformedinaparallelarchitecture,withagatebypMyjx(yjx)foraYangbasedsystemandpMxjyMy(yjx)foraYingbasedsystem.
Althoughwestillencounterthesum-mationoverallthevaluesofy,hereyonlytakesasmallnumberofvalues.
BecauseeachpMzjxy(zjxy)hasadirectlinkx!
zandyonlyfunctionsasagatethatweightsinformationowsfromdierentexperts,wecantrade-othecomplexityofyandthestructuralcomplexityofeachpMzjxy(zjxy)suchthatthenumberofvaluesthatyshouldtakearesignicantlyfewerthanitshouldtakeonthecascadearchitecture.
Thisisanadvantagethecascadearchitecturedoesnothave.
InTab.
6,weletytakeonlykdiscretevalues.
FromPart6,weseethattheYangbasedsystemisactuallythemodelofMixtureofExperts9,10,5,17,18]witheachpMzjxy(zjxy)beinganexpertnetandpMyjx(yjx)be-ingthesocalledsoftmaxgatingnet,andthattheYingbasedsystemisac-tuallythealternativemodelofMixtureofExpertsproposedin17,18],withp(yjx)=p(xjj)p(yj=1)=Pkj=1p(xjj)p(yj=1)asanalternativemodelforthegatingnet.
Moreover,thealgorithminPart4fortheYangandYingbasedsystemsisactuallytheEMalgorithmfortheoriginalmodelformixtureofex-perts7,9]andtheEMalgorithmforthealternativemodelofmixtureexpertsproposedin17,18],respectively.
ThisfactcanalsodirectlybeobservedfromJ(k)=;LzjxinPart3.
ActuallyPart3indicatesthatminfkgJ(k)isequivalenttothemaximumlikelihoodlearningonpM2(zjx)orpM2(zx).
OneinconvenienceofusingEMontheoriginalmodelisthenonlinearityofsoftmaxgatingnet,whichmakesthemaximizationwithrespecttotheparam-etersinthegatingnetbecomenonlinearandunsolvableanalyticallyevenforTable6.
BKYYLearningforMixtureofExperts1.
VariableTypesy=y1yk]withyr2f01gPkr=1yr=1,zisthesameasinTab.
42.
ArchitectureDesignpMyjxz(yjxz)=Pkj=1yjp(yj=1jxz)andpMy(y)=Pkj=1yjp(yj=1)arefreepMzjxy(zjxy)=Pkj=1yjp(zjxj)pMyjx(yjx)=Pkj=1yjp(yj=1jx)p(zjxj)=:p(yj=1jx)=j=Pkj=1j,j=fj(xV)(Qmr=1zrjr(1;jr)1;zrBinary-FzPmr=1zrjr=Pmr=1jrBinary-Ez,G(zg(xWj)j)Realz.
p(xjj)=nG(xmjj)Realx,Qdr=1qxrjr(1;qxrjr)1;xrbinaryx.
jr(xWj)=gr(xWj),pMxjy(xjy)=Pkj=1yjp(xjj),g(xWj)=g1(xWj)gm(xWj)]f(xV)=f1(xV)fk(xV)]e.
g.
,gj(xWj)=s(eTj(Wjx))e.
g.
,fj(xV)=s(eTj(Vx)3.
Learning,i.
e.
,minfkgJ(k),=fjj=1kg=fjj=1kgTheYangbasedsystemTheYingbasedsystemJ(k)=n;Hyjxz;Lzjxy;Lyjx;LzjxJ(k)=n;Hyjxz;Lzjxy;Lxjy+Hy;Lxzp(yj=1jxz)=p(zjxj)p(yj=1jx)pM2(zjx)p(yj=1jxz)=p(zjxj)p(xjj)p(yj=1)pM2(xz)pM2(zjx)=Pkj=1p(zjxj)p(yj=1jx)pM2(xz)=Pkj=1p(zjxj)p(xjj)p(yj=1)Lyjx=RLyjx(xz)pML1(zx)dxdzLyjx(xz)Lxjy=RLxjy(xz)pML1(zx)dxdz=Pkj=1p(yj=1jxz)lnp(yj=1jx)Lxjy(xz)=Pkj=1p(yj=1jxz)lnp(xjj)Lzjx=RpML1(zx)lnpM2(zjx)dxdzLxz=RpML1(zx)lnpM2(xz)dxdzHyjxz=RHyjxz(xz)pML1(zx)dxdz,Hyjxz(xz)=;Pkj=1p(yj=1jxz)lnp(yj=1jxz)Lzjxy=RLzjxy(xz)pML1(zx)dxdzLzjxy(xz)=Pkj=1p(yj=1jxz)lnp(zjxj)Hy=;Pkj=1p(yj=1)lnp(yj=1)p(yj=1)=Rp(yj=1jxz)pML1(zx)dxdzWhenpML1(zx)=pMx(x)pMzjx(zjx)givenbyeq.
1.
2]andeq.
1.
8],thesameasinTab.
5exceptLzjyisreplacedbyLzjxy=P(xz)2DxzLzjxy(xz).
4.
ParameterLearningAlgorithmataxedkStep1:Fix,getp(yj=1jxz)aboveStep2:Step2:p(yj=1)=Forj=1k1#DxzP(xz)2Dxzp(yj=1jxz)updatenewj=argmaxjLzjxynewj=argmaxjLzjxyj=1kmnewj=new=argmaxLyjx1p(yj=1)#DxzP(xz)2Dxzp(yj=1jxz)xe.
g.
,movingonestepalongthenewj=1p(yj=1)#DxzP(xz)2Dxzgradientascentdirection.
p(yj=1jxz)x;mnewj]x;mnewj]T5.
ModelSelection(selectingthenumberkofhiddenunit)Withobtainedbytheabovealgorithm,selectkbyFromeqs.
1.
13&14],byeitherJ1(k)=J(k)orJ2(k)=J1(k)+HyjxzjfgWhenp(zjxj)isgaussian,theyactuallytakethespecicform:J1(k)=P(xz)2DxzPkj=1p(yj=1jxz)lnp(yj=1jxz)+J2(k),J2(k)=;P(xz)2DxzPkj=1J2(k)=lnp(yj=1jx)Pkj=1;p(yj=1jxz)lnp(yj=1jx)+;p(yj=1)lnp(yj=1)+0:5p(yj=1)0:5p(yj=1)lnjjj]lnjjj+0:5p(yj=1)lnjjj]6.
TheOutputofNetspM2(zjx)=Pkj=1p(zjxj)p(yj=1jx)ForYingbasedcase,p(yj=1jx)=p(xjj)p(yj=1)=Pkj=1p(xjj)p(yj=1)Table7.
BKYYLearningforNormalizedRBFNets1.
VariableTypesandArchitectureDesignAllarethesameastheYing-basedsysteminTab.
6,withspecically:p(zjxj)=G(zcjj)orp(zjxj)=G(zWtjx+cjj)p(yj=1)=pjjjPkj=1pjjj2.
TheBatchWayEMAlgorithmataxedkStep1:roldj=ncoldjforaNRBFnet,(Woldj)tx+coldjforanENRBFneth(jjx)=p(yj=1jxz)=e;0:5x;moldj]Toldj];1x;moldj]G(zroldjoldj)Pkj=1e;0:5x;moldj]Toldj];1x;moldj]G(zroldjoldj)Step2:First,forbothNRBFandENRBFnets,getNe=P(xz)2Dxzh(jjx),updatemnewj=1NeP(xz)2Dxzh(jjx)xnewj=1NeP(xz)2Dxzh(jjx)x;mnewj]x;mnewj]TNRBFnet:cnewj=1NeP(xz)2Dxzh(jjx)z,newj=1NeP(xz)2Dxzh(jjx)(z;cnewj)(z;cnewj)TENRBFnet:Ezj=1NeP(xz)2Dxzh(jjx)zRxz=1NeP(xz)2Dxzh(jjx)z;Ez]x;(mj)new]T,Wnewj=(j)new];1Rxzcnewj=Ezj;(Wnewj)t(mj)newnewj=1NeP(xz)2Dxzh(jjx)z;(Wnewj)tx;cnewj]z;(Wnewj)tx;cnewj]T:Whenj=2jI,2j=1NePkj=1P(xz)2Dxzh(jjx)kz;rnewjk23.
TheAdaptiveEMAlgorithmataxedkStep1:getthesameh(jjx)asinthebatchwaycase,updatenewj=(1;0)oldj+0h(jjx),thengetI(jjxi)=n1ifj=argminrf;logh(rjxi)g,0otherwsie.
,ji=0h(jjxi)=jStep2:Updatemnewj=moldj+ji(xi;moldj),newj=(1;ji)oldj+ji(xi;moldj)(xi;moldj)TNRBFnet,cnewj=coldj+ji(zi;coldj)newj=(1;ji);oldj+ji(zi;cnewj)(zi;cnewj)TENRBFnet,Eznewj=Ezoldj+ji(zi;Ezoldj),cnewj=Eznewj;WoldTjmoldjnewj=(1;ji)oldj+ji(zi;WoldTjxi;cnewj)(zi;WoldTjxi;cnewj)T,Wnewj=Woldj+ji(zi;WnewTjxi;cnewj)xTi4.
ModelSelection(selectingthenumberkofradialbasisfunctions)J1(k)=P(xz)2DxzPkj=1p(yj=1jxz)lnp(yj=1jxz)+J2(k),J2(k)=lnPkj=1pjjj+Pkj=1pjjjPkj=1pjjjlnpjjjpiecewiselinearcase.
AnalgorithmcalledIterativelyReweightedLeastSquares(IRLS)isproposedforthenonlinearoptimization9,10].
However,IRLSisaNewton-typealgorithmandthusneedssomesafeguardmeasuresforconver-gence,whichisabigdierencefromtheguaranteedconvergenceofthepureEMalgorithm.
Thisinconveniencehasbeenovercomebythealternativemodel.
AsshowninPart3,thelearningonthegatecanbemadeanalytically.
Anotherimportantopenproblemofusingboththeoriginalandalternativemixtureofexpertsishowtoselectthenumberofexperts.
BasedontheBKYYlearningtheory,weproposedacriteriaforthistaskbyPart5inTab.
6.
WecanintuitivelyobserveJ2(k).
Askincreases,Itsrsttermwillincreasewhichtradesothemonotonicallydecreasingbythesecondtermforabestk.
6BKYYLearningforNormalizedRBFNetsWeconsideraninterestingspecialcaseoftheYing-basedsysteminTab.
6,withaspecicarchitecturedesignbyPart1inTab.
7.
Inthiscase,frompM2(zjx)byPart6inTab.
6,wehaveE(zjx)=RpM2(zjx)dzgivenbyE(zjx)=(Pkj=1cj(x;mj)Pkj=1(Wtjx+cj)(x;mj)(u)=e;0:5uT;1juPkj=1e;0:5uT;1ju:(18)WecaneasilyndthattheyarejustthestandardnormalizedRBF(NRBF)netandextendednormalizedRBF(ENRBF)netswithkhiddenunits,whichhavebeenwidelystudiedintheliterature,seethereferencelistof19].
Inotherwords,themaximumlikelihood(ML)learningonthenormalizedRBFnetsisaspecialcaseoftheMLlearningonthealternativemodelofmixtureofexpertsbytheEMalgorithm.
Intheexistingliterature,RBFnetisexpectedtobetrainedbytheleastsquarelearning|aspecialcaseofMLlearning.
However,duetothedicultyofdeterminingtheparametersj,mjofbasisfunctions,inpracticethelearningisusuallymadeapproximatelybytwoseparatesteps.
First,j=(gj)2Iwith(gj)2beingestimatedroughlyandheuristically,andsomeclusteranalysis(e.
g.
,thek-meansalgorithm)isusedtogroupdatasetDx=fxigNi=1intokclusters,andthenusetheclustercentersasmjj=1k.
Second,theoutputlayerparameterscjj=1karedeterminedbytheleastsquarelearning.
Bythistwo-stagetrainingapproach,thecentersmjj=1kareobtaineddirectlyfrominputdatawithoutconsideringhowtogetthebestregressionE(zjx).
Moreover,thereisalsoanothermajorproblem|howtoselectthenumberofbasisfunctions,whichwillaecttheperformanceconsiderably.
In19],viasettinguptheconnectionsbetweenRBFnetsandkernelregressionestimators,anumberoftheoreticalresultshavebeenobtainedfortheupperboundsofconvergencerateoftheapproximationerrorwithrespecttothenumberofbasisfunctions,andtheupperboundsforthepointwiseconvergencerateandL2convergencerateofthebestconsistentestimatorwithrespecttoboththesamplesandthenumberofbasisfunctions.
However,thesetheoreticalresultsarenotdirectlyusableinpractice.
RivalPenalizedCompetitiveLearning(RPCL)isabletoautomaticallyselectthenumberofclustersandthussuggestedforRBFnets20].
However,althoughitexperimentallyworkswell,RPCLisaheuristicapproachandstillinlackoftheoreticaljustication.
Fromtheconnectioneq.
(18)betweentheBKYYYing-basedlearningformixtureofexperts,wecansolvealltheaboveproblems.
First,wecandirectlyusetheEMalgorithminTab.
6totraintheNRBFandENRBFnetstogetalltheparametersj,mj,Wjcjjsuchthatagloballymoreoptimalsolutionisobtained.
Specically,thedetailedformoftheEMalgorithmisgiveninTab.
7,togetherwithitsadaptivevariants.
Second,wecanselectthenumberkofradialbasisfunctionsbythecriterionbyPart4inTab.
7.
7BCYYLearningandPartiallyMatchedLearning(1)BYYLearningwithConvexDivergenceInsteadofeq.
1.
7],weusetheConvexDivergenceeq.
1.
6]fortheBayesianConvexYing-Yang(BCYY)learning.
Inthiscase,mostoftheresultsgiveninTheorems3.
1-3.
4in1]donotholdanymore.
However,wecanstilluseeq.
1.
17]andthegeneralimplementationtechniquebySec.
3in1]fortheminimizationofFtwo(M1M2).
Here,weonlydiscusssuchalearningforthemixtureofexpertsgiveninTab.
6.
ThekeypointistostillusethesameStep1inTab.
6,andtheninStep2,wedirectlymaximizeLzx=ZxzpML1(zx)f(pM2(xz))dxdzforaYingbasedsystemLzjx=ZxzpML1(zx)f(pM2(zjx))dxdzforaYangbasedsystem:(19)withpM2(xz)pM2(zjx)stillgivenbyPart3inTab.
6,resultingin:Step2(Yang-based):getw(jx)=f0(pM2(zjx))pM2(zjx)p(yj=1jxz),f0(u)=df(u)du,updatenewjviasolvingP(xz)2Dxzw(jxi)dlnp(zjxj)dj=0P(xz)2Dxzw(jxi)dlnp(yj=1jx)d=0:Step2(Ying-based):getw(jx)=f0(pM2(zx))pM2(zx)p(yj=1jxz),updatenewjviasolvingP(xz)2Dxzw(jx)dlnp(zjxj)dj=0,andthengetmnewj=1p(yj=1)#DxzP(xz)2Dxzw(jx)xnewj=1p(yj=1)#DxzP(xz)2Dxzw(jx)x;mnewj]x;mnewj]T:Forthesamereasonasin15],whenf(u)ismonotonicallyincreasingforpositiveu,e.
g.
,f(u)=u0<<1,thelearningwillgiveamorerobustestimationfordataofmulti-modeswithoutliersorhighoverlapbetweenclusters.
(2)PartiallyMatchedBKYYLearningsWerelaxthesatisfactionofeq.
1.
24]andconsiderthelearningeq.
1.
9]onthearchitectureofmixtureofexperts:(1)Yangbasedsystem.
Inthiscase,pMxjy(xjy)andpMy(y)arebothfree.
FromTheorem4in1],Lyjx(xz)inLyjxofTab.
6shouldbereplacedbyLyjx(xz)=kXj=1p(yj=1jxz)lnp(yj=1jx)+ln(1+p(yj=1jxz)p(yj=1jx))](20)forparameterlearning.
Moreover,formodelselectionweneedtoaddtoJ1(k)anextratermgivenbyJa(k)=X(xz)2DxzkXj=1p(yj=1jxz)ln(1+p(yj=1jxz)p(yj=1jx)):(21)(2)Yingbasedsystem.
Inthiscase,pMyjx(yjx)=p(yjx)andpMy(y)arefree.
FromTheorems2&3in1],wehavep(yj=1jx)=p(xjj)p(yj=1)=kXj=1p(xjj)p(yj=1)p(yj=1)=0:51#DxzX(xz)2Dxzp(yj=1jxz)+1#DxXx2Dxp(yj=1jx)]:(22)ThereforeinTab.
6weshouldreplacep(yj=1jxz)inLxjy(xz)byp(yj=1jxz)+p(yjx),letp(yj=1)inHygivenbyeq.
(22),andreplaceHyjxz(xz)byHyjxz(xz)=;Pkj=1p(yj=1jxz)lnp(yj=1jxz);Pkj=1p(yj=1jxz)lnp(yj=1jx)ThenwemodifythealgorithminTab.
6into:Step1:Runtheoriginalstep1plusgettingp(yj=1jx)byeq.
(22),Step3:Getp(yj=1)byeq.
(22),updatenewjasintheoriginalstep2andupdatemnewj=12p(yj=1)#DxzP(xz)2Dxzp(yj=1jxz)+p(yj=1jx)]xnewj=12p(yj=1)#DxzP(xz)2Dxzp(yj=1jxz)+p(yj=1jx)]x;mnewj]x;mnewj]T.
Also,theselectionofkshouldbemadeby(withhj=p(yj=1jxz))J1(k)=X(xz)2DxzkXj=1hjlnhj+p(yj=1jx)lnp(yj=1jx)]+J2(k)(23)J2(k)=kXj=1;2p(yj=1)lnp(yj=1)+p(yj=1)lnjjj+0:5X(xz)2Dxzhjlnjjj]:8ConclusionsThispaperhasfurtherinterpretedthetheorythroughitsusesondevelopingmodelsandalgorithmsfordependencereduction,ICAandblindsourcesepara-tion,lineardimensionreduction,supervisedclassicationandregressionbyfeedforwordnet,mixtureofexpertsandRBFnets.
References1.
Xu,L.
,\BayesianYing-YangSystemandTheoryasAUniedStatisticalLearningApproach(II):FromUnsupervisedLearningtoSupervisedLearningandTemporalModeling",inthesameproceedings,pp25-42.
.
2.
S.
-I.
Amari,A.
Cichocki,H.
Yang,\Anewlearningalgorithmforblindseparationofsources",inAdvancesinNIPS8,MITPress,1996,757-763.
3.
H.
B.
Barlow,\Unsupervisedlearning",NeuralComputation,1,295-311,1989.
4.
A.
J.
BellandT.
J.
Sejnowski,\Aninformation-maximatizationapproachtoblindseparationandblinddeconvolution",NeuralComputation7(1995)1129-1159.
5.
Dempster,A.
,Laird,N.
M.
,&Rubin,D.
B.
,\Maximum-likelihoodfromincompletedataviatheEMalgorithm",J.
RoyalStatisticalSociety,B,39(1),1-38,(1977).
6.
M.
GaetaandJ.
-LLacounme,\SourceSeparationwithoutaprioriknowledde:themaximumlikelihoodsolution",inProc.
EUSIPCO90,pp621-624,1990.
7.
Jacobs,R.
A.
,Jordan,M.
I.
,Nowlan,S.
J.
,&Hinton,G.
E.
,\Adaptivemixturesoflocalexperts",NeuralComputation,3,(1991),79-87.
8.
C.
Jutten,\FromsourceseparationtoIndependentcomponentanalysis:Anintro-ductiontospecialsession",Proc.
of1997EuropeanSymp.
onArticialNeuralNet-works,Bruges,April16-18,pp243-248.
9.
Jordan,M.
I.
&Jacobs,R.
A.
,\HierarchicalmixturesofexpertsandtheEMalgo-rithm",NeuralComputation6(1994),181-214.
10.
Jordan,M.
I.
&Xu,L.
,\ConvergenceresultsfortheEMapproachtomixturesofexperts",NeuralNetworks,8(9)(1995),1409{1431.
11.
Moody,J.
&Darken,J.
,\Fastlearninginnetworksoflocally-tunedprocessingunits",NeuralComputation,1(1989),pp281-294.
12.
D.
T.
Pham,P.
GaratandC.
Jutten,\Separationofamixtureofindependentsourcesthroughamaximumlikelihoodapproach",inSingalProcessingVI:TheoriesandApplications,J.
Vandewalleetal(eds),ElsevierSciencePub.
,1992,pp771-774.
13.
A.
TalebandC.
Jutten,\Nonlinearitysourceseparation:thepost-nonlinearmix-tures",thesameproceedingsof8],pp279-284,1997.
14.
Xu,L.
,\BayesianYing-YangSystemandTheoryasAUniedStatisticalLearningApproach:(I)UnsupervisedandSemi-UnsupervisedLearning",Invitedpaper,",S.
AmariandN.
Kassaboveds.
,Brain-likeComputingandIntelligentInformationSystems,1997,Springer-Verlag,pp241-274.
15.
Xu,L.
,\BayesianYing-YangMachine,ClusteringandNumberofClusters",inpress,patternRecognitionLetters,1997.
16.
L.
Xu,C.
C.
Cheung,J.
Ruan,andS.
-I.
Amari,\NonlinearityandSeparationCa-pability:FurtherJusticationfortheICAAlgorithmwithALearnedMixtureofParametricDensities",thesameproceedingsof8],pp291-296,1997.
17.
Xu,L.
,Jordan,M.
I.
&Hinton,G.
E.
,\Analternativemodelformixturesofexperts",inAdvancesinNeuralInformationProcessingSystems,J.
D.
Cowan,etal,Eds.
,MITPress,CambridgeMA,(1995),pp.
633{640.
18.
Xu,L.
&Jordan,M.
I.
,\Amodiedgatingnetworkforthemixturesofexpertsarchitecture",inProc.
WCNN94,SanDiego,ppII405{II410.
19.
Xu,L.
,Krzyzak,A.
&Yuille,A.
L.
,\OnRadialBasisFunctionNetsandKernelRegression:StatisticalConsistency,ConvergenceRatesandReceptiveFieldSize",NeuralNetworks,5(4)(1994),609-628.
20.
Xu,L.
,A.
Krzyzak,andE.
Oja,RivalPenalized\CompetitiveLearningforClus-teringAnalysis,RBFnetandCurveDetection",IEEETrans.
onNeuralNetworks4(4)(1993),636-649.

RackNerd 2022春节促销提供三款年付套餐 低至年付10.88美元

RackNerd 商家我们应该是比较熟悉的商家,速度一般,但是人家便宜且可选机房也是比较多的,较多集中在美国机房。包括前面的新年元旦促销的时候有提供年付10美元左右的方案,实际上RackNerd商家的营销策略也是如此,每逢节日都有活动,配置简单变化,价格基本差不多,所以我们网友看到没有必要囤货,有需要就选择。RackNerd 商家这次2022农历新年也是有几款年付套餐。低至RackNerd VPS...

Sharktech云服务器35折年付33美元起,2G内存/40G硬盘/4TB流量/多机房可选

Sharktech又称SK或者鲨鱼机房,是一家主打高防产品的国外商家,成立于2003年,提供的产品包括独立服务器租用、VPS云服务器等,自营机房在美国洛杉矶、丹佛、芝加哥和荷兰阿姆斯特丹等。之前我们经常分享商家提供的独立服务器产品,近期主机商针对云虚拟服务器(CVS)提供优惠码,优惠后XS套餐年付最低仅33.39美元起,支持使用支付宝、PayPal、信用卡等付款方式。下面以XS套餐为例,分享产品配...

CheapWindowsVPS:7个机房可选全场5折,1Gbps不限流量每月4.5美元

CheapWindowsVPS是一家成立于2007年的老牌国外主机商,顾名思义,一个提供便宜的Windows系统VPS主机(同样也支持安装Linux系列的哈)的商家,可选数据中心包括美国洛杉矶、达拉斯、芝加哥、纽约、英国伦敦、法国、新加坡等等,目前商家针对VPS主机推出5折优惠码,优惠后最低4GB内存套餐月付仅4.5美元。下面列出几款VPS主机配置信息。CPU:2cores内存:4GB硬盘:60G...

pp43.com为你推荐
网络访问域名访问提示是什么意思硬盘工作原理硬盘跟光盘的工作原理?中老铁路中国有哪些正在修的铁路7788k.com以前有个网站是7788MP3.com后来改成KK130现在又改网站域名了。有知道现在是什么域名么?罗伦佐娜手上鸡皮肤怎么办,维洛娜毛周角化修复液haole018.comhttp://www.haoledy.com/view/32092.html 轩辕剑天之痕11、12集在线观看javbibitreebibi是什么牌子的www.zhiboba.com上什么网看哪个电视台直播NBAbaqizi.cc孔融弑母是真的吗?关键词分析如何进行关键词指数分析
美国虚拟主机 虚拟主机试用30天 域名转让网 樊云 外国空间 windows2003iso 彩虹ip 服务器维护方案 徐正曦 idc是什么 中国电信测网速 服务器托管什么意思 傲盾官网 上海联通宽带测速 电信托管 yundun 电信网络测速器 lamp怎么读 工信部icp备案查询 江苏徐州移动 更多