Flowchart003hh.com

003hh.com 时间:2021-04-07 阅读:()

PROCEEDINGSOpenAccessProDis-ContSHC:learningproteindissimilaritymeasuresandhierarchicalcontextcoherentlyforprotein-proteincomparisoninproteindatabaseretrievalJingyanWang1,2,XinGao1*,QuanquanWang2,YongpingLi2,3FromThe2011InternationalConferenceonIntelligentComputing(ICIC2011)Zhengzhou,China.
11-14August2011AbstractBackground:Theneedtoretrieveorclassifyproteinmoleculesusingstructureorsequence-basedsimilaritymeasuresunderliesawiderangeofbiomedicalapplications.
Traditionalproteinsearchmethodsrelyonapairwisedissimilarity/similaritymeasureforcomparingapairofproteins.
Thiskindofpairwisemeasuressufferfromthelimitationofneglectingthedistributionofotherproteinsandthuscannotsatisfytheneedforhighaccuracyoftheretrievalsystems.
Recentworkinthemachinelearningcommunityhasshownthatexploitingtheglobalstructureofthedatabaseandlearningthecontextualdissimilarity/similaritymeasurescanimprovetheretrievalperformancesignificantly.
However,mostexistingcontextualdissimilarity/similaritylearningalgorithmsworkinanunsupervisedmanner,whichdoesnotutilizetheinformationoftheknownclasslabelsofproteinsinthedatabase.
Results:Inthispaper,weproposeanovelprotein-proteindissimilaritylearningalgorithm,ProDis-ContSHC.
ProDis-ContSHCregularizesanexistingdissimilaritymeasuredijbyconsideringthecontextualinformationoftheproteins.
Thecontextofaproteinisdefinedbyitsneighboringproteins.
Thebasicideais,forapairofproteins(i,j),iftheircontextN(i)andN(j)issimilartoeachother,thetwoproteinsshouldalsohaveahighsimilarity.
WeimplementthisideabyregularizingdijbyafactorlearnedfromthecontextN(i)andN(j).
Moreover,wedividethecontexttohierarchialsub-contextandgetthecontextualdissimilarityvectorforeachproteinpair.
Usingtheclasslabelinformationoftheproteins,weselecttherelevant(apairofproteinsthathasthesameclasslabels)andirrelevant(withdifferentlabels)proteinpairs,andtrainanSVMmodeltodistinguishbetweentheircontextualdissimilarityvectors.
TheSVMmodelisfurtherusedtolearnasupervisedregularizingfactor.
Finally,withthenewSupervisedlearnedDissimilaritymeasure,weupdatetheProteinHierarchialContextCoherentlyinaniterativealgorithm–ProDis-ContSHC.
WetesttheperformanceofProDis-ContSHContwobenchmarksets,i.
e.
,theASTRAL1.
73databaseandtheFSSP/DALIdatabase.
Experimentalresultsdemonstratethatpluggingoursupervisedcontextualdissimilaritymeasuresintotheretrievalsystemssignificantlyoutperformsthecontext-freedissimilarity/similaritymeasuresandotherunsupervisedcontextualdissimilaritymeasuresthatdonotusetheclasslabelinformation.
Conclusions:Usingthecontextualproteinswiththeirclasslabelsinthedatabase,wecanimprovetheaccuracyofthepairwisedissimilarity/similaritymeasuresdramaticallyfortheproteinretrievaltasks.
Inthiswork,forthefirsttime,weproposetheideaofsupervisedcontextualdissimilaritylearning,resultingintheProDis-ContSHC*Correspondence:xin.
gao@kaust.
edu.
sa1KingAbdullahUniversityofScienceandTechnology(KAUST),MathematicalandComputerSciencesandEngineeringDivision,Thuwal,23955-6900,SaudiArabiaFulllistofauthorinformationisavailableattheendofthearticleWangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S22012Wangetal.
;licenseeBioMedCentralLtd.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.
org/licenses/by/2.
0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
algorithm.
Amongdifferentcontextualdissimilaritylearningapproachesthatcanbeusedtocompareapairofproteins,ProDis-ContSHCprovidesthehighestaccuracy.
Finally,ProDis-ContSHCcomparesfavorablywithothermethodsreportedintherecentliterature.
BackgroundProteinsarelinearchainsofaminoacids.
Thepolypeptidechainsarefoldedintocomplicatedthree-dimensional(3D)structures.
Withdifferentstructures,proteinsareabletoperformspecificfunctionsinbiologicalprocesses[1-14].
Tostudythestructure-functionrelationship,biologistshaveahighdemandonproteinstructureretrievalsystemsforsearchingsimilarsequencesor3Dstructures[15].
Pro-teinpairwisecomparisonisoneofthemainfunctionsofsuchretrievalsystems[16].
Theneedtoretrieveorclassifyproteinsusing3Dstructureorsequence-basedsimilarityunderliesmanybiomedicalapplications.
Indrugdiscovery,researcherssearchforproteinsthatsharespecificchemicalpropertiesassourcesfornewtreatment.
Infoldingsimula-tions,similarintermediatestructuresmightbeindicativeofacommonfoldingpathway[17].
RelatedworkThestructuralcomparisonprobleminaproteinstruc-tureretrievalsystemhasbeenextensivelystudied.
In[18],arapidproteinstructureretrievalsystemnamedProtDex2wasproposedbyAungandTan[18],inwhichtheyadoptedtheinformationretrievaltechniquestoperformrapiddatabasesearchwithoutaccessingtoeach3Dstructureinthedatabase.
Theretrievalprocesswasbasedontheinverted-fileindexconstructedonthefeaturevectorsoftherelationshipbetweenthesecond-arystructureelements(SSEs)ofalltheproteinstruc-turesinthedatabase.
Inordertoevaluatethesimilarityscorebetweenaqueryproteinstructureandaproteinstructureinthedatabase,theyadoptedandmodifiedthewell-known∑(tf*idf)scoringschemecommonlyusedindocumentretrievalsystems[19].
In[20,21],a3Dshape-basedapproachwaspresentedbyDarasetal.
Themethodreliedprimarilyonthegeometric3Dstructureoftheproteins,whichwasproducedfromthecorre-spondingPDBfiles,andsecondarilyontheirprimaryandsecondarystructures.
Additionally,characteristicattributesoftheprimaryandsecondarystructuresoftheproteinmoleculeswereextracted,formingattribute-baseddescriptorvectors.
Thedescriptorvectorswerethenweightedandanintegrateddescriptorvectorwasproduced.
Tocompareapairofproteindescriptorvec-tors,Darasetal.
[20,21]usedtwometricsofsimilarity.
ThefirstonewasbasedontheEuclideandistance[22]betweenthedescriptorvectors,andthesecondonewasbasedonMeanEuclideanDistanceMeasure[20,21].
Later,MarsoloandParthasarathypresentedtwonor-malized,stand-alonerepresentationsofproteinsthatenabledfastandefficientobjectretrievalbasedonsequenceorstructureinformation[17,23].
Fortherangequeries,theyspecifiedarangevaluerandretrievedalltheproteinsfromthedatabasewhichliedwithinadis-tancertothequery.
Intheirwork,distancereferredtothestandardEuclideandistance[22].
In[24],Saeletal.
introducedaglobalsurfaceshaperepresentationby3DZernikedescriptorsforproteinstructuresimilaritysearch.
Intheirstudy,threedistancemeasureswereusedforcomparing3DZernikedescriptorsofproteinsurfaceshapes,i.
e.
,Euclideandistance,Manhattandistance[25],andcorrelationcoefficient-baseddistance.
AfastproteincomparisonalgorithmIRTableauwasdevelopedbyZhangetal.
forproteinretrievalpurposesin[26],whichleveragedthetableaurepresentationtocompareproteintertiarystructures.
IRtableaucomparedtableauxusingfeatureindexingtechniques.
InIRTableau[26],anum-berofsimilarityfunctionswereappliedforcomparingapairofproteinvectors,i.
e.
,cosinesimilarity[27],Jaccardindex[28],Tanimotocoefficient[29],andEuclideandistance.
Thebasiccomponentsofaproteinretrievalsystemincludesawaytorepresentproteinsandadissimilaritymeasurethatcomparesapairofproteins.
Mostoftheaforementionedstudiesfocusonthefeaturerepresenta-tionoftheproteins,whileneglectingthecomparisonofthefeaturevectors.
Suchstudiesusuallyapplyasimplesimilarityordissimilaritymeasureforthecomparisonofthefeaturevectors,suchasEuclideanDistanceMeasureusedin[17,20,21,23,24,26].
Mostoftheexistingproteincomparisontechniquessufferfromthefollowingtwobottlenecks:Thedissimilaritymeasureisapairwisedistancemeasure,whichiscomputedonlyconsideringthequeryproteinx0andadatabaseproteinxiasd(x0,xi).
Itdoesnotconsiderotherproteinsinthedata-base,neglectingtheeffectsofthecontextualpro-teins.
IfweconsiderthedistributionoftheentireproteindatabaseX={xj},j=1.
.
.
Nwhencomputingthedissimilarityasd(x0,xi|X),theretrievalperfor-mancemaybenefitfromthecontextualproteins{xj},j≠i.
Thedissimilaritymeasureiscomputedinanunsuper-visedway,whichdoesnotusetheknowninformationWangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page2of14oftheclasslabelsL={lj},j=1.
.
.
,Ninthedatabase.
Althoughwemayhavenoideaaboutwhetherx0andxibelongtothesameclass(havingthesamefoldingtypeetc.
,l0=li)ornot(l0≠li),wedoknowsomepriorinformationaboutotherproteinsL.
Inallofthepre-viousstudies,priorclasslabelsLwerenotadoptedtocalculatethedissimilarityd(x0,xi).
Duetothesetwobottlenecks,traditionalproteinretrie-valsystemsusingpairwiseandunsuperviseddissimilaritymeasureusuallydonotachievesatisfactoryperformance,eventhoughmanyeffectiveproteinfeaturedescriptorsaredevelopedandused.
Inthispaper,weinvestigatethedissimilaritymeasureandproposeanovellearningalgo-rithmtoimprovetheperformanceofagivendissimilaritymeasure.
Recentresearchinmachinelearningpointsoutthatcontextualinformationcanbeusedtoimprovethedis-similarityorsimilaritymeasures.
Thiskindofalgorithmsarecalledcontextualorcontext-sensitivedissimilaritylearning[30-34].
Unlikethetraditionalpairwisedistanced(x0,xi)whichonlyconsidersthetworefereedproteinsx0andxi,contextualdissimilarityalsoconsidersthecon-textualproteinsXwhencomputingthedissimilarityd(x0,xi|X).
Theexistingcontextualsimilaritylearningalgo-rithmscanmainlybeclassifiedintothefollowingtwocategories:DissimilarityregulationThefirstcontextualdissimilaritymeasure(CDM)wasproposedbyJegouetal.
in[30,31].
TheyintroducedtheCDM,whichsignificantlyimprovedtheaccuracyoftheimagesearchproblem.
CDMmeasuretookthelocaldis-tributionofthevectorsintoaccountanditerativelyesti-matedthedistanceupdatetermsinthespiritofSinkhornsscalingalgorithm[35],therebymodifiedtheneighborhoodstructure.
Thisregularizationwasmoti-vatedbytheobservationthatagoodrankingwasusuallynotsymmetricinanimagesearchsystem.
Inthispaper,wewillfocusonthistypeofcontextualdissimilaritylearning.
SimilaritytransductionongraphIn[32,33],Baietal.
providedanovelperspectivetotheshaperetrievaltasksbyconsideringtheexistingshapesasagroupandstudyingtheirsimilaritymeasurestothequeryshapeinagraphstructure.
Foragivensimilaritymeasure,anewsimilaritywaslearnedthroughgraphtransduction.
Thelearningwasdoneinaniterativeman-nersothattheneighborsofagivenshapeinfluencedthefinalsimilaritytothequery.
ThebasicideaisactuallyrelatedtothePageRankalgorithm,whichformsafounda-tionofGoogleWebsearch.
ThismethodisfurtherimprovedbyWangetal.
in[36].
Similarlearningalgo-rithmswerealsousedtorankproteinsinaproteindata-baseasin[37,38].
Kuangetal.
proposedageneralgraph-basedpropagationalgorithmcalledMotifProptodetectmoresubtlesimilarityrelationshipthanthepairwisecom-parisonmethods.
In[38],Westonetal.
reviewedRank-Prop,arankingalgorithmthatexploitedtheglobalnetworkstructureofsimilarityrelationshipamongpro-teinsinadatabasebyperformingadiffusionoperationonaproteinsimilaritynetworkwithweightededges.
Thedrawbacksoftheabovealgorithmslayontwofolds.
Ontheonehand,suchalgorithmsdonotutilizetheclasslabelinformationofthedatabaseimagesL,andthusworkinanunsupervisedway.
TheonlyoneusedLis[38].
How-ever,thealgorithmproposedin[38]hadbasicallythesameframeworkas[32,33,37],i.
e.
,proteinlabelinforma-tionLwasonlyusedtoestimatetheparameters.
Ontheotherhand,the"context"isfixedintheiterativealgo-rithmsofmostofthetransductionmethods[32,33,37,38].
Abetterwayistoupdatethecontextusingthelearnedsimilaritymeasuresasin[30,31].
Toovercomethesedrawbacks,wedevelopanovelcon-textualdissimilaritylearningalgorithmtoimprovetheper-formanceofaproteinretrievalsystem.
Thenoveldissimilaritymeasureisregularizedbythedissimilarityofthecontextualproteins(neighboringproteins),whilethecontextualproteinsareupdatedusingthelearneddissimi-laritiescoherently.
Thebasicideacomesfrom[39,40],whichassumethatiftwolocalfeaturesintwoimagesaresimilar,theircontextislikelytobesimilar.
Incomparisonto[30,31],whichuseneighborhoodasasinglecontext,wepartitiontheneighborhoodintoseveralhierarchicalsub-contextcorrespondingtothelearneddissimilarities.
Withthesub-context,wecomputethedissimilarityofsub-con-textofapairofproteinsandconstructthehierarchialsub-contextualdissimilarityvector.
Moreover,usingthelabelinformationL,weselectpairsofproteinsbelongingtothesameclasses{(xi,xj)|li=lj}astherelevantproteinpairs.
Wealsoselecttheirrelevantproteinpairs{(xk,xl)|lk≠ll}.
Finally,wetrainasupportvectormachine(SVM)[41]todistinguishbetweentherelevantandtheirrelevantproteinpairs.
TheoutputoftheSVMwillfurtherbeusedtoregularizethedissimilarityinaniterativemanner.
MethodsThissectiondescribesourcontextualprotein-proteindissimilaritylearningalgorithm,whichutilizesthecon-textualproteinsandclasslabelinformationofthedata-baseproteinstoindexandsearchproteinstructuresefficiently.
Wewilldemonstratethatourideaisgeneralinthesensethatitcanbeusedtoimprovetheexistingsimilarity/dissimilaritymeasures.
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page3of14ProteinstructureretrievalframeworkInaproteinretrievalsystem,thequeryandthedatabaseproteinsarefirstlyrepresentedasfeaturevectors.
Here,wedenotethequeryproteinfeaturevectorasx0anddatabaseproteinfeaturevectorsasX={x1,x2,.
.
.
,xN},whereNisthenumberofproteinsinthedatabase.
Then,basedonadistancemeasured0i=d(x0,xi),wecomputethedistanceofx0andalltheproteinsinthedatabase,i.
e.
,{d01,d02,.
.
.
,d0N}.
Thedatabaseproteinsarethenrankedaccordingtothedistances.
Thekmostsimilaronesarereturnedastheretrievalresults.
WeillustratetheoutlineoftheproteinretrievalsysteminFigure1.
ProDis-ContSHC:thecontextualdissimilaritylearningalgorithmInthissection,wewillintroducethenovelcontextualprotein-proteindissimilaritylearningalgorithm.
Wefirstgivethedefinitionofthehierarchicalcontextofapro-tein,whichwillbeusedtocomputethecontextualdis-similarityandregularizethedissimilaritymeasure.
Thenamorediscriminativeregularizationfactorislearnedusingtheclasslabelsofthedatabaseproteins.
Finally,weproposetheSupervisedregulatingofProtein-proteinDissimilarityandupdatingoftheHierarchicalContextCoherentlyinaniterativemanner,resultinginthePro-Dis-ContSHCalgorithm.
UsinghierarchicalcontexttoregularizethedissimilaritymeasureHere,wedefineaproteinxi'scontextasitsKnearestneighborsN(i).
Thedissimilaritybetweentwosetsofcontextismeasuredbythecontextualdissimilarityasrij=1K2m∈N(i),n∈N(j)dmn(1)ThecontextualdissimilarityisillustratedinFigure2(a).
Furthermore,insteadofaveragingallthepairwisedis-similaritiesbetweenthetwocontextN(i)andN(j),weproposethehierarchicalcontextbysplittingthecon-textN(i)toP"sub-context"Np(i),p={1,P}accordingtotheirdistancestoxi.
Tobemorespecific,sub-contextNp(i)isdefinedasNp(i)={xj|xjisamongthekthtokthnearestneighborsofxi,accordingto{dij},j∈{1,i1,i+1,N}}(2)wherek'=(p-1)*,k''=(p-1)isthesizeofasub-context,andPisthenumberofsub-context.
Inthisway,wecancomputethecontextualdissimilaritybyaveragingthedissimilarityofthesub-contextasrij=1Pp1κ2m∈Np(i),n∈Np(j)dmn=1Ppdij(p)(3)wheredij(p)=1κ2m∈Np(i),n∈Np(j)dmn,p=1,P,isthehierarchicalsub-contextualdissimilarity.
Figure2(b)illustratestheideaofsub-contextualdissimilarity.
Intuitively,ifthecontextoftwoproteinsisdissimilartoeachother(rijishigherthantheaverage),theyshouldhaveahigherdissimilarityvalue,andviceversa.
Weimplementthisbymultiplyingacoefficient,whichistheratioofrijtotheaverageofallthecontextualdissimilar-ityr=1N2i,jrij,dij=dij*rijr=dij*δij(4)Here,δij=rijrisaregularizationfactorfordij,withwhichwecanimprovedijbyitscontextualinformation.
Figure1Flowchartofproteinretrievalsystems.
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page4of14Moreover,thisprocedurecanbedoneinaniterativemanner.
Wecanusetheregularizeddissimilaritymea-suredijtore-definethenewhierarchicalcontextNp(i).
Inthisway,wecanlearntheprotein-proteindissimilar-itydijandhierarchicalcontextNp(i)coherently.
SupervisedregularizationfactorlearningWetrytoutilizethelabelinformationL={l1,.
.
.
,lN}ofthedatabaseproteinstolearnabetterregularizationfac-torδij.
Theclassinformationisadoptedbothintheintra-classandinterclassdissimilaritycomputationtomaximizetheFishercriterion[42]forproteinclasssepar-ability.
Firstly,wecanselectanumberofproteinpairs{g=(i,j)|i,j=1,.
.
.
,N}.
Foreachpair,wecomputethehierarchicalcontextualdissimilaritiesandorganizethemasaP-dimensionaldissimilarityvectordg=[dij(1)dij(2).
.
.
dij(P)],asshowninFigure3.
Then,inspiredbythescorefusionrule[43,44],usingL,wefurtherlabeleachpairg=(i,j)asarelevantpairyg=+1ifli=lj,oranirre-levantpairyg=-1otherwise.
NowwiththetrainingsamplesasΓ={(dg,yg)},g=1,.
.
.
,NC2,wetrainabinarySVM[41]classifiertodistinguishbetweentherelevantpairsandtheirrelevantpairs.
ThepubliclyavailablepackageSVMlight[45]isappliedtoimplementtheSVMonourtrainingsetΓ.
Thispackageallowsustooptimizeanumberofparametersandofferstheoptionstousedifferentkernelfunctionstoobtainthebestclassificationperformance[46].
TheseparatinghyperplanegeneratedbySVMmodelisgivenbyf(d)=d·w+b(5)wherewisavectororthogonaltothehyperplane,andbisaparameterthatminimizes||w||2andsatisfiesthefollowingconditions:yγ(dγ·w+b)≥1(6)forall1≤g≤NC2,whereNC2isthetotalnumberofexamples(proteinpairs).
AnSVMmodelwithalineardecisionboundaryisshowninFigure3todistinguishtherelevantproteinpairsfromtheirrelevantones.
NotethatnotalltheNC2possibleproteinpairsarenecessarytobeincludedtotraintheSVMmodel(5).
Foranypairofproteins(xi,xj),afterwecomputeitscontextualdis-similarityvectordij,thetrainedSVMclassifierisappliedtogetthedistanceofthispointtothemarginboundaryofSVMasyij=f(dij).
Apparently,yijisameasureofdissimilarityofthecontextofthispairofproteins.
Thus,itcanbeusedtoformaregularizationfactorasFigure2Illustrationofcontext-baseddissimilarityandhierarchicalcontext-baseddissimilarity.
Thetwoproteinsxiandxj,onwhichthedissimilarityistobemeasured,areinthefirstrow.
Thenearestneighborsofthesetwoproteinsarelistedbelowthemasthecontext,respectively.
(a)ThetraditionalcontextN(i);(b)TheproposedhierarchicalcontextNp(i),p={1,2,3}.
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page5of14δij=exp(yijσ)=exp(dij·w+b)σ(7)wheresisapreemptorofthefactor.
Withthisregu-larizationfactorlearnedfromthecontextualproteins,weregularizethedissimilaritydijofproteinpair(xi,xj)asdij=dij*δij(8)UpdatingthecontextanddissimilaritycoherentlyWiththelearneddissimilaritymeasuredij,wecanre-definethe"context"ofaproteinxiaccordingtoitsdissimilaritytoalltheotherproteinsdij,j∈{0,i1,i+1,N}.
Thenew"hier-archical-context"relyingondijisdonatedasNp(i),p={1,P}.
Inthisway,wecandevelopaniterativealgorithmthatlearnsdijandNp(i),p={1,P}coherently.
SinceNp(i)impli-citlydependsondijthroughthenearestneighborsofxi,weuseafixed-pointrecursionmethod[47]tosolvedij.
Ineachiteration,Np(i)isfirstcomputedbyusingthepreviousestimationofdij,whichisthenupdatedbymultiplyingtheregularizationfactorδijasin(8).
TheiterationsarecarriedoutforTtimes,asgiveninAlgorithm1.
WiththelearneddissimilaritymatrixD(t+1),weuseD(t+1)[0;1,.
.
.
,N]asthedissimilaritybetweenthequeryproteinx0andthedatabaseproteins{x1,.
.
.
,xN}.
Thuswecanrankthedatabaseproteinsinanascendingorder.
EfficientimplementationofProDis-ContSHCTheproposedlearningalgorithmistime-consuming.
Therefore,itisnotsuitableforrealtimeproteinretrievalsystems.
Hereweproposeseveraltechniquestosignifi-cantlyimprovetheefficiencyofthealgorithm.
Similarto[33],inordertoincreasethecomputa-tionalefficiency,itispossibletorunProDis-ContSHCforonlypartofthedatabaseoftheknownproteins.
Hence,foreachqueryproteinx0,wefirstretrieveN'Nofthemostsimilarproteins,andper-formProDis-ContSHCtolearnthedissimilaritymatrixofsize(N'+1)*(N'+1)foronlythosepro-teins.
ThenwecalculatethenewdissimilarityFigure3Differentiaterelevantandirrelevantproteinsbyclassification.
(xi,xj)isassumedtobearelevantpairand(xi,xk)isassumedtobeanirrelevantpair.
ThecontextualdissimilarityvectorsofbothpairsaredistinguishedbyabinarySVMmodel.
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page6of14measureD'(N'+1)*(N'+1)foronlythose(N'+1)proteins.
Here,weassumethatalltherelevantpro-teinswillbeamongthetopN'mostsimilarproteins.
ThisstrategyisillustratedinFigure4(a)and4(b).
Mostofthedissimilarityandsimilaritymeasuresaresymmetricones,i.
e.
,dij=dji.
Ascanbeobservedin(13),theregularizationofdijisalsosymmetric.
Therefore,itispossibletodevelopanefficientlearn-ingalgorithmbyusingthisproperty.
Inthealgo-rithm,allthecomputationresultsof(i,j)(suchasdijandδij)canbeuseddirectlyby(j,i).
Inthisway,wecansavealmosthalfofthecomputationaltime,asshowninFigure4(c).
AbottleneckofProDis-ContSHCmaybethetrain-ingprocedurefortheSVMmodelineachiteration.
ForadatabaseofNproteinsbelongingtoCclasses,thereareNC2proteinpairs,inwhichCc=1NcC2arerelevantpairs,whileCc=1c=cNc*Ncareirrele-vantpairs,whereCisthenumberCoftheproteinclassesandNcisthenumberofproteinsinthec-thclass(Cc=1Nc=N).
TheremightbeahugenumberofproteinpairsavailablefortheSVMtraining.
How-ever,itisnotnecessarytoincludealloftheminthetrainingprocess.
OnecanselectasmallbutequalnumberoftherelevantandtheirrelevantpairstotraintheSVMclassifier.
ThisisaneffectivewaytoreducethetrainingtimeofSVM.
Algorithm1ProDis-ContSHC:SupervisedLearningofProteinDissimilarityandUpdatingHierarchicalCon-textCoherently.
Require:InputD=[dij](N+1)*(N+1):matrixofsize(N+1)*(N+1)ofpairwiseproteinfeaturedistances,wherex0isthequeryproteinand{x1,.
.
.
,xN}arethedatabaseproteins;Require:Input:sizeofthehierarchicalsub-context;Require:InputP:numberofthehierarchicalcontext;Initializedissimilaritymatrix:D(1)=D;fort=1,.
.
.
,TdoUpdatethehierarchicalcontextforeachproteinxi:N(t)p(i),(p=1,P),N(t)p(i)={xj|xjisamongthekthtokthnearestneighborsofxi,accordingtoD(t)(i;1,N)}(9)wherek'=(p-1)*,k''=(p-1)andD(t)(i;0,N)=[d(t)i0d(t)iN].
Computethecontextualproteinsdissimilarityvectord(t)ijforeachpairofproteins(i,j),i,j{0,.
.
.
,N}:d(t)ij=[d(t)ij(1)d(t)ij(2)···d(t)ij(P)](10)whered(t)ij(p)=1k2m∈N(t)p(i),n∈N(t)p(j)d(t)mn.
Selectrelevantandirrelevantproteinpairsandlabelthemasyg=+1andyg=-1respectively,trainanSVMmodelfortheircontextualdissimilarityvectorsd(t)γasf(t)(d)=w(t)·d+b(t)(11)ComputethedistancetotheSVMmarginboundaryforthecontextualdissimilarityvectord(t)ijofeachFigure4EfficientimplementationofProDis-ContSHC.
(a)PerformingProDis-ContSHContheoriginalmatrixofsize(N+1)*(N+1)fromtheentiredataset;(b)PerformingProDis-ContSHConasubsetofthedatabaseproteins,i.
e.
,adissimilaritymatrixofsize(N'+1)*(N'+1);(c)Usingthesymmetrypropertyofthedissimilaritymatrixtoreducethetrainingtime.
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page7of14pairofproteinsasy(t)ij=f(t)(d(t)ij),andsetaregulari-zationfactorforthispairofproteins:δ(t)ij=exp(y(t)ijσ)(12)Updatethepairwiseproteindissimilaritymeasures:fori=0,1,.
.
.
,Ndoforj=0,1,.
.
.
,Ndod(t+1)ij=d(t)ij*δ(t)ij(13)endforendforD(t+1)=[d(t+1)ij](N+1)*(N+1).
endforOutputthedissimilaritymatrix:D(t+1).
BenchmarksetsToevaluatetheproposedProDis-ContSHCalgorithm,weconductexperimentsontwodifferentbenchmarksets,i.
e.
,theonesusedin[21]and[26]respectively.
ASTRAL1.
73proteindomaindatasetFollowing[26],weusethefollowingdatabaseandqueriesasourfirstbenchmarkset:DatabaseTheASTRAL1.
73[48]95%sequence-identitynon-redundantdatasetisusedastheproteindatabase.
WegenerateourindexdatabasefromthetableaudatasetpublishedbyStivalaetal.
[49],whichcontains15,169entries.
QueriesAquerydatasetcontaining200randomlyselectedproteindomainsisusedinourexperiment.
Foreachquery,alistthatcontainsalltheproteinsintherespectiveindexdatabaseisreturnedwiththerankingscores.
Wegenerateavectoroffeaturesxforagivenproteinbasedonitstableaurepresentation[49].
FSSP/DALIproteindatasetToevaluatetheperformanceoftheproposedmethods,aportionoftheFSSPdatabase[50]isselectedasin[21].
Thisdatasethas3,736proteinsclassifiedinto30classes.
It'sconstructedaccordingtotheDALIalgorithm[51,52].
Theproteinnumbersindifferentclassesvaries2to561.
Forproteinfeaturerepresentation,thefollow-ingtwofeaturesareextractedfromthe3Dstructureandthesequenceofaproteinasin[20,21]:ThePolar-Fouriertransform,resultingintheFT02features;Krawtchoukmoments,resultingintheKraw00features.
Thedescriptorvectorsareweightedandanintegrateddescriptorvectorisproducedasx,whichwillbeusedfortheproteinretrievaltasks.
ResultsanddiscussionResultsonASTRAL1.
73datasetTocompareaqueryproteinx0toaproteinxiintheASTRAL1.
73dataset,wecomputethecosinesimilarity[27]asthebaselinesimilaritymeasureasin[26].
Cosinesimilarity[27]simplycalculatesthecosineoftheanglebetweenthetwovectorsxiandxj.
sij=C(xi,xj)=xi·xj||xi||||xj||(14)Ahighercosinesimilarityscoreimpliesasmalleranglebetweenthetwovectors.
AlthoughProDis-ContSHCisproposedtolearnprotein-proteindissimi-laritydij,itcanbeextendedeasilytolearnsimilaritysijaswell.
Theonlydifferenceistosettheregularizationfactorasδij=exp(yijσ)insteadofδij=exp(yijσ)in(7).
ROCcurveandprecision-recallcurveperformanceSCOP[53]foldclassificationisusedasthegroundtruthtoevaluatetheperformanceofthedifferentmethods.
Tofairlycomparetheaccuracy,weusethereceiveroperatingcharacteristic(ROC)curve[54],theareaunderthisROCcurve(AUC)[54],andtheprecision-recallcurve[55].
Givenaqueryproteinx0whichbelongstotheSCOPfoldl0,thetopkproteinsreturnedbythesearchalgorithmsareconsideredasthehits.
Theremainingproteinsareconsid-eredasthemisses.
Forthei-thproteinxibelongingtotheSCOPfoldli,ifli=l0andi≤k,theproteinxiisdefinedasatruepositive(TP).
Ontheotherhand,ifli≠l0andi≤k,xiisdefinedasafalsepositive(FP).
Ifli≠l0andi>k,xiisdefinedasatruenegative(TN).
Otherwise,xiisafalsenegative(FN).
Usingthesedefinitions,wecanthencom-putethetruepositiverate(TPRorrecall),thefalseposi-tiverate(FPR),recallandprecisionasfollows:TPR=TPP=TPTP+FNFPR=FPN=FPFP+TN(15)Recall=TPTP+FNPrecision=TPTP+FP(16)TPRk,FPRk,Recallk,andPrecisionkarecalculatedforall1≤k≤N,whereNisthesizeofthedatabase.
TheROCdefinesacurveofpointswithFPRkastheabscissaandTPRkastheordinate.
Precision-recalldefinesacurvewithrecallkandprecisionkasabscissaandordinateWangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page8of14respectively.
WeusetheareaundertheROCcurve(AUC)asasingle-figuremeasurementforthequalityofaROCcurve[54],andusetheaveragedAUCoverallthequeriestoevaluatetheperformanceofthemethod.
Todemonstratethecontributionofthesupervisedlearningidea,wealsocompareProDis-ContSHCwithitsunsupervisedcounterpart,i.
e.
,contextualdissimilarityalgorithmbasedontheunsupervisedlearning,i.
e.
,Pro-Dis-ContHC.
ProDis-ContHCisalsoappliedtoimprovethecosinesimilarity.
Wealsocomparewiththewidely-usedcontextualdissimilaritymeasure[30,31](CDM),whichtriestotakeintoaccountthelocaldistributionofthevectorsanditerativelyestimatesdistanceupdatetermsinthespiritofSinkhornsscalingalgorithm,therebymodifyingtheneighborhoodstructures.
Theperformanceofdifferentmethodsarecompared,asshowninFigure5.
Figure5(a)showstheROCcurvesoftheoriginalcosinesimilarityanditsimprovedversionsbythreecontextualsimilaritylearningalgorithmsontheASTRAL1.
73[48]95%dataset,withdifferentnumbersofproteinsreturnedtoeachquery.
ItcanbeseenfromFigure5(a)thattheTPRofallthemethodsincreasesastheFPRgrows.
Thereasonisduetothefactthat,pro-videdthenumberofqueriesisfixed,whenthenumberkofreturnedproteinstoeachqueryisverysmall,thereturnedproteinsarenotenoughto"represent"theclassfeaturesofthequery,whichthencausesthelowTPR.
Meanwhile,inthissituation,mostofthereturnedpro-teinsarehighlyconfidentofbelongingtothesameclassasthequery,resultinginalowFPR.
Moreover,theTPRisalmost100%whentheFPR>50%.
ItisclearthattheROCcurveofProDis-ContSHCcompletelyembodiestheROCcurvesoftheotherthreemethods,whichimpliesProDis-ContSHCisthebestmethodamongthefour.
Thatalsomeansthatsupervisedlearningisbetterthanunsupervisedlearningforthispurpose.
ProDis-ContHC,ontheotherhand,isthesecondbestmethodamongthesefour,whichdemonstratesthecontributionofthehierarchicalsub-contextideatothetraditionalcontextualdissimilaritymeasures.
TheoverallAUCresultsarelistedinTable1,fromwhichsimilarconclusionscanbedrawn.
ItisnoticeablethattheAUCforProDis-ContSHCisverycloseto1,whichmeansProDis-ContSHCworksalmostperfectlyonthisdataset.
Wefurthercomparethesefourmethodsbytheprecision-recallcurves,whichareshowninFigure5(b).
Itcanbeseenthattheproposedcontex-tualsimilaritylearningalgorithmssignificantlyoutper-formthetraditionalmethods.
ProDis-ContSHC,again,isconsistentlythebestmethodamongthefour.
Regardingtheefficiencyofthemethod,inthisexperi-ment,thelearningtimeoftheProDis-ContSHCislongerthanthatoftheProDis-ContHCandCDM.
Thisisbecauseineachiterationofthelearningalgorithm,aquadraticprogrammingproblemwithmanytrainingFigure5PerformanceofsimilaritymeasuresontheASTRAL1.
7390%dataset.
(a)TheROCcurvesoftheoriginalsimilaritymeasure,andtheimprovedmeasuresbyProDis-ContSHC,ProDis-ContHC,andCDM,respectively.
(b)Theprecision-recallcurvesoftheoriginalsimilaritymeasure,andtheimprovedmeasuresbyProDis-ContSHC,ProDis-ContHC,andCDM,respectively.
Table1PerformanceofdifferentretrievalmethodsontheASTRAL1MethodAUCIRTableau:CosineSimilarity+ProDis-ContSHC0.
973IRTableau:CosineSimilarity+ProDis-ContHC0.
961IRTableau:CosineSimilarity+CDM[30,31]0.
954IRTableau:CosineSimilarity[26]0.
948TableauSearch[56]0.
871QPTableau[49]0.
925Yakusa[57]0.
950SHEBA[58]0.
941VAST[59,60]0.
890TOPS[61,62]0.
871AUCresultsforQPTableau[49],SHEBA[58]andVAST[59,60]aretakenfrom[49],whichusedexactlythesamequerysetandthesamedatasetasourexperiments.
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page9of14proteinpairshavetobesolvedtotraintheSVM.
Inaddition,thecomputationoftheregularizationfactorofsupervisedsimilaritylearningalgorithmneedsmorefunctionevaluations.
Wealsocomparetheproposedalgorithmswithsevenotherproteinretrievalmethods,i.
e.
,tableausearch[56],QPtableau[49],Yakusa[57],SHEBA[58],VAST[59,60],andTOPS[61,62].
TheoverallAUCvaluesareshowninTable1.
Itcanbeconcludedthatthetableaufeaturebasedmethodsdonotalwaysachievebetterper-formancethanothermethods,suchastableausearch.
Amongtheexistingtableaufeaturebasedmethods,IRtableauoutperformstheothers.
YakusaandSHEBAalsohavecomparableperformance.
AsseeninTable1,theAUCoftheproposedalgorithmsisclearlybetterthanalltheothermethods.
ImprovingdifferentsimilaritymeasuresviacontextualdissimilaritylearningalgorithmsTofurtherevaluatetherobustnessofourmethod,wetestthebehaviorofProDis-ContSHCandothercontex-tualsimilaritylearningalgorithmsondifferentsimilaritymeasures.
AgroupofexperimentsareconductedontheASTRAL1.
7395%datasetwiththefollowingsimilaritymeasures:Thecosinesimilarity[27]asintroducedinthepre-vioussection.
TheJaccardindex[28]:itisdefinedasthesizeoftheintersectiondividedbythesizeoftheunionoftwosets,i.
e.
,J(xi,xj)=|xixj||xixj|(17)TheTanimotocoefficient[29]:itisageneralizationoftheJaccardindex,definedasJ(xi,xj)=xi·xj||xi||2+||xj||2xi·xj(18)SquaredEuclideandistance[22]:itisanothermeansofmeasuringsimilarityofproteins.
dij=(xixj)(xixj)=m(xi(m)xj(m))2(19)wherexi(m)isthem-thelementofvectorxi.
ProDis-ContSHC,ProDis-ContHC,andtheCDMalgorithmsareappliedtoimproveeachofthesesimilar-itymeasures,respectively.
TheAUCvaluesofthecorre-spondingretrievalsystemsareplottedinFigure6.
Ingeneral,improvingtheoriginalsimilaritymeasurebyProDis-ContSHCleadstothelargestimprovement.
TheonlyexceptionisforTanimotocoefficient,onwhichProDis-ContSHChasslightlylowerAUCthanProDis-ContHC,butcomparableAUCtotheCDM.
Onepossi-blereasonisthatthesupervisedclassifierfailtocapturetherealdistributionofthecontextualsimilarity.
ProDis-ContHC,ontheotherhand,alsoperformsbetterthantheCDMalgorithmandtheoriginalsimilaritymeasures.
Thisstronglysuggeststhatourpreviousconclusionsarevalidandconsistent.
Thatis,hierarchicalsub-contextualinformationcanremarkablyimprovethetraditionalcon-text-basedsimilaritymeasures,whereassupervisedlearningcanfurtherimprovetheaccuracyformostoftheinputsimilaritymeasures.
ResultsonFSSP/DALIdatasetUnlikethesimilaritymeasureusedinthelastexperiment,hereweusetheEuclideandistance[22]tocompareapairofproteinsasthebaselinedissimilaritymeasureasin[20,21].
Inthisway,wehaveanideaabouthowouralgo-rithmsworkwithbothsimilarityanddissimilaritymea-sures.
Foraqueryproteinx0,thepairwiseEuclideandistances,d0i,i=1,2,.
.
.
,N,areranked.
Thetopkpro-teinsarereturnedastheretrievalresults.
Toevaluatetheperformanceoftheproposedalgorithms,wetestthemonboththeproteinretrievalandtheproteinclassifica-tiontasks,following[20,21].
PerformanceonproteinretrievalTheefficiencyoftheproposeddissimilaritylearningalgo-rithmisfirstevaluatedintermsoftheperformanceontheproteinretrievaltask.
Inthiscase,eachproteinxiXofthedatasetisusedasaqueryx0andtheretrievedproteinsarerankedaccordingtotheshapedissimilarityd0jtothequery,wherej=1,2,.
.
.
,i-1,i+1,.
.
.
,N.
Wealsousetheprecision-recallcurvetodemonstratetheperformanceoftheproposedmethods,whereprecisionistheproportionoftheretrievedproteinsthatarerelevanttothequeryandrecallistheproportionoftherelevantproteinsintheentiredatasetthatareretrievedastheresults.
Totesttherobustnessandconsistencyofourmethods,weapplyourmethodstothreedifferentproteindescrip-torvectors,i.
e.
,Darasetal.
'sFT02,Kraw00,andFT02&Kraw00[20,21]geometricdescriptorvectors.
Wealsoapplytheunsupervisedversionofouralgorithm,ProDis-ContHC,andtheCDMalgorithmtothesamedissimilaritymeasureandthesamedescriptorvectorstocomparewithProDis-ContSHC.
Figure7showsthepre-cision-recallcurvesfordifferentalgorithmsondifferentproteindescriptorvectors.
Asmentionedin[20,21],thereisalwaysatradeoffbetweentheprecisionandrecallvalues.
ThisisclearlyshowninFigure7(a),(b),and7(c),inwhichthealgorithmsreachtheirpeakprecisionvaluesatthesmallestrecallvalues.
ItcanbeseenthatProDis-ContSHChasaclearlybetterperformancethananyothermethod,whereasProDis-ContHCisthesecondWangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page10of14bestone.
Thisisquiteconsistentwithwhatisobservedinthelastexperiment,inwhichasimilaritymeasureisused.
Therefore,ouralgorithmscanconsistentlyimproveanysimilarity/dissimilaritymeasure.
Amongthethreeproteindescriptorvectors,ProDis-ContSHCperformsthebestonthecombinedvector,i.
e.
,Kraw00&FT02.
Thisisbecausethisvectornotonlyemploysthecontext,butalsotheirrelevantinformationtopredicttherelationshipbetweenthequeryandthedatabaseproteins.
PerformanceonproteinclassificationTheperformanceofthemethodisalsoevaluatedintermsoftheoverallclassificationaccuracy[20,21].
Tobemorespecific,foreachproteinxiinthedatabase,adissimilaritymeasureisappliedafterremovingthatpro-teinfromthedatabase("leave-one-out"experiment[63]).
Aclasslabell0isthenassignedtothequeryx0accordingtothelabelofthenearestdatabaseprotein.
Theoverallclassificationaccuracyisgivenby:OverallClassicationAccuracy=NumberofcorrectlypredictedproteinsTotalnumberofproteinsinthedatabase(20)Weagainconductthisexperimentwiththethreedescriptorvectors,i.
e.
,FT02,Kraw00,andFT02&Kraw00.
TheoverallclassificationaccuracyisshowninTable2.
ItcanbeseenthatProDis-ContSHChasaconsistentlyhigherthan99%accuracyonallthethreedescriptorvectors.
EachdissimilaritymeasureachievesitshighestaccuracyonKraw00&FT02.
Amongthefourdissimilar-itymeasures,ProDis-ContSHChasthehighestaccuracy,whereasProDis-ContHCisthesecondbestone.
There-fore,thisconclusionhasbeendemonstratedonbothsimilarityanddissimilaritymeasuresondifferentdata-setswithdifferentdescriptorvectors.
ConclusionsWehaveintroducedinthispaperanovelcontextualdis-similaritylearningalgorithmforprotein-proteincompar-isoninproteindatabaseretrievaltasks.
Itsstrengthresidesintheuseofthehierarchicalcontextbetweenapairofproteinsandtheirclasslabelinformation.
Byextensiveexperiments,thisnovelalgorithmhasbeendemonstratedtooutperformthetraditionalcontext-basedmethodsandtheirunsupervisedversion.
Weformulatetheproteindissimilaritylearningproblemasacontext-basedclassificationproblem.
Undersuchaformulation,wetrytoregularizetheproteinpairwisedis-similarityinasupervisedwayratherthanthetraditionalunsupervisedway.
Tothebestofourknowledge,thisisthefirststudyonsupervisedcontextualdissimilaritylearn-ing.
Weproposeanovelalgorithm,ProDis-ContSHC,whichupdatesaprotein'shierarchicalsub-contextandthedissimilaritymeasurecoherently.
TheregularizationFigure6PerformanceofsimilaritymeasuresondifferentbasemeasuresontheASTRAL1.
7390%dataset.
PerformanceofsimilaritymeasuresondifferentbasemeasuresontheASTRAL1.
7390%dataset.
Thefourbasemeasuresbeingtestedarecosinesimilarity[27],theJaccardindex[28],theTanimotocoefficient[29],andtheEuclideandistance[22].
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page11of14factorsarelearnedbasedontheclassificationoftherele-vantandtheirrelevantproteinpairs.
Thealgorithmworksinaniterativemanner.
Experimentalresultsdemonstratethatsupervisedmethodsarealmostalwaysbetterthantheirunsuper-visedcounterpartsonallthedatabaseswithallthefea-turevectors.
Theproposedmethod,eventhoughmainlypresentedforproteindatabaseretrievaltasks,canbeeasilyextendedtoothertasks,suchasRNAsequence-structurepatternindexing[64],retrievalofhighthroughputphenotypedata[65],andretrievalofgeno-micannotationfromlargegenomicpositiondatasets[66].
Theapproachmayalsobeextendedtothedata-baseretrievalandpatternclassificationproblemsinotherdomains,suchasmedicalimageretrieval[67-69],speechrecognition,andtextureclassification[70].
AcknowledgementsThestudywassupportedbygrantsfromShanghaiKeyLaboratoryofIntelligentInformationProcessing,China(GrantNo.
IIPL-2011-003),KeyLaboratoryofHighPerformanceComputingandStochasticInformationProcessing,MinistryofEducationofChina(GrantNo.
HS201107),NationalGrandFundamentalResearch(973)ProgramofChina(GrantNo.
2010CB834303and2011CB911102),NationalNaturalScienceFoundationofChina(GrantNo.
60973154),HubeiProvincialScienceFoundation,China(GrantNo.
2010CDA006and2010CD06601),andastart-upgrantfromKingAbdullahUniversityofScienceandTechnology.
ThisarticlehasbeenpublishedaspartofBMCBioinformaticsVolume13Supplement7,2012:Advancedintelligentcomputingtheoriesandtheirapplicationsinbioinformatics.
Proceedingsofthe2011InternationalConferenceonIntelligentComputing(ICIC2011).
Thefullcontentsofthesupplementareavailableonlineathttp://www.
biomedcentral.
com/bmcbioinformatics/supplements/13/S7.
Authordetails1KingAbdullahUniversityofScienceandTechnology(KAUST),MathematicalandComputerSciencesandEngineeringDivision,Thuwal,23955-6900,SaudiArabia.
2ShanghaiInstituteofAppliedPhysics,ChineseAcademyofSciences,2019JialuoRoad,JiadingDistrict,Shanghai201800,China.
3ShanghaiKeyLaboratoryofIntelligentInformationProcessing,SchoolofComputerScience,FudanUniversity,Shanghai200433,China.
Authors'contributionsJW:designedthealgorithm,carriedouttheexperiments,analyzedtheresults,andwrotethemanuscript.
XG:designedthealgorithmandtheexperiments,improvedthemanuscript.
QW:carriedouttheexperiments,analyzedtheresults,improvedthemanuscript.
YL:improvedthemanuscript.
Allauthorsreadandapprovedthefinalmanuscript.
CompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Published:8May2012References1.
ChenSA,LeeTY,OuYY:IncorporatingsignificantaminoacidpairstoidentifyO-linkedglycosylationsitesontransmembraneproteinsandnon-transmembraneproteins.
BMCBioinformatics2010,11:536.
2.
SobolevB,FilimonovD,LaguninA,ZakharovA,KoborovaO,KelA,PoroikovV:Functionalclassificationofproteinsbasedonprojectionofaminoacidsequences:applicationforpredictionofproteinkinasesubstrates.
BMCBioinformatics2010,11:313.
3.
AlbayrakA,OtuHH,SezermanUO:ClusteringofproteinfamiliesintofunctionalsubtypesusingRelativeComplexityMeasurewithreducedaminoacidalphabets.
BMCBioinformatics2010,11:428.
4.
EzkurdiaL,BartoliL,FariselliP,CasadioR,ValenciaA,TressML:Progressandchallengesinpredictingprotein-proteininteractionsites.
BriefBioinform2009,10(3):233-246.
Figure7PerformanceofdissimilaritymeasuresontheFSSP/DALIdataset.
(a)Theprecision-recallcurvesoftheoriginaldissimilaritymeasure,andtheimprovedmeasuresbyProDis-ContSHC,ProDis-ContHC,andCDM,respectively,withthedescriptorvectorFT02&Kraw00.
(b)Theprecision-recallcurveswiththedescriptorvectorFT02.
(c)Theprecision-recallcurveswiththedescriptorvectorKraw00.
Table2OverallclassificationaccuracyusingdifferentproteindescriptorsandtheEuclideandistancemeasureDissimilaritymeasureDescriptorsFT02Kraw00Kraw00&FT02EuclideanDistance+ProDis-ContSHC0.
99250.
99540.
9971EuclideanDistance+ProDis-ContHC0.
98900.
99170.
9928EuclideanDistance+CDM[30,31]0.
98690.
98950.
9909EuclideanDistance[20,21]0.
98500.
98790.
9890Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page12of145.
CookT,SuttonR,BuckleyK:Automatedflexioncreaseidentificationusinginternalimageseams.
PatternRecognition2010,43(3):630-635.
6.
OfranY,RostB:Protein-proteininteractionhotspotscarvedintosequences.
PLoSComputBiol2007,3(7):e119.
7.
YhouZH,LeiYK,GuiJ,HuangDS,ZhouX:Usingmanifoldembeddingforassessingandpredictingproteininteractionsfromhigh-throughputexperimentaldata.
Bioinformatics2010,26(21):2744-2751.
8.
XiaJF,ZhaoXM,SongJ,HuangDS:APIS:accuratepredictionofhotspotsinproteininterfacesbycombiningprotrusionindexwithsolventaccessibility.
BMCBioinformatics2010,11:174.
9.
YhouZH,YinZ,HanK,HuangDS,ZhouX:Asemi-supervisedlearningapproachtopredictsyntheticgeneticinteractionsbycombiningfunctionalandtopologicalpropertiesoffunctionalgenenetwork.
BMCBioinformatics2010,11:343.
10.
XiaJF,ZhaoXM,HuangDS:Predictingprotein-proteininteractionsfromproteinsequencesusingmetapredictor.
AminoAcids2010,39(5):1595-1599.
11.
ShiMG,XiaJF,LiXL,HuangDS:Predictingprotein-proteininteractionsfromsequenceusingcorrelationcoefficientandhigh-qualityinteractiondataset.
AminoAcids2010,38(3):891-899.
12.
HuangDS,ZhaoXM,HuangGB,CheungYM:Classifyingproteinsequencesusinghydropathyblocks.
PatternRecognition2006,39(12):2293-2300.
13.
LiJJ,HuangDS,WangB,ChenP:Identifyingprotein-proteininterfacialresiduesinheterocomplexesusingresidueconservationscores.
IntJBiolMacromol2006,38:241-247.
14.
WangB,ChenP,HuangDS,LiJJ,LokTM,LyuMR:Predictingproteininteractionsitesfromresiduespatialsequenceprofileandevolutionrate.
FEBSLett2006,580(2):380-384.
15.
WangJ,LiY,ZhangY,TangN,WangC:Classconditionaldistancemetricfor3Dproteinstructureclassification.
20115thInternationalConferenceonBioinformaticsandBiomedicalEngineering,(iCBBE).
2011,1-4.
16.
ChiPH,ScottG,ShyuCR:Afastproteinstructureretrievalsystemusingimage-baseddistancematricesandmultidimensionalindex.
InternationalJournalofSoftwareEngineeringandKnowledgeEngineering2005,15(3):527-545.
17.
MarsoloK,ParthasarathyS:Ontheuseofstructureandsequence-basedfeaturesforproteinclassificationandretrieval.
KnowledgeandInformationSystems2008,14:59-80.
18.
AungZ,TanK:Rapid3Dproteinstructuredatabasesearchingusinginformationretrievaltechniques.
Bioinformatics2004,20(7):1045-1052.
19.
ZhangW,YoshidaT,TangX:AcomparativestudyofTF*IDF,LSIandmulti-wordsfortextclassification.
ExpertSystAppl2011,38(3):2758-2765.
20.
DarasP,ZarpalasD,TzovarasD,StrintzisM:3Dshape-basedtechniquesforproteinclassification.
IEEEInternationalConferenceonImageProcessing,2005.
ICIP2005.
2005,1130-1133.
21.
DarasP,ZarpalasD,AxenopoulosA,TzovarasD,StrintzisMG:Three-dimensionalshape-structurecomparisonmethodforproteinclassification.
IEEE/ACMTransComputBiolBioinform2006,3(3):193-207.
22.
OscamouM,McDonaldD,YapVB,HuttleyGA,LladserME,KnightR:Comparisonofmethodsforestimatingthenucleotidesubstitutionmatrix.
BMCBioinformatics2008,9:511.
23.
MarsoloK,ParthasarathyS:Ontheuseofstructureandsequence-basedfeaturesforproteinclassificationandretrieval.
ProceedingsoftheSixthInternationalConferenceonDataMining,2006.
ICDM'06.
2006,394-403.
24.
SaelL,LiB,LaD,FangY,RamaniK,RustamovR,KiharaD:Fastproteintertiarystructureretrievalbasedonglobalsurfaceshapesimilarity.
Proteins2008,72:1259-1273.
25.
MittelmannH,PengJ:EstimatingboundsforquadraticassignmentproblemsassociatedwithHammingandManhattandistancematricesbasedonsemidefiniteprogramming.
SIAMJOptim2010,20(6):3408-3426.
26.
ZhangL,BaileyJ,KonagurthuAS,RamamohanaraoK:Afastindexingapproachforproteinstructurecomparison.
BMCBioinformatics2010,11(Suppl1):S46.
27.
LeeB,LeeD:Proteincomparisonatthedomainarchitecturelevel.
BMCBioinformatics2009,10(Suppl15):S5.
28.
RahmanM,HassanMR,BuyyaR:Jaccardindexbasedavailabilitypredictioninenterprisegrids.
InternationalConferenceonComputerScience,ICCS2010.
2010,2701-2710.
29.
GaravagliaS:StatisticalanalysisoftheTanimotocoefficientself-organizingmap(TCSOM)appliedtohealthbehavioralsurveydata.
InternationalJointConferenceonNeuralNetworks,2001.
IJCNN'01.
2001,2483-2488.
30.
JegouH,HarzallahH,SchmidC:Acontextualdissimilaritymeasureforaccurateandefficientimagesearch.
IEEEConferenceonComputerVisionandPatternRecognition,2007.
CVPR'07.
2007,1-8.
31.
JegouH,SchmidC,HarzallahH,VerbeekJ:Accurateimagesearchusingthecontextualdissimilaritymeasure.
IEEETransPatternAnalMachIntell2010,32(1):2-11.
32.
YangX,BaiX,LateckiLJ,TuZ:Improvingshaperetrievalbylearninggraphtransduction.
10thEuropeanConferenceonComputerVision.
ECCV2008.
2008,788-801.
33.
BaiX,YangX,LateckiLJ,LiuW,TuZ:Learningcontext-sensitiveshapesimilaritybygraphtransduction.
IEEETransPatternAnalMachIntell2010,32(5):861-874.
34.
BaiX,WangB,WangX,LiuW,TuZ:Co-transductionforshaperetrieval.
11thEuropeanConferenceonComputerVision.
ECCV2010.
2010,328-341.
35.
SinkhornR:Arelationshipbetweenarbitrarypositivematricesanddoublystochasticmatrices.
AnnMathStatist1964,35(2):876-879.
36.
WangJ,LiY,BaiX,ZhangY,WangC,TangN:Learningcontext-sensitivesimilaritybyshortestpathpropagation.
PatternRecognition2011,44(10-11):2367-2374.
37.
KuangR,WestonJ,NobleW,LeslieC:Motif-basedproteinrankingbynetworkpropagation.
Bioinformatics2005,21(19):3711-3718.
38.
WestonJ,KuangR,LeslieC,NobleWS:Proteinrankingbysemi-supervisednetworkpropagation.
BMCBioinformatics2006,7(Suppl1):S10.
39.
SahbiH,AudibertJY,RabarisoaJ,KerivenR:Objectrecognitionandretrievalbycontextdependentsimilaritykernels.
InternationalWorkshoponContent-BasedMultimediaIndexing,2008.
CBMI2008.
2008,216-223.
40.
SahbiH,AudibertJ,KerivenR:Context-dependentkernelsforobjectclassification.
IEEETransPatternAnalMachIntell2011,33(4):699-708.
41.
DingJ,ZhouS,GuanJ:MiRenSVM:towardsbetterpredictionofmicroRNAprecursorsusinganensembleSVMclassifierwithmulti-loopfeatures.
BMCBioinformatics2010,11(Suppl11):S11.
42.
GonzálezAJ,LiaoL:Predictingdomain-domaininteractionbasedondomainprofileswithfeatureselectionandsupportvectormachines.
BMCBioinformatics2010,11:537.
43.
WangJ,LiY,LiangP,ZhangG,AoX:Aneffectivemulti-biometricssolutionforembeddeddevice.
IEEEInternationalConferenceonSystems,ManandCybernetics,2009.
SMC2009.
2009,917-922.
44.
WangJ,LiY,AoX,WangC,ZhouJ:Multi-modalbiometricauthenticationfusingirisandpalmprintbasedonGMM.
IEEE/SP15thWorkshoponStatisticalSignalProcessing,2009.
SSP'09.
2009,349-352.
45.
Shih-WenKeG,OakesMP,PalominoMA,XuY:ComparisonbetweenSVM-Light,asearchengine-basedapproachandthemediamillbaselinesforassigningconceptstovideoshotannotations.
InternationalWorkshoponContent-BasedMultimediaIndexing,2008.
CBMI2008.
2008,381-387.
46.
RamanaJ,GuptaD:LipocalinPred:aSVM-basedmethodforpredictionoflipocalins.
BMCBioinformatics2009,10:445.
47.
EyK,PoetzscheC:Asymptoticbehaviorofrecursionsviafixedpointtheory.
JournalofMathematicalAnalysisandApplications2008,337(2):1125-1141.
48.
BrennerS,KoehlP,LevittR:TheASTRALcompendiumforproteinstructureandsequenceanalysis.
NucleicAcidsRes2000,28(1):254-256.
49.
StivalaA,WirthA,StuckeyPJ:Tableau-basedproteinsubstructuresearchusingquadraticprogramming.
BMCBioinformatics2009,10:153.
50.
FSSP/DALIDatabase.
[http://ekhidna.
biocenter.
helsinki.
fi/dali/start].
51.
HolmL,SanderC:TheFSSPdatabase:foldclassificationbasedonstructure-structurealignmentofproteins.
NucleicAcidsRes1996,24(1):206-209.
52.
HolmL,SanderC:TheFSSPdatabaseofstructurallyalignedproteinfoldfamilies.
NucleicAcidsRes1994,22:3600-3609.
53.
MurzinAG,BrennerSE,HubbardT,ChothiaC:SCOP:astructuralclassificationofproteinsdatabasefortheinvestigationofsequencesandstructures.
JMolBiol1995,247(4):536-540.
54.
RobinX,TurckN,HainardA,TibertiN,LisacekF,SanchezJC,MüllerM:pROC:anopen-sourcepackageforRandS+toanalyzeandcompareROCcurves.
BMCBioinformatics2011,12:77.
55.
TsaiRT,LaiPT:Dynamicprogrammingre-rankingforPPIinteractorandpairextractioninfull-textarticles.
BMCBioinformatics2011,12:60.
Wangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page13of1456.
KonagurthuAS,StuckeyPJ,LeskAM:Structuralsearchandretrievalusingatableaurepresentationofproteinfoldingpatterns.
Bioinformatics2008,24(5):645-651.
57.
CarpentierM,BrouilletS,PothierJ:YAKUSA:afaststructuraldatabasescanningmethod.
Proteins2005,61(1):137-151.
58.
JungJ,LeeB:Proteinstructurealignmentusingenvironmentalprofiles.
ProteinEng2000,13(8):535-543.
59.
MadejT,GibratJF,BryantSH:Threadingadatabaseofproteincores.
Proteins1995,23(3):356-369.
60.
GibratJF,MadejT,BryantSH:Surprisingsimilaritiesinstructurecomparison.
CurrOpinStructBiol1996,6(3):377-385.
61.
GilbertD,WestheadD,NaganoN,ThorntonJ:Motif-basedsearchinginTOPSproteintopologydatabases.
Bioinformatics1999,15(4):317-326.
62.
TorranceG,GilbertD,MichalopoulosI,WestheadD:Proteinstructuretopologicalcomparison,discoveryandmatchingservice.
Bioinformatics2005,21(10):2537-2538.
63.
ZhangW,SunF,JiangR:Integratingmultipleprotein-proteininteractionnetworkstoprioritizediseasegenes:aBayesianregressionapproach.
BMCBioinformatics2011,12(Suppl1):S11.
64.
MeyerF,KurtzS,BackofenR,WillS,BeckstetteM:Structator:fastindex-basedsearchforRNAsequence-structurepatterns.
BMCBioinformatics2011,12:214.
65.
ChangWE,SarverK,HiggsBW,ReadTD,NolanNM,ChapmanCE,Bishop-LillyKA,SozhamannanS:PheMaDB:asolutionforstorage,retrieval,andanalysisofhighthroughputphenotypedata.
BMCBioinformatics2011,12:109.
66.
KrebsA,FrontiniM,ToraL:GPAT:retrievalofgenomicannotationfromlargegenomicpositiondatasets.
BMCBioinformatics2008,9:533.
67.
WangJ,LiY,ZhangY,WangC,XieH,ChenG,GaoX:Bag-of-featuresbasedmedicalimageretrievalviamultipleassignmentandvisualwordsweighting.
IEEETransMedImaging2011,30(11):1996-2011.
68.
WangJ,LiY,ZhangY,XieH,WangC:Boostedlearningofvisualwordweightingfactorsforbag-of-featuresbasedmedicalimageretrieval.
2011SixthInternationalConferenceonImageandGraphics(ICIG).
2011,1035-1040.
69.
WangJ,LiY,ZhangY,XieH,WangC:Bag-of-featuresbasedclassificationofbreastparenchymaltissueinthemammogramviajointlyselectingandweightingvisualwords.
2011SixthInternationalConferenceonImageandGraphics(ICIG).
2011,622-627.
70.
LiuZ,WangJ,LiY,ZhangY,WangC:Quantizedimagepatchesco-occurrencematrix:anewstatisticalapproachfortextureclassificationusingimagepatchexemplars.
ProceedingsofSPIE8009.
2011,80092P.
doi:10.
1186/1471-2105-13-S7-S2Citethisarticleas:Wangetal.
:ProDis-ContSHC:learningproteindissimilaritymeasuresandhierarchicalcontextcoherentlyforprotein-proteincomparisoninproteindatabaseretrieval.
BMCBioinformatics201213(Suppl7):S2.
SubmityournextmanuscripttoBioMedCentralandtakefulladvantageof:ConvenientonlinesubmissionThoroughpeerreviewNospaceconstraintsorcolorgurechargesImmediatepublicationonacceptanceInclusioninPubMed,CAS,ScopusandGoogleScholarResearchwhichisfreelyavailableforredistributionSubmityourmanuscriptatwww.
biomedcentral.
com/submitWangetal.
BMCBioinformatics2012,13(Suppl7):S2http://www.
biomedcentral.
com/1471-2105/13/S7/S2Page14of14

展开全文

Flowchart003hh.com相关文档

摩拜超15分钟加钱摩拜共享单车要交多少钱押金？小度商城小度怎么下载app？杰景新特杰普特长笛JFL-511SCE是不是有纯银的唇口片？？价格怎样？？百度关键词工具常见的关键词挖掘工具有哪些 haole16.com国色天香16 17全集高清在线观看国色天香qvod快播迅雷下载地址 www.vtigu.com如图，已知四边形ABCD是平行四边形，下列条件：①AC=BD,②AB=AD，③∠1=∠2④AB⊥BC中，能说明平行四边形 www.javmoo.comjavimdb怎么看 16668.com香港最快开奖现场直播今晚开机器蜘蛛《不思议迷宫》四个机器蜘蛛怎么得获得攻略方法介绍 dpscycle寻求LR 高输出宏域名网站 vps论坛老左 tier 韩国加速器 ca4249 新天域互联 phpmyadmin配置福建铁通多线空间 shopex主机卡巴斯基免费试用版英雄联盟台服官网免费蓝钻购买空间 1美元买空间网 phpinfo forwarder 服务器是什么更多

Flowchart003hh.com

Megalayer美国服务器CN2优化线路30M带宽3独立IP限时月299元

特网云57元，香港云主机 1核 1G 10M宽带1G(防御)

香港服务器租用多少钱一个月?影响香港服务器租用价格因素