hardfedora16

fedora16 时间:2021-05-01 阅读:()

WandeltandLeserAlgorithmsforMolecularBiology2012,7:30http://www.
almob.
org/content/7/1/30RESEARCHOpenAccessAdaptiveecientcompressionofgenomesSebastianWandelt*andUlfLeserAbstractModernhigh-throughputsequencingtechnologiesareabletogenerateDNAsequencesataneverincreasingrate.
InparalleltothedecreasingexperimentaltimeandcostnecessarytoproduceDNAsequences,computationalrequirementsforanalysisandstorageofthesequencesaresteeplyincreasing.
Compressionisakeytechnologytodealwiththischallenge.
Recently,referentialcompressionschemes,storingonlythedierencesbetweenato-be-compressedinputandaknownreferencesequence,gainedalotofinterestinthiseld.
However,memoryrequirementsofthecurrentalgorithmsarehighandruntimesoftenareslow.
Inthispaper,weproposeanadaptive,parallelandhighlyecientreferentialsequencecompressionmethodwhichallowsne-tuningofthetrade-obetweenrequiredmemoryandcompressionspeed.
Whenusing12MBofmemory,ourmethodisforhumangenomeson-parwiththebestpreviousalgorithmsintermsofcompressionratio(400:1)andcompressionspeed.
Incontrast,itcompressesacompletehumangenomeinjust11secondswhenprovidedwith9GBofmainmemory,whichisalmostthreetimesfasterthanthebestcompetitorwhileusinglessmainmemory.
Keywords:Sequencecompression,Referentialcompression,Heuristics,ScalabilityBackgroundThedevelopmentofnovelhigh-throughputDNAse-quencingtechniqueshasledtoaneverincreasingoodofdata.
Whileittookroughly12yearsandanesti-matedamountof3billionUSDtodecipherthersthumangenome[1],currenttechniques,usuallysumma-rizedunderthetermsecondgenerationsequencing(SGS),areabletoproduceroughlythesameamountofdatainaboutaweekatacurrentcostofroughly2000USD.
Ontop,thirdgenerationsequencingpromisestodeliverafur-therspeed-up,reducingthetimeandpriceforsequencingahumangenomefromweekstodaysandfromthousandstounderahundredUSD,respectively[2].
Large-scaleprojectsaregeneratingcomprehensivesur-veysofthegenomiclandscapeofvariousdiseasesbysequencingthousandsofgenomes[3].
Managing,stor-ingandanalyzingthisquicklygrowingamountofdataischallenging[4].
Itrequireslargediskarraysforstorage,andlargecomputeclustersforanalysis.
Arecentsug-gestionistousecloudinfrastructuresforthispurpose[5-7].
However,beforebeinganalyzedinacloud,datarsthastobeshippedtothecloud,makingbandwidthinletransferoneofthemajorbottlenecksincloud-based*Correspondence:wandelt@informatik.
hu-berlin.
deInstituteforComputerScience,Humboldt-Universit¨atzuBerlin,Berlin,GermanyDNAanalysis[8].
Accordingly,sequencecompressionisakeytechnologytocopewiththeincreasingoodofDNAsequences[9-11].
Tostoreacompletegenomeofahumanbeing,oneneedsroughly3GB(uncompressed).
Substitutionalorstatisticcompressionschemescanreducethespacerequirementsbyupto6:1(onebaseisencodedwithupto1.
3Bit)[12,13].
However,inmanyprojectsonlygenomesfromonespeciesareconsidered.
Thismeansthatprojectsoftendealwithhundredsofhighlysimilargenomes;forinstance,tworandomlyselectedhumangenomesareidenticaltoanestimated99.
9%.
Thisobservationisexploitedbyso-calledreferentialcompressionschemes,whichonlyencodethedierencesofaninputsequencewithrespecttoapre-selectedreferencesequence.
Usingspace-ecientencodingofdierencesandcleveralgo-rithmsforndinglongstretchesofDNAwithoutdier-ences,thebestcurrentreferentialcompressionalgorithmweareawareofreportsacompressionratesofupto500:1forhumangenomes[14].
However,allexistingcompressionschemeshaveincommonthattheyhaveveryhighdemandsontheunder-lyinghardware(upto25GB,forinstance[15,16]).
Fur-thermore,thetimeneededforcompressingtheamountofsequencescorrespondingtoahumangenomemaybeup2012WandeltandLeser;licenseeBioMedCentralLtd.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.
org/licenses/by/2.
0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
WandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page2of9http://www.
almob.
org/content/7/1/30toseveralhours,whicheasilyout-weightsthetimesav-ingsduringletransfer.
Furthermore,existingapproachescannoteasilybeparallelizedtoreducecompressiontime,becausesuchaparallelizationrequirestoshareinforma-tionaboutthegeneratedcompressionmodel.
Thiscom-munication/synchronizationoverheadmayeasilymitigatethepositiveeectsgainedbyparallelization.
Inthispaper,wepresentanadaptive,scalableandhighlyecientalgorithmforreferentialcompressionofDNAsequencesgenomes.
Itisabletogracefullytrade-ocompressiontimeandspacerequirementswhilecon-sistentlyachievingverygoodcompressionrates.
Forinstance,ouralgorithmsrequiresonly12MBmemorytocompress/decompressahumangenomeatthesamespeedasthefastestexistingapproach.
Using9GBofmemoryusage,ourmethodperformsuptothreetimesfasterthanthebestcompetitorwhilestillneedinglessmainmemory.
Bothvariantsachievesimilarcompressionratesofapproximately400:1forhumanDNA.
Foryeastsequences,thecompressionratiosarelower,sincetwoyeastgenomescanhavesubstantialdierences.
Theremainingpartofthispaperisstructuredasfol-lows.
Werstexplainthemainideasbehindouralgo-rithminSection"GeneralIdea".
TheconcretedesignchoicesandalgorithmsaredescribedindetailinSection"GenomeCompression".
WediscussrelatedworkinSection"RelatedWork",beforeweprovideanevalua-tionofourmethodinSection"Evaluation".
ThepaperisconcludedwithSection"Conclusions".
GeneralIdeaWedenotestringswiths,t.
Thelengthofastringsisdenotedwith|s|andthesubstringstartingatpositioniwithlengthnisdenoteds(i,n).
s(i)isanabbreviationfors(i,1).
Allpositionsinastringarezero-based,i.
e.
therstcharacterisaccessedbys(0).
Theconcatenationoftwostringssandtisdenotedwithst.
Althoughagenomecanbeencodedwithfourcharacters,i.
e.
A,C,G,andT,weallowarbitrarysymbols.
Forinstance,symbolNisoftenusedtoindicateanunknownbase.
Giventwostringssandt,thelongestprex-suxmatchofsint,isthelongeststringtm,suchthatt=t1tmt2ands(0,|tm|)=tm.
Wewanttocompressagiveninputgenomewithrespecttoareferencegenome,byonlyencodingdierencesbetweentheinputandthereference.
Thisyieldsloss-lesscompression,i.
e.
basedonthereferencegenomeandthedierencedescriptionitispossibletorecovertheinputgenome.
Theideaistosplitupagivenreferencegenomeintoblocks.
Ingeneral,twogenomesofthesamespeciesareverysimilartoeachother,andblocksarechoseninawaythatlongmatchesofto-be-compressedblockscanoftenbefoundinreferenceblocksbylocalsearch.
Pleasenotethatsplittingthereferenceintoblocksofxedlengthisnotanoptimalway,sincedierentsizesoftheblockscanprovidebetterperformanceintermsofcompressionratio.
However,forthesakeofcompressionspeedandeasierdatahandling,weonlyconsideraxedblocklength.
Foreachreferencegenomeblock,asuxtreeiscom-puted.
Thesuxtreesallowtondlongestprex-suxmatchesofpartsoftheinputgenomeandthereferencegenomeeciently.
ThecompressionprocessisinformallysummarizedinAlgorithm1.
Algorithm1:SketchofourCompressionAlgorithm1:whileInputcontainscharacters2:FindmatchingreferenceblockBforcurrentinputposition3:PerformreferentialcompressionwithrespecttoBuntilwecannotnd"long"matchesanymore4:endwhileThecompressionalgorithmgeneratesasetofreferentialmatcheswithrespecttoreferenceblocks.
Theoutputofourcompressionalgorithmisacompressedleofentries,suchthateachentryisoneofthefollowing:Block-changeentryBC(i):nextentriesareencodedwithrespecttoreferenceblocki.
RelativematchentryRM(i,j):Theinputmatchesthereferenceblockatpositioniforjcharacters.
RawentryR(s):Astringsisencodedraw(forinstanceifthereisnogoodmatchingblock).
AnexamplecompressionisgiveninFigure1.
Aninputsequence"TACGTAAT.
.
.
"iscompressedwithrespecttoareferencesequence"ACGACGTA.
.
.
".
Therstcharacteroftheinputisencodedraw,withentryR(T),thentherstreferenceblockischosen,withentryBC(0),inordertoreferentiallyencodethesubsequence"ACGTA",withentryRM(3,5).
Pleasenotethatblockchangeentriescanbeactuallyavoided,sincelocalpositionsinablockcanbeeasilytransformedintoglobalpositionsinthereference.
Intheremainingpart,wewillstillassumetheusageofblockchangeentries.
GenomeCompressionIndexConstructionInordertondmatchesoftheinputinreferenceblocks,weneedanindexstructureoverthereferenceblocks.
Suxtrees[17]aretreedatastructureswhichallowforfastaccesstoallsuxesofagivenstring.
Eachsuxisusuallyrepresentedbyapathfromtherootofthetreetoaleaf.
Pleasenotethatthelongestprex-suxmatchproblemofsintcanbebeeasilysolvedbydepth-rstsearchgivenasuxtreeforthestringt.
Thefocusrecentlyshiftedtowardsso-calledcompressedsuxtrees.
OneexampleofacompressedsuxtreeimplementationisCST++[18],whichusesanewapproachtocompressWandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page3of9http://www.
almob.
org/content/7/1/30Figure1Exampleforrelativecompression.
AnexamplecompressionisgiveninFigure1.
Aninputsequence"TACGTAAT.
.
.
"iscompressedwithrespecttoareferencesequence"ACGACGTA.
.
.
".
Therstcharacteroftheinputisencodedraw,withentryR(T),thentherstreferenceblockischosen,withentryBC(0),inordertoreferentiallyencodethesubsequence"ACGTA",withentryRM(3,5).
thelongestcommonprex-arrayandachieves,depend-ingonthesamplingratesofthe(inverse)suxarray,anusualspaceoverheadof(46)nbits.
Inthefollowing,thecompressedsuxtreeofastringsisdenotedwithCST(s).
Inaddition,weneedareferencegenomeforourcom-pressionalgorithm.
ThereferencesequenceisgivenasasetofcompressedFASTA-les,oneforeachchromosome.
Theindexgenerationalgorithmiteratesoverallchro-mosomesofthereferencegenome.
EachchromosomeissplitupintoblocksofamaximumsizeBS.
Forinstance,giventhetextualrepresentations1ofChromosome1,wecomputemsubstringsb1,1,b1,m,suchthats1=b1,1b1,m,i≤(m1):|b1,i|=BS,and[b1,m]≤BS.
Wedothesameforallotherchromosomesoftheref-erencegenome.
Foreachreferenceblockacompressedsuxtreeiscomputedandstoredonharddisktogetherwiththerawreferenceblock.
Inaddition,wekeeptrackoftheorderoftheblocks,whichisinducedbytheirpositiononthechromosomes.
Thismetainfor-mationisusedbelowforoptimizingourcompressionalgorithm.
Thememoryconsumptionduringindexcreationislim-itedasfollows:ateachstepoftheindexgenerationwehaveonerawreferenceblockofsizeatmostBSbytesinmainmemoryplus(roughly)4BSbytesforitscom-pressedsuxtree.
Forexample,givenaBSof4MB,themainmemoryusagecanberestrictedtoapproxi-mately20MB.
ThemaximumvalueforBSis300MB,sincethelargesthumanchromosome(Chromosome1)hasabout247millionnucleotidebasepairs,andeachbaseisencodedasonebyte.
Pleasenotethatwecouldhaveencodedbasesofthereferencegenomewith3bits(veentries:A,C,G,N,T),inordertofurtherreducememoryusage.
However,ourtestsindicatethatthespacesavingyieldsasignicanttimeoverheadforaccessingbases.
Furthermore,the3-Bit-encodingwouldnotallowustocompresssequencesagainstreferenceswithothersymbolsthantheseve.
CompressionandDecompressionInthefollowing,wepresentourcompressionalgorithmforgenomesequencesindetail.
Algorithmsdonotshowrangechecksforthesakeofreadability.
Algorithm2assumestheinputgenomeinInput(asastringofbases).
Theinputstringistraversedfromlefttoright,anddependingonthecurrentcharactersintheinputandinthereferenceblock,dierentsubroutinesareexecuted.
Algorithm2:CompressionAlgorithm1:Pin←02:Praw←03:FIND-MATCH4:whilePin=|Input|do5:ifInput(Pin)=B(Praw)then6:ENCODE-REF7:elseifInput(Pin)/∈{A,C,G,T}then8:ENCODE-RAW9:else10:FIND-MATCH11:endif12:endwhileInthebeginningofthecompressionalgorithm,amatchforthecurrentinputpositioninthereferenceissearched,usingfunctionFIND-MATCH,Algorithm3,explainedindetailbelow.
IfthecurrentreferenceblockBmatchestheinputatthecurrentposition,thenwetrytogen-eratea(aslongaspossible)referencematchinfunc-tionENCODE-REF,Algorithm4.
Ifthecurrentinputbasedoesnotequalthecurrentreferenceblockbase,andtheinputisnotanormalbase,i.
e.
neitherA,C,G,orT,thenthebaseandallthefollowingnon-normalbasesareaddedasrawentriestothecompressedgenomeleinfunctionENCODE-RAW,Algorithm5.
Finally,ifneitheroftheconditionsissatised,thenthealgorithmtriestondanewmatch(eitherinthecur-rentblockorinanotherreferenceblock)usingfunctionFIND-MATCH.
Algorithm3:FIND-MATCHFunction1:Max←25WandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page4of9http://www.
almob.
org/content/7/1/302:ifnotHAS-LOCAL-ALTERNATIVE()then3:forallBj∈Blocksdo4:ifm(Input,Pin,Bj)≥Maxthen5:Max←m(Input,Pin,Bj)6:B←Bj7:endif8:endfor9:ifMax≤25then10:Raws←Input(Pin,Max)11:AddR(Raws)tooutput12:Pin←Pin+Max13:else14:Praw←beginningofthelongestmatchinB15:AddBC(B)tooutput16:endif17:endifAlgorithm4:ENCODE-REFFunction1:M←02:whileInput(Pin)=B(Praw)do3:M←M+14:Pin←Pin+15:Praw←Praw+16:endwhile7:AddRM(Praw-M,M)tooutputAlgorithm5:ENCODE-RAWFunction1:Raws←""2:whileInput(Pin)/∈{A,C,G,T}do3:Raws←RawsInput[Pin];4:Pin←Pin+1;5:endwhile6:AddR(Raws)tooutputInAlgorithm4thenumberofmatchingcharactersbetweencurrentinputpositionandcurrentreferencepositionaredeterminedandstoredinvariableM.
Aref-erenceentryRM(Praw-M,M)isaddedtothecompressedoutput.
TheencodingofarawsequenceinAlgorithm5isstraight-forward:ThestringRawsislledwithbasesfromtheinputuntilanormalbaseisfoundandR(Raws)isaddedtotheoutput.
Algorithm3requiresmorethoughts.
ThefunctionFIND-MATCHiscalled,wheneverthereisamismatchbetweentheinputandthereference.
Inthiscase,weneedtondanalternativematchforthecurrentinputposition.
First,wecheckwhetherthereexistsamatchintheneighbourhood(mutationscausedbyfewsinglenucleotidepolymorphisms)ofthecurrentpositionofthereferenceblock.
Theprocessisexplainedindetaillater.
Ifwecannotndalocalmatch,thenallthereferenceblocksarecheckedforabettermatch.
Theexpressionm(Input,Pin,Bj)returnsthepositionofthelongestprex-suxmatchofthecurrentinputinBj.
InourimplementationofAlgorithm3,wecareabouttheorderinwhichallreferenceblocksaretra-versed.
Thisisnecessary,sinceloadingblocksfromthediskisatimeconsumingeortandshouldbeavoided.
Referenceblocksaretraversedwiththefollowingheuristic:1.
Theblockleftandrightofthecurrentreferenceblock2.
Allotherblocksonthesamechromosomeasthecurrentreferenceblock3.
AllotherblocksofthereferencegenomeIfthereisnolongenoughmatchinanyreferenceblock,thenwejustencodefewrawbases.
Matchesinotherblocksarerequiredtobelongerthan25characters,inordertoavoidrandommatches.
OurexperimentshaveshownthatthemerelengthofthehumanDNAcausesalotofunrelatedmatcheswithlessthan20-25characters.
Algorithm3usesthefunctionHasLocalAlternative.
Theintuitionisthatwewanttoavoidsearchingallpossiblereferenceblocksallthetime.
Ourexperimentsshowedthatonelongestprex-sux-matchlookup-operationinasinglecompressedsuxtreecantakeuptofewmil-lisecondsdependingonthesize.
Thisisbasicallycausedbythehighbit-wisecompressionofthereferencegenome.
Furthermore,potentiallywehavetoloadalltheotherref-erenceblock'sCTSsfromtheharddisk.
Therefore,thenaivestrategytosearchforanewreferenceblockoneachmismatchdoesnotscale.
Insteadweusealocalsearchintheneighbourhoodofthecurrentinputpositionandofthecurrentreferenceposition.
Thisstrategyhasabiologicalfoundation:Oftentwopartsofagenomemightonlybedierentbyfewbases(baseinsertion,baseremoval,ofbasemutation).
When-everwendanappropriatematchintheneighbourhood,weavoidcheckingotherreferenceblocks,althoughtheymightcontainbettermatches.
Infact,ifwesearchedforthebestreferenceblockseachtime,wecouldincreasethecompressionrateslightlyforthepriceofbeingordersofmagnitudeslower.
ThealgorithmforlocalneighbourhoodsearchisshowninAlgorithm6.
Ifamatchofatleast20characterscanbefoundnearthecurrentinputandreferencepositions,thenthedierenceuntilthematchisencodedraw,andthematchisencodedreferentially.
Decompressionoftherelativegenomedataisstraight-forward.
Basically,allentriesareprocessedaccordingtotheirtype.
Fordecompressionofgenomedatawedonotneedthecompressedsuxtreesanymore,butonlytherawblocksofthereferencegenome.
ThedecompressionofgenomedataistheleastCPU-intensivetask.
Oureval-uationinthenextsectionshowsthatindexgenerationandcompressionareCPU-bound,whilethedecompressionphaseisI/O-bound.
WandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page5of9http://www.
almob.
org/content/7/1/30Algorithm6:HAS-LOCAL-ALTERNATIVEFunction1:fori∈{0,1,6,7}do2:forj∈{7,6,6,7}do3:ifInput(Pin+i,20)=B(Praw+j,20)then4:ifi≥1then5:Raws←Input(Pin,i)6:AddR(Raws)tooutput7:Pin←Pin+i8:endif9:Praw←Praw+j10:ENCODE-REF11:RETURN12:endif13:endfor14:endforWeemphasizethatourcompressionschemeisaheuris-tic,whichmainlyworkswhencompressingasequencewithrespecttoareferencefromthesamespecies.
Thee-ciencyofthisheuristicisevaluatedinSection"Evaluation".
Theworst-casetimecomplexityofourcompressionalgo-rithmisO(|Input|),sinceweneedtotraversethewholeinputsequenceonetimefromlefttorightandforeachcharacterweperform(intheworstcase)onelookupinthereference.
Thecomplexityofthelookupdependsonthelengthofthesubstringbeinglookedup(weassumeaxedmaximalmatchlength)andisthereforeconstant.
Theworst-casespacecomplexityisO(|Input|).
Eachchar-acteroftheinputisstoredatmostonetime:eitherinsidearawblockoraspartofareferentialblock.
RelatedWorkInthefollowing,wereviewexistingworkonbiologi-caldatacompression.
Ingeneral,compressionalgorithmsareeithersubstitutional,statistical,orreferential.
Sub-stitutionalalgorithms[19]replacelongrepeatedsub-stringsbyreferences,e.
g.
Lempel-Ziv-basedcompression.
Statisticalalgorithms[13,20]deriveapredictivemodelfrom(asubsetof)theinput,basedonpartialmatches.
Ifthemodelalwaysindicateshighprobabilitiesforthenextsymbol,thenhighcompressionratesarepossible.
Whilereferentialalgorithmsreplacelongsubstringsaswell,thesourceforthesesubsetsisusuallynotpartoftheinput(thesequencetobecompressed).
Referen-tialcompressionalgorithmshavedrawnalotofatten-tionrecently,sincetheyallowforveryhighcompressionrates.
In[16],RLZ,aself-indexingbasedapproachisproposedasfollows:Givenaself-indexforabasesequence,com-pressothersequenceswithLZ77encodingrelativetothebasesequence.
Infact,asuxarrayforthereferencesequenceisbuiltandthereferenceentriesarepositionref-erenceswithlengthintothebasesequence.
Pleasenotethattheauthorsdonotstorerawsequencesatanytime,butonlyencodebasedonalistofreferences,nomatterhowlong.
Theauthorsof[21]proposedtostorehumangenomesrelativelytoareferencechromosome.
Theyusedvari-ableintegersforstoringabsoluteandrelativepositionsofmatchesandrepresentoftenusedk-mers(sequencesoflengthk)withHumanencoding.
In[22],aLZ77-stylecompressionschemeisproposed.
Themaindierenceisthatseveralreferencesequencesaretakenintoaccount.
Thematch-ndingprocessisbasedonhashing.
CompressionisperformedoninputblockswithsharedHumanmodels,inordertosupportrandomaccess.
In[23],anotherLZ77-stylecompressionschemeisproposed.
Thereexistsfurtherearlyworkonnon-referentialcompressionalgorithms[24-27].
Furthermore,thereiscompressionbasedoncontext-freegrammars[28],determinationofMarkov-modelbasedprobabilitiesofsequencepositions[29],andhybridmethods[30].
In[12],asplaytree-basedcompressionalgorithmisproposed,whichfavorsencodingofrecentlyseensymbols.
Finally,thereexistspreviousworkonrelativecompressionofreadsandalignmentdata,e.
g.
[31,32].
EvaluationInthefollowingsection,weevaluateourproposedcom-pressionscheme.
AllexperimentshavebeenrunonaAcerAspire5950Gwith16GBRAMandIntelCorei7-2670QM,onFedora16(64-Bit,Linuxkernel3.
1).
ThecodewasimplementedinC++,usingtheBOOSTlibrary[33],CST[34],andlibz.
Allsizemeasuresareinbyte,e.
g.
1MBmeans1,000,000bytes.
Thesourcecodeisavailablefordownload1.
First,weperformedcompressiontestsonhumangenomes.
Oursetofdatagenomesconsistsof1000genomesfromthe1000Genomeproject[35].
The1000GenomeprojectgroupprovidesallsequencedgenomesinVariantCallFormat(VCF)[36]fordownload2.
TheVariantCallFormatdescribesdierencesofsetsofgenomeswithrespecttoareferencesequence,basedonSNPsandindels.
Wehaveextractedoneconsen-sussequenceeachforintotal1000genomes.
Sincetheprojectcontainsslightlymorethan1000genomes,wehaveonlyextractedtherst1000genomes(columnsfromlefttoright)namedintheseVCFles.
Thecon-sensussequencesforeachchromosomeofeachgenomewerestoredGZip-compressedonaharddisk.
These1000GZip-compressedgenomesneedintotal700GBofstorage.
CompressionRatioFirst,wehavecompressedeachchromosomereferentiallyagainstthereferencechromosometakenfromHG19[37].
TheresultsareshowninFigure2.
TherstchromosomeWandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page6of9http://www.
almob.
org/content/7/1/30Figure2Compressedlesizefordierentinputchromosomes(inMB).
WehavecompressedeachchromosomereferentiallyagainstthereferencechromosometakenfromHG19[37].
TheresultsareshowninFigure2.
Theoverallcompressionratioobtainedforthe1000humangenomesis397:1.
of1000humansneeds236.
6GBofuncompressedstor-age,whilethereferentiallycompressedleneedsonly0.
57GB,yieldingacompressionratioof415:1.
Thesmallesthumanchromosomeofour1000genomes,Chromosome22,iscompressedfrom36.
4GBdownto0.
1GB,yieldingacompressionratioof364:1.
Theoverallcompressionratioobtainedforthe1000humangenomesis397:1.
Forverysimilarsequencesweachievehighercompres-sionrates,thanforlessrelatedsequences.
Thisisinherenttoallreferentialcompressionschemes:themoresimi-larinputandreferencesequenceare,longerreferentialmatchescanbefound.
Wehaveevaluatedtheimpactoftheblocksizeonthecompressionratio.
Theresults(averageoverallhumangenomesandchromosomes)areshowninFigure3.
Ingeneral,alargerblocksizewillyieldbettercompressionratios.
Theintuitionisthatalargerblockinmainmem-oryallowsforndinglongermatches.
However,forablocksizeof1MB,ourcompressionschemecanstillachieveacompressionratioof361:1,whichisroughlycompetitivetoexistingrelativecompressionschemesforhumangenomesequences.
RLZobtainsacompressionrationof80:1andRLZoptacompressionratioof133:1forhumangenomes.
GDCachievescompressionratiosof200:1-500:1,dependingonspeed-tradeos.
ThereexistsonevariantGDC-ultra,whichachievesacom-pressionratioof1000:1,whichswitchesthereferencesequenceduringcompression.
Switchingthereferencenaturallyallowsforhighercompressionratios.
How-ever,sinceweonlyuseonereference,itseemstobefairtoonlycompareourresultstothenon-ultravari-antofGDC,obtainingacompressionratioofatmost500:1.
Pleasenotethatthecompressionratiofor300MBisactuallysmallerthanfor50MB.
Oneexplanationcouldbethatallthesematchesfoundwithsmallerblocksallowforashorterencodingthanthematchesfoundinalongerblock.
Atrstsightthismightsoundcounterintuitive.
Tothebestofourknowledge,noresearchhasbeencon-ductedinthisarea,sincethecompressiongainmainlydependsonthechoiceofreferentialentries,i.
e.
howtoencodepositions,lengthentriesandrawbaseentries.
Ref-erentialcompressionisanoptimizationproblem,wherethelongestmatchesoften,butnotnecessarily,yieldtheFigure3Compressionratiobyblocksize.
Wehaveevaluatedtheimpactoftheblocksizeonthecompressionratio.
Theresults(averageoverallhumangenomesandchromosomes)areshowninFigure3.
Ingeneral,alargerblocksizewillyieldbettercompressionratios.
WandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page7of9http://www.
almob.
org/content/7/1/30shortestcompressedrepresentation.
Forinstance,some-timesmore(shorter)matchescanbeencodedmoree-cientlythanless(longer)matches.
WethinkthattheseeectsareimportanttobestudiedinFutureWork.
Theoverallindexsizeforthereferencesequence(perchromosome)isinaverage202MB,whileanaverageuncompressedinputsequenceisroughly130MBlong.
Thesizeoftheindexstructureisdecreasingwithdecreas-ingblocksize,sincethemaximumlengthofpathsinthesuxtreeislimited.
Additionalexperimentshavebeenconductedonyeastgenomes.
Wehavedownloaded339yeastgenomes,havechosenanarbitraryreferencesequence(273614N)andreferentiallycompressedtheother38genomes(allchro-mosomesconcatenatedtoeachother)withrespecttothereference.
Theaveragecompressionratiois61:1.
Thelowercompressionratiocomparedtohumansequencesisnotsurprising,sinceitiswellknownthattwoyeastgenomescanbelesssimilarthanevenahumangenomeandachimpanzeegenome.
Inthesecases(unoptimized)referentialcompressionschemesdonotobtainsuchnicecompressionratiosaswithhumangenomes.
RLZoptobtainscompressionratiosof50:1,andGDCobtainscompressionratiosof70:1-100:1,dependingonspeed-tradeos.
Inaverage,theindexsizeofanaverageyeastgnome(around12MB)wasfoundtobearound17.
5MB.
CompressionTimesForserialcompression,thecompressionalgorithmper-formscompressionononeblockatatime(inasinglethread).
Thememoryrequirementsduringcompressionareroughly6*BS(onereferencerawblock,oneinputgenomeblock,andthecompressedsuxtreeoftheinputblock).
Inadditionweuseawritebuerforoutputoper-ations(size:1MB),whichcanbeneglected.
Theresultsforserialcompressionofthe1000humangenomesareshowninFigure4.
Alltheresultswereaveragedoverallchromosomesasinput.
Clearlycompressiontimedecreaseswithincreasingblockindexsize,sincelesstimeisspentontraversingotherblockstondbestmatches.
Iftheblockindexsizeis1MB(equalstoroughly6MBmainmemoryusage),thecompressiontimesareroughly9secondsperchromo-some,i.
e.
around200secondstocompressawholehumangenome.
Thisyieldsacompressionspeedofaround15MB/s.
Withablockindexsizeof300MB(roughly1.
8GBmainmemoryusage),thecompressiontimeisdownto1.
5secondsperchromosome,i.
e.
around35secondsforcom-pressionawholehumangenome.
Thisyieldsacompres-sionspeedof85MB/s.
Thetimenecessarytocreatethecompressedsuxtreeforthereferenceisroughly20min-utes.
Ifonetakesthepositionthatindexingtimeshouldbelongtotheonlinesearchtime,thenthecompressionspeedisreducedto2.
42MB/sforasinglegenomesandto82.
87MB/sfor1000genomes.
However,weareconvincedthatindexconstructionshouldbetakenasanoinetask.
RLZoptobtainsacompressionspeedof1.
34MB/s,whileGDCachievescompressionspeedsof4-35MB/s,dependingonspeed-tradeos.
Therefore,ouradaptivealgorithmseemstobecompetitivewithexistingrela-tivecompressionschemes,whileusinglessmemoryinacontrollable(bychangingtheblocksize)way.
Inaddition,wehaveperformedexperimentswith39yeastgenomesasintheprevioussubsection.
Thecom-pressiontimeisaround70secondsforoneyeastgenome,yieldingacompressionspeedof0.
17MB/s.
Asexpected,ourlocalsearchheuristicdoesnotworkasgoodaswithhumangenomes.
Mostofthetimeisspentonconsultingthecompressedsuxtreewhichoftenonlyyieldsveryshortmatchesoflength10-20.
RLZopthasacompressionspeedof1.
6MB/sandGDCachievescompressionspeedsof2-33MB/s,dependingonspeed-tradeos.
Inthiscase,itseemstobeadvantageoustouseahashtableasanindexstructureinsteadofasuxtree,asproposedbytheauthorsofGDC.
Theparallelizationonmulticoresofourapproachisstraightforward.
Block-processingcanbeeasilydis-tributedonseveralCPUs(orevenmachines).
However,ithastobekeptinmindthateachcompressormightworkondierentpartsofthereferencegenome,whichmeansthatforncompressorthreadsthememoryusageisincreasedbythefactorn.
Theupperlimitmainmem-oryusageforseventhreadsis7GB(maximumsizeofthecompleteindex)+7BS,ifallindexstructuresareloadedintomainmemory.
Furtherinvestigationshowedthatthecompressiontimeinthelattercaseisdominatedbyloadingtheindexlesintomainmemory.
Basically,inthebeginningeachthreadrequestsasetofindexblocks,which(duetopar-allelaccess)mightbeloadedinapartialrandomaccessmanner.
Tocounteractthiseect,wehavepre-bulkloadedallindexstructuresatonceafterareboot(takingroughly80seconds)andthenperformedcompression.
Inthisset-ting,awholehumangenomecanbecompressedwithin11secondswith7threads.
Wethinkthatthisbulk-loadscenarioisactuallyquiterealistic,asscientistsoftenwanttocompress/decompresssetsofgenomesprior/afteranalysis.
DecompressionStatisticsForalltests-independentoftheindexblocksize-weobtainedanaveragedecompressiontimeof22seconds.
Furtherparallelizationdidnotimprovethesevalues,sincethedecompressionalgorithmisI/O-bound.
Duringthecompressionaround3.
1GBforthedecompressedgenomearewrittenontheharddisk.
GiventhatwemeasuredtheWandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page8of9http://www.
almob.
org/content/7/1/30Figure4Compressionratiobyblocksize.
Theresultsforserialcompressionofthe1000humangenomesareshowninFigure4.
Alltheresultswereaveragedoverallchromosomesasinput.
Clearlycompressiontimedecreaseswithincreasingblockindexsize,sincelesstimeisspentontraversingotherblockstondbestmatches.
writespeedoftheharddiskwith150MB/s,itseemshardtofurtheroptimizethedecompressiontime.
WehavealsotriedtowriteonlycompressedFASTAlesonthedisk.
Then,however,thedecompressiontakesseveralminutes,becausethe(self-referential)compressionisCPU-bound.
Thedecompressiontimesforcompetitorsaresimilar,i.
e.
RLZoptachieves130MB/sandGDCupto150MB/s.
Weareconvincedthatformostdecompressorimple-mentationstheactuallimitationistheharddiskwritespeed.
ConclusionsWehaveproposedanadaptivereferentialcompressionschemesforgenomes.
Thecompressionspeedcanbecontrolledbyvaryingtheamountofmainmemoryforthecompressor.
Ourvariantwithlowestmainmem-oryfootprint(12MB)achievessimilarcompressionratesandcompressionspeeds,whileusingalmosttwoordersofmagnitudelessmainmemory.
Ourgreedyvariantusing9GBofmainmemoryis2-3timesasfastasthebestknownvariant.
Compressionspeedcanbefur-therimprovedbymassiveparallelizationondierentmachines.
Furtherworkshouldbedoneonimprovingthecom-pressionratio.
Forinstance,itwouldbepossibletondanecientencodingforinversecomplementmatchesorapproximatematches.
Investigationsoninter-speciesreferentialcompressionischallenging.
Sofar,referentialcompressiononlyworkswell,ifinputandreferencebelongtothesamespecies.
Thedevelopmentofamulti-speciesreferencesequencewouldallowformultipurposegenomecompressionalgorithms.
Ourinitialexperimentswithhumangenomecompressionwithrespecttoamousegenomeindicatethatthematchesareusuallyveryshortandtheadvantagesofreferentialcompressionaremitigated.
Endnotes1http://www2.
informatik.
hu-berlin.
de/wandelt/blockcompression2ftp://ftp.
1000genomes.
ebi.
ac.
uk/vol1/ftp/release/20110521/3ftp://ftp.
sanger.
ac.
uk/pub/dmc/yeast/latestCompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Authors'contributionsAllauthorshavecontributedequallytothemaintext.
SebastianWandelthasimplementedthepresentedalgorithmsonC++.
Allauthorsreadandapprovedthenalmanuscript.
Received:25July2012Accepted:26October2012Published:12November2012References1.
ConsortiumIHGS:Initialsequencingandanalysisofthehumangenome.
Nature2001,409(6822):860–921.
2.
SchadtEE,TurnerS,KasarskisA:Awindowintothird-generationsequencing.
HumanMolGenet2010,19(R2):R227–R240.
[http://dx.
doi.
org/10.
1093/hmg/ddq416].
3.
InternationalCancerGenomeConsortiumDataPortal–aone-stopshopforcancergenomicsdata.
Database:thejournalofbiologicaldatabasesandcuration2011,2011(0):bar026.
[http://dx.
doi.
org/10.
1093/database/bar026].
4.
KahnSD:Onthefutureofgenomicdata.
Science2011,331(6018):728–729.
[http://www.
sciencemag.
org/content/331/6018/728.
abstract].
5.
FusaroVA,PatilP,GafniE,WallDP,TonellatoPJ:Biomedicalcloudcomputingwithamazonwebservices.
PLoSComputBiol2011,7(8):1–6.
[https://sremote.
pitt.
edu:11018/login.
aspxdirect=true&db=aph&AN=67016557&site=ehost-live].
6.
CloudcomputingandtheDNAdatarace.
NatBiotechnol2010,28(7):691–693.
[http://dx.
doi.
org/10.
1038/nbt0710-691].
7.
Thecaseforcloudcomputingingenomeinformatics.
GenomeBiol2010,11(5):207+.
[http://dx.
doi.
org/10.
1186/gb-2010-11-5-207].
8.
TrellesO,PrinsP,SnirM,JansenRC:Bigdata,butarewereadyNatRevGenet2011,12(3):224.
[http://dx.
doi.
org/10.
1038/nrg2857-c1].
9.
PennisimE:WillcomputerscrashgenomicsScience2011,331(6018):666–668.
[http://dx.
doi.
org/10.
1126/science.
331.
6018.
666].
10.
GiancarloR,ScaturroD,UtroF:Textualdatacompressionincomputationalbiology:algorithmictechniques.
ComputSciRevJanuary2012,6(1):1–25.
WandeltandLeserAlgorithmsforMolecularBiology2012,7:30Page9of9http://www.
almob.
org/content/7/1/3011.
Nalbantoglu¨OU,RussellDJ,SayoodK:Datacompressionconceptsandalgorithmsandtheirapplicationstobioinformatics.
Entropy2010,12:34–52.
[http://www.
mdpi.
com/1099-4300/12/1/34/].
12.
AntoniouD,TheodoridisE,TsakalidisA:Compressingbiologicalsequencesusingselfadjustingdatastructures.
In10thIEEEInternationalConferenceonInformationTechnologyandApplicationsinBiomedicine;2010:1–5.
13.
PratasD,PinhoAJ:CompressingtheHumanGenomeUsingExclusivelyMarkovModels.
InPACBB,Volume93ofAdvancesinIntelligentandSoftComputing.
EditedbyRochaMP,JMCRodrguez,Fdez-RiverolaF,ValenciaA:Springer;2011:213–220.
[http://dblp.
uni-trier.
de/db/conf/pacbb/pacbb2011.
html#PratasP11].
14.
DeorowiczS,GrabowskiS:Robustrelativecompressionofgenomeswithrandomaccess.
Bioinformatics2011,27(21):2979–2986.
[http://dx.
doi.
org/10.
1093/bioinformatics/btr505].
15.
KuruppuS,PuglisiS,ZobelJ:RelativeLempel-Zivcompressionofgenomesforlarge-scalestorageandretrieval.
InStringProcessingandInformationRetrieval,Volume6393ofLectureNotesinComputerScience.
EditedbyChavezE,LonardiS.
Berlin/Heidelberg:Springer;2010:201–206.
16.
KuruppuS,PuglisiSJ,ZobelJ:RelativeLempel-Zivcompressionofgenomesforlarge-scalestorageandretrieval.
InProceedingsofthe17thinternationalconferenceonStringprocessingandinformationretrieval,SPIRE'10.
Berlin,Heidelberg:Springer-Verlag;2010:201–206.
[http://dl.
acm.
org/citation.
cfmid=1928328.
1928353].
17.
UkkonenE:On-lineconstructionofsuxtrees.
Algorithmica1995,14:249–260doi:10.
1007/BF01206331.
[http://dx.
doi.
org/10.
1007/BF01206331].
18.
OhlebuschE,FischerJ,GogS:CST++.
InStringProcessingandInformationRetrieval-17thInternationalSymposium,SPIRE2010;2010:322–333.
19.
KuruppuS,Beresford-SmithB,ConwayT,ZobelJ:IterativedictionaryconstructionforcompressionoflargeDNAdatasets.
IEEE/ACMTransComputBiolBioinformatics2012,9:137–149.
[http://dx.
doi.
org/10.
1109/TCBB.
2011.
82].
20.
DucCaoM,DixTI,AllisonL,MearsC:Asimplestatisticalalgorithmforbiologicalsequencecompression.
InProceedingsofthe2007DataCompressionConference.
Washington,DC,USA:IEEEComputerSociety;2007:43–52.
[http://dl.
acm.
org/citation.
cfmid=1251981.
1252877].
21.
ChristleyS,LuY,LiC,XieX:Humangenomesasemailattachments.
Bioinformatics2009,25(2):274–275.
[http://dx.
doi.
org/10.
1093/bioinformatics/btn582].
22.
GrabowskiS,DeorowiczS:Engineeringrelativecompressionofgenomes.
ArXiv2011,[http://arxiv.
org/abs/1103.
2351].
23.
KreftS,NavarroG:LZ77-likecompressionwithfastrandomaccess.
InProceedingsofthe2010DataCompressionConference,DCC'10.
Washington,DC,USA:IEEEComputerSociety;2010:239–248.
[http://dx.
doi.
org/10.
1109/DCC.
2010.
29].
24.
GrumbachS,TahiF:CompressionofDNAsequences.
InDataCompressionConference;1993:340–350.
25.
ChenX,KwongS,LiM:AcompressionalgorithmforDNAsequencesanditsapplicationsingenomecomparison.
InProceedingsofthefourthannualinternationalconferenceonComputationalmolecularbiology,RECOMB'00.
NewYork,NY,USA:ACM;2000:107.
[http://doi.
acm.
org/10.
1145/332306.
332352].
26.
ManziniG,RasteroM:AsimpleandfastDNAcompressor.
Software-PracticeandExperience2004,34:1397–1411.
27.
BehzadiB,LeFessantF:DNAcompressionchallengerevisited:adynamicprogrammingapproach.
InCombinatorialPatternMatching,Volume3537ofLectureNotesinComputerScience.
EditedbyApostolicoA,CrochemoreM,ParkK.
Berlin/Heidelberg:Springer;2005:85–96.
28.
CherniavskyN,LadnerR:Grammar-basedcompressionofDNAsequences.
2004.
[Unpublishedwork].
29.
CaoMD,DixTI,AllisonL,MearsC:Asimplestatisticalalgorithmforbiologicalsequencecompression.
DataCompressionConference2007,0:43–52.
30.
MatsumotoT,SadakaneK,ImaiH:Biologicalsequencecompressionalgorithms.
GenomeInformatics2000,11:43–52.
31.
SakibMN,TangJ,ZhengWJ,HuangCT:Improvingtransmissioneciencyoflargesequencealignment/map(SAM)les.
PLoSONE2011,6(12):e28251.
[http://dx.
doi.
org/10.
1371].
32.
Hsi-YangFritzM,LeinonenR,CochraneG,BirneyE:Ecientstorageofhighthroughputsequencingdatausingreference-basedcompression.
GenomeRes2011,21(5):734–740.
[http://genome.
cshlp.
org/content/early/2011/01/18/gr.
114819.
110.
abstract].
33.
BOOSTC++Libraries.
[http://www.
boost.
org].
34.
OhlebuschE,FischerJ,GogS:CST++.
InSPIRE'10;2010:322–333.
35.
ConsortiumGP:Amapofhumangenomevariationfrompopulation-scalesequencing.
Nature2010,467(7319):1061–1073.
[http://dx.
doi.
org/10.
1038/nature09534].
36.
DanecekP,AutonA,AbecasisG,AlbersCA,BanksE,DePristoMA,HandsakerRE,LunterG,MarthGT,SherryST,McVeanG,DurbinR,1000GenomesProjectAnalysisGroup:ThevariantcallformatandVCFtools.
Bioinformatics(Oxford,England)2011,27(15):2156–2158.
[http://dx.
doi.
org/10.
1093/bioinformatics/btr330].
37.
KentWJ,SugnetCW,FureyTS,RoskinKM,PringleTH,ZahlerAM,HausslerD:ThehumangenomebrowseratUCSC.
GenomeRes2002,12(6):996–1006.
[http://www.
ncbi.
nlm.
nih.
gov/entrez/query.
fcgicmd=Retrieve&db=PubMed&dopt=Citation&listuids=12045153].
doi:10.
1186/1748-7188-7-30Citethisarticleas:WandeltandLeser:Adaptiveecientcompressionofgenomes.
AlgorithmsforMolecularBiology20127:30.
SubmityournextmanuscripttoBioMedCentralandtakefulladvantageof:ConvenientonlinesubmissionThoroughpeerreviewNospaceconstraintsorcolorgurechargesImmediatepublicationonacceptanceInclusioninPubMed,CAS,ScopusandGoogleScholarResearchwhichisfreelyavailableforredistributionSubmityourmanuscriptatwww.
biomedcentral.
com/submit

展开全文

hardfedora16相关文档

支持ipad 支持ipad apple.com.cn苹果官方网址到底是http://store.apple.com/cn/？还是 http://www.apple.com.cn？？？？magentomagento是免费的吗?新iphone也将禁售苹果ID换了个新的怎么还是停用 360邮箱请问360邮箱怎么申请申请支付宝账户怎样申请支付宝账户？要填写什么信息？客服电话各银行的客服电话是多少?刚刚网刚刚在网上认识了一个女孩子，不是很了解她，就跟她表白了。瑞东集团福能集团是一个什么企业? 电信测速器创宇云云图标地址大全最好看的qq空间警告本网站美国保护七夕快乐英文免费个人空间双线主机 linux服务器维护美国网站服务器优酷黄金会员账号共享云营销系统秒杀品谷歌台湾 netvigator 网站防护谷歌搜索打不开 register.com 网络安装更多

hardfedora16

湖北22元/月（昔日数据）云服务器，国内湖北十堰云服务器,首月6折

wordpress公司网站模板 wordpress简洁高级通用公司主题

pacificrack：2021年七夕VPS特别促销，$13.14/年，2G内存/2核/60gSSD/1T流量，支持Windows