benchmarkedfedora17
fedora17 时间:2021-03-26 阅读:(
)
Putnametal.
BMCBioinformatics2013,14:369http://www.
biomedcentral.
com/1471-2105/14/369SOFTWAREOpenAccessAcomparisonstudyofsuccinctdatastructuresforuseinGWASPatrickPPutnam1,2*,GeZhang2*andPhilipAWilsey1AbstractBackground:Inrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal(NRG11:647–657,2010)offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
Theuseofsuccinctdatastructuresisonemethodofreducingphysicalsizeofadatasetwithouttheuseofexpensivecompressiontechniques.
Inthiswork,weconsidertheuseof2-and3-bitencodingschemesforgenotypedata.
Wecomparethecomputationalperformanceofalleleorgenotypecountingalgorithmsutilizinggenotypedataencodedinbothschemes.
Results:Weperformacomparisonof2-and3-bitgenotypeencodingschemesforuseingenotypecountingalgorithms.
Wefindthatthereisa20%overheadwhenbuildingsimplefrequencytablesfrom2-bitencodedgenotypes.
However,buildingpairwisecounttablesforgenome-wideepistasisis1.
0%moreefficient.
Conclusions:Inthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackedgenotypedatarepresentationsinGenomeWideAssociationsStudies(GWAS).
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecommonlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastructures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
BackgroundInrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal[1]offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
ThemajorityoftoolsusedinGWAdataanalysistyp-icallyassumethatadatasetwilleasilyfitintothemainmemoryofadesktopcomputer.
Mostdesktopcomput-ershavearound4–16GBofmainmemory,whichismorethanenoughtofitadatasetof1millionvari-antsbytensofthousandsofindividuals.
However,data*Correspondence:putnampp@gmail.
com;zhangge.
uc@gmail.
com1ExperimentalComputingLab,SchoolofElectronicandComputingSystems,POBox210030,Cincinnati,OH45221–0030,USA2HumanGenetics,CincinnatiChildren'sHospitalMedicalCenter,Cincinnati,OH,USAsetsizescontinuetogrowwithadvancementsinanal-ysistechniquesandtechnologies.
Forexample,tech-niqueslikegenotypeimputation[2]attemptexpanddatasetsbyderivingmissinggenotypefromreferencepan-els.
GenotypingtechnologiessuchasIllumina'sOmniSNPHumanOmni5-Quadchipsallowforgenotypingofupwardsof5millionmarkers[3].
Furthermore,genomesequencingtechnologiesareadvancingtothepointwheredetermininggenotypesviawholegenomesequencingmaybeaviableoption.
Havinganindividual'sentireDNAsequenceopensthedoorforevenmoregeneticmark-erstobeanalyzed.
The1000Genomesproject[4]nowincludesroughly36.
7millionvariantsinthehumangenome.
Thesizeofadatafileusedtorepresentthegenotypesof1000individualswouldberoughly37GB(assuming1byteisusedtostoreeachgenotype).
Thereareaseveraloptionstohandlingdatasetsofthissize.
First,thecostofupgradingastandardPC'smemorytohandlethisamountofdataisnotunreasonable.
Second,thealgorithmcan2013Putnametal.
;licenseeBioMedCentralLtd.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.
org/licenses/by/2.
0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
Putnametal.
BMCBioinformatics2013,14:369Page2of7http://www.
biomedcentral.
com/1471-2105/14/369beextendedtoutilizememorymappingtechniques[5],whicheffectivelypageschunksofthedatafileintomainmemoryastheyareneeded.
Athirdoptionistomod-ifytheformatforrepresentinggenotypessuchthatthegenotypesareexpressedintheirmostsuccinctform[6,7].
Thismanuscriptexploresthelatteroptionmoredeeply.
TheinterestismotivatedinpartbythedesiretoworkintheGeneral-PurposeGraphicProcessingUnits(GPGPU)spacewhichhassomewhatlimitedspaceespeciallywhenconsideredonaprocessor-by-processorbasis.
Thecompressionofgenotypeencodingdataismosteffectivelyperformedusingsuccinctdatastructures[8].
Succinctdatastructuresallowcompressionratesclosetotheinformation-theoreticlimitsandyetpreservetheabil-itytoaccessindividualdataelements.
Inthegenotypeanalysistoolsthatusesuccinctdatatypes(e.
g.
,BOOST[6]andBiForce[9]),a3-bitgenotyperepresentationforbiallelicmarkershasbeenadopted.
Whilea3-bitrep-resentationdoesprovideasuccinctdatastructure,itisnotthemostsuccinct.
Moreprecisely,fromaninforma-tiontheoreticperspective,3-bitsisabletorepresentupto8uniquevalues.
However,thereareonly4commonlyusedunphasedgenotypes,namely{NN,AA,Aa,aa}whereNNisusedtorepresentmissingdata.
Thismeansthata2-bitrepresentationistheinformationtheoreticlowerboundanditsusewouldprovideanevenmorecompactrepresentation.
Animportantconsiderationwhendesigningsuccinctdatastructuresisdataelementorientationinmemory.
BOOST[6]andBiForce[9]adoptedavectoredorienta-tionforrepresentingdataelements.
Thevectoredorienta-tionspreadseachdataelementovermultiplebitvectors.
Inotherwords,theyutilize3bitvectorspermarkertorepresentthesetofgenotypes.
Theadvantagesofthisorientationarediscussedlater.
Thismanuscriptmakestwoimportantcontributionsintheuseofsuccinctdatastructuresforgenomicencod-ing.
Inparticular,(i)weimplementatechniquetoreducegenotypeencodingtoa2bitvectorform,and(ii)wecom-paretheperformanceofthenew2-bitencodingtotheconventional3bitvectorencoding.
Fromthesestudies,wehaveobservedthatthe2-bitencodingencodingcon-sumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
ImplementationWeanalyzedacommonlyused3-bitbinaryrepresentationofgenotypesfromperformanceandscalabilityperspec-tives.
WiththisinformationwedevelopedaC++objectlibrarythatwehavenamedlibgwaspp.
Thelibrarypro-videsdatastructuresformanaginggenotypedatatablesina2-or3-bitrepresentation.
Finally,webenchmarkedthetworepresentationsonrandomlygenerateddatasetsofvariousscales.
Genome-wideassociationstudiesDNAfromindividualsarecollected,sequencedorgeno-typed,andthegenotypesforgeneticvariantsareusedinGenome-WideAssociationStudies(GWAS).
Thesestud-iesaimtodeterminewhethergeneticvariantsareassoci-atedwithcertaintraits,orphenotypes.
Themostcommonstudiesarecase-controlstudieswhichgroupindividualstogetherintotwosetsbasedonthepresence(case)orabsence(control)ofaspecifictrait.
Thesestudiestypicallyrelyuponvariousstatisticaltestsbaseduponthegeno-typicorallelicdistributionofthevariantsineachset.
Anaveragedatasetaimstocomparethousandsofindividualsbyhundredsofthousandstomillionsofvariants.
GWAstudiescanbecomputationallyintensivetoper-form.
Commonalgorithmsconsidereithereachvariantindividually,orvariantsincombinationwithoneanother.
Forexample,measuringtheoddsratioforeachvariantinacase-controlstudyisonewayofidentifyingvariantswhichmaybeassociatedwiththetraitinquestion.
Anepista-sisanalysisalgorithm,suchasBOOST[6],comparesthegenotypedistributionoftwovariantsineachstep.
Inbothofthesealgorithms,thebasictaskiscountingtheoccurrencesofeachgenotypeineachofthecase-controlsets.
Inotherwords,thefirststepindeterminingtheoddsratioistobuildafrequencytable(Table1)forboththecaseandcontrolsetsataspecificvariant.
Simi-larly,theBOOST[6]algorithmfirstbuildsacontingencytable(Table2),orpairwisegenotypecounttable,forapairofvariants.
BinarygenotypeencodingschemesAcommonwaytominimizetheimpactofthetablebuild-ingbottleneckistofullyutilizeprocessorthroughputbycountinggenotypesfrommultipleindividualsinonestep.
ThebinaryencodingofgenotypesadoptedbyBOOST[6]improvesthecomputationalefficiencyoftheepista-sisalgorithm.
Thealgorithmused3bitvectorstoencodeforgenotypedata.
Inthisschemeeachgenotypeisitsownbit-vector,orstream,ofdata.
Eachbitcorrespondstoanindexedindividual,andtheindexingisassumedtobeconstantacrossallmarkers.
Asetbitindicatesthattheindividualhasthecorrespondinggenotypeforthespeci-fiedmarker.
Therefore,everyvariantrequires3vectorstofullyrepresentthegenotypes.
Therearetwokeybenefitsofusingthisbinaryencodingscheme.
ThefirstisthatthetaskofbuildingafrequencyTable1FrequencytableforrawinputfromTables3,4and5AAAaaaNNCA2111CB2120Putnametal.
BMCBioinformatics2013,14:369Page3of7http://www.
biomedcentral.
com/1471-2105/14/369Table2PairwisegenotypecounttablefortwomarkersMBAAAaaaNNCAMAAA10102Aa10001aa00101NN01001CB2120NotethatthemarginalsumsofthistablearetheindividualmarkersfrequenciesfromTable1.
tableforagivenmarkerisreducedtocalculatingtheHam-mingdistanceofeachofabit-vectorsandabit-vectorofallzeros.
ThisdistanceisalsoreferredtoasaHammingweight.
ThetechniqueusedforcalculatingtheHammingweightofabitvectoristodividethebit-vectorintoman-ageableblocks,andsumtheHammingweightofeachblock.
Theblocksizeistypicallylinkedtotheproces-sorwordsize,typically32-or64-bits(4or8bytes).
ThealgorithmforcomputingtheHammingweightofanindividualblockiscommonlyreferredtoasPopulationCounting(popcount).
WechosetofollowtheBOOSTimplementationofpopcountwhichlooks-uptheHam-mingweightof16-bitblocksinapre-populatedweighttable.
Thesecondbenefitisthatitreducesgenotypecom-parisonlogictosimpleBooleanlogicoperations.
Morespecifically,thetaskofcountingindividualswhichhaveaspecificcombinationofgenotypesfortwomarkersissim-plifiedtofindingtheHammingweightofthelogicalANDofthegenotypebitvectors.
Thisisusefulwhenbuildingcontingencytables.
Ofinteresttothispaperisthefactthatwhenusingthe3-bitencodingschemeatleasttwothirdsofthebitsusedwillbeunset.
Aninformationtheoreticanalysisofthegenotypealphabetindicatesthat2-bitsaresufficienttouniquelyrepresenteachofthefourunphasedgenotypes.
Theimmediatebenefitisaonethirdreductioninmemoryconsumption(Tables3,4and5).
Thecaveattothisencod-ingschemeisthatdeterminingagenotyperequiresbothbits.
ThealgorithminFigure1isapseudo-coderepresen-tationofhowtobuildagenotypecounttablefrom2-bitencodeddata.
TheHammingweightofeachvectoristhenumberofindividualswith(AAoraa),and(Aaoraa)genotypes,respectively.
TodisambiguatethevaluesitisTable3ExamplegenotypeinputI1I2I3I4I5MAAAAaAAaaNNMBAAAAaaaaAaI1-5representindividuals,andMAandMBaremarkers.
Table43-bitencodingschemeI1I2I3I4I5AA10100MAAa01000aa00010AA11000MBAa00001aa00110necessarytocomputetheHammingweightofthelogicalANDofthebit-vectors.
Thisvaluerepresentsthenumberof(aa)genotypes,andsubtractingitfromtheprevioustwoweightswillresultintheappropriatecounts.
ThealgorithminFigure2illustratestheconstructionofapairwisegenotypecounttable,orcontingencytable.
Acontingencytablerepresentsthenumberofindividualswhopossessagenotypecombinationforapairofmarkers.
Whenusingthe3-bitencodingscheme,eachcellofthetableissimplytheHammingweightofthelogicalANDofthegenotypebit-vectorsforthetwomarkers.
The2-bitencodingrequiresaninlinetransformationsteptocon-vertthe2-bitencodeddatainto3-bitdata.
Thisstepisnecessarytobeabletotakeadvantageofthepopcountbitcountingmethod.
Bothoftheabovealgorithmscanbefurtherimprovedbyincorporatingadditionalinformation.
Forexample,thealgorithmforbuildingacontingencytablecanbesimpli-fiedifmarginalinformationforbothvariantsisavailable.
Thecontingencytablealgorithmcanmakeuseofthevariants'frequencytableandreducehavingtocompute9Hammingweightvaluestoonly4.
Theremainingval-uescanbeeasilycomputedbysubtractingtherowandcolumnsumsfromtheirrespectivemarginalinformationvalues.
Thisreductionofferssignificantcomputationalsavings,especiallywhenperformingexhaustiveepistasisanalysis.
BenchmarkingWecomparedtheperformanceofthe2-bitencodeddatatothe3-bitencodeddata.
Inparticular,wemeasuredtheruntimeforbuildingfrequencytablesandcontingencytablesusingbothencodingschemes.
Theruntimeofthesealgorithmsaredependentuponthenumberofcolumns,orindividuals,ineachrow.
Therefore,wedecidedtoholdTable52-bitencodingschemeI1I2I3I4I5MAAAORaa10110AaORaa01010MBAAORaa11110AaORaa00111Putnametal.
BMCBioinformatics2013,14:369Page4of7http://www.
biomedcentral.
com/1471-2105/14/369Constructingafrequencytablefrom2-bitencodedgenotypesAA0Aa0aa0fori=0NdoisthenumberofblocksperbitvectorxA[i]isthe(AAoraa)genotypebitvectoryB[i]isthe(Aaoraa)genotypebitvectoraaaa+popcount(xy)AaAa+popcount(y)AAAA+popcount(x)endforAAAAaaAaAaaaFigure1Constructingafrequencytablefrom2-bitencodedgenotypes.
thenumberofrowsconstantat10,000variants.
Wevar-iedthenumberofcolumnsbetween1and50thousandindividuals.
Wealsotestedasetwith150,000individualsasanextremescaleexperiment.
Thegenotypesweresim-ulatedfollowingempiricalallelefrequencyspectrumofAffymetrixarray6.
0SNPsoftheCEUHapMapsamples.
Similarly,individualswererandomlyclassifiedaseitheracaseorcontrol.
Threeexperimentswereconducted.
First,foreachdatasettheruntimeforbuildingfrequencytablesforeachofthevariantsweremeasured.
Second,foreachdatasettheruntimeforbuildingallcontingencytablesforanexhaus-tivepairwiseepistasistestwasmeasured.
Third,eachdatasetwasrunthroughourimplementationoftheBOOST[6]algorithmandthetotalruntimewasrecorded.
TheruntimeofBOOST[6]algorithmdoesnotincludethetimetoloadthecompresseddatasetintomainmemory.
Ineachofthesetests,theaverageruntimeiscalculatedandpresented.
Alltestswereconducteduponadesktopcomputerwithan3.
2GHzIntelCorei7-3930K,32GBof1600MHzDDR3memory,with64-bitFedora17.
Timewasmeasureddowntothenanosecondusingtheclock_gettime()glibcfunction.
WeusedGNUG++compiler4.
7,andcompiledusingstandard"-O3"compileroptimizationflag.
Thetestswereperformedusing64-bitblocksize.
ResultsThefirstexperimentmeasuredtheruntimeforbuild-ingfrequencytables.
Initially,the3-bitencodingschemeappearedtoofferaconsistentperformanceadvantageoverthe2-bitencoding.
Asthenumberofindividualsincreased,ittooklesstimetoconstructthecounttable(Figure3).
Theaveragetimetobuildagenotypecounttableforlessthan10,000individualsislessthan1μs.
Fordatasetsgreaterthan10,000individuals,thereissomeperformanceoverheadthatresultsfromdecodingthe2-bitvectors.
Buildingfrequencytablesfromthe3-bitencodeddataprovedtobe12–25%fasterthanwhenbuiltfrom2-bitencodeddata.
Intheextremescaledatasettherewasa5.
00μsdifferenceinfavorofthe3-bitscheme.
However,thesecondexperimentoffereddifferentresults.
Thesecondexperimentmeasuredtheruntimeforbuild-ingcontingencytablesforallpairsofvariantsinthedatasets.
Inthisexperiment,the2-bitencodingschemeofferedbetterperformance.
Similartothefirstexperi-ment,10,000individualsseemedtobethedivergingpoint(Figure4).
Atsizesgreaterthan10,000individuals,the2-bitencodingschemeoffereda1%performanceimprove-mentoverthe3-bitscheme.
With150,000individuals,thisequatestoabouta0.
32μsdifferenceinaverageper-formance.
Thethirdexperimentfurtherconfirmsthisperformancegain(Table6).
Figure2Constructingacontingencytablefrom2-bitencodedgenotypes.
Putnametal.
BMCBioinformatics2013,14:369Page5of7http://www.
biomedcentral.
com/1471-2105/14/3690510152025020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlFrequencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure3AverageCase/ControlfrequencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
DiscussionThisworkfocusesonwaystoaddressfrequencytablebuildingprocessesfoundinGWASfortwoprimaryrea-sons.
First,upstreamsteps,liketheloadingofdata,inageneralGWASpipelineareperformedrelativelyinfre-quently,andcanbeperformedoffline.
Forexample,adatasetcanbetransformedintoanoptimizedformatonce,andineveryrepeatanalysisthedatasettheloadingbecomesaconstanttimestepwithinthepipeline.
Conversely,thebuildingofthesetablesamountstoafrequentlyreoccur-ringstepwhichistypicallyperformedinlineundervaryingconditions.
Secondly,weviewedthetablebuildingprocessasabottleneckfordownstreamanalyticalsteps.
Offeringanapproachwhichpositivelyimpactsthecostassociatedwiththisbottleneckisbeneficial.
Theresultssuggestthattheuseof2-bitencodingschemeforgenotypedatadoesofferseveralbenefitsovera3-bitencodingscheme.
Thecompactencodingschemerequires33%lessmemoryforrepresentingthesamedata.
Asidefromfreeingupsystemmemoryforothertasks,thememorysavingscanbebeneficialforotherreasons.
Forexample,epistasisalgorithmslikeBOOST[6]canberunonGraphicProcessingUnits.
GPUsareseparatedevices05101520253035404550020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlContingencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure4AverageCase/ControlcontingencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
Putnametal.
BMCBioinformatics2013,14:369Page6of7http://www.
biomedcentral.
com/1471-2105/14/369Table6EpistasisruntimecomparisonIndividuals2-bit3-bitSpeedup(%)100028.
56s28.
45s0.
37500092.
07s93.
32s-1.
3310000173.
12s177.
46s-2.
4525000418.
31s420.
71s-0.
5750000810.
71s820.
26s-1.
161500002408.
05s24.
27.
84s-0.
81Speedupismeasuredrelativetothe3-bitruntime.
onacomputerwhichhavetheirownphysicalmemory,typicallylessthan6GB,andrequiredatatobecopiedtoandfromthedevice.
Thelimitedmemoryanddatatrans-ferissuesbothbenefitfromusingamorecompactdataformat.
The2-bitencodedgenotypeshavealsobeenusedbyothersoftwarepackages.
PLINK[7],forexample,usesa2-bitencodingintheBEDfileformat.
BEDfilesuseacontiguouspairingofbitstoexpressthegenotypeofanindividual.
Usingbitpairsallowsformoreefficientindi-vidualgenotypedecodingasaresultofthebitsexistinginthesamebit-block.
However,additionalbitmaskingstepsneedtobeappliedtoeachblocktoeffectivelyutilizepop-countbasedmethodsforcountinggenotypeoccurrenceswithinablock.
Asmentionedearlier,ourimplementationadoptsabit-vectoredapproach,wherebyanindividual'sgenotypeisdividedovertwoseparatevectors.
Thisisprimarilydonetoreducethenumberofmaskingsteps.
Ineithercase,someformofgenotypedisambiguationisnecessary.
Thereisanoverheadassociatedwiththisdecodingstep,anditcanbefeltincertainalgorithms.
Wemeasuredapproximatelya20%overheadwhenbuildingfrequencytables.
Whilethisisasignificantoverhead,thenumberoffrequencytablesarelinearinthenumberofmarkers.
Therefore,itisconceivabletobuildthesetablesonce,andreusethemindownstreamanalyticalstepsasneeded.
Asaresult,thisoverheadisgenerallyacceptable.
Furthermore,theoverheadiseffectivelyhiddenwhenbuildingpairwisefrequencytables.
Theimprovementinperformancepresentwhencon-structingpairwisefrequencytablesfrom2-bitencodedgenotypesstemsfromthereducednumberofmemoryaccesssteps.
AsshowninAlgorithm3sixgenotypesblocksareusedineachstepoftheiteration.
When3-bitencodingisused,eachoftheseblocksmustbereadfrommemory.
Conversely,the2-bitencodingonlyneedstoreadfourblocksandcomputestheremainingtwoblocks.
Afurthergeneralperformanceincreasemaybepos-siblethroughtheuseofhardwareimplementationsofpopcountalgorithms.
AspartoftheStreamingSIMDExtensions(SSE)ofthex86microarchitecturethereisapopcnt[10]instruction.
RecentprocessorlinesfrombothIntelandAMDofferthisinstructioninsomeformoranother.
Aswementionedearlier,thesesuccinctdatastructuresareintendedtoimpacttheincreasingscaleofsamplesets.
Thebuildingofthefrequencytablesarelinearalgorithmswhicharedependentuponthesamplesets.
Byfixingthenumberofvariantsandvaryingthenumberofsamplesinadatasetweshowthelinearincreaseoftheepistasisalgorithmruntime,asisindicatedbyFigure5.
Unfortunately,theruntimeofbruteforcealgorithmslikeBOOST[6]aredominatedmorebythenumberofvari-antsbeinganalyzedthanthenumberofindividualsbeing05001000150020002500020000400006000080000100000120000140000160000Time(s)IndividualsEpistasis(BOOST)algorithmAverageruntimefor10000Variants2-bitencodingscheme3-bitencodingschemeFigure5AverageepistasisruntimeusingBOOST[6]algorithm.
Putnametal.
BMCBioinformatics2013,14:369Page7of7http://www.
biomedcentral.
com/1471-2105/14/369studied.
Adatasetof10,000variantsmeansthat5*107uniquecontingencytablesneedtobebuiltforatypicalcase-controlstudy.
Expandingthatsizetoamillionvari-antsincreasesthecontingencytablecountto5*1011.
Otherworkshavedemonstratedparallelimplementationsthateffectivelyaddressthevariantscaling[9,11,12].
Thisworkdemonstratesageneralwaytofurtherimprovetheperformanceofthesealgorithms.
ConclusionsInthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackeddatarepresentationsinGenomeWideAssociationsStudies.
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecom-monlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastruc-tures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
AvailabilityandrequirementsProjectname:libgwasppProjecthomepage:https://github.
com/putnampp/libgwasppOperatingsystem(s):LinuxProgramminglanguage:C++Otherrequirements:CMake2.
8.
9,GCC4.
7orhigher,Boost1.
51.
0,ZLIB,GSLLicense:FreeBSDCompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Authors'contributionsPPPdesignedandimplementedthesoftware,conductedtheexperiments,andwrotethemainmanuscript.
GZprovideddomainspecificexpertiseinGWAstudies,andtheempiricaldatafromwhichthesimulateddatawasgenerated.
PWcontributedextensiveknowledgeofcomputationalarchitecturesanddatastructures.
Bothalsocontributedgreatlytotheresultanalysisandeditingofthemanuscript.
Allauthorsreadandapprovedthefinalmanuscript.
AcknowledgementsThisworkwaspartiallysupportedbythePilotandFeasibilityProgramofthePerinatalInstitute,CincinnatiChildren'sHospitalMedicalCenter.
Received:25June2013Accepted:11December2013Published:21December2013References1.
SchadtEE,LindermanMD,SorensonJ,LeeL,NolanGP:Computationalsolutionstolarge-scaledatamanagementandanalysis.
NatRevGenet2010,11(9):647–657.
http://dx.
doi.
org/10.
1038/nrg2857.
2.
LiY,WillerC,SannaS,AbecasisG:Genotypeimputation.
AnnRevGenomHumanGenet2009,10:387–406.
http://www.
annualreviews.
org/doi/abs/10.
1146/annurev.
genom.
9.
081307.
164242.
[PMID:19715440].
3.
Whole-genomegenotypingandcopynumbervariationanalysis.
2013.
http://www.
illumina.
com/applications/detail/snp_genotyping_and_cnv_analysis/whole_genome_genotyping_and_copy_number_variation_analysis.
ilmn.
[Online;accessed9-January-2013]4.
Amapofhumangenomevariationfrompopulation-scalesequencing.
Nature2010,467(7319):1061–1073.
http://dx.
doi.
org/10.
1038/nature09534.
5.
NielsenJ,MailundT:SNPFile-Asoftwarelibraryandfileformatforlargescaleassociationmappingandpopulationgeneticsstudies.
BMCBioinformatics2008,9:526.
http://www.
biomedcentral.
com/1471-2105/9/526.
6.
WanX,YangC,YangQ,XueH,FanX,TangNL,YuW:BOOST:afastapproachtodetectinggene-geneinteractionsingenome-widecase-controlstudies.
AmJHumanGenet2010,87(3):325–340.
http://linkinghub.
elsevier.
com/retrieve/pii/S0002929710003782.
7.
PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraMAR,BenderD,MallerJ,SklarP,deBakkerPIW,DalyMJ,ShamPC:PLINK:atoolsetforwhole-genomeassociationandpopulation-basedlinkageanalysis.
AmJHumanGenet2007,81(3):559–575.
http://pngu.
mgh.
harvard.
edu/purcell/plink/.
8.
JacobsonG:Space-efficientstatictreesandgraphs.
InProceedingsofthe30thAnnualSymposiumonFoundationsofComputerScience,SFCS'89.
Washington:IEEEComputSoc;1989:549–554.
http://dx.
doi.
org/10.
1109/SFCS.
1989.
63533.
9.
GyeneseiA,MoodyJ,LaihoA,SempleCA,HaleyCS,WeiWH:BiForceToolbox:powerfulhigh-throughputcomputationalanalysisofgene-geneinteractionsingenome-wideassociationstudies.
NucleicAcidsRes2012,40(W1):W628–W632.
http://nar.
oxfordjournals.
org/content/40/W1/W628.
abstract.
10.
Intel:IntelSSE4ProgrammingReference;2007.
http://home.
ustc.
edu.
cn/~shengjie/REFERENCE/sse4_instruction_set.
pdf.
11.
YungLS,YangC,WanX,YuW:GBOOST:aGPU-basedtoolfordetectinggeneUgeneinteractionsingenome-widecasecontrolstudies.
Bioinformatics2011,27(9):1309–1310.
http://bioinformatics.
oxfordjournals.
org/content/27/9/1309.
abstract.
12.
SchüpbachT,XenariosI,BergmannS,KapurK:FastEpistasis:ahighperformancecomputingsolutionforquantitativetraitepistasis.
Bioinformatics2010,26(11):1468–1469.
http://bioinformatics.
oxfordjournals.
org/content/26/11/1468.
abstract.
doi:10.
1186/1471-2105-14-369Citethisarticleas:Putnametal.
:AcomparisonstudyofsuccinctdatastructuresforuseinGWAS.
BMCBioinformatics201314:369.
SubmityournextmanuscripttoBioMedCentralandtakefulladvantageof:ConvenientonlinesubmissionThoroughpeerreviewNospaceconstraintsorcolorgurechargesImmediatepublicationonacceptanceInclusioninPubMed,CAS,ScopusandGoogleScholarResearchwhichisfreelyavailableforredistributionSubmityourmanuscriptatwww.
biomedcentral.
com/submit
Hostkey.com成立于2007年的荷兰公司,主要运营服务器出租与托管,其次是VPS、域名、域名证书,各种软件授权等。hostkey当前运作荷兰阿姆斯特丹、俄罗斯莫斯科、美国纽约等数据中心。支持Paypal,信用卡,Webmoney,以及支付宝等付款方式。禁止VPN,代理,Tor,网络诈骗,儿童色情,Spam,网络扫描,俄罗斯色情,俄罗斯电影,俄罗斯MP3,俄罗斯Trackers,以及俄罗斯法...
官方网站:https://www.shuhost.com/公司名:LucidaCloud Limited尊敬的新老客户:艰难的2021年即将结束,年终辞旧迎新之际,我们准备了持续优惠、及首月优惠,为中小企业及个人客户降低IT业务成本。我们将持续努力提供给客户更好的品质与服务,在新的一年期待与您有美好的合作。# 下列价钱首月八折优惠码: 20211280OFF (每客户限用1次) * 自助购买可复制...
今天获得消息,vdsina上了AMD EPYC系列的VDS,性价比比较高,站长弄了一个,盲猜CPU是AMD EPYC 7B12(经过咨询,详细CPU型号是“EPYC 7742”)。vdsina,俄罗斯公司,2014年开始运作至今,在售卖多类型VPS和独立服务器,可供选择的有俄罗斯莫斯科datapro和荷兰Serverius数据中心。付款比较麻烦:信用卡、webmoney、比特币,不支持PayPal...
fedora17为你推荐
.cn域名cn域名有什么用啊?硬盘工作原理硬盘的工作原理是什么?甲骨文不满赔偿劳动法员工工作不满一个月辞退赔偿标准冯媛甑冯媛甄 康熙来了www.haole012.com阜阳有什么好的正规的招聘网站?haole10.comwww.qq10eu.in是QQ网站吗www.vtigu.com破译密码L dp d vwxghqw.你能看出这些字母代表什么意思吗?如果给你一把破以它的钥匙X-3,联想彪言彪语( )言( )语的词语ww.43994399在线单机小游戏达林赞雅达信雅是什么意思
如何注销域名备案 godaddy域名解析 账号泄露 私有云存储 免费全能主机 天翼云盘 1元域名 web应用服务器 韩国代理ip atom处理器 国外网页代理 腾讯网盘 宿迁服务器 网站防护 美国十大啦 德国代理 免费php空间申请 dns是什么意思 ddos攻击软件 kosspp 更多