benchmarkedfedora17
fedora17 时间:2021-03-26 阅读:(
)
Putnametal.
BMCBioinformatics2013,14:369http://www.
biomedcentral.
com/1471-2105/14/369SOFTWAREOpenAccessAcomparisonstudyofsuccinctdatastructuresforuseinGWASPatrickPPutnam1,2*,GeZhang2*andPhilipAWilsey1AbstractBackground:Inrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal(NRG11:647–657,2010)offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
Theuseofsuccinctdatastructuresisonemethodofreducingphysicalsizeofadatasetwithouttheuseofexpensivecompressiontechniques.
Inthiswork,weconsidertheuseof2-and3-bitencodingschemesforgenotypedata.
Wecomparethecomputationalperformanceofalleleorgenotypecountingalgorithmsutilizinggenotypedataencodedinbothschemes.
Results:Weperformacomparisonof2-and3-bitgenotypeencodingschemesforuseingenotypecountingalgorithms.
Wefindthatthereisa20%overheadwhenbuildingsimplefrequencytablesfrom2-bitencodedgenotypes.
However,buildingpairwisecounttablesforgenome-wideepistasisis1.
0%moreefficient.
Conclusions:Inthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackedgenotypedatarepresentationsinGenomeWideAssociationsStudies(GWAS).
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecommonlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastructures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
BackgroundInrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal[1]offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
ThemajorityoftoolsusedinGWAdataanalysistyp-icallyassumethatadatasetwilleasilyfitintothemainmemoryofadesktopcomputer.
Mostdesktopcomput-ershavearound4–16GBofmainmemory,whichismorethanenoughtofitadatasetof1millionvari-antsbytensofthousandsofindividuals.
However,data*Correspondence:putnampp@gmail.
com;zhangge.
uc@gmail.
com1ExperimentalComputingLab,SchoolofElectronicandComputingSystems,POBox210030,Cincinnati,OH45221–0030,USA2HumanGenetics,CincinnatiChildren'sHospitalMedicalCenter,Cincinnati,OH,USAsetsizescontinuetogrowwithadvancementsinanal-ysistechniquesandtechnologies.
Forexample,tech-niqueslikegenotypeimputation[2]attemptexpanddatasetsbyderivingmissinggenotypefromreferencepan-els.
GenotypingtechnologiessuchasIllumina'sOmniSNPHumanOmni5-Quadchipsallowforgenotypingofupwardsof5millionmarkers[3].
Furthermore,genomesequencingtechnologiesareadvancingtothepointwheredetermininggenotypesviawholegenomesequencingmaybeaviableoption.
Havinganindividual'sentireDNAsequenceopensthedoorforevenmoregeneticmark-erstobeanalyzed.
The1000Genomesproject[4]nowincludesroughly36.
7millionvariantsinthehumangenome.
Thesizeofadatafileusedtorepresentthegenotypesof1000individualswouldberoughly37GB(assuming1byteisusedtostoreeachgenotype).
Thereareaseveraloptionstohandlingdatasetsofthissize.
First,thecostofupgradingastandardPC'smemorytohandlethisamountofdataisnotunreasonable.
Second,thealgorithmcan2013Putnametal.
;licenseeBioMedCentralLtd.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.
org/licenses/by/2.
0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
Putnametal.
BMCBioinformatics2013,14:369Page2of7http://www.
biomedcentral.
com/1471-2105/14/369beextendedtoutilizememorymappingtechniques[5],whicheffectivelypageschunksofthedatafileintomainmemoryastheyareneeded.
Athirdoptionistomod-ifytheformatforrepresentinggenotypessuchthatthegenotypesareexpressedintheirmostsuccinctform[6,7].
Thismanuscriptexploresthelatteroptionmoredeeply.
TheinterestismotivatedinpartbythedesiretoworkintheGeneral-PurposeGraphicProcessingUnits(GPGPU)spacewhichhassomewhatlimitedspaceespeciallywhenconsideredonaprocessor-by-processorbasis.
Thecompressionofgenotypeencodingdataismosteffectivelyperformedusingsuccinctdatastructures[8].
Succinctdatastructuresallowcompressionratesclosetotheinformation-theoreticlimitsandyetpreservetheabil-itytoaccessindividualdataelements.
Inthegenotypeanalysistoolsthatusesuccinctdatatypes(e.
g.
,BOOST[6]andBiForce[9]),a3-bitgenotyperepresentationforbiallelicmarkershasbeenadopted.
Whilea3-bitrep-resentationdoesprovideasuccinctdatastructure,itisnotthemostsuccinct.
Moreprecisely,fromaninforma-tiontheoreticperspective,3-bitsisabletorepresentupto8uniquevalues.
However,thereareonly4commonlyusedunphasedgenotypes,namely{NN,AA,Aa,aa}whereNNisusedtorepresentmissingdata.
Thismeansthata2-bitrepresentationistheinformationtheoreticlowerboundanditsusewouldprovideanevenmorecompactrepresentation.
Animportantconsiderationwhendesigningsuccinctdatastructuresisdataelementorientationinmemory.
BOOST[6]andBiForce[9]adoptedavectoredorienta-tionforrepresentingdataelements.
Thevectoredorienta-tionspreadseachdataelementovermultiplebitvectors.
Inotherwords,theyutilize3bitvectorspermarkertorepresentthesetofgenotypes.
Theadvantagesofthisorientationarediscussedlater.
Thismanuscriptmakestwoimportantcontributionsintheuseofsuccinctdatastructuresforgenomicencod-ing.
Inparticular,(i)weimplementatechniquetoreducegenotypeencodingtoa2bitvectorform,and(ii)wecom-paretheperformanceofthenew2-bitencodingtotheconventional3bitvectorencoding.
Fromthesestudies,wehaveobservedthatthe2-bitencodingencodingcon-sumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
ImplementationWeanalyzedacommonlyused3-bitbinaryrepresentationofgenotypesfromperformanceandscalabilityperspec-tives.
WiththisinformationwedevelopedaC++objectlibrarythatwehavenamedlibgwaspp.
Thelibrarypro-videsdatastructuresformanaginggenotypedatatablesina2-or3-bitrepresentation.
Finally,webenchmarkedthetworepresentationsonrandomlygenerateddatasetsofvariousscales.
Genome-wideassociationstudiesDNAfromindividualsarecollected,sequencedorgeno-typed,andthegenotypesforgeneticvariantsareusedinGenome-WideAssociationStudies(GWAS).
Thesestud-iesaimtodeterminewhethergeneticvariantsareassoci-atedwithcertaintraits,orphenotypes.
Themostcommonstudiesarecase-controlstudieswhichgroupindividualstogetherintotwosetsbasedonthepresence(case)orabsence(control)ofaspecifictrait.
Thesestudiestypicallyrelyuponvariousstatisticaltestsbaseduponthegeno-typicorallelicdistributionofthevariantsineachset.
Anaveragedatasetaimstocomparethousandsofindividualsbyhundredsofthousandstomillionsofvariants.
GWAstudiescanbecomputationallyintensivetoper-form.
Commonalgorithmsconsidereithereachvariantindividually,orvariantsincombinationwithoneanother.
Forexample,measuringtheoddsratioforeachvariantinacase-controlstudyisonewayofidentifyingvariantswhichmaybeassociatedwiththetraitinquestion.
Anepista-sisanalysisalgorithm,suchasBOOST[6],comparesthegenotypedistributionoftwovariantsineachstep.
Inbothofthesealgorithms,thebasictaskiscountingtheoccurrencesofeachgenotypeineachofthecase-controlsets.
Inotherwords,thefirststepindeterminingtheoddsratioistobuildafrequencytable(Table1)forboththecaseandcontrolsetsataspecificvariant.
Simi-larly,theBOOST[6]algorithmfirstbuildsacontingencytable(Table2),orpairwisegenotypecounttable,forapairofvariants.
BinarygenotypeencodingschemesAcommonwaytominimizetheimpactofthetablebuild-ingbottleneckistofullyutilizeprocessorthroughputbycountinggenotypesfrommultipleindividualsinonestep.
ThebinaryencodingofgenotypesadoptedbyBOOST[6]improvesthecomputationalefficiencyoftheepista-sisalgorithm.
Thealgorithmused3bitvectorstoencodeforgenotypedata.
Inthisschemeeachgenotypeisitsownbit-vector,orstream,ofdata.
Eachbitcorrespondstoanindexedindividual,andtheindexingisassumedtobeconstantacrossallmarkers.
Asetbitindicatesthattheindividualhasthecorrespondinggenotypeforthespeci-fiedmarker.
Therefore,everyvariantrequires3vectorstofullyrepresentthegenotypes.
Therearetwokeybenefitsofusingthisbinaryencodingscheme.
ThefirstisthatthetaskofbuildingafrequencyTable1FrequencytableforrawinputfromTables3,4and5AAAaaaNNCA2111CB2120Putnametal.
BMCBioinformatics2013,14:369Page3of7http://www.
biomedcentral.
com/1471-2105/14/369Table2PairwisegenotypecounttablefortwomarkersMBAAAaaaNNCAMAAA10102Aa10001aa00101NN01001CB2120NotethatthemarginalsumsofthistablearetheindividualmarkersfrequenciesfromTable1.
tableforagivenmarkerisreducedtocalculatingtheHam-mingdistanceofeachofabit-vectorsandabit-vectorofallzeros.
ThisdistanceisalsoreferredtoasaHammingweight.
ThetechniqueusedforcalculatingtheHammingweightofabitvectoristodividethebit-vectorintoman-ageableblocks,andsumtheHammingweightofeachblock.
Theblocksizeistypicallylinkedtotheproces-sorwordsize,typically32-or64-bits(4or8bytes).
ThealgorithmforcomputingtheHammingweightofanindividualblockiscommonlyreferredtoasPopulationCounting(popcount).
WechosetofollowtheBOOSTimplementationofpopcountwhichlooks-uptheHam-mingweightof16-bitblocksinapre-populatedweighttable.
Thesecondbenefitisthatitreducesgenotypecom-parisonlogictosimpleBooleanlogicoperations.
Morespecifically,thetaskofcountingindividualswhichhaveaspecificcombinationofgenotypesfortwomarkersissim-plifiedtofindingtheHammingweightofthelogicalANDofthegenotypebitvectors.
Thisisusefulwhenbuildingcontingencytables.
Ofinteresttothispaperisthefactthatwhenusingthe3-bitencodingschemeatleasttwothirdsofthebitsusedwillbeunset.
Aninformationtheoreticanalysisofthegenotypealphabetindicatesthat2-bitsaresufficienttouniquelyrepresenteachofthefourunphasedgenotypes.
Theimmediatebenefitisaonethirdreductioninmemoryconsumption(Tables3,4and5).
Thecaveattothisencod-ingschemeisthatdeterminingagenotyperequiresbothbits.
ThealgorithminFigure1isapseudo-coderepresen-tationofhowtobuildagenotypecounttablefrom2-bitencodeddata.
TheHammingweightofeachvectoristhenumberofindividualswith(AAoraa),and(Aaoraa)genotypes,respectively.
TodisambiguatethevaluesitisTable3ExamplegenotypeinputI1I2I3I4I5MAAAAaAAaaNNMBAAAAaaaaAaI1-5representindividuals,andMAandMBaremarkers.
Table43-bitencodingschemeI1I2I3I4I5AA10100MAAa01000aa00010AA11000MBAa00001aa00110necessarytocomputetheHammingweightofthelogicalANDofthebit-vectors.
Thisvaluerepresentsthenumberof(aa)genotypes,andsubtractingitfromtheprevioustwoweightswillresultintheappropriatecounts.
ThealgorithminFigure2illustratestheconstructionofapairwisegenotypecounttable,orcontingencytable.
Acontingencytablerepresentsthenumberofindividualswhopossessagenotypecombinationforapairofmarkers.
Whenusingthe3-bitencodingscheme,eachcellofthetableissimplytheHammingweightofthelogicalANDofthegenotypebit-vectorsforthetwomarkers.
The2-bitencodingrequiresaninlinetransformationsteptocon-vertthe2-bitencodeddatainto3-bitdata.
Thisstepisnecessarytobeabletotakeadvantageofthepopcountbitcountingmethod.
Bothoftheabovealgorithmscanbefurtherimprovedbyincorporatingadditionalinformation.
Forexample,thealgorithmforbuildingacontingencytablecanbesimpli-fiedifmarginalinformationforbothvariantsisavailable.
Thecontingencytablealgorithmcanmakeuseofthevariants'frequencytableandreducehavingtocompute9Hammingweightvaluestoonly4.
Theremainingval-uescanbeeasilycomputedbysubtractingtherowandcolumnsumsfromtheirrespectivemarginalinformationvalues.
Thisreductionofferssignificantcomputationalsavings,especiallywhenperformingexhaustiveepistasisanalysis.
BenchmarkingWecomparedtheperformanceofthe2-bitencodeddatatothe3-bitencodeddata.
Inparticular,wemeasuredtheruntimeforbuildingfrequencytablesandcontingencytablesusingbothencodingschemes.
Theruntimeofthesealgorithmsaredependentuponthenumberofcolumns,orindividuals,ineachrow.
Therefore,wedecidedtoholdTable52-bitencodingschemeI1I2I3I4I5MAAAORaa10110AaORaa01010MBAAORaa11110AaORaa00111Putnametal.
BMCBioinformatics2013,14:369Page4of7http://www.
biomedcentral.
com/1471-2105/14/369Constructingafrequencytablefrom2-bitencodedgenotypesAA0Aa0aa0fori=0NdoisthenumberofblocksperbitvectorxA[i]isthe(AAoraa)genotypebitvectoryB[i]isthe(Aaoraa)genotypebitvectoraaaa+popcount(xy)AaAa+popcount(y)AAAA+popcount(x)endforAAAAaaAaAaaaFigure1Constructingafrequencytablefrom2-bitencodedgenotypes.
thenumberofrowsconstantat10,000variants.
Wevar-iedthenumberofcolumnsbetween1and50thousandindividuals.
Wealsotestedasetwith150,000individualsasanextremescaleexperiment.
Thegenotypesweresim-ulatedfollowingempiricalallelefrequencyspectrumofAffymetrixarray6.
0SNPsoftheCEUHapMapsamples.
Similarly,individualswererandomlyclassifiedaseitheracaseorcontrol.
Threeexperimentswereconducted.
First,foreachdatasettheruntimeforbuildingfrequencytablesforeachofthevariantsweremeasured.
Second,foreachdatasettheruntimeforbuildingallcontingencytablesforanexhaus-tivepairwiseepistasistestwasmeasured.
Third,eachdatasetwasrunthroughourimplementationoftheBOOST[6]algorithmandthetotalruntimewasrecorded.
TheruntimeofBOOST[6]algorithmdoesnotincludethetimetoloadthecompresseddatasetintomainmemory.
Ineachofthesetests,theaverageruntimeiscalculatedandpresented.
Alltestswereconducteduponadesktopcomputerwithan3.
2GHzIntelCorei7-3930K,32GBof1600MHzDDR3memory,with64-bitFedora17.
Timewasmeasureddowntothenanosecondusingtheclock_gettime()glibcfunction.
WeusedGNUG++compiler4.
7,andcompiledusingstandard"-O3"compileroptimizationflag.
Thetestswereperformedusing64-bitblocksize.
ResultsThefirstexperimentmeasuredtheruntimeforbuild-ingfrequencytables.
Initially,the3-bitencodingschemeappearedtoofferaconsistentperformanceadvantageoverthe2-bitencoding.
Asthenumberofindividualsincreased,ittooklesstimetoconstructthecounttable(Figure3).
Theaveragetimetobuildagenotypecounttableforlessthan10,000individualsislessthan1μs.
Fordatasetsgreaterthan10,000individuals,thereissomeperformanceoverheadthatresultsfromdecodingthe2-bitvectors.
Buildingfrequencytablesfromthe3-bitencodeddataprovedtobe12–25%fasterthanwhenbuiltfrom2-bitencodeddata.
Intheextremescaledatasettherewasa5.
00μsdifferenceinfavorofthe3-bitscheme.
However,thesecondexperimentoffereddifferentresults.
Thesecondexperimentmeasuredtheruntimeforbuild-ingcontingencytablesforallpairsofvariantsinthedatasets.
Inthisexperiment,the2-bitencodingschemeofferedbetterperformance.
Similartothefirstexperi-ment,10,000individualsseemedtobethedivergingpoint(Figure4).
Atsizesgreaterthan10,000individuals,the2-bitencodingschemeoffereda1%performanceimprove-mentoverthe3-bitscheme.
With150,000individuals,thisequatestoabouta0.
32μsdifferenceinaverageper-formance.
Thethirdexperimentfurtherconfirmsthisperformancegain(Table6).
Figure2Constructingacontingencytablefrom2-bitencodedgenotypes.
Putnametal.
BMCBioinformatics2013,14:369Page5of7http://www.
biomedcentral.
com/1471-2105/14/3690510152025020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlFrequencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure3AverageCase/ControlfrequencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
DiscussionThisworkfocusesonwaystoaddressfrequencytablebuildingprocessesfoundinGWASfortwoprimaryrea-sons.
First,upstreamsteps,liketheloadingofdata,inageneralGWASpipelineareperformedrelativelyinfre-quently,andcanbeperformedoffline.
Forexample,adatasetcanbetransformedintoanoptimizedformatonce,andineveryrepeatanalysisthedatasettheloadingbecomesaconstanttimestepwithinthepipeline.
Conversely,thebuildingofthesetablesamountstoafrequentlyreoccur-ringstepwhichistypicallyperformedinlineundervaryingconditions.
Secondly,weviewedthetablebuildingprocessasabottleneckfordownstreamanalyticalsteps.
Offeringanapproachwhichpositivelyimpactsthecostassociatedwiththisbottleneckisbeneficial.
Theresultssuggestthattheuseof2-bitencodingschemeforgenotypedatadoesofferseveralbenefitsovera3-bitencodingscheme.
Thecompactencodingschemerequires33%lessmemoryforrepresentingthesamedata.
Asidefromfreeingupsystemmemoryforothertasks,thememorysavingscanbebeneficialforotherreasons.
Forexample,epistasisalgorithmslikeBOOST[6]canberunonGraphicProcessingUnits.
GPUsareseparatedevices05101520253035404550020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlContingencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure4AverageCase/ControlcontingencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
Putnametal.
BMCBioinformatics2013,14:369Page6of7http://www.
biomedcentral.
com/1471-2105/14/369Table6EpistasisruntimecomparisonIndividuals2-bit3-bitSpeedup(%)100028.
56s28.
45s0.
37500092.
07s93.
32s-1.
3310000173.
12s177.
46s-2.
4525000418.
31s420.
71s-0.
5750000810.
71s820.
26s-1.
161500002408.
05s24.
27.
84s-0.
81Speedupismeasuredrelativetothe3-bitruntime.
onacomputerwhichhavetheirownphysicalmemory,typicallylessthan6GB,andrequiredatatobecopiedtoandfromthedevice.
Thelimitedmemoryanddatatrans-ferissuesbothbenefitfromusingamorecompactdataformat.
The2-bitencodedgenotypeshavealsobeenusedbyothersoftwarepackages.
PLINK[7],forexample,usesa2-bitencodingintheBEDfileformat.
BEDfilesuseacontiguouspairingofbitstoexpressthegenotypeofanindividual.
Usingbitpairsallowsformoreefficientindi-vidualgenotypedecodingasaresultofthebitsexistinginthesamebit-block.
However,additionalbitmaskingstepsneedtobeappliedtoeachblocktoeffectivelyutilizepop-countbasedmethodsforcountinggenotypeoccurrenceswithinablock.
Asmentionedearlier,ourimplementationadoptsabit-vectoredapproach,wherebyanindividual'sgenotypeisdividedovertwoseparatevectors.
Thisisprimarilydonetoreducethenumberofmaskingsteps.
Ineithercase,someformofgenotypedisambiguationisnecessary.
Thereisanoverheadassociatedwiththisdecodingstep,anditcanbefeltincertainalgorithms.
Wemeasuredapproximatelya20%overheadwhenbuildingfrequencytables.
Whilethisisasignificantoverhead,thenumberoffrequencytablesarelinearinthenumberofmarkers.
Therefore,itisconceivabletobuildthesetablesonce,andreusethemindownstreamanalyticalstepsasneeded.
Asaresult,thisoverheadisgenerallyacceptable.
Furthermore,theoverheadiseffectivelyhiddenwhenbuildingpairwisefrequencytables.
Theimprovementinperformancepresentwhencon-structingpairwisefrequencytablesfrom2-bitencodedgenotypesstemsfromthereducednumberofmemoryaccesssteps.
AsshowninAlgorithm3sixgenotypesblocksareusedineachstepoftheiteration.
When3-bitencodingisused,eachoftheseblocksmustbereadfrommemory.
Conversely,the2-bitencodingonlyneedstoreadfourblocksandcomputestheremainingtwoblocks.
Afurthergeneralperformanceincreasemaybepos-siblethroughtheuseofhardwareimplementationsofpopcountalgorithms.
AspartoftheStreamingSIMDExtensions(SSE)ofthex86microarchitecturethereisapopcnt[10]instruction.
RecentprocessorlinesfrombothIntelandAMDofferthisinstructioninsomeformoranother.
Aswementionedearlier,thesesuccinctdatastructuresareintendedtoimpacttheincreasingscaleofsamplesets.
Thebuildingofthefrequencytablesarelinearalgorithmswhicharedependentuponthesamplesets.
Byfixingthenumberofvariantsandvaryingthenumberofsamplesinadatasetweshowthelinearincreaseoftheepistasisalgorithmruntime,asisindicatedbyFigure5.
Unfortunately,theruntimeofbruteforcealgorithmslikeBOOST[6]aredominatedmorebythenumberofvari-antsbeinganalyzedthanthenumberofindividualsbeing05001000150020002500020000400006000080000100000120000140000160000Time(s)IndividualsEpistasis(BOOST)algorithmAverageruntimefor10000Variants2-bitencodingscheme3-bitencodingschemeFigure5AverageepistasisruntimeusingBOOST[6]algorithm.
Putnametal.
BMCBioinformatics2013,14:369Page7of7http://www.
biomedcentral.
com/1471-2105/14/369studied.
Adatasetof10,000variantsmeansthat5*107uniquecontingencytablesneedtobebuiltforatypicalcase-controlstudy.
Expandingthatsizetoamillionvari-antsincreasesthecontingencytablecountto5*1011.
Otherworkshavedemonstratedparallelimplementationsthateffectivelyaddressthevariantscaling[9,11,12].
Thisworkdemonstratesageneralwaytofurtherimprovetheperformanceofthesealgorithms.
ConclusionsInthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackeddatarepresentationsinGenomeWideAssociationsStudies.
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecom-monlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastruc-tures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
AvailabilityandrequirementsProjectname:libgwasppProjecthomepage:https://github.
com/putnampp/libgwasppOperatingsystem(s):LinuxProgramminglanguage:C++Otherrequirements:CMake2.
8.
9,GCC4.
7orhigher,Boost1.
51.
0,ZLIB,GSLLicense:FreeBSDCompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Authors'contributionsPPPdesignedandimplementedthesoftware,conductedtheexperiments,andwrotethemainmanuscript.
GZprovideddomainspecificexpertiseinGWAstudies,andtheempiricaldatafromwhichthesimulateddatawasgenerated.
PWcontributedextensiveknowledgeofcomputationalarchitecturesanddatastructures.
Bothalsocontributedgreatlytotheresultanalysisandeditingofthemanuscript.
Allauthorsreadandapprovedthefinalmanuscript.
AcknowledgementsThisworkwaspartiallysupportedbythePilotandFeasibilityProgramofthePerinatalInstitute,CincinnatiChildren'sHospitalMedicalCenter.
Received:25June2013Accepted:11December2013Published:21December2013References1.
SchadtEE,LindermanMD,SorensonJ,LeeL,NolanGP:Computationalsolutionstolarge-scaledatamanagementandanalysis.
NatRevGenet2010,11(9):647–657.
http://dx.
doi.
org/10.
1038/nrg2857.
2.
LiY,WillerC,SannaS,AbecasisG:Genotypeimputation.
AnnRevGenomHumanGenet2009,10:387–406.
http://www.
annualreviews.
org/doi/abs/10.
1146/annurev.
genom.
9.
081307.
164242.
[PMID:19715440].
3.
Whole-genomegenotypingandcopynumbervariationanalysis.
2013.
http://www.
illumina.
com/applications/detail/snp_genotyping_and_cnv_analysis/whole_genome_genotyping_and_copy_number_variation_analysis.
ilmn.
[Online;accessed9-January-2013]4.
Amapofhumangenomevariationfrompopulation-scalesequencing.
Nature2010,467(7319):1061–1073.
http://dx.
doi.
org/10.
1038/nature09534.
5.
NielsenJ,MailundT:SNPFile-Asoftwarelibraryandfileformatforlargescaleassociationmappingandpopulationgeneticsstudies.
BMCBioinformatics2008,9:526.
http://www.
biomedcentral.
com/1471-2105/9/526.
6.
WanX,YangC,YangQ,XueH,FanX,TangNL,YuW:BOOST:afastapproachtodetectinggene-geneinteractionsingenome-widecase-controlstudies.
AmJHumanGenet2010,87(3):325–340.
http://linkinghub.
elsevier.
com/retrieve/pii/S0002929710003782.
7.
PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraMAR,BenderD,MallerJ,SklarP,deBakkerPIW,DalyMJ,ShamPC:PLINK:atoolsetforwhole-genomeassociationandpopulation-basedlinkageanalysis.
AmJHumanGenet2007,81(3):559–575.
http://pngu.
mgh.
harvard.
edu/purcell/plink/.
8.
JacobsonG:Space-efficientstatictreesandgraphs.
InProceedingsofthe30thAnnualSymposiumonFoundationsofComputerScience,SFCS'89.
Washington:IEEEComputSoc;1989:549–554.
http://dx.
doi.
org/10.
1109/SFCS.
1989.
63533.
9.
GyeneseiA,MoodyJ,LaihoA,SempleCA,HaleyCS,WeiWH:BiForceToolbox:powerfulhigh-throughputcomputationalanalysisofgene-geneinteractionsingenome-wideassociationstudies.
NucleicAcidsRes2012,40(W1):W628–W632.
http://nar.
oxfordjournals.
org/content/40/W1/W628.
abstract.
10.
Intel:IntelSSE4ProgrammingReference;2007.
http://home.
ustc.
edu.
cn/~shengjie/REFERENCE/sse4_instruction_set.
pdf.
11.
YungLS,YangC,WanX,YuW:GBOOST:aGPU-basedtoolfordetectinggeneUgeneinteractionsingenome-widecasecontrolstudies.
Bioinformatics2011,27(9):1309–1310.
http://bioinformatics.
oxfordjournals.
org/content/27/9/1309.
abstract.
12.
SchüpbachT,XenariosI,BergmannS,KapurK:FastEpistasis:ahighperformancecomputingsolutionforquantitativetraitepistasis.
Bioinformatics2010,26(11):1468–1469.
http://bioinformatics.
oxfordjournals.
org/content/26/11/1468.
abstract.
doi:10.
1186/1471-2105-14-369Citethisarticleas:Putnametal.
:AcomparisonstudyofsuccinctdatastructuresforuseinGWAS.
BMCBioinformatics201314:369.
SubmityournextmanuscripttoBioMedCentralandtakefulladvantageof:ConvenientonlinesubmissionThoroughpeerreviewNospaceconstraintsorcolorgurechargesImmediatepublicationonacceptanceInclusioninPubMed,CAS,ScopusandGoogleScholarResearchwhichisfreelyavailableforredistributionSubmityourmanuscriptatwww.
biomedcentral.
com/submit
提速啦(www.tisula.com)是赣州王成璟网络科技有限公司旗下云服务器品牌,目前拥有在籍员工40人左右,社保在籍员工30人+,是正规的国内拥有IDC ICP ISP CDN 云牌照资质商家,2018-2021年连续4年获得CTG机房顶级金牌代理商荣誉 2021年赣州市于都县创业大赛三等奖,2020年于都电子商务示范企业,2021年于都县电子商务融合推广大使。资源优势介绍:Ceranetwo...
hosteons当前对美国洛杉矶、达拉斯、纽约数据中心的VPS进行特别的促销活动:(1)免费从1Gbps升级到10Gbps带宽,(2)Free Blesta License授权,(3)Windows server 2019授权,要求从2G内存起,而且是年付。 官方网站:https://www.hosteons.com 使用优惠码:zhujicepingEDDB10G,可以获得: 免费升级10...
licloud怎么样?licloud目前提供香港cmi服务器及香港CN2+BGP服务器/E3-1230v2/16GB内存/240GB SSD硬盘/不限流量/30Mbps带宽,$39.99/月。licloud 成立於2021年,是香港LiCloud Limited(CR No.3013909)旗下的品牌,主要提供香港kvm vps,分为精简网络和高级网络A、高级网络B,现在精简网络和高级网络A。现在...
fedora17为你推荐
美国互联网瘫痪美国是否有能力关闭全球互联网以及中国互联网,还有美国有没能力关闭某个网站,比如淘宝,天涯,网易等newworldNew World Group是什么组织安徽汽车网中国汽车十大品牌广东GDP破10万亿广东省2019年各市gdp是多少?www.4411b.com难道那www真的4411B坏了,还是4411b梗换com鑫域明了lunwenjiance论文检测,知网的是32.4%,改了以后,维普的是29.23%。如果再到知网查,会不会超过呢?同一服务器网站一个服务器放多个网站怎么设置?www.haole012.com012qq.com真的假的336.com求一个游戏的网站 你懂得mole.61.com谁知道摩尔庄园的网址啊
国外主机空间 最好的虚拟主机 vmsnap3 牛人与腾讯客服对话 gspeed 电信虚拟主机 台湾谷歌 海外空间 网站加速软件 西安主机 网页加速 可外链的相册 攻击服务器 电信主机托管 so域名 cdn免备案空间 bwg 海尔t68g 监控主机 租主机 更多