benchmarkedfedora17

fedora17  时间:2021-03-26  阅读:()
Putnametal.
BMCBioinformatics2013,14:369http://www.
biomedcentral.
com/1471-2105/14/369SOFTWAREOpenAccessAcomparisonstudyofsuccinctdatastructuresforuseinGWASPatrickPPutnam1,2*,GeZhang2*andPhilipAWilsey1AbstractBackground:Inrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal(NRG11:647–657,2010)offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
Theuseofsuccinctdatastructuresisonemethodofreducingphysicalsizeofadatasetwithouttheuseofexpensivecompressiontechniques.
Inthiswork,weconsidertheuseof2-and3-bitencodingschemesforgenotypedata.
Wecomparethecomputationalperformanceofalleleorgenotypecountingalgorithmsutilizinggenotypedataencodedinbothschemes.
Results:Weperformacomparisonof2-and3-bitgenotypeencodingschemesforuseingenotypecountingalgorithms.
Wefindthatthereisa20%overheadwhenbuildingsimplefrequencytablesfrom2-bitencodedgenotypes.
However,buildingpairwisecounttablesforgenome-wideepistasisis1.
0%moreefficient.
Conclusions:Inthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackedgenotypedatarepresentationsinGenomeWideAssociationsStudies(GWAS).
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecommonlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastructures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
BackgroundInrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal[1]offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
ThemajorityoftoolsusedinGWAdataanalysistyp-icallyassumethatadatasetwilleasilyfitintothemainmemoryofadesktopcomputer.
Mostdesktopcomput-ershavearound4–16GBofmainmemory,whichismorethanenoughtofitadatasetof1millionvari-antsbytensofthousandsofindividuals.
However,data*Correspondence:putnampp@gmail.
com;zhangge.
uc@gmail.
com1ExperimentalComputingLab,SchoolofElectronicandComputingSystems,POBox210030,Cincinnati,OH45221–0030,USA2HumanGenetics,CincinnatiChildren'sHospitalMedicalCenter,Cincinnati,OH,USAsetsizescontinuetogrowwithadvancementsinanal-ysistechniquesandtechnologies.
Forexample,tech-niqueslikegenotypeimputation[2]attemptexpanddatasetsbyderivingmissinggenotypefromreferencepan-els.
GenotypingtechnologiessuchasIllumina'sOmniSNPHumanOmni5-Quadchipsallowforgenotypingofupwardsof5millionmarkers[3].
Furthermore,genomesequencingtechnologiesareadvancingtothepointwheredetermininggenotypesviawholegenomesequencingmaybeaviableoption.
Havinganindividual'sentireDNAsequenceopensthedoorforevenmoregeneticmark-erstobeanalyzed.
The1000Genomesproject[4]nowincludesroughly36.
7millionvariantsinthehumangenome.
Thesizeofadatafileusedtorepresentthegenotypesof1000individualswouldberoughly37GB(assuming1byteisusedtostoreeachgenotype).
Thereareaseveraloptionstohandlingdatasetsofthissize.
First,thecostofupgradingastandardPC'smemorytohandlethisamountofdataisnotunreasonable.
Second,thealgorithmcan2013Putnametal.
;licenseeBioMedCentralLtd.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.
org/licenses/by/2.
0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
Putnametal.
BMCBioinformatics2013,14:369Page2of7http://www.
biomedcentral.
com/1471-2105/14/369beextendedtoutilizememorymappingtechniques[5],whicheffectivelypageschunksofthedatafileintomainmemoryastheyareneeded.
Athirdoptionistomod-ifytheformatforrepresentinggenotypessuchthatthegenotypesareexpressedintheirmostsuccinctform[6,7].
Thismanuscriptexploresthelatteroptionmoredeeply.
TheinterestismotivatedinpartbythedesiretoworkintheGeneral-PurposeGraphicProcessingUnits(GPGPU)spacewhichhassomewhatlimitedspaceespeciallywhenconsideredonaprocessor-by-processorbasis.
Thecompressionofgenotypeencodingdataismosteffectivelyperformedusingsuccinctdatastructures[8].
Succinctdatastructuresallowcompressionratesclosetotheinformation-theoreticlimitsandyetpreservetheabil-itytoaccessindividualdataelements.
Inthegenotypeanalysistoolsthatusesuccinctdatatypes(e.
g.
,BOOST[6]andBiForce[9]),a3-bitgenotyperepresentationforbiallelicmarkershasbeenadopted.
Whilea3-bitrep-resentationdoesprovideasuccinctdatastructure,itisnotthemostsuccinct.
Moreprecisely,fromaninforma-tiontheoreticperspective,3-bitsisabletorepresentupto8uniquevalues.
However,thereareonly4commonlyusedunphasedgenotypes,namely{NN,AA,Aa,aa}whereNNisusedtorepresentmissingdata.
Thismeansthata2-bitrepresentationistheinformationtheoreticlowerboundanditsusewouldprovideanevenmorecompactrepresentation.
Animportantconsiderationwhendesigningsuccinctdatastructuresisdataelementorientationinmemory.
BOOST[6]andBiForce[9]adoptedavectoredorienta-tionforrepresentingdataelements.
Thevectoredorienta-tionspreadseachdataelementovermultiplebitvectors.
Inotherwords,theyutilize3bitvectorspermarkertorepresentthesetofgenotypes.
Theadvantagesofthisorientationarediscussedlater.
Thismanuscriptmakestwoimportantcontributionsintheuseofsuccinctdatastructuresforgenomicencod-ing.
Inparticular,(i)weimplementatechniquetoreducegenotypeencodingtoa2bitvectorform,and(ii)wecom-paretheperformanceofthenew2-bitencodingtotheconventional3bitvectorencoding.
Fromthesestudies,wehaveobservedthatthe2-bitencodingencodingcon-sumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
ImplementationWeanalyzedacommonlyused3-bitbinaryrepresentationofgenotypesfromperformanceandscalabilityperspec-tives.
WiththisinformationwedevelopedaC++objectlibrarythatwehavenamedlibgwaspp.
Thelibrarypro-videsdatastructuresformanaginggenotypedatatablesina2-or3-bitrepresentation.
Finally,webenchmarkedthetworepresentationsonrandomlygenerateddatasetsofvariousscales.
Genome-wideassociationstudiesDNAfromindividualsarecollected,sequencedorgeno-typed,andthegenotypesforgeneticvariantsareusedinGenome-WideAssociationStudies(GWAS).
Thesestud-iesaimtodeterminewhethergeneticvariantsareassoci-atedwithcertaintraits,orphenotypes.
Themostcommonstudiesarecase-controlstudieswhichgroupindividualstogetherintotwosetsbasedonthepresence(case)orabsence(control)ofaspecifictrait.
Thesestudiestypicallyrelyuponvariousstatisticaltestsbaseduponthegeno-typicorallelicdistributionofthevariantsineachset.
Anaveragedatasetaimstocomparethousandsofindividualsbyhundredsofthousandstomillionsofvariants.
GWAstudiescanbecomputationallyintensivetoper-form.
Commonalgorithmsconsidereithereachvariantindividually,orvariantsincombinationwithoneanother.
Forexample,measuringtheoddsratioforeachvariantinacase-controlstudyisonewayofidentifyingvariantswhichmaybeassociatedwiththetraitinquestion.
Anepista-sisanalysisalgorithm,suchasBOOST[6],comparesthegenotypedistributionoftwovariantsineachstep.
Inbothofthesealgorithms,thebasictaskiscountingtheoccurrencesofeachgenotypeineachofthecase-controlsets.
Inotherwords,thefirststepindeterminingtheoddsratioistobuildafrequencytable(Table1)forboththecaseandcontrolsetsataspecificvariant.
Simi-larly,theBOOST[6]algorithmfirstbuildsacontingencytable(Table2),orpairwisegenotypecounttable,forapairofvariants.
BinarygenotypeencodingschemesAcommonwaytominimizetheimpactofthetablebuild-ingbottleneckistofullyutilizeprocessorthroughputbycountinggenotypesfrommultipleindividualsinonestep.
ThebinaryencodingofgenotypesadoptedbyBOOST[6]improvesthecomputationalefficiencyoftheepista-sisalgorithm.
Thealgorithmused3bitvectorstoencodeforgenotypedata.
Inthisschemeeachgenotypeisitsownbit-vector,orstream,ofdata.
Eachbitcorrespondstoanindexedindividual,andtheindexingisassumedtobeconstantacrossallmarkers.
Asetbitindicatesthattheindividualhasthecorrespondinggenotypeforthespeci-fiedmarker.
Therefore,everyvariantrequires3vectorstofullyrepresentthegenotypes.
Therearetwokeybenefitsofusingthisbinaryencodingscheme.
ThefirstisthatthetaskofbuildingafrequencyTable1FrequencytableforrawinputfromTables3,4and5AAAaaaNNCA2111CB2120Putnametal.
BMCBioinformatics2013,14:369Page3of7http://www.
biomedcentral.
com/1471-2105/14/369Table2PairwisegenotypecounttablefortwomarkersMBAAAaaaNNCAMAAA10102Aa10001aa00101NN01001CB2120NotethatthemarginalsumsofthistablearetheindividualmarkersfrequenciesfromTable1.
tableforagivenmarkerisreducedtocalculatingtheHam-mingdistanceofeachofabit-vectorsandabit-vectorofallzeros.
ThisdistanceisalsoreferredtoasaHammingweight.
ThetechniqueusedforcalculatingtheHammingweightofabitvectoristodividethebit-vectorintoman-ageableblocks,andsumtheHammingweightofeachblock.
Theblocksizeistypicallylinkedtotheproces-sorwordsize,typically32-or64-bits(4or8bytes).
ThealgorithmforcomputingtheHammingweightofanindividualblockiscommonlyreferredtoasPopulationCounting(popcount).
WechosetofollowtheBOOSTimplementationofpopcountwhichlooks-uptheHam-mingweightof16-bitblocksinapre-populatedweighttable.
Thesecondbenefitisthatitreducesgenotypecom-parisonlogictosimpleBooleanlogicoperations.
Morespecifically,thetaskofcountingindividualswhichhaveaspecificcombinationofgenotypesfortwomarkersissim-plifiedtofindingtheHammingweightofthelogicalANDofthegenotypebitvectors.
Thisisusefulwhenbuildingcontingencytables.
Ofinteresttothispaperisthefactthatwhenusingthe3-bitencodingschemeatleasttwothirdsofthebitsusedwillbeunset.
Aninformationtheoreticanalysisofthegenotypealphabetindicatesthat2-bitsaresufficienttouniquelyrepresenteachofthefourunphasedgenotypes.
Theimmediatebenefitisaonethirdreductioninmemoryconsumption(Tables3,4and5).
Thecaveattothisencod-ingschemeisthatdeterminingagenotyperequiresbothbits.
ThealgorithminFigure1isapseudo-coderepresen-tationofhowtobuildagenotypecounttablefrom2-bitencodeddata.
TheHammingweightofeachvectoristhenumberofindividualswith(AAoraa),and(Aaoraa)genotypes,respectively.
TodisambiguatethevaluesitisTable3ExamplegenotypeinputI1I2I3I4I5MAAAAaAAaaNNMBAAAAaaaaAaI1-5representindividuals,andMAandMBaremarkers.
Table43-bitencodingschemeI1I2I3I4I5AA10100MAAa01000aa00010AA11000MBAa00001aa00110necessarytocomputetheHammingweightofthelogicalANDofthebit-vectors.
Thisvaluerepresentsthenumberof(aa)genotypes,andsubtractingitfromtheprevioustwoweightswillresultintheappropriatecounts.
ThealgorithminFigure2illustratestheconstructionofapairwisegenotypecounttable,orcontingencytable.
Acontingencytablerepresentsthenumberofindividualswhopossessagenotypecombinationforapairofmarkers.
Whenusingthe3-bitencodingscheme,eachcellofthetableissimplytheHammingweightofthelogicalANDofthegenotypebit-vectorsforthetwomarkers.
The2-bitencodingrequiresaninlinetransformationsteptocon-vertthe2-bitencodeddatainto3-bitdata.
Thisstepisnecessarytobeabletotakeadvantageofthepopcountbitcountingmethod.
Bothoftheabovealgorithmscanbefurtherimprovedbyincorporatingadditionalinformation.
Forexample,thealgorithmforbuildingacontingencytablecanbesimpli-fiedifmarginalinformationforbothvariantsisavailable.
Thecontingencytablealgorithmcanmakeuseofthevariants'frequencytableandreducehavingtocompute9Hammingweightvaluestoonly4.
Theremainingval-uescanbeeasilycomputedbysubtractingtherowandcolumnsumsfromtheirrespectivemarginalinformationvalues.
Thisreductionofferssignificantcomputationalsavings,especiallywhenperformingexhaustiveepistasisanalysis.
BenchmarkingWecomparedtheperformanceofthe2-bitencodeddatatothe3-bitencodeddata.
Inparticular,wemeasuredtheruntimeforbuildingfrequencytablesandcontingencytablesusingbothencodingschemes.
Theruntimeofthesealgorithmsaredependentuponthenumberofcolumns,orindividuals,ineachrow.
Therefore,wedecidedtoholdTable52-bitencodingschemeI1I2I3I4I5MAAAORaa10110AaORaa01010MBAAORaa11110AaORaa00111Putnametal.
BMCBioinformatics2013,14:369Page4of7http://www.
biomedcentral.
com/1471-2105/14/369Constructingafrequencytablefrom2-bitencodedgenotypesAA0Aa0aa0fori=0NdoisthenumberofblocksperbitvectorxA[i]isthe(AAoraa)genotypebitvectoryB[i]isthe(Aaoraa)genotypebitvectoraaaa+popcount(xy)AaAa+popcount(y)AAAA+popcount(x)endforAAAAaaAaAaaaFigure1Constructingafrequencytablefrom2-bitencodedgenotypes.
thenumberofrowsconstantat10,000variants.
Wevar-iedthenumberofcolumnsbetween1and50thousandindividuals.
Wealsotestedasetwith150,000individualsasanextremescaleexperiment.
Thegenotypesweresim-ulatedfollowingempiricalallelefrequencyspectrumofAffymetrixarray6.
0SNPsoftheCEUHapMapsamples.
Similarly,individualswererandomlyclassifiedaseitheracaseorcontrol.
Threeexperimentswereconducted.
First,foreachdatasettheruntimeforbuildingfrequencytablesforeachofthevariantsweremeasured.
Second,foreachdatasettheruntimeforbuildingallcontingencytablesforanexhaus-tivepairwiseepistasistestwasmeasured.
Third,eachdatasetwasrunthroughourimplementationoftheBOOST[6]algorithmandthetotalruntimewasrecorded.
TheruntimeofBOOST[6]algorithmdoesnotincludethetimetoloadthecompresseddatasetintomainmemory.
Ineachofthesetests,theaverageruntimeiscalculatedandpresented.
Alltestswereconducteduponadesktopcomputerwithan3.
2GHzIntelCorei7-3930K,32GBof1600MHzDDR3memory,with64-bitFedora17.
Timewasmeasureddowntothenanosecondusingtheclock_gettime()glibcfunction.
WeusedGNUG++compiler4.
7,andcompiledusingstandard"-O3"compileroptimizationflag.
Thetestswereperformedusing64-bitblocksize.
ResultsThefirstexperimentmeasuredtheruntimeforbuild-ingfrequencytables.
Initially,the3-bitencodingschemeappearedtoofferaconsistentperformanceadvantageoverthe2-bitencoding.
Asthenumberofindividualsincreased,ittooklesstimetoconstructthecounttable(Figure3).
Theaveragetimetobuildagenotypecounttableforlessthan10,000individualsislessthan1μs.
Fordatasetsgreaterthan10,000individuals,thereissomeperformanceoverheadthatresultsfromdecodingthe2-bitvectors.
Buildingfrequencytablesfromthe3-bitencodeddataprovedtobe12–25%fasterthanwhenbuiltfrom2-bitencodeddata.
Intheextremescaledatasettherewasa5.
00μsdifferenceinfavorofthe3-bitscheme.
However,thesecondexperimentoffereddifferentresults.
Thesecondexperimentmeasuredtheruntimeforbuild-ingcontingencytablesforallpairsofvariantsinthedatasets.
Inthisexperiment,the2-bitencodingschemeofferedbetterperformance.
Similartothefirstexperi-ment,10,000individualsseemedtobethedivergingpoint(Figure4).
Atsizesgreaterthan10,000individuals,the2-bitencodingschemeoffereda1%performanceimprove-mentoverthe3-bitscheme.
With150,000individuals,thisequatestoabouta0.
32μsdifferenceinaverageper-formance.
Thethirdexperimentfurtherconfirmsthisperformancegain(Table6).
Figure2Constructingacontingencytablefrom2-bitencodedgenotypes.
Putnametal.
BMCBioinformatics2013,14:369Page5of7http://www.
biomedcentral.
com/1471-2105/14/3690510152025020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlFrequencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure3AverageCase/ControlfrequencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
DiscussionThisworkfocusesonwaystoaddressfrequencytablebuildingprocessesfoundinGWASfortwoprimaryrea-sons.
First,upstreamsteps,liketheloadingofdata,inageneralGWASpipelineareperformedrelativelyinfre-quently,andcanbeperformedoffline.
Forexample,adatasetcanbetransformedintoanoptimizedformatonce,andineveryrepeatanalysisthedatasettheloadingbecomesaconstanttimestepwithinthepipeline.
Conversely,thebuildingofthesetablesamountstoafrequentlyreoccur-ringstepwhichistypicallyperformedinlineundervaryingconditions.
Secondly,weviewedthetablebuildingprocessasabottleneckfordownstreamanalyticalsteps.
Offeringanapproachwhichpositivelyimpactsthecostassociatedwiththisbottleneckisbeneficial.
Theresultssuggestthattheuseof2-bitencodingschemeforgenotypedatadoesofferseveralbenefitsovera3-bitencodingscheme.
Thecompactencodingschemerequires33%lessmemoryforrepresentingthesamedata.
Asidefromfreeingupsystemmemoryforothertasks,thememorysavingscanbebeneficialforotherreasons.
Forexample,epistasisalgorithmslikeBOOST[6]canberunonGraphicProcessingUnits.
GPUsareseparatedevices05101520253035404550020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlContingencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure4AverageCase/ControlcontingencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
Putnametal.
BMCBioinformatics2013,14:369Page6of7http://www.
biomedcentral.
com/1471-2105/14/369Table6EpistasisruntimecomparisonIndividuals2-bit3-bitSpeedup(%)100028.
56s28.
45s0.
37500092.
07s93.
32s-1.
3310000173.
12s177.
46s-2.
4525000418.
31s420.
71s-0.
5750000810.
71s820.
26s-1.
161500002408.
05s24.
27.
84s-0.
81Speedupismeasuredrelativetothe3-bitruntime.
onacomputerwhichhavetheirownphysicalmemory,typicallylessthan6GB,andrequiredatatobecopiedtoandfromthedevice.
Thelimitedmemoryanddatatrans-ferissuesbothbenefitfromusingamorecompactdataformat.
The2-bitencodedgenotypeshavealsobeenusedbyothersoftwarepackages.
PLINK[7],forexample,usesa2-bitencodingintheBEDfileformat.
BEDfilesuseacontiguouspairingofbitstoexpressthegenotypeofanindividual.
Usingbitpairsallowsformoreefficientindi-vidualgenotypedecodingasaresultofthebitsexistinginthesamebit-block.
However,additionalbitmaskingstepsneedtobeappliedtoeachblocktoeffectivelyutilizepop-countbasedmethodsforcountinggenotypeoccurrenceswithinablock.
Asmentionedearlier,ourimplementationadoptsabit-vectoredapproach,wherebyanindividual'sgenotypeisdividedovertwoseparatevectors.
Thisisprimarilydonetoreducethenumberofmaskingsteps.
Ineithercase,someformofgenotypedisambiguationisnecessary.
Thereisanoverheadassociatedwiththisdecodingstep,anditcanbefeltincertainalgorithms.
Wemeasuredapproximatelya20%overheadwhenbuildingfrequencytables.
Whilethisisasignificantoverhead,thenumberoffrequencytablesarelinearinthenumberofmarkers.
Therefore,itisconceivabletobuildthesetablesonce,andreusethemindownstreamanalyticalstepsasneeded.
Asaresult,thisoverheadisgenerallyacceptable.
Furthermore,theoverheadiseffectivelyhiddenwhenbuildingpairwisefrequencytables.
Theimprovementinperformancepresentwhencon-structingpairwisefrequencytablesfrom2-bitencodedgenotypesstemsfromthereducednumberofmemoryaccesssteps.
AsshowninAlgorithm3sixgenotypesblocksareusedineachstepoftheiteration.
When3-bitencodingisused,eachoftheseblocksmustbereadfrommemory.
Conversely,the2-bitencodingonlyneedstoreadfourblocksandcomputestheremainingtwoblocks.
Afurthergeneralperformanceincreasemaybepos-siblethroughtheuseofhardwareimplementationsofpopcountalgorithms.
AspartoftheStreamingSIMDExtensions(SSE)ofthex86microarchitecturethereisapopcnt[10]instruction.
RecentprocessorlinesfrombothIntelandAMDofferthisinstructioninsomeformoranother.
Aswementionedearlier,thesesuccinctdatastructuresareintendedtoimpacttheincreasingscaleofsamplesets.
Thebuildingofthefrequencytablesarelinearalgorithmswhicharedependentuponthesamplesets.
Byfixingthenumberofvariantsandvaryingthenumberofsamplesinadatasetweshowthelinearincreaseoftheepistasisalgorithmruntime,asisindicatedbyFigure5.
Unfortunately,theruntimeofbruteforcealgorithmslikeBOOST[6]aredominatedmorebythenumberofvari-antsbeinganalyzedthanthenumberofindividualsbeing05001000150020002500020000400006000080000100000120000140000160000Time(s)IndividualsEpistasis(BOOST)algorithmAverageruntimefor10000Variants2-bitencodingscheme3-bitencodingschemeFigure5AverageepistasisruntimeusingBOOST[6]algorithm.
Putnametal.
BMCBioinformatics2013,14:369Page7of7http://www.
biomedcentral.
com/1471-2105/14/369studied.
Adatasetof10,000variantsmeansthat5*107uniquecontingencytablesneedtobebuiltforatypicalcase-controlstudy.
Expandingthatsizetoamillionvari-antsincreasesthecontingencytablecountto5*1011.
Otherworkshavedemonstratedparallelimplementationsthateffectivelyaddressthevariantscaling[9,11,12].
Thisworkdemonstratesageneralwaytofurtherimprovetheperformanceofthesealgorithms.
ConclusionsInthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackeddatarepresentationsinGenomeWideAssociationsStudies.
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecom-monlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastruc-tures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
AvailabilityandrequirementsProjectname:libgwasppProjecthomepage:https://github.
com/putnampp/libgwasppOperatingsystem(s):LinuxProgramminglanguage:C++Otherrequirements:CMake2.
8.
9,GCC4.
7orhigher,Boost1.
51.
0,ZLIB,GSLLicense:FreeBSDCompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Authors'contributionsPPPdesignedandimplementedthesoftware,conductedtheexperiments,andwrotethemainmanuscript.
GZprovideddomainspecificexpertiseinGWAstudies,andtheempiricaldatafromwhichthesimulateddatawasgenerated.
PWcontributedextensiveknowledgeofcomputationalarchitecturesanddatastructures.
Bothalsocontributedgreatlytotheresultanalysisandeditingofthemanuscript.
Allauthorsreadandapprovedthefinalmanuscript.
AcknowledgementsThisworkwaspartiallysupportedbythePilotandFeasibilityProgramofthePerinatalInstitute,CincinnatiChildren'sHospitalMedicalCenter.
Received:25June2013Accepted:11December2013Published:21December2013References1.
SchadtEE,LindermanMD,SorensonJ,LeeL,NolanGP:Computationalsolutionstolarge-scaledatamanagementandanalysis.
NatRevGenet2010,11(9):647–657.
http://dx.
doi.
org/10.
1038/nrg2857.
2.
LiY,WillerC,SannaS,AbecasisG:Genotypeimputation.
AnnRevGenomHumanGenet2009,10:387–406.
http://www.
annualreviews.
org/doi/abs/10.
1146/annurev.
genom.
9.
081307.
164242.
[PMID:19715440].
3.
Whole-genomegenotypingandcopynumbervariationanalysis.
2013.
http://www.
illumina.
com/applications/detail/snp_genotyping_and_cnv_analysis/whole_genome_genotyping_and_copy_number_variation_analysis.
ilmn.
[Online;accessed9-January-2013]4.
Amapofhumangenomevariationfrompopulation-scalesequencing.
Nature2010,467(7319):1061–1073.
http://dx.
doi.
org/10.
1038/nature09534.
5.
NielsenJ,MailundT:SNPFile-Asoftwarelibraryandfileformatforlargescaleassociationmappingandpopulationgeneticsstudies.
BMCBioinformatics2008,9:526.
http://www.
biomedcentral.
com/1471-2105/9/526.
6.
WanX,YangC,YangQ,XueH,FanX,TangNL,YuW:BOOST:afastapproachtodetectinggene-geneinteractionsingenome-widecase-controlstudies.
AmJHumanGenet2010,87(3):325–340.
http://linkinghub.
elsevier.
com/retrieve/pii/S0002929710003782.
7.
PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraMAR,BenderD,MallerJ,SklarP,deBakkerPIW,DalyMJ,ShamPC:PLINK:atoolsetforwhole-genomeassociationandpopulation-basedlinkageanalysis.
AmJHumanGenet2007,81(3):559–575.
http://pngu.
mgh.
harvard.
edu/purcell/plink/.
8.
JacobsonG:Space-efficientstatictreesandgraphs.
InProceedingsofthe30thAnnualSymposiumonFoundationsofComputerScience,SFCS'89.
Washington:IEEEComputSoc;1989:549–554.
http://dx.
doi.
org/10.
1109/SFCS.
1989.
63533.
9.
GyeneseiA,MoodyJ,LaihoA,SempleCA,HaleyCS,WeiWH:BiForceToolbox:powerfulhigh-throughputcomputationalanalysisofgene-geneinteractionsingenome-wideassociationstudies.
NucleicAcidsRes2012,40(W1):W628–W632.
http://nar.
oxfordjournals.
org/content/40/W1/W628.
abstract.
10.
Intel:IntelSSE4ProgrammingReference;2007.
http://home.
ustc.
edu.
cn/~shengjie/REFERENCE/sse4_instruction_set.
pdf.
11.
YungLS,YangC,WanX,YuW:GBOOST:aGPU-basedtoolfordetectinggeneUgeneinteractionsingenome-widecasecontrolstudies.
Bioinformatics2011,27(9):1309–1310.
http://bioinformatics.
oxfordjournals.
org/content/27/9/1309.
abstract.
12.
SchüpbachT,XenariosI,BergmannS,KapurK:FastEpistasis:ahighperformancecomputingsolutionforquantitativetraitepistasis.
Bioinformatics2010,26(11):1468–1469.
http://bioinformatics.
oxfordjournals.
org/content/26/11/1468.
abstract.
doi:10.
1186/1471-2105-14-369Citethisarticleas:Putnametal.
:AcomparisonstudyofsuccinctdatastructuresforuseinGWAS.
BMCBioinformatics201314:369.
SubmityournextmanuscripttoBioMedCentralandtakefulladvantageof:ConvenientonlinesubmissionThoroughpeerreviewNospaceconstraintsorcolorgurechargesImmediatepublicationonacceptanceInclusioninPubMed,CAS,ScopusandGoogleScholarResearchwhichisfreelyavailableforredistributionSubmityourmanuscriptatwww.
biomedcentral.
com/submit

月费$389,RackNerd美国大硬盘独立服务器

这次RackNerd商家提供的美国大硬盘独立服务器,数据中心位于洛杉矶multacom,可选Windows、Linux镜像系统,默认内存是64GB,也可升级至128GB内存,而且硬盘采用的是256G SSD系统盘+10个16TSAS数据盘,端口提供的是1Gbps带宽,每月提供200TB,且包含5个IPv4,如果有需要更多IP,也可以升级增加。CPU核心内存硬盘流量带宽价格选择2XE5-2640V2...

无忧云:洛阳/大连BGP云服务器38.4元/月,雅安物理机服务器315元/月起,香港荃湾CN2限时5折优惠

无忧云怎么样?无忧云是一家成立于2017年的老牌商家旗下的服务器销售品牌,现由深圳市云上无忧网络科技有限公司运营,是正规持证IDC/ISP/IRCS商家,主要销售国内、中国香港、国外服务器产品,线路有腾讯云国外线路、自营香港CN2线路等,都是中国大陆直连线路,非常适合免备案建站业务需求和各种负载较高的项目,同时国内服务器也有多个BGP以及高防节点,目前商家开启了夏日清凉补贴活动,商家的机器还是非常...

福州云服务器 1核 2G 2M 12元/月(买5个月) 萤光云

厦门靠谱云股份有限公司 双十一到了,站长我就给介绍一家折扣力度名列前茅的云厂商——萤光云。1H2G2M的高防50G云服务器,依照他们的规则叠加优惠,可以做到12元/月。更大配置和带宽的价格,也在一般云厂商中脱颖而出,性价比超高。官网:www.lightnode.cn叠加优惠:全区季付55折+满100-50各个配置价格表:地域配置双十一优惠价说明福州(带50G防御)/上海/北京1H2G2M12元/月...

fedora17为你推荐
2020双十一成绩单2020年河南全县初二期末成绩排名?商标注册流程及费用商标注册流程及费用?冯媛甑谁知道怎么找到冯媛甄的具体资料?haole018.com为什么www.haole008.com在我这里打不开啊,是不是haole008换新的地址了?www.haole012.comhttp://fj.qq.com/news/wm/wm012.htm 这个链接的视频的 第3分20秒开始的 背景音乐 是什么?javmoo.com找下载JAV软件格式的网站www.bbb551.com广州欢乐在线551要收费吗?lcoc.top服装英语中double topstitches什么意思baqizi.cc曹操跟甄洛是什么关系www.99vv1.comwww.in9.com是什么网站啊?
日本私人vps 新通用顶级域名 香港机房 网络星期一 win8.1企业版升级win10 搜狗抢票助手 lighttpd 150邮箱 福建天翼加速 腾讯云分析 hostloc 双十一秒杀 免费cdn 免费网页申请 创建邮箱 免费ftp 免费asp空间 西安主机 申请免费空间 域名转入 更多