benchmarkedfedora17
fedora17 时间:2021-03-26 阅读:(
)
Putnametal.
BMCBioinformatics2013,14:369http://www.
biomedcentral.
com/1471-2105/14/369SOFTWAREOpenAccessAcomparisonstudyofsuccinctdatastructuresforuseinGWASPatrickPPutnam1,2*,GeZhang2*andPhilipAWilsey1AbstractBackground:Inrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal(NRG11:647–657,2010)offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
Theuseofsuccinctdatastructuresisonemethodofreducingphysicalsizeofadatasetwithouttheuseofexpensivecompressiontechniques.
Inthiswork,weconsidertheuseof2-and3-bitencodingschemesforgenotypedata.
Wecomparethecomputationalperformanceofalleleorgenotypecountingalgorithmsutilizinggenotypedataencodedinbothschemes.
Results:Weperformacomparisonof2-and3-bitgenotypeencodingschemesforuseingenotypecountingalgorithms.
Wefindthatthereisa20%overheadwhenbuildingsimplefrequencytablesfrom2-bitencodedgenotypes.
However,buildingpairwisecounttablesforgenome-wideepistasisis1.
0%moreefficient.
Conclusions:Inthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackedgenotypedatarepresentationsinGenomeWideAssociationsStudies(GWAS).
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecommonlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastructures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
BackgroundInrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal[1]offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
ThemajorityoftoolsusedinGWAdataanalysistyp-icallyassumethatadatasetwilleasilyfitintothemainmemoryofadesktopcomputer.
Mostdesktopcomput-ershavearound4–16GBofmainmemory,whichismorethanenoughtofitadatasetof1millionvari-antsbytensofthousandsofindividuals.
However,data*Correspondence:putnampp@gmail.
com;zhangge.
uc@gmail.
com1ExperimentalComputingLab,SchoolofElectronicandComputingSystems,POBox210030,Cincinnati,OH45221–0030,USA2HumanGenetics,CincinnatiChildren'sHospitalMedicalCenter,Cincinnati,OH,USAsetsizescontinuetogrowwithadvancementsinanal-ysistechniquesandtechnologies.
Forexample,tech-niqueslikegenotypeimputation[2]attemptexpanddatasetsbyderivingmissinggenotypefromreferencepan-els.
GenotypingtechnologiessuchasIllumina'sOmniSNPHumanOmni5-Quadchipsallowforgenotypingofupwardsof5millionmarkers[3].
Furthermore,genomesequencingtechnologiesareadvancingtothepointwheredetermininggenotypesviawholegenomesequencingmaybeaviableoption.
Havinganindividual'sentireDNAsequenceopensthedoorforevenmoregeneticmark-erstobeanalyzed.
The1000Genomesproject[4]nowincludesroughly36.
7millionvariantsinthehumangenome.
Thesizeofadatafileusedtorepresentthegenotypesof1000individualswouldberoughly37GB(assuming1byteisusedtostoreeachgenotype).
Thereareaseveraloptionstohandlingdatasetsofthissize.
First,thecostofupgradingastandardPC'smemorytohandlethisamountofdataisnotunreasonable.
Second,thealgorithmcan2013Putnametal.
;licenseeBioMedCentralLtd.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.
org/licenses/by/2.
0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
Putnametal.
BMCBioinformatics2013,14:369Page2of7http://www.
biomedcentral.
com/1471-2105/14/369beextendedtoutilizememorymappingtechniques[5],whicheffectivelypageschunksofthedatafileintomainmemoryastheyareneeded.
Athirdoptionistomod-ifytheformatforrepresentinggenotypessuchthatthegenotypesareexpressedintheirmostsuccinctform[6,7].
Thismanuscriptexploresthelatteroptionmoredeeply.
TheinterestismotivatedinpartbythedesiretoworkintheGeneral-PurposeGraphicProcessingUnits(GPGPU)spacewhichhassomewhatlimitedspaceespeciallywhenconsideredonaprocessor-by-processorbasis.
Thecompressionofgenotypeencodingdataismosteffectivelyperformedusingsuccinctdatastructures[8].
Succinctdatastructuresallowcompressionratesclosetotheinformation-theoreticlimitsandyetpreservetheabil-itytoaccessindividualdataelements.
Inthegenotypeanalysistoolsthatusesuccinctdatatypes(e.
g.
,BOOST[6]andBiForce[9]),a3-bitgenotyperepresentationforbiallelicmarkershasbeenadopted.
Whilea3-bitrep-resentationdoesprovideasuccinctdatastructure,itisnotthemostsuccinct.
Moreprecisely,fromaninforma-tiontheoreticperspective,3-bitsisabletorepresentupto8uniquevalues.
However,thereareonly4commonlyusedunphasedgenotypes,namely{NN,AA,Aa,aa}whereNNisusedtorepresentmissingdata.
Thismeansthata2-bitrepresentationistheinformationtheoreticlowerboundanditsusewouldprovideanevenmorecompactrepresentation.
Animportantconsiderationwhendesigningsuccinctdatastructuresisdataelementorientationinmemory.
BOOST[6]andBiForce[9]adoptedavectoredorienta-tionforrepresentingdataelements.
Thevectoredorienta-tionspreadseachdataelementovermultiplebitvectors.
Inotherwords,theyutilize3bitvectorspermarkertorepresentthesetofgenotypes.
Theadvantagesofthisorientationarediscussedlater.
Thismanuscriptmakestwoimportantcontributionsintheuseofsuccinctdatastructuresforgenomicencod-ing.
Inparticular,(i)weimplementatechniquetoreducegenotypeencodingtoa2bitvectorform,and(ii)wecom-paretheperformanceofthenew2-bitencodingtotheconventional3bitvectorencoding.
Fromthesestudies,wehaveobservedthatthe2-bitencodingencodingcon-sumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
ImplementationWeanalyzedacommonlyused3-bitbinaryrepresentationofgenotypesfromperformanceandscalabilityperspec-tives.
WiththisinformationwedevelopedaC++objectlibrarythatwehavenamedlibgwaspp.
Thelibrarypro-videsdatastructuresformanaginggenotypedatatablesina2-or3-bitrepresentation.
Finally,webenchmarkedthetworepresentationsonrandomlygenerateddatasetsofvariousscales.
Genome-wideassociationstudiesDNAfromindividualsarecollected,sequencedorgeno-typed,andthegenotypesforgeneticvariantsareusedinGenome-WideAssociationStudies(GWAS).
Thesestud-iesaimtodeterminewhethergeneticvariantsareassoci-atedwithcertaintraits,orphenotypes.
Themostcommonstudiesarecase-controlstudieswhichgroupindividualstogetherintotwosetsbasedonthepresence(case)orabsence(control)ofaspecifictrait.
Thesestudiestypicallyrelyuponvariousstatisticaltestsbaseduponthegeno-typicorallelicdistributionofthevariantsineachset.
Anaveragedatasetaimstocomparethousandsofindividualsbyhundredsofthousandstomillionsofvariants.
GWAstudiescanbecomputationallyintensivetoper-form.
Commonalgorithmsconsidereithereachvariantindividually,orvariantsincombinationwithoneanother.
Forexample,measuringtheoddsratioforeachvariantinacase-controlstudyisonewayofidentifyingvariantswhichmaybeassociatedwiththetraitinquestion.
Anepista-sisanalysisalgorithm,suchasBOOST[6],comparesthegenotypedistributionoftwovariantsineachstep.
Inbothofthesealgorithms,thebasictaskiscountingtheoccurrencesofeachgenotypeineachofthecase-controlsets.
Inotherwords,thefirststepindeterminingtheoddsratioistobuildafrequencytable(Table1)forboththecaseandcontrolsetsataspecificvariant.
Simi-larly,theBOOST[6]algorithmfirstbuildsacontingencytable(Table2),orpairwisegenotypecounttable,forapairofvariants.
BinarygenotypeencodingschemesAcommonwaytominimizetheimpactofthetablebuild-ingbottleneckistofullyutilizeprocessorthroughputbycountinggenotypesfrommultipleindividualsinonestep.
ThebinaryencodingofgenotypesadoptedbyBOOST[6]improvesthecomputationalefficiencyoftheepista-sisalgorithm.
Thealgorithmused3bitvectorstoencodeforgenotypedata.
Inthisschemeeachgenotypeisitsownbit-vector,orstream,ofdata.
Eachbitcorrespondstoanindexedindividual,andtheindexingisassumedtobeconstantacrossallmarkers.
Asetbitindicatesthattheindividualhasthecorrespondinggenotypeforthespeci-fiedmarker.
Therefore,everyvariantrequires3vectorstofullyrepresentthegenotypes.
Therearetwokeybenefitsofusingthisbinaryencodingscheme.
ThefirstisthatthetaskofbuildingafrequencyTable1FrequencytableforrawinputfromTables3,4and5AAAaaaNNCA2111CB2120Putnametal.
BMCBioinformatics2013,14:369Page3of7http://www.
biomedcentral.
com/1471-2105/14/369Table2PairwisegenotypecounttablefortwomarkersMBAAAaaaNNCAMAAA10102Aa10001aa00101NN01001CB2120NotethatthemarginalsumsofthistablearetheindividualmarkersfrequenciesfromTable1.
tableforagivenmarkerisreducedtocalculatingtheHam-mingdistanceofeachofabit-vectorsandabit-vectorofallzeros.
ThisdistanceisalsoreferredtoasaHammingweight.
ThetechniqueusedforcalculatingtheHammingweightofabitvectoristodividethebit-vectorintoman-ageableblocks,andsumtheHammingweightofeachblock.
Theblocksizeistypicallylinkedtotheproces-sorwordsize,typically32-or64-bits(4or8bytes).
ThealgorithmforcomputingtheHammingweightofanindividualblockiscommonlyreferredtoasPopulationCounting(popcount).
WechosetofollowtheBOOSTimplementationofpopcountwhichlooks-uptheHam-mingweightof16-bitblocksinapre-populatedweighttable.
Thesecondbenefitisthatitreducesgenotypecom-parisonlogictosimpleBooleanlogicoperations.
Morespecifically,thetaskofcountingindividualswhichhaveaspecificcombinationofgenotypesfortwomarkersissim-plifiedtofindingtheHammingweightofthelogicalANDofthegenotypebitvectors.
Thisisusefulwhenbuildingcontingencytables.
Ofinteresttothispaperisthefactthatwhenusingthe3-bitencodingschemeatleasttwothirdsofthebitsusedwillbeunset.
Aninformationtheoreticanalysisofthegenotypealphabetindicatesthat2-bitsaresufficienttouniquelyrepresenteachofthefourunphasedgenotypes.
Theimmediatebenefitisaonethirdreductioninmemoryconsumption(Tables3,4and5).
Thecaveattothisencod-ingschemeisthatdeterminingagenotyperequiresbothbits.
ThealgorithminFigure1isapseudo-coderepresen-tationofhowtobuildagenotypecounttablefrom2-bitencodeddata.
TheHammingweightofeachvectoristhenumberofindividualswith(AAoraa),and(Aaoraa)genotypes,respectively.
TodisambiguatethevaluesitisTable3ExamplegenotypeinputI1I2I3I4I5MAAAAaAAaaNNMBAAAAaaaaAaI1-5representindividuals,andMAandMBaremarkers.
Table43-bitencodingschemeI1I2I3I4I5AA10100MAAa01000aa00010AA11000MBAa00001aa00110necessarytocomputetheHammingweightofthelogicalANDofthebit-vectors.
Thisvaluerepresentsthenumberof(aa)genotypes,andsubtractingitfromtheprevioustwoweightswillresultintheappropriatecounts.
ThealgorithminFigure2illustratestheconstructionofapairwisegenotypecounttable,orcontingencytable.
Acontingencytablerepresentsthenumberofindividualswhopossessagenotypecombinationforapairofmarkers.
Whenusingthe3-bitencodingscheme,eachcellofthetableissimplytheHammingweightofthelogicalANDofthegenotypebit-vectorsforthetwomarkers.
The2-bitencodingrequiresaninlinetransformationsteptocon-vertthe2-bitencodeddatainto3-bitdata.
Thisstepisnecessarytobeabletotakeadvantageofthepopcountbitcountingmethod.
Bothoftheabovealgorithmscanbefurtherimprovedbyincorporatingadditionalinformation.
Forexample,thealgorithmforbuildingacontingencytablecanbesimpli-fiedifmarginalinformationforbothvariantsisavailable.
Thecontingencytablealgorithmcanmakeuseofthevariants'frequencytableandreducehavingtocompute9Hammingweightvaluestoonly4.
Theremainingval-uescanbeeasilycomputedbysubtractingtherowandcolumnsumsfromtheirrespectivemarginalinformationvalues.
Thisreductionofferssignificantcomputationalsavings,especiallywhenperformingexhaustiveepistasisanalysis.
BenchmarkingWecomparedtheperformanceofthe2-bitencodeddatatothe3-bitencodeddata.
Inparticular,wemeasuredtheruntimeforbuildingfrequencytablesandcontingencytablesusingbothencodingschemes.
Theruntimeofthesealgorithmsaredependentuponthenumberofcolumns,orindividuals,ineachrow.
Therefore,wedecidedtoholdTable52-bitencodingschemeI1I2I3I4I5MAAAORaa10110AaORaa01010MBAAORaa11110AaORaa00111Putnametal.
BMCBioinformatics2013,14:369Page4of7http://www.
biomedcentral.
com/1471-2105/14/369Constructingafrequencytablefrom2-bitencodedgenotypesAA0Aa0aa0fori=0NdoisthenumberofblocksperbitvectorxA[i]isthe(AAoraa)genotypebitvectoryB[i]isthe(Aaoraa)genotypebitvectoraaaa+popcount(xy)AaAa+popcount(y)AAAA+popcount(x)endforAAAAaaAaAaaaFigure1Constructingafrequencytablefrom2-bitencodedgenotypes.
thenumberofrowsconstantat10,000variants.
Wevar-iedthenumberofcolumnsbetween1and50thousandindividuals.
Wealsotestedasetwith150,000individualsasanextremescaleexperiment.
Thegenotypesweresim-ulatedfollowingempiricalallelefrequencyspectrumofAffymetrixarray6.
0SNPsoftheCEUHapMapsamples.
Similarly,individualswererandomlyclassifiedaseitheracaseorcontrol.
Threeexperimentswereconducted.
First,foreachdatasettheruntimeforbuildingfrequencytablesforeachofthevariantsweremeasured.
Second,foreachdatasettheruntimeforbuildingallcontingencytablesforanexhaus-tivepairwiseepistasistestwasmeasured.
Third,eachdatasetwasrunthroughourimplementationoftheBOOST[6]algorithmandthetotalruntimewasrecorded.
TheruntimeofBOOST[6]algorithmdoesnotincludethetimetoloadthecompresseddatasetintomainmemory.
Ineachofthesetests,theaverageruntimeiscalculatedandpresented.
Alltestswereconducteduponadesktopcomputerwithan3.
2GHzIntelCorei7-3930K,32GBof1600MHzDDR3memory,with64-bitFedora17.
Timewasmeasureddowntothenanosecondusingtheclock_gettime()glibcfunction.
WeusedGNUG++compiler4.
7,andcompiledusingstandard"-O3"compileroptimizationflag.
Thetestswereperformedusing64-bitblocksize.
ResultsThefirstexperimentmeasuredtheruntimeforbuild-ingfrequencytables.
Initially,the3-bitencodingschemeappearedtoofferaconsistentperformanceadvantageoverthe2-bitencoding.
Asthenumberofindividualsincreased,ittooklesstimetoconstructthecounttable(Figure3).
Theaveragetimetobuildagenotypecounttableforlessthan10,000individualsislessthan1μs.
Fordatasetsgreaterthan10,000individuals,thereissomeperformanceoverheadthatresultsfromdecodingthe2-bitvectors.
Buildingfrequencytablesfromthe3-bitencodeddataprovedtobe12–25%fasterthanwhenbuiltfrom2-bitencodeddata.
Intheextremescaledatasettherewasa5.
00μsdifferenceinfavorofthe3-bitscheme.
However,thesecondexperimentoffereddifferentresults.
Thesecondexperimentmeasuredtheruntimeforbuild-ingcontingencytablesforallpairsofvariantsinthedatasets.
Inthisexperiment,the2-bitencodingschemeofferedbetterperformance.
Similartothefirstexperi-ment,10,000individualsseemedtobethedivergingpoint(Figure4).
Atsizesgreaterthan10,000individuals,the2-bitencodingschemeoffereda1%performanceimprove-mentoverthe3-bitscheme.
With150,000individuals,thisequatestoabouta0.
32μsdifferenceinaverageper-formance.
Thethirdexperimentfurtherconfirmsthisperformancegain(Table6).
Figure2Constructingacontingencytablefrom2-bitencodedgenotypes.
Putnametal.
BMCBioinformatics2013,14:369Page5of7http://www.
biomedcentral.
com/1471-2105/14/3690510152025020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlFrequencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure3AverageCase/ControlfrequencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
DiscussionThisworkfocusesonwaystoaddressfrequencytablebuildingprocessesfoundinGWASfortwoprimaryrea-sons.
First,upstreamsteps,liketheloadingofdata,inageneralGWASpipelineareperformedrelativelyinfre-quently,andcanbeperformedoffline.
Forexample,adatasetcanbetransformedintoanoptimizedformatonce,andineveryrepeatanalysisthedatasettheloadingbecomesaconstanttimestepwithinthepipeline.
Conversely,thebuildingofthesetablesamountstoafrequentlyreoccur-ringstepwhichistypicallyperformedinlineundervaryingconditions.
Secondly,weviewedthetablebuildingprocessasabottleneckfordownstreamanalyticalsteps.
Offeringanapproachwhichpositivelyimpactsthecostassociatedwiththisbottleneckisbeneficial.
Theresultssuggestthattheuseof2-bitencodingschemeforgenotypedatadoesofferseveralbenefitsovera3-bitencodingscheme.
Thecompactencodingschemerequires33%lessmemoryforrepresentingthesamedata.
Asidefromfreeingupsystemmemoryforothertasks,thememorysavingscanbebeneficialforotherreasons.
Forexample,epistasisalgorithmslikeBOOST[6]canberunonGraphicProcessingUnits.
GPUsareseparatedevices05101520253035404550020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlContingencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure4AverageCase/ControlcontingencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
Putnametal.
BMCBioinformatics2013,14:369Page6of7http://www.
biomedcentral.
com/1471-2105/14/369Table6EpistasisruntimecomparisonIndividuals2-bit3-bitSpeedup(%)100028.
56s28.
45s0.
37500092.
07s93.
32s-1.
3310000173.
12s177.
46s-2.
4525000418.
31s420.
71s-0.
5750000810.
71s820.
26s-1.
161500002408.
05s24.
27.
84s-0.
81Speedupismeasuredrelativetothe3-bitruntime.
onacomputerwhichhavetheirownphysicalmemory,typicallylessthan6GB,andrequiredatatobecopiedtoandfromthedevice.
Thelimitedmemoryanddatatrans-ferissuesbothbenefitfromusingamorecompactdataformat.
The2-bitencodedgenotypeshavealsobeenusedbyothersoftwarepackages.
PLINK[7],forexample,usesa2-bitencodingintheBEDfileformat.
BEDfilesuseacontiguouspairingofbitstoexpressthegenotypeofanindividual.
Usingbitpairsallowsformoreefficientindi-vidualgenotypedecodingasaresultofthebitsexistinginthesamebit-block.
However,additionalbitmaskingstepsneedtobeappliedtoeachblocktoeffectivelyutilizepop-countbasedmethodsforcountinggenotypeoccurrenceswithinablock.
Asmentionedearlier,ourimplementationadoptsabit-vectoredapproach,wherebyanindividual'sgenotypeisdividedovertwoseparatevectors.
Thisisprimarilydonetoreducethenumberofmaskingsteps.
Ineithercase,someformofgenotypedisambiguationisnecessary.
Thereisanoverheadassociatedwiththisdecodingstep,anditcanbefeltincertainalgorithms.
Wemeasuredapproximatelya20%overheadwhenbuildingfrequencytables.
Whilethisisasignificantoverhead,thenumberoffrequencytablesarelinearinthenumberofmarkers.
Therefore,itisconceivabletobuildthesetablesonce,andreusethemindownstreamanalyticalstepsasneeded.
Asaresult,thisoverheadisgenerallyacceptable.
Furthermore,theoverheadiseffectivelyhiddenwhenbuildingpairwisefrequencytables.
Theimprovementinperformancepresentwhencon-structingpairwisefrequencytablesfrom2-bitencodedgenotypesstemsfromthereducednumberofmemoryaccesssteps.
AsshowninAlgorithm3sixgenotypesblocksareusedineachstepoftheiteration.
When3-bitencodingisused,eachoftheseblocksmustbereadfrommemory.
Conversely,the2-bitencodingonlyneedstoreadfourblocksandcomputestheremainingtwoblocks.
Afurthergeneralperformanceincreasemaybepos-siblethroughtheuseofhardwareimplementationsofpopcountalgorithms.
AspartoftheStreamingSIMDExtensions(SSE)ofthex86microarchitecturethereisapopcnt[10]instruction.
RecentprocessorlinesfrombothIntelandAMDofferthisinstructioninsomeformoranother.
Aswementionedearlier,thesesuccinctdatastructuresareintendedtoimpacttheincreasingscaleofsamplesets.
Thebuildingofthefrequencytablesarelinearalgorithmswhicharedependentuponthesamplesets.
Byfixingthenumberofvariantsandvaryingthenumberofsamplesinadatasetweshowthelinearincreaseoftheepistasisalgorithmruntime,asisindicatedbyFigure5.
Unfortunately,theruntimeofbruteforcealgorithmslikeBOOST[6]aredominatedmorebythenumberofvari-antsbeinganalyzedthanthenumberofindividualsbeing05001000150020002500020000400006000080000100000120000140000160000Time(s)IndividualsEpistasis(BOOST)algorithmAverageruntimefor10000Variants2-bitencodingscheme3-bitencodingschemeFigure5AverageepistasisruntimeusingBOOST[6]algorithm.
Putnametal.
BMCBioinformatics2013,14:369Page7of7http://www.
biomedcentral.
com/1471-2105/14/369studied.
Adatasetof10,000variantsmeansthat5*107uniquecontingencytablesneedtobebuiltforatypicalcase-controlstudy.
Expandingthatsizetoamillionvari-antsincreasesthecontingencytablecountto5*1011.
Otherworkshavedemonstratedparallelimplementationsthateffectivelyaddressthevariantscaling[9,11,12].
Thisworkdemonstratesageneralwaytofurtherimprovetheperformanceofthesealgorithms.
ConclusionsInthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackeddatarepresentationsinGenomeWideAssociationsStudies.
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecom-monlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastruc-tures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
AvailabilityandrequirementsProjectname:libgwasppProjecthomepage:https://github.
com/putnampp/libgwasppOperatingsystem(s):LinuxProgramminglanguage:C++Otherrequirements:CMake2.
8.
9,GCC4.
7orhigher,Boost1.
51.
0,ZLIB,GSLLicense:FreeBSDCompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Authors'contributionsPPPdesignedandimplementedthesoftware,conductedtheexperiments,andwrotethemainmanuscript.
GZprovideddomainspecificexpertiseinGWAstudies,andtheempiricaldatafromwhichthesimulateddatawasgenerated.
PWcontributedextensiveknowledgeofcomputationalarchitecturesanddatastructures.
Bothalsocontributedgreatlytotheresultanalysisandeditingofthemanuscript.
Allauthorsreadandapprovedthefinalmanuscript.
AcknowledgementsThisworkwaspartiallysupportedbythePilotandFeasibilityProgramofthePerinatalInstitute,CincinnatiChildren'sHospitalMedicalCenter.
Received:25June2013Accepted:11December2013Published:21December2013References1.
SchadtEE,LindermanMD,SorensonJ,LeeL,NolanGP:Computationalsolutionstolarge-scaledatamanagementandanalysis.
NatRevGenet2010,11(9):647–657.
http://dx.
doi.
org/10.
1038/nrg2857.
2.
LiY,WillerC,SannaS,AbecasisG:Genotypeimputation.
AnnRevGenomHumanGenet2009,10:387–406.
http://www.
annualreviews.
org/doi/abs/10.
1146/annurev.
genom.
9.
081307.
164242.
[PMID:19715440].
3.
Whole-genomegenotypingandcopynumbervariationanalysis.
2013.
http://www.
illumina.
com/applications/detail/snp_genotyping_and_cnv_analysis/whole_genome_genotyping_and_copy_number_variation_analysis.
ilmn.
[Online;accessed9-January-2013]4.
Amapofhumangenomevariationfrompopulation-scalesequencing.
Nature2010,467(7319):1061–1073.
http://dx.
doi.
org/10.
1038/nature09534.
5.
NielsenJ,MailundT:SNPFile-Asoftwarelibraryandfileformatforlargescaleassociationmappingandpopulationgeneticsstudies.
BMCBioinformatics2008,9:526.
http://www.
biomedcentral.
com/1471-2105/9/526.
6.
WanX,YangC,YangQ,XueH,FanX,TangNL,YuW:BOOST:afastapproachtodetectinggene-geneinteractionsingenome-widecase-controlstudies.
AmJHumanGenet2010,87(3):325–340.
http://linkinghub.
elsevier.
com/retrieve/pii/S0002929710003782.
7.
PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraMAR,BenderD,MallerJ,SklarP,deBakkerPIW,DalyMJ,ShamPC:PLINK:atoolsetforwhole-genomeassociationandpopulation-basedlinkageanalysis.
AmJHumanGenet2007,81(3):559–575.
http://pngu.
mgh.
harvard.
edu/purcell/plink/.
8.
JacobsonG:Space-efficientstatictreesandgraphs.
InProceedingsofthe30thAnnualSymposiumonFoundationsofComputerScience,SFCS'89.
Washington:IEEEComputSoc;1989:549–554.
http://dx.
doi.
org/10.
1109/SFCS.
1989.
63533.
9.
GyeneseiA,MoodyJ,LaihoA,SempleCA,HaleyCS,WeiWH:BiForceToolbox:powerfulhigh-throughputcomputationalanalysisofgene-geneinteractionsingenome-wideassociationstudies.
NucleicAcidsRes2012,40(W1):W628–W632.
http://nar.
oxfordjournals.
org/content/40/W1/W628.
abstract.
10.
Intel:IntelSSE4ProgrammingReference;2007.
http://home.
ustc.
edu.
cn/~shengjie/REFERENCE/sse4_instruction_set.
pdf.
11.
YungLS,YangC,WanX,YuW:GBOOST:aGPU-basedtoolfordetectinggeneUgeneinteractionsingenome-widecasecontrolstudies.
Bioinformatics2011,27(9):1309–1310.
http://bioinformatics.
oxfordjournals.
org/content/27/9/1309.
abstract.
12.
SchüpbachT,XenariosI,BergmannS,KapurK:FastEpistasis:ahighperformancecomputingsolutionforquantitativetraitepistasis.
Bioinformatics2010,26(11):1468–1469.
http://bioinformatics.
oxfordjournals.
org/content/26/11/1468.
abstract.
doi:10.
1186/1471-2105-14-369Citethisarticleas:Putnametal.
:AcomparisonstudyofsuccinctdatastructuresforuseinGWAS.
BMCBioinformatics201314:369.
SubmityournextmanuscripttoBioMedCentralandtakefulladvantageof:ConvenientonlinesubmissionThoroughpeerreviewNospaceconstraintsorcolorgurechargesImmediatepublicationonacceptanceInclusioninPubMed,CAS,ScopusandGoogleScholarResearchwhichisfreelyavailableforredistributionSubmityourmanuscriptatwww.
biomedcentral.
com/submit
今天下午遇到一个网友聊到他昨天新注册的一个域名,今天在去使用的时候发现域名居然不见。开始怀疑他昨天是否付款扣费,以及是否有实名认证过,毕竟我们在国内域名注册平台注册域名是需要实名认证的,大概3-5天内如果不验证那是不可以使用的。但是如果注册完毕的域名找不到那也是奇怪。同时我也有怀疑他是不是忘记记错账户。毕竟我们有很多朋友在某个商家注册很多账户,有时候自己都忘记是用哪个账户的。但是我们去找账户也不办...
最近主机参考拿到了一台恒创科技的美国VPS云服务器测试机器,那具体恒创科技美国云服务器性能到底怎么样呢?主机参考进行了一番VPS测评,大家可以参考一下,总体来说还是非常不错的,是值得购买的。非常适用于稳定建站业务需求。恒创科技服务器怎么样?恒创科技服务器好不好?henghost怎么样?henghost值不值得购买?SonderCloud服务器好不好?恒创科技henghost值不值得购买?恒创科技是...
HostKvm是一家成立于2013年的国外主机服务商,主要提供VPS主机,基于KVM架构,可选数据中心包括日本、新加坡、韩国、美国、俄罗斯、中国香港等多个地区机房,均为国内直连或优化线路,延迟较低,适合建站或者远程办公等。商家本月针对香港国际机房提供特别7折优惠码,其他机房全场8折,优惠后2G内存香港VPS每月5.95美元起,支持使用PayPal或者支付宝付款。下面以香港国际(HKGlobal)为...
fedora17为你推荐
网罗设计网络设计, 计算机德尔,哪个好,哪个能赚钱?特朗普取消访问丹麦特朗普访华后还会去那里?今日油条天天吃油条,身体会怎么样mathplayer如何学好理科甲骨文不满赔偿不签合同不满一年怎么补偿xyq.163.cbg.com『梦幻西游』那藏宝阁怎么登录?51sese.comwww.51xuanh.com这是什么网站是骗人的吗?mole.61.com摩尔庄园RK的秘密是什么?m.2828dy.combabady为啥打不开了,大家帮我提供几个看电影的网址5xoy.com求个如月群真汉化版下载地址
工信部域名备案系统 cve-2014-6271 正版win8.1升级win10 火车票抢票攻略 国外空间 国内加速器 新天域互联 电信托管 华为云建站 贵阳电信 privatetracker 塔式服务器 美国asp空间 美国代理服务器 webmin 连连支付 一句话木马 文件传输 vim命令 dbank 更多