benchmarkedfedora17

fedora17  时间:2021-03-26  阅读:()
Putnametal.
BMCBioinformatics2013,14:369http://www.
biomedcentral.
com/1471-2105/14/369SOFTWAREOpenAccessAcomparisonstudyofsuccinctdatastructuresforuseinGWASPatrickPPutnam1,2*,GeZhang2*andPhilipAWilsey1AbstractBackground:Inrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal(NRG11:647–657,2010)offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
Theuseofsuccinctdatastructuresisonemethodofreducingphysicalsizeofadatasetwithouttheuseofexpensivecompressiontechniques.
Inthiswork,weconsidertheuseof2-and3-bitencodingschemesforgenotypedata.
Wecomparethecomputationalperformanceofalleleorgenotypecountingalgorithmsutilizinggenotypedataencodedinbothschemes.
Results:Weperformacomparisonof2-and3-bitgenotypeencodingschemesforuseingenotypecountingalgorithms.
Wefindthatthereisa20%overheadwhenbuildingsimplefrequencytablesfrom2-bitencodedgenotypes.
However,buildingpairwisecounttablesforgenome-wideepistasisis1.
0%moreefficient.
Conclusions:Inthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackedgenotypedatarepresentationsinGenomeWideAssociationsStudies(GWAS).
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecommonlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastructures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
BackgroundInrecentyearsgeneticdataanalysishasseenarapidincreaseinthescaleofdatatobeanalyzed.
Schadtetal[1]offeredthatwithdatasetsapproachingthepetabytescale,datarelatedchallengessuchasformatting,management,andtransferareincreasinglyimportanttopicswhichneedtobeaddressed.
ThemajorityoftoolsusedinGWAdataanalysistyp-icallyassumethatadatasetwilleasilyfitintothemainmemoryofadesktopcomputer.
Mostdesktopcomput-ershavearound4–16GBofmainmemory,whichismorethanenoughtofitadatasetof1millionvari-antsbytensofthousandsofindividuals.
However,data*Correspondence:putnampp@gmail.
com;zhangge.
uc@gmail.
com1ExperimentalComputingLab,SchoolofElectronicandComputingSystems,POBox210030,Cincinnati,OH45221–0030,USA2HumanGenetics,CincinnatiChildren'sHospitalMedicalCenter,Cincinnati,OH,USAsetsizescontinuetogrowwithadvancementsinanal-ysistechniquesandtechnologies.
Forexample,tech-niqueslikegenotypeimputation[2]attemptexpanddatasetsbyderivingmissinggenotypefromreferencepan-els.
GenotypingtechnologiessuchasIllumina'sOmniSNPHumanOmni5-Quadchipsallowforgenotypingofupwardsof5millionmarkers[3].
Furthermore,genomesequencingtechnologiesareadvancingtothepointwheredetermininggenotypesviawholegenomesequencingmaybeaviableoption.
Havinganindividual'sentireDNAsequenceopensthedoorforevenmoregeneticmark-erstobeanalyzed.
The1000Genomesproject[4]nowincludesroughly36.
7millionvariantsinthehumangenome.
Thesizeofadatafileusedtorepresentthegenotypesof1000individualswouldberoughly37GB(assuming1byteisusedtostoreeachgenotype).
Thereareaseveraloptionstohandlingdatasetsofthissize.
First,thecostofupgradingastandardPC'smemorytohandlethisamountofdataisnotunreasonable.
Second,thealgorithmcan2013Putnametal.
;licenseeBioMedCentralLtd.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense(http://creativecommons.
org/licenses/by/2.
0),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited.
Putnametal.
BMCBioinformatics2013,14:369Page2of7http://www.
biomedcentral.
com/1471-2105/14/369beextendedtoutilizememorymappingtechniques[5],whicheffectivelypageschunksofthedatafileintomainmemoryastheyareneeded.
Athirdoptionistomod-ifytheformatforrepresentinggenotypessuchthatthegenotypesareexpressedintheirmostsuccinctform[6,7].
Thismanuscriptexploresthelatteroptionmoredeeply.
TheinterestismotivatedinpartbythedesiretoworkintheGeneral-PurposeGraphicProcessingUnits(GPGPU)spacewhichhassomewhatlimitedspaceespeciallywhenconsideredonaprocessor-by-processorbasis.
Thecompressionofgenotypeencodingdataismosteffectivelyperformedusingsuccinctdatastructures[8].
Succinctdatastructuresallowcompressionratesclosetotheinformation-theoreticlimitsandyetpreservetheabil-itytoaccessindividualdataelements.
Inthegenotypeanalysistoolsthatusesuccinctdatatypes(e.
g.
,BOOST[6]andBiForce[9]),a3-bitgenotyperepresentationforbiallelicmarkershasbeenadopted.
Whilea3-bitrep-resentationdoesprovideasuccinctdatastructure,itisnotthemostsuccinct.
Moreprecisely,fromaninforma-tiontheoreticperspective,3-bitsisabletorepresentupto8uniquevalues.
However,thereareonly4commonlyusedunphasedgenotypes,namely{NN,AA,Aa,aa}whereNNisusedtorepresentmissingdata.
Thismeansthata2-bitrepresentationistheinformationtheoreticlowerboundanditsusewouldprovideanevenmorecompactrepresentation.
Animportantconsiderationwhendesigningsuccinctdatastructuresisdataelementorientationinmemory.
BOOST[6]andBiForce[9]adoptedavectoredorienta-tionforrepresentingdataelements.
Thevectoredorienta-tionspreadseachdataelementovermultiplebitvectors.
Inotherwords,theyutilize3bitvectorspermarkertorepresentthesetofgenotypes.
Theadvantagesofthisorientationarediscussedlater.
Thismanuscriptmakestwoimportantcontributionsintheuseofsuccinctdatastructuresforgenomicencod-ing.
Inparticular,(i)weimplementatechniquetoreducegenotypeencodingtoa2bitvectorform,and(ii)wecom-paretheperformanceofthenew2-bitencodingtotheconventional3bitvectorencoding.
Fromthesestudies,wehaveobservedthatthe2-bitencodingencodingcon-sumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
ImplementationWeanalyzedacommonlyused3-bitbinaryrepresentationofgenotypesfromperformanceandscalabilityperspec-tives.
WiththisinformationwedevelopedaC++objectlibrarythatwehavenamedlibgwaspp.
Thelibrarypro-videsdatastructuresformanaginggenotypedatatablesina2-or3-bitrepresentation.
Finally,webenchmarkedthetworepresentationsonrandomlygenerateddatasetsofvariousscales.
Genome-wideassociationstudiesDNAfromindividualsarecollected,sequencedorgeno-typed,andthegenotypesforgeneticvariantsareusedinGenome-WideAssociationStudies(GWAS).
Thesestud-iesaimtodeterminewhethergeneticvariantsareassoci-atedwithcertaintraits,orphenotypes.
Themostcommonstudiesarecase-controlstudieswhichgroupindividualstogetherintotwosetsbasedonthepresence(case)orabsence(control)ofaspecifictrait.
Thesestudiestypicallyrelyuponvariousstatisticaltestsbaseduponthegeno-typicorallelicdistributionofthevariantsineachset.
Anaveragedatasetaimstocomparethousandsofindividualsbyhundredsofthousandstomillionsofvariants.
GWAstudiescanbecomputationallyintensivetoper-form.
Commonalgorithmsconsidereithereachvariantindividually,orvariantsincombinationwithoneanother.
Forexample,measuringtheoddsratioforeachvariantinacase-controlstudyisonewayofidentifyingvariantswhichmaybeassociatedwiththetraitinquestion.
Anepista-sisanalysisalgorithm,suchasBOOST[6],comparesthegenotypedistributionoftwovariantsineachstep.
Inbothofthesealgorithms,thebasictaskiscountingtheoccurrencesofeachgenotypeineachofthecase-controlsets.
Inotherwords,thefirststepindeterminingtheoddsratioistobuildafrequencytable(Table1)forboththecaseandcontrolsetsataspecificvariant.
Simi-larly,theBOOST[6]algorithmfirstbuildsacontingencytable(Table2),orpairwisegenotypecounttable,forapairofvariants.
BinarygenotypeencodingschemesAcommonwaytominimizetheimpactofthetablebuild-ingbottleneckistofullyutilizeprocessorthroughputbycountinggenotypesfrommultipleindividualsinonestep.
ThebinaryencodingofgenotypesadoptedbyBOOST[6]improvesthecomputationalefficiencyoftheepista-sisalgorithm.
Thealgorithmused3bitvectorstoencodeforgenotypedata.
Inthisschemeeachgenotypeisitsownbit-vector,orstream,ofdata.
Eachbitcorrespondstoanindexedindividual,andtheindexingisassumedtobeconstantacrossallmarkers.
Asetbitindicatesthattheindividualhasthecorrespondinggenotypeforthespeci-fiedmarker.
Therefore,everyvariantrequires3vectorstofullyrepresentthegenotypes.
Therearetwokeybenefitsofusingthisbinaryencodingscheme.
ThefirstisthatthetaskofbuildingafrequencyTable1FrequencytableforrawinputfromTables3,4and5AAAaaaNNCA2111CB2120Putnametal.
BMCBioinformatics2013,14:369Page3of7http://www.
biomedcentral.
com/1471-2105/14/369Table2PairwisegenotypecounttablefortwomarkersMBAAAaaaNNCAMAAA10102Aa10001aa00101NN01001CB2120NotethatthemarginalsumsofthistablearetheindividualmarkersfrequenciesfromTable1.
tableforagivenmarkerisreducedtocalculatingtheHam-mingdistanceofeachofabit-vectorsandabit-vectorofallzeros.
ThisdistanceisalsoreferredtoasaHammingweight.
ThetechniqueusedforcalculatingtheHammingweightofabitvectoristodividethebit-vectorintoman-ageableblocks,andsumtheHammingweightofeachblock.
Theblocksizeistypicallylinkedtotheproces-sorwordsize,typically32-or64-bits(4or8bytes).
ThealgorithmforcomputingtheHammingweightofanindividualblockiscommonlyreferredtoasPopulationCounting(popcount).
WechosetofollowtheBOOSTimplementationofpopcountwhichlooks-uptheHam-mingweightof16-bitblocksinapre-populatedweighttable.
Thesecondbenefitisthatitreducesgenotypecom-parisonlogictosimpleBooleanlogicoperations.
Morespecifically,thetaskofcountingindividualswhichhaveaspecificcombinationofgenotypesfortwomarkersissim-plifiedtofindingtheHammingweightofthelogicalANDofthegenotypebitvectors.
Thisisusefulwhenbuildingcontingencytables.
Ofinteresttothispaperisthefactthatwhenusingthe3-bitencodingschemeatleasttwothirdsofthebitsusedwillbeunset.
Aninformationtheoreticanalysisofthegenotypealphabetindicatesthat2-bitsaresufficienttouniquelyrepresenteachofthefourunphasedgenotypes.
Theimmediatebenefitisaonethirdreductioninmemoryconsumption(Tables3,4and5).
Thecaveattothisencod-ingschemeisthatdeterminingagenotyperequiresbothbits.
ThealgorithminFigure1isapseudo-coderepresen-tationofhowtobuildagenotypecounttablefrom2-bitencodeddata.
TheHammingweightofeachvectoristhenumberofindividualswith(AAoraa),and(Aaoraa)genotypes,respectively.
TodisambiguatethevaluesitisTable3ExamplegenotypeinputI1I2I3I4I5MAAAAaAAaaNNMBAAAAaaaaAaI1-5representindividuals,andMAandMBaremarkers.
Table43-bitencodingschemeI1I2I3I4I5AA10100MAAa01000aa00010AA11000MBAa00001aa00110necessarytocomputetheHammingweightofthelogicalANDofthebit-vectors.
Thisvaluerepresentsthenumberof(aa)genotypes,andsubtractingitfromtheprevioustwoweightswillresultintheappropriatecounts.
ThealgorithminFigure2illustratestheconstructionofapairwisegenotypecounttable,orcontingencytable.
Acontingencytablerepresentsthenumberofindividualswhopossessagenotypecombinationforapairofmarkers.
Whenusingthe3-bitencodingscheme,eachcellofthetableissimplytheHammingweightofthelogicalANDofthegenotypebit-vectorsforthetwomarkers.
The2-bitencodingrequiresaninlinetransformationsteptocon-vertthe2-bitencodeddatainto3-bitdata.
Thisstepisnecessarytobeabletotakeadvantageofthepopcountbitcountingmethod.
Bothoftheabovealgorithmscanbefurtherimprovedbyincorporatingadditionalinformation.
Forexample,thealgorithmforbuildingacontingencytablecanbesimpli-fiedifmarginalinformationforbothvariantsisavailable.
Thecontingencytablealgorithmcanmakeuseofthevariants'frequencytableandreducehavingtocompute9Hammingweightvaluestoonly4.
Theremainingval-uescanbeeasilycomputedbysubtractingtherowandcolumnsumsfromtheirrespectivemarginalinformationvalues.
Thisreductionofferssignificantcomputationalsavings,especiallywhenperformingexhaustiveepistasisanalysis.
BenchmarkingWecomparedtheperformanceofthe2-bitencodeddatatothe3-bitencodeddata.
Inparticular,wemeasuredtheruntimeforbuildingfrequencytablesandcontingencytablesusingbothencodingschemes.
Theruntimeofthesealgorithmsaredependentuponthenumberofcolumns,orindividuals,ineachrow.
Therefore,wedecidedtoholdTable52-bitencodingschemeI1I2I3I4I5MAAAORaa10110AaORaa01010MBAAORaa11110AaORaa00111Putnametal.
BMCBioinformatics2013,14:369Page4of7http://www.
biomedcentral.
com/1471-2105/14/369Constructingafrequencytablefrom2-bitencodedgenotypesAA0Aa0aa0fori=0NdoisthenumberofblocksperbitvectorxA[i]isthe(AAoraa)genotypebitvectoryB[i]isthe(Aaoraa)genotypebitvectoraaaa+popcount(xy)AaAa+popcount(y)AAAA+popcount(x)endforAAAAaaAaAaaaFigure1Constructingafrequencytablefrom2-bitencodedgenotypes.
thenumberofrowsconstantat10,000variants.
Wevar-iedthenumberofcolumnsbetween1and50thousandindividuals.
Wealsotestedasetwith150,000individualsasanextremescaleexperiment.
Thegenotypesweresim-ulatedfollowingempiricalallelefrequencyspectrumofAffymetrixarray6.
0SNPsoftheCEUHapMapsamples.
Similarly,individualswererandomlyclassifiedaseitheracaseorcontrol.
Threeexperimentswereconducted.
First,foreachdatasettheruntimeforbuildingfrequencytablesforeachofthevariantsweremeasured.
Second,foreachdatasettheruntimeforbuildingallcontingencytablesforanexhaus-tivepairwiseepistasistestwasmeasured.
Third,eachdatasetwasrunthroughourimplementationoftheBOOST[6]algorithmandthetotalruntimewasrecorded.
TheruntimeofBOOST[6]algorithmdoesnotincludethetimetoloadthecompresseddatasetintomainmemory.
Ineachofthesetests,theaverageruntimeiscalculatedandpresented.
Alltestswereconducteduponadesktopcomputerwithan3.
2GHzIntelCorei7-3930K,32GBof1600MHzDDR3memory,with64-bitFedora17.
Timewasmeasureddowntothenanosecondusingtheclock_gettime()glibcfunction.
WeusedGNUG++compiler4.
7,andcompiledusingstandard"-O3"compileroptimizationflag.
Thetestswereperformedusing64-bitblocksize.
ResultsThefirstexperimentmeasuredtheruntimeforbuild-ingfrequencytables.
Initially,the3-bitencodingschemeappearedtoofferaconsistentperformanceadvantageoverthe2-bitencoding.
Asthenumberofindividualsincreased,ittooklesstimetoconstructthecounttable(Figure3).
Theaveragetimetobuildagenotypecounttableforlessthan10,000individualsislessthan1μs.
Fordatasetsgreaterthan10,000individuals,thereissomeperformanceoverheadthatresultsfromdecodingthe2-bitvectors.
Buildingfrequencytablesfromthe3-bitencodeddataprovedtobe12–25%fasterthanwhenbuiltfrom2-bitencodeddata.
Intheextremescaledatasettherewasa5.
00μsdifferenceinfavorofthe3-bitscheme.
However,thesecondexperimentoffereddifferentresults.
Thesecondexperimentmeasuredtheruntimeforbuild-ingcontingencytablesforallpairsofvariantsinthedatasets.
Inthisexperiment,the2-bitencodingschemeofferedbetterperformance.
Similartothefirstexperi-ment,10,000individualsseemedtobethedivergingpoint(Figure4).
Atsizesgreaterthan10,000individuals,the2-bitencodingschemeoffereda1%performanceimprove-mentoverthe3-bitscheme.
With150,000individuals,thisequatestoabouta0.
32μsdifferenceinaverageper-formance.
Thethirdexperimentfurtherconfirmsthisperformancegain(Table6).
Figure2Constructingacontingencytablefrom2-bitencodedgenotypes.
Putnametal.
BMCBioinformatics2013,14:369Page5of7http://www.
biomedcentral.
com/1471-2105/14/3690510152025020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlFrequencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure3AverageCase/ControlfrequencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
DiscussionThisworkfocusesonwaystoaddressfrequencytablebuildingprocessesfoundinGWASfortwoprimaryrea-sons.
First,upstreamsteps,liketheloadingofdata,inageneralGWASpipelineareperformedrelativelyinfre-quently,andcanbeperformedoffline.
Forexample,adatasetcanbetransformedintoanoptimizedformatonce,andineveryrepeatanalysisthedatasettheloadingbecomesaconstanttimestepwithinthepipeline.
Conversely,thebuildingofthesetablesamountstoafrequentlyreoccur-ringstepwhichistypicallyperformedinlineundervaryingconditions.
Secondly,weviewedthetablebuildingprocessasabottleneckfordownstreamanalyticalsteps.
Offeringanapproachwhichpositivelyimpactsthecostassociatedwiththisbottleneckisbeneficial.
Theresultssuggestthattheuseof2-bitencodingschemeforgenotypedatadoesofferseveralbenefitsovera3-bitencodingscheme.
Thecompactencodingschemerequires33%lessmemoryforrepresentingthesamedata.
Asidefromfreeingupsystemmemoryforothertasks,thememorysavingscanbebeneficialforotherreasons.
Forexample,epistasisalgorithmslikeBOOST[6]canberunonGraphicProcessingUnits.
GPUsareseparatedevices05101520253035404550020000400006000080000100000120000140000160000Time(s)IndividualsCase-ControlContingencyTableAveragebuildtimefor10000VariantsfollowingAffy6genotypedistribution2-bitencodingscheme3-bitencodingschemeFigure4AverageCase/ControlcontingencytableconstructionusingsimulateddatafollowingAffy6SNPsofHapMapCEUindividuals.
Putnametal.
BMCBioinformatics2013,14:369Page6of7http://www.
biomedcentral.
com/1471-2105/14/369Table6EpistasisruntimecomparisonIndividuals2-bit3-bitSpeedup(%)100028.
56s28.
45s0.
37500092.
07s93.
32s-1.
3310000173.
12s177.
46s-2.
4525000418.
31s420.
71s-0.
5750000810.
71s820.
26s-1.
161500002408.
05s24.
27.
84s-0.
81Speedupismeasuredrelativetothe3-bitruntime.
onacomputerwhichhavetheirownphysicalmemory,typicallylessthan6GB,andrequiredatatobecopiedtoandfromthedevice.
Thelimitedmemoryanddatatrans-ferissuesbothbenefitfromusingamorecompactdataformat.
The2-bitencodedgenotypeshavealsobeenusedbyothersoftwarepackages.
PLINK[7],forexample,usesa2-bitencodingintheBEDfileformat.
BEDfilesuseacontiguouspairingofbitstoexpressthegenotypeofanindividual.
Usingbitpairsallowsformoreefficientindi-vidualgenotypedecodingasaresultofthebitsexistinginthesamebit-block.
However,additionalbitmaskingstepsneedtobeappliedtoeachblocktoeffectivelyutilizepop-countbasedmethodsforcountinggenotypeoccurrenceswithinablock.
Asmentionedearlier,ourimplementationadoptsabit-vectoredapproach,wherebyanindividual'sgenotypeisdividedovertwoseparatevectors.
Thisisprimarilydonetoreducethenumberofmaskingsteps.
Ineithercase,someformofgenotypedisambiguationisnecessary.
Thereisanoverheadassociatedwiththisdecodingstep,anditcanbefeltincertainalgorithms.
Wemeasuredapproximatelya20%overheadwhenbuildingfrequencytables.
Whilethisisasignificantoverhead,thenumberoffrequencytablesarelinearinthenumberofmarkers.
Therefore,itisconceivabletobuildthesetablesonce,andreusethemindownstreamanalyticalstepsasneeded.
Asaresult,thisoverheadisgenerallyacceptable.
Furthermore,theoverheadiseffectivelyhiddenwhenbuildingpairwisefrequencytables.
Theimprovementinperformancepresentwhencon-structingpairwisefrequencytablesfrom2-bitencodedgenotypesstemsfromthereducednumberofmemoryaccesssteps.
AsshowninAlgorithm3sixgenotypesblocksareusedineachstepoftheiteration.
When3-bitencodingisused,eachoftheseblocksmustbereadfrommemory.
Conversely,the2-bitencodingonlyneedstoreadfourblocksandcomputestheremainingtwoblocks.
Afurthergeneralperformanceincreasemaybepos-siblethroughtheuseofhardwareimplementationsofpopcountalgorithms.
AspartoftheStreamingSIMDExtensions(SSE)ofthex86microarchitecturethereisapopcnt[10]instruction.
RecentprocessorlinesfrombothIntelandAMDofferthisinstructioninsomeformoranother.
Aswementionedearlier,thesesuccinctdatastructuresareintendedtoimpacttheincreasingscaleofsamplesets.
Thebuildingofthefrequencytablesarelinearalgorithmswhicharedependentuponthesamplesets.
Byfixingthenumberofvariantsandvaryingthenumberofsamplesinadatasetweshowthelinearincreaseoftheepistasisalgorithmruntime,asisindicatedbyFigure5.
Unfortunately,theruntimeofbruteforcealgorithmslikeBOOST[6]aredominatedmorebythenumberofvari-antsbeinganalyzedthanthenumberofindividualsbeing05001000150020002500020000400006000080000100000120000140000160000Time(s)IndividualsEpistasis(BOOST)algorithmAverageruntimefor10000Variants2-bitencodingscheme3-bitencodingschemeFigure5AverageepistasisruntimeusingBOOST[6]algorithm.
Putnametal.
BMCBioinformatics2013,14:369Page7of7http://www.
biomedcentral.
com/1471-2105/14/369studied.
Adatasetof10,000variantsmeansthat5*107uniquecontingencytablesneedtobebuiltforatypicalcase-controlstudy.
Expandingthatsizetoamillionvari-antsincreasesthecontingencytablecountto5*1011.
Otherworkshavedemonstratedparallelimplementationsthateffectivelyaddressthevariantscaling[9,11,12].
Thisworkdemonstratesageneralwaytofurtherimprovetheperformanceofthesealgorithms.
ConclusionsInthiswork,wewereconcernedwithcomparingtheperformancebenefitsanddisadvantagesofusingmoredenselypackeddatarepresentationsinGenomeWideAssociationsStudies.
Weimplementeda2-bitencodingforgenotypedata,andcompareditagainstamorecom-monlyused3-bitencodingscheme.
WealsodevelopedaC++library,libgwaspp,whichoffersthesedatastruc-tures,andimplementationsofseveralcommonGWASalgorithms.
Ingeneral,the2-bitencodingconsumeslessmemory,andisslightlymoreefficientinsomealgorithmsthanthe3-bitencoding.
AvailabilityandrequirementsProjectname:libgwasppProjecthomepage:https://github.
com/putnampp/libgwasppOperatingsystem(s):LinuxProgramminglanguage:C++Otherrequirements:CMake2.
8.
9,GCC4.
7orhigher,Boost1.
51.
0,ZLIB,GSLLicense:FreeBSDCompetinginterestsTheauthorsdeclarethattheyhavenocompetinginterests.
Authors'contributionsPPPdesignedandimplementedthesoftware,conductedtheexperiments,andwrotethemainmanuscript.
GZprovideddomainspecificexpertiseinGWAstudies,andtheempiricaldatafromwhichthesimulateddatawasgenerated.
PWcontributedextensiveknowledgeofcomputationalarchitecturesanddatastructures.
Bothalsocontributedgreatlytotheresultanalysisandeditingofthemanuscript.
Allauthorsreadandapprovedthefinalmanuscript.
AcknowledgementsThisworkwaspartiallysupportedbythePilotandFeasibilityProgramofthePerinatalInstitute,CincinnatiChildren'sHospitalMedicalCenter.
Received:25June2013Accepted:11December2013Published:21December2013References1.
SchadtEE,LindermanMD,SorensonJ,LeeL,NolanGP:Computationalsolutionstolarge-scaledatamanagementandanalysis.
NatRevGenet2010,11(9):647–657.
http://dx.
doi.
org/10.
1038/nrg2857.
2.
LiY,WillerC,SannaS,AbecasisG:Genotypeimputation.
AnnRevGenomHumanGenet2009,10:387–406.
http://www.
annualreviews.
org/doi/abs/10.
1146/annurev.
genom.
9.
081307.
164242.
[PMID:19715440].
3.
Whole-genomegenotypingandcopynumbervariationanalysis.
2013.
http://www.
illumina.
com/applications/detail/snp_genotyping_and_cnv_analysis/whole_genome_genotyping_and_copy_number_variation_analysis.
ilmn.
[Online;accessed9-January-2013]4.
Amapofhumangenomevariationfrompopulation-scalesequencing.
Nature2010,467(7319):1061–1073.
http://dx.
doi.
org/10.
1038/nature09534.
5.
NielsenJ,MailundT:SNPFile-Asoftwarelibraryandfileformatforlargescaleassociationmappingandpopulationgeneticsstudies.
BMCBioinformatics2008,9:526.
http://www.
biomedcentral.
com/1471-2105/9/526.
6.
WanX,YangC,YangQ,XueH,FanX,TangNL,YuW:BOOST:afastapproachtodetectinggene-geneinteractionsingenome-widecase-controlstudies.
AmJHumanGenet2010,87(3):325–340.
http://linkinghub.
elsevier.
com/retrieve/pii/S0002929710003782.
7.
PurcellS,NealeB,Todd-BrownK,ThomasL,FerreiraMAR,BenderD,MallerJ,SklarP,deBakkerPIW,DalyMJ,ShamPC:PLINK:atoolsetforwhole-genomeassociationandpopulation-basedlinkageanalysis.
AmJHumanGenet2007,81(3):559–575.
http://pngu.
mgh.
harvard.
edu/purcell/plink/.
8.
JacobsonG:Space-efficientstatictreesandgraphs.
InProceedingsofthe30thAnnualSymposiumonFoundationsofComputerScience,SFCS'89.
Washington:IEEEComputSoc;1989:549–554.
http://dx.
doi.
org/10.
1109/SFCS.
1989.
63533.
9.
GyeneseiA,MoodyJ,LaihoA,SempleCA,HaleyCS,WeiWH:BiForceToolbox:powerfulhigh-throughputcomputationalanalysisofgene-geneinteractionsingenome-wideassociationstudies.
NucleicAcidsRes2012,40(W1):W628–W632.
http://nar.
oxfordjournals.
org/content/40/W1/W628.
abstract.
10.
Intel:IntelSSE4ProgrammingReference;2007.
http://home.
ustc.
edu.
cn/~shengjie/REFERENCE/sse4_instruction_set.
pdf.
11.
YungLS,YangC,WanX,YuW:GBOOST:aGPU-basedtoolfordetectinggeneUgeneinteractionsingenome-widecasecontrolstudies.
Bioinformatics2011,27(9):1309–1310.
http://bioinformatics.
oxfordjournals.
org/content/27/9/1309.
abstract.
12.
SchüpbachT,XenariosI,BergmannS,KapurK:FastEpistasis:ahighperformancecomputingsolutionforquantitativetraitepistasis.
Bioinformatics2010,26(11):1468–1469.
http://bioinformatics.
oxfordjournals.
org/content/26/11/1468.
abstract.
doi:10.
1186/1471-2105-14-369Citethisarticleas:Putnametal.
:AcomparisonstudyofsuccinctdatastructuresforuseinGWAS.
BMCBioinformatics201314:369.
SubmityournextmanuscripttoBioMedCentralandtakefulladvantageof:ConvenientonlinesubmissionThoroughpeerreviewNospaceconstraintsorcolorgurechargesImmediatepublicationonacceptanceInclusioninPubMed,CAS,ScopusandGoogleScholarResearchwhichisfreelyavailableforredistributionSubmityourmanuscriptatwww.
biomedcentral.
com/submit

青果云(59元/月)香港多线BGP云服务器 1核 1G

青果云香港CN2_GIA主机测评青果云香港多线BGP网络,接入电信CN2 GIA等优质链路,测试IP:45.251.136.1青果网络QG.NET是一家高效多云管理服务商,拥有工信部颁发的全网云计算/CDN/IDC/ISP/IP-VPN等多项资质,是CNNIC/APNIC联盟的成员之一。青果云香港CN2_GIA主机性能分享下面和大家分享下。官方网站:点击进入CPU内存系统盘数据盘宽带ip价格购买地...

IMIDC彩虹数据:日本站群多ip服务器促销;30Mbps带宽直连不限流量,$88/月

imidc怎么样?imidc彩虹数据或彩虹网络现在促销旗下日本多IP站群独立服务器,原价159美元的机器现在只需要88美元,而且给13个独立IPv4,30Mbps直连带宽,不限制月流量!IMIDC又名为彩虹数据,rainbow cloud,香港本土运营商,全线产品都是商家自营的,自有IP网络资源等,提供的产品包括VPS主机、独立服务器、站群独立服务器等,数据中心区域包括香港、日本、台湾、美国和南非...

百纵科技云主机首月9元,站群1-8C同价,美国E52670*1,32G内存 50M 899元一月

百纵科技:美国高防服务器,洛杉矶C3机房 独家接入zenlayer清洗 带金盾硬防,CPU全系列E52670、E52680v3 DDR4内存 三星固态盘阵列!带宽接入了cn2/bgp线路,速度快,无需备案,非常适合国内外用户群体的外贸、搭建网站等用途。官方网站:https://www.baizon.cnC3机房,双程CN2线路,默认200G高防,3+1(高防IP),不限流量,季付送带宽美国洛杉矶C...

fedora17为你推荐
网红名字被抢注我想问这个网红 名字叫什么 讲一下谢谢了同ip域名什么是同主机域名同一服务器网站同一服务器上可以存放多个网站吗?同一服务器网站同一服务器上的域名/网址无法访问郭泊雄郭佰雄最后一次出现是什么时候?ip在线查询我要用eclipse做个ip在线查询功能,用QQwry数据库,可是我不知道怎么把这个数据库放到我的程序里面去,高手帮忙指点下,小弟在这谢谢了杨丽晓博客杨丽晓今年高考了吗?www.bbb551.combbb是什么意思partnersonline电脑内一切浏览器无法打开sodu.tw台湾的可以看小说的网站
fc2最新域名 欧洲欧洲vps 个人域名备案流程 香港加速器 秒解服务器 优key 天猫双十一抢红包 2017年万圣节 idc资讯 vip购优惠 如何建立邮箱 便宜空间 shuang12 服务器防火墙 114dns 阿里云邮箱登陆 广东主机托管 阿里dns 免备案jsp空间 国外免费网盘 更多