boilslinpack

linpack 时间:2021-03-26 阅读:()

AchievingNumericalAccuracyandHighPerformanceusingRecursiveTileLUFactorizationJackDongarra1,MathieuFaverge1,HatemLtaief2,andPiotrLuszczek11DepartmentofElectricalEngineeringandComputerScienceUniversityofTennessee,Knoxville2KAUSTSupercomputingLaboratoryThuwal,SaudiArabia{dongarra,faverge,luszczek}@eecs.
utk.
eduHatem.
Ltaief@kaust.
edu.
saAbstract.
TheLUfactorizationisanimportantnumericalalgorithmforsolvingsystemsoflinearequationsinscienceandengineering,andischaracteristicofmanydenselinearalgebracomputations.
IthasevenbecomethedefactonumericalalgorithmimplementedwithintheLINPACKbenchmarktorankthemostpowerfulsupercomputersintheworld,collectedbttheTOP500website.
Inthiscontext,thechallengeindevelopingnewalgorithmsforthescienticcommunityresidesinthecombinationoftwogoals:achievinghighperformanceandmaintainingtheaccuracyofthenumericalalgorithm.
ThispaperproposesanovelapproachforcomputingtheLUfactorizationinparallelonmulticorearchitectures,whichnotonlyimprovestheoverallperformance,butalsosustainsthenumericalqualityofthestandardLUfactorizationalgorithmwithpartialpivoting.
Whiletheupdateofthetrailingsubmatrixiscomputationallyintensiveandhighlyparallel,theinherentlyproblematicportionoftheLUfactorizationisthepanelfactorizationduetoitsmemory-boundcharacteristicaswellastheatomicityofselectingtheappropriatepivots.
Ourapproachusesaparallelne-grainedrecursiveformulationofthepanelfactorizationstepandimplementstheupdateofthetrailingsubmatrixwiththetilealgorithm.
Basedonconict-freepartitioningofthedataandlocklesssynchronizationmechanisms,ourimplementationletstheoverallcomputationownaturallywithoutcontention.
ThedynamicruntimesystemcalledQUARKisthenabletoscheduletaskswithheterogeneousgranularitiesandtotransparentlyintroducealgorithmiclookahead.
Theperformanceresultsofourimplementationarecompetitivecomparedtothecurrentlyavailablesoftwarepackagesandlibraries.
Inparticular,itisupto40%fasterwhencomparedtotheequivalentIntelMKLroutineandupto3-foldfasterthanLAPACKwithmultithreadedIntelMKLBLAS.
Keywords:recursion;LUfactorization;parallellinearalgebra;shared-memorysynchronization;threadedparallelism1IntroductionThemulticoreerahasforcedthescienticsoftwarecommunitytoreshapetheirstate-of-the-artnumericallibrariestobeabletoaddressthemassiveparallelismaswellasthememoryhierarchydesignbroughtbythisarchitecture.
Indeed,LAPACK[1]hasshownsomesignicantlimitationsonsuchplatformsandcanonlyachieveasmallportionofthetheoreticalpeakperformance[2].
Thereasonsforthisaremainlythreefold:(1)theoverheadofitsfork-joinmodelofparallelism,(2)thecoarse-grainedtaskgranularityand(3)thememory-boundnatureofthepanelfactorization.
ThePLASMAlibrary[3,4]initiated,withothersoftwarepackageslikeFLAME[5],thiseffortofredesigningstan-dardnumericalalgorithmstomatchthehardwarerequirementsofmulticorearchitectures.
Successfulhighperformanceresultshavealreadybeenreportedforone-sidedfactorizations(e.
g.
,QR/LQ,LUandCholeskyfactorizations)andmorerecently,forthetridiagonalreductionneededtosolvethesymmetriceigenvalueproblems[6].
Basedontiledatalayout,whichconsistsofsplittingthematrixintosmallsquareregionsofdatacontiguousinmemory,PLASMAhasalleviatedthebottlenecks(1)and(2)fromLAPACKbyratherbringingtheparallelismtothefore,minimizingthesynchronizationoverhead,andrelyingondynamicschedulingofne-grainedtasks.
However,thepanelfactorizationphasehasnotreallybeenimprovedfortheone-sidedfactorizations.
Itisstillrichinmemory-boundoperationsandrunssequentially.
Theperformanceimpactofthesequentialpanelforone-sidedfactorizationsissomewhatminimalthoughandmostlyhiddenbythelargeamountofne-grainedparallelismintroducedintheupdateofthetrailingsubmatrix.
However,theperfor-mancegaincomesatthepriceofnumericalissues,particularlyfortheLUfactorization.
Indeed,thenumericalaccuracyofthesolutionhasbeendeterioratedduetothenecessaryreplacementofthestandardpartialpivotingschemeinthepanelfactorizationbyanincrementalpivotingstrategy[7].
Thenumberofpivotedelementsdramaticallyincreases,whichmayeventuallytriggeraconsiderablegrowthpivotfactorandcanpotentiallymakethewholenumericalschemeunstable[8–10].
2ThispaperpresentsanovelapproachforcomputingtheLUfactorizationonmulticorearchitectures,whichnotonlyimprovestheoverallperformancecomparedtoLAPACKandPLASMA,butalsosustainsthenumericalqualityofthestandardLUfactorizationalgorithmwithpartialpivoting.
Theoriginalityofthisworkresidesintheimprovementofthepanelfactorizationwithpartialpivoting.
InvolvingmostlyLevel2BLASoperations,theparallelizationofthepanelisverychallengingbecauseofthelowratiobetweentheamountoftransferreddatafrommemoryandtheactualcomputation.
Theatomicityofselectingtheappropriatepivotsisyetanotherissue,whichhaspreventedefcientparallelimplementationofthepanel.
Ourapproachusesaparallelne-grainedrecursiveformulationofthepanelfactorizationstepwhiletheupdateofthetrailingsubmatrixfollowsthetilealgorithmprinciples.
Thene-grainedcomputationoccursatthelevelofthesmallcachesassociatedwiththecores,whichmaypotentiallyengendersuper-linearspeedups.
Therecursiveformulationofthepanelallowsonetotakeadvantageofthedifferentmemoryhierarchylevelsandtocastmemory-boundkernelsintoLevel3BLASoperationstoincreasethecomputationalrateevenfurther.
Basedonconict-freepartitioningofthedataandlocklesssynchronizationmechanisms,ourimplementationletstheparallelcomputationownaturallywithoutcontentionandreducessynchronization.
ThedynamicruntimesystemcalledQUARK[11,12](alsousedinPLASMA)isthenabletoschedulesequentialtasks(fromtheupdateofthetrailingsubmatrix)andparalleltasks(fromthepanelfac-torization)withheterogeneousgranularities.
Theexecutionowcanthenbedepictedbyadirectedacyclicgraph(DAG),wherenodesrepresentcomputationaltasksandedgesdenethedependenciesbetweenthem.
TheDAGisactuallyneverbuiltentirelysinceitwouldobviouslynottinthemainmemoryforlargematrixsizes.
Asthecomputationprogresses,theDAGisunrolledjustenoughtoinitiatelookaheadbetweensubsequentstepsofthefactorization.
Onlythetaskslo-catedwithinaparticularwindowarethereforeinstantiated.
Thiswindowsizemaybetunedformaximumperformance.
Moreover,QUARKcantransparentlyintegratealgorithmiclookaheadinordertooverlapsuccessivecomputationalsteps,andtokeepallprocessingunitsbusyduringtheexecutiontimeasmuchaspossible.
Theremainderofthispaperisorganizedasfollows.
Section2givesanoverviewofsimilarprojectsinthisarea.
Sec-tion3recallsthealgorithmicaspectsandthepivotingschemesoftheexistingblockLU(LAPACK)andtileLU(PLASMA)factorizations.
Section4describesournewapproachtocomputethetileLUfactorizationwithpartialpivotingusingaparallelrecursivepanel.
Section5presentssomeimplementationdetails.
Section6presentsperformanceresultsoftheoverallalgorithm.
Also,comparisontestsarerunonshared-memoryarchitecturesagainstthecorrespondingroutinesfromLAPACK[1]andthevendorlibraryIntelMKLversion10.
3[13].
Finally,Section7summarizestheresultsofthispaperanddescribestheongoingwork.
2RelatedWorkThisSectionpresentsprevioussimilarworksinimplementingarecursiveand/orparallelpaneloftheLUfactorization.
Recursiveformulationofaone-sidedfactorization(QR)anditsparallelimplementationhasbeendoneonasharedmemorymachine[14].
Threedifferencesstandout.
First,theauthorsusedasequentialpanelfactorization.
Thisledtotheseconddifference:lackofnestedparallelismthatweemploy.
Andthirdly,master-workerparallelizationwasusedinsteadofdynamicDAGscheduling.
GeorgievandWasniewski[15]presentedarecursiveversionoftheLUdecomposition.
TheyimplementedrecursiveversionsofthemainLAPACKandBLASkernelsinvolvedinthefactorizationi.
e.
,xGETRFandxGEMM,xTRSM,respec-tively.
TheiroriginalcodeisinFortran90andtheyreliedonthecompilertechnologytoachievethedesiredrecursion.
RecursionwasalsosuccessfullyusedinthecontextofsparsematrixLUfactorization[16].
Itlackedpivotingcode,whichisessentialtoensurenumericalstabilityofourimplementation.
Inaddition,here,wefocusondensematricesonly–notthesparseones.
AdistributedmemoryversionoftheLUfactorizationhasbeenattemptedandcomparedagainstScaLAPACK'simple-mentation[17].
Oneproblemcitedbytheauthorswasexcessive,albeitprovablyoptimal,communicationrequirementsinherentinthealgorithm.
Thisisnotanissueinourimplementationbecausewefocusexclusivelyonthesharedmem-oryenvironment.
Similarly,ouropensourceimplementationoftheHighPerformanceLINPACKbenchmark[18]usesrecursivepanelfactorizationonlocaldata,thusavoidingtheexcessivecommunicationcost.
Morerecently,panelfactorizationhasbeensuccessfullyparallelizedandincorporatedintoageneralLUfactorizationcode[19]usingLevel1BLAS;thisisaatparallelismmodelwithfork-joinexecution(closelyrelatedtoBulkSyn-chronousProcessing).
TheauthorsrefertotheirapproachasParallelCacheAssignment(PCA).
Ourworkdiffersinafewkeyaspects.
Weemployrecursiveformulation[20]andthereforeareabletouseLevel3BLASasopposedtojustLevel1BLAS.
AnotherimportantdifferenceisthenestedparallelismwithwhichwehavetheexibilitytoallocateonlyasmallsetofcoresforthepanelworkwhileothercorescarryonwiththeremainingtaskssuchastheSchurcomplementup-dates.
Finally,weusedynamicschedulingthatexecutesnegrainedtasksasynchronously,whichisdrasticallydifferentfromafork-joinparallelism.
AmoredetailedaccountofthedifferencesisgiveninSection4.
3Lastbutnotleast,Chanet.
al[21]implementedtheclassicalLUfactorizationwithpartialpivoting(withintheFLAMEframework),inwhichtheauthorsbasicallyseparatetheruntimeenvironmentfromtheprogrammabilityissues(i.
e.
,thegenerationofthecorrespondingDAG).
Therearemainlytwodifferenceswiththeworkpresentedinthispaper:(1)theirlookaheadopportunitiesaredeterminedbysortingtheenqueuedtasksinaseparatestagecalledananalyzerphase,whileinourcase,thelookaheadoccursnaturallyatruntimeduringtheprocessofpursuingthecriticalpathoftheDAG(andcanalsobestrictlyenforcedbyusingprioritylevels),and(2)wedonotrequireacopyofthepanel,calledamacroblock,instandardcolumn-majorlayoutinordertodeterminethepivotingsequence,butwehadratherimplementedanoptimizedparallelmemory-awarekernel,whichperformsanin-placeLUpanelfactorizationwithpartialpivoting.
Bothoftheseleadtohighperformance.
3TheBlockandTileLUFactorizationsThisSectiondescribestheblockandtileLUfactorizationasimplementedintheLAPACKandPLASMAlibraries,respec-tively.
3.
1TheBlockLUfromLAPACKBlockalgorithmsinLAPACK[1]surfacedwiththeemergenceofcache-basedarchitectures.
Theyarecharacterizedbyasequenceofpanel-updatecomputationalphases.
Thepanelphasecalculatesalltransformationsusingmostlymemory-boundoperationsandappliesthemasablocktothetrailingsubmatrixduringtheupdatephase.
Thispanel-updatesequenceintroducesunnecessarysynchronizationpointsandlookaheadisprevented,whileitcanbeconceptuallyachieved.
Moreover,theparallelismintheblockalgorithmsimplementedinLAPACKresidesintheBLASlibrary,whichfollowsthefork-joinparadigm.
Inparticular,theblockLUfactorizationisnoexceptionandtheatomicityofthepivotselectionhasfurtherexacerbatedtheproblemofthelackofparallelismandthesynchronizationoverhead.
Atthesametime,theLUfactorizationisnumericallystableinpractice,andproducesareasonablegrowthfactor.
Lastbutnotleast,theLAPACKlibraryalsousesthestandardcolumn-majorlayoutfromFortran,whichmaynotbeappropriateinthecurrentandnextgenerationofmulticorearchitectures.
3.
2TheTileLUfromPLASMATilealgorithmsimplementedinPLASMA[3]proposetotakeadvantageofthesmallcachesassociatedwiththemulticorearchitecture.
Thegeneralideaistoarrangetheoriginaldensematrixintosmallsquareregionsofdatawhicharecontiguousinmemory.
Thisisdonetoallowefciencybyallowingthetilestotintothecore'scaches.
Figure1showshowthetranslationproceedsfromcolumn-majortotiledatalayout.
Breakingthematrixintotilesmayrequirearedesignofthestandardnumericallinearalgebraalgorithms.
Furthermore,tilealgorithmsallowparallelismtobebroughttotheforeandexposesequentialcomputationalne-grainedtaskstobenetfromanydynamicruntimesystemenvironments,whichwilleventuallyschedulethedifferenttasksacrosstheprocessingunits.
Theactualframeworkboilsdowntoschedulingadirectedacyclicgraph(DAG),wheretasksrepresentnodesandedgesdenethedatadependenciesbetweenthem.
Thismayproduceanout-of-orderexecutionandtherefore,permitstheremovaloftheunnecessarysynchronizationpointsbetweenthepanelandupdatephasesnoticedintheLAPACKalgorithms.
Lookaheadopportunitiesalsobecomepracticalandengenderatremendousamountofconcurrenttasks.
UnliketheCholeskyfactorization,theoriginalQRandLUfactorizationshadtoberedesignedtoworkontopofatiledatalayout.
ThetileQRfactorizationisbasedonorthogonaltransformationsandtherefore,itdidnotnumericallysufferfromthenecessaryredesign.
However,thetileLUfactorizationhasseenitspivotingschemecompletelyrevised.
Thepartialpivotingstrategyhasbeenreplacedbytheincrementalpivoting.
Itconsistsofperformingpivotinginthepanelcomputationbetweentwotilesontopofeachother,andthismechanismisreproducedfurtherdownthepanelinapairwisefashion.
Andobviously,thispivotingschememayconsiderablydeterioratetheoverallstabilityoftheLUfactorization[8–10].
Asamatteroffact,thegoalofournewLUimplementationistoachievehighperformance,comparabletoPLASMA,whilesustainingnumericalstabilityofthestandardLUimplementationinLAPACK.
4ParallelRecursiveLUFactorizationofaPanelThisSectiondescribesoneofourmostuniquecontributions,whichistheparallelizationoftheLUfactorizationofamatrixpanelusingtherecursivealgorithm[20].
4Fig.
1.
TranslationfromLAPACKlayout(column-major)totiledatalayoutfunctionxGETRFR(M,N,column){ifN==1{singlecolumn,recursionstopsidx=split_IxAMAX(.
.
.
)computelocalmaximumofmodulusgidx=combine_IxAMAX(idx)combinelocalresultssplit_xSCAL(.
.
.
)scalelocaldata}else{xGETRFR(M,N/2,column)recursivecalltofactorlefthalfxLASWP(.
.
.
)pivotingforwardsplit_xTRSM(.
.
.
)triangularsolvesplit_xGEMM(.
.
.
)Schur'scomplementxGETRFR(M,N-N/2,column+N/2)recursivecalltofactorrighthalfxLASWP(.
.
.
)pivotingbackward}}Fig.
2.
Pseudo-codefortherecursivepanelfactorizationoncolumnmajorlayout.
4.
1RecursiveAlgorithmandItsAdvantagesFigure2showsapseudo-codeofourrecursiveimplementation.
Eventhoughthepanelfactorizationisalowerorderterm–O(N2)–fromthecomputationalcomplexityperspective[22],itstillposesaproblemintheparallelsettingfromthetheoretical[23]andpracticalstandpoints[19].
Tobemoreprecise,thecombinedpanelfactorizations'complexityfortheentirematrixisO(N2NB),whereNispanelheight(andmatrixdimension)andNBispanelwidth.
ForgoodperformanceofBLAScalls,panelwidthiscommonlyincreased.
ThiscreatestensionifthepanelisasequentialoperationbecausealargerpanelwidthresultsinlargerAmdahl'sfraction[24].
OurownexperimentsrevealedthistobeamajorobstacletoproperscalabilityofourimplementationoftileLUfactorizationwithpartialpivoting–aresultconsistentwithrelatedefforts[19].
Asidefromgaininghighlevelformulationfreeoflowleveltuningparameters,recursiveformulationpermitstodispenseofahigherleveltuningparametercommonlycalledalgorithmicblocking.
Thereisalreadypanelwidth–atunablevalueusedformergingmultiplepanelcolumnstogether.
Non-recursivepanelfactorizationscouldpotentiallyestablishanotherleveloftuningcalledinner-blocking[2,4].
Thisisavoidedinourimplementation.
4.
2DataPartitioningThechallengingpartoftheparallelizationisthefactthattherecursiveformulationsuffersfrominherentsequentialcontrolowthatischaracteristicofthecolumn-orientedimplementationemployedbyLAPACKandScaLAPACK.
Asarststepthen,weapplya1Dpartitioningtechniquethathasprovensuccessfulbefore[19].
Weemployedthistechniquefortherecursion-stoppingcase:singlecolumnfactorization.
TherecursiveformulationoftheLUalgorithmposesan-otherproblem,namelytheuseofLevel3BLAScallfortriangularsolve–xTRSM()andLAPACK'sauxiliaryroutinefor5Fig.
3.
Fixedpartitioningschemeusedintheparallelrecursivepanelfactorization.
swappingnamedxLASWP().
Bothofthesecallsdonotreadilylendthemselvestothe1Dpartitioningschemeduetotwomainreasons:1.
eachcalltothesefunctionsoccurswithavariablematrixsizeand2.
1Dpartitioningmakesthecallsdependentuponeachotherthuscreatingsynchronizationoverhead.
Thelatterproblemisfairlyeasytoseeasthepivotingrequiresdataaccessesacrosstheentirecolumnandmemorylocationsmaybeconsideredrandom.
Eachpivotelementswapwouldthenrequirecoordinationbetweenthethreadsthatthecolumnispartitionedamongst.
Theformerissueismoresubtleinthattheoverlappingregionsofthematrixcreateamemoryhazardthatmaybeattimesmaskedbythesynchronizationeffectsoccurringinotherportionsofthefactorization.
Todealwithbothissuesatonce,wechosetouse1Dpartitioningacrosstherowsandnotacrossthecolumnsasbefore.
Thisremovestheneedforextrasynchronizationandaffordsusparallelexecution,albeitalimitedoneduetothenarrowsizeofthepanel.
TheSchurcomplementupdateiscommonlyimplementedbyacalltoLevel3BLASkernelxGEMM()andthisisalsoanewfunctionthatisnotpresentwithinthepanelfactorizationsfromLAPACKandScaLAPACK.
Parallelizingthiscallismucheasierthanalltheothernewcomponentsofourpanelfactorization.
Wechosetoreusetheacross-columns1Dpartitioningtosimplifythemanagementofoverlappingmemoryreferencesandtoagainreduceresultingsynchronizationpoints.
Tosummarizetheobservationsthatwemadethroughouttheprecedingtext,weconsiderdatapartitioningamongthethreadstobeofparamountimportance.
UnlikethePCAmethod[19],wedonotperformextradatacopytoeliminatememoryeffectsthataredetrimentaltoperformancesuchasTLBmisses,falsesharing,etc.
Bychoosingtherecursiveformulation,werelyinsteadonLevel3BLAStoperformtheseoptimizationsforus.
Notsurprisingly,thiswasalsothegoaloftheoriginalrecursivealgorithmanditssequentialimplementation[20].
WhatislefttodoforourcodeistheintroductionofparallelismthatiscommonlymissingfromLevel3BLASwhennarrowrectangularmatricesareinvolved.
Insteadoflowlevelmemoryoptimizations,weturnedourfocustowardsavoidingsynchronizationpointsandletthecomputationproceedasynchronouslyandindependentlyaslongaspossibleuntilitisabsolutelynecessarytoperformcommunicationbetweenthreads.
Onedesigndecisionthatstandsoutinthisrespectisthexedpartitioningscheme.
Regardlessofthecurrentcolumnheight(withinthepanelbeingfactored),wealwaysassignthesameamountofrowstoeachthreadexceptfortherstthread.
Figure3showsthatthiscausesaloadimbalanceasthethreadnumber0hasprogressivelysmalleramountsofworktoperformasthepanelfactorizationprogressesfromthersttothelastcolumn.
Thisiscounter-balancedbythefactthatthepanelsarerelativelytallcomparedtothenumberofthreadsandtherstthreadusuallyhasgreaterresponsibilityinhandlingpivotbookkeepingandsynchronizationtasks.
ThetiledatalayoutspecictoPLASMAalgorithmsdoesnotallowsuchpartitioning.
Inthiscase,thepartitionisthenfollowingthetilestructureandeachthreadhandlesaxednumberoftiles.
Thesecondproblemduetothisdata6functionxGETRFR(M,N,column){xGETRFR(M,N/2,column)recursivecalltofactorlefthalfxLASWP(.
.
.
)pivotingforwardxTRSM(.
.
.
)triangularsolveonthersttileforeachlocaltile{xSCAL(N/2-1)scalethelastcolumnoftheleftpartxGEMM(.
.
.
)Schur'scomplementidx=IxAMAX(N/2)updatelocalmaximumoftherstcolumnintherightpart}gidx=combine_IxAMAX(idx)combinelocalresultsxGETRFR(M,N-N/2,column+N/2)recursivecalltofactorrighthalfxLASWP(.
.
.
)pivotingbackward}Fig.
4.
Pseudo-codefortherecursivepanelfactorizationontilelayout.
storageisthenumberofcachemissesgeneratedbythealgorithmdescribedpreviously.
Ifeachstageisperformedoneafteranother(scalethecolumn,computetheSchurcomplementandsearchthepivot),theywilleachrequirealoopoverthetilesownedbythecurrentthread,makingitthreeloopsoverallthetilesofthepanel.
Allthebenetfromthisdatastorageisthenlostinmemoryload.
Thesolutionistoreorderthethreestagestoapplytheminoneshottoeachtileasdescribedbythegure4).
4.
3ScalabilityResultsoftheParallelRecursivePanelKernelFigures5and6showascalabilitystudyusingcolumn-majorlayout(usedinLAPACK)andtilelayout(usedinPLASMA),respectively,onthefoursocket,twelvecoreNUMAOpteron-48machine(seeSection6.
1fordetailedhardwarespeci-cations)ofourparallelrecursivepanelLUfactorizationwithfourdifferentpanelwidths:32,64,128,and256againstequivalentroutinesfromLAPACK.
Theideaistohighlightandtounderstandtheimpactofthedatalayoutonourre-cursiveparallelpanelalgorithm.
Welimitourparallelismlevelto12cores(onesocket)becauseourmainfactorizationneedstheremainingcoresforupdatingthetrailingsubmatrix.
Firstofall,theparallelpanelimplementationbasedoncolumn-majorlayoutachievesthebestperformancecomparedtothetilelayout.
Indeed,asshowninFigure3withcolumn-majorlayout,thread0loadsintothelocalcachememorynotonlyitsdatapartitionbutalsotheremainingdatafromtheotherthreadpartitions,whichobviouslycontributesinpreventinginvalidcachelines.
Incontrast,therecursiveparallelpanelLUalgorithmwithtilelayoutmayengenderhighbustrafcsinceeachthreadneedstoacquireitsowndatapartitionindependentlyfromeachother(latencyoverhead).
Secondly,whencomparedwiththepanelfactorizationroutinexGETF2()(mostlyLevel2BLAS),weachievesuper-linearspeedupforawiderangeofpanelheightswiththemaximumachievedefciencyexceeding550%.
InanarguablymorerelevantcomparisonagainstthexGETRF()routine,whichcouldbeimplementedwithmostlyLevel3BLAS,weachieveperfectscalingfor2and4threadsandeasilyexceed50%efciencyfor8and16threads.
Thisisconsistentwiththeresultspresentedintherelatedworksection[19].
4.
4FurtherImplementationDetailsandOptimizationTechniquesWeexclusivelyuselocklessdatastructures[25]throughoutourcode.
Thischoicewasdictatedbynegranularitysynchronization,whichoccursduringthepivotselectionforeverycolumnofthepanelandatthebranchingpointsoftherecursiontree.
Synchronizationusingmutexlockswasdeemedinappropriateatsuchfrequencyasithasapotentialofincurringsystemcalloverhead.
Togetherwithlocklesssynchronization,weusebusywaitingonshared-memorylocationstoexchangeinformationbetweenthreadsusingacoherencyprotocolofthememorysubsystem.
Whilefastinpractice[19],thiscausesex-traneoustrafcontheshared-memoryinterconnect,whichweaimtoavoid.
Wedosobychangingbusywaitingforcomputationsonindependentdataitems.
Invariably,thisleadstoreachingtheparallelgranularitylevelsthataremostlikelyhamperedbyspuriousmemorycoherencytrafcduetofalsesharing.
Regardlessofthedrawback,wefeelthisisasatisfactorysolutionaswearemotivatedbyavoidingbusywaiting,whichcreatesevengreaterdemandforinter-corebandwidthbecauseithasnousefulworktointerleavewiththeshared-memorypolling.
Wereferthisoptimizationtechniqueasdelayedwaiting.
7Fig.
5.
ScalabilitystudyoftherecursiveparallelpanelfactorizationindoubleprecisiononLAPACKlayoutwithvariouspanelwidths:32(top-left),64(top-right),128(bottom-left),and256(bottom-right).
Anothertechniqueweusetooptimizetheinter-corecommunicationiswhatwecallsynchronizationcoalescing.
Theessenceofthismethodistoconceptuallygroupunrelatedpiecesofcodethatrequireasynchronizationintoasingleaggregatethatsynchronizesonce.
Theprimecandidateforthisoptimizationisthesearchandthewriteofthepivotindex.
Bothoftheseoperationsrequireasynchronizationpoint.
Theformerneedsaparallelreductionoperationwhilethelatterrequiresglobalbarrier.
Neitheroftheseareeverconsideredtoberelatedtoeachotherinthecontextofsequentialparallelization.
Butwithoursynchronizationcoalescingtechnique,theyaredeemedrelatedinthecommunicationrealmand,consequently,weimplementedtheminourcodeasasingleoperation.
Finally,weintroducedasynchronizationavoidanceparadigmwherebyweoptformultiplewritestosharedmemorylocationsinsteadofintroducingamemoryfence(andpotentiallyaglobalthreadbarrier)toensureglobaldataconsis-tency.
Multiplewritesareusuallyconsideredahazardandarenotguaranteedtooccurinaspecicorderinmostoftheconsistencymodelsforsharedmemorysystems.
Wecompletelysidestepthisissue,however,asweguaranteealgorith-micallythateachthreadwritesexactlythesamevaluetomemory.
Clearly,thisseemsasanunnecessaryoverheadingeneral,butinourtightlycoupledparallelimplementationthisisaworthyalternativetoeitherexplicit(viainter-coremessaging)orimplicit(viamemorycoherencyprotocol)synchronization.
Inshort,thistechniqueisanotheradditiontoourcontention-freedesign.
Portability,andmoreprecisely,performanceportability,wasalsoanimportantgoalinouroveralldesign.
Inourlock-freesynchronization,weheavilyrelyonshared-memoryconsistency–aproblematicfeaturefromtheportabilitystandpoint.
Toaddressthisissuereliably,wemaketwobasicassumptionsabouttheshared-memoryhardwareandthesoftwaretools.
Bothofwhich,toourbestknowledge,aresatisedthemajorityofmoderncomputingplatforms.
Fromthehardwareperspective,weassumethatmemorycoherencyoccursatthecachelinegranularity.
Thisallowsustorelyonglobalvisibilityofloadsandstorestonearbymemorylocations.
Whatweneedfromthecompilertool-chainisanappropriatehandlingofC'svolatilekeyword.
This,combinedwiththeuseofprimitivedatatypesthatareguaranteedtobecontainedwithinasinglecacheline,issufcientinpreventingunintendedshared-memorysideeffects.
8Fig.
6.
Scalabilitystudyoftherecursiveparallelpanelfactorizationindoubleprecisionontilelayoutwithvariouspanelwidths:32(top-left),64(top-right),128(bottom-left),and256(bottom-right).
5DynamicSchedulingandLookaheadThisSectionprovidesfurtherdetailsaboutthedynamicschedulingoftherecursivetileLUfactorizationalongwiththeruntimeenvironmentsystememployedtoschedulethevariousheterogeneouscomputationaltasks.
5.
1DiscussionofImplementationVariantsOurimplementationisqualiedastilealgorithmbecauseofthewayitaccessesthematrixdataandnotduetothewaythematrixisstoredinmemory.
Infact,ouralgorithmworksequallywellonmatricesstoredusingeithercolumn-majorortiledatalayout.
Additionally,ourcodehasbeenoriginallyformulatedwhilehavinginmindtheright-lookingvariantofLUfactoriza-tion[26]asitmakesiteasiertotakeadvantagetotheavailableparallelism.
ThisvariantwasalsochosenforLAPACK[1]andScaLAPACK[27].
However,theexecutionowofourcodeisdrivenbythedatadependenciesthatarecommuni-catedtotheQUARKruntimesystem.
Thismayresultinanasynchronousout-of-orderscheduling.
Thedynamicruntimeenvironmentensuresthatenoughparallelismisavailablethroughtheentireexecution(rightlooking),whileadvancingthecriticalpathforlookaheadpurposes(leftlooking).
Therefore,thestrictright-lookingvariantavailableinLAPACK[1]andScaLAPACK[27]cannotbeguaranteedanymore.
TheasynchronousnatureoftheDAGexecutionprovidessufcientlookaheadopportunitiesformanyalgorithmicvariantstocoexistwitheachotherregardlessofthevisitationorderoftheDAG[28].
5.
2SnapshotoftheParallelRecursiveTileLUFactorizationFigure7showstheinitialfactorizationstepsofamatrixsubdividedinto9tiles(a3-by-3gridoftiles).
Therststepisarecursiveparallelfactorizationoftherstpanelconsistingofthreeleftmosttiles.
Onlywhenthisnishes,theothertasksmaystartexecuting,whichcreatesanimplicitsynchronizationpoint.
Toavoidthenegativeimpactonparallelism,9Fig.
7.
ExecutionbreakdownforrecursivetileLUfactorization:factorizationoftherstpanelusingtheparallelkernelisfollowedbythecorrespondingupdatestothetrailingsubmatrix.
weexecutethissteponmultiplecores(seeSection4forfurtherdetails)tominimizetherunningtime.
However,weusenestedparallelismmodelasmostofthetasksarehandledbyasinglecoreandonlythepaneltasksareassignedtomorethanonecore.
Unlikesimilarimplementations[19],wedonotuseallcorestohandlethepanel.
Therearetwomainreasonsforthisdecision.
First,weusedynamicschedulingthatenablesustohidethenegativeinuenceofthepanelfactorizationbehindmoreefcientworkperformedbyconcurrenttasks.
Andsecond,wehaveclearlyobservedtheeffectofdiminishingreturnswhenusingtoomanycoresforthepanel.
Consequently,wedonotusethemallandinsteadwekeeptheremainingcoresbusywithothercriticaltasks.
Thenextstepispivotingtotherightofthepanelthathasjustbeenfactorized.
Wecombineinthisstepthetriangularupdate(xTRSMintheBLASparlance)becausethereisnoadvantageofschedulingthemseparatelyduetocachelocalityconsiderations.
Justasthepanelfactorizationlocksthepanelandhasapotentialtotemporarilystallthecomputation,thepivotinterchangehasasimilareffect.
ThisisindicatedbyarectangularoutlineencompassingthetileupdatedbyxTRSMofthetilesbelowit.
Eventhoughsomanytilesarelockedbythetriangularupdate,thereisstillapotentialforparallelismbecausepivotswapsandthetriangularupdateitselfforasinglecolumnisindependentofothercolumns.
Wecantheneasilysplittheoperationsalongthetileboundariesandschedulethemasindependenttasks.
ThisobservationisdepictedinFigure7byshowingtwoxTRSMupdatesfortwoadjacenttilesinthetopmostrowoftilesinsteadofoneupdateforbothtilesatonce.
ThelaststepshowninFigure7isanupdatebasedontheSchurcomplement.
ItisthemostcomputationallyintensiveoperationintheLUfactorizationandiscommonlyimplementedwithacalltoaLevel3BLASkernelcalledxGEMM.
Insteadofasinglecallthatperformsthewholeupdateofthetrailingsubmatrix,weusemultipleinvocationsoftheroutinebecauseweuseatile-basedalgorithm.
Inadditiontoexposingmoreparallelismandtheabilitytoalleviatetheinuencethealgorithm'ssynchronizationpoints(suchasthepanelfactorization),bysplittingtheSchurupdateoperationweareabletoobtainbetterperformancethanasinglecalltoaparallelizedvendorlibrary[2].
OnethingnotshowninFigure7ispivotingto-the-leftbecauseitdoesnotoccurinthebeginningofthefactorization.
Itisnecessaryforthesecondandsubsequentpanels.
Theswapsoriginatingfromdifferentpanelshavetobeorderedcorrectlybutareindependentforeachcolumn,whichisthebasisforrunningtheminparallel.
Theonlyinhibitorofparallelismthenisthefactthattheswappingoperationsareinherentlymemory-boundbecausetheydonotinvolveanycomputation.
Ontheotherhand,thememoryaccessesaredonewithasinglelevelofindirection,whichmakesthemveryirregularinpractice.
Producingsuchmemorytrafcfromasinglecoremightnottakeadvantageofthemainmemory'sabilitytohandlemultipleoutstandingdatarequestsandtheparallelismaffordedbyNUMAhardware.
Itis10Fig.
8.
AnnotedDAGforparallelrecursivetileLUfactorizationofa3-by-3tilematrix.
Theannotationsareindicatedwithtriangles.
alsonoteworthytomentionthatthetasksperformingthepivotingbehindthepanelsarenotlocatedonthecriticalpath,andtherefore,arenotessentialfortheremainingcomputationalstepsinthesensethattheycouldpotentiallybedelayedtowardtheendofthefactorization(seethetracingguresinSection6.
4).
ThisisalsohighlightedinFigure8,whichdrawstheDAGoftheparallelrecursivetileLUfactorizationsofa3-by-3tilematrix.
ThenodesmarkedasxLASWPareendnodesanddonotdirectlyparticipatetothecompletionofthefactorization.
5.
3QUARK:ParallelRuntimeSystemforDynamicSchedulingOurapproachtoextractingparallelismisbasedontheprincipleofseparationofconcerns[29,30].
WedenehighperformancecomputationalkernelsandtheeffectontheirparametersandsubmitthemtotheQUARK[11,12]schedulerforparallelexecutionasdependenttasks.
Duetothedatadependencesbetweenthetasks,theamountofavailableparallelismislimitedandmaybeincreasedbydecreasingthecomputationalloadofeachtaskwhichresultsanincreaseinthetotalnumberoftasks.
TheoptimalscheduleforthetasksistheonewiththeshortestheightofthespanningtreeoftheDAG.
ButQUARKdoesnotseektoattaintheoptimalschedule,butratherusesalocalizedheuristicthatworksverywellinpractice[2,6,31,32].
TheheuristicisbasedongenerativeexplorationoftheDAGthatcapsthenumberofoutstandingtaskswithatunableparametercalledtaskwindowsize.
11ToexploremoreadvancedfeaturesofQUARKweturntoFigure8whichshowsanannotatedDAGoftasksthatresultsfromexecutingourLUfactorizationonamatrixwith3-by-3tiles.
OnefeaturethatwebelievemakesQUARKstandoutisavailabilityofnestedparallelismwithoutanyconstraintsonthenumberofthreadsexecutingwithinaparalleltask.
Thetasksthatusethesefeatures(andthusareparalleltasks)arethenodesmarkedasxGETRF-REC().
Eachofthesetasksmayuseavariablenumberofthreadstoexecute,andthisisdeterminedatruntimeasthepanelheightdecreaseswiththeprogressofthefactorization.
AnotherfeatureofQUARKthatweuseinourcodeistheabilitytoassignprioritiestotasks.
Forourparticularsituationweonlyusetwopriorities:criticalandnon-critical.
TheformerisreservedforthepanelfactorizationandismarkedwithatriangleinFigure8.
Theformerisusedfortheremainingtasks.
ThischoicewasmadebecausethexGETRF-REC()tasksareonthecriticalpathandcannotbeoverlappedwithothertasksinanoptimallyscheduledDAG.
Eventhoughinpracticethescheduleisnotoptimalduetoaxednumberofcoresandtheschedulingheuristic,highlyprioritizedpanelfactorizationisstillbenecial.
AfeaturethatisalsousefulforourcodeismarkedwithatriangularnodethatislabelledasGATHERVinFigure8.
Thisfeatureallowsforsubmissionoftasksthatwritetodifferentportionsofthesamesubmatrix.
TheSchur'scomplementupdateisperformedwithxGEMMsandcaneitherbeseenasfourindependenttasksthatupdatedisjointportionsofthetrailingmatrix,orasasingletaskthatupdatesthetrailingmatrix,asawhole.
Inthelattercase,theparallelismsoabundantintheupdatewouldhavebeenlost.
GATHERVallowsfortherecoveryofthisparallelismbysubmitting,notone,butmultipletasksthatupdatethesameportionofmemory.
TheGATHERVannotationsinformQUARKthatthesemultipletasksareindependentofeachothereventhoughtheirdatadependenciesindicateotherwise.
Andnally,QUARKcanoptionallygenerateDAGssuchastheonefeaturedinFigure8.
Thisiscontrolledwithanenvironmentvariableandcanbeturnedonasnecessaryasadebuggingorprolingfeature.
6ExperimentalResultsInthissection,weshowresultsonthelargestsharedmemorymachineswecouldaccessatthetimeofwritingthispaper.
Theyarerepresentativeofavastclassofserversandworkstationscommonlyusedforcomputationallyintensiveworkloads.
Theyclearlyshowtheindustry'stransitionfromchipswithfewcorestofewtensofcores;fromcomputenodeswithorderO(10)corestoO(100)designs,andfromFrontSideBusmemoryinterconnect(Intel'sNetBurstandCoreArchitectures)toNUMAandccNUMAhardware(AMD'sHyperTransportandIntel'sQuickPathInterconnect).
6.
1EnvironmentSettingsAlltheexperimentsarerunonasinglesystemthatwewillcallMagnyCour-48.
MagnyCour-48,iscomposedoffourAMDOpteronMagnyCour6172Processorsoftwelvecoreseach,runningat2.
1GHz,with128GBofmemory.
Thetheoreticalpeakforthisarchitectureinsingleanddoubleprecisionarithmeticsis806.
4Gop/s(16.
8Gop/spercore)and403.
2Gop/s(8.
4Gop/spercore),respectively.
WecomparetheresultsagainstthelatestparallelversionoftheIntelMKL10.
3.
2libraryreleasedinJanuary2011,andagainstthereferenceLAPACK3.
2fromNetliblinkedagainsttheIntelMKLBLASmultithreadedlibrary.
WealsolinkagainstthesequentialversionofIntelMKLforourcodeandPLASMA.
6.
2PerformanceResultsFigures9and10presenttheperformancecomparisonsoftheparallelrecursivetileLUalgorithmagainstIntelMKLandPLASMAlibrariesonMagnyCour-48,insingleanddoubleprecisionarithmetics,respectively.
Eachcurveisobtainedbyusingthemaximumnumberofcoresavailableonthearchitecture,andbytuningtheparameterstoachievethebestasymptoticperformance.
Fivecongurationsareshown:theLAPACKimplementationfromNetliblinkedagainsttheparallelBLASfromIntelMKLtostudythefork-joinmodel,theIntelMKLversionofDGETRF,thepreviousalgorithmofPLASMAbasedonincrementalpivotingonthetwodifferentlayoutsstudied:column-major(orLAPACK)andtilelayoutpropertoPLASMAtilealgorithms,andnally,ournewalgorithmLUrec-Par.
Panelequallyonbothlayouts.
BothPLASMAversionscorrespondtothetileLUalgorithmwithincrementalpivotinganduseQUARKasadynamicschedulingframework.
TherstversionhandlestheLAPACKinterface(nativeinterface),whichrequiresaninputmatrixincolumn-majordatalayout,similartoIntelMKL.
ItthusimpliesthatPLASMAhastoconvertthematrixintiledatalayoutbeforethefactorizationcanproceedandconvertsitbacktocolumn-majordatalayoutattheendofthecomputation,asoriginallygivenbytheuser.
Thesecondcongurationisthetileinterface(expertinterface),whichacceptsmatricesalreadyintiledatalayoutandtherefore,avoidsbothlayoutconversions.
Forouralgorithm,thekernelusedforthepanelischosenaccordingtothedatalayout,sonoconversionsarerequired.
12WeusedatilesizeNB=240andaninner-blockingIB=20forPLASMAalgorithmsandapanelwidthNB=280forourparallelrecursivetileLU.
Thoseparametershavebeentunedforasymptoticperformances.
TherecursiveparallelpanelLUalgorithmbasedoncolumn-majorandtiledatalayoutobtainsroughlyasimilarperformance,whichatrstglance,maylooksurprisingaftertheconclusionsdrawnpreviouslyinSection4.
3.
ItisimportanttounderstandthatthestepofthetrailingsubmatrixupdatebecomestheleadingphaseinsuchfactorizationsbecauseitisrichinLevel3BLASoperations,andthus,theoverheadofmemoryaccessesinthepaneliscompletelyhiddenbytheefciencyofthecompute-intensivekernelsontilelayout,whichmakestherecursiveparallelpanelLUalgorithmbasedontiledatalayoutcompetitivecomparedtothecolumn-majordatalayoutversion.
Moreover,Figures9and10showthatasymptotically,wetakeadvantageofthemonolithicxGEMMkernelbyin-creasingtheperformanceupto20%comparedtobothPLASMAversions.
Ourimplementation,however,issignicantlybetterthanIntelMKLasymptotically.
ThecurvesalsoshowthattherecursiveLUalgorithmismoresensitivefortuningsmallsizesthanPLASMAalgorithms,whichproducesgoodperformancenumbersonsmallmatrixsizesevenwiththeselectedcoupleNB/IBforlargematrices.
Forexample,thecongurationN=5000with16threads,givesthesameperformance(≈75Gop/s)onPLASMA'sincrementalpivotingalgorithmasonournewalgorithm.
Fig.
9.
PerformancesofSGETRFonMagnyCour-48.
TheparallelrecursivetileLUalgorithmthusprovidesgoodperformancesonmany-corearchitectures.
ItalsoretainsthestandardnumericalaccuracyasopposedtotheincrementalpivotingstrategyfromPLASMA.
Thisobviouslycomesatapriceofasynchronizationpointaddedrightaftereachpanelcomputation.
Andthissynchronizationpointhasbeenconsiderablyweakenedbyefcientlyparallelizingthepanelfactorization,andbytheincreaseofthelevelofparallelismduringthephaseofthetrailingsubmatrixupdates,comparedtothePLASMA'salgorithms.
TakingintoaccountthefactthatPLASMA'salgorithmlosesdigitsofprecision,especiallywhenthenumberoftilesincreases[10],ournewrecursivetileLUfactorizationclearlyappearstobeagoodalternative,andwilleventuallyreplacethecurrentLUalgorithminPLASMA.
6.
3ScalabilityFigures11and12showthescalabilityoftheparallelrecursivetileLUalgorithmonMagnyCour-48insingleanddoubleprecisionarithmetics,respectively.
Thescalingexperimentshavebeendoneaccordingtothenumberofcorespersocket1PLASMAincpivoncolumn-majordataLayoutincludesthetranslationofthematrixtotilelayoutforthefactorizationstep,andthetranslationbacktoLAPACKlayouttoreturntheresulttotheuser.
13Fig.
10.
PerformancesofDGETRFonMagnyCour-48.
(i.
e.
,twelvecores),using6,12,24and48threads.
Foreachcurve,weusedapanelwidthNB=280,whichgivesthebestperformanceon48threads.
Sincethedataisusuallyallocatedbytheuserandwedonotmoveitaround,weusedthelinuxcommandnumactl–interleave=0-X,whereXisonelessthanthenumberofthreads.
ThiscommandallowsustocontrolNUMApolicybyallocatingthememoryclosetothecores,wherethethreadsarebound.
Weobservethatouralgorithmscalesalmostlinearlyupto48threadsFig.
11.
ScalabilityofPLASMArecursiveLUonMagnyCour-48insingleprecision.
14Fig.
12.
ScalabilityofPLASMArecursiveLUonMagnyCour-48indoubleprecision.
6.
4ExecutionTraceThissectionshowstheexecutiontracesofthreedifferentversionsoftheLUalgorithmonMagnyCour-48with16threadsonmatricesofsize5000*5000.
ThesetraceshavebeengeneratedthankstotheEZTracelibrary[33]andViTEsoft-ware[34].
Oneachtrace,thegreencolorisdedicatedtothefactorizationofthepanel(lightfordgetrfanddarkfordtstrf),thebluecolorillustratestherowupdate(dtrsm+dlaswpordgessm),theyellowcolorrepresentstheupdatekernel(dgemmordssssm),theorangecolorshowsthebackwardswaps,andnally,thegraycolorhighlightstheidletime.
Thersttrace13(a)istheresultoftheexecutionofthePLASMAalgorithm.
Itshowsthatthetilealgorithmresultsinincreasingthedegreeofparallelismtriggeredbythersttask.
Thesecondtrace13(b)depictstherecursivetileLU,wherethepanelisratherperformedbyacalltothesequentialdgetrfroutinefromIntelMKL.
Thisalgorithmreleasesasmuchparallelismasthepreviousoneafterthersttask,butweclearlyobserveontheexecutiontracethatthetimespenttofactorizetherstpanelislongerthanthetimeneededtofactorizetheblockinthetilealgorithm.
Anotherconcerninthisversionisthatthetimespentonthecriticalpathissignicant,whichleadstosubstantialidletimeintervals,especiallyafterthersthalfofthefactorization.
Finally,Figure13(c)showstheexecutiontraceofthesamealgorithmbutwithaparallelpanelcomputationinstead.
Thisresultsinareducedfactorizationstep,whichdrasticallyreducestheoverallidletime.
Itisalsonoteworthytomen-tionhowthelookaheadtransparentlycomesintoeffectatruntime,thankstothedynamicschedulerQUARK.
Moreover,thenon-criticaltasks,whichperformpivotinterchangesbehindthepanel(xLASWP),arepostponeduntiltheendofthefactorizationinordertostressthepursuitofthecriticalpath.
7SummaryandFutureWorkThispaperintroducedanovelparallelrecursiveLUfactorizationwithpartialpivotingonmulticorearchitecture.
Ourimplementationischaracterizedbyaparallelrecursivepanelfactorizationswhilethecomputationonthetrailingsub-matrixiscarriedonbyusingatiledalgorithm.
NotonlydoesthisimplementationachievehigherperformancethanthecorrespondingroutinesinLAPACK,(upto3-foldspeed-up)andMKL(upto40%speed-up),butitalsomaintainsthenumericalqualityofthestandardLUfactorizationalgorithmwithpartialpivotingwhichisnotthecaseforPLASMA.
Ourapproachusesaparallelne-grainedrecursiveformulationofthepanelfactorizationstep.
Basedonconict-free15(a)IncrementalpivotingDGETRFwithN=5000,NB=220andIB=20(b)RecursiveLUDGETRFwithsequentialpanelfactorization,N=5000andNB=220(c)RecursiveLUDGETRFwithparallelpanelfactorization,N=5000andNB=220Fig.
13.
ExecutiontracesofthedifferentvariantofLUfactorizationusingQuark.
Lightgreen:dgetrf,darkgreen:dtstrf,lightblue:dtrsmordgessm,yellow:dgemmordssssmandorange:dlaswp.
partitioningofthedataandlocklesssynchronizationmechanisms,ourimplementationletstheoverallcomputationnat-urallyowwithoutcontention.
ThedynamicruntimesystemQUARKisthenabletoscheduletaskswithheterogeneousgranularitiesandtotransparentlyintroducealgorithmiclookahead.
ThenaturalextensionforthisworkwouldbetheapplicationofourmethodologyandimplementationtechniquestotileQRfactorization.
Eventhough,thetileQRfactorizationdoesnotsufferfromlossofnumericalaccuracywhen16comparedtothestandardQRfactorizationthankstotheuseoforthogonaltransformations,aperformancehithasbeennoticedforasymptoticsizes(alsoseenforthetileLUfromPLASMA).
Andthisismainlyduetothemostcomputeinten-sivekernel,whichiscomposedofsuccessivecallstoLevel3BLASkernels.
IftheQRpanelwouldhavebeenparallelized(similarlytotheLUpanel),theupdatewouldbemuchsimpler(especiallywhentargetingdistributedsystems)andbasedonsinglecallstoxGEMM.
Theoverallperformancewillthenbeguidedsolelybytheperformanceofthematrix-matrixmultiplicationkernel,whichiscrucialwhentargetingasymptoticperformance.
References1.
AndersonE,BaiZ,BischofC,BlackfordSL,DemmelJW,DongarraJJ,CrozJD,GreenbaumA,HammarlingS,McKenneyA,etal.
.
LAPACKUser'sGuide.
3rdedn.
,SocietyforIndustrialandAppliedMathematics:Philadelphia,1999.
2.
AgulloE,HadriB,LtaiefH,DongarrraJ.
Comparativestudyofone-sidedfactorizationswithmultiplesoftwarepackagesonmulti-corehardware.
SC'09:ProceedingsoftheConferenceonHighPerformanceComputingNetworking,StorageandAnalysis,ACM:NewYork,NY,USA,2009;1–12,doi:http://doi.
acm.
org/10.
1145/1654059.
1654080.
3.
UniversityofTennessee.
PLASMAUsers'Guide,ParallelLinearAlgebraSoftwareforMulticoreArchitectures,Version2.
3November2010.
4.
AgulloE,DemmelJ,DongarraJ,HadriB,KurzakJ,LangouJ,LtaiefH,LuszczekP,TomovS.
Numericallinearalgebraonemergingarchitectures:ThePLASMAandMAGMAprojects.
JournalofPhysics:ConferenceSeries2009;180.
5.
TheFLAMEprojectApril2010.
http://z.
cs.
utexas.
edu/wiki/ame.
wiki/FrontPage.
6.
LuszczekP,LtaiefH,DongarraJ.
Two-stagetridiagonalreductionfordensesymmetricmatricesusingtilealgorithmsonmulticorearchitectures.
IEEEInternationalParallelandDistributedProcessingSymposiumMay2011;.
7.
SorensenDC.
Analysisofpairwisepivotingingaussianelimination.
IEEETranasactionsonComputersMarch1985;C-34(3).
8.
ButtariA,LangouJ,KurzakJ,DongarraJJ.
AClassofParallelTiledLinearAlgebraAlgorithmsforMulticoreArchitectures.
ParellelComput.
Syst.
Appl.
2009;35:38–53.
9.
Quintana-OrtíG,Quintana-OrtíES,GeijnRAVD,ZeeFGV,ChanE.
Programmingmatrixalgorithms-by-blocksforthread-levelparallelism.
ACMTrans.
Math.
Softw.
July2009;36:14:1–14:26.
10.
AgulloE,AugonnetC,DongarraJ,FavergeM,LangouJ,LtaiefH,TomovS.
LUFactorizationforAccelerator-basedSystems.
ICLTechnicalReportICL-UT-10-05,SubmittedtoAICCSA2011December2010;.
11.
KurzakJ,LtaiefH,DongarraJ,BadiaR.
Schedulingdenselinearalgebraoperationsonmulticoreprocessors.
ConcurrencyandComputation:PracticeandExperienceJanuary2010;22(1):15–44.
12.
LtaiefH,KurzakJ,DongarraJ,BadiaR.
Schedulingtwo-sidedtransformationsusingtilealgorithmsonmulticorearchitectures.
JournalofScienticComputing2010;18:33–50.
13.
Intel,MathKernelLibrary(MKL).
http://www.
intel.
com/software/products/mkl/.
14.
ElmrothE,GustavsonFG.
NewserialandparallelrecursiveQRfactorizationalgorithmsforSMPsystems.
ProceedingsofPARA1998,1998.
15.
GeorgievK,WasniewskiJ.
RecursiveVersionofLUDecomposition.
RevisedPapersfromtheSecondInternationalConferenceonNumericalAnalysisandItsApplications2001;:325–332.
16.
DongarraJ,EijkhoutV,LuszczekP.
RecursiveapproachinsparsematrixLUfactorization.
Sci.
Program.
January2001;9:51–60.
17.
IronyD,ToledoS.
Communication-efcientparalleldenseLUusinga3-dimensionalapproach.
Proceedingsofthe10thSIAMConferenceonParallelProcessingforScienticComputing,Norfolk,Virginia,USA,2001.
18.
DongarraJJ,LuszczekP,PetitetA.
TheLINPACKbenchmark:Past,present,andfuture.
ConcurrencyandComputation:PracticeandExperience2003;15:1–18.
19.
CastaldoAM,WhaleyRC.
ScalingLAPACKpaneloperationsusingParallelCacheAssignment.
Proceedingsofthe15thACMSIG-PLANsymposiumonPrinciplesandpracticeofparallelprogramming2010;:223–232.
20.
GustavsonFG.
Recursionleadstoautomaticvariableblockingfordenselinear-algebraalgorithms.
IBMJournalofResearchandDevelopmentNovember1997;41(6):737–755.
21.
ChanE,vandeGeijnR,ChapmanA.
ManagingthecomplexityoflookaheadforLUfactorizationwithpivoting.
Proceedingsofthe22ndACMsymposiumonParallelisminalgorithmsandarchitectures2010;:200–208.
22.
AndersonE,DongarraJ.
Implementationguideforlapack.
TechnicalReportUT-CS-90-101,UniversityofTennessee,ComputerScienceDepartmentApril1990.
LAPACKWorkingNote18.
23.
AmdahlGM.
Validityofthesingle-processorapproachtoachievinglargescalecomputingcapabilities.
AFIPSConferenceProceed-ings,vol.
30,AFIPSPress,Reston,VA:AtlanticCity,N.
J.
,1967;483–485.
24.
GustafsonJL.
ReevaluatingAmdahl'sLaw.
CommunicationsofACM1988;31(5):532–533.
25.
SundellH.
Efcientandpracticalnon-blockingdatastructures.
Departmentofcomputerscience,ChalmersUniversityofTechnol-ogy,Gteborg,SwedenNovember52004.
PhDdissertation.
26.
YiQ,KennedyK,YouH,SeymourK,DongarraJ.
AutomaticblockingofQRandLUfactorizationsforlocality.
2ndACMSIGPLANWorkshoponMemorySystemPerformance(MSP2004),2004.
27.
BlackfordLS,ChoiJ,ClearyA,D'AzevedoEF,DemmelJW,DhillonIS,DongarraJJ,HammarlingS,HenryG,PetitetA,etal.
.
ScaLAPACKUsers'Guide.
SocietyforIndustrialandAppliedMathematics:Philadelphia,1997.
28.
HaidarA,LtaiefH,YarKhanA,DongarraJJ.
AnalysisofDynamicallyScheduledTileAlgorithmsforDenseLinearAlgebraonMul-ticoreArchitectures.
ICLTechnicalReportUT-CS-11-666,LAPACKworkingnote#243,SubmittedtoConcurrencyandComputations2010;.
1729.
DijkstraEW.
Ontheroleofscienticthought.
SelectedwritingsonComputing:APersonalPerspective,DijkstraEW(ed.
).
Springer-VerlagNewYork,Inc.
:NewYork,NY,USA,1982;60AS66.
ISBN0-387-90652-5.
30.
ReadeC.
ElementsofFunctionalProgramming.
Addison-WesleyLongmanPublishingCo.
,Inc.
:Boston,MA,USA,1989.
ISBN0201129159.
31.
ButtariA,LangouJ,KurzakJ,DongarraJJ.
ParallelTiledQRFactorizationforMulticoreArchitectures.
ConcurrencyComputat.
:Pract.
Exper.
2008;20(13):1573–1590.
http://dx.
doi.
org/10.
1002/cpe.
1301DOI:10.
1002/cpe.
1301.
32.
PerezJ,BadiaR,LabartaJ.
Adependency-awaretask-basedprogrammingenvironmentformulti-corearchitectures.
ClusterComputing,2008IEEEInternationalConferenceon,2008;142–151,doi:10.
1109/CLUSTR.
2008.
4663765.
33.
DongarraJ,FavergeM,IshikawaY,NamystR,RueF,TrahayF.
Eztrace:agenericframeworkforperformanceanalysis.
TechnicalReport,InnovativeComputingLaboratory,UniversityofTennesseedec2010.
34.
VisualTraceExplorer.
http://vite.
gforge.
inria.
fr/.

展开全文