addingrestart

restart 时间:2021-01-11 阅读:()

Reinit++:EvaluatingthePerformanceofGlobal-RestartRecoveryMethodsforMPIFaultToleranceGiorgisGeorgakoudis1(B),LuanzhengGuo2,andIgnacioLaguna11CenterforAdvancedScienticComputing,LawrenceLivermoreNationalLaboratory,Livermore,USA{georgakoudis1,lagunaperalt1}@llnl.
gov2EECS,UCMerced,Merced,USAlguo4@ucmerced.
eduAbstract.
Scalingsupercomputerscomeswithanincreaseinfailureratesduetotheincreasingnumberofhardwarecomponents.
Instan-dardpractice,applicationsaremaderesilientthroughcheckpointingdataandrestartingexecutionafterafailureoccurstoresumefromthelat-estcheckpoint.
However,re-deployinganapplicationincursoverheadbytearingdownandre-instatingexecution,andpossiblylimitingcheck-pointingretrievalfromslowpermanentstorage.
InthispaperwepresentReinit++,anewdesignandimplementationoftheReinitapproachforglobal-restartrecovery,whichavoidsapplicationre-deployment.
WeextensivelyevaluateReinit++contrastedwiththeleadingMPIfault-toleranceapproachofULFM,implementingglobal-restartrecovery,andthetypicalpracticeofrestartinganapplicationtoderivenewinsightonperformance.
ExperimentationwiththreedierentHPCproxyapplicationsmaderesilienttowithstandprocessandnodefailuresshowsthatReinit++recoversmuchfasterthanrestarting,upto6*,orULFM,upto3*,andthatitscalesexcellentlyasthenumberofMPIprocessesgrows.
1IntroductionHPCsystemperformancescalesbyincreasingthenumberofcomputingnodesandbyincreasingtheprocessingandmemoryelementsofeachnode.
Further-more,electronicscontinuetoshrink,thusaremoresusceptibletointerference,suchasradiationupsetsorvoltageuctuations.
Thosetrendsincreasetheprob-abilityofafailurehappening,eitherduetocomponentfailureorduetotransientsofterrorsaectingelectronics.
LargeHPCapplicationsrunforhoursordaysandusemost,ifnotall,thenodesofasupercomputer,thusarevulnerabletofailures,oftenleadingtoprocessornodecrashes.
Reportedly,themeantimebetweenanodefailureonpetascalesystemshasbeenmeasuredtobe6.
7h[24],L.
Guo—WorkperformedduringinternshipatLawrenceLivermoreNationalLaboratory.
cSpringerNatureSwitzerlandAG2020P.
Sadayappanetal.
(Eds.
):ISCHighPerformance2020,LNCS12151,pp.
536–554,2020.
https://doi.
org/10.
1007/978-3-030-50743-5_27Reinit++:Global-RestartRecoveryforMPIFaultTolerance537whileworst-caseprojections[12]foreseethatexascalesystemsmayexperienceafailureevenmorefrequently.
HPCapplicationsoftenimplementfaulttoleranceusingcheckpointstorestartexecution,amethodreferredtoasCheckpoint-Restart(CR).
Applicationsperiodicallystorecheckpoints,e.
g.
,everyfewiterationsofaniterativecompu-tation,andwhenafailureoccurs,executionabortsandrestartsagaintoresumefromthelatestcheckpoint.
MostscalableHPCapplicationsfollowtheBulkSynchronousParallel(BSP)paradigm,henceCRwithglobal,backward,non-shrinkingrecovery[21],alsoknownasglobal-restartnaturallytstheirexecution.
CRisstraightforwardtoimplementbutrequiresre-deployingthewholeapplica-tiononafailure,re-spawningallprocessesoneverynodeandre-initializinganyapplicationdatastructures.
Thismethodhassignicantoverheadsinceafailureoffewprocesses,evenasingleprocessfailure,requirescompletere-deployment,althoughmostoftheprocessessurvivedthefailure.
Bycontrast,User-levelFaultMitigation(ULFM)[4]extendsMPIwithinter-facesforhandlingfailuresattheapplicationlevelwithoutrestartingexecution.
TheprogrammerisrequiredtousetheULFMextensionstodetectafailureandrepaircommunicatorsandeitherspawnnewprocesses,fornon-shrinkingrecov-ery,orcontinueexecutionwithanysurvivorprocesses,forshrinkingrecovery.
AlthoughULFMgrantstheprogrammergreatexibilitytohandlefailures,itrequiresconsiderableeorttorefactortheapplicationforcorrectlyandecientlyimplementingrecovery.
Alternatively,Reinit[11,22]hasbeenproposedasaneasier-to-programapp-roach,butequallycapableofsupportingglobal-restartrecovery.
ReinitextendsMPIwithafunctioncallthatsetsarollbackpointintheapplication.
Ittrans-parentlyimplementsMPIrecovery,byspawningnewprocessesandmendingtheworldcommunicatorattheMPIruntimelevel.
Thus,Reinittransparentlyensuresaconsistent,initialMPIstateakintothestateafterMPIinitialization.
However,theexistingimplementationofReinit[11]ishardtodeploy,sinceitrequiresmodicationstothejobscheduler,anddiculttocomparewithULFM,whichonlyrequiresextensionstotheMPIlibrary.
Notably,bothReinitandULFMapproachesassumetheapplicationhascheckpointinginplacetoresumeexecutionattheapplicationlevel.
Althoughtherehasbeenalargebibliography[4,5,9,11,16–18,21–23,26]dis-cussingtheprogrammingmodelandprototypesofthoseapproaches,nostudyhaspresentedanin-depthperformanceevaluationofthem–mostpreviousworkseitherfocusonindividualaspectsofeachapproachorperformlimitedscaleexperiments.
Inthispaper,wepresentanextensiveevaluationusingHPCproxyapplicationstocontrastthesetwoleadingglobal-restartrecoveryapproaches.
Specically,ourcontributionsare:–AnewdesignandimplementationoftheReinitapproach,namedReinit++,usingthelatestOpenMPIruntime.
Ourdesignandimplementationsup-portsrecoveryfromeitherprocessornodefailures,ishighperformance,anddeployseasilybyextendingtheOpenMPIlibrary.
Notably,wepresentaprecisedenitionofthefailuresithandlesandthescopeofthisdesignandimplementation.
538G.
Georgakoudisetal.
–Anextensiveevaluationoftheperformanceofthepossiblerecoveryapproaches(CR,Reinit++,ULFM)usingthreeHPCproxyapplications(CoMD,LULESH,HPCCG),andincludingleandin-memorycheckpointingschemes.
–NewinsightfromtheresultsofourevaluationwhichshowthatrecoveryunderReinit++isupto6*fasterthanCRandupto3*fasterthanULFM.
Com-paredtoCR,Reinit++avoidsthere-deploymentoverhead,whilecomparedtoUFLM,Reinit++avoidsinterferenceduringfault-freeapplicationexecutionandhaslessrecoveryoverhead.
2OverviewThissectionpresentsanoverviewofthestate-of-the-artapproachesforMPIfaulttolerance.
Specically,itprovidesanoverviewoftherecoverymodelsforapplicationsandbrieydiscussesULFMandReinit,whichrepresentthestate-of-the-artinMPIfaulttolerance.
2.
1RecoveryModelsforMPIApplicationsThereareseveralmodelsforfaulttolerancedependingontherequirementsoftheapplication.
Specically,ifallMPIprocessesmustrecoverafterafailure,recov-eryisglobal;otherwiseifsome,butnotall,oftheMPIprocessesneedtorecoverthenrecoveryisdeemedaslocal.
Furthermore,applicationscaneitherrecoverbyrollingbackcomputationatanearlierpointintime,denedasbackwardrecovery,or,iftheycancontinuecomputationwithoutbacktracking,recoveryisdeemedasforward.
Moreover,ifrecoveryrestoresthenumberofMPIpro-cessestoresumeexecution,itisdenedasnon-shrinking,whereasifexecutioncontinueswithwhatevernumberofprocessessurvivingthefailure,thenrecov-eryischaracterizedasshrinking.
Global-restartimplementsglobal,backward,non-shrinkingrecoverywhichtsmostHPCapplicationsthatfollowabulk-synchronousparadigmwhereMPIprocesseshaveinterlockeddependencies,thusitisthefocusofthiswork.
2.
2ExistingApproachesforMPIFaultToleranceULFM.
Oneofthestate-of-the-artapproachesforfaulttoleranceinMPIisUser-levelFaultMitigation(ULFM)[4].
ULFMextendsMPItoenablefailuredetectionattheapplicationlevelandprovideasetofprimitivesforhandlingrecovery.
Specically,ULFMtapstotheexistingerrorhandlinginterfaceofMPItoimplementuser-levelfaultnotication.
RegardingitsextensionstotheMPIinterface,weelaborateoncommunicatorssincetheirextensionsareasupersetofothercommunicationobjects(windows,I/O).
Following,ULFMextendsMPIwitharevokeoperation(MPI_Comm_revoke(comm))toinvalidateacommunica-torsuchthatanysubsequentoperationonitraisesanerror.
Also,itdenesaReinit++:Global-RestartRecoveryforMPIFaultTolerance539shrinkoperation(MPI_Comm_shrink(comm,newcomm))thatcreatesanewcom-municatorfromanexistingoneafterexcludinganyfailedprocesses.
Additionally,ULFMdenesacollectiveagreementoperation(MPI_Comm_agree(comm,flag))whichachievesconsensusonthegroupoffailedprocessesinacommunicatorandonthevalueoftheintegervariableflag.
Basedonthoseextensions,MPIprogrammersareexpectedtoimplementtheirownrecoverystrategytailoredtotheirapplications.
ULFMoperationsaregeneralenoughtoimplementanytypeofrecoverydiscussedearlier.
However,thisgeneralitycomesatthecostofcomplexity.
Programmersneedtounder-standtheintricatesemanticsofthoseoperationstocorrectlyandecientlyimplementrecoveryandrestructure,possiblysignicantly,theapplicationforexplicitlyhandlingfailures.
AlthoughULFMprovidesexamplesthatprescribetheimplementationofglobal-restart,theprogrammermustembedthisinthecodeandrefactortheapplicationtofunctionwiththeexpectationthatcommu-nicatorsmaychangeduringexecutionduetoshrinkingandmerging,whichisnotideal.
Reinit.
Reinit[11,22]hasbeenproposedasanalternativeapproachforimple-mentingglobal-restartrecovery,throughasimplerinterfacecomparedtoULFM.
Themostrecentimplementation[11]ofReinitislimitedinseveralaspects:(1)itrequiresmodifyingthejobscheduler(SLURM),besidestheMPIruntime,thusitisimpracticaltodeployandskewsperformancemeasurementsduetocrossingtheinterfacebetweenthejobschedulerandtheMPIruntime;(2)itsimplemen-tationisnotpubliclyavailable;(3)itbasesontheMVAPICH2MPIruntime,whichmakescomparisonswithULFMhard,sinceULFMisimplementedontheOpenMPIruntime.
Thus,weoptforanewdesignandimplementation1,namedReinit++,whichwepresentindetailinthenextsection.
3Reinit++ThissectiondescribestheprogramminginterfaceofReinit++,theassumptionsforapplicationdeployment,processandnodefailuredetection,andtherecoveryalgorithmforglobal-restart.
WealsodenethesemanticsofMPIrecoveryfortheimplementationofReinit++aswellasdiscussitsspecics.
3.
1DesignProgrammingInterfaceofReinit++.
Figure1presentstheprogramminginterfaceofReinit++intheClanguage,whileFig.
2showssampleusageofit.
Thereisasinglefunctioncall,MPI_Reinit,fortheprogrammertocalltodenethepointincodetorollbackandresumeexecutionafterafailure.
ThisfunctionmustbecalledafterMPI_InitsoensuretheMPIruntimehasbeeninitialized.
ItsargumentsimitatetheparametersofMPI_Init,addingaparameterfora1Availableopen-sourceathttps://github.
com/ggeorgakoudis/ompi/tree/reinit.
540G.
Georgakoudisetal.
typedefenum{MPI_REINIT_NEW,MPI_REINIT_REINITED,MPI_REINIT_RESTARTED}MPI_Reinit_state_ttypedefint(*MPI_Restart_point)(intargc,char**argv,MPI_Reinit_state_tstate);intMPI_Reinit(intargc,char**argv,constMPI_Restart_pointpoint);Fig.
1.
TheprogramminginterfaceofReinit++intfoo(intargc,char**argv,MPI_Reinit_state_tstate){/*Loadcheckpointifitexists*/while(!
done){/*Docomputation*//*Storecheckpoint*/}}intmain(intargc,char**argv){MPI_Init(&argc,&argv);/*Application-specificinitialization*///EntrypointoftheresilientfunctionMPI_Reinit(&argc,&argv,foo);MPI_Finalize();}Fig.
2.
SampleusageoftheinterfaceofReinit++pointertoauser-denedfunction.
Reinit++expectstheprogrammertoencap-sulateinthisfunctionthemaincomputationalloopoftheapplication,whichisrestartablethroughcheckpointing.
Internally,MPI_Reinitpassestheparam-etersargcandargvtothisuser-denedfunction,plustheparameterstate,whichindicatestheMPIstateoftheprocessasvaluesfromtheenumerationtypeMPI_Reinit_state_t.
Specically,thevalueMPI_REINIT_NEWdesignatesanewprocessexecutingforthersttime,thevalueMPI_REINIT_REINITEDdesig-natesasurvivorprocessthathasenteredtheuser-denedfunctionafterrollingbackduetoafailure,andthevalueMPI_REINIT_RESTARTEDdesignatesthattheprocesshasfailedandhasbeenre-spawnedtoresumeexecution.
NotethatthisstatevariabledescribesonlytheMPIstateofReinit++,thushasnosemanticsontheapplicationstate,suchaswhethertoloadacheckpointornot.
ApplicationDeploymentModel.
Reinit++assumesalogical,hierarchicaltopologyofapplicationdeployment.
Figure3showsagraphicalrepresentationofthisdeploymentmodel.
Atthetoplevel,thereisasinglerootprocessthatspawnsandmonitorsdaemonprocesses,oneoneachofthecomputingnodesReinit++:Global-RestartRecoveryforMPIFaultTolerance541RootD1P1PkDnPlPm···Fig.
3.
Applicationdeploymentmodelreservedfortheapplication.
DaemonsspawnandmonitorMPIprocesseslocaltotheirnodes.
Therootcommunicateswithdaemonsandkeepstrackoftheirliveness,whiledaemonstrackthelivenessoftheirchildrenMPIprocesses.
Basedonthisexecutionanddeploymentmodel,Reinit++performsfaultdetection,whichwediscussnext.
FaultDetection.
Reinit++targetsfail-stopfailuresofeitherMPIprocessesordaemons.
Adaemonfailureisdeemedequivalenttoanodefailure.
Thecausesforthosefailuresmaybetransientfaultsorhardfaultsofhardwarecomponents.
InthedesignofReinit++,therootmanagestheexecutionofthewholeapplications,soanyrecoverydecisionsaretakenbyit,henceitisthefocalpointforfaultdetection.
Specically,ifanMPIprocessfails,itsmanagingdaemonisnotiedofthefailureandforwardsthisnoticationtotheroot,withouttakinganactionitself.
Ifadaemonprocessfails,whichmeanseitherthenodefailedorthedaemonprocessitself,therootdirectlydetectsthefailureandalsoassumesthatthechildrenMPIprocessesofthatdaemonarelosttoo.
Afterdetectingafaulttherootprocessproceedswithrecovery,whichweintroduceinthefollowingsection.
MPIRecovery.
Reinit++recoveryforbothMPIprocessanddaemonfailuresissimilar,exceptthatonadaemonfailuretherootchoosesanewhostnodetore-instatefailedMPIprocesses,sinceadaemonfailureproxiesanodefailure.
Forrecovery,therootprocessbroadcastsareinitmessagetoalldaemons.
Daemonsreceivingthatmessagerollbacksurvivorprocessesandre-spawnfailedones.
AfterrollingbacksurvivorMPIprocessesandspawningnewones,theseman-ticsofMPIrecoveryarethatonlytheworldcommunicatorisvalidandanypreviousMPIstate(othercommunicators,windows,etc.
)hasbeendiscarded.
ThisissimilartotheMPIstateavailableimmediatelyafteranapplicationcallsMPI_Init.
Next,theapplicationrestoresitsstate,discussedinthefollowingsection.
ApplicationRecovery.
Reinit++assumesthatapplicationsareresponsibleforsavingandrestoringtheirstatetoresumeexecution.
Hence,bothsurvivorand542G.
Georgakoudisetal.
re-spawnedMPIprocessesshouldloadavalidcheckpointafterMPIrecoverytorestoreapplicationstateandresumecomputation.
3.
2ImplementationWeimplementReinit++inthelatestOpenMPIruntime,version4.
0.
0.
Theimplementationsupportsrecoveryfrombothprocessanddaemon(node)failures.
Thisimplementationdoesnotpresupposeanyparticularjobscheduler,soitiscompatiblewithanyjobschedulertheOpenMPIruntimeworkswith.
Introduc-ingbrieytheOpenMPIsoftwarearchitecture,itcomprisesofthreeframeworksofdistinctfunctionality:(i)theOpenMPIMPIlayer(OMPI),whichimplementstheinterfaceoftheMPIspecicationusedbytheapplicationdevelopers;(ii)theOpenMPIRuntimeEnvironment(ORTE),whichimplementsruntimefunctionsforapplicationdeployment,executionmonitoring,andfaultdetection,and(iii)theOpenPortabilityAccessLayers(OPAL),whichimplementsabstractionsofOSinterfaces,suchassignalhandling,processcreation,etc.
Reinit++extendsOMPItoprovidethefunctionMPI_Reinit.
ItextendsORTEtopropagatefaultnoticationsfromdaemonstotherootandtoimple-mentthemechanismofMPIrecoveryondetectingafault.
Also,Reinit++extendsOPALtoimplementlow-levelprocesssignalingfornotifyingsurvivorprocesstorollback.
Thefollowingsectionsprovidemoredetails.
ApplicationDeployment.
Reinit++requirestheapplicationtodeployusingthedefaultlauncherofOpenMPI,mpirun.
Notethatusingthelaunchermpiruniscompatiblewithanyjobschedulerandevenusesoptimizeddeploymentinter-faces,iftheschedulerprovidesany.
PhysicalapplicationdeploymentinOpenMPIcloselyfollowsthelogicalmodelofthedesignofReinit++.
Specically,OpenMPIsetstherootofthedeploymentattheprocesslaunchingthempirun,typicallyonaloginnodeofHPCinstallations,whichisdeemedastheHeadNodeProcess(HNP)inOpenMPIterminology.
Following,therootlaunchesanORTEdaemononeachnodeallocatedfortheapplication.
DaemonsspawnthesetofMPIprocessesineachnodeandmonitortheirexecution.
Therootprocesscommunicateswitheachdaemonoverachannelofareliablenetworktransportandmonitorsthelivenessofdaemonsthroughtheexistenceofthischannel.
Launchinganapplication,theuserspeciesthenumberofMPIprocessesandoptionallythenumberofnodes(ornumberofprocessespernode).
Towithstandprocessfailures,thisspecicationofdeploymentissucient,sinceReinit++re-spawnsfailedprocessesontheiroriginalnodeofdeployment.
However,fornodefailures,theusermustover-provisiontheallocatedprocessslotsforre-spawningthesetofMPIprocesseslostduetoafailednode.
Todoso,themoststraightforwardwayistoallocatemorenodesthanrequiredforfault-freeoperation,uptothemaximumnumberofnodefailurestowithstand.
FaultDetection.
InOpenMPI,adaemonistheparentoftheMPIprocessesonitsnode.
IfanMPIprocesscrashes,itsparentdaemonisnotied,bytrap-pingthesignalSIGCHLD,inPOSIXsemantics.
ImplementingthefaultdetectionReinit++:Global-RestartRecoveryforMPIFaultTolerance543Algorithm1:Root:HandleFailureData:D:thesetofdaemons,Children(x):returnsthesetofchildrenMPIprocessesofdaemonx,Parent(x):returnstheparentdaemonofMPIprocessxInput:Thefailedprocessf(MPIprocessordaemon)//failedprocessisadaemoniff∈DthenD←D\{f}d←d|argmind∈DChildren(d)//broadcastREINITtoalldaemonsBroadcastDmessageREINIT,{d,c|c∈Children(f)}//failedprocessisanMPIprocesselseBroadcastDmessageREINIT,{Parent(f),f}endrequirementsofReinit++,adaemonrelaysthefaultnoticationtotherootpro-cessfortakingaction.
Regardingnodefailures,therootdirectlydetectsthemproxiedthroughdaemonfailures.
Specically,theroothasanopencommunica-tionchannelwitheachdaemonoversomereliabletransport,e.
g.
,TCP.
Iftheconnectionoverthatcommunicationchannelbreaks,therootprocessisnotiedofthefailureandregardsthedaemonatfault,thusassumingallitschildrenMPIprocesslostanditshostnodeisunavailable.
Forbothtypesoffailures(processandnode),therootprocessinitiatesMPIrecovery.
MPIRecovery.
Algorithm1showsinpseudocodetheoperationoftherootprocesswhenhandlingafailure.
Ondetectingafailure,therootprocessdis-tinguisheswhetheritisafaultydaemonorMPIprocess.
Foranodefailure,therootselectstheleastloadednodeintheresourceallocation,thatisthenodewiththefewestoccupiedprocessslots,andsetsthisnode'sdaemonastheparentdaemonforfailedprocesses.
Foraprocessfailure,therootselectstheoriginalparentdaemonofthefailedprocesstore-spawnthatprocess.
Next,therootprocessinitiatesrecoverybybroadcastingtoalldaemonsamessagewiththeREINITcommandandthelistofprocessestospawn,alongwiththeirselectedparentdaemons.
Following,whenadaemonreceivesthatmessageitsignalsitssurvivor,childrenMPIprocessestorollback,andre-spawnsanyprocessesinthelistthathavethisdaemonastheirparent.
Algorithm2presentsthisprocedureinpseudocode.
Regardingtheasynchronous,signalinginterfaceofReinit++,Algorithm3illustratestheinternalsoftheReinit++inpseudocode.
WhenanMPIprocessexecutesMPI_Reinit,itinstallsasignalhandlerforthesignalSIGREINIT,whichaliasesSIGUSR1inourimplementation.
Also,MPI_Reinitsetsanon-localgotopointusingthePOSIXfunctionsetjmp().
ThesignalhandlerofSIGREINITsimplycallslongjmp()toreturnexecutionofsurvivorprocessestothisgotopoint.
RolledbacksurvivorprocessesdiscardanypreviousMPIstateandblock544G.
Georgakoudisetal.
Algorithm2:Daemond:HandleReinitData:Children(x):returnsthesetofchildrenMPIprocessesofdaemonx,Parent(x):returnstheparentdaemonofMPIprocessxInput:List{di,ci//SignalsurvivorMPIprocessesforc∈Children(d)doc.
state←MPIREINITREINITEDSignalSIGREINITtocend//Spawnnewprocessifdisparentforeach{di,cidoifd==dithenChildren(d)←Children(d)∪cici.
state←MPIREINITRESTARTEDSpawnciendendonaORTE-levelbarrier.
ThisbarrierreplicatestheimplicitbarrierpresentinMPI_Inittosynchronizewithre-spawnedprocessesjoiningthecomputation.
Afterthebarrier,survivorprocessesre-initializetheworldcommunicatorandcallthefunctionfootoresumecomputation.
Re-spawnedprocessesinitializetheworldcommunicatoraspartoftheMPIinitializationprocedureofMPI_InitandgothroughMPI_Reinittoinstallthesignalhandler,setthegotopoint,andlastlycalltheuser-denedfunctiontoresumecomputation.
ApplicationRecovery.
Applicationrecoveryincludestheactionsneededattheapplication-leveltoresumecomputation.
AnyadditionalMPIstatebesidestherepairedworldcommunicator,suchassub-communicators,mustbere-createdbytheapplication'sMPIprocesses.
Also,itisexpectedthateachprocessloadsthelatestconsistentcheckpointtocontinuecomputing.
Checkpointinglayswithintheresponsibilityoftheapplicationdeveloper.
Inthenextsection,wediscussthescopeandimplicationsofourimplementation.
Discussion.
Inthisimplementation,thescopeoffaulttoleranceistosupportrecoveryfromfailureshappeningafterMPI_ReinithasbeencalledbyallMPIprocesses.
ThisisbecauseMPI_Reinitmustinstallsignalhandlersandsettheroll-backpointonallMPIprocesses.
Thisissucientforalargecoverageoffailuressinceexecutiontimeisdominatedbythemaincomputationalloop.
InthecaseafailurehappensbeforethecalltoMPI_Reinit,theapplicationfallsbacktothedefaultactionofabortingexecution.
Nevertheless,thedesignofReinit++isnotlimitedbythisimplementationchoice.
Apossibleapproachinsteadofaborting,whichweleaveasfuturework,istotreatanyMPIprocessesthathavenotcalledMPI_Reinitasiffailedandre-executethem.
Reinit++:Global-RestartRecoveryforMPIFaultTolerance545Algorithm3:Reinit++internalsFunctionOnSignalReinit():gotoRollbackendFunctionMPIReinit(argc,argv,foo):InstallsignalhandlerOnSignalReinitonSIGREINITRollback:ifthis.
state==MPIREINITREINITEDthenDiscardMPIstateWaitonbarrierRe-initializeworldcommunicatorendreturnfoo(argc,argv,this.
state)endFurthermore,signalingSIGREINITforrollingbacksurvivorMPIprocessesasynchronouslyinterruptsexecution.
Inourimplementation,werendertheMPIruntimelibrarysignalandroll-backsafebyusingmaskingtodefersignalhan-dlinguntilasafepoint,i.
e.
,avoidinterruptionwhenlocksareheldordatastruc-turesareupdating.
Sinceapplicationcodeisoutofourcontrol,Reinit++requirestheapplicationdevelopertoprogramtheapplicationassignalandroll-backsafe.
Apossibleenhancementistoprovideaninterfaceforinstallingcleanuphandlers,proposedinearlierdesignsofReinit[21],sothatapplicationandlibrarydevel-operscaninstallroutinestoresetapplication-levelstateonrecovery.
Anotherapproachistomakerecoverysynchronous,byextendingtheReinit++interfacetoincludeafunctionthattestswhetherafaulthasbeendetectedandtriggerrollback.
Thedevelopermaycallthisfunctionatsafepointsduringexecutionforrecovery.
Weleaveboththoseenhancementsasfuturework,notingthattheexistinginterfaceissucientforperformingourevaluation.
4ExperimentationSetupThissectionprovidesdetailedinformationontheexperimentationsetup,therecoveryapproachesusedforcomparisons,theproxyapplicationsandtheircon-gurations,andthemeasurementmethodology.
Table1.
ProxyapplicationsandtheircongurationApplicationInputNo.
ranksCoMD-i4-j2-k216,32,64,128,256,512,1024-x80-y40-z40-N20HPCCG64646416,32,64,128,256,512,1024LULESH-i20-s488,64,512546G.
Georgakoudisetal.
RecoveryApproaches.
Experimentationincludesthefollowingrecoveryapproaches:–CR,whichimplementsthetypicalapproachofimmediatelyrestartinganapplicationafterexecutionabortsduetoafailure.
–ULFM,byusingitslatestrevisionbasedontheOpenMPIruntimev4.
0.
1(4.
0.
1ulfm2.
1rc1).
–Reinit++,whichisourownimplementationofReinit,basedonOpenMPIruntimev4.
0.
0.
EmulatingFailures.
Failuresareemulatedthroughfaultinjection.
Weoptforrandomfaultinjectiontoemulatetheoccurrenceofrandomfaults,e.
g.
,softerrorsorfailuresofhardwarecomponents,thatleadtoacrashfailure.
Specically,forprocessfailures,weinstrumentapplicationssothatatarandomiterationofthemaincomputationalloop,arandomMPIprocesssuicidesbyraisingthesignalSIGKILL.
TherandomselectionofiterationandMPIprocessisthesameforeveryrecoveryapproach.
Fornodefailures,themethodissimilar,butinsteadofitself,theMPIprocesssendsthesignalSIGKILLtoitsparentdaemon,thuskillsthedaemonandbyextensionallitschildrenprocesses.
Inexperimentation,weinjectasingleMPIprocessfailureorasinglenodefailure.
Applications.
Weexperimentwiththreebenchmarkapplicationsthatrepre-sentdierentHPCdomains:CoMDformoleculardynamics,HPCCGforitera-tivesolvers,andLULESHformulti-physicscomputation.
Themotivationistoinvestigateglobal-restartrecoveryonawiderangeofapplicationsandevaluateanyperformancedierences.
Table1showsinformationontheproxyapplica-tionsandscalingoftheirdeployednumberofranks.
NoteLULESHrequiresacubenumberofranks,thusthetrimmeddownexperimentationspace.
Thedeploymentcongurationhas16rankspernode,sothesmallestdeploymentcomprisesofonenodewhilethelargestonespans64nodes(1024ranks).
Appli-cationexecuteinweakscalingmode–forCoMDweshowitsinputonly16ranksandchangeitaccordingly.
Weextendapplicationstoimplementglobal-restartwithReinit++orULFM,tostoreacheckpointaftereveryiterationoftheirmaincomputationalloopandloadthelatestcheckpointuponrecovery.
Checkpointing.
Forevaluationpurposes,weimplementourown,simplecheck-pointinglibrarythatsupportssavingandloadingapplicationdatausingin-memoryandlecheckpoints.
Table2summarizescheckpointingperrecoveryapproachandfailuretype.
Indetail,weimplementtwotypesofcheckpointing:leandmemory.
Forlecheckpointing,eachMPIprocessstoresacheckpointtogloballyaccessiblepermanentstorage,whichisthenetworked,parallellesys-temLustreavailableinourcluster.
Formemorycheckpointing,anMPIprocessstoresacheckpointbothlocallyinitsownmemoryandremotelytothememoryofabuddy[33,34]MPIprocess,whichinourimplementationisthe(cyclically)nextMPIprocessbyrank.
Thismemorycheckpointingimplementationisappli-cableonlytosingleprocessfailuressincemultipleprocessfailuresoranodeReinit++:Global-RestartRecoveryforMPIFaultTolerance547failurecanwipeoutbothlocalandbuddycheckpointsforthefailedMPIpro-cesses.
CRnecessarilyuseslecheckpointingsincere-deployingtheapplicationrequirespermanentstoragetoretrievecheckpoints.
Table2.
CheckpointingperrecoveryandfailureFailureRecoveryprocessCRULFMReinitFileMemoryMemorynodeFileFileFileStatisticalEvaluation.
Foreachproxyapplicationandcongurationweper-form10independentmeasurements.
Eachmeasurementcountsthetotalexecu-tiontimeoftheapplicationbreakingitdowntotimeneededforwritingcheck-points,timespentduringMPIrecovery,timereadingacheckpointafterafailure,andthepureapplicationtimeexecutingthecomputation.
Anycondenceinter-valsshowncorrespondtoa95%condencelevelandarecalculatedbasedonthet-distributiontoavoidassumptionsonthesampledpopulation'sdistribution.
5EvaluationFortheevaluationwecompareCR,Reinit++andULFMforbothprocessandnodefailures.
Resultsprovideinsightontheperformanceofeachofthoserecov-eryapproachesimplementingglobal-restartandrevealthereasonsfortheirper-formancedierences.
5.
1ComparingTotalExecutionTimeonaProcessFailureFigure4showsaveragetotalexecutiontimeforprocessfailuresusinglecheck-pointingforCRandmemorycheckpointingforReinit++andULFM.
Theplotbreaksdowntimetocomponentsofwritingcheckpoints,MPIrecovery,andpureapplicationtime.
Readingcheckpointsoccursone-oafterafailureandhasneg-ligibleimpact,intheorderoftensofmilliseconds,thusitisomitted.
TherstobservationisthatReinit++scalesexcellentlycomparedtobothCRandULFM,acrossallprograms.
CRhastheworseperformance,increasinglysowithmoreranks.
Thereasonisthelimitedscalingofwritingcheckpointstothenetworkedlesystem.
Bycontrast,ULFMandReinit++usememorycheckpointing,spendingminimaltimewritingcheckpoints.
Interestingly,ULFMscalesworsethanReinit++;webelievethatthereasonisthatitinatespureapplicationexecutiontime,whichweillustrateinthenextsection.
Further,inthefollowingsections,weremovecheckpointingoverheadfromtheanalysistohighlighttheperformancedierencesofthedierentrecoveringapproaches.
548G.
Georgakoudisetal.
(a)CoMD(b)HPCCG(c)LULESHFig.
4.
Totalexecutiontimebreakdownrecoveringfromaprocessfailure5.
2ComparingPureApplicationTimeUnderDierentRecoveryApproachesFigure5showsthepureapplicationtime,withoutincludingreading/writingcheckpointsorMPIrecovery.
WeobservethatapplicationtimeisonparforCRandReinit++,andthatallapplicationsscaleweaklywellonupto1024ranks.
CRandReinit++donotinterferewithexecution,thustheyhavenoimpactonapplicationtime,whichisonpartothefault-freeexecutiontimeoftheproxyapplications.
However,inULFM,applicationtimegrowssignicantlyasthenumberofranksincreases.
ULFMextendsMPIwithanalways-on,peri-odicheartbeatmechanism[8]todetectfailuresandalsomodiescommunicationprimitivesforfaulttolerantoperation.
Followingfromourmeasurements,those(a)CoMD(b)HPCCG(c)LULESHFig.
5.
ScalingofpureapplicationtimeReinit++:Global-RestartRecoveryforMPIFaultTolerance549extensionsnoticeablyincreasetheoriginalapplicationexecutiontime.
However,itisinconclusivewhetherthisisaresultofthetestedprototypeimplementationorasystemictrade-o.
Next,wecomparetheMPIrecoverytimesamongalltheapproaches.
(a)CoMD(b)HPCCG(c)LULESHFig.
6.
ScalingofMPIrecoverytimerecoveringfromaprocessfailure5.
3ComparingMPIRecoveryTimeRecoveringfromaProcessFailureThoughcheckpointingsavesapplication'scomputationtime,reducingMPIrecoverytimesavesoverheadfromrestarting.
Thisoverheadisincreasinglyimportantthelargerthedeploymentandthehigherthefaultrate.
Inpartic-ular,Fig.
6showsthescalingoftimerequiredforMPIrecoveryacrossallpro-gramsandrecoveryapproaches,againremovinganyoverheadforcheckpointingtofocusontheMPIrecoverytime.
Asexpected,MPIrecoverytimedependsonlyonthenumberofranks,thustimesaresimilaramongdierentprogramsforthesamerecoveryapproach.
Commentingonscaling,CRandReinit++scaleexcellently,requiringalmostconstanttimeforMPIrecoveryregardlessthenum-berofranks.
However,CRisabout6*slower,requiringaround3stoteardownexecutionandre-deploytheapplication,whereasReinit++requiresabout0.
5stopropagatethefault,re-initializesurvivorprocessesandre-spawnthefailedprocess.
ULFMhasonparrecoverytimewithReinit++upto64ranks,butthenitstimeincreasesbeingupto3*slowerthanReinit++for1024ranks.
ULFM(a)CoMD(b)HPCCG(c)LULESHFig.
7.
ScalingofMPIrecoverytimerecoveringfromanodefailure550G.
Georgakoudisetal.
requiresmultiplecollectiveoperationsamongallMPIprocessestoimplementglobal-restart(shrinkthefaultycommunicator,spawnanewprocess,mergeittoanewcommunicator).
Bycontrast,Reinit++implementsrecoveryattheMPIruntimelayerrequiringfeweroperationsandconningcollectivecommunicationonlybetweenrootanddaemonprocesses.
5.
4ComparingMPIRecoveryTimeRecoveringfromaNodeFailureThiscomparisonforanodefailureincludesonlyCRandReinit++,sincethepro-totypeimplementationofULFMfacedrobustnessissues(hangingorcrashing)anddidnotproducemeasurements.
Also,sincebothCRandReinit++uselecheckpointinganddonotinterferewithpureapplicationtime,wepresentonlyresultsforMPIrecoverytimes,showninFig.
7.
BothCRandReinit++scaleverywellwithalmostconstanttimes,astheydoforaprocessfailure.
However,inabsolutevalues,Reinit++hasahigherrecoverytimeofabout1.
5sforanodefailurecomparedto0.
5sforaprocessfailure.
ThisisbecauserecoveringfromanodefailurerequiresextraworktoselecttheleastloadednodeandspawnalltheMPIprocessesofthefailednode.
Nevertheless,recoverywithReinit++isstillabout2*fasterthanwithCR.
6RelatedWorkCheckpoint-Restart[1,2,10,15,20,27,29,32]isthemostcommonapproachtorecoveranMPIapplicationafterafailure.
CRrequiressubstantialdevelopmenteorttoidentifywhichdatatocheckpointandmayhavesignicantoverhead.
Thus,manyeortsattempttomakecheckpointingeasiertoadoptandrenderitfastandstorageecient.
Webrieydiscussthemhere.
HargroveandDuell[15]implementthesystem-levelCRlibraryBerkeleyLabCheckpoint/Restart(BLCR)librarytoautomaticallycheckpointapplicationsbyextendingtheLinuxkernel.
Bosilcaetal.
[6]integrateanuncoordinated,distributedcheckpoint/roll-backsystemintheMPICHruntimetoautomati-callysupportfaulttolerancefornodefailures.
Furthermore.
Sankaranetal.
[27]integrateBerkeleyLabBLCRkernel-levelC/RtotheLAMimplementationofMPI.
Adametal.
[2],SCR[25],andFTI[3]proposeasynchronous,multi-levelcheckpointingtechniquesthatsignicantlyimprovecheckpointingperformance.
Shahzadetal.
[28]provideanextensiveinterfacethatsimpliestheimplementa-tionofapplication-levelcheckpointingandrecovery.
AdvancesincheckpointingarebenecialnotonlyforCRbutforotherMPIfaulttoleranceapproaches,suchasULFMandReinit.
Thoughmakingcheckpointingfasterresolvesthisbottleneck,theoverheadofre-deployingthefullapplicationremains.
ULFM[4,5]isthestate-of-the-artMPIfaulttoleranceapproach,pursuedbytheMPIFaultToleranceWorkingGroup.
ULFMextendsMPIwithinter-facestoshrinkorrevokecommunicators,andfault-tolerantcollectiveconsensus.
TheapplicationdeveloperisresponsibleforimplementingrecoveryusingthoseReinit++:Global-RestartRecoveryforMPIFaultTolerance551operations,choosingthetypeofrecoverybestsuitedforitsapplication.
Acollec-tionofworksonULFM[9,16–18,21,23,26]hasinvestigatedtheapplicabilityofULFMandbenchmarkedindividualoperationsofit.
Bosilcaetal.
[7,8]andKattietal.
[19]proposeecientfaultdetectionalgorithmstointegratewithULFM.
Teranishietal.
[31]usespareprocessestoreplacefailedprocessesforlocalrecov-erysoastoacceleraterecoveryofULFM.
EventhoughULFMgivesexibilitytodeveloperstoimplementanytypeofrecover,itrequiressignicantdevelopereorttorefactortheapplication.
Also,implementingULFMhasbeenidentiedbypreviouswork[14,31]tosuerfromscalabilityissues,asourexperimenta-tionshowstoo.
Fenix[13]providesasimpliedabstractionlayeratopULFMtoimplementglobal-restartrecovery.
However,wechoosetodirectlyuseULFMsinceitalreadyprovidesastraightforward,prescribedsolutionforimplementingglobal-restart.
Reinit[11,22]isanalternativesolutionthatsupportsonlyglobal-restartrecoveryandprovideaneasytouseinterfacetodevelopers.
PreviousdesignsandimplementationsofReinithavelimitedapplicabilitybecausetheyrequiremodifyingthejobscheduleranditsinterfacewiththeMPIruntime.
WepresentReinit++,anewdesignandimplementationofReinitusingthelatestOpenMPIruntimeandthoroughlyevaluateit.
Lastly,Sultanaetal.
[30]proposeMPIstagestoreducetheoverheadofglobal-restartrecoverybycheckpointingMPIstate,sothatrollingbackdoesnothavetore-createit.
Whilethisapproachisinteresting,itisstillinproof-of-conceptstatus.
HowtomaintainconsistentcheckpointsofMPIstateacrossallMPIprocesses,anddoingsofastandeciently,isstillanopen-problem.
7ConclusionWehavepresentedReinit++,anewdesignandimplementationoftheglobal-restartapproachofReinit.
Reinit++recoversfrombothprocessandnodecrashfailures,byspawningnewprocessesandmendingtheworldcommunicator,requiringfromtheprogrammeronlytoprovidearollbackpointinexecutionandhavecheckpointinginplace.
Ourextensiveevaluationcomparingwiththestate-of-the-artapproachesCheckpoint-Restart(CR)andULFMshowsthatReinit++scalesexcellentlyasthenumberofranksgrows,achievingalmostconstantrecov-erytime,beingupto6*fasterthanCRandupto3*fasterthanULFM.
Forfuturework,weplantoexpandReinitforsupportingmorerecoverystrategiesbesidesglobal-restart,includingshrinkingrecoveryandforwardrecoverystrate-gies,tomaintainitsimplementation,andexpandtheexperimentationwithmoreapplicationsandlargerdeployments.
Acknowledgments.
Theauthorswouldliketothanktheanonymousrefereesfortheirvaluablecommentsandhelpfulsuggestions.
ThisworkwasperformedundertheauspicesoftheU.
S.
DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryundercontractDEAC52-07NA27344(LLNL-CONF-800061).
552G.
Georgakoudisetal.
References1.
Adam,J.
,etal.
:Transparenthigh-speednetworkcheckpoint/restartinMPI.
In:Proceedingsofthe25thEuropeanMPIUsers'GroupMeeting,p.
12(2018)2.
Adam,J.
,etal.
:Checkpoint/restartapproachesforathread-basedMPIruntime.
ParallelComput.
85,204–219(2019)3.
Bautista-Gomez,L.
,Tsuboi,S.
,Komatitsch,D.
,Cappello,F.
,Maruyama,N.
,Mat-suoka,S.
:FTI:highperformancefaulttoleranceinterfaceforhybridsystems.
In:SC2011:Proceedingsof2011InternationalConferenceforHighPerformanceCom-puting,Networking,StorageandAnalysis,pp.
1–12,November2011.
https://doi.
org/10.
1145/2063384.
20634274.
Bland,W.
,Bouteiller,A.
,Herault,T.
,Bosilca,G.
,Dongarra,J.
:Post-failurerecov-eryofMPIcommunicationcapability:designandrationale.
Int.
J.
HighPerfor-manceComput.
Appl.
27(3),244–254(2013)5.
Bland,W.
,Lu,H.
,Seo,S.
,Balaji,P.
:Lessonslearnedimplementinguser-levelfailuremitigationinmpich.
In:201515thIEEE/ACMInternationalSymposiumonCluster,CloudandGridComputing(2015)6.
Bosilca,G.
,etal.
:Mpich-v:towardascalablefaulttolerantMPIforvolatilenodes.
In:SC2002:Proceedingsofthe2002ACM/IEEEConferenceonSupercomputing,pp.
29–29.
IEEE(2002)7.
Bosilca,G.
,etal.
:FailuredetectionandpropagationinHPCsystems.
In:SC2016:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pp.
312–322(2016)8.
Bosilca,G.
,etal.
:AfailuredetectorforHPCplatforms.
Int.
J.
HighPerformanceComput.
Appl.
32(1),139–158(2018).
https://doi.
org/10.
1177/10943420177115059.
Bouteiller,A.
,Bosilca,G.
,Dongarra,J.
J.
:PlanB:InterruptionofongoingMPIoperationstosupportfailurerecovery.
In:Proceedingsofthe22ndEuropeanMPIUsers'GroupMeeting,p.
11(2015)10.
Cao,J.
,etal.
:System-levelscalablecheckpoint-restartforpetascalecomputing.
In:2016IEEE22ndInternationalConferenceonParallelandDistributedSystems(ICPADS)(2016)11.
Chakraborty,S.
,etal.
:Ereinit:scalableandecientfault-toleranceforbulk-synchronousMPIapplications.
ConcurrencyandComputation:PracticeandExpe-rience,e4863.
https://doi.
org/10.
1002/cpe.
4863,https://onlinelibrary.
wiley.
com/doi/abs/10.
1002/cpe.
4863,e4863cpe.
486312.
Dongarra,J.
,etal.
:Theinternationalexascalesoftwareprojectroadmap.
Int.
J.
HighPerform.
Comput.
Appl.
25(1),3–60(2011).
https://doi.
org/10.
1177/1094342010391989,http://dx.
doi.
org/10.
1177/109434201039198913.
Gamell,M.
,Katz,D.
S.
,Kolla,H.
,Chen,J.
,Klasky,S.
,Parashar,M.
:Exploringautomatic,onlinefailurerecoveryforscienticapplicationsatextremescales.
In:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pp.
895–906.
SC2014,IEEEPress,Piscataway,NJ,USA(2014).
https://doi.
org/10.
1109/SC.
2014.
7814.
Gamell,M.
,etal.
:Localrecoveryandfailuremaskingforstencil-basedapplicationsatextremescales.
In:SC2015:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pp.
1–12(2015)15.
Hargrove,P.
H.
,Duell,J.
C.
:Berkeleylabcheckpoint/restart(BLCR)forlinuxclus-ters.
In:JournalofPhysics:ConferenceSeries.
vol.
46,p.
494(2006)Reinit++:Global-RestartRecoveryforMPIFaultTolerance55316.
Herault,T.
,etal.
:Practicalscalableconsensusforpseudo-synchronousdistributedsystems.
In:SC2015:ProceedingsoftheInternationalConferenceforHighPer-formanceComputing,Networking,StorageandAnalysis,pp.
1–12(2015)17.
Hori,A.
,Yoshinaga,K.
,Herault,T.
,Bouteiller,A.
,Bosilca,G.
,Ishikawa,Y.
:Slidingsubstitutionoffailednodes.
In:Proceedingsofthe22ndEuropeanMPIUsers'GroupMeeting,p.
14.
ACM(2015)18.
Katti,A.
,DiFatta,G.
,Naughton,T.
,Engelmann,C.
:Scalableandfaulttolerantfailuredetectionandconsensus.
In:Proceedingsofthe22ndEuropeanMPIUsers'GroupMeeting,p.
13(2015)19.
Katti,A.
,DiFatta,G.
,Naughton,T.
,Engelmann,C.
:Epidemicfailuredetectionandconsensusforextremeparallelism.
Int.
J.
HighPerformanceComput.
Appl.
32(5),729–743(2018)20.
Kohl,N.
,etal.
:Ascalableandextensiblecheckpointingschemeformassivelypar-allelsimulations.
Int.
J.
HighPerformanceComput.
Appl.
33(4),571–589(2019)21.
Laguna,I.
,Richards,D.
F.
,Gamblin,T.
,Schulz,M.
,deSupinski,B.
R.
:Evaluatinguser-levelfaulttoleranceforMPIapplications.
In:Proceedingsofthe21stEuro-peanMPIUsers'GroupMeeting,pp.
57:57–57:62.
EuroMPI/ASIA2014,ACM,NewYork,NY,USA(2014).
https://doi.
org/10.
1145/2642769.
2642775,http://doi.
acm.
org/10.
1145/2642769.
264277522.
Laguna,I.
,etal.
:Evaluatingandextendinguser-levelfaulttoleranceinMPIappli-cations.
Int.
J.
HighPerformanceComput.
Appl.
30(3),305–319(2016).
https://doi.
org/10.
1177/109434201562362323.
Losada,N.
,Cores,I.
,Martn,M.
J.
,Gonzalez,P.
:ResilientMPIapplicationsusinganapplication-levelcheckpointingframeworkandULFM.
TheJournalofSuper-computing73(1)(2017)24.
Martino,C.
D.
,Kalbarczyk,Z.
,Iyer,R.
K.
,Baccanico,F.
,Fullop,J.
,Kramer,W.
:Lessonslearnedfromtheanalysisofsystemfailuresatpetascale:Thecaseofbluewaters.
In:201444thAnnualIEEE/IFIPInternationalConferenceonDepend-ableSystemsandNetworks.
pp.
610–621,June2014.
https://doi.
org/10.
1109/DSN.
2014.
6225.
Mohror,K.
,Moody,A.
,Bronevetsky,G.
,deSupinski,B.
R.
:Detailedmodelingandevaluationofascalablemultilevelcheckpointingsystem.
IEEETrans.
ParallelDistrib.
Syst.
25(9),2255–2263(2014).
https://doi.
org/10.
1109/TPDS.
2013.
10026.
Pauli,S.
,Kohler,M.
,Arbenz,P.
:Afaulttolerantimplementationofmulti-levelmontecarlomethods.
ParallelComput.
Acceler.
Comput.
Sci.
Eng.
(CSE)25,471–480(2014)27.
Sankaran,S.
,etal.
:Thelam/mpicheckpoint/restartframework:system-initiatedcheckpointing.
JHPCA19(4),479–493(2005)28.
Shahzad,F.
,Thies,J.
,Kreutzer,M.
,Zeiser,T.
,Hager,G.
,Wellein,G.
:Craft:alibraryforeasierapplication-levelcheckpoint/restartandautomaticfaulttoler-ance.
IEEETrans.
ParallelDistrib.
Syst.
30(3),501–514(2018)29.
Subasi,O.
,Martsinkevich,T.
,Zyulkyarov,F.
,Unsal,O.
,Labarta,J.
,Cappello,F.
:Uniedfault-toleranceframeworkforhybridtask-parallelmessage-passingappli-cations.
Int.
J.
HighPerformanceComput.
Appl.
32(5),641–657(2018)30.
Sultana,N.
,R¨ufenacht,M.
,Skjellum,A.
,Laguna,I.
,Mohror,K.
:Failurerecov-eryforbulksynchronousapplicationswithMPIstages.
ParallelComput.
84,1–14(2019).
https://doi.
org/10.
1016/j.
parco.
2019.
02.
007,http://www.
sciencedirect.
com/science/article/pii/S016781911830326031.
Teranishi,K.
,Heroux,M.
A.
:TowardlocalfailurelocalrecoveryresiliencemodelusingMPI-ULFM.
In:Proceedingsofthe21stEuropeanMPIUsers'GroupMeet-ing,p.
51(2014)554G.
Georgakoudisetal.
32.
Wang,Z.
,Gao,L.
,Gu,Y.
,Bao,Y.
,Yu,G.
:Afault-tolerantframeworkforasyn-chronousiterativecomputationsincloudenvironments.
IEEETrans.
ParallelDis-trib.
Syst.
29(8),1678–1692(2018)33.
Zheng,G.
,XiangN.
,Kale,L.
V.
:Ascalabledoublein-memorycheckpointandrestartschemetowardsexascale.
In:IEEE/IFIPInternationalConferenceonDependableSystemsandNetworksWorkshops(DSN2012),pp.
1–6,June2012.
https://doi.
org/10.
1109/DSNW.
2012.
626467734.
Zheng,G.
,Huang,C.
,Kale,L.
V.
:Performanceevaluationofautomaticcheckpoint-basedfaulttoleranceforampiandcharm++.
SIGOPSOper.
Syst.
Rev.
40(2),90–99(2006).
https://doi.
org/10.
1145/1131322.
1131340,http://doi.
acm.
org/10.
1145/1131322.
1131340

展开全文

addingrestart相关文档

域名注册商最全的域名注册商 asp主机空间asp空间是什么 vps试用求个免费现成的vps（可永久可试用）台湾vps台湾服务器哪里稳定速度快？成都虚拟空间虚拟主机哪家最好~~~韩国虚拟主机香港和韩国的虚拟主机哪个比较好？韩国虚拟主机香港虚拟主机和韩国虚拟主机比较,哪个更好?1g虚拟主机想买个1G虚拟主机，不限流量的，但不知道哪个建站网站靠谱，求推荐！合肥虚拟主机虚拟主机是干嘛的？买了虚拟主机是否要一台电脑？云南虚拟主机云南服务器托管 lamp安装罗马假日广场华为云服务息壤主机 Hello图床回程路由 php免费空间英文站群个人域名 gspeed php空间推荐 183是联通还是移动 hkg isp服务商 php空间购买 yundun 新加坡空间云销售系统中国电信宽带测速 apachetomcat 更多

addingrestart

标准互联（450元）襄阳电信100G防御服务器 10M独立带宽

hostkvm：美国VPS，三网强制CU-VIP线路，$5/月，1G内存/1核/15gSSD/500g流量

RackNerd($199/月),5IP,1x256G SSD+2x3THDD