addingrestart

restart  时间:2021-01-11  阅读:()
Reinit++:EvaluatingthePerformanceofGlobal-RestartRecoveryMethodsforMPIFaultToleranceGiorgisGeorgakoudis1(B),LuanzhengGuo2,andIgnacioLaguna11CenterforAdvancedScienticComputing,LawrenceLivermoreNationalLaboratory,Livermore,USA{georgakoudis1,lagunaperalt1}@llnl.
gov2EECS,UCMerced,Merced,USAlguo4@ucmerced.
eduAbstract.
Scalingsupercomputerscomeswithanincreaseinfailureratesduetotheincreasingnumberofhardwarecomponents.
Instan-dardpractice,applicationsaremaderesilientthroughcheckpointingdataandrestartingexecutionafterafailureoccurstoresumefromthelat-estcheckpoint.
However,re-deployinganapplicationincursoverheadbytearingdownandre-instatingexecution,andpossiblylimitingcheck-pointingretrievalfromslowpermanentstorage.
InthispaperwepresentReinit++,anewdesignandimplementationoftheReinitapproachforglobal-restartrecovery,whichavoidsapplicationre-deployment.
WeextensivelyevaluateReinit++contrastedwiththeleadingMPIfault-toleranceapproachofULFM,implementingglobal-restartrecovery,andthetypicalpracticeofrestartinganapplicationtoderivenewinsightonperformance.
ExperimentationwiththreedierentHPCproxyapplicationsmaderesilienttowithstandprocessandnodefailuresshowsthatReinit++recoversmuchfasterthanrestarting,upto6*,orULFM,upto3*,andthatitscalesexcellentlyasthenumberofMPIprocessesgrows.
1IntroductionHPCsystemperformancescalesbyincreasingthenumberofcomputingnodesandbyincreasingtheprocessingandmemoryelementsofeachnode.
Further-more,electronicscontinuetoshrink,thusaremoresusceptibletointerference,suchasradiationupsetsorvoltageuctuations.
Thosetrendsincreasetheprob-abilityofafailurehappening,eitherduetocomponentfailureorduetotransientsofterrorsaectingelectronics.
LargeHPCapplicationsrunforhoursordaysandusemost,ifnotall,thenodesofasupercomputer,thusarevulnerabletofailures,oftenleadingtoprocessornodecrashes.
Reportedly,themeantimebetweenanodefailureonpetascalesystemshasbeenmeasuredtobe6.
7h[24],L.
Guo—WorkperformedduringinternshipatLawrenceLivermoreNationalLaboratory.
cSpringerNatureSwitzerlandAG2020P.
Sadayappanetal.
(Eds.
):ISCHighPerformance2020,LNCS12151,pp.
536–554,2020.
https://doi.
org/10.
1007/978-3-030-50743-5_27Reinit++:Global-RestartRecoveryforMPIFaultTolerance537whileworst-caseprojections[12]foreseethatexascalesystemsmayexperienceafailureevenmorefrequently.
HPCapplicationsoftenimplementfaulttoleranceusingcheckpointstorestartexecution,amethodreferredtoasCheckpoint-Restart(CR).
Applicationsperiodicallystorecheckpoints,e.
g.
,everyfewiterationsofaniterativecompu-tation,andwhenafailureoccurs,executionabortsandrestartsagaintoresumefromthelatestcheckpoint.
MostscalableHPCapplicationsfollowtheBulkSynchronousParallel(BSP)paradigm,henceCRwithglobal,backward,non-shrinkingrecovery[21],alsoknownasglobal-restartnaturallytstheirexecution.
CRisstraightforwardtoimplementbutrequiresre-deployingthewholeapplica-tiononafailure,re-spawningallprocessesoneverynodeandre-initializinganyapplicationdatastructures.
Thismethodhassignicantoverheadsinceafailureoffewprocesses,evenasingleprocessfailure,requirescompletere-deployment,althoughmostoftheprocessessurvivedthefailure.
Bycontrast,User-levelFaultMitigation(ULFM)[4]extendsMPIwithinter-facesforhandlingfailuresattheapplicationlevelwithoutrestartingexecution.
TheprogrammerisrequiredtousetheULFMextensionstodetectafailureandrepaircommunicatorsandeitherspawnnewprocesses,fornon-shrinkingrecov-ery,orcontinueexecutionwithanysurvivorprocesses,forshrinkingrecovery.
AlthoughULFMgrantstheprogrammergreatexibilitytohandlefailures,itrequiresconsiderableeorttorefactortheapplicationforcorrectlyandecientlyimplementingrecovery.
Alternatively,Reinit[11,22]hasbeenproposedasaneasier-to-programapp-roach,butequallycapableofsupportingglobal-restartrecovery.
ReinitextendsMPIwithafunctioncallthatsetsarollbackpointintheapplication.
Ittrans-parentlyimplementsMPIrecovery,byspawningnewprocessesandmendingtheworldcommunicatorattheMPIruntimelevel.
Thus,Reinittransparentlyensuresaconsistent,initialMPIstateakintothestateafterMPIinitialization.
However,theexistingimplementationofReinit[11]ishardtodeploy,sinceitrequiresmodicationstothejobscheduler,anddiculttocomparewithULFM,whichonlyrequiresextensionstotheMPIlibrary.
Notably,bothReinitandULFMapproachesassumetheapplicationhascheckpointinginplacetoresumeexecutionattheapplicationlevel.
Althoughtherehasbeenalargebibliography[4,5,9,11,16–18,21–23,26]dis-cussingtheprogrammingmodelandprototypesofthoseapproaches,nostudyhaspresentedanin-depthperformanceevaluationofthem–mostpreviousworkseitherfocusonindividualaspectsofeachapproachorperformlimitedscaleexperiments.
Inthispaper,wepresentanextensiveevaluationusingHPCproxyapplicationstocontrastthesetwoleadingglobal-restartrecoveryapproaches.
Specically,ourcontributionsare:–AnewdesignandimplementationoftheReinitapproach,namedReinit++,usingthelatestOpenMPIruntime.
Ourdesignandimplementationsup-portsrecoveryfromeitherprocessornodefailures,ishighperformance,anddeployseasilybyextendingtheOpenMPIlibrary.
Notably,wepresentaprecisedenitionofthefailuresithandlesandthescopeofthisdesignandimplementation.
538G.
Georgakoudisetal.
–Anextensiveevaluationoftheperformanceofthepossiblerecoveryapproaches(CR,Reinit++,ULFM)usingthreeHPCproxyapplications(CoMD,LULESH,HPCCG),andincludingleandin-memorycheckpointingschemes.
–NewinsightfromtheresultsofourevaluationwhichshowthatrecoveryunderReinit++isupto6*fasterthanCRandupto3*fasterthanULFM.
Com-paredtoCR,Reinit++avoidsthere-deploymentoverhead,whilecomparedtoUFLM,Reinit++avoidsinterferenceduringfault-freeapplicationexecutionandhaslessrecoveryoverhead.
2OverviewThissectionpresentsanoverviewofthestate-of-the-artapproachesforMPIfaulttolerance.
Specically,itprovidesanoverviewoftherecoverymodelsforapplicationsandbrieydiscussesULFMandReinit,whichrepresentthestate-of-the-artinMPIfaulttolerance.
2.
1RecoveryModelsforMPIApplicationsThereareseveralmodelsforfaulttolerancedependingontherequirementsoftheapplication.
Specically,ifallMPIprocessesmustrecoverafterafailure,recov-eryisglobal;otherwiseifsome,butnotall,oftheMPIprocessesneedtorecoverthenrecoveryisdeemedaslocal.
Furthermore,applicationscaneitherrecoverbyrollingbackcomputationatanearlierpointintime,denedasbackwardrecovery,or,iftheycancontinuecomputationwithoutbacktracking,recoveryisdeemedasforward.
Moreover,ifrecoveryrestoresthenumberofMPIpro-cessestoresumeexecution,itisdenedasnon-shrinking,whereasifexecutioncontinueswithwhatevernumberofprocessessurvivingthefailure,thenrecov-eryischaracterizedasshrinking.
Global-restartimplementsglobal,backward,non-shrinkingrecoverywhichtsmostHPCapplicationsthatfollowabulk-synchronousparadigmwhereMPIprocesseshaveinterlockeddependencies,thusitisthefocusofthiswork.
2.
2ExistingApproachesforMPIFaultToleranceULFM.
Oneofthestate-of-the-artapproachesforfaulttoleranceinMPIisUser-levelFaultMitigation(ULFM)[4].
ULFMextendsMPItoenablefailuredetectionattheapplicationlevelandprovideasetofprimitivesforhandlingrecovery.
Specically,ULFMtapstotheexistingerrorhandlinginterfaceofMPItoimplementuser-levelfaultnotication.
RegardingitsextensionstotheMPIinterface,weelaborateoncommunicatorssincetheirextensionsareasupersetofothercommunicationobjects(windows,I/O).
Following,ULFMextendsMPIwitharevokeoperation(MPI_Comm_revoke(comm))toinvalidateacommunica-torsuchthatanysubsequentoperationonitraisesanerror.
Also,itdenesaReinit++:Global-RestartRecoveryforMPIFaultTolerance539shrinkoperation(MPI_Comm_shrink(comm,newcomm))thatcreatesanewcom-municatorfromanexistingoneafterexcludinganyfailedprocesses.
Additionally,ULFMdenesacollectiveagreementoperation(MPI_Comm_agree(comm,flag))whichachievesconsensusonthegroupoffailedprocessesinacommunicatorandonthevalueoftheintegervariableflag.
Basedonthoseextensions,MPIprogrammersareexpectedtoimplementtheirownrecoverystrategytailoredtotheirapplications.
ULFMoperationsaregeneralenoughtoimplementanytypeofrecoverydiscussedearlier.
However,thisgeneralitycomesatthecostofcomplexity.
Programmersneedtounder-standtheintricatesemanticsofthoseoperationstocorrectlyandecientlyimplementrecoveryandrestructure,possiblysignicantly,theapplicationforexplicitlyhandlingfailures.
AlthoughULFMprovidesexamplesthatprescribetheimplementationofglobal-restart,theprogrammermustembedthisinthecodeandrefactortheapplicationtofunctionwiththeexpectationthatcommu-nicatorsmaychangeduringexecutionduetoshrinkingandmerging,whichisnotideal.
Reinit.
Reinit[11,22]hasbeenproposedasanalternativeapproachforimple-mentingglobal-restartrecovery,throughasimplerinterfacecomparedtoULFM.
Themostrecentimplementation[11]ofReinitislimitedinseveralaspects:(1)itrequiresmodifyingthejobscheduler(SLURM),besidestheMPIruntime,thusitisimpracticaltodeployandskewsperformancemeasurementsduetocrossingtheinterfacebetweenthejobschedulerandtheMPIruntime;(2)itsimplemen-tationisnotpubliclyavailable;(3)itbasesontheMVAPICH2MPIruntime,whichmakescomparisonswithULFMhard,sinceULFMisimplementedontheOpenMPIruntime.
Thus,weoptforanewdesignandimplementation1,namedReinit++,whichwepresentindetailinthenextsection.
3Reinit++ThissectiondescribestheprogramminginterfaceofReinit++,theassumptionsforapplicationdeployment,processandnodefailuredetection,andtherecoveryalgorithmforglobal-restart.
WealsodenethesemanticsofMPIrecoveryfortheimplementationofReinit++aswellasdiscussitsspecics.
3.
1DesignProgrammingInterfaceofReinit++.
Figure1presentstheprogramminginterfaceofReinit++intheClanguage,whileFig.
2showssampleusageofit.
Thereisasinglefunctioncall,MPI_Reinit,fortheprogrammertocalltodenethepointincodetorollbackandresumeexecutionafterafailure.
ThisfunctionmustbecalledafterMPI_InitsoensuretheMPIruntimehasbeeninitialized.
ItsargumentsimitatetheparametersofMPI_Init,addingaparameterfora1Availableopen-sourceathttps://github.
com/ggeorgakoudis/ompi/tree/reinit.
540G.
Georgakoudisetal.
typedefenum{MPI_REINIT_NEW,MPI_REINIT_REINITED,MPI_REINIT_RESTARTED}MPI_Reinit_state_ttypedefint(*MPI_Restart_point)(intargc,char**argv,MPI_Reinit_state_tstate);intMPI_Reinit(intargc,char**argv,constMPI_Restart_pointpoint);Fig.
1.
TheprogramminginterfaceofReinit++intfoo(intargc,char**argv,MPI_Reinit_state_tstate){/*Loadcheckpointifitexists*/while(!
done){/*Docomputation*//*Storecheckpoint*/}}intmain(intargc,char**argv){MPI_Init(&argc,&argv);/*Application-specificinitialization*///EntrypointoftheresilientfunctionMPI_Reinit(&argc,&argv,foo);MPI_Finalize();}Fig.
2.
SampleusageoftheinterfaceofReinit++pointertoauser-denedfunction.
Reinit++expectstheprogrammertoencap-sulateinthisfunctionthemaincomputationalloopoftheapplication,whichisrestartablethroughcheckpointing.
Internally,MPI_Reinitpassestheparam-etersargcandargvtothisuser-denedfunction,plustheparameterstate,whichindicatestheMPIstateoftheprocessasvaluesfromtheenumerationtypeMPI_Reinit_state_t.
Specically,thevalueMPI_REINIT_NEWdesignatesanewprocessexecutingforthersttime,thevalueMPI_REINIT_REINITEDdesig-natesasurvivorprocessthathasenteredtheuser-denedfunctionafterrollingbackduetoafailure,andthevalueMPI_REINIT_RESTARTEDdesignatesthattheprocesshasfailedandhasbeenre-spawnedtoresumeexecution.
NotethatthisstatevariabledescribesonlytheMPIstateofReinit++,thushasnosemanticsontheapplicationstate,suchaswhethertoloadacheckpointornot.
ApplicationDeploymentModel.
Reinit++assumesalogical,hierarchicaltopologyofapplicationdeployment.
Figure3showsagraphicalrepresentationofthisdeploymentmodel.
Atthetoplevel,thereisasinglerootprocessthatspawnsandmonitorsdaemonprocesses,oneoneachofthecomputingnodesReinit++:Global-RestartRecoveryforMPIFaultTolerance541RootD1P1PkDnPlPm···Fig.
3.
Applicationdeploymentmodelreservedfortheapplication.
DaemonsspawnandmonitorMPIprocesseslocaltotheirnodes.
Therootcommunicateswithdaemonsandkeepstrackoftheirliveness,whiledaemonstrackthelivenessoftheirchildrenMPIprocesses.
Basedonthisexecutionanddeploymentmodel,Reinit++performsfaultdetection,whichwediscussnext.
FaultDetection.
Reinit++targetsfail-stopfailuresofeitherMPIprocessesordaemons.
Adaemonfailureisdeemedequivalenttoanodefailure.
Thecausesforthosefailuresmaybetransientfaultsorhardfaultsofhardwarecomponents.
InthedesignofReinit++,therootmanagestheexecutionofthewholeapplications,soanyrecoverydecisionsaretakenbyit,henceitisthefocalpointforfaultdetection.
Specically,ifanMPIprocessfails,itsmanagingdaemonisnotiedofthefailureandforwardsthisnoticationtotheroot,withouttakinganactionitself.
Ifadaemonprocessfails,whichmeanseitherthenodefailedorthedaemonprocessitself,therootdirectlydetectsthefailureandalsoassumesthatthechildrenMPIprocessesofthatdaemonarelosttoo.
Afterdetectingafaulttherootprocessproceedswithrecovery,whichweintroduceinthefollowingsection.
MPIRecovery.
Reinit++recoveryforbothMPIprocessanddaemonfailuresissimilar,exceptthatonadaemonfailuretherootchoosesanewhostnodetore-instatefailedMPIprocesses,sinceadaemonfailureproxiesanodefailure.
Forrecovery,therootprocessbroadcastsareinitmessagetoalldaemons.
Daemonsreceivingthatmessagerollbacksurvivorprocessesandre-spawnfailedones.
AfterrollingbacksurvivorMPIprocessesandspawningnewones,theseman-ticsofMPIrecoveryarethatonlytheworldcommunicatorisvalidandanypreviousMPIstate(othercommunicators,windows,etc.
)hasbeendiscarded.
ThisissimilartotheMPIstateavailableimmediatelyafteranapplicationcallsMPI_Init.
Next,theapplicationrestoresitsstate,discussedinthefollowingsection.
ApplicationRecovery.
Reinit++assumesthatapplicationsareresponsibleforsavingandrestoringtheirstatetoresumeexecution.
Hence,bothsurvivorand542G.
Georgakoudisetal.
re-spawnedMPIprocessesshouldloadavalidcheckpointafterMPIrecoverytorestoreapplicationstateandresumecomputation.
3.
2ImplementationWeimplementReinit++inthelatestOpenMPIruntime,version4.
0.
0.
Theimplementationsupportsrecoveryfrombothprocessanddaemon(node)failures.
Thisimplementationdoesnotpresupposeanyparticularjobscheduler,soitiscompatiblewithanyjobschedulertheOpenMPIruntimeworkswith.
Introduc-ingbrieytheOpenMPIsoftwarearchitecture,itcomprisesofthreeframeworksofdistinctfunctionality:(i)theOpenMPIMPIlayer(OMPI),whichimplementstheinterfaceoftheMPIspecicationusedbytheapplicationdevelopers;(ii)theOpenMPIRuntimeEnvironment(ORTE),whichimplementsruntimefunctionsforapplicationdeployment,executionmonitoring,andfaultdetection,and(iii)theOpenPortabilityAccessLayers(OPAL),whichimplementsabstractionsofOSinterfaces,suchassignalhandling,processcreation,etc.
Reinit++extendsOMPItoprovidethefunctionMPI_Reinit.
ItextendsORTEtopropagatefaultnoticationsfromdaemonstotherootandtoimple-mentthemechanismofMPIrecoveryondetectingafault.
Also,Reinit++extendsOPALtoimplementlow-levelprocesssignalingfornotifyingsurvivorprocesstorollback.
Thefollowingsectionsprovidemoredetails.
ApplicationDeployment.
Reinit++requirestheapplicationtodeployusingthedefaultlauncherofOpenMPI,mpirun.
Notethatusingthelaunchermpiruniscompatiblewithanyjobschedulerandevenusesoptimizeddeploymentinter-faces,iftheschedulerprovidesany.
PhysicalapplicationdeploymentinOpenMPIcloselyfollowsthelogicalmodelofthedesignofReinit++.
Specically,OpenMPIsetstherootofthedeploymentattheprocesslaunchingthempirun,typicallyonaloginnodeofHPCinstallations,whichisdeemedastheHeadNodeProcess(HNP)inOpenMPIterminology.
Following,therootlaunchesanORTEdaemononeachnodeallocatedfortheapplication.
DaemonsspawnthesetofMPIprocessesineachnodeandmonitortheirexecution.
Therootprocesscommunicateswitheachdaemonoverachannelofareliablenetworktransportandmonitorsthelivenessofdaemonsthroughtheexistenceofthischannel.
Launchinganapplication,theuserspeciesthenumberofMPIprocessesandoptionallythenumberofnodes(ornumberofprocessespernode).
Towithstandprocessfailures,thisspecicationofdeploymentissucient,sinceReinit++re-spawnsfailedprocessesontheiroriginalnodeofdeployment.
However,fornodefailures,theusermustover-provisiontheallocatedprocessslotsforre-spawningthesetofMPIprocesseslostduetoafailednode.
Todoso,themoststraightforwardwayistoallocatemorenodesthanrequiredforfault-freeoperation,uptothemaximumnumberofnodefailurestowithstand.
FaultDetection.
InOpenMPI,adaemonistheparentoftheMPIprocessesonitsnode.
IfanMPIprocesscrashes,itsparentdaemonisnotied,bytrap-pingthesignalSIGCHLD,inPOSIXsemantics.
ImplementingthefaultdetectionReinit++:Global-RestartRecoveryforMPIFaultTolerance543Algorithm1:Root:HandleFailureData:D:thesetofdaemons,Children(x):returnsthesetofchildrenMPIprocessesofdaemonx,Parent(x):returnstheparentdaemonofMPIprocessxInput:Thefailedprocessf(MPIprocessordaemon)//failedprocessisadaemoniff∈DthenD←D\{f}d←d|argmind∈DChildren(d)//broadcastREINITtoalldaemonsBroadcastDmessageREINIT,{d,c|c∈Children(f)}//failedprocessisanMPIprocesselseBroadcastDmessageREINIT,{Parent(f),f}endrequirementsofReinit++,adaemonrelaysthefaultnoticationtotherootpro-cessfortakingaction.
Regardingnodefailures,therootdirectlydetectsthemproxiedthroughdaemonfailures.
Specically,theroothasanopencommunica-tionchannelwitheachdaemonoversomereliabletransport,e.
g.
,TCP.
Iftheconnectionoverthatcommunicationchannelbreaks,therootprocessisnotiedofthefailureandregardsthedaemonatfault,thusassumingallitschildrenMPIprocesslostanditshostnodeisunavailable.
Forbothtypesoffailures(processandnode),therootprocessinitiatesMPIrecovery.
MPIRecovery.
Algorithm1showsinpseudocodetheoperationoftherootprocesswhenhandlingafailure.
Ondetectingafailure,therootprocessdis-tinguisheswhetheritisafaultydaemonorMPIprocess.
Foranodefailure,therootselectstheleastloadednodeintheresourceallocation,thatisthenodewiththefewestoccupiedprocessslots,andsetsthisnode'sdaemonastheparentdaemonforfailedprocesses.
Foraprocessfailure,therootselectstheoriginalparentdaemonofthefailedprocesstore-spawnthatprocess.
Next,therootprocessinitiatesrecoverybybroadcastingtoalldaemonsamessagewiththeREINITcommandandthelistofprocessestospawn,alongwiththeirselectedparentdaemons.
Following,whenadaemonreceivesthatmessageitsignalsitssurvivor,childrenMPIprocessestorollback,andre-spawnsanyprocessesinthelistthathavethisdaemonastheirparent.
Algorithm2presentsthisprocedureinpseudocode.
Regardingtheasynchronous,signalinginterfaceofReinit++,Algorithm3illustratestheinternalsoftheReinit++inpseudocode.
WhenanMPIprocessexecutesMPI_Reinit,itinstallsasignalhandlerforthesignalSIGREINIT,whichaliasesSIGUSR1inourimplementation.
Also,MPI_Reinitsetsanon-localgotopointusingthePOSIXfunctionsetjmp().
ThesignalhandlerofSIGREINITsimplycallslongjmp()toreturnexecutionofsurvivorprocessestothisgotopoint.
RolledbacksurvivorprocessesdiscardanypreviousMPIstateandblock544G.
Georgakoudisetal.
Algorithm2:Daemond:HandleReinitData:Children(x):returnsthesetofchildrenMPIprocessesofdaemonx,Parent(x):returnstheparentdaemonofMPIprocessxInput:List{di,ci//SignalsurvivorMPIprocessesforc∈Children(d)doc.
state←MPIREINITREINITEDSignalSIGREINITtocend//Spawnnewprocessifdisparentforeach{di,cidoifd==dithenChildren(d)←Children(d)∪cici.
state←MPIREINITRESTARTEDSpawnciendendonaORTE-levelbarrier.
ThisbarrierreplicatestheimplicitbarrierpresentinMPI_Inittosynchronizewithre-spawnedprocessesjoiningthecomputation.
Afterthebarrier,survivorprocessesre-initializetheworldcommunicatorandcallthefunctionfootoresumecomputation.
Re-spawnedprocessesinitializetheworldcommunicatoraspartoftheMPIinitializationprocedureofMPI_InitandgothroughMPI_Reinittoinstallthesignalhandler,setthegotopoint,andlastlycalltheuser-denedfunctiontoresumecomputation.
ApplicationRecovery.
Applicationrecoveryincludestheactionsneededattheapplication-leveltoresumecomputation.
AnyadditionalMPIstatebesidestherepairedworldcommunicator,suchassub-communicators,mustbere-createdbytheapplication'sMPIprocesses.
Also,itisexpectedthateachprocessloadsthelatestconsistentcheckpointtocontinuecomputing.
Checkpointinglayswithintheresponsibilityoftheapplicationdeveloper.
Inthenextsection,wediscussthescopeandimplicationsofourimplementation.
Discussion.
Inthisimplementation,thescopeoffaulttoleranceistosupportrecoveryfromfailureshappeningafterMPI_ReinithasbeencalledbyallMPIprocesses.
ThisisbecauseMPI_Reinitmustinstallsignalhandlersandsettheroll-backpointonallMPIprocesses.
Thisissucientforalargecoverageoffailuressinceexecutiontimeisdominatedbythemaincomputationalloop.
InthecaseafailurehappensbeforethecalltoMPI_Reinit,theapplicationfallsbacktothedefaultactionofabortingexecution.
Nevertheless,thedesignofReinit++isnotlimitedbythisimplementationchoice.
Apossibleapproachinsteadofaborting,whichweleaveasfuturework,istotreatanyMPIprocessesthathavenotcalledMPI_Reinitasiffailedandre-executethem.
Reinit++:Global-RestartRecoveryforMPIFaultTolerance545Algorithm3:Reinit++internalsFunctionOnSignalReinit():gotoRollbackendFunctionMPIReinit(argc,argv,foo):InstallsignalhandlerOnSignalReinitonSIGREINITRollback:ifthis.
state==MPIREINITREINITEDthenDiscardMPIstateWaitonbarrierRe-initializeworldcommunicatorendreturnfoo(argc,argv,this.
state)endFurthermore,signalingSIGREINITforrollingbacksurvivorMPIprocessesasynchronouslyinterruptsexecution.
Inourimplementation,werendertheMPIruntimelibrarysignalandroll-backsafebyusingmaskingtodefersignalhan-dlinguntilasafepoint,i.
e.
,avoidinterruptionwhenlocksareheldordatastruc-turesareupdating.
Sinceapplicationcodeisoutofourcontrol,Reinit++requirestheapplicationdevelopertoprogramtheapplicationassignalandroll-backsafe.
Apossibleenhancementistoprovideaninterfaceforinstallingcleanuphandlers,proposedinearlierdesignsofReinit[21],sothatapplicationandlibrarydevel-operscaninstallroutinestoresetapplication-levelstateonrecovery.
Anotherapproachistomakerecoverysynchronous,byextendingtheReinit++interfacetoincludeafunctionthattestswhetherafaulthasbeendetectedandtriggerrollback.
Thedevelopermaycallthisfunctionatsafepointsduringexecutionforrecovery.
Weleaveboththoseenhancementsasfuturework,notingthattheexistinginterfaceissucientforperformingourevaluation.
4ExperimentationSetupThissectionprovidesdetailedinformationontheexperimentationsetup,therecoveryapproachesusedforcomparisons,theproxyapplicationsandtheircon-gurations,andthemeasurementmethodology.
Table1.
ProxyapplicationsandtheircongurationApplicationInputNo.
ranksCoMD-i4-j2-k216,32,64,128,256,512,1024-x80-y40-z40-N20HPCCG64646416,32,64,128,256,512,1024LULESH-i20-s488,64,512546G.
Georgakoudisetal.
RecoveryApproaches.
Experimentationincludesthefollowingrecoveryapproaches:–CR,whichimplementsthetypicalapproachofimmediatelyrestartinganapplicationafterexecutionabortsduetoafailure.
–ULFM,byusingitslatestrevisionbasedontheOpenMPIruntimev4.
0.
1(4.
0.
1ulfm2.
1rc1).
–Reinit++,whichisourownimplementationofReinit,basedonOpenMPIruntimev4.
0.
0.
EmulatingFailures.
Failuresareemulatedthroughfaultinjection.
Weoptforrandomfaultinjectiontoemulatetheoccurrenceofrandomfaults,e.
g.
,softerrorsorfailuresofhardwarecomponents,thatleadtoacrashfailure.
Specically,forprocessfailures,weinstrumentapplicationssothatatarandomiterationofthemaincomputationalloop,arandomMPIprocesssuicidesbyraisingthesignalSIGKILL.
TherandomselectionofiterationandMPIprocessisthesameforeveryrecoveryapproach.
Fornodefailures,themethodissimilar,butinsteadofitself,theMPIprocesssendsthesignalSIGKILLtoitsparentdaemon,thuskillsthedaemonandbyextensionallitschildrenprocesses.
Inexperimentation,weinjectasingleMPIprocessfailureorasinglenodefailure.
Applications.
Weexperimentwiththreebenchmarkapplicationsthatrepre-sentdierentHPCdomains:CoMDformoleculardynamics,HPCCGforitera-tivesolvers,andLULESHformulti-physicscomputation.
Themotivationistoinvestigateglobal-restartrecoveryonawiderangeofapplicationsandevaluateanyperformancedierences.
Table1showsinformationontheproxyapplica-tionsandscalingoftheirdeployednumberofranks.
NoteLULESHrequiresacubenumberofranks,thusthetrimmeddownexperimentationspace.
Thedeploymentcongurationhas16rankspernode,sothesmallestdeploymentcomprisesofonenodewhilethelargestonespans64nodes(1024ranks).
Appli-cationexecuteinweakscalingmode–forCoMDweshowitsinputonly16ranksandchangeitaccordingly.
Weextendapplicationstoimplementglobal-restartwithReinit++orULFM,tostoreacheckpointaftereveryiterationoftheirmaincomputationalloopandloadthelatestcheckpointuponrecovery.
Checkpointing.
Forevaluationpurposes,weimplementourown,simplecheck-pointinglibrarythatsupportssavingandloadingapplicationdatausingin-memoryandlecheckpoints.
Table2summarizescheckpointingperrecoveryapproachandfailuretype.
Indetail,weimplementtwotypesofcheckpointing:leandmemory.
Forlecheckpointing,eachMPIprocessstoresacheckpointtogloballyaccessiblepermanentstorage,whichisthenetworked,parallellesys-temLustreavailableinourcluster.
Formemorycheckpointing,anMPIprocessstoresacheckpointbothlocallyinitsownmemoryandremotelytothememoryofabuddy[33,34]MPIprocess,whichinourimplementationisthe(cyclically)nextMPIprocessbyrank.
Thismemorycheckpointingimplementationisappli-cableonlytosingleprocessfailuressincemultipleprocessfailuresoranodeReinit++:Global-RestartRecoveryforMPIFaultTolerance547failurecanwipeoutbothlocalandbuddycheckpointsforthefailedMPIpro-cesses.
CRnecessarilyuseslecheckpointingsincere-deployingtheapplicationrequirespermanentstoragetoretrievecheckpoints.
Table2.
CheckpointingperrecoveryandfailureFailureRecoveryprocessCRULFMReinitFileMemoryMemorynodeFileFileFileStatisticalEvaluation.
Foreachproxyapplicationandcongurationweper-form10independentmeasurements.
Eachmeasurementcountsthetotalexecu-tiontimeoftheapplicationbreakingitdowntotimeneededforwritingcheck-points,timespentduringMPIrecovery,timereadingacheckpointafterafailure,andthepureapplicationtimeexecutingthecomputation.
Anycondenceinter-valsshowncorrespondtoa95%condencelevelandarecalculatedbasedonthet-distributiontoavoidassumptionsonthesampledpopulation'sdistribution.
5EvaluationFortheevaluationwecompareCR,Reinit++andULFMforbothprocessandnodefailures.
Resultsprovideinsightontheperformanceofeachofthoserecov-eryapproachesimplementingglobal-restartandrevealthereasonsfortheirper-formancedierences.
5.
1ComparingTotalExecutionTimeonaProcessFailureFigure4showsaveragetotalexecutiontimeforprocessfailuresusinglecheck-pointingforCRandmemorycheckpointingforReinit++andULFM.
Theplotbreaksdowntimetocomponentsofwritingcheckpoints,MPIrecovery,andpureapplicationtime.
Readingcheckpointsoccursone-oafterafailureandhasneg-ligibleimpact,intheorderoftensofmilliseconds,thusitisomitted.
TherstobservationisthatReinit++scalesexcellentlycomparedtobothCRandULFM,acrossallprograms.
CRhastheworseperformance,increasinglysowithmoreranks.
Thereasonisthelimitedscalingofwritingcheckpointstothenetworkedlesystem.
Bycontrast,ULFMandReinit++usememorycheckpointing,spendingminimaltimewritingcheckpoints.
Interestingly,ULFMscalesworsethanReinit++;webelievethatthereasonisthatitinatespureapplicationexecutiontime,whichweillustrateinthenextsection.
Further,inthefollowingsections,weremovecheckpointingoverheadfromtheanalysistohighlighttheperformancedierencesofthedierentrecoveringapproaches.
548G.
Georgakoudisetal.
(a)CoMD(b)HPCCG(c)LULESHFig.
4.
Totalexecutiontimebreakdownrecoveringfromaprocessfailure5.
2ComparingPureApplicationTimeUnderDierentRecoveryApproachesFigure5showsthepureapplicationtime,withoutincludingreading/writingcheckpointsorMPIrecovery.
WeobservethatapplicationtimeisonparforCRandReinit++,andthatallapplicationsscaleweaklywellonupto1024ranks.
CRandReinit++donotinterferewithexecution,thustheyhavenoimpactonapplicationtime,whichisonpartothefault-freeexecutiontimeoftheproxyapplications.
However,inULFM,applicationtimegrowssignicantlyasthenumberofranksincreases.
ULFMextendsMPIwithanalways-on,peri-odicheartbeatmechanism[8]todetectfailuresandalsomodiescommunicationprimitivesforfaulttolerantoperation.
Followingfromourmeasurements,those(a)CoMD(b)HPCCG(c)LULESHFig.
5.
ScalingofpureapplicationtimeReinit++:Global-RestartRecoveryforMPIFaultTolerance549extensionsnoticeablyincreasetheoriginalapplicationexecutiontime.
However,itisinconclusivewhetherthisisaresultofthetestedprototypeimplementationorasystemictrade-o.
Next,wecomparetheMPIrecoverytimesamongalltheapproaches.
(a)CoMD(b)HPCCG(c)LULESHFig.
6.
ScalingofMPIrecoverytimerecoveringfromaprocessfailure5.
3ComparingMPIRecoveryTimeRecoveringfromaProcessFailureThoughcheckpointingsavesapplication'scomputationtime,reducingMPIrecoverytimesavesoverheadfromrestarting.
Thisoverheadisincreasinglyimportantthelargerthedeploymentandthehigherthefaultrate.
Inpartic-ular,Fig.
6showsthescalingoftimerequiredforMPIrecoveryacrossallpro-gramsandrecoveryapproaches,againremovinganyoverheadforcheckpointingtofocusontheMPIrecoverytime.
Asexpected,MPIrecoverytimedependsonlyonthenumberofranks,thustimesaresimilaramongdierentprogramsforthesamerecoveryapproach.
Commentingonscaling,CRandReinit++scaleexcellently,requiringalmostconstanttimeforMPIrecoveryregardlessthenum-berofranks.
However,CRisabout6*slower,requiringaround3stoteardownexecutionandre-deploytheapplication,whereasReinit++requiresabout0.
5stopropagatethefault,re-initializesurvivorprocessesandre-spawnthefailedprocess.
ULFMhasonparrecoverytimewithReinit++upto64ranks,butthenitstimeincreasesbeingupto3*slowerthanReinit++for1024ranks.
ULFM(a)CoMD(b)HPCCG(c)LULESHFig.
7.
ScalingofMPIrecoverytimerecoveringfromanodefailure550G.
Georgakoudisetal.
requiresmultiplecollectiveoperationsamongallMPIprocessestoimplementglobal-restart(shrinkthefaultycommunicator,spawnanewprocess,mergeittoanewcommunicator).
Bycontrast,Reinit++implementsrecoveryattheMPIruntimelayerrequiringfeweroperationsandconningcollectivecommunicationonlybetweenrootanddaemonprocesses.
5.
4ComparingMPIRecoveryTimeRecoveringfromaNodeFailureThiscomparisonforanodefailureincludesonlyCRandReinit++,sincethepro-totypeimplementationofULFMfacedrobustnessissues(hangingorcrashing)anddidnotproducemeasurements.
Also,sincebothCRandReinit++uselecheckpointinganddonotinterferewithpureapplicationtime,wepresentonlyresultsforMPIrecoverytimes,showninFig.
7.
BothCRandReinit++scaleverywellwithalmostconstanttimes,astheydoforaprocessfailure.
However,inabsolutevalues,Reinit++hasahigherrecoverytimeofabout1.
5sforanodefailurecomparedto0.
5sforaprocessfailure.
ThisisbecauserecoveringfromanodefailurerequiresextraworktoselecttheleastloadednodeandspawnalltheMPIprocessesofthefailednode.
Nevertheless,recoverywithReinit++isstillabout2*fasterthanwithCR.
6RelatedWorkCheckpoint-Restart[1,2,10,15,20,27,29,32]isthemostcommonapproachtorecoveranMPIapplicationafterafailure.
CRrequiressubstantialdevelopmenteorttoidentifywhichdatatocheckpointandmayhavesignicantoverhead.
Thus,manyeortsattempttomakecheckpointingeasiertoadoptandrenderitfastandstorageecient.
Webrieydiscussthemhere.
HargroveandDuell[15]implementthesystem-levelCRlibraryBerkeleyLabCheckpoint/Restart(BLCR)librarytoautomaticallycheckpointapplicationsbyextendingtheLinuxkernel.
Bosilcaetal.
[6]integrateanuncoordinated,distributedcheckpoint/roll-backsystemintheMPICHruntimetoautomati-callysupportfaulttolerancefornodefailures.
Furthermore.
Sankaranetal.
[27]integrateBerkeleyLabBLCRkernel-levelC/RtotheLAMimplementationofMPI.
Adametal.
[2],SCR[25],andFTI[3]proposeasynchronous,multi-levelcheckpointingtechniquesthatsignicantlyimprovecheckpointingperformance.
Shahzadetal.
[28]provideanextensiveinterfacethatsimpliestheimplementa-tionofapplication-levelcheckpointingandrecovery.
AdvancesincheckpointingarebenecialnotonlyforCRbutforotherMPIfaulttoleranceapproaches,suchasULFMandReinit.
Thoughmakingcheckpointingfasterresolvesthisbottleneck,theoverheadofre-deployingthefullapplicationremains.
ULFM[4,5]isthestate-of-the-artMPIfaulttoleranceapproach,pursuedbytheMPIFaultToleranceWorkingGroup.
ULFMextendsMPIwithinter-facestoshrinkorrevokecommunicators,andfault-tolerantcollectiveconsensus.
TheapplicationdeveloperisresponsibleforimplementingrecoveryusingthoseReinit++:Global-RestartRecoveryforMPIFaultTolerance551operations,choosingthetypeofrecoverybestsuitedforitsapplication.
Acollec-tionofworksonULFM[9,16–18,21,23,26]hasinvestigatedtheapplicabilityofULFMandbenchmarkedindividualoperationsofit.
Bosilcaetal.
[7,8]andKattietal.
[19]proposeecientfaultdetectionalgorithmstointegratewithULFM.
Teranishietal.
[31]usespareprocessestoreplacefailedprocessesforlocalrecov-erysoastoacceleraterecoveryofULFM.
EventhoughULFMgivesexibilitytodeveloperstoimplementanytypeofrecover,itrequiressignicantdevelopereorttorefactortheapplication.
Also,implementingULFMhasbeenidentiedbypreviouswork[14,31]tosuerfromscalabilityissues,asourexperimenta-tionshowstoo.
Fenix[13]providesasimpliedabstractionlayeratopULFMtoimplementglobal-restartrecovery.
However,wechoosetodirectlyuseULFMsinceitalreadyprovidesastraightforward,prescribedsolutionforimplementingglobal-restart.
Reinit[11,22]isanalternativesolutionthatsupportsonlyglobal-restartrecoveryandprovideaneasytouseinterfacetodevelopers.
PreviousdesignsandimplementationsofReinithavelimitedapplicabilitybecausetheyrequiremodifyingthejobscheduleranditsinterfacewiththeMPIruntime.
WepresentReinit++,anewdesignandimplementationofReinitusingthelatestOpenMPIruntimeandthoroughlyevaluateit.
Lastly,Sultanaetal.
[30]proposeMPIstagestoreducetheoverheadofglobal-restartrecoverybycheckpointingMPIstate,sothatrollingbackdoesnothavetore-createit.
Whilethisapproachisinteresting,itisstillinproof-of-conceptstatus.
HowtomaintainconsistentcheckpointsofMPIstateacrossallMPIprocesses,anddoingsofastandeciently,isstillanopen-problem.
7ConclusionWehavepresentedReinit++,anewdesignandimplementationoftheglobal-restartapproachofReinit.
Reinit++recoversfrombothprocessandnodecrashfailures,byspawningnewprocessesandmendingtheworldcommunicator,requiringfromtheprogrammeronlytoprovidearollbackpointinexecutionandhavecheckpointinginplace.
Ourextensiveevaluationcomparingwiththestate-of-the-artapproachesCheckpoint-Restart(CR)andULFMshowsthatReinit++scalesexcellentlyasthenumberofranksgrows,achievingalmostconstantrecov-erytime,beingupto6*fasterthanCRandupto3*fasterthanULFM.
Forfuturework,weplantoexpandReinitforsupportingmorerecoverystrategiesbesidesglobal-restart,includingshrinkingrecoveryandforwardrecoverystrate-gies,tomaintainitsimplementation,andexpandtheexperimentationwithmoreapplicationsandlargerdeployments.
Acknowledgments.
Theauthorswouldliketothanktheanonymousrefereesfortheirvaluablecommentsandhelpfulsuggestions.
ThisworkwasperformedundertheauspicesoftheU.
S.
DepartmentofEnergybyLawrenceLivermoreNationalLaboratoryundercontractDEAC52-07NA27344(LLNL-CONF-800061).
552G.
Georgakoudisetal.
References1.
Adam,J.
,etal.
:Transparenthigh-speednetworkcheckpoint/restartinMPI.
In:Proceedingsofthe25thEuropeanMPIUsers'GroupMeeting,p.
12(2018)2.
Adam,J.
,etal.
:Checkpoint/restartapproachesforathread-basedMPIruntime.
ParallelComput.
85,204–219(2019)3.
Bautista-Gomez,L.
,Tsuboi,S.
,Komatitsch,D.
,Cappello,F.
,Maruyama,N.
,Mat-suoka,S.
:FTI:highperformancefaulttoleranceinterfaceforhybridsystems.
In:SC2011:Proceedingsof2011InternationalConferenceforHighPerformanceCom-puting,Networking,StorageandAnalysis,pp.
1–12,November2011.
https://doi.
org/10.
1145/2063384.
20634274.
Bland,W.
,Bouteiller,A.
,Herault,T.
,Bosilca,G.
,Dongarra,J.
:Post-failurerecov-eryofMPIcommunicationcapability:designandrationale.
Int.
J.
HighPerfor-manceComput.
Appl.
27(3),244–254(2013)5.
Bland,W.
,Lu,H.
,Seo,S.
,Balaji,P.
:Lessonslearnedimplementinguser-levelfailuremitigationinmpich.
In:201515thIEEE/ACMInternationalSymposiumonCluster,CloudandGridComputing(2015)6.
Bosilca,G.
,etal.
:Mpich-v:towardascalablefaulttolerantMPIforvolatilenodes.
In:SC2002:Proceedingsofthe2002ACM/IEEEConferenceonSupercomputing,pp.
29–29.
IEEE(2002)7.
Bosilca,G.
,etal.
:FailuredetectionandpropagationinHPCsystems.
In:SC2016:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pp.
312–322(2016)8.
Bosilca,G.
,etal.
:AfailuredetectorforHPCplatforms.
Int.
J.
HighPerformanceComput.
Appl.
32(1),139–158(2018).
https://doi.
org/10.
1177/10943420177115059.
Bouteiller,A.
,Bosilca,G.
,Dongarra,J.
J.
:PlanB:InterruptionofongoingMPIoperationstosupportfailurerecovery.
In:Proceedingsofthe22ndEuropeanMPIUsers'GroupMeeting,p.
11(2015)10.
Cao,J.
,etal.
:System-levelscalablecheckpoint-restartforpetascalecomputing.
In:2016IEEE22ndInternationalConferenceonParallelandDistributedSystems(ICPADS)(2016)11.
Chakraborty,S.
,etal.
:Ereinit:scalableandecientfault-toleranceforbulk-synchronousMPIapplications.
ConcurrencyandComputation:PracticeandExpe-rience,e4863.
https://doi.
org/10.
1002/cpe.
4863,https://onlinelibrary.
wiley.
com/doi/abs/10.
1002/cpe.
4863,e4863cpe.
486312.
Dongarra,J.
,etal.
:Theinternationalexascalesoftwareprojectroadmap.
Int.
J.
HighPerform.
Comput.
Appl.
25(1),3–60(2011).
https://doi.
org/10.
1177/1094342010391989,http://dx.
doi.
org/10.
1177/109434201039198913.
Gamell,M.
,Katz,D.
S.
,Kolla,H.
,Chen,J.
,Klasky,S.
,Parashar,M.
:Exploringautomatic,onlinefailurerecoveryforscienticapplicationsatextremescales.
In:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pp.
895–906.
SC2014,IEEEPress,Piscataway,NJ,USA(2014).
https://doi.
org/10.
1109/SC.
2014.
7814.
Gamell,M.
,etal.
:Localrecoveryandfailuremaskingforstencil-basedapplicationsatextremescales.
In:SC2015:ProceedingsoftheInternationalConferenceforHighPerformanceComputing,Networking,StorageandAnalysis,pp.
1–12(2015)15.
Hargrove,P.
H.
,Duell,J.
C.
:Berkeleylabcheckpoint/restart(BLCR)forlinuxclus-ters.
In:JournalofPhysics:ConferenceSeries.
vol.
46,p.
494(2006)Reinit++:Global-RestartRecoveryforMPIFaultTolerance55316.
Herault,T.
,etal.
:Practicalscalableconsensusforpseudo-synchronousdistributedsystems.
In:SC2015:ProceedingsoftheInternationalConferenceforHighPer-formanceComputing,Networking,StorageandAnalysis,pp.
1–12(2015)17.
Hori,A.
,Yoshinaga,K.
,Herault,T.
,Bouteiller,A.
,Bosilca,G.
,Ishikawa,Y.
:Slidingsubstitutionoffailednodes.
In:Proceedingsofthe22ndEuropeanMPIUsers'GroupMeeting,p.
14.
ACM(2015)18.
Katti,A.
,DiFatta,G.
,Naughton,T.
,Engelmann,C.
:Scalableandfaulttolerantfailuredetectionandconsensus.
In:Proceedingsofthe22ndEuropeanMPIUsers'GroupMeeting,p.
13(2015)19.
Katti,A.
,DiFatta,G.
,Naughton,T.
,Engelmann,C.
:Epidemicfailuredetectionandconsensusforextremeparallelism.
Int.
J.
HighPerformanceComput.
Appl.
32(5),729–743(2018)20.
Kohl,N.
,etal.
:Ascalableandextensiblecheckpointingschemeformassivelypar-allelsimulations.
Int.
J.
HighPerformanceComput.
Appl.
33(4),571–589(2019)21.
Laguna,I.
,Richards,D.
F.
,Gamblin,T.
,Schulz,M.
,deSupinski,B.
R.
:Evaluatinguser-levelfaulttoleranceforMPIapplications.
In:Proceedingsofthe21stEuro-peanMPIUsers'GroupMeeting,pp.
57:57–57:62.
EuroMPI/ASIA2014,ACM,NewYork,NY,USA(2014).
https://doi.
org/10.
1145/2642769.
2642775,http://doi.
acm.
org/10.
1145/2642769.
264277522.
Laguna,I.
,etal.
:Evaluatingandextendinguser-levelfaulttoleranceinMPIappli-cations.
Int.
J.
HighPerformanceComput.
Appl.
30(3),305–319(2016).
https://doi.
org/10.
1177/109434201562362323.
Losada,N.
,Cores,I.
,Martn,M.
J.
,Gonzalez,P.
:ResilientMPIapplicationsusinganapplication-levelcheckpointingframeworkandULFM.
TheJournalofSuper-computing73(1)(2017)24.
Martino,C.
D.
,Kalbarczyk,Z.
,Iyer,R.
K.
,Baccanico,F.
,Fullop,J.
,Kramer,W.
:Lessonslearnedfromtheanalysisofsystemfailuresatpetascale:Thecaseofbluewaters.
In:201444thAnnualIEEE/IFIPInternationalConferenceonDepend-ableSystemsandNetworks.
pp.
610–621,June2014.
https://doi.
org/10.
1109/DSN.
2014.
6225.
Mohror,K.
,Moody,A.
,Bronevetsky,G.
,deSupinski,B.
R.
:Detailedmodelingandevaluationofascalablemultilevelcheckpointingsystem.
IEEETrans.
ParallelDistrib.
Syst.
25(9),2255–2263(2014).
https://doi.
org/10.
1109/TPDS.
2013.
10026.
Pauli,S.
,Kohler,M.
,Arbenz,P.
:Afaulttolerantimplementationofmulti-levelmontecarlomethods.
ParallelComput.
Acceler.
Comput.
Sci.
Eng.
(CSE)25,471–480(2014)27.
Sankaran,S.
,etal.
:Thelam/mpicheckpoint/restartframework:system-initiatedcheckpointing.
JHPCA19(4),479–493(2005)28.
Shahzad,F.
,Thies,J.
,Kreutzer,M.
,Zeiser,T.
,Hager,G.
,Wellein,G.
:Craft:alibraryforeasierapplication-levelcheckpoint/restartandautomaticfaulttoler-ance.
IEEETrans.
ParallelDistrib.
Syst.
30(3),501–514(2018)29.
Subasi,O.
,Martsinkevich,T.
,Zyulkyarov,F.
,Unsal,O.
,Labarta,J.
,Cappello,F.
:Uniedfault-toleranceframeworkforhybridtask-parallelmessage-passingappli-cations.
Int.
J.
HighPerformanceComput.
Appl.
32(5),641–657(2018)30.
Sultana,N.
,R¨ufenacht,M.
,Skjellum,A.
,Laguna,I.
,Mohror,K.
:Failurerecov-eryforbulksynchronousapplicationswithMPIstages.
ParallelComput.
84,1–14(2019).
https://doi.
org/10.
1016/j.
parco.
2019.
02.
007,http://www.
sciencedirect.
com/science/article/pii/S016781911830326031.
Teranishi,K.
,Heroux,M.
A.
:TowardlocalfailurelocalrecoveryresiliencemodelusingMPI-ULFM.
In:Proceedingsofthe21stEuropeanMPIUsers'GroupMeet-ing,p.
51(2014)554G.
Georgakoudisetal.
32.
Wang,Z.
,Gao,L.
,Gu,Y.
,Bao,Y.
,Yu,G.
:Afault-tolerantframeworkforasyn-chronousiterativecomputationsincloudenvironments.
IEEETrans.
ParallelDis-trib.
Syst.
29(8),1678–1692(2018)33.
Zheng,G.
,XiangN.
,Kale,L.
V.
:Ascalabledoublein-memorycheckpointandrestartschemetowardsexascale.
In:IEEE/IFIPInternationalConferenceonDependableSystemsandNetworksWorkshops(DSN2012),pp.
1–6,June2012.
https://doi.
org/10.
1109/DSNW.
2012.
626467734.
Zheng,G.
,Huang,C.
,Kale,L.
V.
:Performanceevaluationofautomaticcheckpoint-basedfaulttoleranceforampiandcharm++.
SIGOPSOper.
Syst.
Rev.
40(2),90–99(2006).
https://doi.
org/10.
1145/1131322.
1131340,http://doi.
acm.
org/10.
1145/1131322.
1131340

raksmart:全新cloud云服务器系列测评,告诉你raksmart新产品效果好不好

2021年6月底,raksmart开发出来的新产品“cloud-云服务器”正式上线对外售卖,当前只有美国硅谷机房(或许以后会有其他数据中心加入)可供选择。或许你会问raksmart云服务器怎么样啊、raksm云服务器好不好、网络速度快不好之类的废话(不实测的话),本着主机测评趟雷、大家受益的原则,先开一个给大家测评一下!官方网站:https://www.raksmart.com云服务器的说明:底层...

Virtono:€23.7/年,KVM-2GB/25GB/2TB/洛杉矶&达拉斯&纽约&罗马尼亚等

Virtono最近推出了夏季促销活动,为月付、季付、半年付等提供9折优惠码,年付已直接5折,而且下单后在LET回复订单号还能获得双倍内存,不限制付款周期。这是一家成立于2014年的国外VPS主机商,提供VPS和服务器租用等产品,商家支持PayPal、信用卡、支付宝等国内外付款方式,可选数据中心包括罗马尼亚、美国洛杉矶、达拉斯、迈阿密、英国和德国等。下面列出几款VPS主机配置信息,请留意,下列配置中...

什么是BGP国际线路及BGP线路有哪些优势

我们在选择虚拟主机和云服务器的时候,是不是经常有看到有的线路是BGP线路,比如前几天有看到服务商有国际BGP线路和国内BGP线路。这个BGP线路和其他服务线路有什么不同呢?所谓的BGP线路机房,就是在不同的运营商之间通过技术手段时间各个网络的兼容速度最佳,但是IP地址还是一个。正常情况下,我们看到的某个服务商提供的IP地址,在电信和联通移动速度是不同的,有的电信速度不错,有的是移动速度好。但是如果...

restart为你推荐
域名购买域名购买的流程是什么?域名购买域名注册和购买是一个意思吗?php虚拟空间php虚拟主机空间如何连接mysql100m网站空间网站空间100M指多大虚拟主机控制面板虚拟主机管理面板与网站后台有什么区别?郑州虚拟主机虚拟主机哪个好点,用过的推荐下大连虚拟主机大连哪些地方的网通机房好?windows虚拟主机win10用什么虚拟机好虚拟主机提供商哪个虚拟主机的服务商比较好?华众虚拟主机管理系统华众虚拟主机管理系统怎么样?
英文域名 手机域名注册 抗投诉vps主机 Vultr 香港机房托管 光棍节日志 win8升级win10正式版 云主机51web 网页背景图片 创梦 帽子云 百兆独享 速度云 免费phpmysql空间 电信托管 银盘服务是什么 双线机房 1元域名 台湾google 浙江服务器 更多