SoftwarePerformanceAnalysis
ts60.com 时间:2021-04-10 阅读:(
)
MichelR.
Dagenais(michel.
dagenais@polymtl.
ca)Dept.
ofComputerEngineering,EcolePolytechnique,Montreal,CanadaKarimYaghmour(karim@opersys.
com)Opersys,Montreal,CanadaCharlesLevert(Charles.
Levert@ericsson.
ca)MakanPourzandi(Makan.
Pourzandi@ericsson.
ca)EricssonResearch,Montreal,Canada1.
AbstractThekeytospeedingupapplicationsisoftenunderstandingwheretheelapsedtimeisspent,andwhy.
ThisdocumentreviewsindepththefullarrayofperformanceanalysistoolsandtechniquesavailableonLinuxforthistask,fromthetraditionaltoolslikegcovandgprof,tothemoreadvancedtoolsstillunderdevelopmentlikeoprofileandtheLinuxTraceToolkit.
Thefocusismoreontheunderlyingdatacollectionandprocessingalgorithms,andtheiroverheadandprecision,thanonthecosmeticdetailsofthegraphicaluserinterfacefrontends.
2.
IntroductionCodecoveragetools(e.
g.
,Gcov[1])computetheexactnumberoftimeseachstatementisexecuted,andthuscaneasilypinpointfrequentlyexecutedsections.
However,codecoveragetoolscarryanonnegligibleexecutionoverhead,andmoreimportantlydonotmeasuretherelativeexecutioncostofdifferentstatements.
Timedomainsamplingtools(e.
g.
,Gprof[2])maythereforemoreaccuratelyreportlargecontributorstotheexecutiontimebysamplingatregulartimeintervalstheexecutionaddress.
Moreover,theoverheadcostmaybetradedoffforaccuracybyvaryingthetimeinterval.
Modernapplicationsandsystemspresentnewchallengesastheirperformancemaybestronglyaffectedbycachehitrates,diskaccesses,inter-processcommunications,synchronizationwaits,andoperatingsystemoverhead.
Thisiswherenewtoolsbasedonhardwareperformancecounterssampling(e.
g.
,Oprofile[3]),ordetailedanalysisoflowoverheadoperatingsystemtraces(e.
g.
,LinuxTraceToolkit[4])canbringmoreinformation.
Theymayforinstanceindicatecodesectionswithexceedinglyhighlevelofcachemissesorbranchesmispredictions,orestimatetheexecutiontimeoftheXwindowserver[5]onbehalfofeachofitsclients.
Section3discussesthealgorithmsandtechniquesusedtocollect,processandstoreperformancedata.
Thefollowingsectiondescribesanumberofspecificperformanceanalysistools,fromcommonlyusedtoolstomorerecentones,withalongersectiononpossibleextensionsandcustomisationstotheLinuxTraceToolkit.
Section5istheconclusion.
3.
PerformanceDataCollection,ProcessingandStorageInasimplemodel,apieceofperformancedataisobtained,andqueuedorstoredforlaterdisplay.
Inpractice,thegeneratedvolumeofdatamaybesuchthatsomeprocessingmayberequiredatdata1collectiontimetosynthesizemorecompactinformation.
Forexample,inalargescientificcomputerclusterof1024computers,eachtimeaprogramfunctioniscalled,theentryandexittimemaybecollected.
Sendingthesevaluesdirectlytoacentralserverwouldrapidlysaturatethenetworkandcarryanunbearableoverhead.
Instead,theentryandexittimemaybesubtractedandaccumulatedlocallyineachfunction,inordertoproduceasummaryoftheprocessingtimespentineachprogramfunction.
Individualprograms,uponexit,wouldsendtheexecutiontimehistogramtoacentralserverforofflinepreparationofperformancereports.
Thissectionexaminesthekeyaspectsofdatacollectionandprocessing:obtainingthedatapoints,accumulatingvaluesintemporarystorage,andsendingthevaluestostorageforofflineprocessinganddisplay.
3.
1.
DataCollectionSeveralactivitiesincludingdebuggingandprogramunderstandingrequiredatacollection,justlikeperformanceanalysis,andusethesametechniques.
Thefamiliardebuggingprocessisusedtoillustratecommondatacollectionmethods.
Thereisacompromisebetweentheexecutionoverheadandtheamountandprecisionofthedatacollected.
Sometimestheoverheadisunacceptablebecausetheresultingprogramexecutiontimebecomestoolarge,ortheprogramperformanceortherealtimeprogrambehaviorisskewedbythedatacollection.
Storingorsummingavalueinavariableisminimaloverhead.
Callinganotherfunctionforeachfunctionentryisalargeoverheadforsmallfunctions.
Writingperformancedatatoafileuponenteringabasicblock(i.
e.
,linearcodesection,withoutanyjump,andthusalwaysexecutedtogetherandthesamenumberoftimes)wouldsignificantlychangetheexecutionperformanceofaprogram.
3.
1.
1.
ManualProgramInstrumentationAddingaprintstatementisoftentheeasiestandmostflexiblewaytoobtainthedesiredinformation.
Allthedatastructuresandfunctionsaccessibletotheprogrammaybeusedtogeneratethedataprinted.
TheLinuxkernelcode,wherenoviceprogrammersareuncommon,containsnolessthan80000printkstatements.
Severalcommontools(netstat,vmstat,mtrace)usedatagatheringstatementsinsertedinthekernel,ortheClibrabry.
3.
1.
2.
AutomaticProgramInstrumentationCompilersandpreprocessingtoolsmaybeusedtosystematicallyinstrumentaprogrambyaddingdatagatheringstatementstothecodetheywouldnormallygenerate.
Typicallocationsforinsertingstatementsincludethefunctionsentryandexit,andtheentryoflinearcodesections(basicblocks).
Forexample,theGNUCompilerCollection,GCC[1],offersseveralcodeinstrumentationoptionstoinsertacalltofunctionmcountuponenteringeachfunction(option-pgusedforgprof),andtoinsertcallstohooksuponenteringlinearcodesectionsandjumpingtoothercodesections(options-ftest-coverageand-fprofile-arcswhichmaybeusedforgcov).
3.
1.
3.
BinaryInstrumentationInsomesituations,thesourcecodeisnotavailableforsourcelevelinstrumentationandrecompilation,oritissimplyinconvenient.
Therearetoolstomodifybinaryexecutablefilesinordertoinsertdatagatheringstatements(e.
g.
,EEL[6]).
2Ideally,theexistingbinaryinstructionsshouldbeshiftedafewbyteshigherintheaddressspaceeachtimeinstrumentationinstructionsneedtobeinserted.
Doingso,however,changesalltheaddressesintheprogramandalljumpsandcallsmustbefoundandadjustedaccordingly.
Unlesstheprogramsectioncontaininginstructionsdoesnotcontainanydata(e.
g.
,jumptables)andallinstructionshavethesamesize,thisrapidlygetsverytricky.
Anothereasiertechniqueiscodepatching.
Theinstruction(s)locatedwhereinstrumentationcodeshouldbeinsertedisreplacedbyajumptoanewcodesectionappendedtotheprogram,andcontainingtheinstrumentationcodeandthereplacedinstruction.
Onlytheoneortwoinstructionsoverwrittenbythejumpneedtoberelocated(addressadjusted)perinstrumentationpoint.
Asimplerformofcodepatchingisusedinmostdebuggerstoinsertbreakpoints.
Theinstructionwhereinstrumentationcodeshouldbeinsertedissavedawayandreplacedbyasoftinterruptinstruction.
Unlikejumpinstructionswhichareamongthelonguest,andmayoverwriteafewshorterinstructions,interruptinstructionsusuallyarethesmallestandoccupyasinglebyteoni386processors.
Oncetheinterruptinstructionisreached,theinterrupthandlerexecutestheinstrumentationcode,putsbackthesavedawayinstructionandexecutesitinsinglesteppingmode.
Thisway,anotherinterruptionisgeneratedafteritsexecution,atwhichpointtheinterruptinstructionisplacedbackforthenexttimetheprogramreachesthispoint.
Withthistechnique,onesoftinterruptandonesinglesteppinginterruptisrequiredforeachinstrumentationpoint.
Theaddressoftheinstructioncausingtheinterruptisusedtodeterminetheinstrumentationpointreachedandtheassociatedcodetoexecute.
Thisleastinvasivetechnique(noinstructionstorelocate)alsocarriesthelargestoverhead.
Interruptsareprocessedinmicrosecondsatbestwhilejumpsareexecutedinnanoseconds,asshownbelowwithGDB[7].
#Runaprogramundergdb[gzip-1.
2.
4a]$catgdb.
runfilegzip.
plainrun/dev/nullquit[gzip-1.
2.
4a]$timegdb-command=gdb.
run>gdb.
out23.
30user1.
53system0:30.
75elapsed80%CPU(0avgtext+0avgdata0maxresident)k0inputs+0outputs(1213major+875minor)pagefaults0swaps#RunaprogramunderGDBwithabreakpoint.
Thebreakpointisexecuted#6933680times,eachtimeprintinga".
"beforecontinuing.
[gzip-1.
2.
4a]$catgdb.
commandsfilegzip.
plainbreakdeflate.
c:679commands1silentprint".
"continueendrun/dev/nullquit#RunGDBwiththecommandfile[gzip-1.
2.
4a]$timegdb-command=gdb.
commands>gdb.
out3905.
78user1535.
75system2:02:48elapsed73%CPU3.
1.
4.
SamplingSamplingisoftenusedtopredictatlowcosttheoutcomeofanelectionbyaskingseveralhundred3peopletheirvotingpreferencesinsteadofcallingeachofthemillionseligiblevoters.
Similarly,insteadofrecordingtheexecutiontimeforeachfunction,theprogrammaybeinterruptedatregularintervals(e.
g.
,each10miliseconds)tosampletheaddressofthecurrentlyexecutedinstruction.
Thechancesoftheaddressbeingwithinaspecificfunctionisproportionaltotheproportionoftimespentinthatfunction(underrandomnessassumptions).
Simplesamplingbasedprogramslikegprofuseimprecisevirtualtimerswhicharecheckedeachtimetheoperatingsystemschedulerisrun.
Inprogramswithstronginteractionswiththescheduler(e.
g.
,frequentlypollingforI/Oorcallingyield),theresultsmaybeseverelyskewed.
Newertoolslikeoprofiletakeadvantageofhardwareperformancecounterswhichmaycountdifferentevents(cycles,cachemisses,branchestaken/nottaken.
.
.
),andgenerateaninterruptafteracertaincount.
Thisallowsvaryingeasilythesamplinginterval,tradingoffoverheadforaccuracy,andmeasuringseveraldifferentparametersapartfromexecutiontime.
Thelowachievableoverheadenablesmeasuringthecachemisseshistogramforaprogram,whichwouldbeimpracticalusingcodeinstrumentation.
3.
2.
PerformanceDataProcessingOnceadataelementisobtained,itmustfinditswaytotheresultsfile.
Tracingsystemsusuallyaddafewfields(recordnumber,time,address.
.
.
)tothedataelementandappendittoabufferwhichiswrittentodiskoncefull.
Thiswaythecompleteinformationisretainedbutthediskmayfillatanalarmingrate.
TracingthesystemcallsforaWebbrowsermaygenerateseveralmegabytesperminute.
Thealternativetosimplebufferingistoprocessandaggregatethedataelementsastheyaregenerated.
Typicallyacountervariablewillbeassociatedwitheachfunctionorbasicblock,andtheeventsoccuringtherewillbecountedastheyhappen.
Attheprogramend,oratregularintevals,thecontentofthesecountersiswrittentodisk.
Insomecases,thenumberofpossiblelocationsismuchlargerthanthenumberactuallyusedandhashtablesareused.
Thisisthecasewhencountingthecallstoafunctionbycaller.
Veryfewfunctionsamongallthefunctionsinaprogramactuallycallaspecificfunction;usinghashtablesisthusmoreefficientthanallocatingacounterforeachpossiblecaller-calleecombination.
3.
3.
PerformanceDataTransferandStorageInlowvolumeapplications,eventsaregeneratedeveryfewsecondsormore,andmaybewrittenastheyhappentoafiledescriptor.
Thisfiledescriptormaybeconnectedtoafile,asocketorevenagraphicalviewerwhichupdatesitsdisplayinrealtime.
Forlargervolumesofevents,eventsshouldbeappendedinmemorytoalargebuffer,writtentodiskinverylargechunks.
Whentheinstrumentedprogramisdifferentfromtheloggingprocesswritingtodisk,maximumperformancemaybeobtainedthroughtheuseofsharedmemorybetweentheinstrumentedprogramandtheloggingdaemon.
TheLinuxTraceToolkitusessharedmemorybetweentheinstrumentedkernelandthetracingdaemontoachievemaximumperformance.
Theperformanceparametersbeingmeasured(diskthroughput,executiontime,cachehits)willstronglyinfluencewhichoverheadshouldbeminimized.
Ifdiskaccesstimesaretobemeasured,theimpactoftheperformancedatagatheringondisksshouldbereduced,forinstancebywritingtoaseparatedisk,onaseparatediskcontroller,andevenacrossthenetworkonaseparatecomputer.
Ifonlyonediskisavailable,itisalsopossibletoreducethediskrequirementsattheexpenseofexecutiontimeusingdatacompression.
Mostsystemsperformdatacompressionofflineforarchivalpurposesbutnotduringthelivetracing.
44.
ReviewofPerformanceAnalysisToolsThetoolsreviewedinthissectionareallfreelyredistributableandcoverthefullspectrumofwidelyusedtechniquesforsoftwareperformanceanalysis.
Thesimplerandmorewidelydeployedtoolsarepresentedfirst.
Throughoutthissection,theGNUcompressionutilitygzip[8]willbeusedasasampletestcase,compressinga64MBlogfileto1.
6MB.
Itisusedtocomparetheoverheadofdifferentperformanceanalysistools.
#Runthenormalprogram[gzip-1.
2.
4a]$time.
/gzip.
plain/dev/null24.
09user0.
72system0:28.
16elapsed88%CPU4.
1.
CodeinstrumentationandresultsprocessingwithGCC/GCOVGcovispartofthestandardGNUCCompiler(GCC)distribution[1].
Whencompilingaprogramfoo.
cwiththe-ftest-coverageand-fprofile-arcsoptions,alistofthebasicblocksandbranchesaregeneratedinfilesfoo.
c.
bbandfoo.
c.
bbg,andthecodeisinstrumentedtocounttheexecutionfrequenciesofeachbasicblockandbranchandtowriteouttheresultsuponexit.
Theprogrammaythenbeexecutedandsumsitsresultsinfilefoo.
c.
da,whichcanlaterbeanalyzedwithgcovtoprintthesourcecodealongwithexecutionfrequencies.
Thistoolhasaverylowoverhead,asingleaddinstructionforeachbasicblockandeachbranchexecution.
Thebasicblockexecutionfrequenciesareusedtodetectfrequentlyusedcodesections,whereoptimisationcodingeffortsmaybeworthwhile.
Executionfrequenciesforbasicblocksandbranchesmaybeusedfortestcoverageanalysis.
Anysectionwithnullexecutioncounthasnotbeencoveredbythetestsusedwhenexecutingtheprogram.
Finally,branchescountsmaybeusedtocomputebranchprobabilities,whichmaybeusedinsubsequentcompilationstogeneratebettercodeforthemorefrequentcase.
AnotherGCCoption,-axproducessimilarinformationwithmorecontroloverthegenerateddatabutwithoutapost-processingtoollikeGcov.
Itcanproducefunctions,basicblocksandbranchesexecutioncounts,orafulltraceoffunctionandbasicblockentries.
Furthermore,itispossibletospecify5Figure1:Totalelapsedtimeforcompressinga64MBfilewithgzip,unmodified,withsamplingevery0.
01s,withinstrumentationatfunctionentries,andwithinstrumentationatbasicblockentries.
NormalSamplingFunctionBasicblock05101520253035functionstoexclude/includefromthisprofiling/tracing.
Asanexample,theGNUcompressionprogramgzipwasrecompiledwiththe-ftest-coverageand-fprofile-arcsoptionsandrunona64MBlogfile.
Theinstrumentedgziprequires5.
4%moreelapsedtimethanthenormalgzip.
Attheendoftheexecution,afilewiththe.
daextensioniscreatedforeachsourcecodefile(e.
g.
,deflate.
dafordeflate.
c).
Thesefilesmaythenbeusedbygcovtoproduceannotatedsourcecodefiles(e.
g.
,deflate.
c.
gcovfordeflate.
c).
#Runtheprogramcompiledandlinkedwithoptions#-ftest-coverageand-fprofile-arcs[gzip-1.
2.
4a]$time.
/gzip.
gcov/dev/null27.
04user0.
63system0:31.
36elapsed88%CPU#Files.
bband.
bbgareproducedatcompilationtime,files.
da#attheprogramexit.
[gzip-1.
2.
4a]$ls-alsdeflate.
*4-rw-rw-r--1dagenaisdagenais1100avr915:34deflate.
bb4-rw-rw-r--1dagenaisdagenais2048avr915:34deflate.
bbg32-rw-r--r--1dagenaisdagenais29179ao131993deflate.
c4-rw-rw-r--1dagenaisdagenais648avr1014:10deflate.
da20-rw-rw-r--1dagenaisdagenais18856avr1012:01deflate.
o#Performtheanalysisononeofthesourcefiles.
#Gcovreadsthe.
bb.
bbg.
daand.
cfilesandproduces.
c.
gcov[gzip-1.
2.
4a]$gcov-b-fdeflate.
c80.
77%of26sourcelinesexecutedinfunctionlm_init60.
00%of15branchesexecutedinfunctionlm_init40.
00%of15branchestakenatleastonceinfunctionlm_initNocallsinfunctionlm_init.
.
.
Creatingdeflate.
c.
gcov.
[gzip-1.
2.
4a]$lessdeflate.
c.
gcov.
.
.
5while(lookahead!
=0){branch0taken=0%6933680INSERT_STRING(strstart,hash_head);6933680prev_length=match_length,prev_match=match_start;6933680match_length=MIN_MATCH-1;6933680if(hash_head!
=NIL&&prev_lengthlookahead)match_length=lookahead;branch0taken=0%branch1taken=100%.
.
.
4.
2.
TimeBasedSamplingwithGPROFGprofispartoftheGNUbinutilspackage[2].
Whenaprogramfoo.
ciscompiledwiththe-pgoption,acalltofunctionmcountisaddedtoeachfunctionentry.
Moreover,thelinkstepusesversionsofthestandardlibrariescompiledwith-pgwhenavailable,andaddsamoduletoinitializethesamplingatthebeginningandtowriteouttheresultsattheend.
Optionally,compilationflag-gmaybespecifiedtoobtainsourcecodelinesaddresses(usuallyusedbydebuggers),and-atosimultaneouslycomputebasicblockexecutionfrequencies.
6Whentheprogramexecutes,aninterrupthandleriscalledeveryclocktick(10miliseconds)whichtakestheaddressoftheinterruptedinstructionandadds1tothecorrespondingbucketinalargearray.
Typicallythisarrayisaslargeastheprocessinstructionsareaandusesa4bytesintegercounterforeach4bytessectionintheprogram.
Eachtimeafunctioniscalled,theinternalmcountfunctiongetscalled,extractstheaddressesofthecallerandthecallee,andsumsthenumberofcallsforthiscaller-calleepairinalargehashtable.
Whentheprogramexits,thesamplescountsinthelargearray,andthecaller-calleecountsfromthehashtable,areprintedinaresultsfilenamedgmon.
out.
Whenthe-acompilationflagisspecified,basicblocksexecutioncountsarealsocomputedandareappendedtotheresultsfile.
TheGprofprogrammaythenbeusedtoreadresultsfiles,andtheprogramexecutableandsourcecodefiles,toproducedifferentreports.
Theflatprofilereportprintsforeachfunction(andeveneachsourcecodeline)thenumberofassociatedsamplesandthecorrespondingexecutiontimeprobablyspentinthatlocation.
Thefunctionsmaybeprintedindecreasingorderofexecutiontime,thuslistingontopthemosttimeconsumingfunctions,andbestcandidatesforoptimisation.
Thetimeassociatedwithinternalfunctionmcountcanbeusedtosubstract,andthuscompensate,theoverheadassociatedwiththeprofiling.
TheannotatedsourcecodereportissimilartotheoutputproducedbyGcov,listingtheexecutionfrequenciesforeachbasicblock.
Finally,Thecallgraphreportisasupersetoftheflatprofileandaddsinformationaboutcallersandcalledfunctions.
Foreachfunction,itliststhetotaltimespentinitself(numberofsamplestimessamplingperiod),andthecallingfunctions.
Foreachcallingfunctionislistedthenumberofcallsandtheproportionofthetotal;thisinformationisobtainedpreciselythroughthecallstomcountwherethecallerisrecordedinthecallcounts.
Theproportionofcallsfromacallerisusedtoestimatethetimespentinthecurrentfunctionbycaller.
Theunderlyingassumptionisthatcallstoafunctiontakethesametimeonaverageirrespectivefromthecaller.
Thisisclearlynotthecaseas,forinstance,afunctiondraw_iconmaycallscale_bitmapsforverysmallimageswhilefunctionscale_backgroundmaycallscale_bitmapsforimages500timeslarger.
Theexecutiontimespentforeachcallerispropagatedtothecallersresursively,startingfromtheleaffunctions.
Thisaddsanotherlistforeachfunctioninthecallgraphreport,thetimespentineachfunction(anditschildren)called.
Thetotaltimespentinthefunctionitselfandallitscalledchildrenisalsopartofthecallgraphreport.
Forrecursivecalls,aseparateentryiscreatedforthecyclewhichlistscallersandcalleesfromousidethecycle;then,entriesformembersofthecyclelistcallersandcalleesfromwithinthecycle.
Gprofandsimilartoolsofferanexcellentcompromiseoflowoverheadanddetailedusuallyreliableinformation.
Itmay,however,beinadequateorevenmisleadinginmoreinvolvedcaseswhere:thetimespentinafunctionisverydifferentdependingonthecaller.
Measuringitadequatelywouldrequiresamplingnotonlythecurrentlyexecutinginstructionaddressbutalsothecallersaddressonthestack,possiblyforseverallevels.
Thecallgraphreportcouldthenshowthetimespentinafunctionbasedonchainsofcallers,butthiswouldbemoredifficulttopresentclearly,andthenumberofsamplesassociatedwitheachchainislikelytobetoosmalltobereliable;thesamplingintervaliscorrelatedwiththeprogramactivity.
Whenaprogramcontinuouslymakesblockingsystemcalls,thereisachancethatthescheduleroftengivescontroltotheprogramandactivatesthevirtualtimerusedforsamplingwhentheprogramisinthesamearea,whichgetsoverrepresentedinthesamples;thetimespentnotexecutingtheprogram,waitingforthediskorforotherprocesses,isofinterest.
Inthatcase,othersourcesofinformationshouldbeusedtoannotatethereportswithinformationabout7timespentwaiting,toobtainthetotalelapsedtime,insteadofjusttheexecutiontime,ineachfunction;moreinformationisrequiredaboutwhytimeisspentineachsection(cachefaults,pagefaults.
.
.
).
#Runtheprogramcompiledwithout-pgandlinkedwith-pg#tomeasurethelowoverheadofsamplingalone.
[gzip-1.
2.
4a]$time.
/gzip.
pg-nomcount/dev/null24.
30user0.
62system0:28.
30elapsed88%CPU#Runtheprogramcompiledandlinkedwithoption-pg#Overheadisslightlylargerbutitprovidescallgraphinfo[gzip-1.
2.
4a]$time.
/gzip.
pg-static/dev/null25.
79user0.
51system0:29.
88elapsed88%CPU#Filegmon.
outisproducedwhentheprogramexits[gzip-1.
2.
4a]$ls-alsgmon.
out188-rw-rw-r--1useruser186366avr1014:07gmon.
out#Usegproftointerpretgmon.
out,usingdebuggingsymbols#informationfromtheexecutablefile[gzip-1.
2.
4a]$gprofgzip.
pg-static>gprof.
out#Lookattheresults,firsttheflatprofilewiththepercentageof#executiontimeforeachfunctionindecreasingorder.
[gzip-1.
2.
4a]$lessgprof.
outEachsamplecountsas0.
01seconds.
%cumulativeselfselftotaltimesecondssecondscallsms/callms/callname27.
137.
537.
5319573.
855.
95fill_window23.
2013.
976.
4416440.
0019205.
66deflate14.
8418.
094.
1219582.
102.
10updcrc12.
1621.
463.
38do_scan8.
3823.
792.
33short_loop3.
7824.
841.
05__mcount_internal3.
1325.
710.
87read.
.
.
#Thenthecallgraph.
Functionupdcrcuses4.
12sandcallsnone.
#Itiscalledfromzipandfile_read.
Functionfile_readis#calledfromlm_initandfill_window.
Itusesaboutnotimebut#callsupdrcforabout4.
12s.
This4.
12sisalmostallattributed#tocallerfill_window(1957/1958calls).
.
.
.
0.
000.
001/1958zip[3]4.
120.
001957/1958file_read[8][7]15.
64.
120.
001958updcrc[7]0.
000.
001/1958lm_init[27]0.
004.
121957/1958fill_window[6][8]15.
60.
004.
121958file_read[8]4.
120.
001957/1958updcrc[7].
.
.
4.
3.
HardwareCountedEventsSamplingwithOprofileOprofile[3]isverysimilartoGprof;itevenproducescompatibleresultsfiles.
Thetwoimportantdifferencesarethatitcollectssamplesbasedonhardwareperformancecountersinterrupts,andthatitoperatesonthecompletesystem(gatheringsamplescontinuouslyforallprogramsexecutingonthe8system).
Hardwareperformancecountersarenowavailableinallpowerfulmodernmicroprocessorsandmaybeusedtocountvariouseventsofinterestincludingclockcycles,L1orL2cachemisses,branches,branchestaken,andpipelinestalls.
Theperiodicsamplingisobtainedbyrequestinganinterruptwhenthecounterreachesacertaincount.
Oprofiledoesnotuseprograminstrumentation(i.
e.
,placingcallstomcountineachfunction).
Thus,norecompilationisnecessaryandtheoverheadislower,butcallgraphreportscannotbegenerated.
Theonlyrequirementisthatdebugginginformationisrequiredtocorrelatetheaddressofsampleswithprogramfunctionsandlinenumbers.
Becauseofitslowoverhead,itisnotuncommontokeepOprofilerunningforextendedperiodsonrunningserversystems.
Thewholesystemprofilingfeatureisparticularlyusefultoseewhichproportionofthetimeisspentineachprogram,thusdiscoveringwhoisusingupalltheresources.
Itisalsousefulwhenasingletaskinvolvingseveralprogramsisexecutedonasystem.
Forexample,whenanalysingthestartuptimeofasophisticatedWebbrowser,itispossibletoseethetimespentinfunctionsinthekernel,thebrowser,thewindowmanager,theXserverandvariousdaemons.
Insomerarecases,theinformationproducedbyhardwareperformancecounterssamplingtoolsmaybebiased.
Indeed,areal-timeprocessusingthesameperiodasthesamplingtoolmayhaveitssamplesgrosslyunderoroverrepresented.
Asimplesolutionistorandomlyvarythesamplingperiodaroundatargetaverage.
#Runtheoprofilegraphicaluserinterface,clicktostartthedaemon[gzip-1.
2.
4a]$oprof_start#Runtheunmodifiedprogram(withdebuggingsymbolspresent)[gzip-1.
2.
4a]$time.
/gzip.
plain/dev/null24.
40user0.
87system0:27.
67elapsed91%CPU#Stoptheprofilingdaemonandextractstatistics[gzip-1.
2.
4a]$oprofpp-l-igzip.
plain>gzip.
oprof#Lookattheresults[gzip-1.
2.
4a]$lessgzip.
oprofCputype:PIICpuspeedwas(MHzestimation):233.
140000Counted"clocksprocessorisnothalted",samplinginterval200000vmasamples%-agesymbolname0804810000_start.
.
.
08050504500.
059034long_loop080504e1680.
0802862limit_ok080504c11110.
131055longest_match080505721600.
188909the_end080505471710.
201896mismatch0804c5509881.
16651ct_tally0804c93011001.
29875send_bits0804c6d012821.
51363compress_block0805aad014651.
7297memcpy0805050d69058.
15259short_loop080505221210114.
2874do_scan0804ea101382116.
3182updcrc0804b4c02140525.
2724deflate0804b1202502029.
5406fill_window94.
4.
SystemCallsTracingusingStraceLinuxandPOSIXsystemsoffertheptracesystemcalltoallowoneprocesstomonitoranotherprocess.
Itisusedbydebuggersandothertoolslikestrace[9].
Whenaprogramisrununderstrace,itisblockedeachtimeitissuesasystemcall.
Thecontrolpassestostracebeforethesystemcall,toextractthevalueofthearguments,thesystemcallisthenperformedasusual,andthecontrolpassestostraceafterthesystemcalltoextractthereturnvalues.
Theoverheadcostisseveralmicrosecondsforeachsystemcall,comparabletohittingabreakpointinadebugger.
However,thenumberofsystemcallsisrelativelylowforCPUboundprocesseslikegzip.
#Tracethesystemcallsissuedduringtheexecution[gzip-1.
2.
4a]$timestrace-s120-f-F-otrace.
/gzip.
plain\/dev/null25.
23user1.
57system0:30.
97elapsed86%CPU#Examinethesystemcalltracefile[gzip-1.
2.
4a]$lesstrace2488execve(".
/gzip.
plain",[".
/gzip.
plain"],[/*44vars*/])=02488fcntl64(0,0x1,0,0xbffff804)=02488fcntl64(0x1,0x1,0x1,0xbffff804)=02488fcntl64(0x2,0x1,0x1,0xbffff804)=02488uname({sys="Linux",node="rocamadour"02488geteuid32()=53632488getuid32()=53632488getegid32()=11052488getgid32()=11052488brk(0)=0x810b3742488brk(0x810b394)=0x810b3942488brk(0x810c000)=0x810c0002488rt_sigaction(SIGINT,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGINT,{0x804aaa0,[INT],SA_RESTART|0x4000000},{SIG_IGN},8)=02488rt_sigaction(SIGTERM,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGTERM,{0x804aaa0,[TERM],SA_RESTART|0x4000000},{SIG_IGN},8)=02488rt_sigaction(SIGHUP,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGHUP,{0x804aaa0,[HUP],SA_RESTART|0x4000000},{SIG_IGN},8)=02488ioctl(1,0x5401,0xbffff6c0)=-1ENOTTY(Inappropriateioctlfordevice)2488fstat64(0,{st_mode=S_IFREG|0644,st_size=64140109,02488read(0,"recid=1,size=8,format=BINARY,event_type=4,facility=LTT,severity=DEBUG,\nuid=dagenais,gid=dagenais,pid=1953,pgrp=".
.
.
,65536)=65536.
.
.
4.
5.
LowOverheadKernelTracingwiththeLinuxTraceToolkitThedataaggregationperformedbythetoolsdiscussedinprevioussectionsisimportanttoobtainalowoverhead.
Thedownsideisthatsomeinformationislostintheaggregates(numberofcallsforeachfunctioninsteadofwhichfunctionwascalledwhenandinwhatorder).
Afulltraceofcollecteddataitemsrapidlyfillsadiskbutdoesnotaprioridiscardinformationthatmaynotseemusefulinitially,butcouldbefoundsolater.
Tracingisusedpervasivelyinsoftwareprojects,fromprintstatementstologgingdaemonslikesyslogdandklogd.
TheLinuxTraceToolkit[4]isextensible.
Itcomeswith60predefinedtracepointsinsertedinthekernel,dividedinto8categorieswhichmaybeenabled/disabledeitheratcompileoratruntime.
MoretracepointsmayeasilybeaddedbyrecompilingthekernelorusingtheDynamicProbescodepatchingsystem[10].
Itskeyfeaturesarelowoverheadandpost-processingtools.
10Lowoverheadisachievedbycarefullyselectingtheinformationtolog,andminimizingcontextswitching.
Sinceallschedulingchangesandsetuidcallsmaybetraced,thereisnoneedtorecordtheprocessoruseridwitheachevent.
Furthermore,mosteventsbeingcloselyspacedintime,32bitstimedifferencesarestoredinsteadoffull64bitsseconds/microsecondsvalues.
Alltheeventsareloggedbyappendingtheirdatatoalargedoublebuffer.
Whenabufferisfull,adaemonaccessesthebufferthroughsharedmemoryandcopiesitscontenttodisk.
Theoverheadoftracingabusysystemamountstolessthan5%fortheelapsedtimeanduptoseveralmegabytesperminuteoftracingdatatostoreondisk[11].
Thetraceanalysisisperformedoffline,tominimizetheoverheadduringdatacollection,andpossiblyonaremotehostwhenstudyingembeddedsystems.
Theinformationcontainedinthetrace(systemcalls,devicesoperations,schedulingchanges)issufficienttoreconstructthesystembehavior(includingwhichfiledescriptorisassociatedtowhichfileormemoryregion)andmaythereforebeusedforseveraldifferentanalyses.
Thetoolsincludedshowtheeventsgraphically,perprocessinatimeline,andcontainperprocesssummaries(executiontime,numberofcallspersystemcall,waittime,bytesread/written).
Newanalysistoolsmaybewrittenusingtheprovidedtracereadinglibrary,orbymodifyingtheprovidedtools.
Indeed,whileitiseasywithLTTtocapturealltheneededrawdata,synthesizingthedesiredhighlevelinformationismorechallenging.
Asatest,amodifiedtraceanalysistoolwasdevelopedtoanswerthefollowingnonobviousquestion:whereisthelargestartingtimeofapplicationssuchasGaleon,MozillaandOpenOfficespent4.
5.
1.
DecomposingtheWaitingTimeTraditionally,systemstatisticsdecomposetheelapsedtimeintousermodeexecutiontime,systemmodeexecutiontime,andwaitingtime.
Profilingtoolsaregoodatdecomposingtheexecutiontimebyfunctionorevensourcecodeline.
Theobjectiveherewasdecomposingthewaitingtimeintowaitingrunnable(waitingfortheCPU)andwaitingnonrunnable(blockedonI/O,orwaitingforaprocessoratimer).
Thisinformationmaybeextractedfromadetailedtrace:SchedulingChangeeventstellthestatusoftheoutgoingprocess(runnableornot).
WakeUpeventschangethestatusofaprocessfromnonrunnabletorunnable.
Iftheeventpriortobeingscheduledoutnonrunnableisareadorapagefaulttrap,theprocessiswaitingfortheassociatedfile.
Atablemustkeeptrackforeachprocessofalltheopenandmmapoperationsassociatingfileswithfiledescriptorsormemoryaddressranges.
Iftheeventpriortobeingscheduledoutnonrunnableisawaitpidcall,theprocessiswaitingfortheassociatedprocess.
Therearecaseswhereaprocessiswaitingforoneofseveralpossiblefilesorprocesses,withcallstowait,select,orpoll.
Inthatcase,thewaitingtimeisattributedtotheprocessorfilewhichcausesthecalltoterminate.
Pollandwaitmayalsoendbecauseofatimeout.
Thewaittimeisthenattributedtothemoregeneralwaitfortimeoutcategory.
Atracewasgeneratedonafreshlyrebootedsystem,withnootheractivity,whilestartinguplargeprogramsoneafteranotherbutwithoutanyoverlap.
ThewaitingtimeforGaleon[12],forinstance,wasdecomposedintoseveralwait-for-fileandwait-for-processcomponents,whichaccountedfor7.
96softhetotal9.
27s.
Theremaining1.
31sismostlywait-for-timeout(0.
89sawaitingamouseclick)and11only0.
42swait-generic.
Thewait-generictimecouldbefurtherreducedbyidentifyingintheirowncategoryothersystemcalleventsinthetrace,inadditiontoread,select,poll,waitpidandwait.
Thewait-generictimeisnonethelessavailablebysystemcallcategory;.
31soutofthe.
42stotalbelongstostatsystemcalls.
#BootwiththeLTTenabledkernelandstarttheLTTdaemon[gzip-1.
2.
4a]$tracedaemon-ts60/dev/tracertrace1.
tracetrace1.
procTraceDaemon:OutputfilereadyTraceDaemon:TracedriveropenTraceDaemon:Tracebuffersare1000000bytesTraceDaemon:Fetchingeipforsyscallondepth:0TraceDaemon:Daemonwilrunfor:(60,0)TraceDaemon:Donemapping/procTraceDaemon:Endoftracing#Analysethetraceoutput[gzip-1.
2.
4a]$tracevisualizer-atrace1.
tracetrace1.
proctrace1.
out#Lookatthetraceanalysis[gzip-1.
2.
4a]$lesstrace1.
out.
.
.
Process(1684,1658):galeon:galeon-bin.
.
.
UserMode:CPU5.
894726Elapsed15.
677299WaitCPU2.
358053WaitFork0.
000002BH811KernelTimer610PacketIn80PacketOut41SchedIn1330TimerExpire136SyscallMode:Elapsed/Calls0.
0003861CPU0.
308954Elapsed5.
119347WaitCPU0.
599823WaitFile-001.
png0.
000453WaitFile-002.
png0.
000346WaitFile-003.
png0.
000213WaitFile-004.
png0.
000216WaitFile-005.
png0.
000430.
.
.
sys_fcntl64Elapsed/Calls0.
0000054CPU0.
000286Elapsed0.
000286Calls53TrapspagefaultElapsed/Calls0.
0003101CPU0.
179146Elapsed2.
444663WaitCPU0.
448110WaitFile-galeon-bin1.
816968BH38Calls7883KernelTimer23PacketIn212PacketOut1PageAlloc78SchedIn459TimerExpire2IRQstimerElapsed/Calls0.
0000118CPU0.
007725Elapsed0.
007725Calls657usb-uhci,eth0Elapsed/Calls0.
0000263CPU0.
003610Elapsed0.
003625Calls138.
.
.
Trapentry1018876582.
807382168413TRAP:pagefault;EIP:0x08068FC7Trapexit1018876582.
80738716847Syscallentry1018876582.
807421168412SYSCALL:execve;EIP:0x080690D7Filesystem1018876582.
807452168427EXEC:galeonMemory1018876582.
807494168412PAGEALLOCORDER:0Memory1018876582.
807569168412STARTPAGEWAITSchedchange1018876582.
807645157919IN:1579;OUT:1684;STATE:2Filesystem1018876582.
807656157920POLL:3;MASK:0Filesystem1018876582.
807662157920POLL:10;MASK:0Filesystem1018876582.
807666157920POLL:4;MASK:0Filesystem1018876582.
807670157920POLL:5;MASK:0Filesystem1018876582.
807675157920POLL:6;MASK:0Filesystem1018876582.
807680157920POLL:13;MASK:0Memory1018876582.
807688157912PAGEFREEORDER:0Memory1018876582.
807699157912PAGEFREEORDER:0Syscallexit1018876582.
80770215797Syscallentry1018876582.
807731157912SYSCALL:gettimeofday;EIP:0x0805816DSyscallexit1018876582.
80773315797Syscallentry1018876582.
807758157912SYSCALL:ioctl;EIP:0x08056F30Thebiggersurprisewasthe3.
5swait-for-CPUsincenothingelsewashappeningonthesystem.
Theanswerliesinasynchronouslycooperatingprocesses.
Indeed,Galeonstartsanumberofhelpersub-processesandusuallyblocksawaitingananswer.
Thisshowsupnicelyinthewait-for-processcategoryalongwiththesub-processid.
However,GaleonisacomplexgraphicalapplicationwhichusestheXwindowserver[5]forgraphicsrendering.
TheXserverisoptimisedtoavoidblockingusingasynchronousinteractionswheneverpossible.
Thisallowsbatchingrenderingoperations,butalsoparallelexecutionwhentheapplicationandtheXserverarerunningonseparateprocessors.
4.
5.
2.
IdentifyingServersWorkingonbehalfofClientsThelastanalysisrefinementwasdecomposingtheexecutiontimeofaserverprocess,labeledCPU,intoCPU-for-Clientcomponents.
Atypicalclientserverrelationshipmaybeidentifiedbythepatternofaserverlisteningonasocket,aclientconnectingtothesocket,andtheserveracceptingtheconnection.
13Theserverandclientthencommunicatebyreadingandwritingtothesocketuntiltherequestisservedandtheservergoesidle.
Theserverthenawaitsnewconnectionsorrequestsfromclientsusingaselectorpollsystemcall.
Assumingthataserver,afteritsinitializationphase,isalwaysworkingonbehalfofaclient,theservergetsintoCPU-for-Clientxwheneveritacceptsaconnectionorreadsfromaconnectiontoclientx.
Thissimpleheuristicproducesinterestingresultsinmanycases.
However,itdoesnotmodelcaseswhererequestsaregrouped.
Forexample,ifnineclientsmakeasynchronousrequestswhicharesimplybufferedbytheserver,andupontherequestfromthetenthclient,allbufferedrequestsareexecuted,thetimespentservingthelastisoverestimated.
Complexinteractionsbetweenclientsandserversareevenmoredifficulttoautomaticallyrecognizeinordertoattributeservingtimeappropriatelytoclients.
Forinstance,GaleonconnectstotheGNOMEconfigurationdaemon,Gconfd,toexpressinterestinanumberofconfigurationparameters,leavingacallbacksocketaddress.
Gconfdthenconnectstothiscallbackaddressandprovidesthecurrentvaluesfortheparameters,andeventuallyupdatestotheseparameters.
GaleonthenappearstobeservingparameterchangerequestsfromclientGconfd.
4.
6.
DiscussionTheLinuxTraceToolkitisoneofthebestsolutionsforindepthlowlevelsystemstudies.
Itiswidelyusedinembeddedandreal-timesystems,andcouldeasilybeusedinoperatingsystemsandreal-timeprogrammingcourses.
Itcouldproveveryusefultodevelopersandsystemadministratorswhenstronginteractionsbetweenapplicationsandtheoperatingsystemareencountered.
ItisstillpendingintegrationinthemainLinuxkernel.
Thus,akernelpatchandrecompilation,oraspecialkernel,arerequiredontopoftheLTTtracingdaemonandtracingtoolspackage.
5.
ConclusionTraditionaltoolslikeGcov,Gprof,andStracehaveexistedforyears,withmanyvariantsintermsofpresentation(graphs,colors,evenanimations)ondifferentplatforms.
Theserepresentmature,simpleandefficientwaystoobtainusefulinformation.
Theappearanceofhardwareperformancecountersinmicroprocessorsinthe1990swastheenablingtechnologyforsamplingtoolslikeOprofileprovidingmoreprecisesamplingandseveraltypesofsamples(cyclecount,cachemisses.
.
.
).
Anotherimportantenablingtechnologyinthefieldofsystemperformanceanalysisisthesourcecodeavailabilityoffreesoftware.
Theprintstatementspreviouslyinsertedbyprogrammersonlyintheirapplicationscannowbeplacedeverywhereinthesystem.
TheLinuxTraceToolkitprovidesthefirst60interestingtracingpoints.
Moreimportantly,itsimplifiestheinsertionofmoretracepointswhichwillenableanumberofsophisticatedanalysistoolsneededtobetterunderstand,studyandimprovecomplexdistributedsystems.
Bibliography[1]FreeSoftwareFoundation,"GNUCompilerCollection",http://gcc.
gnu.
org/.
[2]FreeSoftwareFoundation,"BinaryTools",http://sources.
redhat.
com/binutils/.
[3]JohnLevonetal.
,"Oprofile",http://oprofile.
sourceforge.
net.
[4]KarimYaghmour,"LinuxTraceToolkit",http://www.
opersys.
com/LTT.
14[5]XFree86Project,"TheXFree86Project",http://www.
xfree86.
org.
[6]JamesLarusetal.
,"EEL:AnExecutableEditingLibrary",http://www.
cs.
wisc.
edu/~larus/eel.
html.
[7]FreeSoftwareFoundation,"TheGNUProjectDebugger",http://sources.
redhat.
com/gdb/.
[8]Jean-LoupGailly,"GNUCompressionUtilityGzip",http://www.
gnu.
org/software/gzip/gzip.
html.
[9]WichertAkkermanetal.
,"Strace",http://www.
liacs.
nl/~wichert/strace/.
[10]RichardJ.
Mooreetal,"DynamicProbes",http://www.
ibm.
com/developerworks/oss/linux/projects/dprobes/.
[11]K.
YaghmourandM.
R.
Dagenais,"Measuringandcharacterizingsystembehaviorusingkernel-leveleventlogging",ProceedingsoftheUSENIXAnnual2000TechnicalConference,SanDiego,California,USA,June2000[12]MarcoPesentiGrittietal,"Galeon:theWeb,onlytheWeb",http://galeon.
sourceforge.
net/.
15
diyvm怎么样?diyvm商家VPS主机均2GB内存起步,三个地区机房可选,使用优惠码后每月69元起;DiyVM独立服务器开设在香港沙田电信机房,CN2线路,5M带宽,自动化开通上架,最低499元/月,配置是L5630*2/16G内存/120G SSD硬盘。DiyVM是一家成立于2009年的国人主机商,提供的产品包括VPS主机、独立服务器租用等,产品数据中心包括中国香港、日本大阪和美国洛杉矶等,...
点击进入亚云官方网站(www.asiayun.com)公司名:上海玥悠悠云计算有限公司成都铂金宿主机IO测试图亚洲云Asiayun怎么样?亚洲云Asiayun好不好?亚云由亚云团队运营,拥有ICP/ISP/IDC/CDN等资质,亚云团队成立于2018年,经过多次品牌升级。主要销售主VPS服务器,提供云服务器和物理服务器,机房有成都、美国CERA、中国香港安畅和电信,香港提供CN2 GIA线路,CE...
在刚才更新Vultr 新年福利文章的时候突然想到前几天有网友问到自己有在Vultr 注册账户的时候无法用支付宝付款的问题,当时有帮助他给予解决,这里正好顺带一并介绍整理出来。毕竟对于来说,虽然使用的服务器不多,但是至少是见过世面的,大大小小商家的一些特性特征还是比较清楚的。在这篇文章中,和大家分享如果我们有在Vultr新注册账户或者充值购买云服务器的时候,不支持支付宝付款的原因。毕竟我们是知道的,...
ts60.com为你推荐
嘉兴商标注册我想注册个商标怎么注册啊?比肩工场比肩成局 什么意思rawtools相机中的RAW是什么意思?月神谭给点人妖。变身类得小说。porntimesexy time 本兮 MP3地址www.765.com下载小说地址javbibitreebibi是什么牌子的www.99vv1.comwww.in9.com是什么网站啊?hao.rising.cn我一打开网页就是瑞星安全网站导航,怎么修改?www.884tt.com刚才找了个下电影的网站www.ttgame8.com,不过好多电影怎么都不能用QQ旋风或者是迅雷下在呢?
国外vps租用 新网域名管理 堪萨斯服务器 yardvps 息壤主机 diahosting 512m 回程路由 华为云主机 网站被封 java空间 阿里校园 申请网页 申请网站 路由跟踪 114dns 杭州电信 汤博乐 七十九刀 免费赚q币 更多