SoftwarePerformanceAnalysis

ts60.com  时间:2021-04-10  阅读:()
MichelR.
Dagenais(michel.
dagenais@polymtl.
ca)Dept.
ofComputerEngineering,EcolePolytechnique,Montreal,CanadaKarimYaghmour(karim@opersys.
com)Opersys,Montreal,CanadaCharlesLevert(Charles.
Levert@ericsson.
ca)MakanPourzandi(Makan.
Pourzandi@ericsson.
ca)EricssonResearch,Montreal,Canada1.
AbstractThekeytospeedingupapplicationsisoftenunderstandingwheretheelapsedtimeisspent,andwhy.
ThisdocumentreviewsindepththefullarrayofperformanceanalysistoolsandtechniquesavailableonLinuxforthistask,fromthetraditionaltoolslikegcovandgprof,tothemoreadvancedtoolsstillunderdevelopmentlikeoprofileandtheLinuxTraceToolkit.
Thefocusismoreontheunderlyingdatacollectionandprocessingalgorithms,andtheiroverheadandprecision,thanonthecosmeticdetailsofthegraphicaluserinterfacefrontends.
2.
IntroductionCodecoveragetools(e.
g.
,Gcov[1])computetheexactnumberoftimeseachstatementisexecuted,andthuscaneasilypinpointfrequentlyexecutedsections.
However,codecoveragetoolscarryanonnegligibleexecutionoverhead,andmoreimportantlydonotmeasuretherelativeexecutioncostofdifferentstatements.
Timedomainsamplingtools(e.
g.
,Gprof[2])maythereforemoreaccuratelyreportlargecontributorstotheexecutiontimebysamplingatregulartimeintervalstheexecutionaddress.
Moreover,theoverheadcostmaybetradedoffforaccuracybyvaryingthetimeinterval.
Modernapplicationsandsystemspresentnewchallengesastheirperformancemaybestronglyaffectedbycachehitrates,diskaccesses,inter-processcommunications,synchronizationwaits,andoperatingsystemoverhead.
Thisiswherenewtoolsbasedonhardwareperformancecounterssampling(e.
g.
,Oprofile[3]),ordetailedanalysisoflowoverheadoperatingsystemtraces(e.
g.
,LinuxTraceToolkit[4])canbringmoreinformation.
Theymayforinstanceindicatecodesectionswithexceedinglyhighlevelofcachemissesorbranchesmispredictions,orestimatetheexecutiontimeoftheXwindowserver[5]onbehalfofeachofitsclients.
Section3discussesthealgorithmsandtechniquesusedtocollect,processandstoreperformancedata.
Thefollowingsectiondescribesanumberofspecificperformanceanalysistools,fromcommonlyusedtoolstomorerecentones,withalongersectiononpossibleextensionsandcustomisationstotheLinuxTraceToolkit.
Section5istheconclusion.
3.
PerformanceDataCollection,ProcessingandStorageInasimplemodel,apieceofperformancedataisobtained,andqueuedorstoredforlaterdisplay.
Inpractice,thegeneratedvolumeofdatamaybesuchthatsomeprocessingmayberequiredatdata1collectiontimetosynthesizemorecompactinformation.
Forexample,inalargescientificcomputerclusterof1024computers,eachtimeaprogramfunctioniscalled,theentryandexittimemaybecollected.
Sendingthesevaluesdirectlytoacentralserverwouldrapidlysaturatethenetworkandcarryanunbearableoverhead.
Instead,theentryandexittimemaybesubtractedandaccumulatedlocallyineachfunction,inordertoproduceasummaryoftheprocessingtimespentineachprogramfunction.
Individualprograms,uponexit,wouldsendtheexecutiontimehistogramtoacentralserverforofflinepreparationofperformancereports.
Thissectionexaminesthekeyaspectsofdatacollectionandprocessing:obtainingthedatapoints,accumulatingvaluesintemporarystorage,andsendingthevaluestostorageforofflineprocessinganddisplay.
3.
1.
DataCollectionSeveralactivitiesincludingdebuggingandprogramunderstandingrequiredatacollection,justlikeperformanceanalysis,andusethesametechniques.
Thefamiliardebuggingprocessisusedtoillustratecommondatacollectionmethods.
Thereisacompromisebetweentheexecutionoverheadandtheamountandprecisionofthedatacollected.
Sometimestheoverheadisunacceptablebecausetheresultingprogramexecutiontimebecomestoolarge,ortheprogramperformanceortherealtimeprogrambehaviorisskewedbythedatacollection.
Storingorsummingavalueinavariableisminimaloverhead.
Callinganotherfunctionforeachfunctionentryisalargeoverheadforsmallfunctions.
Writingperformancedatatoafileuponenteringabasicblock(i.
e.
,linearcodesection,withoutanyjump,andthusalwaysexecutedtogetherandthesamenumberoftimes)wouldsignificantlychangetheexecutionperformanceofaprogram.
3.
1.
1.
ManualProgramInstrumentationAddingaprintstatementisoftentheeasiestandmostflexiblewaytoobtainthedesiredinformation.
Allthedatastructuresandfunctionsaccessibletotheprogrammaybeusedtogeneratethedataprinted.
TheLinuxkernelcode,wherenoviceprogrammersareuncommon,containsnolessthan80000printkstatements.
Severalcommontools(netstat,vmstat,mtrace)usedatagatheringstatementsinsertedinthekernel,ortheClibrabry.
3.
1.
2.
AutomaticProgramInstrumentationCompilersandpreprocessingtoolsmaybeusedtosystematicallyinstrumentaprogrambyaddingdatagatheringstatementstothecodetheywouldnormallygenerate.
Typicallocationsforinsertingstatementsincludethefunctionsentryandexit,andtheentryoflinearcodesections(basicblocks).
Forexample,theGNUCompilerCollection,GCC[1],offersseveralcodeinstrumentationoptionstoinsertacalltofunctionmcountuponenteringeachfunction(option-pgusedforgprof),andtoinsertcallstohooksuponenteringlinearcodesectionsandjumpingtoothercodesections(options-ftest-coverageand-fprofile-arcswhichmaybeusedforgcov).
3.
1.
3.
BinaryInstrumentationInsomesituations,thesourcecodeisnotavailableforsourcelevelinstrumentationandrecompilation,oritissimplyinconvenient.
Therearetoolstomodifybinaryexecutablefilesinordertoinsertdatagatheringstatements(e.
g.
,EEL[6]).
2Ideally,theexistingbinaryinstructionsshouldbeshiftedafewbyteshigherintheaddressspaceeachtimeinstrumentationinstructionsneedtobeinserted.
Doingso,however,changesalltheaddressesintheprogramandalljumpsandcallsmustbefoundandadjustedaccordingly.
Unlesstheprogramsectioncontaininginstructionsdoesnotcontainanydata(e.
g.
,jumptables)andallinstructionshavethesamesize,thisrapidlygetsverytricky.
Anothereasiertechniqueiscodepatching.
Theinstruction(s)locatedwhereinstrumentationcodeshouldbeinsertedisreplacedbyajumptoanewcodesectionappendedtotheprogram,andcontainingtheinstrumentationcodeandthereplacedinstruction.
Onlytheoneortwoinstructionsoverwrittenbythejumpneedtoberelocated(addressadjusted)perinstrumentationpoint.
Asimplerformofcodepatchingisusedinmostdebuggerstoinsertbreakpoints.
Theinstructionwhereinstrumentationcodeshouldbeinsertedissavedawayandreplacedbyasoftinterruptinstruction.
Unlikejumpinstructionswhichareamongthelonguest,andmayoverwriteafewshorterinstructions,interruptinstructionsusuallyarethesmallestandoccupyasinglebyteoni386processors.
Oncetheinterruptinstructionisreached,theinterrupthandlerexecutestheinstrumentationcode,putsbackthesavedawayinstructionandexecutesitinsinglesteppingmode.
Thisway,anotherinterruptionisgeneratedafteritsexecution,atwhichpointtheinterruptinstructionisplacedbackforthenexttimetheprogramreachesthispoint.
Withthistechnique,onesoftinterruptandonesinglesteppinginterruptisrequiredforeachinstrumentationpoint.
Theaddressoftheinstructioncausingtheinterruptisusedtodeterminetheinstrumentationpointreachedandtheassociatedcodetoexecute.
Thisleastinvasivetechnique(noinstructionstorelocate)alsocarriesthelargestoverhead.
Interruptsareprocessedinmicrosecondsatbestwhilejumpsareexecutedinnanoseconds,asshownbelowwithGDB[7].
#Runaprogramundergdb[gzip-1.
2.
4a]$catgdb.
runfilegzip.
plainrun/dev/nullquit[gzip-1.
2.
4a]$timegdb-command=gdb.
run>gdb.
out23.
30user1.
53system0:30.
75elapsed80%CPU(0avgtext+0avgdata0maxresident)k0inputs+0outputs(1213major+875minor)pagefaults0swaps#RunaprogramunderGDBwithabreakpoint.
Thebreakpointisexecuted#6933680times,eachtimeprintinga".
"beforecontinuing.
[gzip-1.
2.
4a]$catgdb.
commandsfilegzip.
plainbreakdeflate.
c:679commands1silentprint".
"continueendrun/dev/nullquit#RunGDBwiththecommandfile[gzip-1.
2.
4a]$timegdb-command=gdb.
commands>gdb.
out3905.
78user1535.
75system2:02:48elapsed73%CPU3.
1.
4.
SamplingSamplingisoftenusedtopredictatlowcosttheoutcomeofanelectionbyaskingseveralhundred3peopletheirvotingpreferencesinsteadofcallingeachofthemillionseligiblevoters.
Similarly,insteadofrecordingtheexecutiontimeforeachfunction,theprogrammaybeinterruptedatregularintervals(e.
g.
,each10miliseconds)tosampletheaddressofthecurrentlyexecutedinstruction.
Thechancesoftheaddressbeingwithinaspecificfunctionisproportionaltotheproportionoftimespentinthatfunction(underrandomnessassumptions).
Simplesamplingbasedprogramslikegprofuseimprecisevirtualtimerswhicharecheckedeachtimetheoperatingsystemschedulerisrun.
Inprogramswithstronginteractionswiththescheduler(e.
g.
,frequentlypollingforI/Oorcallingyield),theresultsmaybeseverelyskewed.
Newertoolslikeoprofiletakeadvantageofhardwareperformancecounterswhichmaycountdifferentevents(cycles,cachemisses,branchestaken/nottaken.
.
.
),andgenerateaninterruptafteracertaincount.
Thisallowsvaryingeasilythesamplinginterval,tradingoffoverheadforaccuracy,andmeasuringseveraldifferentparametersapartfromexecutiontime.
Thelowachievableoverheadenablesmeasuringthecachemisseshistogramforaprogram,whichwouldbeimpracticalusingcodeinstrumentation.
3.
2.
PerformanceDataProcessingOnceadataelementisobtained,itmustfinditswaytotheresultsfile.
Tracingsystemsusuallyaddafewfields(recordnumber,time,address.
.
.
)tothedataelementandappendittoabufferwhichiswrittentodiskoncefull.
Thiswaythecompleteinformationisretainedbutthediskmayfillatanalarmingrate.
TracingthesystemcallsforaWebbrowsermaygenerateseveralmegabytesperminute.
Thealternativetosimplebufferingistoprocessandaggregatethedataelementsastheyaregenerated.
Typicallyacountervariablewillbeassociatedwitheachfunctionorbasicblock,andtheeventsoccuringtherewillbecountedastheyhappen.
Attheprogramend,oratregularintevals,thecontentofthesecountersiswrittentodisk.
Insomecases,thenumberofpossiblelocationsismuchlargerthanthenumberactuallyusedandhashtablesareused.
Thisisthecasewhencountingthecallstoafunctionbycaller.
Veryfewfunctionsamongallthefunctionsinaprogramactuallycallaspecificfunction;usinghashtablesisthusmoreefficientthanallocatingacounterforeachpossiblecaller-calleecombination.
3.
3.
PerformanceDataTransferandStorageInlowvolumeapplications,eventsaregeneratedeveryfewsecondsormore,andmaybewrittenastheyhappentoafiledescriptor.
Thisfiledescriptormaybeconnectedtoafile,asocketorevenagraphicalviewerwhichupdatesitsdisplayinrealtime.
Forlargervolumesofevents,eventsshouldbeappendedinmemorytoalargebuffer,writtentodiskinverylargechunks.
Whentheinstrumentedprogramisdifferentfromtheloggingprocesswritingtodisk,maximumperformancemaybeobtainedthroughtheuseofsharedmemorybetweentheinstrumentedprogramandtheloggingdaemon.
TheLinuxTraceToolkitusessharedmemorybetweentheinstrumentedkernelandthetracingdaemontoachievemaximumperformance.
Theperformanceparametersbeingmeasured(diskthroughput,executiontime,cachehits)willstronglyinfluencewhichoverheadshouldbeminimized.
Ifdiskaccesstimesaretobemeasured,theimpactoftheperformancedatagatheringondisksshouldbereduced,forinstancebywritingtoaseparatedisk,onaseparatediskcontroller,andevenacrossthenetworkonaseparatecomputer.
Ifonlyonediskisavailable,itisalsopossibletoreducethediskrequirementsattheexpenseofexecutiontimeusingdatacompression.
Mostsystemsperformdatacompressionofflineforarchivalpurposesbutnotduringthelivetracing.
44.
ReviewofPerformanceAnalysisToolsThetoolsreviewedinthissectionareallfreelyredistributableandcoverthefullspectrumofwidelyusedtechniquesforsoftwareperformanceanalysis.
Thesimplerandmorewidelydeployedtoolsarepresentedfirst.
Throughoutthissection,theGNUcompressionutilitygzip[8]willbeusedasasampletestcase,compressinga64MBlogfileto1.
6MB.
Itisusedtocomparetheoverheadofdifferentperformanceanalysistools.
#Runthenormalprogram[gzip-1.
2.
4a]$time.
/gzip.
plain/dev/null24.
09user0.
72system0:28.
16elapsed88%CPU4.
1.
CodeinstrumentationandresultsprocessingwithGCC/GCOVGcovispartofthestandardGNUCCompiler(GCC)distribution[1].
Whencompilingaprogramfoo.
cwiththe-ftest-coverageand-fprofile-arcsoptions,alistofthebasicblocksandbranchesaregeneratedinfilesfoo.
c.
bbandfoo.
c.
bbg,andthecodeisinstrumentedtocounttheexecutionfrequenciesofeachbasicblockandbranchandtowriteouttheresultsuponexit.
Theprogrammaythenbeexecutedandsumsitsresultsinfilefoo.
c.
da,whichcanlaterbeanalyzedwithgcovtoprintthesourcecodealongwithexecutionfrequencies.
Thistoolhasaverylowoverhead,asingleaddinstructionforeachbasicblockandeachbranchexecution.
Thebasicblockexecutionfrequenciesareusedtodetectfrequentlyusedcodesections,whereoptimisationcodingeffortsmaybeworthwhile.
Executionfrequenciesforbasicblocksandbranchesmaybeusedfortestcoverageanalysis.
Anysectionwithnullexecutioncounthasnotbeencoveredbythetestsusedwhenexecutingtheprogram.
Finally,branchescountsmaybeusedtocomputebranchprobabilities,whichmaybeusedinsubsequentcompilationstogeneratebettercodeforthemorefrequentcase.
AnotherGCCoption,-axproducessimilarinformationwithmorecontroloverthegenerateddatabutwithoutapost-processingtoollikeGcov.
Itcanproducefunctions,basicblocksandbranchesexecutioncounts,orafulltraceoffunctionandbasicblockentries.
Furthermore,itispossibletospecify5Figure1:Totalelapsedtimeforcompressinga64MBfilewithgzip,unmodified,withsamplingevery0.
01s,withinstrumentationatfunctionentries,andwithinstrumentationatbasicblockentries.
NormalSamplingFunctionBasicblock05101520253035functionstoexclude/includefromthisprofiling/tracing.
Asanexample,theGNUcompressionprogramgzipwasrecompiledwiththe-ftest-coverageand-fprofile-arcsoptionsandrunona64MBlogfile.
Theinstrumentedgziprequires5.
4%moreelapsedtimethanthenormalgzip.
Attheendoftheexecution,afilewiththe.
daextensioniscreatedforeachsourcecodefile(e.
g.
,deflate.
dafordeflate.
c).
Thesefilesmaythenbeusedbygcovtoproduceannotatedsourcecodefiles(e.
g.
,deflate.
c.
gcovfordeflate.
c).
#Runtheprogramcompiledandlinkedwithoptions#-ftest-coverageand-fprofile-arcs[gzip-1.
2.
4a]$time.
/gzip.
gcov/dev/null27.
04user0.
63system0:31.
36elapsed88%CPU#Files.
bband.
bbgareproducedatcompilationtime,files.
da#attheprogramexit.
[gzip-1.
2.
4a]$ls-alsdeflate.
*4-rw-rw-r--1dagenaisdagenais1100avr915:34deflate.
bb4-rw-rw-r--1dagenaisdagenais2048avr915:34deflate.
bbg32-rw-r--r--1dagenaisdagenais29179ao131993deflate.
c4-rw-rw-r--1dagenaisdagenais648avr1014:10deflate.
da20-rw-rw-r--1dagenaisdagenais18856avr1012:01deflate.
o#Performtheanalysisononeofthesourcefiles.
#Gcovreadsthe.
bb.
bbg.
daand.
cfilesandproduces.
c.
gcov[gzip-1.
2.
4a]$gcov-b-fdeflate.
c80.
77%of26sourcelinesexecutedinfunctionlm_init60.
00%of15branchesexecutedinfunctionlm_init40.
00%of15branchestakenatleastonceinfunctionlm_initNocallsinfunctionlm_init.
.
.
Creatingdeflate.
c.
gcov.
[gzip-1.
2.
4a]$lessdeflate.
c.
gcov.
.
.
5while(lookahead!
=0){branch0taken=0%6933680INSERT_STRING(strstart,hash_head);6933680prev_length=match_length,prev_match=match_start;6933680match_length=MIN_MATCH-1;6933680if(hash_head!
=NIL&&prev_lengthlookahead)match_length=lookahead;branch0taken=0%branch1taken=100%.
.
.
4.
2.
TimeBasedSamplingwithGPROFGprofispartoftheGNUbinutilspackage[2].
Whenaprogramfoo.
ciscompiledwiththe-pgoption,acalltofunctionmcountisaddedtoeachfunctionentry.
Moreover,thelinkstepusesversionsofthestandardlibrariescompiledwith-pgwhenavailable,andaddsamoduletoinitializethesamplingatthebeginningandtowriteouttheresultsattheend.
Optionally,compilationflag-gmaybespecifiedtoobtainsourcecodelinesaddresses(usuallyusedbydebuggers),and-atosimultaneouslycomputebasicblockexecutionfrequencies.
6Whentheprogramexecutes,aninterrupthandleriscalledeveryclocktick(10miliseconds)whichtakestheaddressoftheinterruptedinstructionandadds1tothecorrespondingbucketinalargearray.
Typicallythisarrayisaslargeastheprocessinstructionsareaandusesa4bytesintegercounterforeach4bytessectionintheprogram.
Eachtimeafunctioniscalled,theinternalmcountfunctiongetscalled,extractstheaddressesofthecallerandthecallee,andsumsthenumberofcallsforthiscaller-calleepairinalargehashtable.
Whentheprogramexits,thesamplescountsinthelargearray,andthecaller-calleecountsfromthehashtable,areprintedinaresultsfilenamedgmon.
out.
Whenthe-acompilationflagisspecified,basicblocksexecutioncountsarealsocomputedandareappendedtotheresultsfile.
TheGprofprogrammaythenbeusedtoreadresultsfiles,andtheprogramexecutableandsourcecodefiles,toproducedifferentreports.
Theflatprofilereportprintsforeachfunction(andeveneachsourcecodeline)thenumberofassociatedsamplesandthecorrespondingexecutiontimeprobablyspentinthatlocation.
Thefunctionsmaybeprintedindecreasingorderofexecutiontime,thuslistingontopthemosttimeconsumingfunctions,andbestcandidatesforoptimisation.
Thetimeassociatedwithinternalfunctionmcountcanbeusedtosubstract,andthuscompensate,theoverheadassociatedwiththeprofiling.
TheannotatedsourcecodereportissimilartotheoutputproducedbyGcov,listingtheexecutionfrequenciesforeachbasicblock.
Finally,Thecallgraphreportisasupersetoftheflatprofileandaddsinformationaboutcallersandcalledfunctions.
Foreachfunction,itliststhetotaltimespentinitself(numberofsamplestimessamplingperiod),andthecallingfunctions.
Foreachcallingfunctionislistedthenumberofcallsandtheproportionofthetotal;thisinformationisobtainedpreciselythroughthecallstomcountwherethecallerisrecordedinthecallcounts.
Theproportionofcallsfromacallerisusedtoestimatethetimespentinthecurrentfunctionbycaller.
Theunderlyingassumptionisthatcallstoafunctiontakethesametimeonaverageirrespectivefromthecaller.
Thisisclearlynotthecaseas,forinstance,afunctiondraw_iconmaycallscale_bitmapsforverysmallimageswhilefunctionscale_backgroundmaycallscale_bitmapsforimages500timeslarger.
Theexecutiontimespentforeachcallerispropagatedtothecallersresursively,startingfromtheleaffunctions.
Thisaddsanotherlistforeachfunctioninthecallgraphreport,thetimespentineachfunction(anditschildren)called.
Thetotaltimespentinthefunctionitselfandallitscalledchildrenisalsopartofthecallgraphreport.
Forrecursivecalls,aseparateentryiscreatedforthecyclewhichlistscallersandcalleesfromousidethecycle;then,entriesformembersofthecyclelistcallersandcalleesfromwithinthecycle.
Gprofandsimilartoolsofferanexcellentcompromiseoflowoverheadanddetailedusuallyreliableinformation.
Itmay,however,beinadequateorevenmisleadinginmoreinvolvedcaseswhere:thetimespentinafunctionisverydifferentdependingonthecaller.
Measuringitadequatelywouldrequiresamplingnotonlythecurrentlyexecutinginstructionaddressbutalsothecallersaddressonthestack,possiblyforseverallevels.
Thecallgraphreportcouldthenshowthetimespentinafunctionbasedonchainsofcallers,butthiswouldbemoredifficulttopresentclearly,andthenumberofsamplesassociatedwitheachchainislikelytobetoosmalltobereliable;thesamplingintervaliscorrelatedwiththeprogramactivity.
Whenaprogramcontinuouslymakesblockingsystemcalls,thereisachancethatthescheduleroftengivescontroltotheprogramandactivatesthevirtualtimerusedforsamplingwhentheprogramisinthesamearea,whichgetsoverrepresentedinthesamples;thetimespentnotexecutingtheprogram,waitingforthediskorforotherprocesses,isofinterest.
Inthatcase,othersourcesofinformationshouldbeusedtoannotatethereportswithinformationabout7timespentwaiting,toobtainthetotalelapsedtime,insteadofjusttheexecutiontime,ineachfunction;moreinformationisrequiredaboutwhytimeisspentineachsection(cachefaults,pagefaults.
.
.
).
#Runtheprogramcompiledwithout-pgandlinkedwith-pg#tomeasurethelowoverheadofsamplingalone.
[gzip-1.
2.
4a]$time.
/gzip.
pg-nomcount/dev/null24.
30user0.
62system0:28.
30elapsed88%CPU#Runtheprogramcompiledandlinkedwithoption-pg#Overheadisslightlylargerbutitprovidescallgraphinfo[gzip-1.
2.
4a]$time.
/gzip.
pg-static/dev/null25.
79user0.
51system0:29.
88elapsed88%CPU#Filegmon.
outisproducedwhentheprogramexits[gzip-1.
2.
4a]$ls-alsgmon.
out188-rw-rw-r--1useruser186366avr1014:07gmon.
out#Usegproftointerpretgmon.
out,usingdebuggingsymbols#informationfromtheexecutablefile[gzip-1.
2.
4a]$gprofgzip.
pg-static>gprof.
out#Lookattheresults,firsttheflatprofilewiththepercentageof#executiontimeforeachfunctionindecreasingorder.
[gzip-1.
2.
4a]$lessgprof.
outEachsamplecountsas0.
01seconds.
%cumulativeselfselftotaltimesecondssecondscallsms/callms/callname27.
137.
537.
5319573.
855.
95fill_window23.
2013.
976.
4416440.
0019205.
66deflate14.
8418.
094.
1219582.
102.
10updcrc12.
1621.
463.
38do_scan8.
3823.
792.
33short_loop3.
7824.
841.
05__mcount_internal3.
1325.
710.
87read.
.
.
#Thenthecallgraph.
Functionupdcrcuses4.
12sandcallsnone.
#Itiscalledfromzipandfile_read.
Functionfile_readis#calledfromlm_initandfill_window.
Itusesaboutnotimebut#callsupdrcforabout4.
12s.
This4.
12sisalmostallattributed#tocallerfill_window(1957/1958calls).
.
.
.
0.
000.
001/1958zip[3]4.
120.
001957/1958file_read[8][7]15.
64.
120.
001958updcrc[7]0.
000.
001/1958lm_init[27]0.
004.
121957/1958fill_window[6][8]15.
60.
004.
121958file_read[8]4.
120.
001957/1958updcrc[7].
.
.
4.
3.
HardwareCountedEventsSamplingwithOprofileOprofile[3]isverysimilartoGprof;itevenproducescompatibleresultsfiles.
Thetwoimportantdifferencesarethatitcollectssamplesbasedonhardwareperformancecountersinterrupts,andthatitoperatesonthecompletesystem(gatheringsamplescontinuouslyforallprogramsexecutingonthe8system).
Hardwareperformancecountersarenowavailableinallpowerfulmodernmicroprocessorsandmaybeusedtocountvariouseventsofinterestincludingclockcycles,L1orL2cachemisses,branches,branchestaken,andpipelinestalls.
Theperiodicsamplingisobtainedbyrequestinganinterruptwhenthecounterreachesacertaincount.
Oprofiledoesnotuseprograminstrumentation(i.
e.
,placingcallstomcountineachfunction).
Thus,norecompilationisnecessaryandtheoverheadislower,butcallgraphreportscannotbegenerated.
Theonlyrequirementisthatdebugginginformationisrequiredtocorrelatetheaddressofsampleswithprogramfunctionsandlinenumbers.
Becauseofitslowoverhead,itisnotuncommontokeepOprofilerunningforextendedperiodsonrunningserversystems.
Thewholesystemprofilingfeatureisparticularlyusefultoseewhichproportionofthetimeisspentineachprogram,thusdiscoveringwhoisusingupalltheresources.
Itisalsousefulwhenasingletaskinvolvingseveralprogramsisexecutedonasystem.
Forexample,whenanalysingthestartuptimeofasophisticatedWebbrowser,itispossibletoseethetimespentinfunctionsinthekernel,thebrowser,thewindowmanager,theXserverandvariousdaemons.
Insomerarecases,theinformationproducedbyhardwareperformancecounterssamplingtoolsmaybebiased.
Indeed,areal-timeprocessusingthesameperiodasthesamplingtoolmayhaveitssamplesgrosslyunderoroverrepresented.
Asimplesolutionistorandomlyvarythesamplingperiodaroundatargetaverage.
#Runtheoprofilegraphicaluserinterface,clicktostartthedaemon[gzip-1.
2.
4a]$oprof_start#Runtheunmodifiedprogram(withdebuggingsymbolspresent)[gzip-1.
2.
4a]$time.
/gzip.
plain/dev/null24.
40user0.
87system0:27.
67elapsed91%CPU#Stoptheprofilingdaemonandextractstatistics[gzip-1.
2.
4a]$oprofpp-l-igzip.
plain>gzip.
oprof#Lookattheresults[gzip-1.
2.
4a]$lessgzip.
oprofCputype:PIICpuspeedwas(MHzestimation):233.
140000Counted"clocksprocessorisnothalted",samplinginterval200000vmasamples%-agesymbolname0804810000_start.
.
.
08050504500.
059034long_loop080504e1680.
0802862limit_ok080504c11110.
131055longest_match080505721600.
188909the_end080505471710.
201896mismatch0804c5509881.
16651ct_tally0804c93011001.
29875send_bits0804c6d012821.
51363compress_block0805aad014651.
7297memcpy0805050d69058.
15259short_loop080505221210114.
2874do_scan0804ea101382116.
3182updcrc0804b4c02140525.
2724deflate0804b1202502029.
5406fill_window94.
4.
SystemCallsTracingusingStraceLinuxandPOSIXsystemsoffertheptracesystemcalltoallowoneprocesstomonitoranotherprocess.
Itisusedbydebuggersandothertoolslikestrace[9].
Whenaprogramisrununderstrace,itisblockedeachtimeitissuesasystemcall.
Thecontrolpassestostracebeforethesystemcall,toextractthevalueofthearguments,thesystemcallisthenperformedasusual,andthecontrolpassestostraceafterthesystemcalltoextractthereturnvalues.
Theoverheadcostisseveralmicrosecondsforeachsystemcall,comparabletohittingabreakpointinadebugger.
However,thenumberofsystemcallsisrelativelylowforCPUboundprocesseslikegzip.
#Tracethesystemcallsissuedduringtheexecution[gzip-1.
2.
4a]$timestrace-s120-f-F-otrace.
/gzip.
plain\/dev/null25.
23user1.
57system0:30.
97elapsed86%CPU#Examinethesystemcalltracefile[gzip-1.
2.
4a]$lesstrace2488execve(".
/gzip.
plain",[".
/gzip.
plain"],[/*44vars*/])=02488fcntl64(0,0x1,0,0xbffff804)=02488fcntl64(0x1,0x1,0x1,0xbffff804)=02488fcntl64(0x2,0x1,0x1,0xbffff804)=02488uname({sys="Linux",node="rocamadour"02488geteuid32()=53632488getuid32()=53632488getegid32()=11052488getgid32()=11052488brk(0)=0x810b3742488brk(0x810b394)=0x810b3942488brk(0x810c000)=0x810c0002488rt_sigaction(SIGINT,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGINT,{0x804aaa0,[INT],SA_RESTART|0x4000000},{SIG_IGN},8)=02488rt_sigaction(SIGTERM,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGTERM,{0x804aaa0,[TERM],SA_RESTART|0x4000000},{SIG_IGN},8)=02488rt_sigaction(SIGHUP,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGHUP,{0x804aaa0,[HUP],SA_RESTART|0x4000000},{SIG_IGN},8)=02488ioctl(1,0x5401,0xbffff6c0)=-1ENOTTY(Inappropriateioctlfordevice)2488fstat64(0,{st_mode=S_IFREG|0644,st_size=64140109,02488read(0,"recid=1,size=8,format=BINARY,event_type=4,facility=LTT,severity=DEBUG,\nuid=dagenais,gid=dagenais,pid=1953,pgrp=".
.
.
,65536)=65536.
.
.
4.
5.
LowOverheadKernelTracingwiththeLinuxTraceToolkitThedataaggregationperformedbythetoolsdiscussedinprevioussectionsisimportanttoobtainalowoverhead.
Thedownsideisthatsomeinformationislostintheaggregates(numberofcallsforeachfunctioninsteadofwhichfunctionwascalledwhenandinwhatorder).
Afulltraceofcollecteddataitemsrapidlyfillsadiskbutdoesnotaprioridiscardinformationthatmaynotseemusefulinitially,butcouldbefoundsolater.
Tracingisusedpervasivelyinsoftwareprojects,fromprintstatementstologgingdaemonslikesyslogdandklogd.
TheLinuxTraceToolkit[4]isextensible.
Itcomeswith60predefinedtracepointsinsertedinthekernel,dividedinto8categorieswhichmaybeenabled/disabledeitheratcompileoratruntime.
MoretracepointsmayeasilybeaddedbyrecompilingthekernelorusingtheDynamicProbescodepatchingsystem[10].
Itskeyfeaturesarelowoverheadandpost-processingtools.
10Lowoverheadisachievedbycarefullyselectingtheinformationtolog,andminimizingcontextswitching.
Sinceallschedulingchangesandsetuidcallsmaybetraced,thereisnoneedtorecordtheprocessoruseridwitheachevent.
Furthermore,mosteventsbeingcloselyspacedintime,32bitstimedifferencesarestoredinsteadoffull64bitsseconds/microsecondsvalues.
Alltheeventsareloggedbyappendingtheirdatatoalargedoublebuffer.
Whenabufferisfull,adaemonaccessesthebufferthroughsharedmemoryandcopiesitscontenttodisk.
Theoverheadoftracingabusysystemamountstolessthan5%fortheelapsedtimeanduptoseveralmegabytesperminuteoftracingdatatostoreondisk[11].
Thetraceanalysisisperformedoffline,tominimizetheoverheadduringdatacollection,andpossiblyonaremotehostwhenstudyingembeddedsystems.
Theinformationcontainedinthetrace(systemcalls,devicesoperations,schedulingchanges)issufficienttoreconstructthesystembehavior(includingwhichfiledescriptorisassociatedtowhichfileormemoryregion)andmaythereforebeusedforseveraldifferentanalyses.
Thetoolsincludedshowtheeventsgraphically,perprocessinatimeline,andcontainperprocesssummaries(executiontime,numberofcallspersystemcall,waittime,bytesread/written).
Newanalysistoolsmaybewrittenusingtheprovidedtracereadinglibrary,orbymodifyingtheprovidedtools.
Indeed,whileitiseasywithLTTtocapturealltheneededrawdata,synthesizingthedesiredhighlevelinformationismorechallenging.
Asatest,amodifiedtraceanalysistoolwasdevelopedtoanswerthefollowingnonobviousquestion:whereisthelargestartingtimeofapplicationssuchasGaleon,MozillaandOpenOfficespent4.
5.
1.
DecomposingtheWaitingTimeTraditionally,systemstatisticsdecomposetheelapsedtimeintousermodeexecutiontime,systemmodeexecutiontime,andwaitingtime.
Profilingtoolsaregoodatdecomposingtheexecutiontimebyfunctionorevensourcecodeline.
Theobjectiveherewasdecomposingthewaitingtimeintowaitingrunnable(waitingfortheCPU)andwaitingnonrunnable(blockedonI/O,orwaitingforaprocessoratimer).
Thisinformationmaybeextractedfromadetailedtrace:SchedulingChangeeventstellthestatusoftheoutgoingprocess(runnableornot).
WakeUpeventschangethestatusofaprocessfromnonrunnabletorunnable.
Iftheeventpriortobeingscheduledoutnonrunnableisareadorapagefaulttrap,theprocessiswaitingfortheassociatedfile.
Atablemustkeeptrackforeachprocessofalltheopenandmmapoperationsassociatingfileswithfiledescriptorsormemoryaddressranges.
Iftheeventpriortobeingscheduledoutnonrunnableisawaitpidcall,theprocessiswaitingfortheassociatedprocess.
Therearecaseswhereaprocessiswaitingforoneofseveralpossiblefilesorprocesses,withcallstowait,select,orpoll.
Inthatcase,thewaitingtimeisattributedtotheprocessorfilewhichcausesthecalltoterminate.
Pollandwaitmayalsoendbecauseofatimeout.
Thewaittimeisthenattributedtothemoregeneralwaitfortimeoutcategory.
Atracewasgeneratedonafreshlyrebootedsystem,withnootheractivity,whilestartinguplargeprogramsoneafteranotherbutwithoutanyoverlap.
ThewaitingtimeforGaleon[12],forinstance,wasdecomposedintoseveralwait-for-fileandwait-for-processcomponents,whichaccountedfor7.
96softhetotal9.
27s.
Theremaining1.
31sismostlywait-for-timeout(0.
89sawaitingamouseclick)and11only0.
42swait-generic.
Thewait-generictimecouldbefurtherreducedbyidentifyingintheirowncategoryothersystemcalleventsinthetrace,inadditiontoread,select,poll,waitpidandwait.
Thewait-generictimeisnonethelessavailablebysystemcallcategory;.
31soutofthe.
42stotalbelongstostatsystemcalls.
#BootwiththeLTTenabledkernelandstarttheLTTdaemon[gzip-1.
2.
4a]$tracedaemon-ts60/dev/tracertrace1.
tracetrace1.
procTraceDaemon:OutputfilereadyTraceDaemon:TracedriveropenTraceDaemon:Tracebuffersare1000000bytesTraceDaemon:Fetchingeipforsyscallondepth:0TraceDaemon:Daemonwilrunfor:(60,0)TraceDaemon:Donemapping/procTraceDaemon:Endoftracing#Analysethetraceoutput[gzip-1.
2.
4a]$tracevisualizer-atrace1.
tracetrace1.
proctrace1.
out#Lookatthetraceanalysis[gzip-1.
2.
4a]$lesstrace1.
out.
.
.
Process(1684,1658):galeon:galeon-bin.
.
.
UserMode:CPU5.
894726Elapsed15.
677299WaitCPU2.
358053WaitFork0.
000002BH811KernelTimer610PacketIn80PacketOut41SchedIn1330TimerExpire136SyscallMode:Elapsed/Calls0.
0003861CPU0.
308954Elapsed5.
119347WaitCPU0.
599823WaitFile-001.
png0.
000453WaitFile-002.
png0.
000346WaitFile-003.
png0.
000213WaitFile-004.
png0.
000216WaitFile-005.
png0.
000430.
.
.
sys_fcntl64Elapsed/Calls0.
0000054CPU0.
000286Elapsed0.
000286Calls53TrapspagefaultElapsed/Calls0.
0003101CPU0.
179146Elapsed2.
444663WaitCPU0.
448110WaitFile-galeon-bin1.
816968BH38Calls7883KernelTimer23PacketIn212PacketOut1PageAlloc78SchedIn459TimerExpire2IRQstimerElapsed/Calls0.
0000118CPU0.
007725Elapsed0.
007725Calls657usb-uhci,eth0Elapsed/Calls0.
0000263CPU0.
003610Elapsed0.
003625Calls138.
.
.
Trapentry1018876582.
807382168413TRAP:pagefault;EIP:0x08068FC7Trapexit1018876582.
80738716847Syscallentry1018876582.
807421168412SYSCALL:execve;EIP:0x080690D7Filesystem1018876582.
807452168427EXEC:galeonMemory1018876582.
807494168412PAGEALLOCORDER:0Memory1018876582.
807569168412STARTPAGEWAITSchedchange1018876582.
807645157919IN:1579;OUT:1684;STATE:2Filesystem1018876582.
807656157920POLL:3;MASK:0Filesystem1018876582.
807662157920POLL:10;MASK:0Filesystem1018876582.
807666157920POLL:4;MASK:0Filesystem1018876582.
807670157920POLL:5;MASK:0Filesystem1018876582.
807675157920POLL:6;MASK:0Filesystem1018876582.
807680157920POLL:13;MASK:0Memory1018876582.
807688157912PAGEFREEORDER:0Memory1018876582.
807699157912PAGEFREEORDER:0Syscallexit1018876582.
80770215797Syscallentry1018876582.
807731157912SYSCALL:gettimeofday;EIP:0x0805816DSyscallexit1018876582.
80773315797Syscallentry1018876582.
807758157912SYSCALL:ioctl;EIP:0x08056F30Thebiggersurprisewasthe3.
5swait-for-CPUsincenothingelsewashappeningonthesystem.
Theanswerliesinasynchronouslycooperatingprocesses.
Indeed,Galeonstartsanumberofhelpersub-processesandusuallyblocksawaitingananswer.
Thisshowsupnicelyinthewait-for-processcategoryalongwiththesub-processid.
However,GaleonisacomplexgraphicalapplicationwhichusestheXwindowserver[5]forgraphicsrendering.
TheXserverisoptimisedtoavoidblockingusingasynchronousinteractionswheneverpossible.
Thisallowsbatchingrenderingoperations,butalsoparallelexecutionwhentheapplicationandtheXserverarerunningonseparateprocessors.
4.
5.
2.
IdentifyingServersWorkingonbehalfofClientsThelastanalysisrefinementwasdecomposingtheexecutiontimeofaserverprocess,labeledCPU,intoCPU-for-Clientcomponents.
Atypicalclientserverrelationshipmaybeidentifiedbythepatternofaserverlisteningonasocket,aclientconnectingtothesocket,andtheserveracceptingtheconnection.
13Theserverandclientthencommunicatebyreadingandwritingtothesocketuntiltherequestisservedandtheservergoesidle.
Theserverthenawaitsnewconnectionsorrequestsfromclientsusingaselectorpollsystemcall.
Assumingthataserver,afteritsinitializationphase,isalwaysworkingonbehalfofaclient,theservergetsintoCPU-for-Clientxwheneveritacceptsaconnectionorreadsfromaconnectiontoclientx.
Thissimpleheuristicproducesinterestingresultsinmanycases.
However,itdoesnotmodelcaseswhererequestsaregrouped.
Forexample,ifnineclientsmakeasynchronousrequestswhicharesimplybufferedbytheserver,andupontherequestfromthetenthclient,allbufferedrequestsareexecuted,thetimespentservingthelastisoverestimated.
Complexinteractionsbetweenclientsandserversareevenmoredifficulttoautomaticallyrecognizeinordertoattributeservingtimeappropriatelytoclients.
Forinstance,GaleonconnectstotheGNOMEconfigurationdaemon,Gconfd,toexpressinterestinanumberofconfigurationparameters,leavingacallbacksocketaddress.
Gconfdthenconnectstothiscallbackaddressandprovidesthecurrentvaluesfortheparameters,andeventuallyupdatestotheseparameters.
GaleonthenappearstobeservingparameterchangerequestsfromclientGconfd.
4.
6.
DiscussionTheLinuxTraceToolkitisoneofthebestsolutionsforindepthlowlevelsystemstudies.
Itiswidelyusedinembeddedandreal-timesystems,andcouldeasilybeusedinoperatingsystemsandreal-timeprogrammingcourses.
Itcouldproveveryusefultodevelopersandsystemadministratorswhenstronginteractionsbetweenapplicationsandtheoperatingsystemareencountered.
ItisstillpendingintegrationinthemainLinuxkernel.
Thus,akernelpatchandrecompilation,oraspecialkernel,arerequiredontopoftheLTTtracingdaemonandtracingtoolspackage.
5.
ConclusionTraditionaltoolslikeGcov,Gprof,andStracehaveexistedforyears,withmanyvariantsintermsofpresentation(graphs,colors,evenanimations)ondifferentplatforms.
Theserepresentmature,simpleandefficientwaystoobtainusefulinformation.
Theappearanceofhardwareperformancecountersinmicroprocessorsinthe1990swastheenablingtechnologyforsamplingtoolslikeOprofileprovidingmoreprecisesamplingandseveraltypesofsamples(cyclecount,cachemisses.
.
.
).
Anotherimportantenablingtechnologyinthefieldofsystemperformanceanalysisisthesourcecodeavailabilityoffreesoftware.
Theprintstatementspreviouslyinsertedbyprogrammersonlyintheirapplicationscannowbeplacedeverywhereinthesystem.
TheLinuxTraceToolkitprovidesthefirst60interestingtracingpoints.
Moreimportantly,itsimplifiestheinsertionofmoretracepointswhichwillenableanumberofsophisticatedanalysistoolsneededtobetterunderstand,studyandimprovecomplexdistributedsystems.
Bibliography[1]FreeSoftwareFoundation,"GNUCompilerCollection",http://gcc.
gnu.
org/.
[2]FreeSoftwareFoundation,"BinaryTools",http://sources.
redhat.
com/binutils/.
[3]JohnLevonetal.
,"Oprofile",http://oprofile.
sourceforge.
net.
[4]KarimYaghmour,"LinuxTraceToolkit",http://www.
opersys.
com/LTT.
14[5]XFree86Project,"TheXFree86Project",http://www.
xfree86.
org.
[6]JamesLarusetal.
,"EEL:AnExecutableEditingLibrary",http://www.
cs.
wisc.
edu/~larus/eel.
html.
[7]FreeSoftwareFoundation,"TheGNUProjectDebugger",http://sources.
redhat.
com/gdb/.
[8]Jean-LoupGailly,"GNUCompressionUtilityGzip",http://www.
gnu.
org/software/gzip/gzip.
html.
[9]WichertAkkermanetal.
,"Strace",http://www.
liacs.
nl/~wichert/strace/.
[10]RichardJ.
Mooreetal,"DynamicProbes",http://www.
ibm.
com/developerworks/oss/linux/projects/dprobes/.
[11]K.
YaghmourandM.
R.
Dagenais,"Measuringandcharacterizingsystembehaviorusingkernel-leveleventlogging",ProceedingsoftheUSENIXAnnual2000TechnicalConference,SanDiego,California,USA,June2000[12]MarcoPesentiGrittietal,"Galeon:theWeb,onlytheWeb",http://galeon.
sourceforge.
net/.
15

wordpress简洁英文主题 wordpress简洁通用型高级外贸主题

wordpress简洁英文主题,wordpress简洁通用大气的网站风格设计 + 更适于欧美国外用户操作体验,完善的外贸企业建站功能模块 + 更好的移动设备特色模块支持,更高效实用的后台自定义设置 + 标准高效的代码程序功能结构,更利于Goolge等国际搜索引擎的SEO搜索优化和站点收录排名。点击进入:wordpress简洁通用型高级外贸主题主题价格:¥3980 特 惠 价:¥1280安装环境:运...

随风云25元/月 ,德阳高防云服务器 2核2G 10M 75元/月 内蒙古三线BGP服务器 2核2G 5M

公司介绍成都随风云科技有限公司成立于2021年,是国内领先的互联网业务平台服务提供商。公司专注为用户提供低价高性能云计算产品,致力于云计算应用的易用性开发,并引导云计算在国内普及。目前公司研发以及运营云服务基础设施服务平台(IaaS),面向全球客户提供基于云计算的IT解决方案与客户服务,拥有丰富的国内BGP、双线高防、香港等优质的IDC资源。公司一直秉承”以人为本、客户为尊、永续创新&...

Kinponet是谁?Kinponet前身公司叫金宝idc 成立于2013年 开始代理销售美国vps。

在2014年发现原来使用VPS的客户需求慢慢的在改版,VPS已经不能满足客户的需求。我们开始代理机房的独立服务器,主推和HS机房的独立服务器。经过一年多的发展,我们发现代理的服务器配置参差不齐,机房的售后服务也无法完全跟上,导致了很多问题发生,对使用体验带来了很多的不便,很多客户离开了我们。经过我们慎重的考虑和客户的建议。我们在2015开始了重大的改变, 2015年,我们开始计划托管自己...

ts60.com为你推荐
地图应用手机地图软件那么多,都不知道用哪个好了?bbs.99nets.com做一款即时通讯软件难吗 像hi qq这类的xyq.163.cbg.com梦幻CBG的网站是什么。罗伦佐娜手上鸡皮肤怎么办,维洛娜毛周角化修复液haole16.com国色天香16 17全集高清在线观看 国色天香qvod快播迅雷下载地址4400av.com在www.dadady.com 达达电影看片子很快的啊www.03ggg.comwww.tvb33.com这里好像有中国性戏观看吧??555sss.comms真的是500万像素?175qq.comhttp://www.qq10008.com/这个网页是真的吗?朴容熙这个女的叫什么?
域名网站 长沙域名注册公司 万网域名代理 域名服务dns的主要功能为 新加坡主机 linode日本 韩国空间 java主机 美国仿牌空间 12306抢票攻略 seovip 网通服务器ip seednet 200g硬盘 百度云1t 国外ip加速器 空间登入 网站加速软件 德讯 海外加速 更多