SoftwarePerformanceAnalysis

ts60.com  时间:2021-04-10  阅读:()
MichelR.
Dagenais(michel.
dagenais@polymtl.
ca)Dept.
ofComputerEngineering,EcolePolytechnique,Montreal,CanadaKarimYaghmour(karim@opersys.
com)Opersys,Montreal,CanadaCharlesLevert(Charles.
Levert@ericsson.
ca)MakanPourzandi(Makan.
Pourzandi@ericsson.
ca)EricssonResearch,Montreal,Canada1.
AbstractThekeytospeedingupapplicationsisoftenunderstandingwheretheelapsedtimeisspent,andwhy.
ThisdocumentreviewsindepththefullarrayofperformanceanalysistoolsandtechniquesavailableonLinuxforthistask,fromthetraditionaltoolslikegcovandgprof,tothemoreadvancedtoolsstillunderdevelopmentlikeoprofileandtheLinuxTraceToolkit.
Thefocusismoreontheunderlyingdatacollectionandprocessingalgorithms,andtheiroverheadandprecision,thanonthecosmeticdetailsofthegraphicaluserinterfacefrontends.
2.
IntroductionCodecoveragetools(e.
g.
,Gcov[1])computetheexactnumberoftimeseachstatementisexecuted,andthuscaneasilypinpointfrequentlyexecutedsections.
However,codecoveragetoolscarryanonnegligibleexecutionoverhead,andmoreimportantlydonotmeasuretherelativeexecutioncostofdifferentstatements.
Timedomainsamplingtools(e.
g.
,Gprof[2])maythereforemoreaccuratelyreportlargecontributorstotheexecutiontimebysamplingatregulartimeintervalstheexecutionaddress.
Moreover,theoverheadcostmaybetradedoffforaccuracybyvaryingthetimeinterval.
Modernapplicationsandsystemspresentnewchallengesastheirperformancemaybestronglyaffectedbycachehitrates,diskaccesses,inter-processcommunications,synchronizationwaits,andoperatingsystemoverhead.
Thisiswherenewtoolsbasedonhardwareperformancecounterssampling(e.
g.
,Oprofile[3]),ordetailedanalysisoflowoverheadoperatingsystemtraces(e.
g.
,LinuxTraceToolkit[4])canbringmoreinformation.
Theymayforinstanceindicatecodesectionswithexceedinglyhighlevelofcachemissesorbranchesmispredictions,orestimatetheexecutiontimeoftheXwindowserver[5]onbehalfofeachofitsclients.
Section3discussesthealgorithmsandtechniquesusedtocollect,processandstoreperformancedata.
Thefollowingsectiondescribesanumberofspecificperformanceanalysistools,fromcommonlyusedtoolstomorerecentones,withalongersectiononpossibleextensionsandcustomisationstotheLinuxTraceToolkit.
Section5istheconclusion.
3.
PerformanceDataCollection,ProcessingandStorageInasimplemodel,apieceofperformancedataisobtained,andqueuedorstoredforlaterdisplay.
Inpractice,thegeneratedvolumeofdatamaybesuchthatsomeprocessingmayberequiredatdata1collectiontimetosynthesizemorecompactinformation.
Forexample,inalargescientificcomputerclusterof1024computers,eachtimeaprogramfunctioniscalled,theentryandexittimemaybecollected.
Sendingthesevaluesdirectlytoacentralserverwouldrapidlysaturatethenetworkandcarryanunbearableoverhead.
Instead,theentryandexittimemaybesubtractedandaccumulatedlocallyineachfunction,inordertoproduceasummaryoftheprocessingtimespentineachprogramfunction.
Individualprograms,uponexit,wouldsendtheexecutiontimehistogramtoacentralserverforofflinepreparationofperformancereports.
Thissectionexaminesthekeyaspectsofdatacollectionandprocessing:obtainingthedatapoints,accumulatingvaluesintemporarystorage,andsendingthevaluestostorageforofflineprocessinganddisplay.
3.
1.
DataCollectionSeveralactivitiesincludingdebuggingandprogramunderstandingrequiredatacollection,justlikeperformanceanalysis,andusethesametechniques.
Thefamiliardebuggingprocessisusedtoillustratecommondatacollectionmethods.
Thereisacompromisebetweentheexecutionoverheadandtheamountandprecisionofthedatacollected.
Sometimestheoverheadisunacceptablebecausetheresultingprogramexecutiontimebecomestoolarge,ortheprogramperformanceortherealtimeprogrambehaviorisskewedbythedatacollection.
Storingorsummingavalueinavariableisminimaloverhead.
Callinganotherfunctionforeachfunctionentryisalargeoverheadforsmallfunctions.
Writingperformancedatatoafileuponenteringabasicblock(i.
e.
,linearcodesection,withoutanyjump,andthusalwaysexecutedtogetherandthesamenumberoftimes)wouldsignificantlychangetheexecutionperformanceofaprogram.
3.
1.
1.
ManualProgramInstrumentationAddingaprintstatementisoftentheeasiestandmostflexiblewaytoobtainthedesiredinformation.
Allthedatastructuresandfunctionsaccessibletotheprogrammaybeusedtogeneratethedataprinted.
TheLinuxkernelcode,wherenoviceprogrammersareuncommon,containsnolessthan80000printkstatements.
Severalcommontools(netstat,vmstat,mtrace)usedatagatheringstatementsinsertedinthekernel,ortheClibrabry.
3.
1.
2.
AutomaticProgramInstrumentationCompilersandpreprocessingtoolsmaybeusedtosystematicallyinstrumentaprogrambyaddingdatagatheringstatementstothecodetheywouldnormallygenerate.
Typicallocationsforinsertingstatementsincludethefunctionsentryandexit,andtheentryoflinearcodesections(basicblocks).
Forexample,theGNUCompilerCollection,GCC[1],offersseveralcodeinstrumentationoptionstoinsertacalltofunctionmcountuponenteringeachfunction(option-pgusedforgprof),andtoinsertcallstohooksuponenteringlinearcodesectionsandjumpingtoothercodesections(options-ftest-coverageand-fprofile-arcswhichmaybeusedforgcov).
3.
1.
3.
BinaryInstrumentationInsomesituations,thesourcecodeisnotavailableforsourcelevelinstrumentationandrecompilation,oritissimplyinconvenient.
Therearetoolstomodifybinaryexecutablefilesinordertoinsertdatagatheringstatements(e.
g.
,EEL[6]).
2Ideally,theexistingbinaryinstructionsshouldbeshiftedafewbyteshigherintheaddressspaceeachtimeinstrumentationinstructionsneedtobeinserted.
Doingso,however,changesalltheaddressesintheprogramandalljumpsandcallsmustbefoundandadjustedaccordingly.
Unlesstheprogramsectioncontaininginstructionsdoesnotcontainanydata(e.
g.
,jumptables)andallinstructionshavethesamesize,thisrapidlygetsverytricky.
Anothereasiertechniqueiscodepatching.
Theinstruction(s)locatedwhereinstrumentationcodeshouldbeinsertedisreplacedbyajumptoanewcodesectionappendedtotheprogram,andcontainingtheinstrumentationcodeandthereplacedinstruction.
Onlytheoneortwoinstructionsoverwrittenbythejumpneedtoberelocated(addressadjusted)perinstrumentationpoint.
Asimplerformofcodepatchingisusedinmostdebuggerstoinsertbreakpoints.
Theinstructionwhereinstrumentationcodeshouldbeinsertedissavedawayandreplacedbyasoftinterruptinstruction.
Unlikejumpinstructionswhichareamongthelonguest,andmayoverwriteafewshorterinstructions,interruptinstructionsusuallyarethesmallestandoccupyasinglebyteoni386processors.
Oncetheinterruptinstructionisreached,theinterrupthandlerexecutestheinstrumentationcode,putsbackthesavedawayinstructionandexecutesitinsinglesteppingmode.
Thisway,anotherinterruptionisgeneratedafteritsexecution,atwhichpointtheinterruptinstructionisplacedbackforthenexttimetheprogramreachesthispoint.
Withthistechnique,onesoftinterruptandonesinglesteppinginterruptisrequiredforeachinstrumentationpoint.
Theaddressoftheinstructioncausingtheinterruptisusedtodeterminetheinstrumentationpointreachedandtheassociatedcodetoexecute.
Thisleastinvasivetechnique(noinstructionstorelocate)alsocarriesthelargestoverhead.
Interruptsareprocessedinmicrosecondsatbestwhilejumpsareexecutedinnanoseconds,asshownbelowwithGDB[7].
#Runaprogramundergdb[gzip-1.
2.
4a]$catgdb.
runfilegzip.
plainrun/dev/nullquit[gzip-1.
2.
4a]$timegdb-command=gdb.
run>gdb.
out23.
30user1.
53system0:30.
75elapsed80%CPU(0avgtext+0avgdata0maxresident)k0inputs+0outputs(1213major+875minor)pagefaults0swaps#RunaprogramunderGDBwithabreakpoint.
Thebreakpointisexecuted#6933680times,eachtimeprintinga".
"beforecontinuing.
[gzip-1.
2.
4a]$catgdb.
commandsfilegzip.
plainbreakdeflate.
c:679commands1silentprint".
"continueendrun/dev/nullquit#RunGDBwiththecommandfile[gzip-1.
2.
4a]$timegdb-command=gdb.
commands>gdb.
out3905.
78user1535.
75system2:02:48elapsed73%CPU3.
1.
4.
SamplingSamplingisoftenusedtopredictatlowcosttheoutcomeofanelectionbyaskingseveralhundred3peopletheirvotingpreferencesinsteadofcallingeachofthemillionseligiblevoters.
Similarly,insteadofrecordingtheexecutiontimeforeachfunction,theprogrammaybeinterruptedatregularintervals(e.
g.
,each10miliseconds)tosampletheaddressofthecurrentlyexecutedinstruction.
Thechancesoftheaddressbeingwithinaspecificfunctionisproportionaltotheproportionoftimespentinthatfunction(underrandomnessassumptions).
Simplesamplingbasedprogramslikegprofuseimprecisevirtualtimerswhicharecheckedeachtimetheoperatingsystemschedulerisrun.
Inprogramswithstronginteractionswiththescheduler(e.
g.
,frequentlypollingforI/Oorcallingyield),theresultsmaybeseverelyskewed.
Newertoolslikeoprofiletakeadvantageofhardwareperformancecounterswhichmaycountdifferentevents(cycles,cachemisses,branchestaken/nottaken.
.
.
),andgenerateaninterruptafteracertaincount.
Thisallowsvaryingeasilythesamplinginterval,tradingoffoverheadforaccuracy,andmeasuringseveraldifferentparametersapartfromexecutiontime.
Thelowachievableoverheadenablesmeasuringthecachemisseshistogramforaprogram,whichwouldbeimpracticalusingcodeinstrumentation.
3.
2.
PerformanceDataProcessingOnceadataelementisobtained,itmustfinditswaytotheresultsfile.
Tracingsystemsusuallyaddafewfields(recordnumber,time,address.
.
.
)tothedataelementandappendittoabufferwhichiswrittentodiskoncefull.
Thiswaythecompleteinformationisretainedbutthediskmayfillatanalarmingrate.
TracingthesystemcallsforaWebbrowsermaygenerateseveralmegabytesperminute.
Thealternativetosimplebufferingistoprocessandaggregatethedataelementsastheyaregenerated.
Typicallyacountervariablewillbeassociatedwitheachfunctionorbasicblock,andtheeventsoccuringtherewillbecountedastheyhappen.
Attheprogramend,oratregularintevals,thecontentofthesecountersiswrittentodisk.
Insomecases,thenumberofpossiblelocationsismuchlargerthanthenumberactuallyusedandhashtablesareused.
Thisisthecasewhencountingthecallstoafunctionbycaller.
Veryfewfunctionsamongallthefunctionsinaprogramactuallycallaspecificfunction;usinghashtablesisthusmoreefficientthanallocatingacounterforeachpossiblecaller-calleecombination.
3.
3.
PerformanceDataTransferandStorageInlowvolumeapplications,eventsaregeneratedeveryfewsecondsormore,andmaybewrittenastheyhappentoafiledescriptor.
Thisfiledescriptormaybeconnectedtoafile,asocketorevenagraphicalviewerwhichupdatesitsdisplayinrealtime.
Forlargervolumesofevents,eventsshouldbeappendedinmemorytoalargebuffer,writtentodiskinverylargechunks.
Whentheinstrumentedprogramisdifferentfromtheloggingprocesswritingtodisk,maximumperformancemaybeobtainedthroughtheuseofsharedmemorybetweentheinstrumentedprogramandtheloggingdaemon.
TheLinuxTraceToolkitusessharedmemorybetweentheinstrumentedkernelandthetracingdaemontoachievemaximumperformance.
Theperformanceparametersbeingmeasured(diskthroughput,executiontime,cachehits)willstronglyinfluencewhichoverheadshouldbeminimized.
Ifdiskaccesstimesaretobemeasured,theimpactoftheperformancedatagatheringondisksshouldbereduced,forinstancebywritingtoaseparatedisk,onaseparatediskcontroller,andevenacrossthenetworkonaseparatecomputer.
Ifonlyonediskisavailable,itisalsopossibletoreducethediskrequirementsattheexpenseofexecutiontimeusingdatacompression.
Mostsystemsperformdatacompressionofflineforarchivalpurposesbutnotduringthelivetracing.
44.
ReviewofPerformanceAnalysisToolsThetoolsreviewedinthissectionareallfreelyredistributableandcoverthefullspectrumofwidelyusedtechniquesforsoftwareperformanceanalysis.
Thesimplerandmorewidelydeployedtoolsarepresentedfirst.
Throughoutthissection,theGNUcompressionutilitygzip[8]willbeusedasasampletestcase,compressinga64MBlogfileto1.
6MB.
Itisusedtocomparetheoverheadofdifferentperformanceanalysistools.
#Runthenormalprogram[gzip-1.
2.
4a]$time.
/gzip.
plain/dev/null24.
09user0.
72system0:28.
16elapsed88%CPU4.
1.
CodeinstrumentationandresultsprocessingwithGCC/GCOVGcovispartofthestandardGNUCCompiler(GCC)distribution[1].
Whencompilingaprogramfoo.
cwiththe-ftest-coverageand-fprofile-arcsoptions,alistofthebasicblocksandbranchesaregeneratedinfilesfoo.
c.
bbandfoo.
c.
bbg,andthecodeisinstrumentedtocounttheexecutionfrequenciesofeachbasicblockandbranchandtowriteouttheresultsuponexit.
Theprogrammaythenbeexecutedandsumsitsresultsinfilefoo.
c.
da,whichcanlaterbeanalyzedwithgcovtoprintthesourcecodealongwithexecutionfrequencies.
Thistoolhasaverylowoverhead,asingleaddinstructionforeachbasicblockandeachbranchexecution.
Thebasicblockexecutionfrequenciesareusedtodetectfrequentlyusedcodesections,whereoptimisationcodingeffortsmaybeworthwhile.
Executionfrequenciesforbasicblocksandbranchesmaybeusedfortestcoverageanalysis.
Anysectionwithnullexecutioncounthasnotbeencoveredbythetestsusedwhenexecutingtheprogram.
Finally,branchescountsmaybeusedtocomputebranchprobabilities,whichmaybeusedinsubsequentcompilationstogeneratebettercodeforthemorefrequentcase.
AnotherGCCoption,-axproducessimilarinformationwithmorecontroloverthegenerateddatabutwithoutapost-processingtoollikeGcov.
Itcanproducefunctions,basicblocksandbranchesexecutioncounts,orafulltraceoffunctionandbasicblockentries.
Furthermore,itispossibletospecify5Figure1:Totalelapsedtimeforcompressinga64MBfilewithgzip,unmodified,withsamplingevery0.
01s,withinstrumentationatfunctionentries,andwithinstrumentationatbasicblockentries.
NormalSamplingFunctionBasicblock05101520253035functionstoexclude/includefromthisprofiling/tracing.
Asanexample,theGNUcompressionprogramgzipwasrecompiledwiththe-ftest-coverageand-fprofile-arcsoptionsandrunona64MBlogfile.
Theinstrumentedgziprequires5.
4%moreelapsedtimethanthenormalgzip.
Attheendoftheexecution,afilewiththe.
daextensioniscreatedforeachsourcecodefile(e.
g.
,deflate.
dafordeflate.
c).
Thesefilesmaythenbeusedbygcovtoproduceannotatedsourcecodefiles(e.
g.
,deflate.
c.
gcovfordeflate.
c).
#Runtheprogramcompiledandlinkedwithoptions#-ftest-coverageand-fprofile-arcs[gzip-1.
2.
4a]$time.
/gzip.
gcov/dev/null27.
04user0.
63system0:31.
36elapsed88%CPU#Files.
bband.
bbgareproducedatcompilationtime,files.
da#attheprogramexit.
[gzip-1.
2.
4a]$ls-alsdeflate.
*4-rw-rw-r--1dagenaisdagenais1100avr915:34deflate.
bb4-rw-rw-r--1dagenaisdagenais2048avr915:34deflate.
bbg32-rw-r--r--1dagenaisdagenais29179ao131993deflate.
c4-rw-rw-r--1dagenaisdagenais648avr1014:10deflate.
da20-rw-rw-r--1dagenaisdagenais18856avr1012:01deflate.
o#Performtheanalysisononeofthesourcefiles.
#Gcovreadsthe.
bb.
bbg.
daand.
cfilesandproduces.
c.
gcov[gzip-1.
2.
4a]$gcov-b-fdeflate.
c80.
77%of26sourcelinesexecutedinfunctionlm_init60.
00%of15branchesexecutedinfunctionlm_init40.
00%of15branchestakenatleastonceinfunctionlm_initNocallsinfunctionlm_init.
.
.
Creatingdeflate.
c.
gcov.
[gzip-1.
2.
4a]$lessdeflate.
c.
gcov.
.
.
5while(lookahead!
=0){branch0taken=0%6933680INSERT_STRING(strstart,hash_head);6933680prev_length=match_length,prev_match=match_start;6933680match_length=MIN_MATCH-1;6933680if(hash_head!
=NIL&&prev_lengthlookahead)match_length=lookahead;branch0taken=0%branch1taken=100%.
.
.
4.
2.
TimeBasedSamplingwithGPROFGprofispartoftheGNUbinutilspackage[2].
Whenaprogramfoo.
ciscompiledwiththe-pgoption,acalltofunctionmcountisaddedtoeachfunctionentry.
Moreover,thelinkstepusesversionsofthestandardlibrariescompiledwith-pgwhenavailable,andaddsamoduletoinitializethesamplingatthebeginningandtowriteouttheresultsattheend.
Optionally,compilationflag-gmaybespecifiedtoobtainsourcecodelinesaddresses(usuallyusedbydebuggers),and-atosimultaneouslycomputebasicblockexecutionfrequencies.
6Whentheprogramexecutes,aninterrupthandleriscalledeveryclocktick(10miliseconds)whichtakestheaddressoftheinterruptedinstructionandadds1tothecorrespondingbucketinalargearray.
Typicallythisarrayisaslargeastheprocessinstructionsareaandusesa4bytesintegercounterforeach4bytessectionintheprogram.
Eachtimeafunctioniscalled,theinternalmcountfunctiongetscalled,extractstheaddressesofthecallerandthecallee,andsumsthenumberofcallsforthiscaller-calleepairinalargehashtable.
Whentheprogramexits,thesamplescountsinthelargearray,andthecaller-calleecountsfromthehashtable,areprintedinaresultsfilenamedgmon.
out.
Whenthe-acompilationflagisspecified,basicblocksexecutioncountsarealsocomputedandareappendedtotheresultsfile.
TheGprofprogrammaythenbeusedtoreadresultsfiles,andtheprogramexecutableandsourcecodefiles,toproducedifferentreports.
Theflatprofilereportprintsforeachfunction(andeveneachsourcecodeline)thenumberofassociatedsamplesandthecorrespondingexecutiontimeprobablyspentinthatlocation.
Thefunctionsmaybeprintedindecreasingorderofexecutiontime,thuslistingontopthemosttimeconsumingfunctions,andbestcandidatesforoptimisation.
Thetimeassociatedwithinternalfunctionmcountcanbeusedtosubstract,andthuscompensate,theoverheadassociatedwiththeprofiling.
TheannotatedsourcecodereportissimilartotheoutputproducedbyGcov,listingtheexecutionfrequenciesforeachbasicblock.
Finally,Thecallgraphreportisasupersetoftheflatprofileandaddsinformationaboutcallersandcalledfunctions.
Foreachfunction,itliststhetotaltimespentinitself(numberofsamplestimessamplingperiod),andthecallingfunctions.
Foreachcallingfunctionislistedthenumberofcallsandtheproportionofthetotal;thisinformationisobtainedpreciselythroughthecallstomcountwherethecallerisrecordedinthecallcounts.
Theproportionofcallsfromacallerisusedtoestimatethetimespentinthecurrentfunctionbycaller.
Theunderlyingassumptionisthatcallstoafunctiontakethesametimeonaverageirrespectivefromthecaller.
Thisisclearlynotthecaseas,forinstance,afunctiondraw_iconmaycallscale_bitmapsforverysmallimageswhilefunctionscale_backgroundmaycallscale_bitmapsforimages500timeslarger.
Theexecutiontimespentforeachcallerispropagatedtothecallersresursively,startingfromtheleaffunctions.
Thisaddsanotherlistforeachfunctioninthecallgraphreport,thetimespentineachfunction(anditschildren)called.
Thetotaltimespentinthefunctionitselfandallitscalledchildrenisalsopartofthecallgraphreport.
Forrecursivecalls,aseparateentryiscreatedforthecyclewhichlistscallersandcalleesfromousidethecycle;then,entriesformembersofthecyclelistcallersandcalleesfromwithinthecycle.
Gprofandsimilartoolsofferanexcellentcompromiseoflowoverheadanddetailedusuallyreliableinformation.
Itmay,however,beinadequateorevenmisleadinginmoreinvolvedcaseswhere:thetimespentinafunctionisverydifferentdependingonthecaller.
Measuringitadequatelywouldrequiresamplingnotonlythecurrentlyexecutinginstructionaddressbutalsothecallersaddressonthestack,possiblyforseverallevels.
Thecallgraphreportcouldthenshowthetimespentinafunctionbasedonchainsofcallers,butthiswouldbemoredifficulttopresentclearly,andthenumberofsamplesassociatedwitheachchainislikelytobetoosmalltobereliable;thesamplingintervaliscorrelatedwiththeprogramactivity.
Whenaprogramcontinuouslymakesblockingsystemcalls,thereisachancethatthescheduleroftengivescontroltotheprogramandactivatesthevirtualtimerusedforsamplingwhentheprogramisinthesamearea,whichgetsoverrepresentedinthesamples;thetimespentnotexecutingtheprogram,waitingforthediskorforotherprocesses,isofinterest.
Inthatcase,othersourcesofinformationshouldbeusedtoannotatethereportswithinformationabout7timespentwaiting,toobtainthetotalelapsedtime,insteadofjusttheexecutiontime,ineachfunction;moreinformationisrequiredaboutwhytimeisspentineachsection(cachefaults,pagefaults.
.
.
).
#Runtheprogramcompiledwithout-pgandlinkedwith-pg#tomeasurethelowoverheadofsamplingalone.
[gzip-1.
2.
4a]$time.
/gzip.
pg-nomcount/dev/null24.
30user0.
62system0:28.
30elapsed88%CPU#Runtheprogramcompiledandlinkedwithoption-pg#Overheadisslightlylargerbutitprovidescallgraphinfo[gzip-1.
2.
4a]$time.
/gzip.
pg-static/dev/null25.
79user0.
51system0:29.
88elapsed88%CPU#Filegmon.
outisproducedwhentheprogramexits[gzip-1.
2.
4a]$ls-alsgmon.
out188-rw-rw-r--1useruser186366avr1014:07gmon.
out#Usegproftointerpretgmon.
out,usingdebuggingsymbols#informationfromtheexecutablefile[gzip-1.
2.
4a]$gprofgzip.
pg-static>gprof.
out#Lookattheresults,firsttheflatprofilewiththepercentageof#executiontimeforeachfunctionindecreasingorder.
[gzip-1.
2.
4a]$lessgprof.
outEachsamplecountsas0.
01seconds.
%cumulativeselfselftotaltimesecondssecondscallsms/callms/callname27.
137.
537.
5319573.
855.
95fill_window23.
2013.
976.
4416440.
0019205.
66deflate14.
8418.
094.
1219582.
102.
10updcrc12.
1621.
463.
38do_scan8.
3823.
792.
33short_loop3.
7824.
841.
05__mcount_internal3.
1325.
710.
87read.
.
.
#Thenthecallgraph.
Functionupdcrcuses4.
12sandcallsnone.
#Itiscalledfromzipandfile_read.
Functionfile_readis#calledfromlm_initandfill_window.
Itusesaboutnotimebut#callsupdrcforabout4.
12s.
This4.
12sisalmostallattributed#tocallerfill_window(1957/1958calls).
.
.
.
0.
000.
001/1958zip[3]4.
120.
001957/1958file_read[8][7]15.
64.
120.
001958updcrc[7]0.
000.
001/1958lm_init[27]0.
004.
121957/1958fill_window[6][8]15.
60.
004.
121958file_read[8]4.
120.
001957/1958updcrc[7].
.
.
4.
3.
HardwareCountedEventsSamplingwithOprofileOprofile[3]isverysimilartoGprof;itevenproducescompatibleresultsfiles.
Thetwoimportantdifferencesarethatitcollectssamplesbasedonhardwareperformancecountersinterrupts,andthatitoperatesonthecompletesystem(gatheringsamplescontinuouslyforallprogramsexecutingonthe8system).
Hardwareperformancecountersarenowavailableinallpowerfulmodernmicroprocessorsandmaybeusedtocountvariouseventsofinterestincludingclockcycles,L1orL2cachemisses,branches,branchestaken,andpipelinestalls.
Theperiodicsamplingisobtainedbyrequestinganinterruptwhenthecounterreachesacertaincount.
Oprofiledoesnotuseprograminstrumentation(i.
e.
,placingcallstomcountineachfunction).
Thus,norecompilationisnecessaryandtheoverheadislower,butcallgraphreportscannotbegenerated.
Theonlyrequirementisthatdebugginginformationisrequiredtocorrelatetheaddressofsampleswithprogramfunctionsandlinenumbers.
Becauseofitslowoverhead,itisnotuncommontokeepOprofilerunningforextendedperiodsonrunningserversystems.
Thewholesystemprofilingfeatureisparticularlyusefultoseewhichproportionofthetimeisspentineachprogram,thusdiscoveringwhoisusingupalltheresources.
Itisalsousefulwhenasingletaskinvolvingseveralprogramsisexecutedonasystem.
Forexample,whenanalysingthestartuptimeofasophisticatedWebbrowser,itispossibletoseethetimespentinfunctionsinthekernel,thebrowser,thewindowmanager,theXserverandvariousdaemons.
Insomerarecases,theinformationproducedbyhardwareperformancecounterssamplingtoolsmaybebiased.
Indeed,areal-timeprocessusingthesameperiodasthesamplingtoolmayhaveitssamplesgrosslyunderoroverrepresented.
Asimplesolutionistorandomlyvarythesamplingperiodaroundatargetaverage.
#Runtheoprofilegraphicaluserinterface,clicktostartthedaemon[gzip-1.
2.
4a]$oprof_start#Runtheunmodifiedprogram(withdebuggingsymbolspresent)[gzip-1.
2.
4a]$time.
/gzip.
plain/dev/null24.
40user0.
87system0:27.
67elapsed91%CPU#Stoptheprofilingdaemonandextractstatistics[gzip-1.
2.
4a]$oprofpp-l-igzip.
plain>gzip.
oprof#Lookattheresults[gzip-1.
2.
4a]$lessgzip.
oprofCputype:PIICpuspeedwas(MHzestimation):233.
140000Counted"clocksprocessorisnothalted",samplinginterval200000vmasamples%-agesymbolname0804810000_start.
.
.
08050504500.
059034long_loop080504e1680.
0802862limit_ok080504c11110.
131055longest_match080505721600.
188909the_end080505471710.
201896mismatch0804c5509881.
16651ct_tally0804c93011001.
29875send_bits0804c6d012821.
51363compress_block0805aad014651.
7297memcpy0805050d69058.
15259short_loop080505221210114.
2874do_scan0804ea101382116.
3182updcrc0804b4c02140525.
2724deflate0804b1202502029.
5406fill_window94.
4.
SystemCallsTracingusingStraceLinuxandPOSIXsystemsoffertheptracesystemcalltoallowoneprocesstomonitoranotherprocess.
Itisusedbydebuggersandothertoolslikestrace[9].
Whenaprogramisrununderstrace,itisblockedeachtimeitissuesasystemcall.
Thecontrolpassestostracebeforethesystemcall,toextractthevalueofthearguments,thesystemcallisthenperformedasusual,andthecontrolpassestostraceafterthesystemcalltoextractthereturnvalues.
Theoverheadcostisseveralmicrosecondsforeachsystemcall,comparabletohittingabreakpointinadebugger.
However,thenumberofsystemcallsisrelativelylowforCPUboundprocesseslikegzip.
#Tracethesystemcallsissuedduringtheexecution[gzip-1.
2.
4a]$timestrace-s120-f-F-otrace.
/gzip.
plain\/dev/null25.
23user1.
57system0:30.
97elapsed86%CPU#Examinethesystemcalltracefile[gzip-1.
2.
4a]$lesstrace2488execve(".
/gzip.
plain",[".
/gzip.
plain"],[/*44vars*/])=02488fcntl64(0,0x1,0,0xbffff804)=02488fcntl64(0x1,0x1,0x1,0xbffff804)=02488fcntl64(0x2,0x1,0x1,0xbffff804)=02488uname({sys="Linux",node="rocamadour"02488geteuid32()=53632488getuid32()=53632488getegid32()=11052488getgid32()=11052488brk(0)=0x810b3742488brk(0x810b394)=0x810b3942488brk(0x810c000)=0x810c0002488rt_sigaction(SIGINT,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGINT,{0x804aaa0,[INT],SA_RESTART|0x4000000},{SIG_IGN},8)=02488rt_sigaction(SIGTERM,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGTERM,{0x804aaa0,[TERM],SA_RESTART|0x4000000},{SIG_IGN},8)=02488rt_sigaction(SIGHUP,{SIG_IGN},{SIG_DFL},8)=02488rt_sigaction(SIGHUP,{0x804aaa0,[HUP],SA_RESTART|0x4000000},{SIG_IGN},8)=02488ioctl(1,0x5401,0xbffff6c0)=-1ENOTTY(Inappropriateioctlfordevice)2488fstat64(0,{st_mode=S_IFREG|0644,st_size=64140109,02488read(0,"recid=1,size=8,format=BINARY,event_type=4,facility=LTT,severity=DEBUG,\nuid=dagenais,gid=dagenais,pid=1953,pgrp=".
.
.
,65536)=65536.
.
.
4.
5.
LowOverheadKernelTracingwiththeLinuxTraceToolkitThedataaggregationperformedbythetoolsdiscussedinprevioussectionsisimportanttoobtainalowoverhead.
Thedownsideisthatsomeinformationislostintheaggregates(numberofcallsforeachfunctioninsteadofwhichfunctionwascalledwhenandinwhatorder).
Afulltraceofcollecteddataitemsrapidlyfillsadiskbutdoesnotaprioridiscardinformationthatmaynotseemusefulinitially,butcouldbefoundsolater.
Tracingisusedpervasivelyinsoftwareprojects,fromprintstatementstologgingdaemonslikesyslogdandklogd.
TheLinuxTraceToolkit[4]isextensible.
Itcomeswith60predefinedtracepointsinsertedinthekernel,dividedinto8categorieswhichmaybeenabled/disabledeitheratcompileoratruntime.
MoretracepointsmayeasilybeaddedbyrecompilingthekernelorusingtheDynamicProbescodepatchingsystem[10].
Itskeyfeaturesarelowoverheadandpost-processingtools.
10Lowoverheadisachievedbycarefullyselectingtheinformationtolog,andminimizingcontextswitching.
Sinceallschedulingchangesandsetuidcallsmaybetraced,thereisnoneedtorecordtheprocessoruseridwitheachevent.
Furthermore,mosteventsbeingcloselyspacedintime,32bitstimedifferencesarestoredinsteadoffull64bitsseconds/microsecondsvalues.
Alltheeventsareloggedbyappendingtheirdatatoalargedoublebuffer.
Whenabufferisfull,adaemonaccessesthebufferthroughsharedmemoryandcopiesitscontenttodisk.
Theoverheadoftracingabusysystemamountstolessthan5%fortheelapsedtimeanduptoseveralmegabytesperminuteoftracingdatatostoreondisk[11].
Thetraceanalysisisperformedoffline,tominimizetheoverheadduringdatacollection,andpossiblyonaremotehostwhenstudyingembeddedsystems.
Theinformationcontainedinthetrace(systemcalls,devicesoperations,schedulingchanges)issufficienttoreconstructthesystembehavior(includingwhichfiledescriptorisassociatedtowhichfileormemoryregion)andmaythereforebeusedforseveraldifferentanalyses.
Thetoolsincludedshowtheeventsgraphically,perprocessinatimeline,andcontainperprocesssummaries(executiontime,numberofcallspersystemcall,waittime,bytesread/written).
Newanalysistoolsmaybewrittenusingtheprovidedtracereadinglibrary,orbymodifyingtheprovidedtools.
Indeed,whileitiseasywithLTTtocapturealltheneededrawdata,synthesizingthedesiredhighlevelinformationismorechallenging.
Asatest,amodifiedtraceanalysistoolwasdevelopedtoanswerthefollowingnonobviousquestion:whereisthelargestartingtimeofapplicationssuchasGaleon,MozillaandOpenOfficespent4.
5.
1.
DecomposingtheWaitingTimeTraditionally,systemstatisticsdecomposetheelapsedtimeintousermodeexecutiontime,systemmodeexecutiontime,andwaitingtime.
Profilingtoolsaregoodatdecomposingtheexecutiontimebyfunctionorevensourcecodeline.
Theobjectiveherewasdecomposingthewaitingtimeintowaitingrunnable(waitingfortheCPU)andwaitingnonrunnable(blockedonI/O,orwaitingforaprocessoratimer).
Thisinformationmaybeextractedfromadetailedtrace:SchedulingChangeeventstellthestatusoftheoutgoingprocess(runnableornot).
WakeUpeventschangethestatusofaprocessfromnonrunnabletorunnable.
Iftheeventpriortobeingscheduledoutnonrunnableisareadorapagefaulttrap,theprocessiswaitingfortheassociatedfile.
Atablemustkeeptrackforeachprocessofalltheopenandmmapoperationsassociatingfileswithfiledescriptorsormemoryaddressranges.
Iftheeventpriortobeingscheduledoutnonrunnableisawaitpidcall,theprocessiswaitingfortheassociatedprocess.
Therearecaseswhereaprocessiswaitingforoneofseveralpossiblefilesorprocesses,withcallstowait,select,orpoll.
Inthatcase,thewaitingtimeisattributedtotheprocessorfilewhichcausesthecalltoterminate.
Pollandwaitmayalsoendbecauseofatimeout.
Thewaittimeisthenattributedtothemoregeneralwaitfortimeoutcategory.
Atracewasgeneratedonafreshlyrebootedsystem,withnootheractivity,whilestartinguplargeprogramsoneafteranotherbutwithoutanyoverlap.
ThewaitingtimeforGaleon[12],forinstance,wasdecomposedintoseveralwait-for-fileandwait-for-processcomponents,whichaccountedfor7.
96softhetotal9.
27s.
Theremaining1.
31sismostlywait-for-timeout(0.
89sawaitingamouseclick)and11only0.
42swait-generic.
Thewait-generictimecouldbefurtherreducedbyidentifyingintheirowncategoryothersystemcalleventsinthetrace,inadditiontoread,select,poll,waitpidandwait.
Thewait-generictimeisnonethelessavailablebysystemcallcategory;.
31soutofthe.
42stotalbelongstostatsystemcalls.
#BootwiththeLTTenabledkernelandstarttheLTTdaemon[gzip-1.
2.
4a]$tracedaemon-ts60/dev/tracertrace1.
tracetrace1.
procTraceDaemon:OutputfilereadyTraceDaemon:TracedriveropenTraceDaemon:Tracebuffersare1000000bytesTraceDaemon:Fetchingeipforsyscallondepth:0TraceDaemon:Daemonwilrunfor:(60,0)TraceDaemon:Donemapping/procTraceDaemon:Endoftracing#Analysethetraceoutput[gzip-1.
2.
4a]$tracevisualizer-atrace1.
tracetrace1.
proctrace1.
out#Lookatthetraceanalysis[gzip-1.
2.
4a]$lesstrace1.
out.
.
.
Process(1684,1658):galeon:galeon-bin.
.
.
UserMode:CPU5.
894726Elapsed15.
677299WaitCPU2.
358053WaitFork0.
000002BH811KernelTimer610PacketIn80PacketOut41SchedIn1330TimerExpire136SyscallMode:Elapsed/Calls0.
0003861CPU0.
308954Elapsed5.
119347WaitCPU0.
599823WaitFile-001.
png0.
000453WaitFile-002.
png0.
000346WaitFile-003.
png0.
000213WaitFile-004.
png0.
000216WaitFile-005.
png0.
000430.
.
.
sys_fcntl64Elapsed/Calls0.
0000054CPU0.
000286Elapsed0.
000286Calls53TrapspagefaultElapsed/Calls0.
0003101CPU0.
179146Elapsed2.
444663WaitCPU0.
448110WaitFile-galeon-bin1.
816968BH38Calls7883KernelTimer23PacketIn212PacketOut1PageAlloc78SchedIn459TimerExpire2IRQstimerElapsed/Calls0.
0000118CPU0.
007725Elapsed0.
007725Calls657usb-uhci,eth0Elapsed/Calls0.
0000263CPU0.
003610Elapsed0.
003625Calls138.
.
.
Trapentry1018876582.
807382168413TRAP:pagefault;EIP:0x08068FC7Trapexit1018876582.
80738716847Syscallentry1018876582.
807421168412SYSCALL:execve;EIP:0x080690D7Filesystem1018876582.
807452168427EXEC:galeonMemory1018876582.
807494168412PAGEALLOCORDER:0Memory1018876582.
807569168412STARTPAGEWAITSchedchange1018876582.
807645157919IN:1579;OUT:1684;STATE:2Filesystem1018876582.
807656157920POLL:3;MASK:0Filesystem1018876582.
807662157920POLL:10;MASK:0Filesystem1018876582.
807666157920POLL:4;MASK:0Filesystem1018876582.
807670157920POLL:5;MASK:0Filesystem1018876582.
807675157920POLL:6;MASK:0Filesystem1018876582.
807680157920POLL:13;MASK:0Memory1018876582.
807688157912PAGEFREEORDER:0Memory1018876582.
807699157912PAGEFREEORDER:0Syscallexit1018876582.
80770215797Syscallentry1018876582.
807731157912SYSCALL:gettimeofday;EIP:0x0805816DSyscallexit1018876582.
80773315797Syscallentry1018876582.
807758157912SYSCALL:ioctl;EIP:0x08056F30Thebiggersurprisewasthe3.
5swait-for-CPUsincenothingelsewashappeningonthesystem.
Theanswerliesinasynchronouslycooperatingprocesses.
Indeed,Galeonstartsanumberofhelpersub-processesandusuallyblocksawaitingananswer.
Thisshowsupnicelyinthewait-for-processcategoryalongwiththesub-processid.
However,GaleonisacomplexgraphicalapplicationwhichusestheXwindowserver[5]forgraphicsrendering.
TheXserverisoptimisedtoavoidblockingusingasynchronousinteractionswheneverpossible.
Thisallowsbatchingrenderingoperations,butalsoparallelexecutionwhentheapplicationandtheXserverarerunningonseparateprocessors.
4.
5.
2.
IdentifyingServersWorkingonbehalfofClientsThelastanalysisrefinementwasdecomposingtheexecutiontimeofaserverprocess,labeledCPU,intoCPU-for-Clientcomponents.
Atypicalclientserverrelationshipmaybeidentifiedbythepatternofaserverlisteningonasocket,aclientconnectingtothesocket,andtheserveracceptingtheconnection.
13Theserverandclientthencommunicatebyreadingandwritingtothesocketuntiltherequestisservedandtheservergoesidle.
Theserverthenawaitsnewconnectionsorrequestsfromclientsusingaselectorpollsystemcall.
Assumingthataserver,afteritsinitializationphase,isalwaysworkingonbehalfofaclient,theservergetsintoCPU-for-Clientxwheneveritacceptsaconnectionorreadsfromaconnectiontoclientx.
Thissimpleheuristicproducesinterestingresultsinmanycases.
However,itdoesnotmodelcaseswhererequestsaregrouped.
Forexample,ifnineclientsmakeasynchronousrequestswhicharesimplybufferedbytheserver,andupontherequestfromthetenthclient,allbufferedrequestsareexecuted,thetimespentservingthelastisoverestimated.
Complexinteractionsbetweenclientsandserversareevenmoredifficulttoautomaticallyrecognizeinordertoattributeservingtimeappropriatelytoclients.
Forinstance,GaleonconnectstotheGNOMEconfigurationdaemon,Gconfd,toexpressinterestinanumberofconfigurationparameters,leavingacallbacksocketaddress.
Gconfdthenconnectstothiscallbackaddressandprovidesthecurrentvaluesfortheparameters,andeventuallyupdatestotheseparameters.
GaleonthenappearstobeservingparameterchangerequestsfromclientGconfd.
4.
6.
DiscussionTheLinuxTraceToolkitisoneofthebestsolutionsforindepthlowlevelsystemstudies.
Itiswidelyusedinembeddedandreal-timesystems,andcouldeasilybeusedinoperatingsystemsandreal-timeprogrammingcourses.
Itcouldproveveryusefultodevelopersandsystemadministratorswhenstronginteractionsbetweenapplicationsandtheoperatingsystemareencountered.
ItisstillpendingintegrationinthemainLinuxkernel.
Thus,akernelpatchandrecompilation,oraspecialkernel,arerequiredontopoftheLTTtracingdaemonandtracingtoolspackage.
5.
ConclusionTraditionaltoolslikeGcov,Gprof,andStracehaveexistedforyears,withmanyvariantsintermsofpresentation(graphs,colors,evenanimations)ondifferentplatforms.
Theserepresentmature,simpleandefficientwaystoobtainusefulinformation.
Theappearanceofhardwareperformancecountersinmicroprocessorsinthe1990swastheenablingtechnologyforsamplingtoolslikeOprofileprovidingmoreprecisesamplingandseveraltypesofsamples(cyclecount,cachemisses.
.
.
).
Anotherimportantenablingtechnologyinthefieldofsystemperformanceanalysisisthesourcecodeavailabilityoffreesoftware.
Theprintstatementspreviouslyinsertedbyprogrammersonlyintheirapplicationscannowbeplacedeverywhereinthesystem.
TheLinuxTraceToolkitprovidesthefirst60interestingtracingpoints.
Moreimportantly,itsimplifiestheinsertionofmoretracepointswhichwillenableanumberofsophisticatedanalysistoolsneededtobetterunderstand,studyandimprovecomplexdistributedsystems.
Bibliography[1]FreeSoftwareFoundation,"GNUCompilerCollection",http://gcc.
gnu.
org/.
[2]FreeSoftwareFoundation,"BinaryTools",http://sources.
redhat.
com/binutils/.
[3]JohnLevonetal.
,"Oprofile",http://oprofile.
sourceforge.
net.
[4]KarimYaghmour,"LinuxTraceToolkit",http://www.
opersys.
com/LTT.
14[5]XFree86Project,"TheXFree86Project",http://www.
xfree86.
org.
[6]JamesLarusetal.
,"EEL:AnExecutableEditingLibrary",http://www.
cs.
wisc.
edu/~larus/eel.
html.
[7]FreeSoftwareFoundation,"TheGNUProjectDebugger",http://sources.
redhat.
com/gdb/.
[8]Jean-LoupGailly,"GNUCompressionUtilityGzip",http://www.
gnu.
org/software/gzip/gzip.
html.
[9]WichertAkkermanetal.
,"Strace",http://www.
liacs.
nl/~wichert/strace/.
[10]RichardJ.
Mooreetal,"DynamicProbes",http://www.
ibm.
com/developerworks/oss/linux/projects/dprobes/.
[11]K.
YaghmourandM.
R.
Dagenais,"Measuringandcharacterizingsystembehaviorusingkernel-leveleventlogging",ProceedingsoftheUSENIXAnnual2000TechnicalConference,SanDiego,California,USA,June2000[12]MarcoPesentiGrittietal,"Galeon:theWeb,onlytheWeb",http://galeon.
sourceforge.
net/.
15

欧路云:美国200G高防云-10元/月,香港云-15元/月,加拿大480G高防云-23元/月

欧路云 主要运行弹性云服务器,可自由定制配置,可选加拿大的480G超高防系列,也可以选择美国(200G高防)系列,也有速度直逼内地的香港CN2系列。所有配置都可以在下单的时候自行根据项目 需求来定制自由升级降级 (降级按天数配置费用 退款回预存款)。由专业人员提供一系列的技术支持!官方网站:https://www.oulucloud.com/云服务器(主机测评专属优惠)全场8折 优惠码:zhuji...

RepriseHosting:$27.97/月-L5640,16G内存,1TB硬盘,10TB月流量,西雅图机房

RepriseHosting是成立于2012年的国外主机商,提供独立服务器租用和VPS主机等产品,数据中心在美国西雅图和拉斯维加斯机房。商家提供的独立服务器以较低的价格为主,目前针对西雅图机房部分独立服务器提供的优惠仍然有效,除了价格折扣外,还免费升级内存和带宽,商家支持使用支付宝或者PayPal、信用卡等付款方式。配置一 $27.97/月CPU:Intel Xeon L5640内存:16GB(原...

湖北50G防御物理服务器( 199元/月 ),国内便宜的高防服务器

4324云是成立于2012年的老牌商家,主要经营国内服务器资源,是目前国内实力很强的商家,从价格上就可以看出来商家实力,这次商家给大家带来了全网最便宜的物理服务器。只能说用叹为观止形容。官网地址 点击进入由于是活动套餐 本款产品需要联系QQ客服 购买 QQ 800083597 QQ 2772347271CPU内存硬盘带宽IP防御价格e5 2630 12核16GBSSD 500GB​30M​1个IP...

ts60.com为你推荐
固态硬盘是什么什么是固态硬盘?johncusack约翰·库萨克好看的的恐怖片全集广东GDP破10万亿在已披露的2017年GDP经济数据中,以下哪个省份GDP总量排名第一?嘀动网在炫动网买鞋怎么样,是真的吗罗伦佐娜罗拉芳娜 (西班牙小姐)谁可以简单的介绍以下同ip域名同IP网站具体是什么意思,能换独立的吗se95se.com现在400se就是进不去呢?进WWW怎么400se总cOM打开一半,?求解sesehu.com68lolita com是真的吗ww.66bobo.com这个www.中国应急救援网.com查询证件是真是假?yinrentangWeichentang正品怎么样,谁知道?
香港vps krypt 大硬盘 网站监控 anylink bgp双线 谁的qq空间最好看 可外链相册 国外的代理服务器 可外链的相册 免费个人网页 广东服务器托管 网站防护 镇江高防服务器 美国主机侦探 时间同步服务器 服务器操作系统 木马检测 主机游戏 56折扣网 更多