microcheckpointingservicecontrolmanager

ServiceControlManager 时间:2021-04-10 阅读:()

M.
C.
CalzarossaandS.
Tucci(Eds.
):Performance2002,LNCS2459,pp.
290–317,2002.
Springer-VerlagBerlinHeidelberg2002Measurement-BasedAnalysisofSystemDependabilityUsingFaultInjectionandFieldFailureDataRavishankarK.
IyerandZbigniewKalbarczykCenterforReliableandHigh-PerformanceComputingUniversityofIllinoisatUrbana-Champaign1308W.
MainSt.
,Urbana,IL61801-2307{iyer,kalbar}@crhc.
uiuc.
eduAbstract.
Thediscussioninthispaperfocusesontheissuesinvolvedinanalyzingtheavailabilityofnetworkedsystemsusingfaultinjectionandthefailuredatacollectedbytheloggingmechanismsbuiltintothesystem.
Inparticularweaddress:(1)analysisintheprototypephaseusingphysicalfaultinjectiontoanactualsystem.
Weuseexampleoffaultinjection-basedevaluationofasoftware-implementedfaulttolerance(SIFT)environment(builtaroundasetofself-checkingprocessescalledARMORS)thatprovideserrordetectionandrecoveryservicestospacebornescientificapplicationsand(2)measurement-basedanalysisofsystemsinthefield.
WeuseexampleofLANofWindowsNTbasedcomputerstopresentmethodsforcollectingandanalyzingfailuredatatocharacterizenetworksystemdependability.
Both,faultinjectionandfailuredataanalysisenableustostudynaturallyoccurringerrorsandtoprovidefeedbacktosystemdesignersonpotentialavailabilitybottlenecks.
Forexample,thestudyoffailuresinanetworkofWindowsNTmachinesrevealsthatmostoftheproblemsthatleadtorebootsaresoftwarerelatedandthatthoughtheaverageavailabilityevaluatestoover99%,atypicalmachine,onaverage,providesacceptableserviceonlyabout92%ofthetime.
1IntroductionThedependabilityofasystemcanbeexperimentallyevaluatedatdifferentphasesofitslifecycle.
Inthedesignphase,computer-aideddesign(CAD)environmentsareusedtoevaluatethedesignviasimulation,includingsimulatedfaultinjection.
Suchfaultinjectionteststheeffectivenessoffault-tolerantmechanismsandevaluatessystemdependability,providingtimelyfeedbacktosystemdesigners.
Simulation,however,requiresaccurateinputparametersandvalidationofoutputresults.
Althoughtheparameterestimatescanbeobtainedfrompastmeasurements,thisisoftencomplicatedbydesignandtechnologychanges.
Intheprototypephase,thesystemrunsundercontrolledworkloadconditions.
Inthisstage,controlledphysicalfaultinjectionisusedtoevaluatethesystembehaviorunderfaults,includingthedetectioncoverageandtherecoverycapabilityofvariousfaulttolerancemechanisms.
Faultinjectionontherealsystemcanprovideinformationaboutthefailureprocess,fromfaultoccurrencetosystemrecovery,includingerrorlatency,propagation,detection,andrecovery(whichmayinvolvereconfiguration).
Intheoperationalphase,adirectmeasurement-basedapproachcanbeusedtomeasuresystemsintheMeasurement-BasedAnalysisofSystemDependability291fieldunderrealworkloads.
Thecollecteddatacontainalargeamountofinformationaboutnaturallyoccurringerrors/failures.
Analysisofthisdatacanprovideunderstandingofactualerror/failurecharacteristicsandinsightintoanalyticalmodels.
Althoughmeasurement-basedanalysisisusefulforevaluatingtherealsystem,itislimitedtodetectederrors.
Further,conditionsinthefieldcanvarywidely,castingdoubtonthestatisticalvalidityoftheresults.
Thus,allthreeapproaches–simulatedfaultinjection,physicalfaultinjection,andmeasurement-basedanalysis–arerequiredforaccuratedependabilityanalysis.
Inthedesignphase,simulatedfaultinjectioncanbeconductedatdifferentlevels:theelectricallevel,thelogiclevel,andthefunctionlevel.
Theobjectivesofsimulatedfaultinjectionaretodeterminedependabilitybottlenecks,thecoverageoferrordetection/recoverymechanisms,theeffectivenessofreconfigurationschemes,performanceloss,andotherdependabilitymeasures.
Thefeedbackfromsimulationcanbeextremelyusefulincost-effectiveredesignofthesystem.
Forthoroughdiscussionofdifferenttechniquesforsimulatedfaultinjectioncanbefoundin[10].
Intheprototypephase,whiletheobjectivesofphysicalfaultinjectionaresimilartothoseofsimulatedfaultinjection,themethodsdifferradicallybecauserealfaultinjectionandmonitoringfacilitiesareinvolved.
Physicalfaultscanbeinjectedatthehardwarelevel(logicorelectricalfaults)oratthesoftwarelevel(codeordatacorruption).
Heavy-ionradiationtechniquescanalsobeusedtoinjectfaultsandstressthesystem.
Thedetailedtreatmentoftheinstrumentationinvolvedinfaultinjectionexperimentsusingrealexamples,includingseveralfaultinjectionenvironmentsisgivenin[10].
Intheoperationalphase,measurement-basedanalysismustaddressissuessuchashowtomonitorcomputererrorsandfailuresandhowtoanalyzemeasureddatatoquantifysystemdependabilitycharacteristics.
Althoughmethodsforthedesignandevaluationoffault-tolerantsystemshavebeenextensivelyresearched,littleisknownabouthowwellthesestrategiesworkinthefield.
Astudyofproductionsystemsisvaluablenotonlyforaccurateevaluationbutalsoforidentifyingreliabilitybottlenecksinsystemdesign.
In[10]themeasurement-basedanalysisisbasedonover200machine-yearsofdatagatheredfromIBM,DEC,andTandemsystems(notethatthesearenotnetworkedsystems).
Inthispaperwediscussthecurrentresearchintheareaofexperimentalanalysisofcomputersystemdependabilityinthecontextofmethodologiessuitedformeasurement-baseddependabilityanalysisofnetworkedsystems.
Inparticularwefocuson:Analysisintheprototypephaseusingphysicalfaultinjectiontoanactualsystem.
Weuseexampleoffaultinjection-basedevaluationofasoftware-implementedfaulttolerance(SIFT)environment(builtaroundasetofself-checkingprocessescalledARMORS,[13])thatprovideserrordetectionandrecoveryservicestospacebornescientificapplications.
Measurementbasedanalysisofsystemsinthefield.
WeuseexampleofLANofWindowsNTbasedcomputerstopresentmethodsforcollectingandanalyzingfailuredatatocharacterizenetworksystemdependability.
292R.
K.
IyerandZ.
Kalbarczyk2Fault/ErrorInjectionCharacterizationoftheSIFTEnvironmentforSpaceborneApplicationsFault/errorinjectionisanattractiveapproachtotheexperimentalvalidationofdependablesystems.
Theobjectiveoffaultinjectionistomimictheexistenceoffaultsanderrorsandhencetoenablestudyingthefailurebehaviorofthesystem.
Fault\errorinjectioncanbeemployedtoconductdetailedstudiesofthecomplexinteractionsbetweenfaultandfaulthandlingmechanisms,e.
g.
,[1]and[10].
Inparticularfaultinjectionaimsat(1)exposingdeficienciesoffaulttolerancemechanisms(i.
e.
,faultremoval),e.
g.
,[3],and(2)evaluatingcoverageoffaulttolerancemechanisms(i.
e.
,faultforecasting,e.
g.
,[2].
Numberoftoolswereproposedtosupportfaultinjectionanalysisandevaluationofsystems,e.
g.
,FERRARI[14],FIAT[5],andNFTAPE[22].
Thissectionpresentsanexampleofapplyingfault/errorinjectioninassessingfaulttolerancemechanismsofsoftwareimplementedfaulttoleranceenvironmentforspaceborneapplications.
Intraditionalspaceborneapplications,onboardinstrumentscollectandtransmitrawdatabacktoEarthforprocessing.
TheamountofsciencethatcanbedoneisclearlylimitedbythetelemetrybandwidthtoEarth.
TheRemoteExplorationandExperimentation(REE)projectatNASA/JPLintendstouseaclusterofcommercialoff-the-shelf(COTS)processorstoanalyzethedataonboardandsendonlytheresultsbacktoEarth.
Thisapproachnotonlysavesdownlinkbandwidth,butalsoprovidesthepossibilityofmakingreal-time,application-orienteddecisions.
Whilefailuresinthescientificapplicationsarenotcriticaltothespacecraft'shealthinthisenvironment(spacecraftcontrolisperformedbyaseparate,trustedcomputer),theycanbeexpensivenonetheless.
ThecommercialcomponentsusedbyREEareexpectedtoexperienceahighrateofradiation-inducedtransienterrorsinspace(rangingfromoneperdaytoseveralperhour),anddowntimedirectlyleadstothelossofscientificdata.
Hence,afault-tolerantenvironmentisneededtomanagetheREEapplications.
ThemissionsenvisionedtotakeadvantageoftheSIFTenvironmentforexecutingMPI-based[19]scientificapplicationsincludetheMarsRover,theOrbitingThermalImagingSpectrometer(OTIS).
Moredetailsontheapplicationsandthefulldependabilityanalysiscanbefoundin[31]and[32],respectively.
TheremainingofthissectionpresentsamethodologyforexperimentallyevaluatingadistributedSIFTenvironmentexecutinganREEtextureanalysisprogramfromtheMarsRovermission.
Errorsareinjectedsothattheconsequencesoffaultscanbestudied.
Theexperimentsdonotattempttoanalyzethecauseoftheerrorsorfaultcoverage.
Rather,theerrorinjectionsprogressivelystressthedetectionandrecoverymechanismsoftheSIFTenvironment:1.
SIGINT/SIGSTOPinjections.
Manyfaultsareknowntoleadtocrashandhangfailures.
SIGINT/SIGSTOPinjectionsreproducethesefirst-ordereffectsoffaultsinacontrolledmannerthatminimizesthepossibilityoferrorpropagationorcheckpointcorruption.
2.
Registerandtext-segmentinjections.
Thenextsetoferrorinjectionsrepresentcommoneffectsofsingle-eventupsetsbycorruptingthestateintheregistersetandtextsegmentmemory.
Thisintroducesthepossibilityoferrorpropagationandcheckpointcorruption.
Measurement-BasedAnalysisofSystemDependability2933.
Heapinjections.
Thethirdsetofexperimentsfurtherbroadenthefailurescenariosbyinjectingerrorsinthedynamicheapdatatomaximizethepossibilityoferrorpropagation.
Theresultsfromtheseexperimentsareespeciallyusefulinevaluatinghowwellintraprocessself-checkslimiterrorpropagation.
REEcomputationalmodel.
TheREEcomputationalmodelconsistsofatrusted,radiation-hardened(rad-hard)SpacecraftControlComputer(SCC)andaclusterofCOTSprocessorsthatexecutetheSIFTenvironmentandthescientificapplications.
TheSCCschedulesapplicationsforexecutionontheREEclusterthroughtheSIFTenvironment.
REEtestbedconfiguration.
Theexperimentswereexecutedona4-nodetestbedconsistingofPowerPC750processorsrunningtheLynxreal-timeoperatingsystem.
Nodesareconnectedthrough100MbpsEthernetinthetestbed.
BetweenoneandtwomegabytesofRAMoneachprocessorweresetasidetoemulatelocalnonvolatilememoryavailabletoeachnode.
ThenonvolatileRAMisexpectedtostoretemporarystateinformationthatmustsurvivehardwarereboots(e.
g.
,checkpointinginformationneededduringrecovery).
NonvolatilememoryvisibletoallnodesisemulatedbyaremotefilesystemresidingonaSunworkstationthatstoresprogramexecutables,applicationinputdata,andapplicationoutputdata.
2.
1SIFTEnvironmentforREETheREEapplicationsareprotectedbyaSIFTenvironmentdesignedaroundasetofself-checkingprocessescalledARMORS(AdaptiveReconfigurableMobileObjectsofReliability)thatexecuteoneachnodeinthetestbed.
ARMORscontrolalloperationsintheSIFTenvironmentandprovideerrordetectionandrecoverytotheapplicationandtotheARMORprocessesthemselves.
WeprovideabriefsummaryoftheARMOR-basedSIFTenvironmentasimplementedfortheREEapplications;additionaldetailsofthegeneralARMORarchitectureappearin[13].
SIFTArchitectureAnARMORisamultithreadedprocessinternallystructuredaroundobjectscalledelementsthatcontaintheirownprivatedataandprovideelementaryfunctionsorservices(e.
g.
,detectionandrecoveryforremoteARMORprocesses,internalself-checkingmechanisms,orcheckpointingsupport).
Together,theelementsconstitutethefunctionalitythatdefinesanARMOR'sbehavior.
AllARMORscontainabasicsetofelementsthatprovideacorefunctionality,includingtheabilityto(1)implementreliablepoint-to-pointmessagecommunicationbetweenARMORs,(2)communicatewiththelocaldaemonARMORprocess,(3)respondtoheartbeatsfromthelocaldaemon,and(4)captureARMORstate.
SpecificARMORsextendthiscorefunctionalitybyaddingextraelements.
TypesofARMORs.
TheSIFTenvironmentforREEapplicationsconsistsoffourkindsofARMORprocesses:aFaultToleranceManager(FTM),aHeartbeatARMOR,daemons,andExecutionARMORs294R.
K.
IyerandZ.
KalbarczykFaultToleranceManager(FTM).
AsingleFTMexecutesononeofthenodesandisresponsibleforrecoveringfromARMORandnodefailuresaswellasinterfacingwiththeexternalSpacecraftControlComputer(SCC).
HeartbeatARMOR.
TheHeartbeatARMORexecutesonanodeseparatefromtheFTM.
ItssoleresponsibilityistodetectandrecoverfromfailuresintheFTMthroughtheperiodicpollingforliveness.
Daemons.
Eachnodeonthenetworkexecutesadaemonprocess.
DaemonsarethegatewaysforARMOR-to-ARMORcommunication,andtheydetectfailuresinthelocalARMORs.
ExecutionARMORs.
EachapplicationprocessisdirectlyoverseenbyalocalExecutionARMOR.
ExecutingREEApplicationsFig.
1illustratesaconfigurationoftheSIFTenvironmentwithtwoMPIapplications(fromtheMarsRoverandOTISmissions)executingonafour-nodetestbed.
Arrowsinthefiguredepicttherelationshipsamongthevariousprocesses(e.
g.
,theapplicationsendsprogressindicatorstotheExecutionARMORs,theFTMisresponsibleforrecoveringfromfailuresintheHeartbeatARMOR,andtheFTMheartbeatsthedaemonprocesses).
EachapplicationprocessislinkedwithaSIFTinterfacethatestablishesaone-waycommunicationchannelwiththelocalExecutionARMORatapplicationinitialization.
TheapplicationprogrammercanusethisinterfacetoinvokeavarietyoffaulttoleranceservicesprovidedbytheARMOR.
ErrorDetectionHierarchyThetop-downerrordetectionhierarchyconsistsof:Nodeanddaemonerrors.
TheFTMperiodicallyexchangesheartbeatmessageswitheachdaemon(every10sinourexperiments)todetectnodecrashesandhangs.
IftheFTMdoesnotreceivearesponsebythenextheartbeatround,itassumesthatthenodehasfailed.
Adaemonfailureistreatedasanodefailure.
ARMORerrors.
EachARMORcontainsasetofassertionsonitsinternalstate,includingrangechecks,validitychecksondata(e.
g.
,avalidARMORID),anddatastructureintegritychecks.
Otherinternalself-checksavailabletotheARMORsincludepreemptivecontrolflowchecking,I/Osignaturechecking,anddeadlock/livelockdetection[4].
Inordertolimiterrorpropagation,theARMORkillsitselfwhenaninternalcheckdetectsanerror.
ThedaemondetectscrashfailuresintheARMORsonthenodeviaoperatingsystemcalls.
Todetecthangfailures,thedaemonperiodically(every10sintheexperiments)sends"Are-you-alive"messagestoitslocalARMORs.
REEapplications.
AllapplicationcrashfailuresaredetectedbythelocalExecutionARMOR.
CrashfailuresintheMPIprocesswithrank0canbedetectedbytheExecutionARMORthroughoperatingsystemcalls(i.
e.
,waitpid).
TheotherExecutionARMORsperiodicallycheckthattheirMPIprocesses(ranks1throughn)arestillintheoperatingsystem'sprocesstable.
Ifnot,itconcludesthattheapplicationhascrashed.
AnapplicationprocessnotifiesthelocalExecutionARMORthroughitscommunicationchannelbeforeexitingnormallysothattheARMORdoesnotmisinterpretthisexitasanabnormaltermination.
Measurement-BasedAnalysisofSystemDependability295DaemonDaemonDaemonFTMExecutionARMORExecutionARMORSIFTInterfaceOTISProcess(rank0)SIFTInterfaceOTISProcess(rank1)ExecutionARMORExecutionARMORSIFTInterfaceRoverProcess(rank0)SIFTInterfaceRoverProcess(rank1)HeartbeatARMORDaemonNode1Node2Node3Node4networkHeartbeatsProgressIndicatorsRecoveryLegend:Fig.
1.
SIFTArchitectureforExecutingtwoMPIApplicationsonaFour-NodeNetwork.
ApollingtechniqueisusedtodetectapplicationhangsinwhichtheExecutionARMORperiodicallychecksforprogressindicatorupdatessentbytheapplication.
Aprogressindicatorisan"I'm-alive"messagecontaininginformationthatdenotesapplicationprogress(e.
g.
,aloopiterationcounter).
IftheExecutionARMORdoesnotreceiveaprogressindicatorwithinanapplication-specifictimeperiod,theARMORconcludesthattheapplicationprocesshashung.
ErrorRecoveryNodes.
TheFTMmigratestheARMORandapplicationprocessesthatwereexecutingonthefailednodetootherworkingnodesintheSIFTenvironment.
ARMORs.
ARMORstateisrecoveredfromacheckpoint.
ToprotecttheARMORstateagainstprocessfailures,acheckpointingtechniquecalledmicrocheckpointingisused[30].
MicrocheckpointingleveragesthemodularelementcompositionoftheARMORprocesstoincrementallycheckpointstateonanelement-by-elementbasis.
REEApplications.
Ondetectinganapplicationfailure,theExecutionARMORnotifiestheFTMtoinitiaterecovery.
TheversionofMPIusedontheREEtestbedprecludesindividualMPIprocessesfrombeingrestartedwithinanapplication;therefore,theFTMinstructsallExecutionARMORstoterminatetheirMPIprocessesbeforerestartingtheapplication.
Theapplicationexecutablebinariesmustbereloadedfromtheremotediskduringrecovery.
2.
2InjectionExperimentsErrorinjectionexperimentsintotheapplicationandSIFTprocesseswereconductedto:(1)stressthedetectionandrecoverymechanismsoftheSIFTenvironment,(2)determinethefailuredependenciesamongSIFTandapplicationprocesses,(3)measuretheSIFTenvironmentoverheadonapplicationperformance,(4)measuretheoverheadofrecoveringSIFTprocessesasseenbytheapplication.
1.
Studytheeffectsoferrorpropagationandtheeffectivenessofinternalself-checksinlimitingerrorpropagation.
TheexperimentsusedNFTAPE,asoftwareframeworkforconductinginjectioncampaigns[22].
296R.
K.
IyerandZ.
KalbarczykErrorModelsTheerrormodelsusedtheinjectionexperimentsrepresentacombinationofthoseemployedinseveralpastexperimentalstudiesandthoseproposedbyJPLengineers.
SIGINT/SIGSTOP.
Thesesignalswereusedtomimic"clean"crashandhangfailuresasdescribedintheintroduction.
Registerandtext-segmenterrors.
Faultanalysishaspredictedthatthemostprevalentfaultsinthetargetedspaceborneenvironmentwillbesingle-bitmemoryandregisterfaults,althoughshrinkingfeaturesizeshaveraisedthelikelihoodofclockerrorsandmultiple-bitflipsinfuturetechnologies.
Severalerrorinjectionswereuniformlydistributedwithineachrunsinceeachinjectionwasunlikelytocauseanimmediatefailure,andonlythemostfrequentlyusedregistersandfunctionsinthetextsegmentweretargetedforinjection.
Heaperrors.
Heapinjectionswereusedtostudytheeffectsoferrorpropagation.
Oneerrorwasinjectedperrunintonon-pointerdatavaluesonly,andtheeffectsoftheerrorweretracedthroughthesystem.
Errorswerenotinjectedintotheoperatingsystemsinceourexperiencehasshownthatkernelinjectionstypicallyledtoacrash,ledtoahang,orhadnoimpact.
Maderiaetal.
[18]usedthesameREEtestbedtoexaminetheimpactoftransienterrorsonLynxOS.
DefinitionsandMeasurementsSystem,experiment,andrun.
WeusethetermsystemtorefertotheREEclusterandassociatedsoftware(i.
e.
,theSIFTenvironmentandapplications).
Thesystemdoesnotincludetheradiation-hardenedSCCorcommunicationchanneltotheground.
Anerrorinjectionexperimenttargetedaspecificprocess(applicationprocess,FTM,ExecutionARMOR,orHeartbeatARMOR)usingaparticularerrormodel.
Foreachprocess/errormodelpair,aseriesofrunswereexecutedinwhichoneormoreerrorswereinjectedintothetargetprocess.
Activatederrorsandfailures.
Aninjectioncausesanerrortobeintroducedintothesystem(e.
g.
,corruptionataselectedmemorylocationorcorruptionofthevalueinaregister).
Anerrorissaidtobeactivatedifprogramexecutionaccessestheerroneousvalue.
Afailurereferstoaprocessdeviatingfromitsexpected(correct)behaviorasdeterminedbyarunwithoutfaultinjection.
Theapplicationcanalsofailbyproducingoutputthatfallsoutsideacceptabletolerancelimitsasdefinedbyanexternalapplication-providedverificationprogram.
SetuptheenvironmentAppstartsAppendsUsersubmitsappjobUsernotifiedofterminationActualapplicationexecutiontimePerceivedapplicationexecutiontimetimeARMORsuninstalledFig.
2.
Perceivedvs.
ActualExecutionTimeMeasurement-BasedAnalysisofSystemDependability297Asystemfailureoccurswheneither(1)theapplicationcannotcompletewithinapredefinedtimeoutor(2)theSIFTenvironmentcannotrecognizethattheapplicationhascompletedsuccessfully.
SystemfailuresrequirethattheSCCreinitializetheSIFTenvironmentbeforecontinuing,buttheydonotthreatentheSCCorspacecraftintegrity1.
Recoverytime.
Recoverytimeistheintervalbetweenthetimeatwhichafailureisdetectedandthetimeatwhichthetargetprocessrestarts.
ForARMORprocesses,thisincludesthetimerequiredtorestoretheARMOR'sstatefromcheckpoint.
Inthecaseofanapplicationfailure,thetimelosttorollingbacktothemostrecentapplicationcheckpointisaccountedforintheapplication'stotalexecutiontime,notintherecoverytimefortheapplication.
Perceivedapplicationexecutiontime.
TheperceivedexecutiontimeistheintervalbetweenthetimeatwhichtheSCCsubmitsanapplicationforexecutionandthetimeatwhichtheSIFTenvironmentreportstotheSCCthattheapplicationhascompleted.
Actualapplicationexecutiontime.
Theactualexecutiontimeistheintervalbetweenthestartandtheendoftheapplication.
ThedifferencebetweenperceivedandactualexecutiontimeaccountsforthetimerequiredtoinstalltheExecutionARMORsbeforerunningtheapplicationandthetimerequiredtouninstalltheExecutionARMORsaftertheapplicationcompletes(seeFig.
2).
Thisisafixedoverheadindependentoftheactualapplicationexecutiontime.
Baselineapplicationexecutiontime.
Intheinjectionexperiments,theperceivedandactualapplicationexecutiontimesarecomparedtoabaselinemeasurementinordertodeterminetheperformanceoverheadaddedbytheSIFTenvironmentandrecovery.
Twomeasuresofbaselineapplicationperformanceareused:(1)theapplicationexecutingwithouttheSIFTenvironmentandwithoutfaultinjectionand(2)theapplicationexecutingintheSIFTenvironmentbutwithoutfaultinjection.
ThedifferencebetweenthesetwomeasuresprovidestheoverheadthattheSIFTprocessesimposeontheapplication.
Table1showsthattheSIFTenvironmentaddslessthantwosecondstotheperceivedapplicationexecutiontime.
Themeanapplicationexecutiontimeandrecoverytimearecalculatedforeachfaultmodel.
Ninety-fivepercentconfidenceintervals(t-distribution)arealsocalculatedforallmeasurements.
Table1.
BaselineApplicationExecutionTimePerceivedActualWithoutSIFT75.
710.
6575.
710.
65WithSIFT77.
970.
4875.
740.
482.
3CrashandHangFailuresThissectionpresentsresultsfromSIGINTandSIGSTOPinjectionsintotheapplicationandSIFTprocesses,whichwereusedtoevaluatetheSIFTenvironment's11WhilethevastmajorityoffailuresintheSIFTenvironmentwillnotaffectthetrustedSCC,inrealitythereexistsanonzeroprobabilitythattheSCCcanbeimpactedbySIFTfailures.
Wediscountthispossibilityinthepaperbecausethereisnotafull-fledgedSCCavailableforconductingsuchananalysis.
298R.
K.
IyerandZ.
Kalbarczykabilitytohandlecrashandhangfailures.
Wefirstsummarizethemajorfindingsfromover700crashandhanginjections:AllinjectederrorsintoboththeapplicationandSIFTprocesseswererecovered.
RecoveringfromerrorsinSIFTprocessesimposedameanoverheadof5%totheapplication'sactualexecutiontime.
This5%overheadincludes25casesoutofroughly700runsinwhichtheapplicationwasforcedtoblockorrestartbecauseoftheunavailabilityofaSIFTprocess.
Neglectingthosecasesinwhichtheapplicationmustredolostcomputation,theoverheadimposedbyarecoveringSIFTprocesswasinsignificant.
CorrelatedfailuresinvolvingaSIFTprocessandtheapplicationwereobserved.
In25cases,crashandhangfailurescausedaSIFTprocesstobecomeunavailable,promptingtheapplicationtofailwhenitdidnotreceiveatimelyresponsefromthefailedSIFTprocess.
Allcorrelatedfailuresweresuccessfullyrecovered.
Resultsfor100runspertargetaresummarizedinTable2.
Insomecases,theinjectiontime(usedtodeterminewhentoinjecttheerror)occurredaftertheapplicationcompleted.
Fortheseruns,noerrorwasinjected.
Therow"Baseline"reportstheapplicationexecutiontimewithnofaultinjection.
Onehundredrunswerechoseninordertoensurethatfailuresoccurredthroughoutthevariousphasesofanapplication'sexecution(includinganidleSIFTenvironmentbeforeapplicationexecution,applicationsubmissionandinitialization,applicationexecution,applicationtermination,andsubsequentcleanupoftheSIFTenvironment).
ApplicationRecoveryHangsarethemostexpensiveapplicationfailuresintermsoflostprocessingtime.
ApplicationhangsaredetectedusingapollingtechniqueinwhichtheExecutionARMORexecutesathreadthatwakesupevery20secondstocheckthevalueofacounterincrementedbyprogressindicatormessagessentbytheapplication.
Becausethecounterispolledatfixedintervals,theerrordetectionlatencyforhangscanbeuptotwicethecheckingperiod.
Table2.
SIGINT/SIGSTOPInjectionResultsApp.
Exec.
Time(s)TargetFailuresSuccessfulRecoveriesPerceivedActualRecoveryTime(s)SIGINTBaseline--74.
780.
5572.
680.
49-Application10010089.
801.
5087.
881.
500.
480.
05FTM818179.
601.
6173.
890.
250.
640.
16ExecutionARMOR10010077.
911.
0175.
981.
000.
610.
07HeartbeatARMOR979775.
260.
9274.
390.
960.
470.
12SIGSTOPBaseline--71.
960.
3270.
030.
27-Application8484112.
211.
87110.
211.
870.
470.
05FTM979776.
201.
9470.
090.
880.
790.
15ExecutionARMOR989885.
014.
4182.
214.
280.
630.
15HeartbeatARMOR777771.
880.
2470.
240.
240.
560.
21Measurement-BasedAnalysisofSystemDependability299SIFTEnvironmentRecoveryFTMrecovery.
Theperceivedexecutiontimefortheapplicationisextendedif(1)theFTMfailswhilesettinguptheenvironmentbeforetheapplicationexecutionbeginsor(2)theFTMfailswhilecleaninguptheenvironmentandnotifyingtheSpacecraftControlComputerthattheapplicationterminated.
TheapplicationisdecoupledfromtheFTM'sexecutionafterstarting,sofailuresintheFTMdonotaffectit.
TheonlyoverheadinactualexecutiontimeoriginatesfromthenetworkcontentionduringtheFTM'srecovery,whichlastsforonly0.
6-0.
7s.
AnFTM-applicationcorrelatedfailure.
TheerrorinjectionsalsorevealedacorrelatedfailureinwhichtheFTMfailurecausedtheapplicationtorestartin2ofthe178runs(see[32]fordescriptionofcorrelatedfailurescenarios).
TheSIFTenvironmentisabletorecoverfromthiscorrelatedfailurebecausethecomponentsperformingthedetection(HeartbeatARMORdetectingFTMfailuresandExecutionARMORdetectingapplicationfailures)arenotaffectedbythefailures.
ExecutionARMOR.
Ofthe198crash/hangerrorsinjectedintotheExecutionARMORs,175requiredrecoveryonlyintheExecutionARMOR.
Fortheseruns,theapplicationexecutionoverheadwasnegligible.
TheoverheadreportedinTable2(upto10%forhangfailures)resultedfromtheremaining23casesinwhichtheapplicationwasforcedtorestart.
AnExecutionARMOR-applicationcorrelatedfailure.
IftheapplicationprocessattemptedtocontacttheExecutionARMOR(e.
g.
,tosendprogressindicatorupdatesortonotifytheExecutionARMORthatitisterminatingnormally)whiletheARMORwasrecovering,theapplicationprocessblockeduntiltheExecutionARMORcompletelyrecovered.
BecausetheMPIprocessesaretightlycoupled,acorrelatedfailureispossibleiftheExecutionARMORoverseeingtheotherMPIprocessdiagnosedtheblockingasanapplicationhangandinitiatedrecovery.
ThiscorrelatedfailureoccurredmostoftenwhentheExecutionARMORhung(i.
e.
,duetoSIGSTOPinjections):22correlatedfailureswereduetoSIGSTOPinjectionsasopposedto1correlatedfailureresultingfromanARMORcrash(i.
e.
,duetoSIGINTinjections).
ThisisbecauseanExecutionARMORcrashfailureisdetectedimmediatelybythedaemonthroughoperatingsystemcalls,makingtheExecutionARMORunavailableforonlyashorttime.
Hangs,however,aredetectedviaa10-secondheartbeat.
2.
4RegisterandText-SegmentInjectionsThissectionexpandsthescopeoftheinjectionstofurtherstressthedetectionandrecoverymechanismsbyallowingforthepossibilityofcheckpointcorruptionanderrorpropagationtoanotherprocess.
Resultsfromapproximately9,000single-biterrorsintotheregistersetandtextsegmentoftheapplicationandSIFTprocessesshowthat:Mostregisterandtext-segmenterrorsledtocrashandhangfailuresthatwererecoveredbytheSIFTenvironment.
Elevenoftheapproximately700observedfailuresledtosystemfailuresinwhicheithertheapplicationdidnotcompleteortheSIFTenvironmentdidnotdetectthattheapplicationsuccessfullycompleted.
These11systemfailuresresultedfrom300R.
K.
IyerandZ.
KalbarczykinjectederrorsthatcorruptedanARMOR'scheckpointorpropagatedoutsidetheinjectedprocess.
Text-segmenterrorsweremorelikelythanregistererrorstoleadtosystemfailures.
Thiswasbecausevaluesinregisterstypicallyhadashorterlifetime(i.
e.
,theywereeitherneverusedorquicklyoverwritten)whencomparedtoinformationstoredinthetextsegment.
Table3summarizestheresultsofapproximately6,000registerinjectionsand3,000text-segmentinjectionsintoboththeapplicationandARMORprocesses.
Failuresareclassifiedintofourcategories:segmentationfaults,illegalinstructions,hangs,anderrorsdetectedviaassertions.
ThesecondcolumninTable3givesthenumberofsuccessfulrecoveriesvs.
thenumberoffailuresforeachsetofexperiments.
Errorsthatwerenotsuccessfullyrecoveredledtosystemfailures(4duetoFTMfailures,5duetoExecutionARMORfailures,and2duetoHeartbeatARMORfailures).
FTMrecovery.
Table3showsthattheFTMsuccessfullyrecoveredfromallregisterinjections.
Twotext-segmentinjectionsweredetectedthroughassertionsontheFTM'sinternaldatastructures,andbothoftheseerrorswererecovered.
Table3alsoshowsthattheFTMcouldnotrecoverfromfourtext-segmenterrors.
Ineachcase,theerrorcorruptedtheFTM'scheckpointpriortocrashing.
Becausethecheckpointwascorrupted,theFTMcrashedshortlyafterbeingrecovered.
Thiscycleoffailureandrecoveryrepeateduntiltheruntimedout.
ThereweresevencasesofacorrelatedfailureinwhichtheFTMfailedduringtheapplication'sinitialization:threefromtext-segmentinjectionsandfourfromregisterinjections.
BoththeFTMandtheapplicationrecoveredfromallsevencorrelatedfailures.
Table3.
RegisterandText-SegmentInjectionResultsFailureClassificationApp.
Exec.
Time(s)TargetRecoveries/FailuresSeg.
faultIllegalinstr.
HangAssert-ionPerceivedActualRecoverryTime(s)Baseline71.
960.
3270.
030.
27-RegisterInjectionsApplication95/9571420090.
702.
5788.
812.
570.
700.
21FTM84/8458616475.
651.
5473.
421.
280.
710.
03ExecutionARMOR77/8056615376.
191.
8273.
561.
830.
450.
08HeartbeatARMOR77/776268173.
000.
2270.
660.
210.
310.
04Text-segmentInjectionsApplication82/82412318089.
472.
8787.
492.
881.
050.
33FTM84/8853285276.
472.
8771.
002.
310.
510.
05ExecutionARMOR93/95453111877.
481.
9374.
831.
860.
430.
04HeartbeatARMOR95/97533311073.
230.
3771.
210.
360.
300.
01Measurement-BasedAnalysisofSystemDependability301ExecutionARMORrecovery.
Threeregisterinjectionsandtwotext-segmentinjectionsintotheExecutionARMORledtosystemfailure.
Ineachofthesecases,theerrorpropagatedtootherARMORprocessesortotheExecutionARMOR'scheckpoint.
Onetext-segmentinjectionandthreeregisterinjectionscausederrorsintheExecutionARMORtopropagatetotheFTM(i.
e.
,theerrorwasnotfail-silent).
AlthoughtheExecutionARMORdidnotcrash,itsentcorrupteddatatotheFTMwhentheapplicationterminated,causingtheFTMtocrash.
TheFTMstateinitscheckpointwasnotaffectedbytheerror,sotheFTMwasabletorecovertoavalidstate.
BecausetheFTMdidnotcompleteprocessingtheExecutionARMOR'snotificationmessage,theFTMdidnotsendanacknowledgmentbacktotheExecutionARMOR.
ThemissingacknowledgmentpromptedtheExecutionARMORtoresendthefaultymessage,whichagaincausedtheFTMtocrash.
Thiscycleofrecoveryfollowedbytheretransmissionoffaultydatacontinueduntiltheruntimedout.
Oneofthetext-segmentinjectionscausedtheExecutionARMORtosaveacorruptedcheckpointbeforecrashing.
WhentheARMORrecovered,itrestoreditsstatefromthefaultycheckpointandcrashedshortlythereafter.
Thiscyclerepeateduntiltheruntimedout.
Inadditiontothesystemfailuresdescribedabove,threetext-segmentinjectionsintotheExecutionARMORresultedintherestartingofthetextureanalysisapplication.
Allthreeofthesecorrelatedfailuresweresuccessfullyrecovered.
HeartbeatARMORrecovery.
TheHeartbeatARMORrecoveredfromallregistererrors,whiletext-segmentinjectionsbroughtabouttwosystemfailures.
AlthoughnocorruptedstateescapedtheHeartbeatARMOR,theerrorpreventedtheHeartbeatARMORfromreceivingincomingmessages.
Thus,theHeartbeatARMORfalselydetectedthattheFTMhadfailed,sinceitdidnotreceiveaheartbeatreplyfromtheFTM.
TheARMORthenbegantoinitiaterecoveryoftheFTMby(1)instructingtheFTM'sdaemontoreinstalltheFTMprocess,and(2)instructingtheFTMtorestoreitsstatefromcheckpointafterreceivingacknowledgmentthattheFTMhasbeensuccessfullyreinstalled.
Amongthesuccessfulrecoveriesfromtext-segmenterrorsshowninTable3,fourinvolvedcorruptedheartbeatmessagesthatcausedtheFTMtofail.
AlthoughfaultydataescapedtheHeartbeatARMOR,thecorruptedmessagedidnotcompromisetheFTM'scheckpoint.
Thus,theFTMwasabletorecoverfromthesefourfailures.
2.
5HeapInjectionsCarefulexaminationoftheregisterinjectionexperimentsshowedthatcrashfailuresweremostoftencausedbysegmentationfaultsraisedfromdereferencingacorruptedpointer.
Tomaximizethechancesforerrorpropagation,onlydata(notpointers)wereinjectedontheheap.
ResultsfromtargetedinjectionsintoFTMheapmemoryweregroupedbytheelementintowhichtheerrorwasinjected.
Table4showsthenumberofsystemfailuresobservedfrom100errorinjectionsperelement,classifiedastothetheireffectonthesystem.
Onehundredtargetedinjectionsweresufficienttoobserveeitherescapedordetectederrorsgiventheamountofstateineachelement;overall,500heapinjectionswereconductedontheFTM.
302R.
K.
IyerandZ.
KalbarczykTable4.
SystemFailuresObservedThroughHeapInjectionsLegend(Effectonsystem):(A)unabletoregisterdaemons,(B)unabletoinstallExecutionARMORs,(C)unabletostartapplications,(D)unabletouninstallExecutionARMORsafterapplicationcompletes.
Legend(Systemfailure/assertioncheckclassification):(2)systemfailurewithoutassertionfiring,(3)systemfailurewithassertionfiring,(4)successfulrecoveriesafterassertionfired.
ElementEffectonSystemSystemFailuresABCDTotal#2#3#4mgr_armor_info.
StoresinformationaboutsubordinateARMORssuchaslocationandelementcomposition.
4154146819exec_armor_info.
StoresinformationabouteachExecutionARMORsuchasstatusofsubordinateapplication.
00549459app_param.
Storesinformationaboutapplicationsuchasexecutablename,command-linearguments,andnumberoftimesapplicationrestarted.
00000002agr_app_detect.
UsedtodetectthatallprocessesforMPIapplicationhaveterminatedandtoinitiaterecoveryifnecessary.
00000004node_mgmt.
Storesinformationaboutthenodes,includingtheresidentdaemonandhostname.
01400140143TOTAL41510837102737ManydataerrorsweredetectablethroughassertionswithintheFTM,butnotallassertionswereeffectiveinpreventingsystemfailures.
Oneoffourscenariosresultedafteradataerrorwasinjected(thelastthreecolumnsinTable4arenumberedtorefertoscenarios2-4):1.
Thedataerrorwasnotdetectedbyanassertionandhadnoeffectonthesystem.
Theapplicationcompletedsuccessfullyasiftherewerenoerror.
2.
Thedataerrorwasnotdetectedbyanassertionbutledtoasystemfailure.
Noneofthesystemfailuresimpactedtheapplicationwhileitwasexecuting.
3.
Thedataerrorwasdetectedbyanassertioncheck,butonlyaftertheerrorhadpropagatedtotheFTM'scheckpointortoanotherprocess.
RollingbacktheFTM'sstateinthesecircumstanceswasineffective,andsystemfailuresresultedfromwhichtheSIFTenvironmentcouldnotrecover.
Thesecasesshowthaterrorlatencyisafactorwhenattemptingtorecoverfromerrorsinadistributedenvironment.
4.
ThedataerrorwasdetectedbyanassertioncheckbeforepropagatingtotheFTM'scheckpointortoanotherprocess.
Afteranassertionfired,theFTMkilleditselfandrecoveredasifithadexperiencedanordinarycrashfailure.
TheinjectionresultsinTable4showthattheleastsensitiveelements(app_paramandmgr_app_detect)werethosemoduleswhosestatewassubstantiallyread-onlyafterbeingwrittenearlywithintherun.
Withassertionsinplace,noneofthedataerrorsledtosystemfailures.
Attheotherendofthesensitivityspectrum,28errorsintwoelementscausedsystemfailures.
Incontrastwiththeelementscausingnosystemfailures,thedatainmgr_armor_infoandnode_mgmtwererepeatedlywrittenduringtheinitializationphasesofarun.
Measurement-BasedAnalysisofSystemDependability303Table4alsoshowstheefficiencyofassertionchecksinpreventingsystemfailures.
Therightmosttwocolumnsinthetablerepresentthetotalnumberofrunsinwhichassertionsdetectederrors.
Forexample,assertionsinthemgr_armor_infoelementdetected27errors,and19ofthoseerrorsweresuccessfullyrecovered.
Thedataalsoshowthatassertionscoupledwiththeincrementalmicrocheckpointingwereabletopreventsystemfailuresin58%ofthecases(27of64runsinwhichassertionsfired).
Ontheotherhand,assertionsdetectedtheerrortoolatetopreventsystemfailuresin27cases.
Forexample,14ofthe17runsinwhichassertionsdetectederrorsinthenode_mgmtelementresultedinsystemfailures.
Thisproblemwasrectifiedbyaddingcheckstothetranslationresultsbeforesendingthemessage.
2.
6LessonsLearnedSIFToverheadshouldbekeptsmall.
SystemdesignersmustbeawarethatSIFTsolutionshavethepotentialtodegradetheperformanceandeventhedependabilityoftheapplicationstheyareintendedtoprotect.
OurexperimentsshowthatthefunctionalityinSIFTcanbedistributedamongseveralprocessesthroughoutthenetworksothattheoverheadimposedbytheSIFTprocessesisinsignificantwhiletheapplicationisrunning.
SIFTrecoverytimeshouldbekeptsmall.
MinimizingtheSIFTprocessrecoverytimeisdesirablefromtwostandpoints:(1)recoveringSIFTprocesseshavethepotentialtoaffectapplicationperformancebycontendingforprocessorandnetworkresources,and(2)applicationsrequiringsupportfromtheSIFTenvironmentareaffectedwhenSIFTprocessesbecomeunavailable.
OurresultsindicatethatfullyrecoveringaSIFTprocesstakesapproximately0.
5s.
ThemeanoverheadasseenbytheapplicationfromSIFTrecoveryislessthan5%,whichtakesintoaccount10outofroughly800failuresfromregister,text-segmentandheapinjectionsthatcausedtheapplicationtoblockorrestartbecauseoftheunavailabilityofaSIFTprocess.
Theoverheadfromrecoveryisinsignificantwhenthese10casesareneglected.
SIFT/applicationinterfaceshouldbekeptsimple.
InanymultiprocessSIFTdesign,someSIFTprocessesmustbecoupledtotheapplicationinordertoprovideerrordetectionandrecovery.
TheExecutionARMORsplaythisroleinourSIFTenvironment.
Becauseofthisdependency,itisimportanttomaketheExecutionARMORsassimpleaspossible.
Allrecoveryactionsandthoseoperationsthataffecttheglobalsystem(e.
g.
,jobsubmissionanddetectingremotenodefailures)aredelegatedtoaremoteSIFTprocessthatisdecoupledfromtheapplication'sexecution.
Thisstrategyappearstowork,asonly5of373observedExecutionARMORfailuresledtosystemfailures.
SIFTavailabilityimpactstheapplication.
LowrecoverytimeandaggressivecheckpointingoftheSIFTprocesseshelpminimizetheSIFTenvironmentdowntime,makingtheenvironmentavailableforprocessingapplicationrequestsandforrecoveringfromapplicationfailures.
Systemfailuresarenotnecessarilyfatal.
Only11ofthe10,000injectionsresultedinasystemfailureinwhichtheSIFTenvironmentcouldnotrecoverfromtheerror.
Thesesystemfailuresdidnotaffectanexecutingapplication.
304R.
K.
IyerandZ.
Kalbarczyk3ErrorandFailureAnalysisofaLANofWindowsNT-BasedServersDirectmonitoring,recording,andanalysisofnaturallyoccurringerrorsandfailuresinthesystemcanprovidevaluableinformationonactualerror/failurebehavior,identifysystembottlenecks,quantifydependabilitymeasures,andverifyassumptionsmadeinanalyticalmodels.
InthissectionweprovideanexampleofsystemdependabilityanalysisusingfailuredatacollectedfromaLocalAreaNetworks(LAN)ofWindowsNTservers.
Inmostcommercialsystems,informationaboutfailurescanbeobtainedfromthemanuallogsmaintainedbyadministratorsorfromtheautomatedevent-loggingmechanismsintheunderlyingoperatingsystem.
Manuallogsareverysubjectiveandoftenunavailable.
Hencetheyarenottypicallysuitedforautomatedanalysisoffailures.
Incontrast,theeventlogsmaintainedbythesystemhavepredefinedformats,providecontextualinformationincaseoffailures(e.
g.
,atraceofsignificanteventsthatprecedeafailure),andarethusconducivetoautomatedanalysis.
Moreover,asfailuresarerelativelyrareevents,itisnecessarytometiculouslycollectandanalyzeerrordataformanymachine-monthsfortheresultsofthedataanalysistobestatisticallyvalid.
Suchregularandprolongeddataacquisitionispossibleonlythroughautomatedeventlogging.
Hencemoststudiesoffailuresinsingleandnetworkedcomputersystemsarebasedontheerrorlogsmaintainedbytheoperatingsystemrunningonthosemachines.
Thissectionpresentsmethodologyandresultsfromananalysisoffailuresfoundinanetworkofabout70WindowsNTbasedmailservers(runningMicrosoftExchangesoftware).
Thedataforthestudyisobtainedfromeventlogs(i.
e.
,logsofmachineeventsthataremaintainedandmodifiedbytheWindowsNToperatingsystem)collectedoverasix-monthperiodfromthemailroutingnetworkofacommercialorganization.
Inthisstudyweanalyzeonlymachinerebootsbecausetheyconstituteasignificantportionofallloggedfailuredataandarethemostseveretypeoffailure.
Asastartingpoint,apreliminarydataanalysisisconductedtoclassifythenatureofobservedfailureevents.
Thisfailurecategorizationisthenusedtoexaminethebehaviorofindividualmachinesindetailandtoderiveafinitestatemodel.
Themodeldepictsthebehaviorofatypicalmachine.
Finally,adomain-wideanalysisisperformedtocapturethebehaviorofthedomaininafinitestatemodel.
Thethoroughfailuredataanalysis,thereadercanfindin[12].
RelatedWork.
Analysisoffailuresincomputersystemshasbeenthefocusofactiveresearchforquitesometime.
Studiesoffailuresoccurringincommercialsystems(e.
g.
,VAX/VMS,Tandem/GUARDIAN)arebasedprimarilyonfailuredatacollectedfromthefield.
Thefocusofsuchstudiesisoncategorizingthenatureoffailuresinthesystems(e.
g.
,softwarefailures,hardwarefailures),identifyingavailabilitybottlenecks,andobtainingmodelstoestimatetheavailabilityofthesystemsbeinganalyzed.
Lee[15],[16]analyzedfailuresinTandem'sGUARDIANoperatingsystem.
Tang[25]analyzederrorlogspertainingtoamulticomputerenvironmentbasedonVAX/VMScluster.
Thakur[27]presentedananalysisoffailuresintheTandemNonstop-UXoperatingsystem.
Hsueh[9]explorederrorsandrecoveryinIBM'sMVSoperatingsystem.
BasedontheerrorlogscollectedfromMVSsystems,asemi-MarkovmodelofmultipleerrorsMeasurement-BasedAnalysisofSystemDependability305(i.
e.
errorsthatmanifestthemselvesinmultipleways)wasconstructedtoanalyzesystemfailurebehavior.
Measurement-basedsoftwarereliabilitymodelswerealsopresentedin[15],[16](fortheGUARDIANsystem)and[25],[26](fortheVAXcluster).
Theimpactofworkloadonsystemfailureswasalsoextensivelystudied.
Castillo[6]developedasoftwarereliabilitypredictionmodelthattookintoaccounttheworkloadimposedonthesystem.
Iyer[11]examinedtheeffectofworkloadonthereliabilityoftheIBM3081operatingsystem.
Mourad[21]performedareliabilitystudyontheIBMMVS/XAoperatingsystemandfoundthattheerrordistributionisheavilydependentonthetypeofsystemutilization.
Meyer[20]presentedananalysisoftheinfluenceofworkloadonthedependabilityofcomputersystems.
Lin[17]andTsao[28]focusedontrendanalysisinerrorlogs.
Gray[8]presentedresultsfromacensusofTandemsystems.
Chillarege[7]presentedastudyoftheimpactoffailuresoncustomersandthefaultlifetimes.
Sullivan[23],[24]examinedsoftwaredefectsoccurringinoperatingsystemsanddatabases(basedonfielddata).
Velardi[29]examinedfailuresandrecoveryintheMVSoperatingsystem.
3.
1ErrorLogginginWindowsNTWindowsNToperatingsystemofferscapabilitiesforerrorlogging.
Thissoftwarerecordsinformationonerrorsoccurringinthevarioussubsystems,suchasmemory,disk,andnetworksubsystems,aswellasothersystemevents,suchasrebootsandshutdowns.
Thereportsusuallyincludeinformationonthelocation,time,typeoftheerror,thesystemstateatthetimeoftheerror,andsometimeserrorrecovery(e.
g.
,retry)information.
Themainadvantageofon-lineautomaticloggingisitsabilitytorecordalargeamountofinformationabouttransienterrorsandtoprovidedetailsofautomaticerrorrecoveryprocesses,whichcannotbedonemanually.
Disadvantagesarethatanon-linelogdoesnotusuallyincludeinformationaboutthecauseandpropagationoftheerrororaboutoff-linediagnosis.
Also,undersomecrashscenarios,thesystemmayfailtooquicklyforanyerrormessagestoberecorded.
Animportantquestiontobeaskedhereis:HowaccurateareeventlogsincharacterizingfailurebehaviorofthesystemWhileeventlogsprovidevaluableinsightintounderstandingthenatureanddynamicsoftypicalproblemsobservedinanetworksystem,inmanycasestheinformationineventlogsisnotsufficienttopreciselydetermineanatureofaproblem(e.
g.
,whetheritwasasoftwareorhardwarecomponentfailure).
Theonlyreliablewaytoimproveaccuracyoflogsis(1)toperformmorefrequent,detailedloggingbyeachcomponentand(2)instrumenttheWindowsNTcodewithnew(moreprecise)loggingmechanisms.
However,thereisalwaysatrade-offbetweenaccuracyandintrusivenessofmeasurements.
Nocommercialorganizationwillpermitsomeonetoinstallanuntestedtooltomonitorthenetwork.
Consequently,weuseexistinglogsnotonlytocharacterizefailurebehaviorofthenetwork(presentedinthispaper),butalsotodeterminehowtheloggingsystemcouldbeimproved(e.
g.
,byaddingtotheoperatingsystemaquerymechanismtoremotelyprobesystemcomponentsabouttheirstatus).
Itshouldbenotedthatinmanycommercialoperatingsystems(e.
g.
,MVS)eventlogsareaccurateenoughtodocumentfailures.
306R.
K.
IyerandZ.
Kalbarczyk3.
2ClassificationofDataCollectedfromaLANofWindowsNT-BasedServersTheinitialbreakupofthedataonasystemrebootisprimarilybasedontheeventsthatprecededthecurrentrebootbynomorethananhour(andthatoccurredafterthepreviousreboot).
Foreachinstanceofareboot,themostsevereandfrequentlyoccurringevents(hereafterreferredtoasprominentevents)areidentified.
Thecorrespondingrebootisthencategorizedbasedonthesourceandtheidoftheseprominentevents.
Insomecases,theprominenteventsarespecificenoughtoidentifytheproblemthatcausedthereboot.
Inothercases,onlyahigh-leveldescriptionoftheproblemcanbeobtainedbasedontheknowledgeoftheprominentevents.
Table5showsthebreakupoftherebootsbycategory.
Hardwareorfirmwarerelatedproblems:Thiscategoryincludeseventsthatindicateaproblemwithhardwarecomponents(networkadapter,disk,etc.
),theirassociateddrivers(typicallydriversfailingtoloadbecauseofaproblemwiththedevice),orsomefirmware(e.
g.
,someeventsindicatedthatthePowerOnSelfTesthadfailed).
Connectivityproblems:Thiscategorydenoteseventsthatindicatedthateitherasystemcomponent(e.
g.
,redirector,server)oracriticalapplication(e.
g.
,MSExchangeSystemAttendant)couldnotretrieveinformationfromaremotemachine.
Inthesescenarios,itisnotpossibletopinpointtheactualcauseoftheconnectivityproblem.
Someoftheconnectivityfailuresresultfromnetworkadapterproblemsandhencearecategorizedashardwarerelated.
Table5.
BreakupofRebootsBasedonProminentEventsCategoryFrequencyPercentageTotalreboots1100100Hardwareorfirmwareproblems1059Connectivityproblems24122Crucialapplicationfailures15214Problemswithasoftwarecomponent424Normalshutdowns636Normalreboots/power-off(noindicationofanyproblems)17816Unknown31929Crucialapplicationfailure:Thiscategoryencompassesreboots,whichareprecededbysevereproblemswith,andpossiblyshutdownof,criticalapplicationsoftware(suchasMessageTransferAgent).
Insuchcases,itwasn'tclearwhytheapplicationreportedproblems.
Ifanapplicationshutdownoccursasaresultofconnectivityproblem,thenthecorrespondingrebootiscategorizedasconnectivity-related.
Problemswithasoftwarecomponent:Typicallytheserebootsarecharacterizedbystartupproblems(suchasacriticalsystemcomponentnotloadingoradriverentrypointnotbeingfound).
Anothersignificanttypeofprobleminthiscategoryisthemachinerunningoutofvirtualmemory,possiblyduetoamemoryleakinasoftwarecomponent.
Inmanyofthesecases,thecomponentcausingtheproblemisnotidentifiable.
Measurement-BasedAnalysisofSystemDependability307Normalshutdowns:Thiscategorycoversreboots,whicharenotprecededbywarningsorerrormessages.
Additionally,thereareeventsthatindicateshuttingdownofcriticalapplicationsoftwareandsomesystemcomponents(e.
g.
,theBROWSER).
Theserepresentshutdownsformaintenanceorforcorrectingproblemsnotcapturedintheeventlogs.
Normalreboots/power-off:Thiscategorycoversrebootswhicharetypicallynotprecededbyshutdownevents,butdonotappeartobecausedbyanyproblemseither.
Nowarningsorerrormessagesappearintheeventlogbeforethereboot.
BasedondatainTable5,thefollowingobservationscanbemadeaboutthefailures:1.
29%oftherebootscannotbecategorized.
Suchrebootsareindeedprecededbyeventsofseverity2orlesser,butthereisnotenoughinformationavailabletodecide(a)whethertheeventsweresevereenoughtoforcearebootofthemachineor(b)thenatureoftheproblemthattheeventsreflect.
2.
Asignificantpercentage(22%)oftherebootshavereportedconnectivityproblems.
Connectivityproblemssuggestthattherecouldbepropagatedfailuresinthedomain.
Furthermore,itispossiblethatthemachinesfunctioningasthemasterbrowserandthePrimaryDomainController(PDC)2,respectivelyarepotentialreliabilitybottlenecksofthedomain.
3.
Onlyasmallpercentage(10%)oftherebootscanbetracedtoasystemhardwarecomponent.
Mostoftheidentifiableproblemsaresoftwarerelated.
4.
Nearly50%oftherebootsareabnormalreboots(i.
e.
,therebootswereduetoaproblemwiththemachineratherthanduetoanormalshutdown).
5.
Innearly15%ofthecases,severeproblemswithacrucialmailserverapplicationforcearebootofthemachine.
3.
3AnalysisofFailureBehaviorofIndividualMachinesAfterthepreliminaryinvestigationofthecausesoffailures,weprobefailuresfromtheperspectiveofanindividualmachineaswellasthewholenetwork.
Firstwefocusonthefailurebehaviorofindividualmachinesinthedomaintoobtain(1)estimatesofmachineup-timesanddown-times,(2)anestimateoftheavailabilityofeachmachine,and(3)afinitestatemodeltodescribethefailurebehaviorofatypicalmachineinthedomain.
Machineup-timesanddown-timesareestimatedasfollows:Foreveryrebooteventencountered,thetimestampoftherebootisrecorded.
Thetimestampoftheeventimmediatelyprecedingtherebootisalsorecorded.
(Thiswouldbethelasteventloggedbythemachinebeforeitgoesdown.
)Asmoothingfactorofonehourisappliedtothereboots(i.
e.
,formultiplerebootsthatoccurredwithinanperiodofonehour,exceptthelastone,aredisregarded).
Eachup-timeestimateisgeneratedbycalculatingthetimedifferencebetweenareboottimestampandthetimestampoftheeventprecedingthenextreboot.
2Intheanalyzednetwork,themachinesbelongedtoacommonWindowsNTdomain.
OneofthemachineswasconfiguredasthePrimaryDomainController(PDC).
TherestofthemachinesfunctionedasBackupDomainControllers(BDCs).
308R.
K.
IyerandZ.
KalbarczykEachdown-timeestimateisobtainedbycalculatingthetimedifferencebetweenareboottimestampandthetimestampoftheeventprecedingit.
MachineuptimesandmachinedowntimesarepresentedinTable6.
Asthestandarddeviationsuggests,thereisagreatdegreeofvariationinthemachineuptimes.
Thelongestuptimewasnearlythreemonths.
Theaverageisskewedbecauseofsomeofthelongeruptimes.
Themedianismorerepresentativeofthetypicaluptime.
Table6.
MachineUptime&DowntimeStatisticsItemMachineUptimeStatisticsMachineDowntimeStatisticsNumberofentries616682Maximum85.
2days15.
76daysMinimum1hour1secondAverage11.
82days1.
97hoursMedian5.
54days11.
43minutesStandardDeviation15.
656days15.
86hoursAsthetableshows,50%ofthedowntimeslastabout12minutes.
Thisisprobablytooshortaperiodtoreplacehardwarecomponentsandreconfigurethemachine.
Theimplicationisthatmajorityoftheproblemsaresoftwarerelated(memoryleaks,misloadeddrivers,applicationerrorsetc.
).
Themaximumvalueisunrealisticandmighthavebeenduetothemachinebeingtemporarilytakenoff-lineandputbackinafterafortnight.
Sincethemachinesunderconsiderationarededicatedmailservers,bringingdownoneormoreofthemwouldpotentiallydisruptstorage,forwarding,reception,anddeliveryofmail.
Thedisruptioncanbepreventedifexplicitreroutingisper-formedtoavoidthemachinesthataredown.
Butitisnotclearifsuchreroutingwasdoneorcanbedone.
Inthiscontextthefollowingobservationswouldbecausesforconcern:(1)averagedowntimemeasuredwasnearly2hoursor(2)50%ofthemeasureduptimesampleswereabout5daysorless.
AvailabilityHavingestimatedmachineuptimeanddowntime,wecanestimatetheavailabilityofeachmachine.
Theavailabilityisevaluatedastheratio:[/(+)]*100Table7summarizestheavailabilitymeasurements.
Asthetabledepicts,themajorityofthemachineshaveanavailabilityof99.
7%orhigher.
Alsothereisnotalargevariationamongtheindividualvalues.
Thisissurprisingconsideringtheratherlargedegreeofvariationintheaverageuptimes.
Itfollowsthatmachineswithsmalleraverageup-timesalsohadcorrespondinglysmalleraveragedowntimes,sothattheratiosarenotverydifferent.
Hence,thedomainhastwotypesofmachines:thosethatrebootoftenbutrecoverquicklyandthosethatstayuprelativelylongerbuttakelongertorecoverfromafailure.
Measurement-BasedAnalysisofSystemDependability309Table7.
MachineAvailabilityItemValueNumberofmachines66Maximum99.
99Minimum89.
39Median99.
76Average99.
35StandardDeviation1.
52Fig.
3showstheunavailabilitydistributionacrossthemachines(unavailabilitywasevaluatedas:100–Availability).
Lessthan20%ofthemachineshadanavailabilityof99.
9%orhigher.
However,nearly90%ofthemachineshadanavailabilityof99%orhigher.
Itshouldbenotedthatthesenumbersindicatethefractionoftimethemachineisalive.
Theydonotnecessarilyindicatetheabilityofthemachinetoprovideusefulservicebecausethemachinecouldbealivebutstillunabletoprovidetheserviceexpectedofit.
Toelaborate,eachofthemachinesinthedomainactsasamailserverforasetofusermachines.
Hence,ifanyofthesemailservershasproblemsthatpreventitfromreceiving,storing,forwarding,ordeliveringmail,thenthatserverwouldeffectivelybeunavailabletotheusermachineseventhoughitisupandrunning.
Hence,toobtainabetterestimateofmachineavailability,itisnecessarytoexaminehowlongthemachineisactuallyabletoprovideservicetousermachines.
Fig.
3.
UnavailabilityDistribution310R.
K.
IyerandZ.
KalbarczykModelingMachineBehaviorToobtainmoreaccurateestimatesofmachineavailability,wemodeledthebehaviorofatypicalmachineintermsofafinitestatemodel.
Themodelwasbasedontheeventsthateachmachinelogs.
Inthemodel,eachstaterepresentsaleveloffunctionalityofthemachine.
Amachineiseitherinafullyfunctionalstate,inwhichitlogseventsthatindicatenormalactivity,orinapartiallyfunctionalstate,inwhichitlogseventsthatindicateproblemsofaspecificnature.
Selectionandassignmentofstatestoamachinewasperformedasfollows.
Thelogsweresplitintotime-windowsofonehoureach.
Foreachsuchwindow,themachinewasassignedastate,whichitoccupiedthroughoutthedurationofthewindow.
Theassignmentwasbasedontheeventsthatthemachineloggedinthewindow.
Table8describesthestatesidentifiedforthemodel.
Table8.
MachineStatesStateNameMainEvents(id/source/severity)ExplanationReboot6005/EventLog/4MachinelogsrebootandotherinitializationeventsFunctional5715/NETLOGON/41016/MSExchangeISPrivate/8MachinelogssuccessfulcommunicationwithPDCConnectivityproblems3096/NETLOGON/15719/NETLOGON/1ProblemslocatingthePDCStartupproblems7000/ServiceControlManager/17001/ServiceControlManager/1SomesystemcomponentorapplicationfailedtostartupMTAproblems2206/MSExchangeMTA/22207/MSExchangeMTA/2MessageTransferAgenthasproblemswithsomeinternaldatabasesAdapterproblems4105/CpqNF3/14106/CpqNF3/1TheNetFlexAdapterdriverreportsproblemsTemporaryMTAproblems9322/MSExchangeMTA/49277/MSExchangeMTA/23175/MSExchangeMTA/21209/MSExchangeMTA/2MessageTransferAgentreportsproblemsofatemporary(orlesssevere)natureServerproblems2006/Srv/1ServercomponentreportshavingreceivedbadlyformattedrequestsBROWSERproblems8021/BROWSER/28032/BROWSER/1BrowserreportsinabilitytocontactthemasterbrowserDiskproblems11/Cpq32fs2/15/Cpq32fs2/19/Cpqarray/111/Cpqarray/1DiskdriversreportproblemsTapeproblems15/dlttape/1TapedriverreportsproblemsSnmpeleaproblems3006/Snmpelea/1SnmpeventlogagentreportserrorwhilereadinganeventlogrecordShutdown8033/BROWSER/41003/MSExchangeSA/4Application/machineshutdowninprogressEachmachine(exceptthePrimaryDomainController(PDC)whosetransitionsweredifferentfromtherest)inthedomainwasmodeledintermsofthestatesmentionedinthetable.
Ahypotheticalmachinewascreatedbycombiningthetransitionsofalltheindividualmachinesandfilteringouttransitionsthatoccurredlessfrequently.
Fig.
4describesthishypotheticalmachine.
Inthefigure,theweightoneachoutgoingedgerepresentsthefractionofalltransitionsfromtheoriginatingstateMeasurement-BasedAnalysisofSystemDependability311(i.
e.
,tailofthearrow)thatendupinagiventerminatingstate(i.
e.
,headofthearrow).
Forexample,ifthereisanedgefromstateAtostateBwithaweightof0.
5,thenitwouldindicatethat50%ofalltransitionsfromstateAaretostateB.
FromFig.
4thefollowingobservationscanbemade:Onlyabout40%ofthetransitionsoutoftheRebootstatesaretotheFunctionalstate.
Thisindicatesthatinthemajorityofthecases,eithertherebootisnotabletosolvetheoriginalproblem,oritcreatesnewones.
MorethanhalfofthetransitionsoutoftheStartupproblemsaretotheConnectivityproblemsstate.
Thus,themajorityofthestartupproblemsarerelatedtocomponentsthatparticipateinnetworkactivity.
Mostoftheproblemsthatappearwhenthemachineisfunctionalarerelatedtonetworkactivity.
Problemswiththediskandothercomponentsarelessfrequent.
Fig.
4.
StateTransitionsofaTypicalMachineMorethan50%ofthetransitionsoutofDiskproblemsstatearetotheFunctionalstate.
Also,wedonotobserveanysignificanttransitionsfromtheDiskproblemsstatetootherstates.
Thiscouldbeduetooneormoreofthefollowing:1.
Themachinesareequippedwithredundantdiskssothatevenifoneofthemisdown,thefunctionalityisnotdisruptedinamajorway.
2.
Thediskproblems,thoughpersistent,arenotsevereenoughtodisruptnormalactivity(mayberetriestoaccessthedisksucceed).
3.
TheactivitiesthatareconsideredtoberepresentativeoftheFunctionalstatemaynotinvolvemuchdiskactivity.
312R.
K.
IyerandZ.
KalbarczykOver11%ofthetransitionsoutoftheTemporaryMTAproblemsstatearetotheBrowserproblemsstate.
WesuspectthattherewasalocalproblemthatcausedRPCstotimeoutorfailandcausedproblemsfortheMTAandBROWSER.
Anotherpossibilityisthat,inbothcases,itwasthesameremotemachinethatcouldnotbecontacted.
Basedontheavailabledata,itwasnotpossibletodeterminetherealcauseoftheproblem.
Toviewthetransitionsfromadifferentperspective,wecomputedtheweightofeachoutgoingedgeasafractionofallthetransitionsinthefinitestatemachine.
Suchacomputationprovidedsomeinterestinginsights,whichareenumeratedbelow:1.
Nearly10%ofallthetransitionsarebetweentheFunctionalandTemporaryMTAproblemsstates.
TheseMTAproblemsaretypicallyproblemswithsomeRPCcalls(eitherfailingorbeingcanceled).
2.
About0.
5%(1in200)ofalltransitionsaretotheRebootstate.
3.
ThemajorityofthetransitionsintotheMTAproblemsstatearefromtheRebootstate.
Thus,MTAproblemsareprimarilyproblemsthatoccuratstartup.
Incontrast,themajorityofthetransitionsintotheServerproblemsstateandtheBrowserproblemsstate(excludingtheselfloops)arefromtheFunctionalstate.
So,theseproblems(oratleastasignificantfractionofthem)typicallyappearafterthemachineisfunctional.
4.
About92%ofalltransitionsareintotheFunctionalstate.
Thisfigureisapproximatelyameasureoftheaveragetimethehypotheticalmachinespendsinthefunctionalstate.
Henceitisameasureoftheaverageavailabilityofatypicalmachine.
Inthiscase,availabilitymeasurestheabilityofthemachinetoprovideservice,notjusttostayalive.
3.
4ModelingDomainBehaviorAnalyzingsystembehaviorfromtheperspectiveofthewholedomain(1)providesamacroscopicviewofthesystemratherthanamachine-specificview,(2)helpstocharacterizethenatureofinteractionsinthenetwork,and(3)aidsinidentifyingpotentialreliabilitybottlenecksandsuggestswaystoimproveresiliencetooperationalfaults.
Inter-rebootTimes.
Animportantcharacteristicofthedomainishowoftenrebootsoccurwithinit.
Toexaminethis,thewholedomainistreatedasablackbox,andeveryrebootofeverymachineinthedomainisconsideredtobearebootoftheblackbox.
Table9showsthestatisticsofsuchinter-reboottimesmeasuredacrossthewholedomain.
Table9.
Inter-rebootTimeStatisticsfortheDomainItemValueNumberofsamples882Maximum2.
46daysMinimumLessthan1secondMedian2402secondsAverage4.
09hoursStandardDeviation7.
52hoursMeasurement-BasedAnalysisofSystemDependability313FiniteStateModeloftheDomainTheproperfunctioningofthedomainreliesontheproperfunctioningofthePDCanditsinteractionswiththeBackupDomainControllers(BDCs).
ThusitwouldseemusefultorepresentthedomainintermsofhowmanyBDCsarealiveatanygivenmomentandalsointermsofthePDCbeingfunctionalornot.
Accordingly,afinitestatemodelwasconstructedasfollows:1.
Thedatacollectionperiodwasbrokenupintotimewindowsofafixedlength,2.
Foreachsuchtimewindow,thestateofthedomainwascomputed,and3.
Atransitiondiagramwasconstructedbasedonthestateinformation.
Thestateofthedomainduringagiventimewindowwascomputedbyevaluatingthenumberofmachinesthatrebootedduringthattimewindow.
Morespecifically,thestateswereidentifiedasshowninTable10.
Fig.
5showsthetransitionsinthedomain.
Eachtimewindowwasonehourlong.
Table10.
DomainStatesandtheirInterpretationStateNameMeaningPDCPrimaryDomainController(PDC)rebootedBDC1BackupDomainController(BDC)rebootedMBDCManyBDCsrebootedPDC+BDCPDCandOneBDCrebootedPDC+MBDCPDCandManyBDCsrebootedFFunctional(norebootsobserved)Fig.
5.
DomainStateTransitionsFig.
5revealssomeinterestinginsights.
314R.
K.
IyerandZ.
Kalbarczyk1.
Nearly77%ofalltransitionsfromtheFstate,excludingtheself-loops,aretotheBDCstate.
Ifthesetransitionsdoindeedresultindisruptioninservice,thenitispossibletoimprovetheoverallavailabilitysignificantlyjustbytoleratingsinglemachinefailures.
2.
Anon-negligiblenumberoftransitionsarebetweentheFstateandtheMBDCstateandbetweenstatesBDCandMBDC.
Thiswouldindicatepotentiallycorrelatedfailuresandrecovery(see[12]formoredetails).
3.
MajorityoftransitionsfromstatePDCaretostateF.
Thiscouldbeexplainedbyoneofthefollowing:-MostoftheproblemswiththePDCarenotpropagatedtotheBDCs,-ThePDCtypicallyrecoversbeforeanysuchpropagationtakeseffectontheBDCs,or-TheproblemsonthePDCarenotsevereenoughtobringitdown,buttheymightworsenastheypropagatetotheBDCsandforceareboot.
However,20%ofthetransitionsfromthePDCstatearetothePDC+BDCstate.
Sothereisapossibilityofthepropagationoffailures.
4ConclusionsThediscussioninthispaperfocusedontheissuesinvolvedinanalyzingtheavailabilityofnetworkedsystemsusingfaultinjectionandthefailuredatacollectedbytheloggingmechanismsbuiltintothesystem.
Toachieveaccurateandcomprehensivesystemdependabilityevaluationtheanalysismustspanthethreephasesofsystemlife:designphase,prototypephase,andoperationalphase.
ForexamplethepresentedfaultinjectionstudyoftheARMOR-basedSIFTenvironmentdemonstratedthat:1.
Structuringthefaultinjectionexperimentstoprogressivelystresstheerrordetectionandrecoverymechanismsisausefulapproachtoevaluatingperformanceanderrorpropagation.
2.
Eventhoughtheprobabilityforcorrelatedfailuresissmall,itspotentialimpactonapplicationavailabilityissignificant.
3.
TheSIFTenvironmentsuccessfullyrecoveredfromallcorrelatedfailuresinvolvingtheapplicationandaSIFTprocessbecausetheprocessesperformingerrordetectionandrecoveryweredecoupledfromthefailedprocesses.
4.
Targetedinjectionsintodynamicdataontheheapwereusefulinfurtherinvestigatingsystemfailuresbroughtaboutbyerrorpropagation.
AssertionswithintheSIFTprocesseswereshowntoreducethenumberofsystemfailuresfromdataerrorpropagationbyupto42%.
SimilarlyanalysisoffailuredatacollectedinanetworkofWindowsNTmachinesprovidesinsightsintonetworksystemfailurebehavior.
1.
Mostoftheproblemsthatleadtorebootsaresoftwarerelated.
Only10%areattributabletospecifichardwarecomponents.
2.
Rebootingthemachinedoesnotappeartosolvetheprobleminmanycases.
Inabout60%ofthereboots,therebootedmachinereportedproblemswithinahourortwoofthereboot.
Measurement-BasedAnalysisofSystemDependability3153.
Thoughtheaverageavailabilityevaluatestoover99%,atypicalmachineinthedomain,onaverage,providesacceptableserviceonlyabout92%ofthetime.
4.
About1%oftherebootsindicatememoryleaksinthesoftware.
5.
Thereareindicationsofpropagatedorcorrelatedfailures.
Typically,insuchcases,multiplemachinesexhibitidenticalorsimilarproblemsatalmostthesametime.
Moreover,thefailuredataanalysisalsoprovidesinsightsintotheerrorloggingmechanism.
Forexample,event-loggingfeaturesthatareabsent,butdesirable,inWindowsNTcanbesuggested:1.
ThepresenceofaWindowsNTshutdowneventwillimprovetheaccuracyinidentifyingthecausesofreboots.
Itwillalsoleadtobetterestimatesofmachineavailability.
2.
Mostoftheeventsobservedinthelogswereeitherduetoapplicationsortohigh-levelsystemcomponents,suchasfile-systemdrivers.
Itisnotevidentifthisisduetoagenuineabsenceofproblemsatthelowerlevelsoritisjustbecausethelower-levelsystemcomponentslogeventssparinglyorresorttoothermeanstoreportevents.
Ifthelatteristrue,thenimprovedeventloggingbythelower-levelsystemcomponents(protocoldrivers,memorymanagers)canenhancethevalueofeventlogsindiagnosis.
Acknowledgments.
ThismanuscriptisbasedonaresearchsupportedinpartbyNASAundergrantNAG-1-613,incooperationwiththeIllinoisComputerLaboratoryforAerospaceSystemsandSoftware(ICLASS),byTandemComputers,andinpartbyaNASA/JPLcontract961345,andbyNSFgrantsCCR00-86096ITRandCCR99-02026.
References1.
J.
Arlat,etal.
,"FaultInjectionforDependabilityValidation–AMethodologyandSomeApplications,"IEEETrans.
OnSoftwareEngineering,Vol.
16,No.
2,pp.
166-182,Feb.
1990.
2.
J.
Arlat,etal.
,"FaultInjectionandDependabilityEvaluationofFault-TolerantSystems,"IEEETrans.
OnComputers,Vol.
42,No.
8,pp.
913-923,Aug.
1993.
3.
D.
Avresky,etal.
,"FaultInjectionfortheFormalTestingofFaultTolerance,"Proc.
22ndInt.
Symp.
Fault-TolerantComputing,pp.
345-354,June1992.
4.
S.
Bagchi,"Hierarchicalerrordetectioninasoftware-implementedfaulttolerance(SIFT)environment,"Ph.
D.
Thesis,UniversityofIllinois,Urbana,IL,2001.
5.
J.
H.
Barton,E.
W.
Czeck,Z.
Z.
Segall,andD.
P.
Siewiorek,"FaultinjectionexperimentsusingFIAT,"IEEETrans.
Computers,Vol.
39,pp.
575-582,Apr.
1990.
6.
X.
CastilloandD.
P.
Siewiorek,"AWorkloadDependentSoftwareReliabilityPredictionModel,''Proc.
12thInt.
Symp.
Fault-TolerantComputing,pp.
279-286,1982.
7.
R.
Chillarege,S.
Biyani,andJ.
Rosenthal,"MeasurementOfFailureRateinWidelyDistributedSoftware,"Proc.
25thInt.
Symp.
Fault-TolerantComputing,pp.
424-433,1995.
8.
J.
Gray,"ACensusofTandemSystemAvailabilitybetween1985and1990,''IEEETrans.
Reliability,"Vol.
39,No.
4,pp.
409-418,1990.
316R.
K.
IyerandZ.
Kalbarczyk9.
M.
C.
Hsueh,R.
K.
Iyer,andK.
S.
Trivedi,"PerformabilityModelingBasedonRealData:ACaseStudy,''IEEETrans.
Computers,Vol.
37,No.
4,pp.
478-484,April1988.
10.
R.
Iyer,D.
Tang,"ExperimentalAnalysisofComputerSystemDependability,"Chapter5inFaultTolerantComputerDesign,D.
K.
Pradhan,PrenticeHall,pp.
282-392,1996.
11.
R.
K.
IyerandD.
J.
Rossetti,"EffectofSystemWorkloadonOperatingSystemReliability:AStudyonIBM3081,"IEEETrans.
SoftwareEngineering,Vol.
SE-11,No.
12,pp.
1438-1448,1985.
12.
M.
Kalyanakrishnam,"FailureDataAnalysisofLANofWindowsNTBasedComputers,"Proc.
18thSymp.
onReliableDistributedSystems,pp.
178-187,October1999.
13.
Z.
Kalbarczyk,R.
Iyer,S.
Bagchi,K.
Whisnant,"Chameleon:Asoftwareinfrastructureforadaptivefaulttolerance,"IEEETrans.
onParallelandDistributedSystems,vol.
10,no.
6,pp.
560-579,1999.
14.
G.
A.
Kanawati,N.
A.
Kanawati,andJ.
A.
Abraham,"FERRARI:Aflexiblesoftware-basedfaultanderrorinjectionsystem,"IEEETrans.
Computers,Vol.
44,pp.
248-260,Feb.
1995.
15.
I.
LeeandR.
K.
Iyer,"AnalysisofSoftwareHaltsinTandemSystem,"Proc.
3rdInt.
Symp.
SoftwareReliabilityEngineering,pp.
227-236,1992.
16.
I.
LeeandR.
K.
Iyer,"SoftwareDependabilityintheTandemGUARDIANOperatingSystem,"IEEETrans.
onSoftwareEngineering,Vol.
21,No.
5,pp.
455-467,1995.
17.
T.
T.
Lin,D.
P.
Siewiorek,"ErrorLogAnalysis:StatisticalModelingandHeuristicTrendAnalysis,"IEEETrans.
Reliability,Vol.
39,No.
4,pp.
419-432,1990.
18.
H.
Maderia,R.
Some,F.
Moereira,D.
Costa,D.
Rennels,"ExperimentalevaluationofaCOTSsystemforspaceapplications,"Proc.
OfInt.
Conf.
OnDependableSystemsandNetworks(DSN'02),WashingtonDC,pp.
325-330,June2002.
19.
MessagePassingInterfaceForum,"MPI-2:ExtensionstotheMessagePassingInterface,"http://www.
mpi-forum.
org/docs/mpi-20.
ps.
20.
J.
F.
MeyerandL.
Wei,"AnalysisofWorkloadInfluenceonDependability"Proc.
18thInt.
Symp.
Fault-TolerantComputing,pp.
84-89,1988.
21.
S.
MouradandD.
Andrews,"OntheReliabilityoftheIBMMVS/XAOperatingSystem,"IEEETrans.
onSoftwareEngineering,October1987.
22.
D.
Stott,B.
Floering,Z.
Kalbarczyk,andR.
Iyer,"DependabilityassessmentindistributedsystemswithlightweightfaultinjectorsinNFTAPE,"Proc.
Int.
PerformanceandDependabilitySymposium,IPDS-00,pp.
91-100,2000.
23.
M.
S.
Sullivan,R.
Chillarege,"SoftwareDefectsandTheirImpactonSystemAvailability—AStudyofFieldFailuresinOperatingSystems,"Proc.
21stInt.
Symp.
Fault-TolerantComputing,pp.
2-9,1991.
24.
M.
S.
SullivanandR.
Chillarege,"AComparisonofSoftwareDefectsinDatabaseManagementSystemsandOperatingSystems,"Proc.
22ndInt.
Symp.
Fault-TolerantComputing,pp.
475-484,1992.
25.
D.
TangandR.
K.
Iyer,"AnalysisoftheVAX/VMSErrorLogsinMulticomputerEnvironments—ACaseStudyofSoftwareDependability,"Proc.
3rdInt.
Symp.
SoftwareReliabilityEngineering,ResearchTrianglePark,NorthCarolina,pp.
216-226,October1992.
26.
D.
TangandR.
K.
Iyer,"DependabilityMeasurementandModelingofaMulticomputerSystems,''IEEETrans.
Computers,Vol.
42,No.
1,pp.
62-75,January1993.
27.
A.
Thakur,R.
K.
Iyer,L.
Young,I.
Lee,"AnalysisofFailuresintheTandemNonStop-UXOperatingSystem,"Proc.
Int'lSymp.
SoftwareReliabilityEngineering,pp.
40-49,1995.
28.
M.
M.
TsaoandD.
P.
Siewiorek,"TrendAnalysisonSystemErrorfiles,"Proc.
13thInt.
Symp.
Fault-TolerantComputing,pp.
116-119,June1983.
29.
P.
VelardiandR.
K.
Iyer,"AStudyofSoftwareFailuresandRecoveryintheMVSOperatingSystem"'IEEETrans.
OnComputers,Vol.
C-33,No.
6,pp.
564-568,June1984.
30.
K.
Whisnant,Z.
Kalbarczyk,andR.
Iyer,"Micro-checkpointing:Checkpointingformultithreadedapplications,"inProceedingsofthe6thInternationalOn-LineTestingWorkshop,July2000.
Measurement-BasedAnalysisofSystemDependability31731.
K.
Whisnant,R.
Iyer,Z.
Kalbarczyk,P.
Jones,"AnExperimentalEvaluationoftheARMOR-basedREESoftware-ImplementedFaultToleranceEnvironment,"pendingtechnicalreport,UniversityofIllinois,Urbana,IL,2001.
32.
K.
Whisnant,etal.
,"AnExperimentalEvaluationoftheREESIFTEnvironmentforSpaceborneApplications,"Proc.
OfInt.
Conf.
OnDependableSystemsandNetworks(DSN'02),WashingtonDC,pp.
585-594,June2002.

展开全文