This is not a NASA Website. You might learn something. It's YOUR space agency. Get involved. Take it back. Make it work - for YOU.
IT/Web

SLS Flight Software Safety Issues at MSFC (Update)

By Keith Cowing
NASA Watch
October 31, 2016
Filed under , ,
SLS Flight Software Safety Issues at MSFC (Update)

Keith’s 31 October update: NASA MSFC Internal Memo: Key Personnel Announcement -Teresa Washington is retiring, NASA MSFC
“Upon the upcoming retirement of Teresa Washington, I am pleased to announce the appointment of Marcus Lea to the Senior Executive Service (SES) position of Director, Office of Human Capital (OHC). As OHC Director, Mr. Lea will be responsible for the entire scope of the Center’s workforce strategy and planning, organization and leadership development, academic affairs, training and incentives, federal labor relations, and employee services and operations.”
SLS Flight Software Safety Issues at MSFC (Update), earlier post

Keith’s 21 October update: According to sources an internal assessment of SLS software safety activities at QD34 has been conducted at NASA MSFC. The main outcome is an admission that communication between NASA and its safety contractor – and within NASA itself – was not happening. Upper management had no idea what their employees were actually doing – or not doing. In some cases contractor employees were told by NASA not to work on things that they were supposed to be working on. The assessment also found that the use of contractor employees by NASA for what amounts to personal services activities has gotten out of control. As a result of this assessment reassignment of NASA civil servants is being planned. Also, as a direct result of this mess, the current support contractor is likely to be eliminated from future contract consideration.
Keith’s 17 October update: Last Friday a safety advisory team visited NASA MSFC to talk about ongoing SLS software safety issues at QD34. NASA MSFC management continues to try and convince people that things are not as bad as have been reported – but the facts speak for themselves. In addition, the way that NASA MSFC civil servants have been treating contractor employees who raise issues continues to be a concern.
Keith’s 14 October update: Recently all NASA SM&A QD34 employees had to take a breathalyzer test – however no drug test was administered (at least not yet). As for checking for software problems, the primary way that NASA MSFC is verifying SLS flight software is by CSCI execution time. This approach is something like verifying a word processing program on your PC works based upon how long it takes to open. Many employees feel that this is a useless test that is designed *not* to find any problems. Apparently the “test what we fly and fly what we test” approach is not being followed at MSFC.
Reducing Risk is Lifelong Pursuit for New NASA Marshall Center Safety Chief, NASA MSFC
“[Rick] Burt was previously chief safety and mission assurance officer for SLS, the massive rocket that will take humans on exploration missions farther into deep space than ever before, and a journey to Mars. In his new assignment he is responsible for safety, reliability and quality engineering for the programs and activities across the entire Marshall Center.”
Keith’s 13 October update: Senior MSFC management seems to feel that this whole SLS safety/software issue has gone away since no one has been talking about it publicly. That’s not the impression I get from talking to people at NASA Headquarters since they remain very concerned. What will be interesting is seeing how SLS program management explains these problems and the 8% risk of SLS launch failure that the program accepts when it comes time to brief members of the new Administration’s transition team.
Keith’s 29 September update: Sources report that a substantial portion of the contractor staff working for the SLS safety contractor at NASA MSFC QD34 want out and are asking for reassignment to other programs. Many are openly looking for new jobs elsewhere. The prime contractor has been told by NASA MSFC management that if anyone leaves SLS safety support without permission or by other than NASA-directed termination that the incumbent contractor risks not receiving consideration during the contract re-competition next year. SLS safety risks under development are being deleted. People are scared to come forward with issues. SLS management was at Michoud and Stennis for an AOA yesterday and today. This was reportedly a topic for discussion.

NASA Watch founder, Explorers Club Fellow, ex-NASA, Away Teams, Journalist, Space & Astrobiology, Lapsed climber.

13 responses to “SLS Flight Software Safety Issues at MSFC (Update)”

  1. muomega0 says:
    0
    0

    It would interesting to compare the careers of managers and engineers in the HLV projects vs those who spoke out against the undersized LAS mass and depots in a community where technical and economic merit takes back seat to other interests.

    A 20 day capsule to Mars forgetting Apollo 13 and why it is difficult to abort from an SRB, where LAS mass grew from 4mT to 10mT preventing Ares I from getting off the ground, and how software could impact SLS….
    https://www.youtube.com/wat

    • Daniel Woodard says:
      0
      0

      During the Shuttle era there were many who assumed we weren’t ever going to use solids for a human launch again, partly because of the abort problem although there were plenty of other reasons. I wonder if Mike Griffin ever saw this video?

      • muomega0 says:
        0
        0

        The weight margins disappeared soon after the 60-day ESAS was released in 2005 once it was vetted by a slightly larger community. However many thought it was thrust oscillation and SRB under-performance. The safesimplesoon.com website no longer exists.

        Lessons learned are very important as shown in this 2009 Study and video.

        “Re-Confirm Codes. Re-confirm predictive codes & values for solid propellant motor fragmentation, comparing results of the late-1980’s joint NASA/DOE/INSRP Explosion Working Group (and related) analyses of solid propellant rocket debris (particularly applied to the Titan and NASA SRB’s), and verifying that code accuracy continues into the later 1998 Titan A20 destruct at MET=40s.
        http://www.spaceref.com/new

        http://www.markwaki.com/pag

        • mfwright says:
          0
          0

          “The safesimplesoon.com website no longer exists.”

          See what this site by Alliant Techsystems Inc was at
          https://web.archive.org/web

        • Daniel Woodard says:
          0
          0

          Those analyses involved actual detonation of test sections to compare with predictions. The need is not to reconfirm the codes mathematically, but to actually perform a new series of large-scale tests including explosive detonation, to verify changes in the physical designs as well as the mathematical models.

          However to me the real problem is not simply that accuracy of the models, but rather the engineering design as a whole. Solids have failure modes that are very difficult to mitigate, and processing costs that increase with size and are very difficult to reduce. Their primary advantage in missiles, the ability to be stored ready and fired at any moment, is of no value in human launch.

  2. Todd Austin says:
    0
    0

    It would interesting to compare the SpaceX approach to risk mitigation with that for SLS. What percentage of failed launches is acceptable to them on manned flights?

    • fcrary says:
      0
      0

      As strange as this may sound, the difference may not be the acceptable risk. It may be the effort taken to assure the actual risk is below that limit. Is it absolutely certain the risk is under X%, or just “probably” under X%? The difference represents a huge amount of time and money.

  3. Daniel says:
    0
    0

    I left the NASA MSFC flight software branch in April of this year. I started in that branch at the beginning of the Constellation program, writing software for what became Ares I. I know the people and although I largely worked on projects outside of SLS after constellation, as the technical assistant for the flight software branch chief, I heard on a near daily basis, what was happening in SLS FSW. The good and the bad. I also spent those years after Constellation working in a lab nearly everyday across the hall from QD34.

    First off, your assertion that “the primary way that NASA MSFC is verifying SLS flight software is by CSCI execution time” is absurdly and patently false. Read the article that was in the original post https://blogs.nasa.gov/Rock….
    The IATF is an incredible facility. It’s not perfect, but they do everything they can to give the flight software and avionics as real a flight experience as they can on the ground. Of course the ability to test is related to the quality of the models. The IATF team goes to great lengths to provide high fidelity models. Always my biggest concern has been that some models have not been validated against real test data. Some can’t be. Even with the brightest minds and best efforts some assumptions and expectations will be wrong and that scares the hell out of the whole team. If anything is missed – and there will be things missed – it’s not for lack of trying.

    QD34 has a problem. When I left, beleive there was two civil servants and two contractors in the software side of QD34. So your “substantial portion of contractors” at this point could be one person with a narrow and biased view. For the benefit of those not familiar with MSFC org codes, let’s shed some light. They DO NOT perform verification/validation of the flight software. They watch HOW the processes are performed to ensure that performance matches the documented process and the documented process adheres to NPR7150.2. The problem with QD34 is that its efficacy is in doubt. People don’t want to stay there. They come in look to make sure you’ve filled out all the forms in triplicate, yell when they’re not, and go back to their office. It’s not a job that earns you much appreciation. If I’m correct, your primary sources was trying to do much more, but that was not in scope for the job and as an organization, QD34 is not equipped to assess actual software quality.

    There is a separate software V&V team, and another integrated software and avionics V&V team. Then there is IV&V in West Virginia. I had the pleasure of meeting some of the IV&V team just before leaving NASA. I got a first hand demo of their capabilities. We were impressed and started immediately to bring some of their digital simulation capabilities into the development process to catch potential defects earlier.

    I understand that now, additional independent software experts are being brought in to do more unit level and whitebox testing and code review. I’m not sure if that was planned before or after the concerns discussed here were raised.

    As for quality of the code, I feel quite confident. Much of the codebase carried over from Constellation. We’ve been testing it and tweaking it for nearly 10 years now. Most of it has many thousands of hours in simulated flight. It’s been run on several different platforms, and some of those ports help expose undiscovered bugs. I wrote the underlying infrastructure code for scheduling and data routing in the early days of Constellation. To my knowledge, one bug was found in that code when we ported to a new platform about a year after the code was written. One bug was introduced after I left when someone changed an error logging function. That one was caught in static analysis performed by IV&V. Neither would have led to a vehicle failure. So that’s two bugs in about 10K lines of code in constant use for almost 10 years. I’m embarrassed by the one bug that I wrote, but it’s not indicative of systemic software problems.

    The biggest problems the software group has faced has been ever-changing vehicle requirements, some avionics design decisions that caused increased code complexity, and more than anything, people identifying obscure multiple fault scenarios and insisting that software mitigate them without actually quantifying the likelihood/severity of multiple failures and quantitatively assessing the risk the scenario imposed relative to the risk introduced by adding more complicated code logic to mitigate the problem. The QD34 person who started this whole discussion was one of the largest contributors to that last problem.

    As to your insinuation that Steve Pearson’s retirement has anything to do with this conversation about software. I know him. He’s not one to run away from a small bit of controversy. Take a look at the entire senior executive staff at NASA. There is a buyout for SES’s right now and many of these people are in position to take it. Steve’s in the company of many other NASA SES’s choosing to take the buyout and depart. Best of luck to them…

    And as to your accusations of conspiracy and cover-up at MSFC, All I can say is that in all but a very few isolated instances during my many years there, the people I worked with showed integrity. If you’re going to make accusations, bring evidence.

    Best wishes to the SLS team. Pointy end up!

    • Neil.Verea says:
      0
      0

      You mean there is no nefarious conspiracy at MSFC’s QD34? Nice summary of a viable reality that is consistent with organizational dynamics at NASA, not as dramatic as all the conjecture being floated. Refreshing!

      • fcrary says:
        0
        0

        Conspiracies are usually a better story than reality. But I still find this worrying: He wrote “The biggest problems the software group has faced has been… people identifying obscure multiple fault scenarios and insisting that software mitigate them without actually quantifying the likelihood/severity of multiple failures…”

        That’s not good. At best, it means time and money going to imaginary problems, which is a good way to blow your budget and schedule. It can also mean shifting attention from real problem that really need attention. At worst, as the post said, there is a “risk introduced by adding more complicated code logic to mitigate the problem.” That is, the solution to the imaginary problem introduces new and very real failure modes.

        I don’t see this as an issue of whether or not the people writing the software are doing their best. Nor of whether or not the formal management practices are being followed. It’s whether or not those formal management practices actually help, if they are dead weight, or if they actually make things worse.

        • Neil.Verea says:
          0
          0

          Given time and budget, I suspect this branch as most branches across the agency disposition work through triage, based on practicality, reasonableness and what they can and cannot do/afford. Its the 6 sigma problems that are raised and not filtered by the identifier that can overwhelm any process. Those types of people are generally not helpful, they tend to identify numerous haystacks to find the one needle. I don’t know the specific case here but it sounds all too familiar.

          • fcrary says:
            0
            0

            It sounds familiar to me as well. For me, the really annoying occasions aren’t when someone says you have to consider a 6 sigma possibility. It’s when someone says, “I have no idea how likely this is, and given the available data, that’s impossible to estimate. But since you can’t prove it’s less than one in a million, you have to assume it could be a 1% risk, and treat it as such.” I get particularly annoyed when the people making that statement are, just by coincidence, the people who would have to be paid to study the potential risk.

  4. Spectreman75 says:
    0
    0

    I’d like to know who the contractor is so I don’t apply for a job with them.