- NASA Watch
- February 9, 2023
SLS Flight Software Safety Issues at MSFC (Update)
Keith’s 29 September update: Sources report that a substantial portion of the contractor staff working for the SLS safety contractor at NASA MSFC QD34 want out and are asking for reassignment to other programs. Many are openly looking for new jobs elsewhere. The prime contractor has been told by NASA MSFC management that if anyone leaves SLS safety support without permission or by other than NASA-directed termination that the incumbent contractor risks not receiving consideration during the contract re-competition next year. SLS safety risks under development are being deleted. People are scared to come forward with issues. SLS management was at Michoud and Stennis for an AOA yesterday and today. This was reportedly a topic for discussion.
Keith’s 26 September update: Shortly after this original posting on 9 September Andrew (Andy) Gamble was summoned to MSFC Center Director Todd May’s office to talk about QD34 issues. During subsequent closed door meetings MSFC management decided that they did not have any software problems – when in fact they do.
A few days later Steven Pearson, Deputy Director of NASA MSFC’s Safety & Mission Assurance Directorate, suddenly announced his retirement after 37.5 years at NASA. MSFC SM&A hired some outside experts in the area of software safety and V&V and asked them to determine if the MSFC QD34 contractor or NASA QD34 was right with regard to the issues under dispute. The outside experts completed their review and agreed 100% with the positions taken by the contractor.
Meanwhile, contractor employees working for QD34 who have surfaced the issues I reported earlier have suddenly found their funding yanked. Moreover, employees who are leaving or thinking of leaving as a result of raising concerns – are being specifically blackballed – by name – by MSFC management. Potential employers are being told by NASA MSFC that they risk wining new contracts – or losing existing contracts – if they hire these individuals.
According to an internal MSFC memo previously cited “Andy and George stated that Software V&V is really just a function of Software Quality Engineering. (Really, our QD34 customers seemed not to have a clue of what comprises Software V&V, or even Software Assurance or Software Quality for that matter.) False – NASA/MSFC procedural requirements, standards, guidebooks/handbooks, etc. clearly speak to the contrary. Just because Software Assurance isn’t organized on SLSP (or at MSFC) with people explicitly assigned to a Software V&V group doesn’t mean we don’t do Software V&V tasks (separately from Software Quality tasks).”
Most troubling of all, the internal assumption at MSFC is that the first SLS flight will have a built-in risk of failure of around 8%. This risk is being “baked in” to the design of SLS in part due to decisions being made at MSFC about software and avionics – decisions that are being made so as to not surface troublesome issues that no one wants to deal with. One can imagine that safety folks at MSFC are nervous.
This is no way to build a rocket folks.
Keith’s 9 September note: NASA’s SLS program has been experiencing budgetary and scheduling issues for years as noted in multiple GAO and OIG reports. The program has also had problems with technical issus such as software and avionics. Multiple sources report that one of the places experiencing significant problems is QD34 at NASA MSFC – where significant SLS flight software safety issues are mounting – issues that no one else is hearing about. The Branch Chief for QD34 is Andrew Gamble. NASA MSFC management – and perhaps the NASA OIG might – want to pay that organization a visit.
Keith’s 9 September update: According to an internal memo there is a “lack of understanding by QD34 of the intent of NPR 7150.2B (A or B) [NASA Software Engineering Requirements] and of CMMI [Capability Maturity Model Integration] when it comes to a higher level of process, procedure, etc. rigor expected for Class A software than for Class C” and that actions by NASA MSFC QD34 management represent “a direct attempt by our QD34 customers to intentionally minimalize the differences to avoid making matters complicated and having to make any corresponding changes for SLSP at this time.”
SLS Avionics: The Brain Without a Body, NASA
“Ultimately the avionics boxes and software have to work perfectly. But how can you be sure without putting it on the world’s largest rocket and seeing how it works? That’s the focus of the Integrated Avionics Test Facility – or IATF – at NASA’s Marshall Space Flight Center.”
Earlier SLS/Orion posts
Keith – I’m all for shining light on areas where NASA is mis-stepping, but it seems your dis-trust of SLS is taking over. The referenced article describes a multi-pronged approach for testing and validating the SLS avionics software. Your commentary, while dark, is completely unsupported by the referenced article.
SLS has problems – I’m not convinced it will fly 4 times – but lets cast shade where it’s deserved, and with appropriate supporting data.
(Sigh) yet another person who won’t use their real name decides to lecture me. What I have posted is true. I may be able to post more later. If you do not believe what I have posted, fine. There are other space-related news websites for you to read.
Can you provide some information on the actual software issues?
I do not wish these types of problems on any organization or any program. These kind of problems happen in the private sector too.
We can only hope this is not a systemic symptom of Space Launch System.
Has anyone actually tried to read NPR 7150.2B or the CMMI? Has anyone who authored the report actually maintained operational software, hands-on?
Remainder deleted until I can get more details…
Actually, until I read your post, I’d only seen summary presentations about 7150.2. So I decided to skim over it yesterday. I’m afraid I didn’t read it cover-to-cover; it’s not what I’d call a thrilling page-turner I just couldn’t put down. I’m not going to bother looking at CMMI.
I think your basic impression is correct, and I see little value in it (7150.2, that is.) Actually, I see quite a bit of harm, in the form of wasted time, effort and money. But I think you and the authors of that document are making incorrect assumptions.
Your bad assumption is to think it 7150.2 is intended as a rules for writing good flight software. It isn’t. It’s intended to be rules on how to manage software development. Remember, a project manager is not expected to know anything about the nuts-and-bolts details of the work he supervises. A big project like SLS includes far too many disciplines for that to be realistic. Even with middle managers between the project manager and the people doing useful work, the project manager is still expected to be in charge and responsible. So 7150.2 is about the process of developing flight software and how things should be managed, not how the work should be done.
The other bad assumption is on the part of the people who wrote that, and the people who consider it useful. They assume that, if the process and management are structured in a particular way, they can assure good results. That’s absurd. The problem isn’t limited to flight software, but I’ve seen some. I can assure you that there is some very bad code out there. There is also some very good code out there. As far as I can tell, there is remarkably little correlation between a development process which follows 7150.2 and the quality of the results.
Good points, I probably over-generalized. However I have difficulty seeing how a project manager who doesn’t understand the nuts and bolts at least to a significant degree can be effective. What about Kelly Johnson, or Ed Heinemann, or Werner von Braun?
I wondered the same until I read the comment about middle managers, who presumably are experts. Nonetheless how does the boss prioritize resources if she has no idea how the pieces will fit together?
I did say that, the people who think 7150.2 and similar documents are useful are making bad assumptions.
I encourage anyone with a software background to read the NPR and assess for themselves: http://nodis3.gsfc.nasa.gov…. During my tenure at NASA, I felt that NPR7150.2 did more to hinder quality than improve it because it distracted from real issues and shifted the focus to issues of semantics in documentation. I saw far too many dollars and man-hours have been spent debating the intent and implementation of the NPR; resources that could have been applied to focused analysis, testing, and improvement.
The flight software organization is also required to maintain CMMI level 3 certification. Unlike NPR7150.2, CMMI provides a very good framework for assessing your organization’s efforts to build quality into the system. NPR7150.2 is concerned with inspecting the artifacts of producing the software (but not the software) after the process is done.
We know from the manufacturing world that building quality in from the start is better than inspecting for defects afterward. NPR7150.2 has also perpetuated a big up-front requirements /design mentality. Modern programmers and entrepreneurs have found the value in rapid iteration in maturing their products and services. During my time with NASA we tried to implement some components of agile software development, particularly on SLS flight software. The NPR7150.2 mindset was always a major hindrance.
Keith is on to something. Several people have been forced to
resign or have gotten fired from QD34 for bringing up software safety
issues. Sounds like yet another cover-up. Maybe it is time for
the OIG. The data is there.
I work for QD34 as a contractor. I would like to remain anonymous for fear of retribution from the branch chief and I would like to keep my job and others have been fired. I have seen QD34 management manipulate data and discourage people from speaking out using intimidation. NASA upper management needs to confront this problem or it will keep getting worse.
You are not alone… which is why websites like this are so critical. It is not that we will change anything however we can document, document, document so when the wheels do come off the wagon we can say see I told you so.
I do not have any faith and confidence in OIG anymore. They are well aware of the problems here.
Whatever happened to the “If it’s not safe, say so” posters that were common on NASA centers a few years ago, with a phone number to call if safety concerns were being ignored? Has this program fallen by the wayside?
Ship the rocket and fix the software later. I am sure that JPL could write the software if Marshall is too busy with hardware issues. The real problem is that the rocket has no mission and there are plenty of companies that would love to launch big NASA payloads since they get paid even if the rocket blows up.
KC – I’m glad your getting some internal scoops. NASA programs need more of this on matters that too easily get buried, but that would be constructive to have in the light of day. Unfortunately, it’s all too easy nowadays to trace a leak when the specific lines of an email or a slide or such show up somewhere. Still, wish we had more of this.
The real issue here is that NASA management has a toxic culture that fosters bullying and intimidation as a way to “keep the project sold” rather than do the right thing and actually fix the problem.
This mentality is why we have lost shuttles and people, and the problems have gotten worse. NASA cannot clean it’s own house today. Hopefully the next administrator will do something to right this floundering ship.
” NASA cannot clean it’s own house today. Hopefully the next administrator will do something to right this floundering ship.”
NASA cleaning its own house is not politically useful to the Chairs of the congressional committees that pass, or don’t, on the line items in NASA’s budget.
Under what legal authority is NASA management allowed to maintain a blacklist of prohibited employees, or to withhold or cancel a contract from a company that hires one of the blackballed employees? I would like to see that document…
Perhaps it is the same policy that allows them to order their contractors not to hire anyone older than a certain age?
Official backlists are illegal. But, unofficially, managers do talk to each other. It’s hard to regulate against someone, over dinner, rolling their eyes and saying, “Oh, him. When he worked for me, I didn’t think he was half as good as his resume made him seem.” it’s not hard to create an unofficial blacklist by poisoning the rumor mill.
There is nothing unofficially going on at MSFC. This is deliberate and systematic blacklisting of specific individuals.
I have confidence unofficial things go on everywhere, including MSFC. I think you mean there are also official things going on regarding blacklists. I haven’t seen any evidence for or against, so I won’t comment.
It takes more than an informal network of managers to threaten a company with the loss of a contract. If there is evidence of deliberate and systematic blacklisting then the OIG needs to be investigating this.
Yes, you would think that the MSFC OIG would be looking at this. You would think ….
Easy to imagine people getting together over lunch and just…talking.
What we need is a REAL investigative organization, one that’s immune to politics. Like the FBI.
Wow! So which of those mismanaged and illegal centers is responsible for the slow motion train wreck that is the James Webb Space Telescope? Which center couldn’t keep imperial/metric units of measure straight and flame-broiled a Mars probe as a result? Who many billions was Curiosity overbudget again?
Let’s blame JPL for the right things. The Mars Climate Orbiter failure certainly has enough blame to share.
Lockheed Martin Aerospace provided the small forces file with imperial units and no documentation saying what units they used. JPL took that data and assumed, as specified in interface control documents, that the units were metric. I can’t really blame them for that.
JPL was also the organization which had clear evidence of a serious navigation error building up over months of the cruise to Mars. They decided that figuring out what was going on wasn’t worth the effort. That’s where JPL deserves some blame.
Yes, they were making over 10 times the number of course corrections than normal. Should have clued someone that something was wrong.
And crazy Donald wants to put a man on this thing! Let it be him!