This is not a NASA Website. You might learn something. It's YOUR space agency. Get involved. Take it back. Make it work - for YOU.
Commercialization

Boeing Really Needs To Get Their Software Fixed

By Keith Cowing
NASA Watch
February 9, 2020
Filed under , ,
Boeing Really Needs To Get Their Software Fixed

Keith’s 9 Feb update: You should scroll down and click on the comments. At the top you will see that I highlighted comments by NASA HEOMD AA Doug Loverro. He replied to the question that I did not get to ask and spent a lot of time – with a lot of words – in a quality response. Well worth reading.
Keith’s 7Feb note: NASA and Boeing held a telecon today about Starliner problems. They said that they held today’s media telecon as a result of things posted in the media yesterday after the ASAP meeting. (See ASAP: Boeing Starliner Software Issue Potentially “Catastrophic”). Apparently Congress was reading the same articles. When asked about flying people on the next Starliner mission Jim Bridenstine punted. Doug Loverro went into some detail as to what needs to be done with Boeing next but would not answer yes/no either. Alas, NASA is picking favorites again on news telecons. Probably a good idea since this was my question for Boeing:

“Boeing launched a spacecraft designed to carry humans and discovered two fundamental software issues in flight. Now Boeing wants to launch people in that spacecraft the next time it flies. I have been reporting on software issues for another Boeing product – SLS. Add in 737 Max software problems and it would seem that Boeing has some major software weaknesses. Is there any overlap between software teams or management between Starliner and SLS (or 737 Max)? Since Boeing’s current software process has clearly failed after many years and billions of dollars spent, what do you need to do differently in order to get this whole software thing working properly again?”

I was half tempted to get into the weeds with a question about breadboards, wiring jigs, software verification checks, and things like SAIL that we used to test Shuttle avionics and what passed for “software” – all done by Boeing and its heritage companies like Rockwell and McDonnell Douglas – all designed to beat problems out of a design with brut force before it flew. You’d think they’d have that down pat by now. That is the real story here – NASA and its contractors have forgotten how to do stuff like that.

NASA Watch founder, Explorers Club Fellow, ex-NASA, Away Teams, Journalist, Space & Astrobiology, Lapsed climber.

68 responses to “Boeing Really Needs To Get Their Software Fixed”

  1. Jeff2Space says:
    0
    0

    So they went looking for errors in the code after the test flight experienced a serious problem caused by software. And they found a second software problem which could have resulted in loss of the spacecraft (after it separates, the service module literally crashes back into Starliner). And they then quietly patched the second software problem a couple of hours before reentry. And they didn’t bother to tell the public about this second software problem.

    This is a $#!^ show. Why the hell didn’t they find the second problem via code inspection before the test flight? How many other serious software issues are there left to find?

    Best of all, Bridenstine dodged the question of whether or not NASA would require a reflight of the uncrewed test flight before putting crew on board Starliner. Seriously?!?! You’re the NASA Administrator, so it’s your call.

    • Richard Malcolm says:
      0
      0

      As Michael Baylor noted, we have a massive (and darned lucy) irony here: “Thus, you can put two things together and pretty much say that the Mission Elapsed Timer anomaly may have saved Starliner from disaster.”

      https://twitter.com/nextspa

    • Richard Malcolm says:
      0
      0

      Honestly, I think it’s not realistic for Bridenstine to make that call (for a repeat OFT flight) right now. But honestly, the fact that he’s being unwilling to rule it out is a real shift in tone from where they were in the post-flight presser.

      All of the language from NASA peeps this afternoon – to me – sure sounded like it’s more likely than not now that we will, in fact, be seeing a repeat of the OFT.

    • Terry Stetler says:
      0
      0

      Now it’s clear why there was so much arm-waving by guys in suits during the flight. Sheshhh…what a cluster-frack.

    • fcrary says:
      0
      0

      I can understand Mr. Bridenstine dodging that question. When (and I’m pretty sure it’s when, not if) he does order another test flight, some people will be upset, lots of people will want to know where the money is going to come from, and whatever the answer is about the money, that answer is going to make some people _very_ upset. That means he’ll need a good reason for why he’s requiring the extra test. Like a half a page to a page of engineering details provided by a incident investigation board. Just saying, “Are you nuts? It’s obvious we have to do another test,” won’t pass muster with congressmen and lawyers. Even if it is pretty obvious.

      • imhoFRED says:
        0
        0

        That’s a very accurate explanation of the process going on in Jim’s head.

        From the outside, Jim B has still left the door open to flying CFT. This is not a good look. He could have put some emphasis on safety and quality by leading with “we won’t fly a crewed spacecraft until we get to the bottom of these issues, and fix them”. Note that we didn’t really do that. He led with “we aren’t sure yet”.

        I suspect that your are correct and that Jim already knows Boeing needs to be fly OFT again … but publicly he isn’t saying that. In effect, he’s casting a shadow on the ASAP and all the workers by leaving the door open to a CFT.

  2. Richard H. Shores says:
    0
    0

    I listened to the teleconference and in my opinion, the Starliner software is significantly flawed and will take significant time and money to fix. Bridenstein and Loverro danced around, with noncommittal statements, whether or not the next flight will be crewed. I personally feel that there will be another uncrewed flight.

    And it was total BS with the statements from NASA and Boeing that they were being totally transparent. If they were, they would have called on Keith.

  3. ThomasLMatula says:
    0
    0

    Great question! Seems everyone in the room is afraid to say the Emperor has no cloths on. If it was a New Space Firm they would be gone, but folks in Washington seem afraid of Boeing. Not good.

  4. Richard Malcolm says:
    0
    0

    It’s a great question, Keith, and it’s a shame you did not get a chance to get an answer to it. The SLS problems, after all, are within the same division of Boeing even if 737 MAX is not.

    I was hoping that they’d take questions for at least another half hour or so, given how many media were on the call.

  5. Brian_M2525 says:
    0
    0

    Good thing NASA gave Boeing all those extra dollars to ensure CST100 would be ready even if Space X’s Dragon Rider wasn’t. Looks like all their extra dollars did them no good. So much for the idea that experience is the critical factor.

    • Bill Housley says:
      0
      0

      Boeing is not a software company. They develop code but that’s not the same thing.
      The fact that NASA writes their software for SLS is disturbing for me. It says that they actually do have significantly less experience there, and their infrustucture younger, than that of SpaceX.

  6. Jack says:
    0
    0

    I heard, and I would love to have someone verify it for me, that Boeing outsource all their software development to HCL in India and HCL is current working on the 737 MAX fix. If this is true that explains a lot of Boeing’s software issues. Like I said would love to have someone verify this before I get all panicky about it.

    I did some research and found this which confirms part of what I heard.

    https://www.bloomberg.com/n

    • ed2291 says:
      0
      0

      Regardless who writes it, Boeing has at the very least a significant software problem.

      • Jack says:
        0
        0

        That’s true and having experience working with outsourcing software development to HCL I can assure you it won’t be resolved until they bring the software development back in house.

        • fcrary says:
          0
          0

          Are you sure about that? The link you provided only talks about the 737 MAX software being outsourced. Boeing’s civil aviation and space divisions are separate organizations. I don’t know about Boeing, but I know many aerospace companies have divisions which hardly talk to each other. Typically because they used to be separate companies before some buyout or merger. The way the 737 MAX software was developed may not tell us anything about how SLS or Starliner software was developed.

          • ThomasLMatula says:
            0
            0

            Yes, and I would also imagine there would be ITAR issues involved since it is a spacecraft.

          • rktsci says:
            0
            0

            Yes. Software for spacecraft, particularly the GN&C software is ITAR controlled.

          • fcrary says:
            0
            0

            Actually, aircraft can be ITAR controlled as well. If you want to sell them abroad, and you or your subcontractors also build things for the military, you have to document the lack of any connections or design heritage. Oh. 737 MAX. Software developed by a company outside the United States. Agg. I hope I’m connecting the wrong dots.

          • ThomasLMatula says:
            0
            0

            Yes, but since the B737 Max is a commercial airliner with parts from multiple countries I assumed that any ITAR issues would have been addressed early in its development, unlike a military aircraft.

          • fcrary says:
            0
            0

            Never underestimate ITAR. Poking around, I discovered that Boeing did have ITAR problems with the 787. Late in development, they had to go to a lot of trouble and show that there were no connections to military aircraft. It’s gotten to the point where advertising a system as “ITAR-free” is a major selling point, and being able to prove it is a big deal. And foreign made parts aren’t a free ride. In one case, a scientific instrument was built in Switzerland and then shipped to a co-investigator in the US who had access to a nice calibration facility. Then they got into trouble because sending it back to the people who _built_ it would be exporting an ITAR controlled item. Some unkind things have been said about the 737 MAX development process. But in the case of ITAR, I’m inclined to think “written by clowns supervised by monkeys” isn’t unfair.

          • Jack says:
            0
            0

            I did state it confirmed part of what I heard. My experience tells me that those type of directives come from the top of the org chart and affect all divisions. But I would still like to find confirmation for the rest.

          • ThomasLMatula says:
            0
            0

            I think this is the article you are looking for.

            https://www.bloomberg.com/n

            Boeing’s 737 Max Software Outsourced to $9-an-Hour Engineers

            By Peter Robison

            June 28, 2019, 3:46 PM CDT

  7. Nick K says:
    0
    0

    I’m afraid its a typical Gerst problem. He put his friends in charge rather than looking for qualified capable experienced proven people. Much of NASA and its contractors are in the same situation.

  8. Doug Loverro says:
    0
    0

    Since Keith did not get to ask his question on the air, I thought I’d try to answer it here. To remind, Here’s Keith’s Question:

    “Boeing launched a spacecraft designed to carry humans and discovered two fundamental software issues in flight. Now Boeing wants to launch people in that spacecraft the next time it flies. I have been reporting on software issues for another Boeing product – SLS. Add in 737 Max software problems and it would seem that Boeing has some major software weaknesses. Is there any overlap between software teams or management between Starliner and SLS (or 737 Max)? Since Boeing’s current software process has clearly failed after many years and billions of dollars spent, what do you need to do differently in order to get this whole software thing working properly again?”

    To break Keith’s question down to its pieces:

    1) Q: “Is there any overlap between software teams or management between Starliner and SLS (or 737 Max)?

    Ans: For SLS and Starliner the answer is definitely “yes” in terms of management overlap (they both report to Jim Chilton). For software, it’s far more complex–much of SLS software is build in house by NASA, not Boeing. That’s one of the reasons we are having the Starliner Independent Review Team brief the SLS (and Orion, and other HEO projects) team on their results. We want to make sure that they all hear and understand the insidious nature of these kinds of issues. For the record, Boeing is far from the only large space firm that suffers from this kind of software process failure. But the nature of these two issues and the number of times they were missed means a far more thorough review is warranted. As for overlap with the 737 MAX issue — that’s doubtful from a people perspective. But there could be a “process” overlap. We’ll need to investigate so good question.

    2) Q: Since Boeing’s current software process has clearly failed after many years and billions of dollars spent, what do you need to do differently in order to get this whole software thing working properly again?

    Ans: The exact answer to this question is still being formulated and depends a bit upon the final root cause we determine. As we mentioned during the conference, we have asked the Independent review team (IRT) to go back and determine “why” these process failures happened. Is it because Boeing has a flawed process? Or because they have a good process that failed to be followed for some reason? But in general, the way we will be able to “get this whole software thing working properly” is that we will have to review the entire set of documentation for the system software and verify that similar missteps did not occur elsewhere. That means things like going back and reviewing the original requirement statements; the process by which that statement was converted to a logical subroutine; the way that subroutine was then coded; the way that code was then verified and checked off; etc. The software development process used for systems such as this is well understood and generates so-called “artifacts” (reports, other paperwork) which can be examined to determine where the above processes may have failed. For the two issues being discussed here, gathering that paper made it relatively easy to find the cause of the failure. of course, we knew where to look and what to look for. For the entire software load and looking for problems we do not yet know exist, this will be far more difficult.

    Hope that takes away some of the sting of not being called.

    • mfwright says:
      0
      0

      Thanks for spending your time and addressing these questions, especially on late Friday night.

    • tutiger87 says:
      0
      0

      Loverro is not afraid of you folks…LOL..

      Refreshing!!

      • Doug Loverro says:
        0
        0

        Thanks Tutiger — to be clear, I respect NASAwatch folks and Keith in particular. You all hold us accountable — and that’s the way it should be. If I ever cannot explain why we did something, then that’s my fault for doing it, not NASAWatch’s fault for pointing it out.

        • Johnhouboltsmyspiritanimal says:
          0
          0

          Your openness and engagement on here is refreshing thanks for participating in the conversation Mr loverro.

    • fcrary says:
      0
      0

      Thanks for the information. I have three comments, which I suspect you’ve already thought of. (1) The overlap in personnel between SLS and Starliner development is almost certainly greatest in management, and in the higher levels of management. I doubt there are any actual software engineers working on both, although I could be wrong. (2) In addition to overlap in process with the 737 MAX, it’s worth considering training as well. Real time software is a bit of a specialty, and new hires probably get some in-house training. (3) In terms of the development process, one step you didn’t explicitly mention is the interpretation of requirements. I once saw a great cartoon of the process, which started with a picture of a swing (horizontal tree branch, two ropes and a board) labeled “what the scientist thought he said the requirements were”. Then there were a series of increasingly bizarre pictures of things like “what the scientist actually wrote”, “what the engineer thought the requirement was”, etc. I’ve found that there can be a big gap between the intended and communicated requirements. (And finally, thanks, now and in advance, for all the work I expect you’ll be doing to straighten this out. That isn’t a job I envy.)

      • John Thomas says:
        0
        0

        As I understand it, the 737 MAX issue is a control loop problem made complex from the difficulty of testing it but exacerbated by management response, whereas the Starliner issue seems to be more a coding error that should have been more easily detected.

    • Homer Hickam says:
      0
      0

      The question was good, the answer was, too. This kind of openness is exactly what is needed at NASA right now. I’m encouraged by folks like Doug.

    • Bad Horse says:
      0
      0

      Simple static code analysis would have identified the errors prior to flight. The issue is not in the requirements, but in the code and testing. Someone did not execute simple system level tests of the code. Why? It’s evident little or no real IV&V was done by NASA and S&MA did not develop hazards analysis or controls for such problems. Why? The ground did not have procedures in place to check the clock before launch? All of this goes to culture and management by both NASA and Boeing. Bring in an independent organization to do IV&V against the code (in addition to the IRT). Real code analysis – find defects in the code. Get eyes from outside to provide an accurate assessment of the code as it exists NOW (flown), not the artifacts. This would take months and be months well spent.

      A member of Boeing (very) Sr. leadership should accompany the crew on the 1st flight.

      • fcrary says:
        0
        0

        The problem could easily have been in the requirements. Or _one_ of the problems; the fact that it slipped through multiple layers of verification and testing is clearly means there are a few more problems.

        But requirements documents tend to be very long, written by many people, and often distributed in a format which is not easily searched and which is not cross-references or cross-linked. It’s easy for inconsistent or even contradictory requirements to slip in. And it’s easy for someone to miss something that applies to his subsystem, say one buried in a section that he skipped because it didn’t seem relevant.

        Since timing and setting a clock was one of the known problems, I’ll toss out the use of mission elapsed time (MET) as an example. I really don’t like MET because it does not have a well-defined, universally accepted zero point.

        When someone gives me J2000 time, I know t=0 was January 1, 2000 at 12:00:00 UTC. And that the number is in real, SI metric seconds since then. For MET, is t=0 when they turned on the launch vehicle’s computer? Since most launch vehicles have several computers, which one? Or is it when the rocket takes off (t minus zero)? If it’s the former, how much time passed between turning on the computer and lift off? Is the time really in seconds, or clock ticks which are _approximately_ one second apart? If the answer to any of those questions is, “it depends,” you’re asking for trouble. And, getting back to requirements, what if one person wrote a requirement defining MET to meant one thing, and someone else wrote a different requirement assuming it meant something else?

        • TheRadicalModerate says:
          0
          0

          This is exactly why spec-based waterfall development methodology has reached the end of the road. Modern systems are simply too complex, with too many interactions, between components made by other groups at other companies, to be able to do the standard spec / review / design / review / implement / test / integrate / test cycle that NASA and DoD love so much.

          Iterative design sounds vaguely crazy to people trained in the waterfall paradigm, because it defers major requirements until later in the program in favor of integrating and testing minimum viable functionality earlier. But the evidence that it works–even in aerospace–is strong. Just ask SpaceX.

          The problem is going to be that incumbent organizations–including the procuring organizations like NASA and the Air Force–are built from the ground up to execute on waterfall. For them to accept the productivity and quality improvements that come from iterative design, they’ll have to tear their organizations down to their foundations and rebuild them. For obvious reasons, they don’t want to do that.

          But if they don’t, the new generation is going to eat their lunch. And that’s not just a corporate problem; it’s a national security problem as well. I doubt that China is building weapons systems using 1980’s vintage software methodology.

          • Bad Horse says:
            0
            0

            Its not an Agile Vs. Waterfall problem.
            The issue is no System Engineering and no (effective) IV&V. No one tested the spacecraft FSW like they should. It will not help to fix the requirements. Fix the code and go hire serious systems engineers. A deep static code analysis should be done asap from outside NASA and Boeing. Both blew it in terms of quality. The crew should demand this.

            The issue is quality and safety. Both missing in how this was executed.

          • TheRadicalModerate says:
            0
            0

            I wasn’t implying that Boeing should scrap the flight software and start over using an iterative methodology. At this point the Pottery Barn principle applies, and they’ll have to figure out where their validation and verification methods didn’t adequately cover what was in the functional spec and fix it.

            But this is an increasingly bad way of developing software. It’s bad for a variety of reasons:

            1) It doesn’t manage increasing complexity well.

            2) It defers integration testing until the end of the project, uncovering systemic architectural and design problems too late to do much about them other than build in kludgy work-arounds to the problems.

            3) It’s much less productive and much slower to develop.

            4) Increasingly, it’s not how your engineers were trained.

            To be sure, doing iterative design in tightly integrated hardware/software systems means developing a lot of prototype iron that you otherwise wouldn’t. But hardware is now cheap compared to software. And it’s very difficult to manage an iterative project that seeks to integrate out-of-house components into your system. But that strikes me as a “Doctor, Doctor, it hurts when I do this” kind of problem.

            We have way too many examples, from way too many companies, of aerospace software showing up years late with problems very similar to what we’re seeing with Starliner, SLS, EGS, and Orion. There has to be a better reason for that than “Company X didn’t hire very good systems engineers.”

  9. gearbox123 says:
    0
    0

    I work in software development, and it’s a field subject to technology “fads.” Waterfall was replaced by Agile, which would be improved by Devops, which will be fixed by CI/CD….

    People get so infatuated by the “technology of the month” that they forget to make things that actually work.

    I’m reminded of a line from the movie Robocop: “We had a contract! Replacements and supplies for twenty years! Who cares if it worked!?!”

  10. Tom Mazowiesky says:
    0
    0

    Part of the problem is the shift from hardware-centric design to software centric design. When the hardware was the key, people used to spend a lot of time in design reviews looking at the design, components, etc., to make sure that the hardware worked. Then in test, it was fairly straightforward to check it and if there was a problem identify it and fix it. With all the preparation work, people tried to do it right before construction.

    Software is like gold – completely malleable and easily updated. So if there is a problem, it’s supremely easy to fix – a few lines of code, upload it and check it out. So a lot of the pain of making a mistake in the old days – weeks for a board spin, checkout, etc. is no longer there. It’s similar with mechanical design. When a problem is found, it takes a while to make a new part to fix what you missed. It costs you time, money, etc.

    But software doesn’t impose that pain. Designs today use software because it is so easily generated and changed. Until the ship date you can do whatever you want. Unless you have discipline to monitor whats going on, you’ve got major problems.

    How do you ‘patch’ software a couple of hours before launch, and still have the confidence to say everything’s ok? At the very least the launch should have been delayed, since an error of that magnitude managed to be missed until very late in the game. Shouldn’t someone have said, hey, wait, we fixed this but what else is in there?

  11. Winner says:
    0
    0

    Thanks for keeping up the pressure Keith.
    I would rather fly on the next Dragon mission than the next Starliner mission.

  12. PsiSquared says:
    0
    0

    Within in the last year, I thought there was discussion here re: software for human space flight (SLS, I think), and I thought someone referenced NASA spec for software error rate or something similar. Are specs for software reliability and error rate specified differently for SLS than for Commercial Crew, or is it a common standard? Does anyone have a link to any such specification online?

    • fcrary says:
      0
      0

      I’ve never been involved with human spaceflight, but as far as robotic spacecraft go, I’m not even sure what a software error rate means. There are specs for error rates in data transmission (noise) or for radiation-induced glitches, but I’ve never heard “error rate” applied to the code itself. A whole lot of effort goes into making sure the software does the same thing every time. It would depend on the input, and sometimes unexpected events can cause a software error. But then I’d expect people to consider the rate of unexpected events.

      • PsiSquared says:
        0
        0

        I should have been more specific. Does NASA spec a frequency, rate, or probability for those “unexpected” software events?

        • fcrary says:
          0
          0

          Never? Seriously, it would be mission or spacecraft specific. All the effort to make the code deterministic makes it hard to talk about rates or probabilities. Under the same circumstances, you’d expect the software to work 100% of the time or 0% of the time. For example, given the botched flight software on the first flight of the Ariane 5, there is 100% certainty that another launch with the same software would have failed in exactly the same way.

          Are there specifications on the frequency of unexpected situations? Things which make the flight software do something pathological? Sure. But that covers a whole lot of things, you wouldn’t have just one number, and it would be different for different spacecraft. Those probabilities and a bunch of others get folded in to calculate an overall probability of failure (1/270, in the case of commercial crew), which is a high level requirement. But people doing the development work usually don’t want to get pinned down by numbers on particular failure modes. Some flexibility in what problems to fix and which ones to accept is really helpful.

      • George Baggs says:
        0
        0

        Could mean ‘bug rate’. typically, the rate of finding bugs (bugs per day or per week, etc.) should steadily drop as you progress through the software life-cycle. So bugs found during development (prior to verification) would be at a higher rate than bugs found during validation, and the bug rate would continue to fall as you progress through initial release and then into production. If the bug rate remains flat or increases or jumps around unpredictably as the weeks go on, it is a good indicator that the SW engineering process is immature or unstable/broken.

        • fcrary says:
          0
          0

          It’s a rule people rarely follow, but… Officially if anything unexpected happens in flight, someone’s supposed to file a report (an ISA, and I honestly don’t remember what that stands for), which stays open until someone figures out what went wrong, why and how to keep it from happening again. That includes trivial things like “the telescope’s exposure times always seem to be 2% longer than commanded.” Any subsystem generating one or two of those a month would cause a serious amount of concern.

          • George Baggs says:
            0
            0

            Those would be equivalent to bugs found in production. Typically, each bug should have a risk-assessment performed that would tell you if the bug should be addressed immediately (high risk), or if the bug can wait (low risk) until the next update (minor or major).

          • Dave Gingerich says:
            0
            0

            ISA = Incident, Surprise, or Anomaly

  13. Dewey Vanderhoff says:
    0
    0

    So after all is said and done, the first launch of Starliner was a Boeing beta test of the software ? That’s how Apple does it with major releases of new Mac OS operating systems… let the initial release and the first three incremental fixer updates be beta tests using actual customers for testers before the p[roduct is stable and worthy. Doesn’t seem too smart for manned spacecraft, though

  14. Michael Spencer says:
    0
    0

    Maddeningly off topic, wondering if others have the same issue:

    I am asked to login on every single page here; and each login requires a damn captcha! WTF! Even just reloading the page triggers a new login request!

    I use Disqus and have had the account for- I don’t know how long. 15-20 years seems right. And I’ve been commenting here about as long. The issue started in earnest a few years ago.

    Normally, I command click the posts I want to read, opening each in a tab, then reading each in turn.

    It’s making me crazy.

    • SouthwestExGOP says:
      0
      0

      Doesn’t happen on my Mac Mini, OS 10.13.6 (High Sierra). Chrome version 79.lots of numbers

      • CPWB says:
        0
        0

        Boeing used to develop their own software with Engineering being the main focus. Every other job was developed to support that main function.

    • fcrary says:
      0
      0

      You might check how your wireless connections work. Every time I go somewhere else or otherwise get assigned a new and different IP address assigned to my laptop, I have to log on again. It’s hard to imagine that happening as often as you describe, but I can’t think of anything else.

      • ThomasLMatula says:
        0
        0

        It has been happening to me too when I use a laptop. I suspect they changed something to better “protect” aganist robots posting.

  15. Tom Mazowiesky says:
    0
    0

    Keep in mind that software is made up of a collection of modules, each designed to perform a particular task. In a perfect world, the task is well defined, so when the module is tested it can be verified to do what the task was and not produce something odd.

    But now you take each one of these modules and combine them to do a more complicated task. With each interface, the possibility increases that a subtle error can manifest a major problem. You can’t test every path in a software this complex – it’s a physical impossibility, they’d never finish testing.

    And a lot of this is what is called ‘real time’ operation. The software is absorbing lots of sensor readings and figuring out what to do. Multiple tasks that interact with each other again in subtle ways. And it’s difficult to test for problems, even if you program a simulator for the sensing system, you’re reliant on how well the simulator describes reality.

    So you really need to stay disciplined in managing this, and that’s where Boeing seems to be short. As a former pilot, I wouldn’t be comfortable with a software patch made a couple hours before liftoff, I don’t care how minor it was. And while the bug may have been small, it’s impact was huge. How could the management team ignore this?

    • fcrary says:
      0
      0

      Your comment about modules made me wonder if these problems really are flight software issues. That’s what’s been reported, and that’s what we’ve been talking about. But there are some fine distinctions I would expect most press releases and media reports to be vague about.

      I’m not sure about Starliner, but the spacecraft I’ve been involved with (planetary and robotic) have about five things which could be called “software” by the media or in a press release. There is real, actual flight software (both as an operating system, a main loop and modules), data tables (numbers which may change from time to time, but which are values used but the software, not software itself), sequences (command scripts which are unique to a particular period, e.g. a few weeks for Cassini, perhaps a single mission for Starliner, if that’s how they do it), mini-sequences (short scripts which rarely change and may be called by the main sequence multiple times), etc. Unless you’re fairly deeply involved, those might all sound like different sorts of software. But the way they are developed and tested is extremely different. In the case of Starliner, I’m not sure which one we’re talking about. So some of my comments may have been off the point.

      By the way, they didn’t make a flight software change a couple hours before liftoff. It’s worse. They did it in flight, a couple hours before reentry. Apparently, the initial problem with the clock prompted someone to check and see if anything else was off, and they had to upload a patch in flight.

      • Tom Mazowiesky says:
        0
        0

        EEK! Though I guess if your tail is relying on a patch, better before than after.

        I’ve shipped a lot of code and everybody has issues with problems found after the product leaves the factory. But I’ve never been involved in what anyone would call safety related systems.

        I’ve been interested in space exploration since I watched John Glenn on his Friendship 7 flight and the perception of Boeing is that their management of the software needs a lot more work.

      • Tom Mazowiesky says:
        0
        0

        Also I think your comments are knowledgeable and spot on

  16. tutiger87 says:
    0
    0

    Everybody is trying to do more with less in today’s world. Eventually that will bite you in the rear end if done recklessly and without careful thought.

  17. TheRadicalModerate says:
    0
    0

    What was it that suddenly brought the SM separation bug to the flight team’s attention? It seems wildly unlikely that, while the vehicle was on-orbit, somebody came running into the room and said, “Hey I just happened to discover this problem in the control matrix, but I’m sure glad I got here now instead of 2 hours later!”

    My guess is that the attitude control problems that were attributed to the MET error were not quite what we’ve been led to believe, and a thorough review of attitude control in general surfaced the “oh crap!” moment. But that would surface yet another instance in which somebody has been less than forthcoming about the failure to circularize efficiently.

    • fcrary says:
      0
      0

      Someone else pointed out a potential reason for finding the other bug. When someone discovers he’s made a mistake, it’s common for him to ask himself, “have I made the same mistake somewhere else?” When a programmer discovers a bug, that often triggers a search for similar bugs elsewhere in the code.

      • TheRadicalModerate says:
        0
        0

        My guess is that the excessive prop use in maintaining attitude control and the SM separation problem are actually the same bug, and that, despite the earlier explanation that the attitude control problem was related to the bad MET, it was really that the combinations of thrusters needed to achieve a specific rotation or translation was wrong.

        Diagnosing the attitude control problem would have therefore naturally led to uncovering the problem of translating the SM away from the CM.

  18. Jackalope3000 says:
    0
    0

    I’ve worked for several companies when after crashing a UAV we would discover a software bug. And then we’d fix the bug and the problem was considered solved… until of course we found the next one the same way. Management would never agree that this represented a lack of development process, a lack of adversarial review and a lack of flight test process that includes vigilance against the aircraft entering off-nominal flight conditions.

    • fcrary says:
      0
      0

      Now that you mention it… There is one thing I’ve been uncomfortable about, in the spacecraft or instrument reviews I’ve been involved in. Adversarial. The people whose work is being reviewed went out of their way to avoid it being adversarial. They made a point in introductions and conclusions that the wanted to be totally open (a lie, based on how they prepared and instructed the people on their team) and how the reviewers were “really part of the team.” Sometimes holding parties for the reviewers, between the first and second days of the review, etc. I know you don’t want a review to be openly hostile, but these are supposed to be neutral, unbiased and _external_ reviews.

      • Skinny_Lu says:
        0
        0

        Excellent point. Design and Certification reviews are attended by lots of disciplines with very narrow focus. It is hard to get into the weeds of a system with people that understand how things work. Time constraints do not allow thorough reviews. Meeting distractions by going off topic are common and a week-long review can still fall short. Only a step by step, table top review of a system by knowledgeable people can discover such problems. When I worked with cryogenics, the procedures were reviewed, step by step by the whole team, with overhead projection or paper drawings changing the position of each valve and pressure regulator settings because these systems can easily over pressurize and damage hardware or hurt people. It is the only way to properly review & approve a critical procedure. Not anymore, sounds like. Someone else here commented that people are over tasked and doing more than they could possibly keep track of.

  19. Michael Spencer says:
    0
    0

    While it is too early for a victory lap, Mr. Loverro’s predilection for plain speaking is starting to soften my hardened-heart.

    It’s terrifying 🙂

  20. tutiger87 says:
    0
    0

    Keith…

    It’s not that NASA and its contractors forgot how to do such things. It’s that everybody is trying to do more with less.