Social Institutions and Software

A commenter pointed me to an article by Joel Spolsky from 15 years ago.

The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive.

This might be thought of as a Hayekian view of software. Accept what has evolved, even if you do not understand it. In contrast, I think that if you are maintaining software, and there are parts of it that you do not understand, you are in trouble.

Perhaps I am wrong, and Hayek/Spolsky are right. You should not try to rewrite software that works.

However, I think that it might be possible to distinguish software modules that perform generic functions in a reliable way (do not re-invent these) from application-specific software that lives in a world of changing business rules. My hypothesis is that in the latter case, frequent rewrites are more cost-effective than a process of continual patching.

10 thoughts on “Social Institutions and Software

  1. Synthesis view: If you are trying to get previously robust and ‘acceptably reliable’ software to do something new or different, and you don’t understand how all the seemingly nonsensical kludges – which have built up over time through a trial-and-error process of evolution – fit together to make a working whole, you are inevitably going to experience lots of unintended consequences, some quite severe.

    You can either accept the ‘muddle through’ gradual introduction of new kludges approach, or you have to go full rewrite. But if you are going to go full rewrite, make sure you understand the real importance of all those old kludges, otherwise you’re going to introduce a fragility that those kludges were meant to remedy.

  2. Let’s say you have some software that has been used for twenty years. However, it has no unit tests. It is a mess and you have no idea if you change something does it break something. In this case it could be a bit risky to re-factor or re-write.

    On the other hand, imagine you had set up some thorough unit tests initially. Every time there is a bug fix you had some unit tests that cover the bug. In this case every time you re-factor or re-write you have a suite to benchmark the new code against. If you don’t pass all the unit tests, you need to figure out what you’ve done wrong.

    • Correct. Testing really is key. Here’s my current strategy:

      (a) Don’t fix or replace what works. But do develop tests for existing behaviour. Unless inherited, these will be black-box tests.

      (b) When changing things, find the smallest boundary in which to contain your changes. Write unit tests within that boundary and while implementing your new feature, clean up everything within that boundary, and leave nothing non-understood.

      That’s my way around a dilema: Spolsky is right that an awful lot of Hayekean knowledge is present, but not always articulated, in old code. Yet cargo culting around non-understood things is just prouduces more mysteries things to cargo cult.

      They key difference between code and society is that it is immoral and infeasible to bash people around. The first hurdle where this cripples social engineering falls is tesing (social experiments). Hairballs of old, wise, code without proper tests also fall at that first hurdle, and thus become rather Hayekian.

  3. I’m sorry but you guys are talking past each other. In your minds you thinking about very different projects. Arnold, when you started your site, it was the early days, and the tools were primitive and constantly getting better.

    But things aren’t so simplistic. At my company, our software, which you’ve probably used, is constantly refactored to make it more modular. This allows us to refactor pieces of it all the time. Discrete modules can be tested, apart from the whole. Modules have contracts with other parts of the system, and a refactor can usually use the same data contract. At some point our specific system of modules maybe obsolete, and hold us back, but that might not be for a long time.

    In some instances we are doing full rewrites of very big parts of our eco-system. but they are massive projects that cost 100’s of millions of dollars, these are strategic investments for our company, and are not taken lightly.

    • I had the same reaction. Some commenters are talking about rewriting an “application” that is 10k-20k locs total. At a megacorp like an airliner or a major bank, such an “application” would be called a “component”, and the entire in-production code base would be more like 10-20M locs, with an M, and would typically involve 500-10,000 software developers.

      Rewrites from scratch are certainly sometimes merited. In particular, if the fundamental technology has changed, like is happening a lot with web front ends, it can be impossible to migrate the old version forward gradually. Even then, you want to do it one component at a time; starting a whole-stack rewrite has all the problems Joel Spolsky describes.

      If you are working on a huge code base, you need to think of it more like an aircraft carrier than a dinghy. You have to be pretty conservative about things that affect the whole boat, because so much is at stake. You start doing everything on a trial basis, and you pull the plug aggressively if a piloted change is not looking good.

      Regarding untested code, that’s pretty common. In the steady state, you often have to write tests for whatever you are about to change, before you change it. A big reason this happens is that testing technology lags behind implementation technology. Nobody abandons a new features just because the testing infrastructure isn’t ready for it, so you end up with lots of code that doesn’t have good tests.

      The analogies to Hayek, and to economics in general, seem relevant to me. Lots of software development is very exploratory, and new ideas have to propagate through the development staff in one way or another. There are miniature breakthroughs all the time that cause the rest of the software team to have to restructure how it does things. For mature well-understood parts of the software, you’ll see codified practices similar to building codes that emerge. You can’t build everything like that, though, without shackling yourself into mediocrity. And so on.

      • At a megacorp like an airliner or a major bank, such an “application” would be called a “component”, and the entire in-production code base would be more like 10-20M locs, with an M, and would typically involve 500-10,000 software developers.

        I do not think that the best approach for a bank or airliner is to maintain a code base of that size. If they are, then it probably means they are maintaining their own database code, rather than using generic database software. It probably means they are maintaining their own networking software, it could mean they are maintaining their own graphics programs, who knows?

        One of the things that changes over time is that you get an increase in the proportion of a system that can be handled using generic software relative to that which requires proprietary code. One of the problems with not rewriting is that you keep wallowing in unnecessary proprietary code.

        • I disagree. Based on my own experience at a major airline, hitting 10-20k locs per component was trivial. 20M locs is probably high for the whole enterprise but it was almost certainly in the millions. We used commercial off-the-shelf code whenever possible, but even basic customization, scripts, etc., for such products can run many klocs. (IE: SQL Server does your backups, but you have to write the scripts to tell it where the backup files go, time-of-day, notifications when it fails, etc. You have to do this for each database too.)

          To give a very basic example, my code sent messages to customers by e-mail, phone, text, etc., for flight notifications (on time, late, cancel, gate change, etc.). Just for frequent flyers, the decision making on whether to send a message that a flight was delayed ran past 1000 lines of code. It depended on whether messages had been sent before already, plus results from multiple databases with customer contact details, timezones, whether the flight was a connection or originating, time-of-day, saved preferences, plus legal issues such as whether the customer had triggered telco opt-out for text messages. That was for ONE type of message, to frequent flyers already in another database. Non-FF, plus 100’s of other message types, plus tracking every flight, gate, airplane, reservation in the world, in real time, and that doesn’t count the muiltithreading, failover, redundancy, logging, etc. The system I invented and managed used off-the-shelf code, internal and external service providers whenever possible, and dealt only with automated communications with customers, and was easily past the 250klocs mark.

          We rewrote subcomponents of this whenever they got fragile, but typically I had to lie to management and do it without telling them because they’d never approve the time, no matter how badly the component needed it. The organization is what usually needs the rewrite first.

  4. First, unless your code is running in an environment that never changes (inside a machine tool, say) it will in effect “rot” by failing to change in the required ways to keep pace with the environment around it. Saying it’s old and therefore the bugs have been fixed is utter nonsense – rather, it’s old, and so most of the bugs that matter in the environment it runs in have been found and dealt with. Changing the environment in most ANY way will cause old bugs to become manifest.
    (I worked two decades on operating systems, and have no doubt, there was lots and lots of code in the world that broke when some wildly unrelated thing changed.)

    Second, because of the long lists of specific cases to deal with, code for airlines, and one presumes banks, will be very large. It will be very large while using commercially available operating systems, databases, network systems, and so on. There is no “generic” or “old and plenty good” body of code to express any airline’s ongoing schedule and pricing changes and reservation rules. There never will be so long as airlines compete with one another. (In other words, what Corey describes is the norm in the real world – where selection functions and evolution in response – or markets and competitive firms – prevail.)

    It is true that writing code from scratch does not magically make it better. It also does not magically assure it will be worse.

  5. It’s not primarily about the code.

    Peter Naur, arguing the Theory Building View of programming: “The death of a program happens when the programmer team possessing its theory is dissolved… The actual state of death becomes visible when demands for modifications of the program cannot be intelligently answered. Revival of a program is the rebuilding of its theory by a new programmer team.”
    http://pages.cs.wisc.edu/~remzi/Naur.pdf

    Explicit code instantiates a programmer’s tacit theory about the world in which the code needs to operate, but doesn’t contain the theory itself. If that theory can be shared with or reconstructed by other programmers, maintenance is an option. Otherwise, rewriting is the only option.

Comments are closed.