Get this free humorous but helpful handbook of survival tips for your workloads and work-life -- from experts who have been there, done that, and (more or less) lived to attend the event postmortem.
I'd like to laugh ->This happened 20 years ago, but it is an amazing tale of how a series of small but unfortunate events can very quickly add up to a very expensive problem. Like the time one person fat-fingered the wrong year on a JCL job, Ronald Reagan died, and the US postal service lost $500 million. As told by Jeff White, one-time senior manager of database operations and now Enterprise Architecture Manager for Cockroach Labs.
The United States Postal Service runs one of the largest and most complex payroll systems in the world. It is so complicated that no outside company will take it on, not for any amount of money. They’ve tried to outsource it many times and everyone takes one look and says hell no, no way we are touching that. At the time, USPS payroll was over $1 billion every two weeks — and half of the payroll was live checks. So the data center actually printed checks and they would go out by the semi-truck load.
So a new union contract came in and the operators needed to adjust payroll. This should be a simple JCL job — Job Control Language (JCL) the scripting language used on IBM mainframes to instruct the system on how to run a batch job. However, while making the update, the JCL operator accidentally keyed in the wrong year.
Now, normally that would not be a big deal. Each new pay period started Friday at midnight. The day before a pay period would run, the mainframe would do a pre-flight calculation on any issues it saw and it would generate a report that would go to finance. Every Friday morning finance would go over things and approve the batch to run. But then Ronald Reagan died.
Now, when a former president dies, federal employees get a holiday. According to the OPM, Beginning with the death of President Kennedy in 1963, the incumbent President has issued an Executive order closing Government offices throughout the world as a “mark of respect” upon the death of each President or former President. Ronald Reagan died on a Tuesday. There was a lot of indecision inside the federal government as to when they would set the federal holiday for his death. And they didn’t decide until Thursday, late in the day, that on Friday, they were gonna give all federal employees the day off. As a result, no one was scheduled to go in the finance department to go look at these reports that were kicked out every Friday morning. So the mainframe runs the report.
Normally the report printout was around 30 to 40 pages of issues. The pre-flight report with these new payroll changes, though, was about 5,000 pages. It printed out on Thursday night, as usual, and the print operators dropped it off, as usual, in the accounting department for someone to review it on Friday morning. ‘Cuz, you know, it’s just a print job, right? But government offices are closed. No one looks at it. Eventually midnight rolls around and the mainframe kicks off and starts executing EFT deposits and printing paychecks. Once the job wraps up,150,000 employees have been overpaid to the tune of $500 million.
First problem: the USPS didn’t have an extra 500 million dollars hanging out in their bank account. So they have just made the mother of all overdrafts.
The executive in charge of accounting, though, comes into the office on Saturday morning for some reason. He sees this printout — he knows what this report is — starts flipping through it, and about pees his pants. He calls my boss screaming that we need to stop payroll NOW.
The EFTs are long out the door, but it takes a day and a half to print the full run of live paychecks. So when they stop the payroll, only about two thirds of the payroll is gone.
The first problem is to figure out who did get overpaid vs who didn’t get overpaid, and then get everyone who hasn’t been paid yet, paid. They had to bring in all of the programmers (who are unionized) to work on a Saturday and Sunday to write a program to finish paying all of the people who haven’t been paid in the run. Then they have to call the Secretary of the Treasury, at home, on the weekend, to ask if the US Treasury can spot the USPS $500 million.
The second problem is, they have to get all that money back. If a federal employee is accidentally overpaid, the government can only recoup 16.67% of the amount of the overpayment per paycheck pay period. You can’t take it all back at once. So then they had to write a special payroll job to pull all this money back. It took ‘em like four or five months to get it all, but finally it was all accounted for.
There was a congressional investigation as to how this could happen. Which established the cause as, simply, a JCL operator keyed in the year 2003 instead of 2004. Remarkably, no one was fired for this. These things happen. Though ultimately, if Ronald Reagan had picked a different day to die, this would have been caught before a single typo cost the USPS 500 million dollars.
The job of a software engineer/platform architect/DevOps diehard is not simply “all tech, all the time.” We survive a …
Read moreEach year, the Cockroach Labs blog offers a smorgasbord of posts on technical, educational, informational, and cultural …
Read moreThe job of a software engineer/platform architect/DevOps diehard is not simply “all tech, all the time.” We survive a …
Read more