We (literally) wrote the book on distributed SQL at scale!
Free O'Reilly BookToday marks the 99th anniversary of the birth of Edgar F. Codd, the author of “A Relational Model of Data for Large Shared Data Banks” and godfather of the relational database. Ted Codd did for the database what Xerox PARC did for personal computers: made them accessible to everyday humans.
Long before the invention of computers, there were databases. As early as 2400 BC the ancient Sumerians were carving tablets recording medical prescriptions for different ailments. Lists of Roman citizens on parchment scrolls. Card catalogs. Rolodexes. Even after computers were invented, data was far from automated. Early database models used a “flat file” system – a simple consecutive list of records that required the computer to begin at the start of the list and search sequentially. A very slow way to search, add to, and maintain large volumes of records. Meanwhile, we had a moon to get to! Humanity needed a way to access and interact with data in a fast, efficient, and accurate way.
The first significant database breakthrough happened in the 1960s when IBM came up with their hierarchical Information Management System (IMS). This was an inverted tree structure of parent nodes pointing to child nodes, and NASA used IMS to manage the enormous inventory of materials and drawings required for building the Saturn V moon rocket and Apollo lunar lander. GE quickly one-upped IBM with a more flexible network model for databases, where child nodes could have multiple parents. These were a definite improvement over the clay tablet and card catalog, but still cumbersome and difficult to use — these early commercial database products required highly specialized technical skills and a lot of training.
Then along came Ted.
Edgar F. (“Ted”) Codd was born in Britain on August 19, 1923. After serving as an aviator in the RAF during WWII, he completed a degree in mathematics at Oxford and emigrated to the US. After a brief stint in Men’s Sportswear at Macy’s Fifth Avenue flagship store, he joined IBM as a programming mathematician. He worked to develop programs for the Selective Sequence Electronic Calculator, IBM’s first electromechanical computer — a vast and noisy pile of vacuum tubes and mechanical circuits. Codd also briefly worked on IBM’s Card Programmed Electronic Calculator (remember punch cards?) and many other IBM initiatives. Most notably, he led the team that developed the world’s first multiprogramming system, a nifty little notion about making it so a computer’s CPU could run more than one program at a time. (Pretty great, huh? Imagine needing to shut down Chrome so you can check Jira! Oh, wait….).
By 1968 Codd had turned his attention to databases. (According to Codd’s Turing Award biography, he claimed that what initially motivated him in this research was a presentation by a representative from a database company who seemed—incredibly, so far as Codd was concerned—to have no knowledge or understanding of predicate logic.) Codd’s work at the IBM Research Laboratory culminated in his groundbreaking 1970 paper A Relational Model of Data for Large Shared Data Banks.
Codd’s relational data model is widely recognized as one of the great technical achievements of the 20th century. Codd’s revolutionary breakthrough was recognizing it is possible to disconnect the method of access of data from the question that is being asked about the data. Prior to the relational model, analysts had to actually understand how data is stored — where it physically resided on the disk — to be able to query and interpret it. In fact, Codd’s paper opened with the words, Future users of large data banks must be protected from having to know how the data is organized in the machine. He then laid out a relational model for divorcing the logical from the physical — separating data from compute and from the application itself — and described a framework for storing and retrieving data using simple rows and tables.
Perhaps this doesn’t sound all that earth-shattering, but indeed it was. We take Codd’s relational model for granted now because it makes such complete intuitive sense, using it for everything from making spreadsheets to online transaction processing (OLTP). Remember that the best thing going in 1970 was IBM’s IMS hierarchical approach associating related types under a top-level identifier. If you were a bank, for example, this identifier might be the customer’s name with all related data — phone, address, account number, other accounts — appended underneath. Data was hardwired into a top-down schema chosen by the database builder. Querying data required humans to write extremely explicit routines for very specific functions.
With a single paper, Ted Codd transformed the entire concept of databases and, over time, our everyday lives. Whenever we pay for lunch using a credit card, check a travel app for flight information, or scroll through a streaming service looking for something to watch, we effectively have Codd to thank. It is no exaggeration to say that all databases in use or under development today are based on his 1970 breakthrough.
(Incidentally, the relational model was in fact the very first abstract database model to be defined. This means, Codd not only invented the relational model — he actually invented the entire concept of data modeling. Codd once quipped, “At the time, Nixon was normalizing relations with China. I figured that if he could normalize relations, then so could I.”).
In his original paper, Codd proposed a set of operations that could be used to extract data from relations, essentially an initial query language for relational databases. Codd used mathematical notation for this language, which he called relational algebra. In 1971 he published another paper, “A Database Sublanguage Founded on the Relational Calculus,” introducing the language Alpha. Both used mathematical notation with quantifiers and various mathematical operators. (Fun fact: most of the operations Codd proposed can be done in today’s SQL, just with different notation!).
IBM was making good money from selling IMS, its hierarchical database, however, so at first the company was reluctant to divert resources to support Codd’s ideas. It took until 1973 for IBM to launch System R, an experimental relational database based on Codd’s ideas. System R was a seminal project: it used a structured query language to search, retrieve and modify data. It was also the first system to demonstrate that a relational database management system could provide good transaction processing performance. Design decisions in System R, as well as some fundamental algorithm choices such as the dynamic programming algorithm used in query optimization, are the basis for many later relational database management systems (cough Oracle cough). Two IBM researchers, Don Chamberlin and Ray Boyce, were tasked with creating System R’s query language.
Chamberlin and Boyce admired how Codd’s relational algebra and tuple relational calculus allowed the expression of complex queries in just a few lines – queries that in a hierarchical database would require pages. However, they wanted to make relational databases accessible to regular users, people without training in computer programming or mathematics. Their first attempt was a language they called SQUARE (Specifying Queries as Relational Expressions) as the foundation for System R. SQUARE, however, still used some mathematical notation and many superscripts and subscripts, making it difficult to type on a keyboard. They adapted SQUARE to resemble the structure of an English sentence and named the new language SEQUEL, for Structured English Query Language. Unfortunately, SEQUEL was already trademarked by a British aeronautics company and so the vowels were dropped to make it SQL, acronym (or is it backronym?) for Structured Query Language.
Throughout this period Codd continued to be employed by IBM. Codd did not work closely with the System R team, and the reason why he was pulled out of a project based on his own work has never been clear. It may have been related to the fact that Codd was not shy about voicing his opinion when, in his opinion, attempts to apply his relational model were incomplete or even incorrect)
Instead, throughout the 1970s he worked on a prototype of a natural language question and answer application that would sit on top of a relational database system. It was called Rendezvous and allowed a user with no knowledge of database systems — and even limited knowledge of a given database’s content — to engage in a dialog with the system. Users could make a natural language query along the lines of “Give me the quantity of Ticonderoga pencils we have in on-hand inventory” to extract the desired information. (You didn’t even have to say “Hey Siri” or “ Hi Alexa” first).
In 1984 he left IBM to start his own consulting business concentrating on relational database design and management according to his 13 commandments — the rules that he believed bona fide relational databases should embody. Codd’s rules define the qualities a DBMS must possess in order to become a Relational Database Management System (RDBMS). His aim was to prevent the vision of the original relational database from being diluted, as database vendors scrambled in the early 1980s to repackage existing products with a relational veneer. It’s an ambitious list (worthy of a post all its own!) and few if any RDBMSs satisfy all 13 of Codd’s rules.
Starting with his seminal 1970 paper, Codd saw the relational database industry grow to being worth many tens of billions of dollars a year — though Codd himself never benefited directly. Throughout his professional life Codd extended his intellectual curiosity and research to complex data analysis. It was Codd who coined the term OLAP (On-Line Analytical Processing) to describe the multidimensional data model for analyzing information from multiple database systems at the same time. He was also keen to help people use databases easily and effectively; up until the time of his death in 2003, Codd was investigating ways to apply his relational ideas to the problem of business intelligence and automation.
Ted Codd spent his life applying the beauty of math and predicate logic to the problem of managing, finding and sorting data. His indelible impact on everyday human existence is recognized with a 1981 A. M. Turing Award, the highest honor in the computer science field…And with every swipe of a credit card.
Feature photo illustration by Michelle Gienow
(NB: Xerox’s legendary Palo Alto Research Center, or PARC, pioneered the GUI. Legend has it Steve Jobs wrangled an inside tour of PARC for himself and a handful of Apple engineers, resulting in the revolutionary Macintosh System 1’s GUI-based OS. So Ted Codd did for databases what PARC did for PC’s: made them accessible to the masses).
The concept of a database existed before there were computers. Some of you are even old enough to remember the filing …
Read moreObservability is how you understand the current state of your database: how it is behaving, plus any potentially …
Read moreWhen building a table in a SQL database, one of the most important decisions is what to use for a primary key. This can …
Read more