Some observations on databases in general, implications when using COBOL, and a portal that provides data on SQL and Normalization

(c) Peter E. C. Dashwood - 2017

Relational Databases (RDB) and the systems that support them (RDBMS) have become the de facto data repositories for most of the World's data. But it wasn't always like that. Prior to 1980, there were really two choices of Database: Hierarchic and Networked.

Hierarchic DBs (like IBM's mainframe implementation, IMS, which is still used today in some quarters) used a hierarchical series of data "segments" to represent a data structure. This was intuitive and simple and it worked very well for most sites. But it "suffered" because excessive navigation was required to traverse the hierarchy in order to get to the data you wanted. (Experienced IMS programmers found ways around this using special codes and concatenated segment keys to maintain position in a hierarchy, but it was still less efficient than being able to access keyed data directly.)

Networked databases didn't necessarily run across networks, rather they were called "networked" because they represented networks of data structure. Data "records" could have multiple "owners" and "members", providing a very flexible network of data. Networked databases typically used links to implement one-to-many and many-to-one relationships. It could get very tricky following the chains, but it was possible to represent and store very complex data structures.

In the late 1970s Codd and Date produced a Relational theory of data, based on keyed access and an iterative process of refinement, called "Normalization", which sought to remove duplicated data (both Networked and Hierarchic DBs often duplicated data for performance reasons). The Relational model was based on set theory and recognised "repeating groups" as being subsets of a master set. The physical implementation of the model meant that there were no longer constraints on repeating data, and you could keep adding as long as you had storage capacity. (This was in marked contrast to the COBOL idea of a finitely and pre-defined OCCURring sequence.) Data was defined by its "relation" to other data, based on key dependency, and these "relations" were represented as rows in a table.


The following relations are true for all 3 models shown:

1. CUSTOMERS and PRODUCTS can exist on their own.

2. INVOICES must have ONE (and only ONE...) CUSTOMER attached to them.

3. A CUSTOMER can have multiple INVOICES.

4. An INVOICE must contain at least one PRODUCT.



Before there were DBMS available, COBOL programmers would implement this with a series of INDEXED FILES.

The record layouts for these files were often subject to serious design flaws based on "bad ideas" like the following:

"We archive sales data at end of year, and the average sale per year per customer is 5. We can put an OCCURS 15 for the invoice numbers into the CUSTOMER record and we should be good..."

"The average number of lines per invoice is 7. We can put the Product IDs into the INVOICE record with an OCCURS 15 and it should cover it."

Exceptions had to be detected and "overflow" CUSTOMER and INVOICE records, linked back to the original, would be created.

By "INDEXED FILES" in this document, you can consider mainframe VSAM/KSDS and mainframe ISAM, or, for PCs, the ISAM system for the platform, usually integrated with the COBOL compiler. These are file structures that require an INDEX part and a DATA part, but these 2 entities are generally considered as 1 by the programmer. For a couple of decades in the middle of the last century (and, in some places, right up to the present (2017)), this file organization reigned supreme, because it provided both RANDOM and SEQUENTIAL access to data. (Earlier tape-based systems were primarily serial in access and had to be sorted to become sequential. For a humourous, but absolutely true, glimpse of these times, click here.)

(For information about the "nitty gritty" of SQL, DDL, Normalization and other DB related matters, visit the Portal)