Intro

Electromagnetic properties can be varied to read as either an 0 or 1. Such a reading is single bit. Eight bits make a byte. A single byte can be translated to a character. (Actually the dominant convention, ASCII, codes characters using 1 byte but the newer convention, Unicode, uses 2 bytes.)

Characters make the data for records (aka rows; tuples;) each of which has a number of fields (aka columns; attributes;). Records can be collected to form a particular subject, and arranged into tables. A table is typically shown like a spreadsheet and lists the records would along the left and lists the fields along the top. A collection of tables forms a database. EG:

I just got an offer to order 7 books for a penny. The postcard I need to send to the book club is a record. That record contains a number of fields, such as: my name, address, city, state, zip code, and the titles of the 7 books. When the book club gets all their postcards back, they can enter all the data to in one ore more tables in their database.

Tables

Data in a database is collected by subject (EG: Customer, Product, Sale, or Order.) into tables (aka files, relations, subjects, or entities).

diagram of sample table structure

Basic Table Types

Kernel (aka master or regular). Kernel tables exist on their own. It does not provide descriptive information to other tables. Kernel tables are the heart of the database. EG:

A "Product" table is a typical kernel table.

Bridge (aka connector, junction, middle, or associative). Bridge tables tie other tables together. This is the middle table in a many-to-many relationship. EG:

A "Product" kernel table is connected to a "Order" table via a "ProductOrder" table since a product may be in many orders and an order can have many products.

Extension. Extension tables have a one-to-one relationship with a kernel table and may be a sub-set of its related kernel table. EG:

A "Product" table may have a "ProductNew" table just for new products.

Look Up Tables

Look up (aka validation). Look up tables have a look up key (which is often the primary key of the look up table) that is used to match foreign keys in a foreign table. The foreign table matches its foreign key in the look up table's primary key where it gets more info. EG:

A "Product" (foreign) table may have a "Supplier"(look up) table that describes the supplier of each product.

However, a table may act as both a foreign and a look up table. EG:

"Product" is a foreign table to the "Supplier" table, but "Product" is also a lookup table to the "ProductOrder" table.

Reference tables are look up tables that change very infrequently, eg a table of US states.

Other Tables

Base. Base tables are the tables from which a view of multiple tables are based on. That is, when multiple tables are joined in SQL, those multiple tables are the base tables. EG:

SELECT Product.Name
FROM Product INNER JOIN ProductNew
    ON Product.ID = ProductNew.ProductID
--The base tables are 'Product' and 'ProductNew'.

Table Components

In a table, the horizontal rows (aka records, tuples, or instances) are records for each instance in the subject. Each record in the subject will have the same attributes but possibly different values for each attribute.

In a table, the vertical columns (aka fields or attributes) are the different attributes (fields) that all rows or records in the table possess. There are two kinds of fields:

  • Key. A key field is used to identify rows in tables. There are three kinds of key fields:
    • Primary Key. A primary key is a way to uniquely identify each record in a table. The primary key of a table is either a simple key (one field) or a composite key (multiple fields). A primary key cannot be blank, null, or have duplicates. Clearly a better primary key is composed of fewer fields, seldom changes, and is preferably natural for the table (EG: A book's ISBN).
    • Foreign Key. A foreign key is used to relate one table to another, i.e. a foreign key in a table (aka the foreign or characteristic table) is tied to a look up key (usually a primary key) in another table (aka the look up table). A foreign key value is not necessarily unique in its table. Foreign keys should have the same properties as its primary key.
    • Look up Key. (See previous definition).
  • Non-Key. A field that is neither a primary or foreign key. If a field other than the primary key is frequently used to look up data, then that field may be indexed.

In a table the array of cells (aka domain) formed by the rows and columns hold the data of that describe the value for each attribute of each record in that table. A cell is an intersection of each row/record/instance with a column/field/attribute. EG:

In a database for a bookstore, the different subjects are entitled books, customers, and invoices. For the table called customers, the fields are name, address, and phone number. For the customers table, there are records are for Sue, George, and Amy. For the customers table, the identifier field is called ID. The George's record has a cell value of 773-588-2300 for the phone field.

Bookstore database
Books table Customers table Authors table

Customers table
  ID field Name field Phone field
Sue record 1 Sue 773-123-4567
George record 2 George 773-588-2300
Amy record 3 Amy 773-987-6543

Historically database theories were tied to the physical methods used to store the data—hence the terms files, fields, and records. These physical terms were transformed into conceptual terms—hence the terms tables, columns, and rows respectively. In E.F. Codd's breakthrough works on databases he used terms that freed the data from the physical structure—hence the terms relations, attributes, and tuples respectively. Note that this is the origin of "relational" in relational databases, as opposed to the dependency relations discussed next!

Table Relationships

The way data in one table is related to data in another table is described as a relationship (aka dependency). There are three basic kinds of relationships:

  • One-to-one (1:1). This relates one record in a table to one record in another table. This is not frequently used since the two tables might as well be one table. EG:

Employees can have zero or one spouse, so an employee table can have a one-to-one relationship with a spouse table.

  • One-to-many (1:*). This relates one record in a table to many records in another table. This is the most common relationship. EG:

An author may have many books, so an author table may can have a one-to-many relationship with a books table.

  • Many-to-many (*:* but really 1:*:1). This relates many records in a table to many records in another table. This is usually implemented with three tables connected as two one-to-many relationships. The middle table is aka the translation table or the intermediate table. EG:

An author may have many books, but a book can also have many authors. So an authors table and a books table can each have a one-to-many relationship with a third table, the author_books table. The three tables together form a many-to-many relationship.

Many MDBSs have a relationships view, where relationships can be made by dragging-and-dropping fields between tables.

If relationships are in place, then most MDBSs will automatically implement the same relationships when you make new queries. Also some MDBSs can enforce referential integrity. EG:

If two tables are related, then a change in the value of a look up key will also be changed in the foreign key.

Database Normalization

These are rules for designing database table so they approach a relational nom (standard). These rules make the tables more efficient, less redundant, and less prone to error and confusion. Normalization is usually a process of breaking up large tables into smaller tables with clear relationships.

Over normalization can be counter productive, especially if too joins are required to get useful information. Denormalization, usually by making redundant data, may be required to improve system performance.

The five normal forms were formulated by C. J. Date in his 1974 book An Introduction to Database Systems, although E.F. Codd of IBM was the first to present a relational model in 1969 with his 12 rules.

First Normal Form (1NF)

"A relation R is in the first normal form (1NF) if and only if all underlying domains contain atomic values only."

In other words, each cell in a table can only contain one value. Or some say that a row cannot have duplicate columns and a column cannot have more than 1 value. EG:

If a table has a field "CustomerID", then a record with multiple customers in the field "CustomerID" would violate this rule. The usual solution would be to create another table with records for the different customers.

Second Normal Form (2NF)

"A relation R is in second normal from (2NF) if and only if it is in 1NF and every non-key attribute is fully dependent on the primary key."

In other words, every non-key field in a table is entirely dependent on the entire primary key. EG:

If a table has a composite primary key of "CustomerID"and "ProductID", then a field like "Product Description"violates this rule because it is dependent on only part of the primary key. The usual solution is to move the offending field to another table.

Third Normal Form (3NF)

"A relation R is in third normal form (3NF) if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key."

In other words, every field in a table depends only on the primary key and does not depend on another non-key field. EG:

If an "Orders" table has a primary key of "OrderID"and non-key fields of "CustomerID", "ProductID", and "Address", then "Address" violates this rule because it is dependent on the non-key field of "CustomerID". The usual solution is to move the offending field to another table.

Fourth Normal Form (4NF)

"A relation R is in fourth normal form (4NF) if and only if, whenever there exists an MVD [MultiValued Dependence] in R, say A-->B, then all attributes of R are also functionally dependent on A (i.e., A-->X for all attributes X of R)."

In other words, if a table is the "many" table of a one-to-many relationship, then all of its fields are tied to that one-to-many relationship and it cannot be the "many" table to other unrelated one-to-many relationships and have non-key fields connected to that independent one-to-many relationship. EG:

If a customer has many addresses and many orders, then to put the "Customers" table in a one-to-many relationship with an "AddressesAndOrderses"violates this rule. The usual solution would be to create separate tables for the independent one-to-many relationships, i.e. create an "Addresses" table and an "Orders" table.

Fifth Normal Form (5NF)

"A relation R is in fifth normal form (5NF) -- also called projection-join normal form (PJ/NF) -- if and only if every dependency in R is implied by the candidate keys of R."

In other words, if a table has a field with values that are often repeated, then the values should be replaced with keys that can be related to a look up table. EG:

If a "Popsicle" table has a field "Color", with many instances of "cherry" and "aqua", then the table violates this rule. The solution would be to make the "Color" field a look up key, and replace the values of "cherry"and "aqua" with the key values of "red" and "blue". Those keys could then be related to a look up table called "Seasonal", where the foreign keys of "red" and "blue" can be translated back to "cherry" and "aqua".

This would enable all values of "cherry" to be replaced with "strawberry" by making a single change in the "Seasonal" table, instead of changing every instance of "cherry" in the "Popsicle" table.

Normalizing Rules of Thumb

There are rules of thumb and variations on the rules of normalization that accomplish roughly the same goals:

  • Try to have a table for each major entity/subject.
  • Try to use natural keys for primary keys whenever possible. EG: For a large books table, the ISBN (International Standard Book Number) might be a good primary key. The natural keys should have the following characteristics:
    • Short.
    • Non-case sensitive.
    • No spaces or punctuation.
  • Avoid using natural keys when it exposes sensitive information. EG: Social Security Numbers or Tax ID would in theory make good primary keys but it is preferable to store them as regular fields since the primary key is exposed more often.
  • When combining mixed databases, it might be best to create surrogate keys instead of using the original primary keys since the primary keys might be duplicated in a composite table.
  • Try to limit indexes to a single column if possible.

Common Data Structures

In addition to the basic table structures (one-to-one, one-to-many, and many-to-many), there are other common data structures that are repeated in different databases.

Tree

Tree (aka Parent-Child). A brilliant and simple way to store a tree of data in a single table is to use an ID for each row and also to have each row identify its parent row. EG:

Family Table
FamilyID ParentID Name
1 1 Grandfather
2 1 Son
3 1 Daughter
4 2 Grandson via Son
5 2 Grandson via Son
6 3 Granddaughter va Daughter

Shared

These are variations where tables may have a column that may reference different tables.

Normally there are three kinds of relationships. Each can be rewritten to show that it is a t1 table, i.e. a table where each column references one other table.

  • 1:1
    1. 1(t1):1(t1)
  • 1:*
    1. 1(t1):*(t1)
  • 1:*:1
    1. 1(t1):*(t1,t1):1(t1)

A tn table is a table that has a column that may reference different tables. This is usually done by a pair of columns like TableName and TableID, or a single column like TableName. tn tables enable other kinds of relationships especially since some of the relationships are not commutative. I am calling structures with tn tables "shared" structures.

  • 1:1
    1. 1(t1):1(t1)
    2. 1(t1):1(tn)
    3. 1(tn):1(t1)
    4. 1(tn):1(tn)
  • 1:*
    1. 1(t1):*(t1)
    2. 1(t1):*(tn)
    3. 1(tn):*(t1)
    4. 1(tn):*(tn)
  • *:1
    1. *(t1):1(t1)
    2. *(t1):1(tn)
    3. *(tn):1(t1)
    4. *(tn):1(tn)
  • 1:*:1
    1. 1(t1):*(t1,t1):1(t1)
    2. 1(t1):*(t1,t1):1(tn)
    3. 1(t1):*(t1,tn):1(t1)
    4. 1(t1):*(t1,tn):1(tn)
    5. 1(t1):*(tn,t1):1(t1)
    6. 1(t1):*(tn,t1):1(tn)
    7. 1(t1):*(tn,tn):1(t1)
    8. 1(t1):*(tn,tn):1(tn)
    9. 1(tn):*(tn,tn):1(tn)

I will not do an example of each kind of shared structure, but I will do a few.

  • 1(tn):*(t1). The Category table is the 1(tn) table, it is like a lookup table shared by multiple foreign tables but where each foreign table gets its own set of categories. It's convenient to have a single lookup table but pulling from it with SQL can get onerous. (Bonus trick: The NonNameOrder columns can be used to sort a list of items in the Car table first by Category.NonNameOrder and then by Car.NonNameOrder within the list matching Category.CatValue).
    Category Table
    CategoryID TableName Name CatValue NonNameOrder
    1 Car Type SUV 3
    2 Car Type Minivan 2
    3 Car Type Passenger 1
    4 Car Color Red 2
    5 Car Color Blue 1
    6 Boat Power Sail 1
    7 Boat Color Green 2
    8 Car Type Convertible 4
    Car Table
    CarID TypeCatID ColorCatID Name NonNameOrder
    1 1 4 SUV in Red 1
    2 2 4 Minivan in Red  
    3 3 4 Passenger in Red  
    4 1 5 SUV in Blue 2
    5 2 5 Minivan in Blue  
    6 3 5 Passenger in Blue  
    7 8 4 Convertible in Red  
  • *(tn):1(t1). The Phone table is the *(tn) table. This method allows other tables (EG: Employee and Customer tables) to be have their phone numbers changed (added, deleted, modified) without touching the other tables.
    Phone Table
    Columns EGs
    PhoneID 1
    Table Employee
    TableID 3
    Phone 773-588-2300
    Description Fax
  • 1(t1):*(tn,t1):1(t1). The Addressed table is the *(tn,t1) table. Like the previous example but since it's many-to-many the same address can be used by different entities and it only has to be changed in one place.
    Employee Table   Customer Table   Address Table
    EmployeeID   CustomerID   AddressID
    Name   Name   Address
    Addressed Table
    AddressedID TableName TableID AddressID
    1 Employee 1 1
    2 Employee 2 1
    3 Employee 3 3
    4 Customer 1 1
    5 Customer 2 1
    6 Customer 3 3
    7 Customer 4 7

Date Ranges

If you are given a date, you should be able to pull all rows that cover that date. Simply provide columns for DateEffective and DateRetired. If the DateEffective is NULL, then the row item has been effective since the beginning of time. If the DateRetired is NULL, then the row item is current. An item that has never been active should not be in the table.

The statement below gets all fee schedules that are current for a given date. Usually a variable would be used instead of a literal date value.

SELECT FS.Name
FROM FeeSchedule AS FS
WHERE (
    (FS.DateEffective IS NULL AND FS.DateRetired IS NULL) OR
    (FS.DateEffective IS NULL AND FS.DateRetired >= '10/1/01') OR
    (FS.DateEffective <= '10/1/01' AND FS.DateRetired IS NULL) OR
    (FS.DateEffective <= '10/1/01' AND FS.DateRetired >= '10/1/01')
)

Flexible Columns

This is a black box table for storing nearly any kind of simple information. However it might be simpler to store variable datatypes simply as a type of string since a string can store dates, integers, floats, etc. (I do this in Survey below).

Event Table   EventType Table
EventID   EventTypeID
EventTypeID   EventCatID
PertinentStaffID   EventName
EntryStaffID   NonNameOrder
EntryStamp   IsMenu
StartStamp   RelatedFile
EndStamp    
EventString   EventStringNote
EventFloat   EventFloatNote
EventStamp   EventStampNote
Note   Note

Survey

Respondents take surveys. A survey has sections with questions. A question has one or more choices. Some questions allow 0, 1, n, or * answers. A respondent can take a survey 0, 1, n, or * times but it would be better to just give them a new survey, or if they re-take it, then just take the latest answer.

Respondent  1:*  Take     *:1   Survey   1:*   Section
                  *:1                            1:*
                 Answer   *:1   Choice   *:1   Question
create table Answer(
  AnswerID int identity primary key,
  ChoiceID int,
  TakeID int,
  AnsDatatype varchar(3),
  AnsActualValue Txt varchar(512),
  AnsDtm datetime)
create table Choice(
  ChoiceID int primary key,
  QuestionID int,
  DescriptionHTML varchar(256),
  SortKey varchar(4),
  ChoiceDatatype varchar(3),
  ChoiceFixedValue varchar(512))
create table Question(
  QuestionID int primary key,
  SurveyID int,
  NameHTML varchar(256),
  DescriptionHTML varchar(1024),
  SortKey varchar(4),
  AnswersAllowed varchar(2)
  --0, 1, n, or * answers)
create table Respondent(
  RespondentID varchar(64) primary key,
  Name varchar(256),
  NameFirst varchar(128),
  NameLast varchar(128),
  Note varchar(512))
create table Section(
  SectionID int identity primary key,
  SeurveyID,
  NameHTML varchar(256),
  DescriptionHTML varchar(1024),
  SortKey varchar(4))
create table Survey(
  SurveyID int identity primary key,
  NameHTML varchar(256),
  DescriptionHTML varchar(1024)
  DateActivated datetime,
  TakesAllowed varchar(2),
  Note varchar(512))
create table Take(
  TakeID int identity primary key,
  RespondentID varchar(64),
  SurveyID int,
  DateTaken datetime,
  TakenByMeans varchar(64), --web, postal, etc
  EnteredByWhom(256), --usually the Respondent
  RespondentComment varchar(1024),
  EnteredByComment varchar(1024))

Miscellany

  • A database diagramis a graphical representation of the schema (or part of the schema) and relationships between tables.
  • When setting the fields in a table:
    • Enter key fields first.
    • Keep most used fields closer to the top.
    • Keep related fields together.
    • Order the fields in a consistent way throughout the database.
    • Name the foreign key the same as its look up key.
  • It is vital that a database system empower an application without hindering it.
  • A DBMS is often responsible for handling transactions that must pass the "ACID" test.
    • Atomic It must be a complete entity, all-or-nothing, leaving no partial results.
    • Consistent It leaves the DB in a consistent state, i.e. no business rules are forgotten.
    • Isolated It must run regardless of what else is running. Often this means locking certain resources so there is no outside interference.
    • Durable Once complete, its affect remains in storage memory so it would endure even if something like a system crash occurs.
  • Data may be grouped depending on how quickly and how the data changes. Note that this section assumes that the structure of the data does not change. If it does then it's a whole different ball game and is particularly tricky for multidimensional data.
    • SCD (Slowly Changing Data). SCD has intermittent parts of a data set changing intermittently. This is like a contact list. People change addresses or phone numbers occasionally.
      • Data retention. There are three basic ways to deal with data retention:
        • Add a new row and mark both the new and old rows. EG: A new row might be marked as "active" or "business",  while the old row is marked "inactive" or "personal".
        • If you won't need the old information, then simply write over the old info.
        • Make separate columns for data of different ages. EG: Keep the last four numbers by using four columns like "phone1", "phone2", "phone3", and "phone4". This method is not recommended.
      • Data replication. There are three basic ways to keep the SCD correctly replicated:
        • UNION the source data with the target store and save the resulting set into the target store, since this will eliminate the duplicates.
        • Make sure each row in the source data has a date-time stamp field, so the target store can check for rows that have changed since the last replication (in addition to new and deleted rows).
        • Update the target data store as each change occurs in the SCD.
        • Copy over the whole data set periodically.
    • LCD (Log Changing Data). LCD just adds rows to a data set. This is like a log of events. LCD usually have no problems with data retention or data replication.
  • Storing names. If you have a table for contacts, people, etc., then you must store the person's name. Usually FirstName and LastName suffice. However, it may be helpful to have several fields for storing the name: NameFull (EG: John M. Smith, Jr.), NameFirst (EG: John), NameMiddle (EG: M.), NameLast (EG: Smith), NamePrefix (EG: Mr.), NameSuffix (EG: Jr.). There would of course be other fields like Login, Password, JobTitle, Gender, etc. This way you can programmatically produce things like:
    • John
    • Smith
    • John Smith
    • John M. Smith
    • John M. Smith, Jr.
    • Mr. John Smith, Jr.
    • Smith, John
    • Smith, John M.
    • Smith, John M., Jr.
    • Acme. Sometimes an organization or weird entity is entered instead of a person, so "Acme" could be entered in NameFull or LastName.


GeorgeHernandez.comSome rights reserved