Story -- There is nothing so practical as a good theory
Back in the 1970's I taught at a university which sent all it's students out, every summer,
to do real work. Faculty would visit the students and their enterprises to see if the student
was not to unhappy and the enterprise liked the work the student was doing. One effect
was that the faculty had regular updates on the technology used in industry. I had several
students working for a well known international oil company. I had several surprises
when talking with there bosses. In one case they shredded our degree for putting in too
much theory too early. In consequence we redesign the university's curriculum to
do the theory late in the degree. But the second surprise was about
relational data bases which I had written off as an academic theory of no practical
worth. The company was replacing all its files, "CODASYL", and proprietary data base
management systems by relational ones.
Data Models in Projects
In any real project you should make both Conceptual Models
[ d3.html ]
and
Normalized Models. Both should be expressed using data dictionaries and ERDs.
As a general rule,
the conceptual model gives you a simpler picture of the data needed in a
project. Normalization provides a complete and detailed design for a
data base. The two techniques are complementary.
You can do informal and mental normalization as you do
conceptual modeling. Your conceptual models will provide names
for tables that you later derive during normalization. Occasionally you
will uncover a gigantic mismatch between what people think
(their conceptual model) and what the data actually is. In this case
you have an interesting political problem to solve.
Programming such a mismatch will probably make the situation worse.
In this page I will review the words and processes we use, give some hints, give rules for mapping a relation model to ERDs, and describe a simplified UML profile for modeling data.
Pay attention to identifiers and how they are assigned. A system can work or fail depending on how it handles them. For example, a university library (a long way from here) saved thousands of dollars by assigning identifiers to books that were meaningless serial numbers, and letting the computer store the relation between this identifier and the traditional Library of Congress call number.
In computerized system, identifiers are mirrored by keys. All data is stored as a set of records of the same type in a file or table. Each record is a set of data fields -- attributes of some real entity. In these records, there should be special field called a prime key. A prime key uniquely identifies one record in the table (= file). The value of the prime key is unique in the table. The rest of attributes in the record all depend on that key. Keys act as surrogates for records. We can then store the value of the key in another record were it links back to the record. Here it is a foreign key because it identifies a record in a foreign file/table. So a foreign key is an field in a table/file that is a [ prime key ] in a different table/file.
Candidate Key
A candidate key for a table can be any set of fields in the table that
might be used to uniquely identify the rows in the table.
Prime Key
A prime key is a set of fields in a table that uniquely identifies a row in the table.
Notice that several fields may be needed to uniquely identify a row. They all
form a compound key.
Foreign Key
A foreign key is a field in a table that is a prime key in a different
table.
A foreign key defines a natural many-to-one relationship between one table
and another. There is one row identified by the prime key. But
the same value can appear in many records as a foreign key.
In a diagram you should connect the box for the table with the foreign key
to the box that has the prime key. Each object with the foreign key identifies one
prime key entity.
By the way, sometimes a foreign key actually identifies a different entity/row in the same table. For example, in a medical database, a table of People may have a prime key of an SSN, but a foreign key may identifier the person to be contacted in the case of an emergency. As another example I have a bank account with a numerical prime key, but it also designates another account for when I die. Treat this a just another foreign key.
Object Identifiers
You will also hear about OIDs:
Secondary Keys
You will also find Secondary Keys in many data bases.
A
secondary key
identifies a set of rows in a table. A database system is
told that about these so that it can attempt to optimize performance by
creating indexes. You may find described as views. Secondary keys should, normally
be ignored when normalizing data.
Notation for keys
In word processed texts and handwritten documents underline keys.
In the UML keys are attributes. Prime Keys are marked <<PK>> and foreign keys <<FK>>.
Informally, when using ASCII to normalize some data I mark all keys with an asterisk in front. Foreign keys have one, and prime keys have two.
Here is an example of the data we would associate with a Thesis in a data base about graduate students:
Thesis(**number, author, date, title, *supervisor,...)
. . . . . . . . . ( end of section Keys) <<Contents | End>>
Extracting data structure from given data -- Normalization
( **Identifier, data, (repeating group ) , *foreign_key )
Looking further we notice that the price seems to depend on the product type in some of our samples, but not in others. Sometimes a "WI007B" costs $5.00 and sometimes $4.95. So we say that "Price does not depend on the Product Type".
If we go and interview the clerk who handles Customer Orders. We discover that different customers get different discounts, and that the price varies with time. So you can say "The price depends on the time, the customer, and the product type".
You'll find that experts tend to use the phrase "Functional Dependency".
Here is an explanation why.
Net
then they say that "sin is a function". They mean that each real number x maps to a real number called sin(x). Similarly, in C/Java/C++ when we declare
double distance(Point x, Point y);we are saying that the distances depends on two Points.
The rule is that if x R y is a function then
So, the relation between Product Number and Product Description (above) is a function!
Hence this is a functional dependency.
. . . . . . . . . ( end of section Functional Dependencies) <<Contents | End>>
Normalization Process -- The key, the whole key, and nothing but the key
Memorize the mnemonic, and practice the following steps. Normalization
has numbered levels: Zero normalized form, 1st normal form, 2nd normal form, 3rd normal form,
... We abbreviate these: 0NF, 1NF, 2NF, 3NF, ...
Copy/paste each successive stage into the next (DEMO in class if time).
Show keys by typing an asterisk "*" in front. Use "**" for Prime Keys and "*" for Foreign ones. Show repeated groups in parentheses:
Name( **prime key, *foreign key, attribute, (repeating group ) )
First the sample of data:
W-Z0137 - SECURITY - YOU MAY NOT UPDATE ON THIS SCREEN
105 Course Section Scan
S
Screen: ___ SID: _________ Course: CSCI Term: 071 G S
tS SEATS INSTRUCTORS/ PRIM R T CALL
SECeS GT AVAIL DAYS TIMES SUBTITLE MAX ENR UNITS ACT P A NO.
CSCI-121 - COMPTR TECH&PEOPLE
01 76 T 1200-0150PM WATSON P 80 4 2.00 LEC O 10649
CSCI-122 - BIOINFORMATICS
01 113 M 0200-0350PM DEVLIN K 120 7 2.00 LEC O 10650
02 118 T 1200-0150PM DEVLIN K 120 2 2.00 LEC O 10652
CSCI-123 - USING COMPTR SOFTWARE
01 30 M 0900-0950AM PERLEPES T 30 2.00 LEC A O 10653
02 NG 30 M 1000-1150AM PERLEPES T 30 0.00 ACT A O 10654
Activity
CSCI-124 - EXPLORING INFO SUPERHIGHWAY
01 59 R 0200-0250PM WATSON P 60 1 2.00 LEC A O 10655
02 NG 29 R 0300-0450PM WATSON P 30 1 0.00 ACT A O 10656
Activity
03 NG 30 R 1200-0150PM WATSON P 30 0.00 ACT A O 10657
Activity
....
CSCI-365 - SYSTEMS ADMINISTRATION
01 30 TR 0600-0715PM HAN K 30 4.00 LEC O 10681
02 NG 30 T 0730-0920PM HAN K 30 0.00 ACT O 10682
Activity
CSCI-375 - REQUIREMENTS ANALYSIS AND DESIGN
01 30 TR 0400-0600PM BOTTING R 30 4.00 LEC O 10683
...
In this example there are two (nested) groups -- data determined by the class and those depending on the section.
In this example, we notice that the course determines it's title. And extract Course from Offering (inveneted name).
In this case the "mode" and the "call#" determine other data. This gives us two new tables. In the Section table we have: the dept, term, course, and section depending on the call# and so these prime keys are demoted to foreign keys. Be careful to scan other records for data items that are now foreign keys (mode in Section).
More: when I asked the head of department about how the maximum enrolment was chosen he explained it was a property of the room.
. . . . . . . . . ( end of section Examples of Normalizing Sample data) <<Contents | End>>
Hints
. . . . . . . . . ( end of section Mapping 3NF back to ERD) <<Contents | End>>
Stereotypes restrict the meaning of things in UML diagrams. They look like this
<<table>>These are often associated with a particular graphic notation.
Tags provide extra information. They take the form: {tag_name=tag_value}.
Constraints
describe additional rules that should hold in a model. They
are statements that must be true. There is a specialized language
(the Object Constraint Language or OCL)
for writing them if needed.
UML Data Modeling Profiles
| stereotype | a special kind of | meaning |
|---|---|---|
| conceptual model | class diagram | picture of the world outside the computer in terms of entity types and relationships between entities. |
| logical data base | class diagram | A picture of an ideal data base independent of a particular technology or technological requirements. |
| physical data base | class diagram | A detailed solution, for a particular technology, to a set of requirements. |
| table | class | database table |
| PK | attribute | Prime key |
| FK | attribute | Foreign key |
| view | class | view of database(for a function) |
| index | class | index into a database(for a function) |
| Constraints |
|---|
| In conceptual models do not show operations, keys, tables, views, indexes, composition, aggregation, etc. |
| In logical and physical data bases unmarked classes are tables. |
| Use <<table>> only if you have a diagram that has both persistent (data base) classes and run time objects in the same diagram. |
. . . . . . . . . ( end of section UML Data Modeling Profiles) <<Contents | End>>
. . . . . . . . . ( end of section ISSA Flyer) <<Contents | End>>
Info Service Schedule
. . . . . . . . . ( end of section Info Service Schedule) <<Contents | End>>
. . . . . . . . . ( end of section Online Resources and Examples) <<Contents | End>>
Review Questions and Exercises on Normalized Data
. . . . . . . . . ( end of section Template) <<Contents | End>>
. . . . . . . . . ( end of section Detailed Data Design) <<Contents | End>>
Abbreviations
Also see
[ glossary.html ]
for more special abbreviations and phrases.
End