There is also a study of some well known codes. Brian Hayes calculates
how many unique codes can exist, and how mnay are already in use.
This was published in the
"American Scientist" magazine, Vol 93, Jan-Feb 2005. It covers the
following coding schemes:
Story -- Changing Student IDs
Once upon a time, it took this campus a year to change all its records from
using the Social Security Number(SSN) to a campus assigned number (SID). Every record
in a dozen different data bases had to be changed. Why did
CSUSB spend so much time and money changing one
element? -- Because (1) it is illegal
to use the SSN for non-SSN type purposes. (2) We also wanted
to avoid identity theft. (3) We are required to anonamize grades.
Data Options
I noted the subtle but effective security. I also noted that some cards were red or green. I asked about this and it turned out that the red cards indicated threats to social workers... So in this case one data item was the color of the media.
By the way... I recommended not computerizing this process.
Types of code
Clever choices can improve system qualities: security, reliability, time to program, ...
Special Encodings
Decimal
The Western World learned the decimal system from the Arabs. We encode the digits as 0,1,2,3,4,5,6,7,8,and 9. Then we can represent any whole number by
stringing them together:
To be more formal we could define:
Having only two digits fits well with electrical and electronic circuits which tend to be either "on" or "off". In about 1945 Shannon defined
This has resulted in many computer people being able to recite the powers of two up to the highest address on their favorite machine.
Octal and Hex
However, calling out or typing 20 binary digits is a inefficient.
Using decimal notation makes it hard to know what pattern of digits
is needed.
So older computer people tend to use base 8 (octal) and
newer one the
base 16 (hexadecimal) notation:
Nibbles are written and spoken using the hexadecimal digits, 0(=0000),1=(0001),2,3,4,5,6,7,8,9,A,B,C,D,E,and F (=1111). So, for example, "2A" in hex means "00101010" in binary.
Binary Coded Decimal
In commercial systems it was common to find numbers encoded using
Here a number like 987 was encoded by three decimal digits each represented as a nibble in binary:
This wastes some bits but is very convenient for important things like dollars and cents. Again it fits well with hexadecimal notation.
Signed Integers
In scientific computations integers are encoded using binary typically
with 8, 16, 32, ... bits. One extra bit indicates whether the number is
negative or positive.
Real Numbers
Real numbers (measurements) are encoded
using "floating point" where a number has two parts called the
mantissa and the exponent, both encoded in binary. The value is then
Floating point works well when we need a wide range of values and can put up with larger errors on the larger numbers.
Again in commerce and fiance we need precision and speed rather than range. So a Fixed Point notation was preferred. Here you use BCD and the machine scales the number by dividing by a fixed power of ten.
In the 1960's the American standards people ( ANSI ) proposed what has become the standard 8 bit coding for characters -- ASCII
ASCII covers all the characters needed for American needs, but has become the de facto standard on the Internet, and whenever data needs to be shared. The International Standards Organization treats ASCII as a specialized code for use in America. In the UK, the American "#" becomes the symbol for the British pound. Each European country has its own special symbols.
IBM tried to create its own standard -- an Extended Binary Coded Decimal code named EBCDIC. This will disappear with the last mainframe.
Recently, a new standard -- Unicode -- has been created that covers just about every character in every alphabet in the world. This is a 16-bit code. ASCII and the ISO codes appear within it.
The Web uses HTML and HTML has introduced a number of special "entities" for showing non-ASCII characters like Σ and α. These are given numbers and encode in HTML like this:
or
For example the symbols "<" and ">" are encoded as "<" and ">". The double quote sysmbol is encoded as """.
There is a link to more on the HTML below.
Derived Codes
Mixes different coded data into one element
Example: My UK Driving license number, CSUSB Library call numbers, Subscriber
codes for magazines, Rooms on campus.
Ciphers
Example: Spoof at the Imperial Chemical Industries was a number added to the paint sales. We have a lot of good work done since then -- look up DES. PGP, etc. on Wikipedia if you
need more detail.
Numbers are disguised for security or mnemonic purposes
Passwords should be encrypted as soon as they are entered and never stored
without salting and hashing!
Actions
Example: A=Add, D=Delete, ..., The 50+ actions that the 'vi'
editor has built in to it, Mnemonic codes in assembler.
Codes that represent actions. Transaction codes -- for example with a
banking application we might find deposits (coded D) and withdrawals (coded W).
Self-checking Elements
Uses an added digit or character calculated from the rest.
Example: 9s remainder and 11s remainder check digits are added to a decimal number.
There are three classic ways of encoding compound data plus the later markup notations:
<name><first>Richard</first><initial>J</initial><family>Botting</family></name>is a piece of text with added "tags" that indicate the meaning of the parts. In a [ Record Structure ] (above) the "tags" are not needed because their sequence is known and the lengths are fixed (or at least predictable). Thus we get an encoding that is guaranteed not to be ambiguous, is easy to read (kind of), but is somewhat inefficient.
SGML -- The Standard Generalized Markup Language
IBM defined SGML so that you could create and define tags for any purpose. It
is an amazingly difficult encoding to use. You can find the details on the web
if you absolutely have to.
HTML -- The HyperText Markup Language
HTML is a specific application of SGML defined to describe and link
pages on the world-Wide Web. It has gone through half-a-dozen versions.
A new one stays within the XML (next) conventions.
XML -- the eXtendable Markup Language
HTML describes the appearance of pages. XML tries to describe the
meaning of data not its appearance. It has a very simple basis using
<tags>
</end tags>to delimit data. Tags can also have attributes:
<certificate type="participation">Unix Training</certificate>.
XML also allows some tags to be unpaired and these are shown like this:
<endless tag attributes... />XML documents can be parsed fairly easily.
For each application that uses XML must have a DTD -- Document Type Definition published that defines the structure of the data -- what tags can appear inside others. Defining a DTD takes a significant amount of work. But once defined you can use tools to check validity, ...
. . . . . . . . . ( end of section Markup Languages) <<Contents | End>>
Complex Syntax
Complex syntax gets us into natural and artificial languages. It is
rare that we need to express natural data groupings using complex
syntax. When we do we can use a extension of the syntactic meta-languages like
Backus-Naur Form (BNF).
In computer science most of our knowledge about linguistic design has been put into designing programming languages. Programming languages are the most complicated schemes for encoding a domain in existence. There are hundreds of them. For more take a CSCI Programming Language class like our CSCI320 [ ../cs320/ ] (Advert).
. . . . . . . . . ( end of section Encoding Compound data) <<Contents | End>>
Experience -- coding data in the ICI Infra-Red Spectrum Analysis Program.
. . . . . . . . . ( end of section Special Encodings) <<Contents | End>>
Guidelines for encoding data
Reference and Online Resources
Universal Product Code
I don't expect you to understand UPC for this course but if you are interested in these
ubiquitous
bar-codes
see
[ Universal_Product_Code ]
for the details and history.
Samples of Syntax Definitions
My
[ http://www.csci.csusb.edu/dick/samples/ ]
define a large number of sophisticated coding schemes including
programming languages and meta-languages.
ASCII
For reference purposes see
[ comp.text.ASCII.html ]
Markup Languages
For reference purposes see
[ Mark Up Languages in index ]
(my notes).
HTML
[ ../samples/comp.html.syntax.html ]
XML
[ ../samples/xml.html ]
XML reference
. . . . . . . . . ( end of section XML) <<Contents | End>>
<section><department>CSCI</department><cnumber>201</cnumber><sectionnumber>02</sectionnumber><section>
Also see [ glossary.html ] for more special abbreviations and phrases.