.Open Syntax of the HyperText Markup Language - HTML
. Introduction
HTML is the Markup Language used to describe pages on the World-Wide-Web. See
.See http://www/dick/www.html
for tutorials and background information.
Also see the
"Bare Bones Guide to HTML"
.See http://werbach.com/barebones/
HTML is designed to
describe the logical structure of a large number of interlinked pages. It
is a special document type described using the rules of $SGML
(Standardized General Markup Language). It has been updated several times.
The latest version (January 2000) is called XHTML:
.See http://www/dick/samples/xhtml.html
The W3 Consortiuum support the Web and provide
.See http://w3schools.com/
as a family of tools for learning the technology.
This page defines the syntax of a useful subset of HTML, see
.See Basic Ideas
For information on SGML see
.Box
SGML::=`Standardized Generalized Markup Language`,
.See http://www.sil.org/sgml/sgml.html
.See http://www.csci.csusb.edu/dick/samples/comp.text.SGML.html
.Close.Box
For the official definitions of HTML see
.Box
The official defining documents for HTML2.0:
.See http://hopf.math.nwu.edu/html2.0/html-spec_toc.html
For a more up-to-date and complex definition in $SGML see
.See http://www.ucc.ie/dick/doc/www/html/dtds/htmlpro.html
.Close.Box
. Basic Ideas
There are three ideas that makeup HTML
.Box
$SGML tags: ....
.See Documents
$SGML Elements for characters -- $entity
.See Lexicon
(URLs): how://where/what...
.See Universal Resource Locators
(CGIs): Writing programs that produce pages when run
.See Common Gateway Interface
.Close.Box
For help with the Three Letter Acronyms(TLAs) used in talking
about HTML see
glossary::= http://www/dick/samples/comp.html.glossary.html
. Metalanguage
For all X, O(X) ::=`optional X`
For all X, #(X) ::=`zero or more X`
.Open Lexicon
HTML_control_char::=( $lt | $gt | $semicolon | $ampersand | $quote ).
normal_character::= char ~ $HTML_control_char.
char::=http://www/dick/samples/comp.text.ASCII.html#char
ampersand::="&".
semicolon::=";".
lt::="<".
gt::=">".
quote::=http://www/dick/samples/comp.text.ASCII.html#quotes
-- `double quotes character of ASCII"`.
SGML allows special symbols to be indicated in a form known as an entity. For
example in HTML the less_than character has a special use and so real
less than signs are encoded like this:
.As_is <
The $ampersand and the $semicolon bracket SGML & HTML entities:
entity::= $ampersand identifier $semicolon | $ampersand number $semicolon.
An $entity allows a symbol to be described by an identifier
rather than as itself. This has two purposes. First, it
allows symbols used in HTML to appear in the rendered document.
For example '"" is written in HTML where you want a double
quotation mark to appear. The second use of an entity is to express
in ASCII characters that are not ASCII symbols. There are
a small number of predefined HTML entities. See The Latin 1
Iso Character set:
.See http://hopf.math.nwu.edu/html2.0/html-spec_9.html#SEC101
and the HTML Coded Character Set:
.See http://hopf.math.nwu.edu/html2.0/html-spec_13.html#SEC106
The structure of an SGML/HTML document is described by inserting
tags into the raw text - this known as "Marking Up the text".
The general syntax of a $tag is:
tag::=$lt $tag_identifier #$attribute $gt | $lt "/" $tag_identifier $gt.
So, tags
take two forms - those that indicate the `start` of something, and
those that indicate the `end` of something. Here is a typical pair:
.As_is
.As_is
that indicate the start and end (respectively) of a piece of text that
needs strong emphasis.
(start):
For X:tag_identifier, start(X) ::= $lt X #attributes(X) $gt.
Each type of tag has its own attributes.... and the set of attributes
for a given tag has varied with the edition of HTML and
the browser. However they all have the same syntax:
For X:tag_identifier, attributes(X) ::$attribute.
attribute::= $attribute_identifier $O("=" $attribute_value).
tag_identifier::@$identifier.
attribute_identifier::@$identifier.
(end):
For X:tag_identifier, end(X) ::= $lt "/" X $gt.
identifier::= letter #(letter|digit), -- I think?
.Hole
Upper and lower case are ignored in tag and attribute identifiers but not
in attribute values.
An attribute value can be a string or an identifier:
attribute_value::= identifier | $quote #(char~$quote) $quote
comment::=$lt "!--" `text that will not effect the rendered page` "--" $gt.
html_input::lexical= #($comment | $tag | $entity | $normal_character).
.Close
.Open Grammar
.Open Universal Resource Locators
Universal Resource locators (URLs)
are attribute values that tell a browser where to find things on the Internet.
The was a general introduction at
http://www.ncsa.uiuc.edu/demoweb/url-primer.html
but now this is a forbidden link. I must thank Erika Lynch for
pointing this out and suggesting
.See http://www.investintech.com/content/beginnersurl/
as an alternative pointer to resources.
The following XBNF is an approximation to the standard defined at
.See http://www.w3.org/hypertext/WWW/Addressing/URL/5_BNF.html
Notice that there is a special $URL_encoding used to transmit symbols
that have special means in the syntax below.
URL::= $protocol ":" $O($where) $what.
where::=$site $O($port).
what::=$path $O("/" $O($file $O("#" $identifier | "?" $query ))).
path::=#("/"$directory).
query::= $name_value_pair #( $ampersand $name_value_pair).
name_value_pair::= $name "=" $value.
The value in the URL can be any string because it uses $URL_encoding.
protocol::="http" | "ftp" | "mailto" | "telnet" | "file" | "gopher" | "news" |... .
site::= "//" internet_address.
port::=":" decimal_number.
directory::=file_name.
file::=file_name $O("."file_type).
File names can include periods.
file_type::="html" | "gif" | "xbm" | "au" | "jbeg" | "mpeg" | "aiff" | "mov" |...
Browsers often use the $file_type (or extension or suffix) to determine
what they should do with the resource. The protocol is tied into to
the Multimedia EMail proposals (MIME)
.See http://www/dick/samples/comp.mail.MIME.html
. URL Encoding
To ensure that URLs with strange characters are transmitted across the
Internet correctly the characters in the URL are encoded as strings of one or
three characters.
URL_encoding::char-->#char=`A special encoding of ASCII characters that uses plus in place of spaces and $URL_hex_code in place of characters other than letters & digits`.
URL_encoded::="result of $URL_encoding".
URL_encoding= (letter|digit);Id | " "->"+" |->"%"hex(`lower 16-bits of character code`).
.See http://www/dick/samples/comp.text.ASCII.html
For example "Space Plus+" +> "Space+Plus%2B".
Function_table::=following,
.Table letter digit space other
.Row Id Id "+" "%" hex
.Close.Table
Sadly different browsers do Url-encoding differently!
Some encode using the letters "a".."f" for hexadecimal and some
use "A".."F". Worse the implementation for spaces and plus-signs
is $FUBAR. Some quick tests show the following mappings
when a test string "space plus+" is sent from a form by different
browsers:
.Net
Let
space::=" ",
plus::="+".
()|- $URL_encoding = $space+>$plus | $plus+> "%2B" | ...
MS_IExplorer4::= $space+>$plus | $plus+>$plus | ...
lynx2.3::= $space+>"%20" | $plus+>$plus | ...
lynx2.8::= $URL_encoding.
Java URLencoder::= $URL_encoding.
.See http://www.javasoft.com/products/jdk/1.0.2/api/java.net.URLEncoder.html
Netscape::= $URL_encoding.
$MS_IExplorer4 is a many-to-one mapping and so can not be inverted:
(MS_IExplorer4)|- $MS_IExplorer4 in #char(1..2)--(1)#char.
.Close.Net
.Box
To see what `your` browser does with a simple form and/or CGI (and it is
not a pretty sight) try my
.See http://www/dick/test.form.html
.Close.Box
URL_hex_code::= "%" hexadecimal_digit^2.
There is a MIME format called "x-www-form-url-encoded" that is used in forms.
There is a special Java class for handling the URL encoding:
.See http://www.javasoft.com/products/jdk/1.0.2/api/java.net.URLEncoder.html
There is a local UNIX Shell script that will reverse $URL_encoding at
.See http://www.csci.csusb.edu/dick/tools/urlunencode
.Close Universal Resource Locators
. Documents
document::= $O($start("HTML" )) $O($header) $body.
header::= $start("HEAD" ) #header_elements $end("HEAD" )
body::= $start("BODY")$untagged_body $end("BODY" ) | untagged_body.
untagged_body::= #( $element | named($element) | hypertext_refed($element) ).
attributes("BODY") ::=`often used to specify the background, and the color of text, links and so on`.
.Set
"BCOLOR="$color_codes"
"TEXT="$color_codes"
"LINK="$color_codes"
"VLINK="$color_codes"
"ALINK="$color_codes"
"BACKGROUND="$URL
...
.Close.Set
. Backgrounds
You can select a graphic to form a background to your page by
.As_is
Be careful to select something that lets the message on the page be read!
. Colors
.Box
The wise author either lets the browser or user select the default colors
for the body, or specifies a complete set of attributes. Also the wise
author is very careful too make sure that the colors chosen are readable!
Notice that a significant number of people can not tell Red from
Green and so these colors are problematic. Also notice that some browsers
(on hand-held computers in particular) are in black and white... so choose
colors for background that is much darker or lighter than those
for text, links etc.
The values the body specification use
a form of hexadecimal coding:
color_codes::= "#"red green blue. The $red, $green, and $blue numbers get smaller the color gets darker. So "#000000" indicates black and "#FFFFFF" indicates white.
Example color codings
red::=hexadecimal_digit^2.
blue::=hexadecimal_digit^2.
green::=hexadecimal_digit^2.
.Box
.As_is #0000FF Blue
.As_is #00FFFF Cyan
.As_is #00FF00 Green
.As_is #FFFF00 Yellow
.As_is #FF0000 Red
.As_is #FF00FF Purple
.Close.Box
Here is a set of body attributes that
should be specified and values that are close to the Netscape "classic":
.Box
.As_is Background BCOLOR=#B8B8B8 Grey
.As_is Normal Text TEXT=#000000 Black
.As_is Unused Link LINK=#0000FF Blue
.As_is Visited Link VLINK=#8000AF Purple
.As_is Active Link ALINK=#FF0000 Red
.Close.Box
.Close.Box
.Open Some Elements
element::=$special_text | $header | $list | $table | $image | $series_of_paragraphs | $break | $horizontal_rule | $link | $form | ...
Here is a short page with a sample of lists and tables on it:
.See http://www/dick/ttt.html
. Breaks
series_of_paragraphs::=$paragraph #($start("P" ) $paragraph)
break::= $start("br" ).
horizontal_rule::=$start("hr" )...
. Hypertext
named::=$start("a name=" name ) (_) $end("a" ).
Note. The above is used like this `named(header)` in this syntax, and the argument `(header)` replaces the `(_)` above,
hypertext_refed::=$start("a href=" $quote $URL $quote ) (_) $end("a" ). Again.... hypertext_refed(`X`) means $start("a href=" quote URL quote ) `X` $end("a).
. Text
This fails to express a complicated set of rules about what elements can, can not,
or should not appear nested inside other elements. These are in the
Document Type Definition($DTL) for HTML documents - written in the
Standardized General Markup Language (SGML) and
held at CERN(Center for European Nuclear Research).
special_text::= |[ x:$special_test_type] ($start(x ) $simpler_text $end(x )).
special_text_type::= "pre"|"listing"|"blockquote".
Note. The above summarizes 3 different alternative with a different `x` in each one.
header::= |[i:"1".."6"] ( $start("H" i ) $text $end("H" i ) ).
.As_is This is the most prominent header
Note. the above describes the 6 levels of headers with "H1" being most prominent and "H6" least prominent. The actual and relative styles and sizes can not be specified but a chosen by the user and the browser.
paragraph::= #($piece | $text ),
piece::= |[s:$styles]( $start(s ) $text $end(s ) ).
styles::=$logical_styles | $physical_styles.
logical_styles::= "em" | "strong" | "code" | "samp" | "kbd" | "var" | "dfn" | "cite" | "address",
physical_styles::= "b" | "i" | "u" | "tt".
Note. Physical styles are "$deprecated".
The browser and user determine the precise meaning of these styles
with the following guidelines:
. Table of Styles
.As_is Style Meaning
.As_is em Emphasized - "notice me"
.As_is strong Emphasized even more
.As_is code This is a piece of computer output
.As_is samp This is a sample of HTML
.As_is kbd This is the name of a key on the keyboard
.As_is var This is a syntactic variable
.As_is dfn This is a definition
.As_is cite This is a citation of a source
.As_is address This is an address (Real or Email)
.As_is b looks bold (deprecated)
.As_is i looks italic (deprecated)
.As_is u looks underlined (deprecated)
.As_is tt looks like a typewriter (deprecated)
The SGML specifies rules about what is recommended, normal and deprecated.
text::=untagged_body & ( recommended_DTD_ rules | DTD_rules | deprecated_DTD_rules),
DTD::=`Document Type Definition`.
. Images
image::=$start("img"), -- note there is no need for an end img tag.
.As_is
attributes("img") ::= | following,
.Set
"src=" $URL
"alt=" string
"align=" $alignment
"ismap"
.Close.Set
alignment::="left" | "right" | "center" .
The 'alt' attribute is what is shown to a browser that does not show
the image - some browsers do not show graphics. Some users turn off
the graphics to get to the information quicker!
The `ismap` attribute indicates that parts of the graphic are hot
buttons that act as links to other pages etc. `Maps` take time
to construct without special purpose tools.
Remember that each graphic takes time to transfer over the network.
Animated graphics, in particular are resource hogs. If you need to
have a large and complex GIF then use an image processing tool
to create a small thumb nail version. I've used "SnagIt", "the GIMP",
"xv", etc. Then link the thumbnail to the full image:
.As_is
My home and personal pages have examples:
.See http://www/dick/index.html
.See http://www/dick/me.html
. Lists
list::=$ordered_list |$unordered_list | $definition_list | $menu |$directory... .
definition_list::= $start("DL") #$definition $end("DL").
attributes("DL") ::= $O("compact").
definition::=$start( "dt" ) $term #(start("dd") $text ) .
term::=$text.
ordered_list::=$start("OL") $list_body $end("OL" ).
unordered_list::=$start("UL") $list_body $end("UL" ).
menu::=$start("menu") $list_body $end("menu" ).
directory::=$start("dir") $list_body $end("dir" ).
list_body::=#$list_item.
list_item::= $start("li") $text
Lists are a simple and effective way to organize your pages.
Notice that an item in a list can be further split into lines with `
`,
paragraphs by `` and pieces with `
`. You can also have
lists inside lists. So bulletted lists and outlines are easy in HTML.
. Tables
A table is two dimensional grid. Each cell in the grid
can be just about any HTML element. The browser has
an interesting task of making sure that all columns are
wide enough and rows tall enough for all the elements to fit.
Tables are a standard part of HTML but some text based browsers
may not support them.
table::= $start("table") #$row $end("table" ),
Table tags can have a numeric BORDER attribute.
row::=$start("tr") #$table_item $O $end("tr" ),
table_item::=$table_header_item | $table_normal_item.
table_normal_item::=$start("td") #$element $O $end("td" ).
table_header_item::=$start("th") #$element $O $end("th" ).
.Close Some Elements
. Common Attributes
For s, numeric_attribute(s) ::= $O( s "=" $number ).
For s1,s2, string_attribute(s1) ::= $O( s1 "=" string ) .
For s1,s2, value_attribute(s1,s2) ::= $O( s1 "=" s2) .
name_attribute::= $value_attibute("NAME", $name).
.Open HTML Forms
HTML forms are a quick and easy way to gather
information from a user and send them through a
.See Common Gateway Interface
into a program on a server.
See
.See http://www/dick/HTML_quick.html
for a quick introduction
form::= $form_tag #( $element | $form_element ) "".
form_tag::= "