.Open Syntax of the HyperText Markup Language - HTML . Introduction HTML is the Markup Language used to describe pages on the World-Wide-Web. See .See http://www/dick/www.html for tutorials and background information. Also see the "Bare Bones Guide to HTML" .See http://werbach.com/barebones/ HTML is designed to describe the logical structure of a large number of interlinked pages. It is a special document type described using the rules of $SGML (Standardized General Markup Language). It has been updated several times. The latest version (January 2000) is called XHTML: .See http://www/dick/samples/xhtml.html The W3 Consortiuum support the Web and provide .See http://w3schools.com/ as a family of tools for learning the technology. This page defines the syntax of a useful subset of HTML, see .See Basic Ideas For information on SGML see .Box SGML::=`Standardized Generalized Markup Language`, .See http://www.sil.org/sgml/sgml.html .See http://www.csci.csusb.edu/dick/samples/comp.text.SGML.html .Close.Box For the official definitions of HTML see .Box The official defining documents for HTML2.0: .See http://hopf.math.nwu.edu/html2.0/html-spec_toc.html For a more up-to-date and complex definition in $SGML see .See http://www.ucc.ie/dick/doc/www/html/dtds/htmlpro.html .Close.Box . Basic Ideas There are three ideas that makeup HTML .Box $SGML tags: .... .See Documents $SGML Elements for characters -- $entity .See Lexicon (URLs): how://where/what... .See Universal Resource Locators (CGIs): Writing programs that produce pages when run .See Common Gateway Interface .Close.Box For help with the Three Letter Acronyms(TLAs) used in talking about HTML see glossary::= http://www/dick/samples/comp.html.glossary.html . Metalanguage For all X, O(X) ::=`optional X` For all X, #(X) ::=`zero or more X` .Open Lexicon HTML_control_char::=( $lt | $gt | $semicolon | $ampersand | $quote ). normal_character::= char ~ $HTML_control_char. char::=http://www/dick/samples/comp.text.ASCII.html#char ampersand::="&". semicolon::=";". lt::="<". gt::=">". quote::=http://www/dick/samples/comp.text.ASCII.html#quotes -- `double quotes character of ASCII"`. SGML allows special symbols to be indicated in a form known as an entity. For example in HTML the less_than character has a special use and so real less than signs are encoded like this: .As_is < The $ampersand and the $semicolon bracket SGML & HTML entities: entity::= $ampersand identifier $semicolon | $ampersand number $semicolon. An $entity allows a symbol to be described by an identifier rather than as itself. This has two purposes. First, it allows symbols used in HTML to appear in the rendered document. For example '"" is written in HTML where you want a double quotation mark to appear. The second use of an entity is to express in ASCII characters that are not ASCII symbols. There are a small number of predefined HTML entities. See The Latin 1 Iso Character set: .See http://hopf.math.nwu.edu/html2.0/html-spec_9.html#SEC101 and the HTML Coded Character Set: .See http://hopf.math.nwu.edu/html2.0/html-spec_13.html#SEC106 The structure of an SGML/HTML document is described by inserting tags into the raw text - this known as "Marking Up the text". The general syntax of a $tag is: tag::=$lt $tag_identifier #$attribute $gt | $lt "/" $tag_identifier $gt. So, tags take two forms - those that indicate the `start` of something, and those that indicate the `end` of something. Here is a typical pair: .As_is .As_is that indicate the start and end (respectively) of a piece of text that needs strong emphasis. (start): For X:tag_identifier, start(X) ::= $lt X #attributes(X) $gt. Each type of tag has its own attributes.... and the set of attributes for a given tag has varied with the edition of HTML and the browser. However they all have the same syntax: For X:tag_identifier, attributes(X) ::$attribute. attribute::= $attribute_identifier $O("=" $attribute_value). tag_identifier::@$identifier. attribute_identifier::@$identifier. (end): For X:tag_identifier, end(X) ::= $lt "/" X $gt. identifier::= letter #(letter|digit), -- I think? .Hole Upper and lower case are ignored in tag and attribute identifiers but not in attribute values. An attribute value can be a string or an identifier: attribute_value::= identifier | $quote #(char~$quote) $quote comment::=$lt "!--" `text that will not effect the rendered page` "--" $gt. html_input::lexical= #($comment | $tag | $entity | $normal_character). .Close .Open Grammar .Open Universal Resource Locators Universal Resource locators (URLs) are attribute values that tell a browser where to find things on the Internet. The was a general introduction at http://www.ncsa.uiuc.edu/demoweb/url-primer.html but now this is a forbidden link. I must thank Erika Lynch for pointing this out and suggesting .See http://www.investintech.com/content/beginnersurl/ as an alternative pointer to resources. The following XBNF is an approximation to the standard defined at .See http://www.w3.org/hypertext/WWW/Addressing/URL/5_BNF.html Notice that there is a special $URL_encoding used to transmit symbols that have special means in the syntax below. URL::= $protocol ":" $O($where) $what. where::=$site $O($port). what::=$path $O("/" $O($file $O("#" $identifier | "?" $query ))). path::=#("/"$directory). query::= $name_value_pair #( $ampersand $name_value_pair). name_value_pair::= $name "=" $value. The value in the URL can be any string because it uses $URL_encoding. protocol::="http" | "ftp" | "mailto" | "telnet" | "file" | "gopher" | "news" |... . site::= "//" internet_address. port::=":" decimal_number. directory::=file_name. file::=file_name $O("."file_type). File names can include periods. file_type::="html" | "gif" | "xbm" | "au" | "jbeg" | "mpeg" | "aiff" | "mov" |... Browsers often use the $file_type (or extension or suffix) to determine what they should do with the resource. The protocol is tied into to the Multimedia EMail proposals (MIME) .See http://www/dick/samples/comp.mail.MIME.html . URL Encoding To ensure that URLs with strange characters are transmitted across the Internet correctly the characters in the URL are encoded as strings of one or three characters. URL_encoding::char-->#char=`A special encoding of ASCII characters that uses plus in place of spaces and $URL_hex_code in place of characters other than letters & digits`. URL_encoded::="result of $URL_encoding". URL_encoding= (letter|digit);Id | " "->"+" |->"%"hex(`lower 16-bits of character code`). .See http://www/dick/samples/comp.text.ASCII.html For example "Space Plus+" +> "Space+Plus%2B". Function_table::=following, .Table letter digit space other .Row Id Id "+" "%" hex .Close.Table Sadly different browsers do Url-encoding differently! Some encode using the letters "a".."f" for hexadecimal and some use "A".."F". Worse the implementation for spaces and plus-signs is $FUBAR. Some quick tests show the following mappings when a test string "space plus+" is sent from a form by different browsers: .Net Let space::=" ", plus::="+". ()|- $URL_encoding = $space+>$plus | $plus+> "%2B" | ... MS_IExplorer4::= $space+>$plus | $plus+>$plus | ... lynx2.3::= $space+>"%20" | $plus+>$plus | ... lynx2.8::= $URL_encoding. Java URLencoder::= $URL_encoding. .See http://www.javasoft.com/products/jdk/1.0.2/api/java.net.URLEncoder.html Netscape::= $URL_encoding. $MS_IExplorer4 is a many-to-one mapping and so can not be inverted: (MS_IExplorer4)|- $MS_IExplorer4 in #char(1..2)--(1)#char. .Close.Net .Box To see what `your` browser does with a simple form and/or CGI (and it is not a pretty sight) try my .See http://www/dick/test.form.html .Close.Box URL_hex_code::= "%" hexadecimal_digit^2. There is a MIME format called "x-www-form-url-encoded" that is used in forms. There is a special Java class for handling the URL encoding: .See http://www.javasoft.com/products/jdk/1.0.2/api/java.net.URLEncoder.html There is a local UNIX Shell script that will reverse $URL_encoding at .See http://www.csci.csusb.edu/dick/tools/urlunencode .Close Universal Resource Locators . Documents document::= $O($start("HTML" )) $O($header) $body. header::= $start("HEAD" ) #header_elements $end("HEAD" ) body::= $start("BODY")$untagged_body $end("BODY" ) | untagged_body. untagged_body::= #( $element | named($element) | hypertext_refed($element) ). attributes("BODY") ::=`often used to specify the background, and the color of text, links and so on`. .Set "BCOLOR="$color_codes" "TEXT="$color_codes" "LINK="$color_codes" "VLINK="$color_codes" "ALINK="$color_codes" "BACKGROUND="$URL ... .Close.Set . Backgrounds You can select a graphic to form a background to your page by .As_is Be careful to select something that lets the message on the page be read! . Colors .Box The wise author either lets the browser or user select the default colors for the body, or specifies a complete set of attributes. Also the wise author is very careful too make sure that the colors chosen are readable! Notice that a significant number of people can not tell Red from Green and so these colors are problematic. Also notice that some browsers (on hand-held computers in particular) are in black and white... so choose colors for background that is much darker or lighter than those for text, links etc. The values the body specification use a form of hexadecimal coding: color_codes::= "#"red green blue. The $red, $green, and $blue numbers get smaller the color gets darker. So "#000000" indicates black and "#FFFFFF" indicates white. Example color codings red::=hexadecimal_digit^2. blue::=hexadecimal_digit^2. green::=hexadecimal_digit^2. .Box .As_is #0000FF Blue .As_is #00FFFF Cyan .As_is #00FF00 Green .As_is #FFFF00 Yellow .As_is #FF0000 Red .As_is #FF00FF Purple .Close.Box Here is a set of body attributes that should be specified and values that are close to the Netscape "classic": .Box .As_is Background BCOLOR=#B8B8B8 Grey .As_is Normal Text TEXT=#000000 Black .As_is Unused Link LINK=#0000FF Blue .As_is Visited Link VLINK=#8000AF Purple .As_is Active Link ALINK=#FF0000 Red .Close.Box .Close.Box .Open Some Elements element::=$special_text | $header | $list | $table | $image | $series_of_paragraphs | $break | $horizontal_rule | $link | $form | ... Here is a short page with a sample of lists and tables on it: .See http://www/dick/ttt.html . Breaks series_of_paragraphs::=$paragraph #($start("P" ) $paragraph) break::= $start("br" ). horizontal_rule::=$start("hr" )... . Hypertext named::=$start("a name=" name ) (_) $end("a" ). Note. The above is used like this `named(header)` in this syntax, and the argument `(header)` replaces the `(_)` above, hypertext_refed::=$start("a href=" $quote $URL $quote ) (_) $end("a" ). Again.... hypertext_refed(`X`) means $start("a href=" quote URL quote ) `X` $end("a). . Text This fails to express a complicated set of rules about what elements can, can not, or should not appear nested inside other elements. These are in the Document Type Definition($DTL) for HTML documents - written in the Standardized General Markup Language (SGML) and held at CERN(Center for European Nuclear Research). special_text::= |[ x:$special_test_type] ($start(x ) $simpler_text $end(x )). special_text_type::= "pre"|"listing"|"blockquote". Note. The above summarizes 3 different alternative with a different `x` in each one. header::= |[i:"1".."6"] ( $start("H" i ) $text $end("H" i ) ). .As_is

This is the most prominent header

Note. the above describes the 6 levels of headers with "H1" being most prominent and "H6" least prominent. The actual and relative styles and sizes can not be specified but a chosen by the user and the browser. paragraph::= #($piece | $text ), piece::= |[s:$styles]( $start(s ) $text $end(s ) ). styles::=$logical_styles | $physical_styles. logical_styles::= "em" | "strong" | "code" | "samp" | "kbd" | "var" | "dfn" | "cite" | "address", physical_styles::= "b" | "i" | "u" | "tt". Note. Physical styles are "$deprecated". The browser and user determine the precise meaning of these styles with the following guidelines: . Table of Styles .As_is Style Meaning .As_is em Emphasized - "notice me" .As_is strong Emphasized even more .As_is code This is a piece of computer output .As_is samp This is a sample of HTML .As_is kbd This is the name of a key on the keyboard .As_is var This is a syntactic variable .As_is dfn This is a definition .As_is cite This is a citation of a source .As_is address This is an address (Real or Email) .As_is b looks bold (deprecated) .As_is i looks italic (deprecated) .As_is u looks underlined (deprecated) .As_is tt looks like a typewriter (deprecated) The SGML specifies rules about what is recommended, normal and deprecated. text::=untagged_body & ( recommended_DTD_ rules | DTD_rules | deprecated_DTD_rules), DTD::=`Document Type Definition`. . Images image::=$start("img"), -- note there is no need for an end img tag. .As_is [description] attributes("img") ::= | following, .Set "src=" $URL "alt=" string "align=" $alignment "ismap" .Close.Set alignment::="left" | "right" | "center" . The 'alt' attribute is what is shown to a browser that does not show the image - some browsers do not show graphics. Some users turn off the graphics to get to the information quicker! The `ismap` attribute indicates that parts of the graphic are hot buttons that act as links to other pages etc. `Maps` take time to construct without special purpose tools. Remember that each graphic takes time to transfer over the network. Animated graphics, in particular are resource hogs. If you need to have a large and complex GIF then use an image processing tool to create a small thumb nail version. I've used "SnagIt", "the GIMP", "xv", etc. Then link the thumbnail to the full image: .As_is Download a graphic! My home and personal pages have examples: .See http://www/dick/index.html .See http://www/dick/me.html . Lists list::=$ordered_list |$unordered_list | $definition_list | $menu |$directory... . definition_list::= $start("DL") #$definition $end("DL"). attributes("DL") ::= $O("compact"). definition::=$start( "dt" ) $term #(start("dd") $text ) . term::=$text. ordered_list::=$start("OL") $list_body $end("OL" ). unordered_list::=$start("UL") $list_body $end("UL" ). menu::=$start("menu") $list_body $end("menu" ). directory::=$start("dir") $list_body $end("dir" ). list_body::=#$list_item. list_item::= $start("li") $text Lists are a simple and effective way to organize your pages. Notice that an item in a list can be further split into lines with `
`, paragraphs by `

` and pieces with `


`. You can also have lists inside lists. So bulletted lists and outlines are easy in HTML. . Tables A table is two dimensional grid. Each cell in the grid can be just about any HTML element. The browser has an interesting task of making sure that all columns are wide enough and rows tall enough for all the elements to fit. Tables are a standard part of HTML but some text based browsers may not support them. table::= $start("table") #$row $end("table" ), Table tags can have a numeric BORDER attribute. row::=$start("tr") #$table_item $O $end("tr" ), table_item::=$table_header_item | $table_normal_item. table_normal_item::=$start("td") #$element $O $end("td" ). table_header_item::=$start("th") #$element $O $end("th" ). .Close Some Elements . Common Attributes For s, numeric_attribute(s) ::= $O( s "=" $number ). For s1,s2, string_attribute(s1) ::= $O( s1 "=" string ) . For s1,s2, value_attribute(s1,s2) ::= $O( s1 "=" s2) . name_attribute::= $value_attibute("NAME", $name). .Open HTML Forms HTML forms are a quick and easy way to gather information from a user and send them through a .See Common Gateway Interface into a program on a server. See .See http://www/dick/HTML_quick.html for a quick introduction form::= $form_tag #( $element | $form_element ) "". form_tag::= "
". form_attributes::= $name_attribute $O($action_attribute) $O($method_attribute). method_attribute::= "METHOD" "=" ("GET" | "PUT"), use GET for small forms and PUT for large ones. action_attribute::= "ACTION" "=" $quotes $action $quotes. action ::= $URL, -- special semantics: .Set URL can use the HTTP protocol to call a $CGI program on a server. URL can use the MAILTO protocol to send the form as EMAIL. URL can use any protocol to refer to a page on the WWW. Can URL use TELNET? .Close.Set .Open Form Elements form_element::= $textarea | $action_element | $input_element | $selection. textarea::="" ASCII_text "", multiple line text box. textarea_attributes::=$name_attribute $numeric_attribute("ROWS") $numeric_attribute("COLS") $numeric_attribute("MAXLENGTH") $O("WRAP"). action_element::= |[t:action_type] ($input(t)). action_type::= $submit | $reset | $image. (above, MATHS)|- $action_element = $input($submit) | $input($reset) | $input($image). .Box reset::=$ignore_case("RESET"), appears as a button to be selected, resets all input elements to previous values. submit::=$ignore_case("SUBMIT"), appears as a button to be selected, and transmits data in form according to the $method_attribute and $action_attribute. image::=$ignore_case("IMAGE"), like a $submit but includes the (x,y) coordinates of the click in the image in arguments `name`.x and `name`.y .Close.Box input_element::= | [t:input_type] ($input(t)). input_type::= $button | $checkbox | $hidden | $password | $radio | $reset | $submit | $text. .Box button::=$ignore_case("BUTTON"). checkbox::=$ignore_case("CHECKBOX"), multiple boxes can be checked with same name and different values. hidden::=$ignore_case("HIDDEN"). password::=$ignore_case("PASSWORD"), user input possible but invisible but not encrypted. radio::=$ignore_case("RADIO"), only one of each set can be selected. text::=$ignore_case("TEXT"), one line text box. .Close.Box For t:$input_type, input(t) ::= "". input_attributes::input_type -> input_attributes = following .Set ($text, $numeric_attribute("SIZE") $numeric_attribute("MAXLENGTH") ) ($password, $numeric_attribute("SIZE") $numeric_attribute("MAXLENGTH") ) ($checkbox, $string_attribute("VALUE") $O("CHECKED") ) ($radio, $string_attribute("VALUE") $O("CHECKED") ) ($submit, $string_attribute("VALUE") ) ($reset, $string_attribute("VALUE") ) ($button, ?? ) ($image, $string_attribute("ALT") $string_attribute("ALIGN") ) .Close.Set selection::= "<" $select $select_attributes ">" #$option "". selection_attributes::=$name_attribute $numeric_attribute("SIZE") $O("MULTIPLE). Normally SELECT generates a "pop-up" menu listing the options and letting the user select a single item from the list. MULTIPLE plus a SIZE signals a browser to offer the user a scrollable list of options and allow them to check off any number of them. This generates a comma separated list of URL-encoded options. select::=$ignore_case("SELECT"), allows one from a menu of options. option::= "" $string. The VALUE is returned in place of the following string if it exists. .Close Form Elements . What is Sent by a Form If the form's action is "mailto:" or a call to a $CGI then a URL encoded string is sent called the query: query::= $pair #( "&" $pair). pair::= name"="value. The name and value come from the values selected when a Submit is selected: .Set Checkbox and radio: name and value attribute(s) selected. Select: the name of the selection and a comma separated list of selected OPTIONs. Text and textarea: the name and the content. Image: coordinates clicked in image. .Close.Set .Close HTML forms .Open Common Gateway Interface (glossary)|- CGI::=`Common Gateway Interface` The $CGI rules define how data is given to a program on the server. The program then runs and generates a page that is returned to the browser, using a standard format. .See http://www.csci.csusb.edu/dick/www.html#CGI This program can be written in any language - but many prefer to use a language called PERL. Personally I use UNIX shell scripts. To interface a MIS database to the Web.... you could use COBOL. Almost certainly Java is going to be another popular way to write $CGIs. The following program .See http://www.csci.csusb.edu/dick/tools/unpost.c is useful for converting $CGI posted input into normal but URL-encoded standard input ready for a UNIX shell script or program to handle. .See URL Encoding .Close Common Gateway Interface . Glossary FUBAR::="Fouled Up Beyond All Recognition", an extreme form of $SNAFU. SNAFU::="Situation Normal -- All Fouled Up", An acronym used in the USA army in the Second World War. deprecated::=`they don't like it because it is physical not logical `. (HTML is not for word processing!) .Close