[Skip Navigation] [Remove Frame] [CS320] [Text Version] comp.html.syntax.html Sat Dec 23 07:59:37 PST 2006

Contents


    Syntax of the HyperText Markup Language - HTML

      Introduction

      HTML is the language used on the World-Wide-Web [ www.html ] for tutorials and background information.

      HTML is designed to describe the logical structure of a large number of interlinked pages. It is a special document type described using the rules of SGML (Standardized General Markup Language).

      This page defines the syntax of a useful subset of HTML, see [ Basic Ideas ]

      For information on SGML see


      1. SGML::=Standardized Generalized Markup Language, [ sgml.html ] [ comp.text.SGML.html ]


      For the official definitions of HTML see


        The official defining documents for HTML2.0: [ html-spec_toc.html ]

        For a more uptodate and complex definition in SGML see [ htmlpro.html ]


      Basic Ideas

      There are three ideas that makeup HTML
      1. SGML tags: <tag> .... </tag> [ Documents ]
      2. SGML Elements for characters -- entity [ Lexicon ]
      3. (URLs): how://where/what... [ Universal Resource Locators ]
      4. (CGIs): Connecting programs to pages [ Common Gateway Interface ]

      For help with the Three Letter Acronyms(TLAs) used in talking about HTML see
    1. glossary::= See http://csci.csusb.edu/dick/samples/comp.html.glossary.html

      Metalanguage

    2. For all X, O(X)::=optional X
    3. For all X, #(X)::=zero or more X

      Lexicon

      1. HTML_control_char::=( lt | gt | semicolon | ampersand | quote ).

      2. normal_character::= char ~ HTML_control_char.
      3. char::= See http://csci.csusb.edu/dick/samples/comp.text.ASCII.html#char

      4. ampersand::="&".
      5. semicolon::=";".
      6. lt::="<".
      7. gt::=">".
      8. quote::= See http://csci.csusb.edu/dick/samples/comp.text.ASCII.html#quotes -- double quotes character of ASCII".

        SGML allows special symbols to be indicated in a form known as an entity. For example in HTML the less_than character has a special use and so real less than signs are encoded like this:

         		&lt;
        The ampersand and the semicolon bracket SGML & HTML entities:
      9. entity::= ampersand identifier semicolon | ampersand number semicolon. An entity allows a symbol to be described by an identifier rather than as itself. This has two purposes. First, it allows symbols used in HTML to appear in the rendered document. For example '&quot;" is written in HTML where you want a double quotation mark to appear. The second use of an entity is to express in ASCII characters that are not ASCII symbols. There are a small number of predefined HTML entities. See The Latin 1 Iso Character set: [ SEC101 in html-spec_9 ] and the HTML Coded Character Set: [ SEC106 in html-spec_13 ]

        The structure of an SGML/HTML document is described by inserting tags into the raw text - this known as "Marking Up the text". Tags take to forms - those that indicate the start of something, and those that indicate the end of something. Here is a typical pair:

        	<strong>
        	</strong>
        that indicate the start and end (respectively) of a piece of text that needs strong emphasis.


        (start):

      10. For X:tag_identifier, start(X)::= lt X #attribute gt.


        (end):

      11. For X:tag_identifier, end(X)::= lt "/" X gt.

        So the general syntax of a tag is:

      12. tag::=lt tag_identifier #attribute gt | lt "/" tag_identifier gt.

      13. attribute::= attribute_identifier O("=" attribute_value).
      14. tag_identifier::@identifier.
      15. attribute_identifier::@identifier.
      16. identifier::= letter #(leter|digit), -- I think?

        Upper and lower case are ignored in tag and attribute identifiers but not in attribute values.

        An attribute value can be a string or an identifier:

      17. attribute_value::= identifier | quote #(char~quote) quote

      18. comment::=lt "!--" text that will not effect the rendered page "--" gt.

      19. html_input::lexical= #(comment | tag | entity | normal_character).

      Grammar

        Universal Resource Locators

          Universal Resource locators (URLs) are attribute values that tell a browser where to find things on the internet. The is a general introduction at [ url-primer.html ]

          The following XBNF is an approximation to the standard defined at [ 5_BNF.html ]

          Notice that there is a special URL_encoding used to transmit symbols that have special means in the syntax below.

        1. URL::= protocol ":" O(where) what.
        2. where::=site O(port).
        3. what::=path O("/" O(file O("#" identifier | "?" query ))).
        4. path::=#("/"directory).

        5. query::= name_value_pair #( ampersand name_value_pair).
        6. name_value_pair::= name "=" value. The value in the URL can be any string because it uses URL_encoding.

        7. protocol::="http" | "ftp" | "mailto" | "telnet" | "file" | "gopher" | "news" |... .
        8. site::= "//" internet_address.
        9. port::=":" decimal_number.
        10. directory::=file_name.
        11. file::=file_name O("."file_type). File names can include periods.
        12. file_type::="html" | "gif" | "xbm" | "au" | "jbeg" | "mpeg" | "aiff" | "mov" |... Browsers often use the file_type (or extension or suffix) to determine what they should do with the resource. The protocol is tied into to the Multimedia EMail proposals (MIME) [ comp.mail.MIME.html ]

          URL Encoding

        13. URL_encoding::#char-->#char=A special encoding of ASCII strings that uses plus in place of spaces and "%" Hex pairs in place of characters other than letters & digits.
        14. URL_hex_code::= "%" hexadecimal_digit^2. URL_encoding= (letter|digit);Id | " "->"+"|->"%"hex(lower 16-bits of character code).

          This is a MIME format called "x-www-form-url-encoded" and is used in forms. The is a special Java class for handling the encoding: [ java.net.URLEncoder.html ]

          There is a local UNIX Shell script that will reverse URL-encoding at [ urlunencode ]

        . . . . . . . . . ( end of section Universal Resource Locators) <<Contents | End>>

        Documents

      1. document::= O(start("HTML" )) O(header) body.
      2. header::= start("HEAD" ) #header_elements end("HEAD" )
      3. body::= start("BODY #body_attributes" )untagged_body end("BODY" ) | untagged_body.
      4. untagged_body::= #( element | named(element) | hypertext_refed(element) ).

      5. body_attributes::=often used to specify the background, and the color of text, links and so on.

        Backgrounds

        You can select a graphic to form a background to your page by
         		<BODY BACKGROUND="Graphic">
        Be careful to select something that lets the message on the page be read!

        Colors


          The wise author either lets the browser or user select the default colors for the body, or specifies a complete set of attributes. Also the wise author is very careful too make sure that the colors chosen are readable! Notice that a significant number of people can not tell Red from Green and so these colors are problematic. Also notice that some browsers (on handheld computers in particular) are in black and white... so choose colors for backgraound that is much darker or lighter than those for text, links etc.

          The values the body specification use a form of hexadecimal coding:

        1. color_codes::= "#"red green blue. The red, green, and blue numbers get smaller the color gets darker. So "#000000" indicates black and "#FFFFFF" indicates white. Example color codings
        2. red::=hexadecimal_digit^2.
        3. blue::=hexadecimal_digit^2.
        4. green::=hexadecimal_digit^2.
             		#0000FF	Blue
             		#00FFFF	Cyan
             		#00FF00	Green
             		#FFFF00	Yellow
             		#FF0000	Red
             		#FF00FF	Purple

          Here is a set of body attributes that should be specified and values that are close to the Netscape "classic":
             	Background	BCOLOR=#B8B8B8	Grey
             	Normal Text	TEXT=#000000 	Black
             	Unused Link	LINK=#0000FF 	Blue
             	Visitted Link	VLINK=#8000AF	Purple
             	Active Link	ALINK=#FF0000 	Red


        Elements

      6. element::=special_text | header | list | image | paragraph #(start("P" ) paragraph) | break | horizontal_rule | link | form | ...
      7. break::= start("br" ).
      8. horizontal_rule::=start("hr" )...

      9. named::=start("a name=" name ) (_) end("a" ). Note. The above is used like this named(header) in this syntax, and the argument (header) replaces the (_) above,

      10. hypertext_refed::=start("a href=" quote URL quote ) (_) end("a" ). Again.... hypertext_refed(X) means start("a href=" quote URL quote ) X end("a).

        Text

        This fails to express a complicated set of rules about what elements can, can not, or should not appear nested inside other elements. These are in the Document Type Definition(DTL) for HTML documents - written in the Standardized General Markup Language (SGML) and held at CERN(Center for European Nuclear Research).

      11. special_text::= |[ x:special_test_type] (start(x ) simpler_text end(x )).
      12. special_text_type::= "pre"|"listing"|"blockquote".

        Note. The above summarizes 3 different alternative with a different x in each one.

      13. header::= |[i:"1".."6"] ( start("H" i ) text end("H" i ) ).
      14. Note the above describes the 6 levels of headers with "H1" being most prominent and "H6" least prominent. The actual and relative styles and sizes can not be specified but a chosen by the user and the browser.

      15. paragraph::= #(piece | text ),
      16. piece::= |[s:styles]( start(s ) text end(s ) ).
      17. styles::=logical_styles | physical_styles.

      18. logical_styles::= "em" | "strong" | "code" | "samp" | "kbd" | "var" | "dfn" | "cite" | "address",
      19. physical_styles::= "b" | "i" | "u" | "tt".
      20. Note. Physical styles are "deprecated".
      21. deprecated::=they don't like it because it is physical not logical . (HTML is not for word processing!)

        The browser and user determine the precise meaning of these styles with the following guidelines:

        Table of Styles

         Style	Meaning
         em	Emphasized - "notice me"
         strong	Emphasized even more
         code	This is a piece of computer output
         samp	This is a sample of HTML
         kbd	This is the name of a key on the keyboard
         var	This is a syntactic variable
         dfn	This is a definition
         cite	This is a citation of a source
         address	This is an address (Real or Email)
         b	looks bold (deprecated)
         i	looks italic (deprecated)
         u	looks underlined (deprecated)
         tt	looks like a typewriter (deprecated)

        The SGML specifies rules about what is recommended, normal and deprecated.

      22. text::=untagged_body & ( recommended_DTD_ rules | DTD_rules | deprecated_DTD_rules),
      23. DTD::=Document Type Definition.

        Images

      24. image::=start("img src=" URL " O("alt=" string) O("align=" alignment) O("ismap") ).
      25. alignment::="left" | "right" | "center" .

        The 'alt' attribute is what is shown to a browser that does not show the image - some browsers do not show graphics. Some users turn off the graphics to get to the information quicker!

        The ismap attribute indicates that parts of the graphic are hot buttons that act as links to other pages etc. Maps take time to construct without some special purpose tools.

        Remember that each graphic takes time to transfer over the network. Animated graphics, in particular are resource hogs. If you need to have a large and complex GIF then use the xv utility on UNIX to reduce it to thumbnail size and create a link to it like this:

         		<a href="bigfig.gif"><img src="thumbnail.gif" alt="Download a graphic!"></A>
        My personal page has an example [ me.html ]

        Lists

      26. list::=ordered_list |unordered_list | definition_list | menu |directory... .
      27. definition_list::= start("DL" O("compact") ) #( start( "dt" ) term #(start(dd) text ) ) end("DL" ).
      28. term::=text.
      29. ordered_list::=start("OL") list_body end("OL" ).
      30. unordered_list::=start("UL") list_body end("UL" ).
      31. menu::=start("menu") list_body end("menu" ).
      32. directory::=start("dir") list_body end("dir" ).
      33. list_body::=#list_item.
      34. list_item::= start("li") text

        Lists are a simple and effective way to organize your message. Notice that an item in a list can be split into lines with <br>, paragraphs by <p> and pieces with <hr>.

        Tables

        Tables are not yet a standard part of HTML but the popularity of Netscape has made them a part of many pages:
      35. table::= start("table") #row end("table" ),
      36. row::=start("tr") #table_item end("tr" ),
      37. table_item::=start("td") #element end("td" ). -- ??probably??

      Forms

        .RoadWorksAhead See [ HTML_quick.html ]

        HTML forms are a quick and easy way to gather information from a user and send them through a [ Common Gateway Interface ] into a program on a server.

        The following program [ unpost.c ] is useful for converting CGI posted input into normal but URL-encoded standard input ready for a UNIX shell script or program to handle. [ URL Encoding ]

      Common Gateway Interface


      1. (glossary)|-
      2. CGI::=Common Gateway Interface The CGI rules define how data is given to a program on the server. The program then runs and generates a page that is returned to the browser, using a standard format. [ CGI in www ]

        This program can be written in any language - but many prefer to use a language called PERL. Personally I use UNIX shell scripts. To interface a MIS database to the Web.... you could use COBOL. Almost certainly Java is going to be another popular way to write CGIs.

End