Select this to skip to main content [CSUSB] >> [CNS] >> [Comp Sci Dept] >> [R J Botting] >> [CSci320] >> xml [Source]
[Index] [Schedule] [Syllabi] [Text] [Labs] [Projects] [Resources] [Search] [Grading]
Tue Jan 13 17:31:49 PST 2004 xml.mth
Opening the PDF files on this page requires you to Download Adobe Reader or an equivalent viewer (GhostScript).


    eXtensible Markup Language


      Somewhere [ acknowledgement1 ] between the complexity of SGML and the rigidity of HTML lies the eXtensible Markup Language(XML). XML lets you describe the structure of a document. In return, you must have a Document Type Description(DTD) before you can process an XML document properly.

      XML documents can be well_formed if they follow some simple rules that allow them to be parsed. These rules are outlined below. A well_formed document can also be valid if they match a Document Type Declaration(DTD). The DTD has to be declared at the start of the document along with things like the XML version and the character code. There are many different DTDs for for different purposes.

      XML documents can be processed if the processing is described in XSL. To display an XML document you need to supply some kind of mapping into a particular "style". Thus we now have style sheet languages : XSL, PSL, P, ... See the W3C information [ Style ] on style sheets and style sheet languages.

      Examples of Document Types

    1. SimpleNovel::= See

      (MathML): Structure of mathematical formula. [ REC-MathML ] with syntax [ appendixE.html ] + XML dsssl stylesheets rtf tex jade [ mml-files ]

      (OpenMath): [ ]

      (SVG): Scaleable Vector Graphics -- [ ]

      (W3D): Replacement for the Virtual Reality Modeling Language. [ Specifications ]

      (HRMML): XML based Human Resource Management Markup Language: [ main.html ]

      (DocBook): Structure of documentation for software documents. DocBook is (in 1999) actually an SGML based way to document software. See [ DocBook in comp.text.SGML ] for more information.

      (XMI): XMI::="Presents meta-data for modeling objects", by CORBA and the Object Managment Group. [click here [socket symbol] if you can fill this hole] The following needs an Id and password //

      XMI is also integrated with the Unified Modeling Language 1.3 standard [ uml.html ]

      (CBL): XML Common Business Library [ cblfaq.html ]

      There are many more sample DTDs at [ resources.html?keys=*5266 ]

      XML examples

      [ ]

      ( [ ]

      XML information by Dick Baldwin [ ] is the new name for the organization that was previously known as The Mining Company.

      Well-Formed Documents

      First, XML is like HTML however there are vital differences:
      1. All the tags used in HTML are not defined in XML.
      2. You can add new tags to XML.
      3. XML is Case Sensitive
      4. In XML WhiteSpace is significant
      5. XML is not about layout and look-and-feel. It is about structure and meaning.
      6. Five predefined entities: gt(>), lt(<), quot("), amp(&), apos(').
      7. End tags are never omitted. <t....> ... </t>
      8. There is a special kind of tag which does not enclose some content <.../>
      9. Comments look like this <!-- ..... -->
      10. Processing can be embedded <?....?>
      11. Attributes always have a name and a value, and the value is between double quotes: name="value".

      Syntax of Well Formed Documents

      Here is a simple description of all documents that might be in XML -- ignoring the context dependencies:
    2. XMLBNF::=following,
        After a prolog, comes a single entity called the root, and then some miscellaneous stuff that is probably meaningless:
      1. document::= prolog root miscellaneous.

      2. prolog::=xml_type #comment dtd. A well formed document must start with a prolog that identifies the version of XML it uses. For example
         		 <?xml version="1.0"?>
        is the current version of xml. The prolog should also identify the character code - especially if you need to use any non-"ASCII" characters. It can also identify some namespaces:

      3. root::= tagged_element | empty_element.

      4. miscellaneous::= #(comment | processing | WS).
      5. WS::=white space.

      6. tagged_element::= "<" tag #attribute ">" content "</" tag ">", -- the tag at the start and end must be the same. To be valid the tag must be defined in a DTD and have attributes that and content that match the rules in the DTD. A tagged element contains other data -- between the two tags.
         		<title>War and Peace</title>

      7. empty_element::="<" empty_tag #attribute "/>".
         		<timestamp date="1999/06/22" time="11:00"/>
      8. singleton::= empty_element.

      9. content::= #( parsed_data | element | comment ), the valid sequences of pieces in a content are described by a regular expression form in the DTD. An element is either a tagged element or a empty_element:
      10. (element) |- element==>tagged_element | empty_element.
      11. parsed_data::= #(char ~ ("<" | ">" | "&" | ";" | "'") | entity ).
      12. entity::= predefined_entity | defined_entity.
      13. predefined_entity::=gt | lt | quot | amp | apos,
        1. gt::="&gt;", stands for ">".
        2. lt::="&lt;", stands for "<".
        3. quot::="&quot", stands for "\"".
        4. amp::="&amp;", stands for "&".
        5. apos::="&apos", stands for "'".

      14. comment::= "<!--" ... "-->".
          	<!-- this is a comment -->
      15. attribute::= name "=" quoted_value.
      16. quoted_value::=quotes value quotes | apostrophe value apostrophe.
      18. apostrophe::="'".

      19. defined_entity::=defined in prolog.
      20. parsed_data::=defined in prolog.
      21. tag::=defined in prolog or namespace,
      22. |-tag ==> O( namespace ":") name.

      23. name::=defined in prolog.
      24. value::=defined in prolog.

      (End of Net XMLBNF)

      The actual rule for quoting is a little more complex in that the quote character can not appear inside the value:

    3. quoted_value::= | [ q:quotes|apostrophe ] q #(char~q) q, or the union with q equal to quotes or apostrophe of....


      To be valid the entities, tags and their attributes must match a set of rules given in a DTD.

      Suppose that we specify a DTD that has a set of normal tag names T and a set of content free (empty elements) with tag names C and for each tag t:T|C we must have attribute names N(t), and for each tag t:T|C, q:quotes|apostrophe, and attribute a:N(t), we have a set of valid values V(t,n, q), and D is the raw data in our document then define

    4. a(t,q)::= ![n:N(t), q](n= q V(t,n,q) q), a sequence of names with valid quoted values, and
    5. c(t, e)::=an expression describing the valid content of tag t in terms of elements e, and then an element of type t, is defined by
    6. e(t)::= ("<"t a(t)] ">" c(t, e) "</" t"> | [t:C]( "<" t a(t) "/>"), and an element is the union over all tags
    7. element::= D | |[t:T](e(t)). Note
      1. There is a trick above... the content expression c(t,e) depends on all the elements as a function associating tag names to elements of that type. the whole Its probably best to think of this as an array or vector indexed by entity names. The resulting grammar is context dependent but can be formalized using only a small variation of context free grammars.

        The "data" (D above) can include elements that indicate some processing to be done to the data like this "<?.....?>".

      2. processing::= "<?" tag parameters "?>".

        It is possible to name things (like files of data or strings) and use the names in place of the things -- but the rules are a little convoluted.

      Document Type Declarations

      The dtd above is a document type declaration and has many forms. Here are some simple ones:
    8. dtd::= "<!DOCTYPE " WS name O(WS externalId) OWS O( localdtd ) ">".
    9. externalId::= ("PUBLIC" | "SYSTEM") WS string_identifying_a_dtd_file.

    10. localdtd::= "[" #(markup_declaration| ... | WS) "]" OWS. Local dtd are interpreted before external ones so that they can define terms used in the external ones. Unlike all other languages the first definition of a markup overrides the later ones. Thus localdtd's both over-ride and inform the external ones!

      The DOCTYPE defines the structure of the entity in the document for the document to be valid.

    11. markup_declaration::=element_declaration|entity_declaration|attribute_list_declaration | notation_declaration | process_indication | WS.

    12. element_declaration::="<!ELEMENT" element_name type_description ">".
    13. element_name::@name, the set of names occurring in element_declarations.

    14. attribute_list_declaration::="<!ATTLIST" element_name #attribute_declaration ">", attaches a set of attributes to the element named..
    15. attribute_declaration::=attribute_name attribute_type attribute_default.
    16. attribute_name::@name, the set of names appearing in attribute declarations.
    17. type::= "CDATA" | "ENTITY" | "NMTOKEN" | "NMTOKENS" | "ID" | "IDREF" | "IDREFS".
      typeSyntax of Attribute valuesSemantics
      CDATACDATA_sectionblock of text
      ENTITYTBAName of data
      IDidentifierCan be used as an IDREF
      IDREFidentifierrefers to another ID

    18. attribute_default::= required | implied | fixed | default_value.
    19. default_value::=literal data token.
    20. required::="#REQUIRED", implies that the element must specify a value and so no default is needed.
    21. implied::="#IMPLIED", no default is given and no value has to be given. Note however if the attribute name is mentioned it must be assigned a value.
    22. fixed::="#FIXED" default_value, meaning that the default is also the only value and so cannot be changed in any occurrence.

    23. entity_declaration::="<!ENTITY" O("%") entity_name entity_meaning ">".
    24. entity_name::@name, the set of names occurring in entity_declarations. These add new entities. An entity is an abbreviation. Some (with the '%') are to be used in DTDs and are expanded there. They are written as "%"entity_name";" and are replaced by the associated entity_meaning as the DTD is elaborated. Others (with no "%") are ready to be used in actual XML document in form "&"entity_name";".

    25. notation_declaration::="<!NOTATION" TBA ">".

    26. CDATA_section::= "<![CDATA[" TBA "]]>".

    27. pcdata::="#PCDATA", keyword indicating a block of parsed character data -- but no XML style marking up.
    28. identifier

      More TBA.

      Standards on the WWW

      W3C specifications [ REC-xml-19980210 ] and Tim Brays Annotated Specification [ axml.html ]


    29. FOP::= See, XSL to PDF converter.
    30. XT::= See, processes XSL transformations.


      1. IBM [ ]
      2. Apache XML Project's Xerces Java [ ]
      3. James Clark's XP [ ]
      4. Microstar's Aefred [ ]
      5. Sun's Java API for XML [ ]
      6. Oracle's XML parser [ ]

      . . . . . . . . . ( end of section Parsers) <<Contents | Index>>


      Lars Marius Garshol <> wrote(comp.text.xml,13 May 1999) "The namespace URI does not point to anything meaningful, it's just a globally unique identifier. So your application will have to understand the DTD to make use of its elements. It would need that even if the URI did refer to a DTD. But at least they are now identified as being fitting elements, and your application can make a decision as to whether it should just ignore them or whether it should try to support them."


    31. API::="Application Programmers Interface".

    32. DOM::="Documentation Object Model".
    33. DTD::="Document Type Declaration", [ DTD in comp.text.SGML ]

    34. HTML::markup_language= HTML_glossary & HTML_syntax.
    35. HTML_glossary::= See
    36. HTML_syntax::= See

    37. language::="a set of syntactic and semantic rules defining the correct form, structure, and meaning of strings of characters", the chief product of computer science research.

    38. ML::="in an acronym often indicates a markup_language" | "a programming language".
    39. markup_language::language="a language that describes how to mark up text to give it added meaning, richness, or layout and style".

    40. For x, O(x)::= an optional x.

    41. P::stylesheet_language, [ Thot ] the Thot structured document language and the P stylesheet language.
    42. PSL::stylesheet_language, part of the Proteus library and style sheet library. [ ~multimedia ]

    43. SGML::markup_language="Standard Generalized Markup Language", [ comp.text.SGML.html] .
    44. SAX::="Simple API for XML".
    45. stylesheet::="A description in a special stylesheet_language of the way a user or client wants some data interpreted and/or displayed".
    46. stylesheet_language::="A computer language defining how to specify the style for displaying or processing a document".

    47. TBA::="To Be Announced".

    48. XML::markup_language="eXtensible Markup Language". See the BNF syntax XMLBNF above or the W3C specs [ REC-xml-19980210 ] or Tim Brays Annotated Specification [ axml.html ] or the Italian translation

    49. XSL::stylesheet_language="XML stylesheet Language". [ REC-xml ]

    50. element::=an identifiable(and so tagged) piece of data.
    51. entity::=a string that symbolizes a character | something that contains data.

      See Also

      The Annotated XML Spec at [ axml.html ]

      Mapping runtime objects into XML formatted data [ XML_Serialization ] [ xtal.html ]

    . . . . . . . . . ( end of section eXtensible Markup Language) <<Contents | Index>>


    Thanks to "Edward Szumski" <>


  1. Larry Evans <> for correcting my many other errors.

Formulae and Definitions in Alphabetical Order