[Skip Navigation] [Remove Frame] [CS320] [Text Version] xref.html Sat Dec 23 08:01:08 PST 2006

Contents


    A Cross Reference Algorithm

      Problem.

      Given a set of defined terms, and a text, search for occurences of the terms, and if they match, mark them up as links reffering to the definition.

      Terms consist of a sequence of non-whitespace characters.

      The marking up consists of putting a fixed string in front of the term, then a second fxied string, then teh term again, and then a final string:

    1. <a href="#
    2. term
    3. ">
    4. term
    5. </A> However the precise strings are not important.

      First Dictionary

    6. Program::=(input=>i.text, output=>o.text, process=>p).

    7. i.text::= #(occurrence_of_term | non_occurence_of_term | whitespace).
    8. o.text::= #(marked_up_term | non_occurence_of_term | whitespace).

    9. marked_up_term::=prefix occurence_of_term infix occurence_of_term postfix.
    10. prefix::="<a href=\"",
    11. infix::="\">",
    12. postfix::="</A>".

      Difficulties

      HTML does not permit nested "anchors". So terms must not overlap.... given
    13. Terms:={ "abcd", "ab", "bc" } and text:="hello abcd hi ab ther bc" we need as output
    14. "hello <a href="#abcd">abcd</A> hi <a href="#ab">ab</A> ther <a href="#bc">bc</A>" but not:
    15. "hello <a href="#abcd"><a href="#ab">ab</A>cd</A> hi <a href="#ab">ab</A> ther <a href="#bc">bc</A>"

      Further we want to substitute the longest term in preferrence to the shorter ones. This makes the following reult incorrect:

    16. "hello <a href="#ab">ab</A>cd hi <a href="#ab">ab</A> ther <a href="#bc">bc</A>" This means that the normal search for a the first match and replace will not work.

      Idea

      Scan the text, with each term watching to see if it is the matching term. When there is more than one term macthing, continue the scan. They drop out, until: there is one perfect match or none left. If there is one then fix it up, other wise, reset every thing, output the non-term and continue.

      Design

      Treat each term as a separate entity. Each is responsible for recognizing whether or not it is a match.

    17. in.term::=matching | non_matching.
    18. matching::= #(equal_character) end_of_term_maker.

    19. non_matching::=#(equal_character) un_equal_character #(char) end_of_term_marker.

    20. ...

      Implementation

    21. ...

End