1[comment {-*- tcl -*- doctools manpage}]
2[manpage_begin htmlparse n 1.2]
3[moddesc   {HTML Parser}]
4[titledesc {Procedures to parse HTML strings}]
5[category  {Text processing}]
6[require Tcl 8.2]
7[require struct::stack 1.3]
8[require cmdline 1.1]
9[require htmlparse [opt 1.2]]
10[description]
11[para]
12
13The [package htmlparse] package provides commands that allow libraries
14and applications to parse HTML in a string into a representation of
15their choice.
16
17[para]
18The following commands are available:
19
20[list_begin definitions]
21
22
23[call [cmd ::htmlparse::parse] [opt "-cmd [arg cmd]"] [opt "-vroot [arg tag]"] [opt "-split [arg n]"] [opt "-incvar [arg var]"] [opt "-queue [arg q]"] [arg html]]
24
25This command is the basic parser for HTML. It takes an HTML string,
26parses it and invokes a command prefix for every tag encountered. It
27is not necessary for the HTML to be valid for this parser to
28function. It is the responsibility of the command invoked for every
29tag to check this. Another responsibility of the invoked command is
30the handling of tag attributes and character entities (escaped
31characters). The parser provides the un-interpreted tag attributes to
32the invoked command to aid in the former, and the package at large
33provides a helper command, [cmd ::htmlparse::mapEscapes], to aid in
34the handling of the latter. The parser [emph does] ignore leading
35DOCTYPE declarations and all valid HTML comments it encounters.
36
37[para]
38
39All information beyond the HTML string itself is specified via
40options, these are explained below.
41
42[para]
43
44To help understand the options, some more background information about
45the parser.
46
47[para]
48
49It is capable of detecting incomplete tags in the HTML string given to
50it. Under normal circumstances this will cause the parser to throw an
51error, but if the option [arg -incvar] is used to specify a global (or
52namespace) variable, the parser will store the incomplete part of the
53input into this variable instead. This will aid greatly in the
54handling of incrementally arriving HTML, as the parser will handle
55whatever it can and defer the handling of the incomplete part until
56more data has arrived.
57
58[para]
59
60Another feature of the parser are its two possible modes of
61operation. The normal mode is activated if the option [arg -queue] is
62not present on the command line invoking the parser. If it is present,
63the parser will go into the incremental mode instead.
64
65[para]
66
67The main difference is that a parser in normal mode will immediately
68invoke the command prefix for each tag it encounters. In incremental
69mode however the parser will generate a number of scripts which invoke
70the command prefix for groups of tags in the HTML string and then
71store these scripts in the specified queue. It is then the
72responsibility of the caller of the parser to ensure the execution of
73the scripts in the queue.
74
75[para]
76
77[emph Note]: The queue object given to the parser has to provide the
78same interface as the queue defined in tcllib -> struct. This means,
79for example, that all queues created via that tcllib module can be
80immediately used here. Still, the queue doesn't have to come from
81tcllib -> struct as long as the same interface is provided.
82
83[para]
84In both modes the parser will return an empty string to the caller.
85
86[para]
87The [arg -split] option may be given to a parser in incremental mode to
88specify the size of the groups it creates. In other words, -split 5
89means that each of the generated scripts will invoke the command
90prefix for 5 consecutive tags in the HTML string. A parser in normal
91mode will ignore this option and its value.
92
93[para]
94The option [arg -vroot] specifies a virtual root tag. A parser in
95normal mode will invoke the command prefix for it immediately before
96and after it processes the tags in the HTML, thus simulating that the
97HTML string is enclosed in a <vroot> </vroot> combination. In
98incremental mode however the parser is unable to provide the closing
99virtual root as it never knows when the input is complete. In this
100case the first script generated by each invocation of the parser will
101contain an invocation of the command prefix for the virtual root as
102its first command.
103
104The following options are available:
105
106[list_begin definitions]
107
108[def "[option -cmd] [arg cmd]"]
109
110The command prefix to invoke for every tag in the HTML
111string. Defaults to [arg ::htmlparse::debugCallback].
112
113[def "[option -vroot] [arg tag]"]
114
115The virtual root tag to add around the HTML in normal mode. In
116incremental mode it is the first tag in each chunk processed by the
117parser, but there will be no closing tags. Defaults to
118[arg hmstart].
119
120[def "[option -split] [arg n]"]
121
122The size of the groups produced by an incremental mode parser. Ignored
123when in normal mode. Defaults to 10. Values <= 0 are not allowed.
124
125[def "[option -incvar] [arg var]"]
126
127The name of the variable where to store any incomplete HTML into. This
128makes most sense for the incremental mode. The parser will throw an
129error if it sees incomplete HTML and has no place to store it to. This
130makes sense for the normal mode. Only incomplete tags are detected,
131not missing tags.  Optional, defaults to 'no variable'.
132
133[list_end]
134
135[list_begin definitions]
136[para]
137[def [emph "Interface to the command prefix"]]
138
139In normal mode the parser will invoke the command prefix with four
140arguments appended. See [cmd ::htmlparse::debugCallback] for a
141description.
142
143[para]
144
145In incremental mode, however, the generated scripts will invoke the
146command prefix with five arguments appended. The last four of these
147are the same which were mentioned above. The first is a placeholder
148string ([const "@win@"]) for a clientdata value to be supplied later
149during the actual execution of the generated scripts. This could be a
150tk window path, for example. This allows the user of this package to
151preprocess HTML strings without committing them to a specific window,
152object, whatever during parsing. This connection can be made
153later. This also means that it is possible to cache preprocessed
154HTML. Of course, nothing prevents the user of the parser from
155replacing the placeholder with an empty string.
156
157[list_end]
158
159[call [cmd ::htmlparse::debugCallback] [opt [arg clientdata]] [arg "tag slash param textBehindTheTag"]]
160
161This command is the standard callback used by the parser in
162
163[cmd ::htmlparse::parse] if none was specified by the user. It simply
164dumps its arguments to stdout.  This callback can be used for both
165normal and incremental mode of the calling parser. In other words, it
166accepts four or five arguments. The last four arguments are described
167below. The optional fifth argument contains the clientdata value
168passed to the callback by a parser in incremental mode. All callbacks
169have to follow the signature of this command in the last four
170arguments, and callbacks used in incremental parsing have to follow
171this signature in the last five arguments.
172
173[para]
174
175The first argument, [arg clientdata], is optional and present only if
176this command is invoked by a parser in incremental mode. It contains
177whatever the user of this package wishes.
178
179[para]
180
181The second argument, [arg tag], contains the name of the tag which is
182currently processed by the parser.
183
184[para]
185
186The third argument, [arg slash], is either empty or contains a slash
187character. It allows the callback to distinguish between opening
188(slash is empty) and closing tags (slash contains a slash character).
189
190[para]
191
192The fourth argument, [arg param], contains the un-interpreted list of
193parameters to the tag.
194
195[para]
196
197The fifth and last argument, [arg textBehindTheTag], contains the text
198found by the parser behind the tag named in [arg tag].
199
200[call [cmd ::htmlparse::mapEscapes] [arg html]]
201
202This command takes a HTML string, substitutes all escape sequences
203with their actual characters and then returns the resulting string.
204HTML strings which do not contain escape sequences are returned
205unchanged.
206
207[call [cmd ::htmlparse::2tree] [arg {html tree}]]
208
209This command is a wrapper around [cmd ::htmlparse::parse] which takes
210an HTML string (in [arg html]) and converts it into a tree containing
211the logical structure of the parsed document. The name of the tree is
212given to the command as its second argument ([arg tree]). The command
213does [cmd not] generate the tree by itself but expects that the caller
214provided it with an existing and empty tree. It also expects that the
215specified tree object follows the same interface as the tree object in
216tcllib -> struct. It doesn't have to be from tcllib -> struct, but it
217must provide the same interface.
218
219[para]
220
221The internal callback does some basic checking of HTML validity and
222tries to recover from the most basic errors. The command returns the
223contents of its second argument. Side effects are the creation and
224manipulation of a tree object.
225
226[para]
227
228Each node in the generated tree represent one tag in the input. The
229name of the tag is stored in the attribute [emph type] of the
230node. Any html attributes coming with the tag are stored unmodified in
231the attribute [emph data] of the tag. In other words, the command does
232[emph not] parse html attributes into their names and values.
233
234[para]
235
236If a tag contains text its node will have children of type
237[emph PCDATA] containing this text. The text will be stored in the
238attribute [emph data] of these children.
239
240[call [cmd ::htmlparse::removeVisualFluff] [arg tree]]
241
242This command walks a tree as generated by [cmd ::htmlparse::2tree] and
243removes all the nodes which represent visual tags and not structural
244ones. The purpose of the command is to make the tree easier to
245navigate without getting bogged down in visual information not
246relevant to the search. Its only argument is the name of the tree to
247cut down.
248
249[call [cmd ::htmlparse::removeFormDefs] [arg tree]]
250
251Like [cmd ::htmlparse::removeVisualFluff] this command is here to cut
252down on the size of the tree as generated by
253
254[cmd ::htmlparse::2tree]. It removes all nodes representing forms and
255form elements. Its only argument is the name of the tree to cut down.
256
257[list_end]
258
259[section {BUGS, IDEAS, FEEDBACK}]
260
261This document, and the package it describes, will undoubtedly contain
262bugs and other problems.
263
264Please report such in the category [emph htmlparse] of the
265[uri {http://sourceforge.net/tracker/?group_id=12883} {Tcllib SF Trackers}].
266
267Please also report any ideas for enhancements you may have for either
268package and/or documentation.
269
270
271[see_also struct::tree]
272[keywords html parsing tree queue]
273[manpage_end]
274