1<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
2    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
3<html xmlns="http://www.w3.org/1999/xhtml">
4<head>
5<meta name="generator" content="HTML Tidy, see www.w3.org" />
6<title>HTML TIDY - Notes on pending work</title>
7<meta name="keywords"
8content="HTML, validation, error correction, pretty-printing" />
9<meta name="author" content="Dave Raggett &lt;dsr@w3.org&gt;" />
10<style type="text/css">
11  body { 
12    margin-left: 10%; 
13    margin-right: 10%; 
14    font-family: sans-serif
15  }
16  h1 { margin-left: -8% }
17  h2,h3,h4,h5,h6 { margin-left: -4% }
18  pre { color: green; font-weight: bold;
19   font-size: 80%; font-family: monospace}
20  em { font-style: italic; font-weight: bold }
21  strong { text-transform: uppercase; font-weight: bold }
22  .note {font-style: italic; color: rgb(192, 101, 101) }
23  //hr {text-align: center; width: 60% }
24  blockquote {
25    color: navy;
26    margin-left: 1%;
27    margin-right: 1%;
28    text-align: center;
29    font-family: "Comic Sans MS", "Times New Roman", serif
30  }
31  table {
32    font-family: sans-serif;
33    font-size: 80%;
34    background: rgb(255,255,153)
35  }
36  td {
37    font-size: 80%
38  }
39  .people {font-family: "Lucida Calligraphy", serif}
40  :link { color: rgb(0, 0, 153) }
41  :visited { color: rgb(153, 0, 153) }
42  :active { color: rgb(255, 0, 102) }
43  a :hover { color: rgb(0, 0, 255) }
44</style>
45
46<style type="text/css">
47 p.c1 {font-style: italic}
48</style>
49</head>
50<body bgcolor="#FFFFFF" background="grid.gif" text="black"
51link="navy" vlink="black" alink="red">
52<h1>HTML TIDY - Notes on Pending Work</h1>
53
54<p><a href="http://www.w3.org/People/Raggett">Dave Raggett</a> <a
55href="mailto:dsr@w3.org">dsr@w3.org</a></p>
56
57<p>This is a page where I am keeping the suggestions for
58improvements or bug fixes. My current work load means that I
59don't get much time to work on HTML Tidy, so I am interested in
60offers of help!</p>
61
62<h4>Public Email List for Tidy: &lt;<a
63href="mailto:html-tidy@w3.org">html-tidy@w3.org</a>&gt;</h4>
64
65<p>I have set up an archived mailing list devoted to Tidy. To
66subscribe send an email to html-tidy-request@w3.org with the word
67subscribe in the subject line (include the word unsubscribe if
68you want to unsubscribe). The <a
69href="http://lists.w3.org/Archives/Public/html-tidy/">archive</a>
70for this list is accessible online. Please use this list to
71report errors or enhancement requests.</p>
72
73<h2>Things awaiting further attention</h2>
74
75<ul>
76<li>Support for BIG5 and ShiftJIS (Rick Jelliffe)</li>
77
78<li>Stronger checking on which attributes appear on what
79elements</li>
80
81<li>Sorting attributes in a canonical order</li>
82
83<li>Version checking for HTML 4.01 vs 4.0 (Tidy currently will
84set the document type to 4.01 in preference to 4.0)</li>
85
86<li>Noticing that the document isn't really XHTML if it isn't
87wellformed, i.e. it lacks end tags and quotes on attribute
88values</li>
89
90<li>Converting &lt;font face="Symbol"&gt;a&lt;/font&gt; etc. to
91the corresponding Unicode characters, when cleaning HTML.</li>
92
93<li>link checking - this would involve some platform dependent
94code as the network interface varies significantly from one
95platform to the next.</li>
96
97<li>When exporting Word2000 to Web page, there is a need for
98smarter rules of thumb for working out whether the paragraph is a
99bulletted or numbered list item, and determining the level of
100nesting. Perhaps the style attribute holds the key? This tends to
101include substrings like: "mso-list:l0 level1 lfo2;" and
102"mso-list:l1 level1 lfo1;". Unfortunately, these aren't always
103present, and I have yet to figure out a foolproof heuristic.</li>
104</ul>
105
106<p>I need to set up an index of precisely what attributes are
107supported on each element. Right now, some elements check their
108own attributes, whilst others are checked via default checks
109defined for each attribute independently of the element. Until
110this is done, you sometimes find that validation services
111discovering errors unnoticed by Tidy itself.</p>
112
113<p>Jelks Cabaniss asks: <i>Could Tidy be made to automatically
114"clean" (FONTs to CSS) if the Strict DOCTYPE is requested? An
115HTML or XHTML Strict document can't have FONT tags according to
116the DTDs</i>. Jelks has a bunch of other good ideas such as
117converting the bgcolor attribute over to CSS.</p>
118
119<p>Adding an option to select slide transition effects. I would
120also like to provide an optional feature for sorting attribute
121values.</p>
122
123<p>I am having problems with form elements as direct children of
124tr or table. It is dangerous to create an implicit table cell,
125and what is needed is a way to move the form element into the
126next cell. If this can't be done an error needs to be raised
127since Tidy will be stuck. On a separate note, Tidy is still
128breaking lines between &lt;img&gt; and &lt;/a&gt; which in
129Netscape shows as an underlined space. It's fine in IE.</p>
130
131<p>Benjamin Holzman &lt;bah@orientation.com&gt; writes: I'm
132wrapping tidy (release-date 2000.01.13) in some perl objects
133(using SWIG), and CharEncoding being a global is a bit of a pain.
134I was wondering what your thoughts would be on how to fix that.
135The character encoding is already a property of struct Out; is
136there any reason why making it part of struct StreamIn as well,
137and perhaps setting that property in OpenInput, based on the
138existing CharEncoding variable, wouldn't allow us to move
139CharEncoding to be local to main?</p>
140
141<p>Oh, in case you're curious about the API, here's a short
142script using my wrappers to be an html to xhtml filter:</p>
143
144<pre>
145      #!/usr/bin/perl
146
147      require tidy;
148
149      my $tidy     = Tidy-&gt;new(*STDIN);
150      my $document = $tidy-&gt;parse;
151      $tidy-&gt;as_xhtml(*STDOUT);
152</pre>
153
154<p>Rick Parsons would like there to be a new wrap-attributes
155option that can be used to suppress line wrapping within
156attributes. There is already a similar option for JavaScript
157literals.</p>
158
159<p>Vijay Patil would like tidy -h to display options sorted
160alphabetically.</p>
161
162<p>Julian Reschke would like there to be an option to add the
163xml:space="preserve" attribute to pre elements when outputting
164xml.</p>
165
166<p>Armando Asantos would like to use Tidy to produce a list of
167URLs for images or hypertext links according to a config option.
168This would be straightforward, but is a lower priority than bug
169fixes etc.</p>
170
171<p>Omri Traub would like an option to wrap the contents of style
172and script elements in CDATA marked sections when converting to
173XHTML. He is also interested in direct support for 16 bit
174character file I/O.</p>
175
176<p>Bertilo Wennergren notes:</p>
177
178<blockquote>If I configure Tidy to "upgrade to style sheets", it
179does so for a few things in my main document, but the code thus
180created get error reports if I feed it back to Tidy. It turns out
181that Tidy creates extra "class" attributes on tags that already
182have "class" attributes set. This happens with this page:
183&lt;http://www.concinnity.se/bertilow/index.htm>;.</blockquote>
184
185<p>Randi Waki notes:</p>
186
187<blockquote>
188<p>If a quoted URL attribute value (e.g., href in &lt;a&gt;
189elements) contains a line break, 13-Jan-2000 Tidy changes the
190line break to a space while IE and Netscape discard the line
191break. This can result in a broken link in the tidied
192document.</p>
193
194<p>I believe the following change fixes the problem. In lexer.c,
195insert the following lines before line 2502:</p>
196
197<pre>
198                            /* discard line breaks in quoted URLs */
199                            if (c == '\n' &amp;&amp; IsUrl(name))
200                                continue;
201
202/* existing line 2502 */    c = ' ';
203</pre>
204</blockquote>
205
206<p>Stephen Reynolds would like Tidy to keep track of whether a
207comment started on a new line and preserve this in the
208output.</p>
209
210<p>Terry Teague says:</p>
211
212<blockquote>
213<p>Sorry, I should have been more clear. Part of the problem is
214the current HelpText() function in localize.c doesn't actually
215reflect current reality.</p>
216
217<p>You need to at least add the following line to HelpText()
218:</p>
219
220<pre>
221    tidy_out(out, "  -version or -v  show version\n");
222</pre>
223
224<p>And I suppose it should mention the use of the new
225"--&lt;config options&gt;" type syntax.</p>
226
227<p>Regards, Terry</p>
228</blockquote>
229
230<p>John Russel notes:</p>
231
232<pre>
233 what i wonder is
2341] does the specification indicate these are WRONG
2352] if so why do they pass thru tidy ....
236is url syntax such a can of worms that it is left to user
237   to check .......
238
239CASE 1: misuse of slash for folders
240site had  background="pics\fancy.jpg"
241  instead of   "pics/fancy.jpg"
242
243CASE 2: spaces in filename
244site had href="coin album.html"
245instead of "coin%20album.html"
246</pre>
247
248<p>Andre Stechert would like a way to prevent Tidy from
249"cleaning" newly declared elements which don't have any content
250but do have end tags, see his mail of 17th January 2000</p>
251
252<p>Todd Clark would like to use Tidy with Microsoft's WebClass
253tags. Unfortunately these include unusual characters in the tag
254names such as @ which Tidy objects to, for instance:</p>
255
256<pre>
257&lt;WC@DOMAINNAME&gt;test.com&lt;/WC@DOMAINNAME&gt;
258</pre>
259
260<p>Perhaps it makes sense to offer an option to make Tidy less
261picky about what characters it accepts in tag names. Or perhaps
262"WebClass: yes".</p>
263
264<p>Jelks Cabaniss suggests an option to control dropping of empty
265elements, e.g. according to what attributes they have.</p>
266
267<p>Paavo Hartikainen writes:</p>
268
269<blockquote>
270<p>Tidy always expands '&amp;' to '&amp;' even if I have
271'quote-ampersand: no' defined in configuration file. This is not
272a good thing to do for URLs that have '&amp;' characters in them.
273OS is Debian GNU/Linux 2.1 SPARC. Same thing happens on Alpha.
274Other architectures I have not tried.</p>
275
276<p>My configuration looks like this:</p>
277
278<pre>
279char-encoding: latin1
280error-file: /errors
281indent-spaces: 2
282logical-emphasis: yes
283output-xhtml: yes
284quiet: no
285quote-ampersand: no
286show-warnings: yes
287tidy-mark: yes
288wrap: 78
289wrap-attributes: no
290write-back: yes
291keep-time: yes
292</pre>
293</blockquote>
294
295<p>Paul White reports that Tidy isn't recognizing HTML 3.2 when
296the doctype is "-//W3C//DTD HTML 3.2 Final//EN" (as per the REC),
297and similarly for HTML 4.01. This would appear to call for a
298change to the table of names in lexer.c.</p>
299
300<p>Stuart Hungerford would like Tidy to detect and fix duplicate
301attributes e.g. multiple class attributes. Celeste Suliin Burris
302would like Tidy to replace spaces in URLs by %20 as some versions
303of Netscape "croak big time" on this. Denis Kokarev also wants
304Tidy to remove duplicate attributes when the values are the same.
305This apparently stops XSLT from working. Brian Schweitzer notes
306that Tidy adds a 2nd class attribute rather than merging the
307classes into a space separated list.</p>
308
309<p>Bertilo Wennergren writes: Tidy seems not to recognize frame
310elements with a closing "/". It actually removes them. Try his <a
311href="http://www.concinnity.se/bertilow/pmeg/pmeg9/k_bazo.htm">example</a>.
312Tidy can produce XHTML Frameset docs, but when fed them back</p>
313
314<p>again it cries foul.</p>
315
316<p>Jose Manuel Cerqueira Esteves notes:</p>
317
318<pre>
319I've used `tidy' to convert a few HTML 4.0 files to XHTML 1.0 and noticed
320a problem when dealing with constructs like
321
322 &lt;small&gt;&lt;small&gt;some text&lt;/small&gt;&lt;/small&gt;
323
324First, `tidy' acts as if the second "&lt;small&gt;" was meant as a closing tag:
325
326 Warning: "&lt;small&gt; is probably intended as &lt;/small&gt;"
327
328Then it trims the resulting empty &lt;small&gt;&lt;/small&gt;:
329
330 Warning: trimming empty &lt;small&gt;
331
332And finally both remaining closing tags ("&lt;/small&gt;"), now spurious,
333are removed:
334
335 Warning: discarding unexpected &lt;/small&gt;
336 Warning: discarding unexpected &lt;/small&gt;
337
338It would be convenient to have at least some `tidy' option to prevent this
339from happening (or perhaps some different heuristics?).
340</pre>
341
342<p>Robbert Hans Baron would like to see Tidy warning about
343duplicate attributes and fixing these when the values are
344identical.</p>
345
346<p>Jutta Wrage notes that: When parsing HTML 3.2 Pages, tidy
347doesn't accept textareas in forms correctly. The HTML Reference
348specification (HTML 3.2 Final) allows: name, rows and cols, but
349upon seeing these Tidy thinks the document is 4.0.</p>
350
351<p>Matthew Brealey notes that a heading start tag is coerced to
352an end heading tag when the end tag is missing. This is
353deliberate, but perhaps not the best heuristic.</p>
354
355<p>HIYAMA Masayuki notes that Tidy should set the encoding
356attribute to match the language encoding, e.g. ?xml version="1.0"
357encoding="iso-2022-jp"?&gt;&lt;.</p>
358
359<p>Mark Modrall has extended Tidy to support selectively
360stripping out listed tags and attributes, see his email of March
36114th.</p>
362
363<p>Yong Taek Bae notes that with the omit end tags option Tidy
364omits the body tag even if it has attributes. This is an
365error.</p>
366
367<p>Tapio Markula reports that Tidy is incorrectly replacing
368accented characters in script elements by entities. The script
369element (in HTML but not XHTML) is CDATA and as such entities
370won't be expanded. This bug needs to be fixed along with the
371support for CDATA sections.</p>
372
373<p>Terrill Bennett reports tidy crashing when producing slides,
374and when the -i option has been set. He later added the crash
375occurs when the page doesn't include an h1 element. See
376Terrill-Bennett-11mar00.txt.</p>
377
378<p>Stephen Lewis notes that if an &lt;hr&gt; element is present
379in the head before the title element, then Tidy gets confused and
380adds in a spurious extra empty title element. This would be
381avoided if Tidy could move the hr into the body before the body
382element is encountered. This raises a number of problems for
383instance working out when to copy in attributes from an explicit
384body element.</p>
385
386<p>Carl Osterly would like Tidy to avoid breaking lines before or
387after the = sign in attribute values when this is practical.
388Perhaps a simple rule of thumb could be used to decide this?</p>
389
390<p>Rick H Wesson notes that Tidy crashes on CDATA marked sections
391when parsing XML.</p>
392
393<p>Luigi Federici would like an option to set the DTD URI for XML
394or XHTML.</p>
395
396<p>Mat Sander notes: If I have php code the indentation behaves
397strange. Repeated tidying php content and end tag indented one
398level extra for each time. The result ends up something like
399this:</p>
400
401<pre>
402...
403    &lt;?php
404                        $r=0;
405                        ?&lt;
406...
407
408I have the fillowing config file for Tidy:
409---
410tidy-mark: no
411markup: yes
412wrap: 0
413indent: auto
414output-xml: no
415output-xhtml: yes
416doctype: loose
417char-encoding: latin1
418quote-marks: yes
419assume-xml-procins: yes
420word-2000: yes
421clean: yes
422logical-emphasis: yes
423drop-empty-paras: yes
424enclose-text: yes
425fix-bad-comments: yes
426alt-text: .
427write-back: bool
428keep-time: yes
429show-warnings: no
430quiet: yes
431split: no
432---
433
434Best Regards,
435Mats-Olof Sander
436
437</pre>
438
439<p>Don Hasson notes that if you make a mistake and leave off the
440ending "/" in the &lt;title&gt; tag, tidy will generate an extra
441set of &lt;title&gt;s.</p>
442
443<p>Example:</p>
444
445<pre>
446&lt;html&gt;
447&lt;head&gt;&lt;title&gt;No end here&lt;title&gt;&lt;/head&gt;
448&lt;body&gt;
449Empty
450&lt;/body&gt;
451&lt;/html&gt;
452
453</pre>
454
455<p>produces this:</p>
456
457<pre>
458&lt;html&gt;
459&lt;head&gt;
460&lt;title&gt;No end here&lt;/title&gt;
461&lt;title&gt;&lt;/title&gt;
462&lt;/head&gt;
463&lt;body&gt;
464Empty
465&lt;/body&gt;
466&lt;/html&gt;
467
468</pre>
469
470<p>Jeff Wilkinson would like the HTML Tidy page to include
471internal anchors so that he can link directly to the appropriate
472sections.</p>
473
474<p>Peter Vince would like to be able to clean presentation
475attributes on the body element, as well as translating b and i to
476span.</p>
477
478<p>Dave Bryan and Mathew Brealey would like there to be a way to
479suppress the default handling of inline elements in favor of
480simply inserting the appropriate end tag when encountering an
481element that isn't allowed in an inline context. The default
482behavior replicates the rendering on existing browsers but can
483cause problems for hand editors.</p>
484
485<p>Dave Bryan notes that tidy isn't updating the column position
486when parsing attributes.</p>
487
488<p>Can Tidy track when a line break occurs after a PI or comment
489and reproduce this in the output? This idea occurred to me after
490reading a comment from Brad Stowers.</p>
491
492<p>One interesting suggestion is to make some of Tidy's rules of
493thumb sensitive to the program that generated the markup as
494indicated by the meta element. This would allow for greater
495robustness in how the rules operate.</p>
496
497<p>Dave Bryan would like the quiet mode to be tweaked to suppress
498the general info at the end of the report. see
499Dave-Bryan-24mar00.txt.</p>
500
501<p>Erik Rossen would like an option to suppress line wrap within
502tags, so that the tag is always on the same line regardless of
503the number and length of the attributes.</p>
504
505<p>Dan Satria suggest that the clean mechanism check to see if
506there are any existing matching style rules before adding new
507ones.</p>
508
509<p>Zoltan Hawryluk suggests mapping the Netscape layer tag into
510the equivalent CSS positioning syntax.</p>
511
512<p>Jim Walker says Tidy doesn't correctly report errors such as
513<tt>&lt;/&lt;/head&gt;</tt>.</p>
514
515<p>Tidy's slide feature: see Johannes-Poutre-12jul00.txt</p>
516
517<p>Carole Mah suggests Tidy should recover from multiple class
518attributes on the same element.</p>
519
520<h2>Other ideas</h2>
521
522<ul>
523<li>Recursion through subdirectories, so you can fix up your
524entire web site at one go. This assumes I can find a way that is
525portable across a wide range of platforms!</li>
526
527<li>Support for W3C's <a
528href="http://www.w3.org/TR/REC-DOM-Level-1/">Document Object
529Model</a> (DOM) level one.</li>
530
531<li>Full validation of all attribute values.</li>
532
533<li>Mapping Unicode bidi control characters to HTML tags.</li>
534
535<li>Full support for parsing XML (still somewhat limited).</li>
536
537<li>How to say which XML elements should be printed
538"inline".</li>
539
540<li>Acting on the XML encoding attribute, e.g.
541&lt;?xml&#160;encoding="iso-8859-1"&gt;</li>
542
543<li>Improved mapping from HTML presentation attributes/elements
544to CSS.</li>
545
546<li>Improved support for <a
547href="http://java.sun.com/products/jsp/">JSP</a> (Java Server
548pages)</li>
549
550<li>Ugly print option which removes all optional whitespace</li>
551</ul>
552</body>
553</html>
554
555