1<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 3<html xmlns="http://www.w3.org/1999/xhtml"> 4<head> 5<meta name="generator" content="HTML Tidy, see www.w3.org" /> 6<title>HTML TIDY - Notes on pending work</title> 7<meta name="keywords" 8content="HTML, validation, error correction, pretty-printing" /> 9<meta name="author" content="Dave Raggett <dsr@w3.org>" /> 10<style type="text/css"> 11 body { 12 margin-left: 10%; 13 margin-right: 10%; 14 font-family: sans-serif 15 } 16 h1 { margin-left: -8% } 17 h2,h3,h4,h5,h6 { margin-left: -4% } 18 pre { color: green; font-weight: bold; 19 font-size: 80%; font-family: monospace} 20 em { font-style: italic; font-weight: bold } 21 strong { text-transform: uppercase; font-weight: bold } 22 .note {font-style: italic; color: rgb(192, 101, 101) } 23 //hr {text-align: center; width: 60% } 24 blockquote { 25 color: navy; 26 margin-left: 1%; 27 margin-right: 1%; 28 text-align: center; 29 font-family: "Comic Sans MS", "Times New Roman", serif 30 } 31 table { 32 font-family: sans-serif; 33 font-size: 80%; 34 background: rgb(255,255,153) 35 } 36 td { 37 font-size: 80% 38 } 39 .people {font-family: "Lucida Calligraphy", serif} 40 :link { color: rgb(0, 0, 153) } 41 :visited { color: rgb(153, 0, 153) } 42 :active { color: rgb(255, 0, 102) } 43 a :hover { color: rgb(0, 0, 255) } 44</style> 45 46<style type="text/css"> 47 p.c1 {font-style: italic} 48</style> 49</head> 50<body bgcolor="#FFFFFF" background="grid.gif" text="black" 51link="navy" vlink="black" alink="red"> 52<h1>HTML TIDY - Notes on Pending Work</h1> 53 54<p><a href="http://www.w3.org/People/Raggett">Dave Raggett</a> <a 55href="mailto:dsr@w3.org">dsr@w3.org</a></p> 56 57<p>This is a page where I am keeping the suggestions for 58improvements or bug fixes. My current work load means that I 59don't get much time to work on HTML Tidy, so I am interested in 60offers of help!</p> 61 62<h4>Public Email List for Tidy: <<a 63href="mailto:html-tidy@w3.org">html-tidy@w3.org</a>></h4> 64 65<p>I have set up an archived mailing list devoted to Tidy. To 66subscribe send an email to html-tidy-request@w3.org with the word 67subscribe in the subject line (include the word unsubscribe if 68you want to unsubscribe). The <a 69href="http://lists.w3.org/Archives/Public/html-tidy/">archive</a> 70for this list is accessible online. Please use this list to 71report errors or enhancement requests.</p> 72 73<h2>Things awaiting further attention</h2> 74 75<ul> 76<li>Support for BIG5 and ShiftJIS (Rick Jelliffe)</li> 77 78<li>Stronger checking on which attributes appear on what 79elements</li> 80 81<li>Sorting attributes in a canonical order</li> 82 83<li>Version checking for HTML 4.01 vs 4.0 (Tidy currently will 84set the document type to 4.01 in preference to 4.0)</li> 85 86<li>Noticing that the document isn't really XHTML if it isn't 87wellformed, i.e. it lacks end tags and quotes on attribute 88values</li> 89 90<li>Converting <font face="Symbol">a</font> etc. to 91the corresponding Unicode characters, when cleaning HTML.</li> 92 93<li>link checking - this would involve some platform dependent 94code as the network interface varies significantly from one 95platform to the next.</li> 96 97<li>When exporting Word2000 to Web page, there is a need for 98smarter rules of thumb for working out whether the paragraph is a 99bulletted or numbered list item, and determining the level of 100nesting. Perhaps the style attribute holds the key? This tends to 101include substrings like: "mso-list:l0 level1 lfo2;" and 102"mso-list:l1 level1 lfo1;". Unfortunately, these aren't always 103present, and I have yet to figure out a foolproof heuristic.</li> 104</ul> 105 106<p>I need to set up an index of precisely what attributes are 107supported on each element. Right now, some elements check their 108own attributes, whilst others are checked via default checks 109defined for each attribute independently of the element. Until 110this is done, you sometimes find that validation services 111discovering errors unnoticed by Tidy itself.</p> 112 113<p>Jelks Cabaniss asks: <i>Could Tidy be made to automatically 114"clean" (FONTs to CSS) if the Strict DOCTYPE is requested? An 115HTML or XHTML Strict document can't have FONT tags according to 116the DTDs</i>. Jelks has a bunch of other good ideas such as 117converting the bgcolor attribute over to CSS.</p> 118 119<p>Adding an option to select slide transition effects. I would 120also like to provide an optional feature for sorting attribute 121values.</p> 122 123<p>I am having problems with form elements as direct children of 124tr or table. It is dangerous to create an implicit table cell, 125and what is needed is a way to move the form element into the 126next cell. If this can't be done an error needs to be raised 127since Tidy will be stuck. On a separate note, Tidy is still 128breaking lines between <img> and </a> which in 129Netscape shows as an underlined space. It's fine in IE.</p> 130 131<p>Benjamin Holzman <bah@orientation.com> writes: I'm 132wrapping tidy (release-date 2000.01.13) in some perl objects 133(using SWIG), and CharEncoding being a global is a bit of a pain. 134I was wondering what your thoughts would be on how to fix that. 135The character encoding is already a property of struct Out; is 136there any reason why making it part of struct StreamIn as well, 137and perhaps setting that property in OpenInput, based on the 138existing CharEncoding variable, wouldn't allow us to move 139CharEncoding to be local to main?</p> 140 141<p>Oh, in case you're curious about the API, here's a short 142script using my wrappers to be an html to xhtml filter:</p> 143 144<pre> 145 #!/usr/bin/perl 146 147 require tidy; 148 149 my $tidy = Tidy->new(*STDIN); 150 my $document = $tidy->parse; 151 $tidy->as_xhtml(*STDOUT); 152</pre> 153 154<p>Rick Parsons would like there to be a new wrap-attributes 155option that can be used to suppress line wrapping within 156attributes. There is already a similar option for JavaScript 157literals.</p> 158 159<p>Vijay Patil would like tidy -h to display options sorted 160alphabetically.</p> 161 162<p>Julian Reschke would like there to be an option to add the 163xml:space="preserve" attribute to pre elements when outputting 164xml.</p> 165 166<p>Armando Asantos would like to use Tidy to produce a list of 167URLs for images or hypertext links according to a config option. 168This would be straightforward, but is a lower priority than bug 169fixes etc.</p> 170 171<p>Omri Traub would like an option to wrap the contents of style 172and script elements in CDATA marked sections when converting to 173XHTML. He is also interested in direct support for 16 bit 174character file I/O.</p> 175 176<p>Bertilo Wennergren notes:</p> 177 178<blockquote>If I configure Tidy to "upgrade to style sheets", it 179does so for a few things in my main document, but the code thus 180created get error reports if I feed it back to Tidy. It turns out 181that Tidy creates extra "class" attributes on tags that already 182have "class" attributes set. This happens with this page: 183<http://www.concinnity.se/bertilow/index.htm>.</blockquote> 184 185<p>Randi Waki notes:</p> 186 187<blockquote> 188<p>If a quoted URL attribute value (e.g., href in <a> 189elements) contains a line break, 13-Jan-2000 Tidy changes the 190line break to a space while IE and Netscape discard the line 191break. This can result in a broken link in the tidied 192document.</p> 193 194<p>I believe the following change fixes the problem. In lexer.c, 195insert the following lines before line 2502:</p> 196 197<pre> 198 /* discard line breaks in quoted URLs */ 199 if (c == '\n' && IsUrl(name)) 200 continue; 201 202/* existing line 2502 */ c = ' '; 203</pre> 204</blockquote> 205 206<p>Stephen Reynolds would like Tidy to keep track of whether a 207comment started on a new line and preserve this in the 208output.</p> 209 210<p>Terry Teague says:</p> 211 212<blockquote> 213<p>Sorry, I should have been more clear. Part of the problem is 214the current HelpText() function in localize.c doesn't actually 215reflect current reality.</p> 216 217<p>You need to at least add the following line to HelpText() 218:</p> 219 220<pre> 221 tidy_out(out, " -version or -v show version\n"); 222</pre> 223 224<p>And I suppose it should mention the use of the new 225"--<config options>" type syntax.</p> 226 227<p>Regards, Terry</p> 228</blockquote> 229 230<p>John Russel notes:</p> 231 232<pre> 233 what i wonder is 2341] does the specification indicate these are WRONG 2352] if so why do they pass thru tidy .... 236is url syntax such a can of worms that it is left to user 237 to check ....... 238 239CASE 1: misuse of slash for folders 240site had background="pics\fancy.jpg" 241 instead of "pics/fancy.jpg" 242 243CASE 2: spaces in filename 244site had href="coin album.html" 245instead of "coin%20album.html" 246</pre> 247 248<p>Andre Stechert would like a way to prevent Tidy from 249"cleaning" newly declared elements which don't have any content 250but do have end tags, see his mail of 17th January 2000</p> 251 252<p>Todd Clark would like to use Tidy with Microsoft's WebClass 253tags. Unfortunately these include unusual characters in the tag 254names such as @ which Tidy objects to, for instance:</p> 255 256<pre> 257<WC@DOMAINNAME>test.com</WC@DOMAINNAME> 258</pre> 259 260<p>Perhaps it makes sense to offer an option to make Tidy less 261picky about what characters it accepts in tag names. Or perhaps 262"WebClass: yes".</p> 263 264<p>Jelks Cabaniss suggests an option to control dropping of empty 265elements, e.g. according to what attributes they have.</p> 266 267<p>Paavo Hartikainen writes:</p> 268 269<blockquote> 270<p>Tidy always expands '&' to '&' even if I have 271'quote-ampersand: no' defined in configuration file. This is not 272a good thing to do for URLs that have '&' characters in them. 273OS is Debian GNU/Linux 2.1 SPARC. Same thing happens on Alpha. 274Other architectures I have not tried.</p> 275 276<p>My configuration looks like this:</p> 277 278<pre> 279char-encoding: latin1 280error-file: /errors 281indent-spaces: 2 282logical-emphasis: yes 283output-xhtml: yes 284quiet: no 285quote-ampersand: no 286show-warnings: yes 287tidy-mark: yes 288wrap: 78 289wrap-attributes: no 290write-back: yes 291keep-time: yes 292</pre> 293</blockquote> 294 295<p>Paul White reports that Tidy isn't recognizing HTML 3.2 when 296the doctype is "-//W3C//DTD HTML 3.2 Final//EN" (as per the REC), 297and similarly for HTML 4.01. This would appear to call for a 298change to the table of names in lexer.c.</p> 299 300<p>Stuart Hungerford would like Tidy to detect and fix duplicate 301attributes e.g. multiple class attributes. Celeste Suliin Burris 302would like Tidy to replace spaces in URLs by %20 as some versions 303of Netscape "croak big time" on this. Denis Kokarev also wants 304Tidy to remove duplicate attributes when the values are the same. 305This apparently stops XSLT from working. Brian Schweitzer notes 306that Tidy adds a 2nd class attribute rather than merging the 307classes into a space separated list.</p> 308 309<p>Bertilo Wennergren writes: Tidy seems not to recognize frame 310elements with a closing "/". It actually removes them. Try his <a 311href="http://www.concinnity.se/bertilow/pmeg/pmeg9/k_bazo.htm">example</a>. 312Tidy can produce XHTML Frameset docs, but when fed them back</p> 313 314<p>again it cries foul.</p> 315 316<p>Jose Manuel Cerqueira Esteves notes:</p> 317 318<pre> 319I've used `tidy' to convert a few HTML 4.0 files to XHTML 1.0 and noticed 320a problem when dealing with constructs like 321 322 <small><small>some text</small></small> 323 324First, `tidy' acts as if the second "<small>" was meant as a closing tag: 325 326 Warning: "<small> is probably intended as </small>" 327 328Then it trims the resulting empty <small></small>: 329 330 Warning: trimming empty <small> 331 332And finally both remaining closing tags ("</small>"), now spurious, 333are removed: 334 335 Warning: discarding unexpected </small> 336 Warning: discarding unexpected </small> 337 338It would be convenient to have at least some `tidy' option to prevent this 339from happening (or perhaps some different heuristics?). 340</pre> 341 342<p>Robbert Hans Baron would like to see Tidy warning about 343duplicate attributes and fixing these when the values are 344identical.</p> 345 346<p>Jutta Wrage notes that: When parsing HTML 3.2 Pages, tidy 347doesn't accept textareas in forms correctly. The HTML Reference 348specification (HTML 3.2 Final) allows: name, rows and cols, but 349upon seeing these Tidy thinks the document is 4.0.</p> 350 351<p>Matthew Brealey notes that a heading start tag is coerced to 352an end heading tag when the end tag is missing. This is 353deliberate, but perhaps not the best heuristic.</p> 354 355<p>HIYAMA Masayuki notes that Tidy should set the encoding 356attribute to match the language encoding, e.g. ?xml version="1.0" 357encoding="iso-2022-jp"?><.</p> 358 359<p>Mark Modrall has extended Tidy to support selectively 360stripping out listed tags and attributes, see his email of March 36114th.</p> 362 363<p>Yong Taek Bae notes that with the omit end tags option Tidy 364omits the body tag even if it has attributes. This is an 365error.</p> 366 367<p>Tapio Markula reports that Tidy is incorrectly replacing 368accented characters in script elements by entities. The script 369element (in HTML but not XHTML) is CDATA and as such entities 370won't be expanded. This bug needs to be fixed along with the 371support for CDATA sections.</p> 372 373<p>Terrill Bennett reports tidy crashing when producing slides, 374and when the -i option has been set. He later added the crash 375occurs when the page doesn't include an h1 element. See 376Terrill-Bennett-11mar00.txt.</p> 377 378<p>Stephen Lewis notes that if an <hr> element is present 379in the head before the title element, then Tidy gets confused and 380adds in a spurious extra empty title element. This would be 381avoided if Tidy could move the hr into the body before the body 382element is encountered. This raises a number of problems for 383instance working out when to copy in attributes from an explicit 384body element.</p> 385 386<p>Carl Osterly would like Tidy to avoid breaking lines before or 387after the = sign in attribute values when this is practical. 388Perhaps a simple rule of thumb could be used to decide this?</p> 389 390<p>Rick H Wesson notes that Tidy crashes on CDATA marked sections 391when parsing XML.</p> 392 393<p>Luigi Federici would like an option to set the DTD URI for XML 394or XHTML.</p> 395 396<p>Mat Sander notes: If I have php code the indentation behaves 397strange. Repeated tidying php content and end tag indented one 398level extra for each time. The result ends up something like 399this:</p> 400 401<pre> 402... 403 <?php 404 $r=0; 405 ?< 406... 407 408I have the fillowing config file for Tidy: 409--- 410tidy-mark: no 411markup: yes 412wrap: 0 413indent: auto 414output-xml: no 415output-xhtml: yes 416doctype: loose 417char-encoding: latin1 418quote-marks: yes 419assume-xml-procins: yes 420word-2000: yes 421clean: yes 422logical-emphasis: yes 423drop-empty-paras: yes 424enclose-text: yes 425fix-bad-comments: yes 426alt-text: . 427write-back: bool 428keep-time: yes 429show-warnings: no 430quiet: yes 431split: no 432--- 433 434Best Regards, 435Mats-Olof Sander 436 437</pre> 438 439<p>Don Hasson notes that if you make a mistake and leave off the 440ending "/" in the <title> tag, tidy will generate an extra 441set of <title>s.</p> 442 443<p>Example:</p> 444 445<pre> 446<html> 447<head><title>No end here<title></head> 448<body> 449Empty 450</body> 451</html> 452 453</pre> 454 455<p>produces this:</p> 456 457<pre> 458<html> 459<head> 460<title>No end here</title> 461<title></title> 462</head> 463<body> 464Empty 465</body> 466</html> 467 468</pre> 469 470<p>Jeff Wilkinson would like the HTML Tidy page to include 471internal anchors so that he can link directly to the appropriate 472sections.</p> 473 474<p>Peter Vince would like to be able to clean presentation 475attributes on the body element, as well as translating b and i to 476span.</p> 477 478<p>Dave Bryan and Mathew Brealey would like there to be a way to 479suppress the default handling of inline elements in favor of 480simply inserting the appropriate end tag when encountering an 481element that isn't allowed in an inline context. The default 482behavior replicates the rendering on existing browsers but can 483cause problems for hand editors.</p> 484 485<p>Dave Bryan notes that tidy isn't updating the column position 486when parsing attributes.</p> 487 488<p>Can Tidy track when a line break occurs after a PI or comment 489and reproduce this in the output? This idea occurred to me after 490reading a comment from Brad Stowers.</p> 491 492<p>One interesting suggestion is to make some of Tidy's rules of 493thumb sensitive to the program that generated the markup as 494indicated by the meta element. This would allow for greater 495robustness in how the rules operate.</p> 496 497<p>Dave Bryan would like the quiet mode to be tweaked to suppress 498the general info at the end of the report. see 499Dave-Bryan-24mar00.txt.</p> 500 501<p>Erik Rossen would like an option to suppress line wrap within 502tags, so that the tag is always on the same line regardless of 503the number and length of the attributes.</p> 504 505<p>Dan Satria suggest that the clean mechanism check to see if 506there are any existing matching style rules before adding new 507ones.</p> 508 509<p>Zoltan Hawryluk suggests mapping the Netscape layer tag into 510the equivalent CSS positioning syntax.</p> 511 512<p>Jim Walker says Tidy doesn't correctly report errors such as 513<tt></</head></tt>.</p> 514 515<p>Tidy's slide feature: see Johannes-Poutre-12jul00.txt</p> 516 517<p>Carole Mah suggests Tidy should recover from multiple class 518attributes on the same element.</p> 519 520<h2>Other ideas</h2> 521 522<ul> 523<li>Recursion through subdirectories, so you can fix up your 524entire web site at one go. This assumes I can find a way that is 525portable across a wide range of platforms!</li> 526 527<li>Support for W3C's <a 528href="http://www.w3.org/TR/REC-DOM-Level-1/">Document Object 529Model</a> (DOM) level one.</li> 530 531<li>Full validation of all attribute values.</li> 532 533<li>Mapping Unicode bidi control characters to HTML tags.</li> 534 535<li>Full support for parsing XML (still somewhat limited).</li> 536 537<li>How to say which XML elements should be printed 538"inline".</li> 539 540<li>Acting on the XML encoding attribute, e.g. 541<?xml encoding="iso-8859-1"></li> 542 543<li>Improved mapping from HTML presentation attributes/elements 544to CSS.</li> 545 546<li>Improved support for <a 547href="http://java.sun.com/products/jsp/">JSP</a> (Java Server 548pages)</li> 549 550<li>Ugly print option which removes all optional whitespace</li> 551</ul> 552</body> 553</html> 554 555