1A Fast Method of Identifying Plain Text Files 2============================================= 3 4 5Introduction 6------------ 7 8Given a file coming from an unknown source, it is generally impossible 9to conclude automatically, and with 100% accuracy, whether that file is 10a plain text file, without performing a heavy-duty semantic analysis on 11the file contents. It is, however, possible to obtain a fairly high 12degree of accuracy, by employing various simple heuristics. 13 14Previous versions of the zip tools were using a crude detection scheme, 15originally used by PKWare in its PKZip programs: if more than 80% (4/5) 16of the bytes are within the range [7..127], the file is labeled as plain 17text, otherwise it is labeled as binary. A prominent limitation of this 18scheme is the restriction to Latin-based alphabets. Other alphabets, 19like Greek, Cyrillic or Asian, make extensive use of the bytes within 20the range [128..255], and texts using these alphabets are most often 21mis-identified by this scheme; in other words, the rate of false 22negatives is sometimes too high, which means that the recall is low. 23Another weakness of this scheme is a reduced precision, due to the false 24positives that may occur when binary files containing a large amount of 25textual characters are mis-identified as plain text. 26 27In this article we propose a new detection scheme, with a much increased 28accuracy and precision, and a near-100% recall. This scheme is designed 29to work on ASCII and ASCII-derived alphabets, and it handles single-byte 30alphabets (ISO-8859, OEM, KOI-8, etc.), and variable-sized alphabets 31(DBCS, UTF-8, etc.). However, it cannot handle fixed-sized, multi-byte 32alphabets (UCS-2, UCS-4), nor UTF-16. The principle used by this scheme 33can easily be adapted to non-ASCII alphabets like EBCDIC. 34 35 36The Algorithm 37------------- 38 39The algorithm works by dividing the set of bytes [0..255] into three 40categories: 41- The white list of textual bytecodes: 42 9 (TAB), 10 (LF), 13 (CR), 20 (SPACE) to 255 43- The gray list of tolerated bytecodes: 44 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC) 45- The black list of undesired, non-textual bytecodes: 46 0 (NUL) to 6, 14 to 31. 47 48If a file contains at least one byte that belongs to the white list, and 49no byte that belongs to the black list, then the file is categorized as 50plain text. Otherwise, it is categorized as binary. 51 52 53Rationale 54--------- 55 56The idea behind this algorithm relies on two observations. 57 58The first observation is that, although the full range of 7-bit codes 59(0..127) is properly specified by the ASCII standard, most control 60characters in the range 0..31 are not used in practice. The only 61widely-used, almost universally-portable control codes are 9 (TAB), 6210 (LF), and 13 (CR). There are a few more control codes that are 63recognized on a reduced range of platforms and text viewers/editors: 647 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), and 27 (ESC); but these 65codes are rarely (if ever) used alone, without being accompanied by 66some printable text. Even the newer, portable text formats, such as 67XML, avoid using control characters outside the list mentioned here. 68 69The second observation is that most of the binary files tend to contain 70control characters, especially 0 (NUL); even though the older text 71detection schemes observe the presence of non-ASCII codes from the range 72[128..255], the precision rarely has to suffer if this upper range is 73labeled as textual, because the files that are genuinely binary tend to 74contain both control characters, and codes from the upper range. On the 75other hand, the upper range needs to be labeled as textual, because it 76is being used by virtually all ASCII extensions. In particular, this 77range is being heavily used to encode non-Latin scripts. 78 79Given the two observations, the plain text detection algorithm becomes 80straightforward. There must be at least some printable material, or 81some portable whitespace such as TAB, CR or LF, otherwise the file is 82not labeled as plain text. (The boundary case, when the file is empty, 83automatically falls into this category.) However, there must be no 84non-portable control characters, otherwise it's very likely that the 85intended reader of that file is a machine, rather than a human. 86 87Since there is no counting involved, other than simply observing the 88presence or the absence of some byte values, the algorithm produces 89uniform results on any particular text file, no matter what alphabet 90encoding is being used for that text. (In contrast, if counting were 91involved, it could be possible to obtain different results on a text 92encoded, say, using ISO-8859-2 versus UTF-8.) There is the category 93of plain text files that are "polluted" with one or a few black-listed 94codes, either by mistake, or by peculiar design considerations. In such 95cases, a scheme that tolerates a small percentage of black-listed codes 96would provide an increased recall (i.e. more true positives). This, 97however, incurs a reduced precision, since false positives are also more 98likely to appear in binary files that contain large chunks of textual 99data. "Polluted" plain text may, in fact, be regarded as binary, on 100which text conversions should not be performed. Under this premise, it 101is safe to say that the detection method provides a near-100% recall. 102 103Experiments have been run on a large set of files of various categories, 104including plain old texts, system logs, source code, formatted office 105documents, compiled object code, etcetera. The results confirm the 106optimistic assumptions about the high accuracy, precision and recall 107offered by this algorithm. 108 109 110-- 111Cosmin Truta 112Last updated: 2005-Feb-27 113