Charset detection in chaos

$Date: 2003/09/19 11:39:26 $

MURATA Makoto (FAMILY Given)

Previous version: 11 April 2002

0. Introduction

Given an XML or HTML document, how do we determine its charset? There have been a lot of discussion about this issue. Rather than repeating it, we have to begin with a better understanding of the current situation. Here is my attempt to clarify the current (messy) status. Although this memo is still incomplete, I hope this has some points.

1. Textual resources

Many types of WWW resources are textual. Since many charsets are in use, we have to determine the charset of each textual resource so as to handle it correctly .

1.1 Documents

XML, HTML, and CSS of W3C have textual representations. Plain text is certainly textual.

1.2 Programs

Source programs are textual and written in some charset. When we transmit source programs or compile/execute them on the fly, charset issues will arise.

1.3 Generation of documents by programs

On the www server, programs generate documents on the fly. These programs have to specify charsets for such documents. APIs of most programming languages allow charset specification.

Furthermore, programs have to embed in-band signature, when it is necessary. Most APIs and programming languages do not provide any support. Rather, programmers have use "print" carefully so as to create in-band signatures (e.g., meta tags).

1.4 Form data

Finally, text typed in forms of HTML and sent as multipart/form-data via HTTP also reqiure charset information.

2. Current situation

There are already too many methods for determining the charset. I show a list of such methods and further show which is used for which type of resource.

2.1 XML documents received from the HTTP server

Note: RFC 3023 certainly says A > B.

2.2 HTML documents received from the HTTP server

Note: The HTML 4.01 recommendation blesses both A and B, but RFC 2854 (text/html) strongly recommends A only. However, RFC 2854 references to HTML 4.

2.3 CSS stylesheets received from the HTTP server

Note: The CSS recommendation blesses both A and B, but RFC 2318 (text/css) merely mentions A.

2.4 XSLT stylesheets received from the HTTP server

Note: Use of C is incorrect.

2.5 plain text received from the HTTP server

2.6 XML documents which are stored at the server but have not been transmitted to the client yet

2.7 HTML documents which are stored at the server but have not been transmitted to the client yet

2.8 An HTML document that is generated at the server on the fly but has not been transmitted to the client yet

Note: Generating programs typically specify the charset *TWICE*: once for the charset of the output stream and once for generating meta tags.

2.9 An HTML document temporarily created at the client by XSLT

Note: The encoding parameter of xsl:output can specify the charset. Moreover, when the output method is HTML, this parameter also generates an appropriate META tag.

2.10 CSS stylesheets which are stored at the server but have not been transmitted to the client yet

2.11 XSLT stylesheets which are stored at the server but have

not been transmitted to the client yet

Note: Use of C is incorrect.

2.12 plain text stored at the server which are stored at the server but have not been transmitted to the client yet

2.13 text typed in <textarea> or <input type="text"> of HTML and transmitted via HTTP

Note: Unfortunately, the charset parameter for parts of multipart/form-data is not widely implemented.

2.14 file uploaded by <input type="file"> of HTML

Note: Unfortunately, the charset parameter for parts of multipart/form-data is not widely implemented.

2.15 Javascript, VBScript, etc. received from the HTTP server

Note 1: Since there are no media types for such programming languages, the charset parameter is not available.

Note 2: Since scripts in such programming languages contains many ASCII characters and a small number of non-ASCII characters, guessing almost always fails.

Note 3: The referring resource may be an HTML document temporarily created by XSLT at the client side. Even when users create everything in Shift_JIS, creates UTF-16 HTML documents and assumes the referenced Javascript as UTF-16.

2.16 E-mail sent via SMTP

Note: The charset of E-mail received by and stored at the SMTP client is up to the mail program.

2.17 JSP pages

3. Misc

3.1 Database

Typically, web servers are front ends for database systems. Charset issues will arise especially because legacy data are in legacy charsets.

3.2 Content negotiation

We also have to consider content negotiation issues. If configuration of the charset parameter is difficult, the same thing applies to configuration for negotiation.

4. Concluding remarks

Unfortunately, charset detection is complicated, inconsistent, and incomprehensible. As a result, it is extremely difficult to internationalize Web applications. As far as I know, many WWW developers in Japan suffer.

I agree that we have to change the current situation. However, I also think that we can easily impair the situation by shortsighted "improvements". I believe that we strongly need a princple.

In my understanding, I18N people at IETF and the I18N WG have always believed authoritative use of the charset parameter. This is certainly a reasonably princle, since the charset parameter is generic to all text formats. Unfortunately, this approach has not worked very well. It might be possible to establish a different principle that promotes an in-band declaration mechanism generic to all (or most) text formats [2]. However, such a principle has not appeared and each technology has its own ad-hoc mechanisms for specifying the charset.

[1] http://lists.w3.org/Archives/Public/www-tag/2002Jan/0177.html

[2] http://lists.w3.org/Archives/Public/www-tag/2003Apr/0104.html


(c)MURATA Makoto Last update: 2003-09-19