WWW server configuration for HTML, XML, CSS and other textual data UCHIDA Akira Abstract It is difficult for WWW server administrators to configure WWW servers correctly, and it is also difficult for WWW page authors to create WWW pages correctly, when they have to manage Web document files in various character encodings. One effective solution is editing server configuration file to bind a set of extended suffixes for Web document files to MIME types as well as charset parameters. This memo proposes a suffixes extension guideline for Web document files to indicate character encodings. 1. Introduction There are various Web pages which are written in various languages and various character encodings in the WWW, and there are lot of WWW servers which have not been configured to send a charset parameter. Because it is very difficult for Webmaster to choose one default character encoding for his/her site in this "World Wide" Web ages. Even though he/she can choose one default character encoding, he/she has to configure his/her server to send correct charset parameter whitch indicates another character encodings. To provide correct charset parameter, some Webmaster make a local rule to bind suffixes to MIME types using server configuration mechanism; suffix ".html" indicate MIME type "text/html; charset=iso-8859-1", suffix ".html2" indicate MIME type "text/html; charset=iso-8859-2", etc. [Bert Bos] However, it is not easy for most people to systematically choose different suffixes for different charsets, provide appropriate configuration files, and write comprehensible manuals for WWW page authros. This difficulty is one of the many reasons that most WWW browsers are not configured correctly. To overcome this problem, we propose a set of file suffixes and provide example configuration files for some widely-used WWW servers. These suffixes are systematically chosen, easy to remember, and ready to use. We hope that this proposal helps WWW server administrators to configure WWW servers correctly and also helps WWW page authors to create WWW pages correctly. 2. Goal It is necessary and sufficient for our indicator to identify all character encodings that registered IANA registry (See [HTTP1.1] section 3.4, [HTML4.0] section 5, [Text/xml] section 3.1, [Text/css] section 4). Design goals are: - It shall be able to identify all character encodings that registered IANA registry now. - It shall be able to identify all character encodings that will be registered IANA registry in the future. - Fewer (digits or letters) is better. 3. Proposal: We propose that the choice of charset indicator should observe the following priorities when determining a new charset indicators (from highest priority to lowest): 1. If the document's character encoding is US-ASCII, use "ASCII" as an indicator. 2. If the document's character encoding's MIME name has less than seven letters (or digits) form or the document's character encoding is experimental character encoding, use MIME name as an indicator. 3. If the document's character encoding is ISO-8859 series, use ISO-8859 series identifier as an indicator. 4. If the document's character encoding is ISO-2022 series, use ISO-2022 series identifier as an indicator. 5. And if not, use MIBnum value as an indicator. The indicator can be either Mixed style (One suffix such like ".htmlascii" represent MIME type and character encoding) or Seperate style (Doublled suffixes such like ".html.ascii" represent MIME type and character encoding). Note. Suffix ".html" and the HTTP default character encoding Conforming HTTP 1.1 WWW server, if it's not changed default character encoding for HTML file, will provide the charset parameter "charset= ISO-8859-1" for ".html" file (see [HTTP1.1] section 3.7.1). By our proposal or default setting, the WWW server may be able to provide appropriate charset parameter for ISO-8859-1 encoded HTML files. 3.1. US-ASCII encoded file Seperate style usage: ".*.ascii" (eg. to indicate US-ASCII encoded CSS file, suffix will be ".css.ascii") Mixed style usage: ".*ascii" (eg. to indicate US-ASCII encoded XML file, suffix will be ".xmlascii") 3.2. MIME name as an indicator Seperate style usage: ".*.xx" (eg. to indicate UTF-8 encoded HTML file, suffix will be ".html.utf-8") Mixed style usage: ".*xx" (eg. to indicate EUC-JP encoded HTML file, suffix will be ".htmleuc-jp") 3.3. ISO-8859 series identifiers as an indicator Seperate style usage: ".*.8859-x" (eg. to indicate ISO-8859-1 encoded XML file, suffix will be ".xml.8859-1") Preffered Mixed style usage: ".*8859-x" (eg. to indicate ISO-8859-2 encoded HTML file, suffix will be ".html8859-2") Altanative Mixed style usage: ".*-x" (eg. to indicate ISO-8859-2 encoded HTML file, suffix will be ".html-2") Advantages: - Easy to remember. Disadvantages: - None. 3.4. ISO-2022 series identifiers as an indicator Seperate style usage: ".*.2022xx" (eg. to indicate ISO-2022-JP-2 encoded CSS file, suffix will be ".css.2022jp2") Preffered Mixed style usage: ".*2022xx" (eg. to indicate ISO-2022-JP-2 encoded HTML file, suffix will be ".html2022jp2") Altanative Mixed style usage: ".*xx" (eg. to indicate ISO-2022-JP-2 encoded HTML file, suffix will be ".htmljp2") Advantages: - Easy to remember. Disadvantages: - None. 3.5. MIBnum values as an indicator Seperate style usage: ".*.MIBxx" (eg. to indicate MIBnum 17 encoded HTML file, suffix will be ".html.mib17") Mixed style usage: ".*MIBxx" (eg. to indicate MIBnum 17 (Shift_JIS) encoded HTML file, suffix will be ".htmlmib17") Advantages: - Absolutely unique. - Can indicate all IANA registered charsets. Disadvantages: - Hard to remember. - Four digits are needed to represent the character encoding scheme that have not been standardized by any standard setting organization. 4. Sample suffixes table for HTML file 4.1. Suffixes for HTML file in Seperate style encoding suffix US-ASCII .html.ascii ISO-8859-1 .html.8859-1 ISO-8859-2 .html.8859-2 ISO-8859-3 .html.8859-3 ISO-8859-4 .html.8859-4 ISO-8859-5 .html.8859-5 ISO-8859-6 .html.8859-6 ISO-8859-7 .html.8859-7 ISO-8859-8 .html.8859-8 ISO-8859-9 .html.8859-9 Shift_JIS .html.mib17 EUC-JP .html.euc-jp ISO-2022-KR .html.2022kr EUC-KR .html.euc-kr ISO-2022-JP .html.2022jp ISO-2022-JP-2 .html.2022jp2 UTF-7 .html.utf-7 UTF-8 .html.utf-8 GB2312 .html.gb2312 Big5 .html.big5 KOI8-R .html.koi8-r 4.2. Preffered Suffixes for HTML file in Mixed style encoding suffix US-ASCII .htmlascii ISO-8859-1 .html8859-1 ISO-8859-2 .html8859-2 ISO-8859-3 .html8859-3 ISO-8859-4 .html8859-4 ISO-8859-5 .html8859-5 ISO-8859-6 .html8859-6 ISO-8859-7 .html8859-7 ISO-8859-8 .html8859-8 ISO-8859-9 .html8859-9 Shift_JIS .htmlmib17 EUC-JP .htmleuc-jp ISO-2022-KR .html2022kr EUC-KR .htmleuc-kr ISO-2022-JP .html2022jp ISO-2022-JP-2 .html2022jp2 UTF-7 .htmlutf-7 UTF-8 .htmlutf-8 GB2312 .htmlgb2312 Big5 .htmlbig5 KOI8-R .htmlkoi8-r 4.3. Altanative Suffixes for HTML file in Mixed style encoding suffix US-ASCII .htmlascii ISO-8859-1 .html-1 ISO-8859-2 .html-2 ISO-8859-3 .html-3 ISO-8859-4 .html-4 ISO-8859-5 .html-5 ISO-8859-6 .html-6 ISO-8859-7 .html-7 ISO-8859-8 .html-8 ISO-8859-9 .html-9 Shift_JIS .htmlmib17 EUC-JP .htmleuc-jp ISO-2022-KR .htmlkr EUC-KR .htmleuc-kr ISO-2022-JP .htmljp ISO-2022-JP-2 .htmljp2 UTF-7 .htmlutf-7 UTF-8 .htmlutf-8 GB2312 .htmlgb2312 Big5 .htmlbig5 KOI8-R .htmlkoi8-r 5. Example of Configuration 5.1. Apache httpd in Seperate style This is the [Apache] httpd's [AddCharset] configuration sample list. AddCharset US-ASCII ascii AddCharset ISO-8859-1 8859-1 AddCharset ISO-8859-2 8859-2 AddCharset ISO-8859-3 8859-3 AddCharset ISO-8859-4 8859-4 AddCharset ISO-8859-5 8859-5 AddCharset ISO-8859-6 8859-6 AddCharset ISO-8859-7 8859-7 AddCharset ISO-8859-8 8859-8 AddCharset ISO-8859-9 8859-9 AddCharset Shift_JIS mib17 AddCharset EUC-JP euc-jp AddCharset ISO-2022-KR 2022kr AddCharset EUC-KR euc-kr AddCharset ISO-2022-JP 2022jp AddCharset ISO-2022-JP-2 2022jp2 AddCharset UTF-7 utf-7 AddCharset UTF-8 utf-8 AddCharset GB2312 gb2312 AddCharset Big5 big5 AddCharset KOI8-R koi8-r 5.2. Apache httpd in Mixed style This is the [Apache] httpd's configuration sample list. AddType "text/html; charset=US-ASCII" htmlascii AddType "text/html; charset=ISO-8859-1" html8859-1 AddType "text/html; charset=ISO-8859-2" html8859-2 AddType "text/html; charset=ISO-8859-3" html8859-3 AddType "text/html; charset=ISO-8859-4" html8859-4 AddType "text/html; charset=ISO-8859-5" html8859-5 AddType "text/html; charset=ISO-8859-6" html8859-6 AddType "text/html; charset=ISO-8859-7" html8859-7 AddType "text/html; charset=ISO-8859-8" html8859-8 AddType "text/html; charset=ISO-8859-9" html8859-9 AddType "text/html; charset=Shift_JIS" htmlmib17 AddType "text/html; charset=EUC-JP " htmleuc-jp AddType "text/html; charset=ISO-2022-KR" html2022kr AddType "text/html; charset=EUC-KR" htmleuc-kr AddType "text/html; charset=ISO-2022-JP" html2022jp AddType "text/html; charset=ISO-2022-JP-2" html2022jp2 AddType "text/html; charset=UTF-7" htmlutf-7 AddType "text/html; charset=UTF-8" htmlutf-8 AddType "text/html; charset=GB2312" htmlgb2312 AddType "text/html; charset=Big5" htmlbig5 AddType "text/html; charset=KOI8-R" htmlkoi8-r AddType "text/xml; charset=US-ASCII" xmlascii AddType "text/xml; charset=ISO-8859-1" xml8859-1 AddType "text/xml; charset=ISO-8859-2" xml8859-2 AddType "text/xml; charset=ISO-8859-3" xml8859-3 AddType "text/xml; charset=ISO-8859-4" xml8859-4 AddType "text/xml; charset=ISO-8859-5" xml8859-5 AddType "text/xml; charset=ISO-8859-6" xml8859-6 AddType "text/xml; charset=ISO-8859-7" xml8859-7 AddType "text/xml; charset=ISO-8859-8" xml8859-8 AddType "text/xml; charset=ISO-8859-9" xml8859-9 AddType "text/xml; charset=Shift_JIS" xmlmib17 AddType "text/xml; charset=EUC-JP " xmleuc-jp AddType "text/xml; charset=ISO-2022-KR" xml2022kr AddType "text/xml; charset=EUC-KR" xmleuc-kr AddType "text/xml; charset=ISO-2022-JP" xml2022jp AddType "text/xml; charset=ISO-2022-JP-2" xml2022jp2 AddType "text/xml; charset=UTF-7" xmlutf-7 AddType "text/xml; charset=UTF-8" xmlutf-8 AddType "text/xml; charset=GB2312" xmlgb2312 AddType "text/xml; charset=Big5" xmlbig5 AddType "text/xml; charset=KOI8-R" xmlkoi8-r AddType "text/css; charset=US-ASCII" cssascii AddType "text/css; charset=ISO-8859-1" css8859-1 AddType "text/css; charset=ISO-8859-2" css8859-2 AddType "text/css; charset=ISO-8859-3" css8859-3 AddType "text/css; charset=ISO-8859-4" css8859-4 AddType "text/css; charset=ISO-8859-5" css8859-5 AddType "text/css; charset=ISO-8859-6" css8859-6 AddType "text/css; charset=ISO-8859-7" css8859-7 AddType "text/css; charset=ISO-8859-8" css8859-8 AddType "text/css; charset=ISO-8859-9" css8859-9 AddType "text/css; charset=Shift_JIS" cssmib17 AddType "text/css; charset=EUC-JP " csseuc-jp AddType "text/css; charset=ISO-2022-KR" css2022kr AddType "text/css; charset=EUC-KR" csseuc-kr AddType "text/css; charset=ISO-2022-JP" css2022jp AddType "text/css; charset=ISO-2022-JP-2" css2022jp2 AddType "text/css; charset=UTF-7" cssutf-7 AddType "text/css; charset=UTF-8" cssutf-8 AddType "text/css; charset=GB2312" cssgb2312 AddType "text/css; charset=Big5" cssbig5 AddType "text/css; charset=KOI8-R" csskoi8-r 5.3. CERN httpd in Mixed style This is the [CERN] httpd's configuration sample list. AddType .htmlascii text/html;charset=US-ASCII 8bit AddType .html8859-1 text/html;charset=ISO-8859-1 8bit AddType .html8859-2 text/html;charset=ISO-8859-2 8bit AddType .html8859-3 text/html;charset=ISO-8859-3 8bit AddType .html8859-4 text/html;charset=ISO-8859-4 8bit AddType .html8859-5 text/html;charset=ISO-8859-5 8bit AddType .html8859-6 text/html;charset=ISO-8859-6 8bit AddType .html8859-7 text/html;charset=ISO-8859-7 8bit AddType .html8859-8 text/html;charset=ISO-8859-8 8bit AddType .html8859-9 text/html;charset=ISO-8859-9 8bit AddType .htmlmib17 text/html;charset=Shift_JIS 8bit AddType .htmleuc-jp text/html;charset=EUC-JP 8bit AddType .html2022kr text/html;charset=ISO-2022-KR 8bit AddType .htmleuc-kr text/html;charset=EUC-KR 8bit AddType .html2022jp text/html;charset=ISO-2022-JP 8bit AddType .html2022jp2 text/html;charset=ISO-2022-JP-2 8bit AddType .htmlutf-7 text/html;charset=UTF-7 8bit AddType .htmlutf-8 text/html;charset=UTF-8 8bit AddType .htmlgb2312 text/html;charset=GB2312 8bit AddType .htmlbig5 text/html;charset=Big5 8bit AddType .htmlkoi8-r text/html;charset=KOI8-R 8bit AddType .xmlascii text/xml;charset=US-ASCII 8bit AddType .xml8859-1 text/xml;charset=ISO-8859-1 8bit AddType .xml8859-2 text/xml;charset=ISO-8859-2 8bit AddType .xml8859-3 text/xml;charset=ISO-8859-3 8bit AddType .xml8859-4 text/xml;charset=ISO-8859-4 8bit AddType .xml8859-5 text/xml;charset=ISO-8859-5 8bit AddType .xml8859-6 text/xml;charset=ISO-8859-6 8bit AddType .xml8859-7 text/xml;charset=ISO-8859-7 8bit AddType .xml8859-8 text/xml;charset=ISO-8859-8 8bit AddType .xml8859-9 text/xml;charset=ISO-8859-9 8bit AddType .xmlmib17 text/xml;charset=Shift_JIS 8bit AddType .xmleuc-jp text/xml;charset=EUC-JP 8bit AddType .xml2022kr text/xml;charset=ISO-2022-KR 8bit AddType .xmleuc-kr text/xml;charset=EUC-KR 8bit AddType .xml2022jp text/xml;charset=ISO-2022-JP 8bit AddType .xml2022jp2 text/xml;charset=ISO-2022-JP-2 8bit AddType .xmlutf-7 text/xml;charset=UTF-7 8bit AddType .xmlutf-8 text/xml;charset=UTF-8 8bit AddType .xmlgb2312 text/xml;charset=GB2312 8bit AddType .xmlbig5 text/xml;charset=Big5 8bit AddType .xmlkoi8-r text/xml;charset=KOI8-R 8bit AddType .cssascii text/css;charset=US-ASCII 8bit AddType .css8859-1 text/css;charset=ISO-8859-1 8bit AddType .css8859-2 text/css;charset=ISO-8859-2 8bit AddType .css8859-3 text/css;charset=ISO-8859-3 8bit AddType .css8859-4 text/css;charset=ISO-8859-4 8bit AddType .css8859-5 text/css;charset=ISO-8859-5 8bit AddType .css8859-6 text/css;charset=ISO-8859-6 8bit AddType .css8859-7 text/css;charset=ISO-8859-7 8bit AddType .css8859-8 text/css;charset=ISO-8859-8 8bit AddType .css8859-9 text/css;charset=ISO-8859-9 8bit AddType .cssmib17 text/css;charset=Shift_JIS 8bit AddType .csseuc-jp text/css;charset=EUC-JP 8bit AddType .css2022kr text/css;charset=ISO-2022-KR 8bit AddType .csseuc-kr text/css;charset=EUC-KR 8bit AddType .css2022jp text/css;charset=ISO-2022-JP 8bit AddType .css2022jp2 text/css;charset=ISO-2022-JP-2 8bit AddType .cssutf-7 text/css;charset=UTF-7 8bit AddType .cssutf-8 text/css;charset=UTF-8 8bit AddType .cssgb2312 text/css;charset=GB2312 8bit AddType .cssbig5 text/css;charset=Big5 8bit AddType .csskoi8-r text/css;charset=KOI8-R 8bit 6. Notice This proposal is no more than one of the many possible ways to configure servers. For example, one WWW server administrator may use [AddCharset] patch for Apache 1.3.3, and onother may permit authors of HTML documents to use ".htaccess" file for his/her own HTML file type configuration, and so on. We welcome any additional configurations to maximize usability for a specific purpose. If your WWW server can be configured to seperate MIME type and charset each other, we encorage you to configure your WWW server in Seperate style. Because Mixed style configuration leads to combinatorial explosion. 7. Refferences [Bert Bos] Bert Bos, Creating a multilingual site with the CERN-httpd server, 1996, http://www.w3.org/International/O-help-CERN.html [HTTP1.1] W3C, Hypertext Transfer Protocol -- HTTP/1.1, RFC 2068, 1997, http://www.w3.org/Protocols/rfc2068/rfc2068 [HTML4.0] W3C, HTML 4.0 Specification Recommendation, 1997-1998, http://www.w3.org/TR/REC-html40/ [Text/css] H. Lie, B. Bos, C. Lilley, The text/css Media Type, RFC 2318, 1998 ftp://ftp.isi.edu/in-notes/rfc2318.txt. [Text/xml] E. Whitehead, M. Murata, XML Media Types, RFC 2376, 1998 ftp://ftp.isi.edu/in-notes/rfc2376.txt. [IANA] IANA, Character sets, http://www.isi.edu/in-notes/iana/assignments/character-sets [Apache] Apache HTTP Server Project, Apache 1.3 User's Guide, http://www.apache.org/docs/ [CERN] W3C, CERN httpd, http://www.w3.org/Daemon/ [AddCharset] KOGA Youichirou, Koga's Apache page, 1998, http://www.isoternet.org/~y-koga/Apache/ Author's address UCHIDA Akira Hachiman 2-11-1-101, Aoba-ku, Sendai, Japan Email: uchida@happy.email.ne.jp