WWW server configuration for HTML, XML, CSS
and other textual data
UCHIDA Akira
Abstract
It is difficult for WWW server administrators to configure WWW servers
correctly, and it is also difficult for WWW page authors to create
WWW pages correctly, when they have to manage Web document files in
various character encodings. One effective solution is editing server
configuration file to bind a set of extended suffixes for Web document
files to MIME types as well as charset parameters. This memo proposes
a suffixes extension guideline for Web document files to indicate
character encodings.
1. Introduction
There are various Web pages which are written in various languages and
various character encodings in the WWW, and there are lot of WWW servers
which have not been configured to send a charset parameter. Because
it is very difficult for Webmaster to choose one default character
encoding for his/her site in this "World Wide" Web ages.
Even though he/she can choose one default character encoding, he/she
has to configure his/her server to send correct charset parameter
whitch indicates another character encodings.
To provide correct charset parameter, some Webmaster make a local rule
to bind suffixes to MIME types using server configuration mechanism;
suffix ".html" indicate MIME type "text/html; charset=iso-8859-1",
suffix ".html2" indicate MIME type "text/html; charset=iso-8859-2",
etc. [Bert Bos]
However, it is not easy for most people to systematically choose
different suffixes for different charsets, provide appropriate
configuration files, and write comprehensible manuals for WWW page
authros. This difficulty is one of the many reasons that most WWW
browsers are not configured correctly.
To overcome this problem, we propose a set of file suffixes and provide
example configuration files for some widely-used WWW servers. These
suffixes are systematically chosen, easy to remember, and ready to use.
We hope that this proposal helps WWW server administrators to configure
WWW servers correctly and also helps WWW page authors to create WWW pages
correctly.
2. Goal
It is necessary and sufficient for our indicator to identify all
character encodings that registered IANA registry (See [HTTP1.1]
section 3.4, [HTML4.0] section 5, [Text/xml] section 3.1, [Text/css]
section 4).
Design goals are:
- It shall be able to identify all character encodings that registered
IANA registry now.
- It shall be able to identify all character encodings that will be
registered IANA registry in the future.
- Fewer (digits or letters) is better.
3. Proposal:
We propose that the choice of charset indicator should observe
the following priorities when determining a new charset
indicators (from highest priority to lowest):
1. If the document's character encoding is US-ASCII, use "ASCII"
as an indicator.
2. If the document's character encoding's MIME name has less than
seven letters (or digits) form or the document's character
encoding is experimental character encoding, use MIME name as an
indicator.
3. If the document's character encoding is ISO-8859 series,
use ISO-8859 series identifier as an indicator.
4. If the document's character encoding is ISO-2022 series,
use ISO-2022 series identifier as an indicator.
5. And if not, use MIBnum value as an indicator.
The indicator can be either Mixed style (One suffix such like
".htmlascii" represent MIME type and character encoding) or Seperate
style (Doublled suffixes such like ".html.ascii" represent MIME type and
character encoding).
Note. Suffix ".html" and the HTTP default character encoding
Conforming HTTP 1.1 WWW server, if it's not changed default character
encoding for HTML file, will provide the charset parameter "charset=
ISO-8859-1" for ".html" file (see [HTTP1.1] section 3.7.1). By our
proposal or default setting, the WWW server may be able to provide
appropriate charset parameter for ISO-8859-1 encoded HTML files.
3.1. US-ASCII encoded file
Seperate style usage: ".*.ascii"
(eg. to indicate US-ASCII encoded CSS file, suffix will be ".css.ascii")
Mixed style usage: ".*ascii"
(eg. to indicate US-ASCII encoded XML file, suffix will be ".xmlascii")
3.2. MIME name as an indicator
Seperate style usage: ".*.xx"
(eg. to indicate UTF-8 encoded HTML file, suffix will be ".html.utf-8")
Mixed style usage: ".*xx"
(eg. to indicate EUC-JP encoded HTML file, suffix will be ".htmleuc-jp")
3.3. ISO-8859 series identifiers as an indicator
Seperate style usage: ".*.8859-x"
(eg. to indicate ISO-8859-1 encoded XML file, suffix will be
".xml.8859-1")
Preffered Mixed style usage: ".*8859-x"
(eg. to indicate ISO-8859-2 encoded HTML file, suffix will be
".html8859-2")
Altanative Mixed style usage: ".*-x"
(eg. to indicate ISO-8859-2 encoded HTML file, suffix will be
".html-2")
Advantages:
- Easy to remember.
Disadvantages:
- None.
3.4. ISO-2022 series identifiers as an indicator
Seperate style usage: ".*.2022xx"
(eg. to indicate ISO-2022-JP-2 encoded CSS file, suffix will be
".css.2022jp2")
Preffered Mixed style usage: ".*2022xx"
(eg. to indicate ISO-2022-JP-2 encoded HTML file, suffix will be
".html2022jp2")
Altanative Mixed style usage: ".*xx"
(eg. to indicate ISO-2022-JP-2 encoded HTML file, suffix will be
".htmljp2")
Advantages:
- Easy to remember.
Disadvantages:
- None.
3.5. MIBnum values as an indicator
Seperate style usage: ".*.MIBxx"
(eg. to indicate MIBnum 17 encoded HTML file, suffix will be
".html.mib17")
Mixed style usage: ".*MIBxx"
(eg. to indicate MIBnum 17 (Shift_JIS) encoded HTML file, suffix will
be ".htmlmib17")
Advantages:
- Absolutely unique.
- Can indicate all IANA registered charsets.
Disadvantages:
- Hard to remember.
- Four digits are needed to represent the character encoding scheme
that have not been standardized by any standard setting organization.
4. Sample suffixes table for HTML file
4.1. Suffixes for HTML file in Seperate style
encoding suffix
US-ASCII .html.ascii
ISO-8859-1 .html.8859-1
ISO-8859-2 .html.8859-2
ISO-8859-3 .html.8859-3
ISO-8859-4 .html.8859-4
ISO-8859-5 .html.8859-5
ISO-8859-6 .html.8859-6
ISO-8859-7 .html.8859-7
ISO-8859-8 .html.8859-8
ISO-8859-9 .html.8859-9
Shift_JIS .html.mib17
EUC-JP .html.euc-jp
ISO-2022-KR .html.2022kr
EUC-KR .html.euc-kr
ISO-2022-JP .html.2022jp
ISO-2022-JP-2 .html.2022jp2
UTF-7 .html.utf-7
UTF-8 .html.utf-8
GB2312 .html.gb2312
Big5 .html.big5
KOI8-R .html.koi8-r
4.2. Preffered Suffixes for HTML file in Mixed style
encoding suffix
US-ASCII .htmlascii
ISO-8859-1 .html8859-1
ISO-8859-2 .html8859-2
ISO-8859-3 .html8859-3
ISO-8859-4 .html8859-4
ISO-8859-5 .html8859-5
ISO-8859-6 .html8859-6
ISO-8859-7 .html8859-7
ISO-8859-8 .html8859-8
ISO-8859-9 .html8859-9
Shift_JIS .htmlmib17
EUC-JP .htmleuc-jp
ISO-2022-KR .html2022kr
EUC-KR .htmleuc-kr
ISO-2022-JP .html2022jp
ISO-2022-JP-2 .html2022jp2
UTF-7 .htmlutf-7
UTF-8 .htmlutf-8
GB2312 .htmlgb2312
Big5 .htmlbig5
KOI8-R .htmlkoi8-r
4.3. Altanative Suffixes for HTML file in Mixed style
encoding suffix
US-ASCII .htmlascii
ISO-8859-1 .html-1
ISO-8859-2 .html-2
ISO-8859-3 .html-3
ISO-8859-4 .html-4
ISO-8859-5 .html-5
ISO-8859-6 .html-6
ISO-8859-7 .html-7
ISO-8859-8 .html-8
ISO-8859-9 .html-9
Shift_JIS .htmlmib17
EUC-JP .htmleuc-jp
ISO-2022-KR .htmlkr
EUC-KR .htmleuc-kr
ISO-2022-JP .htmljp
ISO-2022-JP-2 .htmljp2
UTF-7 .htmlutf-7
UTF-8 .htmlutf-8
GB2312 .htmlgb2312
Big5 .htmlbig5
KOI8-R .htmlkoi8-r
5. Example of Configuration
5.1. Apache httpd in Seperate style
This is the [Apache] httpd's [AddCharset] configuration sample list.
AddCharset US-ASCII ascii
AddCharset ISO-8859-1 8859-1
AddCharset ISO-8859-2 8859-2
AddCharset ISO-8859-3 8859-3
AddCharset ISO-8859-4 8859-4
AddCharset ISO-8859-5 8859-5
AddCharset ISO-8859-6 8859-6
AddCharset ISO-8859-7 8859-7
AddCharset ISO-8859-8 8859-8
AddCharset ISO-8859-9 8859-9
AddCharset Shift_JIS mib17
AddCharset EUC-JP euc-jp
AddCharset ISO-2022-KR 2022kr
AddCharset EUC-KR euc-kr
AddCharset ISO-2022-JP 2022jp
AddCharset ISO-2022-JP-2 2022jp2
AddCharset UTF-7 utf-7
AddCharset UTF-8 utf-8
AddCharset GB2312 gb2312
AddCharset Big5 big5
AddCharset KOI8-R koi8-r
5.2. Apache httpd in Mixed style
This is the [Apache] httpd's configuration sample list.
AddType "text/html; charset=US-ASCII" htmlascii
AddType "text/html; charset=ISO-8859-1" html8859-1
AddType "text/html; charset=ISO-8859-2" html8859-2
AddType "text/html; charset=ISO-8859-3" html8859-3
AddType "text/html; charset=ISO-8859-4" html8859-4
AddType "text/html; charset=ISO-8859-5" html8859-5
AddType "text/html; charset=ISO-8859-6" html8859-6
AddType "text/html; charset=ISO-8859-7" html8859-7
AddType "text/html; charset=ISO-8859-8" html8859-8
AddType "text/html; charset=ISO-8859-9" html8859-9
AddType "text/html; charset=Shift_JIS" htmlmib17
AddType "text/html; charset=EUC-JP " htmleuc-jp
AddType "text/html; charset=ISO-2022-KR" html2022kr
AddType "text/html; charset=EUC-KR" htmleuc-kr
AddType "text/html; charset=ISO-2022-JP" html2022jp
AddType "text/html; charset=ISO-2022-JP-2" html2022jp2
AddType "text/html; charset=UTF-7" htmlutf-7
AddType "text/html; charset=UTF-8" htmlutf-8
AddType "text/html; charset=GB2312" htmlgb2312
AddType "text/html; charset=Big5" htmlbig5
AddType "text/html; charset=KOI8-R" htmlkoi8-r
AddType "text/xml; charset=US-ASCII" xmlascii
AddType "text/xml; charset=ISO-8859-1" xml8859-1
AddType "text/xml; charset=ISO-8859-2" xml8859-2
AddType "text/xml; charset=ISO-8859-3" xml8859-3
AddType "text/xml; charset=ISO-8859-4" xml8859-4
AddType "text/xml; charset=ISO-8859-5" xml8859-5
AddType "text/xml; charset=ISO-8859-6" xml8859-6
AddType "text/xml; charset=ISO-8859-7" xml8859-7
AddType "text/xml; charset=ISO-8859-8" xml8859-8
AddType "text/xml; charset=ISO-8859-9" xml8859-9
AddType "text/xml; charset=Shift_JIS" xmlmib17
AddType "text/xml; charset=EUC-JP " xmleuc-jp
AddType "text/xml; charset=ISO-2022-KR" xml2022kr
AddType "text/xml; charset=EUC-KR" xmleuc-kr
AddType "text/xml; charset=ISO-2022-JP" xml2022jp
AddType "text/xml; charset=ISO-2022-JP-2" xml2022jp2
AddType "text/xml; charset=UTF-7" xmlutf-7
AddType "text/xml; charset=UTF-8" xmlutf-8
AddType "text/xml; charset=GB2312" xmlgb2312
AddType "text/xml; charset=Big5" xmlbig5
AddType "text/xml; charset=KOI8-R" xmlkoi8-r
AddType "text/css; charset=US-ASCII" cssascii
AddType "text/css; charset=ISO-8859-1" css8859-1
AddType "text/css; charset=ISO-8859-2" css8859-2
AddType "text/css; charset=ISO-8859-3" css8859-3
AddType "text/css; charset=ISO-8859-4" css8859-4
AddType "text/css; charset=ISO-8859-5" css8859-5
AddType "text/css; charset=ISO-8859-6" css8859-6
AddType "text/css; charset=ISO-8859-7" css8859-7
AddType "text/css; charset=ISO-8859-8" css8859-8
AddType "text/css; charset=ISO-8859-9" css8859-9
AddType "text/css; charset=Shift_JIS" cssmib17
AddType "text/css; charset=EUC-JP " csseuc-jp
AddType "text/css; charset=ISO-2022-KR" css2022kr
AddType "text/css; charset=EUC-KR" csseuc-kr
AddType "text/css; charset=ISO-2022-JP" css2022jp
AddType "text/css; charset=ISO-2022-JP-2" css2022jp2
AddType "text/css; charset=UTF-7" cssutf-7
AddType "text/css; charset=UTF-8" cssutf-8
AddType "text/css; charset=GB2312" cssgb2312
AddType "text/css; charset=Big5" cssbig5
AddType "text/css; charset=KOI8-R" csskoi8-r
5.3. CERN httpd in Mixed style
This is the [CERN] httpd's configuration sample list.
AddType .htmlascii text/html;charset=US-ASCII 8bit
AddType .html8859-1 text/html;charset=ISO-8859-1 8bit
AddType .html8859-2 text/html;charset=ISO-8859-2 8bit
AddType .html8859-3 text/html;charset=ISO-8859-3 8bit
AddType .html8859-4 text/html;charset=ISO-8859-4 8bit
AddType .html8859-5 text/html;charset=ISO-8859-5 8bit
AddType .html8859-6 text/html;charset=ISO-8859-6 8bit
AddType .html8859-7 text/html;charset=ISO-8859-7 8bit
AddType .html8859-8 text/html;charset=ISO-8859-8 8bit
AddType .html8859-9 text/html;charset=ISO-8859-9 8bit
AddType .htmlmib17 text/html;charset=Shift_JIS 8bit
AddType .htmleuc-jp text/html;charset=EUC-JP 8bit
AddType .html2022kr text/html;charset=ISO-2022-KR 8bit
AddType .htmleuc-kr text/html;charset=EUC-KR 8bit
AddType .html2022jp text/html;charset=ISO-2022-JP 8bit
AddType .html2022jp2 text/html;charset=ISO-2022-JP-2 8bit
AddType .htmlutf-7 text/html;charset=UTF-7 8bit
AddType .htmlutf-8 text/html;charset=UTF-8 8bit
AddType .htmlgb2312 text/html;charset=GB2312 8bit
AddType .htmlbig5 text/html;charset=Big5 8bit
AddType .htmlkoi8-r text/html;charset=KOI8-R 8bit
AddType .xmlascii text/xml;charset=US-ASCII 8bit
AddType .xml8859-1 text/xml;charset=ISO-8859-1 8bit
AddType .xml8859-2 text/xml;charset=ISO-8859-2 8bit
AddType .xml8859-3 text/xml;charset=ISO-8859-3 8bit
AddType .xml8859-4 text/xml;charset=ISO-8859-4 8bit
AddType .xml8859-5 text/xml;charset=ISO-8859-5 8bit
AddType .xml8859-6 text/xml;charset=ISO-8859-6 8bit
AddType .xml8859-7 text/xml;charset=ISO-8859-7 8bit
AddType .xml8859-8 text/xml;charset=ISO-8859-8 8bit
AddType .xml8859-9 text/xml;charset=ISO-8859-9 8bit
AddType .xmlmib17 text/xml;charset=Shift_JIS 8bit
AddType .xmleuc-jp text/xml;charset=EUC-JP 8bit
AddType .xml2022kr text/xml;charset=ISO-2022-KR 8bit
AddType .xmleuc-kr text/xml;charset=EUC-KR 8bit
AddType .xml2022jp text/xml;charset=ISO-2022-JP 8bit
AddType .xml2022jp2 text/xml;charset=ISO-2022-JP-2 8bit
AddType .xmlutf-7 text/xml;charset=UTF-7 8bit
AddType .xmlutf-8 text/xml;charset=UTF-8 8bit
AddType .xmlgb2312 text/xml;charset=GB2312 8bit
AddType .xmlbig5 text/xml;charset=Big5 8bit
AddType .xmlkoi8-r text/xml;charset=KOI8-R 8bit
AddType .cssascii text/css;charset=US-ASCII 8bit
AddType .css8859-1 text/css;charset=ISO-8859-1 8bit
AddType .css8859-2 text/css;charset=ISO-8859-2 8bit
AddType .css8859-3 text/css;charset=ISO-8859-3 8bit
AddType .css8859-4 text/css;charset=ISO-8859-4 8bit
AddType .css8859-5 text/css;charset=ISO-8859-5 8bit
AddType .css8859-6 text/css;charset=ISO-8859-6 8bit
AddType .css8859-7 text/css;charset=ISO-8859-7 8bit
AddType .css8859-8 text/css;charset=ISO-8859-8 8bit
AddType .css8859-9 text/css;charset=ISO-8859-9 8bit
AddType .cssmib17 text/css;charset=Shift_JIS 8bit
AddType .csseuc-jp text/css;charset=EUC-JP 8bit
AddType .css2022kr text/css;charset=ISO-2022-KR 8bit
AddType .csseuc-kr text/css;charset=EUC-KR 8bit
AddType .css2022jp text/css;charset=ISO-2022-JP 8bit
AddType .css2022jp2 text/css;charset=ISO-2022-JP-2 8bit
AddType .cssutf-7 text/css;charset=UTF-7 8bit
AddType .cssutf-8 text/css;charset=UTF-8 8bit
AddType .cssgb2312 text/css;charset=GB2312 8bit
AddType .cssbig5 text/css;charset=Big5 8bit
AddType .csskoi8-r text/css;charset=KOI8-R 8bit
6. Notice
This proposal is no more than one of the many possible ways to
configure servers. For example, one WWW server administrator may use
[AddCharset] patch for Apache 1.3.3, and onother may permit authors of
HTML documents to use ".htaccess" file for his/her own HTML file type
configuration, and so on. We welcome any additional configurations
to maximize usability for a specific purpose.
If your WWW server can be configured to seperate MIME type and charset
each other, we encorage you to configure your WWW server in Seperate
style. Because Mixed style configuration leads to combinatorial
explosion.
7. Refferences
[Bert Bos]
Bert Bos,
Creating a multilingual site with the CERN-httpd server, 1996,
http://www.w3.org/International/O-help-CERN.html
[HTTP1.1]
W3C,
Hypertext Transfer Protocol -- HTTP/1.1, RFC 2068, 1997,
http://www.w3.org/Protocols/rfc2068/rfc2068
[HTML4.0]
W3C,
HTML 4.0 Specification Recommendation, 1997-1998,
http://www.w3.org/TR/REC-html40/
[Text/css]
H. Lie, B. Bos, C. Lilley,
The text/css Media Type, RFC 2318, 1998
ftp://ftp.isi.edu/in-notes/rfc2318.txt.
[Text/xml]
E. Whitehead, M. Murata,
XML Media Types, RFC 2376, 1998
ftp://ftp.isi.edu/in-notes/rfc2376.txt.
[IANA]
IANA,
Character sets,
http://www.isi.edu/in-notes/iana/assignments/character-sets
[Apache]
Apache HTTP Server Project,
Apache 1.3 User's Guide,
http://www.apache.org/docs/
[CERN]
W3C,
CERN httpd,
http://www.w3.org/Daemon/
[AddCharset]
KOGA Youichirou,
Koga's Apache page, 1998,
http://www.isoternet.org/~y-koga/Apache/
Author's address
UCHIDA Akira
Hachiman 2-11-1-101, Aoba-ku, Sendai, Japan
Email: uchida@happy.email.ne.jp