HTTP/CGI Charset Issues

David Robinson5 October 2003

Introduction

This document reviews the published RFC literature and addresses issues with character set and character encodings in the WWW protocols and the CGI interface. We take a hard look at the precise details of the various specifications. In any set of specifications written by different authors at different times with different purpose and points of view, discrepencies will arise; we discuss these, and the implications for server implementations in general and CGI in particular.

Coded Character Sets

Terminology

In this document we distinguish between a character set, "a collection of characters or glyphs with names and descriptions" and a character encoding, "an allocation of codes to characters and a scheme for representing the codes as a binary data sequence".

Thus the US-ASCII specification titled `Coded Character Sets' is therefore a character set and a character encoding.

US-ASCII

Previously known as X3.4, the X3 committee has changed its name to INCITS, so the latest version of this standard is ANSI INCITS.4-1986 (R2002). Finalised in 1967, it is a coded character set of 95 characters numbered 32 to 127 (including space) and 33 control characters (number 0-31 and 127).
SP!"#$%&'()*+,-./
0123456789:;<=>?
@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghijklmno
pqrstuvwxyz{|}~

ASCII was also standardised as ISO-646, with national variations permitted (US-ASCII is the G0 set). ISO has the vertical bar solid, rather than split. In ISO-646, character 35 can be # or £, character 36 can be ¤ or $. Another 10 characters can be changed for a national variant; these are numbers 64, 96, 91-94 and 123-126. In US-ASCII they are @ ` [ \ ] ^ { | } ~

The official IANA character set name is "ANSI_X3.4-1968" (RFC 1345), with various aliases, including "US-ASCII".

ISO 8859

ISO 8859 contains multiple parts each of which defines different character coding sets. Part 1 defines the Latin alphabet No. 1 (`ISOLatin1') which is a character set of 191 printable chacters which include US-ASCII as a subset. ISO 8859-1 then defines a code table providing an 8-bit coding for ISOLatin1. (Thus ISOLatin1 is a character set, not a character coding.) The official IANA name is "ISO_8859-1:1987" (RFC 1345).

The additional 96 characters in Latin alphabet No. 1 are:

NBSP¡¢£¤¥¦§ ¨©ª«¬SHY®¯
°±²³´µ· ¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇ ÈÉÊËÌÍÎÏ
ÐÑÒÓÔÕÖ× ØÙÚÛÜÝÞß
àáâãäåæç èéêëìíîï
ðñòóôõö÷ øùúûüýþÿ

EBCDIC

Versions of EBCDIC contains a similar set of characters to ASCII. The characters are coded rather differently, although the same control characters appear in positions 0 - 31 as in ASCII. The original variant "EBCDIC-US", did not contain the ASCII "^", "[" or "]", but had "¦", "¢" and "¬" as extra characters.
SP¢.<(+|
&!$*);¬
-/¦,%_>?
`:#@'="
abcdefghi
jklmnopqr
~stuvwxyz
{ABCDEFGHI
}JKLMNOPQR
\STUVWXYZ
0123456789

Variant "IBM037", aliases "cp037", "ebcdic-cp-us" has positions 64-255 containing these 191 characters form the same set as ISO-Latin-1:

SPNBSPâäàáãåçñ¢.<(+|
&éêëèíîïìß!$*);¬
-/ÂÄÀÁÃÅÇѦ,%_>?
øÉÊËÈÍÎÏÌ`:#@'="
Øabcdefghi«»ðýþ±
°jklmnopqrªºæ¸Æ¤
µ~stuvwxyz¡¿ÐÝÞ®
^£¥·©§¼½¾[]¨´×
{ABCDEFGHISHYôöòóõ
}JKLMNOPQR¹ûüùúÿ
\÷STUVWXYZ²ÔÖÒÓÕ
0123456789³ÛÜÙÚ
Variant "IBM1047" alias "IBM-1047" has position 64-255 containing these characters:
SPNBSPâäàáãåçñ¢.<(+|
&éêëèíîïìß!$*);^
-/ÂÄÀÁÃÅÇѦ,%_>?
øÉÊËÈÍÎÏÌ`:#@'="
Øabcdefghi«»ðýþ±
°jklmnopqrªºæ¸Æ¤
µ~stuvwxyz¡¿Ð[Þ®
¬£¥·©§¼½¾Ý¨¯]´×
{ABCDEFGHISHYôöòóõ
}JKLMNOPQR¹ûüùúÿ
\÷STUVWXYZ²ÔÖÒÓÕ
0123456789³ÛÜÙÚ
Compared to CP-037, circumflex (^) and not (¬) have been swapped; left and right square bracket ([ ]) have been swapped for capital Y acute and quotes (Ý ").

Charset Identification

When communicating via an Internet protocol, two systems may need to agree on a character coding. For this purpose, standardised names have been defined. RFC 2928 defines a registration procedure for new character sets.

HTTP/1.1

HTTP/1.1 has clear support for character-set handling in bodies; this will be discussed later. The main issues we consider here are the character sets used in HTTP headers.

Character Sets

RFC 2616 (HTTP) defines the following:
OCTET=<any 8-bit byte>
CHAR=<any 7-bit US-ASCII character>
CTL=<any 7-bit US-ASCII control character>
TEXT=<any OCTET except CTLs, but including LWS>
On TEXT, it says:
"The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047."
Presumably this implies that the 191 ISO-8859-1 encoded Latin-1 characters are permitted in TEXT. It is used in comments, for example. However, the assertion in the first sentence is false; witness the definition
quoted-string=( <"> *(qdtext | quoted-pair ) <"> )
qdtext=<any TEXT except <">>
A quoted-string is used in various machine-interpreted headers, inter alia the Content-Type: charset parameter value. A purposeful interpretation of the specificion is probably best; thus for TEXT read <CHAR except CTL> in nearly all quoted-strings, except in the Warning and Etag headers and in userids.

HTTP/1.1 has an optional extension, allowing CTLs in quoted-string and comments if prefixed by a "\", whereas HTTP/1.0 forbade single-character quoting.

Headers using ISO-Latin-1

As described above, TEXT, comment and quoted-string can contain ISO-Latin-1 characters; comment and quoted-string MAY also contain backslash-quoted CTLs. These can occur in the following situations:
Status-Line=HTTP-Version SP 3DIGIT SP *<TEXT, excluding CR, LF>
Etag:=[ "W/" ] quoted-string
Server:=1*( product | comment )
User-Agent:=1*( product | comment )
Via:=1#( received-protocol received-by [ comment ] )
Warning:=1#(3DIGIT SP warn-agent SP quoted-string [SP <"> HTTP-date <">])
basic-userid=*<TEXT excluding ":">
basic-password=*TEXT
digest-userid=quoted-string
Other character sets are permitted as an option explicity for the the Warning header. The basic-userid and basic-password values are base64-encoded as part of an Authorization header.

The Etag, Via and Warning headers do not appear in HTTP/1.0; it also uses <CHAR except CTL> for basic-userid and quoted-string. The change for quoted string has no practical effect.

Headers using US-ASCII

In all other cases, only the basic US-ASCII coded character set is used. This therefore includes these common headers:
Authorization:=auth-scheme #auth-param
auth-scheme=token
Content-Length:=1*DIGIT
Content-Type:=type "/" subtype * (";" attribute "=" value)
type=subtype = attribute = token
value=token | quoted-string
Some of the HTTP headers are described using rules taken from RFC 2396 (the URI specification) such as absoluteURI, relativeURI, host and port where they are defined purely in terms of characters. Presumably these are to be taken as US-ASCII coded characters:
Content-Location:=( absoluteURI | relativeURI )
Location:= absoluteURI
Referer:=( absoluteURI | relativeURI )
Host:=host [ ":" port ]
received-by=( host [ ":" port ] ) | pseudonym
warn-agent=( host [ ":" port ] ) | pseudonym

HTTP Message Body

The character set and coding used for a message body (either in a request or a response) is defined in the Content-Type header using the `charset' parameter. For `text' message types transmitted over HTTP, RFC 2616 defines the default to be "ISO-8859-1", whereas the default in other situations is defined to be "US-ASCII" by the MIME media type specification (RFC 2046). This issue is discussed in the `text/html' specification (RFC 2854).

URI references

URI textual representation

A URI reference is the location of a web-resource fragement, externally represented as a sequence of simple characters. According to the URI specification (RFC 2396) these include the letters A-Z and a-z, the digits 0-9 and the 21 characters
uri-chars=- _ . ! ~ * ' ( ) ; / ? : @ & = + $ , % #
which are a sub-set of the characters found in both US-ASCII and EBCDIC-US. However, the uri-chars category was expanded by RFC 2732 (for ipv6 addresses) to include "[" and "]"; these characters are not present in the original EBCDIC-US character set.

The specification in RFC 2396 does not mandate any particular coding to be used when exchanging URIs between systems.

URI structure

A URI cannot be an arbitrary string of uri-chars; instead it is a structured string, with various pemitted characters and meanings for those characters in different parts of the string. As with any such string, the structure is defined by delimiters which separate the different components. Section 2. of RFC 2396 states:
"...characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US- ASCII character for that octet [ASCII]) or by an escape encoding."
The escape encoding (of "%" hex hex) allows an octet value to appear in a delimited portion where use of the appropriate character for that octet value would be treated as a separator. In fact, the quoted definition is not entirely correct, as Section 2.1 describes (my emphasis):
"A URI scheme may define a mapping from URI characters to octets; whether this is done depends on the scheme. Commonly, within a delimited component of a URI, a sequence of characters may be used to represent a sequence of octets. For example, the character "a" represents the octet 97 (decimal), while the character sequence "%", "0", "a" represents the octet 10 (decimal)."
The mapping is presumably defined by the scheme on a component by component basis. Therefore, the components delimited by the separators are either binary octet data (concatenation of US-ASCII codes and escapes) or character strings (concatenation of characters). As escape encodings are not permitted in character string components, the component types can be distinguished based on the syntax definition. For example, the URI ftp://joe@server.net:80/this%2f/file contains the character components "ftp", "server.net" and "80" plus the octet components (in hex) 6a6f65, 746869732f and 66696c65.

Thus the `parsed' representation of a URI is a directed acyclic graph with nodes of octet and character data. This clearly does not accord with the commonly understand meaning of URIs, as identifying, hosts, filenames, etc. This is resolved by some URI syntaxes defining the octet data elements to contain coded characters for futher interpretation.

URI Character Mappings

Section 2.1 of RFC 2396 (`URI and non-ASCII characters') describes this dual mapping from the URI character string to octets and possibly to coded characters. Firstly, as explained, octet components are formed from the US-ASCII and escape codes of the characters.

Secondly, "there is a ... translation for some resources: the sequence of octets defined by a component of the URI is subsequently used to represent a sequence of characters. A 'charset' defines this mapping" As the RFC explains, there is currently no way for the URI to self-describe the coding used in the octet data. It implies that the coding should be ASCII-based, but:

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification.
By 'original character sequences' it means octet components of the decoded URI that are designated to contain coded characters.

URI Generic Syntax

The remainder of RFC 2396 defines various syntatic rules for components of URIs with generic forms. This comprises the separators and the character components. The character components are defined as character string representations of objects in generally well known formats, such as integers as a string of digits.

Howevever, this specification was revised by RFC 2732 to define character components for IPv4address and IPv6reference, based on definitions in in the IPv6 specification RFC 2373, which itself relies on basic rules set out in the augmented BNF RFC (2234), such as HEXDIG. These are defined in terms of non-negative intergers, i.e. character codes, and not characters. Despite this, the intent of the RFC 2732 is obvious.

The handling of domain names, described in RFC 1035, is instructive. The Domain Name Space defines the server support for a hierarchical tree of labels of abitrary OCTET strings. The (preferred) name syntax is described in terms of characters. The correspondance between the hostname and the label is by case-insensitive comparison of the character codes and the octets. This fits very well with the URI character model. There is an excellent discussion in RFC 3467 of the issues with extending this DNS character set model.

URI Schemes

The procedure for defining URI schemes is specified in RFC 2717, and was recently reviewed by the W3C/IETF in RFC 3305. Guidelines for new schemes are set out in RFC 2718, which strongly recommends (as per IETF character-set policy in RFC 2277) that UTF-8 be used for coded characters.

http: URIs

The http: scheme syntax

The "http:" scheme is defined in RFC 2616 (HTTP/1.1 specification), The syntax of the character representation is
http_URL="http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
abs_path="/" path_segments
path_segments=segment *( "/" segment )
segment=*pchar *( ";" param )
param=*pchar
host, port, abs_path and query etc. are taken from their generic definitions in RFC 2396, showing how an HTTP URI is to be split into its various data components. The host and port parts identify the server to connect to, and the abs_path and query components form the request for that server. The "https:" scheme is defined in RFC 2818 to have identical syntax, except for the scheme name.

The HTTP protocol specification describes how this URI is used by clients and servers. The first two (character) components, the host and port are used by the client to determine the location of the server (and values for the Host: header). The remainder of the URI is passed as a URI character string with these components removed. Thus the client does not parse these components. As discussed earlier, it is to be assumed that the US-ASCII coding is used for transmitting the URI string in the request:

Request-URI="*" | absoluteURI | abs_path | authority ; [sic]
Presumably instead of abs_path it should read abs_path ["?" query]:
Request-URI="*" | absoluteURI | abs_path_query | authority ; [corrected]
abs_path_query=abs_path [ "?" query ]

This abs_path partial URI should be parsed by the server as described earlier, to give a list of octet data elements (path segments, path parameters and query components).

The parsing of the query partial URI is not defined by the http: scheme. Therefore, it is up to the local application to define the syntax of this part of the URI, and hence how that piece is broken into components, which components are OCTET data, and what (if any) coded characters they contain.

Server/CGI Functionality

Having now discussed the various parts of a request, we can now consider how this is to be handled by a server.

The Request URI

The request URI is to be treated by the server as a series of path and query components. Although these are octet data strings, often the server manager will wish to process these as coded character data, so the question of the coding arises.

The comments in RFC 2717 and 2718 state that the coding is scheme defined and should be UTF-8; however, this is only really applicable where the scheme definition `owns' the URI content. This is the case, for example, in the common-name component of a go: URI (RFC 3368) for which UTF-8 is specified. Whereas for a `generic URI' denoting access to a resource on a particular server, the server is the `authority' for the path. It follows that where the URI path is hierarchical (as in any CGI implementation), the character coding may be determined on a component by component basis.

The schemes that support hierarchical paths include http: and https: described above, and also ftp: (see RFC 1738). Of the other schemes currently registered with IANA, only ipp: (Internet Printing Protocol, RFC 3510), rtsp: (Real Time Streaming Protocol, RFC 2326) and nfs: (RFC 2224) have hierachical paths.

URI Path

Where a server passes off part of the path hierarchy in to a CGI script in the PATH_INFO variable, then it may devolve the coding responsibilty to the script, or it may require the same coding to be applied as the rest of the script.

A server that uses a non-ASCII based character set internally would be recommended to use ASCII for URI paths that represent internal named objects, as the URIs would be more memorable. It would then have to convert the binary strings of ASCII codes to the local character set when resolving local object names. As the PATH_INFO is `URL-decoded', which applies depends on how the server regards the content of the variable. The PATH_INFO is a list of "psegments" separated by the code for "/" in the server's character set. Either

psegment=*<OCTET except code for "/">
in which case the server is passing the raw data to the script for it to apply its own character coding. Alternatively,
psegment=*<any character except "/">
In which case the server has chosen a character coding for the URI path components (probably ASCII-based), and then recoded those characters into the coded character set implied for the PATH_INFO variable.

The former case is used in Unix; without thinking, the server decodes any "%" escapes in the path and presents that in the environment variable. The latter case might be used for a Java application, where the server treats the URI as containing ISO-8859-1 characters and reencodes them in UCS-2 for the Java program.

URI Query String

The query part of a URI could similarly have a server-assigned character coding. However, in some instances the use of the query_string is specified by an application protocol which may require a particular character set. HTTP form upload, for example, uses ISO-8859-1 codes in the query string binary components. In CGI, the script is passed the query component as-is in the QUERY_STRING parameter, so must deal with any conversion itself. Though the URI text string is passed in the server's native character set, which may not be ASCII.

Local paths

The PATH_INFO variable is also processed by the server to give a local filename, PATH_TRANSLATED. The character set for that value, as well as the SCRIPT_NAME value, conform to whatever filename character set is used by the server.

HTTP/CGI Request Parameters

When the server passes HTTP headers or data as request variables, in most instances the value contains only US-ASCII characters. Those variables are Exceptions are listed below.

REMOTE_USER

As explained above, with Basic or Digest authentication, the userid (passed as REMOTE_USER) can contain any TEXT character, which includes ISO-Latin-1 characters. However, like URI paths, the authority issuing userids is the server itself; so it could limit the character set, possibly differently for each security domain. For example, a server using Unix logon names (!) may be limited to a small set of characters for the userid.

SERVER_SOFTWARE

The name the server gives out to the client in a Server: header can contain ISO-Latin-1 characters. The same string may be used in the SERVER_NAME variable. Of course, for any particular server, the characters are limited to those that actually appear in the name of the software.

REMOTE_IDENT and RFC 1413

The Ident Protocol (RFC 1413) provides a means to determine the identity of a user of a particular TCP connection. It returns the following information on success:
ident-result=opsys [ "," charset] ":" ident-userid
opsys="OTHER" | "UNIX" | token
charset="US-ASCII" | token
ident-userid=1*512<any OCTET except 0, 10 or 13>
Thus when the web server acts as an RFC1413 client, it can determine the character set of the userid (defaulting to US-ASCII), and convert to its local character set accordingly.

HTTP headers

As listed earlier, these headers may contain ISO-Latin-1 characters:

Response Data

In the response the CGI script returns a Status: header which is converted into the HTTP Status line. The reason-phrase in the Status: header can contain ISO-Latin-1 characters.

References

[ISO 646]
`Information technology – ISO 7-bit coded character set for information interchange', ISO/IEC 646:1991.
[ISO 8859-1]
`Information technology – 8-bit single-byte coded graphic character sets – Part 1: Latin alphabet No. 1', ISO/IEC 8859-1:1998.
[ISO 10646-1]
`Information technology – Universal Multiple-Octet Coded Character Set (UCS) – Part 1: Architecture and Basic Multilingual Plane' ISO/IEC 10646-1:2000.
[RFC 1035]
Mockapetris, P., `Domain Names - Implementation and Specification', RFC 1035, ISI, November 1987.
[RFC 1345]
Simonsen, K., `Character Mnemonics & Character Sets', RFC 1345, Rationel Almen Planlaegning, June 1992.
[RFC 1413]
St. Johns, M., `Identification Protocol', RFC 1413, US Department of Defense, February 1993.
[RFC 1738]
Berners-Lee, T., Masinter, L., McCahill, M. (eds.) `Uniform Resource Locators (URL)', RFC 1738, CERN, Xerox Corporation, University of Minnesota, December 1994.
[RFC 1945]
Berners-Lee, T., Fielding, R. T. and Frystyk, H., `Hypertext Transfer Protocol – HTTP/1.0', RFC 1945, MIT/LCS, UC Irvine, May 1996.
[RFC 2046]
Freed, N., Borenstein, N., `Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types', RFC 2046, Innosoft, First Virtual, November 1996.
[RFC 2047]
Moore, K., `MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text', RFC 2047, University of Tennessee, November 1996.
[RFC 2224]
Callagham, B., `NFS URL Scheme', RFC 2224. Sun Microsystems, Inc., October 1997.
[RFC 2234]
Crocker, D. (ed) and Overell, P., `Augmented BNF for Syntax Specifications: ABNF', RFC 2234, Internet Mail Consortium, Demon Internet Ltd., November 1997.
[RFC 2277]
Alvestrand, H., `IETF Policy on Character Sets and Languages', BCP 18, RFC 2277, UNINETT, January 1998.
[RFC 2279]
Yergeau, F., `UTF-8, a transformation format of ISO 10646', RFC 2279, Alis Technologies, January 1998.
[RFC 2326]
Schulzrinne, H., Rao, A. and Lanphier, R., `Real Time Streaming Protocol (RTSP)', RFC 2326, Columbia U., Netscape, RealNetworks, April 1998.
[RFC 2373]
Hinden R. and Deering S., `IP Version 6 Addressing Architecture', RFC 2373, Nokia, Cisco Systems, July 1998.
[RFC 2396]
Berners-Lee, T., Fielding, R. and Masinter, L., `Uniform Resource Identifiers (URI) : Generic Syntax', RFC 2396, MIT/LC, U.C. Irvine, Xerox Corporation, August 1998.
[RFC 2616]
Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T., `Hypertext Transfer Protocol – HTTP/1.1', RFC 2616, UC Irving, Compaq/W3C, Compaq, W3C/MIT, Xerox, Microsoft, W3C/MIT, June 1999.
[RFC 2617]
Franks, J., Hallam-Baker, P., Hostetler, J., Lawrence, S., Leach, P., Luotonen, A. and Stewart L. `HTTP Authentication: Basic and Digest Access Authentication', RFC 2617, Northwestern University, Verisign Inc., AbiSource, Inc., Agranat Systems, Inc., Microsoft Corporation, Netscape Communications Corporation, Open Market, Inc., June 1999.
[RFC 2717]
Petke, R. and King, I., `Registration Procedures for URL Scheme Names', BCP 35, RFC 2717, UUNET Technologies, Microsoft Corporation, November 1999.
[RFC 2718]
Masinter, L., Alvestrand, H., Zigmod, D. and Petke, R., `Guidelines for new URL Schemes', RFC 2718, Xerox Corporation, Maxware, Pirsenteret, WebTV Networks, Inc., UUNET Technologies, November 1999.
[RFC 2732]
Hinden, R., Carpenter, B. and Masinter, L., `Format for Literal IPv6 Addresses in URL's', RFC 2732, Nokia, IBM, AT&T, December 1999.
[RFC 2818]
Rescola, E. `HTTP Over TLS', RFC 2818, RTFM, May 2000.
[RFC 2854]
Connolly, D. and Masinter, L., `The 'text/html' Media Type', RFC 2854, W3C, AT&T, June 2000.
[RFC 2978]
Freed, N. and Postel, J., `IANA Charset Registration Procedures', RFC 2978, Innosoft, ISI, October 2000.
[RFC 3305]
Mealling, M. and Denenberg, R. (eds.) `Report from the Joint W3C/IETF URI Planning Interest Group: Uniform Resource Identifiers (URIs), URLs and Uniform Resource NAmes (URNs): Clarifications and Recommendations', RFC 3305, W3C URI Interest Group, August 2002.
[RFC 3368]
Mealling, M., `The 'go' URI Scheme for the Common Name Resolution Protocol', RFC 3368, Verisign, Inc., August 2002.
[RFC 3467]
Klensin, J., `Role of the Domain Name System (DNS)', RFC 3467, February 2003.
[RFC 3510]
Herriot, R. and McDonald, I., `Internet Printing Protocol/1.1: IPP URL Scheme', RFC 3510, High North Inc., April 2003.
[US-ASCII]
`Information Systems – Coded Character Sets – 7-bit American Standard Code for Information Interchange (7-Bit ASCII)', ANSI INCITS.4-1986 (R2002).
[Winter]
dik t. winter's page of early character sets
Home