| David Robinson | 5 October 2003 |
Thus the US-ASCII specification titled `Coded Character Sets' is therefore a character set and a character encoding.
SP ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
ASCII was also standardised as ISO-646, with national variations permitted (US-ASCII is the G0 set). ISO has the vertical bar solid, rather than split. In ISO-646, character 35 can be # or £, character 36 can be ¤ or $. Another 10 characters can be changed for a national variant; these are numbers 64, 96, 91-94 and 123-126. In US-ASCII they are @ ` [ \ ] ^ { | } ~
The official IANA character set name is "ANSI_X3.4-1968" (RFC 1345), with various aliases, including "US-ASCII".
The additional 96 characters in Latin alphabet No. 1 are:
NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ SHY ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
SP ¢ . < ( + | & ! $ * ) ; ¬ - / ¦ , % _ > ? ` : # @ ' = " a b c d e f g h i j k l m n o p q r ~ s t u v w x y z { A B C D E F G H I } J K L M N O P Q R \ S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9
Variant "IBM037", aliases "cp037", "ebcdic-cp-us" has positions 64-255 containing these 191 characters form the same set as ISO-Latin-1:
Variant "IBM1047" alias "IBM-1047" has position 64-255 containing these characters:
SP NBSP â ä à á ã å ç ñ ¢ . < ( + | & é ê ë è í î ï ì ß ! $ * ) ; ¬ - / Â Ä À Á Ã Å Ç Ñ ¦ , % _ > ? ø É Ê Ë È Í Î Ï Ì ` : # @ ' = " Ø a b c d e f g h i « » ð ý þ ± ° j k l m n o p q r ª º æ ¸ Æ ¤ µ ~ s t u v w x y z ¡ ¿ Ð Ý Þ ® ^ £ ¥ · © § ¶ ¼ ½ ¾ [ ] ‾ ¨ ´ × { A B C D E F G H I SHY ô ö ò ó õ } J K L M N O P Q R ¹ û ü ù ú ÿ \ ÷ S T U V W X Y Z ² Ô Ö Ò Ó Õ 0 1 2 3 4 5 6 7 8 9 ³ Û Ü Ù Ú
Compared to CP-037, circumflex (^) and not (¬) have been swapped; left and right square bracket ([ ]) have been swapped for capital Y acute and quotes (Ý ").
SP NBSP â ä à á ã å ç ñ ¢ . < ( + | & é ê ë è í î ï ì ß ! $ * ) ; ^ - / Â Ä À Á Ã Å Ç Ñ ¦ , % _ > ? ø É Ê Ë È Í Î Ï Ì ` : # @ ' = " Ø a b c d e f g h i « » ð ý þ ± ° j k l m n o p q r ª º æ ¸ Æ ¤ µ ~ s t u v w x y z ¡ ¿ Ð [ Þ ® ¬ £ ¥ · © § ¶ ¼ ½ ¾ Ý ¨ ¯ ] ´ × { A B C D E F G H I SHY ô ö ò ó õ } J K L M N O P Q R ¹ û ü ù ú ÿ \ ÷ S T U V W X Y Z ² Ô Ö Ò Ó Õ 0 1 2 3 4 5 6 7 8 9 ³ Û Ü Ù Ú
On TEXT, it says:
OCTET = <any 8-bit byte> CHAR = <any 7-bit US-ASCII character> CTL = <any 7-bit US-ASCII control character> TEXT = <any OCTET except CTLs, but including LWS>
"The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO-8859-1 [22] only when encoded according to the rules of RFC 2047."Presumably this implies that the 191 ISO-8859-1 encoded Latin-1 characters are permitted in TEXT. It is used in comments, for example. However, the assertion in the first sentence is false; witness the definition
A quoted-string is used in various machine-interpreted headers, inter alia the Content-Type: charset parameter value. A purposeful interpretation of the specificion is probably best; thus for TEXT read <CHAR except CTL> in nearly all quoted-strings, except in the Warning and Etag headers and in userids.
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> ) qdtext = <any TEXT except <">>
HTTP/1.1 has an optional extension, allowing CTLs in quoted-string and comments if prefixed by a "\", whereas HTTP/1.0 forbade single-character quoting.
Other character sets are permitted as an option explicity for the the Warning header. The basic-userid and basic-password values are base64-encoded as part of an Authorization header.
Status-Line = HTTP-Version SP 3DIGIT SP *<TEXT, excluding CR, LF> Etag: = [ "W/" ] quoted-string Server: = 1*( product | comment ) User-Agent: = 1*( product | comment ) Via: = 1#( received-protocol received-by [ comment ] ) Warning: = 1#(3DIGIT SP warn-agent SP quoted-string [SP <"> HTTP-date <">]) basic-userid = *<TEXT excluding ":"> basic-password = *TEXT digest-userid = quoted-string
The Etag, Via and Warning headers do not appear in HTTP/1.0; it also uses <CHAR except CTL> for basic-userid and quoted-string. The change for quoted string has no practical effect.
Some of the HTTP headers are described using rules taken from RFC 2396 (the URI specification) such as absoluteURI, relativeURI, host and port where they are defined purely in terms of characters. Presumably these are to be taken as US-ASCII coded characters:
Authorization: = auth-scheme #auth-param auth-scheme = token Content-Length: = 1*DIGIT Content-Type: = type "/" subtype * (";" attribute "=" value) type = subtype = attribute = token value = token | quoted-string
Content-Location: = ( absoluteURI | relativeURI ) Location: = absoluteURI Referer: = ( absoluteURI | relativeURI ) Host: = host [ ":" port ] received-by = ( host [ ":" port ] ) | pseudonym warn-agent = ( host [ ":" port ] ) | pseudonym
which are a sub-set of the characters found in both US-ASCII and EBCDIC-US. However, the uri-chars category was expanded by RFC 2732 (for ipv6 addresses) to include "[" and "]"; these characters are not present in the original EBCDIC-US character set.
uri-chars = - _ . ! ~ * ' ( ) ; / ? : @ & = + $ , % #
The specification in RFC 2396 does not mandate any particular coding to be used when exchanging URIs between systems.
"...characters are either used as delimiters, or to represent strings of data (octets) within the delimited portions. Octets are either represented directly by a character (using the US- ASCII character for that octet [ASCII]) or by an escape encoding."The escape encoding (of "%" hex hex) allows an octet value to appear in a delimited portion where use of the appropriate character for that octet value would be treated as a separator. In fact, the quoted definition is not entirely correct, as Section 2.1 describes (my emphasis):
"A URI scheme may define a mapping from URI characters to octets; whether this is done depends on the scheme. Commonly, within a delimited component of a URI, a sequence of characters may be used to represent a sequence of octets. For example, the character "a" represents the octet 97 (decimal), while the character sequence "%", "0", "a" represents the octet 10 (decimal)."The mapping is presumably defined by the scheme on a component by component basis. Therefore, the components delimited by the separators are either binary octet data (concatenation of US-ASCII codes and escapes) or character strings (concatenation of characters). As escape encodings are not permitted in character string components, the component types can be distinguished based on the syntax definition. For example, the URI
ftp://joe@server.net:80/this%2f/file contains the character
components "ftp", "server.net" and "80" plus the octet components (in hex)
6a6f65, 746869732f and 66696c65.
Thus the `parsed' representation of a URI is a directed acyclic graph with nodes of octet and character data. This clearly does not accord with the commonly understand meaning of URIs, as identifying, hosts, filenames, etc. This is resolved by some URI syntaxes defining the octet data elements to contain coded characters for futher interpretation.
Secondly, "there is a ... translation for some resources: the sequence of octets defined by a component of the URI is subsequently used to represent a sequence of characters. A 'charset' defines this mapping" As the RFC explains, there is currently no way for the URI to self-describe the coding used in the octet data. It implies that the coding should be ASCII-based, but:
For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification.By 'original character sequences' it means octet components of the decoded URI that are designated to contain coded characters.
Howevever, this specification was revised by RFC 2732 to define character components for IPv4address and IPv6reference, based on definitions in in the IPv6 specification RFC 2373, which itself relies on basic rules set out in the augmented BNF RFC (2234), such as HEXDIG. These are defined in terms of non-negative intergers, i.e. character codes, and not characters. Despite this, the intent of the RFC 2732 is obvious.
The handling of domain names, described in RFC 1035, is instructive. The Domain Name Space defines the server support for a hierarchical tree of labels of abitrary OCTET strings. The (preferred) name syntax is described in terms of characters. The correspondance between the hostname and the label is by case-insensitive comparison of the character codes and the octets. This fits very well with the URI character model. There is an excellent discussion in RFC 3467 of the issues with extending this DNS character set model.
host, port, abs_path and query etc. are taken from their generic definitions in RFC 2396, showing how an HTTP URI is to be split into its various data components. The host and port parts identify the server to connect to, and the abs_path and query components form the request for that server. The "https:" scheme is defined in RFC 2818 to have identical syntax, except for the scheme name.
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]] abs_path = "/" path_segments path_segments = segment *( "/" segment ) segment = *pchar *( ";" param ) param = *pchar
The HTTP protocol specification describes how this URI is used by clients and servers. The first two (character) components, the host and port are used by the client to determine the location of the server (and values for the Host: header). The remainder of the URI is passed as a URI character string with these components removed. Thus the client does not parse these components. As discussed earlier, it is to be assumed that the US-ASCII coding is used for transmitting the URI string in the request:
Presumably instead of abs_path it should read abs_path ["?" query]:
Request-URI = "*" | absoluteURI | abs_path | authority ; [sic]
Request-URI = "*" | absoluteURI | abs_path_query | authority ; [corrected] abs_path_query = abs_path [ "?" query ]
This abs_path partial URI should be parsed by the server as described earlier, to give a list of octet data elements (path segments, path parameters and query components).
The parsing of the query partial URI is not defined by the http: scheme. Therefore, it is up to the local application to define the syntax of this part of the URI, and hence how that piece is broken into components, which components are OCTET data, and what (if any) coded characters they contain.
The comments in RFC 2717 and 2718 state that the coding is scheme defined and should be UTF-8; however, this is only really applicable where the scheme definition `owns' the URI content. This is the case, for example, in the common-name component of a go: URI (RFC 3368) for which UTF-8 is specified. Whereas for a `generic URI' denoting access to a resource on a particular server, the server is the `authority' for the path. It follows that where the URI path is hierarchical (as in any CGI implementation), the character coding may be determined on a component by component basis.
The schemes that support hierarchical paths include http: and https: described above, and also ftp: (see RFC 1738). Of the other schemes currently registered with IANA, only ipp: (Internet Printing Protocol, RFC 3510), rtsp: (Real Time Streaming Protocol, RFC 2326) and nfs: (RFC 2224) have hierachical paths.
A server that uses a non-ASCII based character set internally would be recommended to use ASCII for URI paths that represent internal named objects, as the URIs would be more memorable. It would then have to convert the binary strings of ASCII codes to the local character set when resolving local object names. As the PATH_INFO is `URL-decoded', which applies depends on how the server regards the content of the variable. The PATH_INFO is a list of "psegments" separated by the code for "/" in the server's character set. Either
in which case the server is passing the raw data to the script for it to apply its own character coding. Alternatively,
psegment = *<OCTET except code for "/">
In which case the server has chosen a character coding for the URI path components (probably ASCII-based), and then recoded those characters into the coded character set implied for the PATH_INFO variable.
psegment = *<any character except "/">
The former case is used in Unix; without thinking, the server decodes any "%" escapes in the path and presents that in the environment variable. The latter case might be used for a Java application, where the server treats the URI as containing ISO-8859-1 characters and reencodes them in UCS-2 for the Java program.
Thus when the web server acts as an RFC1413 client, it can determine the character set of the userid (defaulting to US-ASCII), and convert to its local character set accordingly.
ident-result = opsys [ "," charset] ":" ident-userid opsys = "OTHER" | "UNIX" | token charset = "US-ASCII" | token ident-userid = 1*512<any OCTET except 0, 10 or 13>