[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
CVS libidn/doc/specifications
From: |
libidn-commit |
Subject: |
CVS libidn/doc/specifications |
Date: |
Fri, 23 Dec 2005 23:41:40 +0100 |
Update of /home/cvs/libidn/doc/specifications
In directory dopio:/tmp/cvs-serv6511
Added Files:
rfc4290.txt
Log Message:
Add.
--- /home/cvs/libidn/doc/specifications/rfc4290.txt 2005/12/23 22:41:40
NONE
+++ /home/cvs/libidn/doc/specifications/rfc4290.txt 2005/12/23 22:41:40
1.1
Network Working Group J. Klensin
Request for Comments: 4290 December 2005
Category: Informational
Suggested Practices for Registration of
Internationalized Domain Names (IDN)
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2005).
IESG Note
This RFC is not a candidate for any level of Internet Standard. The
IETF disclaims any knowledge of the fitness of this RFC for any
purpose and notes that the decision to publish is not based on IETF
review apart from IESG review for conflict with IETF work. The RFC
Editor has chosen to publish this document at its discretion. See
RFC 3932 for more information.
Abstract
This document explores the issues in the registration of
internationalized domain names (IDNs). The basic IDN definition
allows a very large number of possible characters in domain names,
and this richness may lead to serious user confusion about similar-
looking names. To avoid this confusion, the IDN registration process
must impose rules that disallow some otherwise-valid name
combinations. This document suggests a set of mechanisms that
registries might use to define and implement such rules for a broad
range of languages, including adaptation of methods developed for
Chinese, Japanese, and Korean domain names.
Klensin Informational [Page 1]
RFC 4290 IDN Registration Practices December 2005
Table of Contents
1. Introduction ....................................................3
1.1. Background .................................................3
1.2. The Nature and Status of these Recommendations .............4
1.3. Terminology ................................................5
1.3.1. Languages and Scripts .................................5
1.3.2. Characters, Variants, Registrations, and Other
Issues ................................................6
1.3.3. Confusion, Fraud, and Cybersquatting ..................9
1.4. A Review of the JET Guidelines .............................9
1.4.1. JET Model .............................................9
1.4.2. Reserved Names and Label Packages ....................10
1.5. Languages, Scripts, and Variants ..........................11
1.5.1. Languages versus Scripts .............................11
1.5.2. Variant Selection ....................................13
1.6. Variants are not a Universal Remedy .......................14
1.7. Reservations and Exclusions ...............................14
1.7.1. Sequence Exclusions for Valid Characters .............14
1.7.2. Character Pairing Issues .............................15
1.8. The Registration Bundle ...................................15
1.8.1. Definitions and Structure ............................15
1.8.2. Application of the Registration Bundle ...............16
2. Some Implications of This Approach .............................17
3. Possible Modifications of the JET Model ........................18
4. Conclusions and Recommendations About the General Approach .....18
5. A Model Table Format ...........................................19
6. A Model Label Registration Procedure: "CreateBundle" ...........20
6.1. Description of the CreateBundle Mechanism .................21
6.2. The "no-variants" Case ....................................22
6.3. CreateBundle and Nameprep Mapping .........................22
7. IANA Considerations ............................................23
8. Internationalization Considerations ............................24
9. Security Considerations ........................................24
10. Acknowledgements ..............................................25
11. Informative References ........................................26
Klensin Informational [Page 2]
RFC 4290 IDN Registration Practices December 2005
1. Introduction
1.1. Background
The IDNA (Internationalized Domain Names in Applications)
specification [RFC3490] defines the basic model for encoding non-
ASCII strings in the DNS. Additional specifications [RFC3491]
[RFC3492] define the mechanisms and tables needed to support IDNA.
As work on these specifications neared completion, it became apparent
that it would be desirable for registries to impose additional
restrictions on the names that could actually be registered (e.g.,
see [IESG-IDN] and [ICANN-IDN]) to reduce potential confusion among
characters that were similar in some way. This document explores
these IDN (international domain name) registration issues and
suggests a set of mechanisms that IDN registries might use.
Registration restrictions are part of a long tradition. For example,
while the original DNS specifications [RFC1035] permitted any string
of octets in a DNS label, they also recommended the use of a much
more restricted subset. This subset was derived from the much older
"hostname" rules [RFC952] and defined by the "LDH" convention (for
the three permitted types of characters: letters, digits, and the
hyphen). Enforcement of this restricted subset in registrations was
the responsibility of the registry or domain administrator. The
definition of the subset was embedded in the DNS protocol itself,
although some applications protocols, notably those concerned with
electronic mail, did impose and enforce similar rules.
If there are no constraints on registration in a zone, people can
register characters that increase the risk of misunderstandings,
cybersquatting, and other forms of confusion. A similar situation
existed even before the introduction of IDNA, as exemplified by
domain names such as example.com and examp1e.com (note that the
latter domain contains the digit "1" instead of the letter "l").
For non-ASCII names (so-called "internationalized domain names" or
"IDNs"), the problem is more complicated. In the earlier situation
that led to the LDH (hostname) rules, all protocols, hosts, and DNS
zones used ASCII exclusively in practice, so the LDH restriction
could reasonably be applied uniformly across the Internet. Support
for IDNs introduces a very large character repertoire, different
geographical and political locations, and languages that require
different collections of characters. The optimal registration
restrictions are no longer a global matter; they may be different in
different areas and, hence, in different DNS zones.
Klensin Informational [Page 3]
RFC 4290 IDN Registration Practices December 2005
For some human writing systems, there are characters and/or strings
that have equivalent or near-equivalent usages. If a name can be
registered with such a character or string, the registry might want
to automatically associate all of the names that have the same
meaning with the registered name. The registry might also decide
whether the names that are associated with, or generated by, one
registration should, as a group or individually, go into the zone or
should be blocked from registration by different parties.
To date, the best-developed system for handling registration
restrictions for IDNs is the JET Guidelines for Chinese, Japanese,
and Korean [RFC3743], the so-called "CJK" languages. The JET
Guidelines are limited to the CJK languages and, in particular, to
their common script base. Those languages are also the best-known
and most widely-used examples of writing systems constructed on
"ideographic" or "pictographic" principles. This document explores
the principles behind the JET guidelines. It then examines some of
the issues that might arise in adapting them to alphabetic languages,
i.e., to languages whose characters primarily represent sounds rather
than meanings.
This document describes five things:
1. The general background and considerations for non-ASCII scripts
in names.
2. Suggested practices for describing character variants.
3. A method for using a zone's character variants to determine which
names should be associated with a registration.
4. A format for publishing a zone's table of character variants;
Such tables are referred to below simply as "language tables" or
simply "tables".
5. A model algorithm for name registration given the presence of
language tables.
1.2. The Nature and Status of these Recommendations
The document makes recommendations for consideration by registries
and, where relevant, by those who coordinate them, and by those who
use their services. None of the recommendations are intended to be
normative. Instead, the intent of the document is to illustrate a
framework for developing variations to meet the needs of particular
registries and their processing of particular languages. Of course,
if registries make similar decisions and utilize similar tools, costs
Klensin Informational [Page 4]
RFC 4290 IDN Registration Practices December 2005
and confusion may be reduced -- both between registries and for users
and registrars who have relationships with more than one domain.
Just as the JET Guidelines contain some suggestions that may not be
applicable to alphabetic scripts, some of the suggestions here,
especially the more specific ones, may be applicable to some scripts
and not others.
1.3. Terminology
1.3.1. Languages and Scripts
This document uses the term "language" in what may be, to many
readers, an odd way. Neither this specification, nor IDNA, nor the
DNS are directly concerned with natural language, but only with the
characters that make up a given label. In some respects, the term
"script", used in the character coding community for a collection of
characters, might be more appropriate. However, different subsets of
the same script may be used with different languages, and the same
language may be written using different characters (or even
completely different scripts) in different locations, so "script" is
not precisely correct either.
Long-standing confusion has also resulted from the fact that most
scripts are, informally at least, named after one of the languages
written in them. "Chinese" describes both a language and a
collection of characters that are also used in writing Japanese,
Korean, and, at least historically, some other languages. "Latin"
describes a language, the characters used to write that language,
and, often, characters used to write a number of contemporary
languages that are derived from or similar to those used to write the
Latin language. The script used to write the Arabic language is
called "Arabic", but it is also used (typically with some additions
or deletions) to write a number of other languages. Situations in
which a script has a clearly-defined name that is independent of the
name of a language are the exception, rather than the rule; examples
include Hangul, used to write Korean, Katakana and Hiragana, used to
write Japanese, and a few others. Some scholars have historically
used "Roman" or "Roman-derived" for the script in an attempt to
distinguish between a script and the Latin language.
The term "language" is therefore used in this document in the
informal sense of a written language and is defined, for this
purpose, by the characters used to write it, i.e., as a language-
specific subset of a script. In this context, a "language" is
defined by the combination of a code (see Section 1.4.1) and an
authority that has chosen to use that code and establish a
character-listing for it. Authorities are normally TLD (top-level
Klensin Informational [Page 5]
RFC 4290 IDN Registration Practices December 2005
domain) registries; see Section 7 and [IANA-language-registry].
However, it is expected that TLD registries will find appropriate
experts and that advice from language and script experts selected by
international neutral bodies will also become part of the
registration system. In addition, as discussed below in Section 7,
registries may conclude that the best interests of registrants,
stakeholders, and the Internet community would be served by
constructing "language tables" that mix scripts and characters in
ways that conform to no known language. Conventions should be
developed for such registrations that do not misleadingly reflect
specific language codes.
1.3.2. Characters, Variants, Registrations, and Other Issues
1. Characters in this document are specified by their Unicode
codepoints in U+xxxx format, by their official names, or both.
2. The following terms are used in this document.
* String
A "string" is an sequence of one or more characters.
* Base Character
This document discusses characters that may have equivalent or
near-equivalent characters or strings. A "base character" is
a character that has zero or more equivalents. In the JET
Guidelines, base characters are referred to as "valid
characters". In a table with variants, as described in
Section 5, the base characters occupy the first column.
Normally (and always, if the recommendation of Section 6.3 is
adopted), the base characters will be the characters that
appear in registration requests from registrants; any other
character will invalidate the registration attempt.
* Native Script
Native script is the form in which the relevant string would
normally be represented. For example, it might use Lower
Slobbovian characters and the glyphs normally used to write
them. It would not be punycode as a presentation form.
* Variant Characters/Strings
The "variant(s)" are character(s) and/or string(s) that are
treated as equivalent to the base character. Note that these
might not be exactly equivalent characters; a particular
Klensin Informational [Page 6]
RFC 4290 IDN Registration Practices December 2005
original character may be a base character with a mapping to a
particular variant character, but that variant character may
not have a mapping to the original base character. Indeed,
the variant character may not appear in the base character
list, and hence may not be valid for use in a registration.
Usually, characters or strings to be designated as variants
are considered either equivalent or sufficiently similar (by
some registry-specific definition) that confusion between them
and the base character might occur.
* Base Registration
The "base registration" is the single name that the registrant
requested from the registry. The JET Guidelines use the term
"label string" for this concept.
* Registered, Activated
A label (or "name") is described as "registered" if it is
actually entered into a domain (i.e., into a zone file) by the
registry, so that it can be accessed and resolved using
standard DNS tools. The JET Guidelines describe a
"registered" label as "activated". However, some domains use
a slightly different registration logic in which a name can be
registered with the registrar (if one is involved) and with
the registry, but not actually entered into the zone file
until an additional activation or delegation step occurs.
This document does not make that distinction, but is
compatible with it.
As specified in the IDNA Standard, the name actually placed in
the zone file is always the internal ("punycode") form. There
is no provision for actually entering any other form of an IDN
into the DNS. It remains controversial, with different
registrars and registries having adopted different policies,
as to whether the registration, as submitted by the
registrant, is in the form of:
o The native-script name, either in UTF-8 or in some coding
specified by the registrar, or
o the internal-form ("punycode") name, or
o both forms of the name together, so that the registrar and
registry can verify the intended translation.
Klensin Informational [Page 7]
RFC 4290 IDN Registration Practices December 2005
If any of the approaches defined in this document is used, it
is almost certain to be necessary that the native-script form
[1171 lines skipped]