classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

java.net.URI implementation


From: Giannis Georgalis
Subject: java.net.URI implementation
Date: 10 Feb 2003 17:34:28 +0200
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3.50

Hello,

After a discussion I had with Michael Koch, I decided to implement
the java.net.URI class. I found in the classpath mail archives a
patch submited by Mr. Topic (I think) in which he implemented part of
the URI class using:
  /**
   * Regular expression for parsing URIs.
   *
   * Taken from RFC 2396, Appendix B.
   * This expression doesn't parse IPv6 addresses.
   */
  private static final String URI_REGEXP =
    "^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\\?([^#]*))?(#(.*))?";

Appart from the fact that this expression cannot parse IPv6
addresses, it cannot be considered as a substitute of an URI parser,
as it can only break up the parts of a *valid* URI. For example the
uri : "http://1333.2123.232323.0.9.9~84.1"; is not valid, but can be
parsed from this regexp.

After some digging in various RFCs I have written a (complete)
grammar (in BNF) for parsing URIs (I'll append the grammar at the end
of this message).

So the URI parser can be implemented in either native (c code) or
java. Implementing it in java, will be quite hard and difficult to
maintain and keep up with potential URI changes. On the other hand,
if it is implemented in c, it will be *very* easy to implement and
maintain as I'll use flex and maximum parsing speed will be
achieved. Additionally, provided that the URI grammar is very simple,
bison (yacc) is not needed. It would be easy to implement the URI
parser in java if jlex is used (that's another option I'm
considering).

Please, tell me your thoughts and suggestions about this matter.

Thank you,
Giannis

/*
 * <b><i>BNF GRAMMAR for URI parsing</i></b>&nbsp;<br>
 * <i>As described in address@hidden http://www.ietf.org/rfc/rfc2396.txt 
RFC-2396} and
 * in address@hidden http://www.ietf.org/rfc/rfc2732.txt RFC-2732}. IPv6 
address scheme
 * is described in address@hidden http://www.ietf.org/rfc/rfc2373.txt 
RFC-2373}</i>
 *
 * <b>digit</b>           [0-9]
 * <b>alpha</b>           [a-zA-Z]
 * <b>alphanum</b>        alpha | digit
 * <b>hex</b>             digit | [a-fA-F]
 * <b>escaped</b>         "%"hex{2}
 * <b>mark</b>            "-" | "_" | "." | "!" | "~" |
 *                        "*" | "'" | "(" | ")"
 * <b>unreserved</b>      alphanum | mark
 * <b>reserved</b>        ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
 *                        "$" | "," | "[" | "]"
 * <b>uric</b>            reserved | unreserved | escaped
 * <b>fragment</b>        uric*
 * <b>query</b>           uric*
 * <b>pchar</b>           unreserved | escaped |
 *                        ":" | "@" | "&" | "=" | "+" | "$" | ","
 * <b>param</b>           pchar*
 * <b>segment</b>         pchar*(";"param)*
 * <b>path_segments</b>   segment("/"segment)*
 * <b>abs_path</b>        "/"path_segments
 * <b>uric_no_slash</b>   unreserved | escaped | ";" | "?" | ":" | "@" |
 *                        "&" | "=" | "+" | "$" | ","
 * <b>opaque_part</b>     {uric_no_slash}uric*
 * <b>path</b>            abs_path | opaque_part
 * <b>port</b>            digit*
 * <b>IPv4address</b>     digit{1,3}"."digit{1,3}"."digit{1,3}"."digit{1,3}
 * <b>hexseq</b>          hex{4}(":"hex{4})*
 * <b>hexpart</b>         hexseq | hexseq"::"(hexseq)? | "::"(hexseq)?
 * <b>IPv6prefix</b>      hexpart"/"digit{2} <i>It is not needed. Added for 
completeness</i>
 * <b>IPv6address</b>     hexpart(":"IPv4address)?
 * <b>ipv6reference</b>   "["IPv6address"]"
 * <b>toplabel</b>        alpha | alpha(alphanum | "-")*alphanum
 * <b>domainlabel</b>     alphanum | alphanum(alphanum | "-" )*alphanum
 * <b>hostname</b>        (domainlabel".")*toplabel(".")?
 * <b>host</b>            hostname | IPv4address | IPv6reference
 * <b>hostport</b>        host(":"port)?
 * <b>userinfo</b>        (unreserved | escaped |
 *                        ";" | ":" | "&" | "=" | "+" | "$" | ",")*
 * <b>server</b>          ((userinfo"@")?hostport)?
 * <b>reg_name</b>        (unreserved | escaped | "$" | "," |
 *                        ";" | ":" | "@" | "&" | "=" | "+")+
 * <b>authority</b>       server | reg_name
 * <b>scheme</b>          alpha( alpha | digit | "+" | "-" | ".")*
 * <b>rel_segment</b>     (unreserved | escaped |
 *                        ";" | "@" | "&" | "=" | "+" | "$" | ",")+
 * <b>rel_path</b>        rel_segment(abs_path)?
 * <b>net_path</b>        "//"authority(abs_path)?
 * <b>hier_part</b>       (net_path | abs_path)("?"query)?
 * <b>relativeURI</b>     (net_path | abs_path | rel_path)("?"query)?
 * <b>absoluteURI</b>     scheme":"(hier_part | opaque_part)
 * <b>URI-reference</b>   (absoluteURI | relativeURI)?("#"fragment)?
 */

-- 
 Object-oriented programming is an exceptionally bad 
idea which could only have originated in California.
    - Edsger Dijkstra (attributed)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]