[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Filenames and other POSIX byte strings as SCM strings without loss
From: |
Andy Wingo |
Subject: |
Re: Filenames and other POSIX byte strings as SCM strings without loss |
Date: |
Fri, 01 Jul 2011 12:51:27 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) |
Hi Mark!
On Mon 23 May 2011 21:42, Mark H Weaver <address@hidden> writes:
> The tentative plan is to use normal strings to represent pathnames,
> command-line arguments, environmental variable values, and other such
> POSIX byte strings.
Apologies for not giving you prompt feedback on this idea. Basically I
think it sounds like a great, workable plan.
> For purposes of this email, suppose they are called
> scm_to_permissive_stringn and scm_from_permissive_stringn. On top of
> these we would implement scm_to_permissive_locale_stringn,
> scm_from_permissive_locale_stringn, and some other convenience
> functions.
Sounds good. "Permissive" sounds a bit odd but I can't think of another
name. "Foreign"? "Corrupt"? "Possibly invalid"? "Nonsense"? "Raw"?
"Cooked"? "Bytes"? "scm_from_utf8_byte_string"?
> Since scm_from_permissive_stringn maps invalid bytes to private-use code
> points in the range U+109700..U+1097FF, we must ensure that properly
> encoded code points in that range are mapped to something else.
> Otherwise, two distinct POSIX byte strings might map to the same SCM
> string. The simplest solution is to consider any byte sequence which
> would map to our reserved range to be invalid, and thus mapped one byte
> at a time using this scheme. For example, U+1097FF is represented in
> UTF-8 as 0xF4 0x89 0x9F 0xBF. Although scm_from_stringn would map this
> sequence of bytes to the single code point U+1097FF (when using UTF-8),
> scm_from_permissive_stringn would instead consider this entire byte
> sequence to be invalid, and instead map it to the 4 code points
> U+1097F4, U+109789, U+10979F, U+1097BF.
Works for me.
> So the tentative plan is to provide this alternative mapping, and use it
> whenever accessing POSIX byte strings, whether they be filenames,
> command-line arguments, environment variable values, fields within a
> passwd, group, wtmp, or utmp file, system information (e.g. the hostname
> or information from uname), etc.
Cool.
> We should allow the user to access this mapping directly, via
>
> scm_{to,from}_permissive_stringn,
> scm_{to,from}_permissive_locale_stringn,
> scm_{to,from}_permissive_utf8_stringn,
>
> and also between strings and bytevectors in both Scheme and C:
>
> permissive-string->utf8,
> permissive-utf8->string,
> scm_permissive_string_to_utf8,
> scm_permissive_utf8_to_string,
>
> and we should probably add procedures to convert between strings and
> bytevectors using other encodings as well, most importantly the locale
> encoding.
>
> We'd also need permissive-string->pointer and
> permissive-pointer->string.
>
> I'm not sure about the names. Suggestions welcome.
I'm liking "bytes". scm_from_locale_byte_stringn. byte-string->utf8.
Perhaps not clear enough though. WDYT?
> Regarding Noah's proposal to allow handling pathnames as sequences of
> path components: both Andy and I like this idea. However, as always,
> the devil's in the details. I'll write more about this in another
> email.
Sure, let's get this lowest level in first. Are you on it? :-) There
is no hurry of course, just so we know...
Cheers,
Andy
--
http://wingolog.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Filenames and other POSIX byte strings as SCM strings without loss,
Andy Wingo <=