[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: C Strings and String Literals.
From: |
Alejandro Colomar |
Subject: |
Re: C Strings and String Literals. |
Date: |
Tue, 15 Nov 2022 14:48:25 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.1 |
Hi Ralph,
On 11/14/22 14:56, Ralph Corderoy wrote:
Hi Alejandro,
C doesn't _really_ have strings, except at the library level.
It has character arrays and one grain of syntactic sugar for encoding
"string literals", which should not have been called that because
whether they get the null terminator is context-dependent.
char a[5] = "fooba";
char *b = "bazqux";
I see some Internet sources claim that C is absolutely reliable about
null-terminating such literals, but I can't agree. The assignment to
`b` above adds a null terminator, and the one to `a` does not. This
is the opposite of absolute reliability. Since I foresee someone
calling me a liar for saying that, I'll grant that if you carry a long
enough list of exceptional cases for the syntax in your head, both are
predictable. But it's simply a land mine for the everyday programmer.
- C defines both string literals and strings at the language level,
e.g. main()'s argv[] is defined to contain strings.
I must disagree. The string concept is very broad, and you can define
you own string, as for example:
struct str_s {
size_t len;
u_char *s;
}
The point under discussion was whether the language specification of C
has strings or just character arrays and whether string literals should
have been called that because whether they have terminating NUL is
‘context-dependent’.
To contradict what I've written, you're widening the discussion to
arbitrary data structures which can be used to implement a string. That
is not relevant.
I just made that point to make sure that when we talk about strings we talk
about a concrete type of strings, as we agree it is any number of non-NUL
characters, followed by NUL.
However, assuming that the concept of string is a NUL-terminated char
array, there's little in the core language about it.
But little is not nothing and so the C language does have both strings,
as the specification states that is what is sitting in main()'s argv[],
and string literals.
I can't argue argv[] is not part of the language, since it's certainly
documented in the standard. However, it's more of a side-effect of the
interfaces provided by the kernel (mainly, exec(3), to which it's impossible to
pass a sequence of chars with a NUL that's not terminating the string, since it
will just reinterpret it as the end of the string.
If argv[] is the only valid array of strings in the language, I'd say we're in a
bad position to say that the language has strings.
Sure, string literals are the only true strings in the language
Your ‘Sure’ implies you're agreeing with someone. If so, it's not me.
You're wrong on this point.
I think I was kind-of agreeing with Branden, but I don't remember what I was
thinking. Let's say it was a thinko of mine. Something not uncommon.
You can prove that string literals are really strings (i.e.,
NUL-terminated char arrays), by applying sizeof to them, and then
looping over their contents to see that there's exactly one NUL byte
at its last position.
Your definitions are wrong. Proving "foo\0bar" ends with a NUL does not
make it a C string because a NUL-terminated char array is not a C string
if it contains a NUL before that. A C string is zero or more non-NUL
chars followed by a NUL.
Yes, I like your definition better.
- In C, "foo" is a string literal. That is the correct name as it is
not a C string because a string literal may contain explicit NUL bytes
within it which a string may not: "foo\0bar".
I wouldn't discard them as string literals only for that.
Sorry, I meant s/string literals/strings/.
I'm not discarding them as anything. I am pointing out that according
to the language definition, "foo\0bar" is a string literal but not a C
string because of the embedded NUL thus the distinction is necessary and
terms are needed for each.
Writing by accident a NUL byte is not usual, anyway.
I didn't claim it was. I was arguing why ‘they should not have been
called string literal’ is wrong and that whether they get a NUL
terminator is not ‘context dependent’.
So, we could argue that string literals, most of the time, are strings,
conforming to the common idea of any non-NUL followed by a NUL.
- A character array may be initialised by a string literal. Successive
elements of the array are set to the string literal's characters,
including the implicit NUL if there is room.
char two[2] = "foo"; // 'f' 'o'
char three[3] = "foo"; // 'f' 'o' 'o'
char four[4] = "foo"; // 'f' 'o' 'o' '\0'
char five[5] = "foo"; // 'f' 'o' 'o' '\0' '\0'
char implicit[] = "foo"; // 'f' 'o' 'o' '\0'
Ahh my friend, you're too used to some dialect of C that allows this,
I believe. ISO C11 doesn't, and I'm guessing any older ISO C versions
behave in the same way:
$ cat str.c
char two[2] = "foo"; // 'f' 'o'
char three[3] = "foo"; // 'f' 'o' 'o'
char four[4] = "foo"; // 'f' 'o' 'o' '\0'
char five[5] = "foo"; // 'f' 'o' 'o' '\0' '\0'
char implicit[] = "foo"; // 'f' 'o' 'o' '\0'
$ cc str.c -Wpedantic -pedantic-errors
str.c:1:23: error: initializer-string for array of ‘char’ is too long
1 | char two[2] = "foo"; // 'f' 'o'
| ^~~~~
You are showing compiler output and claiming its error proves the
standard.
I actually did ask the compiler to warn about violations of the standard, and
only about them. See:
- The default is '-std=gnu17'. It uses GNU extensions, but I'll show why this
doesn't care too much, with quotes from the gcc(1) manual page:
[
The -ansi option does not cause non‐ISO programs to
be rejected gratuitously. For that, -Wpedantic is
required in addition to -ansi.
]
[
The compiler can accept several base standards, such
as c90 or c++98, and GNU dialects of those standards,
such as gnu90 or gnu++98. When a base standard is
specified, the compiler accepts all programs
following that standard plus those using GNU
extensions that do not contradict it. For example,
-std=c90 turns off certain features of GCC that are
incompatible with ISO C90, such as the "asm" and
"typeof" keywords, but not other GNU extensions that
do not have a meaning in ISO C90, such as omitting
the middle term of a "?:" expression. On the other
hand, when a GNU dialect of a standard is specified,
all features supported by the compiler are enabled,
even when those features change the meaning of the
base standard. As a result, some strict‐conforming
programs may be rejected. The particular standard is
used by -Wpedantic to identify which features are GNU
extensions given that version of the standard. For
example -std=gnu90 -Wpedantic warns about C++ style
// comments, while -std=gnu99 -Wpedantic does not.
]
[
Where the standard specified with -std represents a
GNU extended dialect of C, such as gnu90 or gnu99,
there is a corresponding base standard, the version
of ISO C on which the GNU extended dialect is based.
Warnings from -Wpedantic are given where they are
required by the base standard. (It does not make
sense for such warnings to be given only for features
not in the specified GNU C dialect, since by
definition the GNU dialects of C include all features
the compiler supports with the given option, and
there would be nothing to warn about.)
]
[
-pedantic-errors
Give an error whenever the base standard (see
-Wpedantic) requires a diagnostic, in some cases
where there is undefined behavior at compile‐time and
in some other cases that do not prevent compilation
of programs that are valid according to the standard.
This is not equivalent to -Werror=pedantic, since
there are errors enabled by this option and not
enabled by the latter and vice versa.
]
It would be handier to have a reference to the standard.
The standard is silent about it. Maybe they didn't even consider this to be
important enough to standardize it. The relevant section is C17::6.7.9, but I
didn't find anything there.
However, everything not allowed by the standard is Undefined Behaviour, so it is
UB by ISO C, and therefore GCC is right in warning about it.
Here's a compiler which has been told I want C11.
You told it you want C11.
$ gcc -std=c11 -c str.c
But you didn't tell it to warn about non-conforming code.
Moreover, you asked it to warn about things that may or may not have anything to
do with ISO C11.
[
-Wall
This enables all the warnings about constructions
that some users consider questionable, and that are
easy to avoid (or modify to prevent the warning),
even in conjunction with macros. This also enables
some language‐specific warnings described in C++
Dialect Options and Objective‐C and Objective-C++
Dialect Options.
]
str.c:1:19: warning: initializer-string for array of chars is too long
char two[2] = "foo"; // 'f' 'o'
^~~~~
$ objdump -sj .data str.o
str.o: file format elf64-x86-64
Contents of section .data:
0000 666f666f 6f666f6f 00666f6f 0000666f fofoofoo.foo..fo
0010 6f00 o.
$
Note .data starts with two[]'s ‘fo’.
Undefined Behaviour can result in many different things, including the expected
result. Moreover, since this behaviour is probably an extension by GCC
(although I didn't care enough to check), it's probably implementation-defined
to be that.
Remember that -std=c11 doesn't disable extensions that don't conflict with the
standard (i.e., ones that define what would otherwise be undefined behaviour).
Again, quotation needed:
[
-std=
Determine the language standard. This option is
currently only supported when compiling C or C++.
The compiler can accept several base standards, such
as c90 or c++98, and GNU dialects of those standards,
such as gnu90 or gnu++98. When a base standard is
specified, the compiler accepts all programs
following that standard plus those using GNU
extensions that do not contradict it. For example,
-std=c90 turns off certain features of GCC that are
incompatible with ISO C90, such as the "asm" and
"typeof" keywords, but not other GNU extensions that
do not have a meaning in ISO C90, such as omitting
the middle term of a "?:" expression. On the other
hand, when a GNU dialect of a standard is specified,
all features supported by the compiler are enabled,
even when those features change the meaning of the
base standard. As a result, some strict‐conforming
programs may be rejected. The particular standard is
used by -Wpedantic to identify which features are GNU
extensions given that version of the standard. For
example -std=gnu90 -Wpedantic warns about C++ style
// comments, while -std=gnu99 -Wpedantic does not.
]
- ISO C doesn't allow 'two'.
Reference needed.
The absence of permission makes it UB, IIRC. There's no possible quotation for
the absence of permission. About something not being specified by the standard
being UB, I don't remember/find what's the paragraph about it. Feel free to
crrect me here, since I can't quote it.
- It does however, allow 'five', and forces initialization to the same as
objects that have static storage duration (i.e., 0). See C2x::6.7.10/22
Yes, I know that, showed it above, and this is nothing to do with
initialising a char array but just generally what happens,
e.g. ‘int a[42] = {3, 1, 4}’.
There's actually a paragraph in the standard that specifies it specifially for
char arrays. But as you, I agree that this was already covered by normal array
rules, I think.
- It does allow 'three', 'four', and 'implicit', per C2x::6.7.10/15
(I believe it's that paragraph). I admit that the wording is not so
clear as to reject 'two'; however GCC seems to interpret it that way,
in pedantic mode.
We've moved from C11 to a future C, C2x. Paragraph 6.7.10.15 in C2x is
the same as 6.7.9.14 in C11.
Sorry, I had the C2x document more handy, since I had been discussing some
features in it (or to be possibly included for C3x) these days. Now I've quoted
C17.
An array of character type may be initialized by a character string
literal or UTF-8 string literal, optionally enclosed in braces.
Successive bytes of the string literal (including the terminating
null character if there is room or if the array is of unknown size)
initialize the elements of the array.
It describes the behaviour shown by str.c above: successive bytes
initialise the array. It is not rejected by the compiler. More
importantly, I can't see where it is rejected by the standard.
- The string literal is reliably terminating by a NUL.
Terminated, yes. "terminating", hmmm, I'd say no
Sorry, that's a typo, I meant ‘terminated’.
No problem. We all do them :)
- It is not context dependent whether a string literal has a terminating
NUL.
Sure.
Good.
And guns are just machines that do holes, context-independently.
However, they can kill, depending on the context.
Especially if they have no safety, like Glocks, or string literals.
$ cat str.c
#include <stdio.h>
int main(void)
{
printf("%zu\n", sizeof(1 ? "foo" : "bar"));
printf("%zu\n", 1 ? sizeof("foo") : sizeof("bar"));
}
$ cc str.c -Wpedantic -pedantic-errors
$ ./a.out
8
4
Yes, I recall this from elsewhere in the thread where I asked you to
explain why switching to nitems() fixed the problem because I couldn't
see it given the code samples shown.
https://lists.gnu.org/archive/html/groff/2022-11/msg00030.html
Sorry, I missed it.
But it is nothing to do with the language C defining what a string is
and having string literals as distinct things worthy of a separate name.
Hmm, makes sense. It's rather an issue of the ternary operator in this case.
My point was that C has dangerous features, that combined can be very dangerous.
Some trivial constructs can help you get the compiler on your side, like the
sizeof division.
See for example some (part of a) change that I did for optimizing some code,
where I transformed pointers to char to char arrays (following Ulrich Drepper's
article about libraries). The global change using arrays instead of pointers
reduced the code size in a couple of KiB, IIRC, which for cache misses might be
an important thing.
-static const char *log_levels[] = {
+static const char log_levels[][8] = {
"alert",
"error",
"warn",
"notice",
"info",
"debug",
};
As a note, I used 8 for better alignment, but 7 would have been fine.
Now, let's imagine that I append the following element to the array:
"messages"? Values of beta will give rise to dom!
That's because robust code has become fragile. The original was better
because it allowed that addition of a longer string. The couple of KiB
saved is probably irrelevant compared with the human time of dealing any
error which might arise.
Not really. It's a bug in the compiler. It's only the compiler that decides
which code is fragile or not. Since the new code is undoubtedly better in terms
of performance, and is perfectly supported by the compiler, I find it a bug that
the compiler doesn't make it as safe as the worse version.
So, I'm working on improving the compiler to have it be as safe as the worse
construct.
Wouldn't it be nice to use -Wunterminated-strings and let the
compiler yell at me if I write a string literal with 8 letters?
If the compiler doesn't do that then I expect there is a linter that
will, or a different compiler.
I don't know; maybe. I didn't care too much to try. Since in the project where
I use that we don't have any static analyzers embedded in the build system, and
I don't want to run it manually, I'll work on improving gcc(1), which is the
simplest to do for me.
But it sounds like some of the projects
you work on could do with a project-specific linter which understands
the conventions the code must follow. That might not be too hard given
the LLVM framework and all the tools its provides these days.
Yeah, maybe. Maybe clang-tidy(1) already warns about it. Didn't check, since
it's not useful for me right now.
Having the warning in gcc(1) is valuable, so I'll add it.
Cheers,
Alex
--
<http://www.alejandro-colomar.es/>
OpenPGP_signature
Description: OpenPGP digital signature
- Re: Pascal rides again (was: Specifying dependencies more clearly), (continued)
- C Strings and String Literals. (Was: Pascal rides again), Ralph Corderoy, 2022/11/13
- Re: C Strings and String Literals. (Was: Pascal rides again), Larry McVoy, 2022/11/13
- Re: C Strings and String Literals. (Was: Pascal rides again), Alejandro Colomar, 2022/11/13
- Re: C Strings and String Literals. (Was: Pascal rides again), Alejandro Colomar, 2022/11/13
- Re: C Strings and String Literals. (Was: Pascal rides again), Larry McVoy, 2022/11/13
- Re: C Strings and String Literals. (Was: Pascal rides again), Alejandro Colomar, 2022/11/13
- Re: C Strings and String Literals., Ralph Corderoy, 2022/11/14
- Re: C Strings and String Literals.,
Alejandro Colomar <=
- Re: C Strings and String Literals., Alejandro Colomar, 2022/11/15