[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24450: [PATCHv2] Re: pypi importer outputs strange character series
From: |
Maxim Cournoyer |
Subject: |
bug#24450: [PATCHv2] Re: pypi importer outputs strange character series in optional dependency case. |
Date: |
Mon, 10 Jun 2019 17:32:01 +0900 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) |
Hello!
Ricardo Wurmus <address@hidden> writes:
> Patch number 3!
Yay!
>> From 0c62b541a3e8925b5ca31fe55dbe7536cf95151f Mon Sep 17 00:00:00 2001
>> From: Maxim Cournoyer <address@hidden>
>> Date: Thu, 28 Mar 2019 00:26:01 -0400
>> Subject: [PATCH 3/9] import: pypi: Improve parsing of requirement
>> specifications.
>>
>> The previous solution was fragile and could leave unwanted characters in a
>> requirement name, such as '[' or ']'.
>
> Wouldn’t it be sufficient to add [ and ] to the list of forbidden
> characters? The tests pass with this implementation of
> clean-requirements:
>
> (define (clean-requirements s)
> (cond
> ((string-index s (char-set #\space #\> #\= #\< #\[ #\])) => (cut
> string-take s <>))
> (else s)))
Indeed this would be sufficient to make the tests pass, but the tests
don't cover all the cases; as an example, consider:
--8<---------------cut here---------------start------------->8---
argparse;python_version<"2.7"
--8<---------------cut here---------------end--------------->8---
While we could make it work with the current logic by adding more
invalid characters (such as ';' here) to the character set, it seems
less error prone to use the upstream provided regex to match a package
name. [0]
>> +(define %requirement-name-regexp
>> + ;; Regexp to match the requirement name in a requirement specification.
>> +
>> + ;; Some grammar, taken from PEP-0508 (see:
>> + ;; https://www.python.org/dev/peps/pep-0508/).
>> +
>> + ;; The unified rule can be expressed as:
>> + ;; specification = wsp* ( url_req | name_req ) wsp*
>> +
>> + ;; where url_req is:
>> + ;; url_req = name wsp* extras? wsp* urlspec wsp+ quoted_marker?
>> +
>> + ;; and where name_req is:
>> + ;; name_req = name wsp* extras? wsp* versionspec? wsp* quoted_marker?
>> +
>> + ;; Thus, we need only matching NAME, which is expressed as:
>> + ;; identifer_end = letterOrDigit | (('-' | '_' | '.' )* letterOrDigit)
>> + ;; identifier = letterOrDigit identifier_end*
>> + ;; name = identifier
>> + (let* ((letter-or-digit "[A-Za-z0-9]")
>> + (identifier-end (string-append "(" letter-or-digit "|"
>> + "[-_.]*" letter-or-digit ")"))
>> + (identifier (string-append "^" letter-or-digit identifier-end "*"))
>> + (name identifier))
>> + (make-regexp name)))
>
> This seems a little too complicated. Translating a grammar into a
> regexp is probably not a good idea in general. Since we don’t care
> about anything other than the name it seems easier to just chop off
> the string tail as soon as we find an invalid character.
While I agree that a regexp is a bigger hammer than basic string
manipulation, I see some merit to it here:
1) We can be assured of conformance with upstream, again, per PEP-0508.
2) It is easier to extend; we might want to add parsing for the version
spec in order to disregard dependencies specified for Python < 3, for
example.
The use of the PEP-0508 grammar to define the regexp is useful to detail
in a more human-friendly language the components of the regexp. We
could have otherwise used the more cryptic regexp for Python
distribution names:
--8<---------------cut here---------------start------------->8---
^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$
--8<---------------cut here---------------end--------------->8---
So I guess that what I'm saying is that I prefer this approach to using
string-index with invalid characters, for the reasons above.
[0] https://www.python.org/dev/peps/pep-0508/
Thanks!
Maxim