Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define EBNF in specification #363

Open
JimFuller-RedHat opened this issue Dec 9, 2024 · 10 comments
Open

Define EBNF in specification #363

JimFuller-RedHat opened this issue Dec 9, 2024 · 10 comments

Comments

@JimFuller-RedHat
Copy link

It would be handy to have an EBNF definition of the pURL grammar ... motivated by discussions here #296

@JimFuller-RedHat
Copy link
Author

JimFuller-RedHat commented Dec 10, 2024

@dawud and I made an initial stab at an EBNF .. it is not 100% correct but raised some interesting observations:

purl
    ::= scheme-component type-component namespace-component? name-component version-component? qualifier-component? subpath-component?

scheme-component
    ::= 'pkg' ':'

type-component
    ::= ( ALPHA | DIGIT )+ '/'

namespace-component
    ::= namespace '/'
namespace
    ::= namespace-segment ( '/' namespace-segment)*
namespace-segment
    ::= ( ALPHA | DIGIT | safe )?

name-component
    ::= name-segment
name-segment
    ::= ( ALPHA | DIGIT | safe )*

version-component
    ::= '@' version
version
    ::= ( ALPHA | DIGIT | safe )*

qualifier-component
    ::= '?' qualifier ( '&' qualifier )?
qualifier
    ::= key '=' value
key      ::= ( ALPHA | DIGIT | '.' | '-' | '_' )*
value    ::= ( ALPHA | DIGIT | safe )+

subpath-component
         ::= '#' subpath
subpath  ::= subpath-segment ( '/' subpath-segment )*
subpath-segment
         ::= ( ALPHA | DIGIT | safe )+

safe  ::= '-'
           | '.'
           | ':'
           | '$'
           | '*'
           | ';'
           | '['
           | ']'
           | '^'
           | '_'
           | '+'
           | '~'

DIGIT ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

ALPHA ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' |
          'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' |
          'u' | 'v' | 'w' | 'x' | 'y' | 'z' |
          'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' |
          'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' |
          'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'

and here is obligatory python program generated from the grammar.

Is seems like for this EBNF to be useful that we would need to have a detailed grammar for each type, enforcing required qualifiers ... I also do wonder if some of the current rules of pURL (eg. not supporting utf-8 or the vagaries of IRI) could result in a valid pURL but an invalid URL.

Maybe the way to go is to define something basic and maybe consider higher precision if ever a vnext is to be considered ?

@matt-phylum
Copy link
Contributor

It looks like this parser fails when it reaches a @ or ? or # or the end of the string.

I don't know if this will work. The PURL parse algorithm parses from both ends of the string, so if you encounter a @ or # or ? going from the left, you can't know if it's the separator without processing the rest of the string to the right looking for other separators. pkg:npm/@angular/cli is invalid (no name) but pkg:npm/@angular/cli@1.0.0 is valid because the @ to the right changes the meaning of the first @.

A really gross example is pkg:generic/ns/n@m#?@version?qualifier=#v@lue#subp@th?. The correct parse according to the spec is type: generic, namespace: ns, name: n@m#?, version: version, qualifier: #v@lue, subpath: subp@th. It's parsed correctly by althonos/packageurl.rs, package-url/packageurl-dotnet, package-url/packageurl-php, package-url/packageurl-ruby, package-url/packageurl-swift, phylum-dev/purl (6/14 implementations tested).

I also do wonder if some of the current rules of pURL (eg. not supporting utf-8 or the vagaries of IRI) could result in a valid pURL but an invalid URL.

I think the reason anchore/packageurl-go, giterlizzi/perl-URI-PackageURL, maennchen/purl, package-url/packageurl-go, package-url/packageurl-java, package-url/packageurl-js, package-url/packageurl-python, sonatype/package-url-java (all 8 failing implementations tested) fail is because these PURL parsers are based on existing URI/URL parsers, and PURL uses an incompatible parsing algorithm.

@JimFuller-RedHat
Copy link
Author

JimFuller-RedHat commented Dec 10, 2024

It looks like this parser fails when it reaches a @ or ? or # or the end of the string.

ah right, ya I have not completed it fully thx!

A really gross example is pkg:generic/ns/n@m#?@version?qualifier=#v@lue#subp@th?. The correct parse according to the spec is type: generic, namespace: ns, name: n@m#?, version: version, qualifier: #v@lue, subpath: subp@th. It's parsed correctly by althonos/packageurl.rs, package-url/packageurl-dotnet, package-url/packageurl-php, package-url/packageurl-ruby, package-url/packageurl-swift, phylum-dev/purl (6/14 implementations tested).

useful (pathological) test cases ;)

I also do wonder if some of the current rules of pURL (eg. not supporting utf-8 or the vagaries of IRI) could result in a valid pURL but an invalid URL.

I think the reason anchore/packageurl-go, giterlizzi/perl-URI-PackageURL, maennchen/purl, package-url/packageurl-go, package-url/packageurl-java, package-url/packageurl-js, package-url/packageurl-python, sonatype/package-url-java (all 8 failing implementations tested) fail is because these PURL parsers are based on existing URI/URL parsers, and PURL uses an incompatible parsing algorithm.

once we fully implement EBNF we will know ;) but as prev mentioned I think we will need to drill down and generate rules for each type.

@JimFuller-RedHat
Copy link
Author

JimFuller-RedHat commented Dec 10, 2024

updated EBNF and test parse program which can be run as follows (enclose pURL in {})

python3 test.py {pkg:npm/foobar@12.3.1\?Faaaa\=aaaa}

which emits the following parse tree (in xml)

<?xml version="1.0" encoding="UTF-8"?><purl><scheme-component><TOKEN>pkg</TOKEN><TOKEN>:</TOKEN></scheme-component><type-component><ALPHA><TOKEN>n</TOKEN></ALPHA><ALPHA><TOKEN>p</TOKEN></ALPHA><ALPHA><TOKEN>m</TOKEN></ALPHA><TOKEN>/</TOKEN></type-component><name-component><name-segment><ALPHA><TOKEN>f</TOKEN></ALPHA><ALPHA><TOKEN>o</TOKEN></ALPHA><ALPHA><TOKEN>o</TOKEN></ALPHA><ALPHA><TOKEN>b</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA><ALPHA><TOKEN>r</TOKEN></ALPHA></name-segment></name-component><version-component><TOKEN>@</TOKEN><version><DIGIT><TOKEN>1</TOKEN></DIGIT><DIGIT><TOKEN>2</TOKEN></DIGIT><safe><TOKEN>.</TOKEN></safe><DIGIT><TOKEN>3</TOKEN></DIGIT><safe><TOKEN>.</TOKEN></safe><DIGIT><TOKEN>1</TOKEN></DIGIT></version></version-component><qualifier-component><TOKEN>?</TOKEN><qualifier><key><ALPHA><TOKEN>F</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA></key><TOKEN>=</TOKEN><value><ALPHA><TOKEN>a</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA><ALPHA><TOKEN>a</TOKEN></ALPHA></value></qualifier></qualifier-component></purl>

still not quite right but a little better ...

To generate test parser:

  1. goto https://www.bottlecaps.de/rex/
  2. use these configure values= -tree -python -main
  3. supply test ebnf

Note - The invoke escape the ? and = chars with \

@matt-phylum
Copy link
Contributor

When asking for the error message, there's a bug in the generated code where getTokenSet has a negative f value and then either loops forever or has an indexing exception. I think it should look like this. At least it doesn't throw or hang this way.

while f > 0:
  if (f & 1) != 0 and 0 <= j and j < len(test.TOKEN):
    tokenSet.append(test.TOKEN[j])
    size += 1
  j += 1
  f >>= 1

With this adapter, the code can be loaded into purl-survey and tested using the test suite: https://gist.github.com/matt-phylum/60037cad76af18a7359650b8f319ca36

As expected given the current state, it fails most tests. There might be some bugs in parts of the adapter that are currently unreachable (eg the subpath might have an extra # character on the front, but subpaths don't work at all so I don't know).

@JimFuller-RedHat
Copy link
Author

thx for trying out - great idea re integrating with purl-survey ... I am afraid its going to take me a few iterations - when I think it is ready for proper testing will raise a PR.

@giterlizzi
Copy link
Contributor

A really gross example is pkg:generic/ns/n@m#?@version?qualifier=#v@lue#subp@th?. The correct parse according to the spec is type: generic, namespace: ns, name: n@m#?, version: version, qualifier: #v@lue, subpath: subp@th. It's parsed correctly by althonos/packageurl.rs, package-url/packageurl-dotnet, package-url/packageurl-php, package-url/packageurl-ruby, package-url/packageurl-swift, phylum-dev/purl (6/14 implementations tested).

I also do wonder if some of the current rules of pURL (eg. not supporting utf-8 or the vagaries of IRI) could result in a valid pURL but an invalid URL.

I think the reason anchore/packageurl-go, giterlizzi/perl-URI-PackageURL, maennchen/purl, package-url/packageurl-go, package-url/packageurl-java, package-url/packageurl-js, package-url/packageurl-python, sonatype/package-url-java (all 8 failing implementations tested) fail is because these PURL parsers are based on existing URI/URL parsers, and PURL uses an incompatible parsing algorithm.

The canonical PURL for pkg:generic/ns/n@m#?@version?qualifier=#v@lue#subp@th? is:

$ purl-tool --type 'generic' \
  --namespace 'ns' \
  --name 'n@m#?' \
  --version 'version' \
  --qualifier '=#v@lue' \
  --subpath 'subp@th?'
pkg:generic/ns/n%40m%23%3F@version?=%23v%40lue#subp%40th%3F

Parse / Decode:

$ purl-tool pkg:generic/ns/n%40m%23%3F@version?=%23v%40lue#subp%40th%3F
{
   "name" : "n@m#?",
   "namespace" : "ns",
   "qualifiers" : {
      "" : "#v@lue"
   },
   "subpath" : "subp@th?",
   "type" : "generic",
   "version" : "version"
}

NOTE: purl-tool is a CLI of URI-PackageURL distribution.

@matt-phylum
Copy link
Contributor

The canonical PURL is pkg:generic/ns/n%40m%23%3F@version?qualifier=%23v%40lue#subp%40th%3F. You lost the qualifier key when entering it into purl-tool.

All complete implementations tested parse the canonical PURL except maennchen/purl and package-url/packageurl-ruby which both underdecode. URI-PackageURL parses the non-canonical PURL incorrectly.

giterlizzi added a commit to giterlizzi/perl-URI-PackageURL that referenced this issue Dec 16, 2024
- Improved parsing of non-canonical PURL (package-url/purl-spec#363)
- Improved "URI::VersionRange->constraint_contains"
- Updated "maven" repository URL
- FIX typo in documentation
- Synced "test-suite-data.json" from "package-url/purl-spec"
@giterlizzi
Copy link
Contributor

The canonical PURL is pkg:generic/ns/n%40m%23%3F@version?qualifier=%23v%40lue#subp%40th%3F. You lost the qualifier key when entering it into purl-tool.

Right!

All complete implementations tested parse the canonical PURL except maennchen/purl and package-url/packageurl-ruby which both underdecode. URI-PackageURL parses the non-canonical PURL incorrectly.

I improved the URI-PackageURL parser and added this non-canonical PURL in the tests.

@pombredanne
Copy link
Member

@JimFuller-RedHat thanks to you, TIL about @GuntherRademacher 's https://github.com/GuntherRademacher/rex-parser-generator which looks like a freaking awesome piece of tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants