directory-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [directory-discuss] Machine readable dump of Free Software Directory


From: Dmitry Marakasov
Subject: Re: [directory-discuss] Machine readable dump of Free Software Directory
Date: Tue, 13 Mar 2018 23:50:02 +0300
User-agent: Mutt/1.9.1 (2017-09-22)

* Dmitry Marakasov (address@hidden) wrote:

> I'd like to support Free Software Directory in https://repology.org, a
> service which tracks packages versions across hundreds of repositories.
> This will both allow to enrich Repology with another source of verified
> information on free software projects, and to keep Directory more up to
> date by detecting outdated information.
> 
> I need a machine readable dump for FSD for this purpose, as scrapping
> and parsing individual wiki pages does not look viable. Something like
> https://directory.fsf.org/wiki/All, but in XML/JSON and with additional
> version column would be sufficient. Is something like that possible?

Thanks, I was able to parse http://static.fsf.org/nosvn/directory/directory.xml 
dump.

There are some problems with data though:

- Seems like there are a lot of entries imported from Debian, which means
  incorrect versions (with Debian suffixes like Beanstalkd 1.10-1
  instead of 1.10) and incorrect download locations (ftp.debian.org
  instead of upstream). Is there a reliable way to filter these out?
  I've tried to look for "Debian_import" in "Submitted_by", but it
  doesn't seem to be reliable: after dropping these I'm still seeng
  debian suffixes in versions.

- There are a lot of perl modules which are not distinguishable from
  other software. In most repos there are distinct prefixes/suffixes,
  e.g. p5-FOO or libFOO-perl, so repology is able to detect them and
  merge them under single perl:FOO, avoiding clashes. Is it possible
  to reliably pick out perl modules in FSD?

- Assorted garbage: names like "2532 [[file:pipe.png]]gigs", versions
  like "Version 1.99"

So, the question is whether it's possible to only pick entries
edited and verified by humans, with data conforming to upstream,
and filtering out perl modules.

I know it's possible to do some heuristics (e.g. looking for
debian.org and cpan.org in URLs), but I don't really like this.

-- 
Dmitry Marakasov   .   55B5 0596 FF1E 8D84 5F56  9510 D35A 80DD F9D2 F77D
address@hidden  ..:  jabber: address@hidden      http://amdmi3.ru




reply via email to

[Prev in Thread] Current Thread [Next in Thread]