Re: Use guix to distribute data & reproducible (data) science

On Thu, Feb 15, 2018 at 6:11 PM zimoun <address@hidden> wrote:

Hi,

Thank you for this food for thought.

I agree that the frontier between code and data is arbitary.

However, I am not sure to get the picture about the data management in
the context of Reproducible Science. What is the issue ?

So, I catch your invitation to explore your idea. :-)

[...]

For me, just talking about code, it is not a straightforward task to
define what are the properties for a reproducible and fully controled
computational environment. It is --I guess-- what Guix is defining
(transactional, user-profile, hackable, etc.). Then, it appears to me
even more difficult about data.

What are such properties for data management ?

In other words, on the paper, what are the benefits of a management of
some piece of data in the store ? For example for the applications of
weights of a trained neural network; or of the positions of the atoms in
protein structure.

Given version-ed datasets you could want to switch

the input dataset of a given "pipeline" to see how different data

produce different results.

Also, it is desirable to be able to re-start a "pipeline" when a

datasets is updated.

For me --maybe I have wrong-- the way is to define a package (or
workflow) that fetches the data from some external source, cleans if
needed, does some checks, and then puts it to /path/to/somewhere/
outside the store. In parallel computing, this /path/to/somewhere/ is
accessible by all the nodes. Moreover, this /path/to/somewhere/ contains
something hash-based in the folder name.

Is it not enough ?

It is not enough, you could need to make a diff between two

datasets which is not easily done if the data is stored in tarballs.

But that is not a case that can be handled by guix.

Why do you need the history of changes ? as git provide ?

Because, if the dataset introduce a change that is not handled by

the rest of the code you can get to know it by looking up the diff. For instance, a column that is an enumeration of three values that has now a fourth. But again, it's not a case that's meant to be handled by guix.

Like others have said, there is different kind of data and - even if it was possible to handle large datasets in guix store - it would require also a lot of space and computation power. ConceptNet 5.5.5 is 10G which takes more than a dozen of hours to build and AFAIK is not reproducible since it takes its input directly on live instance of the wiktionary. WikiData is 100G, but requires no processing power. Those are structured data that you could want to version in something like git. But, things like spacy models that are around 1G take I guess around a few hours to build, are not structured. Those are the data that I know about and there very few of them compared to small datasets (see data.gouv.fr)

I think they are various opportunities around reproducible data science. In particular, I see two main opportunities :

a) Packaging data and work with upstream to keep the data clean. A work that is already handled by privaters and in initiatives like http://datahub.io/

b) Cooperate around the making of datasets, some kind of git for data for which there is few or no initiatives. FWIW, I started such a project I call neon.

Thanks for the feedback.

From:	Amirouche Boubekki
Subject:	Re: Use guix to distribute data & reproducible (data) science
Date:	Fri, 16 Feb 2018 12:41:18 +0000