Developing a modern data workflow for regularly updated data - Glenda M. Yenni *et al*, PLOS Biology<br>

This resource first appeared in issue #114 on 19 Mar 2022 and has tags Technical Leadership: CI/CD, Technical Leadership: Data Resources

Developing a modern data workflow for regularly updated data - Glenda M. Yenni et al, PLOS Biology
Updating Data Recipe - Ethan White, Albert Kim, and Glenda M. Yenni

This one’s a couple years old, and I’m surprised I hadn’t seen it before.

It’s getting easy to find good examples for scientists of getting started with GitHub, and then to CI/CD, for code. But for data it’s much harder. And there’s no reason why experimental data shouldn’t benefit from versioning, and analysis pipeline CI/CD that code does. As data gets cleaned up and the pipeline matures, and data products start being released, these tools are just as useful.

Here the authors publish a recipe, instructions, and template repos for Github Actions and Travis CI with data. The immediate audience is for ecology, but the process is pretty general. The instructions cover configuring the repo, connecting to Zenodo (for data artifacts) and the CI/CD tool. Then data checks can be added (with the pipeline failing if the data breaks a validity constraint). And data analysis can be done, with data products being published and versioned when a new release is created. It’s really cool!

By making the things we want to see (data checking, proper data releases) easier, as with automating them, we get more of what we want to see in the world. This is very useful work.

RCT Newsletter

Developing a modern data workflow for regularly updated data - Glenda M. Yenni et al, PLOS Biology