This resource first appeared in issue #103 on 04 Dec 2021 and has tags Technical Leadership: Systems: Other
DevOps in academic research - by Matthew Segal
Here Segal, who worked for 18 months as a “Research DevOps Specialist”, talks about his work in moving a 20kloc MCMC python modelling package for infectious disease models, in a development and systems environment that wasn’t prepared for the sudden urgency and rapid release cycles that were needed when COVID broke out. There were no tests, making development slow. A lot of manual toil was involved in calibrating updated models, which was fine when they were for a paper that took months to write but suddenly much less ok when stakeholders wanted updates weekly.
The article describes a very mature approach to introducing devops into an academic group - mapping the workflow, introducing very basic testing (“smoke tests” - does the code blow up?) to get started, adding automated CI testing, improving performance (useful in its own right, but also makes it easier to get people running tests locally), adding task automation, adding diagnostic visualizations, and introducing data management practices. Notably, while there are tool recommendations, the focus is on processes and enablement.
As soon as basic processes were automated to the point where manual toil wasn’t the limiting step any more, the slurm cluster the group used became the bottleneck, and the team had to move to AWS + Azure (GitHub Actions) to get their new fast, dynamic, workflow running. I think it’s likely a problem in the medium and long term that so many institutions don’t provide access to flexible and dynamic infrastructure, relying solely on systems that have been highly tuned and optimized for asynchronous batch runs. Once early adopter groups get used to going elsewhere for services, it’s going to be hard to bring them back.