jonathan@researchcomputingteams.org

Category: Technical Leadership

All categories: Resources by category

Subcategories: Automation, CI/CD, Case study, Cloud, Code Reviews, Data Resources, HPC, Migration, Open Source Management, Other, Reproducibility, Runbooks, Security, Software Development, Systems, Systems: Incident Handling, Systems: Other, Testing

Picking problems for programming interviews - Will Larson

Picking problems for programming interviews - Will Larson If you do do coding as part of your interviews, it’s tough to find something that is relevant, hard enough to successfully distinguish between candidates, but easy enough to be doable. Here Larson plays with a few examples (one of which is a particular kind of data munging: something broadly relevant to our needs). His suggestions are to aim for problems that: Support simple initial solutions and compounding requirements Are solvable with a few dozen lines of...

Continue...

Does Stress Impact Technical Interview Performance? - Behroozi, Shirolkar, Barik & Parnin

Does Stress Impact Technical Interview Performance? - Behroozi, Shirolkar, Barik & Parnin Tech Sector Job Interviews Assess Anxiety, Not Software Skills - Chris Parnin & Matt Shipman, NC State Whiteboard coding interviews are not widely loved by candidates. I don’t have interviewees live code but do I like watching candidates work through similar kinds of problems on a whiteboard. This study may finally make me rethink this. It’s a small study (N=48) where interviewees were assessed on their coding skills and randomized into two arms....

Continue...

Research Software Engineers - Job Descriptions - Aalto Scientific Computing Group

Research Software Engineers - Job Descriptions - Aalto Scientific Computing Group The Scientific Computing group of Aalto University has text for their job descriptions of a simple three-step (RSE1, RSE2, RSE3) progression for software development in their institution. It’s not a formally recognized ladder by HR yet but it guides their hiring decisions. The whole thing is just a few paragraphs long, but it’s very clear and is a lot more than most institutions have. The other internal documents they have on the page are...

Continue...

Code Review is Feedback - Linnea Huxford

Code Review is Feedback - Linnea Huxford A reminder that code review isn’t just about the code in question, it’s feedback. So that means it’s an opportunity to give nudges to inform future behaviours (code submissions), it’s an opportunity to give positive as well as negative feedback, and it’s important that all team member are providing consistent feedback.

Continue...

We Have to Have a Talk A Step-by-Step Checklist for Difficult Conversations - Judy Ringer

We Have to Have a Talk: A Step-by-Step Checklist for Difficult Conversations - Judy Ringer There’s one thing I’d add as a preamble to this article. If things have advanced to the point with one of our teammates where we’re going to have the sort of conversation we need to brace ourselves for, it is almost always our fault, at least in part. We didn’t have to let things slide this long. Giving consistent feedback about small things, even if uncomfortable, will allow you to...

Continue...

Create space for others - Will Larson

Create space for others - Will Larson One of the hardest things about a transition to leadership, either on the people-manager or technical-leadership track, is stepping further and further back from directly making contribution and spending more time making room for others, nurturing their contributions, and gathering their input. In this article, Larson describes how that works at the Staff+ Engineer level at large tech companies.

Continue...

Some RSE Group Communications Examples

Newcastle University Research Software Engineering 2020 (PDF) - Newcastle Research Software Group BEAR - Advanced Research Computing Research Software Group 2020 Report (PDF) - Birmingham Research Software Group These two reports on the 2020 activities of the research software development groups at Newcastle and Birmingham are extremely interesting if you run a research software development core facility-type operation, and very interesting even if you don’t int terms of the clear product and strategy mindset (and communications efforts) behind the groups. In Newcastle’s, we get some...

Continue...

Why Senior Engineers Hate Coding Interviews - Adam Storm

Why Senior Engineers Hate Coding Interviews - Adam Storm Storm’s piece is related to the discussion last month on hiring criteria, and matching evaluation to what people would actually be doing on the job. Senior developers spend a more time deciding what to code than doing on-the-fly coding, and putting them into a whiteboard coding interview is stressful, unfamiliar, and doesn’t measure what you care about. Storm emphasizes this point, and suggest that if you really want to see if they can code or not...

Continue...

A thorough team guide to RFCs - Juan Pablo Buriticá

A thorough team guide to RFCs - Juan Pablo Buriticá We’ve written before about design documents architectural decision logs (e.g. #33) and using collaboration around documents as a form of asynchronous meeting (e.g. #49). Usually the thinking is that someone in charge has initiated the document. Buriticá writes about team member-initiated requests for comments as a proposal for a change or the creation of something new, which can then go through a comments phase like a PR, and an approval phase where whatever decision making...

Continue...

Building an SRE Team? How to Hire, Assess, & Manage SREs - Emily Arnott

Building an SRE Team? How to Hire, Assess, & Manage SREs - Emily Arnott DevOps and SRE are two sides of a similar coin - bridging the gap between systems and developer teams to do better work faster. DevOps topics usually involve speeding release cycles, and SRE topics usually focus on improving automation, resiliency, and handling incidents, but there’s a significant degree of overlap. Even if you aren’t explicitly building an SRE or DevOps team, you can start hiring for these skills and approaches in...

Continue...

How to release 2 years of unfinished code, the "Agile way" - Jonathan Hall

How to release 2 years of unfinished code, the “Agile way” - Jonathan Hall Sometimes you get yourself into a hole and need to find a way out. Hall recommends releasing something that works right away - some teeny change to the last related version - just to practice the (now quite out of date!) release process and give your team a quick win. It could even be a version bump and a change to the docs! Release something, then find a way to maybe...

Continue...

How to Freaking Find Great Developers By Having Them Read Code - Freaking Rectangle

How to Freaking Find Great Developers By Having Them Read Code - Freaking Rectangle We know code is read more than it’s written, and that debugging, code maintenance, and incremental addition is more common and time consuming than “green field” code development. And yet, the entire software development community tends to vastly over-value writing code from scratch over understanding existing code. That’s true of research software development, too, which famously almost never starts completely from scratch. Here the article’s author recommends focusing a “coding” interview...

Continue...

Don’t fund Software that doesn’t exist

Don’t fund Software that doesn’t exist This blog post by Andraeas Müller connects two facts that I think most of us in R&D computing are pretty familiar with - one that we talk a lot about and one that we don’t - and extrapolates to a conclusion that I’m not sure I agree with but is certainly worth discussing. The fact that we talk about regularly is that ongoing maintenance of important research software (and key open-source software in general) is famously underfunded, and this...

Continue...

Second thoughts on Proper Citation Guidance for Software Developers

Second thoughts on Proper Citation Guidance for Software Developers A good recent blog post on the pros and cons of different approaches to software citation by Daniel S. Katz, who’s thought about this a lot. Some key points: any method is going to take extra work by someone; there may not be a one-size-fits all approach; and in the end, code just isn’t the same as a paper (amongst other things, there’s no one point at which it’s done). Daniel ends the post leaning tentatively...

Continue...

The ELIXIR Core Data Resources fundamental infrastructure for the life sciences, the ELIXIR team

The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences, the ELIXIR team Funding of research computing and data resources is hard. As was pointed out in a recent blogpost at US-RSE, research infrastructure is generally funded as a project, which ends (“and they all lived happily ever after”), rather than as a product which continues to be used; “sustainability” is a word that comes up very quickly in research computing conversations. This is a good paper that makes a familiar case for funding...

Continue...

On being "technical" as an engineering manager, Sean Voisin

On being “technical” as an engineering manager, Sean Voisin How technical do you have to be to be a technical manager?  I think this blog post has the right answer; ”enough”, where enough will depend on what your team is doing and what your role is on the team. You have to be able to make sure the team’s doing the right work and progressing satisfactorily, but that’s a different kind rather than a different amount of technical knowledge necessary to actually do the work.  For us it will require...

Continue...

On Pair Programming, Birgitta Böckeler & Nina Siessegger

On Pair Programming, Birgitta Böckeler & Nina Siessegger A great description on the hows and whys of pair programming, a technique I don’t see very often in research software development (though giving how subtle some pieces of what we work on are, it might be useful). There’s two big advantages of pair programming - knowledge transfer/collective code ownership (at least two people know how some piece of code works), and code quality (two people’s input is better than on).   (It can have advantages for helping the team learn to work together, but...

Continue...

Learning from SRE Teams About Identifying and Reducing Repetitive Work

Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit “Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they...

Continue...

Open Source Maintenance Is Hard Work

The Happiness and Stresses of Open Source work - Drew Devault My FOSS story - Andrew Gallant Research computing teams have a lot in common with open source communities - even if you aren’t developers or developing open source software. One of the joys of open source communities is that you’re part of a small, visible team solving problems for your users - and that’s exactly the situation we’re in. But there’s downsides to that, too. Users can be incredibly demanding, and when you’re a...

Continue...

Jeremy’s Notes on Fast.AI coding style

Jeremy’s Notes on Fast.AI coding style A bracing reminder (ternary operators! 120-character lines!) that there aren’t “correct” coding styles; the purpose is to make sure teams reduce internal barriers to collaboration by picking one style that works for them and sticking with it.

Continue...

HPC on OpenStack the good, the bad and the ugly - Ümit Seren

HPC on OpenStack the good, the bad and the ugly - Ümit Seren The FOSDEM 2020 talks are online now and there’s a lot of really nice work presented. In the HPC, Big Data, and and Data Science track, this good- and bad-news overview of setting up multiple HPC infrastructures on an on-prem OpenStack deployment to take advantage of the reconfigurability between environments. It’s a great talk, and highlights the downsides (the huge complexity of OpenStack) and the upsides (the configurability). That talk brings up...

Continue...

Building cloud-based data services to enable earth-science workflows across HPC centres - John Hanley, ECMWF

Building cloud-based data services to enable earth-science workflows across HPC centres - John Hanley, ECMWF Also from both the UK and FOSDEM, this is a really nice overview of a very sophisticated solution to making archival simulation data from the outputs of the European Centre for Medium-Range Weather Forecasts available in a cloud environment for querying and reanalysis. Groups like ECMWF run with operational requirements that would keep most of us awake at nights in panic-induced sweats - it turns out that governments, companies, navies...

Continue...

Source Code has a Brief and Lonely Existence - Derek Jones

Source Code has a Brief and Lonely Existence - Derek Jones Derek Jones has an interesting blog where he takes data-driven looks at software and software development. Here he points out: The majority of source code has a short lifespan (i.e., a few years), and is only ever modified by one person (i.e., 60%). I think this is worth coming to terms with, particularly in terms of research computing and tool maturity. Most new ideas, as they get put into code, will stall out at...

Continue...

Avoid Rewriting a Legacy System from Scratch by Strangling It - Nicolas Carlo

Avoid Rewriting a Legacy System from Scratch by Strangling It - Nicolas Carlo Because the value of the code is what was learned from writing it rather than the code itself, when it comes time to grow from earlier maturity stages to the next, the recommendation for from other sectors is to migrate away from the earlier code, not to refactor a proof of concept until somehow becomes production quality. (See also Keavy McMinn’s recommendation to throw away code from a proof-of-concept “spike” in her...

Continue...

The Four Pillars of Research Software Engineering - Cohen *et al.*

The Four Pillars of Research Software Engineering - Cohen et al. In this article, we have presented 4 pillars representing a set of areas and activities that we consider to be vital in offering coordinated and comprehensive support for research software and the people who build it. In turn, we hope this will demonstrate to professional developers and researchers alike that research is a viable, and interesting, environment for a software development career. In this white paper, the authors present what they see as the...

Continue...

Mindset for Working With Legacy Code

The key points of “Working Effectively with Legacy Code” - Nicholas Carlo Exit the Haunted Forest - John Millikan An awful lot of of code we work with in research computing can be thought of as legacy code - whether it’s functioning but old code that to meet current needs needs to be refactored, or whether it’s new code from a researcher which just isn’t maintainable in its current form. I like to think of reworking such code as helping the code reach its potential....

Continue...

Quantifying Independently Reproducible Machine Learning - Edward Raff, writing at The Gradient

Quantifying Independently Reproducible Machine Learning - Edward Raff, writing at The Gradient We worry a lot about about replication and reproducibility in research computing. In this article, the author — who attempted to independently replicate the results and basic methods in 255 (!!!) ML papers. Crucial here is independent replication; it’s not enough to just run the code, but to implement independently. He was successful 162 times. That’s enough papers to do some quantitative analysis, and it’s interesting what aspects of the work were not...

Continue...

Premortems The solution to the Preventable Problems Paradox - Shreyas Doshi on Twitter

Premortems: The solution to the Preventable Problems Paradox - Shreyas Doshi on Twitter This is a great twitter thread, which I assume is a summary of one of the author’s presentations, giving very specific advice on how to run a pre-mortem before starting a project to identify potential issues before they arise. It’s so easy for people to see potential issues and not say anything; it could be because they’re not comfortable speaking up, but it could just as easily be because they assume someone...

Continue...

Good Code Reviews

Michaela Greiler on Code Reviews - Software Engineering Radio Episode 400 How to do High-Bar Code Review Without Being a Jerk - Andrew King How to do a Code Review - Google Engineering Practices Documentation Reading Chelsea Troy’s blog has kind of convinced me that Code Reviews are a way of doing asynchronous, distributed pair programming. And even if you do them within an in-person team, they require good communication skills to be productive and drama-free, both in the review itself and “out of band”....

Continue...

Get unstuck on your infrastructure choices - Fred Ross

Get unstuck on your infrastructure choices - Fred Ross A good reminder that there are a lot of perfectly good technical solutions out there and worrying about which one is “the best” probably isn’t worth your time: Decide based on the following criteria: Has your company already standardized on one of these? Use what they do. Do you already have experience on one of them? Use what you know. Do you have a friend or colleague that knows one of them and who will help...

Continue...

A Checklist For Evaluating New Technology - Gustav Wengel

A Checklist For Evaluating New Technology - Gustav Wengel In a similar vein as the above, a pragmatic way of looking at a possible new tool or technology or even methodology to adopt. The checklist items most relevant to us: Does it solve a problem you’re actually having right now? Is it easily replaced? Can it be evaluated quickly? Is the project/technology popular and well maintained? How mature is it? Can it be maintained? Of all of them, I think “Is it easily replaced” is...

Continue...

Feedback Ladders How We Encode Code Reviews at Netlify - Leslie Cohn-Wein, Kristen Lavavej & swyx

Feedback Ladders: How We Encode Code Reviews at Netlify - Leslie Cohn-Wein, Kristen Lavavej & swyx We had several links about code reviews and the importance of clarity around expectations two weeks ago; in this post, authors from Netlify describe a simple, emoji-encoded 5-level scheme for communicating how urgent and important the code review recommendations are. It’s kind of the code review equivalent of the paper referee’s Reject/Resubmit after Major Revisions/Accepted Pending Minor Revisions/Accepted rubric. Read the article for the details, but the levels are:...

Continue...

How to Grow Neat Software Architecture out of Jupyter Notebooks - Guillaume Chevalier

How to Grow Neat Software Architecture out of Jupyter Notebooks - Guillaume Chevalier This is an older blogpost which just became a recent talk. I’m coming around to the point of view that computational notebooks have real problems - obvious ones like hidden state, and maybe less obvious ones like the structure of notebooks actively discourage reasonable software development practices like unit testing or even version control. People even study this. But in research computing lots of things have problems and we are kind of...

Continue...

Code-wise, cloud-foolish avoiding bad technology choices - Forrest Brazeal

Code-wise, cloud-foolish: avoiding bad technology choices - Forrest Brazeal This article is from the start of the year, but it’s been circulating around, and it is good advice in a short article. Everywhere in computing, but it maybe worse in research computing, is a tendency towards NIH Syndrome - “Not Invented Here”. There’s a tendency to roll our own rather than lean on existing tooling; which is frankly madness, since our specialty is using and computing technology to solve research problems, not to invent computing...

Continue...

5 Best Practices on Nailing Postmortems - Hannah Culver, Blameless

5 Best Practices on Nailing Postmortems - Hannah Culver, Blameless We’ve started doing incident reports - sort of baby postmortems - in our project, which has been an extremely useful practice in growing more discipline about how issues with availability or security are reported, distributed, and dealt with. It also gives us a library of materials that we can look through to identify any patterns that appear. This article talks about some best practices for running postmortem processes – Use visuals Be a historian -...

Continue...

The Pyramid of Unit Testing Benefits - Gergely Orosz

The Pyramid of Unit Testing Benefits - Gergely Orosz Unit testing is increasingly accepted in research computing that it doesn’t really need justification, but when people talk about it’s benefit, it’s usually about fairly low-level benefits - CI/CD and avoiding regressions. But there’s an entire pyramid of benefits: Validate your work. Separate concerns in your code. An always up-to-date documentation. Fewer regressions. A safety net for refactoring. Advantages like documentation, and the need to separate concerns in the code to the point that unit testing...

Continue...

Google’s Technical Writing Courses - Google

Google’s Technical Writing Courses - Google Some of us, particularly those of us who were trained in engineering departments, got technical writing training — but most of us didn’t, and the training we did get was focussed more on reserach papers (which let’s face it is a terrible model for almost any other form of writing besides research papers). Google has made available two of their internal courses on technical writing. The first course is sort of “Strunk and White for people who work with...

Continue...

Design Docs, Markdown, and Git - Caitie McCaffrey

Design Docs, Markdown, and Git - Caitie McCaffrey Azure Sphere Security Services used a Word/Sharepoint workflow for drafting, circulating, refining, and approving design documents wasn’t working, so they trialed a move to using markdown and git for their design documents. It was a success, and here they write up their approach. Not every design document corresponds to just on repository’s worth of code, so they chose to have one single repo for design documents for their organization organization, to support discoverability and large/unconstrained multi-codebase architectural...

Continue...

Building a Shared Resource HPC Center Across University Schools and Institutes A Case Study - MacLachlan *et al.*

Building a Shared Resource HPC Center Across University Schools and Institutes: A Case Study - MacLachlan et al. Here the authors describe the history of an HPC centre at George Washington University; it’s interesting to read this in the light of the broader study above. We see some of the same themes; “The budget did not include operating budget line items for staff and operating expenses in the initial budget” and yet “New staff resources was one of the most critical success factors as well...

Continue...

The Communicative Value of Using Git Well - Jeremy Kun

The Communicative Value of Using Git Well - Jeremy Kun I’ve mentioned before several of Chelsea Troy’s articles on code review as a sort of asynchronous pair programming, with the benefits both of better quality code and knowledge transfer. In this article, Kun talks about crafting code changes into meaningful commits and PRs exactly to enhance that communication and knowledge transfer.

Continue...

A C++ Migration Story adopting Modules

Migrating large codebases to C++ Modules - Takahashi, Shadura, & Vassilev C++ Modules in ROOT and Beyond - Vassilev, Lange, Muzzafar, Rodozov, Shadura, & Penev C++20 is finally coming. There are five major new features - Contracts (preconditions/postconditions/assertions - which I think are potentially extremely interesting for research computing), Co-routines, Concepts, Ranges, and Modules. Modules are probably the biggest change to the language. Ever since C, the approach that’s been taken for modularization of C/C++ code is C-preprocessor style include statements. These are hard to...

Continue...

Asymptotics of Reproducibility - Roger Peng

Asymptotics of Reproducibility - Roger Peng A reminder that reproducibility/repeatability is not an immutable property of some computational work — it decays over time, requires maintenance, and that maintenance has to be done by someone.

Continue...

How to create an incident response playbook - Blake Thorne, Atlassian

How to create an incident response playbook - Blake Thorne, Atlassian This is a really good starting point for putting together an incident response playbook, and includes links to Atlassian’s own playbooks and a workshop on incident communication. This is something we’re working on in our own team. We’re not there yet, but we’re getting there. On the other hand, colleagues-of-colleagues of mine were involved in a major incident recently in an organization where there were lots of security policies in place about keys and...

Continue...

A graduate student perspective on overcoming barriers to interacting with open-source software - Oihane Cereceda, Danielle E.A. Quinn

A graduate student perspective on overcoming barriers to interacting with open-source software - Oihane Cereceda, Danielle E.A. Quinn It’s easy to forget how confusing and intimidating it can be to work with open source projects for the first time - filing an issue, submitting a PR (is this change too trivial? Am I submitting the PR right?). This is a description from the point of view of a grad student on the issues with interacting with open source communities for the first time

Continue...

The Engineering Manager Event Loop - David Loftesness via Chris Eigner

The Engineering Manager Event Loop - David Loftesness via Chris Eigner This isn’t new, but I really like the idea: what a generic tech software development manager should be thinking of daily, weekly, and monthly on people, projects, processes, and themselves. It’s not quite right for research computing - thinking about recruiting and hiring on a daily basis is to put it mildly not the regime we’re normally in - but a lot of the other items hold up. What other changes would we have...

Continue...

Five Code Review Anti-Patterns - Trisha Gee, Oracle

Five Code Review Anti-Patterns - Trisha Gee, Oracle We’ve talked before about having clear expectations on code review; here’s five common traps to avoid, and that could be made explicit as part of your team’s CONTRIBUTING.md or similar: Nit-Picking Inconsistent Feedback Last-Minute Design Changes Ping-Pong Reviews Reviewer Ghosting

Continue...

Remote brainstorming for regular humans - Bartek Ciszkowski

Remote brainstorming for regular humans - Bartek Ciszkowski Whiteboarding and brainstorming are harder to do when the team is distributed. Here are some suggestions for Ciszkowski on how to do distributed brainstorming: Do it in ~20 minute chunks with 5 minute breaks Use a simple white boarding tool (Ciszkowski suggests excalidraw which I hadn’t seen before) or even just a screenshared google doc to record responses. That way people can visualize connections between ideas to trigger new ideas. Periodically restate to your objectives to keep...

Continue...

Product for Internal Platforms - Camille Fournier

Product for Internal Platforms - Camille Fournier This is an article written for tech companies about how easy it is to go off the rails developing the enterally-used tech platform for developers. It holds a lot of lessons for research computing (software, systems, or data) though. The traps you can fall into are the same, because you are developing tools for a small, captive audience. It’s too easy to lose track of what a broad range of “customers” need to succeed: When platform teams build...

Continue...

Technical discussions are hard; a few tips](http//gael-varoquaux.info/programming/technical-discussions-are-hard-a-few-tips.html#little-things-that-help) - Gaël Varoquaux

Technical discussions are hard; a few tips](http://gael-varoquaux.info/programming/technical-discussions-are-hard-a-few-tips.html#little-things-that-help) - Gaël Varoquaux The challenges of maintaining community software as seen by a well known neuroscience and machine learning software developer and manager at INRIA. Varoquaux discusses maintainer’s anxiety, contributor’s fatigue, the difficulty of communication. Varoquaux also describes things he’s found that helped: Hear the other: exchange Foster multiway discussions Don’t seek victory Convey ideas well: pedagogy Cater for emotions: tone Give your understanding

Continue...

New users generate more exceptions than existing users (in one dataset - Derek Jones, The Shape Of Code

New users generate more exceptions than existing users (in one dataset - Derek Jones, The Shape Of Code Not surprising for us in research computing but nice to have it validated with data: new users of software find new ways to trigger software faults. This is one of the reasons why the transitions that research software goes through — from being used by the creator to being used by friendly users, and then again to being used by a wider community — is so challenging...

Continue...

Code handover techniques - hand over a mental model, not just code

7 practices you should follow for a successful code handover - Nicolas Carlo Programming as Theory Building - Diogo Felix These are interesting articles to read back to back. Nicholas Carlo has his usual pragmatic information about legacy code - in this case, avoiding code becoming legacy code by executing a handoff between an outgoing developer and a new one. The key ones, I think, are: New dev writes the docs, reviewed by old dev Keep old dev engaged Jointly write more tests to share...

Continue...

Collectively architecting systems

Architecture Jams: a Collaborative Way of Designing Software - Gergely Orosz Proposals and Braintrusts - Nathan Broslawsky These two articles both describe approaches to usefully open up architectural or other proposals to input from a group. The first, an “Architecture Jam”, is sort of half-brainstorming, half-architectural white boarding session; it can work remotely, but is definitely synchronous. The second is more asynchronous - writing up a proposal, and sending it off to a group of people whose job is, explicitly, to improve the proposal. Either...

Continue...

Evidence for the importance of research software - Michelle Barker, Daniel S. Katz, Alejandra Gonzalez-Beltran

Evidence for the importance of research software - Michelle Barker, Daniel S. Katz, Alejandra Gonzalez-Beltran A nice list of papers, talks, and other resources on the topic of the impact of research software. There’s also a continually updated Zotero group library and Github repository.

Continue...

Today was a Good Day The Daily Life of Software Developers - André N. Meyer, Earl T. Barr, Christian Bird, and Thomas Zimmermann

Today was a Good Day: The Daily Life of Software Developers - André N. Meyer, Earl T. Barr, Christian Bird, and Thomas Zimmermann Interesting study of how 5,971 software developers spend their day in general, and how they spend it on days they feel were good days and typical days; the idea is that this could be used to help managers have their developers make more good days. It’s an interesting and short read. I walked away with two big points, but there’s others in...

Continue...

You Might Not Be Hearing Your Team's Best Ideas - Michael Parke and Elad N. Sherf, HBR

You Might Not Be Hearing Your Team’s Best Ideas - Michael Parke and Elad N. Sherf, HBR We’ve talked about the importance of disagreement and input before, and how important it is that people feel ok speaking up.  This is another article on the topic, and it breaks the steps down into managing what people are saying but also managing the silence, what people aren’t saying, which I think is a useful way to think about things.

Continue...

The Runbooks Project - Ian Mieli

The Runbooks Project - Ian Mieli In an effort to help get people started with runbooks for operations, Ian Miele of Container Soltuions has started an opensource set of runbooks, the Open Runbooks Project, starting with their own.  Worth checking out as a set of templates, and keeping an eye on as more get added.

Continue...

Making Space to Disagree - Meg Douglas Howie

Making Space to Disagree - Meg Douglas Howie I know I keep hammering on this, but it’s such an important topic, and people keep writing good articles about it. In our line of work our team members are generally experts or becoming experts in various areas, and if they’re not comfortable speaking up and disagreeing — with each other, or maybe more importantly, with us — not only are you losing incredibly valuable input, you’re also running the risk of eventually losing them. There’s a...

Continue...

Mentored Sprints Community Handbook - Tania Allard and Cheuk Ting Ho

Mentored Sprints Community Handbook - Tania Allard and Cheuk Ting Ho This is really interesting. Is someone on your team working on a community software project and has been thinking about a (now virtual) hackathon or community sprint with other members of the community? This very detailed handbook discusses how to organize and run such an effort.

Continue...

The reasons for design documents

Design Documents at Google - Malte Ubl Code only says what it does - Marc Brooker Malte Ubl’s article is a nice overview of how design documents are done at Google, and how they are used - to communicate not only an end goal but the why’s - the context, the tradeoffs intentionally made in design, and the alternatives considered. As Marc Brooker’s article points out, code is great and can be “self documenting” at what things actually do, but not why they are done...

Continue...

What Predicts Software Developers’ Productivity?

What Predicts Software Developers’ Productivity? - Murphy-Hill, Jaspan, Sadowski, Shepherd, Phillips, Winter, Dolan. Smith & Jorden Transactions on Software Engineering (2019) Interesting paper I just came across: we designed a survey that asked 622 developers across 3 companies about these productivity factors and about self-rated productivity. Our results suggest that the factors that most strongly correlate with self-rated productivity were non-technical factors, such as job enthusiasm, peer support for new ideas, and receiving useful feedback about job performance. Compared to other knowledge workers, our results...

Continue...

Testing And Scale - Daniel Bell

Testing And Scale - Daniel Bell This is a short read talking about the difference in the need for testing at the initial, exploratory phase of coding (where detailed testing is brittle and slows you down) as opposed to the stage of development where the code is being used for real things (where lack of detailed coding makes the codebase brittle because it can’t be easily safely modified). This this is particularly relevant to research software development, where I’ve argued a maturity model is a...

Continue...

Drawing good architecture diagrams National Cyber Security Centre

Drawing good architecture diagrams - Toby W, (UK) National Cyber Security Centre A nice overview of drawing architecture diagrams. The article makes the point that the diagram is about communicating, and if it doesn’t communicate the key points of the system to the readers, then it’s not succeeding. I like this advice: Start with a basic high level concept diagram which provides a summary. Then create separate diagrams that use different lenses to zoom into the various parts of your system. Having multiple diagrams of...

Continue...

7 Ways Leaders Can Ask Better Questions - L. David Marquet

7 Ways Leaders Can Ask Better Questions - L. David Marquet One of the things I continue to have trouble with is remembering that as a manager my off-the-cuff remarks can sometimes have an importance given to them way out of proportion than what I had intended. In particular, questions from managers are incredibly powerful, and that cuts both ways - they can help show interest and help you learn things about your team members and their work, or they can cause a flurry of...

Continue...

Good, Less-painful, Postmortems

Improving Postmortems from Chores to Masterclass with Paul Osman - Blameless Theory vs. Practice: Learnings from a recent Hadoop incident - Sandhya Ramu and Vasanth Rajamani, LinkedIn Stuff happens, and when it does happen it’s a lot of work and stressful. We should at least take the opportunity to make the most of these “unplanned investments”, learn from them, and make use of those lessons to prevent related stuff from happening in the future. The talk and transcript by Paul Osman is a good one...

Continue...

Why write ADRs [Architecture Decision Records] - Eli Perkins, GitHub blog

Why write ADRs [Architecture Decision Records] - Eli Perkins, GitHub blog We’ve written before on the importance of recording the why’s of architecture decisions. Even the best self-documenting code or infrastructure can only describe how it works, which is different from why it was implemented this way rather than another. Without that context, it’s very difficult to know, when something changes, if the architecture should be reconsidered. Perkins does a good job in a short article describing three good classes of reasons why to write...

Continue...

Never Skip Retros - Tim Casasola, The Overlap

Never Skip Retros - Tim Casasola, The Overlap In his new newsletter, Casasola argues that one of the most fundamental team meetings you can have are regular restrospectives, because: They disrupt the habit of anticipating the future, They are low hanging fruit, and They put teams on the path to continuously improve. He goes on to suggest tools like Parabol and Fun Retrospetives as tools to help with the retrospective process. This isn’t exclusively a software development (or even computing) practice; it’s widespread in project...

Continue...

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw

Findings From the Field - Two Years of Following Incidents Closely - John Allspaw Incident handling is an area where research computing falls well behind best practices in technology or IT, partly because the implicitly lower SLAs haven’t pushed us to have the discipline around incidents that other sectors have had. And that’s a shame. There’s nothing wrong with having lower (say) uptime requirements if that’s the tradeoff appropriate for researcher use cases, but that doesn’t mean having no incident response protocol, no playbooks, no...

Continue...

Use a Pre-Mortem to Identify Project Risks Before They Occur - Mike Cohn

Use a Pre-Mortem to Identify Project Risks Before They Occur - Mike Cohn We’ve talked a lot about the importance of psychological safety in teams - making team members comfortable expressing their opinions, including raising issues. Without that, you’re missing important input and potentially running into foreseeable (and foreseen!) problems. Premortems give explicit encouragement to raise issues. I’ve used these to good effect in some project-kickoff situations - trying to get the team to see obstacles ahead so they can be avoided. With pre-mortems, one...

Continue...

SRE Classroom exercises for non-abstract large systems design - Google Cloud

SRE Classroom: exercises for non-abstract large systems design - Google Cloud Google, which is notoriously close-lipped about technology development in the company, is getting more and more open with their training materials. This is terrific, because google takes training materials very seriously, and they’re quite good. In Google’s systems reliability practice, they emphasize large systems design and “back of the envelope” estimation approaches which will seem quite familiar to those of us who were trained in the physical sciences. They teach this approach with quite...

Continue...

A Software Development Life Cycle for Research Software Engineering - Kings Digital Lab

A Software Development Life Cycle for Research Software Engineering - Kings Digital Lab There was a really interesting SORSE talk this past week, Digital Humanities RSE: King’s Digital Lab as experiment and lifecycle by James Smithies and Arianna Ciula. The Digital lab, which hosts and maintains 160+ digital humanities projects, has a very nice lifecycle model for the research software development/hosting/maintenance efforts they get involved in, and they’ve generously made it, and templates for the documents at every step along the cycle, available to the...

Continue...

Alerting on SLOs - Mads Hartmann

Alerting on SLOs - Mads Hartmann Another recurring theme in this newsletter is that while research software development takes a lot of guff, research software development in research is often much closer to industry best practices than research computing systems management. While there’s a lot of research software out there with version control, continuous integration testing, documentation, and disciplined release management, it’s much rarer to find research computing systems with crisply defined service level objectives (SLOs). And without SLOs it’s not possible to answer even...

Continue...

Learning from Postmortems in Hazardous Contexts

Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - Thai Wood, Resilience Roundup Incident Reviews in High-Hazard Industries: Sense Making and Learning Under Ambiguity and Accountability - John S. Carroll, Industrial & Environmental Crisis Quarterly (1995) This is is a recent blog post about a less recent paper, reviewing how incident reviews work in high-hazard industries like nuclear power. Whether the environment is life-critical or just inconvenient like a research cluster going down, a common incident review failure mechanism is...

Continue...

What you can do when code is really hard to review - Nicolas Carlo, Understand Legacy Code

What you can do when code is really hard to review - Nicolas Carlo, Understand Legacy Code One distinguishing feature of research software is that it’s often subtle. Subtlety combined with how often it is legacy code makes it difficult to follow, and makes changes doubly so. In this article Carlo describes some general principles for handling hard-to-review code changes, with the caveat that the hard to review changes are the ones that especially need review, both for QA purposes and for knowledge transfer: Focus...

Continue...

Limiting Work In Progress - Daniel Truemper

Limiting Work In Progress - Daniel Truemper A trap research computing managers fall into fairly frequently (including me) is seeing the big picture, seeing all the things that need to get done, and trying to start them all at once. After all, we know about parallel computing, a wider pipeline can mean higher throughput, right? But human beings don’t work like that. You get more done by diligently limiting the amount of work in progress, which has the advantage that it requires prioritization.

Continue...

A List of Post-Mortems - Dan Luu

A List of Post-Mortems - Dan Luu In research computing, when it comes to running systems we could be a lot closer to industry best practices than we are. We’ve talked about post-mortems more than once; here’s a list of postmortems from many companies collected by Luu. It’s nice to see that they don’t necessarily have to be long or complicated or intricate; like risk management, just simple documents for ongoing clarity can be a huge step forward.

Continue...

Getting big things done by being clear about what they are

Getting Big Things Done - Marc Brooker Architecture Decision Records - Upmo Brooker, who leads development on AWS’s Lambda product, writes about his approach to getting big things done and done well; his approach is outlined below: Is it the right solution? Is it the right problem? Engage with the doubters, but don’t let them get you down Meet the stakeholders where they are Build team(s) The builders The stakeholders Be willing to adapt This maps pretty straightforwardly to research computing work too. Key to...

Continue...

A set of Common Software Quality Assurance Baseline Criteria for Research Projects - Orviz, Lopez, Duma, David, Gomez, and Donvito

A set of Common Software Quality Assurance Baseline Criteria for Research Projects - Orviz, Lopez, Duma, David, Gomez, and Donvito Coming out of the EOSC Synergy effort, an extensive checklist of criteria for “production strength” research code, to be e.g. deployed as a service to communities in the INDIGO Data Cloud. The criteria are broken down into categories: Licensing Code Workflow Code Management Code Style Code Metadata Unit Testing Functional Testing Integration Testing Documentation Security Code Review Automated Deployment In most areas the actual recommendations...

Continue...

From Sysadmin to SRE - Josh Duffney, Octopus Deploy

From Sysadmin to SRE - Josh Duffney, Octopus Deploy As research computing becomes more complex, our systems teams are going to have more and more demands on them, moving them from sysadmins to systems reliability responsibilities, and working more closely with software development teams. It’s an easier transition for sysadmins in research computing than in most fields, as our teams generally have pretty deep experience on the software side of research computing too. Duffney’s article lays out how to start thinking about these changes to...

Continue...

Write Five, Then Synthesize Good Engineering Strategy Is Boring - Will Larson

Write Five, Then Synthesize: Good Engineering Strategy Is Boring - Will Larson Focus enable strategy - not only what you’ll be doing, but how you’ll be doing it. Developing a software development strategy for a team allows you to focus on the important parts of each project rather than bikeshedding the same decisions again and again. You can’t develop such a strategy for executing projects if each project is completely different. Larson’s article is an argument in favour of grounding such a strategy in the...

Continue...

High level overview of how Australian Research Data Commons is viewing Research Software as a First Class Object - Tom Honeyman on Twitter

High level overview of how Australian Research Data Commons is viewing Research Software as a First Class Object - Tom Honeyman on Twitter This is a really interesting diagram of how ARDC is thinking of research software: Here's a preview of what we're thinking (high level) for a national agenda for #researchsoftware as a first class object @ARDC_AU. Feedback welcome pic.twitter.com/XtfwhK48DN— Tom Honeyman (@TomHoneyman3) November 30, 2020 The approach is I think the right one, and one I’ve advocated before; taking a path-to-maturity model approach,...

Continue...

How to Make Your Code Reviewer Fall in Love with You - Michael Lynch

How to Make Your Code Reviewer Fall in Love with You - Michael Lynch A nice article outlining how to write PRs to make them as easy review as possible - making them easier to approve. Good for individuals working on open source projects and for teams working together. There are 13 steps there, but several I think deserve calling out: Review your own code first - go through the code with a reviewer’s eyes Answer questions with the code itself - if questions come...

Continue...

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target

How to apologize for server outages and keep users happy - Adam Fowler, Tech Target When AWS has an outage, it’s in the news and they publish public retrospectives (and here’s a great blog post of the retrospective of the Kinesis incident this week). Our downtimes and failures don’t make the news, but we owe at least that same level of transparency and communication to our researchers. The technical details will differ from case to case. But what’s also needed is an apology, and some...

Continue...

Tech Lead Management roles are a trap. - Will Larson

Tech Lead Management roles are a trap. - Will Larson When I was asked at my SORSE talk if it was possible to be both lead developer and manager, I replied that anything was possible but it is really, really hard. The most stressed I’ve been in the last couple of years was when I’ve had both significant technical and managerial responsibilities - they are completely different skillsets requiring your mind to be in different kinds of places. Bouncing between the two is definitely playing...

Continue...

Be A Good Product Owner, Say No To Things

The 10 Attitudes of Outstanding Product Owners - David Pereira Tactfully rejecting feature requests - Andrew Quan Because of the funding structure of research our training has taught us to think in terms of projects, but in research computing we’re mainly managing products - long lived things that other people use, and don’t typically have clear start or end dates. That means thinking in terms of differentiation, strategy, speeding the learning process, priorities, and alignment, rather than or at least in addition to thinking of...

Continue...

The Case for ‘Center Class’ HPC Think Tank Calls for $10B Fed Funding over Five Years

The Case for ‘Center Class’ HPC: Think Tank Calls for $10B Fed Funding over Five Years For those who haven’t seen the Centre for Data Innovation’s report advocating tripling NSF’s funding for university HPC centres, the report and the arguments therein may be useful for your own internal advocacy efforts.

Continue...

Maximizing Developer Effectiveness - Tim Cochran

Maximizing Developer Effectiveness - Tim Cochran This is aimed at software developers, but much of it would apply just as easily to those running systems or curating research data. Team members are effective if they’re quickly and frequently getting feedback - did this change work, does this solution meet the requestor’s needs - and not waiting for things or having their day chopped up into little pieces. That means as managers it’s important to make sure we have the tooling and processes in place to...

Continue...

Two Kinds of Code Review - Aleksey Kladov

Two Kinds of Code Review - Aleksey Kladov This is another good article of a number we’ve seen here on the topic of code review as asynchronous pair programming, a way of sharing knowledge both ways - about the code itself but also about expectations and goals of the team. From the article: “One goal of a review process is good code.” “Another goal of a review is good coders.”

Continue...

Managing technical quality in a codebase - Will Larson

Managing technical quality in a codebase - Will Larson This article is about the steps in improving code quality over time from an initial messy code base; the idea is marching up a ladder, solving increasingly high-level issues. This is particularly relevant for research software development. Successful research software marches up a technical readiness/maturity ladder from proof of concept to prototype to community use to production research infrastructure. As code marches up that ladder, the tradeoffs change, and the needs for code quality change with...

Continue...

SLO — From Nothing to… Production - Ioannis Georgoulas

SLO — From Nothing to… Production - Ioannis Georgoulas We’ve talked about Service Level Indicators/Objectives/Agreements (SLI/SLO/SLA) in the past as ways to focus operations effort in ways that are visible to users. Service Level here often means “availability” under some specific measure (the indicator) but it could just as easily be a wait time (jobs in the queue, emails awaiting responses, waiting list for training), disk space, or almost anything else (time until a new user successfully runs a nontrivial job?). The indicators are the...

Continue...

How To Feel Productive As a New Manager / Tech Lead

Questionable Advice: “How do I feel Worthwhile as a Manager when My People are Doing all the Implementing?” - Charity Majors> The Non-psychopath’s Guide to Managing an Open-source Project - Kode Vicious, ACM Queue Majors’ article is a good reminder for new managers that it’s really hard to recalibrate job satisfaction or the feeling of accomplishment when you’ve moved into management. All you can do is focus on the big, long timeline stuff while still taking joy in the little moments, and make sure that...

Continue...

Strengths, weaknesses, opportunities, and threats facing the GNU Autotools - Zachary Weinberg

Strengths, weaknesses, opportunities, and threats facing the GNU Autotools - Zachary Weinberg Another very transparent product-focused assessment; a simple but thorough SWOT analyses of the current GNU Autotools stack, which hasn’t been updated in some time (which itself makes the updates harder since the entire process is “rusty”), and which has enormous legacy baggage, but still has opportunities.

Continue...

Creating a Risk-Adjusted Backlog - Mike Griffiths

Creating a Risk-Adjusted Backlog - Mike Griffiths Here’s an example of a concept that I think research software development teams probably “get”, if implicitly, more than teams in other environments. Research software development spends much more time further down the technology readiness ladder; we spend a lot more time asking the question “can this even work” than we do “when will this feature ship”. The risks are higher, because most promising research ideas simply don’t pan out. So we spend a lot of time prototyping,...

Continue...

Open Source Update School of Software Reliability Engineering (SRE) - LinkedIn Engineering

Open Source Update: School of Software Reliability Engineering (SRE) - LinkedIn Engineering LInkedIn has updated its School of SRE materials for new hires or those looking to move into SRE. Even if your systems team isn’t thinking about moving to FAANG-style SRE operations, the basics covered in the material cover a nice range of dev-ops style development, deployment, design, monitoring, and securing of web applications.

Continue...

Why it's important to make code understandable

Developers spend most of their time figuring the system out - Tudor Girba, feenk Writing good code by understanding cognitive load - David Whitney ARCHITECTURE.md - Aleksey Kladov Girba points us to a recent article: Xia, Bao, Lo, Xing, Hassan, & Li (2018), **Measuring Program Comprehension: A Large-Scale Field Study with Professionals*,* IEEE Transactions on Software Engineering that looked at 78 professional developers during over 3000 hours of their work and found that 58% of their time was taken up by comprehending a code base;...

Continue...

Estimating your way to success - Rod Begbie, LeadDev

Estimating your way to success - Rod Begbie, LeadDev Estimating gets a bad rep because our estimates… aren’t very good. The future isn’t knowable! But Begbie reminds us that the purpose of estimation isn’t to get perfect duration predictions but to structure initial conversations about what is to be done and what needs to be done to get there; and then to learn from the estimates to do better the next time. Begbie’s estimation rules are to keep tasks estimated duration between a half and...

Continue...

The SPACE of Developer Productivity - Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler, ACM Queue

The SPACE of Developer Productivity - Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler, ACM Queue We’ve covered several times the challenges of measuring developer productivity, particularly individual developer productivity. Forsgren et al walk us through recent literature on the subject, disabusing us of some common myths and encouraging us to instead, as managers of developers, keep an eye on the SPACE dimensions of how well our team is doing: Satisfaction and well-being - employee satisfaction, developers having the tools...

Continue...

The resilience of mixed seniority engineering teams - Tito Sarrionandia

The resilience of mixed seniority engineering teams - Tito Sarrionandia An ongoing if unintended theme of the newsletter is that when managing teams, many useful things - like everything involved in having the team move to distributed work-from-home, giving feedback, having quarterly goal-setting - come down to making things more explicit. That requires a lot of up front work, more documentation, change of processes, and a little more discomfort for the manager initially - but then make a lot of other things better and easier...

Continue...

The Zero-prep Postmortem How to run your first incident postmortem with no preparation - Jonathan Hall

The Zero-prep Postmortem: How to run your first incident postmortem with no preparation - Jonathan Hall It’s never too late to start running postmortems on your systems when something goes wrong. It doesn’t have to be an advanced practice, or super complicated. Hall provides a script for your first couple. I’d suggest that once you have the basic approach down, move away from “root causes” and “mitigations” and more towards “lessons learned’. Those lessons learned can be about the postmortem process itself, too; you can...

Continue...

Pull Requests vs (are?) Pair Programming

Those pesky pull request reviews - Jessica Joy Kerr Can pair programming replace code review? - Jonathan Hall Kerr’s blog post in late March kicked off a series of posts in the software-dev blogosphere on whether we should still be doing pull reviews. There’s too many posts to list here, but these two by RCT roundup regulars, cover much of the range of views. Kerr’s pretty firmly on team get-rid-of-’em. My summary of her argument: There’s a reason why no one likes getting or giving...

Continue...

Having a Healthy Pull Request Process for Teams - Alex Kitchens

Having a Healthy Pull Request Process for Teams - Alex Kitchens This is a longer read on setting up a pull request process, both for the authors of the PR and the reviewers. Other processes could be healthy too, but any healthy process will have clear and explicit expectations. Kitchens spells out the responsibilities for an author - they fall under making PRs easier to review: Make the PR Description Clear and Digestible Explain Unexpected Changes Keep the Size of Pull Requests Small (When Possible)...

Continue...

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale

Manageable On-Call for Companies without Money Printers - Utsav Shah, Software at Scale A lot of information out there about running on-call, or more advanced practices like SRE, assume that you’re a large organization with 24/7 uptime targets. These can apply to research computing, but more often don’t. Teams sometimes respond to the inability to have 5-nines uptime support and 24/7 oncall with a shrug and just keep things vague; “will respond promptly during working hours, with best-effort responses outside of those times”. But that...

Continue...

Pair programming basics

Dos and Don’ts of Pair Programming - Study Suggests Togetherness and Expediency for Good Sessions - Bruno Couriol, InfoQ Two Elements of Pair Programming Skill - Franz Zieris, Lutz Prechelt, arXiv:2102.06460 Couriol has a good summary of the work of Zieris and Prechelt on pair programming. That work, which was accepted to ICSE 2021, looks at two features which they claim determines whether pair programming succeeds as a practice: a combination of “togetherness” (whether the pair can successfully establish and maintain a common mental model...

Continue...

TinyBird Tech Test - Javi Santana, TinyBird

TinyBird Tech Test - Javi Santana, TinyBird With a clear understanding of a role, it’s much easier to understand how to evaluate against that profile when you’re interviewing. Santana provides one real take-home problem they use at TinyBird, a company that builds real-time data processing tools. It involves writing up how you would solve a data ingest-plus-expose-an-API problem, and describes the rubric they use to answer it (it’s almost all about the communications, not the technical beauty of the proposed solution).

Continue...

Codes of Conducts for Open Source Projects - Not Optional

Open Source Communities Need More Safe Spaces and Codes of Conducts. Now. - Jennifer Riggins, The New Stack Codes of conduct in Open Source Software—for warm and fuzzy feelings or equality in community? - Vandana Singh, Brice Bongiovanni, William Brandon, Software Quality Journal Riggins walks us through the need for codes of conduct for open source projects, pointing out the rather shocking statistic that women make up less than 3% of open source communities, and that this has been stagnant for two decades. Between higher...

Continue...

Counterfactuals are not Causality - Michael Nygard

Counterfactuals are not Causality - Michael Nygard When you’re digging into the (likely multiple) causes of a failure, Nygard reminds us that things that didn’t happen can’t, necessarily, be the cause of something.  To steal an example from the post, ”The admin did not configure file purging” is not a cause.  It can suggest future mitigations or useful lessons learned, as ”we should ensure that file purging is configured by default”, but looking for things that didn’t happen is a way for blame to sneak in and takes our eyes off of the system that...

Continue...

Easy Guide to Remote Pair Programming - Adrian Bolboacă, InfoQ

Easy Guide to Remote Pair Programming - Adrian Bolboacă, InfoQ Bolboacă walks us through the how and why of remote pair programming, and InfoQ helpfully provides key takeaways (quoted verbatim below): Remote pair programming can be an extremely powerful tool if implemented well, in the context where it fits. You need to assess your current organization, technical context, and the time needed to absorb change before rushing into using remote pair programming. There are useful sets of questions for that. Social programming means learning easier...

Continue...

Guiding critical projects without micromanaging - Camille Fournier

Guiding critical projects without micromanaging - Camille Fournier However, as a senior manager, at some point you can make it harder for your managers to succeed when you give them very little structure to work with. It’s tempting to say “I don’t care how you do any of it as long as it gets done.” But that doesn’t help people figure out what is important to you, so they have to guess at what they share, when, and how. It’s tough to strike a balance...

Continue...

Focus assign multiple engineers to the same task - Dawid Ciężarkiewicz

Focus: assign multiple engineers to the same task - Dawid Ciężarkiewicz We’ve talked here quite a bit - starting way back in #13 - about pull requests as asynchronous pair programming, and the benefits of pair programming - not merely for quality control but for knowledge sharing in both directions. In this thought-provoking article, Ciężarkiewicz argues in favour of routinely having two (or more!) team members assigned to a task, so that rather than a code review at the back - or even before pair...

Continue...

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID

It’s Time to Rethink Outage Reports - Geoff Huston, CircleID Increasingly, private-sector systems provide their users detailed explanations of the reasons for unexpected outages, what was done to get things back up, and what they’re changing to prevent it from happening again. As part of incident response, we should be routinely writing up similar reports for internal use, so that we can learn from what happened. With that done, it makes no sense to then keep our users in the dark! Most users won’t care...

Continue...

Minimum Viable Governance lightweight community structure to grow your FOSS project - Justin Colannino

Minimum Viable Governance: lightweight community structure to grow your FOSS project - Justin Colannino Growing a community around an open source research software effort to the point that there are external maintainers is a sign of huge success - but it makes things way more complicated. It’s a pain to be the sole maintainer, but at least there’s clarity in decision making. Here Colannino describes the “Minimum Viable Governance” (MVG) set of template documents for bootstrapping a real open source governance framework. Some areas -...

Continue...

Software Development Waste - Todd Sedano, Paul Ralph & Cécile Péraire, ICSE 2017

Software Development Waste - Greg Wilson, It Will Never Work in Theory Software Development Waste - Todd Sedano, Paul Ralph & Cécile Péraire, ICSE 2017 Wilson briefly summarizes a paper by Sedano, Ralph, and Péraire, who looked at eight software development projects at Pivotal, a software development company, for 2 years and five months, interviewed team members, and analyzed retrospectives. They identified nine broad categories of wasted time and/or effort in the projects: Building the wrong feature or product Mismanaging the backlog Having to re-do...

Continue...

The culture of process - Cate Huston

The culture of process - Cate Huston The defining transition between hobbyist and professional, between someone in research who codes a little or does a bit of sysadminning and running a professional team providing research computing and data services, is that you no longer just focus on mean quality but also variance. You’re no longer trying to just get good results, but consistently good results. That means, painful though it might be, introducing some process. Huston has a few ways to think about process as,...

Continue...

Making World-class Docs Takes Effort - Daniel Stenberg

Making World-class Docs Takes Effort - Daniel Stenberg Documentation is incredibly important for a product’s adoption and use - whether the tool is software, data products, systems, or (increasingly) a combination of the three. It takes a lot of work, but that work pays off later with more adoption and less support effort per user. Stenberg highlights what he’s found to be important for documentation: that it be: Stored with the code, for convenience and so updates are kept in sync, but Not generated from...

Continue...

Demo-driven development - Jade Rubick

Demo-driven development - Jade Rubick From earlier in the year - Rubick describes his method for introducing stories and planning into software development, by starting with routine demos, backing from that into introduction of user stories by structuring the plans for the next demo, and then from there moving out into routine planning. What’s nice about this is that it keeps the important thing - always be delivering something useful - in the forefront.

Continue...

What Makes a Good Changelog - Zeno Rocha, Herbert Lui, WorkOS

What Makes a Good Changelog - Zeno Rocha, Herbert Lui, WorkOS A very pragmatic about documentation for developers. The most important thing about changelogs is that they exist. And the easiest way to ensure that’s done is to have simple, clear, and non-onerous expectations of what they should look like. Rocha and Lui specify: They should be clear In images, highlight changes Spotlight the people behind the product Consistent formatting of versions and dates, and… Teams should dedicate real technical staff time to them

Continue...

Dataset data sheets

Datasheets for Datasets Template - Audrey Beard Datasheets for Datasets - Timnit Gebru et al, arXiv:1803.09010 Beard provides a LaTeX template for Gebru et al’s suggested “Datasheets for Data sets”, a human readable high level description of a dataset - not a data dictionary, but describing the reason the data set exists, how data was collected, what preprocessing/cleaning/labeling was done if any, how or if maintenance will be done, what uses the dataset has been put to, and more.

Continue...

Ship / Show / Ask - Rouan Wilsenach

Ship / Show / Ask - Rouan Wilsenach We’ve talked about pre-commit vs post-commit reviews in #34 - post-commit being something of an alternative to PR review. Changes that past CI testing get committed, so that developers aren’t blocked by waiting for review, and commits are reviewed later. (Obviously this incentivizes a large test suite!) Wilsenach suggests that you don’t have to have a culture where it’s either/or. In the “Ship/Show/Ask” model, changes can be simply made without review (Ship) or post-commit review (or at...

Continue...

Low-code contributions through GitHub - Isabela Presedo-Floyd, Mars Lee, Melissa Weber, Mendonça, Tony Fast, Quansight Labs

Low-code contributions through GitHub - Isabela Presedo-Floyd, Mars Lee, Melissa Weber, Mendonça, Tony Fast, Quansight Labs Interesting experience getting people who wouldn’t normally code to make contributions to a project via github. In this case, the effort was around alt text for images (including scientific diagrams!) for a project, based on pull requests, but I could imagine it working well for documentation, sourcing diagrams, or other contributions. The team’s process was: pre-meeting preparation with a project contributor and meeting facilitator a crash course in the...

Continue...

Guides for Managers - Software Sustainability Institute

Guides for Managers - Software Sustainability Institute This is a resource I hadn’t seen until Better Scientific Software pointed it out - a collection of guides for research software development managers, including starting and improving a community for your product, recruiting a champion or student developers, funding software and developers, and more. The guides are short and come with links to other resources. They take a “focus on the basics” approach that readers of this newsletter would likely appreciate. Overlapping sets of guides for researchers,...

Continue...

Senior level RSE career paths (with an s) - Daniel S. Katz, Kenton McHenry, Jong S. Lee

Senior level RSE career paths (with an s) - Daniel S. Katz, Kenton McHenry, Jong S. Lee In the spirit of Shmitz et al.’s call for a career path for RCD individual contribitors, Katz, McHenry, and Lee describe a career progression for research software developers, starting with associate, staff, then senior research software engineer (RSE). Then there’s a bit of a step change to Lead, which I think is pretty well described here: Some of these roles can include some mentoring and leadership, and at...

Continue...

Incident Review and Postmortem Best Practices - Gergely Orosz

Incident Review and Postmortem Best Practices - Gergely Orosz If your team is thinking of starting incident reviews & postmortems - which I recommend if relevant to your work - this is a good place to start. Orosz reports on a survey and discussions with 60+ teams doing incident responses, and finds that most have a pretty common pattern: An outage is detected An outage is declared The incident is being mitigated The incident has been mitigated Decompression period (often comparitively short) Incident analysis /...

Continue...

Well-researched advice on software team productivity - Ari-Pekka Koponen, Swarmia

Well-researched advice on software team productivity - Ari-Pekka Koponen, Swarmia Management is hard, management of something as complex and ambiguous as software development is especially hard, but that doesn’t mean we don’t know anything. There has been a lot of research on what works for making teams work well, and recently particularly in the area of software development. It doesn’t mean there are cookie-cutter solutions for anything, but we do have good guidelines. Koponen walks us through several well-supported (and in some cases ongoing) reports,...

Continue...

Five-P factors for root cause analysis - Lydia Leong

Five-P factors for root cause analysis - Lydia Leong Rather than “root cause analysis” or “five why’s”, both of which have long since fallen out of favour in areas that take incident analysis seriously like aerospace or health care, Leong suggests that we look at Macneil’s Five P factors from medicine: Presenting problem Precipitating factors - what combination of things triggered the incident? Perpetuating factors - what things kept the incident going, made it worse, or harder to handle? Predisposing factors - what long-standing things...

Continue...

Focus on Maintainability, not "Tech Debt"

Reframing tech debt - Leemay Nassery, Increment A Rubric for Evaluating Team Members’ Contributions to a Maintainable Code Base - Chelsea Troy Once a software product is high enough on the technical readiness ladder - once it’s actually being used by communities - technical debt becomes an issue. The problem isn’t awareness - we all know code should be maintainable and well documented, etc. - the issue is the people systems to support individual developers in deciding to put time into activities that support that....

Continue...

A guide to quarterly planning (plus a template) - Nicole Kahansky, Hypercontext

A guide to quarterly planning (plus a template) - Nicole Kahansky, Hypercontext Kahansky gives an outline for a quarterly planning meeting. Quarterly is an excellent cadence for planning (and even performance reviews) for a lot of research computing teams; long enough between meetings that meaningful amounts of work can be done, but short enough to be able to react to our always-changing environment and needs. Kahansky outlines a five-point agenda: Retrospective on last quarter Brainstorm on what could be done to make a significant difference...

Continue...

DevOps in academic research - by Matthew Segal

DevOps in academic research - by Matthew Segal Here Segal, who worked for 18 months as a “Research DevOps Specialist”, talks about his work in moving a 20kloc MCMC python modelling package for infectious disease models, in a development and systems environment that wasn’t prepared for the sudden urgency and rapid release cycles that were needed when COVID broke out. There were no tests, making development slow. A lot of manual toil was involved in calibrating updated models, which was fine when they were for...

Continue...

OOPS writeups - Lorin Hochstein

OOPS writeups - Lorin Hochstein Hochstein gives the outline and an explanation as to how his team in Netflix write up “OOPS” reports, essentially incidents that didn’t rise to the level of Incident Response, as a way of learning and sharing knowledge about things that can go wrong in their systems. It’s a nice article and provides a light-weight model to potentially use. His outline, blasted verbatim from the article, is below. I particularly like the sections on contributors/enablers and Mitigators as things that didn’t...

Continue...

Get small things done continually

Great engineering teams focus on milestones instead of projects - Jade Rubick Scatter-Gather - Tim Ottinger One recurring issue with research computing is that we typically get funded for projects, but we’re really building products — tools, outputs, and expertise that will (hopefully) outlast any particular project. For different reasons, Rubick strongly recommends that your team focusses on milestones rather than projects, but this change in focus can help be an intermediate stepping stone between project-based thinking and product-based thinking. He recommends defining progress in...

Continue...

Publication of the Trusted CI Guide to Securing Scientific Software - Trusted CI

Publication of the Trusted CI Guide to Securing Scientific Software - Trusted CI The Trusted CyberInfrastructure project has released its report and now guide into securing scientific software - and to some extent the systems they run on. The guide covers the usual topics, but with specific focus on scientific computing: “social engineering”, classic software exploits such as you’d see on OWASP’s top 10 (injection attacks, buffer overflows, improper use of permissions, brute force, software supply chain) and network attacks (replays, passwords, sniffing), and gives...

Continue...

How to learn after an incident

Howie: The Post-Incident Guide - Jeli How to Write Meaningful Retrospectives - Emily Arnott, Blameless The key to getting better, individually or as a team, is to pay attention to how things go, and continue doing the things that lead to good results, while changing things that lead to bad results. Pretty simple, right? And yet we really don’t like to do this. Whether your teams run systems, develop software, curate data resources, or combinations of the three, sometimes things are going to go really...

Continue...

RSE Group Evidence Bank - UK RSE

RSE Group Evidence Bank - UK RSE This is an interesting collection of job titles and descriptions from a number of UK RSE groups for job levelling (junior/RSE/senior/head of RSE), soe articles on setting up RSE or data science institutes. Very interesting if you’re thinking of starting an RSE group. Hopefully it continues to grow.

Continue...

Understanding wait time versus utilization - from reading Phoenix Project - Zhiqiang Qiao

Understanding wait time versus utilization - from reading Phoenix Project - Zhiqiang Qiao Every so often I see technologists rediscover a very widely known result in operations research - introductory textbook stuff, really. Wait times (or other bad behaviour) start rocketing upwards once we get to high (somewhere between 80% - 90%) utilization. You see this in equipment, and teams, of course, too. Teams, whether they’re cash registers or software developers, start getting into trouble at sustained high “utilization rates”, e.g. overwork. And yet, a...

Continue...

The 18F Hiring Process A Nice Example

I really like this documented hiring process by 18F in the US Government. It’s a well thought out process, and it’s written in a way that you could send to candidates so they know exactly what to expect. It’s even in GitHub. I also really like their technical pre-work - it’s either to provide some code they’ve worked on, or to do one of four exercises. The exercises are simple but non-trivial get-and-process-data exercises that would give a lot more confidence about ability to do...

Continue...

Research Software Capability in Australia - Michelle Barker and Markus Buchhorn

Survey reveals 6000+ people develop and maintain vital research software for Australian research - Jo Savill, Australian Research Data Commons (ARDC) Research Software Capability in Australia - Michelle Barker and Markus Buchhorn Interesting results from a late-2021 ARDC survey on research software capability, of 70 managers of Australian research computing and data groups. Results were scaled to try to give an estimate of all-of-Australia numbers. The article by Savill gives an overview, and the full report by Barker and Buchhorn is interesting reading. Some key...

Continue...

Developing a modern data workflow for regularly updated data - Glenda M. Yenni *et al*, PLOS Biology

Developing a modern data workflow for regularly updated data - Glenda M. Yenni et al, PLOS Biology Updating Data Recipe - Ethan White, Albert Kim, and Glenda M. Yenni This one’s a couple years old, and I’m surprised I hadn’t seen it before. It’s getting easy to find good examples for scientists of getting started with GitHub, and then to CI/CD, for code. But for data it’s much harder. And there’s no reason why experimental data shouldn’t benefit from versioning, and analysis pipeline CI/CD that...

Continue...

The Boring Technology Checklist - Brian Leroux

The Boring Technology Checklist - Brian Leroux Is the technology you use boring enough? I really like Dan McKinley’s 2015 talk, Choose Boring Technology, especially the bit where he recommends frugally and reluctantly allocating “innovation tokens” to use in part of a solution. Using shiny newness is expensive. It means constantly fighting against the unknown and solving problems you didn’t know you were going to have. It’s swimming upstream. This is especially true in research computing! The researchers are solving a new problem, using a...

Continue...

6 ways staff engineers help reach clarity - Alex Ewerlöf

6 ways staff engineers help reach clarity - Alex Ewerlöf Being at the Staff/Principal doesn’t mean knowing everything. Ewerlöf describes a number of other roles they can play in helping people find answers, with “knowing the answer” being probably the least valuable case: The Go-To: you have the answer The Rubber ducky: you’re the coach/mentor that helps them answer their own question The Catalyst: you know the people who have pieces of the answer The Detective: you know how to find the answer The Communicator:...

Continue...

Building an SRE Career Progression Framework - Ethan Motion

Building an SRE Career Progression Framework - Ethan Motion Whether it’s for research software, systems, data management, or data science, a lot of groups are trying to figure out formal or informal career progression pathways for individual contributors. As a manager, you can work with individuals in their one-on-ones to find out where they are interested in and ready to grow, and give them opportunities at that intersection. But how do you start thinking about career progression at the whole-team or multi-team level? Motion describes...

Continue...

How I think about Code Management - Andreas Klinger

How I think about Code Management - Andreas Klinger A lot of research software we start dealing with…., well, let’s say “has many opportunities to be made even better”. Klinger has a nice summary of maintaining and improving a code base over time. He sees it as having two components: Reducing complexity, and Increasing confidence And that both of those can and should be addressed incrementally and continuously. Klinger says that you handle the code complexity over time with refactoring (including my favourite refactoring, deleting...

Continue...

The pushback effects of race, ethnicity, gender, and age in code review - Emerson Murphy-Hill, Ciera Jaspan, Carolyn Egelman, Lan Cheng, *Comm ACM* 2022

The pushback effects of race, ethnicity, gender, and age in code review - Emerson Murphy-Hill, Ciera Jaspan, Carolyn Egelman, Lan Cheng, Comm ACM 2022 When we’re assessing the technical merits of a code contribution, and by extension assessing letters of reference etc about a candidate’s technical merit, we need to be aware of these effects - non-white, non-male, and older colleagues get significantly higher pushback for PRs, controlling for number of lines changed, readability, and other effects.

Continue...

Before and after an incident

Incident management best practices: before the incident - Robert Ross Incident Analysis 101: Techniques for Sharing Incident Findings - Vanessa Huerta Granda You’ll know, gentle reader, that I’m a big proponent of learning from incidents, and sharing them with researchers who after all deserve to know why they couldn’t do their work for some period of time. Here’s a pair] of good articles about preparing for an incident, and putting together and sharing the incident report afterwards. In the first article, Ross talks about clarifying...

Continue...

Making operational work more visible - Lorin Hochstein

Making operational work more visible - Lorin Hochstein In the f-string failure article in software development, I pointed out that log and error handling code was under-reviewed and tested. There’s probably a bigger lesson one can take from that on the undervaluing of supporting or glue or infrastructure work compared to “core” work. And sure enough, one of the huge downsides of operations work is that when everything goes well, it’s invisible. Above, Granda walks us through writing up an incident report and sharing it...

Continue...

How to run a Retrospective - Chase Seibert

How to run a Retrospective - Chase Seibert Siebert writes this in the context of sprints, but this short and solid how-to for running retrospectives applies to any project. (A sprint is just a a mini-project, after all - it has well-defined objectives, along with a beginning, middle, and end). Siebert probably feels that actually following up on the retrospective is out of scope of an article on how to run the retrospective meeting, which is fair. Don’t take that as a sign that the...

Continue...

Test Suites Are Part of the Product

I just threw away a bunch of tests - Jonathan Hall The evolution of the SciPy developer CLI - Sayantika Banik, Quantsight Labs Related (but not limited) to ease of developer onboarding - I was just having this conversation with a friend. Test suites are code, too, and part of your product - they’re not some weird kind of “meta-code” for which the usual rules don’t apply. As Hall points out, that means keeping them documented, making them easy to run, refactoring them, and discarding...

Continue...