#77 - 5 June 2021

Remote hiring and diversity; Google's open source insights; Open source codes of conduct; Demos with JupyterLite; Training with SQLite data starter packs; Dask and UCX-Py; SRE hiring

Hi, everyone:

We have several items this week about post pandemic management and continuing to manage a distributed team.

It’s hard to hire and retain excellent staff in research computing, particularly (but not only!) in academia. We can’t offer FAANG-type salaries or perks, and if we’re even trying to compete on those terms we’re doomed. Our only option is to play to every advantage we do have: challenging, meaningful work; flexibility in work (which has always been a perk of academic environments); strong benefit plans; stability of employment; ability to support wide ranges of interesting projects; and clear and up-front staff salary bands (making wide disparities between people hired into the same jobs less likely). And we can play to those strengths while doing our best to attract the kinds of candidates want to have real impact on research, feel part of a research and education community, value learning new things and working in new areas, and who intend to occupy a job long enough that the benefits we do often offer (like pensions) matter.

Too often we don’t make use of those advantages when trying to hire. I’m routinely mystified by the websites of research projects — often research projects who mention that they have a hard time hiring — that make it hard to find out if there are jobs open. And when one does finally dig up a job ad (typically hidden under two or three pages of transparently meaningless institutional fluff about “how we work at [huge institution with vastly different cultures across department and teams]”), rather than being clear on priorities and goals for the new hire, the opportunities for growth and new challenges, and transparency around salary bands, I often see none of that and instead a laundry list of “3 years experience with technology X”, “2 years experience with technology Y”, etc.

These job ads have been clearly inadequate for years, and fixing job ads and where they’re posted is comparatively easy (even at institutions that have a fixed institutional job ad structure filled with boring boilerplate, no one prohibits these projects from advertising those jobs differently elsewhere and linking to the decrepit HR system for applying). So I kind of despair about how we as a community are going to handle the much greater challenge of post-pandemic hybrid work.

Flexibility in work post-pandemic is going to have to include hybrid remote/on-site approaches to work. Even the hospital I work for, an organization that is very much not widely known for its forward-looking, innovative approach to HR, is rolling out an institution-wide set of policies and perfectly serviceable materials for staff and managers in this new hybrid world, with near-site team members working from elsewhere some or all of the time.

We’re probably going to have to go further than this. We’ll have to allow people to work from away, probably including hiring people who will never ever commute from their distant location to “headquarters,” while also making sure space is available on-campus for those who want that (and employees who want to feel part of a research and education community will value that, at least some of the time). Hybrid is much harder than purely remote or purely onsite; the typical failure mode seems to be for people who are on- or near-site to feel “part of the team” in a way that the distributed team members don’t, and to slowly lose engagement with this distributed team members until they eventually leave. That means we have to keep up the practices we’ve been using during the pandemic like asynchronous written communications and processes even when they’re not necessary for everyone. It’ll be easy to slip.

We’re girding ourselves for that here; our working hours currently temporarily span six timezones and two continents, and we’re aiming to make that permanent (after what will surely be an epic battle, sung about for generations to come, with the previously mentioned HR department). We’re having to really up our game with building consensus by circulating documents around for comments, writing decision documents, and the like. But this is also already really helping us with collaborators or with team members who are only on our project 20% of the time, and with raw material for what will become papers or blog posts.

What is your team thinking for post-pandemic? What are your plans, and what are your concerns? Let me know by hitting reply, or emailing me at jonathan@researchcomputingteams.org.

PS - the newsletter will be taking a break next week; when we come back the following week we’ll get back to the regular Friday evening (eastern time) delivery time.

And now, on to the roundup!

Managing Teams

What It Takes to Run a Great Hybrid Meeting - Bob Frisch and Cary Greene, HBR

It’s pretty clear that the safest way to have meetings with both on- and off-site team members is to erase the distinction by having the meetings the way most of us have been doing them for the past 15 months - everyone connecting in separately from their laptops. That way everyone is on an equal footing.

But if there are a lot of people on site, that might not be realistic. Frisch and Greene offer the following suggestions for those who are on-site (interestingly, they assume that the person running the meeting is necessarily in the office):

  • Better audio - the old speaker/mic combination in the centre of the table isn’t going to cut it any more
  • Better video, especially of the distributed participants for the on-site team, ideally life-sized
  • Design the meetings for all attendees - tools for voting, taking notes/whiteboarding, etc
  • Have real facilitation, by someone not running the meeting, to make sure everyone is participating
  • Have on-site participants advocate for individual off-site participants

Honestly, I find this article a little discouraging. It gives some kind of idea how tough this is going to be - this isn’t even the hard case, where there’s multiple in-office teams at different locations as well as some working individually from elsewhere.


Microaggressions at the office can make remote work even more appealing - Karla Miller, Washington Post

Office spaces aren’t equally welcoming environments to all of us. Here Miller points out that for many potential team members, distributed work can mean less of the constant low-level stream of bullshit they’d normally experience in a predominantly white and male workplace.

This point:

Working at home has largely spared them from having to deal with such incidents as […] being mistaken for another colleague of the same race (a problem solved by having names displayed in video meetings)

apparently really registered with a lot of readers on twitter, and it’s a point that I had literally never thought about.

Reading this, I wonder if hiring in an increasingly distributed manner will also help recruit from groups that experience this sort of tiring nonsense all the time. We typically have small tight-knit teams, and team members from a lot of different demographic groups might well feel concerned about joining the team as “onlies”, the only member of that group. Will offering distributed work de-risk that choice enough that it would help improve diversity of our teams with new recruits (while of course leaving the work of equity and inclusion?)


Research Software Development

Open Source Communities Need More Safe Spaces and Codes of Conducts. Now. - Jennifer Riggins, The New Stack
Codes of conduct in Open Source Software—for warm and fuzzy feelings or equality in community? - Vandana Singh, Brice Bongiovanni, William Brandon, Software Quality Journal

Riggins walks us through the need for codes of conduct for open source projects, pointing out the rather shocking statistic that women make up less than 3% of open source communities, and that this has been stagnant for two decades. Between higher demands on their time and increased likelihood of be taken less seriously if not outright harassed, they are even less represented in open source than they are in tech generally.

Riggins points to empirical research by Singh et al that includes results such as:

  • Some women specifically hide their identities for open-source work
  • Women who had a good first open-source experience are much more likely to contribute
  • Many women explicitly look for codes of conduct before participating

There are existing codes of conduct like the contributor covenant which can be used directly.

Our own project (and team) do not have such CoCs, but as the team grows and our code grows in visibility we’re clearly going to have to adopt them.


Porting ATLAS experiment codes to exascale architectures - Nils Heinonen, ALCF Blog

Heinonen writes about porting two codes - FastCaloSim and MadGraph - necessary for the ATLAS experiment to pre-exascale systems at Argonne, in particular using Kokkos, a DOE lab-supported C++ programming model for applications which can target multiple platforms. Unfortunately, there’s not a lot of details, except that Kokkos allows interoperability with pure CUDA, allowing for an incremental porting process. On NVIDIA GPUs, Kokkos does nearly as well as the native GPU code, coming within 10% (that’s what I infer from the short post).

One of the lessons learned, “CUDA portability layer concepts translate well, even if the explicit syntaxes differ”, continues to strike me - that the basic CUDA programming model seems to be holding out extremely well across a number of different kinds of accelerators. I wonder when or if that will change?


Introducing the Open Source Insights Project - Google Open Source Blog

The blog post lists new Open Source Insights site, deps.dev, which lists networks of dependencies for almost the whole ecosystem of npm, Go, Maven (for Java), and Cargo (for Rust). “Coming Soon” is PyPi, which should be fascinating given the size of that ecosystem.

Dependency Graph of Express 4.17.1


Research Data Management and Analysis

SQLite Data Starter Packs - Public Affairs Data Journalism Class at Stanford

A great resource for teaching data analysis; a set of 15 small but real and interesting datasets shipped as SQLite files.


JupyterLite - Jeremy Tuloup and Nicholas Bollweg

JuypterLite is a remarkable implementation of Jupyter in Web Assembly, running entirely in the browser and served by static files. An example is here; it works flawlessly for me in Chrome. For demos or even teaching this would be enormously easier than setting up something in Binder or a JupyterHub instance.


High-Performance Python Communication with UCX-Py - Peter Andreas Entschev, NVIDIA Developer Blog

As Entschev demonstrates, it’s getting increasingly easy to use Dask and the rest of the RAPIDS stack using high speed interconnects like Infiniband or NVLink using UCX, which started as pieces of MPI runtime and is now a pretty well-established tool. I’m optimistic about UCX, and UCX-Py, enabling a new generation of parallel data analysis and simulation tools and approaches that need MPI-type performance but aren’t limited by MPI semantics.


Research Computing Systems

Don’t Be a Blockhead: Zoned Namespaces Make Work on Conventional SSDs Obsolete [PDF] - Theano Stavrinos, Daniel S. Berger, Ethan Katz-Bassett, Wyatt Lloyd, HotOS 21

(Presentation videos for this and other papers, as well as those papers, are available from the conference website).

Stavrinos et al. walk us through the advantages over block storage models of Zoned Namespaces, an existing and increasingly supported command set for newer SSDs and NVMe devices. Rather than continuing to pretend that SSDs are fast spinning disks, ZNS recognize that NVMe have other concerns, like data needing an entire erase unit to be erased before rewrites. Using block interfaces that don’t recognize this tends to lead to write amplification (one byte needing to be written causing a cascade of other writes), leading to reduced throughput and faster wear.

The sooner upstream software can take advantage of ZNS interfaces, the sooner we’ll get to take full advantage of these new storage devices.


Emerging Technologies and Practices

Building an SRE Team? How to Hire, Assess, & Manage SREs - Emily Arnott
How developers can be their own operations department - Daniel Order, Stack Overflow Blog

DevOps and SRE are two sides of a similar coin - bridging the gap between systems and developer teams to do better work faster. DevOps topics usually involve speeding release cycles, and SRE topics usually focus on improving automation, resiliency, and handling incidents, but there’s a significant degree of overlap.

Even if you aren’t explicitly building an SRE or DevOps team, you can start hiring for these skills and approaches in your regular ops or dev teams and try to take advantage of some of the improved tooling available and emerging best practices about running resilient systems and speeding release of new software.

Arnott’s article talks about the responsibilities of real-world SREs - automating processes, developing tools and infrastructure, creating runbooks, defining incident response steps, and the skills and knowledge you should be looking for in hires who would take on such roles.

Orner’s article writes about how Flipp gave - over a period of five years! - its development team greater tools to push new features or new systems to production by changing its infrastructure team’s focus from gatekeepers to enablers and pushing for continuous deployment.


My Magical Adventure With cloud-init - Christine Dodrill

Cloud-init is a standard cross-platform way to initialize a VM instance, given information from the cloud provider, provided user data, and provided vendor data. But as Dodrill demonstrates, if you’re willing to jump through a few other hoops it can be used programmatically to generate VMs locally for developing and testing on your own box (while still leaving you with configuration files that means you could also spin up the same instances on Azure or AWS or elsewhere).


Calls for Proposals/Applications

Ocean Hackweek - 3-6 Aug, Applications due 13 June. Virtual and in person at Bigelow Laboratory for Ocean Sciences, Maine

From the website:

OceanHackWeek (OHW) is a 4-day collaborative learning experience aimed at exploring, creating and promoting effective computation and analysis workflows for large and complex oceanographic data. It includes tutorials, data exploration, software development, collaborative projects and community networking.


Faculty Training Workshop: Teaching Heterogeneous Computing - 16-23 July, Applications due 15 June

Sadly only available to faculty members, this covers a newly developed set of teaching models by the Touch project at Texas state covering GPU programming, hybrid algorithms, and task mapping and scheduling.


Events: Conferences, Training

Tonne of interesting events coming up this roundup:

2nd ELEMENT workshop on Exascale Meshing, 9 June, Free Registration

This workshop covers the full meshing workflow - generation, adapting, partitioning meshes and visualizing results.


Excalibur-SLE Workshop: Data Visualisation and Data Flows - 16-17 June, free over Zoom

From the posting:

Stakeholders at this workshop will discuss the challenges of visualisation and data flow at exascale and topics such as: (i) limits of current workflows given roadmaps of future storage and I/O bandwidth; (ii) prescribed and automated in-situ data extraction; (iii) in-situ dimension reduction techniques; (iv) intelligent data compression; interactive analysis of large ensembles of simulations; and, (v) immersive visualisation using VR and AR


2021 SPDK, PMDK & Intel Performance Analyzers Virtual Forum - 22-24 June, Free Registration required

Overview of the storage performance development kit, persistent memory development kit, and Intel performance tools; last years talks and materials are available if you’re unsure if the event is for you.


ISC 2021 - 24 June - 2 July, Registration for full conference 249€ plus 199€ for tutorials; deep discounts available for students

The extremely broad program for the ISC conference, including a wide range of relevant tutorials, wil be held online for relatively reasonable prices.


Machine Learning for Planetary Space Physics Monthly Seminar - Next seminar 29 June

This is a monthly seminar (with previous talks on YouTube). From the site:

This seminar series aims to bring together researchers in planetary science, space physics, data science, and other domain applications of data science. We welcome presentations from a broad array of fields including: Earth based and planetary science applications, educational efforts, and basic data science research.


New Directions in Numerical Linear Algebra and HPC: Celebrating the 70th Birthday of Jack Dongarra - 7-8 July; free registration due by 2 July

Talks TBD but an extremely heavy-hitting lineup of HPC numerical linear algebra experts lined up to celebrate the accomplishments of Dongarra.


Random

The NVIDIA blog helpfully breaks out recorded talks and materials from GTC2021 events relevant to those of us in HPC.

A list of sources for project management document templates.

A couple resources for technical diagrams and images - pikchr is a tool and pic-inspired language for embedding simple diagrams in markdown, and inkscape has a long-awaited 1.1 release.

JSON handling in Postgres has gotten a lot better with the Postgres 14 release.

The next 50 years of unix shell programming [PDF].

Windows is becoming an increasingly plausible development environment for research computing?! Yeah, it’s weird to me, too. Here’s what developing python looks like these days, through the eyes of a Python core developer.

Learning Linux system programming by writing a pseudo-device driver (e.g. /dev/xyz).

“Computer, enhance”: using GPUs and deep learning to up-scale a cosmology simulation.

The Ada/SPARK community is starting RFCs for new language features.

An ambitious effort to complement the entire textbook An Introduction to Statistical Learning with labs involving R and tinymodels.

anchore is a command-line tool for scanning running docker containers for security issues.