HPC at FOSDEM 2019
This year’s FOSDEM featured an HPC, Big Data and Data Science devroom on the Sunday. This post is the first part of my notes on the topics presented there. If you are interested, book some time and let’s talk about what it means for your and your high-performance computing team.
Adrian Reber from the OpenHPC project gave a refresher on what OpenHPC is, and a status update. OpenHPC has not been represented at FOSDEM since 2016, when the project was very new.
It’s a community-driven project with representation from many vendors and HPC sites. On first blush their output might appear to be “RPM packages” and “documentation” but their mission is actually to discover and share best practices in HPC management. Those packages are all well-tested with each other, and the documentation is tested every release, too. The idea is that if you build the core of your cluster with OpenHPC packages on CentOS-like Linux distributions, on either x86-64 or AArch64, you get to rely on tried and tested work from the whole community.
Reber, who works at Red Hat on their OpenHPC efforts, invited everyone to join the weekly project steering calls in a demonstration of the openness of the project. He discussed future directions, including an upcoming release v1.3.7 that will include packages rebuilt with the ARM HPC compiler for AArch64, and the challenges of understanding when is right to release v1.4 which will drop SLES12 for SLES15 and RHEL7 for RHEL8.
On the subject of HPC libraries, a common frustration is testing codes with various combinations of compilers, MPI libraries, hardware capabilities and so on. Developers both want to know that their code is correct (i.e. the science outcomes are still valid after a change) and that the performance has not been significantly impacted.
Victor Holanda discussed ReFrame, a tool for HPC regression and performance testing developed at CSCS and used regularly on Piz Daint and their other clusters. Written in Python, it gives test authors a way to express what their tests require (e.g. that they must run on machines with CUDA, compile a particular code with one of three different compilers, load environment modules with one of two different MPIs), run the tests, and inspect the output for certain outcomes.
Testers get to run a single command, or point their Jenkins or Travis CIs at a single command, to discover and execute the tests. The ReFrame runtime will compare the environments that the test can use with the ones that are available, and will report on the outcomes in each of those environments.
Inside CSCS, ReFrame is used for a 90 minute nightly production test run, and 10 minute maintenance runs to check for system regressions after configuration changes. They also have a set of diagnostic tests to help understand what’s happened if a node goes bad. Their approach to correctness is very robust; the team do not declare that they support something until it has enough users to know how well it works. They also say that in three years of development they “have never seen a python stacktrace” from ReFrame, as they test ReFrame with ReFrame while they are developing it.
Singularity from sylabs is a container runtime tool that specifically addresses problems containerising HPC workloads. Eduardo Arango gave a “what’s new in Singularity” update, as FOSDEM 2017 had already featured an introduction-level talk.
What’s new is that they’ve rewritten in Go. This means they get better integration with libraries used in Docker, Kube etc., and could adopt the de facto standard Containers Networking Interface for software-defined networking when running containers. It also reduces the dependencies needed to get Singularity up and running.
The new version uses a new format for containers, SIF (Singularity Image Format), a read-only SquashFS filesystem along with metadata, all of which can be cryptographically signed using PGP for integrity protection. An upcoming extension will allow a writable overlay to be added to a SIF.
Supporting this, Sylabs have a new container library similar to DockerHub for hosting SIF images for public or private cloud use. They have a key store for those PGP signing keys, and a cloud-based remote image builder for developers who need to build images but can’t do it locally.
This has been part one of my FOSDEM HPC round-up. I’ve focussed on the tools that are out there for automating and simplifying HPC workflows, because it’s an interesting problem and one that presents challenges to many HPC teams. Don’t forget that the Labrary can help!