RPM and Python virtual environment

Python is a great language for data analysis and AI/ML development work. However, usually one would rely on python virtual environments to achieve reproducibility (situation in R is similar).

In this post, I discuss my approach, which allows me to have flexibility and reproducibility provided by virtual environments combined with straightforward long-term maintainability: the ability to easily and reliably install, configure, monitor security vulnerabilities, upgrade, and remove.

New functionality and backward compatibility

If you rely on the command ls in Linux, you don't expect an updated implementation to change how it works. In Python, this assumption is different. A new version of the same package may have a new interface, the same functions may expect new parameters, defaults may change, and behavior in special cases may change. This is a disadvantage and advantage of Python at the same time.

Large groups of developers and wealthy companies (Linux Foundation, Red Hat, Mathematica, Matlab, etc.) have specialists capable of introducing new features without significant backward compatibility issues. Such an approach makes products stable but limits the rate of innovation. On the other hand, small groups, especially in cases when the main expertise is outside of the software engineering domain (scientists, statisticians, etc.), prefer to have the freedom to enhance functionality and distribute new versions without the overhead of taking care of all possible conflicts. Many such small groups create innovative fast to evolve packages (for Data Science and AI work) in Python (same with R).

Virtual environments

In Python, developers handle reproducibility by creating virtual environments, which "freeze" the state of packages and allow for simple re-installation of a set of packages with specific versions.

The problem with such environments is that they may depend on Linux packages one needs to have in the system before they are able to install them. One can deal with that issues on a case-by-case basis. Another approach is to use conda, which can install both python/R packages and also Linux packages.

Maintenance

Even a simple app would have several types of files - at the very least files with code, configuration files, and log files. While, of course, one can put those files anywhere in the system, it is better to put them in the proper places. Thus some system of installation of packages to those locations should be in place. And I want others to be able to install the software I created.

One way would be to instruct whoever will install your app to untar a file or clone something from Git and then run some script that will copy files where they should be. Then also instructions for uninstall, upgrade, etc.

However, there is a better way.

RPM

Red Hat and similar operating systems use RPM package manager to handle the installation of packages.

In short, RPM enables (among many other things)

Installation that handles dependencies and prevents conflict
Clean uninstall: I don't want the residual ghost files to remain in the system after I uninstall the app. I also want to remove RPM packages that were installed as a dependency of my app/package. And if another package in the system depends on my app/package, I want the system to warn me about that (so I can either stop uninstalling or remove that other package in the same step)
Easy upgrade (with proper handling of configuration files, in case you change them and don't want them to be overwritten when you upgrade a version)

All this can be done in an automated way on one or multiple servers.

RPM approach and Python/R

Problem

So, I want to use the power of RPM, but at the same time, I want to use Python/R, which has so many great packages.

In RPM approach, I would create/use RPM for each python/R package (Red Hat actually provides RPMs for many python packages), and install them. If I only have one app, this may work. But if I have two python apps that may require different versions of, say pandas - this would create a problem. It may also happen that a set of dependencies for app1 can't coexist with a set of dependencies for app2. And even if those two apps can use the same python packages now, with a potential future upgrade of one of them, this may change.

Popular options

Popular options to address that are:

RPM-approach. Go with a classical software engineering approach: RPM for each python/R package, handle dependencies through RPM manager, and just deal with the fact that my two (or more) apps have to use the same set of python packages.
Virtual environment approach, No PRMs. Don't do RPMs, just write scripts that copy files of the app to proper locations, and remember to remove them when uninstalling. Do the same for the virtual environment (python environment, renv, or conda)
Docker. Create Docker image that has all the needed packages, both Linux packages and python/R packages. Use an independent docker image/container for each app. Mount required system directories to the docker container to store in proper standard locations config files, log files, etc.
- Btw, one can actually use RPM approach within Docker
A custom deployment solution, often available in large copropations

I don't like any of the methods above:

RPM-approach. This does not allow me to freely run/develop different apps on the same machine
Virtual environment approach, No RPMs. I have to handle all the operations RPM otherwise would handle for me. Writing my own implementation of something that is already made well for RPM is not good.
Docker. Even small changes may require rebuilding the whole container. RPM is not managing configuration, logs, and other files in the system. Other software can't use RPM to check if my app is installed and thus reliably use it as a dependency. Security and vulnerabilities should be monitored and patched for each docker container independently.
A custom deployment solution. Well... this is not universal - it would not work in case you company does not have managed service of such sort.

Proposed approach

So, here is the proposed approach, which is a compromise of a sort.

Create RPM containing my app, which depends only on Linux packages.
Use RPM Scriptlet
- In the %post section (execute at install or upgrade on the target system) of the rpm's spec file, install the python virtual environment using pip and the requirements.txt file included in RPM. If app is installed in /opt/app1, then create environment in /opt/app1/penv
- In the %postun section (execute at uninstall on the target system) remove files of the python virtual environment

Benefits: maintainability benefits of RPMs + flexibility/reproducibility that comes with virtual environments

Drawbacks: RPM manager does not know about files in the python virtual environment directory. And as such, can't verify their integrity or handle conflict (if other packages would want to write to the same location)

Possible alternative - package environment to RPM

There is also an option to package the virtual environment into a separate RPM using the package rpmenv, and then make your code depend on it. However, it did not work for me - RPM failed to build. I am sure it is possible to make it work eventually, but this means that extra steps may be required.