Ivan Trusov

Posted on Apr 3, 2022

My top 5 learnings from driving an OSS project

#python #databricks #opensource #architecture

Approximately 1 year ago I've released the first version of dbx - a CLI tool for simple and efficient development and deployment of Databricks jobs.

Since then the project adoption has been growing pretty nice and getting dangerously close to the grand total of 300k downloads.

In this post, I would like to share my learnings from driving this project as the core maintainer. My aim is to help other engineers with some of the long-term decisions and ideas they might consider when starting a new OSS project and driving it towards success. I know that some of these bits of knowledge might sound obvious - that's indeed true, but it's better to repeat than to forget.

1. Invest more time in the design phase

I know that a lot of passionate developers prefer to go with the flow. Open your favorite IDE, grab a coffee, turn on some good music - and here you start coding, right?

The hands-on part is without controversy important, however, before doing any implementation steps it's better to invest some time in designing the component or functionality you're going to implement.

I shall admit that I've fallen into this trap as well. Some of the interfaces and arguments of the project I've been working on were pretty much poorly designed.

This led to issues on both sides:

badly designed user interfaces are hard to maintain and extend (on the developer side)
and it's hard to document and use (on the customer side).

At the time when you realize that hastily made design decisions are hampering your project, you probably will already have users who set up some of their projects on the provided interfaces (since it was the only provided option), so for them it would become a source of issues.

Investing time in the user-facing interfaces is crucial, so plan ahead some of your project time for that. Poor design decisions will inevitably pop up, so don't take credit from the future yourself - instead invest your time now to make sure you won't spend time fixing these issues in the future.

2. Finding balance between new features and stability

When a project starts getting good adoption, you'll inevitable receive two types of feedback:

bugs
feature requests

In most cases, bugs are coming from the internal logic of your code. To make your project error-prone don't rush into writing a fix when a bug appeared.

First, you need to understand why the current code version is unable to handle the issue. Approach bugs systematically - don't just fix the issue, fix the root cause of it.

Typically the root cause for bugs is one of (or a combination of):

lack of code analysis
lack of unit tests
lack of integration tests

For all of these cases, you can set up an automated process that will fortify and stabilize your further development. Use tools to automate these checks - apply static code analysis, run your tests in different environments.

In my project I was initially testing everything only on ubuntu-latest VMs in Github Actions. Guess what - a big chunk of the issues started coming when users with other platforms started using the tool. Cover such cases with environment matrix to make sure your product works on all expected platforms.

The same approach applies to the integrations - always add at least some integration tests to your project. Yet, integration testing might have a lot of time overhead.

A good way to find balance is to run integration tests not per every commit to a feature branch, but only at a time you would like to merge your code into the main branch. Make sure you cover the cases when you've got an unexpected response code or unusual response structure from the HTTP API your application is using.

As for the above-mentioned feature requests, always invest some time into the following considerations:

is this feature request something that could be already covered with the existing codebase, but not just documented well?
if this feature request is feasible - invest some time into designing this feature first.

3. What to automate, and why?

A rule of thumb is that your CI and release pipelines shall be automated. By this, I mean that your whole code shall be properly tested when you commit it to the git repository, and there shall be zero manual steps after commit.

However, some steps could be applied automatically even before committing the code - the so-called pre-commit hooks.

I personally find that for Python development usage of black and prospector pre-commit hooks will really help you not to lose time on fixing code smells and apply standard formatting across all developers.

Another important thing is to properly control the security issues in your dependent packages. Dependabot is a great tool with in-place integrations for GitHub Actions.

Finally, some advice for your release pipeline and code distribution. For Python-based projects, I definitely recommend using pypi and storing your packages there. I think that a git-based approach to distribute your code is not an option, since it easily might lead to errors. For example, you migh accidentally force-push a bad version of your code to main and it will lead to issues if the end-user forgot to pass a proper version tag or branch name during pip install.

4. Documentation coverage

Some engineers don't like to write documentation. Some of them don't even like to read it.

This doesn't mean though that you can simply rely on a README.rst and hope that "they'll figure it out somehow". First of all, they won't, second - this is a good welcoming gesture and a way to promote your project to a wider audience.

Documentation is also an incredible tool in the sense of designing new features. Try the following - start writing the documentation for the new functionality before the actual implementation.

It's a very underestimated approach that will let you catch some of the logical flaws and direct you into productive designing flow of mind. Finally, publicly available documentation brings more attention to your projects, since it will be indexed by search engines.

Your end users probably won't search for bits of your code - they'll probably google something like "how to make X on platform Y" - and properly written documentation can easily point them towards your project. My personal favourite terms of where to store the documentation is the beautiful Read The Docs service.

5. Choosing the right toolset

There are numerous of potential combinations of different software development tools and techniques you can rely on - and it's better not to trust my opinion but to try them out on your own and figure out your own toolset.

My personal toolset of choice is the following (for Python projects):

tool	link
IDE	Intellij IDEA or VSCode (still haven't decided which one I like the most)
Git provider	GitHub
CI provider	GitHub Actions
Python code formatting	black
Python code analysis	prospector
Dependency management for Python	pip + requirements.txt files (yes, it's a bit old school, but works really nice without any Docker caching issues etc.)
Environment management for Python	still haven't found anything better than conda
Python testing framework	pytest (it's so powerful and simple, yet took some time to get used to it after `unittest`)
Code coverage analysis	pytest-cov + codecov.io for publishing the results
Code security analysis	LGTM
Dependency version analysis	Dependabot
Documentation	Read The Docs
Documentation & README language	reStructuredText (way more flexible and extensible than Markdown)

Summary

Driving an OSS project, even a small one - is a great opportunity to get new experience, make yourself familiar with different technologies, and finally - get some community feedback. If you have a great OSS project idea, don't hide it under a shadow of self-criticism - get public with it and collect feedback. Hope my learnings might help you with next steps.

In the end, I would like to mention that all my contribution in the dbx project won't happen if at Databricks we didn't have an amazing culture that allows Solution Architects invest some of their time into developing projects and contributing them to Databricks Labs.

And we're hiring, specifically to our EMEA Specialist Solutions Architect team - check out this link for details.

DEV Community

My top 5 learnings from driving an OSS project

Top comments (0)

Read next

Mastering Python for Web Development: Best Practices 🐍💻

7 Powerful Python Performance Optimization Techniques for Faster Code

We made an AI SWE that solved 48.60% of issues on the SWE bench, 100% open-source.

AdventJS: 25 Programming Challenges in JavaScript and Python! [Free]