Approximately 1 year ago I've released the first version of dbx - a CLI tool for simple and efficient development and deployment of Databricks jobs.
Since then the project adoption has been growing pretty nice and getting dangerously close to the grand total of 300k downloads.
In this post, I would like to share my learnings from driving this project as the core maintainer. My aim is to help other engineers with some of the long-term decisions and ideas they might consider when starting a new OSS project and driving it towards success. I know that some of these bits of knowledge might sound obvious - that's indeed true, but it's better to repeat than to forget.
1. Invest more time in the design phase
I know that a lot of passionate developers prefer to go with the flow. Open your favorite IDE, grab a coffee, turn on some good music - and here you start coding, right?
The hands-on part is without controversy important, however, before doing any implementation steps it's better to invest some time in designing the component or functionality you're going to implement.
I shall admit that I've fallen into this trap as well. Some of the interfaces and arguments of the project I've been working on were pretty much poorly designed.
This led to issues on both sides:
- badly designed user interfaces are hard to maintain and extend (on the developer side)
- and it's hard to document and use (on the customer side).
At the time when you realize that hastily made design decisions are hampering your project, you probably will already have users who set up some of their projects on the provided interfaces (since it was the only provided option), so for them it would become a source of issues.
Investing time in the user-facing interfaces is crucial, so plan ahead some of your project time for that. Poor design decisions will inevitably pop up, so don't take credit from the future yourself - instead invest your time now to make sure you won't spend time fixing these issues in the future.
2. Finding balance between new features and stability
When a project starts getting good adoption, you'll inevitable receive two types of feedback:
- bugs
- feature requests
In most cases, bugs are coming from the internal logic of your code. To make your project error-prone don't rush into writing a fix when a bug appeared.
First, you need to understand why the current code version is unable to handle the issue. Approach bugs systematically - don't just fix the issue, fix the root cause of it.
Typically the root cause for bugs is one of (or a combination of):
- lack of code analysis
- lack of unit tests
- lack of integration tests
For all of these cases, you can set up an automated process that will fortify and stabilize your further development. Use tools to automate these checks - apply static code analysis, run your tests in different environments.
In my project I was initially testing everything only on ubuntu-latest VMs in Github Actions. Guess what - a big chunk of the issues started coming when users with other platforms started using the tool. Cover such cases with environment matrix to make sure your product works on all expected platforms.
The same approach applies to the integrations - always add at least some integration tests to your project. Yet, integration testing might have a lot of time overhead.
A good way to find balance is to run integration tests not per every commit to a feature branch, but only at a time you would like to merge your code into the main branch. Make sure you cover the cases when you've got an unexpected response code or unusual response structure from the HTTP API your application is using.
As for the above-mentioned feature requests, always invest some time into the following considerations:
- is this feature request something that could be already covered with the existing codebase, but not just documented well?
- if this feature request is feasible - invest some time into designing this feature first.
3. What to automate, and why?
A rule of thumb is that your CI and release pipelines shall be automated. By this, I mean that your whole code shall be properly tested when you commit it to the git repository, and there shall be zero manual steps after commit.
However, some steps could be applied automatically even before committing the code - the so-called pre-commit hooks.
I personally find that for Python development usage of black and prospector pre-commit hooks will really help you not to lose time on fixing code smells and apply standard formatting across all developers.
Another important thing is to properly control the security issues in your dependent packages. Dependabot is a great tool with in-place integrations for GitHub Actions.
Finally, some advice for your release pipeline and code distribution. For Python-based projects, I definitely recommend using pypi and storing your packages there. I think that a git-based approach to distribute your code is not an option, since it easily might lead to errors. For example, you migh accidentally force-push a bad version of your code to main and it will lead to issues if the end-user forgot to pass a proper version tag or branch name during pip install
.
4. Documentation coverage
Some engineers don't like to write documentation. Some of them don't even like to read it.
This doesn't mean though that you can simply rely on a README.rst
and hope that "they'll figure it out somehow". First of all, they won't, second - this is a good welcoming gesture and a way to promote your project to a wider audience.
Documentation is also an incredible tool in the sense of designing new features. Try the following - start writing the documentation for the new functionality before the actual implementation.
It's a very underestimated approach that will let you catch some of the logical flaws and direct you into productive designing flow of mind. Finally, publicly available documentation brings more attention to your projects, since it will be indexed by search engines.
Your end users probably won't search for bits of your code - they'll probably google something like "how to make X on platform Y" - and properly written documentation can easily point them towards your project. My personal favourite terms of where to store the documentation is the beautiful Read The Docs service.
5. Choosing the right toolset
There are numerous of potential combinations of different software development tools and techniques you can rely on - and it's better not to trust my opinion but to try them out on your own and figure out your own toolset.
My personal toolset of choice is the following (for Python projects):
tool | link |
---|---|
IDE | Intellij IDEA or VSCode (still haven't decided which one I like the most) |
Git provider | GitHub |
CI provider | GitHub Actions |
Python code formatting | black |
Python code analysis | prospector |
Dependency management for Python | pip + requirements.txt files (yes, it's a bit old school, but works really nice without any Docker caching issues etc.) |
Environment management for Python | still haven't found anything better than conda |
Python testing framework |
pytest (it's so powerful and simple, yet took some time to get used to it after unittest ) |
Code coverage analysis | pytest-cov + codecov.io for publishing the results |
Code security analysis | LGTM |
Dependency version analysis | Dependabot |
Documentation | Read The Docs |
Documentation & README language | reStructuredText (way more flexible and extensible than Markdown) |
Summary
Driving an OSS project, even a small one - is a great opportunity to get new experience, make yourself familiar with different technologies, and finally - get some community feedback. If you have a great OSS project idea, don't hide it under a shadow of self-criticism - get public with it and collect feedback. Hope my learnings might help you with next steps.
In the end, I would like to mention that all my contribution in the dbx
project won't happen if at Databricks we didn't have an amazing culture that allows Solution Architects invest some of their time into developing projects and contributing them to Databricks Labs.
And we're hiring, specifically to our EMEA Specialist Solutions Architect team - check out this link for details.
Top comments (0)