DEV Community

Ben Ford
Ben Ford

Posted on • Originally published at binford2k.com on

Impact Analysis of Puppet Modules

Have you ever wondered who’s using your Puppet modules? Or have you hesitated before changing a class parameter because you don’t really know how many people will be affected downstream? Maybe you hesitated before deprecating a barely supported and almost certainly unused subclass because… well, you didn’t really know for sure that it was unused.

Rangefinder is the tool for you. Just run it on the source code you’re working on and it will tell you who might be affected.

[~/Projects/puppetlabs-concat]$ rangefinder manifests/fragment.pp
[concat::fragment] is a _type_
==================================
The enclosing module is declared in 173 of 575 indexed public Puppetfiles

Breaking changes to this file WILL impact these modules:
  * nightfly-ssh_keys (https://github.com/nightfly19/puppet-ssh_keys.git)
  * viirya-mit_krb5 (git://github.com/viirya/puppet-mit_krb5.git)
  * rjpearce-opendkim (https://github.com/rjpearce/puppet-opendkim)
  * shadow-tor (git://github.com/LeShadow/puppet-tor.git)
[...]

Breaking changes to this file MAY impact these modules:
  * empi89-quagga (UNKNOWN)
  * unyonsys-keepalived (UNKNOWN)
  * Flameeyes-udevnet (UNKNOWN)
  * ricbra-ratbox (git://github.com/ricbra/puppet-ratbox.git)
[...]

The tool is basically a glorified database client. It works by identifying the component generated by that source file and then querying for the usage of that component. It can recognize Puppet types, functions, classes, and defined types.

The data used to identify downstream dependents come from a public BigQuery database containing indexed and aggregated data from both the Forge and GitHub. Here’s the query behind that command, right in the GCP console:

Identifying usage patterns for puppetlabs/concat

As you can see, Rangefinder uses both the source and the repo columns to tailor how it displays results. Rows in which the source column matches the metadata from the module you’re running the command from will be displayed as exact (WILL impact) matches, and ones that don’t are possible (MAY impact) matches. We’ll talk more shortly about what that means.

Gathering data

But first, let’s talk about how the data is collected. Each week a cron job runs a simple data aggregation tool. This does several things.

  • It mirrors Puppet-related data from the public GitHub datasets so we can make queries easier on our budget with less-than-terabyte sized tables.
  • It gathers and flattens public data from the Puppet Forge into an easily queryable form.
  • It downloads each new release and runs certain kinds of static analysis against it.

This allows you to do things like retrieve forwards and backwards dependencies, or to join data from the Forge and GitHub. For example, have you ever wondered how many Forge modules define new native types (and are hosted on GitHub)?

SELECT DISTINCT g.repo_name, f.slug
FROM `dataops-puppet-public-data.community.github_ruby_files` g
JOIN `dataops-puppet-public-data.community.forge_modules` f
    ON g.repo_name = REGEXP_EXTRACT(f.source, r'^(?:https?:\/\/github.com\/)?(.*?)(?:.git)?$')
WHERE STARTS_WITH(g.path, 'lib/puppet/type')
LIMIT 1000

Itemization

The coolest part, to me at least, is the static analysis it does. This uses my puppet-itemize gem which deconstructs Puppet manifests into all the types, classes, resources, functions that they declare or invoke. Because it’s not compiling, it doesn’t care about conditional logic and effectively just returns a list of all items referenced in the source code, regardless of the code path.

If I run Puppet Itemize against the first module listed in the puppetlabs/concatRangefinder results, I see this:

[~/Projects]$ git clone https://github.com/nightfly19/puppet-ssh_keys.git
Cloning into 'puppet-ssh_keys'...
remote: Enumerating objects: 23, done.
remote: Total 23 (delta 0), reused 0 (delta 0), pack-reused 23
Unpacking objects: 100% (23/23), done.
[~/Projects]$ cd puppet-ssh_keys
[~/Projects/puppet-ssh_keys]$ puppet itemize
Resource usage analysis:
==========================
>> types:
    concat::fragment | 2
                file | 1
              concat | 1

>> classes:
    ssh_keys::params | 1
            ssh_keys | 1

>> functions:
                 md5 | 1

The result of this analysis is saved into the BigQuery database along with the name of the module, and then when Rangefinder runs, it will match on the two instances of concat::fragment that you see in the output above.

What next?

So where do we go from here? This is actually several steps into a larger metrics project. I’m sure that you’ve connected the dots by now that so far this is only operating on already public data. You can already query the Forge API, you can look at module’s metadata.json or source code, you can query GitHub. That means that this tool is only making it more convenient to do what you could already do!

What if we had access to actual usage data? What kind of development decisions would you make if you knew how many infrastructures are declaring what classes of your modules? Or maybe what different platforms people are running your modules on? Or what versions of your module that people are running? Or maybe even just how many people are using your module in their internal profile classes?

You won’t be surprised to know that I’m working on that also. It’s a much larger project because there are a ton of privacy considerations that we had to address before even thinking about asking people to enable telemetry.

Our two top design constraints while building the client were privacy and transparency and we’re now dogfooding it in our internal infrastructure to watch for sensitive information leaking. Keep an eye out for another post soon showing how that system works and how you can build your own tools to query the data it gathers.

Installing and using

If you’ve made it this far, maybe you’d like to try it out. You can simply gem install it and run it on the command line.

[~]$ gem install puppet-community-rangefinder
[~]$ rangefinder --help
Usage: rangefinder <paths>

Run this command with a space separated list of file paths in a module and it
will infer what each file defines and then tell you what Forge modules use it.

It will separate output by the modules that we KNOW will be impacted and those
which we can only GUESS that will be impacted. We can tell the difference based
on whether the impacted module has properly described dependencies in their
`metadata.json`. These are rendered as *exact match* and *near match*.

Note that non-namespaced items will always be near match only.

    -r, --render-as FORMAT Render the output as human, summarize, json, or yaml
    -v, --verbose Show verbose output
    -d, --debug Show debugging messages
        --shell Open a pry shell for debugging (must have Pry installed)
        --version Show version number

Do let me know how this works for you, and if there are ways it could work better. I’ll post next about the webhook version so that you can get this impact analysis automatically attached to your GitHub pull requests so that you know how much of an impact incoming PRs can have before merging them.

Discussion (0)