TIL: query the NPM paginated Rest API with Elixir

#npm #elixir #aws

These are notes on how to use Elixir to find all the NPM packages published by AWS with their download count.

The total count is 1676. We start and query the npms-api search endpoint: we give a string and it returns a list of matching Javascript packages. The endpoint is paginated. With this list, we query the NPM registry to get statistics for each package.

We send a total of approx 1750 requests. The longer one is the paginated API with a response time of approx 200-250ms per query. Then we send 1700 queries to get the stats on each package with an average response of 10-15ms. We get the result on average in around 20s (compared to a sequential 1700/25*0.200=13s + 1700*.0.010=17s = 30s).

We used Elixir to achieve this with the HTTP client Finch to stream and the Rust based JSON parser jsonrs.

We used an easy path: build the full list of found packages, and then query the details for each element of the list.

We used Stream.resource to build the list of packages. We used the total count returned by the endpoint to handle the pagination and increment a counter on each iteration.

We then use async_stream to query the second endpoint. It returns statistics on a given package. We retrieve the downloaded count during a given period.

We then Enum.sort_by to order the list of maps on a given key.

With the result, we run a side-effect as a Task: it saves the data in a file. Since we need the data unchanged for the final step, use the function tap and the data passes by. This allows us to keep the flow and run this side -effect.

We eventually prettify the data into a list of maps %{"package_name" => downloaded_count}.

Elixir can nicely chain all these streams. In a Livebook, the following code will default to the AWS packages.

Mix.install([
  {:finch, "~> 0.16.0"},
  {:jsonrs, "~> 0.3.1"}
])
Supervisor.start_link(
  [
    {Finch, name: MyFinch},
    {Task.Supervisor, name: MyTaskSup}
  ], 
  strategy: :one_for_one,
  name: MySup
)

defmodule Npm do
  require Logger

  @registry "https://api.npmjs.org"
  @search_point "https://api.npms.io/v2/search"

  @starting "2022-01-01"
  @ending "2023-01-01"
  @search "@aws-sdk/client"
  @aws_npm_packages "aws-npm-packages.json"

  def find(save? \\ false, string \\ @search, starting \\ @starting, ending \\ @ending) do

    check_response = fn response ->
      case response do
        {:ok, result} ->
          result

        {:error, reason} ->
          {:error, reason}
      end
    end

    #  the optional "save to file"
    save_to_file = fn list ->
      Logger.info(%{length: length(list)})

      Task.Supervisor.async_nolink(MyTaskSup, fn ->
        case Jsonrs.encode(list, lean: true, pretty: true) do
          {:ok, result} -> File.write!(@aws_npm_packages, result)
          {:error, reason} -> reason
        end
      end)
    end

    # the iterating function in Stream.resource
    next = fn {data, page} ->
      {response, total} = search(string, 25 * page)

      case page * 25 >= total do
        true ->
          {:halt, data}

        false ->
          {response, {data, page + 1}}
      end
    end

    try do
      Stream.resource(
        fn -> {[], 0} end,
        &next.(&1),
        fn _ -> nil end
      )
      |> Task.async_stream(&downloaded(&1, starting, ending))
      |> Stream.map(&check_response.(&1))
      |> Enum.sort_by(&Map.get(&1, "downloads"), :desc)
      |> tap(fn data -> if save?, do: save_to_file.(data) end)
      |> Enum.map(fn %{"downloads" => d, "package" => name} ->
        Map.put(%{}, name, d)
      end)

    rescue
      e ->
        Logger.warn(e)
    end
  end

  # we send a tuple {stream, total}
  def search(string, from \\ 0) do
    url = 
      URI.new!(@search_point)
      |> URI.append_query(
        URI.encode_query(%{q: string, size: 25, from: from})
        )
      |> URI.to_string()

    with {:ok, %{body: body}} <-
           Finch.build(:get, url)
           |> Finch.request(MyFinch),
         {:ok, %{"results" => results, "total" => total}} <- Jsonrs.decode(body) do
      {
        Stream.filter(results, fn package ->
          Map.has_key?(package, "flags") === false &&
            get_in(package, ["package", "name"]) |> String.contains?(string)
        end)
        |> Stream.map(&get_in(&1, ["package", "name"])),
        total
      }
    else
      {:error, reason} ->
        reason
    end

  # the second endpoint
  def downloaded(package_name, start, ending) do

    path = @registry <> "/downloads/point/" <> "#{start}" <> ":" <> "#{ending}" <> "/" <> "#{package_name}"

    with {:ok, %{body: result}} <-
           Finch.build(:get, path) |> Finch.request(MyFinch),
         {:ok, response} <- Jsonrs.decode(result) do
      response
    else
      {:error, reason} -> reason
    end
  end

The usage is:

iex> Npm.find(false, "@aws-sdk/client", "2022-01-01", "2022-03-01")

The same result with @google-cloud/:

If you have Livebook, installed, you can run a session with the button:

If not, you can easily run a Livebook from Docker. Run the image:

docker run -p 8080:8080 -p 8081:8081 --pull always -e LIVEBOOK_PASSWORD="securesecret" livebook/livebook

and then from another terminal, execute:

open http://localhost:8080/import?url=https://github.com/ndrean/gruland/blob/main/livebook.livemd

DEV Community

TIL: query the NPM paginated Rest API with Elixir

Top comments (0)

Read next

Amazon Q Developer Tips: No.17 Choose the right tool

🚀 Key Takeaways from Dr. Werner Vogels' Keynote at AWS re:Invent 2024 🌍

Streaming of Desktop Applications Securely on Web Browser Using Amazon AppStream 2.0

Exploring new AWS Aurora DSQL. What is it ? Why it is important ? How to quickstart ?