It may be interesting, when there are more people reading DEV - for example, perfect timing is important for those who is obsessed with idea to get more likes and comments, right? :)
example of histogram produced as result of experiments described below
Let's do a simple exercise of collecting stats - for example from DEV
, though this could be used for most similar sites and forums. I will use bash
for amusement, and because it is useful to know more about it. Though you can easily do all steps in other language (Python or PHP will suit best in my opinion). You may briefly refer to my previous post about coding projecteuler problem in bash if you are not well acquainted with this popular linux command line processor.
Plan is like this:
- few words about DEV API at all
- getting max id (or total amount) of articles
- getting about 100 random articles before this id
- collecting their publication hours
- and making small chart of it
Few words about DEV API
This wonderful site has REST API - simply urls which return JSON with data about articles, users, comments etc. A bit of documentation could be found here but it is quite incomplete. Luckily, we can also refer in the source code, particularly here.
Root for API is at dev.to/api
. For example, let's look at endpoint listing articles...
Getting total number of articles
Open the link https://dev.to/api/articles in your browser. You'll see JSON, obviously - array containing articles data.
[
{
"type_of":"article",
"id":261930,
"title":"An Open..."
//...
},
{ /*... another article data*/ },
//...
]
Note, we have several latest articles returned. And each has numeric id
, so seemingly now DEV hosts over 200k
articles.
Now I want to make request from command line, which will return me just this number. I do requests with curl
tool. Try running following commands, one by one:
curl https://dev.to/api/articles/
curl https://dev.to/api/articles/ | grep -oP '(?<=\"id\"\:)\d+'
curl https://dev.to/api/articles/ | grep -oP '(?<=\"id\"\:)\d+' | head -n 1
What's happening? The first line is very simple - curl
fetches json data from API endpoint - and dumps them to console. Second line redirects this output to grep
tool, which applies regular expression to find all "id"=1234567
fragments and prints only numeric ids, one per line (a bit about it further). The third line applies yet another command to the produced list of ids - head
just takes several top lines (one in this case).
What this regular expression means? Look at the end, we search for sequence \d+
which means "digit, repeated 1 or more times". Before this fragment should be something enclosed in (?<=...)
pattern, called "look-behind". I.e. we shall find only those digits, which are preceded by \"id\":
, though this prefix is not included into "matching" part.
So at last let's put this single top id number into variable:
total=$(curl https://dev.to/api/articles/ | grep -oP '(?<=\"id\"\:)\d+' | head -n 1)
Commands inside $(...)
are executed and their output is stored into variable total
.
Getting single article by ID
Now we are going to fetch about 100 random articles. For this we are going to take random number with $RANDOM
function, subtract it from max id
and fetch specific article by this id.
To get article by id we just add this id to the link used above, e.g.
https://dev.to/api/articles/123456 - note here we get not array, but single object describing an article:
{
"type_of":"article",
"id":123456,
//...
"created_at":"2019-06-13T14:31:50Z"
}
The object includes creation timestamp, we'll use it bit later. For now let's complete the code which fetches 100 articles at random:
i=0
while [[ $i -lt 100 ]] ; do
random_id=$(($total - $RANDOM))
atext=$(curl -sf https://dev.to/api/articles/$random_id)
if [[ $? -eq 0 ]] ; then
i=$(($i + 1))
echo "$random_id - ok"
else
echo "$random_id - FAIL"
fi
done
We don't use received text yet - just print out whether article could be loaded or not. Some fails with 404 (which could be checked manually) - seemingly they were saved to draft and never published or deleted afterwards.
If you want run this snippet right now, I recommend changing 100
to 10
for test purposes - otherwise it may take significant time.
Collecting timestamp data (hour)
So we have json for every article in atext
variable. Let's extract created_at
or more precisely the hour part of this field only. Instead of grep
let's try sed
- another cool default command, just for practice. It works like this, let's redirect curl output to it:
curl https://dev.to/api/articles/246755 | sed -r 's/.*created_at\"\:\".{11}(..).*/\1/'
Here sed
just does substitution by regexp. Regexp is made so that it captures whole document (thanks to .*
at start and end) with created_at\"\:\".{11}(..)
fragment inside. In this fragment we skip quotes, semicolon, then 11 symbols (like 2020-01-02T
) and capture two symbols with parentheses. We use the value of the first (and only) captured group with \1
reference to replace whole string. So output is like 20
- i.e. hour part.
We add such line in our script (under if
) in the form:
hour=$(echo "$atext" | sed -r 's/.*created_at\"\:\".{11}(..).*/\1/')
A kind of histogram
Now we want to collect results into array. Let's initiate array with 24 zeroes - and we'll increment the element corresponding to given hour on every article found:
result=($(for i in {0..23}; do echo 0; done)) # creates 24 zeroes
# ... and below in the loop
result[$hour]=$(($result[$hour] + 1))
I will leave to yourself to figure out how to pretty-print histogram, like one shown at the beginning of the article. Simplified code, as a shell file could look like one below. It simply prints counters for every hour:
#!/usr/bin/env bash
arts=`curl https://dev.to/api/articles`
total=`echo "$arts" | grep -oP '(?<=\"id\"\:)\d+' | head -n 1`
result=($(for i in {0..23}; do echo 0; done))
i=0
while [[ $i -lt 100 ]] ; do
random_id=$(($total - $RANDOM))
atext=$(curl -sf https://dev.to/api/articles/$random_id)
if [[ $? -eq 0 ]] ; then
i=$(($i + 1))
hour=$(echo "$atext" | sed -r 's/.*created_at\"\:\".{11}(..).*/\1/')
hour=$(echo $hour | sed 's/^0//')
(( result[hour]++ ))
echo "$random_id - ok, $hour ($i)" # remove leading zero
else
echo "$random_id - FAIL"
fi
done
for i in {0..23} ; do
echo "$i: ${result[i]}"
done
Conclusion
This is rather introductory material. You may at once see several things to improve:
- we should check that randomly chosen pages have no repeating ids
- requests are taking few seconds and often fail, so it would be better to learn running them in parallel
- most active hours probably are not determined by just the hours when articles were created - probably it is better to regard only those articles, which have enough comments and likes
- we can also use authorization for API (don't know yet if this affects speed).
Thanks for reading so far, and excuse me for bash - probably further experiments with API I'll publish using PHP/Python!
Top comments (6)
Wow. I motivated to build app that get random posts from Dev.
I see in docs.dev.to/api/#operation/getArti..., JSON only get 30 posts. Can I reach more than 30?
Hi, I think yes, though it is not documented. Source code contains
per_page
parameter which governs amount of articles in the response.For example:
dev.to/api/articles/?per_page=3
However, probably it is not good idea to load very large responses (don't know if there is limit). Rather use
page
parameter to send several requests to different pages. E.g.dev.to/api/articles/?page=333
Then, how to get 10 posts from Dev randomly? 😂
Oh. I got idea:
Yep. Good idea! Just exactly what is described in this article above :)
😂 thanks
As I couldn't use:
as
-P
doesn't work in macOS I replaced it with the utility I use to pretty print JSON,jq
:or even better, as we don't need the entire page:
this doesn't work in macOS either:
I replaced it with:
or even better:
jq is truly awesome :D stedolan.github.io/jq/manual/