Hey there dev.to community!
I recently needed a simple script to filter Googlebot IPs from a list of IP addresses to be able to extract actual Googlebot visits from an access log.
Thankfully, Google provides a method to make sure a visitor is actually Googlebot.
If you have a similar need, there you go:
#/bin/bash
#
# Performs reverse and forward DNS lookups to list Googlebot's IPs, given a list
# of IP addresses as a file. Useful for filtering access logs to find out actual
# Googlebot visits.
#
# An implementation of https://support.google.com/webmasters/answer/80553?hl=en
while IFS='' read -r IP_ADDRESS || [[ -n "$IP_ADDRESS" ]];
do
IS_GOOGLEBOT=0
REVERSE_LOOKUP="$(host $IP_ADDRESS)"
echo "$REVERSE_LOOKUP" | grep -E "google.com.$|googlebot.com.$" > /dev/null && IS_GOOGLEBOT=1
if [[ IS_GOOGLEBOT -eq 1 ]]; then
FORWARD_LOOKUP="$(host $(echo "$REVERSE_LOOKUP" | cut -d " " -f 5) | cut -d " " -f 4)"
if [[ "$FORWARD_LOOKUP" = "$IP_ADDRESS" ]];
then
echo $IP_ADDRESS
fi
fi
done < "$1"
You may save it as something like filter-googlebot-ips.sh
and provide a file with a list of IP addresses to filter (each on a single line), as an argument. Like so:
$ ./filter-googlebot-ips.sh access-log-ips.txt > googlebot-ips.txt
This will perform reverse and forward DNS lookups for each of the IP addresses and print out the verified Googlebot IPs to STDOUT
, which you can write to a file like in the example above.
Hope it helps someone out there! 🙌
PS: Here is a GitHub Gist if you prefer that.
Top comments (0)