Problem description
Given a file of URLs, how can we filter out any that gives us HTTP 404 Not Found and write the URLs that exist to a new file?
TL;DR
cat urls.txt \
| xargs -I{} sh -c 'curl -sIL {} -w "%{http_code}" -o /dev/null \
| grep -q -v 404 && echo {}' > ok_urls.txt
Explanation
First, we pipe the list of URLs to xargs
using cat
:
cat urls.txt | xargs ...
Using xargs
we read the input from cat
and execute a shell command for each line. The -I{}
tells xargs
that we want to replace the string {}
with the input (in this case a URL).
Since we need to output the URL we got as an input, we will actually use this twice: once first when checking the URL, second when outputting the URL that was valid.
To run multiple commands for each line, we tell xargs
to run a shell with another command, as specified with -c
.
In the next part of the script, we first use curl
to access the URL, telling it to be silent and don't give us output we don't need with -s
, only give us the header -I
and follow any redirects -L
. To get the status code only, we use -w "%{http_code}"
. This flag can be used to tailor the output from curl. -o /dev/null
sends any other output to somewhere it can be discarded.
To filter out 404 Not Found we can use grep -v
which will match lines that do not contain 404, and -q
makes grep be quiet. This way we can test only on the exit value of grep
.
If you combine two commands with &&
, the last one will only be run if the previous command was successful - so by putting && echo {}
after grep, the URL will only be printed if grep
was successful. Remember that {}
is replaced with the URL by xargs
!
Finally, we send the list of URLs to a new file, and we're done!
Happy hacking,
Vetle
Top comments (0)