DEV Community

Cover image for Check links programmatically (with Perl)
Tib
Tib

Posted on • Edited on

Check links programmatically (with Perl)

Links are moving too fast... And your online README.md, links directories, blog posts or whatever... rapidly give links to dead resources 😒

Like in my awesome-like Perl README.md πŸš€ that contains hundreds of links (go check it out, it is cool ! 😎).

My solution is to check periodically that the links are still up !

Basic version

For this very first version, I will take links from a file or a | (pipe). And I will use LWP::Simple.

#!/usr/bin/env perl

use LWP::Simple;
$| = 1; # Ignore this

while(<>) {
    chomp; # Remove carriage return
    my $link = $_;
    print "Checking [$link]...";
    my $content = get($link);
    if(! defined $content) {
        print " BROKEN !\n";
    } else {
        print " OK\n";
    }
}
Enter fullscreen mode Exit fullscreen mode

That I use with a list of links in a links.txt file for instance:

http://cpantesters.org
https://img.shields.io/badge/Language-Perl-blue
https://www.perltutorial.org/
http://cpancover.com
Enter fullscreen mode Exit fullscreen mode

And I run it like this:

$ cat links.txt | perl checklinks.pl
# OR
$ perl checklinks.pl links.txt
Enter fullscreen mode Exit fullscreen mode

This is the magic of <> !

Magic

It produces an output like the following:

Checking [http://cpantesters.org]... OK
Checking [https://img.shields.io/badge/Language-Perl-blue]... BROKEN !
Checking [https://www.perltutorial.org/]... OK
Checking [http://cpancover.com]... BROKEN !
Enter fullscreen mode Exit fullscreen mode

What ? We have some broken links ?

But the shields.io badge and cpancover.com are actually not down...

What The Hell

And since we are using LWP::Simple that clearly states that

"If you need more control or access to the header fields in
the requests sent and responses received, then you should use
the full object-oriented interface provided by the
LWP::UserAgent module."
Enter fullscreen mode Exit fullscreen mode

Then... Go for LWP::UserAgent !

LWP::UserAgent

Then, here is my new version based on LWP::UserAgent:

#!/usr/bin/env perl

use LWP::UserAgent ();
my $ua = LWP::UserAgent->new(timeout => 10);
$| = 1;

while(<>) {
    chomp;
    my $link = $_;
    print "Checking [$link]...";
    my $res = $ua->get($link);
    if(! $res->is_success) {
        print " BROKEN !\n";
    } else {
        print " OK\n";
    }
}
Enter fullscreen mode Exit fullscreen mode

I then run it like this :

echo "https://img.shields.io/badge/Language-Perl-blue" | perl checklinks.pl
Enter fullscreen mode Exit fullscreen mode

The shields.io badge is still up for humans but broken for LWP πŸ˜’:

Checking [https://img.shields.io/badge/Language-Perl-blue]... BROKEN !
Enter fullscreen mode Exit fullscreen mode

I need to see the status code...

403 forbidden !

To know what is the status code, I can print $res->status_line:

print " BROKEN ! --> " . $res->status_line . "\n"
Enter fullscreen mode Exit fullscreen mode

And the conclusion is terrible πŸ˜€:

BROKEN ! --> 403 Forbidden
Enter fullscreen mode Exit fullscreen mode

The 403 Forbidden is something like the server is working well but refused to serve us because it detected something that he does not like.

Maybe like an empty user agent ? πŸ˜‡

Adding $ua->agent('Mozilla/5.0'); like in the LWP::UserAgent CPAN doc effectively fixed the problem:

Checking [https://img.shields.io/badge/Language-Perl-blue]... OK
Enter fullscreen mode Exit fullscreen mode

We fixed one problem, but we still have some others.

406 Not Acceptable

The Perl Tutorial website was OK with the LWP::Simple version but is now BROKEN with a strange status:

Checking [https://www.perltutorial.org/]... BROKEN ! --> 406 Not Acceptable
Enter fullscreen mode Exit fullscreen mode

"Not Acceptable" is supposed to be a problem with what the client accepts ("Accept" headers) and what the server can give.

In my firefox browser I have this:
Accept

I can try to emulate and change them with push_header like this:

$ua->default_headers->push_header('Accept' => "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
$ua->default_headers->push_header('Accept-Encoding' => "gzip, deflate, br");
$ua->default_headers->push_header('Accept-Language' => "en-US,en;q=0.5");
Enter fullscreen mode Exit fullscreen mode

But here it is not the problem, the problem is that I use a bad agent name ("Mozilla/5.0", the one taken as is from LWP::UserAgent doc).

My feeling is that "Mozilla/5.0" is not an empty agent name but is probably too old and looks like too much "a bot with a name" πŸ˜€

This change, fixes the problem:

$ua->agent('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/82.0');
Enter fullscreen mode Exit fullscreen mode

500 Can't connect to ... (certificate verify failed)

One more problem is related to certificat verification.

If you visit builtinperl.com you will get the usual certificat warning:

Warning

We can force the visit when using Firefox, but when using my script:

$ echo "http://builtinperl.com"  | perl checklinks.pl
Enter fullscreen mode Exit fullscreen mode

It hardly fails:

Checking [http://builtinperl.com]... BROKEN ! --> 500 Can't connect to builtinperl.com:443 (certificate verify failed)
Enter fullscreen mode Exit fullscreen mode

But once again, you can tweak LWP::UserAgent to fix this:

use IO::Socket::SSL qw( SSL_VERIFY_NONE );
$ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0;

# And later
# ...
$ua->ssl_opts(SSL_verify_mode => SSL_VERIFY_NONE);
Enter fullscreen mode Exit fullscreen mode

verify_mode => 0 was supposed to do the same than $ENV{PERL_LWP_SSL_VERIFY_HOSTNAME} = 0; but was not working, if someone knows the why... Please comment 😁

500 read timeout

Shit happens, even for best of us (CPANTesters)

You can increase the timeout.

405 Method Not Allowed

I usually use HEAD method since I don't care about the content but only the page status. But some links (e.g. CGI) won't answer HEAD requests!

For instance qntm.org answers "405 Method Not Allowed" for HEAD requests, and it's annoying:

$ curl --head https://qntm.org/files/perl/perl.html
HTTP/1.1 405 Method Not Allowed
Date: Mon, 08 Feb 2021 09:27:56 GMT
Server: Apache/2.4.38 (Debian)
Vary: User-Agent
Content-Type: text/html; charset=UTF-8
Enter fullscreen mode Exit fullscreen mode

Pimp my output

Salt

I just added some salt to my script to make it nicer.

Unicode characters:

use open ':std', ':encoding(UTF-8)';
Enter fullscreen mode Exit fullscreen mode

See this StackOverflow thread to know why this line.

And colors in terminal:

use Term::ANSIColor; 
Enter fullscreen mode Exit fullscreen mode

And later:

print color('red') . " \x{2717}" . color('reset') . " --> " . $res->status_line . "\n";
Enter fullscreen mode Exit fullscreen mode

It does not change much the output but make it clearer and nicer πŸ˜ƒ

Pimp

Conclusion

There is more to say here πŸ˜ƒ

Like mentioning that Mojolicious provides a very good framework for doing the same kind of tasks (it could be perceived as a "more modern" approach).

And also to try to be kind if possible with websites (using HEAD verb, announce yourself as a bot when possible, do not crawl too often...).

EDIT1: This blog post has a sequel, see Check markdown links with github action

EDIT2: This blog has another sequel, see Check links with HTTP::Simple

Top comments (3)

Collapse
 
grinnz profile image
Dan

Some of my modules you may find interesting, written because the existing ones are strange/suboptimal: HTTP::Simple, open::layers

Collapse
 
thibaultduponchelle profile image
Tib

I wrote a sequel, experimenting a version with HTTP::Simple, see check links with HTTP::Simple

Collapse
 
thibaultduponchelle profile image
Tib

Wow thank you a lot ! I need to test these modules ASAP :)