What thoughts come to mind when you come across 404/Page Not Found/Dead Hyperlinks on a website? Aargh! You would find it annoying when you come across broken hyperlinks, which is the sole reason why you should continuously focus on removing the existence of broken links in your web product (or website). Instead of a manual inspection, you can leverage automation for broken link testing using Selenium WebDriver.
When a particular link is broken and a visitor lands on the page, it affects that page’s functionality and results in a poor user experience. Dead links could hurt your product’s credibility, as it ‘might’ give an impression to your visitors that there is a minimal focus on the experience.
If your web product has many pages (or links) that result in a 404 error (or page not found), the product rankings on search engines (e.g., Google) will also be badly affected. Removal of dead links is one of the integral parts of SEO (Search Engine Optimization) activity.
In this part of the Selenium WebDriver tutorial series, we deep dive into finding broken links using Selenium WebDriver. We have demonstrated broken link testing using Selenium Python, Selenium Java, Selenium C#, and Selenium PHP.
Introduction to Broken Links in Web Testing
In simple terms, broken links (or dead links) in a website (or web app) are links that are not reachable and do not work as anticipated. The links could be temporarily down due to server issues or wrongly configured at the back end.
Apart from pages that result in 404 error, other prominent examples of broken links are malformed URLs, links to content (e.g., documents, pdf, images, etc.) that have been moved or deleted.
Prominent Reasons for Broken Links
Here are some of the common reasons behind the occurrence of broken links (dead links or link rots):
- Incorrect or misspelled URL entered by the user.
- Structural changes in the website (i.e., permalinks) with URL redirects or internal redirects are not properly configured.
- Links to content like videos, documents, etc. that are either moved or deleted. If the content is moved, the ‘internal links’ should be redirected to the designated links.
- Temporary website downtime due to site maintenance making the website temporarily inaccessible.
- Broken HTML tags, JavaScript errors, incorrect HTML/CSS customizations, broken embedded elements, etc., within the page leading, can lead to broken links.
- Geolocation restrictions prevent access to the website from certain IP addresses (if they are blacklisted) or specific countries in the world. Geolocation testing with Selenium helps ensure that the experience is tailor-made for the location (or country) from where the site is accessed.
Why should you check Broken Links?
Broken links are a big turn-off for the visitors who land on your website. Here are some of the major reasons why you should check for broken links on your website:
- Broken Links can hurt the user experience.
- Removal of broken (or dead) links is essential for SEO (Search Engine Optimization), as it can affect the site’s rankings on search engines (e.g., Google).
Broken links testing can be done using Selenium WebDriver on a web page, which in turn can be used to remove the site’s dead links.
Broken Links and HTTP Status Codes
When a user visits a website, a request is sent by the browser to the site’s server. The server responds to the browser’s request with a three-digit code called the ‘ HTTP Status Code.’
An HTTP Status Code is the server’s response to a request sent from the web browser. These HTTP Status Codes are considered equivalent to the conversation between the browser (from which URL request is sent) and the server.
Though different HTTP Status Codes are used for different purposes, most of the codes are useful for diagnosing issues in the site, minimizing site downtime, the number of dead links, and more. The first digit of every three-digit status code begins with numbers 1~5. The status codes are represented as 1xx, 2xx.., 5xx for indicating the status codes in that particular range. As each of these ranges consists of a different class of server response, we would limit the discussion to HTTP Status Codes presented for broken links.
Here are the common status code classes that are useful in detecting broken links with Selenium:
Classes of HTTP Status Code | Description |
---|---|
1xx | The Server is still thinking through the request. |
2xx | The request sent by the browser was successfully completed and expected response was sent to the browser by the server. |
3xx | This indicates that a redirect is being performed. For example, 301 redirect is popularly used for implementing permanent redirects on a website. |
4xx | This indicates that either a particular page (or complete site) is not reachable. |
5xx | This indicates that the server was unable to complete the request, even though a valid request was sent by the browser. |
HTTP Status Codes presented on detection of Broken Links
Here are some of the common HTTP Status Codes presented by the web server on encountering a broken link:
How to Find Broken Links Using Selenium WebDriver?
Irrespective of the language used with Selenium WebDriver, the guiding principles for broken link testing using Selenium remains the same. Here are the steps for broken links testing using Selenium WebDriver:
- Use the < a > tag to collect details of all the links present on the webpage.
- Send an HTTP request for every link.
- Verify the corresponding response code received in response to the request sent in the previous step.
- Validate whether the link is broken or not based on the response code sent by the server.
- Repeat steps (2-4) for every link present on the page.
In this Selenium WebDriver tutorial, we would demonstrate how to perform broken link testing using Selenium WebDriver in Python, Java, C#, and PHP. The tests are conducted on (Chrome 85.0 + Windows 10) combination, and the execution is carried out on the cloud-based Selenium Grid provided by LambdaTest.
To get started with LambdaTest, create an account on the platform and note the user-name & access-key available from the profile section on LambdaTest. The browser capabilities are generated using LambdaTest Capabilities Generator.
Here is the test scenario used for finding broken links on a website using Selenium:
Test Scenario
- Go to LambdaTest Blog i.e. https://www.lambdatest.com/blog on Chrome 85.0
- Collect all the links present on the page
- Send HTTP request for each link
- Print whether the link is broken or not on the terminal
It is important to note that the time spent in broken links testing using Selenium depends on the number of links present on the ‘web page under test.’ The more the number of links on the page, the more time will be spent finding broken links. For example, LambdaTest has a huge number of links (~150+); hence, the process of finding broken links might take some time (approx a few minutes).
Broken Link Testing Using Selenium Java
Implementation
package org.selenium4;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Iterator;
import java.util.List;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
/* import org.openqa.selenium.chrome.ChromeDriver; */
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.testng.annotations.AfterClass;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.Test;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CrossBrowserTest
{
/* protected static ChromeDriver driver; */
WebDriver driver = null;
String URL = "https://www.lambdatest.com/blog";
public static String status = "passed";
String username = "user-name";
String access_key = "access-key";
String url = "";
HttpURLConnection urlconnection = null;
int responseCode = 200;
@BeforeClass
public void testSetUp() throws MalformedURLException {
/*
WebDriverManager.chromedriver().setup();
driver = new ChromeDriver();
*/
DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.setCapability("build", "[Java] Finding broken links on a webpage using Selenium");
capabilities.setCapability("name", "[Java] Finding broken links on a webpage using Selenium");
capabilities.setCapability("platform", "Windows 10");
capabilities.setCapability("browserName", "Chrome");
capabilities.setCapability("version","85.0");
capabilities.setCapability("tunnel",false);
capabilities.setCapability("network",true);
capabilities.setCapability("console",true);
capabilities.setCapability("visual",true);
driver = new RemoteWebDriver(new URL("http://" + username + ":" + access_key + "@hub.lambdatest.com/wd/hub"), capabilities);
System.out.println("Started session");
}
@Test
public void test_Selenium_Broken_Links() throws InterruptedException {
driver.navigate().to(URL);
driver.manage().window().maximize();
List<WebElement> links = driver.findElements(By.tagName("a"));
Iterator<WebElement> link = links.iterator();
/* For skipping email address */
String mail_to = "mailto";
String tel ="tel";
String LinkedInPage = "https://www.linkedin.com";
int valid_links = 0;
int broken_links = 0;
Boolean bLinkedIn = false;
int LinkedInStatus = 999;
Pattern pattern = Pattern.compile("[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}");
Matcher mat;
while (link.hasNext())
{
url = link.next().getAttribute("href");
System.out.println(url);
bLinkedIn = false;
if ((url == null) || (url.isEmpty()))
{
System.out.println("URL is either not configured for anchor tag or it is empty");
continue;
}
/* String str="mailto:support@LambdaTest.com"; */
if ((url.startsWith(mail_to)) || (url.startsWith(tel)))
{
System.out.println("Email address or Telephone detected");
continue;
}
if(url.startsWith(LinkedInPage))
{
System.out.println("URL starts with LinkedIn, expected status code is 999");
bLinkedIn = true;
}
try {
urlconnection = (HttpURLConnection) (new URL(url).openConnection());
urlconnection.setRequestMethod("HEAD");
urlconnection.connect();
responseCode = urlconnection.getResponseCode();
if (responseCode >= 400)
{
/* https://stackoverflow.com/questions/27231113/999-error-code-on-head-request-to-linkedin */
if ((bLinkedIn == true) && (responseCode == LinkedInStatus))
{
System.out.println(url + " is a LinkedIn Page and is not a broken link");
valid_links++;
}
else
{
System.out.println(url + " is a broken link");
broken_links++;
}
}
else
{
System.out.println(url + " is a valid link");
valid_links++;
}
} catch (MalformedURLException e) {
System.out.println(e.getMessage());
e.printStackTrace();
} catch (IOException e) {
System.out.println(e.getMessage());
e.printStackTrace();
}
}
System.out.println("Detection of broken links completed with " + broken_links + " broken links and " + valid_links + " valid links\n");
}
@AfterClass
public void tearDown()
{
if (driver != null) {
((JavascriptExecutor) driver).executeScript("lambda-status=" + status);
driver.quit();
}
}
}
Code WalkThrough
1. Import the required packages
The methods in the HttpURLConnection package are used for sending HTTP requests and capturing the HTTP Status Code (or response).
The methods in the regex.Pattern package check if the corresponding link contains an email address or telephone number using a specialized syntax held in a pattern.
import java.net.HttpURLConnection;
import java.util.regex.Pattern;
2. Collect the links present on the page
The links present on the URL under test (i.e., LambdaTest Blog) are located using tagname in Selenium. The tag name used for identification of the element (or link) is ‘a’.
The links are placed in a list to iterate through the list to check broken links on the page.
List<WebElement> links = driver.findElements(By.tagName("a"));
3. Iterate through the URLs
The Iterator object is used for looping through the list created in Step (2)
Iterator<WebElement> link = links.iterator();
4. Identify and Verify the URLs
A while loop is executed till the time Iterator (i.e., link) does not have more elements to iterate. The ‘href’ of the anchor tag is retrieved, and the same is stored in the URL variable.
while (link.hasNext())
{
url = link.next().getAttribute("href");
Skip checking the links if:
a. The link is null or empty
if ((url == null) || (url.isEmpty()))
{
System.out.println("URL is either not configured for anchor tag or it is empty");
continue;
}
b. The link contains mailto or telephone number
if ((url.startsWith(mail_to)) || (url.startsWith(tel)))
{
System.out.println("Email address or Telephone detected");
continue;
}
When checking for the LinkedIn page, the HTTP status code is 999. A Boolean variable (i.e., LinkedIn) is set to true to indicate that it is not a broken link.
if(url.startsWith(LinkedInPage))
{
System.out.println("URL starts with LinkedIn, expected status code is 999");
bLinkedIn = true;
}
5. Validate the links through the Status Code
The methods in HttpURLConnection class provide the provision for sending HTTP requests and capturing the HTTP Status Code.
The openConnection method of the URL class opens the connection to the specified URL. It returns a URLConnection instance representing a connection to the remote object that is referred by the URL. It is type-casted to HttpURLConnection.
HttpURLConnection urlconnection = null;
..............................................
..............................................
..............................................
urlconnection = (HttpURLConnection) (new URL(url).openConnection());
urlconnection.setRequestMethod("HEAD");
The setRequestMethod in HttpURLConnection class sets the method for URL request. The request type is set to HEAD so that only Headers are returned. On the other hand, request type GET would have returned the document body, which is not required in this particular test scenario.
The connect method in HttpURLConnection class establishes the connection to the URL and sends an HTTP request.
urlconnection.connect();
The getResponseCode method returns the HTTP Status Code for the previously sent request.
responseCode = urlconnection.getResponseCode();
For HTTP Status Code is 400 (or more), the variable containing broken links count (i.e., broken_links) is incremented; else, the variable containing valid links (i.e., valid_links) is incremented.
if (responseCode >= 400)
{
if ((bLinkedIn == true) && (responseCode == LinkedInStatus))
{
System.out.println(url + " is a LinkedIn Page and is not a broken link");
valid_links++;
}
else
{
System.out.println(url + " is a broken link");
broken_links++;
}
}
else
{
System.out.println(url + " is a valid link");
valid_links++;
}
Execution
For broken links testing using Selenium Java, we created a project in IntelliJ IDEA. The basic pom.xml file was sufficient for the job!
Here is the execution snapshot, which indicates 169 valid links and 0 broken links on the LambdaTest Blog Page.
The links containing the email addresses and phone numbers were excluded from the search list, as shown below.
You can see the test being run in the below screenshot and getting completed in 2 min 35 seconds, as shown on LambdaTest’s automation logs.
Broken Link Testing Using Selenium Python
Implementation
import requests
import urllib3
import pytest
from requests.exceptions import MissingSchema, InvalidSchema, InvalidURL
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
capabilities = {
"build" : "[Python] Finding broken links on a webpage using Selenium",
"name" : "[Python] Finding broken links on a webpage using Selenium",
"platform" : "Windows 10",
"browserName" : "Chrome",
"version" : "85.0"
}
user_name = "user-name"
app_key = "access-key"
broken_links = 0
valid_links = 0
# options = webdriver.ChromeOptions()
# options.add_argument("start-maximized")
# options.add_argument('disable-infobars')
# driver=webdriver.Chrome(options=options)
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
remote_url = "http://" + user_name + ":" + app_key + "@hub.lambdatest.com/wd/hub"
driver = webdriver.Remote(command_executor = remote_url, desired_capabilities = capabilities)
driver.maximize_window()
driver.get('https://www.lambdatest.com/blog/')
# links = driver.find_elements_by_css_selector("a")
links = driver.find_elements(By.CSS_SELECTOR, "a")
for link in links:
try:
request = requests.head(link.get_attribute('href'), data ={'key':'value'})
print("Status of " + link.get_attribute('href') + " is " + str(request.status_code))
if (request.status_code == 404):
broken_links = (broken_links + 1)
else:
valid_links = (valid_links + 1)
except requests.exceptions.MissingSchema:
print("Encountered MissingSchema Exception")
except requests.exceptions.InvalidSchema:
print("Encountered InvalidSchema Exception")
except:
print("Encountered Some other execption")
print("Detection of broken links completed with " + str(broken_links) + " broken links and " + str(valid_links) + " valid links")
Code WalkThrough
1. Import Modules
Apart from importing the Python modules for Selenium WebDriver, we also import the requests module. The requests module lets you send all kinds of HTTP requests. It can also be used for passing parameters in URL, sending custom headers, and more.
import requests
import urllib3
from requests.exceptions import MissingSchema, InvalidSchema, InvalidURL
2. Collect the links present on the page
The links present on the URL under test (i.e., LambdaTest Blog) are found by locating the web elements by the CSS Selector “a” property.
links = driver.find_elements(By.CSS_SELECTOR, "a")
Since we want the element to be iterable, we use the find_elements method (and not the find_element method).
3. Iterate through the URLs for validation
The head method of the requests module is used to send a HEAD request to the specified URL. The get_attribute method is used on every link for getting ‘href’ attribute of the anchor tag.
The head method is primarily used in scenarios where only status_code or HTTP headers are required, and contents of the file (or URL) are not needed. The head method returns requests.Response object which also contains the HTTP Status Code (i.e. request.status_code).
for link in links:
try:
request = requests.head(link.get_attribute('href'), data ={'key':'value'})
print("Status of " + link.get_attribute('href') + " is " + str(request.status_code))
The same set of operations are performed iteratively till all the ‘links’ present on the page have been exhausted.
4. Validate the links through the Status Code
If the HTTP response code for the HTTP request sent in step(3) is 404 (i.e., Page Not Found), it means that the link is a broken link. For links that are not broken, the HTTP Status Code is 200.
if (request.status_code == 404):
broken_links = (broken_links + 1)
else:
valid_links = (valid_links + 1)
5. Skip irrelevant requests
When applied on links that do not contain the ‘href’ attribute (e.g., mailto, telephone, etc.), the head method results in an exception (i.e., MissingSchema, InvalidSchema).
except requests.exceptions.MissingSchema:
print("Encountered MissingSchema Exception")
except requests.exceptions.InvalidSchema:
print("Encountered InvalidSchema Exception")
except:
print("Encountered Some other execption")
These exceptions are caught, and the same is printed on the terminal.
Execution
We have used the PyUnit (or unittest) here, the default test framework in Python for broken links testing using Selenium. Run the following command on the terminal:
python Broken_Links.py
The execution would take around 2-3 minutes since the LambdaTest Blog page consists of approximately 150+ links. The execution screenshot below shows that the page has 169 valid links and zero broken links.
You would witness the InvalidSchema exception or MissingSchema exception at some places, which indicates that those links are skipped from the evaluation.
The HEAD request to LinkedIn (i.e.) results in an HTTP Status Code of 999. As stated in this thread on StackOverflow, LinkedIn filters the requests based on the user-agent, and the request resulted in ‘Access Denied’ (i.e., 999 as HTTP Status Code).
We verified whether the LinkedIn link present on the LambdaTest blog page is broken or not by running the same test on the local Selenium Grid, which resulted in HTTP/1.1 200 OK.
Broken Link Testing Using Selenium C
Implementation
using OpenQA.Selenium;
using OpenQA.Selenium.Remote;
using OpenQA.Selenium.Chrome;
using NUnit.Framework;
using System.Threading;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using OpenQA.Selenium.Remote;
using System;
using System.Threading;
using System.Net.Http;
using System.Threading.Tasks;
namespace ParallelLTSelenium
{
[TestFixture("chrome", "85.0", "Windows 10")]
public class ParallelLTTests
{
private String browser;
private String version;
private String os;
IWebDriver driver;
public ParallelLTTests(String browser, String version, String os)
{
this.browser = browser;
this.version = version;
this.os = os;
}
[SetUp]
public void Init()
{
String username = "user-name";
String accesskey = "access-key";
String gridURL = "@hub.lambdatest.com/wd/hub";
DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.SetCapability("user", username);
capabilities.SetCapability("accessKey", accesskey);
capabilities.SetCapability("browserName", browser);
capabilities.SetCapability("version", version);
capabilities.SetCapability("platform", os);
capabilities.SetCapability("build", "[C#] Finding broken links on a webpage using Selenium");
capabilities.SetCapability("name", "[C#] Finding broken links on a webpage using Selenium");
driver = new RemoteWebDriver(new Uri("https://" + username + ":" + accesskey + gridURL), capabilities, TimeSpan.FromSeconds(600));
System.Threading.Thread.Sleep(2000);
}
[Test]
public async Task LT_Broken_Links_Test()
{
int valid_links = 0, broken_links = 0;
driver.Url = "https://www.lambdatest.com/blog";
using var client = new HttpClient();
var links = driver.FindElements(By.TagName("a"));
Console.WriteLine("Looking at all the URLs in LambdaTest :");
/* Loop through all the urls */
foreach (var link in links)
{
if (!(link.Text.Contains("Email") || link.Text.Contains("https://www.linkedin.com") || link.Text == "" || link.Equals(null)))
{
try
{
/* Get the URI */
HttpResponseMessage response = await client.GetAsync(link.GetAttribute("href"));
System.Console.WriteLine($"URL: {link.GetAttribute("href")} status is :{response.StatusCode}");
/* Reference - https://docs.microsoft.com/en-us/dotnet/api/system.net.httpwebresponse.statuscode?view=netcore-3.1 */
if (response.StatusCode == HttpStatusCode.OK)
{
valid_links++;
}
else
{
broken_links++;
}
}
catch (Exception ex)
{
if ((ex is ArgumentNullException) ||
(ex is NotSupportedException))
{
System.Console.WriteLine("Exception occured\n");
}
}
}
}
/* Perform wait to check the output */
System.Threading.Thread.Sleep(2000);
Console.WriteLine("Detection of broken links completed with " + broken_links + " broken links and " + valid_links + " valid links");
}
[TearDown]
public void Cleanup()
{
if (driver != null)
driver.Quit();
}
}
}
Code WalkThrough
The NUnit framework is used for automation testing; our earlier blog on NUnit Test automation with Selenium C# can help you get started with the framework.
1. Include HttpClient
The HttpClient namespace is added for usage through the using directive. The HttpClient class in C# provides a base class for sending HTTP requests and receiving the HTTP response from a resource that is identified by URI.
Microsoft recommends using System.Net.Http.HttpClient instead of System.Net.HttpWebRequest; HttpWebRequest could also be used to detect broken links in Selenium C#.
using System.Net.Http;
using System.Threading.Tasks;
2. Define an async method that returns a task
An async test method is defined as using the GetAsync method that sends a GET request to the specified URI as an asynchronous operation.
public async Task LT_Broken_Links_Test()
{
3. Collect the links present on the page
Firstly, we create an instance of HttpClient.
using var client = new HttpClient();
The links present on the URL under test (i.e., LambdaTest Blog) are collected by locating the web elements by the TagName “a” property.
var links = driver.FindElements(By.TagName("a"));
The find_elements method in Selenium is used for locating the links on the page as it returns an array (or list) that can be iterated to verify the workability of the links.
4. Iterate through the URLs for validation
The links located using the find_elements method are verified in a for loop.
foreach (var link in links)
{
We filter the links that contain /email-addresses/telephone numbers/LinkedIn addresses. The links with no Link Text are also filtered out.
if (!(link.Text.Contains("Email") || link.Text.Contains("https://www.linkedin.com") || link.Text == "" || link.Equals(null)))
{
The GetAsync method of HttpClient class sends a GET request to the corresponding URI as an asynchronous operation. The argument to the GetAsync method is the value of the anchor’s ‘href’ attribute collected using the GetAttribute method.
The evaluation of the async method is suspended by the await operator until the completion of the asynchronous operation. On completion of the asynchronous operation, the await operator returns the HttpResponseMessage that includes the data and status code.
/* Get the URI */
HttpResponseMessage response = await client.GetAsync(link.GetAttribute("href"));
System.Console.WriteLine($"URL: {link.GetAttribute("href")} status is :{response.StatusCode}");
5. Validate the links through the Status Code
If the HTTP response code (i.e. response.StatusCode) for the HTTP request sent in step(4) is HttpStatusCode.OK (i.e., 200), it means that the request was completed successfully.
System.Console.WriteLine($"URL: {link.GetAttribute("href")} status is :{response.StatusCode}");
if (response.StatusCode == HttpStatusCode.OK)
{
valid_links++;
}
else
{
broken_links++;
}
NotSupportedException and ArgumentNullException exceptions are handled as a part of exception handling.
catch (Exception ex)
{
if ((ex is ArgumentNullException) ||
(ex is NotSupportedException))
{
System.Console.WriteLine("Exception occured\n");
}
}
Execution
Here is the execution snapshot, which shows that the test was executed successfully.
Exceptions have occurred for links to the ‘share icons,’ i.e., WhatsApp, Facebook, Twitter, etc. Apart from these links, the rest of the links on the LambdaTest blog page return HttpStatusCode.OK (i.e. 200).
Broken Link Testing Using Selenium PHP
Implementation
<?php
require 'vendor/autoload.php';
use PHPUnit\Framework\TestCase;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\WebDriverBy;
# accessKey: AccessKey can be generated from automation dashboard or profile section
$GLOBALS['LT_USERNAME'] = "user-name";
$GLOBALS['LT_APPKEY'] = "access-key";
function check_nonlinks($test_url, $test_pattern)
{
if (preg_match($test_pattern, $test_url) == false)
{
return false;
}
else
{
return true;
}
}
class BrokenLinksTest extends TestCase
{
protected $webDriver;
public function build_browser_capabilities(){
/* $capabilities = DesiredCapabilities::chrome(); */
$capabilities = array(
"build" => "[PHP] Finding broken links on a webpage using Selenium",
"name" => "[PHP] Finding broken links on a webpage using Selenium",
"platform" => "Windows 10",
"browserName" => "Chrome",
"version" => "85.0"
);
return $capabilities;
}
public function setUp(): void
{
$url = "https://". $GLOBALS['LT_USERNAME'] .":" . $GLOBALS['LT_APPKEY'] ."@hub.lambdatest.com/wd/hub";
$capabilities = $this->build_browser_capabilities();
/* $this->webDriver = RemoteWebDriver::create('http://localhost:4444/wd/hub', $capabilities); */
$this->webDriver = RemoteWebDriver::create($url, $capabilities);
}
public function tearDown(): void
{
$this->webDriver->quit();
}
/*
* @test
*/
public function test_Broken_Links()
{
$test_url = "https://www.lambdatest.com/blog";
$pattern_1 = '/\baddtoany\b/';
$pattern_2 = '/\bmailto\b/';
$pattern_3 = '/\blinkedin\b/';
$title = 'LambdaTest | A Cross Browser Testing Blog';
$driver = $this->webDriver;
$driver->get($test_url);
$driver->manage()->window()->maximize();
$this->assertEquals($title, $driver->getTitle());
$brokenlinks = 0;
$validlinks = 0;
/* file_get_contents is used to get the page's HTML source */
$html = file_get_contents($test_url);
/* Instantiate the DOMDocument class */
$htmlDom = new DOMDocument;
/* The HTML of the page is parsed using DOMDocument::loadHTML */
@$htmlDom->loadHTML($html);
/* Extract the links from the page */
$links = $htmlDom->getElementsByTagName('a');
/* The DOMNodeList object is traversed to check for its validity */
foreach($links as $link)
{
/* Get the Link Text */
$linkText = $link->nodeValue;
/* Get the details of the link in the HREF attribute */
$linkHref = $link->getAttribute('href');
/* print($linkHref); */
/* print("\n"); */
/* Skip the link if it is empty */
if(strlen(trim($linkHref)) == 0)
{
continue;
}
/* Skip the link if it is a hashtag or anchor link */
if($linkHref[0] == '#')
{
continue;
}
/* Skip if the link is an email address or telephone number or does not contain the URL (e.g. we have skipped addtoany) */
if ((check_nonlinks($linkHref, $pattern_1))||(check_nonlinks($linkHref, $pattern_2)))
{
print("\nAdd_To_Any or email encountered");
continue;
}
/* curl_init() for initializing a cURL session */
$curl = curl_init($linkHref);
/* curl_setopt() is used for setting an option for cURL transfer */
curl_setopt($curl, CURLOPT_NOBODY, true);
/* curl_exec() for performing a cURL session */
$result = curl_exec($curl);
if ($result !== false)
{
/* curl_getinfo() to get information about the transfer */
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode >= 400)
{
/* Return code for LinkedIn Page is 999 */
/* https://stackoverflow.com/questions/27231113/999-error-code-on-head-request-to-linkedin */
$linkedin_page_status = check_nonlinks($linkHref, $pattern_3);
/* We skip the test for LinkedIn as it can only be verified on local Selenium Grid */
if (($linkedin_page_status) && ($statusCode == 999))
{
print("\nLink " . $linkHref . " is LinkedIn Page and status is " .$statusCode);
$validlinks++;
}
else
{
print("\nLink " . $linkHref . " is broken link and status is " .$statusCode);
$brokenlinks++;
}
}
else
{
print("\nLink " . $linkHref . " is a valid link and status is " . $statusCode);
$validlinks++;
}
}
}
print("\n\nDetection of broken links completed with " . $validlinks . " valid links and " . $brokenlinks . " broken links");
}
}
?>
Code WalkThrough
1. Read the page source
The file_get_contents function in PHP is used for reading the page’s HTML source into a String variable (e.g. $html).
$test_url = "https://www.lambdatest.com/blog";
$html = file_get_contents($test_url);
2. Instantiate the DOMDocument class
The DOMDocument class in PHP represents an entire HTML document and serves as the document tree’s root.
$htmlDom = new DOMDocument;
3. Parse HTML of the page
The DOMDocument::loadHTML() function is used for parsing the HTML source that is contained in $html. On successful execution, the function returns a DOMDocument object.
@$htmlDom->loadHTML($html);
4. Extract the links from the page
The links present on the page are extracted using the getElementsByTagName method of DOMDocument class. The elements (or links) are searched based on the ‘a’ tag from the parsed HTML source.
The getElementsByTagName function returns a new instance of DOMNodeList which contains the elements (or links) of local tag name (i.e. < a > tag)
$links = $htmlDom->getElementsByTagName('a');
5. Iterate through the URLs for validation
The DOMNodeList, which was created in Step (4), is traversed for checking the validity of the links.
foreach($links as $link)
{
$linkText = $link->nodeValue;
The details of the corresponding link are obtained using the ‘href’ attribute. The GetAttribute method is used for the same.
$linkHref = $link->getAttribute('href');
Skip checking the links if:
a. The link is empty
if(strlen(trim($linkHref)) == 0)
{
continue;
}
b. The link is a hashtag or an anchor link
if($linkHref[0] == '#')
{
continue;
}
c. The link contains mailto or addtoany (i.e., social sharing options).
function check_nonlinks($test_url, $test_pattern)
{
if (preg_match($test_pattern, $test_url) == false)
{
return false;
}
else
{
return true;
}
}
public function test_Broken_Links()
{
$pattern_1 = '/\baddtoany\b/';
$pattern_2 = '/\bmailto\b/';
....................................................................
....................................................................
....................................................................
if ((check_nonlinks($linkHref, $pattern_1))||(check_nonlinks($linkHref, $pattern_2)))
{
print("\nAdd_To_Any or email encountered");
continue;
}
....................................................................
....................................................................
....................................................................
}
preg_match function uses a regular expression (regex) for performing a case-insensitive search for mailto and addtoany. The regular expressions for mailto & addtoany are ‘/\bmailto\b/’ & ‘/\baddtoany\b/’ respectively.
6. Validate the HTTP Code using cURL
We use curl to get information regarding the status of the corresponding link. The first step is initializing a cURL session with the ‘link’ on which validation has to be done. The method returns a cURL instance that will be used in the latter part of the implementation.
$curl = curl_init($linkHref);
The curl_setopt method is used for setting options on the given cURL session handle (i.e. $curl).
curl_setopt($curl, CURLOPT_NOBODY, true);
The curl_exec method is called for execution of the given cURL session. It returns True on successful execution.
$result = curl_exec($curl);
This is the most important part of the logic that checks for broken links on the page. The curl_getinfo function that takes the cURL session handle (i.e. $curl) and CURLINFO_RESPONSE_CODE (i.e. CURLINFO_HTTP_CODE) are used for getting information about the last transfer. It returns HTTP Status Code in response.
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
On successful completion of the request, HTTP Status Code of 200 is returned, and the variable holding the valid links count (i.e., $valid_links) is incremented. For links that result in the HTTP Status Code of 400 (or more), a check is performed if the ‘link under test’ was LambdaTest’s, LinkedIn Page. As mentioned earlier, the LinkedIn page’s status code will be 999; hence, $valid_links is incremented.
For all the other links that returned HTTP Status Code of 400 (or more), the variable holding the broken links count (i.e., $broken_links) is incremented.
if (($linkedin_page_status) && ($statusCode == 999))
{
print("\nLink " . $linkHref . " is LinkedIn Page and status is " .$statusCode);
$validlinks++;
}
else
{
print("\nLink " . $linkHref . " is broken link and status is " .$statusCode);
$brokenlinks++;
}
Execution
We use the PHPUnit framework for testing for broken links on the page. For downloading the PHPUnit framework, add the file composer.json in the root folder and run composer require on the terminal.
{
"require":{
"php":">=7.1",
"phpunit/phpunit":"^9",
"phpunit/phpunit-selenium": "*",
"php-webdriver/webdriver":"1.8.0",
"symfony/symfony":"4.4",
"brianium/paratest": "dev-master"
}
}
Run the following command on the terminal to check broken links in Selenium PHP.
vendor\bin\phpunit tests\BrokenLinksTest.php
Here is the execution snapshot that shows a total of 116 valid links and 0 broken links on the LambdaTest Blog. As links for social sharing (i.e., addtoany) and email address are ignored, the total count is 116 (169 in the Selenium Python test).
Conclusion
Broken links, also called dead links or rot links, can hinder the user experience if they are present on the website. Broken links can also impact the rankings on search engines. Hence, broken link testing should be carried periodically for activities related to website development and testing.
Rather than relying on third-party tools or manual methods for checking broken links on a website, broken links testing can be done using Selenium WebDriver with Java, Python, C#, or PHP. The HTTP Status Code, returned when accessing any web page, should be used to check broken links using the Selenium framework.
Top comments (0)