Identifying Dead Links on Websites

Identifying Dead Links on Websites

A common problem with the maintenance of large websites is “dead” links that end up pointing to a nonexistent destination because something has changed somewhere. In particular, these include links to external resources which may easily become stale without anyone noticing until a customer attempts to follow the link.

There are various ways to check if a given URL works or not. That is, if the corresponding server returns a successful result or reports a failure. One challenge when wanting to do this for a complete website, especially with dynamic content, is to obtain that list of URLs to check for. Reading the HTML response that the server sends is often not sufficient with modern websites, as a lot of content is built dynamically through dynamic components like JavaScript code.

In the following sections, we present a solution to these two issues using Squish for Web to access the rendered website inside the browser and extract the links, as well as performing the actual link check.

Gathering Linked Resources

There are three basic tasks involved in a link checker:

  • Gathering all links on a given webpage.
  • Checking each link to verify if it is reachable.
  • Loading the linked resources and repeating the process.

The first step can be implemented using the XPath support in Squish for Web. Once a given URL has been loaded into the browser by assigning to the Browser Tabs URL property, the test script can use an XPath expression, like .//A, to gather a reference to all link objects, and extract the resource they point to.

Since Squish loads the page into a browser, all dynamic content will be executed, and hence the website will appear just like the site visitor would see it. Thus, gathering of the links as shown in the following snippet also captures such dynamically generated links. The example loads the URL into a browser, and then uses the XPath expression to generate a list of URLs the page points to.

def checkLinks(starturl, verifiedLinks, doRecursive):
    activeBrowserTab().url = starturl
    body = waitForObject(nameOfBody)
    links = getAllLinkUrls(body)

def getAllLinkUrls(startobject):
    links = startobject.evaluateXPath(".//A")
    urls = []
    for i in range(0, links.snapshotLength):
        urls.append(links.snapshotItem(i).property("href"))
    return urls

Verifying Resource Availability

Once all links have been gathered into a list, each one can be checked to verify that it loads a proper website. A loop iterating over the list of links and loading each one into the browser in the same way that the checkLinks function does is quick to write. However, with further experimentation, it becomes clear this is not sufficient:

One important feature for a tool that verifies connectivity of links is the ability to provide a report on those links that failed to work. In the below code snippet, the verifyLink function does this by looking at the content of the BODY element which usually contains a textual description of the error. Since it may happen that a given link does not point to an HTML page, but, e.g., some PDF document or other, verifyLink catches the LookupError exception raised by waitForObject and considers the link as working in this case, too.

In case the URL is not reachable, the simple HTTP server used for the example test suite will generate an HTML page that contains details about the error. This is used by the verifyLink function to determine the case of the link not working. With other websites, this check may need to be adapted.

In case of a link not working, the verifyLink function records the link and the error text, and the loop collects all problems and reports them to the caller.

Some additional book-keeping is necessary to ensure that each link is visited only once, and to collect a list of working links on which the check also needs to be run later on. The first part is achieved by keeping a set of links in the verifiedLinks variable. The second part is achieved with a small helper function shouldFollowLinkFrom (covered in the next section) and the linksToFollow list that records those links.

def checkLinks(starturl, verifiedLinks, doRecursive):
...
    links = getAllLinkUrls(body)
    linksToFollow = []
    missingLinks = []
    for link in links:
        if link in verifiedLinks:
            continue
        verifiedLinks.add(link)
        linkResult = verifyLink(link)
        if linkResult is None:
            if shouldFollowLinkFrom(starturl, link):
                linksToFollow.append(link)
        else:
            missingLinks.append(linkResult)

def verifyLink(url):
    activeBrowserTab().url = url
    try:
        body = waitForObject(nameOfBody)
        txt = str(body.simplifiedInnerText)
        if "Error response" in txt and "Error code" in txt and "Error code explanation" in txt:
            return {"url": url, "reason": txt}
        return None
    except LookupError:
        return None

Recursing Into the Found Links

Once all links on a given page have been checked, it is usually necessary to follow those links, load the corresponding page and, further, check the links on those pages. This continues recursively until some initially set condition is reached that stops the recursion.

One example condition would be to stop once no new links are found and only follow links to pages provided by the same server. The basic idea here being that a website is usually provided by a single server and all links that point to external resources are not the responsibility of that website development team anymore. So, it would be sufficient if the links to those external resources work, but it is not necessary to check if those pages themselves contain broken links.

The check if a given link leaves the website (i.e., is an external resource) can be done in many ways, and often depends on how the website is created. In the shouldFollowLinkFrom function below, we chose a comparison of the network location part of the page currently being checked and the link URL. For this example, this is sufficient. For more complex websites, it is easy to extend this function with additional logic.

def shouldFollowLinkFrom(starturl, link):
    return urlparse(starturl).netloc == urlparse(link).netloc

The actual recursion is done after all links of the current URL have been checked. Each link that should be followed is passed to the same checkLinks function again in turn. On each iteration, the list of missingLinks is extended so a complete report of all visited links can be generated after the recursion ends.

...
            missingLinks.append(linkResult)
    if doRecursive:
        for link in linksToFollow:
            missingLinks += checkLinks(link, verifiedLinks, doRecursive)
    return missingLinks

To make the failed link checks visible, the test uses the reporting functionalities from Squish, for example test.fail. This leads to a concise report that includes all the information the test has for the links, like in the following screenshot:

Sample report for connectivity check

Further Development

With the example code shown above, a simple, yet effective and extensible method for verifying the connections of a website to other parts of itself or the internet has been implemented. There are, however, many thinkable extensions of the code, including:

  • Improve synchronization so links are read only once the page is fully settled. This is particularly challenging with modern websites that lazily load most of their content. One way to do this would be to implement a synchronization point that waits for the number of links to be stable over a certain time span, indicating that no new links were added by lazily-loaded content.
  • On more complex websites, the shouldFollowLinkFrom logic likely needs to take into account that pages from different sub-domains are considered part of the same website appearance.
  • The current implementation merely finds links that are not working. It is still up to the website developer to determine where those have been used. A potential improvement would be to store the URL of the page(s) that use a particular link and include that in the report. That would make it easier to repair the website links.
  • Based on the previous idea, it might be interesting to visualize the connection between pages. This can be done by generating a text file using the dot language which can be used to generate a graphical visualization.
  • When executing the example test suite, it becomes evident that for a large site, some way of parallelizing is needed. This could be done by having the script work on a queue of links to check in parallel and distribute this work onto multiple systems which have a squishserver running. The queue would then periodically check each system for being done with loading and obtain the result from the page as well as the links. So, pages loading slowly would not hold up test execution.

Try It Yourself

The code snippets and a simple test website are contained in a small Squish test suite. The website can be made available by opening a command/terminal window, changing to the samplepage subdirectory in the test suite and running Python’s SimpleHTTPServer. The Squish installation ships with a Python that contains this module, so if Squish is installed for example in C:\squish and the example suite has been extracted to C:\suites\suite_website_connectivity, the commands to invoke in a terminal would be:

cd C:\suites\suite_website_connectivity\samplepage
C:\squish\python\python.exe -m SimpleHTTPServer
Andreas Pakulat joined froglogic in 2008 as a software engineer after completing his computer science degree. The first major task he took over was creating a new Squish IDE from the ground up based on the Eclipse framework. Andreas is currently responsible for the Squish for Web, Squish for Tk and Squish for Mac editions, but has also had the opportunity to work on almost all other components of the Squish GUI Tester over the years.

0 Comments

Leave a reply

电子邮件地址不会被公开。 必填项已用*标注

*