[OWASP-IG-001] Spidering, Robots, and Crawling

In this tutorial, we are going to take a quick look at how we can use the tools on the OWASP Live CD to perform testing related to the OWASP-IG-001 classification that is outlined in the OWASP Testing Guide version 3. Based on the OWASP Testing Guide description of this category, we are focusing on active information gathering (Hence the IG). Active information gathering, as it pertains to web testing, involves parsing site code and files for links or indicators of other code or files. This is usually referred to as the site hierarchy or structure. By discovering the site structure, we get a little extra insight into how the site works, and where we might focus our attention as testing progresses.

 

For organizational purposes, we are going to divide this tutorial into two groups:

 

  • robots.txt discovery
  • spidering

 

Robots.txt Discovery

First things first, lets get our trusty proxy running. Since I'm writing this, you'll have to deal with the fact that WebScarab is my favorite proxy, so thats what I use. We also need to launch FireFox, and get it configured to sent requests through WebScarab.

 

lucky, google makes our job extremely easy....

Simply add robots.txt to the end of the base URL of your target site.

A couple of interesting things here:

  1. Notice the User-Agent is set to “*”. This is a bad practice because anyone with a browser can read your robots.txt file. At the same time, user-agent is so easy to fake, its hardly a reliable security measure to restrict the user-agents. Still, its a best practice to allow only known spider-bots, such as google or msn.
  2. There are both disallow and allow entries. Basically, a robots.txt file that has disallows is a goldmine for a hacker, this points you to the really important stuff! The allow entries can probably be assumed to not be things like admin interfaces and the like.
  3. There are a ton of entries!! This in itself isn't a vulnerability, but if you have 50+ entries in your robots.txt file, its probably time to find a new way to protect your sensitive content from prying search engine bots.

If you suspect that there is a robots.txt file present and you simply can't get to it using the default configuration of FireFox, we have a nifty tool called the User-Agent-Switcher.

By default, there aren't many user agents to choose from. We are going to fix that.

Go here and download the xml file from within the live CD environment.

Next, we need to import it into the User-Agent-Switcher:

From within FireFox: Tools->User Agent Switcher->Options->Options

 

Click on the User Agents section of the Options dialog and click “import”. Select your file, and you now have many more user agents to choose from.

 

Now try one of the google agents and see if you are able to access the target site's robots.txt file.

 

Spidering

Generally, spidering doesn't discover vulnerabilities, but is mostly used to get a good overall picture of what the site structure looks like. It will find out if directories are browsable, if the site has 302 redirections, and even what coding the sites uses (ie, PHP or ASP). Spidering is extremely important.

Fortunately for us, the OWASP Live CD has several to choose from, but we are going to use WebScarab's built in spider.

By selecting 'fetch recursively' we can go until we reach the end of the site, so-to-speak. Also, we select our target and click fetch selection.

Now we can hop over to the summary tab and expand out the tree and browse the structure of the site, this is extremely useful.

We can also see what HTTP code was returned by the server, giving us an idea of how the host is configured.

 

Screenshot6

 

This information will prove very useful as you move on through the testing process.