Ever wanted to know which search engine crawlers look at your site, what time and what page?
After launching our new site and looking into SEO I thought it would be interesting to see which search engine crawlers were looking at our site and when, and since I’m such a nice guy I’ve decided to post the PHP code I wrote to track this for all you nice people to use! I’ve attached a heavily commented version of the script to this post, or you can follow the tutorial below.
First things first I presume you have some basic PHP knowledge to follow this tutorial , we will create a variable that finds out what user_agent is on the site.
$useragent = $_SERVER["HTTP_USER_AGENT"];
This first part of the code is to assign something to the variable $useragent, we are using the HTTP_USER_AGENT, this can pull data of the users browser like the below example:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
However we won’t be using it for this purpose, we will be scanning the variable to check if it includes the major search engines (Googlebot, Yahoo! Slurp, Bingbot etc).
Next we will also assign a value to the $time variable to be called on later:
$time = date("j F Y H:i:s");
This part uses PHP’s function date, we then choose which parts of the data and time we want to pull:
j = Day of the month without leading zeros.
F = A full textual representation of a month, such as January or March.
Y = A full numeric representation of a year, 4 digits.
H = 24-hour format of an hour with leading zeros.
i = Minutes with leading zeros.
s = Seconds, with leading zeros
This will give us a date display of:
1 March 2012 12:00:00
You can choose a different date layout if you prefer a different style, I find this one simple to read.
Its now starting to get a little bit more complicated. We are going to get the full page address, you could just use PHP_SELF however this would not give you your full page URL. So instead we pull the page address by creating a function:
function getAddress()
{
$protocol = $_SERVER['HTTPS'] == 'on' ? 'https' : 'http';
return $protocol.'://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
}
As you can see we create a function called getAddress, we then check if the server is https if not we use http at the start of the URL. We then pull the HTTP_HOST and the REQUEST_URI from the $_SERVER Variable.
For example, for this page:
$protocol would equal Http
HTTP_HOST would equal curomarketing.com
REQUEST_URI would equal /blog/how-to-track-search-engine-crawlers
Which would then give us the return URL of http://www.curomarketing.com/blog/how-to-track-search-engine-crawlers/
You then simply need to assign the function to a variable:
$currentpage = getAddress();
We now have defined all of the variables we are going to use and your php file should look something like this:
<?php
function getAddress()
{
$protocol = $_SERVER['HTTPS'] == 'on' ? 'https' : 'http';
return $protocol.'://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
}
$currentpage = getAddress();
$useragent = $_SERVER["HTTP_USER_AGENT"];
$time = date("j F Y H:i:s");
?>
We now need to do something with this data we have capture, depending on if the user_agent is a search engine crawler!
if (stripos($useragent, "Googlebot"))
{
$file = fopen("crawled.txt","a");
fwrite($file, "You've been crawled by GoogleBot on $time For the page $currentpage\n");
}
We do this by starting an if statement We then use stripos to check if the variable $useragent has the word Googlebot in it, if the statement is true, we then execute the code below.
We set the $file variable to fopen a document called crawled.txt and place the cursor to the end of the document to make sure we write the new data at the end of the file. If no file exists with the name crawled.txt fopen will create it.
We then use fwrite to write into the $file we defined been crawled.txt and insert the text You’ve been crawled by Googlebot on $time for the page $currentpage.
This then uses the variables we set up earlier to add the correct data to the txt document.
If you open your text document you should see some data like below after you have been crawled.
You've been crawled by GoogleBot on 1 March 2012 12:00:00 For the
page http://curomarketing.com/
We can then add other search engine crawlers to the if statement:
elseif (stripos($useragent, "Googlebot-Image"))
{
$file = fopen("crawled.txt","a");
fwrite($file, "You've been crawled by Google Image Bot on $time For the page $currentpage\n");
}
This elseif statement continues, if the user_agent was not googlebot, it then checks to see if it is googlebot-Image, we then do the exact same thing as with googlebot. If you prefer you could save each search engine crawler data into a different named text file, or all in the same one.
We can also do the same thing with Bingbot, Yahoo! Slurp, msnbot. There are many many more small search engines that you could also add if you wanted.
This version simply saves the data into a .txt file, you could go one step further with this and save it into a MySQL database, and query it into another file to allow you to sort and filter it better.
The full code (or download below):
<?php
function getAddress()
{
$protocol = $_SERVER['HTTPS'] == 'on' ? 'https' : 'http';
return $protocol.'://'.$_SERVER['HTTP_HOST'].$_SERVER['REQUEST_URI'];
}
$currentpage = getAddress();
$useragent = $_SERVER["HTTP_USER_AGENT"];
$time = date("j F Y H:i:s");
if (stripos($useragent, "Googlebot"))
{
$file = fopen("crawled.txt","a");
fwrite($file, "You've been crawled by GoogleBot on $time For the page $currentpage\n");
}
elseif (stripos($useragent, "Googlebot-Image"))
{
$file = fopen("crawled.txt","a");
fwrite($file, "You've been crawled by Google Image Bot on $time For the page $currentpage\n");
}
elseif (stripos($useragent, "Yahoo! Slurp"))
{
$file = fopen("crawled.txt","a");
fwrite($file, "You've been crawled by Yahoo! Slurp on $time For the page $currentpage\n");
}
elseif (stripos($useragent, "Bingbot"))
{
$file = fopen("crawled.txt","a");
fwrite($file, "You've been crawled by Bingbot on $time For the page $currentpage\n");
}
elseif (stripos($useragent, "msnbot"))
{
$file = fopen("crawled.txt","a");
fwrite($file, "You've been crawled by Msnbot on $time For the page $currentpage\n");
}
?>
crawled.zip - .zip file with crawled.php that we have just created through this tutorial in case you are struggling with anything.
Feel free to leave a comment or get in touch if you are struggling with anything.

