Thursday 30 April 2009

Better Search Results

Having added search to the blog last week I had a further play around with it and noticed that certain searches were not returning the results I expected. So I dug a little deeper into the documentation for Google Custom Search to try and figure out why.

Whilst the documentation suggests that Google try and build an index of 10,000 pages for each custom search engine they start by using just the pages already in the main Google index. The problem of course was that not all the pages from this blog were in the main index and so they were not showing up in the results of searches using the custom search either.

Now as Rob pointed out in a comment on the previous post the way to tell Google (and most other search engines) about the pages on your site is via a sitemap. A sitemap is an XML file that basically lists all the URLs within your website, how important each URL is, when the content of each URL last changed and how often the content of each URL is likely to change. You can find full details of the file format and protocol at the official sitemaps homepage. Whilst the file format is very straightforward collecting together all the information is a little more tricky.

It turns out that Google will accept RSS feeds as well as normal sitemaps so I could just associate the RSS feed that Blogger generates with the custom search engine but that wouldn't allow me full control over the sitemap or to tell other search engines about this blog. Instead I've written a small tool, SitemapGenerator, which when given a folder structure containing a copy of a website will generate a sitemap. I won't go into the details here as I'm sure they would bore most of you, but in essence the tool uses regular expressions to allow you to assign different importance and update frequencies to different pages within the site. The tool can also automatically notify a number of different search engines, including Google, of any changes to your sitemap.

Now sitemaps make updating the pages used by a Google custom search engine really easy. I have an automatic process in place that checks this blog for changes every hour and if needed will regenerate the sitemap and notify Google of the changes. This meant that only 25 minutes after yesterday's post searching for the word daemon was returning the new post.

Of course sitemaps aren't just used for building better custom search engines, they are also used in the generation of the main Google index (and the indexes of the other major search engines). So if you have a website and are worried that it isn't being properly indexed, then build a sitemap and register it either via my tool or via Google's Webmaster Tools page (if you want to get stats about the effectiveness of the sitemap then register it via the web page before allowing the program to notify Google). For example, not only did yesterday's post appear in the results of my custom search engine within 25 minutes but the page is also appeared in the main Google index at about the same time.

So if you have read this far you might well be interested in having a play with the SitemapGenertator so download it have a look at the README file for instructions and then start making sure your website is properly indexed by the main search engines. And of course if you have any problems with the software or suggestions for improvements then leave me a comment.

Post a Comment