Thursday 30 April 2009

Better Search Results

Having added search to the blog last week I had a further play around with it and noticed that certain searches were not returning the results I expected. So I dug a little deeper into the documentation for Google Custom Search to try and figure out why.

Whilst the documentation suggests that Google try and build an index of 10,000 pages for each custom search engine they start by using just the pages already in the main Google index. The problem of course was that not all the pages from this blog were in the main index and so they were not showing up in the results of searches using the custom search either.

Now as Rob pointed out in a comment on the previous post the way to tell Google (and most other search engines) about the pages on your site is via a sitemap. A sitemap is an XML file that basically lists all the URLs within your website, how important each URL is, when the content of each URL last changed and how often the content of each URL is likely to change. You can find full details of the file format and protocol at the official sitemaps homepage. Whilst the file format is very straightforward collecting together all the information is a little more tricky.

It turns out that Google will accept RSS feeds as well as normal sitemaps so I could just associate the RSS feed that Blogger generates with the custom search engine but that wouldn't allow me full control over the sitemap or to tell other search engines about this blog. Instead I've written a small tool, SitemapGenerator, which when given a folder structure containing a copy of a website will generate a sitemap. I won't go into the details here as I'm sure they would bore most of you, but in essence the tool uses regular expressions to allow you to assign different importance and update frequencies to different pages within the site. The tool can also automatically notify a number of different search engines, including Google, of any changes to your sitemap.

Now sitemaps make updating the pages used by a Google custom search engine really easy. I have an automatic process in place that checks this blog for changes every hour and if needed will regenerate the sitemap and notify Google of the changes. This meant that only 25 minutes after yesterday's post searching for the word daemon was returning the new post.

Of course sitemaps aren't just used for building better custom search engines, they are also used in the generation of the main Google index (and the indexes of the other major search engines). So if you have a website and are worried that it isn't being properly indexed, then build a sitemap and register it either via my tool or via Google's Webmaster Tools page (if you want to get stats about the effectiveness of the sitemap then register it via the web page before allowing the program to notify Google). For example, not only did yesterday's post appear in the results of my custom search engine within 25 minutes but the page is also appeared in the main Google index at about the same time.

So if you have read this far you might well be interested in having a play with the SitemapGenertator so download it have a look at the README file for instructions and then start making sure your website is properly indexed by the main search engines. And of course if you have any problems with the software or suggestions for improvements then leave me a comment.

Wednesday 29 April 2009

The Eagle Eyed Daemon

I recently read a very good book that definitely deserved a blog post all of it's own. The day I finished reading it though I watched a film that shared so many ideas and themes with the book that I decided I'd write a joint book/film review. OK I'll stop waffling now and start by telling you about the book.

I often get annoyed with films and books when computer technology is depicted as it never seems to be accurate or even feasible! For instance, don't get me started on using a computer virus to beat the aliens in Independence Day -- probably one of the worst abuses of computers in film history! So when I saw an advert for the novel Daemon by Daniel Suarez I was intrigued enough to order a copy. Daniel Suraez is currently an independent consultant who has previously developed enterprise software for the defence,finance and entertainment markets so he should know a thing or two about software development and computers in general. Fortunately he can also craft an exceptionally good novel. Rather than trying to describe the novel myself here is the blurb from the back of the book
Computer genius Matthew Sobol is dead, but his final creation lives on. An infernal web of autonomous computer programs, Sobol's Daemon feasts on the lifeblood of our hyper-connected society: information. Gathering secrets and stealing identities, it soon has the power to change lives as well as the power to take them. Those who serve the Daemon are rewarded; those who defy it are eliminated. Recruiting acolytes from the dispossessed and disaffected, the Daemon grows stronger with each passing day. We face a stark choice: confront a faceless, formless monster or learn to live in a world in which we are no longer in control.
It really is a cracking good read which effortlessly weaves good and accurate computer information into the story without it getting in the way. As an example, if you know why putting ' or 1=1-- into the username and password fields of a website might get you logged in then you should enjoy this book.

So the evening I finished reading Daemon I sat down and watched the film Eagle Eye. I won't try and write a review as I'm sure I'll give away the plot so here is the trailer:


So you can see similar themes: a computer contacts people and forces them to do what it wants. Now there are vast plot differences between the book and the film and other than the general theme they have almost nothing in common. I would say that the book is infinitely superior to the film but if you don't have time to read 450 odd pages but would rather be entertained for the evening then I would highly recommend the film. Of course my star ratings are a little biased as I have to both compare the film to the book as well as other films I have reviewed.

My Rating for Daemon: 5 Stars A really good book. A terrific thriller and with the added bonus of accurate and believable computer details.

My Rating for Eagle Eye: 4.5 Stars A really good entertaining film. It doesn't quite make 5 stars simply as it is similar in theme to Daemon but not in the same league.

Friday 24 April 2009

Standard Proprietary

I've no idea how come I ended up looking at an advert for an Eddy Current Can Separator but I thought I'd share another daft use of language with you all. Here is the sentence from the advert that mentions replacement parts:
All bearings and drive gear are standard proprietary items that can be obtained from either Magnapower or most bearing suppliers.
I hope the problem with this sentence is obvious to everyone, but if not I'll give you a clue.

Something that is standard is usually well understood and can be made by different people. For example, screws are standard. They come in set sizes and lots of people make them. Screws from one manufacturer are very much like those from another.

Something that is proprietary is usually shrouded in mystery or secrecy in some way. For example, drugs are often proprietary. They are developed (at huge cost) by a company who then patents the drug allowing only them to manufacture it for a number of years. The patent will be written to make it as difficult as possible for a competitor to copy even if the law allowed them to.

So how can a replacement part be standard proprietary? It is either standard, at which point you could buy a replacement from a number of manufacturers, or it is proprietary and you will have to buy it from the sole company who owns the rights to the design. I can't really see a middle ground.

I'm guessing that what they mean is that the parts are proprietary but that they are so common on their machines that most companies who sell bearings and drive belts will stock them.

Tightfisted Cheapskates

Being a Yorkshireman I'm stereotyped as being careful with my money and I have known a few tightfisted cheapskates, however, I think this headline on the BBC news ticker last night beats them all.


You would think that International Donors would be able to scrape together more than a measly £172.

In fact if you check the full article you will find that they raised £172m. Although being a pedant that is actually an even smaller amount. I hope that what the journalist meant was £172M -- as a symbol m is milli or ×10-3, whereas M is million or ×106.

Thursday 23 April 2009

Unable to Fly

I was happily cooking dinner last night when I saw a weird looking bird out of the corner of my eye. From a distance all I could see was a mound of fluffy black/brown feathers. Closer inspection revealed.... well I'm assuming it is a blackbird but whatever it is it shouldn't have been let out of the nest on it's own! The poor thing couldn't fly. It would get about a centimetre off the floor before crashing back to earth.

It didn't seem too distressed though and after posing for a few photos hopped out of sight under a sharp prickly hedge and so should be all right for a while. We made sure there was plenty of bird food on the ground feeder so hopefully it won't go hungry.

Tuesday 21 April 2009

Search This Blog

The eagle-eyed amongst you may have spotted that I've added a search box to the blog (top right if you haven't spotted it yet).

I've got frustrated a few times recently when trying to find old posts so that I could link to them in new articles and so had started to think about adding a search feature. A request from a reader (thanks GB) provided the extra kick of motivation to finally get me to sort out a proper solution.

Now I could easily have added search to the blog by enabling the blogger navbar but I don't like the navbar for two reasons: it looks terrible but more importantly the search feature is weird.

The search provided by the navbar is limited to just searching the content of each post (and I assume the title) but not the labels or comments which can make finding old posts more difficult than it should be. So I decided to try a different approach.

Google provide a rather cool custom search service. This allows you to build your own search engine by providing filters that select just a subset of Google's main web index. The simplest option for creating a custom search engine for a blog is to provide a single filter that selects everything. So for this blog I could have used www.dcs.shef.ac.uk/~mark/blog/* as the filter. Whilst this is easy the downside is that you get a lot of repetition in the search results. Remember that each post appears on it's own page as well as on the monthly archive page and the page for each of the labels it is tagged with. To get around this problem I'm actually using three filters to select just the post pages. I need three filters as I've posted articles in three different years (2007, 2008 and 2009). So the first filter is www.dcs.shef.ac.uk/~mark/blog/2007/* and I'm sure you can guess at the other two. Of course this solution isn't perfect either. Firstly when we move into 2010 I have to remember to add a new filter but secondly the whole page is now being indexed which again can lead to repetitive search results. For example, only a few posts contain the word sugar (4 I think) but it appears on every page as it is in the blog description. Fortunately Google is quite good at filtering these useless results out as you can see here.

There are quite a few options available for customizing the search engine so I may fine tune things later but for now at least I have a useful search tool for me and my readers. If you spot anything weird or have any suggestions then please leave me a comment.

If you have been annoyed at the way the standard navbar search works then I'd certainly have a play with the custom search service. Of course there are no limits on what pages you can include in the search. You could create a search engine which indexes multiple blogs or pages you frequently visit. One thing I have noticed though is that if you change any settings (or when you create the search engine) it can take five or ten minutes before the changes take affect so don't be surprised if it doesn't work straight away -- I couldn't understand why I got no results for any search to start with but after about five minutes it started to work just as I had expected.

Thursday 16 April 2009

Hart-to-Heart

It seems that quite a few of you assumed that I couldn't spell the name of the restaurant that featured in my last post. The restaurant is called The White Heart and not the more common The White Hart. I'd never even thought about the name before, probably as the first thing you notice about the restaurant isn't necessarily the name but rather the large white heart logo on the "Welcome to Penistone" sign.

It turns out that prior to being refurbished by it's current owners it was called The White Hart. I don't know why they changed the name although the cynic in me would suggest that the reason may be advertising. Apparently The White Hart is the 5th most popular British pub name and so I guess that standing out from all the others could be a problem. If, however, you search for the restaurants name on Google it pops up at #2 -- interestingly behind The White Hart Inn in Manchester so even Google must think I'm spelling it wrong!

Monday 13 April 2009

Dessert and an Easter Egg!

Instead of me cooking yesterday we went out for lunch with my parents. Ever since we moved to Penistone we have wanted to have a meal at The White Heart and this seemed the perfect opportunity. The weather was nice and so we walked the few hundred yards to the restaurant to help build up an appetite.

First impressions were good -- they had Black Sheep best bitter on tap! The rest of the meal didn't disappoint. I had mussels in a white wine sauce, followed by roast beef and Yorkshire pudding (even though I'm a Yorkshireman I very rarely eat Yorkshire pudding unless it is toad-in-the-hole) followed by brown bread ice cream. I chose my dessert based on the fact that it sounded interesting and I'd never heard of it before. It was great and I'll certainly be hunting out a recipe at some point (this one looks promising).

The staff were really friendly and we had a great time and as a finishing touch we got an Easter Egg. Apparently as it was Easter Sunday each table was getting given a chocolate egg as they left (we got a Cadbury Flake egg). It's nice, different touches like this (and Rufus) that make places memorable and ensure that we will be going back at some point!

My Rating: 5 Stars If you want to eat out in Penistone then this restaurant would be the first place I'd suggest. Great food and friendly staff will ensure that we go back again.

Friday 10 April 2009

What's In My Head?

Two things have happened at work recently that have caused me to think about what is stored in my head.

I've been to a quite a number of meetings, workshops and conferences as part of my job but as yet I've never made such an impression that anyone has asked for my contact details (although to be fair the papers I've published do have my e-mail address on them) and so I've never felt the need to have business cards. Well someone obviously thinks I now know enough that people may want to remember me and so I now have 250 nicely printed business cards that I can hand out. It is also quite strange seeing my title written out on them as I'm only Doctor to the bank to everyone else I'm still just Mr -- having two Doctors in the house would just make sorting the post very confusing!

The second thing that happened kind of defeats the point of having business cards though -- I had to sign a non-disclosure agreement. Of course I can't tell you anything about the agreement not even who it is with or what aspects of my work it will cover. So while I may have interesting things stored away in my head I might not be able to talk about them and hence it is less likely that anyone will want one of my business cards.

Monday 6 April 2009

Technical Regulations

I've been enjoying the first two weeks of the new F1 season and have already gotten used to the rather strange looking cars. The reason the cars look different (and possibly the reason the races have been so interesting) is that this year the technical regulations have been dramatically altered. Basically the aerodynamic aspects of the car now produce less down-force, which has meant the re-introduction of slick tyres so the cars have enough grip to stay on the track, but should produce less turbulence allowing for closer and more interesting racing. Following these technical changes is a challenge but fortunately there is help.

The official F1 website has a technical section. It is divided by year and race (as well as a general category for new things introduced during testing etc.) and each interesting change to a car is highlighted with a technical drawing and explanatory text.

The main issue so far this year has been the diffusers and they are covered in great detail. There are pages on the three cars with interesting diffusers (Williams -- here and here, Brawn and Toyota) as well as on some of those who have taken a very literal interpretation (Ferrari and McLaren) and some discussion of the merits of both approaches.