XHTML - Love the Googlebot

The Googlebot, possibly a life-form unique in the world and almost certain to be written into the history of the 21st century. More precisely it is the software Google uses to grab information about web pages to make it's searches more accurate. Right now Google is the most used search engine on the Internet having knocked Yahoo off the top spot a few years ago. Love the Googlebot, it loves XHTML.

Why is Google so important?

Google began life as a search engine in 1998, by 2004 it was performing 80% of all the searches on the Internet. In a nutshell, Google runs the Internet. The majority of people go to Google if they want to find something, Google lists over 8 billion pages in its indexes and the unique (and Patented) methods it uses for finding and listing pages for it's search results works in harmony with your semantically coded web page. There is no point building a web page if nobody is going to see it. If you want your page to be seen it needs to be listed on Google and it needs to be listed accurately under the right terms when someone does a web search.

Example 17Click here to see the example >>> Click here for the style sheet >>>

Become one with the Googlebot

Google uses something called the Googlebot to look around the Internet at websites, read each page and automatically list the page in Google's indexes under keywords based on their relevance in the page.

The Googlebot is a web browser without a screen, it understands the basic rules of HTML in that it places more importance or relevance on an <h1> heading tag than an <h2> and so on.

The example file that has been built up through this paper began life as a very basic set of heading and paragraph tags with a list and appeared ordered but very bland on the screen in a browser window. By linking an external stye sheet (the style.css file) to the document a specific set of layout characteristics were applied to the tags to arrange their content in a more graphically enhanced manner on the page without altering the semantic meaning of the tags and their structure in the document.

How does the Googlebot see a web document?

When the Googlebot visits a web document it views the source code jsut as a web browser window does but unlike a web browser widow the Googlebot discards any display characterisitcs and focusses solely on the relevance of the HTML tags and some but not all of the attributes that can be set in each one.

Here is the first part of the source for the example document, the code that is shown in strong type is approximately what the Googlebot sees, the normal face type is ignored by the indexing software.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head> <title>Dave Howe</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <link href="style.css" rel="stylesheet" type="text/css" /> </head> <body> <h1 id="pagebanner">David Howe</h1> <p class="imageholder"> <img src="dave.jpg" width="150" height="200" alt="Picture of Dave" /> <br /> Photo: June 2005 </p> <h1>Curriculum Vitae</h1> <p>I am a highly proficient designer, programmer and project co-ordinator with five years experience applying these skills to the medium of the Internet. Self-motivated both in thought and action I am able to work equally well within a team or on my own. I am keen to meet new challenges and have a wide variety of skills accumulated from both higher education and experience working on a range of computer-based projects. My early training in the architectural field has provided me with a high level of professionalism and planning which I apply to every new task I undertake.</p> <ul id="contentbar"> <li class="contents"> <h1>Contents</h1> <ul> <li><a href="#skills">Skillsets</a></li> <li><a href="#career">Career To Date</a></li> <li><a href="#education">Education</a></li> </ul> </li> </ul> <h2 id="skills">Skillsets</h2> <p>Here is a brief list of my skill sets banded into groups. </p> <h3>Visual Design</h3> <p>Creation...

It won't place much relevance on much more than that. The Googlebot reasons that the more important something is to a page, the further up it will be towards the top. Not really a suprise is it? So why is that the first word you see on most web documents (and therefore the word Google places as the most important thing in the document ) is the word 'Welcome'. At the time of writing Google listed 1.4 billion document matching the word welcome.

Become one with the Googlebot, the XHTML structure of the example is read by the Googlebot, the document is text, XHTML and in English. It picks out the <title> tag content as the most important, then the first <h1> then a "picture of Dave" according to the alt attribute description, its a photo taken in June 2005 and this is a Curriculum Vitae. All of these things are of top relevance and importance, the Googlebot derives, because they are at the top of the code.

The next element encountered is the first introductory paragraph of text. Google ignores all the short words and common words and certain words that make up structure rather than meaning in a sentence and tends to focus on longer words and verbs and nouns especially. The words highlighted are not exactly what the Googlebot might select, merely my rough guess for illustrative purposes, however it is an informed guess.

After the first paragraph of text the Googlebot will focus mainly on the headings image and list tags and content and less on the words in paragraph text. Special attention is paid to the <a> links in a document. Google may store these destinations and send the Googlebot to visit each on in turn after visiting the main document destinaton. In the example, the bookmark links force the hand of the Googlebot down the page to the subheadings of the document, the fact that they are linked means they must be important.

The document is both browser and search engine friendly. Not only that, most other kinds of reader will tret the page in a similar way because of the hierarchy of standardised tags and the way they reflect the relevance of the content. All things to all things (including people). It is truly semantically harmonious.

Technohippy

Style Switcher: one document many styles

Valid XHTML 1.1Valid CSS!