ShareThis social icons

Wednesday, December 29, 2010

Do Search Engines Index That?

Approximately four months ago, a small group of Web developers representing the International Web Developers Network (IWDN) devised an interesting search engine experiment. The question had been raised whether the text within a link's title attribute gets picked up by search engines or not. None of us knew the answer, and before long, we discovered we didn't know the answer to a lot of questions regarding how search engines index text in HTML.

The purpose of this experiment was to see which methods of presenting information on a Web site would get picked up by search engines and which methods would be ignored. We focused initially on three major search engines, Google, MSN and Yahoo!, in an effort to keep the experiment as tidy as possible. As we were conducting our experiment, a new IWDN member introduced us to Agent 55, a handy tool that, amongst other things, shows search results for up to 10 predetermined search engines at a time. This allowed us to expand our test group with ease, and we took full advantage of this. We would have been satisfied staying with the "Big Three," but having more search engines available to test at the same time made expanding our results convenient.

Designing the experiment

We created an HTML document with a list of test terms served up in a variety of ways. Some were displayed using different JavaScript techniques. Some were incorporated into title, name and rel attributes. Some used character encoding techniques. Identical test pages were deployed on a handful of sites, all linking to each other, and all with inbound links external to the testing network. Measuring the results of this experiment was simple. Whichever terms showed up in search engines indicated the method used to deliver that term was one search engines understood.

View a copy of the actual search engine test page here.

The following chart shows the 14 testing terms we tried and results returned to us from 10 different search engines:

Each of the terms within that chart were presented to search engines using a different method or a different context. Here is the key we followed for the presentation of each term:
  1. Text inserted as innerHTML.
  2. Text inserted using document.write
  3. Title text for a generic element.
  4. URL text for a link with no linking text.
  5. Title text for a link with no linking text.
  6. Rel text for a link with no linking text.
  7. URL text for a link with linking text present ("and").
  8. Title text for a link with linking text present ("and").
  9. Rel text for a link with linking text present ("and").
  10. A div hidden using visibility:hidden.
  11. A div hidden using display:none.
  12. A word written using HTML decimal entities ("&123;").
  13. A word written using HTML hexidecimal entities ("a").
  14. The text within an anchor tag's name attribute.
  15. Control word. No special formatting or delivery.

What we learned

The first thing we learned is ExaLead wasn't interested in indexing any of our test pages. We may explore why this is the case another time, but for the purposes of this experiment, it is not important to do so. It is duly noted, and it may be assumed future references to the testing group in this article do not include ExaLead.

The second thing we learned is most major search engines seem to follow a similar indexing scheme. There are only two anomalies in our results. MSN indexed made-up HTML document names, almost as if it expected those documents to exist somewhere. Search results actually pointed to the URLs where our fictitious pages would have existed if there were not fictitious. This is an interesting indexing technique, one we initially suspected was a bug, but may be on-purpose. It seems developers can create links to future pages and get them indexed in MSN before those pages are actually created! Perhaps this one way MSN attempts to index pages faster than its rivals, at the expense of accuracy in portraying the Internet landscape to a small degree.

The other anomaly involved serving up hexidecimal entities. Decimal entities were accepted by all search engines, but Yahoo! and WiseNut did not index hexidecimal entities. This leads one to wonder whether there are any exploitable scenarios here. It's possible a developer could "show" Yahoo! and WiseNut less content than seen by other search engines, increasing perceived keyword density without affecting how visitors digest a site's content, an interesting experiment for another day.

Otherwise, considering the number of testing scenarios we ran on the number of search engines listed, we've concluded indexing is a pretty standardized practice.

Third, this experiment appears to support the current understanding that search engines read a Web page at the code level, not the screen level. We used very direct JavaScript methods for feeding content onto the screen, which were ignored (we predicted this). Likewise, elements permanently obscured using two different CSS methods did not escape search engine indexing for any of the search engines involved, meaning the page style was ignored in favor of the page structure something search engines could only understand if they looked at the page at the code level.

What our results mean to developers

First, developers who stuff keywords into title attributes (i.e
) solely for the purpose of gaining a search engine boost may stop now. Search engines ignore this, and the title attribute is meant for human visitor interaction. It should be used for this purpose only. The same goes for the rel attribute. It should be used to denote a relationship between objects (its original purpose), not as a vehicle for keyword dissemination.

Second, if a developer has the need to obscure characters at the code level, it is safer to use decimal rather than hexidecimal entities.

Third, there is no need to use keyword-rich names when filling in the name attribute () for the sake of search engines. It is best to use names that makes sense from a developer's perspective, since there's no need to seek an indexing/ranking boost here.

Knowing how markup is indexed by search engines gives developers even more incentive than they already have to keep their markup clean and free of keyword spam. Please note: this experiment does not measure the effectiveness of words indexed - how much they affect a site's rank when presented in certain ways. It only demonstrates whether they are recognized or not. Also, we do not condone pursuing the exploitable scenarios presented above. There may be unknown repercussions within search engines for doing so, and human visitor may be aversely affected as well.

read more about
Search Engines future testing

No comments:

Post a Comment