Are Search Engines Crawling Links in HTML Comments?
Posted by Marios Alexandrou, SEO Resume.I was looking through a client's data in Google Webmaster Tools recently and noticed tens of thousands of bad internal links. All of these links could be boiled down to a handful of patterns which struck me as odd because I should've noticed these bad links before. A little digging revealed that Google has been aggressively crawling links that are in JavaScript code.
I could sort of see search engines following fully qualified links (i.e. with http and a full path) in JavaScript code since there's a good chance those links will lead to a real page. However, the links I found were part of a concatenation of strings that required execution of the JavaScript to actually be valid. Here's what I mean:
link = 'http://' + somevariable + '/somepage.php' + someothervariable;
Google decided that it would be a good idea to check out '/somepage.php' which of course is invalid without the variables before and after it.
So what this long-winded intro leads me to is the question of what is the current state of affairs at Google (and the other search engines) when it comes to links in HTML comments? I should have an answer to that in a few days...
Update: January 8, 2009
My experiments suggest Google isn't crawling links that are within HTML comment tags.
- NoScript and SEO
- Duplicate Content Experiment
- Do Deep Folders Stop Search Engines?
- SEO for Images
- Exploiting Google's Geographic Targeting

Entries (RSS)
Google is content hungry and has been finding links in JavaScript for sometime now.
Jaan,
I was under the impression that they were only following "clean" links in JavaScript. To assume a link is legit just because it starts with a forward slash doesn't seem like a good idea. Does Google really have so much extra crawling bandwidth that its better to follow such links rather than to dig deeper into using regular HTML links?
Webmaster Tools, for a large website I monitor, had a jump of 1.5m 404's in the space of a month - these 404's were partial URL's in JavaScript - and as you said - without correct full path information they only guess at the construct.
Now the question becomes does the Google-bot follow Robots.txt to ignore links IT creates?
A definitve answer to this question would help those who practice SEO. I, for one, would like to have an answer.
Tell me about it, I spend hours modifying scripts because of Google trying to decode jscripts. On one site I have some jscripts to magnify images in popups when visitors click on them. Despite the nofollow on the link from the jscript to the popup, and the noindex,nofollow meta-tags on the popup page, Google will still index these pages decoding jscripts. So I had to use forms for the popup clicks using the post method. Forms were always efficient against link indexing because the server end gets a chance to verify pretty much everything before sending content to the client. It's just more work to go down that path (because Google doesn't want to follow its own rules like nofollow in some cases)