Search engines see only one in 500 of the accessible pages out there – but a new approach could open up vast new data mines

With billions of web pages in their indexes, you might imagine that if something is online, search engines will find it for you. In reality, the vast majority of web pages are effectively invisible to them.

Some of this "deep web" contains isolated pages with few, if any, hyperlinks, making them difficult to index. Much is stuff you wouldn't want to see anyway: web pages detailing old flight reservations, for example, or out-of-date product reviews on Amazon. However, a large proportion are believed to contain openly accessible databases of everything from information on used cars to the prices of airline seats.

Even ignoring password-protected and other private sites, the deep web is estimated to be at least 500 times the size of the "surface" web visible to search engines. And by some estimates only 16 per cent of the surface web has been indexed by search engines - that is just 0.03 per cent of the whole (see "Lost in cyberspace").

Juliana Freire at the University of Utah in Salt Lake City thinks that even this figure is over-optimistic. She is developing Deep Peep, a specialist search engine that trawls so-called "form-fronted" databases. These are sites with interfaces in which search terms must be typed in order to call up the information stored in the database. Since it isn't practical to ask each of these sites individually for an index of their contents, the challenge is to get this information automatically.

To read the rest of the article, click here.