New techniques find meaning in words
By Alan Cane
Published: October 8 2008 03:00 | Last updated: October 8 2008 03:00
Markets may rise and fall but the quantity of information collected and stored by businesses only seems to rise.
The need to be able to search quickly and efficiently through all this material for legal, regulatory or plain business reasons has given rise to a subset of search technology - enterprise search.
It may seem little different, in essence, to conventional web searching, but trawling through an organisation's data mountain is a much tougher nut to crack.
Colin Hadden, UK country manager for Sinequa, a French company whose search engine uses a combination of techniques including semantics, outlines the difficulties.
"First, the data are not designed to be found, are of irregular quality and they aren't linked.
"Second, enterprise users search for people as well as content - they want to find the author of a document as well as the document itself. Third, security is essential - it may be a legal requirement and certain individuals may only be allowed to look at certain documents.
"And fourth, the data are stored in a variety of places and formats - documents, e-mails, the internet, databases, spreadsheets and, more recently, audio and video material."
All of which may explain why companies' experience of implementing search systems has not always been satisfactory. Royce Bell, who leads Accenture's Information Management Services, says businesses are beginning to realise that they are not the internet and the tools and techniques needed to search within an organisation are different from Googling at home.
The complexity is multiplied by the fact that many large businesses are distributed, so information has to search across many platforms.
"Quite a few organisations are grappling with that now," he says, pointing out that some companies are simply giving up on unstructured search, especially where audio and video are concerned. Instead, they insist the material is tagged or identified so it can easily be recovered at a later date.
But even if enterprise search is technically harder than consumer search, the user should not be aware of the complexity of the software under the bonnet (or hood - to point out the kind of semantic difficulties search systems face).
Matt Glotzbach, head of products for the enterprise arm of Google, the company whose name is virtually synonymous with consumer search, argues that the objectives of enterprise search and consumer search are similar.
"One fallacy is that enterprise users want to search in some fundamentally different way from consumer users," Mr Glotzbach says. "They don't. We, as Google, have trained the world at large how search is supposed to work: there is a single interface, not 10 interfaces for 10 kinds of content; you enter a couple of words and in less than a second you get the right result."
The knowledge and experience Google has built up over the years developing these rules has been fed into its enterprise offering, the Google Search Appliance, introduced six years ago and now with 20,000 deployments.
A self-contained combination of hardware and software, the appliance is installed "behind the firewall" - that is, connected directly to the company's IT infrastructure.
It is now in its fifth iteration. In August this year, Google released a new version with revamped internal architecture to improve speed and provide the capability to handle up to 10m documents on a single server.
Mr Glotzbach says the revamp was designed to satisfy what he sees as the two big trends in enterprise search. First, the spectacular growth in the volume of information which companies are now storing and the fact that business users expect Google-like quality of search behind the firewall.
He says the system is marketed as an appliance for simplicity and ease of operation.
He explains: "One of the problems we see in most companies with large search engines is the complexity of setting them up and managing them. Often this puts so much work on the systems administrators and IT staff that the project fails."
In this, he is in step with Mr Hadden who says that the Sinequa system was designed to cut the cost and complexity that prevents companies from making the most of search.
Google's early success was based on matching keywords and a now legendary algorithm or mathematical procedure for listing web pages in the most useful order.
Today, Mr Glotzbach says, its technology goes well beyond simple word matches using hundreds of algorithms and signals to produce results. He points out that semantic or knowledge-based search, which essentially identifies non-obvious relationships between different pieces of content, has been part of Google's armoury for years.
"Understanding" is an important concept in modern search technology.
Mike Lynch, chief executive of Autonomy, a world leader in unstructured search, says the field is being transformed as technologies are developed that give computers the appearance of being able to understand meaning.
"At the moment, most search engines don't understand what they are looking for. They just match words.
"Type in 'dog' and the engines don't understand dog, they just recognise the characters d, o and g. These newer technologies understand meaning: a dog is an animal, it is man's best friend, a labrador is an example of one."
Dr Lynch argues that the meaning of search itself is changing: "Advanced search means the ability to process things."
The original aim of search was to retrieve material for human beings to think about but, increasingly, content is retrieved for computers to process. "For the first 40 years of the IT industry, computers were not very clever, so you took the real world and laid it out in a simple form," says Mr Lynch.
"Now as we move into advanced search, computers can handle the real world as it is: a computer can read an e-mail and decide what to do by itself," he says. One new technology is capable of "understanding" the emotional content of, say, an e-mail by analysing the text.
There are essentially two ways to create this kind of "meaning-based computing". First, through an understanding of linguistics, including semantics; second, an approach which treats language mathematically, calculating probabilities to infer context.
Experts argue over the merits of these approaches but as Mr Lynch says: "Both these methods are way ahead for the things people normally think of as search based on keyword matching."
Charles Armstrong, chief executive of Trampoline Systems, a UK-based company, agrees that searching for individuals is becoming as important as searching for documents.
"A business's main asset is its people, their knowledge and experience. In the 1990s, there was a boom in technologies that tried to make sense of millions of documents but they only gave half the picture.
"The rise of Web 2.0 in the consumer world alerted business to the role that social contacts and networks play. When you are dealing with a project that requires a particular knowledge, you look for the person with the knowledge, not a document."
Mr Armstrong says Trampoline's search engine is the first to analyse not just the content of documents but the professional networks of those connected to the documents.
He foresees that a big part of search in the future will be "doing less of it", as apparently sentient systems anticipate the users' needs and provide relevant materials automatically.
Copyright The Financial Times Limited 2008