Unix Spiders: Enterprise Search

I attended the Enterprise search meetup on Tuesday 4th October 2011, and here are some brief notes from the event. We had two talks and a 'fishbowl' discussion. I'll concentrate on the talks, and bring in any issues from the 'fishbowl' when it is necessary. The talks were from Iain Fletcher and Charlie Hull.

1. Iain Fletcher - Search Technologies. Data Quality, the Missing Ingredient for Enterprise Search

The essential problem introduced in the talk was the poor data leads to poor search and dissatisfaction on the part of the users in the service they get from the deployed system. Problems increase with time, search administrators need to constantly update their systems to cope with growth in data. Failure to tackle these problems leads to increasing users dissatisfaction - evidence shows that half the time or less, users do not get what they want. Relevance ranking has come along way over the years, however many ranking algorithms need to be tuned to the collection they retrieve on, optimizing on constants such as B and k1 in the BM25 matching function. I've done some of this type of work myself, and had some success at the Web Track @ TREC.

A possible solutions is auto-personalisation, however Iain suggested that this may not work well, and there is evidence for this in the academic literature.

Iain stated the search engines tend to rely on meta-data, and he gave the example of Google who rely on well written pages from which meta-data can be extracted. Thus classification using the meta-data extracted can narrow down the search for the user providing the user with some ability to improve their search result (as always with search, this must depend on the user's ASK). When writing web pages to be retrieved it is best to remove non-relevant text from page to increase the chances of the page being retrieved - this is the process of 'cleaning' the data. An example would be to have a bio of an author on every page of their website - this would impact on search negatively. A process of normalization can be used to ensure that relevant text is put together on the same page, increasing the chance of better search results.

Iain then talked about complexity management, which can be a real problem in search. He advocated the use of TQM (Total Quality Management) for search, using a black box method to find problems. Optimizing on one variable is problematic, as one does not know the effect of doing this on other variables - a holistic approach needs to be done if this is going to work in any sensible way. I myself used a brute force approach to optimizing tuning constants on the BM25 matching function, but you could think of using machine learning to do this - Microsoft have used Gradient Descent techniques for this kind of work (I can dig up the reference on request).

Iain concluded with a number of suggestions as follows:

Data needs be thought about properly. Focus needs to be on the data, rather than the search engine.
A formal model of data is required, and a data model design is needed. In the discusion later, it appears that in my circumstances no formal document is available to provide this information or the requirements for search. Transparency is a very important factor.
A process to keep search working is essential, and adapt to changes in the data, as it grows with time. Otherwise the search will break!

2. Charlie Hull - Flax Search. Just the Job - Employing Solr for Recruitment Search

Charlie gave an interesting talk on the practical application of search technologies to a real world case study, in this case Reed Recruitment. Reed recruitment has significant data problems with 3 million job seekers in their database, and around 300 end users dispersed throughout 350 offices the UK. Their search before the new system was implemented on a transactional system using Oracle, the relational database system.

To say the oracle search was clunky would be something of an understatement. The user had 20/30 fields to choose from, and had to wait a significant length of time for the results as 100's of millions of database records were processed. Data was held on salaries etc as well as unstructured information such as CV's and job specifications. Oracle is fine for data, but very poor for unstructured data IMHO.

In order to create the new search, data had to be extracted from Oracle and transformed to a format which could be used by a search engine such as Solr. Based on XML two processes are defined:

Indexer: extract and process the data from Oracle.
Config: builds and verifies the data for the search engine.

Charlie describe the process using a diagram, which I don't have but was illuminating and helped understanding (I won't try and replicate it here, my drawing skills are rubbish!). Reed did the interface part of the projects, as they know their users well.

Overall I found this a very useful case study of applying open source software to real world problems. Later on in the discussion, there was an interesting interaction on using open source vs. propriety software. Largely this is due to policy according to Iain Fletcher, which invariably means Microsoft. I was reminded of the old adage "nobody ever go sacked for buying IBM", these days its "nobody ever go sacked for buying Microsoft"!

The search is now live and working well - Reed are satisfied with it. On interesting fact that emerged was that there is considerable resistance from users who have got used to using the old system. This is normal, and reminds me that the only reason Dialog is around is because information scientists using it demand access to a command line interface (power users who want to retain control of their world, and prevent disintermediation). These problems to not appear to occur with new members of staff, not yet initiated into the ways of the old system.

Unix Spiders

Sunday, 9 October 2011

Enterprise Search

No comments:

Twitter

About Me

Blog Archive

Labels