To Bing and Beyond

By Simon Kravis

Dave Hawking is one of Australia's leading Information Retrieval researchers, now working to advance the tecnology behind Bing, Microsoft's search engine. Simon Kravis sat down with Dave to find out more about his work with Microsoft and the challenges for Enterprise Search.

Dave Hawking boasts a long track record in search technology beginning in the mid-1990s when he co-ordinated the Very Large Collection and Web Tracks of the Text Retrieval Conferences (TREC). He joined CSIRO in 1998 where he worked for 10 years on text retrieval and search, developing the P@NOPTIC search engine which formed the basis of  the FunnelBack spin-off company.

He was Chief Scientist for Funnelback between 2008 and 2013 before taking up a position as a Partner Architect with the Bing group at Microsoft. As well as holding an Adjunct Professorship  at the Australian National University (ANU)  in Canberra, he holds an honorary doctorate in Search from the University of Neuchatel in Switzerland and was joint winner of the UK eInformation Group Tony Kent Strix award in 2012 for his outstanding contribution to Information Retrieval.

Q: What exactly is your role in Bing.

A: I'm a Partner Architect, working in Applied Research. It's a complex organisation with a lot of people and a lot of teams working on supporting the infrastructure and all the different processes that go on in delivering search results. There's quite a learning curve in getting to know how the organisation fits together, who's responsible for what and trying to identify gaps that are amenable to research solution or improvement.

Q: How do find working with a large, globally distributed organisation like Bing?

A: Time zone differences are the biggest impediment to communication. Video, instant messaging, audio technology makes it easy to hold multi-party meetings across continents, but finding a mutually convenient time for people in Seattle, Canberra and London is difficult - someone is always up in the middle of the night.

Q: How does Bing fit in with the overall Microsoft R&D effort?

A: It's one of many Microsoft divisions working on their products and services, and separate from Microsoft Research.  We're part of the Applications and Services group, which includes Office and Cortana. With Office 365 being cloud-based, there are many opportunities to use the same sorts of platforms for storing documents and adding value to processes.

Q: Poor financial results for Bing in the past have been the subject of much attention in the IT press. Does Bing have the same kind of advertising-based business model as Google?

A: Operating a large-scale web search engine is a very expensive business. Wired magazine estimates that billions have been spent by Google and Bing on servers and data centres, and revenue has to cover this. Advertising is a large part of the revenue but Bing is fairly tightly integrated into a lot of other Microsoft services. Bing has provided back-end search for Yahoo for a while now. 

Q: Microsoft Online Services' Qi Lu says he wants Bing to be a search tool that understands natural language and does away the need to use prepositionless, article-free noun-based queries that he calls "caveman-speak". Is natural language understanding seen as key to Bing's future?

A: I've seen this as a potential opportunity for a long time. It's so annoying that purely statistical methods based on independent term occurrence have been able to do as good a job as they have in retrieving relevant search results and in other text analysis tasks. It seems self-evident that an understanding of the meaning of sequences of words must be a way forward. Some progress is being made, in that you can ask (or even speak) questions to large-scale search engines like Google, Bing and Baidu and have the answer presented to you rather than a list of documents relevant to your query. It's an important step forward and the whole industry is moving to extend the range of question that can be answered in this way.

Q: Do the achievements of the newly miniaturised IBM Watson excite you?

A: Initially Web Search engines were document retrieval systems- you put in a query and get a list of documents back, hopefully ranked by decreasing probability that they will be useful. Jonathon Fletcher's JumpStation at Stirling University in 1993 was the first web search engine and worked in this way.  Now people continue to castigate the 'ten blue links' -- search engines which accept a query in much the same way as in 1993 and  return results in what is now a standardised format. But engines like Google, Bing and Baidu are now answering questions, evaluating formulae, solving equations and synthesising information, as well as blending results from image search, shopping search, news search etc. to create an experience that goes way beyond just finding documents relevant to your query. Watson is an interesting example of that type of thing.

Q: Do you know how Bing came to be named?

A: I think the name tries to capture the instantaneous nature of search results – “Bing! There's your answer.” I believe there’s a Wikipedia article which describes the naming process.

Q: Did studying Computer Science lead you into the field of information retrieval?

A: Yes, I came to the Australian National University (ANU) to study physics, but in the vacation before I started I worked in a mental hospital and decided that psychology might be interesting and ended up doing a double major in Psychology, along with Computer Science. I was amongst the first group of Honours students to graduate from ANU in Computer Science. The first search engine I worked with was the card catalogue in the library! I think there were some researchers at the John Curtin School of Medical Research who could send queries by post to Medlars [an indexing system run by the US National Library of Medicine] and two or three weeks later get back a list of documents that matched their search. 

I worked in a number of computer infrastructure support roles at ANU and by 1991 I was in charge of a couple of supercomputers - a Connection Machines CM2 and a Fujitsu AP1000. In order to do a good job of managing a large-scale parallel machine I thought I needed to write a parallel program so I built a kind of parallel grep [UNIX search command]. I then realised I had to do something about document ranking and relevance and it became a text retrieval system. In 1994 I participated in the third Text Retrieval Conference (TREC), organised by the National Institute for Standards (NIST) in Washington. 

At the first TREC in 1991 a lot of the participants had extreme difficulty in indexing 2 gigabytes of newswire articles and Government publications. In 1994, City University London produced a ranking algorithm which they called BM25 which is still very competitive. 

I wrote some papers about parallelising text retrieval on supercomputers but I pretty soon decided that text retrieval was more interesting than parallelisation. It was a revelation that one relevance scoring formula could be twice as good as another in retrieving accurately. Then I was seconded to the Co-operative Research Centre for Advanced Computational Systems (AcSys) and eventually obtained my PhD (Text Retrieval over Distributed Collections) on the basis of my published papers. A job was then advertised at CSIRO which seemed exactly suited to me and they appointed me without me having to move from Canberra to Melbourne, which they'd initially wanted. 

At that time, CSIRO had to obtain 1/3 of its budget from non-appropriation sources and I could see the risks of trying to do that through consultancies or contracted projects where the work wasn't contributing to the advancement of science. It was also going to be very difficult to find large-scale projects to fund research in the IT sector. I decided to try to get revenue by turning my research into a commercial product for intranet search and licensing it. The actual plan had come from Paul Thistlethwaite at AcSys for commercialising the retrieval research and selling tools for managing and visualising information on an intranet. He had a vision of a time machine, where you could search the web of an organisation as it was at any time in the past, not just now. Unfortunately he died suddenly in 1999 and I only got as far as commercialising the search.

Q: The search engine was named P@NOPTIC. How did this come about?

A: The first installation was on the ANU intranet in 1999 and was called S@NITY. The fact that there were shampoos and a record chain store called Sanity was not a problem, because trademarks are in specific classes and a search engine was in a different class. However, the Sanity record store was planning to operate online and this would have made the trademark difficult to obtain, so we changed the name to Panoptic. My wife came up with the name, which means seeing the whole in one view. The philosopher Jeremy Bentham invented a jail based on this principle, where a warder could see all of the prisoners from one point. There's a version of one at Port Arthur in Tasmania.

Q: How were the early days of Panoptic commercialisation?

A: It took quite a while for the revenue to build up to the point where we were considered successful by the CSIRO management, and then it came in a rush. We were considered failures in 2002, but after that we exceeded our target by 50%, and this caught the attention of Stuart Beil, who had been CSIRO's general manager of commercialisation. He saw that we had a product, customer and a business model and wanted to be involved. This came to pass and because he knew how to close deals and set prices, the revenue tripled, which became an embarrassment, as CSIRO was now competing with private sector and momentum built for a spin-off. Stuart was interested in that and as CSIRO had re-organised its IT-related research into the ICT Centre, and the new director was keen on commercialisation via spin-offs, the Funnelback spin-off was born on Christmas Eve 2005. 

Q: How does a search engine go from being named after a prison design to a fusion of venomous spiders?

A: My wife also came up with this name, after she was bitten by a redback spider. It was a very painful experience but it made her think of the combination of the names redback and funnel web as implying the funnelling back of important information to you.

Q: How did the spin-off go?

A: I'd gone to considerable lengths to set up a 'virtuous cycle'  in CSIRO where research led to a product, which generated customers, which in turn led to an understanding of customer problems, access to customer data, improved research and a better product, providing yet more customers. 

The problem for CSIRO with the spin-off was that the customers, the data and the use of the product were in the spin-off and the research was in CSIRO. I didn't know at that stage where I wanted to be so I kept a foot in both camps by remaining in CSIRO but being seconded half-time to Funnelback. That created problems of trying to maintain barriers for intellectual property between the two organisations, which were manageable for a while but eventually I decided to leave CSIRO and join Funnelback in 2008.

I was at Funnelback for 5 years and during that time it was bought from CSIRO by an Australian company called Squiz. After that time I was ready for a change as I'd been working on more or less the same thing since 1991. I'd always been interested in web search and large-scale search, so when the opportunity came up to work for Bing in Canberra in 2013, I took it. 

Q: What do you see as challenges for Enterprise Search?

A: There are a number of them. In Funnelback we could perform a universal search of all information repositories within the company using our search engine. There were only 35 staff but we had at least 20 different repositories including external CRM data, Confluence [collaboration], JIRA bug tracking software, databases, internal and external web sites, and email collections. 

We found the ability to search all repositories very useful - people found vital resources they didn't know about, which were stored in unexpected repositories and wouldn't have been found if people had searched only where they expected the information to be. I think this capability would be valuable for many organisations. Quite a few people in Funnelback relied heavily on search for everyday work.

Delivering these benefits to other organisations is complicated by the diversity of repositories. A typical organisation might have a TRIM records management system, Lotus Notes for collaboration and information storage, a Documentum repository and maybe other proprietary information stores. They might also be half-way through a migration into SharePoint. It's a massive task for an Enterprise Search company to build and maintain adapters to extract data from multiple versions of these repositories, each with their own proprietary protocols for access.

Q: Access control issues were seen as a major barrier to enterprise search uptake in a recent survey. Would you agree with this?

A: Yes. Access controls for particular repositories are often out of date, inappropriate, and inconsistent, and deployment of enterprise search exposes these problems.  They can arise from organisational restructuring, staff changes or knee-jerk responses to unauthorised accesses. As there are usually a large number of repositories, rationalising access controls to ensure that search results respect policies is a lot of work.

Organisations vary widely in their approach to security: some want security enforced with early binding (recording permissions at indexing time), others want late binding, where current permissions are applied when query result are displayed, or a hybrid of the two.

 This choice has a major impact on performance. Another option is 'translucency', where users may see the title of a document but not its content, or receive an indication that documents matching the query exist but that they need to request permission to access them. As well these security model variations, organisations vary in their requirements for customization, integration and presentation, and how results from multiple repositories should be prioritized, tending to make enterprise search projects quite complex.

There have been a number of high-profile failures of Enterprise Search projects, which have contributed to the poor reputation of Enterprise Search. Prospective buyers are often fearful of the cost and doubtful of the benefits. It's a lost opportunity as considerable economic benefit can be derived from effective search of most of the documents within most of the repositories within an organisation.