How Australia captured Census 2011

With census forms flooding in to the Australian Bureau of Statistics’ Melbourne Data Processing Centre at the rate of up to three container loads a day, IDM spoke with the Executive Director Andrew Henderson to learn how the massive job of capturing and analysing Australia’s census data has evolved since the last census in 2006.

Any profile of the job of conducting a census must begin and end with statistics, and the figures for Australia’s Census 2011 are mind boggling.

The job of undertaking Census 2011 was budgeted to cost $440 million, or around $19 for each and every Australian, with around a third of that figure required to pay an army of 43,000 field workers who distributed and collected census forms.

In the end this added up to around 8 million census forms, which must be shipped to Melbourne for capture and processing. The containers will keep arriving until January 4, 2012 when the last shipment of census forms from WA is due in.

From then on the ABS will be on a tight schedule to collate the first release of demographic data that must be used by the Commonwealth Government in June 2012 to allocate GST revenue to the states. More complex analysis of more nuanced data will be released later in 2012.

The job of capturing the data submitted on paper census forms is being accomplished by over 750 staff utilising a fleet of 11 new Kodak i1860 high volume scanners acquired for the 2011 Census.

Australia’s 2011 Census was a landmark event, marking 100 years of national Census taking in Australia. It also represents the third census undertaken with IBM’s Intelligent Forms Processing (IFP) software, first adopted for Australia’s census in 2001.

The workload for the ABS Data Processing team this time around was eased somewhat by the strong uptake for online submission of census forms by Australians.

“We are very happy with the success of the eCensus,” said Henderson.

“We got a response rate of 28% which is well up from 9% in 2006.”

The ABS outsourced the job of capturing eCensus submissions to IBM Australia, while guaranteeing privacy by ensuring that data was encrypted from the time it is submitted online to the point at which IBM handed over the data to the Melbourne Data Processing Centre.

There were a huge number of variables that hinged on a strong eCensus takeup rate, as every percentage point either way had a direct impact on the size of the task of managing the paper-based submissions in Melbourne.

The ABS budgeted for 750 staff to handle the job of processing the forms, which will number around 8 million. This is comprised of around 52 million double sided pages, scanned as 104 million individual images (down from 65.5 million pages/131 million images in 2006).

“We banked on at least 20-25% being submitted online, but we are very happy with 28-30% we ended up with. IBM was tooled up to deal with a fair bit more than that and would have been comfortable with a 40% eCensus takeup rate,” said Henderson.

On Census night on Tuesday, August 9,  IBM was processing over 100 online submissions per second at its peak.

To accommodate the huge growth in online submissions, IBM implemented a significant change in the architecture of its eCensus solution in 2011.

This involved a  move to a client-side browser application from the  server side application deployed  in 2006. This significantly reduced the necessary server capacity from IBM, which handled the load in 2011 with three dedicated P-Series servers.

Once the data is captured from census forms by the Kodak scanners, IBM’s Intelligent Forms Processing (IFP) software is used for optical character recognition ((OCR) and Intelligent Character Recognition (ICR).

One major new introduction for 2011 was the scanning of coloured maps from Census forms filled out by collectors.

These were mainly required for rural and regional areas where the collectors would mark the location of a property on a map of their collection district. Once the maps were scanned, the ABS computers could compute the latitude and longitude of the mark and cross-reference it to the listed address.

“Maps became much more important in 2011,” said Henderson.

“We  needed to be able to geographically capture colour maps very accurately, and that’s worked very well.”

There are two main elements to the processing of the paper Census forms:  capturing handwritten addresses and additional answers to questions that relate to birthplace, religion, occupation, etc.

The ABS is finding recognition rates are 10% up on where they were in 2006.

“We are getting autocode rates of better than 90% for capturing addresses,” said Henderson, who believes there a number of  factors causing  the improved recognition rates.

“Its a combination of technology and the way we are using it,” he said.

“Our own server capacity has moved on so far in five years, and the there have been great improvements in the quality of scanning with the new Kodak scanners, and the quality of IBM’s recognition technology.

“But there are also improvements in the way we are able to modify our own procedures. We can adjust very rapidly as we get a better understanding of how certain tasks are being processed.”
The ABS undertook a dress rehearsal in 2010, but this only involved a test run of 20,000 households, which does not really give a true indication of what it would be dealing with when it comes time to process 8 million forms.

“All our testing has shown that autocoding, when its working well, gives us better quality outcomes  than human processing. If you have 600 people looking at 9 million forms a degree of variance and inconsistency comes into it,” said Henderson.

“Using OCR and ICR gives us much better confidence in the quality of the data and buys us time to spend analysing the data, so if we recognise a need to fine tune our indexes and classification schemes we can roll back and reprocess very rapidly.”

Data analysis
As in 2006, data stored in Oracle databases will be analysed using SuperCROSS, a high-end analytical package from Melbourne company Space-Time Research (STR).

The ABS will also be using an additional package from Space-Time Research called TableBuilder. This will allow visitors to the ABS web site to specify their own queries.

“In the past, because we didn’t have a tool like TableBuilder, we had to manage confidentiality at a micro level. So there were tens of thousands of predefined outputs and you had to select the nearest to what you wanted, whereas now high end users will now be able to  specify very detailed tables.

“The tool also ensures we can overlay the appropriate level of confidentiality so we can protect individual information.”

In 2006, Web-based visitors could not gain access to the complete set of raw Census data, the ABS instead loaded the database with the million or so tables (approximately 30 per collection district, each highlighting demographic factors such as age, sex, race, and language spoken) produced by SuperCROSS. Full access and real-time queries would have imposed a significant processing burden on the ABS servers.

“The tools we now have for analysis, like the latest version of SuperCROSS, and our increased server capacity,  mean we will have better understanding and be able to provide high end users with better intelligence on the quality of the census data,” said Henderson.

While the job of tacking the 2011 Census data is accelerating in pace, the ABS is well advanced in planning for the next census in 2016. In fact those decisions on the platforms and architecture changes will need to be bedded down in late 2012.

Handheld devices for Census data collectors are one thing definitely on the horizon, and the experiences of Brazil have been a great influence here. For its 2010 Census, Brazil equipped its data collectors with  225,000 PDAs and notebooks equipped with GPS receivers that could pinpoint the exact location of a household.

The GPS data was cross-referenced with satellite images to ensure that responses are correctly geo-tagged, and meant that mapping was considerably more accurate. Handheld devices are on the agenda for Australia’s next census in 2016.

“When you are deploying 40,000 units in field you need confidence in their robustness and the ability to support them,” said Henderson.

“One of the things that has become clear in the past few censuses is we need to gain a lot more real time intelligence from our 40,000 collectors in the field .In 2011 we used SMS to tell collectors that we have received an eform from a particular house so they don’t need to go back there.

“For the next census, we need to bring the householder into the loop, so when someone rings our inquiry centre we can tell them why the collector has not visited yet and when they will.
“We need to close the loop and have better understanding of what is going on in the field and a move to handled devices will help here,” said Henderson.