Census 2006: making the Web count

Census 2006: making the Web count

By David Braue

The need for forward planning during the five-year Census cycle meant that the 2001 Census was just beginning to explore the potential of the Web. For Census 2006, however, the ABS is finally modernising its approach. David Braue finds out what tricks the country's premier statistical organisation has put up its sleeve for the $250 million exercise.

If you think you've got a lot of paper floating around, spare a thought for the Australian Bureau of Statistics (ABS). Next year, the country's head statistical agency will once again embark on the massive undertaking that is the Census-a carefully orchestrated tour de force of document management involving the delivery and processing of more than 8.6 million census forms that will be filled out across Australia on the night of August 8, 2006.

The day after, those forms will be collected by nearly 35,000 volunteers, each covering a defined collection district comprised of around 250 households. Forms will be collected and sent to the Census Data Processing Centre in Sydney, a process that took four whole months during the last census in 2001.

That time around, it took nearly 1000 employees, 920 computers, 54km of networking cable, and 15 high-volume Kodak 9520 and 2500 scanners a full ten months to scan and extract data from the 117 million pages involved. Fully 1 million data tables, including 250 million cells of data, were released to the public, with more than 100,000 people utilising census data annually for local government, market research, franchise planning and a myriad of other uses.

Next year, the bureau is getting ready to do it all again. Each time around, the bureau's census team looks for new ways to improve the process, evaluating business requirements and looking into the future for new technologies that can further improve the efficiency of the $250 million census. Business needs are identified first, then broad technical requirements; specific technologies are locked down two to three years before the big night to provide a solid target for developers.

This cycle puts the ABS in an interesting position compared with most organisations, which are constantly upgrading their IT infrastructure. Each census upgrade, by contrast, is a major exercise in forward planning and provides the opportunity for significant changes in the process. Yet even a five-year upgrade cycle is rapid compared to the situation in other countries, where inter-census gaps of up to 10 years mean that the entire process must be effectively reinvented once a decade.

For the 2011 census, Census Program head Paul Williams says the focus will probably be on improving methods for forms collection and processing. This time around, however, the ABS has focused its technological change processes on improving the way it delivers its massive volumes of statistical data to its eager and waiting audience.

Power to the people

In the past, distribution of Census data has been through two methods: a CD-ROM filled with raw data and a more advanced product, called CDATA, which combined the raw data with a proprietary application that let users analyse the data in a number of ways.

Making good use of that application, however, has been difficult for casual users: reported problems have ranged from practical issues such as installation dramas, all the way to user confusion over the ABS's census statistical divisions, which do not correspond with suburbs as would be expected. This made it very difficult for a user to figure out which statistical division would contain the information they required, and collating information from multiple divisions required ponderous cutting and pasting from the supplied Excel spreadsheets.

Lacking an alternative in the past, CDATA was simply the best possible method of data distribution. With the Internet and broadband now well established, however, the 2006 census will see the ABS make a dramatic shift away from CDATA, instead implementing a Web-based system that Williams says will be far more user-friendly than past applications.

"We have always been putting out as much data as possible, but you currently have to really understand the way we do business to understand how to access data," says Williams. "If you are a novice user, there's a huge learning curve. What people need is the ability to search quickly and get to the particular data items they need, and that's where our improvements are focused."

Data will still be analysed using SuperCROSS, a high-end analytical package from Melbourne company Space-Time Research (STR) that has previously taken care of the heavy lifting during census data analysis. This time around, data will be stored in a conventional Oracle relational database, which represents a significant step forward compared with the flat-file database approach used by the ABS-and the soon to be defunct CDATA application-in the past.

The addition of better structure to Census data will be matched by improvements in the user interface. STR will front-end the database with a geographical overlay and intelligent parsing systems that will let users search for data by familiar identifiers such as postcode and suburb, and to do things like aggregating data from multiple tables. Use of Oracle will allow retrieval of data tables using standard SQL queries, which will enable the joining and flexible delivery of data to end users via what is expected to be a simple Web interface.

There are limits on just how flexible the new interface can be, however: rather than giving Web-based visitors access to the complete set of raw Census data, the ABS will load the database with the million or so tables-approximately 30 per collection district, each highlighting demographic factors such as age, sex, race, and language spoken-produced by SuperCROSS.

Full access and real-time queries would impose a significant processing burden on the ABS, but that's not the only reason users won't be able to search the raw data to their hearts' content. Legislative controls force the ABS to obscure details of individuals in the data it produces, and a level of "perturbation" introduced into the data ensures that individual data cannot be identified.

Such privacy concerns were a key design criteria that forced the ABS to qualify its push to put all census data online, says Williams: "We know that our users want to be able to cross-analyse everything with everything, but because of our need to protect confidentiality we cannot put the data out in that way," he says. "We have to make intelligent decisions as to how we organise and structure the data to respect confidentiality, without putting too much random perturbation into the data."

The Web-based Census

Even before the data has been collated, the Web will play a major role in the 2006 Census by allowing respondents to lodge their census forms online. Although the ABS recently ran a trial of online census form lodgement that involved less than 100 people, next year will be time for the real thing.

This time around, each household will be assigned a pseudo-random number by their area Census collector, who will leave forms along with a sealed envelope containing a 12-digit PIN. Households choosing to lodge their form enter that PIN and the pseudo-random number into the SSL-secured Census Web site, which will record their information and notify the area Census collector by SMS that the information has been lodged.

As a major new step forward, online lodgement of Census forms will be watched carefully to ensure privacy and data integrity meet or exceed the standards reachable by conventional paper-based processes. Over time, Williams is confident the method will eventually play a major role in future Census collection.

One other Web-based process improvement is already paying dividends for the ABS, however. Whereas applications to serve as Census collectors were previously processed using paper, this year the ABS is encouraging the more than 100,000 expected volunteers to apply online or over the phone instead.

Most are doing just that: "in our recent dress rehearsal there was very little demand for the paper form," says Williams. Since each application must be vetted and candidates put through a full interview process, sucking the paper out of this process will deliver considerable savings in time and effort.

It will also give the ABS unprecedented visibility into its progress: immediate online availability of applicant details will let organisers know exactly which collection areas are still understaffed, rather than having to wait days for paper reports as in the past. "This lets us target our recruitment campaign to get better people," says Williams.

In many ways, the 2006 Census will reflect the Internet's coming of age, since planning for the 2001 Census began in the late 1990s. Significantly higher Internet penetration is allowing the ABS to think outside the box when it comes to revisiting every step of the process-potentially opening the door to significant savings and efficiency improvements.

Such improvements are a strategic imperative for the Census team, says Williams: "Because of the way governments work, effectively each Census is designed to cost less than the previous one," he explains. "Because the demands for data are ever increasing, we're doing smarter things. We pick areas where we think there's an opportunity to re-engineer, and do continuous improvement on the other bits. We're not trying to be cutting edge-just to have a vision a bit far out for what we are trying to achieve."

Related Article:

Lessons learned in Census privacy mess

Business Solution: