Realtime or near-zero latency data warehousing might just be entering the realm of reality.
Realtime or near-zero latency has been a hot topic in data warehousing since 2002. In our industry, that is a rather long time to be a hot topic. According to the annual predictions, by now we should all be refreshing some part of our data warehouse as often as we possibly can, but few of us do.
I know many of us have dismissed realtime data warehousing (RTDW) as vendor hype - a technology solution in search of a business problem. Certainly when the term was first coined, extract, transform and load (ETL) tools were busily re-branded as data integration platforms; perhaps vendors were fearful that we were running out of data for them to move. After all, the much-anticipated wave of clickstream data warehouses had failed to arrive after the dotcom bubble burst.
They need not have worried; data volumes have grown unabated and hitherto unimaginable sources of sub-transactional data have sprung up to challenge us. So much so that most organisations are pleased if their static data warehouses give them anything like a complete picture of what happened yesterday, never mind the last hour.
Yet talk of RTDW just will not go away, despite the low number of early adopters. Possibly this is because it is another example of our natural progression from batch to online - inevitable evolution if you like. Once we did all analytical reporting overnight in batch directly from operational systems, the only time when it was safe to do so. We moved on to the daylight querying of haphazard decision support extracts and then eventually to data warehouses to remove that technology-imposed limitation. Initially data warehouses provided online access to historical summary data extracted on a monthly or weekly basis for management reporting only. Increasingly burdensome operational reporting was to be handled by operational data stores. It was not long before this technology-imposed duality was called into question, as daily extracts of atomic-level detail transactions became the data-warehousing norm. Why maintain two sets of reporting structures and make users choose which they query? Why not one database for all reporting, especially as the distinction between analytical and operational reporting has blurred? We have learned that if business users cannot access information when they want, from where they want, eventually IT practices change when it becomes an economic possibility and an operational imperative.
I believe we may be reaching the RTDW tipping point now. I have one broadcasting client, who has had a very specialised RTDW in production for four years. That is an anomaly but I have another about to embark on a far more ambitious operational data warehouse to match the wholesale integration of their production systems using service-oriented architecture (SOA) technology.
Just as I am glad we did not all attempt to build data warehouses in the early 1990s before we knew enough about dimensional modelling, I am relieved my clients have waited until now for RTDW. Today we do have enabling integration technology that has been production tested but, more importantly, we have had repeatable success at 'static' data warehousing. We have learned to walk (almost gracefully) before attempting to run. That is a good thing because RTDW can be something of a marathon. So how are we going to shape up for it and what must we do to succeed at RTDW?
More than faster ETL
Realise that RTDW is not just faster ETL. A simple definition of RTDW is the ability to query data with a refresh cycle more frequently than current ETL processes can cope with. That is why realtime ETL is going to double your development time and budget but it is no use spending millions making data available in minutes if it will not be consumed for hours. RTDW requires end-to-end process re-engineering, not just stretching back to the source systems so they can cope with accelerated extraction (it probably will not be extraction) but forward to the business intelligence infrastructure to support more active analysis and decision-making. Even dashboards are going to be too passive.
Make realtime operational sources more co-operative. If they cannot cope with invasive data extraction they are going to have to push changes to the warehouse or at least notify ETL processes when data are ready so we do not needlessly burn cycles trying to extract what is not there yet.
Do not assume data flow is a one-way street. Even if your current data warehouse information is consumed entirely by business intelligence users, realtime questions and answers may have a completely different audience. You are probably going to have to close the loop and propagate answers back to the operational systems as frequently as you extract from them.
Adapt our existing ETL techniques. Do not assume that change data capture and enterprise data integration (EAI) technology alone will allow us to simply bolt on realtime. EAI may be appropriate for trickling small to medium volumes of data to and from the warehouse, but messaging technology may be overwhelmed by large bursts of data more suited to our current ETL processing run in micro batch. EAI also represents a huge financial and technical overhead if only a small number of systems must be integrated.
Be honest about how data quality is going to affect your RTDW. You cannot be as thorough in running your suite of data quality tests in 'realtime' as you would in overnight batch.
Continue to use dimensional techniques
Normalised models may be tempting as they can closely match the sources of realtime data and cut down the amount of transformation needed to the point where it will fit into our acceptable refresh cycle. However, these structures are notoriously query-unfriendly and typically rely on a succession of dependent data marts downstream for business intelligence performance. Our calculation of latency has to include the time taken to ready the data for query and typical query processing itself, not just the time to store it. For near-zero latency, the tables/cubes we update in realtime are going to have to be the ones we query.
Use enterprise information integration (EII) technology to prototype RTDW or add a lightweight RTDW component to your existing warehouse. EII's virtual data warehousing could never replace real data warehouses but as an adjunct EII's user-triggered ETL could provide realtime answers where data volumes are low and usage is light or as yet unproven.
Develop realtime partitions for our RTDW fact tables. These may not necessarily be native database partitions as our realtime structures will likely need a more lightweight indexing strategy, possibly not based on surrogate keys. That last point may sound like heresy to any dimensional modeller who knows me but we do not have the luxury of our usual batch windows to do that level of housekeeping. But do not worry. We will promise to clean up overnight, sweeping the realtime data into the static partitions where we can fully support slowly changing dimensions and data quality improvements.
You still do not believe? Almost every data-warehousing project I have been associated with that began with strategic/managerial analysis goals has been hijacked by operational requirements. When your operational processes develop a need to compare the most recent transactions to the historic detail in your warehouse, you will receive your RTDW orders. Interesting times lie ahead.
*Based in the UK, Lawrence Corr is a leading data warehouse design specialist and highly experienced trainer with over 20 years of industry experience in decision support and data analysis.