Data Integration Perspective
The NCR business generates information at an increasing rate from internal and external facing applications as well as a huge fleet of devices we support worldwide. In order to achieve efficient analysis and reporting, we must do a great job at managing the process of collecting, organizing, cleaning, securing and backing up the data. We gather data from several hundred source systems at varying frequencies and volumes each day. The process of integrating that data is complex and requires many different steps and tools, depending on what the final disposition is for each source. The challenge is that these processes have developed over time on a project-by-project basis without a clear vision of a “big” picture that is all encompassing. Things are further complicated through the results of normal business changes, like re-organization and acquisition. These create a continual flow of data changes in source systems and the downstream use of the resultant data.
"We must constantly adjust our horizon to match the changing needs of the business with the technologies available at the time"
As technologies have advanced and the portfolio of applications has increased, expectations have also increased that there must be a way to achieve wholesale analysis of the full spectrum of data across the enterprise instead of just in the narrow paths that were initially created. In order to solve this problem, we need to set expectations for a solution that is practical and can show incremental achievement. The silos of information we have took years upon years to create and it is unlikely that we could resolve them all at once within a reasonable time period. So, we must set about creating a strategy to incrementally move toward an enterprise level view. In our case, this journey begins at the landing zone for new applications or applications under major redesign. The image below depicts a goal-state in very general terms.
The direction includes what would typically seem like an “Enterprise Data Warehouse” stack of Loading/Landing, Transformation, and Presentation. In the past, the data warehouse was a single monolithic machine with a majority of these logical layers represented within it. The difference here is that we’re now virtualizing that “data warehouse” concept across multiple platforms and toolsets. This requires a different set of tool and processes. It also has both benefits and costs.
In order to move to a centralized “data lake” we’re moving to using Hadoop as our initial landing zone for all data, eventually. This will take time, as unwinding existing processes is not cheap or easy and many times the ROI of such a project is difficult to quantify. So, we have started this journey by actively pushing any new development or integration to use this method. Additionally, any major re-designs are required to move to the lake as the landing zone.
This “data lake” concept then creates some interesting options for the then downstream destinations for data. That allows us to re-use and re-purpose many of our existing technologies to analyze and present data from the lake with methods that best use the different technologies. We are able to produce OLAP capabilities in-memory for speed of delivery. We’re able to leverage SSD appliances for heavy calculation. We’re able to do near-real time analysis within the lake or by replicating data to an appliance depending on the use case. In the case of our predictive analytics, we can combine near-real time data from machines with historical aggregations or trends as well as master data elements in a system that quickly reacts and feeds our operational dispatch systems. The use-cases are widely varied and have many different SLA requirements.
All of the many different data access methods allow for a much more comprehensive view of the enterprise-level data. It also has side benefits like creating an opportunity for savings of backup time and space, by reducing the number of duplications of both base-data and processed-data. Of course, in order to take advantage of that, it also requires administration and management at an enterprise level. In our case, being able to centralize and standardize the methods and tools has made that process more efficient. It even provides a much more efficient way to improve data quality.
A centralized store of key master data enables the maintenance and cleansing of the data within the lake. This further allows downstream systems access to a continually refreshed source of the truth. For us at NCR, this has meant a much better view of key master data elements. Cus¬tomer information, like shipping addresses, billing addresses and the like were created and maintained in many systems in the past. As we’ve moved to this centralized model, we’ve been able to greatly reduce the sources of change to these key data points. Monitoring and reporting on data quality has proven the value of the model. We’re able to quickly identify and remediate any systems that are contributing or attempting to contribute poor quality data at a single point, rather than chasing bad data from system to system in loops.
It isn’t a destination we’ll ever fully reach as technologies will inevitably change. Many components are in the process of being cloud-based and that will continue. The virtualization of components will change. Our goal is to move data as little as possible and store it as cheaply as possible for the long term, while making it accessible as quickly as possible when that is important. These are wildly divergent goals and there’s a spectrum in between. The job of managing that spectrum to meet those needs makes continual improvement of these processes a necessity, so we must constantly adjust our horizon to match the changing needs of the business with the technologies available at the time. It’s constant evolution.