Transforming data into actionable insight is key to its value. Leaders from companies in a variety of industries that are materially investing in technology architectures want return on their investment, and they want it now. Unfortunately, many of these leaders are missing a crucial first step that precedes generating useful analyses.
A range of challenges are putting the brakes on leaders’ efforts to extract value from their data and system investments. Crucially, data used in analysis first have to be clean, and data ingestion and curation efforts are time consuming – the process (still quite labor intensive) requires 70%-90% of a typical data professional's time. Most companies hire data professionals for their specialised and expensive analytical skills, not for their ability to clean data. Second, technology architectures lack end-to-end data pipelines, data ontologies and taxonomies. Instead of performing data ingestion and curation effects in a centralised manner, the simultaneous processing of data in multiple departments within a firm results in fragmentation, limiting the ability to spread and use data insights across the organisation. It's just not efficient.
The current state of data engineering has become a key drag on making Machine Intelligence (MI) profitable in the insurance industry.
We define data engineering as the process of refining data into fuel – clean, easily accessible, allowing its insights to be available to use and benefit from anywhere in an organisation. At-scale success requires an automated end-to-end data pipeline performing ingestion, curation, transformation, visualisation, and finally widescale distribution of data. Think of this pipeline as a production line, similar to what one may find in a factory.
Usefully transformed data provide competitive advantage in the insurance industry. Firms can differentiate themselves through identifying, ingesting, and curating novel (not generally available) data and even further, developing more useful derived analytics & insights with these data. These data-driven differentiation opportunities can be found at nearly every step in the risk-transfer value chain -- from product development through core underwriting and claims management to portfolio analysis.
Historically, firms mastered traditional data sources at the needed scale for success in one or more of those steps. However, the ingestion, curation, and transformation of data within each of those steps became tailored to the particular needs and data types of the activity of the department processing the data. That tactically fragmented data engineering landscape has a cost today and places a significant cap on data transformation efforts for tomorrow.
Most insurers today have multiple bespoke ingestion and curation processes, embedded in various analytical processes. None are optimised for cost-efficient engineering, but rather for the desired target analysis. The costs add up. As different areas try to leverage each other's predictive analytics, data professionals, underwriters, actuaries and claims managers must reconcile how data have been differently engineered in different areas before moving onto analysis. This carries not only time and effort cost, but often prevents cross-department leverage from occurring at all.
Looking further, the missing ability to engineer data "at scale" is what prevents firms from extracting value from their investments in data scientists and data analytics tools. A good analogy is that those investments are like powerful engines purchased by the company, however the company cannot deliver them fuel or the fuel they do deliver, is full of sand, which materially degrade engine performance. In the insurance industry, traditional data are already riddled with regulatory and privacy challenges, and the need for decision certainty and traceability already slows the exploration of non-traditional data. Fragmenting internal data engineering capabilities compounds the problem.
Looking specifically through the lens of "making MI profitable", at-scale data engineering gaps prevent promising MI projects from becoming successful operational implementations. Making MI systems operational from the beginning at scale, vs. more siloed project-scale efforts, requires access to high-quality, always ready-for-use data. Project solutions relying on brute force, i.e., high-cost manual coordination of fragmented data engineering capabilities, can not scale into production, regardless how promising the algorithms and models are. New MI techniques, driven by exponentially growing sources of data, will impact competitive advantage in every step of the risk-transfer value chain. However, right now, useful and relevant information often does not find its way into underwriting decisions and more granular profitability analysis does not steer the business. This will have to change.
Data engineering can be broken out into a series of steps. New data must first be discovered and sourced, and on an ongoing basis. Previously, we used to refer to the "ETL" acronym to boil down these steps: extract, transform, load. But the world has changed.
Today, enterprises should have pipes attached to rivers of data so that sourced data are nearly continuous—not "extracted" ad hoc from a static datastore. No longer should we aim to just "transform" data to a single format when there is a given use case. Now, we should apply a collection of analytical transformations on curated data. Thus, ETL becomes something different-- continuous ingestion, curation, transformation, and visualisation (ICTV). These steps should inform and evolve the data-value-chain management process. Indeed, validation and differential and robust privacy-preservation along with meta-data tracking become integral to a more robust data-value-chain management process.
A challenge threaded through these steps is excessive dependence on manual and script-driven processes. This can happen when processes are designed around the manual steps and project code originally developed for experimental, interactive analysis. Multiple (or all) steps might involve a manual trigger, intervention, or review. Manual auditing, versioning and checking for errors and anomalies reduces not only efficiency, but also confidence in downstream modelled outcomes. Lack of a holistic standardised data engineering process increases operational risk as data scientists tailor existing processes or create brand new ones for their needs. The potential for error rates in both data and derived analytics grows.
The use of relational databases as storage solutions impacts speed and scale. Traditional database management, which relies on using rigid technology infrastructure, is by itself insufficient to handle the rising number of heterogeneous formats that are necessary to unlock value from data. The lack of enterprise-wide data ontologies (the defining of relationships among data) also limits the ability to scale and operationalise data across the organisation. Infrequent release iterations of data, coupled with limited CI/CD (Continuous Integration / Continuous Deployment) of both data and MI models fail to adapt to changes in the data environment, or changes in the data that describes the data environment.
Looking forward, running tomorrow's at-scale, operational data capability will require a pipeline driven approach to data engineering – similar to how automotive companies implement their production lines. A single rules-based data pipeline is needed that covers the entire data engineering lifecycle. Most processes in the pipeline are algorithmic, supervised by humans. The automated pipeline connects and streams relevant data from sources as, and when, it becomes available. Depending on the data format, an algorithmic template approach is applied to transforming and extracting data from unstructured / semi-structured documents real-time, e.g. PDF. Data are tagged automatically based on prescribed rules based on business requirements. The data ontology and data dictionary (data terms and definitions) – key to operationalising the data across the organisation - is built dynamically and kept up-to-date based on periodic functions as the new data become available. A single, flexible document-based data lake (well-fed by various data rivers) houses all structured, semi-structured, and unstructured data types. An automated high-integrity versioning and traceability of data process runs throughout the pipeline as part of scheduled batch processing.
Source: Galytix
An array of plug-and-play solutions built on open-source technologies can be applied with humans still intelligently integrated into the loop to speed up evolution of data engineering into a fully integrated end-to-end data pipeline process.
Building a scalable data engineering capability has now become table-stakes for insurers to not only extract value from their data investments but also expedite the speed of data transformation across the organisation. As one data leader put it: “Data initiatives are a team sport – requiring business and data staff to work hand in hand to bring data and technology together – ultimately, delivering a business transformation”. Insurers that make the right design and implementation choices with regard to data engineering related transformation will succeed in the future.