Домой United States USA — software Extract, transform, load? More like extremely tough to load, amirite?

Extract, transform, load? More like extremely tough to load, amirite?

По

June 25, 2018

286

Thankfully, now there are data integration platforms for that
Data integration has been an IT challenge for decades. Long before cloud, even before client server. Back then, though, it was relatively simple – you were working with systems of record, desktop databases and business systems.
Today, it’s not so simple. We exchange information about virtually everything from science to retail, from Facebook Likes to regional sales during August.
The volumes of data we’re dealing with are significantly higher than anything we’ve collectively worked with historically.
In addition to the ever-increasing volume of data is the types of data – we’re not in structured data land any more, Dorothy, and relational monogamy is a thing of the past with data on images and device parameters mingling with information nestled nicely in relational rows and tables.
We will – of course – be familiar with Extract, Transform and Load (ETL) from those bygone, pre-cloud years.
ETL is a series of steps used to take data from one source and make it available to another. ETL is the very core of modern data integration: one extracts data from an application or service, transforms that data so that it can be understood by one or more target applications and services, and loads that data into the target application(s) and/or service(s).
ETL can occur in many forms. For example, one could save data from an application as a Comma Separated Values (CSV) file, open that file in a spreadsheet application to change the name of various columns, and then load that data into another application. This is a fairly common form of ETL operation that many organisations perform manually.
ETL doesn’t always involve a file as an intermediate stage. Frequently, data is extracted from the source application via an API, and loaded into the destination application via an API. In cases where the data integration to be performed is relatively simple (such as having few or no transformation steps), this sort of integration can be automated easily.
API-driven applications offer an easier form of integration.
When not performing direct API-to-API mind melds, data integration requires an intermediate data repository between the extraction and loading steps. This intermediate repository may be a file, database, middleware solutions, or data warehouse.
Extracting data from a source and storing it in a data warehouse, while not always required, is quite common in cases where multiple data sources are being combined as part of the data transformation process.
In this case, data from different source systems is converted into one consolidated data warehouse format ready for transformation. With multiple data sources gathered in a single, centralised data warehouse it becomes easier to transform the data before loading it into the destination application or service.
Transforming the data may involve some or all of the following tasks:
While data integration is centered on the ETL of data from applications and IT services, these applications and services are not always owned by – and do not always serve – the same organisation. Data integration is often broken into two broad categories: application to application and business to business.
Application-to-application data integration is shorthand for ETL operations on applications and services used by a single organisation. An example would be connecting ERP and CRM systems. Application-to-application integrations are increasingly done at the API level, though some integrations with older solutions may still occur directly at a database level, or even require files to be transferred between applications.
Modern application-to-application integration rarely requires an intermediate database, as «direct» integrations between popular applications and services exist. These direct integrations only cover some use cases, however, and many organisations still pump data into a middleware solution between extraction and loading. This is more common for organisations that also engage in business-to-business data integration, as data from one application or service may be used in multiple ways in these organisations.
Business-to-business integration (also known as external data integration) involves the exchange of data between multiple organisations. One classic example of this is data integration pattern is customers loading data into a logistics company’s IT systems to automatically generate waybills.
Business-to-business integration typically differs from application-to-application integration in that integration with external entities is usually done by exchanging external documents. Most organisations are understandably reluctant to open up the APIs of their critical business applications for a variety of reasons.
The exchange of documents that is typical of B to B data integration usually occurs in some sort of loosely coupled format, meaning that each customer’s files can be a little different from the next. In the logistics company example, this means that the company’s IT practitioners have to cope with hundreds of thousands of ever-so-slightly different documents from its hundreds of thousands of customers.
Those ingesting data from such files typically write scripts to transform the data formats they are receiving. However, this problem gets nasty in a hurry as it scales.
At the core of the problem with external data integration is that each organisation extracting and transmitting data from their systems not only uses different source IT solutions, but they have all have different developers available to them. These developers have varying skill sets, and varying levels of experience.
Even where two organisations are using the same system from which they plan to extract data, one organisation might apply data transformation to that extracted data while another might not.
In the logistics example, I’ll assume the company is transforms its data before it sends it by applying filtering and validation. It sends only the data that is absolutely necessary to send, and ensure that the data is formatted such that it should not cause database errors.
A fellow organisation, however, might send what amounts to a raw data dump from their IT system. This might include superfluous Personally Identifiable Information (PII) that will instantly place a burden of responsibility on the destination organisation. In addition, any malformed or malicious data will go through unchanged.
If the recipient of these two companies’ data were to load these two datasets without transformation, not only would the two data sets look out of whack but the recipient could also corrupt its database.
Where does this leave us? Put yourself in the position of that recipient – there’s plenty of us.
Organisations that receive data from multiple sources need to be able to quickly and efficiently design parsers to perform ETL.