Generalities data pipelines
A data pipeline could be seen as layers. And each layer is the input of the next layer. We will not deepen in the explanation of each layer, but we will mention only the most common layers that a data pipeline could have.
Data source: this is the foundation of your pipeline: all your raw data live in this layer. Your data can be structured or unstructured data. This layer can be compound from multiple elements. Relational, NoSQL, plain files, etc.., For example:
MySQL
Maria db
Google cloud storage
Firebase
RDS
XML files
Apps
Werables
Ingestion and integration layer: this is layer is capable of reading the data from data sources into data processing. In this layer you load the data in a targeted storage, giving it a format that the rest of your pipeline its capable of understanding.
REST/MQTT endpoints
Message queue
Firebase rest API
SFTP
Storage layer: this layer is responsible for saving the data. We can have NoSQL and SQL databases. We will focus on this in the next subchapter since this is an important concept for your applications.
SQL databases
No SQL databases
Processing/computation layer: this layer is used for doing aggregation, mix data sources, and pre-calculate data to use it in the next layer for visualization. This layer can be used for streaming or batch processing. (Here is where our analytics engine reside, but needs a stable storage layer and a good presentation layer)
Self hosted scripts (e.g., Python script, SQL scripts, etc..,)
Storm
Apache Spark
Flink
Machine learning models
Crashlytics
Presentation layer: this layer presents the insights through dashboards, emails, SMSs, push notifications, and more. Take into account that, generally, machine learning models are exposed as Micro-services.
Quicksight Amazon
Metabase
Apache Superset
Tableu
Looker
Realtime dashboard
Zoomdata
In the next following two chapters, we are going to go deeper into the storage layer and present to you a brief explanation of the popular types of data pipelines.
Last updated