One solution would be to have a program clean and transform this data so that: However, this data is unclean, missing information, and inconsistent as with most data. One might begin to wonder, Why do we need an ETL pipeline?Īssume we had a set of data that we wanted to use. According to Wikipedia:ĮTL is the general procedure of copying data from one or more sources into a destination system that represents the data differently from the source(s) or in a different context than the source(s).ĭata extraction involves extracting data from (one or more) homogeneous or heterogeneous sources data transformation processes data by data cleaning and transforming it into a proper storage format/structure for the purposes of querying and analysis finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, data lake or a data warehouse. One of the foundational layers when it comes to Machine Learning is ETL(Extract, Transform and Load). You will need to sit down comfortably for this one, it will not be a quick read.īefore we get started, let's take a look at what ETL is and why it is important. This post will detail how to build an ETL (Extract, Transform and Load) using Python, Docker, PostgreSQL and Airflow. I will start with the basics of the ML stack and then move on to the more advanced topics. In this post, I want to share some insights about the foundational layers of the ML stack. During the past few years, I have developed an interest in Machine Learning but never wrote much about the topic.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |