Today: ETL

I am going to be looking at different areas of the tech industry and try and shed some light on the different roles you may want to fill. I plan for this to be an ongoing series and see if people are interested in exploring it further.

ETL Developer

ETL developers do DB development, especially in big data. This is a process that requires a lot of planning before moving on to the implementation, so I wouldn't necessarily call it "day to day operations," at least not as a whole.

Let's clear up the terminology a bit, ETL = Extract, Transform and Load. Basically you're going to take data from one source and transform it then load it to another place but that other place is carefully crafted to fit the data and be expanded on when the time comes.

In some case it could be simple, say mapping fields between two databases, but that is rarely the case. Usually the work is more complicated, and an ETL developer performing mathematical operations on some fields, combining it with other data sources, then loading them into a final location.

Size matters here, since you can't look at it as moving small databases around, even in a small company you will be handling many TB of data.

While the size of the company may dictate how much responsibility your job entails, it does not mean that you will necessarily be working with less data.

In a small to medium size companies, the developer could also be the database developer and the ETL developer, with the position rolled into one. A larger company will likely fill the positions with specialized roles like people who only do ETL, people who just do Database development.

Often the Database developer will work with an application developer to develop a database.

The Load

Since the volume of data is so big you will never find the ETL developers working with the data loads manually. In most cases that would be impossible.

Instead you will find them in the heavily invested in the design process using tools that allow them to manipulate the data on schedule. That involves a lot of data modeling due to the size of the job.

Why? Becuse can't just copy and paste your data into SQLite and call it a day when you need to scan a 100 TB of data and actually return meaningful resuls as fast as possible.

People want the data returned from their query, and they want it now.

The team that is responsible for infrastructure and ETL of any big data platform is mainly composed of people with software engineering backgrounds. Most of their time is spent designing the ETL and making sure it is running without error.

And running without error is definitely a aspect that has little room for compromise, you don't want to be exposed to data loss.

As you can see, ETL development is a subset of database development that focuses on the pipeline/tools. This often rolled together, being used to extract data from a data source, transform it into something useful, and load it into a database.

It may sound straight forward but it is more involved than you can imagine.

No matter how you look at it, there are many aspects to overall database development, so ETL is just one part. At a small company, a database developer might do lots of different database development tasks, but at a larger company, they will be more likely to have specific people writing reports, doing ETL, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *