I have been Data Engineer for around One and half
years
with assigned job descriptions as below:
Job Descriptions
- Working knowledge in big data processing, and big data management.
- Working knowledge of data lake and lakehouse technologies.
- Working knowledge of Python and Scala programming language.
- Strong knowledge in writing SQL queries.
- Sound working knowledge with databases/servers such as Amazon RDS - MySQL and Amazon Redshift.
- Experience with Unix/Linux systems and cloud-based infrastructure and platform services especially AWS.
- Adequate knowledge on database modeling and optimizations.
- Familiarity with Data warehousing tools and processes.
- Sound working knowledge in developing data pipelines and ETL tools.
- Experience with data processing tools like Spark and Hadoop system.
- Experience with building and deploying RESTful APIs.
- Knowledge of version control system (Git)
Work Dissections
I facilitate tasks such as Data Acquisition, Automation of Extract-Load-Transform (ETL) operation of the acquired data through the construction of the pipeline, and development of scalable big data solutions, i.e., data warehouse and data lakes in the AWS cloud.Data Acquisition
In data acquistion, business data is
fetched from respective
sources such as Google Adwords, Google
Analytics, Facebook,
Pinterest and so on.
The API from each source returns data in
various format.
Thus,
the goal of acquistion is to fetch data
in any sorts of the
format and finalizing in common format
of either CSV or TSV
format.
The Data Acquistion makes the use of
Python based custom
framework in factory design
principle with each factory
item
as pull of unique source.
ETL Pipelining
ETL stands for Extract, Transform and
Load operations which
extract data from common format,
transform it and load it to
database.
The CSV or TSV format data from Data
Acquistion is passed to
the ETL pipeline which loads the data,
performs Star Schema
Transformation followed by daily and
monthly aggregations
and
finally loaded in Redshift.
The ETL pipeline is construted using
Spark Scala.
Scalable Big Data Solutions
Scalable Big Data Solutions or simply,
DataLayer (what we
call) is Flask RESTful Api framework
which consits of Python
API which when called returns SQL
queried data from
redshift/mySQL or from any other data
locations.
API gives data in JSON format to BI
application that is
calling the API endpoint.