I have been Data Engineer for around One and half years with assigned job descriptions as below:
- Working knowledge in big data processing, and big data management.
- Working knowledge of data lake and lakehouse technologies.
- Working knowledge of Python and Scala programming language.
- Strong knowledge in writing SQL queries.
- Sound working knowledge with databases/servers such as Amazon RDS - MySQL and Amazon Redshift.
- Experience with Unix/Linux systems and cloud-based infrastructure and platform services especially AWS.
- Adequate knowledge on database modeling and optimizations.
- Familiarity with Data warehousing tools and processes.
- Sound working knowledge in developing data pipelines and ETL tools.
- Experience with data processing tools like Spark and Hadoop system.
- Experience with building and deploying RESTful APIs.
- Knowledge of version control system (Git)
Work DissectionsI facilitate tasks such as Data Acquisition, Automation of Extract-Load-Transform (ETL) operation of the acquired data through the construction of the pipeline, and development of scalable big data solutions, i.e., data warehouse and data lakes in the AWS cloud.
In data acquistion, business data is fetched from respective sources such as Google Adwords, Google Analytics, Facebook, Pinterest and so on. The API from each source returns data in various format. Thus, the goal of acquistion is to fetch data in any sorts of the format and finalizing in common format of either CSV or TSV format.
The Data Acquistion makes the use of Python based custom framework in factory design principle with each factory item as pull of unique source.
ETL stands for Extract, Transform and Load operations which extract data from common format, transform it and load it to database. The CSV or TSV format data from Data Acquistion is passed to the ETL pipeline which loads the data, performs Star Schema Transformation followed by daily and monthly aggregations and finally loaded in Redshift.
The ETL pipeline is construted using Spark Scala.
Scalable Big Data Solutions
Scalable Big Data Solutions or simply, DataLayer (what we call) is Flask RESTful Api framework which consits of Python API which when called returns SQL queried data from redshift/mySQL or from any other data locations. API gives data in JSON format to BI application that is calling the API endpoint.