Our client has started a greenfield project to build an in-house data lake solution. This will serve various BI services and projects as well as other Data Science tools and initiatives. It will ingest near real-time market data from multiple sources and centrally store them on the cloud (AWS). The scope and type of the data consumed, and target consumers will continue to expand.
Currently, the existing functionality includes the following:
• Ingestion of multiple data sources in batch / near real-time in a shared storage.
• Decoupled storage from compute and data processing to provide increased scalability and performance while reducing storage costs.
• The data is compressed and partitioned using a parquet data format which improves the performance while reducing the cost of data retrieval.
• Centralized metadata catalogue that maintains a single view of the data model for all its consumers.
• Integration with other AWS services (Athena, EMR, QuickSight, …) and third-party solutions (Apache Spark, Presto, Tableau, …).
The project is sponsored at high levels within the business and is a new initiative. Working in a small development team, the focus will be on delivering high-quality solutions for electronic trading and pricing workflows. The aim of the project is to build an agile team that works closely with the business sponsors to ensure a high-quality platform that delivers on the needs of a global trading business.
Data Engineer / Data Scientist
As a Senior developer, you will be experienced in working with big data ingestion and large data sets in general. You will be responsible for creating and maintaining high-quality ETLs.
Job Responsibilities / Role:
• Take responsibility for the software delivery by ensuring quality and scope expectations are met.
• Contribute and take ownership of the technical design and ensure all aspects of the system architecture are well documented.
• Work closely with partner technology teams and collaborate effectively.
Candidates must have the technical skills listed below, and in addition, have worked within a data team in the last 5–10 years. History of role stability is preferred.
Technical Skills Required:
• Very deep understanding of Python, Pyspark, Pandas, JupyterLab (working with notebooks).
• Experience in SQL, Hive, Hadoop.
• Experience using AWS platform.
• Solid experience with continuous integration and continuous delivery tools like Git, Jenkins, etc.
• Agile development/Software lifecycle.
Nice to have Skills:
• Experience with Kafka.
• Specifically, experience using EMR (Elastic Map Reduce) in AWS to run Spark clusters.
• Knowledge of Terraform.
• Experience with Ansible, Bash scripting, boto3.
• Experience configuring continuous integration and continuous delivery tools.
What do we offer you?