Looking for a Software Engineer with knowledge in Hadoop and Spark. Experience with data mining, Data Governance Frameworks and stream processing technologies (Kafka, Spark Streaming)
- Develop Spark applications by using Scala and deploy Spark streaming applications with optimized no. of executors, write ahead logs & check point configurations.
- Develop Spark code using Scala and Spark-sql for faster testing , processing of data, improving the performance and optimization of the existing algorithms using Spark-context, Spark-sql, Data Frames, Pair Rdd’s, Spark yarn.
- Design multiple POC’s/prototype using Spark and deploy on the yarn cluster, compare the performance of Spark with sql. also, create data pipeline for different events of ingestion, aggregation and load corresponding data into glue catalog tables in HDFS location to serve as feed for abstraction layer and downstream application.
- Coordinate with production warranty support group to develop new releases and check for the failed jobs to resolve the issue and work with QA in creating test cases, and assist in creating implementation plans.
- Create Elastic MapReduce(EMR) clusters and set up environments on amazon AWS EC2 instances and import data from AWS S3 into Spark Rdd, perform transformations and actions on Rdd’s.
- Collect data using Spark streaming from AWS S3 bucket in near-real-time and performs necessary transformations and aggregations to build the data model and persists the data in HDFS.
- Working with Spark ecosystem using Spark sql and Scala queries on different formats like text file, csv file. extensively work with parquet file formats.
- Implement a mechanism for triggering the Spark applications on EMR on file arrival from the client.
- Work on continuous integration of application using Jenkins, rundeck and CICD pipelines. Coordinate with the team on many design decisions and translate functional and technical requirements into detail programs running on Spark.
- Create mappings and workflows to extract and load data from relational databases, flat file sources and legacy systems using azure. Implement an application to do the address normalization for all the clients datasets and administer the cluster and tuning the memory based on the Rdd usage.
- The minimum education requirements to perform the above job duties are a Bachelor’s degree in Computer Science, Applications, Engineering or related technical field.
- Should have good knowledge on Hadoop Ecosystems, Spark Scala, Python, Java
- Should have NoSQL, SparkSQL, and ANSI SQL query language skills
- Strong verbal and written communication and English language skills