Senior Data Engineer - Python & Pyspark

Posted:
12/16/2025, 8:22:52 PM

Location(s):
Tamil Nadu, India ⋅ Chennai, Tamil Nadu, India

Experience Level(s):
Senior

Field(s):
Data & Analytics

The Senior Data Engineer will be responsible for the architecture, design, development, and maintenance of our data platforms, with a strong focus on leveraging Python and PySpark for data processing and transformation. This role requires a strong technical leader who can work independently and as part of a team, contributing to the overall data strategy and helping to drive data-driven decision-making across the organization.
Key Responsibilities
  • Data Architecture & Design: Design, develop, and optimize data architectures, pipelines, and data models to support various business needs, including analytics, reporting, and machine learning.
  • ETL/ELT Development (Python/PySpark Focus): Build, test, and deploy highly scalable and efficient ETL/ELT processes using Python and PySpark to ingest, transform, and load data from diverse sources into data warehouses and data lakes. Develop and optimize complex data transformations using PySpark.
  • Data Quality & Governance: Implement best practices for data quality, data governance, and data security to ensure the integrity, reliability, and privacy of our data assets.
  • Performance Optimization: Monitor, troubleshoot, and optimize data pipeline performance, ensuring data availability and timely delivery, particularly for PySpark jobs.
  • Infrastructure Management: Collaborate with DevOps and MLOps teams to manage and optimize data infrastructure, including cloud resources (AWS, Azure, GCP), databases, and data processing frameworks, ensuring efficient operation of PySpark clusters.
  • Mentorship & Leadership: Provide technical guidance, mentorship, and code reviews to junior data engineers, particularly in Python and PySpark best practices, fostering a culture of excellence and continuous improvement.
  • Collaboration: Work closely with data scientists, analysts, product managers, and other stakeholders to understand data requirements and deliver solutions that meet business objectives.
  • Innovation: Research and evaluate new data technologies, tools, and methodologies to enhance our data capabilities and stay ahead of industry trends.
  • Documentation: Create and maintain comprehensive documentation for data pipelines, data models, and data infrastructure.
Qualifications
Education
  • Bachelor's or Master's degree in Computer Science, Software Engineering, Data Science, or a related quantitative field.
Experience
  • 5+ years of professional experience in data engineering, with a strong emphasis on building and maintaining large-scale data systems.
  • Extensive hands-on experience with Python for data engineering tasks.
  • Proven experience with PySpark for big data processing and transformation.
  • Proven experience with cloud data platforms (e.g., AWS Redshift, S3, EMR, Glue; Azure Data Lake, Databricks, Synapse; Google BigQuery, Dataflow).
  • Strong experience with SQL and NoSQL databases (e.g., PostgreSQL, MySQL, MongoDB, Cassandra).
  • Extensive experience with distributed data processing frameworks, especially Apache Spark.
Technical Skills
  • Programming Languages: Expert proficiency in Python is mandatory. Strong SQL mastery is essential. Familiarity with Scala or Java is a plus.
  • Big Data Technologies: In-depth knowledge and hands-on experience with Apache Spark (PySpark) for data processing, including Spark SQL, Spark Streaming, and DataFrame API. Experience with Apache Kafka, Apache Airflow, Delta Lake, or similar technologies.
  • Data Warehousing: In-depth knowledge of data warehousing concepts, dimensional modeling, and ETL/ELT processes.
  • Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, GCP) and their data services, particularly those supporting Spark/PySpark workloads.
  • Containerization: Familiarity with Docker and Kubernetes is a plus.
  • Version Control: Proficient with Git and CI/CD pipelines.
Soft Skills
  • Excellent problem-solving and analytical abilities.
  • Strong communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders.
  • Ability to work effectively in a fast-paced, agile environment.
  • Proactive and self-motivated with a strong sense of ownership.
Preferred Qualifications
  • Experience with real-time data streaming and processing using PySpark Structured Streaming.
  • Knowledge of machine learning concepts and MLOps practices, especially integrating ML workflows with PySpark.
  • Familiarity with data visualization tools (e.g., Tableau, Power BI).
  • Contributions to open-source data projects.

------------------------------------------------------

Job Family Group:

Technology

------------------------------------------------------

Job Family:

Data Analytics

------------------------------------------------------

Time Type:

Full time

------------------------------------------------------

Most Relevant Skills

Please see the requirements listed above.

------------------------------------------------------

Other Relevant Skills

For complementary skills, please see above and/or contact the recruiter.

------------------------------------------------------

Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.

 

If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.

View Citi’s EEO Policy Statement and the Know Your Rights poster.

Citi

Website: https://www.citigroup.com/

Headquarter Location: New York, New York, United States

Employee Count: 10001+

Year Founded: 1812

Last Funding Type: Post-IPO Equity

Industries: Banking ⋅ Credit Cards ⋅ Financial Services ⋅ Wealth Management