Senior Data Engineer  ·  Azure Databricks

Sagar
Kumar Bala

Building production-grade data pipelines at enterprise scale — Azure Databricks, Delta Lake, and PySpark across lakehouse architectures that process hundreds of millions of records.

Scroll
4+
Years Experience
600+
ETL Jobs Migrated
526M+
Records Processed
10×
Query Time Improvement
01

About

Background

4 years. 3 enterprises. One data stack.

Pune, Maharashtra, India

Senior Data Engineer with 4 years at Celebal Technologies building production-grade pipelines on Azure Databricks. Delivered end-to-end medallion architecture lakehouse solutions for a large-scale conglomerate, migrated 600+ legacy ETL jobs for a leading private-sector bank's credit risk platform, and re-implemented Oracle PL/SQL sales datamarts at 500M+ record scale for a global industrial technology manufacturer.

Comfortable owning the full pipeline lifecycle — from Autoloader ingestion and PySpark transformation to Delta Lake optimisation, ADF orchestration, and CI/CD deployment via Databricks Asset Bundles. Targeting a senior Data Engineer role with meaningful scope on cloud-native data platforms.

02

Skills

Cloud & Platform
Azure Databricks ADLS Gen2 Azure Data Factory Azure DevOps Azure Key Vault Azure Synapse
Processing
PySpark Spark SQL Spark Structured Streaming Databricks Autoloader Apache Kafka
Storage & Lakehouse
Delta Lake Unity Catalog Medallion Architecture Bronze / Silver / Gold HDFS
Orchestration
Databricks Workflows Databricks Asset Bundles ADF Pipelines Control-M
Languages
Python SQL PL/SQL Spark SQL Bash / Shell JavaScript
Legacy & Source Systems
Oracle SAP HANA MySQL DB2 Pentaho ETL Cloudera Hadoop Impala Hive Denodo
Other Tools
Databricks REST API JDBC AWS Redshift MLflow Power BI Tableau Git
03

Work

Celebal Technologies
June 2021 — Present
Large-Scale Conglomerate  ·  Diversified
Enterprise Data Platform (EDP)
Lakehouse on Azure Databricks
Data Engineer — Ingestion Layer Lead
2024 — Present

Designed and delivered the end-to-end data ingestion layer for a large-scale Enterprise Data Platform, a lakehouse implementation built on Azure Databricks. Owned the full pipeline lifecycle from heterogeneous source systems through the Bronze (Raw) layer to the Silver (Enriched) layer, following a strict medallion architecture pattern.

 Key Contributions
  • Designed a unified, parameterised PySpark ingestion framework for SAP HANA, Oracle, and MySQL JDBC sources under a single reusable script — reducing new-source onboarding effort by an estimated 60% and eliminating per-source code duplication across the medallion stack.
  • Built a Databricks Autoloader notebook for ADLS Gen2-to-Bronze ingestion with cloud-files streaming, schema inference, checkpoint management, and trigger-once semantics — supporting both incremental and full-refresh load patterns on a single configurable pipeline.
  • Engineered an incremental Kafka Structured Streaming pipeline for Oracle GoldenGate change-data feeds into Bronze Delta tables, implementing configurable watermarks, micro-batch triggers, and fault-tolerant offset management for real-time event ingestion.
  • Implemented Bronze-to-Silver transformation logic covering PySpark data cleansing, deduplication, and MERGE operations into Delta Lake, enforcing schema contracts and idempotent, restartable execution across all ingestion paths.
  • Packaged and deployed all ingestion notebooks and job configurations as Databricks Asset Bundles (DAB), establishing a CI/CD-aligned deployment model with environment-specific overrides — enabling clean dev-to-prod promotion without manual intervention.
  • Established pipeline observability with structured logging, Databricks Workflows orchestration, and job-level retry policies — providing audit-ready failure trails aligned with Unity Catalog governance requirements.
Stack Azure Databricks PySpark Delta Lake Autoloader Apache Kafka JDBC SAP HANA DAB Unity Catalog MLflow
Leading Private-Sector Bank  ·  BFSI
Credit Risk Analytics Platform
Pentaho to Databricks Migration
Data Engineer — Notebook Automation Framework & ADF Orchestration
2021 — 2023
600+ Jobs Migrated 25 TB / Month 80M Customers 10× Query Speedup 30% Cost Reduction

Executed the end-to-end data engineering migration of a leading private-sector bank's credit risk analytics department from an on-premises 16-node Cloudera Hadoop cluster to the Azure Databricks cloud platform — migrating 600+ Pentaho ETL jobs that process 25 TB of credit risk and campaign management data per month across 80M unique customers.

 Key Contributions
  • Engineered a Python automation framework using the Databricks REST API that cloned a master template notebook, applied job-specific naming conventions, and placed each notebook in the correct workspace folder structure — eliminating an estimated 60–70% of manual creation effort across 600+ jobs with 30–40 notebooks each.
  • Converted 600+ Pentaho ETL jobs — each containing Cloudera Impala SQL and JavaScript transformation logic — into PySpark/Spark SQL on Azure Databricks, with Apache Spark parameter tuning and data partitioning strategies that contributed to a 10x reduction in query execution time.
  • Built ADF pipelines for all 600+ migrated jobs, configuring parallel and sequential Databricks Notebook execution per job-specific dependency structure; redesigned jobs exceeding ADF's 40-activity limit into sub-pipelines to maintain full functional equivalence with the original Pentaho execution model.
  • Resolved ADLS Gen2 Raw Zone access issues in a secured BFSI Databricks environment by configuring the ABFS driver via spark.conf with folder-level SAS tokens and using Managed Delta Tables as intermediate write buffers — preserving Raw Zone immutability.
  • Diagnosed and fixed Spark execution failures in an air-gapped Azure Databricks workspace — disabling AQE auto-broadcast joins for tables exceeding 8GB and tuning spark.driver.maxResultSize to prevent driver OOM errors across credit and campaign datasets processing 80M customer records and 25 TB/month.
  • Configured Delta Lake across a 3-stage data pipeline (Staging, Intermediate Output, Global Output) on ADLS Gen2, applying OPTIMIZE/compaction for small-file consolidation and Z-ordering on high-cardinality columns to meet query performance SLAs.
  • Established a multi-environment DevOps setup (DEV/UAT/PROD) on Azure DevOps with ARM templates for ADF and a separate PySpark repo — the first version-controlled, environment-isolated release process in the client's credit risk department.
Stack Azure Databricks REST API PySpark Delta Lake ADF ADLS Gen2 Azure DevOps Pentaho Cloudera DB2 Power BI
Global Industrial Technology Manufacturer
FY24 Sales Dataset Build
Legacy Oracle to Databricks Migration
Data Engineer — Legacy Migration & PySpark Conversion
2023 — 2024
526M+ Core Records 5 Regional Datamarts 500M+ Total Records

Executed the legacy-to-cloud migration of a global manufacturer's enterprise sales reporting platform from Oracle on-premises to the Databricks platform, re-implementing five regional sales datamarts on a modern Delta Lake foundation. Converted a large body of Oracle PL/SQL procedures, packages, and Unix shell-scripted conditional logic into production-grade PySpark.

 Key Contributions
  • Converted Oracle PL/SQL procedures, packages, and functions across five regional sales datamarts into production-grade PySpark — eliminating procedural SQL constructs incompatible with Spark's distributed execution model while preserving full business logic.
  • Processed large-scale sales transaction data across rolling and historical windows — Booking (14.5M records), Backlog (12.5M records), and Costed Sales Details (526M+ records) — adapting PySpark execution plans for efficient distributed processing on Databricks.
  • Replaced Control-M job scheduling and Linux shell script conditional logic with Python-based orchestration inside Databricks — replicating runtime branching, dependency management, and environment-conditional execution without any legacy scheduler dependency.
  • Implemented a data consistency validation framework reconciling row counts and key metrics between the legacy on-prem Oracle platform outputs and new Databricks Delta Lake results — identifying and resolving discrepancies before production sign-off.
  • Delivered optimised PySpark pipelines through targeted code refactoring: eliminating redundant operations from the legacy Oracle on-prem process, consolidating similar transformation functions, and replacing custom UDFs with built-in Spark functions to meet the client's latency SLAs.
Stack Databricks PySpark Delta Lake Oracle PL/SQL Shell Scripting SAP HANA Denodo AWS Redshift
04

Education

Grade B
2021
PG Diploma in Big Data Analytics
Centre for Development of Advanced Computing (CDAC)
Pune, Maharashtra Sep 2021
9.2 CGPA
2016
Bachelor of Technology (B.Tech)
Centurion University of Technology and Management
Paralakhemundi, Odisha Apr 2016
Get
in
Touch.

Open to senior Data Engineer roles with scope on cloud-native data platforms, Azure Databricks, and large-scale pipeline architecture. Feel free to reach out.