Sagar Kumar Bala — Senior Data Engineer

Large-Scale Conglomerate · Diversified

Enterprise Data Platform (EDP)
Lakehouse on Azure Databricks

Data Engineer — Ingestion Layer Lead

2024 — Present

Designed and delivered the end-to-end data ingestion layer for a large-scale Enterprise Data Platform, a lakehouse implementation built on Azure Databricks. Owned the full pipeline lifecycle from heterogeneous source systems through the Bronze (Raw) layer to the Silver (Enriched) layer, following a strict medallion architecture pattern.

▶ Key Contributions

Designed a unified, parameterised PySpark ingestion framework for SAP HANA, Oracle, and MySQL JDBC sources under a single reusable script — reducing new-source onboarding effort by an estimated 60% and eliminating per-source code duplication across the medallion stack.
Built a Databricks Autoloader notebook for ADLS Gen2-to-Bronze ingestion with cloud-files streaming, schema inference, checkpoint management, and trigger-once semantics — supporting both incremental and full-refresh load patterns on a single configurable pipeline.
Engineered an incremental Kafka Structured Streaming pipeline for Oracle GoldenGate change-data feeds into Bronze Delta tables, implementing configurable watermarks, micro-batch triggers, and fault-tolerant offset management for real-time event ingestion.
Implemented Bronze-to-Silver transformation logic covering PySpark data cleansing, deduplication, and MERGE operations into Delta Lake, enforcing schema contracts and idempotent, restartable execution across all ingestion paths.
Packaged and deployed all ingestion notebooks and job configurations as Databricks Asset Bundles (DAB), establishing a CI/CD-aligned deployment model with environment-specific overrides — enabling clean dev-to-prod promotion without manual intervention.
Established pipeline observability with structured logging, Databricks Workflows orchestration, and job-level retry policies — providing audit-ready failure trails aligned with Unity Catalog governance requirements.

Stack Azure Databricks PySpark Delta Lake Autoloader Apache Kafka JDBC SAP HANA DAB Unity Catalog MLflow

Leading Private-Sector Bank · BFSI

Credit Risk Analytics Platform
Pentaho to Databricks Migration

Data Engineer — Notebook Automation Framework & ADF Orchestration

2021 — 2023

600+ Jobs Migrated 25 TB / Month 80M Customers 10× Query Speedup 30% Cost Reduction

Executed the end-to-end data engineering migration of a leading private-sector bank's credit risk analytics department from an on-premises 16-node Cloudera Hadoop cluster to the Azure Databricks cloud platform — migrating 600+ Pentaho ETL jobs that process 25 TB of credit risk and campaign management data per month across 80M unique customers.

▶ Key Contributions

Engineered a Python automation framework using the Databricks REST API that cloned a master template notebook, applied job-specific naming conventions, and placed each notebook in the correct workspace folder structure — eliminating an estimated 60–70% of manual creation effort across 600+ jobs with 30–40 notebooks each.
Converted 600+ Pentaho ETL jobs — each containing Cloudera Impala SQL and JavaScript transformation logic — into PySpark/Spark SQL on Azure Databricks, with Apache Spark parameter tuning and data partitioning strategies that contributed to a 10x reduction in query execution time.
Built ADF pipelines for all 600+ migrated jobs, configuring parallel and sequential Databricks Notebook execution per job-specific dependency structure; redesigned jobs exceeding ADF's 40-activity limit into sub-pipelines to maintain full functional equivalence with the original Pentaho execution model.
Resolved ADLS Gen2 Raw Zone access issues in a secured BFSI Databricks environment by configuring the ABFS driver via spark.conf with folder-level SAS tokens and using Managed Delta Tables as intermediate write buffers — preserving Raw Zone immutability.
Diagnosed and fixed Spark execution failures in an air-gapped Azure Databricks workspace — disabling AQE auto-broadcast joins for tables exceeding 8GB and tuning spark.driver.maxResultSize to prevent driver OOM errors across credit and campaign datasets processing 80M customer records and 25 TB/month.
Configured Delta Lake across a 3-stage data pipeline (Staging, Intermediate Output, Global Output) on ADLS Gen2, applying OPTIMIZE/compaction for small-file consolidation and Z-ordering on high-cardinality columns to meet query performance SLAs.
Established a multi-environment DevOps setup (DEV/UAT/PROD) on Azure DevOps with ARM templates for ADF and a separate PySpark repo — the first version-controlled, environment-isolated release process in the client's credit risk department.

Stack Azure Databricks REST API PySpark Delta Lake ADF ADLS Gen2 Azure DevOps Pentaho Cloudera DB2 Power BI

Global Industrial Technology Manufacturer

FY24 Sales Dataset Build
Legacy Oracle to Databricks Migration

Data Engineer — Legacy Migration & PySpark Conversion

2023 — 2024

526M+ Core Records 5 Regional Datamarts 500M+ Total Records

Executed the legacy-to-cloud migration of a global manufacturer's enterprise sales reporting platform from Oracle on-premises to the Databricks platform, re-implementing five regional sales datamarts on a modern Delta Lake foundation. Converted a large body of Oracle PL/SQL procedures, packages, and Unix shell-scripted conditional logic into production-grade PySpark.

▶ Key Contributions

Converted Oracle PL/SQL procedures, packages, and functions across five regional sales datamarts into production-grade PySpark — eliminating procedural SQL constructs incompatible with Spark's distributed execution model while preserving full business logic.
Processed large-scale sales transaction data across rolling and historical windows — Booking (14.5M records), Backlog (12.5M records), and Costed Sales Details (526M+ records) — adapting PySpark execution plans for efficient distributed processing on Databricks.
Replaced Control-M job scheduling and Linux shell script conditional logic with Python-based orchestration inside Databricks — replicating runtime branching, dependency management, and environment-conditional execution without any legacy scheduler dependency.
Implemented a data consistency validation framework reconciling row counts and key metrics between the legacy on-prem Oracle platform outputs and new Databricks Delta Lake results — identifying and resolving discrepancies before production sign-off.
Delivered optimised PySpark pipelines through targeted code refactoring: eliminating redundant operations from the legacy Oracle on-prem process, consolidating similar transformation functions, and replacing custom UDFs with built-in Spark functions to meet the client's latency SLAs.

Stack Databricks PySpark Delta Lake Oracle PL/SQL Shell Scripting SAP HANA Denodo AWS Redshift

Sagar
Kumar Bala

About

Skills

Work

Education

SagarKumar Bala

About

Skills

Work

Education

Sagar
Kumar Bala