Aneek Sarkar | Data Analyst & Future Data Scientist

My Story

The journey from observing data to architecting intelligence.

The Past: The Analyst

Asking the Right Questions

My journey began with a fundamental curiosity: Why do things happen? As a Data Analyst, I focused on extracting insights from raw datasets, creating reports, and identifying historical trends. However, I quickly realized that analyzing the past wasn't enough; true value lies in predicting the future and building systems that can handle scale without breaking. I needed to go deeper into engineering and statistical modeling.

The Present: The Transformation

Corporate Data Science Training

I am currently undergoing rigorous corporate training to transition into a Data Scientist. This isn't just about learning Python syntax; it's about accountability. I am bridging my analytical mindset with robust engineering principles—learning to build fault-tolerant data ingestion pipelines, understanding the math behind machine learning models, and preparing to deploy solutions that drive real business decisions.

The Future: The Innovator

Full-Stack Intelligence

My goal is to be a "WOW" candidate and a reliable asset to my future team. I aim to be a Data Scientist who doesn't just hand off a Jupyter notebook, but one who understands the end-to-end lifecycle—from messy, unstructured data ingestion to deploying a scalable, predictive model that solves a tangible business problem.

How I Work: The Engineering Mindset

Technical skills can be taught, but a reliable methodology is built through discipline. Here is how I approach problems, handle ambiguity, and ensure accountability.

Root Cause Over Quick Fixes

When an analysis yields unexpected results, I don't just patch the output. I trace the data back to its source. I believe in spending 80% of my time understanding the Why before writing a single line of code for the How.

> Data Profiling
> Edge-case identification

Navigating The Unknown

"I don't know" is my starting point, not an excuse. When faced with an unfamiliar algorithm, tool, or domain, my protocol is: Research documentation -> Build a micro-prototype -> Test to failure -> Ask targeted, informed questions.

> RTFM protocol
> Iterative prototyping

Absolute Accountability

If my pipeline breaks, it's my responsibility to fix it and ensure it never fails the same way twice. I build with idempotency, write defensive code, and document my architecture so that the team never inherits a black box.

> Defensive Programming
> Clear Documentation

Case Studies

Projects that demonstrate my transition from analyzing data to engineering reliable data systems.

The Scale Problem: Wikimedia Traffic Analysis

Python DuckDB OLAP

The Narrative:

As an analyst, I was accustomed to opening datasets in Pandas and running `.describe()`. But what happens when the dataset is 41GB of raw compressed logs representing over 15 Billion pageviews? Standard tools crashed. Memory overflowed.

I took it upon myself to learn data engineering concepts to solve this analytical bottleneck. Instead of relying on expensive cloud compute, I explored columnar storage and vectorized engines.

The Outcome & Accountability

I implemented DuckDB and a custom partitioning strategy. I capped memory limits ensuring the system wouldn't crash the host machine. The result? Queries that previously failed due to OOM errors now executed in under 850 milliseconds. This project taught me that a good Data Scientist must first be a capable Data Engineer.

Reliability analysis.png

// Traditional vs Vectorized Execution

import pandas as pd
df = pd.read_csv('15B_logs.csv') # MemoryError

import duckdb

conn = duckdb.connect()
conn.execute("""
  SELECT project, sum(views)
  FROM read_parquet('partitioned_logs/**/*.parquet')
  GROUP BY project
  ORDER BY sum(views) DESC
""")

# Execution Time: 0.84s | Peak Memory: Capped at 4GB

View Source Code on GitHub →

The Reliability Problem: Mutual Fund NAV Pipeline

Pandas / NumPy Data Pipelines Idempotency

The Narrative:

Machine learning models are useless if fed with corrupt, duplicate, or missing data. While training for Data Science, I recognized the need to build a system that guarantees data integrity before any analysis begins.

I engineered a high-throughput ingestion pipeline for Indian Mutual Fund NAV records. The challenge wasn't just fetching data; it was handling network drops and silent schema changes from upstream APIs without poisoning my local analytical store.

The Outcome & Accountability

I designed the pipeline to be strictly idempotent—meaning it could be run 100 times without ever duplicating a historical record. I routed corrupted data to a dead-letter queue rather than crashing the script. It processed over 1,000,000 records safely. This showcases my commitment to building systems that recruiters and teams can trust.

Mutual fund 1.png

Mututal fund forecast.png

// Defensive Ingestion Architecture

def process_nav_batch(batch_data):
  try:
    # 1. Enforce strict schema
    validated = schema_check(batch_data)

    # 2. Idempotent insert (Upsert)
    db.execute("""
      INSERT INTO nav_history
      VALUES (?, ?)
      ON CONFLICT (fund_id, date) DO UPDATE...
    """)
  except SchemaError as e:
    # Quarantine toxic data, keep running
    route_to_dead_letter(batch_data, e)
    alert_admin(e)

View Source Code on GitHub →

The Intelligence Problem: Agentic AI for BI

Python LLMs / LangChain SQL Agents

The Narrative:

Business stakeholders often wait days for analysts to write complex SQL queries for ad-hoc questions. I wanted to democratize data access without compromising security or accuracy.

I built an autonomous AI agent capable of translating natural business language into executable, optimized SQL against a data warehouse. It doesn't just write the code; it executes it, analyzes the result, and generates a business summary.

The Outcome & Accountability

AI hallucinations in Business Intelligence are unacceptable. I implemented strict guardrails: the agent operates on a read-only database role and uses a "human-in-the-loop" fallback when confidence is low. Instead of guessing, it asks for clarification. This pipeline reduces ad-hoc reporting time from days to seconds while maintaining 100% data integrity.

Multiple Agentic BI.png

// Agent Interaction Log

User: "What were Q3 sales by region?"

> Thinking... Identifying intent.

> Generating SQL...

SELECT region, SUM(revenue)
FROM sales
WHERE quarter = 'Q3'
GROUP BY region;

> Execution Success. Generating Summary...

View Source Code on GitHub →

Education & Professional Development

Corporate Training: Data Science

Present

Currently undergoing intensive corporate training designed to elevate analytical skills to full-stack Data Science capabilities. Focus areas include statistical modeling, machine learning workflows, advanced Python engineering, and business-centric problem solving.

University Education / Degree

Completed

Foundational academic background that instilled critical thinking, mathematical reasoning, and the ability to learn complex concepts rapidly.

Bridging the gap between analysis and intelligence.

My Story

Asking the Right Questions

Corporate Data Science Training

Full-Stack Intelligence

How I Work: The Engineering Mindset

Root Cause Over Quick Fixes

Navigating The Unknown

Absolute Accountability

Case Studies

The Scale Problem: Wikimedia Traffic Analysis

The Outcome & Accountability

The Reliability Problem: Mutual Fund NAV Pipeline

The Outcome & Accountability

The Intelligence Problem: Agentic AI for BI

The Outcome & Accountability

Education & Professional Development

Corporate Training: Data Science

University Education / Degree

Bridging the gap between
analysis and intelligence.