Building a Streaming Pipeline: Lessons from 10M Events/Hour

When I started this project, I thought the hard part was going to be the machine learning — fitting an Isolation Forest model, tuning its contamination parameter, deciding what "anomaly" even means in the context of network telemetry logs. I was wrong. The ML piece ended up being almost trivial compared to what it actually takes to get data moving reliably at scale.

This post is about the infrastructure decisions that didn't show up in the Spark documentation, the things I figured out the hard way while building a real-time anomaly detection pipeline over a multi-node Hadoop cluster. We were processing somewhere north of 10 million events per hour by the time it was done. Here's what I learned.

10M+

Events
per Hour

68%

Preprocessing
Runtime Reduction

<500ms

Anomaly Detection
Latency

The Setup: Why Spark + Kafka + Hadoop?

The pipeline ingests log data through Kafka topics, processes it in micro-batches using Spark Structured Streaming, stores raw and processed data on HDFS, and feeds anomaly scores back to a downstream alert system. On paper it's a clean, well-understood stack. In practice, each of those boundaries — Kafka to Spark, Spark to HDFS, HDFS back to Spark — is where things got interesting.

I chose this stack partly because it's what serious production pipelines actually use, and partly because I wanted to work through the complexity myself rather than offloading it to a managed service. If I'm going to claim I understand distributed streaming, I wanted to have wrestled with the parts that managed services quietly fix for you.

Problem #1: Backpressure Isn't Automatic

Spark Structured Streaming has a config called maxOffsetsPerTrigger that controls how many Kafka offsets each micro-batch will consume. Early on, I left this unconfigured and let Spark consume as fast as possible. That worked fine in development, where my test data was clean and small. It fell apart in a realistic environment.

What happens without proper rate limiting is that during bursts — log spikes, deployment events, whatever causes your volume to jump — Spark tries to process everything at once. Batch durations blow past your trigger interval, tasks pile up, and you end up with exactly the kind of latency problem a real-time pipeline is supposed to prevent.

# What I ended up with after tuning
spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", BROKERS) \
    .option("subscribe", TOPIC) \
    .option("maxOffsetsPerTrigger", 50000) \   # tune this
    .option("startingOffsets", "latest") \
    .load()

The right value for maxOffsetsPerTrigger depends entirely on your cluster size, your transformation complexity, and how much lag you can tolerate. For us, 50,000 offsets per trigger with a 10-second micro-batch interval gave us the throughput we needed without batch durations creeping up over time. But honestly — this is empirical. You have to monitor it, watch your batch durations in the Spark UI, and adjust.

The Spark UI's "Streaming" tab is criminally underused. Batch duration trends will tell you more about the health of your pipeline than almost anything else. If durations are monotonically increasing over time, something is wrong — either your rate limit is too loose or a transformation is getting more expensive as state accumulates.

Problem #2: Watermarking and the Late Data Trap

Watermarking in Spark Structured Streaming is how you handle late-arriving data — events that were generated at time T but don't show up in your pipeline until T + some delay. The watermark tells Spark how long to wait before finalizing a window.

The gotcha is subtle: your watermark needs to be long enough to absorb your real late-data distribution, but short enough that state doesn't grow unboundedly. In our case, log data from some nodes arrived with delays of up to 45 seconds due to buffering at the edge. I initially set a watermark of 30 seconds. This meant we were dropping a non-trivial fraction of events silently — no error, no warning, just gone from the window aggregations.

# Watermark on event_time, not processing_time
events \
    .withWatermark("event_time", "90 seconds") \
    .groupBy(
        window("event_time", "1 minute"),
        "host_id"
    ) \
    .agg(count("*").alias("event_count"))

The fix was bumping the watermark to 90 seconds and actually measuring our late-arrival distribution rather than guessing at it. Something I should have done from the start. If you're building anything time-windowed, instrument your pipeline to log the difference between event_time and processing_time early — it'll save you debugging hours later.

The MapReduce Shift: Where the 68% Came From

The initial feature engineering step — transforming raw log lines into feature vectors the Isolation Forest could consume — was running on Spark alone, as a single-stage transformation. For small batches this was fine. As the data volume grew and we started processing historical 50GB+ log dumps in addition to the live stream, the single-node preprocessing time became a genuine bottleneck.

The shift was to move the heavy feature engineering into custom MapReduce jobs running directly on HDFS, parallelized across the cluster, and then have Spark consume the pre-processed feature files rather than raw logs. The Map phase extracted and cleaned individual log fields. The Reduce phase aggregated per-host event statistics over configurable time windows.

This felt counterintuitive at first — MapReduce is older technology, why would it outperform Spark here? The reason is that for this specific workload, the MapReduce jobs are embarrassingly parallel with minimal shuffle, while the Spark approach required more cross-node data movement. For batch preprocessing over large static datasets, the simpler execution model won. A 68% runtime reduction in preprocessing, measured across five runs, was the result.

I think there's a tendency to reach for Spark for everything once you're already in the Spark ecosystem. Sometimes the right tool for a specific subproblem is the older, simpler one. MapReduce for parallelizable batch ETL is still genuinely good at what it does.

Integrating the Isolation Forest

Spark MLlib has an Isolation Forest implementation that integrates cleanly with Structured Streaming via a trained model applied in transform() calls. The model is trained offline on historical data and serialized, then loaded at pipeline startup. Each incoming micro-batch gets scored against it.

The main practical concern here isn't the algorithm — it's model staleness. An Isolation Forest trained on last month's traffic patterns will gradually degrade as normal behavior shifts. For now, we retrain on a weekly schedule using the last 7 days of HDFS logs. Whether that frequency is right is honestly still an open question for me. More on that below.

Open Questions I'm Still Thinking About

// Things I haven't resolved

01. Model retraining cadence. Weekly retraining feels arbitrary. The right answer probably involves drift detection — automatically triggering retraining when the anomaly score distribution shifts beyond some threshold. I haven't implemented this yet, but it feels like the important next step.

02. Exactly-once semantics. We're using at-least-once delivery between Kafka and Spark. For anomaly detection, duplicate events are mostly harmless — a duplicate alert is annoying, not catastrophic. But if this pipeline were feeding a billing or audit system, I'd need to take exactly-once far more seriously. Spark supports it via checkpointing + idempotent sinks, but I haven't stress-tested that path.

03. Is Isolation Forest the right model for this? It's fast and doesn't require labeled data, which was critical here. But it has known weaknesses on high-dimensional sparse data. I'm curious whether a streaming-native algorithm like RRCF (Robust Random Cut Forest) would outperform it in practice. Amazon actually open-sourced an implementation I want to benchmark against.

04. Observability. We have basic metrics via Spark's built-in listeners, but I don't have a great answer for tracing individual events through the full pipeline. When an anomaly is detected, I can't easily reconstruct exactly which raw log lines contributed to that detection. That gap would matter a lot in a forensic context.

What I'd Do Differently

Measure before you optimize. The watermark issue cost me more time than it should have because I trusted my intuition about late-arrival delays rather than instrumenting it. In distributed systems, your intuitions are frequently wrong and the data is almost always available if you look for it.

Don't fight the batch boundaries. Early on I tried to make micro-batches smaller and smaller in pursuit of lower latency, which just created instability. The sweet spot — for this workload, on this cluster — was 10-second trigger intervals. Aggressive micro-batching is not free.

Embrace MapReduce for what it's good at. I came into this project biased toward Spark for everything. That bias cost me preprocessing performance. The lesson is to understand the execution model of each tool and match it to the problem, not just reach for the newest thing.

Motheo Treasure Puso

MS Computer Science candidate at UT Arlington, focused on distributed systems and cloud engineering. Previously @ Altair Engineering and Michigan State University. Find me on GitHub or LinkedIn.

// References & Further Reading

[1] Armbrust, M. et al. (2018). Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD '18. ACM Digital Library ↗

[2] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining. IEEE ↗

[3] Apache Spark Documentation — Structured Streaming Programming Guide. spark.apache.org ↗

[4] Guha, S. et al. (2016). Robust Random Cut Forest Based Anomaly Detection On Streams. ICML 2016. PMLR ↗ — the algorithm I'm eyeing as a potential replacement for Isolation Forest in streaming contexts.

[5] Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: a Distributed Messaging System for Log Processing. NetDB Workshop. SemanticScholar ↗

[6] Dean, J. & Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Google Research ↗

Building a Streaming Pipeline:Lessons from 10M Events/Hour

The Setup: Why Spark + Kafka + Hadoop?

Problem #1: Backpressure Isn't Automatic

Problem #2: Watermarking and the Late Data Trap

The MapReduce Shift: Where the 68% Came From

Integrating the Isolation Forest

Open Questions I'm Still Thinking About

// Things I haven't resolved

What I'd Do Differently

// References & Further Reading

Building a Streaming Pipeline:
Lessons from 10M Events/Hour