When I started this project, I thought the hard part was going to be the machine learning — fitting an Isolation Forest model, tuning its contamination parameter, deciding what "anomaly" even means in the context of network telemetry logs. I was wrong. The ML piece ended up being almost trivial compared to what it actually takes to get data moving reliably at scale.
This post is about the infrastructure decisions that didn't show up in the Spark documentation, the things I figured out the hard way while building a real-time anomaly detection pipeline over a multi-node Hadoop cluster. We were processing somewhere north of 10 million events per hour by the time it was done. Here's what I learned.
per Hour
Runtime Reduction
Latency
The Setup: Why Spark + Kafka + Hadoop?
The pipeline ingests log data through Kafka topics, processes it in micro-batches using Spark Structured Streaming, stores raw and processed data on HDFS, and feeds anomaly scores back to a downstream alert system. On paper it's a clean, well-understood stack. In practice, each of those boundaries — Kafka to Spark, Spark to HDFS, HDFS back to Spark — is where things got interesting.
I chose this stack partly because it's what serious production pipelines actually use, and partly because I wanted to work through the complexity myself rather than offloading it to a managed service. If I'm going to claim I understand distributed streaming, I wanted to have wrestled with the parts that managed services quietly fix for you.
Problem #1: Backpressure Isn't Automatic
Spark Structured Streaming has a config called maxOffsetsPerTrigger that controls how many Kafka offsets each micro-batch will consume. Early on, I left this unconfigured and let Spark consume as fast as possible. That worked fine in development, where my test data was clean and small. It fell apart in a realistic environment.
What happens without proper rate limiting is that during bursts — log spikes, deployment events, whatever causes your volume to jump — Spark tries to process everything at once. Batch durations blow past your trigger interval, tasks pile up, and you end up with exactly the kind of latency problem a real-time pipeline is supposed to prevent.
# What I ended up with after tuning
spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", BROKERS) \
.option("subscribe", TOPIC) \
.option("maxOffsetsPerTrigger", 50000) \ # tune this
.option("startingOffsets", "latest") \
.load()
The right value for maxOffsetsPerTrigger depends entirely on your cluster size, your transformation complexity, and how much lag you can tolerate. For us, 50,000 offsets per trigger with a 10-second micro-batch interval gave us the throughput we needed without batch durations creeping up over time. But honestly — this is empirical. You have to monitor it, watch your batch durations in the Spark UI, and adjust.
The Spark UI's "Streaming" tab is criminally underused. Batch duration trends will tell you more about the health of your pipeline than almost anything else. If durations are monotonically increasing over time, something is wrong — either your rate limit is too loose or a transformation is getting more expensive as state accumulates.
Problem #2: Watermarking and the Late Data Trap
Watermarking in Spark Structured Streaming is how you handle late-arriving data — events that were generated at time T but don't show up in your pipeline until T + some delay. The watermark tells Spark how long to wait before finalizing a window.
The gotcha is subtle: your watermark needs to be long enough to absorb your real late-data distribution, but short enough that state doesn't grow unboundedly. In our case, log data from some nodes arrived with delays of up to 45 seconds due to buffering at the edge. I initially set a watermark of 30 seconds. This meant we were dropping a non-trivial fraction of events silently — no error, no warning, just gone from the window aggregations.
# Watermark on event_time, not processing_time
events \
.withWatermark("event_time", "90 seconds") \
.groupBy(
window("event_time", "1 minute"),
"host_id"
) \
.agg(count("*").alias("event_count"))
The fix was bumping the watermark to 90 seconds and actually measuring our late-arrival distribution rather than guessing at it. Something I should have done from the start. If you're building anything time-windowed, instrument your pipeline to log the difference between event_time and processing_time early — it'll save you debugging hours later.
The MapReduce Shift: Where the 68% Came From
The initial feature engineering step — transforming raw log lines into feature vectors the Isolation Forest could consume — was running on Spark alone, as a single-stage transformation. For small batches this was fine. As the data volume grew and we started processing historical 50GB+ log dumps in addition to the live stream, the single-node preprocessing time became a genuine bottleneck.
The shift was to move the heavy feature engineering into custom MapReduce jobs running directly on HDFS, parallelized across the cluster, and then have Spark consume the pre-processed feature files rather than raw logs. The Map phase extracted and cleaned individual log fields. The Reduce phase aggregated per-host event statistics over configurable time windows.
This felt counterintuitive at first — MapReduce is older technology, why would it outperform Spark here? The reason is that for this specific workload, the MapReduce jobs are embarrassingly parallel with minimal shuffle, while the Spark approach required more cross-node data movement. For batch preprocessing over large static datasets, the simpler execution model won. A 68% runtime reduction in preprocessing, measured across five runs, was the result.
I think there's a tendency to reach for Spark for everything once you're already in the Spark ecosystem. Sometimes the right tool for a specific subproblem is the older, simpler one. MapReduce for parallelizable batch ETL is still genuinely good at what it does.
Integrating the Isolation Forest
Spark MLlib has an Isolation Forest implementation that integrates cleanly with Structured Streaming via a trained model applied in transform() calls. The model is trained offline on historical data and serialized, then loaded at pipeline startup. Each incoming micro-batch gets scored against it.
The main practical concern here isn't the algorithm — it's model staleness. An Isolation Forest trained on last month's traffic patterns will gradually degrade as normal behavior shifts. For now, we retrain on a weekly schedule using the last 7 days of HDFS logs. Whether that frequency is right is honestly still an open question for me. More on that below.
Open Questions I'm Still Thinking About
// Things I haven't resolved
What I'd Do Differently
Measure before you optimize. The watermark issue cost me more time than it should have because I trusted my intuition about late-arrival delays rather than instrumenting it. In distributed systems, your intuitions are frequently wrong and the data is almost always available if you look for it.
Don't fight the batch boundaries. Early on I tried to make micro-batches smaller and smaller in pursuit of lower latency, which just created instability. The sweet spot — for this workload, on this cluster — was 10-second trigger intervals. Aggressive micro-batching is not free.
Embrace MapReduce for what it's good at. I came into this project biased toward Spark for everything. That bias cost me preprocessing performance. The lesson is to understand the execution model of each tool and match it to the problem, not just reach for the newest thing.