Discovering and solving issues with CPU-bound work in our distributed node.js system
Overview
Why our mistakes matter to you
Node is extremely popular and is an amazing tool for building APIs and entire backends. It has a low barrier to entry and allows anyone familiar with javascript to get a server running quickly. There are still many important nuances to how it works. I hope that by sharing our story, you can learn more about how node.js works and how understanding it can impact how you solve actual problems.
Overview of our architecture
At SMRxT, we have a distributed system where each worker is running node.js in a separate container. The api and workers delegate work between each other via a RabbitMQ queueing system. The queues are configured for at-least-once delivery with automatic retries and dead-letter queues. The whole setup is orchestrated with Kubernetes.
Discovering the problem
Our problem began when we introduced a new feature that ran complex analytics on historical data. When a user sets up a new analytic, we publish a message about every patient for which it applies. Processor workers then consume those, process them, and save the results. Because these analytics apply to the entire history of the patient, the processor must fetch all of the data at the beginning and then just crunch through it. The nature of the analytics is such that processing requires jumping around in the data so this optimization is important for keeping the overall time low.
We deployed this new feature to our staging environment which has mocked data at the same scale as production. After setting up some realistic analytics, we sat back and watched our shiny, new workers turn through our queue and push out the long-awaited insights. While the majority of patients were being processed flawlessly, our workers were choking and dying on a select few. When we tried putting the patients back onto the queue, they failed again. The workers were restarting and the work items were getting put back onto the queue.
Investigation
A quick log tail revealed that the workers actually were able to complete the work and were just failing to acknowledge the message in the queue. When they tried to ack, they would get a connection terminated error. The RabbitMQ logs thankfully explained further that the connection timed-out because they were not receiving heartbeats. RabbitMQ is attempting to detect stale, dead connections by looking for some kind of message or heartbeat at least every 1 minute. If it doesn’t hear anything within that interval, it terminates the connection and recovers any used resources.
So then the question was, “what about these particular patients makes the heartbeat fail?” Our failing patients all had long histories and thus had a lot more processing to do. Because we fetched and cached all the data up-front (to solve another potential problem of cascading data fetches), we were just doing CPU-intensive work for long periods of time.
At this point, it’s important to remember that node.js is single threaded and backed by an event-loop. Even though it’s single threaded, node can be very effective at handling tons of requests concurrently because much of web service work is waiting on IO (which is handled outside of the event-loop). The distinction between concurrent and parallel is important. Node doesn’t do two things at the same time (parallel) but rather switches between tasks (concurrent) while it waits for IO tasks to finish.
In our workers, we were seeing the single thread getting tied up doing calculations and unable to handle other things like sending heartbeats. Our RabbitMQ client library handled sending heartbeats by using setInterval
set to fire every 30 seconds. The heartbeat callback would eventually get called - it would just be too late. When our application is chugging through tons of data, the event loop is essentially held-up in a single phase waiting for the calculation code to take a breath (usually waiting for some kind of IO). Our use-case is abnormal to the majority of web-server work. Typical workloads would be things like reading an HTML template file or fetching data and then sending that data back over a connection. In that setup, for a given request, the majority of the time is waiting on the IO of reading a file or making a related data request. So, our CPU-intensive processing was abnormal and wasn’t working well with the platform.
Resolution
There are a few basic ways to solve our problem: break up the calculations to allow the event loop to progress through all of the phases, break out from the event-loop by using worker threads, or use a technology besides node. We could also have split up the patient history into more segments and split that across workers. However, we didn’t want to fetch the full patient data multiple times in different workers because that would just increase the burden on our database and the processing of a single patient doesn’t nicely chunk along data lines.
We decided to try the first option since it was the most straightforward, could be migrated to option two if necessary, would not require a large rework, and didn’t introduce any new paradigms to our codebase. If our processing had been happening inside our api web server, this would not have been a good solution because we would still be starving other requests of resources. Since we were already in our own worker process, this seemed acceptable.
Our solution was to await a promise that resolves with setImmediate
prior to each step in the calculation. The result was a utility function that had a call like this in it.
await new Promise((resolve) => {
setImmediate(resolve);
});
setImmediate
is handled in its own phase in the event loop so this guarantees that the loop completes the other phases before we do more calculations. The subsequent calculations will then get scheduled with another setImmediate
and get executed after another run through the event-loop. This quickly allowed us to “let it breathe.”
We deployed this quick modification to our staging environment. Our fix properly allowed the worker to send the AMQP heartbeats while continuing to process through the patient data.
Final Thoughts
When we started, we had code that was technically correct, but practically ineffective. We had passing unit tests, but failing workers. As engineers, we have to be thinking about the whole system and accounting for how each of the layers function. In this instance, we were affected by RabbitMQ’s heartbeat policy and node’s single-threaded nature. There are several different ways to address this problem, but what worked for us in our situation was to use setImmediate
to let the event loop make it through its checks before continuing with the CPU intensive work. While this isn't a solution that should be used in every situation, hopefully the discussion is useful for understanding node and avoiding this problem in other applications.
Other Resources
- Node.js documentation on the event loop
- Description of the event loop by Bert Belder (node.js core developer)
By the way, we are looking for highly skilled software engineers to join our team. Check out our job listing to learn more!