Recently, I’m doing some experiments to try to rewrite our Nodejs service in a more maintainable style. I’d like to write several posts to share my findings.
The magic req, res and context object. They are everywhere and you have to pass them everywhere, but you don’t know exactly how they are used without reading every line of the project.
Error handling is like crossing a big maze. You have to pass the error object from it’s origin through probably 5-10 stacks to reach the handling logic.
Heavily rely on callbacks, callback of callbacks, or even deeper callbacks to control the program’s execution flow, which not only creates a callback hell but also creates extreme tight coupling of function calls.
Difficult to test. Because we rely on closures, i.e. anonymous functions, to do async callbacks, we have a lot of anonymous functions that are not testable.
In Part 1, I’d like to talk about how to solve first pain point by using Nodejs’ Domain module.
Hello Domain, goodbye req
Because of Nodejs’ async nature, you cannot store your current request’s context in some centralized place and then using accessor methods to access the current working context, like thread-local contexts commonly used in Java web frameworks. Instead, you have to specifically pass these context objects into all the methods you called, so that these context can be captured in closures and being accessible when callback function is called.
If you keep writing code carefully, or you only have 1-2 contributors, it’s probably not a big problem. However when the code is maintained by several guys, the code starts rotting. req will soon become a magic global garbage bin. You can “easily” put something into it in a function then accessing it from another to create a magic tight couple. You can “easily” access and modify the req’s parameters at anywhere as you wish. You can “easily” use res.write to send partial response in the middle of anywhere.
In the end you will have a code base that no one really knows exactly how the context is used and what is in it.
If you are writing a library that might be included in other module then continuation-local-storage is recommended. If like me, you are implementing a web service, I would recommend you just use Domain. Domain is a native support in the Nodejs core while continuation-local-storage is a user land implementation. Another major difference is the error handling, which is out of the scope of this article.
Domain in action
Domain was introduced in Nodejs v0.8, which mainly provides two things:
A sandbox for exception handling and isolation for an async process chain
An object that is bound to an async process chain
We will cover #1 in a later post, now we focus on #2.
When you run your code inside a domain, your code will have the access to the single domain object even in an async callback! It magically works for all Nodejs’ native async functions, such as net, fs, timers, etc.
Let’s see an example.
The above code will print out “hello” in the console. As you can see, the callback function doesn’t require anything to access the current context even that it is in an async callback.
Integrate with express
To integrate that into express framework, you can create a middleware to create a domain for each request and run the request handler in the domain thus provides you an “async local variable”.
Here is a demo. In the demo, we use the context to keep track a requestId, an Id to uniquely identify a request, a common pattern used in web service implementation. Then the server did a simple async http get request.
When the server starts. You can run test to see that it works.
With Domain, you can centralize your request states in an easily accessible context, which not only makes your code cleaner but also is extremely useful for writing functionalities such as transaction management.
Last week we try to reproduce an incident in our production system which is caused by an eleviation of HTTPS connection creation rate. The service was flooded by more than 2000 concurrent connections per server at peak. We then tried to use apache benchmark (ab) to reproduce the issue in a local environment. However we found that although ab allows you to specify a concurrency level by using -c flag, it actually cannot reach that amount of concurrency because ab will hit the server’s QPS bottleneck first and stop generating more concurrent connections.
What we want is a load testing client that can serve limited QPS at high concurrent level.
I know there are severalexistingtools that can tweak QPS/delay and concurrency, but I had some bad experience on these tools before, either for performance or stability season and I don’t want to waste time to learn and try them one by one, which probably would take me 2-3 hours to find a suitable one. Since the requirement is really simple, I decided to work on my own solution.
I’ve never written such a tool before so I basically have no preference on which platform I use to implement it. I think Go should be good at modeling concurrency so I decided to give it a try.
I found goroutine a very natural way for modeling concurrent connections. So here is how to generate concurrent connections. (Please note that the code is not runnable since I picked them from the whole program. I’ll provide a link to the full source code in the end of this post.)
Here concurrency and req are input parameters to the program given by command line.
However these goroutines will generate requests as quick as possible so we need a way to limit the QPS. I used a buffered channel as a blocking queue to generate tokens for sending requests, then all client goroutines are required to get the token first then sending the request. Also setup a separate goroutine to constantly fill in the token queue to allow more requests to be generated. Here is the code:
Here I used the select..default construct to cancel adding tokens to the queue if the queue is full to avoid token generator generating more tokens than wanted under race condition.
That’s it! The basic functionality is done!
Workaround TIME_WAIT connection problem
Because we want to test the scenario that clients keep generating new connections to the server, we specifically disabled keep-alive for the underlying connection. After running the program for a while we found that the client used up the IP port range reserved for client connection which is roughly 28k ports. The result is that on the client host, you observe a lot of TCP connections in TIME_WAIT state. This is an expected behavior of TCP connections. I did the following trick to workaround the problem:
Tweak the OS configuration on the running host. We run the client on a Linux server, you can follow this article to set net.ipv4.tcp_tw_reuse to 1.
Change the Dial of http.Transport to use TCPConn.SetLinger(0). Tweaking the network configuration is not enough, we also have to change the client’s TCP connection to set SO_LINGER to 0 when create connections. See this article for detailed explanation.
After all these tweak, the performance of the final tool is really good, I can generate 200 concurrent connection with 400 max QPS using a single 2.4G Hz Xeon core.
Another beautiful feature of Go is that you can develop in your Mac OS and directly cross compile a Linux working binary on your Dev machine and deploy-by-copying-a-single-file. Sweet!
Go is really good at modeling concurrencies. I can’t imagine how many code I need to write to implement this tool in other language. In fact, I don’t like Go for its lack of Generics and duck typing only type system. There are a lot ofcriticism of Go and some of which are quite reasonable. However, despite the fact that it is not perfect, it is really sharp for the concurrency modeling. As a developer, we should open our mind by learning these different views to the world so that we can model our problems better with more suitable tools.
TCP connection tweaking is interesting. I’ve seen the TIME_WAIT problem a lot before in production service development. You need to understand TCP states pretty well to implement a decent performing service.
When writing unit tests, the most ugly & painful & stupid thing you might have to do is to expose some internal implementation of your module to the testing code so that:
It is easier to enumerate all possible conditional branches to cover some edge cases; (although if you have a complex design, you can always achieve this by mocking things, however sometimes it’s an overkill)
It is possible to inspect side effects of your code; (accept the dark side of the reality, you cannot totally avoid side effects)
Sometimes, you might even find yourself wasting a lot of precious time thinking about solutions to avoid writing such smelly code. But, hey! It’s testing code! Shouldn’t we have fun and enjoy writing it with all the power and flexibility we ought to have rather than struggling in these design holes? Why do we have to break core code for testing code?
Recently I found an excellent library rewire that neatly solves this problem. Moreover, the idea behind it is really creative so that I’d like to share it in this post.
Nothing is more valuable than an example. Let’s try to make a really simple one. The code logic might not be very reasonable, just made for showing the problems and how rewire can solve them. You can find the whole example here.
Given the following scenario: You are implementing a log appender, which writes aggregated log messages to some database in batch. And somehow the database has throttling control so that you cannot write to it more than once per second. Here is something you might write.
Now you might want to write some test cases to test your code. The three basic cases might be:
The appender shouldn’t flush until batch size is reached.
The appender shouldn’t flush twice within one second.
The appender should flush when batch size is reached and more than 1 second is elapsed from last flush.
Here comes to the problems:
How do I know the batch size? It’s not exposed anywhere. And probably it’s a private information that need to be hidden in the module.
How do I know if the flushToDB method is called?
How do I change to a mock db for unit test?
How do I easily control the duration so that the if branch can be tested without using some fancy thing to mock the datetime or wasting your test time for setTimeout?
The fundamental challenge is that we have hidden states in the module which are not supposed to be exposed.
Here comes the rewire
rewire is a replacement for require for testing purpose. When it requires the module it also magically modifies your source code so that variables hidden in the module can be accessed and modified. Here is how you can use rewire to solve the above problems without touching the core module.
And yeah, it runs successfully!
The idea behind the scene
I’m extremely interested how it implements the __set__ and __get__ function. How can they bind to the hidden variables in your module? So I step into the source code. It turns out that eval and closure does the magic. Let’s take a close look at how __set__ is implemented. __get__ is very similar.
Key point 1: It modifies your source code to adding __set__ functions to your code before requiring your module.
This is important, in this way, the added function is in the same lexical scope of your module so that it can access the module’s internal variables.
Key point 2:__set__ uses eval function to execute the variable replacing code.
Variables in the scope of __set__ function are closed in the closure created by eval statement, thus in the eval function you can not only access the hidden variables in your module (inherited from the __set__ closure), but also access the variables in the __set__ function. In this way, you can build a general function that can replace any declared variables in the module’s scope.
Literally it append the following function to your module’s source code:
This technique really opens my mind. It’s so powerful and I believe it has a lot more practical applications. Hope you will also enjoy this library.
At Hulu, we use Nodejs to build a proxy or orchestration service to expose APIs to our client applications in order to provide a unified interface (in API schema) which hides the complexity to deal with our versatile backend “micro-services”.
The reason we choose Nodejs over other technologies is that its asynchronized nature perfectly fits the requirements of the service:
Handling large number of concurrent clients;
Requesting backend micro-services to get the data, where most of the time is spend on IO;
Enhancing the backend response with data from other backend micro-services.
This is a typical front-tier API service that has low CPU consumption and high concurrent IO requirement.
In this post, I’ll list some tools and techniques I have used to profile/debug Nodejs program for daily development and production trouble-shooting as a summary of the work I’ve done.
(We use MacOS X as development environment and Ubuntu server as production OS. The version of node we currently use is a custom build of v0.10.)
node-inspector is my favorite tool for Nodejs application trouble-shooting.
You can use node-debug command to launch your application so that it starts your application and the debugger in your browser. However my favorite way to launch the debugger is to have a separate debugger process and attach node applications to it on demand so that I can use a single debugger process to debug multiple node applications.
First launch node-inspector, I prefer to open a separate console and leave it there.
Then add --debug flag to the node option to run the node application with debugger enabled, which by default listen on port 5858. You can change the port by providing a port number to the debug flag, as --debug=<port>.
If you try to debug some issue when node app starts, you can change --debug to --debug-brk so that your app starts in paused status.
After your app is started, you can visit the URL http://127.0.0.1:8080/debug?port=<port> to debug your application, just like in any IDE.
node-inspector can also be used remotely so that you can debug your applications on production servers.
TraceGL is a tracing tool to log every statement your Nodejs app is executing. It is used when I want to know what exactly happens during the execution, especially when debugging timers, which is really difficult to manage using regular debuggers, because attaching the debugger will stop the execution and thus changing timers’ behavior.
TraceGL is very easy to setup, just clone the repository and follow instructions to build the tracegl.js lib, put it somewhere makes you comfortable to access, like ~/Workspace/tools/tracegl.js. Then start your node application using the following command.
Then open http://localhost:2000 in your browser, you will see something like the below image. TraceGL outputs each executed statement in time order and the value of function arguments with the corresponding stack, which is really useful.
Frame Graph is the first thing I want to get when I start to find CPU bottlenecks. It gives you a good overview understanding of where your application spends most of the time. Below is a screen-shot of the Frame Graph of our application. I recently identified a lot of performance issues using this tool.
Frame Graph is obtained by sampling your running program, getting the call stacks and accumulating them into the graph, in the graph:
The y-axis stands for the depth of call stack. The higher the block, the deeper it is in the call stack. It also shows the parent-children relationship of the functions.
Launch your node-inspector, boot up your server with debug flag and open the web debugger. See this “Profiles” tab? That’s it!
I won’t spend more time to explain on how to use it though :) Just try it yourself.
Moreover, node-inspector’s sampling is really light-weight and it’s very good for CPU bound performance tuning.
v8 Profile log
Sometimes you need to do more subtle inspection, such as finding if the method is optimized or not by v8, knowing what Nodejs is doing in native space. Then you need the v8 profile log.
To generate the v8 profile log is easy. Just add --prof to the node command line parameter.
Then Nodejs will generate a file v8.log in the current folder.
The difficult part is how to analyze the log file. According the tutorial, you need to find a proper mac-tick-processor to parse the log file. Unfortunately, I have never successfully built one, because the v8 version embedded in node v0.10 is too old and it just can’t be built it in the latest OSX environment without changing the source code. That’s painful.
I’m not going to go into the details of how to understand the v8 log. Here are some references if you are interested:
Production is a serious business! Obviously, you should not add --debug flag to your process. Joyent has built a lot of tools that require DTrace functionality and their SmartOS to run well, but we are running Linux and I don’t see any potential plan to change it to SmartOS, it sucks.
There are some profiling and monitoring SaaS for Nodejs out there, such as nodetime, strongloop, PM2, etc. I like none of them, because it not only will introduce an external network/service dependency but also will leak your production data to a 3rd party, which looks awkward and stupid if you are doing serious business.
What we currently do is, setting up an environment as close to production as possible and constantly replaying production traffic to that environment, and using node-inspector to do profiling on that environment if necessary. Also we heavily rely on production metrics. We emit massive amount of metrics.
Recently, I found a tool called node-stap, developed by Uber, which looks promising. However SystemTap is intrusive and has impact on the system, I haven’t tried it on production. I’m doing some experiments on it and might write a post about it later.
Trace GC and Heap size
Nodejs has a pretty decent Garbage Collector, so usually you don’t need to worry much about the GC parameters.
A particular optimization we do on production is to enable --expose-gc and --nouse-idle-notification flag so that we can do manual GC when necessary. Here is a good reference about the effect of --expose-gc and --nouse-idle-notification to the system performance under high load: http://sysmagazine.com/posts/123154/.
On the other hand, GC activity log is a good resource for performance trouble shooting. It is better to record that. To enable GC log, you need to add the --trace_gc flag to the node command. See the below example. One thing to note: the number in the parenthesis is the total memory allocated, the number outside the parenthesis is the total used memory.
When system performance is low, you may check the log to see if there are excessive GC events at the same time. If so, you may need to investigate how to reduce memory usage.
To understand the GC space layout, you may add --trace_gc_verbose to show details for the size of each generation. This is useful when the problem you need to solve is subtle and you need to understand the distribution of new generation and old generation. For example:
Another common metric we monitor is the heap size, which can be obtained by calling process.memoryUsage().heapUsed.
Profiling on Production
Sometimes, you may find abnormal memory usage on production. There is a good library to get heap dump on production: heapdump. You can require heapdump in you application and send signal to your process to get heap dump when necessary. This is extremely useful to investigate problems on production.
After you get the heap dump, you can use Chrome browser to load the dump and do comparison between different dumps to find potential memory leaks. See below image as an example.
General Nodejs Performance Investigation Strategy
With the above tools and utilities, you should have enough power to investigate Nodejs performance problems. Here are some general strategies you can follow.
Find the bottleneck. First of all you need to understand what’s the bottleneck of your system. Is it bandwidth, memory or CPU?
Focus on a single code path at a time. After you know your bottleneck, try to use some consistent way to stress your system and using the profiling tools to get the idea why the system is under perform in such fixed condition. As an example, we have a lot of API endpoints in our Nodejs service. When I do performance investigation, I will first find the slowest endpoint. Then focus on that endpoint to start investigation.
Start from the most significant issue. If some function takes 50% of your server’s resource (CPU), when you optimize it by 30%, you get 15% total increase. On the other hand, if some function only takes 10% of your total resource, even if you totally eliminate the function, you still can only get 10%. So focus on most significant issue first.
Divide and Conquer. Most of the time the problem you are solving is too complex, so you need to break down the problem into small problems. Writing small benchmark scripts to understand how you can improve for each small problem.
Metric, metric, metric! On production, there is limited way to get information from your nodejs application. So please report as many metrics as possible when you write code. However please also profile how much cost you spend on reporting metrics, sometimes you may find it’s ridiculously high, then you need to find a way to optimize the reporting.