Recently, we've started witnessing some strange symptoms on the platform. For example, one of our crucial search components wasn’t loaded on the landing page. It appeared that Varnish couldn’t fetch part of the main page which was esi:included on this page. Some other backoffice sites were loading extremely slowly because those pages were reading some of the details from our other microservices and we found several timeouts on the server. One person even had one 503 error displayed in the browser when searching... The first thought I had was - maybe some AWS errors were to blame (network or load balancers), but I soon realized there were no recent big errors on the AWS, at least not in our region (eu-west-01) and someone else would probably have these problems as well. And still, it continued happening on several random occasions.
A 503 error roughly means "service is unavailable". Usually, we see this error when the load balancer cannot reach the backend server (either it is simply not running, it is busy and can't accept more connections, or the load balancer waited too long for a response). But because it happened several times on unknown occasions, I started to investigate the root cause of the problems.
Without going into too much detail, I’d like to briefly introduce our infrastructure; all the infrastructure is hosted on AWS. The production environment lies in the private network with one NAT gateway. After the gateway, we have Elastic Load Balancers which point directly to 2 Frontend Proxies (load balancers, as EC2 machines). On the EC2 machines we have Varnish running which acts mostly as a load-balancer but also helps us in including parts of other pages with an esi:include directive. These load-balancers then redirect HTTP requests to the application servers. We have 4 different services, each backed up by one additional EC2 instance and 2 postgresql db instances. Each application is written in Java 8 and uses Dropwizard + Jersey. Every application server can talk to another server through 2 internal, Backend Proxies which also run Varnish. These proxies contain rules to route /internal/* requests which are unavailable outside our network. All instances are running on Ubuntu OS.
First of all, I started reading Varnish logs on frontend (or backend) proxies, for example, by running the following command:
I learned that, from time to time, our servers could respond with a 503 error
I started thinking, why is it possible? There was no deploy at the time of error occurrence. I logged into the application server running the search service and there it was; the uptime of the machine was over 100 days and the app itself had been working since the last deployment and restart. So, the service itself couldn't be restarted
Then, a light bulb blinked in my mind - what if our server cannot accept additional connections that it should be accepting?
And so, I checked the max number of sockets the kernel can handle:
… only 128 connections!
Then I needed to check the number of current socket connections:
in total, there were 112 open connections - pretty close to the limit.
I checked other important values as well:
So, we can open an unlimited number of file descriptors (socket is fd as well) - good!
txqueuelen is a queue length on the network interface which tells us how many pending connections can wait until they're processed.
The first value tells us about the max number of incoming connections (between net interface and kernel until kernel opens the socket). The latter tells us about the number of connections pending in SYN request (to sync with the server).
Then, I tried to increase some of these numbers:
As it turned out, it's not related to the number of connections. At least not on the OS level. I then thought it could be because of insufficient connections in Jetty / Dropwizard. I re-configured the number of connections and max threads number (to something around 1024) and gave it a try
After restart it turned out it helped nothing. The issue persisted.
Therefore, I thought I could try to do some manual testing. I tried to curl the problematic resource, being on the search service instance:
But again, no response. I was waiting for quite a few minutes, and nothing happened. OK, so I get the same results which our Varnish instances get. But … Varnish times out after 30 seconds and returns 503 (this is why one of our backoffice pages is displayed after ~30 seconds). That meant the search had to be continued, possibly to the code.
Fourth and last try
Staying on the search service instance, I curled Dropwizard’s admin context of search service to see some metrics:
I got the following
The most interesting was related to the number of blocked threads … why were so many threads blocked? I truly doubt that’s a normal situation (although I’ve never checked these values before, so again it’s only a guess)
That’s why I checked threads as well
and I found something really interesting:
In the end, it turned out that our SessionFilter (which is invoked before / after the resource code is being executed; the problematic one was after resource was executed, when setting cookies to the response) blocked execution of threads. And I therefore realized that the thread was blocked in SecureRandom. It was apparent that this SecureRandom instance was reading random bytes from /dev/random and it then blocks the thread until we have enough randomness. So, more and more threads were coming into this trap and were being blocked … at least for more than a few minutes
As long as we don't need SecureRandom for such trivial things like generating cookies for anonymous users (which in fact is just a 64-character hex number), I have decided to use simple java.util.Random - it should be sufficient for our needs. Moreover, we can’t block threads when trying to find out sufficient random numbers.
After deployment, the problematic error hasn’t reappeared - mission successful!