How to debug in production
Debugging in production has always been something that we try our best to avoid, with multiple QA phases, staging environments, and automation to stop production bugs.
Despite this, occasionally a bug will find its way to production, and when it does, we need a strategy to cope with this safely and efficiently.
Gone are the days where you have a bug in production, you report the issue and reproduce it in staging or development. Modern debugging tools now allow you to safely and reliably debug a production environment safely, without disrupting your users and causing downtime.
Is it safe to debug in production?
One of the key concerns people have is the question of is it safe to debug in production? The answer to this is that it can be.
Specific debugging methods, such as a remote debugger in your IDE or dumping error information to the user, are not considered safe.
Other methods such as logging, error context capture, and debuggers designed explicitly for production can be utilized, with varying success levels.
What makes it safe to debug in production?
If you need to debug in production, there are a few things to ensure when doing so;
- Debugging does not have an extensive performance hit.
- Debugging does not block people from using your application.
- Secure data is not exposed to the outside world.
- You are getting enough debug information to find and fix the issue as soon as possible.
The first 3 points are self-explanatory when debugging you need to keep your application running smoothly and keep the process secure.
Getting all required information when an issue occurs can be challenging, but depending on the complexity / severity of the issue, it can be crucial.
When you debug, you want to get as much information about the issue in the least time possible. If you have to jump between multiple systems and iterate over the errors several times to fix the bugs, this only adds to the stress of dealing with a critical issue.
Why is using a remote debugger in production not a good idea?
When using these debugger connections, you have to place the JVM into debug mode, which can have an enormous hit on performance compared to without this debug capability.
Suppose you do not already have a debugger connection configured you would also need to restart Java to establish this connection. This is not something you want to be doing in an already stressful situation.
When you hit a breakpoint the entire Java engine stops, if a breakpoint is fired all threads in the engine are paused. Hence, your application effectively becomes unusable until you stop debugging.
Why dumping error information to the user is not a good idea?
If you have an error, dumping all that information to the user and having it fixed via support teams seems like a good solution; however, you have to be careful about what information is being displayed.
A stack trace on an error page may not look the prettiest but can direct you towards the problem and the data contained in the stack trace is typically safe. There may be something in that trace that you do not want the user to see.
Dumping variables such as user data or request / response information to give you more context can carry greater risk as this can contain data such as authorization headers, security information, and personal data that could make your system vulnerable.
How to use logging to debug in production
For a long time, logging in production to debug errors was the industry standard, with tools like Splunk, Elastic, or Papertrail making this easier.
With logging, you can capture data for any errors, so you have context when they occur. In some languages like Node JS or PHP you can add logging at runtime, although this does carry some risk, in languages like Java this is not always possible.
If you rely on logging it is always best to do this as you develop the application and use the appropriate log levels so that there is no flood of logs during regular operation.
When an error does occur and the data is not sufficient, you can add required logs for that error to get the context you need.
- When doing this, there are some things you should consider
- Adding extra logging may require restarting your application
- Logging volume can become excessive and hard to manage
- Pinpointing errors can be problematic when buried amongst other messages.
A better alternative to logging in production?
In 2014 the concept of non-blocking debuggers was introduced as a product, the concept of these debuggers is to capture the stack and variables when an event such as an error occurs without blocking the execution of code.
This gives you a picture of what happened when an error occurs so you have the full context of an error automatically. The data captured allows you to hone in on the root cause of an issue without stopping the execution of code or needing any extra logging, debugging tools, or performance impact.
Non-blocking debuggers are great for debugging simpler issues where the root cause of the error is at the top of the stack, if the error is not at the top of the stack you may require further debugging but you will have a much better of where the root cause of the error actually is.
FusionReactor provides a solution called Event Snapshot, when an error occurs, the breaking code, full-stack, and variables will be captured automatically so you can find and fix your errors without the need for any manual work diagnosing the cause of an error.