Monitoring
Stallion comes with a built-in health check endpoint that monitors virtually every single common problem that could mean your site is down, including: excessive 500 errors; failing endpoints; failing recurring jobs; running out of disk space; running out of memory; SSL certs about to expire; too many open file handles.
You can then add this endpoint to a monitoring service like Pingdom, Uptimerobot, ScoutApp, etc, and get alerts if any of the health check information shows problems.
When you generate a new Stallion site, it should put in your conf/stallion.toml
file a setting called healthCheckSecret
which is a random string.
You can use this to view the default health endpoint, which lives at: http://yourdomain.com/_st/health/check-health?secret=<your secret>
Here is an example result:
{
"http" : {
"error500s" : 0, // The number of 500 errors in the last 50 minutes
"error400s" : 0, // 400 errors in the last 50 minutes
"error404s" : 4, // 404 errors in the last 50 minutes
"requestCount" : 18 // requests in the last 50 minutes
},
// Information about recurring jobs
"jobs" : [ {
"jobName" : "find-people",
"lastStartedAt" : 0,
"lastFinishedAt" : 1463081103475,
"lastRunTime" : 0,
"error" : "",
"lastRunSucceeded" : false,
"expectCompleteBy" : 0,
"runningNow" : false
} ],
// Information about asynchronous tasks
"tasks" : {
"stuckTasks" : 0, // Tasks that should have run, but haven't for some reason
"completedTasks" : 17,
"pendingTasks" : 0 // Tasks scheduled for the future
},
"endpoints" : [ {
"url" : "/",
"statusCode" : 200,
"foundString" : true
} ],
"errors" : [ ],
"warnings" : [ ],
"system" : {
"jvmMemoryUsage" : 8118976,
"jvmMemoryUsageMb" : 7,
"diskFreeDataDirectory" : 1269121024,
"diskFreeDataDirectoryMb" : 1210,
"diskFreeAppDirectory" : 1269121024,
"diskFreeLogDirectory" : 1269121024,
"fileHandlesOpen" : 48,
"fileHandlesMax" : 4096,
"fileHandlesAvailable" : 4048,
"memoryPercentFree" : 0.8484154937075445,
"memorySwapSize" : 4294963200,
"memorySwapFree" : 4071411712,
"memoryPhysicalSize" : 513843200,
"memoryPhysicalFree" : 8454144,
"swapPagingRate" : "NaN",
"cpuAppUsage" : 0.0016294810729506142,
"cpuSystemUsage" : 0.0,
"cpuRollingAppUsage" : 5.094006637147029E-7,
"cpuRollingSystemUsage" : 0.81,
"cpusAvailable" : 1,
"sslExpiresWithinMonth" : false,
"sslExpiresDate" : 1468525680.000000000
},
"httpStatusCode" : 200
}
There are a bunch of built-in thresholds defined, and if execeeded the endpoint will respond with a 515 error code rather than a 200 code. Errors are triggered if:
- more than 5% of the requests are a 5xx error
- you have used up 80% of your system file handles
- any endpoint check failed
- Your app, data, or directory has less than 1GB free
Warnings are triggered if:
- more than 10% of requests are a 4xx error
- any job has not finished on time, or failed in its last run
- any stuck async tasks exist
- JVM memory usage is too high
- Your SSL certificate is expiring within 30 days
- Your free memory is too low
- Your swap rate is over 25 pages
If you want to get a 515 error if any warnings exist, add failOnWarnings=true
to the query string.
If you want to monitor each section separately, you can limit the sections by adding the section names to the query string: sections=http,jobs
Viewing exceptions
There is another endpoint that shows the most recent 100 exceptions since the server last rebooted: https://mydomain.com/_st/health/exceptions
This endpoint requires you to log-in as an administrator. If your site is so broken that you cannot log in, you will have to SSH in instead and log at the log files. If you need exceptions post-reboot, you will have to log into the server and view the log files. You may also want to set up a log monitoring tool.
Viewing server information
There is an additional endpoint that tells you some basic information about your server – https://yourdomain.com/_st/health/info?secret=<your healthcheck secret>
.
{
"remoteAddr" : "127.0.0.1", // The remoteAddr as given to the java servlet
"xForwardedFor" : null, // X-forwarded-for HTTP header
"xRealIp" : "173.12.5.73", // X-Real-Ip HTTP Header
"guessedIp" : "173.12.5.73", // Guessed IP based on the "ipHeaderName" setting, which defaults to "X-Real-IP" which is populated by the nginx proxy
//
"jarBuildDates" : { // Which jars are included, and when were they built
"jar:file:/srv/upfor-prod/alpha/bin/stallion!/META-INF/MANIFEST.MF" : "20160510-2044"
},
"instanceHostName" : "upfor.us",
"instanceDomain" : "upfor.us",
// Where this instance lives on the file-system
"targetPath" : "/srv/upfor-prod/alpha",
// When this instance was deployed
"deployDate" : "2016-05-11 22:04:47 PM",
// The local port the java servlet runs on
"port" : 12501,
// The environment
"env" : "prod",
// x-forwarded-host header
"xForwardedHost" : "upfor.us"
}