Warning! This documentation is a work in progress. Expect things to be out of date and not actually work according to instructions.

Monitoring

Stallion comes with a built-in health check endpoint that monitors virtually every single common problem that could mean your site is down, including: excessive 500 errors; failing endpoints; failing recurring jobs; running out of disk space; running out of memory; SSL certs about to expire; too many open file handles.

You can then add this endpoint to a monitoring service like Pingdom, Uptimerobot, ScoutApp, etc, and get alerts if any of the health check information shows problems.

When you generate a new Stallion site, it should put in your conf/stallion.toml file a setting called healthCheckSecret which is a random string.

You can use this to view the default health endpoint, which lives at: http://yourdomain.com/_st/health/check-health?secret=<your secret>

Here is an example result:

{
  "http" : {
    "error500s" : 0, // The number of 500 errors in the last 50 minutes
    "error400s" : 0,  // 400 errors in the last 50 minutes
    "error404s" : 4,  // 404 errors in the last 50 minutes
    "requestCount" : 18 // requests in the last 50 minutes
  },
  // Information about recurring jobs
  "jobs" : [ {
    "jobName" : "find-people",
    "lastStartedAt" : 0,
    "lastFinishedAt" : 1463081103475,
    "lastRunTime" : 0,
    "error" : "",
    "lastRunSucceeded" : false,
    "expectCompleteBy" : 0,
    "runningNow" : false
  } ],
  // Information about asynchronous tasks    
  "tasks" : {
    "stuckTasks" : 0, // Tasks that should have run, but haven't for some reason
    "completedTasks" : 17,
    "pendingTasks" : 0 // Tasks scheduled for the future
  },
  "endpoints" : [ {
    "url" : "/",
    "statusCode" : 200,
    "foundString" : true
  } ],
  "errors" : [ ],
  "warnings" : [ ],
  "system" : {
    "jvmMemoryUsage" : 8118976,
    "jvmMemoryUsageMb" : 7,
    "diskFreeDataDirectory" : 1269121024,
    "diskFreeDataDirectoryMb" : 1210,
    "diskFreeAppDirectory" : 1269121024,
    "diskFreeLogDirectory" : 1269121024,
    "fileHandlesOpen" : 48,
    "fileHandlesMax" : 4096,
    "fileHandlesAvailable" : 4048,
    "memoryPercentFree" : 0.8484154937075445,
    "memorySwapSize" : 4294963200,
    "memorySwapFree" : 4071411712,
    "memoryPhysicalSize" : 513843200,
    "memoryPhysicalFree" : 8454144,
    "swapPagingRate" : "NaN",
    "cpuAppUsage" : 0.0016294810729506142,
    "cpuSystemUsage" : 0.0,
    "cpuRollingAppUsage" : 5.094006637147029E-7,
    "cpuRollingSystemUsage" : 0.81,
    "cpusAvailable" : 1,
    "sslExpiresWithinMonth" : false,
    "sslExpiresDate" : 1468525680.000000000
  },
  "httpStatusCode" : 200
}

There are a bunch of built-in thresholds defined, and if execeeded the endpoint will respond with a 515 error code rather than a 200 code. Errors are triggered if:

  • more than 5% of the requests are a 5xx error
  • you have used up 80% of your system file handles
  • any endpoint check failed
  • Your app, data, or directory has less than 1GB free

Warnings are triggered if:

  • more than 10% of requests are a 4xx error
  • any job has not finished on time, or failed in its last run
  • any stuck async tasks exist
  • JVM memory usage is too high
  • Your SSL certificate is expiring within 30 days
  • Your free memory is too low
  • Your swap rate is over 25 pages

If you want to get a 515 error if any warnings exist, add failOnWarnings=true to the query string.

If you want to monitor each section separately, you can limit the sections by adding the section names to the query string: sections=http,jobs

Viewing exceptions

There is another endpoint that shows the most recent 100 exceptions since the server last rebooted: https://mydomain.com/_st/health/exceptions

This endpoint requires you to log-in as an administrator. If your site is so broken that you cannot log in, you will have to SSH in instead and log at the log files. If you need exceptions post-reboot, you will have to log into the server and view the log files. You may also want to set up a log monitoring tool.

Viewing server information

There is an additional endpoint that tells you some basic information about your server – https://yourdomain.com/_st/health/info?secret=<your healthcheck secret>.


{ "remoteAddr" : "127.0.0.1", // The remoteAddr as given to the java servlet "xForwardedFor" : null, // X-forwarded-for HTTP header "xRealIp" : "173.12.5.73", // X-Real-Ip HTTP Header "guessedIp" : "173.12.5.73", // Guessed IP based on the "ipHeaderName" setting, which defaults to "X-Real-IP" which is populated by the nginx proxy // "jarBuildDates" : { // Which jars are included, and when were they built "jar:file:/srv/upfor-prod/alpha/bin/stallion!/META-INF/MANIFEST.MF" : "20160510-2044" }, "instanceHostName" : "upfor.us", "instanceDomain" : "upfor.us", // Where this instance lives on the file-system "targetPath" : "/srv/upfor-prod/alpha", // When this instance was deployed "deployDate" : "2016-05-11 22:04:47 PM", // The local port the java servlet runs on "port" : 12501, // The environment "env" : "prod", // x-forwarded-host header "xForwardedHost" : "upfor.us" }
© 2018 Stallion Software LLC