Warning! This documentation is a work in progress. Expect things to be out of date and not actually work according to instructions.

Monitoring

Stallion comes with a built-in health check endpoint that monitors virtually every single common problem that could mean your site is down, including: excessive 500 errors; failing endpoints; failing recurring jobs; running out of disk space; running out of memory; SSL certs about to expire; too many open file handles.

You can then add this endpoint to a monitoring service like Pingdom, Uptimerobot, ScoutApp, etc, and get alerts if any of the health check information shows problems.

When you generate a new Stallion site, it should put in your conf/stallion.toml file a setting called healthCheckSecret which is a random string.

You can use this to view the default health endpoint, which lives at: http://yourdomain.com/_st/health/check-health?secret=<your secret>

Here is an example result:

{
  "http" : {
    "error500s" : 0, // The number of 500 errors in the last 50 minutes
    "error400s" : 0,  // 400 errors in the last 50 minutes
    "error404s" : 4,  // 404 errors in the last 50 minutes
    "requestCount" : 18 // requests in the last 50 minutes
  },
  // Information about recurring jobs
  "jobs" : [ {
    "jobName" : "find-people",
    "lastStartedAt" : 0,
    "lastFinishedAt" : 1463081103475,
    "lastRunTime" : 0,
    "error" : "",
    "lastRunSucceeded" : false,
    "expectCompleteBy" : 0,
    "runningNow" : false
  } ],
  // Information about asynchronous tasks    
  "tasks" : {
    "stuckTasks" : 0, // Tasks that should have run, but haven't for some reason
    "completedTasks" : 17,
    "pendingTasks" : 0 // Tasks scheduled for the future
  },
  "endpoints" : [ {
    "url" : "/",
    "statusCode" : 200,
    "foundString" : true
  } ],
  "errors" : [ ],
  "warnings" : [ ],
  "system" : {
    "jvmMemoryUsage" : 8118976,
    "jvmMemoryUsageMb" : 7,
    "diskFreeDataDirectory" : 1269121024,
    "diskFreeDataDirectoryMb" : 1210,
    "diskFreeAppDirectory" : 1269121024,
    "diskFreeLogDirectory" : 1269121024,
    "fileHandlesOpen" : 48,
    "fileHandlesMax" : 4096,
    "fileHandlesAvailable" : 4048,
    "memoryPercentFree" : 0.8484154937075445,
    "memorySwapSize" : 4294963200,
    "memorySwapFree" : 4071411712,
    "memoryPhysicalSize" : 513843200,
    "memoryPhysicalFree" : 8454144,
    "swapPagingRate" : "NaN",
    "cpuAppUsage" : 0.0016294810729506142,
    "cpuSystemUsage" : 0.0,
    "cpuRollingAppUsage" : 5.094006637147029E-7,
    "cpuRollingSystemUsage" : 0.81,
    "cpusAvailable" : 1,
    "sslExpiresWithinMonth" : false,
    "sslExpiresDate" : 1468525680.000000000
  },
  "httpStatusCode" : 200
}

There are a bunch of built-in thresholds defined, and if execeeded the endpoint will respond with a 515 error code rather than a 200 code. Errors are triggered if:

more than 5% of the requests are a 5xx error
you have used up 80% of your system file handles
any endpoint check failed
Your app, data, or directory has less than 1GB free

Warnings are triggered if:

more than 10% of requests are a 4xx error
any job has not finished on time, or failed in its last run
any stuck async tasks exist
JVM memory usage is too high
Your SSL certificate is expiring within 30 days
Your free memory is too low
Your swap rate is over 25 pages

If you want to get a 515 error if any warnings exist, add failOnWarnings=true to the query string.

If you want to monitor each section separately, you can limit the sections by adding the section names to the query string: sections=http,jobs

Viewing exceptions

There is another endpoint that shows the most recent 100 exceptions since the server last rebooted: https://mydomain.com/_st/health/exceptions

This endpoint requires you to log-in as an administrator. If your site is so broken that you cannot log in, you will have to SSH in instead and log at the log files. If you need exceptions post-reboot, you will have to log into the server and view the log files. You may also want to set up a log monitoring tool.

Viewing server information

There is an additional endpoint that tells you some basic information about your server – https://yourdomain.com/_st/health/info?secret=<your healthcheck secret>.


{
  "remoteAddr" : "127.0.0.1", // The remoteAddr as given to the java servlet
  "xForwardedFor" : null, // X-forwarded-for HTTP header
  "xRealIp" : "173.12.5.73",  // X-Real-Ip HTTP Header
  "guessedIp" : "173.12.5.73", // Guessed IP based on the "ipHeaderName" setting, which defaults to "X-Real-IP" which is populated by the nginx proxy
  // 
  "jarBuildDates" : { // Which jars are included, and when were they built
    "jar:file:/srv/upfor-prod/alpha/bin/stallion!/META-INF/MANIFEST.MF" : "20160510-2044"
  },
  "instanceHostName" : "upfor.us",
  "instanceDomain" : "upfor.us",
  // Where this instance lives on the file-system
  "targetPath" : "/srv/upfor-prod/alpha",
  // When this instance was deployed
  "deployDate" : "2016-05-11 22:04:47 PM",
  // The local port the java servlet runs on
  "port" : 12501,
  // The environment
  "env" : "prod",
  // x-forwarded-host header
  "xForwardedHost" : "upfor.us"
  }

Stallion

Tutorials

Reference Guide

Common Tasks

Minor Features

Monitoring

Viewing exceptions

Viewing server information