LevelBlue Agents Memory Consumption and the osquery Watchdog

The LevelBlue Agent is configured to have two osquery processes running: an initial osquery process that functions as a watchdog, and the child worker process that creates the scheduled queries. The initial watchdog process manages the child worker and terminates any processes that exceed the memory limitations configured in the watchdog settings.

Watchdog Overview

The max threshold settings for the watchdog resources are:

  • CPU: Above 25% usage for over 9 consecutive seconds.
  • Memory: When 350MB is reached (LevelBlue Agent default setting).

The watchdog profiles the memory footprint at startup and subtracts that from the monitored value. It only restarts the worker if that difference exceeds the watchdog level set at that time. So, if a watchdog level is set to 350MB and the agent starts up with an initial worker process footprint of 30MB, the watchdog limit will be triggered at 380MB (350MB limit + 30MB initial memory footprint = 380MB threshold limit).

Watchdog Threshold Limits and Errors

Once the watchdog limit is reached, the osquery watchdog respawns the child worker process. After osquery is restarted, the previously active queries are referenced by osquery to see which ones did not finish normally. It is possible that one or more of these queries caused the watchdog limit to be exceeded, therefore the unfinished queries are denylisted from the scheduler for 24 hours.

If the osquery processes exceed their allocated resources, there is the possibility that the watchdog may respawn the process without giving any error message. A good indicator that this has happened can be found by looking at the logs subdirectory and at the timestamps of the files. If there is a high number of files with timestamps that are close together, it could be that the watchdog has been killing processes due to resource allocation limits. Here is an example:

Scheduled Query Failure Messages

The watchdog enforces limits on the worker process to protect systems from CPU-expensive and memory-intensive queries. If the watchdog observes limit violations, it will display an error similar to the following:

Scheduled query may have failed: <<...>>

This line is created when a child worker starts and finds what osquery calls a "dirty bit" toggled for the currently-executing query. If a child worker process is stopped abruptly and a query does not finish, a similar line may display.

Lines that indicate the watchdog exceeded one of its limits include the following:

osqueryd worker (1234) system performance limits exceeded

osqueryd worker (5678) memory limits exceeded: 442494

The process identifier (PID) of the offending child worker is included in parenthesis.

If the child worker finds itself in a reoccurring error state, or if the watchdog continues to stop the worker, additional lines like the following are created:

osqueryd worker respawning too quickly: 1 times

The watchdog implements an exponential backoff when respawning child workers, and the offending query is denylisted from running for 24 hours.

The osquery watchdog is only used for the worker process. It is enabled by default and can be disabled with a control flag.

See the official osquery documentation on query failures with the watchdog for more information on osquery errors and debugging options.

Work Process Control Flags

Many of the parameters of the watchdog are controlled by default settings that can be adjusted with optional command-line interface (CLI) flags. For experienced users who need more advanced control over the osquery watchdog, such as changing the CPU and memory limits, or toggling the watchdog's monitoring functions, refer to the osquery CLI flags documentation.

LevelBlue only supports the agent's default watchdog settings. Adjusted settings should be tested before applying any changes to your environment.