Design

This page describes the problem and the solution in general. What preceded Procpath and why it didn’t solve the problem. It also covers some notable follow-up design details.

Problem statement

On servers and desktops, processes have long been treelike. For instance, this is a process tree of the Chromium browser with a few open tabs:

chromium-browser ...
├─ chromium-browser --type=utility ...
├─ chromium-browser --type=gpu-process ...
│  └─ chromium-browser --type=broker
└─ chromium-browser --type=zygote
   └─ chromium-browser --type=zygote
      ├─ chromium-browser --type=renderer ...
      ├─ chromium-browser --type=renderer ...
      ├─ chromium-browser --type=renderer ...
      ├─ chromium-browser --type=renderer ...
      └─ chromium-browser --type=utility ...

In a server environment it can be substituted with a dozen task queue worker process trees, processes of the connection pool of a database, several web-server process trees or anything-goes in a bunch of Docker containers.

This environment begs some operational questions, both point-in-time and temporal. When I have several trees like one above, how do I know the (sub)tree’s current resource profile, such as total main memory consumption, CPU time and so on? How do I track these profiles over time when, for instance, I suspect a memory leak? How can I point other process analysis and introspection tools to these trees?

Existing approaches for outputting a tree’s PIDs include applying bash-fu on pstree output [1] or nested pgrep for shallower cases. procps (providing top and ps) is inadequate for any of the above from embracing process hierarchy to collecting temporal metrics. psmisc (providing pstree) is only good for displaying the hierarchy, and doesn’t cover any programmatic interaction. htop is great for interactive inspection of process trees with its filter and search, but for programmatic interaction it is also useless. glances has the JSON output feature, but it doesn’t have process-level granularity…

For process metrics collection alone (given you know the PIDs), sysstat (providing pidstat) is likely the only simple solution, which still requires some ad-hoc scripting [2].

Solution

The solution lies in applying the right tool to the job principle.

  1. Represent Procfs [3] processes as a forest structure (a disjoint union of trees).

  2. Expose this structure to queries in a compact tree query language.

  3. Flatten and store a query result in a ubiquitous tabular format allowing for easy sharing and transformation.

A major non-functional requirement here is ease of installation, preferably in the form of pure-python package. That’s because an ad-hoc investigation may not allow installing compiler toolchain on the target machine, which discards psutil [4] and discourages XML as the tree representation format (as it would require lxml for XPath).

Representation is relatively simple. Read all /proc/N/stat, build the forest and serialise it as JSON. The ubiquitous tabular form is even simpler – SQLite!

The step in between is much less obvious. Discarding special graph query languages and focusing on ones targeting JSON the list goes like this. But unfortunately, considering the Python implementations, it is not about choosing the best requirement match, but about choosing the lesser evil.

  1. JSONPath [5] and its Python port. Informal, regex-based (obscure error messages and edge-cases), what-if-XPath-worked-on-JSON prototype. Most popular non-regex Python implementations are a sequence of forks, none of which supports recursive descent. One grammar-based package would work [6], but its filter expressions are just Python eval.

  2. JSON Pointer [7]. No recursive descent supported.

  3. JMESPath (AWS boto dependency). No recursive descent supported [8].

  4. jq and its Python bindings [9]. jq is a programming language in disguise of a JSON transformation CLI tool. Even though there’s lengthy documentation, on occasional use jq feels very counter-intuitive and requires a lot of googling and trial-and-error.

After pondering and playing with these, item 1 and JSONPyth [6] was the choice. Filter Python expression syntax can be “jsonified” by the AttrDict idiom, and the security concern of eval is justified by the CLI use cases (and in some cases being able to write an arbitrary Python expression in a filter can actually be useful).

Data model

procpath query outputs the root process nodes with all their descendants into stdout.

[
  {
    "stat": {"pid": 1, "ppid": 0, ...}
    "cmdline": "a root node",
    "other_stat_file": ...,
    "children": [
      {
        "cmdline": "cmdline of some process",
        "stat": {"pid": 1, "ppid": 323, ...},
        "other_stat_file": ...
      },
      {
        "cmdline": "cmdline of another process with children",
        "stat": {"pid": 1, "ppid": 324, ...},
        "other_stat_file": ...,
        "children": [...]
      },
      ...
    ]
  },
  {
    "stat": {"pid": 2, "ppid": 0, ...},
    "cmdline": "another root node",
    "other_stat_file": ...,
    "children": [...]
  },
  ...
]

When a JSONPath query is provided to the command, the output only contains the nodes (or their parts depending on the query) matching the query (i.e. the top elements of the list are matching nodes).

When recorded into a SQLite database, the schema is inferred from the used Procfs files. The node list is flattened and recorded into the record table having the DDL like the following.

CREATE TABLE record (
    record_id        INTEGER PRIMARY KEY NOT NULL,
    ts               REAL    NOT NULL,
    cmdline          TEXT,
    stat_pid         INTEGER,
    stat_comm        TEXT,
    ...
)

Procpath doesn’t pre-processes Procfs data. For instance, rss is expressed in pages, utime in clock ticks and so on. To properly interpret data in record table, there’s also meta table containing the following key-value records.

platform_node

platform.node()

platform_platform

platform.platform()

page_size

resource.getpagesize(), typically 4096

clock_ticks

os.sysconf('SC_CLK_TCK'), typically 100

physical_pages

os.sysconf('SC_PHYS_PAGES')

cpu_count

os.cpu_count()

procfile_list

json.dumps of --procfile-list

procpath_version

procpath.__version__

procfs_path

--procfs, default “/proc”

procfs_target

--procfs-target, default “process”

Procfs target files

Procpath until version 1.13 only supported collecting process Procfs metrics. File and data structures for threads in Procfs are identical (prefix difference). proc(5) manpage says:

  • /proc/pid subdirectories

    Each one of these subdirectories contains files and subdirectories exposing information about the process with the corresponding process ID.

    Underneath each of the /proc/pid directories, a task subdirectory contains subdirectories of the form task/tid, which contain corresponding information about each of the threads in the process, where tid is the kernel thread ID of the thread. […]

  • /proc/tid subdirectories

    Each one of these subdirectories contains files and subdirectories exposing information about the thread with the corresponding thread ID. The contents of these directories are the same as the corresponding /proc/pid/task/tid directories. […]

Last sentence above is wrong for some of the files. Top-level directories contain files with total metrics of all threads in the process. Say, CPU time fields in /proc/{id}/stat contain the sum of corresponding CPU time fields in all threads of the process, no matter if the id belongs to a process or a thread. Whereas nested directories, with files like /proc/{tgid}/task/{pid}/stat, contain thread-specific information in some fields (see Tasks). This shell command can be used to examine the difference:

procpath query --procfs-target thread -f stat,status \
    '$..children[?(@.status.pid != @.status.tgid)]' \
    'SELECT status_tgid, status_pid FROM record' \
  | pyfil -j 'iter(f"ccdiff /proc/{p.status_pid}/stat \
    /proc/{p.status_tgid}/task/{p.status_pid}/stat" for p in j)' \
  | xargs -n3 -i -- sh -xc {}

However, the volume of Procfs data to read, process and store might increase 10-fold, which demanded some optimisations in the version 1.13, while keeping the same data processing pipeline and flat record table with original data.

$ ls /proc/*/task/*/stat | wc -l
3020
$ ls /proc/*/task/*/stat \
  | sudo /usr/bin/time --verbose xargs -n1 -- cat 2>&1 >/dev/null \
  | grep time
Command being timed: "xargs -n1 -- cat"
User time (seconds): 2.22
System time (seconds): 0.78
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.97

$ ls /proc/*/task/*/smaps_rollup | wc -l
3018
$ ls /proc/*/task/*/smaps_rollup \
  | sudo /usr/bin/time --verbose xargs -n1 -- cat 2>&1 >/dev/null \
  | grep time
Command being timed: "xargs -n1 -- cat"
User time (seconds): 2.01
System time (seconds): 18.04
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.18

The above highlights that aggregated metrics files are computationally intensive even at the Kernel’s side per thread, but may give not additional information. I.e. smaps_rollup is the same across all threads, and should rather be collected for the process.