Design¶
This page describes the problem and the solution in general. What preceded Procpath and why it didn’t solve the problem. It also covers some notable follow-up design details.
Problem statement¶
On servers and desktops, processes have long been treelike. For instance, this is a process tree of the Chromium browser with a few open tabs:
chromium-browser ...
├─ chromium-browser --type=utility ...
├─ chromium-browser --type=gpu-process ...
│ └─ chromium-browser --type=broker
└─ chromium-browser --type=zygote
└─ chromium-browser --type=zygote
├─ chromium-browser --type=renderer ...
├─ chromium-browser --type=renderer ...
├─ chromium-browser --type=renderer ...
├─ chromium-browser --type=renderer ...
└─ chromium-browser --type=utility ...
In a server environment it can be substituted with a dozen task queue worker process trees, processes of the connection pool of a database, several web-server process trees or anything-goes in a bunch of Docker containers.
This environment begs some operational questions, both point-in-time and temporal. When I have several trees like one above, how do I know the (sub)tree’s current resource profile, such as total main memory consumption, CPU time and so on? How do I track these profiles over time when, for instance, I suspect a memory leak? How can I point other process analysis and introspection tools to these trees?
Existing approaches for outputting a tree’s PIDs include applying bash-fu on
pstree
output [1] or nested pgrep
for shallower cases. procps
(providing top
and ps
) is inadequate for any of the above from
embracing process hierarchy to collecting temporal metrics. psmisc
(providing pstree
) is only good for displaying the hierarchy, and doesn’t
cover any programmatic interaction. htop
is great for interactive
inspection of process trees with its filter and search, but for programmatic
interaction it is also useless. glances
has the JSON output feature, but it
doesn’t have process-level granularity…
For process metrics collection alone (given you know the PIDs), sysstat
(providing pidstat
) is likely the only simple solution, which still
requires some ad-hoc scripting [2].
Solution¶
The solution lies in applying the right tool to the job principle.
Represent Procfs [3] processes as a forest structure (a disjoint union of trees).
Expose this structure to queries in a compact tree query language.
Flatten and store a query result in a ubiquitous tabular format allowing for easy sharing and transformation.
A major non-functional requirement here is ease of installation, preferably in
the form of pure-python package. That’s because an ad-hoc investigation may
not allow installing compiler toolchain on the target machine, which discards
psutil
[4] and discourages XML as the tree representation format (as it
would require lxml
for XPath).
Representation is relatively simple. Read all /proc/N/stat
, build the
forest and serialise it as JSON. The ubiquitous tabular form is even simpler –
SQLite!
The step in between is much less obvious. Discarding special graph query languages and focusing on ones targeting JSON the list goes like this. But unfortunately, considering the Python implementations, it is not about choosing the best requirement match, but about choosing the lesser evil.
JSONPath [5] and its Python port. Informal, regex-based (obscure error messages and edge-cases), what-if-XPath-worked-on-JSON prototype. Most popular non-regex Python implementations are a sequence of forks, none of which supports recursive descent. One grammar-based package would work [6], but its filter expressions are just Python
eval
.JSON Pointer [7]. No recursive descent supported.
JMESPath (AWS
boto
dependency). No recursive descent supported [8].jq
and its Python bindings [9].jq
is a programming language in disguise of a JSON transformation CLI tool. Even though there’s lengthy documentation, on occasional usejq
feels very counter-intuitive and requires a lot of googling and trial-and-error.
After pondering and playing with these, item 1 and JSONPyth
[6] was the
choice. Filter Python expression syntax can be “jsonified” by the AttrDict
idiom, and the security concern of eval
is justified by the CLI use cases
(and in some cases being able to write an arbitrary Python expression in a
filter can actually be useful).
Data model¶
procpath query
outputs the root process nodes with all their descendants
into stdout.
[
{
"stat": {"pid": 1, "ppid": 0, ...}
"cmdline": "a root node",
"other_stat_file": ...,
"children": [
{
"cmdline": "cmdline of some process",
"stat": {"pid": 1, "ppid": 323, ...},
"other_stat_file": ...
},
{
"cmdline": "cmdline of another process with children",
"stat": {"pid": 1, "ppid": 324, ...},
"other_stat_file": ...,
"children": [...]
},
...
]
},
{
"stat": {"pid": 2, "ppid": 0, ...},
"cmdline": "another root node",
"other_stat_file": ...,
"children": [...]
},
...
]
When a JSONPath query is provided to the command, the output only contains the nodes (or their parts depending on the query) matching the query (i.e. the top elements of the list are matching nodes).
When recorded into a SQLite database, the schema is inferred from the used
Procfs files. The node list is flattened and recorded into the record
table
having the DDL like the following.
CREATE TABLE record (
record_id INTEGER PRIMARY KEY NOT NULL,
ts REAL NOT NULL,
cmdline TEXT,
stat_pid INTEGER,
stat_comm TEXT,
...
)
Procpath doesn’t pre-processes Procfs data. For instance, rss
is expressed
in pages, utime
in clock ticks and so on. To properly interpret data in
record
table, there’s also meta
table containing the following
key-value records.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Procfs target files¶
Procpath until version 1.13 only supported collecting process Procfs metrics. File and data structures for threads in Procfs are identical (prefix difference). proc(5) manpage says:
/proc/pid subdirectories
Each one of these subdirectories contains files and subdirectories exposing information about the process with the corresponding process ID.
Underneath each of the /proc/pid directories, a task subdirectory contains subdirectories of the form task/tid, which contain corresponding information about each of the threads in the process, where tid is the kernel thread ID of the thread. […]
/proc/tid subdirectories
Each one of these subdirectories contains files and subdirectories exposing information about the thread with the corresponding thread ID. The contents of these directories are the same as the corresponding /proc/pid/task/tid directories. […]
Last sentence above is wrong for some of the files. Top-level directories
contain files with total metrics of all threads in the process. Say, CPU time
fields in /proc/{id}/stat
contain the sum of corresponding CPU time fields
in all threads of the process, no matter if the id
belongs to a process or
a thread. Whereas nested directories, with files like
/proc/{tgid}/task/{pid}/stat
, contain thread-specific information in some
fields (see Tasks). This shell command can be used to examine
the difference:
procpath query --procfs-target thread -f stat,status \
'$..children[?(@.status.pid != @.status.tgid)]' \
'SELECT status_tgid, status_pid FROM record' \
| pyfil -j 'iter(f"ccdiff /proc/{p.status_pid}/stat \
/proc/{p.status_tgid}/task/{p.status_pid}/stat" for p in j)' \
| xargs -n3 -i -- sh -xc {}
However, the volume of Procfs data to read, process and store might increase
10-fold, which demanded some optimisations in the version 1.13, while keeping
the same data processing pipeline and flat record
table with original data.
$ ls /proc/*/task/*/stat | wc -l
3020
$ ls /proc/*/task/*/stat \
| sudo /usr/bin/time --verbose xargs -n1 -- cat 2>&1 >/dev/null \
| grep time
Command being timed: "xargs -n1 -- cat"
User time (seconds): 2.22
System time (seconds): 0.78
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.97
$ ls /proc/*/task/*/smaps_rollup | wc -l
3018
$ ls /proc/*/task/*/smaps_rollup \
| sudo /usr/bin/time --verbose xargs -n1 -- cat 2>&1 >/dev/null \
| grep time
Command being timed: "xargs -n1 -- cat"
User time (seconds): 2.01
System time (seconds): 18.04
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.18
The above highlights that aggregated metrics files are computationally
intensive even at the Kernel’s side per thread, but may give not additional
information. I.e. smaps_rollup
is the same across all threads, and should
rather be collected for the process.