/proc/[pid]
and the missing threadsSuppose you run the following program:
use std::time;
use std::thread;
fn main() {
println!("My TID: {}", nix::unistd::gettid());
let child = thread::spawn(move || {
println!("My thread's TID: {}", nix::unistd::gettid());
loop {
thread::sleep(time::Duration::from_secs(1));
}
});
child.join().unwrap();
}
and get the following output:
My TID: 840680
My thread's TID: 840695
Inspecting thread state through /proc
works as
expected:
$ cat /proc/840680/comm
thread_id
$ cat /proc/840695/comm
thread_id
However, if you happened to browse through /proc/
(via
ls
or other), you’ll notice a strange inconsistency:
$ ls -l /proc | grep 840680 &> /dev/null; echo $?
0
$ ls -l /proc | grep 840695 &> /dev/null; echo $?
1
In other words, there’s no directory entry for the thread.
Why is this the case? We have to look at the kernel code to find out.
First let’s look at where all the entries in /proc
are
instantiated. Remember that /proc
, or procfs
,
is a virtual file system so there’s not actually anything on disk
backing the fileystem. Everything is generated when we request it.
In fs/proc/root.c
:
/*
* This is the root "inode" in the /proc tree..
*/
struct proc_dir_entry proc_root = {
.low_ino = PROC_ROOT_INO,
.namelen = 5,
.mode = S_IFDIR | S_IRUGO | S_IXUGO,
.nlink = 2,
.refcnt = REFCOUNT_INIT(1),
.proc_iops = &proc_root_inode_operations,
.proc_dir_ops = &proc_root_operations,
.parent = &proc_root,
.subdir = RB_ROOT,
.name = "/proc",
};
&proc_root_operations
seems like a likely suspect
for directory
operations:
/*
* The root /proc directory is special, as it has the
* <pid> directories. Thus we don't use the generic
* directory handling functions for that..
*/
static const struct file_operations proc_root_operations = {
.read = generic_read_dir,
.iterate_shared = proc_root_readdir,
.llseek = generic_file_llseek,
};
So far the comments confirm our understanding. However, it’s somewhat
unclear which callback is called when we run ls
versus
directly cat
a file. Let’s use bpftrace to investigate.
In one terminal:
$ sudo bpftrace -e 'kprobe:generic_read_dir { printf("%s\n", kstack); }'
Attaching 1 probe...
In another terminal:
$ ls -l /proc
Nothing in the first terminal. Let’s try the next function.
$ sudo bpftrace -e 'kprobe:proc_root_readdir { printf("%s\n", kstack); }'
Attaching 1 probe...
Run ls
again and we get the following output:
proc_root_readdir+1
iterate_dir+323
ksys_getdents64+156
__x64_sys_getdents64+22
do_syscall_64+78
entry_SYSCALL_64_after_hwframe+68
proc_root_readdir+1
iterate_dir+323
ksys_getdents64+156
__x64_sys_getdents64+22
do_syscall_64+78
entry_SYSCALL_64_after_hwframe+68
Nice, so we know running ls
generates a
proc_root_readdir
callback. Let’s look at the code:
static int proc_root_readdir(struct file *file, struct dir_context *ctx)
{
if (ctx->pos < FIRST_PROCESS_ENTRY) {
int error = proc_readdir(file, ctx);
if (unlikely(error <= 0))
return error;
ctx->pos = FIRST_PROCESS_ENTRY;
}
return proc_pid_readdir(file, ctx);
}
FIRST_PROCESS_ENTRY
is defined as:
in fs/proc/internal.h
:
and we see proc_readdir
incrementing pos
in
proc_readdir_de
(a later callee). So this code probably
handles all the non-process entries in /proc
and we can
ignore it for now and focus on proc_pid_readdir
.
In fs/proc/base.c
:
/* for the /proc/ directory itself, after non-process stuff has been done */
int proc_pid_readdir(struct file *file, struct dir_context *ctx)
{
struct tgid_iter iter;
struct pid_namespace *ns = proc_pid_ns(file_inode(file));
loff_t pos = ctx->pos;
This code just sets up some variables by pulling context information out. Not really important.
if (pos >= PID_MAX_LIMIT + TGID_OFFSET)
return 0;
if (pos == TGID_OFFSET - 2) {
struct inode *inode = d_inode(ns->proc_self);
if (!dir_emit(ctx, "self", 4, inode->i_ino, DT_LNK))
return 0;
ctx->pos = pos = pos + 1;
}
if (pos == TGID_OFFSET - 1) {
struct inode *inode = d_inode(ns->proc_thread_self);
if (!dir_emit(ctx, "thread-self", 11, inode->i_ino, DT_LNK))
return 0;
ctx->pos = pos = pos + 1;
}
This code does 3 things:
/proc/self
entry/proc/thread-self
entryInteresting to note but not important for this article.
iter.tgid = pos - TGID_OFFSET;
iter.task = NULL;
for (iter = next_tgid(ns, iter);
iter.task;
iter.tgid += 1, iter = next_tgid(ns, iter)) {
Now this is the interesting bit. Now we’re iterating through all the
thread group IDs (tgid) via next_tgid
.
TGIDs are better understood from userspace as the PIDs we see, where
each process can have multiple threads (each with their own TID).
There’s more code that follows but it’s not very interesting for us.
So we know why ls /proc
does not show threads now. But
how does directly accessing /proc/[TID]/comm
work?
We follow the same process with bpftrace and try some more functions.
Finally, we discover that the following triggers output when we run
cat /proc/864518/comm
:
$ sudo bpftrace -e 'kprobe:proc_root_lookup / comm == "cat" / { printf("%s\n", kstack); }'
Attaching 1 probe...
proc_root_lookup+1
__lookup_slow+140
walk_component+513
link_path_walk+759
path_openat+157
do_filp_open+171
do_sys_openat2+534
do_sys_open+68
do_syscall_64+78
entry_SYSCALL_64_after_hwframe+68
Note that we used a filter in our bpftrace script to limit output to our command.
Astute readers might have noted that our cat
command
used a different TID. That’s because we only trigger output once per
lifetime (or some other period of time) of the TID. That’s because the
kernel is probably caching directory entries in memory so it doesn’t
have to do a full lookup every time.
Now look at proc_root_lookup
:
In fs/proc/root.c
:
static struct dentry *proc_root_lookup(struct inode * dir, struct dentry * dentry, unsigned int flags)
{
if (!proc_pid_lookup(dentry, flags))
return NULL;
return proc_lookup(dir, dentry, flags);
}
In fs/proc/base.c
:
struct dentry *proc_pid_lookup(struct dentry *dentry, unsigned int flags)
{
struct task_struct *task;
unsigned tgid;
struct pid_namespace *ns;
struct dentry *result = ERR_PTR(-ENOENT);
tgid = name_to_int(&dentry->d_name);
if (tgid == ~0U)
goto out;
Some setup and error checks. Not too interesting.
This is more interesting: we do a lookup on the requested tgid. Note
that tgid
here is somewhat improperly named. We’re doing a
lookup based on a task
which does not have to be a thread
group leader.
if (task)
get_task_struct(task);
rcu_read_unlock();
if (!task)
goto out;
result = proc_pid_instantiate(dentry, task, NULL);
put_task_struct(task);
out:
return result;
}
The remainder of the function instantiates an inode for
/proc/[TID]
and most likely populates it as well. Then in
proc_root_lookup
, proc_lookup
probably walks
the FS structure and finds the new inode.
Mystery solved.