An hour of working with CPython’s internals

Eric Fulmer
4 min readSep 16, 2018

--

A little while ago, a friend of mine asked me if there was a dict-like structure in Python that, when failing to look up a key, would insert the key with itself as the value, so you’d get d[k] = k.

defaultdict, from the collections module, is almost, but not quite, this. defaultdict is exactly like the built-in dict, except if the key doesn’t exist in the dictionary, it calls a nullary (math major for “0-argument”) function, called the default factory, to set the value for that key. For example, you could use the list, dict, set, or int constructors to get an empty collection or 0 if the key isn’t present.

The Python Data Model defines a special method called __missing__ that gets called when key lookup in a dict fails. We can very easily override defaultdict.__missing__ so it calls the default factory with the key… but that’s not as interesting as modifying the CPython source to do the same thing.

I knew from my conversations with my acquaintance James Powell that the collections module is written in C, not Python. Looking at the Python developer’s guide, specifically the intro, you see that most of the C code in the standard library Modules directory. The collections module’s source code is in Modules/_collections.c. The defaultdict implementation starts on line 1926 in Python 3.7; search the file for “defaultdict type” to find it in other versions.

defaultdict is implemented in about 300 lines of C code (not counting the implementation of dict, which defaultdict obviously piggybacks off of). Most of this code isn’t relevant to what we want to do, and much of it is docstrings or boilerplate used to creating the Python interface for defaultdict.

Here’s the definition of the type.

/* defaultdict type *********************************************************/typedef struct { PyDictObject dict;  PyObject *default_factory;} defdictobject;

Since this is C, you can’t inherit from another class, but this struct explains what defaultdict is very succinctly: you have a dictionary plus a pointer to the default factory. Since defaultdict is implemented in C, we can find the implementation of __missing__ a bit below:

static PyObject *defdict_missing(defdictobject *dd, PyObject *key){  PyObject *factory = dd->default_factory;  PyObject *value;  if (factory == NULL || factory == Py_None) {    /* XXX Call dict.__missing__(key) */    PyObject *tup;    tup = PyTuple_Pack(1, key);    if (!tup) return NULL;    PyErr_SetObject(PyExc_KeyError, tup);    Py_DECREF(tup);    return NULL;  }  value = PyEval_CallObject(factory, NULL);  if (value == NULL)    return value;  if (PyObject_SetItem((PyObject *)dd, key, value) < 0) {    Py_DECREF(value);    return NULL;  }  return value;}

The key item here is PyEval_CallObject, which is defined in Include/ceval.h. It’s actually a macro for PyEval_CallObjectWithKeywords, but with the keyword arguments being set to NULL. PyEval_CallObjectWithKeywords is defined in Objects/call.c:

/* External interface to call any callable object.The args must be a tuple or NULL. The kwargs must be a dict or NULL. */PyObject *PyEval_CallObjectWithKeywords(PyObject *callable, PyObject *args, PyObject *kwargs){  #ifdef Py_DEBUG  /* PyEval_CallObjectWithKeywords() must not be called with an exceptionset. It raises a new exception if parameters are invalid or ifPyTuple_New() fails, and so the original exception is lost. */  assert(!PyErr_Occurred());  #endif  if (args != NULL && !PyTuple_Check(args)) {    PyErr_SetString(PyExc_TypeError, “argument list must be a tuple”);    return NULL;  }  if (kwargs != NULL && !PyDict_Check(kwargs)) {    PyErr_SetString(PyExc_TypeError, “keyword list must be a dictionary”);    return NULL;  }  if (args == NULL) {    return _PyObject_FastCallDict(callable, NULL, 0, kwargs);  }  else {    return PyObject_Call(callable, args, kwargs);  }}

Very interesting. The first argument is a reference to the callable, then come *args and **kwargs. Something to look into later: how CPython handles *args separately from the standard argument tuple.

I think we’ll be okay with constructing a 1-tuple containing the key, then passing that to PyEval_CallObject. Like so:

static PyObject *defdict_missing(defdictobject *dd, PyObject *key){  PyObject *factory = dd->default_factory;  PyObject *factory_args = PyTuple_Pack(key);  PyObject *value;  if (factory == NULL || factory == Py_None) {    /* XXX Call dict.__missing__(key) */    PyObject *tup;    tup = PyTuple_Pack(1, key);    if (!tup) return NULL;    PyErr_SetObject(PyExc_KeyError, tup);    Py_DECREF(tup);    return NULL;  }  value = PyEval_CallObject(factory, factory)args);  if (value == NULL)  return value;  if (PyObject_SetItem((PyObject *)dd, key, value) < 0) {    Py_DECREF(value);    return NULL;  }  return value;}

Let’s see…

0:00:17 load avg: 2.12 [152/415/3] test_defaultdict failedtest test_defaultdict failed — multiple errors occurred; run in verbose mode for details

Guess not! test_defaultdict is defined in Lib/test/test_typing.py.

Next time: We check out the test suite and see how the new defaultdict works. We also discuss how I found my way around the CPython codebase.

Other thoughts:

1. Could we check that the default_factory doesn’t take any arguments when we construct a defaultdict? Obviously, we can. The more interesting question, of course, is how. I would start by checking the default factory’s __code__.co_argcount attribute.

2. Why is the factory a PyObject *?

--

--