Post

Python's Preprocessor

Every now and then you hear outrageous claims such as “Python has no preprocessor”.

This is simply not true. In fact, Python has the best preprocessor of all languages - it quite literally allows us to do whatever we want, and a lot more. It’s just a little tricky to (ab)use.

Python source code encodings

Thanks to PEP-0263 it is possible to define a source code encoding by placing a magic comment in one of the first 2 lines.

All of the following lines would instruct the Python interpreter to decode the rest of the file using the utf8 codec:

1
2
3
# coding=utf8
# -*- coding: utf8 -*-
# vim: set fileencoding=utf8 :

To be precise, the line must match the regular expression ^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+). Naturally we can use our own encodings, but their names must match [-_.a-zA-Z0-9]+. As you might have guessed by now - our own codec will do a whole lot more than just decode the source file.

Path configuration files (.pth)

Unless the Python interpreter was started with the -S option, it will automatically load the site package during initialization. This is done to append site-specific paths to the module search path.

One way to do so is by placing a path configuration file (with .pth suffix) in the site-packages folder of your target Python installation. Every line (except lines starting in # and blank lines) in it will be added to the module search path.

Interestingly the Python Docs also mention the following:

Lines starting with import (followed by space or tab) are executed.

Which gives us a nice opportunity to always execute arbitrary code during initialization of the Python interpreter. This can be used to load the custom codec - to do so create a file packagename.pth in site-packages with content matching

1
import packagename.register_codec

This will import the register_codec module from the packagename package. Importing this module must register the codec, which is done by registering a search function by calling codecs.register. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
import codecs
from typing import Optional

def search_function(encoding) -> Optional[codecs.CodecInfo]:
    if encoding == "codec_name":
        return codecs.CodecInfo(
          name=encoding,
          encode=codecs.utf_8_encode,
          decode=your_decoder,
          incrementaldecoder=your_incremental_decoder
      )

codecs.register(search_function)

Since importing modules only executes them once, this is sufficient to register our codec’s search function exactly once. This leaves one thing to do: the actual decoder.

Defining custom codecs

Essentially we need two things to make the Python interpreter happy:

  • a decode function decode(data: bytes) -> tuple[str, int]
  • an incremental decoder class

Let’s do the decode function first. codecs.utf_8_decode can be used for the actual decoding - this will return a tuple of the decoded content of the source file and how many bytes were consumed. The resulting string can be passed on to our actual preprocessor.

Uncaught exceptions will not be printed with traceback to the terminal as you would expect. Instead the interpreter will simply yield SyntaxError: encoding problem: your_codec with no helpful extra information as to why there was a problem with your codec.

It is therefore advisable to catch exceptions coming from your preprocessor and explicitly print them before reraising.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import codecs
import traceback

def preprocessor(data: str) -> str:
    # do actual preprocessing here
    return data

def decode(data: bytes) -> tuple[str, int]:
    decoded, consumed = codecs.utf_8_decode(data, errors='strict', final=True)
    try:
        # run the preprocessor
        processed = preprocessor(decoded)
    except Exception:
        # print the traceback
        traceback.print_exc()
        raise
    return processed, consumed

To get things to work nicely we also need to provide an incremental decoder. Since we don’t want to actually preprocess the file incrementally, we can instead collect it into a buffer and preprocess the entire thing once the final decode call happened. For this purpose we can inherit from codecs.BufferedIncrementalDecoder (or codecs.IncrementalDecoder, since we will override decode, which provides the primary machinery, anyway). This will look something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import codecs

class Decoder(codecs.BufferedIncrementalDecoder):
    def _buffer_decode(self, input, errors, final):  """not used"""

    def decode(self, data, final=False) -> str:
        self.buffer += data

        if self.buffer and final:
            buffer = self.buffer
            self.reset()
            # call our decode function, return only the result string
            return decode(buffer)[0]

        return ""

The search function from earlier can now be updated to use the decode function and the incremental decoder class.

1
2
3
4
5
6
7
8
def search_function(encoding) -> Optional[codecs.CodecInfo]:
    if encoding == "codec_name":
        return codecs.CodecInfo(
          name=encoding,
          encode=codecs.utf_8_encode,
          decode=decode,              # our decode function
          incrementaldecoder=Decoder  # our incremental decoder
      )

It does not matter if or how the source file’s content is used, you can also return completely arbitrary code. However note that the first line will be dropped (since it is expected to contain the magic line) and it must be valid Python.

Extending Python

Fortunately extending Python is rather easy since Python’s standard library contains tools to tokenize and parse Python. While regular expressions may be sufficient for simple language extensions, this often tends to be rather error prone.

If your language extension uses only valid Python tokens, it is possible to use the tokenize module to retrieve the file’s token stream, modify it as required and untokenize the result.

If your language extension transforms syntactically valid Python, it is possible to use the ast module to parse the source file, modify the resulting abstract syntax tree and finally unparse it.

Unary increment and decrement

Unlike many other languages Python is unfortunately lacking unary increment and decrement operators.

In case you’re not familiar with the concept, here’s a quick refresher:

  • Pre-increment and pre-decrement operators modify their operand by 1 and return the value after doing so
  • Post-increment and post-decrement operators modify their operand by 1 and return the value before doing so

In Python “post-increment” x++ and “post-decrement”x-- are syntactically not valid.

“Pre-increment” ++x and “pre-decrement” --x however are syntactically valid, but would result in a call x.__pos__().__pos__() or x.__neg__().__neg__() respectively. Keep in mind that breaking these up with extra parentheses like +(+x) or -(-x) would still result in that call.

Essentially we want to replace every occurrence of these invalid unary increment and decrement expressions into a Python expression that has the same semantics.

One possible way to do this is to form a tuple of the x before mutating it and x after mutation. This can be used for both prefix and postfix notation - we can simply pick out whichever value we need using the tuple’s subscript operator. Thanks to PEP-0572 Python has assignment expressions (also known as the walrus operator), which allow mutation of x but also return the resulting value.

Here’s the list of replacements:

Unary expressionToken sequencePython equivalent
x++(NAME, 'x'), (OP, '+'), (OP, '+')(x, x := x + 1)[0]
x--(NAME, 'x'), (OP, '-'), (OP, '-')(x, x := x - 1)[0]
++x(OP, '+'), (OP, '+'), (NAME, 'x')(x, x := x + 1)[1]
--x(OP, '-'), (OP, '-'), (NAME, 'x')(x, x := x - 1)[1]

Simply replacing these token sequences in the token stream is strictly speaking not sufficient, since it will fail for expressions such as x++ - -y, however this can easily be disambiguated with extra parenthesis: x++ - (-y).

incdec.py, the Python project that inspired this blog post, uses regular expressions to do the replacements. While it does try to prevent replacements inside string literals, it is still rather brittle. You can find a reimplementation that directly modifies the token stream at magic.incdec.

Example

An input file incdec.py

1
2
3
4
5
6
7
8
9
10
11
12
# coding: magic.incdec
i = 6

assert i-- == 6
assert i == 5
assert ++i == 6
assert --i == 5
assert i++ == 5
assert i == 6
assert (++i, 'i++') == (7, 'i++')
print("PASSED")

would be transformed to

1
2
3
4
5
6
7
8
9
10
11
i = 6

assert ((i, i := i - 1)[0]) == 6
assert i == 5
assert ((i, i := i + 1)[1]) == 6
assert ((i, i := i - 1)[1]) == 5
assert ((i, i := i + 1)[0]) == 5
assert i == 6
assert (((i, i := i + 1)[1]),'i++') == (7, 'i++')
print ("PASSED")

To verify that it actually works, try running python tests/incdec/incdec.py in the magic_codec repository after installing magic_codec. It should print

1
2
$ python tests/incdec/incdec.by
PASSED

Python with braces (Bython)

Another thing C/C++ programmers usually find rather off-putting about Python is its use of indentation for scoping purposes. Unfortunately the Python developers have strong opinions on using braces for scoping, which can be confirmed by importing braces from __future__:

1
2
3
>>> from __future__ import braces
  File "<stdin>", line 1
SyntaxError: not a chance

Let’s do it anyway.


As with the incdec example, we can directly modify the token stream. To do so get the tokens from the source file using tokenize.generate_tokens. Unfortunately generate_tokens expects a callable that yields one line at a time. We can get one by wrapping our string in a StringIO object and use its bound readline method.


Since whitespace does not matter in the input, all tokens of the types INDENT and DEDENT can be dropped.

Tokens of the type OP are interesting for primary required machinery - if the token’s string representation matches {, the indentation level needs to be increased and a : emitted. Likewise if the token’s string representation matches }, the indentation level must be decreased.

Finally to fix indentation every token of type NL must be followed by a token of type INDENT with an appropriate amount of whitespace for the current indentation level as content.

Since Python uses curly braces for dictionaries, this can be slightly improved upon by only adjusting the indentation level only if the { token is followed by a newline and respectively the } token preceded by a newline. Limiting dictionaries with the curly brace syntax to a single line might seem rather limiting, but remember that

1
2
3
4
dictionary = { \
    'a': 420,  \
    'b': 10    \
}

contains no newline tokens within the curly braces.

Example

An input file test.by

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# coding: magic.braces
def print_message(num_of_times) {
    for i in range(num_of_times) {  
   print("braces ftw")
  print({'x': 3})
 }
}

x = {        \
  'foo': 42, \
  'bar': 5   \
}

if __name__ == "__main__" {
print_message(2)
    print({k:v for k, v in x.items()})
}

would be transformed to

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# coding: magic.braces
def print_message(num_of_times):
    for i in range(num_of_times):
        print("braces ftw")
        print({'x': 3})

x = {        \
  'foo': 42, \
  'bar': 5   \
}

if __name__ == "__main__":
    print_message(2)
    print({k:v for k, v in x.items()})

You can verify this by running python tests/braces/test.by in the magic_codec repository after installing magic_codec. It should print

1
2
3
4
5
6
$ python tests/braces/test.by
braces ftw
{'x': 3}
braces ftw
{'x': 3}
{'foo': 42, 'bar': 5}

Interpreting other languages

Instead of expanding Python, why not teach the Python interpreter itself a few more tricks? After all there’s all kinds of cool languages it could interpret!

Some languages (ie. shell script, CMake script, PHP or Ruby) use # for comments, notably every language that supports shebangs - this can be abused to set the encoding directly.

C and C++

For C and C++ we have no such luck. Comments use /* comment */ or // comment syntax, neither of which is usable. It is however possible to satisfy the source encoding pattern by using preprocessor directives, which happen to start with a #.

The regular expression for magic lines ^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+) matches, if a line contains:

  • any amount of spaces, tabs or form feeds
  • the # character
  • any amount of any characters
  • the word coding
  • either : or =
  • any amount of spaces or tabs
  • an identifier matching [-_.a-zA-Z0-9]+

One preprocessor directive in C++ that can be used for this is #define. What we want to do is define a macro and let its value match .*?coding[:=][ \t]*([-_.a-zA-Z0-9]+). For example

1
#define CODEC "coding:magic.cpp"

would match.

Great, we can now trigger the magic.cpp decoder with a valid C or C++ source file. To actually get the Python interpreter to interpret this C or C++ code for us, we can use the excellent package cppyy. In essence cppyy uses cling under the hood to interpret our code and generates Python bindings for us to use it.

After our decoder is done with the input file, the output should look something like

1
2
3
4
5
6
7
8
9
10
11
import cppyy

# interpret the input source code
cppyy.cppdef("<input source file content>")

# find the main function
from cppyy.gbl import main

if __name__ == "__main__":
  # call C/C++ main
  main()

Now we can run python foo.cpp if foo.cpp begins with the magic line #define CODEC "coding:magic.cpp". One example implementation of this can be found at magic.cpp.

Example

An input file test.cpp

1
2
3
4
5
6
#define CODEC "coding:magic.cpp"
#include <cstdio>

int main() {
    puts("Hello World");
}

would be transformed to

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import cppyy

cppyy.cppdef(r"""
#define CODEC "coding:magic.cpp"
#include <cstdio>

int main() {
    puts("Hello World");
}
""")
from cppyy.gbl import main

if __name__ == "__main__":
    main()

You can try this by running python tests/cpp/test.cpp in the magic_codec repository after installing magic_codec and cppyy. It should print

1
2
$ python tests/cpp/test.cpp
Hello World

Validating data

One data interchange format that does allow comments and uses # to introduce them is TOML. This allows us to set an encoding and let the Python interpreter act as a validation tool instead. jsonschema which is a Python implementation of JSON Schema can be used to do the actual validation.

This one is rather straight forward, a preprocess function could look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def preprocess(data: str):
    return """
import argparse
import json
import sys
import tomllib
from pathlib import Path
from jsonschema import ValidationError, validate

def main():
    parser = argparse.ArgumentParser(
                    prog='magic.toml',
                    description='Verify toml data against json schemas')
    parser.add_argument('-s', '--schema', type=Path, required=True)
    args = parser.parse_args()

    data = tomllib.loads(Path(sys.argv[0]).read_text(encoding="utf-8"))
    schema = json.loads(args.schema.read_text(encoding="utf-8"))
    try:
        validate(data, schema)
    except ValidationError as exc:
        print(exc)
    else:
        print("Successfully validated.")

if __name__ == "__main__":
    main()
"""

A slightly different example implementation can be found at magic.toml.

Example

With a schema schema.json

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "scores": {
            "type": "array",
            "items": {"type": "number"}
        },
        "address": {"$ref": "#/$defs/address"}
    },
    "required": ["name"],
    "$defs": {
        "address": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "postcode": {"type": "number"}
            },
            "required": ["street"]
        }
    }
}

and an input file data_valid.toml

1
2
3
4
5
6
7
8
# coding: magic.toml
name = "John Doe"
age = 42
scores = [40, 20, 80, 90]

[address]
street = "Grove St. 4"
postcode = 19201

the expected output is

1
2
$ python tests/toml/data_valid.toml -s tests/toml/schema.json
Successfully validated.


However, for an input file data_invalid.toml

1
2
3
4
5
6
7
8
# coding: magic.toml
name = "John Doe"
age = 42
scores = [40, "20", 80, 90]

[address]
street = "Grove St. 4"
postcode = 19201

the expected output will be

1
2
3
4
5
6
7
8
$ python tests/toml/data_invalid.toml -s tests/toml/schema.json
'20' is not of type 'number'

Failed validating 'type' in schema['properties']['scores']['items']:
    {'type': 'number'}

On instance['scores'][1]:
    '20'

Conclusion

Custom codecs in conjunction with path configuration files can drastically change the behavior of the Python interpreter. While most of the examples here are written purely for entertainment purposes, there are definitely valid uses for this technique. One notable example is pythonql, which is a query language extension for Python. Another notable example is future-typing which backports generic type hints and union syntax via | to Python 3.6+. Similar projects include future-fstrings and future-annotations.

If you want to play around with your own preprocessors but do not wish to mess with site-packages directly, introduce path configuration files and write all the boilerplate yourself, you can instead use magic_codec.

To extend magic_codec with your own preprocessors, you can create another Python package whose name is prefixed with magic_. Setting the codec of any file to magic_foo would load the magic_foo package and check if it has a function preprocess.

The expected signature of preprocess is as follows:

1
2
def preprocess(data: str) -> str:
    raise NotImplementedError

You can find an example extension in example/.

This post is licensed under CC BY 4.0 by the author.