Removing Digraphs

Document number:
Dxxxx
Date:
2026-05-10
Audience:
SG22
EWG
Project:
ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
Reply-to:
Matthias Wippich <[email protected]>

This paper proposes removal of digraphs from the C++ language.

Revision history

0.1. R0 May 2026

Original version of the paper.

1. Introduction

Digraphs are a complicated solution to a very old problem, that cause more problems than they solve in a modern environment. Digraphs also severely limit the design space of C++, although as we have seen with [P2996] we are already fine with special-casing our way out of this pickle.

This however introduces an interesting problem: If you need to use a source encoding that requires use of digraphs, you cannot use all of C++26

directly.

Since we are most likely going to continue seeing similar problems, this paper proposes to remove digraphs from the language entirely.

2. Design Space

As mentioned before, digraphs severely limit the design space of C++. This isn't an entirely new insight, in fact we've ran into issues because of digraphs already and will most likely continue to run into new issues because of digraphs.

This leads to a fragmented language - some parts you can write if you need to use digraphs, some you don't. At the same time we're accumulating workarounds (like [CWG1104]), which lead to an excessively complex language.

2.1. Splicers

Splicers from [P2996] were accepted for C++26 with the proposed syntax [: expr :]. However, we are not allowed to use digraphs to spell this as <:: expr ::>.

While that seems to be in direct contradiction of the guarantees we're given in [lex.digraph] paragraph 2

In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.

it actually isn't. The tokens [: and :] are distinct preprocessing tokens rather than being composed from [ and : (or : and ] respectively). Therefore it doesn't matter whether <: is a valid alternative spelling for [ - the splicer syntax does not contain [ tokens.

Unfortunately that doesn't exactly help if your source encoding does not have angle brackets. In such cases you cannot use this language feature directly - you'd have to find some workaround (such as inventing some arbitrary replacement sequence that is expanded to [: after transcoding).

3. Interpolated string literals

The design problems stemming from digraphs do not end there. In some of the recent discussions around string interpolation ([P3412], [P3951]) some interesting code was brought up. Consider the following:

t"foo { bar %> baz"

In an interpolated string literal, the interpolated expression field is wrapped in curly braces. To parse an interpolated string literal you must therefore switch between regular string literal parsing and expression parsing as soon as you see a field introducer ({).

However, once you parse the interpolated expression things get a little strange. %> is an alternative spelling of }. We haven't yet returned back to literal parsing, so this would yield the correct token. So, should we be able to signify the end of the interpolation field with %>?

Since allowing anything but literal } to terminate a interpolation field seems extremely surprising and will most likely not match user expectations, we are noce again looking to disallow digraphs in this context.

Unfortunately that also means that we are once again looking to introduce a feature that you cannot directly use if your source encoding requires the use of the corresponding digraphs.

4. History

Digraphs are a very rarely used C++ feature. GCC goes as far as calling it "obscure" in their documentation:

Apparently in the 1990s some computer systems had trouble inputting these characters, or trouble displaying them. These digraphs almost never appear in C programs nowadays, but we mention them for completeness.

What the GCC documentation refers to here is systems that require the source encoding to be something that does not have the characters <, >, [, ], { or }. In such cases, the corresponding digraphs are required to write those characters in source code.

Such encodings are rather uncommon nowadays, but they do exist.

Historically digraphs were introduced because of [ISO 646], which includes several national variants that do not have the angle brackets or curly braces. For example, the [ISO-IR-021] (also known as DIN 66003) variant used in Germany replaces [ with Ä, ] with Ü, { with ä and } with ü.

Interestingly, [ISO-IR-027] even lacks :, making two of the digraphs unusable. While it's based on ISO 646, it is not a true variant however.

So, how did we get here?

1970s Several ISO 646 variants are introduced, some of which lack punctuators
1989 C89 introduces trigraphs as a workaround
1995 C95 introduces digraphs as a less invasive alternative (ie. <: instead of ??()
1998 C++98 adopts digraphs from C95
2011 [CWG1104] adopted for C++11 to disambiguate qualified template arguments from digraphs
2017 Trigraphs removed from C++17 ([N4086])

5. Usage Analysis

This raises an obvious question: Are people still using digraphs? Are they targeting somewhat modern C++ versions?

A GitHub code search can unfortunately not answer this question for us, since a lot of use might be in code that isn't publicly available. However, it helps getting a rough idea of the situation and how people use digraphs.

Surprisingly, we can actually find quite a lot of code that uses digraphs. For example %:include is included in 850 files!

However, if you take a closer look at those results, you'll quickly notice that aside from a bunch of false positives most of them fall into one of three categories:

  1. Compiler test code and other lexer test code
  2. Demo code, examples and student assignments
  3. Inconsistent use of digraphs and the tokens they replace, likely for obfuscation purposes

Note how this list is missing "actual production code that uses digraphs for anything other than tests".

For the other digraphs the situation is similar. %:%: yields 216 results, most of which are false positives and/or lexer code. %:if yields 22 results, all of which are university assignments, inconsistent or false positives.

For the digraphs <:, :>, <% and %> a code search yields a lot more results, but they seem to be almost exclusively false positives - usually because they appear in string literals or comments. Furthermore, <: yields especially bad results due to qualified template arguments being spelled as templ<::arg> (without space after <). Due to platform rate limits further usage analysis for those does not seem feasible.

6. Compatibility

In C++14 we removed support for trigraphs. Since this has been quite a while back now, it is fair to assume that mitigations for users that required use of trigraphs but wanted to target anything beyond C++11 are in place.

While more complex approaches (such as filesystem-level transcoding) are possible, at the time of writing almost all compilers still support trigraphs as an extension. For instance:

While the situation around digraphs is arguably a little different and might require extra preprocessing to use of all language features, mitigations will likely look similar to the ones required for trigraphs.

Some compilers already support disabling digraph support altogether. For instance:

7. Wording

Make the following changes to the C++ Working Draft. All wording is relative to [N5032], the latest draft at the time of writing.

Lexical conventions [lex]

Preprocessing tokens [lex.pptoken]

Modify paragraph 5 as indicated

5 If the input stream has been parsed into preprocessing tokens up to a given character:

5.1 If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as `R"`, the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed in phase 2 (line splicing) are reverted; this reversion is applied before any d-char, r-char, or delimiting parenthesis is identified. The raw string literal is defined as the shortest sequence of characters that matches the raw-string pattern

encoding-prefixopt R raw-string

5.2 Otherwise, if the next three characters are <:: and the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token `<:`.

5.3 Otherwise, if the next three characters are [:: and the subsequent character is not :, or if the next three characters are :>, the [ is treated as a preprocessing token by itself and not as the first character of the preprocessing token [:.

[Note: The tokens [: and :] cannot be composed from digraphs. — end note]

5.4 Otherwise, the next preprocessing token is the longest sequence of [...]

Operators and punctuators [lex.operators]

Modify as indicated.

1 The lexical representation of C++ programs includes a number of preprocessing tokens that are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:

preprocessing-op-or-punc:
preprocessing-operator
operator-or-punctuator
preprocessing-operator:
one of
# ## %: %:%:
operator-or-punctuator:
one of
{ } [ ] ( ) [: :]
<% %> <: :> ; : ...
? :: . .* -> ->* ^^ ~
! + - * / % ^ & |
= += -= *= /= %= ^= &= |=
== != < > <= >= <=> && ||
<< >> <<= >>= ++ -- ,
and or xor not bitand bitor compl
and_eq or_eq xor_eq not_eq

Each operator-or-punctuator is converted to a single token in translation phase 6 ([lex.phases]).

Alternative tokens [lex.digraphalt]

Rename to [lex.alt] and update all references accordingly.

1 Alternative token representations are provided for some operators and punctuators.

2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.

[Note: The “stringized” values ([cpp.stringize]) of [ and <: are different, maintaining the source spelling. — end note]

The set of alternative tokens is defined in Table 3.

Modify Table 3

Alternative Primary Alternative Primary Alternative Primary
<% { and && and_eq &=
%> } bitor | or_eq |=
<: [ or || xor_eq ^=
:> ] xor ^ not !
%: # compl ~ not_eq !=
%:%: ## bitand &

Remove footnote 10

🞰) These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”.

Preprocessing directives [cpp]

Argument substitution [cpp.subst]

Modify Example 1 as indicated.

[Example:

#define LPAREN() ( #define G(Q) 42 #define F(R, X, ...) __VA_OPT__(G R X) ) int x = F(LPAREN(), 0, <:[-); // replaced by int x = 42;

end example]

Annex C (informative) [diff]

C++ and ISO C++ 2026 [diff.cpp26]

[lex] lexical conventions [diff.cpp26.lex]

Add new entry

Affected subclause: [lex.digraph]

Change: Removal of digraph support as a required feature.

Rationale: Resolves fragmentation of the language, opens up design space and simplifies the language.

Effect on original feature: Valid C++ 2026 code that uses the punctuation digraphs <%, %>, <: and :> may not be valid or have different semantics in this revision of C++. Implementations may choose to translate digraphs as specified in C++2026 if they appear outside of a raw string literal, as part of the implementation-defined mapping from input source file characters to the translation character set.

8. Acknowledgements

Thanks to Jan Schultke for the markup language and document generator used for this paper.


9. References

[N5032] Thomas Köppe. Working Draft Programming Languages — C++ 2025-12-15 https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/n5032.pdf
[CWG1104] Global-scope template arguments vs the <: digraph 2010-08-02 https://cplusplus.github.io/CWG/issues/1104.html