Removing Digraphs
- Document number:
- Dxxxx
- Date:
2026-05-10 - Audience:
- SG22
- EWG
- Project:
- ISO/IEC 14882 Programming Languages — C++, ISO/IEC JTC1/SC22/WG21
- Reply-to:
- Matthias Wippich <[email protected]>
Revision history
0.1. R0 May 2026
Original version of the paper.
1. Introduction
Digraphs are a complicated solution to a very old problem, that cause more problems than they solve in a modern environment. Digraphs also severely limit the design space of C++, although as we have seen with [P2996] we are already fine with special-casing our way out of this pickle.
This however introduces an interesting problem: If you need to use a source encoding that requires use of digraphs, you cannot use all of C++26
directly.
Since we are most likely going to continue seeing similar problems, this paper proposes to remove digraphs from the language entirely.
2. Design Space
As mentioned before, digraphs severely limit the design space of C++. This isn't an entirely new insight, in fact we've ran into issues because of digraphs already and will most likely continue to run into new issues because of digraphs.
This leads to a fragmented language - some parts you can write if you need to use digraphs, some you don't. At the same time we're accumulating workarounds (like [CWG1104]), which lead to an excessively complex language.
2.1. Splicers
Splicers from [P2996] were accepted for C++26 with the proposed syntax . However,
we are not allowed to use digraphs to spell this as .
While that seems to be in direct contradiction of the guarantees we're given in [lex.digraph] paragraph 2
In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.
it actually isn't. The tokens and are distinct preprocessing tokens rather than being
composed from and (or and respectively). Therefore it doesn't matter whether is a valid
alternative spelling for - the splicer syntax does not contain tokens.
Unfortunately that doesn't exactly help if your source encoding does not have angle brackets. In such
cases you cannot use this language feature directly - you'd have to find some workaround (such as
inventing some arbitrary replacement sequence that is expanded to after transcoding).
3. Interpolated string literals
The design problems stemming from digraphs do not end there. In some of the recent discussions around string interpolation ([P3412], [P3951]) some interesting code was brought up. Consider the following:
In an interpolated string literal, the interpolated expression field is wrapped in curly braces. To parse
an interpolated string literal you must therefore switch between regular string literal parsing and expression parsing as soon
as you see a field introducer ().
However, once you parse the interpolated expression things get a little strange. is an alternative spelling of . We haven't yet
returned back to literal parsing, so this would yield the correct token. So, should we be able to signify the end of the interpolation field with ?
Since allowing anything but literal to terminate a interpolation field seems extremely surprising and will most likely not match user expectations, we
are noce again looking to disallow digraphs in this context.
Unfortunately that also means that we are once again looking to introduce a feature that you cannot directly use if your source encoding requires the use of the corresponding digraphs.
4. History
Digraphs are a very rarely used C++ feature. GCC goes as far as calling it "obscure" in their documentation:
Apparently in the 1990s some computer systems had trouble inputting these characters, or trouble displaying them. These digraphs almost never appear in C programs nowadays, but we mention them for completeness.
What the GCC documentation refers to here is systems that require the source encoding to be something that does not have the characters , , , , or . In such cases, the corresponding digraphs are required to write those characters in source code.
Such encodings are rather uncommon nowadays, but they do exist.
Historically digraphs were introduced because of [ISO 646], which includes several national variants that do not have the angle brackets or curly braces. For example, the [ISO-IR-021] (also known as DIN 66003) variant used in Germany replaces with , with , with and with .
Interestingly, [ISO-IR-027] even lacks , making two of the digraphs unusable. While it's based on ISO 646, it is not a true variant however.
So, how did we get here?
| 1970s | Several ISO 646 variants are introduced, some of which lack punctuators |
| 1989 | C89 introduces trigraphs as a workaround |
| 1995 | C95 introduces digraphs as a less invasive alternative (ie. instead of ) |
| 1998 | C++98 adopts digraphs from C95 |
| 2011 | [CWG1104] adopted for C++11 to disambiguate qualified template arguments from digraphs |
| 2017 | Trigraphs removed from C++17 ([N4086]) |
5. Usage Analysis
This raises an obvious question: Are people still using digraphs? Are they targeting somewhat modern C++ versions?
A GitHub code search can unfortunately not answer this question for us, since a lot of use might be in code that isn't publicly available. However, it helps getting a rough idea of the situation and how people use digraphs.
Surprisingly, we can actually find quite a lot of code that uses digraphs. For example is included in 850 files!
However, if you take a closer look at those results, you'll quickly notice that aside from a bunch of false positives most of them fall into one of three categories:
- Compiler test code and other lexer test code
- Demo code, examples and student assignments
- Inconsistent use of digraphs and the tokens they replace, likely for obfuscation purposes
Note how this list is missing "actual production code that uses digraphs for anything other than tests".
For the other digraphs the situation is similar. yields 216 results, most of which are false positives and/or lexer code. yields 22 results, all of which are university assignments, inconsistent or false positives.
For the digraphs , , and a code search yields a lot more results, but they seem to be almost exclusively false positives - usually because they appear in string literals or comments. Furthermore, yields especially bad results due to qualified template arguments being spelled as (without space after <). Due to platform rate limits further usage analysis for those does not seem feasible.
6. Compatibility
In C++14 we removed support for trigraphs. Since this has been quite a while back now, it is fair to assume that mitigations for users that required use of trigraphs but wanted to target anything beyond C++11 are in place.
While more complex approaches (such as filesystem-level transcoding) are possible, at the time of writing almost all compilers still support trigraphs as an extension. For instance:
- GCC has
- trigraphs - Clang has
- ftrigraphs - MSVC has
/ Zc : trigraphs - EDG has
-- trigraphs
While the situation around digraphs is arguably a little different and might require extra preprocessing to use of all language features, mitigations will likely look similar to the ones required for trigraphs.
Some compilers already support disabling digraph support altogether. For instance:
- Clang has
- fno - digraphs - EDG has
-- no_alternative_tokens - MSVC's documentation of compiler warning C4628 and C4629 suggests using digraphs is not supported with
and will cause a warning with/ Ze , however Compiler Explorer does not verify that/ Za
7. Wording
Make the following changes to the C++ Working Draft. All wording is relative to [N5032], the latest draft at the time of writing.
Lexical conventions [lex]
Preprocessing tokens [lex.pptoken]
Modify paragraph 5 as indicated5 If the input stream has been parsed into preprocessing tokens up to a given character:
5.1
If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as `R"`, the next preprocessing token shall be a raw string literal.
Between the initial and final double quote characters of the raw string, any transformations performed in phase 2 (line splicing) are reverted; this reversion is applied before any
5.2
Otherwise, if the next three characters are and the subsequent character is neither nor , the is treated as a preprocessing token by itself and not as the first character of the alternative token `<:`.
5.3
Otherwise, if the next three characters are and the subsequent character is not , or if the next three characters are the , is treated as a preprocessing token by itself and not as the first character of the preprocessing token .
[Note:
The tokens and cannot be composed from digraphs.
— end note]
5.4 Otherwise, the next preprocessing token is the longest sequence of [...]
Operators and punctuators [lex.operators]
Modify as indicated.
1 The lexical representation of C++ programs includes a number of preprocessing tokens that are used in the syntax of the preprocessor or are converted into tokens for operators and punctuators:
| | | |
| | | | | | | | |
| | | | | | | ||
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | ||
| | | | | | | ||
| | | |
Each
Alternative tokens [lex.digraphalt]
Rename to [lex.alt] and update all references accordingly.
1 Alternative token representations are provided for some operators and punctuators.
2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling.
[Note:
The “stringized” values ([cpp.stringize]) of and are different, maintaining the source spelling.
— end note]
The set of alternative tokens is defined in Table 3.
Modify Table 3
| Alternative | Primary | Alternative | Primary | ||
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | |
Remove footnote 10
🞰)
These include “digraphs” and additional reserved words.
The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters.
Nonetheless, those alternative tokens that aren't lexical keywords are colloquially known as “digraphs”.
Preprocessing directives [cpp]
Argument substitution [cpp.subst]
Modify Example 1 as indicated.
[Example:
— end example]
Annex C (informative) [diff]
C++ and ISO C++ 2026 [diff.cpp26]
[lex] lexical conventions [diff.cpp26.lex]
Add new entry
Affected subclause: [lex.digraph] Change: Removal of digraph support as a required feature. Rationale: Resolves fragmentation of the language, opens up design space and simplifies the language. Effect on original feature: Valid C++ 2026 code that uses the punctuation digraphs , , and may not be valid or have different semantics in this revision of C++. Implementations may choose to translate digraphs as specified in C++2026 if they appear outside of a raw string literal, as part of the implementation-defined mapping from input source file characters to the translation character set.
8. Acknowledgements
Thanks to Jan Schultke for the markup language and document generator used for this paper.