[!!!][FEATURE] New TypoScript parser (09c61884) · Commits · TYPO3 / TYPO3.CMS

Commit 09c61884 authored 3 years ago by
Christian Kuhn
[!!!][FEATURE] New TypoScript parser

This adds a rewritten TypoScript syntax parser to ext:core.

The heart of the existing parser dates back to Kasper's
initial commit in 2003 and most parts have never been
touched structurally. Even though various parts of
TypoScript became less important in the Frontend rendering
chain over time, it still plays a central point. Looking at
the given parser approach, it became clear the project can
benefit from a revamped parser based on modern PHP code.

Goals:
* A well structured, flexible and tested codebase.
* Fix long standing syntax shenanigans and limits of
  the current parser approach and make the syntax
  interpretation more robust and resilient.
* Have a better cache layer that kicks in more often
  to ultimately gain speed building the FE TypoScript.
* Allow improving the Backend "Template" module to
  show and analyze more TypoScript details.
* Have an object based structure for the final TypoScript
  tree, and allow an array export of it to keep backwards
  compatibility.

The new implementation aims to fully substitute the old
parser. This first patch however only brings the "library"
part of needed code changes: There is no usage of the new
parser whatsoever with this patch, except of course unit
and functional tests.

To understand the structure of the new parser, it might be
helpful to have a rough understanding of the old parser. We'll
look at the frontend parsing steps only, Backend usages in
Template module are similar, though.

When accessing a frontend page, TYPO3 at some points finds the
requested page uid. It then builds the "rootline" of page
records from the top page root node down to the requested
page. This creates a list of page database records, mountpoints
and similar are taken care of at this point, too.

This rootline is fed to the old TypoScript parser. The parser now
finds all relevant sys_template records attached to the rootline
pages. These are the "entry" TypoScript snippets to parse. It
then resolves all "includes": Snippets from sys_template
include_static_file db records, snippets from various globals,
snippets from sys_template basedOn, and so on. It also resolves
@import and <INCLUDE_TYPOSCRIPT: syntax and substitutes them
with the included content. All that is gathered in one huge
string. The main parsing process then goes through this string
and creates the TypoScript array, while taking care of conditions
and constants substitution at the same time.

This approach has various drawbacks: First, gathering everything
into one main string forces to parse the entire string a-new for
each and every page: A different page could have different
includes or sys_template records attached. Secondly, parsing
conditions and constants while creating the final array from the
source string, ties "runtime" information (especially conditions)
hard into the main parsing process, rendering an effective cache
layer impossible. Third, the main strategy of gathering everything
in one string leads to various funny details, for instance when
opening a condition in one file and not closing it with [end] or
[global], this state leaks to the next file. Bracket handling "}"
has similar issues.

These structural issues can't be changed by refactoring the
given codebase, the only option is to rewrite the entire thing.

IncludeTree:
The entry point to the new parser is IncludeTree/TreeBuilder:
This one again receives the pages "rootline" from Frontend. But
instead of creating one huge string, it creates an object tree of
includes. Each attached sys_template record is a child of
"RootInclude" in the tree, and each sys_template include node
can have children for further includes (like include_static_file).
Single source snippets of each include are tokenized (see below)
and analyzed for @import and <INCLUDE_TYPOSCRIPT: and conditions:
If they exist, an include is marked as "split". Each part is then
represented by child includes. We thus receive a tree of objects
where imports are already resolved, and conditions are child nodes
in the tree.

The main advantage is this tree does not carry runtime information.
A single snippet *always* leads to the same tree, no matter if a
condition matched or not. This way, the tree can be cached and the
full tree, or parts of it can be re-used when requesting a
different page. To do that efficiently, a new cache "typoscript"
is established to store a serialized representation of the tree
as php file, which can be unserialized relatively quickly.

Tokenizer:
The tokenizer takes a single source snipped and creates a stream
of Line objects from it, with tokens representing the source line.
For instance, a TypoScript snipped like "foo = fooValue" creates
a IdentifierAssignmentLine object, with a T_IDENTIFIER token for
"foo" and a T_VALUE token for "fooValue". There are Line classes
for all the different Lines (assignment, copy, condition, ...) and
Tokens for the various details (T_VALUE, T_CONSTANT, ...). Tokens
are sometimes encapsulated in TokenStreams, for example multiple
identifiers "foo.bar = barValue" create a stream of the two
T_IDENTIFIER tokens "foo" and "bar". This depends on the line type.

We also have two different tokenizers: The Frontend uses the
LossyTokenizer which creates a stream only of "relevant" tokens:
Empty lines, comments and lines with invalid syntax are ignored.
The LosslessTokenizer however creates a 1:1 representation of the
source snipped, including token positions (line and start column).
This allows detailed analysis in the Backend Template module.

AstBuilder:
This is the third part of the structure: A representation of
parsed TypoScript as object tree. This is similar to the array
we're dealing with now, but it can carry additional functionality.
For instance, the well-known TypoScript array can be extracted
from it.
Using the AstBuilder works like this: Create the IncludeTree
first (from cache). Next, traverse the IncludeTree with a visitor
that looks at condition includes and sets an information
if they match. Then, feed all includes that matched to the
AstBuilder to create the object tree. The AstBuilder receives
a LineStream of each include and extends the object tree depending
on the line type.

This means: IncludeTree building and source tokenizing can be
cached in Frontend, applying constants and condition verdicts
together with building the finally TypoScript object tree is a
runtime step on each page. Separating tokenizing and AST building
with a cache layer in between is roughly twice as fast in
Frontend compared to the previous solution.

The patch establishes these three main structures and comes with
an extensive set of unit and functional tests. It is an extract
of a bigger WIP patch that already contains usages of the new
structure in Frontend and some parts of the Backend. The patch
also adds a ReST file to explain some subtle TypoScript syntax
changes. These will kick in as soon as we start using the
structure with upcoming changes.

The new parser should be rather robust already. Further changes
that start using the new parser will probably only change minor
things. For now, the entire structure is marked @internal,
though: API is still the array representation, only. This will
change later when we actively start using the TypoScript object
tree in the core.

Change-Id: I4047a878494078b6ac149553fa305c1d69329e37
Resolves: #97816
Resolves: #96503
Resolves: #90146
Resolves: #41327
Resolves: #76447
Releases: main
Reviewed-on: https://review.typo3.org/c/Packages/TYPO3.CMS/+/74987


Tested-by: core-ci <typo3@b13.com>
Tested-by: Benni Mack <benni@typo3.org>
Tested-by: Stefan Bürk <stefan@buerk.tech>
Tested-by: Christian Kuhn <lolli@schwarzbu.ch>
Reviewed-by: Benni Mack <benni@typo3.org>
Reviewed-by: Stefan Bürk <stefan@buerk.tech>
Reviewed-by: Christian Kuhn <lolli@schwarzbu.ch>
parent 8b57f8c5
Branches