From 390092d83f5122c2e64bbff26858e3078d60a85d Mon Sep 17 00:00:00 2001
From: Garvin Hicking <gh@faktor-e.de>
Date: Thu, 23 Nov 2023 13:03:55 +0100
Subject: [PATCH] [BUGFIX] No "update storage index" FAL task fail with too
 many records
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Indexer builds a large array of all actual files on a storage
(identifiedFileUids). If many files exists, this array can get
very large.

This array was then passed to a QueryBuilder to fetch all
records NOT IN that array. Since a NOT IN query is passed as a
string to the database, it can exceed the string size allowed
in a query, making the whole task fail.

Since a NOT IN query cannot be chunked easily, the whole
logic has been adapted in a different way.

Instead of fetching a restricted list of database records,
all records are fetched and iterated. Even with a million
of sys_file_records of a single (!) storage this will
perform alright, and be within practical usage scenarios.

Each database record is then checked for a match in the
large array of known records, and then execute the
same logic as before.

To benchmark the implications, the following test was run:

Baseline:

* sys_file with 50.736 entries
  * 16.912 marked as missing
  * 33.824 marked as existing
* Filesystem with 8.771 actual files

Tested setup via a script which:

* Resets to baseline sys_file storage
* Executes scheduler task "File Abstraction Layer: Update storage index
  (scheduler)"
* Flags 41.965 files as missing, 8.771 as found.

Script execution was performed 50 times, and a mean avery was calculated,
once once with the patch in place, once without.

Old variant (using NOT IN query): 11.787 seconds
New variant (fetching all records): 12.0544 seconds

On top of being within the same performance level, using the new method,
no database exception will be provoked (see ticket).

Resolves: #102295
Releases: main, 12.4
Change-Id: Id998d7cd062fe75aac738b896bfb307b51f5cef8
Reviewed-on: https://review.typo3.org/c/Packages/TYPO3.CMS/+/82237
Tested-by: Stefan Bürk <stefan@buerk.tech>
Reviewed-by: Stefan Bürk <stefan@buerk.tech>
Tested-by: core-ci <typo3@b13.com>
---
 .../core/Classes/Resource/Index/Indexer.php    | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/typo3/sysext/core/Classes/Resource/Index/Indexer.php b/typo3/sysext/core/Classes/Resource/Index/Indexer.php
index 4a9f0877b8a6..cb8d48198666 100644
--- a/typo3/sysext/core/Classes/Resource/Index/Indexer.php
+++ b/typo3/sysext/core/Classes/Resource/Index/Indexer.php
@@ -160,17 +160,25 @@ class Indexer implements LoggerAwareInterface
     }
 
     /**
-     * Since by now all files in filesystem have been looked at it is save to assume,
-     * that files that are in indexed but not touched in this run are missing
+     * Since by now all files in filesystem have been looked at, it is safe to assume,
+     * that files that are indexed, but not touched in this run, are missing
      */
     protected function detectMissingFiles()
     {
-        $indexedNotExistentFiles = $this->getFileIndexRepository()->findInStorageAndNotInUidList(
+        $allCurrentFiles = $this->getFileIndexRepository()->findInStorageAndNotInUidList(
             $this->storage,
-            $this->identifiedFileUids
+            []
         );
 
-        foreach ($indexedNotExistentFiles as $record) {
+        foreach ($allCurrentFiles as $record) {
+            // Check if the record retrieved from the database was associated
+            // with an existing file.
+            // If yes: All is good, file is in index and in database.
+            // If no: Database record may need to be marked as removed (extra check!)
+            if (in_array($record['uid'], $this->identifiedFileUids, true)) {
+                continue;
+            }
+
             if (!$this->storage->hasFile($record['identifier'])) {
                 $this->getFileIndexRepository()->markFileAsMissing($record['uid']);
             }
-- 
GitLab