Skip to content

DRAFT: ADBDEV-7181 - Orphaned files removal #1631

Draft
whitehawk wants to merge 22 commits intoadb-6.x-devfrom
feature/ADBDEV-7181
Draft

DRAFT: ADBDEV-7181 - Orphaned files removal #1631
whitehawk wants to merge 22 commits intoadb-6.x-devfrom
feature/ADBDEV-7181

Conversation

@whitehawk
Copy link

DRAFT: ADBDEV-7181 - Orphaned files removal

robertmhaas and others added 22 commits March 14, 2025 16:44
C doesn't have any sort of built-in understanding of a pointer
relative to some arbitrary base address, but dynamic shared memory
segments can be mapped at different addresses in different processes,
so any sort of shared data structure stored within a dynamic shared
memory segment can't use absolute pointers.  We could use something
like Size to represent a relative pointer, but then the compiler
provides no type-checking.  Use stupid macro tricks to get some
type-checking.

Patch originally by me.  Concept suggested by Andres Freund.  Recently
resubmitted as part of Thomas Munro's work on dynamic shared memory
allocation.

Discussion: 20131205144434.GG12398@alap2.anarazel.de
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com

(cherry picked from commit fbc1c12)
This is intended as infrastructure for a full-fledged allocator for
dynamic shared memory.  The interface looks a bit like a real
allocator, but only supports allocating and freeing memory in
multiples of the 4kB page size.  Further, to free memory, you must
know the size of the span you wish to free, in pages.  While these are
make it unsuitable as an allocator in and of itself, it still serves
as very useful scaffolding for a full-fledged allocator.

Robert Haas and Thomas Munro.  This code is mostly the same as my 2014
submission, but Thomas fixed quite a few bugs and made some changes to
the interface.

Discussion: CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com

(cherry picked from commit 13e14a7)
If you have previously pinned a segment and decide that you don't
actually want to keep it around until shutdown, this new API lets you
remove the pin.  This is pretty trivial except on Windows, where it
requires closing the duplicate handle that was used to implement the
pin.

Thomas Munro and Amit Kapila, reviewed by Amit Kapila and by me.

(cherry picked from commit 0fda682)

Changes from original commit: remove API changes and windows
compatibility code to keep binary compatibility with 6.x
Programmers discovered decades ago that it was useful to have a simple
interface for allocating and freeing memory, which is why malloc() and
free() were invented.  Unfortunately, those handy tools don't work
with dynamic shared memory segments because those are specific to
PostgreSQL and are not necessarily mapped at the same address in every
cooperating process.  So invent our own allocator instead.  This makes
it possible for processes cooperating as part of parallel query
execution to allocate and free chunks of memory without having to
reserve them prior to the start of execution.  It could also be used
for longer lived objects; for example, we could consider storing data
for pg_stat_statements or the stats collector in shared memory using
these interfaces, rather than writing them to files.  Basically,
anything that needs shared memory but can't predict in advance how
much it's going to need might find this useful.

Thomas Munro and Robert Haas.  The original code (of mine) on which
Thomas based his work was actually designed to be a new backend-local
memory allocator for PostgreSQL, but that hasn't gone anywhere - or
not yet, anyway.  Thomas took that work and performed major
refactoring and extensive modifications to make it work with dynamic
shared memory, including the addition of appropriate locking.

Discussion: CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com
(cherry picked from commit 13df76a)

Changes from original commit:
  removed extra argument from dsm_create() call
The comments in dsa.c suggested that areas were owned by resource
owners, but it was not in fact tracked explicitly. The DSM attachments
held by the dsa were owned by resource owners, but not the area
itself.  That led to confusion if you used one resource owner to
attach or create the area, but then switched to a different resource
owner before allocating or even just accessing the allocations in the
area with dsa_get_address(). The additional DSM segments associated
with the area would get owned by a different resource owner than the
initial segment.  To fix, add an explicit 'resowner' field to
dsa_area.  It replaces the 'mapping_pinned' flag; resowner == NULL now
indicates that the mapping is pinned.

This is arguably a bug fix, but I'm not backpatching because it
doesn't seem to be a live bug in the back branches. In 'master', it is
a bug because commit b8bff07daa made ResourceOwners more strict so
that you are no longer allowed to remember new resources in a
ResourceOwner after you have started to release it. Merely accessing a
dsa pointer might need to attach a new DSM segment, and before this
commit it was temporarily remembered in the current owner for a very
brief period even if the DSA was pinned. And that could happen in
AtEOXact_PgStat(), which is called after the owner is already released.

Reported-by: Alexander Lakhin
Reviewed-by: Alexander Lakhin, Thomas Munro, Andres Freund
Discussion: https://www.postgresql.org/message-id/11b70743-c5f3-3910-8e5b-dd6c115ff829%40gmail.com
(cherry picked from commit postgres/postgres@a8b330f)
This covers basic calls within a single backend process, and also
calling dsa_allocate() or dsa_get_address() while in a different
resource owners. The latter case was fixed by the previous commit.

Discussion: https://www.postgresql.org/message-id/11b70743-c5f3-3910-8e5b-dd6c115ff829%40gmail.com
(cherry picked from commit postgres/postgres@325f540)

Changes from original commit:
  test was adapted to 6.x
1. Implement redo module for pending deletes. This module is responsible for
operations that are related to redo process:
 - Inserting XLOG_PENDING_DELETE xlog record in xlog when checkpointer
requests it.
 - Parsing XLOG_SMGR_CREATE xlog to retrieve relfilenodes, that should be added
to pending deletes hash table on redo.
 - Replaying XLOG_PENDING_DELETE  by adding items to pending deletes redo hash
table.
 - Removing nodes from pending deletes redo hash table for committed or aborted
transactions.
 - Dropping orphaned files basing on redo hash table with pending deletes in the
end of the recovery process.
2. Add unit test for the module.
3. Add GUC 'gp_track_pending_delete'.

Ticket: ADBDEV-7304
The module contains functions to maintain doubly linked lists of
(RelFileNodePendingDelete, transaction id) pairs for all backends in shared
memory. The shared memory can be initialized using the PdlShmemSize and
PdlShmemInit functions. Backend can add and remove pairs to its own list using
PdlShmemAdd and PdlShmemRemove respectively. The backends lists can be got in
the format suitable for XLOG_PENDING_DELETE using PdlXLogShmemDump.

When backend stops, the module cleanups its pending deletes list.

The size argument is removed from PdlXLogShmemDump, because this value can be
calculated by caller using the function return value.

Ticket: ADBDEV-7303
Problem description:
XLOG_SMGR_CREATE WAL record doesn't contain information about relstorage for
the created relation. Orphaned files removal feature requires knowing the
relstorage info, otherwise it can't properly handle the removal of all orphaned
files for AO tables.

Fix:
In order to store relation's relstorage in xlog record and keep backward
compatibility with previous versions we introduce a new xlog record type 
XLOG_SMGR_CREATE_PDL. This new record type contains info about relstorage and
is used instead of XLOG_SMGR_CREATE.
Plus, this patch updates 'log_smgrcreate()' - now it creates
XLOG_SMGR_CREATE_PDL record and flushes it right after the creation (otherwise,
in case of a crash after file creation, the file may be orphaned).

No special tests are presented in this patch, as the added functionality will
be tested later together with other parts for the orphaned file removal feature.
At this point, it is enough to pass the current standard test set.
Extend the PendingRelDelete structure. Now it stores the shmemPtr pointer to
the corresponding shared pending deletes list node. The pointer is filled in
RelationCreateStorage as a return value of PdlShmemAdd. The pointer is set as
invalid in RelationDropStorage, because storage dropping can't lead to orphaned
relfilenode.
Replace pfree for the pendingDeletes list entry with a new PendingRelDeleteFree
function which calls PdlShmemRemove before pfree when shmemPtr is valid.
Add initialization of shared memory for the storage_pending_deletes module.
Increase number of LWLocks by MaxBackends to add LWLocks for the module.
Fix Assert in the dsm_create function, because the module is used in the
stand-alone mode, for example, when initdb runs 'postgres --single'.

No special tests are presented in this patch, as the added functionality will
be tested later together with other parts for the orphaned file removal feature.
At this point, it is enough to pass the current standard test set.

Ticket: ADBDEV-7410
Tests are not added, because xlog_desc is used for debug purposes only.

Ticket: ADBDEV-7409
…1547)

When a table was created right after transaction beginning, then the
XLOG_SMGR_CREATE_PDL record was added with InvalidTransactionId, because
XLogInsert calls GetCurrentTransactionIdIfAny and the transaction id has not
been got at this moment.

Get the transaction id before the log_smgrcreate function is called.

No special tests are presented in this patch, as the added functionality will
be tested later together with other parts for the orphaned file removal feature.
At this point, it is enough to pass the current standard test set.

Ticket: ADBDEV-7458
Problem description:
After following scenario:
1. primary segment started a transaction;
2. primary segment created a relation;
3. primary segment did a checkpoint;
4. mirror created a restartpoint by timeout;
5. both primary and mirror crashed and, then, recovered;

primary has removed the orphaned file, but mirror didn't.

Root cause:
On step 5, mirror started replaying WAL from the restartpoint, so it didn't meet
the XLOG_SMGR_CREATE_PDL record (which was at the moment of table creation,
before the restartpoint). The information about the table's relfilenode is also
stored in the XLOG_PENDING_DELETE WAL record. But the mirror skipped the
processing of the XLOG_PENDING_DELETE record.

Fix:
Enable processing of the XLOG_PENDING_DELETE record by the mirror. This record
is created on each checkpoint, so if the mirror starts from a restartpoint, it
will be one of the first records to replay. Mirror will obtain the relfilenode
information from this record and will remove the orphaned file.
This patch:
 - Adds clean up of orphaned files into `StartupXLOG()`.
 - Adds function RemovePendingDeletesForPreparedTransactions(), which removes
prepared transactions xids from redo pending deletes before the orphaned files
cleanup is performed. Orphaned files removal feature should consider prepared
transactions xids and remove them from redo pending deletes, because otherwise
some crash scenarios (for ex. 'crash_recovery_dtm' isolation2 test) would drop
files, that shouldn't be dropped.

No special tests are added. For now, it is enough to pass the standard test set.

Plus, abi-check is fixed.
The tests check that orphaned files are not left for all access methods, before
and after checkpoint, on segments and on coordinator.

Ticket: ADBDEV-7305
…ns (#1737)

This patch introduces additional test cases for the orphaned files removal
feature, when primary segment goes down completely without immediate crash
recovery, and mirror promotion happens.

Plus, this patch updates the ignore file for abi check, as the baseline has
changed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants