Skip to content

[MVP] gprebalance#1198

Draft
bimboterminator1 wants to merge 20 commits intoadb-7.2.0from
feature/ADBDEV-6608
Draft

[MVP] gprebalance#1198
bimboterminator1 wants to merge 20 commits intoadb-7.2.0from
feature/ADBDEV-6608

Conversation

@bimboterminator1
Copy link
Member

Mvp for gprebalance utility

bimboterminator1 and others added 20 commits December 20, 2024 05:53
Implement cluster validation possibility

This is the first commit for building an MVP for new rebalance utility -
gprebalance. This utility is intended to be used for the situation, when after
cluster resize (after expand, shrink) is in unbalanced state. Balanced state
is defined very simple: if number of segments per host is equal across all the
hosts, then cluster is balanced. There are a lot of other aspects for proper
implementation of optimal rebalance algorithm, which will be implemented in
the next patches.

This patch adds the skeleton of future utility, providing initial validation
of rebalance possibility. It includes checks, that validate some basic aspects:
whether segments can be distributed uniformly and can target mirroring strategy
be achieved. Decided to provide validation through separate classes, which is
different approach from gpexpand utility. Also, some unit tests have been added.
Validation of available disk space is not implemented since cannot be achieved at
this initial validation step
gprebalance skeleton is complemented with additional
options from mvp specification.
This code proposes the rebalance algorithm. GpRebalance.createPlan() returns a
Plan represented by the list of Moves. The algorithm itself produces an
intiutive greed solution by manual setting the final balanced state.
The proposed code contains main framework for rebalance execution.
Some options are not implemented fully and are expected to be finished in next
tasks.

The code describes the following segment movement approach. Firstly, we creating
a movements plan: simple steps telling which segment to which host to move.
Steps in plan can be different:

Mirror only moves.
Both primary and mirror are moved to different hosts.
Primary only moves.
Primary and mirror are swapped.
For each type of movement we clarify the target dirs and ports at target hosts,
able to contain the size of moved segment. To do that the DiskFree and DiskUsage
commands are used.

The movements, in its turn, are composite and imply extra actions including
segment switching.

Mirror only moves use only single gprecoverseg call to perform movement.
If we move primary and mirror pair, the strategy is following. The mirror is
firstly moved via gprecoverseg to primary's target host. Then the roles are
switched. Then ex-primary (new mirror) is moved to mirror's target host.
Primary only moves imply 2 role switches. Switch.Move.Switch.
Primary mirror swap is executed similar to 2nd type. Mirror is moved to
primary dir in its own host. Switch. Ex-primary is moved to mirror dir in its
own host.
The status management is written in general and may contain errors.

Cleanup is prepared by RekGRpth

Co-authored-by: Georgy Shelkovy <g.shelkovy@arenadata.io>
This PR intoduces the rollback handler in gprebalance MVP. The rollback
function creates new plan of movements by calculating the difference between
current configuration and original state loaded from previously pickled plan.
The changes of this patch provide the prototype for status tracking of mirror moves
during rebalance. Firstly , this patch removes the usage of gpdb table for
whole execution status. Secondly, the status manager is rewritten in order to
track execution process with status file only. If the movement step, presented
by gprecoverseg process, fails, the corresponging status (FAILED) will be
written to the internal status struct first, then will be flushed to disk.

The main purpose of these changes is also implementation of gprecoverseg
determination. The code in analyze_gprecoverseg_states() tries to implement
the SRS diagram for gprecoverseg status definition. It processes the following
scenarios:
1. A mirror move failed after pg_hba conf had been updated at primary. In this case
primary marks the mirror as being down.
2. A mirror move failed after gp_segment_configuration had been updated. Here our code
tries to determine whether pg_basebackup was executed succesfully or not.

Depending on the basebackup state, the algorithm tries to either startup the 
backuped mirror or rollback the configuration changes with recovering old mirror
Problem description:
There were no means to provide segments shrink feature to the 'gprebalance'
tool.

Fix:
Add new command 'ALTER TABLE <table_name> REBALANCE' (MVP level). Details:
1. 'ALTER TABLE <table_name> REBALANCE' supports an optional parameter - target
number of segments (ex. 'ALTER TABLE <table_name> REBALANCE 2;').
2. If the target number of segments is more than the number of segments in the 
table's distribution policy, rebalance command will invoke the existing 
functionality of 'ALTER TABLE <table_name> EXPAND TABLE' (meaning that expand 
will  always be done to the current number of segments in the cluster, even if
we specified less) 
3. If the target number of segments is less than the number of segments in the 
table's distribution policy, the table will be shrunk into the target number
of segments. For hashed or randomly distributed tables, data from the excessive
segments is inserted into the target segments, and then for all table types the
distribution policy is updated for the target number of segments. Data from the
excessive segments is not removed (we do not want to spend time on it, as most
likely they will be excluded from the cluster soon anyway).
4. New GUC 'gp_target_numsegments' is added. If the target number of segments is
not specified for the 'ALTER TABLE <table_name> REBALANCE' command, value of
'gp_target_numsegments' is used.
5. If 'gp_target_numsegments' is set, all new tables are created using this
number of segments.
Commit 5b3f506 introduced new command ALTER
TABLE REBALANCE with shrink support. The target number of segments (if not
specified in ALTER command) is taken from GP_POLICY_DEFAULT_NUMSEGMENTS() macro.
Therefore, we need somehow to set and maintain the creation number across all
backends.

This patch introduces a mechanism for managing the default number of segments
used in table creation during a rebalance operation in GPDB. A new shared
variable gp_create_table_rebalance_numsegments is introduced in gpexpand.h  to
track the number of segments to use during table creation while a rebalancing
operation is in progress. The shared variable is initialized in shared memory
with appropriate size and get functionality.

Corresponding SQL functions are created in gp_toolkit extension.
The system now checks if a rebalancing operation is active by verifying locks
before allowing modifications to the number of segments. If a lock is not
already acquired in current transaction (indicating that no rebalancing is
underway), an appropriate error message is returned.

Tests from 5b3f506
are updated to support the new functonality

gp_debug_numsegments extension preserves its behaviour. But we disallow to
modify local numsegments value when gp_create_table_rebalance_numsegments
is set.
This patch implements a state machine skeleton for a basic shrink scenario based 
on 'transitions' library. It consists of a new 'ggrebalance' tool, which will be 
a single entry point for shrink, expand, and cluster rebalance functionality, 
and 'shrink.py', which contains the state machine itself with the shrink logic. 

The main purpose of this half-MVP is to evaluate the state machine pattern
suitability. Therefore it implements only a limited set of requirements for the
shrink, which allows you to support basic shrink workflow.
This patch adds a check for probable scenario when during interruption of
ggrebalance the cluster could be restarted. In this case the shared variable
gp_rebalance_numsegments is unset, and new table may be created at old segment
count. Thus, during recovering of shrink process the STATE_CHECK_PREVIOUS_RUN
callback calls get_state_after_interrupt() function, which checks the mentioned
situation. If cluster is restarted the state machine executes transition to
STATE_BACKUP_CATALOG_AND_UPDATE_TARGET_SEGMENT_COUNT_STARTED state.

The interface for gp_rebalance_numsegments variable is updated via
gp_rebalance_numsegments_is_set() SQL function in order to provide convenient way
to monitor variable status. Before that, the comparison with INT_MAX value was required.

Additionally, fault injection interface was returned to behave tests to cause
workflow interruptions. The behave tests utility code was also adjusted to
support some of the shrink scenarios. The code related to table population
is fixed to make it follow declared semantics. gpaddmirrors test is updated
as well.

Co-Authored-By: Roman Eskin r.eskin@arenadata.io
In this patch:
1. The new option '--clean' is added for the cluster shrink by the ggrebalance
tool.
2. The new option '--rollback' is added for the cluster shrink by the
ggrebalance tool.
3. The new option '--non-interactive-mode' is added for the ggrebalance tool. It
is essential to allow auto testing of some cleanup scenarios that would expect
user confirmation without such an option.
4. As the existing 'main' and the new 'rollback' shrink workflows use similar
functionality, the shrink code is reorganized to reduce code duplication:
a. New functions that are used in both 'main' and 'rollback' workflows are
introduced (like 'prepare_shrink_schema()', 'rebalance_tables()').
b. All logic related to the ggrebalance schema handling is moved to a separate
class named 'RebalanceSchema' in 'rebalance_commons.py'.
5. A new entity, 'Plan,' is added. It is used to pass information about required
shrink configuration of the target cluster to the shrink engine. We store it in
the rebalance schema and used for the 'rollback' workflow, and when we recover
from an interrupted shrink state. It is added due to the following reasons:
a. As already stated above, we need it during rollback. When the user starts the
rollback operation, he doesn't specify the target segment count that was used
at the preceding shrink operation. Thus we need to store this information at
shrink for the later usage.
b. When the user tries to re-enter the shrink procedure from an interrupted
state, we need to re-start with the same target segment count that was specified
originally. Otherwise we may get the cluster in some invalid configuration where
tables are shrunk to different segment counts. Giving the user the ability
to specify target segment count for the re-enter launch opens the way for such
error prone scenarios. So we just forbid specifying segment count configuration
if we re-enter the interrupted state or start the rollback, and use the saved
plan information that we got at the very first operation start.
c. According to the current design, at the later phase we'll introduce a Planner
entity, that will perform planning for all shrink/expand/rebalance operations.
And its output Plan will be the input to the shrink engine. So this change is
aligned with the overall design.
6. New behave test cases are added. The test cases cover not only the 'cleanup'
and 'rollback' flows, but also the existing 'main' shrink flow, as we can't
guarantee the correctness of rollback without proving the 'main' flow works Ok.
The existing test case is renamed to 'test 2.4' and moved to be near the new
tests that cover similar functionality.
7. New steps are added to mgmt_utils.py, that are used to verify that the
shrinked segments are actually down. Also a small change in 'SegmentIsShutDown'
is done - it is required to check that the mirror is down.
8. In order to recover properly, if we are interrupted in the middle of stopping
shrinked segments, a new class 'SegmentStopAfterShrink' is introduced. It wraps
the 'SegmentStop' with the checking whether the segment is actually still
running. Without it, if shrink was re-entered and some segments were already
shut down by the preceding interrupted launch, we got an error when trying to
shut down such segments.
This patch adds foundations of shrink/rebalance planner. Some extra planning
details and proper integration of planning stage into the ggrebalance state
machine are going to be considered in separate tickets.

The main feature of provided code is an abstract balancing algorithm, which
represents manual primary/mirror host assignment following greedy strategy.
In short, algorithm structure consists of several phases:

1) Primary assignment. Sort segments by relocation priority: firstly, must-move
segments - those lying at decomissioned hosts, encoded in initial_primary as
indexes >= n_target_hosts. Then move from overloaded to underloaded hosts.
Assign each segment to least-loaded host, preferring original placement when
possible.

2) Mirror assignment. Is built according to simple logic: prefer original
mirror hosts, use least-loaded mirror hosts.

3) Optional improvement. Using adaptive large neighborhood search, where we try
build near solutions by destroying and reassigning parts of the initial one.
Quite volatile, but in some cases can bring better solution. Proposed to use
in the ggrebalance utility. Reentrancy could be achieved by saving first plan
into the database.

Unit tests are moved from gppylib into gprebalance_modules in order to achieve
better tests granularity and possibility to import separate modules.
This patch implements the following changes:

1. The support of IP addresses in 'target-hosts, add-hosts, remove-hosts' is
added. Their validation requires hostname resolution, thus, the HostResolver()
class is added in rebalance_commons.py Without validation we may face the case
when passed through options IP address corresponds to existing host but is
interpreted by ggrebalance as a new one.

2. The support hosts files is added.

3. The target directories handling is reworked. TemplateParser() class is added
to support several placeholders. Now if 'target-datadirs' options is not passed
all moves will choose default template directories as target ones.

4. The port planning is added in simple form (since doing network communication
is overhead here) via PortAllocator() class. It forms per host per segment type
port patterns and assigns them incrementally to moves.

5. The storage estimation is implemented. DiskUsage, DiskFree commands are used.
The source datadirs and tablespaces are taken into account and validation of
available space is provided. Main datadirs and tablespaces are validated on available
disk space on corresponding filesystems. 

Corresponding unit tests are added for basic scenarios.
List of changes:
1. This patch adds rebalance functionality. Main part of the related logic is
located in the 'RebalanceSM' class. Rebalance is done according to the list of
moves from the supplied plan, and includes following steps:
 - move (via gpmovemirrors) all mirrors from the list of moves;
 - for all primaries from the list of moves switch them with their mirrors;
 - move (via gpmovemirrors) all these segments which were primaries;
 - switch all these segments back to primaries roles.
2. As the rebalance functionality should be correctly coordinated with the
existing shrink logic, this patch adds the high level state-machine
implementation in 'GGRebalanceMainSM' class. It is responsible for proper flow
of high level states like planning, rebalance schema creation and deletion,
invocation of shrink and rebalance execution, invocation of cleanup and shrink.
Therefore:
 - some states and logic are moved from the existing shrink state-machine to
 'GGRebalanceMainSM';
 - temp code is removed from the planner;
 - code in 'ggrebalance' is updated to call only 'GGRebalanceMainSM', that will
 do the rest.
3. As now we need to handle states from shrink, rebalance and main
state-machines, 'RebalanceSchema' code is updated to store and access these
state categories.
4. New behave tests for rebalance functionality are added. As the ggrebalance
test suit became too large and long too execute, it is split into 3 files:
 - 'ggrebalance_basics.feature' - contains the existing basic checks from the
 old file;
 - 'ggrebalance_shrink.feature' - contains the existing checks for shrink from
 the old file;
 - 'ggrebalance_rebalance.feature' - contains the new tests for the rebalance.

Also, some notes about changes related to tests:
 - Old test named 'test 2.2. shrink' is merged into the test with a new name
 'test 1.3. shrink', as the usage of the new top-level state-machine allows now
 to continue shrink execution in this test case;
 - New step definition is added into 'mgmt_util.py', that allows to get the
 number of segments which satisfy a certain condition. It is used in the new
 tests.
 - New step definition is added into 'mgmt_util.py', that allows to set a delay
 for a fault to happen. The respective changes are added into the fault
 injector code. It is used in the new tests, when we test interruption during
 the work of gpmovemirrors or gprecoverseg.
Problem description:
Need to update rebalance execution flow in a way that it can support parallel
segment movement, and at the same time the flow must consider following
limitations:
 - ggrebalance should save every move step and it's status in persistance
storage so that failed steps may be retried, rollbacked or cancelled (rollback,
retry or cancel of particular movement will be implemented later in a separate
patch);
 - switchover actions (primary to mirror, mirror to primary) will require user
approval once we implement interactive mode (later in a separate patch);
 - ggrebalance should consider the order of the planned movements in the
primary-mirror swap scenario using 3rd intermediate, transitional host. It means
that the executor can't swap the order of mirror and primary movements.

Therefore, this patch:
1. Adds an entity of RebalanceStep, that contain the state of execution together
with the movement definition. List of such steps is now saved to the rebalance
schema.
2. Updates the state machine of the rebalance execution. Now new states, where
approval will be later requested from the user, are added. And the state machine
can switch between segment processing and approval request as many times as
required, till all steps are processed. Execution of the rebalance steps is
performed in batches. Each batch is comprised from the same type of rebalance
steps, without duplication of dbids.
3. Updates the code to use '--parallel' option to config
'gpmovemirrors'/'gprecoverseg'.
4. Updates behave tests according to changes described above.
This patch adds a new 'ggrebalance_misc_options' test suite, which currently
has checks for:
1. '--target-hosts-file' option;
2. '--target-hosts' option;
3. '--target-datadirs-file' option;
4. '--target-datadirs' option;
5. '--mirror-mode' option;
6. '--add-hosts-file' option;
7. '--remove-hosts-file' option;
8. scenario with no mirrors in the cluster;
9. scenario when the cluster can't be rebalanced with the given parameters;
10. scenario when the cluster is in coordinator-only mode;
11. scenario when another instance of ggrebalance is running;
12. scenario when another conflicting tool is running;

Also, this patch updates and adds some new step definitions, required by the
new tests. Noticeable change: now we can bring up a test cluster with
configurable number of segments (before it was hardcoded to 2 segments).

And this patch adds a set of small fixes in the ggrebalance code to support the
tested scenarios:
 - Move the validation that the cluster has mirrors to an earlier stage.
Otherwise, without this check, ggrebalance crashed on accessing the non-existing
mirror information, before it actually checked the mirror's presence.
 - Fix function 'get_hosts_from_file()'. Before this change, it tried to split
hostname into letters (for ex., instead of 'sdw1', it returned 4 hosts:
's', 'd', 'w', '1'). Also, added a validation that the file is not empty.
 - Add checks for 'gpexpand' and 'pg_basebackup' tools running in parallel.
This patch adds support for the following options:

 - '--hba-hostnames'
It determines whether to use hostnames in pg_hba.conf. Passed directly to
'gpmovemirrors' tool.

 - '--replay-lag <replay_lag>'
It determines replay lag (in GBs) allowed on the mirror when rebalancing the
segments. Passed directly to the 'gprecoverseg' tool.

 - '--log-dir <log_dir>'
It determines the directory to store logs of the tool and all tools that are
called by it.

 - '--analyze'
It determines whether to run ANALYZE after rebalancing table redistribution.

Also, this patch adds:
 - tests for the mentioned options;
 - definition of new steps required by the tests;
 - a small fix in the 'gpmovemirrors' tool to support log-dir with spaces in the
name;
 - definition of STATE_ERROR into rebalance executor SM;
Problem description:
Attempts to rebalance a materialized view via
'ALTER MATERIALIZED VIEW ... REBALANCE' command (or via equivalently working for
materialized views 'ALTER TABLE ... REBALANCE') ended with an error:
'ERROR:  cannot change materialized view ...'

Root cause:
The table rebalance logic tried to insert the data directly into the
materialized view as if it were an ordinary table. It is prohibited for
materialized views.

Fix:
Skip the call of 'ATExecShrinkTable()' for the materialized views. So during
'ALTER ... REBALANCE' only the distribution policy for the materialized view is
updated. And the user needs to perform 'REFRESH MATERIALIZED VIEW ...' after
the rebalance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants