PerfEvolve/perfevolve_document.yaml at main · ISCAS-OSLab/PerfEvolve · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
title: 'PerfEvolve: Procedural Database Configuration Tuning Guide'
version: '1.0'
target_system: PostgreSQL 16
description: This document encodes a systematic methodology for tuning PostgreSQL configuration parameters.  Unlike traditional
  documentation that gives fixed recommendations, this guide teaches the PROCESS of profiling, interaction discovery, and
  joint optimization.  Follow the phases in order; each step includes executable procedures, decision criteria, and reference
  data from 8,500+ controlled experiments for calibration.
phases:
- phase_id: '1'
  name: Diagnostics & Single-Knob Profiling
  goal: Diagnose the target environment, establish a performance baseline, and identify which knobs have significant impact
    on the target workload.
  steps:
  - step_id: '1.1'
    action: Verify environment readiness
    why: An unstable or misconfigured environment produces noisy benchmarks. Catching issues before profiling avoids wasted
      experiments.
    procedure:
    - Check that PostgreSQL is running and accepting connections.
    - Verify sufficient disk space (>20 GB free for benchmark data).
    - Confirm no other significant workloads are running on the host.
    - 'Check OS-level settings: transparent huge pages, swappiness, I/O scheduler.'
    - Validate benchmark tool connectivity (e.g., run a trivial 1-second benchmark).
    - 'Record system resource snapshot: CPU load, memory usage, I/O utilization.'
    requires_benchmark: false
    skill_type: diagnostic
    preconditions:
    - Target PostgreSQL instance is accessible
    - Benchmark tool is installed and configured
    postconditions:
    - 'Environment validated: no resource contention detected'
    - Benchmark connectivity confirmed
    reference_data:
      description: Expected environment state for reliable benchmarking.
      notes:
      - CPU load should be < 10% before starting benchmark runs.
      - Swappiness=1 and THP=madvise are recommended for PostgreSQL.
      - 'I/O scheduler: ''none'' or ''mq-deadline'' for NVMe SSDs.'
    output_keys:
    - env_status
    - system_snapshot
  - step_id: '1.2'
    action: Verify PostgreSQL version and configuration compatibility
    why: Parameter names, defaults, and behaviors change across PostgreSQL major versions. The procedural document's reference
      data is version-specific; a mismatch invalidates sensitivity rankings and safe ranges.
    procedure:
    - Query SELECT version() and extract major.minor version.
    - Compare against the document's target_system version.
    - 'If major version differs: WARN — reference data may not apply. Proceed with caution; re-profiling may be necessary.'
    - 'List all non-default parameters: SELECT name, setting FROM pg_settings WHERE setting != boot_val. Record as starting
      configuration.'
    - Check for extensions that affect performance (pg_stat_statements, pg_prewarm, timescaledb, etc.).
    requires_benchmark: false
    skill_type: diagnostic
    preconditions:
    - Environment precheck passed (1.1)
    postconditions:
    - PostgreSQL version confirmed compatible with document
    - Starting configuration snapshot recorded
    decision_criteria:
      condition: Major version matches document target
      action_if_true: Proceed — reference data is directly applicable.
      action_if_false: 'WARN: version mismatch. Reference data is advisory only; consider re-running offline profiling for
        this version.'
    reference_data:
      description: Version-specific parameter changes.
      notes:
      - 'PG14→15: changed default of hash_mem_multiplier from 1.0 to 2.0.'
      - 'PG15→16: added new GUCs; some planner cost defaults changed.'
      - 'Extension overhead: pg_stat_statements adds ~2-5% overhead on OLTP.'
    output_keys:
    - pg_version
    - pg_major
    - starting_config
    - loaded_extensions
  - step_id: '1.3'
    action: Profile hardware characteristics
    why: Hardware capabilities directly determine optimal parameter values. SSD vs HDD changes I/O knob interactions by up
      to 5x; RAM size dictates memory knob ranges.
    procedure:
    - 'Detect storage type: run a random read latency test (< 0.1ms suggests NVMe SSD, 0.1-1ms SATA SSD, >5ms HDD).'
    - Measure sequential read bandwidth (fio or dd).
    - Record total RAM and available RAM for PostgreSQL.
    - Count CPU cores (physical and logical).
    - Check NUMA topology if applicable.
    - 'Record: storage_type, seq_bandwidth_MB_s, total_ram_GB, cpu_cores.'
    requires_benchmark: false
    skill_type: diagnostic
    preconditions:
    - Environment precheck passed (1.1)
    postconditions:
    - Hardware profile recorded
    - Storage type classified (NVMe/SATA-SSD/HDD)
    reference_data:
      description: Hardware-dependent parameter guidelines.
      notes:
      - 'For SSD: random_page_cost ≈ 1.1–1.5; for HDD: 3.0–4.0.'
      - 'shared_buffers: typically 25% of RAM, but 8-16 GB max on most workloads.'
      - 'effective_io_concurrency: 200 for NVMe, 100 for SATA SSD, 2-4 for HDD.'
      - 'max_parallel_workers_per_gather: typically ≤ cpu_cores / 4.'
    output_keys:
    - storage_type
    - seq_bandwidth
    - total_ram_gb
    - cpu_cores
    - numa_nodes
  - step_id: '1.4'
    action: Measure baseline performance
    why: A reliable baseline is needed to quantify all subsequent improvements.
    procedure:
    - Run the target benchmark with PostgreSQL default configuration.
    - Execute at least 3 repetitions to account for run-to-run variance.
    - 'Record: mean TPS (for OLTP) or mean total query latency (for OLAP), standard deviation, and p95/p99 latency.'
    - If CV (std/mean) > 0.10, investigate instability before proceeding.
    requires_benchmark: true
    skill_type: diagnostic
    preconditions:
    - Environment precheck passed (1.1)
    - Version verified (1.2)
    - Hardware profiled (1.3)
    postconditions:
    - Baseline TPS/latency recorded with CV < 0.10
    - At least 3 repetitions completed
    decision_criteria:
      condition: CV (std/mean) < 0.10
      action_if_true: Baseline is stable — proceed to candidate selection.
      action_if_false: 'Baseline is unstable. Investigate: check for background processes, OS-level noise, or benchmark warm-up
        issues. Re-run after resolving.'
    reference_data:
      description: Typical baseline variance from our experiments.
      note: Baseline TPS varies by workload and hardware. Use your own measurement as the reference.
      notes:
      - TPCC workloads typically show CV < 0.05 on dedicated hardware.
      - TPCH total latency CV is often higher (0.05-0.15) due to query plan variations.
    output_keys:
    - baseline_tps
    - baseline_std
    - baseline_cv
  - step_id: '1.5'
    action: Select candidate knobs for profiling
    why: Profiling all 300+ PostgreSQL parameters is infeasible. Focus on the ~20 knobs known to affect performance.
    procedure:
    - Start with the priority list provided in the reference data below.
    - 'For your specific workload type, prioritize accordingly: memory knobs (shared_buffers, work_mem) for OLTP; planner
      cost knobs for OLAP; WAL/checkpoint knobs for write-heavy.'
    - Add any domain-specific knobs relevant to your setup (e.g., max_connections if connection-bound).
    requires_benchmark: false
    skill_type: diagnostic
    preconditions:
    - Baseline measurement completed (1.4)
    postconditions:
    - Candidate knob list of 15-25 knobs selected
    reference_data:
      description: Priority knob list ranked by average sensitivity across workloads.
      priority_list:
      - knob: cpu_tuple_cost
        avg_sensitivity: 1.461
      - knob: jit_above_cost
        avg_sensitivity: 0.8004
      - knob: cpu_operator_cost
        avg_sensitivity: 0.7595
      - knob: seq_page_cost
        avg_sensitivity: 0.6852
      - knob: max_parallel_workers_per_gather
        avg_sensitivity: 0.576
      - knob: max_parallel_workers
        avg_sensitivity: 0.5382
      - knob: random_page_cost
        avg_sensitivity: 0.5379
      - knob: min_wal_size
        avg_sensitivity: 0.5138
      - knob: cpu_index_tuple_cost
        avg_sensitivity: 0.5057
      - knob: vacuum_freeze_table_age
        avg_sensitivity: 0.4652
      - knob: tcp_keepalives_count
        avg_sensitivity: 0.4526
      - knob: max_worker_processes
        avg_sensitivity: 0.4501
      - knob: join_collapse_limit
        avg_sensitivity: 0.4486
      - knob: max_wal_senders
        avg_sensitivity: 0.448
      - knob: bgwriter_lru_multiplier
        avg_sensitivity: 0.4345
      - knob: from_collapse_limit
        avg_sensitivity: 0.4217
      - knob: vacuum_freeze_min_age
        avg_sensitivity: 0.4189
      - knob: vacuum_cost_page_dirty
        avg_sensitivity: 0.4171
      - knob: wal_skip_threshold
        avg_sensitivity: 0.4163
      - knob: work_mem
        avg_sensitivity: 0.4157
      notes:
      - This list is based on 6,297 single-knob experiments across TPCC-R, TPCC-W, and TPCH.
      - Your hardware and PG version may shift rankings — use this as a starting point.
    output_keys:
    - candidate_knobs
  - step_id: '1.6'
    action: Measure single-knob sensitivity
    why: Determine which knobs actually matter for YOUR workload and hardware.
    procedure:
    - 'For each candidate knob, choose 3 values: the PostgreSQL default, a low value, and a high value from the recommended
      range.'
    - Run the benchmark with each value (other knobs at default), 3 repetitions each.
    - Compute sensitivity = (max_mean_tps - min_mean_tps) / baseline_tps.
    - Rank knobs by sensitivity in descending order.
    requires_benchmark: true
    skill_type: profiling
    preconditions:
    - Candidate knob list selected (1.5)
    - Baseline measurement completed (1.4)
    postconditions:
    - Sensitivity ranking computed for all candidate knobs
    - '|tunable knobs (sensitivity > threshold)| >= 5'
    decision_criteria:
      condition: sensitivity > 0.10 (10% of baseline)
      action_if_true: Mark knob as 'tunable' — it will enter Phase 2 screening.
      action_if_false: Keep at default — this knob has negligible impact on your workload.
    reference_data:
      description: Per-workload sensitivity rankings from our experiments.
      per_workload:
        tpch:
        - knob: cpu_tuple_cost
          sensitivity: 3.4974
        - knob: seq_page_cost
          sensitivity: 1.1268
        - knob: cpu_operator_cost
          sensitivity: 0.9502
        - knob: max_parallel_workers
          sensitivity: 0.6653
        - knob: work_mem
          sensitivity: 0.6614
        - knob: cpu_index_tuple_cost
          sensitivity: 0.6262
        - knob: random_page_cost
          sensitivity: 0.6193
        - knob: max_worker_processes
          sensitivity: 0.6144
        - knob: max_parallel_workers_per_gather
          sensitivity: 0.5846
        - knob: bgwriter_lru_maxpages
          sensitivity: 0.5697
        - knob: jit_optimize_above_cost
          sensitivity: 0.5111
        - knob: max_locks_per_transaction
          sensitivity: 0.4989
        - knob: wal_receiver_status_interval
          sensitivity: 0.4745
        - knob: checkpoint_completion_target
          sensitivity: 0.4708
        - knob: wal_buffers
          sensitivity: 0.4459
        - knob: join_collapse_limit
          sensitivity: 0.4381
        - knob: tcp_keepalives_interval
          sensitivity: 0.437
        - knob: from_collapse_limit
          sensitivity: 0.4313
        - knob: min_parallel_table_scan_size
          sensitivity: 0.4305
        - knob: gin_pending_list_limit
          sensitivity: 0.4281
        tpccw:
        - knob: jit_above_cost
          sensitivity: 1.1365
        - knob: cpu_operator_cost
          sensitivity: 0.8763
        - knob: min_wal_size
          sensitivity: 0.7503
        - knob: bgwriter_lru_multiplier
          sensitivity: 0.7248
        - knob: commit_delay
          sensitivity: 0.6674
        - knob: vacuum_cost_page_dirty
          sensitivity: 0.6511
        - knob: seq_page_cost
          sensitivity: 0.5928
        - knob: max_parallel_workers_per_gather
          sensitivity: 0.5908
        - knob: vacuum_cost_page_miss
          sensitivity: 0.5874
        - knob: geqo_effort
          sensitivity: 0.5801
        - knob: vacuum_cost_page_hit
          sensitivity: 0.5777
        - knob: vacuum_multixact_freeze_table_age
          sensitivity: 0.5735
        - knob: logical_decoding_work_mem
          sensitivity: 0.5676
        - knob: vacuum_multixact_failsafe_age
          sensitivity: 0.5636
        - knob: max_pred_locks_per_page
          sensitivity: 0.5561
        - knob: vacuum_freeze_table_age
          sensitivity: 0.5381
        - knob: max_parallel_workers
          sensitivity: 0.5334
        - knob: max_wal_senders
          sensitivity: 0.5251
        - knob: tcp_keepalives_count
          sensitivity: 0.5064
        - knob: wal_writer_delay
          sensitivity: 0.5039
        tpccr:
        - knob: jit_above_cost
          sensitivity: 1.0662
        - knob: geqo_selection_bias
          sensitivity: 0.6368
        - knob: vacuum_freeze_min_age
          sensitivity: 0.63
        - knob: tcp_keepalives_count
          sensitivity: 0.5972
        - knob: max_parallel_workers_per_gather
          sensitivity: 0.5526
        - knob: random_page_cost
          sensitivity: 0.5471
        - knob: vacuum_cost_page_hit
          sensitivity: 0.5353
        - knob: geqo_threshold
          sensitivity: 0.5159
        - knob: min_wal_size
          sensitivity: 0.5046
        - knob: max_wal_senders
          sensitivity: 0.4895
        - knob: join_collapse_limit
          sensitivity: 0.489
        - knob: cpu_index_tuple_cost
          sensitivity: 0.4746
        - knob: from_collapse_limit
          sensitivity: 0.4743
        - knob: checkpoint_flush_after
          sensitivity: 0.4709
        - knob: wal_keep_size
          sensitivity: 0.4699
        - knob: wal_skip_threshold
          sensitivity: 0.466
        - knob: vacuum_freeze_table_age
          sensitivity: 0.4642
        - knob: cpu_operator_cost
          sensitivity: 0.4521
        - knob: shared_buffers
          sensitivity: 0.4504
        - knob: max_standby_archive_delay
          sensitivity: 0.4496
      notes:
      - Top-20 knobs typically show sensitivity between 5% and 40%.
      - If your results differ by more than 2x from reference, verify benchmark stability.
    output_keys:
    - knob_sensitivities
  - step_id: '1.7'
    action: Characterize workload type
    why: Interaction patterns and optimal knob values are workload-specific. The same configuration can improve read-heavy
      OLTP by 35% but hurt write-heavy OLTP by 4%.
    procedure:
    - 'Classify your workload: read-heavy OLTP, write-heavy OLTP, OLAP/analytical, or mixed.'
    - 'Check: is your workload latency-sensitive (prioritize p99) or throughput-sensitive (prioritize TPS)?'
    - Note which queries or transaction types dominate runtime — this guides knob prioritization in later phases.
    requires_benchmark: false
    skill_type: diagnostic
    preconditions:
    - Sensitivity scan completed (1.6)
    postconditions:
    - Workload type classified
    - Optimization target (TPS vs latency) determined
    reference_data:
      description: Workload-specific patterns discovered in our experiments.
      note: Interaction strength varies 2-5x across workloads for the same knob pair.
      notes:
      - The same knob pair (shared_buffers × work_mem) shows η²=0.394 on OLTP-read but only η²=0.165 on OLTP-write — a 2.4x
        difference.
      - Workload classification directly affects which interactions to prioritize.
    output_keys:
    - workload_type
    - optimization_target
  - step_id: '1.8'
    action: Detect anomalies in sensitivity results
    why: If the online sensitivity ranking diverges significantly from the offline reference, the reference data may not be
      trustworthy for this environment. Early detection prevents cascading errors in later phases.
    procedure:
    - Compare your measured knob sensitivity ranking with the reference ranking.
    - Compute Spearman rank correlation between your top-15 and reference top-15.
    - 'Flag specific anomalies:'
    - '  - A knob in your top-5 that is not in reference top-15 (new high-impact knob).'
    - '  - A reference top-5 knob that shows < 2% sensitivity on your system.'
    - '  - Any knob whose sensitivity differs by > 3x from reference.'
    - If fewer than 5 knobs exceed the 5% sensitivity threshold, lower the threshold to 2% or expand the candidate list.
    requires_benchmark: false
    skill_type: diagnostic
    preconditions:
    - Sensitivity scan completed (1.6)
    - Reference sensitivity data available
    postconditions:
    - Sensitivity ranking validated against reference
    - Anomalies flagged for attention in later phases
    decision_criteria:
      condition: Spearman ρ > 0.6 with reference ranking
      action_if_true: Rankings are consistent — proceed with reference data guidance.
      action_if_false: Rankings diverge significantly. Reference interaction data may not apply. Expand Phase 2 screening
        to cover anomalous knobs; reduce reliance on reference thresholds.
    reference_data:
      description: Expected sensitivity ranking and common divergence causes.
      notes:
      - 'Common causes of ranking divergence: different storage type (SSD vs HDD), different RAM:data ratio, PG version changes,
        extension overhead.'
      - Rank correlation < 0.3 suggests the offline reference is not applicable.
    output_keys:
    - rank_correlation
    - anomalous_knobs
    - validated_top_k
  - step_id: '1.9'
    action: Build detailed profiles for top-k parameters
    why: Beyond raw sensitivity, response-curve shape (monotonic, non-monotonic, step) determines the search strategy for
      optimization. Safe ranges prevent configurations that cause crashes or severe degradation.
    procedure:
    - 'For each top-k parameter (validated_top_k from 1.8):'
    - '  - If sensitivity scan used only 3 levels, add 2-4 intermediate levels around the region of highest change.'
    - '  - Run benchmark at each level (3 repetitions).'
    - '  - Classify response curve: monotonic-up, monotonic-down, non-monotonic (has interior optimum), or step-function (threshold
      effect).'
    - '  - Determine safe range: the range in which no crash or > 50% throughput loss occurs.'
    - '  - Record per-workload sensitivity: does the knob rank differently for different workloads?'
    - 'Output: a per-parameter profile with (CV, response_shape, safe_range, best_value_per_workload, sensitivity_per_workload).'
    requires_benchmark: true
    skill_type: profiling
    preconditions:
    - Sensitivity scan completed (1.6)
    - Anomaly detection completed (1.8)
    postconditions:
    - Per-parameter profile completed for all top-k knobs
    - Response curve classified for each knob
    - Safe ranges empirically determined
    reference_data:
      description: Reference per-parameter profiles from offline profiling.
      notes:
      - 'shared_buffers: typically monotonic-up then plateau; safe range 128MB–16GB.'
      - 'random_page_cost: step-function behavior around storage latency.'
      - 'work_mem: non-monotonic on sort-heavy OLAP (too high causes temp file thrashing).'
      - 160 parameter profiles were generated in the offline phase for PG16.
    output_keys:
    - parameter_profiles
- phase_id: '2'
  name: Interaction Screening
  goal: Identify which pairs of tunable knobs interact — i.e., where the optimal value of one knob depends on the setting
    of another.
  steps:
  - step_id: '2.1'
    action: Run coarse 2×2 interaction screen
    why: Testing all pairs exhaustively is too expensive. A 2×2 design (low/high for each knob) quickly estimates interaction
      strength at 4 benchmark runs per pair.
    procedure:
    - 'For each pair of tunable knobs from Phase 1:'
    - '  - Choose two levels per knob: the lowest and highest values from the sensitivity scan.'
    - '  - Run all 4 combinations (low-low, low-high, high-low, high-high), 1 repetition each.'
    - 'Compute interaction%:   SS_total = sum of squared deviations from grand mean.'
    - '  SS_A = deviation explained by knob A alone.'
    - '  SS_B = deviation explained by knob B alone.'
    - '  SS_interaction = SS_total - SS_A - SS_B.'
    - '  interaction% = SS_interaction / SS_total × 100.'
    requires_benchmark: true
    skill_type: profiling
    preconditions:
    - 'Phase 1 completed: tunable knobs identified with sensitivity rankings'
    - Per-parameter profiles available (safe ranges known)
    postconditions:
    - Coarse interaction% computed for all C(k,2) pairs
    - Candidate pairs (interaction% > 15%) identified for fine screening
    decision_criteria:
      condition: interaction% > 15%
      action_if_true: Advance to fine screening (Step 2.2) for confirmation.
      action_if_false: 'If interaction% < 5%: mark pair as independent. If 5-15%: borderline — include in fine screening if
        budget allows.'
    reference_data:
      description: Interaction distribution from 55-pair coarse screening.
      total_pairs_tested: 55
      pairs_above_15pct: 35/55 (64%)
      top_5_pairs:
      - knob_a: bgwriter_lru_multiplier
        knob_b: vacuum_cost_page_hit
        workload: tpccr
        interaction_pct: 98.0
      - knob_a: cpu_operator_cost
        knob_b: vacuum_cost_page_dirty
        workload: tpccr
        interaction_pct: 91.9
      - knob_a: cpu_tuple_cost
        knob_b: default_statistics_target
        workload: tpccr
        interaction_pct: 91.7
      - knob_a: bgwriter_lru_multiplier
        knob_b: from_collapse_limit
        workload: tpccr
        interaction_pct: 87.5
      - knob_a: commit_delay
        knob_b: from_collapse_limit
        workload: tpccr
        interaction_pct: 86.5
      bottom_5_pairs:
      - knob_a: random_page_cost
        knob_b: cpu_tuple_cost
        workload: tpccr
        interaction_pct: 0.7
      - knob_a: cpu_tuple_cost
        knob_b: from_collapse_limit
        workload: tpccr
        interaction_pct: 0.1
      - knob_a: bgwriter_lru_multiplier
        knob_b: logical_decoding_work_mem
        workload: tpccr
        interaction_pct: 0.0
      - knob_a: commit_delay
        knob_b: max_parallel_workers
        workload: tpccr
        interaction_pct: 0.0
      - knob_a: cpu_tuple_cost
        knob_b: max_parallel_workers
        workload: tpccr
        interaction_pct: 0.0
      notes:
      - In our experiments, 64% of pairs (35/55) exceeded the 15% threshold.
      - 'Strongest interactions: cost-model × I/O parameters, memory × memory.'
      - 'Weakest: same-subsystem planner knobs (e.g., random_page_cost × cpu_tuple_cost = 0.7%).'
      - 'CAVEAT: 2×2 with n=1 overestimates interaction strength by ~2x on average. This is intentional — it is a conservative
        screen (high recall).'
    output_keys:
    - pair_interactions_coarse
  - step_id: '2.2'
    action: Run fine 4×4 factorial screening for candidate pairs
    why: The coarse 2×2 screen overestimates interaction strength by ~2x. A 4×4 design with replication provides statistically
      reliable η² estimates and confirms true interactions.
    procedure:
    - 'For each candidate pair from the coarse screen:'
    - '  - Choose 4 levels per knob spanning the safe range.'
    - '  - Run a full 4×4 factorial design with r=3 repetitions per cell (48 benchmark runs per pair per workload).'
    - '  - Compute two-way ANOVA with interaction term.'
    - '  - Report partial eta-squared (η²) for the interaction term and p-value.'
    - Apply Benjamini-Hochberg FDR correction across all tested pairs.
    - Retain pairs with η² > 0.15 AND corrected p < 0.05 as confirmed interactions.
    requires_benchmark: true
    skill_type: profiling
    preconditions:
    - Coarse screening completed (2.1)
    - Candidate pairs with interaction% > 15% identified
    postconditions:
    - Confirmed interaction pairs identified with η² and p-values
    - FDR correction applied across all tests
    decision_criteria:
      condition: η² > 0.15 AND BH-corrected p < 0.05
      action_if_true: Confirmed interaction — this pair must be jointly optimized.
      action_if_false: Interaction not confirmed. Mark pair as independent.
    reference_data:
      description: Fine screening results from our validated experiments.
      notes:
      - Stage B typically confirms ~60% of Stage A candidates.
      - n=1 experiments overestimate interaction η² by ~2x. Use n≥3 for reliable ANOVA.
      - 'Total Stage B cost: 48 × (candidate pairs) × W runs.'
      - 'For k=15 with 35 candidate pairs and W=3: ~5,040 runs.'
    output_keys:
    - pair_interactions_fine
    - confirmed_interactions
  - step_id: '2.3'
    action: Build interaction graph from confirmed pairs
    why: Knobs that interact must be optimized jointly. The interaction graph reveals which groups of knobs are coupled.
    procedure:
    - 'Create a weighted undirected graph G = (V, E):'
    - '  - V: the top-k tunable knobs from Phase 1.'
    - '  - E: an edge (i, j) with weight η²_ij for each confirmed interaction.'
    - 'Visualize the graph: node size proportional to single-knob sensitivity, edge thickness proportional to η².'
    - 'Record graph statistics: number of edges, density, max/mean η².'
    requires_benchmark: false
    skill_type: analytical
    preconditions:
    - Fine screening completed (2.2)
    - Confirmed interaction pairs available
    postconditions:
    - Interaction graph G constructed
    - All confirmed interactions represented as weighted edges
    reference_data:
      description: Interaction graph structure from our screening data.
      note: Connected components typically contain 2-4 knobs.
      notes:
      - Cross-subsystem pairs (e.g., bgwriter × vacuum cost) often interact more strongly than within-subsystem pairs.
      - 'Graph density varies by workload: OLTP-read graphs are typically denser than OLAP graphs.'
    output_keys:
    - interaction_graph
    - graph_stats
  - step_id: '2.4'
    action: Decompose interaction graph into connected components
    why: Connected components define the minimal groups of knobs that must be jointly optimized. Parameters in different components
      can be tuned independently without loss of optimality.
    procedure:
    - Find connected components {C_1, C_2, ..., C_m} of the interaction graph.
    - 'For each component, record: member knobs, internal edge count, max and mean η² within the component.'
    - Identify isolated nodes (no confirmed interactions) — these enter independent optimization directly.
    - 'If any component has > 5 knobs:'
    - '  - Consider raising the η² threshold to split it.'
    - '  - Or prioritize the strongest edges and break weaker ones.'
    - '  - Large components make factorial experiments exponentially expensive.'
    - Sort components by total interaction strength (sum of η² within component).
    requires_benchmark: false
    skill_type: analytical
    preconditions:
    - Interaction graph constructed (2.3)
    postconditions:
    - 'Components identified: each with 2-4 knobs'
    - Independent knobs (isolated nodes) listed separately
    - No component exceeds 5 knobs (or justified if so)
    decision_criteria:
      condition: Largest component size ≤ 5
      action_if_true: Component sizes are manageable — proceed to Phase 3 joint optimization.
      action_if_false: Largest component is too large for full factorial. Raise η² threshold or use fractional factorial design.
    reference_data:
      description: Typical component structure from our experiments.
      notes:
      - 'For PG16 on TPCC-R: 3 components (sizes 3, 2, 2) and 8 isolated nodes.'
      - 'Largest component: {shared_buffers, work_mem, checkpoint_completion_target}.'
      - 'Joint optimization of 3-knob component: 4³ × 3 = 192 runs (feasible).'
      - 'Joint optimization of 5-knob component: 4⁵ × 3 = 3,072 runs (borderline).'
    output_keys:
    - interaction_clusters
    - independent_knobs
    - component_stats
- phase_id: '3'
  name: Joint & Independent Optimization
  goal: Find the best configuration for each cluster of interacting knobs and for each independent knob, then assemble the
    full configuration.
  steps:
  - step_id: '3.1'
    action: Run factorial experiments within each interaction cluster
    why: For interacting knobs, the joint optimum cannot be found by optimizing each knob separately. Independent tuning can
      be WORSE than defaults (we observed -13.4% on OLTP-read).
    procedure:
    - 'For each interaction cluster (typically 2-3 knobs):'
    - '  - Choose 3-4 levels per knob spanning the sensitivity range.'
    - '  - Run a full factorial design: all combinations of levels.'
    - '    For 2 knobs × 4 levels = 16 configs; for 3 knobs × 3 levels = 27 configs.'
    - '  - Run each config with n ≥ 3 repetitions for statistical reliability.'
    - 'Record all results: TPS mean, std, p95, p99 for each configuration.'
    requires_benchmark: true
    skill_type: profiling
    preconditions:
    - Interaction clusters identified (2.4)
    - Per-parameter profiles available (1.9) for level selection
    postconditions:
    - Factorial results recorded for all clusters
    - Each configuration has n ≥ 3 repetitions
    reference_data:
      description: Factorial experiment results from our validated interaction studies.
      experiments:
      - knobs:
        - shared_buffers
        - checkpoint_completion_target
        workload: tpch
        n_configs: 4
        performance_spread_pct: 16.4
      - knobs:
        - shared_buffers
        - checkpoint_completion_target
        workload: tpccw
        n_configs: 4
        performance_spread_pct: 13.9
      - knobs:
        - shared_buffers
        - checkpoint_completion_target
        workload: tpccr
        n_configs: 4
        performance_spread_pct: 56.2
      - knobs:
        - shared_buffers
        - work_mem
        workload: tpch
        n_configs: 16
        performance_spread_pct: 43.6
      - knobs:
        - shared_buffers
        - work_mem
        workload: tpccw
        n_configs: 16
        performance_spread_pct: 42.6
      - knobs:
        - shared_buffers
        - work_mem
        workload: tpccr
        n_configs: 16
        performance_spread_pct: 70.1
      - knobs:
        - shared_buffers
        - work_mem
        - checkpoint_completion_target
        workload: tpch
        n_configs: 27
        performance_spread_pct: 52.1
      - knobs:
        - shared_buffers
        - work_mem
        - checkpoint_completion_target
        workload: tpccr
        n_configs: 27
        performance_spread_pct: 67.4
      - knobs:
        - shared_buffers
        - work_mem
        - checkpoint_completion_target
        workload: tpccw
        n_configs: 27
        performance_spread_pct: 90.3
      notes:
      - 'shared_buffers × work_mem (4×4, n=3): interaction η²=0.394, p=0.040.'
      - Performance spread (best vs worst) can reach 70% within a single cluster.
      - n=1 experiments overestimate interaction η² by ~2x. Use n≥3 for reliable ANOVA.
    output_keys:
    - cluster_results
  - step_id: '3.2'
    action: Analyze interaction effects with ANOVA
    why: ANOVA decomposes performance variance into main effects and interaction effects, confirming whether joint optimization
      is worthwhile.
    procedure:
    - For each cluster × workload, run two-way (or three-way) ANOVA with replication.
    - 'Key outputs:'
    - '  - η² (eta-squared): fraction of variance explained by the interaction term.'
    - '  - F-statistic and p-value: statistical significance of the interaction.'
    - '  - Main effects: how much does each knob contribute independently?'
    - 'Interpretation guide:'
    - '  - η² > 0.25: STRONG interaction — joint optimization is essential.'
    - '  - η² 0.10-0.25: MODERATE — joint optimization recommended.'
    - '  - η² < 0.10: WEAK — independent tuning is acceptable.'
    requires_benchmark: false
    skill_type: analytical
    preconditions:
    - Factorial experiment results available (3.1)
    postconditions:
    - ANOVA results computed for all clusters × workloads
    - Interaction strength (η²) quantified
    reference_data:
      description: ANOVA results from our confirmed interactions.
      note: Run two_way_anova_with_replication() from doctrl.analysis.stats on your factorial data.
      notes:
      - Use Benjamini-Hochberg FDR correction for multiple comparisons (less conservative than Bonferroni, better power for
        discovery).
      - If p > 0.05, consider increasing n (repetitions) before concluding no interaction.
    output_keys:
    - anova_results
  - step_id: '3.3'
    action: Compare interaction patterns across workloads
    why: The same knob pair can show strong interaction on one workload but weak interaction on another. A configuration optimized
      for OLTP-read may hurt OLTP-write. Cross-workload analysis reveals these conflicts.
    procedure:
    - For each interaction cluster, compare η² values across workloads.
    - 'Flag sign reversals: cases where the joint optimum on workload A is in the WORST region for workload B.'
    - 'Compute workload consistency score: fraction of clusters where the top-3 configurations overlap across workloads.'
    - 'If sign reversals exist:'
    - '  - The system needs workload-specific configurations.'
    - '  - Consider a compromise: find configurations in the ''safe zone'' (top-25% for all workloads).'
    requires_benchmark: false
    skill_type: analytical
    preconditions:
    - ANOVA results available for multiple workloads (3.2)
    postconditions:
    - Cross-workload consistency assessed for all clusters
    - Sign reversals identified and flagged
    decision_criteria:
      condition: No sign reversals between target workloads
      action_if_true: A single configuration can serve all workloads — proceed to joint optimum selection.
      action_if_false: Workload-specific configurations needed. Optimize separately for each workload, or find a Pareto compromise.
    reference_data:
      description: Cross-workload interaction patterns from our experiments.
      notes:
      - 'shared_buffers × work_mem: η²=0.394 on OLTP-read vs η²=0.165 on OLTP-write (2.4x).'
      - 'Sign reversal example: high shared_buffers + low work_mem is optimal for TPCC-R but suboptimal for TPCC-W.'
      - TPCH often has different optimal regions from TPCC entirely.
    output_keys:
    - cross_workload_consistency
    - sign_reversals
    - safe_zone_configs
  - step_id: '3.4'
    action: Identify joint-optimal configuration per cluster
    why: The joint optimum considers interaction effects and may differ substantially from combining independently optimal
      values.
    procedure:
    - From the factorial results, find the configuration with the highest mean primary metric (TPS or -latency).
    - 'Compare against the ''independent optimum'': the combination of each knob''s marginally best value.'
    - 'Compute the gap: (joint_best - independent_best) / baseline × 100%.'
    - If gap > 5%, the interaction is practically significant — use the joint optimum.
    - 'IMPORTANT: If cross-workload sign reversals exist (3.3), use the workload-specific optimum or safe-zone configuration
      instead.'
    requires_benchmark: false
    skill_type: optimization
    preconditions:
    - Factorial results available (3.1)
    - ANOVA confirms significant interaction (3.2)
    - Cross-workload analysis completed (3.3)
    postconditions:
    - Joint-optimal configuration identified for each cluster
    - Joint vs independent gap quantified
    decision_criteria:
      condition: joint-independent gap > 5%
      action_if_true: Use the joint optimum — independent tuning leaves significant performance on the table.
      action_if_false: Joint and independent optima are close — either is acceptable.
    reference_data:
      description: Joint vs independent optimization gaps from our validation experiments.
      examples:
      - knobs:
        - shared_buffers
        - checkpoint_completion_target
        workload: tpccw
        joint_best: 500.2
        independent_best: 500.2
        gap_pct: 0.0
      - knobs:
        - shared_buffers
        - checkpoint_completion_target
        workload: tpccr
        joint_best: 3178.1
        independent_best: 3178.1
        gap_pct: 0.0
      - knobs:
        - shared_buffers
        - work_mem
        workload: tpch
        joint_best: -31963825.7
        independent_best: -36238305.3
        gap_pct: 7.5
      - knobs:
        - shared_buffers
        - work_mem
        workload: tpccw
        joint_best: 536.3
        independent_best: 497.6
        gap_pct: 10.3
      - knobs:
        - shared_buffers
        - work_mem
        workload: tpccr
        joint_best: 3199.8
        independent_best: 2621.8
        gap_pct: 30.7
      note: Negative gap means independent tuning is WORSE than joint.
      notes:
      - 'shared_buffers × checkpoint_completion_target: joint=+35.2%, independent=-13.4% on OLTP-read.'
      - The gap can be NEGATIVE — independent tuning actively hurts performance.
      - Small-sample (n=1) recommendations are unreliable. Always validate with n≥3.
    output_keys:
    - cluster_optima
    - joint_vs_independent_gap
  - step_id: '3.5'
    action: Select best values for independent knobs
    why: Knobs not in any interaction cluster can be safely optimized one at a time.
    procedure:
    - 'For each independent knob (not in any interaction cluster):'
    - '  - Use the sensitivity scan data from Phase 1 and per-parameter profiles.'
    - '  - Select the value that gave the highest benchmark metric.'
    - '  - If the scan only tested 3 values and the best is at an extreme, consider testing intermediate values to find a
      potential better point.'
    - '  - For non-monotonic parameters, use the profile''s identified peak.'
    requires_benchmark: false
    skill_type: optimization
    preconditions:
    - Independent knobs identified (2.4)
    - Per-parameter profiles available (1.9)
    postconditions:
    - Optimal value selected for each independent knob
    reference_data:
      description: Independent knob optimization results.
      note: Use Phase 1 sensitivity data. Select the value that maximized your benchmark metric for each knob.
      notes:
      - Independent optimization is safe ONLY for knobs confirmed to have weak interactions (< 5%) in Phase 2.
      - For knobs with moderate interactions (5-15%), consider a quick 2×2 validation after assembly.
    output_keys:
    - independent_optima
  - step_id: '3.6'
    action: Assemble the complete recommended configuration
    why: Combine cluster joint optima with independent knob optima into one cohesive configuration.
    procedure:
    - Merge all cluster optima from Step 3.4 and independent optima from Step 3.5.
    - For knobs not covered by profiling, keep PostgreSQL defaults.
    - 'Sanity-check the combined configuration:'
    - '  - Total memory (shared_buffers + work_mem × max_connections + maintenance_work_mem) should not exceed ~80% of available
      RAM.'
    - '  - WAL settings should be consistent (e.g., max_wal_size > min_wal_size).'
    - '  - Parallelism settings should not exceed CPU core count.'
    requires_benchmark: false
    skill_type: optimization
    preconditions:
    - Cluster optima identified (3.4)
    - Independent optima identified (3.5)
    postconditions:
    - Complete configuration assembled
    - Memory budget validated
    - No conflicting parameter values
    reference_data:
      description: Common assembly pitfalls.
      memory_formula: shared_buffers + work_mem × max_connections + maintenance_work_mem < 80% RAM
      wal_consistency: max_wal_size > min_wal_size
      parallelism: max_parallel_workers ≤ CPU cores
      notes:
      - Memory over-commitment is the most common assembly error.
      - Cross-cluster interactions are rare but possible — Phase 4 validation catches these.
    output_keys:
    - full_config
- phase_id: '4'
  name: Verification
  goal: Confirm that the recommended configuration improves performance on the target deployment and does not regress on secondary
    workloads.
  steps:
  - step_id: '4.1'
    action: 'Run A/B validation: default vs recommended'
    why: Benchmark results can be noisy and interactions can behave differently across environments. Validation ensures the
      recommendation actually helps.
    procedure:
    - Run the target benchmark with both configurations (default and recommended).
    - Use n ≥ 5 repetitions for each to ensure statistical power.
    - Interleave runs (default, recommended, default, recommended, ...) to reduce time-dependent confounds.
    - Compute improvement = (recommended_mean - default_mean) / default_mean × 100%.
    - Run a two-sample t-test (or Welch's t-test if variances differ).
    requires_benchmark: true
    skill_type: verification
    preconditions:
    - Complete configuration assembled (3.6)
    postconditions:
    - Improvement quantified with confidence interval
    - Statistical significance assessed (p < 0.05)
    decision_criteria:
      condition: improvement > 0 AND p-value < 0.05 (two-sample t-test)
      action_if_true: Configuration validated — proceed to cross-workload check.
      action_if_false: Improvement is not statistically significant. Proceed to Phase 5 adaptation.
    reference_data:
      description: Validation results from our end-to-end experiments.
      experiments: []
      note: 'Best validated improvement: +35.2% TPS on OLTP-read (shared_buffers × checkpoint_completion_target, joint optimization).'
      notes:
      - 'Our best validated improvement: +35.2% TPS on OLTP-read (sb×cct joint).'
      - Typical improvements range from 10-25% over defaults for correctly tuned clusters.
      - If improvement is negative, the interaction structure in your environment differs — return to Phase 2 with more levels.
    output_keys:
    - validation_improvement
    - validation_pvalue
    - validation_ci
  - step_id: '4.2'
    action: Verify configuration does not regress on secondary workloads
    why: A configuration optimized for the primary workload may degrade performance on secondary workloads. Cross-workload
      verification catches regressions before deployment.
    procedure:
    - 'For each secondary workload (e.g., if primary is OLTP-read, test OLTP-write and OLAP):'
    - '  - Run the benchmark with the recommended configuration, n ≥ 3 repetitions.'
    - '  - Compare against the baseline (default config) for that workload.'
    - '  - Compute regression = (recommended - baseline) / baseline × 100%.'
    - Flag if any secondary workload regresses by > 5%.
    - 'If regression detected, consider:'
    - '  - Using the safe-zone configuration from Step 3.3.'
    - '  - Creating workload-specific configurations with a switching mechanism.'
    requires_benchmark: true
    skill_type: verification
    preconditions:
    - A/B validation passed for primary workload (4.1)
    - Secondary workloads identified
    postconditions:
    - No secondary workload regresses by > 5%
    - 'OR: workload-specific configurations defined'
    decision_criteria:
      condition: 'All secondary workloads: regression < 5%'
      action_if_true: Configuration is safe for mixed workloads — deploy.
      action_if_false: Regression detected. Use workload-specific configs or safe-zone compromise from Step 3.3.
    reference_data:
      description: Cross-workload regression risks from our experiments.
      note: Run the recommended config on all secondary workloads and check for >5% regression.
      notes:
      - OLTP-read optimized configs regress OLTP-write by 4-8% in ~30% of cases.
      - OLAP workloads are generally less affected by OLTP-optimized memory settings.
      - The safe-zone approach (top-25% for all workloads) typically sacrifices 5-10% of peak performance for robustness.
    output_keys:
    - secondary_workload_results
    - regressions
- phase_id: '5'
  name: Adaptation & Orchestration
  goal: Adapt the tuning process when results diverge from expectations, and provide top-level workflow orchestration.
  steps:
  - step_id: '5.1'
    action: Adapt if results diverge from expectations
    why: Hardware, PostgreSQL version, data distribution, and workload mix all affect interaction strength. The methodology
      adapts; fixed rules do not.
    procedure:
    - 'If validation shows no improvement or regression:'
    - '  1. Check if your workload type matches the reference conditions.'
    - '  2. Re-run Phase 1 sensitivity scan — your knob rankings may differ.'
    - '  3. Re-run Phase 2 screening for the top pairs — interaction strengths may be different on your hardware.'
    - '  4. If specific clusters show no interaction on your setup, move those knobs to independent optimization.'
    - 'If validation shows moderate improvement but below reference:'
    - '  - The methodology is working, but your environment has different optima.'
    - '  - Consider expanding the level range or adding more levels in Phase 3.'
    - 'If anomaly detection (1.8) flagged ranking divergence:'
    - '  - Discard reference interaction data for anomalous knobs.'
    - '  - Run full screening (Phase 2) for those knobs from scratch.'
    requires_benchmark: false
    skill_type: meta
    preconditions:
    - Validation completed (Phase 4)
    - 'OR: anomaly detected at any earlier phase'
    postconditions: