Skip to content

Ask Claude to Update to 2.21.0#491

Draft
hmaarrfk wants to merge 55 commits into
conda-forge:mainfrom
hmaarrfk:update_to_2.21.0
Draft

Ask Claude to Update to 2.21.0#491
hmaarrfk wants to merge 55 commits into
conda-forge:mainfrom
hmaarrfk:update_to_2.21.0

Conversation

@hmaarrfk
Copy link
Copy Markdown
Contributor

@hmaarrfk hmaarrfk commented May 15, 2026

For me to restart with claude:

claude --resume "tensorflow 2.21"

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@conda-forge-admin
Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/recipe.yaml) and found it was in an excellent condition.

hmaarrfk added 4 commits May 17, 2026 16:40
<details><summary>Claude's draft</summary>

Bump the feedstock to TensorFlow 2.21.0:

- recipe.yaml: version 2.21.0, refreshed sha256, build 0, bazel 7.*,
  run-dep bumps (protobuf >=6.31.1, tensorboard 2.21, keras 3.12), and
  CUDA selectors extended to the 13.x series.
- Rebase the patch series onto the 2.21.0 source tree. TF 2.21.0 moved
  third_party/{absl,gpus,eigen3,ducc} under third_party/xla/ and dropped
  the duplicated TSL tree, so every patch was refreshed; nine obsolete
  patches were dropped and several new ones added to restore the
  systemlib (absl / protobuf / gRPC) wiring that 2.21.0 removed.
- Patch files keep their original author attribution (Uwe L. Korn,
  H. Vetinari, Isuru Fernando and others), sourced from the conda-forge
  feedstock history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Add .ci_support/migrations/cuda130.yaml so the feedstock builds a
CUDA 13.0 variant alongside CUDA 12.8. The migration is taken from the
conda-forge cuda130 migrator (key_add operation, wait_for_migrators:
cuda129, c_stdlib 2.28), matching the approach used by pytorch-cpu-feedstock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

TF 2.21.0 changed its build substantially; the feedstock build scripts
are reworked to cope, in logical steps:

- Compiler: switch the linux build to the conda-forge clang/clangxx 18
  toolchain (conda_build_config.yaml). TF 2.21.0 defaults to a hermetic
  LLVM CC toolchain (rules_ml_toolchain); build_common.sh selects
  --config=clang_local + --crosstool_top so the conda toolchain and
  system headers are used instead.
- ABI: pin --cxxopt/--host_cxxopt=-fclang-abi-compat=17. clang 18
  changed the Itanium mangling of non-type template parameters of
  dependent type; conda-forge's libabseil/libprotobuf use the older
  GCC-compatible form, so TF must match it or absl::Cord etc. fail to
  link.
- System libraries: restore TF_SYSTEM_LIBS and, because TF 2.21.0's
  cc_shared_library does not forward systemlib cc_library linkopts,
  force-link the systemized libraries (protobuf, grpc, sqlite3, icu,
  png, jpeg, gif, flatbuffers, snappy, curl, and abseil's ~90 shared
  objects) so libtensorflow*.so record them as DT_NEEDED.
- Caching: restore .bazelrc to a pristine snapshot on every invocation
  so the per-Python passes do not accumulate duplicate flags (which
  changed every compile command and defeated all Bazel caching), and
  add a persistent --disk_cache.
- Packaging: fix the XLA header-install path (@local_xla renamed to
  @xla), chmod build_env writable so rattler-build can clean it up and
  package every output, and use cp -f for the wheel copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Re-rendered with conda-smithy to pick up the CUDA 13.0 migration and the
clang toolchain change: regenerated .ci_support variant files (CUDA 13.0
level1/level3, refreshed CUDA-None variants), the conda-build workflow,
and pixi.toml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
hmaarrfk added 23 commits May 17, 2026 20:00
<details><summary>Claude's draft</summary>

The run dependency was tensorboard >=2.21,<2.22, which is unsatisfiable:
no tensorboard 2.21 has been released on PyPI or conda-forge (latest is
2.20.0 on both). This made the tensorflow-base test environment unsolvable.

TensorFlow 2.21.0 does not list tensorboard in its wheel REQUIRED_PACKAGES
at all; its CI requirement files pin `tensorboard ~= 2.20.0`. Following
the meaning of ~=, that is >=2.20.0,<2.21. Pin tensorboard >=2.20,<2.21
to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

conda-forge CI fails the linux builds ~12s into the Bazel compile:

  embed_gpu_specs_gen failed (Exit 127): /bin/bash: xxd: command not found

TF 2.21.0's bundled XLA added the genrule
@xla//xla/backends/gpu/target_config:embed_gpu_specs_gen, which runs
`xxd -i` to embed GPU spec files into generated C++. xxd is not in the
conda-forge build image -- it was an undeclared host tool that happened
to be present on the maintainer's dev machine (it ships with vim).

Add vim to the staging output's build requirements (xxd is not a
standalone conda-forge package; vim provides it). Fixes the linux CPU,
CUDA and aarch64 jobs, which all fail here identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The osx-arm64 CI build fails 12s into the Bazel compile:

  error: invalid value '17' in '-fclang-abi-compat=17'
  Error in child process '/usr/bin/xcrun'. 1

-fclang-abi-compat=17 was added unconditionally to .bazelrc, but it is a
linux-specific fix: conda-forge's linux libabseil/libprotobuf use the
GCC-compatible pre-clang-18 mangling for dependent non-type template
parameters, so the clang-built TF must match. macOS builds with Apple
clang (via xcrun), which both rejects the bare '17' value and does not
need it (the conda macOS libraries are clang-built already).

Emit the --cxxopt/--host_cxxopt=-fclang-abi-compat=17 lines only when
target_platform is linux-*.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The osx-arm64 build fails compiling generated protobuf code:

  tpu_embedding_configuration.pb.h: fatal error:
  'google/protobuf/runtime_version.h' file not found
  ...
  logging_initializer.cc: fatal error: 'absl/base/log_severity.h' not found

TF 2.21.0's reworked .bazelrc has `common:macos --config=apple-toolchain`,
and `common:apple-toolchain` forces @local_config_apple_cc//:toolchain for
--crosstool_top, --apple_crosstool_top and --host_crosstool_top. That
overrides the recipe's `--crosstool_top=//bazel_toolchain:toolchain` and,
crucially, also sets the host/apple crosstool slots the recipe never
touched -- so the conda toolchain (which carries -isystem $PREFIX/include)
is bypassed entirely and conda's protobuf/abseil headers are invisible.

sed the apple-toolchain config to point all three crosstool slots at the
conda //bazel_toolchain:toolchain. This restores the pre-2.21 behaviour
where the conda toolchain served the macOS build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The osx-arm64 build now reaches the link step (~22 min in) and fails:

  ld: illegal thread local variable reference to regular symbol
  google::protobuf::internal::ThreadSafeArena::thread_cache_ for arm64

conda-forge's macOS libprotobuf is compiled with PROTOBUF_NO_THREADLOCAL,
so it exports ThreadSafeArena::thread_cache_ as a regular (non-TLS)
symbol. TensorFlow compiles the same protobuf headers without that macro,
so its objects emit a thread-local (TLV) relocation against thread_cache_;
the Mach-O linker rejects a TLV reference to a non-TLS definition. (ELF
tolerates the mismatch, so linux is unaffected.)

Add -DPROTOBUF_NO_THREADLOCAL to --copt and --host_copt for osx so TF's
protobuf header compilation matches the ABI of the installed libprotobuf.
host_copt is needed too: the failing target is the [for tool] exec build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The previous commit added -DPROTOBUF_NO_THREADLOCAL for osx, but
protobuf's port_def.inc explicitly rejects it:

  port_def.inc:731: error: PROTOBUF_NO_THREADLOCAL was previously defined

protobuf manages that macro itself and never expects it pre-defined.
Revert that change. The underlying osx link error -- ld: illegal thread
local variable reference to regular symbol ThreadSafeArena::thread_cache_
-- is a protobuf ABI clash (a non-TLS protobuf object is being linked
into libtensorflow_framework.dylib while the systemized protobuf 6.33.5
headers make thread_cache_ __thread); it needs the protobuf
systemize-vs-vendor wiring fixed for osx, not a compile define.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

-fclang-abi-compat=17 was gated on target_platform == linux-*, but the
linux CUDA variants build with gcc 14 (the cuda128/cuda130 migrators pin
the host compiler to gcc, since nvcc needs a gcc host compiler). gcc has
no -fclang-abi-compat flag, so it errors out -- a CUDA-build blocker.

Tighten the gate to linux AND c_compiler == clang*, i.e. the CPU variants
only. The CUDA variants (gcc) and macOS (Apple clang) both correctly skip
the flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

osx-arm64 fails linking libtensorflow_framework.dylib:

  ld: illegal thread local variable reference to regular symbol
      google::protobuf::internal::ThreadSafeArena::thread_cache_

TF 2.21.0's cc_shared_library does not propagate the systemlib protobuf
linkopt. On linux build_common.sh force-links -lprotobuf (and the other
systemlibs) via LDFLAGS, but the osx branch only added -undefined
dynamic_lookup -- which hides undefined regular protobuf symbols yet
cannot reconcile the TLS storage class of thread_cache_, so ld rejects
the thread-local reference to it as a (regular, undefined) symbol.

Force-link conda's libprotobuf for osx via --linkopt/--host_linkopt
(host too: the failing target is the [for tool] exec-config link), so
thread_cache_ resolves to its TLS definition in libprotobuf.dylib.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

osx-64 CI fails at Bazel analysis:

  cc_toolchain_suite '//bazel_toolchain:toolchain' does not contain a
  toolchain for cpu 'darwin_arm64'

osx-64 is cross-compiled on an arm64 runner. gen-bazel-toolchain keys the
conda cc_toolchain_suite on darwin_x86_64 (target) and darwin_arm64
(build host), but build_common.sh forced --cpu=darwin -- which matches
no suite key. Once TF 2.21.0's apple-toolchain config points
crosstool_top at that suite, analysis fails.

Set TARGET_CPU=darwin_x86_64 so --cpu matches the emitted suite key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

conda-forge is moving off CUDA 12.8 on linux. Mirror pytorch-cpu-feedstock,
which builds CUDA 12.9 + 13.0: replace the local cuda128.yaml migrator
with cuda129.yaml and cuda130.yaml copied verbatim from
conda-forge/pytorch-cpu-feedstock's .ci_support/migrations/.

A rerender follows to regenerate the .ci_support variant files (drops
linux_64_cuda_compiler_version12.8*, adds 12.9).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

conda_build_config.yaml pinned c_compiler=clang / version 18 for all of
linux. The CUDA variants must build with gcc (nvcc needs a gcc host
compiler, and the cuda migrators pin c_compiler_version=14) -- the
unconditional clang 18 pin clashed with the migrator and made the
rerender ambiguous.

Gate the clang pin on cuda_compiler_version == "None" so only the CPU
variant uses clang 18; the CUDA variants fall through to the global
pinning + migrator (gcc).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Re-rendered with conda-smithy 3.61.2 / conda-forge-pinning 2026.05.16
to pick up the cuda128 -> cuda129 migrator swap and the clang compiler
gating: the linux CUDA variants are now 12.9 and 13.0 (12.8 dropped),
and the CPU variant keeps clang while the CUDA variants use gcc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The CUDA 13.0 build aborts during Bazel's repository fetch:

  nvidia_nvshmem: Platform cuda13_x86_64-unknown-linux-gnu is not
  supported  [...]  @xla//xla/tsl/cuda:nvshmem_stub depends on
  @nvidia_nvshmem which failed to fetch

TF 2.21.0's nvshmem_stub alias resolves to the hermetic @nvidia_nvshmem
redistribution when CUDA libraries are force-included
(override_include_cuda_libs=true, which the recipe sets). The pinned
NVSHMEM 3.2.5 redist ships only cuda11/cuda12 archives -- no cuda13 --
so the build fails before compiling anything. CUDA 12.x silently worked
because a cuda12 archive exists.

Add patch 0067 making nvshmem_stub always resolve to the bundled dlopen
stub (:nvshmem). conda-forge does not package NVSHMEM and TF's NVSHMEM
support (optional multi-GPU collectives) is loaded via dlopen anyway, so
the stub is the correct choice for both CUDA 12.9 and 13.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Two source-side CI fixes (a rerender still needs to land separately):

- conda_build_config.yaml: restore the unconditional # [linux] clang
  pin. The `cuda_compiler_version == "None"` selector cannot evaluate at
  config-parse time, so clang was dropped from every variant and the CPU
  build regressed to gcc -- XLA's -emit-llvm intrinsic codegen then fails
  (gcc rejects clang-only flags like -fno-experimental-sanitize-metadata).

- recipe.yaml: add cuda-nvrtc-dev to the CUDA-12/13 host deps. It ships
  targets/<arch>-linux/include/nvrtc.h; without it the hermetic
  cuda_nvrtc Bazel repo has an empty include/ and the CUDA build aborts
  with "missing input file @cuda_nvrtc//:include/nvrtc.h". Both
  pytorch-cpu-feedstock and jaxlib-feedstock list cuda-nvrtc-dev.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

c_compiler_version / cxx_compiler_version / c_stdlib_version belong to
conda-forge's `unix` zip_keys group. Under CF_CUDA_ENABLED the cuda
migrator adds a second entry (the CUDA variant), so a single-entry clang
override desynced the group and made conda-smithy rerender fail
("ambiguous ... we did not find ['18'] ... in c_compiler_version
['14','14']").

conda_build_config.yaml: give each overridden linux key two parallel
entries (CPU + CUDA) plus a matching c_stdlib_version block, mirroring
jaxlib-feedstock (which also builds XLA with clang alongside CUDA).
Includes the conda-smithy re-render. Rerender now succeeds: CPU and
CUDA 12.9 render as clang 18; CUDA 13.0 as clang 14 (cuda130 migrator
pin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The cuda130.yaml migrator carried gcc-era c_compiler_version (14/13),
so the CUDA 13.0 variant rendered with clang 14 -- too old for TF 2.21
/ XLA and rejected by the recipe's -fclang-abi-compat=17.

jaxlib-feedstock handles this by editing its own copy of the CUDA
migrators (they carry use_local: true) to pin the clang version it
builds with, in lockstep with recipe/conda_build_config.yaml. Follow
that: pin c_compiler_version / cxx_compiler_version / fortran_compiler_version
to 18 in cuda130.yaml, and add the same explicit clang-18 block to
cuda129.yaml (which previously only got clang 18 by fall-through).

Both CUDA 12.9 and 13.0 variants now render c_compiler: clang /
c_compiler_version: 18, matching the CPU variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The CUDA variants now build with clang (conda_build_config.yaml + the
cuda migrators pin clang 18). TF's ./configure failed:

  UserInputError: Invalid GCC_HOST_COMPILER_PATH provided 10 times

because build_common.sh still set GCC_HOST_COMPILER_PATH=${GCC} -- and
there is no gcc in a clang build env, so ${GCC} is empty.

TF 2.21.0's configure.py only reads GCC_HOST_COMPILER_PATH when
TF_CUDA_CLANG=0; with TF_CUDA_CLANG=1 it reads CLANG_CUDA_COMPILER_PATH
and clang compiles the CUDA device code directly (--config=cuda_clang).
This matches XLA's own cuda_clang_local reference config and
jaxlib-feedstock's clang CUDA build.

CUDA branch: drop GCC_HOST_COMPILER_PATH/_PREFIX; set TF_CUDA_CLANG=1,
TF_NEED_CLANG=1, CLANG_CUDA_COMPILER_PATH / CLANG_COMPILER_PATH to the
conda clang. Also stop the later unconditional TF_CUDA_CLANG=0 /
TF_NEED_CLANG=0 from clobbering it -- gate them to the non-CUDA build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The clang-CUDA build (TF_CUDA_CLANG=1) failed compiling CUDA 13 device
code: clang 18's __clang_cuda_runtime_wrapper.h includes headers CUDA 13
removed (texture_fetch_functions.h), and the device pass chokes on
__float128 in gcc 15's libstdc++. clang only gained CUDA 13 support in
v21 -- clang 18 cannot target CUDA 13 at all.

Switch to nvcc for device code with clang 18 as the host compiler:
- TF_CUDA_CLANG=0, TF_NVCC_CLANG=1; append `build --config=cuda_nvcc`.
- configure.py reads GCC_HOST_COMPILER_PATH on the TF_CUDA_CLANG=0 path
  and only checks the path exists, so point it at clang -- which is what
  nvcc uses as host under TF_NVCC_CLANG.
- Put nvvm/bin on PATH for nvcc/cicc/ptxas.
- Strip TF's hardcoded -fuse-ld=lld from .bazelrc (conda clang has no lld;
  the cuda_clang config carried it, cuda_nvcc does not).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The CUDA build passed sm_100/sm_120 (Blackwell) to clang 18, which
errors "unsupported CUDA gpu architecture: sm_100". clang 18 only knows
up to sm_90. Drop sm_100/sm_120/compute_120 from
HERMETIC_CUDA_COMPUTE_CAPABILITIES -> sm_90/compute_90 ceiling, for both
the 12.x and 13.x lists. Blackwell support can return with a newer clang.

(One of two changes for the CUDA build; the crosstool-routing fix that
keeps nvcc-only copts off plain clang is separate.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The CUDA build failed because the recipe appended a blanket
`--crosstool_top=//bazel_toolchain:toolchain` (the conda plain-clang
toolchain) for all variants. cuda_library .cu.cc targets carry nvcc-only
copts (-Xcuda-fatbinary=, -nvcc_options=, -x cuda, --cuda-gpu-arch=);
forced onto plain clang 18 those are rejected. --config=cuda_nvcc was set
but inert because TF's CUDA crosstool was never selected.

cuda_library is a plain cc_library, so Bazel resolution cannot route
.cu.cc separately. The mechanism that splits device/host is TF's CUDA
crosstool, whose host_compiler is the nvcc wrapper
(crosstool_wrapper_driver_is_not_gcc): it sends '-x cuda' actions to
nvcc 13 and everything else to conda clang 18 (CLANG_CUDA_COMPILER_PATH).

Make --crosstool_top per-variant: CPU keeps //bazel_toolchain:toolchain;
CUDA uses @local_config_cuda//crosstool:toolchain (+ host_crosstool_top).
That CUDA crosstool's cc_toolchain_suite is keyed k8 not x86_64, so the
CUDA linux-64 build also sets --cpu=k8 (CC_CPU). Mirrors TF's own
config:rocm pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

With the CUDA build routed through TF's CUDA crosstool, host .cc compiles
failed: 'sqlite3ext.h' / 'absl/base/log_severity.h' file not found. TF's
CUDA crosstool (cuda_configure.bzl) carries none of the conda
gen-bazel-toolchain customizations -- its cxx_builtin_include_directories
are only clang's own builtins, unfiltered_compile_flags is empty -- so
conda's -isystem $PREFIX/include and the LDFLAGS force-link block (baked
into //bazel_toolchain for the CPU build) never reach it.

In the CUDA branch, append explicit flags: --copt/--host_copt
-isystem $PREFIX/include, --linkopt/--host_linkopt -L$PREFIX/lib, and a
loop turning the assembled $LDFLAGS (the -lprotobuf/-lgrpc/-labsl_*/...
force-link set) into --linkopt/--host_linkopt. Same pattern the osx
branch already uses for the Apple crosstool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The previous attempt injected conda's header dir with
`--copt=-isystem $PREFIX/include`, but Bazel rejects an absolute include
path the active toolchain does not declare:

  hwloc/base64.c: the include path '$PREFIX/include' references a path
  outside [the execroot]

Use CPATH instead. clang reports CPATH directories in `clang -E -v`, so
cuda_configure.bzl folds $PREFIX/include into the CUDA crosstool's
cxx_builtin_include_directories (declared -> accepted), and clang also
searches CPATH at compile time. Export CPATH/CPLUS_INCLUDE_PATH so the
cuda_configure repo rule sees them, and pass them via
--action_env/--host_action_env so the compile actions do too. Library
dirs / force-link libs stay as --linkopt (linkopts are not path-checked).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

TF's CUDA crosstool passes --cuda-path=external/cuda_nvcc to every
compile action. On plain C files (e.g. vendored brotli) clang reports
"argument unused during compilation: '--cuda-path=...'", which TF's
-Werror,-Wunused-command-line-argument turns into a hard error.

Add -Qunused-arguments (--copt/--host_copt) for the CUDA build so clang
silently ignores command-line arguments that do not apply to a given
source file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what you all think of this file.

Taking in som stats from my usage

  1. claude-opus-4-7 123.6k input, 1.4m output, 503.0m cache read , 5.7m cache write ($322.24)
  2. claude-haiku-4-5: 1.4m input, 34.5k output, 0 cache read, 0 cache write, 44 web search ($1.99)

basically 2 days on a powerful laptop, though i'm pretty sure my computer crashed half way through.

@hmaarrfk
Copy link
Copy Markdown
Contributor Author

I understand the contents of these patches just as much as I understand the contents of the old ones.

hmaarrfk added 15 commits May 19, 2026 12:37
…ibprotobuf

<details><summary>Claude's draft</summary>

`import tensorflow` aborted with a protobuf descriptor double-registration
SIGABRT ("File already exists in database"). With conda's *shared*
libprotobuf there is one process-global generated-descriptor database, and
TF embeds the same generated proto .pb.o into many shared objects
(libtensorflow_framework.so, libtensorflow_cc.so, the _pywrap_*.so
extensions, ...); the second .so to load re-registers a proto file the
first already registered, and protobuf's AddDescriptors aborts.

Fix (keeps shared/systemlib protobuf — no static/hermetic switch, no TF
source patch):

- recipe/tf_proto_descriptor_guard.h — featherweight, force-included into
  every TU (--copt=-include). No protobuf/absl headers, so it is safe in
  vendored pre-C++17 sources. It just defines AddDescriptors ->
  AddDescriptors_TfGuarded.
- recipe/tf_proto_descriptor_guard_impl.h — force-included into the .pb.cc
  files only (--per_file_copt). A functional clone of protobuf's
  AddDescriptors that skips InternalAddGeneratedFile when
  internal_generated_database()->FindFileByName shows the proto file is
  already registered — idempotent instead of fatal.
- build_common.sh copies both headers into a toolchain -isystem dir, wires
  the copt/per_file_copt flags into .bazelrc, and removes them before
  packaging.

Also in build_common.sh: re-enable USE_PYWRAP_RULES (the upstream pywrap
build) for the python wheel and build the standalone libtensorflow /
libtensorflow_cc C/C++ libraries in a separate non-pywrap Bazel pass; drop
the --@local_config_cuda//cuda:override_include_cuda_libs flag so CUDA
libraries are dlopen'd lazily (no hard libcuda.so.1 DT_NEEDED), matching
jaxlib.

Verified: full cp312 pywrap wheel builds (20,267 Bazel actions, 0 errors)
and `import tensorflow` succeeds — tf.__version__ 2.21.0, tf ops run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
…ib protobuf flat deps

<details><summary>Claude's draft</summary>

Two fixes that together make the CPU build's `import tensorflow` work end-to-end.

0072 (OpRegistry duplicate tolerance — the import-test fix). Both
libtensorflow_framework.so.2 (from the pywrap python wheel) and
libtensorflow_cc.so.2.21.0 (from the libtensorflow_cc package) each statically
embed tensorflow/core/ops/function_ops.cc, so each runs the same _Arg op
registration's static initializer at process startup. Because OpRegistry isn't
initialized yet at that point, both registrations land on the single
OpRegistry::Global() deferred queue. The first load_op_library() call (Lite's
audio_microfrontend, reached on `import tensorflow` via
compat.v1.lite.experimental.authoring) calls LoadDynamicLibrary, whose first
action is ProcessRegistrations -> CallDeferred. The second _Arg in the queue
hits try_emplace's existing entry and aborts as AlreadyExistsError. Patch
OpRegistry::RegisterAlreadyLocked: when the op name is already registered,
compare the new OpDef to the existing one via OpDefEqual; if equal, silently
accept (and skip the watcher callback so the duplicate is not mis-attributed
to the load_op_library's contributed op list). Genuinely divergent
redefinitions still error.

0071 (systemlib _protobuf_deps — the build-analysis fix). With USE_PYWRAP_RULES
on, @tsl//tsl/platform:protobuf evaluates tsl_protobuf_deps()'s _protobuf_deps
branch, which references @com_google_protobuf//src/google/protobuf/io and the
:delimited_message_util / :differencer / :json_util / :type_resolver split
targets. The systemlib protobuf.BUILD does not declare those (conda's
libprotobuf is one complete library, exposed via flat :protobuf / :protobuf_lite).
Drop the missing sub-targets from _protobuf_deps so Bazel analysis succeeds.

Verified the CPU build's `import tensorflow` now reaches well past the proto
descriptor abort and into the tflite load_op_library path; this commit's 0072
addresses the latter. Full build-locally CPU end-to-end run pending the
in-flight rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

TF 2.21.0's setup.py.tpl pins h5py < 3.15.0, but conda-forge ships
h5py 3.16.0+. h5py is API-stable enough that the upper bound is
over-restrictive; the conda-forge pip-check post-install test fails
on the megabuild's `tensorflow` python package output:

    tensorflow 2.21.0 has requirement h5py<3.15.0,>=3.11.0,
    but you have h5py 3.16.0.

Drop the upper bound. The CPU variant's `import tensorflow` test
(the rest of the megabuild test phase) already passed end-to-end
with this commit's series — descriptor guard (2728fb3) +
OpRegistry duplicate tolerance (077cf82, patch 0072) +
_protobuf_deps systemlib flat targets (077cf82, patch 0071) +
this h5py loosening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Follow jaxlib's approach for hermetic CUDA wheel builds:

* Add `build --config=cuda_wheel` to the per-build `.bazelrc` (CUDA
  variants only). This sets `--@local_config_cuda//cuda:include_cuda_libs=
  false`, so the CUDA runtime libraries (libcudart, libcublas, libcufft,
  libcusparse, libnvjitlink, ...) are loaded lazily at first GPU use rather
  than being hard-linked via DT_NEEDED into libtensorflow_framework.so.2.
  Without this the wheel ends up needing libcuda.so.1 at `import
  tensorflow`, which the conda test envs do not ship.

* XLA's `xla/stream_executor/cuda/cuda_executor.cc` still calls NVML
  directly (`nvmlDeviceGetHandleByPciBusId_v2`, ...), so build-time tools
  that pull in cuda_executor (e.g. `hlo_to_kernel`) fail with undefined
  references unless we hand them libnvidia-ml. Force-link the conda
  `cuda-nvml-dev` stub explicitly in LDFLAGS, alongside the existing
  `-lcusparse`.

* The proto-text codegen tool (`gen_proto_text_functions`) ends up
  DT_NEEDED-ing `libnvidia-ml.so.1` even under cuda_wheel, so symlink the
  stub into `${PREFIX}/lib` for build-time runtime resolution, paralleling
  the existing libcuda.so.1 stub symlink. Both stubs are removed from
  `${PREFIX}/lib` in `recipe/build.sh` before packaging so they never ship.

Validated end-to-end via `build-locally.py` for the CPU variant (✔ python
imports test passed). CUDA 13.0 build-locally is in progress with this
configuration; CUDA 12.9 still to validate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
…DYLIB

<details><summary>Claude's draft</summary>

Mirror of the linux force-link list (commit f7663d6) onto the osx-*
branch of `build_common.sh`.

TF 2.21.0's `cc_shared_library` rules do not forward the systemlib
`cc_library` `linkopts`, so libtensorflow_cc.dylib and
_pywrap_tensorflow_internal.so end up with no LC_LOAD_DYLIB entries for
the systemized grpc/sqlite3/icu/png/jpeg/gif/flatbuffers/abseil they
reference. On linux this manifests as `undefined reference` link errors,
which is why we already force-link those there.

On osx the existing `-Xlinker -undefined -Xlinker dynamic_lookup` flag
lets the link succeed silently and defers symbol resolution to runtime.
At `import tensorflow` time nothing has loaded libgrpc++, so dlopen of
_pywrap_tensorflow_internal.so fails with e.g.

    symbol not found in flat namespace '__ZN4grpc6Status2OKE'
    (== grpc::Status::OK)

Fix: explicitly list the systemized libraries in LDFLAGS so ld64 records
LC_LOAD_DYLIB entries for them. macOS ld64 does not accept
`--no-as-needed` or `--export-dynamic`, but listing `-l<name>` is enough
to add the load command; the existing `-undefined dynamic_lookup`
remains as a safety net for symbols not covered by any conda dylib.
abseil ships ~90 dylibs; enumerate them like the linux branch does.

Discovered from osx-arm64 azure CI logs: bazel build + py3.{10,11,12}
wheels all built fine; rattler-build's test phase blew up on
`import tensorflow` for the py3.10 cpu_py310h... output. The
rattler-build "links against" diagnostic in the same log confirms
libtensorflow_cc.dylib has no libgrpc++ load command before this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The previous commit (dfdda7e) added `-lnvidia-ml` to the CUDA-variant
LDFLAGS so XLA's `cuda_executor.cc` (which calls NVML directly) and
build-time tools like `hlo_to_kernel` could resolve NVML symbols against
the conda `cuda-nvml-dev` stub.

But that LDFLAGS line is appended AFTER the linux branch sets
`-Wl,--no-as-needed`, so every binary the linker produced — including
tiny extension modules that don't touch NVML at all, like
`tensorflow/python/platform/_pywrap_cpu_feature_guard.so` — ended up
with `DT_NEEDED libnvidia-ml.so.1`. The build-time stub symlink in
`${PREFIX}/lib` papered over this during the build, but the conda test
env has no NVML runtime (NVML normally ships with the NVIDIA driver,
not as a conda package), so `import tensorflow` blew up:

    File ".../tensorflow/python/platform/self_check.py", line 63
        from tensorflow.python.platform import _pywrap_cpu_feature_guard
    ImportError: libnvidia-ml.so.1: cannot open shared object file:
                 No such file or directory

Fix: bracket `-lnvidia-ml` with `-Wl,--as-needed ... -Wl,--no-as-needed`
so the spurious DT_NEEDED is dropped from binaries that don't actually
reference NVML symbols. Binaries that DO reference NVML
(`libtensorflow_framework.so.2`'s XLA stream_executor, `hlo_to_kernel`,
gen_proto_text_functions if it indirectly links cuda_executor) keep
their entry and still resolve against the build-time stub symlink.

The runtime DT_NEEDED in `libtensorflow_framework.so.2` itself is OK as
long as it's only reached on actual GPU use; cuda_wheel's lazy dlopen
already covers cudart/cublas/cufft, and import-time code paths like
preload_check no longer drag in the NVML chain.

Surfaced from CUDA 13.0 build-locally test-phase failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The previous CUDA fixes added `-L${BUILD_PREFIX}/.../stubs -lcusparse
-lnvidia-ml` to LDFLAGS and a `${PREFIX}/lib/libnvidia-ml.so.1`
symlink so build-time codegen tools could resolve NVML symbols. That
worked for the build but baked DT_NEEDED libnvidia-ml.so.1 (and
libcusparse.so.12) into every output -- including
`libtensorflow_framework.so.2.21.0` and tiny extension modules like
`_pywrap_cpu_feature_guard.so`. NVML ships only with the NVIDIA driver
(no conda-forge package provides libnvidia-ml.so.1), so the conda test
env fails at:

    from tensorflow.python.platform import _pywrap_cpu_feature_guard
    ImportError: libnvidia-ml.so.1: cannot open shared object file

The `--as-needed` wrapper I added in a follow-up didn't help: this
recipe forwards $LDFLAGS to Bazel via the per-token loop further down
build_common.sh; Bazel reorders linkopts by class, which strips the
`--as-needed`/`--no-as-needed` bracket scope, leaving the `-l<name>`
unconditionally NEEDed.

TF/XLA already ship an in-tree solution: `xla/tsl/cuda/nvml_stub.cc`
(with implib_so trampolines in `nvml.tramp.S`/`nvml.symbols`) provides
a lazy-dlopen NVML stub that is auto-aliased in by `:nvml` whenever
`--@local_config_cuda//cuda:include_cuda_libs=false` -- which is what
`--config=cuda_wheel` (already enabled) sets. Same story for cusparse,
cudart, cublas, cufft, cusolver. Jaxlib's `xla_cuda_plugin.so` on
conda-forge is built this way: zero DT_NEEDED for libnvidia-ml,
libcusparse, libcuda, libcudart, etc.

This commit:

1. Removes the `-lcusparse -lnvidia-ml` LDFLAGS additions and the
   `${PREFIX}/lib/libnvidia-ml.so.1` symlink (and its build.sh
   cleanup). Keeps the `libcuda.so.1` symlink -- that one is for a
   different problem (host tools that load libtensorflow_framework
   during codegen and hit its driver stub dep).

2. Adds patch `0074-xla-cuda_executor-depend-on-tsl-cuda-nvml-stub`
   making `cuda_executor` depend on `//xla/tsl/cuda:nvml` -- a
   one-line BUILD patch that mirrors what `cuda_platform` already
   does. Without this, a binary that links cuda_executor but not
   cuda_platform would still see undefined NVML symbols. Most TF
   binaries pull both in, but better safe than another 6-hour rebuild.

Diagnosed by inspecting the staging .bazelrc generated by the recipe
(linkopt reorder), readelf -d on the failing artifacts, and comparing
to a conda-installed jaxlib (clean hermetic layout).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The ${PREFIX}/lib/libcuda.so.1 symlink (and its build.sh cleanup) was
added before --config=cuda_wheel to let build-time host tools resolve
libtensorflow_framework.so's libcuda.so.1 DT_NEEDED off-GPU.

Under cuda_wheel (include_cuda_libs=false, a common: flag that also
applies to the exec/host config), XLA routes the CUDA driver through
its always-lazy in-tree stub (//xla/tsl/cuda:cuda -> cuda_stub.cc), so
libtensorflow_framework.so no longer DT_NEEDEDs libcuda.so.1 -- verified
by readelf on the freshly built artifact (zero CUDA DT_NEEDED). The
symlink is therefore unnecessary; remove it and its packaging cleanup.

If a host tool still fails with "libcuda.so.1: cannot open", the fix is
to route that tool through //xla/tsl/cuda:cuda rather than re-adding the
symlink. Being validated by a clean CUDA 13.0 build-locally rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

The active maintainers do not have the bandwidth to support
linux-aarch64 or macOS (osx-64 and osx-arm64) at this time. Add them to
the top-level build skip so only linux-64 (x86_64) is built, where the
CPU and CUDA 12.9/13.0 variants are built and tested.

The rattler-build `aarch64` selector covers linux-aarch64 (and
osx-arm64); `osx` covers both osx-64 and osx-arm64. Replaces the
narrower "aarch64 and cuda_compiler_version != None" skip. Contributions
to re-enable aarch64/osx are welcome.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

After the nvml-DT_NEEDED fix let `import tensorflow` get past
preload_check, the CUDA package aborts during stream_executor init:

  F0000 repeat_buffer_kernel_cuda.cc:32] Failed to register kernel:
    ALREADY_EXISTS: Object for trait ...RepeatBufferKernel... and
    platform CUDA is already registered.  -> Aborted (core dumped)

Same multi-.so root cause as the proto-descriptor and OpRegistry
duplicates: the RepeatBufferKernelCuda static registrar is compiled into
both libtensorflow_framework.so.2 and libtensorflow_cc.so.2.21.0, and
each .so runs its module initializer once against the process-global
PlatformObjectRegistry singleton. The registrar macros LOG(FATAL) on any
non-OK status, so the second (identical) registration kills the process.

Add patch 0075 making PlatformObjectRegistry::RegisterObject keep the
first registration and return Ok for an identical-key duplicate, instead
of AlreadyExistsError. Fixing it at the single insert chokepoint covers
GPU kernels and every other STREAM_EXECUTOR_REGISTER_OBJECT_STATICALLY
user, avoiding whack-a-mole. Mirrors patch 0072 (OpRegistry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
…nstead)

<details><summary>Claude's draft</summary>

After patch 0075 fixed the GPU-kernel registry abort, the next
duplicate-registration in the multi-.so layout surfaces in systemized
absl's process-global FlagRegistry:

  ERROR: Flag 'coordination_agent_recoverable' was defined more than once
  but with differing types. Defined in files
  '.../coordination_service_agent.cc' and '.../coordination_service_agent.cc'.

coordination_service_agent.cc is statically embedded in both
libtensorflow_framework.so.2 and libtensorflow_cc.so.2, so its
ABSL_FLAG static registrar runs twice against the one libabsl_flags
registry and aborts `import tensorflow`. Unlike OpRegistry (0072) and
PlatformObjectRegistry (0075), the registry here lives in systemized
absl and cannot be patched to tolerate the duplicate, and the flag can't
be deduplicated across the two .so's (its only reader is in the same TU,
and absl's per-TU FastTypeId<bool> differs between the .so's -- hence the
"differing types" message).

Patch 0076 drops the ABSL_FLAG (an experimental default-false knob whose
own TODO asks to move it off a flag) and reads the override from the
TF_COORDINATION_AGENT_RECOVERABLE env var via tsl::ReadBoolFromEnvVar,
parsed once. The programmatic `recoverable` parameter is unchanged;
operators keep a global override via the env var. BUILD dep swapped
absl/flags:flag -> //xla/tsl/util:env_var on coordination_service_agent.

Diagnosed from a CUDA 13.0 CI test-phase log; applies cleanly to the
2.21.0 source. Will be exercised by the in-flight local CUDA 13.0 build,
whose test phase was expected to reach this same error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
…sede 0076)

<details><summary>Claude's draft</summary>

Root cause of the duplicate-registration whack-a-mole: TF's pywrap+systemlib
megabuild ships both libtensorflow_framework.so and libtensorflow_cc.so in the
wheel and the _pywrap_*.so extensions link both. cc_shared_library normally
partitions each TU into exactly one .so, but the systemlib force-linking
defeats that, so the same static registrars (proto, ops, GPU kernels, AND
absl flags) get embedded into both libs. Upstream survives because it uses
static absl (per-.so flag registries); our systemized absl has ONE shared
FlagRegistry, so a flag defined in two loaded .so's aborts `import tensorflow`:

  ERROR: Flag 'coordination_agent_recoverable' was defined more than once
  ERROR: Flag 'leave_barriers_on_recoverable_agent_restart' ...

We fixed the proto/OpRegistry/PlatformObjectRegistry families by patching TF's
own registries to tolerate duplicates, but absl's FlagRegistry is a prebuilt
conda package we cannot patch, and there are ABSL_FLAGs across several
subsystems -- converting them one-by-one (patch 0076 did one) is untenable.

Systemic fix: add tf_absl_flag_guard.h. It includes absl/flags/flag.h then
overrides the ABSL_FLAG_IMPL_REGISTRAR sub-macro to construct
FlagRegistrar<T, /*do_register=*/false> -- abseil's own ABSL_FLAGS_STRIP_NAMES
build already uses this <T,false> form. Net effect: ABSL_FLAG still defines
FLAGS_<name> (so absl::GetFlag keeps working) and keeps .OnUpdate() and the
name/help, but skips inserting into the shared FlagRegistry, so the duplicate
registration across the two .so's becomes a silent no-op.

Scoped via --per_file_copt to just the 10 ABSL_FLAG-defining .cc files
enumerated from the 2.21.0 source (every `ABSL_FLAG(` lives in one of them),
so it does not re-key the whole build (keeps the bazel disk cache warm) and
does not pull absl/flags into unrelated TUs. The header self-guards to
non-CUDA C++ so it is a no-op anywhere it might otherwise reach.

XLA_FLAGS is unaffected (XLA uses its own flag mechanism, not ABSL_FLAG). The
only behavior change is that TF's C++ absl flags are no longer settable via
the command-line registry -- not used by the Python package.

Reverts patch 0076 (the per-flag env-var hack for coordination_agent_recoverable),
now handled uniformly by the guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Document the libcuda rabbit hole in CLAUDE_FEEDSTOCK_GUIDE.md (an
anti-pattern bullet + a toolchain-table row): never force-link CUDA libs
in LDFLAGS or symlink their stubs into $PREFIX/lib. --config=cuda_wheel
already routes them through XLA's in-tree lazy-dlopen stubs, so
force-linking leaks a hard DT_NEEDED into every .so and breaks the
driverless conda test env. If a target truly needs a CUDA symbol, add
//xla/tsl/cuda:<lib> to its BUILD deps (patch 0074).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Condense the comment blocks added during the 2.21.0 bump to 1-2 sentence
explanations, drop historical ("we used to...") and failed-approach
narration, and stop restating what the diffs do.

- build_common.sh: condense the proto/absl guard, LDFLAGS force-link,
  NVML/cusparse, nvcc-host-compiler, cccl-flatten, .bazelrc, crosstool,
  CPATH, and cuda_wheel comment blocks; delete the historical
  libcuda.so.1-symlink note entirely.
- build.sh, recipe.yaml: tighten the read-only-toolchain comment and the
  0072/0074/0075 inline patch notes; drop the "supersedes patch 0076"
  historical line.
- patch headers 0052/0063/0065/0066/0067/0070/0071/0072/0074/0075:
  condense the prose body to 1-2 sentences. Diff hunks untouched
  (verified byte-identical to HEAD); 0064/0068/0073 left as-is. Removed a
  stray non-format-patch line 1 from 0069.
- tf_proto_descriptor_guard.h, tf_proto_descriptor_guard_impl.h,
  tf_absl_flag_guard.h: condense the 40+ line headers (dropped abort-quote
  dumps, disassembly, and "this used to be" framing); code unchanged.

Comment-only: shell scripts pass bash -n, no Bazel flags or patch-list
entries changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
@hmaarrfk
Copy link
Copy Markdown
Contributor Author

Next steps are to:

  1. Test this out. (I can only test out CUDA 13)
  2. Ask AI why it decided to use LD Flags instead of patching bazel like the other maintainers did
  3. Ask AI to rebuild a dev build to get the LDFlags folded back as bazel patches

hmaarrfk added a commit to hmaarrfk/tensorflow-feedstock that referenced this pull request May 24, 2026
<details><summary>Claude's draft</summary>

Set github_actions.store_build_artifacts: true (+ artifact_retention_days:
14) in conda-forge.yml and rerender with conda-smithy 3.61.2. The GHA
workflow now publishes each job's built .conda packages as a downloadable
workflow artifact (actions/upload-artifact@v7, 14-day retention), plus the
build env on failure — so PR conda-forge#491 builds can be downloaded and tested
without waiting for merge/upload to anaconda.org.

Rerender also dropped the now-unused .ci_support configs for the skipped
platforms (linux_aarch64, osx_64) and added .scripts/create_conda_build_artifacts.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
@conda-forge-admin
Copy link
Copy Markdown
Contributor

conda-forge-admin commented May 24, 2026

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe/recipe.yaml) and found some lint.

Here's what I've got...

For recipe/recipe.yaml:

  • ❌ In conda-forge.yml: $.github_actions = {'timeout_minutes': 1080, 'triggers': ['push', 'pull_request'], 'store_build_artifacts': True, 'artifact_retention_days': 14}.

    {'timeout_minutes': 1080, 'triggers': ['push', 'pull_request'], 'store_build_artifacts': True, 'artifact_retention_days': 14} is not valid under any of the given schemas

    Schema
    {
      "anyOf": [
        {
          "$ref": "#/$defs/GithubActionsConfig"
        },
        {
          "type": "null"
        }
      ]
    }

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/26364308038. Examine the logs at this URL for more detail.

hmaarrfk added 2 commits May 24, 2026 10:30
<details><summary>Claude's draft</summary>

Set github_actions.store_build_artifacts: true (+ artifact_retention_days:
14) so the GHA build jobs publish each config's built .conda packages as a
downloadable workflow artifact (14-day retention) — lets PR builds be
downloaded and tested before merge/upload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
…6.05.24.12.35.11

Other tools:
- conda-build 26.3.0
- rattler-build 0.65.0
- rattler-build-conda-compat 1.4.14
@hmaarrfk hmaarrfk force-pushed the update_to_2.21.0 branch from fd1c802 to a75c53f Compare May 24, 2026 14:31
hmaarrfk added 2 commits May 24, 2026 10:46
<details><summary>Claude's draft</summary>

5th and final duplicate-registration manifestation of the systemlib
two-.so layout, found by running real GPU ops (import + pip-check, all CI
runs, pass because conda-forge runners are GPU-less). With a GPU present TF
auto-places ops there; the first host<->device copy aborts:

  InvalidArgumentError: Multiple OpKernel registrations match NodeDef at the
  same priority '_Send' device_type: "CPU" and '_Send' device_type: "CPU"

_Send/_Recv (and other) kernels are embedded in both libtensorflow_framework.so
and libtensorflow_cc.so, so they register twice in the process-global OpKernel
registry and FindKernelRegistration errors on the ambiguity. Patch 0077 makes
OpKernelRegistrar::InitInternal skip an exact duplicate (same key,
kernel_class_name and serialized KernelDef); genuinely distinct kernels are
still registered. Mirrors 0072 (OpRegistry) / 0075 (PlatformObjectRegistry).

Must be GPU-validated locally since CI cannot exercise it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
<details><summary>Claude's draft</summary>

Document at the top of recipe.yaml the upstream XLA bug where JIT
(jit_compile=True / keras auto-JIT) crashes emitting the Ampere TF32
tensor-core mma.sync matmul on CUDA 13 ("FloatAttr does not match
expected type of the constant" / "Operand is null" -> Failed to emit
LLVM IR). Records that it is compiler-side XLA codegen (not packaging),
the disable-TF32 workaround, and the corroborating issues jax-ml/jax#20154
and libxsmm/tpp-mlir#870.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resume this Claude session:
```
claude --resume d2eda4d2-e5f8-4dd0-a194-052aea8a0ff3
```
</details>
@conda-forge-admin
Copy link
Copy Markdown
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/recipe.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/recipe.yaml:

  • ℹ️ output 0 output overrides versions pinned in the feedstock:
    ['- In section host: python 3.12.*']
    Requirement spec should not list version specifiers to respect conda-forge-pinning. If you need to force another version, please override the pin via conda_build_config.yaml.
  • ℹ️ 'Store Build Artifacts' is deprecated.
    Deprecated. Whether to store build artifacts. Use workflow_settings.store_build_artifacts instead.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/26380789360. Examine the logs at this URL for more detail.

@hmaarrfk
Copy link
Copy Markdown
Contributor Author

I'm not sure i'm going to be able to get this over the finish line. When i test this with CUDA 13, it just fails at using usable models.

@hmaarrfk hmaarrfk changed the title Update to 2.21.0 Ask Claude to Update to 2.21.0 May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants