Skip to content

CreateReasoningEngine fails with code 13 after deleting previous resource (Dockerfile/image_spec path, us-central1) #6754

@trossy1

Description

@trossy1

Environment details

  • OS type and version: macOS 24.6.0 (darwin arm64)
  • Python version: N/A (using REST API directly via curl, not the Python SDK)
  • pip version: N/A
  • google-cloud-aiplatform version: N/A — REST API v1beta1

Summary

After successfully deploying a Dockerfile-based reasoning engine via the REST API (image_spec: {} + inline_source.source_archive), deleting that resource with force=true, and then attempting to create a new one, all subsequent CreateReasoningEngine operations fail with code 13 in us-central1. The same request succeeds in us-east4 within the same project.

Cloud Logging confirms the build completes and the container starts healthy — the failure is in Agent Engine's internal post-deploy verification.

Steps to reproduce

  1. Deploy a Dockerfile-based reasoning engine via REST API to us-central1:

    curl -X POST \
      "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/{PROJECT}/locations/us-central1/reasoningEngines" \
      -H "Authorization: Bearer $(gcloud auth print-access-token)" \
      -H "Content-Type: application/json" \
      -d '{
        "display_name": "my-dockerfile-agent",
        "spec": {
          "source_code_spec": {
            "inline_source": { "source_archive": "<base64-tar-gz-of-Dockerfile-and-source>" },
            "image_spec": {}
          },
          "agent_framework": "custom",
          "class_methods": [{"name": "query", "api_mode": ""}],
          "deployment_spec": {
            "env": [{"name": "SOME_VAR", "value": "some_value"}],
            "min_instances": 1,
            "max_instances": 1,
            "resource_limits": {"cpu": "4", "memory": "8Gi"}
          }
        }
      }'

    Result: Success — operation completes, resource created, :query works.

  2. Delete the reasoning engine:

    curl -X DELETE \
      "https://us-central1-aiplatform.googleapis.com/v1beta1/projects/{PROJECT}/locations/us-central1/reasoningEngines/{RESOURCE_ID}?force=true" \
      -H "Authorization: Bearer $(gcloud auth print-access-token)"

    Result: Delete succeeds (done: true).

  3. Create a new reasoning engine with the same or different payload:

    # Same curl as step 1, different display_name

    Result: Fails with code 13 every time.

  4. Deploy the exact same payload to us-east4 in the same project:

    # Same curl but with us-east4 in the URL

    Result: Success — deploys fine, container starts, :query works.

Observed behavior

  • The operation is accepted and returns an operation name
  • Cloud Logging (reasoning_engine_build) shows the Dockerfile build completing successfully ("DONE", image pushed with SHA digest)
  • Cloud Logging (reasoning_engine_stdout) shows the container starting and logging that it's listening on port 8080
  • Despite the container being healthy, the operation completes with code 13

Expected behavior

CreateReasoningEngine should succeed since the build completes and the container starts healthy. Deleting and recreating a reasoning engine should not permanently break the region for the project.

Minimal Dockerfile used for testing

FROM node:22-slim
WORKDIR /app
COPY server.js ./
CMD ["node", "server.js"]
// server.js
const http = require("http");
http.createServer((req, res) => {
  if (req.url === "/ping") { res.end(JSON.stringify({status:"ok"})); return; }
  let body = "";
  req.on("data", c => body += c);
  req.on("end", () => {
    res.writeHead(200, {"content-type":"application/json"});
    res.end(JSON.stringify({output: "echo: " + body}));
  });
}).listen(8080, () => console.log("listening on 8080"));

Even this minimal 2-file container fails with code 13 in us-central1 after the delete, but deploys fine in us-east4.

Error response

{
  "name": "projects/{NUMBER}/locations/us-central1/reasoningEngines/{ID}/operations/{OP_ID}",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.aiplatform.v1beta1.CreateReasoningEngineOperationMetadata",
    "genericMetadata": {
      "createTime": "2026-05-07T23:59:49.660727Z",
      "updateTime": "2026-05-07T23:59:49.660727Z"
    }
  },
  "done": true,
  "error": {
    "code": 13,
    "message": "Please refer to our documentation (https://cloud.google.com/vertex-ai/generative-ai/docs/agent-engine/troubleshooting/deploy) for checking logs and other troubleshooting tips."
  }
}

Additional context

  • Region: us-central1 is broken, us-east4 works — same project, same payload, same permissions
  • The issue started immediately after deleting a previously deployed reasoning engine
  • All IAM roles verified (reasoningEngineServiceAgent, artifactregistry.reader, storage.objectAdmin, logging.logWriter)
  • Staging bucket exists and is accessible
  • Cloud Resource Manager API is enabled
  • No VPC-SC configured
  • Multiple retries over 2+ hours — issue does not self-heal
  • Operations cannot be cancelled via the API ("not cancellable")

Hypothesis

Deleting the reasoning engine left orphaned internal state (Cloud Run revision, internal AR image reference, or routing configuration) in us-central1 that blocks new reasoning engine deployments from completing their post-deploy verification step. The build and container startup succeed, but the orchestration layer's readiness check fails against stale internal state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    api: vertex-aiIssues related to the googleapis/python-aiplatform API.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions