Skip to content

Storage Read API returns full Arrow IPC streams instead of bare messages #493

@otegami

Description

@otegami

What happened?

Reading a table over the Storage Read API (Arrow) with the official cloud.google.com/go/bigquery client returns no rows. With the Storage Read client enabled, RowIterator reads zero rows, and on the current main it panics in RowIterator.Next (index out of range [0] with length 0).

The cause is the Arrow framing: each field should be a single bare IPC message, but the emulator writes a whole IPC stream into it.

  • serialized_schema: [schema][empty record batch][EOS] — should be just a schema message.
  • serialized_record_batch: [schema][record][EOS] — should be just a record-batch message.

The client reads the schema field first, consumes its empty batch (zero rows), and stops at the EOS before the real data — so RowIterator.Next panics.
Low-level access (CreateReadSession / ReadRows decoded by hand) tolerates this, which is why it goes unnoticed.

What did you expect to happen?

Per the Storage v1 proto (google/cloud/bigquery/storage/v1/arrow.proto), serialized_schema is an "IPC serialized Arrow schema" and serialized_record_batch is an "IPC-serialized Arrow RecordBatch" — i.e. each field is a single Arrow IPC-encapsulated message (no leading schema, no end-of-stream marker), as real BigQuery sends them. With that framing the official client's RowIterator returns the table rows.

Why a single message, and not an IPC stream

Arrow IPC has two layers: a single encapsulated message (one schema, or one record batch), and a stream — the sequence <schema><record batch>…<EOS optional> built from those messages.

These proto fields each carry one encapsulated message. The consumer builds the stream itself by concatenating serialized_schema + each serialized_record_batch — that concatenation is the stream.

The emulator instead writes a whole stream into every field (serialized_schema = [schema][empty batch][EOS], each serialized_record_batch = [schema][record][EOS]). Concatenating those gives two schemas and a mid-stream EOS — not a valid stream.

Note on dictionaries: a dictionary-encoded column requires DictionaryBatch message(s) before the record batch. The emulator emits plain (non-dictionary) batches, so a single record-batch message is complete here; dictionary support would be a separate change.

How can we reproduce it (as minimally and precisely as possible)?

A minimal failing test reproduces it in two steps.

  1. Save this as server/storage_read_repro_test.go:
storage_read_repro_test.go
package server_test

import (
	"context"
	"testing"

	"cloud.google.com/go/bigquery"
	"github.com/goccy/bigquery-emulator/server"
	"github.com/goccy/bigquery-emulator/types"
	"google.golang.org/api/iterator"
	"google.golang.org/api/option"
)

func TestStorageReadRepro(t *testing.T) {
	ctx := context.Background()

	// Emulator holding one table: dataset1.users = {id: 42, name: "alice"}.
	emulator, err := server.New(server.TempStorage)
	if err != nil {
		t.Fatal(err)
	}
	source := server.StructSource(
		types.NewProject("test",
			types.NewDataset("dataset1",
				types.NewTable("users",
					[]*types.Column{
						types.NewColumn("id", types.INT64),
						types.NewColumn("name", types.STRING),
					},
					types.Data{
						{"id": 42, "name": "alice"},
					},
				),
			),
		),
	)
	if err := emulator.Load(source); err != nil {
		t.Fatal(err)
	}
	testServer := emulator.TestServer() // in-process REST + gRPC, no TLS
	defer testServer.Close()

	// Official client with the Storage Read API enabled.
	client, err := bigquery.NewClient(ctx, "test",
		option.WithEndpoint(testServer.URL), option.WithoutAuthentication())
	if err != nil {
		t.Fatal(err)
	}
	defer client.Close()
	grpcOptions, err := testServer.GRPCClientOptions(ctx)
	if err != nil {
		t.Fatal(err)
	}
	if err := client.EnableStorageReadClient(ctx, grpcOptions...); err != nil {
		t.Fatal(err)
	}

	// Read the table over the Storage Read API.
	rowIterator := client.Dataset("dataset1").Table("users").Read(ctx)
	var rows []map[string]bigquery.Value
	for {
		row := map[string]bigquery.Value{}
		err := rowIterator.Next(&row)
		if err == iterator.Done {
			break
		}
		if err != nil {
			t.Fatalf("read failed: %v", err)
		}
		rows = append(rows, row)
	}

	// Expect exactly the row we stored to come back.
	if len(rows) != 1 {
		t.Fatalf("want 1 row, got %d: %v", len(rows), rows)
	}
	gotID := rows[0]["id"]
	gotName := rows[0]["name"]
	if gotID != int64(42) || gotName != "alice" {
		t.Fatalf("want {id: 42, name: alice}, got {id: %v, name: %v}", gotID, gotName)
	}
}
  1. Run it:
go test ./server/ -run TestStorageReadRepro -count=1 -v

Expected

PASS — the test reads {id: 42, name: "alice"} back.

Actual (on current main)

FAIL — the official client decodes the non-conformant framing into zero rows, then RowIterator.Next panics:

=== RUN   TestStorageReadRepro
--- FAIL: TestStorageReadRepro (0.12s)
panic: runtime error: index out of range [0] with length 0 [recovered, repanicked]
...
cloud.google.com/go/bigquery.(*RowIterator).Next(...)
	cloud.google.com/go/bigquery@v1.77.0/iterator.go:177
github.com/goccy/bigquery-emulator/server_test.TestStorageReadRepro(...)
	server/storage_read_repro_test.go
FAIL	github.com/goccy/bigquery-emulator/server

Anything else we need to know?

References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions