From d1363cbd7777a39338df564313a196f9ed821362 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Wed, 18 Mar 2026 17:14:56 +0000 Subject: [PATCH 01/35] first commit putting most of the information I want out there --- .../container/cpu-speech-to-text.mdx | 183 ++++++++++++++++++ 1 file changed, 183 insertions(+) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 60b63656..53aa8531 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -225,6 +225,189 @@ The following example shows how to use `--all-formats` parameter. In this scenar +## Batch persisted worker transcription + +Batch persisted workers (knows as http batch workers), are multi session capable persisted workers. They work utilizing an http server, which is able to +accept jobs through POST and by using the [V2 Batch REST API] (https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle +of posting a job, to checking the status of the jobs and asking for the transcript. + + +You can run the persisted worker with: + + + {`docker run -it \\ + -e LICENSE_TOKEN=$TOKEN_VALUE \\ + -p PORT:18000 \\ + batch-asr-transcriber-en:${smVariables.latestContainerVersion} \\ + --run-mode http \\ + --parallel=4 \\ + --all-formats /output_dir_name +`} + + +The parameters are: +- `parallel` - The number of parallel sessions you want this container to have (Each session corresponds to one gpu connection). The more sessions the higher + throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK). +- `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats). + If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs` + +To submit a job you can either use curl directly or using the python sdk. +With curl: +``` + curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"parallel_engines":2, "user_id":"MY_USER_ID"}' \ + -F 'config={ + "type":"transcription", + "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} + }' \ + -F 'data_file=@~/audio_file.mp3' +``` +Returns: + on success:json string containing job id: {"job_id": "abcdefgh01"} and HTTP status code 201 + on failure: technically it raises but the exception is translated to HTTP status code != 200: + HTTP status code 503 for server busy + HTTP status code 400 for invalid request + +with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription): +``` +import asyncio +import os +from dotenv import load_dotenv +from speechmatics.batch import AsyncClient + +load_dotenv() + +async def main(): + client = AsyncClient(api_key=os.getenv("SPEECHMATICS_API_KEY"), url="address.of.container:PORT/v2") + result = await client.transcribe("audio.wav",parallel_engines=2, user_id="MY_USER_ID") + print(result.transcript_text) + await client.close() + +asyncio.run(main()) +``` + +## Job specific endpoints + +/v2/jobs + +args: created_before: string in ISO 8601 format, only returns jobs created before this time +limit: maximum number of jobs to return, can be between 1 and 100 + +returns: list of jobs + +/v2/jobs/{job_id}/transcript + +args: job_id and format of the transcript. Options for the transcript currently are : "json", "txt", "srt" (we might need to add an ALL option). Maybe we can return all due to the nature of the http requests, but all formats are probably saved already locally?(todo find out) + +Returns the transcript for a specific job if it has finished, the format is a valid choice, and the job_id exists. + +if the job_id doesn’t exist returns an HTTPException with 404. + +if the job hasn’t finished, returns a 404, and includes the status and request_id. + +if the format is not in our included list we return a 404 with error = unsupported format + + +/v2/jobs/{job_id} + +returns job status, including job_id and request_id + + +/v2/jobs/{job_id}/log + +returns the logs for the specific job + + +## Health service + +The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready` and `session_status`. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around +[liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). + +The Health Service is enabled by default and runs as a subprocess of the main entrypoint to the container. + +### Endpoints + +The Health Service offers four endpoints: + +#### `/sessions` + + + f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding + +Possible responses: + +- `200` if all of the services in the container have successfully started. + +A JSON object is also returned in the body of the response, indicating the status. + +Example: + +```bash-and-response +$ curl -i address.of.container:PORT/sessions +HTTP/1.0 200 OK +Server: BaseHTTP/0.6 Python/3.8.5 +Date: Mon, 08 Feb 2021 12:46:21 GMT +Content-Type: application/json +{ + "started": true +} +``` + +#### `/live` + +This endpoint provides a liveness probe. It can be queried using an HTTP GET request. You must include the relevant port, 8001, in the request. + +This probe indicates whether all services in the Container are active. The services in the Container send regular updates to the Health Service, if they don't send an update for more than 10 seconds then they will be marked as 'dead' and this endpoint will return an unsuccessful response code. For example, if the WebSocket server in the Container were to crash, this endpoint should indicate that. + +Possible responses: + +- `200` if all of the services in the Container have successfully started, and have recently sent an update to the Health Service. +- `503` otherwise. + +A JSON object is also returned in the body of the response, indicating the status. + +Example: + +```bash-and-response +$ curl -i address.of.container:PORT/live +HTTP/1.0 200 OK +Server: BaseHTTP/0.6 Python/3.8.5 +Date: Mon, 08 Feb 2021 12:46:45 GMT +Content-Type: application/json +{ + "live": true +} +``` + +#### `/ready` + +This endpoint provides a readiness probe. It can be queried using an HTTP GET request. + +The container has been designed to process multiple audio streams at a time. This probe indicates whether the container has a slot free for connections, and can be used as a scaling mechanism. + +**Note**: The readiness check is accurate within a 2 second resolution. If you do use this probe for load balancing, be aware that bursts of traffic within that 2 second window could all be allocated to a single Container since its readiness state will not change. +return {"ready": True, "engines_used": self.engines_used} +Possible responses: + +- `200` if the container has a free connection slot. +- `503` otherwise. + +In the body of the response there is also a JSON object with the current status. + +Example: + +```bash-and-response +$ curl -i address.of.container:PORT/ready +HTTP/1.0 200 OK +Server: BaseHTTP/0.6 Python/3.8.5 +Date: Mon, 08 Feb 2021 12:47:05 GMT +Content-Type: application/json +{ + "ready": true, + "engines_used": 2 +} +``` + ## Realtime transcription The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file. From a32e122a7db020a12407eaaff9b03ecb442051ad Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Wed, 18 Mar 2026 17:34:03 +0000 Subject: [PATCH 02/35] small fixes to be able to render --- docs/deployments/container/cpu-speech-to-text.mdx | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 53aa8531..f517b28d 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -263,10 +263,10 @@ With curl: -F 'data_file=@~/audio_file.mp3' ``` Returns: - on success:json string containing job id: {"job_id": "abcdefgh01"} and HTTP status code 201 - on failure: technically it raises but the exception is translated to HTTP status code != 200: - HTTP status code 503 for server busy - HTTP status code 400 for invalid request +on success: json string containing job id: `{"job_id": "abcdefgh01"}` and HTTP status code 201 +on failure: technically it raises but the exception is translated to HTTP status code != 200: + HTTP status code 503 for server busy + HTTP status code 400 for invalid request with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription): ``` @@ -386,7 +386,7 @@ This endpoint provides a readiness probe. It can be queried using an HTTP GET re The container has been designed to process multiple audio streams at a time. This probe indicates whether the container has a slot free for connections, and can be used as a scaling mechanism. **Note**: The readiness check is accurate within a 2 second resolution. If you do use this probe for load balancing, be aware that bursts of traffic within that 2 second window could all be allocated to a single Container since its readiness state will not change. -return {"ready": True, "engines_used": self.engines_used} +return `{"ready": True, "engines_used": self.engines_used}` Possible responses: - `200` if the container has a free connection slot. From 54503716fc50961b08f9fda71867cdf4f902a7f5 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Wed, 18 Mar 2026 17:37:25 +0000 Subject: [PATCH 03/35] more fixes --- docs/deployments/container/cpu-speech-to-text.mdx | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index f517b28d..6a5085b4 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -295,7 +295,7 @@ limit: maximum number of jobs to return, can be between 1 and 100 returns: list of jobs -/v2/jobs/{job_id}/transcript +`/v2/jobs/{job_id}/transcript` args: job_id and format of the transcript. Options for the transcript currently are : "json", "txt", "srt" (we might need to add an ALL option). Maybe we can return all due to the nature of the http requests, but all formats are probably saved already locally?(todo find out) @@ -308,12 +308,12 @@ if the job hasn’t finished, returns a 404, and includes the status and request if the format is not in our included list we return a 404 with error = unsupported format -/v2/jobs/{job_id} +`/v2/jobs/{job_id}` returns job status, including job_id and request_id -/v2/jobs/{job_id}/log +`/v2/jobs/{job_id}/log` returns the logs for the specific job @@ -332,7 +332,9 @@ The Health Service offers four endpoints: #### `/sessions` - f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding +```python (TODO GH REMOVE) +f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding +``` Possible responses: From edadd98a0ebec11966a7353c9467e15459438aa4 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Wed, 18 Mar 2026 20:06:20 +0000 Subject: [PATCH 04/35] more improvements, adding actual results from the endpoints --- .../container/cpu-speech-to-text.mdx | 86 +++++++++++++++++-- 1 file changed, 77 insertions(+), 9 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 6a5085b4..ee87fde1 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -228,7 +228,7 @@ The following example shows how to use `--all-formats` parameter. In this scenar ## Batch persisted worker transcription Batch persisted workers (knows as http batch workers), are multi session capable persisted workers. They work utilizing an http server, which is able to -accept jobs through POST and by using the [V2 Batch REST API] (https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle +accept jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle of posting a job, to checking the status of the jobs and asking for the transcript. @@ -286,14 +286,59 @@ async def main(): asyncio.run(main()) ``` -## Job specific endpoints +Regular lifecycle and how to set a job to use multiple engine, to reduce RTF -/v2/jobs +..add headers stuff .. + +explain what endpoint they need to check before starting a job. + +### Job specific endpoints + +`/v2/jobs` args: created_before: string in ISO 8601 format, only returns jobs created before this time limit: maximum number of jobs to return, can be between 1 and 100 returns: list of jobs +```json +{ + "jobs": [ + { + "id": "191f47e4a4204fa4ac2b", + "created_at": "2026-03-18T19:27:42.436Z", + "data_name": "5_min", + "text_name": null, + "duration": 300, + "status": "RUNNING", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "standard" + } + } + }, + { + "id": "6dcb02e0dc5943e2b643", + "created_at": "2026-03-18T19:27:47.550Z", + "data_name": "5_min", + "text_name": null, + "duration": 300, + "status": "RUNNING", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "standard" + } + } + } + ] +} +``` + `/v2/jobs/{job_id}/transcript` @@ -312,26 +357,46 @@ if the format is not in our included list we return a 404 with error = unsupport returns job status, including job_id and request_id +```json +{ + "job": { + "id": "191f47e4a4204fa4ac2b", + "created_at": "2026-03-18T19:27:42.436Z", + "data_name": "5_min", + "duration": 300, + "status": "DONE", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "standard" + } + }, + "request_id": "191f47e4a4204fa4ac2b" + } +} +``` `/v2/jobs/{job_id}/log` returns the logs for the specific job -## Health service +### Health service -The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready` and `session_status`. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around +The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready`.. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). The Health Service is enabled by default and runs as a subprocess of the main entrypoint to the container. -### Endpoints +#### Endpoints -The Health Service offers four endpoints: +The Health Service offers three endpoints: #### `/sessions` - +(TODO GH REMOVE) ```python (TODO GH REMOVE) f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding ``` @@ -351,7 +416,10 @@ Server: BaseHTTP/0.6 Python/3.8.5 Date: Mon, 08 Feb 2021 12:46:21 GMT Content-Type: application/json { - "started": true + "request_ids": [ + "978174b1564e40ccacba,2", + "52d532a2efcb4b78962b,2" + ] } ``` From 3689576cf85501175f1cfae1bdd8ca92be6bccb0 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 20 Mar 2026 16:29:25 +0000 Subject: [PATCH 05/35] more improvements and full cycle documented --- .../container/cpu-speech-to-text.mdx | 111 ++++++++++++------ 1 file changed, 74 insertions(+), 37 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index ee87fde1..cb4f8492 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -227,9 +227,13 @@ The following example shows how to use `--all-formats` parameter. In this scenar ## Batch persisted worker transcription -Batch persisted workers (knows as http batch workers), are multi session capable persisted workers. They work utilizing an http server, which is able to -accept jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle -of posting a job, to checking the status of the jobs and asking for the transcript. +This feature is available for onPrem containers only. + +Shall we mention the version which this is available too????? + +Batch persisted workers (known as http batch workers), are batch multi session capable persisted workers. They work utilizing an http server, which is able to +accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle +of posting a job, checking the status of the jobs and retrieving for the transcript. You can run the persisted worker with: @@ -249,9 +253,13 @@ The parameters are: - `parallel` - The number of parallel sessions you want this container to have (Each session corresponds to one gpu connection). The more sessions the higher throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK). - `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats). - If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs` + If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`. +- `PORT` The port of your local environment you will forward to docker container's port. + +Do we need to say that they can set up the internal port via an env.variable as well? +`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to -To submit a job you can either use curl directly or using the python sdk. +To submit a job you can either use curl directly or use the python sdk. With curl: ``` curl -X POST address.of.container:PORT/v2/jobs \ @@ -262,11 +270,14 @@ With curl: }' \ -F 'data_file=@~/audio_file.mp3' ``` + Returns: +``` on success: json string containing job id: `{"job_id": "abcdefgh01"}` and HTTP status code 201 -on failure: technically it raises but the exception is translated to HTTP status code != 200: +on failure: returns an HTTP status code != 200: HTTP status code 503 for server busy HTTP status code 400 for invalid request +``` with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription): ``` @@ -286,13 +297,45 @@ async def main(): asyncio.run(main()) ``` -Regular lifecycle and how to set a job to use multiple engine, to reduce RTF +With the persisted batch worker you have the capability to submit multiple jobs on the same worker given it has enough free capacity to process them. +You can figure the free capacity left by querying the `/ready` endpoint outlined below. The result of this endpoint will include (`engines_used`) the total number of engines being +used by the running jobs now. To calculate the number of free engines you subtract the initial number of parallel engines you spinned +up the worker (set using `--parallel=NUM`) minus the engines you currently use. -..add headers stuff .. +If as part of a job you request more engines that those free, the job won't be accepted and will return a 503 with: -explain what endpoint they need to check before starting a job. +`HTTP 503: Service Unavailable - {"detail":"Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"}` -### Job specific endpoints + +By requesting more engines in parallel for a job, you are able to improve the turnaround time for the job. + +To request multiple engines in parallel for a job you need to add a header in the POST request called `X-SM-Processing-Data`, which receives as input a json dictionary. +The specify the number of parallel engines you want you need to add to this header a dict with key `parallel_engines` and as value the number of engines you want. + +For example with curl: +``` + curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"parallel_engines":2}' \ + -F 'config={ + "type":"transcription", + "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} + }' \ + -F 'data_file=@~/audio_file.mp3' +``` + +To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature using the same header as above `X-SM-Processing-Data` +insert as a key `user_id`, and value the id of the user/customer. +``` + curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"user_id":"MY_USER_ID"}' \ + -F 'config={ + "type":"transcription", + "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} + }' \ + -F 'data_file=@~/audio_file.mp3' +``` + +### Job API endpoints `/v2/jobs` @@ -315,7 +358,7 @@ returns: list of jobs "transcription_config": { "language": "en", "diarization": "speaker", - "operating_point": "standard" + "operating_point": "enhanced" } } }, @@ -331,7 +374,7 @@ returns: list of jobs "transcription_config": { "language": "en", "diarization": "speaker", - "operating_point": "standard" + "operating_point": "enhanced" } } } @@ -342,7 +385,7 @@ returns: list of jobs `/v2/jobs/{job_id}/transcript` -args: job_id and format of the transcript. Options for the transcript currently are : "json", "txt", "srt" (we might need to add an ALL option). Maybe we can return all due to the nature of the http requests, but all formats are probably saved already locally?(todo find out) +args: job_id and format of the transcript. Options for the format transcript currently are : "json", "txt", "srt". Returns the transcript for a specific job if it has finished, the format is a valid choice, and the job_id exists. @@ -350,7 +393,7 @@ if the job_id doesn’t exist returns an HTTPException with 404. if the job hasn’t finished, returns a 404, and includes the status and request_id. -if the format is not in our included list we return a 404 with error = unsupported format +if the format is not in our included list we return a 404 with error = unsupported format. `/v2/jobs/{job_id}` @@ -370,7 +413,7 @@ returns job status, including job_id and request_id "transcription_config": { "language": "en", "diarization": "speaker", - "operating_point": "standard" + "operating_point": "enhanced" } }, "request_id": "191f47e4a4204fa4ac2b" @@ -385,33 +428,25 @@ returns the logs for the specific job ### Health service -The container is able to expose an HTTP Health Service, which offers startup, liveness, readiness, and session listing probes. This is accessible from port 8001, and has four endpoints, `started`, `live`, `ready`.. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around +The container is exposes an http Health Service, which offers liveness, readiness, and session listing probes. This is accessible from the same port +as job posting, and has 3 endpoints, `live`, `ready` and `sessions`. This may be especially helpful if you are deploying the container into a Kubernetes +cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). -The Health Service is enabled by default and runs as a subprocess of the main entrypoint to the container. - #### Endpoints The Health Service offers three endpoints: #### `/sessions` -(TODO GH REMOVE) -```python (TODO GH REMOVE) -f"{js.request_id},{js.requested_parallel}" for js in self._jobs_status.values() if js.is_decoding -``` - -Possible responses: - -- `200` if all of the services in the container have successfully started. - -A JSON object is also returned in the body of the response, indicating the status. +This endpoint provides a list of the currently running jobs. It can be queried using an HTTP GET request. +Returns a list of the currently running jobs, which has a comma separate string of request_id and parallel_engines used for this job pair. Example: ```bash-and-response $ curl -i address.of.container:PORT/sessions -HTTP/1.0 200 OK +HTTP/1.1 200 OK Server: BaseHTTP/0.6 Python/3.8.5 Date: Mon, 08 Feb 2021 12:46:21 GMT Content-Type: application/json @@ -425,14 +460,13 @@ Content-Type: application/json #### `/live` -This endpoint provides a liveness probe. It can be queried using an HTTP GET request. You must include the relevant port, 8001, in the request. +This endpoint provides a liveness probe. It can be queried using an HTTP GET request. -This probe indicates whether all services in the Container are active. The services in the Container send regular updates to the Health Service, if they don't send an update for more than 10 seconds then they will be marked as 'dead' and this endpoint will return an unsuccessful response code. For example, if the WebSocket server in the Container were to crash, this endpoint should indicate that. +This probe indicates whether all services in the Container are active. Possible responses: - `200` if all of the services in the Container have successfully started, and have recently sent an update to the Health Service. -- `503` otherwise. A JSON object is also returned in the body of the response, indicating the status. @@ -440,7 +474,7 @@ Example: ```bash-and-response $ curl -i address.of.container:PORT/live -HTTP/1.0 200 OK +HTTP/1.1 200 OK Server: BaseHTTP/0.6 Python/3.8.5 Date: Mon, 08 Feb 2021 12:46:45 GMT Content-Type: application/json @@ -453,22 +487,21 @@ Content-Type: application/json This endpoint provides a readiness probe. It can be queried using an HTTP GET request. -The container has been designed to process multiple audio streams at a time. This probe indicates whether the container has a slot free for connections, and can be used as a scaling mechanism. +The container has been designed to process multiple jobs cuncurrently. This probe indicates whether the container has one slot (one engine) free for connections, and can be used as a scaling mechanism. -**Note**: The readiness check is accurate within a 2 second resolution. If you do use this probe for load balancing, be aware that bursts of traffic within that 2 second window could all be allocated to a single Container since its readiness state will not change. return `{"ready": True, "engines_used": self.engines_used}` Possible responses: - `200` if the container has a free connection slot. - `503` otherwise. -In the body of the response there is also a JSON object with the current status. +In the body of the response there is also a JSON object with the current status, and the total number of engines being used. Example: ```bash-and-response $ curl -i address.of.container:PORT/ready -HTTP/1.0 200 OK +HTTP/1.1 200 OK Server: BaseHTTP/0.6 Python/3.8.5 Date: Mon, 08 Feb 2021 12:47:05 GMT Content-Type: application/json @@ -478,6 +511,10 @@ Content-Type: application/json } ``` +Environment variables: + +`SM_BATCH_WORKER_MAX_JOB_HISTORY` : This is the maximum number of job records to keep in memory + ## Realtime transcription The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file. From 779a4578cd88dcbf2977daca677a6d2b4adba4d3 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 20 Mar 2026 16:31:10 +0000 Subject: [PATCH 06/35] add bold to unknown if we should include --- docs/deployments/container/cpu-speech-to-text.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index cb4f8492..6722ebb7 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -227,9 +227,9 @@ The following example shows how to use `--all-formats` parameter. In this scenar ## Batch persisted worker transcription -This feature is available for onPrem containers only. +**This feature is available for onPrem containers only.** -Shall we mention the version which this is available too????? +**Shall we mention the version which this is available too?????** Batch persisted workers (known as http batch workers), are batch multi session capable persisted workers. They work utilizing an http server, which is able to accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle @@ -256,8 +256,8 @@ The parameters are: If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`. - `PORT` The port of your local environment you will forward to docker container's port. -Do we need to say that they can set up the internal port via an env.variable as well? -`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to +**Do we need to say that they can set up the internal port via an env.variable as well? +`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to** To submit a job you can either use curl directly or use the python sdk. With curl: From 9666e13964eb9735ef304a487ba44627ecb45701 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 20 Mar 2026 16:37:49 +0000 Subject: [PATCH 07/35] add more comments why to use batch worker --- docs/deployments/container/cpu-speech-to-text.mdx | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 6722ebb7..14cd28ef 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -235,7 +235,12 @@ Batch persisted workers (known as http batch workers), are batch multi session c accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle of posting a job, checking the status of the jobs and retrieving for the transcript. +The main benefit of this worker vs normal batch is that you don't incur the cost of spinning up the worker for each you want to transcribe. +This has the benefit of reduding the turnaround time, especially for smaller files. The memory utilization is reduced as now you can have +multiple jobs running in parallel in the same container sharing the memory, and remove the need to spin up mulitple container incuring the same memory cost as many times. +Better utilizing the gpu as now we don't have initial setup times for the worker, and we are able to use the gpu uninterrupted. +### How to run the worker and submit jobs to it You can run the persisted worker with: From f91732e700218d8ab6c831de28f85b71fa51abea Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Mon, 23 Mar 2026 09:11:58 +0000 Subject: [PATCH 08/35] persisted->persistent --- docs/deployments/container/cpu-speech-to-text.mdx | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 14cd28ef..25264fca 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -225,13 +225,13 @@ The following example shows how to use `--all-formats` parameter. In this scenar -## Batch persisted worker transcription +## Batch persistent worker transcription **This feature is available for onPrem containers only.** **Shall we mention the version which this is available too?????** -Batch persisted workers (known as http batch workers), are batch multi session capable persisted workers. They work utilizing an http server, which is able to +Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They work utilizing an http server, which is able to accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle of posting a job, checking the status of the jobs and retrieving for the transcript. @@ -241,7 +241,7 @@ multiple jobs running in parallel in the same container sharing the memory, and Better utilizing the gpu as now we don't have initial setup times for the worker, and we are able to use the gpu uninterrupted. ### How to run the worker and submit jobs to it -You can run the persisted worker with: +You can run the persistent worker with: {`docker run -it \\ @@ -302,7 +302,7 @@ async def main(): asyncio.run(main()) ``` -With the persisted batch worker you have the capability to submit multiple jobs on the same worker given it has enough free capacity to process them. +With the persistent batch worker you have the capability to submit multiple jobs on the same worker given it has enough free capacity to process them. You can figure the free capacity left by querying the `/ready` endpoint outlined below. The result of this endpoint will include (`engines_used`) the total number of engines being used by the running jobs now. To calculate the number of free engines you subtract the initial number of parallel engines you spinned up the worker (set using `--parallel=NUM`) minus the engines you currently use. @@ -525,7 +525,7 @@ Environment variables: The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file. - Multiple instances of the container can be run on the same Docker host. This enables scaling of a single language or multiple languages as required -- All data is transitory, once a container completes its transcription it removes all record of the operation, no data is persisted +- All data is transitory, once a container completes its transcription it removes all record of the operation, no data is persistent Here's an example of how to start the Container from the command line: @@ -640,7 +640,7 @@ Users may wish to run the Container in read-only mode. This may be necessary due rt-asr-transcriber-en:${smVariables.latestContainerVersion}`} -The Container still requires a temporary directory with write permissions. Users can provide a directory (e.g `/tmp`) by using the `--tmpfs` Docker argument. A tmpfs mount is temporary, and only persisted in the host memory. When the Container stops, the tmpfs mount is removed, and files written there won’t be persisted. +The Container still requires a temporary directory with write permissions. Users can provide a directory (e.g `/tmp`) by using the `--tmpfs` Docker argument. A tmpfs mount is temporary, and only persistent in the host memory. When the Container stops, the tmpfs mount is removed, and files written there won’t be persistent. If customers want to use the shared Custom Dictionary Cache feature, they must also specify the location of cache and mount it as a volume From f4cadeb3870110e0bf10ddb09505c30c401cd656 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 10 Apr 2026 14:59:04 +0100 Subject: [PATCH 09/35] change endpoint name to /jobs instead of /sessions --- .../container/cpu-speech-to-text.mdx | 28 +++++++++++++------ 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 25264fca..d416d02f 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -255,7 +255,7 @@ You can run the persistent worker with: The parameters are: -- `parallel` - The number of parallel sessions you want this container to have (Each session corresponds to one gpu connection). The more sessions the higher +- `parallel` - The number of parallel engines you want this container to have (Each session corresponds to one gpu connection). The more engines the higher throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK). - `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats). If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`. @@ -434,7 +434,7 @@ returns the logs for the specific job ### Health service The container is exposes an http Health Service, which offers liveness, readiness, and session listing probes. This is accessible from the same port -as job posting, and has 3 endpoints, `live`, `ready` and `sessions`. This may be especially helpful if you are deploying the container into a Kubernetes +as job posting, and has 3 endpoints, `live`, `ready` and `jobs`. This may be especially helpful if you are deploying the container into a Kubernetes cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). @@ -442,24 +442,34 @@ cluster. If you are using Kubernetes, we recommend that you also refer to the Ku The Health Service offers three endpoints: -#### `/sessions` +#### `/jobs` This endpoint provides a list of the currently running jobs. It can be queried using an HTTP GET request. -Returns a list of the currently running jobs, which has a comma separate string of request_id and parallel_engines used for this job pair. +Returns a dictionary including the maximum number of engines this worker can use (`max_engines`), the number of free engines able to pick up work `unused_engines` +and a list of the currently running jobs, with element includes the job id and the number of engines that job uses. The number of free engines `unused_engines` can be used +to determine the number of parallel engines you can request for the next job. Example: ```bash-and-response -$ curl -i address.of.container:PORT/sessions +$ curl -i address.of.container:PORT/jobs HTTP/1.1 200 OK Server: BaseHTTP/0.6 Python/3.8.5 Date: Mon, 08 Feb 2021 12:46:21 GMT Content-Type: application/json { - "request_ids": [ - "978174b1564e40ccacba,2", - "52d532a2efcb4b78962b,2" - ] + "active_jobs": [ + { + "job_id": "f8a564954b334eecb823", + "parallel_engines": 1 + }, + { + "job_id": "29351ae8cf2c4e8694f0", + "parallel_engines": 1 + } + ], + "max_engines": 8, + "unused_engines": 6 } ``` From e970a0d615d02024f842c634076a7574bcd1af9b Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 10 Apr 2026 15:27:07 +0100 Subject: [PATCH 10/35] remove line 230, add availability and add SM_BATCH_WORKER_LISTEN_PORT --- docs/deployments/container/cpu-speech-to-text.mdx | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index d416d02f..8b151ef5 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -227,9 +227,8 @@ The following example shows how to use `--all-formats` parameter. In this scenar ## Batch persistent worker transcription -**This feature is available for onPrem containers only.** -**Shall we mention the version which this is available too?????** +**Available from version 15.5.0** Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They work utilizing an http server, which is able to accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle @@ -261,8 +260,7 @@ The parameters are: If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`. - `PORT` The port of your local environment you will forward to docker container's port. -**Do we need to say that they can set up the internal port via an env.variable as well? -`SM_BATCH_WORKER_LISTEN_PORT` → env var controlling the port the API listens to** +By default the persistent worker listens on port 18000. You can configure this to use a different port via the environment variable `SM_BATCH_WORKER_LISTEN_PORT`. To submit a job you can either use curl directly or use the python sdk. With curl: From ba9b559f00cce9f856f509900369b5d434b936fc Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 10 Apr 2026 15:30:38 +0100 Subject: [PATCH 11/35] small fix on code --- docs/deployments/container/cpu-speech-to-text.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 8b151ef5..bf53cd13 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -293,7 +293,7 @@ load_dotenv() async def main(): client = AsyncClient(api_key=os.getenv("SPEECHMATICS_API_KEY"), url="address.of.container:PORT/v2") - result = await client.transcribe("audio.wav",parallel_engines=2, user_id="MY_USER_ID") + result = await client.transcribe("audio.wav", parallel_engines=2, user_id="MY_USER_ID") print(result.transcript_text) await client.close() From 456e2f8113283e0628b94a9022a40385f96b8d90 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 10 Apr 2026 15:34:43 +0100 Subject: [PATCH 12/35] remove leftover code --- docs/deployments/container/cpu-speech-to-text.mdx | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index bf53cd13..e7bf5ebc 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -500,9 +500,8 @@ Content-Type: application/json This endpoint provides a readiness probe. It can be queried using an HTTP GET request. -The container has been designed to process multiple jobs cuncurrently. This probe indicates whether the container has one slot (one engine) free for connections, and can be used as a scaling mechanism. +The container has been designed to process multiple jobs concurrently. This probe indicates whether the container has one slot (one engine) free for connections, and can be used as a scaling mechanism. -return `{"ready": True, "engines_used": self.engines_used}` Possible responses: - `200` if the container has a free connection slot. From b5a3e083cbf997fdff7981a0cea365092646bee2 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 10 Apr 2026 15:35:32 +0100 Subject: [PATCH 13/35] add parenthesis to parameter name --- docs/deployments/container/cpu-speech-to-text.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index e7bf5ebc..8dcac189 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -443,7 +443,7 @@ The Health Service offers three endpoints: #### `/jobs` This endpoint provides a list of the currently running jobs. It can be queried using an HTTP GET request. -Returns a dictionary including the maximum number of engines this worker can use (`max_engines`), the number of free engines able to pick up work `unused_engines` +Returns a dictionary including the maximum number of engines this worker can use (`max_engines`), the number of free engines able to pick up work (`unused_engines`) and a list of the currently running jobs, with element includes the job id and the number of engines that job uses. The number of free engines `unused_engines` can be used to determine the number of parallel engines you can request for the next job. From 15299629963397894b10f4fae7fb1248b7f99d5f Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Mon, 13 Apr 2026 18:00:23 +0100 Subject: [PATCH 14/35] wrongly changed persisted --- docs/deployments/container/cpu-speech-to-text.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 8dcac189..d6477e0e 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -532,7 +532,7 @@ Environment variables: The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file. - Multiple instances of the container can be run on the same Docker host. This enables scaling of a single language or multiple languages as required -- All data is transitory, once a container completes its transcription it removes all record of the operation, no data is persistent +- All data is transitory, once a container completes its transcription it removes all record of the operation, no data is persisted Here's an example of how to start the Container from the command line: @@ -647,7 +647,7 @@ Users may wish to run the Container in read-only mode. This may be necessary due rt-asr-transcriber-en:${smVariables.latestContainerVersion}`} -The Container still requires a temporary directory with write permissions. Users can provide a directory (e.g `/tmp`) by using the `--tmpfs` Docker argument. A tmpfs mount is temporary, and only persistent in the host memory. When the Container stops, the tmpfs mount is removed, and files written there won’t be persistent. +The Container still requires a temporary directory with write permissions. Users can provide a directory (e.g `/tmp`) by using the `--tmpfs` Docker argument. A tmpfs mount is temporary, and only persisted in the host memory. When the Container stops, the tmpfs mount is removed, and files written there won’t be persisted. If customers want to use the shared Custom Dictionary Cache feature, they must also specify the location of cache and mount it as a volume From 19f2f6163068274a1fe473050632fb01b6d38cf0 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 17 Apr 2026 16:25:11 +0100 Subject: [PATCH 15/35] make clear that it works for both cpu and gpu --- docs/deployments/container/cpu-speech-to-text.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index d6477e0e..6997b051 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -230,7 +230,7 @@ The following example shows how to use `--all-formats` parameter. In this scenar **Available from version 15.5.0** -Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They work utilizing an http server, which is able to +Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They are able to run on both CPU and GPU, although using GPU yields better results.They work utilizing an http server, which is able to accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle of posting a job, checking the status of the jobs and retrieving for the transcript. From efbc120f1507d69b3f391314fdd467a31a9aab1a Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 17 Apr 2026 17:03:25 +0100 Subject: [PATCH 16/35] space --- docs/deployments/container/cpu-speech-to-text.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 6997b051..c8680161 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -230,7 +230,7 @@ The following example shows how to use `--all-formats` parameter. In this scenar **Available from version 15.5.0** -Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They are able to run on both CPU and GPU, although using GPU yields better results.They work utilizing an http server, which is able to +Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They are able to run on both CPU and GPU, although using GPU yields better results. They work utilizing an http server, which is able to accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle of posting a job, checking the status of the jobs and retrieving for the transcript. From 9d1ae9f2e75ad69a138ebd7a533f315bd4838a77 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 17 Apr 2026 17:03:58 +0100 Subject: [PATCH 17/35] singular --- docs/deployments/container/cpu-speech-to-text.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index c8680161..06f6efdf 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -232,7 +232,7 @@ The following example shows how to use `--all-formats` parameter. In this scenar Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They are able to run on both CPU and GPU, although using GPU yields better results. They work utilizing an http server, which is able to accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle -of posting a job, checking the status of the jobs and retrieving for the transcript. +of posting a job, checking the status of the job and retrieving for the transcript. The main benefit of this worker vs normal batch is that you don't incur the cost of spinning up the worker for each you want to transcribe. This has the benefit of reduding the turnaround time, especially for smaller files. The memory utilization is reduced as now you can have From 47f073dcfcc8e546c7c3d742eb9a6ac2e57a2469 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 17 Apr 2026 17:04:53 +0100 Subject: [PATCH 18/35] better wording --- docs/deployments/container/cpu-speech-to-text.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 06f6efdf..bc8edcb7 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -236,7 +236,7 @@ of posting a job, checking the status of the job and retrieving for the transcri The main benefit of this worker vs normal batch is that you don't incur the cost of spinning up the worker for each you want to transcribe. This has the benefit of reduding the turnaround time, especially for smaller files. The memory utilization is reduced as now you can have -multiple jobs running in parallel in the same container sharing the memory, and remove the need to spin up mulitple container incuring the same memory cost as many times. +multiple jobs running in parallel in the same container sharing the memory, removing the need to spin up mulitple container incuring the same memory cost as many times. Better utilizing the gpu as now we don't have initial setup times for the worker, and we are able to use the gpu uninterrupted. ### How to run the worker and submit jobs to it From ff74151f476efad8f889634e9af244779218ee92 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 17 Apr 2026 18:14:43 +0100 Subject: [PATCH 19/35] refactor wording around speaker id and optionally adding the user id in the headers --- docs/deployments/container/cpu-speech-to-text.mdx | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index bc8edcb7..ce797969 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -326,8 +326,9 @@ For example with curl: -F 'data_file=@~/audio_file.mp3' ``` -To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature using the same header as above `X-SM-Processing-Data` -insert as a key `user_id`, and value the id of the user/customer. +To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature the same logic around secrets [outlined here](/deployments/container/speaker-identification) applies. +If there is a need to have encrypted identifiers per customer, similar to how our [SaaS](/speech-to-text/features/speaker-identification) handles this, you can use the same header as above `X-SM-Processing-Data` +and insert as a key `user_id`, and value the id of the user/customer for each job request, but this is an optional step. ``` curl -X POST address.of.container:PORT/v2/jobs \ -H 'X-SM-Processing-Data: {"user_id":"MY_USER_ID"}' \ From 8f27a4135a4d867b454c8e22fbdd27d2ec27ce29 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Mon, 20 Apr 2026 10:24:25 +0100 Subject: [PATCH 20/35] use the version that is supported from onwards on the example, and not the latest released --- docs/deployments/container/cpu-speech-to-text.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index ce797969..4aab3be4 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -246,7 +246,7 @@ You can run the persistent worker with: {`docker run -it \\ -e LICENSE_TOKEN=$TOKEN_VALUE \\ -p PORT:18000 \\ - batch-asr-transcriber-en:${smVariables.latestContainerVersion} \\ + batch-asr-transcriber-en:15.5.0 \\ --run-mode http \\ --parallel=4 \\ --all-formats /output_dir_name From 4c25b15955cfb2ddef635d5f8a0899d7767c82b9 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 10:56:04 +0100 Subject: [PATCH 21/35] add batch persistent worker to its own page --- .../container/batch-persistent-worker.mdx | 335 ++++++++++++++++++ 1 file changed, 335 insertions(+) create mode 100644 docs/deployments/container/batch-persistent-worker.mdx diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx new file mode 100644 index 00000000..5609aee4 --- /dev/null +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -0,0 +1,335 @@ +--- +id: batch-persistent-worker +title: Batch Persistent Worker +sidebar_label: Batch Persistent Worker +description: Run a long-lived HTTP transcription worker that accepts multiple jobs without restarting, reducing turnaround time and improving GPU utilisation. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Batch Persistent Worker + +:::note +Available from version 15.5.0 +::: + +A batch persistent worker (also called an **HTTP batch worker**) is a long-running transcription service that accepts jobs over HTTP. Unlike standard batch workers, it stays alive between jobs — meaning you pay the startup cost once, not on every request. + +--- + +## Why use a persistent worker? + +| | Standard batch | Persistent worker | +|---|---|---| +| Startup cost | Per job | Once | +| Memory usage | One container per job | Multiple jobs share one container | +| GPU utilisation | Interrupted between jobs | Continuous | +| Best for | Large, infrequent files | High throughput or smaller files | + +The persistent worker is especially beneficial for smaller audio files, where startup overhead would otherwise dominate total turnaround time. + +--- + +## Starting the worker + +```bash +docker run -it \ + -e LICENSE_TOKEN=$TOKEN_VALUE \ + -p PORT:18000 \ + batch-asr-transcriber-en:15.5.0 \ + --run-mode http \ + --parallel=4 \ + --all-formats /output_dir_name +``` + +### Parameters + +| Parameter | Description | +|---|---| +| `--parallel` | Number of parallel engines (each engine maps to one GPU connection). Increase this to improve throughput, up to your GPU's capacity. | +| `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [Generating multiple transcript formats](#) for details. | +| `PORT` | The local port forwarded to the container's internal port (`18000`). | + +:::tip +To use a different internal port, set the `SM_BATCH_WORKER_LISTEN_PORT` environment variable. +::: + +--- + +## Submitting a job + + + + +```bash +curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"parallel_engines": 2, "user_id": "MY_USER_ID"}' \ + -F 'config={ + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + }' \ + -F 'data_file=@~/audio_file.mp3' +``` + + + + +```python +import asyncio +import os +from dotenv import load_dotenv +from speechmatics.batch import AsyncClient + +load_dotenv() + +async def main(): + client = AsyncClient( + api_key=os.getenv("SPEECHMATICS_API_KEY"), + url="address.of.container:PORT/v2" + ) + result = await client.transcribe( + "audio.wav", + parallel_engines=2, + user_id="MY_USER_ID" + ) + print(result.transcript_text) + await client.close() + +asyncio.run(main()) +``` + + + + +### Response codes + +| Code | Meaning | +|---|---| +| `201` | Job accepted. Returns `{"job_id": "abcdefgh01"}` | +| `400` | Invalid request | +| `503` | Server busy — not enough free engines | + +--- + +## Managing capacity + +The worker processes multiple jobs concurrently, up to the `--parallel` limit you set at startup. + +Each job can request multiple engines using the `parallel_engines` value in the `X-SM-Processing-Data` header. More engines per job means faster turnaround for that job, at the cost of reduced concurrency for others. + +To check available capacity before submitting, query the [`/jobs` health endpoint](#get-jobs). The `unused_engines` field tells you how many engines are free. + +:::warning +If a job requests more engines than are currently available, it will be rejected: + +``` +HTTP 503: {"detail": "Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"} +``` +::: + +### Requesting parallel engines + +```bash +curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"parallel_engines": 2}' \ + -F 'config={"type": "transcription", "transcription_config": {"language": "en"}}' \ + -F 'data_file=@~/audio_file.mp3' +``` + +--- + +## Speaker identification + +To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature you can use the same logic used for the one shot batch container. +To enable per-customer encrypted identifiers (as used in our SaaS offering), pass a `user_id` in the `X-SM-Processing-Data` header. + +```bash +curl -X POST address.of.container:PORT/v2/jobs \ + -H 'X-SM-Processing-Data: {"user_id": "MY_USER_ID"}' \ + -F 'config={ + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + }' \ + -F 'data_file=@~/audio_file.mp3' +``` + +:::info +For details on secrets management, refer to the [Speaker identification documentation](/deployments/container/speaker-identification). +::: + +--- + +## Job API reference + +### `GET /v2/jobs` + +Returns a list of jobs. + +**Query parameters:** + +| Parameter | Description | +|---|---| +| `created_before` | ISO 8601 datetime. Only return jobs created before this time. | +| `limit` | Max number of jobs to return (1–100). | + +**Example response:** + +```json +{ + "jobs": [ + { + "id": "191f47e4a4204fa4ac2b", + "created_at": "2026-03-18T19:27:42.436Z", + "data_name": "5_min", + "text_name": null, + "duration": 300, + "status": "RUNNING", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + } + }, + { + "id": "6dcb02e0dc5943e2b643", + "created_at": "2026-03-18T19:27:47.550Z", + "data_name": "5_min", + "text_name": null, + "duration": 300, + "status": "RUNNING", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + } + } + ] +} +``` + +--- + +### `GET /v2/jobs/{job_id}` + +Returns the status of a specific job. + +**Example response:** + +```json +{ + "job": { + "id": "191f47e4a4204fa4ac2b", + "created_at": "2026-03-18T19:27:42.436Z", + "data_name": "5_min", + "duration": 300, + "status": "DONE", + "config": { + "type": "transcription", + "transcription_config": { + "language": "en", + "diarization": "speaker", + "operating_point": "enhanced" + } + }, + "request_id": "191f47e4a4204fa4ac2b" + } +} +``` + +--- + +### `GET /v2/jobs/{job_id}/transcript` + +Returns the transcript for a completed job. + +**Query parameters:** + +| Parameter | Options | +|---|---| +| `format` | `json`, `txt`, `srt` | + +**Error responses:** + +| Code | Reason | +|---|---| +| `404` | Job not found, job not yet complete (includes current status), or unsupported format | + +--- + +### `GET /v2/jobs/{job_id}/log` + +Returns the processing logs for a specific job. + +--- + +## Health endpoints + +The worker exposes three health endpoints on the same port as job submission. + +:::tip Kubernetes users +These endpoints are designed to work as [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) in a Kubernetes cluster. +::: + +### `GET /jobs` + +Returns current engine usage and a list of active jobs. Use `unused_engines` to determine how many engines you can request for the next job. + +**Example response:** + +```json +{ + "active_jobs": [ + { "job_id": "f8a564954b334eecb823", "parallel_engines": 1 }, + { "job_id": "29351ae8cf2c4e8694f0", "parallel_engines": 1 } + ], + "max_engines": 8, + "unused_engines": 6 +} +``` + +--- + +### `GET /live` + +Liveness probe. Returns `200` when all container services are running and healthy. + +```json +{ "live": true } +``` + +--- + +### `GET /ready` + +Readiness probe. Returns `200` when at least one engine slot is free, `503` when all engines are occupied. + +```json +{ + "ready": true, + "engines_used": 2 +} +``` + +--- + +## Environment variables + +| Variable | Description | +|---|---| +| `SM_BATCH_WORKER_LISTEN_PORT` | Override the default internal port (`18000`). | +| `SM_BATCH_WORKER_MAX_JOB_HISTORY` | Maximum number of completed job records to retain in memory. | From ecc16b97c9d63accdde0eabe17f4c08659ca0c12 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 10:56:38 +0100 Subject: [PATCH 22/35] remove batch persistent notes from cpu page --- .../container/cpu-speech-to-text.mdx | 303 ------------------ 1 file changed, 303 deletions(-) diff --git a/docs/deployments/container/cpu-speech-to-text.mdx b/docs/deployments/container/cpu-speech-to-text.mdx index 4aab3be4..60b63656 100644 --- a/docs/deployments/container/cpu-speech-to-text.mdx +++ b/docs/deployments/container/cpu-speech-to-text.mdx @@ -225,309 +225,6 @@ The following example shows how to use `--all-formats` parameter. In this scenar -## Batch persistent worker transcription - - -**Available from version 15.5.0** - -Batch persistent workers (known as http batch workers), are batch multi session capable persistent workers. They are able to run on both CPU and GPU, although using GPU yields better results. They work utilizing an http server, which is able to -accept batch jobs through POST and by using the [V2 Batch REST API](https://docs.speechmatics.com/api-ref/batch/create-a-new-job). This server was build to mimic exactly the V2 API capabilities and the whole life cycle -of posting a job, checking the status of the job and retrieving for the transcript. - -The main benefit of this worker vs normal batch is that you don't incur the cost of spinning up the worker for each you want to transcribe. -This has the benefit of reduding the turnaround time, especially for smaller files. The memory utilization is reduced as now you can have -multiple jobs running in parallel in the same container sharing the memory, removing the need to spin up mulitple container incuring the same memory cost as many times. -Better utilizing the gpu as now we don't have initial setup times for the worker, and we are able to use the gpu uninterrupted. - -### How to run the worker and submit jobs to it -You can run the persistent worker with: - - - {`docker run -it \\ - -e LICENSE_TOKEN=$TOKEN_VALUE \\ - -p PORT:18000 \\ - batch-asr-transcriber-en:15.5.0 \\ - --run-mode http \\ - --parallel=4 \\ - --all-formats /output_dir_name -`} - - -The parameters are: -- `parallel` - The number of parallel engines you want this container to have (Each session corresponds to one gpu connection). The more engines the higher - throughput you should be able to get (until you max out your gpu capacity). (Might worth adding recommendations here? IDK). -- `all-formats` This is similar to [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats). - If this is not provided the default path that all jobs and logs will be saved to is `/tmp/jobs`. -- `PORT` The port of your local environment you will forward to docker container's port. - -By default the persistent worker listens on port 18000. You can configure this to use a different port via the environment variable `SM_BATCH_WORKER_LISTEN_PORT`. - -To submit a job you can either use curl directly or use the python sdk. -With curl: -``` - curl -X POST address.of.container:PORT/v2/jobs \ - -H 'X-SM-Processing-Data: {"parallel_engines":2, "user_id":"MY_USER_ID"}' \ - -F 'config={ - "type":"transcription", - "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} - }' \ - -F 'data_file=@~/audio_file.mp3' -``` - -Returns: -``` -on success: json string containing job id: `{"job_id": "abcdefgh01"}` and HTTP status code 201 -on failure: returns an HTTP status code != 200: - HTTP status code 503 for server busy - HTTP status code 400 for invalid request -``` - -with [python sdk](https://github.com/speechmatics/speechmatics-python-sdk?tab=readme-ov-file#batch-transcription): -``` -import asyncio -import os -from dotenv import load_dotenv -from speechmatics.batch import AsyncClient - -load_dotenv() - -async def main(): - client = AsyncClient(api_key=os.getenv("SPEECHMATICS_API_KEY"), url="address.of.container:PORT/v2") - result = await client.transcribe("audio.wav", parallel_engines=2, user_id="MY_USER_ID") - print(result.transcript_text) - await client.close() - -asyncio.run(main()) -``` - -With the persistent batch worker you have the capability to submit multiple jobs on the same worker given it has enough free capacity to process them. -You can figure the free capacity left by querying the `/ready` endpoint outlined below. The result of this endpoint will include (`engines_used`) the total number of engines being -used by the running jobs now. To calculate the number of free engines you subtract the initial number of parallel engines you spinned -up the worker (set using `--parallel=NUM`) minus the engines you currently use. - -If as part of a job you request more engines that those free, the job won't be accepted and will return a 503 with: - -`HTTP 503: Service Unavailable - {"detail":"Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"}` - - -By requesting more engines in parallel for a job, you are able to improve the turnaround time for the job. - -To request multiple engines in parallel for a job you need to add a header in the POST request called `X-SM-Processing-Data`, which receives as input a json dictionary. -The specify the number of parallel engines you want you need to add to this header a dict with key `parallel_engines` and as value the number of engines you want. - -For example with curl: -``` - curl -X POST address.of.container:PORT/v2/jobs \ - -H 'X-SM-Processing-Data: {"parallel_engines":2}' \ - -F 'config={ - "type":"transcription", - "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} - }' \ - -F 'data_file=@~/audio_file.mp3' -``` - -To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature the same logic around secrets [outlined here](/deployments/container/speaker-identification) applies. -If there is a need to have encrypted identifiers per customer, similar to how our [SaaS](/speech-to-text/features/speaker-identification) handles this, you can use the same header as above `X-SM-Processing-Data` -and insert as a key `user_id`, and value the id of the user/customer for each job request, but this is an optional step. -``` - curl -X POST address.of.container:PORT/v2/jobs \ - -H 'X-SM-Processing-Data: {"user_id":"MY_USER_ID"}' \ - -F 'config={ - "type":"transcription", - "transcription_config":{"language":"en","diarization":"speaker","operating_point":"enhanced"} - }' \ - -F 'data_file=@~/audio_file.mp3' -``` - -### Job API endpoints - -`/v2/jobs` - -args: created_before: string in ISO 8601 format, only returns jobs created before this time -limit: maximum number of jobs to return, can be between 1 and 100 - -returns: list of jobs -```json -{ - "jobs": [ - { - "id": "191f47e4a4204fa4ac2b", - "created_at": "2026-03-18T19:27:42.436Z", - "data_name": "5_min", - "text_name": null, - "duration": 300, - "status": "RUNNING", - "config": { - "type": "transcription", - "transcription_config": { - "language": "en", - "diarization": "speaker", - "operating_point": "enhanced" - } - } - }, - { - "id": "6dcb02e0dc5943e2b643", - "created_at": "2026-03-18T19:27:47.550Z", - "data_name": "5_min", - "text_name": null, - "duration": 300, - "status": "RUNNING", - "config": { - "type": "transcription", - "transcription_config": { - "language": "en", - "diarization": "speaker", - "operating_point": "enhanced" - } - } - } - ] -} -``` - - -`/v2/jobs/{job_id}/transcript` - -args: job_id and format of the transcript. Options for the format transcript currently are : "json", "txt", "srt". - -Returns the transcript for a specific job if it has finished, the format is a valid choice, and the job_id exists. - -if the job_id doesn’t exist returns an HTTPException with 404. - -if the job hasn’t finished, returns a 404, and includes the status and request_id. - -if the format is not in our included list we return a 404 with error = unsupported format. - - -`/v2/jobs/{job_id}` - -returns job status, including job_id and request_id - -```json -{ - "job": { - "id": "191f47e4a4204fa4ac2b", - "created_at": "2026-03-18T19:27:42.436Z", - "data_name": "5_min", - "duration": 300, - "status": "DONE", - "config": { - "type": "transcription", - "transcription_config": { - "language": "en", - "diarization": "speaker", - "operating_point": "enhanced" - } - }, - "request_id": "191f47e4a4204fa4ac2b" - } -} -``` - -`/v2/jobs/{job_id}/log` - -returns the logs for the specific job - - -### Health service - -The container is exposes an http Health Service, which offers liveness, readiness, and session listing probes. This is accessible from the same port -as job posting, and has 3 endpoints, `live`, `ready` and `jobs`. This may be especially helpful if you are deploying the container into a Kubernetes -cluster. If you are using Kubernetes, we recommend that you also refer to the Kubernetes documentation around -[liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). - -#### Endpoints - -The Health Service offers three endpoints: - -#### `/jobs` - -This endpoint provides a list of the currently running jobs. It can be queried using an HTTP GET request. -Returns a dictionary including the maximum number of engines this worker can use (`max_engines`), the number of free engines able to pick up work (`unused_engines`) -and a list of the currently running jobs, with element includes the job id and the number of engines that job uses. The number of free engines `unused_engines` can be used -to determine the number of parallel engines you can request for the next job. - -Example: - -```bash-and-response -$ curl -i address.of.container:PORT/jobs -HTTP/1.1 200 OK -Server: BaseHTTP/0.6 Python/3.8.5 -Date: Mon, 08 Feb 2021 12:46:21 GMT -Content-Type: application/json -{ - "active_jobs": [ - { - "job_id": "f8a564954b334eecb823", - "parallel_engines": 1 - }, - { - "job_id": "29351ae8cf2c4e8694f0", - "parallel_engines": 1 - } - ], - "max_engines": 8, - "unused_engines": 6 -} -``` - -#### `/live` - -This endpoint provides a liveness probe. It can be queried using an HTTP GET request. - -This probe indicates whether all services in the Container are active. - -Possible responses: - -- `200` if all of the services in the Container have successfully started, and have recently sent an update to the Health Service. - -A JSON object is also returned in the body of the response, indicating the status. - -Example: - -```bash-and-response -$ curl -i address.of.container:PORT/live -HTTP/1.1 200 OK -Server: BaseHTTP/0.6 Python/3.8.5 -Date: Mon, 08 Feb 2021 12:46:45 GMT -Content-Type: application/json -{ - "live": true -} -``` - -#### `/ready` - -This endpoint provides a readiness probe. It can be queried using an HTTP GET request. - -The container has been designed to process multiple jobs concurrently. This probe indicates whether the container has one slot (one engine) free for connections, and can be used as a scaling mechanism. - -Possible responses: - -- `200` if the container has a free connection slot. -- `503` otherwise. - -In the body of the response there is also a JSON object with the current status, and the total number of engines being used. - -Example: - -```bash-and-response -$ curl -i address.of.container:PORT/ready -HTTP/1.1 200 OK -Server: BaseHTTP/0.6 Python/3.8.5 -Date: Mon, 08 Feb 2021 12:47:05 GMT -Content-Type: application/json -{ - "ready": true, - "engines_used": 2 -} -``` - -Environment variables: - -`SM_BATCH_WORKER_MAX_JOB_HISTORY` : This is the maximum number of job records to keep in memory - ## Realtime transcription The Realtime container provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file. From 2e129ce2ed9fc8b59406b1ffe72db7834d446f8e Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 11:58:20 +0100 Subject: [PATCH 23/35] correct version and Batch Persistent Worker -> Batch persistent worker --- docs/deployments/container/batch-persistent-worker.mdx | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 5609aee4..6920a91e 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -1,17 +1,17 @@ --- id: batch-persistent-worker -title: Batch Persistent Worker -sidebar_label: Batch Persistent Worker +title: Batch persistent worker +sidebar_label: Batch persistent worker description: Run a long-lived HTTP transcription worker that accepts multiple jobs without restarting, reducing turnaround time and improving GPU utilisation. --- import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; +import TabItem from '@theme/TabItem' -# Batch Persistent Worker +# Batch persistent worker :::note -Available from version 15.5.0 +Available from version 15.7.0 ::: A batch persistent worker (also called an **HTTP batch worker**) is a long-running transcription service that accepts jobs over HTTP. Unlike standard batch workers, it stays alive between jobs — meaning you pay the startup cost once, not on every request. From 687182a114a826aaaf9bc46326d1021a1d789efe Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 12:00:49 +0100 Subject: [PATCH 24/35] batch-asr-transcriber-en:${smVariables.latestContainerVersion} --- docs/deployments/container/batch-persistent-worker.mdx | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 6920a91e..57a02b72 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -7,6 +7,8 @@ description: Run a long-lived HTTP transcription worker that accepts multiple jo import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem' +import { smVariables } from "/sm-variables"; + # Batch persistent worker @@ -37,7 +39,7 @@ The persistent worker is especially beneficial for smaller audio files, where st docker run -it \ -e LICENSE_TOKEN=$TOKEN_VALUE \ -p PORT:18000 \ - batch-asr-transcriber-en:15.5.0 \ + batch-asr-transcriber-en:${smVariables.latestContainerVersion} \ --run-mode http \ --parallel=4 \ --all-formats /output_dir_name From dae38d6a2aff4190ae1fb0c64aa04086eb48996a Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 12:02:19 +0100 Subject: [PATCH 25/35] add correct link for transcript formats --- docs/deployments/container/batch-persistent-worker.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 57a02b72..1920c07c 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -50,7 +50,7 @@ docker run -it \ | Parameter | Description | |---|---| | `--parallel` | Number of parallel engines (each engine maps to one GPU connection). Increase this to improve throughput, up to your GPU's capacity. | -| `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [Generating multiple transcript formats](#) for details. | +| `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats) for details. | | `PORT` | The local port forwarded to the container's internal port (`18000`). | :::tip From 0fb0e9176258d78a8085f42731bc85a5792b7a3d Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 12:41:52 +0100 Subject: [PATCH 26/35] remove gpu related info --- docs/deployments/container/batch-persistent-worker.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 5609aee4..c77ca6af 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -47,7 +47,7 @@ docker run -it \ | Parameter | Description | |---|---| -| `--parallel` | Number of parallel engines (each engine maps to one GPU connection). Increase this to improve throughput, up to your GPU's capacity. | +| `--parallel` | Number of parallel engines (each engine maps to one GPU connection). | | `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [Generating multiple transcript formats](#) for details. | | `PORT` | The local port forwarded to the container's internal port (`18000`). | From 6f57ce471261c593e0232666cb50f3edce797153 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 12:53:09 +0100 Subject: [PATCH 27/35] add cpu to improving utilization --- docs/deployments/container/batch-persistent-worker.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 7c97e6fc..2061ccf6 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -2,7 +2,7 @@ id: batch-persistent-worker title: Batch persistent worker sidebar_label: Batch persistent worker -description: Run a long-lived HTTP transcription worker that accepts multiple jobs without restarting, reducing turnaround time and improving GPU utilisation. +description: Run a long-lived HTTP transcription worker that accepts multiple jobs without restarting, reducing turnaround time and improving CPU/GPU utilisation. --- import Tabs from '@theme/Tabs'; @@ -147,7 +147,7 @@ curl -X POST address.of.container:PORT/v2/jobs \ ## Speaker identification -To enable the [Speaker identification](/speech-to-text/features/speaker-identification) feature you can use the same logic used for the one shot batch container. +To enable the Speaker identification feature you can use the same logic used for the one shot [batch container](/speech-to-text/features/speaker-identification). To enable per-customer encrypted identifiers (as used in our SaaS offering), pass a `user_id` in the `X-SM-Processing-Data` header. ```bash From 1a9031353c326966a5791ce5aeb70be106040258 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 12:54:36 +0100 Subject: [PATCH 28/35] add cpu again to make it clear --- docs/deployments/container/batch-persistent-worker.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 2061ccf6..0edeb00b 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -26,7 +26,7 @@ A batch persistent worker (also called an **HTTP batch worker**) is a long-runni |---|---|---| | Startup cost | Per job | Once | | Memory usage | One container per job | Multiple jobs share one container | -| GPU utilisation | Interrupted between jobs | Continuous | +| CPU/GPU utilisation | Interrupted between jobs | Continuous | | Best for | Large, infrequent files | High throughput or smaller files | The persistent worker is especially beneficial for smaller audio files, where startup overhead would otherwise dominate total turnaround time. @@ -49,7 +49,7 @@ docker run -it \ | Parameter | Description | |---|---| -| `--parallel` | Number of parallel engines (each engine maps to one GPU connection). | +| `--parallel` | Number of parallel engines (each engine maps to one GPU connection when on GPU container). | | `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats) for details. | | `PORT` | The local port forwarded to the container's internal port (`18000`). | From e578d29229c8ecd0042e8277dde919aeab8abede Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 12:58:28 +0100 Subject: [PATCH 29/35] remove dividers between sections --- .../container/batch-persistent-worker.mdx | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 0edeb00b..3431156f 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -18,7 +18,7 @@ Available from version 15.7.0 A batch persistent worker (also called an **HTTP batch worker**) is a long-running transcription service that accepts jobs over HTTP. Unlike standard batch workers, it stays alive between jobs — meaning you pay the startup cost once, not on every request. ---- + ## Why use a persistent worker? @@ -31,7 +31,7 @@ A batch persistent worker (also called an **HTTP batch worker**) is a long-runni The persistent worker is especially beneficial for smaller audio files, where startup overhead would otherwise dominate total turnaround time. ---- + ## Starting the worker @@ -57,7 +57,7 @@ docker run -it \ To use a different internal port, set the `SM_BATCH_WORKER_LISTEN_PORT` environment variable. ::: ---- + ## Submitting a job @@ -116,7 +116,7 @@ asyncio.run(main()) | `400` | Invalid request | | `503` | Server busy — not enough free engines | ---- + ## Managing capacity @@ -143,7 +143,6 @@ curl -X POST address.of.container:PORT/v2/jobs \ -F 'data_file=@~/audio_file.mp3' ``` ---- ## Speaker identification @@ -168,7 +167,6 @@ curl -X POST address.of.container:PORT/v2/jobs \ For details on secrets management, refer to the [Speaker identification documentation](/deployments/container/speaker-identification). ::: ---- ## Job API reference @@ -224,7 +222,6 @@ Returns a list of jobs. } ``` ---- ### `GET /v2/jobs/{job_id}` @@ -253,7 +250,6 @@ Returns the status of a specific job. } ``` ---- ### `GET /v2/jobs/{job_id}/transcript` @@ -271,13 +267,11 @@ Returns the transcript for a completed job. |---|---| | `404` | Job not found, job not yet complete (includes current status), or unsupported format | ---- ### `GET /v2/jobs/{job_id}/log` Returns the processing logs for a specific job. ---- ## Health endpoints @@ -304,7 +298,6 @@ Returns current engine usage and a list of active jobs. Use `unused_engines` to } ``` ---- ### `GET /live` @@ -314,7 +307,6 @@ Liveness probe. Returns `200` when all container services are running and health { "live": true } ``` ---- ### `GET /ready` @@ -327,7 +319,6 @@ Readiness probe. Returns `200` when at least one engine slot is free, `503` when } ``` ---- ## Environment variables From 83ccae18b945a1d49a3ce5d81aa78594a89f7d4e Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 17:03:30 +0100 Subject: [PATCH 30/35] refactor doc to be more friendly to clients, given suggestions --- .../container/batch-persistent-worker.mdx | 50 +++++++++++-------- 1 file changed, 30 insertions(+), 20 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 3431156f..10af649e 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -8,7 +8,7 @@ description: Run a long-lived HTTP transcription worker that accepts multiple jo import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem' import { smVariables } from "/sm-variables"; - +import CodeBlock from "@theme/CodeBlock"; # Batch persistent worker @@ -16,7 +16,13 @@ import { smVariables } from "/sm-variables"; Available from version 15.7.0 ::: -A batch persistent worker (also called an **HTTP batch worker**) is a long-running transcription service that accepts jobs over HTTP. Unlike standard batch workers, it stays alive between jobs — meaning you pay the startup cost once, not on every request. +A batch persistent worker (also called an **HTTP batch worker**) is a long-running transcription service that loads the ASR models once at startup and then accepts jobs over an HTTP API for the lifetime of the container. Unlike standard batch containers — which start up, process a single job, and exit — a persistent worker stays alive indefinitely, serving jobs as they arrive. + +This gives you: +- **No per-job cold start.** The model is loaded into memory once. Every subsequent job skips the startup cost entirely. +- **Concurrent processing.** The `--parallel` flag controls how many processing units the worker handles simultaneously. Individual jobs can also be assigned multiple processing units(called `engines` in this document) to reduce their own turnaround time. + +The worker exposes an HTTP API for submitting jobs, polling status, fetching transcripts, and checking availability. @@ -29,23 +35,31 @@ A batch persistent worker (also called an **HTTP batch worker**) is a long-runni | CPU/GPU utilisation | Interrupted between jobs | Continuous | | Best for | Large, infrequent files | High throughput or smaller files | -The persistent worker is especially beneficial for smaller audio files, where startup overhead would otherwise dominate total turnaround time. +**Cold start overhead is significant for short audio.** Loading the ASR models — especially onto GPU — takes several seconds. For a 5-minute file this cost is negligible. For a 10-second clip, startup can take longer than transcription itself. The persistent worker eliminates this by loading the model once. +**High-throughput workloads benefit from a single long-lived container.** Routing many jobs to one worker is more efficient than launching a container per job. The `--parallel` setting lets you tune concurrency to your workload. +**GPU utilisation is maximised.** On GPU deployments, a standard batch container leaves the GPU idle between jobs. A persistent worker keeps the GPU warm and available, reducing wasted capacity across back-to-back requests. -## Starting the worker +When processing long audio jobs the benefits on RTF of the Persistent batch worker is negligible, and the resultant RTF is similar to that of a standard batch job. -```bash -docker run -it \ + + +## Deploying the worker + +### Docker + + +{`docker run -it \ -e LICENSE_TOKEN=$TOKEN_VALUE \ -p PORT:18000 \ batch-asr-transcriber-en:${smVariables.latestContainerVersion} \ --run-mode http \ --parallel=4 \ - --all-formats /output_dir_name -``` + --all-formats /output_dir_name`} + -### Parameters +#### Parameters | Parameter | Description | |---|---| @@ -53,13 +67,17 @@ docker run -it \ | `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats) for details. | | `PORT` | The local port forwarded to the container's internal port (`18000`). | -:::tip -To use a different internal port, set the `SM_BATCH_WORKER_LISTEN_PORT` environment variable. -::: +#### Environment variables + +| Variable | Description | +|---|---| +| `SM_BATCH_WORKER_LISTEN_PORT` | Override the default internal port (`18000`). | +| `SM_BATCH_WORKER_MAX_JOB_HISTORY` | Maximum number of completed job records to retain in memory. | ## Submitting a job +Once the worker is running and is available, submit jobs by `POST`ing to `/v2/jobs` with an audio file and a transcription config. The worker queues the job and returns a `job_id` immediately; poll [`GET /v2/jobs/{job_id}`](#get-v2jobsjob_id) for status, then fetch the transcript once it reaches `DONE`. @@ -318,11 +336,3 @@ Readiness probe. Returns `200` when at least one engine slot is free, `503` when "engines_used": 2 } ``` - - -## Environment variables - -| Variable | Description | -|---|---| -| `SM_BATCH_WORKER_LISTEN_PORT` | Override the default internal port (`18000`). | -| `SM_BATCH_WORKER_MAX_JOB_HISTORY` | Maximum number of completed job records to retain in memory. | From 02d0886d6dbe1dfddc2a6f74998fc7c1793642e8 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 17:04:30 +0100 Subject: [PATCH 31/35] plural for models --- docs/deployments/container/batch-persistent-worker.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 10af649e..3a06633f 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -19,7 +19,7 @@ Available from version 15.7.0 A batch persistent worker (also called an **HTTP batch worker**) is a long-running transcription service that loads the ASR models once at startup and then accepts jobs over an HTTP API for the lifetime of the container. Unlike standard batch containers — which start up, process a single job, and exit — a persistent worker stays alive indefinitely, serving jobs as they arrive. This gives you: -- **No per-job cold start.** The model is loaded into memory once. Every subsequent job skips the startup cost entirely. +- **No per-job cold start.** The models are loaded into memory once. Every subsequent job skips the startup cost entirely. - **Concurrent processing.** The `--parallel` flag controls how many processing units the worker handles simultaneously. Individual jobs can also be assigned multiple processing units(called `engines` in this document) to reduce their own turnaround time. The worker exposes an HTTP API for submitting jobs, polling status, fetching transcripts, and checking availability. @@ -35,7 +35,7 @@ The worker exposes an HTTP API for submitting jobs, polling status, fetching tra | CPU/GPU utilisation | Interrupted between jobs | Continuous | | Best for | Large, infrequent files | High throughput or smaller files | -**Cold start overhead is significant for short audio.** Loading the ASR models — especially onto GPU — takes several seconds. For a 5-minute file this cost is negligible. For a 10-second clip, startup can take longer than transcription itself. The persistent worker eliminates this by loading the model once. +**Cold start overhead is significant for short audio.** Loading the ASR models — especially onto GPU — takes several seconds. For a 5-minute file this cost is negligible. For a 10-second clip, startup can take longer than transcription itself. The persistent worker eliminates this by loading the models once. **High-throughput workloads benefit from a single long-lived container.** Routing many jobs to one worker is more efficient than launching a container per job. The `--parallel` setting lets you tune concurrency to your workload. From 4aa576e1400985c355981378490b40d79c78fdd1 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Fri, 24 Apr 2026 18:23:48 +0100 Subject: [PATCH 32/35] add error response for /log endpoint, stating that logs might not appear if the verbosity isn't high enough --- docs/deployments/container/batch-persistent-worker.mdx | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 3a06633f..1cbc755e 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -290,6 +290,12 @@ Returns the transcript for a completed job. Returns the processing logs for a specific job. +**Error responses:** + +| Code | Reason | +|---|---| +| `404` | Log for job ID `{job_id}` not found | +| `404` | No log file found for job `{job_id}`. Verbosity isn't enough to produce debug logs. Add `-v` to the container arguments or set the environment variable `DEBUG=true` for higher verbosity. | ## Health endpoints From 82aee5d531e61ade1933f68bc9089fddee86de6e Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous <40407855+giorgosHadji@users.noreply.github.com> Date: Mon, 27 Apr 2026 13:03:10 +0100 Subject: [PATCH 33/35] Apply suggestions from code review insert empty line between header Co-authored-by: Tudor Evans <104087420+TudorCRL@users.noreply.github.com> --- docs/deployments/container/batch-persistent-worker.mdx | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 1cbc755e..938f577e 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -77,6 +77,7 @@ When processing long audio jobs the benefits on RTF of the Persistent batch work ## Submitting a job + Once the worker is running and is available, submit jobs by `POST`ing to `/v2/jobs` with an audio file and a transcription config. The worker queues the job and returns a `job_id` immediately; poll [`GET /v2/jobs/{job_id}`](#get-v2jobsjob_id) for status, then fetch the transcript once it reaches `DONE`. From 0609f7d9e348867dfc5a0af9df7185372b1dffd9 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous Date: Mon, 27 Apr 2026 13:41:31 +0100 Subject: [PATCH 34/35] remove jobs api reference and describe differences, refactor managing capacity, remove response codes --- .../container/batch-persistent-worker.mdx | 173 ++++-------------- 1 file changed, 35 insertions(+), 138 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index 938f577e..a2400787 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -127,34 +127,40 @@ asyncio.run(main()) -### Response codes - -| Code | Meaning | -|---|---| -| `201` | Job accepted. Returns `{"job_id": "abcdefgh01"}` | -| `400` | Invalid request | -| `503` | Server busy — not enough free engines | +## Managing capacity -## Managing capacity The worker processes multiple jobs concurrently, up to the `--parallel` limit you set at startup. -Each job can request multiple engines using the `parallel_engines` value in the `X-SM-Processing-Data` header. More engines per job means faster turnaround for that job, at the cost of reduced concurrency for others. -To check available capacity before submitting, query the [`/jobs` health endpoint](#get-jobs). The `unused_engines` field tells you how many engines are free. +To check available capacity before submitting, query the `/jobs` endpoint. -:::warning -If a job requests more engines than are currently available, it will be rejected: +### `GET /jobs` +Returns current engine usage and a list of active jobs. The `unused_engines` field tells you how many engines are free, and you can use it to determine how many engines you can request for the next job. + +**Example response:** + +```json +{ + "active_jobs": [ + { "job_id": "f8a564954b334eecb823", "parallel_engines": 1 }, + { "job_id": "29351ae8cf2c4e8694f0", "parallel_engines": 1 } + ], + "max_engines": 8, + "unused_engines": 6 +} ``` -HTTP 503: {"detail": "Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"} -``` -::: + + + ### Requesting parallel engines +Each job can request multiple engines using the `parallel_engines` value in the `X-SM-Processing-Data` header. More engines per job means faster turnaround for that job, at the cost of reduced concurrency for others. + ```bash curl -X POST address.of.container:PORT/v2/jobs \ -H 'X-SM-Processing-Data: {"parallel_engines": 2}' \ @@ -162,6 +168,13 @@ curl -X POST address.of.container:PORT/v2/jobs \ -F 'data_file=@~/audio_file.mp3' ``` +:::warning +If a job requests more engines than are currently available, it will be rejected: + +``` +HTTP 503: {"detail": "Server busy: 8 engines not available (2 engines in use, 5 parallel allowed)"} +``` +::: ## Speaker identification @@ -189,114 +202,16 @@ For details on secrets management, refer to the [Speaker identification document ## Job API reference -### `GET /v2/jobs` - -Returns a list of jobs. +The API is made on purpose similar to our [V2 SaaS API](/api-ref/batch/create-a-new-job) to ease integration to our SaaS, and be able to use both SaaS and On-prem interchangeably. +There are some difference between the SaaS vs On-prem API outlined below: -**Query parameters:** +#### For the Batch persistent worker: -| Parameter | Description | -|---|---| -| `created_before` | ISO 8601 datetime. Only return jobs created before this time. | -| `limit` | Max number of jobs to return (1–100). | +We don't support: +- The `include_deleted` parameter in the `GET /v2/jobs` call. +- 'GET /v2/usage` call. -**Example response:** - -```json -{ - "jobs": [ - { - "id": "191f47e4a4204fa4ac2b", - "created_at": "2026-03-18T19:27:42.436Z", - "data_name": "5_min", - "text_name": null, - "duration": 300, - "status": "RUNNING", - "config": { - "type": "transcription", - "transcription_config": { - "language": "en", - "diarization": "speaker", - "operating_point": "enhanced" - } - } - }, - { - "id": "6dcb02e0dc5943e2b643", - "created_at": "2026-03-18T19:27:47.550Z", - "data_name": "5_min", - "text_name": null, - "duration": 300, - "status": "RUNNING", - "config": { - "type": "transcription", - "transcription_config": { - "language": "en", - "diarization": "speaker", - "operating_point": "enhanced" - } - } - } - ] -} -``` - - -### `GET /v2/jobs/{job_id}` - -Returns the status of a specific job. - -**Example response:** - -```json -{ - "job": { - "id": "191f47e4a4204fa4ac2b", - "created_at": "2026-03-18T19:27:42.436Z", - "data_name": "5_min", - "duration": 300, - "status": "DONE", - "config": { - "type": "transcription", - "transcription_config": { - "language": "en", - "diarization": "speaker", - "operating_point": "enhanced" - } - }, - "request_id": "191f47e4a4204fa4ac2b" - } -} -``` - - -### `GET /v2/jobs/{job_id}/transcript` - -Returns the transcript for a completed job. - -**Query parameters:** - -| Parameter | Options | -|---|---| -| `format` | `json`, `txt`, `srt` | - -**Error responses:** - -| Code | Reason | -|---|---| -| `404` | Job not found, job not yet complete (includes current status), or unsupported format | - - -### `GET /v2/jobs/{job_id}/log` - -Returns the processing logs for a specific job. - -**Error responses:** - -| Code | Reason | -|---|---| -| `404` | Log for job ID `{job_id}` not found | -| `404` | No log file found for job `{job_id}`. Verbosity isn't enough to produce debug logs. Add `-v` to the container arguments or set the environment variable `DEBUG=true` for higher verbosity. | +For the API call `GET /v2/jobs/{job_id}`, we also return the `request_id` as part of the response. ## Health endpoints @@ -306,24 +221,6 @@ The worker exposes three health endpoints on the same port as job submission. These endpoints are designed to work as [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) in a Kubernetes cluster. ::: -### `GET /jobs` - -Returns current engine usage and a list of active jobs. Use `unused_engines` to determine how many engines you can request for the next job. - -**Example response:** - -```json -{ - "active_jobs": [ - { "job_id": "f8a564954b334eecb823", "parallel_engines": 1 }, - { "job_id": "29351ae8cf2c4e8694f0", "parallel_engines": 1 } - ], - "max_engines": 8, - "unused_engines": 6 -} -``` - - ### `GET /live` Liveness probe. Returns `200` when all container services are running and healthy. From 6f86676e5960fc251301f8e9784749d332fae7a7 Mon Sep 17 00:00:00 2001 From: Georgios Hadjiharalambous <40407855+giorgosHadji@users.noreply.github.com> Date: Tue, 28 Apr 2026 14:19:26 +0100 Subject: [PATCH 35/35] Apply suggestions from code review Co-authored-by: Tudor Evans <104087420+TudorCRL@users.noreply.github.com> --- .../container/batch-persistent-worker.mdx | 22 ++++++------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/docs/deployments/container/batch-persistent-worker.mdx b/docs/deployments/container/batch-persistent-worker.mdx index a2400787..fc5227d3 100644 --- a/docs/deployments/container/batch-persistent-worker.mdx +++ b/docs/deployments/container/batch-persistent-worker.mdx @@ -35,15 +35,13 @@ The worker exposes an HTTP API for submitting jobs, polling status, fetching tra | CPU/GPU utilisation | Interrupted between jobs | Continuous | | Best for | Large, infrequent files | High throughput or smaller files | -**Cold start overhead is significant for short audio.** Loading the ASR models — especially onto GPU — takes several seconds. For a 5-minute file this cost is negligible. For a 10-second clip, startup can take longer than transcription itself. The persistent worker eliminates this by loading the models once. +**Cold start overhead is significant for short audio.** Loading the ASR models — especially onto GPU — takes several seconds. For a five minute file this cost is negligible. For a ten second clip, startup can take longer than transcription itself. The persistent worker eliminates this by loading the models once. **High-throughput workloads benefit from a single long-lived container.** Routing many jobs to one worker is more efficient than launching a container per job. The `--parallel` setting lets you tune concurrency to your workload. **GPU utilisation is maximised.** On GPU deployments, a standard batch container leaves the GPU idle between jobs. A persistent worker keeps the GPU warm and available, reducing wasted capacity across back-to-back requests. -When processing long audio jobs the benefits on RTF of the Persistent batch worker is negligible, and the resultant RTF is similar to that of a standard batch job. - - +When processing long audio jobs the benefits on RTF of the persistent batch worker is negligible, and the resultant RTF is similar to that of a standard batch job. ## Deploying the worker @@ -64,7 +62,7 @@ When processing long audio jobs the benefits on RTF of the Persistent batch work | Parameter | Description | |---|---| | `--parallel` | Number of parallel engines (each engine maps to one GPU connection when on GPU container). | -| `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [Generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats) for details. | +| `--all-formats` | Directory where all job outputs and logs are saved. If omitted, defaults to `/tmp/jobs`. See [generating multiple transcript formats](https://docs.speechmatics.com/deployments/container/cpu-speech-to-text#generating-multiple-transcript-formats) for details. | | `PORT` | The local port forwarded to the container's internal port (`18000`). | #### Environment variables @@ -74,11 +72,9 @@ When processing long audio jobs the benefits on RTF of the Persistent batch work | `SM_BATCH_WORKER_LISTEN_PORT` | Override the default internal port (`18000`). | | `SM_BATCH_WORKER_MAX_JOB_HISTORY` | Maximum number of completed job records to retain in memory. | - - ## Submitting a job -Once the worker is running and is available, submit jobs by `POST`ing to `/v2/jobs` with an audio file and a transcription config. The worker queues the job and returns a `job_id` immediately; poll [`GET /v2/jobs/{job_id}`](#get-v2jobsjob_id) for status, then fetch the transcript once it reaches `DONE`. +Once the worker is running and is available, submit jobs by making a `POST` request to `/v2/jobs` with an audio file and transcription config. The worker queues the job and returns a `job_id` immediately. You can poll [`GET /v2/jobs/{job_id}`](#get-v2jobsjob_id) for the job status, and fetch the transcript when the status changes to `DONE`. @@ -134,7 +130,6 @@ asyncio.run(main()) The worker processes multiple jobs concurrently, up to the `--parallel` limit you set at startup. - To check available capacity before submitting, query the `/jobs` endpoint. ### `GET /jobs` @@ -202,20 +197,17 @@ For details on secrets management, refer to the [Speaker identification document ## Job API reference -The API is made on purpose similar to our [V2 SaaS API](/api-ref/batch/create-a-new-job) to ease integration to our SaaS, and be able to use both SaaS and On-prem interchangeably. -There are some difference between the SaaS vs On-prem API outlined below: - -#### For the Batch persistent worker: +The HTTP batch worker API is similar to our [V2 SaaS API](/api-ref/batch/create-a-new-job). This makes it easy to use our SaaS and on-prem offerings interchangeably. The only differences between the SaaS API and our HTTP workers are: We don't support: - The `include_deleted` parameter in the `GET /v2/jobs` call. -- 'GET /v2/usage` call. +- `GET /v2/usage` call. For the API call `GET /v2/jobs/{job_id}`, we also return the `request_id` as part of the response. ## Health endpoints -The worker exposes three health endpoints on the same port as job submission. +The worker exposes two health endpoints on the same port as job submission. :::tip Kubernetes users These endpoints are designed to work as [liveness and readiness probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) in a Kubernetes cluster.