Latency and Throughput Inquiry

Hello-

I've been looking into hosting an LLM on AWS Infrastructure. I am mainly looking to host Flan T5 XXL. My question is below

Inquiry: what is the recommended container for hosting Flan T5 XXL?
Context: I've hosted Flan T5 XXL using the TGI Container and the DJL-FasterTransformer container. Using the same Prompt, TGI takes around 5-6 seconds whereas the DJL-FasterTransformer container takes .5-1.5 seconds. The DJL-FasterTransformer Container has the tensor-parallel-degree set to 4. The SM_NM_GPU for TGI was set to 4. Both were hosted using ml.g5.12xlarge.
- Are there recommended configs for the TGI Container that I might be missing?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency and Throughput Inquiry #20

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Latency and Throughput Inquiry #20

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions