Deploy the Llama3 model for text_generation task with Hugging Face LLM Serving Runtime¶
In this example, We demonstrate how to deploy Llama3 model
for text generation task from Hugging Face by deploying the InferenceService
with Hugging Face Serving runtime.
Serve the Hugging Face LLM model using vLLM backend¶
KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
Note
The Llama3 model requires huggingface hub token to download the model. You can set the token using HF_TOKEN
environment variable.
Create a secret with the Hugging Face token.
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
type: Opaque
stringData:
HF_TOKEN: <token>
Then create the inference service.
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
EOF
Check InferenceService
status.¶
kubectl get inferenceservices huggingface-llama3
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-llama3 http://huggingface-llama3.default.example.com True 100 huggingface-llama3-predictor-default-47q2g 7d23h
Perform Model Inference¶
The first step is to determine the ingress IP and ports and set INGRESS_HOST
and INGRESS_PORT
.
MODEL_NAME=llama3
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
KServe Hugging Face vLLM runtime supports the OpenAI /v1/completions
and /v1/chat/completions
endpoints for inference
Sample OpenAI Completions request:¶
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"model": "llama3", "prompt": "Write a poem about colors", "stream":false, "max_tokens": 30}'
Expected Output
{
"id": "cmpl-625a9240f25e463487a9b6c53cbed080",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"text": " and how they make you feel\nColors, oh colors, so vibrant and bright\nA world of emotions, a kaleidoscope in sight\nRed"
}
],
"created": 1718620153,
"model": "llama3",
"system_fingerprint": null,
"object": "text_completion",
"usage": {
"completion_tokens": 30,
"prompt_tokens": 6,
"total_tokens": 36
}
}
Sample OpenAI Chat request:¶
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"system","content":"You are an assistant that speaks like Shakespeare."},{"role":"user","content":"Write a poem about colors"}],"max_tokens":30,"stream":false}'
Expected Output
{
"id": "cmpl-9aad539128294069bf1e406a5cba03d3",
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": " O, fair and vibrant colors, how ye doth delight\nIn the world around us, with thy hues so bright!\n",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1718638005,
"model": "llama3",
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 30,
"prompt_tokens": 37,
"total_tokens": 67
}
}
Sample OpenAI Chat Completions streaming request:¶
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"system","content":"You are an assistant that speaks like Shakespeare."},{"role":"user","content":"Write a poem about colors"}],"max_tokens":30,"stream":true}'
Note
The output is truncated for brevity.
Expected Output
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":" ","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":" O","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":",","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":"skie","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":",","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: {"id":"cmpl-22e12eb9fa5e4b0c9726cef4a9ac993c","choices":[{"delta":{"content":" what","function_call":null,"tool_calls":null,"role":"assistant"},"logprobs":null,"finish_reason":null,"index":0}],"created":1718638726,"model":"llama3","system_fingerprint":null,"object":"chat.completion.chunk"}
data: [DONE]
Serve the Hugging Face LLM model using HuggingFace Backend¶
You can use --backend=huggingface
argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also
supports the OpenAI /v1/completions
and /v1/chat/completions
endpoints for inference.
Note
The Llama3 model requires huggingface hub token to download the model. You can set the token using HF_TOKEN
environment variable.
Create a secret with the Hugging Face token.
apiVersion: v1
kind: Secret
metadata:
name: hf-secret
type: Opaque
stringData:
HF_TOKEN: <token>
Then create the inference service.
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
- --backend=huggingface
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: HF_TOKEN
optional: false
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
EOF
Check InferenceService
status.¶
kubectl get inferenceservices huggingface-llama3
Expected Output
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-llama3 http://huggingface-llama3.default.example.com True 100 huggingface-llama3-predictor-default-47q2g 7d23h
Perform Model Inference¶
The first step is to determine the ingress IP and ports and set INGRESS_HOST
and INGRESS_PORT
.
MODEL_NAME=llama3
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
Sample OpenAI Completions request:¶
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-H "content-type: application/json" \
-d '{"model": "llama3", "prompt": "Write a poem about colors", "stream":false, "max_tokens": 30}'
Expected Output
{
"id": "564d3bcf-5569-4d15-ace4-ed8a29678359",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"text": "\nColors, oh colors, so vibrant and bright\nA world of emotions, a world in sight\nRed, the passion, the fire that burns"
}
],
"created": 1718699758,
"model": "llama3",
"system_fingerprint": null,
"object": "text_completion",
"usage": {
"completion_tokens": 30,
"prompt_tokens": 6,
"total_tokens": 36
}
}
Sample OpenAI Chat Completions request:¶
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"system","content":"You are an assistant that speaks like Shakespeare."},{"role":"user","content":"Write a poem about colors"}],"max_tokens":30,"stream":false}'
Expected Output
{
"id": "7dcc83b4-aa94-4a52-90fd-fa705978d3c1",
"choices": [
{
"finish_reason": "length",
"index": 0,
"message": {
"content": "assistant\n\nO, fairest hues of earth and sky,\nHow oft thy beauty doth my senses fly!\nIn vibrant splendor, thou",
"tool_calls": null,
"role": "assistant",
"function_call": null
},
"logprobs": null
}
],
"created": 1718699982,
"model": "llama3",
"system_fingerprint": null,
"object": "chat.completion",
"usage": {
"completion_tokens": 30,
"prompt_tokens": 26,
"total_tokens": 56
}
}
Sample OpenAI Completions streaming request:¶
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions \
-H "content-type: application/json" \
-d '{"model": "llama3", "prompt": "Write a poem about colors", "stream":true, "max_tokens": 30}'
Note
The output is truncated for brevity.
Expected Output
data: {"id":"acadb7d0-1235-4cd7-bd7b-24c62a89b8de","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"\n"}],"created":1718700166,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"acadb7d0-1235-4cd7-bd7b-24c62a89b8de","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Colors, "}],"created":1718700168,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"acadb7d0-1235-4cd7-bd7b-24c62a89b8de","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"oh "}],"created":1718700169,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":null}
data: {"id":"acadb7d0-1235-4cd7-bd7b-24c62a89b8de","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"colors, "}],"created":1718700170,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":null}
data: [DONE]