Skip to content

LLM Model

This content is not available in your language yet.

This guide demonstrates how to deploy a Large Language Model (LLM) in your OtterScale cluster and test it using Python with OpenAI API integration.

  1. Navigate to the Models page in your OtterScale cluster.

  2. Click the Create button to create a new model.

  3. Select a model from your model artifacts:

    • Search for available models using the cloud icon in the search box
    • Or click the archive icon to browse model artifacts
    • Select your desired LLM (e.g., meta-llama/Llama-2-7b-chat)
  4. Configure the model deployment:

    • Name: Choose a descriptive name (e.g., llm-demo)
    • Namespace: Select your target namespace
    • Prefill Configuration: Set vGPU memory %, replica count, and tensor configuration (if needed)
    • Decode Configuration: Set decoding parameters similarly
    • Description: Add any relevant notes about the deployment
  5. Review the configuration and click Create to deploy the model.

  6. Monitor the deployment status on the Models page. The status will change from PendingRunningReady.

  7. Once the status shows Ready, click the Test button to verify the model API is working.

Once your LLM model is deployed and ready, you can test it using Python with OpenAI API integration.

Before running the test scripts, you’ll need to find the following information from the <url>/scope/<scope-name>/models/llm page:

  • Service URL: The URL information from the Service card
  • Name: The name field in the model table
  • Model Name: The Model Name field in the model table
import requests
import json
# Configuration
SERVICE_URL = "<your_service_url>" # e.g., http://localhost:8000
NAME = "<your_name>" # e.g., llm-demo
MODEL_NAME = "<your_model_name>"
def ask_question(question):
"""Send a simple question to the LLM and get a response."""
headers = {
"OtterScale-Model-Name": NAME,
"Content-Type": "application/json"
}
payload = {
"model": MODEL_NAME,
"prompt": question
}
try:
response = requests.post(
f"{SERVICE_URL}/v1/chat",
headers=headers,
json=payload
)
response.raise_for_status()
result = response.json()
return result.get("response", result)
except Exception as e:
return f"✗ Error: {str(e)}"
# Test
question = "Are you alive? Please respond if you can process this message."
answer = ask_question(question)
print(f"Q: {question}")
print(f"A: {answer}")