LLM Model

This guide demonstrates how to deploy a Large Language Model (LLM) in your OtterScale cluster and test it using Python with OpenAI API integration.

Deploy LLM Model

Navigate to the Models page in your OtterScale cluster.
Click the Create button to create a new model.
Select a model from your model artifacts:
- Search for available models using the cloud icon in the search box
- Or click the archive icon to browse model artifacts
- Select your desired LLM (e.g., meta-llama/Llama-2-7b-chat)
Configure the model deployment:
- Name: Choose a descriptive name (e.g., llm-demo)
- Namespace: Select your target namespace
- Prefill Configuration: Set vGPU memory %, replica count, and tensor configuration (if needed)
- Decode Configuration: Set decoding parameters similarly
- Description: Add any relevant notes about the deployment
Review the configuration and click Create to deploy the model.
Monitor the deployment status on the Models page. The status will change from Pending → Running → Ready.
Once the status shows Ready, click the Test button to verify the model API is working.

Test with Python

Once your LLM model is deployed and ready, you can test it using Python with OpenAI API integration.

Connection Information

Before running the test scripts, you’ll need to find the following information from the <url>/scope/<scope-name>/models/llm page:

Service URL: The URL information from the Service card
Name: The name field in the model table
Model Name: The Model Name field in the model table

import requests
import json

# Configuration
SERVICE_URL = "<your_service_url>"  # e.g., http://localhost:8000
NAME = "<your_name>"    # e.g., llm-demo
MODEL_NAME = "<your_model_name>"

def ask_question(question):
    """Send a simple question to the LLM and get a response."""
    headers = {
        "OtterScale-Model-Name": NAME,
        "Content-Type": "application/json"
    }

    payload = {
        "model": MODEL_NAME,
        "prompt": question
    }

    try:
        response = requests.post(
            f"{SERVICE_URL}/v1/chat",
            headers=headers,
            json=payload
        )
        response.raise_for_status()
        result = response.json()
        return result.get("response", result)
    except Exception as e:
        return f"✗ Error: {str(e)}"

# Test
question = "Are you alive? Please respond if you can process this message."
answer = ask_question(question)
print(f"Q: {question}")
print(f"A: {answer}")