LLM Model
This guide demonstrates how to deploy a Large Language Model (LLM) in your OtterScale cluster and test it using Python with OpenAI API integration.
Deploy LLM Model
Section titled “Deploy LLM Model”-
Navigate to the Models page in your OtterScale cluster.
-
Click the Create button to create a new model.
-
Select a model from your model artifacts:
- Search for available models using the cloud icon in the search box
- Or click the archive icon to browse model artifacts
- Select your desired LLM (e.g.,
meta-llama/Llama-2-7b-chat)
-
Configure the model deployment:
- Name: Choose a descriptive name (e.g.,
llm-demo) - Namespace: Select your target namespace
- Prefill Configuration: Set vGPU memory %, replica count, and tensor configuration (if needed)
- Decode Configuration: Set decoding parameters similarly
- Description: Add any relevant notes about the deployment
- Name: Choose a descriptive name (e.g.,
-
Review the configuration and click Create to deploy the model.
-
Monitor the deployment status on the Models page. The status will change from
Pending→Running→Ready. -
Once the status shows Ready, click the Test button to verify the model API is working.
Test with Python
Section titled “Test with Python”Once your LLM model is deployed and ready, you can test it using Python with OpenAI API integration.
Connection Information
Section titled “Connection Information”Before running the test scripts, you’ll need to find the following information from the <url>/scope/<scope-name>/models/llm page:
- Service URL: The URL information from the Service card
- Name: The
namefield in the model table - Model Name: The
Model Namefield in the model table
import requestsimport json
# ConfigurationSERVICE_URL = "<your_service_url>" # e.g., http://localhost:8000NAME = "<your_name>" # e.g., llm-demoMODEL_NAME = "<your_model_name>"
def ask_question(question): """Send a simple question to the LLM and get a response.""" headers = { "OtterScale-Model-Name": NAME, "Content-Type": "application/json" }
payload = { "model": MODEL_NAME, "prompt": question }
try: response = requests.post( f"{SERVICE_URL}/v1/chat", headers=headers, json=payload ) response.raise_for_status() result = response.json() return result.get("response", result) except Exception as e: return f"✗ Error: {str(e)}"
# Testquestion = "Are you alive? Please respond if you can process this message."answer = ask_question(question)print(f"Q: {question}")print(f"A: {answer}")