Getting Started12 min readJanuary 27, 2025

Getting Started with SmolLM3: Complete Setup and First Steps Guide

Learn how to install, configure, and run SmolLM3 for the first time. This comprehensive guide covers everything from installation to your first inference with Hugging Face's compact yet powerful 3B parameter language model.

SmolLM3 represents a significant advancement in compact language models, delivering impressive performance while maintaining efficiency suitable for edge deployment. This guide will walk you through the complete setup process, from initial installation to running your first inference tasks.

System Requirements

Before installing SmolLM3, ensure your system meets the minimum requirements. The model requires Python 3.8 or higher, with Python 3.9+ recommended for optimal performance. You'll need at least 6GB of RAM for basic inference, though 8GB or more is recommended for comfortable usage with longer contexts.

For GPU acceleration, SmolLM3 supports CUDA-enabled NVIDIA GPUs with at least 4GB of VRAM. While the model can run on CPU-only setups, GPU acceleration significantly improves inference speed, especially for longer sequences. The model also supports Apple Silicon Macs through MPS (Metal Performance Shaders) acceleration.

Minimum Requirements:

  • Python 3.8+ (3.9+ recommended)
  • 6GB RAM (8GB+ recommended)
  • 10GB free disk space
  • Optional: CUDA-compatible GPU with 4GB+ VRAM

Installation Methods

SmolLM3 can be installed through multiple methods, each suited for different use cases. The most straightforward approach uses the Hugging Face Transformers library, which handles model downloading and dependency management automatically.

Method 1: Using Transformers Library

Start by creating a virtual environment to isolate your SmolLM3 installation. This prevents conflicts with other projects and makes dependency management easier. Open your terminal and run the following commands:

# Create virtual environment
python -m venv smollm3-env

# Activate environment (Linux/Mac)
source smollm3-env/bin/activate

# Activate environment (Windows)
smollm3-env\Scripts\activate

# Install required packages
pip install torch transformers accelerate

Once the dependencies are installed, you can load and use SmolLM3 with just a few lines of Python code. The model will be automatically downloaded from Hugging Face Hub on first use:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

print("SmolLM3 loaded successfully!")

Method 2: Using Hugging Face CLI

For users who prefer command-line tools or need to download models for offline use, the Hugging Face CLI provides an alternative installation method. First, install the CLI tool:

# Install Hugging Face CLI
pip install huggingface_hub[cli]

# Download SmolLM3 model
huggingface-cli download HuggingFaceTB/SmolLM3-3B

# Verify download
huggingface-cli scan-cache

Your First Inference

Now that SmolLM3 is installed, let's run your first inference to verify everything is working correctly. We'll start with a simple text generation task that demonstrates the model's capabilities.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Initialize model and tokenizer
model_name = "HuggingFaceTB/SmolLM3-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Prepare input
prompt = "Explain the concept of machine learning in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate response
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

This code demonstrates basic text generation with SmolLM3. The model should produce a coherent explanation of machine learning concepts, showcasing its knowledge base and reasoning capabilities despite its compact size.

Configuration Options

SmolLM3 offers numerous configuration options to optimize performance for your specific use case. Understanding these parameters helps you balance quality, speed, and resource usage according to your requirements.

Model Loading Options

The model can be loaded with different precision levels and optimization settings. For production deployments with limited resources, consider using 8-bit or 4-bit quantization:

# Load with 8-bit quantization
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

# For even more aggressive compression
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)

Generation Parameters

Fine-tuning generation parameters allows you to control the model's output characteristics. Temperature affects randomness, while top-p and top-k parameters influence token selection diversity:

# Conservative generation (more focused)
outputs = model.generate(
    inputs.input_ids,
    max_length=500,
    temperature=0.3,
    top_p=0.8,
    top_k=40,
    do_sample=True,
    repetition_penalty=1.1
)

# Creative generation (more diverse)
outputs = model.generate(
    inputs.input_ids,
    max_length=500,
    temperature=0.8,
    top_p=0.95,
    top_k=50,
    do_sample=True,
    repetition_penalty=1.0
)

Testing Multilingual Capabilities

One of SmolLM3's standout features is its native support for six languages. Let's test this capability with prompts in different languages to verify the multilingual functionality:

# Test multilingual capabilities
prompts = {
    "English": "Describe the benefits of renewable energy:",
    "French": "Décrivez les avantages des énergies renouvelables:",
    "Spanish": "Describe los beneficios de la energía renovable:",
    "German": "Beschreiben Sie die Vorteile erneuerbarer Energien:",
    "Italian": "Descrivi i vantaggi delle energie rinnovabili:",
    "Portuguese": "Descreva os benefícios da energia renovável:"
}

for language, prompt in prompts.items():
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        inputs.input_ids,
        max_length=150,
        temperature=0.7,
        do_sample=True
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"{language}: {response}\n")

Memory and Performance Optimization

Optimizing memory usage and inference speed is crucial for deploying SmolLM3 in production environments. Here are several strategies to improve performance while maintaining output quality.

Gradient Checkpointing

For scenarios requiring fine-tuning or when memory is extremely limited, gradient checkpointing can significantly reduce memory usage at the cost of slightly increased computation time:

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

# Use attention slicing for large contexts
model.enable_model_cpu_offload()

# Clear cache between inferences
if torch.cuda.is_available():
    torch.cuda.empty_cache()

Batch Processing

When processing multiple prompts, batch processing can significantly improve throughput. SmolLM3's efficient architecture makes it well-suited for batch inference:

# Batch processing example
prompts = [
    "What is artificial intelligence?",
    "Explain quantum computing.",
    "Describe climate change effects."
]

# Tokenize all prompts
inputs = tokenizer(
    prompts, 
    return_tensors="pt", 
    padding=True, 
    truncation=True,
    max_length=512
)

# Generate responses in batch
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=200,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode all responses
responses = [tokenizer.decode(output, skip_special_tokens=True) 
            for output in outputs]

for i, response in enumerate(responses):
    print(f"Prompt {i+1}: {response}\n")

Troubleshooting Common Issues

While SmolLM3 is designed to be user-friendly, you may encounter some common issues during setup or usage. Here are solutions to the most frequently reported problems.

Memory Issues

If you encounter out-of-memory errors, try reducing the batch size, using quantization, or enabling CPU offloading. For systems with limited RAM, consider using the model in evaluation mode only:

# Memory-efficient loading
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    device_map="auto"
)

# Set to evaluation mode
model.eval()

# Use no_grad for inference
with torch.no_grad():
    # Your inference code here
    pass

Slow Inference

If inference is slower than expected, verify that you're using GPU acceleration and that the model is properly loaded onto the GPU. Also, consider using compilation optimizations:

# Check device placement
print(f"Model device: {model.device}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Use torch.compile for optimization (PyTorch 2.0+)
if hasattr(torch, 'compile'):
    model = torch.compile(model)

# Warm up the model
dummy_input = tokenizer("Hello", return_tensors="pt")
with torch.no_grad():
    model.generate(dummy_input.input_ids, max_length=10)

Next Steps

Congratulations! You now have SmolLM3 running and understand the basic configuration options. This foundation enables you to explore more advanced features like fine-tuning, deployment strategies, and specialized applications.

Consider experimenting with different generation parameters to understand how they affect output quality and style. Try testing the model's reasoning capabilities with complex prompts, and explore its multilingual features if you work with non-English content.