Overview
TheAIModelHandler class manages AI model operations including initialization, text generation, and summarization. It supports multiple models (Llama 3.1-Nemotron and BART) with automatic GPU detection and optimization.
Class: AIModelHandler
Attributes
models: Dict[str, Any]- Dictionary storing loaded model instancestokenizers: Dict[str, Any]- Dictionary storing tokenizer instancesmodel_configs: Dict[str, Dict]- Configuration for each supported modeldefault_model: str- Default model key (fromDEFAULT_MODELenv var, defaults to ‘llama’)enable_learning: bool- Continuous learning flag (fromENABLE_CONTINUOUS_LEARNINGenv var)
Methods
__init__()
Initialize the AI model handler with model configurations.DEFAULT_MODEL: Model key to use by default (default: ‘llama’)ENABLE_CONTINUOUS_LEARNING: Enable learning features (default: ‘true’)
ai_handler.py:27-47
initialize_models()
Initialize AI models asynchronously.True if initialization successful, False otherwise
-
GPU Detection
- Checks CUDA availability
- Logs GPU device name and available memory
-
BART Model Initialization
- Loads
facebook/bart-largetokenizer - Loads BART model with automatic device mapping
- Uses float16 precision on GPU, float32 on CPU
- Loads
-
Llama Model Initialization
- Loads
nvidia/Llama-3.1-Nemotron-70B-Instruct-HFtokenizer - Loads Llama model with automatic device mapping
- Uses float16 precision on GPU, float32 on CPU
- Loads
device_map='auto': Automatic device placement when GPU is availabletorch_dtype: float16 for GPU (memory efficient), float32 for CPUlocal_files_only=False: Downloads models from Hugging Face if not cached
- Returns False if any model fails to load
- Logs detailed error messages for debugging
- Continues initialization if one model fails (when possible)
ai_handler.py:49-107
generate_response()
Generate a response using the specified model.Input text to generate response from
Key of the model to use. If None, uses
default_model (‘llama’)Maximum length of generated text in tokens
Sampling temperature. Higher values (e.g., 1.0) make output more creative/random, lower values (e.g., 0.2) make it more focused/deterministic
Nucleus sampling parameter. Only tokens with cumulative probability up to top_p are considered
Maximum number of attempts to generate a valid response
Generated response text. Returns error message if generation fails
-
Model Selection
- Uses specified
model_keyor falls back todefault_model - Validates that the model exists
- Uses specified
-
Input Processing
- Tokenizes input text
- Truncates to 512 tokens maximum
- Adds padding for batch processing
- Moves tensors to model’s device (CPU/GPU)
-
Generation Loop
- Attempts generation up to
max_attemptstimes - Uses nucleus sampling with specified temperature and top_p
- Validates response quality (>3 words, different from input)
- Breaks on first valid response
- Attempts generation up to
-
Response Validation
- Checks response length (must be >3 words)
- Ensures response differs from input
- Returns fallback message if all attempts fail
do_sample=True: Enables sampling for diverse outputsnum_return_sequences=1: Generates one response- Uses model’s pad_token_id and eos_token_id for proper sequence handling
- Raises ValueError if model_key is invalid
- Returns user-friendly error message on exceptions
- Logs all errors for debugging
ai_handler.py:109-182
summarize_text()
Summarize text using BART model.Text to summarize
Maximum length of summary in tokens
Minimum length of summary in tokens
Summarized text. Returns error message if summarization fails
-
Input Processing
- Tokenizes input text using BART tokenizer
- Truncates to 1024 tokens maximum
- Adds padding for batch processing
- Moves tensors to BART model’s device
-
Summary Generation
- Uses beam search with 4 beams for better quality
- Applies length penalty of 2.0 to encourage conciseness
- Early stopping when all beams reach EOS token
- Decodes output tokens to text
-
Post-processing
- Removes special tokens from output
- Strips whitespace from final summary
num_beams=4: Beam search width for qualitylength_penalty=2.0: Encourages shorter summariesearly_stopping=True: Stops when EOS is reached
- Returns user-friendly error message on exceptions
- Logs all errors for debugging
ai_handler.py:184-227
get_available_models()
Get list of available model keys.List of model keys: [‘llama’, ‘bart’]
- Returns the keys from
model_configsdictionary - Useful for validating model selection
ai_handler.py:229-231
get_model_info()
Get information about a specific model.Key of the model to get information about (‘llama’ or ‘bart’)
Dictionary containing model configuration (name, type, task) or None if model_key is invalid
ai_handler.py:233-235
Singleton Instance
The module exports a singleton instance:Dependencies
Environment Variables
DEFAULT_MODEL(optional): Model key to use by default (default: ‘llama’)ENABLE_CONTINUOUS_LEARNING(optional): Enable continuous learning features (default: ‘true’)
GPU Support
The handler automatically detects and uses GPU when available:- Uses CUDA if
torch.cuda.is_available()returns True - Automatically maps models to available devices
- Uses float16 precision on GPU for memory efficiency
- Falls back to CPU with float32 precision when GPU is unavailable
