The research paper [1] introduces a new series of models called LLaMA to enhance efficiency and performance in foundational language models. The architecture is built upon the transformer model and includes tweaks such as pre-normalization with RMSNorm for more consistent training and rotary positional embeddings in place of absolute positional embeddings. LLaMA-2, a later version, introduces grouped-query attention to enhance the scalability of larger models during inference.
Transformers are a type of Artificial Neural Network (ANN) architecture used in Natural Language Processing (NLP) [2]. Unlike RNNs, Transformers process/operate on all the data in parallel rather than processing them sequentially, allowing for faster training times.
This model employs an encoder-decoder structure in the architecture, each of which is composed of layers that process input text and generate the output [2]. Each stage has multiple sub-layers that include an attention mechanism for contextual processing and a feed-forward network for parallel computation of elements.
The attention mechanism emphasizes important words in a sentence to understand the context better. It allows the model to selectively focus on parts of the text that are relevant to the current task [2]. It processes all parts of the input sentence simultaneously rather than sequentially. The attention mechanism calculates a set of Query (Q), Key (K), and Value (V) vectors through the linear transformation of the input embeddings in the transformers. The similarity between each query and all keys is calculated using cosine similarity, which is then normalized to attention weights.
Meta AI introduces the LLaMA collection of models ranging from 7B to 65B parameters trained on publicly available datasets. The LLaMA-13B model outperforms GPT-3 (175B) on most benchmarks. LLaMA-65B is competitive with models like Chinchilla-70B and PaLM-540B.
The LLaMA model employs large transformers powered by self-attention mechanisms trained on a massive corpus of data using standard optimizers. Key features include:
The Attention class demonstrates parallelization in the transformer class. The number of heads (including query, key, and value) are divided by the model_parallel_size, which is the number of GPUs. Custom linear layers distribute weight matrices across GPUs for query transformation and value application.
We were able to run the code from the official llama repository. However, we faced challenges running the model natively on Apple Silicon Chips (no NVIDIA GPUs). We used the llama.cpp repository to run the model on Mac.
The LLaMA model addresses the problem of training large language models to achieve optimal performance at various inference budgets by scaling datasets and model sizes appropriately. Challenges include high computational power requirements, environmental impact, diminishing performance improvements with size, and potential biases from the training dataset.
To tackle the computational intensity, we propose utilizing Microsoft’s Low-Rank Adaptation (LoRA) technique, which reduces the model’s complexity by decomposing its parameter matrix into manageable low-rank matrices. We suggest using QLoRA for even more efficient fine-tuning.
LLaMA 2 | Fine-tuning w/ LoRA |
---|---|
Advantage | Advantage |
Comprehensive Knowledge | Requires Minimal Resources |
State-of-Art Model | Allows Easy Model Switching |
Computational Efficient | |
Disadvantage | Disadvantage |
Resources Intensive | Risk of Overfitting |
Lack of Deep Domain Knowledge | Catastrophic Forgetting |
We used the ROUGE score to evaluate summarization and machine translation. The fine-tuned model showed better overlap between generated inferences and references, proving increased performance.
The Transformer architecture leverages the attention mechanism and parallelization for efficiency. LoRA fine-tuning can train the LLaMA model quickly on a specific dataset. However, the extended inference duration can slow down evaluation times, posing a potential bottleneck.