Key Points

1. The research presents a new linear-complexity multiplication (L-Mul) algorithm that approximates floating point number multiplication with integer addition operations. The L-Mul algorithm is designed to significantly reduce the energy consumption and computation resources required for tensor processing hardware in large language models.

2. The proposed method demonstrated higher precision and significantly lower bit-level computation compared to 8-bit floating point multiplications. Utilizing L-Mul in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products.

3. The research conducted theoretical error expectation calculations for L-Mul and evaluated the algorithm on textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Numerical analysis experiments agree with the theoretical error estimation, indicating that L-Mul achieves comparable precision as float8 e4m3 multiplications with 4-bit mantissa and outperforms float8 e5m2 with 3-bit mantissa.

4. Modern artificial intelligence (AI) systems, especially large language models, are significant energy consumers due to the large scale computation needed for neural network inference. The proposed L-Mul method aims to reduce both energy consumption and inference speed for large-scale AI models.

5. The proposed linear-complexity multiplication (L-Mul) algorithm is expected to lead to significantly reduced energy consumption for both model training and inference. Multiplications between floating point numbers consume significantly higher energy than addition operations in modern computing hardware.

6. The research evaluates the numerical precision of L-Mul algorithm on transformer-based language models with a wide range of language and vision tasks. The experiments show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8 e4m3 as accumulation precision in both fine-tuning and inference.

7. The paper highlights the potential of the proposed L-Mul algorithm to replace tensor multiplications in the attention mechanism without any loss of performance. Additionally, it suggests that fine-tuning a model where all multiplication operations in attention mechanisms, linear transformations, and element-wise products are replaced by 3-bit-mantissa L-Mul results in comparable performance to fine-tuning a standard model with an accumulation precision of float8 e4m3.

8. The research presents error and complexity analyses to demonstrate that L-Mul is both more efficient and more accurate than 8-bit floating point multiplication. The experiments demonstrate that L-Mul-based attention leads to improved accuracy and reduced energy consumption.

9. Finally, the research outlines future directions, including the implementation of the L-Mul and L-Matmul kernel algorithms on hardware level and the development of programming APIs for high-level model design to deliver high-speed and energy-efficient AI hosting solutions. It also proposed the training of textual, symbolic, and multi-modal generative AI models optimized for deployment on L-Mul native hardware to reduce the energy cost for data centers and edge-computing devices. The research offers a detailed analysis of the proposed L-Mul algorithm's potential to significantly reduce energy consumption and computation resources, with implications for the development of energy-efficient AI models and hardware.

Summary

L-Mul Algorithm
The paper introduces the L-Mul algorithm, which approximates floating point number multiplication with integer addition operations. This has the potential to significantly reduce the energy costs and computational resources required for tensor processing in neural networks, especially large language models. The key insight is that floating point multiplication can be closely approximated by a much simpler integer addition operation. Floating point multiplication requires complex operations like exponent addition, mantissa multiplication, and rounding, which consume substantially more energy compared to integer addition. The L-Mul algorithm removes the mantissa multiplication step and instead performs a series of integer additions, resulting in a computational complexity of O(n) where n is the bit size of the operands.

Precision and Error Analysis
The paper provides a detailed analysis of the precision and error properties of the L-Mul algorithm. It shows that L-Mul with 4-bit mantissas can achieve comparable precision to 8-bit floating point (float8) e4m3 multiplications, while L-Mul with 3-bit mantissas outperforms float8 e5m2. This means L-Mul can provide higher precision than float8 multiplications while consuming significantly less computational resources.

Empirical Evaluation
The paper evaluates the L-Mul algorithm empirically on a wide range of tasks including natural language understanding, structural reasoning, mathematics, and commonsense question answering. The results show that directly applying L-Mul to the attention mechanism in transformer models is almost lossless in performance. Furthermore, replacing all floating point multiplications in a transformer model with 3-bit mantissa L-Mul can achieve equivalent precision as using float8 e4m3 for accumulation. The potential benefits of L-Mul are substantial given the immense computational costs of modern AI systems. It is estimated that the average electricity consumption of the ChatGPT service in early 2023 was equivalent to the total daily electricity usage of 18,000 families in the United States. Replacing floating point multiplications with the L-Mul algorithm has the potential to reduce the energy consumption of tensor computations by up to 95%.

In conclusion, the linear-complexity L-Mul algorithm presents an efficient and accurate alternative to floating point multiplication, with the ability to significantly reduce the energy and computational costs of neural network inference, especially for large language models. The authors argue that truly energy- and compute-efficient AI will require a holistic integration of optimizations across I/O, control, and arithmetic operations, with the L-Mul algorithm being a key component of this effort.

Reference: https://arxiv.org/abs/2410.00907