Intro:
I had the privilege of serving as a panelist, discussing the decision-making process in the design of systems that utilize machine learning/AI to enhance efficiency. During my preparation, I discovered an insightful paper that provided valuable clarity. I highly recommend giving this paper a read.
Double Edge Sword - Paradox:
The computational demands of large language models are astonishingly high—some facts about their power consumption are truly staggering. Here are a few that will offer some perspective
1.Training the GPT 3 model with 175B parameter - an energy-intensive process that would have resulted in an additional 503 tonnes of carbon emissions—a stark reminder of the environmental considerations we must balance in the pursuit of AI advancements. To put things into perspective this would have been around 500 flights between NewYork and London
2.Electricity consumption to train the model - consuming 1280 MWh of electricity—enough to power 320 average detached homes in the UK for an entire year.
3.Inferencing the LLM model - The daily inference operations of GPT-3 consume 564 MWh of electricity, which is sufficient to supply energy to 140 average UK homes for an entire year. This statistic underscores the significant power requirements of maintaining such advanced AI systems in active use
Motivation for the Research Paper:
The energy consumption by leading cloud computing providers—Meta, Amazon, Microsoft, and Google—has surged, exceeding previous levels. Inference operations, which are integral to AI models like those used in Google Translate, occur far more often than the training phase, potentially billions of times daily. When we shift our focus from training to inference, we see a different picture, especially regarding models designed for general purposes. While training a single model for various tasks may seem more energy-efficient initially, this advantage may diminish or even become a deficit throughout the model’s operational life. This is due to the extensive amount of inference performed when these models are implemented in consumer services such as chat platforms and web search engines.
Methodology:
The research conducted by the authors encompassed 88 models pertaining to 10 tasks and 30 datasets, covering both natural language processing and computer vision. It examined how factors such as the final task, modality, model size, architecture, and learning approach—whether task-specific or multi-task/multi-purpose—affect energy efficiency. The authors uncovered vast disparities in the energy consumed per inference among different models, modalities, and tasks. The study also highlighted a crucial balance between the advantages of multi-purpose systems and their energy expenditure, along with the related carbon emissions.
The authors selected the eight most popular models from the HuggingFace Hub, determined by the number of downloads, for all the tasks mentioned. The full list of model identifiers is presented in Table 6 of the Supplementary Materials. For each model, the authors conducted 1,000 inferences for each of the three datasets the model was trained on, as listed in Table 1, utilizing the Transformers library [55]. To ensure the statistical significance of the measurements, each set of inferences was repeated ten times. The inferences were carried out sequentially—without batching—to accurately reflect the variability encountered in real-world model deployment, where batching inputs is often not feasible
The authors conducted tests on a subset of tasks to compare the energy consumption and emissions between multi-purpose generative models and individual task-specific systems. The tasks chosen for this comparison were question answering, text classification, and summarization. These particular tasks were selected because the authors could identify a set of models capable of performing them under a unified architectural framework—a feat not possible for all tasks, especially those involving multiple modalities. The eight models were evaluated in a zero-shot setting, which remained consistent across all models. For example, they used the prompt “Summarize the following text: [text]. Summary:” on the same 1,000 samples as the fine-tuned models. Each experiment was repeated ten times to ensure the statistical significance of the results.
I hope these insights have captured your curiosity. I strongly encourage you to delve into this compelling research and consider the environmental implications when making design decisions.
Reference:
Estimating the Carbon Footprint of BLOOM - Paper Link
Power Hungry Process - Paper Link
Energy Consumption in UK: Energy Guide UK
Utilisation for inference Google Translate
Top comments (0)