隨著LLaMA的更新,模型的能力越來越完善,但我也發現模型變得越來越難微調。主要問題是:經過微調後,模型會明顯失去原本的技能,這就是所謂的「災難性遺忘」(Catastrophic Forgetting)。這個問題在LLaMA-3之後變得更棘手;經過一次迭代就可能導致模型出現結尾停不下來的現象。在LLaMA-3技術報告中也提到了"tail repetition"的相關描述。
網路上有許多文章討論如何在Huggingface拉模型接續訓練而不破壞模型的原本能力,大多提到以下幾種方式:
- 準備歷史資料一起參與訓練
- 在目標函數中限制模型不能偏離原本的結果太遠
- 套用LoRA或其他adaptation技巧
然而,實際經驗顯示,這幾個方法實作效果不佳,或成本過高。例如,如何準備LLaMA-3-Instruct的歷史資料呢?準備這樣的資料可能遠比訓練的新資料更龐大。此外,準備歷史資料與「一筆資料只訓練一次」的經驗準則相衝突。對已經學會的資料進行再訓練,相當於重複訓練同一資料,這會使模型傾向於使用「Ad hoc」的方法,這樣的知識並沒有在模型中合理地被吸收,而是機械地記住答案。LoRA等adaptation方法的效果也有限,僅僅是讓「學習新知識」與「破壞舊知識」的過程變得較慢,讓你在兩者之間尋找妥協點,但我們從未透過此方法獲得令人滿意的結果。看看專家的做法,例如Taiwan LLaMA是從base model開始訓練,自行準備instruction-tuning的資料集,因此沒有這個問題。
解決方法
有效的解決方案是找到「最小修改幅度」的訓練方式。首先,大家通常依照chat template生成輸入和輸出的資料,然後將其丟給模型訓練。這樣的訓練方式無形中讓模型學到許多已經具備的知識,或形成無意中引入的新觀念(例如某種風格傾向)。許多人知道要使用相同模型來輔助生成訓練資料,但這還不夠。理想情況下,應該只訓練關鍵知識的token,而不是讓所有內容參與loss計算。實驗表明,這樣的做法讓我們可以省去DPO等進階訓練方法,只需進行簡單的SFT全參數訓練即可。
另一個有意思的點是:LoRA無法保護模型原本的能力,而你也不需要特別保護。大模型的參數量足夠將各種高階知識分散到不同的參數中。我發現,只要prompt具備完整的知識描述(推理過程),且能夠被attention的QK計算選到,知識就能被正確地放置於模型中。這些attention過程決定了知識是否會互相污染。如果歷史詞不具備推理要素,相當於沒有歷史詞,這等同於讓模型學習「無中生有」地產生某個詞。由於「無中生有」屬於沒有先前條件就會被觸發的知識,它會明顯壓縮模型現有的能力。別忘了,大語言模型儲存知識的空間非常龐大,幾乎可以引入特定知識而感受不到其他能力消失。通過精細調整訓練資料,即可解決問題,並在數千題benchmark考題中保持完全相同的結果。
在準備資料時,不能完全套用傳統機器學習的經驗來設計。差異在於接續訓練時attention的能力早已建立,不是從零開始競爭學習。如果你準備了很多不會被正確attention的提示詞,它們就無法如預期進行學習。反之,也應減少不必要的prompt,因為LLM難免會出現錯誤的attention行為。與其冒險讓模型學到錯誤的關聯,不如刪減冗餘詞彙。
translated by ChatGPT:
As LLaMA continues to be updated, the model's capabilities are becoming more refined, but I’ve also noticed that it is increasingly difficult to fine-tune. The main issue is that after fine-tuning, the model significantly loses its original skills, a phenomenon known as "Catastrophic Forgetting." This issue has become more troublesome with LLaMA-3; after just one iteration, the model may exhibit a problem where it can't stop at the end of a response. The LLaMA-3 technical report also mentions this issue, referring to it as "tail repetition."
There are many articles online discussing how to continue training models on Huggingface without damaging the original capabilities. Most of them mention the following methods:
- Preparing historical data to participate in the training
- Limiting the objective function to ensure the model doesn’t deviate too far from the original results
- Applying LoRA or other adaptation techniques
However, practical experience shows that these methods are either ineffective or too costly. For example, how can we prepare historical data for LLaMA-3-Instruct? The preparation of such data might be far larger than the new data you’re trying to train. Additionally, preparing historical data contradicts the principle of “training each piece of data only once.” Retraining data the model already knows is akin to training the same data multiple times, which would lead the model to rely on "Ad hoc" methods, where the knowledge isn’t properly absorbed by the model, but simply memorized. The effect of LoRA and other adaptation techniques is also limited; they merely slow down the process of "learning new knowledge" while "damaging old knowledge," allowing for a compromise between the two, but we’ve never been satisfied with the results obtained through this method. Take the experts’ approach, such as Taiwan LLaMA, which starts training from the base model and prepares its own instruction-tuning dataset, thus avoiding this problem.
Solution
An effective solution is to find a training method with "minimal modifications." First, people usually follow a chat template to generate input and output data, which is then used to train the model. This training method often unintentionally teaches the model knowledge it already possesses or introduces new concepts (such as certain stylistic tendencies). Many people know that using the same model to assist in generating training data helps, but this alone is not enough. Ideally, you should train only the tokens of key knowledge rather than allowing all content to participate in the loss calculation. Experiments have shown that this method enables us to skip advanced training techniques like DPO and simply conduct straightforward SFT full-parameter training.
Another interesting point is that LoRA cannot protect the model's original capabilities, nor does it need to. Large models have enough parameters to distribute various high-level knowledge across different parameters. I found that as long as the prompt contains a complete knowledge description (reasoning process) and is selected by the attention’s QK calculation, the knowledge can be properly placed within the model. These attention processes determine whether knowledge will be contaminated. If the historical tokens lack reasoning elements, it’s as if there were no historical tokens, and this is equivalent to teaching the model to generate a word "out of thin air." Since "out of thin air" knowledge is triggered without prior conditions, it obviously compresses the model’s existing capabilities. Don’t forget, large language models have a vast storage space for knowledge, and you can almost introduce specific knowledge without compromising other abilities. By carefully adjusting the training data, the problem can be solved, while maintaining the same results on thousands of benchmark test questions.
When preparing data, you can’t fully apply the traditional machine learning experience in its design. The difference is that during continued training, the attention mechanism’s ability has already been established, so it is not a competition to learn from scratch. If you prepare many prompt words that are not correctly attended to, they will not be learned as expected. Conversely, unnecessary prompts should also be reduced, as LLMs inevitably produce incorrect attention behaviors. Instead of risking the model learning erroneous associations, it’s better to cut redundant words.
Top comments (0)