Welcome back to my AI journey, where I stumbled, learned, and maybe cried a little! π
1. Diving into the Code: The Good, The Bad, and The Ugly
This time, I got my hands dirty by coding the first version of my AI model. Spoiler alert: I achieved an accuracy of just 0.18945%! π― (Ouch! I guess even my toaster could do better π€π).
Let's dive into the code and see what went wrong.
# Initializing BERT for sequence classification
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)
Whatβs happening here?
I'm using BERT, the superstar transformer model, to classify the danger level of legal contracts on a scale of 1 to 5. π
def preprocess_data(dataframe, tokenizer):
texts = dataframe['texte'].tolist()
labels = [label - 1 for label in dataframe['niveau_de_danger'].tolist()]
encoded_data = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
return encoded_data, labels
Why preprocess data?
Iβve tokenized the contract texts for BERT to digest (like breaking down a complex contract into easier-to-understand clauses). π½οΈ
2. Training My Model: Andβ¦ It Crashed and Burned π₯
def train_model(model, train_loader, num_epochs=5):
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
for epoch in range(num_epochs):
model.train()
for batch in train_loader:
optimizer.zero_grad()
input_ids, attention_mask, labels = [b.to(device) for b in batch]
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
accuracy = (outputs.logits.argmax(dim=-1) == labels).float().mean()
loss = outputs.loss
loss.backward()
optimizer.step()
print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}')
print(f'Final Loss: {loss:.4f}, Accuracy: {accuracy:.4f}')
This function trains BERT to classify contracts, but letβs just say it didnβt pass the bar exam π¬. The low accuracy told me that my model was basically guessing randomly.
3. Evaluating the Model: Reality Check π§ββοΈ
def evaluate_model(model, test_loader):
model.eval()
predictions = []
true_labels = []
with torch.no_grad():
for batch in test_loader:
input_ids, attention_mask, labels = [b.to(device) for b in batch]
outputs = model(input_ids, attention_mask=attention_mask)
_, predicted = torch.max(outputs.logits, dim=-1)
predictions.extend(predicted.cpu().tolist())
true_labels.extend(labels.cpu().tolist())
return classification_report(true_labels, predictions)
After running this, I got a brutal classification report that screamed, "You need more data, buddy!" π
4. The Root Cause: My Dataset Needs a Lawyer-Grade Makeover π
After some reflection, I realized the real issue was my dataset. Itβs like trying to learn law from a pamphlet instead of an encyclopedia. π
I need to get my hands on a large, reliable, and indexed dataset that can better train the model. If anyone knows where to find high-quality legal datasets, Iβm all ears! π
5. Annotating Contracts (A Work in Progress) βοΈ
def annotate_contract(model, tokenizer, contract_text):
inputs = tokenizer(contract_text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**inputs)
_, predicted = torch.max(outputs.logits, dim=-1)
danger_level = predicted.item() + 1
problematic_sections = analyze_problematic_sections(contract_text, danger_level)
return {
'danger_level': danger_level,
'problematic_sections': problematic_sections
}
This function is supposed to analyze the legal contract and predict the danger level, but as you might guess, itβs not ready to replace your lawyer just yet. π§
Next Steps: A Better Dataset and Model Tuning π
Iβm planning to go on a treasure hunt for a better dataset. Once I have more data, Iβll revisit model training, tweak hyperparameters, and hopefully get a model that can actually understand legal jargon! βοΈ
Until next time, may your accuracy be ever in your favor! π
0x2e73
Top comments (0)