This article is translated from my Japanese tech blog.
https://tmyoda.hatenablog.com/entry/20210628/1624883322
About the Coleridge Competition
https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data
This is a Competition to predict the dataset names that are shown in academic papers. The only data provided is the text of the papers and GT.
How I Split into the Dataset to the Validation
In this competition, there are about 130 dataset names (targets) in the training set, but the test set includes dataset names that do not appear in the training phase.
Therefore, it must be divided without any duplication of the dataset names. So, I implemented a BFS and divided it into an 8:2 ratio to avoid any duplication.
Pipeline
Classifier
This classifier worked better than I thought, and most of our team's top submissions included this classifier.
Just classify whether a dataset name exists or not.
MLM
We almost re-use of the kernel below.
https://www.kaggle.com/tungmphung/coleridge-predict-with-masked-dataset-modeling
Jaccard filter
This is also re-use of the kernel as well.
def jaccard_filter(org_labels, threthold=0.75):
assert isinstance(org_labels, list)
filtered_labels = []
for labels in org_labels:
filtered = []
for label in sorted(labels, key=len):
label = clean_text(label)
if len(filtered) == 0 or all(jaccard(label, got_label)
< threthold for got_label in filtered):
filtered.append(label)
filtered_labels.append('|'.join(filtered))
return filtered_labels
What I tried
- Using DiceLoss, FocalLoss which is good at imbalanced data: The score decreased
- NER (Named Entity Recognition): It didn't seem to be effective
- SciBERT: No change
- Increasing external datasets csv: Extraneous strings were hit: decreasing the score
- Switching BERT to Electra: The score decreased
- Changing CONNECTION_TOKEN: The number of target documents increased, and the score decreased
- Beam search with k-fold: It was hard for us to run because of the time
Top comments (0)