Detecting Label Errors and Boosting Model Performance with Cleanlab
There is a general rule in the world of Machine Learning that the quality of your data can make or break your model. One often overlooked aspect of data quality is label accuracy. Incorrect labels can lead to poor model performance, misguided insights, and wasted resources.
Cleanlab is a powerful Python library designed to detect label errors in your dataset and improve your model’s performance. In this post let’s explore how to use Cleanlab to identify and correct label issues, potentially giving your ML projects a significant boost.
The Problem: Noisy Labels
Real-world datasets are often noisy, with a non-trivial percentage of incorrect labels. These errors can stem from various sources:
- Human error in manual labeling
- Ambiguous cases that are difficult to categorize
- Data entry mistakes
- Deliberate mislabeling in adversarial scenarios
Traditional data cleaning methods often miss these issues, especially in large datasets where manual inspection is impractical.
Cleanlab
Cleanlab uses a novel approach to identify potential label errors. It leverages your model’s out-of-sample predicted probabilities to find data points where the model’s prediction strongly disagrees with the given label. This method is particularly powerful because it can find subtle errors that might be missed by simple rule-based cleaning.
Let’s walk through a practical example of using Cleanlab to improve a classification model.
Step 1: Setup and Data Preparation
First, let’s set up our environment and create a sample dataset with intentional label errors:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from cleanlab import classification
from cleanlab.filter import find_label_issues
# Create a sample dataset with some label errors
np.random.seed(42)
n_samples = 1000
n_features = 5
X = np.random.randn(n_samples, n_features)
y_true = np.random.choice([0, 1, 2], size=n_samples)
# Introduce 5% label errors
y_noisy = y_true.copy()
n_errors = int(0.05 * n_samples)
error_indices = np.random.choice(n_samples, n_errors, replace=False)
y_noisy[error_indices] = np.random.choice([0, 1, 2], size=n_errors)
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(n_features)])
df['label'] = y_noisy
X_train, X_test, y_train, y_test = train_test_split(
df.drop('label', axis=1), df['label'], test_size=0.2, random_state=42)
Step 2: Detecting Label Issues with Cleanlab
Let’s use Cleanlab to identify potential label errors in our training data:
# Train a model and get out-of-sample predicted probabilities
clf = RandomForestClassifier(n_estimators=100, random_state=42)
pred_probs = classification.cross_val_predict(clf, X_train, y_train, cv=5, method='predict_proba')
# Find label issues
label_issues = find_label_issues(y_train, pred_probs, return_indices_ranked_by='self_confidence')
print(f"Number of label issues detected: {len(label_issues)}")
Step 3: Examining and Cleaning Label Issues
Let’s take a look at some of the detected label issues and clean our dataset:
# Examine some of the detected label issues
n_examples = 5
print(f"\nTop {n_examples} label issues:")
for idx in label_issues[:n_examples]:
print(f"Index: {idx}, Given label: {y_train.iloc[idx]}, "
f"Predicted label: {pred_probs[idx].argmax()}")
# Clean the labels
cleaned_labels = y_train.copy()
cleaned_labels.iloc[label_issues] = -1 # Mark issues as -1
Step 4: Comparing Model Performance
Now, let’s train models with both the original and cleaned labels to see if we’ve improved performance:
# Train models with original and cleaned labels
clf_original = RandomForestClassifier(n_estimators=100, random_state=42)
clf_original.fit(X_train, y_train)
clf_cleaned = RandomForestClassifier(n_estimators=100, random_state=42)
clf_cleaned.fit(X_train[~cleaned_labels.isin([-1])], cleaned_labels[~cleaned_labels.isin([-1])])
# Compare performance
accuracy_original = accuracy_score(y_test, clf_original.predict(X_test))
accuracy_cleaned = accuracy_score(y_test, clf_cleaned.predict(X_test))
print(f"\nAccuracy with original labels: {accuracy_original:.4f}")
print(f"Accuracy with cleaned labels: {accuracy_cleaned:.4f}")
Visualizing Label Quality
Cleanlab also provides tools to visualize the overall health of your dataset:
from cleanlab.dataset import health_summary
summary = health_summary(y_train, pred_probs, class_names=['Class 0', 'Class 1', 'Class 2'])
summary.plot()
plt.tight_layout()
plt.show()
Key Takeaways
- Improved Model Performance: By identifying and addressing label issues, we often see a boost in model accuracy.
- Data Insights: Cleanlab helps uncover patterns in label errors, providing valuable insights into your data collection and labeling processes.
- Efficiency: Automated label error detection can save countless hours of manual data cleaning and validation.
- Robustness: Models trained on cleaned data are typically more robust and generalize better to new, unseen data.
Conclusion
Label errors are a common but often overlooked issue in machine learning projects. By leveraging tools like Cleanlab, we can systematically identify and address these errors, leading to more accurate models and more reliable insights. Whether you’re working on a small personal project or a large-scale production system, incorporating label quality checks into your ML pipeline can yield significant benefits.
Remember, while Cleanlab is a powerful tool, it’s not a magic solution. Always combine automated tools with domain expertise and critical thinking. Happy cleaning, and may your models be ever more accurate!