๐ŸƒSemi-Supervised Learning

So far, youโ€™ve seen both supervised and unsupervised learning. But have you ever wondered if thereโ€™s something in between?

Thatโ€™s where semi-supervised learning comes in! It tackles problems by combining both labeled and unlabeled data.

About the Data

The raw data is the same as the campaign data used in association. It has 2 classes and all the records are labeled.

In real world, 2 types of scenario could happen:

  1. Each class has labeled and unlabeled data.

  2. Only 1 class is partially labeled and all the other data is unlabeled.

To simulate these two real-world scenarios, we can mask the data.

Scenario 1 Data Mask

In this scenario, we mask a portion of the data from both classes as unlabeled and allow for flexible adjustment of the mask_rate. The goal is, for the original labeled data to retain their 0 or 1 labels, while the unlabeled data is marked as -1.

When splitting the data into training and testing sets, we use stratified splitting to ensure that the proportion of each label in the training data matches that in the testing data. In the example below, 95% of the data has been masked:

Before beginning any forecasting work, we can visualize the data in 2D or 3D space to understand its distribution and assess the complexity of the problem. To do this, we can use UMAP for dimensionality reduction, projecting the dataset into a 2D space. Each data point will be color-coded based on its mask and true label: "0False" indicates a masked negative class, "0True" represents the original negative class, "1False" is the masked positive class, and "1True" is the original positive class. Let's examine the plot of the training data:

and the plot of testing data:

Not seeing any clear patterns to distinguish between the classes, right? ๐Ÿ˜… No problem! With this data, we'll work on classifying all the masked instances!

Scenario 2 Data Mask

In this scenario, we mask most of the data and only keep a portion of positive data labeled. This type of problem is called "PU Learning" (Positive-Unlabeled Learning).

The code to mask the data with configurable masking rate is here:

When setting mask_rate=0.95, it means we will mask 95% data and the rest 5% data will be positive class. Therefore, in the output below, you can see there's 95% labeled as "-1" and 5% labeled as "1". Among all the masked data, there are 55.4% negative class and 44.6% positive class.

When setting mask_rate=0.3, it meant to have 70% positive class remain labeled as positive, but because all the positive data only occupies 47.4% population, in this case, we will only get 47.4% labeled data, and all the negative data will be unlabeled.

๐ŸŒป Check scenario 2 mask code here >>

Now let's see how to classify these unlabeled data!

Classification on Scenario 1 Data Mask

We will begin by classifying the data when unlabeled instances appear within each class.

To do this, we will experiment with three approaches using a dataset where 90% of the labels are masked. In this dataset, only 10% of the records retain their original labels, with 5.37% being negative and 4.63% positive. Among the masked data, 52.50% are negative records, while 47.50% are positive.

After comparing the following 3 approaches, we will select the best one to experiment on datasets with varying mask rates.

Approach 1: Label Propagation

Label Propagation assigns labels to unlabeled data by assuming that similar data points share the same label. The process of labeling the unlabeled data involves the following steps:

  1. Graph Construction: It creates a connected graph by drawing edges between data nodes. You can limit the number of nodes each point connects to using the n_neighbors parameter, which reduces the computational resources required. Conversely, building a fully connected graph demands significantly more resources.

  2. Edge Weighting: On this graph, edges between more similar nodes receive higher weights, while edges between less similar nodes receive lower weights. A higher weight makes it easier for a label to propagate, increasing the likelihood that it will spread to neighboring nodes.

  3. Random Walk: For each unlabeled node, a random walk is performed to determine the probability distribution of reaching labeled nodes. This helps identify which label has the highest probability of being correct. The random walk continues until convergence is achieved, meaning either all paths have been explored or the probabilities of each possible label no longer change.

Let's apply Label Propagation to our dataset with 90% masked labels! Scikit-learn's Label Propagation algorithm supports two kernel options: knn and rbf. The kernel determines how similarity between data points is measured. The KNN kernel uses the number of neighbors to assess similarity, while the RBF kernel measures similarity based on distances.

Here's how to apply label propagation with KNN kernel:

And here's the code of label propagation with RBF kernel:

The performance above is quite similar, if we consider AUC as the main metric, RBF kernel works slightly better than KNN kernel in this case.

๐ŸŒป Check label propagation code here >>

Approach 2: Label Spreading

Label Spreading is similar to Label Propagation, but with a key difference: Label Propagation uses hard clamping, meaning labeled data points never change their labels. In contrast, Label Spreading adopts soft clamping, controlled by the parameter alpha. This parameter determines the balance between the influence of neighboring data and the original label. When alpha=0, the model fully preserves the original labels, while alpha=1 means the initial labels are entirely replaced by information from neighboring points. To learn more details, check this article.

The supported kernels in label spreading are knn and rbf too. Let's apply label spreading with KNN on our 90% masked data first:

The parameter settings for RBF's Label Spreading differ slightly. In addition to adding the gamma parameter required by RBF, we adjusted alpha from 0.5 to its default value of 0.2 and increased n_neighbors from 7 to 20. These changes were made to improve performance.

Still got similar performance, and RBF kernel has slightly better AUC.

๐ŸŒป Check label spreading code here >>

Approach 3: Self Training

Self-training allows you to choose any estimator supported by Scikit-learn, such as XGBoost or LightGBM, to train on the labeled data and make predictions on the unlabeled data. These predictions are then used as pseudo-labels, which are added to the existing labeled data for another round of training and prediction. This process is repeated until all the data is labeled or the maximum number of iterations is reached.

Sklearn provides built-in SelfTrainingClassifier, and it can be used in this way:

๐ŸŒป Check self training code here >>

Comparing with label spreading and label propagation, self training took much longer time to run, and unfortunately, it got the worst performance.

Performance with Different Mask Rates

Overall, Label Spreading with the RBF kernel achieved slightly better performance on our dataset with 90% masked labels. Next, let's apply this method to datasets with varying mask rates.

As shown below, performance improves as more data is labeled (i.e., with lower mask rates). However, when the mask rate drops below 0.5, the performance gains become less significant. For instance, the difference in performance between a mask rate of 0.5 and 0.2 is much smaller.

๐ŸŒป Check detailed code here >>

๐ŸŒป Check all-labeled-data forecast >>

The best performance we've achieved so far is 0.74 AUC and 0.65 Recall. With fully labeled data, we can reach 0.84 AUC and 0.85 Recall. To get closer to the performance of fully labeled data, we could optimize model parameters or explore more advanced algorithms. What are your ideas for making such improvements? Share them here! ๐Ÿ˜‰

Forecast on Type 2 Data Mask

Next, let's tackle the PU Learning (Positive-Unlabeled Learning) problem, where only a portion of the positive labels are known, and the rest of the data is unlabeled. We'll demonstrate a DIY PU learning solution and compare it with Scikit-learn's built-in PU learning approach. Which one do you think will perform better? ๐Ÿ˜

For this experiment, we're using a dataset where 90% of the records are masked, leaving only 10% with their original positive labels. Among the masked data, 58.50% are negative, and 41.50% are positive.

How to Solve PU Learning Problem

The main idea is, given all the data, calculate the probability of each record being positive, denoted as P(positive_label=1 | data).

  1. Using conditional probability, we can derive the equation: P(positive_label=1 | data) * P(data) = P(has_label=1 | data) * P(data) / P(has_label=1 | positive_label=1), which simplifies to: P(positive_label=1 | data) = P(has_label=1 | data) / P(has_label=1 | positive_label=1). Therefore, to obtain the final output P(positive_label=1 | data) we only need P(has_label=1 | data) and P(has_label=1 | positive_label=1).

  2. In the dataset, we replace the original label column with has_label, indicating whether each record has a label. We then split the dataset into training and testing sets using a stratified split based on has_label. To calculate P(has_label=1 | data), we train an estimator on the training set, and the predictions on the test set provide us with P(has_label=1 | data).

  3. To find P(has_label=1 | positive_label=1), we calculate the probability of "having a label" among the positive samples in the training set. By averaging these probabilities, we obtain P(has_label=1 | positive_label=1).

  4. Finally, we use the formula: P(positive_label=1 | data) = P(has_label=1 | data) / P(has_label=1 | positive_label=1) to compute the probability of each record being positive.

This approach is known as the E&N (Elkan & Noto) method, and it can be applied to both binary and multi-class classification problems.

DIY PU Learning Solution

The DIY solution follows the exact steps outlined above, producing the probability of each data record belonging to the positive class.

๐ŸŒป Check DIY PU Learning solution here >>

The challenge now is, how do we evaluate the results? ๐Ÿค”

In an ideal scenario, where you have labels for all the data, you can use standard machine learning evaluation metrics such as AUC, Average Precision, etc. For example, in our case, comparing the predicted probabilities of the positive class against the actual labels yields an AUC of 0.71, as demonstrated in the notebook.

However, in reality, you often don't have labels for the entire datasetโ€”only a small fraction of positive labels are known ๐Ÿฅฒ. To assess performance in such cases, let's calculate the following metrics:

  • real_pos_perct: the actual percentage of positive class in the dataset. If the ground truth is unknown, this can be estimated by business or domain experts.

  • pred_pos_perct: the predicted percentage of the positive class. This is calculated as the proportion of records with predicted probability >= threshold.

  • known_recall: the recall among the known positive labels. By classifying records as positive if predicted probability >= threshold and as negative otherwise, we can compare these predictions against the known positive labels to calculate this recall.

Here's the code of pred_pos_perct and known_recall:

We can plot the performance with different thresholds to decide the optimal threshold. The ideal threshold has a decent known_recall and pred_pos_perct is closer to real_pos_perct. In this example, we can choose a threshold between 0.7 ~ 0.75.

๐ŸŒป Check DIY PU Learning evaluation code >>

Sklearn Built-in PU Learning Solution

PULEARN is a sklearn built-in PU Learning library, it supports 3 classifiers:

  • ElkanotoPuClassifier: is E&N method, same as above DIY solution.

  • WeightedElkanotoPuClassifier: also came from E&N paper, it adds weights to unlabeled data.

  • BaggingPuClassifier: applies a bagging SVM on positive and unlabeled data.

Let's check performance by applying them on our 90% masked data!

๐ŸŒป Check Built-in PU Learning code >>

Clearly, the ElkanotoPuClassifier delivers the best overall performance. However, the DIY solution performs slightly better because, at the optimal threshold, where pred_pos_perct intersect with real_pos_perct, the DIY solution achieves a higher known_recall.

All the experiments above were conducted on data with 90% masking. How does the performance change with different mask rates? ๐Ÿ˜‰

Performance with Various Mask Rates

Lady H. applied both the DIY solution and PULearn's built-in ElkanotoPuClassifier on datasets with mask rates of 95%, 80%, 50%, and 30%. The results showed that the DIY solution consistently outperformed. Let's dive into the details.

This is the performance comparison on 95% masked data. At the best threshold, both solution have pred_pos_perct intersects with real_pos_perct, but DIY solution has higher known_recall.

For 80%, 50% and 30% masked data, the performance difference at the best threshold is minor, but DIY solution still has slightly better known_recall.

๐Ÿ˜ So, the DIY solution won!

Last updated