BERT, Encoders and Linear Models for Resume Text Classification

Exploring Performance of Advanced NLP Algorithms for Text Classification

This project evaluates the performance of advanced NLP models and vectorization techniques for text classifcation using a resume dataset. Implementing Linear SVC, FNN, Encoder models, and BERT, the project achieved an accuracy of 91.67% with BERT. The project demonstrates how to build efficient preprocessing pipelines, optimize feature representation to enhance resource usage, and develop high-performing text classification models using Scikit-Learn and PyTorch.

Key Findings

Factors Influencing Model Performance: The quality of feature representations and the ability to capture contextual information were crucial in determining model effectiveness.
Preprocessing and Vectorization: Robust preprocessing and advanced vectorization techniques significantly enhanced model performance, with all models surpassing 70% accuracy.
Model performance
- BERT: Best performing model with an accuracy of 91.67% with bidirectional transformers. Showcases the effectiveness of pre-trained models and transfer learning.
- Linear SVC: Achieved an accuracy of 87.15% with TF-IDF vectors. Attributed to the model’s simplicity and use of effective feature representation.
- Encoder Model: Reached a 74.54% accuracy. Suggests the need for further fine-tuning, given the large difference with its cousin transformer model.
- Feedforward Neural Network: Recorded the lowest accuracy at 73.15%. Highlights the model’s challenges with sequential dependencies and contextual nuances.

Dataset

The project utilizes the Resume Dataset from LiveCareer, available at Kaggle. The dataset comprises over 2400 resumes in both string and HTML format, each labeled with their respective labeled categories. The dataset includes of the following variables:

ID: A unique identifier for each resume
Resume_str: The textual content of the resume
Resume html: The HTML content of the resume
Category: The job field classification of each resume (e.g., Information Technology, Teaching, Advocacy, Business Development, Healthcare)

Preprocessing

Data preprocessing involved two tasks: text preprocessing and data rebalancing.

For text cleaning, I developed a custom preprocessing function that integrates several operations into a streamlined pipeline. This function is highly adaptable, with parameters to handle tasks such as converting text to lowercase, decoding HTML, removing emails and URLs, eliminating special characters, expanding contractions, applying custom regex cleaning, and performing tokenization, stemming, lemmatization, and stopword removal. In particular, I improved the text quality by eliminating noise like non-existent words and frequent, resume-specific stopwords such as months, function verbs, and section headings.

View Code

import re
from bs4 import BeautifulSoup
from unidecode import unidecode
import contractions
from nltk.corpus import stopwords, words
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

def preprocessing(
    text, 
    tokenize=False,
    stem=False,
    lem=False,
    html=False,
    exist=False,
    remove_emails=True,
    remove_urls=True,
    remove_digits=True,
    remove_punct=True,
    expand_contractions=True,
    remove_special_chars=True,
    remove_stopwords=True,
    lst_stopwords=None,
    lst_regex=None
) -> str | list[str]:
    
    # Lowercase conversion
    cleaned_text = text.lower()

    # HTML decoding
    if html:
        soup = BeautifulSoup(cleaned_text, "html.parser")
        cleaned_text = soup.get_text()
    
    # Remove Emails
    if remove_emails:
        cleaned_text = re.sub(r"([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)", " ", cleaned_text)
    
    # URL removal
    if remove_urls:
        cleaned_text = re.sub(r"(http|https|ftp|ssh)://[\w_-]+(?:\.[\w_-]+)+[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-]?", " ", cleaned_text)
    
    # Remove escape sequences and special characters
    if remove_special_chars:
        cleaned_text = re.sub(r"[^\x00-\x7f]", " ", cleaned_text)
        cleaned_text = unidecode(cleaned_text)
    
    # Remove multiple characters
    cleaned_text = re.sub(r"(.)\1{3,}", r"\1", cleaned_text)
    
    # Expand contractions
    if expand_contractions:
        cleaned_text = contractions.fix(cleaned_text)
        cleaned_text = re.sub("'(?=[Ss])", "", cleaned_text)
    
    # Remove digits
    if remove_digits:
        cleaned_text = re.sub(r"\d", " ", cleaned_text)
    
    # Punctuation removal
    if remove_punct:
        cleaned_text = re.sub("[!\"#$%&\\'()*+\,-./:;<=>?@\[\]\^_`{|}~]", " ", cleaned_text)
    
    # Line break and tab removal
    cleaned_text = re.sub(r"[\n\t]", " ", cleaned_text)
    
    # Excessive spacing removal
    cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()
    
    # Regex (in case, before cleaning)
    if lst_regex: 
        for regex in lst_regex:
            compiled_regex = re.compile(regex)
            cleaned_text = re.sub(compiled_regex, '', cleaned_text)

    # Tokenization (if tokenization, stemming, lemmatization or custom stopwords is required)
    if stem or lem or remove_stopwords or tokenize:
        if isinstance(cleaned_text, str):
            cleaned_text = cleaned_text.split()
        
        # Remove stopwords
        if remove_stopwords:
            if lst_stopwords is None:
                lst_stopwords = set(stopwords.words('english'))
            cleaned_text = [word for word in cleaned_text if word not in lst_stopwords]

        # Remove non-existent words
        if exist:
            english_words = set(words.words())
            cleaned_text = [word for word in cleaned_text if word in english_words]

        # Stemming
        if stem:
            stemmer = PorterStemmer()
            cleaned_text = [stemmer.stem(word) for word in cleaned_text]

        # Lemmatization
        if lem:
            lemmatizer = WordNetLemmatizer()
            cleaned_text = [lemmatizer.lemmatize(word) for word in cleaned_text]
        
        if not tokenize:
            cleaned_text = ' '.join(cleaned_text)

    return cleaned_text

Next, I converted the Category variable to a numerical format using Scikit-Learn’s LabelEncoder to meet the requirements of the algorithms. After cleaning, I saved the text separately for future use in tasks such as topic modeling and document similarity, where balancing is not required.

To address category imbalance for text classification, I employed random resampling using Scikit-Learn’s resample method. This approach ensures balanced representation across all categories, enhancing accuracy and reducing bias, thereby improving overall model performance.

Linear SVC

¹ Linear Support Vector Classifier (SVC) is a classification algorithm that seeks to find the maximum-margin hyperplane, that is, the hyperplane that most clearly classifies observations

² Truncated Singular Value Decomposition (SVD) is a dimensionality reduction technique that decomposes a matrix into three smaller matrices, retaining only the most significant features of the original matrix.

For this first model, I trained a baseline Linear SVC¹ using the TF-IDF vectors. Then, I performed Latent Semantic Analysis (LSA) by applying Truncated Singular Value Decomposition (SVD)² to reduce the TF-IDF matrix to a lower-dimensional space. The performance of both models provides a baseline for comparison with more advanced models in subsequent sections.

Takeaways

The baseline model delivers robust accuracy while optimizing resource usage, particularly when dimensionality reduction is applied.

Linear SVC achieved an accuracy of 84% on the test set.

Linear SVC achieved 81% accuracy with Truncated SVD on the test set, utilizing only 5% of the original features.

Import Packages and Data

In addition to standard libraries, I imported two custom functions: classifier_report to generate classification reports and confusion matrices, and save_performance to store model performance metrics in a JSON file for future analysis. I also loaded a pre-trained Label Encoder to label the encoded categories in subsequent visualizations.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.decomposition import TruncatedSVD
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

from src.modules.ml import classifier_report
from src.modules.utils import save_performance

# Import the data with engineered features
df = pd.read_parquet('./data/3-processed/resume-features.parquet')

# Label model to label categories
le = joblib.load('./models/le-resumes.gz')

Baseline LinearSVC

I split the dataset into 70% for training and 30% for testing. I used the tfidf_vectors column as the feature matrix, stacking the vectors into a single matrix with np.vstack. The encoded Category column served as the target variable, and I extracted the variables into a NumPy array using the .values method to improve performance. To verify the split, I printed the shapes of the training and testing sets.

View Code

X = np.vstack(df['tfidf_vectors'].values)
y = df['Category'].values

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=42)

print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Testing set: {X_test.shape}, {y_test.shape}")

Training set: (2016, 11618), (2016,)
Testing set: (864, 11618), (864,)

Next, I trained a Linear SVC model with default hyperparameters and evaluated its performance using the classifier_report function, which generated a classification report and confusion matrix.

View Code

svc = LinearSVC(dual="auto")
accuracy = classifier_report([X_train, X_test, y_train, y_test], svc, le.classes_, True)

SVC accruacy score 87.15%

The model achieved an accuracy of 87% on the test set, which was a strong result for a baseline model. However, TF-IDF vectors often lead to sparse, high-dimensional representations with low information density. To address this, I employed Truncated Singular Value Decomposition (SVD).

LinearSVC with Truncated SVD

³ For an in-depth explanation, see Manning, C.D., Raghavan, P., & Schütze, H. (2008) ‘Matrix Decompositions and Latent Semantic Indexing’, Introduction to Information Retrieval, 1.

Transforming TF-IDF matrices with Truncated SVD is known as Latent Semantic Analysis (LSA). This technique extracts the \(n\) largest eigenvalues to capture the most significant semantic relationships between terms and documents, while filtering out noise and low-information features³.

I used TruncatedSVD to reduce the number of components to 500, which is less than 5% of the original number of features. I then applied this transformation, split the resulting feature matrix into training and testing sets, and proceeded to train the new model.

View Code

t_svd = TruncatedSVD(n_components=500, algorithm='arpack')
X_svd = t_svd.fit_transform(X)


X_train_svd, X_test_svd, y_train_svd, y_test_svd = train_test_split(X_svd, 
                                                                    y, 
                                                                    test_size=0.30, 
                                                                    random_state=42)


print(f"Training set: {X_train_svd.shape}, {y_train_svd.shape}")
print(f"Testing set: {X_test_svd.shape}, {y_test_svd.shape}")

Training set: (2016, 500), (2016,)
Testing set: (864, 500), (864,)

I trained a new Linear SVC model using the SVD-transformed feature matrix and generated a classification report and confusion matrix.

View Code

svc_svd = LinearSVC(dual="auto")
accuracy = classifier_report([X_train_svd, X_test_svd, y_train_svd, y_test_svd], svc_svd, le.classes_, True)

SVC accruacy score 84.94%

The model achieved an 84% accuracy on the test set, which is slightly lower than the baseline model. However, this performance comes from a model trained on less than 5% of the original features, demonstrating that despite the reduced dimensionality, the model retains a high level of accuracy. This highlights the sparsity of the TF-IDF vectors while showcasing that effective accuracy can be maintained with a significantly smaller feature set.

Before proceeding to the next model, I saved the performance metrics of the baseline model for future comparison.

View Code

save_performance(model_name='LinearSVC',
                 architecture='default',
                 embed_size='n/a',
                 learning_rate='n/a',
                 epochs='n/a',
                 optimizer='n/a',
                 criterion='n/a',
                 accuracy=87.15
                 )

Feedforward Neural Network

⁴ Feedforward Neural Networks (FNN) are a type of neural networks where information moves in only one direction—forward—from the input nodes, through hidden layers, and to the output nodes.

The next model is a Feedforward Neural Network (FNN)⁴ built using PyTorch. I created an iterator for the dataset with the DataLoader class, which tokenized and numericalized the resumes, dynamically padded the sequences, and batched the data for training, thus saving memory and computation time. I then constructed a simple neural network architecture featuring an embedding layer, followed by three fully connected layers with ReLU activation functions. The model was trained using the Adam optimizer and Cross Entropy Loss criterion, and its performance was evaluated on the test set with accuracy as the metric.

Takeaways

The Feedforward Neural Network achieved an accuracy of 73.15% with a loss of 1.1444 on the test set.

Performance suggests minimal overfitting, indicated by the small gap between training and validation accuracies.

The model demonstrates robust generalization, with test accuracy closely aligning with validation accuracy.

Import Packages and Data

In addition to standard PyTorch and pandas imports, I also imported three custom functions:

train_model: Trains the model using the provided hyperparameters and data, prints real-time training and validation loss and accuracy, and saves the best model based on the lowest validation loss. It also offers the option to visualize the training progress with the PlotLosses library or matplotlib.
test_model: Evaluates the model on the test set using the best model saved during training and returns the test accuracy.
save_performance: Saves the model’s performance metrics to a JSON file for future analysis.

View Code

import matplotlib.pyplot as plt
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import CrossEntropyLoss
from torch.nn.utils.rnn import pad_sequence
from torch.optim import Adam
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.utils.data import DataLoader, Dataset, random_split
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from livelossplot import PlotLosses
from tqdm import tqdm

from src.modules.dl import train_model, test_model
from src.modules.utils import save_performance

data = pd.read_parquet('./data/3-processed/resume-features.parquet', columns=['Category', 'cleaned_resumes'])

When training deep learning models, I always include the option to use a GPU if available and set the device variable accordingly. This approach not only enables the model to utilize parallel computing resources if available, but also makes the code reproducible across different device setups. The snippet below checks for GPU availability and assigns the device variable, which is then used by the DataLoader and the model.

View Code

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device('cpu')
print("Using {}.".format(device))

Using cpu.

Dataset and DataLoader

Before constructing the Dataset class, I defined a tokenization function that initializes the tokenizer, tokenizes the text data, and builds a vocabulary using PyTorch’s get_tokenizer and build_vocab_from_iterator functions. This function returns the tokenized texts for indexing by the DataLoader during training and the vocabulary for determining vocabulary size.

View Code

def tokenization(texts, tokenizer_type='basic_english', specials=['<unk>'], device=device):
    # Instantiate tokenizer
    tokenizer = get_tokenizer(tokenizer_type)
    # Tokenize text data
    tokens = [tokenizer(text) for text in texts]
    # Build vocabulary
    vocab = build_vocab_from_iterator(tokens, specials=specials)
    # Set default index for unknown tokens
    vocab.set_default_index(vocab['<unk>'])

    # Convert tokenized texts to a tensor
    tokenized_texts = [torch.tensor([vocab[token] for token in text], dtype=torch.int64, device=device) for text in tokens]

    return tokenized_texts, vocab

Next, I created the ResumeDataset iterator, which preprocesses the text data using the tokenization function and indexes samples for the DataLoader during training. The class includes: - __len__: Returns the length of the dataset. - vocab_size: Provides the size of the vocabulary. - num_class: Returns the number of unique classes in the dataset. - __getitem__: Returns a sample of text and its corresponding label from the dataset.

View Code

class ResumeDataset(Dataset):
    # Dataset initialization and preprocessing
    def __init__(self, data):
        # Initialize dataset attributes
        super().__init__()
        self.text = data.iloc[:,1]
        self.labels = data.iloc[:,0]
        
        self.tokenized_texts, self.vocab = tokenization(self.text)

    # Get length of dataset
    def __len__(self):
        return len(self.labels)

    # Get vocabulary size
    def vocab_size(self):
        return len(self.vocab)

    # Get number of classes
    def num_class(self):
        return len(self.labels.unique())

    # Get item from dataset
    def __getitem__(self, idx):
        sequence = self.tokenized_texts[idx]
        label = self.labels[idx]
        return sequence, label

I also defined a collate_fn function to implement dynamic padding when batching the data. Dynamic padding is a technique used to adjusts sequence lengths to match the longest sequence in each batch, rather than to the longest sequence in the entire dataset. Dynamic padding allows the model to process sequences of varying lengths more efficiently, saving memory and computation time.

View Code

def collate_fn(batch):
    sequences, labels = zip(*batch)
    # Pad sequences to the longest sequence in the batch
    sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=0)
    # Convert labels to tensor
    labels = torch.tensor(labels, dtype=torch.long)
    return sequences_padded, labels

Finally, I instantiated the ResumeDataset class and split the dataset into 70% training, 15% validation, and 15% test sets using the random_split function. I created DataLoader iterators for each set, applying dynamic padding through the collate_fn function.

View Code

dataset = ResumeDataset(data)
train_dataset, val_dataset, test_dataset = random_split(dataset, [0.7, 0.15, 0.15])

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)

Model Architecture

The model is a simple FNN with an embedding layer, followed by two fully connected layers with ReLU activation functions, and a final fully connected layer with output size matching the number of classes. The architecture is defined in the SimpleNN class, which takes the vocabulary size, embedding size, number of classes, and an expansion_factor parameter (set to 2) to determine the hidden dimension size.

The EmbeddingBag function efficiently computes embeddings by first creating embeddings for the input indices and then averaging the output across the sequence dimension. This approach accommodates sequences of varying lengths, allowing the model to process them more efficiently.

View Code

class SimpleNN(nn.Module):
    def __init__(self, vocab_size, embed_size, num_class, expansion_factor=2, dropout=0.1):
        super().__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_size, sparse=False)
        self.hidden_dim = embed_size * expansion_factor
        self.layer1 = nn.Linear(embed_size, self.hidden_dim)
        self.layer2 = nn.Linear(self.hidden_dim, embed_size)
        self.layer3 = nn.Linear(embed_size, num_class)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.embedding(x)
        x = F.relu(self.layer1(x))
        x = self.dropout(x)
        x = F.relu(self.layer2(x))
        x = self.dropout(x)
        x = self.layer3(x)
        return x

Hyperparameters and Training

Before training, I set the hyperparameters for the neural network. The vocabulary size and number of classes are sourced from the ResumeDataset class. The embedding size is set to 60, the learning rate to 1e-3, and the model is trained for 40 epochs.

View Code

vocab_size = dataset.vocab_size()
num_class = dataset.num_class()
embed_size = 60
lr=0.001
epochs = 40

I instantiated the model, assigned it to the available device, and defined the loss function and optimizer. The loss function used is Cross Entropy Loss, suitable for multi-class classification, and the optimizer is Adam, known for its adaptive learning rate capabilities. Additionally, I set up a learning rate scheduler to reduce the learning rate by a factor of 0.1 if the validation loss does not improve over a specified patience period, helping to prevent overfitting and enhance generalization. The model and hyperparameters were then passed to the train_model function⁵.

⁵ During fine-tuning, the model achieved better accuracy with a dropout rate of 0.4.

To visualize the training progress, I set the visualize parameter to ‘liveloss’, utilizing the PlotLosses library for a dynamically updating plot that shows real-time training and validation loss and accuracy. This allows for effective monitoring and adjustment of hyperparameters as needed.

View Code

model = SimpleNN(vocab_size, embed_size, num_class, dropout=0.4).to(device)
criterion = CrossEntropyLoss()
loss = Adam(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(loss, patience=2)

train_model(model, train_loader, val_loader, epochs, criterion, loss, scheduler, visualize='liveloss')

accuracy
  training           (min:    0.036, max:    0.712, cur:    0.707)
  validation         (min:    0.037, max:    0.694, cur:    0.685)
log loss
  training           (min:    0.873, max:    3.187, cur:    0.921)
  validation         (min:    1.274, max:    3.184, cur:    1.330)
------------------------------
Best model saved:
Val Loss: 1.3300 | Val Acc: 0.6852
✅ Training complete!

The training and validation losses decreased steadily until the 30th epoch, indicating effective learning. After 30 epochs, training loss continued to decrease while validation loss plateaued. However, the final training loss was 0.92 and the validation loss was 1.33, with a minimal gap suggesting good generalization.

In terms of accuracy, both training and validation accuracies increased until they converged at 40 epochs. At convergence, the training accuracy was 71% and the validation accuracy was 69%. The small difference between these accuracies indicates minimal overfitting and good generalization to unseen data. The best model saved achieved a validation loss of 1.33 and a validation accuracy of 68.52%.

Evaluation

After training the model, I evaluated its performance on the test set using the test_model function. This function takes the trained model, test data loader, and criterion as inputs, and returns the model’s accuracy on the test set.

View Code

accuracy = test_model(model, test_loader, criterion)

Test Loss: 1.1444 | Test Acc: 0.7315
✅ Testing complete!

The FNN achieved a test accuracy of 73.15% and a loss of 1.1444. The test performance is consistent with the validation accuracy observed during training, indicating that the model generalizes well to new data and demonstrates robustness in its predictions.

While the FNN’s accuracy is lower than the baseline model, it is still impressive given the simplicity of the architecture. Performance could be enhanced by incorporating advanced regularization techniques, utilizing pre-trained word embeddings⁶, or increasing model complexity. The next section explores a more advanced Transformer-based architecture that leverages self-attention mechanisms to capture long-range dependencies in the data.

⁶ Examples of pre-trained embeddings include: Word2Vec, GloVe, and fastText.

To conclude this section, I saved the performance metrics of the FNN model for future analysis.

View Code

save_performance(model_name='Feedforward Neural Network',
                 architecture='embed_layer->dropout->120->dropout->60->dropout->num_classes',
                 embed_size='60',
                 learning_rate='1e-3',
                 epochs='50',
                 optimizer='Adam',
                 criterion='CrossEntropyLoss',
                 accuracy=73.15,
                 )

Encoder Model

The next model utilizes the encoder component of the Transformer architecture for text classification. Unlike decoders, which generate output sequences from dense representations, encoders instead transform input sequences into dense representations. This capability is particularly valuable for tasks such as sentiment analysis, named entity recognition, and text classification.

The encoder model follows the Transformer architecture described in Attention is All You Need⁷ and is used as a baseline for transformer-based models. I construct the encoder architecture with an embedding layer, a stack of encoder layers, and a feed-forward neural network for classification. I then initialize the hyperparameters for training and train the model using the Adam optimizer and Cross Entropy Loss criterion. The model’s performance is evaluated on the test set using accuracy as the evaluation metric.

⁷ Vaswani, Ashish u. a. (2017): Attention Is All You Need. Advances in neural information processing systems 30.

The imported packages and the DataLoader setup mirror those used for the Feedforward Neural Network model (see packages and data preparation), so I will focus on the model architecture.

Takeaways

The Transformer Encoder model achieved an accuracy of 75% on the test set.

This model provides a solid baseline for transformer-based approaches in text classification tasks.

Model Architecture

⁸ The multi-head self-attention mechanism is a component of the transformer model that weights the importance of different elements in a sequence by computing attention scores multiple times in parallel across different linear projections of the input.

⁹ Layer normalization is a normalization technique that uses the mean and variance statistics calculated across all features.

The Encoder model features an embedding layer, a stack of encoder layers, and a fully connected neural network for classification. The embedding layer converts input sequences into dense representations, which then pass through the encoder layers. Each encoder layer includes a multi-head self-attention mechanism⁸, followed by a residual connection with layer normalization⁹, a feed-forward neural network, and another residual connection with layer normalization. The final output is processed by a feed-forward neural network for classification.

I built the encoder model from scratch to gain a deeper understanding of its architecture and components. Using a modular approach, I created each component as a separate class and integrated them in the TransformerEncoder class.

Further Reading:

The implementation draws inspiration from these resources:

The Annotated Transformer

Coding a Transformer from Scratch on PyTorch (YouTube)

Text Classification with Transformer Encoders

View Code

class EmbeddingLayer(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        super().__init__()
        # Dimensions of embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Embedding dimension
        self.d_model = d_model

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)

class PositionalEmbedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, dropout: float = 0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # Initialize positional embedding matrix (vocab_size, d_model)
        pe = torch.zeros(vocab_size, d_model)
        # Positional vector (vocab_size, 1)
        position = torch.arange(0, vocab_size).unsqueeze(1)
        # Frequency term
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000) / d_model))
        # Sinusoidal functions
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        # Add batch dimension
        pe = pe.unsqueeze(0)
        # Save to class
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

class LayerNorm(nn.Module):
    def __init__(self, d_model: int, eps: float = 1e-6):
        super().__init__()
        # Learnable parameters
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.ones(d_model))
        # Numerical stability in case of 0 denominator
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        # Linear combination of layer norm with parameters gamma and beta
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

class ResidualConnection(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1):
        super().__init__()
        # Layer normalization for residual connection
        self.norm = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x1, x2):
        return self.dropout(self.norm(x1 + x2))

class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int = 2048, dropout: float = 0.1):
        super().__init__()
        # Linear layers and dropout
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int, dropout: float =0.1, qkv_bias: bool = False, is_causal: bool = False):
        super().__init__()
        assert d_model % num_heads == 0,  "d_model is not divisible by num_heads"
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.dropout = dropout
        self.is_causal = is_causal

        self.qkv = nn.Linear(d_model, 3 * d_model, bias=qkv_bias)
        self.linear = nn.Linear(num_heads * self.head_dim, d_model)
        self.dropout_layer = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        batch_size, seq_length = x.shape[:2]

        # Linear transformation and split into query, key, and value
        qkv = self.qkv(x)  # (batch_size, seq_length, 3 * embed_dim)
        qkv = qkv.view(batch_size, seq_length, 3, self.num_heads, self.head_dim)  # (batch_size, seq_length, 3, num_heads, head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch_size, num_heads, seq_length, head_dim)
        queries, keys, values = qkv  # 3 * (batch_size, num_heads, seq_length, head_dim)

        # Scaled Dot-Product Attention
        context_vec = F.scaled_dot_product_attention(queries, keys, values, attn_mask=mask, dropout_p=self.dropout, is_causal=self.is_causal)

        # Combine heads, where self.d_model = self.num_heads * self.head_dim
        context_vec = context_vec.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        context_vec = self.dropout_layer(self.linear(context_vec))

        return context_vec

class EncoderLayer(nn.Module):
    def __init__(self, d_model: int, num_heads: int, hidden_dim: int, dropout: float = 0.1):
        super().__init__()
        # Multi-head self-attention mechanism
        self.multihead_attention = MultiHeadAttention(d_model, num_heads, dropout)
        # First residual connection and layer normalization
        self.residual1 = ResidualConnection(d_model, dropout)
        # Feed-forward neural network
        self.feed_forward = FeedForward(d_model, hidden_dim, dropout)
        # Second residual connection and layer normalization
        self.residual2 = ResidualConnection(d_model, dropout)

    def forward(self, x, mask=None):
        x = self.residual1(x, self.multihead_attention(x, mask))
        x = self.residual2(x, self.feed_forward(x))
        return x

class EncoderStack(nn.Module):
    def __init__(self, d_model: int, num_heads: int, hidden_dim: int, num_layers: int, dropout: float = 0.1):
        super().__init__()
        # Stack of encoder layers
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, hidden_dim, dropout) for _ in range(num_layers)])

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return x

class TransformerClassifier(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, num_heads: int, hidden_dim: int, num_layers: int, out_features: int, dropout: float = 0.1):
        super().__init__()
        self.embedding = EmbeddingLayer(vocab_size, d_model)
        self.positional_embedding = PositionalEmbedding(vocab_size, d_model, dropout)
        self.encoder = EncoderStack(d_model, num_heads, hidden_dim, num_layers, dropout)
        self.classifier = nn.Linear(d_model, out_features)

    def forward(self, x, mask=None):
        x = self.embedding(x)
        x = self.positional_embedding(x)
        x = self.encoder(x, mask)
        x = x.mean(dim=1)
        x = self.classifier(x)
        return x

Hyperparameters and Training

With the model constructed, I set the training hyperparameters. As with the previous model, I obtain the vocabulary size and number of output features from the ResumeDataset class. The embedding size is set to 80, the hidden dimension to 180, and the multi-head attention mechanism uses 4 heads with an encoder stack of 4 layers. I train the model for 20 epochs with a learning rate of 1e-3.

View Code

vocab_size = dataset.vocab_size()
d_model = 80
num_heads = 4
hidden_dim = 180
num_layers = 4
out_features = dataset.num_class()
lr = 0.001
epochs = 20

I instantiated the model with the specified hyperparameters and moved it to the appropriate device. The criterion and optimizer remained unchanged from the previous model, using the Adam optimizer and CrossEntropyLoss for multi-class classification. I also initialized a learning rate scheduler with a patience of 2 epochs to mitigate overfitting. The model was trained using the train_model function.

View Code

model = TransformerClassifier(vocab_size, d_model, num_heads, 
                                hidden_dim, num_layers, out_features, dropout=0).to(device)
criterion = CrossEntropyLoss()
loss = Adam(model.parameters(), lr=lr)
scheduler = ReduceLROnPlateau(loss, patience=2)

train_model(model, train_loader, val_loader, epochs, criterion, loss, scheduler)

accuracy
  training           (min:    0.043, max:    1.000, cur:    1.000)
  validation         (min:    0.035, max:    0.794, cur:    0.785)
log loss
  training           (min:    0.027, max:    3.276, cur:    0.029)
  validation         (min:    1.023, max:    3.273, cur:    1.071)
------------------------------
Best model saved:
Val Loss: 1.0709 | Val Acc: 0.7847
✅ Training complete!

Despite achieving better validation accuracy than the FNN, the Encoder model exhibited a significant gap between training and validation performance, indicating overfitting. During the first 14 epochs, both training and validation losses decreased rapidly, and accuracies increased together. However, over the next 6 epochs, the training loss continued to drop from 1.25 to near 0, while the validation loss plateaued around 1.07. Training accuracy surged to 100%, whereas validation accuracy remained at 78%. This discrepancy highlighted that the model was overfitting.

Nevertheless, the Encoder model achieved a validation accuracy of 78%, which was slightly higher than that of the Feedforward Neural Network. I proceeded to evaluate the model on the test set to determine its final accuracy.

Evaluation

View Code

accuracy = test_model(model, test_loader, criterion)

Test Loss: 1.2526 | Test Acc: 0.7454
✅ Testing complete!

Despite incorporating a more advanced architecture with a multi-head self-attention mechanism, the Encoder model achieved an accuracy of 74.5%, similar to the FNN model. The large gap observed between training and validation performance suggests that this overfitting may be affecting the model’s generalization capabilities.

With improved training and validation performance, the model could potentially reach higher accuracy on the test set. Enhancing the model’s performance might involve techniques such as data augmentation, adjusting hyperparameters like embedding size, hidden dimensions, and the number of layers, or experimenting with different optimizers and learning rate schedulers.

As with the previous models, I saved the performance metrics for later analysis.

View Code

save_performance(model_name='Transformer',
                 architecture='embed_layer->encoder->linear_layer',
                 embed_size='64',
                 learning_rate='1e-3',
                 epochs='20',
                 optimizer='Adam',
                 criterion='CrossEntropyLoss',
                 accuracy=80
                 )

BERT

The final model is Bidirectional Encoder Representations from Transformers (BERT), a pre-trained transformer-based model known for its bidirectional context understanding, meaning that it can take into account the context of a word by looking at both the left and right context. This enables BERT to capture a wider range of contextual information, which is particularly useful for tasks such as text classification.

I created an iterable dataset using the Dataset and DataLoader classes to tokenize resumes with the BERT tokenizer, pad sequences to uniform lengths, and batch the data for training. The model architecture includes the pre-trained BERT base model, a dropout layer, and a linear output layer for classification. Hyperparameters were initialized, and the model was trained using Cross Entropy Loss and the Adam optimizer. Performance was evaluated based on accuracy, consistent with other deep learning models.

Takeaways

The BERT model achieved a notable accuracy of 91.67% on the test set, significantly outperforming all previous models.

The close alignment between the validation and test accuracies demonstrated the model’s strong generalization ability to new, unseen data.

The model significantly outperforms all previous models tested so far.

Import Packages

In addition to the standard deep learning packages, I imported three classes from the transformers package:

BertModel: Loads the pre-trained BERT model.
BertTokenizer: Constructs a BERT tokenizer.
DataCollatorWithPadding: Creates batches with dynamically padded sequences.

Given BERT’s unique output format, I also imported custom functions train_BERT and test_BERT, designed specifically to handle BERT’s outputs, including input IDs and attention masks, for evaluating the model’s performance.

View Code

from transformers import BertModel, BertTokenizer, DataCollatorWithPadding

from src.modules.dl import train_BERT, test_BERT

Dataset and DataLoader

Before creating the dataset and DataLoader, I initialized the BERT tokenizer and define the pre-trained BERT model. I then constructed the ResumeBertDataset, which handles the tokenization and preparation of resumes for model input.

Unlike the previous models, I configured the tokenizer using the .encode_plus method. This method returns a dictionary containing the batch encodings, including tokenized input sequences and attention masks. I set padding to False, as dynamic padding will be managed by the data collator, and truncation to True to truncate sequences that exceed the maximum length. The method also adds the special tokens [CLS] and [SEP], required by BERT. I use return_tensors='pt' to return PyTorch tensors. The function ultimately returns a dictionary with the processed input sequences, attention masks, and labels.

View Code

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

class ResumeBertDataset(Dataset):
    def __init__(self, data, max_length, tokenizer=tokenizer, device=device):
        super().__init__()
        self.texts = data.iloc[:,1].values
        self.labels = torch.tensor(data.iloc[:,0])
        self.max_length = max_length
        self.tokenizer = tokenizer
        self.device = device

    def __len__(self):
        return len(self.labels)

    def num_class(self):
        return len(self.labels.unique())

    def __getitem__(self, idx):
        resumes = self.texts[idx]
        labels = self.labels[idx]

        encoding = self.tokenizer.encode_plus(
            resumes,
            add_special_tokens=True,
            max_length=self.max_length,
            truncation=True,
            padding=False,
            return_attention_mask=True,
            return_tensors='pt'
        ).to(self.device)

        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

I initialized the dataset and set max_length to 512, the maximum token limit that BERT could handle. I then split the dataset into 70% training, 15% validation, and 15% test sets using the random_split function. To manage varying sequence lengths, I used the DataCollatorWithPadding class, which dynamically padded sequences to the maximum length within each batch. Finally, I created DataLoader instances for the training, validation, and test sets, setting the batch size to 16, enabling data shuffling, and applying the data collator for padding.

View Code

dataset = ResumeBertDataset(data, max_length=512)
train_dataset, val_dataset, test_dataset = random_split(dataset, [0.7, 0.15, 0.15])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=data_collator)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True, collate_fn=data_collator)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True, collate_fn=data_collator)

Model Architecture

The BertResumeClassifier model comprised the pre-trained BERT base model, a dropout layer, and a linear output layer for resume classification. I utilized the bert-base-uncased pre-trained model to generate contextual embeddings from the input sequences. I extracted these embeddings by indexing the pooler_output key from the output dictionary. The embeddings were then passed through a dropout layer to mitigate overfitting, and subsequently fed into a fully connected linear layer that mapped the embeddings to the desired number of output classes for classification.

View Code

class BertResumeClassifier(nn.Module):
    def __init__(self, n_classes: int, dropout: float = 0.01):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        pooled_output = self.bert(
          input_ids=input_ids,
          attention_mask=attention_mask
        )['pooler_output']
        output = self.dropout(pooled_output)
        output = self.out(output)
        return output

Hyperparameters and Training

Since BERT is a pre-trained model with a fixed input size of 512 tokens, there were fewer parameters to set for the model itself. The primary parameter to configure was the number of output classes, which, as before, was obtained from the Dataset class.

I initialized the model, loss function, optimizer, and set the number of epochs. I used the Cross Entropy Loss function and the Adam optimizer, adjusting the learning rate to 2e-5, which had shown better performance in preliminary tests. The model was trained for 10 epochs.

View Code

n_classes = dataset.num_class()

model = BertResumeClassifier(n_classes).to(device)
criterion = CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=2e-5)
epochs = 10

train_BERT(model, train_loader, val_loader, epochs, criterion, optimizer)

accuracy
    training             (min:    0.050, max:    0.991, cur:    0.991)
    validation           (min:    0.120, max:    0.933, cur:    0.933)
log loss
    training             (min:    0.081, max:    3.189, cur:    0.081)
    validation           (min:    0.414, max:    3.002, cur:    0.428)
------------------------------
Best model saved:
Val Loss: 0.4143 | Val Acc: 0.9190
✅ Training complete!

The BERT model demonstrated a significant and steady decrease in both training and validation losses over the 10 epochs, indicating effective learning of data patterns. By the end of training, the training loss decreased from 3.1394 to 0.0589, while the validation loss fell from 2.7347 to 0.5070. This consistent reduction highlighted the model’s ability to capture and generalize the data without overfitting, as evidenced by the small gap between the training and validation losses by the end of the 10th epoch.

In terms of accuracy, both training and validation accuracies showed a steady increase throughout the 10 epochs. Training accuracy rose from 0.0774 to 0.9950, while validation accuracy improved from 0.3472 to 0.9028. These results indicated that the model effectively learned the data patterns and generalized well to unseen data. The minimal difference between the final training and validation accuracies underscored the model’s robustness and its ability to avoid overfitting, ensuring reliable performance on new data. The best saved model achieved a validation loss of 0.5070 and an accuracy of 0.9028.

Evaluation

As before, I evaluate the model using the test_model function using the best saved model.

View Code

accuracy = test_BERT(model, test_loader, criterion)

Test Loss: 0.3984 | Test Acc: 0.9167
✅ Testing complete!

The BERT model achieved impressive performance on the test set, reaching an accuracy of 91.67% with a test loss of 0.3984. This result significantly outperformed all previous models tested. As anticipated from the training and validation performances, the model demonstrated robustness and excellent generalization to unseen data. This was further confirmed by the close alignment between test and validation accuracies, both of which represented datasets not previously seen by the model. The similar accuracy between the validation data and the test data indicated that the model could generalize effectively to new resumes and accurately classify them into the correct categories.

As with the previous models, I saved its performance using the save_performance function.

View Code

save_performance(model_name='BERT',
                 architecture='bert-base-uncased>dropout->linear_layer',
                 embed_size='768',
                 learning_rate='2e-5',
                 epochs='10',
                 optimizer='Adam',
                 criterion='CrossEntropyLoss',
                 accuracy=90
                 )

Results and Discussion

Below I discuss the performance of the classification models, the reasons behind their varying accuracies, and their implications in industry applications.

All four models achieved accuracies above 70%, in large part thanks to effective data preparation including, involving thorough text preprocessing, data balancing, and advanced vectorization techniques. The models can be categorized into two performance groups: Linear SVC and BERT, both achieving around 90% accuracy, and FNN and Transformer, with approximately 70% accuracy. Interestingly, the highest-performing models are at opposite ends of the complexity spectrum, with Linear SVC being the simplest and BERT the most complex, while the lower-performing models have higher complexity than the baseline. The reasons for these performance differences are discussed below.

View Code

evaluation_df = pd.read_json('./output/classifier_performance.json').sort_values(by='accuracy', ascending=False)

ax = sns.barplot(evaluation_df, x='model', y='accuracy', hue='model', palette='hls')
[ax.bar_label(container, fmt="%0.2f%%") for container in ax.containers]
plt.show()

BERT and Linear SVC: High Accuracy and Robust Performance
- Linear SVC: Achieved 84% accuracy, offering competitive performance due to its simplicity and effective feature representation¹⁰ using TF-IDF vectors. This simplicity reduces the risk of overfitting and the straightforward feature representation contributes to the model’s high accuracy and fast training times. This model is a reliable option for businesses that prioritize efficiency and simplicity without sacrificing too much accuracy.
- BERT: With an accuracy of 91.67%, BERT demonstrated the highest performance among the models. Its success is attributed to its pre-trained nature and high-quality embeddings. The model was pre-trained on a large corpus and thus has the ability to generate embeddings that encode deep semantic information¹¹. Moreover, its bidirectional nature captures a wide range of contextual information across a long-range dependencies. Businesses with the resources to deploy complex models would benefit from BERT’s superior accuracy.
FNN and Encoder Model: Moderate Performance with Complex Architectures
- Feedforward Neural Network: Achieved around 73.15% accuracy. While more advanced than linear models, the FNN lacks the capability to capture sequential dependencies and contextual nuances, which limits its effectiveness in tasks requiring a deeper understanding of text.
- Encoder Model: Achieved 74.54% accuracy. Despite its sophisticated architecture, the Transformer underperformed relative to expectations. This is attributed to its lack of pre-training on a large corpus and possibly insufficient fine-tuning. While Transformers have significant potential, their performance is heavily influenced by pre-training and proper hyperparameter tuning.
Industry Recommendations:
- Opt for BERT for Maximum Accuracy: For businesses prioritizing high accuracy and capable of supporting complex models, BERT is the best choice. Its high performance in classification tasks makes it ideal for scenarios where precision is critical.
- Consider Linear SVC for Simplicity and Efficiency: Linear SVC offers a balance of good performance and computational efficiency. It is suitable for applications needing quick deployment and easy interpretability.
- Invest in Transformer Models with Pre-Training: For those interested in advanced models, investing in pre-training or additional fine-tuning for Transformers could enhance their performance. Transformers show promise but require adequate preparation to reach their full potential.

¹⁰ To read more on the efficiency of linear classifiers in text classification, see Lin, Y.-C. et al. (2023) ‘Linear Classifier: An Often-Forgotten Baseline for Text Classification’.

¹¹ See Devlin, J. et al. (2018) ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’.

Conclusion

In this project, I tackled the challenge of resume classification using a range of machine learning and deep learning models. The process began with preprocessing the resume data, including tokenizing the resumes and preparing them for model input. I implemented and evaluated four different models: Linear Support Vector Classifier (SVC), Feedforward Neural Network (FNN), Transformer model, and BERT. Each model was trained on the resume dataset and assessed based on accuracy metrics.

The evaluation provided valuable insights into the impact of model complexity and feature representation on performance. BERT achieved the highest accuracy of 91.67%, followed by the Linear SVC model at 87.15%. The Feedforward Neural Network and Transformer models recorded accuracies of 73.15% and 74.54%, respectively. BERT’s outstanding performance is attributed to its pre-trained architecture, which effectively captures deep semantic relationships within text. On the other hand, Linear SVC’s strong performance is due to its simplicity and efficiency with high-dimensional data, utilizing interpretable features like TF-IDF vectors.

Two important observations arise from these results:

State-of-the-art transformer models, such as BERT, combined with transfer learning, deliver superior performance.

Simple models with high-quality, interpretable feature representations such as Linear SVC with TF-IDF vectors, can also achieve impressive.

These findings reveal that model complexity does not automatically translate to better performance. This is further supported by the fact that the two models with the lowest accuracies have more complex architectures than the baseline.

Quality feature representations and capturing contextual information are more important factors in achieving high performing text classification models.

Ultimately, the choice of model depends on the task requirements and available resources. For applications where high performance is critical and ample resources are available, leveraging advanced transformers like BERT with transfer learning is recommended. For scenarios prioritizing simplicity and interpretability, Linear SVC with TF-IDF vectors provides a high-performing, resource-efficient solution.