What is BERTopic? Metabob’s AI team is testing another topic modeling technique

This blog is written in collaboration with Ben Reaves, our director of AI, who has extensive experience in the field of NLP. Ben holds MS from Stanford and 35+ years of experience as an NLP engineer in high growth startups and in big corporations such as Panasonic and Toyota.

In order to generate useful recommendations, we need to train it on not only text but some context about that text...

A topic model is a type of algorithm that scans a set of documents (known in the field of Natural Language Processing/NLP as a corpus), examines how words and phrases co-occur in them, and automatically “learns” groups or clusters of words that best characterize those documents. These sets of words often appear together to represent a coherent theme or topic.

At every major conference, new machine-learning models are always being proposed. A fundamental question to ask is whether our current model (LDA) works better than new models like BERTopicoption. The only way to find an answer is to test these models and to study their results. So, that’s what we did.

In our previous blog “LDA and the method of topic modeling,” we described how we apply the LDA machine learning model to Metabob. In this blog post we compare the results with a new topic modeling technique called “BERTopic.”

What is BERTopic?

BERTopic is a topic modeling technique that leverages a pre-trained transformer model to embed all the input documents in a multi-dimensional space. Imagine a 3-dimensional filing cabinet and you place each document at a location in space such that similar documents are near each other (that would be a 3-dimensional space). Then it finds the clusters by using HDBSCAN and an algorithm called c-TF-IDF to find the topics.

Also, BERTopic supports guided, (semi-) supervised, and dynamic topic modeling as well as visualizations that are similar to the LDA model.

Unlike LDA, which starts with a model that is not initialized, BERTopic starts with a model that is already trained. Think of studying a foreign language at age 20: you already know your native language well, so it should not take 20 more years to learn how to communicate in a new language. Bertopic starts with a model that has been already trained with a huge amount of data (for example Wikipedia) that need not be exactly the same as in your application (for example a local newspaper).

Testing BERTopic

First, we tested the BERTopic model with three separate datasets.The first two are relatively common ones: 20_news_groups and trump_tweets. Since many previous models were tested on these two datasets mentioned, one of the advantages of choosing them for testing is that there are many modeling baseline records available for comparison. Thus, we can compare our BERTopic model with past models to see its performance. Next, we used our GitHub comments dataset to test BERTopic to see how topics within those comments are distributed in a vector space.

Second, we started BERTopic with an embedding model pre-trained on a large diverse dataset of over 1 billion sentence pairs: all-MiniLM-L6-v2.

In order to evaluate fairly, the testing methods for both models need to be synchronized. Thus, we need to control both the data pre-processing process and the selection of pre-trained embedding models. all-MiniLM-L6-v2 is selected here for its balanced computation time and performance.

Finally, we compare BERTopic to other Topic Models by using the OCTIS framework which calculates several different metrics including Coherence (such as c_v and c_npmi), Diversity, and Kullback-Leibler Divergence scores (think of how the scikit-learn framework facilitates cross-validations and accuracy numbers like F1-score). OCTIS also standardizes some parts of the training such as tokenizing and other preprocessing of our raw data.

Key findings

We found that the BERTopic model outperforms the LDA according to the metrics in several categories, and has many advantages over the LDA model:

  • Vector projection of topics lets us know the distance between each topic and discover their interrelationship.

  • Unlike LDA which requires the number of desired topics to be specified before training, BERTopic determines it automatically using Hdbscan, a hierarchical clustering approach.

  • Variants like the guided version of BERTopic enable us to predefine topics and make the model converge to those predefined topics.

  • Powerful visualization tools represented as dashboards.

Additional findings are as follows:

Pros:

  • Flexible in hyperparameter tuning, which has great potential to perform far better than the default parameters.

  • Intuitive modeling steps/ideas that can be easily explained to stakeholders without going into deep technical explanations.

  • Well-supported open source tools and well-acknowledged in the field of topic modeling as a newly invented model even though it was published in March 2022.

Cons:

  • Training requires huge computational resources, especially in a large dataset.

  • In partial steps of BERTopic, customization functions are not supported.

Next steps

We will continue to explore BERTopic for our purposes and will keep testing other topic models including Top2Vec. Stay tuned for more findings and comparisons in our future posts!