How do they read your mind ?

Sofiene Azabou
7 min readJan 8, 2020

NATURAL LANGUAGE PROCESSING (PART III)

The following is part of a series of articles on NLP. (Check Part I & Part II)

Have you ever wondered how could Spotify Discover Weekly provides you every week with a customized playlist that fits well your tastes? Have you ever been looking for a video, and you find exactly what you’re looking for on your YouTube suggested video list? Isn’t it impressive how your favorite News Website provides you with the articles that interest you the most?
I mean it’s crazy, sometimes even scary, how technology knows exactly what you are thinking about and what you are looking for. In this article, we are going to get a heads-up about how it works, especially the related article recommender system.

Recommender System

A recommender engine is a system that aims to predict the user’s preferences in order to provide him with the best personalized UX. We can distinguish 3 types of recommender systems:

  1. Collaborative Filtering
    Methods based on collecting and analyzing a large amount of information about users’ behaviors, activities or preferences, in order to predict what users will like based on their similarity to other users. A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an “understanding” of the item itself.
    The above example explains in an easy how does the collaborative filtering work, specifically, a User-Based Collaborative Filtering. Imagine that someone loves cinema and music but hates painting, so if I love cinema and hate painting, I’d probably love music as well. This way, the recommender system uses this logic to build suggestions based on our similarity.
  2. Content-Based Filtering
    Methods based on a description of the item itself and profile of the user’s preference. In a content-based recommendation system, keywords are used to describe the items; besides, a user is built to indicate the type of the item this user likes.
  3. Hybrid Recommendation Systems
    As the name implies, this kind of recommendation system is a combination of both: Collaborative Filtering & Content-Based Filtering. It’s basically used to overcome some of the common problems in recommendation systems such as cold start: “it concerns the issue where the system is not able to relate any inferences for users or items about which it has not yet gathered sufficient information”.

When it comes to article recommender system, we can use any of these 3 types. Today, we are going to see how we can use NLP to build a Content-Based Filtering recommender system. Let’s get started!

Latent-Dirichlet Allocation

David Blei — Professor in the Statistics and Computer Science departments at Columbia University.

One of most used techniques in Natural Language Processing is topic modeling. It’s a statistical model, purely based on unsupervised learning, capable of detecting various topics that appear in a collection of documents.
In 2003, David Blei, a Professor in the Statistics and Computer Science departments at Columbia University, developed with his colleagues a powerful algorithm named “Latent-Dirichlet Allocation”. Since then, it has become the main algorithm driving many areas of application, such as: Topic modeling, document classification, image clustering, sentiment analysis, …

But, how does it work?

LDA is considered as a “a probabilistic model with a corresponding generative process”. The idea behind LDA model seems to be simple; it assumes that a specific set of topics is described in advance. Then, the only observable features that the model considers are the words appearing in documents. Each one of these words presents a certain probability to belong to a specific topic. After various iterations, the model ends up with assigning to each topic a collection of words. As a result, every document represents a mixture of topics with various probabilities. Therefore, according to the term frequency of these words in each document, it assigns to each document the topic with the largest probability. Easy right??

Well it not as simple as that. Below is what is known as a plate diagram of an LDA model.

Where:
• α: per-document topic distribution
• β: per-topic word distribution
• θ: Document-specific topic distribution
• Z: topic assignment
• W: observed word

In the diagram above, we put the word (W) in white because it’s the only feature observed by the model as we said earlier, anything else is considered latent variable.
In the process of optimizing our model, we can mess with these parameters to get better results. Let’s consider α and β for example:
- α: alpha represents document-topic density. With a higher alpha, documents are made up of more topics, and with lower alpha, documents contain fewer topics. In other words, the higher is alpha, the more the documents seem to be similar.
- β: Beta represents topic-word density. With a high beta, topics are made up of most of the words in the corpus, and with a low beta they consist of few words. In other words, the higher is beta, the more the topics seem to be similar.

So while dealing with LDA in Natural Language Processing, it’s up to you to mess with these parameters, or you can just use a Python library that will do the work for you .. awesome right?
Ladies and gentlemen let me introduce Gensim, an open-source library for unsupervised topic modeling and natural language processing. This library provides you with a fast implementation of LDA, even better, it could let the model learn alpha and beta values for you.

Wanna see how it works? Just keep reading…

Expertime’s articles Topic Modeling

I currently work at Expertime, a fast-growing company expert in innovation and consulting on Azure and Office 365 based solutions such as DevOps, Data & Artificial Intelligence.
On our website, we have a blog where our experts publish articles about the latest technologies news and insights. I thought it would be interesting to apply a topic modeling to figure the different topics my colleagues are talking about. I’m not spying on them.. I’m just “passionately curious” 😊

I went on Expertime Blog (you can find the French version of few of my articles out there) and picked up only 3 articles to make this demo easy.
In fact, I did some web scraping to get the data. In order to do so, I used couple of libraries which are requests and BeautifulSoup.

#Imports
import requests
from bs4 import BeautifulSoup
#Scraping Function
def url_to_transcript(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
text = soup.find(class_="wrap").find_all('p')
print(url)
return text
#Articles links from Expertime's Blog
urls = ["https://expertime.com/blog/a/power-bi-api-sncf/",
"https://expertime.com/blog/a/expertech-architecture-chatbot-azure-framework/",
"https://expertime.com/blog/a/7-malentendus-blockchain/"]

Now we need to put everything in a Pandas DataFrame, which is basically a 2-dimensional labeled data structure, as below:

#Store the articles into a dataframe
import pandas as pd
pd.set_option('max_colwidth', 150)
df = pd.DataFrame(d)

Once we’re done, we apply some of the text pre-processing techniques we learned in the previous article It’s the same Hamburger!!.. remember?
• Make text lowercase
• Expand contractions
• Remove punctuations
• Spelling Correction
• Remove stop words
• Part of Speech filtering
This way we’ll end up with a clean DataFrame like this:

In the previous article we also talked about the difference between structured and unstructured data, and that NLP provide us with the process of deriving meaningful information from text through applying a variety of algorithms. So, let’s convert our text (unstructured data) into a more structured form. That’s why we’re going to use a Document-Term Matrix, where every row is a document (article in our case), and every column is a term. To do so, we’re going to use CountVectorizer from Sklearn.

#Create a new document-term matrix using only nouns
from sklearn.feature._extraction import text
from sklearn.feature._extraction import CountVectorizer
#Create a new document-term matrix and remove common words that apper more than 75% with max_df
cvna = CountVectorizer(max_df = 0.75)
data_cvna = cvna.fit_transform(df.Speech)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns = cvna.get_feature_names())
data_dtmna.index = df.index
data_dtmna.head()

Now as we converted our text into matrix, it becomes much easier to apply some mathematical algorithms. Let’s do it! Now is the time we’re calling LDA using Genism.

#Gensim import
from gensim import matutils, models
import scipy.sparse
#Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))
#Create the vocabulary dictionary
id2wordna = dict((v,k) for k,v in cvna.vocabulary_.itels())

One more line of code to apply LDA and we’re done.. As you see below, all we need to do is to choose the number of topics, number of words describing each topic and the number of passes or number of iterations of the algorithms. It’s literally as simple as that. Well later you can mess with the parameters alpha and beta or just let Gensim do the work for you.

#LDA model result
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes = 1500)
ldana.print_topics(num_words=3)

The articles I picked up are:
- Power BI et API SNCF : Une bonne association ?
- Architecture ChatBot avec Microsoft Azure et Bot Framework
- 7 malentendus courants au sujet de la Blockchain

And the results were really impressive!! I’ll let you judge for yourself.

To be continued …

--

--