Category predictor

Jose Pablo
4 min readJun 21, 2019
Image get from internet

The purpose of this publication is to define what a category predictor is, and to build one.

A category predictor is used to predict the category to which a text belongs, so it’s commonly used in text classification.

The next is the libraries and dataset that will be used:

  • 20 Newsgroup dataset
  • Multinomial naive Bayes classifier
  • TfidfTransformer
  • CountVectorizer

Before to begin let’s explain a little about each of the above items.

20 Newsgroup dataset

This dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date. (source: The 20 newsgroups text dataset)

Multinomial Naive Bayes classifier

A naive Bayes model assumes that each of the features it uses are conditionally independent of one another given some class. The multinomial model captures word frequency information in documents. A document is an ordered sequence of word events, drawn from the same vocabulary V. (source: A comparison of event models for naive Bayes text classification)

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p_1, …… ,p_n) where p_i is the probability that event i occurs.

Tfidf(Transformer)

tf-idf is short for term frequency–inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (source: Data mining)

  • Term Frequency: this summarizes how often a given word appears within a document.
  • Inverse Document Frequency: this downscales words that appear a lot across documents.

TfidfTransformer transforms a count matrix to a normalized tf or tf-idf representation. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

CountVectorizer

CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary. (source: How to prepare text data for machine learning with scikit-learn)

Let’s code!

Libraries:

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

We can define a collection of categories in order to fetch them from the dataset, if we don’t do this we’ll fetch all the categories. They are these:

alt.atheism,
comp.graphics,
comp.os.ms-windows.misc,
comp.sys.ibm.pc.hardware,
comp.sys.mac.hardware,
comp.windows.x,
misc.forsale,
rec.autos,
rec.motorcycles,
rec.sport.baseball,
rec.sport.hockey,
sci.crypt,
sci.electronics,
sci.med,
sci.space,
soc.religion.christian,
talk.politics.guns,
talk.politics.mideast,
talk.politics.misc,
talk.religion.misc

For the purpose of this entry we’ll define a map with some of them. You’ll see the purpose of the key-value pair later.

cats_map = {‘sci.space’: ‘Space’, ‘rec.motorcycles’: ‘Motorcycles’,
‘misc.forsale’: ‘Sales’, ‘comp.graphics’: ‘Graphics’,
‘talk.politics.guns’: ‘Guns debate’,
‘rec.sport.baseball’: ‘Baseball’}

Fetch the train data:

Notice that scikit-learn provides us a function for fetching the 20 Newsgroup dataset, so we don’t need to manually import it.

The previous defined category collection is used by sending it as the value of the parameter categories. random_state is a seed integer used to shuffle the dataset.

training_data = fetch_20newsgroups(subset=’train’, 
categories=category_map.keys(), shuffle=True, random_state=45)

Now we create the count vectorizer object, and get the term count.

# Bag of words
count_vectorizer = CountVectorizer()
train_term_count = count_vectorizer.fit_transform(train_data.data)

Then we create the TfidfTransformer object, and fit our train_term_count data and transform it.

tfidf = TfidfTransformer()
x_train_tfidf = tfidf.fit_transform(train_term_count)

Next we train our model.

classifier = MultinomialNB().fit(x_train_tfidf, train_data.target)

At this point we already trained our model, and it’s ready to be used with testing data. Instead of using the testing data from 20 Newsgroup dataset we’ll create a few sentences to use them as our testing input data.

test_data = [
‘I love ride it, and feel the wind in my face, that is freedom!’,
‘Rendering could take plenty time’,
‘Howard Bruce Sutter is a former right-handed relief pitcher.’,
‘There really is no such thing as race. We all came from Africa. We are all of the same stardust. We are all going to live and die on the same planet, a Pale Blue Dot in the vastness of space. We have to work together’,
‘It looks like the popular opinion is to regulate the guns market’,
‘Microsoft CEO Satya Nadella has transformed more than the stock price’
]
test_term_count = count_vectorizer.transform(test_data)
x_test_tfidf = tfidf.transform(test_term_count)

Finally we’ll get the predictions from the model

predictions = classifier.predict(x_test_tfidf)

Let’s print them

for input, prediction in zip(test_data, predictions):
print(‘\nInput:’, input, ‘\nPredicted category:’, cats_map[train_data.target_names[prediction]])

Result:

Input: I love ride it, and feel the wind in my face, that is freedom! 
Predicted category: Motorcycles
Input: Rendering could take plenty time
Predicted category: Graphics
Input: Howard Bruce Sutter is a former right-handed relief pitcher.
Predicted category: Baseball
Input: There really is no such thing as race. We all came from Africa. We are all of the same stardust. We are all going to live and die on the same planet, a Pale Blue Dot in the vastness of space. We have to work together
Predicted category: Space
Input: It looks like the popular opinion is to regulate the guns market
Predicted category: Guns debate
Input: Microsoft CEO Satya Nadella has transformed more than the stock price
Predicted category: Sales

This publication is mainly for learning purposes, so it has window for improvements, and it’s lacking the step of getting metrics from the model and analyzing them, however, I hope you find it useful.

Please don’t hesitate to let comments to improve it, or ask any question in case of doubts.

--

--

Jose Pablo

Full time Sr. software engineer and part time MSc in Computer Science student with interest in AI, DL, HPC, and computer graphics. Love outdoors. Foodie.