Term-document matrix guide with its implementation in R and Python

In natural language processing, we are required to perform different types of text preprocessing tasks so that mathematical operations can be performed on the data. Before applying mathematics to this data, the data must be represented in the mathematical format. For textual data, the term-document matrix is ​​a kind of representation that helps to convert textual data into mathematical matrices. In this article, we are going to discuss the term-document matrix and we will see how we can make one. We will do a practical implementation of term-document matrices in R and Python programming languages ​​for better understanding. The main points that will be discussed in this article are listed below.

Contents

  1. What is a Document-Term Matrix?
  2. Term-document Matrix in R
  3. Term-Document Matrix in Python
    1. Pandas use
    2. Using Text Mining
  4. Application of Expression-Document Matrix

Let’s start the discussion by understanding what the term-document matrix is.

What is a Document-Term Matrix?

In natural language processing, we see many methods of representing textual data. Term document matrix is ​​also a method to represent textual data. In this method, the text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data that need to be analyzed and the columns of the matrix represent the word. The dice under the matrix represent the number of occurrences of the words. Let’s understand with an example.

Index Sentences
1 I like football
2 Messi is a great football player
3 Messi has won seven Ballon d’Or awards

Here we can see a set of text responses. The term-documents matrix for these answers will look like this:

I to like Soccer Messi is a Great player at won Seven golden ball price
I like football 1 1 1 0 0 0 0 0 0 0 0 0 0
Messi is a great football player 0 0 1 1 1 1 1 1 0 0 0 0 0
Messi has won seven Ballon d’Or awards 0 0 0 1 0 0 0 0 1 1 1 1 1

The table above is a representation of the term-document matrix. From this matrix we can get the total number of occurrences of a word in the whole corpus and by analyzing them we can reach many fruitful results. Term document matrices are one of the most common approaches that must be followed during natural language processing and analysis of textual data. More formally, we can say that it is the way of representing the relationship between the words and the sentences presented in the corpus.

Since R and Python are two common languages ​​that are used for NLP, let’s see how we can implement a term-document matrix in both of the languages. Let’s start with the language of R.

Implementation in R

In this section of the article, we are going to see how we can create a term-document matrix using the language of R. For this purpose, we are required to install the tm (text mining) library in our environment.

Library installation:

install.packages("tm")

Using the above lines of codes, we can install the text mining library. Instead of term document and document matrix term, we have various facilities available in the library of the field of text mining and others.

Library import:

library(tm)

Using the above lines of code we can call the library.

Data import:

To make an R term-document matrix, we use the raw data that comes with the tm library and it is a volatile corpus of 20 news articles that deal with crude oil.

data("crude")

Lets you inspect the raw vcorpus

inspect(crude[1:2])

To go out:

Here is the output. You can see character counts and metadata in vcorpu. For more information, we can use the help function of R.

help(crude)

To go out:

Here we can also use the corpus to do the term-document matrix, but we use vcorpus because of its explainability after converting to a term-document matrix.

Creation of a term-document matrix:

tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))
tdm

To go out:

Here we can see the details of the term-document matrix. We are going to inspect some values ​​of it.

inspect(tdm[100:110, 1:9])

To go out:

Here in the output we can see some of the values ​​from the term-document matrix and some of the information about those values. We can also check the values ​​using our chosen words from the documents.

inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")])

To go out:

We can also make the document term matrix using the functions provided by the tm library like,

dtm <- DocumentTermMatrix(crude,
                          control = list(weighting =
                                         function(x)
                                         weightTfIdf(x, normalize =
                                                     FALSE),
                                         stopwords = TRUE))
dtm

To go out:

We will inspect the matrix of long-term documents.

inspect(dtm)

To go out:

The basic difference between term-document matrix and term-document matrix is ​​that the weighting in term-document matrix is ​​based on long-term frequency (TF) and in document-duration matrix the weighting is based on document term frequency inverse frequency ( TF-IDF).

The image below is a representation of a word cloud using the document term matrix we made earlier. We can do this using the following codes:

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wordcloud(names(freq), freq, min.freq=400, max.words=Inf, random.order=FALSE, colors=brewer.pal(8, "Accent"), scale=c(7,.4), rot.per=0)

Here in the image we can see that we are required to clean the data to get more proper results. Since the motive of the article is to learn the basic implementation of the document term matrix, we will focus on this motive alone. Let’s see how we can run it on Python programming language.

Implementation in Python

In this section of the article, we are going to see how we can make the document term matrix using the Python languages ​​and the libraries built in the python language. In python there are several ways using which we can achieve this. Before going to any of the processes Let’s define a document. Here we take the sentences from the table above. Let’s start by defining the documents.

sentence1 = "I love football"
sentence2 = "Messi is a great football player"
sentence3 = "Messi has won seven Ballon d’Or awards "

As we said in python, we can do this in different ways. Here we are going to discuss two easiest ways to perform this. The first way to do the term-document matrix is ​​to use the pandas functions and scikit learn libraries. Let’s see how we can achieve this.

Pandas use

Importing Libraries

 import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Adding sentences

docs = [sentence1, sentence2, sentence3]
print(docs)

To go out:

The definition and implementation of vectorizer counting on the document.

vec = CountVectorizer()
X = vec.fit_transform(docs)

Converting vector to dataframe using pandas

df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df.head()

To go out:

Here we can see the term document matrix of the documents we have defined. Now let’s see how we can achieve this using our second way where we have a library named text mining which has a function to render long term document matrix from text data.

Using Text Mining

Installing the library:

pip install textmining3

To go out:

Function to do Term-document matrix initialization.

import textmining
tdm = textmining.TermDocumentMatrix()
print(tdm)

To go out:

Here we can see the type of object in the output that we defined for making the term-document matrix.

Placement of documents in the function.

tdm.add_doc(sentence1)
tdm.add_doc(sentence2)
tdm.add_doc(sentence3)

Conversion of the term-document matrix into the Pandas dataframe.

tdm=tdm.to_df(cutoff=0)
tdm

To go out:

Here we can see the document term matrix that we created using the text mining library.

Application of Expression-Document Matrix

It can be said that making a term-document matrix from the text data is one of the tasks that comes between the whole project of NLP. Term document matrix can be used in different types of NLP tasks, some of the tasks that we can perform using term-document matrix are:

  • By performing the singular value decomposition of the term-document matrix, the search results can be improved to some extent. Its use on the search engine, we can improve the results of searches by disambiguating polysemous words and the search for synonyms of the query.
  • Most NLP processes focus on mining one or more behavioral data from the text corpus. Term document matrices are very useful for extracting behavioral data. By performing multivariate analysis on the long-term document matrix, we can reach the different themes of the data.

Last words

Here in this article, we have seen what is a term-document matrix with an example as well as how we can make the term-document matrix using R and Python programming languages. In the end, we have also discussed the main applications of the term-document matrix.

The references

About Florence L. Silvia

Check Also

How to Stream Every ‘Matrix’ Movie in 2022

ex_artist/Shutterstock.com The Wachowskis created a revolution in blockbuster action when they created a genre-blending adventure …