Process and learnings of a NLP project with SpaCy
A few days ago I started a new NLP project. The goal of the project is to be able to automatically evaluate and classify content of texts, which are collected in the context of larger surveys.
There are two different levels of complexity that I see in this problem. Complexity level 1: A question to be answered by participants could be, for example, "Which aspects of the product do you like best?". It can be assumed that the feeling associated with the statement is positive in most observations, since explicit questions are asked about characteristics that the respondent likes. Complexity level 2: It gets more complicated with questions that are more open. For example, it could be asked whether the respondent still has any further comments or remarks. In these cases, not only the topic of the text but also the feeling associated with it must be determined.
Course of the project
For the text processing I use the open source library SpaCy and write the code in Python. Since the texts I work with in this project are in German, I use the model "de_core_news_sm".
After everything for development was set up, I first concentrated on the somewhat simpler question - complexity level 1 where I already knew the sentiment behind the observations in nearly all of the cases. First I transferred all observations into a list of sentences. I therefore used some basic string operations and a SpaCy method for splitting text into single sentences. Then I reduced the sentences to noun pieces and output the frequency of the individual noun pieces. In this way one can get a feeling for which specific areas were often mentioned positively. So I was able to see i.e. what the five most frequent mentioned nouns were related to a positive sentiment. I used the same procedure to determine which nouns were most frequently used when the observations sentiment was negativ.
For the complexity level 2 a different approach is needed - a sentiment analysis. I started again with creating a list of all single sentences and reused some areas from my previous code. By examining the most frequently mentioned topics, it is possible to identify specific areas in the observations that can be transferred into the model. I created a dictionary and assigned certain terms to it based on the knowledge of the previously identified topics that occur in a high frequency. For example, one of the topics was programs for video chat. I created this topic in my model and used a lexicon to assign some different programs for video chat, mentioned in the observations, to this topic. In a further step I defined enhancing (i.e. “very”), weakening (i.e. “a bit”) and inverting (i.e. “not”) terms and added them to the model. With the help of these extensions the feeling associated with the topic can be determined more precisely. The importance of combining the definition of topics with the extended methods of evaluation can be seen in the following example. For a text such as "The Zoom tool is not very good", "Zoom" is recognised as topic of the sentence, but we can now also recognise "not" as a negation, the word "very" is recognised as an intensifier and the word "good" as a positive feeling. So we are able to recognise that the overall impression of the tool Zoom is slightly negative in this observation.
My overall impression of SpaCy is a very good one. Simpler classifications and quick and individual adjustments work really easily and well. Another step I would like to take in the future is to label the data to enable deep learning and to train our model.