Overview Data Preparation Topic Modeling Sentiment Analysis Outcome and Reflection

Data Preparation

This section describes the relevant data cleaning methods employed to render the raw employee survey data suitable for topic modeling and sentiment classification. We also provide a brief introduction to the topic of data preparation.

Introduction

"Data" refers to a collection of facts such as numbers, words, measurements, or other observations [1]. Data can be inherently structured, meaning that there already exist clearly-defined data types and relationships, or otherwise unstructured, implying that data lacks structure which makes it readily navigable, searchable, or interpretable. In our case, we work with unstructured, natural language (text) data, which requires several stages of preprocessing and feature engineering to satisfy the analysis methods we employ. There exist three common challenges to data analysis:

Missing Data: Data or metadata are unavailable. With our analysis data, we possess only raw employee survey comments associated with prompting questions. For anonymity reasons, we lack metadata about the respondents, which limits the analyses we can perform.

Noisy Data: Data contains outliers or other inconsistencies which obfuscate patterns we expect or aim to find. In our case, the employee survey data contains many irrelevant comments, spelling errors, and nonsenical phrases which must be removed or corrected. We below describe in greater detail several of the preprocessing steps which aim to combat this problem. Additionally, we propose a novel spelling correction algorithm described in a later section.

Inconsistent Data: Data is inconsistently recorded. This includes formatting inconsistencies such as letter-casing or inconsistent data types.

Data Preparation Techniques

We below list the data preparation methods we employ to combat the issues of noisy and inconsistent data.

Lowercasing: To render all words in a natural language dataset comparable and remove duplicates resulting from capitalization differences, it is advisable to make all words lowercase. This helps to maintain data consistency.

Number Removal: For our semantic analysis, numbers are not a relevant feature and only introduce additional noise. Therefore, removing numbers is a relevant form of denoising.

Punctuation Removal: Many punctuation marks are irrelevant to our semantic analysis or are misused, introducing additional noise which is best removed. We remove symbols present in the following list: [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]

Term Frequency Inverse Document Frequency (TFIDF): Although not a form of data cleaning per se, for several stages of our analysis it is useful to introduce additional statistical features which can help us to structure the representation of the texts. TFIDF is a form of statistical feature engineering which denotes the number of times a term appears in a given corpus (TF), divided by the number of times the term appears in a given document (IDF). For further information on how the TFIDF score is used to inform our document vector representations, please refer to the Topic Modeling section of this report.

Tokenization: Tokenizing a document signifies splitting the document into its component words or sentences. In order to isolate words as the atomic constituents of document meaning, we tokenize our documents into words as an important preprocessing step.

Stemming: Removing inflectional stems from words controls for morphological variations which otherwise confuse the association of closely-related terms.

Lemmatization: As an alternative to stemming, rendering words into their "lemmata" or uninflected forms is also an effective means of controlling for variations in word inflection or declination.

Stopword Removal: Many words such as "a," "the," and "from" do not contribute significant meaning to a sentence but are commonly repeated connecting particles which, for all-intensive purposes, constitute dataset noise. We remove an exhaustive list of English "stopwords" in order to remove these tokens from the analysis.

By the methods described above we are able to both remove noise from the dataset and control the representation of natural language such that the independent variables inherent to the unstructured data are, as much as possible, only the semantic meaning of the individual texts. We below describe the unique spell correction algorithm we employ to further correct inconsistencies in our dataset.

Spelling Correction

The Problem

When working on non-quality-controlled, human-generated text data, lexical errors can be problematic for automatic analysis. Automatically correcting these errors is a non-trivial task. The task itself can be split into three sub-tasks - error detection, candidate generation, and candidate selection - with their own sets of challenges.

Error detection is the task of finding incorrect words in a sentence. Errors can be categorized in two major categories: non-word errors (NWE) and word errors (WE). Non-word errors constitute errors that are not real words, e.g. when someone writes "thre" instead of "three". Word errors, on the other hand, are errors that are correct words used in the incorrect context, e.g. when someone writes "tree" instead of "three".

Candidate generation is the task of generating a list of words that could be the word the author meant to use, given an incorrect word.

Candidate selection is the task of selecting the most likely candidate from the generated list of candidates with which to replace the detected error.

When pre-processing answers to survey questions, the answers’ semantic information should/must be preserved. Out-of-the-box spelling correction solutions have failed us at this in two ways. Firstly, they occasionally considered correctly spelled words as incorrect and attempted to correct them, changing them into a different word in the process. Secondly, the generated (and selected) candidates for an incorrect word were often completely different words, e.g. “bark” instead of “book”.

The Approach

TransSpell is an attempt to solve the problem of spelling correction in a robust way, emphasizing the importance of minimizing false positives in error detection. Its goal is to apply spelling correction only to words that can most definitely be considered “false” and for those corrections to be correct. The error detection rate does not need to be perfect, but it should not produce any false positives. TransSpell is the successor to FastSpell, a spelling correction approach that was developed for this Study Project which was abandoned due to performance issues.

Error Detection

We first attempted to develop context-sensitive error detection (i.e. error detection capable of detecting NWEs). When this approach yielded many false positives, we reverted to a modified version of rule-based error detection used during FastSpell development.

Context-Sensitive Approach

Given a sentence consisting of n tokens, we create n substrings where one token is masked (i.e. considered an error); for example, in substring 2, the second token of the sentence is masked. We then let the candidate generation component (see section 2.2) determine 50 likely lexical candidates for the masked position given the remaining (unmasked) tokens of the sentence. If the original token is not among the generated tokens, it is considered a text error. Unfortunately, this approach comes with two major flaws, not to mention high computational costs, reducing the scalability of this approach.

Firstly, by masking a random word and letting the model determine a likely candidate for that word, we assert that the context is correct. Let us assume a sentence with a word-error in the third token, e.g. "We made ensure that our customers get the products on time" when the author really meant "We made sure[...]". Error detection iterates through the sentence sequentially, thereby checking "made" before “ensure”. Given the context, i.e. "we [mask] ensure that [...]", the candidate generation component will not generate the correct "made" but rather deem the actual token erroneous and supply something along the lines of "must". The sentence now reads "we must ensure that [...]", which is now contextually coherent but not what the original author had in mind. If the contextual error does not occur in the first checked position of a sentence, this approach will make the context fit the original error, rather than make the error fit the context.

Secondly, this approach to error-detection can be summarized as "making sure a given sentence is a sequence of words that is statistically likely". This entails that, rather than "just" looking for errors, we remodel sentences according to what the pre-trained model used for candidate generation has learned sentences to look like. For example, the sentence "we work hard to meet consumer requirements" is changed to "we work hard to meet minimum requirements." This is very problematic because it changes the meaning of the sentence, reproducing one of the main issues of the out-of-the-box approaches.

Rule-Based Approach

For the reasons described above, context-sensitive error correction is currently too unreliable. However, context-insensitive error detection solutions introduce new problems that require consideration. Usually, these approaches take finite lists of correct words, like lexica, and check whether a token is contained in this finite list. As is the norm with assets like that, these lists are not exhaustive. While there are options available that are potentially exhaustive (e.g. the Merriam-Webster API), these options are not for free and therefore not feasible for this project. We therefore introduce a list of additional conditions for correctness of a word besides "membership in a lexicon", to reduce the number of false positives in error detection.

The complete process of context-insensitive error detection for a given token is as follows (in order):

  1. Check whether the token is shorter than four characters or - if the source sentence is properly capitalized – matches a regex modelling acronyms. If either condition is met, do not check further and assume it is not an error. This aims to protect acronyms and similar domain-specific terms from being considered an error.
  2. Check whether the token occurs more than 10 times in the entire corpus. If it does, do not consider it an error.
  3. Check whether the token is contained in a lexicon for American English and British English, respectively.

If none of these conditions were met, consider the word an error. Please note that the second criterion requires the spelling corrector to have information regarding the content of the corpus and the corpus itself to be of a certain size for the measure to be effective.

Candidate Generation

Candidate generation occurs based solely on the context of the erroneous word. We simply mask the token in question (i.e. replace it with a mask-token) and let a pretrained transformer model predict which token should be in the given position based on its context. Sentence length is an important factor in how well this approach will perform. For one-word answers, for example, it will not work at all. Furthermore, the pre-trained models seem to disproportionally weigh the positional context in the first and last token of a sequence, e.g. by suggesting replacement of the first token with tokens like bullet points or similar beginning-of-line indicators. Because of this, this error correction approach will only work properly on tokens that are preceded by at least one token and followed by at least one token.

A Note on the Implementation

As a description of the BERT architecture and its derivates far exceeds the scope of this section, please refer to the original paper or the Sentiment Analysis section for more information [4]. For TransSpell, we use a pretrained distillated version of a RoBERTa model (i.e. a more lightweight model based on a more complex BERT architecture) with 82M parameters. The model itself is a pyTorch-based implementation through HuggingFace’s transformers library.

Candidate Selection

Given the list of candidates generated by the transformer, which is ordered according to decreasing likelihood of the candidate given the context, we first create a ranking of these candidates based on their respective Levenshtein distance to the original erroneous token. All candidates with a distance larger than 3 are ruled out. For a better understanding of Levenshtein distances, refer to Peter Norvig’s blog post on writing a spelling corrector. If no suggestion had a small enough distance to the original word, we return the original word as a correction candidate to prevent TransSpell from applying a change that would change the intended meaning of the author. If some candidates meet the Levenshtein distance criterion, we then iterate through the ranking, checking if the first letter of the candidate matches the first letter of the original word. This is based on the intuition that most human-made typos do not occur in the first position of a word. The first match to this category is chosen as the most likely candidate. If no token matches this criterion, the most likely (according to the transformer model) token with the closest Levenshtein distance to the original word is chosen as the correct word.

Evaluation

Given the necessity of an error being incorporated in a full sentence for TransSpell to work, we cannot use traditional spelling correction corpora that only consist of a proper word and possible misspellings of that word (such as the Birkbeck corpus). Moreover, calculating a meaningful metric for performance would require a an amount of manually labeled data that exceeds the scope of this Study Project.

While a meaningful metrics-driven evaluation was not possible within the scope of this project, we provide a CSV document (upon request in adherence to data protection guidelines) containing all sentences from the PNLP dataset that were altered by the spelling corrector, as well as TransSpell’s corrections for comparison. The CSV provides an intuition for the performance of TransSpell.

Named Entity Recognition

Named Entity Recognition (NER) is a part of information extraction that detects and categorizes named entities in a raw text such as organization, person, location, date, monetary values, etc. NER can be typically used for Machine Translation, Semantic Annotation, and such tasks in Artificial Intelligence including Natural Language Processing.

Part-of-Speech Tagging and Chunking with NLTK

Before proceeding with NER, we need to process part-of-speech (POS) tagging and chunking based on POS-tags. A POS tagger is used to label grammar information corresponding to each word in a sentence using information like front and back words and shape features, such as upper and lower case, capitalization, etc. Chunking takes as input POS-tags, it extracts phrases composed of different tokens from the text. Especially, noun phrase (NP) chunking is a necessary process for NER because single word tokens may not fully represent the actual meaning of the text if a named entity consists of multiple words.

As a first approach to POS tagging and chunking, Python’s NLTK library has been mainly used for data preparation in this project, also features a solid sentence tokenizer and POS tagger. First, NLTK tokenizes sentences based on words and punctuation, then the tagging is done by a trained model of the NLTK library. A chunk grammar can be defined using regular expressions and works on top of POS-tags. As such, the performance was improved by modifying the chunk grammar based on the early result following POS-tagging, but NER using its default model still gave us inaccurate labeling.

Named Entity Recognition with spaCy

So, the next method with a pre-trained model for general entities like people, location, and dates is Python’s spaCy library. SpaCy performs a highly efficient statistical processing for NER. It has not only similar features provided by NLTK such as tokenization, part-of-speech tagging, and named entity recognition, but also supports other functionalities, e.g. neural network models (multi-task CNN), integrated word vectors, and dependency parsing. But most of all, spaCy is optimized for production use. This makes spaCy deliver better performance and usability with intuitive visualizations as well. Apart from its default model, spaCy also enables us to train and update the NER model with new examples, though it doesn’t support different neural network architectures.

First, we chose spaCy’s small model without word vectors, which can be the right tool with fairly high accuracy (F-score: 0.85) and speed to get started with NER. The model was applied to our data, it worked relatively well for date and monetary entities, but most of the others, especially for company-specific words, were given a blank space or incorrect label. The large model might give slightly better performance (F-score: 0.86), but it’s not effective enough to offset the trade-off between coverage and memory usage that the increased size of the vector table brought. Additionally, the problem was that our dataset didn’t have sufficient named entities overall to train a custom model for noticeable improvement.

Discussion

Those results might be due to the fact that our data consists of unstructured texts collected from an employee survey which usually includes company-related words such as the name of its project, product, or system. So, if we tried to annotate it manually, we still had to get further information to set the criteria for what categories company-specific names belong to. This is one of the general challenges in the field of entity recognition that NER systems are often given sensitive data depending on the project. It means that even if a suitable model for one project is designed, it may not be reusable for another project. In conclusion, there is a need to develop a NER model that can eventually cover a wider range of data.

Classification Model for NER detection

Introduction

Classification models are supervised learning models, where the model learns from a pre-existing dataset. We have an input and a set output. The output are finite or discrete variables making them a class. So, for every input the model learns to have an output class. We have a multi class output in the NER detection. For every given word input its output is a class output. The class output are categories of NER like person, location, organization.

Aim and Objective

The aim of creating a pipeline was to try to classify words to labeled entities. The objectives were:

The Dataset

We used an already existing GMB (Groningen Meaning Bank) corpus for entity classification which is pre annotated for NER. The dataset is tokenized at white spacing basically making it one-word tokens. Except for labels where multiple words are essential for an entity like organizations. Let’s take a list of entities in the training data with their corresponding description:

NER Labels

Labels Description
O All Tokens that are not an Entity
geo Geographical Entity
org Organization
per Person
gpe Geopolitical Entity
tim time indicator
art Artifact
nat Natural Phenomenon
eve Event

The Packages

The major two packages used for the classification model were Keras and Scikit-learn. Scikit-learn is a machine learning library for the Python programming language. We used functions like count vectorizer, tf-idf and data splitting functions. Count vectorizer and tf-idf are basically used to convert the words to a numeric vector which can be fed to a neural network for processing. The data splitting function basically helps in splitting the data into a training, testing and blind dataset. Keras is a neural network package in Python. We use it to create a sequential dense model. “A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.”

The Model

Results and Output

The following image shows the partial output accuracy of 3 classes number 0,1 and 2. Where 0 stands for geo, 1 for org and 2 for per.

precision recall f1-score support
0 0.85 0.95 0.90 7540
1 0.86 0.71 0.78 4065
2 0.94 0.90 0.92 3351

The second image shows how the output looks like and the corresponding pre-annotated correct labels.

Predicted Actual_Tag
0 0 [geo]
1 2 [per]
2 1 [geo]
3 0 [geo]
...
... ...
14952 0
[geo]
14953 0 [geo]
14954 01 [org]
14955 2 [per

Discussion

Seeing the results one can see that the classification model works better in terms of accuracy compared to the NLTK and the spacy model. However, we ran into a problem which we couldn’t solve. Once we used the pre existing annotated dataset we could test that data. But when we wanted to test the data on our dataset which is a blind dataset we weren’t able to run the model. The model is compatible to a certain input matrix. The blind data when vectorized using word vector or tf-idf has a different input matrix which is not compatible with the trained model.

Back to Top

References

[1] https://www.etymonline.com/word/data
[2] https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data?select=ner_dataset.csv
[3] https://keras.io/guides/sequential_model/
[4] Bhavani, D. (n.d.). Understanding Named Entity Recognition Pre-Trained Models. Retrieved September 7, 2020, from https://blog.vsoftconsulting.com/blog/understanding-named-entity-recognition-pre-trained-models
[5] Devlin, Jacob & Chang, Ming-Wei & Lee, Kenton & Toutanova, Kristina. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
[6] Jurafsky, D., & Martin, J. H. (2008). Speech and Language Processing, 2nd Edition (2nd ed.). Prentice Hall. 3. Li, S. Named Entity Recognition with NLTK and SpaCy. (2018, August 17). Retrieved September 7, 2020, from https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
[7] spaCy 101: Everything you need to know. (n.d.). Retrieved September 7, 2020, from https://spacy.io/usage/spacy-101