Would you like to peruse a report on how to elucidate overwrought verbiage? Or maybe you’d like to read an article on how to simplify complex writing. Yes, that sounds much better.
Whether you want to eliminate jargon, write for a second-language learner, or end a toxic relationship with your thesaurus, Grammarly’s tools can help you replace complex words with simpler ones that appeal to a wider audience. In this article, we’ll give an overview of how text can be simplified using machine learning and linguistics.
The pipeline for simplifying a piece of writing includes finding the complex words in the text, generating a list of candidate replacement words, and picking the best ones. This breaks down into the following general pieces:
- Processing text. Each sentence is split into words, and each word is tagged with a part of speech. This is a common step in any natural language processing (NLP) pipeline.
- Feature extraction. A variety of linguistic features, like word frequency and word sense, are extracted.
- Complex word identification (CWI). Using those features, a machine learning model labels each word as complex or simple.
- Generating candidate replacement words. For each complex word, synonyms are extracted from the thesaurus.
- Reranking. The candidates are reranked in the original context, using a language model. The highest ranked candidate is suggested to the user.
A frequent mistake that NLP researchers make is measuring the quality of their solution solely by the precision and recall on the evaluation data set. However, real users of the feature may consider other aspects important, such as:
- Consistency in which words are flagged as complex
- A replacement word that’s indeed simpler than the original, but not so simple as to be vague
- A suggestion that preserved the grammar and meaning of the original sentence
We’ll bear these criteria in mind while building the solution.
Identifying complex words
Complex word identification, or CWI, is far from a solved problem. There have been two shared tasks in the research community on CWI, in 2016 and 2018, but the results have some issues, particularly due to the quality of the data sets. For example, the data sets had very simple words like “win” and “laughter” marked as complex. The highest performance with linear regression or random forest achieved 81 using an F1 measure; deep learning methods didn’t rank that high. To create a more effective classifier for the user-oriented metrics described above, we can leverage linguistics to extract better features.
Words are complex when they’re not well-known by readers. So one hypothesis is that complex words appear in the language less frequently than simple ones. To get an understanding of frequency, we can use a large corpus (a good example of this is Common Crawl), tokenize it, and count the number of times each word appears.
When we do this, we discover that certain forms and spellings of words aren’t seen as often as others. For example, a module that counts frequency will label “accessorize” as simple but “accessorizes” as complex; the alternative spelling of “accessorize”, which uses an “s” instead of a “z”, will be labeled complex as well. In other words, this approach will fail to provide consistent results. We can solve this, however, by looking at how linguistics defines a word: as the unity of all forms of the word.
Freq = C(“accessorize”) + C(“accessorizes”) +
C(“accessorized”) + C(“accessorizing”) +
C(“accessorise”) + C(“accessorises”) +
C(“accessorised”) + C(“accessorising”)
To scale this approach, we can leverage open-source datasets like Wiktionary, an online crowdsourced dictionary, which has information on the forms of every word in the over 30 languages it supports. We can transform each word into its basic form (in linguistics, this form is called a lemma). After the word is transformed into a lemma, we can get all forms of the lemma from the dictionary and calculate the word frequency using the formula above. The resulting number can then be used as a feature.
We say complex words are “big words” for a reason: they tend to be long. There are some exceptions to this rule, however. The word “friendliness” is long, but if you understand “friend,” you’ll understand “friendliness.” As you can see, some long words are actually simple:
lawlessness: law + less + ness
ghostlike: ghost + like
mistreatment: mis + treat + ment
bittersweet: bitter + sweet
satisfactory: satisf(y) + act + ory
mouth-watering: mouth + – + water + ing
In linguistics, the elements that make up a word are called morphemes. It’s possible to build a tool called a morphological analyzer that can break down any word in the language into morphemes. Morphological analysis can help find the etymon, a word that the given word was derived from (like “friend” for “friendliness”), which can then be used as a feature. Note that such an approach ensures consistency too: Words with the same etymon will be labeled in the same way.
We can improve our classifier by going a level lower than the word and looking at the characters it contains. We notice that complex words tend to contain rare letter combinations. Let’s compare character n-grams (subsequences of n characters) for the words “abhorrence” and “anger”:
abhorrence: ^abh, abho, bhor, horr, …, ence, nce$
anger: ^ang, ange, nger, ger$
How many words do you know that contain “bhor”? We’d venture a guess that it’s not many. In contrast, plenty of words contain “ang” or “nger.” Other interesting subword features are the number of repeating sounds, the number of syllables, and the ratio of consonant to vowel sounds when the word is pronounced. In complex words, this ratio can be as high as two to one, in contrast to simpler words, where consonant and vowel sounds tend to be more evenly distributed.
procrastinate – /prəˈkræstəneɪt/ – eight consonant sounds vs. five vowel sounds
flabbergasted – /ˈflæbəɡɑːstɪd/ – seven consonant sounds vs. four vowel sounds
neighborhood – /ˈneɪbəhʊd/ – four consonant sounds vs. four vowel sounds
information – /ˌɪnfəˈmeɪʃən/ – five consonant sounds vs. five vowel sounds
We can further improve our classifier by stepping back to look at what a word means. The hypothesis is that complex words have fewer meanings than simpler words. Simpler words are used more frequently, so they’ve evolved and added new meanings over time; complex words are niche, and therefore rare.
Any good dictionary can tell you how many meanings a word has. Compare the number of meanings for complex and simple words below.
WordNet also describes lexical relationships among words, providing a branching hierarchy of hypernyms and hyponyms. In linguistics, a hypernym is a generic term, and a hyponym is a specific instance of this term. For example, mouse, hamster and rat are all hyponyms of rodent (their hypernym); rodent, in turn, is a hyponym of animal. Simpler words tend to be more generic, and complex words more specific, so counting hypernyms and hyponyms can yield useful features.
Going one step further, we can borrow from the field of psycholinguistics, which studies the cognitive processes that are related to language comprehension and language production. Psycholinguistic databases like MRC describe, among other things, how concrete a given word is and how readily your brain comes up with a picture for it (this is called “imageability”). We can even find data on a word’s familiarity, i.e., how many people of a certain origin and age know this word, and the average age when people start using the word in their vocabulary. These higher-level features can make your model smarter.
Finding candidate replacement words
After we’ve identified a complex word, detected the part of speech, and transformed the word into its lemma, we are ready to look it up in a thesaurus. Or are we? Technically, we also need to know the meaning of the word because different meanings will have entirely different sets of synonyms. It’s worth devoting another article to that process, which is called word sense disambiguation; for simplicity, we won’t be discussing it here.
Assuming we have the right set of synonyms for a complex word, how do we pick our candidate replacements? Going back to our user-oriented success metrics, we’ll recall that all candidates should be simpler than the original word, but not too simple. Also, the sentence must remain grammatically correct when a substitute is suggested.
We already have a good way to discard synonyms that are more complex than the original word: we can use our CWI model to filter out words that score higher than our original. To filter out candidates that are too simple, we can look at the number of meanings a word has; if you replace a word that has one meaning with a word that has 30 meanings, you risk ambiguity. Another strategy is to check the word frequency: If the candidate synonym is used much more frequently than the original, you could be changing the meaning too much.
To ensure that the candidates make grammatical sense in the original sentence, we need to put verbs in the right form, make sure that nouns have the right number and article, and give adjectives the same degree of comparison. We also need to consider “governing,” which is how words interact with each other.
Correct verb form
affront => insult
affronts => insults
affronted => insulted
Correct noun number
revelries => celebrations, festivities
a destitute area => an impoverished area
Degrees of comparison
more destitute => poorer
brawnier => more muscular
infatuated with => charmed by
matriculate at the university => enroll in the university
Maintaining correct grammar does more than just satisfy the end user. In the next step, we will be putting our candidate words into the original context and reranking the new sentences in order of probability; these sentences need to have accurate grammar to give us accurate results.
Reranking to find the best replacement
orig = “They ameliorated the situation.”
repl_1 = “They helped the situation.”
repl_2 = “They improved the situation.”
repl_3 = “They enhanced the situation.”
repl_3 = “They bettered the situation.”
repl_n = “They upgraded the situation.”
Which replacement is best? To determine this, we need to know which sentence will be the most probable in the language. There are two general approaches to language modeling, and we’ll give a brief overview of them here.
Statistical language modeling
If we want to calculate the probability of a sentence like “They bettered the situation,” we use the chain rule, multiplying the probability of each word given the ones that have come before.
P(“<S> They bettered the situation . </S>”) =
P(“bettered”|”<S> They”) *
P(“the”|”<S> They bettered”) *
P(“situation”|“<S> They bettered the”) *
P(“.”|“<S> They bettered the situation”)
The problem is that as sentences grow longer, the proceeding words become less and less likely, and the probabilities don’t make sense. So we can apply the Markov assumption, which states that the future is independent of the past given the present. This means that every word’s probability only depends on the probability of the previous word, not the entire previous phrase, which gives us a simpler equation. (In practice, two or three previous words are used, as one word gives too little information.)
P(“<S> They bettered the situation . </S>”) =
P(“bettered”|”<S> They”) *
P(“the”|”They bettered”) *
P(“situation”|“bettered the”) *
To calculate a single probability, we can take a large data set and look at how many times the words appear in a row. Knowing how often we encountered “They bettered” will give us the probability of how likely “bettered” is to follow “They.” But what if we never saw a word, like “bettered,” in our dataset? In that case, the probability of the entire sentence will be zero. To avoid these zero probabilities, there are different smoothing techniques we can use.
Neural language modeling
Another option is to use a recurrent neural network as a language model. In the simplest case, there would be an input layer with word embeddings and a hidden state. At the output layer, we maximize the probability of the next word given the history of previous words in the sentence.
Neural networks can take a long time to train and tend to overgeneralize. For example, you could build word embeddings that know that green is a type of color, and end up with a high likelihood for “green horse” because “white horse” and “black horse” are common. On the other hand, statistical language models are easier to implement and to change, but they don’t generalize as well. “Red car” and “blue car” might be frequent, but if “purple car” was never seen in the data, it will have a low probability. In reality, statistical models and neural networks are often used interchangeably, depending on the resources and task at hand.
Having stepped through the pipeline for suggesting simpler words to replace complex ones, we think this problem demonstrates some valuable lessons. First, it shows that linguistic knowledge gives you power, letting you derive better features that are based on the way humans actually communicate. Second, it teaches that researchers are not the final consumers of NLP applications: we need to think about what the end users want and expect. And finally, we think that going through an exercise like this—deeply studying a problem—will give you better results than blindly feeding data to ML models. We encourage everyone to study your problems more and let your data guide you.
If you’re interested in studying problems like this one, Grammarly is hiring! By joining our team of researchers and engineers, you could help improve communication for millions of users around the world. Check out our open roles here.