Going Down the Natural Language Processing Pipeline
A Basic Intro to NLP
Communication plays a big part in our everyday lives. We talk to different people in different languages, but what about communicating with technology? Nowadays, everyone has some sort of device, and we often use it to find answers to our questions. Such as asking Siri, “where can I find the nearest sushi place?” we are verbally asking a question/making a statement.
But here’s the thing, computers don’t just speak English; they are written in complex code with totally different syntax than we speak. What we normally speak (no matter what language) is called a natural language. Many languages tend to be quite complicated since there is a large amount of different vocabulary, including many words that are spelt the same but used for different meanings, and many more things to consider. Still, it’s pretty easy for humans to understand what we meant to say.
So how does it understand us when we’re speaking almost a different language? And that’s where Natural Language Processing comes in :D. NLP combines computer science (the language that computers use) and linguistics.
To start, let’s try to understand the process for the computer to understand what you want/what you’re asking. We can’t just give a computer a dictionary of all the possible sentences in the world that it would need to understand a natural language because that would just be impossible. In NLP, we first have to deconstruct sentences into smaller pieces so they can be understood by individual words to figure out what we are saying.
Segmentation + Tokenization
The first step is segmentation, just breaking a more significant sentence or sentence into many parts. Sometimes it’s broken down using punctuation within the sentence or different topics being used.
The following process is tokenization which is taking the sentences and breaking them down word for word. you’re essentially just cutting a sentence down into “tokens’’ so the computer can understand things more easily. Let’s take the question, “Where can I find the nearest sushi store,” and see what it would look like as tokens.
Then after we do that, we have to tag the parts of speech to the words themselves. In English, we have 9 types of words, nouns, pronouns, articles, verbs, adjectives, adverbs, prepositions, conjunctions, and interjections, called “parts of speech.”
This can be very useful, but sometimes words can have different meanings with the exact spelling, such as the word “leaves” could be a noun or a verb. To solve that, the computer needs to know some basic grammar as well, which leads to making phrase structure rules, or just almost formulas to understand how a sentence can be structured. For example, for a regular sentence, there’s usually a noun phrase first and then a verb phrase to make up a complete sentence.
After this, we can construct a ‘parse tree,’ which helps identify the sentence parts and tag the parts of speech.
Returning to our sushi example, we can apply this format to the sentence. The first part of asking “where?” signifies trying to find a place, nearest identifying the dimension/range, and then pizza identifying the noun itself.
Stemming + Lemmatization
Then our third step is stemming and lemmatization. This process is essentially the prefixes and suffixes of certain words.
We start with stemming, which is (which you can probably guess already) taking the stem of the word itself and removing the affixes, such as -ing, -s, and -ed. For example, the word ‘walk,’ which is our basic word stem, can have different variations like ‘walking,’ ‘walks,’ or ‘walked,’ but once we remove the ending affixes, we get the ideal outcome of the steaming process.
After stemming, we have lemmatization, which is figuring out the root form of each word, also known as the ‘lemma.’ It considers the context surrounding the word, then converts it into the most meaningful base form.
The difference between this and stemming is that stemming removes the last few letters/characters only, which can sometimes result in inaccurate results.
The root word, in this case, would always be some time that could be a word found in the dictionary, but the root/stem may not be so. Lemmatization uses a knowledge base called ‘wordnet’ to process the words. For example, the three words, ‘went,’ ‘going,’ and ‘gone,’ all originate from the word ‘go,’ meaning that ‘go’ would be our lemma.
Named Entity Recognition
Our last step is called entity recognition, which could also be known as many names of named identity identification, entity chunking, or entity extraction. This is a subtask of extracting information from named entities already mentioned in unstructured text, which is then sorted into predefined categories.
If you didn’t understand what that meant, it’s pretty simple; it just means figuring out and categorizing data into different sections. When you extract the main concepts/ entities from a text, you can quickly figure out the main messages by grouping them together by specific descriptions.
Some everyday subcategories include people, a specific person mentioned quantity, location, organization, movie or monetization. These are kind of the most basic subcategories used for this process. If we go back to our original question, “what are some nearby sushi stores’, ‘we know the sushi store is our location.
NLP Pipeline
This entire process that we just learned about can be called an NLP pipeline, which is basically. A way to visualize the steps you need to talk to process the text for the computer to understand. Usually, NLP pipelines include 3 major sections, text processing, feature extraction, and modeling. These three sections can then be broken down into all of those individual steps we just talked about
Our input is our text, which runs through these many processes, and our end file is sometimes that the computer can actually understand!
That’s it for the fundamentals of the Natural Language Processing Pipeline, but there’s so much more to it! There’s a whole other side of actually programming it and audio processing, which is super cool. There are many applications with this technology, like chatbots, language translators, Smart Assistants and many more.
Very soon, I’ll be looking into how we can use this branch of artificial intelligence to help with different problems with education in developing countries! If you’d like to keep updated on my ai exploring journey, follow me on medium.
I hope this gave you a good introduction to NLP and interested you in learning more. If you enjoyed this article, give a round of applause. Make sure to check out some of my other articles.
:)