Dr. Ranja Sarkar

Author of "A Handbook of Mathematical Models with Python".

Follow me on GitHub

22

From language processing to language modeling

If represented in a Venn diagram, the AI superset has two major subsets - Natural language processing (NLP) and machine learning (ML). Deep learning (DL) or neural networks forms a sub-subset and generative AI is a subset of DL having substantial intersection with NLP.

11

One would ask, where is Data Science?

Well, we humans create value with data and it is what we call Data Science that takes many forms. We start with a business problem and work toward an appropriate data-backed solution. AI is part of Data Science.

Coming back to the Venn diagram, the remaining space in the AI superset would have transfer learning, reinforcement learning, hyperpersonalization, etc., systems that are directly or indirectly influenced by ML/DL models as well as LLMs.

With NLP, several tasks like topic classification, sentiment analysis, relationship extraction, and entity recognition can be accomplished. The input data for NLP can be multimodal like text, audio, video. I will write about text data here.

Natural Language Processing

NLP involves transforming raw text data into a format that the machine understands. In NLP, each text sentence is called a document and collection of documents is referred to as text corpus.

✅ Data pre-processing or text cleaning (by regular expressions) comprises of tokenization, stemming & lemmatization

Tokenization is the process of breaking text into individuals words or tokens. Normalizing the text would entail removing punctuations and stopwords, one may or may not utilize it depending on the use case. Stemming means reducing a word to its stem (root word), it removes the morphological affixes from words, leaving only the root word.

01

The stem in lemmatization belongs to a valid word in the language.

02

Good old python libraries like nltk, spacy, gensim, textblob can be used to tokenize input text as well.

03

These days we have tokenizers by Hugging Face available to do the same. Hugging face also has a playground to experiment with different tokenizers.

05

✅ Part-Of-Speech (POS) tagging can be rule-based, statistical or based on deep learning

This is usually done for entity recognition from a corpus.

✅ Vectorization is word embedding, which is converting the tokens into numbers. In embeddings, we have fixed length representations for the tokens in a text regardless of the number of tokens in the vocabulary.

N-grams can be used when we want to preserve sequence information in the text data (what word is likely to follow a given one). Unigram (N=1) however does not contain any sequence information (a word is considered individually).

The classical approach of converting text into numeric vectors is to use the Bag-of-words (BoW) method. The principle of BoW is to take all the unique words/tokens from the text and create a text corpus called vocabulary. Using the vocabulary, each sentence/document can be represented as a vector consisting of ones and zeros, depending on whether a word from the vocabulary is present in the sentence or not. With one-hot encoding, each token is represented by an array of vocabulary size but with embeddings, each token now has the shape of an embedding dimension.


In transformer architecture which underlies large language models (LLMs), a positional encoding matrix is created to represent all the possible positions a word/token can take.

Positional encoding is used to provide a relative position for each token in a sequence. When reading a sentence, each word in the sentence is dependent on the words around it. For example, some words have different meanings in different contexts, so a model should understand these variations and the words that each word relies on for context. In the architecture, the values in the representation are not fixed binary values but changing floating points allowing for fine-grained learned representations.


Now coming back to the classical approach, two sentences are said to be similar if they contain similar set of words. To add more context to the vocabulary, tokens may be grouped together. This method is called N-gram approach. An N-gram is a sequence of N tokens for example, a bigram is a sequence of two words. Once the vocabulary is chosen, occurrences of the grams must be counted. The downside of BoW approach is that popular or frequent words become too important.

101

A better method called term frequency-inverse document frequency (TF-IDF) is used. TF-IDF consists of TF that captures the importance of the word wrt the length of the sentence and IDF which captures in how many sentences the gram occurs wrt the total number of sentences, thus highlighting the rarity of the word.

If N is the total number of documents, n is the number of documents containing the word or keyword, then IDF = log(N/n).

102

A word has a higher TF-IDF score if it occurs more (frequently) in a document but occurs less or infrequently in the corpus. The TF-IDF score determines how unique the word is in the corpus.


A lexical (keyword) search result is scored by similarity methods like TF-IDF whose scales are usually unbounded, while a semantic (vector) search result is scored by distance methods like cosine similarity etc. whose scales are within a closed interval.

The journey from lexical search to semantic search to a more advanced hybrid search is about how information retrieval can be improved. Hybrid search is sort of a ‘sum’ of lexical search and semantic search and when done right, can yield more relevant results than either. The methods used to merge the lexical and semantic search results to get to a hybrid search query have been well explained in the article by Elastic Search service.

Other hybrid search engines are Snowflake Cortex Search and Azure AI Search. One can enhance search experience by combining the precision of lexical search and the context understanding of semantic search and crafting hybrid search queries.

Machine Learning

Post pre-processing and feature extraction, the data is ready to be consumed that is, models can be trained with the data.

A supervised learning model learns patterns from the features to predict the labels. We perform classification tasks with labelled data, for example detection of spam emails wherein the model is trained with emails that have labels (spam and no-spam). Typical supervised learning algorithms used for text classification are (multinomial) Naive Bayes and logistic regression.

Unsupervised learning encompasses

  1. clustering algorithms like (agglomerative/bottom-up) hierarchical clustering (connectivity-based), kmeans clustering (centroid-based), density-based clustering (DBSCAN, OPTICS)

  2. automated text summarization and topic modeling

Text summarization, as the name suggests is creating summary of a corpus. Summarization algorithms like Latent Semantic Analysis (LSA) perform best with big and long documents.

11

Topic modeling focuses on extracting themes from a collection of text documents (when texts are diverse). Topic models are probabilistic models. They are developed using linear algebraic methods such as singular value decomposition (SVD) to uncover latent semantic structures from texts wherein matrix factorization divides a feature matrix into smaller components.

22

Algorithms such as Latent Dirichlet Allocation (LDA), LSA take advantage of inear algebra to divide a document into topics (clusters of words). The resulting vector contains all topics with weights. Similar content can be grouped by their topics (text classification).

LDA is a conditional, probabilistic form of topic modeling which uncovers the latent topics (themes) characterizing a collection of documents. If we have two topics, Topic 1 and Topic 2, the scores attached to each word are the probability of that keyword appearing under a topic across the whole corpus.

33

Neural Networks or Deep Learning

LLMs make use of the transformer architecture and learn from a vast amount of text data. The input is a huge number of tokens with which massive neural networks are trained.

▶️ LLMs split the text (input sequence/sentence) into tokens, convert them into vector embeddings.

▶️ LLMs use positional embeddings to track token order.

Positional encoding is typically introduced as a set of additional vectors that are added to the embeddings. A positional encoding vector is created for each position so each position has a unique representation, before being fed into the network. Feedforward neural networks apply non-linear transformations to the token representations, allowing capture of complex patterns and relationships.

▶️ The self-attention mechanism in a transformer architecture allows each word to attend to every other word in the sequence, weighing the importance for the current token.The cross-attention mechanism on the other hand allows looking across a related sequence. Self-attention operates in multiple attention heads to capture different relationships between tokens. Cross-attention is also multi-headed. The activation function softmax is used to calculate attention weights in the multi-head attention mechanism.


The families of the transformer architecture are encoder-only, decoder-only, and encoder-decoder. An encoder network looks across the input sequence and the decoder network looks across a sequence of representations from the encoder.

The GPT (generative pre-trained transformer) series developed by OpenAI has a decoder-only architecture, has unidirectional (left to right) context handling, and used primarily for text generation and summarization.

BERT (Bidirectional Encoder Representations from Transformers) developed by Google has encoder-only architecture, and is and used primarily for text classification and sentiment analysis. Google Gemini has encoder-decoder architecture. It is multimodal and good for content generation and complex reasoning.


A transformer processes input sequences in parallel, making it efficient for training and inference. It has better long-range interaactions and makes it deeper (in layers) than Recurrent Neural Networks (RNNs) in practice. It needs less training time than that needed by RNNs which also run into limitations in retaining context when the “distance” or range between pieces of information in an input is long.

💡 A transformer is one of the most important sequence modeling improvements of the past decade.

LLMs to Agents

If external databases are integrated with an LLM to cater to the specific domain or use-case, the LLM yields contextual output with RAG (retrieval augmented generation). Factual inconsistencies can be mitigated to a good extent by use of RAG.

When LLMs dynamically direct their own processes and tools (accessible to them), maintaining control over how they accomplish tasks, they act as agents. Agents are different from workflows. Workflows are systems where LLMs and other tools are orchestrated through predefined code paths. We must be well-aware of business cases when agents should be used.

One can build an agent with the open-source langchain framework.

ag

LLMs can be made and behaviorally conscious and more responsible in their outputs by rewarding good decisions and punishing bad ones. Training a machine/model to make better decisions/outcome by rewarding is known as reinforcement learning (RL).

rl

LLMs can be made to perform better by RL from human feedback (RLHF).


Caveats

▶️ Longer input sequences mean more tokens, which in turn means high memory usage and often overload (cache for storing tokens to be reused). KV cache (K for key and V for value) leads to slow processing and forces higher memory allocation, making it expensive.

▶️ Older tokens lose relevance as the input grows, the model tends to forget older tokens and focus on recent ones, leading to factual inconsistencies with growing input size.

▶️ The self-attention operation in LLMs has O(n²) time complexity, which kills efficiency

It is clear by now that more tokens do not mean better results. It only means more noise in the input, leading to increase in hallucinations in the output. From a pragmatic point of view, higher number of tokens only lead to high costs (LLM service providers charge per token).

💡💡Memory-efficient hybrid systems are smarter solutions.