Getting Started with Natural Language Processing

An introduction to NLP concepts and techniques for processing and analyzing text data.

D
Dery Febriantara Developer
Getting Started with Natural Language Processing

Natural Language Processing (NLP) enables machines to understand, interpret, and generate human language. From search engines to virtual assistants, NLP powers many of the applications we use daily. In this comprehensive guide, we’ll explore NLP from fundamental concepts to modern transformer-based approaches.

Table of Contents

  1. Introduction to NLP
  2. Text Preprocessing
  3. Text Representation
  4. Traditional NLP Tasks
  5. Word Embeddings
  6. Modern NLP with Transformers
  7. Named Entity Recognition
  8. Text Classification
  9. Sequence-to-Sequence Tasks
  10. Practical Applications

Introduction to NLP

What is NLP?

Natural Language Processing sits at the intersection of linguistics, computer science, and artificial intelligence. It aims to bridge the gap between human communication and computer understanding.

Key Challenges in NLP:

  • Ambiguity: Words and sentences can have multiple meanings
  • Context: Meaning often depends on surrounding context
  • Variability: Many ways to express the same idea
  • World Knowledge: Understanding requires background knowledge

NLP Pipeline

A typical NLP pipeline consists of:

Raw Text → Preprocessing → Feature Extraction → Model → Output

Each stage transforms the text into a more useful representation for the task at hand.

Setting Up Your Environment

# Install essential libraries
# pip install nltk spacy transformers torch scikit-learn gensim

import nltk
import spacy
import torch
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Download spaCy model
# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

Text Preprocessing

Text preprocessing is crucial for NLP success. Raw text is messy and needs cleaning before analysis.

Complete Preprocessing Pipeline

import re
import string
import unicodedata
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import contractions

class TextPreprocessor:
    def __init__(self, lowercase=True, remove_stopwords=True,
                 lemmatize=True, stem=False, remove_numbers=True,
                 remove_punctuation=True, min_word_length=2):
        self.lowercase = lowercase
        self.remove_stopwords = remove_stopwords
        self.lemmatize = lemmatize
        self.stem = stem
        self.remove_numbers = remove_numbers
        self.remove_punctuation = remove_punctuation
        self.min_word_length = min_word_length

        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        self.stemmer = PorterStemmer()

    def expand_contractions(self, text):
        """Expand contractions: don't → do not"""
        return contractions.fix(text)

    def remove_urls(self, text):
        """Remove URLs from text"""
        url_pattern = re.compile(r'https?://\S+|www\.\S+')
        return url_pattern.sub('', text)

    def remove_html_tags(self, text):
        """Remove HTML tags"""
        clean = re.compile('<.*?>')
        return re.sub(clean, '', text)

    def remove_emails(self, text):
        """Remove email addresses"""
        email_pattern = re.compile(r'\S+@\S+')
        return email_pattern.sub('', text)

    def remove_mentions_hashtags(self, text):
        """Remove @mentions and #hashtags"""
        text = re.sub(r'@\w+', '', text)
        text = re.sub(r'#\w+', '', text)
        return text

    def normalize_unicode(self, text):
        """Normalize unicode characters"""
        return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')

    def remove_extra_whitespace(self, text):
        """Remove extra whitespace"""
        return ' '.join(text.split())

    def preprocess(self, text):
        """Full preprocessing pipeline"""
        # Initial cleaning
        text = self.remove_html_tags(text)
        text = self.remove_urls(text)
        text = self.remove_emails(text)
        text = self.remove_mentions_hashtags(text)
        text = self.expand_contractions(text)
        text = self.normalize_unicode(text)

        # Lowercase
        if self.lowercase:
            text = text.lower()

        # Remove numbers
        if self.remove_numbers:
            text = re.sub(r'\d+', '', text)

        # Remove punctuation
        if self.remove_punctuation:
            text = text.translate(str.maketrans('', '', string.punctuation))

        # Tokenize
        tokens = word_tokenize(text)

        # Remove stopwords
        if self.remove_stopwords:
            tokens = [t for t in tokens if t not in self.stop_words]

        # Lemmatize or stem
        if self.lemmatize:
            tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        elif self.stem:
            tokens = [self.stemmer.stem(t) for t in tokens]

        # Filter by length
        tokens = [t for t in tokens if len(t) >= self.min_word_length]

        # Remove extra whitespace
        text = ' '.join(tokens)
        text = self.remove_extra_whitespace(text)

        return text

# Usage
preprocessor = TextPreprocessor()
text = "I can't believe it's 2024! Check out https://example.com for more info. #NLP @user"
cleaned = preprocessor.preprocess(text)
print(cleaned)
# Output: "believe check info nlp user"

Tokenization in Detail

from nltk.tokenize import (
    word_tokenize,
    sent_tokenize,
    TreebankWordTokenizer,
    TweetTokenizer
)

text = "Hello! How are you? I'm doing great :) #happy"

# Word tokenization
print("Word tokens:", word_tokenize(text))
# ['Hello', '!', 'How', 'are', 'you', '?', 'I', "'m", 'doing', 'great', ':', ')', '#', 'happy']

# Sentence tokenization
print("Sentences:", sent_tokenize(text))
# ['Hello!', 'How are you?', "I'm doing great :) #happy"]

# Tweet tokenizer (preserves hashtags, mentions, emoticons)
tweet_tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
print("Tweet tokens:", tweet_tokenizer.tokenize(text))
# ['hello', '!', 'how', 'are', 'you', '?', "i'm", 'doing', 'great', ':)', '#happy']

# spaCy tokenization
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("spaCy tokens:", [token.text for token in doc])

Handling Different Languages

# For non-English text
from nltk.corpus import stopwords

# Available languages
print(stopwords.fileids())
# ['arabic', 'danish', 'dutch', 'english', 'french', 'german', ...]

# Spanish preprocessing example
spanish_stops = set(stopwords.words('spanish'))
spanish_text = "El gato está en la casa"
tokens = word_tokenize(spanish_text.lower())
filtered = [t for t in tokens if t not in spanish_stops]
print(filtered)  # ['gato', 'está', 'casa']

# Using spaCy for multiple languages
# python -m spacy download es_core_news_sm
# python -m spacy download de_core_news_sm
# python -m spacy download fr_core_news_sm

Stemming vs Lemmatization

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer

# Test words
words = ['running', 'runs', 'ran', 'easily', 'fairly', 'studies', 'studying']

# Stemming - crude but fast
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

print("Porter:", [porter.stem(w) for w in words])
# ['run', 'run', 'ran', 'easili', 'fairli', 'studi', 'studi']

print("Lancaster:", [lancaster.stem(w) for w in words])
# ['run', 'run', 'ran', 'easy', 'fair', 'study', 'study']

print("Snowball:", [snowball.stem(w) for w in words])
# ['run', 'run', 'ran', 'easili', 'fair', 'studi', 'studi']

# Lemmatization - uses vocabulary and morphological analysis
lemmatizer = WordNetLemmatizer()
print("Lemmatizer:", [lemmatizer.lemmatize(w, pos='v') for w in words])
# ['run', 'run', 'run', 'easily', 'fairly', 'study', 'study']

Text Representation

Computers need numerical representations of text. Here are the main approaches.

Bag of Words (BoW)

The simplest approach: count word occurrences.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "I love machine learning and deep learning",
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks",
    "I love artificial intelligence research"
]

# Basic BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# View the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# View the document-term matrix
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)

Customizing CountVectorizer:

# With various options
vectorizer = CountVectorizer(
    max_features=1000,       # Keep top 1000 words
    min_df=2,                # Ignore words appearing in <2 documents
    max_df=0.95,             # Ignore words appearing in >95% of documents
    ngram_range=(1, 2),      # Include unigrams and bigrams
    stop_words='english',    # Remove English stop words
    lowercase=True,          # Convert to lowercase
    token_pattern=r'\b[a-zA-Z]{2,}\b'  # Only alphabetic tokens with 2+ chars
)

X = vectorizer.fit_transform(corpus)
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Some bigrams: {[w for w in vectorizer.get_feature_names_out() if ' ' in w][:10]}")

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF weighs words by how unique they are to a document.

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)}$$

Where:

  • TF(t, d) = frequency of term t in document d
  • DF(t) = number of documents containing term t
  • N = total number of documents
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

corpus = [
    "Machine learning is amazing for data science",
    "Data science requires statistics and programming",
    "Machine learning uses algorithms to learn from data",
    "Programming is essential for machine learning"
]

# TF-IDF vectorization
tfidf = TfidfVectorizer(
    max_features=100,
    ngram_range=(1, 2),
    sublinear_tf=True  # Use log(1 + tf) instead of tf
)

X = tfidf.fit_transform(corpus)

# Get feature names and their IDF values
feature_names = tfidf.get_feature_names_out()
idf_values = tfidf.idf_

# Show terms sorted by IDF (higher = more unique)
idf_df = pd.DataFrame({'term': feature_names, 'idf': idf_values})
print(idf_df.sort_values('idf', ascending=False).head(10))

# Get top terms for each document
def get_top_terms(doc_idx, n=5):
    scores = X[doc_idx].toarray().flatten()
    top_indices = scores.argsort()[-n:][::-1]
    return [(feature_names[i], scores[i]) for i in top_indices if scores[i] > 0]

for i, doc in enumerate(corpus):
    print(f"\nDocument {i}: {doc[:50]}...")
    print(f"Top terms: {get_top_terms(i)}")

N-grams

Capture word sequences to preserve some context.

from sklearn.feature_extraction.text import CountVectorizer

text = ["The quick brown fox jumps over the lazy dog"]

# Unigrams only
unigram_vec = CountVectorizer(ngram_range=(1, 1))
unigrams = unigram_vec.fit_transform(text)
print("Unigrams:", unigram_vec.get_feature_names_out())

# Bigrams only
bigram_vec = CountVectorizer(ngram_range=(2, 2))
bigrams = bigram_vec.fit_transform(text)
print("Bigrams:", bigram_vec.get_feature_names_out())

# Unigrams + Bigrams + Trigrams
ngram_vec = CountVectorizer(ngram_range=(1, 3))
ngrams = ngram_vec.fit_transform(text)
print("All n-grams:", ngram_vec.get_feature_names_out())

# Character n-grams (useful for spelling correction, language detection)
char_vec = CountVectorizer(analyzer='char', ngram_range=(2, 4))
char_ngrams = char_vec.fit_transform(text)
print("Char n-grams sample:", char_vec.get_feature_names_out()[:20])

Word Embeddings

Word embeddings represent words as dense vectors that capture semantic meaning.

Word2Vec

Word2Vec learns embeddings by predicting words from context (CBOW) or context from words (Skip-gram).

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import gensim.downloader as api

# Train your own Word2Vec
sentences = [
    ['machine', 'learning', 'is', 'amazing'],
    ['deep', 'learning', 'uses', 'neural', 'networks'],
    ['natural', 'language', 'processing', 'is', 'nlp'],
    ['python', 'is', 'great', 'for', 'machine', 'learning'],
    ['tensorflow', 'and', 'pytorch', 'are', 'deep', 'learning', 'frameworks']
]

model = Word2Vec(
    sentences,
    vector_size=100,     # Embedding dimension
    window=5,            # Context window size
    min_count=1,         # Minimum word frequency
    workers=4,           # Number of threads
    sg=1,                # Skip-gram (0 for CBOW)
    epochs=100           # Training iterations
)

# Save and load
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")

# Use pre-trained Google News vectors (3 billion words)
# Warning: This downloads ~1.5GB
# google_model = api.load('word2vec-google-news-300')

# Use smaller pre-trained model
glove_model = api.load('glove-wiki-gigaword-100')

# Find similar words
print(glove_model.most_similar('king', topn=5))

# Word arithmetic: king - man + woman = ?
result = glove_model.most_similar(positive=['king', 'woman'], negative=['man'])
print(f"king - man + woman = {result[0][0]}")  # queen

# Calculate similarity
print(f"Similarity(cat, dog): {glove_model.similarity('cat', 'dog'):.4f}")
print(f"Similarity(cat, car): {glove_model.similarity('cat', 'car'):.4f}")

# Get word vector
vector = glove_model['computer']
print(f"Vector shape: {vector.shape}")

Document Embeddings

Create embeddings for entire documents.

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Prepare documents
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with multiple layers",
    "Natural language processing enables computers to understand text",
    "Computer vision allows machines to interpret images"
]

# Tag documents
tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]

# Train Doc2Vec
model = Doc2Vec(
    tagged_docs,
    vector_size=50,
    window=2,
    min_count=1,
    workers=4,
    epochs=100
)

# Get document vector
doc_vector = model.dv[0]
print(f"Document vector shape: {doc_vector.shape}")

# Find similar documents
similar_docs = model.dv.most_similar(0)
print(f"Most similar to doc 0: {similar_docs}")

# Infer vector for new document
new_doc = "Neural networks are used in deep learning"
inferred_vector = model.infer_vector(new_doc.split())

Sentence Embeddings with Sentence Transformers

Modern approach using transformers.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Machine learning is fascinating",
    "I love artificial intelligence",
    "The weather is nice today",
    "Deep learning is a type of machine learning"
]

# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings)

# Print similarity matrix
import pandas as pd
df = pd.DataFrame(similarity_matrix,
                  index=[s[:30] for s in sentences],
                  columns=[s[:30] for s in sentences])
print(df.round(3))

# Semantic search
def semantic_search(query, documents, model, top_k=3):
    """Find most similar documents to a query."""
    query_embedding = model.encode([query])
    doc_embeddings = model.encode(documents)
    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    top_indices = similarities.argsort()[-top_k:][::-1]
    return [(documents[i], similarities[i]) for i in top_indices]

query = "How does AI work?"
results = semantic_search(query, sentences, model)
for doc, score in results:
    print(f"Score: {score:.4f} - {doc}")

Traditional NLP Tasks

Sentiment Analysis

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import pandas as pd

# Sample data (in practice, use larger datasets like IMDb, Yelp)
data = {
    'text': [
        "This product is amazing! Best purchase ever!",
        "Terrible quality, complete waste of money",
        "It's okay, nothing special but does the job",
        "Absolutely love it! Exceeded expectations!",
        "Disappointed, it broke after one week",
        "Great value for the price, highly recommend",
        "Not worth it, poor customer service",
        "Perfect! Exactly what I needed",
        "Mediocre at best, wouldn't buy again",
        "Outstanding product, fast shipping too!"
    ],
    'sentiment': [1, 0, 1, 1, 0, 1, 0, 1, 0, 1]  # 1=positive, 0=negative
}

df = pd.DataFrame(data)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['sentiment'], test_size=0.3, random_state=42
)

# Create pipeline
sentiment_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),
        min_df=1
    )),
    ('clf', LogisticRegression(
        max_iter=1000,
        class_weight='balanced'
    ))
])

# Train
sentiment_pipeline.fit(X_train, y_train)

# Evaluate
predictions = sentiment_pipeline.predict(X_test)
print(classification_report(y_test, predictions, target_names=['Negative', 'Positive']))

# Cross-validation
cv_scores = cross_val_score(sentiment_pipeline, df['text'], df['sentiment'], cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Predict new text
new_texts = [
    "I absolutely hate this product",
    "Best thing I've ever bought!"
]
predictions = sentiment_pipeline.predict(new_texts)
probabilities = sentiment_pipeline.predict_proba(new_texts)

for text, pred, prob in zip(new_texts, predictions, probabilities):
    sentiment = "Positive" if pred == 1 else "Negative"
    confidence = max(prob)
    print(f"Text: '{text}'")
    print(f"Sentiment: {sentiment} (confidence: {confidence:.4f})")

Text Classification

Multi-class classification for categorizing documents.

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Load 20 Newsgroups dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,
                                       remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
                                      remove=('headers', 'footers', 'quotes'))

# TF-IDF features
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X_train = tfidf.fit_transform(newsgroups_train.data)
X_test = tfidf.transform(newsgroups_test.data)
y_train = newsgroups_train.target
y_test = newsgroups_test.target

# Compare multiple classifiers
classifiers = {
    'Naive Bayes': MultinomialNB(),
    'Linear SVM': LinearSVC(max_iter=10000),
    'Random Forest': RandomForestClassifier(n_estimators=100, n_jobs=-1)
}

results = {}
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    results[name] = accuracy
    print(f"\n{name} Accuracy: {accuracy:.4f}")

# Best classifier detailed report
best_clf_name = max(results, key=results.get)
best_clf = classifiers[best_clf_name]
predictions = best_clf.predict(X_test)
print(f"\n{best_clf_name} Classification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups_train.target_names))

# Feature importance (for SVM)
if isinstance(best_clf, LinearSVC):
    feature_names = np.array(tfidf.get_feature_names_out())
    for i, category in enumerate(newsgroups_train.target_names):
        top_indices = best_clf.coef_[i].argsort()[-10:][::-1]
        print(f"\n{category} - Top features: {', '.join(feature_names[top_indices])}")

Topic Modeling with LDA

Discover hidden topics in document collections.

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

documents = [
    "Machine learning algorithms process data to make predictions",
    "Neural networks are inspired by the human brain structure",
    "Python is a popular programming language for data science",
    "Deep learning requires large amounts of training data",
    "Artificial intelligence is transforming many industries",
    "Data preprocessing is crucial for machine learning success",
    "JavaScript is widely used for web development",
    "Natural language processing enables text understanding",
    "Cloud computing provides scalable infrastructure",
    "Reinforcement learning trains agents through rewards"
]

# Create document-term matrix
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(documents)

# Train LDA
n_topics = 3
lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=10,
    learning_method='online'
)
lda.fit(doc_term_matrix)

# Display topics
feature_names = vectorizer.get_feature_names_out()

def display_topics(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_words_idx = topic.argsort()[:-n_top_words-1:-1]
        top_words = [feature_names[i] for i in top_words_idx]
        print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")

display_topics(lda, feature_names)

# Get topic distribution for a document
doc_topics = lda.transform(doc_term_matrix)
for i, doc in enumerate(documents[:3]):
    print(f"\nDocument: '{doc[:50]}...'")
    print(f"Topic distribution: {doc_topics[i].round(3)}")

Named Entity Recognition

NER identifies and classifies named entities (people, places, organizations, etc.).

Using spaCy

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

text = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in
Cupertino, California in 1976. The company's first product was the Apple I
computer. Today, Apple is worth over $2 trillion and employs more than
150,000 people worldwide. Their headquarters, Apple Park, opened in April 2017.
"""

doc = nlp(text)

# Extract entities
print("Named Entities:")
for ent in doc.ents:
    print(f"  {ent.text:20} | {ent.label_:10} | {spacy.explain(ent.label_)}")

# Visualize entities
# displacy.serve(doc, style="ent")  # For Jupyter
html = displacy.render(doc, style="ent", page=True)

# Entity statistics
from collections import Counter
entity_types = Counter([ent.label_ for ent in doc.ents])
print("\nEntity type distribution:")
for entity_type, count in entity_types.most_common():
    print(f"  {entity_type}: {count}")

# Custom entity extraction
def extract_entities_by_type(text, entity_types):
    """Extract specific entity types from text."""
    doc = nlp(text)
    entities = {}
    for ent_type in entity_types:
        entities[ent_type] = [ent.text for ent in doc.ents if ent.label_ == ent_type]
    return entities

entities = extract_entities_by_type(text, ['ORG', 'PERSON', 'GPE', 'DATE', 'MONEY'])
for ent_type, ents in entities.items():
    print(f"{ent_type}: {ents}")

Training Custom NER

import spacy
from spacy.training import Example
import random

# Create blank model
nlp = spacy.blank("en")

# Add NER pipeline
ner = nlp.add_pipe("ner")

# Training data - annotated examples
TRAIN_DATA = [
    ("iPhone 15 is the latest Apple smartphone", {"entities": [(0, 9, "PRODUCT"), (27, 32, "ORG")]}),
    ("Tesla Model S has great range", {"entities": [(0, 13, "PRODUCT"), (0, 5, "ORG")]}),
    ("Microsoft Office 365 is popular", {"entities": [(0, 20, "PRODUCT"), (0, 9, "ORG")]}),
    ("Samsung Galaxy phone is affordable", {"entities": [(0, 14, "PRODUCT"), (0, 7, "ORG")]}),
]

# Add entity labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# Training
nlp.begin_training()

for iteration in range(30):
    random.shuffle(TRAIN_DATA)
    losses = {}

    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], losses=losses)

    print(f"Iteration {iteration}, Losses: {losses}")

# Test
test_text = "The new Google Pixel is competing with iPhone 14"
doc = nlp(test_text)
print("\nTest results:")
for ent in doc.ents:
    print(f"  {ent.text}: {ent.label_}")

Modern NLP with Transformers

Transformers have revolutionized NLP with their attention mechanism.

Understanding Transformers

from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSequenceClassification,
    pipeline
)
import torch

# Load pre-trained BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize text
text = "Natural language processing is fascinating!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

print("Tokenized input:")
print(f"  Input IDs: {inputs['input_ids']}")
print(f"  Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")
print(f"  Attention mask: {inputs['attention_mask']}")

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Last hidden state (contextual embeddings)
last_hidden_state = outputs.last_hidden_state
print(f"\nLast hidden state shape: {last_hidden_state.shape}")
# Shape: [batch_size, sequence_length, hidden_size]

# Pooler output (CLS token representation)
pooler_output = outputs.pooler_output
print(f"Pooler output shape: {pooler_output.shape}")
# Shape: [batch_size, hidden_size]

Hugging Face Pipelines

Easy-to-use interfaces for common NLP tasks.

from transformers import pipeline

# Sentiment Analysis
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love using transformers for NLP!")
print(f"Sentiment: {result}")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named Entity Recognition
ner_pipeline = pipeline("ner", aggregation_strategy="simple")
result = ner_pipeline("Apple CEO Tim Cook announced new products in California")
print(f"NER: {result}")

# Question Answering
qa_pipeline = pipeline("question-answering")
context = """
Transformers were introduced in the paper "Attention Is All You Need" by
Vaswani et al. in 2017. They have since become the foundation for many
state-of-the-art NLP models including BERT, GPT, and T5.
"""
question = "When were transformers introduced?"
result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']} (confidence: {result['score']:.4f})")

# Text Summarization
summarizer = pipeline("summarization")
long_text = """
Machine learning is a subset of artificial intelligence that provides systems
the ability to automatically learn and improve from experience without being
explicitly programmed. Machine learning focuses on the development of computer
programs that can access data and use it to learn for themselves. The process
begins with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better decisions
in the future based on the examples that we provide.
"""
summary = summarizer(long_text, max_length=50, min_length=25)
print(f"Summary: {summary[0]['summary_text']}")

# Text Generation
generator = pipeline("text-generation", model="gpt2")
prompt = "The future of artificial intelligence is"
result = generator(prompt, max_length=50, num_return_sequences=1)
print(f"Generated: {result[0]['generated_text']}")

# Zero-Shot Classification
classifier = pipeline("zero-shot-classification")
text = "I just got back from an amazing trip to Paris!"
candidate_labels = ["travel", "cooking", "technology", "sports"]
result = classifier(text, candidate_labels)
print(f"Classification: {list(zip(result['labels'], result['scores']))}")

# Translation
translator = pipeline("translation_en_to_fr")
result = translator("Hello, how are you?")
print(f"Translation: {result[0]['translation_text']}")

Fine-Tuning BERT for Classification

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset, Dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# Load dataset
dataset = load_dataset("imdb")

# Subset for demonstration
train_data = dataset["train"].shuffle(seed=42).select(range(1000))
test_data = dataset["test"].shuffle(seed=42).select(range(200))

# Load tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize dataset
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=256
    )

tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_test = test_data.map(tokenize_function, batched=True)

# Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1": f1_score(labels, predictions, average="weighted")
    }

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

# Train
trainer.train()

# Evaluate
results = trainer.evaluate()
print(f"Final Results: {results}")

# Save model
trainer.save_model("./fine_tuned_model")

# Inference with fine-tuned model
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="./fine_tuned_model")
print(classifier("This movie was absolutely fantastic!"))

Sequence-to-Sequence Tasks

Machine Translation

from transformers import MarianMTModel, MarianTokenizer

# English to French translation
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate(text, tokenizer, model):
    inputs = tokenizer(text, return_tensors="pt", padding=True)
    translated = model.generate(**inputs)
    return tokenizer.decode(translated[0], skip_special_tokens=True)

english_texts = [
    "Hello, how are you?",
    "Machine learning is fascinating.",
    "The weather is beautiful today."
]

for text in english_texts:
    translation = translate(text, tokenizer, model)
    print(f"EN: {text}")
    print(f"FR: {translation}\n")

Text Summarization

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Using T5 for summarization
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def summarize(text, max_length=150, min_length=40):
    inputs = tokenizer.encode(
        "summarize: " + text,
        return_tensors="pt",
        max_length=512,
        truncation=True
    )

    summary_ids = model.generate(
        inputs,
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

article = """
The COVID-19 pandemic has fundamentally changed how we work, with remote work
becoming the norm for many industries. Companies that once required in-person
attendance have adapted to distributed teams, using video conferencing and
collaboration tools. While some businesses are returning to offices, many have
adopted hybrid models that offer flexibility. Studies show that remote workers
often report higher productivity and job satisfaction, though challenges remain
around team collaboration and work-life boundaries. The shift has also impacted
commercial real estate and urban planning, as businesses reconsider their space
needs and employees move away from city centers.
"""

summary = summarize(article)
print(f"Original length: {len(article.split())} words")
print(f"Summary: {summary}")
print(f"Summary length: {len(summary.split())} words")

Question Generation

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "valhalla/t5-small-qg-hl"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

def generate_questions(context, answer):
    """Generate questions given context and answer."""
    input_text = f"generate question: <hl> {answer} <hl> {context}"
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

    outputs = model.generate(
        inputs,
        max_length=64,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

context = "Paris is the capital and most populous city of France."
answer = "Paris"

question = generate_questions(context, answer)
print(f"Context: {context}")
print(f"Answer: {answer}")
print(f"Generated Question: {question}")

Practical Applications

Building a Chatbot

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class SimpleChatbot:
    def __init__(self, model_name="microsoft/DialoGPT-medium"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.chat_history_ids = None

    def respond(self, user_input):
        # Encode user input
        new_input_ids = self.tokenizer.encode(
            user_input + self.tokenizer.eos_token,
            return_tensors='pt'
        )

        # Append to chat history
        if self.chat_history_ids is not None:
            bot_input_ids = torch.cat([self.chat_history_ids, new_input_ids], dim=-1)
        else:
            bot_input_ids = new_input_ids

        # Generate response
        self.chat_history_ids = self.model.generate(
            bot_input_ids,
            max_length=1000,
            pad_token_id=self.tokenizer.eos_token_id,
            no_repeat_ngram_size=3,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )

        # Decode response
        response = self.tokenizer.decode(
            self.chat_history_ids[:, bot_input_ids.shape[-1]:][0],
            skip_special_tokens=True
        )

        return response

    def reset(self):
        self.chat_history_ids = None

# Usage
chatbot = SimpleChatbot()
print("Chatbot: Hi! I'm a simple chatbot. Type 'quit' to exit.")

while True:
    user_input = input("You: ")
    if user_input.lower() == 'quit':
        break
    response = chatbot.respond(user_input)
    print(f"Bot: {response}")
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

class DocumentSearchEngine:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.index = None
        self.documents = []

    def index_documents(self, documents):
        """Index documents for fast similarity search."""
        self.documents = documents

        # Generate embeddings
        embeddings = self.model.encode(documents, show_progress_bar=True)
        embeddings = np.array(embeddings).astype('float32')

        # Normalize for cosine similarity
        faiss.normalize_L2(embeddings)

        # Create FAISS index
        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity
        self.index.add(embeddings)

        print(f"Indexed {len(documents)} documents")

    def search(self, query, k=5):
        """Search for most similar documents."""
        # Generate query embedding
        query_embedding = self.model.encode([query])
        query_embedding = np.array(query_embedding).astype('float32')
        faiss.normalize_L2(query_embedding)

        # Search
        scores, indices = self.index.search(query_embedding, k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            results.append({
                'document': self.documents[idx],
                'score': float(score),
                'index': int(idx)
            })

        return results

# Usage
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with multiple layers",
    "Natural language processing enables computers to understand text",
    "Computer vision allows machines to interpret images",
    "Reinforcement learning trains agents through rewards and penalties",
    "Transfer learning applies knowledge from one task to another",
    "Python is the most popular language for data science",
    "TensorFlow and PyTorch are popular deep learning frameworks"
]

search_engine = DocumentSearchEngine()
search_engine.index_documents(documents)

query = "How do machines learn from data?"
results = search_engine.search(query, k=3)

print(f"Query: '{query}'\n")
print("Results:")
for r in results:
    print(f"  Score: {r['score']:.4f} | {r['document']}")

Text Extraction from PDFs and Images

# PDF text extraction
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """Extract text from PDF file."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# OCR for images (requires tesseract)
import pytesseract
from PIL import Image

def extract_text_from_image(image_path):
    """Extract text from image using OCR."""
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image)
    return text

# Process extracted text with NLP
def analyze_document(text):
    """Analyze extracted text."""
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    analysis = {
        'entities': [(ent.text, ent.label_) for ent in doc.ents],
        'sentences': len(list(doc.sents)),
        'words': len([token for token in doc if not token.is_punct]),
        'noun_phrases': [chunk.text for chunk in doc.noun_chunks][:10]
    }

    return analysis

Best Practices and Tips

Preprocessing Guidelines

  1. Always explore your data first

    # Check for common issues
    def analyze_text_data(texts):
        import statistics
        lengths = [len(t.split()) for t in texts]
        return {
            'num_texts': len(texts),
            'avg_length': statistics.mean(lengths),
            'min_length': min(lengths),
            'max_length': max(lengths),
            'empty_texts': sum(1 for t in texts if not t.strip())
        }
  2. Handle class imbalance

    from imblearn.over_sampling import SMOTE
    from sklearn.utils.class_weight import compute_class_weight
    
    # Calculate class weights
    class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
    class_weight_dict = dict(zip(np.unique(y), class_weights))
  3. Use appropriate evaluation metrics

    from sklearn.metrics import classification_report, confusion_matrix
    
    # For imbalanced data, use F1, precision, recall
    print(classification_report(y_true, y_pred))

Model Selection Guide

TaskSimple BaselineBetter PerformanceState-of-the-Art
Text ClassificationTF-IDF + LogRegTF-IDF + SVMFine-tuned BERT
Sentiment AnalysisBag of Words + NBTF-IDF + LogRegTransformer
NERRule-basedCRFFine-tuned BERT
SummarizationExtractive (TextRank)Seq2SeqT5, BART
TranslationStatistical MTSeq2Seq + AttentionTransformer

Performance Optimization

# Batch processing for transformers
from transformers import pipeline

# Use batching for efficiency
classifier = pipeline("sentiment-analysis", device=0)  # Use GPU if available
texts = ["text1", "text2", "text3", ...]
results = classifier(texts, batch_size=32)

# Use quantization for faster inference
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    torch_dtype=torch.float16  # Half precision
)

# Or dynamic quantization
import torch
model_quantized = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Further Reading

Books

  • “Speech and Language Processing” - Jurafsky & Martin
  • “Natural Language Processing with Python” - Bird, Klein, Loper
  • “Transformers for Natural Language Processing” - Denis Rothman

Online Resources

  • Hugging Face Course - Free NLP course with transformers
  • Stanford CS224N - NLP with Deep Learning
  • spaCy Course - Free interactive course

Research Papers

  • “Attention Is All You Need” - Vaswani et al. (2017)
  • “BERT: Pre-training of Deep Bidirectional Transformers” - Devlin et al. (2018)
  • “Language Models are Few-Shot Learners” (GPT-3) - Brown et al. (2020)

Libraries Reference

LibraryPurposeDocumentation
NLTKClassic NLP toolkitnltk.org
spaCyProduction NLPspacy.io
Hugging FaceTransformershuggingface.co
GensimTopic modelingradimrehurek.com/gensim
Sentence TransformersEmbeddingssbert.net