Getting Started with Natural Language Processing
An introduction to NLP concepts and techniques for processing and analyzing text data.
Natural Language Processing (NLP) enables machines to understand, interpret, and generate human language. From search engines to virtual assistants, NLP powers many of the applications we use daily. In this comprehensive guide, we’ll explore NLP from fundamental concepts to modern transformer-based approaches.
Table of Contents
- Introduction to NLP
- Text Preprocessing
- Text Representation
- Traditional NLP Tasks
- Word Embeddings
- Modern NLP with Transformers
- Named Entity Recognition
- Text Classification
- Sequence-to-Sequence Tasks
- Practical Applications
Introduction to NLP
What is NLP?
Natural Language Processing sits at the intersection of linguistics, computer science, and artificial intelligence. It aims to bridge the gap between human communication and computer understanding.
Key Challenges in NLP:
- Ambiguity: Words and sentences can have multiple meanings
- Context: Meaning often depends on surrounding context
- Variability: Many ways to express the same idea
- World Knowledge: Understanding requires background knowledge
NLP Pipeline
A typical NLP pipeline consists of:
Raw Text → Preprocessing → Feature Extraction → Model → Output
Each stage transforms the text into a more useful representation for the task at hand.
Setting Up Your Environment
# Install essential libraries
# pip install nltk spacy transformers torch scikit-learn gensim
import nltk
import spacy
import torch
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Download spaCy model
# python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
Text Preprocessing
Text preprocessing is crucial for NLP success. Raw text is messy and needs cleaning before analysis.
Complete Preprocessing Pipeline
import re
import string
import unicodedata
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import contractions
class TextPreprocessor:
def __init__(self, lowercase=True, remove_stopwords=True,
lemmatize=True, stem=False, remove_numbers=True,
remove_punctuation=True, min_word_length=2):
self.lowercase = lowercase
self.remove_stopwords = remove_stopwords
self.lemmatize = lemmatize
self.stem = stem
self.remove_numbers = remove_numbers
self.remove_punctuation = remove_punctuation
self.min_word_length = min_word_length
self.stop_words = set(stopwords.words('english'))
self.lemmatizer = WordNetLemmatizer()
self.stemmer = PorterStemmer()
def expand_contractions(self, text):
"""Expand contractions: don't → do not"""
return contractions.fix(text)
def remove_urls(self, text):
"""Remove URLs from text"""
url_pattern = re.compile(r'https?://\S+|www\.\S+')
return url_pattern.sub('', text)
def remove_html_tags(self, text):
"""Remove HTML tags"""
clean = re.compile('<.*?>')
return re.sub(clean, '', text)
def remove_emails(self, text):
"""Remove email addresses"""
email_pattern = re.compile(r'\S+@\S+')
return email_pattern.sub('', text)
def remove_mentions_hashtags(self, text):
"""Remove @mentions and #hashtags"""
text = re.sub(r'@\w+', '', text)
text = re.sub(r'#\w+', '', text)
return text
def normalize_unicode(self, text):
"""Normalize unicode characters"""
return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
def remove_extra_whitespace(self, text):
"""Remove extra whitespace"""
return ' '.join(text.split())
def preprocess(self, text):
"""Full preprocessing pipeline"""
# Initial cleaning
text = self.remove_html_tags(text)
text = self.remove_urls(text)
text = self.remove_emails(text)
text = self.remove_mentions_hashtags(text)
text = self.expand_contractions(text)
text = self.normalize_unicode(text)
# Lowercase
if self.lowercase:
text = text.lower()
# Remove numbers
if self.remove_numbers:
text = re.sub(r'\d+', '', text)
# Remove punctuation
if self.remove_punctuation:
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
if self.remove_stopwords:
tokens = [t for t in tokens if t not in self.stop_words]
# Lemmatize or stem
if self.lemmatize:
tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
elif self.stem:
tokens = [self.stemmer.stem(t) for t in tokens]
# Filter by length
tokens = [t for t in tokens if len(t) >= self.min_word_length]
# Remove extra whitespace
text = ' '.join(tokens)
text = self.remove_extra_whitespace(text)
return text
# Usage
preprocessor = TextPreprocessor()
text = "I can't believe it's 2024! Check out https://example.com for more info. #NLP @user"
cleaned = preprocessor.preprocess(text)
print(cleaned)
# Output: "believe check info nlp user"
Tokenization in Detail
from nltk.tokenize import (
word_tokenize,
sent_tokenize,
TreebankWordTokenizer,
TweetTokenizer
)
text = "Hello! How are you? I'm doing great :) #happy"
# Word tokenization
print("Word tokens:", word_tokenize(text))
# ['Hello', '!', 'How', 'are', 'you', '?', 'I', "'m", 'doing', 'great', ':', ')', '#', 'happy']
# Sentence tokenization
print("Sentences:", sent_tokenize(text))
# ['Hello!', 'How are you?', "I'm doing great :) #happy"]
# Tweet tokenizer (preserves hashtags, mentions, emoticons)
tweet_tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
print("Tweet tokens:", tweet_tokenizer.tokenize(text))
# ['hello', '!', 'how', 'are', 'you', '?', "i'm", 'doing', 'great', ':)', '#happy']
# spaCy tokenization
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print("spaCy tokens:", [token.text for token in doc])
Handling Different Languages
# For non-English text
from nltk.corpus import stopwords
# Available languages
print(stopwords.fileids())
# ['arabic', 'danish', 'dutch', 'english', 'french', 'german', ...]
# Spanish preprocessing example
spanish_stops = set(stopwords.words('spanish'))
spanish_text = "El gato está en la casa"
tokens = word_tokenize(spanish_text.lower())
filtered = [t for t in tokens if t not in spanish_stops]
print(filtered) # ['gato', 'está', 'casa']
# Using spaCy for multiple languages
# python -m spacy download es_core_news_sm
# python -m spacy download de_core_news_sm
# python -m spacy download fr_core_news_sm
Stemming vs Lemmatization
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
# Test words
words = ['running', 'runs', 'ran', 'easily', 'fairly', 'studies', 'studying']
# Stemming - crude but fast
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')
print("Porter:", [porter.stem(w) for w in words])
# ['run', 'run', 'ran', 'easili', 'fairli', 'studi', 'studi']
print("Lancaster:", [lancaster.stem(w) for w in words])
# ['run', 'run', 'ran', 'easy', 'fair', 'study', 'study']
print("Snowball:", [snowball.stem(w) for w in words])
# ['run', 'run', 'ran', 'easili', 'fair', 'studi', 'studi']
# Lemmatization - uses vocabulary and morphological analysis
lemmatizer = WordNetLemmatizer()
print("Lemmatizer:", [lemmatizer.lemmatize(w, pos='v') for w in words])
# ['run', 'run', 'run', 'easily', 'fairly', 'study', 'study']
Text Representation
Computers need numerical representations of text. Here are the main approaches.
Bag of Words (BoW)
The simplest approach: count word occurrences.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus = [
"I love machine learning and deep learning",
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks",
"I love artificial intelligence research"
]
# Basic BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# View the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
# View the document-term matrix
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df)
Customizing CountVectorizer:
# With various options
vectorizer = CountVectorizer(
max_features=1000, # Keep top 1000 words
min_df=2, # Ignore words appearing in <2 documents
max_df=0.95, # Ignore words appearing in >95% of documents
ngram_range=(1, 2), # Include unigrams and bigrams
stop_words='english', # Remove English stop words
lowercase=True, # Convert to lowercase
token_pattern=r'\b[a-zA-Z]{2,}\b' # Only alphabetic tokens with 2+ chars
)
X = vectorizer.fit_transform(corpus)
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Some bigrams: {[w for w in vectorizer.get_feature_names_out() if ' ' in w][:10]}")
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF weighs words by how unique they are to a document.
$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\frac{N}{\text{DF}(t)}$$
Where:
- TF(t, d) = frequency of term t in document d
- DF(t) = number of documents containing term t
- N = total number of documents
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
corpus = [
"Machine learning is amazing for data science",
"Data science requires statistics and programming",
"Machine learning uses algorithms to learn from data",
"Programming is essential for machine learning"
]
# TF-IDF vectorization
tfidf = TfidfVectorizer(
max_features=100,
ngram_range=(1, 2),
sublinear_tf=True # Use log(1 + tf) instead of tf
)
X = tfidf.fit_transform(corpus)
# Get feature names and their IDF values
feature_names = tfidf.get_feature_names_out()
idf_values = tfidf.idf_
# Show terms sorted by IDF (higher = more unique)
idf_df = pd.DataFrame({'term': feature_names, 'idf': idf_values})
print(idf_df.sort_values('idf', ascending=False).head(10))
# Get top terms for each document
def get_top_terms(doc_idx, n=5):
scores = X[doc_idx].toarray().flatten()
top_indices = scores.argsort()[-n:][::-1]
return [(feature_names[i], scores[i]) for i in top_indices if scores[i] > 0]
for i, doc in enumerate(corpus):
print(f"\nDocument {i}: {doc[:50]}...")
print(f"Top terms: {get_top_terms(i)}")
N-grams
Capture word sequences to preserve some context.
from sklearn.feature_extraction.text import CountVectorizer
text = ["The quick brown fox jumps over the lazy dog"]
# Unigrams only
unigram_vec = CountVectorizer(ngram_range=(1, 1))
unigrams = unigram_vec.fit_transform(text)
print("Unigrams:", unigram_vec.get_feature_names_out())
# Bigrams only
bigram_vec = CountVectorizer(ngram_range=(2, 2))
bigrams = bigram_vec.fit_transform(text)
print("Bigrams:", bigram_vec.get_feature_names_out())
# Unigrams + Bigrams + Trigrams
ngram_vec = CountVectorizer(ngram_range=(1, 3))
ngrams = ngram_vec.fit_transform(text)
print("All n-grams:", ngram_vec.get_feature_names_out())
# Character n-grams (useful for spelling correction, language detection)
char_vec = CountVectorizer(analyzer='char', ngram_range=(2, 4))
char_ngrams = char_vec.fit_transform(text)
print("Char n-grams sample:", char_vec.get_feature_names_out()[:20])
Word Embeddings
Word embeddings represent words as dense vectors that capture semantic meaning.
Word2Vec
Word2Vec learns embeddings by predicting words from context (CBOW) or context from words (Skip-gram).
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import gensim.downloader as api
# Train your own Word2Vec
sentences = [
['machine', 'learning', 'is', 'amazing'],
['deep', 'learning', 'uses', 'neural', 'networks'],
['natural', 'language', 'processing', 'is', 'nlp'],
['python', 'is', 'great', 'for', 'machine', 'learning'],
['tensorflow', 'and', 'pytorch', 'are', 'deep', 'learning', 'frameworks']
]
model = Word2Vec(
sentences,
vector_size=100, # Embedding dimension
window=5, # Context window size
min_count=1, # Minimum word frequency
workers=4, # Number of threads
sg=1, # Skip-gram (0 for CBOW)
epochs=100 # Training iterations
)
# Save and load
model.save("word2vec.model")
model = Word2Vec.load("word2vec.model")
# Use pre-trained Google News vectors (3 billion words)
# Warning: This downloads ~1.5GB
# google_model = api.load('word2vec-google-news-300')
# Use smaller pre-trained model
glove_model = api.load('glove-wiki-gigaword-100')
# Find similar words
print(glove_model.most_similar('king', topn=5))
# Word arithmetic: king - man + woman = ?
result = glove_model.most_similar(positive=['king', 'woman'], negative=['man'])
print(f"king - man + woman = {result[0][0]}") # queen
# Calculate similarity
print(f"Similarity(cat, dog): {glove_model.similarity('cat', 'dog'):.4f}")
print(f"Similarity(cat, car): {glove_model.similarity('cat', 'car'):.4f}")
# Get word vector
vector = glove_model['computer']
print(f"Vector shape: {vector.shape}")
Document Embeddings
Create embeddings for entire documents.
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# Prepare documents
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with multiple layers",
"Natural language processing enables computers to understand text",
"Computer vision allows machines to interpret images"
]
# Tag documents
tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]
# Train Doc2Vec
model = Doc2Vec(
tagged_docs,
vector_size=50,
window=2,
min_count=1,
workers=4,
epochs=100
)
# Get document vector
doc_vector = model.dv[0]
print(f"Document vector shape: {doc_vector.shape}")
# Find similar documents
similar_docs = model.dv.most_similar(0)
print(f"Most similar to doc 0: {similar_docs}")
# Infer vector for new document
new_doc = "Neural networks are used in deep learning"
inferred_vector = model.infer_vector(new_doc.split())
Sentence Embeddings with Sentence Transformers
Modern approach using transformers.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
"Machine learning is fascinating",
"I love artificial intelligence",
"The weather is nice today",
"Deep learning is a type of machine learning"
]
# Generate embeddings
embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
# Calculate similarity matrix
similarity_matrix = cosine_similarity(embeddings)
# Print similarity matrix
import pandas as pd
df = pd.DataFrame(similarity_matrix,
index=[s[:30] for s in sentences],
columns=[s[:30] for s in sentences])
print(df.round(3))
# Semantic search
def semantic_search(query, documents, model, top_k=3):
"""Find most similar documents to a query."""
query_embedding = model.encode([query])
doc_embeddings = model.encode(documents)
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
top_indices = similarities.argsort()[-top_k:][::-1]
return [(documents[i], similarities[i]) for i in top_indices]
query = "How does AI work?"
results = semantic_search(query, sentences, model)
for doc, score in results:
print(f"Score: {score:.4f} - {doc}")
Traditional NLP Tasks
Sentiment Analysis
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
import pandas as pd
# Sample data (in practice, use larger datasets like IMDb, Yelp)
data = {
'text': [
"This product is amazing! Best purchase ever!",
"Terrible quality, complete waste of money",
"It's okay, nothing special but does the job",
"Absolutely love it! Exceeded expectations!",
"Disappointed, it broke after one week",
"Great value for the price, highly recommend",
"Not worth it, poor customer service",
"Perfect! Exactly what I needed",
"Mediocre at best, wouldn't buy again",
"Outstanding product, fast shipping too!"
],
'sentiment': [1, 0, 1, 1, 0, 1, 0, 1, 0, 1] # 1=positive, 0=negative
}
df = pd.DataFrame(data)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['sentiment'], test_size=0.3, random_state=42
)
# Create pipeline
sentiment_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2),
min_df=1
)),
('clf', LogisticRegression(
max_iter=1000,
class_weight='balanced'
))
])
# Train
sentiment_pipeline.fit(X_train, y_train)
# Evaluate
predictions = sentiment_pipeline.predict(X_test)
print(classification_report(y_test, predictions, target_names=['Negative', 'Positive']))
# Cross-validation
cv_scores = cross_val_score(sentiment_pipeline, df['text'], df['sentiment'], cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
# Predict new text
new_texts = [
"I absolutely hate this product",
"Best thing I've ever bought!"
]
predictions = sentiment_pipeline.predict(new_texts)
probabilities = sentiment_pipeline.predict_proba(new_texts)
for text, pred, prob in zip(new_texts, predictions, probabilities):
sentiment = "Positive" if pred == 1 else "Negative"
confidence = max(prob)
print(f"Text: '{text}'")
print(f"Sentiment: {sentiment} (confidence: {confidence:.4f})")
Text Classification
Multi-class classification for categorizing documents.
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
# Load 20 Newsgroups dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories,
remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
remove=('headers', 'footers', 'quotes'))
# TF-IDF features
tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
X_train = tfidf.fit_transform(newsgroups_train.data)
X_test = tfidf.transform(newsgroups_test.data)
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Compare multiple classifiers
classifiers = {
'Naive Bayes': MultinomialNB(),
'Linear SVM': LinearSVC(max_iter=10000),
'Random Forest': RandomForestClassifier(n_estimators=100, n_jobs=-1)
}
results = {}
for name, clf in classifiers.items():
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
results[name] = accuracy
print(f"\n{name} Accuracy: {accuracy:.4f}")
# Best classifier detailed report
best_clf_name = max(results, key=results.get)
best_clf = classifiers[best_clf_name]
predictions = best_clf.predict(X_test)
print(f"\n{best_clf_name} Classification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups_train.target_names))
# Feature importance (for SVM)
if isinstance(best_clf, LinearSVC):
feature_names = np.array(tfidf.get_feature_names_out())
for i, category in enumerate(newsgroups_train.target_names):
top_indices = best_clf.coef_[i].argsort()[-10:][::-1]
print(f"\n{category} - Top features: {', '.join(feature_names[top_indices])}")
Topic Modeling with LDA
Discover hidden topics in document collections.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
documents = [
"Machine learning algorithms process data to make predictions",
"Neural networks are inspired by the human brain structure",
"Python is a popular programming language for data science",
"Deep learning requires large amounts of training data",
"Artificial intelligence is transforming many industries",
"Data preprocessing is crucial for machine learning success",
"JavaScript is widely used for web development",
"Natural language processing enables text understanding",
"Cloud computing provides scalable infrastructure",
"Reinforcement learning trains agents through rewards"
]
# Create document-term matrix
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(documents)
# Train LDA
n_topics = 3
lda = LatentDirichletAllocation(
n_components=n_topics,
random_state=42,
max_iter=10,
learning_method='online'
)
lda.fit(doc_term_matrix)
# Display topics
feature_names = vectorizer.get_feature_names_out()
def display_topics(model, feature_names, n_top_words=10):
for topic_idx, topic in enumerate(model.components_):
top_words_idx = topic.argsort()[:-n_top_words-1:-1]
top_words = [feature_names[i] for i in top_words_idx]
print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")
display_topics(lda, feature_names)
# Get topic distribution for a document
doc_topics = lda.transform(doc_term_matrix)
for i, doc in enumerate(documents[:3]):
print(f"\nDocument: '{doc[:50]}...'")
print(f"Topic distribution: {doc_topics[i].round(3)}")
Named Entity Recognition
NER identifies and classifies named entities (people, places, organizations, etc.).
Using spaCy
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
text = """
Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in
Cupertino, California in 1976. The company's first product was the Apple I
computer. Today, Apple is worth over $2 trillion and employs more than
150,000 people worldwide. Their headquarters, Apple Park, opened in April 2017.
"""
doc = nlp(text)
# Extract entities
print("Named Entities:")
for ent in doc.ents:
print(f" {ent.text:20} | {ent.label_:10} | {spacy.explain(ent.label_)}")
# Visualize entities
# displacy.serve(doc, style="ent") # For Jupyter
html = displacy.render(doc, style="ent", page=True)
# Entity statistics
from collections import Counter
entity_types = Counter([ent.label_ for ent in doc.ents])
print("\nEntity type distribution:")
for entity_type, count in entity_types.most_common():
print(f" {entity_type}: {count}")
# Custom entity extraction
def extract_entities_by_type(text, entity_types):
"""Extract specific entity types from text."""
doc = nlp(text)
entities = {}
for ent_type in entity_types:
entities[ent_type] = [ent.text for ent in doc.ents if ent.label_ == ent_type]
return entities
entities = extract_entities_by_type(text, ['ORG', 'PERSON', 'GPE', 'DATE', 'MONEY'])
for ent_type, ents in entities.items():
print(f"{ent_type}: {ents}")
Training Custom NER
import spacy
from spacy.training import Example
import random
# Create blank model
nlp = spacy.blank("en")
# Add NER pipeline
ner = nlp.add_pipe("ner")
# Training data - annotated examples
TRAIN_DATA = [
("iPhone 15 is the latest Apple smartphone", {"entities": [(0, 9, "PRODUCT"), (27, 32, "ORG")]}),
("Tesla Model S has great range", {"entities": [(0, 13, "PRODUCT"), (0, 5, "ORG")]}),
("Microsoft Office 365 is popular", {"entities": [(0, 20, "PRODUCT"), (0, 9, "ORG")]}),
("Samsung Galaxy phone is affordable", {"entities": [(0, 14, "PRODUCT"), (0, 7, "ORG")]}),
]
# Add entity labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# Training
nlp.begin_training()
for iteration in range(30):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
nlp.update([example], losses=losses)
print(f"Iteration {iteration}, Losses: {losses}")
# Test
test_text = "The new Google Pixel is competing with iPhone 14"
doc = nlp(test_text)
print("\nTest results:")
for ent in doc.ents:
print(f" {ent.text}: {ent.label_}")
Modern NLP with Transformers
Transformers have revolutionized NLP with their attention mechanism.
Understanding Transformers
from transformers import (
AutoTokenizer,
AutoModel,
AutoModelForSequenceClassification,
pipeline
)
import torch
# Load pre-trained BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Tokenize text
text = "Natural language processing is fascinating!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print("Tokenized input:")
print(f" Input IDs: {inputs['input_ids']}")
print(f" Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")
print(f" Attention mask: {inputs['attention_mask']}")
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
# Last hidden state (contextual embeddings)
last_hidden_state = outputs.last_hidden_state
print(f"\nLast hidden state shape: {last_hidden_state.shape}")
# Shape: [batch_size, sequence_length, hidden_size]
# Pooler output (CLS token representation)
pooler_output = outputs.pooler_output
print(f"Pooler output shape: {pooler_output.shape}")
# Shape: [batch_size, hidden_size]
Hugging Face Pipelines
Easy-to-use interfaces for common NLP tasks.
from transformers import pipeline
# Sentiment Analysis
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love using transformers for NLP!")
print(f"Sentiment: {result}")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Named Entity Recognition
ner_pipeline = pipeline("ner", aggregation_strategy="simple")
result = ner_pipeline("Apple CEO Tim Cook announced new products in California")
print(f"NER: {result}")
# Question Answering
qa_pipeline = pipeline("question-answering")
context = """
Transformers were introduced in the paper "Attention Is All You Need" by
Vaswani et al. in 2017. They have since become the foundation for many
state-of-the-art NLP models including BERT, GPT, and T5.
"""
question = "When were transformers introduced?"
result = qa_pipeline(question=question, context=context)
print(f"Answer: {result['answer']} (confidence: {result['score']:.4f})")
# Text Summarization
summarizer = pipeline("summarization")
long_text = """
Machine learning is a subset of artificial intelligence that provides systems
the ability to automatically learn and improve from experience without being
explicitly programmed. Machine learning focuses on the development of computer
programs that can access data and use it to learn for themselves. The process
begins with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better decisions
in the future based on the examples that we provide.
"""
summary = summarizer(long_text, max_length=50, min_length=25)
print(f"Summary: {summary[0]['summary_text']}")
# Text Generation
generator = pipeline("text-generation", model="gpt2")
prompt = "The future of artificial intelligence is"
result = generator(prompt, max_length=50, num_return_sequences=1)
print(f"Generated: {result[0]['generated_text']}")
# Zero-Shot Classification
classifier = pipeline("zero-shot-classification")
text = "I just got back from an amazing trip to Paris!"
candidate_labels = ["travel", "cooking", "technology", "sports"]
result = classifier(text, candidate_labels)
print(f"Classification: {list(zip(result['labels'], result['scores']))}")
# Translation
translator = pipeline("translation_en_to_fr")
result = translator("Hello, how are you?")
print(f"Translation: {result[0]['translation_text']}")
Fine-Tuning BERT for Classification
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
from datasets import load_dataset, Dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
# Load dataset
dataset = load_dataset("imdb")
# Subset for demonstration
train_data = dataset["train"].shuffle(seed=42).select(range(1000))
test_data = dataset["test"].shuffle(seed=42).select(range(200))
# Load tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Tokenize dataset
def tokenize_function(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=256
)
tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_test = test_data.map(tokenize_function, batched=True)
# Define metrics
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
"accuracy": accuracy_score(labels, predictions),
"f1": f1_score(labels, predictions, average="weighted")
}
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=100,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
compute_metrics=compute_metrics,
)
# Train
trainer.train()
# Evaluate
results = trainer.evaluate()
print(f"Final Results: {results}")
# Save model
trainer.save_model("./fine_tuned_model")
# Inference with fine-tuned model
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="./fine_tuned_model")
print(classifier("This movie was absolutely fantastic!"))
Sequence-to-Sequence Tasks
Machine Translation
from transformers import MarianMTModel, MarianTokenizer
# English to French translation
model_name = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
def translate(text, tokenizer, model):
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
return tokenizer.decode(translated[0], skip_special_tokens=True)
english_texts = [
"Hello, how are you?",
"Machine learning is fascinating.",
"The weather is beautiful today."
]
for text in english_texts:
translation = translate(text, tokenizer, model)
print(f"EN: {text}")
print(f"FR: {translation}\n")
Text Summarization
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
# Using T5 for summarization
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def summarize(text, max_length=150, min_length=40):
inputs = tokenizer.encode(
"summarize: " + text,
return_tensors="pt",
max_length=512,
truncation=True
)
summary_ids = model.generate(
inputs,
max_length=max_length,
min_length=min_length,
length_penalty=2.0,
num_beams=4,
early_stopping=True
)
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
article = """
The COVID-19 pandemic has fundamentally changed how we work, with remote work
becoming the norm for many industries. Companies that once required in-person
attendance have adapted to distributed teams, using video conferencing and
collaboration tools. While some businesses are returning to offices, many have
adopted hybrid models that offer flexibility. Studies show that remote workers
often report higher productivity and job satisfaction, though challenges remain
around team collaboration and work-life boundaries. The shift has also impacted
commercial real estate and urban planning, as businesses reconsider their space
needs and employees move away from city centers.
"""
summary = summarize(article)
print(f"Original length: {len(article.split())} words")
print(f"Summary: {summary}")
print(f"Summary length: {len(summary.split())} words")
Question Generation
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = "valhalla/t5-small-qg-hl"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
def generate_questions(context, answer):
"""Generate questions given context and answer."""
input_text = f"generate question: <hl> {answer} <hl> {context}"
inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(
inputs,
max_length=64,
num_beams=4,
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
context = "Paris is the capital and most populous city of France."
answer = "Paris"
question = generate_questions(context, answer)
print(f"Context: {context}")
print(f"Answer: {answer}")
print(f"Generated Question: {question}")
Practical Applications
Building a Chatbot
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
class SimpleChatbot:
def __init__(self, model_name="microsoft/DialoGPT-medium"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.chat_history_ids = None
def respond(self, user_input):
# Encode user input
new_input_ids = self.tokenizer.encode(
user_input + self.tokenizer.eos_token,
return_tensors='pt'
)
# Append to chat history
if self.chat_history_ids is not None:
bot_input_ids = torch.cat([self.chat_history_ids, new_input_ids], dim=-1)
else:
bot_input_ids = new_input_ids
# Generate response
self.chat_history_ids = self.model.generate(
bot_input_ids,
max_length=1000,
pad_token_id=self.tokenizer.eos_token_id,
no_repeat_ngram_size=3,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7
)
# Decode response
response = self.tokenizer.decode(
self.chat_history_ids[:, bot_input_ids.shape[-1]:][0],
skip_special_tokens=True
)
return response
def reset(self):
self.chat_history_ids = None
# Usage
chatbot = SimpleChatbot()
print("Chatbot: Hi! I'm a simple chatbot. Type 'quit' to exit.")
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
response = chatbot.respond(user_input)
print(f"Bot: {response}")
Document Similarity Search
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
class DocumentSearchEngine:
def __init__(self, model_name='all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.index = None
self.documents = []
def index_documents(self, documents):
"""Index documents for fast similarity search."""
self.documents = documents
# Generate embeddings
embeddings = self.model.encode(documents, show_progress_bar=True)
embeddings = np.array(embeddings).astype('float32')
# Normalize for cosine similarity
faiss.normalize_L2(embeddings)
# Create FAISS index
dimension = embeddings.shape[1]
self.index = faiss.IndexFlatIP(dimension) # Inner product for cosine similarity
self.index.add(embeddings)
print(f"Indexed {len(documents)} documents")
def search(self, query, k=5):
"""Search for most similar documents."""
# Generate query embedding
query_embedding = self.model.encode([query])
query_embedding = np.array(query_embedding).astype('float32')
faiss.normalize_L2(query_embedding)
# Search
scores, indices = self.index.search(query_embedding, k)
results = []
for score, idx in zip(scores[0], indices[0]):
results.append({
'document': self.documents[idx],
'score': float(score),
'index': int(idx)
})
return results
# Usage
documents = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with multiple layers",
"Natural language processing enables computers to understand text",
"Computer vision allows machines to interpret images",
"Reinforcement learning trains agents through rewards and penalties",
"Transfer learning applies knowledge from one task to another",
"Python is the most popular language for data science",
"TensorFlow and PyTorch are popular deep learning frameworks"
]
search_engine = DocumentSearchEngine()
search_engine.index_documents(documents)
query = "How do machines learn from data?"
results = search_engine.search(query, k=3)
print(f"Query: '{query}'\n")
print("Results:")
for r in results:
print(f" Score: {r['score']:.4f} | {r['document']}")
Text Extraction from PDFs and Images
# PDF text extraction
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
"""Extract text from PDF file."""
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
return text
# OCR for images (requires tesseract)
import pytesseract
from PIL import Image
def extract_text_from_image(image_path):
"""Extract text from image using OCR."""
image = Image.open(image_path)
text = pytesseract.image_to_string(image)
return text
# Process extracted text with NLP
def analyze_document(text):
"""Analyze extracted text."""
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
analysis = {
'entities': [(ent.text, ent.label_) for ent in doc.ents],
'sentences': len(list(doc.sents)),
'words': len([token for token in doc if not token.is_punct]),
'noun_phrases': [chunk.text for chunk in doc.noun_chunks][:10]
}
return analysis
Best Practices and Tips
Preprocessing Guidelines
-
Always explore your data first
# Check for common issues def analyze_text_data(texts): import statistics lengths = [len(t.split()) for t in texts] return { 'num_texts': len(texts), 'avg_length': statistics.mean(lengths), 'min_length': min(lengths), 'max_length': max(lengths), 'empty_texts': sum(1 for t in texts if not t.strip()) } -
Handle class imbalance
from imblearn.over_sampling import SMOTE from sklearn.utils.class_weight import compute_class_weight # Calculate class weights class_weights = compute_class_weight('balanced', classes=np.unique(y), y=y) class_weight_dict = dict(zip(np.unique(y), class_weights)) -
Use appropriate evaluation metrics
from sklearn.metrics import classification_report, confusion_matrix # For imbalanced data, use F1, precision, recall print(classification_report(y_true, y_pred))
Model Selection Guide
| Task | Simple Baseline | Better Performance | State-of-the-Art |
|---|---|---|---|
| Text Classification | TF-IDF + LogReg | TF-IDF + SVM | Fine-tuned BERT |
| Sentiment Analysis | Bag of Words + NB | TF-IDF + LogReg | Transformer |
| NER | Rule-based | CRF | Fine-tuned BERT |
| Summarization | Extractive (TextRank) | Seq2Seq | T5, BART |
| Translation | Statistical MT | Seq2Seq + Attention | Transformer |
Performance Optimization
# Batch processing for transformers
from transformers import pipeline
# Use batching for efficiency
classifier = pipeline("sentiment-analysis", device=0) # Use GPU if available
texts = ["text1", "text2", "text3", ...]
results = classifier(texts, batch_size=32)
# Use quantization for faster inference
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
torch_dtype=torch.float16 # Half precision
)
# Or dynamic quantization
import torch
model_quantized = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
Further Reading
Books
- “Speech and Language Processing” - Jurafsky & Martin
- “Natural Language Processing with Python” - Bird, Klein, Loper
- “Transformers for Natural Language Processing” - Denis Rothman
Online Resources
- Hugging Face Course - Free NLP course with transformers
- Stanford CS224N - NLP with Deep Learning
- spaCy Course - Free interactive course
Research Papers
- “Attention Is All You Need” - Vaswani et al. (2017)
- “BERT: Pre-training of Deep Bidirectional Transformers” - Devlin et al. (2018)
- “Language Models are Few-Shot Learners” (GPT-3) - Brown et al. (2020)
Libraries Reference
| Library | Purpose | Documentation |
|---|---|---|
| NLTK | Classic NLP toolkit | nltk.org |
| spaCy | Production NLP | spacy.io |
| Hugging Face | Transformers | huggingface.co |
| Gensim | Topic modeling | radimrehurek.com/gensim |
| Sentence Transformers | Embeddings | sbert.net |