Web scraping basics
Web scraping is the process of extracting data from websites automatically. It allows you to collect specific information from web pages and use it for analysis, research, or other purposes. Python is one of the most popular languages for web scraping due to its simplicity and the availability of powerful libraries.
Introduction to Web Scraping
Web scraping involves several steps:
- Sending a request to a website to retrieve the HTML content
- Parsing the HTML to extract the relevant information
- Processing and storing the extracted data
Before starting any web scraping project, it’s important to understand some key concepts and ethical considerations.
Legal and Ethical Considerations
Web scraping operates in a legal gray area in many jurisdictions. Here are some important guidelines to follow:
Check the robots.txt file: Most websites have a robots.txt file (e.g., www.example.com/robots.txt) that specifies which parts of the site should not be accessed by automated tools.
Review the website’s terms of service: Some websites explicitly prohibit scraping in their terms of service.
Be respectful of the server: Don’t overload websites with too many requests in a short time period. Implement delays between requests.
Identify your scraper: Use custom user agents that identify your scraper and possibly include contact information.
Only collect publicly available data: Avoid scraping personal information or data behind login pages.
Important: Web scraping should be used responsibly. Improper scraping can lead to IP bans, legal issues, or damage to the website’s servers. Always use web scraping ethically and respect website owners’ rights.
Required Libraries
For our web scraping project, we’ll use the following libraries:
- Requests: For sending HTTP requests to websites
- Beautiful Soup: For parsing HTML and extracting data
- Pandas: For organizing and exporting the scraped data
Let’s install these libraries:
pip install requests beautifulsoup4 pandas
Basic Web Scraping with Requests and Beautiful Soup
First, let’s import the necessary libraries:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
Now, let’s see a basic example of scraping a webpage:
def basic_scrape(url):
"""
Basic function to scrape a webpage and print its title and links.
Args:
url (str): The URL of the webpage to scrape
"""
# Send a GET request to the URL
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract and print the title
title = soup.title.text
print(f"Title: {title}")
# Extract and print all links
links = soup.find_all('a')
print(f"\nFound {len(links)} links:")
for i, link in enumerate(links[:5], 1): # Print only first 5 links
href = link.get('href', 'No link')
text = link.text.strip() or 'No text'
print(f"{i}. {text}: {href}")
if len(links) > 5:
print("...")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
# Example usage
url = "https://www.example.com"
basic_scrape(url)
This basic example fetches a webpage, parses its HTML, and extracts the title and links. Let’s break down what each part does:
- The
requests.get()
function sends an HTTP GET request to the specified URL. - We include a user agent header to identify our scraper.
- We check if the request was successful by examining the status code (200 means success).
- We use Beautiful Soup to parse the HTML content.
- We extract and print the page title and the first few links.
Note: Always include a user agent in your requests to identify your scraper. Some websites block requests without a user agent or with the default Python user agent.
Understanding HTML Structure for Scraping
To effectively scrape a website, you need to understand its HTML structure. Most modern browsers have developer tools that let you inspect the elements of a webpage:
- Right-click on the element you want to scrape and select “Inspect” or “Inspect Element”
- The developer tools will show you the HTML structure
- Look for patterns, classes, IDs, or other attributes that can help you target specific elements
Beautiful Soup provides several methods to navigate and search HTML:
Method | Description |
---|---|
find() | Finds the first matching element |
find_all() | Finds all matching elements |
select() | Uses CSS selectors to find elements |
get_text() | Extracts text from an element |
Here’s an example of targeting specific elements:
def scrape_quotes(url):
"""
Scrape quotes from a quotes website.
Args:
url (str): The URL of the quotes webpage
Returns:
list: A list of dictionaries containing quotes and authors
"""
# Send a GET request to the URL
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
quotes_data = []
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all quote containers
# Note: This is an example. The actual structure would depend on the website.
quote_containers = soup.find_all('div', class_='quote')
for container in quote_containers:
# Extract the quote text
quote_text = container.find('span', class_='text').get_text()
# Extract the author
author = container.find('small', class_='author').get_text()
# Add to our data list
quotes_data.append({
'quote': quote_text,
'author': author
})
print(f"Scraped {len(quotes_data)} quotes")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
return quotes_data
# This is a hypothetical example. The actual HTML structure would vary.
This example assumes a specific HTML structure where quotes are in elements with the class “quote”, the text is in a span with class “text”, and the author is in a small element with class “author”. You would need to adjust these selectors based on the actual website you’re scraping.
Practical Project: Scraping a Products Catalog
Let’s create a more comprehensive example: scraping product information from an e-commerce website. For this example, we’ll use a fictional structure, but the principles apply to real websites:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
def scrape_products(base_url, num_pages=1):
"""
Scrape product information from multiple pages of a fictional e-commerce website.
Args:
base_url (str): The base URL of the website
num_pages (int): Number of pages to scrape
Returns:
pandas.DataFrame: A DataFrame containing the scraped product information
"""
all_products = []
for page in range(1, num_pages + 1):
# Construct the URL for the current page
url = f"{base_url}/products?page={page}"
print(f"Scraping page {page}...")
# Send a GET request to the URL
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all product containers
product_containers = soup.find_all('div', class_='product-item')
for container in product_containers:
try:
# Extract product name
name = container.find('h2', class_='product-name').get_text().strip()
# Extract product price
price_elem = container.find('span', class_='price')
price = price_elem.get_text().strip() if price_elem else "N/A"
# Extract product rating
rating_elem = container.find('div', class_='rating')
rating = rating_elem.get('data-rating') if rating_elem else "N/A"
# Extract product availability
availability_elem = container.find('span', class_='availability')
availability = availability_elem.get_text().strip() if availability_elem else "N/A"
# Add to our products list
all_products.append({
'name': name,
'price': price,
'rating': rating,
'availability': availability,
'page': page
})
except Exception as e:
print(f"Error extracting product information: {e}")
print(f"Scraped {len(product_containers)} products from page {page}")
# Be respectful: add a random delay between requests
if page < num_pages:
delay = random.uniform(1, 3)
print(f"Waiting {delay:.2f} seconds before next request...")
time.sleep(delay)
else:
print(f"Failed to retrieve page {page}. Status code: {response.status_code}")
# Convert the list of dictionaries to a DataFrame
df = pd.DataFrame(all_products)
return df
def save_to_csv(dataframe, filename):
"""
Save the DataFrame to a CSV file.
Args:
dataframe (pandas.DataFrame): The DataFrame to save
filename (str): The name of the CSV file
"""
dataframe.to_csv(filename, index=False)
print(f"Data saved to {filename}")
def main():
# Base URL of the fictional e-commerce website
base_url = "https://www.fictional-ecommerce.com"
# Number of pages to scrape
num_pages = 3
# Scrape the products
products_df = scrape_products(base_url, num_pages)
# Display the first few rows
print("\nSample of scraped data:")
print(products_df.head())
# Save to CSV
save_to_csv(products_df, "products_data.csv")
if __name__ == "__main__":
main()
Note: This example is based on a fictional website structure. When adapting it for a real website, you’ll need to:
- Inspect the actual HTML to find the correct CSS selectors
- Handle any site-specific challenges (like pagination or dynamic content)
- Respect the website’s robots.txt file and terms of service
Handling Different Types of Content
Websites contain various types of content. Here’s how to handle some common scenarios:
1. Extracting Tables
Many websites present data in HTML tables. Beautiful Soup makes it easy to extract this information:
def extract_table(url, table_index=0):
"""
Extract a table from a webpage.
Args:
url (str): The URL of the webpage
table_index (int): The index of the table to extract (if multiple tables exist)
Returns:
pandas.DataFrame: A DataFrame containing the table data
"""
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Find all tables in the document
tables = soup.find_all('table')
if not tables:
print("No tables found on the page")
return None
if table_index >= len(tables):
print(f"Table index {table_index} is out of range. Only {len(tables)} tables found.")
return None
# Select the desired table
table = tables[table_index]
# Extract headers
headers = []
header_row = table.find('thead')
if header_row:
headers = [th.get_text().strip() for th in header_row.find_all('th')]
# If no header row found, try the first row
if not headers:
first_row = table.find('tr')
if first_row:
headers = [th.get_text().strip() for th in first_row.find_all(['th', 'td'])]
# Extract rows
rows = []
for row in table.find_all('tr')[1:] if headers else table.find_all('tr'):
cells = [td.get_text().strip() for td in row.find_all(['td', 'th'])]
if cells:
rows.append(cells)
# Create DataFrame
if headers and rows:
# Ensure all rows have the same length as headers
rows = [row for row in rows if len(row) == len(headers)]
df = pd.DataFrame(rows, columns=headers)
return df
elif rows:
df = pd.DataFrame(rows)
return df
else:
print("No data found in the table")
return None
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
return None
2. Extracting Images
You can also scrape image URLs:
def extract_images(url, limit=10):
"""
Extract image URLs from a webpage.
Args:
url (str): The URL of the webpage
limit (int): Maximum number of images to extract
Returns:
list: A list of image URLs
"""
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
image_urls = []
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Find all img tags
img_tags = soup.find_all('img')
for img in img_tags[:limit]:
# Try different attributes where image URL might be stored
img_url = img.get('src') or img.get('data-src') or img.get('data-lazy-src')
if img_url:
# Handle relative URLs
if img_url.startswith('/'):
# Extract domain from the original URL
from urllib.parse import urlparse
parsed_url = urlparse(url)
domain = f"{parsed_url.scheme}://{parsed_url.netloc}"
img_url = domain + img_url
image_urls.append(img_url)
print(f"Extracted {len(image_urls)} image URLs")
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
return image_urls
3. Handling Pagination
Many websites split their content across multiple pages. Here’s a simple approach to handle pagination:
def scrape_with_pagination(base_url, max_pages=5):
"""
Scrape content across multiple pages.
Args:
base_url (str): The base URL of the website
max_pages (int): Maximum number of pages to scrape
Returns:
list: Combined data from all pages
"""
all_data = []
current_page = 1
while current_page <= max_pages:
# Construct the URL for the current page
if current_page == 1:
url = base_url
else:
url = f"{base_url}/page/{current_page}"
print(f"Scraping page {current_page}: {url}")
# Send a GET request
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Find data elements (adjust selectors based on the actual website)
data_elements = soup.find_all('div', class_='data-item')
if not data_elements:
print(f"No data found on page {current_page}. Stopping pagination.")
break
# Extract data from each element
for element in data_elements:
title = element.find('h2').get_text().strip() if element.find('h2') else "N/A"
description = element.find('p').get_text().strip() if element.find('p') else "N/A"
all_data.append({
'title': title,
'description': description,
'page': current_page
})
print(f"Scraped {len(data_elements)} items from page {current_page}")
# Check if there's a next page link
next_page = soup.find('a', class_='next-page')
if not next_page:
print("No next page link found. Stopping pagination.")
break
# Be respectful: add a delay between requests
delay = random.uniform(1, 3)
print(f"Waiting {delay:.2f} seconds before next request...")
time.sleep(delay)
current_page += 1
else:
print(f"Failed to retrieve page {current_page}. Status code: {response.status_code}")
break
return all_data
Handling Dynamic Content with Selenium
Some websites load content dynamically using JavaScript. Beautiful Soup and requests alone can’t handle this. For such websites, we need Selenium, a tool that automates browsers.
First, install Selenium:
pip install selenium
You’ll also need a WebDriver for your browser. For Chrome, download the ChromeDriver matching your Chrome version from ChromeDriver downloads.
Here’s a simple example using Selenium:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
def scrape_dynamic_content(url, driver_path):
"""
Scrape content from a dynamic website using Selenium.
Args:
url (str): The URL of the webpage
driver_path (str): Path to the ChromeDriver executable
Returns:
list: Scraped data
"""
# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in headless mode (no browser window)
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--window-size=1920,1080")
# Set up the driver
service = Service(driver_path)
driver = webdriver.Chrome(service=service, options=chrome_options)
results = []
try:
# Navigate to the page
driver.get(url)
# Wait for the dynamic content to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CLASS_NAME, "dynamic-item")))
# Additional wait to ensure all content is loaded
time.sleep(2)
# Find all items
items = driver.find_elements(By.CLASS_NAME, "dynamic-item")
for item in items:
# Extract data
title = item.find_element(By.CLASS_NAME, "title").text
description = item.find_element(By.CLASS_NAME, "description").text
results.append({
"title": title,
"description": description
})
print(f"Scraped {len(results)} items")
except Exception as e:
print(f"Error during scraping: {e}")
finally:
# Always close the driver
driver.quit()
return results
# Example usage
url = "https://www.example-dynamic-site.com"
driver_path = "/path/to/chromedriver" # Update with your actual path
data = scrape_dynamic_content(url, driver_path)
# Convert to DataFrame
df = pd.DataFrame(data)
print(df.head())
Important: Selenium interacts with a real browser, which makes it more powerful but also slower than requests. Use it only when necessary, typically when dealing with websites that load content dynamically with JavaScript.
Best Practices for Web Scraping
Respect
robots.txt
:import requests from urllib.robotparser import RobotFileParser def is_scraping_allowed(url, user_agent="*"): """Check if scraping is allowed for the given URL and user agent.""" parsed_url = requests.utils.urlparse(url) robots_url = f"{parsed_url.scheme}://{parsed_url.netloc}/robots.txt" parser = RobotFileParser() parser.set_url(robots_url) parser.read() return parser.can_fetch(user_agent, url) # Example usage url = "https://www.example.com/products" if is_scraping_allowed(url): print("Scraping is allowed!") else: print("Scraping is not allowed by robots.txt")
Add delays between requests:
import time import random # Add a random delay between 1 and 3 seconds delay = random.uniform(1, 3) time.sleep(delay)
Use proper headers:
headers = { 'User-Agent': 'YourScrapingProject/1.0 ([email protected])', 'Accept': 'text/html,application/xhtml+xml,application/xml', 'Accept-Language': 'en-US,en;q=0.9' } response = requests.get(url, headers=headers)
Handle errors gracefully:
try: response = requests.get(url, timeout=10) response.raise_for_status() # Raise an exception for 4XX/5XX status codes except requests.exceptions.HTTPError as e: print(f"HTTP Error: {e}") except requests.exceptions.ConnectionError as e: print(f"Connection Error: {e}") except requests.exceptions.Timeout as e: print(f"Timeout Error: {e}") except requests.exceptions.RequestException as e: print(f"Error: {e}")
Use proxies if needed:
proxies = { 'http': 'http://proxy.example.com:8080', 'https': 'https://proxy.example.com:8080' } response = requests.get(url, proxies=proxies)
Store scraped data properly:
import json # Save as JSON with open('data.json', 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=4) # Save as CSV df = pd.DataFrame(data) df.to_csv('data.csv', index=False, encoding='utf-8') # Save as Excel df.to_excel('data.xlsx', index=False)
Complete Example: News Article Scraper
Let’s put everything together in a complete example that scrapes news articles from a fictional news website:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import os
from datetime import datetime
from urllib.robotparser import RobotFileParser
class NewsArticleScraper:
def __init__(self, base_url):
"""
Initialize the news article scraper.
Args:
base_url (str): The base URL of the news website
"""
self.base_url = base_url
self.headers = {
'User-Agent': 'NewsArticleScraper/1.0 (educational-project)',
'Accept': 'text/html,application/xhtml+xml,application/xml',
'Accept-Language': 'en-US,en;q=0.9'
}
# Check if scraping is allowed
if not self._is_scraping_allowed():
print(f"WARNING: Scraping may not be allowed for {base_url} according to robots.txt")
def _is_scraping_allowed(self):
"""Check if scraping is allowed according to robots.txt."""
try:
robots_url = f"{self.base_url}/robots.txt"
parser = RobotFileParser()
parser.set_url(robots_url)
parser.read()
return parser.can_fetch(self.headers['User-Agent'], self.base_url)
except Exception as e:
print(f"Error checking robots.txt: {e}")
return False
def _get_page(self, url):
"""Send a GET request to the URL and return the BeautifulSoup object."""
try:
response = requests.get(url, headers=self.headers, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, 'html.parser')
except requests.exceptions.RequestException as e:
print(f"Error retrieving {url}: {e}")
return None
def scrape_article_page(self, article_url):
"""
Scrape data from a single article page.
Args:
article_url (str): The URL of the article
Returns:
dict: Article data or None if scraping failed
"""
soup = self._get_page(article_url)
if not soup:
return None
try:
# Extract article content (adjust selectors based on the actual website)
title = soup.find('h1', class_='article-title').get_text().strip()
date_elem = soup.find('span', class_='article-date')
date = date_elem.get_text().strip() if date_elem else "N/A"
author_elem = soup.find('span', class_='article-author')
author = author_elem.get_text().strip() if author_elem else "N/A"
content_elem = soup.find('div', class_='article-content')
content = content_elem.get_text().strip() if content_elem else "N/A"
# Extract categories
categories = []
category_elems = soup.find_all('a', class_='article-category')
for cat in category_elems:
categories.append(cat.get_text().strip())
return {
'title': title,
'date': date,
'author': author,
'content': content,
'categories': ', '.join(categories),
'url': article_url
}
except Exception as e:
print(f"Error extracting article data from {article_url}: {e}")
return None
def scrape_category_page(self, category, max_pages=2):
"""
Scrape articles from a category page.
Args:
category (str): The category to scrape
max_pages (int): Maximum number of pages to scrape
Returns:
list: List of article URLs
"""
article_urls = []
for page in range(1, max_pages + 1):
if page == 1:
url = f"{self.base_url}/category/{category}"
else:
url = f"{self.base_url}/category/{category}/page/{page}"
print(f"Scraping category page: {url}")
soup = self._get_page(url)
if not soup:
break
# Find article links (adjust selectors based on the actual website)
article_elements = soup.find_all('div', class_='article-preview')
if not article_elements:
print(f"No article previews found on page {page}. Stopping pagination.")
break
for article in article_elements:
link = article.find('a', class_='article-link')
if link and link.get('href'):
article_url = link.get('href')
# Handle relative URLs
if not article_url.startswith('http'):
article_url = self.base_url + article_url
article_urls.append(article_url)
print(f"Found {len(article_elements)} articles on page {page}")
# Check if there's a next page link
next_page = soup.find('a', class_='next-page')
if not next_page:
print("No next page link found. Stopping pagination.")
break
# Be respectful: add a delay between requests
delay = random.uniform(2, 4)
print(f"Waiting {delay:.2f} seconds before next request...")
time.sleep(delay)
return article_urls
def scrape_articles(self, categories, max_pages_per_category=2, max_articles=20):
"""
Scrape articles from multiple categories.
Args:
categories (list): List of categories to scrape
max_pages_per_category (int): Maximum number of pages to scrape per category
max_articles (int): Maximum number of articles to scrape in total
Returns:
pandas.DataFrame: DataFrame containing article data
"""
all_articles = []
total_articles = 0
for category in categories:
print(f"\nScraping category: {category}")
# Get article URLs for this category
article_urls = self.scrape_category_page(category, max_pages_per_category)
# Limit the number of articles to scrape
remaining_articles = max_articles - total_articles
if remaining_articles <= 0:
break
article_urls = article_urls[:remaining_articles]
# Scrape each article
for article_url in article_urls:
print(f"Scraping article: {article_url}")
article_data = self.scrape_article_page(article_url)
if article_data:
article_data['category'] = category
all_articles.append(article_data)
total_articles += 1
# Be respectful: add a delay between requests
delay = random.uniform(2, 4)
print(f"Waiting {delay:.2f} seconds before next request...")
time.sleep(delay)
if total_articles >= max_articles:
print(f"Reached maximum number of articles ({max_articles}). Stopping.")
break
# Convert to DataFrame
if all_articles:
df = pd.DataFrame(all_articles)
return df
else:
print("No articles were scraped.")
return pd.DataFrame()
def save_to_csv(self, dataframe, filename=None):
"""
Save the scraped articles to a CSV file.
Args:
dataframe (pandas.DataFrame): The DataFrame to save
filename (str, optional): The filename to use. If None, generates a filename with the current date.
"""
if dataframe.empty:
print("No data to save.")
return
if filename is None:
current_date = datetime.now().strftime("%Y%m%d")
filename = f"news_articles_{current_date}.csv"
dataframe.to_csv(filename, index=False, encoding='utf-8')
print(f"Saved {len(dataframe)} articles to {filename}")
def save_to_json(self, dataframe, filename=None):
"""
Save the scraped articles to a JSON file.
Args:
dataframe (pandas.DataFrame): The DataFrame to save
filename (str, optional): The filename to use. If None, generates a filename with the current date.
"""
if dataframe.empty:
print("No data to save.")
return
if filename is None:
current_date = datetime.now().strftime("%Y%m%d")
filename = f"news_articles_{current_date}.json"
# Convert DataFrame to a list of dictionaries
articles_list = dataframe.to_dict(orient='records')
with open(filename, 'w', encoding='utf-8') as f:
import json
json.dump(articles_list, f, ensure_ascii=False, indent=4)
print(f"Saved {len(dataframe)} articles to {filename}")
def main():
# Base URL of the fictional news website
base_url = "https://www.fictional-news-site.com"
# Categories to scrape
categories = ["technology", "business", "science"]
# Create the scraper
scraper = NewsArticleScraper(base_url)
# Scrape articles
articles_df = scraper.scrape_articles(
categories=categories,
max_pages_per_category=2,
max_articles=15
)
# Display a summary
if not articles_df.empty:
print("\nSummary of scraped articles:")
print(f"Total articles: {len(articles_df)}")
print("\nArticles per category:")
print(articles_df['category'].value_counts())
print("\nFirst few articles:")
# Display a preview (title and author only)
preview_df = articles_df[['title', 'author', 'date', 'category']]
print(preview_df.head())
# Save the results
scraper.save_to_csv(articles_df)
scraper.save_to_json(articles_df)
print("\nWeb scraping completed.")
if __name__ == "__main__":
main()
Extending the Web Scraper
The basic web scraper we’ve built is already powerful, but there are many ways to extend it:
1. Adding Advanced Analytics
You can analyze the scraped data to extract insights:
def analyze_articles(df):
"""
Perform basic analysis on the scraped articles.
Args:
df (pandas.DataFrame): DataFrame containing article data
"""
if df.empty:
print("No data to analyze.")
return
# Count articles by category
category_counts = df['category'].value_counts()
print("\nArticles by Category:")
print(category_counts)
# Count articles by author
author_counts = df['author'].value_counts().head(10)
print("\nTop 10 Authors:")
print(author_counts)
# Analyze publish dates
if 'date' in df.columns:
try:
# Convert to datetime if possible
df['date_parsed'] = pd.to_datetime(df['date'])
# Count articles by month
month_counts = df['date_parsed'].dt.to_period('M').value_counts().sort_index()
print("\nArticles by Month:")
print(month_counts)
except Exception as e:
print(f"Error parsing dates: {e}")
# Analyze content length
if 'content' in df.columns:
df['content_length'] = df['content'].str.len()
avg_length = df['content_length'].mean()
print(f"\nAverage Article Length: {avg_length:.0f} characters")
# Longest and shortest articles
print("\nLongest Article:")
longest_idx = df['content_length'].idxmax()
print(f"Title: {df.loc[longest_idx, 'title']}")
print(f"Length: {df.loc[longest_idx, 'content_length']} characters")
print("\nShortest Article:")
shortest_idx = df['content_length'].idxmin()
print(f"Title: {df.loc[shortest_idx, 'title']}")
print(f"Length: {df.loc[shortest_idx, 'content_length']} characters")
# Simple word frequency analysis
if 'content' in df.columns:
from collections import Counter
import re
# Combine all content
all_text = ' '.join(df['content'].tolist())
# Remove punctuation and convert to lowercase
all_text = re.sub(r'[^\w\s]', '', all_text.lower())
# Split into words
words = all_text.split()
# Count word frequency
word_counts = Counter(words)
# Remove common stop words (this is a very basic list)
stop_words = {'the', 'a', 'an', 'in', 'to', 'for', 'of', 'and', 'is', 'are', 'on', 'that', 'with', 'this', 'it'}
for word in stop_words:
if word in word_counts:
del word_counts[word]
# Print top 20 words
print("\nTop 20 Words:")
for word, count in word_counts.most_common(20):
print(f"{word}: {count}")
2. Creating Visualizations
You can visualize the scraped data:
def visualize_data(df):
"""
Create visualizations of the scraped data.
Args:
df (pandas.DataFrame): DataFrame containing article data
"""
try:
import matplotlib.pyplot as plt
import seaborn as sns
# Set the style
sns.set(style="whitegrid")
if df.empty:
print("No data to visualize.")
return
# Create a directory for visualizations
os.makedirs('visualizations', exist_ok=True)
# 1. Articles by Category
plt.figure(figsize=(10, 6))
category_counts = df['category'].value_counts()
sns.barplot(x=category_counts.index, y=category_counts.values)
plt.title('Number of Articles by Category')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('visualizations/articles_by_category.png')
plt.close()
# 2. Articles by Author (top 10)
plt.figure(figsize=(12, 6))
author_counts = df['author'].value_counts().head(10)
sns.barplot(x=author_counts.values, y=author_counts.index)
plt.title('Top 10 Authors by Number of Articles')
plt.xlabel('Count')
plt.ylabel('Author')
plt.tight_layout()
plt.savefig('visualizations/top_authors.png')
plt.close()
# 3. Content Length Distribution
if 'content' in df.columns:
df['content_length'] = df['content'].str.len()
plt.figure(figsize=(10, 6))
sns.histplot(df['content_length'], bins=30, kde=True)
plt.title('Distribution of Article Lengths')
plt.xlabel('Length (characters)')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('visualizations/content_length_distribution.png')
plt.close()
# 4. Content Length by Category
plt.figure(figsize=(12, 6))
sns.boxplot(x='category', y='content_length', data=df)
plt.title('Article Length by Category')
plt.xlabel('Category')
plt.ylabel('Length (characters)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('visualizations/content_length_by_category.png')
plt.close()
# 5. Articles Over Time (if date is available)
if 'date' in df.columns:
try:
# Convert to datetime if possible
df['date_parsed'] = pd.to_datetime(df['date'])
# Group by month
articles_by_month = df.groupby(df['date_parsed'].dt.to_period('M')).size()
plt.figure(figsize=(12, 6))
articles_by_month.plot(kind='line', marker='o')
plt.title('Number of Articles Published Over Time')
plt.xlabel('Month')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('visualizations/articles_over_time.png')
plt.close()
except Exception as e:
print(f"Error creating time-based visualization: {e}")
print("Visualizations saved to the 'visualizations' directory.")
except ImportError:
print("Matplotlib and/or seaborn not installed. Install with:")
print("pip install matplotlib seaborn")
3. Sentiment Analysis on Articles
You can analyze the sentiment of article content:
def analyze_sentiment(df):
"""
Perform sentiment analysis on article content.
Args:
df (pandas.DataFrame): DataFrame containing article data
Returns:
pandas.DataFrame: The input DataFrame with sentiment columns added
"""
try:
from textblob import TextBlob
if df.empty or 'content' not in df.columns:
print("No content to analyze.")
return df
# Create a copy to avoid modifying the original
result_df = df.copy()
# Add sentiment columns
result_df['polarity'] = 0.0
result_df['subjectivity'] = 0.0
result_df['sentiment'] = 'neutral'
for idx, row in result_df.iterrows():
if pd.isna(row['content']) or row['content'] == "N/A":
continue
# Analyze sentiment with TextBlob
blob = TextBlob(row['content'])
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity
# Store results
result_df.at[idx, 'polarity'] = polarity
result_df.at[idx, 'subjectivity'] = subjectivity
# Categorize sentiment
if polarity > 0.1:
result_df.at[idx, 'sentiment'] = 'positive'
elif polarity < -0.1:
result_df.at[idx, 'sentiment'] = 'negative'
else:
result_df.at[idx, 'sentiment'] = 'neutral'
# Print summary
sentiment_counts = result_df['sentiment'].value_counts()
print("\nSentiment Analysis:")
print(sentiment_counts)
print("\nAverage Polarity:", result_df['polarity'].mean())
print("Average Subjectivity:", result_df['subjectivity'].mean())
# Sentiment by category
print("\nSentiment by Category:")
sentiment_by_category = pd.crosstab(result_df['category'], result_df['sentiment'])
print(sentiment_by_category)
return result_df
except ImportError:
print("TextBlob not installed. Install with:")
print("pip install textblob")
print("Then run: python -m textblob.download_corpora")
return df
Ethical Considerations and Challenges
When scraping websites, you might encounter various challenges:
1. CAPTCHA and Anti-Scraping Mechanisms
Many websites employ mechanisms to prevent scraping:
- CAPTCHA: These are tests designed to determine if the user is human.
- IP Blocking: Websites may block IPs that send too many requests.
- User-Agent Checking: Some sites block requests with bot-like user agents.
Strategies to handle these challenges:
# Rotate user agents
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
]
headers = {
'User-Agent': random.choice(user_agents)
}
# Use proxy rotation
proxies = [
{'http': 'http://proxy1.example.com:8080', 'https': 'https://proxy1.example.com:8080'},
{'http': 'http://proxy2.example.com:8080', 'https': 'https://proxy2.example.com:8080'}
]
proxy = random.choice(proxies)
response = requests.get(url, headers=headers, proxies=proxy)
2. Terms of Service Violations
Always review a website’s Terms of Service before scraping:
- Some explicitly forbid scraping
- Others have specific requirements for automated access
- Some limit the rate or volume of requests
3. Legal Issues
Web scraping exists in a legal gray area:
- The Computer Fraud and Abuse Act (CFAA) in the US has been used in cases against scrapers
- The EU’s GDPR may apply if you’re collecting personal data
- Copyright laws may restrict how you can use the scraped content
Important: This tutorial is for educational purposes only. Always ensure your web scraping activities comply with relevant laws, regulations, and website terms of service.
Conclusion
Web scraping is a powerful skill that allows you to collect data from various websites. In this tutorial, we’ve covered:
- Basic web scraping with Requests and Beautiful Soup
- Handling different types of content (tables, images, pagination)
- Dealing with dynamic content using Selenium
- Best practices for ethical and responsible scraping
- A complete project for scraping news articles
- Extensions for analyzing and visualizing the scraped data
While web scraping opens up many possibilities, remember to:
- Only scrape publicly available data
- Respect websites’ terms of service and robots.txt
- Implement rate limiting to avoid overwhelming servers
- Use the data ethically and responsibly
With these skills and principles in mind, you’re now ready to scrape data from websites and use it for your Python projects!
Exercises
Exercise 1: Create a simple web scraper that extracts the title, URL, and post date of the 10 most recent posts from a blog of your choice. Save the data to a CSV file.
Exercise 2: Build a web scraper that extracts product information (name, price, rating) from an e-commerce website’s search results page. Implement pagination to get results from multiple pages.
Exercise 3: Develop a weather data scraper that collects the forecast for the next 5 days for a specific city from a weather website. Use the data to create a simple visualization.
Hint for Exercise 1: First inspect the HTML structure of the blog to find the appropriate CSS selectors for post titles, URLs, and dates. Use Beautiful Soup’s find_all()
method to locate all post elements, then extract the required information from each. Remember to handle relative URLs properly.
# Exercise 1 solution outline
import requests
from bs4 import BeautifulSoup
import pandas as pd
# URL of the blog
blog_url = "https://example-blog.com"
# Send request and get the page
response = requests.get(blog_url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all post elements (adjust selector based on the actual blog)
posts = soup.find_all('article', class_='post')[:10] # Get first 10 posts
# Extract data from each post
data = []
for post in posts:
# Extract title, URL and date (adjust selectors)
title = post.find('h2').get_text().strip()
url = post.find('a', class_='post-link')['href']
date = post.find('span', class_='post-date').get_text().strip()
# Handle relative URLs
if not url.startswith('http'):
url = blog_url + url
data.append({
'title': title,
'url': url,
'date': date
})
# Save to CSV
df = pd.DataFrame(data)
df.to_csv('blog_posts.csv', index=False)
print("Data saved to blog_posts.csv")