Simple Python Script to Scrape and Download All Files from a Web Page (Bite-size Article)

#python #beginners

Introduction

A while back, I needed a certain dataset (Excel files) from a particular website. But unfortunately, there was no “download all” button, so I was stuck downloading each file one by one. There were over 60 files in total, which was incredibly tedious.

However, I discovered that by using Python, I could automatically fetch all the links and download them in bulk. In the end, I managed to skip all the manual work simply by running a Python script to download everything at once.

This approach—using Python to gather links automatically and perform a bulk download—is extremely handy when there are many CSV or Excel files linked on a page and you don’t want to download each one manually.

Of course, there might be simpler or more elegant solutions out there that I’m unaware of, but as someone who’s still very new to Python, this was also a great learning experience for me.

With that in mind, as a personal record (and in case it helps someone else), I’d like to share how to use Python to automatically retrieve and save any .csv, .xls, or .xlsx files found on a specified webpage.

Setup and Implementation Steps

First, install the necessary libraries:

pip install requests beautifulsoup4

By using the code below, you can extract all download links for specific file types from the target page and save them all at once.

📄 1. Link Extraction & Download Functions

import os
import time
import requests
from bs4 import BeautifulSoup

# Target web page URL (modify as needed)
BASE_URL = "https://example.com/data"

def get_download_links():
    """Retrieve links to CSV/XLS/XLSX files from the page"""
    response = requests.get(BASE_URL)
    if response.status_code != 200:
        print("Failed to retrieve the page")
        return []

    soup = BeautifulSoup(response.text, "html.parser")
    links = []

    for tag in soup.find_all("a", href=True):
        href = tag["href"]
        if href.endswith((".csv", ".xlsx", ".xls")):
            full_url = href if href.startswith("http") else "https://example.com" + href
            links.append(full_url)

    return links

def download_file(url, download_dir):
    """Save the file from the link (append a number if the filename already exists)"""
    base_filename = url.split("/")[-1]
    name, ext = os.path.splitext(base_filename)
    filepath = os.path.join(download_dir, base_filename)
    counter = 1

    while os.path.exists(filepath):
        filepath = os.path.join(download_dir, f"{name}_{counter}{ext}")
        counter += 1

    print(f"Downloading: {url}")
    try:
        response = requests.get(url, stream=True, timeout=10)
        with open(filepath, "wb") as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"Saved: {filepath}")
    except requests.RequestException as e:
        print(f"Error downloading {url}: {e}")

    time.sleep(1)  # Wait a bit to avoid sending too many requests in a short time

🚀 2. Execution Script

def main():
    DOWNLOAD_DIR = "downloads"
    os.makedirs(DOWNLOAD_DIR, exist_ok=True)

    print("Fetching links...")
    links = get_download_links()

    if not links:
        print("No downloadable files found.")
        return

    print(f"Downloading {len(links)} files...")
    for url in links:
        download_file(url, DOWNLOAD_DIR)

    print("All downloads completed.")

if __name__ == "__main__":
    main()

I’ve split the code into two separate files here, but there shouldn’t be any problem combining them into a single script if you prefer.

Tip: About File Extensions

In this example, I focused on Excel files (.csv, .xlsx, .xls). However, you can freely add or change which extensions are targeted. If you also want to download .pdf or .zip files, for instance, you could modify it as follows:

if href.endswith((".csv", ".xlsx", ".xls", ".pdf", ".zip")):
...

Simply adding the relevant extensions in this check will let you handle different file types, so feel free to customize it to suit your needs.

Conclusion

I’ve kept the explanation concise, but in most cases this code should work perfectly fine as is.
I personally used it when I needed to download a large number of files from a government-operated open data site that lacked a “bulk download” feature. It might be a bit of a niche scenario, but I hope it helps anyone who finds themselves in a similar situation.

Thank you for reading!