RepoFlow Team · May 18, 2025

Mirror the Entire PyPI Repository with Bash

Create a local, self-contained PyPI repository for air-gapped networks and secure environments.

Mirroring the entire PyPI repository can be essential for organizations with strict security requirements or air-gapped networks that need a complete, self-contained copy of PyPI. This approach can also be useful for enterprises that require local access to all available Python packages without relying on an external internet connection.

Why Mirror PyPI?

Mirroring a package repository like PyPI can be beneficial for the following reasons:

Air-Gapped Networks: For secure environments where internet access is restricted or completely unavailable.
Regulatory Compliance: Some organizations need complete control over their software supply chain for compliance purposes.
Disaster Recovery: Ensures packages are always available, even if the external repository goes dow

Prerequisites

Before you start, make sure you have the following installed:

Bash: Usually pre-installed on most Linux distributions and macOS. For Windows, you can use the Windows Subsystem for Linux (WSL) or a tool like Git Bash or Cygwin.
wget
curl

Understanding the Script

This Bash script is designed to mirror the entire PyPI repository to a local directory. It crawls the PyPI package index, retrieves the list of all available packages, and then downloads every available version of each package. This approach creates a local, self-contained copy of PyPI, which can be particularly useful for air-gapped networks or organizations with strict security requirements.

Consider the Storage Requirements

Keep in mind that this process can require a significant amount of storage, depending on the number of packages and versions you choose to mirror. Currently, PyPI hosts over 4 million packages, totaling around 27.6 TB of data. Be sure you have sufficient storage capacity before starting.

Consider the Storage Requirements

Here is a Bash script to mirror the entire PyPI repository:

#!/bin/bash

# Create the mirror directory
mkdir -p ./pypi_mirror

# Log file to track last mirrored package
LOG_FILE="./pypi_mirror/index.log"

# Get the list of all package names (strip "/simple/")
packages=($(curl -s https://pypi.org/simple/ | awk -F '"' '/href="/ {print $2}' | sed 's|/simple/||g' | sed 's|/$||'))

# Get the total number of packages
total_packages=${#packages[@]}
start_time=$SECONDS

echo "Total packages to download: $total_packages"
echo ""

# Read last completed package from log
if [[ -f "$LOG_FILE" ]]; then
    last_package=$(tail -n 1 "$LOG_FILE")
    echo "Resuming from package: $last_package"
    skip=true
else
    last_package=""
    skip=false
fi

# Loop through each package and download all available versions
for i in "${!packages[@]}"; do
    package="${packages[$i]}"

    # Skip previously completed packages
    if [[ "$skip" == true ]]; then
        if [[ "$package" == "$last_package" ]]; then
            skip=false  # Found the last completed package, start from the next one
        fi
        continue
    fi

    # Update progress
    progress=$(( (i + 1) * 100 / total_packages ))
    elapsed_time=$(( SECONDS - start_time ))
    avg_time_per_pkg=$(( elapsed_time / (i + 1) ))
    remaining_pkgs=$(( total_packages - i - 1 ))
    eta=$(( avg_time_per_pkg * remaining_pkgs ))

    # Prevent negative ETA
    if [[ $eta -lt 0 ]]; then eta=0; fi

    # Progress bar settings
    bar_length=40
    filled_length=$(( bar_length * (i + 1) / total_packages ))

    # Ensure at least 1 character for cut
    if [[ $filled_length -lt 1 ]]; then filled_length=1; fi

    # Construct progress bar
    bar=$(printf "%-${bar_length}s" "█████████████████████████████████████████" | cut -c1-"$filled_length")
    empty_bar=$(printf "%-${bar_length}s" "")

    # Print progress dynamically
    tput sc
    echo -ne "Progress: [$bar$empty_bar] $progress% | Elapsed: ${elapsed_time}s | ETA: ${eta}s | Downloading: $package\r"
    tput rc

    # Create a directory for the package
    mkdir -p "./pypi_mirror/$package"

    # Get the list of package files from PyPI
    package_page=$(curl -s "https://pypi.org/simple/$package/")

    # Extract all file URLs
    urls=$(echo "$package_page" | awk -F '"' '/href="https/ {print $2}')

    if [[ -z "$urls" ]]; then
        continue  # Skip if no files found
    fi

    # Download each file (silent mode to keep terminal clean)
    for url in $urls; do
        cleaned_url="${url%%#*}"
        file_name="./pypi_mirror/$package/$(basename "$cleaned_url")"
        
        # Check if the file already exists and is not empty
        if [[ -f "$file_name" && -s "$file_name" ]]; then
            echo "Skipping already downloaded file: $file_name"
            continue
        fi
        
        wget -q -P "./pypi_mirror/$package/" "$url"
    done

    # Log the completed package
    echo "$package" >> "$LOG_FILE"
done

# Final message
echo -e "\n\n🎉 PyPI mirroring complete! All $total_packages packages downloaded."

Key Features of the Script

Resumable Downloads: The script can resume from the last completed package if interrupted.
Progress Bar: Real-time progress bar to track the download status.

Alternative Methods

bandersnatch: A PyPI package for mirroring Python packages. More details at bandersnatch on PyPI.

Final Thoughts

This is a simple example script to demonstrate how mirroring PyPI can be achieved. Feel free to modify it based on your specific needs, whether that's optimizing for speed, adding error handling, or integrating it with your existing infrastructure.

Happy mirroring!

Share article

Tutorial

Mirror Debian and Ubuntu Repositories

RepoFlow Team · June 1, 2025

Tutorial

Mirror the Entire PyPI Repository with Bash

RepoFlow Team · May 18, 2025

Release

Run a Private Docker Registry on Your iPhone

RepoFlow Team · April 23, 2025

RepoFlow Blog