RepoFlow Team · May 18, 2025
Mirror the Entire PyPI Repository with Bash
Create a local, self-contained PyPI repository for air-gapped networks and secure environments.
Mirroring the entire PyPI repository can be essential for organizations with strict security requirements or air-gapped networks that need a complete, self-contained copy of PyPI. This approach can also be useful for enterprises that require local access to all available Python packages without relying on an external internet connection.
Why Mirror PyPI?
Mirroring a package repository like PyPI can be beneficial for the following reasons:
- Air-Gapped Networks: For secure environments where internet access is restricted or completely unavailable.
- Regulatory Compliance: Some organizations need complete control over their software supply chain for compliance purposes.
- Disaster Recovery: Ensures packages are always available, even if the external repository goes dow
Prerequisites
Before you start, make sure you have the following installed:
- Bash: Usually pre-installed on most Linux distributions and macOS. For Windows, you can use the Windows Subsystem for Linux (WSL) or a tool like Git Bash or Cygwin.
wget
curl
Understanding the Script
This Bash script is designed to mirror the entire PyPI repository to a local directory. It crawls the PyPI package index, retrieves the list of all available packages, and then downloads every available version of each package. This approach creates a local, self-contained copy of PyPI, which can be particularly useful for air-gapped networks or organizations with strict security requirements.
Consider the Storage Requirements
Keep in mind that this process can require a significant amount of storage, depending on the number of packages and versions you choose to mirror. Currently, PyPI hosts over 4 million packages, totaling around 27.6 TB of data. Be sure you have sufficient storage capacity before starting.
Consider the Storage Requirements
Here is a Bash script to mirror the entire PyPI repository:
#!/bin/bash
# Create the mirror directory
mkdir -p ./pypi_mirror
# Log file to track last mirrored package
LOG_FILE="./pypi_mirror/index.log"
# Get the list of all package names (strip "/simple/")
packages=($(curl -s https://pypi.org/simple/ | awk -F '"' '/href="/ {print $2}' | sed 's|/simple/||g' | sed 's|/$||'))
# Get the total number of packages
total_packages=${#packages[@]}
start_time=$SECONDS
echo "Total packages to download: $total_packages"
echo ""
# Read last completed package from log
if [[ -f "$LOG_FILE" ]]; then
last_package=$(tail -n 1 "$LOG_FILE")
echo "Resuming from package: $last_package"
skip=true
else
last_package=""
skip=false
fi
# Loop through each package and download all available versions
for i in "${!packages[@]}"; do
package="${packages[$i]}"
# Skip previously completed packages
if [[ "$skip" == true ]]; then
if [[ "$package" == "$last_package" ]]; then
skip=false # Found the last completed package, start from the next one
fi
continue
fi
# Update progress
progress=$(( (i + 1) * 100 / total_packages ))
elapsed_time=$(( SECONDS - start_time ))
avg_time_per_pkg=$(( elapsed_time / (i + 1) ))
remaining_pkgs=$(( total_packages - i - 1 ))
eta=$(( avg_time_per_pkg * remaining_pkgs ))
# Prevent negative ETA
if [[ $eta -lt 0 ]]; then eta=0; fi
# Progress bar settings
bar_length=40
filled_length=$(( bar_length * (i + 1) / total_packages ))
# Ensure at least 1 character for cut
if [[ $filled_length -lt 1 ]]; then filled_length=1; fi
# Construct progress bar
bar=$(printf "%-${bar_length}s" "█████████████████████████████████████████" | cut -c1-"$filled_length")
empty_bar=$(printf "%-${bar_length}s" "")
# Print progress dynamically
tput sc
echo -ne "Progress: [$bar$empty_bar] $progress% | Elapsed: ${elapsed_time}s | ETA: ${eta}s | Downloading: $package\r"
tput rc
# Create a directory for the package
mkdir -p "./pypi_mirror/$package"
# Get the list of package files from PyPI
package_page=$(curl -s "https://pypi.org/simple/$package/")
# Extract all file URLs
urls=$(echo "$package_page" | awk -F '"' '/href="https/ {print $2}')
if [[ -z "$urls" ]]; then
continue # Skip if no files found
fi
# Download each file (silent mode to keep terminal clean)
for url in $urls; do
cleaned_url="${url%%#*}"
file_name="./pypi_mirror/$package/$(basename "$cleaned_url")"
# Check if the file already exists and is not empty
if [[ -f "$file_name" && -s "$file_name" ]]; then
echo "Skipping already downloaded file: $file_name"
continue
fi
wget -q -P "./pypi_mirror/$package/" "$url"
done
# Log the completed package
echo "$package" >> "$LOG_FILE"
done
# Final message
echo -e "\n\n🎉 PyPI mirroring complete! All $total_packages packages downloaded."
Key Features of the Script
- Resumable Downloads: The script can resume from the last completed package if interrupted.
- Progress Bar: Real-time progress bar to track the download status.
Alternative Methods
- bandersnatch: A PyPI package for mirroring Python packages. More details at bandersnatch on PyPI.
Final Thoughts
This is a simple example script to demonstrate how mirroring PyPI can be achieved. Feel free to modify it based on your specific needs, whether that's optimizing for speed, adding error handling, or integrating it with your existing infrastructure.
Happy mirroring!
Happy mirroring!