The Hugging Face Hub team is undertaking a significant redesign of their upload and download infrastructure to better handle the growing demands of machine learning model and dataset storage.
Current infrastructure overview: Hugging Face’s existing system utilizes Amazon S3 for storage in us-east-1 and AWS CloudFront as a content delivery network, but faces limitations with large file transfers and optimization capabilities.
- CloudFront’s 50GB file size limit forces large models like Meta-Llama-3-70B (131GB) to be split into smaller chunks
- The current setup lacks advanced deduplication and compression capabilities
- Recent analysis revealed 8.2 million upload requests and 130.8 TB of data transferred from 88 countries in a single day
Proposed architectural changes: A new content-addressed store (CAS) will serve as the primary point for content distribution, implementing a custom protocol focused on “dumb reads and smart writes.”
- The read path emphasizes simplicity and speed, with requests routed through CAS for reconstruction information
- The write path operates at the chunk level, optimizing upload speeds by transferring only necessary new data
- The system maintains S3 as backing storage while adding enhanced security and validation capabilities
Technical optimizations: The new architecture enables format-specific optimizations and improved efficiency.
- Byte-level file management allows for format-specific compression techniques
- Parquet file deduplication and Safetensors compression could reduce upload speeds by 10-25%
- Enhanced telemetry provides detailed logging and audit trails for enterprise customers
Global deployment strategy: After careful analysis of traffic patterns, the team has designed a three-region deployment plan.
- Primary regions: us-east-1 (Americas), eu-west-3 (Europe/Middle East/Africa), and ap-southeast-1 (Asia/Oceania)
- Resource allocation: 4 nodes each in US and Europe, 2 nodes in Asia
- The top 7 countries account for 80% of uploaded bytes, while the top 20 contribute 95%
Implementation timeline: The rollout will proceed gradually throughout 2024.
- Initial deployment begins with a single CAS in us-east-1
- Internal repository migration will serve as a benchmark for transfer performance
- Additional points of presence will be added based on performance testing results
Future implications: This infrastructure overhaul positions Hugging Face to gain unique insights into global AI development trends and patterns.
- The platform hosts one of the largest collections of open-source machine learning data
- Future analysis could reveal geographic trends in different AI modalities
- Expected 12% reduction in bandwidth will be offset by system optimizations
Strategic considerations: While the new architecture introduces some initial latency for certain users, the benefits of enhanced security, optimization capabilities, and scalability make this a calculated trade-off that positions Hugging Face for future growth in the rapidly evolving AI infrastructure landscape.
Rearchitecting Hugging Face Uploads and Downloads