Database

Our database is composed by 17 tables (see UML diagram to visualize the main components). The following table shows how many records has each database table.

(All data in this page is from June 2024)

table name number of records
author 570,492
commits 7,803,628
commit_parents 10,951
conflicting_files_discussion 121,744
dataset 149,828
datasets_in_space 3,381
discussion 273,191
discussion_event 518,924
files_in_commit 21,055,405
model 681,682
models_in_space 183,120
modified_file 21,259,405
repo_file 63,039,567
repository 1,088,879
space 257,342
tag 66,921
tags_in_repo 5,111,538
ER Diagram

From the UML diagram, we have defined the corresponding database schema. In the database, the Repository inheritance has been mapped using the concrete table inheritance method, resulting into three tables (model, dataset and repository, space does not have an own class as it does not contain specific information). Attribute and table names are mapped from the attribute names presented in the Hugging Face Hub library.

Download

We deployed our database in a MariaDB server. We offer the database as a compressed dump file. Note: Due to the high volume of data introduced in v1.1, we only populated the modified_file for models. One can also use the HFC extractor to enrich the last HFC dump. After the release of June 2024, we will not populate the modified_file table anymore due to size restrictions.

Date Download link
October 2024
June 2024a
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
November 2022
= v1.1
= v1.0
aThis release contains the populated modified_file table.