Script docs

Script to retrieve HFH and Git data, and to create and to populate a MariaDB database.

databaseImport.check_database_schema(cursor, database)[source]

Auxiliar function to check if all the tables required are in the database.

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • database (str) – MariaDB database name.

databaseImport.create_connection_mysql()[source]

Configures a database connection to a MySQL/MariaDB database. Database configuration file must be located in the path as a JSON file called hfc.config.

Returns:

Tuple containing the MySQL connector to the database specified in the configuration file and the database name.

Return type:

(connection.MySQLConnection, str)

databaseImport.create_schema_mysql(cursor)[source]

Creates the database schema, following this ER diagram.

Parameters:

cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

databaseImport.eprint(*args, **kwargs)[source]

Auxiliar function to print errors to stderr.

databaseImport.main(argv)[source]

Main method. Called when invoking the script.

databaseImport.onerror(func, path, exc_info)[source]

Error handler for shutil.rmtree. Intended for Windows usage (e.g., Acces Denied)

If the error is due to an access error (read only file) it attempts to add write permission and then retries.

If the error is for another reason it re-raises the error.

Usage : shutil.rmtree(path, onerror=onerror)

databaseImport.populate_commits(cursor, conn, repo_name, repo_type)[source]

Importation of commit information using PyDriller. It inserts commit information into commits, author, modified_file and files_in_commit tables.

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • conn (connection.MySQLConnection) – The MySQL connector to the database specified in the configuration file. Used to commit changes to fulfill FK restrictions. Used to commit changes to fulfill FK restrictions.

  • repo_name (str) – The full name (i.e., “owner/repo_name”) of the repository.

  • repo_type (str) – The type of the repository (i.e., model, dataset or space)

databaseImport.populate_datasets(cursor, conn, api, limit_index, limit_date)[source]

Importation of the information of datasets. It retrieves the whole set of models from HFH or, optionally, the first limit_index elements. It inserts dataset information into repository, dataset and author tables, and calls the rest of populate methods to fill the remaining tables (not including populate_models and populate_spaces).

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • conn (connection.MySQLConnection) – A MySQL connector to the database specified in the configuration file. Used to commit changes to fulfill FK restrictions.

  • api (huggingface_hub.HfApi) – The huggingface_hub API object pointer.

  • limit_index (int) – limit_index bound of the slicing of the set of datasets

  • limit_date (datetime) – Date from which it starts to update the database (e.g., update the database values of the datasets that have been modified in the last month).

databaseImport.populate_discussions(cursor, conn, api, repo_name, repo_type)[source]

Importation of discussions information. It inserts discussion information into discussion, author, conflicting_files_discussion and discussion_event tables.

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • conn (connection.MySQLConnection) – A MySQL connector to the database specified in the configuration file. Used to commit changes to fulfill FK restrictions.

  • api (huggingface_hub.HfApi) – The huggingface_hub API object pointer.

  • repo_name (str) – The full name (i.e., “owner/repo_name”) of the repository.

  • repo_type (str) – The type of the repository (i.e., model, dataset or space)

databaseImport.populate_files(cursor, api, repo_name, repo_type)[source]

Importation of file information. It inserts file information into repo_file table.

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • repo_name (str) – The full name (i.e., “owner/repo_name”) of the repository.

  • repo_type (str) – The type of the repository (i.e., model, dataset or space)

databaseImport.populate_models(cursor, conn, api, limit_index, limit_date)[source]

Importation of the information of models. It retrieves the whole set of models from HFH or, optionally, the first limit_index elements. It inserts model information into repository, model and author tables, and calls the rest of populate methods to fill the remaining tables (not including populate_datasets and populate_spaces).

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • conn (connection.MySQLConnection) – A MySQL connector to the database specified in the configuration file. Used to commit changes to fulfill FK restrictions.

  • api (huggingface_hub.HfApi) – The huggingface_hub API object pointer.

  • limit_index (int) – limit_index bound of the slicing of the set of models

  • limit_date (datetime) – Date from which it starts to update the database (e.g., update the database values of the models that have been modified in the last month).

databaseImport.populate_space_dependencies(cursor, models, datasets, repo_name)[source]

Importation of space dependency information. It inserts the models and datasets used information into the models_in_space and datasets_in_space tables.

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • models (list[str]) – A list of model ids.

  • datasets (list[str]) – A list of dataset ids.

  • repo_name (str) – The name of the space repository.

databaseImport.populate_spaces(cursor, conn, api, limit_index, limit_date)[source]

Importation of the information of spaces. It retrieves the whole set of models from HFH or, optionally, the first limit_index elements. It inserts model information into repository, space and author tables, and calls the rest of populate methods to fill the remaining tables (not including populate_models and populate_datasets).

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • conn (connection.MySQLConnection) – A MySQL connector to the database specified in the configuration file. Used to commit changes to fulfill FK restrictions.

  • api (huggingface_hub.HfApi) – The huggingface_hub API object pointer.

  • limit_index (int) – limit_index bound of the slicing of the set of spaces

  • limit_date (datetime) – Date from which it starts to update the database (e.g., update the database values of the spaces that have been modified in the last month).

databaseImport.populate_tags(cursor, conn, tags, repo_name, repo_type)[source]

Importation of tag information. It inserts tag information into the tag and tags_in_repo tables.

Parameters:
  • cursor (cursor.MySQLCursor) – The MySQL connection cursor to execute operations such as SQL statements.

  • conn (connection.MySQLConnection) – The MySQL connector to the database specified in the configuration file. Used to commit changes to fulfill FK restrictions.

  • tags (list[str]) – A list of tag names.

  • repo_name (str) – The full name (i.e., “owner/repo_name”) of the repository.

  • repo_type (str) – The type of the repository (i.e., model, dataset or space)

databaseImport.read_config()[source]

Function that retrieves configuration JSON from hfc.config.

Returns config:

JSON object containing all configuration parameters.

Return type:

JSON

databaseImport.validate_token(token)[source]

Auxiliar function to validate the token placed in hfc.config.

Parameters:

token (str) – Hugging Face Hub API token