Usage
=====================

The purpose of the HFCommunity extractor script is to prepare a relational database (following this `ER diagram <https://som-research.github.io/HFCommunity/download.html#er_diagram>`_) and to populate it with information of Hugging Face Hub repositories and its Git commit history.
It can be used to only create the database schema, or to perform the extraction process.
The script builds all the necessary database artifacts (the very database, the tables and the relationships).
To set up the script, the user must edit the ``hfc.config`` file (see `Configuration`_).

This script is intended to be used by high performance machines.
To have a reference of the time and resource consuming this process is, the retrieval of the whole set of models is approximately 54 hours on a 64 GB of RAM and 8 core processor server.

Requirements
---------------------

To execute the ``databaseImport`` script, it is necessary having installed:

* `Python <https://www.python.org/downloads/>`_ v3.9.12
* `MariaDB <https://mariadb.com/kb/en/getting-installing-and-upgrading-mariadb/>`_ v10.3.39
* `Git <https://git-scm.com/book/en/v2/Getting-Started-Installing-Git>`_ v2.20.1

Furthermore, inside the extractor folder of the `GitHub repo <https://github.com/SOM-Research/HFCommunity/tree/main/extractor>`_ there is a ``requirements.txt`` to import all libraries required to execute the ``databaseImport`` script.
At the command line (with pip installed):

.. code-block:: console

    $ pip install -r requirements.txt


Configuration
---------------------

Before invoking the ``databaseImport`` script, it is necessary to fill (or create) the ``hfc.config`` file, in the same folder the script is located.
Besides the mandatory fields, there are two optional parameters to configure a threshold for low resource machines.
The retrieval of large repositories (often found in HFH dataset repositories), may consume too many resources and force the script to stop.
These thresholds are not exclusive, and can it be used any combination of them.
The file follows a JSON structure, and must be defined as:

.. code-block:: json

    {
        "host" : "hostname",
        "port" : 0000,
        "user" : "user",
        "pass" : "psswd",
        "database" : "db_name",
        "hfh_token" : "token",
        "last_n_months" : -1, // optional
        "max_num_commits" : 0, // optional
        "max_num_files" : 0 // optional
    }

The keywords in this file define:

* **host**: The location of the MariaDB server and database (if self-hosted, ``"localhost"``).
* **port**: Port used by the MariaDB server (by default 3306).
* **user**: User to acces the MariaDB database (there is a default user as ``"root"``).
* **pass**: Password of the user.
* **database**: The database name. The database **must** be created before invoking the script.
* **hfh_token**: The Hugging Face Hub `API token <https://huggingface.co/docs/hub/security-tokens>`_. The token must be a read access token.
* **last_n_months**: Number of months from which it is wanted to fully update the repositories. It fully updates the repositories modified within the last ``last_n_months``. The rest of repositories only gets its downloads and likes updated. Intended for executions after the first time. 
* **max_num_commits**: Constant which tells the process the threshold for not retrieving repositories with more than ``max_num_commits`` number of files commits. This is intended for machines with lower resources, only enabled when using the ``-s`` flag (see `Execution`_).
* **max_num_files**: Constant which tells the process the threshold for not retrieving repositories with more than ``max_num_files`` number of files. This is intended for machines with lower resources, only enabled when using the ``-s`` flag (see `Execution`_).

The recommended values for ``max_num_commits`` and ``max_num_files`` are 1,500 and 10,000, respectively.
This must be configured according to the machine available resources.
If the script is being used for testing purposes, these values can be lowered.


Execution
---------------------

Once installed the dependencies and libraries, the ``databaseImport`` script can be executed as follows:

#. Clone the `HFCommunity repository <https://github.com/SOM-Research/HFCommunity>`_.
#. Edit the ``hfc.config`` file according to your configuration.
#. Launch the extractor via command line (if the database schema is not created, it will be created automatically). The extract admits several parameters:

    .. code-block:: console

        $ python databaseImport.py -t {model|dataset|space|all} [-i limit_index] [-s]


#. Furthermore, the script can be used to only generate the required database schema (see `ER diagram`_). Execute the script with the following flag:

    .. code-block:: console

        $ python databaseImport.py -c


python databaseImport.py -c
    It creates the database schema of HFC database (see `ER diagram`_).
    It only creates the database schema, it doesn't import the data!

python databaseImport.py -t <type>
    It imports only a type of repositories (``type`` must be either ``model``, ``dataset`` or ``space``, or ``all`` if any type).
    This flag is mandatory, and it indicates which type of repositories will import the script. 
    If it is wanted to import all repositories (not recommended because of time and resource consumption) use the ``all`` argument.

python databaseImport.py -i <limit_index>
    The repositories are retrieved sorted (descending) by its last modification date (the most recent are in lower indices).
    If it is wanted to import just a subset of these (i.e., 100 most recent), the indix will slice the list. 
    Then, from all repositories it selects the repositories below the ``limit_index`` from all the repositories retrieved of HFH. 
    This is intended to be used when we want to populate a reduced version of the HFC database.
    Note that if the index exceed the size of the list, it will import the whole set.

python databaseImport.py -s
    The importation of dataset and model repositories may consume excessive resources in some cases.
    When this flag is activated, the script will read the ``max_num_commits`` and ``max_num_files`` parameters of the ``hfc.config`` file, and skip repositories having more than one of these threshold (each parameter represents a threshold and they are evaluated independently).
    It is not necessary to specify both.
    The script will print to stdout the repositories skipped and the total number of skipped repositories at the end.
    **It is highly recommended using this option**, even in high performance machines, as some repositories may have tens of thousands of commits and files, which it will consume excessive RAM memory.


Execution examples
^^^^^^^^^^^^^^^^^^^^^

To have a clear idea of how to execute the script, we present some examples in the following:

*"I want to retrieve the 100 most recent modified models."*

.. code-block:: console

    $ python databaseImport.py -t model -i 100

*"I want to retrieve all repositories of HFH, but skip those above the threshold."*

.. code-block:: console

    $ python databaseImport.py -t all -s

*"I want to retrieve the 50 most recent modified spaces, but skip those above the threshold."*

.. code-block:: console

    $ python databaseImport.py -t space -i -50 -s