HFCommunity information schema

The information to build HFCommunity from Hugging Face Hub repositories comes from two sources: (1) Hugging Face Hub client library and Hub API and (2) Git.

Hugging Face offers a Python library to exploit the Hub called Hub client library which allows the interaction with the Hugging Face Hub, such as deleting or cloning a repository, uploading files, and creating and updating a branch. Furthermore, it supports the retrieval of information from these repositories.

However, there is some information valuable in a repository that is not tracked by this library: development history. The information about the commits can be useful to understand the evolution and development practices of a project. Thus, we have recovered this information using a Git analyzer tool: PyDriller. As a result, besides the information available in the Hub client library, we offer the data of the commits of each repository which includes the date and time of the commit, its author, and the changed files, among others.

UML diagram

 

Limitations

We detected some limitations on our proposal:

  • The Hub client library is currently under development, thus having some limitations on the data retrieval. For example, the complete information about an author is only retrievable for the authors of discussions, as we recover it directly from the API endpoint.
  • There is not a way to link commit authors with users of the Hugging Face Hub.
  • The connection between types of repositories (model, dataset and space) is not tracked, neither in the Hub client library nor the Hub API. The connection mentioned refers to the association of a dataset being used by a model, for example. However, in some repositories the connection is specified in the card_data field.