HFCommunity information schema

The information to build HFCommunity from Hugging Face Hub repositories comes from two sources: (1) Hugging Face Hub client library and Hub API and (2) Git.

Hugging Face offers a Python library to exploit the Hub called Hub client library which allows the interaction with the Hugging Face Hub, such as deleting or cloning a repository, uploading files, and creating and updating a branch. Furthermore, it supports the retrieval of information from these repositories.

However, there is some information valuable in a repository that is not tracked by this library: development history. The information about the commits can be useful to understand the evolution and development practices of a project. Thus, we have recovered this information using a Git analyzer tool: PyDriller. As a result, besides the information available in the Hub client library, we offer the data of the commits of each repository which includes the date and time of the commit, its author, and the changed files, among others.

UML diagram

 

Limitations

We detected some limitations on our proposal, mainly related to the libraries and tools we use to build HFCommunity:

  • The Hub client library is currently under development, thus having some limitations on the data retrieval. For example, the complete information about an author is only retrievable for the authors of discussions, as we recover it directly from the API endpoint.
  • There is not a way to link commit authors with users of the Hugging Face Hub.