Problem Statement
Over the last twenty years, materials from the humanities domain have been stored digitally and made available with associated metadata but have not been utilized effectively by scholars. Identifying associate articles and authors is challenging due to the interdisciplinary nature of digital humanities, but essential for effective collaboration. This poster showcased a search engine (https://humanities.lairhub.com/) to identify related scholarly works and authors for improving collaboration within the humanities domain.
Dataset
Before identifying the relationship between articles and authors, our study developed an Extract-Transform-Load (ETL) pipeline. Inputs of the pipeline were the bibliography images collected from scholarly articles. Input images collected using iPhone and screenshot tool in Windows 11. The articles were retrieved using the Humanities Index tool with “digital humanities” as a query term. The pipeline leveraged the Google Vision API tool for optical character recognition, extracted the text from the images, and stored them in a staging area (figure 2). The pipeline sends extracted text to the Open AI API for collecting article title, authors, and DOI (GPT-3.5-turbo model) using the prompt below:
The pipeline performs the initial search through the Crossref API using the title and author attributes generated by the ChatGPT and selects the top document. In the second step, the pipeline stores the document metadata (title, abstract, authors, creation date, document type) in a relation database management system in a normalized form following the Dublin Core metadata properties. The pipeline sequentially requests the Scopus, IEEE, and Semantic Scholars API if the abstract is unavailable on Crossref. (figure 3).
Scope of the Database
Number of Documents | 3162 |
Number of Abstracts | 2167 |
Number of Journal Articles | 2002 |
Number of Books | 78 |
Last Publication Date | June 2024 |