REST Service for Document Classification

Saxon State and University Library Dresden (SLUB Dresden)

The current extent of daily published documents poses a challenge for libraries and their efforts in categorizing all information. This project evaluated whether machine learning methods can be used to automate classifying documents. The result is a Open‑Source software published on GitHub.

Project Context

The library “SLUB Dresden” manages multiple 10,000 documents with their document server “Qucosa”, from dissertations, to research reports, journals, and books. Classifying publications into detailed topics or categories often requires expensive export knowledge.

The most commonly used class schemata are the Dewey Decimal Classification and (in Germany) the “Regensburger Verbundklassifikation”. The latter one consists of over 100,000 classes, such that a manual assignment is very difficult even for scientists supported by a suitable search tool.

The goal of this project was to assess the effectiveness of current classification approaches in the area of artificial intelligence and machine learning. All trained models were then made available via a REST service for further use.

Beyond the Qucosa database, the following other publicly available data sources were used:

Metadata of the k10plus
Information about the Regensburger Verbundklassifikation
Information about other class schemata provided by the Coli-Conc project

Among others, the following classification methods were evaluated:

k-nearest neighbor classification (as baseline)
Artificial Neural Networks (Fully Connected, BERT, Pre-Trained)
Specialized methods (Omikuji, FastText) as provided by the Annif project

Special consideration was given to the evaluation methodology. Since all class schemata are structured in a hierarchy, the severity of a misclassification needs to considered. For that reason, a hierarchical evaluation metric was used (see Cesa-Bianchi). It distinguishes between a misclassification as a parent category and a misclassification as an unrelated category.

Used Technologies

The software is implemented in Python and uses various Open‑Source modules. Artificial Neural Networks are evaluated using the modules PyTorch and Hugging Face Transformers. Further machine learning methods are based on the software library Scikit-Learn. The REST interface is implemented as OpenAPI and uses Swagger UI. Data preprocessing is done with Pandas. Results are visualized using Plotly.

Scope of Services

The following services were provided:

Scientific research of related methods from the area of machine learning in context of the project
Presentation of relevant ideas, concepts, problems, and restrictions in a way comprehensible for a non-expert
Implementation of the software and its components: combining data from multiple sources; adopting and modifying classification models and evaluation methods
Conducting of experiments using accelerated computing with modern graphics cards
Presentation and discussion of the results
Publishing of the software as Open‑Source project on GitHub