• Services
  • Expertise
  • Projects
  • Contact

REST Service for Document Classification

Saxon State and University Library Dresden (SLUB Dresden)

The current extent of daily published documents poses a challenge for libraries and their efforts in categorizing all information. This project evaluated whether machine learning methods can be used to automate classifying documents. The result is a Open‑Source software published on GitHub.

Project Context

The library “SLUB Dresden” manages multiple 10,000 documents with their document server “Qucosa”, from dissertations, to research reports, journals, and books. Classifying publications into detailed topics or categories often requires expensive export knowledge.

The most commonly used class schemata are the Dewey Decimal Classification and (in Germany) the “Regensburger Verbundklassifikation”. The latter one consists of over 100,000 classes, such that a manual assignment is very difficult even for scientists supported by a suitable search tool.

The goal of this project was to assess the effectiveness of current classification approaches in the area of artificial intelligence and machine learning. All trained models were then made available via a REST service for further use.

Beyond the Qucosa database, the following other publicly available data sources were used:

Among others, the following classification methods were evaluated:

  • k-nearest neighbor classification (as baseline)
  • Artificial Neural Networks (Fully Connected, BERT, Pre-Trained)
  • Specialized methods (Omikuji, FastText) as provided by the Annif project

Special consideration was given to the evaluation methodology. Since all class schemata are structured in a hierarchy, the severity of a misclassification needs to considered. For that reason, a hierarchical evaluation metric was used (see Cesa-Bianchi). It distinguishes between a misclassification as a parent category and a misclassification as an unrelated category.

Used Technologies

The software is implemented in Python and uses various Open‑Source modules. Artificial Neural Networks are evaluated using the modules PyTorch and Hugging Face Transformers. Further machine learning methods are based on the software library Scikit-Learn. The REST interface is implemented as OpenAPI and uses Swagger UI. Data preprocessing is done with Pandas. Results are visualized using Plotly.

Scope of Services

The following services were provided:

  • Scientific research of related methods from the area of machine learning in context of the project
  • Presentation of relevant ideas, concepts, problems, and restrictions in a way comprehensible for a non-expert
  • Implementation of the software and its components: combining data from multiple sources; adopting and modifying classification models and evaluation methods
  • Conducting of experiments using accelerated computing with modern graphics cards
  • Presentation and discussion of the results
  • Publishing of the software as Open‑Source project on GitHub

Demo

The trained models were compiled into a simple online demo:

dcs.knopflogik.de
Try it now!

We look Forward to Hearing from You

Send a Message

This message is end-to-end encrypted and can only be read by the company management.

Call by Phone

+49-391-40594560
(Mon ‒ Fri, 9am ‒ 5pm)

Follow Us

GitHub, LinkedIn

  • Navigation

    • Home
    • Services
    • Expertise
    • Projects
    • Contact
  • Contact

  • Settings

  • Information

© 2021–2024, knopflogik GmbHversion ac31bea in production mode
Save your language selection as a cookie?
Auf Deutsch lesen? / Change to German?
Save your theme selection as a cookie?
All cookies for this website have been successfully deleted.
There are currently no cookies stored for this website.