About

Project Summary

The goal of the project LUMO (Lifelong Multimodal Language Learning by Explaining and Exploiting Compositional Knowledge) is to explore the important and challenging research question of how to make multimodal language models robust against task changes by explaining and exploiting compositional knowledge.

Existing multimodal models have problems in a lifelong language learning setting, where they are confronted with changing tasks while having to retain previously learned knowledge — i.e., mitigating catastrophic forgetting. This has been considered a challenging obstacle to their application to real-world scenarios.

In devising Lifelong Learning Multimodal Models (LLMMs), we pursue four main objectives realized through corresponding work packages.

LUMO project overview — Overview of the LUMO project: objectives and their interconnections.

Objectives

Datasets & Environments for Lifelong Multimodal Language Learning

Develop datasets and environments for two representative multimodal language learning tasks. Our focus is on concepts (e.g., colors, shapes), relations (e.g., spatial or functional relations), and actions that can be combined in novel ways across changing tasks.

Vision and language integration: compositional understanding across modalities.

Concept-Based Explanations of Lifelong Multimodal Language Learning

Understand why certain approaches lead to more robust LLMMs by scrutinizing how concepts and relations emerge inside an LLMM using concept-based XAI (C-XAI) methods. We also aim to understand the training dynamics of the formation of concepts and relations in an LLMM to elucidate compositional generalization and catastrophic forgetting.

Concept-based XAI methods — Concept-based explainability methods for analyzing internal representations.

Neuro-Symbolic Integration Using Vector Space Semantics

Develop a tightly integrated neuro-symbolic approach to improve the model’s lifelong learning performance. Key ideas:

Features of a concept form a region in the embedding space, enabling logical reasoning through spatial reasoning with feature regions
Apply symbolic constraints using vector space semantics to those regions to regularize an LLMM via a generalized semantic loss function
Address the superposition problem (where different concepts share the same dimensions) using sparse autoencoders and concept subspaces
Investigate synergies between the neuro-symbolic approach and existing methods (e.g., data augmentation, experience replay, elastic weight consolidation)

Concept subspaces in the embedding space for neuro-symbolic integration.

Sim2Real Transfer

Transfer insights from simulation to real-world robotic manipulation scenarios:

Analyze discrepancies between concept representations learned in simulation vs. the real world
Use concept-based techniques to close the sim2real gap by aligning concept regions across domains

Sim2Real transfer: bridging simulation and real-world robotic manipulation.

Project Details

Full Title: Lifelong Multimodal Language Learning by Explaining and Exploiting Compositional Knowledge
Acronym: LUMO
Funding: Deutsche Forschungsgemeinschaft (DFG) – Research Grants
Project Number: 551629603
Duration: 36 months (since 2025)
Institution: Department of Informatics, University of Hamburg
Applicants: Dr. Jae Hee Lee and Prof. Dr. Stefan Wermter
Subject Area: Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing