iros-24

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Chao Wang, Stephan Hasler, Daniel Tanneberg, Felix Ocker, Antonello Ceravola, Frank Joublin, Joerg Deigmoeller, Michael Gienger

LLM driven human-robot interaction centered around Character, Capabilities and Examples

Abstract

This paper presents an innovative large language model (LLM)-based robotic system for enhancing multi-modal human-robot interaction (HRI). Traditional HRI systems relied on complex designs for intent estimation, reasoning, and behavior generation, which were resource-intensive. In contrast, our system empowers researchers and practitioners to regulate robot behavior through three key aspects: providing high-level linguistic guidance, creating "atomics" for actions and expressions the robot can use, and offering a set of examples. Implemented on a physical robot, it demonstrates proficiency in adapting to multi-modal inputs and determining the appropriate manner of action to assist humans with its arms, following researchers' defined guidelines. Simultaneously, it coordinates the robot's lid, neck, and ear movements with speech output to produce dynamic, multi-modal expressions. This showcases the system's potential to revolutionize HRI by shifting from conventional, manual state-and-flow design methods to an intuitive, guidance-based, and example-driven approach.


Use Cases of Large Language Models driven Multi-Modal Human-Robot Interaction

System

The system's architecture includes three key modules: "Scene Narrator", "Planner", and "Expresser". The Scene Simulator mirrors the states of objects and humans as detected by sensors. The Planner module processes multi-modal inputs as event messages, encompassing the positions of individuals within the scene. Inter-module communication is facilitated using ROS.

The system structure

Interaction Flow

The interaction typically begins with a person's speech. For instance, "Scene Narrator" detects "Felix speaks to Daniel: 'Please hand me the red glass'." This event is then translated into natural language and relayed to the "Planner" module, initiating a GPT query. Simultaneously, the "Planner" informs the "Expresser" for an immediate rule-based response, leading the robot to look at Felix while its ears and lid roll back, simulating a listening gesture. Approximately 2 seconds later, GPT responds by invoking theget_persons() and get_objects() functions to identify people and objects present. The resulting data, including "Felix," "Daniel," and object details, are sent back to GPT for further analysis. During the wait for GPT's next response, the robot exhibits a 'thinking' gesture, looking from side to side with blinking lid movements. Shortly after, the LLM calls check_hindering_reasons() to assess if Daniel can see and reach the red glass and whether he is busy. Concurrently, facial_expression() is activated for the robot to look towards Daniel. The outcome indicates Daniel can hand over the glass, and the robot, following pre-defined guidance, opts not to intervene, silently displaying the reasoning on the GUI. Subsequently, Felix asks Daniel to pour cola into the glass. The robot, attentive to their conversation, deduces through check_hindering_reasons that Daniel is occupied with a phone call and learns from is_person_busy_or_idle that Felix is holding the cup. The robot then opts to pour cola from the bottle into Felix's glass. Should Felix not be holding the glass, or if it's beyond the robot's reach, the robot will instead place the bottle near Felix. Directed by LLM, the robot's head tracks the bottle during pickup and shifts to the glass while pouring. Upon completion, the robot nods towards Felix and announces, "I've poured Coca-Cola into your glass as Daniel is currently busy.".

The interaction flow. The blue square are the action generated by the LLM; the grey ones are rule-based function.

Prompts