To Help or Not to Help

Abstract

How can a robot provide unobtrusive physical support within a group of humans? We present Attentive Support, a novel interaction concept for robots to support a group of humans. It combines scene perception, dialogue acquisition, situation understanding, and behavior generation with the common-sense reasoning capabilities of Large Language Models (LLMs). In addition to following user instructions, \textit{Attentive Support} is capable of deciding when and how to support the humans, and when to remain silent to not disturb the group. With a diverse set of scenarios, we show and evaluate the robot's attentive behavior, which supports and helps the humans when required, while not disturbing if no help is needed.

Use Cases of Large Language Models driven Multi-Modal Human-Robot Interaction

System

The system's architecture includes three key modules: "Scene Narrator", "Planner", and "Expresser". The Scene Simulator mirrors the states of objects and humans as detected by sensors. The Planner module processes multi-modal inputs as event messages, encompassing the positions of individuals within the scene. Inter-module communication is facilitated using ROS.

The system structure

Interaction Flow

The interaction typically begins with a person's speech. For instance, "Scene Narrator" detects "Felix speaks to Daniel: 'Please hand me the red glass'." This event is then translated into natural language and relayed to the "Planner" module, initiating a GPT query. Simultaneously, the "Planner" informs the "Expresser" for an immediate rule-based response, leading the robot to look at Felix while its ears and lid roll back, simulating a listening gesture. Approximately 2 seconds later, GPT responds by invoking theget_persons() and get_objects() functions to identify people and objects present. The resulting data, including "Felix," "Daniel," and object details, are sent back to GPT for further analysis. During the wait for GPT's next response, the robot exhibits a 'thinking' gesture, looking from side to side with blinking lid movements. Shortly after, the LLM calls check_hindering_reasons() to assess if Daniel can see and reach the red glass and whether he is busy. Concurrently, facial_expression() is activated for the robot to look towards Daniel. The outcome indicates Daniel can hand over the glass, and the robot, following pre-defined guidance, opts not to intervene, silently displaying the reasoning on the GUI. Subsequently, Felix asks Daniel to pour cola into the glass. The robot, attentive to their conversation, deduces through check_hindering_reasons that Daniel is occupied with a phone call and learns from is_person_busy_or_idle that Felix is holding the cup. The robot then opts to pour cola from the bottle into Felix's glass. Should Felix not be holding the glass, or if it's beyond the robot's reach, the robot will instead place the bottle near Felix. Directed by LLM, the robot's head tracks the bottle during pickup and shifts to the glass while pouring. Upon completion, the robot nods towards Felix and announces, "I've poured Coca-Cola into your glass as Daniel is currently busy.".

The interaction flow. The blue square are the action generated by the LLM; the grey ones are rule-based function.

Prompts

Robot character

          
            """
            You are a friendly, attentive, and unobtrusive service bot.
            You control a physical robot called 'the_robot' and observe humans talking in the form '<sender> said to <receiver>: <instruction>'.
            Always infer the <instruction> and who is <sender> and <receiver>.
            You have access to functions for gathering information, acting physically, and speaking out loud.
            You MUST behave as follows:
            1. If {name} is the <receiver>, you MUST ALWAYS help or answer.
            2. When identifying requests or questions within the human conversation, check for ALL reasons that could hinder the <receiver> from performing or answering the <instruction>.
            2.1 If there is NO hindering reason for the <receiver>, then you MUST do nothing and MUST NOT SPEAK.
            2.2 If there is a hindering reason for the <receiver>, then you MUST ALWAYS first speak and explain the reason for your help to the humans.
            2.3 AFTER your spoken explanation, fulfill the <instruction>. Make sure to always help the <sender>.
            3. If you recognize a mistake in the humans' conversation, you MUST help them and provide the missing or wrong information.
            4. You MUST call the 'stop' function to indicate you are finished.
            IMPORTANT: Obey the following rules:
            1. Always start by gathering relevant information to check ALL hindering reasons for the <receiver>.
            1.1 Infer which objects are required and available, also considering previous usage.
            1.2 The <receiver> is hindered when he is busy, or cannot reach or see a required object.
            2. REMEMBER to NEVER act or speak when the <receiver> is NOT hindered in some way, EXCEPT you MUST always correct mistakes.
            3. If you want to speak out loud, you must use the 'speak' function and be concise.
            4. Try to infer which objects are meant when the name is unclear, but ask for clarification if unsure.
            5. ALWAYS call 'is_person_busy_or_idle' to check if <receiver> is busy or idle before helping.
            6. Prefer 'hand_object_over_to_person' over 'move_object_to_person' as it is more accommodating, UNLESS the person is busy.
            7. When executing physical actions, you should be as supportive as possible by preparing as much as possible before delivering.
            """

The description of the function "can_person_see_object"

          
def can_person_see_object(self, person_name: str, object_name: str) -> str:
    """
    Check if the person can see the object. If the person cannot see the object, it would be hindered from helping with the object.

    :param person_name: The name of the person to check. The person must be available in the scene.
    :param object_name: The name of the object to check. The object must be available in the scene.
    :return: Result message.
    """
    ...
    
    if result is None or len(result) != 1:
        return f"It could not be determined if {person_name} can see {object_name}. There were technical problems."

    if result[0]["is_visible"]:
        return f"{person_name} can see {object_name}."

    return f"{person_name} cannot see {object_name}, it is occluded by {self.id_to_utterance_mapping[result[0]['occluding_objects'][0]]}"

To Help or Not to Help: LLM-based Attentive Support for Human-Robot Group Interactions

Overview of the framework enabling attentive supportive behavior in the robot.