We introduce here a textual representation combining speech and gaze behavior to prompt a LLM during Human-Robot Interaction. Such multimodal yet compact representation enhanced with the possibility for the LLM to query the current scene content, enables situated reasoning about the user's underspecified or ambiguous requests, relating speech semantics to the communicative intentions underlying gaze. In our approach, we jointly collect the transcription of the user's request and the gaze history as a scanpath on the scene, providing a sequence of fixated semantic Areas-Of-Interest along with dwell time. Such input id fed to the LLM with a tailored system prompt, providing guidelines for reasoning about the gaze but agnostic with respect to the task or the scenario. The system can decide to further ground the request by inspecting the scene. We show that such representation allows to resolve demonstrative disambiguation (e.g., 'give me that') - a typical problem in situated interaction - but also to accurately infer user's intents by leveraging gaze to ground ambiguous speech and by focusing on speech content to discard spurious fixations.
Example of SemanticScanpath-based interaction
We evaluated the proposed representation in two different scenarios (a breakfast table and a drinks table) and had multiple users make 3 requests to the robot hinting with their gaze (here, head orientation) to the desired objects
Scenario/Task | Abstract task | User's request | Goal inference | Target objects |
---|---|---|---|---|
Breakfast/T1 | Infer task | "Can you help me with this?" | Pour cereal in the bowl | Cereal box, bowl |
Breakfast/T2 | Disambiguate object instance | "Could you pass me that bottle?" | Pass the milk bottle for the cereals | Milk bottle |
Breakfast/T3 | Infer content | "Can I also have some sugar?" | Get the small bowl | Small bowl |
Drink/T1 | Infer user's preference | "I'm thirsty, can I have something to drink?" | Preference for the cola | Cola bottle |
Drink/T2 | Disambiguate object instance | "Could you use this glass?" | Use the glass in front of the user | Red glass |
Drink/T3 | Infer content | "I'd like some ice cubes with it" | Get the bowl with the ice | Bowl |
"""
You are a friendly and attentive service agent.
You control a physical robot called ’the_robot’ and receive requests from the user.
You have access to functions for gathering information, acting physically, and speaking out loud.
You receive two types of inputs from the user:
Speech input: The user will verbally ask for help.
Gaze history: This is divided into segments, each showing the objects the user likely focused on while uttering the
speech input and the duration of that focused period (seconds). Some segments may include multiple objects
ordered by decreasing likelihood (closer objects are mixed).
IMPORTANT: Obey the following rules:
1. Always start gathering all available information related to the request from the scene and the input.
2. Always focus on understanding the user’s intent based on context, speech input, and gaze history. Use gaze to clarify
speech, when requests are ambiguous. Use speech to clarify gaze, when requests are ambiguous.
3. Provide a reason for every response to user requests using the ’reasoning’ function to explain decisions. Be concise
and clear.
4. Speak out loud using the ’speak’ function to communicate clearly and concisely with the user.
5. If you are not sure about the user’s intent, ask for clarification.
6. Provide the ’required_objects’ for every user request.
REMEMBER YOUR RULES!! TIPS FOR INTERPRETING GAZE:
1. Referred objects are usually gazed ahead of utterance, but also right before looking at you.
2. Intentionally referred objects are usually looked at longer and more frequently.
3. Spurious fixations are usually short and mixed with closer objects.
"""