User Tools

Site Tools


gestures:crossmodal:gesture_index

Crossmodal Gesture Input

Crossmodal gestures are compound gestures that require multiple gestures inputs that come from more than one input mode. A good example is a gesture that uses both voice commands and motion input to create a gesture event. Crossmodal gestures are an excellent example of complex gesture sequences. In most use cases cross-model gestures are best used as delimiting methods for context sensitive high fidelity input. For example voice commands can sometimes confuse application controls in a noisy environment. If voice commands are only processed when the user looking at a target it will reduce the chances of accidental triggering. Similarly on a webpage that presents multiple videos eye or head tracking can identify which video the user wants to play when the generic “play” command is uttered without the need for a longer phrase to explicitly reference the item. This type of multimodal gesture qualification aligns well with existing “natural” user behaviors and can be very effective at using subtle cues to establish user intent.

Crossmodal Context Fusion

At their core gestures attempt to convey user intent. GML attempts to provide criteria to describe user actions and intended outcomes and a clear method to program a system of interactions. Multimodal and Crossmodal gesture input allows multiple input threads (from different devices) to be woven into a single gesture description. When multimodal input is cross-qualified in this manner it allows for the modal context to be selectively associated.

Fusing features from separate modes of operation into a single context is a powerful method for delimiting actions. For example the following modal actions can significantly increase gesture confidence and reduce false recognition events.

  • gaze_plus_voice_command
  • gaze_plus_trigger_selection
  • gaze_plus_pinch_selection

Position & Vector Stabilization

Crossmodal gestures can be used to improve the accuracy of object targeted. For example a crossmodal gesture can be created that will only trigger when a user us pointing to and looking at a target at the same time. when this occurs the act of gaze targeting can trigger motion stabilization of the finger tip to make the acquisition of small or far away targets easier.

  • trigger_inertial_selection
  • gaze_plus_trigger_inertial_selection
  • gaze_activated_trigger_inertial_selection

Many subtle interactions require greater tracking sensitivity or switched tracking algorithms for controlled periods of time. Using context cues to qualify tracking mode switching can provide a reliable method for automation. For example:

  • gaze_activated_surface_interaction (allows surface intersection algorithms or custom occlusion resistant modes to be activated)
  • gaze_activated_object_scanning (allows objects of interest to be mapped in high resolution or mesh closing algorithms to be activated)
  • hand_pose_activated_region_scanning (allows a specified region of interest to be mapped in high resolution)

Increased Resolution and Global Features

Using complimentary features from independent (distinct) input devices can provides a method for increase feature resolution, increased tracking rates and more robust occlusion performance through redundancy. The fusion of skeletal feature data from multiple different sources can result in a super-skeleton feature set that is very difficult to achieve with a single input device.

3D motion body, hand & head skeleton fusion

An effective example of modal feature fusion is the fusion of Kinect body skeletal data with Leap Motion hand skeletal data. The Kinect uses structured IR light to pattern track object features in the centimeter scale (with a 2-5m range). However Leap Motion uses a high frequency stereo flash IR illumination method to track sub millimeter features (with a 1m range). When the two devices are placed (with perpendicular illumination directions) to reduce optical interference the optical object tracking methods are different enough to allow both to track the same object in 3D space.


Leap Motion (hand and fingers) + Kinect (body, face and hand)

The Leap Motion device can track the hand and fingers and the Kinect device can track the body and hand. When the two systems are calibrated to co-register the hand coordinate location an extended feature map (fused skeleton) is created. The power of this approach is that skeletal fusion is computationally inexpensive and yet the context gain allows for a significantly extended gesture set (body + hand) with much greater hand pose and gesture confidence rates.

Wearables and 3D motion hand skeleton fusion

The current generation of 3D motion tracking devices (tof and structured light) are good at tracking the position of objects of interest for example: fingertips and hands. However when those object move quickly or are occluded their associated features can be difficult to acquire and they are established with limited confidence. The first of these features to degrade is the orientation of the object (or surface normal associated with the tracked surface). When using IMU devices for a similar purpose, the position of a hand (wearing the device) is not as well known as the orientation or velocity of the hand. If the data from the IMU device is combined with data from depth map models of the same hand the strengths of each method can be used to compliment the weaknesses of the other.


Nod ring (IMU) + Leap Motion (depth map, 3D hand motion)

When using a Nod ring in conjunction with a leap motion device, full 6DOF can be confidently tracked along with acceleration and velocity. In addition to accurate positioning and tracking the pose associated with the tacked skeleton features (from the Leap Motion device) can be extended to include a robust set of (mutually exclusive) thumb and index finger micro gestures. The context driven fusion of these two devices leads to wealth of extremely high fidelity gesture control schemes.


Basis watch (IMU) + Leap Motion (depth map, 3D hand motion)

Tangibles & Sensors

Using a similar approach to wearables modular IMU sensors can be placed on 3D tangible objects. If the objects have a known and visible topology they can be tracked by depth map camera (RealSense) and recognized as a qualified “mesh”. The tracked position of the object mesh can be complimented by the IMU sensor. In the example below a Intel Curie device is used as a wireless motion sensor. The when the tracked features are fused the “lightsaber” can then have a fully (6DOF) tracked 3D position, orientation and velocity.

Curie (IMU) + RealSense (depth map, 3D Object tracking)

Cooperative methods that analyze shared data between independent devices can reveal a number of subtle cues from the user without the need to create a permanent physical association. For example one way to check when a user is holding an object in the hand is to check for correlated hand motion and object motion metrics.

Types of Input Fusion

gestures/crossmodal/gesture_index.txt · Last modified: 2019/01/29 19:06 (external edit)