WO2025166188A1

WO2025166188A1 - System/ method for generative body, gesture, and facial expression in 3d characters

Info

Publication number: WO2025166188A1
Application number: PCT/US2025/014070
Authority: WO
Inventors: Jordan HARVEY; Anda DENG; Jansen SULLIVAN; Michael Manh; Rendy KURNIA; Silvester TOT; Stefanie Hutka
Original assignee: Remote Control Enterprises Inc
Current assignee: Remote Control Enterprises Inc
Priority date: 2024-01-31
Filing date: 2025-01-31
Publication date: 2025-08-07
Anticipated expiration: 2026-07-31

Abstract

A method for generating a response from a virtual character in response to a user input, includes collecting auditory, visual, and textual data pertaining to user behavior. One or more landmark features are extracted from the data and pertain to the user's face, body, and hands. The user's facial expressions, body language, and hand gestures are categorized to a specific behavior. Additional contextual information is obtained, and an emotional and sentiment prediction of the user is determined based on the behavior categorization and the contextual information. A vector output is generated that indicates the predicted user sentiment. An emotional vector is generated based on the predicted user sentiment, behavior categorization, and contextual information. Motion data is generated to correspond with the emotional vector and mapped onto the virtual character to generate a visual response by the virtual character that is consistent with a current emotional state of the user.

Description

SYSTEM/ METHOD FOR GENERATIVE BODY, GESTURE, AND FACIAL

EXPRESSION IN 3D CHARACTERS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of and priority to U.S. Provisional Patent Application Serial No. 63/627,538, filed January 31, 2024, and U.S. Provisional Patent Application Serial No. 63/751,609, filed January 30, 2025. The entire contents of these applications are hereby incorporated by reference.

TECHNOLOGICAL FIELD

[0002] The following disclosure relates generally to embodiments of a method of enabling a more lifelike and intuitive interaction between humans and machines, including computers and robotics. The following disclosure is also directed to embodiments of a device for executing the disclosed method.

BACKGROUND

[0003] Artificial intelligence (Al) has significantly influenced the way humans and machines interact. This everchanging landscape has undergone transformative advancements in recent years, enabling more intuitive and lifelike interactions between humans and machines. However, existing Al systems still struggle with integrating diverse data modalities into a cohesive framework. Current Al systems confine data modalities (such as to text-only or voice-only system), which constrains their ability to interpret complex human behaviors or respond dynamically in real time. As a result current Al interactions are impersonal and lack the warmth of person-to-person interactions.

[0004] These are just some of the problems associated with currently used Al generative body, gesture, and facial expression in 3d characters.

BRIEF SUMMARY

[0005] Aspects of the disclosure are directed to embodiments of a method for generating a response from a virtual character in response to a user input. In some embodiments, the method includes collecting multi-modal data pertaining to user behavior, wherein the multi-modal data includes auditory, visual, and textual information. In some embodiments, the method further includes extracting one or more landmark features from the multi-modal data, wherein the one or more landmark features pertain to the user’s face, body, and hands. In some embodiments, the method includes fusing the extracted landmark features into a data set and analyzing the data set to behavi orally categorize the user’s facial expressions, body language, and hand gestures. In some embodiments, the method further includes outputting a behavior categorization for the user, obtaining contextual information, and determining a sentiment classification of the user based on the behavior categorization and the contextual information. In some embodiments, the method further includes generating a vector output indicating the predicted user sentiment and integrating the vector output with contextual information. In some embodiments, the method further includes generating an emotional vector based on the predicted user sentiment, behavior categorization, and contextual information. In some embodiments, the method further includes generating motion data to correspond with the emotional vector and mapping the motion data to the virtual character to generate a visual response by the virtual character to the user. In some embodiments, the method further includes generating an auditory response for the virtual character, and outputting the visual and auditory responses to the user, wherein the visual response is consistent with a current emotional state of the user.

[0006] In some embodiments of the method, the contextual information includes at least one of user location, time of day, and number of users present. In some embodiments of the method, the auditory response includes a non-verbal response. In some embodiments of the method, the audio response includes a verbal response. In some embodiments, generating the verbal response further includes generating a verbal transcript based on the user input, adding a voice inflection based on the behavior categorization, sentiment, and emotion prediction, and outputting the verbal response in conjunction with the visual response.

[0007] Aspects of the disclosure are directed to embodiments of a method of user interaction with a virtual human in a commercial setting. In some embodiments, the method includes issuing a greeting to the user, prompting the user input, obtaining visual, auditory, and textual data from the user analyzing the user data to categorizing the user behavior, predicting user sentiment and emotion, generating a visual response based on the user input, the category of user behavior and the predicted sentiment and emotion of the user, generating an auditory response that corresponds to the visual response, outputting a contextually proper visual response that is consistent with a current emotional state of the user, and outputting the auditory response.

[0008] In some embodiments of the method, the user data further includes contextual information. In some embodiments of the method, the contextual information comprises at least one of user location, time of day, and number of users present. In some embodiments of the method, the auditory response includes a non-verbal response. In some embodiments of the method, the audio response includes a verbal response. In some embodiments of the method, the generating of the verbal response further includes generating a verbal transcript based on the user input, adding a voice inflection based on the behavior categorization, sentiment, and emotion prediction, and outputting the verbal response in conjunction with the visual response.

[0009] Aspects of the disclosure are directed to embodiments of a retail kiosk that includes an input device. In some embodiments, the input device includes a visual display, one or more cameras, and one or more microphones. In some embodiments, the retail kiosk includes one or more processors and memory units configured to analyze user information obtained by the input device, categorize the behavior of the user, predict a sentiment of the user, generate a visual response based on the obtained user information, the category of user behavior and the predicted sentiment of the user, generate an auditory response that corresponds to the visual response, output a contextually proper response on the visual display that is consistent with a current emotional state of the user, and output the auditory response that corresponds to the visual response.

[0010] In some embodiments of the retail kiosk, the user information further includes contextual information. In some embodiments, the contextual information includes at least one of user location, time of day, and number of users present. In some embodiments of the retail kiosk, the auditory response includes a non-verbal response. In some embodiments of the retail kiosk, the auditory response is a verbal response. In some embodiments of the retail kiosk, the one or more processors are further configured to generate a verbal transcript based on the user input, add a voice inflection based on the behavior categorization, sentiment, and emotion prediction, and output the verbal response in conjunction with the visual response.

BRIEF DESCRIPTION OF DRAWINGS

[0011] A more particular description of the invention briefly summarized above may be had by reference to the embodiments, some of which are illustrated in the accompanying drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments. Thus, for further understanding of the nature and objects of the invention, references can be made to the following detailed description, read in connection with the drawings.

[0012] Fig. 1 schematically illustrates an embodiment of a system for generative body, gesture, and facial expression in 3D characters.

[0013] Fig. 2 illustrates an embodiment of an input device as may be used with the system of Fig. 1.

[0014] Fig. 3 schematically illustrates a portion of an embodiment of a behavior categorization system/method that is part of the system of Fig. 1.

[0015] Fig. 4 schematically illustrates another portion of the embodiment of the behavior categorization system/method that is part of the system of Fig. 1.

[0016] Fig. 5 schematically illustrates another portion of the embodiment of the behavior categorization system/method that is part of the system of Fig. 1.

[0017] Fig. 6 schematically illustrates a portion of an embodiment of a sentiment prediction/emotional prediction system/method that is part of the system of Fig. 1.

[0018] Fig. 7 schematically illustrates another portion of the embodiment of the sentiment prediction/emotional prediction system/method that is part of the system of Fig. 1.

[0019] Fig. 8 schematically illustrates an embodiment of a motion generation model/system that is part of the system of Fig. 1 .

[0020] Fig. 9 schematically illustrates another portion of the embodiment of the motion generation system/method that is part of the system of Fig. 1.

[0021] Fig. 10 schematically illustrates another portion of the embodiment of the motion generation system/method that is part of the system of Fig. 1 .

[0022] Fig. 11 mathematically illustrates the systems/methods of Figs. 3-10.

[0023] Fig. 12 schematically illustrates an embodiment of a system/method for capturing contextual information to be used in one or more of the systems/methods of Figs. 3-10.

[0024] Fig. 13 schematically illustrates an embodiment of a large language model used as part of the system of Fig. 1.

[0025] Fig. 14 schematically illustrates an embodiment of a speech generation component used as part of the system of Fig. 1.

[0026] Fig. 15 schematically illustrates an embodiment of a method of pre-training the system of Fig. 1.

[0027] Fig. 16 schematically illustrates an embodiment of a method of pre-training the system of Fig. 1.

[0028] Fig. 17 schematically illustrates a method of user interaction with the system of Fig. 1 when used in a retail or commercial setting.

DETAILED DESCRIPTION

[0029] The following discussion relates to various embodiments of a system/ method for generative body, gesture, and facial expression in 3d characters. It will be understood that the herein described versions are examples that embody certain inventive concepts as detailed herein. To that end, other variations and modifications will be readily apparent to those of sufficient skill. In addition, certain terms are used throughout this discussion in order to provide a suitable frame of reference with regard to the accompanying drawings. The terms “about” or “approximately” as used herein may refer to a range of 80%-125% of the claimed or disclosed value. With regard to the drawings, their purpose is to depict salient features of the system/ method for generative body, gesture, and facial expression in 3d characters and are not specifically provided to scale.

[0030] The disclosed system/ method for generative body, gesture, and facial expression in 3D characters 100 uses a multi-modal framework that is configured to process and fuse data pertaining to body language, facial expressions, hand gestures, spoken language, as well as environmental information including time of day, location, available menu items and user history. In some embodiments, the disclosed body language model system 100 integrates advanced computational methodologies into a single system through a stacking ensemble approach. In some embodiments, the advanced computational methodologies are configured to fuse spatio-temporal data with contextual data to make behavior categorization 200, sentiment prediction/emotional prediction 300, and motion generation 400. As used herein, the term “data” may refer to one or more types of data including spatial data, temporal data, and/or any other type of data acquired by and/or used by the disclosed system 100. In some embodiments, the body language model further comprises a meta-layer of additional contextual information to ensure that user inputs are recognized and interpreted in the context of the user's environment and emotional state.

[0031] Fig. 1 schematically represents an interaction between the multiple components of the system 100. Referring to Figs. 1 and 2, user information is obtained from an input device 10. In some embodiments, the input device 10 may be part of kiosk or mobile device 20 comprising one or more processing units and memory storage units. In some embodiments, the input device 10 may be configured to transmit information using Internet Protocol. In some embodiments, the input device 10 includes one or more cameras 12 that are configured to capture facial expressions, body movements, and hand gestures of a user as one or more video streams. In some embodiments, the input device 10 comprises at least one audio input 14, such as a microphone, that is configured to capture the user’s voice or vocal output. In some embodiments, the input device 10 includes a touchscreen 16, one or more buttons and/or keys, and/or other user elements that enable the user to provide a manual input. In some embodiments, the touchscreen 16 comprises a visual display configured to output visual feedback/responses 18 to the user. Referring to Fig. 1, in some embodiments, the input device 10 is configured to capture contextual information 500 (Fig. 12) related to the user interaction, such as environmental cues, brand guidelines, and interaction guard rails 502. In some embodiments, the environmental cues may include, but are not limited to time of day, number of users (single user vs two or more users), and/or location. In some embodiments, the brand guidelines and interaction guard rails are part of a stored data set 20. In some embodiments, the data set 20 may be updated periodically according to retail protocols. In some embodiments, the brand guidelines may include, but are not limited to, product information, marketing tone, and/or influencer data. In some embodiments, the interaction guardrails may include, but are not limited to, subject matter limits, safety protocols, and/or data collection limits. As shown ion Figs. 1 and 12, the contextual information is output as textual information 504 that may be included as an input to the behavior categorization 200, sentiment prediction/emotional predictions 00, and motion generation 400 systems/models.

[0032] The behavior categorization 200, sentiment prediction/emotional predictions 00, and motion generation 400 systems/models are further configured to operate in concert with large language models (LLMs) 600 and a sophisticated hardware system encompassing computer vision and audio processing units. An embodiment of an LLM 600 is shown in Fig. 13, where user audio is analyzed for vocal tone and linguistics. In some embodiments, the LLM 600 predicts vocal intent and speech categorical description. In some embodiments, the output of the LLM 600 is textual and may be included as an input to any or all of the behavior categorization 200, sentiment prediction/emotional prediction300, and motion generation 400 systems/models.

BEHAVIOR CATEGORIZATION

[0033] At least some of the user data obtained by the input device 10 undergoes behavior categorization 200 to interpret and classify user actions by analyzing the real-time multi-modal data streams obtained from the input device. Referring generally to Figs. 3-5, different portions 200a, 200b, 200c of the behavior categorization method 100 are shown. In a first step 202 the information captured by the input device 10, for example, the one or more cameras 12, and processes/ analyzes this information by one or more processing units for facial expressions, body gestures, and hand gestures. In a next step 204, landmark features are extracted from the captured facial expressions, body gestures, and hand gestures. In some embodiments, facial landmarks are extracted from the recorded facial expressions and analyzed/tracked. In some embodiments, the facial landmarks may include eyes, eyebrows and mouth. In some embodiments, the one or more video streams are processed/analyzed to identify micro-expressions, such as brief, involuntary facial movements that may reveal the true emotions of the user. Similarly, body landmarks, such as the position of key body joints (e.g., elbows, shoulders) are extracted and analyzed/tracked to analyze posture and body movement, which provides insight into user engagement or disengagement. In addition, movement landmarks are extracted, such as the position of certain parts of the hand and/or head, in order to identify patterns such as waving, or nodding, in order to capture dynamic user behavior. In some embodiments, the feature extraction process is represented by the following equations: x/= 4>f(7/), xb = b(Ib), xg = Here, If, lb, and Ig represent the input data streams such as images or poses, and <pf, (pb, and (pg are the respective feature extraction functions for each modality.

[0034] Referring to Fig. 4, in 200b, at step 206, the multi-modal inputs are fused into two data sets where a first data set captures position and movement of the key landmarks extracted at step 204 (which are tracked at a minimum of 8 frames per second). The second data set includes environmental contexts such as identifying when a user engages with a particular system feature or expresses a specific emotion, linking each entry in this dataset to specific behaviors in the first dataset. In some embodiments, the second data set may further include audio information obtained from the input device 10. In some embodiments, audio information is obtained along with the video/visual data and then stored. The fusion step may be expressed mathematically as: zt = Wfxf + Wbxb + Wgxg + b. Here, Wf, Wb, and Wg are weights that determine the contribution of each modality's features, and b is a bias term. At step 208, in some embodiments, a recurrent neural network RNN incorporates or integrates temporal dependencies to update latent features based on past and present states. This temporal context integration step may be expressed mathematically as: zt = RNN(zt-l, zt).

[0035] Turning to Fig. 5, 200c illustrates that the outputs from step 206 (and 208) are used in step 210, 212 for behavioral categorization/prediction. In the behavior categorization/behavior prediction steps 210, 212, behavioral states are categorized (e.g., waving, smiling, frowning, nodding, and many others) using a deep neural network pre-trained to identify specific behaviors and contextual categories. In some embodiments, the pre-training involves exposing the neural network to large multi-modal datasets that have been pre-tagged, including video, audio, contextual annotations and/or transformations of vector points, to learn how combinations of inputs correspond to predefined behavioral states. During this process, the deep neural network optimizes its weights to recognize patterns and correlations across different modalities, segmenting and categorizing actions quickly. The deep neural network processes the fused multi-modal data from step 206 (and step 208) to extract increasingly abstract data representations related to fundamental features, such as motion trajectories, and facial expressions, and combine them into higher-order representations. These representations are then mapped to predefined categories and a confidence level for each behavioral state is indicated. In some embodiments, a softmax function is applied to a weighted combination of the latent features to make the prediction: y = softmax(zt Wc + be). In this equation, Wc and be are the weights and biases used for classification. The behavior outputs are generated at step 214 and include one or more of: (i) facial expressions; (ii) categorical descriptions; (iii) body language categorical descriptions; and/or (iv) hand gesture categorical descriptions. In some embodiments, these behavior outputs may be a text output(s) 216.

SENTIMENT PREDICTION

[0036] Turning to Figs. 6 and 7, the output from the behavior categorization method 200 is used at step 302 as an input for a sentiment prediction/emotional prediction model/method 300 (300a, 300b). The sentiment prediction/emotional prediction 300 processes data from three modalities: text (from steps 214/216); audio information gathered from the audio input 14 and contextual information, in some embodiments, is provided as video input. At step 304, features are extracted from the three different data streams using modality specific feature extraction functions: h_text = (|)_text(x_text) h audio = (|)_audio(x_audio) h_video = <|)_video(x_video) where (x text) refers to input data comprising of text, audio (x_audio), and video (x_video), and (|)_modality refers to feature extraction functions for each modality.

[0037] At step 306, the extracted features from each modality are combined or fused into a unified representation through a multi-modal fusion layer, which ensures seamless integration of diverse data sources. At this step, the inputs are spatially and temporally aligned to maintain the integrity of each modality. In an embodiment, the system synchronizes audio tones, facial expressions, and body gesture outputs from the behavioral categorization model to create a coherent representation. This unified representation enables the system 100 to handle the complex interdependencies between modalities, ensuring that behavioral and contextual nuances are preserved and corresponded with sentiment. By fusing these features, the sentiment prediction/emotional prediction model/method 300 creates a foundation for predicting accurate, context-aware insights and is adaptable to real-time interactions. In an embodiment, the fusion step 306 may be represented as: h fusion =f_fusion(h_text, h audio, h video), where f fusion is the multi-modal fusion layer.

[0038] In step 308, shown at 300b in Fig. 7, the user sentiment and intention are classified using the fused representation from step 306 and a sentiment classification function, such as: y = g_sentiment(h_fusion), where “g_sentimenf ’ is the sentiment classification function and “y” is the predicted sentiment and intent. The sentiment and classification step evaluates the fused representation and assigns probabilities to various sentiment categories, such as, but not limited to, happiness, frustration, anger, or neutrality, as well as nuanced intentions, such as but not limited to, agreement, hesitation, or curiosity. Advanced neural architectures assess these probabilities using context-sensitive algorithms to ensure that the predictions are accurate and contextually relevant. In some embodiments, environmental elements (e.g., user location and time of day) and contextual elements (e.g., menus and/or prompts) are incorporated into the sentiment and classification step 308 to further enhance the relevance of the predictions. Incorporation of environmental and contextual elements results in a more comprehensive understanding of user sentiment, bridging the raw data from the fusion process with actionable insights, and enabling the system 100 to interpret complex emotional and intentional states effectively.

[0039] At step 310, user interactions are continuously monitored and dynamic updates to the initial sentiment prediction are made to incorporate real-time feedback and context. This real- time adaptation step 310 involves analyzing discrepancies between predicted and observed behaviors and then recalibrating the model weights in real time to improve prediction accuracy. By integrating historical data and environmental/contextual element, the system adapts its outputs to align with evolving behavioral patterns. The system 100 employs reinforcement mechanisms, adjusting predictions based on outcomes, and dynamically recalculates probabilities for each sentiment category to maintain contextual relevance and emotional alignment.

[0040] In some embodiments, the real-time adaption occurs using the below: y' = y + Ay_context xt — 1 = 1 — ptl (xt — ptcOO (xt, t, ye))+ptz where “Ay_context” represents the adjustment factor derived from contextual feedback. This real-time adaptation system/algorithm offers a comprehensive framework for multi-modal sentiment and intention analysis. At step 312, an emotional prediction is developed and output as a vector 314.

MOTION GENERATION

[0041] Turning to Figs. 8-10, an embodiment of a motion generation model/system 400 (400a, 400b, 400c) are shown. The motion generation model/system 400 processes inputs 314 from the sentiment predict! on/emotional prediction model/system and leverages these inputs to generate structured datasets that drive virtual characters in 3D environments or robotics in real-world. At step 402, the vector output 314 from the sentiment prediction/emotional prediction model/system 300 is used as an input for sentiment analysis integration. In addition, to the vector output 314, additional multi-modal data streams may be analyzed, such as text, video, and/or audio, to extract an emotional vector. In some embodiments, the additional multi-modal data includes text output from a large language model analysis of audio inputs, such as the user’s voice, contextual and environmental features, as well as the peripheral inputs, such as the camera 12, the microphone 14, or any other device that collects inputs and engages with the a component of the system.

[0042] The sentiment analysis step 402 processes input (e.g., text, speech) by evaluating multi-modal data streams to extract an emotional vector 404. The emotional vector 404 represents a range of states, including but not limited to, happiness, sadness, anger, neutrality, frustration, excitement, and curiosity. The inputs undergo modality-specific processing where the textual inputs are analyzed for emotionally significant phrases and word patterns, the audio input is analyzed for tone, pitch, and modulation, and the visual input captures facial expressions and gestures. By integrating these input/data streams, the motion generation model/system 400 constructs a comprehensive emotional profile of the user, which enables a nuanced classification of sentiments and intentions to be accomplished. In some embodiments, the emotional vector 404 is a control signal for the motion generation model/system 400 by encoding nuanced emotional states and inferred intentions derived from sentiment analysis. The emotional vector 404, which represents emotional states of the user acts as a dynamic input that directs the motion generation model/system 400 to synthesize contextually accurate and emotionally aligned animations.

[0043] In some embodiments, pre-training the motion generation model/system 400 on controlled datasets ahead of time, which include annotated emotional states paired with corresponding motion capture sequences, and then fine-tuning the motion generation model/system 400 further on a more extensive video library of interactions and the historical data of user interactions, enables the motion generation model/system 400 to improve the interpretation of the emotional vector 404. The pre-training and fine-tuning pipeline enables the motion generation model/system 400 to learn the intricate relationships between emotional cues and physical motion, such as associating a cheerful state with relaxed gestures and upward head tilts, or linking frustration with tense postures and rapid hand movements. By leveraging the emotional vector 404, the motion generation model/system 400 ensures that the generated outputs — whether for virtual characters or robotic systems — align precisely with the user's emotional tone and situational context, delivering lifelike and responsive performances.

[0044] At 400b (Fig. 9), a diffusion probabilistic model generates high-dimensional motion sequences conditioned on the emotional vector 404. Unlike conventional generative models, the diffusion probabilistic model excel at producing coherent and high-resolution outputs by iteratively refining noisy inputs through a reverse diffusion process. By conditioning this refinement on the emotional vector 404, the motion generation model/system 400 ensures that the generated motion aligns seamlessly with the user's emotional state and context. Accordingly, the motion generation model/system 400 captures the variability inherent in human motion, allowing the productions of diverse and realistic sequences.

[0045] In some embodiments, the motion generation model/system 400 is pre-trained on annotated datasets of motion capture sequences paired with emotional labels and further enhances the ability to interpret and to respond to the control signal provided by the emotional vector 404. During training, the diffusion process learns to iteratively denoise random motion trajectories to reconstruct realistic, contextually appropriate animations. The diffusion process comprises a forward diffusion process at 406 where noise is gradually added to structured motion data x_(t-l), which effectively transforms the structured motion data into random noise x_T over a series of timesteps. Referring to the below mathematical representations: x_T is the initial random noise vector sampled from a normal distribution represents the generation process's starting point; y is the emotional vector derived from the sentiment analysis and serves as the generation process's conditioning input. x_t is the synthesized motion state at a specific timestep t during the reverse diffusion process; e_0 is the learned noise prediction function parameterized by a neural network, responsible for predicting noise components during denoising; and

P_t is a predefined noise schedule controlling the amount of noise added at each timestep.

In some embodiments, the forward diffusion may be represented as follows: q(x_t | x_(t-l)) = N(x_t; (1 - _t) x_(t-l), _t I)

Where:

N Denotes a normal distribution;

1 — p t scales the contribution of the previous time step x (t — 1); and P_t I represents the variance (added noise).

[0046] The diffusion process further comprises a reverse diffusion process 410 that learns to remove this noise step-by-step, thereby refining the motion data into coherent and expressive sequences. In other words, the reverse diffusion process 410 reconstructs structured motion x_0 from the noise x_T, thereby iteratively removing noise at each timestep t. In some embodiments, the reverse diffusion process may be represented as follows: x_(t-l) = (l/( 1 -P_t)) (x_t -P_t c_0(x_t, t, y))+P_t z

Where:

£ 0 (x_t, t, y): is the predicted noise at timestep t conditioned on y, which aligns generated motion with the emotional context. z~ N(O,I): is the random noise from a standard normal distribution.

The reverse diffusion step 410 synthesizes motion trajectories that correspond to the emotional vector y.

[0047] In some embodiments, the diffusion process enables the motion generation model/system 400 to decompose motion data into a latent space at 408, which is amenable to stochastic sampling. In some embodiments, the motion generation model/system 400 is fine-tuned with extensive video data and historical user interactions to further enhance its ability to interpret the emotional vector 404 and generate context-specific outputs. By leveraging these additional data, the motion generation model/system 400 adapts to a broader range of scenarios and user behaviors, which improves the accuracy and expressiveness of the generated animations in both virtual environments and robotic applications.

[0048] At 412, the motion generation model/system 400 embeds the emotional vector 404 into a latent representation, which serves as a condition for the motion generation model/system 400. Accordingly, to further enforce alignment between motion generation and the emotional context, conditional sampling integrates the emotional vector 404 (y) directly into the generation process. The conditional sampling uses a neural network-based encoder-decoder architecture:

X_(t-1) = Decoder(Encoder(y) + x_t)

Where the Encoder maps the emotional vector y into a latent space. The Decoder synthesizes motion x_(t-l) using this latent representation and the current motion state x_t.

[0049] Turning to 400c (Fig. ), at 414, the generated motion data is mapped onto the virtual human character's (or robotic) skeletal structure by translating the structured dataset into precise joint positions and rotations, and facial expressions. Accordingly, facial expressions and body gestures are synthesized in parallel and integrated with body motion. By incorporating the discussed steps, the system 100 ensures that the virtual character exhibits synchronized facial expressions, gestures, and body language that match the sentiment in real-time. Moreover, the system 100 ensures that each data point, such as a hand gesture or head tilt, accurately aligns with the character's kinematic hierarchy. Using a structured dataset that adheres to 3D Euclidean space principles, the system is compatible with most mainstream virtual character creation applications, including industry-standard animation and gaming engines. The system is further compatible with most mainstream robotics controllers. Such compatibility allows the system 100 to deliver lifelike performances with smooth transitions and consistent emotional expression, which ensures a seamless integration into diverse platforms and use cases.

[0050] In some embodiments, the synthesized animation A is as follows:

A =I( J, F, G)

Where:

J are joint rotations and positions for skeletal motion;

F are facial expression parameters for emotive display;

G are gesture parameters for communicative motion; and

A is the complete animation sequence integrating J, F, and G.

[0051] The function f dynamically combines these components to generate lifelike animations.

[0052] At 416, the motion is output either via a virtual character on via a robotic assembly. In some embodiments, a speech generation component 700, as schematically shown in Fig. 14, receives an output from the LLM 600 and generates a transcript of a response to the user input, including a voice inflection. The transcript and inflection are used to generate an audio response to accompany the motion output at 416. [0053] As previously described, the system 100 handles multi-modal data capture by simulating diverse user scenarios to record interactions across multiple modalities. The system 100 collects high-resolution video data as well as spatial tracked vectors to track facial expressions, gestures, and full-body movements, capturing every detail of non-verbal communication in 3D Euclidean coordinates. Simultaneously, audio inputs record speech patterns, including tone, pitch, and rhythm, while textual inputs capture real-time interactions such as commands or inquiries. The structure of these outputs are synchronized datasets that include spatial positioning, temporal markers, and environmental contexts, forming a robust foundation for downstream analysis and model training.

[0054] The initial dataset undergoes a series of structured preprocessing steps configured to make the data interpretable for the Behavior Analysis Model. In some embodiments, these steps begin with segmenting raw multi-modal inputs into discrete motion sequences and grouping them by spatial and temporal characteristics. Each motion is analyzed and categorized based on its associated body part (e.g., hand, arm, head) and its relationship within the kinematic hierarchy of the human body, ensuring the preservation of parent-child linkages (e.g., shoulder-to-arm or hip- to-leg). Temporal synchronization aligns data streams from different modalities, such as video, audio, and textual inputs, so that corresponding events occur cohesively across the dataset. Normalization processes standardize spatial data by recalibrating coordinates into a unified Euclidean space, eliminating inconsistencies caused by varying capture conditions or sensor inaccuracies. The processing and structuring of the dataset enables the Behavior Analysis Model to extract meaningful features, segment individual motions or motion groups, and interpret them within the context of user behaviors and environmental cues.

[0055] Behavioral Categorization 200: As previously discussed, the system 100 processes structured, multi-modal input data to predict behavioral states with high granularity. Input data includes synchronized and normalized motion sequences, spatial coordinates of joints, facial landmarks, and segmented gestures. The output of this process is a set of categorized behaviors, segmented by motion group and linked to their respective kinematic hierarchy, such as "waving" associated with hand and arm movements. These predictions provide the foundational insights required for subsequent sentiment analysis and motion synthesis.

[0056] sentiment prediction/emotional prediction300: As previously discussed, the system 100 processes multi-modal data to classify emotional states with precision and nuance. Input data consists of synchronized and normalized outputs from the Behavior Categorization Model 200, such as facial expressions, gestures, vocal tones, and contextual metadata. The system integrates these inputs into a cohesive emotional profde by leveraging cross-attention mechanisms and feature extraction pipelines. The output includes a detailed classification of emotional and sentimental states (e.g., happiness, sadness, frustration, neutrality) and inferred intentions (e.g., agreement, hesitation, or curiosity). These results provide actionable insights for guiding subsequent processes like motion generation and context-sensitive system responses.

[0057] Motion Generation 400: Synthesizes real-time animations by translating input data, including emotional vectors and segmented behaviors, into lifelike motion sequences. Input data includes structured outputs from the sentiment prediction/emotional prediction model 300, such as emotional profiles and contextual metadata, combined with data like joint coordinates and motion trajectories. This comprehensive input enables the Motion Generation Model 400 to map emotional cues to physical movements. The output comprises high-resolution, temporally synchronized animations integrating gestures, facial expressions, and body movements. These outputs are compatible with virtual characters and robotics, ensuring seamless, contextually accurate performances in real-time applications.

[0058] Output Layer: Delivers animations for virtual characters and robotics by transforming high-resolution motion data into applicable formats for diverse platforms. Input data consists of structured datasets that include joint positions, motion trajectories, facial expressions, and motional vectors, all aligned with 3D Euclidean coordinates. The output is a seamlessly integrated animation or motion instructions compatible with virtual environments and robotic systems, rendering lifelike movements and synchronized gestures in real-time for virtual characters. At the same time, for robotics, the data is converted into commands for actuators and motors, enabling precise and contextually relevant physical movements.

[0059] Figs. 15 and 16 illustrates an embodiment of data acquisition 1500 for pre-training the system/model 100. In such embodiments, performers, such as test users, wear a tracking suit that relies on internal tracking sensors and/or external cameras to map the motions of the performers. The plurality of tracking sensors track position, space, and motion of the performer’s body to create data for model training. Accordingly, the body sensors are spatially tracked over time and associated with facial expressions, hand gestures, and body language recognition. The results are tagged data groups with cross-associations to each other with a range of expressions, emotions, and contextual cues. Referring to the example of Fig. 16, a correlation between a group of arm positions and facial expressions at a certain time and in a specific place over time is shown and can lead to a deeper insight about the user’s behavior. Accordingly, the time and place are contextual elements. The system/model 100 assesses the current state of emotion as well as the history of motion and emotion, as well as environmental context before generating a proper emotionally relevant response.

[0060] Referring to Fig. 17, an example of a method of user interaction with the previously described system 100 integrated into a retail setting is shown. At 1702, the user starts the interaction. Starting the interaction may require the user to administer an input, such as via a touchscreen or and auditory input. At 1704, a greeting is offered to the user in response to the user starting the interaction. At 1706, additional user input is required and is provided at 1708. If no input is provided 1710, then an idle loop is created at 1712, which may continue for a predetermined period of time before resetting back to 1702. If user input is provided at 1708, then the behavior categorization/analysis model/system 200 is initiated at 1714 as well as the collection of user data/information 1716. At 1718 and 1720, the sentiment prediction/emotional prediction model/system 300 and motion generation system/ model 400 are initiated. At 1722, an Al response is generated as a result of steps 1714-1720 and output to the user. At 1724, the user is prompted for an additional input and then waits for the prompted response at 1706. As shown, the system 100 keeps analyzing the user data/information in conjunction with the user input in order to generate contextually and emotionally responses.

[0061] While the present invention has been particularly shown and described with reference to certain exemplary embodiments, it will be understood by one skilled in the art that various changes in detail may be effected therein without departing from the spirit and scope of the invention that can be supported by the written description and drawings. Further, where exemplary embodiments are described with reference to a certain number of elements, it will be understood that the exemplary embodiments can be practiced utilizing either less than or more than the certain number of elements.

Claims

1. A method for generating a response from a virtual character in response to a user input, comprising: collecting multi-modal data pertaining to user behavior, where in the multi-modal data comprises auditory, visual, and textual information; extracting one or more landmark features from the multi-modal data, wherein the one or more landmark features pertain to the user’s face, body, and hands; fusing the extracted landmark features into a data set; analyzing the data set to behaviorally categorize the user’s facial expressions, body language, and hand gestures; outputting a behavior categorization for the user; obtaining contextual information; determining a sentiment classification of the user based on the behavior categorization and the contextual information; generating a vector output indicating the predicted user sentiment; integrating the vector output with contextual information; generating an emotional vector based on the predicted user sentiment, behavior categorization, and contextual information; generating motion data to correspond with the emotional vector; mapping the motion data to the virtual character to generate a visual response by the virtual character to the user; generating an auditory response for the virtual character; and outputting the visual and auditory responses to the user, wherein the visual response is consistent with a current emotional state of the user.

2. The method of claim 1, wherein the contextual information comprises at least one of user location, time of day, and number of users present.

3. The method of claim 1, wherein the auditory response comprises a non-verbal response.

4. The method of claim 1, wherein the audio response comprises a verbal response, wherein generating the verbal response further comprises: generating a verbal transcript based on the user input; adding a voice inflection based on the behavior categorization, sentiment, and emotion prediction; and outputting the verbal response in conjunction with the visual response.

5. A method of user interaction with a virtual human in a commercial setting, comprising: issuing a greeting to the user; prompting the user input; obtaining visual, auditory, and textual data from the user; analyzing the user data to categorize user behavior; predicting user sentiment and emotion; generating a visual response based on the user input, the category of user behavior and the predicted sentiment and emotion of the user; generating an auditory response that corresponds to the visual response; outputting a contextually proper visual response that is consistent with a current emotional state of the user; and outputting the auditory response.

6. The method of claim 5, wherein the user data further comprises contextual information, wherein the contextual information comprises at least one of user location, time of day, and number of users present.

7. The method of claim 5, wherein the auditory response comprises a non-verbal response.

8. The method of claim 5, wherein the auditory response comprises a verbal response, wherein generating the verbal response further comprises: generating a verbal transcript based on the user input; adding a voice inflection based on the behavior categorization, sentiment, and emotion prediction; and outputting the verbal response in conjunction with the visual response.

9. A retail kiosk comprising: an input device comprising, a visual display, one or more cameras, and one or more microphones, one or more processors configured to: analyze user information obtained by the input device, categorize a behavior of the user, predict a sentiment and emotion of the user, generate a visual response based on the obtained user information, the category of user behavior and the predicted sentiment of the user, generate an auditory response that corresponds to the visual response, output a contextually proper response on the visual display that is consistent with a current emotional state of the user, and output the auditory response that corresponds to the visual response.

10. The retail kiosk of claim 9, wherein the user information further comprises contextual information, and wherein the contextual information comprises at least one of user location, time of day, and number of users present.

11 . The retail kiosk of claim 9, wherein the auditory response comprises a non-verbal response.

12. The retail kiosk of claim 9, wherein the auditory response is a verbal response.

13. The retail kiosk of claim 12, wherein the one or more processors are further configured to: generate a verbal transcript based on the user input; add a voice inflection based on the behavior categorization, sentiment, and emotion prediction; and output the verbal response in conjunction with the visual response.