Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In order that the above-described aspects may be better understood, exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. And the use of "first," "second," and "third," etc. do not denote any order, and the terms may be construed as names.
Referring to fig. 1, fig. 1 is a schematic diagram of a server 1 (also called a somatosensory dance game device for generating a virtual character based on user voice information) of a hardware running environment according to an embodiment of the present invention.
The server provided by the embodiment of the invention is equipment with display function, such as 'Internet of things equipment', intelligent air conditioner with networking function, intelligent electric lamp, intelligent power supply, AR/VR equipment with networking function, intelligent sound box, automatic driving automobile, PC, intelligent mobile phone, tablet personal computer, electronic book reader, portable computer and the like.
As shown in fig. 1, the server 1 includes a memory 11, a processor 12, and a network interface 13.
The memory 11 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the server 1, such as a hard disk of the server 1. The memory 11 may also be an external storage device of the server 1 in other embodiments, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the server 1.
Further, the memory 11 may also include an internal storage unit of the server 1 as well as an external storage device. The memory 11 may be used not only for storing application software installed in the server 1 and various types of data, for example, codes of the somatosensory dance game program 10 for generating a virtual character based on user voice information, etc., but also for temporarily storing data that has been output or is to be output.
Processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip for executing program code or processing data stored in memory 11, such as executing a somatosensory dance game program 10 that generates virtual characters based on user voice information, or the like.
The network interface 13 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used to establish a communication connection between the server 1 and other electronic devices.
The network may be the internet, a cloud network, a wireless fidelity (Wi-Fi) network, a Personal Area Network (PAN), a Local Area Network (LAN), and/or a Metropolitan Area Network (MAN). Various devices in a network environment may be configured to connect to a communication network according to various wired and wireless communication protocols. Examples of such wired and wireless communication protocols may include, but are not limited to, at least one of transmission control protocol and internet protocol (TCP/IP), user Datagram Protocol (UDP), hypertext transfer protocol (HTTP), file Transfer Protocol (FTP), zigBee, EDGE, IEEE 802.11, light fidelity (Li-Fi), 802.16, IEEE 802.11s, IEEE 802.11g, multi-hop communication, wireless Access Point (AP), device-to-device communication, cellular communication protocol, and/or bluetooth (bluetooth) communication protocol, or combinations thereof.
Optionally, the server may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or a display unit, for displaying information processed in the server 1 and for displaying a visual user interface.
Fig. 1 shows only a server 1 having components 11-13 and a somatosensory dance game program 10 that generates virtual characters based on user voice information, and those skilled in the art will appreciate that the structure shown in fig. 1 does not constitute a limitation of the server 1, and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In this embodiment, the processor 12 may be configured to invoke a somatosensory dance game program for generating a virtual character based on user voice information stored in the memory 11, and perform the following operations:
when creating a virtual character of the somatosensory dance game, activating a preset voice assistant to perform interactive communication with a user;
Collecting first voice data exchanged between a user and a voice assistant, and sending a reading request for reading a specified text to the user when the first voice data meets preset conditions;
collecting second voice data when a user reads the appointed text;
Analyzing biological characteristics and character characteristics of a user by utilizing a pre-trained machine learning model based on the second voice data, and generating a tone model matched with the tone of the user;
generating an avatar of the virtual character based on the biometric characteristic and the character characteristic;
and generating the voice of the virtual character based on the voice model.
Based on the hardware architecture of the somatosensory dance game device for generating the virtual character based on the user voice information, the embodiment of the somatosensory dance game method for generating the virtual character based on the user voice information is provided. The invention discloses a somatosensory dance game method for generating virtual characters based on user voice information, which aims to realize the intelligent and personalized creation of the virtual characters.
Referring to fig. 2, fig. 2 is a diagram illustrating an embodiment of a method for generating a somatosensory dance game of a virtual character based on user voice information according to the present invention, the method for generating a somatosensory dance game of a virtual character based on user voice information comprising the steps of:
And S10, when the virtual character of the somatosensory dance game is created, activating a preset voice assistant to perform interactive communication with the user.
Wherein the virtual character refers to a digitized character or character in a somatosensory dance game. This character would represent the presence and performance of the user in the game and interact with the game environment.
The voice assistant refers to an intelligent voice interaction system built in a game, and can interact with a user in natural language. The assistant has the ability to understand and respond to user voice input, designed to guide the user through voice to complete the character creation process.
Specifically, the system automatically initiates a voice assistant when the user selects the option to create a character. The voice assistant asks the user for a good, e.g. "you good", welcome to the character creation link by means of a soft and friendly tone. I will help you customize the unique roles that you belong to. "then, the voice assistant communicates with the user through a guiding problem, such as" what music you like at ordinary times. These problems not only help the user to understand better the character creation process, but also provide the necessary interaction basis for the voice data collection in the subsequent steps, and help to build user trust and eliminate the user's abstinence psychology.
S20, collecting first voice data exchanged between the user and the voice assistant, and sending a reading request for reading the specified text to the user when the first voice data meets the preset condition.
Wherein the first voice data refers to a voice signal generated by the user during interaction with the voice assistant. These data include characteristics of the user such as speech rate, pitch, timbre, intonation, and emotional expression.
The preset condition refers to that the system judges whether the voice data of the user in communication with the voice assistant meets specific requirements, for example, the voice of the user shows enough relaxation or positive emotion, and the analysis result of the language and gas characteristics reaches the preset standard.
The specified text is a piece of specific text content that the game system requires the user to read, and the content of the specified text typically includes various types of sentences, shortcuts or phrases that are designed to be representative in order to capture the natural speech characteristics of the user.
Optionally, the specified text includes the feature of phoneme diversity in which the text content contains various phonemes to ensure that the user's phonetic pronunciation characteristics are fully captured. Emotion expressions text may contain different emotion expressions such as questions, statements, commands, etc. to capture speech features of the user in different contexts. Sentence structure-sentences in text can be long or short, and contain different grammatical structures, which helps capture the speed and intonation variations of the user in different language expressions.
Specifically, the system automatically captures these voice inputs as the user answers questions posed by the voice assistant via voice. The system can record the voice of the user through the microphone equipment and perform real-time processing in the background. Processing includes filtering background noise, removing extraneous acoustic interference, and extracting key speech features. When the first voice data meets the preset condition, the system sends a request for reading the specified text to the user.
In some embodiments, the preset condition refers to determining that the user is in a relaxed state based on the voice data of the user in the preliminary communication, and the determination of the relaxed state includes at least one of the following features:
1. the speech rate of the user is maintained at a predetermined speech rate range.
The speech rate refers to the number of words or syllables that the user pronounces per minute during communication. The predetermined speech rate range is typically set to a numerical interval reflecting the natural dialogue of the user, for example 100 to 150 words per minute.
Specifically, the system calculates the user's speech rate in real-time as the user makes a preliminary communication with the voice assistant. If the user's speech rate is always maintained within a predetermined range (e.g., 100 to 150 words/minute), the system will determine that the user is likely to be in a relaxed state. The stability of this speech range generally indicates that the user is communicating with the voice assistant in a natural, relaxed manner, rather than a rapid or slow speech due to tension or anxiety.
2. The pitch of the user is maintained in a preset pitch range.
Where pitch refers to the height of the fundamental frequency in the user's voice, typically related to the user's emotional and physiological state. The preset pitch range is a reasonable interval, e.g., 85 to 180 hz, set according to the user's normal speaking pitch.
Specifically, the system determines whether the pitch is maintained within a preset range by analyzing the fundamental frequency of the user's voice during the preliminary communication. If the pitch remains stable and there is no significant drastic fluctuation, the system will consider the user likely to be in a relaxed state. This is because, when the user feels relaxed, the pitch generally tends to be smooth, and no pitch rise or dip in tension occurs.
3. The intonation change frequency of the user is lower than a preset intonation threshold.
The intonation change frequency refers to the change frequency of intonation of a user in the communication process, namely the number of times of voice fluctuation. The preset intonation threshold is a criterion defining the number of intonation changes, for example not more than 3 per minute.
Specifically, the system calculates the frequency of intonation changes when analyzing the user's speech data. If the tone change frequency of the user is lower than a preset threshold (for example, not more than 3 times per minute), the system can judge that the tone of the user is stable, and the user is in a relaxed state. Lower intonation change frequencies generally indicate that the user's emotion is stable and does not frequently adjust intonation due to anxiety or excitement.
4. The voice rhythm of the user is maintained in a preset voice range;
where speech cadence refers to the speed, pauses, and continuity of speech in communication by the user. The preset speech range typically reflects the speech tempo feature in a natural conversation, including the uniformity and consistency of speech.
Specifically, the system judges whether the voice rhythm is in a preset range or not by analyzing indexes such as pause interval, pronunciation continuity and the like in the voice of the user. If the user's voice cadence is uniform and there is no apparent jerk or discontinuity, the system may consider the user to be likely in a relaxed state. The balance of this rhythm is often a representation of relaxation and natural communication.
5. Through emotion recognition algorithm detection, emotion characteristics in the user voice accord with a preset standard of calm or pleasure state.
The emotion recognition algorithm refers to an algorithm model for judging the emotion state of a user by analyzing voice characteristics (such as speech speed, pitch, intonation and the like). The preset criteria is a score for calm or pleasant emotion based on emotion recognition models, typically set to a score above 0.7.
Specifically, the system uses emotion recognition algorithms to process and analyze the user's speech data. If the emotion recognition model scores the current emotional state of the user above 0.7 and is displayed as calm or pleasant, the system may consider the user to be in a relaxed state. Such scoring typically reflects the overall emotional state of the user, ensuring the naturalness and comfort of the communication.
It can be understood that the user trust is established and the withdrawal psychology is eliminated through the preliminary communication of the voice assistant, and the system can collect the voice data of the user and analyze the language characteristics of the voice data under the natural relaxation environment, so that the user can be guided to read the appointed text, and the user can express more true emotion to acquire more true and accurate voice characteristics. In addition, the interactive communication mode enables the character creation process to be more natural and smooth, and the interactive experience and immersion of users are remarkably improved.
S30, collecting second voice data when the user reads the specified text.
Specifically, the system may direct the user to read the specified text through a graphical interface or voice prompt. When a user reads a specified text, the system collects voice data of the specified text read by the user through a configured microphone array or a single microphone device. In which more accurate sound source localization and higher anti-noise capability can be achieved if a microphone array is used. This is because the microphone array collects the user's sound signals simultaneously through multiple channels, helping to reduce interference from background noise.
S40, analyzing the biological characteristics and character characteristics of the user by utilizing a pre-trained machine learning model based on the second voice data, and generating a tone color model matched with the tone color of the user.
The pre-trained machine learning model is a model which is trained on a large amount of marked data and has a certain prediction capability. These models, by learning patterns and features in a large amount of data during the training process, can output predicted results for the user's biological features (e.g., age, gender, voiceprint features, etc.) and personality features (e.g., outward-to-inward-to-personality trends) given the input data (e.g., speech features).
Alternatively, the machine learning model may be at least one of a deep neural network, a convolutional neural network, a recurrent neural network, a long-short-term memory network, a support vector machine, and a random forest.
Specifically, the system inputs the collected second voice data into a pre-trained machine learning model, and the model automatically extracts and analyzes the biological characteristics, character characteristics and tone color model of the user. The analysis process comprises the steps of feature extraction, feature selection, classification and the like, and finally relevant feature information of the user is output.
S50, generating the appearance of the virtual character based on the biological characteristics and the character characteristics.
Specifically, the system selects the virtual character template most suitable for the characteristics of the user according to the biological characteristics and character characteristics of the user analyzed by the machine learning model, and then adjusts the appearance attribute of the virtual character template according to the characteristics of the user so as to be closer to the actual characteristics of the user. For example, the sex, height, skin state, etc. of the virtual character are adjusted according to the biological characteristics (such as sex, age, height, etc.) of the user, and the attributes of the virtual character skin color, hairstyle, hair color, makeup, etc. are adjusted according to the character characteristics of the user. For example, an outward user may obtain an active, dynamic character, while an inward user may obtain a steady, low-tuning character.
S60, generating the voice of the virtual character based on the voice model.
Where the voice of the avatar refers to the character's vocalization characteristics generated from the user's timbre model that will be used in the game for conversations, expressions, and interactions of the avatar.
Specifically, the system first generates the voice of the virtual character using a timbre model through text-to-speech (TTS) technology. For example, when a user enters a piece of text or vocalizes a character according to a game scenario, the system may invoke a timbre model to convert the text into speech that resembles the user's timbre.
It can be appreciated that the somatosensory dance game method of the technical scheme of the application eliminates the withdrawal psychology of the user through interactive communication between the voice assistant and the user, establishes preliminary trust with the user, and then analyzes the biological characteristics, character characteristics and tone models of the user based on the voice data by collecting the voice data of the user when reading the appointed text and then combining with the pre-trained machine learning model, thereby automatically generating the virtual character highly matched with the user based on the biological characteristics, character characteristics and tone models. The method not only eliminates the complicated process of manually inputting or selecting the preset parameters in the traditional somatosensory game and virtual reality technology, but also obviously improves the individuation and the sense of reality of the virtual character. In addition, due to the fact that the voice characteristics of the user can be automatically captured and analyzed, the role creation process becomes more natural and noninductive, and interaction experience and immersion of the user are improved. Therefore, the somatosensory dance game method of the technical scheme has the advantage of intelligently and individually creating the virtual roles.
In some embodiments, the voice assistant communicates with the user in a preset mood and preset text based on a purpose design that establishes user trust and/or eliminates the user's arming mind.
The preset mood refers to the mood features adopted by the voice assistant in communication with the user, and the voice assistant is usually soft, friendly and related, so that the user can feel relaxed and trust.
The preset text refers to text content used by a voice assistant in preliminary communication, and is mainly conversation content which is simple, easy and pressureless in design, so as to establish preliminary trust with a user and eliminate the withdrawal psychology of the user.
Specifically, when the user begins the virtual character creation process, the voice assistant first grees the user with a close and relaxed mood, such as "hello | welcome to our game world. Together we will create a unique virtual character. Next, the voice assistant presents a simple, pressureless patency problem through the designed pre-set text, the purpose of these questions is to let the user feel relaxed and enter a communication state naturally, thus laying the foundation for subsequent speech data collection and analysis, as "what interesting games you have played recently.
It can be understood that through the preset language and language text, user trust can be quickly established, the withdrawal psychology can be eliminated, and the system can collect voice data of the user and analyze the language characteristics of the user in a natural relaxation environment, so that accurate judgment of the emotion state of the user can be realized.
In some embodiments, analyzing the user's biometric features using a pre-trained machine learning model based on the second speech data comprises:
S110, extracting pitch characteristics, formant characteristics and frequency spectrum characteristics from the second voice data.
Among these, pitch characteristics (Pitch Features) reflect the fundamental frequency of the user's utterance, typically related to the vibration of the vocal cords. The system may extract the fundamental frequency (F0) from the speech signal using an Autocorrelation method (autocorrection) or a pitch tracking Algorithm (PITCH TRACKING Algorithm) to obtain the pitch characteristics.
Further, there are the following differences in pitch between men and women:
physiological differences-pitch is determined by the vocal cord vibration frequency (fundamental frequency, F0), which is generally longer, thicker, lower for men and therefore generally lower for women. While the vocal cords of women are shorter, thinner, and vibrate at a higher frequency, so the fundamental frequency is generally higher.
Sex differences statistically, the average pitch for men is typically between 85Hz and 180Hz, and the average pitch for women is typically between 165Hz and 255 Hz.
Formant characteristics (Formant Features) are areas of energy concentration in the speech spectrum, reflecting the shape and size of the user's vocal tract. The system may extract a plurality of formant frequencies (e.g., F1, F2, F3) as features by Linear Predictive Coding (LPC) or formant tracking algorithms.
Further, there are the following differences in formants between men and women:
physiological differences-formants are determined by the shape and size of the vocal tract, reflecting the physical resonance characteristics of the vocal tract. Men typically have longer vocal tract (including pharyngeal cavity and oral cavity), which results in lower frequency formants. The vocal tract of females is generally short and therefore the frequency of the formants is high.
Sex difference typically, the first formant (F1) and the second formant (F2) are lower in frequency in male speech and higher in female speech. For example, male F1 may be between 100Hz and 900Hz, F2 between 900Hz and 2400Hz, and female F1 and F2 may be higher.
The spectral features (Spectral Features) reflect the energy distribution of the speech signal at different frequencies, which can be extracted by short-time fourier transform (STFT) or mel-frequency cepstral coefficient (MFCC). The tone quality and tone color characteristics of the user's voice can be captured through the spectral features.
Further, there are the following differences in frequency spectrum between men and women:
Physiological differences-spectral features reflect the energy distribution of the speech signal over different frequency ranges. Due to differences in vocal tract configuration between men and women, the energy distribution of the speech signal in the frequency spectrum is also different. Male speech is typically more energetic in the low frequency part, while female speech is more energetic in the medium and high frequency parts.
Sex variation by analyzing the spectral envelope or MFCC (Mel Frequency Cepstral Coefficients) of the speech signal, a difference in the spectral energy distribution of the different sexes can be observed. In general, the spectrum of female speech tends to show higher energy in a higher frequency range, while the spectral features of male are more concentrated in the low frequency band.
And S120, normalizing and dimension-reducing the pitch characteristic, the formant characteristic and the frequency spectrum characteristic.
The purpose of the normalization process is to scale the range of values of the different features to the same scale (e.g., between 0 and 1) to eliminate scale differences between the different features. This prevents the model from being excessively affected by the features having a large range of feature values.
The dimension reduction process refers to the use of dimension reduction techniques to reduce feature dimensions. The purpose of this step is to remove redundant information while retaining the principal components that best reflect the gender characteristics of the user. The feature vector after the dimension reduction will contain fewer dimensions but the information is more concentrated.
Specifically, after various voice features are extracted, the system performs normalization and dimension reduction processing on the features to reduce differences and redundancy among the features, so that the prediction precision and efficiency of the model are improved.
Alternatively, the features such as pitch features may be normalized by methods such as min-max normalization, Z-Score normalization, max absolute normalization, etc.
Alternatively, the method of principal component analysis, linear discriminant analysis, independent component analysis, singular value decomposition and the like can be adopted to perform dimension reduction processing on the characteristics of pitch characteristics and the like.
S130, fusing the pitch characteristic, the formant characteristic and the frequency spectrum characteristic after normalization and dimension reduction processing into a first characteristic vector.
Specifically, pitch features, formant features, and spectral features may be fused into one feature vector using methods such as feature stitching (Feature Concatenation), weighted Fusion (Weighted Fusion), feature averaging (Feature Averaging), and the like.
S140, inputting the first feature vector into a pre-trained gender identification model, and predicting the gender features of the user.
Among these, the pre-trained gender recognition model may be a Support Vector Machine (SVM), a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), or the like. The model is trained on a large-scale labeling data set, and can accurately map the input feature vector to the gender classification result.
Specifically, the system inputs the first feature vector into a pre-trained gender recognition model, the model analyzes according to the input feature vector, and outputs a gender prediction result of the user. The model outputs the gender characteristics of the user, typically binary classification (e.g., male/female), and may also output a probability distribution (e.g., male probability of 0.8 and female probability of 0.2).
It will be appreciated that by the feature extraction, processing and fusion processes performed in the above embodiments, the system is able to accurately extract and identify gender features from the user's voice data. The process utilizes multidimensional information of pitch, formants and spectrum characteristics, and the characteristics are input into a pre-trained machine learning model, so that the model can comprehensively analyze by utilizing the relevance and the difference between the characteristics, thereby accurately predicting the gender of the user and further realizing high accuracy and robustness of gender identification.
In some embodiments, analyzing the user's biometric features using a pre-trained machine learning model based on the second speech data further comprises:
S210, calculating the total syllable number and the voice activity duration in the voice data.
The total number of syllables is the number of syllables included in the speech signal extracted from the speech data. Syllables are the basic building blocks of speech, typically containing a single vowel or a combination of a vowel and a consonant. The voice activity duration refers to the total duration of actual voice content in the voice signal when the user reads the specified text. The voice activity duration does not include silence or background noise portions, only the actual sound production time is calculated.
Specifically, the system first distinguishes between voice activity and silence portions in the voice signal by a Voice Activity Detection (VAD) algorithm and extracts the duration of the voice activity. Next, the system calculates the total number of syllables in the speech data by detecting peaks or syllable changes in the speech. Finally, the system uses this information as a basis to further analyze the speech rate characteristics of the user. The system calculates the total number of syllables generated by the user when speaking the specified text and the duration of the voice activity (namely the duration of the actual sound production of the user when speaking) by analyzing the voice data.
S220, calculating the speech speed characteristic according to the syllable total number and the speech activity duration.
The speech rate feature refers to the number of syllables sent by the user per unit time, and is used for representing the pronunciation rate of the user. The speech rate feature is an important index reflecting the speech rate, and is usually expressed in the form of "syllable/second" or "syllable/minute". The speech rate feature can reflect the user's speaking habits, which are often somewhat age-related. For example, young people typically speak faster, while older people speak slower.
Specifically, the system obtains the speech speed feature through division operation according to the syllable total number and the voice activity duration calculated in the step S210. The calculation formula is speech rate = syllable total number/voice activity duration.
S230, extracting fundamental frequency characteristics from the voice data, and calculating intonation range, intonation ascending mode, intonation descending mode, intonation mean value and intonation standard deviation according to the fundamental frequency characteristics.
The fundamental frequency characteristic, which is referred to herein as the fundamental frequency (F0) in the user's speech signal, reflects the frequency of the vocal cord vibrations, and is typically used to describe the change in pitch. The intonation range refers to the variation range of the fundamental frequency in the voice of the user, and represents the fluctuation of the voice of the user. Intonation ascending and intonation descending patterns represent patterns or trends of fundamental frequency ascending and descending in the user's voice, respectively. The intonation average refers to the average of the fundamental frequencies in the user's speech, reflecting the pitch level of the overall speech. The intonation standard deviation represents the fluctuation range of the fundamental frequency and reflects the stability and the variation amplitude of the voice.
Specifically, the system first extracts fundamental frequency features from speech data by short-time fourier transform (STFT) or other frequency analysis methods, and then calculates intonation ranges, intonation rising and falling patterns, intonation averages and intonation standard deviations from the fundamental frequency features.
S240, merging the intonation range, the intonation ascending mode, the intonation descending mode, the intonation mean value and the intonation standard deviation to obtain intonation characteristics.
Specifically, the system performs feature stitching or weighted average equal fusion operation on the intonation range, the intonation ascending mode, the intonation descending mode, the intonation average value and the intonation standard deviation calculated in the step S230 to form a unified intonation feature vector, i.e. a second feature vector. The second feature vector can comprehensively reflect intonation characteristics of the user and serve as one of inputs of a subsequent age identification model.
S250, fusing the speech speed feature, the intonation feature, the pitch feature, the formant feature and the frequency spectrum feature into a second feature vector.
Specifically, the system sequentially splices the speech speed feature, the intonation feature, the pitch feature, the formant feature and the frequency spectrum feature generated in the previous steps into a unified feature vector, namely a second feature vector. The vector is fused with various voice characteristic information, and can provide comprehensive characteristic input for a machine learning model, so that the prediction accuracy of the model is improved.
S260, inputting the second feature vector into a pre-trained age identification model, and predicting age features of the user.
Specifically, the system inputs the second feature vector generated in step S250 into a pre-trained age-recognition model that outputs the user' S age prediction result through analysis and processing of these features. The model can be a machine learning algorithm such as a Deep Neural Network (DNN), a Support Vector Machine (SVM), a Random Forest (Random Forest) and the like, and can accurately predict the age characteristics of a user after a large amount of data training.
It should be noted that, the age identification model may be a classification model or a regression model, and if the age identification model is a classification model, the age output result is the age range (e.g. 10-20 years old, 20-30 years old, etc.) to which the user belongs. If the age identification model is a regression model, the output result is a specific age value of the user.
It can be understood that by extracting and fusing various voice features such as speech speed, intonation, pitch, formants, frequency spectrum and the like, a comprehensive second feature vector is constructed and input into a pre-trained age identification model, the system can accurately predict the age features of the user, and therefore individuation and realism of the virtual character are improved. In addition, through fusion processing of multiple features, the system can comprehensively capture physiological and behavioral features in the voice of the user, so that the robustness and adaptability of the age prediction model are improved.
In some embodiments, inputting the second feature vector into a pre-trained age identification model, analyzing the biological characteristics of the user by using a pre-trained machine learning model based on the voice data further comprises selecting an age identification model of the corresponding gender from a preset model library according to the identification result of the gender characteristics of the user.
The preset model library refers to a plurality of age identification model sets stored in the system in advance, and the models are specially trained and optimized according to different gender characteristics so as to adapt to the difference of the male and female in the voice characteristics.
Specifically, the system first selects an age identification model corresponding to the sex from a preset model library according to the sex characteristics of the user (such as male or female). The purpose of this is to use models specifically trained for different genres to improve the accuracy of age predictions. Because men and women have significant differences in speech characteristics (e.g., pitch, intonation, pace, etc.), the use of models of different sexes can better capture these differences, thereby providing more accurate age predictions.
In some embodiments, when collecting the second speech data of the specified text, the microphone array comprising at least two microphones distributed at different heights and positions is used to collect the second speech data of the specified text when the user is reading.
Specifically, based on the configuration of the microphone array, the system can infer the physical height of the user by analyzing the time difference of arrival of sound at the different microphones and fusing it with the preliminary predicted height to derive a more accurate final height of the user.
Analyzing the user's biometric characteristics using a pre-trained machine learning model based on the second speech data, further comprising:
S310, inputting the second feature vector into a pre-trained region identification model, and predicting the region where the user is located.
The pre-trained region recognition model is a machine learning model which is trained on large-scale labeling data and can predict the region where the user is located according to the input feature vector.
It is worth noting that due to the diversity of languages and different regional cultures, people in different regions may exhibit different patterns in speech characteristics. For example, some regions may have faster speech, and some dialects may have specific formant frequencies, which may be reflected in feature vectors. And the second feature vector is a comprehensive feature vector generated by fusing various speech features (such as speech speed, intonation, pitch, formants, frequency spectrum, etc.). These features may reflect speech features of the user's pronunciation habits, intonation patterns, speech speed, pitch variation, etc., which tend to vary significantly in different geographic areas.
Then, in the model training process, the model can be pertinently made to learn the mapping relation between the second feature vector and the region where the user is located. For example, certain feature vector patterns may correspond to particular regions. Thus, by inputting new feature vectors, the model is able to predict regions from which the user may come based on learned patterns and rules. Returning to the technical scheme of the application, by inputting the second feature vector into the pre-trained region recognition model, the model predicts the geographic position or cultural area of the user by analyzing the voice features, and further accurately deduces the region where the user is located.
S320, obtaining corresponding crowd statistical heights from a preset height database according to the gender characteristics, the age characteristics and the region where the user is located.
The preset height database refers to a database stored in the system and containing crowd height statistical data of different sexes, age groups and areas. The database is constructed based on extensive demographic data, and can reflect the height distribution conditions of different groups.
Specifically, the system extracts statistical height data corresponding to the user characteristics from a preset height database according to the gender characteristics, the age characteristics and the region of the user. These data typically include statistical information such as average height, median height, standard deviation, etc.
S330, predicting the preliminary height of the user according to the crowd statistical height and the pre-constructed statistical model.
The crowd statistical height refers to statistical data extracted from a preset height database, and reflects the height distribution of the crowd matched with the user characteristics.
The pre-built statistical model refers to a mathematical model established based on large-scale data analysis and regression techniques for predicting individual characteristics (e.g., height) from the statistical data.
Alternatively, the statistical model may be any one of a linear regression model, a logistic regression model, a polynomial regression model, a bayesian statistical model, principal component analysis, random forests, K-means clustering, na iotave bayesian classifier, survival analysis model, hidden markov model.
Specifically, the system calculates by using a pre-constructed statistical model according to the crowd statistical height data extracted from the database and the gender, age and region characteristics of the user, and outputs a preliminary height predicted value conforming to the statistical rule.
It can be understood that the system predicts the preliminary height through the pre-constructed statistical model due to the fact that the region where the user is located is predicted according to the voice characteristics and the crowd statistical height is obtained from the preset database by combining the gender, the age and the region information, so that the accuracy and the individuation degree of the height prediction are effectively improved. In addition, by integrating the multidimensional data, the system can provide a height prediction result which is more in line with the actual situation according to the specific characteristics of the user. Meanwhile, according to the scheme, the statistical model is used for height prediction, so that errors caused by individual differences are reduced, and the height prediction is more stable and reliable.
S340, calculating the head position of the user according to the time difference of the user sound reaching different microphones.
The time difference of the user's voice reaching the different microphones refers to the time difference of the voice signal sent by the user reaching the different microphones in the microphone array. Due to the limited speed of sound propagation in air, the time for the speech signal to reach different microphones from the user's head position may vary.
The head position of the user refers to the position coordinates of the head of the user in the three-dimensional space, which are deduced by calculating the time difference of arrival of sound at different microphones.
Specifically, the system first measures the time of arrival of the user's voice at each microphone and then uses time difference method (TDOA) or triangulation to calculate the user's head position. By analyzing these time differences, the system is able to determine the relative position (including height, horizontal position, etc.) of the user's head in the microphone array.
S350, calculating the physical height of the user according to the head position of the user.
Wherein, the physical height refers to the actual height of the user from the top of the head to the ground, and reflects the actual height of the user.
Specifically, the system calculates the physical height of the user based on the head position of the user, in combination with the known positions of the microphones in the microphone array.
For example, if the distance of the user's head height relative to the ground is known, the system may take it as an estimate of the physical height. The system may apply some adjustment or calibration to ensure accuracy of the physical height, considering that the user may be in different standing or sitting positions.
Illustratively, assuming that the head position of the user is expressed as (xhead, yhead, zhead) and the lowest microphone position in the microphone array is (xmic, ymic), the head height h head of the user is a coordinate value of the head position of the user in the vertical direction, i.e., zhead.
Then, considering the position of the user's head position relative to the microphone array (i.e. the offset in the horizontal direction may affect the measurement accuracy), the projection of the user's head position relative to the microphone array in the vertical direction may be calculated as:
h head=zhead-zmic, wherein zmic is the height coordinate of the microphone and zhead is the height coordinate of the user's head.
Further, the physical height Huser of the user can be calculated by the following formula:
H user=hhead+hmic, wherein H mic is the lowest microphone to ground distance and H head is the user's head to lowest microphone distance.
S360, fusing the physical height with the preliminary height to obtain the final height of the user.
The final height is the comprehensive height of the user obtained by combining the physical height and the preliminary height, and the result of physical measurement and statistical prediction is combined.
Specifically, the system fuses the physical height and the preliminary height, and common methods include weighted average, bayesian inference, and the like. By means of the fusion mode, the system can comprehensively consider advantages and disadvantages of the two methods, and more accurate user height estimation is obtained. For example, the physical height and the preliminary height may be added by a weight to balance the results based on physical measurements and statistical predictions to arrive at the final height of the user.
In some embodiments, fusing the physical height with the preliminary height to obtain a final height of the user comprises:
S361, according to a predefined prior distribution and likelihood function, combining the physical height and the preliminary height, calculating posterior distribution of the height of the user, wherein the posterior distribution has the following calculation formula:
p (H|H physical,Hpred)∝P(Hphysical|H)·P(Hpred |H) ·P (H), wherein,
Where P (H) is a priori distribution, which conforms to a normal distribution, mu prior is the average height based on demographics, sigma prior 2 is the variance of height,
Where P (H physical. Delta. H) is the physical height likelihood function, H physical is the physical height predicted by the microphone array, σ physical 2 is the variance of the physical prediction error,
Wherein P (H pred -H) is the preliminary height likelihood function, H pred is the preliminary height, σ pred 2 is the error variance of the statistical model;
S362, using the average value of posterior distribution as the final height, wherein the calculation formula of the average value of posterior distribution is as follows:
and S370, taking the final height as the height characteristic of the user.
Specifically, the system takes the finally fused height as the height characteristic of the user. This final height feature will be used to personalize the creation of the virtual character, making the created virtual character closer to the user's actual height, thereby improving the realism of the virtual experience.
It will be appreciated that since the microphone array is used in collecting speech data and the head position of the user is calculated from the time difference of arrival of sound at the different microphones, the system is able to infer the physical height of the user based on the actual measurement results, thereby improving the accuracy of height prediction. By fusing the physical height with the preliminary predicted height, the system further eliminates the limitations of a single prediction method, and finally obtains more accurate user height characteristics. In addition, the method combining physical measurement and statistical prediction not only improves the precision of the system, but also enhances the individuation and realism of the virtual character, thereby providing a more immersive interactive experience for the user.
In some embodiments, the specified text comprises specified lines in preset scenario content, the preset scenario content is displayed to a user when a virtual character of the somatosensory dance game is created, the preset scenario content is displayed to the user, the preset scenario content is divided into a plurality of scenario segments, and the scenario segments are displayed in sequence according to a preset arrangement strategy, and the arrangement strategy is designed based on a preset character test method so as to excite the user to express in various character dimensions through multi-dimensional situation changes.
The preset scenario content refers to a section of scene or story content preset according to a game scenario, and is used for guiding a user to enter a specific situation so as to guide the user to express a specific emotion or attitude.
The division of preset scenario content refers to splitting the complete scenario content into a plurality of logically or contextually relatively independent scenario segments, each scenario segment having a specific contextually or emotion expression target.
The specified lines refer to specific text contents which are required to be read by the user in preset scenario contents, and the speech expression of the user can be guided through the lines so as to facilitate subsequent character analysis.
The arrangement policy refers to a preset sequence of segment display, and the arrangement policy is usually designed according to the requirements of the personality test method, so as to ensure that the sequence of segments can effectively excite the expression of the user in different personality dimensions. For example, the ranking policy may rank the segments in a manner that increases emotion intensity, increases decision complexity, increases emotion contrast arrangement, increases context complexity, increases stress, increases cultural and social context, changes in language style, conflicts in personal value view, and the like.
The character test method refers to a standardized method for evaluating character characteristics of a user, such as MBTI, five-big personality trait model, and the like. The arrangement strategy is designed according to these methods so as to fully motivate the user's expression in multiple personality dimensions through multi-dimensional context changes. For example, the expression of the outward/inward dimension may be motivated by different social contexts, and the expression of the thinking/emotion dimension may be triggered by different decision contexts.
Multidimensional context change refers to inspiring a user's expression in multiple personality dimensions by exhibiting different types of contexts (e.g., social, decision, emotion, etc.). The context change of each segment design can prompt the user to present different sides of their personality characteristics in different environments, thereby providing rich data support for personality analysis.
Personality dimensions are different aspects or dimensions used by psychological and personality tests to describe and classify personal personality characteristics. These dimensions generally reflect the propensity of a person to behave, think and emotion, and are an important basis for analyzing and understanding individual personality. The definition and classification of character dimensions may vary among character testing methods, but they are generally described by a set of opposing features.
The following are several common personality test methods and their interpretation of personality dimensions:
MBTI (Myers-Briggs Type Indicator, meiers-brigas type index), which is a personality type assessment tool developed based on the psychological type theory of cals-rong, uses four bipartite personality dimensions to describe a person' S personality, outward (Extraversion, E) vs inward (Introversion, I), sense (S) vs Intuition (Intuition, N), think (Thinking, T) vs emotion (Feeling, F), judge (Judging, J) vs perception (PERCEIVING, P). 2. A large five personality model (Big Five Personality Traits), a personality assessment model widely used in psychology, uses five independent dimensions to describe a person's personality. 3. Rongge psychological type theory, cal-Rongge psychological type theory is the basis of MBTI, but it is more focused on the classification of intrinsic psychological processes.
Specifically, upon requesting the user to read the specified text, the system may present the user with a pre-set scenario content associated with the game story, visually or audibly presenting the clip to the user. When preset scenario contents are displayed to a user, the system divides the scenario contents into a plurality of scenario segments, and sequentially displays the scenario segments according to an arrangement strategy designed by a character testing method. Thus, through the multi-dimensional situation change, the system can guide the user to make diversified expressions in different character dimensions and extract corresponding voice features from the expressions, so as to provide more comprehensive and accurate data support for character analysis. The method can capture all aspects of the character of the user more carefully, and further improve the personalized matching degree of the virtual character.
It should be noted that, when creating the virtual character, the scenario content may be presented to the user as an independent unit alone to collect speech data related to the character in a targeted manner, or may be presented to the user only, because the speech data generated when the user reads the speech word is sufficient to analyze the biological characteristics and the timbre model of the user.
It will be appreciated that with well-designed scenario content, the system may trigger specific emotional expressions, such as agitation, calm, tension, etc., of the user when speaking a specified speech. Such emotional expressions can significantly affect speech characteristics, enabling the system to more accurately capture and analyze the personality type of the user. Meanwhile, by displaying scenario contents of different situations, the system can collect voice data of the user in various situations. This helps to obtain richer speech features, thereby improving accuracy and comprehensiveness of character analysis.
In some embodiments, the multi-dimensional context changes include at least one of emotion intensity changes, context complexity changes, role type changes, decision difficulty changes, socio-cultural background changes, language style changes, stress context changes, personal value perspective conflicts, and time sequence.
It will be appreciated that through these multi-dimensional context changes, the system is able to excite the user's personality profile in multiple ways, thereby analyzing the user's personality type more comprehensively. The dimension of the change not only enriches the voice characteristic data of the user, but also provides multi-level and multi-angle basis for character analysis, so that the generation of the virtual character is more accurate and personalized.
In some embodiments, the displaying of the preset scenario content to the user further comprises selecting matching scenario content from a preset scenario library according to the gender characteristics of the user.
It can be appreciated that such gender-based scenario content selection can ensure that the user is more natural and authentic in speech expression, thereby providing more reliable data support for personality analysis. In addition, the sexing script content can also promote the immersion of the user, so that the user can enter roles and situations more easily, and the character characteristics of the user can be displayed more accurately.
In some embodiments, analyzing character features of the user using a pre-trained machine learning model based on the second speech data comprises:
s410, extracting corresponding voice features from voice data of each scenario segment according to preset character dimension analysis requirements, wherein the character dimension analysis requirements of each scenario segment are determined according to a preset arrangement strategy.
The voice data of the scenario segment refers to a voice signal collected by the voice input device when a user reads specified lines in the scenario segment. Because the scenario content is divided into a plurality of scenario segments, the system can store the voice data collected by the user when reading the specified speech in the scenario segments in a segmented manner.
Character dimension analysis requirements are determined based on specific character test methods (e.g., MBTI, large five personality model, DISC). For example, if the MBTI character type test method is employed, the character dimensions include exogenously (E), endotropism (I), perceptually (S), intuitively (N), etc. The system extracts speech features associated with each of the specific personality dimensions, such as speech speed, intonation, volume, etc., based on the analysis requirements of those dimensions for each of the episodes.
Specifically, the system first sequentially displays the designed scenario segments according to a preset arrangement strategy. The presentation content of each segment (e.g., segment content may relate to social context if the goal is an outward/inward dimension; segment content may relate to decision context if the goal is a judgment/perception dimension) has been customized to the character dimension analysis requirements. Thus, the user's performance in each segment will directly reflect its characteristics in the corresponding personality dimension. After the system collects the voice data, corresponding features are extracted from the voice data through preset character dimension analysis requirements, so that the pertinence and the effectiveness of analysis are ensured. For example, the system may focus on speech rate, volume, and intonation changes when analyzing the outward/inward dimension, and stability of intonation and richness of emotional expression when analyzing the thought/emotional dimension.
S420, selecting an applicable machine learning model to analyze the extracted voice features according to different character dimensions so as to generate the personality features of the user in each character dimension.
Among these, different personality test methods (e.g., MBTI, large five personality model, DISC) require different machine learning models for analysis. For example, MBTI personality type testing requires the use of a particular model to analyze the user's handedness and endotropism, while the large five personality model requires the use of other models to evaluate the user's openness, disfigurement, and other dimensions.
Specifically, after extracting the corresponding voice features, the system selects a machine learning model most suitable for the character dimension for analysis. For example, the system may analyze speech rate and volume data using a Support Vector Machine (SVM) model to generate personality characteristics of the user in the outside-in dimension. Also, for the thinking/emotion dimension, the system may select neural network models to handle intonation changes and emotion recognition features. Through analysis of these models, the system generates user-specific performance features in various character dimensions.
S430, generating the overall character type of the user as the character characteristics of the user based on the character characteristics of each character dimension.
Wherein, the personality characteristics of each personality dimension refer to the performance results of the user in different personality dimensions obtained through the machine learning model analysis in step S420.
The character type refers to character classification results of a user under a specific character test method (MBTI character type, large five personality model, DISC character type). The overall character type is a comprehensive character classification result formed by integrating the user's performances in each character dimension, for example, a certain type (such as INTJ, ESFP, etc.) in MBTI.
Specifically, the system generates the overall personality type of the user by comprehensively analyzing the characteristics of the user after obtaining the personality characteristics of the user in each personality dimension. For example, the system may integrate the personality characteristics of each dimension by means of stitching, weighted averaging, rule decision, etc., to obtain the personality type result under the MBTI type or other personality test methods of the user.
It can be understood that the system can accurately collect and analyze data aiming at different character dimensions by extracting corresponding voice features from voice data of each scenario segment and combining a preset segment arrangement strategy, so that accuracy and comprehensiveness of character analysis are improved, the system can generate individual features of a user in each character dimension by selecting an applicable machine learning model to analyze the extracted voice features, scientificity and reliability of character analysis results are ensured, and the system generates integral character types of the user by integrating the individual features of each character dimension, so that individual and virtual character experience fitting the characters of the user is provided for the user.
In some embodiments, selecting an applicable machine learning model to analyze the extracted speech features according to different personality dimensions to generate personality characteristics of the user in each personality dimension includes:
s510, extracting voice characteristics related to the current character dimension from voice data of other scenario segments when analyzing the character dimension of the current scenario segment.
Where the current scenario segment refers to the particular scenario segment currently being analyzed, the design of which is intended to motivate the user to behave in a particular personality dimension.
Other scenario segments refer to other scenario segments preset in the system that are different from the current scenario segment, which may involve the user's performance in other contexts.
Character dimension-related speech features refer to speech features that reflect the user's performance in a particular character dimension, such as speech rate, volume, intonation variations, and the like.
Specifically, when the system analyzes character dimensions in the current scenario segment, not only speech features extracted from the current segment are considered, but also speech data in other scenario segments are checked for features related to the analysis requirements of the current character dimensions. For example, if the current scenario segment relates primarily to the user's outward/inward dimension, the system may extract relevant features such as speech rate and volume from other segments to supplement the analysis data in the current segment. This cross-segment feature extraction helps to construct a more complete and accurate user personality feature image.
S520, the voice characteristics extracted from the current scenario segment and the related voice characteristics extracted from other scenario segments are combined in a weighting mode.
The weighted integration refers to giving different weights to the voice features according to different feature sources. Alternatively, higher weights may be given to the speech features of the current segment, while lower weights are given to the speech features of the other segments, and these weighted features are integrated into one integrated feature vector.
Specifically, after the system acquires the voice features of the current scenario segment, the features are regarded as main data sources, and higher weight is given to the features, because the current segment is designed for the character dimension, and the relevance and the accuracy of the features are the highest. At the same time, the system also extracts auxiliary voice features related to the current character dimension from other scenario segments, but the weights of the features are low so as to ensure that the analysis results are not excessively affected in the integration process. Finally, the system integrates the weighted features to generate a comprehensive feature vector for more accurate character analysis.
It can be understood that by giving higher weight to the speech features of the current scenario segment, the system can ensure that character analysis is primarily dependent on feature data with explicit design goals, thereby improving accuracy of analysis, and by extracting and weighting lower relevant features from other segments, the system can supplement and refine character analysis data, thereby improving comprehensiveness and consistency of analysis. Finally, the weighting integration strategy enables character analysis to be more accurate, and generated personalized features to be more reliable.
S530, inputting the weighted and integrated voice characteristics into a corresponding machine learning model to generate personality characteristics of the user in each personality dimension.
Specifically, the system inputs the weighted integrated speech features into a pre-trained machine learning model. The model is specially optimized for the character temperature analysis requirement (such as outward/inward or thinking/emotion dimension) of the current scenario segment, and the personality characteristics of the user in the character dimension are generated by analyzing the input comprehensive feature vector. This result may help to further understand and define the user's overall personality type and for subsequent avatar generation and personalization settings.
It can be understood that when analyzing the character dimension of the current scenario segment, the system extracts not only the voice feature of the current segment, but also the voice feature related to the current character dimension from other scenario segments, so that the system can more comprehensively capture the character performances of the user in different environments, thereby providing richer data support for the analysis of the current character dimension. In addition, by giving higher weight to the voice features extracted from the current scenario segment and lower weight to the related voice features extracted from other segments, the system can effectively integrate information in different segments, ensure character analysis mainly depending on feature data with definite targets, and further improve accuracy and consistency of character analysis.
In some embodiments, generating a timbre model matching the timbre of the user using a pre-trained machine learning model analysis based on the second speech data includes:
s610, preprocessing the second voice data, wherein the preprocessing comprises the steps of cleaning background noise, removing non-voice components and extracting key tone characteristics of user voice.
The background noise cleaning means that environmental noise and noise possibly affecting analysis in the voice data are removed through an audio processing technology, so that the purity of the voice signal is ensured.
The removal of non-voice components refers to removing irrelevant parts in voice data, such as silence segments, breathing sounds, cough sounds and the like, so that only effective voice information of a user is contained in analysis.
Key tone feature extraction refers to extracting features critical to tone modeling, such as Pitch (Pitch), timbre (Timbre), intonation (Prosody), formants (Formants), etc., from the preprocessed speech data.
In particular, the system may use filters (e.g., low pass filters, high pass filters) or other noise suppression algorithms to remove background sounds in the environment. The system then recognizes and removes non-speech components in the speech signal, such as silence, respiratory sounds, and the like. Next, the system extracts key tone features using an audio analysis tool. For example, the system may extract pitch by autocorrelation methods or fourier transforms, extract timbre features by Linear Predictive Coding (LPC) or mel-frequency cepstral coefficients (MFCCs).
S620, inputting the preprocessed voice features into a pre-trained voice machine learning model.
Wherein the pre-trained speech machine learning model is capable of recognizing and processing speech features and generating corresponding timbre representations. The model may be a deep neural network (e.g., convolutional neural network CNN, long-term memory network LSTM) or other machine learning algorithm suitable for processing speech data.
Specifically, the system inputs the tone color features extracted from S610 as input data into a pre-trained machine learning model. The model further processes and analyzes these features through its hierarchical structure to identify unique timbre features for the user. For example, the model may analyze the pitch pattern, the darkness of the timbre, the prosody of the intonation, etc., to generate a high-dimensional feature vector or set of parameters that match the user's voice features.
S630, generating a tone color model matched with the tone color of the user based on the voice characteristics of the user and the output of the pre-training model.
Specifically, the system matches and optimizes the output of the pre-trained model (e.g., the high-dimensional feature vector) to the original speech features of the user to generate a timbre model. The timbre model may contain a number of parameters, such as pitch range, timbre adjustment coefficients, formant frequencies, etc., for describing key features of the user's voice. The generated tone color model not only can reflect the natural voice characteristics of the user, but also can be used in the subsequent text-to-speech (TTS) process to ensure that the voice of the virtual character is highly consistent with the real tone color of the user. After the model is generated, the system can verify and finely tune by comparing the similarity of the original voice data and the generated voice so as to improve the precision of the model and the matching degree of the user.
It will be appreciated that by preprocessing the collected user voice data, key timbre features are extracted and input into a pre-trained machine learning model for analysis and generation, so that the system is able to generate a timbre model that is highly matched to the user timbre. The tone color model can accurately reflect the sound characteristics of the user, thereby providing a foundation for the voice generation of the virtual character, ensuring the natural and true voice expression of the virtual character, and remarkably improving the immersion and personalized experience of the user. In addition, the generated tone model is matched and optimized, so that the similarity between the tone model and the voice of the user is further improved, and the emotion connection between the virtual character and the user is enhanced.
In some embodiments, generating an avatar of the avatar based on the biometric and character features includes:
s710, selecting a virtual character template corresponding to the biological characteristics and the character types from a preset template library according to the biological characteristics and the character characteristics of the user.
The preset template library comprises a plurality of virtual character templates, and each template corresponds to a specific biological characteristic and character type. For example, the template library may have character templates for different heights, body types, sexes, and designs (e.g., lively, stable, mystery, etc.) suitable for different character types. The system selects the virtual character template from the template library that best matches the characteristics of the user's biological characteristics (e.g., height, gender, skin, etc.) and personality characteristics (e.g., MBTI type, major five personality score, etc.).
Specifically, the system firstly screens out templates meeting the physical properties according to the biological characteristics (such as age, height and sex) of the user, then further reduces the template selection range according to the character characteristics (such as exogenousness and emotional stability) of the user, and finally determines the most suitable virtual character template. This process ensures that the base character of the avatar is in agreement with the user's real features.
S720, mapping the user biological characteristics to the physical attributes of the virtual character, and mapping the character characteristics to the appearance attributes of the virtual character.
Wherein the mapping process refers to the materialization of the user's biometric and personality characteristics as physical and appearance attributes of the avatar. For example, the user's height may be mapped directly to the height attributes of the avatar, while the user's character characteristics, such as exogenously, may be mapped to a brighter and more general hairstyle or color.
Specifically, the system adjusts the physical attributes of the virtual character according to the biological characteristics of the height, sex, face shape and the like of the user, so that the height, the body shape, the face outline and the like of the character are matched with the actual characteristics of the user. Meanwhile, the system adjusts the appearance details of the roles according to the character characteristics of the users, such as selecting bright hair colors and active hair styles suitable for the outward users, or designing softer and lower-tuning makeup and modeling for the inward users. This mapping process ensures that the virtual character is not only physically similar to the user, but also reflects the personality characteristics of the user.
And S730, generating the virtual character image matched with the user according to the mapping result.
Specifically, the system integrates all mapped attributes to generate a final avatar image. This character image will have biological characteristics of the user's height, body shape, skin, etc., while the appearance details such as hairstyle, color, clothing shape, etc. will also reflect the user's personality characteristics. The generated virtual character can be used in somatosensory dance games to perform interaction and game operation on behalf of a user, so that the user can feel high individuation and reality in a virtual environment.
It will be appreciated that the virtual character is generated by dual mapping of the biometric and personality characteristics such that the character is highly matched to the user's real characteristics in both physical appearance and personality appearance. The method not only improves individuation and sense of reality of the virtual character, but also enhances the immersion experience of the user in the game.
In some embodiments, generating speech for the virtual character based on the speech model includes:
S810, combining the text or game script input by the user with the tone model through a text-to-speech conversion technology to generate virtual character speech matched with the tone characteristics of the user.
The text or game script input by the user refers to text content which is directly input by the user or preset in the game context, and the system needs to convert the content into the voice of the virtual character.
Specifically, the system may invoke TTS techniques when a user enters text (such as dialog content) in a game or when a game script requires a character to sound. The TTS system will use the previously generated timbre model to convert the text content into speech output consistent with the user's voice characteristics. Parameters in the timbre model (e.g., pitch, intonation, timbre) directly affect the characteristics of the generated speech, thereby ensuring that the virtual character's speech is of a pitch similar to the user's real voice, providing a personalized speech experience.
S820, dynamically adjusting and generating emotion expression of the virtual character voice according to the current situation and the game scene of the virtual character.
The virtual character context and game scene refer to the game context (such as dancing, stage performance, interaction with other characters, etc.) and specific game scene (such as stage, concert, practice room, etc.), which determine the emotion expression mode of the character.
Dynamically adjusting emotion expression of generated voice refers to that the system adjusts parameters such as intonation, pitch, speed of voice and the like in real time according to situation requirements so as to reflect emotion states (such as excitement, happiness, relaxation and the like) of characters in different dance scenes.
Specifically, the system may consider the current dance context of the character when generating the virtual character speech. For example, in a live dance performance scenario, the system may adjust the pitch and pace of the speech to make the speech appear to be excited and enthusiastic, while in a relaxed exercise scenario, the speech may appear to be calm and relaxed. By adjusting the emotion expression of the voice in real time, the system ensures that the voice expression of the virtual character is consistent with dance scenes and user emotion experience, thereby enhancing the immersion of the game and the emotion depth of the character.
It will be appreciated that by combining the user's timbre model with text-to-speech conversion techniques, virtual character speech consistent with the user's voice characteristics is dynamically generated, and virtual character images matching the same are generated through timbre characteristic mapping, the system is able to provide a highly personalized virtual character for the user that closely fits the user's voice characteristics. Therefore, the immersion and emotion connection of the user in the game are obviously enhanced, and meanwhile, the visual and auditory performances of the virtual character are highly coordinated, so that the overall quality of game experience is improved.
In addition, the embodiment of the invention also provides a computer readable storage medium, which can be any one or any combination of a plurality of hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disc read-only memory (CD-ROM), a USB memory and the like. The computer-readable storage medium includes the somatosensory dance game program 10 for generating the virtual character based on the user voice information, and the embodiment of the computer-readable storage medium of the present invention is substantially the same as the above-described somatosensory dance game method for generating the virtual character based on the user voice information and the embodiment of the server 1, and will not be repeated here.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.