CN119672185B

CN119672185B - Voice and action synchronization method

Info

Publication number: CN119672185B
Application number: CN202510175788.8A
Authority: CN
Inventors: 张思达; 王琴; 张勇达; 杨荣源; 黄显长; 邱智强
Original assignee: Xiamen Yoya Network Technology Co ltd
Current assignee: Xiamen Yoya Network Technology Co ltd
Priority date: 2025-02-18
Filing date: 2025-02-18
Publication date: 2025-05-16
Anticipated expiration: 2045-02-18
Also published as: CN119672185A

Abstract

The application provides a voice and action synchronization method which comprises the steps of obtaining a voice signal, extracting voice duration characteristics of the voice signal, combining a preset voice segmentation rule, carrying out segmentation processing on the voice signal, dividing the voice into a voice continuous section and a voice pause section, extracting short-time characteristics corresponding to each section to obtain voice time sequence distribution characteristics, adjusting time sequence and space position of action elements if the smoothness score is lower than a preset scoring threshold value, obtaining and storing an optimized action sequence through fine adjustment in time and space, inputting the optimized action sequence into an action controller, combining the voice time sequence distribution characteristics, generating an action execution instruction, and completing synchronous output of the action and the voice.

Description

Voice and action synchronization method

Technical Field

The invention relates to the technical field of information, in particular to a voice and action synchronization method.

Background

In an animation intelligent synthesis system, synchronization of a conversation rhythm and an action sequence is a complex technical problem. The coupling relation between the voice duration and the limb action time sequence directly influences the naturalness and the sense of reality of the character expression. When the conversation rhythm is stopped, the action controller needs to adjust the action element sequence in real time so as to maintain the consistency and fluency of the action of the character. However, how to accurately grasp the time node during the action reorganization process, so as to avoid stiffness, unnaturalness or excessive smoothness of the action is a key problem to be solved. Meanwhile, the action recombination also needs to consider the influence of factors such as role emotion, character and the like on the action performance. Characters of different characters face the same dialogue rhythm, and the language performance of limbs often has differences. How to take account of character individuation characteristics during action reorganization, so that the actions are richer, more various and more expressive, interactive actions conforming to the context are automatically generated according to dialogue contents and character relations, and smooth transition and natural connection of the actions are realized during the change of dialogue rhythms, so that the method is a very challenging technical problem.

Disclosure of Invention

The invention provides a voice and action synchronization method, which is used for intelligent animation synthesis and mainly comprises the following steps:

The method comprises the steps of obtaining a voice signal, extracting voice duration characteristics of the voice signal, carrying out sectional processing on the voice signal by combining with a preset voice sectional rule, dividing the voice into a voice continuous section and a voice pause section, extracting short-time characteristics corresponding to each section to obtain voice time sequence distribution characteristics, detecting starting points and ending points of the voice pause section according to the voice time sequence distribution characteristics, determining starting time, ending time and duration of the pause section, combining the voice time sequence distribution characteristics with preset action elements, calculating time sequence alignment relation between the action elements and the voice continuous section to generate a preliminarily aligned action sequence, detecting whether the preliminarily aligned action sequence has a voice pause section, if so, recombining the action element sequence according to the duration time of the pause time and the action element when the pause time is larger than a threshold value of the duration of a target action element, generating a recombined action sequence, carrying out secondary alignment on the recombined action sequence and the time sequence of the voice continuous section, if the difference between the starting time and the ending time of the action element and the corresponding voice continuous section is smaller than a preset synchronous threshold value, calculating the coupling relation between the recombined action element and the action sequence and the corresponding action element when the starting time and the ending time and the time of the corresponding voice continuous section is smaller than the preset synchronous threshold value, and calculating the time and the smooth action element when the difference between the time and the time is lower than the continuous time is required by the time and the time is calculated, and the time of the motion element is smooth, and the motion is calculated, and the motion is smooth, if the space is low, and the space is required is compared, and the time is calculated and has a smooth, and the time is low, and inputting the optimized action sequence into an action controller, generating an action execution instruction by combining the voice time sequence distribution characteristics, and completing synchronous output of the action and the voice.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

The invention discloses an intelligent animation synthesis method for fusing a production element and a script element. The method generates a scene aiming at the voice-driven action, and solves the problem of accurate synchronization between the action and the voice. Firstly, the invention carries out segmentation processing on the input voice signal and extracts the voice time sequence distribution characteristics. Then, the voice features are aligned with preset action elements to generate a preliminary action sequence. For pauses in speech, the present invention optimizes the action sequence by reorganizing the action elements. The temporal-spatial continuity of the action sequence is further optimized through secondary alignment and fluency assessment. Finally, the invention combines the optimized action sequence with the voice characteristic to generate the synchronous action execution instruction. The method realizes accurate synchronization of voice and actions, improves natural fluency of action generation, and can be widely applied to fields of virtual anchor, intelligent robots and the like.

Drawings

FIG. 1 is a flow chart of a method of synchronizing speech and motion according to the present invention.

Fig. 2 is a schematic diagram of a method of synchronizing speech and motion according to the present invention.

FIG. 3 is a schematic diagram of a method of synchronizing speech and motion according to the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1-3, a method for synchronizing voice and motion according to this embodiment may specifically include:

s101, acquiring a voice signal, extracting voice duration characteristics of the voice signal, combining a preset voice segmentation rule, segmenting the voice signal, dividing the voice into a voice continuous segment and a voice pause segment, and extracting short-time characteristics corresponding to each segment to obtain voice time sequence distribution characteristics.

The method comprises the steps of carrying out smooth noise reduction treatment on a voice signal by adopting a wavelet transformation method, carrying out signal amplitude normalization by setting a high threshold and a low threshold, eliminating high-frequency noise interference by adopting a Butterworth low-pass filter to obtain a voice signal after interference elimination, calculating a short-time energy function and a short-time zero-crossing rate function according to the voice signal after interference elimination to obtain a voice signal strength sequence, marking a voice starting point if the signal strength is larger than a preset volume threshold, obtaining a voice duration sequence according to the interval between adjacent starting points, segmenting the voice signal after interference elimination aiming at the voice duration sequence, dividing a voice frame by adopting a Hamming window function, determining a voice frame boundary by calculating an inter-frame energy ratio, extracting a pitch period and a formant frequency from the voice frame to obtain a voice basic feature sequence, carrying out continuity analysis according to the voice basic feature sequence, judging a continuous voice segment boundary by adopting a maximum entropy model if the period change rate of the adjacent voice frame is smaller than the preset threshold, extracting an energy feature vector and a time-varying feature vector from the continuous voice segment and the continuous voice segment to obtain a voice time sequence distribution feature.

The method comprises the steps of obtaining a voice signal according to sampling, carrying out smooth noise reduction on the voice signal by adopting a wavelet transformation method, carrying out signal amplitude normalization processing by setting a high threshold and a low threshold, and eliminating high-frequency noise interference on the normalized signal by adopting a Butterworth low-pass filter to obtain a first voice signal. And extracting the duration characteristics of the first voice signal, obtaining a voice signal strength sequence by calculating a short-time energy function and a short-time zero-crossing rate function, marking the voice signal strength sequence as a voice starting point if the signal strength is larger than a preset volume threshold, and calculating the time interval between adjacent starting points to obtain the voice duration sequence. Segmenting a first voice signal according to the voice duration sequence, dividing voice frames by adopting a Hamming window function, determining a voice frame boundary by calculating an inter-frame energy ratio, and extracting a pitch period and a formant frequency from each voice frame to obtain a voice basic feature sequence. And carrying out voice continuity analysis on the voice basic feature sequence, if the pitch period change rate of the adjacent voice frames is smaller than a preset threshold value, judging the adjacent voice frames to be continuous voice segments, otherwise judging the adjacent voice frames to be pause segments, and adopting a maximum entropy model to optimally position the boundaries of the voice segments. And extracting time sequence distribution characteristics of the continuous voice section and the pause section respectively, obtaining a voice section energy characteristic vector by calculating short-time energy mean value, variance and skewness in the voice section, and obtaining a voice section time-varying characteristic vector by calculating pitch period change rate in the voice section. And combining the energy feature vector and the time-varying feature vector of the voice segment by adopting a sliding time window method, and obtaining a voice time sequence distribution feature matrix by calculating Euclidean distance between the feature vectors. The smooth noise reduction processing of the voice signals adopts a wavelet transformation method, and for the voice signals with the sampling frequency of 16000 Hz, the Debessew 4 wavelet basis function is selected for 5 layers of decomposition, so that the wavelet coefficients of different frequency bands are obtained. And processing the wavelet coefficient by a soft threshold method, wherein the threshold is set to be 3 times of the standard deviation of the wavelet coefficient of the layer, so that effective noise suppression is realized. And (3) for normalization processing of the signal amplitude, setting a high threshold value to be 0.8 and a low threshold value to be 0.2, and carrying out linear mapping on the signal amplitude in an iterative mode. In the voice duration feature extraction stage, the voice signal is subjected to framing treatment, the frame length is set to 320 sampling points, and the frame shift is set to 160 sampling points. When the short-time energy function is calculated, each frame of signal is windowed by adopting a rectangular window function, and a voice signal intensity sequence is obtained. For the determination of the speech start point, a volume threshold value of 0.3 is set, and when the signal strength of the continuous 5 frames is greater than the threshold value, the start point of the first frame is marked as the speech start point. The boundary determination of the speech frame is processed using a hamming window function, the window length being set to 320 sample points. The energy ratio between adjacent frames is calculated and if the energy ratio is greater than 2 or less than 0.5, the frame is marked as a boundary frame. For pitch period extraction, an autocorrelation method is used for calculation, and the pitch period is searched in the frequency range of 50 Hz to 400 Hz. The formant frequencies are extracted by adopting a linear prediction analysis method, the prediction order is set to be 12, and the first 3 formant frequencies are extracted. In the speech continuity analysis, a threshold value of the pitch period change rate is set to 0.15, and when the pitch period change rate of an adjacent frame is smaller than the threshold value, it is determined that a speech segment is continuous. When the maximum entropy model optimizes the boundary of the voice segment, the feature selection comprises parameters such as a pitch period, short-time energy, zero crossing rate and the like, and the iteration number of the model is set to be 100 times. In speech segment timing feature extraction, a 64 millisecond sliding time window is used with a window overlap of 50%. And calculating the mean value, variance and skewness of the short-time energy in the window to obtain the 3-dimensional energy feature vector. And calculating the pitch period change rate by adopting a differential method to obtain a 1-dimensional time-varying feature vector. After the feature vectors are combined, a voice time sequence distribution feature matrix is obtained through Euclidean distance calculation, the number of rows of the matrix corresponds to the number of time windows, and the number of columns is 4, and the energy mean, variance, skewness and pitch period change rate are respectively represented. And (3) for judging the continuous voice section and the pause section, introducing a zero crossing rate characteristic to carry out auxiliary judgment, and judging as a silent section when the zero crossing rate is more than 1500 times per second and the energy is low. By calculating the first derivative of the energy envelope within the speech segment, the accuracy of the localization of the speech segment boundaries can be further optimized. The minimum duration of the speech segments is set to 200 milliseconds and speech segments below this threshold will be merged to neighboring segments.

S102, detecting a starting point and an ending point of a voice pause section according to the voice time sequence distribution characteristics, determining the starting time, the ending time and the duration of the pause section, combining the voice time sequence distribution characteristics with preset action elements, calculating the time sequence alignment relation between the action elements and the voice continuous section, and generating a preliminary aligned action sequence.

The method comprises the steps of obtaining short-time energy difference values and zero crossing difference values of adjacent voice frames according to voice signal time sequence distribution characteristics, marking the voice frames as pause section starting frames if the short-time energy difference values are larger than a first preset threshold value and the zero crossing difference values are larger than a second preset threshold value, carrying out boundary positioning on the pause sections by adopting a recursion dichotomy method, obtaining a characteristic distance sequence by calculating Mel frequency cepstrum coefficient distances between the adjacent voice frames, determining a second pause section position sequence according to the characteristic distance sequence, dividing voice continuous sections for the second pause section position sequence, obtaining voice continuous section characteristic vectors by calculating duration time, average energy and pitch period parameters of the voice continuous sections, calculating matching degrees of the voice continuous section characteristic vectors and preset action element characteristic vectors by adopting a dynamic time warping algorithm, selecting corresponding action elements according to the matching degrees, and carrying out nonlinear expansion on execution time sequences of the action elements by adopting a cubic spline interpolation algorithm to obtain time sequence alignment action sequences.

The method includes the steps of extracting short-time energy difference values and zero-crossing difference values between adjacent voice frames according to voice time sequence distribution characteristics, and marking the voice frames as pause section start frames to obtain a first pause section position sequence if the short-time energy difference values of the voice frames are larger than a first preset threshold value and the zero-crossing difference values of the voice frames are larger than a second preset threshold value. And carrying out boundary accurate positioning on each pause section in the first pause section position sequence, obtaining a characteristic distance sequence by calculating the mel frequency cepstrum coefficient distance between adjacent voice frames, and segmenting the characteristic distance sequence by adopting a recursion dichotomy method to obtain a second pause section position sequence. And carrying out time segmentation on the voice signal according to the second pause segment position sequence to obtain a plurality of voice continuous segments, and generating voice continuous segment feature vectors by calculating the duration, average energy and pitch period parameters of the voice continuous segments. And calculating the matching degree between the characteristic vector of the continuous voice segment and the characteristic vector of the preset action element by adopting a dynamic time warping algorithm, selecting the action element with the highest matching degree for each continuous voice segment as a corresponding item, and generating a corresponding relation sequence of the continuous voice segment and the action element. Aiming at the corresponding relation sequence of the voice segment and the action element, calculating the ratio of the standard duration of each action element to the actual duration of the corresponding voice continuous segment to obtain a duration ratio sequence, and performing nonlinear expansion and contraction on the execution time sequence of the action element by adopting a cubic spline interpolation algorithm to generate a time sequence aligned action sequence. And carrying out smoothing treatment on the action sequences with the aligned time sequences, calculating the feature similarity between adjacent action elements, merging the adjacent action elements if the feature similarity is larger than a preset threshold value, and calculating the time sequence parameters of the merged action elements by adopting a weighted average method. In the extraction process of the voice time sequence distribution characteristics, the voice pause section is identified by calculating the short-time energy difference value and the zero-crossing difference value of adjacent voice frames. For a speech signal with a sampling frequency of 16000 hz, the frame length is set to 320 sampling points and the frame is shifted to 160 sampling points. After framing through a rectangular window, setting a first preset threshold value of a short-time energy difference value to be 0.3, setting a second preset threshold value of a zero-crossing rate difference value to be 0.25, and marking the frame as a pause section starting frame when two difference values exceed corresponding thresholds at the same time. And when the boundary of the pause section is precisely positioned, using 13-dimensional Meier frequency cepstrum coefficient as a characteristic parameter, and calculating Euclidean distance between adjacent frames to obtain a characteristic distance sequence. when the feature distance sequence is segmented by the recursion dichotomy, the initial segmentation point selects the local peak position of the feature distance, the segmentation point is optimized by calculating the variance ratio of the subsequences before and after the segmentation point, and the variance ratio threshold is set to be 2.5. Feature extraction of successive segments of speech involves both time and frequency domains. In the time domain, the duration of each speech segment is calculated, typically between 200 and 2000 milliseconds. The average energy is obtained by calculating the arithmetic average of the short-time energy of all frames in the continuous segment of speech. The pitch period parameters are extracted by an autocorrelation method, and the search range is set between 50 Hz and 400 Hz. When calculating the matching degree between the continuous speech segment and the preset action element, the dynamic time warping algorithm sets the local path constraint as symmetrical, and the slope constraint range is 0.5 to 2. The preset action element feature vector comprises three dimensions of duration, speed change rate and acceleration change rate of the action. The match threshold is set to 0.75 and only matches that exceed this threshold are considered valid matches. When the time sequence of the action element is stretched, the time length ratio sequence reflects the time length difference between the voice segment and the action element. And performing nonlinear expansion and contraction on the execution time sequence of the action element by adopting a cubic spline interpolation algorithm, wherein interpolation nodes are selected at key time points of the action element and generally comprise a starting point, a peak point and an ending point of the action. For control of the speed of motion, the ratio of the lengths of the acceleration section and the deceleration section is kept at 3:2. The feature similarity calculation of adjacent action elements is based on the spatial position and the movement direction of the action. The spatial position is represented by three-dimensional coordinates, and the motion direction is represented by a unit vector. The feature similarity threshold is set to 0.8 and adjacent action elements exceeding the threshold are combined. And calculating time sequence parameters by adopting a weighted average method by the combined action elements, wherein the weight is in direct proportion to the duration of the original action elements.

S103, detecting whether a voice pause section exists in the initially aligned action sequence, if so, judging whether the action element sequence needs to be recombined according to the pause time and the duration time of the action element, and if so, recombining the action element sequence to generate a recombined action sequence when the pause time is longer than a threshold value of the duration time of the target action element.

Analyzing a voice signal by a short-time energy calculation method to obtain a first voice pause feature sequence, performing secondary judgment on the first voice pause feature sequence by a short-time zero-crossing rate detection method to obtain a second voice pause feature sequence, calculating the ratio of the duration of a voice pause section to the target duration of an action element according to the second voice pause feature sequence, marking the action element as an element to be recombined if the ratio of the duration is greater than a preset duration threshold value, obtaining an action element sequence to be recombined, calculating a similarity matrix among the action elements by a hierarchical clustering algorithm according to the action element sequence to be recombined, dividing the similarity matrix by the action feature threshold value to obtain an action element recombination scheme sequence, constructing an action element connection relation graph according to the recombination scheme sequence by a minimum spanning tree algorithm, determining the action element combination sequence by calculating the weight value of a connection edge in the connection relation graph, and obtaining a recombination time sequence, wherein the weight value of the edge is calculated by the following formula:

,

w _ij represents the edge weight of node i to node j, s _ij represents the similarity between nodes i and j, and N _i represents the neighbor set of node i.

Illustratively, according to the preliminarily aligned action sequences, a time interval sequence between adjacent action elements is calculated, and a short-time energy calculation method is adopted to analyze voice signals in each time interval to obtain a first voice pause feature sequence. And aiming at the first voice pause feature sequence, adopting a short-time zero-crossing rate detection method to carry out secondary judgment on the voice signal, and judging the accurate boundary position of the voice pause section by calculating the zero-crossing difference value between adjacent frames to obtain a second voice pause feature sequence. And calculating the ratio of the duration of the voice pause section to the target duration of the corresponding action element according to the second voice pause feature sequence, and marking the action element as the element to be recombined if the ratio is greater than a preset duration threshold value, so as to obtain the action element sequence to be recombined. Aiming at the action element sequence to be recombined, a hierarchical clustering algorithm is adopted to calculate a similarity matrix between action elements, and the similarity matrix is segmented through setting an action characteristic threshold value to obtain a recombined scheme sequence of the action elements. And constructing a connection relation diagram among the action elements by adopting a minimum spanning tree algorithm according to the recombination scheme sequence, determining the combination sequence of the action elements by calculating the weight value of the connection edge, and generating a recombination time sequence of the action elements. And (3) performing time length adjustment on the action elements in the recombined time sequence, and scaling the execution speed of the action elements by adopting a linear interpolation algorithm by calculating the ratio of the time length of the voice pause section to the execution time length of the recombined action sequence to obtain the recombined action sequence. In the analysis of the voice pause feature, a short-time energy calculation is adopted to process a voice signal, a voice signal with the sampling frequency of 16000 Hz is set to 320 sampling points, the frame length is set to 160 sampling points, and a rectangular window is adopted to frame the voice signal. The short-term energy value of each frame signal is obtained by calculating the sum of squares of the sampling points, and is typically between 0.5 and 1 for normal speech segments, while the short-term energy value for pause segments is below 0.1. The zero-crossing rate detection is used to further confirm the voice pause boundary, as measured by counting the number of times the voice signal amplitude changes from positive to negative or vice versa. For a 16000 hz sampled speech signal, the zero crossing rate of the voiced segments is typically between 1000 and 2000 times per second, while the zero crossing rate of the stop segments is higher than 2500 times per second. And when judging the boundary of the pause section, if the zero crossing rates of the continuous 5 frames exceed the threshold value, confirming the start point of the pause section. The target duration of the action element is determined based on a preset standard action library. The standard action library contains a plurality of basic action elements, and each action element has a predefined standard execution duration, for example, the standard duration of the waving action is 800 milliseconds, and the standard duration of the nodding action is 500 milliseconds. When the ratio of the duration of the voice pause section to the standard duration of the action element exceeds 1.5, triggering an action reorganization mechanism. When the hierarchical clustering algorithm calculates the similarity of the action elements, the action elements are represented by adopting multidimensional feature vectors, and the multidimensional feature vectors comprise parameters such as spatial position, movement speed, movement direction and the like. The spatial position is represented by three-dimensional coordinates, the movement speed is represented by scalar values, and the movement direction is represented by unit vectors. The degree of similarity between feature vectors is calculated using cosine similarity, and the similarity threshold is set to 0.75. In the minimum spanning tree algorithm, the weight value of the connecting edge reflects the conversion difficulty between action elements. The calculation of the weight value considers the factors such as the space distance of the action starting position, the included angle of the movement direction, the change rate of the speed and the like. A smaller weight value indicates a smoother motion transition, typically setting the weight threshold to 0.4, below which the connecting edges are preferentially selected. In the process of reorganizing the action sequence, the execution speed of the action elements needs to be adjusted to match the duration of the voice pause. and scaling the action execution speed by a linear interpolation algorithm, wherein the scaling range is limited to be between 0.5 and 2, and the action speed is prevented from being too high or too low. The interpolation node is selected at key time points of the action, and generally comprises a starting point, a peak point and an ending point of the action.

S104, performing secondary alignment on the time sequence distribution of the recombined action sequence and the voice continuous segment, and judging that the coupling relation between the recombined action sequence and the voice meets the synchronization requirement when the difference between the starting time and the ending time of the action element and the starting time and the ending time of the corresponding voice segment is smaller than a preset synchronization threshold value.

Extracting characteristic parameters of an action element according to a recombined action sequence by adopting a short-time window detection method, obtaining an action time sequence characteristic sequence by calculating the amplitude change rate and the speed change rate of the action element, detecting local extreme points of the action amplitude according to the action time sequence characteristic sequence by adopting a self-adaptive threshold method, obtaining a start-stop time sequence of the action element by calculating the time interval between the local extreme points, obtaining a boundary time sequence of a voice paragraph according to the start-stop time sequence of the action element by adopting a voice energy envelope extraction method, calculating the change trend of voice energy, calculating a time difference value according to the start-stop time sequence of the action element and the boundary time sequence of the voice paragraph, calculating an alignment path by adopting a dynamic programming algorithm if the time difference value is smaller than a first preset threshold value, obtaining a corrected action time sequence by adjusting the execution speed of the action element according to the alignment path, and calculating a start time difference value and an end time difference value of each action element and a corresponding voice paragraph according to the corrected action time sequence, and judging that the start time difference value and the end time difference value are smaller than a second preset threshold value, wherein the action element meets a synchronous requirement and the end time difference value is calculated by adopting the following formula:

,

Delta T _start represents the start time difference of the action element and the corresponding speech paragraph, T _action,start represents the start time of the action element, T _audio,start represents the start time of the corresponding speech paragraph,

,

Delta T _end represents the end time difference of the action element and the corresponding speech paragraph, T _action,end represents the end time of the action element, and T _audio,end represents the end time of the corresponding speech paragraph.

The method comprises the steps of extracting characteristic parameters of each action element by a short-time window detection method according to the recombined action sequence, and obtaining a first action time sequence characteristic sequence by calculating the change rate of the action amplitude and the change rate of the speed. And carrying out boundary positioning on the first action time sequence characteristic sequence, detecting local extreme points of the action amplitude by adopting a self-adaptive threshold method, and obtaining a start-stop time sequence of the action element by calculating the time interval between adjacent extreme points. And aiming at the start-stop time sequence of the action element, acquiring boundary characteristics of the continuous voice segment by adopting a voice energy envelope extraction method, and calculating the change trend of voice energy to obtain the boundary time sequence of the voice segment. And (3) performing alignment matching on the start-stop time sequence of the action element and the boundary time sequence of the voice paragraph, and marking as a time sequence matching point by calculating the time difference value between the time sequences if the time difference value is smaller than a first preset threshold value. And calculating an optimal alignment path between the action element and the voice paragraph by adopting a dynamic programming algorithm according to the time sequence matching points, and obtaining a corrected action time sequence by adjusting the execution speed of the action element. And calculating a starting time difference value and an ending time difference value of each action element and the corresponding voice paragraph aiming at the corrected action time sequence, and judging that the action element meets the synchronous requirement if the two difference values are smaller than a second preset threshold value. The action time sequence feature extraction adopts a short-time window detection method, the window length is set to be 200 milliseconds, and the window overlapping rate is 50 percent. The rate of change of the amplitude of the motion is obtained by calculating the displacement difference between adjacent sample points, the peak value of the rate of change of amplitude being typically between 0.5 and 2 meters per second for a standard sequence of motion. The rate of change of velocity reflects the acceleration characteristics of the motion, obtained by computing the first order difference of the velocity sequence, with typical motion acceleration peaks between 2 and 5 meters per square second. When detecting the local extreme point of the action amplitude, the self-adaptive threshold method calculates the background noise level by adopting a moving average method, and the threshold value is set to be 3 times of the average value of the background noise. For action elements with a duration exceeding 500 milliseconds, 2 to 4 local extreme points are typically included in the action sequence. The time interval between adjacent extreme points reflects the rhythmic character of the motion, with interval durations typically between 200 milliseconds and 800 milliseconds. The extraction of the voice energy envelope curve adopts a short-time energy calculation method, and the frame length of a 16000 Hz sampled voice signal is set to 320 sampling points, and the frame is shifted to 160 sampling points. And windowing the voice signal through a Hamming window to calculate the energy value of each frame of signal. The boundary features of a speech segment are characterized by a rate of energy change, the absolute value of which is typically greater than 0.3 at the beginning and end of the speech. The judgment of the time sequence matching point adopts a double-threshold criterion, and a first preset threshold is set to be 50 milliseconds and is used for judging the alignment degree of the action element and the voice paragraph on a time axis. When the time difference between the action start-stop moment and the voice boundary is smaller than the threshold value, the action start-stop moment and the voice boundary are considered to be aligned at the moment point. In practical applications, about 70% of the key time points in the action sequence can be matched with the voice boundary. The dynamic programming algorithm sets the time window width to 400 milliseconds when calculating the optimal alignment path, and searches for the optimal action speed scaling factor in the window. The range of the scaling factor is limited between 0.5 and 2, so that the excessive change of the action speed is avoided. For a standard motion sequence, about 80% of the motion elements require less than 30% of the velocity adjustment to achieve good synchronization with speech. The synchronization requirement is determined by adopting a strict double-threshold standard, and a second preset threshold is set to 30 milliseconds and is used for limiting the action starting time difference value and the action ending time difference value respectively. Practice has shown that when the time difference is controlled within this threshold, the human eye is hard to perceive the phenomenon of dyssynchrony between motion and speech. For complex action speech sequences, typically more than 90% of the action elements can meet this synchronization requirement.

S105, calculating transition time and space continuity between action elements of the recombined action sequence, calculating difference of adjacent action elements in time and space, and evaluating action fluency by acquiring speed, acceleration and angle change in the action execution process to generate a fluency scoring result.

The method comprises the steps of obtaining space sampling points between adjacent action elements by a three-dimensional coordinate sampling method, calculating Euclidean distance values and direction vectors according to the space sampling points to obtain a displacement sequence and an angle sequence, conducting smoothing processing on the displacement sequence and the angle sequence, calculating a transition track curve by a cubic spline interpolation algorithm, calculating a curvature value and a deflection rate value at a node according to the transition track curve to obtain a first transition feature sequence, calculating an instantaneous speed value and an instantaneous acceleration value of a sampling point in the first transition feature sequence by a central difference method according to the first transition feature sequence, conducting segmentation processing through an acceleration threshold to obtain a second transition feature sequence, conducting curve fitting on angular speed parameters and angular acceleration parameters of an action transition stage according to the second transition feature sequence by a least square method to obtain a third transition feature sequence, conducting feature extraction on the third transition feature sequence, calculating a speed fluctuation coefficient, an acceleration peak value ratio, an angle change rate and track smoothness, and generating a smoothness scoring result.

The method comprises the steps of extracting space position data between adjacent action elements according to a recombined action sequence by adopting a three-dimensional coordinate sampling method, obtaining a displacement sequence by calculating Euclidean distances between sampling points, and obtaining an angle sequence by calculating direction vectors between the adjacent sampling points. And carrying out smoothing treatment on the displacement sequence and the angle sequence, calculating a transition track curve by adopting a cubic spline interpolation algorithm, and obtaining a first transition characteristic sequence by calculating a curve rate value and a deflection rate value of the curve at each node. And calculating the instantaneous speed value and the instantaneous acceleration value of each sampling point by adopting a central difference method aiming at the first transition characteristic sequence, and segmenting an acceleration curve by setting a maximum acceleration threshold value to obtain a second transition characteristic sequence. And calculating the angular speed and angular acceleration parameters of the action transition stage according to the second transition characteristic sequence, and fitting an angle change curve by adopting a least square method to obtain a third transition characteristic sequence. And extracting the characteristics of the third transition characteristic sequence, and calculating the speed fluctuation coefficient, the acceleration peak value ratio, the angle change rate and the track smoothness to obtain the motion smoothness characteristic vector. And aiming at the motion fluency feature vector, constructing a scoring function by adopting a support vector regression algorithm, and obtaining a motion fluency scoring result by calculating a distance value between the feature vector and a preset standard feature template.

,

S represents the action smoothness scoring result, n represents the number of feature vectors, v _i represents the ith feature vector, t _i represents the ith preset standard feature template, d (v _i,t_i) represents the distance value between the feature vector and the preset standard feature template, max (d) represents the maximum value in all the distance values, the formula calculates the average normalized distance between all the feature vectors and the corresponding preset standard feature template, then the value is subtracted by 1 to obtain the final score, and the closer the scoring result is to 1, the smoother the action is represented. The spatial position of the motion sequence is sampled and recorded by three-dimensional coordinates with fixed frequency, and for the standard motion sequence, the sampling frequency is set to be 60 Hz, and each sampling point contains X, Y, Z coordinate components. When the Euclidean distance between adjacent sampling points is calculated, the displacement sequence reflects the space track characteristic of the motion, and when the displacement value is smaller than 5 cm, the motion is considered to be in a motion pause state. The direction vector is obtained by calculating displacement vectors of adjacent sampling points, and the change of the direction angle reflects the steering characteristic of the action. When the cubic spline interpolation algorithm processes the transition track, the boundary condition is set as a natural boundary condition, so that the continuity of the second derivative of the curve at the node is ensured. the curvature value reflects the curvature degree of the track, and the maximum curvature value of the transition track is controlled within 0.5 in general. The flex value describes the degree of torsion of the space curve, and the flex peak of the standard motion sequence is typically no more than 0.3. The calculation of the instantaneous velocity and acceleration uses a 5-point center difference formula, with a time step of 16.7 milliseconds for a 60 hz sampled action sequence. The maximum acceleration threshold is set to 2 times the gravitational acceleration, i.e., 19.6 meters per square second, and actions exceeding this threshold are determined to be non-smooth transitions. The volatility of the velocity profile is assessed by calculating the standard deviation of adjacent velocity values. The angular velocity and angular acceleration parameters reflect the intensity of the change of the motion direction, and the peak value of the angular velocity is usually not more than 90 degrees per second in the standard motion transition process, and the peak value of the angular acceleration is controlled within 180 degrees per square second. the least square method adopts a third-order polynomial as a fitting function when fitting an angle change curve. The motion smoothness characteristic vector comprises 4 dimensions, wherein the speed fluctuation coefficient reflects the stability of a speed curve, the value range is between 0 and 1, the acceleration peak value ratio represents the ratio of the maximum acceleration to the average acceleration, the ratio is usually controlled within 3, the angle change rate describes the smoothness degree of the direction change, the smoothness degree of the track is ideally not more than 0.5, and the track smoothness degree is obtained through weighted average calculation of curvature and flexibility rate. The support vector regression algorithm uses a radial basis function with a kernel parameter set to 0.1. The preset standard feature template is from a professional action database and comprises a plurality of groups of fluency features of typical action sequences. The distance value is obtained by calculating the mahalanobis distance between the feature vector and the template, the final scoring result is expressed by a percentile, and the score higher than 80 points indicates smooth movement transition.

And S106, if the fluency score is lower than a preset score threshold, adjusting the time sequence and the space position of the action element, and obtaining and storing an optimized action sequence through fine adjustment actions in time and space.

According to a preset fluency scoring threshold value, detecting action elements, calculating deviation values of the action elements on a time axis and a space track by adopting a double-threshold value detection method to obtain a first time-space optimization vector, adjusting action start-stop time by adopting a particle swarm optimization algorithm according to time sequence parameters in the first time-space optimization vector, calculating through minimum time interval constraint and maximum delay constraint to obtain a second time-space optimization vector, reconstructing control point coordinates and tangential directions of an action track by adopting a three-dimensional spline interpolation algorithm according to the space parameters in the second time-space optimization vector to obtain a third time-space optimization vector, calculating continuity constraint conditions of track nodes in the third time-space optimization vector by adopting a Bezier curve fitting algorithm to obtain a fourth time-space optimization vector, verifying speed continuity and acceleration continuity of the action by adopting a kinematic constraint verification method according to the fourth time-space optimization vector, and forming and storing an optimized action sequence after verification.

The method comprises the steps of identifying action elements with scores lower than a preset threshold value by adopting a double-threshold detection method according to a fluency scoring result, and obtaining a first time-space optimization vector by calculating the overlapping degree of the action elements on a time axis and a deviation value of the action elements on a space track. And aiming at time sequence parameters in the first time-space optimization vector, adopting a particle swarm optimization algorithm to adjust the start and stop time of the motion, and obtaining a second time-space optimization vector by setting minimum time interval constraint and maximum delay constraint. And reconstructing the action track by adopting a three-dimensional spline interpolation algorithm according to the space parameters in the second space-time optimization vector, and obtaining a third space-time optimization vector by adjusting the position coordinates and tangential direction of the control points. And (3) performing curve smoothness optimization on the third space-time optimization vector, adjusting track nodes by adopting a Bezier curve fitting algorithm, and obtaining a fourth space-time optimization vector by calculating continuity constraint conditions of the curve. And aiming at the fourth space-time optimization vector, verifying the speed continuity and the acceleration continuity of the action by adopting a kinematic constraint test method, and correcting corresponding parameters if the constraint conditions are not met to obtain a fifth space-time optimization vector. And constructing an optimized action sequence according to the fifth space-time optimization vector, and classifying and storing time sequence parameters, space parameters and constraint conditions of the action sequence by adopting a relational database. In the action optimization process, the double-threshold detection is judged by adopting two dimensions of smoothness score and space-time consistency, wherein the smoothness score threshold is set to 80 minutes, and the space-time consistency threshold is set to 0.85. For a standard sequence of actions, the degree of overlap on the time axis is obtained by calculating the time interval between adjacent action elements, and the degree of overlap should be controlled within 15% under normal conditions. The deviation value of the space track is obtained by calculating the Euclidean distance between the action track and the standard track template, and the deviation value is not more than 10 cm. When the particle swarm optimization algorithm adjusts the action time sequence, the population scale is set to be 50, and the maximum iteration number is 100. The minimum time interval constraint is set to 200 milliseconds to ensure adequate transition time between actions. The maximum delay constraint is set to 500 milliseconds, avoiding the overall cadence of the action sequence to be too slow. In the optimization process, each particle represents a set of possible timing parameters, and the optimal timing solution is searched by continuously and iteratively updating the particle positions. The three-dimensional spline interpolation algorithm uses cubic bezier curves as base curves, each segment of the curves being defined by 4 control points. The position coordinates of the control points are obtained by minimizing the curve energy functional, and the tangential direction is determined by calculating the position relation of the adjacent control points. For complex motion trajectories, it is typically broken down into multiple segments of bezier curves, and the second derivative continuity is guaranteed from segment to segment. In the Bezier curve fitting process, a uniform parameterization method is adopted to resample track nodes, and the sampling point number is set to be 2 times of the original node number. The continuity constraint condition comprises three layers of position continuity, speed continuity and acceleration continuity, and the optimal position of the control point is solved by constructing a constraint equation set. For a standard motion trajectory, the fitting error is typically controlled to within 5 mm. The kinematic constraint test comprises two aspects of speed constraint and acceleration constraint, the speed continuity requires that the speed change rate between adjacent sampling points is not more than 30%, and the acceleration continuity requires that the acceleration change rate is controlled within 50%. And for parameters which do not meet the constraint condition, correcting by adopting a linear interpolation method, wherein the interpolation weight is inversely proportional to the constraint violation degree. The relational database adopts a hierarchical storage structure, and the first layer stores basic information of an action sequence, including sequence identification, creation time and optimization times. The second layer stores timing parameters including a start time, a duration, and a transition time for each action element. And the third layer of storage space parameters comprise control point coordinates, curve parameters and constraint conditions. The data retrieval efficiency is improved by establishing an index, and the typical retrieval time is controlled within 10 milliseconds.

And S107, inputting the optimized action sequence into an action controller, and generating an action execution instruction by combining the voice time sequence distribution characteristics to finish synchronous output of the action and the voice.

The method comprises the steps of obtaining action element timestamp information and voice fragment timestamp information by a high-precision clock source, obtaining a first synchronous control sequence by generating a time sequence correlation table to record the start and stop moments of voice fragments corresponding to action elements, calculating joint position parameters and joint speed parameters according to the first synchronous control sequence by adopting a joint interpolation algorithm, planning action tracks by adopting a kinematic positive solution equation to obtain a second synchronous control sequence, caching the action parameters according to the second synchronous control sequence by adopting a circular queue structure, obtaining action data by setting a queue read-write pointer, outputting an action control signal if the data amount in the queue reaches a preset cache threshold, adjusting the action execution speed by adopting a proportional integral controller according to the sampling phase difference value of the action control signal and a voice playing signal, compensating the time sequence deviation by adjusting an execution period, obtaining a corrected action control signal, outputting the corrected action control signal and the corrected voice playing signal in parallel, and generating a synchronous clock signal by a hardware timer to realize synchronous output of the action execution signal and the voice playing signal.

By way of example, according to the optimized action sequence, a high-precision clock source is adopted to extract time stamp information of action elements and voice fragments, and a time sequence association table is generated to record start and stop moments of the voice fragments corresponding to each action element, so that a first synchronous control sequence is obtained. And calculating joint position parameters and joint speed parameters in the action execution process aiming at the first synchronous control sequence by adopting a joint interpolation algorithm, and planning the action track in real time by using a kinematic positive solution equation to obtain a second synchronous control sequence.

,

Θ (T) represents the joint angle at time T, θ ₀ represents the initial angle, θ _f represents the final angle, T represents the current time, T represents the total execution time, and this formula uses a polynomial interpolation of five times to calculate the joint position.

,

Ω (T) represents the joint angular velocity at time T, θ ₀ represents the initial angle, θ _f represents the final angle, T represents the current time, T represents the total execution time, and this formula derives the joint velocity by deriving the joint position interpolation formula. And according to the second synchronous control sequence, buffering the action parameters by adopting a circular queue structure, synchronously reading the action data by setting a queue read-write pointer, and outputting an action control signal if the data quantity in the queue reaches a preset buffering threshold value. and sampling and detecting the output motion control signal, and calculating the phase difference value of the motion sampling signal and the voice sampling signal to obtain a real-time synchronous deviation sequence. And according to the real-time synchronous deviation sequence, regulating the action execution speed by adopting a proportional-integral controller, and compensating the time sequence deviation by regulating the execution period to obtain a corrected action control signal. And the corrected action control signal and the corrected voice playing signal are output in parallel, and a synchronous clock signal is generated through a hardware timer, so that synchronous output of the action execution signal and the voice playing signal is realized. The high-precision clock source adopts a 100MHz crystal oscillator to provide a reference clock signal, and a frequency doubling circuit is used for generating a 1000MHz system clock, so that microsecond time stamp precision is realized. The time sequence association table adopts a corresponding relation of key values and structure records, wherein the action element identification is used as a key, and the starting and ending moments of the voice fragments are used as values. For a standard sequence of actions, each action element corresponds on average to 2 to 3 speech segments. The joint interpolation algorithm adopts a five-degree polynomial interpolation method to ensure that the position, the speed and the acceleration are continuous at the track end points. The joint position parameters comprise 6 degrees of freedom, and the movement range of each degree of freedom is limited by mechanical limit. The kinematic positive solution adopts a D-H parameter method, and the spatial position of the end effector is obtained by calculating the coordinate transformation of the connecting rod. For complex motion sequences, the forward solution computation frequency remains at 1000 hertz. The circular queue adopts a double pointer structure to realize annular storage of data, the length of the queue is set to 1024, and the initial interval of the read pointer and the write pointer is 512. The preset buffer threshold is set to 75% of the queue length, i.e. the output operation is triggered when the data amount reaches 768. This design provides sufficient buffer space to cope with data stream fluctuations while ensuring data continuity. In the sampling detection process, the sampling frequency of the action signal and the voice signal is set to be 44.1KHZ, and each frame of sampling data contains 2048 sampling points. The phase difference value is obtained by calculating the cross-correlation function of the two paths of signals, and when the phase difference value exceeds plus or minus 5 degrees, a synchronous compensation mechanism is triggered. the parameters of the proportional-integral controller are obtained through experimental calibration, the proportional coefficient is set to 0.8, and the integral coefficient is set to 0.2. The adjustment range of the execution period is limited within plus or minus 10 percent, so that the excessive change of the action speed is avoided. For a typical sequence of actions, the response time of the controller is less than 10 milliseconds, with steady state error controlled to within 1 millisecond. The hardware timer is implemented by a 16-bit counter, the clock division coefficient is set to 8, and a synchronous clock signal of 125KHZ is generated. The action execution signal and the voice playing signal are buffered through the dual-port RAM and are output at the same time on the rising edge of the synchronous clock. In practical application, the time deviation of the two signals is controlled within 40 microseconds, and the effect of complete synchronization is achieved for the perception of human eyes and human ears.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the application. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A method of voice and motion synchronization for intelligent animation synthesis, the method comprising:

2. The method of claim 1, wherein the steps of obtaining a voice signal, extracting voice duration features of the voice signal, segmenting the voice signal in combination with a preset voice segmentation rule, dividing the voice into a continuous voice segment and a pause voice segment, and extracting short-time features corresponding to each segment to obtain voice time sequence distribution features comprise:

performing smooth noise reduction treatment on the voice signal by adopting a wavelet transformation method, performing signal amplitude normalization by setting a high threshold and a low threshold, and eliminating high-frequency noise interference by adopting a Butterworth low-pass filter to obtain an interference-eliminated voice signal;

calculating a short-time energy function and a short-time zero-crossing rate function according to the voice signals after interference elimination to obtain a voice signal strength sequence, marking a voice starting point if the signal strength is greater than a preset volume threshold value, and obtaining a voice duration sequence according to the interval between adjacent starting points;

Segmenting the voice signal after interference elimination aiming at the voice duration sequence, dividing a voice frame by adopting a Hamming window function, determining a voice frame boundary by calculating an inter-frame energy ratio, and extracting a pitch period and a formant frequency from the voice frame to obtain a voice basic feature sequence;

And carrying out continuity analysis according to the voice basic feature sequence, judging the voice frame to be a continuous voice segment if the pitch period change rate of the adjacent voice frames is smaller than a preset threshold value, optimizing the voice segment boundary by adopting a maximum entropy model, and extracting an energy feature vector and a time-varying feature vector from the continuous voice segment and the pause segment to obtain voice time sequence distribution features.

3. The method of claim 1, wherein detecting a start point and an end point of a speech pause segment based on the speech timing distribution feature, determining a start time, an end time, and a duration of the pause segment, combining the speech timing distribution feature with a predetermined action element, calculating a timing alignment relationship of the action element and the speech continuous segment, and generating a preliminary aligned action sequence comprises:

acquiring short-time energy difference values and zero-crossing difference values of adjacent voice frames according to the time sequence distribution characteristics of the voice signals, and marking the voice frames as pause section starting frames if the short-time energy difference values are larger than a first preset threshold value and the zero-crossing difference values are larger than a second preset threshold value;

Performing boundary positioning on the pause section by adopting a recursion dichotomy method, obtaining a characteristic distance sequence by calculating the mel frequency cepstrum coefficient distance between adjacent voice frames, and determining a second pause section position sequence according to the characteristic distance sequence;

Dividing the voice continuous segment for the second pause segment position sequence, and obtaining a voice continuous segment feature vector by calculating the duration time, average energy and pitch period parameters of the voice continuous segment;

And calculating the matching degree of the characteristic vector of the continuous speech segment and the characteristic vector of the preset action element by adopting a dynamic time warping algorithm, selecting a corresponding action element according to the matching degree, and carrying out nonlinear expansion and contraction on the execution time sequence of the action element by adopting a cubic spline interpolation algorithm to obtain a time sequence aligned action sequence.

4. The method of claim 1, wherein the detecting whether the initially aligned action sequence has a speech pause, if so, determining whether a reorganization of the action element sequence is required according to a pause duration and a duration of the action element, and when the pause duration is greater than a threshold value of a target action element duration, reorganizing the action element sequence, and generating a reorganized action sequence includes:

Analyzing the voice signal by adopting a short-time energy calculation method to obtain a first voice pause feature sequence, and performing secondary judgment on the first voice pause feature sequence by adopting a short-time zero-crossing rate detection method to obtain a second voice pause feature sequence;

calculating the ratio of the duration of the voice pause section to the target duration of the action element according to the second voice pause feature sequence, and if the ratio of the duration is greater than a preset duration threshold, marking the action element as an element to be recombined to obtain an action element sequence to be recombined;

Aiming at the action element sequence to be recombined, calculating a similarity matrix among action elements by adopting a hierarchical clustering algorithm, and dividing the similarity matrix through an action characteristic threshold value to obtain an action element recombination scheme sequence;

according to the recombination scheme sequence, an action element connection relation diagram is constructed by adopting a minimum spanning tree algorithm, the action element combination sequence is determined by calculating the weight value of the connection edge in the connection relation diagram, the recombination time sequence is obtained, and the weight value of the edge is calculated by the following formula:

,

5. The method of claim 1, wherein the secondarily aligning the recombined action sequence with the time sequence distribution of the continuous speech segment, when the difference between the start time and the end time of the action element and the start time and the end time of the corresponding speech segment is smaller than a preset synchronization threshold, determining that the coupling relationship between the recombined action sequence and the speech segment meets the synchronization requirement, includes:

extracting characteristic parameters of the action elements by adopting a short-time window detection method according to the recombined action sequences, and obtaining action time sequence characteristic sequences by calculating the amplitude change rate and the speed change rate of the action elements;

Detecting local extremum points of the motion amplitude by adopting a self-adaptive threshold method aiming at the motion time sequence characteristic sequence, and obtaining a start-stop time sequence of the motion element by calculating the time interval between the local extremum points;

According to the start-stop time sequence of the action element, a voice energy envelope extraction method is adopted, and the boundary time sequence of the voice paragraph is obtained by calculating the change trend of the voice energy;

Calculating a time difference value aiming at a start-stop time sequence of the action element and a boundary time sequence of the voice paragraph, and if the time difference value is smaller than a first preset threshold value, adopting a dynamic programming algorithm to calculate a pair Ji Lujing;

According to the alignment path, obtaining a corrected action time sequence by adjusting the execution speed of the action element;

For the corrected action time sequence, calculating a start time difference value and an end time difference value of each action element and the corresponding voice paragraph, and if the start time difference value and the end time difference value are smaller than a second preset threshold value, judging that the action element meets the synchronization requirement, wherein the start time difference value and the end time difference value are calculated by the following formula:

,

6. The method of claim 1, wherein calculating transition time and spatial continuity between action elements of the reorganized action sequence, calculating differences in time and space between adjacent action elements, evaluating action fluency by acquiring speed, acceleration, and angular changes during execution of an action, generating a fluency scoring result, comprising:

Acquiring space sampling points between adjacent action elements by adopting a three-dimensional coordinate sampling method, and calculating Euclidean distance values and direction vectors according to the space sampling points to obtain a displacement sequence and an angle sequence;

Smoothing the displacement sequence and the angle sequence, calculating a transition track curve by adopting a cubic spline interpolation algorithm, and calculating a curvature value and a deflection rate value at a node according to the transition track curve to obtain a first transition characteristic sequence;

Calculating an instantaneous speed value and an instantaneous acceleration value of a sampling point in the first transition feature sequence by adopting a central difference method according to the first transition feature sequence, and carrying out segmentation processing through an acceleration threshold value to obtain a second transition feature sequence;

Calculating an angular velocity parameter and an angular acceleration parameter of the action transition stage aiming at the second transition characteristic sequence, and performing curve fitting on the parameters by adopting a least square method to obtain a third transition characteristic sequence;

And extracting the characteristics of the third transition characteristic sequence, calculating the speed fluctuation coefficient, the acceleration peak value ratio, the angle change rate and the track smoothness, and generating a smoothness grading result.

7. The method of claim 1, wherein if the fluency score is lower than a preset score threshold, adjusting the time sequence and the spatial position of the action element, and obtaining and storing the optimized action sequence by fine-tuning the action in time and space, comprises:

Detecting action elements according to a preset fluency scoring threshold value, and calculating deviation values of the action elements on a time axis and a space track by adopting a double-threshold detection method to obtain a first time-space optimization vector;

aiming at time sequence parameters in the first time-space optimization vector, a particle swarm optimization algorithm is adopted to adjust the start and stop time of the action, and a second time-space optimization vector is obtained through minimum time interval constraint and maximum delay constraint calculation;

Reconstructing the coordinates of the control points of the action track and the tangential direction by adopting a three-dimensional spline interpolation algorithm according to the space parameters in the second space-time optimization vector to obtain a third space-time optimization vector;

Aiming at track nodes in the third space-time optimization vector, calculating the continuity constraint condition of the track nodes by adopting a Bezier curve fitting algorithm to obtain a fourth space-time optimization vector;

And aiming at the fourth time space optimization vector, verifying the speed continuity and the acceleration continuity of the motion by adopting a kinematic constraint test method, and forming and storing an optimized motion sequence after verification.

8. The method of claim 1, wherein inputting the optimized motion sequence into the motion controller, generating a motion execution instruction in combination with the voice time sequence distribution feature, and completing the synchronous output of the motion and the voice, comprises:

Acquiring action element time stamp information and voice segment time stamp information by adopting a high-precision clock source, and recording the start and stop moments of the voice segment corresponding to the action element by generating a time sequence association table to obtain a first synchronous control sequence;

Calculating joint position parameters and joint speed parameters by adopting a joint interpolation algorithm according to the first synchronous control sequence, and planning an action track by using a kinematic positive solution equation to obtain a second synchronous control sequence;

For the second synchronous control sequence, a circular queue structure is adopted to buffer action parameters, action data is obtained by setting a queue read-write pointer, and if the data amount in the queue reaches a preset buffer threshold, an action control signal is output;

according to the sampling phase difference value of the motion control signal and the voice playing signal, a proportional integral controller is adopted to adjust the motion execution speed, and the time sequence deviation is compensated through adjusting the execution period, so that a corrected motion control signal is obtained;

and the corrected action control signal and the corrected voice playing signal are output in parallel, and a synchronous clock signal is generated through a hardware timer, so that synchronous output of the action execution signal and the voice playing signal is realized.