CN113192510A

CN113192510A - Method, system and medium for implementing voice age and/or gender identification service

Info

Publication number: CN113192510A
Application number: CN202011591501.3A
Authority: CN
Inventors: 杨学锐; 晏超
Original assignee: Yuncong Technology Group Co Ltd
Current assignee: Yuncong Technology Group Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-07-30
Anticipated expiration: 2040-12-29
Also published as: CN113192510B

Abstract

The invention relates to the field of voice recognition, in particular to a method, a system, a device and a medium for realizing voice age and/or gender recognition service, and aims to solve the technical problems of remote accurate calling and simple and convenient deployment of the existing voice age and/or gender recognition model. Therefore, according to the invention, the terminal calls the server side through a serialized voice age/gender recognition request under a predefined GRPC framework, and the server side accurately selects the corresponding voice age/gender recognition deep neural network model to decode and determine the information of the age and/or gender of the target object through the set age/gender voice recognition service and returns the information to the terminal. Due to the age and/or gender service mode and the remote calling architecture, the corresponding model is called after the type of the determined model is determined, the calling is more accurate without depending on a fixed framework, the flexibility is higher, the expandability is strong, the resource utilization rate is high, the concurrency is high, and the iterative updating of the algorithm model is facilitated.

Description

Method, system and medium for implementing voice age and/or gender identification service

Technical Field

The invention relates to the field of voice recognition, in particular to a method and a system for realizing voice age and/or gender recognition.

Background

The speech age and/or gender recognition model based on the Kaldi HMM-DNN hybrid architecture has great advantages in the capability of recognizing speech, but is very difficult to deploy and use in industry, and a common method is to convert the Kaldi Nnet3 model into an ONNX model through a model conversion tool, and then use other deep learning engines to provide speech recognition service by using the ONNX model (for example, a MACE Mobile terminal AI calculation engine) or use a Tensflow Serving mode for deployment, but the frames used by the two modes are fixed and are not easy to modify, and the speech recognition service has poor flexibility and expansibility, and only supports an operator of a Kaldi neural network inference part, and WFST decoding still needs to be decoded by means of the Kaldi.

While a speech recognition engine based on a Websocket and a Gstreamer framework provided by Kaldi native can provide certain speech service capability, but cannot meet the actual industrial deployment requirements in terms of memory resource occupation, decoding speed and concurrency.

Moreover, the speech recognition engine generally provides an access mode of Rest-API to the outside, does not have a serialization compression mechanism for the transmitted audio data, is not beneficial to the data transmission of long-term speech audio large files, and is also extremely difficult to use in the scene requiring bidirectional streaming interaction.

On the other hand, the voice audio formats are various, and a general voice recognition engine only supports one type of audio formats (for example, 16k/8k sampling rate) defined in advance, and cannot dynamically adapt to different requirements.

Therefore, although the existing voice recognition model based on the Kaldi HMM-DNN hybrid architecture has great voice recognition advantages, the actual application deployment difficulty is high, the expansibility and the flexibility are poor, and the decoding speed, the concurrency, the resource occupation, the interaction, the dynamic adaptability and the like after deployment can not meet the actual application requirements, so that the user experience is poor, and a scheme which is more flexible, easier to expand and better in user experience is required.

Disclosure of Invention

In order to overcome the defects, the invention is provided to solve or partially solve the technical problem of how to more simply and quickly call the voice age and/or gender recognition deep neural network model which correspondingly supports the voice to be recognized of different target objects for recognition through establishing the voice recognition service of the age and/or gender to realize more efficient, flexible and expandable remote and accurate acquisition of the age and/or gender information of the target objects, thereby improving the user experience. To this end, the present invention provides a method, system, apparatus and medium for implementing a voice age and/or gender identification service.

In a first aspect, a method for implementing a speech age and/or gender identification service is provided, comprising: receiving a serialized voice recognition request sent by a GRPC-defined client, wherein the voice recognition request comprises age and/or gender through voice recognition; performing deserialization operation on the voice recognition request and analyzing audio data and parameter field information in the voice recognition request; selecting a corresponding voice age and/or gender recognition deep neural network model according to the parameter field information, and decoding the audio data and the context audio information thereof through the voice age and/or gender recognition deep neural network model to obtain a corresponding age and/or gender recognition result; and returning the identification result to the client.

Wherein, the specific process defined by the GRPC comprises the following steps: according to the structure of the ProtoBuf of the GRPC, parameter field information related to a voice age and/or gender identification service mode, an audio format, voice audio data to be identified, a sampling rate and an audio length is predefined so as to obtain a well-defined ProtoBuf protocol of the GRPC; compiling and generating GRPC service interface codes of a client and a server for carrying out GRPC voice age and/or gender identification service according to the ProtoBuf protocol so as to carry out remote calling between the client and the server; the client with the GRPC service interface code is a client defined by GRPC; the service end with the GRPC service interface code is a service end defined by GRPC.

Wherein the serialized voice recognition request is a remote request sent by a voice and/or gender service mode pre-selected by the GRPC-defined client; the remote request comprises: serializing voice audio data to be recognized and audio parameter field information by utilizing a ProtoBuf structure; the "deserializing the voice recognition request and analyzing the audio data and the parameter field information in the voice recognition request" specifically includes: performing deserialization operation on the voice recognition request through a ProtoBuf structure to obtain the voice audio data to be recognized and the audio parameter field information; wherein the audio parameter field information at least comprises: audio format, sampling rate and field information of voice age and/or gender recognition service mode; analyzing the corresponding voice audio data to be recognized based on the audio format so as to uniformly convert the voice audio data to be recognized into audio data in a PCM data format; the "selecting a corresponding speech age and/or gender recognition deep neural network model according to the parameter field information" specifically includes: determining the type of the voice recognition model needing to be selected as a voice age and/or gender recognition deep neural network model according to the voice age and/or gender recognition service mode; calling a corresponding voice age and/or gender recognition deep neural network model supporting recognition of the audio format and the sampling rate according to the audio format and the sampling rate; the "obtaining the recognition result of the corresponding age and/or gender by decoding the audio data through the speech age and/or gender recognition deep neural network model" specifically includes: decoding the audio data converted into the PCM data format and the context audio information thereof by utilizing a voice age and/or gender identification deep neural network model correspondingly supporting the identification of the audio format and the sampling rate to obtain a corresponding identification result of age and/or gender; the step of returning the identification result to the client specifically includes: and serializing the identification result and sending the serialized identification result back to the GRPC-defined client.

Wherein, serializing the recognition result and sending back to the client defined by the GRPC specifically includes: carrying out serialization coding compression on the recognition result through a ProtoBuf structure; calling corresponding result returning logic according to the voice age and/or gender identification service mode to send the identification result after the serialization coding compression back to the client defined by the GRPC; wherein the voice age and/or gender recognition service patterns include: bidirectional streaming for real-time speech age and/or gender identification and non-streaming for one-sentence speech age and/or gender identification; wherein the result return logic comprises: returning an identification result for the non-streaming type as one-time, returning an identification result of each segment for the bidirectional streaming type as a segment, and returning a final identification result after all voice audio data to be identified are transmitted; wherein the returned final recognition result comprises information of corresponding age and/or gender.

Wherein, the deep neural network model for speech age and/or gender identification adopts the following steps: a forward sequence memory neural network FSMN, a time delay neural network TDNN or a factor time delay neural network TDNNF model based on a Kaldi mixed architecture; when the FSMN is adopted, the last two layers are adopted as the 10-layer depth FSMN of the time-limited self-attention network, layer jump connection short is adopted between every two layers, and cross entropy corss entry is adopted as a loss function during training.

In a second aspect, there is provided a method of implementing a speech age and/or gender identification service, comprising: reading voice audio data to be recognized, and selecting a corresponding voice age and/or gender recognition service mode; carrying out serialization coding compression on the voice audio data to be recognized and audio parameter field information corresponding to the voice audio data by utilizing a ProtoBuf structure so as to form a voice recognition request; calling the voice age and/or gender recognition service mode, sending the voice recognition request to a service terminal defined by GRPC (group-generic-packet-service computer) so as to remotely call the voice age and/or gender recognition deep neural network model of the corresponding voice recognition request to perform voice recognition; wherein the voice age and/or gender recognition service patterns include: bi-directional streaming for real-time speech age and/or gender identification and non-streaming for one sentence speech age and/or gender identification; wherein the voice recognition request is a request for age and/or gender through voice recognition.

Wherein, the process defined by the GRPC specifically includes: according to the structure of the ProtoBuf of the GRPC, parameter field information related to a voice age and/or gender identification service mode, an audio format, voice audio data to be identified, a sampling rate and an audio length is predefined so as to obtain a well-defined ProtoBuf protocol of the GRPC; compiling and generating GRPC service interface codes of a client and a server for carrying out GRPC voice age and/or gender identification service according to the ProtoBuf protocol so as to carry out remote calling between the client and the server; the client with the GRPC service interface code is a client defined by GRPC; the service end with the GRPC service interface code is a service end defined by GRPC.

Receiving a corresponding age and/or gender identification result obtained by identifying the voice audio data to be identified through the GRPC-defined service end; the service end defined by the GRPC utilizes the ProtoBuf structure to carry out serialized coding compression on the recognition result and calls corresponding result returning logic to return the recognition result according to the voice age and/or gender recognition service mode; deserializing and outputting the received recognition result by utilizing the ProtoBuf structure; the step of remotely calling the voice age and/or gender recognition deep neural network model of the voice recognition request for voice recognition comprises the steps of determining the type of the voice recognition model needing to be selected according to the voice age and/or gender recognition service mode, calling the voice age and/or gender recognition deep neural network model which correspondingly supports recognition of the audio format and the sampling rate in the type according to the audio format and the sampling rate, and extracting context information of audio data within a preset time to perform voice recognition to obtain a corresponding age and/or voice recognition result; and the audio data is audio data which is obtained by uniformly converting the voice audio data into a PCM data format based on the audio format in the process of analyzing the voice recognition request.

In a third aspect, there is provided a server for implementing a voice age and/or gender identification service, comprising: the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a serialized voice recognition request sent by a GRPC-defined client, and the voice recognition request comprises age and/or gender through voice recognition; the serialization module is used for performing deserialization operation on the voice recognition request and performing serialization operation on a recognition result; the audio analysis module is used for analyzing the data and the parameter field information in the voice recognition request; the voice recognition core algorithm module is used for selecting a corresponding voice age and/or gender recognition deep neural network model according to the parameter field information, and decoding the audio data and the context audio information thereof through the voice age and/or gender recognition deep neural network model to obtain a corresponding age and/or gender recognition result; and the return module is used for returning the identification result to the client.

Wherein, the process defined by the GRPC specifically includes: according to the structure of the ProtoBuf of the GRPC, parameter field information related to a voice age and/or gender identification service mode, an audio format, voice audio data to be identified, a sampling rate and an audio length is predefined so as to obtain a well-defined ProtoBuf protocol of the GRPC; compiling and generating GRPC service interface codes of a client and a server for carrying out GRPC voice age and/or gender identification service according to the ProtoBuf protocol so as to carry out remote calling between the client and the server; the client with the GRPC service interface code is a client defined by GRPC; wherein, the server with the GRPC service interface code is a server defined by GRPC.

Wherein the serialized voice recognition request is a remote request sent by a voice and/or gender service mode pre-selected by the GRPC-defined client; the remote request comprises: serializing voice audio data to be recognized and audio parameter field information by utilizing a ProtoBuf structure; the serialization module specifically comprises: the ProtoBuf deserialization unit is used for deserializing the voice recognition request through a ProtoBuf structure to obtain the voice audio data to be recognized and the audio parameter field information; wherein the audio parameter field information at least comprises: audio format, sampling rate and field information of voice age and/or gender recognition service mode; the ProtoBuf serialization unit is used for carrying out serialization coding compression on the recognition result through a ProtoBuf structure; the audio analysis module specifically executes the following operations: analyzing the corresponding voice audio data to be recognized based on the audio format so as to uniformly convert the voice audio data to be recognized into audio data in a PCM data format; the speech recognition core algorithm module specifically executes the following operations: determining the type of the voice recognition model needing to be selected as a voice age and/or gender recognition deep neural network model according to the voice age and/or gender recognition service mode; calling a voice age and/or gender recognition deep neural network model which correspondingly supports recognition of the audio format and the sampling rate according to the audio format and the sampling rate; decoding the audio data converted into the PCM data format and the context audio information thereof by utilizing a speech age and/or gender identification deep neural network model correspondingly supporting the identification of the audio format and the sampling rate to obtain a corresponding identification result of age and/or gender; the return module specifically includes: the return logic unit is used for calling corresponding result return logic according to the voice age and/or gender identification service mode so as to send the identification result after the serialization coding compression back to the client defined by the GRPC; wherein the voice age and/or gender recognition service patterns include: bidirectional streaming for real-time speech age and/or gender identification and non-streaming for one-sentence speech age and/or gender identification; wherein the result return logic comprises: returning an identification result once for the non-streaming type, returning an identification result of each segment for the bidirectional streaming type as a segment, and returning a final identification result after all voice audio data to be identified are transmitted; wherein the returned final recognition result comprises information of corresponding age and/or gender.

Wherein, serialization module specifically includes: the ProtoBuf deserialization unit is used for deserializing the voice recognition request through a ProtoBuf structure; wherein the audio parameter field information at least includes: field information of audio format, sampling rate and voice recognition service mode; the parsing operation of the audio data parsing module specifically includes: analyzing the corresponding voice audio data to be recognized according to the audio format so as to uniformly convert the voice audio data to be recognized into audio data in the PCM data format; the recognition operation of the speech recognition core algorithm module specifically comprises the following steps: and selecting a corresponding Kaldi voice recognition model according to the audio format and the sampling rate, and performing voice recognition decoding on the audio data converted into the PCM data format by using the corresponding Kaldi voice recognition model to obtain a recognition result.

In a fourth aspect, a terminal for implementing a voice age and/or gender identification service is provided, which includes: the GRPC client module is used for reading voice audio data to be recognized; the GRPC mode selection module is used for selecting a corresponding voice age and/or gender identification service mode when the GRPC client module reads voice audio data to be identified; the ProtoBuf serialization module is used for carrying out serialization coding compression on the voice audio data to be recognized and the audio parameter field information corresponding to the voice audio data by utilizing a ProtoBuf structure so as to form a voice recognition request, wherein the voice recognition request comprises age and/or gender through voice recognition; the GRPC client module is further configured to: calling the voice age and/or gender identification service mode, sending the voice identification request to a service terminal defined by GRPC (generalized public key pc), and remotely calling a voice age and/or gender identification deep neural network model corresponding to the voice identification request to perform voice identification; wherein the voice age and/or gender identification service mode comprises: bidirectional streaming for real-time speech age and/or gender identification and non-streaming for one-sentence speech age and/or gender identification; wherein the voice recognition request is a request for age and/or gender through voice recognition.

Wherein the GRPC client module is further configured to: receiving a recognition result of corresponding age and/or gender obtained by recognizing the voice audio data to be recognized from the server defined by the GRPC; the service end defined by the GRPC utilizes the ProtoBuf structure to carry out serialized coding compression on the recognition result and calls corresponding result returning logic to return the recognition result according to the voice age and/or gender recognition service mode; the ProtoBuf serialization module is further configured to: deserializing and outputting the received recognition result by utilizing the ProtoBuf structure; the voice age and/or gender recognition deep neural network model of the server determines the type of the voice recognition model needing to be selected through the voice age and/or gender recognition service mode, calls the voice age and/or gender recognition deep neural network model which correspondingly supports recognition of the audio format and the sampling rate in the type through the audio format and the sampling rate, and performs voice recognition by extracting context information of audio data within a preset time to obtain a corresponding age and/or voice recognition result; and the audio data is audio data which is obtained by uniformly converting the voice audio data into a PCM data format based on the audio format in the process of analyzing the voice identification request.

In a fifth aspect, a computer readable storage medium is provided, which stores a plurality of program codes, wherein the program codes are adapted to be loaded and executed by a processor to perform the method for implementing a speech age and/or gender identification service according to any of the first and second aspects.

In a sixth aspect, a processing device is provided, comprising a processor and a storage device, wherein the storage device is adapted to store a plurality of program codes, wherein the program codes are adapted to be loaded and executed by the processor to perform the method for implementing a speech age and/or gender identification service according to any of the first and second aspects.

In a seventh aspect, a system for implementing a voice age and/or gender identification service is provided, which is characterized by comprising a server for implementing a voice age and/or gender identification service according to any one of the third aspect and a terminal for implementing a voice age and/or gender identification service according to any one of the fourth aspect.

One or more technical schemes of the invention at least have one or more of the following beneficial effects:

in the technical scheme of the invention, for the complicated voice recognition of the age and/or gender of a specific target object, the specific age and/or gender voice recognition service mode is set, the voice recognition service engine formed by remote process call of the client and the server defined by GRPC is utilized, the recognition model category of the remote server is determined, so that the age and/or gender recognition model capable of supporting the audio format and the sampling rate is accurately selected, the age and/or gender information of the target object is recognized according to the voice of the target object, the recognition accuracy is high, the selectable range of the model is large, and the update of the model does not affect the client. Particularly, the voice recognition mode selection/setting of a plurality of specific ages and/or sexes, the voice recognition service mode based on the ages and/or sexes of the decoded voice determines that the model type related to the age and/or gender recognition in various available recognition models should be selected, and then through the audio format and the code rate, the corresponding model capable of supporting the format and the code rate can be accurately and quickly called from a plurality of different voice age and/or gender models, and the model decodes and identifies the corresponding audio data and the audio information of the context thereof through a specific structure, therefore, the identification is more accurate and diversified, and the identification of the age and/or the gender of the service end can be realized as long as the service end has a speech age and/or gender identification model capable of supporting the corresponding audio format and sampling rate.

Furthermore, the voice audio data can be serialized and deserialized by using the ProtoBuf, the transmission overhead of a large file audio network is greatly reduced, the transmission rate is improved, the Grpc remote process call framework based on HTTP/2.0 effectively combines multithreading, concurrent, unidirectional and bidirectional stream efficient transmission and service response, and meanwhile, the Kaldi decoder of the server is decoupled and modularized (such as mode selection) and combined with the GPU, so that a voice recognition core algorithm and multi-model management are realized, and the utilization rate of server resources is greatly improved.

Furthermore, by separating a lightweight frame such as GRPC from a speech recognition core algorithm, the speech algorithm can be quickly and conveniently optimized and updated and iterated without being influenced by client deployment; through the Protobuf protocol service and the field definition, the voice age and/or gender recognition models with different audio formats and different sampling rates can be flexibly selected, the function of the current voice algorithm engine can be freely expanded, and the flexibility and the expansibility of the technical application deployment of the voice recognition service are fully embodied.

Drawings

Embodiments of the invention are described below with reference to the accompanying drawings, in which:

FIGS. 1 and 2 are a primary flow diagram of one embodiment of a method of implementing a voice age and/or gender identification service in accordance with the present invention;

FIG. 3 is a block diagram of one embodiment of a system for implementing a voice age and/or gender identification service, in accordance with the present invention;

FIG. 4 is a schematic diagram of an application-time interaction process according to an embodiment of the present invention;

fig. 5 is an exemplary structure of an FSMN model showing speech age/gender identification according to an embodiment of the present invention;

fig. 6 and 7 are schematic diagrams of hardware structures of an embodiment of a processing device to which the technical solution of the present invention is applied.

Detailed Description

Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention and are not intended to limit the scope of the present invention.

In the description of the present invention, a "module" or "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, may comprise software components such as program code, or may be a combination of software and hardware. The processor may be a central processing unit, microprocessor, image processor, digital signal processor, or any other suitable processor. The processor has data and/or signal processing functionality. The processor may be implemented in software, hardware, or a combination thereof. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, and the like. The term "a and/or B" denotes all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" means similar to "A and/or B" and may include only A, only B, or both A and B. The singular forms "a", "an" and "the" may include the plural forms as well.

Technical terms involved in the present invention are explained as follows:

kaldi: the method is mainstream in the field of voice recognition, and the most widely used special deep learning platform for voice recognition is used;

GRPC, Google's remote procedure call service framework, mainly define the interface through Protobuf;

google develops a data serialization protocol which is independent of language and platform, has extensible rules of a serialization data structure and is used for a data communication protocol, data storage and the like;

ONNX, an open neural network exchange format, is a standard for representing deep learning models, and can transfer the models among different frameworks;

the Gstreamer is a multimedia framework supporting cross-platform;

PCM is a pulse modulation coding format, and represents the method of sampling audio frequency analog signal by digit;

FSMN: the fed Sequential Memory Network model can achieve the same effect as Bi-RNN under the condition of small time delay.

TDNN: the Time Delay Neural Network is a multi-layer Network, each layer has strong abstract capability to characteristics and has the capability of expressing the relation of voice characteristics in Time.

TDNNF: factored TDNN, compared to the TDNN model, defines one of the two matrices after factorization as a semi-positive and uses skip connection.

The main realization principle of the technical scheme of the invention mainly comprises the following steps: the method comprises the steps of defining field information and contents such as a voice recognition service mode, a voice audio format to be recognized, voice audio data, audio length and the like through a ProtoBuf structure of the GRPC. When the voice audio data to be recognized is read specifically, the specific voice audio information and other parameters corresponding to the voice audio data, such as audio parameter field information, are filled in the parameter field information. The definition determines a ProtoBuf protocol, and the proto is compiled by the protocol to generate interface codes (for example, the ProtoBuf is compiled by a ProtoBuf compiler to compile proto into specific classes, each field can be accessed by a simple method and can be accessed in a serialization or deserialization mode) of a corresponding server and a corresponding client (namely, between the client and a server which need to be remotely called to realize the voice recognition service), and service module division can be performed corresponding to the voice recognition service of the server. After the client reads the voice audio data, the client uses the ProtoBuf to perform serialization compression (the voice audio data and the corresponding audio parameter field information, and the voice recognition service mode is also included herein) to form a voice recognition request, and the request is sent to the server according to the voice recognition service mode. The server side restores original voice audio data through ProtoBuf deserialization according to an interface code defined by convention, converts the audio data into a uniform PCM data format by combining audio parameter field information such as an audio format, a sampling rate and the like, selects a corresponding Kaldi voice model through a voice recognition service mode indicated in the parameter field information, combines the audio format, the sampling rate and the like, performs voice age and/or gender recognition by adopting a deep neural network, such as an FSMN (frequency selective messaging network) or TDNN (time domain networking) or TDNNF (time domain networking network) structure, and performs voice recognition decoding on audio content. The server side decodes the recognized voice recognition result, compresses the result by using ProtoBuf serialization according to the defined interface code, and sends the result back to the client side requesting voice recognition corresponding to the voice recognition service mode. And the client receives the reply of the server, deserializes the recognition result, and outputs the recognition result (in an image/video, character, audio and other modes) to finish the calling of the voice recognition service.

The invention uses Protobuf to serialize voice audio data and serialize recognition result, thus greatly improving the efficiency of network transmission of large audio files; and the lightweight framework (a lightweight speech recognition engine established by combining ProtoBuf) is called based on a GRPC remote process, a speech recognition core algorithm is realized, the deployment problem of real-time online speech recognition and non-real-time speech recognition can be effectively solved based on flow type and non-flow type transmission of the GRPC, and the method has high flexibility, expandability and concurrency capability of industrial deployment, and has strong advantages on the support of multi-language clients.

The implementation of the present invention is described below with reference to the main flow charts of one embodiment of the method for implementing the speech age and/or gender identification service of the present invention shown in fig. 1 and 2.

Step S110, based on the pre-defined GRPC service interface code, reading the voice audio data to be recognized, and selecting the corresponding voice recognition service mode.

Specifically, a proto file service method and an interaction field are defined; the function of the speech recognition service and interactive fields such as audio format, sampling rate and the like are predefined.

In one embodiment, the predefined process specifically includes: according to the structure of the ProtoBuf of the GRPC, parameter field information related to a voice recognition service mode, an audio format, voice audio data to be recognized, a sampling rate and an audio length is predefined, so that a well-defined ProtoBuf protocol of the GRPC is obtained. The client reads the voice audio data to be recognized, and obtains the corresponding parameter data of the voice audio data, such as the audio format, the sampling rate, the audio length and the like, wherein the parameter data comprise parameter field information which can be formed by correspondingly putting the voice audio data according to the predefined parameter field name, the parameter field content and the like, and the subsequent serialized coding compression is performed so as to perform high-efficiency transmission. The fields are mainly interactive fields, the service end and the client end are required to be capable of acquiring and using the field information, and the voice recognition is to recognize the age and/or the gender in the voice, namely, the voice recognition request is to request to recognize the age and/or the gender of the user through the voice.

Then, according to the ProtoBuf protocol, a GRPC service interface code of a client and a server for carrying out GRPC voice recognition service is compiled and generated, so that remote calling between the client and the server (such as a server) is carried out. The GRPC definition is to establish a GRPC service interface code for implementing remote invocation between a server and a client, so that the client invokes various application programs and the like of the server through an interface code protocol, in particular invoking a specific targeted speech recognition model. The client with the GRPC service interface code is a client defined by GRPC, and the server with the GRPC service interface code is a server defined by GRPC. According to the interface code protocol, the server side operates in combination with the CPU, and a targeted voice recognition model is selected and used, so that modular management is formed. That is, the ProtoBuf compiler corresponding to various languages may be used to generate interface codes of the client and the server, and implement the speech recognition core algorithm based on the GRPC interface code, especially the code of the server.

Further, the client reading the voice audio data to be recognized is a client defined by GRPC, which supports clients using various languages or speaking different languages, and is not limited by language types. The proto file is compiled into a class (comprising a GRPC service interface and a transmission field, namely GRPC service interface codes) according to the proto Buf structure, and can read an audio file so as to initiate a remote request to a voice service end, such as a server, remotely called by the GRPC.

Further, the client defined by the GRPC can select a corresponding voice recognition mode when reading the voice audio data to be recognized. Wherein the voice recognition service mode includes: real-time speech recognition in a bidirectional stream and one-sentence speech recognition in a non-stream.

Specifically, the GRPC Simple mode is used for non-streaming call of remote speech recognition service, and if the client selects the non-streaming service, the audio is completely transmitted to the server once, and the speech recognition service of the server returns a recognition result once after once recognition is completed.

Specifically, the GRPC Stream mode is used for bidirectional streaming to invoke a remote speech recognition service, if the streaming service is selected by the client, large audio data is transmitted to the server in a segmented manner, the speech recognition service performs segmented decoding on the received audio data, and a recognition result is returned in a segmented manner, thereby completing a real-time speech recognition function.

And step S120, performing serialized coding compression on the voice audio data to be recognized and the audio parameter field information corresponding to the voice audio data by using a ProtoBuf structure to form a voice recognition request.

In one embodiment, based on the predefined GRPC service interface code, the client defined by the GRPC performs serialized coding compression on all parameter field information, i.e. the read voice audio data to be recognized and audio parameter field information (sampling rate, audio format, voice service mode, audio length, etc.) corresponding to the voice audio data, by using a ProtoBuf structure, to form a binary sequence voice recognition request, which is a remote request.

The transmitted voice audio data and various audio parameter field information corresponding to the voice audio data are used as voice recognition requests to carry out serialized coding, and then data compression can be selected.

In one embodiment, the spoken human audio may be captured by a microphone or microphone array, typically over 1-2 seconds.

Step S130, the selected voice recognition service mode is called to send the voice recognition request to a service terminal defined by GRPC, so as to remotely call a Kaldi voice recognition service model corresponding to the voice recognition request to perform voice recognition.

In one embodiment, the remote request (e.g., the speech recognition request) is sent by invoking a pre-selected speech recognition service mode to a service defined via the GRPC. After the voice recognition request is sent to the service end defined by the GRPC, the service end can determine which appropriate Kaldi voice recognition service model to select for voice recognition according to the audio parameter field information except the voice audio data to be recognized in the voice recognition request, and perform recognition processing on the voice audio data. The remote calling only aims at selecting and operating the needed or adaptive speech recognition model, so that modularization, easier management and easier expansion are realized, the effect of realizing speech recognition by the remote calling of the client side cannot be influenced by updating, iterative change and the like of the speech recognition model of the server side, and the remote calling is flexible and has strong expandability.

Step S240, receiving a speech recognition request sent by a client defined by a GRPC, where the speech recognition request includes serialized speech audio data to be recognized and audio parameter field information.

Specifically, the voice recognition request is a remote request for invoking a voice recognition service of the server. See step S120, supra. According to the GRPC service interface code, the service end can receive the remote call of the request to the voice recognition service on the service end.

Step S250, performing deserialization operation on the voice recognition request based on a predefined GRPC service interface code to obtain the voice audio data to be recognized and the audio parameter field information.

In one embodiment, since the predefined GRPC service interface code indicates that the server having the interface code is the server defined by GRPC, it can perform serialization and deserialization operations through the ProtoBuf structure. Therefore, for a remote request (such as the voice recognition request) from a client forming a remote call relation based on a GRPC framework, the voice recognition request can be subjected to reverse deserialization operation through a ProtoBuf structure, and the original audio voice data can be directly decoded.

Furthermore, due to the fact that the binary field serialization and deserialization operations are rapid and simple, the transmission speed can be increased, the field information reading speed can be increased, and parameter data can be extracted from the parameter field information rapidly. Besides the voice audio data to be recognized, the audio parameter field information may at least include: audio format, sampling rate, and speech recognition service mode.

Step S260, the voice audio data to be recognized is analyzed according to the audio parameter field information, so that the voice audio data to be recognized is uniformly converted into audio data in a PCM data format.

In one embodiment, the corresponding voice audio data to be recognized may be analyzed according to the defined audio format and sampling rate (such as the field content of each interactive field defined as such) in the audio parameter field information in the request, that is, the audio format field corresponding to the transmitted voice audio data is analyzed, and the format conversion is implemented after the voice audio data is decoded according to different audio formats, so that the voice audio data to be recognized may be mainly converted into the audio data in the PCM data format in a unified manner, and the unified audio format facilitates subsequent voice recognition.

Step S270, selecting a corresponding Kaldi voice recognition service model according to the audio parameter field information, so as to decode the audio data converted into the PCM data format and obtain a recognition result.

In one embodiment, the selection of which format and using Kaldi speech recognition model is required according to various model parameters defined by fields, such as sampling rate and audio format, and even speech service mode, in the audio parameter field, for example: FSMN, TDNN, TDNNF, etc. An example of a preferred FSMN is shown in figure 5. Further, when the corresponding model is selected, the model can be used to perform speech recognition, i.e., decoding and transcription, on the output audio data (to-be-recognized speech audio data) converted into the PCM data format, and then the recognition result is obtained and output.

Referring to fig. 4, the server side performs speech age recognition or speech gender recognition using a deep neural network model. The speech age recognition algorithm may be TDNN, TDNNF, FSMN model. Preferably, taking the FSMN model as an example, the model includes 10 FSMNs, and a layer jump connection is used between each two layers, where a × b in each layer means that the context relationship is a and the step is b. The FSMN structure loss function adopts cross entropy corss entropy, and when loss of 2 epochs is not reduced any more, the neural network training is considered to achieve the usable effect. In order to enable the model to read the input signal normally, feature extraction is carried out on the signal input into the model, and the feature extraction method includes but is not limited to carrying out one or more signal processing modes of Fourier transform, short-time Fourier transform, framing, windowing, pre-emphasis, Mel filter and discrete cosine transform on the input signal.

More specifically, the deep neural network model is used for speech age recognition or speech gender recognition at the server side. The speech age recognition algorithm may be TDNN, TDNNF, FSMN model. Taking a forward sequence memory network/feedforward sequence memory network FSMN model as an example, the forward sequence memory network comprises a plurality of layers of deep forward sequence memory networks, jump connection is arranged between every two layers of deep forward sequence memory networks, and the jump connection is carried out after the context relationship and the stride in the deep forward sequence memory networks are changed; the memory modules of the deep forward sequence memory network in different layers are different in size, and the corresponding memory modules are also different from small to large according to the hierarchy. And after the context and the stride in the depth forward sequence memory network are changed, jumping connection is carried out, and the gradient of the current jumping connection is respectively transmitted to the next jumping connection and the depth forward sequence memory network separated by two layers. A training stage: and uniformly converting the format of the one or more voice audios used as training by utilizing down sampling and/or up sampling. Sampling, for example, by a certain sampling rate; the sampling rate, also called sampling speed or sampling frequency, defines the number of samples per second extracted from a continuous signal and constituting a discrete signal, expressed in hertz (Hz); the inverse of the sampling frequency is the sampling period or sampling time, which is the time interval between samples; colloquially speaking, the sampling frequency refers to how many signal samples per second a computer takes. Common audio formats include, but are not limited to: audio formats such as wav, pcm, mp3, ape, wma, etc.

Then, the conversion into a unified format, such as a converted audio format, includes at least: wav format and/or pcm format (i.e., the model is able to identify audio data in this format when the application is deployed).

Furthermore, the classification model (such as FSMN) can normally read the input audio signal, and the audio signal can be firstly subjected to feature extraction processing. For example, the one or more voice audios as training may be subjected to feature signal processing by a feature extraction method, and waveforms of the one or more voice audios as training may be converted into a feature vector sequence. The method for extracting features of signals input to the model includes but is not limited to: one or more signal processing modes of fast Fourier transform, short-time Fourier transform, framing, windowing, pre-emphasis, Mel filter and discrete cosine transform.

In the example of fig. 5, the model contains 10 layers of depth FSMN, with a layer-hopping connection between each two layers (shortcut between each two layers). By setting the skip connection shortcut, the gradient transfer of the depth forward sequence memory network DFSMN can be optimized, and the gradient of the depth forward sequence memory network DFSMN can be better transferred, so that the model training effect is more excellent. And jumping to connect the shortcut after the context relation and the stride change in the depth forward sequence memory network, namely jumping to connect the shortcut after a & ltb & gt changes in each layer of depth forward sequence memory network, and respectively sending the gradient of the current time of jumping to connect the shortcut to the next time of jumping to connect the shortcut and to the depth forward sequence memory network FSMN (Deep-FSMN/DFSMN) with two layers in between when the shortcut is jumped. Wherein, a and b in each layer mean that the context relationship is a and the stride is b. For example, 4 × 1DFSMN indicates a context of 4 and a stride of 1; 8 × 1DFSMN indicates that the context relationship is 8 and the stride is 1; 6 × 2DFSMN indicates that the context is 6 and the stride is 2; 10 x 2DFSMN indicates a context of 10 and a stride of 2.

Age identification example 1: when the 4 × 1DFSMN is changed to 8 × 1DFSMN, the deep forward sequence memory network DFSMN corresponding to the 8 × 1DFSMN starts to perform a jump connection shortcut, transfer the gradient to the deep forward sequence memory network DFSMN corresponding to the 6 × 2DFSMN, and transfer the gradient to the next jump connection shortcut. The Memory modules Memory blocks of the deep forward sequence Memory networks in different layers are different in size, and the corresponding Memory modules Memory blocks are from small to large according to the hierarchy. And the level of the layer where the 4 × 1DFSMN is located is less than the level of the layer where the 8 × 1DFSMN is located, the level of the layer where the 8 × 1DFSMN is located is less than the level of the layer where the 6 × 2DFSMN is located, and so on.

Sex identification example 1: when the 8 × 1DFSMN is changed to 6 × 2DFSMN, the deep forward sequence memory network DFSMN corresponding to the 6 × 2DFSMN starts to perform a jump connection shortcut, transfer the gradient to the deep forward sequence memory network DFSMN corresponding to the 10 × 2DFSMN, and transfer the gradient to the next jump connection shortcut. The Memory modules Memory blocks of the deep forward sequence Memory networks in different layers are different in size, and the corresponding Memory modules Memory blocks are from small to large according to the hierarchy. Corresponding to fig. 4, the level of the layer where the 4 × 1DFSMN is located is less than the level of the layer where the 8 × 1DFSMN is located, the level of the layer where the 8 × 1DFSMN is located is less than the level of the layer where the 6 × 2DFSMN is located, and so on.

In the training stage, the FSMN structure loss function adopts cross entropy corss, when the loss of 2 epochs continuously does not decrease any more, the neural network training is considered to achieve the usable effect, and the classification model generated after the neural network training is used as the final classification model. Besides outputting the identification result of the age and/or the gender, the model can output the identification result of the age and/or the gender of one or more target objects in the form of characters. Specifically, during training, each audio sample may contain a sentence of voice, and corresponding age and/or gender labels, the number of elderly people is small, the ratio of young people, children and elderly people can be balanced in the screening process, and the age labels may be further processed into several age categories, children, teenagers, adults and elderly people; then, judging KL divergence between the classification result and the label by adopting a cross entropy loss function during training, and considering that the training is finished when the cross entropy is gradually reduced to the condition that the cross entropy is basically unchanged along with the continuous progress of the experiment; two layers of time-corrected self-attention networks are added behind the FSMN in the model structure, and compared with the common self-attention network, the model structure is beneficial to improving the model reasoning speed by extracting the context information of the audio within a certain time.

The trained deep neural network model capable of identifying the age and/or the gender of the voice is located at a server side to wait for calling, and the model does not need to be deployed to a terminal or the local place to ensure the flexibility and the expandability of the model.

In one embodiment, determining the speech recognition model that should be selected based on audio parameter information such as the audio format and the sampling rate, and even the speech (age and/or gender) recognition service mode, may include: the model for speech age and/or gender identification is invoked in response to the audio format, sampling rate, and whether the mode is an age and/or gender service. For example, the voice age and/or gender recognition service mode is determined, the corresponding voice age and/or gender recognition deep neural network is determined to be selected, and the sampling rate and the audio format are combined to determine that the model hypothesis used by the corresponding call is the FSMN model capable of supporting the sampling rate and the audio format.

Step S280, serialize the recognition result and send it back to the client defined by GRPC.

In one embodiment, the recognition result just recognized by the model may be serialized (serialization coding compression), specifically, for example, the recognition result may be serialized through a ProtoBuf structure; further, when sending back to the client, the server may call a corresponding result return logic according to the bidirectional streaming or non-streaming indicated by the speech recognition mode in the audio parameter field information, and send the serialized recognition result back to the client defined by the GRPC through the result return logic.

Further, there are generally two modes of speech recognition services: real-time speech recognition in a bi-directional stream and one-sentence speech recognition in a non-stream. Correspondingly, the result returning logic called by the server may return the recognition result for one time for the non-streaming type, return the recognition result for each segment for the bidirectional streaming type for segmentation, and return the final recognition result after all the voice audio data to be recognized are transmitted.

Step S290, receiving a recognition result obtained by recognizing the voice audio data to be recognized from the service end defined by the GRPC, and further performing deserialization on the received recognition result by using the ProtoBuf structure and outputting the deserialization.

After receiving the recognition result, the GRPC service interface code is also used as the basis, the recognition result is that the server side carries out serialization coding compression through the ProtoBuf result, the transmission is fast, the field can be fast read through deserialization to the client side, the information is extracted, and the recognition result corresponding to the voice audio data to be recognized is extracted. The input of the recognition result may be performed in various ways, and may include audio output or display output of video, images, characters, and the like.

The implementation of the present invention will be described below with reference to the main block diagram of the architecture of an embodiment of the system for implementing a speech age and/or gender identification service of the present invention shown in fig. 2.

The client 210 supports multiple languages (e.g., various voice audio data reads).

The server providing speech recognition services, here exemplified by server 220, provides various speech recognition service models to accommodate remote service invocation by client 210.

In one embodiment, the system implementing the voice age and/or gender identification service is a Client/Server or Client/Server architecture.

Specifically, the client/terminal 210 is a client that implements Kaldi voice recognition service (voice age and/or gender recognition service) based on GRPC, supports multiple languages, and is defined by GRPC; the server 220, i.e., a server implementing the Kaldi voice recognition service (voice age and/or gender recognition service) based on GRPC, is also defined by GRPC. The pre-definition is actually to realize remote call between the client and the server for establishing a speech recognition engine under a GRPC/ProtoBuf architecture, and more specifically is to establish GRPC service interface codes which can remotely realize remote service call between the client and the server, wherein the GRPC service interface codes are established based on the speech recognition service. The establishing is a predefined process, and the process specifically includes: according to the structure of the ProtoBuf of the GRPC, parameter field information related to a voice recognition service mode, an audio format, voice audio data to be recognized, a sampling rate and an audio length is predefined to obtain a well-defined ProtoBuf protocol of the GRPC; and compiling and generating GRPC service interface codes of the client and the server for carrying out GRPC voice recognition service according to the ProtoBuf protocol. The GRPC service interface code may program the respective functions/service modules of the client and server, with the client 210 and server 220 having the GRPC service interface code being a GRPC-defined client and a GRPC-defined server, respectively.

In one embodiment, the client 210 includes at least:

the GRPC client module 2101 is configured to read voice audio data to be recognized based on a predefined GRPC service interface code, and call a voice age and/or gender recognition service mode selected by the GRPC mode selection module 2102 when the voice audio data is read, and send a voice recognition request formed by the ProtoBuf serialization module 2103 to the server 220, so as to remotely call a Kaldi voice recognition service model corresponding to the voice recognition request for voice recognition, where the voice recognition request is used for recognizing age and/or gender.

A GRPC mode selection module 2102 configured to select a corresponding voice recognition service mode when the GRPC client module reads voice audio data to be recognized. Corresponding to different voice audio data, different recognition requirements exist, and corresponding voice recognition service modes can be selected.

The speech recognition service modes are, for example: bidirectional streaming GRPC Stream for real-time speech recognition, and non-streaming equal GRPC Simple for one-sentence recognition. Accordingly, server 220, upon recognizing and/or returning results, may reference the pattern, select a model recognition and select result return logic that matches the voice service pattern.

The ProtoBuf serialization module 2103 is configured to perform serialization, encoding and compression on the speech audio data to be recognized and the audio parameter field information corresponding to the speech audio data by using a ProtoBuf structure to form a speech recognition request, and is further configured to perform deserialization and output on the received recognition result by using the ProtoBuf structure.

The GRPC client module 2101 is further configured to: receiving a recognition result obtained by recognizing the voice audio data to be recognized from the GRPC-defined server 220. When the service end defined by the GRPC recognizes the voice to obtain the recognition result, the ProtoBuf structure is firstly utilized to carry out serialized coding, even data compression, on the recognition result, and then a corresponding result returning logic is called according to the voice recognition service mode (such as GRPC Simple or GRPC Stream) to return the recognition result. When the client 210 receives the returned serialized identification result through the GRPC client module 2101, the ProtoBuf serialization module 2103 is called to perform deserialization through the ProtoBuf structure, and then the identification result is output, such as display output, audio output and the like.

In one embodiment, the server 220 includes at least:

the receiving module 2201 is configured to receive a voice recognition request sent by a client defined by a GRPC, where the voice recognition request includes serialized voice audio data to be recognized and audio parameter field information. The speech recognition request received by the receiving module 2201 specifically includes: the client 210 defined by the GRPC uses the ProtoBuf serialization module 2103 to perform serialization, encoding and compression on the read voice audio data to be recognized and the audio parameter field information corresponding to the voice audio data by using a ProtoBuf structure, and then calls a pre-selected voice recognition service mode to send a remote request to the server.

A serialization module 2202, configured to perform deserialization operation on the voice recognition request based on a predefined GRPC service interface code, to obtain the voice audio data to be recognized and the audio parameter field information; and the method is used for carrying out serialization operation on the recognition result. It specifically still includes: a ProtoBuf deserializing unit 22021, configured to perform deserialization operation on the voice recognition request through a ProtoBuf structure. Wherein the audio parameter field information at least comprises: audio format, sampling rate, and speech recognition service mode. The speech recognition service mode is selected by the GRPC mode selection module 2102 for these audio data to be recognized when the GRPC client module 2101 of the GRPC defined client 210 reads the speech audio data to be recognized. Other interactive fields (parameter field information) such as sampling rate and audio format, audio length, etc. are also recorded corresponding to the read voice audio data to be recognized. Further, the system further includes a ProtoBuf serialization unit 22022, configured to perform serialization coding on an identification result obtained after the selected corresponding Kaldi speech recognition model completes recognition on the speech audio data to be recognized through the ProtoBuf structure, and even compress data to form recognition result serialized data. Similarly, the ProtoBuf structure can be serialized or deserialized, encode data or decode data.

The audio data parsing module 2203 is configured to parse the voice audio data to be recognized according to the audio parameter field information, so as to uniformly convert the voice audio data to be recognized into audio data in a PCM data format. It specifically still includes: the parsing conversion unit 22031 is configured to parse the corresponding to-be-recognized voice audio data according to the audio format, so as to uniformly convert the to-be-recognized voice audio data into audio data in the PCM data format. The conversion utilizes the audio format in the interactive field, and even the speech audio data can be analyzed and decoded in combination with the sampling rate in the interactive field of the speech audio data to be recognized, and the audio format is converted into the PCM data format correspondingly. I.e. any format of voice audio data, can be parsed into PCM data format, which can accommodate speech recognition in various languages.

The speech recognition core algorithm module 2204 is configured to select a corresponding Kaldi speech recognition service model according to the audio parameter field information, so as to decode the audio data converted into the PCM data format and obtain a recognition result. It specifically still includes: a recognition unit 22041, configured to select a corresponding Kaldi speech recognition model according to the audio format and the sampling rate extracted from the interactive field, even a speech recognition service model, and the like, and perform speech recognition decoding on the audio data converted into the PCM data format by using the selected corresponding Kaldi speech recognition model and obtain a recognition result.

And the server side performs voice age recognition or voice gender recognition by using the deep neural network model. The speech age recognition algorithm may be TDNN, TDNNF, FSMN model. Taking a forward sequence memory network/feedforward sequence memory network FSMN model as an example, the forward sequence memory network comprises a plurality of layers of deep forward sequence memory networks, jump connection is arranged between every two layers of deep forward sequence memory networks, and the jump connection is carried out after the context relationship and the stride in the deep forward sequence memory networks are changed; the memory modules of the deep forward sequence memory network in different layers are different in size, and the corresponding memory modules are from small to large according to the hierarchy. And after the context and the stride in the depth forward sequence memory network are changed, jumping connection is carried out, and the gradient of the current jumping connection is respectively transmitted to the next jumping connection and the depth forward sequence memory network separated by two layers. A training stage: and uniformly converting the format of the one or more voice audios used as training by utilizing down sampling and/or up sampling. Sampling, for example, by a certain sampling rate; the sampling rate, also called sampling speed or sampling frequency, defines the number of samples per second extracted from a continuous signal and constituting a discrete signal, expressed in hertz (Hz); the inverse of the sampling frequency is the sampling period or sampling time, which is the time interval between samples; colloquially speaking, the sampling frequency refers to how many signal samples per second a computer takes. Common audio formats include, but are not limited to: audio formats such as wav, pcm, mp3, ape, wma, etc.

In the training stage, the FSMN structure loss function adopts cross entropy corss, when the loss of 2 epochs continuously does not decrease any more, the neural network training is considered to achieve the usable effect, and the classification model generated after the neural network training is used as the final classification model. Besides outputting the identification result of the age and/or the gender, the model can output the identification result of the age and/or the gender of one or more target objects in the form of characters. Specifically, during training, each audio sample may contain a sentence of voice, and corresponding age and/or gender labels, the number of elderly people is small, the ratio of young people, children and elderly people can be balanced in the screening process, and the age labels may be further processed into several age categories, children, teenagers, adults and elderly people; then, judging KL divergence between the classification result and the label by adopting a cross entropy loss function during training, and considering that the training is finished when the cross entropy is gradually reduced to the condition that the cross entropy is basically unchanged along with the continuous progress of the experiment; two layers of time-corrected self-attention networks are added behind the FSMN in the model structure, and compared with the common self-attention network, the model structure is beneficial to the promotion of the model reasoning speed by extracting the audio information of the context within a certain time.

A returning module 2205, configured to send back the GRPC-defined client after the recognition result is serialized by the serialization module. Specifically, the method further comprises the following steps: a return logic unit 22051, configured to invoke corresponding result return logic according to the speech recognition service mode, so as to send the recognition result after compression of serialization coding back to the GRPC-defined client. The voice recognition service mode is a service mode provided in the previous interactive field corresponding to when the client 210 transmits the voice recognition request according to the mode. The method mainly comprises two modes: bidirectional streaming for real-time speech recognition and non-streaming for one-sentence speech recognition. Wherein the result return logic comprises: and returning an identification result once for the non-streaming type, returning an identification result of each section for the bidirectional streaming type as a subsection, and returning a final identification result after all voice audio data to be identified are transmitted.

The implementation of speech recognition by remote invocation of the present invention is further described below in conjunction with the example of application-time interaction process shown in fig. 3.

And establishing a voice recognition service engine based on Protobuf and Grpc remote procedure call. The method and the system define the client (such as a client) and the server (such as a server) which are defined by the GRPC firstly, and the remote transmission between the client (such as the client) and the server (such as the server) can carry out serialization and deserialization on voice audio data through the ProtoBuf, thereby greatly reducing the transmission cost of a large file audio network and improving the transmission rate. The GRPC remote procedure call framework adopted by the client and the server is a remote procedure call framework based on HTTP/2.0, can be combined with a multithreading technology, solves the problem of one-way and two-way stream efficient transmission and service response, and can be used for voice recognition service in real-time and non-real-time modes. And in the definition process, the proto file class is compiled through a ProtoBuf structure definition field, and according to the protocol class, the client can serialize the read voice audio data to be recognized through the ProtoBuf structure of the protocol and then quickly transmit the voice audio data to the server, receive the recognition result which is also serialized through the ProtoBuf structure from the server, and then output the recognition result to the user after deserialization. On the server side, according to the ProtoBuf serialization/deserialization function, the serialized voice identification service request containing the voice audio data to be identified is received, the original voice audio and the interaction field, namely various parameters related to the analysis of the audio, are obtained after deserialization, the original voice audio data are analyzed and decoded according to the condition of the audio format and even the sampling rate of the original voice audio data, the PCM data format is taken as the target, the original voice audio data are converted into the audio data in the PCM data format, namely the audio data are analyzed, the audio data in the PCM data format output after analysis can be identified according to the selected voice identification model, namely, the voice core algorithm is called to decouple or modularize the Kaldi decoder; and after the recognition result is serialized through a ProtoBuf serialization/deserialization function, according to a voice recognition service mode required by the voice recognition service request when the client is remotely called: bidirectional streaming or non-streaming, control returns logic that returns the serialized recognition result. Therefore, resources of the server are fully transferred, the resource utilization rate is greatly improved, only the functions (including each recognition model of block management) of the voice recognition service of the corresponding server (including multiple servers, cloud server states or a single server) need to be directly called, and the like, and the problem of insufficient concurrency amount support is solved. The Kaldi decoder is decoupled and modularized, the management of a speech recognition core algorithm and multiple models is effectively realized by combining a GPU, and the speech algorithm can be quickly and conveniently optimized and updated without being influenced by a client, a language change language format and the like through an engine architecture in which a Grpc lightweight frame is separated from the speech recognition core algorithm.

In addition, the whole framework can flexibly select voice recognition models with different audio formats and different sampling rates through Protobuf protocol service and field definition, and can also freely expand the functions of the current voice algorithm engine and the like.

According to the description of the foregoing embodiment of the present invention, it can be seen that the present invention is mainly based on remote invocation of GRPC, and utilizes ProtoBuf to perform interface definition of a client and a server (server), so as to implement remote speech recognition invocation service, whereas in the past, speech recognition service engines in the industry mostly provide speech recognition service in a Websocket service and a tensflo Serving manner, so that the present invention is not flexible to use, and real-time speech recognition of bidirectional flow is difficult to implement, so that more resources are consumed, and it is often inconvenient to deploy on a low-resource server. By the scheme of the invention, the set of voice recognition service engine based on Protobuf and Grpc remote process call can be effectively and flexibly deployed, has strong expansibility, high resource utilization rate, strong concurrency capability, quick transmission and good recognition effect, and the updating, iteration and optimization of the core algorithm are convenient and fast.

The invention has the following specific advantages:

(1) the GRPC-based lightweight speech recognition system can solve the problem that complicated large-scale Kaldi model industrial deployment and application which can be updated and changed frequently is difficult when various target objects perform speech recognition of age and/or gender, so that an algorithm model can be free from ONNX conversion and a fixed tf Serving inference framework, has better flexibility and expandability, and greatly improves the utilization rate of server resources and the concurrency of speech recognition service.

(2) And the Grpc and the Protobuf are used for carrying out serialization compression on the voice audio file, so that the size of transmission data is reduced, the transmission efficiency is improved, and the transmission delay of a large file is reduced.

(3) The voice age and/or gender identification model of the voice identification core algorithm is developed in a modularization mode, an age and/or gender service mode formulated by a Protobuf protocol is combined, multiple audio formats can be simultaneously supported, requirements of models with different sampling rates can be met, more audio formats and sampling rates can be matched in a cross mode, the age and/or gender of a target object can be determined, various voice related functions can be conveniently and dynamically expanded in the modularization mode, and the models can be updated flexibly and conveniently without affecting a terminal and a calling mode.

(4) The client sides of different languages (c/c + +, python, go, java, php, oc and the like) are simple in service construction according to the defined proto protocol, generate client programs of different languages respectively, and can use the voice recognition service conveniently and quickly.

Further, it can be understood by those skilled in the art that all or part of the flow of the method of the present invention implementing the above-described embodiment can also be implemented by instructing the relevant hardware by a computer program, which can be stored in a computer-readable storage medium, wherein a plurality of program codes are stored in the storage medium, and the program codes are suitable for being loaded and executed by a processor to execute the steps of the above-described methods for implementing the speech age and/or gender identification service. For convenience of illustration, only the parts related to the embodiments of the present invention are shown, and the details are not disclosed. The storage device may be a storage device apparatus formed by including various electronic devices, and optionally, a non-transitory computer-readable storage medium is stored in the embodiment of the present invention. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying said computer program code, medium, U-disk, removable hard disk, magnetic diskette, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signal, telecommunication signal, software distribution medium or the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as subject to legislation and patent practice.

Further, the present invention also provides a processing apparatus comprising a processor and a memory, the memory being configurable to store a plurality of program codes adapted to be loaded and executed by the processor to perform the steps of the aforementioned respective method of implementing a speech age and/or gender identification service. Specifically, the hardware configuration is as shown in fig. 6 and 7.

The apparatus may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103 and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software-programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like. In this embodiment, the processor of the apparatus includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may be obtained by referring to the above embodiments, which are not described herein again.

Fig. 7 is a schematic hardware structure diagram of an apparatus according to another embodiment of the present application. FIG. 7 is a specific embodiment of the implementation of FIG. 6. As shown, the apparatus of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment. The second memory 1202 is configured to store various types of data to support operations at the device. Examples of such data include instructions for any application or method operating on the device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

Optionally, the first processor 1201 is provided in the processing assembly 1200. The apparatus may further comprise: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The components and the like included in the apparatus are set according to actual requirements, which is not limited in this embodiment.

The processing assembly 1200 generally controls the overall operation of the device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200. The power supply assembly 1204 provides power to the various components of the device. The power supply component 1204 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device. The multimedia components 1205 include a display screen that provides an output interface between the device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the apparatus is in an operating mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the voice component 1206 further includes a speaker for outputting voice signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor assembly 1208 includes one or more sensors for providing status assessment of various aspects of the device. For example, the sensor component 1208 may detect an open/close state of the device, a relative positioning of the components, a presence or absence of user contact with the device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the apparatus and other devices in a wired or wireless manner. The device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the device may include a SIM card slot therein for insertion of a SIM card so that the device may log onto a GPRS network to establish communication with the server via the internet.

As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 in the embodiment of fig. 7 may be implemented as input devices in the embodiment of fig. 6.

Furthermore, the invention also provides a system for realizing the voice age and/or gender identification service, which comprises the client/terminal and the server for realizing the voice age and/or gender identification service.

It should be noted that, although the foregoing embodiments describe each step in a specific sequential order, those skilled in the art may understand that, in order to achieve the effect of the present invention, different steps do not necessarily need to be executed in such an order, and they may be executed simultaneously (in parallel) or executed in other orders, and these changes are all within the scope of the present invention.

Further, it should be understood that, since the modules are only configured to illustrate the functional units of the system of the present invention, the corresponding physical devices of the modules may be the processor itself, or a part of software, a part of hardware, or a part of a combination of software and hardware in the processor. Thus, the number of individual modules in the figures is merely illustrative.

Those skilled in the art will appreciate that the various modules in the system may be adaptively split or merged. Such splitting or merging of specific modules does not cause the technical solutions to deviate from the principle of the present invention, and therefore, the technical solutions after splitting or merging will fall within the protection scope of the present invention.

So far, the technical solution of the present invention has been described with reference to one embodiment shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for implementing a voice age and/or gender identification service, comprising:

receiving a serialized voice recognition request sent by a GRPC-defined client, wherein the voice recognition request comprises age and/or gender through voice recognition;

performing deserialization operation on the voice recognition request and analyzing audio data and parameter field information in the voice recognition request;

selecting a corresponding voice age and/or gender recognition deep neural network model according to the parameter field information, and decoding the audio data and the context audio information thereof through the voice age and/or gender recognition deep neural network model to obtain a corresponding age and/or gender recognition result;

and returning the identification result to the client.

2. The method of claim 1,

the specific process defined by the GRPC comprises the following steps:

according to the structure of the ProtoBuf of the GRPC, parameter field information related to a voice age and/or gender identification service mode, an audio format, voice audio data to be identified, a sampling rate and an audio length is predefined so as to obtain a well-defined ProtoBuf protocol of the GRPC;

compiling and generating GRPC service interface codes of a client and a server for carrying out GRPC voice age and/or gender identification service according to the ProtoBuf protocol so as to carry out remote calling between the client and the server;

the client with the GRPC service interface code is a client defined by GRPC;

the service end with the GRPC service interface code is a service end defined by GRPC.

3. The method of claim 2,

the serialized voice recognition request is a remote request sent by a voice and/or gender service mode pre-selected by the GRPC-defined client;

the remote request comprises: serializing voice audio data to be recognized and audio parameter field information by utilizing a ProtoBuf structure;

the "deserializing the voice recognition request and analyzing the audio data and the parameter field information in the voice recognition request" specifically includes:

performing deserialization operation on the voice recognition request through a ProtoBuf structure to obtain the voice audio data to be recognized and the audio parameter field information; wherein the audio parameter field information at least comprises: audio format, sampling rate and field information of voice age and/or gender recognition service mode;

analyzing the corresponding voice audio data to be recognized based on the audio format so as to uniformly convert the voice audio data to be recognized into audio data in a PCM data format;

the "selecting a corresponding speech age and/or gender recognition deep neural network model according to the parameter field information" specifically includes:

determining the type of the voice recognition model needing to be selected as a voice age and/or gender recognition deep neural network model according to the voice age and/or gender recognition service mode;

calling a voice age and/or gender recognition deep neural network model which correspondingly supports recognition of the audio format and the sampling rate according to the audio format and the sampling rate;

the "obtaining the recognition result of the corresponding age and/or gender by decoding the audio data through the speech age and/or gender recognition deep neural network model" specifically includes:

decoding the audio data converted into the PCM data format and the context audio information thereof by utilizing a speech age and/or gender identification deep neural network model correspondingly supporting the identification of the audio format and the sampling rate to obtain a corresponding age and/or gender identification result;

the step of returning the recognition result to the client specifically includes:

and serializing the identification result and sending the serialized identification result back to the GRPC-defined client.

4. The method as claimed in claim 3, wherein serializing the recognition result and sending it back to the GRPC-defined client comprises:

carrying out serialization coding compression on the recognition result through a ProtoBuf structure; and the number of the first and second groups,

calling corresponding result returning logic according to the voice age and/or gender identification service mode to send the identification result after the serialization coding compression back to the client defined by the GRPC;

wherein the voice age and/or gender recognition service patterns include: bidirectional streaming for real-time speech age and/or gender identification and non-streaming for one-sentence speech age and/or gender identification;

wherein the result return logic comprises: returning an identification result for the non-streaming type as one-time, returning an identification result of each segment for the bidirectional streaming type as a segment, and returning a final identification result after all voice audio data to be identified are transmitted;

wherein the returned final recognition result comprises information of corresponding age and/or gender.

5. The method of any one of claims 1 to 4,

the voice age and/or gender recognition deep neural network model adopts the following steps: a forward sequence memory neural network FSMN, a time delay neural network TDNN or a factor time delay neural network TDNNF model based on a Kaldi mixed architecture;

when the FSMN is adopted, the last two layers are adopted as the 10-layer depth FSMN of the time-limited self-attention network, layer jump connection short is adopted between every two layers, and cross entropy corss entry is adopted as a loss function during training.

6. A method for implementing a voice age and/or gender identification service, comprising:

reading voice audio data to be recognized, and selecting a corresponding voice age and/or gender recognition service mode;

carrying out serialization coding compression on the voice audio data to be recognized and audio parameter field information corresponding to the voice audio data by utilizing a ProtoBuf structure so as to form a voice recognition request; and the number of the first and second groups,

calling the voice age and/or gender recognition service mode, sending the voice recognition request to a service terminal defined by GRPC (generalized public key computer), and remotely calling a voice age and/or gender recognition deep neural network model corresponding to the voice recognition request to perform voice recognition;

wherein the voice recognition request is a request for age and/or gender through voice recognition.

7. The method of claim 6,

the GRPC-defined process specifically includes:

the client with the GRPC service interface code is a client defined by GRPC;

8. The method of claim 7, further comprising:

receiving a corresponding identification result of age and/or gender obtained by identifying the voice audio data to be identified through the GRPC-defined service end;

the service end defined by the GRPC utilizes the ProtoBuf structure to carry out serialized coding compression on the recognition result and calls corresponding result returning logic to return the recognition result according to the voice age and/or gender recognition service mode;

deserializing and outputting the received recognition result by utilizing the ProtoBuf structure;

the step of remotely calling the voice age and/or gender recognition deep neural network model of the voice recognition request for voice recognition comprises the steps of determining the type of the voice recognition model needing to be selected according to the voice age and/or gender recognition service mode, calling the voice age and/or gender recognition deep neural network model which correspondingly supports recognition of the audio format and the sampling rate in the type according to the audio format and the sampling rate, and extracting context information of audio data within a preset time to perform voice recognition so as to obtain a corresponding age and/or voice recognition result;

and the audio data is audio data which is obtained by uniformly converting the voice audio data into a PCM data format based on the audio format in the process of analyzing the voice recognition request.

9. A server for implementing a voice age and/or gender identification service, comprising:

the receiving module is used for receiving a serialized voice recognition request sent by a GRPC-defined client, wherein the voice recognition request comprises age and/or gender through voice recognition;

the serialization module is used for performing deserialization operation on the voice recognition request and performing serialization operation on a recognition result;

the audio analysis module is used for analyzing the data and the parameter field information in the voice recognition request;

the voice recognition core algorithm module is used for selecting a corresponding voice age and/or gender recognition deep neural network model according to the parameter field information, and decoding the audio data and the context audio information thereof through the voice age and/or gender recognition deep neural network model to obtain a corresponding age and/or gender recognition result;

and the return module is used for returning the identification result to the client.

10. The server according to claim 9,

the GRPC-defined process specifically includes:

the client with the GRPC service interface code is a client defined by GRPC;

wherein, the server with the GRPC service interface code is a server defined by GRPC.

11. The server according to claim 10,

the serialization module specifically comprises:

the ProtoBuf deserialization unit is used for deserializing the voice recognition request through a ProtoBuf structure to obtain the voice audio data to be recognized and the audio parameter field information; wherein the audio parameter field information at least comprises: audio format, sampling rate and field information of voice age and/or gender recognition service mode; and the number of the first and second groups,

the ProtoBuf serialization unit is used for carrying out serialization coding compression on the recognition result through a ProtoBuf structure;

the audio analysis module specifically executes the following operations:

the speech recognition core algorithm module specifically executes the following operations:

decoding the audio data converted into the PCM data format and the context audio information thereof by utilizing a speech age and/or gender identification deep neural network model correspondingly supporting the identification of the audio format and the sampling rate to obtain a corresponding identification result of age and/or gender;

the return module specifically includes:

the return logic unit is used for calling corresponding result return logic according to the voice age and/or gender identification service mode so as to send the identification result after the serialization coding compression back to the client defined by the GRPC;

wherein the result return logic comprises: returning an identification result once for the non-streaming type, returning an identification result of each segment for the bidirectional streaming type as a segment, and returning a final identification result after all voice audio data to be identified are transmitted;

12. The server according to any one of claims 9 to 11,

13. A terminal for implementing a voice age and/or gender identification service, comprising:

the GRPC client module is used for reading voice audio data to be recognized;

the GRPC mode selection module is used for selecting a corresponding voice age and/or gender identification service mode when the GRPC client module reads voice audio data to be identified;

the ProtoBuf serialization module is used for carrying out serialization coding compression on the voice audio data to be recognized and the audio parameter field information corresponding to the voice audio data by utilizing a ProtoBuf structure so as to form a voice recognition request, wherein the voice recognition request comprises age and/or gender through voice recognition;

the GRPC client module is further configured to: calling the voice age and/or gender recognition service mode, sending the voice recognition request to a service terminal defined by GRPC (generalized public key computer), and remotely calling a voice age and/or gender recognition deep neural network model corresponding to the voice recognition request to perform voice recognition;

14. The terminal of claim 13,

the GRPC-defined process specifically includes:

the client with the GRPC service interface code is a client defined by GRPC;

15. The terminal of claim 14,

the GRPC client module is further configured to: receiving a recognition result of corresponding age and/or gender obtained by recognizing the voice audio data to be recognized from the server defined by the GRPC;

the ProtoBuf serialization module is further configured to: deserializing and outputting the received recognition result by utilizing the ProtoBuf structure;

the voice age and/or gender recognition deep neural network model of the server determines the type of the voice recognition model needing to be selected through the voice age and/or gender recognition service mode, calls the voice age and/or gender recognition deep neural network model which correspondingly supports recognition of the audio format and the sampling rate in the type through the audio format and the sampling rate, and performs voice recognition by extracting context information of audio data within a preset time to obtain a corresponding age and/or voice recognition result;

16. A computer readable storage medium storing a plurality of program codes, wherein the program codes are adapted to be loaded and executed by a processor to perform the method for implementing a speech age and/or gender identification service according to any one of claims 1 to 8;

or,

a processing device comprising a processor and a storage device, said storage device being adapted to store a plurality of program codes, characterized in that said program codes are adapted to be loaded and run by said processor to perform the method of implementing a speech age and/or gender identification service according to any of claims 1 to 8;

or,

a system for implementing a voice age and/or gender identification service, comprising the server for implementing a voice age and/or gender identification service according to any one of claims 9 to 15, and the terminal for implementing a voice age and/or gender identification service according to any one of claims 9 to 15.