CN110858819A - Corpus collection method and device based on WeChat applet and computer equipment - Google Patents
Corpus collection method and device based on WeChat applet and computer equipment Download PDFInfo
- Publication number
- CN110858819A CN110858819A CN201910760571.8A CN201910760571A CN110858819A CN 110858819 A CN110858819 A CN 110858819A CN 201910760571 A CN201910760571 A CN 201910760571A CN 110858819 A CN110858819 A CN 110858819A
- Authority
- CN
- China
- Prior art keywords
- user
- vocabulary
- input
- account information
- recording
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000004590 computer program Methods 0.000 claims description 13
- 230000007246 mechanism Effects 0.000 claims description 13
- 238000012790 confirmation Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 7
- 241001672694 Citrus reticulata Species 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 210000000056 organ Anatomy 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 206010071299 Slow speech Diseases 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 210000003484 anatomy Anatomy 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 210000000867 larynx Anatomy 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 210000002105 tongue Anatomy 0.000 description 1
- 210000000515 tooth Anatomy 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/04—Real-time or near real-time messaging, e.g. instant messaging [IM]
- H04L51/046—Interoperability with other network applications or services
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/07—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
- H04L51/18—Commands or executable codes
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/08—Network architectures or network communication protocols for network security for authentication of entities
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application relates to a corpus collection method, a corpus collection device, computer equipment and a computer readable storage medium based on a WeChat applet, wherein the method comprises the steps of collecting account information of a user when the fact that the user logs in the WeChat applet is detected; acquiring a corpus collection operation event to collect a recording text of a user; and storing the collected recording text according to the account information of the user. The corpus is collected through the WeChat small program, and low-cost and efficient collection of the corpus can be achieved.
Description
Technical Field
The invention relates to the technical field of information processing, in particular to a corpus collection method and device based on WeChat small programs and computer equipment.
Background
With the increasingly mature and wide application of AI speech recognition technology and speech interaction technology, the collection of corpora becomes extremely important as an energy source for speech interaction. The high-quality linguistic data can train a high-availability speech recognition model, and further accurately recognize the intention of a client.
At present, corpus collection is mainly performed through professional recording equipment and space, and a professional corpus collection mechanism is high in collection cost and long in time period. Therefore, there is a need for a corpus collection method that can take time, cost and recording quality into consideration.
Disclosure of Invention
The application provides a corpus collection method and device based on WeChat small program and computer equipment, which can realize high-efficiency and low-cost collection of corpus, and the collected corpus quality can meet the use requirement.
A corpus collection method based on WeChat small program, the method includes:
when a user is detected to log in a WeChat applet, acquiring account information of the user;
acquiring a corpus collection operation event to collect a recording text of a user;
and storing the collected recording text according to the account information of the user.
In one embodiment, before collecting the account information of the user when the user is detected to log in the WeChat applet, the method further includes:
if the user is detected to be a user entering the WeChat applet for the first time, displaying a disclaimer page;
entering a user identity information registration page after receiving a confirmation instruction of the disclaimer;
receiving and storing identity information input by a user, wherein the identity information comprises a region where the user is located, a dialect of the region, and the gender and age of the user;
and allocating a registration account number for the user to generate account information of the user.
In an embodiment, the acquiring the corpus collection operation event to collect the recording text of the user includes:
receiving an input instruction;
pushing a vocabulary to be input currently and prompting a voice input requirement;
and receiving and identifying whether the vocabulary input by the user meets the voice input requirement, and dividing the input vocabulary into effective vocabulary and invalid vocabulary according to the identification result.
In an embodiment, the voice recording requirement includes a recording duration, a recording speed and a recording language of the vocabulary.
In an embodiment, the storing the collected recording text according to the account information of the user includes:
if the effective vocabulary is recognized to reach the preset recording times, naming all the vocabulary input by the user according to the identity information of the user;
and storing all the named words in a multi-level directory according to a preset storage rule.
In an embodiment, the method further comprises:
counting the number of effective vocabularies finished by a user;
generating a corresponding electronic red packet through a red packet generation mechanism of the WeChat platform according to the number of the effective words;
and acquiring account information of the user, and forwarding the electronic red packet to an account of the user.
In an embodiment, the method further comprises:
detecting whether a user is an invitee or not when a registration request of the user is received;
if the user is an invitee, acquiring account information of the inviter who invites the user;
and generating an electronic red packet corresponding to the invitation through a red packet generating mechanism of the WeChat platform, and forwarding the electronic red packet to the account of the inviter.
In an embodiment, after receiving and identifying whether the vocabulary entered by the user satisfies the speech entry requirement, the method further comprises:
acquiring the voiceprint characteristics of the words input by the user and matching the voiceprint characteristics with the account information of the user;
and if the voiceprint characteristics are identified to be matched with the account information, listing the accounts of the user in a blacklist.
A WeChat applet-based corpus collection device, the device comprising:
the first acquisition module is used for acquiring the account information of the user when the user is detected to log in the WeChat applet;
the second acquisition module is used for acquiring the corpus collection operation event so as to acquire the recording text of the user;
and the storage module is used for storing the acquired recording text according to the account information of the user.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the above method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
The corpus collection method, device, computer equipment and computer readable storage medium based on the WeChat applet, provided by the embodiment of the application, comprises the steps of collecting account information of a user when the fact that the user logs in the WeChat applet is detected; acquiring a corpus collection operation event to collect a recording text of a user; and storing the collected recording text according to the account information of the user. The corpus is collected through the WeChat small program, and low-cost and efficient collection of the corpus can be achieved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a corpus collection method based on WeChat applets, according to an embodiment;
FIG. 2 is a flowchart illustrating obtaining a corpus collection operation event to collect a user's recorded text according to an embodiment;
FIG. 3 is a schematic diagram of a boot interface of a WeChat applet according to an embodiment;
FIG. 4 is a schematic diagram of a record output interface according to an embodiment;
FIG. 5 is a schematic illustration of reward details provided by one embodiment;
FIG. 6 is a diagram illustrating sharing and reward details provided in accordance with one embodiment;
FIG. 7 is a block diagram of a corpus collection device based on WeChat applet, in one embodiment;
FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth to provide a thorough understanding of the present application, and in the accompanying drawings, preferred embodiments of the present application are set forth. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. This application is capable of embodiments in many different forms than those described herein and those skilled in the art will be able to make similar modifications without departing from the spirit of the application and it is therefore not intended to be limited to the specific embodiments disclosed below.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
Fig. 1 is a flowchart illustrating an embodiment of a method for collecting corpus based on a wechat applet, as shown in fig. 1, the method for collecting corpus based on a wechat applet includes steps 110 to 130, wherein:
and step 110, when the user is detected to log in the WeChat applet, acquiring account information of the user.
A WeChat applet, an applet for short, is an application that can be used without downloading or installation, and a user can open the application by scanning a two-dimensional code or searching the applet. After a user adds a user side wechat applet in the wechat application, the wechat applet displays a user information registration page when detecting that the user logs on the login authentication platform. If the user is detected to enter the WeChat applet for the first time, the disclaimer page is displayed, and the display time can be 5 seconds. The information of the disclaimer page may include privacy terms that are entered into the user identity information registration page upon receiving a confirmation instruction for the disclaimer. The user can input the identity information through a corresponding input window in the share information registration page. The WeChat applet receives and stores the identity information input by the user, wherein the identity information comprises information such as the region where the user is located, the dialect of the region, the gender and the age of the user and the like. It should be noted that, after the user registers the identity information, the user is prompted that the identity information of the user cannot be changed after being clicked for storage, and the identity information is stored after being clearly confirmed.
In addition, the user of the wechat applet is registered for the first time, and the wechat applet assigns a user unique ID number as a registered account number of the user to generate account information of the user.
And step 120, acquiring a corpus collection operation event to collect a recording text of the user.
The corpus collection operation event may be understood as an interaction process between a user and a WeChat applet. Specifically, as shown in fig. 2, acquiring the corpus collection operation event to collect the recording text of the user includes steps 210 to 230, where:
step 210, receiving an input instruction;
step 220, pushing the vocabulary to be recorded currently and prompting the voice recording requirement;
and step 230, receiving and identifying whether the vocabulary input by the user meets the voice input requirement, and dividing the input vocabulary into effective vocabulary and invalid vocabulary according to the identification result.
Specifically, after the user finishes registration, the user logs in to enter the WeChat applet, the display interface is as shown in FIG. 3, and the user clicks to immediately start the accessible recording main interface, that is, the WeChat applet enters the recording main interface after receiving the input instruction. The recording main interface is shown in fig. 4. The recording main interface comprises the vocabulary to be recorded currently, prompts the voice recording requirement, and also comprises the completion times and the reward rule.
The vocabulary to be entered may include wake-up words and command words, the number of which is not limiting in this application. In this embodiment, the vocabulary to be entered includes 52 vocabularies, which include 15 wakeup words and 37 command words. The voice recording requirement comprises the recording duration of the vocabulary, the recording speed and the recording language. The recording speed can include three types, namely a fast speed, a normal speed and a slow speed, each type of speed corresponds to different recording duration, and the recording language includes two types, namely a dialect and a mandarin. In this embodiment, the same vocabulary needs to be entered 6 times for the same collection object to complete the entry of the vocabulary. The 6 entries include 3 valid mandarin entries (prompting the user to perform each of fast, normal, and slow speech rates) and 3 dialect entries (prompting the user to perform each of fast, normal, and slow speech rates) for 6 times. For different speech rates, different recording durations can be set, for example, when a user inputs the vocabulary of a normal speech rate, the recording main interface prompts: please recite the upper characters at normal speed for a duration of no more than 5 seconds.
It can be understood that the recording duration, the recording speed, and the recording language may include other different speeds and recording languages, and this embodiment is merely an example, and does not limit the recording duration, the recording speed, and the recording language.
The user can start recording by triggering a recording start instruction, wherein the recording start instruction can be a button clicked to start recording, and can also be other modes for triggering the WeChat applet to start recording. As shown in fig. 4, recording begins upon the user clicking on the microphone button. After the recording is started, the WeChat applet receives and identifies whether the vocabulary input by the user meets the voice input requirement, and divides the input vocabulary into effective vocabulary and invalid vocabulary according to the identification result. And the WeChat applet identifies the received input vocabulary, takes the input vocabulary meeting the voice input requirement as effective vocabulary, and takes the input vocabulary not meeting the voice input requirement as invalid vocabulary.
For example, if the user is currently required to enter the vocabulary "on" in mandarin and at a normal speed (entry of the vocabulary is completed within 5 seconds), the WeChat applet, upon receiving the vocabulary entered by the user, identifies the vocabulary entered by the user. Specifically, whether the vocabulary input by the user is standard Mandarin, whether the input of the vocabulary is completed within a specified input duration and whether the input is the vocabulary of 'power on' are identified. If the vocabulary input by the user meets all the requirements, the input vocabulary is used as an effective vocabulary, and if one of the vocabularies input by the user does not meet the requirements, for example, the input vocabulary is identified to be not mandarin, the input vocabulary is used as an invalid vocabulary to prompt the user to input again.
Note that mandarin and dialects need to be recognized by speech. Dialect identification in the implementation of the city, a Baidu AI interface is used for identification. In addition, for collection of dialects, if the dialect region is provincial level, the number of collected audios can be set, and if the dialects are city level, the dialects in the city level are only used in the statistical form and are not the standard for limiting users to upload recording, so that the dialects in the city level are not necessary to select items, such as Hangzhou in Zhejiang.
And step 130, storing the collected recording text according to the account information of the user.
And if the vocabulary is recognized and the entry is completed, storing all the vocabulary entered by the user. Specifically, if the recognized effective vocabulary reaches the preset recording frequency, naming all the vocabulary input by the user according to the identity information of the user.
And if the vocabulary is recognized to be completely input, namely 6 effective inputs are completed, storing all the vocabulary input by the user. All vocabularies comprise valid vocabularies and invalid vocabularies which are input by a user in the corpus collection process. The invalid vocabulary is stored for testing, and the testing is used for testing whether the intelligent microphone is not triggered by the invalid vocabulary by mistake and the invalid vocabulary is not recognized by the intelligent microphone. It should be noted that the preset recording times are 6 times. It is understood that the preset recording times may also be 4 times, 8 times, 10 times, etc., and the specific times may be set according to actual situations.
And after the user finishes inputting the vocabulary, sorting all the vocabulary input by the user into audio files and naming the audio files. The specific naming mode is as follows:
for invalid words, the naming is: user ID-province ID-age-gender (0, 1) - (1, 2, 3, 4.). wav, the gender attribute may be noted as 0 when the user is female and 1 when the user is male. There is no upper limit to the preservation of invalid words, so the 5 th attribute of a name may be Wie infinity. For valid vocabulary, the naming is: user ID-province ID-age-gender (0, 1) - (1, 2, 3). wav; for dialect vocabulary, the naming method is as follows: user ID-province ID-age-gender (0, 1) - (1, 2, 3). wav.
After naming the vocabulary input by the user, storing all the vocabulary in a multilevel directory under the account of the user according to a preset storage rule. Wherein, the first-level directory is respectively an invalid vocabulary, an effective vocabulary and a dialect vocabulary; the second-level directory is corresponding specific vocabularies, and each vocabulary corresponds to a directory.
The corpus collection method based on the WeChat applet provided by the embodiment comprises the following steps: when a user is detected to log in the WeChat applet, acquiring account information of the user; acquiring a corpus collection operation event to collect a recording text of a user; and storing the collected recording text according to the account information of the user. The corpus is collected through the WeChat small program, and low-cost and efficient collection of the corpus can be achieved.
In one embodiment, completing the vocabulary entry task to be entered may take a period of time, such as 10 minutes to 20 minutes, during which the user may freely enter and exit the WeChat applet. After the user quits the small program during recording, the WeChat small program can automatically save the vocabulary records input by the current user, and when the user enters the WeChat small program again, vocabulary input is automatically started from the node quitted by the user, so that the user does not need to complete vocabulary input once, and the user experience is improved.
In addition, the number of the vocabularies to be input is limited, for example, the number of the effective sound recordings of each awakening word reaches 1000, which can meet the requirement. Therefore, after the user quits the WeChat applet during the recording period, if other users complete the input of the vocabulary to be input and the number of the vocabulary reaches the number requirement of the vocabulary to be input, the prompt information is displayed after the user enters the WeChat applet again, the user is prompted that the input of the vocabulary corresponding to the user is completed, and the input cannot be performed at present, so that the interestingness is increased.
In an embodiment, the corpus collection method further includes: and counting the number of the effective words finished by the user, and generating a corresponding electronic red packet through a red packet generation mechanism of the WeChat platform according to the number of the effective words.
Specifically, the WeChat applet may call a WeChat Red packet interface to generate a certain amount of electronic Red packets. The specific amount of the electronic red envelope is related to the number of valid words completed by the user. If the arrangement is finished, one effective vocabulary rewards 2-element RMB, if N effective vocabularies are finished, an electronic red packet with the amount of 2 x N is generated.
And forwarding the electronic red packet to the account of the user according to the account information of the user. And displaying icons of the undistributed reward data and the reward details in the recording main interface, and checking specific data information by opening the icons by the user. As shown in fig. 4, the gift icon in the lower right corner is the bonus detail, and the portrait is the sharing statistic. When the user clicks the gift icon, the displayed interface is as shown in fig. 5, and there are two pieces of information of my bonus and sharing bonus, respectively, and fig. 5 shows the information of my bonus. The sharing reward information can be viewed through the sliding operation. When the user clicks the human icon, the WeChat applet can be shared with other WeChat friends.
The users who input effective vocabularies are rewarded through a red packet rewarding mechanism, so that the use interests of the users can be stimulated, and the collection work of the linguistic data can be more effectively completed. In addition, if the user completes the entry of all the vocabularies (the total number of the vocabularies is 52 in the application), an electronic red packet is additionally generated and forwarded to the account of the user, so that the user is additionally rewarded.
In an embodiment, the corpus collection method further includes: upon receiving a registration request of a user, it is detected whether the user is an invitee. And if the user is an invitee, acquiring account information of the inviter of the inviting user. And generating an electronic red packet corresponding to the invitation through a red packet generating mechanism of the WeChat platform, and forwarding the electronic red packet to an account of the inviter.
The designated interface displayed by the client for entering the registration information may be an initial registration interface, and a text box for entering the inviter identification may be included in the initial registration interface. In this way, when the user inputs the registration information, the inviter identifier can be input into the text box, so that when the client sends the registration request to the server, the inviter identifier can be carried in the registration request and sent to the server together. Wherein, the inviter identification can be a unique code corresponding to the registration account number of the inviter.
In this embodiment, when receiving a registration request from a user, the wechat applet detects whether the user is an invitee by detecting whether the registration request carries an inviter identifier, and if the registration request carries the inviter identifier, determines that the user is an invitee. In particular, the wechat applet on the client, upon receiving a registration request from the user, may display a text box for entering an inviter identification. Therefore, when the user inputs the registration information, the inviter identification can be input into the text box, so that the client can carry the inviter identification in the registration request and send the registration request to the server together when the client wants the server to send the registration request. Wherein, the inviter identification can be a unique code corresponding to the registration account number of the inviter. The server detects whether the registration request carries an inviter identifier or not so as to detect whether the user is an invitee or not. If the user is detected to be an invitee, acquiring a registered account number of the invitee according to the inviter identifier, generating a corresponding electronic red packet through a red packet generation mechanism of the WeChat platform, and forwarding the electronic red packet to the account of the invitee, wherein the electronic red packet can be understood as a promotion and reward red packet.
In one embodiment, when a new user registration WeChat applet is detected, an inviter inviting the user is automatically identified, a registration account number of the inviter is obtained, a corresponding electronic red packet is generated through a red packet generation mechanism of a WeChat platform, and the electronic red packet is forwarded to an account of the inviter, wherein the electronic red packet can be understood as a promotion reward red packet.
In addition, the state that the invitee completes the task is monitored, if the invitee completes the successful input of the vocabulary, on one hand, an electronic red packet is generated and forwarded to the account of the invitee, and another electronic red packet is generated and sent to the account of the inviter so as to reward both the inviter and the invitee. As shown in fig. 6, of the two users of zhang three and li invited by the inviter, the inviter is rewarded for 20.88 yuan if zhang three completes the entry of 2 vocabularies, and for 50 yuan if zhang three completes the entry of 10 vocabularies. It will be appreciated that the award amounts may be other amounts, and that the application is not limited to specific amounts.
In an embodiment, after receiving and identifying whether the vocabulary entered by the user meets the speech entry requirements, the method further comprises:
acquiring voiceprint characteristics of words input by a user and matching the voiceprint characteristics with account information of the user;
and if the voiceprint characteristics are identified to be matched with the account information, listing the accounts of the user in a blacklist.
And when the vocabulary input by the user is obtained, carrying out voiceprint characteristic analysis on the input vocabulary to obtain the voiceprint characteristic of the user recording. Voiceprint (Voiceprint) is the spectrum of sound waves carrying verbal information displayed with an electro-acoustic instrument. The generation of human language is a complex physiological and physical process between human language center and pronunciation organs, and the pronunciation organs used by each person when speaking: the tongue, teeth, larynx, lung, and nasal cavity vary greatly in size and morphology, so the acoustic spectrum varies between any two people. Voiceprint features are acoustic features related to the anatomy of the human pronunciation mechanism, such as spectrum, cepstrum, formants, fundamental tones, reflection coefficients, nasal sounds, profound breath sounds, humble, laughter, etc. Because the pronunciation organs of each person are different, the audio corresponding to different speakers can be accurately identified by identifying the voiceprint characteristics of the audio signals.
Different voiceprint features correspond to different users, because the users registering the WeChat applet are all allocated with user unique ID numbers as the registered account numbers of the users, whether the same user uses a plurality of accounts for recording or not can be identified by matching the identified voiceprint features with the account information of the users, if so, the account numbers used by the users are listed in a blacklist, and the users are forbidden to use the WeChat applet.
In one embodiment, the vocabulary to be entered is set to be out of order, namely, the sequence of the vocabulary to be entered shown by the wechat applets of different accounts is different, so that when the same user logs in the wechat applets by using a plurality of electronic devices to record simultaneously, the entered vocabulary only meets the entry requirement of the wechat applets of one electronic device, and the possibility that the same user logs in the wechat applets by adopting a plurality of devices to record simultaneously can be reduced.
Because the corpus collection through the WeChat applet only needs one person to record one account, the collected corpus has use value. If one person uses a plurality of accounts to record, the finally collected effective corpus is only one person, and the paid reward money is not in direct proportion to the return of the effective corpus. The method can prevent the phenomenon from happening.
In one embodiment, the background of the WeChat applet may record the following: navigation menu, password modification, total user quantity, total vocabulary collection conditions (invalid, valid and dialect), present reward issuing conditions and the like, and corresponding data can be viewed by clicking any one area. And the information of the registered user can be browsed, and the functions of inquiry and record viewing are supported. The function of checking the user recording specifically can be displaying words of the user completing the recording, and can be audited on line. The history of rewards issued by the platform can also be queried, including information such as user ID, issue time, amount and account transfer number.
In addition, the types of vocabularies, the highest recording time, the number collected by each province, newly-added city names and the like can be set in the background.
It should be understood that although the steps in the flowcharts of fig. 1 and 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 7, there is provided a corpus collection device based on a WeChat applet, including: a first acquisition module 710, a second acquisition module 720, and a save module 730, wherein:
the first acquisition module 710 is configured to acquire account information of a user when the user is detected to log in a wechat applet;
the second acquisition module 720 is configured to acquire a corpus collection operation event to acquire a recording text of a user;
the storage module 730 is configured to store the collected recording text according to the account information of the user.
In an embodiment, when the first collecting module 710 detects that the user logs in the WeChat applet, before collecting the account information of the user, the method further includes:
if the user is detected to be a user entering the WeChat applet for the first time, displaying a disclaimer page;
entering a user identity information registration page after receiving a confirmation instruction of the disclaimer;
receiving and storing identity information input by a user, wherein the identity information comprises a region where the user is located, a dialect of the region, and the gender and age of the user;
and allocating a registration account number for the user to generate account information of the user.
In an embodiment, the acquiring the corpus collection operation event by the second acquiring module 720 to acquire the recording text of the user includes:
receiving an input instruction;
pushing a vocabulary to be input currently and prompting a voice input requirement;
and receiving and identifying whether the vocabulary input by the user meets the voice input requirement, and dividing the input vocabulary into effective vocabulary and invalid vocabulary according to the identification result.
In an embodiment, the voice recording requirement includes a recording duration, a recording speed and a recording language of the vocabulary.
In an embodiment, if the storage module 730 recognizes that the valid recording reaches the preset recording frequency, naming all the words entered by the user according to the identity information of the user;
and storing all the named words in a multi-level directory according to a preset storage rule.
In one embodiment, the device further comprises a reward module for counting the number of the effective vocabularies completed by the user;
generating a corresponding electronic red packet through a red packet generation mechanism of the WeChat platform according to the number of the effective words;
and acquiring account information of the user, and forwarding the electronic red packet to an account of the user.
In one embodiment, the reward module is further configured to detect whether the user is an invitee upon receiving a registration request of the user;
if the user is an invitee, acquiring account information of the inviter who invites the user;
and generating an electronic red packet corresponding to the invitation through a red packet generating mechanism of the WeChat platform, and forwarding the electronic red packet to the account of the inviter.
In an embodiment, after the second collecting module 720 receives and identifies whether the vocabulary entered by the user meets the voice entry requirement, the method further includes:
acquiring the voiceprint characteristics of the words input by the user and matching the voiceprint characteristics with the account information of the user;
and if the voiceprint characteristics are identified to be matched with the account information, listing the accounts of the user in a blacklist.
For specific limitations of the corpus collection device based on the WeChat applet, reference may be made to the above limitations of the corpus collection method based on the WeChat applet, and details thereof are not repeated here. All or part of each module in the corpus collection device based on the WeChat applet can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a WeChat applet-based corpus collection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
when a user is detected to log in a WeChat applet, acquiring account information of the user;
acquiring a corpus collection operation event to collect a recording text of a user;
and storing the collected recording text according to the account information of the user.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
when a user is detected to log in a WeChat applet, acquiring account information of the user;
acquiring a corpus collection operation event to collect a recording text of a user;
and storing the collected recording text according to the account information of the user.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (11)
1. A corpus collection method based on WeChat applet is characterized by comprising the following steps:
when a user is detected to log in a WeChat applet, acquiring account information of the user;
acquiring a corpus collection operation event to collect a recording text of a user;
and storing the collected recording text according to the account information of the user.
2. The method of claim 1, wherein prior to collecting account information for a user upon detecting that the user is logged into a WeChat applet, the method further comprises:
if the user is detected to be a user entering the WeChat applet for the first time, displaying a disclaimer page;
entering a user identity information registration page after receiving a confirmation instruction of the disclaimer;
receiving and storing identity information input by a user, wherein the identity information comprises a region where the user is located, a dialect of the region, and the gender and age of the user;
and allocating a registration account number for the user to generate account information of the user.
3. The method of claim 1, wherein the obtaining the corpus collection action event to collect the recorded text of the user comprises:
receiving an input instruction;
pushing a vocabulary to be input currently and prompting a voice input requirement;
and receiving and identifying whether the vocabulary input by the user meets the voice input requirement, and dividing the input vocabulary into effective vocabulary and invalid vocabulary according to the identification result.
4. The method of claim 3, wherein the voice entry requirements include an entry duration, a recording speech rate, and a recording language of the vocabulary.
5. The method of claim 3, wherein saving the captured recorded text according to the account information of the user comprises:
if the effective vocabulary is recognized to reach the preset recording times, naming all the vocabulary input by the user according to the identity information of the user;
and storing all the named words in a multi-level directory according to a preset storage rule.
6. The method of claim 3, further comprising:
counting the number of effective vocabularies finished by a user;
generating a corresponding electronic red packet through a red packet generation mechanism of the WeChat platform according to the number of the effective words;
and acquiring account information of the user, and forwarding the electronic red packet to an account of the user.
7. The method of claim 1, further comprising:
detecting whether a user is an invitee or not when a registration request of the user is received;
if the user is an invitee, acquiring account information of the inviter who invites the user;
and generating an electronic red packet corresponding to the invitation through a red packet generating mechanism of the WeChat platform, and forwarding the electronic red packet to the account of the inviter.
8. The method of claim 3, wherein after receiving and identifying whether the user-entered vocabulary satisfies the speech-entry requirements, the method further comprises:
acquiring the voiceprint characteristics of the words input by the user and matching the voiceprint characteristics with the account information of the user;
and if the voiceprint characteristics are identified to be matched with the account information, listing the accounts of the user in a blacklist.
9. A corpus collection device based on a WeChat applet, the device comprising:
the first acquisition module is used for acquiring the account information of the user when the user is detected to log in the WeChat applet;
the second acquisition module is used for acquiring the corpus collection operation event so as to acquire the recording text of the user;
and the storage module is used for storing the acquired recording text according to the account information of the user.
10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910760571.8A CN110858819A (en) | 2019-08-16 | 2019-08-16 | Corpus collection method and device based on WeChat applet and computer equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910760571.8A CN110858819A (en) | 2019-08-16 | 2019-08-16 | Corpus collection method and device based on WeChat applet and computer equipment |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN110858819A true CN110858819A (en) | 2020-03-03 |
Family
ID=69636460
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910760571.8A Pending CN110858819A (en) | 2019-08-16 | 2019-08-16 | Corpus collection method and device based on WeChat applet and computer equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN110858819A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113327593A (en) * | 2021-05-25 | 2021-08-31 | 上海明略人工智能(集团)有限公司 | Apparatus and method for corpus acquisition, electronic device and readable storage medium |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6275806B1 (en) * | 1999-08-31 | 2001-08-14 | Andersen Consulting, Llp | System method and article of manufacture for detecting emotion in voice signals by utilizing statistics for voice signal parameters |
| US20160217786A1 (en) * | 2006-04-05 | 2016-07-28 | Amazon Technologies, Inc. | Hosted voice recognition system for wireless devices |
| CN107368724A (en) * | 2017-06-14 | 2017-11-21 | 广东数相智能科技有限公司 | Anti- cheating network research method, electronic equipment and storage medium based on Application on Voiceprint Recognition |
| CN108831476A (en) * | 2018-05-31 | 2018-11-16 | 平安科技(深圳)有限公司 | Voice acquisition method, device, computer equipment and storage medium |
| CN109003600A (en) * | 2018-08-02 | 2018-12-14 | 科大讯飞股份有限公司 | Message treatment method and device |
| CN109150700A (en) * | 2018-09-06 | 2019-01-04 | 北京云测信息技术有限公司 | A kind of method and device of data acquisition |
| CN109493869A (en) * | 2018-12-25 | 2019-03-19 | 苏州思必驰信息科技有限公司 | The acquisition method and system of audio data |
| CN109902226A (en) * | 2019-01-25 | 2019-06-18 | 上海基分文化传播有限公司 | A kind of user's recommended method and system and client device |
-
2019
- 2019-08-16 CN CN201910760571.8A patent/CN110858819A/en active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6275806B1 (en) * | 1999-08-31 | 2001-08-14 | Andersen Consulting, Llp | System method and article of manufacture for detecting emotion in voice signals by utilizing statistics for voice signal parameters |
| US20160217786A1 (en) * | 2006-04-05 | 2016-07-28 | Amazon Technologies, Inc. | Hosted voice recognition system for wireless devices |
| CN107368724A (en) * | 2017-06-14 | 2017-11-21 | 广东数相智能科技有限公司 | Anti- cheating network research method, electronic equipment and storage medium based on Application on Voiceprint Recognition |
| CN108831476A (en) * | 2018-05-31 | 2018-11-16 | 平安科技(深圳)有限公司 | Voice acquisition method, device, computer equipment and storage medium |
| CN109003600A (en) * | 2018-08-02 | 2018-12-14 | 科大讯飞股份有限公司 | Message treatment method and device |
| CN109150700A (en) * | 2018-09-06 | 2019-01-04 | 北京云测信息技术有限公司 | A kind of method and device of data acquisition |
| CN109493869A (en) * | 2018-12-25 | 2019-03-19 | 苏州思必驰信息科技有限公司 | The acquisition method and system of audio data |
| CN109902226A (en) * | 2019-01-25 | 2019-06-18 | 上海基分文化传播有限公司 | A kind of user's recommended method and system and client device |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113327593A (en) * | 2021-05-25 | 2021-08-31 | 上海明略人工智能(集团)有限公司 | Apparatus and method for corpus acquisition, electronic device and readable storage medium |
| CN113327593B (en) * | 2021-05-25 | 2024-04-30 | 上海明略人工智能(集团)有限公司 | Device and method for corpus acquisition, electronic equipment and readable storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11727918B2 (en) | Multi-user authentication on a device | |
| Hoegen et al. | An end-to-end conversational style matching agent | |
| CN107818798B (en) | Customer service quality evaluation method, device, equipment and storage medium | |
| CN112262431B (en) | Speaker log using speaker embeddings and trained generative model | |
| EP3881317B1 (en) | System and method for accelerating user agent chats | |
| US10270736B2 (en) | Account adding method, terminal, server, and computer storage medium | |
| US8095372B2 (en) | Digital process and arrangement for authenticating a user of a database | |
| EP2109097B1 (en) | A method for personalization of a service | |
| WO2017197953A1 (en) | Voiceprint-based identity recognition method and device | |
| CN114155460B (en) | User type identification method, device, computer equipment and storage medium | |
| CN110136721A (en) | A kind of scoring generation method, device, storage medium and electronic equipment | |
| WO2021169365A1 (en) | Voiceprint recognition method and device | |
| CN109346089A (en) | Living body identity identifying method, device, computer equipment and readable storage medium storing program for executing | |
| CN107506166A (en) | Information cuing method and device, computer installation and readable storage medium storing program for executing | |
| CN110766442A (en) | Client information verification method, device, computer equipment and storage medium | |
| US12512094B2 (en) | System and method for consent detection and validation | |
| KR20220018461A (en) | server that operates a platform that analyzes voice and generates events | |
| CN108322770A (en) | Video frequency program recognition methods, relevant apparatus, equipment and system | |
| CN112417412A (en) | Bank account balance inquiry method, device and system | |
| JP4143541B2 (en) | Method and system for non-intrusive verification of speakers using behavior models | |
| CN110858819A (en) | Corpus collection method and device based on WeChat applet and computer equipment | |
| CN115171732A (en) | Method, system, electronic device and storage medium for actively collecting user opinions | |
| CN116047929A (en) | Smart home voice control method and system based on full-duplex voice interaction | |
| CN116631370A (en) | Voice prompt method, device, electronic device and storage medium | |
| CN114862420A (en) | Identification methods, apparatus, program products, media and equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200303 |
|
| RJ01 | Rejection of invention patent application after publication |