WO2002061729A1

WO2002061729A1 - Method and system for audio interaction between human being and computer

Info

Publication number: WO2002061729A1
Application number: PCT/JP2001/000628
Authority: WO
Inventors: Tadamitsu Ryu; Masato Numabe; Yoichi Saitoh; Shinichiro Kubo; Hiroyuki Shimazaki
Original assignee: Cai Co., Ltd
Priority date: 2001-01-31
Filing date: 2001-01-31
Publication date: 2002-08-08

Abstract

An audio interactive method for interface between human being and a computer comprises a step (21) of receiving human words from a voice input device which recognizes conversation between human being and a computer, a step (22) of interpreting the words of conversation and creating words of response by the computer, a step (23) of allowing a voice generator to produce words of response by the computer to human words, a step (24) of entering the words of response from the computer using the voice input device, and a step (25) of canceling the words of response from the computer by a voice cancellation device to remove them from what the computer should recognition.

Description

Specification

Method and system for voice interaction between computer and human

The present invention relates to a method and system for voice dialogue exchanged between a computer and a human, and more particularly to a computer which correctly recognizes a conversation sentence from a human and enables a natural dialogue between humans. The present invention relates to a voice interaction method and system.

All views of art

In recent years, computer-based speech recognition technology has been developed, and an environment in which computers and humans interact with each other by voice is being prepared. Since speech is the most natural means of input and output for human beings, many attempts and researches have been conducted to enable conversations between humans and humans using information stored in computers, such as conversations between humans. Done. For example, search and extract the appropriate information from the information stored in the combination, and transmit the information to humans, or set questions and answers to match the data structure in memory. Some were performed in the following order and at a speed that was not unnatural.

In the former example, the computer provides information to the computer.In the latter example, the computer asks the patient to listen to the patient's symptoms and condition and fill out a sheet before the doctor's consultation. was there. As a prerequisite for the speech dialogue between the computer and humans to function well, the computer must correctly recognize and interpret the speech uttered by humans, that is, the conversational sentences, and create various programs for establishing conversations. We need to be able to cooperate properly. However, research on speech dialogue between computers and humans has been rapidly conducted in recent years, and there has been considerable research on theoretical aspects such as the theory of speech recognition and efforts to create artificial brains. Level. However, what should be done to make the computer recognize human voices correctly? Adjustment techniques such as devising to recognize the noise even when there is ambient noise and noise, or normalization techniques. At present, there has not been enough research.

For example, in a voice dialogue between a computer and a human, a human speaks and a conversation sentence from the human is received by a voice input device such as a microphone connected to the computer. The computer interprets the input conversation and creates a response to it. This response sentence is output from a voice generator such as a speaker, and this output is also input to a voice input device such as a microphone connected to the computer. As a result, the convenience evening is similar to a human conversation It will try to interpret this response and create a further response. Such an operation might confuse a program for operating a computer, and prevent proper dialogue. Also, focusing on the environment in which the dialogue takes place, the surroundings are not completely silent, and in many cases there is a considerable level of noise. It is the sound of air sent from air conditioning or the voice of surrounding people. In addition, such noise is small in the early morning hours, but increases with time. Therefore, there was a disadvantage that the recognition accuracy of the conversation sentence by the computer became extremely low in the afternoon, or the recognition became impossible.

By the way, there is a telephone as a device for holding a voice dialogue between humans, and the same problem on the transmitting side and the receiving side has been solved by various methods. For example, in a “noise canceling device” according to Japanese Patent Application Laid-Open No. 9-133177, an audio signal is input from a first microphone, while noise and an audio signal are input from a second microphone. Then, the noise / voice signal from the second microphone is inverted in phase and combined with the voice signal from the first microphone to obtain a reduced noise signal, which is input to the voice input of the device to be used. At the same time, the noise and surrounding noise are reduced by synthesizing the signal obtained by inverting the phase of the audio signal with the audio output signal of the device and outputting the synthesized signal from the speaker.

Since the present invention deals with human-to-human dialogue, even if it is applied to a voice dialogue between a computer and a human, the computer cannot interpret human conversational sentences. In other words, humans listen separately to the other party without being aware of their own conversations, recognize and interpret only what the other party has said, and create a response sentence taking into account their own past remarks. On the other hand, in the case of a computer, if no adjustment is made to the audio signal obtained by the microphone, all of them are judged to be speech recognition targets, and the process of creating a response sentence is started. This confuses the program that runs the dialogue as described above.

In addition, the noise canceller included in the audio input to the microphone plays an important role in accurately recognizing the conversation sentence by the combination, but the noise canceller invention in the telephone as described above is directly applied. Can not be used. That is, the method of always canceling noise as in the above-mentioned publication has a drawback that noise cannot be removed properly if the volume difference between the noise and the conversation becomes small. Therefore, attempts have been made to collect noise over a predetermined period and learn it to create a noise canceling signal. However, between computers and humans There has not yet been developed how to specifically apply such an attempt to a method of voice dialogue between them, and there has been a demand for the development of such adjustment technology or normalization technology. An object of the present invention is to provide a voice interaction method and system capable of performing a natural and accurate dialogue between a computer and a human in response to the above demand. It is another object of the present invention to provide a speech dialogue method and system for surely canceling a response sentence from a computer and leaving only a human conversation sentence as an object of computer conversation recognition.

Another object of the present invention is to provide a speech dialogue method and system that can be easily recognized by a computer, and thus enable accurate recognition of conversational sentences.

It is still another object of the present invention to provide a voice conversation method and system that enable effective noise cancellation by a learning function.

Disclosure of the invention

According to a first aspect of the present invention, there is provided a step of receiving a conversation sentence from a human from a voice input device for listening to a conversation between a computer and a human being, and interpreting the conversation sentence and outputting a response sentence to the computer. And a step of outputting a response sentence from a computer to a conversation sentence from a human from a voice generating device, a step of inputting a response sentence from a computer by a voice input device, and a response from the computer. A speech dialogue method comprising a step of canceling a sentence by a speech cancellation device and removing the sentence from a conversation recognition target by a computer.

According to a second aspect of the present invention, in the voice interaction method according to the first aspect, the response sentence canceling step sets a flag on the response sentence creation signal created by the computer and outputs the response sentence from the voice generating device. After that, the voice input to the voice input device after a predetermined time is canceled using the response sentence creation signal as a reference signal.

The invention according to claim 3 is the voice interaction method according to claim 1 or 2, further comprising a step of removing noise from a voice received by the voice input device by a noise canceller device, and thereafter, a computer Thus, a response sentence to a conversation sentence from a human is created.

According to a fourth aspect of the present invention, in the voice interaction method according to the third aspect, the noise canceling step includes a noise in a time zone when the voice level is so low that it is clear that no speech is being made from a human or a computer, At other times, cancel the utterance And learning the accumulated noise for a predetermined period of time, and canceling the learned noise from the voice signal from the voice input device and removing it during the next utterance from a human.

A second aspect of the present invention is a voice dialogue system between a computer and a human being, wherein the voice input device listens for a conversation between a computer and a human, and the output signal from the voice input device is Speech canceller that cancels the response to the response sent from the computer and removes it from the target of conversation recognition by the computer, and conversation recognition in the computer that interprets the conversation from a human and creates a response to it. The present invention provides a voice dialogue system including a response text creation unit and a voice generating device for outputting a response text from a computer.

According to a sixth aspect of the present invention, in the voice dialogue system according to the fifth aspect, the voice canceling device flags the response sentence creation signal created by the computer, and sends the response sentence from the voice generating device. A clock that measures the time from when the voice input device receives the voice after outputting it, and when the time until the voice input device receives the voice after outputting the response from the voice generating device is less than the specified time, Canceller means for determining that the voice received by the voice input device is a response sentence created by the computer, and canceling the response sentence creation signal as a reference signal.

The invention according to claim 7 is the voice dialogue system according to claim 5 or 6, further comprising a noise canceller device that removes noise from a voice received by the voice input device, whereby the computer includes a noise canceller device. The conversation sentence recognition / response sentence creation unit is characterized in that it receives only conversation sentences from humans that do not include noise as conversation recognition targets.

The invention according to claim 8 is the method according to claim 7, wherein the noise canceller device is configured to perform noise during a time period when the voice level is low enough to make it clear that no speech is being made from a person or a computer, or In other time periods, the noise for which the utterance content has been canceled is accumulated for a predetermined period of time for learning, and during the next utterance from a human, the learned noise is canceled and removed from the voice signal from the voice input device. Noise learning / noise removing means.

The invention will be described in more detail hereinafter with reference to preferred embodiments illustrated, which are merely examples and do not limit the scope of the invention. The present invention provides It should be noted that various modifications and alterations can be made without departing from the spirit of the invention described in the appended claims.

A simple theory of I

FIG. 1 is a flowchart of one embodiment for explaining what kind of dialog between a computer and a human is applied to a voice dialog method according to the present invention.

FIG. 2 is a block diagram showing one embodiment of a system for performing any kind of dialogue with a human shown in FIG.

FIG. 3 is a flowchart of another embodiment of a dialog between a computer and a human. Fig. 4 shows an example of dialogue when a computer provides travel guidance to a human.

FIG. 5 is a block diagram showing an embodiment of the host computer 9 of the system for executing the dialog shown in FIG.

Figure 6 is a table showing a conventional relational database.

FIG. 5 is a flowchart showing a flow of one embodiment of a voice dialogue method between a computer and a person according to the present invention.

FIG. 8 is a flowchart showing a flow of a method for canceling a response sentence signal by the voice canceling device employed in FIG.

9 (A) and 9 (B) are a schematic configuration diagram of an embodiment of a voice dialogue system between a computer and a human according to the present invention, and a block diagram of a configuration in a computer 30, respectively. Confuse.

FIG. 10 is a schematic diagram for explaining the operation in the voice interaction system between the computer and the human shown in FIG.

Shameless bear for carrying out the invention

Hereinafter, a voice interaction method and system according to the present invention will be described in detail with reference to the accompanying drawings.

First, the method and system of the spine dialogue according to the present invention will be briefly described specifically for what kind of dialogue between a computer and a human. As disclosed in the international application (PCT / JP00 / 06759), entitled "Topics Dialogue Method and System," the present inventors have proposed a natural and intelligent communication between the future computer and humans. Development for realizing an effective dialogue. This topic dialogue method and system can be used to provide the above-mentioned conventional information provision to a computer, or to examine the symptoms and condition of a patient by a doctor. Compared to what you do when you listen and fill out the sheet before you do the work, it makes natural and intellectual dialogue enough to give humans the illusion that a computer has an artificial brain . The present invention therefore applies to all such dialogues, including inventions and ingenuity in the dialogue between human beings and the convenience store that will be developed in the future.

In order to gain an overview of the “topics dialogue method and system” achieved by the international application, a portion is cited below. The entire contents are, of course, described in the international application, so please refer to it.

As shown in FIG. 1, in a dialogue between a computer and a human to which the voice dialogue method according to the present invention is applied, first, a database and a program are recorded (step 1), and voice input is performed. If there is, it is word-decomposed 'After analyzing the sentence, it is determined whether or not the information item exists (step 2), and it is determined whether or not the information item necessary to identify the record is included in the input speech. If the answer is “No”, the required information items are asked to humans (Step 3), and if the answer is “Yes” to J1 or the information items necessary to identify the record by Step 3 are collected. In this case, the program proceeds according to the program (step 4).

In the embodiment shown in FIG. 1, a computer interacts with humans using data stored in a relational database stored in memory. Fig. 6 is a table showing a conventional typical relational data rate data structure. In the table, S l to Sn are attributes serving as search keys, that is, schemes, and T l l to Tmn are tuples that are contents or values. Each line makes up one record. If the relational database is for travel, the schemes sl to Sn may be, for example,

"Destination", "Purpose", "Days", "Departure date (hour)", "People", "Price", "Airline company", "Hotel", "Room specifications", "Eating / no meal", "Option", "Passport", "Visa", "Payment method", etc. Each record describes the sample for these schemas. Each record includes, for example, “Hawaii”, “Thai packet”, “Helsinki in Finland”, “Kyoto”, “Aomori in Mutsu”, “Okibashi”, “English”, etc. Key information is recorded.

The memory also specifies the dialogue sequence that defines the order in which each scheme should be put on the topic in dialogue with humans, the wording when each scheme becomes a topic, its deformation, etc. The program is also recorded. The computer's CPU can Call a program to perform the dialogue according to the program. In the travel example, the dialogue sequence begins with a conversational sentence, such as "What's your business?"

In step 2, when a human utters a voice to the computer, the computer recognizes phonemes using a microphone phone, voice recognition software, and the like. For example, "I want to be happy." Using a word dictionary, syntax dictionary, case dictionary, etc., these words are disassembled and sentence analysis is performed, and then "I want to go to Hawaii." And "I want to see the aurora."

Subsequently, it is determined whether or not a value corresponding to each scheme (herein, referred to as an “information item”) exists in the speech input obtained by the word disassembly-sentence analysis in the decision J1. Many records are stored in the relational data base, and this is to determine which of the records the audio input requires. That is, if the input information does not include the information items required to identify the record, the user is asked in step 3 about the missing information items, and all the required information items are heard. This identifies one (or a few) records.

For example, in a relational database relating to travel, there are tens to hundreds of pieces of information on destinations in Hawaii. In order to select the record desired by the user from these, further information items are required. "Purpose", "Days", "Departure date", etc. The program pre-determines which scheme type ^ / is such an 'indispensable item', and starts with an information item that identifies such a record during word disassembly / sentence-analyzed speech input. It is determined whether or not all items are present.

If the input information required to identify the record is included in the input voice from the beginning, or if the user has been interviewed and aligned as described above, the information item of the record in step 4 Using, the computer's CPU runs the program according to the dialog sequence. Typically, the dialogue sequence proceeds in a predetermined order using all of the records or information items corresponding to the appropriate scheme. In the illustrated embodiment, the missing information item recall process step 3 is skipped. — Ask the question by putting back the name of the song in the sentence. "I want to go to Hawaii." The above-mentioned answer of interest for the person that is, "objective", "number of days", where _c is the lack of required information items such as "departure date", the computer will elicit such倩報items In this case, "Please tell us the purpose of your (travel).", "How many days are you?" Do you? It is preferable to ask the question by putting the name of the scheme in the reflection as it is. As a result, it is possible to avoid a situation in which communication between the computer and a human is not successful. In addition, it is also possible to improve the recognition accuracy by limiting the answers from users, such as "Please select the purpose of the trip from honeymoon, marine sports, tanning, shopping, business, etc."

Referring to FIG. 2, one embodiment of a system for performing the dialog shown in FIG. 1 is shown.

FIG. 2 shows an embodiment of a system for implementing a voice interaction method via an in-home network, but is not limited to this. For example, in a form of use in which a computer is used to accept company work, humans and computers interact directly without going through the Internet.

Such a topic dialogue system generally includes a voice input device 1 such as a microphone, a voice output device S3 such as a speaker and a headphone, a user terminal 5, and communication such as an Internet connection, an intranet, and a LAN. It comprises a line 7 and a host combination 9 for managing this system.

The voice input device 1 converts a voice uttered by a human being as a user into a digital signal that can be processed by a computer. The sound output device 3 converts the sound into a sound based on a sound generation signal generated by the computer. The user terminal 5 can be connected to the Internet by various well-known personal computers. The processing result at the user terminal 5 is transmitted to the host computer 9 via the communication line 7, and the processing result at the host computer 9 can be received by the user terminal 5 via the communication line 7.

The host computer 9 is provided with a memory 11 for recording various data and programs, and a CPU 13 for calling a program recorded in the memory and performing various controls. Memory 11 records a number of schemes, namely a relational data base 1 la consisting of schemas and tuples, and a program that defines the order in which each scheme is to be discussed in the topic. A dialogue sequence unit 1 lb, and a word recording unit 11 c for recording a program that defines wording when each scheme becomes a topic are provided.

The CPU 13 includes an information item determination control means 13a for analyzing the input voice of the user by word decomposition and sentence analysis to determine whether or not there is an information item corresponding to each scheme, and a relational data processor. The information items required to identify the record being based are included in the input audio. Θ

If it is not, ask the user about the missing information item and ask the required information item control means 13b to hear all the necessary information items, and the information necessary to identify the record When an item is included in the input speech, a program progress control means 13c is provided for using the information item of the record to advance the program in accordance with the interactive sequence.

In the illustrated embodiment, the mandatory information item hearing control means 13b asks a question by putting the name of the scheme in a feedback sentence, thereby hearing back the missing information item. FIG. 3 is a flowchart of another embodiment of a dialog between a computer and a human. FIG. 5 is a block diagram showing an embodiment of the host computer 9 of the system for performing the dialogue as shown in FIG.

The dialog shown in Fig. 3 is different from the dialog shown in Fig. 1 in that the computer can identify and interact with scenes (topics) from human voices. The difference is that it is possible to insert a small biz-like dialogue that adds a little bit to the topic during the conversation.

In the dialogue shown in Fig. 3, first, a database and a program are recorded (step 11), and then a human is asked for an index for identifying a scene (step 12). When the type data base is specified, it is recorded in the cache memory (step 13), and if there is a voice input from the user, it is analyzed after word analysis and sentence analysis to determine the existence of the information item. (Step 14), and judge whether the information item necessary to identify the record is included in the input voice. If “No” for J11, the human is required to hear the information item (Step 15). ), And if the judgment J 11 is “Yes” or if the information items necessary to identify the record are obtained in step 15, the program proceeds (step 16), and the predetermined scan is performed. When the program and / or tuple become a topic, the subroutine for small scenes is entered (step 17), and when the small scene subroutine is completed, the program returns to the original dialogue sequence and the remaining program is executed. (Step 18).

In step 11, a plurality of relational databases are recorded and stored in the memory of the convenience store, and each of them is provided with an index that can identify the scene (topic) that is being handled on the basis of the data from the others. Have been. The computer's memory also stores a relational database that defines small scenes associated with a given scheme and / or tuple. The relational data base that defines the small scene is It consists of a structural example (corresponding to a schema) consisting of multiple items and a content example (corresponding to a tuple), which is the contents of the structural example. In the memory of the computer, a program that defines an interaction sequence and an item sequence, which is the order of making each scheme and item a topic, and a wording when each scheme and item becomes a topic is recorded. I have.

Fig. 4 shows an example of dialogue when a computer provides travel guidance to a human.

In step 12, the computer first utters a question, such as “please do your business.”, Which is a question for inquiring an index for identifying a scene to which a conversation is directed to a human. When a human responds, for example, "Looking for a summer vacation destination", the scene of "travel information" is specified according to the input index of "travel destination", "search", and the like. On the other hand, during the user's response, for example,

Negative words (underlined in the text) such as "Φ) bad", "What is the subscript m not Φ, † r †?", "About 亩 without a hotel and Μ¾1 ^." ), The scene of “complaint” is specified. As described above, according to the present embodiment, when a human utters one of a plurality of scenes (topics), the scene (topic) can be identified by finding a word serving as an index included therein. It has features in points. That is, a specific one scene can be selected from a large number of scenes by finding a word serving as a predetermined index.

In step 13, the CPU calls the relational database of the scene (topic) of the travel guide specified by the index from the memory and records it in the cache memory so that it can be rewritten.

The schema of the scene of travel is, for example, “Destination”, “Purpose”, “Days”, “Departure (time)”, “Number of people”, “Breakdown of companions”, “Budget”, “Designation of airline” , “Designation of hotel”, “room specifications”, “meal availability”, “option”, “passport”, “necessity of visa”, “payment method”, etc. Therefore, in the combination, in the essential information item presence / absence determining step (step 14), it is determined whether or not all the essential information items for identifying the record are present. Then, in a counseling mode that searches for a target record from a series of questions and answers, a dialogue for searching for a missing information item is started. In this embodiment, the computer asked the question "Where do you want to go?" And the user answered "Is it UK or America?" From the answer to “Is it British or American?”, Combi U will detect that the user has not decided on the destination, and will transition to the advise mode to confirm this as soon as possible. In the dialogue between humans, based on the knowledge that the counseling mode and the advice mode appear alternately and develop the dialogue, this was applied for the first time to the dialogue between the viewer and human beings. In the dialogue in Fig. 3, after asking a question about the purpose of the trip, the user's answer, “house line,” is used as an information item to recommend a piece of information.

From "C: What about Orlando, Florida?" To "G: Let's do that pack trip" constitute one subroutine. Then, after the selection of the package tour was made, the participants returned to the original travel guide scene and continued the dialogue.

In addition, depending on the selection of the park trip by the user, “Destination”, “Days”, “Cost”,

The “airline”, “hotel”, and “meal availability” will be determined, and only the remaining schemes will be heard in subsequent dialogues. In this embodiment, “departure date” and “number of people” have not been confirmed. So, in step 15 we asked the question "When is the departure date?"

"I will do it on July 18th."

Next, in the present embodiment I ^ J, in step 17, a transition is made to a small scene of “Singing and sleeping child price” using “family” as a keyword. Specifically, after asking about the composition and age of the family, they explained about the “paying for a bed with a child” and asked if they would be eligible. If the answer in the sixth line of the dialogue shown in Fig. 4 is, for example, "honeymoon", then it is possible to shift to the small scene using "honeymoon" as a keyword. For example, it is possible to take up a variety of topics such as pick-up and drop-off from the airport to the hotel by limousine, a special dinner in a private room, and a room at the front desk of the wedding reception.

If the information item required to identify the record is included in the spoken input voice, the program proceeds according to the dialogue sequence using the information item of the record. Normally, it proceeds to confirm the entered schema to the user in order. In the present embodiment, the user has indicated that he will go on a 10-night, 10-day trip to Orlando, Florida. However, it may be difficult to determine whether or not the intention is confirmed after confirming all the conditions of this package trip in the event of a dispute at a later date. Therefore, "Destination", "Days", "Cost" It is preferable to clearly confirm the items that have been confirmed. For example, "destination"

—In Rand (including day bus trips to NAN SA) and in Miami, make sure you spend 6 and 2 nights and 3 days to travel. · The feature of this mode is that, as described above, during the dialogue, the topics that appeared in the dialogue are taken up, and the dialogue can be developed by digging deeply into the topics or referring to other variations. That is, when a predetermined scheme and / or evening topic becomes a topic, a relational data base that defines a small scene is called, and the program proceeds as a subroutine according to the item sequence. The above example is characterized in that, for example, when a user mentions the “purpose” of a trip, control is performed so that the topic shifts to a small scene regarding “family trip”.

In the small scene of “home”, a dialogue with humans is conducted using a relational database consisting of a structural example consisting of a plurality of items and a content example that is the contents of the structural example. The order in which the items appear in the dialogue is determined by the item sequence. The items of family travel include "breakdown of family", "sex of child", "age of child", "whether or not to pay for bed-sharing child", and "number of people". In the example above, "Please tell us about your family."

"Please tell me the gender and age of your child." After explaining the price of the bed with you, "What's up?" When the item sequence has been completed for all or predetermined items of the structural example in the relational type of the small scene, at step 18, the program returns to the dialogue sequence and proceeds with the rest of the program.

The illustrated preferred embodiment is characterized in that the structural case defining the sub-scene is a past interactive case. A small scene composed of a plurality of pieces of content information collected as described above can be recorded in a memory as an example of a dialog. Then, the next time the same small scene becomes a topic, control is performed so that the dialogue proceeds with the item sequence based on the dialogue example. By accumulating examples of dialogue, it is possible for the viewer to become skilled in how to ask questions, guide to a predetermined tour, etc.

Small scenes can be constructed in an infinite hierarchy. In other words, it can be constructed such that one small scene has a lower sub-scene, and that sub-scene has a lower sub-scene. As a result, the variety of conversations between the computer and humans is infinitely widespread, and the conventional technology that simply handles the prepared conversations is used. It can completely dispel the peculiar monotony of interacting with computers, which has been a criticism.

Next, with reference to FIG. 5, a system for implementing the voice interaction method shown in FIG. 3 will be described in detail.

The system of the present embodiment is the same as the Tobix dialogue system shown in FIG. 2 except for the configuration of the host computer 9, so that only different configurations will be described. In the description, the same reference numerals as those in FIG. 2 are used for the same components as those in FIG. In the present embodiment, the memory 11 of the host computer 9 contains a relational database 1 la in the evening, and a program that defines the order in which each scheme is to be placed in the topic. 1 lb, and a word recording unit 11 c that records a program that defines wording when each scheme becomes a topic. The relational data base section 11a is divided into sections llaa to llan so that a relational database of a large number of different scenes (topics) can be stored. In the illustrated preferred embodiment, each relational database is pre-selected and registered with one or more words that serve as an index that can be distinguished from others. Then, by finding the word, one relational data base is specified. Each relational database is the same as that in Fig. 2 in that it consists of a schema and a tuple.

The memory 11 is also provided with one or more relational data bases 1 Id that define small scenes associated with a given scheme and / or tuple. If there are multiple relational data bases that specify small scenes, as in the data space section 11a, the relational data base section 11d should also be used for data processing. ~: Divided into L 1 dm. The relational data base that defines a small scene also consists of a structural example (equivalent to a schema) consisting of a plurality of items and a content example (equivalent to a tuple) that is the content of the structural example. In the relational data base that defines each small scene, it is also necessary to determine in advance which of the relational data bases that define the scene will be transferred to the small scene when it appears. is necessary.

The memory 11 further records an item sequence, which is the order in which each item is made a topic, and an item sequence part 11 e and a program that records a program that defines wording when each item becomes a topic. A recording unit 11 f is provided. The host computer 9 further includes, in addition to the memory 11, a cache memory 15 for calling the relational data space specified by the index and recording it in a rewritable manner 6

As in the system of FIG. 2, the CPU 13 is provided with information item presence / absence control means 13a, essential information item hearing control means 13b, and program progress control means 13c. Have been. The CPU 13 further includes an index query control means 13 d for inquiring an index for specifying which scene the dialog is about from the computer to a human, and a scene according to the input index. Further, there is provided a cache memory recording control means 13 e for calling the specified relational data base from the memory and rewritably recording it in the cache memory. Further, when a predetermined scheme and / or tuple becomes a topic, the CPU 13 calls a relational database for defining a small scene, and as a subroutine, a subroutine progress control means for executing a program according to an item sequence. 13 f, and a return sequence control means 13 g for returning to the interactive sequence and proceeding with the remaining program when the item sequence is completed.

In the illustrated preferred embodiment, the mandatory information item recall control means 13b asks for a question by putting the name of the scheme in a recall sentence, thereby recalling the missing key report item.

Next, a detailed description will be given of a speech dialogue method and system according to the present invention, which is applied to such a dialogue between a computer and a human to enable the computer to correctly recognize a conversation uttered by a human.

As shown in FIG. 7, the method of voice dialogue between a computer and a human according to the present invention generally includes a step of receiving a conversation sentence from a human by a voice input device such as a microphone (step 2). 1), a step in which the computer creates a response sentence in accordance with a program for performing the conversation (step 22), a step in which the response sentence is output from a sound generator such as a speaker (step 23), and a sound such as a microphone. A step of receiving the response sentence output from the voice generator by the input device (step 24); and a step of canceling the response sentence signal by the voice canceller and removing it from the conversation recognition target by the computer (step 24). 25).

Various methods can be used for canceling the response sentence signal by the voice canceller, but in the preferred embodiment shown in FIG. After setting the flag on the answer creation signal (step 31) and outputting the response sentence from the voice generator, the voice input to the voice input device after a predetermined time is canceled using the response sentence creation signal as a reference signal. (Step 32).

Furthermore, in preparing a response sentence in accordance with a program for performing a conversation, the input signal is preferably a pure signal with little noise. Therefore, it is preferable to interpose a step (step 26) for removing noise from the audio signal received by the audio input device by the noise canceller device between step 21 and step 22. By such noise cancellation, noise is removed from a voice signal input from a voice input device such as a microphone, so that only a voice signal corresponding to a conversation sentence from a human remains.

In the noise canceling step, the noise is accumulated during a time period when the voice level is low enough that it is clear that no speech is being made from a human or a computer, or in other time periods, the noise canceling the utterance is accumulated for a predetermined period of time. Then, during the next utterance from a human, the learned noise is canceled and removed from the voice signal from the voice input device. As a result, even when the volume of the noise increases and the difference from the volume of the conversation sentence from a human decreases, the noise can be surely eliminated. Normally, the noise collection time required for noise cancellation is about 3 seconds. However, the time during which the response sentence from the convenience store is output from the speaker (the response sentence signal is canceled by the voice canceller, so only the noise is collected). This has the effect that the collection of such noise can be completed after a human utters or at a time added to the next human utterance. As a result, noise that changes every moment can be effectively canceled, and only speech sentence signals from humans can be input to the speech recognition device in a clear state.

Next, a voice dialogue system between a computer and a human according to the present invention will be described in detail with reference to FIG.

As shown in the figure, the voice dialogue system according to the present invention is a keyboardless keyboard having a microphone 31 for listening to a conversation between a computer and a human and a speed 32 for outputting a response sentence created by the computer. Includes 30 minutes.

The computer 30 recognizes phonemes of the voice input from the microphone 31, analyzes the words / sentences and analyzes them as a conversational sentence, and the computer recognizes the voice signal from the microphone 31. A voice canceling device 3-4 that cancels the response to the created response sentence and removes it from the subject of conversation recognition by the computer, and A conversation sentence recognition / response sentence creation unit 35 that interprets these conversation sentences and creates a response sentence therefor. These devices can be realized by control means for causing the CPU 30b to call various programs stored in the memory 30a and to process various signals.

In the preferred embodiment shown in the figure, the voice canceling device 34 is a means for flagging the response sentence creation signal created by the viewer, and a voice after outputting the response sentence from the voice generating device. A clock 34b that measures the time until the input device receives the voice, and if the time until the voice input device receives the voice after outputting the response from the voice generation device is within the specified time, the voice The canceller means 34c which determines that the voice received by the input device is a response sentence created by the computer and cancels the response sentence creation signal as a reference signal.

As the noise canceller device 37, various types of devices including a conventionally known device can be adopted, whereby the conversational sentence recognition / response sentence creation unit 35 can use only noise-free conversational sentences from humans. As a conversation recognition target. In the illustrated preferred embodiment, the noise canceller device 37 cancels the noise in the time period when the sound level is low, that is, when no human or the computer is speaking, or in other time periods. The learned noise is accumulated for a predetermined time. Then, during the next utterance from a human, the learned noise is canceled and removed from the voice signal from the voice input device.

By using this type of speech dialogue between the computer and a human, the computer can correctly recognize the human utterance even if a human interrupts while outputting a response sentence from the evening. This has the effect of enabling word analysis and sentence analysis to create a response sentence. The conventional method has the disadvantage that if a human utters before the output of the computer, the computer will not be able to recognize the voice, or the program will be confused and the conversation will be impossible.

Claims

The scope of the claims

1. A method of spoken dialogue between a computer and a human,

Receiving a conversation sentence from a human from a voice input device that listens to a conversation between the convenience store and the human;

A step of interpreting a conversational sentence and creating a response sentence by the computer, a step of outputting a response sentence from a computer for a conversational sentence from a human from a voice generating device,

Inputting a response sentence from the computer by the voice input device; andsteps of canceling the response sentence from the convenience store by the voice cancellation device and removing the response sentence from the conversation recognition target by the convenience store.

A voice interaction method comprising:

2. The voice interaction method according to claim 1, wherein the response sentence canceling step sets a flag on a response sentence creation signal created by a computer, and outputs the response sentence from the voice generating device. A voice dialogue method, wherein a voice input to the voice input device after a predetermined time is canceled using the response sentence creation signal as a reference signal.

3. The voice interaction method according to claim 1 or 2, further comprising a step of removing noise from a voice received by the voice input device by a noise canceller device, and thereafter, a conversation sentence from a human by a computer. A spoken dialogue method comprising creating a response sentence to a user.

4. The voice dialogue method according to claim 3, wherein the noise canceling step includes: a noise in a time zone in which the voice level is so low that it is clear that no speech is being made from a person or a computer; Learning is performed by accumulating noise for which the uttered content has been canceled for a predetermined time, and canceling and removing the learned noise from the voice signal from the voice input device during the next utterance from a human. Spoken dialogue method.

5. A speech dialogue system between a computer and a human,

A voice input device for listening to a conversation between a computer and a human,

Of the output signals from the voice input device, the one corresponding to the response sentence from the computer is canceled and removed from the speech recognition target by the computer, and the speech cancellation device that interprets the conversation sentence from human and responds to it. A conversation sentence recognition / response sentence creation unit in the computer to be created, and

A voice generator for outputting a response sentence from the computer, A voice dialogue system comprising:

6. The voice interaction system according to claim 5, wherein the voice canceling device flags a response sentence creation signal created by a computer, and outputs the response sentence from the voice generating device. A clock for measuring a time until the voice input device receives a voice, and a clock for measuring a time until the voice input device receives a voice after outputting a response sentence from the voice generating device. A canceller means for determining that the speech received by the speech input device is a response sentence created by the computer and canceling the response sentence creation signal as a reference signal. Dialogue system.

7. The speech dialogue system according to claim 5, further comprising a noise canceller device for removing noise from a voice received by the voice input device, whereby the speech recognition / response text in a computer is provided. The creation unit is a voice conversation system characterized by receiving only conversational sentences from humans without noise as conversation recognition targets.

8. The voice interaction method according to claim 7, wherein the noise canceller device is configured to perform noise in a time zone when the voice level is so low that it is apparent that the user is not speaking from a human or a convenience store, or other noise. In the time period, the noise that cancels the utterance is accumulated for a predetermined period of time to learn, and during the next utterance from a human, the noise is removed by canceling the learned noise from the audio signal from the audio input device. A spoken dialogue method comprising learning z-noise removing means.