KR102695585B1

KR102695585B1 - Interactive ai systems and server that enable emotional communication

Info

Publication number: KR102695585B1
Application number: KR1020230132075A
Authority: KR
Inventors: 이정용; 유연임
Original assignee: 이정용; 유연임
Priority date: 2023-10-04
Filing date: 2023-10-04
Publication date: 2024-08-16
Anticipated expiration: 2043-10-04

Abstract

본 발명은 대화형 인공지능 시스템에 있어서, 특정인의 이동형 단말에 설치되어 기설정된 시간동안 백그라운드 어플리케이션으로 구동되며, 구동중 상기 특정인의 음성을 녹음하여 외부의 서버로 음성 데이터를 전송하는 기능을 실행시키기 위하여 매체에 저장된 음성 수집 어플리케이션; 상기 서버에 구성되며, 초거대언어모델(LLM, Large Language Model)의 소프트웨어가 저장되고, 상기 음성 데이터를 수신받아, 상기 특정인의 감정 상태별 사용 언어를 감정-문장 데이터셋으로 구축하며, 상기 감정-문장 데이터셋으로 상기 초거대언어모델을 조정(Fine-tuning) 학습하는 기계학습부; 및 상기 서버에 구성되며, 사용자의 단말을 통해 상기 사용자가 입력한 질문 데이터에 대한 응답 요청을 수신하고, 상기 기계학습부를 통해 조정(Fine-tuning)된 초거대언어모델로 상기 질문 데이터에 대한 특정인의 응답을 모사한 자연어를 생성하는 발화 처리부;를 포함하여, 사용자가 대화를 희망하는 특정인의 자연어를 모사하여 감성적 소통이 가능한 대화 서비스를 제공하는 것을 일 특징으로 한다. The present invention relates to a conversational artificial intelligence system, comprising: a voice collection application stored in a medium for executing a function of recording the voice of the specific person and transmitting the voice data to an external server while being installed in a mobile terminal of a specific person and running as a background application for a preset period of time; a machine learning unit configured in the server, in which software of a Large Language Model (LLM) is stored, receiving the voice data, constructing an emotion-sentence dataset of the language used by the specific person according to the emotional state, and learning the large language model through fine-tuning; and a speech processing unit configured in the server, which receives a response request for question data input by the user through the user's terminal, and generating a natural language that simulates the specific person's response to the question data using the large language model fine-tuned through the machine learning unit; and is characterized in that it provides a conversation service that simulates the natural language of the specific person with whom the user wishes to have a conversation, thereby enabling emotional communication.

Description

{INTERACTIVE AI SYSTEMS AND SERVER THAT ENABLE EMOTIONAL COMMUNICATION}

본 발명은 개인화된 보이스/텍스트 챗봇에 관한 것으로, 개인이 대화를 희망하는 특정인의 음성과 말투를 실제 일상의 대화가 가능하도록 모사하여 감성적 소통이 가능한 대화형 서비스를 제공하는 인공지능 시스템 및 어플리케이션에 관한 것이다. The present invention relates to a personalized voice/text chatbot, and more particularly, to an artificial intelligence system and application that provides a conversational service that enables emotional communication by simulating the voice and speech of a specific person with whom an individual wishes to have a conversation so that actual, everyday conversations are possible.

보이스 챗봇은 사용자의 음성을 인식하여 문장을 해석하고, 그에 대한 답변을 실제 인간이 발화하는 음성의 톤으로 출력하는 음성 서비스이다. 현재 몇몇 글로벌 기업들은 고객문의, 상품안내, 자동응답, 번역 등의 특수목적 분야에서 API 형태로 보이스 챗봇 서비스를 제공하고 있고 비교적 정교하게 가공된 목소리를 출력하여 대화에 이질감이 없도록 제공하고 있다.Voice chatbot is a voice service that recognizes the user's voice, interprets sentences, and outputs responses in the tone of voice that a real human would speak. Currently, several global companies provide voice chatbot services in the form of APIs for special purposes such as customer inquiries, product guidance, automatic responses, and translation, and output relatively sophisticated voices to provide a conversation without any awkwardness.

그러나, 대화에 이질감이 없도록 정교하게 제공되는 종래의 보이스 챗봇은 특정 목적에 의해서만 활용되고 있어 기업과 개인들이 자유롭게 활용할 수 있는 컨텐츠 활성화에 많은 도움을 주지 못하고 있다. 특히, 종래의 보이스 챗봇은 지정된 목소리와 톤을 갖고 있어서 다양성이 떨어지고, 챗봇 스스로 특유의 뉘앙스나 감성을 전달하는 단계까지는 이르지 못하고 있다.However, conventional voice chatbots, which are provided in a sophisticated manner so that there is no sense of incongruity in conversations, are only used for specific purposes and do not provide much help in activating content that can be freely used by businesses and individuals. In particular, conventional voice chatbots have a fixed voice and tone, so they lack diversity, and the chatbots themselves do not reach the stage of conveying unique nuances or emotions.

최근, 특정 유명인의 목소리를 이용하여 서비스를 제공하는 사례들이 등장하고 있다. 일 종래기술로 한국공개특허 제10-2020-0016516호는 개인화된 가상 음성 합성 장치 및 방법을 개시하고 있다. 특정인의 목소리로 대화형 서비스를 제공하는 종래기술들은 음성 제공 단말에서 수집된 음성 데이터를 머신러닝 알고리즘으로 유형을 분류하고, 음성의 특징을 추출하여 이를 기반으로 학습하여 가상의 음성을 생성한다. 그러나, 아직까지 특정인의 음성을 가상으로 합성하여 일상 대화를 자연스럽게 구현하는 것에는 어려움이 있는 실정이다. Recently, cases of providing services using the voices of specific celebrities have emerged. As a prior art, Korean Patent Publication No. 10-2020-0016516 discloses a personalized virtual voice synthesis device and method. Prior art technologies that provide conversational services using the voices of specific individuals classify the types of voice data collected from a voice providing terminal using a machine learning algorithm, extract voice features, and learn based on this to generate a virtual voice. However, it is still difficult to virtually synthesize a specific individual's voice to naturally implement everyday conversations.

그 이유는 다음과 같다. The reason is as follows.

특정인의 목소리로 특정인의 대화 언어를 가상으로 구현하기 위해서는 약 3000시간의 녹음이 필요하다는 점이다. 특정인의 목소리를 정교하게 구현하기 위해서는 3000시간에 이르는 해당 목소리를 데이터로 수집하여 STT(Speech-To-Text)와 TTS(Text-To-Speech) 기술을 이용해 가상의 문장에 해당 음성을 입혀서 출력해야 한다. 종래의 경우, 특정인의 음성은 STT시 약 10~15분 정도 녹취되는 것이 실정이다. 종래의 특정인 음성 합성은 특정인으로부터 많은 녹음시간을 확보할 수 없음에 따라 약 10~15분간으로 녹음된 음성 데이터를 Voice Cloning 알고리즘으로 작업하여 복사하였다. 이 경우, 기계식 대화와 같은 부자연스러운 발화 음성이 생성된다. 기계의 음성보다 실제 특정인의 사람과 대화하듯 음성을 구현하기 위해서는, 해당 특정인의 감정에 기반한 단어 선택, 독특한 뉘앙스와 화법 등이 반영되어야 한다. 현재 특정인의 음성을 구현하는 자연어 처리의 한계는 사실상 해당 특정인의 음성 데이터를 수집해야 하는 양이 방대하기 때문이다. 일반적으로 3000시간에 달하는 음성 데이터를 수집해야 노이즈 제거로 약 1000시간 정도의 유효 음성 데이터가 수집되며, 이정도 데이터량을 학습해야만 다양한 주제에 대한 해당 특정인의 뉘앙스, 감정, 화법을 정교하게 구사할 수 있는 것이다. In order to virtually implement a specific person's conversational language with a specific person's voice, approximately 3,000 hours of recording is required. In order to precisely implement a specific person's voice, the corresponding voice of up to 3,000 hours must be collected as data and the corresponding voice must be inserted into a virtual sentence and output using STT (Speech-To-Text) and TTS (Text-To-Speech) technology. In the past, a specific person's voice was recorded for approximately 10 to 15 minutes during STT. Since conventional specific person voice synthesis could not secure a large amount of recording time from a specific person, voice data recorded for approximately 10 to 15 minutes was copied using a Voice Cloning algorithm. In this case, an unnatural speech voice like a machine-like conversation is generated. In order to implement a voice that is like a real person talking to a specific person rather than a machine-like voice, word selection based on the specific person's emotions, unique nuances, and speaking style must be reflected. The current limitation of natural language processing that implements a specific person's voice is that the amount of voice data that must be collected for the specific person is actually enormous. Typically, it takes about 3,000 hours of voice data to collect, leaving about 1,000 hours of valid voice data after noise removal. This amount of data must be learned to precisely use the nuances, emotions, and speech styles of a specific person on various topics.

특정인의 음성을 구현한 보이스 챗봇은 컨텐츠의 활용도가 매우 높아진다. 또한, 특정인의 문체를 구현한 텍스트 챗봇 역시 컨텐츠의 활용도가 높아진다. 일 예시로, 이성과의 대화를 연습하고자 하는 사용자, 자신이 좋아하는 유명인과 대화로 힐링을 얻고자 하는 사용자, 훌륭한 기업가와 사업에 대한 대화와 멘토가 필요한 사용자 등과 같이 해당 특정인과의 대화 기능을 제공하는 보이스/텍스트 챗봇은 해당 특정인은 사고와 스타일을 학습하여 소비자의 다양한 니즈를 해소할 수 있다. Voice chatbots that embody a specific person's voice greatly increase the usability of content. In addition, text chatbots that embody a specific person's writing style also increase the usability of content. For example, voice/text chatbots that provide conversation functions with specific people, such as users who want to practice conversations with the opposite sex, users who want to be healed by talking to their favorite celebrities, and users who need conversations and mentors about great entrepreneurs and businesses, can learn the thoughts and styles of specific people and solve various needs of consumers.

구글의 초거대언어모델(LLM, Large Language Model)이 구축된 지금 인공지능의 학습 기술은 특정인의 음성을 구현할 수 있을 정도에 이르렀으나, 정작 문제는 해당 특정인의 감정과 대화 스타일을 학습할 수 있는 충분한 음성 데이터 수집함에 있다. 이 과정에서는 최소 1000시간에 달하는 음성의 유효 데이터가 필요하여 장시간의 녹음과정이 요구된다. 결국, 특정인에게 무리한 음성 수집을 요구하기 어려운 현실적 문제와, 설령 음성 수집을 허여한 특정인일지라도 장시간 녹음 과정에서 사생활, 비속어 등이 노출될 수 있어 더더욱 협력을 기대하기 어려운 문제가 있다.Now that Google's Large Language Model (LLM) has been built, AI learning technology has reached a level where it can implement a specific person's voice, but the real problem is collecting enough voice data to learn the specific person's emotions and conversation style. This process requires at least 1,000 hours of valid voice data, which requires a long recording process. Ultimately, there is the practical problem of making it difficult to request unreasonable voice collection from a specific person, and even if a specific person allows voice collection, there is the problem that it is even more difficult to expect cooperation because private information, vulgar language, etc. can be exposed during the long recording process.

한국공개특허 제10-2020-0016516호Korean Patent Publication No. 10-2020-0016516

본 발명의 목적은 특정인의 실제 음성 데이터를 다량으로 확보하여 특정인의 지식과 사상이 고려된 대화 스타일과 뉘앙스를 유사하게 모사하고, 특정인과 실제 대화의 느낌을 구현하여 감성적 소통이 가능한 대화형 인공지능 시스템 및 어플리케이션을 제공하고자 한다.The purpose of the present invention is to provide an interactive artificial intelligence system and application capable of emotional communication by securing a large amount of actual voice data of a specific person, similarly imitating the conversation style and nuance that take into account the knowledge and thoughts of the specific person, and implementing the feeling of an actual conversation with the specific person.

일 실시예로, 상기 음성 수집 어플리케이션은, 백그라운드 어플리케이션으로 구동되는 시간이 최소 1000시간으로 설정되며, 이동형 단말에 설치되어 있는 음성이 사용되는 앱에 대한 접근 권한의 허용이 구동 조건으로 설정될 수 있다.In one embodiment, the voice collection application may be set to run for a minimum of 1000 hours as a background application, and permission for access to a voice-using app installed on a mobile terminal may be set as an operating condition.

일 실시예로, 상기 음성 수집 어플리케이션은, 음성인식 알고리즘으로, 음성이 감지될 때만 녹음을 시작하고 음성이 감지되지 않을 때는 녹음을 중단하는 액티브 레코딩(Actove Recoding) 알고리즘이 적용될 수 있다.In one embodiment, the voice collection application may apply an active recording algorithm that starts recording only when a voice is detected and stops recording when no voice is detected as a voice recognition algorithm.

일 실시예로, 상기 음성 수집 어플리케이션은, 상기 특정인이 키워드를 입력하는 입력창을 출력하고, 상기 특정인이 입력한 키워드가 들어간 음성의 문장을 녹음에서 제외하는 녹음 설정 기능을 포함할 수 있다.In one embodiment, the voice collection application may include a recording setting function that outputs an input window for the specific person to input a keyword, and excludes from recording a voice sentence containing the keyword input by the specific person.

일 실시예로, 상기 음성 수집 어플리케이션은, 상기 특정인의 음성을 텍스트로 변환하는 STT(Speech To Text)알고리즘이 적용되어, 소정의 시간 동안 녹음된 음성 데이터를 텍스트로 출력하는 녹음 스크립트 기능을 포함하고, 상기 녹음 스크립트 기능을 통해 상기 특정인은 음성 데이터를 선택적으로 삭제할 수 있다.In one embodiment, the voice collection application includes a recording script function that outputs voice data recorded for a predetermined period of time as text by applying an STT (Speech To Text) algorithm that converts the voice of the specific person into text, and through the recording script function, the specific person can selectively delete voice data.

일 실시예로, 상기 음성 수집 어플리케이션은, 설치된 이동형 단말의 내부 저장소에 상기 음성 데이터를 저장하되, 기설정된 주기에 따라 저장된 음성 데이터를 상기 서버에 전송하고 전송된 음성 데이터를 상기 내부 저장소에서 삭제하는 데이터 관리 기능을 포함할 수 있다. In one embodiment, the voice collection application may include a data management function that stores the voice data in the internal storage of the installed mobile terminal, transmits the stored voice data to the server according to a preset cycle, and deletes the transmitted voice data from the internal storage.

일 실시예로, 상기 음성 수집 어플리케이션은, 상기 데이터 관리 기능의 실행으로 상기 서버에 음성 데이터를 전송하기 이전에, 상기 녹음 스크립트 기능이 실행되고 상기 특정인이 녹음된 음성 데이터를 텍스트로 확인하여 음성 데이터를 선택적으로 삭제할 수 있다.In one embodiment, the voice collection application may, prior to transmitting voice data to the server by executing the data management function, execute the recording script function and selectively delete the voice data by checking the recorded voice data as text by the specific person.

일 실시예로, 상기 기계학습부는, 상기 감정-문장 데이터셋이 A질문에 대한 B답변으로 전처리된 학습 데이터셋이 입력되어, A질문을 입력으로 받아들이고 B답변을 출력으로 내보내는 학습을 수행하는 1차 지도 학습 모듈; 및 A질문에 대해 B답변을 포함한 다양한 답변을 구성하고 다양한 답변에 가중치를 부여하는 강화학습을 수행하는 2차 강화 학습 모듈을 포함하여, 상기 초거대언어모델을 상기 특정인의 언어특성으로 조정(Fine-Tuning) 학습할 수 있다.In one embodiment, the machine learning unit may include a first supervised learning module that performs learning by inputting a learning dataset in which the emotion-sentence dataset is preprocessed into answer B for question A, and receiving question A as input and outputting answer B as output; and a second reinforcement learning module that performs reinforcement learning by configuring various answers, including answer B, for question A and assigning weights to the various answers, so that the super-large language model can be fine-tuned and learned to use the language characteristics of the specific person.

일 실시예로, 상기 기계학습부는, 수신된 상기 음성 데이터에서 음성의 주파수, 파장, 음량, 또는 음의 속도를 포함한 음향학적 특성을 이용하여 문장별 감정을 분류하여 라벨링한 감정-문장 데이터셋을 구축하는 전처리 모듈;을 더 포함할 수 있다.In one embodiment, the machine learning unit may further include a preprocessing module that builds an emotion-sentence dataset by classifying and labeling emotions for each sentence using acoustic characteristics including frequency, wavelength, volume, or speed of sound of the voice in the received voice data.

일 실시예로, 상기 전처리 모듈은, 상기 감정-문장 데이터셋을 감정별 A질문에 대한 B답변으로 분류하여 학습 데이터셋을 구축할 수 있다.As an example, the preprocessing module can build a learning dataset by classifying the emotion-sentence dataset into B answers to A questions by emotion.

일 실시예로, 상기 서버에 구성되며, 상기 특정인의 식별정보와, 상기 기계학습부를 통해 상기 특정인의 감정-문장 데이터셋으로 조정 학습된 상기 초거대언어모델이 저장되는 데이터베이스부를 더 포함하고, 상기 발화 처리부는, 사용자의 단말로부터 요청된 특정인의 식별정보로 상기 데이터베이스부로부터 상기 식별정보의 특정인에 해당하는 조정 학습된 상기 초거대언어모델을 로딩하여 응답을 위한 자연어를 생성할 수 있다. In one embodiment, the server further includes a database section configured to store the identification information of the specific person and the ultra-large language model that has been trained by adjusting the emotion-sentence dataset of the specific person through the machine learning section, and the speech processing section can load the ultra-large language model that has been trained by adjusting the identification information of the specific person from the database section using the identification information of the specific person requested from the user's terminal to generate natural language for a response.

또한, 본 발명은 대화형 인공지능 서버에 있어서, 사용자의 이동형 단말로부터 사용자가 대화를 희망하는 특정인의 식별정보와, 특정인을 대상으로 사용자가 입력한 질문 데이터를 입력받는 입력부; 특정인의 이동형 단말로부터 기설정된 시간동안 녹음된 상기 특정인의 음성 데이터를 수신받는 데이터 수집부; 초거대언어모델(LLM, Large Language Model)의 소프트웨어가 저장되고, 상기 음성 데이터를 수신받아, 상기 특정인의 감정 상태별 사용 언어를 감정-문장 데이터셋으로 구축하며, 상기 감정-문장 데이터셋으로 상기 초거대언어모델을 조정(Fine-tuning) 학습하는 기계학습부; 상기 기계학습부를 통해 조정된 초거대언어 모델로 상기 질문 데이터에 대한 상기 특정인의 응답을 모사한 자연어를 생성하는 발화 처리부; 및 상기 특정인의 식별정보와, 상기 기계학습부를 통해 상기 특정인의 감정-문장 데이터셋으로 조정 학습된 상기 초거대언어모델이 저장되는 데이터베이스부;를 포함하여, 사용자가 대화를 희망하는 특정인의 자연어를 모사하여 감성적 소통이 가능한 대화 서비스를 제공하는 것을 다른 특징으로 한다.In addition, the present invention is a conversational artificial intelligence server, including an input unit for receiving identification information of a specific person with whom the user wishes to have a conversation and question data entered by the user targeting the specific person from the user's mobile terminal; a data collection unit for receiving voice data of the specific person recorded for a preset period of time from the specific person's mobile terminal; a machine learning unit for storing software of a Large Language Model (LLM), receiving the voice data, constructing a language used by the specific person according to each emotional state into an emotion-sentence dataset, and fine-tuning and learning the Large Language Model with the emotion-sentence dataset; a speech processing unit for generating a natural language that simulates the specific person's response to the question data with the Large Language Model adjusted through the machine learning unit; and a database unit for storing the identification information of the specific person and the Large Language Model adjusted and learned with the emotion-sentence dataset of the specific person through the machine learning unit; thereby providing a conversation service capable of emotional communication by simulating the natural language of the specific person with whom the user wishes to have a conversation.

본 발명에 따르면 연인이나 셀럽, 인플루언서, 멘토, 부모, 친구 등과의 대화를 희망하는 사용자에게 사용자가 선택한 특정인의 대화패턴을 모사하여 그들의 감정상태에 따른 언어습관, 목소리 톤, 단어 패턴을 구현함에 따라, 사용자들에게 그들이 희망하는 특정인과 감성적으로 소통될 수 있는 대화 컨텐츠를 제공할 수 있다. 즉, 본 발명은 특정인과의 일상의 대화를 실제와 유사하게 제공하여 대화의 집중도와 일상의 힐링을 제공할 수 있다.According to the present invention, the conversation pattern of a specific person selected by the user is simulated to implement language habits, voice tones, and word patterns according to the emotional state of the user, thereby providing the user with conversation content that allows emotional communication with the specific person they desire. In other words, the present invention can provide everyday conversations with a specific person similar to reality, thereby providing concentration in the conversation and healing in everyday life.

본 발명은 특정인의 언어습관과 문체를 음성 또는 텍스트의 형태로 유사하게 제공하기 위해서, 해당 특정인의 일상의 언어를 1000시간 이상 수집하게 된다. 여기서 특정인은 유명인뿐만 아니라 사용자 주변의 일반인도 될 수 있으며, 본 발명은 이러한 다양한 특정인으로부터 1000여시간의 장시간 음성 수집을 위한 음성 애플리케이션이 함께 연동되도록 제공한다. 음성 애플리케이션은 특정인의 이동형 단말에 설치되어 백그라운드로 동작함에 따라 특정인은 녹음 상황이라는 인지 없이 자연스럽게 일상을 영위하면서 말하게 되는 음성이 수집되며, 음성 수집의 데이터 저장 효율을 고려한 녹음 설정 기능과, 특정인의 프라이버시를 존중하기 위한 녹음 스크립트 기능 및 데이터 관리 기능을 제공한다. 이처럼 수집된 특정인의 음성 데이터를 초거대언어모델(LLM)을 특정인에 맞도록 튜닝하여 자연어를 생성함으로써 사용자는 자신이 희망하는 특정인과 자연스러운 대화 서비스를 제공받을 수 있다.The present invention collects more than 1,000 hours of daily language of a specific person in order to provide a similar language habit and writing style of a specific person in the form of voice or text. Here, the specific person may be not only a celebrity but also an ordinary person around the user, and the present invention provides a voice application for long-term voice collection of more than 1,000 hours from various specific people. Since the voice application is installed on the mobile terminal of the specific person and operates in the background, voices spoken naturally by the specific person while going about his or her daily life are collected without the specific person being aware that it is a recording situation, and a recording setting function that considers data storage efficiency of voice collection, a recording script function that respects the privacy of the specific person, and a data management function are provided. By tuning the voice data of the specific person collected in this way to fit the specific person and generating natural language, the user can receive a natural conversation service with the specific person he or she desires.

도 1은 본 발명의 실시예에 따른 대화형 인공지능 시스템 구성도를 나타낸다.
도 2는 도 1의 대화형 인공지능 시스템이 제공하는 보이스 챗봇 서비스의 실시예를 나타낸다.
도 3은 본 발명의 실시예에 따른 음성 수집 어플리케이션 구성도이다.
도 4는 본 발명의 실시예에 따른 음성 수집 어플리케이션이 제공하는 녹음 설정 기능을 나타낸다.
도 5는 본 발명의 실시예에 따른 음성 수집 어플리케이션이 제공하는 데이터 관리 기능과 녹음 스크립트 기능을 나타낸다.
도 6은 본 발명의 실시예에 따른 기계학습부의 구성도이다.
도 7은 본 발명의 실시예에 따른 데이터베이스부의 저장 데이터를 나타낸다.Figure 1 shows a configuration diagram of an interactive artificial intelligence system according to an embodiment of the present invention.
Figure 2 illustrates an embodiment of a voice chatbot service provided by the conversational artificial intelligence system of Figure 1.
Figure 3 is a configuration diagram of a voice collection application according to an embodiment of the present invention.
Figure 4 illustrates a recording setting function provided by a voice collection application according to an embodiment of the present invention.
Figure 5 illustrates a data management function and a recording script function provided by a voice collection application according to an embodiment of the present invention.
Figure 6 is a configuration diagram of a machine learning unit according to an embodiment of the present invention.
Figure 7 shows stored data in a database section according to an embodiment of the present invention.

본 문서에 기재된 다양한 실시예들은, 본 발명 및 개시의 기술적 사상을 명확히 설명하기 위한 목적으로 예시된 것이며, 이를 특정한 실시 형태로 한정하려는 것이 아니다. 본 발명 및 개시의 기술적 사상은, 본 문서에 기재된 각 실시예의 다양한 변경(modifications), 균등물(equivalents), 대체물(alternatives) 및 각 실시예의 전부 또는 일부로부터 선택적으로 조합된 실시예를 포함한다. 또한 본 발명 및 개시의 기술적 사상의 권리 범위는 이하에 제시되는 실시예들이나 이에 대한 구체적 설명으로 한정되지 않는다.The various embodiments described in this document are exemplified for the purpose of clearly explaining the technical idea of the present invention and disclosure, and are not intended to limit it to a specific embodiment. The technical idea of the present invention and disclosure includes various modifications, equivalents, alternatives, and embodiments selectively combined from all or part of each embodiment described in this document. In addition, the scope of the technical idea of the present invention and disclosure is not limited to the embodiments presented below or the specific description thereof.

기술적이거나 과학적인 용어를 포함해서, 본 문서에서 사용되는 용어들은, 달리 정의되지 않는 한, 본 발명 및 개시가 속하는 기술 분야에서 통상의 지식을 가진 자에게 일반적으로 이해되는 의미를 가질 수 있다.Terms used in this document, including technical or scientific terms, unless otherwise defined, may have the meaning commonly understood by one of ordinary skill in the art to which the present invention and disclosure belong.

본 문서에서 사용되는 "포함한다", "포함할 수 있다", "구비한다", "구비할 수 있다", "가진다", "가질 수 있다" 등과 같은 표현들은, 대상이 되는 특징으로 기능, 동작 또는 구성요소 등이 존재함을 의미하며, 다른 추가적인 특징의 존재를 배제하지 않는다. 즉, 이와 같은 표현들은 다른 실시예를 포함할 가능성을 내포하는 개방형 용어(open-ended terms)로 이해되어야 한다.The expressions “includes,” “may include,” “comprises,” “may have,” “have,” and “may have” used in this document mean that a function, operation, or component exists as a target feature, and do not exclude the existence of other additional features. In other words, such expressions should be understood as open-ended terms that imply the possibility of including other embodiments.

본 문서에서 사용되는 단수형의 표현은, 문맥상 다르게 뜻하지 않는 한 복수형의 의미를 포함할 수 있으며, 이는 청구항에 기재된 단수형의 표현에도 마찬가지로 적용된다.As used herein, singular expressions may include plural meanings unless the context clearly indicates otherwise, and this also applies to singular expressions recited in the claims.

본 문서에서 사용되는 "A, B, 및 C," "A, B, 또는 C," "A, B, 및/또는 C" 또는 "A, B, 및 C 중 적어도 하나," "A, B, 또는 C 중 적어도 하나," "A, B, 및/또는 C 중 적어도 하나," "A, B, 및 C 중에서 선택된 적어도 하나," "A, B, 또는 C 중에서 선택된 적어도 하나," "A, B, 및/또는 C 중에서 선택된 적어도 하나" 등의 표현은, 각각의 나열된 항목 또는 나열된 항목들의 가능한 모든 조합들을 의미할 수 있다. 예를 들어, "A 및 B 중에서 선택된 적어도 하나"는, (1) A, (2) A 중 적어도 하나, (3) B, (4) B 중 적어도 하나, (5) A 중 적어도 하나 및 B 중 적어도 하나, (6) A 중 적어도 하나 및 B, (7) B 중 적어도 하나 및 A, (8) A 및 B를 모두 지칭할 수 있다.As used herein, the expressions "A, B, and C," "A, B, or C," "A, B, and/or C," or "at least one of A, B, and C," "at least one of A, B, or C," "at least one of A, B, and/or C," "at least one selected from A, B, and C," "at least one selected from A, B, or C," "at least one selected from A, B, and/or C," and the like can refer to each of the listed items or all possible combinations of the listed items. For example, "at least one selected from A and B" can refer to (1) A, (2) at least one of A, (3) B, (4) at least one of B, (5) at least one of A and at least one of B, (6) at least one of A and B, (7) at least one of B and A, and (8) all of A and B.

본 문서에서 사용되는 "~에 기초하여"라는 표현은, 해당 표현이 포함되는 어구 또는 문장에서 기술되는, 결정, 판단의 행위 또는 동작에 영향을 주는 하나 이상의 인자를 기술하는데 사용되고, 이 표현은 해당 결정, 판단의 행위 또는 동작에 영향을 주는 추가적인 인자를 배제하지 않는다.The expression "based on" as used in this document is used to describe one or more factors that influence the decision, act of judgment, or action described in the phrase or sentence containing the expression, and this expression does not exclude additional factors that influence the decision, act of judgment, or action.

본 문서에서 사용되는, 어떤 구성요소(예컨대, 제1 구성요소)가 다른 구성요소(예컨대, 제2 구성요소)에 "연결되어" 있다거나 "접속되어" 있다는 표현은, 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로 연결 또는 접속되는 것뿐 아니라, 새로운 다른 구성요소(예컨대, 제3 구성요소)를 매개로 하여 연결 또는 접속되는 것을 의미할 수 있다.As used herein, the expression that a component (e.g., a first component) is “connected” or “connected” to another component (e.g., a second component) may mean that the component is directly connected or connected to the other component, but also connected or connected via a new other component (e.g., a third component).

본 문서에서 사용된 표현 "~하도록 구성된(configured to)"은 문맥에 따라, "~하도록 설정된", "~하는 능력을 가지는", "~하도록 변경된", "~하도록 만들어진", "~를 할 수 있는" 등의 의미를 가질 수 있으며 이루어진(consist)의 의미와 구분된다.The expression "configured to" used in this document can have the meanings of "set to do", "having the ability to do", "changed to do", "made to do", and "able to do" depending on the context, and is distinct from the meaning of "consist".

이하, 첨부된 도면들을 참조하여, 본 개시의 다양한 실시예들을 설명한다. 첨부된 도면 및 도면에 대한 설명에서, 동일하거나 실질적으로 동등한(substantially equivalent) 구성요소에는 동일한 참조부호가 부여될 수 있다. 또한, 이하 다양한 실시예들의 설명에 있어서, 동일하거나 대응하는 구성요소를 중복하여 기술하는 것이 생략될 수 있으나, 이는 해당 구성요소가 그 실시예에 포함되지 않는 것을 의미하지는 않는다.Hereinafter, various embodiments of the present disclosure will be described with reference to the attached drawings. In the attached drawings and the description of the drawings, identical or substantially equivalent components may be given the same reference numerals. In addition, in the description of various embodiments below, the same or corresponding components may be omitted from being described repeatedly, but this does not mean that the corresponding components are not included in the embodiments.

도 1은 본 발명의 실시예에 따른 대화형 인공지능 시스템(1) 구성도를 나타낸다. 도 1을 참조하면, 대화형 인공지능 시스템(1)은 음성 수집 어플리케이션(20)과 대화형 인공지능 서버(10)를 포함할 수 있다. 음성 수집 어플리케이션(20)은 이동형 단말에 설치되기 위해 매체에 저장될 수 있으며, 실시예로 인터넷을 통해 제공되는 매체인 웹 페이지상에 저장되어 다운로드 가능한 프로그램의 형태로 제공될 수 있다. 음성 수집 어플리케이션(20)은 하나 이상의 프로세서에 의해 실행 가능하며, 이하에서 설명하게 될 어플리케이션의 각 기능은 하나 이상의 프로세서로 하여금 해당 기능을 수행하도록 하는 명령인 프로그램 언어로 구현될 수 있다.FIG. 1 shows a configuration diagram of an interactive artificial intelligence system (1) according to an embodiment of the present invention. Referring to FIG. 1, the interactive artificial intelligence system (1) may include a voice collection application (20) and an interactive artificial intelligence server (10). The voice collection application (20) may be stored in a medium for installation in a mobile terminal, and may be provided in the form of a downloadable program stored on a web page, which is a medium provided through the Internet, as an example. The voice collection application (20) may be executed by one or more processors, and each function of the application to be described below may be implemented in a program language that is a command for causing one or more processors to perform the corresponding function.

음성 수집 어플리케이션(20)은 특정인(2)의 이동형 단말에 설치되어 기설정된 시간동안 백그라운드 어플리케이션으로 구동되며, 구동중 특정인(2)의 음성을 녹음하여 외부의 서버(10)로 음성 데이터를 전송하는 기능을 실행시키기 위하여 매체에 저장될 수 있다.The voice collection application (20) is installed on a mobile terminal of a specific person (2) and runs as a background application for a preset period of time. During the run, the voice of the specific person (2) is recorded and the voice data is transmitted to an external server (10). The voice collection application (20) can be stored on a medium to execute the function.

본 명세서에서 지칭하는 백그라운드 어플리케이션이란 서비스 백그라운드 어플리케이션을 의미한다. 이는 사용자가 직접 실행한 어플리케이션이 백그라운드에서 계속해서 실행되는 것을 의미하며, 사용자가 운영체제에서 다른 어플리케이션을 실행하여도 종료되지 않고 중복적으로 기능이 구동되고 있는 서비스를 의미한다. 서비스 백그라운드 어플리케이션의 예시로는 카카오톡과 같은 메신져 서비스가 될 수 있다.The background application referred to in this specification refers to a service background application. This means that an application directly executed by a user continues to run in the background, and refers to a service whose function is not terminated and is run redundantly even when the user executes another application in the operating system. An example of a service background application may be a messenger service such as KakaoTalk.

본 실시예에 따른 대화형 인공지능 시스템(1)은 한정되지 않은 다양한 특정인(2)을 대상으로 특정인의 언어습관, 감정에 따른 음성톤, 음량, 사용단어 등을 유사하게 모사하여 텍스트 또는 음성의 대화 서비스를 제공한다. 여기서, 특정인(2)은 연예인과 같은 유명인, 성공한 기업가, 배우, 가수, 아이돌, 강사 등 다양한 분야에서 멘토로 적합한 공인이 될 수 있다.The conversational artificial intelligence system (1) according to the present embodiment provides a text or voice conversation service by similarly imitating the language habits, emotional tone of voice, volume, words used, etc. of a specific person (2) for an unlimited variety of specific people. Here, the specific person (2) can be a public figure suitable as a mentor in various fields, such as a celebrity, a successful entrepreneur, an actor, a singer, an idol, an instructor, etc.

도 2는 도 1의 대화형 인공지능 시스템(1)이 제공하는 보이스 챗봇 서비스의 실시예를 나타낸다. 도 2를 참조하면, 특정인과의 보이스 챗봇 서비스로 특정인(2)은 개그맨 ‘박명수’로 가정하여 대화가 진행되는 일 사례를 예시하였다. 사용자(3)는 보이스 챗봇을 통해 질문(300)을 입력한다. 질문(300)은 이동형 단말에 텍스트의 형태로 입력되어도 무방하고, 사용자(3)의 음성으로 입력되어도 무방하다. 음성의 입력을 나타내는 스펙트로그램(30)은 입력된 사용자 음성의 소리나 파동을 시각화하여 표현한다. Fig. 2 shows an example of a voice chatbot service provided by the conversational artificial intelligence system (1) of Fig. 1. Referring to Fig. 2, an example is shown in which a conversation is conducted with a specific person through a voice chatbot service, assuming that the specific person (2) is comedian ‘Park Myeong-su’. A user (3) inputs a question (300) through a voice chatbot. The question (300) may be input in the form of text on a mobile terminal, or may be input as the voice of the user (3). A spectrogram (30) representing a voice input is expressed by visualizing the sound or wave of the input user’s voice.

도 2의 예시에서 사용자(3)는 대화하고자 하는 특정인(2)으로 ‘박명수’를 지정하였다. 사용자(3)는 대화 질문으로 “나도 개그맨 할까?”라고 말하면, 해당 질문에서 개그맨 ‘박명수’의 언어습관을 고려하여 대화형 인공지능 시스템(1)이 생성한 자연어로 “야 공부나해”가 특정인의 응담(200)으로 출력될 수 있다. 일반적인 초거대언어모델(LLM) 기반의 자연어 생성을 고려하면, 직업으로 개그맨을 고려하는 도 2의 사용자 질문에 다소 냉소적이고 공격적인 답변인 “야 공부나해”의 답변을 생성하는 것은 부자연스럽다. 해당 답변은 일반적인 대화의 답변이 아니기 때문이다. 이는 ‘박명수’라는 직설적인 화법을 같는 특정인(2)의 성향이 고려되어야 비로소 이해된다. 또한, “야 공부나해”를 말할 때의 음성톤에 따라서 사용자(3)는 공격적이거나 자신을 무시하는 듯한 느낌을 받을수도 있다. 하지만, ‘박명수’의 음성톤으로 “야 공부나해”의 답변이 출력되면, 사용자(3)는 오히려 실제 ‘박명수’와 대화하는 느낌을 받게 되어 기분이 나쁘기는커녕, 재미있고 유쾌한 대화로 느낄 것이다.In the example of Fig. 2, the user (3) designates ‘Park Myeong-su’ as the specific person (2) with whom he or she wishes to have a conversation. When the user (3) says “Should I also be a comedian?” as a conversation question, the conversational artificial intelligence system (1) may generate “Hey, let’s study” as the specific person’s response (200) by considering the language habits of the comedian ‘Park Myeong-su’ in the question. Considering the general natural language generation based on the general large language model (LLM), it is unnatural to generate the somewhat cynical and aggressive answer “Hey, let’s study” to the user’s question of Fig. 2 regarding the possibility of becoming a comedian as a career. This is because the answer is not a response in a general conversation. This can be understood only when the tendency of the specific person (2) who speaks directly, ‘Park Myeong-su’, is taken into account. In addition, depending on the tone of voice when saying “Hey, let’s study,” the user (3) may feel that the system is being aggressive or ignoring him or her. However, if the response “Hey, go study” is output in the voice tone of ‘Park Myeong-su’, the user (3) will feel like he is actually talking to ‘Park Myeong-su’ and will find it a fun and pleasant conversation rather than a bad one.

도 2의 예시처럼, 특정인(2)과의 대화는 같은 문장일지라도 그 음성의 톤과 감정이 결부되어야 비로소 해당 특정인(2)과의 실제 대화하는 듯한 챗봇을 구현할 수 있다. 이는 현재 구축된 초거대언어모델(LLM)을 그대로 이용해서는 구현할 수 없는 것이어서 실제 특정인(2)의 언어 특성이 학습 데이터로 구축되어야 하는 것이다.As in the example of Figure 2, a conversation with a specific person (2) can be implemented as a chatbot that feels like an actual conversation with that specific person (2) only when the tone and emotion of the voice are combined, even if the sentences are the same. This cannot be implemented by directly using the currently constructed large language model (LLM), so the actual language characteristics of the specific person (2) must be constructed as learning data.

한편, 특정인(2)은 사용자(3)가 지정하게 되는 일반인이 될 수도 있는데, 일반인으로는 사용자(3)의 주변 지인, 부모님 등이 될 수 있다. 이와 같이 제한되지 않은 다양한 사람을 특정인(2)으로 하여 실제와 유사한 대화습관을 구현하기 위해서는 현재 인공지능 기술과, Voice Cloning 등의 음성 복제 기술이 발달하였다 할지라도, 최소 1000여 시간의 유효한 음성 데이터가 필요하다. 바람직하게는 3000시간 이상의 음성 데이터가 녹음되어야 한다. Meanwhile, a specific person (2) may be a general person designated by the user (3), and the general person may be an acquaintance, parent, etc. of the user (3). In order to implement a conversation habit similar to reality by using various people as a specific person (2) without any restrictions, even with the current development of artificial intelligence technology and voice cloning technology such as voice cloning, at least 1,000 hours of valid voice data is required. Preferably, more than 3,000 hours of voice data should be recorded.

공인의 경우, 유튜브 등의 미디어 매체에 노출된 공인의 다양한 음성을 수집하여 1000여시간에 달하는 음성 데이터를 확보할 수도 있겠으나, 주제가 한정되고 일상 대화의 언어습관이 충분히 반영되지 않아 목소리의 톤과 같이 특정한 부분만 유사하게 구현될 뿐, 일상 대화가 자연스럽게 구현되지 않는 한계가 있다. 공인의 경우에도 음성 데이터를 충분히 수집할 수 있는 사람은 한정되며, 일반인의 경우에는 직접 녹음 데이터를 취득하지 않는 이상 음성 데이터를 확보할 방법이 없는 한계가 있다.In the case of public figures, it is possible to collect various voices of public figures exposed on media such as YouTube and secure about 1,000 hours of voice data, but the topics are limited and the language habits of daily conversations are not sufficiently reflected, so only certain parts such as the tone of voice are implemented similarly, and there is a limitation that daily conversations are not implemented naturally. Even in the case of public figures, there are only a limited number of people who can sufficiently collect voice data, and in the case of ordinary people, there is a limitation that there is no way to secure voice data unless you directly obtain the recorded data.

본 실시예에 따른 인공지능 대화 시스템(1)은 공인 뿐만 아니라 일반인을 포함한 불특정 인물인 특정인(2)으로부터 최소 1000여 시간에 달하는 음성 데이터를 확보하기 위해서, 음성 수집에 최적화된 음성 수집 어플리케이션(20)을 제공한다. 본 실시예에 따른 음성 수집 어플리케이션(20)은 특정인(2)의 이동형 단말에 설치되어, 최소 1000시간 동안 백그라운드 어플리케이션으로 구동되면서, 특정인(2)의 일상의 대화언어를 녹음한다. 여기서, 특정인(2)은 1000여 시간 동안 자신의 일상이 모두 녹음됨에 따라 상당한 거부감을 가질 수 있다. 이를 위해 음성 수집 어플리케이션(20)은 프라이버시를 고려해야 하며 특정인(2)의 컨펌을 거쳐서 최종 음성 데이터가 서버(10)에 수집될 수 있도록 기능이 제공되어야 한다.The artificial intelligence conversation system (1) according to the present embodiment provides a voice collection application (20) optimized for voice collection in order to secure voice data of at least 1,000 hours from a specific person (2), including not only public officials but also unspecified persons including ordinary people. The voice collection application (20) according to the present embodiment is installed on a mobile terminal of the specific person (2), and records the daily conversational language of the specific person (2) while running as a background application for at least 1,000 hours. Here, the specific person (2) may feel considerable resistance as his or her daily life is recorded for at least 1,000 hours. To this end, the voice collection application (20) must consider privacy and must provide a function so that the final voice data can be collected on a server (10) after receiving confirmation from the specific person (2).

한편, 특정인(2)에게는 최소 1000시간의 음성 수집에 대한 보상이 제공되어야 음성 수집 어플리케이션(20)의 이용이 유인될 수 있다. 인공지능 대화 시스템(1)을 이용한 보이스/텍스트 챗봇의 서비스를 제공하려는 사업자는 특정인(2)의 음성에 대한 수익 배분을 통해 음성 수집 어플리케이션(20)의 사용을 유도하고, 특정인(2)으로부터 충분한 음성 데이터를 수집할 수 있다. 수익 배분의 실시예는 협의에 의한 계약으로 진행될 수도 있으나, 별도의 수익 배분 플랫폼이 사용되어도 무방하다.Meanwhile, a specific person (2) should be provided with compensation for at least 1000 hours of voice collection to encourage the use of the voice collection application (20). A business operator who intends to provide a voice/text chatbot service using an artificial intelligence conversation system (1) can induce the use of the voice collection application (20) and collect sufficient voice data from the specific person (2) by sharing the profits for the voice of the specific person (2). An example of profit sharing can be conducted as a contract through consultation, but a separate profit sharing platform may also be used.

일 실시예로, 음성 수집 어플리케이션(20)은 수익 배분 플랫폼을 통해 다운로드 가능한 매체로 제공될 수 있고, 특정인(2)은 해당 플랫폼을 통해 수익 분배 계약을 수행할 수 있다. 수익 배분 플랫폼은 사용자(3)들로부터 음성 서비스의 요청 현황을 공개하며, 요청에 따라 특정인(2)의 대화 서비스가 제공되는 횟수에 비례하여 특정인(2)의 수익비용이 책정될 수 있다. 특정인(2)은 수익 분배 플랫폼을 통해서 음성 녹음을 진행한다면, 저작권과 같이 지속적으로 수입이 발생될 수 있다. 본 실시예에 따른 음성 수집 어플리케이션(20)은 프라이버시의 문제와 모바일 단말의 배터리 및 데이터 저장 공간 등의 우려사항이 고려됨에 따라, 특정인(2)에게 수익이 공유되는 배경을 고려하면 특정인(2)의 충분한 음성 데이터를 수집하기에 적합하다. 이하에서는, 음성 수집 어플리케이션(20)의 기능을 설명한다.In one embodiment, the voice collection application (20) may be provided as a downloadable medium through a revenue distribution platform, and a specific person (2) may perform a revenue distribution contract through the platform. The revenue distribution platform may disclose the status of voice service requests from users (3), and the revenue cost of the specific person (2) may be determined in proportion to the number of times the conversation service of the specific person (2) is provided according to the request. If the specific person (2) conducts voice recording through the revenue distribution platform, income may be continuously generated, similar to copyright. The voice collection application (20) according to the present embodiment is suitable for collecting sufficient voice data of the specific person (2), considering the background in which the revenue is shared with the specific person (2), considering privacy issues and concerns such as battery and data storage space of the mobile terminal. Hereinafter, the functions of the voice collection application (20) will be described.

도 3은 본 발명의 실시예에 따른 음성 수집 어플리케이션(20) 구성도이다.Figure 3 is a configuration diagram of a voice collection application (20) according to an embodiment of the present invention.

도 3을 참조하면, 음성 수집 어플리케이션(20)은 녹음 기능(22), 녹음 설정 기능(23), 녹음 스크립트 기능(24), 및 데이터 관리 기능(25)을 포함할 수 있다. 읍성 수집 어플리케이션(20)은 백그라운드 어플리케이션으로 구동되는 시간이 최소 1000시간으로 설정되며, 이동형 단말에 설치되어 있는 음성이 사용되는 앱에 대한 접근 권한의 허용이 구동 조건으로 설정될 수 있다. Referring to FIG. 3, the voice collection application (20) may include a recording function (22), a recording setting function (23), a recording script function (24), and a data management function (25). The town collection application (20) may be set to run as a background application for a minimum of 1000 hours, and permission for access to an app that uses voice installed on a mobile terminal may be set as an operating condition.

본 실시예로, 음성이 사용되는 앱(App)은 전화, 소셜네트워크서비스(SNS)와 관련된 앱, 화상회의와 관련된 앱 등이 될 수 있다. 지칭된 앱은 안드로이드 또는 Mac과 같은 운영체제에서 제공하는 다양한 어플리케이션을 포함하여, 주로 마이크의 기능을 허여받아 사용자의 음성이 사용될 수 있는 어플리케이션을 의미한다. 음성 수집 어플리케이션(20)은 기존 특정인(2)의 단말에 설치된 상기와 같은 음성이 사용되는 앱에 대한 접근 권한을 허여하도록 초기설정됨으로써, 백그라운드로 동작하면서, 특정인(2)이 전화, 회의, 게임, SNS 등을 하면서 자연스럽게 발화하는 음성을 수집할 수 있다.In this embodiment, the voice-using app may be an app related to a phone call, a social network service (SNS), an app related to a video conference, etc. The referred app mainly refers to an application that allows the user's voice to be used by receiving a microphone function, including various applications provided by an operating system such as Android or Mac. The voice collection application (20) is initially set to grant access to the voice-using app installed on the terminal of a specific person (2), thereby operating in the background and collecting the voice naturally spoken by the specific person (2) while making a phone call, meeting, playing a game, using SNS, etc.

음성 수집 어플리케이션(20)은 백그라운드 어플리케이션으로 구동됨에 따라, 구동되는 시간이 곧 음성을 수집하는 시간이 될 수 있다. 음성 수집 어플리케이션(20)은 최소 1000시간, 바람직하게는 3000시간동안 구동되는 것으로 설정될 수 있으며, 해당 시간 음성 수집이 완료되면 자동으로 종료되고, 특정인(2)의 단말로부터 삭제될 수 있다.Since the voice collection application (20) is run as a background application, the time it is run can be the time it collects voice. The voice collection application (20) can be set to run for at least 1000 hours, preferably 3000 hours, and when voice collection for the corresponding time is completed, it is automatically terminated and can be deleted from the terminal of a specific person (2).

녹음 기능(22)은 특정인(2)의 음성을 녹음하는 기능으로, 음성인식 알고리즘으로 구현된다. 여기서, 본 실시예에 따른 녹음 기능(22)은 장시간 백그라운드로 구동됨에 따라 이용자 단말의 배터리를 과도하게 소모시킬 수 있다. 따라서, 음성 수집 어플리케이션(20)은 음성인식 알고리즘으로, 음성이 감지될 때만 녹음을 시작하고 음성이 감지되지 않을 때는 녹음을 중단하는 액티브 레코딩(Actove Recoding) 알고리즘이 적용될 수 있다.The recording function (22) is a function for recording the voice of a specific person (2), and is implemented as a voice recognition algorithm. Here, the recording function (22) according to the present embodiment may excessively consume the battery of the user terminal as it runs in the background for a long time. Therefore, the voice collection application (20) may apply an active recording algorithm as a voice recognition algorithm, which starts recording only when a voice is detected and stops recording when a voice is not detected.

본 실시예에 따른 액티브 레코딩 알고리즘은 스펙토그램에서 음성 세그먼트(Vo)의 발생을 감지하고, 음성 세그먼트(Vo)의 발생시 음성인식을 수행하여 녹음을 진행함에 따라 배터리 전력을 최소화시킬 수 있다. 음성 세그먼트의 발생을 감지하는 알고리즘에는 스펙토그램의 임계치 설정과 듀레이션의 설정으로 임계치를 넘은 스펙토그램의 값이 생성되었을 때, 음성 인식을 시작하도록 설정할 수 있다. 또는, 음성 세그먼트의 시작점과 끝점을 인식하는 룰베이스 방식 및 기계학습 방식 중 적어도 하나에 기조하는 EPD(End-Point Detection)가 채용될 수 있다. 룰베이스 방식은 Frame energy, Zerocrossing rate, Energy entropy, TEO energy 및 Melscale filter bank 중 적어도 하나에 기초한다. 또한, 기계학습 방식은 GMM(Gaussian Mixture Model), HMM(Hidden Markov Model), SVM(Support Vector Machine) 및 DNN(Deep Neural Net) 중 적어도 하나에 기초한다. 전술한 바와 같은 알고리즘들은, 음성 데이터로부터 음성 영역을 검출하기 위한 예시적 알고리즘으로서, 본 발명의 권리 범위는 이에 제한되지 않는다.The active recording algorithm according to the present embodiment detects the occurrence of a voice segment (Vo) in a spectrogram, and performs voice recognition when the voice segment (Vo) occurs, thereby minimizing battery power as recording proceeds. The algorithm for detecting the occurrence of a voice segment can be set to start voice recognition when a value of a spectrogram exceeding a threshold is generated by setting a threshold value and a duration of a spectrogram. Alternatively, an EPD (End-Point Detection) based on at least one of a rule-based method and a machine learning method that recognizes a start point and an end point of a voice segment can be adopted. The rule-based method is based on at least one of Frame energy, Zerocrossing rate, Energy entropy, TEO energy, and Melscale filter bank. In addition, the machine learning method is based on at least one of a GMM (Gaussian Mixture Model), an HMM (Hidden Markov Model), an SVM (Support Vector Machine), and a DNN (Deep Neural Net). The algorithms described above are exemplary algorithms for detecting speech areas from speech data, and the scope of the rights of the present invention is not limited thereto.

본 실시예에 따른 녹음 기능(22)은 기존의 보급된 이동형 단말에서 기본기능으로 제공하는 녹음의 기능이 실행되도록 제어하는 명령일수도 있으나, 음성의 인식 알고리즘에서 특정인(2)의 음성을 제외한 일상의 소음이나 환경음 등은 본 실시예가 목적하는 서비스에서 모두 노이즈에 해당한다. 음성 수집 어플리케이션(20)은 특정인(2)의 일상 생활에서의 언어특성이 반영된 문장과 감정별 대화톤 및 사용언어를 수집하는 것이 목적으로서, 특정인(2)의 음성이 없는 대부분의 시간들은 단말기의 배터리를 지나치게 빨리 소모시키면서도 노이즈의 음성 데이터만을 확보하는 불합리함을 야기한다. 특히, 특정인(2)의 음성을 구현하기 위한 음성 데이터는 최소 1000시간에 해당하는 데이터이며, 이렇게 구동시간이 정해진 음성 수집 어플리케이션(20)은 특정인(2)의 음성에 해당하는 데이터를 1000시간분으로 수집해야 한다. 음성 수집 어플리케이션(20)은 음성 세그먼트(Vo)를 감지하여 특정인(2)의 음성이 있을때만 선택적으로 녹음하며, 선택적으로 녹음되는 시간이 본 명세서에서 지칭하는 구동시간이고, 결국 특정인(2)은 순수하게 자신이 이야기한 시간이 1000시간에 해당해야 음성 수집 어플리케이션(20)의 구동시간을 충족하는 것이다.The recording function (22) according to the present embodiment may be a command to control the execution of the recording function provided as a basic function in existing popular mobile terminals, but in the voice recognition algorithm, everyday noises or environmental sounds, excluding the voice of a specific person (2), are all considered noises in the service aimed at by the present embodiment. The purpose of the voice collection application (20) is to collect sentences reflecting the language characteristics of the daily life of a specific person (2), conversational tones by emotion, and language used, and most of the time when the voice of a specific person (2) is not heard causes the unreasonableness of obtaining only noise voice data while excessively quickly draining the battery of the terminal. In particular, the voice data for implementing the voice of a specific person (2) is data corresponding to at least 1,000 hours, and the voice collection application (20) with the operating time set in this way must collect data corresponding to the voice of a specific person (2) for 1,000 hours. The voice collection application (20) detects a voice segment (Vo) and selectively records only when a specific person's (2) voice is present, and the time for selective recording is the operating time referred to in this specification. Ultimately, the specific person (2) must speak for 1000 hours to satisfy the operating time of the voice collection application (20).

액티브 레코딩의 알고리즘으로 음성이 발현된 데이터만 수집하도록 녹음 기능(22)이 구현됨에 따라 무의미한 노이즈의 사전 제거와, 특정인(2)의 이동형 단말의 배터리 소모량을 저감시킬 뿐만 아니라, 인공지능으로 특정인(2)의 음성 구현이 가능할 정도의 최소 데이터 확보가 가능해질 수 있다.As the recording function (22) is implemented to collect only data in which voice is expressed by the algorithm of active recording, it is possible to remove meaningless noise in advance and reduce battery consumption of a mobile terminal of a specific person (2), and also to secure the minimum amount of data that enables the voice of a specific person (2) to be expressed by artificial intelligence.

음성 세그먼트(Vo)가 감지되면 이는 Vo Data의 음성 데이터로 저장되며, 추후 설명하게 될 서버(10)로 전송된다. 본 실시예로, 이러한 Vo Data는 녹음 스크립트 기능(24)의 구현을 위한 STT(Speech To Text)알고리즘을 통해 음성이 텍스트로 변환된 Txt Data로 변환된다. Txt Data 또한 서버(10)로 전송된다. 음성 수집 어플리케이션(20)을 통해 수집되는 데이터는 위의 실시예와 같이 음성 데이터(Vo Data)와 텍스트 데이터(Txt Data)가 함께 수집될 수 있다.When a voice segment (Vo) is detected, it is stored as voice data of Vo Data and transmitted to the server (10) to be described later. In this embodiment, this Vo Data is converted into Txt Data, which is voice converted into text, through an STT (Speech To Text) algorithm for implementing a recording script function (24). The Txt Data is also transmitted to the server (10). Data collected through the voice collection application (20) can be collected together as voice data (Vo Data) and text data (Txt Data), as in the above embodiment.

도 4는 본 발명의 실시예에 따른 음성 수집 어플리케이션(20)이 제공하는 녹음 설정 기능(23)을 나타낸다. 음성 수집 어플리케이션(20)은 장시간 자신의 모든 일상이 녹음되는 특정인(2)의 프라이버시를 고려하여, 특정인(2)이 수집을 거부하는 자신의 취미, 특정 관계인, 이성 등과의 대화 수집을 차단할 수 있도록 녹음 설정 기능(23)을 제공할 수 있다.Figure 4 illustrates a recording setting function (23) provided by a voice collection application (20) according to an embodiment of the present invention. The voice collection application (20) may provide a recording setting function (23) so that a specific person (2) can block the collection of conversations with his/her hobbies, specific related persons, the opposite sex, etc., that the specific person (2) refuses to collect, considering the privacy of the specific person (2) whose entire daily life is recorded for a long time.

도 4를 참조하면, 녹음 설정 기능(23)은 특정인(2)이 키워드를 입력하는 입력창을 출력하고, 특정인(2)이 입력한 키워드가 들어간 음성의 문장을 녹음에서 제외할 수 있도록 설정 메뉴를 제공하는 기능이다.Referring to Fig. 4, the recording setting function (23) is a function that outputs an input window for a specific person (2) to input a keyword and provides a setting menu so that a sentence of voice containing the keyword input by the specific person (2) can be excluded from recording.

녹음 설정 기능(23)은 사용자의 프라이버시를 고려한 다양한 필터링 모드를 제공할 수 있다. 본 실시예로, 녹음 설정 기능(23)은 특정인(2)으로부터 발화한 음성 중 특정 키워드가 포함된 문장을 제거시킬 수 있도록 키워드의 종류를 입력받을 수 있다. 특정인(2)은 자신의 일상 대화에서 스캔들 등의 이슈를 피하기 위해 특정인과 통화하거나 특정 취미 생활과 관련된 음성이 녹음되지 않기를 희망할 수 있다. 이 경우, 특정인(2)은 녹음 설정 기능(23)으로 제공되는 필터링 키워드를 설정하여 녹음에서 제외시킬 수 있다. The recording setting function (23) can provide various filtering modes that take into account the user's privacy. In this embodiment, the recording setting function (23) can receive the type of keyword so that sentences containing a specific keyword can be removed from voices spoken by a specific person (2). The specific person (2) may wish not to record voices related to a specific person's phone conversation or a specific hobby in order to avoid issues such as scandals in his or her daily conversation. In this case, the specific person (2) can set a filtering keyword provided by the recording setting function (23) to exclude it from recording.

또한, 녹음 설정 기능(23)은 필터링 모드의 실시예 중 하나로 필터링 시간이 설정될 수 있다. 특정인(2)은 자신의 은밀한 취미나 사생활과 관련된 시간대를 설정하여 녹음에서 제외시킬 수 있다. 이 경우, 필터링 시간은 해당 시간대에 음성 수집 어플리케이션(20)의 녹음 기능(22)이 백그라운드 상태에서도 오프되도록 제어될 수 있다.In addition, the recording setting function (23) can set a filtering time as one of the embodiments of the filtering mode. A specific person (2) can set a time zone related to his or her secret hobby or private life to exclude it from recording. In this case, the filtering time can be controlled so that the recording function (22) of the voice collection application (20) is turned off even in the background state during the corresponding time zone.

녹음 스크립트 기능(24)은 특정인(2)의 음성을 텍스트로 변환하는 STT(Speech To Text)알고리즘이 적용되어, 소정의 시간 동안 녹음된 음성 데이터를 텍스트로 출력하는 기능이다. 녹음 스크립트 기능(24)을 통해 특정인(2)은 음성 데이터를 선택적으로 삭제할 수 있다. 본 실시예로, 녹음 스크립트 기능(24)은 녹음된 일자, 시간, 문장별로 데이터가 세그먼트화 될 수 있다. 녹음 스크립트 기능(24)은 도 3에서 설명한 바와 같이 Txt Data의 형태로 서버(10)로 전송한다. 텍스트 형태의 데이터 전송은 텍스트 챗봇의 구현에 이용될 뿐만 아니라, 동시에 특정인(2)으로부터 녹음된 데이터의 컨펌을 위한 수단으로 활용될 수 있다. 바람직하게, 데이터 세그먼트는 문장별로 형성될 수 있다. 문장별로 형성된 데이터 세그먼트는 음성 수집 어플리케이션(20)에서 특정인(2)이 해당 영역에 위치시 세그먼트가 활성화되어 데이터 세그먼트 단위별로 손쉽게 편집 및 삭제를 가능하도록 한다.The recording script function (24) is a function that applies an STT (Speech To Text) algorithm that converts the voice of a specific person (2) into text, and outputs voice data recorded for a certain period of time as text. Through the recording script function (24), a specific person (2) can selectively delete voice data. In this embodiment, the recording script function (24) can segment data by recorded date, time, and sentence. The recording script function (24) transmits data to the server (10) in the form of Txt Data as described in FIG. 3. Data transmission in the form of text can be used not only for implementing a text chatbot, but can also be utilized as a means for confirming data recorded from a specific person (2). Preferably, data segments can be formed by sentence. Data segments formed by sentences are activated when a specific person (2) is located in the corresponding area in the voice collection application (20), so that they can be easily edited and deleted by data segment unit.

데이터 관리 기능(25)은 설치된 이동형 단말의 내부 저장소에 음성 데이터를 저장하되, 기설정된 주기에 따라 저장된 음성 데이터를 서버(10)에 전송하고 전송된 음성 데이터를 내부 저장소에서 삭제하는 기능이다.The data management function (25) is a function that stores voice data in the internal storage of the installed mobile terminal, transmits the stored voice data to the server (10) according to a preset cycle, and deletes the transmitted voice data from the internal storage.

도 5는 본 발명의 실시예에 따른 음성 수집 어플리케이션(20)이 제공하는 데이터 관리 기능(25)과 녹음 스크립트 기능(24)을 나타낸다.Figure 5 illustrates a data management function (25) and a recording script function (24) provided by a voice collection application (20) according to an embodiment of the present invention.

도 5를 참조하면, 데이터 관리 기능(25)은 실행시 서버(10)에 음성 데이터를 전송하기 이전에, 녹음 스크립트 기능(24)이 실행되고 특정인(2)이 녹음된 음성 데이터를 텍스트로 확인하여 음성 데이터를 선택적으로 삭제할 수 있도록 할 수 있다. 도 5의 예시는 특정인(2)으로 개그맨 ‘박명수’를 예시하여 수집된 음성 데이터의 텍스트 문장이며, 이는 실제 수집된 데이터가 아닌 녹음 스크립트 기능(24)을 설명하기 위한 가상의 사례이다. Referring to Fig. 5, the data management function (25) can enable the recording script function (24) to be executed before transmitting voice data to the server (10) during execution, and for a specific person (2) to check the recorded voice data as text and selectively delete the voice data. The example of Fig. 5 is a text sentence of voice data collected by exemplifying the comedian ‘Park Myeong-su’ as a specific person (2), and this is a hypothetical case for explaining the recording script function (24) rather than actual collected data.

도 5에서 데이터 관리 기능(25)은 음성 데이터를 서버(10)로 전송하기 전 특정인(2)의 컨펌을 위해 녹음 스크립트(24) 기능을 불러올 수 있다. 음성 데이터가 서버(10)로 전송되는 주기는 서버(10)의 사용자에 의해서 설정될 수 있으며, 그 단위는 하루 또는 일주일 단위가 될 수 있다. 녹음 스크립트(24)가 로딩되면, 특정인(2)은 시간대별로 자신의 녹음된 음성을 텍스트로 확인할 수 있다. In Fig. 5, the data management function (25) can call up the recording script (24) function for confirmation by a specific person (2) before transmitting voice data to the server (10). The period for transmitting voice data to the server (10) can be set by the user of the server (10), and the unit can be a day or a week. When the recording script (24) is loaded, the specific person (2) can check his/her recorded voice as text by time zone.

스크립트의 텍스트는 각각 데이터 세그먼트(250)화 된다. 데이터 세그먼트(250)는 특정인(2)의 편집을 용이하게 한다. 특정인(2)은 녹음 데이터의 전송을 원하지 않는 문장을 클릭하면, 세그먼트(250)의 전체가 클릭되어 삭제 명령의 UI가 출력되고, 특정인(2)은 해당 문장을 UI의 터치로 손쉽게 삭제할 수 있다.The text of the script is each segmented into a data segment (250). The data segment (250) facilitates editing by a specific person (2). When a specific person (2) clicks on a sentence for which he or she does not want to transmit recorded data, the entire segment (250) is clicked and a UI for a deletion command is output, and the specific person (2) can easily delete the sentence by touching the UI.

이상에서는 특정인(2)의 음성 데이터를 수집하기 위한 음성 수집 어플리케이션(20)의 실시예를 설명하였다. 이하, 음성 수집 어플리케이션(20)을 통해 수집된 음성 데이터를 인공지능을 통한 학습으로 자연어를 생성하여 사용자(3)의 질문에 적절한 답변을 형성시키는 대화형 인공지능 서버(10)의 구성을 설명한다.The above describes an embodiment of a voice collection application (20) for collecting voice data of a specific person (2). Below, the configuration of an interactive artificial intelligence server (10) that generates natural language through artificial intelligence learning from voice data collected through the voice collection application (20) to form an appropriate answer to a user's (3) question is described.

대화형 인공지능 서버(10)는 입력부(11), 데이터 수집부(12), 데이터베이스부(13), 기계학습부(14) 및 발화 처리부(15)를 포함할 수 있다.The conversational artificial intelligence server (10) may include an input unit (11), a data collection unit (12), a database unit (13), a machine learning unit (14), and a speech processing unit (15).

입력부(11)는 사용자(3)의 이동형 단말로부터 사용자(3)가 대화를 희망하는 특정인(2)의 식별정보와, 특정인(2)을 대상으로 사용자(3)가 입력한 질문 데이터를 입력받을 수 있다. 특정인(2)의 식별정보는 특정인(2)의 음성 데이터를 수집할 때 입력하게 되는 고유 ID 정보일 수 있다. 질문 데이터는 사용자(3)가 단말을 통해 입력된 텍스트 또는 음성일 수 있다. The input unit (11) can receive identification information of a specific person (2) with whom the user (3) wishes to have a conversation, and question data entered by the user (3) for the specific person (2) from the user's (3) mobile terminal. The identification information of the specific person (2) may be unique ID information entered when collecting voice data of the specific person (2). The question data may be text or voice entered by the user (3) through the terminal.

데이터 수집부(12)는 특정인(2)의 이동형 단말로부터 기설정된 시간동안 녹음된 특정인(2)의 음성 데이터를 수신받을 수 있다. 데이터 수집부(12)는 수신한 음성 데이터를 데이터베이스부(13)에 전송할 수 있다. The data collection unit (12) can receive voice data of a specific person (2) recorded for a preset period of time from the mobile terminal of the specific person (2). The data collection unit (12) can transmit the received voice data to the database unit (13).

기계학습부(14)는 초거대언어모델(LLM, Large Language Model, 140)의 소프트웨어가 저장되고, 음성 데이터를 수신받아, 특정인(2)의 감정 상태별 사용 언어를 감정-문장 데이터셋으로 구축하며, 감정-문장 데이터셋으로 초거대언어모델(140)을 조정(Fine-tuning) 학습할 수 있다.The machine learning unit (14) stores the software of the Large Language Model (LLM, 140), receives voice data, builds an emotion-sentence dataset of the language used according to the emotional state of a specific person (2), and can fine-tune and learn the Large Language Model (140) with the emotion-sentence dataset.

초거대언어모델(LLM)은 구글이 오픈소스로 제공하는 언어모델일 수 있다. 본 실시예로, GitHub와 같은 사이트에서 초거대언어모델의 오픈소스를 다양한 디렉토리로 제공하고 있다. 일 예시로, 기계학습부(14)는 GitHub에서 제공하는 초거대언어모델로 LLaMa2에 해당하는 SRC 데이터를 다운받을 수 있다. 이는 서버(10)의 데이터베이스부(13)에 저장되어도 무방하다. The LLM may be a language model provided as open source by Google. In this embodiment, the open source of the LLM is provided in various directories on a site such as GitHub. As an example, the machine learning unit (14) may download the SRC data corresponding to LLaMa2 as the LLM provided by GitHub. This may be stored in the database unit (13) of the server (10).

도 6은 본 발명의 실시예에 따른 기계학습부(14)의 구성도이다. 도 6을 참조하면, 기계학습부(14)는 전처리 모듈(141), 1차 지도 학습 모듈(142), 및 2차 강화 학습 모듈(143)을 포함할 수 있다.Figure 6 is a configuration diagram of a machine learning unit (14) according to an embodiment of the present invention. Referring to Figure 6, the machine learning unit (14) may include a preprocessing module (141), a first supervised learning module (142), and a second reinforcement learning module (143).

전처리 모듈(141)은 수신된 음성 데이터에서 음성의 주파수, 파장, 음량, 또는 음의 속도를 포함한 음향학적 특성을 이용하여 문장별 감정을 분류하여 라벨링한 감정-문장 데이터셋을 구축할 수 있다. 이를 1차적 전처리라 한다.The preprocessing module (141) can build an emotion-sentence dataset by classifying and labeling emotions for each sentence using acoustic characteristics including the frequency, wavelength, volume, or speed of sound of the voice from the received voice data. This is called primary preprocessing.

1차적 전처리는 자동 또는 수동으로 수행될 수 있다. 데이터베이스부(13)는 도 3에서 전술한 바와 같이 음성 데이터(Vo Data)와 텍스트 데이터(Txt Data)를 수신받는다. 1차적 전처리는 텍스트 데이터(Txt Data)를 통해 데이터 세그먼트(250)가 어떠한 감정상태에서 발화된 것인지 분류되는 작업을 의미한다. 즉, 1차적 전처리는 특정인(2)의 대화 문장별로 감정이 라벨링된다. 라벨링되는 감정의 종류로는, 활발, 시크, 차분, 개구쟁이, 어른스런, 우울, 화난, 열정적인, 공부잘하는, 사랑스런, 덤덤, 사춘기, 나이든 등의 항목이 될 수 있다. 1차적 전처리로 텍스트 데이터(Txt Data)는 데이터 세그먼트(250)가 감정으로 분류된 감정-문장 데이터셋으로 가공될 수 있다. The primary preprocessing can be performed automatically or manually. The database unit (13) receives voice data (Vo Data) and text data (Txt Data) as described above in FIG. 3. The primary preprocessing means a task of classifying the emotional state in which the data segment (250) was uttered through the text data (Txt Data). That is, the primary preprocessing labels the emotions of each dialogue sentence of a specific person (2). The types of emotions to be labeled can include items such as active, chic, calm, mischievous, mature, depressed, angry, passionate, studious, loving, indifferent, adolescent, and old. Through the primary preprocessing, the text data (Txt Data) can be processed into an emotion-sentence dataset in which the data segment (250) is classified by emotion.

전처리 모듈(141)은 감정-문장 데이터셋을 감정별 A질문에 대한 B답변으로 분류하여 학습 데이터셋을 구축할 수 있다. 이를 2차적 전처리라 한다. 2차적 전처리는 자동 또는 수동으로 수행될 수 있다. The preprocessing module (141) can build a learning data set by classifying the emotion-sentence data set into B answers to A questions by emotion. This is called secondary preprocessing. Secondary preprocessing can be performed automatically or manually.

2차적 전처리는 감정-문장 데이터셋의 데이터 세그먼트(250)를 기준으로 해당 데이터 세그먼트(250)의 문장이 B답변일 때, 해당 답변으로 적절한 A질문을 배정하여 학습 데이터셋을 구축한다. 이렇게 구축된 학습 데이터셋을 이용하여 초거대언어모델(LLM)을 미세조정할 수 있다.Secondary preprocessing builds a learning dataset by assigning an appropriate A question to a B answer when a sentence in the data segment (250) of the emotion-sentence dataset is a B answer. The learning dataset built in this way can be used to fine-tune a large language model (LLM).

1차 지도 학습 모듈(142)은 감정-문장 데이터셋이 A질문에 대한 B답변으로 전처리된 학습 데이터셋이 입력되어, A질문을 입력으로 받아들이고 B답변을 출력으로 내보내는 학습을 수행할 수 있다. The first supervised learning module (142) can perform learning by inputting a learning data set in which the emotion-sentence data set is preprocessed into answer B for question A, and receiving question A as input and outputting answer B as output.

LLM의 미세조정(Fine-Tuning)은 LLM을 활용 목적에 맞게 최적화하는 과정을 의미한다. 이는 LLM의 파라미터를 조정하여 특정 작업에 대한 성능을 향상시키는 것으로 학습 데이터셋을 사용하여 수행된다. 이 작업을 통해 LLM이 특정 작업에 대한 지식을 습득하고 해당 작업에 대한 성능을 향상시킬 수 있다. 본 실시예에서는, LLM을 특정인(2)의 A질문에 대한 B답변으로 분류된 학습 데이터셋으로 학습하여 A질문과 유사한 질문에 B와 같은 답변으로 제시하도록 LLM을 미세조정한다. Fine-tuning of LLM refers to the process of optimizing LLM for the purpose of use. This is done by adjusting the parameters of LLM to improve performance for a specific task, and is done using a learning dataset. Through this task, LLM can acquire knowledge about a specific task and improve performance for the task. In this embodiment, LLM is trained with a learning dataset classified as answer B to question A of a specific person (2), and LLM is fine-tuned to present answers similar to question A as answers similar to B.

2차 강화 학습 모듈(143)은 A질문에 대해 B답변을 포함한 다양한 답변을 구성하고 다양한 답변에 가중치를 부여하는 강화학습을 수행할 수 있다. 강화학습의 알고리즘은 다양하게 공개된 오픈소스 알고리즘이 이용될 수 있으며, B답변의 유사 변형 형태를 포함하여, 1차 학습된 B답변과 유사할수록 가중치가 높게 부여되는 알고리즘을 이용한다.The secondary reinforcement learning module (143) can perform reinforcement learning that configures various answers including answer B for question A and assigns weights to the various answers. Various open source algorithms can be used as the reinforcement learning algorithm, and an algorithm is used that assigns a higher weight the more similar it is to the first learned answer B, including a similar modified form of answer B.

1차, 2차에 따른 학습으로 기계학습부(14)는 초거대언어모델(140)을 특정인(2)의 언어특성으로 조정 학습하여 조정 학습된 초거대언어모델(145)을 생성할 수 있다. 도 6에서는 조정된 초거대언어모델(145)을 LLM_ID021로 도시하였다. 조정된 초거대언어모델(145)은 데이터베이스부(13)로 전송되어 고유값으로 저장될 수 있다.By learning according to the first and second stages, the machine learning unit (14) can generate a super large language model (145) by adjusting the super large language model (140) to the language characteristics of a specific person (2). In Fig. 6, the adjusted super large language model (145) is illustrated as LLM_ID021. The adjusted super large language model (145) can be transmitted to the database unit (13) and stored as a unique value.

데이터베이스부(13)는 특정인(2)의 식별정보와, 기계학습부(14)를 통해 특정인(2)의 감정-문장 데이터셋으로 조정 학습된 초거대언어모델(145)이 저장될 수 있다. The database section (13) can store identification information of a specific person (2) and a super-large language model (145) that has been trained using the emotion-sentence dataset of a specific person (2) through the machine learning section (14).

도 7은 본 발명의 실시예에 따른 데이터베이스부(13)의 저장 데이터를 나타낸다. 도 7은, 데이터베이스부(13)의 데이터 구조(130)를 예시한다. 데이터베이스부(13)는 특정인(2)의 식별정보가 카테고라이징으로 분류된다. 유명인의 카테고리에 개그맨의 하위분류를 가정할 때, ‘유재석’, ‘박명수’ 등의 특정인(2)에 대한 음성 수집 데이터, 식별 정보(ID), 각 특정인별 조정학습된 LLM과 LLM을 호출하는 API가 저장될 수 있다. Fig. 7 shows the stored data of the database unit (13) according to an embodiment of the present invention. Fig. 7 illustrates the data structure (130) of the database unit (13). The database unit (13) classifies the identification information of a specific person (2) by categorization. Assuming a subcategory of comedians in the category of celebrities, voice collection data, identification information (ID), LLMs tuned and learned for each specific person, and APIs calling the LLMs for specific people (2) such as ‘Yoo Jae-seok’ and ‘Park Myung-soo’ can be stored.

발화 처리부(15)는 사용자(3)의 단말을 통해 사용자(3)가 입력한 질문 데이터에 대한 응답 요청을 수신하고, 기계학습부(14)를 통해 조정(Fine-tuning)된 초거대언어모델(145)로 질문 데이터에 대한 특정인의 응답을 모사한 자연어를 생성할 수 있다.The speech processing unit (15) receives a request for a response to question data entered by a user (3) through the user's (3) terminal, and can generate natural language that imitates a specific person's response to the question data using a fine-tuned, ultra-large language model (145) through the machine learning unit (14).

발화 처리부(15)는 사용자(3)의 단말로부터 요청된 특정인(2)의 식별정보로 데이터베이스부(13)로부터 식별정보의 특정인(2)에 해당하는 조정 학습된 초거대언어모델(145)을 로딩하여 응답을 위한 자연어를 생성할 수 있다. The speech processing unit (15) can load a learned, super-large language model (145) corresponding to a specific person (2) whose identification information is requested from a user's (3) terminal from a database unit (13) and generate natural language for a response.

발화 처리부(15)가 생성하는 자연어는 TTS 기반의 음성 출력이 될 수 있고, 텍스트 형태의 자연어일 수 있다. 음성 출력의 자연어는 보이스 챗봇에 사용될 수 있으며, 텍스트 형태의 자연어는 텍스트 챗봇에 사용될 수 있다. The natural language generated by the speech processing unit (15) can be a TTS-based voice output or a text-based natural language. The natural language of the voice output can be used in a voice chatbot, and the natural language in the text-based natural language can be used in a text chatbot.

이상에서 서술한 실시예들에 의해 본 발명 및 개시의 기술적 사상이 설명되었지만, 본 발명의 기술적 사상은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 이해할 수 있는 범위에서 이루어질 수 있는 다양한 치환, 변형 및 변경을 포함한다. 또한, 그러한 치환, 변형 및 변경은 첨부된 청구범위 내에 포함될 수 있는 것으로 이해되어야 한다.Although the technical idea of the present invention and the disclosure has been explained by the embodiments described above, the technical idea of the present invention includes various substitutions, modifications, and changes that can be made within the scope that can be understood by those of ordinary skill in the art to which the present invention belongs. In addition, it should be understood that such substitutions, modifications, and changes can be included within the scope of the appended claims.

1: 대화형 인공지능 시스템 2: 특정인
3: 사용자 10: 대화형 인공지능 서버
11: 입력부 12: 데이터 수집부
13: 데이터베이스부 130: 데이터 구조
14: 기계학습부 140: 초거대언어모델
141: 전처리 모듈 142: 1차 지도 학습 모듈
143: 2차 강화 학습 모듈 145: 조정 학습된 초거대언어모델
15: 발화 처리부 20: 음성 수집 어플리케이션
22: 녹음 기능 23: 녹음 설정 기능
24: 녹음 스크립트 기능 25: 데이터 관리 기능
250: 녹음 데이터 세그먼트 200: 특정인의 자연어 응답
300: 사용자의 질문 30: 스펙트로그램1: Conversational AI system 2: Specific person
3: User 10: Conversational AI Server
11: Input section 12: Data collection section
13: Database Part 130: Data Structure
14: Machine Learning Department 140: Super Large Language Model
141: Preprocessing module 142: First supervised learning module
143: Secondary reinforcement learning module 145: Adaptive learned super-large language model
15: Speech Processing Unit 20: Voice Collection Application
22: Recording function 23: Recording setting function
24: Recording script function 25: Data management function
250: Recorded data segment 200: Natural language response from a specific person
300: User's Question 30: Spectrogram

Claims

A voice collection application stored in a medium to execute a function of recording the voice of a specific person and transmitting the voice data to an external server while installed on a specific person's mobile terminal and running as a background application for a preset period of time;
A machine learning unit configured on the above server, in which the software of the Large Language Model (LLM) is stored, receives the voice data, constructs an emotion-sentence dataset of the language used according to the emotional state of the specific person, and learns to fine-tune the Large Language Model using the emotion-sentence dataset; and
It comprises a speech processing unit configured on the above server, which receives a request for a response to question data entered by the user through the user's terminal, and generates a natural language that simulates a specific person's response to the question data using a super-large language model fine-tuned through the machine learning unit;
The above voice collection application is,
It includes a recording setting function that outputs an input window for the specific person to input a keyword, and excludes sentences of voice containing the keyword input by the specific person from recording.
The above voice collection application is,
The Active Recording algorithm is applied as a voice recognition algorithm, which starts recording only when a voice is detected and stops recording when no voice is detected.
The active recording algorithm detects the occurrence of a voice segment (Vo) in a spectrogram, performs voice recognition when the voice segment (Vo) of a specific person occurs, and selectively records the voice of the specific person, and the time for selective recording is the operating time of the background application.
An interactive artificial intelligence system characterized in that it stores the voice data in the internal storage of an installed mobile terminal, transmits the stored voice data to the server according to a preset cycle, and deletes the transmitted voice data from the internal storage.

In the first paragraph,
The above voice collection application is,
The time running as a background application is set to at least 1000 hours.
An interactive artificial intelligence system characterized in that permission for access to a voice-enabled app installed on a mobile terminal is set as an operating condition.

delete

In the first paragraph,
The above voice collection application is,
The STT (Speech To Text) algorithm, which converts the voice of the above-mentioned specific person into text, is applied, and includes a recording script function that outputs voice data recorded for a certain period of time as text.
An interactive artificial intelligence system characterized in that the specific person can selectively delete voice data through the above recording script function.

delete

In paragraph 1,
The above voice collection application is,
It further includes a recording script function that outputs recorded voice data for a certain period of time as text.
An interactive artificial intelligence system characterized in that, prior to transmitting voice data to the server by executing the data management function, the recording script function is executed and the specific person can selectively delete the voice data by checking the recorded voice data as text.

In the first paragraph,
The above machine learning unit,
The above emotion-sentence dataset is input as a learning dataset preprocessed as B answer to A question, and a first supervised learning module that performs learning by accepting A question as input and outputting B answer as output; and
Including a second reinforcement learning module that constructs various answers including answer B for question A and performs reinforcement learning to weight the various answers.
An interactive artificial intelligence system characterized by learning to fine-tune the above-mentioned large-scale language model to the language characteristics of a specific person.

In the first paragraph,
The above machine learning unit,
An interactive artificial intelligence system characterized by including a preprocessing module for constructing an emotion-sentence dataset by classifying and labeling emotions of each sentence using acoustic characteristics including frequency, wavelength, volume, or speed of sound of the voice from the received voice data.

In Article 9,
The above preprocessing module,
An interactive artificial intelligence system characterized by constructing a learning dataset by classifying the above emotion-sentence dataset into B answers to A questions by emotion.

In the first paragraph,
It is configured on the above server, and further includes a database section in which the identification information of the specific person and the ultra-large language model learned through the machine learning section with the emotion-sentence dataset of the specific person are stored,
The above ignition processing unit,
An interactive artificial intelligence system characterized in that it generates natural language for a response by loading the learned ultra-large language model corresponding to the specific person whose identification information is requested from the user's terminal from the database.

delete