KR20200061877A

KR20200061877A - System and method for generating paraphrase sentence based on ontology

Info

Publication number: KR20200061877A
Application number: KR1020180147694A
Authority: KR
Inventors: 양승원; 김성만
Original assignee: 주식회사 솔트룩스
Priority date: 2018-11-26
Filing date: 2018-11-26
Publication date: 2020-06-03
Anticipated expiration: 2038-11-26
Also published as: KR102143157B1

Abstract

패러프레이즈(paraphrase) 문장을 생성하는 시스템은, 본 발명의 예시적 실시예에 따라, 질의 문장으로부터 추출된 토큰들에 대응하는 온톨로지(ontology) 구성요소들 각각을 포함하는 시맨틱 리소스들을 획득하고, 시맨틱 리소스들의 조합인 패턴 및 질의 문장에서 토큰들이 대체가능한 구조를 가지는 프레임을 생성하는 특성 추출부, 패턴 및 프레임을 상호 대응시키고, 상호 대응되는 패턴 및 프레임을 패턴 저장부에 저장하는 패턴 색인부, 및 패턴 저장부에 저장된 패턴에 대응하는 프레임들에 기초하여, 질의 문장의 패러프레이즈 문장을 생성하는 문장 생성부를 포함할 수 있다.The system for generating a paraphrase sentence acquires semantic resources including each of ontology components corresponding to tokens extracted from the query sentence, according to an exemplary embodiment of the present invention, and semantic A feature extraction unit for generating a frame having a structure in which tokens are replaceable in a pattern and query sentence, which are a combination of resources, a pattern index unit that correlates patterns and frames, and stores the corresponding patterns and frames in a pattern storage unit, and Based on the frames corresponding to the pattern stored in the pattern storage unit, it may include a sentence generation unit for generating a paraphrase sentence of the query sentence.

Description

System and method for generating ontology-based paraphrase sentences {SYSTEM AND METHOD FOR GENERATING PARAPHRASE SENTENCE BASED ON ONTOLOGY}

본 발명의 기술적 사상은 패러프레이즈 문장 생성에 관한 것으로서, 자세하게는 온톨로지 기반 패러프레이즈 문장 생성을 위한 시스템 및 방법에 관한 것이다.The technical idea of the present invention relates to the generation of paraphrase sentences, and more particularly, to a system and method for generating ontology-based paraphrase sentences.

본 발명은 산업자원통상부 로봇산업핵심기술개발사업-인공지능융합로봇시스템기술의 일환으로 (주)아이피엘에서 주관하고 (주)솔트룩스에서 공동 연구하여 수행된 연구로부터 도출된 것이다. [연구기간: 2018.01.01~2018.12.31, 연구관리 전문기관: 한국산업기술평가관리원, 연구과제명: 가정용 소셜로봇 및 서비스 개발 시스템, 과제 고유번호: 10077633]The present invention is derived from a study conducted by the ILO Corporation and jointly researched by Saltlux Co., Ltd. as part of the Robot Industry Core Technology Development Project-Artificial Intelligence Convergence Robot System Technology by the Ministry of Commerce, Industry and Energy. [Research period: 2018.01.01~2018.12.31, Research management agency: Korea Institute of Industrial Technology Evaluation and Management, Research project name: Household social robot and service development system, task identification number: 10077633]

사람의 언어를 기계가 이해하도록 자연어를 인식하고 처리하는 것은 자연어 이해(natural language understand)로서 지칭될 수 있고, 자연어 이해는 다양한 분야에 사용될 수 있다. 예를 들면, 자연어 이해는, 사용자의 질의를 인식함으로써 질의에 대한 답변을 자동으로 제공하는 질의 응답 시스템(question and answering system)에 사용될 수 있다.Recognizing and processing natural language so that machines understand human language can be referred to as natural language understanding, and natural language understanding can be used in various fields. For example, natural language understanding may be used in a question and answering system that automatically provides an answer to a query by recognizing the user's query.

동일한 의미에도 불구하고 사람의 언어는 다양하게 표현될 수 있으므로, 자연어 이해에서는 다양하게 표현된 문장들로부터 동일한 의미를 파악하는 것이 요구될 수 있다. 이에 따라, 동일한 의미에 대응하는 다수의 문장들, 즉 패러프레이즈(paraphrase) 문장들을 준비하는 것은 자연어 이해의 중요한 기반이 될 수 있다.In spite of the same meaning, since human language can be expressed in various ways, understanding of natural language may require understanding the same meaning from variously expressed sentences. Accordingly, preparing multiple sentences corresponding to the same meaning, that is, paraphrase sentences may be an important basis for understanding natural language.

본 발명의 기술적 사상은, 온톨로지에 기반하여 패러프레이즈 문장들을 자동으로 생성하는 시스템 및 방법을 제공한다.The technical idea of the present invention provides a system and method for automatically generating paraphrase sentences based on ontology.

상기와 같은 목적을 달성하기 위하여, 본 발명의 기술적 사상에 따라 패러프레이즈(paraphrase) 문장을 생성하는 시스템은, 질의 문장으로부터 추출된 토큰들에 대응하는 온톨로지(ontology) 구성요소들 각각을 포함하는 시맨틱 리소스들을 획득하고, 시맨틱 리소스들의 조합인 패턴 및 질의 문장에서 토큰들이 대체가능한 구조를 가지는 프레임을 생성하는 특성 추출부, 패턴 및 프레임을 상호 대응시키고, 상호 대응되는 패턴 및 프레임을 패턴 저장부에 저장하는 패턴 색인부, 및 패턴 저장부에 저장된 패턴에 대응하는 프레임들에 기초하여, 질의 문장의 패러프레이즈 문장을 생성하는 문장 생성부를 포함할 수 있다.In order to achieve the above object, the system for generating a paraphrase sentence according to the technical idea of the present invention, the semantics including each of the ontology (ontology) components corresponding to the tokens extracted from the query sentence Acquire resources, and correlate feature extraction units, patterns and frames that generate frames having token-replaceable structures in patterns and query sentences that are combinations of semantic resources, and store the corresponding patterns and frames in the pattern storage unit It may include a pattern index unit, and a sentence generator for generating a paraphrase sentence of a query sentence based on frames corresponding to a pattern stored in the pattern storage unit.

본 발명의 예시적 실시예에 따라, 특성 추출부는, 네트워크를 통해서 복수의 질의 문장들을 획득할 수 있다.According to an exemplary embodiment of the present invention, the feature extraction unit may obtain a plurality of query sentences through a network.

본 발명의 예시적 실시예에 따라, 특성 추출부는, 질의 문장에 대하여 자연어 처리를 수행함으로써 토큰들을 추출하는 자연어 처리부, 및 지식 베이스로부터 토큰들에 대응하는 시맨틱 리소스들을 수신하고, 패턴 및 프레임을 생성하는 패턴 생성부를 포함할 수 있다.According to an exemplary embodiment of the present invention, the feature extraction unit receives a natural language processing unit that extracts tokens by performing natural language processing on a query sentence, and receives semantic resources corresponding to tokens from a knowledge base, and generates patterns and frames It may include a pattern generator.

본 발명의 예시적 실시예에 따라, 패턴 색인부는, 상이한 순서의 시맨틱 리소스들을 각각 포함하는 패턴들을 상이한 패턴들로서 패턴 저장부에 저장할 수 있다.According to an exemplary embodiment of the present invention, the pattern index unit may store patterns including each of semantic resources in different orders as different patterns in the pattern storage unit.

본 발명의 예시적 실시예에 따라, 문장 생성부는, 특성 추출부에 의해서 생성된 패턴에 대응하는 프레임들을 패턴 저장부에서 검색하는 패턴 검색부, 토큰들을 검색된 프레임들에 적용함으로써 적어도 하나의 예비 패러프레이즈 문장을 생성하는 어휘 치환부, 및 적어도 하나의 예비 패러프레이즈 문장을 자연어 규칙에 따라 수정함으로써 패러프레이즈 문장을 생성하는 문장 후처리부를 포함할 수 있다.According to an exemplary embodiment of the present invention, the sentence generation unit applies at least one preliminary paradigm by applying the patterns corresponding to the pattern generated by the feature extraction unit in the pattern storage unit and the tokens to the searched frames. It may include a vocabulary substitution unit for generating a phrase sentence, and a sentence post-processing unit for generating a paraphrase sentence by modifying at least one preliminary paraphrase sentence according to a natural language rule.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 시맨틱 리소스들에 기초하여, 토큰들의 동의어들을 검색된 프레임들에 적용함으로써 적어도 하나의 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, the lexical substitution unit may generate at least one preliminary paraphrase sentence by applying synonyms of tokens to searched frames based on semantic resources.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 검색된 프레임의 구조를 변경함으로써 적어도 하나의 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, the vocabulary substitution unit may generate at least one preliminary paraphrase sentence by changing the structure of the searched frame.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 검색된 프레임에서 부사구의 위치를 변경함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, the lexical substitution unit may generate an additional preparatory paraphrase sentence by changing the position of the adverb phrase in the searched frame.

본 발명의 예시적 실시예에 따라, 어휘 치환부는, 검색된 프레임이 명사구로 종료하는 경우, 질의 종결구를 추가함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다.According to an exemplary embodiment of the present invention, when the searched frame ends with a noun phrase, the lexical substitution unit may generate an additional preliminary paraphrase sentence by adding a query termination phrase.

본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 검증된 온톨로지에 기반하여 문장을 분석함으로써 정확하면서도 풍부한 패러프레이즈 문장들이 자동으로 생성될 수 있다.According to the system and method according to the technical idea of the present invention, accurate and rich paraphrase sentences can be automatically generated by analyzing sentences based on a proven ontology.

또한, 본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 풍부하게 구비된 패러프레이즈 문장들에 기인하여 자연어 이해를 기반으로 하는 작업들, 예컨대 질의 응답 시스템, 지식 추출 등의 성능을 현저하게 향상시킬 수 있다.In addition, according to the system and method according to the technical concept of the present invention, due to the abundantly provided paraphrase sentences, tasks based on natural language understanding, such as a query response system, knowledge extraction, etc., are significantly improved. Can be.

또한, 본 발명의 기술적 사상에 따른 시스템 및 방법에 의하면, 풍부하게 구비된 패러프레이즈 문장들에 기인하여 기계 학습을 위한 양질의 데이터가 마련될 수 있고, 이에 따라 기계 학습의 활용도를 현저하게 확대시킬 수 있다.In addition, according to the system and method according to the technical idea of the present invention, quality data for machine learning may be prepared due to the abundantly provided paraphrase sentences, thereby significantly expanding the utilization of machine learning. Can be.

본 발명의 예시적 실시예들에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 아니하며, 언급되지 아니한 다른 효과들은 이하의 기재로부터 본 발명의 예시적 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 도출되고 이해될 수 있다. 즉, 본 발명의 예시적 실시예들을 실시함에 따른 의도하지 아니한 효과들 역시 본 발명의 예시적 실시예들로부터 당해 기술분야의 통상의 지식을 가진 자에 의해 도출될 수 있다.The effects obtainable in the exemplary embodiments of the present invention are not limited to the above-mentioned effects, and other effects not mentioned are common knowledge in the art to which the exemplary embodiments of the present invention belong from the following description. It can be clearly drawn and understood by those who have it. That is, unintended effects of implementing exemplary embodiments of the present invention can also be derived by those skilled in the art from the exemplary embodiments of the present invention.

도 1은 본 발명의 예시적 실시예에 따른 시스템 및 그 입출력 관계를 나타내는 블록도이다.
도 2는 본 발명의 예시적 실시예에 따라 도 1의 특성 추출부의 예시를 나타내는 블록도이다.
도 3a 및 도 3b는 본 발명의 예시적 실시예들에 따른 질의 문장, 패턴 및 프레임의 예시들을 나타낸다.
도 4는 본 발명의 예시적 실시예에 따라 도 1의 패턴 저장부에 저장된 패턴들 및 프레임들의 예시를 나타낸다.
도 5는 본 발명의 예시적 실시예에 따른 도 1의 문장 생성부의 예시를 나타내는 블록도이다.
도 6a 및 도 6b는 본 발명의 예시적 실시예들에 따라 도 5의 어휘 치환부의 동작의 예시들을 나타내는 도면들이다.
도 7은 본 발명의 예시적 실시예에 따른 패러프레이즈 문장을 생성하는 방법을 나타내는 순서도이다.Fig. 1 is a block diagram showing a system and its input/output relationship according to an exemplary embodiment of the present invention.
Fig. 2 is a block diagram showing an example of the characteristic extraction unit of Fig. 1 according to an exemplary embodiment of the present invention.
3A and 3B show examples of query sentences, patterns, and frames according to exemplary embodiments of the present invention.
4 shows an example of patterns and frames stored in the pattern storage unit of FIG. 1 according to an exemplary embodiment of the present invention.
5 is a block diagram illustrating an example of a sentence generation unit of FIG. 1 according to an exemplary embodiment of the present invention.
6A and 6B are diagrams illustrating examples of the operation of the vocabulary substitution unit of FIG. 5 according to exemplary embodiments of the present invention.
Fig. 7 is a flow chart showing a method for generating a paraphrase sentence according to an exemplary embodiment of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대해 상세히 설명한다. 본 발명의 실시 예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위하여 제공되는 것이다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다. 첨부된 도면에 있어서, 구조물들의 치수는 본 발명의 명확성을 기하기 위하여 실제보다 확대하거나 축소하여 도시한 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. The present invention can be applied to various changes and may have various forms, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to a specific disclosure form, and it should be understood that it includes all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar components. In the accompanying drawings, the dimensions of the structures are enlarged or reduced than actual ones for clarity of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, terms such as “include” or “have” are intended to indicate the presence of features, numbers, steps, actions, elements, parts or combinations thereof described in the specification, but one or more other features. It should be understood that the presence or addition possibilities of fields or numbers, steps, actions, components, parts or combinations thereof are not excluded in advance.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 아니하는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms, such as those defined in a dictionary used in general, should be interpreted as having meanings consistent with meanings in the context of related technologies, and are interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not work.

이하 도면 및 설명에서, 하나의 블록으로 표시 또는 설명되는 구성요소는 하드웨어 블록 또는 소프트웨어 블록일 수 있다. 예를 들면, 구성요소들 각각은 서로 신호를 주고 받는 독립적인 하드웨어 블록일 수도 있고, 또는 하나의 프로세서에서 실행되는 소프트웨어 블록일 수도 있다. 또한, 본 명세서에서 "시스템" 또는 "데이터베이스"는 적어도 하나의 프로세서 및 프로세서에 의해서 액세스되는 메모리를 포함하는 컴퓨팅 시스템을 지칭할 수 있다.In the following drawings and descriptions, components represented or described as one block may be hardware blocks or software blocks. For example, each of the components may be an independent hardware block that communicates with each other, or may be a software block executed in one processor. Also, as used herein, “system” or “database” may refer to a computing system that includes at least one processor and memory accessed by the processor.

도 1은 본 발명의 예시적 실시예에 따른 시스템 및 그 입출력 관계를 나타내는 블록도이다. 도 1에 도시된 바와 같이, 패러프레이즈 생성 시스템(100)은 지식 베이스(300) 및 패턴 저장부(500)와 통신가능하게 연결될 수 있다. 또한, 패러프레이즈 생성 시스템(100)은 자연어로서 질의 문장을 수신할 수 있고, 지식 베이스(300) 및 패턴 저장부(500)와 통신함으로써 질의 문장과 동일한 의미의 패러프레이즈 문장을 생성할 수 있다. 후술되는 바와 같이, 패러프레이즈 생성 시스템(100)은 지식 베이스(300)를 참조하여 질의 문장으로부터 패턴을 추출할 수 있고, 패턴에 대응하는 다수의 프레임들을 패턴과 함께 패턴 저장부(500)에 저장할 수 있다. 도 1에 도시된 블록들(100, 300, 500)은 네트워크를 통해서 상호 통신할 수도 있고, 일대일 통신을 위한 전용 채널을 통해서 상호 통신할 수도 있다. 또한, 도 1에 도시된 블록들(100, 300, 500) 중 2개 이상이 하나의 시스템(예컨대, 컴퓨팅 시스템)에 포함될 수도 있고, 일부 실시예들에서 패턴 저장부(500)는 패러프레이즈 생성 시스템(100)에 포함될 수 있다.Fig. 1 is a block diagram showing a system and its input/output relationship according to an exemplary embodiment of the present invention. As shown in FIG. 1, the paraphrase generation system 100 may be communicatively connected to the knowledge base 300 and the pattern storage unit 500. In addition, the paraphrase generation system 100 may receive a query sentence as a natural language, and generate a paraphrase sentence having the same meaning as a query sentence by communicating with the knowledge base 300 and the pattern storage unit 500. As will be described later, the paraphrase generation system 100 may extract a pattern from a query sentence with reference to the knowledge base 300 and store a plurality of frames corresponding to the pattern in the pattern storage unit 500 together with the pattern. Can be. The blocks 100, 300, and 500 shown in FIG. 1 may communicate with each other through a network, or may communicate with each other through a dedicated channel for one-to-one communication. Also, two or more of the blocks 100, 300, and 500 shown in FIG. 1 may be included in one system (eg, a computing system), and in some embodiments, the pattern storage unit 500 generates a paraphrase It may be included in the system 100.

패러프레이즈 생성 시스템(100)은 다양한 방식으로 질의 문장을 수신할 수 있다. 일부 실시예들에서, 패러프레이즈 생성 시스템(100)은 사용자 인터페이스를 포함할 수 있고, 사용자 인터페이스를 통해서 사람, 즉 사용자로부터 질의 문장을 수신할 수 있다. 일부 실시예들에서, 패러프레이즈 생성 시스템(100)은 네트워크, 예컨대 인터넷에 접속할 수 있고, 네트워크로부터 다수의 질의 문장들을 능동적으로 수집할 수도 있다. 도 1에 도시된 바와 같이, 패러프레이즈 생성 시스템(100)은 특성 추출부(120), 패턴 색인부(140) 및 문장 생성부(160)를 포함할 수 있다.The paraphrase generation system 100 may receive query sentences in various ways. In some embodiments, the paraphrase generation system 100 may include a user interface, and receive a query sentence from a person, ie, a user, through the user interface. In some embodiments, the paraphrase generation system 100 may connect to a network, such as the Internet, and may actively collect multiple query sentences from the network. As shown in FIG. 1, the paraphrase generation system 100 may include a feature extraction unit 120, a pattern index unit 140, and a sentence generation unit 160.

특성 추출부(120)는 질의 문장으로부터 추출된 토큰들에 대응하는 온톨로지 구성요소들 각각을 포함하는 시맨틱 리소스(semantic resource)들을 획득할 수 있다. 온톨로지(ontology)는 실존하거나 사람이 인식 가능한 것들을 컴퓨터에서 다룰 수 있는 형태로 표현한 것으로서, 온톨로지 구성요소들은 엔티티(entity; E)(또는 인스턴스(instance)), 클래스(class; C), 속성(property; P), 값(value; V)을 포함할 수 있다. 추가적으로, 온톨로지 구성요소들은, 관계(relation)(엔티티간 속성 또는 클래스간 속성), 함수 텀(function term), 제한(restriction), 규칙(rule), 사건(event) 등을 더 포함할 수 있다. 지식 베이스(300)는 온톨로지에 기반하여 방대한 지식 데이터를 저장할 수 있고, 예컨대 지식 베이스(300)는 RDF(Resource Description Framework)를 사용하여 표현된 지식 데이터를 포함할 수 있으며, 지식 데이터 단위로서 트리플(triple)이 사용될 수 있다. 지식 베이스(300)는 쿼리, 예컨대 SPARQL(SPARQL Protocol and RDF Query Language) 쿼리에 응답하여 트리플을 반환할 수 있다. 일부 실시예들에서, 특성 추출부(120)는 패러프레이즈 생성 시스템(100)의 외부에 있는 자연어 인식 시스템(미도시)에 질의 문장을 전달할 수 있고, 자연어 인식 시스템으로부터 질의 문장으로부터 추출된 토큰들을 수신할 수 있다. 일부 실시예들에서, 특성 추출부(120)는, 도 2를 참조하여 후술되는 바와 같이, 질의 문장을 직접 처리함으로써 시맨틱 리소스들을 생성할 수도 있다.The feature extraction unit 120 may obtain semantic resources including each of ontology components corresponding to tokens extracted from the query sentence. Ontology is an expression that exists or can be recognized by human beings in a form that can be handled by a computer. ; P), value (V). Additionally, ontology components may further include relationships (inter-entity attributes or inter-class attributes), function terms, restrictions, rules, events, and the like. The knowledge base 300 may store vast amount of knowledge data based on an ontology. For example, the knowledge base 300 may include knowledge data expressed using a Resource Description Framework (RDF), and a triple ( triple) can be used. Knowledge base 300 may return a triple in response to a query, such as a SPARQL (SPARQL Protocol and RDF Query Language) query. In some embodiments, the feature extraction unit 120 may deliver a query sentence to a natural language recognition system (not shown) outside the paraphrase generation system 100, and may extract tokens extracted from the query sentence from the natural language recognition system. I can receive it. In some embodiments, the feature extraction unit 120 may generate semantic resources by directly processing a query statement, as described below with reference to FIG. 2.

특성 추출부(120)는 온톨로지에 기반하여 질의 문장으로부터 획득된 시맨틱 리소스들로부터 패턴 및 프레임을 생성할 수 있다. 도 3a 및 도 3b를 참조하여 후술되는 바와 같이, 패턴은 시맨틱 리소스들의 조합으로서 질의 문장이 나타내는 의미에 대응할 수 있다. 즉, 동일한 패턴에 대응하는 문장들을 패러프레이즈 문장들로서 지칭될 수 있다. 또한, 도 3a 및 도 3b를 참조하여 후술되는 바와 같이, 프레임은 질의 문장으로부터 추출된 토큰들이 질의 문장에서 대체가능한 구조를 가질 수 있다. 도 4를 참조하여 후술되는 바와 같이, 하나의 패턴은 적어도 하나의 프레임에 대응할 수 있고, 상호 대응되는 패턴 및 프레임이 패턴 색인부(140)에 의해서 패턴 저장부(500)에 저장될 수 있다. 특성 추출부(120)는 패턴 색인부(140)에 패턴 및 프레임을 제공할 수 있는 한편, 문장 생성부(160)에 추출된 토큰들을 포함하는 토큰 리스트 및 패턴을 제공할 수 있다.The feature extraction unit 120 may generate patterns and frames from semantic resources obtained from query sentences based on the ontology. As will be described later with reference to FIGS. 3A and 3B, the pattern may correspond to the meaning indicated by the query sentence as a combination of semantic resources. That is, sentences corresponding to the same pattern may be referred to as paraphrase sentences. In addition, as described below with reference to FIGS. 3A and 3B, the frame may have a structure in which tokens extracted from a query sentence are replaceable in the query sentence. As described below with reference to FIG. 4, one pattern may correspond to at least one frame, and patterns and frames corresponding to each other may be stored in the pattern storage unit 500 by the pattern index unit 140. The feature extraction unit 120 may provide a pattern and a frame to the pattern indexing unit 140, and may provide a token list and pattern including the extracted tokens to the sentence generation unit 160.

패턴 색인부(140)는 특성 추출부(120)로부터 질의 문장으로부터 생성된 패턴 및 프레임을 수신할 수 있다. 패턴 색인부(140)는 패턴 및 프레임을 상호 대응시킬 수 있고, 상호 대응되는 패턴 및 프레임을 패턴 저장부(500)에 저장할 수 있다. 이에 따라, 패턴 저장부(500)에 저장된 패턴이 (예컨대, 문장 생성부(160)에 의해서) 검색되는 경우 대응되는 프레임들이 같이 검색될 수 있다. 일부 실시예들에서, 패턴 색인부(140)는 특성 추출부(120)로부터 수신된 패턴을 패턴 저장부(500)에서 검색할 수 있고, 특성 추출부(120)로부터 수신된 패턴이 패턴 저장부(500)에 이미 저장되어 있고, 특성 추출부(120)로부터 수신된 프레임 역시 패턴 저장부(500)에 저장되어 있는 경우, 특성 추출부(120)로부터 수신된 패턴 및 프레임의 저장을 생략할 수 있다. 패턴 색인부(140)에 의해서 패턴 저장부(500)에 저장된 패턴 및 프레임의 구조는 도 4를 참조하여 후술될 것이다.The pattern indexer 140 may receive patterns and frames generated from the query sentence from the feature extractor 120. The pattern index unit 140 may correlate patterns and frames, and may store patterns and frames corresponding to each other in the pattern storage unit 500. Accordingly, when a pattern stored in the pattern storage unit 500 is searched (eg, by the sentence generation unit 160), corresponding frames may be searched together. In some embodiments, the pattern indexing unit 140 may search for a pattern received from the characteristic extraction unit 120 in the pattern storage unit 500, and the pattern received from the characteristic extraction unit 120 may be a pattern storage unit When the frame already received in the 500 and the frame received from the feature extraction unit 120 is also stored in the pattern storage unit 500, the storage of the pattern and the frame received from the feature extraction unit 120 may be omitted. have. The structure of the patterns and frames stored in the pattern storage unit 500 by the pattern indexing unit 140 will be described later with reference to FIG. 4.

문장 생성부(160)는 특성 추출부(120)로부터 질의 문장에 대응하는 패턴 및 토큰 리스트를 수신할 수 있고, 질의 문장의 패러프레이즈 문장을 생성할 수 있다. 문장 생성부(160)는 패턴 저장부(500)에서 패턴을 검색할 수 있고, 검색된 패턴 및 이에 대응하는 프레임들을 패턴 저장부(500)로부터 획득할 수 있다. 예를 들면, 문장 생성부(160)는 특성 추출부(120)로부터 수신된 패턴을 패턴 저장부(500)에서 검색할 수 있고, 토큰 리스트 및 검색된 패턴에 대응하는 프레임들에 기초하여 질의 문장과 동일한 의미를 가지는 문장, 즉 패러프레이즈 문장을 생성할 수 있다. 문장 생성부(160)의 예시는 도 5를 참조하여 후술될 것이다.The sentence generation unit 160 may receive a pattern and a token list corresponding to the query sentence from the feature extraction unit 120 and may generate a paraphrase sentence of the query sentence. The sentence generation unit 160 may search for a pattern in the pattern storage unit 500 and obtain the searched pattern and frames corresponding thereto from the pattern storage unit 500. For example, the sentence generation unit 160 may search for a pattern received from the feature extraction unit 120 in the pattern storage unit 500, and based on the token list and frames corresponding to the searched pattern, query sentences and A sentence having the same meaning, that is, a paraphrase sentence, can be generated. An example of the sentence generation unit 160 will be described later with reference to FIG. 5.

도면들을 참조하여 후술되는 바와 같이, 패러프레이즈 생성 시스템(100)은 온톨로지 기반 지식 베이스(300)를 사용하여 질의 문장의 패러프레이즈 문장을 생성할 수 있다. 이에 따라 질의 문장은 검증된 온톨로지에 기반하여 분석될 수 있고, 정확하면서도 풍부한 패러프레이즈 문장들이 자동으로 생성될 수 있다. 또한, 패러프레이즈 생성 시스템(100)에 의해서 생성된 패러프레이즈 문장들에 기인하여 자연어 이해를 기반으로 하는 작업들, 예컨대 질의 응답 시스템, 지식 추출 등의 성능이 현저하게 향상될 수 있고, 기계 학습을 위한 양질의 데이터가 적절하게 마련될 수 있다.As described below with reference to the drawings, the paraphrase generation system 100 may generate a paraphrase sentence of a query sentence using the ontology-based knowledge base 300. Accordingly, the query sentence can be analyzed based on the verified ontology, and accurate and rich paraphrase sentences can be automatically generated. In addition, due to the paraphrase sentences generated by the paraphrase generation system 100, performances based on natural language understanding, such as a query response system and knowledge extraction, can be significantly improved, and machine learning Quality data for this can be properly prepared.

도 2는 본 발명의 예시적 실시예에 따라 도 1의 특성 추출부(120)의 예시를 나타내는 블록도이고, 도 3a 및 도 3b는 본 발명의 예시적 실시예들에 따른 질의 문장, 패턴 및 프레임의 예시들을 나타낸다. 도 1을 참조하여 전술된 바와 같이, 도 2의 특성 추출부(120')는 질의 문장을 수신할 수 있고, 지식 베이스(300)를 참조하여 질의 문장에 대응하는 패턴 및 프레임을 생성할 수 있다. 도 2에 도시된 바와 같이, 특성 추출부(120')는 자연어 처리부(122) 및 패턴 생성부(124)를 포함할 수 있고, 이하에서 도 2, 도 3a 및 도 3b는 도 1을 참조하여 설명될 것이다.2 is a block diagram illustrating an example of the feature extraction unit 120 of FIG. 1 according to an exemplary embodiment of the present invention, and FIGS. 3A and 3B are query sentences, patterns, and patterns according to exemplary embodiments of the present invention. Examples of frames are shown. As described above with reference to FIG. 1, the feature extraction unit 120 ′ of FIG. 2 may receive a query sentence, and generate a pattern and a frame corresponding to the query sentence with reference to the knowledge base 300. . As shown in FIG. 2, the characteristic extraction unit 120 ′ may include a natural language processing unit 122 and a pattern generation unit 124, and FIGS. 2, 3A, and 3B will be described with reference to FIG. 1 below. Will be explained.

일부 실시예들에서, 특성 추출부(120')는 질의 문장을 자연어 처리함으로써 토큰(token)들을 추출할 수 있고, 토큰들에 대응하는 시맨틱 리소스들을 지식 베이스(300)로부터 획득할 수 있다. 도 1을 참조하여 전술된 바와 같이, 일부 실시예들에서, 도 2의 예시와 상아하게, 도 1의 특성 추출부(120)는 질의 문장을 직접 자연어 처리하는 대신 패러프레이즈 생성 시스템(100) 외부의 자연어 처리 시스템에 질의 문장을 전달하고, 그로부터 토큰들을 수신할 수도 있다.In some embodiments, the feature extraction unit 120 ′ may extract tokens by natural processing the query sentence, and obtain semantic resources corresponding to the tokens from the knowledge base 300. As described above with reference to FIG. 1, in some embodiments, unlike the example of FIG. 2, the feature extraction unit 120 of FIG. 1 is external to the paraphrase generation system 100 instead of directly processing the natural language of the query sentence. You can also send a query sentence to the natural language processing system of, and receive tokens from it.

자연어 처리부(122)는 질의 문장으로부터 토큰들을 추출할 수 있다. 예를 들면, 도 3a에 도시된 바와 같이 "서울의 인구는?"이라는 제1 질의 문장(Q1)에 대하여, 자연어 처리부(122)는 "서울", "의", "인구", "는", "?"을 포함하는 토큰 리스트를 생성할 수 있다. 일부 실시예들에서, 자연어 처리부(122)에 의해서 추출된 토큰은 단어뿐만 아니라 특성을 함께 생성할 수 있다. 예를 들면, 토큰은 품사(예컨대, 명사, 형용사 부사 등)를 포함할 수도 있고, 질의 문장에서 문장 구성요소(예컨대, 주어, 목적어, 부사어 등)를 포함할 수도 있다. 자연어 처리부(122)는 임의의 방식으로 토큰들을 추출할 수 있다. 예를 들면, 자연어 처리부(122)는 지식 베이스(300)를 참조하여 토큰들을 추출할 수도 있다. 예를 들면, 자연어 처리부(122)는 지식 베이스(300)의 엔티티들, 속성들 등에 기초하여 질의 문장에 포함된 단어의 품사를 판정할 수 있다. 또한, 일부 실시예들에서, 도 6a 및 도 6b를 참조하여 후술되는 바와 같이, 추출된 토큰들은 트리 구조를 가질 수 있다.The natural language processing unit 122 may extract tokens from the query sentence. For example, as shown in FIG. 3A, with respect to the first query sentence Q1, "What is the population of Seoul?", the natural language processing unit 122 "Seoul", "Ui", "Population", "A" , Token list including "?" can be generated. In some embodiments, the token extracted by the natural language processing unit 122 may generate characteristics as well as words. For example, the token may include a part-of-speech (eg, noun, adjective adverb, etc.) or a sentence component (eg, subject, object, adverb, etc.) in a query sentence. The natural language processing unit 122 can extract tokens in any manner. For example, the natural language processing unit 122 may extract tokens with reference to the knowledge base 300. For example, the natural language processing unit 122 may determine the part of speech of the word included in the query sentence based on the entities, attributes, and the like of the knowledge base 300. Further, in some embodiments, as described below with reference to FIGS. 6A and 6B, the extracted tokens may have a tree structure.

패턴 생성부(124)는 자연어 처리부(122)로부터 토큰들을 수신할 수 있고, 수신된 토큰들에 대응하는 시맨틱 리소스들을 지식 베이스(300)로부터 획득할 수 있다. 예를 들면, 도 3a에 도시된 바와 같이, "서울의 인구는?"이라는 제1 질의 문장(Q1)에 대하여, "서울"의 시맨틱 리소스로서 엔티티 "<Entity-City>"를 획득할 수 있고, "인구"의 시맨틱 리소스로서 속성 "<Property_Population>"을 획득할 수 있다. 이에 따라, 제1 패턴(P1)은 시맨틱 리소스들로서, "<Entity-City>" 및 "<Property_Population>"을 포함할 수 있다. 또한, 도 3b에 도시된 바와 같이, "서울의 인구수를 알려주세요?"라는 제2 질의 문장(Q2)에 대하여, "LA"의 시맨틱 리소스로서 엔티티 "<Entity_City>"를 획득할 수 있고, "인구수"의 시맨틱 리소스로서 속성 "<Peoperty_Population>"을 획득할 수 있다. 제2 패턴(P2)은 "<Entity-City>" 및 "<Property_Population>"을 포함할 수 있고, 이에 따라 제1 패턴(P1)과 동일할 수 있다.The pattern generation unit 124 may receive tokens from the natural language processing unit 122 and obtain semantic resources corresponding to the received tokens from the knowledge base 300. For example, as shown in FIG. 3A, for the first query sentence Q1, "What is the population of Seoul?", it is possible to obtain the entity "<Entity-City>" as a semantic resource of "Seoul". , As a semantic resource of "population", property "<Property_Population>" can be obtained. Accordingly, the first pattern P1 is semantic resources, and may include "<Entity-City>" and "<Property_Population>". In addition, as shown in FIG. 3B, for the second query sentence Q2 "Please tell the population of Seoul?", it is possible to acquire the entity "<Entity_City>" as a semantic resource of "LA"," The attribute "<Peoperty_Population>" can be obtained as a semantic resource of the population. The second pattern P2 may include “<Entity-City>” and “<Property_Population>”, and thus may be the same as the first pattern P1.

일부 실시예들에서, 패턴은 시맨틱 리소스들의 조합뿐만 아니라 시맨틱 리소스들의 순서를 정의할 수 있다. 즉, 동일한 시맨틱 리소스들을 포함하는 패턴들일지라도 시맨틱 리소스들의 순서들이 상이한 경우, 해당 패턴들은 상이한 패턴들로서 패턴 색인부(140)에 의해 패턴 저장부(500)에 저장될 수 있다.In some embodiments, a pattern can define a combination of semantic resources as well as a sequence of semantic resources. That is, even if patterns having the same semantic resources have different order of semantic resources, the patterns may be stored in the pattern storage unit 500 by the pattern indexer 140 as different patterns.

또한, 패턴 생성부(124)는 질의 문장 및 패턴에 기초하여 프레임을 생성할 수 있다. 예를 들면, 도 3a에 도시된 바와 같이, 제1 질의 문장(Q1) 및 제1 패턴(P1)에 기초하여, 패턴 생성부(124)는 제1 프레임(F11), 즉 "<Entity_City>의 <Property_Population>는?"을 생성할 수 있다. 제1 프레임(F11)에서, 엔티티에 대응하는 시맨틱 리소스, 즉 "<Entity_City>"에, "서울", "LA" 등과 같은 인스턴스가 적용(또는 대입)될 수 있고, 속성에 대응하는 시맨틱 리소스, 즉 "<Property_Population>"은 "인구", "인구수"와 같이 동의어들 중 하나가 적용될 수 있다. 이에 따라, 제1 질의 문장(Q1)과 상이한 질의 문장으로서 제1 프레임(F11)에 제1 패턴(P1)으로 분석된 질의 문장에 포함된 토큰들을 제1 프레임(F11)에 적용하는 경우, 해당 질의 문장의 패러프레이즈 문장이 생성될 수 있다. 유사하게, 도 3b에 도시된 바와 같이, 제2 질의 문장(Q2) 및 제2 패턴(P2)에 기초하여, 패턴 생성부(124)는 제2 프레임(F12), 즉 "<Entity_City>의 <Property_Population>를 알려주세요."를 생성할 수 있다. 제1 프레임(F11)과 유사하게, 제2 프레임(F12)에서, 엔티티에 대응하는 시맨틱 리소스, 즉 "<Entity_City>"에, "서울", "LA" 등과 같은 인스턴스가 적용될 수 있고, 속성에 대응하는 시맨틱 리소스, 즉 "<Property_Population>"에 "인구", "인구수"와 같이 동의어들 중 하나가 적용될 수 있다. 이에 따라, 제2 질의 문장(Q2)과 상이한 질의 문장으로서 제2 프레임(F12)에 제1 패턴(P1)으로 분석된 질의 문장에 포함된 토큰들을 제2 프레임(F12)에 적용하는 경우, 해당 질의 문장의 패러프레이즈 문장이 생성될 수 있다.Also, the pattern generator 124 may generate a frame based on the query sentence and pattern. For example, as illustrated in FIG. 3A, based on the first query sentence Q1 and the first pattern P1, the pattern generation unit 124 is configured to generate a first frame F11, that is, "<Entity_City> <Property_Population>?". In the first frame F11, an instance such as “Seoul”, “LA”, etc. may be applied (or assigned) to the semantic resource corresponding to the entity, that is, “<Entity_City>”, and the semantic resource corresponding to the attribute, That is, "<Property_Population>" may be one of synonyms such as "population" and "population number". Accordingly, when applying the tokens included in the query sentence analyzed by the first pattern P1 in the first frame F11 as the query sentence different from the first query sentence Q1, in the first frame F11, corresponding A paraphrase sentence of the query sentence may be generated. Similarly, as illustrated in FIG. 3B, based on the second query sentence Q2 and the second pattern P2, the pattern generation unit 124 generates a second frame F12, that is, <<Entity_City>. Property_Population>." Similar to the first frame F11, in the second frame F12, an instance such as “Seoul”, “LA”, etc. can be applied to the semantic resource corresponding to the entity, that is, “<Entity_City>”, and the attribute One of the synonyms such as "population" and "population number" may be applied to the corresponding semantic resource, that is, "<Property_Population>". Accordingly, in the case where the tokens included in the query sentence analyzed by the first pattern P1 in the second frame F12 as the query sentence different from the second query sentence Q2 are applied to the second frame F12, the corresponding A paraphrase sentence of the query sentence may be generated.

도 3a 및 도 3b의 예시에서, 제1 질의 문장(Q1) 및 제2 질의 문장(Q2)은 동일한 패턴에 대응하므로(즉, 제1 패턴(P1) 및 제2 패턴(P2)이 일치하므로), 제1 질의 문장(Q1)의 토큰들을 제2 프레임(F12)에 적용함으로써 제1 질의 문장(Q1)의 패러프레이즈 문장, 즉 "서울의 인구를 알려주세요."가 생성될 수 있다. 유사하게, 제2 질의 문장(Q2)의 토큰들을 제1 프레임(F11)에 적용함으로써 제2 질의 문장(Q2)의 패러프레이즈 문장, 즉 "LA의 인구는?"이 생성될 수 있다. In the example of FIGS. 3A and 3B, the first query sentence Q1 and the second query sentence Q2 correspond to the same pattern (that is, because the first pattern P1 and the second pattern P2 match) , By applying the tokens of the first query sentence Q1 to the second frame F12, a paraphrase sentence of the first query sentence Q1, that is, "Please tell me the population of Seoul." Similarly, by applying the tokens of the second query sentence Q2 to the first frame F11, a paraphrase sentence of the second query sentence Q2, that is, “What is the population of LA?” can be generated.

도 4는 본 발명의 예시적 실시예에 따라 도 1의 패턴 저장부(500)에 저장된 패턴들 및 프레임들의 예시를 나타낸다. 도 1을 참조하여 전술된 바와 같이, 패턴 색인부(140)는 패턴 및 프레임을 상호 대응시킬 수 있고, 상호 대응하는 패턴 및 프레임을 패턴 저장부(500)에 저장할 수 있다. 이하에서, 도 4는 도 1을 참조하여 설명될 것이다.4 illustrates an example of patterns and frames stored in the pattern storage unit 500 of FIG. 1 according to an exemplary embodiment of the present invention. As described above with reference to FIG. 1, the pattern index unit 140 may correlate patterns and frames, and may store patterns and frames corresponding to each other in the pattern storage unit 500. In the following, FIG. 4 will be described with reference to FIG. 1.

도 3a 및 도 3b를 참조하여 전술된 바와 같이, 하나의 패턴에 다수의 프레임들이 대응할 수 있고, 주어진 질의 문장에 대응하는 패턴을 분석한 후 패턴에 대응하는 다수의 프레임들에 질의 문장의 토큰들을 적용함으로써 다수의 패러프레이즈 문장들이 생성될 수 있다. 이에 따라, 도 4에 도시된 바와 같이, 패턴 저장부(500)는 복수의 패턴들(P1 내지 Pn) 각각에 대응하는 복수의 그룹들(G1 내지 Gn)을 포함할 수 있다(n은 1보다 큰 정수). 복수의 그룹들(G1 내지 Gn) 각각은 하나의 패턴 및 패턴에 대응하는 복수의 프레임들을 포함할 수 있다. 예를 들면, 제1 그룹(G1)은 제1 패턴(P1) 및 이에 대응하는 복수의 프레임들(F11 내지 F1x)을 포함할 수 있고(x는 1보다 큰 정수), 제2 그룹(G2)은 제2 패턴(P2) 및 이에 대응하는 복수의 프레임들(F21 내지 F2y)을 포함할 수 있으며(y는 1보다 큰 정수), 제3 그룹(G3)은 제3 패턴(P3) 및 이에 대응하는 복수의 프레임들(Fn1 내지 Fnz)을 포함할 수 있다(z는 1보다 큰 정수).As described above with reference to FIGS. 3A and 3B, multiple frames may correspond to a single pattern, and after analyzing a pattern corresponding to a given query sentence, tokens of the query sentence may be assigned to multiple frames corresponding to the pattern Multiple paraphrase sentences can be generated by applying. Accordingly, as illustrated in FIG. 4, the pattern storage unit 500 may include a plurality of groups G1 to Gn corresponding to each of the plurality of patterns P1 to Pn (n is less than 1). Large integer). Each of the plurality of groups G1 to Gn may include one pattern and a plurality of frames corresponding to the pattern. For example, the first group G1 may include the first pattern P1 and a plurality of frames F11 to F1x corresponding thereto (x is an integer greater than 1), and the second group G2. May include the second pattern P2 and a plurality of frames F21 to F2y corresponding thereto (y is an integer greater than 1), and the third group G3 corresponds to the third pattern P3 and the same. It may include a plurality of frames (Fn1 to Fnz) (z is an integer greater than 1).

패턴 색인부(140)는 도 4에 도시된 바와 같은 구조를 가지도록 패턴들 및 프레임들을 패턴 저장부(500)에 저장할 수 있는 한편, 문장 생성부(160)는 도 4에 도시된 구조에 기인하여, 검색된 패턴에 대응하는 복수의 프레임들을 획득할 수 있다. 예를 들면, 문장 생성부(160)가 특성 추출부(120)로부터 수신한 패턴이 제2 패턴(P2)에 대응하는 경우, 문장 생성부(160)는 제2 패턴(P2)을 패턴 저장부(500)에서 검색할 수 있고, 제2 패턴(P2)에 대응하는 복수의 프레임들(F21 내지 F2y)을 획득할 수 있다. 도 4에 도시된 예시는 패턴 및 복수의 프레임들 사이 대응관계를 나타내는 것일 뿐, 도 4에 도시된 구조를 구현하기 위하여 임의의 데이터 구조들이 사용될 수 있는 점은 이해될 것이다. 또한, 일부 실시예들에서, 프레임은 도 6a를 참조하여 후술되는 바와 같이 트리 구조를 가질 수도 있다.The pattern index unit 140 may store patterns and frames in the pattern storage unit 500 to have a structure as shown in FIG. 4, while the sentence generation unit 160 is due to the structure shown in FIG. 4 Thus, a plurality of frames corresponding to the searched pattern can be obtained. For example, when the pattern received from the sentence generation unit 160 corresponds to the second pattern P2, the sentence generation unit 160 stores the second pattern P2 as a pattern storage unit. It is possible to search in (500), it is possible to obtain a plurality of frames (F21 to F2y) corresponding to the second pattern (P2). It will be appreciated that the example shown in FIG. 4 is only a pattern and a correspondence between a plurality of frames, and that any data structures can be used to implement the structure shown in FIG. 4. Further, in some embodiments, the frame may have a tree structure as described below with reference to FIG. 6A.

도 5는 본 발명의 예시적 실시예에 따른 도 1의 문장 생성부(160)의 예시를 나타내는 블록도이다. 도 1을 참조하여 전술된 바와 같이, 도 5의 문장 생성부(160')는 특성 추출부(120)로부터 질의 문장에 대응하는 패턴 및 프레임을 수신할 수 있고, 패턴 저장부(500)에 저장된 패턴 및 프레임을 참조하여 질의 문장의 패러프레이즈 문장을 생성할 수 있다. 도 5에 도시된 바와 같이, 문장 생성부(160')는 패턴 검색부(162), 어휘 치환부(164) 및 문장 후처리부(166)를 포함할 수 있고, 이하에서 도 5는 도 1을 참조하여 설명될 것이다.5 is a block diagram illustrating an example of the sentence generator 160 of FIG. 1 according to an exemplary embodiment of the present invention. As described above with reference to FIG. 1, the sentence generation unit 160 ′ of FIG. 5 may receive a pattern and a frame corresponding to a query sentence from the characteristic extraction unit 120 and stored in the pattern storage unit 500 A paraphrase sentence of a query sentence can be generated by referring to a pattern and a frame. As illustrated in FIG. 5, the sentence generation unit 160 ′ may include a pattern search unit 162, a vocabulary substitution unit 164, and a sentence post-processing unit 166. Will be explained with reference.

패턴 검색부(162)는 특성 추출부(120)로부터 패턴을 수신할 수 있고, 특성 추출부(120)로부터 수신된 패턴에 대응하는 프레임들을 패턴 저장부(500)에서 검색할 수 있다. 도 4를 참조하여 전술된 바와 같이, 패턴 저장부(500)는 상호 대응하는 패턴 및 프레임들을 포함하도록 구조화될 수 있으므로, 패턴 검색부(162)는 수신된 패턴을 패턴 저장부(500)에서 검색함으로써 패턴에 대응하는 프레임들을 패턴 저장부(500)로부터 수신할 수 있다.The pattern search unit 162 may receive a pattern from the characteristic extraction unit 120, and may search for frames corresponding to the pattern received from the characteristic extraction unit 120 in the pattern storage unit 500. As described above with reference to FIG. 4, since the pattern storage unit 500 may be structured to include patterns and frames corresponding to each other, the pattern search unit 162 searches the received pattern in the pattern storage unit 500 By doing so, frames corresponding to the pattern may be received from the pattern storage unit 500.

어휘 치환부(164)는 특성 추출부(120)로부터 토큰 리스트를 수신할 수 있고, 패턴 검색부(162)로부터 프레임들을 수신할 수 있다. 도 1을 참조하여 전술된 바와 같이, 토큰 리스트는 질의 문장에서 추출된 토큰들을 포함할 수 있고, 어휘 치환부(164)는 토큰 리스트에 포함된 토큰들을 패턴 검색부(162)로부터 수신된 프레임들에 적용함으로써 예비(preliminary) 패러프레이즈 문장들을 생성할 수 있다. 예를 들면, 도 3a의 제1 질의 문장(Q1)으로부터 추출된 토큰 리스트를 수신하고, 도 3b의 제2 프레임(F12)을 패턴 검색부(162)로부터 수신한 경우, 어휘 치환부(164)는 토큰 리스트에 포함된 "서울", "인구"를 제2 프레임(F12)에 적용함으로써 예비 패러프레이즈 문장 "서울의 인구수를 알려주세요."를 생성할 수 있다. 유사하게, 도 3b의 제2 질의 문장(Q2)으로부터 추출된 토큰 리스트를 수신하고, 도 3a의 제1 프레임(F11)을 패턴 검색부(162)로부터 수신한 경우, 어휘 치환부(164)는 토큰 리스트에 포함된 "LA", "인구수"를 제1 프레임(F11)에 적용함으로써 예비 패러프레이즈 문장 "LA의 인구수는?"을 생성할 수 있다.The vocabulary substitution unit 164 may receive a token list from the feature extraction unit 120 and receive frames from the pattern search unit 162. As described above with reference to FIG. 1, the token list may include tokens extracted from a query sentence, and the vocabulary substitution unit 164 may include the tokens included in the token list from the pattern search unit 162 By applying to, preliminary paraphrase sentences can be generated. For example, if the token list extracted from the first query sentence Q1 of FIG. 3A is received, and the second frame F12 of FIG. 3B is received from the pattern search unit 162, the vocabulary replacement unit 164 May generate a preliminary paraphrase sentence "Please tell me the population of Seoul." by applying "Seoul" and "Population" included in the token list to the second frame F12. Similarly, if the token list extracted from the second query sentence Q2 of FIG. 3B is received, and the first frame F11 of FIG. 3A is received from the pattern search unit 162, the vocabulary substitution unit 164 By applying "LA" and "population number" included in the token list to the first frame F11, a preliminary paraphrase sentence "What is the population of LA?" can be generated.

일부 실시예들에서, 어휘 치환부(164)는 토큰에 대응하는 단어의 동의어를 프레임에 적용함으로써 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, 도 3b의 제2 질의 문장(Q2)으로부터 추출된 토큰 리스트를 수신하고, 도 3a의 제1 프레임(F11)을 패턴 검색부(162)로부터 수신한 경우, 어휘 치환부(164)는 예비 패러프레이즈 문장으로서 "LA의 인구수는?" 뿐만 아니라 "로스앤젤레스의 인구수는?"을 생성할 수도 있고, 제2 프레임(F12)에 기초하여 "로스앤젤레스의 인구수를 알려주세요."를 생성할 수도 있다. 또한, 일부 실시예들에서, 어휘 치환부(164)는 프레임의 구조를 변경함으로써 예비 패러프레이즈 문장을 생성할 수도 있다. 프레임의 구조를 변경함으로써 예비 패러프레이즈 문장을 생성하는 어휘 치환부(164)의 예시들은 도 6a 및 도 6b를 참조하여 후술될 것이다.In some embodiments, the vocabulary substitution unit 164 may generate a preliminary paraphrase sentence by applying a synonym of the word corresponding to the token to the frame. For example, if the token list extracted from the second query sentence Q2 of FIG. 3B is received, and the first frame F11 of FIG. 3A is received from the pattern search unit 162, the vocabulary substitution unit 164 Is a preliminary paraphrase sentence, "What is the population of LA?" In addition, it is also possible to generate "What is the population of Los Angeles?" or "Please tell me the population of Los Angeles." based on the second frame (F12). Also, in some embodiments, the lexical substitution unit 164 may generate a preliminary paraphrase sentence by changing the structure of the frame. Examples of the vocabulary substitution unit 164 generating a preliminary paraphrase sentence by changing the structure of the frame will be described later with reference to FIGS. 6A and 6B.

문장 후처리부(166)는 어휘 치환부(164)로부터 예비 패러프레이즈 문장을 수신할 수 있고, 예비 패러프레이즈 문장을 자연어 규칙에 따라 수정함으로써 패러프레이즈 문장을 생성할 수 있다. 어휘 치환부(164)에 의해서 토큰들이 프레임에 적용된 예비 패러프레이즈 문장은 자연어 규칙이 위반된 부분을 포함할 수 있다. 예를 들면, 예비 패러프레이즈 문장은 부적절한 조사 및/또는 어미를 포함할 수 있고, 문장 후처리부(166)는 자연어 규칙에 기초하여 조사 및/또는 어미를 수정함으로써 자연스러운 문장으로서 패러프레이즈 문장을 생성할 수 있다.The sentence post-processing unit 166 may receive a preliminary paraphrase sentence from the vocabulary substitution unit 164, and may generate a paraphrase sentence by correcting the preliminary paraphrase sentence according to natural language rules. The preliminary paraphrase sentence in which tokens are applied to the frame by the vocabulary substitution unit 164 may include a part in which the natural language rules are violated. For example, the preliminary paraphrase sentence may include an inappropriate investigation and/or ending, and the sentence post-processing unit 166 may generate a paraphrase sentence as a natural sentence by modifying the investigation and/or ending based on natural language rules. Can be.

도 6a 및 도 6b는 본 발명의 예시적 실시예들에 따라 도 5의 어휘 치환부(164)의 동작의 예시들을 나타내는 도면들이다. 구체적으로, 도 6a 및 도 6b는 프레임의 구조를 변경함으로써 예비 패러프레이즈 문장들을 생성하는 어휘 치환부(164)의 예시들을 나타낸다. 어휘 치환부(164)는 프레임의 구조를 변경할 수도 있고, 프레임에 토큰들을 적용한 예비 패러프레이즈 문장의 구조를 변경할 수도 있다. 이하에서 도 6a 및 도 6b는 도 5를 참조하여 설명될 것이며, 도 6a 및 도 6b에 대한 설명 중 중복되는 내용은 생략될 것이다. 설명의 편의상 도 6a 및 도 6b는 프레임에 토큰들이 적용된 예비 패러프레이즈 문장의 구조를 변경하는 예시들을 나타낸다.6A and 6B are diagrams illustrating examples of operation of the vocabulary substitution unit 164 of FIG. 5 according to exemplary embodiments of the present invention. Specifically, FIGS. 6A and 6B show examples of the lexical substitution unit 164 that generates preliminary paraphrase sentences by changing the structure of the frame. The vocabulary substitution unit 164 may change the structure of the frame, or may change the structure of a preliminary paraphrase sentence in which tokens are applied to the frame. Hereinafter, FIGS. 6A and 6B will be described with reference to FIG. 5, and overlapping contents of the descriptions of FIGS. 6A and 6B will be omitted. For convenience of description, FIGS. 6A and 6B show examples of changing the structure of a preliminary paraphrase sentence in which tokens are applied to a frame.

도 6a를 참조하면, 어휘 치환부(164)는 프레임에서 부사구의 위치를 변경함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, "스미스소니언 미술관은 미국의 어느 도시에 있어?"라는 예비 패러프레이즈 문장은 도 6a의 상부에 도시된 바와 같은 트리 구조로 표현될 수 있다. 도 6a에서, 토큰들 아래 표시된 참조부호들은 토큰들의 특성들로서, 품사 및 문장 구성요소 등을 나타낼 수 있다. 예를 들면, "VNP"는 긍정 지정사구로서 "명사+이다"를 나타낼 수 있고, "NP_SBJ"는 체언(예컨대, 명사, 대명사, 수사)이 주어로서 사용된 것을 나타낼 수 있고, "NP_AJT"는 체언이 부사어(용언 수식어)로서 사용된 것을 나타낼 수 있다. 또한, "NP_MOD"는 체언이 관형어(체언 수식어)로서 사용된 것을 나타낼 수 있고, "DP"는 관형사구를 나타낼 수 있다.Referring to FIG. 6A, the vocabulary substitution unit 164 may generate an additional preliminary paraphrase sentence by changing the position of the adverb phrase in the frame. For example, a preliminary paraphrase of "Which Smithsonian Art Museum is in the United States?" can be expressed in a tree structure as shown at the top of FIG. 6A. In FIG. 6A, reference numerals indicated below tokens are characteristics of tokens, and may indicate parts of speech and sentence components. For example, “VNP” may indicate “noun+is” as a positive designating phrase, “NP_SBJ” may indicate that a body (eg, noun, pronoun, rhetoric) is used as a subject, and “NP_AJT” It is possible to indicate that the body is used as an adverb (alternative modifier). In addition, "NP_MOD" may indicate that the body is used as a tubular language (a body modifier), and "DP" may indicate a tubular phrase.

어휘 치환부(164)는 부사구에 대응하는 서브-트리의 위치를 변경할 수 있다. 예를 들면, 도 6b의 하단에 도시된 바와 같이, 어휘 치환부(164)는 "미국의 어느 도시에"에 대응하는 서브-트리와 "스미스소니언 미술관은"의 위치를 바꿀 수 있다. 이에 따라, "미국의 어느 도시에 스미스소니언 미술관은 있어?"라는 예비 패러프레이즈 문장이 생성될 수 있다. 도 5의 문장 후처리부(166)는 조사를 수정함으로써 패러프레이즈 문장, 즉 "미국의 어느 도시에 스미스소니언 미술관이 있어?"을 최종적으로 생성할 수 있다.The vocabulary substitution unit 164 may change the position of the sub-tree corresponding to the adverb. For example, as shown at the bottom of FIG. 6B, the vocabulary substitution unit 164 may change the location of the sub-tree corresponding to “to any city in the United States” and “Smithsonian Art Museum”. Accordingly, a preliminary paraphrase sentence, "Which city in the United States has the Smithsonian Art Museum?" can be generated. The sentence post-processing unit 166 of FIG. 5 can finally generate a paraphrase sentence, that is, "Which city in the United States has the Smithsonian Art Museum?"

도 6b를 참조하면, 어휘 치환부(164)는 하나이상의 구를 추가적인 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, "스미스소니언 미술관이 있는 도시는?"이라는 예비 패러프레이즈 문장은 도 6b의 좌측에 도시된 바와 같이 구조화될 수 있다. 어휘 치환부(164)는 검색된 프레임이 명사구로 종료하는 경우, 질의 종결구를 추가함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수 있다. 예를 들면, 도 6b의 우측에 도시된 바와 같이, 어휘 치환부(164)는 명사구인 "도시는"으로 종결되는 프레임에서 질의 종결구로서 "어디야"를 추가함으로써 예비 패러프레이즈 문장, 즉 "스미스소니언 미술관이 있는 도시는 어디야?"를 생성할 수 있다.Referring to FIG. 6B, the vocabulary substitution unit 164 may generate additional preparatory paraphrase sentences for one or more phrases. For example, a preliminary paraphrase sentence, "Which city has the Smithsonian Art Museum?" can be structured as shown on the left side of FIG. 6B. When the searched frame ends with a noun phrase, the vocabulary replacement unit 164 may generate an additional preliminary paraphrase sentence by adding a query termination phrase. For example, as shown on the right side of FIG. 6B, the vocabulary replacement unit 164 adds a "where" as the query terminator in the frame ending with the noun phrase "city", that is, a preliminary paraphrase sentence, that is, "Smith "Where is the city where the Sony Museum of Art is located?"

도 7은 본 발명의 예시적 실시예에 따른 패러프레이즈 문장을 생성하는 방법을 나타내는 순서도이다. 예를 들면, 도 7의 방법은 도 1의 패러프레이즈 생성 시스템(100)에 의해서 수행될 수 있다. 구체적으로, 단계 S10, 단계 S20 및 단계 S30은 도 1의 특성 추출부(120)에 의해서 수행될 수 있고, 단계 S40, 단계 S50 및 단계 S60은 문장 생성부(160)에 의해서 수행될 수 있다. 이하에서, 도 7은 도 1을 참조하여 설명될 것이다.Fig. 7 is a flow chart showing a method for generating a paraphrase sentence according to an exemplary embodiment of the present invention. For example, the method of FIG. 7 may be performed by the paraphrase generation system 100 of FIG. 1. Specifically, steps S10, S20, and S30 may be performed by the feature extraction unit 120 of FIG. 1, and steps S40, S50, and S60 may be performed by the sentence generation unit 160. In the following, FIG. 7 will be described with reference to FIG. 1.

도 7을 참조하면, 단계 S10에서 질의 문장을 수신하는 동작이 수행될 수 있다. 예를 들면, 특성 추출부(120)는 사용자 인터페이스를 통해서 사용자로부터 질의 문장을 수신할 수도 있고, 네트워크로부터 질의 문장들을 수집할 수도 있다.Referring to FIG. 7, in step S10, an operation of receiving a query sentence may be performed. For example, the feature extraction unit 120 may receive a query sentence from a user through a user interface or collect query sentences from a network.

단계 S20에서, 시맨틱 리소스들을 획득하는 동작이 수행될 수 있다. 예를 들면, 특성 추출부(120)는 질의 문장을 자연어 처리함으로써 추출된 토큰들에 대응하는 시맨틱 리소스들을 지식 베이스(300)로부터 획득할 수 있다. 도면들을 참조하여 전술된 바와 같이, 지식 베이스(300)는 온톨로지에 기반하여 구축될 수 있고 검증된 지식 데이터를 포함할 수 있으므로, 질의 문장의 패러프레이즈 문장을 생성하는데 활용될 수 있다. 질의 문장의 자연어 처리는, 일부 실시예들에서 패러프레이즈 생성 시스템(100)의 외부에서 수행될 수도 있고, 일부 실시예들에서 특성 추출부(120)에 의해서 수행될 수도 있다.In step S20, an operation of acquiring semantic resources may be performed. For example, the feature extraction unit 120 may acquire semantic resources corresponding to the extracted tokens from the knowledge base 300 by processing the query sentence in natural language. As described above with reference to the drawings, the knowledge base 300 may be constructed based on ontology and may include verified knowledge data, and thus may be utilized to generate a paraphrase sentence of a query sentence. The natural language processing of the query sentence may be performed outside the paraphrase generation system 100 in some embodiments, or may be performed by the feature extraction unit 120 in some embodiments.

단계 S30에서, 패턴 및 프레임을 생성하는 동작이 수행될 수 있다. 예를 들면, 특성 추출부(120)는 질의 문장에 기초하여 획득된 시맨틱 리소스들의 조합으로서 패턴을 생성할 수 있고, 지의 문장에서 토큰들이 대체 가능한 구조를 가지는 프레임을 생성할 수 있다. 또한, 특성 추출부(120)는 추출된 토큰들을 포함하는 토큰 리스트를 생성할 수도 있다.In step S30, an operation for generating a pattern and a frame may be performed. For example, the feature extraction unit 120 may generate a pattern as a combination of semantic resources obtained based on a query sentence, and may generate a frame having a structure in which tokens are replaceable in a branch sentence. Also, the feature extraction unit 120 may generate a token list including the extracted tokens.

단계 S40에서, 패턴을 검색하는 동작이 수행될 수 있다. 예를 들면, 패턴 저장부(500)는 상호 대응하는 패턴 및 프레임을 저장할 수 있고, 하나의 패턴은 복수의 프레임들에 대응할 수 있다. 문장 생성부(160)는 패턴 저장부(500)는 단계 S30에서 생성된 패턴을 검색할 수 있고, 검색된 패턴에 대응하는 프레임들을 패턴 저장부(500)로부터 획득할 수 있다.In step S40, an operation of searching for a pattern may be performed. For example, the pattern storage unit 500 may store patterns and frames corresponding to each other, and one pattern may correspond to a plurality of frames. The sentence generation unit 160 may search for the pattern generated in step S30, and the pattern storage unit 500 may obtain frames corresponding to the searched pattern from the pattern storage unit 500.

단계 S50에서, 프레임에 토큰들을 적용하는 동작이 수행될 수 있다. 예를 들면, 문장 생성부(160)는 패턴 저장부(500)로부터 획득된 프레임에 토큰 리스트에 포함된 토큰들을 적용함으로써 예비 패러프레이즈 문장들을 생성할 수 있다. 일부 실시예들에서, 문장 생성부(160)는 토큰의 동의어를 사용하여 추가적인 예비 패러프레이즈 문장을 생성할 수도 있고, 프레임 또는 예비 패러프레이즈 문장의 구조를 변경함으로써 추가적인 예비 패러프레이즈 문장을 생성할 수도 있다.In step S50, an operation of applying tokens to the frame may be performed. For example, the sentence generation unit 160 may generate preliminary paraphrase sentences by applying tokens included in the token list to a frame obtained from the pattern storage unit 500. In some embodiments, the sentence generator 160 may generate an additional preparatory paraphrase sentence using a synonym of the token, or may generate an additional preparatory paraphrase sentence by changing a frame or a structure of the preliminary paraphrase sentence. have.

단계 S60에서, 문장의 후처리가 수행될 수 있다. 예를 들면, 단계 S50에서 생성된 예비 패러프레이즈 문장은 자연어 규칙이 위반된 부분을 포함할 수 있고, 문장 생성부(160)는 자연어 규칙에 따라 예비 패러프레이즈 문장의 조사 및/또는 어미를 수정함으로써 패러프레이즈 문장을 최종적으로 생성할 수 있다.In step S60, post-processing of the sentence may be performed. For example, the preliminary paraphrase sentence generated in step S50 may include a part in which the natural language rule is violated, and the sentence generation unit 160 corrects the investigation and/or ending of the preliminary paraphrase sentence according to the natural language rule. Paraphrase sentences can be finally generated.

이상에서와 같이 도면과 명세서에서 예시적인 실시예들이 개시되었다. 본 명세서에서 특정한 용어를 사용하여 실시예들이 설명되었으나, 이는 단지 본 발명의 기술적 사상을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.As described above, exemplary embodiments have been disclosed in the drawings and the specification. Although the embodiments have been described using specific terminology in this specification, they are only used for the purpose of explaining the technical spirit of the present invention and are not used to limit the scope of the present invention as defined in the claims or the claims. . Therefore, those skilled in the art will understand that various modifications and other equivalent embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

Claims

A system for generating paraphrase sentences,
Semantic resources including each of ontology components corresponding to tokens extracted from a query sentence are obtained, and a pattern having a combination of the semantic resources and a frame having a structure in which the tokens are replaceable in the query sentence are generated. Characteristic extraction unit configured to;
A pattern index unit configured to correlate patterns and frames with each other, and to store mutually corresponding patterns and frames in a pattern storage unit; And
And a sentence generation unit configured to generate a paraphrase sentence of the query sentence based on frames corresponding to the pattern stored in the pattern storage unit.

The method according to claim 1,
The feature extraction unit, characterized in that the system is configured to obtain a plurality of query sentences through the network.

The method according to claim 1,
The characteristic extraction unit,
A natural language processor configured to extract the tokens by performing natural language processing on the query sentence; And
And a pattern generator configured to receive the semantic resources corresponding to the tokens from the knowledge base, and to generate the pattern and the frame.

The method according to claim 1,
And the pattern index unit is configured to store patterns each containing semantic resources of different order as the different patterns in the pattern storage unit.

The method according to claim 1,
The sentence generation unit,
A pattern search unit configured to search frames corresponding to the pattern generated by the feature extraction unit in the pattern storage unit;
A vocabulary substitution unit configured to generate at least one preliminary paraphrase sentence by applying the tokens to the searched frames; And
And a sentence post-processing unit configured to generate the paraphrase sentence by modifying the at least one preliminary paraphrase sentence according to a natural language rule.

The method according to claim 5,
And the vocabulary replacement unit is configured to generate the at least one preliminary paraphrase sentence by applying synonyms of the tokens to the searched frames based on the semantic resources.

The method according to claim 5,
And the vocabulary replacement unit is configured to generate the at least one preliminary paraphrase sentence by changing the structure of the searched frame.

The method according to claim 7,
And the vocabulary replacement unit is configured to generate an additional preliminary paraphrase sentence by changing the position of the adverb phrase in the searched frame.

The method according to claim 7,
The vocabulary replacement unit is configured to generate an additional preliminary paraphrase sentence by adding a query terminator when the searched frame ends with a noun phrase.