KR102763200B1

KR102763200B1 - Apparatus and method for extracting domain adaptive data

Info

Publication number: KR102763200B1
Application number: KR1020240073424A
Authority: KR
Inventors: 김한길
Original assignee: 주식회사 리턴제로
Priority date: 2024-06-05
Filing date: 2024-06-05
Publication date: 2025-02-07
Anticipated expiration: 2044-06-05

Abstract

본 개시는 도메인 적응형 데이터 추출 장치 및 방법에 관한 것으로, 외부 장치와 통신을 수행하는 통신모듈; 도메인 적응형 데이터 추출 동작을 수행하기 위한 적어도 하나의 프로세스가 저장된 메모리; 및 상기 프로세스에 따라 상기 도메인 적응형 데이터 추출 동작을 수행하는 프로세서를 포함하며, 상기 프로세서는, 제1 도메인 및 제2 도메인 각각에 대응하는 적어도 하나의 기본 키워드를 정의하고, 상기 적어도 하나의 기본 키워드를 기초로 상기 제1 도메인 및 상기 제2 도메인 별로 불용어를 제외한 단어를 적어도 하나의 키워드로서 수집하고, 제1 데이터풀 내에서 각 키워드에 대한 단어 분포 엔트로피를 산출하여 적어도 하나의 핵심 키워드를 추출하고, 제2 데이터풀 내에서 상기 적어도 하나의 핵심 키워드를 기반으로 기 설정된 조건에 따라 적어도 하나의 데이터를 추출하고, 상기 제1 도메인은, 사용자가 생성하고자 하는 모델이 적용될 도메인을 나타내고, 상기 제2 도메인은, 상기 제1 도메인에서 제외하고자 하는 적어도 하나의 도메인을 나타내는 것으로, 적어도 하나 이상일 수 있다.The present disclosure relates to a domain adaptive data extraction device and method, comprising: a communication module for performing communication with an external device; a memory storing at least one process for performing a domain adaptive data extraction operation; and a processor for performing the domain adaptive data extraction operation according to the process, wherein the processor defines at least one basic keyword corresponding to each of a first domain and a second domain, collects words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one basic keyword, calculates word distribution entropy for each keyword in a first data pool to extract at least one core keyword, and extracts at least one data according to a preset condition based on the at least one core keyword in a second data pool, wherein the first domain represents a domain to which a model that a user wants to create is to be applied, and the second domain represents at least one domain that a user wants to exclude from the first domain, and there may be at least one or more of them.

Description

{APPARATUS AND METHOD FOR EXTRACTING DOMAIN ADAPTIVE DATA}

본 개시는 데이터 추출 장치 및 방법에 대한 것으로, 보다 상세하게는, 도메인 적응형 데이터 추출 장치 및 방법에 관한 것이다.The present disclosure relates to a data extraction device and method, and more particularly, to a domain adaptive data extraction device and method.

특정 도메인의 모델을 만들기 위해서는 단순히 많은 데이터가 아니라 그에 적합한 데이터를 확보하는 것이 가장 중요하다. 그러나, 정제된 데이터의 양은 나날이 증가하고 있으나, 그 중 특정 도메인에 적합한 데이터가 무엇인지를 찾아내기란 아주 어려운 일이다.In order to create a model for a specific domain, it is most important to secure not just a large amount of data, but also the appropriate data. However, the amount of refined data is increasing day by day, but it is very difficult to find out which data is appropriate for a specific domain.

애초에 데이터 별로 해당 데이터가 어떤 특성의 어떤 정보를 담고 있는지를 나타내는 메타데이터를 함께 저장하는 경우는 많지 않으며, 유사한 데이터로 보인다고 할지라도 상황이나 목적이 달라진다면 실제로는 완전 핏하지 않은 경우가 많다. 즉, 동일한 금융 도메인이라고 할지라도 A 은행과 B 보험사에서 쓰는 용어나 표현은 다를 수 있고, 심지어 동일한 A 은행일지라도 상담을 위한 상황과 내부 회의를 위한 상황은 상이하다. 이에 따라, 보통은 새로운 도메인에 맞춰 새롭게 데이터를 가공하고 라벨링하는 절차를 거치게 되는데 이는 굉장히 고비용의 작업이다.In the first place, it is not common to store metadata that indicates what kind of information and characteristics the data contains together with each data, and even if the data appears to be similar, if the situation or purpose is different, it is often not a perfect fit in reality. In other words, even in the same financial domain, the terms and expressions used by Bank A and Insurance Company B may be different, and even within the same Bank A, the situations for consultation and internal meetings are different. Accordingly, usually, the data is reprocessed and labeled to fit the new domain, which is an extremely expensive task.

또한, 특정 도메인용 인공지능(Artificial Intelligence, AI) 모델을 만든다고 했을 때, 그 '도메인'이 무엇인가를 정의하는 일이다. 구체적으로, 이를 어떻게 정의할 것인가, 어떤 데이터가 들어가야 하는 것인가에 대해 규정하는 것 자체가 쉽지 않고, 데이터를 새롭게 만들려고 해도 이를 정의하는 작업이 선행되어야 하지만 그 자체가 쉽지 않다는 딜레마가 존재한다.Also, when creating an artificial intelligence (AI) model for a specific domain, it is a task to define what that 'domain' is. Specifically, it is not easy to define how to define it and what data should be included, and even if you try to create new data, you have to define it first, but there is a dilemma that it is not easy in itself.

따라서, 적은 키워드만을 이용하여 도메인을 정의하고, 그와 관련한 유의미한 데이터를 용이하게 추출할 수 있도록 하는 기술이 개발될 필요가 있다. Therefore, there is a need to develop a technology that can define a domain using only a few keywords and easily extract meaningful data related to it.

한국등록특허공보 제10-2580512호 (등록일: 2023년 09월 15일)Korean Patent Publication No. 10-2580512 (Registration Date: September 15, 2023)

본 개시는 키워드 및 시맨틱 서치를 모두 사용하는 하이브리드 방식을 통해 수 개의 적은 키워드만을 기반으로 타겟 도메인 및 주변 도메인을 정의할 수 있고, 이를 통해 텍스트 자체가 유사한 데이터뿐만 아니라 맥락이 유사한 데이터를 보다 용이하게 추출할 수 있도록 하는 도메인 적응형 데이터 추출 장치 및 방법을 제공함에 있다.The present disclosure provides a domain-adaptive data extraction device and method that can define a target domain and surrounding domains based on only a few keywords through a hybrid approach that uses both keyword and semantic search, thereby making it easier to extract not only data with similar texts but also data with similar contexts.

본 개시가 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present disclosure are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

상술한 기술적 과제를 달성하기 위한 본 개시의 일 측면에 따른 도메인 적응형 데이터 추출 장치는, 외부 장치와 통신을 수행하는 통신모듈; 도메인 적응형 데이터 추출 동작을 수행하기 위한 적어도 하나의 프로세스가 저장된 메모리; 및 상기 프로세스에 따라 상기 도메인 적응형 데이터 추출 동작을 수행하는 프로세서를 포함하며, 상기 프로세서는, 제1 도메인 및 제2 도메인 각각에 대응하는 적어도 하나의 기본 키워드를 정의하고, 상기 적어도 하나의 기본 키워드를 기초로 상기 제1 도메인 및 상기 제2 도메인 별로 불용어를 제외한 단어를 적어도 하나의 키워드로서 수집하고, 제1 데이터풀 내에서 각 키워드에 대한 단어 분포 엔트로피를 산출하여 적어도 하나의 핵심 키워드를 추출하고, 제2 데이터풀 내에서 상기 적어도 하나의 핵심 키워드를 기반으로 기 설정된 조건에 따라 적어도 하나의 데이터를 추출하고, 상기 제1 도메인은, 사용자가 생성하고자 하는 모델이 적용될 도메인을 나타내고, 상기 제2 도메인은, 상기 제1 도메인에서 제외하고자 하는 적어도 하나의 도메인을 나타내는 것으로, 적어도 하나 이상일 수 있다.According to an aspect of the present disclosure for achieving the above-described technical problem, a domain adaptive data extraction device comprises: a communication module for performing communication with an external device; a memory storing at least one process for performing a domain adaptive data extraction operation; and a processor for performing the domain adaptive data extraction operation according to the process, wherein the processor defines at least one basic keyword corresponding to each of a first domain and a second domain, collects words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one basic keyword, calculates word distribution entropy for each keyword in a first data pool to extract at least one core keyword, and extracts at least one data according to a preset condition based on the at least one core keyword in a second data pool, wherein the first domain represents a domain to which a model that a user wants to create is to be applied, and the second domain represents at least one domain that a user wants to exclude from the first domain, and there may be at least one or more of them.

한편, 본 개시의 일 측면에 따른 도메인 적응형 데이터 추출 방법은, 제1 도메인 및 제2 도메인 각각에 대응하는 적어도 하나의 기본 키워드가 정의되면, 상기 적어도 하나의 기본 키워드를 기초로 상기 제1 도메인 및 상기 제2 도메인 별로 불용어를 제외한 단어를 적어도 하나의 키워드로서 수집하는 단계; 제1 데이터풀 내에서 각 키워드에 대한 단어 분포 엔트로피를 산출하여 적어도 하나의 핵심 키워드를 추출하는 단계; 및 제2 데이터풀 내에서 상기 적어도 하나의 핵심 키워드를 기반으로 기 설정된 조건에 따라 적어도 하나의 데이터를 추출하는 단계를 포함하고, 상기 제1 도메인은, 사용자가 생성하고자 하는 모델이 적용될 도메인을 나타내고, 상기 제2 도메인은, 상기 제1 도메인에서 제외하고자 하는 적어도 하나의 도메인을 나타내는 것으로, 적어도 하나 이상일 수 있다.Meanwhile, a domain adaptive data extraction method according to one aspect of the present disclosure includes: a step of collecting words excluding stop words as at least one keyword based on the at least one basic keyword corresponding to each of a first domain and a second domain; a step of extracting at least one core keyword by calculating word distribution entropy for each keyword within a first data pool; and a step of extracting at least one data according to a preset condition based on the at least one core keyword within a second data pool, wherein the first domain represents a domain to which a model that a user wants to create is to be applied, and the second domain represents at least one domain that a user wants to exclude from the first domain, and may be at least one or more.

이 외에도, 본 개시를 구현하기 위한 방법을 실행하기 위한 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램이 더 제공될 수 있다.In addition, a computer program stored in a computer-readable recording medium for executing a method for implementing the present disclosure may be further provided.

이 외에도, 본 개시를 구현하기 위한 방법을 실행하기 위한 컴퓨터 프로그램을 기록하는 컴퓨터 판독 가능한 기록 매체가 더 제공될 수 있다.In addition, a computer-readable recording medium recording a computer program for executing a method for implementing the present disclosure may be further provided.

본 개시의 전술한 과제 해결 수단에 의하면, 키워드 및 시맨틱 서치를 모두 사용하는 하이브리드 방식을 통해 수 개의 적은 키워드만을 기반으로 타겟 도메인 및 주변 도메인을 정의할 수 있고, 이를 통해 텍스트 자체가 유사한 데이터뿐만 아니라 맥락이 유사한 데이터를 보다 용이하게 추출할 수 있도록 한다.According to the aforementioned problem solving means of the present disclosure, a hybrid method using both keywords and semantic search can be used to define a target domain and surrounding domains based on only a few keywords, thereby making it easier to extract data with similar contexts as well as data with similar texts.

본 개시의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 개시의 일 실시예에 따라 도메인 적응형 데이터 추출 시스템의 네트워크 구조를 나타내는 도면이다.
도 2는 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출을 위한 서비스 서버의 구성을 나타내는 도면이다.
도 3은 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법을 나타내는 도면이다.
도 4는 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법에서 각 도메인 별로 수집된 적어도 하나의 키워드를 수집하는 동작을 나타내는 도면이다.
도 5는 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법에서 적어도 하나의 키워드를 기반으로 적어도 하나의 핵심 데이터를 추출하는 동작을 나타내는 도면이다.
도 6은 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법에서 적어도 하나의 데이터 각각에 대해 유사한 데이터를 추출하는 동작을 나타내는 도면이다.FIG. 1 is a diagram illustrating a network structure of a domain adaptive data extraction system according to one embodiment of the present disclosure.
FIG. 2 is a diagram showing the configuration of a service server for domain adaptive data extraction according to one embodiment of the present disclosure.
FIG. 3 is a diagram illustrating a domain adaptive data extraction method according to one embodiment of the present disclosure.
FIG. 4 is a diagram illustrating an operation of collecting at least one keyword for each domain in a domain adaptive data extraction method according to one embodiment of the present disclosure.
FIG. 5 is a diagram illustrating an operation of extracting at least one core data based on at least one keyword in a domain adaptive data extraction method according to one embodiment of the present disclosure.
FIG. 6 is a diagram illustrating an operation of extracting similar data for each of at least one data in a domain adaptive data extraction method according to one embodiment of the present disclosure.

본 개시의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 개시는 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 속하는 기술 분야의 통상의 기술자에게 본 개시의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 개시는 청구항의 범주에 의해 정의될 뿐이다.The advantages and features of the present disclosure, and the methods for achieving them, will become apparent by referring to the embodiments described in detail below together with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below and may be implemented in various different forms, and these embodiments are provided only to make the present disclosure complete and to fully inform a person skilled in the art of the scope of the present disclosure, and the present disclosure is defined only by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 개시를 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 개시의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing embodiments only and is not intended to limit the present disclosure. In the present disclosure, the singular also includes the plural unless the context clearly dictates otherwise. The terms “comprises” and/or “comprising” as used herein do not exclude the presence or addition of one or more other components in addition to the mentioned components. Like reference numerals refer to like components throughout the specification, and “and/or” includes each and every combination of one or more of the mentioned components. Although “first”, “second”, etc. are used to describe various components, these components are not limited by these terms. These terms are only used to distinguish one component from another. Therefore, it should be understood that a first component mentioned below may also be a second component within the technical spirit of the present disclosure.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 개시가 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with the meaning commonly understood by those skilled in the art to which this disclosure belongs. In addition, terms defined in commonly used dictionaries shall not be ideally or excessively interpreted unless explicitly specifically defined.

본 개시 전체에 걸쳐 동일 참조 부호는 동일 구성요소를 지칭한다. 본 개시가 실시예들의 모든 요소들을 설명하는 것은 아니며, 본 개시가 속하는 기술분야에서 일반적인 내용 또는 실시예들 간에 중복되는 내용은 생략한다. 명세서에서 사용되는 "부" 또는 “모듈”이라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부" 또는 “모듈”은 어떤 역할들을 수행한다. 그렇지만 "부" 또는 “모듈”은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부" 또는 “모듈”은 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부" 또는 “모듈”은 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부" 또는 “모듈”들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부" 또는 “모듈”들로 결합되거나 추가적인 구성요소들과 "부" 또는 “모듈”들로 더 분리될 수 있다.Throughout this disclosure, the same reference numerals refer to the same components. This disclosure does not describe all elements of the embodiments, and general contents or overlapping contents between embodiments in the technical field to which this disclosure belongs are omitted. The term "part" or "module" used in the specification means a software, hardware component such as an FPGA or an ASIC, and the "part" or "module" performs certain roles. However, the "part" or "module" is not limited to software or hardware. The "part" or "module" may be configured to be in an addressable storage medium and may be configured to play one or more processors. Thus, as an example, the "part" or "module" includes components such as software components, object-oriented software components, class components, and task components, and processes, functions, properties, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. The functionality provided within the components and “sub-components” or “modules” may be combined into a smaller number of components and “sub-components” or “modules” or further separated into additional components and “sub-components” or “modules”.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐 아니라, 간접적으로 연결되어 있는 경우를 포함하고, 간접적인 연결은 무선 통신망을 통해 연결되는 것을 포함한다.Throughout the specification, when a part is said to be "connected" to another part, this includes not only a direct connection but also an indirect connection, and an indirect connection includes a connection via a wireless communications network.

또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Additionally, when a part is said to "include" a component, this does not mean that it excludes other components, but rather that it may include other components, unless otherwise specifically stated.

명세서 전체에서, 어떤 부재가 다른 부재 "상에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the specification, when it is said that an element is "on" another element, this includes not only cases where the element is in contact with the other element, but also cases where there is another element between the two elements.

제1, 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위해 사용되는 것으로, 구성요소가 전술된 용어들에 의해 제한되는 것은 아니다. The terms first, second, etc. are used to distinguish one component from another, and the components are not limited by the aforementioned terms.

단수의 표현은 문맥상 명백하게 예외가 있지 않는 한, 복수의 표현을 포함한다.Singular expressions include plural expressions unless the context clearly indicates otherwise.

각 단계들에 있어 식별부호는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 실시될 수 있다. The identification codes in each step are used for convenience of explanation and do not describe the order of each step. Each step may be performed in a different order than specified unless the context clearly indicates a specific order.

이하 설명에서 사용되는 용어에 대하여 정의하면 하기와 같다.The terms used in the following explanation are defined as follows.

본 명세서에서 '서비스 서버'로 한정하여 설명하였으나, 이는 도메인 적응형 데이터 추출 서비스를 제공하기 위한 장치로서, 연산처리를 수행할 수 있는 다양한 장치들을 모두 포함할 수 있다. 또한, 이 서비스 서버는 별도의 서버, 컴퓨터 및/또는 휴대용 단말기와 연결/연동되어 설정이 변경되거나 정보를 수집 또는 분석하여 제공할 수 있으며, 그 종류 및 형태를 한정하지 않는다.Although described in this specification as a 'service server', this is a device for providing a domain adaptive data extraction service and may include various devices capable of performing computational processing. In addition, this service server may be connected/interconnected with a separate server, computer, and/or portable terminal to change settings or collect or analyze information and provide it, and its type and form are not limited.

여기에서, 상기 컴퓨터는 예를 들어, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(desktop), 랩톱(laptop), 태블릿 PC, 슬레이트 PC 등을 포함할 수 있다.Here, the computer may include, for example, a notebook, desktop, laptop, tablet PC, slate PC, etc. equipped with a web browser.

상기 서버는 외부 장치와 통신을 수행하여 정보를 처리하는 서버로써, 애플리케이션 서버, 컴퓨팅 서버, 데이터베이스 서버, 파일 서버, 게임 서버, 메일 서버, 프록시 서버 및 웹 서버 등을 포함할 수 있다.The above server is a server that processes information by communicating with external devices, and may include an application server, a computing server, a database server, a file server, a game server, a mail server, a proxy server, and a web server.

상기 휴대용 단말기는 예를 들어, 휴대성과 이동성이 보장되는 무선 통신 장치로서, PCS(Personal Communication System), GSM(Global System for Mobile communications), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), WiBro(Wireless Broadband Internet) 단말, 스마트 폰(Smart Phone) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치와 시계, 반지, 팔찌, 발찌, 목걸이, 안경, 콘택트 렌즈, 또는 머리 착용형 장치(head-mounted-device(HMD) 등과 같은 웨어러블 장치를 포함할 수 있다.The above portable terminal may include, for example, all kinds of handheld-based wireless communication devices such as a PCS (Personal Communication System), GSM (Global System for Mobile communications), PDC (Personal Digital Cellular), PHS (Personal Handyphone System), PDA (Personal Digital Assistant), IMT (International Mobile Telecommunication)-2000, CDMA (Code Division Multiple Access)-2000, W-CDMA (W-Code Division Multiple Access), WiBro (Wireless Broadband Internet) terminal, a smart phone, and a wearable device such as a watch, a ring, a bracelet, an anklet, a necklace, glasses, contact lenses, or a head-mounted-device (HMD).

본 명세서에서 '도메인'은 단어/텍스트를 포함하여 구성된 문서를 나타낸다.In this specification, 'domain' refers to a document composed of words/text.

이하 첨부된 도면들을 참고하여 본 개시의 작용 원리 및 실시예들에 대해 설명한다.The operating principle and embodiments of the present disclosure are described below with reference to the attached drawings.

도 1은 본 개시의 일 실시예에 따라 도메인 적응형 데이터 추출 시스템의 네트워크 구조를 나타내는 도면이다.FIG. 1 is a diagram illustrating a network structure of a domain adaptive data extraction system according to one embodiment of the present disclosure.

도 1을 참조하면, 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 시스템(이하, '데이터 추출 시스템'이라 칭함)(10)은 서비스 서버(100) 및 사용자 단말(200) 중 적어도 하나를 포함하여 구성될 수 있다.Referring to FIG. 1, a domain adaptive data extraction system (hereinafter, referred to as “data extraction system”) (10) according to one embodiment of the present disclosure may be configured to include at least one of a service server (100) and a user terminal (200).

서비스 서버(100)는 사용자에게 도메인 적응형 데이터 추출 서비스(이하, '데이터 추출 서비스'라 칭함)를 제공하기 위한 것으로, 이를 위해 별도의 웹 페이지 및/또는 플랫폼(어플리케이션)을 제공할 수 있으며, 사용자 단말(200)은 그 웹 페이지 또는 플랫폼을 기반으로 그 데이터 추출 서비스를 위한 각종 정보/데이터를 제공하거나 제공받을 수 있다.The service server (100) is intended to provide a domain-adaptive data extraction service (hereinafter referred to as a “data extraction service”) to a user, and for this purpose, a separate web page and/or platform (application) may be provided, and the user terminal (200) may provide or receive various information/data for the data extraction service based on the web page or platform.

이 서비스 서버(100)는 사용자가 사용자 단말(200)을 통해 별도의 웹 페이지 또는 플랫폼을 실행하여 제1 도메인 및 제2 도메인 각각에 대한 키워드를 정의하면, 그 정의된 키워드를 확장한 후, 이를 기반으로 적어도 하나의 데이터를 획득한다. 이때, 서비스 서버(100)는 정의된 키워드를 확장하기 위해 크롤링(crawling)을 수행하여 기 설정된 웹 사이트(300)를 통해 각종 데이터(정보, 컨텐츠 등)를 획득하고, 그 데이터 내에 포함된 명사들을 이용하여 키워드를 확장할 수 있다.When a user defines a keyword for each of the first domain and the second domain by executing a separate web page or platform through a user terminal (200), the service server (100) expands the defined keyword and then acquires at least one data based on the keyword. At this time, the service server (100) performs crawling to expand the defined keyword, acquires various data (information, content, etc.) through a preset website (300), and can expand the keyword using nouns included in the data.

이어서, 서비스 서버(100)는 적어도 하나의 데이터 내에서 핵심 키워드를 추출함으로써 제1 도메인에 관련한 데이터만을 추출한다. 여기서, 제1 도메인(타겟 도메인)은 사용자가 생성하고자 하는 모델이 적용될 도메인을 나타내고, 제2 도메인(주변 도메인)은 제1 도메인에서 제외하고자 하는 적어도 하나의 도메인을 나타내는 것으로, 적어도 하나 이상일 수 있다.Next, the service server (100) extracts only data related to the first domain by extracting core keywords from at least one data. Here, the first domain (target domain) represents a domain to which a model that the user wants to create will be applied, and the second domain (surrounding domain) represents at least one domain that the user wants to exclude from the first domain, and there may be at least one or more.

이로써, 서비스 서버(100)는 사용자로부터 입력되는 적은 수의 키워드만으로 텍스트 자체가 유사한 데이터뿐만 아니라, 맥락(의미)가 유사한 데이터까지도 증강하여 추출함으로써, 보다 적합한 데이터로 구성된 데이터 셋을 사용자에게 제공하도록 할 수 있다.Accordingly, the service server (100) can provide the user with a data set composed of more suitable data by extracting not only data with similar texts but also data with similar context (meaning) using only a small number of keywords input by the user.

한편, 사용자 단말(200)은 서비스 서버(100)를 통해 데이터 추출 서비스를 제공받고자 하는 사용자가 소지하는 단말로서, 그 데이터 추출 서비스를 제공받기 위해 서비스 서버(100)에서 제공하는 웹 페이지에 접속하거나, 플랫폼(어플리케이션)을 설치하여 구비할 수도 있다.Meanwhile, the user terminal (200) is a terminal possessed by a user who wishes to receive a data extraction service through the service server (100). In order to receive the data extraction service, the user terminal may access a web page provided by the service server (100) or install a platform (application).

이로써, 사용자 단말(200)은 별도의 웹 페이지 또는 플랫폼을 실행하여 사용자가 특정 도메인의 모델을 생성하기 위해 서비스 서버(100)로 서비스 제공을 요청하면, 그에 대한 응답으로 서비스 서버(100)를 통해 추출된 데이터로 구성된 데이터 셋을 제공받는다. Accordingly, when the user terminal (200) executes a separate web page or platform to request service provision from the service server (100) to create a model of a specific domain, the user terminal (200) receives a data set composed of data extracted through the service server (100) in response thereto.

이때, 사용자 단말(200)은 제1 도메인 및 제2 도메인 각각에 대한 키워드를 입력하여 정의함으로써, 서비스 제공을 요청할 수 있다.At this time, the user terminal (200) can request service provision by entering and defining keywords for each of the first domain and the second domain.

이때, 사용자 단말(200)은 사용자가 원하는 다수의 응용 프로그램(즉, 어플리케이션)을 설치하여 실행할 수 있는 컴퓨터, UMPC(Ultra Mobile PC), 워크스테이션, 넷북(net-book), PDA(Personal Digital Assistants), 포터블(portable) 컴퓨터, 웹 테블릿(web tablet), 무선 전화기(wireless phone), 모바일 폰(mobile phone), 스마트 폰(smart phone), 패드(Pad), 스마트 워치(Smart watch), 웨어러블(wearable) 단말, e-북(e-book), PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 블랙 박스(black box) 또는 디지털 카메라(digital camera), 기타 이동통신 단말 등일 수 있다. 즉, 사용자 단말(200)은 다양한 형태로 구비될 수 있으며, 그 형태를 한정하지 않는다.At this time, the user terminal (200) may be a computer, an Ultra Mobile PC (UMPC), a workstation, a netbook, a PDA (Personal Digital Assistants), a portable computer, a web tablet, a wireless phone, a mobile phone, a smart phone, a pad, a smart watch, a wearable terminal, an e-book, a portable multimedia player (PMP), a portable game console, a navigation device, a black box or a digital camera, or other mobile communication terminal, on which a user can install and execute a plurality of desired application programs (i.e., applications). That is, the user terminal (200) may be provided in various forms, and its form is not limited.

비록, 도 1에는 사용자 단말(200)을 하나만 도시하였으나, 이는 설명의 편의를 위한 것일 뿐, 적어도 하나 이상일 수 있으며, 그 개수 및 종류를 한정하지 않는다.Although only one user terminal (200) is illustrated in FIG. 1, this is only for convenience of explanation, and there may be at least one or more, and the number and type are not limited.

상술한 바와 같이, 본 개시에 따른 데이터 추출 시스템은 네트워크를 기반으로 하는 서비스 서버(100) 및 사용자 단말(200) 간 데이터/정보 송수신을 통해 구현될 수 있다.As described above, the data extraction system according to the present disclosure can be implemented through data/information transmission and reception between a network-based service server (100) and a user terminal (200).

도 2는 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출을 위한 서비스 서버의 구성을 나타내는 도면이다. FIG. 2 is a diagram showing the configuration of a service server for domain adaptive data extraction according to one embodiment of the present disclosure.

도 2를 참조하면, 본 개시의 일 실시예에 따른 서비스 서버(100)는 통신모듈(110), 메모리(120) 및 프로세서(130) 중 적어도 하나를 포함하여 구성될 수 있다.Referring to FIG. 2, a service server (100) according to one embodiment of the present disclosure may be configured to include at least one of a communication module (110), a memory (120), and a processor (130).

통신모듈(110)은 각종 단말(장치), 외부 저장소(예를 들어, 데이터베이스(database, 140)), 외부 서버 및 클라우드 서버 중 적어도 하나와 통신을 수행할 수 있다.The communication module (110) can communicate with at least one of various terminals (devices), external storage (e.g., database (140)), external server, and cloud server.

한편, 외부 서버 또는 클라우드 서버에서는, 프로세서(130)의 적어도 일부의 역할을 수행하도록 구성될 수 있다. 즉, 데이터 처리 또는 데이터 연산 등의 수행은 외부 서버 또는 클라우드 서버에서 이루어지는 것이 가능하며, 본 개시에서는 이러한 방식에 대한 특별한 제한을 두지 않는다.Meanwhile, an external server or a cloud server may be configured to perform at least a part of the role of the processor (130). That is, data processing or data operations, etc. may be performed on an external server or a cloud server, and the present disclosure does not place any special restrictions on this method.

한편, 통신모듈(110)은 통신하는 대상(예를 들어, 전자기기, 외부 서버, 디바이스 등)의 통신 규격에 따라 다양한 통신 방식을 지원할 수 있다.Meanwhile, the communication module (110) can support various communication methods according to the communication standards of the communicating target (e.g., electronic device, external server, device, etc.).

예를 들어, 통신모듈(110)은, WLAN(Wireless LAN), Wi-Fi(Wireless-Fidelity), Wi-Fi(Wireless Fidelity) Direct, DLNA(Digital Living Network Alliance), WiBro(Wireless Broadband), WiMAX(World Interoperability for Microwave Access), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), LTE(Long Term Evolution), LTE-A(Long Term Evolution-Advanced), 5G(5th　Generation　Mobile　Telecommunication　), 블루투스(Bluetooth™RFID(Radio Frequency Identification), 적외선 통신(Infrared Data Association; IrDA), UWB(Ultra-Wideband), ZigBee, NFC(Near Field Communication), Wi-Fi Direct, Wireless USB(Wireless Universal Serial Bus) 기술 중 적어도 하나를 이용하여, 통신 대상과 통신하도록 이루어질 수 있다.For example, the communication module (110) may be configured to communicate with a communication target using at least one of the following technologies: WLAN (Wireless LAN), Wi-Fi (Wireless-Fidelity), Wi-Fi (Wireless Fidelity) Direct, DLNA (Digital Living Network Alliance), WiBro (Wireless Broadband), WiMAX (World Interoperability for Microwave Access), HSDPA (High Speed Downlink Packet Access), HSUPA (High Speed Uplink Packet Access), LTE (Long Term Evolution), LTE-A (Long Term Evolution-Advanced), 5G (5th Generation Mobile Telecommunication), Bluetooth (Bluetooth™), RFID (Radio Frequency Identification), Infrared Data Association; IrDA), UWB (Ultra-Wideband), ZigBee, NFC (Near Field Communication), Wi-Fi Direct, and Wireless USB (Wireless Universal Serial Bus).

한편, 메모리(120)는, 본 개시와 관련된 다양한 정보를 저장하도록 이루어질 수 있다. 본 개시에서 메모리(120)는 본 개시에 따른 장치 자체에 구비될 수 있다. 이와 다르게, 메모리(120)의 적어도 일부는, 데이터베이스(database: DB, 140) 및 클라우드 저장소(또는 클라우드 서버) 중 적어도 하나를 의미할 수 있다. 즉, 메모리(120)는 본 개시에 따른 장치 및 방법을 위하여 필요한 정보가 저장되는 공간이면 충분하며, 물리적인 공간에 대한 제약은 없는 것으로 이해될 수 있다. 이에, 이하에서는, 메모리(120), 데이터베이스(140), 외부 저장소, 클라우드 저장소(또는 클라우드 서버)를 별도로 구분하지 않고, 모두 메모리(120)라고 표현하도록 한다.Meanwhile, the memory (120) may be configured to store various information related to the present disclosure. In the present disclosure, the memory (120) may be provided in the device itself according to the present disclosure. Alternatively, at least a part of the memory (120) may mean at least one of a database (DB, 140) and a cloud storage (or a cloud server). That is, it may be understood that the memory (120) is sufficient as a space where information required for the device and method according to the present disclosure is stored, and there is no limitation on the physical space. Accordingly, in the following, the memory (120), the database (140), the external storage, and the cloud storage (or the cloud server) are not separately distinguished, and are all expressed as the memory (120).

이 메모리(120)는 서비스 서버(100)에서 구동되는 다수의 응용 프로그램(application program 또는 애플리케이션(application), 서비스 서버(100)의 동작을 위한 데이터들, 명령어들을 저장할 수 있다. 이러한 응용 프로그램 중 적어도 일부는, 무선 통신을 통해 외부 서버로부터 다운로드 될 수 있다. 한편, 응용 프로그램은, 메모리(120)에 구비된 적어도 하나의 메모리에 저장되고, 서비스 서버(100) 상에 설치되어, 프로세서(130)를 통해 메모리(120)에 저장된 적어도 하나의 프로세서에 의하여 동작(또는 기능)을 수행하도록 구동될 수 있다.This memory (120) can store a plurality of application programs (or applications) driven by the service server (100), data for the operation of the service server (100), and commands. At least some of these application programs can be downloaded from an external server via wireless communication. Meanwhile, the application programs can be stored in at least one memory provided in the memory (120), installed on the service server (100), and driven to perform operations (or functions) by at least one processor stored in the memory (120) via the processor (130).

한편, 적어도 하나의 메모리는 플래시 메모리 타입(flash memory type), 하드 디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(Random Access Memory, RAM), SRAM(Static Random Access Memory), 롬(Read-Only Memory, ROM), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory) 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. 아울러, 메모리는 일시적, 영구적 또는 반영구적으로 정보를 저장할 수 있으며, 내장형 또는 탈착형으로 제공될 수 있다.Meanwhile, at least one memory may include at least one type of storage medium among a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, an SD or XD memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read-Only Memory (ROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Programmable Read-Only Memory (PROM), a magnetic memory, a magnetic disk, and an optical disk. In addition, the memory may store information temporarily, permanently, or semi-permanently, and may be provided as a built-in or removable type.

다음으로, 프로세서(130)는 본 개시와 관련된 장치의 전반적인 동작을 제어하도록 이루어질 수 있다. 프로세서(130)는 위에서 살펴본 구성요소들을 통해 입력 또는 출력되는 신호, 데이터, 정보 등을 처리하거나 사용자에게 적절한 정보 또는 기능을 제공 또는 처리할 수 있다.Next, the processor (130) may be configured to control the overall operation of the device related to the present disclosure. The processor (130) may process signals, data, information, etc. input or output through the components discussed above, or provide or process appropriate information or functions to the user.

프로세서(130)는 적어도 하나의 CPU(Central Processing Unit, 중앙처리장치)를 포함하여, 본 개시에 따른 기능을 수행할 수 있다.The processor (130) includes at least one CPU (Central Processing Unit) and can perform functions according to the present disclosure.

구체적으로, 프로세서(130)는 제1 도메인 및 제2 도메인 각각에 대응하는 적어도 하나의 키워드를 정의하고, 적어도 하나의 키워드를 기초로 제1 도메인 및 제2 도메인 별로 불용어를 제외한 단어를 적어도 하나의 키워드로서 수집하고, 제1 데이터풀 내에서 각 키워드에 대한 단어 분포 엔트로피를 산출하여 적어도 하나의 핵심 키워드를 추출하고, 제2 데이터풀 내에서 그 추출된 적어도 하나의 핵심 키워드를 기반으로 기 설정된 조건에 따라 적어도 하나의 데이터를 추출한다. Specifically, the processor (130) defines at least one keyword corresponding to each of the first domain and the second domain, collects words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one keyword, calculates word distribution entropy for each keyword within the first data pool to extract at least one core keyword, and extracts at least one data according to a preset condition based on the extracted at least one core keyword within the second data pool.

이때, 적어도 하나의 키워드를 수집하기 위해 적어도 하나의 방식 또는 적어도 하나의 검색엔진이 적용(이용)될 수 있다. 예를 들어, 크롤링 방식을 이용하여 적어도 하나의 키워드를 수집할 수 있다. 그러나. 이는 하나의 실시예일 뿐, 이 방식만으로 한정하지는 않는다.At this time, at least one method or at least one search engine may be applied (used) to collect at least one keyword. For example, at least one keyword may be collected using a crawling method. However, this is only one embodiment and is not limited to this method.

나아가, 프로세서(130)는 제2 데이터풀 내에서 앞서 추출된 적어도 하나의 데이터 각각에 대해 맥락적으로 유사한 데이터를 추가적으로 더 추출할 수 있다. 이로써, 제2 데이터풀 내에서 적어도 하나의 데이터와 그 유사한 데이터를 더 추출하여 데이터 셋을 보강할 수 있다. Furthermore, the processor (130) can additionally extract contextually similar data for each of at least one data previously extracted within the second data pool. As a result, the data set can be supplemented by further extracting at least one data and its similar data within the second data pool.

여기서, 기 설정된 조건은 하나의 데이터에 포함된 핵심 키워드의 출현 빈도를 나타낼 수 있다. 그러나, 이는 하나의 예시일 뿐, 정규표현식(regex)을 충족하는 조건, 제외되어야 할 키워드 조건 등이 적용될 수 있으며, 이를 한정하지 않는다. 또한, 제1 데이터풀은 수집된 적어도 하나의 키워드가 저장되는 데이터풀이고, 제2 데이터풀은 다양한 형태의 각종 데이터가 저장된 데이터풀일 수 있다. Here, the preset condition may indicate the frequency of occurrence of a core keyword included in one piece of data. However, this is only an example, and conditions that satisfy regular expressions (regex), conditions for keywords to be excluded, etc. may be applied, and are not limited thereto. In addition, the first data pool may be a data pool in which at least one collected keyword is stored, and the second data pool may be a data pool in which various types of data are stored.

한편, 프로세서(130)는 기 설정된 알고리즘을 기반으로 수집된 적어도 하나의 키워드 각각에 대한 단어 분포 엔트로피를 산출하여 핵심 키워드를 추출하고, 핵심 키워드를 포함하는 데이터만을 추출한다. Meanwhile, the processor (130) extracts core keywords by calculating word distribution entropy for each of at least one keyword collected based on a preset algorithm, and extracts only data including the core keywords.

또한, 프로세서(130)는 적어도 하나의 키워드를 수집할 시에, 기 설정된 웹 사이트(300) 내에서 제1 도메인 및 제2 도메인 각각을 검색하고, 그 검색 결과로서 확인되는 컨텐츠들 중 상위에 위치된 기 설정된 개수의 컨텐츠 내 데이터를 기반으로 크롤링을 수행한다. 이후, 프로세서(130)는 그 크롤링된 텍스트 중 불용어를 제외한 단어를 추출하여 하나의 단어 목록을 생성한다. 이때, 불용어를 제외한 단어를 추출하기 위해 형태소 분석과 같은 분석 방식이 적용될 수 있다.In addition, when the processor (130) collects at least one keyword, it searches each of the first domain and the second domain within the preset website (300), and performs crawling based on data within a preset number of contents located at the top among the contents identified as the search results. Thereafter, the processor (130) extracts words excluding stop words from the crawled text to generate a word list. At this time, an analysis method such as morphological analysis may be applied to extract words excluding stop words.

한편, 프로세서(130)는 핵심 키워드를 추출할 시에, 제1 도메인 및 상기 제2 도메인 별 수집된 적어도 하나의 키워드를 기반으로 문서 단어 행렬(Document-Term Matrix, DTM)을 생성하고, 기 설정된 알고리즘을 기반으로 적어도 하나의 키워드 각각에 대한 단어 분포 엔트로피를 산출하여 가중치를 결정한다. 이후, 프로세서(130)는 문서 단어 행렬에 포함된 각 키워드에 대해 가중치를 적용하여 점수(score)를 산출한 후, 문서 단어 행렬의 각 행을 정규화하고, 상대적으로 높은 점수를 갖는 기 설정된 개수의 키워드를 적어도 하나의 핵심 키워드로서 추출한다.Meanwhile, when extracting core keywords, the processor (130) generates a document-term matrix (DTM) based on at least one keyword collected for each of the first domain and the second domain, and calculates word distribution entropy for each of at least one keyword based on a preset algorithm to determine weights. Thereafter, the processor (130) applies weights to each keyword included in the document-term matrix to calculate a score, normalizes each row of the document-term matrix, and extracts a preset number of keywords having relatively high scores as at least one core keyword.

여기서, 기 설정된 알고리즘은 TF-IDF(Term Frequency-Inverse Document Frequency)를 변형한 알고리즘으로서, 프로세서(130)에 의해 적어도 하나의 키워드 각각의 중요도에 따른 점수(score)를 산출하기 위한 것이다. 일반적으로 TD-IDF를 사용하는 경우와는 상이하기 때문에 이를 그대로 사용한다면 유의미한 수치를 산출하기 어려워 이를 변형한 알고리즘을 사용하는 것이다. 다시 말해, 문서 수가 도메인 수를 의미함에 따라 일반적인 TD-IDF를 사용하는 경우에 비해 문서 수가 아주 적은 편인 반면에, 한 단어 내 단어 수는 많아진다. 그에 따라, TF가 IDF 보다 훨씬 과하게 적용되는 바, 엔트로피를 사용하고 정규화를 하는 것이다. 한편, DF 정의 자체가 해당 단어가 한번이라도 출연한 문서의 개수를 나타내지만, 한 문서(도메인) 내 단어 개수가 아주 많은 만큼 한 개라도 있는지 여부로 이를 가리는 것은 상황을 잘 반영하지 못한 수치이므로, 단어 출현 여부가 아닌 단어 분포 엔트로피를 DF로 활용하는 것이다.Here, the preset algorithm is an algorithm that modifies TF-IDF (Term Frequency-Inverse Document Frequency), and is intended to calculate a score according to the importance of at least one keyword by the processor (130). Since it is different from the case where TD-IDF is generally used, it is difficult to calculate a meaningful value if it is used as is, so a modified algorithm is used. In other words, since the number of documents means the number of domains, the number of documents is much smaller than the case where TD-IDF is generally used, while the number of words in one word is much larger. Accordingly, TF is applied much more excessively than IDF, so entropy is used and normalization is performed. On the other hand, although the definition of DF itself indicates the number of documents in which the word appears at least once, since the number of words in one document (domain) is very large, it is a value that does not reflect the situation well to determine whether there is even one word, so word distribution entropy, not whether a word appears, is used as DF.

구체적으로, TF는 해당 키워드가 특정 문서(데이터) 내에서 얼마나 자주 등장했는지를 의미한다. 즉, 하나의 문서 내에서 해당 키워드가 자주 등장했다면 그 문서의 주제와 연관되어 있을 확률이 높으므로, 이를 중요한 키워드로 인식할 수 있다. 또한, IDF는 해당 키워드가 전체 문서(데이터) 내에서 얼마나 자주 등장했는지를 의미한다. 즉, 전체 문서에서 자주 등장하는 키워드일수록 중요하지 않은 키워드로 인식할 수 있다. 본 발명에서는 이러한 특성을 이용하여 우선 TF를 기반으로 각 데이터 내에서 자주 등장하는 키워드를 확인하고, IDF를 기반으로 그 키워드가 전체 데이터 내에서 모두 자주 등장하는 것이 확인되면, 그 키워드는 중요하지 않은 키워드로 간주한다. 또한, TF를 기반으로 각 데이터 내에서 자주 등장하는 키워드를 확인하고, IDF를 기반으로 그 키워드가 전체 데이터 중 특정 데이터 내에서만 자주 등장하는 것이 확인되면, 그 키워드는 특정 데이터에 있어 중요한 키워드로 간주한다.Specifically, TF means how frequently a keyword appears in a specific document (data). In other words, if a keyword appears frequently in a document, it is highly likely that it is related to the topic of the document, and thus it can be recognized as an important keyword. In addition, IDF means how frequently a keyword appears in the entire document (data). In other words, a keyword that appears frequently in the entire document can be recognized as an unimportant keyword. In the present invention, by utilizing these characteristics, keywords that appear frequently in each data are first identified based on TF, and if it is confirmed based on IDF that the keyword appears frequently in all data, the keyword is considered an unimportant keyword. In addition, keywords that appear frequently in each data are identified based on TF, and if it is confirmed based on IDF that the keyword appears frequently only in specific data among the entire data, the keyword is considered an important keyword for the specific data.

이때, 각 키워드의 엔트로피를 산출한 값(엔트로피 값)을 역수로 취하여 IDF 값을 산출할 수 있다. 즉, 엔트로피 값이 낮을수록, 다시 말해, 해당 키워드가 다양한 문서에 흔히 사용되지 않고, 특정 문서에서만 집중적으로 사용되었을수록 그 역수인 값은 높아지기 때문에, 결과적으로 TF-IDF 값이 높아지게 된다.At this time, the IDF value can be calculated by taking the reciprocal of the value (entropy value) calculated from the entropy of each keyword. That is, the lower the entropy value, in other words, the higher the reciprocal value, the higher the keyword is used in specific documents rather than in various documents, and as a result, the TF-IDF value becomes higher.

이후, 문서 단어 행렬의 각 키워드에 대해 IDF 값을 곱하여 TD-IDF 점수를 산출하고, 행렬의 각 행을 정규화한다. 이때, 정규화 방식은 Standard 정규화, Min-Max 정규화,　Z-score 정규화,　MaxAbs 정규화, Decimal Scaling 정규화, 파워 변환, 로버스트 정규화 및 로그 변환 중 적어도 하나를 이용할 수 있다. 다만, 이상치(outlier)에 민감할 수 있다는 특성을 더 고려하여 그 중 로버스트 정규화 또는 로그 변환을 이용하도록 구성될 수 있다.Afterwards, the IDF value is multiplied for each keyword in the document word matrix to produce a TD-IDF score, and each row of the matrix is normalized. At this time, the normalization method may use at least one of Standard normalization, Min-Max normalization, Z-score normalization, MaxAbs normalization, Decimal Scaling normalization, power transformation, robust normalization, and log transformation. However, considering the characteristic that it may be sensitive to outliers, it may be configured to use robust normalization or log transformation among them.

또한, 프로세서(130)는 이상치(outlier)에 민감할 수 있다는 특성을 더 고려하여 정규화 방식으로서 로버스트 정규화(Robust scaling) 및 로그 변환 중 어느 하나를 이용하도록 구성될 수 있으나, 이는 하나의 실시예일 뿐이며, 그 방식의 종류를 한정하지 않는다.In addition, the processor (130) may be configured to use either robust scaling or log transformation as a normalization method, taking into account the characteristic that it may be sensitive to outliers, but this is only one embodiment and does not limit the type of the method.

한편, 프로세서(130)는 제2 데이터풀 내에서 데이터를 추출할 시에, 제2 데이터풀에 포함된 각 데이터에 대하여 앞서 추출된 핵심 키워드 각각의 출현 빈도를 확인하고, 그 확인된 출현 빈도를 기 설정된 임계치와 비교함으로써, 적어도 하나의 데이터를 획득할 수 있다. 즉, 그 확인된 출현 빈도가 기 설정된 임계치 이상인 데이터만을 선별하여 획득할 수 있다.Meanwhile, when extracting data from within the second data pool, the processor (130) can obtain at least one data by checking the frequency of occurrence of each core keyword previously extracted for each data included in the second data pool and comparing the checked frequency of occurrence with a preset threshold. In other words, only data having a checked frequency of occurrence higher than the preset threshold can be selected and obtained.

그 외, 프로세서(130)의 구체적인 동작에 대해서는 이하에서 각각의 도면을 기반으로 설명하도록 한다. In addition, the specific operation of the processor (130) will be described below based on each drawing.

도 2에 도시된 구성 요소들의 성능에 대응하여 적어도 하나의 구성요소가 추가되거나 삭제될 수 있다. 예를 들어, 도 1에는 도시하지 않았으나, 검색모듈, 증강모듈 및 추출모듈을 각각 구비하여 프로세서(130)에 의해 제어되도록 구성될 수 있다. 또한, 구성 요소들의 상호 위치는 장치의 성능 또는 구조에 대응하여 변경될 수 있다는 것은 당해 기술 분야에서 통상의 지식을 가진 자에게 용이하게 이해될 것이다.At least one component may be added or deleted in response to the performance of the components illustrated in FIG. 2. For example, although not illustrated in FIG. 1, a search module, an augmentation module, and an extraction module may be configured to be controlled by a processor (130), respectively. In addition, it will be readily understood by those skilled in the art that the mutual positions of the components may be changed in response to the performance or structure of the device.

도 3은 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법을 나타내는 도면이다.FIG. 3 is a diagram illustrating a domain adaptive data extraction method according to one embodiment of the present disclosure.

도 3을 참조하면, 서비스 서버(100)는 사용자 단말(200)로부터 제1 도메인 및 제2 도메인 각각에 대응하는 적어도 하나의 기본 키워드가 입력되어 정의되면, 적어도 하나의 기본 키워드를 기초로 그 제1 도메인 및 제2 도메인 별로 불용어를 제외한 단어를 적어도 하나의 키워드로서 수집한다(S110).Referring to FIG. 3, when at least one basic keyword corresponding to each of the first domain and the second domain is input and defined from the user terminal (200), the service server (100) collects words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one basic keyword (S110).

그 다음으로, 서비스 서버(100)는 기 설정된 알고리즘을 기반으로 제1 데이터풀 내에서 S110 단계에 의해 수집된 적어도 하나의 키워드 각각에 대한 단어 분포 엔트로피를 산출하여 적어도 하나의 핵심 키워드를 추출한다(S120).Next, the service server (100) calculates word distribution entropy for each of at least one keyword collected by step S110 within the first data pool based on a preset algorithm to extract at least one core keyword (S120).

그 다음으로, 서비스 서버(100)는 제2 데이터풀 내에서 S120 단계에 의해 추출된 적어도 하나의 핵심 키워드를 기반으로 기 설정된 조건에 따라 적어도 하나의 데이터를 추출한다(S130).Next, the service server (100) extracts at least one data according to preset conditions based on at least one core keyword extracted by step S120 within the second data pool (S130).

이로써, 그 추출된 데이터를 이용하여 데이터 셋을 생성한다(S140).In this way, a data set is created using the extracted data (S140).

한편, 도 3에는 도시하지 않았으나, S140 단계를 수행할 시 또는 S140 단계를 수행한 이후에, 제2 데이터풀 내에서 S130 단계에 의해 추출된 적어도 하나의 데이터 각각에 대해 맥락적으로 유사한 데이터를 추가적으로 더 추출하여 데이터 셋을 보강할 수 있다.Meanwhile, although not shown in FIG. 3, when performing step S140 or after performing step S140, contextually similar data may be additionally extracted for each of at least one data extracted by step S130 within the second data pool to reinforce the data set.

도 4는 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법에서 각 도메인 별로 수집된 적어도 하나의 키워드를 수집하는 동작을 나타내는 도면으로서, 도 3의 S110 단계를 보다 구체적으로 나타낸 것이다.FIG. 4 is a diagram showing an operation of collecting at least one keyword for each domain in a domain adaptive data extraction method according to one embodiment of the present disclosure, and illustrates step S110 of FIG. 3 in more detail.

도 4를 참조하면, 서비스 서버(100)는 기 설정된 웹 사이트(300) 내에서 제1 도메인 및 제2 도메인 각각을 검색하고(S111), 그 검색 결과로서 확인되는 컨텐츠들 중 상위에 위치된 기 설정된 개수의 컨텐츠 내 데이터를 기반으로 크롤링을 수행한다(S112).Referring to FIG. 4, the service server (100) searches each of the first domain and the second domain within a preset website (300) (S111), and performs crawling based on data within a preset number of contents located at the top among the contents identified as the search results (S112).

그 다음으로, 서비스 서버(100)는 S112 단계에 의해 데이터를 통해 크롤링된 텍스트 중 불용어를 제외한 단어를 적어도 하나의 키워드로서 추출하고(S113), 그 추출된 적어도 하나의 키워드를 이용하여 하나의 단어 목록을 생성한다(S114). Next, the service server (100) extracts words excluding stop words from the text crawled through data by step S112 as at least one keyword (S113), and creates a word list using the at least one extracted keyword (S114).

도 5는 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법에서 적어도 하나의 키워드를 기반으로 적어도 하나의 핵심 키워드를 추출하는 동작을 나타내는 도면으로서, 도 3의 S120 단계를 보다 구체적으로 나타낸 것이다.FIG. 5 is a diagram illustrating an operation of extracting at least one core keyword based on at least one keyword in a domain adaptive data extraction method according to one embodiment of the present disclosure, and illustrates step S120 of FIG. 3 in more detail.

도 5를 참조하면, 서비스 서버(100)는 제2 데이터풀에 포함된 각 데이터에 대하여 S110 단계에 의해 제1 도메인 및 상기 제2 도메인 별 수집된 적어도 하나의 키워드 각각의 출현 빈도를 확인하고(S121), 그 확인된 출현 빈도를 이용하여 문서 단어 행렬(Document-Term Matirx, DTM)을 생성한다(S122). Referring to FIG. 5, the service server (100) checks the frequency of occurrence of each of at least one keyword collected for each of the first domain and the second domain by step S110 for each data included in the second data pool (S121), and generates a document-term matrix (DTM) using the checked frequency of occurrence (S122).

그 다음으로, 서비스 서버(100)는 (S110 단계에 의해 수집된 적어도 하나의 키워드 각각에 대한 단어 분포 엔트로피를 산출하여 가중치를 결정한다(S124).Next, the service server (100) calculates the word distribution entropy for each of at least one keyword collected by step S110 and determines a weight (S124).

여기서, 수집된 적어도 하나의 키워드 각각에 대한 단어 분포 엔트로피는 하기 <수학식 1>과 같이 나타낼 수 있다.Here, the word distribution entropy for each of at least one keyword collected can be expressed as in <Mathematical Formula 1> below.

여기서, 는 키워드 의 엔트로피, 는 도메인 내에서 키워드 의 출현 빈도(즉, 와 동일), 은 상기 제1 도메인 및 상기 제2 도메인의 개수를 나타낸다.Here, is a keyword The entropy of, is a domain Keywords within The frequency of occurrence (i.e., ) same as represents the number of the first domain and the second domain.

그 다음으로, 서비스 서버(100)는 S122 단계에 의해 생성된 문서 단어 행렬에 포함된 각 키워드에 대해 S123 단계에 의해 결정된 가중치를 적용하여 점수(score)를 산출하고(S124), 그 문서 단어 행렬의 각 행을 정규화한다(S125).Next, the service server (100) calculates a score by applying the weight determined by step S123 to each keyword included in the document word matrix generated by step S122 (S124) and normalizes each row of the document word matrix (S125).

여기서, 각 키워드 별 점수는 하기 <수학식 2>와 같이 나타낼 수 있다.Here, the score for each keyword can be expressed as in <Mathematical Formula 2> below.

여기서, 는 도메인 에 대한 키워드 의 출현 빈도를 나타낼 수 있다.Here, is a domain Keywords for can indicate the frequency of appearance.

그에 따라, 해당 키워드와 해당 TF-IDF 점수, 즉, 의 쌍을 생성할 수 있다.Accordingly, the corresponding keywords and their TF-IDF scores, i.e., can generate pairs of .

이로써, 서비스 서버(100)는 적어도 하나의 데이터 각각에 대해 상대적으로 높은 점수를 갖는 기 설정된 개수의 키워드를 핵심 키워드로서 추출한다(S126).Accordingly, the service server (100) extracts a preset number of keywords with relatively high scores for each of at least one data as core keywords (S126).

도 6은 본 개시의 일 실시예에 따른 도메인 적응형 데이터 추출 방법에서 적어도 하나의 데이터 각각에 대해 유사한 데이터를 추출하는 동작을 나타내는 도면으로서, 도 3의 S140 단계에서 수행되거나 S140 단계 이후에 수행될 수 있는 추가적인 동작이다. FIG. 6 is a diagram illustrating an operation of extracting similar data for each of at least one data in a domain adaptive data extraction method according to one embodiment of the present disclosure, which is an additional operation that may be performed in step S140 of FIG. 3 or after step S140.

앞서 설명한 바와 같이, 서비스 서버(100)는 제2 데이터풀 내에서 적어도 하나의 데이터의 임베딩 값과 유사한 유사 임베딩 값을 가지는 데이터를 더 추출한다. As described above, the service server (100) further extracts data having a similar embedding value similar to the embedding value of at least one data within the second data pool.

도 6을 참조하면, 전체 데이터풀의 모든 데이터를 임베딩하는데, 먼저 적어도 하나의 데이터(문장, 문단 등 텍스트를 포함하는 형태의 데이터)를 임베딩 모델에 통과시켜 얻은 벡터값을 각 데이터의 임베딩 값으로 획득하고(S151), 각 데이터의 임베딩 값과 유사한 임베딩 값을 가지는 데이터를 검출한다(S152). 이때, 단순히 동일한 단어를 가지고 있는 데이터를 추출하는 것을 넘어서서 맥락(의미)이 유사한 데이터를 찾아 그 데이터를 다양화 시킬 수 있다. 그에 따라, 적어도 하나의 데이터 뿐만 아니라, 그 각 데이터와 유사한 데이터를 더 추출함으로써, 보강된 데이터 셋을 생성할 수 있다.Referring to Fig. 6, in order to embed all data in the entire data pool, first, at least one data (data in the form of text including sentences, paragraphs, etc.) is passed through an embedding model to obtain a vector value as an embedding value of each data (S151), and data having an embedding value similar to the embedding value of each data is detected (S152). At this time, beyond simply extracting data having the same words, data having a similar context (meaning) can be found to diversify the data. Accordingly, by extracting not only at least one data but also data similar to each data, an enhanced data set can be generated.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The above-described program may include codes coded in a computer language, such as C, C++, JAVA, or machine language, that can be read by the processor (CPU) of the computer through the device interface of the computer, so that the computer reads the program and executes the methods implemented as a program. Such codes may include functional codes related to functions that define functions necessary for executing the methods, and may include control codes related to execution procedures necessary for the processor of the computer to execute the functions according to a predetermined procedure. In addition, such codes may further include memory reference-related codes regarding which location (address address) of the internal or external memory of the computer should be referenced for additional information or media necessary for the processor of the computer to execute the functions. In addition, if the processor of the computer needs to communicate with another computer or server located remotely in order to execute the functions, the code may further include communication-related code regarding how to communicate with another computer or server located remotely using the communication module of the computer, what information or media to send and receive during communication, etc.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The above storage medium means a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short period of time, such as a register, cache, or memory. Specifically, examples of the storage medium include, but are not limited to, ROM, RAM, CD-ROM, magnetic tape, floppy disk, or optical data storage device. That is, the program can be stored in various storage media on various servers that the computer can access or in various storage media on the user's computer. In addition, the medium can be distributed to computer systems connected to a network, so that a computer-readable code can be stored in a distributed manner.

본 개시의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 개시가 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments of the present disclosure may be implemented directly in hardware, implemented in a software module executed by hardware, or implemented by a combination of these. The software module may reside in a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, a CD-ROM, or any other form of computer-readable recording medium well known in the art to which the present disclosure pertains.

이상, 첨부된 도면을 참조로 하여 본 개시의 실시예를 설명하였지만, 본 개시가 속하는 기술분야의 통상의 기술자는 본 개시가 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.Although the embodiments of the present disclosure have been described above with reference to the attached drawings, those skilled in the art will appreciate that the present disclosure may be implemented in other specific forms without changing the technical spirit or essential features thereof. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

10 : 데이터 추출 시스템 100 : 서비스 서버
200 : 사용자 단말 300 : 웹 페이지
110 : 통신모듈 120 : 메모리
130 : 프로세서 140 : 데이터베이스10: Data extraction system 100: Service server
200: User terminal 300: Web page
110: Communication module 120: Memory
130: Processor 140: Database

Claims

A communication module that performs communication with external devices;
A memory storing at least one process for performing domain adaptive data extraction operations; and
A processor for performing the domain adaptive data extraction operation according to the above process,
The above domain includes a first domain to which a model that the user wants to create is to be applied, and a second domain other than the first domain.
The above processor,
Define at least one basic keyword corresponding to each of the first domain and the second domain, and collect words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one basic keyword.
In the first data pool where the at least one keyword collected is stored, word distribution entropy for each of the at least one keyword collected is calculated to extract at least one core keyword,
Extracting at least one data according to a preset condition based on at least one core keyword from a second data pool where various types of data are stored,
A document-term matrix (DTM) is generated based on the at least one keyword for each of the first domain and the second domain, a word distribution entropy for each of the at least one keyword is calculated based on a preset algorithm to determine a weight, and a score is calculated by applying the weight to each keyword included in the document-term matrix, and then each row of the document-term matrix is normalized, and a preset number of keywords having relatively high scores are extracted as the at least one core keyword.
Characterized in that the normalization is performed based on one of the following methods: Standard normalization, Min-Max normalization, Z-score normalization, MaxAbs normalization, Decimal Scaling normalization, Power transformation, Robust scaling, and Log transformation.
Domain adaptive data extraction device.

In the first paragraph,
The above processor,
When collecting at least one keyword, each of the first domain and the second domain is searched, crawling is performed based on a preset number of data within the content, and words excluding the stop words are extracted from the crawled text to create a word list.
Domain adaptive data extraction device.

In the first paragraph,
The above conditions are set,
The frequency of occurrence of keywords contained in one piece of data,
The above processor,
When extracting at least one of the above data, the frequency of occurrence of each core keyword for each data included in the second data pool is checked, and only data having a confirmed frequency of occurrence higher than a preset threshold is selected and extracted.
Domain adaptive data extraction device.

In the first paragraph,
The above processor,
Characterized in that it further extracts contextually similar data for each of the at least one data extracted within the second data pool.
Domain adaptive data extraction device.

In paragraph 4,
The above processor,
Based on the above preset algorithm, the word distribution entropy for each keyword is calculated,
The above preset algorithm is,
It is an algorithm modified from TF-IDF (Term Frequency-Inverse Document Frequency), characterized in that it calculates a score according to the importance of at least one keyword.
Domain adaptive data extraction device.

delete

A communication module that performs communication with external devices;
A memory storing at least one process for performing domain adaptive data extraction operations; and
A processor for performing the domain adaptive data extraction operation according to the above process,
The above domain includes a first domain to which a model that the user wants to create is to be applied, and a second domain other than the first domain.
The above processor,
Define at least one basic keyword corresponding to each of the first domain and the second domain, and collect words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one basic keyword.
In the first data pool where the at least one keyword collected is stored, word distribution entropy for each of the at least one keyword collected is calculated to extract at least one core keyword,
Extracting at least one data according to a preset condition based on at least one core keyword from a second data pool where various types of data are stored,
The word distribution entropy for each of the above at least one keyword is,
It is characterized by being expressed as in the following <Mathematical Formula 1>.
Domain adaptive data extraction device.
<Mathematical formula 1>

Here, is a keyword The entropy of, is a domain Keywords within Frequency of appearance, represents the number of the first domain and the second domain.

A communication module that performs communication with external devices;
A memory storing at least one process for performing domain adaptive data extraction operations; and
A processor for performing the domain adaptive data extraction operation according to the above process,
The above domain includes a first domain to which a model that the user wants to create is to be applied, and a second domain other than the first domain.
The above processor,
Define at least one basic keyword corresponding to each of the first domain and the second domain, and collect words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one basic keyword.
In the first data pool where the at least one keyword collected is stored, word distribution entropy for each of the at least one keyword collected is calculated to extract at least one core keyword,
Extracting at least one data according to a preset condition based on at least one core keyword from a second data pool where various types of data are stored,
The above processor,
A document-term matrix (DTM) is generated based on the at least one keyword for each of the first domain and the second domain, a word distribution entropy for each of the at least one keyword is calculated based on a preset algorithm to determine a weight, and a score is calculated by applying the weight to each keyword included in the document-term matrix, and then each row of the document-term matrix is normalized, and a preset number of keywords having relatively high scores are extracted as the at least one core keyword.
The above scores are,
It is characterized by being expressed as in <Mathematical Formula 2> below.
Domain adaptive data extraction device.
<Mathematical formula 2>

Here, is a domain Keywords for represents the appearance frequency of keyword j, and IFD _j represents the reciprocal of the entropy value of keyword j.

A domain adaptive data extraction method performed by a processor of a device,
The above domain includes a first domain to which a model that the user wants to create is to be applied, and a second domain other than the first domain.
When at least one basic keyword corresponding to each of the first domain and the second domain is defined, a step of collecting words excluding stop words as at least one keyword for each of the first domain and the second domain based on the at least one basic keyword;
A step of extracting at least one core keyword by calculating word distribution entropy for each of the at least one keyword collected in the first data pool where the at least one keyword collected is stored; and
It includes a step of extracting at least one data according to a preset condition based on at least one core keyword in a second data pool where various types of data are stored,
The above processor,
A document-term matrix (DTM) is generated based on the at least one keyword for each of the first domain and the second domain, a word distribution entropy for each of the at least one keyword is calculated based on a preset algorithm to determine a weight, and a score is calculated by applying the weight to each keyword included in the document-term matrix, and then each row of the document-term matrix is normalized, and a preset number of keywords having relatively high scores are extracted as the at least one core keyword.
Characterized in that the normalization is performed based on one of the following methods: Standard normalization, Min-Max normalization, Z-score normalization, MaxAbs normalization, Decimal Scaling normalization, Power transformation, Robust scaling, and Log transformation.
Domain adaptive data extraction method.