KR102025813B1

KR102025813B1 - Device and method for chronological big data curation system

Info

Publication number: KR102025813B1
Application number: KR1020180036540A
Authority: KR
Inventors: 한상용; 최승진; 서지완; 유가람
Original assignee: 중앙대학교 산학협력단
Priority date: 2017-03-31
Filing date: 2018-03-29
Publication date: 2019-11-04
Also published as: KR20180111646A

Abstract

본 발명은 연대순 정보 기반 큐레이션 장치 및 제어방법에 관한 것으로, 더욱 상 세하게는 사건 흐름 정보를 제공하기 위한 연대순 정보 기반 큐레이션 장치 및 방법에 관한 것이다. 본 발명은 인터넷에 산재해 있는 다양한 종류의 정보들을 주제 및 연관성을 통해 정리하고, 정리된 데이터를 사용자들에게 보여주기 쉽고, 접근성 있게 보여주고 정리된 데이터에 대해 재사용이 가능하게 하여 정보에 대한 유지, 관리, 제공을 통하여 가치적인 정보를 생성할 수 있다.The present invention relates to a chronological information-based curation apparatus and a control method, and more particularly to a chronological information-based curation apparatus and method for providing event flow information. The present invention organizes various types of information scattered on the Internet through topics and associations, and shows the organized data to users easily, makes it accessible, and reuses the organized data to maintain information. It can generate valuable information through management, management and provision.

Description

Chronological information-based curation apparatus and its control method for providing event flow information {DEVICE AND METHOD FOR CHRONOLOGICAL BIG DATA CURATION SYSTEM}

발명은 연대순 정보 기반 큐레이션 장치 및 방법에 관한 것으로, 더욱 상세하게는 사건 흐름 정보를 제공하기 위한 연대순 정보 기반 큐레이션 장치 및 제어방법에 관한 것이다.The present invention relates to a chronological information-based curation apparatus and method, and more particularly to a chronological information-based curation apparatus and control method for providing event flow information.

웹 기술의 출현과 발전으로 인하여 방대한 양의 다른 종류의 데이터 들이 급속도로 생산되고 있고 정보의 양이 상당하게 증가하고 있다. 이런 빅데이터 시대에 소비자들은 이전보다 더 많은 데이터와 정보를 얻을 수 있지만 수많은 정보와 지식 가운데 가치 있는 데이터를 가려내는 것은 쉽지 않다. 많은 양의 데이터로부터 가치 있는 정보를 찾는 것은 점점 더 중요해지고 있고, 많은 국가와 회사들이 데이터의 획득과 분석에 많은 시간과 돈을 투자하고 있다.With the advent and development of web technologies, vast amounts of different kinds of data are being produced rapidly and the amount of information is increasing significantly. In this big data era, consumers can get more data and information than ever before, but it's not easy to sift through valuable information and knowledge. Finding valuable information from large amounts of data is becoming increasingly important, and many countries and companies are spending a lot of time and money on data acquisition and analysis.

디지털 큐레이션이란 인터넷에 산재해 있는 다양한 종류의 정보들을 주제 및 연관성을 통해 정리하고, 정리된 데이터를 사용자들에게 보여주기 쉽고, 접근성 있게 보여주는 작업을 의미하며, 이러한 데이터에 대해 재사용이 가능하게 하도록 하는 작업이다. 이러한, 디지털 큐레이션은 정보에 대한 유지, 관리, 제공을 통하여 가치적인 정보를 생성할 수 있으며, 정보의 접근성과 재사용성을 높일 수 있다.Digital curation is the process of organizing various kinds of information scattered on the Internet through themes and associations, and presenting the organized data to users easily and accessiblely, and making the data reusable. It's work. Such digital curation can generate valuable information through maintenance, management, and provision of information, and can improve accessibility and reusability of information.

이러한 빅데이터 시대에 중요한 토픽을 검색하고, 이해하고, 식별하는데 많은 노력과 비용이 들어 가기 때문에, 중요한 이슈 중 하나는 유저의 요구를 만족 시킬 수 있는 유익하고 신뢰 있는 디지털 큐레이션 장치를 만드는 것이다.As it takes a lot of effort and cost to search, understand, and identify important topics in this big data age, one of the important issues is to create a profitable and reliable digital curation device that can satisfy the needs of users.

현재 존재하는 정보 시스템은 질의를 기반으로 검색 서비스를 제공하는 것이 일반적이다.In general, existing information systems provide search services based on queries.

시맨틱 검색은 좀 더 좋은 성능을 보이고, 좀더 정확한 결과를 얻기 위하여 검색자의 의도와 단어의 맥락적 의미를 이해하여 검색 정확도를 향상시킬 수 있는 검색을 말한다. 그러나 이러한 시맨틱 시스템을 포함하고 있는 현재 검색 시스템은, 시간과는 무관한 파라미터만을 고려하기 때문에, 시간 감쇠 효과를 고려한 검색 데이터의 표현과 순위 매김에 큰 어려움을 가지고 있다.Semantic search is a search that can improve search accuracy by understanding the searcher's intention and the contextual meaning of words in order to show better performance and get more accurate results. However, the current search system including such a semantic system has a great difficulty in the representation and ranking of search data considering the time decay effect, since only the time-independent parameters are considered.

몇몇 사건은 오랜 시간에 걸쳐 발생하고, 사건과 관련된 중요도와 흥미는 시간에 따라 바뀐다. 다양한 사건 및/또는 사고가 다른 사람들과 얽히게 되면 검색 데이터와 관련된 내재적이고 근본적인 의미를 이해하는 것이 어려워진다.Some events occur over time, and their importance and interests change over time. When various events and / or accidents become entangled with others, it becomes difficult to understand the inherent and fundamental meanings associated with the search data.

이에 따라, 사건이 발생한 시간을 더욱 고려하는 디지털 큐레이션 장치 및 시스템에 관련하는 연구가 요구되는 실정이다.Accordingly, there is a need for research related to digital curation devices and systems that further consider the time of occurrence of an event.

본 발명은 전술한 문제 및 다른 문제를 해결하는 것을 목적으로 한다. 또 다른 목적은 사건간의 발생 시간을 더욱 고려하여 보다 효율적으로 데이터를 분류할 수 있는 큐레이션 장치을 제공하는 것을 그 목적으로 한다.It is an object of the present invention to solve the above and other problems. Another object is to provide a curation apparatus that can classify data more efficiently in consideration of the occurrence time between events.

본 발명은 빅데이터 환경에서 사용자가 검색하려고 하는 특정한 사건 또는 사고에 대한 포괄적인 이해를 돕기 위하여 관련된 핵심 정보를 연대순으로 제공하는 연대순 정보 기반 큐레이션 장치 및 방법을 제공한다.The present invention provides a chronological information-based curation apparatus and method that provides the relevant core information chronologically in order to help a comprehensive understanding of a specific event or accident that a user tries to search in a big data environment.

본 발명은 사건 또는 사고 데이터를 다양한 소스로부터 수집하며 관 련된 정보를 연대순으로 분석하고, 특정 사건이나 지식을 시간에 따라 모델링 함에 따라 제공되고 시각화 도구를 통해 보여지며, 특정한 사건이나 지식을 이해하기 위하여 반복적인 검색 작업을 줄일 수 있고, 재사용할 수 있는 정보를 생성하는 연대순 정보 기반 큐레이션 장치 및 방법을 제공한다.The present invention collects event or accident data from a variety of sources, analyzes related information in chronological order, is provided as modeling specific events or knowledge over time, and is shown through visualization tools, to understand specific events or knowledge. Provided are a chronological information-based curation apparatus and method for generating reusable information and reducing repetitive searching.

본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned above will be clearly understood by those skilled in the art from the following description. Could be.

상기 또는 다른 목적을 달성하기 위해 본 발명의 일 측면에 따르면, 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집하는 데이터 수집부; 상기 수집된 복수 개의 토픽 데이터 간의 연관 관계를 추출하는 그래프 모델링부; 및 상기 추출된 연관 관계에 기초하여, 상기 복수 개의 토픽 데이터 중 소정 토픽과 직접적으로 연관되는 내부 노드 집합과, 상기 소정 토픽과 간접적으로 연관되는 외부 노드 집합으로 분류하는 연대순 분석부를 포함하는, 큐레이션 장치를 제공한다.According to an aspect of the present invention to achieve the above or another object, a data collector for collecting a plurality of topic data from a data source on the web; A graph modeling unit configured to extract an association between the collected plurality of topic data; And a chronological order analysis unit classifying the internal node set directly associated with a predetermined topic among the plurality of topic data and the external node set indirectly associated with the predetermined topic, based on the extracted correlation. To provide.

여기서 데이터 소스는, 소셜 네트워크 서비스(SNS)의 게시글, 뉴스 아티클(News article)을 포함할 수 있다.The data source may include a post of a social network service (SNS) and a news article.

그리고 상기 소정 토픽은, 사용자로부터 입력 받은 검색어 키워드에 대한 토픽일 수 있다.The predetermined topic may be a topic about a keyword of a search word received from a user.

또한, 두 토픽 데이터 간의 연관 관계는, 단일 문서 상에서 상기 두 토픽 데이터가 함께 포함되어 있는 문서의 개수로 수치화될 수 있다.In addition, an association relationship between two topic data may be quantified by the number of documents in which the two topic data are included together in a single document.

그리고 상기 연대순 분석부는, 상기 소정 토픽과 연관 관계가 존재하는 토픽 데이터는 내부 노드 집합으로 분류하고, 상기 소정 토픽과 연관 관계는 없지만, 생성 시간과 관련되는 토픽 데이터는 외부 노드 집합으로 분류할 수 있다.The chronological order analysis unit may classify the topic data having an association relationship with the predetermined topic into an internal node set, and may classify the topic data related to a creation time into an external node set although it is not associated with the predetermined topic. .

상기 연대순 분석부는, 상기 분류된 복수 개의 토픽을 생성 시간순으로 정렬할 수 있다.The chronological analysis unit may sort the sorted plurality of topics in order of creation time.

상기 또는 다른 목적을 달성하기 위해 본 발명의 다른 측면에 따르면, 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집하는 단계; 상기 수집된 복수 개의 토픽 데이터 간의 연관 관계를 추출하는 단계; 및 상기 추출된 연관 관계에 기초하여, 상기 복수 개의 토픽 데이터 중 소정 토픽과 직접적으로 연관되는 내부 노드 집합과, 상기 소정 토픽과 간접적으로 연관되는 외부 노드 집합으로 분류하는 단계를 포함하는, 큐레이션 장치의 제어 방법을 제공한다.According to another aspect of the present invention for achieving the above or another object, collecting a plurality of topic data from a data source on the web; Extracting an association between the collected plurality of topic data; And classifying an internal node set directly associated with a predetermined topic among the plurality of topic data and an external node set indirectly associated with the predetermined topic, based on the extracted association relationship. Provide a control method.

본 발명에 따른 큐레이션 장치 및 그것의 제어 방법의 효과에 대해 설명하면 다음과 같다.The effects of the curation apparatus and its control method according to the present invention will be described below.

본 발명의 일실시예 중 적어도 하나에 의하면 인터넷에 산재해 있는 다양한 종류의 정보들을 주제 및 연관성을 통해 정리하고, 정리된 데이터를 사용자들이 보기 쉽도록 보여줄 수 있고, 정리된 데이터에 대해 재사용이 가능하게 하여 정보에 대한 유지, 관리, 제공을 통하여 가치적인 정보를 생성할 수 있다.According to at least one embodiment of the present invention, various types of information scattered on the Internet can be organized through topics and associations, and the organized data can be easily viewed by users, and the reused data can be reused. In this way, valuable information can be generated by maintaining, managing, and providing the information.

본 발명의 적용 가능성의 추가적인 범위는 이하의 상세한 설명으로부터 명백해질 것이다. 그러나 본 발명의 사상 및 범위 내에서 다양한 변경 및 수정은 당업자에게 명확하게 이해될 수 있으므로, 상세한 설명 및 본 발명의 바람직한 실시 예와 같은 특정 실시 예는 단지 예시로 주어진 것으로 이해되어야 한다. Further scope of the applicability of the present invention will become apparent from the following detailed description. However, various changes and modifications within the spirit and scope of the present invention can be clearly understood by those skilled in the art, and therefore, specific embodiments, such as the detailed description and the preferred embodiments of the present invention, should be understood as given by way of example only.

도 1은 본 발명의 일실시예에 따른 큐레이션 장치의 블록도를 도시하는 도면이다.
도 2는 본 발명의 일실시예에 따라 구분되는 SNS, 블로그 및 뉴스 아티클의 성격을 도표로 도시한다.
도 3은 본 발명의 일실시예에 따른 주요 토픽과 하위 토픽을 정렬시키는 토픽 정렬 구조를 도시하는 도면이다.
도 4는 본 발명의 일실시예에 따른 시간관계 링크(Time relation graph)를 설명하기 위한 예시 도면이다.
도 5는 본 발명의 일실시예에 따른 유사도 링크(Similarity relation graph)를 설명하기 위한 예시 도면이다.
도 6 및 도 7은 본 발명의 일실시예에 따라 재사용부(400)가 트리 구조로 저장하는 일예를 도시하는 도면이다.
도 8은 본 발명의 일실시예에 따른 큐레이션 장치의 제어 방법에 대한 순서도를 도시한다.
도 9는 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 유명한 정치가 안철수를 검색 한 결과에 대한 결과 그래프이다.
도 10은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 은행인 "농협" 해킹 사건에 대한 검색 결과 그래프이다.
도 11은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 위의 실험결과들과 다른 결과 그래프를 도시한다.1 is a block diagram of a curation apparatus according to an embodiment of the present invention.
2 is a diagram illustrating the characteristics of SNS, blogs, and news articles that are differentiated according to one embodiment of the present invention.
3 is a diagram illustrating a topic alignment structure for aligning main topics and subtopics according to an embodiment of the present invention.
4 is an exemplary diagram for explaining a time relation graph according to an embodiment of the present invention.
5 is an exemplary diagram for describing a similarity relation graph according to an embodiment of the present invention.
6 and 7 are views illustrating an example in which the reuse unit 400 stores the tree structure according to an embodiment of the present invention.
8 is a flowchart illustrating a control method of a curation apparatus according to an embodiment of the present invention.
9 is a result graph of a result of searching for Ahn Cheol-soo, a famous Korean politician, by a curation apparatus according to an embodiment of the present invention.
10 is a graph of a search result for the "Nonghyup" hacking incident of the Korean bank by the curation apparatus according to an embodiment of the present invention.
11 is a graph showing results different from the above experimental results by the curation apparatus according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. DETAILED DESCRIPTION Hereinafter, exemplary embodiments disclosed herein will be described in detail with reference to the accompanying drawings, and the same or similar components will be given the same reference numerals regardless of the reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or used in consideration of ease of specification, and do not have distinct meanings or roles from each other. In addition, in describing the embodiments disclosed herein, when it is determined that the detailed description of the related known technology may obscure the gist of the embodiments disclosed herein, the detailed description thereof will be omitted. In addition, the accompanying drawings are intended to facilitate understanding of the embodiments disclosed herein, but are not limited to the technical spirit disclosed herein by the accompanying drawings, all changes included in the spirit and scope of the present invention. It should be understood to include equivalents and substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers such as first and second may be used to describe various components, but the components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this application, the terms "comprises" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

본 발명은 큐레이션 장치 및 그것의 제어방법에 관한 것으로써, 큐레이션이란 다양한 데이터를 어떠한 방법으로 보여줄지 결정하는 방법을 말한다. 뉴스기사를 예로 들면, 기존의 '네이버 뉴스'와 같은 검색 시스템들은 특정 키워드를 입력하게 되면, 똑같은 시간의 똑같은 기사(제목만 조금씩 다른)들만 나오게 된다. 결국 이러한 시스템을 통한 결과는, 검색 키워드에 관련되는 다양한 결과를 확인하기는 다소 어려운 검색 결과 화면으로 볼 수 있다. 이를 극복하기 위해 본 발명에서는 검색을 위한 특정 키워드와 연관성 있는 복수 개의 토픽(시간에 따른 기사들의 여러 키워드)들을 구성하고, 복수 개의 토픽들을 여러 파라미터로 적절하게 구분시켜, 사용자에게 효과적인 검색 결과를 제공하는 것에 그 목적이 있다.The present invention relates to a curation apparatus and a control method thereof, and refers to a method of determining how to display various data. For example, in the case of a news article, existing search systems such as 'Naver News' enter specific keywords, and only the same articles (the titles are slightly different) appear at the same time. As a result, the results through such a system can be viewed as a search result screen which is somewhat difficult to check various results related to the search keyword. In order to overcome this problem, the present invention configures a plurality of topics (multiple keywords of articles over time) related to a specific keyword for searching, and appropriately divides the plurality of topics by various parameters to provide an effective search result to the user. The purpose is to do that.

특히, 본 발명에서는, 복수 개의 토픽들을 구성하는데 있어서, 직접적인 관련성뿐만 아니라 간접적인 관련성을 더 고려하도록 제안한다. 검색 할 때 입력된 주제(키워드)의 토픽 발생 시간과 가장 근접하게 토픽이 발생한 시간을 고려함으로써 그 간접성을 추측하는 것이다. 여기서 간접성이란, 사람이 판단하지 미처 판단하지 못한 기준을 의미할 수 있을 것이다.In particular, the present invention proposes to consider indirect relations as well as direct relations in constructing a plurality of topics. When searching, the indirectivity is inferred by considering the time when the topic occurred closest to the topic occurrence time of the entered topic (keyword). Indirectness may refer to a criterion that a person does not judge.

““포항에서 일어난 산불””에 관한 실험 결과 예시에서, ““산불””이라는 키워드로 검색하게 되면 모든 산불이 일어난 지역도 연관적으로 보여지게 되는 것이 일반적일 것이다. 추가적으로 본 발명의 실시예에 따르면, 시간에 의한 간접성을 더 고려하기 때문에 ““포항””이라는 키워드로 검색을 할 때에도, 다른 지역에서 일어난 ““산불””도 보여질 가능성이 있게 된다. 왜냐하면, "포항"과 "산불"은 직접적인 관련(해당 사건을 제외한 다른 경우에 있어서)이 있지는 않지만, 같은 시간에 발생된 토픽이기 때문에 본 발명에 따르면 간접적인 관련성이 인정될 수 있다. 이러한 간접적인 관련성으로 인하여 "포항"이라는 검색어가 입력되는 경우 "산불"이라는 토픽이 함께 구성(이하에서는 이러한 토픽을 외부 토픽이라 설명함)되며, "산불"이라는 토픽으로 인하여 다른 산불 관련 검색 결과를 사용자에게 보여줄 수 있게 되는 것이다. In the example of the results of the experiment "For wildfires in Pohang," it would be common to search for the keyword "wildfires" as well as to show all the wildfire areas associatively. In addition, according to an embodiment of the present invention, even when searching with the keyword “pohang” because of further considering indirectness by time, there is a possibility that “fire” occurred in other regions is also shown. Because "Pohang" and "forest fire" do not have a direct relationship (in other cases except for the case), but indirect relationships can be recognized according to the present invention because the topics are generated at the same time. Due to this indirect relevance, if a search term "Pohang" is entered, the topic "Forest Fire" is composed together (hereinafter, these topics are referred to as external topics), and the topic "Forest Fire" is used to search for other forest related search results. Will be shown to the user.

이렇게 직접적으로 연관이 있는 토픽 집합(Internal Node Group, 이하에서 내부 노드 집합이라 함)과 간접적으로 시간에 의한 연관관계가 존재하는 토픽 집합(External Node Group, 이하에서는 외부 노드 집합이라 함)을 동시에 고려한다. 또한, 특정 임계값으로 어느 검색 결과까지 제공할 것인지 결정(Maximum Cut을 이용)하여 사용자가 관심있어 할만한 결과 만을 적절하게 보여줄 수 있을 것이다. 시간의 차이는 적으며(토픽을 중심으로 최근에 일어난 토픽), 연관도는 최대(토픽과 유사한 토픽)인 적정 임계값을 통하여, 가장 유의미한 결과만을 제공하여 줄 수 있을 것이다. 즉, 내부 노드 집합(Internal Node Group)은 직관성에 의해서 토픽을 검색하고, 외부 노드 집합(External Node Group)은 내포되어 있는 영역을 찾기 위하여 제공되는 토픽일 것이다.Consider a set of topics that are directly related to each other (Internal Node Group, hereinafter) and a set of topics that are indirectly related by time (External Node Group, hereinafter). do. In addition, you can determine which search results to provide with a certain threshold (using Maximum Cut) to properly show only the results you are interested in. The time difference is small (the topic that has recently occurred around the topic), and the appropriate threshold value that is the maximum (topic similar to the topic) may provide only the most significant result. That is, an internal node group may search for a topic by intuition, and an external node group may be a topic provided to find an included region.

도 1은 본 발명의 일실시예에 따른 큐레이션 장치의 블록도를 도시하는 도면이다.1 is a block diagram of a curation apparatus according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 연대순 정보 기반 큐레이션 장치는 데이터 수집부(100), 그래프 모델링부(200), 연대순 분석부(300) 및 재사용부(400)를 포함하도록 구성될 수 있다.Referring to FIG. 1, the chronological information-based curation apparatus according to an embodiment of the present invention includes a data collector 100, a graph modeling unit 200, a chronological analysis unit 300, and a reuse unit 400. Can be configured.

데이터 수집부(100)는 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집한다. 여기서 데이터 소스란, 소셜 네트워크 서비스(SNS)의 게시글, 뉴스 아티클(News article)등을 포함할 수 있으며, www(world wide web) 프로토콜 및 다른 기타 통신 프로토콜을 통하여 데이터를 제공하여 줄 수 있는 모든 소스가 포함될 수 있을 것이다.The data collector 100 collects a plurality of topic data from a data source on the web. Data sources here may include posts from social network services (SNS), news articles, etc., and any source that can provide data through the www (world wide web) protocol and other communication protocols. May be included.

그리고, 토픽 데이터란, 사람들의 관심이나 주목을 받을 수 있는 사건이나 주제에 대한 키워드 데이터를 의미할 수 있다. 예를 들어, '대통령 선거'나 '화재'등의 단어가 될 수 있을 것이다. 이렇게 토픽 데이터를 수집하는 구체적인 방법에 대해서는 이하에서 구체적으로 후술하기로 한다.The topic data may mean keyword data about an event or a subject that may receive attention or attention of people. For example, it could be words like 'presidential election' or 'fire'. A specific method of collecting topic data will be described later in detail.

데이터 수집부(100)는 단순하게 토픽 데이터만을 수집하는 것이 아니라, 해당 토픽 데이터가 생성된 시간(날짜와 시간을 모두 포함할 수 있음, 이하에서는 생성된 날짜와 시간을 모두 포함하여 생성 시간이라 함)에 대한 정보를 함께 수집할 수 있을 것이다.The data collection unit 100 does not simply collect only topic data, but may include both a date and time when the topic data is generated. Hereinafter, the data collection unit 100 may include a generated date and time. ) May be gathered together.

데이터 수집부(100)에서 이와 같이 각 데이터 소스에서 수집되는 생성 시간, 토픽 데이터 및 연관 관계 데이터는 후술되는 전처리 과정(그래픽 모델링부의 단계)을 거쳐 이후의 단계에서 정확성을 높이고 데이터를 효율적으로 사용할 수 있도록 한다.The generation time, the topic data, and the association data collected by each data source in the data collector 100 are subjected to a preprocessing process (step of the graphic modeling unit), which will be described later, to improve accuracy and to efficiently use data in a later step. Make sure

특히, 본 발명의 일실시예에서는, 상기 데이터 소스 중 소셜 네트워크 서비스(SNS)의 성격을 다음과 같은 적어도 두 가지로 구분하여 데이터를 수집하도록 제안한다.In particular, in one embodiment of the present invention, it is proposed to collect data by dividing the characteristics of the social network service (SNS) of the data source into at least two as follows.

이를테면, 유저들이 모바일이나 PC 등에서 간편하게 일상을 공유할 수 있는 제 1 SNS와, 세부적인 정보를 공유하기 위하여 제공되는 제 2 SNS로 구분 가능할 것이다. 제 1 SNS의 예시로, 카카오톡(kakaotalk), 트위터(twitter), 인스타그램(instagram) 이나 페이스북(facebook)을 들 수 있으며, 제 2 SNS의 예시로는, 블로그(blog)나 브런치(brunch)를 들 수 있다.For example, it may be divided into a first SNS that allows users to easily share a daily life on a mobile or a PC, and a second SNS provided for sharing detailed information. Examples of the first SNS include kakaotalk, twitter, instagram or facebook. Examples of the second SNS include blog or brunch. brunch).

이와 같이 SNS의 성격을 구분하는 이유는, 성격에 따라 수집되는 데이터의 종류가 달라질 수 있기 때문이다.The reason for distinguishing the characteristics of SNS is that the type of data collected may vary according to the characteristics.

도 2는 본 발명의 일실시예에 따라 구분되는 SNS, 블로그 및 뉴스 아티클의 성격을 도표로 도시한다. 도시된 도표에서, SNS란, 상술한 제 1 SNS를 의미하고, 블로그는 상술한 제 2 SNS를 말한다.2 is a diagram illustrating the characteristics of SNS, blogs, and news articles that are differentiated according to one embodiment of the present invention. In the diagram shown, SNS means the above-mentioned first SNS, and blog refers to the above-mentioned second SNS.

일상적인 대화를 나누기 위한 제 1 SNS의 경우에는, 유저들이 간편하게 자신의 의견을 작성하기 때문에 문서의 길이는 매우 짧고 전파력은 매우 높지만, 신뢰도 측면에서는 다소 부족할 것이다.In the case of the first SNS for daily conversations, the length of the document is very short and the propagation power is very high because the users simply write their opinions, but the reliability is somewhat insufficient.

정보를 공유하기 위한 제 2 SNS의 경우에도 문서의 길이는 짧은 편에 속하지만, 위와는 반대로 전파력은 다소 부족하고, 신뢰도 측면에서는 제 2 SNS 보다는 높은 중간 수준일 수 있다.Even in the case of the second SNS for sharing information, the length of the document belongs to a short side, but contrary to the above, the propagation power is somewhat insufficient, and in terms of reliability, it may be a medium level higher than the second SNS.

위 두 개의 예시와는 반대로, 뉴스 아티클의 경우에는 상술한 SNS 보다는 매우 신뢰도 측면에서 높을 수 밖에 없으며, 전문 기자에 의해서 작성되기 때문에 전파력 측면에서도 매우 높을 것이다. 또한, 문서의 길이 역시 소정 길이 이상 확보되기 때문에, 상술한 두 SNS의 경우 보다 높을 수 있다.Contrary to the above two examples, the news article is highly reliable in terms of reliability than the above-described SNS, and will be very high in terms of propagation power since it is written by a professional journalist. In addition, since the length of the document is also secured more than a predetermined length, it may be higher than the two SNS described above.

본 발명의 일실시예에서는, 제 1 SNS에서와 같이 전파력은 높지만 문서 길이나 신뢰도 측면에서 다소 낮은 경우에는 토픽 데이터와 해당 토픽 데이터의 생성 시간만을 수집하도록 제안한다.In one embodiment of the present invention, when the propagation power is high as in the first SNS, but somewhat lower in terms of document length or reliability, it is proposed to collect only topic data and generation time of the topic data.

또한, 본 발명의 일실시예에서는, 제 2 SNS와 뉴스 아티클에서는, 신뢰도가 다소 높은 데이터 소스이기 때문에, 데이터간의 연관 관계를 수집하도록 제안하는 것이다.In addition, in one embodiment of the present invention, since the second SNS and the news article are data sources of somewhat high reliability, it is proposed to collect the correlations between the data.

이와 같이 수집된 연관 관계는, 이하에서 토픽 데이터를 분류하는데 활용된다.The associations thus collected are used to classify topic data below.

이러한 연관 관계의 추출(수집)은, 그래프 모델링부(200)에서 이루어진다.Extraction (collection) of such an association is performed by the graph modeling unit 200.

그래프 모델링부(200)는, 상기 데이터 수집부(100)에서 수집된 데이터로부터 유의미한 데이터를 얻어내기 위하여, 전처리 과정을 수행한다. 전처리 과정이란, 분석에 필요한 중요한 데이터만을 골라내기 위하여 필요한 다양한 과정들을 모두 포괄하는 과정을 말한다.The graph modeling unit 200 performs a preprocessing process to obtain meaningful data from the data collected by the data collecting unit 100. Preprocessing is a process that encompasses all the various processes necessary to select only the important data needed for analysis.

그래프 모델링부(200)는 상기 복수 개의 토픽 데이터들 사이의 연관 관계를 추출한다. 그리고, 연관 관계는, 상술한 예시에서 제 2 SNS(블로그 등)과 뉴스 아티클과 같이 신뢰도가 어느 정도 인정되는 데이터 소스로부터 추출(수집)될 수 있다. 이러한 연관 관계에 대한 데이터는, 복수 개의 토픽 데이터들 간의 유사도(similarity) 및 동시출현빈도 중 적어도 하나를 이용하는 데이터일 수 있다. 즉, 그래프 모델링부(200)는 어느 하나의 토픽 데이터와 다른 토픽 데이터 간의 연관 관계를 판단하는데 있어서, 유사도(similarity) 및 동시출현빈도 중 적어도 하나를 이용하는 것이다.The graph modeling unit 200 extracts an association relationship between the plurality of topic data. In addition, the correlation may be extracted (collected) from a data source whose reliability is recognized to some extent, such as a second SNS (blog or the like) and a news article in the above-described example. The data on the association may be data using at least one of similarity and co-occurrence between the plurality of topic data. That is, the graph modeling unit 200 uses at least one of similarity and co-occurrence frequency in determining an association relationship between any one topic data and another topic data.

동시출현빈도란, 단일 문서 상에서 상기 두 토픽 데이터가 함께 포함되어 있는 문서의 개수를 의미할 수 있다. 즉, 두 토픽 데이터 각각을 나타내는 단어가 동시에 하나의 문서에 포함되는 경우의 수를 말한다. 예를 들어, 제 1 토픽 데이터와 제 2 토픽 데이터 간의 연관 관계를 파악하고자 할 때, 제 1 및 제 2 토픽 데이터가 동시에 포함되는 단일 문서의 개수를 제 1 및 제 2 토픽 데이터간의 연관 관계라고 할 수 있을 것이다. 이 경우 연관 관계가 높은 값을 가질 수록 두 토픽 데이터 간에는 연관성이 높다고 인정될 수 있을 것이다.The co-occurrence frequency may refer to the number of documents in which the two topic data are included together in a single document. That is, the number of cases where words representing each of the two topic data are included in one document at the same time. For example, to determine the correlation between the first topic data and the second topic data, the number of single documents that simultaneously include the first and second topic data may be referred to as the relationship between the first and second topic data. Could be. In this case, the higher the association, the higher the association between the two topic data.

두 토픽 데이터 사이에 강한 연관 관계가 존재할 경우(즉, 연관 관계가 높은 경우), 생성 시간 순서에 따라 주요 토픽과 하위 토픽으로 구분된다. 토픽 데이터간의 관계는 병렬 처리를 위해 맵리듀스(Map & Reduce)로 확장 가능한 그래프 구조로 표현될 수 있다.If there is a strong association between the two topic data (that is, the association is high), it is divided into main topics and subtopics in order of creation time. The relationship between topic data can be represented as a graph structure that can be extended with Map & Reduce for parallel processing.

이하에서 상세하게 후술하겠지만, 이와 같이 수집된 데이터들을 그래프로 도시화하는데 있어서, 각 토픽 데이터들은 네트워크 상의 노드로 표현되고, 연관 관계는 노드와 노드를 서로 연결시키는 링크 구조로 표현된다.As will be described in detail below, in the graphs of the collected data in this manner, each topic data is represented by a node on a network, and an association relationship is represented by a link structure that connects the node and the node to each other.

도 3은 본 발명의 일실시예에 따른 주요 토픽과 하위 토픽을 정렬시키는 토픽 정렬 구조를 도시하는 도면이다. 도시된 토픽 정렬 구조는, 소정 토픽(특정 이벤트)과 관련되는 토픽 데이터들의 계층적 구조(hierarchical structure)이다.3 is a diagram illustrating a topic alignment structure for aligning main topics and subtopics according to an embodiment of the present invention. The illustrated topic alignment structure is a hierarchical structure of topic data associated with a given topic (specific event).

각 토픽 데이터에는 생성 시간(이벤트 작성 시간)에 따라 하위 토픽 데이터가 포함될 수 있으며, 재귀적으로 하위 토픽(하위 토픽 데이터의 하위로 하하위 토픽 데이터)을 가질 수 있다. 각 주요 토픽에 대해 하위 토픽은 방향 그래프 구조를 사용하여 계층적으로 표현된다.Each topic data may include subtopic data according to a generation time (event creation time), and may recursively have a subtopic (subtopic data as a subtopic of subtopic data). For each main topic, the subtopics are represented hierarchically using the direction graph structure.

도 3과 같이 그래프는 토픽 데이터를 노드(node)로 하고, 두 개의 노드를 서로 연결하는 방향성 링크로 구성된다. 방향성은 생성 시간 관계의 순서를 의미하고 값은 생성 시간의 차이를 의미할 수 있다. 도 3에는 값이 표시되어 있지 않지만, 이러한 값에 대하여 이하 도 4 및 5를 참조하여 후술한다.As shown in FIG. 3, the graph is composed of directional links connecting topic nodes to nodes and connecting two nodes to each other. The directionality may mean the order of the generation time relationship, and the value may mean a difference in generation time. Although values are not shown in FIG. 3, these values will be described later with reference to FIGS. 4 and 5.

도 3에서 노드 A(301-1)는 주요 토픽(예를 들어 사용자로부터 입력 받은 검색어 키워드에 대한 토픽)을 나타내고 다른 노드(301-2 ~ 301-6)들은 하위 토픽을 의미한다. 이러한 토픽 정렬 구조는 토픽들의 계층적 구조를 이해하는데 유용하다. 그래프의 분석을 통해 어느 사건(이벤트)들이 일련의 과정으로 일어났는지 분석 할 수 있다. 도 3에서 노드 C(301-3)가 2개의 하하위 토픽 E, F에 대한 노드 E, F(301-5, 301-6)를 가지고 있으며, 주요 토픽 A 는 E, F와 함축적인 관계가 있음을 확인할 수 있다. 이를 통해 하위 토픽 C가 주요 토픽 A이 후에 발생하였음을 알 수 있으며, 토픽 E, F 가 토픽 C를 뒤따라 발생하였음을 확인할 수 있다.In FIG. 3, node A 301-1 represents a main topic (eg, a topic about a search word keyword input from a user), and other nodes 301-2 to 301-6 represent lower topics. This topic alignment structure is useful for understanding the hierarchical structure of the topics. The analysis of the graph allows you to analyze which events occurred as a series of processes. In Figure 3, node C 301-3 has nodes E, F (301-5, 301-6) for two subtopics E, F, and main topic A has an implicit relationship with E, F. It can be confirmed. Through this, it can be seen that the subtopic C occurs after the main topic A, and it can be confirmed that the topics E and F follow the topic C.

두 토픽 데이터 간의 연관 관계를 그래프 적으로 나타내기 위한 링크의 두 가지 유형에 대하여 도 4 및 도 5를 참조하여 설명한다.Two types of links for graphically representing an association between two topic data will be described with reference to FIGS. 4 and 5.

도 4는 본 발명의 일실시예에 따른 시간관계 링크(Time relation graph)를 설명하기 위한 예시 도면이다. 도 5는 본 발명의 일실시예에 따른 유사도 링크(Similarity relation graph)를 설명하기 위한 예시 도면이다.4 is an exemplary diagram for explaining a time relation graph according to an embodiment of the present invention. 5 is an exemplary diagram for describing a similarity relation graph according to an embodiment of the present invention.

도 4를 참조하면, 두 토픽 데이터 사이를 연결하는 링크의 값(수치)는, 두 토픽 데이터의 생성 시간 차이를 의미한다. 즉, 제 1 토픽 데이터의 생성 시간과 제 2 토픽 데이터의 생성 시간 간의 차이가 수치화되는 것이다. 시간관계 링크는 생성 시간 선후에 따라 방향성이 존재한다.Referring to FIG. 4, a value (value) of a link connecting two topic data means a difference in generation time of two topic data. That is, the difference between the generation time of the first topic data and the generation time of the second topic data is digitized. The temporal link is directional depending on the generation time.

도 4에서는 다섯 가지 토픽 데이터(401-1 ~ 401-5) 사이의 시간관계 링크를 도시한다. 토픽 데이터 B(401-2)가 토픽 데이터 A(401-1)의 25시간 후에 생성되었으며, 토픽 데이터 D(401-4)가 토픽 데이터 A(401-1)의 1시간 후에 발생했다는 것을 나타낸다. 4 shows a time-related link between the five topic data (401-1 ~ 401-5). Indicates that topic data B 401-2 was generated 25 hours after topic data A 401-1, and topic data D 401-4 occurred 1 hour after topic data A 401-1.

도 5는 방향성과 무관한 유사도 만을 나타내는 링크를 의미한다. 이러한 유사도 링크는 TF-IDF(Term Frequency - Inverse Document Frequency)를 사용하여 상술한 동시출현빈도를 수치화한 값을 가질 수 있다.FIG. 5 means a link showing only similarity regardless of directionality. Such a similarity link may have a value obtained by quantifying the co-occurrence frequency described above using TF-IDF (Term Frequency-Inverse Document Frequency).

도 4에서와 같이 시간관계 링크가 완성 된 후에 도 5의 토픽 데이터 간 유사도를 계산할 수 있다. 앞서 설명한 것과 같이 문서에서 동시에 발생하는 두 토픽 데이터 사이에는 유사성이 존재하다고 인정되기 때문이다. 연관 관계를 정량화 하기 위하여, 블로그 및 뉴스의 동시출현빈도수를 사용하여 각 토픽 데이터(키워드) 사이의 연관 관계를 수치화 하는 것이다.As shown in FIG. 4, after the time-related link is completed, the similarity between the topic data of FIG. 5 may be calculated. As described above, it is recognized that similarity exists between two topic data occurring simultaneously in a document. In order to quantify the relationship, the co-occurrence frequency of blogs and news is used to quantify the relationship between each topic data (keyword).

다시 도 1로 복귀하여 연대순 분석부(300)에 대하여 설명한다.1 again, the chronological order analysis unit 300 will be described.

연대순 분석부(300)는 시간 순서에 따른 분석을 통하여 특정 주제(주요 토픽)에 대해서 시간 순으로 일어난 일련의 사건들의 집합인 내부 노드 집합(Internal Chronical Node Group)과, 특정 주제(주요 토픽과직접적인 연관 관계가 존재하지 않는 잠재적 사건들의 집합인 외부 노드 집합(External Chonical Nodes Group)으로 분류한다. 그래프 모델링부(200)에서 추출된 토픽 데이터 간의 연관 관계에 기초하여, 토픽 데이터들이 내부 노드 집합과 외부 노드 집합으로 구분되어 구성된다.The chronological analysis unit 300 analyzes the chronological order of internal nodes, which is a set of events occurring in a chronological order on a specific subject (main topic), and a specific topic (direct topic and the main topic). It is classified into External Chonical Nodes Group, which is a set of potential events for which no correlation exists, and based on the relationship between the topic data extracted by the graph modeling unit 200, the topic data is the internal node set and the external node. It is divided into sets.

내부 노드 집합은 일련의 시간순서로 정렬되고 주요 토픽과 관련되는 토픽 데이터들의 집합을 나타낸다. 외부 노드 집합은 주요 토픽과 간접적으로 관련되어 있는 토픽 데이터들의 집합임을 의미한다.An internal node set represents a set of topic data arranged in a series of time sequences and related to the main topic. The external node set is a set of topic data that is indirectly related to the main topic.

부모 노드(상위 노드)와 자식 노드(하위 노드)의 관계는 시간순서에 따라 결정된다.The relationship between the parent node (parent node) and the child node (child node) is determined in chronological order.

외부 노드 집합과 주요 토픽 간의 관계는 낮지만 유사한 이벤트가 비슷한 시기에 발생 했음을 의미한다. 즉, 도 3에 도시된 바에서와 같이, 외부 노드 집합의 최상위에 있는 외부 토픽(304-1)은 주요 토픽(301-1)과의 연관 관계는 없거나 낮을 수는 있지만, 발생 시간이 근접하다는 관련성이 존재하는 것이다.Although the relationship between the external node set and the main topic is low, it means that similar events occurred at similar times. That is, as shown in FIG. 3, the outer topic 304-1 at the top of the outer node set may have a low or no association with the main topic 301-1, but the occurrence time is close. There is a connection.

도 3의 토픽 정렬 구조 상에서 최종적으로 보여지게 될 노드의 수는 생성 시간과 링크 간의 유사도 값에 따라 적절한 개수로 제한될 수 있을 것이다. 특히 본 발명의 일실시예에서는 맥시멈 컷(Maximum cut) 방식에 기초하여 적절한 개수로 제한할 수 있다. 즉, 기설정된 혹은 사용자의 입력에 따른 맥시멈 컷 임계값을 통해 보여질 노드의 수를 제어 할 수 있는 것이다.The number of nodes to be finally shown on the topic alignment structure of FIG. 3 may be limited to an appropriate number according to the similarity value between the creation time and the link. In particular, in one embodiment of the present invention can be limited to an appropriate number based on the maximum cut (Maximum cut) method. That is, the number of nodes to be displayed can be controlled through the maximum cut threshold value according to a preset or user input.

한편, 처음 토픽이 등장한 후에 오랜 시간 후에 다른 토픽 등장한다면 두 토픽 사이의 관계가 중요하지 않을 수 있다. 또는, 두 토픽 사이의 유사도가 높은데도 불구하고 오랜 시간이 지난 후에 연관성 높은 토픽이 재등장 할 수도 있다. 두 경우 모드를 만족시키기 위해 사용자에게 보여질 노드는 유사도는 관련 가중치는 높되 시간적인 가중치는 최소로 하는 맥시멈 컷 임계값을 사용한다.On the other hand, if a different topic appears after a long time after the first topic appears, the relationship between the two topics may not be important. Or, even though the similarity between the two topics is high, relevant topics may reappear after a long time. In both cases, the node to be shown to the user to satisfy the mode uses the maximum cut threshold which has similarity but high related weight but minimum temporal weight.

재사용부(400)는, 상술한 방식에 따른 그래프를 시각화(visualization)시키고, 생성된 결과를 저장시킬 수 있다. 즉, 상술한 방식에 따라 연대순의 분석을 수행한 결과에 대하여 시각적으로 보여주고 분석된 결과를 나중에 사용을 위하여 트리 구조로 저장하는 것이다.The reuse unit 400 may visualize a graph according to the above-described method and store the generated result. That is, the results of the chronological analysis according to the above-described method are visually shown and the analyzed results are stored in a tree structure for later use.

도 6 및 도 7은 본 발명의 일실시예에 따라 재사용부(400)가 트리 구조로 저장하는 일예를 도시하는 도면이다.6 and 7 are views illustrating an example in which the reuse unit 400 stores the tree structure according to an embodiment of the present invention.

도 6 및 도 7을 참조하면, 재사용부(400)는 상술한 방식에 따른 그래프가 구성되면 그래프의 모든 정보가 표준화 된 형식으로 저장된다. 결과적으로 데이터의 가용성 및 재사용성이 증가하고 향후 유지 관리에 유용하게 될 수 있을 것이다.6 and 7, when the graph according to the above-described method is configured, the reuse unit 400 stores all the information of the graph in a standardized format. As a result, the availability and reusability of data will increase and become useful for future maintenance.

분석 결과는 추후 연대기 분석 단계에서 사용할 수 있도록 트리 구조 방식으로 저장되며, 일예로 도 7에서와 같은 XML(eXtensible Markup Language) 형식을 사용하여 트리 구조로 표현된다. 즉, 도 6의 그래프 데이터는 도 7과 같은 XML형식으로 저장될 수 있다. 이러한 분석 결과는 API로 제공되거나 사용자가 쿼리를 입력으로 주요 사건에 대한 키워드를 입력 할 때 다시 사용될 수 있다.The analysis results are stored in a tree structure so that they can be used later in the chronological analysis. For example, the analysis results are expressed in a tree structure using an XML (eXtensible Markup Language) format as shown in FIG. 7. That is, the graph data of FIG. 6 may be stored in an XML format as shown in FIG. 7. These analysis results can be provided as an API or used again when a user enters a keyword for a key event by entering a query.

도 8은 본 발명의 일실시예에 따른 큐레이션 장치의 제어 방법에 대한 순서도를 도시한다.8 is a flowchart illustrating a control method of a curation apparatus according to an embodiment of the present invention.

도시된 도면에 따르면, S810 단계에서 웹 상의 데이터 소스로부터 복수 개의 토픽 데이터를 수집할 수 있다. 이러한 한 예시로, 상술한 바와 같이 제 1 SNS 데이터 소스에서 토픽 데이터와 생성 시간을 수집하고, 블로그 등의 제 2 SNS 데이터 소스, 뉴스 아티클에서 데이터 사이의 연관 관계를 수집할 수 있다.According to the drawing, in step S810 it is possible to collect a plurality of topic data from a data source on the web. As an example, as described above, the topic data and the creation time may be collected from the first SNS data source, and the correlation between the data may be collected from a second SNS data source such as a blog or a news article.

S830 단계에서 연대순 분석부(300)는 토픽 데이터들을 내부 노드 집합과 외부 노드 집합으로 분류할 수 있다.In step S830, the chronological order analysis unit 300 may classify the topic data into an internal node set and an external node set.

그리고 S840 단계에서 재사용부(400)는 연대순의 분석을 수행한 결과에 대하여 그래프를 통한 시각화 및 재사용을 위하여 분석 결과를 트리 구조로 저장한다.In operation S840, the reuse unit 400 stores the analysis results in a tree structure for visualization and reuse through graphs for the results of the chronological analysis.

이하 도면 및 이와 관련되는 설명에서는, 실제 실험을 통하여 분석한 데이터에 관하여 상세히 설명한다.In the following drawings and related descriptions, data analyzed through actual experiments will be described in detail.

실험을 위해 한 달의 기간 동안의 트위터, 네이버 블로그, 네이버 뉴스를 사용하였다. 총 트위터를 통하여 수집된 데이터의 수는 150만개의 트윗이며, 전체 크기는 약 30GB다. 트위터로부터 무의미한 단어를 제거하고 총 600개의 토픽 데이터를 추출하였으며 이 토픽 데이터를 사용하여 그래프가 구성되었다. 또한 22,692개의 네이버블로그 및 16,288개의 네이버뉴스 기사를 수집하였다. 블로그 데이터는 한 문서당 평균 328단어이고, 뉴스 데이터는 한 문서당 평균 254단어로 계산됐다. 그래프의 링크는 네이버 블로그와 네이버 뉴스 데이터를 이용하여 구성됐으며, 토픽 정렬 구조에서 123,894개의 링크가 만들어졌다. 본 실험에서는, SNS, 블로그 및 뉴스에 대한 데이터 셋의 범위를 제한하였지만, 보다 다양한 소스로부터 데이터를 수집하기 위한 전처리 모듈을 추가함으로써 쉽게 확장할 수 있다.We used Twitter, Naver Blog, and Naver News for a month. The total data collected through Twitter is 1.5 million tweets, and the total size is about 30GB. A total of 600 topic data were extracted from meaningless words from Twitter, and a graph was constructed using this topic data. We also collected 22,692 Naver blogs and 16,288 Naver News articles. Blog data averaged 328 words per document, and news data averaged 254 words per document. The links in the graph were constructed using Naver blogs and Naver news data, and 123,894 links were created in the topic alignment structure. In this experiment, we limited the scope of the data set for SNS, blogs, and news, but it can be easily extended by adding preprocessing modules to collect data from more diverse sources.

데이터 소스로부터 토픽 데이터를 추출하기 위해 명사추출기인 'Komoran'을 사용하였다. 생성 시간과 키워드는 토픽 데이터를 추출하는 데 사용되었다. 트위터 데이터 소스에서 추출한 키워드를 사용하여 관련 블로그 및 뉴스 데이터를 수집하였다. 문서에서 단어의 동시출현빈도는 블로그 및 뉴스 데이터에서 추출하고, 주제와 TF-IDF 쌍을 계산하고, 연관 관계에 해당하는 유사도 링크의 값은 0에서 1사이로 정규화 시켰다.To extract the topic data from the data source, we use 'Komoran', a noun extractor. Creation time and keywords were used to extract topic data. The relevant blog and news data was collected using keywords extracted from Twitter data sources. The co-occurrence frequency of the words in the document was extracted from blog and news data, the subject and TF-IDF pairs were calculated, and the similarity links corresponding to the relationships were normalized between 0 and 1.

시간관계 링크는 트위터 데이터에서 추출한 해당 토픽이 포함된 게시글의 작성 시간을 기초로 산출되었다. 시간관계 링크 값은 문서 생성 시간과 정규화 된 값(혹은 평균값)의 차이에 의해 계산되며 마찬가지로 0에서 1사이로 정규화된다. 그래프는 주제, 유사성 및 시간 관계(연관 관계)를 사용하여 구성된다. 내부 및 외부 노드 집합은 연대기 분석에 따라 구성된다. 연관 관계를 나타내는 유사도 링크가 특정 임계 값보다 높은 값을 갖는 경우, 이들 노드들은 부모 노드 및 자식 노드로써 포함된다. 이러한 집합을 내부 노드 집합(Internal Chronological Group)이라 한다. 특정 임계값을 초과하지 않은 다른 노드는 외부 노드 집합(External Chronological Group)으로 분류 하였다.The time-relational link was calculated based on the creation time of the post containing the topic extracted from Twitter data. The time-relevant link value is calculated by the difference between the document generation time and the normalized (or averaged) value, and is likewise normalized from 0 to 1. Graphs are constructed using subject, similarity, and time relationships (association relationships). The set of internal and external nodes is organized according to chronological analysis. If the similarity link representing an association has a value higher than a certain threshold, these nodes are included as parent and child nodes. This set is called an internal chronological group. Other nodes that did not exceed a certain threshold were classified as External Chronological Group.

도 9는 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 유명한 정치가 안철수를 검색 한 결과에 대한 결과 그래프이다. "안철수"가 주요 토픽이며, "김종훈"은 하위 토픽으로 구성되었다. "안철수"는 야당 대선후보였으며, 김종훈은 여당 후보로 여겨졌으며, 이후 "안철수"는 야당후보 를 지원하기 위해 대선 후보로부터 "사퇴"하였다. 이러한 일련의 사건들이, 안철수와 관련되는 내부 노드 집합과 외부 노드 집합으로 구분되어 그래프로서 시각화 되고 있다.9 is a result graph of a result of searching for Ahn Cheol-soo, a famous Korean politician, by a curation apparatus according to an embodiment of the present invention. "Ahn Cheol-soo" is the main topic, and "Kim Jong-hoon" is composed of subtopics. Ahn Cheol-soo was an opposition presidential candidate, Kim Jong-hoon was considered a candidate for the ruling party, and Ahn later resigned from the presidential candidate to support the opposition candidate. This series of events is visualized as a graph divided into the inner node set and the outer node set related to Ahn.

도 10은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 한국의 은행인 "농협" 해킹 사건에 대한 검색 결과 그래프이다. 당시 사건과 관련되는 주된 의혹 중 하나는 그 사건 뒤에 용의자가 "북한"이었다는 것이다. 그리고, 사람들은 본 해킹 사건이 "천안함"과 관련이 있다고 생각하였다. "북한"과의 외부 연결 고리는 사고 당시의 한국의 국방부 장관인 "김관진"이다. 이러한 일련의 사건들이, "농협"과 관련되는 내부 노드 집합과 외부 노드 집합으로 구분되어 그래프로서 시각화 되고 있다. 연관이 없어 보이는 키워드도 존재 하지만 대부분의 경우에는 "농협 해킹"과 관련된 결과를 보여주었다.10 is a graph of a search result for the "Nonghyup" hacking incident of the Korean bank by the curation apparatus according to an embodiment of the present invention. One of the main suspicions related to the incident was that the suspect was "North Korea" after the incident. And people thought that this hacking incident was related to "Cheonan". The external link to "North Korea" is South Korean Defense Minister Kim Kwan-jin at the time of the accident. This series of events is visualized as a graph divided into the internal node set and the external node set related to the "Nonghyup". There are also keywords that seem to be unrelated, but in most cases the results were related to "hacking hacks."

도 11은 본 발명의 일실시예에 따른 큐레이션 장치에 의해, 위의 실험결과들과 다른 결과 그래프를 도시한다. "화이트데이"는 3월 14일에 남자가 여자에게 사탕을 선물하는 기념일이다. "화이트데이"에는 하위 토픽인 "사탕"이 존재하는걸 확인할 수 있다. 이 두 키워드 사이에 의미상 유사성이 있지만 시간 관계가 있음을 확인하기가 어렵다. 다양한 검색 키워드를 입력 할 때 제안하는 'Chronological Big data Curation' 방식은 대부분의 경우 견고하게 동작하며 일련의 토픽들을 연대기적으로 이해하는데 도움을 준다. 그러나, 일반적인 주제 또는 오래 지속되지 않는 일회성 이벤트의 경우에는 한계를 보이는 것을 확인하였다.11 is a graph showing results different from the above experimental results by the curation apparatus according to an embodiment of the present invention. White Day is the anniversary of a man giving candy to a woman on March 14. You can see that there is a subtopic "Candy" in "White Day". Although these two keywords are semantically similar, it is difficult to confirm that there is a time relationship. The Chronological Big Data Curation method, which is suggested when entering various search keywords, works robustly in most cases and helps us to understand a series of topics chronologically. However, it was confirmed that there is a limit in case of general theme or one-time event that does not last long.

본 발명의 일실시예에 따른 연대순 정보 기반 큐레이션 방법과 기존의 방법을 비교 분석하여 평가를 수행하였다. 평가를 하기 위한 요소는 "검색어를 사용하여 토픽에 대한 포괄적인 이해가 가능한가" 라는 질문에 답하는 것이다. 본 발명의 발명가들은 한국에서 가장 널리 사용되는 뉴스 포털 사이트인 네이버 뉴스 포털과 비교 하였다.The evaluation was performed by comparing and analyzing the chronological information-based curation method and the existing method according to an embodiment of the present invention. An element of the assessment is to answer the question, "Do you use search terms to get a comprehensive understanding of the topic?" The inventors of the present invention compared with Naver News Portal, the most widely used news portal site in Korea.

도 12는 검색 키워드 "안철수"와 "해킹"을 입력한 네이버 뉴스 결과를 도시한다. "안철수"에 대한 결과인 도 12 (a)를 참조하면, 검색 옵션인 정확도 순으로 정렬하여도, 첫 번째 페이지의 검색 결과는 가장 최근 날짜인 3월 31일의 기사만 표시된다. 2일 전인 3월 29일의 기사는 130번째로 노출되었다. 이 그림에서 알 수 있듯이 뉴스 결과만 보고서, 전반적인 사건의 흐름을 이해하기는 어렵다. 반면에 검색 키워드 "해킹"에 대한 결과인 도 12 (b)는 전체 사건들을 이벤트 중심으로 쉽게 이해할 수 있다.12 shows Naver news results in which the search keywords "Ahn Cheol-soo" and "hacking" are entered. Referring to FIG. 12 (a), which is the result of “Ahn Cheol-soo,” the search results of the first page display only the article of March 31, which is the most recent date, even when sorted by the accuracy of the search options. Two days ago, the article on March 29 was the 130th exposure. As you can see from this figure, it is difficult to report only the news results and understand the flow of the whole event. On the other hand, Figure 12 (b), which is the result of the search keyword "hacking", can easily understand the entire event focused on the event.

더 정확한 평가를 위해 우리는 전통적인 평가방법과 'Chronological Big data Curation'과의 정량적 평가를 수행하였다. 정량적인 평가를 위해 우리는 기사에서 나타난 유일한 단어들을 측정하였다.For a more accurate assessment, we performed a quantitative evaluation of the traditional assessment method and 'Chronological Big Data Curation'. For quantitative evaluation, we measured the only words that appeared in the article.

도 13은 본 발명의 일실시예에 따른 연대순 정보 기반 큐레이션 방법과 네이버 뉴스 검색 결과를 비교한 결과이다.13 is a result of comparing the chronological information-based curation method and the Naver news search results according to an embodiment of the present invention.

검색어를 기준으로 다음 항목의 평균 값을 계산하였다. 1) 고유한 단어 비율: 뉴스에 대한 주제를 식별 할 수 있는 고유 명사의 비율, 2) 시간분포: 최종 결과에 표시된 기사 작성 시간의 평균 시간 범위, K@N은 검색에 노출된 뉴스 기사의 상위 개수 N개 별로 계산한 값이다.The average value of the following items was calculated based on the search term. 1) Unique word rate: The ratio of proper nouns that can identify the topic for the news. 2) Time distribution: The average time span of article creation time shown in the final result. It is calculated by N number.

상위 5 ~ 15개의 검색 결과를 비교해 보았을 때, 해당 고유한 단어가 포함되는 비율이 기존 네이버 뉴스 검색 결과 보다, 본 발명의 일실시예에 따른 큐레이션 방법이 모두 높은 수치를 가지고 있음을 확인할 수 있다. 또한, 검색 결과 시간 분표 역시 본 발명의 일실시예에 따른 큐레이션 방법이 더 높은 수치를 가지고 있음을 확인할 수 있다.When comparing the top 5 to 15 search results, it can be seen that the ratio of the corresponding unique words is higher than the existing Naver news search results, all of the curation method according to an embodiment of the present invention has a higher value . In addition, the search result time table also confirms that the curation method according to an embodiment of the present invention has a higher value.

즉, 본 발명에 따른 결과가 네이버 뉴스 검색 결과 보다 더 포괄적이고 정확한 검색 결과를 제공하여 줄 수 있다고 인정될 수 있을 것이다.In other words, it can be appreciated that the results according to the present invention can provide more comprehensive and accurate search results than Naver news search results.

이상으로 본 발명에 따른 연대순 정보 기반 큐레이션 장치 및 이를 이용한 제어 방법의 실시예를 설시하였으나 이는 적어도 하나의 실시예로서 설명되는 것이며, 이에 의하여 본 발명의 기술적 사상과 그 구성 및 작용이 제한되지는 아니하는 것으로, 본 발명의 기술적 사상의 범위가 도면 또는 도면을 참조한 설명에 의해 한정／제한되지는 아니하는 것이다. 또한 본 발명에서 제시된 발명의 개념과 실시예가 본 발명의 동일 목적을 수행하기 위하여 다른 구조로 수정하거나 설계하기 위한 기초로써 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의해 사용되어질 수 있을 것인데, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에 의한 수정 또는 변경된 등가 구조는 특허청구범위에서 기술되는 본 발명의 기술적 범위에 구속되는 것으로서, 특허청구범위에서 기술한 발명의 사상이나 범위를 벗어나지 않는 한도 내에서 다양한 변화, 치환 및 변경이 가능한 것이다.As described above, an embodiment of the chronological information-based curation apparatus and a control method using the same has been described, but this is described as at least one embodiment, whereby the technical idea of the present invention and its configuration and operation are not limited. No, the scope of the technical idea of the present invention is not limited / limited by the drawings or the description with reference to the drawings. In addition, the concept and embodiment of the invention presented in the present invention may be used by those skilled in the art as a basis for modifying or designing to other structures for carrying out the same purpose of the present invention. The equivalent structure modified or changed by those skilled in the art to which the present invention pertains is to be bound by the technical scope of the present invention described in the claims, and the spirit or scope of the invention described in the claims. Various changes, substitutions and alterations are possible without departing from the scope of the invention.

Claims

A data collector configured to collect a plurality of topic data and generation time information of the plurality of topic data from a data source on a web;
A graph modeling unit configured to extract an association between the collected plurality of topic data; And
And a chronological order analysis unit classifying the internal node set directly related to the main topic among the plurality of topic data and the external node set indirectly related to the main topic based on the extracted correlation and the creation time information. ,
The chronological analysis unit,
Divide the data sources on the web into first and second groups based on at least one of propagation force, document length, and reliability,
Based on the association extracted from the first group, the topic data in which the association exists with the main topic is classified into an internal node set.
Based on the generation time information extracted from the second group, the topic data associated with the generation time is classified into an external node set,
Curation device.

The method of claim 1,
The data source,
Including posts from social networking services (SNS), news articles,
Curation device.

The method of claim 1,
The main topic is a topic for a search term keyword received from a user,
Curation device.

The method of claim 1,
The association between two topic data is
The number of documents containing the two topic data together on a single document is quantified.
Curation device.

delete

The method of claim 1,
The chronological analysis unit,
Characterized in that sorted the plurality of topics in the order of creation time,
Curation device.

Collecting a plurality of topic data and generation time information of the plurality of topic data from a data source on a web;
Extracting an association between the collected plurality of topic data; And
And classifying an internal node set directly associated with a predetermined topic among the plurality of topic data and an external node set indirectly associated with the predetermined topic, based on the extracted correlation and the generation time information.
The classifying step,
Dividing the data sources on the web into first and second groups based on at least one of propagation force, document length, and reliability;
Classifying the topic data in which an association relationship exists with a main topic based on an association relationship extracted from the first group into an internal node set; And
And classifying the topic data related to the generation time into an external node set based on the generation time information extracted from the second group.
Control method of curation device.