KR102591935B1

KR102591935B1 - Method of simulating time series data by combination of base time series through cascaded feature and computer device performing the same

Info

Publication number: KR102591935B1
Application number: KR1020220147470A
Authority: KR
Inventors: 김광호; 김유상
Original assignee: 김유상; 김광호
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-10-23
Also published as: KR20240066987A; KR20240067049A

Abstract

본 발명은 시계열 데이터를 모사하기 위한 방법에 관한 것으로 본 발명에 따른 컴퓨팅 장치에서 수행되는 시계열 데이터의 모사 방법은 모사하고자 하는 목표 시계열 데이터를 입력 받는 단계; 복수개의 기저 시계열의 비중으로 결정되는 모사 시계열 데이터와 상기 목표 시계열 데이터 간의 오차를 최소화하는 각 기저 시계열의 비중을 산출하는 단계; 및 상기 최소화된 오차를 갖도록 산출된 비중을 출력하는 단계를 포함하는 것이 바람직하다. 본 발명에 따르면, 목표 시계열 데이터를 모사할 수 있는 기저 시계열들 최적화된 비중을 산출함으로써 다양화된 시계열 데이터 집합을 구성할 수 있다.The present invention relates to a method for simulating time series data. The method for simulating time series data performed in a computing device according to the present invention includes the steps of receiving target time series data to be simulated; calculating a proportion of each underlying time series that minimizes the error between simulated time series data determined by the proportions of a plurality of underlying time series and the target time series data; and outputting the calculated specific gravity to have the minimized error. According to the present invention, a diversified time series data set can be constructed by calculating the optimized proportion of base time series that can simulate target time series data.

Description

Method of simulating step-by-step characteristics of time series data through combination of base time series and computer device performing the same {Method of simulating time series data by combination of base time series through cascaded feature and computer device performing the same}

본 발명은 시계열 데이터를 모사하기 위한 방법에 관한 것이다. 바람직하게는 서로 다른 시변적 특성을 갖는 기저 시계열을 통한 조합 시계열 데이터의 모사 방법에 관한 것이다.The present invention relates to a method for simulating time series data. Preferably, it relates to a method for simulating combined time series data using underlying time series with different time-varying characteristics.

시계열 데이터들은 시간에 따라 다양한 변위 값을 갖는 복수의 데이터로, 시계열 데이터들의 변화를 예측하기 위해 많은 알고리즘들이 개발되고 있다.Time series data is a plurality of data with various displacement values over time, and many algorithms are being developed to predict changes in time series data.

인공지능 기술의 발달로 과거 구간의 특징과 현재 구간의 특징 간의 인과관계를 학습하고자 하는 신경망 기반의 모델들이 고안되고 있다.With the development of artificial intelligence technology, neural network-based models are being designed to learn causal relationships between the characteristics of past sections and the features of the current section.

일반적으로 순환 신경망의 경우 RNN(Recurrent neural network) 구조를 통해 은닉 상태에 과거의 정보를 저장하고 시퀀스에 대해 연산할 수행함으로써 자연어 처리나, 기타 신호 분석등에 활용되어 왔다. In general, recurrent neural networks have been used for natural language processing and other signal analysis by storing past information in a hidden state and performing calculations on sequences through the RNN (Recurrent neural network) structure.

하지만, RNN은 보다 장기적인 의존관계를 학습하는 능력이 제한되며 따라서 보다 장기적인 데이터의 예측을 위해서 부가적인 게이트를 사용하여 은닉 셀의 어느 정보를 출력과 다음 은닉 상태로 보낼 지 판단하도록 하는 LSTM((Long Shot Term Memory network) 모델들이 시계열 예측 분야에 널리 활용되고 있다.However, RNN is limited in its ability to learn longer-term dependencies, and therefore, for longer-term data prediction, LSTM ((Long Shot Term Memory network) models are widely used in the field of time series prediction.

다만, 이러한 신경망 모델들은 단일 시계열 데이터의 변화의 특징을 추출하는데 최적화될 뿐 복수의 기저 시계열들의 조합을 통해 생성되는 상위 데이터의 변화를 종합적으로 예측하는 것은 불가능하며, 대표적인 특성에 치중하여 학습 함으로써 검증 과정에서 과적합(Overfitting) 문제가 발생된다.However, these neural network models are optimized for extracting the characteristics of changes in single time series data, and it is impossible to comprehensively predict changes in high-level data generated through the combination of multiple underlying time series, and are verified by focusing on representative characteristics and learning. Overfitting problems occur in the process.

만약 상위 데이터를 구성하는 하위 기저 시계열들의 구성을 알더라도 개별적인 비중에 따라 변화 양상들이 달리 질 수 있으므로 보다 구체적인 예측 방법이 필요하게 된다.Even if the composition of the lower base time series that make up the upper data is known, the change patterns may vary depending on the individual proportions, so a more specific prediction method is needed.

특히 이러한 상위 시계열 데이터의 예측 문제는 실세계에서 복잡한 변화의 특징들을 반영한 다양한 금융 상품들에 적용될 수 있다.In particular, this problem of predicting high-level time series data can be applied to various financial products that reflect the characteristics of complex changes in the real world.

예를 들어 시장수익률을 초과하는 수익을 올리기 위해 운용되는 액티브 펀드의 경우 시장의 전망에 따라 탄력적으로 자산을 배분하고 종목을 선별하면서 운용될 수 있는데 펀드의 변화를 통해 수익률의 예측을 위해서는 펀드를 구성하는 각 종목들의 시계열 성을 이해할 필요가 있다.For example, in the case of an active fund that is managed to raise profits that exceed the market rate of return, it can be managed by flexibly allocating assets and selecting stocks according to the market outlook. In order to predict returns through changes in the fund, a fund is formed. It is necessary to understand the time series nature of each stock.

본 발명은 목표 시계열 데이터를 모사할 수 있는 기저 시계열들의 최적화를 수행하는 방법을 제안하는 것을 목적으로 한다.The purpose of the present invention is to propose a method for optimizing base time series that can simulate target time series data.

본 발명은 목표 시계열 데이터의 미시적 특성과 거시적 특성의 임베딩을 통한 기저 시계열들의 비중을 결정하는 방법을 제안하는 것을 목적으로 한다.The purpose of the present invention is to propose a method for determining the proportion of underlying time series through embedding of micro and macro characteristics of target time series data.

본 발명은 시계열 데이터의 관측 시점에 연동되지 않은 여러 유의한 거시적 특성들을 활용한 정규화 (regularization)과정을 거침으로써 대표 특성에 집중하는 과적합 문제를 해결하는 방법을 제안하는 것을 목적으로 한다.The purpose of the present invention is to propose a method to solve the overfitting problem that focuses on representative characteristics by going through a regularization process using various significant macroscopic characteristics that are not linked to the observation time of time series data.

본 발명은 목표 시계열 데이터의 구분된 각 부분들의 최적화를 통해 전체 기저 시계열들의 비중을 최적화하는 방법을 제안하는 것을 목적으로 한다.The purpose of the present invention is to propose a method for optimizing the proportion of the entire underlying time series through optimization of each segment of the target time series data.

또한, 본 발명은 상위 시계열 데이터를 이용하여 기저 시계열의 구성과 비중을 목표에 따라 자유롭게 설정할 수 있도록 함으로써 기저 시계열의 조합에서 유동성을 높이는 것을 목적으로 한다.In addition, the purpose of the present invention is to increase liquidity in the combination of base time series by allowing the composition and proportion of the base time series to be freely set according to the goal using upper time series data.

상기 기술적 과제를 해결하기 위한 본 발명에 따른 컴퓨팅 장치에서 수행되는 시계열 데이터의 모사 방법은 모사하고자 하는 목표 시계열 데이터를 입력 받는 단계; 복수개의 기저 시계열의 비중으로 결정되는 모사 시계열 데이터와 상기 목표 시계열 데이터 간의 오차를 최소화하는 각 기저 시계열의 비중을 산출하는 단계; 및 상기 최소화된 오차를 갖도록 산출된 비중을 출력하는 단계를 포함하는 것이 바람직하다.A method for simulating time series data performed in a computing device according to the present invention to solve the above technical problem includes receiving target time series data to be simulated; calculating a proportion of each underlying time series that minimizes the error between simulated time series data determined by the proportions of a plurality of underlying time series and the target time series data; and outputting the calculated specific gravity to have the minimized error.

상기 목표 시계열 데이터는 제1 기저 시계열 집합을 구성하는 제1 기저 시계열 간의 임의의 비중에 따라 결정되고, 상기 모사 시계열 데이터는 제2 기저 시계열 집합을 구성하는 제2 기저 시계열 간의 비중에 따라 결정되는 것이 바람직하다.The target time series data is determined according to a random ratio between the first base time series constituting the first base time series set, and the simulated time series data is determined according to the ratio between the second base time series constituting the second base time series set. desirable.

상기 목표 시계열 데이터는 제1 기저 시계열 집합으로 정의된 제1 투자 포트폴리오 내 각 상품의 제1 기저 시계열로 구성된 제1 투자 펀드이고, 상기 모사 시계열 데이터는 투자 불가능한 상기 제1 투자 펀드에 대해 임의의 상품의 제2 기저 시계열 간의 비중에 따라 결정되는 제2 투자 펀드인 것이 바람직하다.The target time series data is a first investment fund composed of the first base time series of each product in the first investment portfolio defined as a first base time series set, and the simulated time series data is a random product for the first investment fund that is not investable. It is preferable that it is a second investment fund determined according to the proportion between the second underlying time series.

상기 임의의 상품의 조합으로 구성된 제2 투자 포트폴리오 내 상품의 구성은 사용자에 따라 가변 되되, 상기 비중을 산출하는 단계는 상기 가변된 상품의 비중을 재 산출하는 것이 바람직하다.The composition of the products in the second investment portfolio, which is composed of a combination of the arbitrary products, varies depending on the user, and the step of calculating the proportion is preferably to recalculate the changed proportion of the product.

상기 비중을 산출하는 단계는, 상기 모사 시계열 데이터와 상기 목표 시계열 데이터의 단위 변화율 간의 차를 최소화하는 상기 비중을 산출하는 것이 바람직하다.In the step of calculating the specific gravity, it is preferable to calculate the specific gravity that minimizes the difference between the unit change rate of the simulated time series data and the target time series data.

상기 비중을 산출하는 단계는, 학습된 시계열 모델로부터 추출된 상기 목표 시계열 데이터의 제1 임베딩 벡터와 상기 모사 시계열 데이터의 제2 임베딩 벡터 간의 공통 임베딩 공간 내 거리를 최소화하는 상기 비중을 산출하고, 상기 시계열 모델은 시계열 데이터의 중첩된 구간 별 특징을 임베딩 벡터로 추출하는 인코더와 추출된 임베딩 벡터를 통해 가상(Synthetic) 시계열 데이터를 생성하는 디코더로 구성되되, 상기 시계열 데이터와 가상 시계열 데이터 간의 오차를 최소화하도록 학습된 것이 바람직하다.The step of calculating the ratio includes calculating the ratio that minimizes the distance in a common embedding space between the first embedding vector of the target time series data extracted from the learned time series model and the second embedding vector of the simulated time series data, The time series model consists of an encoder that extracts the features of each overlapped section of time series data as an embedding vector and a decoder that generates synthetic time series data through the extracted embedding vector, while minimizing the error between the time series data and the virtual time series data. It is desirable to be taught to do so.

상기 비중을 산출하는 단계는, 상기 목표 시계열 데이터 및 상기 모사 시계열 데이터를 차원 확장하는 단계; 상기 확장된 차원 공간 내 주기적 특성을 정의하는 위상학적 특징 벡터를 각각 추출하는 단계; 및 상기 추출된 위상학적 특징 벡터 간의 차이를 최소화하는 상기 비중을 산출하는 단계를 포함한다.Calculating the specific gravity includes dimensionally expanding the target time series data and the simulated time series data; extracting topological feature vectors each defining periodic characteristics within the extended dimensional space; and calculating the specific gravity that minimizes the difference between the extracted topological feature vectors.

상기 비중을 산출하는 단계는, 상기 목표 시계열 데이터 및 상기 모사 시계열 데이터의 통계 값을 산출하는 단계; 및 상기 통계 값 간의 차이를 최소화하는 상기 비중을 산출하는 단계를 포함한다.Calculating the specific gravity includes calculating statistical values of the target time series data and the simulated time series data; and calculating the specific gravity that minimizes the difference between the statistical values.

상기 비중을 산출하는 단계는, 목표 함수 내 최소화하고자 하는 항들의 미리 결정된 튜닝 파라미터를 통한 가중합을 최소화하는 비중을 산출하는 것이 바람직하다.In the step of calculating the specific gravity, it is preferable to calculate the specific gravity that minimizes the weighted sum of the terms to be minimized in the target function through a predetermined tuning parameter.

상기 비중을 산출하는 단계는, 상기 최소화하고자 하는 항을 독립적으로 최소화하는 비중을 산출하되, 상기 가중합이 종료 조건을 만족하도록 반복 수행되는 것이 바람직하다.The step of calculating the specific gravity is preferably performed repeatedly so that the weighted sum satisfies the termination condition while calculating the specific gravity that independently minimizes the term to be minimized.

상기 기술적 과제를 해결하기 위한 본 발명에 따른 컴퓨팅 장치는 프로세서; 및 상기 프로세서와 통신하는 메모리를 포함하고, 상기 메모리는 상기 프로세서로 하여금 동작들을 수행하게 하는 명령들을 저장하고, 상기 동작들은, 모사하고자 하는 목표 시계열 데이터를 입력 받는 동작, 복수개의 기저 시계열의 비중으로 결정되는 모사 시계열 데이터와 상기 목표 시계열 데이터 간의 오차를 최소화하는 각 기저 시계열의 비중을 산출하는 동작, 및 상기 최소화된 오차를 갖도록 산출된 비중을 출력하는 동작을 포함한다.A computing device according to the present invention for solving the above technical problem includes a processor; and a memory that communicates with the processor, wherein the memory stores instructions that cause the processor to perform operations, wherein the operations include receiving target time series data to be simulated, the ratio of a plurality of base time series. An operation of calculating a proportion of each base time series that minimizes the error between the determined simulated time series data and the target time series data, and an operation of outputting the proportion calculated to have the minimized error.

본 발명에 따르면, 목표 시계열 데이터를 모사할 수 있는 기저 시계열들 최적화된 비중을 산출함으로써 다양화된 시계열 데이터 집합을 구성할 수 있다.According to the present invention, a diversified time series data set can be constructed by calculating the optimized proportion of base time series that can simulate target time series data.

또한, 시계열 정보를 갖는 실물들의 변화를 기저 시계열 들의 조합으로 모사함으로써 기저 시계열 들 각각에 대한 독립적인 의사 결정을 수행할 수 있다.또한, 펀드와 같은 종합 종목 상품에 대하여 다른 환경으로의 이식성을 높이고 새로운 환경에서 기저 종목들로 유사 조합을 생성함으로써 다양한 파생 투자 방법을 제공할 수 있다.In addition, by simulating changes in real objects with time series information using a combination of underlying time series, independent decisions can be made for each underlying time series. In addition, portability to other environments is increased for general stock products such as funds. By creating similar combinations of underlying stocks in a new environment, various derivative investment methods can be provided.

또한, 한국에서 직접 투자가 불가능한 역외금융상품들을 한국에서 투자가 가능한 상품들의 조합으로 구성함으로써, 상대적으로 낮은 비용으로 동일 상품에 간접 투자하는 효과를 발생시킬 수 있다.In addition, by combining offshore financial products that cannot be directly invested in Korea with products that can be invested in Korea, it is possible to create the effect of indirectly investing in the same product at a relatively low cost.

또한, 개인의 의사 결정에 따라 펀드 구성 상품들의 유동성을 높이고 목표 펀드의 특성을 모사함으로써 투자 자유도를 높일 수 있다.In addition, investment freedom can be increased by increasing the liquidity of fund products and replicating the characteristics of the target fund according to individual decision-making.

도 1은 본 발명의 일 실시예에 따른 목표 시계열 데이터의 구성을 나타낸 예시도이다.
도 2는 본 발명의 일 실시예에 따른 모사 시계열 데이터의 구성을 나타낸 예시도이다.
도 3은 본 발명의 일 실시예에 따른 시계열 데이터 모사 방법을 나타낸 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 시계열 데이터의 재귀적 모사 방법을 나타낸 예시도이다.
도 5는 본 발명의 일 실시예에 따른 시계열 데이터의 모사를 위한 학습 파이프라인을 나타낸 예시도이다.
도 6은 본 발명의 일 실시예에 따른 단위 변화율을 통한 모사 방법을 나타낸 예시도이다.
도 7 내지 8은 본 발명의 일 실시예에 따른 시계열 모델을 통한 모사 방법을 나타낸 예시도이다.
도 9는 본 발명의 일 실시예에 따른 위상학적 특성을 고려한 모사 방법을 나타낸 흐름도이다.
도 10 내지 11은 본 발명의 일 실시예에 따른 위상학적 특성을 고려한 모사 방법을 나타낸 예시도이다.
도 12는 본 발명의 일 실시예에 따른 통계적 특성을 고려한 모사 방법을 나타낸 흐름도이다.
도 13은 본 발명의 일 실시예들에 따른 모사 방법을 수행하는 컴퓨팅 장치 형태의 구현을 나타낸 예시도이다.Figure 1 is an exemplary diagram showing the configuration of target time series data according to an embodiment of the present invention.
Figure 2 is an exemplary diagram showing the configuration of simulated time series data according to an embodiment of the present invention.
Figure 3 is a flowchart showing a time series data simulation method according to an embodiment of the present invention.
Figure 4 is an exemplary diagram showing a recursive simulation method of time series data according to an embodiment of the present invention.
Figure 5 is an exemplary diagram showing a learning pipeline for simulation of time series data according to an embodiment of the present invention.
Figure 6 is an exemplary diagram showing a simulation method using unit rate of change according to an embodiment of the present invention.
Figures 7 and 8 are exemplary diagrams showing a simulation method using a time series model according to an embodiment of the present invention.
Figure 9 is a flowchart showing a simulation method considering topological characteristics according to an embodiment of the present invention.
Figures 10 and 11 are exemplary diagrams showing a simulation method considering topological characteristics according to an embodiment of the present invention.
Figure 12 is a flowchart showing a simulation method considering statistical characteristics according to an embodiment of the present invention.
Figure 13 is an exemplary diagram showing an implementation in the form of a computing device that performs a simulation method according to embodiments of the present invention.

이하의 내용은 단지 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시 되지 않았지만 발명의 원리를 구현하고 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시 예들은 원칙적으로, 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이외같이 특별히 열거된 실시 예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다. The following merely illustrates the principles of the invention. Therefore, a person skilled in the art can invent various devices that embody the principles of the invention and are included in the concept and scope of the invention, although not clearly described or shown herein. In addition, all conditional terms and embodiments listed in this specification are, in principle, clearly intended only for the purpose of ensuring that the inventive concept is understood, and should be understood as not limiting to the specifically listed embodiments and states. .

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. The above-mentioned purpose, features and advantages will become clearer through the following detailed description in relation to the attached drawings, and accordingly, those skilled in the art in the technical field to which the invention pertains will be able to easily implement the technical idea of the invention. .

또한, 발명을 설명함에 있어서 발명과 관련된 공지 기술에 대한 구체적인 설명이 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. 이하에는 첨부한 도면을 참조하여 본 발명의 바람직한 실시 예에 대해 상세하게 설명한다.Additionally, when describing the invention, if it is determined that a detailed description of known technology related to the invention may unnecessarily obscure the gist of the invention, the detailed description will be omitted. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따라 모사하고자 하는 목표 시계열 데이터의 구성을 예시하는 도이다.Figure 1 is a diagram illustrating the configuration of target time series data to be simulated according to an embodiment of the present invention.

도 1을 참조하면, 목표 시계열 데이터는 복수의 기저 시계열들()의 집합 으로 구성될 수 있으며, 각각의 기저 시계열들 간의 결정된 비중에 따라 통합된 시변(Time-varing) 특성을 가지게 된다.Referring to Figure 1, the target time series data is a plurality of base time series ( ) set of It can be composed of and has integrated time-varying characteristics according to the determined proportion between each underlying time series.

이러한 목표 시계열 데이터들은 예를 들어 변동되는 주식 시장에서 다양한 투자 종목들의 조합으로 구성된 종합 펀드의 시간적 변화일 수 있으며 펀드를 운용하는 운용사의 노하우나 외부 환경 변수에 대응한 다양한 의사 결정에 따라 전체의 시변 특성들이 결정될 수 있다.These target time series data may, for example, be the temporal changes of a comprehensive fund composed of a combination of various investment items in a fluctuating stock market, and the overall time-varying data may vary depending on the know-how of the fund manager or various decisions in response to external environmental variables. Characteristics can be determined.

예를 들어, 펀드를 구성하는 각 종목들에 대한 비중은 목표 시계열 데이터에 영향을 미치는 다양한 외부 환경 요소에 따라 설정될 수 있으며 목표 시계열 데이터의 변화 양상이 투자 목적에 부합할 수 있도록 지속적으로 갱신될 수 있다.For example, the proportion of each stock that makes up the fund can be set according to various external environmental factors that affect the target time series data, and the changing pattern of the target time series data can be continuously updated to meet the investment purpose. You can.

이때, 미국, 유럽 등지에 거점을 둔 역외펀드와 같은 투자 상품들은 비교적 높은 수수료가 책정되며 구체적인 투자 포트폴리오의 구성과 구성들의 비중들을 적시에 외부에서 파악하기 어려운 문제가 있다. 또는 현지 투자 시장의 정책에 따라 한국에서의 투자가 제한되거나 높은 시간적, 금전적 비용이 발생하는 등 다양한 제약들이 존재할 수 있다.At this time, investment products such as offshore funds based in the United States and Europe charge relatively high fees, and there is a problem in that it is difficult to determine the composition of the specific investment portfolio and the proportions of the components from the outside in a timely manner. Alternatively, depending on the policies of the local investment market, there may be various restrictions, such as restrictions on investment in Korea or high time and financial costs.

따라서, 본 실시예에 따른 모사 방법은 투자 포트폴리오 내 각 상품의 기저 시계열로 구성된 투자 펀드를 목표 시계열 데이터로 정의하고, 투자기 불가능하거나 비교적 낮은 수수료를 통해 임의의 상품의 기저 시계열로 투자 펀드를 모사할 수 있도록 한다.Therefore, the simulation method according to this embodiment defines an investment fund consisting of the underlying time series of each product in the investment portfolio as the target time series data, and simulates the investment fund with the underlying time series of any product through an investment impossible or relatively low fee. make it possible

나아가, 펀드를 모사하는 경우 개인의 투자 전략과 취향을 고려하여 투자 포트폴리오 내 상품의 구성은 사용자에 따라 가변 되도록 하되 가변된 상품 구성을 통해 모사함으로써 자율성을 부여하는 것도 가능하다.Furthermore, when replicating a fund, the composition of products in the investment portfolio can be varied depending on the user, taking into account the individual's investment strategy and taste, but it is also possible to grant autonomy by replicating the variable product composition.

즉, 본 실시예에 따른 시계열 데이터의 모사 방법은 목표 시계열 데이터를 구성하는 기저 시계열의 구성이나 각각의 비중을 모르는 상태에서, 일부 같거나 다른 기저 시계열로 구성된 집합으로 목표 시계열 데이터를 가장 잘 모사할 수 있는 모사 시계열 데이터를 생성하는 방법을 제안하는 것을 목적으로 한다.In other words, the time series data simulation method according to this embodiment can best simulate the target time series data with a set of the same or different base time series without knowing the composition of the base time series or the respective proportions of the target time series data. The purpose is to propose a method for generating simulated time series data that can be simulated.

도 2를 참조하여 설명하면, 도 2는 목표 시계열 데이터를 모사하는 시계열 데이터의 구성을 예시하는 도이다.When described with reference to FIG. 2, FIG. 2 is a diagram illustrating the configuration of time series data that simulates target time series data.

모사 시계열 데이터는 다양한 기저 시계열로 구성된 기저 시계열 집합 으로 생성될 수 있다.Simulated time series data is a set of underlying time series composed of various underlying time series. It can be created with

이때 기저 시계열 집합은 목표 시계열 데이터를 구성하는 집합의 구성과 일부 또는 전부가 상이할 수 있다. 또는 동일한 구성을 갖는 경우 목표 시계열 데이터를 구성하는 기저 시계열 각각의 비중을 모르는 상태에서 비중을 예측하는 형태로 본 발명에 따른 시계열 데이터의 모사 방법이 수행될 수 있다.At this time, the base time series set may be partially or completely different from the composition of the set constituting the target time series data. Alternatively, if they have the same configuration, the time series data simulation method according to the present invention can be performed in the form of predicting the proportions without knowing the proportions of each underlying time series constituting the target time series data.

본 실시예에 따른 주어진 목표 시계열 데이터 (S^*(t) ≡ Signal^* (t)) 를 구성하는 기저 시계열 집합 U₀ = {x₁(t), ..., x_m(t)}이 m 개의 서로 다른 기저 (underlying) 시계열들의 알려지지 않은 특정 선형 조합 (linear combination)으로 이루어진다고 전제할 때, 각 x_i(t) 들은 알려지지 않은 어떤 추계학적 과정 (stochastic process) 로부터 생성될 수 있다.The basis time series set U ₀ = {x ₁ (t), ..., x _m (t)} constituting the given target time series data (S ^* (t) ≡ Signal ^* (t)) according to this embodiment is m. Assuming that it is composed of a certain unknown linear combination of different underlying time series, each x _i (t) can be generated from some unknown stochastic process.

이때, 기저 시계열 집합 U₀ 의 구성과 비중 w^* ₁ , ..., w^* _m 은 일반적으로 알려지지 않는다고 가정할 수 있다.At this time, it can be assumed that the composition and proportions w ^* ₁ , ..., w ^* _m of the underlying time series set U ₀ are generally unknown.

따라서, 본 실시예에 따른 시계열 데이터의 모사 방법은 목표 시계열 데이터와 가장 유사한 패턴을 갖는 기저 시계열 집합의 구성과 각 기저 시계열들의 최적화된 비중을 찾는 문제로 정의될 수 있다.Therefore, the time series data simulation method according to this embodiment can be defined as the problem of configuring a set of base time series with the most similar pattern to the target time series data and finding the optimized proportion of each base time series.

즉, 본 실시예에서 시계열 데이터의 모사 방법에서 모사 시계열 데이터를 구성하는 기저 시계열 집합 U = {y₁(t), ..., y_n(t)} 로부터 목표 시계열 데이터 Signal 의 동역학적 특성을 가장 잘 모사하는 모사 시계열 데이터 S(t) ≡ (t)을 합성하는 것을 목적으로 한다.That is, in this embodiment, in the time series data simulation method, the dynamic characteristics of the target time series data signal are obtained from the base time series set U = {y ₁ (t), ..., y _n (t)} that constitutes the simulated time series data. Best simulated time series data S(t) ≡ The purpose is to synthesize (t).

도 3을 참조하면 본 실시예에 따른 시계열 데이터의 모사 방법은 재귀적인 방법을 통해 모사 시계열 데이터를 구성하는 기저 시계열 집합의 비중을 갱신함으로써 목표 시계열 데이터와의 오차를 최소화하는 방향으로 반복적으로 수행될 수 있다.Referring to FIG. 3, the time series data simulation method according to this embodiment is performed repeatedly to minimize the error with the target time series data by updating the proportion of the base time series set constituting the simulated time series data through a recursive method. You can.

구체적으로 본 실시예에서 시계열 데이터의 모사 방법은 비 지도적 학습(unsupervised learning) 모형 및 위상학적 분석을 기반으로 주어진 시계열의 거시적 동역학적 특성들을 최대한 가깝게 모사함으로써, 기존의 단순 통계 기반의 시계열 데이터의 모사 방법에 대하여 보다 정확한 모사를 수행할 수 있다.Specifically, in this embodiment, the time series data simulation method is based on an unsupervised learning model and topological analysis to simulate the macroscopic dynamic characteristics of a given time series as closely as possible, thereby simulating existing simple statistical-based time series data. A more accurate simulation can be performed using this method.

이하, 도 4를 참고하여 본 발명의 일 실시예에 따른 컴퓨팅 장치에서 수행되는 시계열 데이터의 모사 방법에 대하여 보다 상세히 설명한다.Hereinafter, a method for simulating time series data performed in a computing device according to an embodiment of the present invention will be described in more detail with reference to FIG. 4.

도 4를 참고하면, 먼저 모사하고자 하는 목표 시계열 데이터를 입력 받을 수 있다(S100).Referring to Figure 4, first, target time series data to be simulated can be input (S100).

상술한 바와 같이 목표 시계열 데이터 (S(t) ≡ Signal^* (t))는 기저 시계열 x_m(t)로 구성된 집합 U₀과 각 기저 시계열들의 비중 w^* _m 으로 결정될 수 있다.As described above, the target time series data (S(t) ≡ Signal ^* (t)) can be determined by a set U ₀ composed of base time series x _m (t) and the proportion w ^* _m of each base time series.

다음, 본 실시예에 따른 시계열 데이터의 모사 방법은 모사를 위해 복수개의 기저 시계열과 각각의 비중으로 결정된 모사 시계열 데이터와 목표 시계열 데이터 간의 오차를 최소화하는 비중을 반복적으로 갱신함으로써 최적의 비중을 산출한다(S200).Next, the time series data simulation method according to this embodiment calculates the optimal ratio by repeatedly updating the ratio that minimizes the error between the simulated time series data and the target time series data determined by a plurality of base time series and each ratio for simulation. (S200).

이러한 비중의 반복적인 갱신 과정은 목표 시계열 데이터의 특성을 비교적 작은 범위의 미시적 특성에서 전체적인 거시적 특성으로 크기에 따라 구분하고 구분된 각 부분들의 최적화를 통해 수행될 수 있다.This process of repeatedly updating the ratio can be performed by dividing the characteristics of the target time series data according to size from relatively small range of microscopic characteristics to overall macroscopic characteristics and optimizing each divided part.

구체적으로 도 5를 참고하면, 비중의 갱신을 위한 학습 파이프라인은 목표 시계열 데이터와 모사 시계열 데이터를 구성하는 제2 시계열 집합을 입력으로 구분된 각각의 임베딩 벡터를 추출하는 과정으로 구분된다.Specifically, referring to FIG. 5, the learning pipeline for updating the proportion is divided into a process of extracting each embedding vector divided into target time series data and the second time series set constituting the simulated time series data as input.

본 실시예에서 학습 파이프라인은 네 종류의 유사성 측정 함수를 기반으로 목표 시계열 데이터의 시변 특성을 순차적으로 세분화하여 모사할 수 있다.In this embodiment, the learning pipeline can sequentially segment and simulate the time-varying characteristics of the target time series data based on four types of similarity measurement functions.

예를 들어 세분화되는 크기 중 가장 작은 제1 크기의 최소 단위 변화율을 통해 단위 변화 특성을 모사할 수 있으며, 제2 크기로 보다 큰 구간의 시변 특성을 학습된 신경망 모델을 이용하여 모사할 수 있다. For example, unit change characteristics can be simulated through the minimum unit change rate of the first size, which is the smallest of the subdivided sizes, and time-varying characteristics of a larger section of the second size can be simulated using a learned neural network model.

이어서, 제3 크기로 구간의 주기적 변화를 위상학적 특성으로 정의함으로써 시변 특성을 모사할 수 있도록 하며 제4 크기로 양 시계열 데이터의 전역적인 특성으로 통계 값의 차이가 적어지도록 모사를 수행할 수 있다.Next, by defining the periodic change of the section as a topological characteristic at the third size, time-varying characteristics can be simulated, and at the fourth size, simulation can be performed to reduce the difference in statistical values due to the global characteristics of both time series data. .

특성의 추출을 위해 시계열 데이터를 분할하는 크기가 커질수록, 양 시계열 데이터의 거시적 유사성 비중이 증가하게 될 수 있다. 이때 거시적 유사성 비중은 학습 파이프라인 자체의 튜닝 파라미터로 결정될 수 있으며 학습과정에서 조절될 수 있다.As the size of dividing time series data to extract characteristics increases, the proportion of macroscopic similarities between both time series data may increase. At this time, the macroscopic similarity proportion can be determined by the tuning parameters of the learning pipeline itself and can be adjusted during the learning process.

이하, 각각의 학습 파이프라인에 대하여 보다 상세히 설명한다.Hereinafter, each learning pipeline will be described in more detail.

먼저, 학습 파이프라인 내 단위 변화 측정 모듈(134-1)은 가장 최소화된 특성을 모사하기 위하여 시계열 데이터의 측정 단위(제1 크기)를 기준으로 구분된 구간의 단위 변화율을 추출하고 단위 변화율 벡터의 차이를 산출하는 과정을 포함한다.First, the unit change measurement module 134-1 in the learning pipeline extracts the unit change rate of the section divided based on the measurement unit (first size) of the time series data in order to simulate the most minimized characteristics and extracts the unit change rate vector of the unit change rate vector. It includes the process of calculating the difference.

도 6을 참조하면, 목표 시계열 데이터와 모사 시계열 데이터의 시간은 동기화될 수 있으며, 목표 시계열 데이터의 측정 단위를 기준으로 구분된 단위 별 변화율 간의 차이를 산출한다.Referring to FIG. 6, the times of the target time series data and the simulated time series data can be synchronized, and the difference between the change rates for each unit divided based on the measurement unit of the target time series data is calculated.

단위 변화 측정 모듈(134-1)은 단위 변화율의 차이((t)=δS(t)- δ(t))를 각 구간 별로 산출할 수 있다.The unit change measurement module 134-1 measures the difference in unit change rate ((t)=δS(t)-δ (t)) can be calculated for each section.

이어서, 단위 변화 측정 모듈(134-1)은 각 단위 별 단위 변화율 차의 전에 합 이 최소화될 수 있도록 하는 비중 를 산출한다.Subsequently, the unit change measurement module 134-1 is the sum of the unit change rate differences for each unit. The proportion that can be minimized Calculate .

나아가, 보다 큰 범위의 목표 시계열 데이터의 특성 모사를 위해 제2 크기에 따른 구간 별 변화를 임베딩할 수 있다.Furthermore, in order to simulate the characteristics of a larger range of target time series data, changes for each section according to the second size can be embedded.

구간 변화 측정 모듈(134-2)은 결정된 구간 T에 대하여 각 구간 별 변화 특성을 임베딩하고 임베딩 된 구간 임베딩 벡터의 차이를 최소화함으로써 를 최적화한다.The section change measurement module 134-2 embeds the change characteristics for each section with respect to the determined section T and minimizes the difference between the embedded section embedding vectors. Optimize.

이때 구간은 서로 중첩을 허용함으로써 구간 간의 변화에 영향을 미치는 특성들이 더욱 강조될 수 있도록 한다.At this time, sections are allowed to overlap with each other so that characteristics that affect changes between sections can be further emphasized.

본 실시예에서는 구간 임베딩을 위하여 미리 학습된 모델을 이용할 수 있으며, 모델은 재귀적 특성을 추출하는 시계열 모델로 바람직하게는 LSTM 모델을 이용할 수 있다.In this embodiment, a pre-trained model can be used for section embedding, and the model can be a time series model that extracts recursive characteristics, preferably an LSTM model.

이때, LSTM 모델은 목표 시계열 특성의 함축적인 특징을 임베딩 벡터로 잘 추출할 수 있도록 미리 학습될 수 있으며 학습을 위하여 고유의 제2 학습 파이프라인을 추가적으로 구성할 수 있다.At this time, the LSTM model can be trained in advance to well extract the implicit features of the target time series characteristics as an embedding vector, and a unique second learning pipeline can be additionally configured for learning.

도 7을 참조하면, 제2 학습 파이프라인은 원본 시계열 데이터의 구간 별 특징을 결정된 차원의 임베딩 벡터로 추출하는 인코더와 추출된 임베딩 벡터를 통해 가상(Synthetic) 시계열 데이터를 생성하는 디코더로 구성될 수 있으며 입력된 시계열 데이터와 디코더로 출력된 가상 시계열 데이터 간의 오차를 손실 함수로 구성하여 오차를 최소화하는 방법으로 지속적인 LSTM 모델의 학습을 수행할 수 있다.Referring to FIG. 7, the second learning pipeline may be composed of an encoder that extracts the characteristics of each section of the original time series data into an embedding vector of a determined dimension and a decoder that generates synthetic time series data through the extracted embedding vector. Continuous LSTM model learning can be performed by minimizing the error between the input time series data and the virtual time series data output from the decoder by configuring it as a loss function.

본 실시예에서 LSTM 모델의 학습을 위한 손실 함수는 원본 시계열 데이터와 생성된 가상 시계열 데이터 간의 평균 제곱 오차(Mean Squared Error, MSE)와 임베딩 차원 내의 특징 벡터 간의 오차를 각각의 항으로 구성할 수 있다.In this embodiment, the loss function for learning the LSTM model can be composed of the mean squared error (MSE) between the original time series data and the generated virtual time series data and the error between the feature vectors in the embedding dimension. .

이때 특징 벡터 간의 오차는 결정된 임베딩 차원 내 임베딩 벡터 간의 거리를 통해 오차를 산출할 수 있다.At this time, the error between feature vectors can be calculated through the distance between embedding vectors within the determined embedding dimension.

구체적으로, 본 실시예에서는 임베딩 벡터의 추출을 위해 전제되는 임베딩 벡터의 차원을 추정하기 위해 거짓 최근접 이웃(FNN, False Negative Neighbor)을 이용할 수 있다.Specifically, in this embodiment, a false negative neighbor (FNN) can be used to estimate the dimension of the embedding vector, which is prerequisite for extracting the embedding vector.

거짓 최근접 이웃은 임베딩 차원이 증가함에 따른 점의 이웃 수를 산출하는데 임베딩 차원이 너무 낮으면 많은 이웃이 거짓이지만 적절한 임베딩 차원 이상에서는 많은 이웃이 참이라는 것을 전제로 데이터의 특성을 가장 잘 나타내는 차원을 결정한다.False nearest neighbors calculate the number of neighbors of a point as the embedding dimension increases. If the embedding dimension is too low, many neighbors are false, but above an appropriate embedding dimension, many neighbors are true. This is the dimension that best represents the characteristics of the data. decide.

즉, 차원이 증가함에 따라 거짓 이웃의 수가 차원의 함수로 어떻게 변하는지를 판단함으로써 LSTM 모델에서 시계열 데이터를 축약하는데 가장 적절한 임베딩 차원을 결정할 수 있다.In other words, the most appropriate embedding dimension for reducing time series data in an LSTM model can be determined by determining how the number of false neighbors changes as a function of dimension as the dimension increases.

이때, 본 실시예에서 이용되는 LSTM 모델은 목표하는 목표 시계열 데이터들의 공통 도메인에서 수집된 학습 데이터 셋을 이용하여 학습될 수 있으며, 예를 들어 주식 시장의 특정 펀드의 유사 펀드 또는 동일 국가 내의 펀드에 대한 지수 등락을 학습 데이터 셋으로 이용함으로써 강인한 성능을 갖도록 할 수 있다.At this time, the LSTM model used in this embodiment can be learned using a learning data set collected from a common domain of target time series data, for example, a similar fund of a specific fund in the stock market or a fund within the same country. Robust performance can be achieved by using the fluctuations in the index as a learning data set.

이상의 학습된 LSTM 모델을 통해 목표 시계열 데이터의 임베딩 벡터와 모사 시계열 데이터의 임베딩 벡터들을 산출하고 각 벡터 간의 오차가 최소화되도록 비중을 최적화한다.Through the learned LSTM model above, the embedding vectors of the target time series data and the embedding vectors of the simulated time series data are calculated and the proportions are optimized to minimize the error between each vector.

구체적으로 임베딩 벡터 간의 오차(L_FNN)는 결정된 차원에서 정의된 임베딩 벡터의 거리를 산출함으로써 비중을 최적화한다.Specifically, the error between embedding vectors (L _FNN ) is optimized by calculating the distance of the defined embedding vectors in the determined dimension.

도 8을 참조하면, 차원 내 임베딩된 각각의 모사 시계열 데이터들의 특징 벡터(x_{i, …}x_j)들과 목표 시계열 데이터(S^*)의 특징 벡터들 간의 거리는 유클리드 기반의 거리(d_{i, …}d_j)로 산출될 수 있으며 각 거리의 합)이 최종적으로 최소화되도록 비중 를 최적화한다Referring to FIG. 8, the ^distance between the feature vectors (x _i _, _... d _j ) and can be calculated as the sum of each distance ) is ultimately minimized. optimize

나아가, 본 실시예에 따른 시계열 데이터의 모사 방법은 보다 큰 범위의 특징을 추출하고 해당 특징을 통해 모사를 수행한다.Furthermore, the time series data simulation method according to this embodiment extracts a larger range of features and performs simulation using the features.

바람직하게, 위상 측정 모듈(134-3)은 상술한 LSTM 모델에서 이용하는 구간에 비해 큰 구간에 대한 위상학적 특징을 추출한다.Preferably, the phase measurement module 134-3 extracts topological features for a section larger than the section used in the above-described LSTM model.

본 실시예에서 위상 측정 모듈(134-3)은 관측 시기와 상관없이 지속성 있게 나타나는 위상 특성을 추려내고 통계적으로 정량화 함으로써 타 대수적 특성에 비하여 매우 강건(robust)한 특징을 지닌다In this embodiment, the phase measurement module 134-3 has very robust characteristics compared to other algebraic characteristics by selecting and statistically quantifying phase characteristics that appear consistently regardless of the observation period.

도 9를 참고하면, 먼저 위상 측정 모듈은 목표 시계열 데이터 및 상기 모사 시계열 데이터를 차원 확장할 수 있다(S210).Referring to FIG. 9, first, the phase measurement module can dimensionally expand the target time series data and the simulated time series data (S210).

예를 들어 1차원 상의 변위 값으로 정의되는 시계열 데이터를 고차원(K차원)으로 매핑하는 경우 시계열 데이터가 갖는 반복적인 특성들이 더욱 강조되어 나타날 수 있다.For example, when mapping time series data defined as displacement values on one dimension to a higher dimension (K dimension), the repetitive characteristics of the time series data may be further emphasized.

구체적으로 시계열 데이터를 K차원으로 시간 지연 임베딩(Time delay embedding)을 수행함으로써 지연된 시간으로 중첩되는 특징들을 고차원에서 확대 구조화한다. 따라서 확장된 차원에서 서로 다른 시구간들의 상호 의존적 특성들이 더 잘 강조될 수 있다.Specifically, by performing time delay embedding of time series data in K dimension, features overlapping with delayed time are expanded and structured in a high dimension. Therefore, in an expanded dimension, the interdependent characteristics of different time periods can be better emphasized.

일 예로 본 실시예에 따라 추출되는 시계열 데이터에 관한 위상학적 특성은 주기성 혹은 준(quasi) 주기성일 수 있다. 위상 측정 모듈(134-3)은 시간 지연 임베딩을 통해 결정된 지연 시간에서 적어도 일부가 반복되는 K차원 내 시계열 데이터의 (준)주기적 특성을 위상학적 특징 벡터로 추출한다(S220).For example, the topological characteristic of time series data extracted according to this embodiment may be periodicity or quasi-periodicity. The phase measurement module 134-3 extracts (quasi-) periodic characteristics of time series data in the K dimension, at least part of which is repeated at a delay time determined through time delay embedding, as a topological feature vector (S220).

도 10을 참조하면 주기성을 갖는 시계열 데이터(a)의 특징은 K차원의 지연된 시간에서 반복되는 패턴들이 더욱 강조되어 나타날 수 있으며, 반면 주기성이 약한 시계열 데이터(b)의 특징은 K차원 내에서 산재되는 형태의 패턴을 보일 수 있다.Referring to FIG. 10, the characteristics of time series data (a) with periodicity may appear with more emphasis on repeated patterns in the delayed time of the K dimension, while the characteristics of time series data (b) with weak periodicity may appear scattered within the K dimension. A pattern can be seen.

따라서 위상 측정 모듈(134-3)은 이상의 K차원 내의 패턴이 나타내는 특징을 통해 통계적으로 유의한 호몰로지 (Homology) 특성만을 추출함으로써 시계열 데이터의 보다 전역적인 특성을 판단할 수 있도록 한다. Therefore, the phase measurement module 134-3 extracts only statistically significant homology characteristics through the characteristics represented by the patterns within the above K dimensions, thereby enabling the determination of more global characteristics of the time series data.

구체적으로 도 11을 참고하면, 위상학적 특징 벡터의 차이를 산출하기 위하여 K차원의 위상 벡터를 주기성과 관련된 호몰로지 공간에 관하여 (예: H_1, H_2) 다시 저차원으로 압축할 수 있으며, 예를 들어 압축된 위상 벡터를 1차원화함으로써 데이터의 지속성에 대한 특징을 나타내는 랜드스케이프(Persistence landscapes) 벡터를 생성할 수 있다.Specifically, referring to FIG. 11, in order to calculate the difference in topological feature vectors, the K-dimensional phase vector can be compressed back to a lower dimension with respect to the homology space related to periodicity (e.g., H_1, H_2), for example For example, by one-dimensionalizing the compressed phase vector, a persistence landscapes vector representing the characteristics of data persistence can be generated.

이러한 시계열 데이터의 위상학적 특성을 나타내는 랜드스케이프 벡터에서 분포가 밀집될수록 특정 주기성이 강하다고 볼 수 있으며 시계열 데이터 자체가 외부의 요인에 의한 영향이 비교적 작고 패턴을 유지하려는 속성이 크다고 볼 수 있다. In the landscape vector, which represents the topological characteristics of such time series data, the denser the distribution, the stronger the specific periodicity, and the time series data itself can be seen as being relatively less influenced by external factors and having a greater tendency to maintain patterns.

따라서 위상 측정 모듈(134-3)은 목표 시계열 데이터의 위상학적 특성을 잘 모사하도록 함으로써 모사 시계열 데이터가 실제 환경의 영향을 잘 반영하거나 또는 주기성을 유지하도록 함으로써 환경에 따른 오차가 발생되는 것을 방지할 수 있다.Therefore, the phase measurement module 134-3 can prevent errors due to the environment from occurring by well simulating the topological characteristics of the target time series data so that the simulated time series data well reflects the influence of the actual environment or maintains periodicity. You can.

이어서 위상 측정 모듈(134-3)은 추출된 위상학적 특징 벡터 간의 차이를 최소화하도록 비중을 산출한다(S230). 위상 측정 모듈(134-3)은 랜드스케이프 벡터 간의 오차를 산출하되, 각 부분(H₁, H₂) 랜드스케이프 벡터 간의 오차의 합이 최소화되도록 를 최적화한다,Next, the phase measurement module 134-3 calculates the specific gravity to minimize the difference between the extracted topological feature vectors (S230). The phase measurement module 134-3 calculates the error between the landscape vectors, and minimizes the sum of the errors between each part (H ₁ , H ₂ ) landscape vectors. Optimize,

나아가 본 실시예에 학습 파이프 라인은 통계적인 특성을 이용하여 모사 시계열 데이터의 내부 기저 시계열들의 비중을 결정하는 것도 가능하다.Furthermore, in this embodiment, the learning pipeline is also capable of determining the proportion of internal underlying time series of the simulated time series data using statistical characteristics.

통계 모듈(134-4)은 시계열 데이터의 가장 전역적인 특성으로 통계값을 산출하고 통계 값 간의 차이를 비교하여 를 최적화할 수 있다.The statistical module 134-4 calculates statistical values using the most global characteristics of time series data and compares the differences between statistical values. can be optimized.

도 12를 참고하면, 통계 모듈(134-4)은 목표 시계열 데이터 및 상기 모사 시계열 데이터의 통계 값을 산출한다(S210-1).Referring to FIG. 12, the statistical module 134-4 calculates statistical values of the target time series data and the simulated time series data (S210-1).

본 실시예에서 통계 모듈(134-4)은 시계열 데이터의 유사성 비교 척도로 다양한 통계 지표를 이용할 수 있다. 예를 들어, 평균, 분산, 상관 계수(Pearson correlation coefficient), DTW(Dynamic time warping)를 종합적으로 이용할 수 있다.In this embodiment, the statistical module 134-4 can use various statistical indicators as a similarity comparison measure of time series data. For example, mean, variance, correlation coefficient (Pearson correlation coefficient), and DTW (Dynamic time warping) can be comprehensively used.

통계 모듈(134-4)은 복수의 통계 지표를 종합적으로 이용하는 경우 각각의 지표에 대하여 목표 시계열 데이터와의 차이를 최소화하는 방향으로 비중 를 최적화함으로써 비중을 산출한다.When using multiple statistical indicators comprehensively, the statistical module 134-4 sets the weight for each indicator in the direction of minimizing the difference from the target time series data. The specific gravity is calculated by optimizing .

이때, 각각의 통계 지표 간의 오차 비중을 별도로 설정할 수 있으며(예를 들어 2:1:1:1)) 전체의 오차 합이 최소화되도록 비중을 산출한다(S210-2).At this time, the error proportion between each statistical indicator can be set separately (for example, 2:1:1:1)) and the proportion is calculated so that the total error sum is minimized (S210-2).

다시 도 5를 참고하면, 학습 파이프라인 내 각 모듈들의 최적화 과정은 서로 다른 크기의 특성들에 집중하여 목표 시계열 데이터를 모사함으로써 상호 보완적인 추가 과정이 필요할 수 있다.Referring again to FIG. 5, the optimization process of each module in the learning pipeline may require additional, complementary processes by concentrating on characteristics of different sizes and simulating target time series data.

따라서, 본 실시예에 따른 학습 파이프라인은 미리 결정된 제약 및 종료 조건에 따라 반복적인 갱신 과정을 수행할 수 있다.Therefore, the learning pipeline according to this embodiment can perform an iterative update process according to predetermined constraints and termination conditions.

제약 조건으로 각 비중 의 최소, 최대값을 설정할 수 있으며, 또는 모사 시계열 데이터의 지속적인 업데이트 과정에서 과거의 비중 을 기준으로 특정 범위 이내()에서만 비중 가 결정될 수 있도록 설정하는 것도 가능하다.Each weight as a constraint You can set the minimum and maximum values, or the historical proportion in the process of continuous updating of simulated time series data. Within a certain range based on ( ) only the specific gravity It is also possible to set it so that can be determined.

또한, 본 실시예에서 학습 파이프라인은 각각의 크기에 따른 비중의 최적화를 수행함에 따라 전체의 오차를 최소화하는 문제로 학습 파이프라인의 목표 함수(QP(Qudratic Program) Fomulate)를 아래 수학식과 같이 정의하고 비중 를 산출한다.In addition, in this embodiment, the learning pipeline is a problem of minimizing the overall error by optimizing the proportion according to each size, and the objective function (QP (Qudratic Program) Formulate) of the learning pipeline is defined as the equation below: And the specific gravity Calculate .

[수학식][Equation]

(여기서, Rw는 모사 시계열 데이터의 단위 변화율, r_tgt는 목표 시계열 데이터의 단위 변화율, Ew는 모사 시계열 데이터에 대한 시계열 모델로 추출된 임베딩 벡터, e_tgt는 목표 시계열 데이터에 대한 시계열 모델로 추출된 임베딩 벡터, Tw는 모사 시계열 데이터에 대한 랜드스케이프 벡터, t_tgt는 모사 시계열 데이터에 대한 랜드스케이프 벡터, Sw, S_tgt는 모사 시계열데이터와 목표 시계열데이터 각각의 통계값임)(Here, Rw is the unit change rate of the simulated time series data, r _tgt is the unit change rate of the target time series data, Ew is the embedding vector extracted as the time series model for the simulated time series data, and e _tgt is the time series model extracted as the target time series data. Embedding vector, Tw is the landscape vector for the simulated time series data, t _tgt is the landscape vector for the simulated time series data, Sw, S _tgt are the statistical values of the simulated time series data and target time series data, respectively)

최적화 과정은 제약 조건들을 정의하는 집합 C에 정의된 조건을 만족하는 범위 내에서 최적화를 수행하되 각 기저 시계열의 비중 의 합은 1이 되도록 설정한다. The optimization process performs optimization within the range that satisfies the conditions defined in the set C that defines the constraints, but the proportion of each underlying time series is Set the sum to be 1.

시계열 모사를 수행할 시 실용성 측면에서 가용 가능한 전체 기저 시계열 집합 중 일부만 사용할 수 있다. 이를 위해, 제약 조건에 양의 비중을 가지는 기저 시계열의 개수가 최대 k개가 넘지 않도록 하는 선형 제약 조건을 포함할 수 있으며, 이 경우 최적화 모델은 혼합정수이차계획법 (mixed-integer quadratic programming)이 된다.When performing time series simulation, practicality dictates that only a subset of the entire available set of underlying time series be used. For this purpose, linear constraints can be included to ensure that the number of underlying time series with positive weight does not exceed a maximum of k, and in this case, the optimization model is mixed-integer quadratic programming.

또한, 목표 함수 내 항들에 대한 가중치 λ_e, λ_t, λ_s들은 튜닝 파라미터로 미리 설정될 수 있으며 튜닝 파라미터는 학습데이터 셋 내에서 학습과 검증을 번갈아 수행함으로써 산출될 수 있다.Additionally, the weights λ _e , λ _t , and λ _s for the terms in the objective function can be preset as tuning parameters, and the tuning parameters can be calculated by alternately performing learning and verification within the learning data set.

본 실시예에서 학습 파이프라인은 최적화를 위해서 함수 내 각각의 항들에 대한 최적화를 순차적으로 수행하고 전체의 오차를 줄여 나가는 방식으로 비중을 산출한다.In this embodiment, the learning pipeline sequentially optimizes each term in the function for optimization and calculates the proportion by reducing the overall error.

즉, 목표 함수의 최적화 문제를 QP(Quadratic Program)으로 정의함으로써 보다 효율적인 최적화를 진행한다. 본 실시예에서 학습 파이프라인은 QP를 구성하는 각 부분의 반복적인 최적화를 통해 수행하는 것으로 특히 빅데이터 최적화에 있어 가장 진보된 형태의 QP를 위한 ADMM(Alternating direction method of multipliers) 솔버(solver)인 OSQP(Operator Splitting QP)를 통해 각 문제를 구성하는 구성 들의 최적화를 빠른 속도로 수행한다.In other words, more efficient optimization is performed by defining the optimization problem of the goal function as QP (Quadratic Program). In this embodiment, the learning pipeline is performed through iterative optimization of each part that constitutes QP, and in particular, it is an ADMM (Alternating direction method of multipliers) solver for the most advanced form of QP in big data optimization. Optimization of the components that make up each problem is performed at high speed through OSQP (Operator Splitting QP).

특히, 본 실시예에서 거시적 유사성 관련 지표들은 상대적으로 느린 주기로 갱신되어도 무방하므로, 제안한 모듈은 수집되는 시계열 데이터들에 대하여 사실상 실시간으로 수행이 가능하다.In particular, in this embodiment, the macroscopic similarity-related indicators can be updated at a relatively slow cycle, so the proposed module can be performed in virtually real time on the collected time series data.

이하, 본 발명의 일 실시예에 따른 시계열 데이터의 모사 방법을 수행하는 컴퓨팅 장치(300)의 구체적인 하드웨어 구현에 대하여 설명한다.Hereinafter, a detailed hardware implementation of the computing device 300 that performs a method for simulating time series data according to an embodiment of the present invention will be described.

도 11을 참조하면, 본 발명의 몇몇 실시예들에서 컴퓨팅 장치(300)는 서버 형태로 구현될 수 있다. 컴퓨팅 장치(300)를 구성하는 각각의 모듈 중 하나 이상은 범용 컴퓨팅 프로세서 상에서 구현되며 따라서 프로세서(processor)(308), 입출력 I/O(302), 메모리 (memory)(340), 인터페이스(interface)(306) 및 버스(314, bus)를 포함할 수 있다. 프로세서(308), 입출력 장치(302), 메모리 (340) 및/또는 인터페이스(306)는 버스(314)를 통하여 서로 결합될 수 있다. 버스(314)는 데이터들이 이동되는 통로(path)에 해당한다.Referring to FIG. 11, in some embodiments of the present invention, the computing device 300 may be implemented in the form of a server. One or more of each module constituting the computing device 300 is implemented on a general-purpose computing processor, and thus includes a processor 308, input/output I/O 302, memory 340, and interface. It may include 306 and bus 314. The processor 308, input/output device 302, memory 340, and/or interface 306 may be coupled to each other through a bus 314. The bus 314 corresponds to a path through which data moves.

구체적으로, 프로세서(308)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit), 마이크로프로세서, 디지털 신호 프로세스, 마이크로컨트롤러, 어플리케이션 프로세서(AP, application processor) 및 이들과 유사한 기능을 수행할 수 있는 논리 소자들 중에서 적어도 하나를 포함할 수 있다.Specifically, the processor 308 includes a Central Processing Unit (CPU), Micro Processor Unit (MPU), Micro Controller Unit (MCU), Graphic Processing Unit (GPU), microprocessor, digital signal processor, microcontroller, and application processor (AP). , application processor) and logic elements capable of performing similar functions.

입출력 장치(302)는 키패드(keypad), 키보드, 터치스크린 및 디스플레이 장치 중 적어도 하나를 포함할 수 있다. 메모리 장치(340)는 데이터 및/또는 프로그램 등을 저장할 수 있다.The input/output device 302 may include at least one of a keypad, a keyboard, a touch screen, and a display device. The memory device 340 may store data and/or programs.

인터페이스(306)는 통신 네트워크로 데이터를 전송하거나 통신 네트워크로부터 데이터를 수신하는 기능을 수행할 수 있다. 인터페이스(306)는 유선 또는 무선 형태일 수 있다. 예컨대, 인터페이스(306)는 안테나 또는 유무선 트랜시버 등을 포함할 수 있다. 메모리 (340)는 프로세서(308)의 동작을 향상시키되, 개인정보의 보호를 위한 휘발성의 동작 메모리로서, 고속의 디램 및/또는 에스램 등을 더 포함할 수도 있다. The interface 306 may perform the function of transmitting data to or receiving data from a communication network. Interface 306 may be wired or wireless. For example, the interface 306 may include an antenna or a wired or wireless transceiver. The memory 340 is a volatile operating memory that improves the operation of the processor 308 and protects personal information, and may further include high-speed DRAM and/or SRAM.

또한, 메모리(340) 내에는 여기에 설명된 일부 또는 모든 모듈의 기능을 제공하는 프로그래밍 및 데이터 구성을 저장한다. 예를 들어, 상술한 시계열 데이터의 모사 방법의 선택된 양태들을 수행하도록 하는 로직을 포함할 수 있다.Additionally, memory 340 stores programming and data configurations that provide the functionality of some or all of the modules described herein. For example, it may include logic to perform selected aspects of the method for simulating time series data described above.

메모리 (340)에 저장된 상술한 시계열 데이터의 모사 방법을 수행하는 각 단계를 포함하는 명령어들의 집합으로 프로그램 또는 어플리케이션을 로드하고 프로세서가 각 단계를 수행할 수 있도록 한다. A program or application is loaded with a set of instructions including each step of performing the above-described time series data simulation method stored in the memory 340 and allows the processor to perform each step.

나아가, 여기에 설명되는 다양한 실시예는 예를 들어, 소프트웨어, 하드웨어 또는 이들의 조합된 것을 이용하여 컴퓨터 또는 이와 유사한 장치로 읽을 수 있는 기록매체 내에서 구현될 수 있다.Furthermore, various embodiments described herein may be implemented in a recording medium readable by a computer or similar device, for example, using software, hardware, or a combination thereof.

하드웨어적인 구현에 의하면, 여기에 설명되는 실시예는 ASICs (application specific integrated circuits), DSPs (digital signal processors), DSPDs (digital signal processing devices), PLDs (programmable logic devices), FPGAs (field programmable gate arrays, 프로세서(processors), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기타 기능 수행을 위한 전기적인 유닛 중 적어도 하나를 이용하여 구현될 수 있다. 일부의 경우에 본 명세서에서 설명되는 실시예들이 제어 모듈 자체로 구현될 수 있다.According to hardware implementation, the embodiments described herein include application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), and field programmable gate arrays (FPGAs). It may be implemented using at least one of processors, controllers, micro-controllers, microprocessors, and other electrical units for performing functions. In some cases, as described herein, The described embodiments may be implemented as a control module itself.

소프트웨어적인 구현에 의하면, 본 명세서에서 설명되는 절차 및 기능과 같은 실시예들은 별도의 소프트웨어 모듈들로 구현될 수 있다. 상기 소프트웨어 모듈들 각각은 본 명세서에서 설명되는 하나 이상의 기능 및 작동을 수행할 수 있다. 적절한 프로그램 언어로 씌여진 소프트웨어 어플리케이션으로 소프트웨어 코드가 구현될 수 있다. 상기 소프트웨어 코드는 메모리 모듈에 저장되고, 제어모듈에 의해 실행될 수 있다.According to software implementation, embodiments such as procedures and functions described in this specification may be implemented as separate software modules. Each of the software modules may perform one or more functions and operations described herein. Software code can be implemented as a software application written in an appropriate programming language. The software code may be stored in a memory module and executed by a control module.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. The above description is merely an illustrative explanation of the technical idea of the present invention, and various modifications, changes, and substitutions can be made by those skilled in the art without departing from the essential characteristics of the present invention. will be.

따라서, 본 발명에 개시된 실시 예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구 범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.Accordingly, the embodiments disclosed in the present invention and the accompanying drawings are not intended to limit the technical idea of the present invention, but are for illustrative purposes, and the scope of the technical idea of the present invention is not limited by these embodiments and the attached drawings. . The scope of protection of the present invention should be interpreted in accordance with the claims below, and all technical ideas within the equivalent scope should be construed as being included in the scope of rights of the present invention.

Claims

In a method for simulating time series data performed on a computing device,
A step of receiving target time series data to be simulated;
calculating a proportion of each underlying time series that minimizes the error between simulated time series data determined by the proportions of a plurality of underlying time series and the target time series data; and
Including the step of outputting the calculated specific gravity to have the minimized error,
The step of calculating the specific gravity is,
The distance within the common embedding space between the first embedding vector of the target time series data and the second embedding vector of the simulated time series data for each first section extracted from the learned time series model, and
Calculate the proportion that minimizes the difference between the topological feature vectors of the target time series data and the simulated time series data for each second section extracted from the phase measurement module,
The time series model consists of an encoder that extracts the features of each overlapping first section of the time series data as an embedding vector and a decoder that generates synthetic time series data through the extracted embedding vector, and between the time series data and the virtual time series data. Learned to minimize errors,
The phase measurement module is a method for simulating time series data, characterized in that the topological feature vector of the second section is larger than the first section.

According to claim 1,
The target time series data is determined according to a random ratio between the first base time series constituting the first base time series set,
A method for simulating time series data, wherein the simulated time series data is determined according to the proportion between the second base time series constituting the second base time series set.

According to claim 2,
The target time series data is a first investment fund consisting of a first base time series of each product in the first investment portfolio defined as a first base time series set,
A method of simulating time series data, characterized in that the simulated time series data is a second investment fund determined according to the ratio between the second base time series of an arbitrary product with respect to the first investment fund that cannot be invested.

According to claim 3,
The composition and optimization form of the product in the second investment portfolio, which is composed of a combination of the above arbitrary products, is implemented in a variable manner according to the user's purpose and inclination,
The step of calculating the specific gravity is a method of simulating time series data, characterized in that the specific gravity of the changed product is recalculated when necessary.

According to claim 2,
The step of calculating the specific gravity is,
A simulation method of time series data, characterized in that calculating the specific gravity that minimizes the difference between the unit change rate of the simulated time series data and the target time series data.

delete

According to claim 2,
The step of calculating the specific gravity is,
dimensionally expanding the target time series data and the simulated time series data;
extracting topological feature vectors each defining periodic characteristics within the extended dimensional space; and
A method for simulating time series data, comprising calculating the proportion that minimizes the difference between the extracted topological feature vectors.

According to claim 2,
The step of calculating the specific gravity is,
calculating specific statistical values describing characteristics of the target time series data and the simulated time series data; and
A method for simulating time series data, comprising calculating the specific gravity that minimizes the difference between the statistical values.

According to claim 2,
The step of calculating the specific gravity is,
A simulation method of time series data characterized by calculating a proportion that minimizes the weighted sum of the terms to be minimized within the objective function through predetermined tuning parameters.

According to clause 9,
The step of calculating the specific gravity is,
Calculate the proportion that independently minimizes the term to be minimized,
A method for simulating time series data, characterized in that the weighted sum is repeatedly performed to satisfy a termination condition.

processor; and
comprising a memory in communication with the processor,
The memory stores instructions that cause the processor to perform operations,
The above operations are,
The operation of receiving target time series data to be copied,
An operation of calculating the proportion of each underlying time series that minimizes the error between simulated time series data determined by the proportions of a plurality of underlying time series and the target time series data, and
Including an operation of outputting the calculated specific gravity to have the minimized error,
The operation of calculating the specific gravity is,
The distance within the common embedding space between the first embedding vector of the target time series data and the second embedding vector of the simulated time series data for each first section extracted from the learned time series model, and
Calculate the proportion that minimizes the difference between the topological feature vectors of the target time series data and the simulated time series data for each second section extracted from the phase measurement module,
The time series model consists of an encoder that extracts the features of each overlapping first section of the time series data as an embedding vector and a decoder that generates synthetic time series data through the extracted embedding vector, and between the time series data and the virtual time series data. Learned to minimize errors,
The phase measurement module is a computing device characterized in that it extracts the topological feature vector of the second section that is larger than the first section.

According to claim 11,
The target time series data is determined according to a random ratio between the first base time series constituting the first base time series set,
A computing device wherein the simulated time series data is determined according to the proportion between the second basis time series constituting the second basis time series set.

According to claim 12,
The operation of calculating the specific gravity is,
Computing device for calculating the specific gravity that minimizes the difference between the unit rate of change of the simulated time series data and the target time series data.

delete

According to claim 12,
The operation of calculating the specific gravity is,
An operation of dimensionally expanding the target time series data and the simulated time series data,
An operation of extracting topological feature vectors defining periodic characteristics within the extended dimensional space, respectively, and
Computing device comprising calculating the specific gravity that minimizes the difference between the extracted topological feature vectors.

According to claim 12,
The operation of calculating the specific gravity is,
An operation of calculating statistical values of the target time series data and the simulated time series data, and
Computing device comprising calculating the specific gravity that minimizes the difference between the statistical values.

According to claim 12,
The operation of calculating the specific gravity is,
A computing device characterized in that it calculates a proportion that minimizes the weighted sum of terms to be minimized within a goal function through predetermined tuning parameters.

According to claim 17,
The operation of calculating the specific gravity is,
Calculate the proportion that independently minimizes the term to be minimized,
A computing device characterized in that the weighted sum is repeatedly performed to satisfy a termination condition.

A program stored in a computer-readable recording medium that performs a method of simulating time series data performed by the computing device according to any one of claims 1 to 5 and 7 to 10.