KR102721179B1

KR102721179B1 - Method and system for standardizing text data on food material

Info

Publication number: KR102721179B1
Application number: KR1020240001724A
Authority: KR
Inventors: 박승리
Original assignee: 주식회사 마켓보로
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-10-24
Anticipated expiration: 2044-01-04

Abstract

본 발명의 일 태양에 따르면, 식자재에 관한 텍스트 데이터를 표준화하기 위한 방법으로서, 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화하는 단계, 인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 상기 복수의 토큰 각각의 가중치를 결정하고, 상기 가중치에 기초하여 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각과 상기 텍스트 데이터에 포함되는 식자재명 사이의 유사도인 식자재명 유사도를 산출하는 단계, 및 상기 식자재명 유사도에 기초하여 상기 복수의 표준 식자재명 중에서 상기 텍스트 데이터와 대응되는 타겟 식자재명을 결정함으로써 상기 식자재에 관한 텍스트 데이터를 표준화하는 단계를 포함하는 방법이 제공된다.According to one aspect of the present invention, a method for standardizing text data regarding a foodstuff is provided, comprising: a step of tokenizing text data regarding a foodstuff into a plurality of tokens; a step of determining a weight of each of the plurality of tokens using an artificial intelligence-based token weight determination model, and calculating a food ingredient name similarity, which is a degree of similarity between each of a plurality of standard food ingredient names included in standard food ingredient data and a food ingredient name included in the text data, based on the weight; and a step of standardizing the text data regarding a foodstuff by determining a target food ingredient name corresponding to the text data from among the plurality of standard food ingredient names based on the food ingredient name similarity.

Description

METHOD AND SYSTEM FOR STANDARDIZING TEXT DATA ON FOOD MATERIAL

본 발명은 식자재에 관한 텍스트 데이터를 표준화하기 위한 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for standardizing text data regarding food materials.

식자재를 취급하는 매장, 예를 들어 식자재를 사용하여 음식을 만들고 이를 판매하거나 음식 판매와 연관된 서비스를 제공하는 여러 매장들의 식자재 유통 정보(예를 들면, 식자재 구매 이력 정보)를 통계적으로 분석하면 다양한 식자재들의 수요와 공급에 관한 정보나 물가, 소비 심리, 유행 등의 시장 상황에 관한 정보와 같은 여러 가지 유용한 정보를 얻을 수 있다. 그리고, 이러한 통계적인 분석의 신뢰도를 높이기 위해서는 우선 식자재에 관한 데이터를 정확하게 수집하는 것이 필수적이다.By statistically analyzing the distribution information of food materials (e.g., food material purchase history information) of various stores that handle food materials, for example, stores that make and sell food using food materials or provide services related to food sales, various useful information can be obtained, such as information on the demand and supply of various food materials, information on market conditions such as prices, consumer sentiment, and trends. In addition, in order to increase the reliability of such statistical analysis, it is essential to first accurately collect data on food materials.

그러나, 식자재는 그 표기법이 정책적으로 정해진 바가 없기 때문에 동일한 식자재가 다양한 방식으로 표기되어 유통되는 경우가 많다. 실제로, A라는 식자재명을 가지는 식자재를 구매하는 경우에, 어떤 구매자는 A'라는 식자재명으로 50개 주문하고, 다른 구매자는 A''라는 식자재명으로 50개 주문하는 것과 유사한 상황은 매우 빈번하게 발생하고 있다. 이러한 경우에 A라는 식자재가 100개 주문되었다고 집계되는 것이 아니라, A'라는 식자재 50개와 A''라는 식자재 50개가 주문되었다고 집계될 수 있기 때문에 정확한 통계적인 분석에 어려움이 있는 실정이다. 물론, 이러한 문제는 인력을 투입함으로써 해결할 수도 있지만, 시간과 비용이 지속적으로 투입되어야 하므로 현실적으로는 좋은 해결책이 될 수 없다.However, since there is no policy-based labeling method for food ingredients, the same food ingredients are often distributed labeled in various ways. In fact, when purchasing a food ingredient called A, it is very common for one buyer to order 50 items with the name A', while another buyer may order 50 items with the name A''. In such cases, it is difficult to conduct accurate statistical analysis because it may be counted that 50 items of food ingredient A' and 50 items of food ingredient A'' were ordered, rather than 100 items of food ingredient A being ordered. Of course, this problem can be solved by investing manpower, but it is not a good solution in reality because it requires continuous investment of time and money.

이에 본 발명자(들)는, 식자재에 관한 텍스트 데이터를 표준화함으로써 식자재의 유통에 관한 통계적인 분석이 정확하게 수행될 수 있도록 하는 기술을 제안하는 바이다.Accordingly, the inventor(s) of the present invention propose a technology that enables accurate statistical analysis of the distribution of food materials by standardizing text data on food materials.

한국등록특허공보 제10-1975911호 (2019. 4. 30)Korean Patent Publication No. 10-1975911 (April 30, 2019)

본 발명은 전술한 종래 기술의 문제점을 모두 해결하는 것을 그 목적으로 한다.The purpose of the present invention is to solve all of the problems of the above-mentioned prior art.

또한, 본 발명은, 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화하고, 인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 복수의 토큰 각각의 가중치를 결정하고, 가중치에 기초하여 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각과 텍스트 데이터에 포함되는 식자재명 사이의 유사도인 식자재명 유사도를 산출하고, 식자재명 유사도에 기초하여 복수의 표준 식자재명 중에서 텍스트 데이터와 대응되는 타겟 식자재명을 결정함으로써 식자재에 관한 텍스트 데이터를 표준화하는 것을 다른 목적으로 한다.In addition, the present invention has another purpose of standardizing text data about foodstuffs by tokenizing text data about foodstuffs into a plurality of tokens, determining a weight for each of the plurality of tokens using an artificial intelligence-based token weight determination model, calculating a food ingredient name similarity, which is a degree of similarity between each of a plurality of standard food ingredient names included in standard foodstuff data and a food ingredient name included in text data, based on the weights, and determining a target food ingredient name corresponding to the text data from among the plurality of standard food ingredient names based on the food ingredient name similarity.

또한, 본 발명은, 식자재의 유통에 관한 통계적인 분석이 정확하게 수행될 수 있도록 하는 것을 또 다른 목적으로 한다.In addition, another purpose of the present invention is to enable accurate statistical analysis of the distribution of food materials.

상기 목적을 달성하기 위한 본 발명의 대표적인 구성은 다음과 같다.A representative configuration of the present invention to achieve the above purpose is as follows.

본 발명의 일 태양에 따르면, 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화하는 단계, 인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 상기 복수의 토큰 각각의 가중치를 결정하고, 상기 가중치에 기초하여 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각과 상기 텍스트 데이터에 포함되는 식자재명 사이의 유사도인 식자재명 유사도를 산출하는 단계, 및 상기 식자재명 유사도에 기초하여 상기 복수의 표준 식자재명 중에서 상기 텍스트 데이터와 대응되는 타겟 식자재명을 결정함으로써 상기 식자재에 관한 텍스트 데이터를 표준화하는 단계를 포함하는 방법이 제공된다.According to one aspect of the present invention, a method is provided, including: a step of tokenizing text data regarding a food ingredient into a plurality of tokens; a step of determining a weight of each of the plurality of tokens using an artificial intelligence-based token weight determination model; and a step of calculating a food ingredient name similarity, which is a degree of similarity between each of a plurality of standard food ingredient names included in standard food ingredient data and a food ingredient name included in the text data, based on the weights; and a step of standardizing the text data regarding the food ingredient by determining a target food ingredient name corresponding to the text data from among the plurality of standard food ingredient names based on the food ingredient name similarity.

본 발명의 다른 태양에 따르면, 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화하는 데이터 토큰화부,According to another aspect of the present invention, a data tokenization unit for tokenizing text data regarding food materials into a plurality of tokens;

인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 상기 복수의 토큰 각각의 가중치를 결정하고, 상기 가중치에 기초하여 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각과 상기 텍스트 데이터에 포함되는 식자재명 사이의 유사도인 식자재명 유사도를 산출하는 유사도 산출부, 및 상기 식자재명 유사도에 기초하여 상기 복수의 표준 식자재명 중에서 상기 텍스트 데이터와 대응되는 타겟 식자재명을 결정함으로써 상기 식자재에 관한 텍스트 데이터를 표준화하는 데이터 표준화부를 포함하는 시스템이 제공된다.A system is provided, comprising: a similarity calculation unit that determines the weight of each of the plurality of tokens using an artificial intelligence-based token weight determination model, and calculates a food ingredient name similarity, which is a degree of similarity between each of the plurality of standard food ingredient names included in standard food ingredient data and the food ingredient name included in the text data, based on the weights; and a data standardization unit that standardizes text data regarding the food ingredient by determining a target food ingredient name corresponding to the text data from among the plurality of standard food ingredient names based on the food ingredient name similarity.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램을 기록하는 비일시성의 컴퓨터 판독 가능한 기록 매체가 더 제공된다.In addition, other methods for implementing the present invention, other systems, and non-transitory computer-readable recording media recording a computer program for executing the above methods are further provided.

본 발명에 의하면, 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화하고, 인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 복수의 토큰 각각의 가중치를 결정하고, 가중치에 기초하여 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각과 텍스트 데이터에 포함되는 식자재명 사이의 유사도인 식자재명 유사도를 산출하고, 식자재명 유사도에 기초하여 복수의 표준 식자재명 중에서 텍스트 데이터와 대응되는 타겟 식자재명을 결정함으로써 식자재에 관한 텍스트 데이터를 표준화할 수 있게 된다.According to the present invention, text data regarding foodstuffs is tokenized into a plurality of tokens, a weight of each of the plurality of tokens is determined using an artificial intelligence-based token weight determination model, and based on the weights, a food ingredient name similarity, which is a similarity between each of a plurality of standard food ingredient names included in standard food ingredient data and a food ingredient name included in text data, is calculated, and based on the food ingredient name similarity, a target food ingredient name corresponding to the text data is determined from among the plurality of standard food ingredient names, thereby making it possible to standardize text data regarding foodstuffs.

또한, 본 발명에 의하면, 식자재의 유통에 관한 통계적인 분석이 정확하게 수행되도록 할 수 있게 된다.In addition, according to the present invention, statistical analysis regarding the distribution of food materials can be performed accurately.

도 1은 본 발명의 일 실시예에 따라 식자재에 관한 텍스트 데이터를 표준화하기 위한 전체 시스템의 개략적인 구성을 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 표준화 시스템의 내부 구성을 상세하게 도시하는 도면이다.
도 3은 본 발명의 일 실시예에 따라 식자재에 관한 텍스트 데이터를 표준화하는 과정을 예시적으로 나타내는 도면이다.FIG. 1 is a diagram schematically illustrating the configuration of an entire system for standardizing text data regarding foodstuffs according to one embodiment of the present invention.
FIG. 2 is a drawing detailing the internal configuration of a standardization system according to one embodiment of the present invention.
FIG. 3 is a diagram exemplarily showing a process of standardizing text data regarding food materials according to one embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이러한 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 본 명세서에 기재되어 있는 특정 형상, 구조 및 특성은 본 발명의 정신과 범위를 벗어나지 않으면서 일 실시예로부터 다른 실시예로 변경되어 구현될 수 있다. 또한, 각각의 실시예 내의 개별 구성요소의 위치 또는 배치도 본 발명의 정신과 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 행하여지는 것이 아니며, 본 발명의 범위는 특허청구범위의 청구항들이 청구하는 범위 및 그와 균등한 모든 범위를 포괄하는 것으로 받아들여져야 한다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 구성요소를 나타낸다.The detailed description of the present invention set forth below refers to the accompanying drawings which illustrate specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention, while different from each other, are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be modified and implemented from one embodiment to another without departing from the spirit and scope of the invention. It should also be understood that the positions or arrangements of individual components within each embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention is to be taken to encompass the scope of the claims and all equivalents thereof. Like reference numerals in the drawings represent the same or similar elements throughout the several aspects.

이하에서는, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 여러 바람직한 실시예에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, various preferred embodiments of the present invention will be described in detail with reference to the attached drawings so that a person having ordinary skill in the art to which the present invention pertains can easily practice the present invention.

본 명세서에서, 식자재라는 용어는 음식을 만들기 위하여 사용하는 재료뿐만 아니라 그 재료들을 이용하여 만든 음식도 포함하는 광의의 개념으로 이해되어야 한다.In this specification, the term “food material” should be understood as a broad concept that includes not only the materials used to make food, but also the food made using those materials.

전체 시스템의 구성Composition of the entire system

도 1은 본 발명의 일 실시예에 따라 식자재에 관한 텍스트 데이터를 표준화하기 위한 전체 시스템의 개략적인 구성을 나타내는 도면이다.FIG. 1 is a diagram schematically illustrating the configuration of an entire system for standardizing text data regarding foodstuffs according to one embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 전체 시스템은 통신망(100), 표준화 시스템(200) 및 디바이스(300)를 포함할 수 있다.As illustrated in FIG. 1, the entire system according to one embodiment of the present invention may include a communication network (100), a standardization system (200), and a device (300).

먼저, 본 발명의 일 실시예에 따른 통신망(100)은 유선 통신이나 무선 통신과 같은 통신 양태를 가리지 않고 구성될 수 있으며, 근거리 통신망(LAN; Local Area Network), 도시권 통신망(MAN; Metropolitan Area Network), 광역 통신망(WAN; Wide Area Network) 등 다양한 통신망으로 구성될 수 있다. 바람직하게는, 본 명세서에서 말하는 통신망(100)은 공지의 인터넷 또는 월드 와이드 웹(WWW; World Wide Web)일 수 있다. 그러나, 통신망(100)은, 굳이 이에 국한될 필요 없이, 공지의 유무선 데이터 통신망, 공지의 전화망 또는 공지의 유무선 텔레비전 통신망을 그 적어도 일부에 있어서 포함할 수도 있다.First, a communication network (100) according to an embodiment of the present invention may be configured regardless of a communication mode such as wired communication or wireless communication, and may be configured with various communication networks such as a local area network (LAN), a metropolitan area network (MAN), and a wide area network (WAN). Preferably, the communication network (100) referred to in the present specification may be the well-known Internet or the World Wide Web (WWW). However, the communication network (100) is not necessarily limited thereto, and may include at least a part of a well-known wired and wireless data communication network, a well-known telephone network, or a well-known wired and wireless television communication network.

예를 들면, 통신망(100)은 무선 데이터 통신망으로서, 와이파이(WiFi) 통신, 와이파이 다이렉트(WiFi-Direct) 통신, 롱텀 에볼루션(LTE; Long Term Evolution) 통신, 5G 통신, 블루투스 통신(저전력 블루투스(BLE; Bluetooth Low Energy) 통신 포함), 적외선 통신, 초음파 통신 등과 같은 종래의 통신 방법을 적어도 그 일부분에 있어서 구현하는 것일 수 있다. 다른 예를 들면, 통신망(100)은 광 통신망으로서, 라이파이(LiFi; Light Fidelity) 등과 같은 종래의 통신 방법을 적어도 그 일부분에 있어서 구현하는 것일 수 있다.For example, the communication network (100) may be a wireless data communication network that implements, at least in part, a conventional communication method such as WiFi communication, WiFi-Direct communication, Long Term Evolution (LTE) communication, 5G communication, Bluetooth communication (including Bluetooth Low Energy (BLE) communication), infrared communication, or ultrasonic communication. As another example, the communication network (100) may be an optical communication network that implements, at least in part, a conventional communication method such as LiFi (Light Fidelity).

다음으로, 본 발명의 일 실시예에 따른 표준화 시스템(200)은 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화하고, 인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 복수의 토큰 각각의 가중치를 결정하고, 가중치에 기초하여 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각과 텍스트 데이터에 포함되는 식자재명 사이의 유사도인 식자재명 유사도를 산출하고, 식자재명 유사도에 기초하여 복수의 표준 식자재명 중에서 텍스트 데이터와 대응되는 타겟 식자재명을 결정함으로써 식자재에 관한 텍스트 데이터를 표준화하는 기능을 수행할 수 있다.Next, a standardization system (200) according to one embodiment of the present invention tokenizes text data regarding foodstuffs into a plurality of tokens, determines a weight for each of the plurality of tokens using an artificial intelligence-based token weight determination model, calculates a food ingredient name similarity, which is a similarity between each of a plurality of standard food ingredient names included in standard food ingredient data and a food ingredient name included in text data, based on the weights, and determines a target food ingredient name corresponding to the text data from among the plurality of standard food ingredient names based on the food ingredient name similarity, thereby performing a function of standardizing text data regarding foodstuffs.

본 발명에 따른 표준화 시스템(200)의 구성과 기능에 관하여는 이하의 상세한 설명을 통하여 자세하게 알아보기로 한다.The configuration and function of the standardization system (200) according to the present invention will be described in detail below.

다음으로, 본 발명의 일 실시예에 따른 디바이스(300)는 표준화 시스템(200)에 접속한 후 통신할 수 있는 기능을 포함하는 디지털 기기로서, 스마트폰, 태블릿, 스마트 워치, 스마트 밴드, 스마트 글래스, 데스크탑 컴퓨터, 노트북 컴퓨터, 워크스테이션, PDA, 웹 패드, 이동 전화기 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기라면 얼마든지 본 발명에 따른 디바이스(300)로서 채택될 수 있다.Next, a device (300) according to one embodiment of the present invention is a digital device that includes a function for communicating after being connected to a standardization system (200), and any digital device equipped with a memory means and a microprocessor and having a computing capability, such as a smart phone, a tablet, a smart watch, a smart band, smart glasses, a desktop computer, a notebook computer, a workstation, a PDA, a web pad, a mobile phone, etc., can be adopted as the device (300) according to the present invention.

특히, 디바이스(300)는, 사용자가 표준화 시스템(200)으로부터 본 발명에 따른 서비스를 제공받을 수 있도록 지원하는 애플리케이션(미도시됨)을 포함할 수 있다. 이와 같은 애플리케이션은 표준화 시스템(200) 또는 외부의 애플리케이션 배포 서버(미도시됨)로부터 다운로드된 것일 수 있다. 한편, 이러한 애플리케이션의 성격은 후술할 바와 같은 표준화 시스템(200)의 데이터 토큰화부(210), 유사도 산출부(220), 데이터 표준화부(230), 통신부(240) 및 제어부(250)와 전반적으로 유사할 수 있다. 여기서, 애플리케이션은 그 적어도 일부가 필요에 따라 그것과 실질적으로 동일하거나 균등한 기능을 수행할 수 있는 하드웨어 장치나 펌웨어 장치로 치환될 수도 있다.In particular, the device (300) may include an application (not shown) that supports a user to receive a service according to the present invention from the standardization system (200). Such an application may be downloaded from the standardization system (200) or an external application distribution server (not shown). Meanwhile, the nature of such an application may be generally similar to the data tokenization unit (210), the similarity calculation unit (220), the data standardization unit (230), the communication unit (240), and the control unit (250) of the standardization system (200) as described below. Here, at least a part of the application may be replaced with a hardware device or firmware device that can perform functions substantially identical to or equivalent thereto, as necessary.

표준화 시스템의 구성Composition of the standardization system

이하에서는, 본 발명의 구현을 위하여 중요한 기능을 수행하는 표준화 시스템(200)의 내부 구성과 각 구성요소의 기능에 대하여 살펴보기로 한다.Below, the internal configuration of the standardization system (200) that performs important functions for implementing the present invention and the functions of each component will be examined.

도 2는 본 발명의 일 실시예에 따른 표준화 시스템(200)의 내부 구성을 상세하게 도시하는 도면이다.FIG. 2 is a drawing detailing the internal configuration of a standardization system (200) according to one embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 일 실시예에 따른 표준화 시스템(200)은, 데이터 토큰화부(210), 유사도 산출부(220), 데이터 표준화부(230), 통신부(240) 및 제어부(250)를 포함하여 구성될 수 있다. 본 발명의 일 실시예에 따르면, 데이터 토큰화부(210), 유사도 산출부(220), 데이터 표준화부(230), 통신부(240) 및 제어부(250)는 그 중 적어도 일부가 외부의 시스템(미도시됨)과 통신하는 프로그램 모듈일 수 있다. 이러한 프로그램 모듈은 운영 시스템, 응용 프로그램 모듈 또는 기타 프로그램 모듈의 형태로 표준화 시스템(200)에 포함될 수 있고, 물리적으로는 여러 가지 공지의 기억 장치에 저장될 수 있다. 또한, 이러한 프로그램 모듈은 표준화 시스템(200)과 통신 가능한 원격 기억 장치에 저장될 수도 있다. 한편, 이러한 프로그램 모듈은 본 발명에 따라 후술할 특정 업무를 수행하거나 특정 추상 데이터 유형을 실행하는 루틴, 서브루틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포괄하지만, 이에 제한되지는 않는다.As illustrated in FIG. 2, a standardization system (200) according to one embodiment of the present invention may be configured to include a data tokenization unit (210), a similarity calculation unit (220), a data standardization unit (230), a communication unit (240), and a control unit (250). According to one embodiment of the present invention, at least some of the data tokenization unit (210), the similarity calculation unit (220), the data standardization unit (230), the communication unit (240), and the control unit (250) may be program modules that communicate with an external system (not shown). Such program modules may be included in the standardization system (200) in the form of an operating system, an application program module, or other program modules, and may be physically stored in various known memory devices. In addition, such program modules may be stored in a remote memory device that can communicate with the standardization system (200). Meanwhile, these program modules include, but are not limited to, routines, subroutines, programs, objects, components, data structures, etc. that perform specific tasks or execute specific abstract data types according to the present invention.

한편, 표준화 시스템(200)에 관하여 위와 같이 설명되었으나, 이러한 설명은 예시적인 것이고, 표준화 시스템(200)의 구성요소 또는 기능 중 적어도 일부가 필요에 따라 디바이스(300) 또는 서버(미도시됨) 내에서 실현되거나 외부 시스템(미도시됨) 내에 포함될 수도 있음은 당업자에게 자명하다.Meanwhile, although the standardization system (200) has been described as above, this description is exemplary, and it is obvious to those skilled in the art that at least some of the components or functions of the standardization system (200) may be realized within a device (300) or a server (not shown) or included within an external system (not shown) as needed.

먼저, 본 발명의 일 실시예에 따른 데이터 토큰화부(210)는, 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화하는 기능을 수행할 수 있다.First, the data tokenization unit (210) according to one embodiment of the present invention can perform a function of tokenizing text data regarding food materials into a plurality of tokens.

구체적으로, 본 발명의 일 실시예에 따른 데이터 토큰화부(210)는, 토큰화를 위하여 식자재에 관한 텍스트 데이터를 획득할 수 있다. 본 발명의 일 실시예에 따르면, 식자재에 관한 텍스트 데이터는 식자재의 생산, 가공, 유통 등의 과정에서 수집될 수 있는 다양한 텍스트 데이터를 의미할 수 있다.Specifically, the data tokenization unit (210) according to one embodiment of the present invention can obtain text data regarding food ingredients for tokenization. According to one embodiment of the present invention, text data regarding food ingredients can mean various text data that can be collected during the process of producing, processing, distributing, etc., the food ingredients.

예를 들어, 본 발명의 일 실시예에 따르면, 식자재의 유통 과정에서 식자재의 구매자가 식자재를 주문하기 위하여 입력하거나 선택하는 식자재명, 단위 정보 등의 데이터가 이러한 텍스트 데이터에 포함될 수 있다. 여기서, 본 발명의 일 실시예에 따르면, 식자재명에는 식자재 카테고리, 브랜드, 유통사 등의 정보가 포함될 수 있고, 단위 정보에는 해당 식자재의 양(예를 들면, 무게, 부피, (단위) 수량, 포장 단위 등)에 관한 정보가 포함될 수 있다.For example, according to one embodiment of the present invention, data such as the name of the food ingredient, unit information, etc., which a purchaser of the food ingredient inputs or selects to order the food ingredient during the distribution process of the food ingredient, may be included in such text data. Here, according to one embodiment of the present invention, the name of the food ingredient may include information such as the category of the food ingredient, brand, distributor, etc., and the unit information may include information regarding the amount of the corresponding food ingredient (e.g., weight, volume, (unit) quantity, packaging unit, etc.).

한편, 본 발명의 일 실시예에 따르면, 식자재에 관한 텍스트 데이터는 애초부터 텍스트 형식으로 획득될 수 있지만, STT(Speech-To-Text) 등의 기술을 이용하여 획득되는 다른 형식의 데이터를 텍스트 형식으로 변환함으로써 획득될 수도 있다. 다만, 본 발명의 일 실시예에 따른 식자재에 관한 텍스트 데이터의 종류나 그 획득 방식이 위에서 설명된 내용에 한정되는 것은 아니며, 본 발명의 목적을 달성할 수 있는 범위 내에서 다양하게 변경될 수 있다.Meanwhile, according to one embodiment of the present invention, text data regarding food ingredients may be acquired in a text format from the beginning, but may also be acquired by converting data of another format acquired using a technology such as STT (Speech-To-Text) into a text format. However, the type of text data regarding food ingredients according to one embodiment of the present invention or the method of acquiring the same are not limited to the contents described above, and may be variously changed within the scope that can achieve the purpose of the present invention.

도 3은 본 발명의 일 실시예에 따라 식자재에 관한 텍스트 데이터를 표준화하는 과정을 예시적으로 나타내는 도면이다.FIG. 3 is a drawing exemplarily showing a process of standardizing text data regarding food materials according to one embodiment of the present invention.

도 3을 참조하면, "사세 치킨텐더 1kg"(410)과 같이 본 발명의 일 실시예에 따라 획득될 수 있는 식자재에 관한 텍스트 데이터의 예가 input_name으로 명명된 열에 도시되어 있다. 도시된 바와 같이, 식자재에 관한 텍스트 데이터에는 식자재명(예를 들면, "사세 치킨텐더", "한일 본고장 건더기스프", "사조참치캔", "사조살코키참치", "환타(오렌지)" 및" 환타(파인)")이 포함된다. 그리고, 경우에 따라서는 식자재에 관한 텍스트 데이터에 단위에 관한 정보, 즉 단위 정보(예를 들면, "1kg", "1.88kg", "1.88kg 6개" 및 "355ml x 24캔")가 더 포함될 수 있다.Referring to FIG. 3, an example of text data about a food ingredient that can be obtained according to one embodiment of the present invention, such as "Sase Chicken Tender 1 kg" (410), is illustrated in a column named input_name. As illustrated, the text data about the food ingredient includes a food ingredient name (e.g., "Sase Chicken Tender", "Hanil Original Dried Seaweed Soup", "Sase Tuna Can", "Sase Salkoki Tuna", "Fanta (Orange)", and "Fanta (Pine)"). And, depending on the case, the text data about the food ingredient may further include information about a unit, that is, unit information (e.g., "1 kg", "1.88 kg", "1.88 kg 6 pieces", and "355 ml x 24 cans").

계속하면, 본 발명의 일 실시예에 따른 데이터 토큰화부(210)는, 위와 같이 획득되는 식자재에 관한 텍스트 데이터를 복수의 토큰으로 토큰화할 수 있다. 본 발명의 일 실시예에 따르면, 토큰화란 후술할 식자재 유사도 및/또는 단위 유사도를 산출하기 위하여 식자재에 관한 텍스트 데이터를 복수의 조각으로 나누는 과정을 의미할 수 있고, 그 나누어진 각각의 조각들이 토큰에 해당할 수 있다. 본 발명의 일 실시예에 따른 데이터 토큰화부(210)는, 식자재에 관한 텍스트 데이터의 패턴, 예를 들면, 단위 정보에 관한 토큰은 주로 텍스트 데이터의 뒷부분에 위치한다거나, 쉼표나 곱하기 마크 주변에 위치한다는 등의 패턴에 기초하여 이러한 토큰화를 수행할 수도 있다. 그리고, 본 발명의 일 실시예에 따른 데이터 토큰화부(210)는, 자연어 처리, 텍스트 마이닝(text mining) 등에 관한 다양한 종류의 알고리즘을 이용하여 식자재에 관한 텍스트 데이터를 처리할 수 있다.Continuing, the data tokenization unit (210) according to one embodiment of the present invention can tokenize the text data about the food material obtained as described above into a plurality of tokens. According to one embodiment of the present invention, tokenization may mean a process of dividing the text data about the food material into a plurality of pieces in order to calculate the food material similarity and/or the unit similarity described below, and each of the divided pieces may correspond to a token. The data tokenization unit (210) according to one embodiment of the present invention may perform such tokenization based on a pattern of the text data about the food material, for example, a pattern such as that tokens about unit information are mainly located at the end of the text data or are located around a comma or a multiplication mark. In addition, the data tokenization unit (210) according to one embodiment of the present invention can process the text data about the food material using various types of algorithms related to natural language processing, text mining, etc.

예를 들면, 본 발명의 일 실시예에 따른 데이터 토큰화부(210)는, 식자재에 관한 텍스트 데이터를 둘 이상의 토큰으로 토큰화할 수 있고, 그 둘 이상의 토큰은 식자재 카테고리, 브랜드, 유통사, 무게, 부피, 수량 및 포장 단위 중에서 적어도 하나의 종류에 해당할 수 있다. 이때, 같은 종류의 토큰이 둘 이상인 상태로 토큰화될 수도 있는 것으로 이해되어야 한다. 예를 들면, 식자재 카테고리 토큰이 둘 이상이거나 브랜드 토큰이 둘 이상일 수 있다. 이렇게 같은 종류의 토큰이 둘 이상인 경우에, 같은 종류의 토큰은 상하위 개념에 해당할 수 있고, 병렬적 개념에 해당할 수도 있다.For example, the data tokenization unit (210) according to one embodiment of the present invention can tokenize text data regarding food ingredients into two or more tokens, and the two or more tokens can correspond to at least one type among food ingredient category, brand, distributor, weight, volume, quantity, and packaging unit. At this time, it should be understood that tokenization may be performed in a state where two or more tokens of the same type are present. For example, there may be two or more food ingredient category tokens or two or more brand tokens. In this case where there are two or more tokens of the same type, the tokens of the same type may correspond to upper and lower concepts or may correspond to parallel concepts.

도 3을 참조하여 예를 들면, 식자재에 관한 텍스트 데이터인 "사세 치킨텐더 1kg"(410)는 "사세", "치킨텐더" 및 "1kg"의 세 개의 토큰으로 토큰화될 수 있을 것이다.For example, referring to FIG. 3, the text data regarding food material, “Sase Chicken Tender 1kg” (410), may be tokenized into three tokens: “Sase”, “Chicken Tender”, and “1kg”.

다만, 본 발명의 일 실시예에 따른 토큰화가 위에서 설명된 내용에 한정되는 것은 아니며, 본 발명의 목적을 달성할 수 있는 범위 내에서 다양하게 변경될 수 있다.However, tokenization according to one embodiment of the present invention is not limited to the contents described above, and may be variously changed within the scope that can achieve the purpose of the present invention.

다음으로, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 복수의 토큰 각각의 가중치를 결정하는 기능을 수행할 수 있다.Next, the similarity calculation unit (220) according to one embodiment of the present invention can perform a function of determining the weight of each of a plurality of tokens using an artificial intelligence-based token weight determination model.

구체적으로, 본 발명의 일 실시예에 따른 데이터 토큰화부(210)에 의하여 식자재에 관한 텍스트 데이터가 복수의 토큰으로 토큰화되면, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 인공 지능 기반의 토큰 가중치 결정 모델을 이용하여 그 복수의 토큰 각각의 중요도에 따라 각 토큰에 차등적인 가중치가 부여되도록 할 수 있다. 이러한 가중치는 식자재명 유사도를 산출함에 있어서 이용될 수 있는데, 이에 관한 자세한 설명은 후술하기로 한다.Specifically, when text data regarding food ingredients are tokenized into a plurality of tokens by the data tokenization unit (210) according to one embodiment of the present invention, the similarity calculation unit (220) according to one embodiment of the present invention can use an artificial intelligence-based token weight determination model to assign differential weights to each token according to the importance of each of the plurality of tokens. Such weights can be used in calculating the similarity of food ingredient names, and a detailed description thereof will be provided later.

보다 구체적으로, 본 발명의 일 실시예에 따르면, 토큰 가중치 결정 모델은 해당 모델의 학습을 위하여 사용되는 정답 데이터 세트(즉, 식자재에 관한 텍스트 데이터에 포함되는 식자재명이 대응되어야 하는 타겟 식자재명에 관한 데이터의 모음; 이에 관한 자세한 내용은 후술된다.)에 등장하는 빈도가 높은 토큰일수록 해당 토큰에 더 높은 가중치를 부여하도록 학습될 수 있다. 예를 들면, 본 발명의 일 실시예에 따른 토큰 가중치 결정 모델은, 위와 같은 방식으로 가중치를 결정할 수 있도록 정보 이득 스코어(Information Gain Score)에 기초하여 학습된 것일 수 있다.More specifically, according to one embodiment of the present invention, the token weight determination model can be trained to assign a higher weight to a token the more frequently it appears in a correct answer data set (i.e., a collection of data on target food ingredient names to which food ingredient names included in text data on food ingredients should correspond; this will be described in detail later) used for training the model. For example, the token weight determination model according to one embodiment of the present invention can be trained based on an information gain score so as to determine the weight in the above manner.

그러나, 정답 데이터 세트에 등장하는 빈도에만 의존하여 각 토큰의 가중치를 결정하는 경우에는 식자재 도메인의 특수성이 충분히 고려되지 않은 결과가 도출될 수도 있다. 예를 들어, "촉촉한"이라는 토큰이 정답 데이터 세트에 등장하는 빈도가 높고 토큰의 등장 빈도에만 의존하여 가중치를 결정한다고 가정하면, 해당 토큰에 높은 가중치가 부여되겠지만, 실제로는 해당 토큰이 너무 많은 식자재명에 특별한 의미없이 포함되는 단어인 관계로 식자재를 식별하는 데에 크게 도움이 되지 않거나 오히려 그 식별을 방해하는 것일 수 있다. 물론, 토큰의 중요도는 일률적으로 정해질 수 있는 성질의 것은 아니고, 일반적으로는 중요도가 낮은 토큰이라고 하더라도 다른 특정 토큰(단어)와의 관계를 고려할 때에는 중요도가 높은 토큰으로 취급될 수도 있을 것이다.However, if the weight of each token is determined solely based on the frequency of its appearance in the correct data set, the results may not sufficiently take into account the specificity of the food domain. For example, if the token "moist" appears frequently in the correct data set and the weight is determined solely based on the frequency of its appearance, the token will be given a high weight, but in reality, since the token is a word that is included in too many food ingredient names without any special meaning, it may not be of much help in identifying the food ingredient, or rather, it may hinder the identification. Of course, the importance of a token is not something that can be uniformly determined, and even a token with low importance in general may be treated as a token with high importance when considering its relationship with other specific tokens (words).

이러한 사정을 고려하여, 본 발명의 일 실시예에 따른 토큰 가중치 결정 모델은 식자재에 관한 온톨로지(ontology) 데이터에 기초하여 학습될 수 있다. 본 발명의 일 실시예에 따르면, 식자재에 관한 온톨로지 데이터는 식자재에 관한 텍스트 데이터에 포함될 수 있는 식자재 카테고리, 브랜드, 유통사, 무게, 부피, 수량, 포장 단위 등에 관한 단어들 사이의 연관 관계, 그 연관 관계에 따른 단어의 중요도, 식자재 도메인에서 각 단어 자체가 가지는 의미에 따른 중요도 등을 정의하는 데이터일 수 있다.Considering these circumstances, a token weight determination model according to one embodiment of the present invention may be learned based on ontology data regarding food ingredients. According to one embodiment of the present invention, ontology data regarding food ingredients may be data defining relationships between words regarding food ingredient categories, brands, distributors, weights, volumes, quantities, packaging units, etc., which may be included in text data regarding food ingredients, the importance of words according to the relationships, the importance according to the meaning of each word itself in the food ingredient domain, etc.

예를 들면, 본 발명의 일 실시예에 따른 온톨로지 데이터에는, 특정 단어가 식자재 도메인에서는 사전적인 의미대로 해석되는 것이 아니라 특정 브랜드를 의미하는 단어임을 고려하여 그 특정 단어를 높은 중요도로 취급되도록 정의될 수 있다. 다른 예를 들면, 본 발명의 일 실시예에 따른 온톨로지 데이터에는, 특정 단어가 다른 특정 단어와 같이 사용되는 경우에만 높은 중요도로 취급되도록 정의될 수 있다.For example, in ontology data according to one embodiment of the present invention, a specific word may be defined to be treated as having high importance considering that the word is not interpreted according to its dictionary meaning in the foodstuff domain, but is a word that signifies a specific brand. In another example, in ontology data according to one embodiment of the present invention, a specific word may be defined to be treated as having high importance only when it is used together with another specific word.

본 발명의 일 실시예에 따르면, 이렇게 토큰 가중치 결정 모델이 식자재에 관한 온톨로지 데이터에 기초하여 학습되도록 함으로써, 식자재 도메인에서 각 토큰이 가지는 중요도에 따라 해당 토큰에 보다 정확한 가중치가 부여되도록 할 수 있다.According to one embodiment of the present invention, by allowing a token weight determination model to be learned based on ontology data regarding foodstuffs, more accurate weights can be assigned to tokens according to the importance of each token in the foodstuff domain.

또한, 본 발명의 일 실시예에 따르면, 토큰 가중치 결정 모델의 학습 데이터는 식자재에 관한 온톨로지 데이터에 기초한 하드 네거티브 샘플링(Hard Negative Sampling)에 의하여 증강된 것일 수 있다.Additionally, according to one embodiment of the present invention, the learning data of the token weight determination model may be augmented by hard negative sampling based on ontology data regarding food materials.

구체적으로, 본 발명의 일 실시예에 따르면, 토큰 가중치 결정 모델이 상술한 바와 같이 식자재에 관한 온톨로지 데이터에 기초하여 학습을 수행한 이후에, 그 학습된 토큰 가중치 결정 모델을 이용하여 타겟 식자재명을 결정할 때에 잘못된 결정을 내린 데이터나 잘못된 결정을 내릴 가능성이 높은 데이터를 이용하여 해당 모델의 학습 데이터를 증강시킬 수 있다. 그리고, 이렇게 증강된 학습 데이터를 이용하여 토큰 가중치 결정 모델을 다시 학습시킴으로써 모델의 성능을 높일 수 있다.Specifically, according to one embodiment of the present invention, after a token weight determination model performs learning based on ontology data regarding foodstuffs as described above, when determining a target foodstuff name using the learned token weight determination model, the learning data of the model can be augmented using data that leads to an incorrect decision or data that is highly likely to lead to an incorrect decision. Then, by re-training the token weight determination model using the augmented learning data, the performance of the model can be improved.

본 발명의 일 실시예에 따른 유사도 산출부(220)는, 위와 같이 산출되는 각 토큰의 가중치에 기초하여 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각과 식자재에 관한 텍스트 데이터에 포함되는 식자재명 사이의 유사도인 식자재명 유사도를 산출하는 기능을 수행할 수 있다.A similarity calculation unit (220) according to one embodiment of the present invention can perform a function of calculating food ingredient name similarity, which is the similarity between each of a plurality of standard food ingredient names included in standard food ingredient data and food ingredient names included in text data about food ingredients, based on the weight of each token calculated as described above.

구체적으로, 본 발명의 일 실시예에 따르면, 표준 식자재 데이터는 식자재에 관한 텍스트 데이터가 대응(또는 매핑(mapping))되어야 하는 데이터의 집합을 의미할 수 있다. 도 3을 참조하면, 표준 식자재 데이터에는 표준 식자재명(420) 및 그와 대응되는 단위 정보(430)가 포함될 수 있다. 예를 들면, 식자재에 관한 텍스트 데이터인 "사세 치킨텐더 1kg"(410)는 표준 식자재 데이터 중에서 "텐더스틱(사세, 치킨) 1kg", "ea" 및 "1000g"과 대응될 수 있다. 도 3에 도시된 바와 같이, 표준 식자재명(420)에는 식자재 카테고리, 브랜드, 유통사 등의 정보뿐만 아니라 경우에 따라서는 식자재의 무게, 부피, 수량, 포장 단위 등에 관한 정보가 함께 포함될 수 있는 것으로 이해되어야 한다.Specifically, according to one embodiment of the present invention, standard food ingredient data may mean a set of data to which text data regarding food ingredients should correspond (or be mapped). Referring to FIG. 3, standard food ingredient data may include a standard food ingredient name (420) and unit information (430) corresponding thereto. For example, "Sase Chicken Tender 1kg" (410), which is text data regarding food ingredients, may correspond to "Tender Stick (Sase, Chicken) 1kg", "ea", and "1000g" among the standard food ingredient data. As illustrated in FIG. 3, it should be understood that the standard food ingredient name (420) may include information regarding not only the food ingredient category, brand, and distributor, but also, in some cases, information regarding the weight, volume, quantity, and packaging unit of the food ingredient.

계속하면, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 표준 식자재 데이터에 포함되는 복수의 표준 식자재명(420) 각각과 식자재에 관한 텍스트 데이터에 포함되는 식자재명(410) 사이에 중복되는 토큰의 수 내지 비율에 기초하여 식자재명 유사도를 산출할 수 있다.Continuing, the similarity calculation unit (220) according to one embodiment of the present invention can calculate the food ingredient name similarity based on the number or ratio of overlapping tokens between each of a plurality of standard food ingredient names (420) included in standard food ingredient data and the food ingredient name (410) included in text data regarding the food ingredient.

본 발명의 일 실시예에 따르면, 중복되는 토큰의 수가 많거나 비율이 높을수록 식자재명 유사도가 더 높게 산출되는 것이 기본 원리이지만, 유사도를 산출함에 있어서 중복되는 토큰의 중요도, 즉 중복되는 토큰의 가중치가 고려될 수 있다. 따라서, 식자재명(410)과 제1 표준 식자재명 사이에 중복되는 토큰의 수가 식자재명(410)과 제2 표준 식자재명 사이에 중복되는 토큰의 수보다 적더라도(또는 비율보다 낮더라도), 식자재명(410)과 제1 표준 식자재명 사이에 중복되는 토큰(들)의 가중치가 높다면 제1 표준 식자재명과의 식자재명 유사도가 제2 표준 식자재명과의 식자재명 유사도보다 더 높게 산출될 수도 있다.According to one embodiment of the present invention, the basic principle is that the more the number or the higher the ratio of overlapping tokens, the higher the similarity of the food ingredient name is calculated, but the importance of the overlapping tokens, i.e., the weight of the overlapping tokens, may be considered when calculating the similarity. Accordingly, even if the number of tokens overlapping between the food ingredient name (410) and the first standard food ingredient name is less than the number of tokens overlapping between the food ingredient name (410) and the second standard food ingredient name (or lower than the ratio), if the weight of the token(s) overlapping between the food ingredient name (410) and the first standard food ingredient name is high, the similarity of the food ingredient name with the first standard food ingredient name may be calculated to be higher than the similarity of the food ingredient name with the second standard food ingredient name.

본 발명의 일 실시예에 따른 유사도 산출부(220)는, 복수의 토큰 중에서 노이즈성 토큰이 식자재명 유사도에 미치는 영향이 비노이즈성 토큰이 식자재명 유사도에 미치는 영향보다 더 작도록 식자재명 유사도를 산출할 수 있다.A similarity calculation unit (220) according to one embodiment of the present invention can calculate the similarity of food ingredient names such that the influence of a noisy token among a plurality of tokens on the similarity of food ingredient names is smaller than the influence of a non-noisy token on the similarity of food ingredient names.

구체적으로, 본 발명의 일 실시예에 따르면, 본 발명의 일 실시예에 따른 데이터 토큰화부(210)에 의하여 생성되는 복수의 토큰에는 경우에 따라 노이즈성 토큰이 포함될 수 있다. 본 발명의 일 실시예에 따르면, 노이즈성 토큰은 식자재명 유사도를 산출함에 있어서 그 산출 결과를 부정확하게 만드는 토큰에 해당할 수 있다. 이러한 노이즈성 토큰은, 본 발명의 일 실시예에 따른 유사도 산출부(220)에 의한 토큰 가중치 결정 과정에서 낮은 가중치가 부여되도록 함으로써 식자재명 유사도에 미치는 영향이 작도록 할 수 있지만, 토큰의 가중치와는 별개로 필터링을 하는 등의 방식으로 식자재명 유사도에 미치는 영향이 작도록 할 수도 있다.Specifically, according to one embodiment of the present invention, a plurality of tokens generated by the data tokenization unit (210) according to one embodiment of the present invention may include noisy tokens in some cases. According to one embodiment of the present invention, a noisy token may correspond to a token that renders the calculation result inaccurate when calculating the similarity of food ingredient names. Such noisy tokens may have a small influence on the similarity of food ingredients names by being given a low weight in the token weight determination process by the similarity calculation unit (220) according to one embodiment of the present invention. However, the influence on the similarity of food ingredients names may also be small by performing filtering separately from the weight of the tokens.

한편, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 식자재에 관한 텍스트 데이터에 포함되는 식자재의 단위 정보와 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각에 대응되는 단위 정보 사이의 유사도인 단위 유사도를 산출할 수 있다.Meanwhile, the similarity calculation unit (220) according to one embodiment of the present invention can calculate unit similarity, which is the similarity between unit information of a food ingredient included in text data regarding the food ingredient and unit information corresponding to each of a plurality of standard food ingredient names included in standard food ingredient data.

도 3을 참조하여 구체적으로 설명하면, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 식자재에 관한 텍스트 데이터(410)에 포함되는 식자재의 단위 정보("1kg")와 표준 식자재 데이터에 포함되는 복수의 표준 식자재명(420) 각각에 대응되는 단위 정보(430) 사이의 단위 유사도를 산출할 수 있다. 이때, 본 발명의 일 실시예에 따르면, 단위 유사도를 산출함에 있어서 단위 정보(430)의 일부만 이용될 수도 있는 것으로 이해되어야 한다. 예를 들면, "텐더스틱(사세, 치킨) 1kg"의 경우에 그에 대응되는 단위 정보 "ea" 및 "1000g" 중에서 "1000g"만이 식자재에 관한 텍스트 데이터(410)에 포함되는 식자재의 단위 정보("1kg")와의 단위 유사도를 산출함에 있어서 이용될 수도 있다.Specifically described with reference to FIG. 3, the similarity calculation unit (220) according to one embodiment of the present invention can calculate the unit similarity between the unit information (“1 kg”) of the food ingredient included in the text data (410) regarding the food ingredient and the unit information (430) corresponding to each of the plurality of standard food ingredient names (420) included in the standard food ingredient data. At this time, it should be understood that according to one embodiment of the present invention, only a part of the unit information (430) may be used when calculating the unit similarity. For example, in the case of “tender stick (chicken, sachet) 1 kg”, among the unit information “ea” and “1000g” corresponding thereto, only “1000g” may be used when calculating the unit similarity with the unit information (“1 kg”) of the food ingredient included in the text data (410) regarding the food ingredient.

보다 구체적으로, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 식자재에 관한 텍스트 데이터에 포함되는 식자재의 단위 정보와 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각에 대응되는 단위 정보 사이의 편집 거리(Edit Distance)에 기초하여 위와 같은 단위 유사도를 산출할 수 있다.More specifically, the similarity calculation unit (220) according to one embodiment of the present invention can calculate the unit similarity as above based on the edit distance between the unit information of the food ingredient included in the text data regarding the food ingredient and the unit information corresponding to each of the plurality of standard food ingredient names included in the standard food ingredient data.

예를 들면, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 식자재에 관한 텍스트 데이터에 포함되는 식자재의 단위 정보와 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각에 대응되는 단위 정보 사이의 Damerau-Levenshtein 거리(즉, 편집 거리)를 산출하고, 그 거리가 짧을수록(그 값이 작을수록) 더 높은 값을 갖도록 단위 유사도를 산출할 수 있다.For example, a similarity calculation unit (220) according to one embodiment of the present invention can calculate a Damerau-Levenshtein distance (i.e., an edit distance) between unit information of a food ingredient included in text data regarding a food ingredient and unit information corresponding to each of a plurality of standard food ingredient names included in standard food ingredient data, and calculate a unit similarity such that the shorter the distance (the smaller the value), the higher the value.

또한, 본 발명의 일 실시예에 따른 유사도 산출부(220)는, 복수의 토큰 중에서 노이즈성 토큰이 단위 유사도에 미치는 영향이 비노이즈성 토큰이 단위 유사도에 미치는 영향보다 더 작도록 단위 유사도를 산출할 수 있다. 그 방식은 식자재명 유사도를 산출하는 과정에서 노이즈성 및 비노이즈성 토큰을 처리하는 것과 유사할 수 있고, 이에 관한 내용은 상술하였으므로, 중복되는 설명은 생략하기로 한다.In addition, the similarity calculation unit (220) according to one embodiment of the present invention can calculate the unit similarity so that the influence of the noisy token among the plurality of tokens on the unit similarity is smaller than the influence of the non-noisy token on the unit similarity. The method can be similar to processing the noisy and non-noisy tokens in the process of calculating the similarity of food ingredient names, and since the content regarding this has been described above, a duplicate description will be omitted.

다음으로, 본 발명의 일 실시예에 따른 데이터 표준화부(230)는, 식자재명 유사도에 기초하여 복수의 표준 식자재명 중에서 식자재에 관한 텍스트 데이터와 대응되는 타겟 식자재명을 결정함으로써 식자재에 관한 텍스트 데이터를 표준화하는 기능을 수행할 수 있다.Next, the data standardization unit (230) according to one embodiment of the present invention can perform a function of standardizing text data about a food ingredient by determining a target food ingredient name corresponding to text data about the food ingredient among a plurality of standard food ingredient names based on the similarity of the food ingredient names.

구체적으로, 본 발명의 일 실시예에 따른 데이터 표준화부(230)는, 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 중에서 식자재에 관한 텍스트 데이터에 포함되는 식자재명과의 식자재명 유사도가 가장 높은 표준 식자재명을 이러한 타겟 식자재명으로서 결정할 수 있다.Specifically, the data standardization unit (230) according to one embodiment of the present invention can determine, as the target food ingredient name, a standard food ingredient name having the highest similarity with a food ingredient name included in text data about the food ingredient among a plurality of standard food ingredient names included in the standard food ingredient data.

또한, 본 발명의 일 실시예에 따른 데이터 표준화부(230)는, 위와 같은 식자재명 유사도뿐만 아니라, 식자재에 관한 텍스트 데이터에 포함되는 식자재의 단위 정보와 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 각각에 대응되는 단위 정보 사이의 유사도인 단위 유사도에 더 기초하여 타겟 식자재명을 결정할 수 있다.In addition, the data standardization unit (230) according to one embodiment of the present invention can determine the target food ingredient name based not only on the food ingredient name similarity as described above, but also on the unit similarity, which is the similarity between the unit information of the food ingredient included in the text data regarding the food ingredient and the unit information corresponding to each of the plurality of standard food ingredient names included in the standard food ingredient data.

구체적으로, 본 발명의 일 실시예에 따르면, 식자재명 유사도 및 단위 유사도에는 소정의 가중치가 부여될 수 있고, 표준 식자재 데이터에 포함되는 복수의 표준 식자재명 중에서 그 가중치를 고려하여 식자재명 유사도 및 단위 유사도를 결합한 결과가 가장 높은 표준 식자재명을 타겟 식자재명으로서 결정할 수 있다. 한편, 위의 소정의 가중치는 일률적으로 정해지는 것은 아니고, 상품의 카테고리별로 다르게 정해질 수도 있는 것으로 이해되어야 한다.Specifically, according to one embodiment of the present invention, predetermined weights may be given to the food ingredient name similarity and unit similarity, and among a plurality of standard food ingredient names included in standard food ingredient data, the standard food ingredient name with the highest result of combining the food ingredient name similarity and unit similarity by considering the weights may be determined as the target food ingredient name. Meanwhile, it should be understood that the above predetermined weights are not uniformly determined, and may be determined differently for each product category.

이렇게 식자재에 관한 텍스트 데이터의 토큰화를 통해 식자재명 유사도와 단위 유사도를 각각 산출하고 이들을 결합한 결과에 기초하여 타겟 식자재명을 결정함으로써, 식자재에 관한 텍스트 데이터를 그 전체로서 활용하여 표준 식자재 데이터와의 유사도를 산출하는 경우보다 더 정확한 결과를 얻을 수 있게 된다.By tokenizing text data about foodstuffs in this way, the similarity of food ingredient names and unit similarities are calculated separately, and the target food ingredient name is determined based on the result of combining these, more accurate results can be obtained than when text data about foodstuffs is utilized as a whole to calculate the similarity with standard food ingredient data.

다음으로, 본 발명의 일 실시예에 따른 통신부(240)는 데이터 토큰화부(210), 유사도 산출부(220) 및 데이터 표준화부(230)로부터의/로의 데이터 송수신이 가능하도록 하는 기능을 수행할 수 있다.Next, the communication unit (240) according to one embodiment of the present invention can perform a function that enables data transmission and reception from/to the data tokenization unit (210), the similarity calculation unit (220), and the data standardization unit (230).

마지막으로, 본 발명의 일 실시예에 따른 제어부(250)는 데이터 토큰화부(210), 유사도 산출부(220), 데이터 표준화부(230) 및 통신부(240) 간의 데이터의 흐름을 제어하는 기능을 수행할 수 있다. 즉, 본 발명의 일 실시예에 따른 제어부(250)는 표준화 시스템(200)의 외부로부터의/로의 데이터 흐름 또는 표준화 시스템(200)의 각 구성요소 간의 데이터 흐름을 제어함으로써, 데이터 토큰화부(210), 유사도 산출부(220), 데이터 표준화부(230) 및 통신부(240)에서 각각 고유 기능을 수행하도록 제어할 수 있다.Finally, the control unit (250) according to one embodiment of the present invention can perform a function of controlling the flow of data between the data tokenization unit (210), the similarity calculation unit (220), the data standardization unit (230), and the communication unit (240). That is, the control unit (250) according to one embodiment of the present invention can control the data tokenization unit (210), the similarity calculation unit (220), the data standardization unit (230), and the communication unit (240) to perform their own functions by controlling the flow of data from/to the outside of the standardization system (200) or the flow of data between each component of the standardization system (200).

이상 설명된 본 발명에 따른 실시예는 다양한 컴퓨터 구성요소를 통하여 실행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위하여 하나 이상의 소프트웨어 모듈로 변경될 수 있으며, 그 역도 마찬가지이다.The embodiments of the present invention described above may be implemented in the form of program commands that can be executed through various computer components and recorded on a computer-readable recording medium. The computer-readable recording medium may include program commands, data files, data structures, etc., alone or in combination. The program commands recorded on the computer-readable recording medium may be those specially designed and configured for the present invention or those known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and hardware devices specially configured to store and execute program commands, such as ROMs, RAMs, and flash memories. Examples of the program commands include not only machine language codes generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter, etc. The hardware devices may be changed into one or more software modules to perform processing according to the present invention, and vice versa.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항과 한정된 실시예 및 도면에 의하여 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위하여 제공된 것일 뿐, 본 발명이 상기 실시예에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정과 변경을 꾀할 수 있다.Although the present invention has been described above with reference to specific details such as specific components and limited examples and drawings, these have been provided only to help a more general understanding of the present invention, and the present invention is not limited to the above examples, and those with common knowledge in the technical field to which the present invention pertains may make various modifications and changes based on this description.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 또는 이로부터 등가적으로 변경된 모든 범위는 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the idea of the present invention should not be limited to the embodiments described above, and not only the scope of the patent claims described below but also all scopes equivalent to or equivalently modified from the scope of the patent claims are included in the scope of the idea of the present invention.

100: 통신망
200: 표준화 시스템
210: 데이터 토큰화부
220: 유사도 산출부
230: 데이터 표준화부
240: 통신부
250: 제어부
300: 디바이스100: Communication network
200: Standardization System
210: Data Tokenization
220: Similarity calculation section
230: Data Standardization Department
240: Communications Department
250: Control Unit
300: Device

Claims

A method implemented in a standardization system for standardizing text data on food materials, wherein the system includes a data tokenization unit, a similarity calculation unit and a data standardization unit,
The above data tokenization unit is a step of tokenizing text data regarding food materials into multiple tokens.
The step of the similarity calculation unit determining the weight of each of the plurality of tokens using an artificial intelligence-based token weight determination model, and calculating the food ingredient name similarity, which is the similarity between each of the plurality of standard food ingredient names included in the standard food ingredient data and the food ingredient name included in the text data, based on the weight, and
The above data standardization unit includes a step of standardizing text data regarding the food ingredient by determining a target food ingredient name corresponding to the text data among the plurality of standard food ingredient names based on the similarity of the food ingredient names,
The above token weight determination model is learned based on ontology data on food materials,
In the above calculation step, the similarity calculation unit further calculates the unit similarity, which is the similarity between the unit information of the food ingredient included in the text data and the unit information corresponding to each standard food ingredient name,
In the above standardization step, the data standardization unit determines the target food ingredient name based on the unit similarity,
In the above calculation step, the similarity calculation unit calculates the unit similarity based on the edit distance between the unit information of the food ingredient and the unit information corresponding to each standard food ingredient name.
In the above standardization step, the data standardization unit determines the target food ingredient name based on the weights assigned to each of the food ingredient name similarity and the unit similarity.
method.

delete

In the first paragraph,
The learning data of the above token weight determination model is augmented by hard negative sampling based on the above ontology data.
method.

In the first paragraph,
In the above calculation step, the similarity calculation unit calculates the food ingredient name similarity such that the influence of the noisy token among the plurality of tokens on the food ingredient name similarity is smaller than the influence of the non-noisy token on the food ingredient name similarity.
method.

delete

In the first paragraph,
In the above calculation step, the similarity calculation unit calculates the unit similarity such that the influence of the noisy token among the plurality of tokens on the unit similarity is smaller than the influence of the non-noisy token on the unit similarity.
method.

A non-transitory computer-readable recording medium recording a computer program for executing the method according to claim 1.

As a system for standardizing text data on food materials,
Data tokenization unit that tokenizes text data about food ingredients into multiple tokens;
A similarity calculation unit that determines the weight of each of the plurality of tokens using an artificial intelligence-based token weight determination model, and calculates the food ingredient name similarity, which is the similarity between each of the plurality of standard food ingredient names included in the standard food ingredient data and the food ingredient name included in the text data, based on the weight; and
A data standardization unit is included to standardize text data regarding the food ingredient by determining a target food ingredient name corresponding to the text data among the plurality of standard food ingredient names based on the similarity of the food ingredient names.
The above token weight determination model is learned based on ontology data on food materials,
The above similarity calculation unit further calculates unit similarity, which is the similarity between the unit information of the food ingredient included in the text data and the unit information corresponding to each standard food ingredient name,
The above data standardization unit determines the target food ingredient name based further on the unit similarity,
The above similarity calculation unit calculates the unit similarity based on the edit distance between the unit information of the food ingredient and the unit information corresponding to each standard food ingredient name.
The above data standardization unit determines the target food ingredient name based on the weights assigned to each of the food ingredient name similarity and the unit similarity.
System.

delete

In Article 9,
The learning data of the above token weight determination model is augmented by hard negative sampling based on the above ontology data.
System.

In Article 9,
The above similarity calculation unit calculates the food ingredient name similarity such that the influence of the noisy token among the plurality of tokens on the food ingredient name similarity is smaller than the influence of the non-noisy token on the food ingredient name similarity.
System.

delete

In Article 9,
The above similarity calculation unit calculates the unit similarity so that the influence of the noisy token among the plurality of tokens on the unit similarity is smaller than the influence of the non-noisy token on the unit similarity.
System.