KR20060100646A

KR20060100646A - Method for searching specific location of image and image search system

Info

Publication number: KR20060100646A
Application number: KR1020050022349A
Authority: KR
Inventors: 천세기; 윤덕호
Original assignee: 주식회사 코난테크놀로지
Priority date: 2005-03-17
Filing date: 2005-03-17
Publication date: 2006-09-21

Abstract

본 발명은 사용자로부터 검색어를 입력 받고, 입력된 검색어와 관련 있는 대사가 영상물 내에서 재생되는 특정 위치를 검색할 수 있는 방법 및 그 시스템에 관한 것이다. 본 발명의 일실시예에 따른 영상물을 검색하는 방법은, 상기 영상물을 복수의 영상 스트림 데이터로 분할하고, 상기 분할된 영상 스트림 데이터 각각에 대해서 분할 식별자를 대응시키는 단계와, 상기 영상 스트림 데이터로부터 오디오 데이터를 추출하고, 상기 추출된 오디오 데이터를 관련된 분할 식별자에 상관시키는 단계와, 소정의 음성 인식 기법을 이용하여 상기 추출된 오디오 데이터를 소정의 스크립트 정렬부에서 인지할 수 있는 형태의 기호 데이터로 변환하는 단계와, 상기 스크립트 정렬부에서 상기 변환된 기호 데이터를 포함하는, 상기 영상물과 상응하는 스크립트 데이터 내에서의 텍스트 데이터를 식별하는 단계, 및 상기 식별된 텍스트 데이터 및 관련된 분할 식별자를 상관시키는 단계를 포함하는 것을 특징으로 한다. 본 발명에 따르면, 사용자로 하여금 임의의 텍스트를 이용한 영상 검색을 허용하고, 검색어를 포함한 대사를 이용하여 재생하는 영상물의 특정 위치를 검색하여 제공함으로써 시간에 의존하여 검색하는 기존의 검색 방식 보다 간편하고 편리하게 영상 검색을 수행하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템을 제공할 수 있는 이점이 있다.The present invention relates to a method and system for receiving a search word from a user and searching for a specific position where a dialogue related to the input search word is reproduced in an image. According to an aspect of the present invention, there is provided a method of searching for an image object, the method comprising: dividing the image object into a plurality of image stream data, associating a segment identifier with each of the divided image stream data, and audio from the image stream data. Extracting data, correlating the extracted audio data with a related partition identifier, and converting the extracted audio data into symbol data that can be recognized by a predetermined script alignment unit by using a predetermined speech recognition technique. Identifying text data in the script data corresponding to the video object, including the converted symbol data, and correlating the identified text data and the associated segmentation identifier in the script aligning unit. It is characterized by including. According to the present invention, a user can search an image using arbitrary text, and search and provide a specific position of a video to be reproduced by using a dialogue including a search word. There is an advantage that can provide a method for searching for a specific position of the image to perform the image search and the image retrieval system conveniently.

영상물, 영상 검색, 재생 시간, 대사, 비터비 검색 Video, video search, playback time, dialogue, Viterbi search

Description

METHOD AND SYSTEM FOR SEARCHING THE POSITION OF AN IMAGE THING}

도 1은 본 발명의 영상 검색 시스템에 대한 네트워크 구성도를 설명하기 위한 도면이다.1 is a view for explaining the network configuration of the image search system of the present invention.

도 2는 본 발명의 바람직한 실시예에 따른 영상 검색 시스템의 내부 구성을 나타내는 블록도이다.2 is a block diagram illustrating an internal configuration of an image retrieval system according to a preferred embodiment of the present invention.

도 3은 본 발명에 따른 문자 형태의 검색어를 이용하여 영상물의 특정 위치를 검색하는 일례를 나타내는 도면이다.3 is a diagram illustrating an example of searching for a specific position of an image by using a keyword in a text form according to the present invention.

도 4는 본 발명의 바람직한 실시예에 따른 영상 검색 방법을 구체적으로 도시한 작업 흐름도이다.4 is a flowchart illustrating an image search method according to a preferred embodiment of the present invention.

도 5는 본 발명의 다른 실시예에 따라 영상 검색을 수행하는 방법의 일예를 나타내는 작업 흐름도이다.5 is a flowchart illustrating an example of a method of performing an image search according to another exemplary embodiment of the present invention.

도 6은 본 발명에 따른 영상 검색 방법을 수행하는 데 채용될 수 있는 범용 컴퓨터 장치의 내부 블록도이다.6 is an internal block diagram of a general purpose computer device that may be employed to perform the image retrieval method according to the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

200 : 영상 검색 시스템 210 : 식별자 연관부200: Image retrieval system 210: Identifier association unit

220 : 오디오 추출부 230 : 음성 인식부220: audio extraction unit 230: speech recognition unit

240 : 스크립트 정렬부 250 : 인터페이스부240: script alignment unit 250: interface unit

260 : 위치 검색부 270 : 검색 결과 제공부260: location search unit 270: search result provider

본 발명은 영상 검색 방법 및 영상 검색 시스템에 관한 것으로, 보다 상세하게는 사용자로부터 검색어를 입력 받고, 입력된 검색어와 관련 있는 대사가 영상물 내에서 재생되는 특정 위치를 검색할 수 있는 방법 및 그 시스템에 관한 것이다.The present invention relates to an image retrieval method and an image retrieval system, and more particularly, to a method and system for receiving a search word from a user and searching for a specific position where a dialogue related to the input search word is reproduced in an image. It is about.

종래 오프라인에서 주로 이루어지던 영화, 음악, 방송 등의 영상 제공 서비스가 인터넷 등 통신망의 발달과 함께 온라인 상에서도 구현되고 있다. 특히, 통신망을 통하여 사용자의 단말기로 직접 전송되는 영상 제공 서비스는 시공간의 제약을 극복하여 실시간으로 사용자에게 공급할 수 있다는 장점으로 인해 그 이용 영역은 멀티미디어 분야의 전분야로 점차 확대되고 있는 실정이다.Image providing services such as movies, music, and broadcasting, which are mainly performed offline, are being implemented on-line with the development of communication networks such as the Internet. In particular, the image providing service transmitted directly to the user's terminal through a communication network can be provided to the user in real time to overcome the limitations of space and time, the use area is gradually expanding to all fields of the multimedia field.

사용자는 저장된 영상을 재생시키는 동안 자신이 원하는 특정 지점을 영상 내에서 검색하고자 하는 필요성을 느낄 수 있다. 하지만, 영상 데이터는 일반 텍스트와는 다른 복잡한 형태로 저장되므로, 해당 지점에 상응하는 영상 데이터를 데이터베이스로 구축하기가 쉽지 않으며 이로 인하여 검색도 용이하지 않다.The user may feel a need to search for a specific point that he / she wants in the image while playing the stored image. However, since the image data is stored in a complex form different from the general text, it is not easy to construct the image data corresponding to the corresponding point into the database, and thus it is not easy to search.

이에, 종래에는 영상물의 특정 위치의 검색을 위해서는 직접 영상의 재생 화면을 보면서 '2배 빨리보기' 또는 '앞으로 빨리감기' 등의 방식을 답습해야만 한다. 즉, 기존 멀티미디어 제공 서비스에서의 검색 방식은 영상물의 재생 시간에 전적으로 의존하는 것이며, 검색하고자 하는 영상물의 특정 위치와 관련된 런닝(재생)된 시간을 사용자가 사전에 정확하게 식별하고 있어야만 정확한 위치 검색이 가능하게 된다. 만약, 재생 시간을 제대로 알 수 없는 경우, 기존의 멀티미디어 제공 서비스에서는 '뒤로 빨리보기' 또는 '앞으로 빨리보기' 기능을 반복적으로 수행해야만 하고, 그나마 관련 있는 지점의 대략적인 위치 정보만을 제공하고 있어, 검색 시간이 많이 소요되고 정확도가 떨어지는 문제를 내포하고 있다.Accordingly, in order to search for a specific position of an image, a conventional method such as '2 times faster' or 'fast forward' should be followed while watching a playback screen of an image. That is, the search method in the existing multimedia providing service is completely dependent on the playback time of the video material, and the accurate location search is possible only if the user correctly identifies the running (playing) time related to the specific location of the video to be searched. Done. If the playing time is not known properly, the existing multimedia providing service has to repeatedly perform the 'fast forward view' or 'fast forward view' function, and provides only the approximate location information of the relevant point. The search takes a lot of time and is inaccurate.

이때, 영상물에 대한 검색에 있어서, 사용자가 찾고자 하는 대사를 직접 입력하는 것만으로 영상물의 특정 위치를 결정할 수 있는 영상 검색 방식이 있다면, 보다 짧은 검색 시간 내에 보다 정확한 검색이 보장되도록 할 수 있을 것이다.In this case, if there is an image search method for determining a specific position of an image only by directly inputting a dialogue to be searched for, the user may be able to ensure a more accurate search within a shorter search time.

따라서, 사용자가 원하는 대사의 입력에 따라 영상 검색 요청을 발생하고, 발생된 영상 검색 요청에 응답하여 해당 대사가 재생되는 영상물의 특정 위치가 검색되도록 함으로써, 사용자의 영상 검색 요청을 최적으로 충족시킬 수 있는 새로운 영상 검색 모델이 절실하게 요구되고 있다.Therefore, by generating a video search request according to the input of the user's desired dialogue, and in response to the generated video search request to search the specific position of the video image to be reproduced, the user can optimally meet the user's video search request New image retrieval models are urgently needed.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 사용자가 입력한 대사를 이용한 영상 검색을 허용하고, 입력된 대사를 재생하는 영상물의 특정 위치를 검색 함으로써, 보다 간편하고 편리하게 영상 검색을 수행하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템을 제공하는 것을 목적으로 한다.The present invention has been made to solve the above problems, by allowing the user to search the image using the input dialogue, and search for a specific position of the image to reproduce the input dialogue, more convenient and convenient image search An object of the present invention is to provide a method for searching for a specific position of an image to be performed and an image search system.

또한, 본 발명의 목적은 영상물의 분할과 관련되는 분할 식별자를 이용하여, 상기 영상물과 상응되는 스크립트 데이터를 정렬하며, 검색어를 포함하는 스크립트 데이터의 문장과 연관된 분할 식별자를 식별 함으로써 영상물을 분할한 영상 스트림 데이터를 검색하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템을 제공하는 데에 있다.In addition, an object of the present invention is to divide the image by sorting the script data corresponding to the image using the segmentation identifier associated with the segmentation of the image, and to identify the segmentation identifier associated with the sentence of the script data including the search word The present invention provides a method and an image retrieval system for retrieving a specific position of an image for retrieving stream data.

또한, 본 발명의 다른 목적은 영상물로부터 추출된 텍스트 데이터를 이용하여, 대사를 이용한 영상 검색 요청을 최적하게 서비스 지원하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템을 제공하는 데에 있다.Another object of the present invention is to provide a method and an image retrieval system for retrieving a specific position of an image that optimally supports an image retrieval request using dialogue using text data extracted from the image.

상기의 목적을 이루기 위한 본 발명의 일실시예에 따른 영상물을 검색하는 방법은, 상기 영상물을 복수의 영상 스트림 데이터로 분할하고, 상기 분할된 영상 스트림 데이터 각각에 대해서 분할 식별자를 대응시키는 단계와, 상기 영상 스트림 데이터로부터 오디오 데이터를 추출하고, 상기 추출된 오디오 데이터를 관련된 분할 식별자에 상관시키는 단계와, 소정의 음성 인식 기법을 이용하여 상기 추출된 오디오 데이터를 소정의 스크립트 정렬부에서 인지할 수 있는 형태의 기호 데이터로 변환하는 단계와, 상기 스크립트 정렬부에서 상기 변환된 기호 데이터를 포함하는, 상기 영상물과 상응하는 스크립트 데이터 내에서의 텍스트 데이터를 식별하는 단계, 및 상기 식별된 텍스트 데이터 및 관련된 분할 식별자를 상관시키는 단계를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a method of searching for an image object, the method comprising: dividing the image object into a plurality of image stream data, and mapping a segment identifier to each of the divided image stream data; Extracting audio data from the video stream data, correlating the extracted audio data to an associated segmentation identifier, and recognizing the extracted audio data in a predetermined script alignment unit using a predetermined speech recognition technique. Converting into symbol data in the form, identifying the text data in the script data corresponding to the video object, including the converted symbol data in the script aligning unit, and the identified text data and associated segmentation. And correlating the identifiers. It shall be.

또한, 상기 목적을 달성하기 위한 기술적 구성으로서, 영상물을 검색하는 시스템은, 상기 영상물을 복수의 영상 스트림 데이터로 분할하고, 상기 분할된 영상 스트림 데이터 각각에 대해서 분할 식별자를 대응시키는 식별자 연관부와, 상기 영상 스트림 데이터로부터 오디오 데이터를 추출하는 오디오 추출부와, 소정의 음성 인식 기법을 이용하여 상기 추출된 오디오 데이터를 소정의 스크립트 정렬부에서 인지할 수 있는 형태의 기호 데이터로 변환하는 음성 인식부와, 상기 변환된 기호 데이터를 포함하는, 상기 영상물과 상응하는 스크립트 데이터 내에서의 텍스트 데이터를 식별하고, 상기 식별된 텍스트 데이터 및 관련된 분할 식별자를 상관시키는 스크립트 정렬부와, 사용자로부터 검색어를 포함하는 영상 검색 요청을 수신하는 인터페이스부와, 상기 스크립트 데이터 내에서, 상기 입력된 검색어를 포함하는 텍스트 데이터를 식별하고, 상기 식별된 텍스트 데이터에 상관하는 분할 식별자를 검색하는 위치 검색부, 및 상기 검색된 분할 식별자와 대응하는 영상 스트림 데이터를 추출하여 상기 사용자에게 제공하는 검색 결과 제공부를 포함하는 것을 특징으로 한다.In addition, as a technical configuration for achieving the above object, a system for retrieving a video object comprises: an identifier association unit for dividing the video object into a plurality of video stream data, and corresponding to the divided identifier for each of the divided video stream data; An audio extraction unit for extracting audio data from the image stream data, a voice recognition unit for converting the extracted audio data into symbol data that can be recognized by a predetermined script alignment unit using a predetermined speech recognition technique; A script alignment unit for identifying text data in the script data corresponding to the video object including the converted symbol data and correlating the identified text data with an associated segmentation identifier, and an image including a search word from a user. An interface unit for receiving a search request; In the script data, a position search unit for identifying text data including the input search word, searching for a segmentation identifier correlated with the identified text data, and extracting image stream data corresponding to the searched segmentation identifier; And a search result providing unit provided to the user.

이하, 첨부된 도면을 참조하여, 본 발명의 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템에 대하여 설명한다.Hereinafter, a method and an image retrieval system for retrieving a specific position of an image of the present invention will be described with reference to the accompanying drawings.

본 명세서에서 지속적으로 사용되는 "영상물"이라는 용어는 통신망을 통해 사용자로부터 전송을 주문 받고, 이에 상응하여 역시 통신망을 통해 상기 사용자에게 공급되는 영상 정보에 관한 것으로, 예컨대 사용자의 요구에 따라 원하는 시간에 원하는 영상을 시청할 수 있도록 하는 맞춤 영상 정보 서비스(VOD: video on demand)에 의해 제공되는 영상 정보를 지칭할 수 있다. 또한, 영상 스트림 데이터는 영상물을 구성하는 단위별 영상으로서, 전체의 영상물을 소정의 단위, 예컨대 재생 시간, 프레임 수, 장면(scene) 등을 기준으로 분할함으로써 생성하게 된다.The term " image material " as used continuously herein refers to image information that is ordered to be transmitted from a user via a communication network and correspondingly also supplied to the user via the communication network, for example at a desired time at the user's request. It may refer to video information provided by a video on demand (VOD) that enables a user to watch a desired video. In addition, the image stream data is a unit-specific image constituting a video object, and is generated by dividing the entire video object based on a predetermined unit, for example, a reproduction time, a frame number, a scene, and the like.

영상 검색 시스템(100)은 종래와 같은 재생 시간에 의한 영상물의 위치 식별에서 탈피하여, 사용자(130)가 원하는 대사를 문자 형태로 입력 받고, 이에 상당하는 영상물의 위치를 사용자(130)에게 제공되도록 하는 검색 방식을 지원하게 된다.The image retrieval system 100 deviates from the positional identification of the image by the playback time as in the prior art, so that the user 130 receives the desired dialogue in text form and provides the user 130 with the position of the image corresponding thereto. It will support the search method.

우선, 영상 제공 서버(110)는 사용자(130)로부터 수신되는 영상물에 대한 주문 요청에 따라 이와 관련된 영상 정보를 통신망(140)을 통해 제공하는 역할을 수행하며, 예컨대 VOD 서비스를 지원하는 VOD 서버를 예시할 수 있다. 즉, 영상 제공 서버(110)는 소정의 계약 관계에 있는 사용자(130)에게 원하는 영상 정보를 제공하며, 사용자(130)로 하여금 통신망(140) 및 단말 수단(135)을 통해 영상 시청을 가능하게 하는 역할을 한다. 또한, 영상 제공 서버(110)에는 레이디 디스크(ready disk, 도시하지 않음)를 구비할 수 있으며, 처리 대상이 되는 영상 정보의 일부를 영상 정보 데이터베이스(120)에서 사전에 추출하여 상기 레이디 디스크에 복사 함으로써 다수 사용자(130)의 서비스 요구를 실시간으로 신속하게 처리할 수 있다.First, the image providing server 110 serves to provide image information related thereto through the communication network 140 in response to an order request for an image received from the user 130, for example, a VOD server supporting a VOD service. It can be illustrated. That is, the image providing server 110 provides desired image information to the user 130 having a predetermined contract relationship, and enables the user 130 to watch the image through the communication network 140 and the terminal means 135. It plays a role. In addition, the image providing server 110 may be provided with a lady disk (ready disk, not shown), a portion of the video information to be processed in advance in the image information database 120 to copy to the lady disk As a result, service requests of the multiple users 130 may be promptly processed in real time.

또한, 영상 제공 서버(110)는 본 발명의 영상 검색 시스템(100)을 내부 또는 외부에 포함할 수 있으며, 사용자(130)의 영상 검색 요청에 응답하여 영상물의 특정 위치를 검색하고 이에 관한 정보를 사용자(130)에게 통지되도록 제어하고 있다.In addition, the image providing server 110 may include the image retrieval system 100 of the present invention inside or outside, and searches for a specific position of the image in response to the image retrieval request of the user 130 and retrieves information about the image. The user 130 is controlled to be notified.

영상 정보 데이터베이스(120)는 소정의 영상 공급자로부터 수집된 복수의 영상 정보를 저장하는 대용량의 저장 매체를 지칭한다. 이러한 영상 정보 데이터베 이스(120)는 복수의 저장 매체를 연결하는 어레이(array) 형태를 취하고 있으며, 영상물의 카테고리, 영상물을 공급하는 영상 공급자, 영상물의 등급, 영상물의 품질 등을 고려하여 영상 정보를 체계적으로 분류, 저장할 수 있다.The image information database 120 refers to a large capacity storage medium that stores a plurality of image information collected from a predetermined image provider. The image information database 120 takes the form of an array that connects a plurality of storage media. The image information database 120 considers the category of the image, the image provider supplying the image, the grade of the image, the quality of the image, and the like. Can be categorized and stored systematically.

사용자(130)는 본 발명의 영상 검색 시스템(100)과의 접속을 위한 단말 수단(135)을 보유하며, 소정 영상물 중에서 원하는 대사를 문자 형태로 검색 입력 함으로써 영상물의 특정 위치에 대한 검색 서비스를 제공 받는 인터넷 이용자 또는 VOD 서비스 이용자를 의미할 수 있다.The user 130 has a terminal means 135 for accessing the image retrieval system 100 of the present invention, and provides a search service for a specific position of an image by searching and inputting a desired dialogue in a text form among predetermined images. It may mean a receiving Internet user or a VOD service user.

단말 수단(135)은 인터넷 등의 통신망(140)을 통해 영상 검색 시스템(100)과의 접속 상태를 유지하며, 사용자(130)에게 제공되는 영상 정보를 시현하는 장치이다. 또한, 단말 수단(135)은 검색어의 입력을 위한 소정의 사용자 인터페이스(UI)를 화면 상에 구동하며, 사용자(130)가 상기 사용자 인터페이스에 임의의 대사와 관련된 문자(검색어)를 입력하는 것에 연동하여 영상 검색 요청을 생성하게 된다. 이러한, 단말 수단(135)은 예컨대 퍼스널 컴퓨터, 핸드헬드(handheld) 컴퓨터, PDA(Personal Digital Assistant), MP3 플레이어, 전자 사전, 휴대폰, 스마트폰 등과 같이 소정의 메모리 수단을 구비하고 소정의 마이크로프로세서를 탑재함으로써 소정의 연산 능력을 갖춘 단말기를 통칭하는 개념일 수 있다.The terminal unit 135 maintains a connection state with the image retrieval system 100 through a communication network 140 such as the Internet and is a device for displaying image information provided to the user 130. In addition, the terminal means 135 drives a predetermined user interface (UI) for inputting a search word on the screen, and interlocks with the user 130 inputting a character (search word) related to an arbitrary dialogue in the user interface. To generate an image search request. The terminal means 135 is provided with predetermined memory means such as, for example, a personal computer, a handheld computer, a personal digital assistant (PDA), an MP3 player, an electronic dictionary, a mobile phone, a smartphone, and the like. It may be a concept of collectively referred to as a terminal having a predetermined computing power by mounting.

영상 검색 시스템(100)은 사용자(130)로 하여금 영상물에서 재생되었던 대사를 검색어로 입력하는 것을 허용 함으로써 문자를 이용한 영상물의 특정 위치에 대한 검색 서비스를 제공할 수 있게 된다. 이하, 도 2를 참조하여 본 발명의 영상 검색 시스템(200)의 구체적인 구성을 설명한다.The image retrieval system 100 may provide a search service for a specific position of an image using a character by allowing the user 130 to input a dialogue that was reproduced in the image as a search word. Hereinafter, a detailed configuration of the image retrieval system 200 of the present invention will be described with reference to FIG. 2.

본 발명의 영상 검색 시스템(200)은, 식별자 연관부(210), 오디오 추출부(220), 음성 인식부(230), 스크립트 정렬부(240), 인터페이스부(250), 위치 검색부(260) 및 검색 결과 제공부(270)를 포함한다.Image retrieval system 200 of the present invention, identifier association unit 210, audio extraction unit 220, speech recognition unit 230, script aligning unit 240, interface unit 250, location search unit 260 ) And a search result providing unit 270.

우선, 식별자 연관부(210)는 영상물을 복수의 영상 스트림 데이터로 분할하고, 분할된 영상 스트림 데이터 각각에 대해서 분할 식별자를 대응시키는 장치이다. 여기서, 분할 식별자는 전체 영상물 중에서 분할된 영상 스트림 데이터가 위치하는 지점을 식별하기 위한 것이며, 분할 식별자에는 대응되는 영상 스트림 데이터에 관한 재생 시간(시작 시간 및 종료 시간), 프레임 수, 장면(scene) 등의 정보를 포함할 수 있다. 즉, 식별자 연관부(210)는 영상물을 구성하는 복수의 영상 스트림 데이터 각각을 분할 식별자로 구분하여 저장하며, 분할 식별자의 식별을 통한 특정 영상 스트림 데이터의 검색을 가능하도록 제어하는 역할을 하게 된다. 또한, 식별자 연관부(210)는 각 영상 스트림 데이터에 대해서 시간의 흐름 순서, 즉 시계열(time-series)적으로 분할 식별자를 대응시킬 수 있으며, 예컨대 시간적으로 최초 분할되는 영상 스트림 데이터에 분할 식별자 'T1'을, 이어서 일련의 분할 식별자를 순서대로 대응시키고, 마침내 n번째 분할되는 영상 스트림 데이터에 분할 식별자 'Tn'을 상관시킬 수 있다.First, the identifier association unit 210 is an apparatus for dividing a video object into a plurality of video stream data, and for assigning a segmentation identifier to each of the divided video stream data. Here, the segmentation identifier is used to identify a point where segmented video stream data is located among all video objects, and the segmentation identifier is a playback time (start time and end time), frame number, and scene related to video stream data corresponding to the segmentation identifier. Information may be included. That is, the identifier association unit 210 divides and stores each of the plurality of video stream data constituting the video material into a partition identifier, and controls to search for specific video stream data through identification of the partition identifier. In addition, the identifier association unit 210 may correspond to each video stream data in the order of time flow, that is, the time-series division identifiers. For example, the identifier association unit 210 may correspond to the video stream data that is first divided in time. T1 'can then be sequentially associated with a series of segmentation identifiers, and the segmentation identifier' Tn 'can be correlated to the video stream data that is finally divided into the nth segment.

오디오 추출부(220)는 영상 스트림 데이터로부터 오디오 데이터를 추출하는 장치로서, 각 영상 스트림 데이터에 대해서 이미지, 사진 등의 영상 데이터 및 음 성, 음향 등의 오디오 데이터를 분류하고, 다수의 오디오 데이터 중에서 사용자(130)의 선택에 따라 특정의 오디오 데이터를 선택적으로 추출할 수 있도록 한다.The audio extractor 220 extracts audio data from video stream data. The audio extractor 220 classifies video data such as an image and a picture and audio data such as sound and sound for each video stream data. Specific audio data may be selectively extracted according to the selection of the user 130.

음성 인식부(230)는 소정의 인식 기법을 이용하여 상기 추출된 오디오 데이터를 소정의 스크립트 정렬부(240)에서 인지할 수 있는 형태의 기호 데이터로 변환하는 장치이다. 즉, 음성 인식부(230)는 상기 영상물과 관련된 스크립트 데이터에 대한 정렬 처리(예, 스크립트 데이터에 포함되는 데이터 데이터의 위치 식별 처리)에 관여하는 스크립트 정렬부(240)에게 의미있는 부호로서의 기호 데이터를 생성하는 과정으로, 예컨대 자동 발음 표기 규칙 생성을 통해서 스크립트 정렬부(240)가 인식할 수 있는 기호 데이터로 변환하는 역할을 한다. 여기서, 자동 발음 표기 규칙 생성은, 텍스트 정규화 및 GTP(Grapheme-to-Phoneme) 변환 과정으로 구성될 수 있다. 예를 들어, 영문 오디오 데이터 내에 "3-point-7-5 percent"라는 구절이 있을 때, 텍스트 정규화를 거치면 "three point seven to five percent" 가 되고, 이를 GTP 변환 과정에 통과시키면 "TH R IY sp P OY N TD sp S EH V AX N sp T AX sp F AY V sp P AXR S EH N TD sp"로 변환할 수 있다. 이같이 변환된 기호들 하나하나는 음성 인식 모델들과 상관한다. 또한, 여기서 음성 인식 기법은 영상 재생 시 출력되는 오디오 데이터(소리)를 시간 정보를 가지는 텍스트 형태의 문자(문장)로 변화시키는 기술로서, 대표적으로 비터비 검색(Viterbi-search) 기법을 예시할 수 있다. 비터비 검색 기법에 의한 음성 인식은 추출된 오디오 데이터를 미리 설정된 음소 모델로 재구성된 스크립트를 시계열적으로 통과시키면서 가장 높은 정확도를 갖는 음소 모델을 선정하고, 선정된 음소 모델에 상당하는 문자를 결정하는 것에 의해 수행된다. 또한, 음성 데이터로부터 음성 인식에 사용할 특징 정보를 추출하는 과정에서는, 예컨대, Sequential Fast Pole Filtering(SFPF)을 이용하여 오디오 데이터에 대한 소정의 보상 처리를 수행하게 되며, 이에 따라 추출된 오디오 데이터에 대한 정렬 정확도를 향상시킬 수 있게 된다.The speech recognition unit 230 is a device that converts the extracted audio data into symbol data that can be recognized by the predetermined script alignment unit 240 using a predetermined recognition technique. That is, the voice recognition unit 230 is symbol data as a sign meaningful to the script aligning unit 240 involved in alignment processing (eg, position identification processing of data data included in the script data) related to the script data related to the video material. In the process of generating a function, for example, the automatic sorting rule generation is performed to convert the script sorter 240 into symbol data that can be recognized. Here, the automatic phonetic notation rule generation may be composed of text normalization and GTP (Grapheme-to-Phoneme) conversion process. For example, when there is a phrase "3-point-7-5 percent" in English audio data, the text normalization results in "three point seven to five percent". sp P OY N TD sp S EH V AX N sp T AX sp F AY V sp P AXR S EH N TD sp ". Each of these transformed symbols correlates to speech recognition models. In addition, the speech recognition technique is a technique for changing the audio data (sound) output during image playback into text-type characters (sentences) having time information, and may be representative of the Viterbi-search technique. have. Speech recognition by the Viterbi search technique selects the phoneme model with the highest accuracy while passing the script reconstructed into the preset phoneme model through the extracted audio data, and determines the characters corresponding to the selected phoneme model. Is performed by. In addition, in the process of extracting feature information to be used for speech recognition from speech data, for example, Sequential Fast Pole Filtering (SFPF) is used to perform a predetermined compensation process on audio data, thereby extracting the extracted audio data. It is possible to improve the alignment accuracy.

스크립트 정렬부(240)는 변화된 기호 데이터를 확률적으로 다수 포함하는, 상기 영상물과 상응하는 스크립트 데이터 내에서의 텍스트 데이터를 식별하고, 식별된 텍스트 데이터 및 관련된 분할 식별자를 상관시키는 장치이다. 즉, 스크립트 정렬부(240)는 소정의 스크립트 데이터 내에서 변환된 텍스트 데이터의 위치를 식별하는 장치이며, 음성 인식에 의해 변환된 상기 기호 데이터와 일치(유사) 정도가 가장 높은 스크립트 데이터의 텍스트 데이터의 위치를 실시간에 준해서 식별한다. 여기서, 스크립트 데이터는 상기 영상 스트림 데이터에 상응하는 대본(또는 자막)의 집합으로, 예컨대 해당 영상물을 제작하거나 공급하는 영상 공급자에 의해 사전에 작성되며, 소정 시점에서 본 시스템을 운영하는 운영자에게 제공된다. 또한, 스크립트 데이터는 문자, 숫자, 특수 문자 중 하나 이상의 조합, 또는 이들 조합으로 이루어지는 문장이 나열되어 이루어지며, 예컨대 상기 조합이나 문장은 영상물에서 재생되는 장면(scene) 등을 고려하여 구분될 수 있다.The script aligning unit 240 is an apparatus for identifying text data in the script data corresponding to the video material and including the changed symbol data in a stochastic manner, and correlating the identified text data with an associated segmentation identifier. That is, the script aligning unit 240 is a device for identifying the position of the text data converted in the predetermined script data, and the text data of the script data having the highest degree of agreement (similarity) with the symbol data converted by speech recognition. The location of is identified based on real time. Here, the script data is a set of scripts (or subtitles) corresponding to the video stream data, for example, created in advance by an image provider for producing or supplying the corresponding video material, and provided to an operator who operates the system at a predetermined point in time. . In addition, the script data is composed of one or more combinations of letters, numbers, special characters, or sentences consisting of these combinations. For example, the combinations or sentences may be distinguished in consideration of a scene reproduced in an image. .

더불어, 스크립트 정렬부(240)는 스크립트 데이터 내에서, 식별된 텍스트 데이터에 변환된 기호 데이터와 관련되는 분할 식별자를 상관시키고, 상기 상관된 분할 식별자를 기준으로 스크립트 데이터의 조합 또는 문장을 정렬하는 역할을 할 수 있다. 상술한 바와 같이, 분할 식별자는 각 영상 스트림 데이터에 시계열적으로 상관되며, 이와 연동하여 스크립트 정렬부(240)는 상기 스크립트 데이터를 분할 식별자에 기준하여 시계열적으로 재정렬 할 수 있게 한다.In addition, the script aligning unit 240 correlates the segmentation identifier associated with the symbol data converted to the identified text data in the script data, and sorts the combination or sentence of the script data based on the correlated segmentation identifier. can do. As described above, the partition identifier is correlated in time series with each video stream data, and in conjunction with this, the script aligning unit 240 may rearrange the script data in time series based on the partition identifier.

이에 따라, 본 발명의 스크립트 데이터는 기존의 시간 정보를 대체하는 분할 식별자를 조합/문장 별로 대응시킬 수 있으며, 스크립트 데이터의 특정 조합/문장이 영상물의 어느 위치에서 사운드로서 재생되고 있음을 알 수 있게 된다.Accordingly, the script data of the present invention can correspond to the partition identifier replacing the existing time information for each combination / statement, so that it can be seen that the specific combination / statement of the script data is reproduced as a sound at any position of the image. do.

인터페이스부(250)는 사용자(130)로부터 검색어를 포함하는 영상 검색 요청을 수신하는 장치이다. 즉, 인터페이스부(250)는, 소정의 사용자 인터페이스에 사용자(130)가 검색어(대사)를 입력 함에 따라 생성되는 영상 검색 요청을 입력 받는다.The interface unit 250 is a device that receives an image search request including a search word from the user 130. That is, the interface unit 250 receives an image search request generated as the user 130 inputs a search word (metabolism) to a predetermined user interface.

위치 검색부(260)는 스크립트 데이터 내에서, 입력된 검색어를 포함하는 텍스트 데이터를 식별하고, 식별된 텍스트 데이터에 상관하는 분할 식별자를 검색하는 장치이다. 즉, 위치 검색부(260)는 입력된 검색어(대사)를 적어도 일부 포함하는 스크립트 데이터의 조합/문장을 식별하고 이에 매칭되는 분할 식별자를 검색하는 역할을 한다. 특히, 검색어를 포함하는 스크립트 데이터의 위치(조합/문장)가 복수일 경우, 위치 검색부(260)는 각 위치와 검색어와의 유사도를 산출하고, 산출된 유사도의 수치가 설정치 이상인 소정 개를 검색한다. 본 실시예에서는 상기 유사도를 산출하는 기법에 대해 구체적인 한정을 하고 있으며, 예컨대 음소, 음절, 단어의 일치 정도, 의미의 유사 정도 등을 고려하여 유사도를 수치로서 산출할 수 있다. 이러한, 유사도 산출 기법은 본 시스템을 운영하는 운영자에 의해 유연하게 선정될 수 있다. 다만, 본 발명의 위치 검색부(260)에 의한 위치(텍스트 데이터) 를 검색하는 과정에서는 음소 단위 마다 계산이 이루어지고, 앞 단계의 결과가 뒷 단에 영향을 미치므로, 현재의 결정된 위치가 최종 결정될 위치라고 보장할 수 없다. 따라서, 운영자는 운영하는 시스템 환경에 따라, 보다 높은 정확도를 갖는 유사도 산출 기법을 유연하게 취사 선택하여 적용할 수 있다.The location retrieval unit 260 is an apparatus for identifying text data including the input search word in the script data and searching for a segmentation identifier correlated with the identified text data. That is, the location search unit 260 identifies a combination / sentence of the script data including at least a part of the input search word (metabolism) and searches for a partition identifier matched thereto. In particular, when there are a plurality of positions (combination / sentence) of the script data including the search word, the position search unit 260 calculates the similarity between each position and the search word, and searches for a predetermined number whose calculated similarity value is equal to or larger than a set value. do. In the present embodiment, the technique for calculating the similarity is specifically limited. For example, the similarity may be calculated as a numerical value in consideration of phonemes, syllables, the degree of coincidence of words, the degree of similarity in meaning, and the like. Such a similarity calculation technique can be flexibly selected by an operator who operates the system. However, in the process of searching for the position (text data) by the position search unit 260 of the present invention, calculation is performed for each phoneme unit, and the result of the previous step affects the rear end, so that the current determined position is final. There is no guarantee that it will be determined. Therefore, the operator can flexibly select and apply the similarity calculation technique with higher accuracy according to the operating system environment.

검색 결과 제공부(270)는 검색된 분할 식별자에 대응하는 영상 스트림 데이터를 추출하여 사용자(130)에게 제공하는 장치이다. 즉, 검색 결과 제공부(270)는 위치 검색부(260)에 의해 검색된 분할 식별자와 대응 관계에 있는 영상 스트림 데이터를 확인하고, 확인된 영상 스트림 데이터를 영상 검색 요청에 대한 결과로서 단말 수단(135)에 노출되도록 제어하는 역할을 한다.The search result providing unit 270 is a device that extracts and provides the video stream data corresponding to the found partition identifier to the user 130. That is, the search result providing unit 270 checks the video stream data corresponding to the segmentation identifier searched by the location search unit 260, and uses the checked video stream data as a result of the video search request. ) To control exposure to

따라서, 본 발명에 따르면, 사용자(130)에게 검색어로서 영상물의 대사를 이용한 영상 검색을 허용하고, 입력된 대사를 재생하는 영상물의 특정 위치를 검색 함으로써 시간에 의존하여 검색하는 기존의 검색 방식 보다 정확하면서도 편리한 영상 검색 환경을 만들 수 있는 이점이 있다.Therefore, according to the present invention, the user 130 allows the image search using the metabolism of the image as a search word, and is more accurate than the conventional search method of searching in dependence on time by searching for a specific position of the image playing the input metabolism. It has the advantage of creating a convenient image search environment.

우선, 도 3의 ⅰ)에서는 영상물의 일부(또는 전체)에 대해 분할 식별자를 대응시키는 것을 예시하고 있으며, 도 2의 식별자 연관부(210)는 소정의 기준(재생 시간, 프레임, 장면 등)에 의거하여 영상물을 복수 개의 영상 스트림 데이터로 분할한다. 이때, 영상 스트림 데이터 각각에는 분할 식별자(T1∼T4)가 대응한다. 도 3의 ⅰ)에서는 본 발명의 일례로서 분할 식별자가 복수의 영상 스트림 데이터 각각을 재생 시간에 기준하여 대응되는 것을 예시하고 있다.First, FIG. 3) illustrates that a segmentation identifier is mapped to a portion (or all) of an image, and the identifier association unit 210 of FIG. 2 is configured to a predetermined reference (play time, frame, scene, etc.). Based on this, the video object is divided into a plurality of video stream data. At this time, the partition identifiers T1 to T4 correspond to the video stream data. In FIG. 3), as an example of the present invention, a segmentation identifier corresponds to each of a plurality of video stream data on the basis of a reproduction time.

즉, 도 3의 iv)에서 도시한 바와 같이, 소정의 메모리 수단은 분할된 영상 스트림 데이터가 재생되는 재생 시간 및 이에 대응되는 분할 식별자를 기록할 수 있다. 이에 따라 특정의 분할 식별자가 검색되는 경우, 이에 대응되는 영상 스트림 데이터, 보다 상세하게는 영상 스트림 데이터가 재생되는 재생 시간에 대한 정보를 알 수 있다. 예컨대, 메모리 수단에 기록되는 영상 스트림 데이터 및 분할 식별자가 도 3의 iv)와 같으며, 소정의 처리에 의해 분할 식별자 'T1'이 검색되는 경우, 영상 검색 시스템(200)은 재생 시간이 '00:03:24-00:06:17'인 영상 스트림 데이터 #1을 검색 결과로서 식별할 수 있다.That is, as shown in iv) of FIG. 3, the predetermined memory means may record a reproduction time when the divided video stream data is reproduced and a partition identifier corresponding thereto. Accordingly, when a specific segmentation identifier is searched for, the video stream data corresponding to the specific segmentation identifier, more specifically, the information on the reproduction time at which the video stream data is reproduced can be known. For example, when the video stream data and the partition identifier recorded in the memory means are the same as in iv) of FIG. 3, and the partition identifier 'T1' is searched by a predetermined process, the video search system 200 has a playback time of '00. Video stream data # 1 of '03: 24-00: 06: 17 'can be identified as a search result.

도 3의 ⅱ)에서는 스크립트를 GTP로 변환한 후, 각 영상 스트림 데이터로부터 추출된 오디오 데이터를 이용하여 음성 인식하는 것에 대해 예시하고 있다.In FIG. 3, ii) illustrates the recognition of speech using audio data extracted from each video stream data after the script is converted to GTP.

도 3의 ⅲ)에서는 음성 인식된 결과가 오디오상에서 스크립트의 각 문장의 시작과 끝의 위치를 식별하는 것을 예시하고 있다. 이때, 스크립트 정렬부(240)는 식별된 각 위치에 해당 텍스트 데이터와 관련된 분할 식별자를 상관시키게 된다. 이에 따라, 본 발명의 영상 검색 시스템(200)은 스크립트 데이터의 특정 위치가 영상물에서 재생되는 시점을, 상기 분할 식별자로서 식별할 수 있게 된다.In (iii) of FIG. 3, the speech recognized result identifies the position of the start and end of each sentence of the script on the audio. At this time, the script aligning unit 240 correlates the segmentation identifier associated with the corresponding text data to each identified position. Accordingly, the image retrieval system 200 of the present invention can identify a time point at which a specific position of the script data is reproduced in the video object as the segmentation identifier.

이후, 사용자(130)에 의해 예컨대, 검색어 '선택은 순간'이 입력되는 경우, 영상 검색 시스템(200)은 검색어 '선택은 순간'을 포함하는 스크립트 데이터의 위치 '이성의 선택은 순간적인 느낌이 중요해'를 식별하고, 이와 관련된 분할 식별자 'T1'을 검색하게 된다. 또한, 영상 검색 시스템(200)은 검색된 'T1'에 대응되는 영상 스트림 데이터 #1(일례로서, 재생 시간 00:03:24~00:06:17)을 검색 결과로서 사용자(130)에게 제공할 수 있다.Then, for example, when the search term 'selection is instant' is input by the user 130, the image search system 200 may sense the moment of selection of the location of the script data including the search term 'selection is instant'. Important 'and search for the associated partition identifier' T1 '. In addition, the image retrieval system 200 may provide the user 130 with the image stream data # 1 (for example, playback time 00:03:24 to 00:06:17) corresponding to the retrieved 'T1' as a search result. Can be.

따라서, 본 발명에 따르면, 소정 영상물 중에서 사용자(130)가 대사(선택은 순간)를 입력함으로써 해당 대사가 재생되는 영상물의 소정 위치를 검색하는 효과를 얻을 수 있다.Therefore, according to the present invention, the user 130 inputs a dialogue (selection moment) among predetermined video images, thereby obtaining an effect of searching for a predetermined position of the video image where the corresponding dialogue is reproduced.

이러한 구성을 갖는 본 발명에 따른 영상 검색 시스템(200)의 작업 흐름을 상세히 설명한다.The workflow of the image retrieval system 200 according to the present invention having such a configuration will be described in detail.

본 실시예에 따른 영상 검색 방법은 상술한 영상 검색 시스템(200)에 의해 수행된다.The image retrieval method according to the present embodiment is performed by the image retrieval system 200 described above.

우선, 영상 검색 시스템(200)은 영상물을 복수의 영상 스트림 데이터로 분할하고, 분할된 영상 스트림 데이터 각각에 대해서 분할 식별자를 대응시킨다(S410). 본 단계(S410)는 영상물을 소정 단위로 구분하고 그 각각에 대해 분할 식별자를 할당하는 과정으로, 예컨대 재생 시간, 프레임, 장면 등을 기준하여 영상물을 분할할 수 있다. 도 3의 ⅳ)에서는 상기 영상물을 재생 시간 단위로 분할하고 각각에 분할 식별자를 상관시키는 것을 예시하며, 영상물의 재생 시간 00:03:24~00:06:17에 해당하는 영상 스트림 데이터 #1은 분할 식별자 T1에 대응시키고 있음을 보이고 있다.First, the image retrieval system 200 divides an image into a plurality of image stream data, and associates a segmentation identifier with respect to each of the divided image stream data (S410). The step S410 is a process of dividing the video material into predetermined units and allocating a segmentation identifier for each of the video material. For example, the video material may be divided based on a playback time, a frame, a scene, and the like. In (iii) of FIG. 3, the video material is divided into playback time units and the partition identifiers are correlated to each other. Image stream data # 1 corresponding to the playback time 00:03:24 to 00:06:17 It corresponds to the partition identifier T1.

또한, 영상 검색 시스템(200)은 영상 스트림 데이터로부터 오디오 데이터를 추출하고, 상기 추출된 오디오 데이터를 관련된 분할 식별자에 상관시킨다(S420). 본 단계(S420)는 영상 스트림 데이터에 포함되는 영상 데이터 및 이와 동기되는 오디오 데이터를 분리하는 과정으로, 본 실시예에서는 이중에서 오디오 데이터 만을 추출하게 된다.In addition, the image retrieval system 200 extracts audio data from the image stream data, and correlates the extracted audio data with an associated segmentation identifier (S420). In operation S420, the video data included in the video stream data and the audio data synchronized with the video stream data are separated. In this embodiment, only the audio data is extracted.

다음으로, 영상 검색 시스템(200)은 소정의 음성 인식 기법을 이용하여, 추출된 오디오 데이터를 추출하고, 추출된 오디오 데이터를 소정의 스크립트 정렬부(240)에서 인지할 수 있는 형태의 기호 데이터로 변환한다(S430). 본 단계(S430)는 해당 영상 스트림 데이터 내에서 재생되는 음성을 문자화하는 과정으로, 예컨대 비티비 검색을 이용한 소정의 기호맵(model map)에, 상기 오디오 데이터를 통과시킨 후 음소 모델과 비교하고 가장 유사하다고 판단되는 기호 데이터를 결정할 수 있다. 이때, 기호 데이터는 후술되는 스크립트 정렬부(240)에서 의미 식별이 가능한 형태로 생성되어 스크립트 데이터에 대한 정렬 처리에 이용할 수 있도록 한다.Next, the image retrieval system 200 extracts the extracted audio data by using a predetermined speech recognition technique, and extracts the extracted audio data into symbol data that can be recognized by the predetermined script arranging unit 240. The conversion is made (S430). This step (S430) is a process of textualizing the voice reproduced in the corresponding video stream data, for example, by passing the audio data to a predetermined map (model map) using a video search, and compared with the phoneme model The symbol data determined to be similar can be determined. At this time, the symbol data is generated in a form that can be identified by the script sorter 240 to be described later to be used for the sorting process for the script data.

또한, 영상 검색 시스템(200)은 스크립트 정렬부에서 변환된 기호 데이터를 확률적으로 다수 포함하는, 상기 영상물과 상응하는 스크립트 데이터 내에서의 텍스트 데이터를 식별한다(S440). 본 단계(S440)는 상기 변환된 기호 데이터와 영상물의 데본인 스크립트 데이터의 텍스트 데이터를 비교하고, 상기 텍스트 데이터와 동일한 또는 가장 유사한 텍스트 데이터를 식별하는 과정이다.In addition, the image retrieval system 200 identifies the text data in the script data corresponding to the video material, including a plurality of symbol data transformed by the script aligning unit (S440). This step (S440) is a process of comparing the converted symbol data with the text data of the script data which is the Devon of the image, and identifying the same or most similar text data with the text data.

계속해서, 영상 검색 시스템(200)은 식별된 텍스트 데이터 및 관련된 분할 식별자를 상관시킨다(S450). 본 단계(S450)는 분할 식별자를 이용하여 스크립트 데이터를 정렬시키는 과정으로, 상기 상관되는 분할 식별자는 예컨대, 텍스트 데이 터의 식별에 관여한 기호 데이터 또는 해당 기호 데이터의 변환에 관여한 오디오 데이터와 대응하는 분할 식별자로 결정할 수 있다. 이를 통해 본 발명의 영상 검색 시스템(200)은 스크립트 데이터의 소정 문장이 재생되는 영상물의 위치를 분할 식별자를 이용하여 식별할 수 있게 된다(도 3의 ⅲ) 참조).Subsequently, the image retrieval system 200 correlates the identified text data and the associated segmentation identifier (S450). This step (S450) is a process of aligning the script data using the partition identifier, the correlated partition identifier corresponds to, for example, symbol data involved in the identification of text data or audio data involved in the conversion of the symbol data. Can be determined by the partition identifier. As a result, the image retrieval system 200 of the present invention can identify the position of the image of which the predetermined sentence of the script data is reproduced by using the partition identifier (see FIG. 3).

따라서 본 발명에 따르면, 영상물의 분할과 관련되는 분할 식별자를 이용하여, 상기 영상물과 상응되는 스크립트 데이터를 정렬할 수 있으며, 이를 통해 스크립트 데이터의 소정 문장이 재생되는 영상물의 특정 위치를 정확하게 식별하는 효과를 얻을 수 있다.Therefore, according to the present invention, the script data corresponding to the video object can be sorted by using a segmentation identifier associated with the segmentation of the video object, thereby accurately identifying a specific position of the video object in which a predetermined sentence of the script data is reproduced. Can be obtained.

또한, 영상 검색 시스템(200)은 사용자(130)로부터 검색어를 포함하는 영상 검색 요청을 수신하고, 스크립트 데이터 내에서 상기 입력된 검색어를 포함하는 텍스트 데이터를 식별한다(S460). 본 단계(S460)는 영상 검색 요청을 위해 입력된 검색어 및 영상물의 대본인 스크립트 데이터를 비교하고, 검색어를 포함하는 텍스트 데이터 즉, 스크립트 데이터의 위치를 식별하는 과정이다(S460). 즉, 영상 검색 시스템(200)은 사용자(130)로부터 영상물의 소정 대사를 검색어로서 입력 받음에 따라 영상 검색 요청을 발생시키고, 입력된 대사와 동일하거나 유사한 스크립트 데이터의 문장 위치를 식별하는 과정이다.In addition, the image search system 200 receives an image search request including a search word from the user 130 and identifies text data including the input search word in script data (S460). This step (S460) is a process of comparing the search word input for the image search request and the script data that is the script of the image, and identifying the location of the text data, that is, the script data including the search word (S460). That is, the image retrieval system 200 generates a video retrieval request according to a user inputting a predetermined dialogue from the user 130 as a search word, and identifies a sentence position of script data identical or similar to the input dialogue.

다음으로, 영상 검색 시스템(200)은 상기 식별된 텍스트 데이터에 상관하는 분할 식별자를 검색한다(S470). 본 단계(S470)는 사용자(130)에 의해 입력된 대사를 스크립트 데이터 내에서 위치 확인하는 과정이다. 특히, 본 단계(S470)에서의 영상 검색 시스템(200)은 입력된 검색어를 포함하는 스크립트 데이터의 특정 위치 가 복수 개로 식별되는 경우, 각 위치마다 산출되는 유사도에 기초하여 소정 개의 위치를 선택하게 된다. 여기서, 유사도는 스크립트 데이터의 소정 문장과 텍스트 데이터 간의 일치 비율, 음소의 동일수 산출, 유의어 사용 여부 등을 고려하여 예컨대 유사 확률 수치로 산출할 수 있으며, 설정된 유사 확률 수치 이상인 스크립트 데이터의 위치를 검색한다.Next, the image search system 200 searches for a segmentation identifier correlating to the identified text data (S470). This step (S470) is a process of positioning the dialogue input by the user 130 in the script data. In particular, when a plurality of specific locations of the script data including the input search word are identified in the step S470, the image search system 200 selects a predetermined location based on the similarity calculated for each location. . Here, the similarity may be calculated as, for example, a similar probability value in consideration of a matching ratio between a predetermined sentence and text data of the script data, calculation of the same number of phonemes, use of a synonym, and the like. do.

예를 들어, 스크립트 데이터가 도 3의 ⅲ)와 같으며, 검색어로서 '이 식당은 갈비를 맛있게 해'라고 입력되는 경우를 가정한다. 이때, 영상 검색 시스템(200)은 사용자(130)가 입력한 검색어 중에서 '식당, 갈비'를 포함하는, 스크립트 데이터의 위치인 '이 식당은 갈비를 잘해(분할 식별자 T3)' 및 '이 식당은 갈비가 전문이야(분할 식별자 T5)'를 검색하게 된다. 이후, 영상 검색 시스템(200)은 예컨대 유의어 사용을 고려하여 분할 식별자 T3과 연관된 스크립트 데이터의 위치인 '이 식당은 갈비를 잘해'에 상대적으로 높은 유사도가 산출되도록 할 수 있다. 만약, 검색되는 분할 식별자의 수를 하나로 제한하는 경우, 영상 검색 시스템(200)은 영상 검색 요청을 위해 입력된 검색어 '이 식당은 갈비를 맛있게 해'에 응답하여 분할 식별자 T3 만이 검색되도록 할 수 있다.For example, suppose that the script data is the same as i) of FIG. 3 and input as a search word 'this restaurant makes the ribs delicious'. At this time, the image search system 200 is the location of the script data, including 'restaurant, ribs' among the search terms entered by the user' this restaurant is good ribs (split identifier T3) 'and' this restaurant The ribs are full text (split identifier T5). Thereafter, the image retrieval system 200 may allow a relatively high similarity to be calculated in consideration of the use of synonyms, such as 'this restaurant is good at ribs', which is the location of the script data associated with the segmentation identifier T3. If the number of the segment identifiers to be searched is limited to one, the image search system 200 may allow only the segment identifier T3 to be searched in response to the search term 'this restaurant makes the ribs delicious' inputted for the image search request. .

또한, 영상 검색 시스템(200)은 검색된 분할 식별자와 상관되는 영상 스트림 데이터를 추출하여, 사용자(130)에게 제공한다(S480). 본 단계(S480)는 영상 검색 요청에 대한 검색 결과로서, 검색된 분할 식별자에 대응되는 영상 스트림 데이터를 통신망(140)을 통해 전송하는 과정이다. 예컨대, 단계 S470에서 검색된 분할 식별자 T3에 대해서, 영상 검색 시스템(200)은 재생 시간 00:08:26~00:10:00과 관련된 영상 스트림 데이터 #3을 사용자(130)에게 제공할 수 있다.In addition, the image retrieval system 200 extracts the image stream data correlated with the retrieved segment identifier and provides the image to the user 130 (S480). This step (S480) is a process of transmitting the video stream data corresponding to the found segment identifier through the communication network 140 as a search result for the video search request. For example, with respect to the segmentation identifier T3 retrieved in step S470, the image retrieval system 200 may provide the user 130 with image stream data # 3 related to the playback time 00:08:26 to 00:10:00.

따라서, 본 발명에 따르면, 사용자(130)로 하여금 대사를 검색어로 입력 허용하고, 입력된 검색어가 영상물 중에서 재생되는 위치를 검색 함으로써 편리하면서도 정확한 영상 검색 방식을 구현하는 효과를 얻을 수 있게 된다.Therefore, according to the present invention, by allowing the user 130 to input the dialogue as a search word and searching for a position where the input search word is reproduced in the image, a convenient and accurate image search method can be obtained.

이하, 본 발명의 다른 실시예로서, 스크립트 데이터가 없는 영상물에 대하여, 사용자(130)의 텍스트 입력을 통해 영상 검색을 수행하는 것에 대해 설명한다.Hereinafter, as another embodiment of the present invention, it will be described to perform an image search through the text input of the user 130 with respect to the video material without the script data.

우선, 영상 검색 시스템(200)은 영상물을 복수의 영상 스트림 데이터로 분할하고, 분할된 각 영상 스트림 데이터로부터 오디오 데이터를 추출한다(S510). 본 단계(S510)는 소정의 기준 예컨대, 영상물의 재생 시간, 프레임, 장면 등에 의거하여 영상물을 단위 데이터로 분할하고, 분할된 단위 데이터 각각에 대해 오디오를 분리하는 과정이다.First, the image retrieval system 200 divides an image into a plurality of image stream data, and extracts audio data from each of the divided image stream data (S510). The step S510 is a process of dividing an image into unit data based on a predetermined criterion, for example, a playback time, a frame, a scene of the image, and separating audio for each of the divided unit data.

또한, 영상 검색 시스템(200)은 소정의 음성 인식 기법을 이용하여, 추출된 오디오 데이터를 소정의 기호 데이터로 변환하고, 변환된 기호 데이터와 확률적으로 유사한 텍스트 데이터를 소정의 스크립트 데이터에 참조하여 식별한다(S520). 또한, 본 단계(S520)에서 영상 검색 시스템(200)은 식별된 텍스트 데이터, 및 상기 식별된 텍스트 데이터와 관련되는 영상 스트림 데이터를 대응하여 소정의 메모리 수단에 기록하는 역할을 한다. 즉, 영상 검색 시스템(200)은 영상물에서 추출한 오디오 데이터를 이용하여 최종적으로 텍스트 데이터를 유추하고, 상기 오디오 데 이터를 추출시킨 영상 스트림 데이터와 상기 텍스트 데이터를 대응하여 저장한다. 예컨대, 영상 스트림 데이터가 재생 시간에 기준하여 분할되며, 특정 영상 스트림 데이터에서 변환된 텍스트 데이터가 '우리 사랑 영원히'일 경우, 영상 검색 시스템(200)은 '우리 사랑 영원히'와 대응하여 영상물의 재생 시간 '1:01:25∼1:02:00'을 기록할 수 있다.In addition, the image retrieval system 200 converts the extracted audio data into predetermined symbol data by using a predetermined speech recognition technique, and refers to predetermined script data by referring to text data that is stochasticly similar to the converted symbol data. It identifies (S520). In operation S520, the image retrieval system 200 may record the identified text data and the image stream data associated with the identified text data in a predetermined memory means. That is, the image retrieval system 200 finally infers the text data by using the audio data extracted from the image material, and stores the image stream data and the text data corresponding to the extracted audio data. For example, when the image stream data is divided based on the reproduction time, and the text data converted from the specific image stream data is 'our love forever', the image retrieval system 200 reproduces the image in response to 'our love forever'. You can record the time '1: 01: 25-1: 02: 00'.

다음으로, 영상 검색 시스템(200)은 사용자(130)로부터 검색어를 포함하는 영상 검색 요청을 수신하고, 메모리 수단을 참조하여 검색어를 포함하는 텍스트 데이터를 검색한다(S530). 본 단계(S530)는 사용자(130)로부터 영상물의 특정 대사를 검색어로서 입력 받고, 입력된 대사와 동일하거나 유사한 텍스트 데이터를 검색하는 과정이다. 특히, 본 단계(S530)에서의 영상 검색 시스템(200)은 입력된 검색어를 포함하는 텍스트 데이터가 복수 개로 검색되는 경우, 각 텍스트 데이터가 검색어와의 일치 정도에 근거한 유사도에 따라 소정 개의 텍스트 데이터를 선택하게 된다. 즉, 영상 검색 시스템(200)은 검색어와 텍스트 데이터 간의 일치 비율, 음소의 동일수 산출, 유의어 사용 여부 등을 고려하여 예컨대 유사 확률 수치로 산출할 수 있으며, 설정된 유사 확률 수치 이상인 텍스트 데이터를 검색한다.Next, the image search system 200 receives an image search request including a search word from the user 130 and searches for text data including the search word with reference to a memory means (S530). This step (S530) is a process of receiving a specific dialogue of the image from the user 130 as a search word, and searching for text data identical or similar to the input dialogue. In particular, when a plurality of text data including an input search word is searched for in a plurality of text data, the image search system 200 may select predetermined text data according to a similarity based on the degree of correspondence with the search word. Will be chosen. That is, the image retrieval system 200 may calculate, for example, a similar probability value in consideration of the match ratio between the search word and the text data, the calculation of the same number of phonemes, the use of synonyms, and the like, and searches for text data having a predetermined similar probability value or more. .

또한, 영상 검색 시스템(200)은 검색된 텍스트 데이터에 대응하는 영상 스트림 데이터를 추출하여 사용자(130)에게 제공한다(S540). 본 단계(S480)는 영상 검색 요청에 대한 검색 결과로서, 검색된 분할 식별자에 대응하여 상기 메모리 수단에 기록되는 영상 스트림 데이터를 통신망(140)을 통해 사용자(130)에게 전송하는 과정이다.In addition, the image retrieval system 200 extracts and provides the image stream data corresponding to the retrieved text data to the user 130 (S540). In operation S480, as a result of the search for the video search request, the video stream data recorded in the memory means is transmitted to the user 130 through the communication network 140 in response to the found partition identifier.

따라서, 본 발명에 따르면, 영상물의 대본을 구비하지 않더라도, 영상물로부터 추출된 텍스트 데이터를 이용하여, 대사를 이용한 영상 검색 요청을 최적하게 서비스 지원하는 효과를 얻을 수 있다.Therefore, according to the present invention, even if the script of the video is not provided, the service of optimally supporting the video search request using the metabolism can be obtained by using the text data extracted from the video.

본 발명의 실시예들은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램 명령을 포함하는 컴퓨터 판독 가능 매체를 포함한다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크와 같은 자기-광 매체, 및 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 상기 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the invention include a computer readable medium containing program instructions for performing various computer-implemented operations. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical recording media such as CD-ROMs, DVDs, magnetic-optical media such as floppy disks, and ROM, RAM, flash memory, and the like. Hardware devices specifically configured to store and execute the same program instructions are included. The medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, or the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

컴퓨터 장치(600)는 램(RAM: Random Access Memory)(620)과 롬(ROM: Read Only Memory)(630)을 포함하는 주기억장치와 연결되는 하나 이상의 프로세서(610) 를 포함한다. 프로세서(610)는 중앙처리장치(CPU)로 불리기도 한다. 본 기술분야에서 널리 알려져 있는 바와 같이, 롬(630)은 데이터(data)와 명령(instruction)을 단방향성으로 CPU에 전송하는 역할을 하며, 램(620)은 통상적으로 데이터와 명령을 양방향성으로 전송하는 데 사용된다. 램(620) 및 롬(630)은 컴퓨터 판독 가능 매체의 어떠한 적절한 형태를 포함할 수 있다. 대용량 기억장치(Mass Storage)(640)는 양방향성으로 프로세서(610)와 연결되어 추가적인 데이터 저장 능력을 제공하며, 상기된 컴퓨터 판독 가능 기록 매체 중 어떠한 것일 수 있다. 대용량 기억장치(640)는 프로그램, 데이터 등을 저장하는데 사용되며, 통상적으로 주기억장치보다 속도가 느린 하드 디스크와 같은 보조기억장치이다. CD 롬(660)과 같은 특정 대용량 기억장치가 사용될 수도 있다. 프로세서(610)는 비디오 모니터, 트랙볼, 마우스, 키보드, 마이크로폰, 터치스크린 형 디스플레이, 카드 판독기, 자기 또는 종이 테이프 판독기, 음성 또는 필기 인식기, 조이스틱, 또는 기타 공지된 컴퓨터 입출력장치와 같은 하나 이상의 입출력 인터페이스(650)와 연결된다. 마지막으로, 프로세서(610)는 네트워크 인터페이스(670)를 통하여 유선 또는 무선 통신 네트워크에 연결될 수 있다. 이러한 네트워크 연결을 통하여 상기된 방법의 절차를 수행할 수 있다. 상기된 장치 및 도구는 컴퓨터 하드웨어 및 소프트웨어 기술 분야의 당업자에게 잘 알려져 있다.Computer device 600 includes one or more processors 610 coupled with a main memory device including random access memory (RAM) 620 and read only memory (ROM) 630. The processor 610 may also be called a central processing unit (CPU). As is well known in the art, the ROM 630 serves to transfer data and instructions to the CPU unidirectionally, and the RAM 620 typically transfers data and instructions bidirectionally. Used to. RAM 620 and ROM 630 may include any suitable form of computer readable media. Mass storage 640 is bidirectionally coupled to processor 610 to provide additional data storage capability, and may be any of the computer readable recording media described above. The mass storage device 640 is used to store programs, data, and the like, and is a secondary memory device such as a hard disk which is generally slower than the main memory device. Certain mass storage devices such as CD ROM 660 may be used. The processor 610 may include one or more input / output interfaces such as a video monitor, trackball, mouse, keyboard, microphone, touchscreen display, card reader, magnetic or paper tape reader, voice or handwriting reader, joystick, or other known computer input / output device. 650 is connected. Finally, the processor 610 may be connected to a wired or wireless communication network through the network interface 670. Through this network connection, the procedure of the method described above can be performed. The apparatus and tools described above are well known to those skilled in the computer hardware and software arts.

상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있다.The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention.

지금까지 본 발명에 따른 구체적인 실시예에 관하여 설명하였으나, 본 발명 의 범위에서 벗어나지 않는 한도 내에서는 여러 가지 변형이 가능함은 물론이다. 그러므로, 본 발명의 범위는 설명된 실시예에 국한되어 정해져서는 안되며, 후술하는 특허 청구의 범위뿐 아니라 이 특허 청구의 범위와 균등한 것들에 의해 정해져야 한다.Although specific embodiments of the present invention have been described so far, various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the claims below, but also by the equivalents of the claims.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명 사상은 아래에 기재된 특허 청구 범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.As described above, the present invention has been described by way of limited embodiments and drawings, but the present invention is not limited to the above-described embodiments, which can be variously modified and modified by those skilled in the art to which the present invention pertains. Modifications are possible. Accordingly, the spirit of the present invention should be understood only by the claims set forth below, and all equivalent or equivalent modifications thereof will belong to the scope of the present invention.

이상의 설명에서 알 수 있는 바와 같이, 본 발명에 따르면, 사용자로 하여금 대사를 이용한 영상 검색을 허용하고, 입력된 대사를 재생하는 영상물의 특정 위치를 검색하여 제공함으로써 시간에 의존하여 검색하는 기존의 검색 방식 보다 간편하고 편리하게 영상 검색을 수행하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템을 제공할 수 있다.As can be seen from the above description, according to the present invention, the existing search that allows the user to search the image depending on time by allowing the user to search the image using the dialogue, and search and provide a specific position of the image to reproduce the input dialogue A method and image search system for searching for a specific position of an image to perform an image search more simply and conveniently can be provided.

또한, 본 발명에 따르면, 영상물의 분할과 관련되는 분할 식별자를 이용하여, 상기 영상물과 상응되는 스크립트 데이터를 정렬하며, 검색어를 포함하는 스크립트 데이터의 문장과 연관된 분할 식별자를 식별 함으로써 영상물을 분할한 영상 스트림 데이터를 검색하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스 템을 제공할 수 있다.In addition, according to the present invention, by using the partition identifier associated with the segmentation of the video image, the script data corresponding to the video material is aligned, and the segmented video image by identifying the segmentation identifier associated with the sentence of the script data including the search word It is possible to provide a method and a video search system for searching for a specific position of an image to search stream data.

또한, 본 발명에 따르면, 영상물로부터 추출된 텍스트 데이터를 이용하여, 대사를 이용한 영상 검색 요청을 최적하게 서비스 지원하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템을 제공할 수 있다.According to the present invention, it is possible to provide a method and an image retrieval system for retrieving a specific position of an image that optimally supports an image retrieval request using dialogue using text data extracted from the image.

또한, 본 발명에 따르면, 영상물의 특정 구간을 검색하려 할 때, 키프레임에 의존한, 사용자의 시각에 의한 단조로운 검색이 아닌 최적화된 텍스트 검색 기능을 활용하는 영상물의 특정 위치를 검색하는 방법 및 영상 검색 시스템을 제공할 수 있다.In addition, according to the present invention, when searching for a specific section of the image, a method and image for searching a specific position of the image using an optimized text search function, rather than a monotonous search by the user's vision, depending on the key frame A search system can be provided.

Claims

In the method for searching for video material,

Dividing the video object into a plurality of video stream data and assigning a partition identifier to each of the divided video stream data;

Extracting audio data from the video stream data and correlating the extracted audio data with an associated partition identifier;

Converting symbol data of a form recognizable by a predetermined script alignment unit from the extracted audio data using a predetermined speech recognition technique;

Identifying text data in the script data corresponding to the video object including the converted symbol data in the script aligning unit; And

Correlating the identified text data with an associated segmentation identifier

Image retrieval method comprising a.

The method of claim 1,

Receiving an image search request including a search word from a user;

Identifying, within the script data, text data including the entered search term

Retrieving a segmentation identifier that correlates to the identified text data; And

Extracting the video stream data corresponding to the retrieved partition identifier and providing the same to the user

Image retrieval method comprising a.

The method of claim 2,

The step of identifying the text data containing the search word,

And when a plurality of text data is identified in the script, the predetermined text data is selected based on the similarity calculated for each text data.

The method of claim 1,

The script data is an image retrieval method, characterized in that consisting of one or more combinations of letters, numbers, special symbols.

The method of claim 1,

And the video stream data is generated by dividing the video object based on one of a reproduction time, a frame number, and a scene.

In the method for searching for video material,

Dividing the video object into a plurality of video stream data;

Extracting audio data from the video stream data;

Converting the extracted audio data into predetermined symbol data using a predetermined speech recognition technique, and identifying text data probabilisticly similar to the converted symbol data with reference to predetermined script data;

Recording the identified text data and video stream data associated with the identified text data in a predetermined memory means correspondingly;

Receiving an image search request including a search word from a user;

Searching for text data associated with the search word with reference to the memory means; And

Extracting image stream data corresponding to the retrieved text data and providing the same to the user

Image retrieval method comprising a.

The method of claim 6,

The step of searching for text data associated with the search word,

And when a plurality of the text data are searched, the predetermined text data is selected according to the similarity based on the degree of correspondence between the text data and the search word.

The method according to claim 1 or 6,

The speech recognition technique is a Viterbi-Search technique for characterizing the extracted audio data by comparing with a phoneme model.

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 1 to 7.

In the system for searching video material,

An identifier association unit for dividing the video object into a plurality of video stream data and corresponding to a segmentation identifier for each of the divided video stream data;

An audio extracting unit extracting audio data from the video stream data;

A speech recognition unit for converting the extracted audio data into symbol data that can be recognized by a predetermined script alignment unit by using a predetermined speech recognition technique;

A script alignment unit for identifying text data in the script data corresponding to the video material including the converted symbol data and correlating the identified text data with an associated segmentation identifier;

An interface unit for receiving an image search request including a search word from a user;

A position search unit for identifying text data including the input search word in the script data and searching for a segmentation identifier correlated with the identified text data; And

Search result providing unit for extracting the video stream data corresponding to the searched partition identifier and provides the user to the extracted stream;

Image retrieval system comprising a.