KR20090090613A

KR20090090613A - Multimodal Interactive Image Management System and Method

Info

Publication number: KR20090090613A
Application number: KR1020080015938A
Authority: KR
Inventors: 김현정; 장두성
Original assignee: 주식회사 케이티
Priority date: 2008-02-21
Filing date: 2008-02-21
Publication date: 2009-08-26

Abstract

본 발명은 멀티모달 대화형 이미지 관리 시스템 및 방법에 관한 것이다. 본 발명은 사용자 단말기로부터 새로운 이미지가 입력되면 이미지에 포함된 객체를 구별하고 이미지 분류정보를 생성하는 이미지 인식모듈; 이미지 인식모듈로부터 생성된 이미지 분류 정보에 기초하여, 주제별로 연관된 단어들로 구축된 태그 온톨로지를 이용하여 이미지 분류 정보와 관련된 이미지의 태그 후보 추천 정보들을 추출하는 태그넷 관리모듈; 추출된 태그 후보 추천 정보를 이용하여 사용자와 대화를 통해 원하는 태그를 선택하게 하는 자연어 대화 모듈; 사용자 단말기로부터 음성 및 비음성 입력 방식으로 이미지에 대한 메모를 포함하는 이미지 관련 정보를 입력받아 하나의 정보로 인테그레이션하는 멀티모달 인터페이스 처리 모듈; 및 이미지, 관련된 태그 및 메모 정보를 저장하는 이미지 DB를 포함한다. 이에 따라, 사용자가 사진 또는 동영상을 멀티모달 대화형 이미지 관리 시스템으로 업로드한 후, 음성, 키보드, 터치패드 등의 멀티모달 인터페이스와 대화 모듈을 사용하여 사용자와 대화를 통해 이미지에 태그 및 메모 정보를 붙여 사진-태그DB에 저장한후, PC, 디지털 액자 등 사용자 단말기로 사진 또는 동영상과 관련된 태그 및 메모를 제공한다.The present invention relates to a multimodal interactive image management system and method. The present invention provides an image recognition module that distinguishes an object included in an image and generates image classification information when a new image is input from a user terminal; A tagnet management module extracting tag candidate recommendation information of an image related to the image classification information based on the image ontology generated from the image recognition module, using a tag ontology constructed with words associated with each subject; A natural language dialogue module for selecting a desired tag through dialogue with a user using the extracted tag candidate recommendation information; A multi-modal interface processing module configured to receive image-related information including a memo about an image from a user terminal and integrate the information into one piece of information; And an image DB for storing images, related tags, and memo information. This allows the user to upload photos or videos to a multi-modal interactive image management system, and then use the multi-modal interface and dialogue modules, such as voice, keyboard, and touchpad, to interact with the user to add tag and memo information to the images. After pasting and storing it in the photo-tag DB, it provides tags and memos related to photos or videos to user terminals such as PCs and digital photo frames.

Description

System and method for multimodal conversational mode image management

본 발명은 멀티모달 대화형 이미지 관리 시스템 및 방법에 관한 것으로, 특히 사용자가 사진 또는 동영상을 멀티모달 대화형 이미지 관리 시스템으로 업로드한 후, 화상인식 시스템과 사진-태그DB, 태그넷을 이용하여 사진 또는 동영상과 관련된 태그를 추천하거나 음성 인식과 터치패드를 결합한 멀티모달 인터페이스와 대화 모듈을 사용하여 사용자와 대화를 통해 이미지에 태그 정보를 붙여 저장하여 태그 기반의 이미지 정보 검색 서비스를 제공하는, 멀티모달 대화형 이미지 관리 시스템 및 방법에 관한 것이다. The present invention relates to a multi-modal interactive image management system and method, and in particular, after a user uploads a photo or video to a multi-modal interactive image management system, the user uses an image recognition system, a photo-tag DB, and a tagnet. Multi-modal, which provides tag-based image information retrieval service by recommending tag related to video or by attaching tag information to image through dialogue with user using multi-modal interface and dialogue module combining voice recognition and touch pad. It relates to an interactive image management system and method.

최근, 음성인식 및 음성합성 기술이 발전하고 휴대 단말기, 홈네트워크 단말기, 로봇 등의 단말기에 음성과 펜을 이용하는 멀티모달 인터페이스의 필요성이 증대되고 있다. In recent years, voice recognition and speech synthesis techniques have been developed, and the necessity of a multimodal interface using voice and a pen for a terminal such as a mobile terminal, a home network terminal, a robot, and the like is increasing.

멀티 모달(multimodal)은 다수의 모달리티(modality)로 사람의 시각, 청각, 촉각, 미각, 후각 등 각각의 감각채널을 모델링하여 기계장치로 전환된 채널이다. 또한, 각 모달리티를 종합하여 교류하는 것을 멀티모달 상호작용(multimodal interaction)이라 한다. Multimodal is a channel that is converted into a mechanical device by modeling each sensory channel such as human sight, hearing, touch, taste, and smell with multiple modalities. In addition, the interaction of each modality is called multimodal interaction.

멀티모달 인터페이스(multimodal interface)는 시각적 관점과 결합 성격에 의해 분류되며, 시각적 관점에 따라 순차(sequential)와 동시(simultaneous) 멀티모달, 결합성격에 따라 단순(alternate) 결합, 보조(supplementary) 결합 멀티 모달로 분류된다. Multimodal interfaces are classified by visual viewpoint and associative nature, sequential and simultaneous multimodal according to visual perspective, alternative combination by complementary nature, supplementary combination multi Classified as modal.

멀티모달 인터페이스는 사람과 단말기기 사이의 인터페이스를 음성뿐만 아니라 키보드, 펜, 그래픽 등을 활용하는 것으로 사용자가 음성, 키보드, 펜 등을 사용하여 정보를 입력하고, 음성, 그래픽, 음악 및 영상 등을 출력하는 인터페이스이다. Multi-modal interface utilizes not only voice but also keyboard, pen, and graphic interface between user and terminal device. User can input information using voice, keyboard, pen, etc. Output interface.

멀티모달의 입력(input)은 음성, 펜(또는 키패드), 키보드, 마우스, 터치스크린, 입술 운동 입력, 제스처 입력, 안구 이동 입력 등이 있다. Multimodal inputs include voice, pen (or keypad), keyboard, mouse, touchscreen, lip movement input, gesture input, eye movement input, and the like.

멀티모달의 출력(output)은 아이콘, 표, 테이블과 같은 Graphic 요소와, 이어콘(earcon), 효과음, 음성 합성 등의 사운드 요소, 그리고 쇼크 등의 진동 요소 등이 있다. The output of the multimodal includes graphic elements such as icons, tables, and tables, sound elements such as earcons, sound effects, and voice synthesis, and vibration elements such as shocks.

멀티모달 인터페이스는 음성 입출력을 처리하는 음성 인터페이스와, 펜으로 펜 필기체를 인식하는 잉크 인터페이스, 및 전통적으로 사용해 온 키보드 인터페이스로 나누어진다. 음성 인터페이스는 다른 인터페이스의 결과를 동시에 표시하기 위해 EMMA(Extensible Multimodal Annotation Markup Language) 형식을 사용한다. 잉크 인터페이스란 펜으로 글을 쓰거나, 그림으로 표시한 것, 또는 수식을 사용하여 인터페이스 하는 방식이다. 키보드 인터페이스는 현재 사용되고 있는 방식이며 다른 인터페이스 결과와 같이 표시하기 위해 EMMA 형식을 사용한다. The multimodal interface is divided into a voice interface for processing voice input and output, an ink interface for recognizing pen writing with a pen, and a keyboard interface conventionally used. Voice interfaces use the Extensible Multimodal Annotation Markup Language (EMMA) format to display the results of different interfaces simultaneously. An ink interface is a way of writing with a pen, drawing a picture, or using a mathematical interface. The keyboard interface is currently in use and uses the EMMA format to display like other interface results.

최근에 음성인식, 음성합성 및 필기체 인식 기술이 발전하고 이러한 멀티모달 기술을 활용하는 서비스가 요구됨에 따라 W3C에서는 멀티모달 인터랙션(interaction) 워킹 그룹에서 멀티모달 인터랙션 프레임워크(Multimodal Interaction Framework), EMMA(Extensible Multimodal Annotation) 및 잉크 마크업언어(Ink Markup Language)의 표준화 작업을 추진하고 있으며, Voice Browser Working Group과 Multimodal Interaction Working Group에서는 음성 인터페이스와 멀티모달 인터페이스를 위한 표준화 작업을 추진하고 있다. With the recent advances in speech recognition, speech synthesis, and handwriting recognition technology, and the need for services that utilize these multimodal technologies, the W3C has adopted the Multimodal Interaction Framework, EMMA (Multimodal Interaction Working Group). Standardization of Extensible Multimodal Annotation and Ink Markup Language is being promoted, and Voice Browser Working Group and Multimodal Interaction Working Group are promoting standardization for voice interface and multimodal interface.

그러므로, 멀티모달 인터페이스로 사용되는 음성 인터페이스는 사람과 단말기기 사이의 음성인식, TTS(Text To Speech)에 의해 문자를 음성으로 변환하는 음성합성 및 언어처리 기술을 사용하여 MS 스피치 서버(Speech Server) 등을 구현하는 기술이다. 언어처리 기술은 음성인식, 음성합성 기술에 포함될 수 있지만 최근에는 음성 입출력을 제어하는 VoiceXML(Voice Extensible Markup Language), 또는 SALT(Speech Application Language Tags)에 관한 기술을 사용한다. Therefore, the speech interface used as the multi-modal interface is an MS speech server using speech recognition between a person and a terminal device, a speech synthesis and a language processing technology that converts text into speech by TTS (Text To Speech). It is a technique to implement. Language processing techniques may be included in speech recognition and speech synthesis techniques, but recently, techniques related to Voice Extensible Markup Language (VoiceXML) or Speech Application Language Tags (SALT) that control voice input and output are used.

VoiceXML 이란 음성인식, 음성합성 및 DTMF 신호 등의 음성 담화(Dialog) 기능을 가진 음성 입출력을 Web상에 제어하는 XML(eXtended Mark-up Language) 언어이다. VoiceXML is an XML (eXtended Mark-up Language) language that controls voice input / output with voice conversation functions such as voice recognition, voice synthesis, and DTMF signals on the Web.

VoiceXML 기술은 AT&T, Lucent Technologies, Motorola 및 IBM을 주축으로 1999 년 8월 포럼을 결성하여 현재 약 400개의 회사가 가입되어 개발되는 기술이다. 2000년 3월 VoiceXML 1.0 스펙이 W3C 보이스 브라우저 워킹그룹에 의해 제안되 었다. VoiceXML 2.0 스펙은 2001년 10월에 제안되어 2004년 3월에 레코멘데이션(Recommendation) 된 상태이다. VoiceXML 2.0은 음성인식을 위한 문법으로 VoiceXML 1.0에서 사용되는 JSGF(Java Speech Grammar Format)대신 SRML(Speech Recognition Markup Language)가 2004년 3월 레코멘데이션 되었고 음성합성을 위한 SSML(Speech Synthesis Markup Language)은 2004년 7월에 레코멘데이션 되었다. VoiceXML technology is a technology that is formed by AT & T, Lucent Technologies, Motorola and IBM in August 1999, and formed a forum of about 400 companies. In March 2000, the VoiceXML 1.0 specification was proposed by the W3C Voice Browser Working Group. The VoiceXML 2.0 specification was proposed in October 2001 and has been revised in March 2004. VoiceXML 2.0 is a grammar for speech recognition. Instead of the Java Speech Grammar Format (JSGF) used in VoiceXML 1.0, Speech Recognition Markup Language (SRML) was rewritten in March 2004. Speech Synthesis Markup Language (SSML) for speech synthesis It was renovated in July 2004.

SALT(Speech Application Language Tags)는 개발자들이 음성 및 멀티모달 응용제품을 쉽게 개발할 수 있도록 HTML, XHTML, XML 같은 웹 표준언어를 확장한 것이다. 2000년대 초반에 MS를 주축으로 시스코, 인텔, 컨버스(Converse), 필립스(Philips Speech Processing) 및 스캔소프트(Scan Soft) 등의 회사들이 모여 전화, PC, 노트북 PC, PDA를 이용하여 음성, 펜, 마우스를 사용한 입ㆍ출력 명령을 제어할 수 있는 멀티모달 인터페이스이다. VoiceXML 기술이 전화망 기반 음성응용 개발자들을 위한 상위 수준의 언어라면, SALT는 웹 기반 개발자들을 위한 하위 수준의 언어이다. Speech Application Language Tags (SALT) is an extension of Web standard languages such as HTML, XHTML, and XML to make it easier for developers to develop speech and multimodal applications. In the early 2000s, companies such as Cisco, Intel, Converse, Philips Speech Processing, and Scan Soft gathered around Microsoft, using voice, pen, It is a multi-modal interface that can control input and output commands using a mouse. If VoiceXML technology is a high-level language for phone-based voice application developers, SALT is a low-level language for web-based developers.

VoiceXML과 SALT의 가장 큰 차이점은 VoiceXML은 HTML과 독립적으로 동작하여 동시에 사용하기 어려운 반해, SALT는 HTML안에 음성인식, 합성 관련 태그를 정의하였기 때문에 동시에 사용 가능하고 멀티모달 구현이 가능하다. The biggest difference between VoiceXML and SALT is that VoiceXML works independently of HTML and is difficult to use simultaneously, while SALT defines tags related to speech recognition and synthesis in HTML so that they can be used simultaneously and multimodal.

멀티모달 인터페이스(multimodal interface) 방식은 1) 음성인식, 음성합성 및 담화 기술이 단말기기에 구현되어 자연스럽게 음성으로 명령을 내리거나 명령 결과를 소리로 듣게 되는 방식과, 2) 서버에 음성 인터페이스 기능을 제공하고 단말기를 단순히 연결만 해주는 방식으로 통신 회사의 전화망을 이용하는 음성 인터 페이스에 많이 사용하고 있는 방식과, 3) 하이브리드 방식(1 + 2)이 고려되고 있으며, 특히 W3C(World Wide Web Consortium)에서는 멀티모달 플랫폼 연구를 위한 스터디 그룹을 결성하여 표준화 활동을 하고 있다.The multimodal interface method includes 1) voice recognition, voice synthesis, and discourse technology implemented in a terminal device to naturally give a command or hear a command result by voice, and 2) provide a voice interface function to a server. And a method of simply connecting a terminal to a voice interface using a telephone company's telephone network, and 3) a hybrid method (1 + 2) is considered. In particular, the W3C (World Wide Web Consortium) A study group for modal platform research has been formed to standardize.

멀티모달 인터페이스는 인간과 단말기의 통신을 위해 음성, 키보드, 펜을 이용하여 인터페이스를 의미하고, 입력(input)으로 음성, 펜, 글씨 및 키보드 타이핑을 사용하고, 단말기의 처리 결과를 출력(output)으로 음성, 오디오, 비디오를 제공한다. W3C 소프트웨어 아키텍처의 새로운 표준을 따르는 멀티모달 인터페이스는 해럴(Harel) 상태차트 기반의 대화 모델링 언어인 SCXML(State Chart XML)을 사용한다. SCXML은 비동기식 이벤트 기반 범용 상태 기계 언어로써 인터랙션 매니저(대화 매니저)에서 동작하며, 대화 매니저는 SCXML로 작성된 시나리오 스크립트를 불러온 다음 XHTML, VoiceXML, SVG(Scalable Vector Graphics)와 같은 XML 마크업을 처리하는 모달리티 콤포넌트를 구동한다.Multi-modal interface means an interface using voice, keyboard and pen for communication between human and terminal, uses voice, pen, text and keyboard typing as input, and outputs the processing result of the terminal. Provides voice, audio and video. A multi-modal interface that follows the new standard of the W3C software architecture uses State Chart XML (SCXML), a Harrel statechart-based conversational modeling language. SCXML is an asynchronous, event-based, general-purpose state machine language that works in the interaction manager (conversation manager), which invokes scenario scripts written in SCXML and then processes XML markup such as XHTML, VoiceXML, and Scalable Vector Graphics (SVG). Runs a modality component.

SVG(Scalable Vector Graphics)는 2차원 그래픽을 표현하기 위해 XML 기반으로 만들어진 언어로써 W3C에 의해 제안된 XML 그래픽 표준이다. Scalable Vector Graphics (SVG) is an XML-based language designed by the W3C to represent two-dimensional graphics.

현재 2개 이상의 모달리티를 허용하는 X+V 멀티모달 시스템은 IBM에서 개발된 형태로 VoiceXML과 XHTML의 내포된 구조를 사용한다. Currently, the X + V multimodal system, which allows more than one modality, uses the nested structure of VoiceXML and XHTML, developed by IBM.

도 1은 종래의 W3C에서 제안되고 있는 멀티모달 상호작용 프레임워크를 나타낸 도면이다. 1 is a diagram illustrating a multimodal interaction framework proposed in the conventional W3C.

멀티모달 상호작용 프레임워크 시스템은 입력 요소(10), 출력 요소(30), 인터랙션 매니저(interaction manager)(20), 응용 서비스(Application Functions) 요 소(21), 세션(session) 요소(22), 및 시스템 환경(System & Environments) 요소(23)로 구성된다. The multimodal interaction framework system includes an input element 10, an output element 30, an interaction manager 20, an application service element 21, and a session element 22. , And System & Environments element 23.

인터랙션 매니저(대화 매니저)(20)는 입력 요소로부터 얻은 정보를 이용하여 실제 응용 서비스 실행을 수행한 후, 그 결과를 출력 요소에 제공하는 역할을 한다. 현재는 Finite State Diagram의 일종인 해럴 상태 다이어그램(Harel state diagram) 방식을 사용되고 있다. The interaction manager (conversation manager) 20 performs the actual application service execution using the information obtained from the input element and serves to provide the result to the output element. Currently, the Harel state diagram, a kind of finite state diagram, is used.

세션 요소(Session Component)(22)는 다양한 단말기와 멀티모달 응용 서비스와의 세션 관리 및 다양한 단말기기 출력을 위한 싱크(Sync) 기능을 담당한다. The session component 22 is responsible for session management between various terminals and multi-modal application services and a sync function for outputting various terminal devices.

시스템 환경 요소(System & Envirionment)(23)는 단말기 및 사용자 환경의 상황에 따라 자동적으로 휴대용 단말기, 자동차용 단말기 및 데스크탑 등으로 출력 모드를 쉽게 적응되도록 환경을 제공한다.System & Envirionment (23) provides an environment so that the output mode can be easily adapted to a portable terminal, a vehicle terminal and a desktop automatically according to the situation of the terminal and the user environment.

도 2는 멀티모달 상호작용 프레임워크의 입력 요소 상세 구성도이다. 2 is a detailed structural diagram of input elements of a multimodal interaction framework.

입력(input) 요소(10)는 음성, 필기체, 키보드 등을 인지하여 해석하기에 편한 정보 형태를 구성하는 인식 모듈(recognition module), 인식된 정보를 의미적으로 해석하는 해석 모듈(interpretation module), 및 여러 종류의 입력을 통합시켜 인터랙션 매니저(interaction manager)(20)로 전송하는 통합 모듈(integration module)로 구성된다. The input element 10 may include a recognition module constituting an information form that is easy to recognize and interpret a voice, a handwriting, a keyboard, an interpretation module semantically interpreting the recognized information, And an integration module for integrating various types of inputs and transmitting the same to the interaction manager 20.

해석 모듈은 인식 모듈로부터 인식된 결과를 의미적으로 해석하여 동일한 대표 텍스트로 변환시킨다. 예를 들면, 해석 모듈은‘네',‘네에’,‘예' 등을‘네’로 변환시키는 일을 한다. 해석 모듈의 결과는 EMMA로 변환되어 통합 모듈로 입력 이 된다. The interpretation module semantically interprets the result recognized from the recognition module and converts it into the same representative text. For example, the interpretation module converts yes, yes, yes, etc. into yes. The results of the analysis module are converted to EMMA and entered into the integration module.

통합 모듈은 음성, 포인팅 디바이스 등의 정보를 통합시켜 인터랙션 매니저(20)로 전달한다. The integration module integrates information such as voice and pointing device and delivers the information to the interaction manager 20.

입력 요소는 음성, 펜, 키보드 또는 GPS 정보 등을 사용하고, 장기적으로 센서도 가능하게 된다. 이러한 정보가 입력되면, 멀티모달 서비스는 상황인지 서비스도 가능하게 된다.The input element uses voice, pen, keyboard or GPS information, and in the long term the sensor is also possible. When such information is input, the multimodal service can also be a context aware service.

EMMA는 입력 요소(10)와 인터랙션 매니저(20)를 연결해 주는 표준 언어로써, 사용자가 멀티모달 상호작용에 있어 음성, 펜, 키보드, 필기체 등 사용자로부터 입력받아 환경 정보를 메타 데이터에 실어 전송한다.EMMA is a standard language that connects the input element 10 and the interaction manager 20. In the multi-modal interaction, the EMMA receives input from a user such as a voice, a pen, a keyboard, and a handwriting, and transmits environment information in metadata.

XML 스타일로 정형화한 데이타 구조로써 멀티모달 시스템에서 서로 다른 콤포넌트 사이에 데이터 교환을 가능하도록 처리 결과를 표현해 주는 마크업 언어이다. 예를 들면, EMMA는 입력 시간, 인식 결과에 대한 신뢰 값 및 다양한 입력 종류등의 메타 데이타가 표현되도록 한다. It is a markup language that expresses processing results to enable data exchange between different components in a multimodal system. For example, EMMA allows metadata such as input time, confidence values for recognition results, and various input types to be represented.

잉크 마크업 언어는 XML 기반으로 필기체를 인식한 결과를 표현해 주는 언어이다. 사용되는 필기체는 그림이 될 수 있으며 중요도 표시, 싸인 및 단순히 손으로 쓴 글씨가 될 수 있고 또한 수식, 음악 기호 등도 가능하다.Ink markup language is a language that expresses the result of recognizing handwriting based on XML. The handwriting that is used may be a picture, an importance sign, an autograph, or simply a handwritten letter, or a mathematical expression or musical symbol.

도 3은 멀티모달 상호작용 프레임워크의 출력 요소 상세 구성도이다.3 is a detailed block diagram of an output element of the multimodal interaction framework.

출력(output) 요소(30)는 생성 모듈(Generation module), 스타일 모듈(styling module), 및 렌더링 모듈(rendering module)로 구성된다. The output element 30 is composed of a generation module, a styling module, and a rendering module.

생성 모듈은 인터랙션 매니저(20)로부터 사용자에게 전달할 정보가 입력이 되면 음성(Voice), 그래픽(Graphics) 등 어떤 모드로 출력할 것인지를 결정한다. The generation module determines which mode to output, such as voice and graphics, when information to be transmitted to the user is input from the interaction manager 20.

스타일 모듈은 어떻게 표현될지에 대한 정보를 추가하는 역할을 한다. 예를 들면 그래픽이 화면에 어떻게 위치할지에 대한 정보나 혹은 음성이 출력될 때 단어 사이의 간격 등에 대한 정보도 추가된다. 멀티모달 인터페이스는 음성 출력을 제어하기 위해 CSS(Cascading Style Sheets)를 사용하며, 그래픽을 출력하기 위해 XHTML(Extensible HyperText Markup Language)을 사용하며, 출력은 SSML(Speech Synthesis Markup Language)로 표현된다. The style module adds information about how it is rendered. For example, information about how the graphic is placed on the screen, or information about the spacing between words when a voice is output is added. The multimodal interface uses Cascading Style Sheets (CSS) to control speech output, Extensible HyperText Markup Language (XHTML) to output graphics, and the output is expressed in Speech Synthesis Markup Language (SSML).

렌더링 모듈은 스타일 모듈에서 생성된 음성 또는 화면에 그래픽으로 그려진 그림을 출력한다. 멀티모달 인터페이스의 출력 요소는 상황 인지를 위한 액츄에이터(actuator) 기능이 추가되면 홈 가전기기, 로봇 등의 제어도 가능하며, 단말기의 연산 능력에 따라 분산처리가 가능하다.The rendering module outputs a graphic drawn on the screen or voice generated by the style module. The output element of the multi-modal interface can control home appliances, robots, etc. when an actuator function for situation awareness is added, and can be distributedly processed according to the computing power of the terminal.

도 4는 사용자 중심의 온라인 이미지 저장 및 검색 서비스(flickr)를 예시한 화면이다. 4 is a screen illustrating a user-centered online image storage and retrieval service (flickr).

flickr(http://www.flickr.com/)은 del.ico.us(http://del.ico.us/) 북마크(즐겨찾기) 공유 등의 Web2.0 기술을 사용하여 사진에 태그(tag)를 붙여 온라인 이미지 관리 시스템으로 저장하고, 카테고리별로 태그(tag)가 부착된 많은 온라인 사진 이미지를 수집하여 집단 지성(Collective Intelligence)을 구축함으로써, 사용자 중심의 온라인 이미지 저장하고 검색하고 공유할 수 있게 해주는 서비스를 제공한다. flickr (http://www.flickr.com/) uses Web2.0 technologies, such as del.ico.us (http://del.ico.us/), to share bookmarks (favorites) to tag photos. tagging and storing them as an online image management system, and by collecting a large number of tagged online photo images by category to build collective intelligence, you can store, search, and share user-oriented online images. Providing services that make it possible.

태그(tag)는 사진, 음악, 비디오 클립 등에 연관된 단어를 붙여 검색 서버로 저장하여, folksonomy(folks(people)와 taxonomy의 합성어, 사람들의 분류관리)라고 불리는 개념으로써 정보를 분류하는 키워드(keyword) 또는 카테고리(category)라고 정의된다. A tag is a keyword that classifies information into a concept called folksonomy (a compound word of folks (people) and taxonomy, managing people's classification) by attaching words related to pictures, music, video clips, etc., and storing them in a search server. Or category.

그러나, 종래의 온라인 이미지 시스템은 음성 인식과 터치패드 등의 멀티모달 인터페이스(multimodal interface)와 대화처리 기술을 이용하여 키보드나 음성, 터치 패드 등을 사용하며, 업로드된 사진에 태그(tag) 정보를 음성으로 입력함으로써, 이미지에 태그 정보를 붙여 집단 지성을 구현하는 시스템이 존재하지 않았다. However, the conventional online image system uses a keyboard, a voice, a touch pad, etc. by using a multimodal interface such as voice recognition, a touch pad, and a conversation processing technology, and applies tag information to uploaded pictures. By voice input, no system existed for tagging images and implementing collective intelligence.

본 발명은 종래 기술의 문제점을 해결하기 위해 제안된 것으로써, IT 서비스 사용에 익숙치 않은 모든 사용자들도 손쉽게 사용 가능한 인터페이스로써, 동영상과 사진 등의 이미지 자료에 태그정보를 입력하기 위해 이미지를 선택하고 음성, 키보드, 터치패드 등의 멀티모달 인터페이스를 통해 특정대상을 지목하여 이미지 업로드 후, 태그 후보를 추천받거나 음성으로 태그정보를 입력하면, 업로드된 이미지와 관련된 태그정보를 자동으로 사진-태그DB로 저장하여 음성인식과 동시에 터치패드를 입력받아 자연어 대화를 이용해 메뉴선택과 검색어 입력, 태그 정보를 입력하는 멀티모달 대화형 이미지 관리 시스템 및 방법을 제공하는데 그 목적이 있다.The present invention has been proposed to solve the problems of the prior art, and is an interface that can be easily used by all users who are not used to using IT services, and select an image for inputting tag information into image data such as video and photo. After uploading an image by selecting a specific object through multi-modal interface such as voice, keyboard, and touch pad, and receiving a tag candidate or inputting tag information by voice, the tag information related to the uploaded image is automatically transferred to the photo-tag DB. The purpose of the present invention is to provide a multi-modal interactive image management system and method for storing a voice recognition and inputting a touch pad at the same time and using natural language dialogue to input menu selection, search word input, and tag information.

이러한 과정의 진행은 음성으로 입력하거나 키보드를 사용할 수도 있고, 대화형으로 자유스럽게 발화하면 대화관리시스템에서 의미정보를 추출하여 메뉴이동, 유사어, 동의어 검색 등의 이미지 관리에 필요한 모든 제어를 음성 또는 키보드로 자유대화를 통해 제어할 수 있다. 또한, 멀티모달 대화형 이미지 관리 시스템은 몇 개의 이미지만 태깅을 해 놓으면 학습을 통해 모든 이미지를 자동 태깅할 수 있는 기능도 제공한다. This process can be done by voice input or using the keyboard, or when the speech is freely spoken, the semantic information is extracted from the conversation management system to control all the controls necessary for image management such as menu movement, synonyms, and synonym search. Can be controlled through free conversation. In addition, the multi-modal interactive image management system also provides the ability to automatically tag all images through learning if only a few images are tagged.

본 발명의 목적을 달성하기 위하여, 본 발명은 멀티모달 대화형 이미지 관리 시스템으로서, 사용자 단말로부터 새로운 이미지가 제공되면 이미지에 포함된 객체를 식별하여 식별된 결과에 따라, 상기 이미지를 주제별로 분류한 이미지 분류정보를 생성하는 이미지 인식모듈; 상기 이미지 분류 정보에 대응하는 다수의 태그예제를 포함하는 태그 후보 추천 정보를 생성하는 태그넷 관리모듈; 상기 사용자 단말과 자연어 형식으로 질의 및 응답정보를 주고 받으며, 상기 태그 후보 추천 정보를 이용하여 태그를 선택하게 하는 자연어 대화모듈; 상기 사용자 단말로부터 상기 응답정보를 하나 이상의 인터페이스를 통해 제공받아 통합하는 멀티모탈 인터페이스 처리 모듈 및, 상기 이미지, 선택된 태그를 저장하는 이미지 DB를 포함하는 것을 특징으로 한다.In order to achieve the object of the present invention, the present invention is a multi-modal interactive image management system, when a new image is provided from the user terminal to identify the object included in the image and classify the image by subject according to the identified result An image recognition module for generating image classification information; A tagnet management module for generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; A natural language dialogue module for exchanging question and response information with the user terminal in a natural language format and selecting a tag using the tag candidate recommendation information; And a multi-mortal interface processing module for receiving and integrating the response information from the user terminal through one or more interfaces, and an image DB for storing the image and the selected tag.

상기 자연어 대화 모듈과 연결되고, 상기 응답정보의 의미구조 및, 사용자의 의도를 판단하기 위한 다수의 예제가 저장되는 대화모델 DB; 상기 응답정보의 의미구조 및, 사용자의 의도에 대응하는 다수의 대화쌍이 저장되는 대화예제 DB를 더 포함하는 것을 특징으로 한다.A conversation model DB connected with the natural language conversation module and storing a plurality of examples for determining a semantic structure of the response information and a user's intention; And a dialogue example DB in which a semantic structure of the response information and a plurality of conversation pairs corresponding to a user's intention are stored.

상기 자연어 대화 모듈은, 상기 응답정보의 의미구조와 사용자의 의도를 분류하고, 상기 의미구조와 사용자의 의도에 매핑되는 상기 대화쌍을 찾아 사용자에 게 상기 질의정보로서 제공하는 것을 특징으로 한다.The natural language dialogue module classifies the semantic structure of the response information and the user's intention, finds the dialogue pair mapped to the semantic structure and the user's intention, and provides the dialogue information to the user as the query information.

상기 사용자 단말의 요청에 의해 이미지 및, 상기 이미지와 관련된 태그를 검색하여 제공하는 이미지관리 어플리케이션 연동모듈을 더 포함한다.The image management application interworking module for searching for and providing an image and a tag associated with the image is requested by the user terminal.

상기 사용자 단말은 디지털 액자인 것을 특징으로 한다.The user terminal is characterized in that the digital picture frame.

상기 하나 이상의 인터페이스는, 음성 입력 방식 및 비음성 입력방식을 포함하는 것을 특징으로 한다.The at least one interface may include a voice input method and a non-voice input method.

상기 비음성 입력 방식은 키보드 또는 터치패드를 이용한 입력 방식을 포함하는 것을 특징으로 한다.The non-voice input method may include an input method using a keyboard or a touch pad.

상기 이미지 인식 모듈과 연결되고, 다수의 이미지의 주제와 관련된 이미지 분류 정보가 저장되는 이미지 분류 훈련 모델 DB를 더 포함한다.And an image classification training model DB connected with the image recognition module and storing image classification information related to a subject of a plurality of images.

상기 태그넷 관리 모듈과 연결되고, 이미지의 주제별로 관련된 단어가 포함되는 상기 태그예제가 저장되는 태그 온톨리지 DB를 더 포함한다.And a tag ontology DB that is connected to the tagnet management module and stores the tag example including words related to each subject of an image.

상기 멀티모탈 인터페이스 처리 모듈은, 음성 입출력을 위한 음성 모달리티부; 비음성 입출력을 위한 HTML 모달리티부 및, 음성 입출력 데이터의 내용을 파악 및 해석하는 런타임 프레임워크를 포함하는 것을 특징으로 한다.The multi-mortal interface processing module includes a voice modality unit for voice input and output; An HTML modality unit for non-voice input and output, and a runtime framework for identifying and interpreting the contents of the voice input and output data.

상기 음성 모달리티부는, 사용자와 단말간 음성 입출력을 위한 음성 I/O 프로세서; 상기 음성 I/O 프로세서로부터 입출력되는 음성 호를 IP 기반으로 제어하기 위한 IP-CCS; 사용자의 음성을 인식하는 음성인식부; 문자를 음성으로 변환하는 음성합성부; 상기 음성인식부 및 음성합성부의 입출력 데이타를 해석하기 위한 VoiceXML 인터프리터를 포함하는 것을 특징으로 한다.The voice modality unit may include a voice I / O processor for voice input / output between a user and the terminal; IP-CCS for IP-based voice call input / output from the voice I / O processor; A voice recognition unit recognizing a user's voice; A voice synthesizer for converting text into voice; And a VoiceXML interpreter for interpreting input / output data of the voice recognition unit and the voice synthesis unit.

상기 음성 I/O 프로세서와 상기 IP-CCS는, 서로 RTP로 통신하는 것을 특징으로 한다.The voice I / O processor and the IP-CCS communicate with each other by RTP.

상기 런타임 프레임워크는, EMMA 파서와 연동하여 상기 음성 입출력 데이터를 통합시키는 통합 모듈; 상기 통합모듈로부터 전송되는 통합된 음성 입출력 데이터를 SCXML파서와 연동하여 상기 음성 모달리티부를 제어하는 인터랙션 매니저(Interaction Manager); 상기 사용자의 세션 관리를 위한 데이터 지원 및 관리를 수행하는 세션 데이타 관리부; 상기 단말의 속성과, 상기 사용자의 선호 데이터를 저장하는 DCI; 상기 인터랙션 매니저로부터 사용자에게 전달할 정보가 입력이 되면 음성, 그래픽 중, 선택되어지는 하나의 모드로 출력할 것인지를 결정하고, 음성 모달리티부 또는 HTML 모달리티부로 전송하는 생성 모듈를 포함하는 것을 특징으로 한다.The runtime framework includes an integration module for integrating the voice input and output data in conjunction with an EMMA parser; An interaction manager for controlling the voice modality unit by interworking the integrated voice input / output data transmitted from the integration module with an SCXML parser; A session data manager to perform data support and management for session management of the user; A DCI for storing attributes of the terminal and preference data of the user; When the information to be transmitted to the user from the interaction manager is input, it is characterized in that it comprises a generation module for determining whether to output in a selected mode of the voice, graphics, and transmits to the voice modality unit or HTML modality unit.

본 발명은, 멀티모달 대화형 이미지 관리 방법으로서, (a) 사용자 단말로부터 이미지를 업로드 받는 단계; (b) 이미지 인식모듈이 이미지 분류 훈련모델 DB에 저장된 다수의 이미지에 대한 주제분류를 이용하여 상기 이미지에 포함된 객체를 구별하고 이미지 분류정보를 생성하는 단계; (c) 태그넷 관리모듈이 상기 이미지 분류 정보에 대응하는 다수의 태그예제를 포함하는 태그 후보 추천 정보를 생성하는 단계; (d) 자연어 대화모듈이 상기 단계 (c)에서 추출된 태그 후보 추천 정보를 기반으로 질의정보를 생성하고, 멀티모달 인터페이스 제공모듈을 통해 상기 사용자 단말로 제공하는 단계; (e) 멀티모달 인터페이스 제공 모듈이 상기 사용자 단말기를 통해 사용자가 원하는 태그 정보를 응답정보로서 하나 이상의 인터페이스 방식 으로 제공받고, 상기 응답정보를 통합하는 단계; (f) 이미지 관련 정보를 멀티모달 인터페이스 제공 모듈을 통해 입력받아 이미지, 이미지와 관련된 태그를 이미지 DB로 저장하는 단계를 포함한다.The present invention provides a multi-modal interactive image management method, comprising: (a) receiving an image uploaded from a user terminal; (b) the image recognition module distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) generating, by the tagnet management module, tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) generating, by the natural language conversation module, query information based on the tag candidate recommendation information extracted in step (c), and providing the query information to the user terminal through a multi-modal interface providing module; (e) receiving, by the multimodal interface providing module, tag information desired by a user through the user terminal as one or more interface methods, and integrating the response information; (f) receiving image related information through a multi-modal interface providing module and storing an image and a tag related to the image as an image DB.

상기 단계 (e)는, (e1) 상기 자연어 대화 모듈이 통합된 상기 응답정보를 제공받아, 대화모델 DB에 저장된 다수의 예제를 통해 상기 응답정보의 의미구조 및, 사용자의 의도를 판단하는 단계; (e2) 대화예제 DB에 저장되어 있는 다수의 상기 대화쌍을 통해 상기 질의정보를 생성하고, 상기 멀티모달 인터페이스 제공 모듈을 통해 상기 사용자 단말기에 제공하는 단계를 포함한다.The step (e) may include: (e1) receiving the response information in which the natural language conversation module is integrated, and determining a semantic structure of the response information and a user's intention through a plurality of examples stored in a conversation model DB; (e2) generating the query information through the plurality of conversation pairs stored in a conversation example DB, and providing the query information to the user terminal through the multi-modal interface providing module.

상기 단계 (e)에서, 상기 하나 이상의 인터페이스 방식은, 음성 및 비음성 방식을 포함하는 것을 특징으로 한다.In the step (e), the at least one interface method, characterized in that it comprises a voice and a non-voice method.

상기 이미지 선택은 상기 비음성 방식으로 입력받고, 해당 이미지에 대응하는 태그 정보는 상기 음성으로 입력받는 것을 특징으로 한다.The image selection may be input in the non-voice method, and tag information corresponding to the image may be input in the voice.

(g) 이미지관리 어플리케이션 연동모듈이 상기 응답정보에 따라 상기 이미지 DB로부터 이미지, 이미지와 관련된 태그를 검색하여 제공하는 단계를 더 포함한다.(g) the image management application interworking module further comprising searching for and providing an image and a tag related to the image from the image DB according to the response information.

본 발명은, 컴퓨터 또는 디지털 액자에, (a) 멀티모달 대화형 이미지 관리 시스템이 사용자 단말기로부터 이미지를 업로드 받는 기능; (b) 이미지 인식모듈이 이미지 분류 훈련모델 DB에 저장된 다수의 이미지에 대한 주제분류를 이용하여 상기 이미지에 포함된 객체를 구별하고 이미지 분류정보를 생성하는 기능; (c) 태그넷 관리모듈이 상기 이미지 분류 정보에 대응하는 다수의 태그예제를 포함하는 태그 후보 추천 정보를 생성하는 기능; (d) 자연어 대화모듈이 상기 단계 (c)에서 추 출된 태그 후보 추천 정보를 기반으로 질의정보를 생성하고, 멀티모달 인터페이스 제공모듈을 통해 상기 사용자 단말기로 제공하는 기능; (e) 멀티모달 인터페이스 제공 모듈이 상기 사용자 단말기를 통해 사용자가 원하는 태그 정보를 응답정보로서 하나 이상의 인터페이스 방식으로 제공받고, 상기 응답정보를 통합하는 기능; (f) 이미지 관련 정보를 멀티모달 인터페이스 제공 모듈을 통해 입력받아 이미지, 이미지와 관련된 태그 및 메모 정보를 이미지 DB로 저장하는 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터 또는 디지털 액자로 읽을 수 있는 기록매체를 제공한다.The present invention provides a computer or digital photo frame, comprising: (a) a function of uploading an image from a user terminal by a multimodal interactive image management system; (b) an image recognition module for distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB; (c) a tagnet management module generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information; (d) a function of the natural language dialogue module generating query information based on the tag candidate recommendation information extracted in the step (c) and providing the query information to the user terminal through a multi-modal interface providing module; (e) a function of receiving a tag information desired by a user through the user terminal as one or more interface methods through the user terminal, and integrating the response information; (f) A recording medium that can be read by a computer or digital photo frame recording a program for realizing the function of receiving image-related information through a module providing a multi-modal interface and storing image, tag and memo information related to the image as an image DB. to provide.

본 발명에 따른 멀티모달 대화형 이미지 관리 시스템 및 방법은 사용자가 사진 또는 동영상을 웹사이트로 업로드한 후, 화상인식 시스템과 사진-태그DB, 태그넷을 이용하여 자동으로 사진을 구별하고, 사진 또는 동영상과 관련된 태그를 자동으로 추천하거나 대화를 통해 태그 정보를 알아내도록, 음성 인식과 터치패드를 결합한 멀티모달 인터페이스와 자연어 대화처리 기술을 이용하여 대화 형태로 메뉴를 부르거나 검색어를 입력하고, 손가락을 터치하거나 마우스로 클릭하면서 동시에 태깅 정보를 음성으로 입력함으로써, 사용자의 의도, 의미 정보를 훈련된 언어 이해 모델을 이용하여 분류하여 이미지에 태깅 정보를 붙이는 효과가 있다. In the multi-modal interactive image management system and method according to the present invention, after a user uploads a photo or video to a website, the image is automatically distinguished using an image recognition system, a photo-tag DB, a tagnet, and the photo or Using a multi-modal interface that combines speech recognition and the touchpad and natural language processing technology to automatically recommend tags related to a video or find tag information through a conversation, call menus or enter search terms in a conversational form, By simultaneously inputting the tagging information by touch or clicking with a mouse, the user's intention and semantic information can be classified using a trained language understanding model to attach the tagging information to the image.

본 발명은 간단한 프로그램 제어와 유저 인터페이스를 제공함으로써 IT 서비스에 익숙하지 않은 다양한 계층의 사람들이 보다 손쉽게 모든 종류의 어플리케이션을 제어할 수 있는 원천 기술 판매를 통해 수익을 창출할 수 있다. By providing simple program control and a user interface, the present invention can generate revenue through sales of original technology that makes it easier for people at various levels who are not familiar with IT services to control all types of applications.

또한, 본 발명은 디지털 액자, 홈네트워크 서비스, IPTV 제어, 로봇, 포탈 서비스, 민원 서류 발급기, 네비게이션 등 음성과 터치패드를 동시에 입력 가능한 모든 서비스의 인터페이스로 사용 가능하다. 이러한 기술은 지능형 로봇에 음성, 음원 처리 기능에 특화된 멀티모달 인터페이스 모듈로 사용 가능하다.In addition, the present invention can be used as the interface of all services that can simultaneously input the voice and the touch pad, such as digital photo frame, home network service, IPTV control, robot, portal service, complaint document issuer, navigation. These technologies can be used in intelligent robots as multimodal interface modules specialized for voice and sound processing functions.

이하, 첨부된 도면을 참조로 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, terms or words used in the specification and claims should not be construed as having a conventional or dictionary meaning, and the inventors should properly explain the concept of terms in order to best explain their own invention. Based on the principle that can be defined, it should be interpreted as meaning and concept corresponding to the technical idea of the present invention. Therefore, the embodiments described in the specification and the drawings shown in the drawings are only the most preferred embodiment of the present invention and do not represent all of the technical idea of the present invention, various modifications that can be replaced at the time of the present application It should be understood that there may be equivalents and variations.

도 5는 본 발명에 따른 멀티모달 대화형 이미지 관리 시스템 구성도이다. 5 is a block diagram of a multi-modal interactive image management system according to the present invention.

멀티모달 대화형 이미지 관리 시스템(100)은 이미지 인식 모듈(110), 이미지분류 훈련모델DB(111), 태그넷 관리 모듈(120), 태그 온톨로지 DB(121), 자연어 대화 모듈(130), 대화 모델 DB(131), 멀티모달 인터페이스 처리 모듈(140), 이미지 관리 어플리케이션 연동 모듈(145), 및 이미지 DB(150)를 포함한다. The multi-modal interactive image management system 100 includes an image recognition module 110, an image classification training model DB 111, a tagnet management module 120, a tag ontology DB 121, a natural language dialogue module 130, and a dialogue. And a model DB 131, a multi-modal interface processing module 140, an image management application interworking module 145, and an image DB 150.

이미지 인식모듈(110)은 사용자 단말기로부터 이미지가 입력되면 이미지분류 훈련모델 DB(111)을 이용하여 자동으로 객체를 구별하고 이미지 분류정보를 찾아내는 모듈이다. The image recognition module 110 is a module for automatically distinguishing objects and finding image classification information by using an image classification training model DB 111 when an image is input from a user terminal.

이미지 분류 훈련모델 DB(111)은 상기 이미지 인식 모듈(110)과 연결되고, 미리 구축된 이미지 분류 훈련 모델 정보를 저장한다.The image classification training model DB 111 is connected to the image recognition module 110 and stores pre-built image classification training model information.

즉, 이미지 인식모듈(110)은 사용자 단말기로부터 입력되는 이미지를 전처리(preprocess)하여 특징적인 패턴을 찾아낸 후, 이미지 분류 훈련 모델 DB(111)에 저장되어 있는 다수의 예제 이미지들의 패턴과 반복적으로 비교하여 확률적으로 가장 유사한 이미지의 객체를 구별하고, 이에 따른 이미지의 분류정보를 추출하는 기능을 수행한다.That is, the image recognition module 110 preprocesses an image input from the user terminal to find a characteristic pattern, and then repeatedly compares the pattern with a plurality of example images stored in the image classification training model DB 111. By distinguishing the objects of the most similar image with probability and extracting the classification information of the image accordingly.

태그넷 관리모듈(120)은 상기 이미지 인식모듈(110)로부터 상기 입력된 이미지의 분류 정보가 추출되면 해당 주제별로 연관된 단어들로 구축된 태그 온톨로지DB(121)를 이용하여 해당 분류 주제와 유사하거나 관련이 있는 정보를 추출하여 이미지의 태그후보가 되는 여러 개의 단어들(이하, 태그 후보 추천 정보라 한다)을 추출한다.Tagnet management module 120 is similar to the classification subject by using the tag ontology DB 121 constructed with words associated with the subject when the classification information of the input image is extracted from the image recognition module 110 or The relevant information is extracted to extract a plurality of words (hereinafter, referred to as tag candidate recommendation information) which are tag candidates of an image.

여기서, 태그넷 관리모듈(120)은 사전적 의미가 아닌, 태깅 정보에서 관련있는 태그셋을 추출하게 된다.Here, the tagnet management module 120 extracts a related tag set from tagging information, not a dictionary meaning.

태그 온톨리지 DB(121)는 상기 태그넷 관리 모듈(120)과 연결되고, 이미지와 관련된 태그 추천을 위해 주제별로 단어를 저장한다.The tag ontology DB 121 is connected to the tagnet management module 120 and stores words for each topic for tag recommendation related to an image.

자연어 대화 모듈(130)은 사용자가 발화했을 때, 대화모델 DB(미도시)에 저장된 대화 모델에 따라 사용자의 의도와 의미정보를 판단하고, 이에 근거하여 대화 예제 DB(미도시)에서 이와 가장 적합한 대화쌍을 찾아 사용자에게 응답하여, 메뉴이동, 태깅정보 입력, 검색어 입력, 태그 추천, 질의 응답 등을 처리하는 모듈이다. When the user speaks, the natural language dialogue module 130 determines the user's intention and semantic information according to the dialogue model stored in the dialogue model DB (not shown), and based on this, the natural language dialogue module 130 is most suitable for the dialogue example DB (not shown). It is a module that processes menu movement, tagging information input, search word input, tag recommendation, and query response by searching for a conversation pair and responding to the user.

자연어 대화모듈(130)은 멀티모달 인터페이스 처리 모듈(140)을 통해, 상기 추출된 태그 후보 추천 정보를 기반으로 사용자와 대화를 수행하여 해당 이미지의 태그 후보중에 사용자가 원하는 정보를 선택하게 하거나, 또는 사용자로부터 이미지에 대한 부가설명이나 메모 등을 입력받아 사용자의 의도를 이해하고 의미구조를 분류하여 그 이미지에 가장 적합한 태그를 선정하고, 이미지와 관련된 태그 및 메모 정보를 이미지 DB(150)로 저장한다.The natural language conversation module 130 performs a dialogue with the user based on the extracted tag candidate recommendation information through the multi-modal interface processing module 140 to select the desired information among the tag candidates of the corresponding image, or Understands user's intention, classifies semantic structure, selects the most suitable tag for the image and receives tag and memo information related to the image to image DB 150 .

이러한 기능을 수행하기 위해, 상기 자연어 대화 모듈(130)은 이미지 태깅(Tagging)과 관련한 수만쌍의 대화예제 정보들을 저장하는 대화예제 DB와, 이미지와 관련된 태그를 추천하거나 메모를 덧붙이도록 수집된 대화예제에 기초하여 사용자와의 대화를 위해 멀티모달 인터페이스처리 모듈(140)로부터 입력된 정보를 이용하여 의미구조와 사용자의 의도를 분류하기 위한 대화모델DB와 연결된다.In order to perform this function, the natural language conversation module 130 stores a conversation example DB for storing tens of thousands of conversation example information related to image tagging, and a conversation collected to recommend a tag related to an image or add a memo. Based on the example, the dialogue model DB for classifying the semantic structure and the user's intention is connected using information input from the multi-modal interface processing module 140 for dialogue with the user.

멀티모달 인터페이스 처리 모듈(140)은 사용자 단말기로부터 음성 및 비음성 입력 방식으로 입력받아 여러가지 입력수단을 통합하여 해석함으로써 하나의 정보로 인테그레이션(integration)한다.The multi-modal interface processing module 140 receives input from a user terminal in a voice and non-voice input manner, integrates and interprets various input means into one information.

여기서, 상기 비음성 입력 방식으로는 키보드 또는 터치 패드를 이용한 입력 방식이 적용될 수 있다.Here, the input method using a keyboard or a touch pad may be applied as the non-voice input method.

상술한 기능들을 처리하기 위해 적용되는 기술로서, 멀티모달 인터페이 스(multimodal interface) 처리 기술은 사용자가 문자를 입력하거나, 음성을 입력하거나, 터치패드를 입력하여 컴퓨터에 입력하였을 때, 이에 대한 동기(sync)를 유지하면서 입력정보를 통합하여 적절한 서비스를 제공할 수 있는 기술이다. 즉, 본 발명의 기술적 사상에 비추어 본다면, 사용자가 디지털 액자 또는 PC를 이용하여 멀티모달 대화형 이미지 관리 시스템(100)으로 이미지를 업로드하고, 터치패드 또는 키보드를 이용한 문자입력, 마이크를 이용한 음성 입력으로 의사를 전달하면 멀티모달 인터페이스 처리모듈(140)이 입력 정보를 통합하여 자연어 대화처리 모듈(130)로 전송하고, 자연어 대화처리 모듈(130)에서 입력 정보를 대화모델을 이용하여 사용자의 의도와 사용자 발화의 의미성분을 분류하여 적절한 응답을 하게 하는 기술이다.As a technique applied to process the above-described functions, a multimodal interface processing technique is a method for synchronizing with a computer when a user inputs a character, a voice, or a touchpad. It is a technology that can provide appropriate service by integrating input information while maintaining sync). That is, in the light of the technical concept of the present invention, a user uploads an image to the multi-modal interactive image management system 100 using a digital frame or a PC, inputs a text using a touch pad or keyboard, or inputs a voice using a microphone. In this case, the multimodal interface processing module 140 integrates the input information and transmits the input information to the natural language conversation processing module 130. The natural language conversation processing module 130 transmits the input information to the user by using a conversation model. It is a technology that classifies the semantic components of user speech and makes an appropriate response.

또한, 상기 자연어 대화 처리모듈(130)은 자연어 대화처리 기술에 기반하고 있으며, 보다 상세하게는 상기 자연어 대화처리 기술은 사용자가 입력하는 문장단위 또는 단어단위의 음성 또는 텍스트 발화를 해석하여 의미구조를 파악하고, 이에 따라 사용자의 의도를 파악해 가장 적절한 시스템의 응답을 추론하여 대화를 처리하는 기술이다. In addition, the natural language dialogue processing module 130 is based on the natural language dialogue processing technology, and more specifically, the natural language dialogue processing technology interprets a speech or text utterance in sentence units or word units input by a user to interpret a semantic structure. It is a technology that processes conversations by grasping the user's intention and inferring the response of the most appropriate system accordingly.

이미지 관리 어플리케이션 연동 모듈(145)은 자연어 대화 모듈(130)의 대화처리 결과에 따라, 사용자에게 메뉴제어나 검색어 입력, 포탈연동 등의 각종 서비스를 제공하도록 연동하는 모듈이다.The image management application interworking module 145 is a module for interworking to provide various services such as menu control, search word input, portal linkage, etc. to the user according to the dialogue processing result of the natural language dialogue module 130.

즉, 이미지관리 어플리케이션 연동모듈(145)은 이렇게 이미지-태그셋으로 구축된 이미지 DB를 운용하면서 멀티모달 인터페이스 처리 모듈(140)을 통해 입력된 사용자의 요구에 대응하여 이미지 정보, 이미지와 관련된 태그 및 메모 정보를 검색하여 제공하거나, 어플리케이션을 연동하여 가정내 디지털 액자나 전자 다이어리 시스템 등으로 이미지-태그 정보를 전송하여 디지털 액자를 통해 이미지를 감상하거나 다이어리 시스템을 통해 이미지와 관련된 태그 및 메모, 다이어리 정보를 검색하는 기능을 제공한다.That is, the image management application interworking module 145 operates the image DB constructed as the image-tagset, and responds to the user's request input through the multi-modal interface processing module 140 and tags related to image information and images. Search and provide memo information, or link the application to transmit image-tag information to a digital photo frame or electronic diary system in the home to view the image through the digital frame or through the diary system related tags, memos and diary information Provides the ability to search.

도 6은 멀티모달 인터페이스(MMI) 처리 모듈의 일실시예 구성도이다.6 is a diagram illustrating an embodiment of a multimodal interface (MMI) processing module.

멀티모달 인터페이스(MMI) 처리 모듈은, 음성 입출력을 위한 음성 I/O 프로세서(Voice I/O Processer)를 포함하는 음성 모달리티부(Voice Modality Component), 런타임 프레임워크(Runtime Framework), OSGi 서비스 번들 및, 비음성 입출력을 위한 HTML 모달리티부(HTML Modality Component)로 구성된다.The Multimodal Interface (MMI) processing module includes a Voice Modality Component, a Runtime Framework, an OSGi Service Bundle, including a Voice I / O Processor for voice input and output. It is composed of HTML Modality Component for non-voice input / output.

상기 음성 모달리티부(Voice Modality Component)는 사용자와 단말기간 음성 입출력을 위한 음성 I/O 프로세서(Voice I/O Processor), 음성 스트림을 전송하기 위한 RTP(Realtime Transport Protocol), IP 기반으로 음성 호를 제어하기 위한 IP-CCS(IP Call Control Server), 사람과 단말기간의 음성을 인식하는 음성인식부(ASR:Automatic Speech Recognizer), 문자를 음성으로 변환하는 음성합성부(TTS:Text To Speech), 음성인식 및 음성합성 데이타를 해석하기 위한 VoiceXML 인터프리터(Voice Interpreter)로 구성된다.The voice modality component (Voice Modality Component) is a voice I / O processor (Voice I / O Processor) for voice input and output between the user and the terminal, Real Time Transport Protocol (RTP) for transmitting the voice stream, IP-based voice call IP Call Control Server (IP-CCS) for control, Automatic Speech Recognizer (ASR) for recognizing speech between people and terminals, Text to Speech (TTS) for converting text to speech, Voice It consists of a VoiceXML Interpreter for recognizing and interpreting speech synthesis data.

본 발명은 음성 인터페이스를 위해 VoiceXML을 사용하여 인터렉션 매니저에 의해 사용자와 단말기의 상호작용을 하고, 비음성 인터페이스를 위해 XHTML을 사용하였다. The present invention uses VoiceXML for the voice interface to interact with the user and the terminal by the interaction manager, and uses XHTML for the non-voice interface.

런타임 프레임워크(Runtime Framework)는 통합 모듈(Integration Module), EMMA 파서(EMMA Parser), 인터랙션 매니저(Interaction Manager), SCXML 파서, 세션 데이타 관리부(Session/Data Model), DCI(Delivery Context Interface), 생성 모듈(Generation Module)를 포함한다.The Runtime Framework includes the Integration Module, EMMA Parser, Interaction Manager, SCXML Parser, Session / Data Model, Delivery Context Interface, and DCI. Contains a module (Generation Module).

통합 모듈(Integration Module)은 EMMA 파서(EMMA Parser)와 연동하여 음성, 포인팅 디바이스 등의 모달리티 정보를 통합시켜 인터랙션 매니저(Interaction Manager)로 전송한다. The Integration Module integrates modality information such as voice and pointing device in conjunction with EMMA Parser and transmits it to the Interaction Manager.

인터랙션 매니저(Interaction Manager)는 SCXML파서(SCXML Parser)와 연동되어 사용자와 시스템 사이의 상호작용을 하도록 모달리티부(Voice Modality Component)를 제어, 감독할 뿐 아니라 사용자로부터 전달된 단순결합 및 보조 결합 멀티모달 입력 처리를 하거나 외부 모듈과 접속한다. The Interaction Manager works in conjunction with the SCXML Parser to control and supervise the Voice Modality Component for interaction between the user and the system, as well as the simple and secondary coupling multimodal delivered from the user. Process input or connect with external module.

세션 데이타 관리부(Session/Data Model)는 세션 관리를 위한 데이터 지원 및 관리를 수행한다. The session data management unit (Session / Data Model) performs data support and management for session management.

DCI(Delivery Context Interface)는 보다 개인화된 서비스가 가능하도록 동기식 메시징 방식으로 개인별 프로파일 정보와 단말 환경 정보를 저장한 DB로써, 디바이스 속성과 사용자 선호 데이터를 저장한다. The DCI (Delivery Context Interface) is a DB that stores personal profile information and terminal environment information in a synchronous messaging method to enable more personalized service, and stores device attributes and user preference data.

생성 모듈(Generation Module)은 인터랙션 매니저(Interaction Manager)로부터 사용자에게 전달할 정보가 입력이 되면 음성(Voice), 그래픽(Graphics) 등 어떤 모드로 출력할 것인지를 결정하고, 출력할 내용을 음성 모달리티부(Voice Modality Component) 및, HTML 모달리티부(HTML Modality Component)로 전송한다.The generation module determines whether to output in a voice, graphics, etc. mode when information to be transmitted to the user is input from the interaction manager, and the content to be output is the voice modality unit. Voice Modality Component) and HTML Modality Component.

도 7은 본 발명에 따른 이미지 업로드 후, 태그를 추천하여 저장하는 멀티모달 대화형 이미지 관리 방법을 설명한 흐름도이다. 7 is a flowchart illustrating a multi-modal interactive image management method for recommending and storing a tag after uploading an image according to the present invention.

멀티모달 대화형 이미지 관리 시스템(100)은 사용자 단말기로부터 이미지를 업로드 받는다(단계 S11). The multi-modal interactive image management system 100 receives an image from the user terminal (step S11).

이미지 인식모듈(110)은 이미지분류 훈련모델DB(111)를 이용하여 훈련을 통해 이미지 분류 훈련모델을 이용하여 이미지에 포함된 객체를 구별하고 이미지 분류정보를 생성한다(단계 S12). The image recognition module 110 discriminates the objects included in the image and generates the image classification information by using the image classification training model through training using the image classification training model DB 111 (step S12).

태그넷 관리모듈(120)은 상기 이미지 인식모듈(110)로부터 생성된 새로운 이미지의 분류 정보에 기초하여 해당 주제별로 연관된 단어들로 구축된 태그 온톨로지DB(121)를 이용하여 이미지 분류 정보와 관련된 태그 후보 추천 정보를 추출하여(단계 S13) 이미지의 태그 후보 정보를 추천하여 사용자에 의해 선택되도록 한다(단계 S14). TagNet management module 120 is a tag associated with the image classification information using the tag ontology DB 121 constructed of words associated with each subject based on the classification information of the new image generated from the image recognition module 110 The candidate recommendation information is extracted (step S13) to recommend the tag candidate information of the image to be selected by the user (step S14).

자연어 대화모듈(130)은 이렇게 태그 후보 추천 정보를 기반으로 사용자와 대화를 통해 해당 이미지의 태그 후보중에 사용자가 원하는 태그 정보를 선택하게 하거나, 사용자로부터 대화를 통해 이미지에 대한 부가설명이나 메모 등을 음성 또는 비음성 입력 방식으로 멀티모달 인터페이스 처리 모듈(140)로부터 입력받아 사용자의 의도를 이해하고 의미구조를 분류하여 그 이미지에 가장 적합한 태그를 선정하거나, 이미지, 이미지와 관련된 태그 및 메모 정보를 이미지 DB(150)로 저장한다(단계 S15).The natural language conversation module 130 allows the user to select desired tag information among the tag candidates of the corresponding image through dialogue with the user based on the tag candidate recommendation information, or provides additional explanation or memo about the image through the dialogue from the user. Receives input from the multi-modal interface processing module 140 by voice or non-voice input method, understands the intention of the user, classifies the semantic structure, selects the most suitable tag for the image, or displays the image and the tag and memo information related to the image. Save to DB 150 (step S15).

멀티모달 인터페이스 처리 모듈(140)은 사용자로부터 키보드나 터치패드, 음 성으로 입력받아 여러가지 입력수단을 통합하여 해석하고, 자연어 대화모듈(130)에 의해 사용자와 대화를 통해 해당 이미지의 태그 후보중에 사용자가 원하는 태그 정보를 선택하게 하거나, 이미지에 대한 부가설명이나 메모를 멀티모달 인터페이스 모듈(140)로부터 입력받아 사용자의 의도를 이해하고 의미구조를 분류하여 그 이미지에 가장 적합한 태그를 선정하거나, 이미지와 태그 및 연관된 메모를 이미지 DB(150)로 저장한다.The multi-modal interface processing module 140 receives input from the user with a keyboard, a touch pad, and voice, integrates and interprets various input means, and communicates with the user by the natural language dialog module 130 to communicate with the user through tag candidates of the corresponding image. Allow the user to select the desired tag information, or receive additional descriptions or notes about the image from the multi-modal interface module 140 to understand the user's intention and classify the semantic structure to select the most suitable tag for the image, The tag and the associated memo are stored in the image DB 150.

이미지관리 어플리케이션 연동모듈(145)은 이렇게 이미지-태그들이 저장된 이미지 DB(150)를 운용하면서 사용자의 요구에 대해 가장 적합한 이미지 정보를 제공하거나, 어플리케이션을 연동하여 가정내 디지털 액자나 전자 다이어리 시스템 등으로 이미지-태그 정보를 전송하여(단계 S16) 디지털 액자를 통해 이미지를 감상하거나 다이어리 시스템을 통해 이미지와 연관된 메모, 다이어리, 태그정보 등을 검색하는 기능을 제공한다(단계 S17). The image management application interworking module 145 thus provides the most suitable image information for the user's needs while operating the image DB 150 in which the image-tags are stored, or by integrating the application into a digital photo frame or an electronic diary system in the home. The image-tag information is transmitted (step S16) to view an image through a digital picture frame, or to search for a memo, diary, tag information, etc. associated with the image through the diary system (step S17).

도 8은 자연어 대화 모듈의 기능을 설명한 흐름도이다. 8 is a flowchart illustrating the function of the natural language dialogue module.

자연어 대화모듈(130)은 대화 모델과 대화예제DB를 구축하고(단계 S20) 사용자와 대화를 통해 사용자가 발화하면(단계 S21) 이미지에 태그 정보를 선택하게 하거나, 이미지에 대한 부가설명이나 메모 등을 멀티모달 인터페이스 모듈(140)로부터 입력받아 기 설정된 대화 모델에 따라 식별자(Classifier)에 의해 의미구조/화행을 추출하여(단계 S22), 그 이미지에 가장 적합한 태그를 선정하거나, 이미지와 태그 및 연관된 메모를 이미지 DB(150)로 저장한다.The natural language dialogue module 130 builds a dialogue model and a dialogue example DB (step S20), and when the user speaks through the dialogue with the user (step S21), allows the tag information to be selected in the image, an additional description or a memo of the image, etc. Is inputted from the multi-modal interface module 140 to extract the semantic structure / conversation line by the identifier (Classifier) according to the preset dialogue model (step S22) to select the most suitable tag for the image, or to select the image and the tag and associated The memo is stored in the image DB 150.

자연어 대화모듈(130)은 사용자와 대화시, 대화예제 DB로부터 확률적으로 가 장 유사한 대화예제를 선택하여(단계 S23), 검색결과 시스템 응답 템플릿 기반의 시스템 발화를 생성하여(단계 S24) 사용자에게 대화에 대한 응답으로 시스템 응답을 제공한다(단계 S25). The natural language dialogue module 130 selects a dialogue example that is most likely similarly from the dialogue example DB (step S23) and generates a system utterance based on the search result system response template (step S24). A system response is provided in response to the conversation (step S25).

본 발명은 사용자 단말기로부터 멀티모달 대화형 이미지 관리 시스템(100)으로 이미지를 업로드한 후, 이미지 태깅이나 메모를 덧붙이는 영역의 대화 예제를 2만쌍 정도 수집하고 수동으로 그 의미구조와 사용자의 의도를 파악한 후, 이를 대화모델로 훈련시키고 대화예제를 대화모델DB(131)로 구축하여, 실제 사용환경에서 사용자가 발화를 했을 때, 대화모델을 이용하여 의미구조와 사용자의 의도를 분류하고 대화모델DB(131)에서 이와 가장 유사한 대화쌍을 찾아내어 그 시스템의 응답을 실제 사용자에게 응답하는 예제기반 대화관리시스템을 제공한다.The present invention, after uploading an image from the user terminal to the multi-modal interactive image management system 100, collects about 20,000 pairs of dialogue examples in the area where image tagging or memo is added, and manually sets the meaning structure and the intention of the user. After grasping it, it is trained as a conversation model and a conversation example is constructed as a conversation model DB 131. When the user speaks in an actual use environment, the semantic structure and the intention of the user are classified using the conversation model, and the conversation model DB is used. In (131), an example-based dialog management system is provided which finds the most similar conversation pair and responds to the actual user.

본 발명은 사용자가 사진 또는 동영상을 웹사이트로 업로드한 후, 화상인식 시스템과 사진-태그DB, 태그넷을 이용하여 자동으로 사진을 구별하고, 사진 또는 동영상과 관련된 태그를 자동으로 추천하여 사용자가 선택하게 하거나, 음성 인식과 터치패드를 결합한 멀티모달 인터페이스와 자연어 대화처리 기술을 이용하여 사용자와 대화를 통해 태깅 정보를 음성으로 입력함으로써, 사용자의 의도 정보를 훈련된 언어 이해 모델로 분류함으로써 사진, 동영상 또는 이미지에 태그 정보를 붙여 집단 지성을 구현하여 이미지 또는 동영상 검색 서비스를 제공할 수 있다. According to the present invention, after a user uploads a picture or video to a website, the user automatically distinguishes the picture using an image recognition system, a photo-tag DB, and a tagnet, and automatically recommends a tag related to the picture or video. By using the multimodal interface combined with voice recognition and touchpad and natural language dialogue processing technology, the user inputs the tagging information through dialogue with the user, thereby classifying the user's intention information into a trained language understanding model. The tag information may be attached to a video or an image to implement collective intelligence to provide an image or a video search service.

또한, 멀티모달 대화형 이미지 관리 시스템은 디지털 액자, 홈네트워크 서비스, IPTV 제어, 로봇, 포탈 서비스, 민원 서류 발급기, 네비게이션 등 음성과 터치패드를 동시에 입력 가능한 모든 서비스의 인터페이스로 사용 가능하다. 이러한 기 술은 지능형 로봇에 음성, 음원 처리 기능에 특화된 멀티모달 인터페이스 모듈로 사용 가능하다. In addition, the multi-modal interactive image management system can be used as an interface for all services that can simultaneously input voice and touch pad, such as digital photo frames, home network services, IPTV control, robots, portal services, civil document issuers, and navigation. This technology can be used as an intelligent multi-modal interface module specialized for voice and sound processing functions.

전술한 바와 같이 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다. As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form.

전술한 바와 같이, 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 또는 변형하여 실시할 수 있다.As described above, although described with reference to a preferred embodiment of the present invention, those skilled in the art various modifications of the present invention without departing from the spirit and scope of the invention described in the claims below Or it may be modified.

도 1은 종래의 W3C에서 제안되고 있는 멀티모달 상호작용 프레임워크.1 is a multimodal interaction framework proposed in the prior art W3C.

도 2는 종래의 멀티모달 상호작용 프레임워크의 입력 요소 상세 구성도. 2 is a detailed structural diagram of input elements of a conventional multimodal interaction framework.

도 3은 종래의 멀티모달 상호작용 프레임워크의 출력 요소 상세 구성도.3 is a detailed configuration diagram of output elements of a conventional multimodal interaction framework.

도 4는 사용자 중심의 온라인 이미지 저장 및 검색 서비스를 예시한 화면. 4 is a screen illustrating a user-centered online image storage and retrieval service.

도 5는 본 발명에 따른 멀티모달 대화형 이미지 관리 시스템 구성도. 5 is a block diagram of a multimodal interactive image management system according to the present invention;

도 6은 멀티모달 인터페이스(MMI) 처리 모듈의 일실시예 구성도.6 is a diagram illustrating an embodiment of a multimodal interface (MMI) processing module.

도 7은 본 발명에 따른 이미지 업로드 후, 태그를 추천하여 저장하는 멀티모달 대화형 이미지 관리 방법을 설명한 흐름도. 7 is a flowchart illustrating a multi-modal interactive image management method for recommending and storing a tag after uploading an image according to the present invention.

도 8은 자연어 대화 모듈의 기능을 설명한 흐름도. 8 is a flow chart illustrating the functionality of a natural language conversation module.

Claims

Multimodal interactive image management system,

An image recognition module for generating image classification information for classifying the image according to a subject according to the identified result by identifying an object included in the image when a new image is provided from a user terminal;

A tagnet management module for generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information;

A natural language dialogue module for exchanging question and response information with the user terminal in a natural language format and selecting a tag using the tag candidate recommendation information;

A multi-mortal interface processing module for receiving and integrating the response information from the user terminal through one or more interfaces;

An image DB for storing the image and the selected tag;

Multi-modal interactive image management system comprising a.

The method of claim 1,

A conversation model DB connected with the natural language conversation module and storing a plurality of examples for determining a semantic structure of the response information and a user's intention;

A dialogue example DB in which a semantic structure of the response information and a plurality of conversation pairs corresponding to a user's intention are stored

Multimodal interactive image management system further comprising.

The method of claim 2,

The natural language dialogue module classifies the semantic structure of the response information and the user's intention, and finds the dialogue pair mapped to the semantic structure and the user's intention and provides the user with the query information as the query information. Image management system.

The method of claim 1,

And an image management application interworking module for searching for and providing an image and a tag associated with the image at the request of the user terminal.

The method of claim 1,

And the user terminal is a digital picture frame.

The method of claim 1,

And the at least one interface comprises a voice input method and a non-voice input method.

The method of claim 6,

The non-voice input method is a multi-modal interactive image management system, characterized in that it comprises an input method using a keyboard or a touch pad.

The method of claim 1,

And an image classification training model DB connected to the image recognition module, the image classification training model DB storing image classification information related to a subject of a plurality of images.

The method of claim 1,

And a tag ontology DB connected with the tagnet management module and storing the tag example including words related to each subject of the image.

The method of claim 1,

The multi-mortal interface processing module,

A voice modality unit for voice input / output;

HTML modality unit for non-voice input and output, and

Runtime framework to understand and interpret the contents of voice input and output data

Multi-modal interactive image management system comprising a.

The method of claim 10,

The voice modality unit may include a voice I / O processor for voice input / output between a user and the terminal;

IP-CCS for IP-based voice call input / output from the voice I / O processor;

A voice recognition unit recognizing a user's voice;

A voice synthesizer for converting text into voice;

VoiceXML interpreter for interpreting input / output data of the speech recognition unit and speech synthesis unit

Multi-modal interactive image management system comprising a.

The method of claim 11,

And the voice I / O processor and the IP-CCS communicate with each other by RTP.

The method of claim 10,

The runtime framework,

An integration module for integrating the voice input / output data in association with an EMMA parser;

An interaction manager for controlling the voice modality unit by interworking the integrated voice input / output data transmitted from the integration module with an SCXML parser;

A session data manager to perform data support and management for session management of the user;

A DCI for storing attributes of the terminal and preference data of the user;

Generating module for determining whether to output in the selected mode of the voice, graphics, the information to be transmitted to the user from the interaction manager, and transmits to the voice modality unit or HTML modality unit

Multi-modal interactive image management system comprising a.

As a multimodal interactive image management method,

(a) receiving an image from the user terminal;

(b) the image recognition module distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB;

(c) generating, by the tagnet management module, tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information;

(d) generating, by the natural language conversation module, query information based on the tag candidate recommendation information extracted in step (c), and providing the query information to the user terminal through a multi-modal interface providing module;

(e) receiving, by the multimodal interface providing module, tag information desired by a user through the user terminal as one or more interface methods, and integrating the response information;

(f) receiving image related information through a multi-modal interface providing module and storing an image and a tag related to the image as an image DB;

Multi-modal interactive image management method comprising a.

The method of claim 14,

Step (e),

(e1) receiving the response information integrated with the natural language conversation module, and determining a semantic structure of the response information and a user's intention through a plurality of examples stored in a conversation model DB;

(e2) generating the query information through a plurality of conversation pairs stored in a conversation example DB and providing the query information to the user terminal through the multi-modal interface providing module;

Multi-modal interactive image management method comprising a.

The method of claim 14,

In said step (e), said at least one interface scheme comprises voice and non-voice schemes.

The method of claim 14,

And the image selection is input in the non-voice method, and the tag information corresponding to the image is input in the voice method.

The method of claim 14,

(g) an image management application interworking module, further comprising: searching for and providing an image and a tag related to the image from the image DB according to the response information.

(A) a multimodal interactive image management system uploading an image from a user terminal to a computer or a digital picture frame;

(b) an image recognition module for distinguishing objects included in the image and generating image classification information by using subject classification of a plurality of images stored in an image classification training model DB;

(c) a tagnet management module generating tag candidate recommendation information including a plurality of tag examples corresponding to the image classification information;

(d) a natural language dialogue module generating query information based on the tag candidate recommendation information extracted in step (c) and providing the query information to the user terminal through a multi-modal interface providing module;

(e) a function of receiving a tag information desired by a user through at least one interface method as response information through the user terminal and integrating the response information through the user terminal;

(f) A recording medium that can be read by a computer or digital photo frame recording a program for realizing a function of receiving image-related information through a multi-modal interface providing module and storing an image, tag and memo information related to the image as an image DB.