Disclosure of Invention
The embodiment of the application provides an augmented reality interaction method, an augmented reality interaction device, electronic equipment and a computer readable storage medium, which can improve the success rate of augmented reality interaction.
The embodiment of the application provides an augmented reality interaction method which is applied to augmented reality equipment and comprises the following steps:
In response to an interaction event, collecting voice information and eye gaze information of a target user holding the augmented reality device;
determining a target interaction object from an extended reality scene presented by the extended reality equipment according to the voice information, the eye gaze information and the historical behavior information of the target user;
generating a control instruction aiming at the target interaction object according to the voice information;
and controlling the target interaction object according to the control instruction.
Correspondingly, the embodiment of the application also provides an augmented reality interaction device which is applied to the augmented reality equipment, and the device comprises:
the acquisition module is used for responding to the interaction event and acquiring voice information and eye gazing information of a target user with the augmented reality equipment;
The determining module is used for determining a target interaction object from an extended reality scene presented by the extended reality equipment according to the voice information, the eye gaze information and the historical behavior information of the target user;
the generation module is used for generating a control instruction aiming at the target interaction object according to the voice information;
and the control module is used for controlling the target interaction object according to the control instruction.
Optionally, in some embodiments of the present application, the determining module includes:
The first screening unit is used for screening target interaction objects from the augmented reality scene by the voice information;
The second screening unit is used for screening target interaction objects from the augmented reality scene according to the eye gaze information if the target interaction objects meeting the target interaction conditions are not screened, or screening target interaction objects from the augmented reality scene according to the voice information and the eye gaze information;
and the third screening unit is used for locating the target interaction object from the augmented reality scene according to the eye gaze information and the historical behavior information if the target interaction object meeting the target interaction condition is not screened yet.
Wherein, in some embodiments of the application, the third screening unit comprises:
determining a gaze range from the augmented reality scene according to the eye gaze information, wherein the gaze range covers at least two objects to be interacted with;
determining positioning reference information according to the historical behavior information, wherein the positioning reference information comprises at least one of behavior habit information, personal preference information or operation intention information of the target user;
And screening target interaction objects from at least two objects to be interacted in the gazing range according to the positioning reference information.
Wherein, in some embodiments of the application, the third screening unit comprises:
determining a fixation mode corresponding to the eye fixation information;
If the gazing mode is matched with a preset mode, at least one object to be interacted is screened from the augmented reality scene according to the historical behavior information;
adjusting the layout of each object to be interacted in the augmented reality scene according to the object to be interacted;
and positioning a target interaction object based on the layout and the eye gaze information.
Wherein, in some embodiments of the application, the second screening unit comprises:
determining a gaze range from the augmented reality scene according to the eye gaze information;
if the gazing range is in a first preset area and the gesture operation of the target user is detected, determining the gesture type of the gesture operation;
and screening target interaction objects from the augmented reality scene according to the gesture types.
Wherein, in some embodiments of the present application, the eye gaze information includes gaze location information and eye motion information, and the second screening unit includes:
Determining a gaze range from the augmented reality scene according to the gaze location information;
identifying the action type corresponding to the eye action information;
And determining the gazing range and the target interaction object corresponding to the action type from a mapping relation set, wherein the mapping relation set records the mapping relation among a preset gazing range, a preset action type and a preset interaction object.
Wherein, in some embodiments of the application, the second screening unit comprises:
screening at least two objects to be interacted from the augmented reality scene according to the voice information;
and screening target interaction objects from at least two objects to be interacted according to the eye gaze information.
Wherein, in some embodiments of the application, the determining module comprises:
A determining unit, configured to determine positioning reference information according to the historical behavior information;
a fourth screening unit, configured to screen at least two objects to be interacted from the augmented reality scene according to the positioning reference information;
The display unit is used for displaying the object to be interacted in a second preset area of the augmented reality scene;
The first positioning unit is used for positioning a target interaction object from at least two objects to be interacted in the second preset area according to the voice information and the eye gaze information if the eye gaze information characterizes that the eye gaze of the target user reaches the second preset area.
Wherein, in some embodiments of the present application, the eye gaze information includes first stage eye gaze information and second stage eye gaze information, and the determining module includes:
The extraction unit is used for extracting gazing duration information from the first-stage eye gazing information if the first-stage eye gazing information characterizes that the eye gazing of the target user is focused on at least two objects to be interacted in a stacked shielding state and the voice information does not clearly identify a target interaction object in the objects to be interacted;
The unfolding unit is used for unfolding each object to be interacted if the gazing duration information meets a preset duration threshold value, so as to obtain at least two objects to be interacted in a diffusion arrangement state;
and the second positioning unit is used for determining a target interaction object from at least two objects to be interacted in a diffusion arrangement state according to the eye gaze information of the second stage and the historical behavior information.
Wherein, in some embodiments of the application, the generating module comprises:
A scene determination unit configured to determine usage scene information of the augmented reality device;
the recognition unit is used for recognizing the voice information to obtain a voice recognition result;
The action determining unit is used for determining target action information meeting the use scene information according to the voice recognition result;
the first generation unit is used for generating a control instruction aiming at the target interaction object based on the target action information.
Wherein, in some embodiments of the application, the action determining unit comprises:
a mode determining subunit, configured to determine an operation mode of the target interaction object according to the usage scenario information;
And the action determining subunit is used for determining target action information conforming to the running mode according to the voice recognition result.
Wherein, in some embodiments of the present application, the action determining subunit is specifically configured to:
Extracting a target keyword from the voice recognition result based on the operation mode, and determining target action information based on the target keyword;
Or alternatively
And determining action intention information of the voice recognition result through an intention analysis model based on the operation mode, and determining target action information based on the action intention information.
Wherein, in some embodiments of the application, the generating module comprises:
a first eye movement information determining unit, configured to determine a gaze range and eye movement information corresponding to the eye gaze information if a control instruction for the target interactive object is not generated based on the voice information;
And the second generation unit is used for generating a control instruction aiming at the target interaction object according to the gazing range and the action type corresponding to the eye action information.
Wherein, in some embodiments of the application, the generating module comprises:
A second eye movement information determining unit, configured to identify a gesture type corresponding to a gesture operation of the target user if a control instruction for the target interactive object is not generated based on the voice information and the gesture operation of the target user is detected;
And the third generation unit is used for generating a control instruction aiming at the target interaction object based on the gesture type.
Wherein, in some embodiments of the present application, the control instruction includes a first sub-control instruction and a second sub-control instruction, and the generating module includes:
A fourth generating unit, configured to generate a first sub-control instruction for the target interactive object according to the voice information;
And a fifth generating unit, configured to generate a second sub-control instruction for the target interactive object according to the eye gaze information if the first sub-control instruction meets a preset instruction type.
Wherein, in some embodiments of the present application, the first sub-control instruction is for marking, the second sub-control instruction is for moving, and the control module comprises:
the first control unit is used for marking the target interaction object through the first sub-control instruction;
and the second control unit is used for moving the marked target interactive object to a target area of the augmented reality scene through the second sub-control instruction.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the computer program when executed by the processor implements the steps in the above-mentioned augmented reality interaction method.
In a fourth aspect, an embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements the steps in the above-mentioned augmented reality interaction method.
In a fifth aspect, embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations described in the embodiments of the present application.
According to the embodiment of the application, the target interactive object in the augmented reality scene is comprehensively positioned by acquiring the voice information, the eye gaze information and the historical behavior information of the user, and compared with a scheme of positioning the target interactive object based on voice only, the embodiment of the application can improve the accuracy and success rate of positioning the target interactive object, and further improve the success rate of augmented reality interaction.
Detailed Description
The following description of the embodiments of the present application will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
The embodiment of the application provides an augmented reality interaction method, an augmented reality interaction device, electronic equipment and a computer readable storage medium. Specifically, the embodiment of the application provides an augmented reality interaction device suitable for electronic equipment, wherein the electronic equipment refers to an augmented reality equipment with a computing processing unit, and the device comprises a head-mounted display and wearable glasses, and can be integrated augmented reality equipment built in the computing processing unit or split-type augmented reality equipment externally connected with the computing processing unit. The augmented reality device includes, but is not limited to, an onboard optical display system, i.e., a head-up display system, applied to vehicles such as aircrafts, automobiles, ships, etc., for example, an AR-HUD (Augmented reality HUD-up display) mounted on an intelligent internet-enabled automobile, a handheld mobile device such as a mobile phone, a desktop computer, a notebook computer, a tablet, etc., and a wearable near-to-eye display system such as a head-mounted display, smart glasses, etc. When the augmented reality device is a wearable head-mounted display or intelligent glasses, the device can be an integrated augmented reality device built in the computing processing unit or a split-type augmented reality device externally connected with the computing processing unit.
Referring to fig. 1, fig. 1 is a schematic view of a scene of an augmented reality device (e.g. an optical transmission wearable smart AR glasses) executing the augmented reality interaction method according to an embodiment of the present application, where a specific execution process of the augmented reality device executing the augmented reality interaction method is as follows:
The augmented reality device 10 collects voice information and eye gaze information of a target user holding the augmented reality device 10 in response to an interaction event, determines a target interaction object from an augmented reality scene presented by the augmented reality device 10 according to the voice information, the eye gaze information and historical behavior information of the target user, generates a control instruction for the target interaction object according to the voice information, and controls or manipulates the target interaction object according to the control instruction.
Wherein the target interactive object is one or more of the interactive objects, which are objects in the augmented reality scene that the user desires to interact with, including but not limited to applications, pictures, videos, controls, text, or areas in the augmented reality scene for presenting virtual content, etc.
It can be understood that, by acquiring the voice information, the eye gaze information and the historical behavior information of the user, the embodiment of the application comprehensively locates the target interactive object in the augmented reality scene, and compared with the scheme of simply locating the target interactive object based on voice, the embodiment of the application can improve the accuracy and success rate of locating the target interactive object, thereby improving the success rate of the augmented reality interaction.
The following will describe in detail. It should be noted that the following description order of embodiments is not a limitation of the priority order of embodiments.
Referring to fig. 2, fig. 2 is a flow chart of an augmented reality interaction method according to an embodiment of the application. Although a logical order is depicted in the flowchart, in some cases the steps shown or described may be performed in an order different than depicted in the figures. Specifically, the augmented reality interaction method is applied to an augmented reality device, and the specific flow of the method is as follows:
101. in response to an interaction event, voice information and eye gaze information of a target user holding the augmented reality device are collected.
It should be noted that, the interaction event is an event generated according to the interaction requirement, and the event is used as a trigger event for collecting information, positioning a target interaction object and interacting with the target interaction object in the embodiment of the present application, and controls the start of the whole process steps, that is, when the interaction event is detected, the collection of the user voice information and the eye gaze information is triggered.
Alternatively, in the embodiment of the present application, the interaction event may be generated through a voice or a control, for example, when the user generates a voice of an interaction type, for example, the user voice such as "navigate to XX place" or "display unread mail" or when the user clicks the "interaction" control, the interaction event is generated.
Accordingly, in the embodiment of the present application, the interaction event may also be generated by a user specific gesture, a user eye specific action (such as continuous blinking), or after the application program receives the information to be processed, which is not limited in the embodiment of the present application, and it should be understood that, as long as the action or action indicating the augmented reality device to perform the augmented reality interaction is generated, all the action or action should be considered as the way or strategy for generating the interaction event in the embodiment of the present application.
When the augmented reality device collects voice information of the user, whether the voice in the environment is the voice of the user can be judged through a sound source position judging mode and the like, so that accurate voice of the user can be collected, for example, when the augmented reality device is a head-mounted display or wearable glasses, and the voice is considered to be the voice of the user when the voice is 5-20cm below the augmented reality device.
The eye gaze information includes gaze point information of eyes, eye movement information, and the like, the gaze point information includes gaze position information, gaze direction information, and the like, and the eye movement information includes behavior information of eyes, such as blinking, closing or opening of eyes, and the like. In the embodiment of the application, the eye gaze information of the user wearing the augmented reality device can be acquired through an eye tracker configured in the augmented reality device.
It should be noted that, in the embodiment of the present application, the collection of the eye gaze information and the voice information is performed at substantially the same time, and it may be understood that, based on the requirement of accuracy or time difference, the collection time of the eye gaze information and the voice information may also be set within a certain time error range, that is, based on the requirement of collection, a certain error may be allowed in the collection time of the eye gaze information and the voice information.
102. And determining a target interaction object from an extended reality scene presented by the extended reality equipment according to the voice information, the eye gaze information and the historical behavior information of the target user.
The augmented reality scene is formed by combining a virtual scene provided by the augmented reality device with a real scene, for example, for the augmented reality device in the augmented reality device, the augmented reality scene is obtained by superposing the virtual scene on the real scene.
Wherein the target interactive object is one or more of the interactive objects, which are objects in the augmented reality scene that the user desires to interact with, including but not limited to applications, pictures, videos, controls, text, physical smart devices, or areas in the augmented reality scene for presenting virtual content, etc.
The historical behavior information is behavior information of the target user at a historical time, and reflects behavior actions of the user at the historical time (including a period of time in the past), such as behaviors of eating, social interaction, learning, traveling, working, using a smart device, and the like. Because the behavior of the user generally has a certain rule or habit, the historical behavior information can be used as a reference to analyze the object which the user expects to interact with currently, so that the accuracy of the target interaction object is improved.
Correspondingly, on the basis of the voice information and the historical behavior information, the eye gaze information is combined to position the target interactive object, so that the accuracy of positioning the target interactive object is improved.
103. And generating a control instruction aiming at the target interaction object according to the voice information.
It can be understood that, by recognizing the voice information, the keywords in the voice can be extracted to obtain the indication or the control action interacted with the interaction object, so as to generate the control instruction. For example, a control instruction of "on" is extracted from voice information of "on map navigation", and a control instruction of "off" is extracted from voice information of "off television".
It may be understood that, in the embodiment of the present application, when generating the control instruction for the target interactive object, the control instruction generated is further controlled based on the object type or the state of the target interactive object, for example, if the target interactive object is an object in a real scene, whether the object has a controlled characteristic needs to be analyzed, for example, if the target interactive object is a real cloud, the object does not have a moving characteristic, and accordingly, the control instruction for the movement cannot be generated, and if the target interactive object is an application program, and the application program does not have an edited characteristic, the control instruction for the editing of the target interactive object cannot be generated.
104. And controlling the target interaction object according to the control instruction.
Accordingly, after the control instruction for the target interaction object is generated, the target interaction object can be controlled according to the control instruction, so as to realize interaction with the target interaction object.
In summary, by acquiring the voice information, the eye gaze information and the historical behavior information of the user, and comprehensively positioning the target interactive object in the augmented reality scene, compared with a scheme of positioning the target interactive object based on voice only, the embodiment of the application can improve the accuracy and success rate of positioning the target interactive object, and further improve the success rate of augmented reality interaction.
The method and the device can reduce the expression difficulty of the voice of the user by combining the eye gaze information and the historical behavior information of the user, the user can not be limited to specific expression, and in the voice expression process, the voice information does not contain specific interaction objects, so that the positioning of target interaction objects can be assisted by the eye gaze information and the historical behavior information.
Therefore, in the embodiment of the present application, the voice information of the user may be analyzed first, if the target object to be interacted is clear in the voice information, the target interaction object may be located directly based on the voice information, if the target interaction object cannot be located from the voice information, the target interaction object may be located by combining the eye gaze information or the voice information with the eye gaze information, or the voice information combines the eye gaze information and the historical behavior information at the same time, that is, optionally, in some embodiments of the present application, the step of determining the target interaction object from the augmented reality scene presented by the augmented reality device according to the voice information, the eye gaze information and the historical behavior information of the target user includes:
screening target interaction objects from the augmented reality scene according to the voice information;
If the target interaction object meeting the target interaction condition is not screened, screening the target interaction object from the augmented reality scene according to the eye gaze information, or screening the target interaction object from the augmented reality scene according to the voice information and the eye gaze information;
And if the target interaction object meeting the target interaction condition is not screened, positioning the target interaction object from the augmented reality scene according to the eye gaze information and the historical behavior information.
It can be appreciated that if the target interactive object is directly located from the voice information, the eye gaze information and the historical behavior information are not required to be utilized, so that complexity of the interactive processing can be reduced, for example, if a mail application is marked, the target interactive object is definitely a mail application, and accordingly, the mail application can be marked and displayed in an area which is easier for a user to browse, for example, the middle part of an extended display scene.
And when the target interaction object is not positioned, the success rate of positioning the target interaction object is improved by combining the eye gaze information and the historical behavior information, so that the success rate of target interaction is improved.
Wherein the target interaction condition is a condition that the interaction object needs to satisfy, e.g., the target interaction object is unique, the state representation can interact, be marked, be operational, etc.
The target interaction condition may also be determined based on factors such as weather, time, location, started application, etc. of the augmented reality device in the current use scenario. That is, the target interaction conditions corresponding to different usage scenarios may be different.
It may be appreciated that, since the eye gaze may be located to a plurality of interactive objects, in the application embodiment, the target interactive object most likely to be interacted by the user may be screened out of the plurality of interactive objects through the historical behavior information, that is, optionally, in some embodiments of the present application, the step of "locating the target interactive object from the augmented reality scene according to the eye gaze information and the historical behavior information" includes:
determining a gaze range from the augmented reality scene according to the eye gaze information, wherein the gaze range covers at least two objects to be interacted with;
determining positioning reference information according to the historical behavior information, wherein the positioning reference information comprises at least one of behavior habit information, personal preference information or operation intention information of the target user;
And screening target interaction objects from at least two objects to be interacted in the gazing range according to the positioning reference information.
For example, when there are many interactive objects in the augmented reality scene and the distribution is dense, the gaze range obtained by eye gaze may include a plurality of interactive objects, or when the eye gaze of the user is not concentrated or the gaze range changes in a short time, the plurality of interactive objects may be located by eye gaze information, based on this, the target interactive object most likely to be interacted by the user may be screened out of the plurality of interactive objects by historical behavior information of the user, for example, when three interactive objects are located based on eye gaze information, one target interactive object may be screened out of the three interactive objects in the historical behavior information, for example, the interactive object most used by the user and used in the time and place in habituation may be screened out.
The behavior habit information is user habit data obtained based on historical behaviors of the user, the personal preference information is personalized preference data of the user obtained based on information analysis of the user, the operation intention information is intention information obtained based on a series of behavior analysis of the user before the interaction behavior is triggered, for example, when the interaction behavior is triggered, the user establishes Bluetooth connection with Bluetooth equipment (such as Bluetooth earphone) and the like, the intention of the user for controlling the Bluetooth equipment is indicated, and then the target interaction object of the Bluetooth equipment can be screened out from three interaction objects.
Optionally, when positioning the target interactive object based on the eye gaze information, a gaze pattern of the eye gaze information may be analyzed, and positioning of the target interactive object is performed based on the gaze pattern, that is, optionally, in some embodiments of the present application, the step of "positioning the target interactive object from the augmented reality scene according to the eye gaze information and the historical behavior information" includes:
determining a fixation mode corresponding to the eye fixation information;
If the gazing mode is matched with a preset mode, at least one object to be interacted is screened from the augmented reality scene according to the historical behavior information;
adjusting the layout of each object to be interacted in the augmented reality scene according to the object to be interacted;
and positioning a target interaction object based on the layout and the eye gaze information.
It will be appreciated that gaze patterns include long-term concentration, short-term browsing, etc., and may be distinguished based on the duration of the continuous gaze, the degree of eye focus, etc. In the embodiment of the present application, each gaze pattern corresponds to an adjustment manner of an object to be interacted with, including but not limited to, adjusting a layout, adjusting a display depth, adjusting a size, adjusting a color, and the like. The adjustment manner corresponding to each gaze pattern may be predefined, for example, configured one by one based on habits of the user.
For example, when it is analyzed that the eye gaze information of the user is focused for a long time, it is analyzed that the user wants to adjust the layout of the interactive objects, so that the layout of the interactive objects can be adjusted according to a preset mode, and then the target interactive objects are positioned in the interactive objects after the layout adjustment.
For another example, when it is analyzed that the user wants to adjust the layout of the interactive objects, a plurality of objects to be interacted with may be screened out from the augmented reality scene based on the historical behavior information of the user, and then local adjustment of the distribution of the interactive objects is performed based on the objects to be interacted with, for example, the objects to be interacted with are adjusted to the middle position of the augmented reality scene.
Optionally, in the embodiment of the present application, a specific interaction object may be further located according to a position of eye gaze in combination with a gesture of a user, that is, optionally, in some embodiments of the present application, the step of "screening a target interaction object from the augmented reality scene according to the eye gaze information" includes:
determining a gaze range from the augmented reality scene according to the eye gaze information;
if the gazing range is in a first preset area and the gesture operation of the target user is detected, determining the gesture type of the gesture operation;
and screening target interaction objects from the augmented reality scene according to the gesture types.
For example, the correspondence between a specific gaze location, a specific gesture type and a specific interaction object may be configured, and when it is detected that the user gazes at the specific location and the corresponding gesture type is generated, the specific target interaction object may be directly located. The corresponding relationship can also be obtained through the pre-defined configuration of the user. For example, a quick look is made twice and a finger is used to draw a circle, then the specific application is taken as the target interactive object.
The gesture type can be obtained by extracting and analyzing feature points of the acquired image aiming at the gesture operation, for example, the gesture type corresponding to the gesture operation is analyzed through an image recognition algorithm or a machine learning model.
Optionally, in an embodiment of the present application, the target interaction object may also be located by an eye gaze range and an eye action, that is, optionally, in some embodiments of the present application, the eye gaze information includes gaze location information and eye action information, and the step of "screening the target interaction object from the augmented reality scene according to the eye gaze information" includes:
Determining a gaze range from the augmented reality scene according to the gaze location information;
identifying the action type corresponding to the eye action information;
And determining the gazing range and the target interaction object corresponding to the action type from a mapping relation set, wherein the mapping relation set records the mapping relation among a preset gazing range, a preset action type and a preset interaction object.
For example, when a user gazes at a specific area of an augmented reality scene and eyes generate blink motions, a specific target interaction object can be directly positioned.
Likewise, the correspondence between each gaze range or gaze area, action type (eye) and each interaction object may be customized based on the needs or habits of the user.
Optionally, in the embodiment of the present application, when the voice information is ambiguous, for example, when the target interactive object cannot be uniquely located, the target interactive object may be selected from a plurality of interactive objects located by the voice information through the eye gaze information, that is, optionally, in some embodiments of the present application, the step of "selecting the target interactive object from the augmented reality scene according to the voice information and the eye gaze information" includes:
screening at least two objects to be interacted from the augmented reality scene according to the voice information;
and screening target interaction objects from at least two objects to be interacted according to the eye gaze information.
For example, when the user voice is "play music", if a plurality of music player applications are configured in the augmented reality scene, the music player gazed at by the eyes of the user is taken as the target interaction object.
Correspondingly, when the target interactive object is positioned by combining the eye gaze information, the content and specification of the voice information are simplified, so that the voice content is more flexible, for example, when a user gazes at an application icon and speaks "open" or "start", the corresponding application can be opened, the specific voice content such as music and a reader is not required to be opened by the user, the step that the user needs to manually click or touch the application icon is eliminated, and the voice length is simplified.
For another example, in an extended display scenario, the user looks at a control element (e.g., button or slider) and then speaks "increase" or "decrease", and the system adjusts the state of the control element according to the voice command. When the user looks at a piece of text and speaks "copy" or "paste", the system performs the corresponding text operation in a manner that provides more convenient selection and editing functions in the text processing application. When the user looks at a song in the music player and speaks "play" or "pause", the system executes the corresponding music play control command. Within the application, the user looks at a specific functional area (e.g., drawing toolbar) and speaks a related command (e.g., "draw a circle") and the system performs a corresponding operation in the application. The user looks at the smart home device (e.g., smart light) and speaks "turn on" or "turn off" and the system triggers the corresponding smart device to operate. When the user looks at a link in the web page and says "open", the system opens the corresponding link in the browser, provides a more intuitive way of navigating the web page, and so on.
Optionally, in the embodiment of the present application, after detecting the interaction event, a plurality of objects to be interacted may be screened according to the historical behavior information, and then, a target interaction object is located from the plurality of objects to be interacted based on the voice information and the eye gaze information, that is, optionally, in some embodiments of the present application, the step of determining, from the augmented reality scene presented by the augmented reality device, the target interaction object according to the voice information, the eye gaze information and the historical behavior information of the target user includes:
Determining positioning reference information according to the historical behavior information;
According to the positioning reference information, at least two objects to be interacted are screened from the augmented reality scene;
displaying the object to be interacted in a second preset area of the augmented reality scene;
And if the eye gaze information characterizes that the eye gaze of the target user is focused on the second preset area, positioning a target interaction object from at least two objects to be interacted in the second preset area according to the voice information and the eye gaze information.
The second preset area is a predetermined area, for example, an area which is easy to browse by a user is used as the second preset area, wherein the first selection of the interactive objects based on habits, favorites or intentions of the user is realized by screening a plurality of objects to be interacted according to historical behavior information, and the objects are displayed in the second preset area, so that the selection of the user can be facilitated, for example, when the plurality of objects to be interacted are contained in an augmented reality scene, the positioning of the target interactive object by the user can be accelerated in a first selection mode, the user is not required to browse the interactive objects in the augmented reality scene one by one, and the positioning of the target interactive object is accelerated.
For example, analyzing the historical behavior information of the user, knowing what applications the user is typically using in the morning, if the user is typically using a mail application in the morning, the system may be more likely to associate the user's gaze focus with the mail application in the morning, e.g., to show the mail application in a second preset area, allowing the user to gaze at the mail application faster and more conveniently. For example, a place when the user is interacting with the historical time is analyzed from the historical behavior information, and the interaction habit of the user is determined based on the place, for example, which place the user often uses which application is analyzed, and the like.
As another example, consider the current time and date to learn about the likely activities and needs of the user. For example, during monday morning, the user may be more likely to use a calendar application or the like. Correspondingly, different behavior habits are recorded at different times of the week.
For another example, a connection situation for a bluetooth headset or a smart watch exists some time before the interaction event occurs, and at this time if the user is exercising, the system may prioritize applications related to health and exercise, etc.
Accordingly, the activity of the user on the social media may also be analyzed from the historical behavior information to learn about the user's recent interests and topics, or to analyze the user's past notification history to learn about the user's response to a specific notification, e.g., if the user frequently clicks on a mail notification, the mail application (corresponding application icon) is preferentially presented in the second preset area.
Optionally, in the embodiment of the present application, in the augmented reality scenario, there are multiple stacked blocking display cases of the interactive objects, so that two-stage actions may be configured to implement positioning of the target interactive object, for example, the target interactive object is positioned from the multiple stacked blocking display objects by expanding the multiple stacked blocking display objects through the first segment of voice command, and then positioning the target interactive object through the second segment of voice command.
Accordingly, in the embodiments of the present application, the two-stage actions may be implemented by eye actions, for example, two-stage control actions are extracted from eye gaze information, and the target interactive object is located by the two-stage control actions, that is, optionally, in some embodiments of the present application, the eye gaze information includes first-stage eye gaze information and second-stage eye gaze information, and the step of determining the target interactive object from the augmented reality scene presented by the augmented reality device according to the voice information, the eye gaze information, and the historical behavior information of the target user includes:
if the first-stage eye gaze information characterizes that the eye gaze of the target user reaches at least two objects to be interacted in a stacked shielding state, and the voice information does not define a target interaction object in the objects to be interacted, extracting gaze duration information from the first-stage eye gaze information;
If the gazing duration information meets a preset duration threshold, expanding each object to be interacted to obtain at least two objects to be interacted in a diffusion arrangement state;
and determining a target interaction object from at least two objects to be interacted in a diffusion arrangement state according to the eye gaze information of the second stage and the historical behavior information.
The stacked occlusion refers to that when a plurality of interactive objects are presented, the interactive objects are based on the fact that the positions are the same and the depths are different, for example, the front interactive object occludes the rear interactive object, or the interactive object of the virtual scene occludes the interactive object in the real scene. And the unfolding means that the plurality of interactive objects are scattered and distributed, so that a user can conveniently screen target interactive objects from the scattered plurality of interactive objects.
Optionally, in the embodiment of the present application, multiple operation modes may be configured based on the preference and individuation of the user, and by analyzing the voice or eye movements of the user, it is determined which mode is to be entered, for example, by entering the office mode through the voice or blink movements, and accordingly, the interactive object related to the work is preferentially displayed.
In the embodiment of the application, emotion or mood of the user can be analyzed, and accurate target interaction objects can be positioned in combination with emotion, mood and the like, for example, voice that "i starve" can determine that the user wants to order food, but "you will not starve the bar? The voice of "is just a question of the user and not an explicit instruction to order.
It can be understood that in the embodiment of the present application, the control instruction for the target interaction object may be explicitly determined by voice, for example, voice such as opening, closing, marking, heating or cooling, and the control instruction may be explicitly obtained.
But for the same interactive object and voice indication, different control instructions are also generated for different usage scenarios, for example, for an application, which has both an office mode and an entertainment mode, when the user is at an office location, it is more prone to enter the office mode, and when the user is at home, it is more prone to enter the entertainment mode, i.e. optionally, in some embodiments of the present application, the step of "generating control instructions for the target interactive object from the voice information" includes:
determining usage scenario information of the augmented reality device;
Recognizing the voice information to obtain a voice recognition result;
determining target action information meeting the use scene information according to the voice recognition result;
And generating a control instruction for the target interaction object based on the target action information.
For example, when the user is at an office place, the voice recognition result is on, a control instruction for entering the office mode is generated, and when the user is at home, the voice recognition result is on, a control instruction for entering the entertainment mode is generated.
Accordingly, for different operation modes of the interactive object, the same voice will also generate different control instructions, so that the control instruction for the target interactive object may be determined based on the operation mode of the target interactive object, that is, optionally, in some embodiments of the present application, the step of "generating the control instruction for the target interactive object according to the voice information" includes:
determining an operation mode of the target interactive object according to the use scene information;
And determining a control instruction conforming to the running mode according to a voice recognition result corresponding to the voice information.
For example, when the augmented reality device is in office mode, the open voice indication may be to launch an office application, dial a customer phone, send a mail to a customer, etc., and when the augmented reality device is in entertainment mode, the open voice indication may be to play an entertainment video, launch a gaming application, etc.
It may be appreciated that, due to different operation modes, the keywords or the indication information used are different, so that, for different operation models, different keywords need to be extracted from the semantic information, that is, optionally, in some embodiments of the present application, the step of determining the target action information according to the operation mode according to the speech recognition result includes:
Extracting a target keyword from the voice recognition result based on the operation mode, and determining target action information based on the target keyword;
Or alternatively
And determining action intention information of the voice recognition result through an intention analysis model based on the operation mode, and determining target action information based on the action intention information.
For example, for a speech recognition result with text "Changjiang bridge of Wuhan city", the office mode extracts keywords of "Wuhan city" and "Changjiang bridge", while the entertainment mode extracts keywords of "Wuhan", "city length" and "Jiang Daqiao". That is, the embodiment of the application extracts the keywords according to the operation mode, so that the accuracy of extracting the keywords can be improved, and the matching performance of the control instruction and the actual scene mode can be improved.
Accordingly, the intent analysis or the processing of the intent analysis model can be understood with reference to the description of the keywords, that is, the generated action information is different due to different key points on which the intent analysis is focused based on different operation models.
In the embodiment of the application, the target action information is mainly determined based on the interactable content of the target interaction object, for example, the mark, new creation or deletion of the mail, the addition of the bookmark, deletion of the bookmark or color mark of the text, the unfolding, dispersion or aggregation recovery of the stacked occlusion type interaction object, and the like. Move, delete, merge, etc. for application icons.
Optionally, if no explicit control instruction is obtained through the voice information, the control instruction may be generated based on the eye gaze information, that is, optionally, in some embodiments of the present application, the step of "generating the control instruction for the target interaction object according to the voice information" includes:
If a control instruction aiming at the target interaction object is not generated based on the voice information, determining a gazing range and eye action information corresponding to the eye gazing information;
And generating a control instruction aiming at the target interaction object according to the gazing range and the action type corresponding to the eye action information.
For example, control instructions for the target interactive object are determined based on whether the gaze range matches a preset area and whether the action type is a specific action, such as gazing at an area of the extended display scene and blinking, triggering opening of a specific application.
Optionally, in the embodiment of the present application, when the control instruction for the target interactive object is generated, the control instruction for the target interactive object may also be generated by adopting a gesture manipulation manner, that is, optionally, in some embodiments of the present application, the step of "generating the control instruction for the target interactive object according to the voice information" includes:
if a control instruction for the target interactive object is not generated based on the voice information and gesture operation of the target user is detected, identifying a gesture type corresponding to the gesture operation;
and generating a control instruction for the target interactive object based on the gesture type.
Each gesture type can correspond to one control action or control instruction, so that the control instruction for the target interaction object is generated by gesture operation, convenience for the target interaction object can be improved, and control for the target interaction object is not dependent on voice or eye actions.
Optionally, the control operation or the control instruction for the target interactive object may be more than one, for example, the target interactive object is driven to execute two actions successively, where the two actions may be obtained by voice, or may be generated by voice and eye gaze information, that is, optionally, in some embodiments of the present application, the control instruction includes a first sub-control instruction and a second sub-control instruction, and the step of generating the control instruction for the target interactive object according to the voice information includes:
generating a first sub-control instruction aiming at the target interaction object according to the voice information;
and if the first sub-control instruction meets a preset instruction type, generating a second sub-control instruction aiming at the target interactive object according to the eye gaze information.
The preset instruction type is a preconfigured instruction type, and reflects a type of instruction with continuous actions, such as a mark class, a selection class, a display class, an unfolding class and the like.
For example, the first sub-control command is used for marking, the second sub-control command is used for moving, and the step of controlling the target interactive object according to the control command includes:
marking the target interactive object through the first sub-control instruction;
and moving the marked target interactive object to a target area of the augmented reality scene through the second sub-control instruction.
For example, the user may say "mark the mail application" and then move the mail application by looking at the marked location to an area that the user is more intuitively able to view. Or after marking the mail application, the augmented reality scene is switched to a form centering on the mail application, for example, only the mail application (icon corresponding to the application) is displayed in the augmented reality scene, so that the mail application can be quickly found and operated.
Optionally, in the embodiment of the present application, a feedback policy for the control operation may be further configured, for example, a brief animation or icon is shown in an augmented reality scene, so as to feedback that an action command is made for voice, eye gaze information or historical behavior information to the user, for example, when the user gazes an application icon and speaks "on", a successfully opened animation, such as an icon expansion or an animation effect on the application icon, may be displayed on the screen.
Feedback may also be provided by sound to inform the user that the corresponding one has been identified. Manipulation instructions, such as "instructions received" or "operations successful". The sound effect can be selected according to the personalized settings of the user.
In summary, by acquiring the voice information, the eye gaze information and the historical behavior information of the user, and comprehensively positioning the target interactive object in the augmented reality scene, compared with a scheme of positioning the target interactive object based on voice only, the embodiment of the application can improve the accuracy and success rate of positioning the target interactive object, and further improve the success rate of augmented reality interaction.
And positioning the target interaction object expected to be interacted by the user through the fixation range, fixation mode, action type and fixation time length of eyes and combining gestures, historical behavior information and the like, thereby improving the accuracy of positioning the target interaction object.
And by combining the use scene of the augmented reality equipment, the control instruction for the target interaction object is determined by utilizing eye gaze information and the like on the basis of voice information, so that the accuracy of generating the control instruction is improved.
In order to facilitate better implementation of the augmented reality interaction method, the application also provides an augmented reality interaction device based on the augmented reality interaction method. The meaning of the nouns is the same as that in the above-mentioned augmented reality interaction method, and specific implementation details can refer to the description in the method embodiment.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an augmented reality interaction device according to an embodiment of the present application, where the augmented reality interaction device is applied to an augmented reality apparatus, and the device may specifically be as follows:
An acquisition module 201, configured to acquire voice information and eye gaze information of a target user holding the augmented reality device in response to an interaction event;
A determining module 202, configured to determine a target interaction object from an augmented reality scene presented by the augmented reality device according to the voice information, the eye gaze information, and the historical behavior information of the target user;
a generating module 203, configured to generate a control instruction for the target interaction object according to the voice information;
And the control module 204 is used for controlling the target interaction object according to the control instruction.
Optionally, in some embodiments of the present application, the determining module 202 includes:
The first screening unit is used for screening target interaction objects from the augmented reality scene by the voice information;
The second screening unit is used for screening target interaction objects from the augmented reality scene according to the eye gaze information if the target interaction objects meeting the target interaction conditions are not screened, or screening target interaction objects from the augmented reality scene according to the voice information and the eye gaze information;
and the third screening unit is used for locating the target interaction object from the augmented reality scene according to the eye gaze information and the historical behavior information if the target interaction object meeting the target interaction condition is not screened yet.
Wherein, in some embodiments of the application, the third screening unit comprises:
determining a gaze range from the augmented reality scene according to the eye gaze information, wherein the gaze range covers at least two objects to be interacted with;
determining positioning reference information according to the historical behavior information, wherein the positioning reference information comprises at least one of behavior habit information, personal preference information or operation intention information of the target user;
And screening target interaction objects from at least two objects to be interacted in the gazing range according to the positioning reference information.
Wherein, in some embodiments of the application, the third screening unit comprises:
determining a fixation mode corresponding to the eye fixation information;
If the gazing mode is matched with a preset mode, at least one object to be interacted is screened from the augmented reality scene according to the historical behavior information;
adjusting the layout of each object to be interacted in the augmented reality scene according to the object to be interacted;
and positioning a target interaction object based on the layout and the eye gaze information.
Wherein, in some embodiments of the application, the second screening unit comprises:
determining a gaze range from the augmented reality scene according to the eye gaze information;
if the gazing range is in a first preset area and the gesture operation of the target user is detected, determining the gesture type of the gesture operation;
and screening target interaction objects from the augmented reality scene according to the gesture types.
Wherein, in some embodiments of the present application, the eye gaze information includes gaze location information and eye motion information, and the second screening unit includes:
Determining a gaze range from the augmented reality scene according to the gaze location information;
identifying the action type corresponding to the eye action information;
And determining the gazing range and the target interaction object corresponding to the action type from a mapping relation set, wherein the mapping relation set records the mapping relation among a preset gazing range, a preset action type and a preset interaction object.
Wherein, in some embodiments of the application, the second screening unit comprises:
screening at least two objects to be interacted from the augmented reality scene according to the voice information;
and screening target interaction objects from at least two objects to be interacted according to the eye gaze information.
Wherein, in some embodiments of the application, the determining module 202 comprises:
A determining unit, configured to determine positioning reference information according to the historical behavior information;
a fourth screening unit, configured to screen at least two objects to be interacted from the augmented reality scene according to the positioning reference information;
The display unit is used for displaying the object to be interacted in a second preset area of the augmented reality scene;
The first positioning unit is used for positioning a target interaction object from at least two objects to be interacted in the second preset area according to the voice information and the eye gaze information if the eye gaze information characterizes that the eye gaze of the target user reaches the second preset area.
Wherein, in some embodiments of the present application, the eye gaze information includes first stage eye gaze information and second stage eye gaze information, the determining module 202 includes:
The extraction unit is used for extracting gazing duration information from the first-stage eye gazing information if the first-stage eye gazing information characterizes that the eye gazing of the target user is focused on at least two objects to be interacted in a stacked shielding state and the voice information does not clearly identify a target interaction object in the objects to be interacted;
The unfolding unit is used for unfolding each object to be interacted if the gazing duration information meets a preset duration threshold value, so as to obtain at least two objects to be interacted in a diffusion arrangement state;
and the second positioning unit is used for determining a target interaction object from at least two objects to be interacted in a diffusion arrangement state according to the eye gaze information of the second stage and the historical behavior information.
Wherein, in some embodiments of the present application, the generating module 203 comprises:
A scene determination unit configured to determine usage scene information of the augmented reality device;
the recognition unit is used for recognizing the voice information to obtain a voice recognition result;
The action determining unit is used for determining target action information meeting the use scene information according to the voice recognition result;
the first generation unit is used for generating a control instruction aiming at the target interaction object based on the target action information.
Wherein, in some embodiments of the application, the action determining unit comprises:
a mode determining subunit, configured to determine an operation mode of the target interaction object according to the usage scenario information;
And the action determining subunit is used for determining target action information conforming to the running mode according to the voice recognition result.
Wherein, in some embodiments of the present application, the action determining subunit is specifically configured to:
Extracting a target keyword from the voice recognition result based on the operation mode, and determining target action information based on the target keyword;
Or alternatively
And determining action intention information of the voice recognition result through an intention analysis model based on the operation mode, and determining target action information based on the action intention information.
Wherein, in some embodiments of the present application, the generating module 203 comprises:
a first eye movement information determining unit, configured to determine a gaze range and eye movement information corresponding to the eye gaze information if a control instruction for the target interactive object is not generated based on the voice information;
And the second generation unit is used for generating a control instruction aiming at the target interaction object according to the gazing range and the action type corresponding to the eye action information.
Wherein, in some embodiments of the present application, the generating module 203 comprises:
A second eye movement information determining unit, configured to identify a gesture type corresponding to a gesture operation of the target user if a control instruction for the target interactive object is not generated based on the voice information and the gesture operation of the target user is detected;
And the third generation unit is used for generating a control instruction aiming at the target interaction object based on the gesture type.
Wherein, in some embodiments of the present application, the control instruction includes a first sub-control instruction and a second sub-control instruction, and the generating module 203 includes:
A fourth generating unit, configured to generate a first sub-control instruction for the target interactive object according to the voice information;
And a fifth generating unit, configured to generate a second sub-control instruction for the target interactive object according to the eye gaze information if the first sub-control instruction meets a preset instruction type.
Wherein, in some embodiments of the present application, the first sub-control instruction is for marking and the second sub-control instruction is for moving, the control module 204 comprises:
the first control unit is used for marking the target interaction object through the first sub-control instruction;
and the second control unit is used for moving the marked target interactive object to a target area of the augmented reality scene through the second sub-control instruction.
In the embodiment of the application, firstly, the acquisition module 201 responds to an interaction event to acquire voice information and eye gaze information of a target user of the augmented reality device, then, the determination module 202 determines a target interaction object from an augmented reality scene presented by the augmented reality device according to the voice information, the eye gaze information and historical behavior information of the target user, then, the generation module 203 generates a control instruction for the target interaction object according to the voice information, and then, the control module 204 controls the target interaction object according to the control instruction.
According to the embodiment of the application, the target interactive object in the augmented reality scene is comprehensively positioned by acquiring the voice information, the eye gaze information and the historical behavior information of the user, and compared with a scheme of positioning the target interactive object based on voice only, the embodiment of the application can improve the accuracy and success rate of positioning the target interactive object, and further improve the success rate of augmented reality interaction.
In addition, the application further provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the application, specifically:
The electronic device may include one or more processing cores 'processors 301, one or more computer-readable storage media's memory 302, power supply 303, and input unit 304, among other components. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components. Wherein:
The processor 301 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 302, and calling data stored in the memory 302, thereby performing overall monitoring of the electronic device. Optionally, processor 301 may include one or more processing cores; preferably, the processor 301 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 301.
The memory 302 may be used to store software programs and modules, and the processor 301 executes various functional applications and data processing by executing the software programs and modules stored in the memory 302. The memory 302 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 302 may also include a memory controller to provide the processor 301 with access to the memory 302.
The electronic device further comprises a power supply 303 for powering the various components, preferably the power supply 303 is logically connected to the processor 301 by a power management system, whereby the functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 303 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power device commissioning circuit, a power converter or inverter, a power status indicator, etc.
The electronic device may further comprise an input unit 304, which input unit 304 may be used for receiving input digital or character information and for generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.
Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 301 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 302 according to the following instructions, and the processor 301 runs the application programs stored in the memory 302, so as to implement the steps in any one of the augmented reality interaction methods provided in the embodiments of the present application.
According to the embodiment of the application, the augmented reality equipment responds to the interaction event, acquires the voice information and eye gazing information of the target user of the augmented reality equipment, determines a target interaction object from an augmented reality scene presented by the augmented reality equipment according to the voice information, the eye gazing information and the historical behavior information of the target user, generates a control instruction aiming at the target interaction object according to the voice information, and controls the target interaction object according to the control instruction.
According to the embodiment of the application, the target interactive object in the augmented reality scene is comprehensively positioned by acquiring the voice information, the eye gaze information and the historical behavior information of the user, and compared with a scheme of positioning the target interactive object based on voice only, the embodiment of the application can improve the accuracy and success rate of positioning the target interactive object, and further improve the success rate of augmented reality interaction.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, the present application provides a computer readable storage medium having stored thereon a computer program that can be loaded by a processor to perform the steps of any of the augmented reality interaction methods provided by the present application.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Because the instructions stored in the computer readable storage medium can execute the steps in any one of the augmented reality interaction methods provided by the present application, the beneficial effects that can be achieved by any one of the augmented reality interaction methods provided by the present application can be achieved, and detailed descriptions of the foregoing embodiments are omitted herein.
The foregoing has described in detail the methods, apparatus, electronic devices and computer readable storage medium for augmented reality interaction provided by the present application, and specific examples have been applied to illustrate the principles and embodiments of the present application, and the description of the foregoing examples is only for aiding in understanding the methods and core ideas of the present application; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present application, the present description should not be construed as limiting the present application in summary.
It should be noted that, in the specific embodiment of the present application, related data such as voice information, eye annotation information, historical behavior information, gesture operation, etc. are related, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.