CN111369688B

CN111369688B - Cognitive navigation method and system for structured scene expression

Info

Publication number: CN111369688B
Application number: CN202010166282.8A
Authority: CN
Inventors: 陈崇雨; 于帮国
Original assignee: DMAI Guangzhou Co Ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2023-05-09
Anticipated expiration: 2040-03-11
Also published as: CN111369688A

Abstract

The invention discloses a cognitive navigation method and a system for structured scene expression, wherein the method comprises the following steps: obtaining target two-dimensional information and three-dimensional information in each frame of image by using the obtained target scene image sequence and parameters of the image obtaining equipment; obtaining optimal scene graph information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of the target; processing scene graphs formed by continuous multi-frame images to generate local scene graphs, and merging and updating the local scene graphs to generate global scene graphs; and obtaining target coordinates according to target information in the global scene graph, planning a path according to the target coordinates, and navigating. The method combines the target scene image sequence and the target priori knowledge to obtain the local scene image, combines and updates the local scene image to obtain the global scene image, can construct the three-dimensional scene image of multiple scenes and scenes with finer granularity, and improves the order of targets in the three-dimensional scene image and the navigation accuracy.

Description

Cognitive navigation method and system for structured scene expression

Technical Field

The invention relates to the technical field of image processing, in particular to a cognitive navigation method and a system for structured scene expression.

Background

At present, with the rapid development of computer graphic image processing technology, as the three-dimensional virtual scene can reproduce the planar scene picture in a vivid and vivid manner to the real scene, a good visual effect and visual experience are brought to people, so that the demand of the three-dimensional visualization technology has a remarkable growing trend, and therefore, how to create the required three-dimensional scene graph is more and more widely focused and researched, and is widely applied in various industries. The existing scene graph generation technology is divided into: firstly, generating a scene graph based on a single image by using a deep learning technology; secondly, identifying objects by using a target detection algorithm, positioning the objects by combining a mapping technology, and directly extracting the relation of each object in the three-dimensional scene graph; thirdly, identifying objects by using a target detection algorithm, positioning the objects by combining a mapping technology, taking a practical data set as a statistical basis, and extracting the relation of all objects in the three-dimensional scene graph. However, the relationships between objects in the scene graph generated by the prior art are disordered and unordered, and cannot be well structured.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defect that the relation of each object in the scene graph generated by the prior art is disordered and unordered and the relation among objects cannot be well structured, so as to provide a cognitive navigation method and a system for structured scene expression.

In order to achieve the above purpose, the present invention provides the following technical solutions:

in a first aspect, an embodiment of the present invention provides a cognitive navigation method for structured scene expression, including: acquiring a target scene image by using image acquisition equipment to obtain a corresponding image sequence, wherein the image comprises a depth image and a color image, and the image sequence comprises the depth image sequence and the color image sequence; obtaining two-dimensional information and three-dimensional information of each target in each frame of image by using the target scene image, the image sequence and parameters of the image acquisition equipment; obtaining optimal scene graph information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image; processing scene graphs formed by a preset number of frames to be optimized to generate a local scene graph, and merging and updating the local scene graph to generate a global scene graph; and obtaining target coordinates according to target information in the global scene graph, planning a path according to the target coordinates, and navigating.

In one embodiment, the target prior knowledge information is acquired before the acquisition of the target scene image, and the process of acquiring the target prior knowledge information includes: screening and cleaning the preset target data set, classifying the screened and cleaned preset target data set according to a preset scene image classification method, and generating various types of scene images; counting the probability of occurrence of targets, the probability of target attributes and the probability of relations in each type of scene image; constructing an AND or graph structure according to the attribute of the targets and the relation between the targets; and filling the probability of the occurrence of the target, the probability of the attribute of the target and the probability of the relation in each type of scene image into an AND-OR graph structure to generate the target priori knowledge information.

In one embodiment, the step of obtaining two-dimensional information and three-dimensional information of each object in each frame of image by using the object scene image, the image sequence and the parameters of the image acquisition device includes: estimating the pose of the image acquisition equipment by using an image sequence and a SLAM method, and acquiring the pose of the image acquisition equipment by taking the position of the first frame image as the origin of coordinates; detecting all targets in each frame of image by utilizing a color image and a target detection method, and acquiring two-dimensional information of each target in each frame of image; and acquiring three-dimensional information of the targets in each frame of image according to the depth image, the pose of the image acquisition equipment, the parameters of the image acquisition equipment and the two-dimensional information of each target, wherein the three-dimensional information comprises three-dimensional coordinate information and three-dimensional boundary box information of the targets.

In an embodiment, the step of obtaining the optimal scene graph information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image includes: estimating the relation among targets and the probability thereof according to the three-dimensional information of each target in each frame of image, wherein the probability comprises the probability of the occurrence of the targets in each frame of image, the probability of the attribute of the targets and the probability of the relation of the targets; generating a scene graph of each frame of image according to the relation between the corresponding targets in each frame of image and the two-dimensional information of the targets; and optimizing each frame of image according to the target priori knowledge information, the relation among targets and the probability thereof to obtain the optimal scene graph information.

In an embodiment, the steps of processing a scene graph formed by a preset number of frames to be optimized to generate a local scene graph, and merging and updating the local scene graph to generate a global scene graph include: storing initial scene graph information of a preset number of frames to be optimized into a group to be optimized; when the occurrence frequency of any target in the initial scene graph of the group to be optimized is smaller than a preset occurrence frequency threshold, filtering the target from the initial scene graph to generate a filtered scene graph set; re-calculating the average value of the three-dimensional information of all targets in the filtered scene graph set, and generating a local scene graph according to the re-calculated average value of the three-dimensional information of the targets; and merging and updating the local scene graphs according to the target coordinate information, the target category information and the target information in the generated global scene graph to generate the global scene graph.

In an embodiment, the steps of processing a scene graph formed by a preset number of frames to be optimized to generate a local scene graph, merging and updating the local scene graph to generate a global scene graph further include: and when the number of frames in the to-be-optimized group exceeds the preset number of frames to be optimized, performing filtering optimization on the scene graph of the key frame group to generate a local scene graph.

In one embodiment, the step of updating the plurality of partial scene graphs includes: re-performing target relation calculation by utilizing the targets in the generated global scene graph, and generating a target relation calculation result; and adding a new target, a new target relation or an updated target old relation into the non-generated global scene graph according to the target relation calculation result.

In a second aspect, an embodiment of the present invention provides a cognitive navigation system for structured scene representation, including: the image and image sequence acquisition module is used for acquiring a target scene image by utilizing image acquisition equipment to obtain a corresponding image sequence, wherein the image comprises a depth image and a color image, and the image sequence comprises a depth image sequence and a color image sequence; the two-dimensional information and three-dimensional information acquisition module is used for acquiring two-dimensional information and three-dimensional information of each target in each frame of image by utilizing the target scene image, the image sequence and the parameters of the image acquisition equipment; the optimal scene image information acquisition module is used for acquiring optimal scene image information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image; the global scene graph generation module is used for processing scene graphs formed by a preset number of frames to be optimized to generate local scene graphs, and combining and updating the local scene graphs to generate a global scene graph; and the path planning and navigation module is used for acquiring target coordinates according to the target information in the global scene graph, planning a path according to the target coordinates and navigating.

In a third aspect, an embodiment of the present invention provides a terminal device, including: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the cognitive navigation method of the structured scene representation of the first aspect of the embodiment of the invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing computer instructions for causing a computer to perform the cognitive navigation method of the structured scene representation of the first aspect of the embodiments of the present invention.

The technical scheme of the invention has the following advantages:

1. according to the cognitive navigation method and system for the structured scene expression, provided by the invention, the target scene graph, the image sequence and the target priori knowledge are combined to obtain the local scene graph, and the local scene graph is combined and updated to obtain the global scene graph, so that three-dimensional scene graph construction can be carried out on multiple scenes and scenes with finer granularity, and the order of targets in the three-dimensional scene graph and the navigation accuracy are improved; the structured expression of the target priori knowledge is introduced, more scene information can be obtained in a limited perception information range, and the mapping effect is further optimized for detecting the generated scene graph.

2. According to the cognitive navigation method and system for the structured scene expression, provided by the invention, the object two-dimensional information and the pose of the image acquisition device are acquired, the structured information containing the object attribute, the three-dimensional coordinate and the relation between the objects is estimated by combining the depth image, and the optimal scene graph information is obtained according to the object priori knowledge information, so that the accuracy of scene graph construction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a specific example of a cognitive navigation method of structured scene expression provided in an embodiment of the present invention;

FIG. 2 is a flowchart of a specific example of generating the objective priori knowledge according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a specific example of an AND or OR structure provided by an embodiment of the present invention;

FIG. 4 is a flowchart of a specific example of acquiring three-dimensional information according to an embodiment of the present invention;

FIG. 5 is a flowchart of a specific example of obtaining optimal scene graph information according to an embodiment of the present invention;

FIG. 6 is a flowchart of a specific example of generating a global scene graph according to an embodiment of the present invention;

FIG. 7 is a block diagram of a specific example of a cognitive navigation system for structured scene representation provided by an embodiment of the present invention;

fig. 8 is a composition diagram of a specific example of a terminal device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Example 1

The embodiment of the invention provides a cognitive navigation method for structured scene expression, which is applied to the fields of scene graph construction, three-dimensional reconstruction, automatic control technology and the like, and as shown in fig. 1, comprises the following steps:

step S11: and acquiring a target scene image by using an image acquisition device to obtain a corresponding image sequence, wherein the image comprises a depth image and a color image, and the image sequence comprises the depth image sequence and the color image sequence.

According to the embodiment of the invention, the image acquisition equipment can be used for sequentially and continuously acquiring the target scene images of all targets in the scene at different times and in different directions to obtain the corresponding depth image sequence and color image sequence, and the depth images which are not easily influenced by illumination, shadow and the like are used for replacing gray level images in the prior art and are combined with the color images, so that the accuracy of the constructed three-dimensional scene graph and the order of the targets in the scene graph are improved.

Step S12: and obtaining the two-dimensional information and the three-dimensional information of each target in each frame of image by using the target scene image, the image sequence and the parameters of the image acquisition equipment.

The embodiment of the invention detects two-dimensional information of all targets in each frame of target scene image and each frame of image sequence by using a preset target detection and identification method, wherein the two-dimensional information comprises the types of targets, the target attributes, the probability of the targets appearing in the scene image, target two-dimensional boundary box information, target ID (identity) and the like, the target scene image and the image sequence are mainly color images and color image sequences, and the preset target detection and identification method can be a yolo-v3 target detection method.

According to the image sequence, the image acquisition equipment is positioned by adopting a laser or visual positioning and map construction algorithm, the initial pose of the image acquisition equipment is estimated, and meanwhile, the corner points can be tracked by utilizing an ORB-SLAM algorithm. And obtaining coordinates of the target in the three-dimensional space and a three-dimensional boundary frame according to the initial pose of the camera, the depth image, the two-dimensional information and parameters of the image acquisition equipment.

It should be noted that the various algorithms mentioned in the embodiments of the present invention are only for illustration, but not for limitation.

Step S13: and obtaining optimal scene graph information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image.

The embodiment of the invention carries out relation estimation between targets by using three-dimensional coordinate information and three-dimensional boundary box information in each frame of image information, and carries out maximum posterior reasoning after obtaining the category probability and relation probability of the targets and carrying out structural expression of the prior knowledge information of the targets so as to obtain optimal scene graph information, wherein the optimal scene graph information comprises the optimized relationships among the categories, attributes and the targets of the targets.

Step S14: and processing the scene graphs formed by the preset number of frames to be optimized to generate a local scene graph, and merging and updating the local scene graph to generate a global scene graph.

In order to further improve accuracy of three-dimensional scene graph construction, the method and the device perform optimization processing on initial images formed by a preset number of frames to be optimized in real time, put the frames of the preset number of frames to be optimized into a to-be-optimized group, perform optimization filtering processing on the initial scene graph directly when the number of frames to be optimized exceeds the preset number of frames to be optimized, filter a target from the initial scene graph when the number of the frames to be optimized exceeds the preset number of frames to be optimized and a certain target occurrence number is smaller than a preset occurrence number threshold, and recalculate a three-dimensional information average value of the target in the optimized initial scene graph to generate a local scene graph, and merge and update the multiple local scene graphs according to target coordinate information, target category information and target information in the generated global scene graph to generate the global scene graph.

Step S15: and obtaining target coordinates according to target information in the global scene graph, planning a path according to the target coordinates, and navigating.

When the terminal or the equipment receives the navigation instruction, the terminal or the equipment can search and inquire in the global scene graph according to the types, the attributes and the relations of the objects, find out the corresponding objects, inquire the coordinates of the corresponding objects, plan the path according to the target coordinates and navigate.

According to the cognitive navigation method for the structured scene expression, provided by the embodiment of the invention, the target scene graph, the image sequence and the target priori knowledge are combined to obtain the local scene graph, and the local scene graph is combined and updated to obtain the global scene graph, so that three-dimensional scene graph construction can be carried out on multiple scenes and scenes with finer granularity, and the order of targets in the three-dimensional scene graph and the navigation accuracy are improved; the structured expression of the target priori knowledge is introduced, more scene information can be obtained in a limited perception information range, and the mapping effect is further optimized for detecting the generated scene graph.

In one embodiment, as shown in fig. 2, the process of acquiring the target prior knowledge information before acquiring the target scene image includes:

step S21: screening and cleaning the preset target data set, classifying the screened and cleaned preset target data set according to a preset scene image classification method, and generating various types of scene images.

The embodiment of the invention takes the target data set as a feature extraction basis, for example: the Visual Genome data set is screened and cleaned, all images containing people are removed, and the screened and cleaned images of the preset target data set can be classified according to a plurality of functional scenes such as a kitchen, a living room, a bedroom, a conference room, an office, a restaurant, a washroom and the like, and the classification rules can be customized according to requirements.

Step S22: the probability of occurrence of the object, the probability of the object attribute and the probability of the relation in each type of scene image are counted.

Step S23: and constructing an AND or graph structure according to the attributes of the targets and the relation between the targets.

And counting the probability of the occurrence of the target, the probability of the attribute of the target and the probability of the relation between the targets in each frame of image of each scene, and constructing an AND or graph structure shown in figure 3 according to the attribute of the target and the relation between the targets.

The and or graph structure shown in fig. 3 mainly comprises and nodes and or nodes, wherein the and nodes represent the splitting of objects, or the nodes represent multiple choices of targets, and the terminal nodes comprise address nodes and attribute nodes, and the attribute nodes are target attributes, such as colors, coordinates and the like.

Step S24: and filling the probability of the occurrence of the target, the probability of the attribute of the target and the probability of the relation in each type of scene image into an AND-OR graph structure to generate the target priori knowledge information.

And filling statistical results (such as the probability of occurrence of targets, the probability of target attributes and the probability of relation among targets in each frame of image of each scene (such as the probability of the target category, the attribute, the relation among targets and the first three items) into a corresponding object or node structure in a AND or graph structure, wherein various kinds of target priori knowledge information can be obtained in the AND or graph, and the target priori knowledge information covers the actual scene condition.

In a specific embodiment, as shown in fig. 4, the step of obtaining two-dimensional information and three-dimensional information of each object in each frame of image by using the object scene image, the image sequence and the parameters of the image acquisition device includes:

step S121: estimating the pose of the image acquisition equipment by using the image sequence and the SLAM method, and acquiring the pose of the image acquisition equipment by taking the position of the first frame image as the origin of coordinates.

The embodiment of the invention utilizes a preset pose estimation algorithm to estimate the pose of the image acquisition equipment according to the image sequence, and acquires the pose of the image acquisition equipment taking the position of the first frame image as the origin of coordinates. For example, when the robot is equipped with a camera device, the real-time pose of the mobile robot can be estimated by using a laser or visual SLAM algorithm, and the initial pose of the camera can be estimated by performing corner tracking by using an ORB-SLAM algorithm.

Step S122: and detecting all targets in each frame of image by using the color image and the target detection method, and acquiring the two-dimensional information of each target in each frame of image.

The target detection algorithm is used for detecting all the appearing targets in each frame of color image, and two-dimensional information of each target is obtained, wherein the two-dimensional information can comprise target types, occurrence probability in each frame of image, target two-dimensional boundary box information, two-dimensional coordinates and the like, and the target detection algorithm adopted in the embodiment is an existing relatively mature target detection algorithm for detecting the targets, and is not limited herein.

Step S123: and acquiring three-dimensional information of the targets in each frame of image according to the depth image, the pose of the image acquisition equipment, the parameters of the image acquisition equipment and the two-dimensional information of each target, wherein the three-dimensional information comprises three-dimensional coordinate information and three-dimensional boundary box information of the targets.

And calculating coordinates of the target in the three-dimensional space and three-dimensional boundary frame information by utilizing the type of the target detected and identified by the target and the two-dimensional boundary frame information and combining the depth image, the pose of the image acquisition equipment and the parameters of the image acquisition equipment. In the embodiment of the present invention, the calculation of three-dimensional bounding box information can be divided into the following two cases: for a target with a smaller volume, the three-dimensional bounding box can be approximated according to the two-dimensional bounding box, and for a target with a larger volume, the three-dimensional bounding box of the target can be set by labeling the two-dimensional code.

In a specific embodiment, as shown in fig. 5, the step of obtaining the optimal scene graph information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image includes:

step S131: according to the three-dimensional information of each target in each frame of image, estimating the relation among targets and the probability thereof, wherein the probability comprises the probability of the occurrence of the targets, the probability of the attribute of the targets and the probability of the relation of the targets in each frame of image.

And estimating the relation and probability of the objects in each frame of image by using the obtained three-dimensional coordinate information and the three-dimensional boundary frame information of the objects. The estimation for the relationship between targets can be divided into the following two cases: 1. judging whether the supported targets are above the supporting targets or not, so as to judge whether the targets have a supporting relationship or not; 2. by judging whether or not the contained target is inside the contained target, it is judged whether or not there is a contained relationship between the targets. 3. And judging whether the support key and the inclusion relation exist between the targets or not, so that independence between the targets is judged.

Step S132: and generating a scene graph of each frame of image according to the relation between the corresponding targets in each frame of image and the two-dimensional information of the targets.

Step S133: and optimizing each frame of image according to the target priori knowledge information, the relation among targets and the probability thereof to obtain the optimal scene graph information.

And generating a scene graph of each frame of image according to the relation among the corresponding targets in each frame of image, the two-dimensional information (the attribute and the category of the targets) of the targets and the structural expression of the prior knowledge information of the targets.

Performing maximum posterior reasoning on target relationships, attributes and the like in each frame of image scene graph to obtain optimized object types, attributes and relationships, wherein the maximum posterior reasoning can be implemented according to the formula (1):

wherein PG is an analysis chart of the corresponding image information, and p (PG|Γ, g _ε ) Is posterior probability. g _ε As a and or graph random grammar, Γ is input image data including a target class Γ _o Three-dimensional spatial relationship Γ _S And object attribute Γ _A 。

In a specific embodiment, as shown in fig. 6, the steps of processing a scene graph formed by a preset number of frames to be optimized to generate a local scene graph, and merging and updating the local scene graph to generate a global scene graph include:

step S141: and storing the initial scene graph information of the preset number of frames to be optimized into the group to be optimized.

According to the embodiment of the invention, initial scene graph information of the preset number of frames to be optimized of continuous multiframes is stored in the groups to be optimized, the capacity of the groups to be optimized is set to be the preset number, and when the number of frames in the groups to be optimized exceeds the preset number of frames to be optimized, the initial scene graph is optimized.

Step S142: when the occurrence frequency of any target in the initial scene graph of the group to be optimized is smaller than a preset occurrence frequency threshold, filtering the target from the initial scene graph to generate a filtered scene graph set.

According to the embodiment of the invention, when the occurrence frequency of any target in the initial scene graph of the group to be optimized is smaller than the preset occurrence frequency threshold, the short-time occurrence object caused by target detection misidentification in the scene is removed, and thus the graph construction accuracy is improved.

Step S143: and (3) recalculating the average value of the three-dimensional information of all targets in the filtered scene graph set, and generating a local scene graph according to the recalculated average value of the three-dimensional information of the targets.

According to the embodiment of the invention, after the short-time occurrence of the object is removed, the filtered scene graph set is subjected to mean value calculation again on the three-dimensional coordinates and the boundary frames of each object in the group to be optimized, the three-dimensional coordinates and the three-phase boundary frames of the object are used as the three-dimensional coordinates and the three-phase boundary frames of the optimized object, and the local scene graph is generated according to the three-dimensional coordinates and the three-phase boundary frames of the optimized object.

Step S144: and merging and updating the local scene graphs according to the target coordinate information, the target category information and the target information in the generated global scene graph to generate the global scene graph.

The embodiment of the invention re-performs target relation calculation by utilizing the targets in the generated global scene graph, and generates a target relation calculation result; and adding a new target, a new target relation or an updated target old relation into the global scene graph which is not generated according to the target relation calculation result, so as to generate the global scene graph in real time.

In a specific embodiment, the steps of processing a scene graph formed by a preset number of frames to be optimized to generate a local scene graph, merging and updating the local scene graph to generate a global scene graph further include:

and when the number of frames in the to-be-optimized group exceeds the preset number of frames to be optimized, performing filtering optimization on the scene graph of the key frame group to generate a local scene graph.

According to the cognitive navigation method for the structured scene expression, provided by the embodiment of the invention, the target scene graph, the image sequence and the target priori knowledge are combined to obtain the local scene graph, and the local scene graph is combined and updated to obtain the global scene graph, so that three-dimensional scene graph construction can be carried out on multiple scenes and scenes with finer granularity, and the order of targets in the three-dimensional scene graph and the navigation accuracy are improved; the structured expression of the target priori knowledge is introduced, more scene information can be obtained in a limited perception information range, and the graph construction effect is further optimized for detecting the generated scene graph; by acquiring the two-dimensional information of the target and the pose of the image acquisition equipment and combining the depth image, the structured information containing the object attribute, the three-dimensional coordinate and the relation between the target is estimated, and the optimal scene graph information is obtained according to the prior knowledge information of the target, so that the accuracy of scene graph construction is improved.

Example 2

An embodiment of the present invention provides a cognitive navigation system for structured scene expression, as shown in fig. 7, including:

the image and image sequence acquisition module 1 is used for acquiring a target scene image by utilizing image acquisition equipment to obtain a corresponding image sequence, wherein the image comprises a depth image and a color image, and the image sequence comprises a depth image sequence and a color image sequence; this module performs the method described in step S1 in embodiment 1, and will not be described here again.

The two-dimensional information and three-dimensional information acquisition module 2 is used for acquiring two-dimensional information and three-dimensional information of each target in each frame of image by utilizing the target scene image, the image sequence and the parameters of the image acquisition equipment; this module performs the method described in step S2 in embodiment 1, and will not be described here.

The optimal scene graph information acquisition module 3 is used for acquiring optimal scene graph information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image; this module performs the method described in step S3 in embodiment 1, and will not be described here.

The global scene graph generating module 4 is used for processing scene graphs formed by a preset number of frames to be optimized to generate local scene graphs, and combining and updating the local scene graphs to generate a global scene graph; this module performs the method described in step S4 in embodiment 1, and will not be described here.

The path planning and navigation module 5 is used for acquiring target coordinates according to the target information in the global scene graph, planning a path according to the target coordinates and navigating; this module performs the method described in step S5 in embodiment 1, and will not be described here.

According to the cognitive navigation system for the structured scene expression, provided by the embodiment of the invention, the target scene graph, the image sequence and the target priori knowledge are combined to obtain the local scene graph, and the local scene graph is combined and updated to obtain the global scene graph, so that three-dimensional scene graph construction can be performed on multiple scenes and scenes with finer granularity, and the order of targets in the three-dimensional scene graph and the navigation accuracy are improved; the structured expression of the target priori knowledge is introduced, more scene information can be obtained in a limited perception information range, and the graph construction effect is further optimized for detecting the generated scene graph; by acquiring the two-dimensional information of the target and the pose of the image acquisition equipment and combining the depth image, the structured information containing the object attribute, the three-dimensional coordinate and the relation between the target is estimated, and the optimal scene graph information is obtained according to the prior knowledge information of the target, so that the accuracy of scene graph construction is improved.

Example 3

An embodiment of the present invention provides a terminal device, as shown in fig. 8, including: at least one processor 401, such as a CPU (Central Processing Unit ), at least one communication interface 403, a memory 404, at least one communication bus 402. Wherein communication bus 402 is used to enable connected communications between these components. The communication interface 403 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may further include a standard wired interface and a wireless interface. The memory 404 may be a high-speed RAM memory (Ramdom Access Memory, volatile random access memory) or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 404 may also optionally be at least one storage device located remotely from the aforementioned processor 401. Wherein the processor 401 may perform the cognitive navigation method of the structured scene representation of embodiment 1. A set of program codes is stored in the memory 404, and the processor 401 calls the program codes stored in the memory 404 for executing the cognitive navigation method of the structured scene expression of embodiment 1.

The communication bus 402 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. Communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in fig. 8, but not only one bus or one type of bus.

Wherein the memory 404 may include volatile memory (English) such as random-access memory (RAM); the memory may also include a nonvolatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviated as HDD) or a solid-state drive (english: SSD); memory 404 may also include a combination of the above types of memory.

The processor 401 may be a central processor (English: central processing unit, abbreviated: CPU), a network processor (English: network processor, abbreviated: NP) or a combination of CPU and NP.

Wherein the processor 401 may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof (English: programmable logic device). The PLD may be a complex programmable logic device (English: complex programmable logic device, abbreviated: CPLD), a field programmable gate array (English: field-programmable gate array, abbreviated: FPGA), a general-purpose array logic (English: generic array logic, abbreviated: GAL), or any combination thereof.

Optionally, the memory 404 is also used for storing program instructions. The processor 401 may invoke program instructions to implement the cognitive navigation method as described herein to perform the structured scene representation in embodiment 1.

The embodiment of the invention also provides a computer readable storage medium, and the computer readable storage medium stores computer executable instructions thereon, wherein the computer executable instructions can execute the cognitive navigation method of the structured scene expression of the embodiment 1. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A cognitive navigation method of structured scene expressions, comprising:

acquiring a target scene image by using image acquisition equipment to obtain a corresponding image sequence, wherein the image comprises a depth image and a color image, and the image sequence comprises a depth image sequence and a color image sequence;

obtaining two-dimensional information and three-dimensional information of each target in each frame of image by using the target scene image, the image sequence and parameters of the image acquisition equipment;

obtaining optimal scene graph information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image;

processing scene graphs formed by a preset number of frames to be optimized to generate a local scene graph, and merging and updating the local scene graph to generate a global scene graph;

acquiring target coordinates according to target information in the global scene graph, planning a path according to the target coordinates, and navigating;

the step of processing the scene graph formed by the preset number of frames to be optimized to generate a local scene graph, merging and updating the local scene graph to generate a global scene graph comprises the following steps: storing initial scene graph information of a preset number of frames to be optimized into a group to be optimized; when the occurrence frequency of any target in the initial scene graph of the group to be optimized is smaller than a preset occurrence frequency threshold, filtering the target from the initial scene graph to generate a filtered scene graph set; re-calculating the average value of the three-dimensional information of all targets in the filtered scene graph set, and generating a local scene graph according to the re-calculated average value of the three-dimensional information of the targets; and merging and updating the local scene graphs according to the target coordinate information, the target category information and the target information in the generated global scene graph to generate the global scene graph.

2. The method for cognitive navigation of a structured scene representation of claim 1, wherein the target priori knowledge information is obtained prior to obtaining the target scene image, and the process of obtaining the target priori knowledge information comprises:

screening and cleaning the preset target data set, classifying the screened and cleaned preset target data set according to a preset scene image classification method, and generating various types of scene images;

counting the probability of occurrence of targets, the probability of target attributes and the probability of relations in each type of scene image;

constructing an AND or graph structure according to the attribute of the targets and the relation between the targets;

and filling the probability of the occurrence of the target, the probability of the attribute of the target and the probability of the relation in each type of scene image into an AND-OR graph structure to generate the target priori knowledge information.

3. The cognitive navigation method of a structured scene representation according to claim 1, wherein the step of obtaining two-dimensional information and three-dimensional information of each object in each frame of image using parameters of the object scene image, the image sequence and the image acquisition device comprises:

estimating the pose of the image acquisition equipment by using an image sequence and a SLAM method, and acquiring the pose of the image acquisition equipment by taking the position of the first frame image as the origin of coordinates;

detecting all targets in each frame of image by utilizing a color image and a target detection method, and acquiring two-dimensional information of each target in each frame of image;

and acquiring three-dimensional information of the targets in each frame of image according to the depth image, the pose of the image acquisition equipment, the parameters of the image acquisition equipment and the two-dimensional information of each target, wherein the three-dimensional information comprises three-dimensional coordinate information and three-dimensional boundary box information of the targets.

4. The cognitive navigation method of a structured scene representation according to claim 1, wherein the step of obtaining optimal scene graph information according to each target two-dimensional information, three-dimensional information and target priori knowledge information in each frame of image comprises:

estimating the relation among targets and the probability thereof according to the three-dimensional information of each target in each frame of image, wherein the probability comprises the probability of the occurrence of the targets, the probability of the attribute of the targets and the probability of the relation of the targets in each frame of image;

generating a scene graph of each frame of image according to the relation between the corresponding targets in each frame of image and the two-dimensional information of the targets;

and optimizing each frame of image according to the target priori knowledge information, the relation among targets and the probability thereof to obtain the optimal scene graph information.

5. The cognitive navigation method of structured scene representation according to claim 1, wherein the steps of processing a scene graph formed by a preset number of frames to be optimized to generate a local scene graph, merging and updating the local scene graph to generate a global scene graph further comprise:

6. The method of cognitive navigation of a structured scene representation of claim 1, wherein the step of updating a plurality of local scene graphs comprises:

re-performing target relation calculation by utilizing the targets in the generated global scene graph, and generating a target relation calculation result;

and adding a new target, a new target relation or an updated target old relation into the non-generated global scene graph according to the target relation calculation result.

7. A cognitive navigation system of structured scene representations, comprising:

the image and image sequence acquisition module is used for acquiring a target scene image by utilizing image acquisition equipment to obtain a corresponding image sequence, wherein the image comprises a depth image and a color image, and the image sequence comprises a depth image sequence and a color image sequence;

the two-dimensional information and three-dimensional information acquisition module is used for acquiring two-dimensional information and three-dimensional information of each target in each frame of image by utilizing the target scene image, the image sequence and the parameters of the image acquisition equipment;

the optimal scene image information acquisition module is used for acquiring optimal scene image information according to the two-dimensional information, the three-dimensional information and the target priori knowledge information of each target in each frame of image;

the global scene graph generation module is used for processing scene graphs formed by a preset number of frames to be optimized to generate local scene graphs, and combining and updating the local scene graphs to generate a global scene graph;

the path planning and navigation module is used for acquiring target coordinates according to target information in the global scene graph, planning a path according to the target coordinates and navigating;

8. A terminal device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the cognitive navigation method of the structured scene representation of any of claims 1-7.

9. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the cognitive navigation method of structured scene representation according to any of claims 1-7.