Disclosure of Invention
In order to overcome the problems in the related art, the present specification provides a remote sensing image searching method, apparatus and computer readable storage medium.
In a first aspect, a remote sensing image searching method is provided, the method includes:
Acquiring a wide-area remote sensing image and a language instruction describing a search target;
Dynamically planning sequences to be queried of a plurality of subareas in the wide-area remote sensing image under preset resource constraint by fusing cross-modal semantic information of the wide-area remote sensing image and the language instruction;
and executing the sequence to be queried, sequentially querying corresponding subareas in the wide-area remote sensing image, and outputting a subarea set matched with the language instruction and containing the target.
According to the remote sensing image searching method provided by the application, the cross-modal semantic information of the wide area remote sensing image and the language instruction is fused, and the sequence to be queried of a plurality of subareas in the wide area remote sensing image is dynamically planned under the preset resource constraint, and the method comprises the following steps:
semantic alignment is carried out on the wide-area remote sensing image and the language instruction, and a fused multi-modal feature representation is obtained;
constructing a current search state based on the multi-modal feature representation, the historical search results and the remaining resources;
And under the preset resource constraint, generating sequences to be queried of a plurality of subareas in the wide-area remote sensing image according to the current search state.
According to the remote sensing image searching method provided by the application, the sequence to be queried of a plurality of subareas in the wide area remote sensing image is generated according to the current searching state and is realized through a pre-trained searching strategy model;
The search strategy model is configured to be trained through reinforcement learning to maximize cumulative rewards under resource constraints, the rewards being determined based on whether targets are present within the queried sub-region.
According to the remote sensing image searching method provided by the application, the method further comprises the following steps:
clustering the image space according to semantic features of a plurality of subareas in the wide-area remote sensing image, and constructing a graph model representing the relation among clustered areas;
generating graph guidance features for macroscopic searching based on the graph model;
The constructing the current search state comprises the following steps:
and constructing a current search state based on the multi-modal feature representation, the graph guide feature, the historical search result and the residual resources.
According to the remote sensing image searching method provided by the application, the nodes of the graph model represent a plurality of clustering areas obtained by clustering the image space, and the edges of the graph model represent the association relation among the clustering areas.
According to the remote sensing image searching method provided by the application, the characteristic representation of the node in the graph model is dynamically updated according to the historical searching result in the inquiring process.
According to the remote sensing image searching method provided by the application, the resource constraint comprises query times constraint and/or movement cost constraint among sub-areas.
According to the remote sensing image searching method provided by the application, the movement cost constraint is determined based on Manhattan distance between subareas.
In a second aspect, an electronic device is provided, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the remote sensing image searching method according to the first aspect is implemented.
In a third aspect, a computer readable storage medium is provided, where a remote sensing image search program is stored, and when executed, implements any one of the remote sensing image search methods described in the first aspect above.
The application also provides a computer program product comprising a computer program which when executed by a processor implements a remote sensing image search method as described in any one of the above.
Compared with the current difficulty in rapidly positioning a specific target in a wide-area remote sensing image, the remote sensing image searching method, the remote sensing image searching device and the computer readable storage medium have the following beneficial effects:
According to the method, the wide-area remote sensing image and the language instruction describing the search target are obtained, and semantic alignment is achieved through fusion of cross-mode semantic information of the wide-area remote sensing image and the language instruction, so that the search process is guided by the language instruction, the target sub-area discovery efficiency and accuracy under the conditions of massive data and complex environments are remarkably improved, and a foundation is provided for achieving efficient positioning of the target.
In the second aspect, the sequences to be queried of a plurality of subareas in the wide-area remote sensing image are dynamically planned under the constraint of limited resources, the high-value areas are explored preferentially, and the searching and the utilization are balanced effectively. And when the sequence to be queried is executed, sequentially querying corresponding subareas in the wide-area remote sensing image, finally outputting a subarea set matched with the language instruction and containing a target, and improving the response speed and the intelligent level of the system as a whole through the multi-mode information coordination capability, thereby being suitable for resource-limited scenes such as space-based earth observation and the like.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Detailed Description
The technical solutions in the embodiments (or "implementations") of the present application will be clearly and completely described herein with reference to the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated.
If there are terms (e.g., upper, lower, left, right, front, rear, inner, outer, top, bottom, center, vertical, horizontal, longitudinal, lateral, length, width, counterclockwise, clockwise, axial, radial, circumferential, etc.) related to directional indications or positional relationships in embodiments of the present application, such terms are used merely to explain the relative positional relationships, movement, etc. between the components at a particular pose (as shown in the drawings), and if the particular pose is changed, the directional indications or positional relationships are correspondingly changed. In addition, the terms "first", "second", etc. in the embodiments of the present application are used for descriptive convenience only and are not to be construed as indicating or implying relative importance.
The application provides a remote sensing image searching method, remote sensing image searching equipment and a computer readable storage medium. The present application will be described in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other.
Wide area remote sensing images typically cover a wide range of features, e.g., 30000 x 30000 pixels, and locating specific targets (e.g., vehicles, buildings, disaster areas) quickly in wide area remote sensing images is a challenging task.
The existing non-short-sight searching technology uses a visual active exploration framework based on deep reinforcement learning for wide-area geographic space exploration, and introduces a meta-learning method to improve the adaptability and efficiency of the model in a new task. However, this method relies on visual cues, has limited exploration ability in complex environments, and is inefficient in searching.
In order to solve the above technical problems, the present disclosure provides a remote sensing image searching method, and referring to fig. 1, fig. 1 is a schematic flow chart of a remote sensing image searching method according to an embodiment of the present disclosure.
The method aims at guiding a visual search process by utilizing language instructions, and dynamically planning a search path through reinforcement learning so as to realize efficient positioning of targets in a wide-area remote sensing image under resource constraint.
In other words, through VLAS (Visual-Language ACTIVE SEARCH) technology, sequential screening of sub-regions in the wide-area remote sensing image according to priority under the guidance of a specific Language instruction can maximize coverage of as many target objects as possible while overcoming budget or resource constraints. Furthermore, the inherent spatial correlation of adjacent sub-regions can also provide important clues to the search process. The method balances exploration (improving model efficiency) and utilization (finding more targets) under resource limitation by integrating a machine learning model of a predicted target label and a customized algorithm strategy.
It should be noted that, the remote sensing image searching scheme is directly deployed on the intelligent satellite cluster with body, the foundation is changed into the space-based, the remote sensing image searching algorithm is actively executed on orbit through the intelligent algorithm, the transmission load of the space-ground network is obviously reduced, and the flexibility and the response speed of task execution are finally improved.
The present application provides a first embodiment of a remote sensing image searching method, and referring to fig. 2, fig. 2 is a schematic flow chart of the first embodiment of the remote sensing image searching method provided in the embodiments of the present specification.
Specifically, the method comprises the following steps 101 to 103:
In step 101, a wide-area remote sensing image and language instructions describing a search target are acquired.
First, a search scene and parameters are initialized. The method comprises the following steps:
And acquiring a wide-area remote sensing image, and dividing the wide-area remote sensing image into grid cells to obtain a plurality of subareas. , wherein,Total number of grid cells divided for an image, eachRepresenting a sub-region in the wide-area image. And then, screening out key subregions containing the target from the subregions, so as to realize efficient positioning of the target.
At the same time, language instructions for describing the search target are acquired。
The language instructions described herein are used to explicitly define visual targets to be searched, the semantic content of which includes, but is not limited to, descriptions of target object categories, and their optional visual attributes, spatial relationships, or scene contexts, such as "identify areas where large vehicles are present. In the remote sensing influence search, the language instruction is used as priori knowledge to guide the semantic understanding and decision direction of the whole subsequent visual search process.
As one example, the language instructions may be, but are not limited to, in the form of text, speech, standard instructions generated by a structured template such as a form, and the like.
Further, a preset resource constraint for searching is set.
As an example, the resource constraints include a query number constraint and/or a movement cost constraint between sub-regions.
Presetting a total query budgetAnd initializing historical search resultsResidual query budgetTime step. The query budget needs to comprehensively consider the execution times of the query and the moving cost of the front and back sub-regions to be queried. Wherein the movement cost constraint is determined based on Manhattan distances between sub-regions to be queried.
By the setting, under the constraint of limited resources and budget, the high-value area can be searched preferentially, and the searching and the utilization can be balanced effectively.
Step 102, dynamically planning to-be-queried sequences of a plurality of subareas in the wide-area remote sensing image under preset resource constraint by fusing cross-modal semantic information of the wide-area remote sensing image and the language instruction.
In the face of tens to hundreds of TB data generated daily by a global satellite network, VLAS is used for fusing multimodal information, random search is not needed, only a specific language instruction is needed to be input, the system can synthesize visual observation and human text prompt of an unmanned plane or a remote sensing satellite, and a high-probability area is explored preferentially, so that a team is focused on the subarea most likely to find a target. The method can be used for rapidly locking key sub-areas in the complex wide-area image under the conditions of mass data and limited computing power, and obviously optimizing resource allocation of subsequent tasks (such as target detection).
In some embodiments, the dynamically planning the query sequence of the plurality of sub-regions in the wide-area remote sensing image under the preset resource constraint by fusing the cross-modal semantic information of the wide-area remote sensing image and the language instruction includes the following steps 1021 to 1023:
In step 1021, semantic alignment is performed on the wide-area remote sensing image and the language instruction, so as to obtain a fused multi-modal feature representation.
As an example, a modal encoder is cross-modal over CLIP or other basisAnd respectively extracting image features and language instruction features, and realizing multi-mode semantic fusion through feature alignment.
Specifically, for imagesAnd language instructionsCoding:
Image is formed Encoding as image featuresThe spatial location information is preserved.
Will language instructionEncoding as instruction featuresWhereinIs a language feature dimension.
In step 1022, a current search state is constructed based on the multi-modal representation, historical search results, and remaining resources.
Fusing image featuresInstruction featuresHistorical search resultsResidual budgetGenerating an initial state。
The history search result adopts a three-value mark to record three states of searched and target-existing, searched and target-absent and unsearched of the subareas.
In step 1023, under a preset resource constraint, generating a sequence to be queried of a plurality of sub-areas in the wide-area remote sensing image according to the current search state.
In some embodiments, the generating the to-be-queried sequences of the plurality of sub-regions in the wide-area remote sensing image according to the current search state is implemented through a pre-trained search strategy model;
And outputting actions according to the current search state by the search strategy model, and selecting the subareas to be queried. For ease of description, the search strategy model employing VLAS will be referred to as a controller And (3) representing.
Illustratively, at time stepsBy a controllerAccording to the current stateGenerating actionsAnd selecting the number of the subarea to be queried. Wherein the method comprises the steps ofFor the number of sub-regions per query,Can be a decision neural network with current input state and action output, namely action。
And 103, executing the sequence to be queried, sequentially querying corresponding subareas in the wide-area remote sensing image, and outputting a subarea set matched with the language instruction and containing the target.
The actions in the obtained sequence to be queried are performedThe method is applied to the sub-region selection operation, and each sub-region image data appointed in the sequence is processed in sequence according to the sub-region number given by the sequence to be queried. Illustratively, for each sub-region, a visual perception model is invoked for analysis, determining whether its content matches a search target described by natural language instructions, and generating a binary query result (present target/absent target). Illustratively, the visual perception model may be, but is not limited to, an object detection model (e.g., YOLO, fast R-CNN) or an image classification model.
After the whole query sequence is executed, the system integrates all successful query results to generate a position information set containing all sub-areas hitting targets, and the position information set is used as the final output of the active search task. Wherein the set of location information may be, but is not limited to, e.g., a set of coordinate lists or sub-region bounding boxes.
In some embodiments, the search policy model is configured to be trained by reinforcement learning to maximize cumulative rewards under resource constraints, the rewards being determined based on whether targets are present within the queried sub-region.
Action is to takeApplied to sub-region selection, the following operations are performed:
a. Acquiring instant rewards. Marking according to target presence Calculating rewards. Wherein, the Indicating the presence of a target object,Indicating the absence.
B. and updating historical search results. For explored sub-regions, i.e. sequence numbersSetting upWhile the new history search results are。
That is, the sub-region corresponding to the existence of the target after inquiryWhen confirming absence ofIf not searched。
C. The residual budget is updated. According to Manhattan distanceCalculating movement costs, updating residual budgetsWhereinFor the sub-region selected in the previous step.
D. the current search state is updated. Generating new states。
Thereafter, training data is collected. Recording transition tuplesFor aiming atIs described.
And the searching path process terminates the searching flow when the termination condition is met. As an example, if the residual budgetTerminating the search flow, otherwise, lettingAnd returns to step 102 to continue the iteration.
In some embodiments, the search path is optimized by reinforcement learning strategies.
As one example, a controller is trained using Reinforcement Learning (RL) in combination with Supervised Learning (SL).
First, a loss value of a loss function is calculated.
Transition tuple based collectionCalculating reinforcement learning lossSupervising learning lossIs a weighted sum of:
wherein Is a super parameter.
Next, the controller is updatedParameters.
Updating tactical controllers by a back-propagation algorithm based on the previously calculated loss valuesAnd a basic backbone networkIs a parameter of (a). Wherein the gradient is calculated asTo maximize the jackpot.
Finally, through iterative execution of sub-region selection, strategy optimization and controller parameter updating, a target maximization discovery result conforming to budget limitation is finally output.
That is, output serialized sub-region selectionWhereinRepresent the firstAnd step, selecting a sub-region set, and finally realizing the maximum discovery of the target object under the budget constraint.
Through the embodiment, the wide-area remote sensing image and the language instruction describing the search target are acquired, and the semantic alignment is realized by mapping the image sub-area and the language instruction to the same semantic space. On the basis, a reinforcement learning intelligent agent (controller) is constructed) It decides which sub-region to query next according to the fused multi-modal state (image features, language features, historical search state, residual budget) and continuously optimizes the search strategy by rewarding signals. Finally, the rapid positioning of the subarea where the target is realized. Meanwhile, the cross-modal semantic information of the wide-area remote sensing image and the language instruction is fused, so that the adaptability to complex scenes is remarkably enhanced, and the target can be efficiently positioned in the complex wide-area image.
The present application provides a second embodiment of a remote sensing image searching method, and referring to fig. 3, fig. 3 is a schematic flow chart of the second embodiment of the remote sensing image searching method provided in the embodiments of the present specification.
The method improves upon the basic visual-linguistic active search (VLAS) method, which is based on the graph-enhanced visual-linguistic active search (PAGE) method. The method realizes the efficient target sub-region positioning under the complex scene by introducing a hierarchical controller architecture and a dynamic graph model, and for convenience of description, a search strategy model adopting VLAS is called a controllerAnd (3) representing.
Specifically, the method comprises the following steps 201 to 203:
step 201, acquiring a wide-area remote sensing image and a language instruction describing a search target.
Receiving and blocking aerial or satellite imagesInput language instructionsWhereinTotal number of grid cells divided for an image, eachRepresenting a sub-region in the wide-area image.
Setting a preset total query budgetAnd initializing historical search resultsResidual query budgetInitial queried sub-region featuresTime step. The query budget needs to comprehensively consider the execution times of the query and the Manhattan distance of the front and back sub-regions to be queried.
Step 202, dynamically planning to-be-queried sequences of a plurality of subareas in the wide-area remote sensing image under preset resource constraint by fusing cross-modal semantic information of the wide-area remote sensing image and the language instruction.
First, cross-modal encoder through CLIP or other basisFor imagesAnd language instructionsCoding:
Image is formed Encoding as image featuresThe spatial location information is preserved.
Will language instructionEncoding as instruction featuresWhereinIs a language feature dimension.
Next, a knowledge graph based on visual language data is constructedWherein the nodeRepresenting the mean value characteristics of clustering subareas and edgesRepresenting feature similarity between classes. The knowledge graph can model commonalities among the sub-region features, thereby improving the accuracy of active searching.
In some embodiments, image spaces are clustered according to semantic features of a plurality of subareas in the wide-area remote sensing image, a graph model representing relations among clustered areas is built, graph guide features for macroscopic search are generated based on the graph model, and a current search state is built based on the multi-mode feature representation, the graph guide features, historical search results and residual resources.
The nodes of the graph model represent a plurality of clustering areas obtained by clustering the image space, and the edges of the graph model represent association relations among the clustering areas.
Specifically, the construction process adopts clustering region division, namely clustering semantic features of all sub-regions of all images in a training set by adopting a K-Means algorithm, and dividingCluster areas with the numbers of。
Graph nodeIs defined as each cluster regionAverage image characteristics of all sub-regions in a computer systemAverage language featuresComposition is prepared. Wherein, the Extracting the sub-region pixels, and obtaining the average value characteristic after encoding by a cross-mode encoder; Extracted by sub-region classes via a cross-modal encoder.
Meanwhile, calculating the adjacency probability among the clustered regions, namely the average probability of whether the subareas are physically adjacent or not through Sinkhorn-Knopp algorithm so as to generate a normalized adjacency matrix, and finally defining the normalized adjacency matrix as the edge of the graph model. Finally, based on the graphBuilding graph rolling network. Processing graph structure by using convolution network according to queried subarea characteristicsExtracting graph guidance features。
Thereafter, image features are fusedInstruction featuresGraph guidance featureHistorical search resultsResidual budgetGenerating a current search state。
And under the preset resource constraint, generating sequences to be queried of a plurality of subareas in the wide-area remote sensing image according to the current search state.
As one example, sub-region features are calculated first.
At a time stepWill be at the firstStep selected sub-region setAll sub-regions in (1) are based on respectivelyCoding and calculating the characteristic mean value to obtain the characteristic of the subarea。
And identifying the corresponding node serial numbers of the current sub-region and the target sub-region in the graph.
The current area isWhereinIs Euclidean distance, and the target area isWhereinAnd (3) representing one-hot vectors of the categories of the target objects to be searched in the language instruction.
At a time stepHierarchical controllerAccording to the current search stateGenerating actionsAnd selecting the numbers of the subareas to be queried to form a sequence to be queried. Wherein the method comprises the steps ofThe number of sub-regions for each query.
In addition, in the case of the optical fiber,Is a decision neural network, can convolve the network through the graphComputing graph featuresThen calculate the sub-region selection according to VLAS。
And 203, executing the sequence to be queried, sequentially querying corresponding subareas in the wide-area remote sensing image, and outputting a subarea set matched with the language instruction and containing the target.
Action is to takeApplied to sub-region selection, the following operations are performed:
Acquiring instant rewards. Marking according to target presence Calculating rewards. Wherein, the Indicating the presence of a target object,Indicating the absence.
And updating historical search results. For explored sub-regions, i.e. sequence numbersSetting upWhile the new history search results are. In summary, the sub-region correspondence of the existence of the target is confirmed after the queryWhen confirming absence ofIf not searched。
The residual budget is updated. According to Manhattan distanceCalculating movement costs, updating residual budgetsWhereinFor the sub-region selected in the previous step.
Updating state, generating new state。
And dynamically updating the characteristic representation of the node in the graph model according to the historical search results in the query process. In particular, according to the characteristics of the selected sub-regionThe average image characteristics of the clustering center of the current subarea are adjusted in real time, namely. Wherein the method comprises the steps ofTo adjust the super-parameters, a positive real number near 0 is typically set.
If residual budgetTerminating the search flow, otherwise, lettingAnd returns to step S301 to continue the iteration.
During the search, training data is collected. Recording transition tuplesFor aiming atIs described.
The specific optimization process is as follows:
Transition tuple based collection Calculating reinforcement learning lossSupervising learning lossIs a weighted sum of (2)WhereinIs a super parameter.
Updating a controller by a back propagation algorithmAnd a basic backbone networkIs calculated as the parameter, gradientTo maximize the jackpot.
Through the above process, output serialized sub-region selectionWhereinRepresent the firstAnd step, selecting a sub-region set, and finally realizing the maximum discovery of the target object under the budget constraint.
Through the embodiment, a graph structure modeling and layering controller is introduced on the basis of VLAS, a macro area graph is constructed through clustering, and semantic understanding and strategic planning capability of a complex scene are enhanced. Finer decision making is realized in the complex wide-area image, and the efficiency and accuracy of target positioning are remarkably improved.
Compared with the current method for quickly positioning a specific target in a wide-area remote sensing image, the method, the device and the computer readable storage medium have the following beneficial effects:
According to the method, the wide-area remote sensing image and the language instruction describing the search target are obtained, and semantic alignment is achieved through fusion of cross-mode semantic information of the wide-area remote sensing image and the language instruction, so that the search process is guided by the language instruction, the target sub-area discovery efficiency and accuracy under the conditions of massive data and complex environments are remarkably improved, and a foundation is provided for achieving efficient positioning of the target.
In the second aspect, the sequences to be queried of a plurality of subareas in the wide-area remote sensing image are dynamically planned under the constraint of limited resources, the high-value areas are explored preferentially, and the searching and the utilization are balanced effectively. And when the sequence to be queried is executed, sequentially querying corresponding subareas in the wide-area remote sensing image, finally outputting a subarea set matched with the language instruction and containing a target, and improving the response speed and the intelligent level of the system as a whole through the multi-mode information coordination capability, thereby being suitable for resource-limited scenes such as space-based earth observation and the like.
Based on the same application conception as the method, the embodiment of the application also provides a remote sensing image searching device. As shown in fig. 4, the apparatus includes:
the information acquisition module is used for acquiring the wide-area remote sensing image and a language instruction describing a search target;
the feature fusion module is used for dynamically planning sequences to be queried of a plurality of subareas in the wide-area remote sensing image under the preset resource constraint by fusing the cross-modal semantic information of the wide-area remote sensing image and the language instruction;
And the target searching module is used for executing the sequence to be queried, sequentially querying the corresponding subareas in the wide-area remote sensing image, and outputting a subarea set which is matched with the language instruction and contains the target.
The implementation process of the functions and actions of each module/sub-module/unit in the above device is specifically detailed in the implementation process of the corresponding steps in the above method, so that the same technical effects can be achieved, and will not be described herein again.
The application also provides a whole vehicle controller which is used for realizing the remote sensing image searching method.
Fig. 5 illustrates a physical structure diagram of a remote sensing image searching apparatus, as shown in fig. 5, the remote sensing image searching apparatus may include a processor (processor) 510, a communication interface (CommunicationsInterface) 520, a memory (memory) 530, and a communication bus 540, where the processor 510, the communication interface 520, and the memory 530 complete communication with each other through the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the telemetry image search method.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
In another aspect, the present application also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the remote sensing image searching method provided by the above methods.
In yet another aspect, the present application further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the remote sensing image searching method provided by the above methods.
It should be noted that the technical solutions or technical features described in the above embodiments may be combined or supplemented with each other without generating a conflict. The scope of the present application is not limited to the exact construction described in the above embodiments and illustrated in the accompanying drawings, but modifications, equivalents, improvements, etc. that fall within the spirit and principle of the present application are intended to be included in the scope of the present application.