Disclosure of Invention
The invention aims to provide a reinforcement learning-based adaptive wide-area electromagnetic method induced polarization information extraction method, which realizes the identification and regularization setting of adaptive inversion parameters by defining sensitivity as the characteristics of inversion parameter identification and adopting a reinforcement learning method, thereby improving the accuracy of induced polarization information extraction.
In order to achieve the purpose, the invention provides the following technical scheme:
the invention provides a reinforcement learning-based adaptive wide-area electromagnetic method induced polarization information extraction method, which is characterized in that sensitivity is defined as the characteristic of inversion parameter identification, and meanwhile, the reinforcement learning method is adopted to realize the identification and regularization setting of adaptive inversion parameters, so that intelligent induced polarization information extraction is realized.
The invention provides a reinforcement learning-based adaptive wide-area electromagnetic method induced polarization information extraction method, which comprises the following steps of:
s1, setting a calculation equation of the wide area apparent resistivity:
in the formula (1), the reaction mixture is,
r is the distance from the observation point to the center of the dipole source, or the transmitting-receiving distance; dL is the length of the horizontal current source,
is the distance between observation points M and N;
p is the resistivity, I is the current intensity,
k is called the propagation constant or wavenumber of the electromagnetic wave, i is the imaginary part,
is the angle between r and the current source;
s2, setting an induced polarization model as follows:
in the formula (2), ρ (ω) is a wide-area complex resistivity related to frequency in consideration of the polarization effect; rhoaWide area apparent resistivity when no polarization effect is considered; m is polarizability; τ is a time constant; c is a frequency correlation coefficient, and omega is an angular velocity;
s3, setting an inverted objective function as follows:
fit=E(e)+λ1R(ρ)+λ2R(m) (3)
in formula (3), R (ρ) and R (m) are minimum constructive constraint functions for resistivity and polarizability, respectively; lambda [ alpha ]1、λ2The two regularization factors are respectively corresponding to R (delta) and R (m), and the reason for adopting the two independent regularization factors is the value space of the polarizability (m belongs to[0,1]) The value space of the resistivity is greatly different (delta > m can be generally considered), and if a uniform regularization factor is adopted, a relatively small polarizability parameter cannot be constrained; e (e) is a target error function, and is a fitting error of data in inversion;
r (ρ) and R (m) are both calculated here using the following formulae:
in the formula (4), M is a model parameter obtained by inversion, and comprises resistivity rho and polarizability M;
s4, designing a staged extraction method of different physical parameters by defining sensitivity as the characteristics of inversion parameter identification, and distinguishing the stage of the current inversion through the sensitivity;
the sensitivities of resistivity and polarizability are defined as follows:
in the formula (5), S is sensitivity, G is iteration times, fit is fitness, and M is a model parameter obtained by inversion, including resistivity rho and polarizability M;
s5, adopting reinforcement learning based on the determined strategy gradient to realize judgment of an inversion stage and setting of a regularization coefficient;
the reinforcement learning comprises three elements of state, behavior and reward, and system modeling is carried out aiming at the three elements, wherein the state is the sensitivity of resistivity and polarizability, the behavior is a regularization coefficient, and the reward is an improved value of fitness; the system judges the inversion stage according to the current state and outputs a corresponding regularization coefficient, and then calculates the reward according to the inversion result to adjust a strategy and a value function in reinforcement learning; through repeated learning until the strategy and the value function are stable, the inversion stage can be accurately judged and a proper regularization coefficient is set;
and S6, controlling inversion imposed constraint according to the regularization coefficient generated by reinforcement learning, realizing the identification and regularization setting of the adaptive inversion parameters, and obtaining high-precision induced polarization information (including resistivity and polarizability parameters).
Further, in step S5, the step of reinforcement learning includes:
step one, four networks are initialized randomly, namely a current strategy network mu, a target strategy network mu ', a current Q network Q and a target Q network Q';
the parameters are respectively: a current strategy network parameter theta, a target strategy network parameter theta ', a current Q network parameter w and a target Q network parameter w', wherein the current iteration time t is 0;
step two, S is an initial state, the state S is input into the current policy network, and the action A is obtained:
A=μ(S|θ)+N
wherein, mu (-) is the strategy output by the current strategy network, S is the initial state, theta is the parameter of the current strategy network, and N is the noise;
step three, the state S executes the action A to obtain the next state S ', the reward R is obtained, and the S, A, R and S' are stored into an experience playback set D ═ St,At,Rt,S′t};
Step four, updating the state S to be S'; randomly collecting n samples from an empirical playback set D Si,Ai,Ri,S′i1,2,3, …, n, calculating the output value y of the current Q network Qi:
yi=Ri+γq′(S′i,μ′(S′i|θ′)|w′
Wherein R isiIs state SiPerforming action AiThe reward earned, γ is the reward attenuation factor, Q '() is the Q value of the target Q network output, w' is a parameter of the target Q network, μ '() is the policy of the target policy network output, θ' is a parameter of the target policy network;
step five, calculating the loss L of the current Q network by using a mean square error loss function MSE (mean squared error) and updating all parameters w of the current Q network through the gradient back propagation of the neural network;
where n is the total number of samples taken, Q (-) is the Q value of the current Q network output, SiIs the ith state, AiIs the ith action, w is a parameter of the current Q network;
step six, using a performance index function J, updating all parameters theta of the current strategy network through gradient back propagation of the neural network, and increasing the iteration times t by 1;
step seven, updating the target Q network parameter w 'and the target strategy network parameter theta' every fixed period;
w′=τw+(1-τ)w′
θ′=τθ+(1-τ)θ′
wherein tau is a network parameter soft update coefficient, theta is a current policy network parameter, and w is a current Q network parameter;
and step eight, judging whether the strategy and the value function are stably converged, finishing the training if the terminal condition is reached, and returning to the step two if the terminal condition is not reached.
Further, in step S6, two types of constraints will be imposed during the inversion process: firstly, applying prior information constraint of resistivity and polarizability by utilizing known physical characteristics of an exploration area, and reducing a search space of an inversion algorithm; and secondly, when a certain physical property parameter (resistivity parameter or polarizability parameter) is in an inversion stage, a limitation constraint is applied to another physical property parameter, namely, the search of the other physical property parameter is limited within a small range, so that the influence of the main physical property parameter on the fitness function is strengthened.
Therefore, the strength of different constraints is controlled by different regularization coefficients generated by reinforcement learning, and accurate multi-parameter inversion is realized.
According to the method, the influence of the resistivity at the earlier stage of inversion on the observation data is far greater than the polarizability, so that the sensitivity of the resistivity is higher than the polarizability, the resistivity is mainly used at the earlier stage of inversion, prior information constraint is applied to the resistivity parameter, and strong limit constraint is applied to the polarizability parameter; and in the later inversion stage, the resistivity tends to be stable, the sensitivity of the polarizability is higher than that of the resistivity, the polarizability is mainly used, prior information constraint is applied to polarizability parameters, and strong limit constraint is applied to resistivity parameters. And the specific constraint application also sets the judgment result of the inversion stage through reinforcement learning.
The invention designs a reinforcement learning-based adaptive wide-area electromagnetic method induced polarization information extraction method, so that an inversion algorithm can automatically and quickly identify whether the main parameter of current inversion is polarizability or resistivity, and perform targeted inversion, thereby improving the accuracy of induced polarization information extraction.
Compared with the prior art, the invention has the following advantages:
(1) the method can judge the current inversion state (mainly polarization inversion or mainly resistivity inversion) according to the sensitivities of the resistivity and the polarization in the iteration process, and output correct regularization coefficients and apply correct constraint conditions, thereby realizing intelligent excited electricity information extraction.
(2) The method can effectively solve the problem of uncertainty in multi-parameter inversion.
(3) The method can strengthen the influence of the polarizability in the later inversion stage and improve the accuracy of the extraction of the induced polarization information.
Detailed Description
The invention will be further illustrated with reference to the following specific examples and the accompanying drawings:
example 1
The invention provides a reinforcement learning-based adaptive wide-area electromagnetic method induced polarization information extraction method, which comprises the following steps of:
s1, setting a calculation equation of the wide area apparent resistivity:
in the formula (1), the reaction mixture is,
r is the distance from the observation point to the center of the dipole source, or the transmitting-receiving distance; dL is the length of the horizontal current source,
is the distance between observation points M and N;
p is the resistivity, I is the current intensity,
k is called the propagation constant or wavenumber of the electromagnetic wave, i is the imaginary part,
is the angle between r and the current source;
s2, setting an induced polarization model as follows:
in the formula (2), ρ (ω) is a wide-area complex resistivity related to frequency in consideration of the polarization effect; rhoaWide area apparent resistivity when no polarization effect is considered; m is polarizability; τ is a time constant; c is a frequency correlation coefficient, and omega is an angular velocity;
s3, setting an inverted objective function as follows:
fit=E(e)+λ1R(ρ)+λ2R(m) (3)
in formula (3), R (ρ) and R (m) are minimum constructive constraint functions for resistivity and polarizability, respectively; lambda [ alpha ]1、λ2The regularization factors respectively corresponding to R (rho) and R (m), and the reason for adopting two independent regularization factors is the value space of the polarizability (m belongs to [0,1]]) The value space of the resistivity is greatly different (generally, rho > m), and if a uniform regularization factor is adopted, a relatively small polarizability parameter cannot be constrained; e (e) is a target error function, and is a fitting error of data in inversion;
r (ρ) and R (m) are both calculated here using the following formulae:
in the formula (4), M is a model parameter obtained by inversion, and comprises resistivity rho and polarizability M;
s4, designing a staged extraction method of different physical parameters by defining sensitivity as the characteristics of inversion parameter identification, and distinguishing the stage of the current inversion through the sensitivity;
the sensitivities of resistivity and polarizability are defined as follows:
in the formula (5), S is sensitivity, G is iteration times, fit is fitness, and M is a model parameter obtained by inversion, including resistivity rho and polarizability M;
s5, adopting reinforcement learning based on the determined strategy gradient to realize judgment of an inversion stage and setting of a regularization coefficient, as shown in FIG. 2 specifically;
the reinforcement learning comprises three elements of state, behavior and reward, and system modeling is carried out aiming at the three elements, wherein the state is the sensitivity of resistivity and polarizability, the behavior is a regularization coefficient, and the reward is an improved value of fitness; the system judges the inversion stage according to the current state and outputs a corresponding regularization coefficient, and then calculates the reward according to the inversion result to adjust a strategy and a value function in reinforcement learning; through repeated learning until the strategy and the value function are stable, the inversion stage can be accurately judged and a proper regularization coefficient is set;
the step of reinforcement learning comprises:
step one, four networks are initialized randomly, namely a current strategy network mu, a target strategy network mu ', a current Q network Q and a target Q network Q';
the parameters are respectively: a current strategy network parameter theta, a target strategy network parameter theta ', a current Q network parameter w and a target Q network parameter w', wherein the current iteration time t is 0;
step two, S is an initial state, the state S is input into the current policy network, and the action A is obtained:
A=μ(S|θ)+N
wherein, mu (-) is the strategy output by the current strategy network, S is the initial state, theta is the parameter of the current strategy network, and N is the noise;
step three, the state S executes the action A to obtain the next state S ', the reward R is obtained, and the S, A, R and S' are stored into an experience playback set D ═ St,At,Rt,S′t};
Step four, updating the state S to be S'; randomly collecting n samples from an empirical playback set D Si,Ai,Ri,S′i1,2,3, …, n, calculating the output value y of the current Q network Qi:
yi=Ri+γq′(S′i,μ′(S′i|θ′)|w′)
Wherein R isiIs state SiPerforming action AiThe reward earned, γ is the reward attenuation factor, Q '() is the Q value of the target Q network output, w' is a parameter of the target Q network, μ '() is the policy of the target policy network output, θ' is a parameter of the target policy network;
step five, calculating the loss L of the current Q network by using a mean square error loss function MSE (mean squared error) and updating all parameters w of the current Q network through the gradient back propagation of the neural network;
where n is the total number of samples taken, Q (-) is the Q value of the current Q network output, SiIs the ith state, AiIs the ith action, w is a parameter of the current Q network;
step six, using a performance index function J, updating all parameters theta of the current strategy network through gradient back propagation of the neural network, and increasing the iteration times t by 1;
step seven, updating the target Q network parameter w 'and the target strategy network parameter theta' every fixed period;
w′=τw+(1-τ)w′
θ′=τθ+(1-τ)θ′
wherein tau is a network parameter soft update coefficient, theta is a current policy network parameter, and w is a current Q network parameter;
step eight, judging whether the strategy and the value function are stably converged, finishing the training if the strategy and the value function reach the termination condition, and returning to the step two if the strategy and the value function do not reach the termination condition;
s6, controlling constraints imposed by inversion according to regularization coefficients generated by reinforcement learning, realizing identification and regularization setting of self-adaptive inversion parameters, and obtaining high-precision induced polarization information (including resistivity and polarizability parameters);
two types of constraints will be imposed during the inversion process: firstly, applying prior information constraint of resistivity and polarizability by utilizing known physical characteristics of an exploration area, and reducing a search space of an inversion algorithm; and secondly, when a certain physical property parameter (resistivity parameter or polarizability parameter) is in an inversion stage, a limitation constraint is applied to another physical property parameter, namely, the search of the other physical property parameter is limited within a small range, so that the influence of the main physical property parameter on the fitness function is strengthened. Therefore, the strength of different constraints is controlled by different regularization coefficients generated by reinforcement learning, and accurate multi-parameter inversion is realized.
Example 2
The method was tested on a three-layer model, the resistivity parameter ρ, the thickness parameter h and the polarizability parameter m of which are set as shown in table 1; the inversion algorithm uses a grayish optimization algorithm GWO in which the population size P and the number of iterations tmaxThe settings of (a) are shown in table 1; the soft update coefficient tau and the reward attenuation factor gamma of reinforcement learning are set as shown in table 1; regularization factor lambda of a minimum constructor when reinforcement learning is not employed1And λ2The settings of (2) are shown in table 1.
TABLE 1
The inversion results of the comparison between the method provided by the invention and the method which does not adopt reinforcement learning and adopts an Actor-Critic method (single network) are shown in table 2; the evaluation indexes are Root Mean Square Error (RMSE) and coefficient of determination R2。
TABLE 2
| Method
|
RMSE
|
R2 |
| Learning without reinforcement
|
38.33
|
0.88
|
| Actor-critical method
|
30.24
|
0.91
|
| The method of the invention
|
27.43
|
0.93 |
According to the inversion result, the inversion method based on reinforcement learning (Actor-Critic method and the method of the invention) is superior to the inversion method without reinforcement learning in result, because the reinforcement learning can automatically identify the physical property stage where the inversion is located, output correct regularization coefficients and apply constraints. The method is superior to the Actor-Critic method because the method adopts double networks to respectively realize Actor and Critic modules, and compared with the Actor-Critic method, the mode of separating the current network and the target network (double networks) can further improve the stability and generalization capability of reinforcement learning.