[go: up one dir, main page]

WO2025004859A1 - Learning device, control system, input/output device, learning method, and recording medium - Google Patents

Learning device, control system, input/output device, learning method, and recording medium Download PDF

Info

Publication number
WO2025004859A1
WO2025004859A1 PCT/JP2024/021691 JP2024021691W WO2025004859A1 WO 2025004859 A1 WO2025004859 A1 WO 2025004859A1 JP 2024021691 W JP2024021691 W JP 2024021691W WO 2025004859 A1 WO2025004859 A1 WO 2025004859A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
simulation
reset
learning device
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2024/021691
Other languages
French (fr)
Japanese (ja)
Inventor
駿平 窪澤
貴士 大西
慶雅 鶴岡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
National Institute of Advanced Industrial Science and Technology AIST
Original Assignee
NEC Corp
National Institute of Advanced Industrial Science and Technology AIST
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp, National Institute of Advanced Industrial Science and Technology AIST filed Critical NEC Corp
Publication of WO2025004859A1 publication Critical patent/WO2025004859A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Definitions

  • This disclosure relates to a learning device, a control system, an input/output device, a learning method, and a recording medium.
  • One example of the objective of this disclosure is to provide a learning device, a control system, an input/output device, a learning method, and a recording medium that can solve the above-mentioned problems.
  • the learning device includes a policy generation means for generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period, and a behavior decision means for deciding the behavior of the controlled object based on the policy.
  • a policy generation means for generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period
  • a behavior decision means for deciding the behavior of the controlled object based on the policy.
  • a control system includes a learning device and a control device
  • the learning device includes a policy generation means for generating a policy, which is a decision rule for the behavior of the control object, based on a reward value, which is a value indicating an evaluation of the behavior of the control object at a time step back within an episode representing a learning period, and a behavior decision means for deciding the behavior of the control object based on the policy
  • the control device controls the control object based on the policy obtained using the learning device.
  • the input/output device includes a state presenting means for presenting to a user the state of the environment in which the controlled object acts, and a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted.
  • the learning method includes generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period, and determining the behavior of the controlled object based on the policy.
  • the recording medium stores a program for causing a computer to execute the following: generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period; and determining the behavior of the controlled object based on the policy.
  • reinforcement learning using simulation can be performed relatively efficiently.
  • FIG. 1 illustrates an example configuration of a control system according to some embodiments of the present disclosure.
  • FIG. 2 is a diagram illustrating an example of the configuration of a learning device according to some embodiments of the present disclosure.
  • FIG. 13 is a diagram showing an example of an instantaneous reward value that is the basis for calculating a cumulative reward value.
  • FIG. 13 is a diagram illustrating an example of a cumulative reward value.
  • FIG. 11 is a diagram showing a first example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure.
  • FIG. 13 is a diagram showing a second example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure.
  • FIG. 13 is a diagram showing a third example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure.
  • FIG. 13 is a diagram illustrating another example of the configuration of a learning device according to some embodiments of the present disclosure.
  • FIG. 2 illustrates another example of a control system configuration according to some embodiments of the present disclosure.
  • FIG. 2 is a diagram illustrating an example of a configuration of an input/output device according to some embodiments of the present disclosure.
  • FIG. 1 is a diagram illustrating an example of a processing procedure in a learning method according to some embodiments of the present disclosure.
  • FIG. 1 illustrates an example of a computer configuration in accordance with at least one embodiment of the present disclosure.
  • FIG. 1 is a diagram illustrating an example of a configuration of a control system according to some embodiments of the present disclosure.
  • the control system 1 includes a learning device 100, a control device 200, and a control target 910.
  • the control system 1 is a system that learns control of a control target 910 and controls the control target 910 based on the learning results.
  • the learning device 100 learns control over the control object 910. In particular, the learning device 100 learns control over the control object 910 by reinforcement learning using a simulation.
  • the reinforcement learning referred to here is a machine learning technique that learns a policy, which is the behavioral rule of an agent that performs an action in a certain environment, based on a reward, which represents an evaluation of the action.
  • the state of the environment is also referred to simply as the state.
  • the environment may include an agent. Therefore, the state may include the state of the agent.
  • the learning device 100 determines an action based on a policy in the state at that time, simulates the determined action, and calculates the next state, which is the state at the next step.
  • the learning device 100 also calculates a reward value based on the obtained next state, and updates the policy based on the calculated reward value.
  • the update of the policy here can be considered as the generation of a policy. In other words, the learning device 100 can be considered to generate a policy based on a past policy.
  • a step in reinforcement learning will be referred to as a time step or simply as a step. Also, below, time will be represented by a time step.
  • the reinforcement learning method used by the learning device 100 is not limited to a specific type of method.
  • the learning device 100 may learn control of the control target 910 based on a known reinforcement learning method such as Q-learning or SARSA.
  • the learning device 100 repeats the processing for each step of each episode until the episode ends.
  • An episode here is a unit of time in reinforcement learning, and is a time period from when an agent starts a series of actions until when it ends.
  • the learning device 100 repeats the processing for each step in the episode until a condition that is predetermined as the condition for ending the episode is met.
  • the learning device 100 determines that a predetermined condition for interrupting the simulation is satisfied, the learning device 100 interrupts the simulation even in the middle of an episode.
  • the condition for interrupting the simulation is also referred to as an interruption condition.
  • Interrupting the simulation can also be said to interrupt the episode being executed. This allows the learning device 100 to automatically interrupt the execution of an episode when the learning device 100 falls into a state where policy updating is not expected even if the episode is continued, or when the learning device 100 falls into a state where policy updating is expected to progress slowly.
  • Policy updates can be thought of as policy improvements. Policy updates can be thought of as progress in reinforcement learning.
  • the learning device 100 that has suspended the execution of an episode may go back a time step within the episode and resume the execution of the episode. Alternatively, the learning device 100 that has suspended the execution of an episode may start the execution of another episode. According to the learning device 100, reinforcement learning can be performed more efficiently than in a case where the execution of an episode is continued even when the state in which policy updating is not expected or when the state in which policy updating is expected to progress slowly is reached.
  • the learning device 100 also stores the state at each of a number of time steps during the execution of an episode.
  • the learning device 100 suspends the execution of an episode, it selects one of the time steps for which the state is stored, returns to the selected time step, and resumes the execution of that episode. Specifically, the learning device 100 resumes processing for each time step from the state of the selected time step.
  • the beginning of the episode is a part where learning has already progressed sufficiently, and it is conceivable that the policy update will not proceed any further.
  • the time step to go back in the episode is also called the reset destination.
  • the candidates for the reset destination are also called reset destination candidates.
  • the reset destination candidates are, for example, executed time steps whose states are stored as time steps at which the simulation can be resumed within the episode that was executed until the simulation was interrupted. Interrupting the execution of a simulation and changing the state in the simulation to a reset state is also called resetting the simulation.
  • the control device 200 controls the control target 910 based on the policy obtained by learning using the learning device 100 .
  • the control object 910 is not limited to a specific one, and may be various objects capable of learning control over the control object 910 using reinforcement learning.
  • the control object 910 may be equipment such as a plant or a power plant, a system such as a manufacturing line in a factory, or a standalone device.
  • the control object 910 may be a moving object such as an automobile, a railroad vehicle, an airplane, a ship, or a self-propelled mobile robot, or a transportation system such as a railroad or air traffic control system.
  • the controlled object 910 may be configured as a part of the control system 1 , or may be configured external to the control system 1 .
  • the learning device 100 it is sufficient to have the learning device 100, and it is not necessary to have the control device 200 and the control target 910. Furthermore, during execution of control, it is sufficient to have the control device 200 and the control target 910, and it is not necessary to have the learning device 100. All of the learning device 100, the control device 200, and the control target 910, or a combination of two of them, may be configured integrally.
  • the learning device 100 and the control device 200 may be configured as a single device.
  • the control device 200 and the control target 910 may be configured as a single device.
  • FIG. 2 is a diagram showing an example of the configuration of the learning device 100.
  • the learning device 100 includes a communication unit 110, a display unit 120, an operation input unit 130, a memory unit 180, and a processing unit 190.
  • the processing unit 190 includes an action decision unit 191, a simulation unit 192, a measure generation unit 193, a reset decision unit 194, and a resume state decision unit 195.
  • the communication unit 110 communicates with other devices. For example, the communication unit 110 may receive various information for performing a simulation from other devices. The communication unit 110 may also transmit the measures obtained by learning to the control device 200.
  • the display unit 120 has a display screen, such as a liquid crystal panel or an LED (Light Emitting Diode) panel, and displays various images.
  • the display unit 120 may display information related to the learning performed by the learning device 100, such as the progress of the learning performed by the learning device 100.
  • the operation input unit 130 includes input devices such as a keyboard and a mouse, and receives user operations. For example, the operation input unit 130 may receive a user operation to instruct the control unit 910 to start learning control over the control target 910.
  • the storage unit 180 stores various data. For example, a snapshot of a state in a simulation may be stored.
  • the storage unit 180 is configured using a storage device included in the learning device 100.
  • the processing unit 190 performs various processes by controlling each unit of the learning device 100.
  • the functions of the processing unit 190 are performed, for example, by a CPU (Central Processing Unit) included in the learning device 100 reading a program from the storage unit 180 and executing it.
  • the behavior decision unit 191 decides the behavior of the control object 910 in reinforcement learning.
  • the behavior decision unit 191 decides the behavior of the control object 910 based on a policy.
  • the behavior determining unit 191 corresponds to an example of a behavior determining means.
  • the simulation unit 192 performs a simulation of the environment.
  • the simulation unit 192 performs a simulation of the action determined by the action determination unit 191, and calculates a next state which is a state after the action.
  • the control object 910 in the simulation can be regarded as an agent or a part thereof, and the control object 910 may be included in the environment that is the target of the simulation.
  • the state of the control object 910 may be included in the state in the simulation.
  • the simulation unit 192 corresponds to an example of a simulation means.
  • the simulation unit 192 may be configured as a part of the learning device 100 , or may be configured external to the learning device 100 .
  • the policy generator 193 updates the policy based on the reward value.
  • the reward value is a value indicating an evaluation for an action.
  • the policy is a decision rule for an action.
  • the update of the policy here can be considered as the generation of the policy.
  • the policy generation unit 193 may update the policy based on the instantaneous reward value, or may update the policy based on the cumulative reward value.
  • the instantaneous reward value is a value calculated for each step based on the state, action, and next state, or a part of these, and indicates an evaluation of the action in that step.
  • the cumulative reward value is a value calculated by accumulating the instantaneous reward values for multiple steps.
  • the instantaneous reward value may be multiplied by a coefficient value such as a forgetting coefficient.
  • the policy generator 193 may update the policy based on a value function value (also referred to as value), which is a prediction of the cumulative reward value, e.g., an expected value of the cumulative reward value at the end of an episode.
  • the instantaneous reward value, the cumulative reward value, and the value function value are all examples of reward values.
  • the learning device 100 may use a reward value in which the larger the value, the better the evaluation, or may use a reward value in which the smaller the value, the better the evaluation. In the following, an example will be described in which the learning device 100 uses a reward value in which the larger the value, the better the evaluation.
  • the combination of the action decision unit 191 and the measure generation unit 193 is also called an agent processing unit.
  • the reset determination unit 194 determines that the interruption condition is satisfied, it interrupts the simulation by the simulation unit 192. Specifically, the reset determination unit 194 interrupts the simulation being performed by the simulation unit 192.
  • the interruption condition is a condition that is determined in advance as a condition for interrupting the simulation of the behavior of the control target 910.
  • the reset determination unit 194 corresponds to an example of a reset determination means.
  • the reset decision unit 194 can automatically interrupt the execution of an episode if the episode falls into a state where policy updates are not expected to occur even if the episode continues to be executed, or if the episode falls into a state where policy updates are expected to proceed slowly.
  • the reset determination unit 194 that has suspended the execution of an episode may go back a time step within the episode and restart the simulation by the simulation unit 192 from the state of the time step that was gone back.
  • the reset determination unit 194 that has suspended the execution of an episode may select another episode and cause the simulation unit 192 to perform a simulation of the selected episode. This allows the learning device 100 to perform reinforcement learning more efficiently compared to a case in which the learning device 100 continues to execute the episode even when the learning device 100 falls into a state in which policy updating is not expected or when the learning device 100 falls into a state in which policy updating is expected to progress slowly.
  • the restart state determination unit 195 determines a reset destination when the simulation by the simulation unit 192 is interrupted.
  • the simulation unit 192 restarts the simulation from the reset destination state determined by the restart state determination unit 195.
  • the restart state determination unit 195 corresponds to an example of a restart state determination means.
  • the restart state determination unit 195 can select not only the beginning of an episode but also any time step in the middle as the destination to go back in the episode, and this is expected to enable the learning device 100 to perform reinforcement learning more efficiently.
  • the simulation unit 192 may resume the simulation from a predetermined state, such as the initial state of the episode.
  • the learning device 100 may be configured without including the resume state determination unit 195 .
  • the reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on a change in the cumulative reward value.
  • 3 is a diagram showing an example of an instantaneous reward value that is the basis for calculating a cumulative reward value.
  • the horizontal axis of the graph in FIG. 3 indicates a time step, and the vertical axis indicates an instantaneous reward value.
  • the instantaneous reward value can be a positive value or a negative value. The larger the instantaneous reward value (e.g., the larger the positive value and the magnitude), the better the evaluation indicated by the instantaneous reward value.
  • the instantaneous reward value e.g., the larger the negative value and the magnitude
  • the instantaneous reward value is a positive value
  • Fig. 4 is a diagram showing an example of a cumulative reward value.
  • the horizontal axis of the graph in Fig. 4 indicates a time step, and the vertical axis indicates a cumulative reward value.
  • Fig. 4 shows the cumulative reward value calculated by accumulating the instantaneous reward values shown in Fig. 3 from the start of the episode. The cumulative reward value reaches a maximum at time step t11, and from the time step after t11, the cumulative reward value continues to decrease.
  • the cumulative reward value is small, such as in the state of time step t12, it is possible that an event has occurred that causes the evaluation indicated by the reward value to be poor.
  • the control object 910 is a railway, it is possible that the intervals between trains have become too close, causing a disruption to the train schedule.
  • the reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on the change in the accumulated reward value. For example, when the reset decision unit 194 determines that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192. In the example of Fig. 4, the reset decision unit 194 may compare the magnitude of the decrease amount from the maximum value of the cumulative reward value at the time step t11 with the threshold value d11, and may decide to interrupt the simulation by the simulation unit 192 at or after the end of the time step t12 where the magnitude of the decrease amount is greater than the threshold value d11.
  • the reset decision unit 194 may update the threshold value so that the extent of deterioration in evaluation indicated by the threshold value of the cumulative reward value increases as the simulation by the simulation unit 192 progresses.
  • the reset decision unit 194 may update the value of the threshold value d11 so that the value of the threshold value d11 increases as the time steps progress.
  • the starting point from which the reset determination unit 194 calculates the amount of decrease in the cumulative reward value is not limited to the maximum value of the cumulative reward value.
  • the reset determination unit 194 may calculate the amount of decrease from a cumulative reward value of 0 and compare it with a threshold value.
  • the reset determination unit 194 may calculate the amount of decrease from the maximum value of the cumulative reward value and compare it with a threshold value.
  • the reset decision unit 194 setting the threshold to a relatively small value and suspending the execution of the episode earlier, it is expected that the action decision unit 191 will try various actions and find an action (or series of actions) that will result in a large reward value at a relatively early stage.
  • the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 when the deterioration of the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more, based on the increase or decrease in the cumulative reward value. In the example of FIG. 4, if the number of time steps for interrupting the simulation by the simulation unit 192 is set to 7, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 at or after the end of time step t12 in which the cumulative reward value has decreased seven times consecutively from time step t11 at which the cumulative reward value is maximized.
  • the reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on a combination of the amount of decrease in the cumulative reward value and the number of time steps in which the cumulative reward value has continuously decreased. For example, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 when the amount of decrease in the cumulative reward value becomes greater than a predetermined threshold value, and when the number of time steps in which the cumulative reward value has continuously decreased reaches or exceeds a predetermined number.
  • the reset decision unit 194 may decide whether or not to suspend the simulation by the simulation unit 192 based on the value function value in addition to or instead of the cumulative reward value.
  • the value function value is a predicted value of the cumulative reward value, and it is considered that there is a positive correlation between the value function value and the cumulative reward value.
  • the reset decision unit 194 determines, based on an increase or decrease in the value function value, that the evaluation indicated by the value function value has deteriorated below a predetermined threshold value, the simulation by the simulation unit 192 may be interrupted. In this case, as in the case of the cumulative reward value described above, the reset determination unit 194 may update the threshold value so that the extent of the deterioration in evaluation indicated by the threshold value of the value function value increases as the simulation by the simulation unit 192 progresses.
  • the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 if the deterioration of the evaluation indicated by the value function value continues for a predetermined number of time steps or more, based on an increase or decrease in the value function value.
  • reset determination unit 194 may determine whether or not to interrupt the simulation by simulation unit 192 based on a combination of the amount of decrease in the value function value and the number of time steps in which the value function value continuously decreases.
  • the simulation unit 192 may be configured to interrupt the simulation.
  • the storage unit 180 stores a plurality of patterns of states in which train intervals are short and disruption of the schedule occurs as states requiring interruption.
  • the reset decision unit 194 then compares the current state with each of the states requiring interruption and determines whether the current state is similar to any of the states requiring interruption. If it is determined that the current state is similar to any of the states requiring interruption, the reset decision unit 194 decides to interrupt the simulation by the simulation unit 192.
  • the method by which the reset determination unit 194 determines whether the current state is similar to the state requiring interruption is not limited to a specific method.
  • the reset determination unit 194 may calculate the feature amount of each of the two states as a vector, and calculate the similarity of the vectors, such as cosine similarity. The reset determination unit 194 may then compare the calculated similarity with a threshold, and determine that the two states are similar if the similarity is equal to or greater than the threshold.
  • a machine learning model that determines whether or not the two states are similar may be prepared, and the reset determination unit 194 may use this machine learning model to determine whether or not the current state is similar to the state requiring interruption.
  • the learning device 100 may present the states to the user and accept the user's designation of the states requiring interruption.
  • the reset determination unit 194 may cause some state, such as a current state or a past state in the simulation, to be displayed on the display unit 120.
  • the reset determination unit 194 may then receive a user operation via the operation input unit 130 instructing whether or not to register the displayed state as a state requiring interruption.
  • the combination of the reset determination unit 194 and the display unit 120 corresponds to an example of a state presenting unit
  • the combination of the reset determination unit 194 and the operation input unit 130 corresponds to an example of a state designation receiving unit.
  • the display unit 120 may display the current train operation status in the simulation in the form of a diagram.
  • the user may perform a user operation indicating this on the operation input unit 130, and the reset determination unit 194 may detect that the user operation has been performed.
  • the learning device 100 is an example of an input/output device in that it is equipped with a reset decision unit 194, a display unit 120, and an operation input unit 130.
  • the input/output device may be configured as a device separate from the learning device 100.
  • the terminal device of the learning device 100 may have a function of displaying the status according to instructions from the learning device 100, and a function of accepting a user operation specifying a state requiring interruption and notifying the learning device 100.
  • the restart state determination unit 195 may select, from among the reset destination candidates in the episode that was being executed until the simulation was interrupted, a reset destination candidate that corresponds to a time step with the fewest number of time steps between the time the simulation was interrupted and the time step in which the change in evaluation, indicated by an increase or decrease in the cumulative reward value, has turned from improvement to deterioration, or a time step prior to that.
  • the restart state determination unit 195 may select a reset destination candidate that corresponds to a time step in which the evaluation was maximally good immediately before the interruption of the simulation, or a time step prior to that.
  • time step t11 corresponds to the time step at which the cumulative reward value is maximized immediately before the simulation is interrupted (time step t12).
  • the restart state determination unit 195 may select a reset destination candidate that corresponds to time step t11 or an earlier time step.
  • the cumulative reward value is continuously decreasing, and it is possible that an event has occurred that causes the evaluation indicated by the reward value to be a bad evaluation.
  • the restart state determination unit 195 select a reset destination candidate that corresponds to the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted or an earlier time step, it is expected that the possibility of selecting a time step before the occurrence of an event that causes the evaluation indicated by the reward value to be a bad evaluation will increase.
  • the resume state determination unit 195 may select one of all reset destination candidates within the episode that was being executed up until the simulation was interrupted as selection targets. As a result, if a decrease in the cumulative reward value does not necessarily indicate the occurrence of an event that will cause the reward value to have a bad evaluation, the resume state determination unit 195 may be able to select a later reset destination candidate (a reset destination candidate closer to the time the simulation was interrupted).
  • the resume state determination unit 195 may select one of the multiple reset destination candidates in accordance with a probability distribution set for the multiple reset destination candidates.
  • a probability distribution set for the multiple reset destination candidates For example, when an event that causes the reward value to be evaluated poorly occurs, from the viewpoint of returning to a time step before the event occurred, it is possible to return to as early a time step in the episode as possible (a time step close to the beginning of the episode).
  • the viewpoint of reducing the number of repetitions of learning at the beginning of the episode it is possible to return to as late a time step as possible among the time steps that have been executed in the episode. In this way, there is a trade-off between achieving efficiency in reinforcement learning by avoiding a state in which an event that causes the reward value to be evaluated poorly occurs, and achieving efficiency in reinforcement learning by reducing the number of repetitions of learning at the beginning of the episode.
  • the restart state determination unit 195 therefore probabilistically selects one of the multiple reset destination candidates. This makes it possible to avoid restarting the execution of the episode from the beginning every time the execution of the episode is interrupted. Furthermore, if an event that would cause the reward value to be evaluated poorly has already occurred in the selected reset destination candidate, it is expected that by interrupting the episode one or more times, it will be possible to select a reset destination candidate that was selected before the event that would cause the reward value to be evaluated poorly occurred.
  • the restart state determination unit 195 may select one of the multiple reset destination candidates in accordance with a uniform probability distribution. This allows the restart state determination unit 195 to select a reset destination candidate in response to a control object 910 for which the approximate number of time steps to go back to reach a state before the occurrence of an event that caused the evaluation indicated by the reward value to be poorly evaluated is unknown.
  • FIG. 5 is a diagram showing a first example of a probability distribution set for reset destination candidates, in which the horizontal axis of the graph in Fig. 5 represents time steps and the vertical axis represents probability.
  • the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates.
  • the restart state determination unit 195 sets a uniform probability distribution for these reset destination candidates, selecting three time steps from time step t21 to t23 among these five reset destination candidates as reset destination candidates to be selected.
  • the restart state determination unit 195 selects each of the three reset destination candidates with a one-third probability according to the set probability distribution.
  • the restart state determination unit 195 may set a probability distribution for the reset destination candidates such that the greater the deterioration in evaluation at the time the simulation is interrupted, as indicated by an increase or decrease in the cumulative reward value, the more likely it is that a reset destination candidate with a large number of time steps between the time the simulation is interrupted and the reset destination candidate will be selected, and one of the reset destination candidates may be selected according to the set probability distribution.
  • FIG. 6 is a diagram showing a second example of a probability distribution set for reset destination candidates, in which the horizontal axis of the graph in Fig. 6 represents time steps and the vertical axis represents probability.
  • the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates.
  • the restart state determination unit 195 sets three time steps from time step t21 to t23 out of these five reset destination candidates as reset destination candidates to be selected, and sets a probability distribution for these reset destination candidates.
  • FIG. 6 shows an example in which the amount of decrease in the cumulative reward value when the simulation is interrupted is relatively small, and a higher probability is set for a reset destination candidate that is closer to the interruption (i.e., a reset destination candidate with fewer time steps between the interruption and the simulation).
  • the restart state determination unit 195 selects one of the three reset destination candidates according to the set probability distribution. This makes it relatively easy for the restart state determination unit 195 to select a reset destination candidate that is close to the interruption time (i.e., a reset destination candidate with a small time step between the interruption time and the reset destination candidate).
  • FIG. 7 is a diagram showing a third example of a probability distribution set for reset destination candidates.
  • the horizontal axis of the graph in Fig. 7 represents time steps, and the vertical axis represents probability.
  • the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates.
  • the resume state determination unit 195 sets three time steps from time step t21 to t23 out of these five reset destination candidates as reset destination candidates to be selected, and sets a probability distribution for these reset destination candidates.
  • Fig. 7 shows an example in which the magnitude of the decrease in the cumulative reward value at the time of interruption of the simulation is relatively large.
  • a higher probability is set for time step t21, which is a candidate for the reset destination relatively close to the time of interruption, than in Fig. 6.
  • a lower probability is set for time step t23, which is a candidate for the reset destination relatively close to the time of interruption, than in Fig. 6.
  • the restart state determination unit 195 selects one of the three reset destination candidates according to the set probability distribution. This makes it relatively easy for the restart state determination unit 195 to select a reset destination candidate that is far from the interruption time (i.e., a reset destination candidate with many time steps between the interruption time and the reset destination candidate).
  • the learning device 100 can efficiently perform reinforcement learning by the restart state determination unit 195 changing the reset destination candidates that are easy to select depending on the magnitude of the increase or decrease in the cumulative reward value.
  • FIG. 8 is a diagram showing an example of data input/output in the learning device 100.
  • the action decision unit 191 decides the action of the control target 910 based on the measure, and outputs the decided action to the simulation unit 192 .
  • the simulation unit 192 simulates the action determined by the action determination unit 191 to calculate a state (next state), and outputs the calculated state to the measure generation unit 193.
  • the policy generation unit 193 calculates a reward value based on the state acquired from the action decision unit 191, and updates the policy based on the calculated reward value. Depending on the reward value, the policy generation unit 193 may not update the policy (may leave it as is). In addition, the measure generator 193 outputs the calculated reward value to the reset determiner 194 and the restart state determiner 195 .
  • the reset decision unit 194 decides whether or not to interrupt the simulation by the simulation unit 192 based on the reward value. If the reset decision unit 194 decides to interrupt the simulation, the restart state decision unit 195 decides the reset destination based on the reward value. The reset decision unit 194 and the restart state decision unit 195 then output an instruction to interrupt the simulation and the reset destination to the simulation unit 192 and the measure generation unit 193.
  • the reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 are not limited to a specific one.
  • the reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 may be an instantaneous reward value, a cumulative reward value, a value function value, or any other reward value, or may be a combination of these.
  • the reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 may be the same or different.
  • the policy generation unit 193 may output an instantaneous reward value to the reset determination unit 194, and the reset determination unit 194 may calculate a cumulative reward value based on the instantaneous reward value.
  • the simulation unit 192 interrupts the simulation and returns the state of the simulation to the designated reset destination. Furthermore, when an interruption instruction is received, the policy generation unit 193 returns either the policy or the reward value, or both, to the values at the specified reset destination. The policy generation unit 193 may not change either the policy or the reward value when the simulation is interrupted. In this case, the reset determination unit 194 and the restart state determination unit 195 may not output the interruption instruction and the reset destination to the policy generation unit 193.
  • FIG. 9 is a diagram illustrating an example of a procedure of processing performed by the learning device 100.
  • the simulation unit 192 sets up a simulation (step S101).
  • the simulation unit 192 sets the state of the episode that is specified in the episode.
  • the processing unit 190 executes one step of reinforcement learning (step S102).
  • the reset determination unit 194 determines whether or not the interruption condition is met (step S103).
  • step S111 the process returns to step S101.
  • step S101 the simulation unit 192 sets the state in the simulation to the state after reset.
  • step S103 determines whether or not the interruption condition is not met.
  • step S121 determines whether or not the episode end condition is met.
  • step S103 determines whether or not the end condition of the episode is satisfied (step S121).
  • step S131 determines whether or not the end condition of the reinforcement learning is satisfied (step S131).
  • step S141 selects the next episode (step S141).
  • step S141 the process returns to step S101.
  • the simulation unit 192 sets the state of the episode that is specified in the episode selected by the processing unit 190.
  • step S131 if the processing unit 190 determines that the condition for ending the reinforcement learning is satisfied (step S131: NO), the learning device 100 ends the process in FIG.
  • the policy generation unit 193 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the control object 910 at a time step back within an episode representing a learning period.
  • the behavior decision unit 191 decides the behavior of the control object based on the policy.
  • the learning device 100 it is expected that reinforcement learning can be performed more efficiently by going back in time steps within an episode.
  • the learning device 100 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor occurred that would cause the evaluation indicated by the reward value to deteriorate. This is expected to enable the learning device 100 to generate a measure that will improve the evaluation of the entire episode more quickly.
  • the reset determination unit 194 suspends the simulation when it is determined that a predetermined condition for suspending the simulation of the behavior of the control target 910 is satisfied.
  • the measure generation unit 193 generates a measure based on a reward value calculated based on the simulation of the behavior of the control target 910. According to the learning device 100, reinforcement learning can be performed more efficiently than in a case where the execution of an episode is continued even when the state in which policy updating is not expected or when the state in which policy updating is expected to progress slowly is reached.
  • the simulation unit 192 simulates the behavior of the controlled object 910 and calculates the state of the environment in which the controlled object behaves. According to the learning device 100, it is expected that reinforcement learning can be performed more efficiently by going back in time steps in a simulation.
  • the reset decision unit 194 decides whether or not to interrupt the simulation based on a change in the cumulative reward value, which is the cumulative value of the instantaneous reward value. This allows the reset decision unit 194 to decide whether to suspend the simulation based on not only the state at one time step but also on a change in state. In this respect, the learning device 100 is expected to be able to appropriately decide whether to suspend the simulation.
  • the reset decision unit 194 determines, based on an increase or decrease in the cumulative reward value, that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold, it decides to interrupt the simulation. According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of comparing the cumulative reward value with a threshold value.
  • the reset determination unit 194 updates the threshold value of the cumulative reward value so that the extent of deterioration in the evaluation indicated by the threshold value increases as the simulation progresses.
  • the reset decision unit 194 sets the threshold to a relatively small value to hasten the interruption of the execution of the episode, so that the action decision unit 191 will try various actions, and an action (or a series of actions) that will increase the reward value will be found at a relatively early stage.
  • the learning device 100 can efficiently perform reinforcement learning.
  • the reset decision unit 194 decides to interrupt the simulation if the deterioration of the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more, based on the increase or decrease in the cumulative reward value. According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of counting the number of steps in which the cumulative reward value is continuously decreasing.
  • the reset decision unit 194 decides whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of instantaneous reward values. This allows the reset decision unit 194 to decide whether to suspend the simulation based on not only the state at one time step but also on a change in state. In this respect, the learning device 100 is expected to be able to appropriately decide whether to suspend the simulation.
  • reset decision unit 194 determines, based on an increase or decrease in the value function value, that the evaluation indicated by the value function value has deteriorated below a predetermined threshold value, it decides to interrupt the simulation. According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of comparing the value function value with a threshold value.
  • the reset determination unit 194 updates the threshold value so that the extent of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
  • the reset decision unit 194 sets the threshold to a relatively small value and interrupts the execution of the episode earlier, so that the action decision unit 191 tries various actions and finds an action (or a series of actions) that will increase the reward value at a relatively early point in time.
  • the learning device 100 is expected to enable efficient reinforcement learning.
  • reset decision unit 194 decides to interrupt the simulation when the deterioration of the evaluation indicated by the value function value continues for a predetermined number of time steps or more, based on the increase or decrease in the value function value. According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of counting the number of steps in which the value function value is continuously decreasing.
  • the reset decision unit 194 decides to interrupt the simulation when it determines that the current state of the environment in which the control object 910 is acting, as calculated in the simulation, is similar to one or more states that are pre-set as states requiring interruption to a certain extent or more.
  • the designer who designs learning device 100 only needs to prepare a sample of a state in which a simulation needs to be interrupted, and does not need to design rules for determining whether or not a simulation needs to be interrupted. In this respect, learning device 100 reduces the burden on the designer.
  • the combination of the reset determination unit 194 and the display unit 120 presents the state to the user.
  • the combination of the reset determination unit 194 and the operation input unit 130 accepts a user operation that specifies a state that requires interruption.
  • a sample of a state in which a simulation needs to be interrupted can be acquired in response to a user's designation.
  • the learning device 100 reduces the burden on the designer of the learning device 100.
  • the user can reflect in the learning device 100 the user's own judgment as to whether or not a simulation needs to be interrupted.
  • the restart state determining unit 195 determines the state when the simulation is restarted. As described above, it is considered that the policy update does not necessarily progress at the beginning of an episode. As described above for the learning device 100, the resumption state determination unit 195 can select not only the beginning of an episode but also an intermediate time step as the destination to go back in the episode, and therefore it is expected that the learning device 100 can perform reinforcement learning more efficiently.
  • the restart state determination unit 195 also selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted.
  • restart state determination unit 195 probabilistically select one of the multiple reset destination candidates, it is possible to avoid restarting the execution of the episode from the beginning of the episode every time the execution of the episode is interrupted. Furthermore, if an event that would cause the reward value to be evaluated poorly has already occurred in the selected reset destination candidate, it is expected that by interrupting the episode one or more times, it will be possible to select a reset destination candidate before the event that would cause the reward value to be evaluated poorly occurred.
  • the reset determination unit 194 selects one of the multiple reset destination candidates in accordance with a uniform probability distribution.
  • a reset destination candidate can be selected in response to a control object 910 for which the approximate number of time steps to go back to reach a state before the occurrence of an event that caused the evaluation indicated by the reward value to be poor is unknown.
  • the reset decision unit 194 also sets a probability distribution for multiple reset destination candidates so that the greater the deterioration in evaluation at the time the simulation is interrupted, which is indicated by an increase or decrease in the cumulative reward value, the more likely it is to select a reset destination candidate with a large number of time steps between the time the simulation is interrupted and the reset destination candidate, and selects one of the reset destination candidates according to the set probability distribution.
  • the learning device 100 can efficiently perform reinforcement learning by the restart state determination unit 195 changing the reset destination candidates that are easy to select depending on the magnitude of the increase or decrease in the cumulative reward value.
  • the reset decision unit 194 also selects, from among the reset destination candidates, a time step in which the change in evaluation, indicated by an increase or decrease in the cumulative reward value, has turned from improvement to deterioration, and which corresponds to the time step with the fewest number of time steps between the time the simulation was interrupted, or a time step earlier than that.
  • the cumulative reward value is continuously decreasing, and it is possible that an event has occurred that causes the evaluation indicated by the reward value to be a bad evaluation.
  • the restart state determination unit 195 select a reset destination candidate that corresponds to the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted or an earlier time step, it is expected that the possibility of selecting a time step before the occurrence of an event that causes the evaluation indicated by the reward value to be a bad evaluation will increase.
  • a learning device 610 includes a policy generator 611 and an action determiner 612.
  • the policy generating unit 611 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object in a state going back a time step in an episode representing a learning period.
  • the behavior deciding unit 612 decides the behavior of the controlled object based on the policy.
  • the measure generating unit 611 corresponds to an example of a measure generating means
  • the action deciding unit 612 corresponds to an example of an action deciding means.
  • the learning device 610 can perform reinforcement learning more efficiently by going back in time steps within an episode.
  • the learning device 610 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor caused the evaluation indicated by the reward value to deteriorate. This is expected to enable the learning device 610 to generate a measure that will improve the evaluation of the entire episode more quickly.
  • the policy generation unit 611 can be realized, for example, by using the functions of the policy generation unit 193 in FIG. 1, etc.
  • the action decision unit 612 can be realized, for example, by using the functions of the action decision unit 191 in FIG. 1, etc.
  • FIG. 11 is a diagram illustrating another example of a control system configuration according to some embodiments of the present disclosure.
  • the control system 620 includes a measure generator 622 and an action determiner 623 .
  • the policy generating unit 622 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object in a state going back a time step in an episode representing a learning period.
  • the behavior deciding unit 623 decides the behavior of the controlled object based on the policy.
  • the measure generating unit 622 corresponds to an example of a measure generating means
  • the action deciding unit 623 corresponds to an example of an action deciding means.
  • control system 620 can perform reinforcement learning more efficiently by going back in time steps within an episode.
  • the control system 620 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor caused the evaluation indicated by the reward value to deteriorate. This is expected to enable the control system 620 to generate a measure that would improve the evaluation of the entire episode more quickly.
  • the policy generation unit 622 can be realized, for example, by using the functions of the policy generation unit 193 in FIG. 1, etc.
  • the action decision unit 623 can be realized, for example, by using the functions of the action decision unit 191 in FIG. 1, etc.
  • an input/output device 630 includes a status presenting unit 631 and a status designation receiving unit 632.
  • the state presenting unit 631 presents to the user the state of the environment in which the controlled object acts.
  • the state designation receiving unit 632 receives a user operation that designates a state in which the simulator that simulates the environment needs to be interrupted.
  • the status presenting unit 631 is an example of a status presenting unit
  • the status designation receiving unit 632 is an example of a status designation receiving unit.
  • the input/output device 630 a sample of a state in which a simulation needs to be interrupted can be acquired in response to a user's designation.
  • the input/output device 630 reduces the burden on the designer of the input/output device 630.
  • the user can reflect his/her own judgment as to whether or not a simulation needs to be interrupted in the execution of the simulation.
  • the state presenting unit 631 can be realized, for example, by using the functions of the reset determining unit 194 and the display unit 120 in Fig. 1.
  • the state designation receiving unit 632 can be realized, for example, by using the functions of the reset determining unit 194 and the operation input unit 130 in Fig. 1.
  • FIG. 13 is a diagram showing an example of a processing procedure in a learning method according to some embodiments of the present disclosure.
  • the learning method shown in FIG. 13 includes generating a strategy (step S611) and determining an action (step S612).
  • step S611 the computer generates a policy, which is a decision rule for the behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within the episode representing the learning period.
  • step S612 the computer determines an action of the controlled object based on a strategy.
  • the learning method shown in FIG. 13 it is expected that reinforcement learning can be performed more efficiently by going back a time step within an episode. For example, according to the learning method shown in Fig. 13, it is possible to avoid the occurrence of a factor that deteriorates the evaluation indicated by the reward value by going back to the time step in which the factor occurred that deteriorates the evaluation indicated by the reward value. As a result, it is expected that the learning method shown in Fig. 13 can generate a policy that improves the evaluation of the entire episode more quickly.
  • FIG. 14 is a diagram illustrating an example of a computer configuration in accordance with at least some embodiments of the present disclosure.
  • a computer 700 includes a CPU 710 , a main memory device 720 , an auxiliary memory device 730 , an interface 740 , and a non-volatile recording medium 750 .
  • any one or more of the learning device 100, the control device 200, the learning device 610, the learning device 621, the control device 626, and the input/output device 630, or a part of them, may be implemented in the computer 700.
  • the operation of each of the above-mentioned processing units is stored in the auxiliary storage device 730 in the form of a program.
  • the CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.
  • the CPU 710 also secures a storage area corresponding to each of the above-mentioned storage units in the main storage device 720 according to the program.
  • the interface 740 having a communication function and communicating according to the control of the CPU 710.
  • the interface 740 also has a port for the non-volatile recording medium 750, and reads information from the non-volatile recording medium 750 and writes information to the non-volatile recording medium 750.
  • the operations of the processing unit 190 and each of its units are stored in the auxiliary storage device 730 in the form of a program.
  • the CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.
  • the CPU 710 also reserves a memory area for the memory unit 180 in the main memory device 720 in accordance with the program. Communication with other devices by the communication unit 110 is achieved by the interface 740 having a communication function and operating under the control of the CPU 710. Display of images by the display unit 120 is achieved by the interface 740 having a display device and displaying various images under the control of the CPU 710. Reception of user operations by the operation input unit 130 is achieved by the interface 740 having an input device and accepting user operations under the control of the CPU 710.
  • control device 200 When the control device 200 is implemented in the computer 700, its operation is stored in the form of a program in the auxiliary storage device 730.
  • the CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.
  • the CPU 710 also allocates a memory area in the main memory device 720 for the control device 200 to perform processing according to the program. Communication between the control device 200 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the control device 200 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.
  • the operations of the policy generation unit 611 and the action decision unit 612 are stored in the auxiliary storage device 730 in the form of a program.
  • the CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.
  • the CPU 710 also allocates a memory area in the main memory device 720 for the learning device 610 to perform processing according to the program. Communication between the learning device 610 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the learning device 610 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.
  • the operations of the policy generation unit 622 and the action decision unit 623 are stored in the auxiliary storage device 730 in the form of a program.
  • the CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.
  • the CPU 710 also allocates a memory area in the main memory device 720 for the learning device 621 to perform processing according to the program.
  • Communication between the learning device 621 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710.
  • Interaction between the learning device 621 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.
  • auxiliary storage device 730 When the control device 626 is implemented in the computer 700, its operation is stored in the auxiliary storage device 730 in the form of a program.
  • the CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.
  • the CPU 710 also allocates a memory area in the main memory 720 for the control device 626 to perform processing according to the program. Communication between the control device 626 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the control device 626 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.
  • the operations of the state presentation unit 631 and the state designation reception unit 632 are stored in the auxiliary storage device 730 in the form of a program.
  • the CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.
  • the CPU 710 also allocates a storage area in the main memory device 720 for the I/O device 630 to perform processing according to the program.
  • Communication between the I/O device 630 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710.
  • Interaction between the I/O device 630 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.
  • any one or more of the above-mentioned programs may be recorded on the non-volatile recording medium 750.
  • the interface 740 may read the program from the non-volatile recording medium 750.
  • the CPU 710 may then directly execute the program read by the interface 740, or may temporarily store the program in the main memory device 720 or the auxiliary memory device 730 and then execute it.
  • a program for executing all or part of the processing performed by learning device 100, control device 200, learning device 610, learning device 621, control device 626, and input/output device 630 may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed to perform processing of each part.
  • computer system here includes hardware such as an OS (Operating System) and peripheral devices.
  • computer-readable recording medium refers to portable media such as flexible disks, optical magnetic disks, ROMs (Read Only Memory), and CD-ROMs (Compact Disc Read Only Memory), as well as storage devices such as hard disks built into computer systems.
  • the above-mentioned program may be for realizing part of the above-mentioned functions, or may be capable of realizing the above-mentioned functions in combination with a program already recorded in the computer system.
  • a learning device comprising:
  • (Appendix 2) a reset decision means for suspending the simulation when it is determined that a predetermined condition for suspending the simulation of the behavior of the controlled object is satisfied,
  • the policy generating means generates the policy based on the reward value calculated based on the simulation.
  • the learning device further comprising: a simulation means for performing the simulation to calculate a state of an environment in which the controlled object acts.
  • the reset decision means decides whether or not to interrupt the simulation based on a change in a cumulative reward value, which is a cumulative value of an instantaneous reward value. 4.
  • a learning device according to claim 2 or 3.
  • the reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold based on the increase or decrease of the cumulative reward value. 5.
  • a learning device as described in claim 4.
  • the reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the cumulative reward value increases as the simulation progresses. 6.
  • a learning device as described in appendix 5.
  • the reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more based on an increase or decrease in the cumulative reward value. 7.
  • a learning device according to any one of claims 4 to 6.
  • the reset determination means determines whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of an instantaneous reward value.
  • the reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the value function value has deteriorated below a predetermined threshold based on an increase or decrease in the value function value.
  • the reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
  • the reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the value of the value function continues for a predetermined number of time steps or more based on an increase or decrease in the value of the value function.
  • the learning device according to claim 9 or 10.
  • the reset decision means decides to interrupt the simulation when it is determined that a current state of the environment in which the control target acts, calculated in the simulation, is similar to any one or more states that are preset as states requiring interruption to a certain degree or more; 12.
  • a learning device according to any one of claims 2 to 11.
  • the restart state determination means selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted; 15.
  • a learning device as described in appendix 14.
  • the restart state determination means selects one of a plurality of reset destination candidates in accordance with a uniform probability distribution. 16.
  • the restart state determination means sets a probability distribution for the plurality of reset destination candidates such that the greater the deterioration of the evaluation at the time of interruption of the simulation, which is indicated by an increase or decrease in an accumulated reward value that is an accumulated value of the instantaneous reward value, the easier it is to select a reset destination candidate having a larger number of time steps between the time of interruption of the simulation and the reset destination candidate, and selects one of the reset destination candidates according to the set probability distribution. 16.
  • the restart state determination means selects a reset destination candidate corresponding to a time step having the smallest number of time steps between the time of interruption of the simulation and the time step in which the change in the evaluation indicated by the increase or decrease in the cumulative reward value, which is the cumulative value of the instantaneous reward value, has turned from improvement to deterioration, among the reset destination candidates which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the interruption of the simulation, or a time step earlier than that. 18.
  • a learning device according to any one of appendices 14 to 17.
  • Appendix 19 A control device that performs control on a control target based on a policy obtained using the learning device according to any one of appendixes 1 to 18.
  • the learning device includes: a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period; an action decision means for deciding an action of the control target based on the measure; Equipped with The control device controls a control target based on the policy obtained using the learning device. Control system.
  • a state presentation means for presenting to a user a state of an environment in which the controlled object acts; a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted;
  • An input/output device comprising:
  • a learning method that includes:
  • This disclosure may be applied to a learning device, a control system, an input/output device, a learning method, and a recording medium.
  • Control system 100 610, 621 Learning device 110 Communication unit 120 Display unit 130 Operation input unit 180 Memory unit 190 Processing unit 191, 612, 623 Action decision unit 192 Simulation unit 193, 611, 622 Measure generation unit 194 Reset decision unit 195 Resume state decision unit 200, 624 Control device 630 Input/output device 631 State presentation unit 632 State designation reception unit 910 Control target

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Feedback Control In General (AREA)

Abstract

In the present invention, a learning device generates a policy, which is a rule for determining a to-be-controlled action, on the basis of a reward value that indicates an evaluation of the action in a state in which a time step is traced back in an episode representing a learning period, and the learning device determines the to-be-controlled action on the basis of the policy.

Description

学習装置、制御システム、入出力装置、学習方法および記録媒体Learning device, control system, input/output device, learning method, and recording medium

 本開示は、学習装置、制御システム、入出力装置、学習方法および記録媒体に関する。 This disclosure relates to a learning device, a control system, an input/output device, a learning method, and a recording medium.

 機械学習の1つに強化学習がある(例えば、特許文献1参照)。 One type of machine learning is reinforcement learning (see, for example, Patent Document 1).

日本国特開2022-182581号公報Japanese Patent Application Publication No. 2022-182581

 シミュレーションを用いた強化学習を効率的に行えることが好ましい。 It would be preferable to be able to efficiently perform reinforcement learning using simulation.

 本開示の目的の一例は、上述の課題を解決することのできる学習装置、制御システム、入出力装置、学習方法、および記録媒体を提供することである。 One example of the objective of this disclosure is to provide a learning device, a control system, an input/output device, a learning method, and a recording medium that can solve the above-mentioned problems.

 本開示の第1の態様によれば、学習装置は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、を備える。 According to a first aspect of the present disclosure, the learning device includes a policy generation means for generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period, and a behavior decision means for deciding the behavior of the controlled object based on the policy.

 本開示の第2の態様によれば、制御システムは、学習装置と制御装置とを備え、前記学習装置は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、を備え、前記制御装置は、前記学習装置を用いて得られた方策に基づいて、制御対象に対する制御を行う。 According to a second aspect of the present disclosure, a control system includes a learning device and a control device, the learning device includes a policy generation means for generating a policy, which is a decision rule for the behavior of the control object, based on a reward value, which is a value indicating an evaluation of the behavior of the control object at a time step back within an episode representing a learning period, and a behavior decision means for deciding the behavior of the control object based on the policy, and the control device controls the control object based on the policy obtained using the learning device.

 本開示の第3の態様によれば、入出力装置は、制御対象が行動する環境の状態をユーザに提示する状態提示手段と、前記環境を模擬するシミュレータの中断が必要な状態を指定するユーザ操作を受け付ける状態指定受付手段と、を備える。 According to a third aspect of the present disclosure, the input/output device includes a state presenting means for presenting to a user the state of the environment in which the controlled object acts, and a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted.

 本開示の第4の態様によれば、学習方法は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成し、前記制御対象の行動を前記方策に基づいて決定する、ことを含む。 According to a fourth aspect of the present disclosure, the learning method includes generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period, and determining the behavior of the controlled object based on the policy.

 本開示の第5の態様によれば、記録媒体は、コンピュータに、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成することと、前記制御対象の行動を前記方策に基づいて決定することと、を実行させるためのプログラムを記憶する。 According to a fifth aspect of the present disclosure, the recording medium stores a program for causing a computer to execute the following: generating a policy, which is a decision rule for the behavior of a controlled object, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within an episode representing a learning period; and determining the behavior of the controlled object based on the policy.

 本開示によれば、シミュレーションを用いた強化学習を比較的効率的に行うことができる。 According to the present disclosure, reinforcement learning using simulation can be performed relatively efficiently.

本開示のいくつかの実施形態に係る制御システムの構成の例を示す図である。FIG. 1 illustrates an example configuration of a control system according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置の構成の例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of a learning device according to some embodiments of the present disclosure. 累積報酬値の計算の元となる瞬時報酬値の例を示す図である。FIG. 13 is a diagram showing an example of an instantaneous reward value that is the basis for calculating a cumulative reward value. 累積報酬値の例を示す図である。FIG. 13 is a diagram illustrating an example of a cumulative reward value. 本開示のいくつかの実施形態に係るリセット先候補に設定されている確率分布の第1の例を示す図である。FIG. 11 is a diagram showing a first example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係るリセット先候補に設定されている確率分布の第2の例を示す図である。FIG. 13 is a diagram showing a second example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係るリセット先候補に設定されている確率分布の第3の例を示す図である。FIG. 13 is a diagram showing a third example of a probability distribution set for reset destination candidates according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置におけるデータの入出力の例を示す図である。A diagram showing an example of data input and output in a learning device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置が行う処理の手順の例を示す図である。A diagram showing an example of a processing procedure performed by a learning device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習装置の構成の、もう1つの例を示す図である。FIG. 13 is a diagram illustrating another example of the configuration of a learning device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る制御システムの構成の、もう1つの例を示す図である。FIG. 2 illustrates another example of a control system configuration according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る入出力装置の構成の例を示す図である。FIG. 2 is a diagram illustrating an example of a configuration of an input/output device according to some embodiments of the present disclosure. 本開示のいくつかの実施形態に係る学習方法における処理の手順の例を示す図である。FIG. 1 is a diagram illustrating an example of a processing procedure in a learning method according to some embodiments of the present disclosure. 本開示の少なくとも1つの実施形態に係るコンピュータの構成の例を示す図である。FIG. 1 illustrates an example of a computer configuration in accordance with at least one embodiment of the present disclosure.

 以下、本開示の実施形態を説明するが、以下の実施形態は請求の範囲にかかる開示を限定するものではない。また、実施形態の中で説明されている特徴の組み合わせの全てが開示の解決手段に必須であるとは限らない。
 図1は、本開示のいくつかの実施形態に係る制御システムの構成の例を示す図である。図1に示す構成で、制御システム1は、学習装置100と、制御装置200と、制御対象910とを備える。
Hereinafter, embodiments of the present disclosure will be described, but the following embodiments do not limit the disclosure according to the claims. Furthermore, not all of the combinations of features described in the embodiments are necessarily essential to the solution of the disclosure.
1 is a diagram illustrating an example of a configuration of a control system according to some embodiments of the present disclosure. In the configuration illustrated in FIG. 1, the control system 1 includes a learning device 100, a control device 200, and a control target 910.

 制御システム1は、制御対象910に対する制御を学習し、学習結果に基づいて制御対象910を制御するシステムである。
 学習装置100は、制御対象910に対する制御を学習する。特に、学習装置100は、シミュレーションを用いた強化学習にて、制御対象910に対する制御を学習する。
The control system 1 is a system that learns control of a control target 910 and controls the control target 910 based on the learning results.
The learning device 100 learns control over the control object 910. In particular, the learning device 100 learns control over the control object 910 by reinforcement learning using a simulation.

 ここでいう強化学習は、ある環境(Environment)において行動(Action)を行うエージェント(Agent)の行動規則である方策(Policy)を、行動に対する評価を表す報酬(Reward)に基づいて学習する機械学習である。
 環境の状態(State)を、単に状態とも称する。ここでの環境は、エージェントを含んでいてもよい。したがって、ここでいう状態は、エージェントの状態を含んでいてもよい。
The reinforcement learning referred to here is a machine learning technique that learns a policy, which is the behavioral rule of an agent that performs an action in a certain environment, based on a reward, which represents an evaluation of the action.
The state of the environment is also referred to simply as the state. The environment may include an agent. Therefore, the state may include the state of the agent.

 学習装置100は、強化学習における1ステップごとに、その時の状態のもとで方策に基づいて行動を決定し、決定した行動のシミュレーションをおこなって、次のステップにおける状態である次状態を算出する。また、学習装置100は、得られた次状態に基づいて報酬値(報酬の値)を算出し、算出した報酬値に基づいて方策を更新する。ここでの方策の更新は、方策の生成と捉えることができる。すなわち、学習装置100が、過去の方策に基づいて方策を生成する、と捉えることができる。 For each step in reinforcement learning, the learning device 100 determines an action based on a policy in the state at that time, simulates the determined action, and calculates the next state, which is the state at the next step. The learning device 100 also calculates a reward value based on the obtained next state, and updates the policy based on the calculated reward value. The update of the policy here can be considered as the generation of a policy. In other words, the learning device 100 can be considered to generate a policy based on a past policy.

 以下では、強化学習におけるステップを時間ステップまたは単にステップとも称する。また、以下では、時刻を時間ステップで表す。
 学習装置100が用いる強化学習の手法は、特定の種類の手法に限定されない。例えば、学習装置100が、Q学習(Q-Learning)またはSARSAなど公知の強化学習手法に基づいて、制御対象910に対する制御の学習を行うようにしてもよい。
Hereinafter, a step in reinforcement learning will be referred to as a time step or simply as a step. Also, below, time will be represented by a time step.
The reinforcement learning method used by the learning device 100 is not limited to a specific type of method. For example, the learning device 100 may learn control of the control target 910 based on a known reinforcement learning method such as Q-learning or SARSA.

 学習装置100は、例えば、エピソードの各々について、そのエピソードが終了するまでステップごとの処理を繰り返す。ここでいうエピソードは、強化学習における1つの時間単位であり、エージェントが一連の行動を開始してから終了するまでの時間区間である。例えば、学習装置100は、エピソードの終了条件として予め定められている条件が成立するまで、そのエピソードにおけるステップごとの処理を繰り返す。 For example, the learning device 100 repeats the processing for each step of each episode until the episode ends. An episode here is a unit of time in reinforcement learning, and is a time period from when an agent starts a series of actions until when it ends. For example, the learning device 100 repeats the processing for each step in the episode until a condition that is predetermined as the condition for ending the episode is met.

 また、学習装置100は、シミュレーションを中断する条件として予め定められている条件が成立していると判定した場合、エピソードの途中でも、そのシミュレーションを中断する。シミュレーションを中断する条件を、中断条件とも称する。シミュレーションを中断することは、実行中のエピソードを中断することともいえる。
 これにより、学習装置100は、実行中のエピソードをそれ以上続けても方策の更新が期待できない状態に陥った場合、あるいは、方策の更新がなかなか進まないと予想される状態に陥った場合に、自動的にそのエピソードの実行を中断することができる。
 方策の更新は、方策の改善と捉えることができる。方策が更新されることは、強化学習が進むことと捉えることができる。
Furthermore, when the learning device 100 determines that a predetermined condition for interrupting the simulation is satisfied, the learning device 100 interrupts the simulation even in the middle of an episode. The condition for interrupting the simulation is also referred to as an interruption condition. Interrupting the simulation can also be said to interrupt the episode being executed.
This allows the learning device 100 to automatically interrupt the execution of an episode when the learning device 100 falls into a state where policy updating is not expected even if the episode is continued, or when the learning device 100 falls into a state where policy updating is expected to progress slowly.
Policy updates can be thought of as policy improvements. Policy updates can be thought of as progress in reinforcement learning.

 エピソードの実行を中断した学習装置100が、そのエピソード内で時間ステップを遡って、そのエピソードの実行を再開するようにしてもよい。あるいは、エピソードの実行を中断した学習装置100が、他のエピソードの実行を開始するようにしてもよい。
 学習装置100によれば、方策の更新が期待できない状態に陥っても、あるいは、方策の更新がなかなか進まないと予想される状態に陥ってもエピソードの実行を継続する場合と比較して、強化学習をより効率的に行うことができる。
The learning device 100 that has suspended the execution of an episode may go back a time step within the episode and resume the execution of the episode. Alternatively, the learning device 100 that has suspended the execution of an episode may start the execution of another episode.
According to the learning device 100, reinforcement learning can be performed more efficiently than in a case where the execution of an episode is continued even when the state in which policy updating is not expected or when the state in which policy updating is expected to progress slowly is reached.

 また、学習装置100は、エピソードの実行において、複数の時間ステップのそれぞれにおける状態を記憶しておく。そして、学習装置100は、エピソードの実行を中断した際、状態を記憶している時間ステップの何れかを選択し、選択した時間ステップに戻って、そのエピソードの実行を再開する。具体的には、学習装置100は、選択した時間ステップの状態から時間ステップごとの処理を再開する。 The learning device 100 also stores the state at each of a number of time steps during the execution of an episode. When the learning device 100 suspends the execution of an episode, it selects one of the time steps for which the state is stored, returns to the selected time step, and resumes the execution of that episode. Specifically, the learning device 100 resumes processing for each time step from the state of the selected time step.

 ここで、エピソードを毎回最初から実行する場合を考えると、エピソードの最初の方は、既に学習が十分進んだ部分となり、それ以上、方策の更新が進まないことが考えられる。学習装置100が、エピソードを遡る先として、エピソードの最初に限らず途中の時間ステップも選択できることで、より効率的に強化学習を行えることが期待される。
 エピソードを遡る先の時間ステップを、リセット先とも称する。リセット先の候補をリセット先候補とも称する。リセット先候補は、例えば、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップである。
 シミュレーションの実行を中断し、シミュレーションにおける状態をリセット先の状態に変更することを、シミュレーションのリセットとも称する。
Here, if we consider a case where an episode is executed from the beginning every time, the beginning of the episode is a part where learning has already progressed sufficiently, and it is conceivable that the policy update will not proceed any further. By allowing the learning device 100 to select a time step in the middle of an episode as a destination to go back to, not just the beginning of the episode, it is expected that reinforcement learning can be performed more efficiently.
The time step to go back in the episode is also called the reset destination. The candidates for the reset destination are also called reset destination candidates. The reset destination candidates are, for example, executed time steps whose states are stored as time steps at which the simulation can be resumed within the episode that was executed until the simulation was interrupted.
Interrupting the execution of a simulation and changing the state in the simulation to a reset state is also called resetting the simulation.

 制御装置200は、学習装置100による学習で得られた方策に基づいて、制御対象910を制御する。
 制御対象910は、特定のものに限定されず、強化学習を用いて制御対象910に対する制御を学習可能ないろいろなものとすることができる。例えば、制御対象910は、制御対象910は、工場(Plant または Factory)または発電プラントなどの設備であってもよいし、工場における製造ラインなどのシステムであってもよいし、単体の装置であってもよい。あるいは、制御対象910は、自動車、鉄道車両、飛行機、船舶、自走式移動ロボットなどの移動体であってもよいし、鉄道または航空管制システムなどの交通システムであってもよい。
 制御対象910が、制御システム1の一部として構成されていてもよいし、制御システム1の外部の構成となっていてもよい。
The control device 200 controls the control target 910 based on the policy obtained by learning using the learning device 100 .
The control object 910 is not limited to a specific one, and may be various objects capable of learning control over the control object 910 using reinforcement learning. For example, the control object 910 may be equipment such as a plant or a power plant, a system such as a manufacturing line in a factory, or a standalone device. Alternatively, the control object 910 may be a moving object such as an automobile, a railroad vehicle, an airplane, a ship, or a self-propelled mobile robot, or a transportation system such as a railroad or air traffic control system.
The controlled object 910 may be configured as a part of the control system 1 , or may be configured external to the control system 1 .

 学習時には、学習装置100があればよく、制御装置200と制御対象910とは無くてもよい。また、制御の実行時には、制御装置200と制御対象910とがあればよく、学習装置100は無くてもよい。
 学習装置100、制御装置200、および、制御対象910の全て、または、これらのうちの2つの組み合わせが、一体的に構成されていてもよい。例えば、学習装置100と、制御装置200とが1つの装置として構成されていてもよい。また、制御装置200と制御対象910とが1つの装置として構成されていてもよい。
During learning, it is sufficient to have the learning device 100, and it is not necessary to have the control device 200 and the control target 910. Furthermore, during execution of control, it is sufficient to have the control device 200 and the control target 910, and it is not necessary to have the learning device 100.
All of the learning device 100, the control device 200, and the control target 910, or a combination of two of them, may be configured integrally. For example, the learning device 100 and the control device 200 may be configured as a single device. Also, the control device 200 and the control target 910 may be configured as a single device.

 図2は、学習装置100の構成の例を示す図である。図2に示す構成で、学習装置100は、通信部110と、表示部120と、操作入力部130と、記憶部180と、処理部190とを備える。処理部190は、行動決定部191と、シミュレーション部192と、方策生成部193と、リセット決定部194と、再開状態決定部195とを備える。 FIG. 2 is a diagram showing an example of the configuration of the learning device 100. In the configuration shown in FIG. 2, the learning device 100 includes a communication unit 110, a display unit 120, an operation input unit 130, a memory unit 180, and a processing unit 190. The processing unit 190 includes an action decision unit 191, a simulation unit 192, a measure generation unit 193, a reset decision unit 194, and a resume state decision unit 195.

 通信部110は、他の装置と通信を行う。例えば、通信部110が、シミュレーションを行うための各種情報を他の装置から受信するようにしてもよい。また、通信部110が、学習で得られた方策を制御装置200へ送信するようにしてもよい。 The communication unit 110 communicates with other devices. For example, the communication unit 110 may receive various information for performing a simulation from other devices. The communication unit 110 may also transmit the measures obtained by learning to the control device 200.

 表示部120は、例えば液晶パネルまたはLED(Light Emitting Diode、発光ダイオード)パネル等の表示画面を備え、各種画像を表示する。例えば、表示部120が、学習装置100が行う学習の進捗状況など、学習装置100が行う学習に関する情報を表示するようにしてもよい。 The display unit 120 has a display screen, such as a liquid crystal panel or an LED (Light Emitting Diode) panel, and displays various images. For example, the display unit 120 may display information related to the learning performed by the learning device 100, such as the progress of the learning performed by the learning device 100.

 操作入力部130は、例えばキーボードおよびマウス等の入力デバイスを備え、ユーザ操作を受け付ける。例えば、操作入力部130が、制御対象910に対する制御の学習を開始するよう指示するユーザ操作を受け付けるようにしてもよい。
 記憶部180は、各種データを記憶する。例えば、シミュレーションにおける状態のスナップショットを記憶するようにしてもよい。記憶部180は、学習装置100が備える記憶デバイスを用いて構成される。
The operation input unit 130 includes input devices such as a keyboard and a mouse, and receives user operations. For example, the operation input unit 130 may receive a user operation to instruct the control unit 910 to start learning control over the control target 910.
The storage unit 180 stores various data. For example, a snapshot of a state in a simulation may be stored. The storage unit 180 is configured using a storage device included in the learning device 100.

 処理部190は、学習装置100の各部を制御して各種処理を行う。処理部190の機能は、例えば学習装置100が備えるCPU(Central Processing Unit、中央処理装置)が、記憶部180からプログラムを読み出して実行することで実行される。
 行動決定部191は、強化学習における制御対象910の行動を決定する。行動決定部191は、方策に基づいて、制御対象910の行動を決定する。
 行動決定部191は、行動決定手段の例に該当する。
The processing unit 190 performs various processes by controlling each unit of the learning device 100. The functions of the processing unit 190 are performed, for example, by a CPU (Central Processing Unit) included in the learning device 100 reading a program from the storage unit 180 and executing it.
The behavior decision unit 191 decides the behavior of the control object 910 in reinforcement learning. The behavior decision unit 191 decides the behavior of the control object 910 based on a policy.
The behavior determining unit 191 corresponds to an example of a behavior determining means.

 シミュレーション部192は、環境のシミュレーションを行う。特に、シミュレーション部192は、行動決定部191が決定した行動のシミュレーションをおこなって、行動後の状態である次状態を計算する。ここで、シミュレーション上の制御対象910は、エージェントまたはその一部と捉えることができ、シミュレーションの対象の環境に制御対象910が含まれていてもよい。シミュレーションにおける状態に制御対象910の状態が含まれていてもよい。
 シミュレーション部192は、シミュレーション手段の例に該当する。
 シミュレーション部192が、学習装置100の一部として構成されていてもよいし、学習装置100の外部の構成となっていてもよい。
The simulation unit 192 performs a simulation of the environment. In particular, the simulation unit 192 performs a simulation of the action determined by the action determination unit 191, and calculates a next state which is a state after the action. Here, the control object 910 in the simulation can be regarded as an agent or a part thereof, and the control object 910 may be included in the environment that is the target of the simulation. The state of the control object 910 may be included in the state in the simulation.
The simulation unit 192 corresponds to an example of a simulation means.
The simulation unit 192 may be configured as a part of the learning device 100 , or may be configured external to the learning device 100 .

 方策生成部193は、報酬値に基づいて方策を更新する。上述したように、報酬値は、行動に対する評価を示す値である。方策は、行動の決定規則である。上述したように、ここでの方策の更新は、方策の生成と捉えることができる。
 方策生成部193が、瞬時報酬値に基づいて方策を更新するようにしてもよいし、累積報酬値に基づいて方策を更新するようにしてもよい。瞬時報酬値は、1ステップごとに、状態、行動、および、次状態、またはこれらのうちの一部に基づいて計算される、そのステップにおける行動に対する評価を示す値である。累積報酬値は、瞬時報酬値を複数ステップ分累積して算出される値である。累積報酬値の計算の際、瞬時報酬値に忘却係数などの係数値が乗算されていてもよい。
 あるいは、方策生成部193が、価値関数値(価値とも称する)に基づいて方策を更新するようにしてもよい。価値関数値は、例えばエピソードの終了時における累積報酬値の期待値など、累積報酬値の予測値である。
The policy generator 193 updates the policy based on the reward value. As described above, the reward value is a value indicating an evaluation for an action. The policy is a decision rule for an action. As described above, the update of the policy here can be considered as the generation of the policy.
The policy generation unit 193 may update the policy based on the instantaneous reward value, or may update the policy based on the cumulative reward value. The instantaneous reward value is a value calculated for each step based on the state, action, and next state, or a part of these, and indicates an evaluation of the action in that step. The cumulative reward value is a value calculated by accumulating the instantaneous reward values for multiple steps. When calculating the cumulative reward value, the instantaneous reward value may be multiplied by a coefficient value such as a forgetting coefficient.
Alternatively, the policy generator 193 may update the policy based on a value function value (also referred to as value), which is a prediction of the cumulative reward value, e.g., an expected value of the cumulative reward value at the end of an episode.

 瞬時報酬値、累積報酬値、および、価値関数値は、いずれも報酬値の例に該当する。
 学習装置100が、報酬値として、値が大きいほど良い評価を示すような報酬値を用いるようにしてもよいし、値が小さいほど良い評価を示すような報酬値を用いるようにしてもよい。以下では、学習装置100が、値が大きいほど良い評価を示すような報酬値を用いる場合を例に説明する。
 行動決定部191と方策生成部193との組み合わせを、エージェント処理部とも称する。
The instantaneous reward value, the cumulative reward value, and the value function value are all examples of reward values.
The learning device 100 may use a reward value in which the larger the value, the better the evaluation, or may use a reward value in which the smaller the value, the better the evaluation. In the following, an example will be described in which the learning device 100 uses a reward value in which the larger the value, the better the evaluation.
The combination of the action decision unit 191 and the measure generation unit 193 is also called an agent processing unit.

 リセット決定部194は、中断条件が成立していると判定した場合、シミュレーション部192によるシミュレーションを中断する。具体的には、リセット決定部194は、シミュレーション部192が行っているシミュレーションを中断させる。上述したように、中断条件は、制御対象910の行動のシミュレーションを中断する条件として予め定められている条件である。
 リセット決定部194は、リセット決定手段の例に該当する。
When the reset determination unit 194 determines that the interruption condition is satisfied, it interrupts the simulation by the simulation unit 192. Specifically, the reset determination unit 194 interrupts the simulation being performed by the simulation unit 192. As described above, the interruption condition is a condition that is determined in advance as a condition for interrupting the simulation of the behavior of the control target 910.
The reset determination unit 194 corresponds to an example of a reset determination means.

 学習装置100について上述したように、リセット決定部194は、実行中のエピソードをそれ以上続けても方策の更新が期待できない状態に陥った場合、あるいは、方策の更新がなかなか進まないと予想される状態に陥った場合に、自動的にそのエピソードの実行を中断することができる。 As described above for the learning device 100, the reset decision unit 194 can automatically interrupt the execution of an episode if the episode falls into a state where policy updates are not expected to occur even if the episode continues to be executed, or if the episode falls into a state where policy updates are expected to proceed slowly.

 エピソードの実行を中断したリセット決定部194が、そのエピソード内で時間ステップを遡って、遡った先の時間ステップの状態から、シミュレーション部192によるシミュレーションを再開させるようにしてもよい。あるいは、エピソードの実行を中断したリセット決定部194が、他のエピソードを選択し、選択したエピソードにおけるシミュレーションをシミュレーション部192に行わせるようにしてもよい。
 これにより、学習装置100では、方策の更新が期待できない状態に陥っても、あるいは、方策の更新がなかなか進まないと予想される状態に陥ってもエピソードの実行を継続する場合と比較して、強化学習をより効率的に行うことができる。
The reset determination unit 194 that has suspended the execution of an episode may go back a time step within the episode and restart the simulation by the simulation unit 192 from the state of the time step that was gone back. Alternatively, the reset determination unit 194 that has suspended the execution of an episode may select another episode and cause the simulation unit 192 to perform a simulation of the selected episode.
This allows the learning device 100 to perform reinforcement learning more efficiently compared to a case in which the learning device 100 continues to execute the episode even when the learning device 100 falls into a state in which policy updating is not expected or when the learning device 100 falls into a state in which policy updating is expected to progress slowly.

 再開状態決定部195は、シミュレーション部192によるシミュレーションが中断されたときのリセット先を決定する。シミュレーション部192は、再開状態決定部195が決定したリセット先の状態からシミュレーションを再開する。
 再開状態決定部195は、再開状態決定手段の例に該当する。
The restart state determination unit 195 determines a reset destination when the simulation by the simulation unit 192 is interrupted. The simulation unit 192 restarts the simulation from the reset destination state determined by the restart state determination unit 195.
The restart state determination unit 195 corresponds to an example of a restart state determination means.

 上述したように、エピソードの最初は、必ずしも方策の更新が進まないことが考えられる。学習装置100について上述したように、再開状態決定部195が、エピソードを遡る先として、エピソードの最初に限らず途中の時間ステップも選択できることで、学習装置100が、より効率的に強化学習を行えることが期待される。 As described above, it is conceivable that at the beginning of an episode, policy updates may not necessarily progress. As described above for the learning device 100, the restart state determination unit 195 can select not only the beginning of an episode but also any time step in the middle as the destination to go back in the episode, and this is expected to enable the learning device 100 to perform reinforcement learning more efficiently.

 なお、リセット決定部194が、エピソードの実行を中断した場合そのエピソードの最初の状態など予め定められた状態から、シミュレーション部192によるシミュレーションを再開させるようにしてもよい。
 この場合、学習装置100が、再開状態決定部195を備えていない構成となっていてもよい。
When the reset decision unit 194 interrupts the execution of an episode, the simulation unit 192 may resume the simulation from a predetermined state, such as the initial state of the episode.
In this case, the learning device 100 may be configured without including the resume state determination unit 195 .

 リセット決定部194が、累積報酬値の変化に基づいて、シミュレーション部192によるシミュレーションを中断するか否かを決定するようにしてもよい。
 図3は、累積報酬値の計算の元となる瞬時報酬値の例を示す図である。図3のグラフの横軸は時間ステップを示す。縦軸は瞬時報酬値を示す。
 図3の例で、瞬時報酬値は正の値および負の値をとり得るものとする。瞬時報酬値が大きいほど(例えば、正の値で大きさが大きいほど)、瞬時報酬値が示す評価が良いものとする。瞬時報酬値が小さいほど(例えば、負の値で大きさが大きいほど)、瞬時報酬値が示す評価が悪いものとする。
 図3の例で、時間ステップt11では瞬時報酬値が正の値となっており、時間ステップt11の次から瞬時報酬値が負の値となっている時間ステップが連続している。
The reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on a change in the cumulative reward value.
3 is a diagram showing an example of an instantaneous reward value that is the basis for calculating a cumulative reward value. The horizontal axis of the graph in FIG. 3 indicates a time step, and the vertical axis indicates an instantaneous reward value.
In the example of Fig. 3, the instantaneous reward value can be a positive value or a negative value. The larger the instantaneous reward value (e.g., the larger the positive value and the magnitude), the better the evaluation indicated by the instantaneous reward value. The smaller the instantaneous reward value (e.g., the larger the negative value and the magnitude), the worse the evaluation indicated by the instantaneous reward value.
In the example of FIG. 3, at time step t11, the instantaneous reward value is a positive value, and from time step t11 onwards there are successive time steps in which the instantaneous reward value is a negative value.

 図4は、累積報酬値の例を示す図である。図4のグラフの横軸は時間ステップを示す。縦軸は累積報酬値を示す。
 図4は、図3に示す瞬時報酬値をエピソードの開始から累積して算出される累積報酬値を示している。時間ステップt11で累積報酬値が極大となっており、時間ステップt11の次からは、累積報酬値の減少が継続している。
Fig. 4 is a diagram showing an example of a cumulative reward value. The horizontal axis of the graph in Fig. 4 indicates a time step, and the vertical axis indicates a cumulative reward value.
Fig. 4 shows the cumulative reward value calculated by accumulating the instantaneous reward values shown in Fig. 3 from the start of the episode. The cumulative reward value reaches a maximum at time step t11, and from the time step after t11, the cumulative reward value continues to decrease.

 時間ステップt12の状態など累積報酬値が小さい値となっている状態では、報酬値が示す評価が悪い評価となる要因の事象が発生していることが考えられる。例えば、制御対象910が鉄道である場合、列車の間隔が詰まってダイヤ乱れが生じているといったことが考えられる。 When the cumulative reward value is small, such as in the state of time step t12, it is possible that an event has occurred that causes the evaluation indicated by the reward value to be poor. For example, if the control object 910 is a railway, it is possible that the intervals between trains have become too close, causing a disruption to the train schedule.

 報酬値が示す評価が悪い評価となる要因の事象が発生している場合、そのままエピソードの実行を継続しても報酬値が示す評価が良くならず、時間ステップを遡ってエピソードの実行をやり直した方が、方策の更新が進むことが考えられる。
 例えば、制御対象910が鉄道であり、列車の間隔が詰まってダイヤ乱れが生じている場合、列車の間隔の詰まりが解消されるまで、報酬値が示す評価が良くならないことが考えられる。報酬値が示す評価が悪いことで、方策生成部193による方策の更新が進まないことが考えられる。
If an event occurs that causes the reward value to be poorly evaluated, continuing to execute the episode will not improve the reward value, and it is thought that going back a time step and restarting the episode will result in more progress in updating the policy.
For example, when the control object 910 is a railway and train intervals are short, causing a disruption in the schedule, it is considered that the evaluation indicated by the reward value will not improve until the short train intervals are resolved. It is considered that the poor evaluation indicated by the reward value will not progress in updating the measures by the measure generation unit 193.

 この場合、そのままエピソードの実行を継続するよりも、列車の間隔が詰まる前の時間ステップなど、報酬値が示す評価が悪い評価となる要因の事象が発生する前の時間ステップまで戻ってエピソードの実行をやり直した方が、方策の更新が進むことが期待される。
 図4の例で、時間ステップt11の次の時間ステップからの、累積報酬値の減少が継続している時間ステップでは、報酬値が示す評価が悪い評価となる要因の事象が既に発生していることが考えられる。この場合、時間ステップt11またはそれよりも前の時間ステップまで遡ってエピソードの実行をやり直した方が、方策の更新が進むことが期待される。
In this case, rather than continuing to execute the episode as is, it is expected that the policy will be updated more effectively if the episode is restarted by going back to a time step before the train intervals become shorter, or before an event occurs that causes the reward value to be evaluated poorly.
In the example of Fig. 4, in the time steps from the time step t11 onwards where the cumulative reward value continues to decrease, it is possible that an event that causes the reward value to be poorly evaluated has already occurred. In this case, it is expected that the policy update will progress more if the episode is executed again going back to time step t11 or an earlier time step.

 そこで、リセット決定部194が、累積報酬値の変化に基づいて、シミュレーション部192によるシミュレーションを中断するか否かを決定するようにしてもよい。
 例えば、リセット決定部194が、累積報酬値が示す評価が所定の閾値よりも悪化したと判定した場合に、シミュレーション部192によるシミュレーションを中断することに決定するようにしてもよい。図4の例で、リセット決定部194が、時間ステップt11における累積報酬値の極大値からの減少量の大きさと閾値d11とを比較して、減少量の大きさが閾値d11よりも大きくなっている時間ステップt12またはその終了後に、シミュレーション部192によるシミュレーションを中断することに決定するようにしてもよい。
Therefore, the reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on the change in the accumulated reward value.
For example, when the reset decision unit 194 determines that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192. In the example of Fig. 4, the reset decision unit 194 may compare the magnitude of the decrease amount from the maximum value of the cumulative reward value at the time step t11 with the threshold value d11, and may decide to interrupt the simulation by the simulation unit 192 at or after the end of the time step t12 where the magnitude of the decrease amount is greater than the threshold value d11.

 この場合、リセット決定部194が、累積報酬値の閾値が示す評価の悪化の幅が、シミュレーション部192によるシミュレーションの進行に応じて大きくなるように、閾値を更新するようにしてもよい。図4の例で、リセット決定部194が、時間ステップの進行に応じて、閾値d11の値が大きくなるように、閾値d11の値を更新するようにしてもよい。 In this case, the reset decision unit 194 may update the threshold value so that the extent of deterioration in evaluation indicated by the threshold value of the cumulative reward value increases as the simulation by the simulation unit 192 progresses. In the example of FIG. 4, the reset decision unit 194 may update the value of the threshold value d11 so that the value of the threshold value d11 increases as the time steps progress.

 リセット決定部194が累積報酬値の減少量を計算する起算点は、累積報酬値の極大値に限定されない。例えば、リセット決定部194が、累積報酬値0からの減少量を算出して閾値と比較するようにしてもよい。あるいは、リセット決定部194が、累積報酬値の最大値からの減少量を算出して閾値と比較するようにしてもよい。 The starting point from which the reset determination unit 194 calculates the amount of decrease in the cumulative reward value is not limited to the maximum value of the cumulative reward value. For example, the reset determination unit 194 may calculate the amount of decrease from a cumulative reward value of 0 and compare it with a threshold value. Alternatively, the reset determination unit 194 may calculate the amount of decrease from the maximum value of the cumulative reward value and compare it with a threshold value.

 ここで、エピソードの初期の段階、あるいは、強化学習の初期の段階では、学習が進んでいないことで、報酬値が大きくならないケースが多いことが考えられる。この場合、リセット決定部194が閾値を比較的小さい値に設定してエピソードの実行の中断を早めることで、行動決定部191が、いろいろな行動を試すようになり、報酬値が大きくなる行動(または一連の行動)を比較的早い時点で見つけられることが期待される。 Here, it is considered that in the early stages of an episode or in the early stages of reinforcement learning, there are many cases in which the reward value does not become large because learning has not progressed. In this case, by the reset decision unit 194 setting the threshold to a relatively small value and suspending the execution of the episode earlier, it is expected that the action decision unit 191 will try various actions and find an action (or series of actions) that will result in a large reward value at a relatively early stage.

 あるいは、リセット決定部194が、累積報酬値の増減に基づいて、累積報酬値が示す評価の悪化が所定の時間ステップ数以上継続した場合に、シミュレーション部192によるシミュレーションを中断することに決定するようにしてもよい。図4の例で、シミュレーション部192によるシミュレーションを中断する時間ステップ数が7に設定されている場合、リセット決定部194が、累積報酬値が極大になっている時間ステップt11から連続して7回減少している時間ステップt12またはその終了後に、シミュレーション部192によるシミュレーションを中断することに決定するようにしてもよい。 Alternatively, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 when the deterioration of the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more, based on the increase or decrease in the cumulative reward value. In the example of FIG. 4, if the number of time steps for interrupting the simulation by the simulation unit 192 is set to 7, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 at or after the end of time step t12 in which the cumulative reward value has decreased seven times consecutively from time step t11 at which the cumulative reward value is maximized.

 リセット決定部194が、累積報酬値の減少量と、累積報酬値が連続して減少している時間ステップの回数との組み合わせに基づいて、シミュレーション部192によるシミュレーションを中断するか否かを決定するようにしてもよい。例えば、リセット決定部194が、累積報酬値の減少量の大きさが所定の閾値よりも大きくなった場合、および、累積報酬値が連続して減少している時間ステップが、所定の回数以上となった場合の何れも、シミュレーション部192によるシミュレーションを中断することに決定するようにしてもよい。 The reset decision unit 194 may decide whether or not to interrupt the simulation by the simulation unit 192 based on a combination of the amount of decrease in the cumulative reward value and the number of time steps in which the cumulative reward value has continuously decreased. For example, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 when the amount of decrease in the cumulative reward value becomes greater than a predetermined threshold value, and when the number of time steps in which the cumulative reward value has continuously decreased reaches or exceeds a predetermined number.

 リセット決定部194が、累積報酬値に加えて、あるいは、累積報酬値に代えて、価値関数値に基づいて、シミュレーション部192によるシミュレーションを中断するか否かを決定するようにしてもよい。価値関数値は累積報酬値の予測値であり、価値関数値と累積報酬値との間には正の相関関係があると考えられる。リセット決定部194が、価値関数値に基づいて、シミュレーション部192によるシミュレーションを中断するか否かを決定することで、累積報酬値の基づく場合と同様、報酬値が示す評価が悪い評価となる要因の事象が発生している場合に、シミュレーション部192によるシミュレーションを中断するようになり、方策の更新が進むことが期待される。 The reset decision unit 194 may decide whether or not to suspend the simulation by the simulation unit 192 based on the value function value in addition to or instead of the cumulative reward value. The value function value is a predicted value of the cumulative reward value, and it is considered that there is a positive correlation between the value function value and the cumulative reward value. By having the reset decision unit 194 decide whether or not to suspend the simulation by the simulation unit 192 based on the value function value, the simulation by the simulation unit 192 will be suspended when an event occurs that causes the evaluation indicated by the reward value to be a bad evaluation, just as in the case where it is based on the cumulative reward value, and it is expected that the update of the policy will progress.

 上述した累積報酬値の場合と同様、リセット決定部194が、価値関数値の増減に基づいて、価値関数値が示す評価が所定の閾値よりも悪化したと判定した場合、シミュレーション部192によるシミュレーションを中断するようにしてもよい。
 この場合、上述した累積報酬値の場合と同様、リセット決定部194が、価値関数値の閾値が示す評価の悪化の幅が、シミュレーション部192によるシミュレーションの進行に応じて大きくなるように、閾値を更新するようにしてもよい。
As in the case of the cumulative reward value described above, if the reset decision unit 194 determines, based on an increase or decrease in the value function value, that the evaluation indicated by the value function value has deteriorated below a predetermined threshold value, the simulation by the simulation unit 192 may be interrupted.
In this case, as in the case of the cumulative reward value described above, the reset determination unit 194 may update the threshold value so that the extent of the deterioration in evaluation indicated by the threshold value of the value function value increases as the simulation by the simulation unit 192 progresses.

 また、上述した累積報酬値の場合と同様、リセット決定部194が、価値関数値の増減に基づいて、前記価値関数値が示す評価の悪化が所定の時間ステップ数以上継続した場合、シミュレーション部192によるシミュレーションを中断することに決定するようにしてもよい。
 上述した累積報酬値の場合と同様、リセット決定部194が、価値関数値の減少量と、価値関数値が連続して減少している時間ステップの回数との組み合わせに基づいて、シミュレーション部192によるシミュレーションを中断するか否かを決定するようにしてもよい。
Also, as in the case of the cumulative reward value described above, the reset decision unit 194 may decide to interrupt the simulation by the simulation unit 192 if the deterioration of the evaluation indicated by the value function value continues for a predetermined number of time steps or more, based on an increase or decrease in the value function value.
As in the case of the cumulative reward value described above, reset determination unit 194 may determine whether or not to interrupt the simulation by simulation unit 192 based on a combination of the amount of decrease in the value function value and the number of time steps in which the value function value continuously decreases.

 リセット決定部194が、現在の状態(現在の時間ステップにおける状態)が、中断が必要な状態として予め設定されている1つ以上の状態の何れかと一定以上類似していると判定した場合、シミュレーション部192によるシミュレーションを中断するようにしてもよい。
 例えば、制御対象910が鉄道である場合、記憶部180が、列車の間隔が詰まってダイヤ乱れが生じている状態を、中断が必要な状態として複数パターン記憶しておく。そして、リセット決定部194が、現在の状態と、中断が必要な状態のそれぞれとを比較して、現在の状態が、中断が必要な状態の何れかと類似しているか否かを判定する。現在の状態が、中断が必要な状態の何れかと類似していると判定した場合、リセット決定部194は、シミュレーション部192によるシミュレーションを中断することに決定する。
If the reset decision unit 194 determines that the current state (the state at the current time step) is similar to one or more states that are pre-set as states requiring interruption to a certain degree or more, the simulation unit 192 may be configured to interrupt the simulation.
For example, when the control target 910 is a railway, the storage unit 180 stores a plurality of patterns of states in which train intervals are short and disruption of the schedule occurs as states requiring interruption. The reset decision unit 194 then compares the current state with each of the states requiring interruption and determines whether the current state is similar to any of the states requiring interruption. If it is determined that the current state is similar to any of the states requiring interruption, the reset decision unit 194 decides to interrupt the simulation by the simulation unit 192.

 リセット決定部194が、現在の状態と中断が必要な状態が類似しているか否かを判定する方法は、特定の方法に限定されない。
 例えば、リセット決定部194が、2つの状態それぞれの特徴量をベクトルで算出し、コサイン類似度などベクトルの類似度を算出するようにしてもよい。そして、リセット決定部194が、算出した類似度と閾値とを比較して、類似度が閾値以上である場合に、2つの状態が類似していると判定するようにしてもよい。あるいは、2つの状態が類似しているか否かを判定する機械学習モデルを用意しておき、リセット決定部194が、この機械学習モデルを用いて、現在の状態と中断が必要な状態が類似しているか否かを判定するようにしてもよい。
The method by which the reset determination unit 194 determines whether the current state is similar to the state requiring interruption is not limited to a specific method.
For example, the reset determination unit 194 may calculate the feature amount of each of the two states as a vector, and calculate the similarity of the vectors, such as cosine similarity. The reset determination unit 194 may then compare the calculated similarity with a threshold, and determine that the two states are similar if the similarity is equal to or greater than the threshold. Alternatively, a machine learning model that determines whether or not the two states are similar may be prepared, and the reset determination unit 194 may use this machine learning model to determine whether or not the current state is similar to the state requiring interruption.

 学習装置100が、中断が必要な状態を収集する際、状態をユーザに提示して、中断が必要な状態のユーザによる指定を受け付けるようにしてもよい。
 例えば、リセット決定部194が、シミュレーションにおける現在の状態または過去の状態など何らかの状態を表示部120に状態を表示させるようにしてもよい。そして、リセット決定部194が、操作入力部130を介して、表示している状態を中断が必要な状態として登録するか否かを指示するユーザ操作を受け付けるようにしてもよい。
When the learning device 100 collects states requiring interruption, the learning device 100 may present the states to the user and accept the user's designation of the states requiring interruption.
For example, the reset determination unit 194 may cause some state, such as a current state or a past state in the simulation, to be displayed on the display unit 120. The reset determination unit 194 may then receive a user operation via the operation input unit 130 instructing whether or not to register the displayed state as a state requiring interruption.

 この場合、リセット決定部194と表示部120との組み合わせは、状態提示手段の例に該当する。リセット決定部194と操作入力部130との組み合わせは、状態指定受け付け手段の例に該当する。
 制御対象910が鉄道である場合、表示部120が、シミュレーションにおける現在の列車運行状況をダイヤグラムの形式で表示するようにしてもよい。そして、ユーザが、表示されている運行状況が、中断が必要な状況であると判断したときに、そのことを示すユーザ操作を操作入力部130にて行い、リセット決定部194が、そのユーザ操作が行われたことを検出するようにしてもよい。
In this case, the combination of the reset determination unit 194 and the display unit 120 corresponds to an example of a state presenting unit, and the combination of the reset determination unit 194 and the operation input unit 130 corresponds to an example of a state designation receiving unit.
When the control target 910 is a railway, the display unit 120 may display the current train operation status in the simulation in the form of a diagram. When the user determines that the displayed operation status is one that requires interruption, the user may perform a user operation indicating this on the operation input unit 130, and the reset determination unit 194 may detect that the user operation has been performed.

 学習装置100は、リセット決定部194と表示部120と操作入力部130とを備えている点で、入出力装置の例に該当する。あるいは、入出力装置が、学習装置100とは別の装置として構成されていてもよい。例えば、学習装置100の端末装置が、学習装置100からの指示に従って状態を表示する機能と、中断が必要な状態を指定するユーザ操作を受け付けて学習装置100に通知する機能とを有していてもよい。 The learning device 100 is an example of an input/output device in that it is equipped with a reset decision unit 194, a display unit 120, and an operation input unit 130. Alternatively, the input/output device may be configured as a device separate from the learning device 100. For example, the terminal device of the learning device 100 may have a function of displaying the status according to instructions from the learning device 100, and a function of accepting a user operation specifying a state requiring interruption and notifying the learning device 100.

 再開状態決定部195が、シミュレーションの中断時まで実行されていたエピソード内のリセット先候補のうち、累積報酬値の増減で示される評価の変化が良化から悪化に転じている時間ステップのうちシミュレーションの中断時との間の時間ステップ数が最も少ない時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択するようにしてもよい。すなわち、再開状態決定部195が、シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択するようにしてもよい。 The restart state determination unit 195 may select, from among the reset destination candidates in the episode that was being executed until the simulation was interrupted, a reset destination candidate that corresponds to a time step with the fewest number of time steps between the time the simulation was interrupted and the time step in which the change in evaluation, indicated by an increase or decrease in the cumulative reward value, has turned from improvement to deterioration, or a time step prior to that. In other words, the restart state determination unit 195 may select a reset destination candidate that corresponds to a time step in which the evaluation was maximally good immediately before the interruption of the simulation, or a time step prior to that.

 図4の例で、リセット決定部194が時間ステップt12でシミュレーション部192によるシミュレーションを中断した場合、時間ステップt11が、シミュレーションの中断時(時間ステップt12)に直近で累積報酬値が極大になっている時間ステップに該当する。再開状態決定部195が、時間ステップt11またはそれ以前の時間ステップに相当するリセット先候補を選択するようにしてもよい。 In the example of FIG. 4, if the reset determination unit 194 interrupts the simulation by the simulation unit 192 at time step t12, time step t11 corresponds to the time step at which the cumulative reward value is maximized immediately before the simulation is interrupted (time step t12). The restart state determination unit 195 may select a reset destination candidate that corresponds to time step t11 or an earlier time step.

 シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップよりも後の時間ステップでは、累積報酬値が継続的に減少しており、報酬値が示す評価が悪い評価となる要因の事象が発生していることが考えられる。再開状態決定部195が、シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択することで、報酬値が示す評価が悪い評価となる要因の事象が発生する前の時間ステップを選択する可能性が高まることが期待される。 In time steps after the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted, the cumulative reward value is continuously decreasing, and it is possible that an event has occurred that causes the evaluation indicated by the reward value to be a bad evaluation. By having the restart state determination unit 195 select a reset destination candidate that corresponds to the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted or an earlier time step, it is expected that the possibility of selecting a time step before the occurrence of an event that causes the evaluation indicated by the reward value to be a bad evaluation will increase.

 あるいは、再開状態決定部195が、シミュレーションの中断時まで実行されていたエピソード内の全てのリセット先候補を選択の対象として、何れか1つのリセット先候補を選択するようにしてもよい。
 これにより、累積報酬値が減少していることが、必ずしも報酬値が示す評価が悪い評価となる要因の事象が発生していることを示さない場合、再開状態決定部195が、より後のリセット先候補(シミュレーションの中断時により近いリセット先候補)を選択できる可能性がある。
Alternatively, the resume state determination unit 195 may select one of all reset destination candidates within the episode that was being executed up until the simulation was interrupted as selection targets.
As a result, if a decrease in the cumulative reward value does not necessarily indicate the occurrence of an event that will cause the reward value to have a bad evaluation, the resume state determination unit 195 may be able to select a later reset destination candidate (a reset destination candidate closer to the time the simulation was interrupted).

 再開状態決定部195が、複数のリセット先候補に設定されている確率分布に従って、それら複数のリセット先候補のうち何れか1つを選択するようにしてもよい。
 ここで、報酬値が示す評価が悪い評価となる要因の事象が発生している場合に、その事象が発生する前の時間ステップまで戻る観点からは、エピソード内でなるべく早い時間ステップ(エピソードの最初に近い時間ステップ)まで戻ることが考えられる。一方、エピソードの最初の方における学習の繰り返しを軽減させる観点からは、エピソード内で実行済みの時間ステップのうち、なるべく遅い時間ステップまで戻ることが考えられる。このように、報酬値が示す評価が悪い評価となる要因の事象が発生している状態を避けて強化学習の効率化を図ることと、エピソードの最初の方における学習の繰り返しを軽減させて強化学習の効率化を図ることとは、トレードオフの関係にある。
The resume state determination unit 195 may select one of the multiple reset destination candidates in accordance with a probability distribution set for the multiple reset destination candidates.
Here, when an event that causes the reward value to be evaluated poorly occurs, from the viewpoint of returning to a time step before the event occurred, it is possible to return to as early a time step in the episode as possible (a time step close to the beginning of the episode). On the other hand, from the viewpoint of reducing the number of repetitions of learning at the beginning of the episode, it is possible to return to as late a time step as possible among the time steps that have been executed in the episode. In this way, there is a trade-off between achieving efficiency in reinforcement learning by avoiding a state in which an event that causes the reward value to be evaluated poorly occurs, and achieving efficiency in reinforcement learning by reducing the number of repetitions of learning at the beginning of the episode.

 そこで、再開状態決定部195が、複数のリセット先候補のうちの何れか1つを確率的に選択する。これにより、エピソードの実行が中断されるごとに毎回エピソードの最初の方からエピソードの実行を再開することを避けることができる。また、選択したリセット先候補では、報酬値が示す評価が悪い評価となる要因の事象が既に発生している場合、エピソードの中断をさらに1回以上繰り返すことで、報酬値が示す評価が悪い評価となる要因の事象が発生する前のリセット先候補を選択できると期待される。 The restart state determination unit 195 therefore probabilistically selects one of the multiple reset destination candidates. This makes it possible to avoid restarting the execution of the episode from the beginning every time the execution of the episode is interrupted. Furthermore, if an event that would cause the reward value to be evaluated poorly has already occurred in the selected reset destination candidate, it is expected that by interrupting the episode one or more times, it will be possible to select a reset destination candidate that was selected before the event that would cause the reward value to be evaluated poorly occurred.

 再開状態決定部195が、複数のリセット先候補のうち何れか1つを一様分布の確率分布に従って選択するようにしてもよい。
 これにより、再開状態決定部195は、報酬値が示す評価が悪い評価となる要因の事象が発生する前の状態に到達するために遡る時間ステップ数の目安が不明な制御対象910に対応して、リセット先候補を選択することができる。
The restart state determination unit 195 may select one of the multiple reset destination candidates in accordance with a uniform probability distribution.
This allows the restart state determination unit 195 to select a reset destination candidate in response to a control object 910 for which the approximate number of time steps to go back to reach a state before the occurrence of an event that caused the evaluation indicated by the reward value to be poorly evaluated is unknown.

 図5は、リセット先候補に設定されている確率分布の第1の例を示す図である。図5のグラフの横軸は時間ステップを表す。縦軸は確率を表す。
 図5の例では、リセット決定部194は、時間ステップt25よりも後の時間ステップでシミュレーション部192によるシミュレーションを中断しており、時間ステップt21からt25までの5つの時間ステップがそれぞれリセット先候補となっている。再開状態決定部195は、これら5つのリセット先候補のうち時間ステップt21からt23までの3つの時間ステップを、選択対象のリセット先候補として、これらのリセット先候補に一様分布の確率分布を設定している。再開状態決定部195は、設定した確率分布に従って、3つのリセット先候補の何れも3分の1の確率で選択する。
5 is a diagram showing a first example of a probability distribution set for reset destination candidates, in which the horizontal axis of the graph in Fig. 5 represents time steps and the vertical axis represents probability.
5, the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates. The restart state determination unit 195 sets a uniform probability distribution for these reset destination candidates, selecting three time steps from time step t21 to t23 among these five reset destination candidates as reset destination candidates to be selected. The restart state determination unit 195 selects each of the three reset destination candidates with a one-third probability according to the set probability distribution.

 あるいは、再開状態決定部195が、リセット先候補に対して、累積報酬値の増減で示される、シミュレーションの中断時における評価の悪化が大きいほど、シミュレーションの中断時とリセット先候補との間の時間ステップ数が多いリセット先候補を選び易くなるように確率分布を設定し、設定した確率分布に従って何れか1つのリセット先候補を選択するようにしてもよい。 Alternatively, the restart state determination unit 195 may set a probability distribution for the reset destination candidates such that the greater the deterioration in evaluation at the time the simulation is interrupted, as indicated by an increase or decrease in the cumulative reward value, the more likely it is that a reset destination candidate with a large number of time steps between the time the simulation is interrupted and the reset destination candidate will be selected, and one of the reset destination candidates may be selected according to the set probability distribution.

 図6は、リセット先候補に設定されている確率分布の第2の例を示す図である。図6のグラフの横軸は時間ステップを表す。縦軸は確率を表す。
 図6の例では、リセット決定部194は、時間ステップt25よりも後の時間ステップでシミュレーション部192によるシミュレーションを中断しており、時間ステップt21からt25までの5つの時間ステップがそれぞれリセット先候補となっている。再開状態決定部195は、これら5つのリセット先候補のうち時間ステップt21からt23までの3つの時間ステップを、選択対象のリセット先候補として、これらのリセット先候補に確率分布を設定している。
6 is a diagram showing a second example of a probability distribution set for reset destination candidates, in which the horizontal axis of the graph in Fig. 6 represents time steps and the vertical axis represents probability.
6, the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates. The restart state determination unit 195 sets three time steps from time step t21 to t23 out of these five reset destination candidates as reset destination candidates to be selected, and sets a probability distribution for these reset destination candidates.

 図6は、シミュレーションの中断時における累積報酬値の減少量の大きさが比較的小さい場合の例を示しており、中断時に近いリセット先候補ほど(すなわち、中断時との間の時間ステップが少ないリセット先候補ほど)大きい確率を設定している。
 再開状態決定部195は、設定した確率分布に従って、3つのリセット先候補の何れかを選択する。これにより、再開状態決定部195は、中断時に近いリセット先候補(すなわち、中断時との間の時間ステップが少ないリセット先候補)を比較的選び易い。
FIG. 6 shows an example in which the amount of decrease in the cumulative reward value when the simulation is interrupted is relatively small, and a higher probability is set for a reset destination candidate that is closer to the interruption (i.e., a reset destination candidate with fewer time steps between the interruption and the simulation).
The restart state determination unit 195 selects one of the three reset destination candidates according to the set probability distribution. This makes it relatively easy for the restart state determination unit 195 to select a reset destination candidate that is close to the interruption time (i.e., a reset destination candidate with a small time step between the interruption time and the reset destination candidate).

 図7は、リセット先候補に設定されている確率分布の第3の例を示す図である。図7のグラフの横軸は時間ステップを表す。縦軸は確率を表す。
 図7の例では、リセット決定部194は、時間ステップt25よりも後の時間ステップでシミュレーション部192によるシミュレーションを中断しており、時間ステップt21からt25までの5つの時間ステップがそれぞれリセット先候補となっている。再開状態決定部195は、これら5つのリセット先候補のうち時間ステップt21からt23までの3つの時間ステップを、選択対象のリセット先候補として、これらのリセット先候補に確率分布を設定している。
7 is a diagram showing a third example of a probability distribution set for reset destination candidates. The horizontal axis of the graph in Fig. 7 represents time steps, and the vertical axis represents probability.
7, the reset determination unit 194 suspends the simulation by the simulation unit 192 at a time step after time step t25, and the five time steps from time step t21 to t25 are reset destination candidates. The resume state determination unit 195 sets three time steps from time step t21 to t23 out of these five reset destination candidates as reset destination candidates to be selected, and sets a probability distribution for these reset destination candidates.

 図7は、シミュレーションの中断時における累積報酬値の減少量の大きさが比較的大きい場合の例を示している。図7の例では、中断時から比較的近いリセット先候補である時間ステップt21に対しては、図6の場合よりも大きい確率が設定されている。また、中断時に比較的近いリセット先候補である時間ステップt23に対しては、図6の場合よりも小さい確率を設定している。
 再開状態決定部195は、設定した確率分布に従って、3つのリセット先候補の何れかを選択する。これにより、再開状態決定部195は、中断時から遠いリセット先候補(すなわち、中断時との間の時間ステップが多いリセット先候補)を比較的選び易い。
Fig. 7 shows an example in which the magnitude of the decrease in the cumulative reward value at the time of interruption of the simulation is relatively large. In the example of Fig. 7, a higher probability is set for time step t21, which is a candidate for the reset destination relatively close to the time of interruption, than in Fig. 6. Also, a lower probability is set for time step t23, which is a candidate for the reset destination relatively close to the time of interruption, than in Fig. 6.
The restart state determination unit 195 selects one of the three reset destination candidates according to the set probability distribution. This makes it relatively easy for the restart state determination unit 195 to select a reset destination candidate that is far from the interruption time (i.e., a reset destination candidate with many time steps between the interruption time and the reset destination candidate).

 ここで、累積報酬値が急激に減少する状況は、特に避けたいと考えられる。累積報酬値が急激に減少する状況を回避できる可能性を高めるために、シミュレーションの中断時に遡る時間ステップ数を多くとることが考えられる。
 一方、累積報酬値が緩やかに減少する状況は、累積報酬値が急激に減少する状況と比較すると、避けたい度合いは小さいことが考えられる。この場合、エピソードの最初の方における学習の繰り返しを軽減させるために、シミュレーションの中断時に遡る時間ステップ数を比較的少なくすることが考えられる。
 再開状態決定部195が、累積報酬値の増減の大きさに応じて、選択しやすいリセット先候補を変えることで、学習装置100が、強化学習を効率的に行えると期待される。
Here, it is particularly desirable to avoid a situation in which the cumulative reward value drops suddenly. In order to increase the possibility of avoiding a situation in which the cumulative reward value drops suddenly, it is possible to take a large number of time steps back to the time when the simulation was interrupted.
On the other hand, a situation where the cumulative reward value gradually decreases is less desirable than a situation where the cumulative reward value rapidly decreases. In this case, in order to reduce the repetition of learning at the beginning of an episode, it is conceivable to make the number of time steps back when the simulation is interrupted relatively small.
It is expected that the learning device 100 can efficiently perform reinforcement learning by the restart state determination unit 195 changing the reset destination candidates that are easy to select depending on the magnitude of the increase or decrease in the cumulative reward value.

 図8は、学習装置100におけるデータの入出力の例を示す図である。
 図8の例で、行動決定部191は、方策に基づいて制御対象910の行動を決定し、決定した行動をシミュレーション部192へ出力する。
 シミュレーション部192は、行動決定部191が決定した行動を模擬して状態(次状態)を算出し、算出した状態を方策生成部193へ出力する。
FIG. 8 is a diagram showing an example of data input/output in the learning device 100.
In the example of FIG. 8, the action decision unit 191 decides the action of the control target 910 based on the measure, and outputs the decided action to the simulation unit 192 .
The simulation unit 192 simulates the action determined by the action determination unit 191 to calculate a state (next state), and outputs the calculated state to the measure generation unit 193.

 方策生成部193は、行動決定部191から取得した状態に基づいて報酬値を算出し、算出した報酬値に基づいて方策を更新する。報酬値によっては、方策生成部193が方策を更新しない(そのままとする)場合もある。
 また、方策生成部193は、算出した報酬値をリセット決定部194および再開状態決定部195へ出力する。
The policy generation unit 193 calculates a reward value based on the state acquired from the action decision unit 191, and updates the policy based on the calculated reward value. Depending on the reward value, the policy generation unit 193 may not update the policy (may leave it as is).
In addition, the measure generator 193 outputs the calculated reward value to the reset determiner 194 and the restart state determiner 195 .

 リセット決定部194は、報酬値に基づいて、シミュレーション部192によるシミュレーションを中断するか否かを決定する。リセット決定部194が、シミュレーションを中断することに決定した場合、再開状態決定部195は、報酬値に基づいて、リセット先を決定する。そして、リセット決定部194および再開状態決定部195は、シミュレーション中断の指示とリセット先とを、シミュレーション部192および方策生成部193へ出力する。 The reset decision unit 194 decides whether or not to interrupt the simulation by the simulation unit 192 based on the reward value. If the reset decision unit 194 decides to interrupt the simulation, the restart state decision unit 195 decides the reset destination based on the reward value. The reset decision unit 194 and the restart state decision unit 195 then output an instruction to interrupt the simulation and the reset destination to the simulation unit 192 and the measure generation unit 193.

 ここで、方策生成部193が用いる報酬値、リセット決定部194が用いる報酬値、および、再開状態決定部195が用いる報酬値は、特定のものに限定されない。方策生成部193が用いる報酬値、リセット決定部194が用いる報酬値、および、再開状態決定部195が用いる報酬値は、瞬時報酬値、累積報酬値、価値関数値、または、他の報酬値の何れかであってもよいし、これらの組み合わせであってもよい。 Here, the reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 are not limited to a specific one. The reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 may be an instantaneous reward value, a cumulative reward value, a value function value, or any other reward value, or may be a combination of these.

 また、方策生成部193が用いる報酬値と、リセット決定部194が用いる報酬値と、再開状態決定部195が用いる報酬値とは、同じものであってもよいし異なるものであってもよい。例えば、方策生成部193がリセット決定部194へ瞬時報酬値を出力し、リセット決定部194が、瞬時報酬値に基づいて累積報酬値を算出するようにしてもよい。 Furthermore, the reward value used by the policy generation unit 193, the reward value used by the reset determination unit 194, and the reward value used by the restart state determination unit 195 may be the same or different. For example, the policy generation unit 193 may output an instantaneous reward value to the reset determination unit 194, and the reset determination unit 194 may calculate a cumulative reward value based on the instantaneous reward value.

 中断指示を受けた場合、シミュレーション部192は、シミュレーションを中断し、指定されたリセット先へシミュレーションの状態を戻す。
 また、中断指示を受けた場合、方策生成部193は、方策または報酬値の何れか、あるいはこれら両方を、指定されたリセット先における値に戻す。なお、方策生成部193が、シミュレーションの中断時に方策および報酬値の何れも変更しないようにしてもよい。この場合、リセット決定部194および再開状態決定部195が、方策生成部193へ中断指示およびリセット先を出力しないようにしてもよい。
When an interrupt instruction is received, the simulation unit 192 interrupts the simulation and returns the state of the simulation to the designated reset destination.
Furthermore, when an interruption instruction is received, the policy generation unit 193 returns either the policy or the reward value, or both, to the values at the specified reset destination. The policy generation unit 193 may not change either the policy or the reward value when the simulation is interrupted. In this case, the reset determination unit 194 and the restart state determination unit 195 may not output the interruption instruction and the reset destination to the policy generation unit 193.

 図9は、学習装置100が行う処理の手順の例を示す図である。
 図9に示す処理で、シミュレーション部192は、シミュレーションの設定を行う(ステップS101)。エピソードの開始時には、シミュレーション部192は、そのエピソードで指定されている、そのエピソードの状態を設定する。
 次に、処理部190は、強化学習を1ステップ実行する(ステップS102)。
 そして、リセット決定部194は、中断条件が成立しているか否かを判定する(ステップS103)。
FIG. 9 is a diagram illustrating an example of a procedure of processing performed by the learning device 100.
9, the simulation unit 192 sets up a simulation (step S101). At the start of an episode, the simulation unit 192 sets the state of the episode that is specified in the episode.
Next, the processing unit 190 executes one step of reinforcement learning (step S102).
Then, the reset determination unit 194 determines whether or not the interruption condition is met (step S103).

 中断条件が成立しているとリセット決定部194が判定した場合(ステップS103:YES)、再開状態決定部195は、リセット先を決定する(ステップS111)。
 ステップS111の後、処理がステップS101へ戻る。この場合、シミュレーション部192はステップS101で、シミュレーションにおける状態を、リセット先の状態に設定する。
When the reset determination unit 194 determines that the interruption condition is met (step S103: YES), the resume state determination unit 195 determines a reset destination (step S111).
After step S111, the process returns to step S101. In this case, in step S101, the simulation unit 192 sets the state in the simulation to the state after reset.

 一方、ステップS103で中断条件が成立していないとリセット決定部194が判定した場合(ステップS103:NO)、処理部190は、エピソードの終了条件が成立しているか否かを判定する(ステップS121)。
 エピソードの終了条件が成立していないと処理部190が判定した場合(ステップS121:NO)、処理がステップS102へ戻る。
On the other hand, if the reset determination unit 194 determines in step S103 that the interruption condition is not met (step S103: NO), the processing unit 190 determines whether or not the episode end condition is met (step S121).
When the processing unit 190 determines that the episode end condition is not met (step S121: NO), the process returns to step S102.

 一方、ステップS103でエピソードの終了条件が成立していると判定した場合(ステップS121:YES)、処理部190は、強化学習の終了条件が成立しているか否かを判定する(ステップS131)。
 強化学習の終了条件が成立していないと判定した場合(ステップS131:NO)、処理部190は、次のエピソードを選択する(ステップS141)。
 ステップS141の後処理がステップS101へ戻る。この場合、シミュレーション部192は、処理部190が選択したエピソードで指定されている、そのエピソードの状態を設定する。
 一方、ステップS131で、強化学習の終了条件が成立していると処理部190が判定した場合(ステップS131:NO)、学習装置100は、図9の処理を終了する。
On the other hand, when it is determined in step S103 that the end condition of the episode is satisfied (step S121: YES), the processing unit 190 determines whether or not the end condition of the reinforcement learning is satisfied (step S131).
When it is determined that the condition for ending the reinforcement learning is not satisfied (step S131: NO), the processing unit 190 selects the next episode (step S141).
After the process of step S141, the process returns to step S101. In this case, the simulation unit 192 sets the state of the episode that is specified in the episode selected by the processing unit 190.
On the other hand, in step S131, if the processing unit 190 determines that the condition for ending the reinforcement learning is satisfied (step S131: NO), the learning device 100 ends the process in FIG.

 以上のように、方策生成部193は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象910の行動に対する評価を示す値である報酬値に基づいて、行動の決定規則である方策を生成する。行動決定部191は、制御対象の行動を方策に基づいて決定する。 As described above, the policy generation unit 193 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the control object 910 at a time step back within an episode representing a learning period. The behavior decision unit 191 decides the behavior of the control object based on the policy.

 学習装置100によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
 例えば、学習装置100によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、学習装置100では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。
According to the learning device 100, it is expected that reinforcement learning can be performed more efficiently by going back in time steps within an episode.
For example, the learning device 100 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor occurred that would cause the evaluation indicated by the reward value to deteriorate. This is expected to enable the learning device 100 to generate a measure that will improve the evaluation of the entire episode more quickly.

 また、リセット決定部194は、制御対象910の行動のシミュレーションを中断する条件として予め定められている条件が成立していると判定した場合、シミュレーションを中断する。方策生成部193は、制御対象910の行動のシミュレーションに基づいて計算される報酬値に基づいて、方策を生成する。
 学習装置100によれば、方策の更新が期待できない状態に陥っても、あるいは、方策の更新がなかなか進まないと予想される状態に陥ってもエピソードの実行を継続する場合と比較して、強化学習をより効率的に行うことができる。
Furthermore, the reset determination unit 194 suspends the simulation when it is determined that a predetermined condition for suspending the simulation of the behavior of the control target 910 is satisfied. The measure generation unit 193 generates a measure based on a reward value calculated based on the simulation of the behavior of the control target 910.
According to the learning device 100, reinforcement learning can be performed more efficiently than in a case where the execution of an episode is continued even when the state in which policy updating is not expected or when the state in which policy updating is expected to progress slowly is reached.

 また、シミュレーション部192は、制御対象910の行動のシミュレーションをおこなって、前記制御対象が行動する環境の状態を計算する。
 学習装置100によれば、シミュレーションにおける時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
Furthermore, the simulation unit 192 simulates the behavior of the controlled object 910 and calculates the state of the environment in which the controlled object behaves.
According to the learning device 100, it is expected that reinforcement learning can be performed more efficiently by going back in time steps in a simulation.

 また、リセット決定部194は、瞬時報酬値の累積値である累積報酬値の変化に基づいて、シミュレーションを中断するか否かを決定する。
 これにより、リセット決定部194は、1つの時間ステップにおける状態だけでなく、状態の変化に基づいて、シミュレーションを中断するか否かを決定することができる。学習装置100によれば、この点で、シミュレーションを中断するか否かを適切に決定できることが期待される。
In addition, the reset decision unit 194 decides whether or not to interrupt the simulation based on a change in the cumulative reward value, which is the cumulative value of the instantaneous reward value.
This allows the reset decision unit 194 to decide whether to suspend the simulation based on not only the state at one time step but also on a change in state. In this respect, the learning device 100 is expected to be able to appropriately decide whether to suspend the simulation.

 また、リセット決定部194は、累積報酬値の増減に基づいて、累積報酬値が示す評価が所定の閾値よりも悪化したと判定した場合、シミュレーションを中断することに決定する。
 学習装置100によれば、累積報酬値と閾値とを比較するという簡単な処理で、シミュレーションを中断するか否かを決定することができる。
Furthermore, when the reset decision unit 194 determines, based on an increase or decrease in the cumulative reward value, that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold, it decides to interrupt the simulation.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of comparing the cumulative reward value with a threshold value.

 また、リセット決定部194は、累積報酬値の閾値が示す評価の悪化の幅が、シミュレーションの進行に応じて大きくなるように閾値を更新する。
 上述したように、エピソードの初期の段階、あるいは、強化学習の初期の段階では、学習が進んでいないことで、報酬値が大きくならないケースが多いことが考えられる。この場合、リセット決定部194が閾値を比較的小さい値に設定してエピソードの実行の中断を早めることで、行動決定部191が、いろいろな行動を試すようになり、報酬値が大きくなる行動(または一連の行動)を比較的早い時点で見つけられることが期待される。学習装置100によれば、この点で、強化学習を効率的に行えると期待される。
Furthermore, the reset determination unit 194 updates the threshold value of the cumulative reward value so that the extent of deterioration in the evaluation indicated by the threshold value increases as the simulation progresses.
As described above, in the early stages of an episode or in the early stages of reinforcement learning, it is considered that there are many cases in which the reward value does not become large because learning has not progressed. In this case, it is expected that the reset decision unit 194 sets the threshold to a relatively small value to hasten the interruption of the execution of the episode, so that the action decision unit 191 will try various actions, and an action (or a series of actions) that will increase the reward value will be found at a relatively early stage. In this respect, it is expected that the learning device 100 can efficiently perform reinforcement learning.

 また、リセット決定部194は、累積報酬値の増減に基づいて、累積報酬値が示す評価の悪化が所定の時間ステップ数以上継続した場合、シミュレーションを中断することに決定する。
 学習装置100によれば、累積報酬値が継続的に減少しているステップ数を数えるという簡単な処理で、シミュレーションを中断するか否かを決定することができる。
Furthermore, the reset decision unit 194 decides to interrupt the simulation if the deterioration of the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more, based on the increase or decrease in the cumulative reward value.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of counting the number of steps in which the cumulative reward value is continuously decreasing.

 また、リセット決定部194は、瞬時報酬値の累積値である累積報酬値の予測値である価値関数値に基づいて、前記シミュレーションを中断するか否かを決定する。
 これにより、リセット決定部194は、1つの時間ステップにおける状態だけでなく、状態の変化に基づいて、シミュレーションを中断するか否かを決定することができる。学習装置100によれば、この点で、シミュレーションを中断するか否かを適切に決定できることが期待される。
Further, the reset decision unit 194 decides whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of instantaneous reward values.
This allows the reset decision unit 194 to decide whether to suspend the simulation based on not only the state at one time step but also on a change in state. In this respect, the learning device 100 is expected to be able to appropriately decide whether to suspend the simulation.

 また、リセット決定部194は、価値関数値の増減に基づいて、価値関数値が示す評価が所定の閾値よりも悪化したと判定した場合、シミュレーションを中断することに決定する。
 学習装置100によれば、価値関数値と閾値とを比較するという簡単な処理で、シミュレーションを中断するか否かを決定することができる。
Furthermore, when reset decision unit 194 determines, based on an increase or decrease in the value function value, that the evaluation indicated by the value function value has deteriorated below a predetermined threshold value, it decides to interrupt the simulation.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of comparing the value function value with a threshold value.

 また、リセット決定部194は、価値関数値の閾値が示す評価の悪化の幅が、シミュレーションの進行に応じて大きくなるように閾値を更新する。
 上述したように、エピソードの初期の段階、あるいは、強化学習の初期の段階では、学習が進んでいないことで、報酬値が大きくならないケースが多いことが考えられる。この場合、リセット決定部194が閾値を比較的小さい値に設定してエピソードの実行の中断を早めることで、行動決定部191が、いろいろな行動を試すようになり、報酬値が大きくなる行動(または一連の行動)を比較的早い時点で見つけられることが期待される。
 学習装置100によれば、この点で、強化学習を効率的に行えると期待される。
Furthermore, the reset determination unit 194 updates the threshold value so that the extent of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
As described above, in the early stage of an episode or in the early stage of reinforcement learning, it is considered that there are many cases where the reward value does not become large because learning has not progressed. In this case, it is expected that the reset decision unit 194 sets the threshold to a relatively small value and interrupts the execution of the episode earlier, so that the action decision unit 191 tries various actions and finds an action (or a series of actions) that will increase the reward value at a relatively early point in time.
In this respect, the learning device 100 is expected to enable efficient reinforcement learning.

 また、リセット決定部194は、価値関数値の増減に基づいて、価値関数値が示す評価の悪化が所定の時間ステップ数以上継続した場合、シミュレーションを中断することに決定する。
 学習装置100によれば、価値関数値が継続的に減少しているステップ数を数えるという簡単な処理で、シミュレーションを中断するか否かを決定することができる。
Furthermore, reset decision unit 194 decides to interrupt the simulation when the deterioration of the evaluation indicated by the value function value continues for a predetermined number of time steps or more, based on the increase or decrease in the value function value.
According to the learning device 100, it is possible to determine whether or not to interrupt the simulation by the simple process of counting the number of steps in which the value function value is continuously decreasing.

 また、リセット決定部194は、シミュレーションで計算される、制御対象910が行動する環境の現在の状態が、中断が必要な状態として予め設定されている1つ以上の状態の何れかと一定以上類似していると判定した場合、シミュレーションを中断することに決定する。
 学習装置100を設計する設計者は、シミュレーションの中断が必要な状態のサンプルを用意すればよく、シミュレーションの中断の要否の判定基準をルールとして設計しておく必要はない。学習装置100によれば、この点で、設計者の負担が小さい。
In addition, the reset decision unit 194 decides to interrupt the simulation when it determines that the current state of the environment in which the control object 910 is acting, as calculated in the simulation, is similar to one or more states that are pre-set as states requiring interruption to a certain extent or more.
The designer who designs learning device 100 only needs to prepare a sample of a state in which a simulation needs to be interrupted, and does not need to design rules for determining whether or not a simulation needs to be interrupted. In this respect, learning device 100 reduces the burden on the designer.

 また、リセット決定部194と表示部120との組み合わせは、状態をユーザに提示する。リセット決定部194と操作入力部130との組み合わせは、中断が必要な状態を指定するユーザ操作を受け付ける。
 学習装置100によれば、ユーザの指定を受けて、シミュレーションの中断が必要な状態のサンプルを取得することができる。学習装置100によれば、この点で、学習装置100の設計者の負担が小さい。また、ユーザは、シミュレーションの中断が必要か否かについてのユーザ自らの判断を、学習装置100に反映させることができる。
The combination of the reset determination unit 194 and the display unit 120 presents the state to the user. The combination of the reset determination unit 194 and the operation input unit 130 accepts a user operation that specifies a state that requires interruption.
According to the learning device 100, a sample of a state in which a simulation needs to be interrupted can be acquired in response to a user's designation. In this respect, the learning device 100 reduces the burden on the designer of the learning device 100. Furthermore, the user can reflect in the learning device 100 the user's own judgment as to whether or not a simulation needs to be interrupted.

 また、再開状態決定部195は、シミュレーションを再開する際の状態を決定する。
 上述したように、エピソードの最初は、必ずしも方策の更新が進まないことが考えられる。学習装置100について上述したように、再開状態決定部195が、エピソードを遡る先として、エピソードの最初に限らず途中の時間ステップも選択できることで、学習装置100が、より効率的に強化学習を行えることが期待される。
Furthermore, the restart state determining unit 195 determines the state when the simulation is restarted.
As described above, it is considered that the policy update does not necessarily progress at the beginning of an episode. As described above for the learning device 100, the resumption state determination unit 195 can select not only the beginning of an episode but also an intermediate time step as the destination to go back in the episode, and therefore it is expected that the learning device 100 can perform reinforcement learning more efficiently.

 また、再開状態決定部195は、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップであるリセット先候補に設定されている確率分布に基づいて、何れか1つのリセット先候補を選択する。 The restart state determination unit 195 also selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted.

 上述したように、報酬値が示す評価が悪い評価となる要因の事象が発生している場合に、その事象が発生する前の時間ステップまで戻る観点からは、エピソード内でなるべく早い時間ステップ(エピソードの最初に近い時間ステップ)まで戻ることが考えられる。一方、エピソードの最初の方における学習の繰り返しを軽減させる観点からは、エピソード内で実行済みの時間ステップのうち、なるべく遅い時間ステップまで戻ることが考えられる。このように、報酬値が示す評価が悪い評価となる要因の事象が発生している状態を避けて強化学習の効率化を図ることと、エピソードの最初の方における学習の繰り返しを軽減させて強化学習の効率化を図ることとは、トレードオフの関係にある。 As mentioned above, when an event occurs that causes the reward value to be poorly evaluated, from the perspective of going back to a time step before that event occurred, it is possible to go back to as early a time step as possible in the episode (a time step closest to the beginning of the episode). On the other hand, from the perspective of reducing the number of learning repetitions at the beginning of the episode, it is possible to go back to as late a time step as possible among the time steps that have already been executed in the episode. In this way, there is a trade-off between improving the efficiency of reinforcement learning by avoiding states in which an event has occurred that causes the reward value to be poorly evaluated, and improving the efficiency of reinforcement learning by reducing the number of learning repetitions at the beginning of the episode.

 そこで、再開状態決定部195が、複数のリセット先候補のうちの何れか1つを確率的に選択することで、エピソードの実行が中断されるごとに毎回エピソードの最初の方からエピソードの実行を再開することを避けることができる。また、選択したリセット先候補では、報酬値が示す評価が悪い評価となる要因の事象が既に発生している場合、エピソードの中断をさらに1回以上繰り返すことで、報酬値が示す評価が悪い評価となる要因の事象が発生する前のリセット先候補を選択できると期待される。 Then, by having the restart state determination unit 195 probabilistically select one of the multiple reset destination candidates, it is possible to avoid restarting the execution of the episode from the beginning of the episode every time the execution of the episode is interrupted. Furthermore, if an event that would cause the reward value to be evaluated poorly has already occurred in the selected reset destination candidate, it is expected that by interrupting the episode one or more times, it will be possible to select a reset destination candidate before the event that would cause the reward value to be evaluated poorly occurred.

 また、リセット決定部194は、複数のリセット先候補のうち何れか1つを一様分布の確率分布に従って選択する。
 学習装置100によれば、報酬値が示す評価が悪い評価となる要因の事象が発生する前の状態に到達するために遡る時間ステップ数の目安が不明な制御対象910に対応して、リセット先候補を選択することができる。
Furthermore, the reset determination unit 194 selects one of the multiple reset destination candidates in accordance with a uniform probability distribution.
According to the learning device 100, a reset destination candidate can be selected in response to a control object 910 for which the approximate number of time steps to go back to reach a state before the occurrence of an event that caused the evaluation indicated by the reward value to be poor is unknown.

 また、リセット決定部194は、複数のリセット先候補に対して、累積報酬値の増減で示される、シミュレーションの中断時における評価の悪化が大きいほど、シミュレーションの中断時とリセット先候補との間の時間ステップ数が多いリセット先候補を選び易くなるように確率分布を設定し、設定した確率分布に従って何れか1つのリセット先候補を選択する。 The reset decision unit 194 also sets a probability distribution for multiple reset destination candidates so that the greater the deterioration in evaluation at the time the simulation is interrupted, which is indicated by an increase or decrease in the cumulative reward value, the more likely it is to select a reset destination candidate with a large number of time steps between the time the simulation is interrupted and the reset destination candidate, and selects one of the reset destination candidates according to the set probability distribution.

 上述したように、累積報酬値が急激に減少する状況は、特に避けたいと考えられる。累積報酬値が急激に減少する状況を回避できる可能性を高めるために、シミュレーションの中断時に遡る時間ステップ数を多くとることが考えられる。
 一方、累積報酬値が緩やかに減少する状況は、累積報酬値が急激に減少する状況と比較すると、避けたい度合いは小さいことが考えられる。この場合、エピソードの最初の方における学習の繰り返しを軽減させるために、シミュレーションの中断時に遡る時間ステップ数を比較的少なくすることが考えられる。
 再開状態決定部195が、累積報酬値の増減の大きさに応じて、選択しやすいリセット先候補を変えることで、学習装置100が、強化学習を効率的に行えると期待される。
As mentioned above, it is particularly desirable to avoid a situation in which the cumulative reward value decreases rapidly. In order to increase the possibility of avoiding a situation in which the cumulative reward value decreases rapidly, it is possible to take a large number of time steps back to the time when the simulation was interrupted.
On the other hand, a situation where the cumulative reward value gradually decreases is less desirable than a situation where the cumulative reward value rapidly decreases. In this case, in order to reduce the repetition of learning at the beginning of an episode, it is conceivable to make the number of time steps back when the simulation is interrupted relatively small.
It is expected that the learning device 100 can efficiently perform reinforcement learning by the restart state determination unit 195 changing the reset destination candidates that are easy to select depending on the magnitude of the increase or decrease in the cumulative reward value.

 また、リセット決定部194は、リセット先候補のうち、累積報酬値の増減で示される評価の変化が良化から悪化に転じている時間ステップのうちシミュレーションの中断時との間の時間ステップ数が最も少ない時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択する。 The reset decision unit 194 also selects, from among the reset destination candidates, a time step in which the change in evaluation, indicated by an increase or decrease in the cumulative reward value, has turned from improvement to deterioration, and which corresponds to the time step with the fewest number of time steps between the time the simulation was interrupted, or a time step earlier than that.

 シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップよりも後の時間ステップでは、累積報酬値が継続的に減少しており、報酬値が示す評価が悪い評価となる要因の事象が発生していることが考えられる。再開状態決定部195が、シミュレーションの中断時に直近で評価が極大に良い評価となっている時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択することで、報酬値が示す評価が悪い評価となる要因の事象が発生する前の時間ステップを選択する可能性が高まることが期待される。 In time steps after the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted, the cumulative reward value is continuously decreasing, and it is possible that an event has occurred that causes the evaluation indicated by the reward value to be a bad evaluation. By having the restart state determination unit 195 select a reset destination candidate that corresponds to the time step in which the evaluation was the most recent maximally good evaluation when the simulation was interrupted or an earlier time step, it is expected that the possibility of selecting a time step before the occurrence of an event that causes the evaluation indicated by the reward value to be a bad evaluation will increase.

 図10は、本開示のいくつかの実施形態に係る学習装置の構成の、もう1つの例を示す図である。図10に示す構成で、学習装置610は、方策生成部611と、行動決定部612とを備える。
 かかる構成で、方策生成部611は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、行動の決定規則である方策を生成する。行動決定部612は、制御対象の行動を方策に基づいて決定する。
 方策生成部611は、方策生成手段の例に該当する。行動決定部612は、行動決定手段の例に該当する。
10 is a diagram showing another example of the configuration of a learning device according to some embodiments of the present disclosure. In the configuration shown in FIG. 10, a learning device 610 includes a policy generator 611 and an action determiner 612.
In this configuration, the policy generating unit 611 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object in a state going back a time step in an episode representing a learning period. The behavior deciding unit 612 decides the behavior of the controlled object based on the policy.
The measure generating unit 611 corresponds to an example of a measure generating means, and the action deciding unit 612 corresponds to an example of an action deciding means.

 学習装置610によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
 例えば、学習装置610によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、学習装置610では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。
It is expected that the learning device 610 can perform reinforcement learning more efficiently by going back in time steps within an episode.
For example, the learning device 610 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor caused the evaluation indicated by the reward value to deteriorate. This is expected to enable the learning device 610 to generate a measure that will improve the evaluation of the entire episode more quickly.

 方策生成部611は、例えば、図1の方策生成部193等の機能を用いて実現することができる。行動決定部612は、例えば、図1の行動決定部191等の機能を用いて実現することができる。 The policy generation unit 611 can be realized, for example, by using the functions of the policy generation unit 193 in FIG. 1, etc. The action decision unit 612 can be realized, for example, by using the functions of the action decision unit 191 in FIG. 1, etc.

 図11は、本開示のいくつかの実施形態に係る制御システムの構成の、もう1つの例を示す図である。
 図11に示す構成で、制御システム620は、方策生成部622と、行動決定部623とを備える。
FIG. 11 is a diagram illustrating another example of a control system configuration according to some embodiments of the present disclosure.
In the configuration shown in FIG. 11, the control system 620 includes a measure generator 622 and an action determiner 623 .

 かかる構成で、方策生成部622は、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、行動の決定規則である方策を生成する。行動決定部623は、制御対象の行動を方策に基づいて決定する。
 方策生成部622は、方策生成手段の例に該当する。行動決定部623は、行動決定手段の例に該当する。
In this configuration, the policy generating unit 622 generates a policy, which is a decision rule for behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object in a state going back a time step in an episode representing a learning period. The behavior deciding unit 623 decides the behavior of the controlled object based on the policy.
The measure generating unit 622 corresponds to an example of a measure generating means, and the action deciding unit 623 corresponds to an example of an action deciding means.

 制御システム620によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
 例えば、制御システム620によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、制御システム620では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。
It is expected that the control system 620 can perform reinforcement learning more efficiently by going back in time steps within an episode.
For example, the control system 620 may be able to avoid the occurrence of a factor that would cause the evaluation indicated by the reward value to deteriorate by going back to a time step in which the factor caused the evaluation indicated by the reward value to deteriorate. This is expected to enable the control system 620 to generate a measure that would improve the evaluation of the entire episode more quickly.

 方策生成部622は、例えば、図1の方策生成部193等の機能を用いて実現することができる。行動決定部623は、例えば、図1の行動決定部191等の機能を用いて実現することができる。 The policy generation unit 622 can be realized, for example, by using the functions of the policy generation unit 193 in FIG. 1, etc. The action decision unit 623 can be realized, for example, by using the functions of the action decision unit 191 in FIG. 1, etc.

 図12は、本開示のいくつかの実施形態に係る入出力装置の構成の例を示す図である。図12に示す構成で、入出力装置630は、状態提示部631と、状態指定受付部632とを備える。
 かかる構成で、状態提示部631は、制御対象が行動する環境の状態をユーザに提示する。状態指定受付部632は、環境を模擬するシミュレータの中断が必要な状態を指定するユーザ操作を受け付ける。
 状態提示部631は、状態提示手段の例に該当する。状態指定受付部632は、状態指定受付手段の例に該当する。
12 is a diagram illustrating an example of a configuration of an input/output device according to some embodiments of the present disclosure. In the configuration illustrated in FIG. 12, an input/output device 630 includes a status presenting unit 631 and a status designation receiving unit 632.
With this configuration, the state presenting unit 631 presents to the user the state of the environment in which the controlled object acts. The state designation receiving unit 632 receives a user operation that designates a state in which the simulator that simulates the environment needs to be interrupted.
The status presenting unit 631 is an example of a status presenting unit, and the status designation receiving unit 632 is an example of a status designation receiving unit.

 入出力装置630によれば、ユーザの指定を受けて、シミュレーションの中断が必要な状態のサンプルを取得することができる。入出力装置630によれば、この点で、入出力装置630の設計者の負担が小さい。また、ユーザは、シミュレーションの中断が必要か否かについてのユーザ自らの判断を、シミュレーションの実行に反映させることができる。
 状態提示部631は、例えば、図1のリセット決定部194および表示部120等の機能を用いて実現することができる。状態指定受付部632は、例えば、図1のリセット決定部194および操作入力部130等の機能を用いて実現することができる。
According to the input/output device 630, a sample of a state in which a simulation needs to be interrupted can be acquired in response to a user's designation. In this respect, the input/output device 630 reduces the burden on the designer of the input/output device 630. Furthermore, the user can reflect his/her own judgment as to whether or not a simulation needs to be interrupted in the execution of the simulation.
The state presenting unit 631 can be realized, for example, by using the functions of the reset determining unit 194 and the display unit 120 in Fig. 1. The state designation receiving unit 632 can be realized, for example, by using the functions of the reset determining unit 194 and the operation input unit 130 in Fig. 1.

 図13は、本開示のいくつかの実施形態に係る学習方法における処理の手順の例を示す図である。図13に示す学習方法は、方策を生成すること(ステップS611)と、行動を決定すること(ステップS612)とを含む。 FIG. 13 is a diagram showing an example of a processing procedure in a learning method according to some embodiments of the present disclosure. The learning method shown in FIG. 13 includes generating a strategy (step S611) and determining an action (step S612).

 方策を生成すること(ステップS611)では、コンピュータが、学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する。
 行動を決定すること(ステップS612)では、コンピュータが、制御対象の行動を方策に基づいて決定する。
In generating a policy (step S611), the computer generates a policy, which is a decision rule for the behavior, based on a reward value, which is a value indicating an evaluation of the behavior of the controlled object at a time step back within the episode representing the learning period.
In determining an action (step S612), the computer determines an action of the controlled object based on a strategy.

 図13に示す学習方法によれば、エピソード内で時間ステップを遡ることで、より効率的に強化学習を行えることが期待される。
 例えば、図13に示す学習方法によれば、報酬値が示す評価が悪くなる要因が発生した時間ステップよりも過去に遡って、報酬値が示す評価が悪くなる要因の発生を回避できる可能性がある。これにより、図13に示す学習方法では、エピソード全体での評価が良くなるような方策を、より早く生成できることが期待される。
According to the learning method shown in FIG. 13, it is expected that reinforcement learning can be performed more efficiently by going back a time step within an episode.
For example, according to the learning method shown in Fig. 13, it is possible to avoid the occurrence of a factor that deteriorates the evaluation indicated by the reward value by going back to the time step in which the factor occurred that deteriorates the evaluation indicated by the reward value. As a result, it is expected that the learning method shown in Fig. 13 can generate a policy that improves the evaluation of the entire episode more quickly.

 図14は、本開示のいくつかの少なくとも1つの実施形態に係るコンピュータの構成の例を示す図である。
 図14に示す構成で、コンピュータ700は、CPU710と、主記憶装置720と、補助記憶装置730と、インタフェース740と、不揮発性記録媒体750とを備える。
FIG. 14 is a diagram illustrating an example of a computer configuration in accordance with at least some embodiments of the present disclosure.
In the configuration shown in FIG. 14, a computer 700 includes a CPU 710 , a main memory device 720 , an auxiliary memory device 730 , an interface 740 , and a non-volatile recording medium 750 .

 上記の学習装置100、制御装置200、学習装置610、学習装置621、制御装置626、および、入出力装置630のうち何れか1つ以上またはその一部が、コンピュータ700に実装されてもよい。その場合、上述した各処理部の動作は、プログラムの形式で補助記憶装置730に記憶されている。CPU710は、プログラムを補助記憶装置730から読み出して主記憶装置720に展開し、当該プログラムに従って上記処理を実行する。また、CPU710は、プログラムに従って、上述した各記憶部に対応する記憶領域を主記憶装置720に確保する。各装置と他の装置との通信は、インタフェース740が通信機能を有し、CPU710の制御に従って通信を行うことで実行される。また、インタフェース740は、不揮発性記録媒体750用のポートを有し、不揮発性記録媒体750からの情報の読出、および、不揮発性記録媒体750への情報の書込を行う。 Any one or more of the learning device 100, the control device 200, the learning device 610, the learning device 621, the control device 626, and the input/output device 630, or a part of them, may be implemented in the computer 700. In this case, the operation of each of the above-mentioned processing units is stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program. The CPU 710 also secures a storage area corresponding to each of the above-mentioned storage units in the main storage device 720 according to the program. Communication between each device and other devices is executed by the interface 740 having a communication function and communicating according to the control of the CPU 710. The interface 740 also has a port for the non-volatile recording medium 750, and reads information from the non-volatile recording medium 750 and writes information to the non-volatile recording medium 750.

 学習装置100がコンピュータ700に実装される場合、処理部190およびその各部の動作は、プログラムの形式で補助記憶装置730に記憶されている。CPU710は、プログラムを補助記憶装置730から読み出して主記憶装置720に展開し、当該プログラムに従って上記処理を実行する。 When the learning device 100 is implemented in a computer 700, the operations of the processing unit 190 and each of its units are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

 また、CPU710は、プログラムに従って、記憶部180のための記憶領域を主記憶装置720に確保する。通信部110による他の装置との通信は、インタフェース740が通信機能を有し、CPU710の制御に従って動作することで実行される。表示部120による画像の表示は、インタフェース740が表示装置を備え、CPU710の制御に従って各種画像の表示することで実行される。操作入力部130によるユーザ操作の受け付けは、インタフェース740が入力デバイスを備え、CPU710の制御に従ってユーザ操作を受け付けることで実行される。 The CPU 710 also reserves a memory area for the memory unit 180 in the main memory device 720 in accordance with the program. Communication with other devices by the communication unit 110 is achieved by the interface 740 having a communication function and operating under the control of the CPU 710. Display of images by the display unit 120 is achieved by the interface 740 having a display device and displaying various images under the control of the CPU 710. Reception of user operations by the operation input unit 130 is achieved by the interface 740 having an input device and accepting user operations under the control of the CPU 710.

 制御装置200がコンピュータ700に実装される場合、その動作は、プログラムの形式で補助記憶装置730に記憶されている。CPU710は、プログラムを補助記憶装置730から読み出して主記憶装置720に展開し、当該プログラムに従って上記処理を実行する。 When the control device 200 is implemented in the computer 700, its operation is stored in the form of a program in the auxiliary storage device 730. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

 また、CPU710は、プログラムに従って、制御装置200が処理を行うための記憶領域を主記憶装置720に確保する。制御装置200と他の装置との通信は、インタフェース740が通信機能を有し、CPU710の制御に従って動作することで実行される。制御装置200とユーザとのインタラクションは、インタフェース740が入力デバイスおよび出力デバイスを有し、CPU710の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory device 720 for the control device 200 to perform processing according to the program. Communication between the control device 200 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the control device 200 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

 学習装置610がコンピュータ700に実装される場合、方策生成部611と、行動決定部612との動作は、プログラムの形式で補助記憶装置730に記憶されている。CPU710は、プログラムを補助記憶装置730から読み出して主記憶装置720に展開し、当該プログラムに従って上記処理を実行する。 When the learning device 610 is implemented in the computer 700, the operations of the policy generation unit 611 and the action decision unit 612 are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

 また、CPU710は、プログラムに従って、学習装置610が処理を行うための記憶領域を主記憶装置720に確保する。学習装置610と他の装置との通信は、インタフェース740が通信機能を有し、CPU710の制御に従って動作することで実行される。学習装置610とユーザとのインタラクションは、インタフェース740が入力デバイスおよび出力デバイスを有し、CPU710の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory device 720 for the learning device 610 to perform processing according to the program. Communication between the learning device 610 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the learning device 610 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

 学習装置621がコンピュータ700に実装される場合、方策生成部622と、行動決定部623との動作は、プログラムの形式で補助記憶装置730に記憶されている。CPU710は、プログラムを補助記憶装置730から読み出して主記憶装置720に展開し、当該プログラムに従って上記処理を実行する。 When the learning device 621 is implemented in the computer 700, the operations of the policy generation unit 622 and the action decision unit 623 are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

 また、CPU710は、プログラムに従って、学習装置621が処理を行うための記憶領域を主記憶装置720に確保する。学習装置621と他の装置との通信は、インタフェース740が通信機能を有し、CPU710の制御に従って動作することで実行される。学習装置621とユーザとのインタラクションは、インタフェース740が入力デバイスおよび出力デバイスを有し、CPU710の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory device 720 for the learning device 621 to perform processing according to the program. Communication between the learning device 621 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the learning device 621 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

 制御装置626がコンピュータ700に実装される場合、その動作は、プログラムの形式で補助記憶装置730に記憶されている。CPU710は、プログラムを補助記憶装置730から読み出して主記憶装置720に展開し、当該プログラムに従って上記処理を実行する。 When the control device 626 is implemented in the computer 700, its operation is stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

 また、CPU710は、プログラムに従って、制御装置626が処理を行うための記憶領域を主記憶装置720に確保する。制御装置626と他の装置との通信は、インタフェース740が通信機能を有し、CPU710の制御に従って動作することで実行される。制御装置626とユーザとのインタラクションは、インタフェース740が入力デバイスおよび出力デバイスを有し、CPU710の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a memory area in the main memory 720 for the control device 626 to perform processing according to the program. Communication between the control device 626 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the control device 626 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

 入出力装置630がコンピュータ700に実装される場合、状態提示部631と、状態指定受付部632との動作は、プログラムの形式で補助記憶装置730に記憶されている。CPU710は、プログラムを補助記憶装置730から読み出して主記憶装置720に展開し、当該プログラムに従って上記処理を実行する。 When the input/output device 630 is implemented in the computer 700, the operations of the state presentation unit 631 and the state designation reception unit 632 are stored in the auxiliary storage device 730 in the form of a program. The CPU 710 reads the program from the auxiliary storage device 730, expands it in the main storage device 720, and executes the above-mentioned processing according to the program.

 また、CPU710は、プログラムに従って、入出力装置630が処理を行うための記憶領域を主記憶装置720に確保する。入出力装置630と他の装置との通信は、インタフェース740が通信機能を有し、CPU710の制御に従って動作することで実行される。入出力装置630とユーザとのインタラクションは、インタフェース740が入力デバイスおよび出力デバイスを有し、CPU710の制御に従って出力デバイスにて情報をユーザに提示し、入力デバイスにてユーザ操作を受け付けることで実行される。 The CPU 710 also allocates a storage area in the main memory device 720 for the I/O device 630 to perform processing according to the program. Communication between the I/O device 630 and other devices is performed by the interface 740, which has a communication function and operates according to the control of the CPU 710. Interaction between the I/O device 630 and a user is performed by the interface 740, which has an input device and an output device, presenting information to the user via the output device according to the control of the CPU 710, and accepting user operations via the input device.

 上述したプログラムのうち何れか1つ以上が不揮発性記録媒体750に記録されていてもよい。この場合、インタフェース740が不揮発性記録媒体750からプログラムを読み出すようにしてもよい。そして、CPU710が、インタフェース740が読み出したプログラムを直接実行するか、あるいは、主記憶装置720または補助記憶装置730に一旦保存して実行するようにしてもよい。 Any one or more of the above-mentioned programs may be recorded on the non-volatile recording medium 750. In this case, the interface 740 may read the program from the non-volatile recording medium 750. The CPU 710 may then directly execute the program read by the interface 740, or may temporarily store the program in the main memory device 720 or the auxiliary memory device 730 and then execute it.

 なお、学習装置100、制御装置200、学習装置610、学習装置621、制御装置626、および、入出力装置630が行う処理の全部または一部を実行するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより各部の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、OS(Operating System)や周辺機器等のハードウェアを含むものとする。
 また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ROM(Read Only Memory)、CD-ROM(Compact Disc Read Only Memory)等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。
In addition, a program for executing all or part of the processing performed by learning device 100, control device 200, learning device 610, learning device 621, control device 626, and input/output device 630 may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed to perform processing of each part. Note that the term "computer system" here includes hardware such as an OS (Operating System) and peripheral devices.
Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, optical magnetic disks, ROMs (Read Only Memory), and CD-ROMs (Compact Disc Read Only Memory), as well as storage devices such as hard disks built into computer systems. The above-mentioned program may be for realizing part of the above-mentioned functions, or may be capable of realizing the above-mentioned functions in combination with a program already recorded in the computer system.

 以上、この開示の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この開示の要旨を逸脱しない範囲の設計等も含まれる。  Although the embodiments of this disclosure have been described in detail above with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs that do not deviate from the gist of this disclosure.

 上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Some or all of the above embodiments can be described as follows, but are not limited to the following:

(付記1)
 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、
 前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、
 を備える学習装置。
(Appendix 1)
a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
A learning device comprising:

(付記2)
 前記制御対象の行動のシミュレーションを中断する条件として予め定められている条件が成立していると判定した場合、前記シミュレーションを中断するリセット決定手段
 をさらに備え、
 前記方策生成手段は、前記シミュレーションに基づいて計算される前記報酬値に基づいて、前記方策を生成する、
 付記1に記載の学習装置。
(Appendix 2)
a reset decision means for suspending the simulation when it is determined that a predetermined condition for suspending the simulation of the behavior of the controlled object is satisfied,
The policy generating means generates the policy based on the reward value calculated based on the simulation.
2. A learning device as described in claim 1.

(付記3)
 前記シミュレーションを行って、前記制御対象が行動する環境の状態を計算するシミュレーション手段
 をさらに備える、付記2に記載の学習装置。
(Appendix 3)
3. The learning device according to claim 2, further comprising: a simulation means for performing the simulation to calculate a state of an environment in which the controlled object acts.

(付記4)
 前記リセット決定手段は、瞬時報酬値の累積値である累積報酬値の変化に基づいて、前記シミュレーションを中断するか否かを決定する、
 付記2または付記3に記載の学習装置。
(Appendix 4)
The reset decision means decides whether or not to interrupt the simulation based on a change in a cumulative reward value, which is a cumulative value of an instantaneous reward value.
4. A learning device according to claim 2 or 3.

(付記5)
 前記リセット決定手段は、前記累積報酬値の増減に基づいて、前記累積報酬値が示す評価が所定の閾値よりも悪化したと判定した場合、前記シミュレーションを中断することに決定する、
 付記4に記載の学習装置。
(Appendix 5)
The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold based on the increase or decrease of the cumulative reward value.
5. A learning device as described in claim 4.

(付記6)
 前記リセット決定手段は、前記累積報酬値の閾値が示す評価の悪化の幅が、前記シミュレーションの進行に応じて大きくなるように前記閾値を更新する、
 付記5に記載の学習装置。
(Appendix 6)
The reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the cumulative reward value increases as the simulation progresses.
6. A learning device as described in appendix 5.

(付記7)
 前記リセット決定手段は、前記累積報酬値の増減に基づいて、前記累積報酬値が示す評価の悪化が所定の時間ステップ数以上継続した場合、前記シミュレーションを中断することに決定する、
 付記4から6の何れか一つに記載の学習装置。
(Appendix 7)
The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more based on an increase or decrease in the cumulative reward value.
7. A learning device according to any one of claims 4 to 6.

(付記8)
 前記リセット決定手段は、瞬時報酬値の累積値である累積報酬値の予測値である価値関数値に基づいて、前記シミュレーションを中断するか否かを決定する、
 付記2から7の何れか一つに記載の学習装置。
(Appendix 8)
The reset determination means determines whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of an instantaneous reward value.
8. A learning device according to any one of claims 2 to 7.

(付記9)
 前記リセット決定手段は、前記価値関数値の増減に基づいて、前記価値関数値が示す評価が所定の閾値よりも悪化したと判定した場合、前記シミュレーションを中断することに決定する、
 付記8に記載の学習装置。
(Appendix 9)
The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the value function value has deteriorated below a predetermined threshold based on an increase or decrease in the value function value.
9. A learning device as described in claim 8.

(付記10)
 前記リセット決定手段は、前記価値関数値の閾値が示す評価の悪化の幅が、前記シミュレーションの進行に応じて大きくなるように前記閾値を更新する、
 付記9に記載の学習装置。
(Appendix 10)
the reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
10. The learning device according to claim 9.

(付記11)
 前記リセット決定手段は、前記価値関数値の増減に基づいて、前記価値関数値が示す評価の悪化が所定の時間ステップ数以上継続した場合、前記シミュレーションを中断することに決定する、
 付記9または付記10に記載の学習装置。
(Appendix 11)
The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the value of the value function continues for a predetermined number of time steps or more based on an increase or decrease in the value of the value function.
11. The learning device according to claim 9 or 10.

(付記12)
 前記リセット決定手段は、前記シミュレーションで計算される、前記制御対象が行動する環境の現在の状態が、中断が必要な状態として予め設定されている1つ以上の状態の何れかと一定以上類似していると判定した場合、前記シミュレーションを中断することに決定する、
 付記2から11の何れか一つに記載の学習装置。
(Appendix 12)
the reset decision means decides to interrupt the simulation when it is determined that a current state of the environment in which the control target acts, calculated in the simulation, is similar to any one or more states that are preset as states requiring interruption to a certain degree or more;
12. A learning device according to any one of claims 2 to 11.

(付記13)
 前記環境の状態をユーザに提示する状態提示手段と、
 前記中断が必要な状態を指定するユーザ操作を受け付ける状態指定受け付け手段と、
 をさらに備える、付記12に記載の学習装置。
(Appendix 13)
a state presenting means for presenting a state of the environment to a user;
a state designation receiving means for receiving a user operation for designating a state in which the interruption is required;
13. The learning device of claim 12, further comprising:

(付記14)
 中断されたシミュレーションを再開する際の、前記制御対象が行動する環境の状態を決定する再開状態決定手段
 をさらに備える、付記2から13の何れか一つに記載の学習装置。
(Appendix 14)
14. The learning device according to any one of appendices 2 to 13, further comprising a restart state determination means for determining a state of an environment in which the controlled object acts when resuming an interrupted simulation.

(付記15)
 前記再開状態決定手段は、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップであるリセット先候補に設定されている確率分布に基づいて、何れか1つのリセット先候補を選択する、
 付記14に記載の学習装置。
(Appendix 15)
the restart state determination means selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted;
15. A learning device as described in appendix 14.

(付記16)
 前記再開状態決定手段は、複数のリセット先候補のうち何れか1つを一様分布の確率分布に従って選択する、
 付記15に記載の学習装置。
(Appendix 16)
The restart state determination means selects one of a plurality of reset destination candidates in accordance with a uniform probability distribution.
16. A learning device as described in appendix 15.

(付記17)
 前記再開状態決定手段は、複数のリセット先候補に対して、瞬時報酬値の累積値である累積報酬値の増減で示される、シミュレーションの中断時における評価の悪化が大きいほど、シミュレーションの中断時とリセット先候補との間の時間ステップ数が多いリセット先候補を選び易くなるように確率分布を設定し、設定した確率分布に従って何れか1つのリセット先候補を選択する、
 付記15に記載の学習装置。
(Appendix 17)
The restart state determination means sets a probability distribution for the plurality of reset destination candidates such that the greater the deterioration of the evaluation at the time of interruption of the simulation, which is indicated by an increase or decrease in an accumulated reward value that is an accumulated value of the instantaneous reward value, the easier it is to select a reset destination candidate having a larger number of time steps between the time of interruption of the simulation and the reset destination candidate, and selects one of the reset destination candidates according to the set probability distribution.
16. A learning device as described in appendix 15.

(付記18)
 前記再開状態決定手段は、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップであるリセット先候補のうち、瞬時報酬値の累積値である累積報酬値の増減で示される評価の変化が良化から悪化に転じている時間ステップのうちシミュレーションの中断時との間の時間ステップ数が最も少ない時間ステップまたはそれよりも前の時間ステップに相当するリセット先候補を選択する、
 付記14から17の何れか一つに記載の学習装置。
(Appendix 18)
The restart state determination means selects a reset destination candidate corresponding to a time step having the smallest number of time steps between the time of interruption of the simulation and the time step in which the change in the evaluation indicated by the increase or decrease in the cumulative reward value, which is the cumulative value of the instantaneous reward value, has turned from improvement to deterioration, among the reset destination candidates which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the interruption of the simulation, or a time step earlier than that.
18. A learning device according to any one of appendices 14 to 17.

(付記19)
 付記1から18の何れか一つに記載の学習装置を用いて得られた方策に基づいて、制御対象に対する制御を行う、制御装置。
(Appendix 19)
A control device that performs control on a control target based on a policy obtained using the learning device according to any one of appendixes 1 to 18.

(付記20)
 学習装置と制御装置とを備え、
 前記学習装置は、
 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、
 前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、
 を備え、
 前記制御装置は、前記学習装置を用いて得られた方策に基づいて、制御対象に対する制御を行う、
 制御システム。
(Appendix 20)
A learning device and a control device are provided,
The learning device includes:
a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
Equipped with
The control device controls a control target based on the policy obtained using the learning device.
Control system.

(付記21)
 制御対象が行動する環境の状態をユーザに提示する状態提示手段と、
 前記環境を模擬するシミュレータの中断が必要な状態を指定するユーザ操作を受け付ける状態指定受付手段と、
 を備える、入出力装置。
(Appendix 21)
A state presentation means for presenting to a user a state of an environment in which the controlled object acts;
a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted;
An input/output device comprising:

(付記22)
 コンピュータが、
 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成し、
 前記制御対象の行動を前記方策に基づいて決定する、
 ことを含む学習方法。
(Appendix 22)
The computer
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object, in a state going back a time step in the episode representing the learning period;
determining an action of the controlled object based on the policy;
A learning method that includes:

(付記23)
 コンピュータに、
 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成することと、
 前記制御対象の行動を前記方策に基づいて決定することと、
 を実行させるためのプログラムを記憶した記録媒体。
(Appendix 23)
On the computer,
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back in time steps within an episode representing a learning period;
determining an action of the controlled object based on the policy;
A recording medium storing a program for executing the above.

 この出願は、2023年6月26日に出願された日本国特願2023-103973号を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2023-103973, filed on June 26, 2023, the entire disclosure of which is incorporated herein by reference.

 本開示は、学習装置、制御システム、入出力装置、学習方法および記録媒体に適用してもよい。 This disclosure may be applied to a learning device, a control system, an input/output device, a learning method, and a recording medium.

 1、620 制御システム
 100、610、621 学習装置
 110 通信部
 120 表示部
 130 操作入力部
 180 記憶部
 190 処理部
 191、612、623 行動決定部
 192 シミュレーション部
 193、611、622 方策生成部
 194 リセット決定部
 195 再開状態決定部
 200、624 制御装置
 630 入出力装置
 631 状態提示部
 632 状態指定受付部
 910 制御対象
1, 620 Control system 100, 610, 621 Learning device 110 Communication unit 120 Display unit 130 Operation input unit 180 Memory unit 190 Processing unit 191, 612, 623 Action decision unit 192 Simulation unit 193, 611, 622 Measure generation unit 194 Reset decision unit 195 Resume state decision unit 200, 624 Control device 630 Input/output device 631 State presentation unit 632 State designation reception unit 910 Control target

Claims (20)

 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、
 前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、
 を備える学習装置。
a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
A learning device comprising:
 前記制御対象の行動のシミュレーションを中断する条件として予め定められている条件が成立していると判定した場合、前記シミュレーションを中断するリセット決定手段
 をさらに備え、
 前記方策生成手段は、前記シミュレーションに基づいて計算される前記報酬値に基づいて、前記方策を生成する、
 請求項1に記載の学習装置。
a reset decision means for suspending the simulation of the behavior of the controlled object when it is determined that a predetermined condition for suspending the simulation of the behavior of the controlled object is satisfied,
The policy generating means generates the policy based on the reward value calculated based on the simulation.
The learning device according to claim 1 .
 前記シミュレーションを行って、前記制御対象が行動する環境の状態を計算するシミュレーション手段
 をさらに備える、請求項2に記載の学習装置。
The learning device according to claim 2 , further comprising: a simulation means for performing the simulation to calculate a state of an environment in which the controlled object acts.
 前記リセット決定手段は、瞬時報酬値の累積値である累積報酬値の変化に基づいて、前記シミュレーションを中断するか否かを決定する、
 請求項2または請求項3に記載の学習装置。
The reset decision means decides whether or not to interrupt the simulation based on a change in a cumulative reward value, which is a cumulative value of an instantaneous reward value.
The learning device according to claim 2 or 3.
 前記リセット決定手段は、前記累積報酬値の増減に基づいて、前記累積報酬値が示す評価が所定の閾値よりも悪化したと判定した場合、前記シミュレーションを中断することに決定する、
 請求項4に記載の学習装置。
The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the cumulative reward value has deteriorated below a predetermined threshold based on the increase or decrease of the cumulative reward value.
The learning device according to claim 4.
 前記リセット決定手段は、前記累積報酬値の閾値が示す評価の悪化の幅が、前記シミュレーションの進行に応じて大きくなるように前記閾値を更新する、
 請求項5に記載の学習装置。
The reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the cumulative reward value increases as the simulation progresses.
The learning device according to claim 5 .
 前記リセット決定手段は、前記累積報酬値の増減に基づいて、前記累積報酬値が示す評価の悪化が所定の時間ステップ数以上継続した場合、前記シミュレーションを中断することに決定する、
 請求項4から6の何れか一項に記載の学習装置。
The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the cumulative reward value continues for a predetermined number of time steps or more based on an increase or decrease in the cumulative reward value.
A learning device according to any one of claims 4 to 6.
 前記リセット決定手段は、瞬時報酬値の累積値である累積報酬値の予測値である価値関数値に基づいて、前記シミュレーションを中断するか否かを決定する、
 請求項2から7の何れか一項に記載の学習装置。
The reset determination means determines whether or not to interrupt the simulation based on a value function value that is a predicted value of a cumulative reward value that is a cumulative value of an instantaneous reward value.
A learning device according to any one of claims 2 to 7.
 前記リセット決定手段は、前記価値関数値の増減に基づいて、前記価値関数値が示す評価が所定の閾値よりも悪化したと判定した場合、前記シミュレーションを中断することに決定する、
 請求項8に記載の学習装置。
The reset decision means decides to interrupt the simulation when it is determined that the evaluation indicated by the value function value has deteriorated below a predetermined threshold based on an increase or decrease in the value function value.
The learning device according to claim 8.
 前記リセット決定手段は、前記価値関数値の閾値が示す評価の悪化の幅が、前記シミュレーションの進行に応じて大きくなるように前記閾値を更新する、
 請求項9に記載の学習装置。
the reset determination means updates the threshold value so that a range of deterioration in the evaluation indicated by the threshold value of the value function value increases as the simulation progresses.
The learning device according to claim 9.
 前記リセット決定手段は、前記価値関数値の増減に基づいて、前記価値関数値が示す評価の悪化が所定の時間ステップ数以上継続した場合、前記シミュレーションを中断することに決定する、
 請求項9または請求項10に記載の学習装置。
The reset decision means decides to interrupt the simulation when a deterioration in the evaluation indicated by the value of the value function continues for a predetermined number of time steps or more based on an increase or decrease in the value of the value function.
The learning device according to claim 9 or 10.
 前記リセット決定手段は、前記シミュレーションで計算される、前記制御対象が行動する環境の現在の状態が、中断が必要な状態として予め設定されている1つ以上の状態の何れかと一定以上類似していると判定した場合、前記シミュレーションを中断することに決定する、
 請求項2から11の何れか一項に記載の学習装置。
the reset decision means decides to interrupt the simulation when it is determined that a current state of the environment in which the control target acts, calculated in the simulation, is similar to any one or more states that are preset as states requiring interruption to a certain degree or more;
A learning device according to any one of claims 2 to 11.
 前記環境の状態をユーザに提示する状態提示手段と、
 前記中断が必要な状態を指定するユーザ操作を受け付ける状態指定受け付け手段と、
 をさらに備える、請求項12に記載の学習装置。
a state presenting means for presenting a state of the environment to a user;
a state designation receiving means for receiving a user operation for designating a state in which the interruption is required;
The learning device of claim 12 further comprising:
 中断されたシミュレーションを再開する際の、前記制御対象が行動する環境の状態を決定する再開状態決定手段
 をさらに備える、請求項2から13の何れか1項に記載の学習装置。
The learning device according to claim 2 , further comprising: a restart state determination means for determining a state of an environment in which the controlled object acts when restarting an interrupted simulation.
 前記再開状態決定手段は、シミュレーションの中断時まで実行されていたエピソード内でシミュレーションを再開可能な時間ステップとして状態が記憶されている実行済みの時間ステップであるリセット先候補に設定されている確率分布に基づいて、何れか1つのリセット先候補を選択する、
 請求項14に記載の学習装置。
the restart state determination means selects one of the reset destination candidates based on a probability distribution set for the reset destination candidates, which are executed time steps whose states are stored as time steps at which the simulation can be restarted within the episode that was executed until the simulation was interrupted;
The learning device according to claim 14.
 前記再開状態決定手段は、複数のリセット先候補のうち何れか1つを一様分布の確率分布に従って選択する、
 請求項15に記載の学習装置。
The restart state determination means selects one of a plurality of reset destination candidates in accordance with a uniform probability distribution.
The learning device according to claim 15.
 学習装置と制御装置とを備え、
 前記学習装置は、
 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成する方策生成手段と、
 前記制御対象の行動を前記方策に基づいて決定する行動決定手段と、
 を備え、
 前記制御装置は、前記学習装置を用いて得られた方策に基づいて、制御対象に対する制御を行う、
 制御システム。
A learning device and a control device are provided,
The learning device includes:
a policy generating means for generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back a time step in an episode representing a learning period;
an action decision means for deciding an action of the control target based on the measure;
Equipped with
The control device controls a control target based on the policy obtained using the learning device.
Control system.
 制御対象が行動する環境の状態をユーザに提示する状態提示手段と、
 前記環境を模擬するシミュレータの中断が必要な状態を指定するユーザ操作を受け付ける状態指定受付手段と、
 を備える、入出力装置。
A state presentation means for presenting to a user a state of an environment in which the controlled object acts;
a state designation receiving means for receiving a user operation for designating a state in which the simulator that simulates the environment needs to be interrupted;
An input/output device comprising:
 コンピュータが、
 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成し、
 前記制御対象の行動を前記方策に基づいて決定する、
 ことを含む学習方法。
The computer
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object, in a state going back a time step in the episode representing the learning period;
determining an action of the controlled object based on the policy;
A learning method that includes:
 コンピュータに、
 学習期間を表すエピソード内で時間ステップを遡った状態における、制御対象の行動に対する評価を示す値である報酬値に基づいて、前記行動の決定規則である方策を生成することと、
 前記制御対象の行動を前記方策に基づいて決定することと、
 を実行させるためのプログラムを記憶した記録媒体。
On the computer,
generating a policy, which is a decision rule for the action, based on a reward value, which is a value indicating an evaluation of the action of the controlled object in a state going back in time steps within an episode representing a learning period;
determining an action of the controlled object based on the policy;
A recording medium storing a program for executing the above.
PCT/JP2024/021691 2023-06-26 2024-06-14 Learning device, control system, input/output device, learning method, and recording medium Pending WO2025004859A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2023103973 2023-06-26
JP2023-103973 2023-06-26

Publications (1)

Publication Number Publication Date
WO2025004859A1 true WO2025004859A1 (en) 2025-01-02

Family

ID=93938914

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2024/021691 Pending WO2025004859A1 (en) 2023-06-26 2024-06-14 Learning device, control system, input/output device, learning method, and recording medium

Country Status (1)

Country Link
WO (1) WO2025004859A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180089553A1 (en) * 2016-09-27 2018-03-29 Disney Enterprises, Inc. Learning to schedule control fragments for physics-based character simulation and robots using deep q-learning
JP2019529135A (en) * 2016-09-15 2019-10-17 グーグル エルエルシー Deep reinforcement learning for robot operation
US10800040B1 (en) * 2017-12-14 2020-10-13 Amazon Technologies, Inc. Simulation-real world feedback loop for learning robotic control policies

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019529135A (en) * 2016-09-15 2019-10-17 グーグル エルエルシー Deep reinforcement learning for robot operation
US20180089553A1 (en) * 2016-09-27 2018-03-29 Disney Enterprises, Inc. Learning to schedule control fragments for physics-based character simulation and robots using deep q-learning
US10800040B1 (en) * 2017-12-14 2020-10-13 Amazon Technologies, Inc. Simulation-real world feedback loop for learning robotic control policies

Similar Documents

Publication Publication Date Title
US20210374538A1 (en) Reinforcement learning using target neural networks
US11341420B2 (en) Hyperparameter optimization method and apparatus
CN101842754B (en) For the method for the state of discovery techniques system in a computer-assisted way
US11480623B2 (en) Apparatus and method for predicting a remaining battery life in a device
US20210406683A1 (en) Learning method and information processing apparatus
CN110781969A (en) Air conditioner air volume control method and device based on deep reinforcement learning and medium
JP2019159888A (en) Machine learning system
EP3926547A1 (en) Program, learning method, and information processing apparatus
US20220017106A1 (en) Moving object control device, moving object control learning device, and moving object control method
WO2025004859A1 (en) Learning device, control system, input/output device, learning method, and recording medium
US20240394554A1 (en) Learning device, learning method, control system, and recording medium
US20220150148A1 (en) Latency mitigation system and method
CN115545188A (en) Multitask offline data sharing method and system based on uncertainty estimation
JP2022140092A (en) Device for reinforcement learning, method for reinforcement learning, and program
US20210004717A1 (en) Learning method and recording medium
JP7574940B2 (en) Operational rule determination device, operational rule determination method, and program
US20090030861A1 (en) Probabilistic Prediction Based Artificial Intelligence Planning System
JP2025002046A (en) Multi-agent reinforcement learning system and program
JP7505563B2 (en) Learning device, learning method, control system and program
JP7173317B2 (en) Operation rule determination device, operation rule determination method and program
JP2024140139A (en) Learning device, learning method, and program
JP7421391B2 (en) Learning methods and programs
EP3996005A1 (en) Calculation processing program, calculation processing method, and information processing device
Dementyeva et al. Runtime assurance for intelligent cyber-physical systems
CN114254765A (en) Active sequence decision method, device and medium for simulation deduction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24831723

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025529646

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025529646

Country of ref document: JP