18CS30051_Debajyoti_Dasgupta_report_10th_sem
18CS30051_Debajyoti_Dasgupta_report_10th_sem
18CS30051_Debajyoti_Dasgupta_report_10th_sem
in
by
Debajyoti Dasgupta
(18CS30051)
I certify that
(a) The work contained in this report has been done by me under the guidance of
my supervisor.
(b) The work has not been submitted to any other Institute for any degree or
diploma.
(c) I have conformed to the norms and guidelines given in the Ethical Code of
Conduct of the Institute.
(d) Whenever I have used materials (data, theoretical analysis, figures, and text)
from other sources, I have given due credit to them by citing them in the text
of the thesis and giving their details in the references. Further, I have taken
permission from the copyright owners of the sources, whenever necessary.
i
COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
KHARAGPUR - 721302, INDIA
CERTIFICATE
This is to certify that the project report entitled “Artificial Intelligence based
Smart Grid scheduling algorithm using Reinforcement Learning” submit-
ted by Debajyoti Dasgupta (Roll No. 18CS30051) to Indian Institute of Technol-
ogy Kharagpur towards partial fulfilment of requirements for the award of degree
of Dual Degree (Masters of Technology) in Computer Science and Engineering is
a record of bona fide work carried out by him under my supervision and guidance
during Spring Semester, 2022-23.
ii
Abstract
The power consumption of households has constantly been growing over the years
Kahan 2019. To cope with this growth, intelligent management of the consumption
profile of the households is necessary, such that the households can save electricity
bills, and the stress on the power grid during peak hours can be reduced. However,
implementing such a method is challenging due to the existence of randomness in
the electricity price as well as the requirement and consumption of the appliances.
To address these challenges, we propose a reinforcement learning-based AI agent for
the demand response based on cost minimization for the client-side appliances while
taking care of user satisfaction and reducing the peak load on the supply power.
We propose a working architecture and method employed for training the models.
Finally, we show that using Reinforcement Learning Approach, we can approach
the lower bound to the optimized solution. We will use the Peccan Street dataset
for consumption and synthetic data to simulate the time-of-the-day pricing of the
electricity. We thus show using the dataset, that the Reinforcement learning model
not only learns to minimize the cost of consumption while handling several devices
but also provides a scalable architecture that will reduce the stress on the generation
side. In this report, we also suggest two new modes of training the Reinforcement
Learning Agent: incremental training and using a predictor network to improve
training. We also give a comparative study of performance improvement in training
using the newly proposed metrics. The study is then extended to a multi-agent
environment where we build strong baselines to take account of human nature and
show that careful modeling of the observation space can lead to efficient learning
of agents in the cooperative scenario. Finally, the exploration continues with the
results of the genetic algorithm-based solution.
iii
Acknowledgements
I want to thank my mentors, Prof. Partha Pratim Chakrabarti and Prof. Arijit
Mondal, for their exceptional guidance and support, without which this project
would not have reached its full potential. They have always motivated me to explore
as much as possible, look into numerous papers, and try as many ideas as possible.
They have always supported me through whatever problems I faced during the
project and resources. With their continuous input, the project has evolved over
the semester. I would also like to thank my parents and friends, who motivated me
to do the project and provided me with moral support, especially during the tough
times of the pandemic.
Debajyoti Dasgupta
iv
Contents
Declaration i
Certificate ii
Abstract iii
Acknowledgements iv
Contents v
List of Figures ix
1 Introduction 1
1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Why Smart Grid theme was chosen? . . . . . . . . . . . . . . 1
1.1.2 Power Control and information flow . . . . . . . . . . . . . . . 2
1.1.3 Where does optimization fit in? . . . . . . . . . . . . . . . . . 2
1.1.4 Why consider multi-agents? . . . . . . . . . . . . . . . . . . . 3
1.2 Objective and Work Done . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Layout of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Problem Statement 6
2.1 Objectives we aim to achieve . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Input for the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Output for a Good Scheduling . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Extending to Multi-Agent Systems . . . . . . . . . . . . . . . . . . . 11
2.4.1 Disclosure of information . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Cooperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Literature Review 13
3.1 Deep Q Networks (DQN) . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Epsilon-Greedy Action Selection . . . . . . . . . . . . . . . . . . . . . 15
3.3 RL Agent for managing power demand between grid and battery . . . 15
v
Contents vi
4.9.3 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.9.4 Genetic Diversity . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.9.4.1 Genotypic Diversity . . . . . . . . . . . . . . . . . . 54
4.9.4.2 Phenotypic Diversity . . . . . . . . . . . . . . . . . . 55
4.9.4.3 Species Count . . . . . . . . . . . . . . . . . . . . . . 56
4.9.4.4 Fitness Diversity . . . . . . . . . . . . . . . . . . . . 58
Bibliography 119
List of Figures
ix
List of Figures x
5.1 Scheduling of the jobs during the day by greedy algorithm (FCFS) . . 61
5.2 Scheduling of the jobs during the day by Binary Search combined
with Knapsack solver (without deadlines) . . . . . . . . . . . . . . . . 62
5.3 Scheduling of the jobs during the day by Binary Search combined
with Knapsack solver (with deadlines) . . . . . . . . . . . . . . . . . 63
5.4 Scheduling of the jobs during the day by MIP(without deadlines) . . 64
5.5 Scheduling of the jobs during the day by MIP(with deadlines) . . . . 64
5.6 Predictor Network used to improve training results . . . . . . . . . . 67
5.7 Episode Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.8 Training Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9 Bill Amount using the same number of machines and jobs . . . . . . 70
5.10 Incrasing Jobs keeping Machines constant . . . . . . . . . . . . . . . 70
5.11 Incrasing Machines keeping Jobs constant . . . . . . . . . . . . . . . 70
5.12 Electricity billing for one day’s job . . . . . . . . . . . . . . . . . . . 73
5.13 Comparison of the Electricity billing . . . . . . . . . . . . . . . . . . 73
5.14 Comparision of time taken for a single step during training . . . . . . 74
5.15 Comparison of the Bill amount saved by Inception Net as compared
to other feature extractors . . . . . . . . . . . . . . . . . . . . . . . . 75
5.16 Fine tuning results with different time steps . . . . . . . . . . . . . . 76
List of Figures xi
Introduction
A smart grid is an electrical grid which includes a variety of operation and energy
measures including Advanced metering infrastructure (of which smart meters are
a generic name for any utility side device even if it is more capable e.g. a fiber
optic router), Smart distribution boards and circuit breakers integrated with home
control and demand response (behind the meter from utility perspective), Load
control switches and smart appliances, often financed by efficiency gains on municipal
programs (e.g. PACE financing), Renewable energy resources, including capacity to
charge parked (electric vehicle) batteries or larger arrays of batteries recycled from
these, or other energy storage, Energy efficient resources, Sufficient utility grade fiber
broadband to connect and monitor the above, with wireless as backup. Sufficient
spare if ”dark” capacity to ensure fail over, often leased for revenue.
idea of A “smart electricity system” has moved from conceptual to operational in the
last few years. The smart grid has undergone significant innovation, with demand
response being one other important focus areas. Further, the advanced metering
infrastructure (AMI) has made it possible to collect the usage data and to commu-
nicate with other AMI devices.
• Research showed that, it could translate into as much as $59 billion in societal
benefits by 2019
While moving to the real world looking through the eyes of optimization of only
myself not only is fair nor does it promote the efficient utilization of the natural
resources. Moreover it is pretty common now-a-days that people usually live in tall
towers and huge buildings which may have hundreds of apartments within them,
each having their own machines and using their own quota of electricity ignorant
of what the consumption levels of the neighbour might be. But we want to study
here, if we can make intelligent systems that model behaviour of the neighbours can
there be steps that can be taken by the owners to optimize the overall power usage
by the entire building which may lead to sustainable utilization of the resources.
The major issue with using Smart Grids as the only source of control on the energy
utilization is that many of the second world and third world countries are yet to
get fully functional smart grid set up and working for mass usage. But most of the
people in these countries do own a mobile phone or a laptop with good internet
connection. So we propose a two way modeling in this thesis report.
1. The model we propose can be used for giving output a schedule of devices that
user will needed during the day given the jobs that need to be scheduled. There
can be two different modes of scheduling in this case, a dynamic scheduling and
a static scheduling. In dynamic scheduling the user can give any job on any
machine during the day, and even increase or decrease the number of available
machines or cancel the ongoing jobs. The static scheduling is the case when
the jobs are given in advance that needs to be done and the user will not add
any other job during the day. This can be thought of as a special case of the
dynamic scheduling. The main dependence of this functionality is the manual
use, because the manual user may be delayed in scheduling the job on the
device at the time mentioned by the schedule in the output. This is taken
care in the model we propose during training, as it will take input the state
of the environment which is simulated such that it is very close to a human
behaviour. The user will be using an application that will be using the model
to give a scheduling and update the application after every time a new job is
Chapter 1. Sample 4
2. The second way of using the model is along with the smart Grid technology.
The model will be receiving continuous input from the advanced metering
infrastructure. Further the user may also install automatic scheduling of job
facility and sensors to detect whether a job has been scheduled on a particular
device or not. this is a complete automation of the process of completing all
the household jobs without the user needing to interfere in any format. If the
user wants then they can interrupt the scheduling and add new jobs or devices,
similar to the formulation of the dynamic scheduling in the above point.
• Apart from the above, we have also additionally improvised newer methods of
training. Our first new method is termed as incremental training. Incremental
training basically trains the model by selectively training the layers on the sub
task of the problem one by one. An easy example would be, if we want to
train an agent to learn playing a football game with 11 players, we first teach
the model to learn to play with one player, then with two players and so on up
to eleven players. We studied how incremental training benefits the training
of the DQN model both on the fronts of performance and time taken for the
training. We ave achieved a performance boost of 15.38% in case of complex
models like ResNet by exploring more observation space within less number of
training steps.
• Extending this training we show that coarse training with 15 minute time steps
and fine tuning with 5 minute time can give a performance boost of 16.67%
even in case of efficiently trained model like InceptionNet. Also we show this
this approach provides a better scheduling which covers mode of the day.
Chapter 1. Sample 5
In this section of Chapter 1 of the thesis report it was mainly described what was
the motivation behind choosing the topic of the thesis and how can the modeling
that we will propose actually benefit the society. In the following Chapter 2 we will
start by describing the problem statement that we aim to solve in this thesis. In
chapter 3 we will be giving a brief review of the literature review from the papers in
which related work were published and the motivations that can be taken from those
papers for solving our problem. Chapter 4 will be giving a detailed description of the
inputs that we will be receiving for the modelling purpose and the output that we
will be returning to the user. It also describes the mathematics behind the modelling
and brief description of the actual implementation of the model. The results and
observations are discussed in Chapter 5 of the thesis report. Finally we conclude the
report with discussion on the future work in Chapter 6 that can be done continuing
the work that has been proposed in this thesis. At the end we add a section of
Bibliography adding the literature referred in this thesis.
Chapter 2
Problem Statement
In this problem will be focusing on scheduling the appliances that are used by the
user ( or the client ). Subsequently, many research papers have suggested several
mechanisms to schedule the consumption of the appliances. The authors in Maharjan
et al. 2013, Maharjan et al. 2016 applied game theory to model the interaction
between the utility companies and the customers to reduce the power consumption.
A real-time pricing scheme was adopted in Zhao et al. 2013, and a genetic algorithm
was utilized to minimize the electricity cost. Instead of scheduling the appliances
in the residential area, the scheduling of industrial loads was considered in Gholian
A. 2016. Then, an incentive scheme was applied in Ehsanfar and Heydari 2018 to
encourage more households to participate in load scheduling.
Since the client side devices are in the control of the user we have decided to apply
scheduling techniques to this field first. Since there can be a wide variety of ap-
pliances that user can use in everyday life, like TV, refrigerator, Air Conditioners
etc. we have decided to focus on selected appliances. For selecting this appliance
we will firs t categorize the appliances into some major groups. The classes that we
categorize the devices into are :
1. Appliances that need to finish batch task within deadline and have stages
defined in the process
• Eg. Washing Machine and Dish Washer
2. Appliances that have to run in periods (Periodic Appliances)
• Eg. Air Conditioner and Refrigerator. (they stop running when a certain
temperature is achieved and start running again later)
3. Appliances that have to perform the same job multiple times(sometimes with
different settings)
6
Chapter 2. Problem Statement 7
Our major assumption will also involve the fact that we will be using Time of The
Day pricing. Also we will be including the Demand Response based penalties, that
is crossing a certain limit set will include penalties on the user side.
The devices are classified into different groups because of their functioning. The
functioning of the devices in different classes is described as follows:
1. The devices in the first category have well defined processes that it needs to
complete to finish a batch of jobs. Between these processes the machine can be
preempted and we can continue the process later on. For example in washing
machine, the process to clean clothes involve - wash, rinse and spin. so after
washing we can schedule some other appliance and then come back to rinsing.
2. The devices in the second category usually run in periods. They try to achieve
dome target and once the target is achieved, it stops working, later and later
on starts working again. For example, in the case of air conditioner, the de-
vice works up until the temperature set bu the user is not reached. Once the
temperature in the settings is reached, the thermostat senses that and the air
conditioner is automatically switched off. Later when the temperature of the
room changes, the ac will start working again. Similar is the principle fol-
lowed by the refrigerator also. Since these devices usually work on a range of
temperature we will set a limit to the operable conditions, like in the range of
T0 ± R where R ≤ 0.01 the AC may turn off, where T0 is the target and R is
cushion range.
3. Finally in the third type of devices, they are usually tasked with doing the
same job multiple times, sometimes just with different settings. For example
in case of Microwave oven we need to heat a lot many foods usually with
different settings for the heat at which the microwave should operate. Also
while heating a food, we cannot preempt the process, otherwise it will render
the process useless.
Our main objective in this problem will to reduce the peak demand on the client
side and spread it equivalently among the entire duration. This is mainly because
first of all if the peak demand is higher than the limit then there may be severe
penalties. Secondly if the peak demand is higher especially during the time of the
day when the cost of supplied electricity is high, it will cost the the user very much
and is also not efficient management of energy. Also since we are using time of the
day pricing as well as penalties for crossing the limit, our objective function will also
include minimization of the client cost (demand side).
Second while we are minimizing the cost of the client it is important to keep the user
satisfied also. Suppose if the microwave was scheduled to heat a food at 8 a.m. in
the morning, but the scheduler schedules the heating of the food for 9 p.m., then the
user will be very unsatisfied with the service of the Reinforcement Learning agent.
Another situation will be if we provide to much cushion for Refrigerator ( say 3)
then the food may get spoilt and the user will be again be very unsatisfied. So
we will also be keeping track of the satisfaction score and the task of our objective
function will be to maximize the satisfaction score.
Since we will be making use of Reinforcement learning, the input should contain
mainly of three important things - the environment, the reward that was provided
when the output action is taken in the previous step. The following are the descrip-
tion of what will be included for each these inputs :
1. The environment mainly will contain a set of parameters that will describe
the state of the device at the current point of time. The devices that be-
long to each of the three groups (as described in Section 1) will have different
types of states, like the washing machine needs to mention the state/process
in which it is currently running and the time remaining to complete the run-
ning process, where as the refrigerator may provide its current temperature
Chapter 2. Problem Statement 9
or the difference between target and current temperature and the microwave
may in addition add the mode which it has to be set to operate in currently.
We may also include how much of each of the job is completed till now. The
environment of the system may contain any other details also that will pro-
vide any extra information about the state of the device. The environment
of the device is very important since it will help in the further decision making.
2. The reward will be basically the the output of the reward function that will
be calculated by the reinforcement learning agent in the previous time stamp.
This reward will also influence the decision making in the future. So the
Reinforcement Learning agent is trying to maximize the future reward. So if
the reward in the previous step was very low it needs to take some decision
accordingly to increase the future reward. For this the reward function should
be carefully determined.
We can further extend the approach by storing the Experience Replay. also if we
are using ϵ-greedy strategy, then we will need the value of ϵ in the previous step to
implement ϵ-decay.
In addition to the above inputs we will also need to provide a list of inputs that need
to be scheduled. This list may be dynamic or a static list (while testing), though
for training purpose we will use a static list of the jobs. The Job description will
contain the type of jobs, setting it needs and the device on which he job needs to
schedule.
Also the Reinforcement learning agent should have the information of the work-
ing of the device (if any). The working of the device basically means what type
of setting requires how much time, say if it is configured that the rinsing of the
washing machine will take 15 minutes(say). Also this list will contain which pro-
cess should necessarily come after which other process, like it is necessary that spin
comes after rinse in washing machine. Sometimes switching between the different
states of the device also requires time and loss/gain of energy (to be compensated
from the source), so details about that will be helpful in the input (if modelled prop-
erly). These information should be available to the Reinforcement Learning agent
in advance.
Since we are working with Reinforcement Learning, the main component of the
output will be the action that was taken by the Reinforcement Learning agent based
on some policy and the Reward that was achieved after taking that action. Following
is the description of the above outputs
Chapter 2. Problem Statement 10
(a) Which device was selected (scheduled) to operate in the current time slot.
This will basically be a pointer to the device or a list of devices that need
to be activated and the jobs will be scheduled on these devices.
(b) How much time slice was allocated to the process to be scheduled cur-
rently. This time slice will basically decide how much time the device
scheduled will run the selected job for, Once the device completes the
time slice before completing the job, the process will be preempted from
it, whereas if it finishes earlier than the time slice then it will willingly
give the rest slice for scheduling the next appliance.
(c) which job is done on the current scheduling.That is from the list of jobs
which are still pending to be completed, it gives a pointer to the job that
needs to be scheduled now as scheduling this device will be beneficial.
(d) Which phase (in the list of process to be done, like the rinse phase in
case of washing machine) will the device scheduled will be executing the
job/process in. This will mainly be the next phase that needs to be done
as we have the job completed until the previous phase.
According to the job that was scheduled, the time allotted to the job may vary
(that is the time slice will not be constant). Also if a device scheduled take
very low power then next device will be scheduled simultaneously. This will
help in reducing the peak demand and help in utilizing the power supplied
efficiently also.
2. The reward will simply refer to the reward that will be earned after taking
the current action as specified in the output. The reward will be computed
using the reward function which will be trained using the deep Q-network.
The reward function will contain a combination of all the objectives that we
want to achieve, that is minimizing the demand side cost, reducing the peak
demand and maximizing the satisfaction of the user. There will be mainly two
kinds of reward function, the episodic reward and the continuous reward. It is
needed to be decided as to which type of reward function should be adopted
for the model. This will actually depend on the implementation of the model
that we adopt.
In addition to the above details, the output may also contain how much of the
currently scheduled job will be completed in the current time slice of the device that
is provided in the output. This is important because it is not necessary that the job
will be completed as it will be preempted when the time slice expires.
Chapter 2. Problem Statement 11
The multi-agent problem statement is basically dealing with handling multiple apart-
ments in a single building with each apartment having their own agents to solve their
scheduling and trying to optimize their rewards (primarily the bill amount). extend
the problem statement to the domain of multi agent system we modify the above
mentioned constraints in the following way.
Since the owners may be either be very open or be scared about sharing their
data,our study will be extending to different levels of information sharing. Primar-
ily were are concerned about the knowing the actions of the other agents in our
neighbourhood and what state they are in. Based on this the information that is
shared may be from PUBLIC, SHARED or PRIVATE (no sharing at all) knowledge
set. We will primarily be trying to optimize the case for no knowledge sharing dur-
ing the execution time and try to improve on the results by slowly increasing the
amount of knowledge that is shared.
2.4.2 Cooperation
For the initial study we assume that we are implementing our agents in a friendly
neighbourhood, were everyone is ready to compromise for other. Hence we first
make study on cooperative environment and look into, how well can we optimize, if
all the agents were to fully cooperate with one another.
The input for the multi agent environment, will be very much similar to the one
described above for a single agent scenario, with the modifications to handle multiple
apartments in a single building, each with their won intelligent agent.
• We are considering N houses, each house will have a single RL Agent (hence
in total N RL agents will be acting together).
• Each house will have three category of devices (as mentioned above), with zero
or multiple instances of each category of device. Primarily house Hi will be
having
• For the ease of modelling our observation space and to make the model flexible
for different amounts of machines in each house, we will be specifying an
upper limit to the number of devices of each category that can be added to
a particular house. Thus M Ai , M Bi and M Ci will belong to the mentioned
limit.
• A primary assumption to make the study of the complex domain a bit simpler,
we will currently be assuming that the average energy profile of the devices
of a particular category remains same across all houses. This means that
is washing machines are being used in different houses, then all the washing
machines are the same model, belonging to the same company. Assuming same
energy profile is a realistic assumption since a large number of families use 2-
star or 3-star rated devices, which have near about same energy consumption
during the run time.
• Houses will also need to declare the number of jobs they want each machine
to handle during the day. That is house Hi will be having
Literature Review
• Accomplishment
– The paper shows huge success on games involving complex decision mak-
ing process and many interlinking states. These games were mainly from
the Atari 2600
13
Chapter 3. Literature Review 14
• Complexity:
– Can model scenarios where the action space is discrete and small
– Cannot handle continuous action spaces
• Flexibility:
• We will be using Deep Q-network because it’s shortcomings will not affect us.
– The states in the application will be discrete - jobs that need to be com-
pleted
– Decisions need to be taken at discrete times, when a job or batch of jobs
are completed.
• From the results we see that the on the games which had too many hidden
states to explore the model did not perform better than a human being. How-
ever in most other games the DQN model was able to outperform even the
most professional players.
Chapter 3. Literature Review 15
Figure 3.3: Comparison of the DQN agent with the best reinforcement learning
methods in the literature
2. The proposed technique is data-driven, and the RL agent learns from scratch
how to efficiently use the energy storage device given variable tariff structures
Chapter 3. Literature Review 16
3. This paper also brings into light two new models for solving some of the issues
in DQN (which will be discussed later)
4. Accomplishment
(a) The paper shows that the model was quite successful even with only Time
of the Day pricing, and had saving of 6-8%
(b) Savings of 12-14% can be obtained if the utility follows the ToD pricing
along with rewards from the DR (Demand Response) program
5. Observations
(a) Agent performs better on high capacity batteries than low capacity bat-
teries
(b) This is because the agents trained on the low capacity batteries have not
seen states that are experienced by the high capacity batteries
(c) But the high capacity batteries have seen the states that are seen by the
low capacity batteries
(d) Hence, the Deep Q Network has better performance when it can explore
more number of states
5. An optimization algorithm, which can provide a schedule for smart home ap-
pliance usage, is proposed based on the mixed-integer programming technique.
6. Results
(a) Simulation results demonstrate the utility of the proposed solution for
appliance scheduling.
(b) The paper further show that adding a PV system in the home results in
the reduction of electricity bills and
(c) the export of energy to the national grid in times when solar energy
production is more than the demand of the home
2. Demand side management encourages the users in a smart grid to shift their
electricity consumption in response to varying electricity prices
3. This paper, we propose a distributed framework for the demand response based
on cost minimization.
4. Each user in the system will find an optimal start time and operating mode
for the appliances in response to the varying electricity prices.
5. In order for the users to coordinate with each other, we introduce a penalty
term in the cost function, which penalizes large changes in the scheduling
between successive iterations.
• These problems are adressed in this paper by training individual agents with
a novel value decomposition network architecture, which learns to decompose
the team value function into agent-wise value functions. This work proposes a
way to have separate action-value functions for multiple agents and learn them
by just one shared team reward signal. The joint action-value function is a
linear summation of all action-value functions of all agents. Actually, by using
a single shared reward signal, it tries to learn decomposed value functions for
each agent and use it for decentralised execution.
• For 2 agents case the reward will be r(s, a) = r1 (o1 , a1 ) + r2 (o2 , a2 ). Then the
total Q function is:
Chapter 3. Literature Review 20
• With the above mentioned assumption, every agent can participate in a de-
centralized execution solely by choosing greedy actions with respect to its Q.
Monotonicity can be enforced through a constraint on the relationship between
total Q and individual Q:
∂Qtot
≥ 0, ∀aϵA
∂Qa
• For every agent, this paper follows the same network architecture mentioned in
the VDN, which is a recurrent-based network. In order to fulfill the constraint,
QMIX uses an architecture consisting of agent networks, a mixing network, and
a set of hyper networks. The following diagram shows the detail of the QMIX:
Figure 3.9: (a) The architecture of mixing network (b) QMIX overview (c) The
architecture of agent network
• The main contribution of QMIX is the mixing network. The mixing network is
a feed-forward network that outputs the total Q value. It inputs the individual
Q value for each agent and mixes them monotonically. In order to follow the
monotonic constraint, the hyper-networks of QMIX use the global state as its
input and output the weight of the mixing network. And each hyper-network
consists of a single linear layer, followed by an absolute activation function so
that it can follow the constraint.
Figure 3.10: Win rates for IQL, VDN, and QMIX on six different combat maps.
The performance of the heuristic-based algorithm is shown as a dashed line.
In this paper, they also proposed three different ablations in order to show the
advantage of QMIX:
• First, they analyze the significance of extra state information on the mixing
network by comparing it against QMIX without hyper-networks. Thus, the
weights and biases of the mixing network are learned in the standard way,
without conditioning on the state. They refer to this method as QMIX-NS.
• With the growing adoption of smart grid technologies and the widespread
distribution of smart meters, it has become possible to monitor electricity usage
in real time within modern smart building environments. Utility companies
are now implementing different electricity prices for each time slot, taking
into consideration peak usage times. This paper introduces a novel power
consumption scheduling algorithm for smart buildings that utilize smart meters
and real-time electricity pricing.
• The proposed algorithm Lee and Bahn 2014 dynamically adjusts the power
mode of each electrical device in response to changes in electricity prices.
Specifically, we model the electricity usage scheduling problem as a real-time
task scheduling problem and demonstrate that it is a complex search prob-
lem with exponential time complexity. To address this challenge, our scheme
employs an efficient heuristic based on genetic algorithms, which significantly
reduces the search space and identifies a reasonable schedule within an accept-
able time budget.
• Experimental results under various building conditions reveal that the pro-
posed algorithm effectively reduces the electricity costs for a smart building.
On average, the algorithm achieves a 25.6% reduction in electricity charges,
with a maximum reduction of up to 33.4%. This demonstrates the potential
of our approach to optimize power consumption in smart building environ-
ments while accounting for real-time electricity pricing, ultimately leading to
significant cost savings and improved energy efficiency.
Figure 3.11 shows an example of power consumption scheduling for each electricity
demand as time goes on. In the figure, peak time represents the time period at
Chapter 3. Literature Review 24
which the price of electricity becomes high. As shown in the figure, it is desirable
to change the state of each device to a low-power mode during peak time.
This paper cuts down the huge search space of the genetic algorithm using an evolu-
tionary computation method to find an approximated schedule within a reasonable
time budget. Specifically, a genetic algorithm, which is a probabilistic search method
based on natural selection and population genetics, is used. Figure 3.12 depicts a
brief flow of the genetic algorithm. A certain number of schedules are initially gen-
erated and they form the initial set of schedules. Among schedules in the set, two
schedules are selected and then merged as one scheduled by the crossover and mu-
tation operations, and then the schedule set is evolved by replacing a schedule in
the old set with the newly generated schedule. This process is repeated until the
schedule set converges.
Chapter 3. Literature Review 25
The proposed algorithm represents a schedule using a matrix, with rows correspond-
ing to electric devices and columns representing the 24 hours of a day. Each matrix
entry is either 0 or 1, where 0 indicates the device is off and 1 signifies it is on.
For devices with multiple power modes, each mode is treated as a separate row in
the matrix. This simplifies the encoding compared to using multiple values for each
entry.
Devices have different power demand behavior over time, and scheduling can be
classified as interruptible or non-interruptible based on the flexibility in splitting
tasks across time slots. The algorithm maps interruptible tasks to multiple non-
interruptible tasks with finer-grained slots. The scheduler initially generates a set
of 1000 random schedules.
A selection operation chooses two schedules in the current scheduling set that will be
the parents of the next generation. This paper applies the most common practice,
which gives the best schedule four times more probability of being selected than the
worst one [3]. The value of each schedule is evaluated by the electric charge of the
schedule. The normalized value of a schedule is defined as follows.
Where Rw, Rb , and Ri are the ranks of the worst schedule, the best schedule, and
schedule i, respectively, in the current generation. Based on the normalized value,
a roulette wheel selection is applied as a selection operator. It assigns each schedule
i a slot with a size equal to its normalized value N Vi . The pointer of the wheel
is a random real number ranging from 0 to N Vi . A schedule whose slot spans the
Chapter 3. Literature Review 27
pointer is selected and becomes a parent. Note that this is based on a probabilistic
rule that favors better schedules but is not deterministic. We use a probabilistic
approach here to avoid too much discrimination. That is, if we deterministically
select the best schedules, they may dominate the scheduling set rapidly, potentially
causing premature convergence to a local optimum.
After selecting the parents, a crossover operation is executed. The goal of the
crossover is to combine beneficial segments from the parents to create a child that
leads to improved schedules over time. We employ a traditional one-point crossover
technique, which generates a random crossover point and then swaps segments be-
tween the two parents to produce offspring. In our method, the crossover point is
generated by cutting a matrix column, as depicted in Figure3.13.
Chapter 3. Literature Review 28
This paper does not explicitly define mutation operations, which alter the generated
child to maintain diversity, as the crossover operations should incorporate adjust-
ment routines that serve the purpose of mutation. Specifically, when applying the
classical one-point crossover, an infeasible schedule might be created, meaning that
the scheduling sequence does not meet the time constraints for each device. Such
a situation arises when the crossover point cuts the sliding window of a scheduling
unit. As a result, we adjust the operation so that the child primarily inherits the
scheduling unit from parent 1, while the remaining slots are completed by parent 2.
If any empty slots remain, they are randomly filled by selecting any location within
the sliding window
Upon creating a child schedule, the new generation is formed by replacing a schedule
from the current generation with the newly generated child. In this paper, the sched-
ule with the highest electricity charge in the current generation is replaced by the
new child schedule, which is the most frequently employed replacement operation.
3.8.5 Results
Experiments were performed with six different scale buildings, where the electricity
demands are 50, 100, 150, 200, 250, and 300, respectively. Before showing the
comparison results, Fig.3.14 plots the electricity charges of the best and the worst
schedules in the current search space and their average as the generation progresses.
As shown in the figure, the quality of schedules improves significantly according
to the evolution of the generation and, finally, the convergence. As shown in the
figure, the proposed scheduling system performs the best irrespective of the building
scale. The performance gain of GA-based scheduling is 27.0% on average and up
to 36.4% compared to the original scheduling system. As shown in the figure, the
Chapter 3. Literature Review 29
proposed scheduling system performs the best irrespective of the building scale. The
performance gain of GA-based scheduling is 27.0% on average and up to 36.4
Figure 3.16 shows the electricity charges of the proposed GA-based scheduling sys-
tem and the original system that does not use it. In the original system, each
scheduling unit is randomly located among the release time and the deadline. For
comparison purposes, the Greedy scheduler is also simulated. The Greedy scheduler
locates each scheduling unit in the time slot in which the electricity price is lowest.
For the exact analysis of this interesting result, the electric charge of each time slot
of a day is also examined. Figure 3.15 plots the normalized electricity charge of
the three schedulers as time progresses within a day. As shown in the figure, the
Chapter 3. Literature Review 30
Since in the limited amount of time we cannot include all the household devices that
are used everyday, we try to classify the devices in 3 broad categories. In addition to
this, the model we built is a robust model and will be highly scalable so that given
the device and its specification.
1. Category-1
(a) Appliances that need to finish batch task within deadline and have stages
defined in the process Eg. Washing Machine and Dishwasher
(b) The devices in this first category have well defined processes that it needs
to complete a batch of jobs. Between these processes the machine can be
preempted and continue the later. For example in washing machine, the
process to clean clothes involve - wash, rinse and spin. so after washing
we can schedule some other appliance and then come back to rinsing.
31
Chapter 4. Theory 32
Figure 4.1: load profile for Dish Figure 4.2: load profile for Wash-
Washer ing Machine
2. Category-2
(a) Appliances that have to run in periods (Periodic Appliances) Eg. Air
Conditioner and Refrigerator. (they stop running when a certain tem-
perature is achieved and start running again later)
(b) They try to achieve some target and once the target is achieved, it stops
working, later and later on starts working again.
(c) For example, in the case of an air conditioner, the device works up until
the temperature set by the user is not reached. Once the temperature in
the settings is reached, the thermostat senses that, and the air conditioner
is automatically switched off.
(d) Since these devices usually work on a range of temperature we will set a
limit to the operable conditions, like in the range of T0 ±R where R ≤ 0.01
the AC may turn off, where T0 is the target and R is cushion range.
3. Category-3
Chapter 4. Theory 33
(a) Appliances that have to perform the same job multiple times (sometimes
with different settings)
(b) Eg. Microwave Oven and Toaster. (Many Food Scheduling types)
(c) Devices in this category same job multiple times, sometimes just with
different settings.
(d) For example in the case of a Microwave oven we need to heat a lot many
foods usually with different settings for the heat at which the microwave
should operate. Also while heating a food, we cannot preempt the process,
otherwise, it will render the process useless.
Figure 4.4: load profile for the Figure 4.5: load profile for the
oven in morning oven at night
• This is because the washing machine will also be washing different sets of
clothes with different settings, like cotton, jeans, mild rinsing, etc
• This is because the washing machine will also be washing different sets of
clothes with different settings, like cotton, jeans, mild rinsing, etc
Chapter 4. Theory 34
1. Our main objective in this problem will be to reduce the peak demand on the
client side and spread it equivalently over the entire duration.
2. This is mainly because first of all if the peak demand is higher than the limit
then there may be severe penalties.
3. Secondly if the peak demand is higher, especially during the time of the day
when the cost of supplied electricity is high, it will cost the user very much
and is also not efficient management of energy.
m X
X ni
N X
fL = min (Pijk Xijk − q)2 (4.1)
X
k=1 i=1 j=1
Mary Mammen and Kumar 2019 Also since we are using time of the day pricing
as well as penalties for crossing the limit, our objective function will also include
minimization of the client cost (demand side). Power utilized multiplied by the cost,
k
X is a vector of Xi,j
Xm Xn Xni
k
fc = min C ( Pijk Xijk ) (4.3)
x
k=1 i=1 j=1
while we are minimizing the cost of the client it is important to keep the user satisfied
also. Suppose if the microwave was scheduled to heat food at 8 a.m. in the morning,
but the scheduler schedules the heating of the food for 9 p.m., then the user will be
very unsatisfied with the service of the Reinforcement Learning agent.
User Satisfaction is a term that varies from person to person. But we create a
mathematical measure of the same to create an estimate (as per the use of the
device). All the devices will have user satisfaction scores on two fronts, one is the
delay in the start of the job scheduled on the device which is defined as the difference
between the time of job submission and the time of scheduling. Following is the user
satisfaction for each device:-
1. For the first category of devices (eg. washing machine) satisfaction will also
be measured on the fact as how much time is taken for the job completion
defined as the difference between the scheduling of the job and the completion
of the job.
2. For the second category of devices (Eg. Air Conditioner), these devices will be
running continuously whenever needed. The other metric of the satisfaction
score of these devices will be the deviation from the range that it is required
to operate in, defined as the difference between the current temperature and
the nearest endpoint of the optimal operating range.
3. The metric for the third category of devices namely microwave ovens, is same
as that of the first category devices, that is the time taken for job completion
In addition to the above goals, we will also be deploying the model in some cloud-
based systems with some specified API endpoints. Also, this can be implemented
into a mobile app or a web app. Using the above API, people will be able to make
Chapter 4. Theory 36
use of the model built to implement static scheduling of jobs in countries such as
India. Also countries such as Singapore which have an advanced metering scheme
in use will be able to use full-fledged dynamic scheduling of devices
1. Each layer will deal with scheduling jobs of a particular type of machine.
Say layer 1 of the image deals with the washing machine, then it will deal
with scheduling washing clothes-related jobs. Hence the redundant states
(such as scheduling jobs oven jobs on the washing machine) are reduced
significantly, as compared to combining the 3 layers in a single image.
2. The dimension of each layer is (number of machines X number of jobs)
for a particular type layer, hence the image size is smaller and scalable
on both machine dimension as well as job dimension.
3. In this formulation, we do not need to burden ourselves to find a model
that can learn the proper splitting of energy among the different RL
agents (for different devices). A proper split will be learnt from the
several CNN filter whose weights are learned during training.
The following image shows the image representation of the intermediate state in
the environment. It can be interpreted as follows (assume for the ith row and jth
column):
• The black cell denotes that machine(i) cannot schedule on the job(j). This
can be due to several different reasons either machine(i) is already running a
job different from (j), machine I have been allocated to an incomplete process,
or job(j) belongs to a process that has been allocated to a machine that is
different from (i).
• Light color (more towards white) means less energy used in the process, and
darker cells (more towards black) imply less power utilization.
Chapter 4. Theory 37
ActionM in = 0 (4.4)
#machine 3
ActionM ax = 2 ×2 (4.5)
• However now say we use separate observation space in three separate layers,
then the action space size will reduce to 21 0 + 21 0 + 21 0 = 3072 states. We
can see a significant reduction in the size of the action space in the two cases
• This can be also visualized using the following analogy. Now we have say 29
microwaves, 28 washing machines, and 28 air conditioners, then we will have
22 9 + 22 8 + 22 8 = 1 billion states (approx) in the action space, that is using the
same number of action spaces we will be able to use more machines of each
type.
Chapter 4. Theory 38
2. Taking the maximum Q-value will be noisy during the initial phase of training
if non-optimal actions are regularly given higher Q-value than the best action.
3. The solution involves using two separate Q-value estimators, each of which is
used to update the other. Using these independent estimators, we can unbiased
Q-value estimates of the actions selected using the opposite estimator
2. DDQN can learn which states are (or are not) valuable without having to learn
the effect of each action at each state since it’s also calculating V(s).
In the subsequent study we replace the simple convolutional layer of the neural net-
work architecture with complex feature extractor network as shown in the diagram
below. We mainly experiment with three different types of feature extractor net-
work: InceptionNet V3 4.11, ResNet50 4.12 and a simple resnet 4.13. A simple
ResNet is the one in which there are only two residual connections available. the In-
ceptionNet is a compute efficient feature extractor, and will help us study the effect
of addition of 1x1 convolutional modules (which help in reduction of the number of
computations) affect the learning ability of the model as compared to the ResNet
models. The two ResNets help us in the study of how increasing the residual con-
nection affects the learning capacity of the model and it’s affects on the training
time.
Chapter 4. Theory 41
Our problem poses a unique mathematical optimization problem for getting a good
schedule. So relying only on the feature extractor may not always provide the best
result, so we began exploring models that also looked into the mathematical nature
in the images. especially since we are dealing with time series model, it is best
that we use models that make use of the recurrent connections over the time axis.
Recently many such models have come up which target the mathematical nature
among multivariate systems. We convert our problem into a multivariate system
Chapter 4. Theory 42
by simply flattening the image matrix into a multivariate polynomial where each
variable corresponds to a single sell in the image which represents the current state.
So our polynomial will have (number of devices) * (number of jobs) * (number of
distinct category of devices, which is three in our case) variable, where each of these
variable vary over time. We chose to go with LSTNet Lai et al. 2017 model to solve
this problem. LSTNet has shown prominent ability to recognize patterns in many
daily use cases especially ones which have mathematical patterns in time varying
format. We use this architecture to study the improvement we achieve with this
architecture with/without the presence of a predictor network.
As can be viewed from the above figure, the LSTNet 4.14 is divided into three phases,
where the first phase contains the convolutional layers, which act as image feature
extractor in our case. The second phase has the recurrent and the recurrent skip
connections, which are very good networks capable of extracting useful information
from time series data. It is here, that the essence of time series is captured by the
model. We also see a presence of a Autoregressive head, which assumes that the
current data depends only on the previous data and current variables. Finally all
these extracted features goes through the processing of a feed forward network where
the non linearity is learned to make predictions for the future time step.
Chapter 4. Theory 43
To compare the performance with the previous complex feature extractors like the
Inception Net V3 4.11, we introduce a BiLSTM layer in conjunction with a 1 dimen-
sional convolutional layer which gives the feature extractor to capture the essence
of time series. We use similar layered architecture with ResNet50 4.12 and ResNet
Small 4.13 as well.
Chapter 4. Theory 44
4.7 Happiness
• For washing machine the user will be satisfied as soon as the clothes are washed
and returned, as it gives time for drying and adding further jobs to the washing
machine as well. However if the job of washing the clothes was supposed to be
complete by 9AM, and gets scheduled by 9PM, the user will be very unhappy.
• For oven and Air Conditioner devices, if the user want the food to be prepared
by 11AM, it is best that the machine schedules the task in the nearby time
Chapter 4. Theory 45
slot as scheduling the job too early will lead to the food getting cold over due
time, and scheduling too late will not serve the purpose of the job, so in either
case user will be unhappy.
The values for the coefficients of the equation and the correct value for the power
of the equation are extracted using grid search. The above mentioned function for
happiness, is a piece-wise function, where each piece has a specific function for the
happiness. The last part of the equation basically signifies that heavy penalty is
imposed if the task is delayed after the deadline, so we prioritize the completion of
the task within the deadline. The normal distribution near the deadline emphasises
that near the deadline completion is preferred by the user. The first part of the
equation is however task dependent as described in the points above. For washing
machine, earlier the task is completed, the happier user is, hence it has positive
weight. However, for oven and air conditioner we need to push the task to the
scheduled time as doing the task too early or too late will render the job useless.
power and total bill amount incurred by the entire building must also be reduced
while performing local optimizations too.
Since the jobs on the same category devices are identical (i.e. they go through the
same cycles, and have the same average energy consumption in each cycle), hence
we apriori allocate each of the jobs on all the available machines of the respective
category by equally distributing them. For eg., say if we have 5 washing of cloth
tasks to be done on 3 washing machines, we allocate 2 tasks each to the first two
machines and 1 task to the third machine.
• The major advantage of this formulation is that we don not need to limit the
maximum jobs the user can schedule during the day, and hence the model
so developed will be able to scale efficiently on the job domain unconstrained.
Also the effect of inter-machine job scheduling is minimal since usually, average
houses have a limited number of devices of the same category in the same house.
Chapter 4. Theory 47
• The three factors that will now primarily affect the scheduling are:
• Since the jobs are now evenly distributed among the different machines, we do
not need to take care of intra-device scheduling. The only decision that needs
to be taken is whether to schedule a job at the current moment or not.
• So now the action space for each agent is a boolean vector of length (Nmachines ).
This is highly scalable with a growing number of machines in the household
since there will be no exponential factor to be handled anymore.
4.8.2.1 PettingZoo
been referred to in several other related works and the developers keep it up
to date with the latest works in the Multi-Agent RL industry.
4.8.2.2 Mava
• We start with understanding the formulation of the genetic space with the
above example. The red, green, and yellow colors represent one instance of
each of the three distinct types of machines mentioned previously. We are
representing the machine as a set of processes (group of jobs). Each process is
further represented as a group of jobs. In the above diagram 4.19, each of the
cells represents a single job that needs to be completed.
• We then assign time to each of the jobs, as can be seen from the figure 4.20.
The entire day is first divided into α unit time steps (α is 15 minutes for
course training and 5 minutes for fine-tuning in our experiments.) Each job is
allocated the value of the time block that they fall in, where each block is α
unit length.
Chapter 4. Theory 50
• Once we have completed the assignment of the time to each process, we can
now create the chromosome/gene from these sequences of time. We are already
aware of which job goes to which machine from the metadata and the uniform
assignment of each job.
– Join the time sequence of each of the distinct types of machines by joining
all the jobs on the machine together. This gives us a sequence of numbers
representing the jobs that need to be completed in the order in which they
need to be completed. This can be understood from figure 4.21
– Now, we take the sequences formed in the previous step and join them
together to form a single sequence. The order in which the sequences
are joined does not matter significantly. We just need to make sure that
each sub-sequence representing a single machine should not have any
Chapter 4. Theory 51
– Even though the order in which the sub-sequences are joined doesn’t mat-
ter, we need to store the order of joining for reconstructing the schedule
back from the chromosome.
4.9.2 Mutation
Beginning with the first and most important process of the genetic algorithm for
producing new offspring is mutation. For mutation, we will be performing the
following mentioned steps to achieve a stable and valid mutation.
• Randomly select any one of the machines (of any type) on which the mutation
needs to be applied.
Chapter 4. Theory 52
• Starting from the ending of the sequence of the jobs to be completed on that
machine (which is represented as a sequence of numbers corresponding to their
scheduled slot), for each job on the machine, determine whether (probability
= 0.5) or not to shift the job. Here we are considering only a backward shift
at the moment, that means that we can only postpone the current scheduled
job
• If we decide to shift the job in our current view, then we select a random
number between
– current value + 1 , where current value represents the current slot as-
signed to the job
– (New start time of next job) - (Time taken for this job)
4.9.3 Crossover
Performing a gene crossover over two healthy parents is much more complicated as
the generated schedule can yield either unstable offsprings, which means the schedule
is not possible (for example, intersecting schedule or out of the limits of the day),
or healthy parents me give rise to poor offsprings on performing crossover. We will
primarily focus on two types of crossover, as mentioned below.
Chapter 4. Theory 54
• Crossover-1: Figure 4.24 represents the first crossover scheme. Here we first
select two healthy parents randomly. Then within these parents, we select a
crossover point. The crossover point is selected such that it marks the joining
point of two sub-sequences. If the point belongs to any of the subsequences
then we may generate an unstable schedule. Hence it is quite necessary that
the selected crossover point be the joining point of two sequences. Once the
crossover point is selected, we iterate for all the jobs after the crossover point
and swap the values of the respective cell of parent 1 and 2 (that is we perform
swap(parent[i1 ][j], parent[i2 ][j])). This process always yields a stable schedule
after the crossover.
environments, resist diseases, and maintain long-term viability. There are several
methods to measure genotypic diversity, with two common approaches being the
Hamming distance and Shannon entropy.
In the context of species count, the DBSCAN algorithm can be used to identify
subgroups within the population by considering the Euclidean distance as the metric
of the distance between individuals. This distance metric is particularly suitable for
continuous data, such as phenotypic traits or genetic markers. To obtain optimal
results, the EPS parameter should be fine-tuned to ensure that the identified clusters
accurately represent the underlying structure of the population.
species count and analyze diversity within a population. By applying the species
count method and analyzing the resulting clusters, we can gain valuable insights into
the population structure, identify distinct subgroups, and inform decision-making
in areas such as conservation, breeding programs, and population management.
• Average Fitness: Calculating the average fitness value of the population and
monitoring its changes over time can provide insights into the overall fitness
progression and selective pressures. Typically, a diverse population would have
individuals with varying fitness levels, and observing the average fitness can
help researchers identify stagnation in the evolutionary process.
Following Algorithms are used to set base lines for the upper and lower bound of
the model’s performance
• First Come First Served: the First Come First Served algorithm replicates
the human behaviour. That is as soon as we see a free machine and we want
have the job to be completed we place the job on the machine with complete
disregard to the cost at that moment of the day.
• Lower Bound (based on Binary Search):
– This algorithm provides an optimized lower bound to the bill amount
that can be achieved.
– This is ideal because the algorithm disregards the fact that there may be
delay on behalf of the human user to schedule the job.
– In this algorithm we will will perform a binary search on the limit to bill
amount for per unit time and try to minimize it.
Symbols
• N = Number of machines
• J = Number of jobs
• T = Total time units in a day (units in our case will be 5 minute interval)
59
Chapter 5. Results And Discussions 60
The multi agent experiment are performed with the following configurations
• The First come, First Served approach here highlights the condition that the
neighbors are unaware of what other houses in the building are consuming. So
whenever there is a job to be done, and a device is available, a job is scheduled
on the device.
• Due to the greedy scheduling, adding deadlines will have no effect, only re-
ordering in the order of the job allocation based on deadline-based priority.
The following image visualizes the schedule of the job during the entire day.
Figure 5.1: Scheduling of the jobs during the day by greedy algorithm (FCFS)
Chapter 5. Results And Discussions 62
• As the scale of the environment grows into the multi-agent setting, simple
reordering of the jobs within the given power/cost bound becomes extremely
difficult. Hence we use a new approach in this setting. Within the given bound
on the cost on which the binary search is being performed, we use all the jobs
that can be scheduled at the current time step and pass it to an efficient
knapsack solver (from Google OR-Tools) to figure out which jobs should be
scheduled at the current time step to get the optimal value.
• The knapsack solver assigns a value ”one” to each job in the scenario in which
there is no deadline, primarily because the completion of the job is important
here. As shown in figure 5.2, there is a uniform trend in which the jobs have
been assigned to different machines, primarily because peak power is uniformly
distributed over the entire day.
Figure 5.2: Scheduling of the jobs during the day by Binary Search combined
with Knapsack solver (without deadlines)
the knapsack solver assigns based on how near we are to the scheduled job
deadline. Primarily, the value of a job can be viewed by the following equation
(LIMIT=INF):
Figure 5.3: Scheduling of the jobs during the day by Binary Search combined
with Knapsack solver (with deadlines)
1. 0 ≤ xij ≤ T
2. xij1 + dj1 ≤ xij2 , ∀j1 , j2 ϵMi
Chapter 5. Results And Discussions 64
• Without the presence of deadlines, we see that MIPS stacks all the jobs in
that part of the day when the cost for each unit is the lowest. But this leads
to a significant spike in the usage of electricity during a very short duration
of the day. If a similar trend is followed by a lot of buildings in the town, the
generators will soon start overheating and malfunctioning. In due course, the
price during that time of the day will be increased.
Figure 5.4: Scheduling of the jobs during the day by MIP(without deadlines)
Figure 5.5: Scheduling of the jobs during the day by MIP(with deadlines)
Chapter 5. Results And Discussions 65
• Adding deadline to the Mixed Integer Programming based solver was a much
easier task as compared to the binary search solver. The only change required
is to set the following constraint to the decision variable.
1. 0 ≤ xij ≤ Dj
• From figure 5.5 we see that now, the mixed integer programming solver tries
to stack each job to the lowest possible cost region within the deadline. this
will still lead to spike at different point in time of the day, but given the
constraints, the solvers finds the minimum bill amount that can be achieve,
and this actually gives a optimal lower bound to compare with the multi agent
reinforcement learning approaches.
• This is a lower bound primarily because it does not take into account the uncer-
tainties of the human behaviour, that there might be delays during scheduling.
Adding uncertainties while train reinforcement learning model so that agent
can model near real world scenario, will always add some disturbance to the
lower bound.
Two methods of training are implemented and their results are compared in this
report. In each of these methods we focus on two training instance, one after 20,000
steps (69 episodes), and other after 2 million steps (6940 episodes).
• The model results that are provided in the subsequent slides were trained on
15 machines each and 40 jobs to be performed by each class of the machines.
• We train the RL Agent on higher number of machines, and the end user will
be limited to add up to 15 devices from each class. On request from end user
the number of devices may be increased.
Chapter 5. Results And Discussions 66
• If we are adding lesser number of machines then we just need to set the cells
corresponding to assignment to invalid cells to -inf (Black colour cells in the
diagram).
• Some auxiliary functions, like those for prioritized experience replay, are be-
ing used from the stable baseline (fork of the OpenAI baselines https://
stable-baselines3.readthedocs.io/en/master/ repository), since they have
quite extensively optimised implementations.
• This incremental training method starts with a single hidden layer in the feed-
forward network and trains it for one machine.
• As we keep increasing machines in the following steps, we add a new layer for
each new device.
• When we add the second machine, we will add a new hidden layer between the
output layer and the final fully connected layer, which we have already trained
for one device.
• The training after adding a new machine will majorly affect only the new layer
added to the fully connected network and minorly to other layers (due to the
freezing of layers). Hence, training will be faster.
– Before starting the training during the initial warm up steps, the model
was also run a few times on some of the initial experience accumulated
with some random states, so that the feature extractor layer does not start
training with completely random kernel initialization. This has proven
to improve the performance.
– During training the model allowed previous layers to be trainable to 10
- 20 % of the steps. That is, if I say I am training for the 3rd layer for
3000 steps, then I allow 2500 steps for training the third layer and 500
steps for training the first two layers.
Chapter 5. Results And Discussions 67
1.5
– Number of steps allocation was proportional to ⌈layernumber ⌉ (eg 1:3:6).
Finally for fine tuning the model was run for 2000-5000 steps with the
last two layers and the output layer set as trainable layers.
future time step. Historical time step data is available from the buffer containing the
stored prior experiences while training the DQN.For the future time steps we use the
inception network based agent to predict next three states and give the output after
simulation. The main reason being that the inception network is already trained for
giving output with good results.
• Dataset Available
1. Pecan Street Data: Dataport hosts all the data collected via Pecan
Street’s (Texas USA) water and electricity research.
Chapter 5. Results And Discussions 69
Note: The colour coding in the following figure represents the different runs of
training the agent to observe that is the model working correctly for .
Figure 5.9: Bill Amount using the same number of machines and jobs
.
Bill Amount for figure 5.9 (Rs) ( #Devices
#Jobs
= 1)
Algorithm 1 Devices 2 Devices 4 Devices 8 Devices 15 Devices
or Steps 1 Jobs 2 Jobs 4 Jobs 8 Jobs 15 Jobs
FCFS 2056.51 2056.51 2256.51 2256.51 2453.51
20k steps- 1766.58 1629.45 1850.4 1849.92 1909.86
norm
20k steps- 1294.15 1340.46 1481.59 1439.62 1691.85
incr
2Mil steps- 907.46 1048.47 1022.74 1077.36 1242.99
norm
2Mil steps- 877.38 859.83 881.34 931.33 1048.9
incr
Binary 740.25 740.25 750.25 788.25 940.2
search
• From figure 5.9 we see that the trend of bill amount almost remains constant
with increasing the jobs (or machines) keeping the ratio of J/M constant.
However there is a slight increase in Bill amount with increasing the number
of machines (or jobs). this is due to the fact that when many machines start
working simultaneously especially when ToD pricing of electricity is low, there
is higher risk of crossing the penalty limit, thus the bill increases.
• From figure 5.10 we see that the Bill amount increases with increasing the
number of jobs. This is mainly due to the reason that more the number of
jobs more time the machines will rune, hence more power is consumed, so
the bill increases. Similar reasoning can be given to 5.11. Increasing the
number of machines provide better opportunity to the RL agent for properly
scheduling the jobs on different machines to minimize the cost. Thus the
output scheduling is in accordance with the hypothesis.
• Most important observations from the above figures is that the incrementally
trained models provide better result than the full data model when trained for
the same number of steps. Also the model trained for more number of steps (2
million) provides better results. The incremental model trained for 2 million
steps even performs comparable with our optimal heuristic too.
For the testing of the feature extractor network we schedule 15 jobs to be completed
on 5 machines. We make use of the Inception Net V3 4.11, ResNet50 4.12 and ResNet
Small 4.13 for feature extractor testing and compare the policy learned using these
network against our FCFS and Random Policies. All of these models are trained for
Chapter 5. Results And Discussions 73
2 million steps (5 min time steps) which is also equivalent to 60K steps (for 15 min
time steps) for both incremental and full training.
(a) incremental vs full training bill amount (line graph) (b) incremental vs full training bill amount (bar
chart)
The above graph 5.12 show the billing incurred by the user in one day using the
schedule provided by the RL agent using different feature extractor networks. It
is clear from both the plots 5.12(a) and 5.12(b) that the inception net has out
performed the other feature extractor network in recognizing the patterns and accu-
rately prediction the action to be taken in the near future for achieving best possible
scheduling. Hypothesis for these results is that InceptionNet is compute optimized
network and our model needs to map out a computational structure from the image,
hence the InceptionNet has better efficiency in easily extracting the mathematical
features from the image.
Figure 5.14: Comparision of time taken for a single step during training
Model check-pointing is used to store the intermediate models and restore models for
starting next phase in the incremental training. Check-pointing and restoring the
model from the checkpoint of huge models take more time hence incremental training
can take quite huge time. In the above image 5.14 we give a comparative study of
the time taken for the full training and incremental training. We use fixed number of
checkpoints (every 2000 steps) in this case for both training paradigms. This graph
5.14 displays the time taken for 1 step in the training process by the different policies
Chapter 5. Results And Discussions 75
used in DQN Agent. Compute optimized inception network is about 5 times faster
than the ResetV2(50) model. Since Each step takes lesser time, the entire training
time is overall faster for Inception Net as compared to other models.
(a) Bill amount saved (line graph) (b) Bill amount saved (bar chart)
Figure 5.15: Comparison of the Bill amount saved by Inception Net as compared
to other feature extractors
This graphs 5.15(a) and 5.15(b) displays the percentage of the bill amount that
the inception net has saved as compared to the other algorithms / models. using
incremental and full training. It is clear that inception net out performs the FCFS
algorithm, which replicates the human behaviour by a large margin, and also has
significantly better performance as compared to the other models.
In this section we enumerate the results of training the model using 15 minute
time steps, 5 minute steps, 15 minute steps and fine tuning using 5 minute steps,
5minute time steps and then fine tuning on 15 minute steps. Here we divide the
day into broad time steps of 5 or 15 minutes instead of 1 second mainly to reduce
the observation space over which the scheduling will occur. Training with 15 minute
Chapter 5. Results And Discussions 76
steps and fine tuning on 5 minute steps mainly signifies that we first make our RL
agent learn to schedule the devices if we had broken the day in 15 minute time steps
(i.e only multiple of 15 minutes can be allocated to a process). Once it has learned
a scheduling on 15 minute time steps a day, then we tell the model that now it has
to learn scheduling on 5 minute time steps a day.
This graph 5.16 shows the bill amount achieved by different policies on the following
four types of unit steps with incremental training
From the plots 5.17 we can see that for the case of full training the difference between
using fine tuning and not using fine tuning is not that significant, however when we
Chapter 5. Results And Discussions 77
go to incremental training we can see significant effect on the cost saving with the use
of incremental training. We also observe that training with 15 minute course steps
and fine tuning with 5 minute steps has better performance than doing viz-a-viz.
We also observe that in general with any form of tuning, the incremental training
has better performance than full training.
Figure 5.17: Comparison of fine tuning model with different time steps
The graph 5.18(a) and 5.18(b) shows the performance of the model using full and
incremental training for 60K steps, respectively. The model evaluation is done after
every 10K steps, and the computed minimum bill amount (after running 5 times) is
stored as the evaluation result for performance measure. For full training, we observe
that the inception 4.11 network consistently performs better than the resnet models.
The ResNet50 4.12 performs worse than ResNet small4.13 initially but learns the
feature eventually and performs better than a smaller network. On the other hand,
for the incremental training, we see that it initially performs almost equivalently
since the lower steps represent training for a lesser number of machines for all the
networks and hence lesser observation space. But as the observation space becomes
bigger with the addition of more machines, the inception network performs better,
as seen in the earlier results.
From figure 5.19(a), we see that in the case of inception, network addition of an
LSTM layer did not lead to much improvement in performance, mainly because it
had already learned the feature more efficiently. However, there is a significant im-
provement in the performance of the ResNet50 model. From figure 5.19(b), we see
that LSTNet lone, when trained, doesn’t perform much better. Indeed it performs
worse than inception net, though it has outperformed residual networks. However,
Chapter 5. Results And Discussions 79
on adding the predictor network (which is a trained inception net v3), we see that
there is a significant improvement in the performance, and the newly learned policy
even outperforms the policy learned with Inception Net alone. We see that aug-
menting some good possible future possibilities (peek in the future) with the past
experience gained (peek in the past) has better performance. However, the pre-
dictor network’s performance also plays a crucial role in giving good predictions to
make better decisions for future time steps. From figure 5.20 we also observe that
adding a predictor network to train the models reduces the peak power consumed
and smooths the household’s power consumption, which reduces the bill.
In all the following results of the ablation study, we have considered cost bound. But
the cost bound and the power bound are directly proportional and are related by
the pricing function determined by the electricity company providing the services.
Clearly, if the power usage goes up, the cost goes up and viz-a-viz. Hence the cost
and power bound have been used interchangeably.
• For all possible bounds, the results of the FCFS (greedy strategy) always
remain the same, primarily due to the deterministic nature of the algorithm.
The cost incurred by FCFS is 39532.5. This provides an upper bound on the
efficiency of the RL model since the models should not perform worse than this.
But the FCFS strategy achieves 100% job completion within the deadline.
Figure 5.21: Bill amount incurred versus the different cost bound. B = 623
(optimal bound)
• The binary search algorithm achieves a cutoff bound of Rs 623 per time step
for the answer, which gives optimal bill amount while distributing the power
bound equally. We now try to disturb the bound by multiple of (ϵ = Rs.10),
Chapter 5. Results And Discussions 81
• The Mixed Integer Programming solver always tries to stack the jobs in the
best possible location as per the lowest cost per unit of energy. According to
intuition, the stacking will be better if the per time unit bound is more because
more jobs can be accommodated in the slot with lower cost, and hence the total
bill amount over the day comes down. The results achieved also validate or
assumption, as seen in 5.22.
Figure 5.22: Bill amount incurred versus the different cost bound for MIP solver
is also partly due to greedily allocating jobs to Knapsack solver and also due
to the fact that not much rearranging of the jobs is taken into account with
binary search.
Figure 5.23: Bill amount incurred versus the different models (a) Peak power
optimization setting (b) Bill Amount optimization setting
• In the results for the Reinforcement Learning models, the change between the
Bill optimization and peak power optimization settings differs in the way the
reward function are modelled in the environment simulation. Hence based on
the reward maximization, the policy learnt is also different.
• It is clear from the figure 5.23 that QMIX out performs other value algorithm.
The monotonicity constraint on the action value function of the QMIX Rashid
et al. 2018 may also reinforce the result as to why the binary search algorithm
Chapter 5. Results And Discussions 83
and QMIX (when trained with pea power optimization settings) converge on
results which are very close. Both algorithms enforce monotonicity constraints,
just in different perspectives.
• The QMIX is actually able to reorder the time step wise limit/bound, which
is being used by the binary search algorithm more efficiently by increased
exploration and efficient reordering of the jobs during the day to reach to a
much better solution.
• QMIX also models the uncertainty of the human scheduling error and hence
the optimal bill amount achieved by the QMIX is a bit astray from the optimal
bill amount achieved by the lower bount set using Mixed Integer Programming
solver.
• The table below gives the values of the Bill amount and the Peak Power re-
quired (real time) during the schedule of the entire day, under different op-
timization settings, namely Bill Optimized (BO) and Peak Power Optimized
(PO). From the table we see that, QMIX is not only able to reduce the bill
amount, but also reduces the real time peak power, while striking a balance
among the two.
Figure 5.24: Scatter plot showing the position of the models on the Peak Power
vs. Bill amount plane
Figure 5.25: Scatter plot showing the position of the models on the Peak Power
vs. Bill amount plane with increased sampling points
From Figure 5.25, we start figuring out that when we are only studying the effect
of using different levels of optimization constraints, like high bill optimized, highly
Chapter 5. Results And Discussions 85
power optimized, os skewed from bill optimized, and skewed from power optimized, a
unique pattern starts to emerge. This pattern is enhanced when we outline the curve
they follow as in Figure 5.26. These trend lines formed by the different algorithms
seemingly form contours where each contour may demarcate some physical feature
of the algorithm, like the mathematical limitation or resource constraint handling
capacity. The trend lines are interestingly quite hyperbolic looking, and the increase
in efficiency of one of the axis parameters clearly affects the performance of the other
axis parameters, which also aligns with our ideas of the performance of the model.
FCFS doesn’t have contours as it is a completely deterministic algorithm. Hence all
executions of the algorithm yield the same value in the absence of the introduction
of any human-induced uncertainties.
Figure 5.26: Outlining trends in 5.25 (Bill in Rs.1K, Peak Power in kW)
and the peak power. This supports that the models are trying to provide the best
satisfaction to the users by optimizing the satisfaction scores.
Figure 5.28 shows the satisfaction scores obtained by different models. As expected,
the satisfaction score of the human nature based greedy ”first come, first served”
based algorithm will not be the most optimal, primarily in the cases where the job
requirement near the deadline is more satisfactory, like food heating. The MIP model
has a peak satisfaction score amongst all the models resulting from the performance
of its efficient mixed integer programming algorithms. However, our reinforcement
learning algorithm performs much better than the deterministic counterparts like
QMIX, outperforming Binary Search in satisfying the end user.
We will now understand the pricing schemes we used to find the average results and
the motivation behind selecting the pricing schemes. Different pricing schemes in this
section have taken inspiration from different sources and have different significance.
It is important to use different types of pricing schemes so that we can analyze and
study the effects of our scheduling on both generic and specific cases.
These are the category of the non-monotonic pricing schemes, which means they do
not show any fixed trend, and their plot may or may not seem to fluctuate. This type
of plot usually focuses on the different requirements at different times of the day and
tries to control the usage during the peak times when most users are statistically
available.
Figure 5.29 shows the first pricing model used in most of the above-mentioned results.
The article of K Raheja Group 2018 inspires this pricing model. This pricing scheme
is inspired by the studies that support that electricity use is particularly high during
the morning to afternoon part, which is the peak office duration and the post-
evening time when most people are at home. We want to control the peak usage in
the household during the morning hours to shift in the early morning hours and or
Chapter 5. Results And Discussions 88
late afternoons and similarly for the late evening hours. Hence, the pricing model
has steeply high pricing during the morning to early afternoon and the late evening
compared to the early morning and late afternoon to early evening duration.
Figure 5.30 shows the second pricing scheme, which peaks during the day and is
lower in the early morning and the evening. This type of pricing scheme is partic-
ularly used in office areas where the amount of electricity usage peaks significantly
during the morning to afternoon working hours. The sudden surge in the activa-
tion of many devices in offices usually leads to a huge surge in energy requirements.
Hence it is enforced that the offices take the electricity consumption responsible
for avoiding high bill charges. This pricing scheme is sampled from the Gaussian
Normal distribution.
Next, we move on to the monotonic pricing schemes, where the pricing scheme
monotonically moves in one direction. That means we get an increasing or decreasing
bill charges trend over the time-of-the-day rating.
We first look at Figure 5.31, which shows the first monotonic pricing scheme we will
consider. This is a monotonically decreasing pricing scheme. This means that the
Chapter 5. Results And Discussions 90
price of electricity is steep during the day time and low during the night time. This
is primarily useful for restaurants and businesses that bloom at night time. With
an anticipated energy demand, the electricity company prices minimum during the
peak working hours and increases the prices during the off-hours when a high surge
of electricity is not anticipated.
Finally, we look at the last pricing scheme that we used for the experiments. The
pricing scheme represented by Figure 5.32 is actually almost a mirror image of the
previous monotonic pricing scheme represented by 5.31. This pricing scheme is a
monotonically increasing pricing scheme; that is, the pricing is lower in the early
morning and increases steadily towards the evening. This type of monotonic pricing
scheme gives us an overview of a pricing scheme that penalizes a business due to
high demand during the night time and is included purely to complete the general
case scenarios.
For the following experiments, we will be using the schedule displacement tech-
nique. That means that once we get the generated schedule, we will iterate over all
the jobs and randomly shift it by some number in the range [1, δ] minutes, where
δϵ{−10, −5, 0, 5, 10}. A negative delay indicates that the job was done before the
Chapter 5. Results And Discussions 91
time it was actually scheduled to do, and a positive delay indicates postponement
of the job. The delay 0 is the original schedule.
(e) QMIX
Figure 5.33: Results for sensitivity analysis of pricing scheme 1 (Bill amount(in
Rs. 1K) vs displacement)
First, we begin by analyzing the sensitivity of the model built on pricing scheme 1.
From Figure 5.33(a), we can see that for FCFS scheduling, a preponement of the job
Chapter 5. Results And Discussions 92
clearly has no effect, which is also expected. This is because the FCFS algorithm
packs all the jobs in the earliest possible slot, which is free. Hence preponement
and the schedule remain the same as the original schedule. However, we see that
with the postponement of the schedule, there is a significant dip in the bill amount,
which means there is a scope for improvement in the generated schedule. Figure
5.33(b) shows the results of sensitivity analysis for binary search-based optimiza-
tion in combination with a knapsack solver. We see that the bill amount incurred
increases both on preponing and postponing the schedule, which is also expected
from a good schedule. The preponement of the schedule doesn’t have much effect on
changing the bill amount as compared to the postponement. This means postpon-
ing the schedule generated by the binary search-based method cannot be delayed in
any case; however, doing the job somewhat earlier to avoid delay doesn’t affect the
schedule much. From the sensitivity analysis of the MIPS solver, as shown in Figure
5.33(c), we can infer that the schedule generated by MIPS, even though it provides
us with the lowest bill amount, is very sensitive to any changes in the schedule. The
user must schedule each job exactly on time or use automated help since a small
delay of even 5 minutes can even be very costly.
Figure 5.34: Combined Sensitivity Result for Pricing 1 5.29 (Bill Amount(in
Rs.1k) vs displacement(in minutes)
We now combine the previous section’s results to understand better the relative effect
and the position these models stand on. We begin by analyzing the combined results
of the models fine-tuned to work on the pricing 1 scenario as shown in 5.34. We see
the MIPS perform extremely well in optimizing the bill amount, but the generated
schedule is very sensitive to the time slot they have been arranged in. Shifting the
generated schedule for multiple machines by 5-10 minutes has a detrimental effect on
the schedule. As explained in the previous section, both QMIX and Binary search-
based scheduling are sensitive to delayed scheduling. Still, as we can see from the
combined plot, the schedule’s effect is much less than the binary search combined
with the knapsack method. Hence the delayed scheduling has more stability in the
case of Reinforcement learning-based QMIX agents than deterministic counterparts.
Moving on to the result generated for the fine-tuning of the model on the pricing
scheme-2 as shown in Figure 5.35, we see that the preponing and postponing cases
of MIPs is still the worst case possible, but this is still the model which achieves the
Chapter 5. Results And Discussions 94
least bill amount as compared to the other models. The QMIX model, in this case,
outperforms the deterministic and other value decomposition networks. However,
we see that we may achieve some incentives by preponing some of the jobs. This
shows that there is still scope for improvement in the modeling. A similar kind of
result is also observed in the case of binary search-based scheduling too. This also
reinforces the fact that there is some similarity on the basis of the scheduling of
both the algorithms, namely QMIX and binary search since both are based on the
monotonicity fact and display some level of similar behaviors. VDN, as before, is
lower on sensitivity to postponing as compared to preponing and the FCFS variant
is invariant to preponing since early stacking and shows improvement on postponing.
Figure 5.35: Combined Sensitivity Result for Pricing 2 5.30 (Bill Amount(in
Rs.1k) vs displacement(in minutes))
Finally, we analyze the results of the monotonous pricing model too. Figure 5.36
shows the sensitivity analysis result of the pricing scheme-3 discussed above. Pricing
scheme-3 is a monotonously decreasing model, which means the pricing is high in the
early morning hours and lowers as we move to the late night hours. This fact is also
reflected in the results. In most cases, we see that preponing the schedule severely
affects the bill amount, but postponing the schedule showed some improvements in
the bill amount. Since the MIP model had already stacked all the jobs towards the
Chapter 5. Results And Discussions 95
lowest price part of the schedule, there is not much significant change in the bill
amount resulting in postponing the job but a sharp increase in preponing it. For
this pricing scheme, we see that the binary search model has improved performance
compared to the qMIX model though nearly the same. The QMIX model, however,
on extreme delay, had a better performance than the Binary Search-based scheduling
option. However, overall we can see that most of the adaptive models learn to stack
the jobs toward the end of the day for better billing. However, if the MIP model
is run for all households or restaurants, then all the jobs will be strictly backed
towards the end of the day, and there will suddenly be a huge surge in electricity.
Comparatively, the scheduling generated by the Binary Search and QMIX algorithms
are much more friendly to the power consumption and will not affect the peak power
consumed much since there is quite an even distribution of energy even towards the
end of the day.
Figure 5.36: Combined Sensitivity Result for Pricing 3 5.31 (Bill Amount(in
Rs.1k) vs displacement(in minutes)
Chapter 5. Results And Discussions 96
(e) QMIX
Figure 5.37: Results for sensitivity analysis of Average Case analysis (Bill
amount(in Rs. 1K) vs displacement)
Now we come to the analysis of the individual model performance on the average
case scenario as compared to the pricing scheme 1, as was discussed in the earlier
sections. For each of the following averaged case results, we perform the scheduling
20 times, that is 5 times for each of the 4 schedules, and then take the average; since
the reinforcement learning algorithms take uncertainty into account, they produce
Chapter 5. Results And Discussions 97
different results every time, and we need to average over multiple generations to
get the fair idea of the model’s capacity. The results of the deterministic algorithm
remain the same in all generations. Hence multiple generations in the same pricing
scheme don’t affect the averaging till the time we take an equal number of generations
for each pricing scheme. Also, all these schedules are generated will full training; we
will see the effect of incremental training in the next section.
For FCFS, from the results shown in Figure 5.37(a), we see that the average case sce-
nario has nearly the same structure as the case with the pricing scheme -1. However,
we see a higher average value due to its higher bill amount in the later cases. Figure
5.37(b) presents the result of averaging with the binary search combined with the
knapsack algorithm. From this, we see an interesting observation that, on average
case, the binary search has a lower effect on postponing the jobs as compared to
postponing in the case of pricing scheme 1. This is partly because of the contribu-
tions of the monotonic models in the averaging too. Another interesting observation
is the sensitivity to preponing the job increases. Thus, the resultant plot is now
quite similar to that of the effects of the MIP model. Figure 5.37(c) shows the result
of the average case analysis for the MIP model, and as expected, we see that the
MIP model produces a very low bill amount but on disturbing the schedule, the
penalty is usually very high, that is we can say that the schedule is quite unstable.
Indeed we see that for the average case result, the preponing has a much more verse
effect as we see a further increase in the steepness of the slope of the curve.
Coming to the results of the reinforcement learning models, we see that figure 5.37(d)
shows the result of the VDN model on the average case scenario. We see that
the VDN model, which was earlier only quite sensitive to preponing now becomes
equally sensitive to both preponing and postponing the schedule. Still, the range
over the sensitivity is reduced, which is a good sign as we know that preponing
and postponing will not have severe effects as before. Figure 5.37(e) presents us
with the result of the average case scenario with the QMIX algorithm. The QMIX
shows a similar trend as most of the other models; that is, QMIX now has increased
sensitivity on both preponing and postponing the schedule. But as we would have
expected, the QMIX shows behavior quite similar to binary search-based scheduling
due to its similarity in the monotonicity assumption; we see a reversal in the severity
Chapter 5. Results And Discussions 98
effect. We see that QMIX now has lower severity on postponing as compared to
preponing in the average case scenario.
On analyzing the combined average analysis result we see that the MIP model still
performs the best in terms of achieving the minimum bill amount and is very sensitive
to any disturbance in the schedule as it has a detrimental effect on the bill amount
so produced due to an error on the part of the human due to delaying or doing the
chores early. The QMIX model is the best-performing model amongst the others
in terms of achieving lower bill amount which is stable in terms of the effects due
to human error like preponing or postponing the jobs. We see that the minimum
of the VDN and the Binary Search algorithm is nearly the same when there is no
disturbance in the schedule, but in case of a disturbance in the schedule, the VDN
model has a lower effect as compared to the Binary Search algorithm. The FCFS
algorithm as in general, has the worst performance.
Chapter 5. Results And Discussions 99
Figure 5.39 shows the mean and variance of the model’s performance in the average
case scenario. We see that the mean value of the MIP is lower than any other model
as can be seen from the combined plot of 5.38, but the variance of the MIP model is
too high, near in the range of 60% as compared to the variance of the QMIX model
which is near in the range of 10%. FCFS has the least amount of variance, which
is also understandable due to the fact that there is not much possibility in terms of
preponing, and postponing in the worst position in the day doesn’t have much effect
on postponing too. The VDN and the Binary Search-based algorithm have nearly
the same variance and mean too in the average case scenario.
Figure 5.39: Full training - Combined Average Case Mean and Variance for
different models
the previous QMIX model, we see a dip in the amount of penalty that the model was
previously incurring. This means that the introduction of the predictor network into
the modeling makes the generated schedule more stable. We see that postponing
the schedule generated by the predictor network by less than 5 minutes has much
less effective compared to that by shifting it by less than 10 minutes. However, in
the case of the preponing of the schedule, that effect is reversed. That is, preponing
by less than 5 minutes leads to an increase in the bill amount, but further preponing
by 5 minutes leads to a drop in the bill amount, so generated with considering the
disturbance.
Figure 5.41 represents the man and variance results of the different models with full
training on the multi-agent scenario after introducing the predictor network. We see
that after the introduction of the predictor network, the effect of the variance due
to any disturbance in the schedule is decreased further in comparison to the original
QMIX model. We see an improvement in the variance from 3.63% to 1.12%, which
is a near 0 variance, similar to the FCFS model. When used with the predictor
network, we also see that the QMIX model performs much better in terms of the
mean bill amount that it achieves in the different pricing models on average compared
Chapter 5. Results And Discussions 101
to the MIP model. Even though MIP achieves a lower bill amount, the sensitivity
to different displacements due to uncertainty leads to an increase in the mean bill
amount that it achieves. On that front, the QMIX model performs quite better and
further has improved results with the predictor network with mean reducing from
Rs. 39.65K to Rs. 37K.
Now we make use of incremental training for training the models and then study
the results obtained. The incremental training is based on the idea of reusing the
knowledge of the model acquired for learning a smaller set of items and extending
it for a larger set of items as discussed in 4. Figure 5.42 shows the result of the
combined analysis report on the sensitivity of the models trained with incremental
training. From the results, we see that the incrementally trained models have a bit
higher sensitivity to the displacement effects due to human uncertainty as compared
to the full training of the same models. We additionally observe the most signifi-
cant fact that using the incremental training methodology also lowers the minimum
bill amount achieved further, and we have achieved bill amounts very close to that
Chapter 5. Results And Discussions 102
achieved by the MIP solver, but the schedule generated by the reinforcement learn-
ing algorithms is much more stable compared to the deterministic counterpart of
MIP. Deterministic models do not need any training. Hence there is no incremen-
tal training for the deterministic models. Interestingly, the incrementally trained
VDN model with predictor network, when shifted by less than 10 minutes prior,
that is, randomly preponing the jobs by less than 10 minutes, has lower bill amount
consumed as compared to the QMIX model trained under the same environment.
Figure 5.43 presents us with the results of the mean and average obtained with
different models on incrementally training them compared with the full training
methodology. Clearly the average of bill amount achieved in the different scenarios
for incrementally training the model is lower when compared to the full training,
but we see an increase in the variance value for incrementally trained models. This
is primarily because when the full training is performed, it takes into account all the
effects of all the models over all time steps, but the incremental model is focused
Chapter 5. Results And Discussions 103
on optimizing one subset of items at a time, hence suffers from higher variance
comparatively. Thus we can conclude that with full training, we have a higher
average bill amount but lower variance, but with incremental training, we have a
lower bill amount but higher variance.
Figure 5.43: Incremental training - Combined Average Case Mean and Variance
for different models
The following figures 5.44 and a5.45 represent the result of the scalability of the
models in terms of time taken. For the deterministic models, as shown in figure
5.44, we see the time for producing the schedule for each algorithm. While in the
case of the reinforcement learning algorithms, we see the time for training the model
which is significant as the number of machines for which the scheduling needs to be
learned grows.
Chapter 5. Results And Discussions 104
(c) MIPS
Figure 5.45: time taken incrementally raining the reinforcement learning models
Figure 5.44(a) represents the result of the inferencing of the FCFS algorithm, which
is constant if we increase the machines keeping the jobs constant and increasing the
number of machines as all the jobs will be equally distributed among the machines.
Still, if the jobs are increased keeping the machine constant we see a slight increase
in the inferencing time due to scheduling more jobs. Figure 5.44(a) represents a
Chapter 5. Results And Discussions 105
similar result with the Binary search scheduling algorithm. As the jobs increase, the
algorithm needs to iterate and transform more jobs to see a perfect location. The
complexity of the knapsack also increases as the number of items increases. Hence
we see a steady increase in the time taken for inferencing. As the number of machines
increases, only the complexity of the knapsack increases, but not significantly due to
the fact that the jobs are distributed equally on the machines, so just the knapsack
machine is run more time on a smaller number of jobs per machine. Figure 5.44(b)
shows that MIPS is also affected both on the fronts of increasing machines and
increasing jobs. There are no significant constraints between the machines in the
MIP modeling. As the number of machines increases, there are more possible values
that the jobs can take (due to equal distribution), and looping over the number of
machines takes a bit of time, hence a slight increase in time with more machines. As
the jobs increase, the number of constraints added also increases proportionately.
Hence the time taken for the solver increases significantly. VDN uses incremental
training methodology for training, thus, the major effect on the training time is due
to increasing the number of machines, as seen from figure 5.45(a). This is because the
incrementally is based on the increasing number of machines based on the knowledge
gained from a smaller subset of machines. Since processes are equally distributed
among the available machines, increasing the number of processes primarily affects
the processing time. QMIX also follows a similar training technique, due to which
we see similar trends as compared to VDN as seen from 5.45(b). The time taken
for training the QMIX is quite higher as compared to the VDN primarily because
of the more computations involved in QMIX.
• RL Agents can efficiently model the best scheduling considering the human
uncertainties during run time, like delayed scheduling with bounds. This can
be easily verified during real-time experience since while training, this was
considered while modeling the environment.
Chapter 5. Results And Discussions 106
• From the results of the single agent, we observe that RL Agents can model
the characteristic stochastic quantities like the satisfaction of the user much
better as compared to the deterministic counterparts, and this quality is also
extended to the multi-agent scenarios too.
• With the aid of a predictor network we can achieve results much close to
heuristic lower bounds and much more stable than the heuristic schedules in
terms of their sensitivity to human uncertainties.
Figure 5.46 presents us the result of the genetic algorithm, which is plotted at
specific points of the iteration, that is, after every 2k × 100th generations. In each
generation, we divide all the offspring bill amounts into 3 buckets and use the average
to determine the circle center. These buckets have an equal number of offspring.
Since each generation has 100 offspring, thus each bucket will have 333 offspring
on approx. Averaging gives us an idea of the trend of that particular bucket, and
bucketing is similar to binning, which helps us to understand the direction of the
concentration of the offspring. Buckets are built after sorting all the offspring by their
fitness value, where fitness incorporates the bill amount. Bucket 1 hence will always
have a lower bill amount, followed by Bucket 2 and finally Bucket 3. From figure
Chapter 5. Results And Discussions 107
5.46, we see that the results of all the buckets improve over different generations.
The concentrations of the buckets, however, shift over different generations. In the
early generations, Bucket 1 and Bucket 2 are quite close together, but Bucket 3
is quite far. However, in the later generations, we see that Bucket 2 shifts closer
to Bucket 3 and Bucket 1 are further apart, which means that the concentration
of the offspring shifts towards the higher bill amount side. This means that most
of the offspring are not improving after a certain number of generations and may
contribute due to the fact that direct crossover of two healthy parents may introduce
non-healthy offspring. Comparing the performance of the genetic algorithm with the
above-analyzed solvers like QMIX and MIP, we see that Genetic Algorithm doesn’t
perform at the same level of efficiency. MIP was the most efficient, with a bill
amount o Rs. 29.76K, and Incrementally Trained QMIX with Predictor Network
had achieved a bill amount of Rs. 31.73K, which was more stable, but the bill
amount achieved with Genetic Algorithm is in Rs. 38K ranges on average, and the
best schedule has a bill amount of Rs. 37K even after 1600 generations. We maintain
multiple offspring of each generation primarily because the offspring with the lowest
bill amount is not always the schedule that is most stable. Hence variety helps us
make informed decisions about the best schedule for different use cases.
(a) Schedule 1
(b) Schedule 2
(c) Schedule 3
(d) Schedule 4
Figure 5.47: Examples of different schedules after 3200 generations picked ran-
domly from group 1
Chapter 5. Results And Discussions 109
Figure 5.47 shows the 4 randomly sampled examples of the schedules generated after
3200 generations. From the schedule 5.48, which represents the scheduling with the
best bill amount, we can see that the algorithm learns over generations that stacking
the jobs in the lower bill amount is more efficient while, at the same time, it is trying
to maintain a lower peak power while distributing the jobs over the day. From the
previously analyzed schedules of the Binary Search Algorithm and QMIX issues with
stacking on the early part of the day (less exploration on later halves), we see that
the genetic algorithm has much better exploration of the entire day as it is able to
spread jobs over the day. This is primarily due to random initialization of the jobs.
Figure 5.49, which presents the result of the sensitivity analysis of the genetic learn-
ing algorithm, shows that the average sensitivity of the group 1 offspring is much
higher compared to the average sensitivity of the group 2 element and group 3 ele-
ments. The average sensitivity of the Group-2 and Group-3 elements is much closer
as compared to Group-1. By Group, we here refer to the different Buckets described
above. The group 3 elements, though, have a poor performance and have the least
sensitivity among all other elements as we see that displacement of the schedule
generated has very little effect. This may be due to the fact that the schedule
generated is not much good, and shifting the elements in their current region have
comparatively less effect than moving out of strongly packed low-pricing regions.
We can extend our study to include the features that make group-3 more stable as
Chapter 5. Results And Discussions 110
From figure 5.50, we see that time taken for generation each generation of the ge-
netic algorithm is nearly constant. This is inferred from the fact that the total time
taken for generation ith generation is nearly linear in the number of generations pro-
duced, that is, i. For producing 1600 generations wee se that we need approximately
811 minutes which is nearly 13 hours 30 mins, significant time, comparable to the
time taken to train the reinforcement learning agents. For computing the memory
consumed, we need to consider the following items:
• Each unit of the 200-part chromosome is of 500 Bytes structure containing the
data for scheduling the job
Chapter 5. Results And Discussions 111
Thus the total memory consumed in our case is (1000 × 200 × 500) ∗ 2 = 200M B
per generation of the genetic algorithm, which is also in consensus with the memory
consumption detected on real-time algorithm execution. Here we multiply by 2 for
the current and the next generation when being produced in memory simultaneously
for selecting the healthy offspring.
From the following figure 5.51 and 5.52, we see the result of the performance of the
genetic algorithm on reducing the population size from 100 to 500, that is, reducing
the diversity. Figure 5.51 shows that on reducing the population size initially, the
offsprings suffer much due to decreased diversity. We cannot attain the schedule
with the best performance on a similar generation with 1000 offspring. However,
due to the reduced number of offspring per generation, there is lower computation
and memory consumption, so we have the bandwidth to generate more number
of generations. We see that after 3200 generations, the 500 population is able to
achieve performance a bit better compared to the 100 offspring algorithm. Figure
5.51 shows that the time taken for 1000 offspring for 1600 generations is nearly the
Chapter 5. Results And Discussions 112
same as that of 500 offspring for 3200 generations. See that the result of Figure 5.51
is shown in a logarithmic scale.
For presenting the results of the genotypic diversity, we had the following two options
as described in the methodology section
• Hamiing Distance
• Sharon Entropy
We make use of the hamming distance to show our results for the genotypic diversity
of the population in each generation of the genetic algorithm. Figure 5.53 shows the
results of the diversity computation for the population size of 1000.
• The decrease in the population size is quite steep during the first few 500
generations, then the rate of the decrease reduces significantly.
• This may be due to the fact of getting stuck in a local minimum or reaching
the valley of the global minimum.
• The decrease in genotypic diversity raises concern that our genotypic variation
exploring capacity is reducing over time with the generations.
• For computing the phenotypic diversity, too, we had multiple options: use
either Euclidean distance of the variance or standard deviation.
• We observe that the phenotypic diversity initially declines steeply, but after
around 450 generations, we see an increase in the phenotypic diversity, and
after around 800 generations, it again keeps decreasing.
• From 450 to 800 generations, getting into a new valley may lead to an increase
in the phenotypic diversity of the available population.
Chapter 5. Results And Discussions 115
• The decrease in phenotypic diversity raises concern that our phenotypic vari-
ation exploring capacity is reducing over time with the generations.
• We see that the species count initially decreases, but later it increases after
about 450 generations.
• This means that the evolutionary similarity among the chromosomes again
starts reducing after a certain number of generations, and thus, we get a good
variety accordingly.
• We see that the fitness range starts with a pretty good range for diversity ini-
tially and then falls steeply for the first few generations up to 450 generations.
After 450 generations, we again see a rise in the fitness range which is much
steeper, and after 800 generations, the slope of the increase reduces, showing
that the variety of fitness increases again after the initial reduction.
We present 4 different types of analysis for the genetic diversity of the algorithm.
Each of the above metrics shows a different feature of the population and hence has
a different interpretation of the result generated. As for Phenotypic and genotypic
diversity, the reduction in diversity is a serious concern as we see a steep, followed
by a steady decline in diversity. However, with fitness diversity and species count,
we see an increase in diversity after some number of generations. The best result
will be for all the metrics of diversity to be at least constant or increase over time
for better exploration.
Chapter 6
Future Work
• Storing the energy in the battery is important since the power from renewable
source are not steady and users need steady power supply.
• Once the energy is stored in the battery, we can use the Reinforcement based
schedule to optimize the objectives by balancing between both the grid power
and the battery power.
• Reinforcement learning will help us train model which will help us utilize
environment friendly alternatives to energy generation very efficiently
• That is the RL based Scheduler will also schedule as to how the power gen-
eration should be managed between different generators so that the cost of
generation is minimized while still meeting Demand side needs.
• Consecutive iterations of the brownouts may also significantly hamper the de-
vices, even with recovery setup, hence scheduling the brownout for not ham-
pering the device while recovery is also an important area to work on.
• Different levels of brownouts may include allowing only 1 fan or one light or
only essential devices to be run. Determining what level of brownout will not
hamper the users and their devices significantly. Also since the intentional
brownouts are done primarily by the electricity companies in sub-urban areas
for cost cutting, simultaneously, the agent must also maximize the profit while
doing so.
• Instead of Adding a new layer for training for each increments, we can com-
press the number of layers require using a binary fashion counter. Like, if
we are training for machine 11 (binary 1011), then we will train layer 1, 3, 4
(corresponding to set bits) and freeze rest layers.
• We can also model the incremental training system as LSTM network instead
of using 2N action space. The motivation behind this is that the devices are
identical and we need to schedule multiple units of the same device, so we can
make a neural network and add feedback layers while maintaining long term
dependencies using LSTM, and schedule each device sequentially.
Bibliography
119
Bibliography 120
Klemenjak, Christoph et al. (Apr. 2020). “A synthetic energy dataset for non-
intrusive load monitoring in households”. In: Scientific Data 7. doi: 10.1038/
s41597-020-0434-6.
Ku, Wen-Yang and J. Christopher Beck (2016). “Mixed Integer Programming models
for job shop scheduling: A computational analysis”. In: Computers & Operations
Research 73, pp. 165–173. issn: 0305-0548. doi: https://doi.org/10.1016/j.
cor.2016.04.006. url: https://www.sciencedirect.com/science/article/
pii/S0305054816300764.
Lai, Guokun et al. (2017). “Modeling Long- and Short-Term Temporal Patterns
with Deep Neural Networks”. In: CoRR abs/1703.07015. arXiv: 1703 . 07015.
url: http://arxiv.org/abs/1703.07015.
Lee, Eunji and Hyokyung Bahn (2014). “A genetic algorithm based power consump-
tion scheduling in smart grid buildings”. In: The International Conference on
Information Networking 2014 (ICOIN2014), pp. 469–474. doi: 10.1109/ICOIN.
2014.6799726.
Maharjan, Sabita et al. (Mar. 2013). “Dependable Demand Response Management
in the Smart Grid: A Stackelberg Game Approach”. In: Smart Grid, IEEE Trans-
actions on 4, pp. 120–132. doi: 10.1109/TSG.2012.2223766.
– (2016). “Demand Response Management in the Smart Grid in a Large Population
Regime”. In: IEEE Transactions on Smart Grid 7.1, pp. 189–199. doi: 10.1109/
TSG.2015.2431324.
Mao, Hongzi et al. (2016). “Resource Management with Deep Reinforcement Learn-
ing”. In: Proceedings of the 15th ACM Workshop on Hot Topics in Networks.
HotNets ’16. Atlanta, GA, USA: Association for Computing Machinery, pp. 50–
56. isbn: 9781450346610. doi: 10.1145/3005745.3005750. url: https://doi.
org/10.1145/3005745.3005750.
Mary Mammen, Priyanka and Hareesh Kumar (Oct. 2019). “Explainable AI: Deep
Reinforcement Learning Agents for Residential Demand Side Cost Savings in
Smart Grids”. In: url: https://arxiv.org/pdf/1910.08719v2.pdf.
Bibliography 121
Mnih, Volodymyr et al. (Feb. 2015). “Human-level control through deep reinforce-
ment learning”. In: Nature 518, pp. 529–33. doi: 10.1038/nature14236.
Palensky, Peter and Dietmar Dietrich (Sept. 2011). “Demand Side Management:
Demand Response, Intelligent Energy Systems, and Smart Loads”. In: Industrial
Informatics, IEEE Transactions on 7, pp. 381–388. doi: 10.1109/TII.2011.
2158841.
Qayyum, Fatima et al. (Jan. 2015). “Appliance Scheduling Optimization in Smart
Home Networks”. In: IEEE Access 3, pp. 1–1. doi: 10 . 1109 / ACCESS . 2015 .
2496117.
Rashid, Tabish et al. (2018). “QMIX: Monotonic Value Function Factorisation for
Deep Multi-Agent Reinforcement Learning”. In: CoRR abs/1803.11485. arXiv:
1803.11485. url: http://arxiv.org/abs/1803.11485.
Sunehag, Peter et al. (2017). “Value-Decomposition Networks For Cooperative Multi-
Agent Learning”. In: CoRR abs/1706.05296. arXiv: 1706 . 05296. url: http :
//arxiv.org/abs/1706.05296.
Zhao, Zhuang et al. (2013). “An Optimal Power Scheduling Method for Demand
Response in Home Energy Management System”. In: IEEE Transactions on Smart
Grid 4.3, pp. 1391–1400. doi: 10.1109/TSG.2013.2251018.