The rapid development of deep reinforcement learning (DRL), the combination of deep learning and reinforcement learning, has attracted more and more researchers from different fields to apply DRL to solve problems in their research fields. With the ability of deep learning to handle the continuous or complicated state space and the ability of reinforcement learning to learn from trial and error in a complicated environment, DRL is particularly good at solving problems that lack good exact or heuristic methods in complex environments. Since solving most reinforcement learning problems requires an extremely large amount of data, most DRL (or RL) agents are trained in a simulated environment. With a diverse library of machine learning tools, Python has become the go-to choice for DRL training. However, using Python, as a programming language, to build large-scale simulations that simulate complicated environments is hard. AnyLogic is a perfect platform for building simulation models to train DRL agents in complex environments. The newly developed Alpyne library is a Python library that enables users to train DRL agents in Python by interacting with the AnyLogic model during run time. Unfortunately, it is still not stable enough to handle complicated simulation models. In this blog post, we introduce a new way to apply DRL to simulation models in AnyLogic using the Pypeline library in AnyLogic. This method can also be used for RL (not deep) training, but due to simplicity, most environments that can be solved with RL can be simulated directly in programming languages, like Python.
The standard way of training a DRL agent is to interact with simulation models from Python. In this method, the DRL agent is called from the simulation model to observe and act on the model at the action time step, and saves all its critical components, for example, replay buffer and neural networks, to local at the end of each episode. This method provides a stable way to implement DRL in AnyLogic models.
In the remaining sections, we will first provide a general walkthrough of the main components of this method. Specifically, we use the implementation of Deep Q-Learning for demonstration purposes, but this method can be applied to various RL algorithms. Then, we will show a simple small-scale example (simplified OpenAI Gym Taxi-v3) to demonstrate the implementation of this method.
General Walkthrough on Main Components
Components on AnyLogic (Environment) Side
To communicate with Python, first we need to install the Pypeline library to your AnyLogic model. Since the focus of this blog post is not on the Pypeline library, please refer to https://www.anylogic.com/resources/educational-videos/webinar-pypeline-a-python-connector-library-for-anylogic/ for specific instructions on the installation and use of the Pypeline library.
After installation of the Pypeline library, we need to import the Python module for our DRL training and create an instance of the DRL training class in the On Startup section of the main agent. At each action step during the run time of the simulation, this instance of the DRL training class will be called to receive state information to output action and to receive a reward from the simulation environment.
For the training of RL agent, there are four important abilities that the simulation environment must have:
the ability to output the state information from the environment,
the ability to output reward from the environment,
the ability to receive and implement action from the RL agent, and
the ability to tell the RL agent whether the episode is finished.
Thus, there should be functions made in the simulation to enable these four abilities. Specifically for our implementation, a function was made for enabling each of (1) and (2), and another function was made for enabling (3) and (4). The function for (1) simply returns the current state information in a double or integer list. The function for (2) simply returns the current reward in double or integer. The function for (3) and (4) take input of the action from the RL agent to act in the environment and returns whether the environment will be done after taking the action.
Finally, a function communicating with the RL action should be made to utilize the above four abilities and communicate with the RL agent at each action step.
Components on Python (RL Agent) Side
As discussed above, a new instance of the RL agent will be initialized at the beginning of each episode. Since there is a new RL agent initialized in every episode, it is critical to find a way to record the important information of the RL training locally, so that this information would not be lost at the end of each training episode. Here, we use JSON and the saving functions from libraries like PyTorch to save the information at the end of each training episode and load the information at the initialization. Use Deep Q-Learning as an example, the important information includes but is not limited to replay buffer, policy network, target network, number of steps taken, reward buffer, loss history, and optimizer (if momentum-based optimizer, like ADAM, is used). To learn more about the Deep Q-Learning algorithm, please refer to .
The logging of important information enables us to train the RL agent in a continuous fashion between episodes. However, one more thing that needs to be addressed is that the simulation model only outputs the current state, reward and whether the episode is finished (we call this DONE from now on), but the RL agent needs the previous state to form a transition to push into the replay buffer. This problem is tackled by initializing the previous state and action values to null. Upon receiving the state, reward and DONE information from the simulation, the state will become the new previous state and the output action from the state will become the new previous action. If the previous state and the previous action values are not null, a new transition consisting previous state, previous action, current state, reward, and DONE will be appended to the replay buffer.
Simple Demonstration – Simplified Taxi-v3
Without further ado, let’s dive immediately into the implementation. The AnyLogic model with Python files created for this demo can be accessed at: https://github.com/m1ng2e/RL-in-Anylogic-Demo.git
For demonstration purposes, we demonstrate our method using a simplified OpenAI Gym Taxi-v3 environment replicated in AnyLogic. Still, this method is stable enough to be applied to large-scale and much more complicated environments. It perhaps fits more complicated environments better because the extra cost of communication between AnyLogic and Python can become ignorable in more complicated environments.
This environment is in a 4*4 grid world, with an RL-controlled taxi and a passenger. A visualization of the grid world is shown in figure 1, where the green lines represent walls that the taxi cannot go across. The initial location of the passenger is G, and the destination of the passenger is Y. The taxi will be initialized anywhere randomly other than the passenger location. The goal of the taxi is to first pick up the passenger and then drop the passenger off at the destination. Once the passenger is dropped off or more than 200 action steps are taken, the episode ends. The action space in this environment is 0: move up, 1: move down, 2: move left, 3: move right, 4: pick up, and 5: drop off. The state space is the position of taxi on the x-axis, the position of the taxi on the y-axis, and whether the passenger has been picked up (0 or 1). When the taxi makes a failed pick up or drop off, it receives a reward of -10. When the taxi successfully drops off the passenger, it receives a reward of +20. The taxi receives a -1 reward, unless one of the above-mentioned rewards is triggered.
Figure 1: Visualization of the grid world
Implementation in AnyLogic
In this model, there are some important functions that enable the training of the RL agent. The f_State function returns an integer list of length three for the representation of the current state. The f_Reward function returns the reward resulting from the action. The f_TaxiAction function implements the action from the RL agent and returns whether the episode is finished after taking that action. If the model parameter deploy is set to be true, the f_TaxiAction function will change the visualization according to the action. The f_RLAction function calls the RL agent to select action according to the current state and provides the RL agent with the training required information using the above-mentioned three functions. During the run time of the simulation, the f_RLAction function is called every 0.1 seconds with a cyclical event.
Implementation in Python
PyTorch library, a deep learning library, is used to implement Deep Q-Learning in Python. Other than some extra lines of code to save and load important information for training, this implementation is no different than the other standard implementation of Deep Q-Learning. Since the focus of this blog post is not on RL algorithms and to not bother you with technical details, only parts of the code that are related to the application of RL in AnyLogic will be discussed in this section. In this implementation, there are two Python files created for the RL training, Train.py and DQNModel.py, since DQNModel.py only consists of the construction of the neural network, it is not discussed in this blog post.
One thing to notice here is that given that we are creating a connection between AnyLogic and Python, it is better to modularize the Python code to make the connection easy and clean. Here we created a class for the Deep Q-Learning training agent, called DQN_Main.
To initialize an instance of the DQN_Main class (this happens at the beginning of each episode), we need to first load the necessary information from the local disk using JSON and the load function from PyTorch, then set the previous state and previous action value to null and the episode reward to 0. The information needed for this instance are marked in red in figure 2.
Figure 2: Files in the Model Folder (red: information necessary for training, yellow: plots for monitoring training)
Then at each action step, the act function is defined to be called from AnyLogic to push experience into the replay buffer, to call the train function to train the neural network, and if the episode is done, to save some important information to local disk.
After being called by the act function, the train function trains the neural network for one epoch and saves the important information that was changed from training to the local disk if the episode is done.
If desired, you can also add functions to generate reward and loss plots to the local disk, so that you can watch your RL agent getting better. The generated plots for this instance are marked in yellow in figure 2.
The full code on the Train.py is attached below: