Conditional Predictive Behavior Planning With Inverse Reinforcement Learning for Human-Like Autonomous Driving

Question

1. What are the main challenges addressed in the predictive behavior planning framework?

2. What are the challenges of imitation learning (IL) and reinforcement learning (RL) in autonomous driving decision-making?

3. What is conditional motion prediction?

4. What is the purpose of inverse reinforcement learning (IRL) methods?

Accepted Answer

The main challenges addressed in the predictive behavior planning framework are improving prediction accuracy and making prediction results more suitable for the downstream task, and designing the cost function to determine desirable behaviors. The first challenge involves enhancing prediction accuracy and ensuring prediction results align with the AV's future actions, which is achieved by leveraging the conditional motion prediction (CMP) method. The second challenge focuses on designing the cost function, which is crucial for capturing the nuances of human driving behaviors. To overcome this, the framework employs a maximum entropy inverse reinforcement learning (IRL) framework to automatically learn the cost function from human driving data. This ensures that the cost function reflects actual human preferences and avoids unintended behaviors resulting from manual tuning.

Accepted Answer

Imitation learning (IL) and reinforcement learning (RL) face challenges in autonomous driving decision-making. IL struggles with distribution shift from training to deployment, making it difficult to mitigate. RL encounters obstacles such as sample efficiency, accurate environment modeling, and proper reward function design. Both methods have inherent flaws, compromising safety, interpretability, and generalizability. Consequently, there has been a shift towards classic planning methods, which provide stronger safety guarantees, rule compliance, and interpretability. However, their performance relies on accurate prediction of surrounding agents and proper evaluation of planned behaviors.

Accepted Answer

Conditional motion prediction (CMP) is a model that predicts future trajectories for other agents based on a query future trajectory for an ego agent. It addresses the issue of ignoring the influence of the AV's future actions on other agents in motion prediction models. CMP has shown a 10% improvement in accuracy over non-conditional prediction. M2I extended CMP to multiagent interactive prediction, where an influencer is selected and reactors' future trajectories are predicted using a CMP model according to the influencer's marginal prediction result. Scene Transformer proposed a unified Transformer-based architecture with a masking strategy, enabling prediction of other agents' behaviors conditioned on the future trajectory of the AV. However, CMP models are not yet integrated into planning, and the planning performance of CMP models needs further investigation.

Accepted Answer

Inverse reinforcement learning (IRL) methods aim to learn underlying cost functions from expert demonstrations, avoiding manual specification. They are used to infer the preferences and intentions of an expert by observing their behavior, which can then be used to guide the behavior of autonomous systems. IRL is particularly useful in scenarios where it is difficult or impractical to explicitly define the cost functions or reward functions that an autonomous system should optimize. By learning from expert demonstrations, IRL methods can help autonomous systems adapt to complex and dynamic environments, such as autonomous driving, where human behavior and decision-making play a crucial role. The Maximum Entropy IRL approach, for example, addresses ambiguities or uncertainties inherent in human demonstrations and has become popular in autonomous driving applications. It allows for the learning of individual driving styles from highway driving demonstrations and the reproduction of distinct driving policies. Additionally, IRL methods can be integrated with planning algorithms to automatically tune the cost function, surpassing the level of manual expert-tuned cost functions. However, challenges remain in accurately predicting the actions of other agents in multi-agent scenarios, which is an area of ongoing research and development.

Accepted Answer

The three core components of the proposed behavior planning framework are behavior generation, conditional motion prediction, and IRL scoring. Behavior generation synthesizes diverse trajectory proposals based on the current state and reference route of the AV. Conditional motion prediction predicts trajectories of surrounding agents under each planned trajectory, providing multiple possible futures. IRL scoring combines features of trajectory proposals with learnable weights to calculate costs and obtain probabilities of trajectory proposals. The behavior with the highest probability or sampled from the distribution is used as a reference trajectory for downstream trajectory planning, refining the coarse trajectory for the AV to follow. The framework omits the trajectory planning module, focusing on formulating the behavior planning problem and elaborating on the key components.

Accepted Answer

The conventional motion prediction task involves modeling the posterior distribution p(Y|X, M), where Y represents the joint states of surrounding agents over a future time horizon T f. This task uses the joint historical states of surrounding agents X and map information M to predict the future states of the agents.

Accepted Answer

Candidate behaviors for AV are generated by considering road structure, semantics, and traffic rules. Polynomial curves are utilized to generate trajectory proposals in the Frenet Frame of the reference path. A fixed time horizon of 5 seconds is used, with target velocities ranging from braking to stop to accelerating to the speed limit. A quartic polynomial parameterizes longitudinal states, while a quintic polynomial represents lateral states. Approximately 10-30 candidate behaviors are generated, with the final trajectory translated back to Cartesian coordinates. The method is capable of handling complex road structures in urban areas, generating behaviors that are dynamically feasible, correct, interpretable, and compliant with traffic rules. More details can be found in section IV-B.

Accepted Answer

The conditional motion prediction module predicts other agents' future motions conditioned on the AV's planned trajectory by utilizing a Transformer-based neural network. The prediction network utilizes vectorized map, agent history, and AV plan as diverse sources of information. The historical states of all agents are encoded by different LSTM networks, and the map waypoints are encoded by MLPs. The AV plan is encoded using an MLP and a self-attention Transformer encoder layer. The agent-agent interaction is modeled by a two-layer self-attention Transformer encoder, and the agent-map interaction is modeled by an agent-map encoder. The AV's planned trajectory is fused with other agents' encoded features using early fusion, late fusion, or early + late fusion approaches. The decoding process involves repeating the agent interaction encoding and concatenating it with the agent-map interaction encoding to generate a final latent representation tensor. An MLP is used to decode the Gaussian parameters and predict the probability of different futures for each agent at every timestep in the future.

Accepted Answer

The purpose of maximum entropy inverse reinforcement learning (IRL) is to recover underlying reward functions from demonstrations of human driving behaviors. It addresses ambiguity and stochasticity by recovering a distribution over all trajectories. The resulting probability distribution over candidate behaviors is maximized using a linear cost function, which is a weighted sum of features characterizing driving behavior. The objective is to optimize the cost function weights to maximize the log-likelihood of expert demonstration trajectories in the dataset. This method helps in evaluating behaviors and making human-like decisions in driving scenarios, considering various nuances and different actions taken by people in the same situation.

Accepted Answer

The learning process is divided into two stages. The first stage deals with conditional motion prediction, which involves learning to predict other agents' behaviors conditioned on the AV's future trajectory. This is achieved through interactions among human drivers and using the AV's ground-truth future trajectory as the planned trajectory. The second stage focuses on learning the cost function weights for evaluating candidate plans. The CMP module, a deep neural network, is trained using negative log-likelihood loss on the GMM parameters to predict reactive and different behaviors of other agents given different plans of the AV. The model generalizes to predict other agents' reactions to different plans in specific scenarios. The second stage calculates the features and costs of all generated plans, imposing the negative log-likelihood loss on the distribution to favor the trajectory that most closely matches the expert demonstration in feature space.

Accepted Answer

The Waymo Open Motion Dataset (WOMD) is used for training and validating the proposed framework. It contains 104,000 unique scenes, each 20 seconds long at 10 Hz, collected from 570 hours of driving and over 1750 km of roadways. The dataset provides annotated high-definition map data and high-accuracy agent track data suitable for prediction and planning tasks. In experiments, 10,156 scenes are selected, with 80% used as training data and the rest as testing data. Each scene is split into several 7-second tracks with an observation horizon of 2 seconds and a prediction/planning horizon of 5 seconds into the future. The self-driving car track is used for behavior planning, while surrounding traffic participants are used for prediction. The CMP module uses all data points for training and evaluation, while the IRL scoring module filters data points based on speed and lane change capabilities. After filtering, the IRL scoring module has 10,564 training data points and 2,246 testing data points.

Accepted Answer

The designed features for candidate decision characterization include travel efficiency, maximum acceleration, maximum jerk, maximum lateral acceleration, headway, lateral distance, safety, and collision. Travel efficiency is represented by the difference between the current speed and speed limit, normalized by the speed limit. Maximum acceleration and maximum jerk are used to measure ride comfort in the longitudinal direction. Maximum lateral acceleration is used to measure ride comfort in the lateral direction. Headway is the safe longitudinal distance between the AV and the leading vehicle, calculated using time headway and a Gaussian RBF. Lateral distance is the safe lateral distance from other vehicles, calculated using lateral distance and a Gaussian RBF. Safety considers collisions between the AV's planned trajectory and other vehicles' predicted trajectories, using an indicator function and summing collision times across all time steps in the time horizon. These features are computed for each candidate trajectory and used to calculate the cost of the trajectory.

Accepted Answer

The evaluation metrics used for behavior prediction are the minimum Average Distance Error (minADE) and minimum Final Distance Error (minFDE). MinADE measures the average displacement of each point in the closest joint trajectories to the ground truth, while minFDE is the displacement error between the final point of the joint predicted trajectories and ground truth. These prediction errors are averaged for all agents in the joint trajectories. Additionally, a set of metrics is used to evaluate the behavior planning performance, including minFDE between the top-3 most likely planned trajectories and ground-truth one, the accuracy of any of the top-3 most likely planned trajectories matching the ground truth, and the intention accuracy. The top-3 accuracy is chosen due to the probabilistic nature of the planning framework, which can address the stochasticity of human driving behaviors. Furthermore, behaviors are reduced to discrete intentions, such as acceleration and deceleration in the longitudinal direction and lane change in the lateral direction, and the model's accuracy in identifying these intentions is calculated.

Accepted Answer

The parameters of the prediction module include 8 attention heads, a hidden dimension of 1024 for the feed-forward network, and RELU as the activation function. Additionally, every dense layer, except the output layer, is followed by a dropout layer with a dropout rate of 0.1. The network outputs displacements relative to an agent's current position, improving prediction accuracy. The Adam optimizer with an initial learning rate of 2e-4, decaying by a factor of 0.5 every 5 epochs, is used for training. The batch size is 32, and the total training epochs are 30. Gradient norm clipping is set to a max norm of 5. For training the IRL-based planner, the Adam optimizer with a learning rate starting at 1e-2 and decaying by a factor of 0.9 every 50 steps is used. L2 regularization with a weight decay value of 1e-2 is applied to the cost function weights to prevent overfitting. The mini-batch size is 64, and the total training steps are 500. Collision detection between the AV and other objects is approximated using a list of circles based on their poses, with a collision occurring if the distance between any pair of circles' centers is smaller than a threshold.

Accepted Answer

The conditional motion prediction (CMP) module's performance is evaluated in terms of quantitative and qualitative results. Quantitative results show that the early-fusion structure significantly outperforms others, with approximately 10% improvement in prediction metrics compared to non-conditional prediction. The structure of fusing the AV's future information needs careful design, and three fusion structures are investigated: early fusion, late fusion, and early+late fusion. The early-fusion structure performs significantly better than the late-fusion or early-late-fusion variant. Multi-future prediction demonstrates the network's ability to jointly predict multiple futures for surrounding agents based on the early fusion structure. Conditional prediction shows the network's ability to predict other agents' behaviors conditioned on the AV's different plans. However, the model cannot completely make reactive predictions for other agents, which may cause collisions in some plans. Nonetheless, such plans will be ruled out by the downstream planner, encouraging the planner to choose plans that comply with the training distribution from real-world data.

Accepted Answer

The proposed method demonstrates its capability to predict other agents' trajectories and select appropriate behaviors in urban driving scenarios. In Scenario 1, the method selects Plan 1, which is the closest to the ground truth, as it maintains a safe distance from the leading vehicle. Other candidate plans, such as Plan 2 with lower target speed and Plan 3 with higher target speed, have lower scores due to unsafe behaviors like smaller headway or unnecessary speed loss. In Scenario 2, the method chooses Plan 1, which safely avoids collision risks with a cut-in vehicle while maintaining speed. Plan 2, which involves hard braking, has a near-zero score due to collision risks. In Scenario 3, the method selects Plan 1 and Plan 2, which involve slowing down and yielding to the cut-in vehicle, as they prioritize safety over speed. The results indicate that the learned cost function scores different candidate plans based on their safety and proximity to the ground truth. Overall, the proposed method effectively predicts other agents' trajectories and selects appropriate behaviors in urban driving scenarios.

Accepted Answer

The results indicate that prediction accuracy plays a crucial role in ensuring downstream planning performance. Using a learning-based prediction model significantly improves prediction accuracy and consequently planning performance compared to a kinematic-based prediction model. The conditional prediction model outperforms the non-conditional model, highlighting the benefits of leveraging the AV's future plan information. Planning with the proposed conditional prediction model has comparable performance to the oracle method, suggesting its ability to better reflect real-world interaction dynamics. However, planning errors can still occur due to limited trajectory proposals and limitations in evaluation methods. Introducing different cost functions, such as manually tuned and maximum-margin, can improve the evaluation of candidate behaviors and human likeness. Learning the cost function from data proves to be more effective than manual tuning, with the max-entropy method marginally outperforming the max-margin method.

Accepted Answer

For conditional prediction methods, there are two inference approaches: single and batch. The single approach involves querying the prediction model for each planned trajectory, while the batch approach organizes all planned trajectories into a batch and repeats the environmental context tensors to match with the plan queries. The batch processing method significantly reduces computation time by parallelizing the conditional prediction process, making it suitable for real-time usage. On the other hand, the single processing method has the longest computation time and is not suitable for real-time usage. The non-conditional method has the shortest computation time but compromises planning and prediction performance. Additionally, the early fusion method runs slightly faster than the late fusion method.

Accepted Answer

The proposed framework divides behavior planning into prediction and scoring processes. It includes a conditional motion prediction model that forecasts other agents' future trajectories based on the AV's potential plan. The framework also learns a cost function using inverse reinforcement learning to evaluate candidate plans. This approach enhances safety, interpretability, and reliability compared to other learning-based methods. However, limitations include validation in an open-loop manner and the need for further investigation in safe-critical scenarios.

Accepted Answer

The proposed framework comprises three core modules: a behavior generation module, a conditional motion prediction module, and a scoring module. The behavior generation module produces diverse trajectory proposals. The conditional motion prediction module forecasts other agents' future trajectories based on each candidate plan. The scoring module evaluates candidate plans using a cost function learned with maximum entropy inverse reinforcement learning (IRL). These modules work together to generate and evaluate human-like driving behaviors in a large-scale real-world urban driving dataset.

Conditional Predictive Behavior Planning With Inverse Reinforcement Learning for Human-Like Autonomous Driving

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the main challenges addressed in the predictive behavior planning framework?

2. What are the challenges of imitation learning (IL) and reinforcement learning (RL) in autonomous driving decision-making?

3. What is conditional motion prediction?

4. What is the purpose of inverse reinforcement learning (IRL) methods?

5. What are the three core components of the proposed behavior planning framework?

6. What is the conventional motion prediction task?

7. How are candidate behaviors generated for AV?

8. How does the conditional motion prediction module predict other agents' future motions in the AV's planned trajectory?

9. What is the purpose of maximum entropy IRL in evaluating driving behaviors?

10. What is the learning process divided into?

11. What dataset is used for training and validating the proposed framework?

12. What are the designed features for candidate decision characterization?

13. What are the evaluation metrics used for behavior prediction?

14. What are the parameters of the prediction module?

15. What is the performance of the conditional motion prediction module?

16. How does the proposed method perform in predicting other agents' trajectories and selecting appropriate behaviors in urban driving scenarios?

17. How does prediction accuracy impact downstream planning performance?

18. What inference approaches are used for conditional prediction methods?

19. What are the key components of the proposed framework for behavior planning in autonomous driving?

20. What are the core modules of the proposed learning-based predictive behavior planning framework?

Citations

A Systematic Survey of Control Techniques and Applications in Connected and Automated Vehicles

GameFormer: Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction and Planning for Autonomous Driving

Human as AI mentor: Enhanced human-in-the-loop reinforcement learning for safe and efficient autonomous driving

Recent advances in reinforcement learning-based autonomous driving behavior planning: A survey

Learning Interaction-Aware Motion Prediction Model for Decision-Making in Autonomous Driving

References

Attention Is All You Need

Apprenticeship learning via inverse reinforcement learning

A Review of Motion Planning Techniques for Automated Vehicles

Optimal trajectory generation for dynamic street scenarios in a Frenét Frame

ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst

Related Papers (5)

Physician-Friendly Machine Learning: A Case Study with Cardiovascular Disease Risk Prediction

Adaptive Task Offloading in Vehicular Edge Computing Networks: a Reinforcement Learning Based Scheme

Reinforcement Learning Based on Contextual Bandits for Personalized Online Learning Recommendation Systems

Task offloading method of edge computing in internet of vehicles based on deep reinforcement learning

Graph convolutional network-based reinforcement learning for tasks offloading in multi-access edge computing