Pancreas stereotactic body radiotherapy treatment planning requires planners to make sequential, time consuming interactions with the treatment planning system (TPS) to reach the optimal dose distribution. We seek to develop a reinforcement learning (RL)-based planning bot to systematically address complex tradeoffs and achieve high plan quality consistently and efficiently. The focus of pancreas SBRT planning is finding a balance between organs-at-risk sparing and planning target volume (PTV) coverage. Planners evaluate dose distributions and make planning adjustments to optimize PTV coverage while adhering to OAR dose constraints. We have formulated such interactions between the planner and the TPS into a finite-horizon RL model. First, planning status features are evaluated based on human planner experience and defined as planning states. Second, planning actions are defined to represent steps that planners would commonly implement to address different planning needs. Finally, we have derived a reward system based on an objective function guided by physician-assigned constraints. The planning bot trained itself with 48 plans augmented from 16 previously treated patients and generated plans for 24 cases in a separate validation set. All 24 bot-generated plans achieve similar PTV coverages compared to clinical plans while satisfying all clinical planning constraints. Moreover, the knowledge learned by the bot can be visualized and interpreted as consistent with human planning knowledge, and the knowledge maps learned in separate training sessions are consistent, indicating reproducibility of the learning process.