We describe work in progress for training a humanoid robot to produce iconic arm and head gestures as part of task-oriented dialogue interaction. This involves the development and use of a multimodal dialog manager for non-experts to quickly program' the robot through speech and vision. Using this dialog manager, videos of gesture demonstrations are collected. Motor positions are extracted from these videos to specify motor trajectories where collections of motor trajectories are used to produce robot gestures following a Gaussian mixtures approach. Concluding discussion considers how learned representations may be used for gesture recognition by the robot, and how the framework may mature into a system to address language grounding and semantic representation.