Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios: Supplementary Material
Published in , 2022
In all tested architectures, a first embedding layer transformed the input coordinates and offsets from all landmarks into an intermediate representation vector of size 512 via a dense layer. For bimodal architectures, metadata, audio, and transcriptions representations were embedded into sizes 16, 64, and 64, respectively, through dense layers. All non-linearities following convolutional or dense layers used leaky ReLUs with a negative slope of 0.01.
Recommended citation: German Barquero, Johnny Núnez, Zhen Xu, Sergio Escalera, Wei-Wei Tu, Isabelle Guyon, Cristina Palmero. (2022). "Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios: Supplementary Material."
Download Paper | Download Slides