Thesis Examination Committee
Prof Long QUAN, CSE/HKUST (Chairperson)
Prof Prof Bertram SHI, ECE/HKUST (Thesis Supervisor)
Prof Dit Yan YEUNG, ECE/HKUST (Thesis Co-supervisor)
Prof Max Qing Hu MENG, Department of Electronic Engineering, The Chinese University of Hong Kong (External Examiner)
Prof Chi Ying TSUI, ECE/HKUST
Prof Weichuan YU, ECE/HKUST
Prof Shing Chi CHEUNG, CSE/HKUST
Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the instantaneous visual appearance and the motion dynamics. This thesis proposes several new deep architectures to handle video signals efficiently and effectively.
Specifically, we propose the first end-to-end framework for human action recognition: factorized spatio-temporal convolutional networks (FstCN), which factorize the original 3D convolution kernel learning the sequential process of first learning layers of 2D spatial kernels, followed by learning layers of 1D temporal kernels. This factorized architecture mitigates the difficulty of learning high dimensional kernels, especially when faced with an insufficiency of training videos.
Though effective, Convolutional Neural Networks (CNNs) based methods are still limited in modeling long-term motion dynamics. On the other hand, naively applying Recurrent Neural Networks (RNNs) to video sequences in a convolutional manner implicitly assumes that the motions in videos are stationary across different spatial locations. This may be valid for short-term motions, but is usually invalid when the duration of the motion is long. We introduce the Lattice Long Short-Term Memory (LSTM), which extends the LSTM by learning independent hidden state transitions for the memory cells at different spatial locations. This method effectively models dynamics across time and addresses the non-stationarity of long-term motion dynamics without significantly increasing the model complexity. By jointly training the input and forget gates within the LSTM on heterogenous data (RGB images and optical flows), better control of the output dynamics can be achieved. Finally, to further beneﬁt from these heterogenous data/features and RNNs, we propose the Coupled Recurrent Network (CRN). The CRN takes advantages of the heterogenous data by coupling them from the input level. Hidden features extracted from other input sources are concatenated with the features from the current source and remapped as new input for the current source. The learned features are reﬁned iteratively and recursively. Our proposed methods can achieve the state-of-the-art performance on several benchmarks.