Activities and human faces recognition infrastructure are used everywhere behavior analysis is required. Since a video approach is built over an already existing infrastructure comprised of CCTV-Closed Circuit Television and a central computer, it can be used in any space where actions’ monitoring is necessary. Main objective of this paper consists of building a reliable and lightweight human faces and actions’ recognition classifier, able to classify a large number of actions, lightweight enough that it can work in real time. It processes at least 30 frames per second, using of-the-shelf computer hardware, connected to a normal CCTV infrastructure. The temporal convolutional network - TCN represents a viable solution for a proposed problem. It classifies a large number of actions - 60, using only RGB (red-green-blue) images of fairly low resolution, in real time. Deciding which class of action belongs to should not be connected to environment, background, person, view angle, or other specific identifiers. This selection should be associated only with the person executing it and the spatial-temporal context of the person. As technology and processing power improve, the problem slightly shifts. When more processing power to a system is added, in this model is possible either to increase the number of frames per second or the number of cameras in the infrastructure, or to increase the quality of the images, resulting most likely higher accuracy of the predictions. This model can be extended to a larger number of classes, with a minimal impact on performance. The proposed model has a tested accuracy of 82% which can be attributed to the recurrent property of the network. The model performs close to the most performing existing solutions. The present TCN + 3D Convolution Model is built with the smaller TCN units. Its architecture uses an alternation of a Simple Unit and a Complex Unit, in order to maximize the diversity of features the model learns. This paper illustrates a deep learning classifier based on TCNs for human actions recognition. Is relatively lightweight compared to other methods, and performs very well, competing with the best architectures. Ideally, it is able to classify an action irrespective of the person executing it or the environment where it was executed. This is achieved as much as possible through a diverse dataset on which the model is trained and tested, namely NTU RGB+D. After a simple and a complex unit, an Average Pool 3D layer reduces at least one dimension to half.

This content is only available via PDF.
You do not currently have access to this content.