Abstract
Action Recognition is a domain that gains interest along with the development of specific motion capture equipment, hardware and power of processing. Its many applications in domains such as national security and behavior analysis make it even more popular among the scientific community, especially considering the ascending trend of machine learning methods. Nowadays approaches necessary to solve real life problems through human actions recognition became more interesting. To solve this problem are mainly two approaches when attempting to build a classifier, either using RGB images or sensor data, or where possible a combination of these two. Both methods have advantages and disadvantages and domains of utilization in real life problems, solvable through actions recognition. Using RGB input makes it possible to adopt a classifier on almost any infrastructure without specialized equipment, whereas combining video with sensor data provides a higher accuracy, albeit at a higher cost. Neural networks and especially convolutional neural networks are the starting point for human action recognition. By their nature, they can recognize very well spatial and temporal features, making them ideal for RGB images or sequences of RGB images. In the present paper is proposed the convolutional neural network architecture based on 2D kernels. Its structure, along with metrics measuring the performance, advantages and disadvantages are here illustrated. This solution based on 2D convolutions is fast, but has lower performance compared to other known solutions. The main problem when dealing with videos is the context extraction from a sequence of frames. Video classification using 2D Convolutional Layers is realized either by the most significant frame or by frame to frame, applying a probability distribution over the partial classes to obtain the final prediction. To classify actions, especially when differences between them are subtle, and consists of only a small part of the overall image is difficult. When classifying via the key frames, the total accuracy obtained is around 10%. The other approach, classifying each frame individually, proved to be too computationally expensive with negligible gains.