I have a dataset that I have created from some videos . I have collected all the frames from 2 videos of 2-3 min playtime. (each has 25 frames per second) . The dataset contains all the faces from these frames, frame number, predicted emotion and true label. I have created true labels by looking at the context in these videos. so if a person is happy across 20 frames, all 20 frames have true labels as happy. The predicted labels across these 20 frames can be Neutral, surprise, happy, neutral for example or it can be neutral for 100 frames in a row. Also the true emotion is detected across 20,30 or sometimes even 90 frames. The window to detect emotion is not fixed. Different people in these videos show their emotions in different time windows.
My question is can I put together a simple RNN and feed it data from my dataset and train it so that it can predict context labels to detect 4-5 emotions . If so , how best to do this using python? many thanks.