A video shows a girl throwing a ball hard at a boy in a school yard. The boy throws the ball gently back. What does a computer vision system understand of this scene? Until recently it understood what is happening in the video - throwing a ball - who is performing the action - a boy and a girl - and where the action is taking place - on a school yard. But the vision system would not recognize how the action happened - whether the ball was thrown hard or gently, or in another subtle way. This is about to change thanks to research by postdoc Hazel Doughty and professor Cees Snoek, both of the Video & Image Sense Lab (VIS) from the Informatics Institute of UvA.
Doughty and Snoek have built a computer vision system that can recognize 34 adverbs describing how an action on a video is happening, for example quickly or slowly, gently or firmly, purposefully or accidentally. They published their findings in the ArXiv-article ‘How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs’ and presented them at the Computer Vision and Pattern Recognition Conference (CVPR) in New Orleans (USA) in June.
Doughty and Snoek used some 15,000 video clips of ten to thirty seconds from instructional videos that are available on YouTube to train their computer vision system. Sometimes the instructor tells the instruction verbally, other times the instruction is written as text in the video, instructions like ‘you should chop the onion finely’. The vision system was built on the basis of a deep learning model which took a few hours to train. After training the model can run on new videos and recognize how an action is taking place in the video, for 135 different actions from swimming to cutting.
Recognize new combinations
‘Until recently very little work had been done on recognizing how an action takes place’, tells Doughty. ‘Although this problem is by far not solved, we have made quite some progress. Our systems easily recognizes compositions that it has been trained on, like ‘mix gently’ or ‘hit slowly’. But it can also recognize some new combinations, that it has not seen before, like ‘hit gently’. However, there is still a lot of room for improvement. Ultimately we would like the system to recognize adverbs over all the feasible combinations of actions and adverb.’
And even when that problem is solved, there is another grand challenge: how can a computer vision system understand how an action takes place in a domain that it has never seen before? For example, how to recognize that a pancake is being flipped quickly, if the computer vision system has never seen a pancake being prepared?
As computer vision systems are increasingly used in everyday applications, it is obvious that these systems will perform better when they better understand how an action takes place. Another possible application, says Doughty, lies in the field of robotics: ‘You can have a robot using our system watch an instructional video and learn the subtleties of how to perform a certain action.’
The research described it this article falls within the scope of the research theme 'Artificial Intelligence' of the Informatics Institute.