Welcome to Vision repository documentation!¶
This project’s goal is to provide Roboy with extensive vision capabilities. This means to recognize, localize and classify objects in its environment as well as to provide data for localization to be processed by other modules. The input will be a realsense camera device, the output should be high-level data about Roboy’s environment provided using ROS messages and services.
The most import task in Vision for human interaction is to detect and recognize faces, which is why this was considered the highest priority of this project. The current main tasks of this project are:
- Identification of Roboy Team Members
- Pose estimation of a detected face and Roboy Motor Control
- Tracking of detected objects
- Person Talking detection
- Mood Recognition
- Gender Recognition
- Remebering faces online
- Age classification
- Scene and object classification
What Roboy Vision can do:¶
- Face detection.
- Speaker detection.
- Object detection.
Relevant Background Information and Pre-Requisits¶
Our approach to tackle the given tasks in Vision is to use machine learning methods. Therefore a basic understanding of machine learning, specifically also deep Neural Networks and Convolutional Neural Networks will be necessary.
The following links are to be seen as suggestions for getting started on machine learning:
- Crash Course on Deep Learning in the form of Youtube tutorials: DeepLearning.tv
- Closer Look at the implementation of Neural Networks: The Foundations of deep learning
- An introduction to Convolutional Neural Networks (CNNs): Deep learning in Computer vision
- The machine learning framework used for implementation: Tensorflow
- Furthermore a basic understanding of simple machine learning approaches like Regression, Tree Learning, K-Nearest-Neighbours (KNN), Support Vector Machines (SVMs), Gaussian Models, Eigenfaces, etc. will be helpful.
The papers currently used for implementation should be understood:
- Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
- FaceNet: A Unified Embedding for Face Recognition and Clustering
- DLIB: Facial landmarks and face recognition
- ‘You Only Look Once: Unified, Real-Time Object Detection <https://pjreddie.com/media/files/papers/yolo.pdf>`_
Furthermore there are plans to extend the implementation using this paper: