image processing / computer vision - body part recognition - posture ( standing/ sitting) - supervised learning - node.js
I'm after advice from the image processing / computer vision experts here. Trying to develop a robust, scaled algorithm to extract dimensions of a person's body. For example, his upper-body width.
problems:
images without faces
person sitting
multiple faces
person is holding something , thus covering part of his body
ways of doing this:
* haar - unsupervised , a lot of training date of different body parts and hope for the best.
* HOG - 1. face detection -> afterwards using HOG and assumptions along the way with different filters
Note: all images will be scaled to the same size.
Obviously computation time for the second approach MIGHT be more demanding (doubtful though)
but for the 1st method, training is almost impossible and would take much more time..
P.S.
I know there's a paper about using pedestrian data.. but that would work for full body + standing, not for sitting.
I'm open to hearing all your ideas..ask away if you have anything to add.
Implementation would be done, hopefully via node.js
Thank you
DPM is widely used in computer vision for object detection and it tends to work in the case of occlusion and also when only part of an object is present in the image. The grammar model for humans is very good and has state of the art results on standard datasets. It takes around a second to perform detection on a single image, its matlab code, so its expected to be slow.
http://www.cs.berkeley.edu/~rbg/latent/
Related
Which face detection method suitable for detecting faces of people at a long distance?
I have checked five different methods for face detection. 1. Haar cascade 2. Dlib HOG 3. Python face_recognition module 4. DLib_CNN 5. OpenCV CNN All these methods have some advantages and disadvantages and i found out that openCV_CNN works better out of these five algorithm. But for my application i need to detect faces from people on far distance and for this purpose even OpenCV_CNN is not working well (it detects faces of people closer to camera and not the people on far distance). Is there any other algorithm which detects faces of people on far distance?
One of the ways is to do instance segmentation in order to get all the classes in the environment including distant objects. Once you get all the classes, you can draw a bounding box around the required far off face class, upsample it and send it to your face detection NN. suppose your image is of 54x54x3, it will be upsampled to 224x224x3 and sent to your trained NN.
Face Detection State-of-the-art practical considerations Face Detection is often the first stage of a Computer Vision pipeline. Thus, it is important for the algorithm to perform in real time. So, it is important to know the comparison between various face detection algorithms and their pros and cons to use the right algorithm for your application. There are many algorithms that have been developed over the years as shown below. Our recent favorite is YuNet because of its balance between speed and accuracy. Apart from that, RetinaFace is also very accurate but it is a larger model and is a little slow. We have compared the top 9 algorithms for Face Detection on some of the features that we should keep in mind while choosing a Face Detection algorithm: Speed Accuracy Size of face Robustness to occlusion Robustness to Lighting variation Robustness to Orientation or Pose You can check out the Face Detection ultimate guide that gives a brief overview of the popular face detection algorithms.
How do face recognition systems differentiate between a real face and a photo of a face?
I'm working on face recognition project with python & OpenCV I detect faces but I have that problem I don't know how to get t make the system differentiating between real and fake faces with 2D image if someone has any ideas, please help me. thank you.
There is a really good article (code included) by Adrian from pyimagesearch tackling the same exact problem with liveness detector. Below is the extract from that article There are a number of approaches to liveness detection, including: Texture analysis, including computing Local Binary Patterns (LBPs) over face regions and using an SVM to classify the faces as real or spoofed. Frequency analysis, such as examining the Fourier domain of the face. Variable focusing analysis, such as examining the variation of pixel values between two consecutive frames. Heuristic-based algorithms, including eye movement, lip movement, and blink detection. These set of algorithms attempt to track eye movement and blinks to ensure the user is not holding up a photo of another person (since a photo will not blink or move its lips). Optical Flow algorithms, namely examining the differences and properties of optical flow generated from 3D objects and 2D planes. -3D face shape, similar to what is used on Appleās iPhone face recognition system, enabling the face recognition system to distinguish between real faces and printouts/photos/images of another person. Combinations of the above, enabling a face recognition system engineer to pick and choose the liveness detections models appropriate for their particular application.
You can solve this problem using multiple methods, I'm listing some of them here, you can find a few more by referring to some research papers. Motion Approach: You can make user blink or move which convinces a way that they are real (Most likely to work on video dataset or sequential images) Feature Approach: Extract useful features from an Image and use them to make binary classification decisions to say real or not. Frequency Analysis: Examining the Fourier domain of the face. Optical Flow algorithms: Namely examining the differences and properties of optical flow generated from 3D objects and 2D planes. Texture Analysis: You can also do Local Binary Patterns using OpenCV to classify the images fake or not, refer this link for details on this approach.
Image Categorization Using Gist Descriptors
I created a multi-class SVM model using libSVM for categorizing images. I optimized for the C and G parameters using grid search and used the RBF kernel. The classes are 1) animal 2) floral 3) landscape 4) portrait. My training set is 100 images from each category, and for each image, I extracted a 920-length vector using Lear's Gist Descriptor C code: http://lear.inrialpes.fr/software. Upon testing my model on 50 images/category, I achieved ~50% accuracy, which is twice as good as random (25% since there are four classes). I'm relatively new to computer vision, but familiar with machine learning techniques. Any suggestions on how to improve accuracy effectively? Thanks so much and I look forward to your responses!
This is very very very open research challenge. And there isn't necessarily a single answer that is theoretically guaranteed to be better. Given your categories, it's not a bad start though, but keep in mind that Gist was originally designed as a global descriptor for scene classification (albeit has empirically proven useful for other image categories). On the representation side, I recommend trying color-based features like patch-based histograms as well as popular low-level gradient features like SIFT. If you're just beginning to learn about computer vision, then I would say SVM is plenty for what you're doing depending on the variability in your image set, e.g. illumination, view-angle, focus, etc.
I need a function that describes a set of sequences of zeros and ones?
I have multiple sets with a variable number of sequences. Each sequence is made of 64 numbers that are either 0 or 1 like so: Set A sequence 1: 0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0 sequence 2: 0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0 sequence 3: 0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0 ... Set B sequence1: 0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1 sequence2: 0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0 ... I would like to find a mathematical function that describes all possible sequences in the set, maybe even predict more and that does not contain the sequences in the other sets. I need this because I am trying to recognize different gestures in a mobile app based on the cells in a grid that have been touched (1 touch/ 0 no touch). The sets represent each gesture and the sequences a limited sample of variations in each gesture. Ideally the function describing the sequences in a set would allow me to test user touches against it to determine which set/gesture is part of. I searched for a solution, either using Excel or Mathematica, but being very ignorant about both and mathematics in general I am looking for the direction of an expert. Suggestions for basic documentation on the subject is also welcome.
It looks as if you are trying to treat what is essentially 2D data in 1D. For example, let s1 represent the first sequence in set A in your question. Then the command ArrayPlot[Partition[s1, 8]] produces this picture: The other sequences in the same set produce similar plots. One of the sequences from the second set produces, in response to the same operations, the picture: I don't know what sort of mathematical function you would like to define to describe these pictures, but I'm not sure that you need to if your objective is to recognise user gestures. You could do something much simpler, such as calculate the 'average' picture for each of your gestures. One way to do this would be to calculate the average value for each of the 64 pixels in each of the pictures. Perhaps there are 6 sequences in your set A describing gesture A. Sum the sequences element-by-element. You will now have a sequence with values ranging from 0 to 6. Divide each element by 6. Now each element represents a sort of probability that a new gesture, one you are trying to recognise, will touch that pixel. Repeat this for all the sets of sequences representing your set of gestures. To recognise a user gesture, simply compute the difference between the sequence representing the gesture and each of the sequences representing the 'average' gestures. The smallest (absolute) difference will direct you to the gesture the user made. I don't expect that this will be entirely foolproof, it may well result in some user gestures being ambiguous or not recognisable, and you may want to try something more sophisticated. But I think this approach is simple and probably adequate to get you started.
In Mathematica the following expression will enumerate all the possible combinations of {0,1} of length 64. Tuples[{1, 0}, {64}] But there are 2^62 or 18446744073709551616 of them, so I'm not sure what use that will be to you. Maybe you just wanted the unique sequences contained in each set, in that case all you need is the Mathematica Union[] function applied to the set. If you have a the sets grouped together in a list in Mathematica, say mySets, then you can apply the Union operator to every set in the list my using the map operator. Union/#mySets If you want to do some type of prediction a little more information might be useful. Thanks you for the clarifications. Machine Learning The task you want to solve falls under the disciplines known by a variety of names, but probably most commonly as Machine Learning or Pattern Recognition and if you know which examples represent the same gestures, your case would be known as supervised learning. Question: In your case do you know which gesture each example represents ? You have a series of examples for which you know a label ( the form of gesture it is ) from which you want to train a model and use that model to label an unseen example to one of a finite set of classes. In your case, one of a number of gestures. This is typically known as classification. Learning Resources There is a very extensive background of research on this topic, but a popular introduction to the subject is machine learning by Christopher Bishop. Stanford have a series of machine learning video lectures Standford ML available on the web. Accuracy You might want to consider how you will determine the accuracy of your system at predicting the type of gesture for an unseen example. Typically you train the model using some of your examples and then test its performance using examples the model has not seen. The two of the most common methods used to do this are 10 fold Cross Validation or repeated 50/50 holdout. Having a measure of accuracy enables you to compare one method against another to see which is superior. Have you thought about what level of accuracy you require in your task, is 70% accuracy enough, 85%, 99% or better? Machine learning methods are typically quite sensitive to the specific type of data you have and the amount of examples you have to train the system with, the more examples, generally the better the performance. You could try the method suggested above and compare it against a variety of well proven methods, amongst which would be Random Forests, support vector machines and Neural Networks. All of which and many more are available to download in a variety of free toolboxes. Toolboxes Mathematica is a wonderful system, is infinitely flexible and my favourite environment, but out of the box it doesn't have a great deal of support for machine learning. I suspect you will make a great deal of progress more quickly by using a custom toolbox designed for machine learning. Two of the most popular free toolboxes are WEKA and R both support more than 50 different methods for solving your task along with methods for measuring the accuracy of the solutions. With just a little data reformatting, you can convert your gestures to a simple file format called ARFF, load them into WEKA or R and experiment with dozens of different algorithms to see how each performs on your data. The explorer tool in WEKA is definitely the easiest to use, requiring little more than a few mouse clicks and typing some parameters to get started. Once you have an idea of how well the established methods perform on your data you have a good starting point to compare a customised approach against should they fail to meet your criteria. Handwritten Digit Recognition Your problem is similar to a very well researched machine learning problem known as hand written digit recognition. The methods that work well on this public data set of handwritten digits are likely to work well on your gestures.
Obstacle avoidance using 2 fixed cameras on a robot
I will be start working on a robotics project which involves a mobile robot that has mounted 2 cameras (1.3 MP) fixed at a distance of 0.5m in between.I also have a few ultrasonic sensors, but they have only a 10 metter range and my enviroment is rather large (as an example, take a large warehouse with many pillars, boxes, walls .etc) .My main task is to identify obstacles and also find a roughly "best" route that the robot must take in order to navigate in a "rough" enviroment (the ground floor is not smooth at all). All the image processing is not made on the robot, but on a computer with NVIDIA GT425 2Gb Ram. My questions are : Should I mount the cameras on a rotative suport, so that they take pictures on a wider angle? It is posible creating a reasonable 3D reconstruction based on only 2 views at such a small distance in between? If so, to what degree I can use this for obstacle avoidance and a best route construction? If a roughly accurate 3D representation of the enviroment can be made, how can it be used as creating a map of the enviroment? (Consider the following example: the robot must sweep an fairly large area and it would be energy efficient if it would not go through the same place (or course) twice;however when a 3D reconstruction is made from one direction, how can it tell if it has already been there if it comes from the opposite direction ) I have found this response on a similar question , but I am still concerned with the accuracy of 3D reconstruction (for example a couple of boxes situated at 100m considering the small resolution and distance between the cameras). I am just starting gathering information for this project, so if you haved worked on something similar please give me some guidelines (and some links:D) on how should I approach this specific task. Thanks in advance, Tamash
If you want to do obstacle avoidance, it is probably easiest to use the ultrasonic sensors. If the robot is moving at speeds suitable for a human environment then their range of 10m gives you ample time to stop the robot. Keep in mind that no system will guarantee that you don't accidentally hit something. (2) It is posible creating a reasonable 3D reconstruction based on only 2 views at such a small distance in between? If so, to what degree I can use this for obstacle avoidance and a best route construction? Yes, this is possible. Have a look at ROS and their vSLAM. http://www.ros.org/wiki/vslam and http://www.ros.org/wiki/slam_gmapping would be two of many possible resources. however when a 3D reconstruction is made from one direction, how can it tell if it has already been there if it comes from the opposite direction Well, you are trying to find your position given a measurement and a map. That should be possible, and it wouldn't matter from which direction the map was created. However, there is the loop closure problem. Because you are creating a 3D map at the same time as you are trying to find your way around, you don't know whether you are at a new place or at a place you have seen before. CONCLUSION This is a difficult task! Actually, it's more than one. First you have simple obstacle avoidance (i.e. Don't drive into things.). Then you want to do simultaneous localisation and mapping (SLAM, read Wikipedia on that) and finally you want to do path planning (i.e. sweeping the floor without covering area twice). I hope that helps?
I'd say no if you mean each eye rotating independently. You won't get the accuracy you need to do the stereo correspondence and make calibration a nightmare. But if you want the whole "head" of the robot to pivot, then that may be doable. But you should have some good encoders on the joints. If you use ROS, there are some tools which help you turn the two stereo images into a 3d point cloud. http://www.ros.org/wiki/stereo_image_proc. There is a tradeoff between your baseline (the distance between the cameras) and your resolution at different ranges. large baseline = greater resolution at large distances, but it also has a large minimum distance. I don't think i would expect more than a few centimeters of accuracy from a static stereo rig. and this accuracy only gets worse when you compound there robot's location uncertainty. 2.5. for mapping and obstacle avoidance the first thing i would try to do is segment out the ground plane. the ground plane goes to mapping, and everything above is an obstacle. check out PCL for some point cloud operating functions: http://pointclouds.org/ if you can't simply put a planar laser on the robot like a SICK or Hokuyo, then i might try to convert the 3d point cloud into a pseudo-laser-scan then use some off the shelf SLAM instead of trying to do visual slam. i think you'll have better results. Other thoughts: now that the Microsoft Kinect has been released, it is usually easier (and cheaper) to simply use that to get a 3d point cloud instead of doing actual stereo. This project sounds a lot like the DARPA LAGR program. (learning applied to ground robots). That program is over, but you may be able to track down papers published from it.