I’ve tried for a while to figure out why computer vision is mostly still in research labs in spite of the fact that there are many thousands of people and different algorithms and codebases for doing computer vision. One analogy that occurs to me is image compression.
There are an infinite number of ways of compressing an image, and each one gives a different result. In principle, we could have 1,000s of people around the world working by themselves on this very hard problem. But, it would be better to take a combination of the best ideas, and have everyone use that.
While codecs and computer vision seem quite different, they share an important similarity: in the pipeline of computer vision, from pre-processing to feature extraction, each step produces a smaller amount of data. At the end of the analysis you might be left with the data that this is an image of your house, which is just a few bytes. This compression is also precisely what a codec does.
Another similarity is that decoding is much simpler than encoding. Decompressing an image is faster than compressing it, and the encoders can typically get smarter while the decoder doesn’t even realize it. Likewise, we have plenty of software today that can generate a photo-realistic image of a house. The computer is doing the reverse process of what happens in our eyes.
So perhaps it could be that we have 1000s of computer vision people around the world taking an image and extracting the data, but it is some combination is the best. To be fair, this doesn’t tell us how hard problem is. Will it take the best ideas of 3 or 50 people?
To answer that involves look at each piece. Note that there is plenty of good free code for image processing, which is an important piece of computer vision. When it gets to lines and edges, it seems like that is less well decided. But I suspect that there are many ways to do this, but we should just pick some robust way and move on. [More here]
I’ve discovered the best codebase for people who want to work on computer vision is http://stefanv.github.com/scikits.image/index.html. It’s got Python-powered SciPy and DVCS.
So let’s get going.
Interesting post. Encoding videos and computer vision both try to reduce the information in a video. A sufficient understanding of a video allows you to separate invariants (rigid objects for example) which does not change from frame to frame.
: The problem is that there are real-time constraints as well. In practise people pick a solution which works in their particular domain and they move on. The problem is that there is no general solution to the problem of object recognition for example which works across various domains and in real-time.
: Another problem is that often there is not obvious way how to unify two different ideas into one coherent solution. For example there is work on
* real-time simultaneous localisation and mapping
* multiple-rigid body segmentation
* tracking of non-rigid objects
I think we can agree that it would be nice to unify this three algorithms into a single system. But nobody has done it as far as I know. And even if it was done, current hardware may not be powerful enough to run that system in real-time in a realistic scenario.
Hopefully parallel computing will alleviate the problem of real-time. Considering the visual cortex or the brain in general, the major difference to computers is that the neurons operate much slower than transistors but the overall architecture is massively parallel.
My work (HornetsEye) is on creating a Ruby library which makes it easier to try out different approaches. I.e. the core is implemented in Ruby as well (although the JIT-friendly way of the implementation is still a bit difficult to read). I hope to be able to for example implement the Harris-Stephens corner- and edge-detector using basic array operations and have it run as fast as existing C++ implementations.
No preview button here. I hope I got the format right.
I would contest that computer vision is “mostly in research labs”. I guess there’s always going to be something thats mostly in research labs, like the latest advances in semiconductors are mostly in research labs. But computer vision is used in lots of places. There are a ton of mainstream products out there. Every web camera now has face detection. There is an iphone app that recognizes objects and looks them up on amazon. Computer vision is the technology that makes movies like Avatar possible. Any scene with live action and graphics locked together in a movie uses computer vision. Its used in the latest flying quad rotor. Not to mention all the boring production line inspection computer vision systems that no-one thinks about. Its the engine behind all these augmented reality things that are starting to get popular for advertising. Apple ships facial recognition for photos. Microsoft is shipping project Natal for xbox, and we have stuff like image search which uses vision to group images. Its everywhere already. It just tends to be the hidden glue behind UIs that magically do the right thing and is less obvious than other stuff.
Yes, it is true that it is starting to be used in a number of places recently. But the efforts are scattered and specialized and simple (http://www.cs.cmu.edu/~cil/v-source.html)
And the comparison to silicon doesn’t really fit. Computer vision has been kicking around for decades, whereas advances in silicon come out constantly. The semiconductor people have been holding up their end of the bargain just fine.