FCam is a computational camera framework.

FCam is a computational camera framework.

posted : Wednesday, February 29th, 2012

posted : Thursday, February 16th, 2012

Experimental Blacki

Messing around with some Kinect / WebGL / audio multimedia publishing. NEW MEDIA.

(Source: davidstolarsky.com)

posted : Thursday, February 9th, 2012

posted : Tuesday, January 31st, 2012

posted : Saturday, January 14th, 2012

Retrospective: CS 7495, Computer Vision

My last assignment this semester in CS 7495 at the Georgia Institute of Technology is to write publicly about my experience in this course.

So first of all, what is “Computer Vision?” Which is a real question I get from people who don’t know the first thing about it. I think it’s something experts should and do ask as well, though. “Computer Vision is making cameras understand what they see” is my short answer for the lay person. The lead instructor of the course, Dr. Frank Dellaert, uses the following image to describe what Computer Vision is:

TinyImages

80 Million Tiny Images, MIT

This is the stunning result of the “80 Million Tiny Images” project, which represents the nouns in the English language. Related nouns are placed close to each other, according to an established lexicographic mapping corpus — and each noun is visually represented as the average Google Image Search result for that noun. The big green cluster are nouns that fall under the broad term “plant.” Computer Vision looks for patterns in visual signals in order to gain insights beyond pixel values, without requiring a human to be involved.

Computer Vision has long fascinated me. I was an amateur user of it before this course. After this course, I am a slightly more advanced amateur. As with any graduate level subject, learning it takes more than a lifetime. Indeed it’s being invented every day. Here’s what I learned and what I made in this course.

Computer Vision is applied linear algebra, statistics, and probability. Principle component analysis, sum of squared distance optimization, singular value decomposition, and Kalman filter are the solutions to most problems. Often a huge amount of data, perhaps mined from the Internet, when processed in the right way, can result in truly magical results, as in the case of massive scale “Structure from Motion,” which in one popular case is referred to as “Photo Tourism.”

Photo Tourism, Microsoft Research, University of Washington

Photo Tourism is probably the most famous and stunning Computer Vision achievement of the last decade. I certainly am a fan, so I will use it as a vehicle to describe some fundamental vision techniques. With the Photo Tourism engine, one simply downloads thousands of image search results for, e.g. “Notre Dame”, feeds the images through the engine, and out comes a 3D model of the Notre Dame. Photo Tourism demonstrates what I think is one of the most beautiful features of software, as an art form — what was once prohibitively expensive, is virtually free with the right software. For a long time you could survey buildings painstakingly with very expensive equipment and get an accurate 3D representation of them; now you can just squeeze spatial understanding out of huge unstructured Internet image corpi.

Photo Tourism finds the correspondance between images taken from different angles in order to reconstruct a scene’s geometry. The first step in this process is Feature Detection, which is built on the idea that most of the pixels in an image do not distinguish the image in any way. For example, if we’re dealing with images of the night sky, the presence of a black pixel does not say much. The presence of a red pixel, on the other hand, is quite important. In finding similarities between images, you don’t want to waste your time caring about all the unimportant pixels, so you use Feature Detection to decide where in the image you should spend time performing analyses. A standard feature detector is the Harris Operator.

Once you know what is important in an image, in other words what is a “feature,” you need to compactly describe these features (with bits) so that they can be quickly, and sensibly compared. An image is made of about 24,000,000 bits, where a bit is either a 0 or a 1. A short, plain English description of an image, however, is stored in perhaps 2,000 bits. In Computer Vision, a descriptor is like this plain English description — it somehow encapsulates information about many bits in orders of magnitude fewer bits. Often descriptors focus on summarizing where and how intense the edges in an image are, i.e. where the color quickly changes from light to dark or visa versa. Edges with high contrast are fundamental in Computer Vision, just as they are in human vision. Look at any line drawing for proof of that.

Edge Detection

Back to the Photo Tourism example: Once we have a bunch of features and their descriptions from a set of random photos of a scene, we finally must figure out which feature descriptions from different images actually refer to the same feature. We could have many thousands of features, or more, so exhaustively comparing all possible combinations is not an option. Enter the wonderful randomized algorithm. A randomized algorithm lets us effectively sample the space of all possible combinations. We simply must keep sampling until it seems we’ve found a good matching. However, it’s not totally random — we use the results from one match and make a good guess about what might be a better match. We keep producing better guesses, until the returns with each iteration diminish to a negligible level. The accepted way, or family of ways, to do this is RANSAC.

Of course we covered a lot more in Computer Vision than this one (quite stunning) example. For example, techniques similar at a high level can be applied to the problem of figuring out what color something really is. The color a camera observes is always effected by the color of the light in the scene. It’s questionable what “true color” even is, since without colored light, everything is just pitch black. There are ways, however, to for example predict what an image taken under tungsten lighting would look like under sunlight.

A holy grail in Computer Vision is basically to get computers to describe everything they see in plain English. Again, with a lot of data from the Internet, this is somewhat doable, and somewhat done; see Google Goggles (and a multitude of related demos, e.g. Search by Sketch) This brings up an interesting debate in Computer Vision — really smart software versus huge amounts of data. Some people want to just write code that can figure things out, some people say we need to “teach” computers by feeding them massive quantities of data. I’d say both philosophies are very very important, but I side with data if I have to. Again, just look to the best vision achievements of the last decade, in my opinion Photo Tourism, Google Goggles & Co., and Kinect skeleton tracking — all involve massive amounts of data.

The projects I did in this course were in the direction of applied vision, computer graphics, and art. My first project was “BodyScanner,” which was a semi-successful attempt to create a Kinect-powered, well, body scanner. The idea was that a person could scan their body into a realistic, animatable 3D model. I met some great people working on this project and, as the overarching vision was my own, I got a little managerial experience as well. In the end, I think we fell a bit short because we spent a lot of time on side issues, and also because we had varying levels of coding proficiency which the work distribution did not reflect. We made serious progress and learned a lot though, and got some very funny results along the way.

My body segmented into basic skeletal parts

Some intentional, some unintentional, but 100% awesome intermediate results of BodyScanner

My last project was Large Tile Photo Mosaics, in which my partner and I developed a system to generate photomosaics in a way that moves beyond the played out form we’ve seen for about two decades now. A photomosaic is an image reconstructed with many tiny, perhaps unrelated, images. The average color in each tiny image serves as a sort of oversized pixel, at least in the traditional, played-out photomosaic. In our technique, we aim to use the structure inside images so that we can have much larger component images, elevating those images above the status of “glorified pixel.” The following is the work of a painter, which serves as the inspiration for our technique.

United Nations Mural Mosaic, painted by Lewis Lavoie

Our technique makes use of a decomposition of images into their significant spatial components, or a segmentation, as our descriptor, which we then use for comparisons when looking for good tiles to use in reconstructing a subject image.

Segmentation of a photo of a bowl of tomato soup

We attempted to reconstruct Adam’s face from the tiles Lewis Lavoie painted; this was not quite successful. We did however succeed in a contrived case:

Contrived Large Tile Photo Mosaic: Circle reconstructed from database of simple block shapes

Like I said, after my first formal piece of Computer Vision education, I’m now a slightly more advanced amateur. I did figure out the what/how/why behind many acronyms I was curious about - SIFT, RANSAC, SLAM, and heard some new acronyms representing pretty cool things - HOG, HMM, these all being vision techniques, and then of course the linear algebra techniques that drive it all - SVD, SSD minimization, PCA. Where I’m not yet super comfortable in a given area of computer vision, I at least know where to look to figure it out. Of course we’re all learning as we go in Computer Vision, cutting edge and young topic as it is. Coming from the media arts world, where some people throw around the phrase “computer vision” like it’s a perfected science, I always tended to say, “wait, slow down, we can’t really necessarily do this crazy visual understanding task you ask for in software.” Coming out of my first vision class, though, I see that in fact vision researchers have achieved some truly mind blowing things. It just takes a heck of a lot of work to transfer cutting edge, contrived (but not the lesser for it) results into real world application.

posted : Thursday, December 15th, 2011

posted : Saturday, December 3rd, 2011

posted : Tuesday, November 22nd, 2011

posted : Wednesday, October 12th, 2011

mark looking a bit funny

our body scanner is almost done, but some of the math is off, so things are looking pretty good.

posted : Wednesday, October 12th, 2011

posted : Monday, October 10th, 2011

posted : Monday, October 10th, 2011

oh yea, see me stand

by mark luffel 2011

posted : Wednesday, October 5th, 2011

posted : Saturday, September 24th, 2011

posted : Wednesday, September 14th, 2011