Back, just like I said, to post the second part of the interviews with the engineers at Ximira. Today’s blog post centers around Paul York’s answers to some of the same questions. As you can tell if you read the interview with Jagadish (you should go check it out if you haven’t), both engineers approach the questions in different ways that highlight the complexity and subjectivity of some aspects of the NOVA system’s development.
In this post I particularly enjoy Paul’s discussion of the design elements that play into how we conceptualize the NOVA system. In designing a piece of accessible technology it is important to keep universal design principles in mind, just like with any other piece of tech. Paul highlights the modality and versatility that the NOVA system will bring to the visually impaired community. Read on!
What makes the NOVA system “state of the art” for you?
We are using state of the art technology and AI techniques. With computer vision, we are (or will be once bugs are worked out) using both traditional object detection as well as depth mapping from a special kind of camera with a pair of lenses that work like human eyes to determine how far away objects are. We are leveraging concepts from modern autonomous robotics and augmented/virtual reality.
Paul told me, “Don't read too much into "state of the art." Just means that we need (and use) the most up-to-date stuff.” While I still like to think about the NOVA system that Paul and Jagadish are designing as being state of the art, I think that the conversations had with the engineering team have done a lot to simplify the complexity that is often talked about around Ximira.
Even though it is originally Jagadish’s words, what do you think “state-of-the-art deep learning technology to model the environment by performing objection detection and semantic segmentation tasks” means?
Basically deep learning involves “training” a computer to recognize “something”. In this case, we are feeding it photos (often still frames from video) that have had specific “objects” identified in them. For example, you might feed in a bunch of pictures with dogs and tell the computer exactly where in the picture the dog(s) are. Likewise, you also feed it a bunch of “not-dog” pictures...maybe cats or dragons. The computer then constructs a mathematical structure (usually a neural network) that when fed ANY picture can determine if the picture has a dog or not. That’s object detection.
We will use multiple neural networks each capable of detecting (classifying) numerous objects. We will also “train” a few of our own. For example, I believe Jagadish trained a model to recognize curbs.
Basically then semantic segmentation is the process of assigning a label to each pixel in an image. I.e., once you’ve determined there IS a dog, you can then label each pixel as belonging to a dog (or not). So for video, this allows us to track where things are in each video frame.
Why is Ximira special for you? How does the NOVA system stand out from the competition?
There are other folks attempting parts of what we are doing. Ours I believe is the first attempting to integrate everything into a single, integrated solution. I put together the following for a different document:
Ximira’s iNtegrated Open Visual Assistant, or NOVA, is a flexible computer vision platform designed to improve the lives of those living with blindness or severe visual impairment. The long-term goal for NOVA is to provide a functional alternative to navigational aids such as a white cane or guide dog, while also providing spatial awareness to allow the user to participate more naturally in a society built for the sighted. The modular nature of NOVA will enable the system to provide incrementally more value to the visually impaired user until our full vision has been realized.
Paul gave this list of important design elements in the NOVA system:
- Handsfree – the system must leave users with the ability to use their hands freely
- Mobile – the system must be capable of operating many hours on battery power
- Unimpairing – the system must not adversely interfere with the users’ senses, especially hearing, touch and remaining vision
- Adaptable – the system must meet the needs of users with varied impairments and abilities
- Extensible – the system must easily accommodate new/upgraded features and peripherals
- Modular – the system should be composed of independent and isolated components to minimize interdependencies and maximize flexibility
- Agnostic – the system should minimize migration costs for supporting new/alternate hardware
- Multimodal – the system should allow user interaction using a variety of mechanism, including auditory/verbal, haptic/gestural and graphical/touch
- Unobtrusive – the system should not draw undue attention to the user’s visual impairment
- Compact – the system should be small and lightweight enough to wear comfortably all day
- Wireless – the system should not require wires to communicate with peripheral devices
- Cloud-Optional – the system should support cloud-enabled features, but should not depend on an Internet connection for core functionality
- Open – the system should enable an ecosystem of open collaboration and contributions both directly to core design and functionality as well as through turnkey peripheral devices”
Part of the inclusive design decisions being made by the engineering team include both audio and tactile intractability with the system. While Jagadish has been developing AI for audio processing, Paul has been simultaneously looking at how to translate the same data into vibrations. Currently, a glove design is being tested to determine the best way for users to control the system. Here is prototype of that idea being worked on:
How is Ximira similar or different from other popular computer vision technologies such as those being used by companies like Tesla, Google, and Amazon?
Actually, quite similar to Tesla’s self-driving cars. The main difference is that NOVA doesn’t directly “control” the vehicle but simply provides information and instructions. But the basic “intelligence” components are similar.”
How do the sensors on the vest work? How does your code take an image and make it into an audio file that I can interpret? How does LIDAR work and why is it important to have in addition to cameras?
In the initial Mira prototype, we had only a single camera module, though it had TWO actual camera sensors in it. As noted above, we really only used one of the two sensors since we didn’t initially leverage the depth data. Each image (“frame” in a motion video) is divided up into a grid of pixels. Each pixel is simply a number telling you what color is at that point in the image. This grid of pixels is called a “bitmap.” Think of it as color by numbers. To reconstruct the image, the computer just takes each of the number and shows the associated color at that point in the grid. The bitmap itself is usually quite big, so often you will see the image “compressed.” That’s where the GIF, JPEG, PNG, etc. image formats come in. But in the end, all are just bitmaps.
That image file (the matrix of numbers) is fed into the computer vision algorithm. It is often turned into grayscale, shrunk down in size and otherwise made smaller so that the processing runs faster. But that’s basically direct input into the object detection algorithms.
IF you use both cameras, you can also estimate how FAR each of the pixels is from the camera. Again, this is the same as how your eyes do it. It just compares the two images to find the differences and then uses trigonometry (the distance between it’s “eyes” is one side of a triangle) to estimate the distance. The end result is simply adding a distance (depth) to each pixel in the image bitmap. This is often referred to as a “depth map.”
That would be INSTEAD of LIDAR. We don’t currently use LIDAR. But that would be an alternative (and more accurate) way of constructing a depth map. That uses a laser to reflect off of objects and measures the time it takes for photons to reflect back to the camera to estimate depth. RADAR with LIght instead of RAdio waves.. Thus LIDAR. :blush:
Note that Tesla doesn’t use LIDAR. Google (Waymo) does. There are pros and cons. Also note that both also use RADAR. We do not.
Bottom line is that the more input data you have, the more accurate your estimation of the “world” around you can be.”
Anyone inside or outside the accessible technology world can tell that the NOVA system being developed here at Ximira is meaningfully complex. Paul’s list of essential core design elements is not long unnecessarily; creating change in the lives of all people with visual impairments will not be a one size fits all solution. Jagadish and Paul have both discussed the upgrades in technology that the NOVA device has gone through and these upgrades have all come along as more considerations are being made about the variety of people that Ximira wants to serve as an organization.
Once again, thaks to Paul York and Jagadish Mahendran, the lead engineers at Ximira, for taking the time to answer questions that both I and those I talk to about the NOVA project have asked me. The physical tech and AI computing that goes into turning the world around us into sound and tactility is both fun and worth understanding (at least I think…)