A lot of times when trying to explain Ximira to people who are interested in the project it can be difficult to get across the technical aspects of exactly what makes the device and software behind it unique to the Accessible Technology and Spatial AI communities.
As someone outside of the hard sciences, I wanted to gain a better understanding of the project I was working on and share that information in an accessible way with our community. For this reason, I spent time talking with our co-lead engineers at Ximira, Jagadish and Paul, about what makes the product unique and technologically innovative. Even though I knew there was amazing progress being made behind the scenes, it was eye opening to hear the answers that they gave to these questions.
Given the in-depth responses I got, I’ve chosen to break this post into two separate segments; one for Jagadish and one for Paul. Here are Jagadish’s answers to some questions that I have always wondered about Ximira’s tech:
Why do you think this solution is different from others?
As you know, the human vision system is a complex process and requires complex computations which in turn need powerful processors.
The existing solutions do not use deep learning due to process power limitations; So, They use traditional computer vision techniques which are not very accurate. They are also bulky sometimes.
Our solution can perform complex perception tasks using the latest deep learning techniques on low computing platforms. As a result the physical setup is simple and does not stand out as an assistive device. Also we will be making our solution free and open source. The sensor is placed over the chest area instead of the eye so we are not blocking any remaining visual field of the user.
How does the technology work?
We use state-of-the-art deep learning technology to model the environment by performing objection detection and semantic segmentation tasks to understand traffic conditions, sidewalk conditions and other environmental situations. We also use a depth sensor that provides depth data. This will help us know how far the obstacles are and also model the environment in 3D.
What do you think makes Mira state of the art?
Deep learning is a growing field with continuous research development. The techniques are constantly getting replaced with better ones. In our solution, we have used the latest developments in the field including deep learning as well as edge AI technology to keep our system accurate and cheap.
What does “state-of-the-art deep learning technology to model the environment by performing objection detection and semantic semantic segmentation tasks” even mean?!
Deep Learning is an advanced machine learning technique proven to mimic human cognitive systems such as vision systems, speech recognition etc. This is so far the best technology that is proven to perform human cognitive tasks such as detecting and recognizing objects in the image, recognizing words in audio clips etc.
Object detection is the process of localizing objects within the image by drawing bounding boxes around them. For example: drawing bounding boxes around people in an image. Semantic segmentation is the process of classifying each pixel to which category it belongs. For example: predicting all the road pixels in the image.
While humans can do these tasks easily, getting a computer to do such tasks is not that easy. The deep learning process involves collection of large datasets, labeling them and training them to obtain an AI model to solve these tasks. We collected a lot of dataset on our own as well as used the dataset that is available online to train our models.
How is Mira similar or different from other popular Computer Vision tech such as Tesla's self driving cars or Amazon's sorting robots?
The technology is quite similar. In fact, we have utilized a few self driving car models in our solution.
What makes Mira special for in today's accessibility and technology space? Is it a matter of novel code being developed in house, specific application of already existing ideas in a new form factor, or something else?
Mira is special in a lot of ways. In a way, [Ximira] is a combination of all of those. Running deep learning models require specialized hardware such as GPUs. This will add to the cost as well as form factor: Imagine carrying huge GPUs in a backpack along with fans , power system and walking along the streets, plus the cost is expensive. In our solution we have focused on edge AI technology which can optimize these heavy models so that they can be run on smaller hardware and also cheap, replacing the need for expensive GPUs. Apart from this, we have also options for user interface including voice interface and wireless haptic gloves.
How do the sensors on the vest work? Does your code take an image and make it into an audio file that I can interpret? How does LIDAR work and why is it important to have in addition to cameras?
The OAK-D [sensor from Luxonis] collects images/videos just like a webcam connected to a laptop, also this sensor can also give depth information using stereo images. The collected image/video along with depth data is further processed to detect obstacles, objects and understand traffic conditions. In case of critical scenarios such as obstacle presence within the proximity of the user, our solution can detect this using depth image and the user will be updated via earphones. We also have wireless haptic gloves which can potentially replace ear phones. Apart from this, the user can also request the system via voice commands for certain features. For example : “describe”, upon this command, the system will start to describe what the system sees like objects being detected and provide this information in clock notation.
Hearing from Jagadish about his understanding of the Ximira tech was a treat for me that I look forward to having again. If you are interested to hear a different take on some of these questions, Paul's answers will come in the form of another blog post soon!