System Component 1: Capture and Perspective Transformation

The first time a user encounters an interface, he uses the Facade iOS app to take a photo of the interface with a dollar bill, and sends the image to be processed and pushed to the crowd for manual labeling. The dollar bill is used to produced an image of the interface warped to appear as if from the front perspective, and to recover size information. We use a dollar bill as the fiducial marker because of its ubiquity, its standard size and appearance, and its richness in details and texture to provide sufficient feature points for tracking.

Facade uses SURF (Speeded-Up Robust Features) feature detector to compute key points and feature vectors in both the standard image of the dollar bill and the input image. Then the feature vectors are matched using FLANN (Fast Library for Approximate Nearest Neighbors) based matcher. By filtering matches and finding the perspective transformation between the two images using RANSAC (Random Sample Consensus), our system is able to localize the standard dollar bill image in the input image, and warp the input image to the front perspective for further labeling. The Facade app streams images to the backend server, which then localizes either side of the dollar bill in the image and provides real-time feedback on the aiming of camera relative to the dollar bill to blind users. By reading out instructions such as ``not found'', ``move phone to left/right/up/down/further'' and ``aiming is good'', the app guides the blind user to more easily take a photo from the front perspective, which will result in better warped image after the perspective transformation. The computer vision components are implemented using C++ and the OpenCV Library.

It is important to note that our system only has knowledge of the dollar bill and provides guidance based on its location, without knowing where the interface is. Blind users take advantage of the guidance provided by the app, combined with their knowledge of the relative location of the interface and the dollar bill, to aim the camera and take photos. However, if the appliance interface is partially cropped in the photo, in the next step, crowd workers will provide feedback to the user for taking another photo. Using a second marker could address this problem, but appliances might not have enough space to fit two markers.

Material examples

System Component 2: Crowdsourced Segmenting and Labeling

Discussions

Become a Hackaday.io Member