Close
0%
0%

Image Captioning for the Visually Impaired People

This project aims to improve accessibility for visually impaired and blind people by developing a novel image captioning system.

Similar projects worth following
96 views
0 followers

Blind and visually impaired individuals often encounter a variety of socioeconomic challenges that can hinder their ability to live independently and participate fully in society. However, the advent of machine learning has opened up new possibilities for the development of assistive technologies. In this study, we utilized image captioning and text-to-speech technologies to create a device that assists those who are visually impaired or blind. Image captioning combined with text-to-speech technology can serve as an aid for the visually impaired and blind.

Furthermore, I would like to share my experience on optimizing a deep learning model with TensorRT to improve its inference time.

  • 1 × Jetson Xavier NX Developer Kit

  • 1
    Image Captioning Model deployment pipeline

    We used the popular Microsoft COCO 2014 (COCO) benchmark dataset to train the ExpansionNet v2 image captioning model. The dataset consisted of 123, 287 images, with each image having five human-annotated captions, resulting in a total of over 600, 000 image-text pairs. We split the dataset into training (113, 287 images), validation (5, 000 images), and test (5, 000 images) sets, using the Karpathy splitting strategy for offline evaluation. To generate captions in Kazakh, we translated the original English captions using the freely available Google Translate service.

    To train the model for Kazakh captions, we followed the model architecture defined in the original work of the ExpansioNet v2. The pre-trained Swin Transformer was used as a backbone network to generate visual features from the input images. The model was trained on four V100 graphics processing units (GPUs) in Nvidia DGX-2 server.

    Finally, the image captioning model, ExpansionNet v2, was deployed on the Nvidia Jetson Xavier NX board. The camera was triggered by pressing the push button to capture an RGB image with a resolution of 640 × 480 pixels. Then, the captured image was resized to 384 × 384 and passed to the ExpansionNet v2 model to generate a caption. Next, the generated caption text was converted into audio, using a text-to-speech model. In our research study, we utilized the KazakhTTS model to convert Kazakh text to speech. Finally, the generated audio was played through the user’s headphones, making it possible for individuals who are blind or visually impaired to comprehend what is in front of them.

  • 2
    ONNX overview

    ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.

    The workflow consists of the following steps:

    • Convert the regular PyTorch model file to the ONNX format. The ONNX conversion script is available here.
    • Create a TensorRT engine using trtexec utility
    trtexec --onnx=./model.onnx --saveEngine=./model_fp32.engine --workspace=200
    • Run inference from the TensorRT engine.
  • 3
    Inference Optimization using TensorRT

    TensorRT is a high-performance deep learning inference engine developed by NVIDIA. It optimizes neural network models and generates highly optimized inference engines that can run on NVIDIA GPUs. TensorRT uses a combination of static and dynamic optimizations to achieve high performance, including layer fusion, kernel auto-tuning, and precision calibration.

    PyTorch, on the other hand, is a popular deep learning framework that is widely used for research and development. PyTorch provides a dynamic computational graph that allows users to define and modify their models on the fly, which makes it easy to experiment with different architectures and training methods.

    It appears that the TensorRT model is providing faster inference results compared to the PyTorch model. The TensorRT model is taking around 50% less time to process the images compared to the PyTorch model, even though it has a smaller file size.

    In a nutshell, if speed and efficiency are your primary concerns, then TensorRT may be a better choice. This is fast enough for most real-time object detection applications.

    During the inference process, you can check the current performance of the Nvidia Jetson boards using jetson-stats utility. You can monitor the resources that your models are using in real time and get maximum utilization out of your hardware.

View all 6 instructions

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates