• Debugging Language Conditioning in GR00T Multitask Training

    Vipin M10/20/2025 at 16:15 0 comments

    When your robot ignores “do not pick up the cheese” and picks it up anyway: A journey through frozen VLM backbones and the limits of action-only fine-tuning

    Project Overview

    This project log documents the discovery and resolution of a critical language conditioning failure in NVIDIA’s GR00T N1.5 vision-language-action (VLA) model during multitask training. The investigation revealed fundamental limitations in training configurations that freeze the vision-language backbone, and provides solutions for enabling proper language-conditioned robotic manipulation.

    The debugging process involved systematic testing with dual-ingredient scenarios (cheese and bread), analysis of model architecture and data flow, and identification of the root cause: frozen Eagle VLM backbone preventing task-specific language-vision association learning.

    This work is part of the LeIsaac project - building a multi-ingredient sandwich assembly robot using Isaac Sim, Isaac Lab, and VLA models with an SO-ARM 101 robotic arm.

    Hardware and Software Stack

    • Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
    • Cameras: Dual camera system (wrist + scene) at 640x480, 30fps
    • GPU: RTX 4080 Super with 16GB VRAM
    • Model: NVIDIA GR00T N1.5-3B (Vision-Language-Action model)
    • Framework: Isaac-GR00T + LeRobot v3.0
    • Training: LoRA fine-tuning on custom datasets
    • Task: Multitask pick-and-place (cheese vs bread)

    The Challenge: Multitask Language Conditioning

    Why Multitask Learning?

    The sandwich assembly task requires the robot to manipulate multiple ingredients based on language instructions:

    • “Pick up the cheese and place it in the white plate”
    • “Pick up the bread and place it in the white plate”
    • “Stack the cheese on the bread”

    This requires the model to:

    1. Understand language instructions - differentiate “cheese” vs “bread”
    2. Ground language to vision - recognize which object is cheese vs bread
    3. Execute task-specific actions - different manipulation strategies per ingredient

    Training Setup

    Datasets:

    • Cheese dataset: 50 episodes, 14,212 frames, task: “Pick slice of yellow cheese and place it in the white plate”
    • Bread dataset: 50 episodes, 13,483 frames, task: “Pick slice of bread and place it in the white plate”

    Training configuration:

    python scripts/gr00t_finetune.py \    --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \    --num-gpus 1 \    --max-steps 10000 \    --data-config so100_dualcam \    --batch-size 16 \    --lora-rank 32 \    --balance-dataset-weights \    --balance-trajectory-weights
    

    The LeRobotMixtureDataset automatically balances sampling across both datasets during training.

    Phase 1: Problem Discovery

    Initial Testing

    After training a multitask model for 3,000 steps, I deployed it on the physical robot and tested with different language instructions:

    Test 1"pick up the yellow cheese and put it into the white plate"

    • Result: ✅ Robot picks up cheese

    Test 2"pick up the bread and put it into the white plate"

    • Result: ❌ Robot picks up cheese (ignores instruction!)

    Test 3"do not pick up the cheese"

    • Result: ❌ Robot picks up cheese (completely ignores negation!)

    Critical observation: The model’s behavior was 100% determined by visual state, with 0% influence from language instruction.

    Hypothesis: Visual State Machine

    The robot appeared to be using a simple position-based heuristic:

    IF (object detected in plate):    STOP (task complete)
    ELSE IF (object detected in holder):    GRASP object → MOVE to plate → RELEASE
    ELSE:    SEARCH randomly
    

    This suggested the model learned visual patterns rather than language-conditioned behavior.

    Phase 2: First Fix Attempt - The Diffusion Model Flag

    Discovery of --no-tune_diffusion_model

    Investigating the training script revealed a suspicious flag:

    TRAIN_CMD="python scripts/gr00t_finetune.py \ --dataset-path ${DATASET_PATHS} \ --no-tune_diffusion_model \ # ←...
    Read more »

  • Building a Sandwich Assembly Simulation for Robotic Manipulation

    Vipin M10/16/2025 at 20:30 0 comments

    From USD scene creation to MimicGen integration: A complete implementation of multi-ingredient manipulation in Isaac Sim

    Project Overview

    This project implements a complete sandwich assembly simulation environment in Isaac Lab, designed to train robotic manipulation policies for multi-step food preparation tasks. The development involved creating a custom USD scene with proper physics configuration, implementing MimicGen integration for data augmentation, and solving critical challenges with rigid body hierarchies and API compatibility.

    This project log documents the systematic development process, from initial scene setup through the final dynamic ingredient selection feature, including all the debugging challenges and solutions encountered along the way.

    The simulation successfully supports teleoperation, demonstration recording, MimicGen annotation, and automated data generation for training vision-language-action (VLA) models on sandwich assembly tasks.

    Hardware and Software Stack

    • Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
    • Cameras: Dual camera system (wrist + front) at 640x480, 30fps
      • Front camera calibrated for Nexigo N60 webcam (78° FOV)
      • Wrist camera for close-up manipulation view
    • GPU: RTX 4080 Super with 16GB VRAM
    • Simulation: Isaac Sim 5.0 + Isaac Lab framework
    • Framework: LeIsaac (custom robotics framework built on Isaac Lab)
    • Task: Multi-ingredient sandwich assembly with 4 ingredients (2× bread, cheese, patty)

    The Challenge: Multi-Ingredient Manipulation

    Why Sandwich Assembly?

    Sandwich assembly represents a complex manipulation task that requires:

    • Sequential manipulation: Multiple pick-and-place operations
    • Spatial reasoning: Proper stacking order and alignment
    • Object diversity: Different ingredient types with varying properties
    • Real-world relevance: Applicable to food preparation and assembly tasks
    • VLA training: Language-conditioned manipulation (“pick up the cheese”, “place the patty”)

    Technical Requirements

    1. USD Scene: Custom kitchen environment with proper physics
    2. Multiple Objects: 4 ingredients + plate + holder, each with correct physics properties
    3. MimicGen Integration: Subtask annotation and data augmentation
    4. Dynamic Configuration: Support for different ingredient types
    5. Camera Setup: Optimal viewing angles for manipulation

    Phase 1: USD Scene Creation

    Initial Scene Setup (Commit 1c52342)

    Development began with creating the basic task structure and scene configuration. The initial implementation involved:

    Created Files:

    • assemble_sandwich_env_cfg.py - Main environment configuration
    • assemble_sandwich_mimic_env_cfg.py - MimicGen variant with subtask configs
    • README.md - Complete documentation (614 lines)

    Key Configuration:

    # Scene loading with automatic USD parsing
    parse_usd_and_create_subassets(KITCHEN_WITH_SANDWICH_USD_PATH, self)
    

    This function automatically detects all rigid body objects in the USD scene and creates corresponding configuration objects, eliminating manual object registration.

    Scene Simplification Challenge

    Problem: The original kitchen scene was 37.9 MB with complex fixtures (cabinets, appliances, decorative elements) that slowed simulation and cluttered the workspace.

    Solution: Documented a systematic simplification workflow:

    1. Remove unnecessary kitchen fixtures
    2. Create simple table workspace (1.2m × 0.8m × 0.85m)
    3. Reduce file size to ~5-10 MB (75% reduction)
    4. Optimize for robot simulation performance

    Table Layout Design:

    ┌─────────────────────────────────┐
    │  [Ingredients Holder]     [🍽️] │  ← Left: Holder, Right: Plate
    │  ┌─┬─┬─┬─┐              Plate  │
    │  │🍞│🍞│🥩│🧀│                   │  ← Slots: bread, bread, patty, cheese
    │  └─┴─┴─┴─┘                     │
    │                                 │
    │        Assembly Area            │
    └─────────────────────────────────┘
    

    Physics Configuration (Commit 6a7e4b5)

    The USD scene structure was created with proper physics APIs for all objects:

    Dynamic Objects (movable ingredients):

    • bread_slice_1, ...
    Read more »

  • MimicGen Data Augmentation Pipeline for Robotic Manipulation

    Vipin M10/08/2025 at 22:35 0 comments

    Project Overview

    I implemented a complete MimicGen data augmentation pipeline to generate multiple training demonstrations from a single recorded episode. The goal was to overcome the data scarcity problem in robotic manipulation by automatically creating diverse variations of expert demonstrations.

    This project log documents the systematic implementation of the 4-step MimicGen workflow, from converting demonstrations to IK actions through generating 10x augmented data, and the debugging challenges encountered along the way.

    The pipeline successfully transformed 1 original demonstration into 10 augmented demonstrations with a 71.4% generation success rate, providing rich training data for imitation learning policies.

    Hardware Setup

    • Robot: SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
    • Cameras: Dual camera system (scene + wrist) at 640x480, 30fps
    • GPU: RTX 4080 Super with 16GB VRAM
    • Simulation: Isaac Sim 5.0 + Isaac Lab framework
    • Dataset: Single “lift_cube” demonstration → 10 augmented demonstrations
    • Task: “Pick up 1.5cm cube and lift it 5cm above robot base”

    The Problem: Data Scarcity in Robotic Learning

    Initial Challenge

    Robotic manipulation policies require large amounts of diverse training data, but collecting demonstrations is:

    • Time-consuming: Each episode requires manual teleoperation
    • Limited diversity: Human demonstrations tend to be similar
    • Expensive: Requires expert operators and robot time
    • Insufficient for generalization: Single demonstrations don’t capture task variations

    Traditional approach: Record 50-100 demonstrations manually.
    MimicGen approach: Record 1 demonstration → Generate 10+ variations automatically.

    MimicGen Pipeline Overview

    The 4-Step Workflow

    1. Convert to IK Actions: Transform joint-space actions (6D) to end-effector actions (8D)
    2. Annotate Subtasks: Automatically detect subtask boundaries using termination signals
    3. Generate Augmented Data: Create variations by recombining subtask segments
    4. Convert to Joint Actions: Transform back to joint-space for training

    Task Structure: Lift Cube

    Subtask 1pick_cube - Approach and grasp the cube
    Subtask 2lift_cube - Lift cube above threshold height

    Key Requirements:

    • Cube dimensions: 1.5cm × 1.5cm × 1.5cm
    • Lift threshold: 5cm above robot base
    • Success condition: Cube height > base height + 0.05m

    Debugging Approach

    Step 1: Environment Configuration Issues

    Problem: MimicGen annotation failed with “The final task was not completed” error.

    Root Cause Analysis:

    • Missing lift_cube observation function in environment
    • Incorrect subtask termination signal configuration
    • Height threshold too strict for actual cube size

    Solution: Added lift_cube observation function:

    def lift_cube(        env: ManagerBasedRLEnv,        cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"),        robot_cfg: SceneEntityCfg = SceneEntityCfg("robot"),        robot_base_name: str = "base",        height_threshold: float = 0.05) -> torch.Tensor:    """Check if the cube is lifted above the robot base."""    cube: RigidObject = env.scene[cube_cfg.name]    robot: Articulation = env.scene[robot_cfg.name]    cube_height = cube.data.root_pos_w[:, 2]    base_index = robot.data.body_names.index(robot_base_name)    robot_base_height = robot.data.body_pos_w[:, base_index, 2]    above_base = cube_height - robot_base_height > height_threshold    return above_base
    

    Step 2: Height Threshold Calibration

    Critical Discovery: The default height threshold (0.20m) was too strict for the actual cube size.

    Investigation Process:

    1. Examined cube model file: /assets/scenes/table_with_cube/cube/model.xml
    2. Found actual dimensions: 0.015077m × 0.015077m × 0.015077m (1.5cm cube)
    3. Calculated appropriate threshold: 0.05m (3.3× cube height)

    Configuration Update:

    # Updated threshold in both environments
    height_threshold: float = 0.05  # Changed from 0.20m
    

    Step 3: MimicGen Configuration Requirements

    Problem: Assertion error during generation: “assert subtask_configs[-1].subtask_term_offset_range[0]...

    Read more »

  • Building a Real-to-Sim Digital Twin for SO-101 Robot Arm in Isaac Sim

    Vipin M10/06/2025 at 06:50 0 comments

    Project Overview

    I worked on implementing a real-to-sim digital twin system for an SO-101 robotic arm using NVIDIA Isaac Sim 4.5.0. The goal was to create a virtual replica that mirrors the physical robot’s movements in real-time, enabling simultaneous control of both real and virtual arms through a leader-follower teleoperation setup.

    This project log documents the complete setup process, debugging challenges, and the implementation of a robust ROS2-based communication pipeline between physical hardware and Isaac Sim.

    Hardware Setup

    • Robot: SO-100/SO-101 robotic arm (6 DOF: 5 arm joints + 1 gripper)
    • Control System: Leader-follower teleoperation setup
    • Device Mappings:
      • /dev/leader - Leader arm for human teleoperation
      • /dev/follower - Follower arm (physical robot being controlled)
      • /dev/wrist - Wrist-mounted camera (video0)
      • /dev/scene - Scene overview camera (video2)
    • GPU: RTX 4080 Super with 16GB VRAM
    • Software: Isaac Sim 4.5.0, ROS2 Humble, CycloneDDS

    The Challenge

    The objective was to create a digital twin where:

    1. Leader arm movements control both physical follower arm AND virtual follower arm
    2. Real-time synchronization with minimal latency
    3. Proper joint state feedback from physical to virtual robot
    4. Seamless integration with existing teleoperation workflow

    Architecture Overview

    The system uses a three-component architecture:

    Component 1: Teleoperation System

    • Reads leader arm positions from /dev/leader
    • Controls physical follower arm via existing teleoperation

    Component 2: Joint State Bridge

    • Reads actual follower arm positions from /dev/follower
    • Publishes joint states to ROS2 topics
    • Bridges physical robot data to Isaac Sim

    Component 3: Isaac Sim Digital Twin

    • Subscribes to joint state commands
    • Renders virtual robot matching physical movements
    • Provides visual feedback and simulation capabilities

    Debugging Process

    Issue 1: GLIBCXX Library Version Conflicts

    When attempting to run ROS2 nodes from the Isaac Sim conda environment, I encountered:

    ImportError: /home/vipin/miniconda3/envs/isaacsim/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/_rclpy_pybind11.cpython-310-x86_64-linux-gnu.so)
    

    Root Cause: The conda environment’s libstdc++ (version 3.4.26) was older than what system ROS2 required (3.4.30).

    Investigation:

    # Conda environment library
    strings /home/vipin/miniconda3/envs/isaacsim/lib/libstdc++.so.6 | grep GLIBCXX | tail -1
    # Output: GLIBCXX_3.4.26
    
    # System library  strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX | tail -1
    # Output: GLIBCXX_3.4.30
    

    Solution: Use Isaac Sim’s internal ROS2 libraries instead of system installation:

    # Configure Isaac Sim with internal ROS2 libraries
    export isaac_sim_package_path=$(dirname $(which isaacsim))/../lib/python3.10/site-packages/isaacsim
    export RMW_IMPLEMENTATION=rmw_cyclonedx_cpp
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble/lib
    export PYTHONPATH=$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble:$PYTHONPATH
    

    This approach avoided library conflicts while maintaining Isaac Sim stability.

    Issue 2: Network Topic Interference

    During initial testing, I discovered unexpected joint states:

    ros2 topic echo /joint_states --once
    # Output showed wheel joints, gripper extension, head swivel - not arm joints!
    

    Root Cause: Another machine on the network was publishing robot topics to the same ROS_DOMAIN_ID.

    Solution: Network isolation using unique domain ID:

    export ROS_DOMAIN_ID=42  # Isolated domain
    ros2 topic list
    # Clean output: only local topics
    

    Issue 3: Joint Name Mismatch

    The most critical debugging challenge was joint name inconsistency. Isaac Sim’s ArticulationController was throwing warnings:

    [Warning] [omni.graph.core.plugin] /so101_new_calib/ROS_JointStates/ArticulationController: [/so101_new_calib/ROS_JointStates] OmniGraph Warning: 'joint_1'
    

    Investigation: Compared published...

    Read more »

  • Debugging Robot “Twitching” in GR00T N1.5 Deployment

    Vipin M10/05/2025 at 16:13 0 comments

    Project Overview

    I worked on debugging a puzzling issue where a fine-tuned NVIDIA GR00T N1.5 model was causing an SO-100 robotic arm to “twitch” instead of performing pick-and-place tasks. The robot would make tiny oscillating movements around the same position, with the gripper staying completely unresponsive.

    This project log documents the systematic debugging process that revealed the root cause: an undertrained model that needed significantly more training steps to learn the complete task sequence.

    Hardware Setup

    • Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
    • Cameras: Dual camera system (scene + wrist) at 640x480, 30fps
    • GPU: RTX 4080 Super with 16GB VRAM
    • Dataset: 20 episodes of pick-and-place demonstrations
    • Task: “Pick up the striped block and put it into the white plate”

    The Problem: Robot Twitching

    Initial Symptoms

    When deploying the trained GR00T model:

    • Robot connected successfully
    • Model inference server running correctly
    • Robot made tiny oscillating movements around the same position
    • Robot was not executing the intended pick-and-place task

    The model had been trained for 2000 steps and showed good loss convergence, but the physical deployment was completely unsuccessful.

    Debugging Approach

    Step 1: Enhanced Logging Implementation

    Added comprehensive logging to both the inference server and robot client to understand what data was being exchanged.

    Server-Side Logging (service.py):

    • Request counter for each inference call
    • Input data keys and shapes
    • Inference time in milliseconds
    • Output action statistics (min/max/mean values)

    Client-Side Logging (eval_lerobot.py):

    • Step counter and observation keys
    • Current robot state (all 6 joints)
    • Received action chunks from server
    • First action being sent to robot

    Example Output:

    [Request #1] Endpoint: get_action  Inference time: 75.23ms  Response keys: ['action.single_arm', 'action.gripper']    action.single_arm: shape=(16, 5), min=-45.23, max=67.89, mean=12.34    action.gripper: shape=(16, 1), min=-0.30, max=0.50, mean=0.15
    
    [CLIENT] First action to send to robot:    shoulder_pan.pos: -12.34
    

    Step 2: Diagnostic Tools Development

    Created several diagnostic scripts to isolate the issue:

    Joint Testing Tool (test_joint.py):

    • Tests individual joint control to verify hardware functionality
    • Takes joint number (1-6) and value (-100 to 100) as input
    • Helps isolate hardware vs. software issues

    Robot State Monitor (monitor_robot_state.py):

    • Real-time monitoring of robot joint positions
    • Verifies encoder readings match values sent to server

    Step 3: Dataset Visualization

    Uploaded the dataset to Hugging Face Hub and used Rerun visualization to inspect the recorded episodes:

    # Upload dataset for analysis
    python scripts/so100_groot/upload_to_huggingface.py \    --local-dir ~/.cache/huggingface/lerobot/rubbotix/striped-block \    --repo-id sparkmt/so100-striped-block
    
    # Visualize episodes
    ./scripts/so100_groot/visualize_episodes.sh 0
    

    This revealed the difference between State (robot’s actual position) and Action (commanded target position), which was crucial for diagnosis.

    Critical Discovery: The Root Cause

    Key Finding from Logs

    The robot was making very small, uncertain movements instead of decisive actions. The logging revealed that the model was outputting actions with very small magnitudes, indicating high uncertainty.

    The Root Cause: Undertrained Model

    Analysis revealed that the model was severely undertrained at 2000 steps.

    Evidence:

    1. Tiny action magnitudes: Model outputting very small actions due to high uncertainty
    2. Lack of task structure understanding: Model hadn’t learned the full sequence (approach → grasp → lift → move → release)
    3. Closed-loop instability: Small errors accumulating, causing the robot to end up in states the model never saw during training

    The Solution: Extended Training

    Training Requirements Analysis

    Task ComplexityMinimum StepsRecommended Steps
    Simple reaching1,000-2,0005,000
    Pick and place5,000-10,000...
    Read more »

  • Fine-Tuning GR00T N1.5 for SO-100 Robot Arm Manipulation

    Vipin M10/05/2025 at 15:59 0 comments

    Project Overview

    I worked on fine-tuning NVIDIA’s GR00T N1.5 model for controlling an SO-100 robotic arm. The project involved dataset preparation, memory optimization for 16GB VRAM constraints, model training with LoRA techniques, and deployment setup for real-world robot control.

    The goal was to train the model to perform pick-and-place manipulation tasks using the instruction “pick up the striped box and put it into the white plate” with dual-camera visual input.

    Hardware Setup

    • Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
    • Cameras: Dual camera system at 640x480, 30fps
    • GPU: RTX 4080 Super with 16GB VRAM
    • Dataset: 20 episodes, 5,197 frames of manipulation demonstrations
    • Model: NVIDIA GR00T N1.5 (3B parameters)

    Dataset Preparation and Debugging

    Issue 1: Blank Visualization Plots

    The dataset visualization script displayed blank canvases for state/action plots.

    Root Cause: The script had hardcoded humanoid robot keys (left_armright_armleft_handright_hand) while the SO-100 dataset uses different keys (single_armgripper).

    Solution: Modified the visualization function to auto-detect keys from the dataset:

    # Before: hardcoded humanoid keys
    shared_keys = ["left_arm", "right_arm", "left_hand", "right_hand"]
    
    # After: auto-detect from dataset
    if shared_keys is None:    shared_keys = [key.replace("state.", "") for key in state_dict.keys()]    print(f"Auto-detected keys to plot: {shared_keys}")
    

    Issue 2: Camera Mapping Discrepancy

    The visualization showed the wrist camera perspective when it should have shown the scene camera.

    Investigation: Checked the dataset’s modality.json mappings and discovered that during data collection, the camera naming was swapped:

    • observation.images.main was actually the wrist/gripper camera
    • observation.images.secondary_0 was actually the scene camera

    Solution: Corrected the mappings in modality.json:

    "video": {    "front": {"original_key": "observation.images.secondary_0"},  // Scene camera    "wrist": {"original_key": "observation.images.main"}          // Wrist camera
    }
    

    Verification: Created a diagnostic script that confirmed the mapping correction by comparing raw video frames with dataset loader output.

    Issue 3: Missing Video Metadata

    Dataset loading failed due to missing video metadata fields.

    Solution: Added the required fields to info.json:

    info['features'][key]['info']['video.channels'] = 3
    info['features'][key]['info']['video.height'] = 720
    info['features'][key]['info']['video.width'] = 1280
    

    Memory Optimization Challenge

    The Problem: CUDA Out of Memory

    Initial training attempts all failed with out-of-memory errors, even with very small batch sizes:

    AttemptBatch SizeGradient AccumResult
    1642OOM at step 0
    2324OOM at step 0
    3168OOM at step 0
    4816OOM at step 0
    5432OOM at step 0
    6264OOM during optimizer step

    Analysis: The base model has 3B parameters, plus a 550M parameter diffusion model. The Adam optimizer requires 2x memory for momentum and variance states, exceeding the 16GB VRAM limit.

    Solution: LoRA Fine-Tuning

    Implemented Low-Rank Adaptation (LoRA) to reduce trainable parameters:

    LoRA Configuration:

    --lora-rank 32          # Size of low-rank adaptation matrices
    --lora-alpha 64         # Scaling factor (typically 2x rank)
    --lora-dropout 0.1      # Regularization
    --no-tune_diffusion_model  # Freeze 550M parameter diffusion model
    

    Memory Savings:

    • Full fine-tuning: ~200M trainable parameters
    • LoRA fine-tuning: ~10M trainable parameters (20x reduction)
    • Result: Fits in 16GB VRAM with batch_size=16

    Training Configuration and Results

    Final Training Setup

    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
    python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/example_dataset/ \ --num-gpus 1 \ --output-dir ./so100-checkpoints \ --max-steps 5000 \ --data-config so100_dualcam \ --batch-size 16 \ --gradient-accumulation-steps 8 \ --learning-rate 0.0001 \ --no-tune_diffusion_model \ --lora-rank 32 \ --lora-alpha 64 \ --lora-dropout 0.1
    ...
    Read more »

  • Debugging GR00T N1.5 Inference in Phosphobot

    Vipin M10/05/2025 at 15:55 0 comments

    Project Overview

    I worked on debugging inference issues with a fine-tuned NVIDIA GR00T N1.5 model for controlling an SO-100 robotic arm. The model was trained successfully and uploaded to HuggingFace Hub, but the PhosphoBot browser interface was failing during AI control activation.

    This project log documents the debugging process, root cause analysis, and the implementation of an alternative local inference solution.

    Hardware Setup

    • Robot: SO-100 robotic arm (6 DOF: 5 arm joints + 1 gripper)
    • Cameras: Dual camera system (IDs 0 and 2) at 640x480, 30fps
    • GPU: RTX 4080 Super with 16GB VRAM
    • Modelphospho-app/gr00t-example_dataset-h9g75u7gak (fine-tuned GR00T N1.5)

    The Problem

    When clicking “AI Control” in the PhosphoBot browser interface, the system reported:

    Exception: No robot connected. Exiting AI control loop.
    

    The robot was physically connected and visible in the UI, cameras were streaming successfully, and the model had been trained and uploaded to HuggingFace Hub. The issue appeared to be in the inference pipeline.

    Debugging Process

    Issue 1: Joint Count Mismatch

    Added debug logging to understand the failure and discovered:

    Connected joints: 6, Config joints: 1
    

    Root Cause: The code was reading the model configuration incorrectly:

    # Incorrect code
    number_of_joints_in_config = len(    config.embodiment.statistics.action.action_space.values()
    )
    

    This was counting dictionary keys (maxminmeanstdq01q99) instead of joint dimensions.

    Model Config Structure:

    {  "action_space": {    "action_space": 6  }
    }
    

    Solution: Handle the nested dictionary structure correctly:

    # Fixed code
    action_space = config.embodiment.statistics.action.action_space
    
    # Case 1: action_space is a dict with 'action_space' key containing the number
    if isinstance(action_space, dict) and 'action_space' in action_space:    number_of_joints_in_config = action_space['action_space']
    # Case 2: action_space has 'max' or 'min' arrays
    elif hasattr(action_space, 'max') and isinstance(action_space.max, list):    number_of_joints_in_config = len(action_space.max)
    # Additional fallback cases...
    

    Issue 2: Device Mismatch on Modal Server

    After fixing the joint count, a new error appeared:

    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
    

    Root Cause:

    • Model inference happens on Modal GPU server (remote)
    • Some model components loaded on CPU, others on GPU
    • Issue occurs in VLLN (Vision-Language Layer Norm) component

    Attempted Fix: Added retry logic with exponential backoff to handle transient device issues:

    max_retries = 3
    retry_delay = 1.0  # seconds
    
    for retry_attempt in range(max_retries):    try:        actions = self(inputs)        break  # Success    except RuntimeError as e:        if "Expected all tensors to be on the same device" in str(e):            if retry_attempt < max_retries - 1:                logger.warning(f"Device mismatch error on attempt {retry_attempt + 1}/{max_retries}. Retrying in {retry_delay}s...")                await asyncio.sleep(retry_delay)                retry_delay *= 2  # Exponential backoff
    

    Status: This helped with transient issues but didn’t solve the root cause, which is on the Modal server side and not fixable from the client.

    Alternative Solution: Local Inference

    Since the PhosphoBot Modal server had device mismatch issues, I implemented a local inference solution using official Isaac-GR00T scripts.

    Architecture: Client-Server Model

    Based on NVIDIA’s official tutorial, the solution uses a two-terminal approach:

    Terminal 1: Inference Server

    • Loads GR00T model on local GPU
    • Runs inference on observations
    • Returns action predictions
    • Uses ZMQ protocol for fast communication

    Terminal 2: Robot Client

    • Connects to SO-100 robot via USB
    • Captures camera images
    • Sends observations to server
    • Executes returned actions

    Implementation

    Server Script (start_groot_server.sh):

    #!/bin/bash
    cd /home/vipin/Isaac-GR00T
    conda activate gr00t
    
    python scripts/inference_service.py \ --server \ --model_path "phospho-app/gr00t-example_dataset-h9g75u7gak"...
    Read more »

  • Debugging Dual-Camera Vision System for SO-101 Robotic Manipulation Platform

    Vipin M10/05/2025 at 15:51 0 comments

    Project Overview

    I worked on debugging a dual-camera vision system for my SO-101 robotic manipulation platform. The cameras were experiencing intermittent streaming failures that initially appeared to be software compatibility issues, but turned out to be caused by a faulty USB extension cable.

    This project log documents the troubleshooting process, technical solutions implemented, and lessons learned while setting up the vision system for robotic data collection.

    Hardware Setup

    The SO-101 platform consists of:

    • Dual robotic arms: Leader and follower configuration with USB serial communication
    • Dual camera system:
      • Wrist-mounted camera (640x480 @ 30fps) for end-effector view
      • Scene camera (NexiGo N60 FHD, 1920x1080 capable) for workspace overview
    • Target performance: Stable 30 FPS streaming for robotics data collection

    The Problem: Intermittent Camera Failures

    The scene camera was experiencing frustrating intermittent failures:

    • Random “No such device” errors during streaming
    • Inconsistent connection behavior
    • Performance degradation over time
    • Apparent timing-related issues

    Initial symptoms pointed to software compatibility problems, leading me down a complex debugging path.

    Solution 1: Persistent Device Management with udev Rules

    First, I tackled device management by implementing comprehensive udev rules for consistent device naming:

    # /etc/udev/rules.d/99-lerobot-so101.rules
    SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A90068534", SYMLINK+="leader", MODE="0666"
    SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A900685B4", SYMLINK+="follower", MODE="0666"
    SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2c99", ATTR{index}=="0", SYMLINK+="wrist", MODE="0666"
    SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2b95", ATTR{index}=="0", SYMLINK+="scene", MODE="0666"
    

    Key insight: Using ATTR{index}=="0" prevents conflicts with video device metadata nodes, ensuring symlinks point to actual video devices.

    Solution 2: Camera Implementation Optimization

    I developed an improved OpenCV camera implementation with better resource management:

    class CorrectedOpenCVCamera:    def __init__(self, camera_index, fps=30, width=640, height=480):        self.camera_index = camera_index        self.fps = fps        self.width = width        self.height = height        self.cap = None            def connect(self):        self.cap = cv2.VideoCapture(self.camera_index)        if not self.cap.isOpened():            raise RuntimeError(f"Failed to open camera {self.camera_index}")                # Set properties for optimal performance        self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, self.width)        self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, self.height)        self.cap.set(cv2.CAP_PROP_FPS, self.fps)
    

    Root Cause: Hardware Issue

    After implementing various software approaches including:

    • MJPG format forcing (reduced performance from 30fps to 15fps)
    • Artificial timing delays (caused more failures)
    • Complex configuration workarounds

    The actual root cause was identified through systematic testing: a poor-quality USB extension cable.

    Debugging Approach

    1. Systematic isolation: Tested each camera individually
    2. Performance measurement: FPS monitoring under realistic conditions
    3. Hardware verification: Checked physical connections
    4. Root cause analysis: Eliminated software assumptions

    Results

    After removing the faulty USB extension cable:

    • Wrist camera: 33.5 FPS sustained
    • Scene camera: 30+ FPS sustained
    • Both cameras: Work with default OpenCV settings
    • System stability: No configuration workarounds needed

    Technical Notes

    1. Hardware vs Software Issues

    Physical connection issues can create symptoms that mimic software problems. Checking hardware connections early in the debugging process can save time.

    2. USB Cable Quality

    Poor quality USB extension cables can cause:

    • Signal degradation
    • Power delivery issues
    • Bandwidth limitations
    • Intermittent connection failures

    3. Camera Configuration

    • Forcing MJPG format: Not necessary, can reduce performance...
    Read more »