Robotic arm workflow with nVIDIA GR00T N1.5 model. Dataset recording, fine-tuning, debugging, and deployment for pick-and-place tasks
To make the experience fit your profile, pick a username and tell us what interests you.
We found and based on your interests.
When your robot ignores “do not pick up the cheese” and picks it up anyway: A journey through frozen VLM backbones and the limits of action-only fine-tuning
This project log documents the discovery and resolution of a critical language conditioning failure in NVIDIA’s GR00T N1.5 vision-language-action (VLA) model during multitask training. The investigation revealed fundamental limitations in training configurations that freeze the vision-language backbone, and provides solutions for enabling proper language-conditioned robotic manipulation.
The debugging process involved systematic testing with dual-ingredient scenarios (cheese and bread), analysis of model architecture and data flow, and identification of the root cause: frozen Eagle VLM backbone preventing task-specific language-vision association learning.
This work is part of the LeIsaac project - building a multi-ingredient sandwich assembly robot using Isaac Sim, Isaac Lab, and VLA models with an SO-ARM 101 robotic arm.
The sandwich assembly task requires the robot to manipulate multiple ingredients based on language instructions:
This requires the model to:
Datasets:
Training configuration:
python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/cheese/ ./demo_data/bread/ \ --num-gpus 1 \ --max-steps 10000 \ --data-config so100_dualcam \ --batch-size 16 \ --lora-rank 32 \ --balance-dataset-weights \ --balance-trajectory-weights
The LeRobotMixtureDataset automatically balances sampling across both datasets during training.
After training a multitask model for 3,000 steps, I deployed it on the physical robot and tested with different language instructions:
Test 1: "pick up the yellow cheese and put it into the white plate"
Test 2: "pick up the bread and put it into the white plate"
Test 3: "do not pick up the cheese"
Critical observation: The model’s behavior was 100% determined by visual state, with 0% influence from language instruction.
The robot appeared to be using a simple position-based heuristic:
IF (object detected in plate): STOP (task complete) ELSE IF (object detected in holder): GRASP object → MOVE to plate → RELEASE ELSE: SEARCH randomly
This suggested the model learned visual patterns rather than language-conditioned behavior.
--no-tune_diffusion_modelInvestigating the training script revealed a suspicious flag:
TRAIN_CMD="python scripts/gr00t_finetune.py \ --dataset-path ${DATASET_PATHS} \ --no-tune_diffusion_model \ # ←...
Read more »
From USD scene creation to MimicGen integration: A complete implementation of multi-ingredient manipulation in Isaac Sim
This project implements a complete sandwich assembly simulation environment in Isaac Lab, designed to train robotic manipulation policies for multi-step food preparation tasks. The development involved creating a custom USD scene with proper physics configuration, implementing MimicGen integration for data augmentation, and solving critical challenges with rigid body hierarchies and API compatibility.
This project log documents the systematic development process, from initial scene setup through the final dynamic ingredient selection feature, including all the debugging challenges and solutions encountered along the way.
The simulation successfully supports teleoperation, demonstration recording, MimicGen annotation, and automated data generation for training vision-language-action (VLA) models on sandwich assembly tasks.
Sandwich assembly represents a complex manipulation task that requires:
Development began with creating the basic task structure and scene configuration. The initial implementation involved:
Created Files:
assemble_sandwich_env_cfg.py - Main environment configurationassemble_sandwich_mimic_env_cfg.py - MimicGen variant with subtask configsREADME.md - Complete documentation (614 lines)Key Configuration:
# Scene loading with automatic USD parsing parse_usd_and_create_subassets(KITCHEN_WITH_SANDWICH_USD_PATH, self)
This function automatically detects all rigid body objects in the USD scene and creates corresponding configuration objects, eliminating manual object registration.
Problem: The original kitchen scene was 37.9 MB with complex fixtures (cabinets, appliances, decorative elements) that slowed simulation and cluttered the workspace.
Solution: Documented a systematic simplification workflow:
Table Layout Design:
┌─────────────────────────────────┐ │ [Ingredients Holder] [🍽️] │ ← Left: Holder, Right: Plate │ ┌─┬─┬─┬─┐ Plate │ │ │🍞│🍞│🥩│🧀│ │ ← Slots: bread, bread, patty, cheese │ └─┴─┴─┴─┘ │ │ │ │ Assembly Area │ └─────────────────────────────────┘
The USD scene structure was created with proper physics APIs for all objects:
Dynamic Objects (movable ingredients):
bread_slice_1, ...I implemented a complete MimicGen data augmentation pipeline to generate multiple training demonstrations from a single recorded episode. The goal was to overcome the data scarcity problem in robotic manipulation by automatically creating diverse variations of expert demonstrations.
This project log documents the systematic implementation of the 4-step MimicGen workflow, from converting demonstrations to IK actions through generating 10x augmented data, and the debugging challenges encountered along the way.
The pipeline successfully transformed 1 original demonstration into 10 augmented demonstrations with a 71.4% generation success rate, providing rich training data for imitation learning policies.
Robotic manipulation policies require large amounts of diverse training data, but collecting demonstrations is:
Traditional approach: Record 50-100 demonstrations manually.
MimicGen approach: Record 1 demonstration → Generate 10+ variations automatically.
Subtask 1: pick_cube - Approach and grasp the cube
Subtask 2: lift_cube - Lift cube above threshold height
Key Requirements:
Problem: MimicGen annotation failed with “The final task was not completed” error.
Root Cause Analysis:
lift_cube observation function in environmentSolution: Added lift_cube observation function:
def lift_cube( env: ManagerBasedRLEnv, cube_cfg: SceneEntityCfg = SceneEntityCfg("cube"), robot_cfg: SceneEntityCfg = SceneEntityCfg("robot"), robot_base_name: str = "base", height_threshold: float = 0.05) -> torch.Tensor: """Check if the cube is lifted above the robot base.""" cube: RigidObject = env.scene[cube_cfg.name] robot: Articulation = env.scene[robot_cfg.name] cube_height = cube.data.root_pos_w[:, 2] base_index = robot.data.body_names.index(robot_base_name) robot_base_height = robot.data.body_pos_w[:, base_index, 2] above_base = cube_height - robot_base_height > height_threshold return above_base
Critical Discovery: The default height threshold (0.20m) was too strict for the actual cube size.
Investigation Process:
/assets/scenes/table_with_cube/cube/model.xmlConfiguration Update:
# Updated threshold in both environments height_threshold: float = 0.05 # Changed from 0.20m
Problem: Assertion error during generation: “assert subtask_configs[-1].subtask_term_offset_range[0]...
Read more »I worked on implementing a real-to-sim digital twin system for an SO-101 robotic arm using NVIDIA Isaac Sim 4.5.0. The goal was to create a virtual replica that mirrors the physical robot’s movements in real-time, enabling simultaneous control of both real and virtual arms through a leader-follower teleoperation setup.
This project log documents the complete setup process, debugging challenges, and the implementation of a robust ROS2-based communication pipeline between physical hardware and Isaac Sim.
/dev/leader - Leader arm for human teleoperation/dev/follower - Follower arm (physical robot being controlled)/dev/wrist - Wrist-mounted camera (video0)/dev/scene - Scene overview camera (video2)The objective was to create a digital twin where:
The system uses a three-component architecture:
Component 1: Teleoperation System
/dev/leaderComponent 2: Joint State Bridge
/dev/followerComponent 3: Isaac Sim Digital Twin
When attempting to run ROS2 nodes from the Isaac Sim conda environment, I encountered:
ImportError: /home/vipin/miniconda3/envs/isaacsim/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/_rclpy_pybind11.cpython-310-x86_64-linux-gnu.so)
Root Cause: The conda environment’s libstdc++ (version 3.4.26) was older than what system ROS2 required (3.4.30).
Investigation:
# Conda environment library strings /home/vipin/miniconda3/envs/isaacsim/lib/libstdc++.so.6 | grep GLIBCXX | tail -1 # Output: GLIBCXX_3.4.26 # System library strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX | tail -1 # Output: GLIBCXX_3.4.30
Solution: Use Isaac Sim’s internal ROS2 libraries instead of system installation:
# Configure Isaac Sim with internal ROS2 libraries export isaac_sim_package_path=$(dirname $(which isaacsim))/../lib/python3.10/site-packages/isaacsim export RMW_IMPLEMENTATION=rmw_cyclonedx_cpp export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble/lib export PYTHONPATH=$isaac_sim_package_path/exts/isaacsim.ros2.bridge/humble:$PYTHONPATH
This approach avoided library conflicts while maintaining Isaac Sim stability.
During initial testing, I discovered unexpected joint states:
ros2 topic echo /joint_states --once # Output showed wheel joints, gripper extension, head swivel - not arm joints!
Root Cause: Another machine on the network was publishing robot topics to the same ROS_DOMAIN_ID.
Solution: Network isolation using unique domain ID:
export ROS_DOMAIN_ID=42 # Isolated domain ros2 topic list # Clean output: only local topics
The most critical debugging challenge was joint name inconsistency. Isaac Sim’s ArticulationController was throwing warnings:
[Warning] [omni.graph.core.plugin] /so101_new_calib/ROS_JointStates/ArticulationController: [/so101_new_calib/ROS_JointStates] OmniGraph Warning: 'joint_1'
Investigation: Compared published...
Read more »I worked on debugging a puzzling issue where a fine-tuned NVIDIA GR00T N1.5 model was causing an SO-100 robotic arm to “twitch” instead of performing pick-and-place tasks. The robot would make tiny oscillating movements around the same position, with the gripper staying completely unresponsive.
This project log documents the systematic debugging process that revealed the root cause: an undertrained model that needed significantly more training steps to learn the complete task sequence.
When deploying the trained GR00T model:
The model had been trained for 2000 steps and showed good loss convergence, but the physical deployment was completely unsuccessful.
Added comprehensive logging to both the inference server and robot client to understand what data was being exchanged.
Server-Side Logging (service.py):
Client-Side Logging (eval_lerobot.py):
Example Output:
[Request #1] Endpoint: get_action Inference time: 75.23ms Response keys: ['action.single_arm', 'action.gripper'] action.single_arm: shape=(16, 5), min=-45.23, max=67.89, mean=12.34 action.gripper: shape=(16, 1), min=-0.30, max=0.50, mean=0.15 [CLIENT] First action to send to robot: shoulder_pan.pos: -12.34
Created several diagnostic scripts to isolate the issue:
Joint Testing Tool (test_joint.py):
Robot State Monitor (monitor_robot_state.py):
Uploaded the dataset to Hugging Face Hub and used Rerun visualization to inspect the recorded episodes:
# Upload dataset for analysis python scripts/so100_groot/upload_to_huggingface.py \ --local-dir ~/.cache/huggingface/lerobot/rubbotix/striped-block \ --repo-id sparkmt/so100-striped-block # Visualize episodes ./scripts/so100_groot/visualize_episodes.sh 0
This revealed the difference between State (robot’s actual position) and Action (commanded target position), which was crucial for diagnosis.
The robot was making very small, uncertain movements instead of decisive actions. The logging revealed that the model was outputting actions with very small magnitudes, indicating high uncertainty.
Analysis revealed that the model was severely undertrained at 2000 steps.
Evidence:
| Task Complexity | Minimum Steps | Recommended Steps |
|---|---|---|
| Simple reaching | 1,000-2,000 | 5,000 |
| Pick and place | 5,000-10,000... |
I worked on fine-tuning NVIDIA’s GR00T N1.5 model for controlling an SO-100 robotic arm. The project involved dataset preparation, memory optimization for 16GB VRAM constraints, model training with LoRA techniques, and deployment setup for real-world robot control.
The goal was to train the model to perform pick-and-place manipulation tasks using the instruction “pick up the striped box and put it into the white plate” with dual-camera visual input.
The dataset visualization script displayed blank canvases for state/action plots.
Root Cause: The script had hardcoded humanoid robot keys (left_arm, right_arm, left_hand, right_hand) while the SO-100 dataset uses different keys (single_arm, gripper).
Solution: Modified the visualization function to auto-detect keys from the dataset:
# Before: hardcoded humanoid keys
shared_keys = ["left_arm", "right_arm", "left_hand", "right_hand"]
# After: auto-detect from dataset
if shared_keys is None: shared_keys = [key.replace("state.", "") for key in state_dict.keys()] print(f"Auto-detected keys to plot: {shared_keys}")
The visualization showed the wrist camera perspective when it should have shown the scene camera.
Investigation: Checked the dataset’s modality.json mappings and discovered that during data collection, the camera naming was swapped:
observation.images.main was actually the wrist/gripper cameraobservation.images.secondary_0 was actually the scene cameraSolution: Corrected the mappings in modality.json:
"video": { "front": {"original_key": "observation.images.secondary_0"}, // Scene camera "wrist": {"original_key": "observation.images.main"} // Wrist camera
}
Verification: Created a diagnostic script that confirmed the mapping correction by comparing raw video frames with dataset loader output.
Dataset loading failed due to missing video metadata fields.
Solution: Added the required fields to info.json:
info['features'][key]['info']['video.channels'] = 3 info['features'][key]['info']['video.height'] = 720 info['features'][key]['info']['video.width'] = 1280
Initial training attempts all failed with out-of-memory errors, even with very small batch sizes:
| Attempt | Batch Size | Gradient Accum | Result |
|---|---|---|---|
| 1 | 64 | 2 | OOM at step 0 |
| 2 | 32 | 4 | OOM at step 0 |
| 3 | 16 | 8 | OOM at step 0 |
| 4 | 8 | 16 | OOM at step 0 |
| 5 | 4 | 32 | OOM at step 0 |
| 6 | 2 | 64 | OOM during optimizer step |
Analysis: The base model has 3B parameters, plus a 550M parameter diffusion model. The Adam optimizer requires 2x memory for momentum and variance states, exceeding the 16GB VRAM limit.
Implemented Low-Rank Adaptation (LoRA) to reduce trainable parameters:
LoRA Configuration:
--lora-rank 32 # Size of low-rank adaptation matrices --lora-alpha 64 # Scaling factor (typically 2x rank) --lora-dropout 0.1 # Regularization --no-tune_diffusion_model # Freeze 550M parameter diffusion model
Memory Savings:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ python scripts/gr00t_finetune.py \ --dataset-path ./demo_data/example_dataset/ \ --num-gpus 1 \ --output-dir ./so100-checkpoints \ --max-steps 5000 \ --data-config so100_dualcam \ --batch-size 16 \ --gradient-accumulation-steps 8 \ --learning-rate 0.0001 \ --no-tune_diffusion_model \ --lora-rank 32 \ --lora-alpha 64 \ --lora-dropout 0.1 ...Read more »
I worked on debugging inference issues with a fine-tuned NVIDIA GR00T N1.5 model for controlling an SO-100 robotic arm. The model was trained successfully and uploaded to HuggingFace Hub, but the PhosphoBot browser interface was failing during AI control activation.
This project log documents the debugging process, root cause analysis, and the implementation of an alternative local inference solution.
phospho-app/gr00t-example_dataset-h9g75u7gak (fine-tuned GR00T N1.5)When clicking “AI Control” in the PhosphoBot browser interface, the system reported:
Exception: No robot connected. Exiting AI control loop.
The robot was physically connected and visible in the UI, cameras were streaming successfully, and the model had been trained and uploaded to HuggingFace Hub. The issue appeared to be in the inference pipeline.
Added debug logging to understand the failure and discovered:
Connected joints: 6, Config joints: 1
Root Cause: The code was reading the model configuration incorrectly:
# Incorrect code number_of_joints_in_config = len( config.embodiment.statistics.action.action_space.values() )
This was counting dictionary keys (max, min, mean, std, q01, q99) instead of joint dimensions.
Model Config Structure:
{ "action_space": { "action_space": 6 }
}
Solution: Handle the nested dictionary structure correctly:
# Fixed code action_space = config.embodiment.statistics.action.action_space # Case 1: action_space is a dict with 'action_space' key containing the number if isinstance(action_space, dict) and 'action_space' in action_space: number_of_joints_in_config = action_space['action_space'] # Case 2: action_space has 'max' or 'min' arrays elif hasattr(action_space, 'max') and isinstance(action_space.max, list): number_of_joints_in_config = len(action_space.max) # Additional fallback cases...
After fixing the joint count, a new error appeared:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Root Cause:
Attempted Fix: Added retry logic with exponential backoff to handle transient device issues:
max_retries = 3
retry_delay = 1.0 # seconds
for retry_attempt in range(max_retries): try: actions = self(inputs) break # Success except RuntimeError as e: if "Expected all tensors to be on the same device" in str(e): if retry_attempt < max_retries - 1: logger.warning(f"Device mismatch error on attempt {retry_attempt + 1}/{max_retries}. Retrying in {retry_delay}s...") await asyncio.sleep(retry_delay) retry_delay *= 2 # Exponential backoff
Status: This helped with transient issues but didn’t solve the root cause, which is on the Modal server side and not fixable from the client.
Since the PhosphoBot Modal server had device mismatch issues, I implemented a local inference solution using official Isaac-GR00T scripts.
Based on NVIDIA’s official tutorial, the solution uses a two-terminal approach:
Terminal 1: Inference Server
Terminal 2: Robot Client
Server Script (start_groot_server.sh):
#!/bin/bash cd /home/vipin/Isaac-GR00T conda activate gr00t python scripts/inference_service.py \ --server \ --model_path "phospho-app/gr00t-example_dataset-h9g75u7gak"...Read more »
Project Overview
I worked on debugging a dual-camera vision system for my SO-101 robotic manipulation platform. The cameras were experiencing intermittent streaming failures that initially appeared to be software compatibility issues, but turned out to be caused by a faulty USB extension cable.
This project log documents the troubleshooting process, technical solutions implemented, and lessons learned while setting up the vision system for robotic data collection.
The SO-101 platform consists of:
The scene camera was experiencing frustrating intermittent failures:
Initial symptoms pointed to software compatibility problems, leading me down a complex debugging path.
First, I tackled device management by implementing comprehensive udev rules for consistent device naming:
# /etc/udev/rules.d/99-lerobot-so101.rules
SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A90068534", SYMLINK+="leader", MODE="0666"
SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", ATTRS{serial}=="54A900685B4", SYMLINK+="follower", MODE="0666"
SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2c99", ATTR{index}=="0", SYMLINK+="wrist", MODE="0666"
SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1bcf", ATTRS{idProduct}=="2b95", ATTR{index}=="0", SYMLINK+="scene", MODE="0666"
Key insight: Using ATTR{index}=="0" prevents conflicts with video device metadata nodes, ensuring symlinks point to actual video devices.
I developed an improved OpenCV camera implementation with better resource management:
class CorrectedOpenCVCamera: def __init__(self, camera_index, fps=30, width=640, height=480): self.camera_index = camera_index self.fps = fps self.width = width self.height = height self.cap = None def connect(self): self.cap = cv2.VideoCapture(self.camera_index) if not self.cap.isOpened(): raise RuntimeError(f"Failed to open camera {self.camera_index}") # Set properties for optimal performance self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, self.width) self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, self.height) self.cap.set(cv2.CAP_PROP_FPS, self.fps)
After implementing various software approaches including:
The actual root cause was identified through systematic testing: a poor-quality USB extension cable.
After removing the faulty USB extension cable:
Physical connection issues can create symptoms that mimic software problems. Checking hardware connections early in the debugging process can save time.
Poor quality USB extension cables can cause:
Create an account to leave a comment. Already have an account? Log In.
Become a member to follow this project and never miss any updates
About Us Contact Hackaday.io Give Feedback Terms of Use Privacy Policy Hackaday API Do not sell or share my personal information