Logo

Tetris-PoseNet

2026 Robotics

6D pose estimation for robotic bin picking using point cloud deep learning and sim-to-real transfer.

Tetris-PoseNet performing 6D pose estimation on bin-packed objects

Real-time pose estimation and grasp planning for densely packed objects

Demo

3D point cloud captured from Zivid camera

3D point cloud captured from Zivid camera

Network predictions with 6D pose bounding boxes

Network predictions with 6D pose bounding boxes

Tetris in cluttered bin

Tetris in cluttered bin

6D pose prediction in cluttered bin

6D pose prediction in cluttered bin


Background

Robotic bin picking requires solving multiple hard problems simultaneously:

  1. 3D perception in clutter - objects heavily occluded and randomly oriented
  2. RGB methods fail - lighting variations, textureless surfaces, depth ambiguity
  3. Instance segmentation at scale - traditional pipelines need separate detection + segmentation + pose networks
  4. Sim-to-real gap - models trained on synthetic data often fail in real deployment

I built Tetris-PoseNet combining 3D deep learning and robotic integration to deliver: point cloud direct processing + end-to-end pose prediction + robust sim-to-real transfer.


Calibration table plane fitting

Table calibration using plane fitting

Pose predictions on calibrated workspace

Validated predictions in calibrated frame

3D scanned calibration object

Hand-eye calibration using 3D scan

Real bin picking result

Successful grasp execution


References

Research Foundation:

  1. Point Transformer V3 - State-of-art point cloud backbone
  2. PPR-Net - Point-wise pose regression framework

Technical Stack:

  • PyTorch Lightning - Distributed training framework (6-GPU DDP)
  • Point Transformer V3 - Self-attention point cloud backbone
  • MuJoCo & pyBullet - Physics simulation for synthetic data
  • Hydra - Config-driven architecture (swappable backbones)

Things I Learned

Deep Learning at Scale:

  • Distributed training orchestration (6x RTX 5880 scaling)
  • Curriculum learning from single-object to dense clutter
  • Symmetry-aware loss functions for rotational invariance

3D Vision:

  • Point cloud transformers vs. voxel methods
  • Sim-to-real gap challenges (sensor noise, material properties)
  • Camera-robot calibration precision requirements

System Integration:

  • Real-time pipeline optimization (<200ms perception-to-action)
  • Modular design with Hydra configs (PointNet++/DGCNN/PTV3 swapping)
  • Production deployment debugging (network inference + robot control)

Acknowledgments

Supervisor: Prof. Ziqi Wang, HKUST

Lab Resources: Von Neumann Institute, HKUST



PROJECT INFO

Year
2026
Type
Robotics

HARDWARE

UR12e Robot Zivid 3D Camera NVIDIA RTX 5090D

SOFTWARE

PyTorch Point Transformer V3 PyTorch Lightning pyBullet MuJoCo

TAGS

deep learning computer vision bin picking sim-to-real