MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery

Wenhua Wei; Laurent Mennillo; Zhehua Mao; Anjana Wijekoon; Kendall Feeny; Danyal Zaman Khan; Evangelos B. Mazomenos; Danail Stoyanov; Hani J. Marcus; Sophia Bano

MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery

Wenhua Wei, Laurent Mennillo, Zhehua Mao, Anjana Wijekoon, Kendall Feeny, Danyal Zaman Khan, Evangelos B. Mazomenos, Danail Stoyanov, Hani J. Marcus, Sophia Bano

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:1095-1114, 2026.

Abstract

Reliable perception of surgical instruments is a key prerequisite for intraoperative guidance, context-aware assistance, and workflow analysis in minimally invasive surgery (MIS). This is particularly challenging in skull base procedures, where narrow anatomical corridors, frequent occlusions, specular highlights, and visually similar instruments make multi-class detection and 2D pose estimation difficult. We address joint instrument detection and keypoint-based pose estimation from monocular endoscopic videos and introduce MID-POSE, a dual-head architecture that couples a high-resolution HRNetV2p encoder with a class-agnostic dense detection-pose head and a Multi-level Instrument Classification (MIC) head which operates on RoI-aligned multi-level features. To support this task, we construct the PitSurg dataset from 26 clinical procedures, providing seven instrument classes with bounding boxes and detailed 2D keypoints. Using YOLOv8x-pose as our strongest baseline, which in our tasks outperforms YOLO11x-pose, MID-POSE improves Det/Pose $AP_{50\text{–}95}$ on PitSurg from $59.4/63.1$ to $77.5/78.5$ and on the robotic SurgPose dataset from $47.9/61.1$ to $62.7/71.4$. Qualitative analysis shows that high-resolution features sharpen localisation and keypoint placement, while the RoI classifier reduces misclassifications and spurious background detections, indicating that the proposed architecture and dataset provide an effective basis for robust multi-instrument perception in MIS.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-wei26a,
  title = 	 {MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery},
  author =       {Wei, Wenhua and Mennillo, Laurent and Mao, Zhehua and Wijekoon, Anjana and Feeny, Kendall and Khan, Danyal Zaman and Mazomenos, Evangelos B. and Stoyanov, Danail and Marcus, Hani J. and Bano, Sophia},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {1095--1114},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/wei26a/wei26a.pdf},
  url = 	 {https://proceedings.mlr.press/v315/wei26a.html},
  abstract = 	 {Reliable perception of surgical instruments is a key prerequisite for intraoperative guidance, context-aware assistance, and workflow analysis in minimally invasive surgery (MIS). This is particularly challenging in skull base procedures, where narrow anatomical corridors, frequent occlusions, specular highlights, and visually similar instruments make multi-class detection and 2D pose estimation difficult. We address joint instrument detection and keypoint-based pose estimation from monocular endoscopic videos and introduce MID-POSE, a dual-head architecture that couples a high-resolution HRNetV2p encoder with a class-agnostic dense detection-pose head and a Multi-level Instrument Classification (MIC) head which operates on RoI-aligned multi-level features. To support this task, we construct the PitSurg dataset from 26 clinical procedures, providing seven instrument classes with bounding boxes and detailed 2D keypoints. Using YOLOv8x-pose as our strongest baseline, which in our tasks outperforms YOLO11x-pose, MID-POSE improves Det/Pose $AP_{50\text{–}95}$ on PitSurg from $59.4/63.1$ to $77.5/78.5$ and on the robotic SurgPose dataset from $47.9/61.1$ to $62.7/71.4$. Qualitative analysis shows that high-resolution features sharpen localisation and keypoint placement, while the RoI classifier reduces misclassifications and spurious background detections, indicating that the proposed architecture and dataset provide an effective basis for robust multi-instrument perception in MIS.}
}

Endnote

%0 Conference Paper
%T MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery
%A Wenhua Wei
%A Laurent Mennillo
%A Zhehua Mao
%A Anjana Wijekoon
%A Kendall Feeny
%A Danyal Zaman Khan
%A Evangelos B. Mazomenos
%A Danail Stoyanov
%A Hani J. Marcus
%A Sophia Bano
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-wei26a
%I PMLR
%P 1095--1114
%U https://proceedings.mlr.press/v315/wei26a.html
%V 315
%X Reliable perception of surgical instruments is a key prerequisite for intraoperative guidance, context-aware assistance, and workflow analysis in minimally invasive surgery (MIS). This is particularly challenging in skull base procedures, where narrow anatomical corridors, frequent occlusions, specular highlights, and visually similar instruments make multi-class detection and 2D pose estimation difficult. We address joint instrument detection and keypoint-based pose estimation from monocular endoscopic videos and introduce MID-POSE, a dual-head architecture that couples a high-resolution HRNetV2p encoder with a class-agnostic dense detection-pose head and a Multi-level Instrument Classification (MIC) head which operates on RoI-aligned multi-level features. To support this task, we construct the PitSurg dataset from 26 clinical procedures, providing seven instrument classes with bounding boxes and detailed 2D keypoints. Using YOLOv8x-pose as our strongest baseline, which in our tasks outperforms YOLO11x-pose, MID-POSE improves Det/Pose $AP_{50\text{–}95}$ on PitSurg from $59.4/63.1$ to $77.5/78.5$ and on the robotic SurgPose dataset from $47.9/61.1$ to $62.7/71.4$. Qualitative analysis shows that high-resolution features sharpen localisation and keypoint placement, while the RoI classifier reduces misclassifications and spurious background detections, indicating that the proposed architecture and dataset provide an effective basis for robust multi-instrument perception in MIS.

APA

Wei, W., Mennillo, L., Mao, Z., Wijekoon, A., Feeny, K., Khan, D.Z., Mazomenos, E.B., Stoyanov, D., Marcus, H.J. & Bano, S.. (2026). MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:1095-1114 Available from https://proceedings.mlr.press/v315/wei26a.html.

Related Material

Download PDF