Shunli Wang (王顺利)
Ph.D. Student at Fudan University, Shanghai
E-mail: slwang19[at]fudan.edu.cn
I am a fifth-year Ph.D. student at Academy for Engineering and Technology of Fudan University. As a part of the Cognition and Intelligent Technology Laboratory (CIT-Lab), I am advised by Prof. Lihua Zhang . Before that, I received my B.S. degree from School of Electrical Engineering and Automation, Anhui University, in 2019. My research interests are Vision-based Fine-grained Action Recognition and Action Quality Assessment in medical scenes. I have also explored many vision tasks such as multi-target tracking, human-object interaction detection, network quantization and 6D pose estimation of satellites.
Publications
● Research of Fine-grained Medical Action Recognition and Skill Assessment Technologies
Shunli Wang
Ph.D. Dissertation of Fudan University
Currently, China is facing problems such as a shortage of medical resources, overpressure on the medical service system, and uneven distribution of regional medical resources. Considering the vast population base of China and the ageing trend, the gap in the number of medical personnel in China will continue to expand. In order to deal with the above problems, there is an effective way to fill the vast number gap as soon as possible while ensuring the quality of medical services. However, the training of medical personnel has the characteristics of a long cycle and high cost. Therefore, improving efficiency and reducing the cost of medical training have become critical issues in alleviating the pressure on the medical system. In many medical teaching and training phases, medical skill training occupies a pivotal position. High-quality skill training can effectively improve the clinical level of medical staff and reduce the incidence of medical accidents.
Under the traditional medical skill training and assessment mode, doctors must observe and guide students' operations. Although this method can ensure high-quality teaching results, it always requires the participation of doctors, which has problems such as high human costs and heavy workloads. The high human cost under the traditional teaching and assessment mode limits the efficiency and scale of medical skill training. The intelligent medical skill assessment system based on artificial intelligence provides a vital idea to solve the above problems: firstly, record the operation process of medical students through vision and other sensors, then use artificial intelligence algorithms to achieve fine-grained action recognition and accurate automatic skill assessment, and finally feedback the skill assessment results. The intelligence medical skill assessment system can effectively improve the efficiency of training and testing, thereby significantly reducing the workload of doctors and human costs in medical skill training.
● CPR-Coach: Recognizing Composite Error Actions based on Single-class Training
Shunli Wang, Shuaibing Wang, Dingkang Yang, Mingcheng Li, Haopeng Kuang, Xiao Zhao, Liuzhen Su, Peng Zhai, Lihua Zhang*
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)
Fine-grained medical action analysis plays a vital role in improving medical skill training efficiency, but it faces the problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is an essential skill in emergency treatment. Currently, the assessment of CPR skills mainly depends on dummies and trainers, leading to high training costs and low efficiency. For the first time, this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically, we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and then develop a video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark, this paper investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable “Single-class Training & Multi-class Testing” problem, we propose a human-cognition-inspired framework named ImagineNet to improve the model’s multierror recognition performance under restricted supervision. Extensive comparison and actual deployment experiments verify the effectiveness of the framework. We hope this work could bring new inspiration to the computer vision and medical skills training communities simultaneously. The dataset and the code are publicly available on GitHub .
● CPR-CLIP: Multimodal Pre-training for Composite Error Recognition in CPR Training
Shunli Wang, Dingkang Yang, Peng Zhai, Lihua Zhang*
IEEE Signal Processing Letters (SPL 2023)
The expensive cost of the medical skill training paradigm hinders the development of medical education, which has attracted widespread attention in the intelligent signal processing community. To address the issue of composite error action recognition in Cardiopulmonary Resuscitation (CPR) training, this letter proposes a multimodal pre-training framework named CPR-CLIP based on prompt engineering. Specifically, we design three prompts to fuse multiple errors naturally on the semantic level and then align linguistic and visual features via the contrastive pre-training loss. Extensive experiments verify the effectiveness of the CPR-CLIP. Ultimately, the CPR-CLIP is encapsulated to an electronic assistant, and four doctors are recruited for evaluation. Nearly four times efficiency improvement is observed in comparative experiments, which demonstrates the practicality of the system. We hope this work brings new insights to the intelligent medical skill training and signal processing communities simultaneously. Code is available on GitHub .
● CA-SpaceNet: Counterfactual Analysis for 6D Pose Estimation in Space
Shunli Wang, Shuaibing Wang, Bo Jiao, Dingkang Yang, Liuzhen Su, Peng Zhai, Chixiao Chen, Lihua Zhang*
2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022)
Reliable and stable 6D pose estimation of uncooperative space objects plays an essential role in on-orbit servicing and debris removal missions. Considering that the pose estimator is sensitive to background interference, this paper proposes a counterfactual analysis framework named CA-SpaceNet to complete robust 6D pose estimation of the spaceborne targets under complicated background. Specifically, conventional methods are adopted to extract the features of the whole image in the factual case. In the counterfactual case, a non-existent image without the target but only the background is imagined. Side effect caused by background interference is reduced by counterfactual analysis, which leads to unbiased prediction in final results. In addition, we also carry out lowbit-width quantization for CA-SpaceNet and deploy part of the framework to a Processing-In-Memory (PIM) accelerator on FPGA. Qualitative and quantitative results demonstrate the effectiveness and efficiency of our proposed method. To our best knowledge, this paper applies causal inference and network quantization to the 6D pose estimation of space-borne targets for the first time. The code is available at https://github.com/Shunli-Wang/CA-SpaceNet .
● TSA-Net: Tube Self-Attention Network for Action Quality Assessment
Shunli Wang, Dingkang Yang, Peng Zhai, Chixiao Chen, Lihua Zhang*
ACM International Conference on Multimedia (ACM-MM 2021) (Oral)
In recent years, assessing action quality from videos has attracted growing attention in computer vision community and human-computer interaction. Most existing approaches usually tackle this problem by directly migrating the model from action recognition tasks, which ignores the intrinsic differences within the feature map such as foreground and background information. To address this issue, we propose a Tube Self-Attention Network (TSA-Net) for action quality assessment (AQA). Specifically, we introduce a single object tracker into AQA and propose the Tube Self-Attention Module (TSA), which can efficiently generate rich spatio-temporal contextual information by adopting sparse feature interactions. The TSA module is embedded in existing video networks to form TSA-Net. Overall, our TSA-Net is with the following merits: 1) High computational efficiency, 2) High flexibility, and 3) The state-of-the-art performance. Extensive experiments are conducted on popular action quality assessment datasets including AQA-7 and MTL-AQA. Besides, a dataset named Fall Recognition in Figure Skating (FR-FS) is proposed to explore the basic action assessment in the figure skating scene. Our TSA-Net achieves the Spearman's Rank Correlation of 0.8476 and 0.9393 on AQA-7 and MTL-AQA, respectively, which are the new state-of-the-art results. The results on FR-FS also verify the effectiveness of the TSA-Net. The code and FR-FS dataset are publicly available at https://github.com/Shunli-Wang/TSA-Net .
● A Survey of Video-based Action Quality Assessment
Shunli Wang, Dingkang Yang, Peng Zhai, Qing Yu, Tao Suo, Zhan Sun, Ka Li, Lihua Zhang*
International Conference on Networking Systems of AI (INSAI 2021) (Oral)
Human action recognition and analysis have great demand and important application significance in video surveillance, video retrieval, and human-computer interaction. The task of human action quality evaluation requires the intelligent system to automatically and objectively evaluate the action completed by the human. The action quality assessment model can reduce the human and material resources spent in action evaluation and reduce subjectivity. In this paper, we provide a comprehensive survey of existing papers on video-based action quality assessment. Different from human action recognition, the application scenario of action quality assessment is relatively narrow. Most of the existing work focuses on sports and medical care. We first introduce the definition and challenges of human action quality assessment. Then we present the existing datasets and evaluation metrics. In addition, we summarized the methods of sports and medical care according to the model categories and publishing institutions according to the characteristics of the two fields. At the end, combined with recent work, the promising development direction in action quality assessment is discussed.
● A 0.57-GOPS/DSP Object Detection PIM Accelerator on FPGA
Bo Jiao, Jinshan Zhang, Yuanyuan Xie, Shunli Wang, Haozhe Zhu, Xiaoyang Kang, Zhiyan Dong, Lihua Zhang, Chixiao Chen*
Proceedings of the 26th Asia and South Pacific Design Automation Conference (ASP-DAC '21)
The paper presents an object detection accelerator featuring a processing-in-memory (PIM) architecture on FPGAs. PIM architectures are well known for their energy efficiency and avoidance of the memory wall. In the accelerator, a PIM unit is developed using BRAM and LUT based counters, which also helps to improve the DSP performance density. The overall architecture consists of 64 PIM units and three memory buffers to store inter-layer results. A shrunk and quantized Tiny-YOLO network is mapped to the PIM accelerator, where DRAM access is fully eliminated during inference. The design achieves a throughput of 201.6 GOPs at 100MHz clock rate and correspondingly, a performance density of 0.57 GOPS/DSP.
● Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors
Bo Jiao, Haozhe Zhu, Jinshan Zhang, Shunli Wang, Xiaoyang Kang, Lihua Zhang, Mingyu Wang and Chixiao Chen*
Proceedings of the 2021 on Great Lakes Symposium on VLSI (GLSVLSI '21)
This paper presents a design strategy of chiplet-based processing-in-memory systems for deep neural network applications. Monolithic silicon chips are area and power limited, failing to catch the recent rapid growth of deep learning algorithms. The paper first demonstrates a straightforward layer-wise method that partitions the workload of a monolithic accelerator to a multi-chiplet pipeline. A quantitative analysis shows that the straightforward separation degrades the overall utilization of computing resources due to the reduced on-chiplet memory size, thus introducing a higher memory wall. A tile interleaving strategy is proposed to overcome such degradation. This strategy can segment one layer to different chiplets which maximizes the computing utilization. To facilitate the strategy, the modification of the chiplet system hardware is also discussed. To validate the proposed strategy, a nine-chiplet processing-in-memory system is evaluated with a custom-designed object detection network. Each chiplet can achieve a peak performance of 204.8GOPS at a 100-MHz rate. The peak performance of the overall system is 1.711TOPS, where no off-chip memory access is needed. By the tile interleaving strategy, the utilization is improved from 53.9 to 92.8.
Joint Works
● HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images
Shuaibing Wang, Shunli Wang, Dingkang Yang, Mingcheng Li, Ziyun Qian, Liuzhen Su, Lihua Zhang*
IEEE International Conference on Multimedia & Expo (ICME 2023)
We propose a robust and accurate method for reconstructing 3D hand mesh from monocular images. This is a very challenging problem, as hands are often severely occluded by objects. Previous works often have disregarded 2D hand pose information, which contains hand prior knowledge that is strongly correlated with occluded regions. Thus, in this work, we propose a novel 3D hand mesh reconstruction network HandGCAT, that can fully exploit hand prior as compensation information to enhance occluded region features. Specifically, we designed the Knowledge-Guided Graph Convolution (KGC) module and the Cross-Attention Transformer (CAT) module. KGC extracts hand prior information from 2D hand pose by graph convolution. CAT fuses hand prior into occluded regions by considering their high correlation. Extensive experiments on popular datasets with challenging hand-object occlusions, such as HO3D v2, HO3D v3, and DexYCB demonstrate that our HandGCAT reaches state-of-the-art performance.
● Emotion Recognition for Multiple Context Awareness
Dingkang Yang, Shuai Huang, Shunli Wang, Yang Liu, PengZhai, Liuzhen Su, Mingcheng Li, Lihua Zhang*
European Conference on Computer Vision (ECCV 2022)
Understanding emotion in context is a rising hotspot in the computer vision community. Existing methods lack reliable context semantics to mitigate uncertainty in expressing emotions and fail to model multiple context representations complementarily. To alleviate these issues, we present a context-aware emotion recognition framework that combines four complementary contexts. The first context is multimodal emotion recognition based on facial expression, facial landmarks, gesture and gait. Secondly, we adopt the channel and spatial attention modules to obtain the emotion semantics of the scene context. Inspired by sociology theory, we explore the emotion transmission between agents by constructing relationship graphs in the third context. Meanwhile, we propose a novel agent-object context, which aggregates emotion cues from the interactions between surrounding agents and objects in the scene to mitigate the ambiguity of prediction. Finally, we introduce an adaptive relevance fusion module for learning the shared representations among multiple contexts. Extensive experiments show that our approach outperforms the state-of-the-art methods on both EMOTIC and GroupWalk datasets. We also release a dataset annotated with diverse emotion labels, Human Emotion in Context (HECO). In practice, we compare with the existing methods on the HECO, and our approach obtains a higher classification average precision of 50.65% and a lower regression mean error rate of 0.7. The project is available at https://heco2022.github.io/ .
Challenges
ICCV-2023 Demo Proposal Accepted
Our demonstration of CPR-Coach dataset and ImagineNet has been accepted by ICCV-2023 Demo. This is the Home Page of this project. The CPR-Coach dataset and the code of ImagineNet are public available on the page. The demonstrattion Video of this system is available now.
ICCV-2023 Demo Proposal Accepted
Our demonstration of CA-SpaceNet in IROS 2022 has been accepted by ICCV-2023 Demo. This is the introduction PDF and presentation Video of the Demo. Code of the CA-SpaceNet is available on Github.
The Second Place in the ACM-MM'21 3rd VRU Challenge
Our team Planck won the second place in The 3rd Video Relationship Understanding Challenge in ACM Multimedia 2021 Grand Challenges. This is our Certificate.
Cube Robot Based on DSP and STM32
In 2017, Zeguang Chang and I built a cube robot using DSP and STM32. We open source all the code of CubeRobot in this Repository. You can find the complete restore video on BiliBili. Our robot won the third prize in the 2018 innovation competition of Anhui Province.
Education/Work

Sep 2019 - current

PhD Student at Academy for Engineering and Technology, Fudan University, Shanghai

Research Topics: Fine-grained action recognition and action quality assessment in medical scenes.

Sep 2015 - Jun 2019

Bachelor's Degree, Anhui University

Thesis: Figure Skating Analysis System Based on Multi-target Tracking and Posture Estimation