Homepage - Yaqi Xia's Home Page

Yaqi Xia (夏亚奇)

PhD, School of Computer Science, Wuhan University

I am Yaqi Xia, I am currently working toward the Ph.D. degree in computer science with Wuhan University under the supervision of Prof. Dazhao Cheng. My research interests include distributed deep learning model training and high-performance computing system for AI/ML. Before pursuing my Ph.D., I obtained both my Bachelor's and Master's degrees from Xidian University, where I had the privilege of being mentored by Prof. Rui Song.

yaqixia(at)whu.edu.com Google Scholar GitHub ORCID

Education

Wuhan University

Ph.D. in Artificial Intelligence Sep. 2021 -
Xidian University

M.S. in Electronics and Communication Engineering Sep. 2018 - Jul. 2021
Xidian University

B.S. in Communication Engineering Sep. 2014 - Jul. 2018

Honors & Awards

Best Paper Runner-up of ACM HPDC23 2023
Second Prize of First 'Tianzhi Cup' Artificial Intelligence Challenge 2019

Experience

Research Center for Graph Computing, Zhejiang Lab

Research Intern Aug. 2023 - Dec. 2023

News!

I am looking for highly self-motivated Bachelor, Master and PhD students. Feel free to get in touch with me via email If you're interested in collaborating.

News

2025

Our work 'Voltrix: Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization' is accepted by ATC 2025.

Apr 27

I am looking for highly self-motivated Bachelor, Master and PhD students. Feel free to get in touch with me via email If you're interested in collaborating.

Feb 14

2024

Our work 'Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion' is accepted by PPoPP 2025.

Dec 12

Our work 'Redundancy-Free and Load-Balanced TGNN Training With Hierarchical Pipeline Parallelism' is accepted by TPDS 2024.

Nov 11

Our work 'Scaling New Heights :Transformative Cross-GPU Sampling for Training Billion-Edge Graphs' is accepted by SC 2024.

May 16

Our work 'Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching' is accepted by SC 2024.

Apr 16

Our work 'Raptor-T :A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences' is accepted by TC 2024.

Apr 16

Our work 'MPMoE :Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism' is accepted by TPDS 2024.

Apr 08

2023

Our work 'Redundancy-Free High-Performance Dynamic GNN Training with Hierarchical Pipeline Parallelism' is accepted by HPDC 2023 and be selected as Best Paper Nomination (only two nominations)!

Aug 07

Our work 'MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism' is accepted by IPDPS 2023.

Jul 18

Selected Publications (view all )

Voltrix:Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization

Yaqi Xia†, Weihu Wang†, Donglin Yang, Xiaobo Zhou, Dazhao Cheng(† equal contribution)

2025 USENIX Annual Technical Conference (ATC) 2025 ConferenceCCF-A

We introduce Voltrix-SpMM, a revolutionary GPU kernel design for sparse matrix-matrix multiplication.

Voltrix:Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization

Yaqi Xia†, Weihu Wang†, Donglin Yang, Xiaobo Zhou, Dazhao Cheng(† equal contribution)

2025 USENIX Annual Technical Conference (ATC) 2025 ConferenceCCF-A

We introduce Voltrix-SpMM, a revolutionary GPU kernel design for sparse matrix-matrix multiplication.

Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion

Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2025 ConferenceCCF-A

We introduce CCFuser, a novel framework designed for efficient training of MoE models.

[Paper] [Code]

Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion

Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2025 ConferenceCCF-A

We introduce CCFuser, a novel framework designed for efficient training of MoE models.

[Paper] [Code]

Redundancy-free and load-balanced TGNN training with hierarchical pipeline parallelism

Yaqi Xia, Zheng Zhang, Donglin Yang, Chuang Hu, Xiaobo Zhou, Hongyang Chen, Qianlong Sang, Dazhao Cheng

IEEE Transactions on Parallel and Distributed (TPDS) 2024 JournalCCF-A

This work introduces Sven, a co-designed algorithm-system library aimed at accelerating TGNN training on a multi-GPU platform.

[Paper] [Cite]

Redundancy-free and load-balanced TGNN training with hierarchical pipeline parallelism

Yaqi Xia, Zheng Zhang, Donglin Yang, Chuang Hu, Xiaobo Zhou, Hongyang Chen, Qianlong Sang, Dazhao Cheng

IEEE Transactions on Parallel and Distributed (TPDS) 2024 JournalCCF-A

This work introduces Sven, a co-designed algorithm-system library aimed at accelerating TGNN training on a multi-GPU platform.

[Paper] [Cite]

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

Weihu Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2024 ConferenceCCF-A

We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training.

[Paper] [Code]

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

Weihu Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2024 ConferenceCCF-A

We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training.

[Paper] [Code]

Scaling New Heights :Transformative Cross-GPU Sampling for Training Billion-Edge Graphs

Yaqi Xia, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2024 ConferenceCCF-A

In this paper, we introduced HyDRA, a pioneering framework for sampling-based GNN training on large-scale graphs.

[Paper]

Scaling New Heights :Transformative Cross-GPU Sampling for Training Billion-Edge Graphs

Yaqi Xia, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2024 ConferenceCCF-A

In this paper, we introduced HyDRA, a pioneering framework for sampling-based GNN training on large-scale graphs.

[Paper]

Raptor-T :A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences

Hulin Wang, Donglin Yang, Yaqi Xia, Zheng Zhang, Qigang Wang, Jianping Fan, Xiaobo Zhou, Dazhao Cheng

IEEE Transactions on Computers (TC) 2024 JournalCCF-A

We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance.

[Paper] [Code]

Raptor-T :A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences

Hulin Wang, Donglin Yang, Yaqi Xia, Zheng Zhang, Qigang Wang, Jianping Fan, Xiaobo Zhou, Dazhao Cheng

IEEE Transactions on Computers (TC) 2024 JournalCCF-A

[Paper] [Code]

MPMoE :Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

Zheng Zhang, Yaqi Xia, Hulin Wang, Donglin Yang, Chuang Hu, Xiaobo Zhou, Dazhao Cheng

IEEE Transactions on Parallel and Distributed (TPDS) 2024 JournalCCF-A

In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism.

[Paper]

MPMoE :Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

Zheng Zhang, Yaqi Xia, Hulin Wang, Donglin Yang, Chuang Hu, Xiaobo Zhou, Dazhao Cheng

IEEE Transactions on Parallel and Distributed (TPDS) 2024 JournalCCF-A

In this paper, we present the design and implementation of MPMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism.

[Paper]

Redundancy-Free High-Performance Dynamic GNN Training with Hierarchical Pipeline Parallelism

Yaqi Xia, Zheng Zhang, Hulin Wang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

The 32nd International Symposium on High-Performance Parallel and Distributed Computing (ACM HPDC) 2023 ConferenceCCF-BBest Paper Nomination

This paper presents Sven, an algorithm and system co-designed TGNN training library for the end-to-end performance optimization on multi-node multi-GPU systems.

[Paper] [Cite]

Redundancy-Free High-Performance Dynamic GNN Training with Hierarchical Pipeline Parallelism

Yaqi Xia, Zheng Zhang, Hulin Wang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng

The 32nd International Symposium on High-Performance Parallel and Distributed Computing (ACM HPDC) 2023 ConferenceCCF-BBest Paper Nomination

This paper presents Sven, an algorithm and system co-designed TGNN training library for the end-to-end performance optimization on multi-node multi-GPU systems.

[Paper] [Cite]

ASFM-Net :Asymmetrical Siamese Feature Matching Network for Point Completion

Yaqi Xia†, Yan Xia†, Wei Li, Rui Song, Kailang Cao, Uwe Stilla(† equal contribution)

Proceedings of the 29th ACM international conference on multimedia (ACM MM) 2021 ConferenceCCF-A

We tackle the problem of object completion from point clouds and propose a novel point cloud completion network employing an Asymmetrical Siamese Feature Matching strategy, termed as ASFM-Net.

[Paper] [Cite] [Code]

ASFM-Net :Asymmetrical Siamese Feature Matching Network for Point Completion

Yaqi Xia†, Yan Xia†, Wei Li, Rui Song, Kailang Cao, Uwe Stilla(† equal contribution)

Proceedings of the 29th ACM international conference on multimedia (ACM MM) 2021 ConferenceCCF-A

We tackle the problem of object completion from point clouds and propose a novel point cloud completion network employing an Asymmetrical Siamese Feature Matching strategy, termed as ASFM-Net.

[Paper] [Cite] [Code]

Education

Honors & Awards

Experience

News!

News

Selected Publications (view all )

Voltrix:Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization

Voltrix:Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization

Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion

Harnessing Inter-GPU Shared Memory for Seamless MoE Communication-Computation Fusion

Redundancy-free and load-balanced TGNN training with hierarchical pipeline parallelism

Redundancy-free and load-balanced TGNN training with hierarchical pipeline parallelism

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

Scaling New Heights :Transformative Cross-GPU Sampling for Training Billion-Edge Graphs

Scaling New Heights :Transformative Cross-GPU Sampling for Training Billion-Edge Graphs

Raptor-T :A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences

Raptor-T :A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences

MPMoE :Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

MPMoE :Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

Redundancy-Free High-Performance Dynamic GNN Training with Hierarchical Pipeline Parallelism

Redundancy-Free High-Performance Dynamic GNN Training with Hierarchical Pipeline Parallelism

ASFM-Net :Asymmetrical Siamese Feature Matching Network for Point Completion

ASFM-Net :Asymmetrical Siamese Feature Matching Network for Point Completion

All publications