Publications | Neiwen Ling

2025

TMC’25

TypeFly: Low-Latency Drone Planning with Large Language Models

Guojun Chen, Xiaojing Yu, Neiwen Ling, and Lin Zhong

to apper in IEEE Trans. Mobile Computing,

Abs

2024

arxiv

TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications

Neiwen Ling, Guojun Chen, and Lin Zhong

arXiv preprint arXiv:2412.18695

Abs HTML PDF

Large Language Models (LLMs) such as GPT-4 and Llama3 can already comprehend complex commands and process diverse tasks. This advancement facilitates their application in controlling drones and robots for various tasks. However, existing LLM serving systems typically employ a first-come, first-served (FCFS) batching mechanism, which fails to address the time-sensitive requirements of robotic applications. To address it, this paper proposes a new system named TimelyLLM serving multiple robotic agents with time-sensitive requests. TimelyLLM introduces novel mechanisms of segmented generation and scheduling that optimally leverage redundancy between robot plan generation and execution phases. We report an implementation of TimelyLLM on a widely-used LLM serving framework and evaluate it on a range of robotic applications. Our evaluation shows that TimelyLLM improves the time utility up to 1.97x, and reduces the overall waiting time by 84%.
MobiCom’24

Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving

Shuyao Shi(co-primary), Neiwen Ling(co-primary), Zhehao Jiang(co-primary), Xuan Huang(co-primary), Yuze He, Xiaoguang Zhao, Bufang Yang, Chen Bian, and 4 more authors

The 30th Annual International Conference on Mobile Computing And Networking (ACM MobiCom 2024)
Best Artifact Award Runner-Up

Abs HTML PDF

Recently, smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components carefully designed to overcome various system and physical challenges. Soar can leverage the existing operational infrastructure like street lampposts for a lower barrier of adoption. Soar adopts a new communication architecture that comprises a bi-directional multi-hop I2I network and a downlink I2V broadcast service, which are designed based on off-the-shelf 802.11ac interfaces in an integrated manner. Soar also features a hierarchical DL task management framework to achieve desirable load balancing among nodes and enable them to collaborate efficiently to run multiple data-intensive autonomous driving applications. We deployed a total of 18 Soar nodes on existing lampposts on campus, which have been operational for over two years. Our real-world evaluation shows that Soar can support a diverse set of autonomous driving applications and achieve desirable real-time performance and high communication reliability. Our findings and experiences in this work offer key insights into the development and deployment of next-generation smart roadside infrastructure and autonomous driving systems.
arxiv

TypeFly: Flying Drones with Large Language Model

Guojun Chen, Xiaojing Yu, Neiwen Ling, and Lin Zhong

arXiv preprint arXiv:2312.14950

Abs HTML PDF

Recent advancements in robot control using large language models (LLMs) have demonstrated significant potential, primarily due to LLMs’ capabilities to understand natural language commands and generate executable plans in various languages. However, in real-time and interactive applications involving mobile robots, particularly drones, the sequential token generation process inherent to LLMs introduces substantial latency, i.e. response time, in control plan generation. In this paper, we present a system called TypeFly that tackles this problem using a combination of a novel programming language called MiniSpec and its runtime to reduce the plan generation time and drone response time. That is, instead of asking an LLM to write a program (robotic plan) in the popular but verbose Python, TypeFly gets it to do it in MiniSpec specially designed for token efficiency and stream interpretation. Using a set of challenging drone tasks, we show that design choices made by TypeFly can reduce up to 62% response time and provide a more consistent user experience, enabling responsive and intelligent LLM-based drone control with efficient completion.
RTCSA’24

Timely Fusion of Surround Radar/Lidar for Object Detection in Autonomous Driving Systems

Wenjing Xie, Tao Hu, Neiwen Ling, Guoliang Xing, Chun Jason Xue, and Nan Guan

The 30th IEEE International Conference on Embedded and Real-Time Computing Systems and Application (IEEE RTCSA 2024)

Abs HTML PDF

Fusion of multiple sensor modalities, such as camera, Lidar and Radar, are commonly used in autonomous driving systems to fully utilize the complementary advantages of different sensors. Surround Radar/Lidar can provide 360-degree view sampling with the minimal cost, which are promising sensing hardware solutions for autonomous driving systems. However, due to the intrinsic physical constraints, the rotating speed (i.e., the frequency to generate data frames) of surround Radar is much lower than surround Lidar, and existing Radar/Lidar fusion methods have to work at the low frequency of surround Radar, which cannot meet the high responsiveness requirement of autonomous driving systems. This paper develops techniques to fuse surround Radar/Lidar with working frequency only limited by the faster surround Lidar instead of the slower surround Radar, based on the state-of-the-art Radar/Lidar DNN model MVDNet. The basic idea of our approach is simple: we let MVDNet work with temporally unaligned data from Radar/Lidar, so that fusion can take place at any time when a new Lidar data frame arrives, instead of waiting for the slow Radar data frame. However, directly applying MVDNet to temporally unaligned Radar/Lidar data greatly degrades its object detection accuracy. The key information revealed in this paper is that we can achieve high output frequency with little accuracy loss by enhancing the training procedure to explore the temporal redundancy in fusion procedure of MVDNet so that it can tolerate the temporal unalignment of the input data. We explore several different ways of training enhancement and compare them quantitatively with experiments.

2023

IPSN’23

CoEdge: A Cooperative Edge System for Distributed Real-Time Deep Learning Tasks

Zhehao Jiang(co-primary), Neiwen Ling(co-primary), Xuan Huang, Shuyao Shi, Chenhao Wu, Xiaoguang Zhao, Zhenyu Yan, and Guoliang Xing

The 22nd ACM/IEEE Conference on Information Processing in Sensor Networks (ACM/IEEE IPSN 2023)

Abs HTML PDF Slides

Recent years have witnessed the emergence of a new class of cooperative edge systems in which a large number of edge nodes can collaborate through local peer-to-peer connectivity. In this paper, we propose CoEdge, a novel cooperative edge system that can support concurrent data/compute-intensive deep learning (DL) models for distributed real-time applications such as city-scale traffic monitoring and autonomous driving. First, CoEdge includes a hierarchical DL task scheduling framework that dispatches DL tasks to edge nodes based on their computational profiles, communication overhead, and real-time requirements. Second, CoEdge can dramatically increase the execution efficiency of DL models by batching sensor data and aggregating the inferences of the same model. Finally, we propose a new edge containerization approach that enables an edge node to execute concurrent DL tasks by partitioning the CPU and GPU workloads into different containers. We extensively evaluate CoEdge on a self-deployed smart lamppost testbed on a university campus. Our results show that CoEdge can achieve up to 82.32% reduction on deadline missing rate compared to baselines.
SenSys’23 Poster

Unifying On-device Tensor Program Optimization through Large Foundation Model

Zhihe Zhao, Neiwen Ling, Kaiwei Liu, Nan Guan, and Guoliang Xing

The 21st ACM Conference on Embedded Networked Sensor Systems (ACM SenSys 2023)

Abs PDF

We present TensorBind, a novel approach aimed at unifying different hardware architectures for compilation optimization. Our proposed framework establishes an embedding space to seamlessly bind diverse hardware platforms together. By leveraging this unified representation, TensorBind enables efficient tensor program optimization techniques across a wide range of hardware platforms. We provide experimental results demonstrating the essentiality and adaptability of TensorBind in translating tensor program optimization records across multiple hardware architectures, thus revolutionizing compilation optimization strategies and facilitating the development of high-performance compilation systems over heterogeneous devices.
SenSys’23

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Zhihe Zhao, Neiwen Ling, Nan Guan, and Guoliang Xing

The 21st ACM Conference on Embedded Networked Sensor Systems (ACM SenSys 2023)

Abs HTML PDF

Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks wth varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resourcelimited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92% while only incurring less than 10% latency overhead for critical tasks, compared to state of art baselines.
SenSys’23

EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge

Bufang Yang, Lixing He, Neiwen Ling, Zhenyu Yan, Guoliang Xing, Xian Shuai, Ren Xiaozhe, and Xin Jiang

The 21st ACM Conference on Embedded Networked Sensor Systems (ACM SenSys 2023)

Abs PDF

Deep Learning (DL) models have been widely deployed on IoT devices with the help of advancements in DL algorithms and chips. However, the limited resources of edge devices make these ondevice DL models hard to be generalizable to diverse environments and tasks. Although the recently emerged foundation models (FMs) show impressive generalization power, how to effectively leverage the rich knowledge of FMs on resource-limited edge devices is still not explored. In this paper, we propose EdgeFM, a novel edge-cloud cooperative system with open-set recognition capability. EdgeFM selectively uploads unlabeled data to query the FM on the cloud and customizes the specific knowledge and architectures for edge models. Meanwhile, EdgeFM conducts dynamic model switching at run-time taking into account both data uncertainty and dynamic network variations, which ensures the accuracy always close to the original FM. We implement EdgeFM using two FMs on two edge platforms. We evaluate EdgeFM on three public datasets and two self-collected datasets. Results show that EdgeFM can reduce the end-to-end latency up to 3.2x and achieve 34.3% accuracy increase compared with the baseline.
MobiSys’23

Harmony: Heterogeneous Multi-Modal Federated Learning through Disentangled Model Training

Xiaomin Ouyang, Zhiyuan Xie, Heming Fu, Li Pan, Sitong Chen, Neiwen Ling, Guoliang Xing, Jiayu Zhou, and 1 more author

The 21st ACM International Conference on Mobile Systems, Applications, and Services (ACM MobiSys 2023)

Abs HTML PDF

Multi-modal sensing systems are increasingly prevalent in realworld applications such as health monitoring and autonomous driving. Most multi-modal learning approaches need to access users’ raw data, which poses significant concerns to users’ privacy. Federated learning (FL) provides a privacy-aware distributed learning framework. However, current FL approaches have not addressed the unique challenges of heterogeneous multi-modal FL systems, such as modality heterogeneity and significantly longer training delay. In this paper, we propose Harmony, a new system for heterogeneous multi-modal federated learning. Harmony disentangles the multimodal network training in a novel two-stage framework, namely modality-wise federated learning and federated fusion learning. By integrating a novel balance-aware resource allocation mechanism in modality-wise FL and exploiting modality biases in federated fusion learning, Harmony improves the model accuracy under noni.i.d. data distributions and speeds up system convergence. We implemented Harmony on a real-world multi-modal sensor testbed deployed in the homes of 16 elderly subjects for Alzheimer’s Disease monitoring. Our evaluation on the testbed and three large-scale public datasets of different applications show that, Harmony outperforms by up to 46.35% accuracy over state-of-the-art baselines and saves up to 30% training delay.
HotMobile’23

Moses: Exploiting Cross-device Transferable Features for On-device Tensor Program Optimization

Zhihe Zhao, Xian Shuai, Neiwen Ling, Nan Guan, Zhenyu Yan, and Guoliang Xing

The 24th International Workshop on Mobile Computing Systems and Applications (ACM HotMobile 2023)

Abs HTML PDF

Achieving efficient execution of machine learning models on mobile/edge devices has attracted significant attention recently. A key challenge is to generate high-performance tensor programs for each operator inside a DNN model efficiently. To this end, deep learning compilers have adopted auto-tuning approaches such as Ansor. However, it is challenging to optimize tensor codes for mobile/edge devices by auto-tuning due to limited time budgets and on-device resources. A key component of DNN compilers is the cost model that can predict the performance of each configuration on specific devices. However, current design of cost models cannot provide transferable features among different hardware accelerators efficiently and effectively. In this paper, we propose Moses, a simple yet efficient design based on the lottery ticket hypothesis, which fully takes advantage of the hardware-agnostic features transferable to the target device via domain adaptation to optimize the time-consuming auto-tuning process of DNN compiling on a new hardware platform. Compared with state-of-the-art approaches, Moses achieves up to 1.53X efficiency gain in the search stage and 1.41X inference speedup on challenging DNN benchmarks.

2022

SenSys’22

BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference

Neiwen Ling, Xuan Huang, Zhihe Zhao, Nan Guan, Zhenyu Yan, and Guoliang Xing

The 20th ACM Conference on Embedded Networked Sensor Systems (ACM SenSys 2022)
Best Paper Finalist

Abs HTML PDF Slides

In recent years, Deep Neural Network (DNN) has been increasingly adopted by a wide range of time-critical applications running on edge platforms with heterogeneous multiprocessors. To meet the stringent timing requirements of these applications, heterogeneous CPU and GPU resources must be efficiently utilized for the inference of multiple DNN models. Such a cross-processor real-time DNN inference paradigm poses major challenges due to the inherent performance imbalance among different processors and the lack of real-time support for cross-processor inference from existing deep learning frameworks. In this work, we propose a new system named BlastNet that exploits duo-block - a new model inference abstraction to support highly efficient cross-processor real-time DNN inference. Each duo-block has a dual model structure, enabling efficient fine-grained inference alternatively across different processors. BlastNet employs a novel block-level Neural Architecture Search (NAS) technique to generate duo-blocks, which accounts for computing characteristics and communication overhead. The duoblocks are optimized at design time and then dynamically scheduled to achieve high resource utilization of heterogeneous CPU and GPU at runtime. BlastNet is implemented on an indoor autonomous driving platform and three popular edge platforms. Extensive results show that BlastNet achieves 35.07 % less deadline missing rate with a mere 1.63% of model accuracy loss.
SenSys’22Poster

Aaron: Compile-time Kernel Adaptation for Multi-DNN Inference Acceleration on Edge GPU

Zhihe Zhao, Neiwen Ling, Nan Guan, and Guoliang Xing

The 20th ACM Conference on Embedded Networked Sensor Systems
Best Poster Award

Abs HTML PDF

AI applications powered by deep learning are increasingly running on edge devices. Meanwhile, many real-world IoT applications demand multiple real-time tasks to run on the same device, for example, to achieve both object tracking and image segmentation simultaneously on an augmented reality glass. However, the current solutions can not yet support such multi-tenant real-time DNN inference on edge devices. Techniques such as on-device model compression trade inference accuracy for speed, while traditional DNN compilers mainly focus on single-tenant DNN model optimization. To fill this gap, we propose Aaron, which leverages DNN compiling techniques to accelerate multi-DNN inference on edge GPU based on compile-time kernel adaptation with no accuracy loss. Aaron integrates both DNN graph and kernel optimization to maximize on-device parallelism and minimize contention brought by concurrent inference.
SenSys’22Workshop

Dataset: An Indoor Smart Traffic Dataset and Data Collection System

Neiwen Ling(co-primary), Yuze He(co-primary), Nan Guan, Heming Fu, and Guoliang Xing

The 5th International SenSys/BuildSys Workshop on Data

Abs HTML Dataset PDF

Smart traffic is an emerging research area gaining more attention due to a class of emerging applications such as autonomous driving. Most smart traffic scenarios are outdoors, which are hard to collect traffic data and build demanding sensing systems. In this work, an indoor smart traffic testbed with an F1TENTH autonomous driving vehicle is built, allowing the collection of traffic datasets under different scenarios and performing various smart traffic tasks. This novel data collection system and collected dataset can help research teams build various smart traffic systems and evaluate indoor smart traffic datasets. The collected traffic light dataset is publicly available at the link1.
arXiv

Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization

Zhihe Zhao, Xian Shuai, Yang Bai, Neiwen Ling, Nan Guan, Zhenyu Yan, and Guoliang Xing

arXiv preprint arXiv:2201.05752

Abs HTML PDF

Achieving efficient execution of machine learning models has attracted significant attention recently. To generate tensor programs efficiently, a key component of DNN compilers is the cost model that can predict the performance of each configuration on specific devices. However, due to the rapid emergence of hardware platforms, it is increasingly labor-intensive to train domain-specific predictors for every new platform. Besides, current design of cost models cannot provide transferable features between different hardware accelerators efficiently and effectively. In this paper, we propose Moses, a simple and efficient design based on the lottery ticket hypothesis, which fully takes advantage of the features transferable to the target device via domain adaptation. Compared with state-of-the-art approaches, Moses achieves up to 1.53X efficiency gain in the search stage and 1.41X inference speedup on challenging DNN benchmarks.

2021

SenSys’21

RT-mDL: Supporting Real-Time Mixed Deep Learning Tasks on Edge Platforms

Neiwen Ling, Kai Wang, Yuze He, Guoliang Xing, and Daqi Xie

The 19th ACM Conference on Embedded Networked Sensor Systems (ACM SenSys 2021)

Abs HTML Video Demo PDF Slides

Recent years have witnessed an emerging class of real-time applications, e.g., autonomous driving, in which resource-constrained edge platforms need to execute a set of real-time mixed Deep Learning (DL) tasks concurrently. Such an application paradigm poses major challenges due to the huge compute workload of deep neural network models, diverse performance requirements of different tasks, and the lack of real-time support from existing DL frameworks. In this paper, we present RT-mDL, a novel framework to support mixed real-time DL tasks on edge platform with heterogeneous CPU and GPU resource. RT-mDL aims to optimize the mixed DL task execution to meet their diverse real-time/accuracy requirements by exploiting unique compute characteristics of DL tasks. RT-mDL employs a novel storage-bounded model scaling method to generate a series of model variants, and systematically optimizes the DL task execution by joint model variants selection and task priority assignment. To improve the CPU/GPU utilization of mixed DL tasks, RT-mDL also includes a new priority-based scheduler which employs a GPU packing mechanism and executes the CPU/GPU tasks independently. Our implementation on an F1/10 autonomous driving testbed shows that, RT-mDL can enable multiple concurrent DL tasks to achieve satisfactory real-time performance in traffic light detection and sign recognition. Moreover, compared to state-of-the-art baselines, RT-mDL can reduce deadline missing rate by 40.12% while only sacrificing 1.7% model accuracy.
IoTDI’21

EdgeML: An AutoML framework for real-time deep learning on the edge

Zhihe Zhao, Kai Wang, Neiwen Ling, and Guoliang Xing

The 6th International Conference on Internet-of-Things Design and Implementation (ACM/IEEE IoTDI 2021)

Abs HTML PDF

In recent years, deep learning algorithms are increasingly adopted by a wide range of data-intensive and time-critical Internet of Things (IoT) applications. As a result, several new approaches, including model partition/offloading and progressive neural architecture, have been proposed to address the challenge of deploying the computation-intensive deep neural network (DNN) models on resource-constrained edge devices. However, the performance of existing approaches is highly affected by runtime dynamics. For example, offloading workload from edge to cloud suffers from communication delays and the efficiency of progressive neural architecture supporting early-exit DNN executions relies on input characteristics. In this paper, we introduce EdgeML, an AutoML framework that provides flexible and fine-grained DNN model execution control by combining workload offloading mechanism and dynamic progressive neural architecture. To achieve desirable latency-accuracy-energy system performance on edge platforms, EdgeML adopts reinforcement learning to automatically update model execution policy in response to runtime dynamics in real-time. We implement EdgeML for several widely used DNN models on the latest edge devices. Comparing to existing approaches, our experiments show that EdgeML achieves up to 8x performance improvement under dynamic environments.

2018

SenSys’18Demo

ECRT: An edge computing system for real-time image-based object tracking

Zhihe Zhao, Zhehao Jiang, Neiwen Ling, Xian Shuai, and Guoliang Xing

The 16th ACM Conference on Embedded Networked Sensor Systems

Abs HTML PDF

Real-time image-based object tracking from live video is of great importance for several smart city applications like surveillance, intelligent traffic management and autonomous driving. Although recent deep learning systems can achieve satisfactory tracking performance, they incur significant compute overhead, which prevents them from wide adoption on resource-constrained IoT platforms. In this demonstration, we present an Edge Computing system for Real-time object Tracking (ECRT) for resource-constrained devices. The key feature of our system is that it intelligently partitions compute-intensive tasks such as inferencing a convolutional neural network(CNN) into two parts, which are executed locally on an IoT device and/or on the edge server. Moreover, ECRT can minimize the power consumption of IoT devices while taking into consideration the dynamic network environment and user requirement on end to end delay.