Author_Institution :
Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
Abstract :
The emerging multi-core platforms provide a high-performance, easy-to-develop and flexible way to implement high-speed network applications. For example, multi-core solutions for layer 7 protocol identification achieve 2 Gbps or higher processing speed. However, they occupy most of the processing cores in the system, leaving limited headroom for more complex manipulations, such as intrusion detection/prevention, anti-malware, data loss prevention, etc. Based on the deep understanding of the application bottlenecks and the optimization techniques of X86-64 platforms, we achieve the same or higher speed using only one core, saving more resources for further processing. In this paper, we make a deep system-wide profile and analyze the major hotspots of a typical network system on an Intel X86-64 platform, including a complete TCP/IP stack and a protocol identification engine by deep packet inspection (DPI). Profiling results and analysis show that network applications containing layer 2 to layer 7 processing are inherently memory and computation intensive. Then we propose optimization guidelines and techniques: 1) removing memory bottlenecks: combining independent irregular memory access to hide delay, using software-assistant cache prefetch to improve cache hit ratio, and mapping memory from kernel space to user space to reduce access overhead; 2) removing computation bottlenecks: using 64-bit registers and instructions to speed up the common computations, and using Streaming SIMD Extensions (SSE) instructions to accelerate special time-consuming tasks. Compared to the state-of-the-art multi-core implementations, our implementation on the X86-64 platform using only one core can deliver the same or higher processing speed of 7 Gbps with the average packet size of 501 bytes and 2 Gbps with the average packet size of 110 bytes.