[go: up one dir, main page]

CN116700901A - Container construction and operation system and method based on microkernel - Google Patents

Container construction and operation system and method based on microkernel Download PDF

Info

Publication number
CN116700901A
CN116700901A CN202310746321.5A CN202310746321A CN116700901A CN 116700901 A CN116700901 A CN 116700901A CN 202310746321 A CN202310746321 A CN 202310746321A CN 116700901 A CN116700901 A CN 116700901A
Authority
CN
China
Prior art keywords
container
resources
namespace
request
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310746321.5A
Other languages
Chinese (zh)
Inventor
糜泽羽
岑少锋
陈海波
臧斌宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiao Tong University
Original Assignee
Shanghai Jiao Tong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiao Tong University filed Critical Shanghai Jiao Tong University
Priority to CN202310746321.5A priority Critical patent/CN116700901A/en
Publication of CN116700901A publication Critical patent/CN116700901A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45583Memory management, e.g. access or allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45587Isolation or security of virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供一种基于微内核的容器构建与运行系统及方法,包括:命名空间功能模块、控制组功能模块以及故障恢复功能模块;其中,命名空间功能模块:用于划分静态的系统资源;将挂载点数据、网络协议栈数据、进程管理数据进行划分,不同的应用容器所处不同的命名空间,访问不同的系统资源,实现系统资源的隔离;控制组功能模块:用于划分动态的系统资源,将CPU资源、内存资源和I/O带宽资源进行统计和限制,不同的应用容器所处不同的控制组,使用限制的资源,实现系统资源的限制;故障恢复功能模块:用于处理因为内存错误导致的系统服务崩溃的情况。本发明能够在获得更强隔离性的和安全性的基础上,还能获得性能上的提升。

The present invention provides a microkernel-based container construction and operation system and method, including: a namespace functional module, a control group functional module, and a fault recovery functional module; wherein, the namespace functional module: used to divide static system resources; Mount point data, network protocol stack data, and process management data are divided, and different application containers are located in different namespaces, accessing different system resources, and realizing the isolation of system resources; control group function module: used to divide dynamic systems Resources, to count and limit CPU resources, memory resources, and I/O bandwidth resources. Different application containers are in different control groups, use limited resources, and realize system resource limitations; fault recovery function module: used to deal with A case where a system service crashes due to a memory error. The present invention can obtain improved performance on the basis of stronger isolation and safety.

Description

基于微内核的容器构建与运行系统及方法Container construction and operation system and method based on microkernel

技术领域technical field

本发明涉及虚拟化技术领域,具体地,涉及一种基于微内核的容器构建与运行系统及方法。The present invention relates to the technical field of virtualization, in particular to a microkernel-based container construction and operation system and method.

背景技术Background technique

容器可以为操作系统和应用程序的执行提供隔离的环境,近年来它的应用越来越普遍,容器安全的重要性也随之不断上升。容器的隔离性是安全性中重要的一环,隔离性强的容器设计在某个容器出现故障时不会影响其他容器和操作系统。容器的隔离性是支撑容器可靠性、可扩展性的关键。然而,随着操作系统的不断发展,传统的容器的隔离机制面临着新的问题和挑战。例如,操作系统和软件的代码量都在不断膨胀,导致应用程序中存在大量未被发现的漏洞,容易被恶意的容器所利用,导致一个故障的容器可以窃取到运行在同一个内核上其他容器的数据,也可以将属于其他容器的资源占为己用。工业界和学术界的研究者们针对这些问题,提出了一系列技术方案来增强容器的隔离性,包括使用新软件架构隔离面增加隔离性、增强传统操作系统的隔离性、使用新的硬件增强隔离性、使用隔离性强的操作系统等。Containers can provide an isolated environment for the execution of operating systems and applications. In recent years, its application has become more and more common, and the importance of container security has also increased. The isolation of containers is an important part of security. A container design with strong isolation will not affect other containers and operating systems when a container fails. The isolation of containers is the key to supporting the reliability and scalability of containers. However, with the continuous development of the operating system, the traditional container isolation mechanism faces new problems and challenges. For example, the code size of the operating system and software is constantly expanding, resulting in a large number of undiscovered vulnerabilities in the application, which are easily exploited by malicious containers, causing a faulty container to steal other containers running on the same kernel. data, or use resources belonging to other containers for your own use. In response to these problems, researchers in industry and academia have proposed a series of technical solutions to enhance the isolation of containers, including using new software architecture isolation planes to increase isolation, enhancing the isolation of traditional operating systems, and using new hardware to enhance isolation. Isolation, using an operating system with strong isolation, etc.

在传统的Linux操作系统中运行应用程序时,由于操作系统系统的进程这一概念的隔离性并不强,不同的进程之间会共享网络、CPU、内存、磁盘等资源,恶意的进程很容易攻击其他进程的数据流,偷取其他进程的数据。对安全性要求较高的进程需要额外的防护机制将自身与其他进程隔离。Linux提供了命名空间(Namespace)和控制组(ControlGroups)机制帮助进程实现隔离的功能。命名空间机制提供一种资源隔离方案,每个命名空间可以指定自己的进程、进程间通信、网络、文件等系统资源,这些资源对其余命名空间是不可见的。通过将不同的进程放入不同的命名空间中,可以实现进程间的资源隔离,使恶意的进程无法直接窃取其他进程的数据。控制组机制可以限制每个进程使用的网络、CPU、内存、磁盘等资源的量,可以防止某个进程消耗过多的系统资源而影响其他进程的性能。When running applications in the traditional Linux operating system, because the concept of the process of the operating system is not very isolated, different processes will share resources such as the network, CPU, memory, and disk, and malicious processes can easily Attack the data flow of other processes and steal data from other processes. Processes with high security requirements need additional protection mechanisms to isolate themselves from other processes. Linux provides a namespace (Namespace) and a control group (ControlGroups) mechanism to help processes achieve isolation. The namespace mechanism provides a resource isolation scheme. Each namespace can specify its own process, inter-process communication, network, file and other system resources. These resources are invisible to other namespaces. By putting different processes into different namespaces, resource isolation between processes can be achieved, so that malicious processes cannot directly steal data from other processes. The control group mechanism can limit the amount of network, CPU, memory, disk and other resources used by each process, and can prevent a process from consuming too many system resources and affecting the performance of other processes.

容器在命名空间机制和控制组机制的基础上产生,它可以为运行在容器内的进程提供一个隔离的执行环境。LXC是一个典型的容器。LXC为Linux Container的简写,是Linux原生支持的容器。LXC将每个命名空间看作一个容器,每个容器拥有自己的进程视图、进程间通信、文件系统和网络视图等。由于使用了命名空间进行隔离,不同容器之间不能互相访问数据,也不能影响其他容器内执行的控制流。它的架构如图14所示。在LXC中,每个容器内都运行一个操作系统和相应的应用程序。为了防止某个容器占用掉过多的系统资源继而影响其他容器和内核的性能,LXC使用控制组机制限制了每个命名空间里运行的进程组所能使用的系统资源的上限,即限制每个容器所能占用的网络、CPU、内存、磁盘等资源的量。The container is generated on the basis of the namespace mechanism and the control group mechanism, which can provide an isolated execution environment for the process running in the container. LXC is a typical container. LXC is the abbreviation of Linux Container, which is a container natively supported by Linux. LXC regards each namespace as a container, and each container has its own process view, inter-process communication, file system and network view, etc. Due to the use of namespaces for isolation, different containers cannot access data from each other, nor can they affect the control flow executed in other containers. Its architecture is shown in Figure 14. In LXC, each container runs an operating system and corresponding application. In order to prevent a container from taking up too many system resources and then affecting the performance of other containers and the kernel, LXC uses the control group mechanism to limit the upper limit of system resources that can be used by process groups running in each namespace, that is, limit each container The amount of network, CPU, memory, disk and other resources that can be occupied.

虽然容器利用命名空间机制和控制组机制增强了自身的隔离性,但是它仍然存在两大隔离性问题:安全隔离和性能隔离。Although the container uses the namespace mechanism and the control group mechanism to enhance its own isolation, it still has two isolation problems: security isolation and performance isolation.

安全隔离指一个容器是否能访问到其他容器的数据,或影响其他容器的安全性。容器比普通的进程安全性要高,但是它仍然存在安全隔离的问题。这是因为容器之间虽然被命名空间机制和控制组机制进行了隔离,但它们也还是运行在同一个操作系统上的进程,容器要正常工作必须信任整个操作系统,将操作系统的代码作为它的可信代码基(TCB)。而操作系统的代码量在急速膨胀,Linux内核的代码量从v1.0版的15万行代码上升到4.15版超过2000万行,TCB的膨胀超过100倍,导致Linux内不可避免地包含大量潜藏的漏洞。如果某个容器触发了操作系统中的一个漏洞,那么它就可能有访问或攻击该操作系统上的其他容器甚至操作系统本身的权限。例如,已公布出来的在Docker平台上的shocker攻击利用了与文件句柄相关的系统调用里的设计漏洞,可以让容器访问操作系统上的任意文件,包括其他容器的文件,容器的安全隔离遭到破坏。Security isolation refers to whether a container can access the data of other containers, or affect the security of other containers. Containers are more secure than ordinary processes, but they still have the problem of security isolation. This is because although the containers are isolated by the namespace mechanism and the control group mechanism, they are still processes running on the same operating system. To work properly, the container must trust the entire operating system and use the code of the operating system as its Trusted Code Base (TCB). The code size of the operating system is rapidly expanding. The code size of the Linux kernel has increased from 150,000 lines of code in version v1.0 to more than 20 million lines in version 4.15. loopholes. If a container triggers a vulnerability in an operating system, it could potentially gain access or attack other containers on that operating system or even the operating system itself. For example, the published shocker attack on the Docker platform exploits a design loophole in the system call related to the file handle, allowing the container to access any file on the operating system, including files of other containers, and the security isolation of the container is compromised. destroy.

性能隔离指一个容器是否能影响其他容器的性能。在LXC和Docker中,一个容器默认可以占据全部的CPU资源。如果部署者没有手动设置每个容器可以使用的CPU资源上限,当一个容器中运行CPU高负载的程序时,会影响其他容器的性能。在CPU被大量占用时,LXC内运行的数据库执行事务的效率会降低约10%。此外,由于操作系统内隐含着大量漏洞,一个容器不仅可以通过漏洞来攻击其他容器或操作系统的安全性,它也可以利用漏洞让操作系统分配更多的资源给自己,侵占本应属于其他容器或者操作系统的资源,影响其他容器的性能。数据表明,在磁盘I/O资源被占满时,在LXC内运行的数据库效率会降低约15%。内存资源被占满时,它的效率会降低近30%。Performance isolation refers to whether a container can affect the performance of other containers. In LXC and Docker, a container can occupy all CPU resources by default. If the deployer does not manually set the upper limit of the CPU resources that each container can use, when a program with a high CPU load is running in one container, it will affect the performance of other containers. When the CPU is heavily occupied, the transaction efficiency of the database running in LXC will be reduced by about 10%. In addition, due to the large number of vulnerabilities hidden in the operating system, a container can not only use the vulnerabilities to attack the security of other containers or operating systems, but it can also use the vulnerabilities to make the operating system allocate more resources to itself, encroaching on resources that should belong to other resources. The resources of a container or operating system affect the performance of other containers. The data shows that when the disk I/O resources are fully occupied, the efficiency of the database running in LXC will be reduced by about 15%. It is nearly 30% less efficient when memory resources are full.

可以看出,传统的容器虽然增强了进程的隔离性,但仍然存在安全隔离问题和性能隔离问题,有很大的改善空间。It can be seen that although the traditional container has enhanced the isolation of the process, it still has the problem of security isolation and performance isolation, and there is a lot of room for improvement.

发明内容Contents of the invention

针对现有技术中的缺陷,本发明提供一种基于微内核的容器构建与运行系统及方法。Aiming at the defects in the prior art, the present invention provides a microkernel-based container construction and operation system and method.

根据本发明提供的一种基于微内核的容器构建与运行系统及方法,所述方案如下:According to a microkernel-based container construction and operation system and method provided by the present invention, the scheme is as follows:

第一方面,提供了一种基于微内核的容器构建与运行系统,所述系统包括:命名空间功能模块、控制组功能模块以及故障恢复功能模块;In the first aspect, a microkernel-based container construction and operation system is provided, and the system includes: a namespace function module, a control group function module, and a failure recovery function module;

其中,命名空间功能模块:用于划分静态的系统资源;将挂载点数据、网络协议栈数据、进程管理数据进行划分,不同的应用容器所处不同的命名空间,访问不同的系统资源,实现系统资源的隔离;Among them, the namespace function module: used to divide static system resources; divide mount point data, network protocol stack data, process management data, different application containers are located in different namespaces, access different system resources, and realize Isolation of system resources;

控制组功能模块:用于划分动态的系统资源,将CPU资源、内存资源和I/O带宽资源进行统计和限制,不同的应用容器所处不同的控制组,使用限制的资源,实现系统资源的限制;Control group function module: used to divide dynamic system resources, collect statistics and limit CPU resources, memory resources and I/O bandwidth resources, different application containers are in different control groups, use limited resources, and realize system resource allocation limit;

故障恢复功能模块:用于处理因为内存错误导致的系统服务崩溃的情况;Fault recovery function module: used to deal with the crash of system services caused by memory errors;

所述系统还包括系统容器,加强对系统服务进程的资源统计和故障恢复。The system also includes a system container, which strengthens the resource statistics and fault recovery of the system service process.

优选地,所述命名空间功能模块包括:文件系统挂载点命名空间的创建、加入、退出、销毁和使用;Preferably, the namespace function module includes: creation, addition, exit, destruction and use of the file system mount point namespace;

其中,创建文件系统挂载点命名空间的步骤如下:Among them, the steps to create a file system mount point namespace are as follows:

步骤1):创建请求发起,容器管理程序通过进程间通信的方式向系统容器发送请求,创建文件系统挂载点的命名空间;Step 1): The creation request is initiated, and the container management program sends a request to the system container through inter-process communication to create a namespace of the file system mount point;

步骤2):获取容器管理程序发来的工作路径,并判断路径长度是否小于255个字符;Step 2): Obtain the working path sent by the container management program, and determine whether the length of the path is less than 255 characters;

步骤3):获取在原状态下工作路径的挂载点信息,根据工作路径从原状态的挂载点信息中获取挂载点信息;Step 3): Obtain the mount point information of the working path in the original state, and obtain the mount point information from the mount point information in the original state according to the working path;

步骤4):向对应的文件系统发送创建命名空间的请求,具体的流程在文件系统的命名空间创建中介绍;Step 4): Send a request to create a namespace to the corresponding file system, the specific process is introduced in the file system namespace creation;

步骤5):创建新的挂载点信息链表;Step 5): Create a new mount point information linked list;

步骤6):初始化新的挂载点信息链表,将步骤3)中获取的挂载点信息作为路径的挂载点信息添加到链表中;Step 6): Initialize a new mount point information linked list, and add the mount point information obtained in step 3) to the linked list as the mount point information of the path;

步骤7):从数组中获取能用的命名空间ID;Step 7): Obtain the available namespace IDs from the array;

步骤8):将新的挂载点信息链表添加到系统容器的数组元素中;Step 8): Add the new mount point information linked list to the array element of the system container;

步骤9):进程间通信返回,完成新的命名空间的创建。Step 9): The inter-process communication returns, and the creation of a new namespace is completed.

优选地,加入文件系统挂载点命名空间的步骤如下:Preferably, the steps of adding the file system mount point namespace are as follows:

步骤1):发起加入命名空间请求,容器管理程序需要将挂载点系统容器命名空间的ID和系统服务类型传入到内核中;Step 1): Initiate a request to join the namespace, and the container management program needs to pass the ID of the mount point system container namespace and the system service type into the kernel;

步骤2):获取容器管理程序传入的挂载点系统容器命名空间ID和系统服务类型;Step 2): Obtain the mount point system container namespace ID and system service type passed in by the container management program;

步骤3):将容器管理程序传入的挂载点系统容器命名空间ID填入到应用容器进程的挂载点系统容器命名空间的位置中;Step 3): Fill in the mount point system container namespace ID passed in by the container management program into the position of the mount point system container namespace of the application container process;

步骤4):完成命名空间加入请求,从内核态返回用户态。Step 4): Complete the request to join the namespace, and return to the user state from the kernel state.

优选地,退出文件系统挂载点命名空间的步骤如下:Preferably, the steps for exiting the file system mount point namespace are as follows:

步骤1):发起退出命名空间请求,容器管理程序需要将挂载点系统容器的系统服务类型传入到内核中;Step 1): Initiate a request to exit the namespace, and the container management program needs to pass the system service type of the mount point system container into the kernel;

步骤2):获取容器管理程序传入的系统服务类型,即挂载点系统容器系统服务类型;Step 2): Obtain the system service type imported by the container management program, that is, the mount point system container system service type;

步骤3):将内核中的挂载点系统容器对应的命名空间数据清空;Step 3): Clear the namespace data corresponding to the mount point system container in the kernel;

步骤4):完成命名空间退出请求,从内核态返回用户态。Step 4): Complete the namespace exit request, and return to the user state from the kernel state.

优选地,销毁文件系统挂载点命名空间的步骤如下:Preferably, the steps for destroying the file system mount point namespace are as follows:

步骤1):销毁请求发起,容器管理程序通过进程间通信的方式向系统容器发送请求,销毁文件系统挂载点的命名空间;Step 1): A destruction request is initiated, and the container management program sends a request to the system container through inter-process communication to destroy the namespace of the file system mount point;

步骤2):获取容器管理程序发来的命名空间ID;Step 2): Obtain the namespace ID sent by the container management program;

步骤3):判断命名空间ID是否为0,若为0,则代表根命名空间,无法销毁;Step 3): Determine whether the namespace ID is 0, if it is 0, it represents the root namespace and cannot be destroyed;

步骤4):判断命名空间ID对应的命名空间是否存在,若不存在,则无法销毁;Step 4): Determine whether the namespace corresponding to the namespace ID exists, and if it does not exist, it cannot be destroyed;

步骤5):获取路径的挂载点信息;Step 5): Obtain the mount point information of the path;

步骤6):向对应的文件系统发送销毁命名空间的请求;Step 6): Send a request to the corresponding file system to destroy the namespace;

步骤7):清空当前的挂载点信息链表,包括释放内存及赋值指针为0;Step 7): Clear the current mount point information linked list, including releasing memory and assigning the pointer to 0;

步骤8):将系统容器的数组中对应的命名空间元素清空;Step 8): Empty the corresponding namespace elements in the array of the system container;

步骤9):进程间通信请求返回,完成命名空间的销毁。Step 9): The inter-process communication request is returned, and the destruction of the namespace is completed.

优选地,使用文件系统挂载点命名空间的步骤如下:Preferably, the steps to use the file system mount point namespace are as follows:

步骤1):应用容器发起请求,发送进程间通信请求到挂载点系统服务;Step 1): The application container initiates a request and sends an inter-process communication request to the mount point system service;

步骤2):在内核态获取当前应用容器的命名空间ID;Step 2): Obtain the namespace ID of the current application container in the kernel state;

步骤3):从进程间通信请求相关的结构体获取对应的系统容器;Step 3): Obtain the corresponding system container from the structure related to the inter-process communication request;

步骤4):切换进程到系统容器进程,并传递命名空间ID;Step 4): switch the process to the system container process, and pass the namespace ID;

步骤5):获取内核传递的命名空间ID;Step 5): Obtain the namespace ID passed by the kernel;

步骤6):根据命名空间ID从数组中寻找对应命名空间的挂载点链表信息并切换;Step 6): According to the namespace ID, find the mount point linked list information of the corresponding namespace from the array and switch;

步骤7):执行具体的应用容器请求;Step 7): Execute a specific application container request;

步骤8):请求处理完毕,返回到内核态;Step 8): After the request is processed, return to the kernel state;

步骤9):完成请求,返回应用容器。Step 9): Complete the request and return to the application container.

优选地,所述控制组功能模块包括:Preferably, the control group functional modules include:

统计和限制CPU资源:将进程收到的时钟中断总数作为CPU资源使用量;当时钟中断触发时,在内核获取到当前触发时钟中断的CPU上运行的进程,并对进程收到的时钟中断数进行加1操作;Statistics and limitation of CPU resources: The total number of clock interrupts received by the process is used as the CPU resource usage; when the clock interrupt is triggered, the kernel obtains the process running on the CPU that currently triggers the clock interrupt, and counts the number of clock interrupts received by the process Perform plus 1 operation;

对调度策略进行修改,当前进程的时间片为0后,进入调度策略,从等待队列中取出一个进程,计算CPU使用率,即进程收到的时钟中断数与等待队列中所有进程收到的时钟中断数总和之比,将实际的CPU使用率与用户设定的CPU使用率进行比较;Modify the scheduling strategy. After the time slice of the current process is 0, enter the scheduling strategy, take out a process from the waiting queue, and calculate the CPU usage, that is, the number of clock interrupts received by the process and the clocks received by all processes in the waiting queue. The ratio of the total number of interrupts, comparing the actual CPU usage with the CPU usage set by the user;

如果进程的CPU使用率较高,则调小时间片,甚至让进程长期处于等待状态;如果进程的CPU使用率较低,则调大时间片;If the CPU usage of the process is high, reduce the time slice, or even let the process wait for a long time; if the CPU usage of the process is low, increase the time slice;

通过修改的调度策略,对每个进程的时间片进行单独的控制,当每个进程完整地调度一轮时,就确保每个进程的CPU使用率符合用户设定的值;Through the modified scheduling strategy, the time slice of each process is controlled separately, and when each process is scheduled for a complete round, it is ensured that the CPU usage of each process meets the value set by the user;

统计和限制内存资源:在缺页异常捕获发生的异常,并将分配的物理页大小加入到对应的应用容器进程和应用容器,统计物理页的内存资源的使用量;Statistics and limitation of memory resources: Capture the exceptions that occur in page fault exceptions, and add the allocated physical page size to the corresponding application container process and application container, and count the usage of memory resources of physical pages;

如果在这个过程中发现应用容器的物理内存使用量超过用户设定的值,则需杀死超出内存使用量的应用容器进程或者暂停该应用容器进程的运行;If it is found that the physical memory usage of the application container exceeds the value set by the user during this process, it is necessary to kill the application container process that exceeds the memory usage or suspend the operation of the application container process;

统计和限制I/O带宽资源:通过在文件系统的系统服务和设备驱动的系统服务之间添加一个限流器系统服务,每一个I/O请求都将被限流器系统服务捕获,并根据I/O请求的类型和大小判断是否满足请求下发的要求,如果不满足,则暂停该请求的下发,如果满足,则下发该请求,并更新限流器系统服务中的令牌数;Count and limit I/O bandwidth resources: By adding a current limiter system service between the file system system service and the device driver system service, each I/O request will be captured by the current limiter system service and processed according to The type and size of the I/O request determine whether the request is met. If not, the request is suspended. If it is satisfied, the request is issued and the number of tokens in the current limiter system service is updated. ;

令牌数随着系统运行而逐渐增长,并存在令牌数上限,一旦达到上限,则令牌数无法再继续增加。The number of tokens increases gradually with the operation of the system, and there is an upper limit on the number of tokens. Once the upper limit is reached, the number of tokens cannot continue to increase.

优选地,所述故障恢复功能模块包括:故障错误发生时,对故障进行捕获,最常见的故障就是缺页异常,捕获完成后,对发生异常的系统容器进程进行退出操作,并回收所有的资源;Preferably, the fault recovery function module includes: when a fault error occurs, capture the fault. The most common fault is a page fault exception. After the capture is completed, exit the abnormal system container process and reclaim all resources ;

然后向进程管理系统容器发送重启系统容器的消息,进程管理系统容器收到消息后,立刻重启系统容器,并重建系统容器内部的内容和进程间通信。Then send a message to the process management system container to restart the system container. After receiving the message, the process management system container restarts the system container immediately, and rebuilds the internal content of the system container and inter-process communication.

优选地,所述系统容器位于用户态的系统服务进程也作为容器来进行管理,统计系统服务进程的CPU、内存开销,并对系统服务进程进行统一地管理;Preferably, the system service process in the user mode of the system container is also managed as a container, and the CPU and memory overhead of the system service process are counted, and the system service process is uniformly managed;

所述系统通过直接内存访问加速微内核的I/O速度,采用capability管理需要传递的数据,并通过直接内存访问将数据直接从设备拷贝到内存或者从内存直接写入到设备,以此减少进程间通信的数量和重复内存拷贝的次数。The system accelerates the I/O speed of the microkernel through direct memory access, uses capability to manage the data that needs to be transferred, and directly copies the data from the device to the memory or directly writes the data from the memory to the device through direct memory access, thereby reducing the number of processes The amount of inter-communication and the number of repeated memory copies.

第二方面,提供了一种基于微内核的容器构建与运行方法,所述方法包括:In a second aspect, a microkernel-based container construction and operation method is provided, the method comprising:

命名空间功能步骤:划分静态的系统资源;将挂载点数据、网络协议栈数据、进程管理数据进行划分,不同的应用容器所处不同的命名空间,访问不同的系统资源,实现系统资源的隔离;Namespace function steps: Divide static system resources; divide mount point data, network protocol stack data, and process management data. Different application containers are located in different namespaces, access different system resources, and realize system resource isolation. ;

控制组功能步骤:划分动态的系统资源,将CPU资源、内存资源和I/O带宽资源进行统计和限制,不同的应用容器所处不同的控制组,使用限制的资源,实现系统资源的限制;Control group function steps: divide dynamic system resources, collect statistics and limit CPU resources, memory resources, and I/O bandwidth resources, place different application containers in different control groups, use limited resources, and realize system resource limitations;

故障恢复功能步骤:处理因为内存错误导致的系统服务崩溃的情况。Fault recovery function steps: handle the crash of system services caused by memory errors.

与现有技术相比,本发明具有如下的有益效果:Compared with the prior art, the present invention has the following beneficial effects:

1、本发明提出系统容器的概念,系统容器代表不止用户运行的程序会被包含在容器里,所有的系统服务都会运行在容器之中,这可以保证对系统资源的使用进行更精确的统计;1. The present invention proposes the concept of a system container. The system container represents that not only programs run by users will be included in the container, and all system services will run in the container, which can ensure more accurate statistics on the use of system resources;

2、本发明使用了直接内存访问的方式来提高微内核的性能,跳过反复的内存拷贝和进程间通信;2. The present invention uses direct memory access to improve the performance of the microkernel, skipping repeated memory copying and inter-process communication;

3、本发明可以在微内核环境中提供灵活的容器支持,并且可以灵活移植到各个微内核平台而无需以来特定宏内核的环境支持,通过灵活的中断隔离性与调度算法独立性可进一步支持特定容器内的实时性需求,使用的系统容器可以让资源统计更精确,并且对系统服务的行为进行单独的管理,实现更强的资源隔离能力;3. The present invention can provide flexible container support in a microkernel environment, and can be flexibly transplanted to various microkernel platforms without the need for specific macrokernel environment support, and can further support specific containers through flexible interrupt isolation and scheduling algorithm independence. The real-time requirements in the container, the system container used can make resource statistics more accurate, and manage the behavior of system services separately to achieve stronger resource isolation capabilities;

4、与现有技术相比,本发明在获得更强隔离性的和安全性的基础上,还能获得性能上的提升。4. Compared with the prior art, the present invention can also obtain improved performance on the basis of stronger isolation and safety.

本发明的其他有益效果,将在具体实施方式中通过具体技术特征和技术方案的介绍来阐述,本领域技术人员通过这些技术特征和技术方案的介绍,应能理解所述技术特征和技术方案带来的有益技术效果。Other beneficial effects of the present invention will be set forth through the introduction of specific technical features and technical solutions in the specific embodiments, and those skilled in the art should be able to understand the implications of the technical features and technical solutions through the introduction of these technical features and technical solutions. beneficial technical effects.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显:Other characteristics, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1为本发明的架构图;Fig. 1 is a structure diagram of the present invention;

图2为本发明中的命名空间设计架构示意图;Fig. 2 is a schematic diagram of the namespace design framework in the present invention;

图3为本发明中的文件系统挂载点命名空间的创建流程的示意图;FIG. 3 is a schematic diagram of the creation process of the file system mount point namespace in the present invention;

图4为本发明中的文件系统挂载点命名空间的加入流程的示意图;FIG. 4 is a schematic diagram of the adding process of the file system mount point namespace in the present invention;

图5为本发明中的文件系统挂载点命名空间的退出流程的示意图;5 is a schematic diagram of the exit process of the file system mount point namespace in the present invention;

图6为本发明中的文件系统挂载点命名空间的销毁流程的示意图;FIG. 6 is a schematic diagram of the destruction process of the file system mount point namespace in the present invention;

图7为本发明中的文件系统挂载点命名空间的使用流程的示意图;FIG. 7 is a schematic diagram of the use flow of the file system mount point namespace in the present invention;

图8为本发明中的控制组设计架构示意图;Fig. 8 is a schematic diagram of the control group design architecture in the present invention;

图9为本发明中的故障恢复的处理流程的示意图;Fig. 9 is a schematic diagram of the processing flow of fault recovery in the present invention;

图10为本发明中直接内存访问I/O的流程示意图;Fig. 10 is a schematic flow chart of direct memory access I/O in the present invention;

图11为CPU资源统计;Figure 11 shows CPU resource statistics;

图12为内存控制组处理流程;Figure 12 is the processing flow of the memory control group;

图13为I/O控制组处理流程;Fig. 13 is the processing flow of the I/O control group;

图14为LXC架构。Figure 14 shows the LXC architecture.

具体实施方式Detailed ways

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明,但不以任何形式限制本发明。应当指出的是,对本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变化和改进。这些都属于本发明的保护范围。The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

本发明实施例提供了一种基于微内核的容器构建与运行系统,参照图1所示,系统包括命名空间功能模块、控制组功能模块以及故障恢复功能模块。The embodiment of the present invention provides a microkernel-based container construction and operation system. Referring to FIG. 1 , the system includes a namespace function module, a control group function module, and a fault recovery function module.

本发明还创新性地提出了系统容器来加强对系统服务进程的资源统计和故障恢复,并提出了直接内存访问来加速微内核I/O的方式。The invention also innovatively proposes a system container to strengthen the resource statistics and fault recovery of the system service process, and proposes a method of direct memory access to accelerate microkernel I/O.

一、命名空间功能模块:参照图2所示,用于划分静态的系统资源;将挂载点数据、网络协议栈数据、进程管理数据进行划分,不同的应用容器所处不同的命名空间,访问不同的系统资源,实现系统资源的隔离。1. Namespace function module: as shown in Figure 2, it is used to divide static system resources; divide mount point data, network protocol stack data, and process management data. Different application containers are located in different namespaces. Access Different system resources realize the isolation of system resources.

命名空间的使用方式主要包含创建、加入、退出、销毁和使用。The usage of the namespace mainly includes creating, joining, exiting, destroying and using.

命名空间的创建在用户态完成,包括获取用户请求、系统资源的创建和初始化等。不同的命名空间执行不同的系统资源创建和初始化流程。The creation of the namespace is completed in the user state, including obtaining user requests, creating and initializing system resources, and so on. Different namespaces perform different system resource creation and initialization processes.

对于文件系统挂载点命名空间来说,需要获取用户传入的工作路径、挂载点信息链表的创建和初始化等。For the file system mount point namespace, it is necessary to obtain the work path passed in by the user, create and initialize the mount point information list, and so on.

对于LwIP网络协议栈命名空间来说,需要创建新的连续的数组来存储网络数据,需要创建新的链表来存储网络接口。For the LwIP network protocol stack namespace, a new continuous array needs to be created to store network data, and a new linked list needs to be created to store network interfaces.

对于进程管理命名空间来说,需要选择进程树中的进程节点来作为新的根节点,作为访问的根节点。For the process management namespace, it is necessary to select the process node in the process tree as the new root node and as the root node for access.

命名空间的加入在内核态完成,容器管理程序会将对应的命名空间ID与对应的系统容器的类型传入内核态,并在内核态将数据填入对应的位置,即完成命名空间的加入。The addition of the namespace is completed in the kernel state, and the container management program will pass the corresponding namespace ID and the type of the corresponding system container into the kernel state, and fill in the data in the corresponding position in the kernel state, that is, the addition of the namespace is completed.

命名空间的退出在内核态完成,容器管理程序将对应的系统容器类型传入内核态,在内核态删除对应位置的数据,即完成命名空间的退出。The exit of the namespace is completed in the kernel state. The container management program transfers the corresponding system container type into the kernel state, and deletes the data at the corresponding location in the kernel state, that is, the exit of the namespace is completed.

命名空间的销毁在用户态完成,需要销毁在创建命名空间时创建的系统资源。The destruction of the namespace is completed in the user mode, and the system resources created when the namespace is created need to be destroyed.

命名空间的使用依赖于进程间通信,需要发送进程间通信的时候,在内核态获取对应命名空间的ID,然后将ID传递给上层的系统容器,并由系统容器选择对应的系统资源来使用。The use of namespaces depends on inter-process communication. When inter-process communication needs to be sent, the ID of the corresponding namespace is obtained in the kernel mode, and then the ID is passed to the upper-layer system container, and the system container selects the corresponding system resources to use.

具体地,参照图3所示,创建文件系统挂载点命名空间的步骤如下:Specifically, referring to Figure 3, the steps for creating a file system mount point namespace are as follows:

步骤1):创建请求发起,容器管理程序通过进程间通信的方式向系统容器发送请求,创建文件系统挂载点的命名空间;Step 1): The creation request is initiated, and the container management program sends a request to the system container through inter-process communication to create a namespace of the file system mount point;

步骤2):获取容器管理程序发来的工作路径,并判断路径长度是否小于255个字符;Step 2): Obtain the working path sent by the container management program, and determine whether the length of the path is less than 255 characters;

步骤3):获取在原状态下工作路径的挂载点信息,根据工作路径从原状态的挂载点信息中获取挂载点信息;Step 3): Obtain the mount point information of the working path in the original state, and obtain the mount point information from the mount point information in the original state according to the working path;

步骤4):向对应的文件系统发送创建命名空间的请求,具体的流程在文件系统的命名空间创建中介绍;Step 4): Send a request to create a namespace to the corresponding file system, the specific process is introduced in the file system namespace creation;

步骤5):创建新的挂载点信息链表;Step 5): Create a new mount point information linked list;

步骤6):初始化新的挂载点信息链表,将步骤3)中获取的挂载点信息作为路径的挂载点信息添加到链表中;Step 6): Initialize a new mount point information linked list, and add the mount point information obtained in step 3) to the linked list as the mount point information of the path;

步骤7):从数组中获取能用的命名空间ID;Step 7): Obtain the available namespace IDs from the array;

步骤8):将新的挂载点信息链表添加到系统容器的数组元素中;Step 8): Add the new mount point information linked list to the array element of the system container;

步骤9):进程间通信返回,完成新的命名空间的创建。Step 9): The inter-process communication returns, and the creation of a new namespace is completed.

参照图4所示,加入文件系统挂载点命名空间的步骤如下:Referring to Figure 4, the steps to add a file system mount point namespace are as follows:

步骤1):发起加入命名空间请求,容器管理程序需要将挂载点系统容器命名空间的ID和系统服务类型传入到内核中;Step 1): Initiate a request to join the namespace, and the container management program needs to pass the ID of the mount point system container namespace and the system service type into the kernel;

步骤2):获取容器管理程序传入的挂载点系统容器命名空间ID和系统服务类型;Step 2): Obtain the mount point system container namespace ID and system service type passed in by the container management program;

步骤3):将容器管理程序传入的挂载点系统容器命名空间ID填入到应用容器进程的挂载点系统容器命名空间的位置中;Step 3): Fill in the mount point system container namespace ID passed in by the container management program into the position of the mount point system container namespace of the application container process;

步骤4):完成命名空间加入请求,从内核态返回用户态。Step 4): Complete the request to join the namespace, and return to the user state from the kernel state.

参照图5所示,退出文件系统挂载点命名空间的步骤如下:Referring to Figure 5, the steps to exit the file system mount point namespace are as follows:

步骤1):发起退出命名空间请求,容器管理程序需要将挂载点系统容器的系统服务类型传入到内核中;Step 1): Initiate a request to exit the namespace, and the container management program needs to pass the system service type of the mount point system container into the kernel;

步骤2):获取容器管理程序传入的系统服务类型,即挂载点系统容器系统服务类型;Step 2): Obtain the system service type imported by the container management program, that is, the mount point system container system service type;

步骤3):将内核中的挂载点系统容器对应的命名空间数据清空;Step 3): Clear the namespace data corresponding to the mount point system container in the kernel;

步骤4):完成命名空间退出请求,从内核态返回用户态。Step 4): Complete the namespace exit request, and return to the user state from the kernel state.

参照图6所示,销毁文件系统挂载点命名空间的步骤如下:Referring to Figure 6, the steps to destroy the file system mount point namespace are as follows:

步骤1):销毁请求发起,容器管理程序通过进程间通信的方式向系统容器发送请求,销毁文件系统挂载点的命名空间;Step 1): A destruction request is initiated, and the container management program sends a request to the system container through inter-process communication to destroy the namespace of the file system mount point;

步骤2):获取容器管理程序发来的命名空间ID;Step 2): Obtain the namespace ID sent by the container management program;

步骤3):判断命名空间ID是否为0,若为0,则代表根命名空间,无法销毁;Step 3): Determine whether the namespace ID is 0, if it is 0, it represents the root namespace and cannot be destroyed;

步骤4):判断命名空间ID对应的命名空间是否存在,若不存在,则无法销毁;Step 4): Determine whether the namespace corresponding to the namespace ID exists, and if it does not exist, it cannot be destroyed;

步骤5):获取“/”路径的挂载点信息;Step 5): Obtain the mount point information of the "/" path;

步骤6):向对应的文件系统发送销毁命名空间的请求,具体的流程会在文件系统的命名空间销毁中介绍;Step 6): Send a request to the corresponding file system to destroy the namespace, and the specific process will be introduced in Destroying the Namespace of the File System;

步骤7):清空当前的挂载点信息链表,包括释放内存及赋值指针为0;Step 7): Clear the current mount point information linked list, including releasing memory and assigning the pointer to 0;

步骤8):将系统容器的数组中对应的命名空间元素清空;Step 8): Empty the corresponding namespace elements in the array of the system container;

步骤9):进程间通信请求返回,完成命名空间的销毁。Step 9): The inter-process communication request is returned, and the destruction of the namespace is completed.

参照图7所示,使用文件系统挂载点命名空间的步骤如下:Referring to Figure 7, the steps to use the file system mount point namespace are as follows:

步骤1):应用容器发起请求,发送进程间通信请求到挂载点系统服务;Step 1): The application container initiates a request and sends an inter-process communication request to the mount point system service;

步骤2):在内核态获取当前应用容器的命名空间ID,因为之前加入命名空间的过程已经在对应位置写入了所属命名空间的ID;Step 2): Obtain the namespace ID of the current application container in the kernel state, because the process of adding the namespace has already written the ID of the corresponding namespace in the corresponding position;

步骤3):从进程间通信请求相关的结构体获取对应的系统容器;Step 3): Obtain the corresponding system container from the structure related to the inter-process communication request;

步骤4):切换进程到系统容器进程,并传递命名空间ID;Step 4): switch the process to the system container process, and pass the namespace ID;

步骤5):获取内核传递的命名空间ID;Step 5): Obtain the namespace ID passed by the kernel;

步骤6):根据命名空间ID从数组中寻找对应命名空间的挂载点链表信息并切换;Step 6): According to the namespace ID, find the mount point linked list information of the corresponding namespace from the array and switch;

步骤7):执行具体的应用容器请求;Step 7): Execute a specific application container request;

步骤8):请求处理完毕,返回到内核态;Step 8): After the request is processed, return to the kernel state;

步骤9):完成请求,返回应用容器。Step 9): Complete the request and return to the application container.

二、控制组功能模块:参照图8所示,用于划分动态的系统资源,可以将CPU资源、内存资源和I/O带宽资源进行统计和限制,不同的应用容器所处不同的控制组,使用一定限制的资源,实现系统资源的限制。2. Control group function module: as shown in Figure 8, it is used to divide dynamic system resources, and can count and limit CPU resources, memory resources and I/O bandwidth resources. Different application containers are in different control groups. Use certain limited resources to realize the limitation of system resources.

控制组的使用主要包含资源统计和资源限制。The use of control groups mainly includes resource statistics and resource limits.

统计和限制CPU资源的方式如下:The methods of counting and limiting CPU resources are as follows:

CPU资源的统计方式是将进程收到的时钟中断总数作为CPU资源使用量。因为时钟中断会按照一个固定的时间间隔触发,所以当时钟中断触发时,可以在内核获取到当前触发时钟中断的CPU上运行的进程,并对进程收到的时钟中断数进行加1操作。The CPU resource statistics method is to use the total number of clock interrupts received by the process as the CPU resource usage. Because the clock interrupt is triggered at a fixed time interval, when the clock interrupt is triggered, the kernel can obtain the process running on the CPU that currently triggers the clock interrupt, and add 1 to the number of clock interrupts received by the process.

为了实现对进程CPU利用率的控制,除了以上提到的CPU资源的统计,还需要对调度策略进行修改。In order to control the CPU utilization of the process, in addition to the statistics of the CPU resources mentioned above, it is also necessary to modify the scheduling policy.

当前进程的时间片为0后,进入调度策略。从等待队列中取出一个进程,计算其CPU使用率,也就是进程收到的时钟中断数与等待队列中所有进程收到的时钟中断数总和之比,将实际的CPU使用率与用户设定的CPU使用率进行比较。After the time slice of the current process is 0, enter the scheduling strategy. Take a process out of the waiting queue and calculate its CPU usage, that is, the ratio of the number of clock interruptions received by the process to the sum of the number of clock interruptions received by all processes in the waiting queue, and compare the actual CPU usage with the user-set CPU usage for comparison.

如果进程的CPU使用率较高,则会调小时间片,甚至让进程长期处于等待状态;如果进程的CPU使用率较低,则会调大时间片。If the CPU usage of the process is high, the time slice will be reduced, and even the process will be in a waiting state for a long time; if the CPU usage of the process is low, the time slice will be increased.

通过以上经过修改的调度策略,可以对每个进程的时间片进行单独的控制,当每个进程完整地调度一轮时,就可以确保每个进程的CPU使用率符合用户设定的值。Through the above modified scheduling strategy, the time slice of each process can be individually controlled. When each process is scheduled for a complete round, it can ensure that the CPU usage of each process meets the value set by the user.

统计和限制内存资源的方式如下:The methods of counting and limiting memory resources are as follows:

在缺页异常捕获发生的异常,并将分配的物理页大小加入到对应的应用容器进程和应用容器,以此来统计物理页的内存资源的使用量。Capture the exception that occurs when the page fault exception occurs, and add the allocated physical page size to the corresponding application container process and application container, so as to count the usage of memory resources of the physical page.

如果在这个过程中发现应用容器的物理内存使用量超过用户设定的值,就需要杀死超出内存使用量的应用容器进程或者暂停该应用容器进程的运行。If it is found that the physical memory usage of the application container exceeds the value set by the user during this process, it is necessary to kill the application container process exceeding the memory usage or suspend the operation of the application container process.

内存控制组的资源统计方法是在内核分配物理内存页时进行统计。在函数get_pages和free_pages中统计物理内存数据,将数据添加到当前运行的进程中。The resource statistics method of the memory control group is to make statistics when the kernel allocates physical memory pages. Statistical physical memory data in the functions get_pages and free_pages, and add the data to the currently running process.

为了实现对应用容器使用物理内存的限制,需要在每次统计物理内存时进行判断,内存统计的判断以进程组为单位,每个应用容器进程的内存数据处理完毕后,会将对应的数据更新到进程组中,并判断进程组的内存是否超过用户的预设值。In order to limit the physical memory used by the application container, it is necessary to make a judgment every time the physical memory is counted. The judgment of the memory statistics is based on the process group. After the memory data of each application container process is processed, the corresponding data will be updated. Go to the process group, and judge whether the memory of the process group exceeds the user's preset value.

如果进程组的实际物理内存使用超过用户的预设值,有以下两种选择。If the actual physical memory usage of the process group exceeds the user's preset value, there are the following two options.

1.将超过内存的进程组中的进程直接杀死,将进程的内存释放掉;1. Kill the process in the process group that exceeds the memory directly, and release the memory of the process;

2.进程保持等待状态,直到进程组中的进程释放足够的内存可以让等待的;2. The process remains in the waiting state until the processes in the process group release enough memory to allow waiting;

统计和限制I/O带宽资源的方式如下:The methods of counting and limiting I/O bandwidth resources are as follows:

I/O控制组是通过利用限流器来进行I/O带宽限制,将所有发往设备驱动系统容器的I/O请求截获并解析得到对应的请求和大小并统计。限流器控制带宽的逻辑主要依靠令牌桶算法实现。令牌桶算法的原理是维护一个令牌桶,系统以恒定的速率产生令牌,令牌桶有一个额定容量。当令牌桶满时,新生成的令牌无法继续加入令牌桶。The I/O control group uses the current limiter to limit the I/O bandwidth, intercepts and parses all I/O requests sent to the device driver system container to obtain the corresponding request and size, and counts them. The logic of the current limiter to control the bandwidth is mainly realized by the token bucket algorithm. The principle of the token bucket algorithm is to maintain a token bucket, the system generates tokens at a constant rate, and the token bucket has a rated capacity. When the token bucket is full, newly generated tokens cannot continue to be added to the token bucket.

当一个请求到达时,会解析请求的类型和大小,根据类型选择令牌桶,检查其中的令牌数量是否足够,如果足够,则下发该请求,如果不足,则阻塞该请求,直到有足够的令牌数。When a request arrives, it will analyze the type and size of the request, select the token bucket according to the type, check whether the number of tokens in it is enough, if enough, send the request, if not, block the request until there are enough number of tokens.

通过令牌桶算法,可以对一段时间内的读写次数和读写字节数进行限制。Through the token bucket algorithm, the number of reads and writes and the number of bytes read and written within a period of time can be limited.

为了以恒定的速率产生令牌,限流器中还维护了一个定时器,每秒钟运行一次,产生一定量的令牌,同时还会检查是否存在阻塞的I/O请求已经满足下发要求。In order to generate tokens at a constant rate, a timer is also maintained in the current limiter, which runs once per second to generate a certain amount of tokens, and also checks whether there are blocked I/O requests that have met the delivery requirements .

I/O请求的阻塞采用notification机制来实现。当发现令牌数不足,会创建一个notification capability,并进入阻塞状态,将I/O请求和notification capability加入到队列中,该队列会在定时器中被检查,查看是否存在满足要求的I/O请求,如果满足,则会唤醒阻塞的I/O请求,完成下发过程。The blocking of I/O requests is implemented by the notification mechanism. When it is found that the number of tokens is insufficient, a notification capability will be created and enter the blocking state, and the I/O request and notification capability will be added to the queue, which will be checked in the timer to see if there is I/O that meets the requirements If the request is satisfied, the blocked I/O request will be woken up to complete the sending process.

三、故障恢复功能模块:参照图9所示,用于处理因为内存错误导致的系统服务崩溃的情况;3. Fault recovery function module: as shown in Figure 9, it is used to deal with the crash of system services caused by memory errors;

故障错误发生时,可以对故障进行捕获,最常见的故障就是缺页异常,捕获完成后,可以对发生异常的系统容器进程进行退出操作,并回收所有的资源。When a fault error occurs, the fault can be captured. The most common fault is a page fault exception. After the capture is completed, the system container process that has the exception can be exited and all resources can be recovered.

然后向进程管理系统容器发送重启系统容器的消息,进程管理系统容器收到消息后,会立刻重启系统容器,并重建系统容器内部的内容和进程间通信。Then send a message to the process management system container to restart the system container. After receiving the message, the process management system container will immediately restart the system container and rebuild the internal content of the system container and inter-process communication.

具体地,故障恢复功能主要分为两个部分:故障捕获和故障恢复。当应用容器正常使用系统容器功能时,系统容器会将应用容器的操作记录下来,并存储在内核的一块内存区域中。Specifically, the fault recovery function is mainly divided into two parts: fault capture and fault recovery. When the application container normally uses the function of the system container, the system container will record the operation of the application container and store it in a memory area of the kernel.

当错误发生导致缺页异常时,错误会被传递到内核,然后在内核中判断是不是系统容器发生错误,如果是,就会回收系统容器中所有的进程间通信以及锁资源,并退出系统容器,所有在系统容器内部正在执行的进程间通信都会收到重试的返回码。When an error occurs and a page fault occurs, the error will be passed to the kernel, and then it will be judged in the kernel whether there is an error in the system container. If so, all inter-process communication and lock resources in the system container will be recovered, and the system container will exit , all ongoing interprocess communication inside the system container will receive a retry return code.

然后发送信息给进程管理系统容器,进程管理系统容器获取到消息后,会尝试重启系统容器,并根据之前记录的操作重建系统容器内部的内容,并且重建进程间通信,当重建完成时,就可以被继续使用。Then send information to the process management system container. After the process management system container gets the message, it will try to restart the system container, and rebuild the internal content of the system container according to the previously recorded operations, and rebuild the inter-process communication. When the reconstruction is completed, you can It is continued to be used.

为了更精确地统计系统容器的资源使用量并且对系统服务进程进行故障恢复,本发明将所有的系统服务进程都包含在容器中,通过这种方法,可以对系统服务进程的资源也进行统计和限制,如果超过限制,就可以触发故障恢复功能,对系统容器进行重启恢复。In order to more accurately count the resource usage of the system container and perform fault recovery on the system service process, the present invention includes all system service processes in the container, and by this method, the resources of the system service process can also be counted and If the limit is exceeded, the failure recovery function can be triggered to restart and restore the system container.

四、系统容器:将位于用户态的系统服务进程也作为容器来进行管理,可以更精确地统计系统服务进程的CPU、内存开销,并对系统服务进程进行统一地管理。4. System container: manage the system service process in the user state as a container, which can more accurately count the CPU and memory overhead of the system service process, and uniformly manage the system service process.

五、直接内存访问:参照图10所示,通过直接内存访问来加速微内核的I/O速度,实现的具体方法是采用capabi lity管理需要传递的数据,并通过直接内存访问将数据直接从设备拷贝到内存或者从内存直接写入到设备,以此来减少进程间通信的数量和重复内存拷贝的次数。5. Direct memory access: As shown in Figure 10, the I/O speed of the microkernel is accelerated through direct memory access. The specific method of implementation is to use capability to manage the data that needs to be transferred, and directly transfer the data from the device through direct memory access. Copy to memory or write directly from memory to device to reduce the amount of inter-process communication and the number of repeated memory copies.

本发明还提供一种基于微内核的容器构建与运行方法,所述基于微内核的容器构建与运行系统可以通过执行所述基于微内核的容器构建与运行方法的流程步骤予以实现,即本领域技术人员可以将所述基于微内核的容器构建与运行方法理解为所述基于微内核的容器构建与运行系统的优选实施方式。该方法包括:The present invention also provides a microkernel-based container construction and operation method, the microkernel-based container construction and operation system can be realized by executing the process steps of the microkernel-based container construction and operation method, which is the technical field A skilled person may understand the microkernel-based container construction and operation method as a preferred implementation manner of the microkernel-based container construction and operation system. The method includes:

命名空间功能步骤:划分静态的系统资源;将挂载点数据、网络协议栈数据、进程管理数据进行划分,不同的应用容器所处不同的命名空间,访问不同的系统资源,实现系统资源的隔离;Namespace function steps: Divide static system resources; divide mount point data, network protocol stack data, and process management data. Different application containers are located in different namespaces, access different system resources, and realize system resource isolation. ;

控制组功能步骤:划分动态的系统资源,将CPU资源、内存资源和I/O带宽资源进行统计和限制,不同的应用容器所处不同的控制组,使用限制的资源,实现系统资源的限制;Control group function steps: divide dynamic system resources, collect statistics and limit CPU resources, memory resources, and I/O bandwidth resources, place different application containers in different control groups, use limited resources, and realize system resource limitations;

故障恢复功能步骤:处理因为内存错误导致的系统服务崩溃的情况。Fault recovery function steps: handle the crash of system services caused by memory errors.

接下来,对本发明进行更为具体的说明。Next, the present invention will be described more specifically.

一种基于微内核的容器构建与运行系统,采用微内核架构,并划分命名空间功能设计、控制组功能设计和故障恢复功能设计。A microkernel-based container construction and operation system adopts microkernel architecture and divides namespace function design, control group function design and fault recovery function design.

首先在命名空间功能设计上,参照图2所示,本方案的微内核容器的命名空间功能主要是现在EL0层,在EL1层上也布置了部分实现代码。First of all, in terms of namespace function design, as shown in Figure 2, the namespace function of the microkernel container in this solution is mainly at the EL0 layer, and part of the implementation code is also arranged on the EL1 layer.

对于刚刚启动的应用容器,在其中添加对应的命名空间的capability,并初始化值为0,0表示其在对应的系统容器中使用默认的初始化系统资源。For the application container just started, add the capability of the corresponding namespace to it, and initialize the value to 0, 0 means that it uses the default initialization system resources in the corresponding system container.

当用户发送进程间通信获取系统容器的服务时,会先从内核处获取对应系统容器的命名空间的capability,并携带命名空间信息到用户态的系统容器中选择不同的系统资源。When the user sends inter-process communication to obtain the service of the system container, it will first obtain the capability of the namespace corresponding to the system container from the kernel, and carry the namespace information to the system container in the user mode to select different system resources.

针对微内核中不同的系统容器,需要设计不同的系统资源划分方式。For different system containers in the microkernel, different system resource division methods need to be designed.

对于文件系统挂载点系统容器,它的系统容器的功能是根据用户进程提供的路径,选择不同的挂载点,并返回对应的ID和进程间通信的capability,应用容器根据对应的ID和capability发送具体的文件操作到对应的文件系统系统容器中,同时还提供mount和umount操作。For the file system mount point system container, the function of its system container is to select a different mount point according to the path provided by the user process, and return the corresponding ID and the capability of inter-process communication, and the application container according to the corresponding ID and capability Send specific file operations to the corresponding file system system container, and also provide mount and umount operations.

该系统容器中,主要保存的是系统内的挂载点信息,所以要对该信息进行隔离,该系统容器内使用链表来维护一组挂载点信息,为了让新的文件系统挂载点命名空间拥有不同的挂载点信息,所以每创建一个新的文件系统挂载点命名空间,都需要创建一个新的挂载点信息链表,用来存储新的文件系统挂载点命名空间内的挂载点信息,而应用容器会根据所处的文件系统挂载点命名空间选择对应的挂载点信息链表。In the system container, the mount point information in the system is mainly stored, so the information should be isolated. A linked list is used in the system container to maintain a set of mount point information. In order to name the new file system mount point Spaces have different mount point information, so each time a new file system mount point namespace is created, a new mount point information list needs to be created to store the mount points in the new file system mount point namespace. Mount point information, and the application container will select the corresponding mount point information list according to the mount point namespace of the file system.

对于LwIP网络协议栈系统容器,它的系统容器的功能是为上层应用提供网络支持,根据用户的请求执行open、close、read、write等网络协议栈操作。For the LwIP network protocol stack system container, the function of its system container is to provide network support for upper-layer applications, and perform network protocol stack operations such as open, close, read, and write according to user requests.

该系统容器中,主要保存的信息是网络接口列表、网络数据。网络接口列表是通过链表组织起来的一系列网络接口;网络数据是一个内存连续的数组,用来存储网络接口中收发的数据。In the system container, the main information stored is the list of network interfaces and network data. The network interface list is a series of network interfaces organized through a linked list; the network data is a continuous array of memory, which is used to store the data sent and received in the network interface.

在LwIP网络协议栈系统容器中,每创建一个新的网络协议栈命名空间,就需要创建一个空的链表,用来存储新的网络协议栈命名空间中的网络接口,并为其初始化一个loopback网络接口,并创建一个新的内存连续的数组,处理新的网络协议栈命名空间中的网络数据。In the LwIP network protocol stack system container, every time a new network protocol stack namespace is created, an empty linked list needs to be created to store the network interfaces in the new network protocol stack namespace and initialize a loopback network for it interface, and create a new memory-contiguous array that handles network data in the new network stack namespace.

对于进程管理系统容器,它的系统容器的功能是提供进程号和进程树支持,负责创建新进程和回收僵尸进程等操作。For the process management system container, the function of its system container is to provide process ID and process tree support, and is responsible for operations such as creating new processes and recycling zombie processes.

该系统容器中主要保存的信息是系统启动时创建的第一个进程,即进程树的根节点。The main information stored in the system container is the first process created when the system starts, that is, the root node of the process tree.

在进程管理系统容器中,每创建一个新的进程管理系统容器就需要选择进程树中的节点作为新的进程树根节点,任何在进程管理命名空间中创建的进程,都会按照树结构向下生成,不会被其他的进程管理命名空间观察到。In the process management system container, every time a new process management system container is created, a node in the process tree needs to be selected as the new process tree root node. Any process created in the process management namespace will be generated downward according to the tree structure , will not be observed by other process management namespaces.

控制组功能设计主要包括CPU、内存和I/O带宽。Control group function design mainly includes CPU, memory and I/O bandwidth.

CPU控制组实现在EL1层,针对微内核的应用容器,为了在满足实时性的需求上再加上CPU控制组的功能,采用基于优先级的时间片轮转调度算法对应用容器进行实时性的支持。The CPU control group is implemented at the EL1 layer. For the application container of the microkernel, in order to meet the real-time requirements and add the function of the CPU control group, the priority-based time slice round-robin scheduling algorithm is used to support the application container in real time. .

基于优先级的时间片轮转调度算法时通过对进程指定优先级的方法来实现实时性,不同的进程拥有不同的优先级,在调度时根据进程的优先级来选择具体需要调度的进程。The priority-based time slice round-robin scheduling algorithm achieves real-time performance by specifying the priority of the process. Different processes have different priorities. When scheduling, select the specific process that needs to be scheduled according to the priority of the process.

如果进程的优先级更高,则会在调度时优先调度优先级更高的进程。If the priority of the process is higher, the process with higher priority will be prioritized during scheduling.

如果进程的优先级一致,那么会采用时间片轮转调度来调度优先级一致的进程。If the priorities of the processes are the same, time slice round-robin scheduling will be used to schedule the processes with the same priorities.

CPU控制组无法用于不同优先级的进程,因为会存在进程抢占的情况,无法准确地统计CPU使用率,但是对于相同优先级的进程,则可以使用CPU控制组,在优先级一致的进程中,会在调度时统计进程的CPU使用率,并使用CPU控制组进行控制。The CPU control group cannot be used for processes of different priorities, because there will be process preemption, and the CPU usage cannot be accurately counted. However, for processes of the same priority, the CPU control group can be used. Among processes with the same priority , will count the CPU usage of the process during scheduling, and use the CPU control group to control it.

参照图11所示,CPU资源的统计方式是将进程收到的时钟中断作为CPU资源使用量。因为时钟中断会按照一个固定的时间间隔触发,所以当时钟中断触发时,可以在内核获取到当前触发时钟中断的CPU上运行的进程,并对进程收到的时钟中断数进行加1操作。Referring to FIG. 11 , the CPU resource statistics method is to use the clock interrupt received by the process as the CPU resource usage. Because the clock interrupt is triggered at a fixed time interval, when the clock interrupt is triggered, the kernel can obtain the process running on the CPU that currently triggers the clock interrupt, and add 1 to the number of clock interrupts received by the process.

因为在微内核架构下进程间通信的存在,所以需要考虑简称间通信行为导致进程主体发生变化的情况。进程间通信会导致进程的执行权发生切换,当应用容器的进程通过进程间通信向系统容器申请服务的时候,就不再是原来的进程,而是切换到了系统容器的进程,但是这部分运行时间依旧属于应用容器的进程。Because of the existence of inter-process communication under the microkernel architecture, it is necessary to consider the situation that the process subject changes due to inter-communication behavior for short. Inter-process communication will cause the execution right of the process to switch. When the process of the application container applies for services from the system container through inter-process communication, it is no longer the original process, but switched to the process of the system container, but this part of the operation Time still belongs to the process of the application container.

为了解决这个问题,我们关注应用容器的进程发生进程间通信过程中的不变量,就是执行体(scheduling context),进程间通信会将应用容器进程的执行体传递给系统容器进程,所以我们将统计CPU资源的变量存储在执行体中。即使应用容器的进程通过进程间通信切换到了系统容器进程,依旧可以正确地统计应用容器进程的执行时间。In order to solve this problem, we pay attention to the invariant in the inter-process communication process of the application container process, which is the scheduling context. The inter-process communication will pass the execution body of the application container process to the system container process, so we will count Variables for CPU resources are stored in the executable. Even if the process of the application container is switched to the system container process through inter-process communication, the execution time of the application container process can still be counted correctly.

为了实现对进程CPU使用率的控制,除了以上提到的CPU资源统计,还需要对调度策略进行修改,实现通过调节进程的时间片(Time Budget)的方式来调整进程的CPU使用率。In order to control the CPU usage of the process, in addition to the CPU resource statistics mentioned above, the scheduling policy needs to be modified to adjust the CPU usage of the process by adjusting the time slice (Time Budget) of the process.

当前进程的时间片为0后,进入调度策略。从等待队列中取出一个进程,计算进程的CPU使用率,也就是进程收到的时钟中断数与等待队列中所有进程收到的时钟中断数总和之比,将实际的CPU使用率与用户设定的CPU使用率进行比较。After the time slice of the current process is 0, enter the scheduling policy. Take out a process from the waiting queue, calculate the CPU usage of the process, that is, the ratio of the number of clock interruptions received by the process to the sum of the number of clock interruptions received by all processes in the waiting queue, and compare the actual CPU usage with the user setting CPU usage for comparison.

如果进程的CPU使用率较高,则会调小时间片,甚至让进程长期处于等待状态;如果进程的CPU使用率较低,则会调大时间片。If the CPU usage of the process is high, the time slice will be reduced, and even the process will be in a waiting state for a long time; if the CPU usage of the process is low, the time slice will be increased.

通过以上的调度策略,可以对每个进程的时间片进行单独的控制,当每个进程都完整地调度一轮时,每个进程的CPU使用率就会符合用户的设定值。Through the above scheduling strategy, the time slice of each process can be individually controlled. When each process is scheduled for a complete round, the CPU usage of each process will meet the user's set value.

内存控制组同样实现在EL1层,内存控制组限制了应用容器的物理内存使用量,其中包括所有的已经建立了页表映射的匿名内存页;对于共享内存,则将这部分物理内存使用量赋予第一次使用这部分内存页的进程。The memory control group is also implemented at the EL1 layer. The memory control group limits the physical memory usage of the application container, including all anonymous memory pages that have established page table mapping; for shared memory, this part of the physical memory usage is given to The process that is using this memory page for the first time.

参照图12所示,内存控制组的处理流程。Referring to Figure 12, the processing flow of the memory control group.

同样的,需要考虑进程间通信的情况,所以这里采用相同的方式,将对应的数据统计在进程的执行体中,可以在发生进程间通信后,依旧可以将使用的物理内存页正确地统计到应用容器中。Similarly, inter-process communication needs to be considered, so the same method is used here to count the corresponding data in the execution body of the process. After the inter-process communication occurs, the used physical memory pages can still be counted correctly. in the application container.

I/O带宽资源是在驱动层之上进行统计的。在微内核架构下,块设备I/O的调用流程是应用容器->文件系统系统容器->驱动系统容器。为了可以统计所有的块设备I/O请求,在文件系统系统容器和驱动系统容器之间添加一个新的系统容器,称之为限流器,用来控制I/O带宽。所有数据的流入和流出都会经过限流器。I/O bandwidth resources are counted above the driver layer. Under the microkernel architecture, the calling process of block device I/O is application container -> file system system container -> driver system container. In order to be able to count all block device I/O requests, a new system container is added between the file system system container and the drive system container, called a current limiter, to control the I/O bandwidth. All data in and out goes through the throttle.

参照图13所示,是I/O控制组的处理流程。Referring to Figure 13, it is the processing flow of the I/O control group.

限流器的控制I/O带宽的逻辑主要依靠令牌桶算法实现。令牌桶算法的原理是维护一个令牌桶,系统以恒定的速率持续产生令牌,令牌桶有一个上限容量。当令牌桶满时,新生成的令牌无法继续加入令牌桶。The logic of controlling the I/O bandwidth of the current limiter mainly relies on the token bucket algorithm. The principle of the token bucket algorithm is to maintain a token bucket, the system continuously generates tokens at a constant rate, and the token bucket has an upper limit capacity. When the token bucket is full, newly generated tokens cannot continue to be added to the token bucket.

当一个I/O请求到达时,会解析I/O请求的类型和大小,根据类型选择读令牌桶或这写令牌桶,检查令牌桶中的令牌数量是否足够,如果足够,则减去指定数量的令牌数并下发该请求;如果不足,则阻塞该请求,等待系统有足够数量的令牌桶。When an I/O request arrives, it will analyze the type and size of the I/O request, choose to read the token bucket or write the token bucket according to the type, check whether the number of tokens in the token bucket is sufficient, and if so, then Subtract the specified number of tokens and issue the request; if not enough, block the request and wait for the system to have a sufficient number of token buckets.

通过令牌桶算法,可以对I/O读写次数带宽和I/O读写带宽进行限制。Through the token bucket algorithm, the I/O read and write frequency bandwidth and the I/O read and write bandwidth can be limited.

为了以恒定的速率持续产生令牌,限流器中维护了一个定时器,每秒钟运行一次,产生一定量的令牌,同时还会检查是否存在阻塞的I/O请求已经满足下发要求。In order to continuously generate tokens at a constant rate, a timer is maintained in the current limiter, which runs once every second to generate a certain amount of tokens, and also checks whether there are blocked I/O requests that have met the delivery requirements .

I/O请求的阻塞和唤醒采用notification机制。当发现令牌数不足时,会创建一个notification capability,并进入阻塞状态,将I/O请求和notification capability加入到队列中,该队列会在定时器中被检查,查看是否存在满足要求的I/O请求,如果满足,则会唤醒阻塞的I/O请求,完成I/O请求下发过程。The blocking and waking up of I/O requests adopt the notification mechanism. When it is found that the number of tokens is insufficient, a notification capability will be created and enter the blocking state, and the I/O request and notification capability will be added to the queue, which will be checked in the timer to see if there is an I/O request that meets the requirements. If the O request is satisfied, it will wake up the blocked I/O request and complete the I/O request delivery process.

为了系统的强可靠性,对于系统容器内部的错误,我们设计了故障恢复功能,可以确保当系统容器发生内部错误崩溃时,可以捕获崩溃错误,并重启系统容器,恢复其中的关键数据,确保应用容器的请求可以被重新执行。For the strong reliability of the system, we have designed a fault recovery function for internal errors in the system container, which can ensure that when the system container crashes due to an internal error, the crash error can be captured, and the system container can be restarted to restore the key data in it, ensuring that the application Container requests can be re-executed.

参照图9所示,是故障恢复的处理流程。Referring to FIG. 9, it is a processing flow of fault recovery.

当系统容器进程发生错误导致缺页异常时,系统可以在缺页中断处捕获该错误,并分析错误是否来自于系统容器进程,如果错误来自于系统容器进程,那说明系统容器进程内部的逻辑或者内存发生了错误,需要被重启。When an error occurs in the system container process that causes a page fault exception, the system can capture the error at the page fault and analyze whether the error comes from the system container process. If the error comes from the system container process, it means that the logic inside the system container process or A memory error has occurred and needs to be restarted.

首先,因为是缺页异常,所以可以通过系统容器进程数据得到当前系统容器进程所管理的所有进程间通信数据,并将进程间通信的锁全部释放掉,并设置应用容器进程的返回值,设置进程间通信连接为不可使用,最后退出系统容器进程,当系统容器退出时,会发送信息给进程管理系统容器,进程管理系统容器接收到信息后,会尝试重启系统容器并恢复内部数据。First of all, because it is a page fault exception, all inter-process communication data managed by the current system container process can be obtained through the system container process data, and all inter-process communication locks are released, and the return value of the application container process is set. The inter-process communication connection is unavailable, and finally the system container process exits. When the system container exits, it will send a message to the process management system container. After receiving the information, the process management system container will try to restart the system container and restore internal data.

以往的容器都是用户态的应用程序被包裹在容器中,但是在微内核场景下,原来在内核态的系统服务被上移到了用户态,所以可以将系统服务进程也包含在容器内,将系统内所有的资源都交给容器管理,通过这种方式,可以更细粒度地统计系统服务进程的资源,并结合1中提到的故障恢复能力,当系统容器发生错误时,都可以被捕获并且重新启动,不会影响系统的正常运行。In the past, containers were user-mode applications wrapped in the container, but in the microkernel scenario, the system services originally in the kernel mode were moved up to the user mode, so the system service process can also be included in the container. All resources in the system are handed over to the container management. In this way, the resources of the system service process can be counted in a more fine-grained manner. Combined with the fault recovery capability mentioned in 1, when the system container has an error, it can be captured And restarting will not affect the normal operation of the system.

系统容器的管理范围与应用容器的管理范围也是不同的,应用容器的管理范围是命名空间和控制组;而系统容器的管理范围是控制组和故障恢复。The management scope of the system container is also different from that of the application container. The management scope of the application container is the namespace and control group; while the management scope of the system container is the control group and fault recovery.

应用容器需要管理其可以访问的系统资源并限制应用容器可以访问的资源总量,所以使用命名空间来管理其可以访问的系统资源,使用控制组来限制应用容器访问的资源总量。Application containers need to manage the system resources they can access and limit the total amount of resources that application containers can access, so namespaces are used to manage the system resources they can access, and control groups are used to limit the total amount of resources that application containers can access.

应用容器处的故障是由用户自己编写的代码导致的,所以系统并没有责任为用户的错误进行故障恢复,而系统容器的故障是需要由系统负责的,并且为了更好的性能隔离,为了保证系统容器的CPU和内存使用不会超出规定的限制,所以系统容器会对系统容器的CPU、内存使用量进行统计,如果CPU、内存资源的使用量超出了额定限制,就会触发故障恢复,将系统容器进行重启并恢复,以保证可以正常服务应用容器。The failure of the application container is caused by the code written by the user, so the system is not responsible for fault recovery for the user's error, while the failure of the system container needs to be taken care of by the system, and for better performance isolation, in order to ensure The CPU and memory usage of the system container will not exceed the specified limit, so the system container will collect statistics on the CPU and memory usage of the system container. If the CPU and memory resource usage exceeds the rated limit, a fault recovery will be triggered and the The system container is restarted and restored to ensure that the application container can be served normally.

本发明创新性地提出了使用直接内存访问来加速微内核的I/O速度,对于常见的微内核来说,因为不同的系统容器都在不同的地址空间,而地址空间之间用作共享内存来进行进程间通信的内存量又是有限的,所以需要将一个请求划分成多个请求,并通过不断的内存拷贝和大量的进程间通信来完成一个完整的请求。The present invention innovatively proposes to use direct memory access to accelerate the I/O speed of the microkernel. For common microkernels, because different system containers are in different address spaces, and the address spaces are used as shared memory The amount of memory for inter-process communication is limited, so it is necessary to divide a request into multiple requests, and complete a complete request through continuous memory copying and a large amount of inter-process communication.

存在这样一种情况,用户希望读取1MB的文件大小,而共享内存只有4KB,所以用户需要一个请求划分为多个请求,并且请求从文件系统到驱动的过程中,因为还有其他元数据需要携带,所以实际可以用来存储文件内容的内存量会越来越小,那么也就会导致从用户到文件系统的进程间通信数量在一千个,而文件系统到驱动的进程间通信的数量就会高达两千个,并且这种情况还会随着进程间通信链条的延长而加剧,导致进程间通信数量在整个请求中呈现指数级上升的态势。There is such a situation that the user wants to read a file size of 1MB, but the shared memory is only 4KB, so the user needs to divide a request into multiple requests, and the request is in the process from the file system to the driver, because there are other metadata required Portable, so the amount of memory that can actually be used to store file content will become smaller and smaller, which will also lead to a thousand inter-process communications from the user to the file system, and the number of inter-process communications from the file system to the driver It will be as high as two thousand, and this situation will be exacerbated with the extension of the inter-process communication chain, resulting in an exponential increase in the number of inter-process communication throughout the request.

为了解决上述的两个问题,本发明提出了使用直接内存访问来加速微内核的I/O的方案。参照图10所示,是直接内存访问操作I/O的流程。In order to solve the above two problems, the present invention proposes a solution for accelerating the I/O of the microkernel by using direct memory access. Referring to FIG. 10 , it is a flow of direct memory access operation I/O.

当应用容器发送请求时,如果是个读请求,那么应用容器会提前申请一块指定大小的内存空间,将指定大小的内存空间作为一个capability传递到文件系统系统容器和设备驱动系统容器,设备驱动系统容器拿到这个capability之后,先检查capability对应的物理地址是否已经映射,如果没有映射内存就先对物理内存进行映射,然后就可以直接对capability中包含的物理内存进行直接内存访问操作,将数据从设备直接写入到对应的物理内存空间,因为这部分是异步的,所以应用容器的进程间通信可以直接返回,并且不需要携带任何数据返回,没有内存拷贝和多余的进程间通信操作。When the application container sends a request, if it is a read request, the application container will apply for a memory space of a specified size in advance, and pass the memory space of the specified size as a capability to the file system system container and device driver system container, and the device driver system container After obtaining this capability, first check whether the physical address corresponding to the capability has been mapped. If there is no mapped memory, first map the physical memory, and then directly perform direct memory access operations on the physical memory contained in the capability, and transfer data from the device to Directly write to the corresponding physical memory space, because this part is asynchronous, so the inter-process communication of the application container can return directly, and does not need to carry any data back, there is no memory copy and redundant inter-process communication operations.

如果是个写请求,那么应用容器会将自己所要写的数据所对应的物理内存空间作为一个capability,将这个capability和其他的请求数据一起发送到文件系统系统容器和设备驱动系统容器,设备驱动系统容器拿到这个capability之后,可以直接将数据从内存通过直接内存访问的方式写入到驱动的指定位置中,当执行请求下发后,就可以返回。If it is a write request, the application container will use the physical memory space corresponding to the data to be written as a capability, and send this capability and other request data to the file system system container and device driver system container, and the device driver system container After getting this capability, you can directly write data from the memory to the designated location of the driver through direct memory access, and return after the execution request is issued.

读或写请求完成后,就会触发中断,告诉系统当前的请求已经完成。When a read or write request is complete, an interrupt is triggered, telling the system that the current request has been completed.

本发明实施例提供了一种基于微内核的容器构建与运行系统及方法,可以在微内核环境中提供灵活的容器支持,并且可以灵活移植到各个微内核平台而无需以来特定宏内核的环境支持,本发明通过灵活的中断隔离性与调度算法独立性可进一步支持特定容器内的实时性需求,本发明使用的系统容器可以让资源统计更精确,并且对系统服务的行为进行单独的管理,实现更强的资源隔离能力。与现有技术相比,本发明在获得更强隔离性的和安全性的基础上,还能获得性能上的提升。The embodiment of the present invention provides a microkernel-based container construction and operation system and method, which can provide flexible container support in a microkernel environment, and can be flexibly transplanted to various microkernel platforms without requiring specific macrokernel environment support , the present invention can further support real-time requirements in a specific container through flexible interrupt isolation and scheduling algorithm independence. The system container used in the present invention can make resource statistics more accurate, and perform separate management on the behavior of system services to achieve Stronger resource isolation capabilities. Compared with the prior art, on the basis of obtaining stronger isolation and safety, the present invention can also obtain performance improvement.

本领域技术人员知道,除了以纯计算机可读程序代码方式实现本发明提供的系统及其各个装置、模块、单元以外,完全可以通过将方法步骤进行逻辑编程来使得本发明提供的系统及其各个装置、模块、单元以逻辑门、开关、专用集成电路、可编程逻辑控制器以及嵌入式微控制器等的形式来实现相同功能。所以,本发明提供的系统及其各项装置、模块、单元可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置、模块、单元也可以视为硬件部件内的结构;也可以将用于实现各种功能的装置、模块、单元视为既可以是实现方法的软件模块又可以是硬件部件内的结构。Those skilled in the art know that, in addition to realizing the system provided by the present invention and its various devices, modules, and units in a purely computer-readable program code mode, the system provided by the present invention and its various devices can be completely programmed by logically programming the method steps. , modules, and units implement the same functions in the form of logic gates, switches, ASICs, programmable logic controllers, and embedded microcontrollers. Therefore, the system and its various devices, modules, and units provided by the present invention can be regarded as a hardware component, and the devices, modules, and units included in it for realizing various functions can also be regarded as hardware components. The structure; the devices, modules, and units for realizing various functions can also be regarded as not only the software modules for realizing the method, but also the structures in the hardware components.

以上对本发明的具体实施例进行了描述。需要理解的是,本发明并不局限于上述特定实施方式,本领域技术人员可以在权利要求的范围内做出各种变化或修改,这并不影响本发明的实质内容。在不冲突的情况下,本申请的实施例和实施例中的特征可以任意相互组合。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art may make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. In the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other arbitrarily.

Claims (10)

1. A microkernel-based container build and run system, comprising: a namespace function module, a control group function module, and a fault recovery function module;
wherein the namespace function module: for partitioning static system resources; dividing mounting point data, network protocol stack data and process management data, accessing different system resources by different namespaces where different application containers are located, and realizing isolation of the system resources;
control group function module: the system resource control method is used for dividing dynamic system resources, counting and limiting CPU resources, memory resources and I/O bandwidth resources, using limited resources in different control groups where different application containers are located, and realizing the limitation of the system resources;
fault recovery function module: the method is used for processing the condition of system service breakdown caused by memory errors;
the system also includes a system container to enhance resource statistics and failure recovery for the system service processes.
2. The microkernel-based container build and run system of claim 1, wherein the namespace function module comprises: creating, joining, exiting, destroying and using a file system mounting point name space;
The file system mount point namespaces are created as follows:
step 1): initiating a creation request, sending a request to a system container by a container management program in an inter-process communication mode, and creating a naming space of a file system mounting point;
step 2): acquiring a working path sent by a container management program, and judging whether the path length is smaller than 255 characters;
step 3): acquiring mounting point information of a working path in an original state, and acquiring mounting point information from the mounting point information in the original state according to the working path;
step 4): sending a request for creating a namespace to a corresponding file system, wherein a specific flow is introduced in the creation of the namespace of the file system;
step 5): creating a new mounting point information linked list;
step 6): initializing a new mounting point information linked list, and adding the mounting point information acquired in the step 3) into the linked list as mounting point information of a path;
step 7): acquiring available namespaces ID from the array;
step 8): adding a new mounting point information linked list into an array element of a system container;
step 9): the inter-process communication returns to complete the creation of the new namespace.
3. The microkernel-based container build and run system of claim 2 wherein the step of joining a file system mount point namespace is as follows:
Step 1): initiating a request for joining a name space, wherein a container management program needs to transmit an ID of a name space of a system container of a mounting point and a system service type into a kernel;
step 2): acquiring a system container name space ID and a system service type of a mounting point system, which are transmitted by a container management program;
step 3): filling the name space ID of the mount point system container, which is transmitted by the container management program, into the position of the name space of the mount point system container of the application container process;
step 4): and finishing the namespace joining request, and returning to the user mode from the kernel mode.
4. The microkernel-based container build and run system of claim 1 wherein the step of exiting the file system mount point namespace is as follows:
step 1): initiating a request for exiting a namespace, wherein a container management program needs to transmit the system service type of the system container of the mounting point into a kernel;
step 2): acquiring a system service type transmitted by a container management program, namely a system service type of a system container of a mounting point system;
step 3): clearing up the namespaces corresponding to the system containers of the mounting points in the kernel;
step 4): and finishing the namespace exit request, and returning to the user mode from the kernel mode.
5. The microkernel-based container build and run system as in claim 1 wherein the step of destroying the file system mount point namespaces is as follows:
step 1): the method comprises the steps that a destruction request is initiated, a container management program sends a request to a system container in an inter-process communication mode, and a naming space of a file system mounting point is destroyed;
step 2): acquiring a name space ID sent by a container management program;
step 3): judging whether the name space ID is 0, if so, representing a root name space, and failing to destroy;
step 4): judging whether a name space corresponding to the name space ID exists or not, and if the name space exists, failing to destroy the name space;
step 5): acquiring mounting point information of a path;
step 6): sending a request for destroying the name space to a corresponding file system;
step 7): clearing a current mounting point information linked list, wherein the current mounting point information linked list comprises a memory release and an assignment pointer of 0;
step 8): clearing corresponding name space elements in the array of the system container;
step 9): and returning the inter-process communication request to complete the destruction of the naming space.
6. The microkernel-based container build and run system as in claim 1 wherein the step of using the file system mount point namespace is as follows:
Step 1): the application container initiates a request and sends an inter-process communication request to the mounting point system service;
step 2): acquiring a name space ID of a current application container in a kernel mode;
step 3): acquiring a corresponding system container from a structure body related to the inter-process communication request;
step 4): switching the process to a system container process and transmitting a namespace ID;
step 5): acquiring a name space ID transferred by a kernel;
step 6): searching the mounting point linked list information of the corresponding name space from the array according to the name space ID and switching;
step 7): executing a specific application container request;
step 8): after the request is processed, returning to a kernel mode;
step 9): and finishing the request and returning to the application container.
7. The microkernel-based container build and run system of claim 1, wherein the control group function module comprises:
statistics and limitation of CPU resources: taking the total number of clock interruption received by a process as the CPU resource usage amount; when the clock interrupt is triggered, a process running on a CPU (central processing unit) which triggers the clock interrupt at present is acquired by a kernel, and the number of the clock interrupts received by the process is added with 1;
modifying the scheduling strategy, entering the scheduling strategy after the time slice of the current process is 0, taking out a process from the waiting queue, calculating the CPU utilization rate, namely the ratio of the number of clock interrupts received by the process to the sum of the number of clock interrupts received by all processes in the waiting queue, and comparing the actual CPU utilization rate with the CPU utilization rate set by a user;
If the CPU utilization rate of the process is higher, the time slice is reduced, and even the process is in a waiting state for a long time; if the CPU utilization rate of the process is lower, the time slice is enlarged;
independently controlling the time slices of each process through a modified scheduling strategy, and ensuring that the CPU utilization rate of each process accords with a value set by a user when each process completely schedules one round;
counting and limiting memory resources: capturing the abnormal occurrence of the page fault abnormality, adding the size of the allocated physical page into the corresponding application container process and application container, and counting the use amount of the memory resource of the physical page;
if the physical memory usage of the application container exceeds the value set by the user in the process, the application container process exceeding the memory usage is killed or the operation of the application container process is paused;
statistics and limiting I/O bandwidth resources: by adding a current limiter system service between the system service of the file system and the system service of the device driver, each I/O request is captured by the current limiter system service, whether the request issuing requirement is met or not is judged according to the type and the size of the I/O request, if not, the issuing of the request is suspended, and if yes, the request is issued, and the token number in the current limiter system service is updated;
The number of tokens increases gradually as the system operates and there is an upper limit on the number of tokens, once the upper limit is reached, the number of tokens cannot continue to increase.
8. The microkernel-based container build and run system of claim 1, wherein the fail-over functional module comprises: when a fault error occurs, capturing the fault, wherein the most common fault is page fault abnormality, and after capturing, performing exit operation on the abnormal system container process, and recovering all resources;
and then sending a message for restarting the system container to the process management system container, and immediately restarting the system container after the process management system container receives the message, and reconstructing the content and inter-process communication in the system container.
9. The microkernel-based container building and running system according to claim 1, wherein the system container is located in a user state and is also managed as a container, and the CPU and memory overhead of the system service process are counted and the system service process is uniformly managed;
the system accelerates the I/O speed of the microkernel through direct memory access, manages data to be transferred by adopting capability, and directly copies the data from the device to the memory or directly writes the data from the memory to the device through direct memory access, thereby reducing the number of inter-process communication and the number of repeated memory copies.
10. A microkernel-based container construction and operation method, comprising:
namespaces function steps: dividing static system resources; dividing mounting point data, network protocol stack data and process management data, accessing different system resources by different namespaces where different application containers are located, and realizing isolation of the system resources;
control group function steps: dividing dynamic system resources, counting and limiting CPU resources, memory resources and I/O bandwidth resources, using limited resources by different control groups where different application containers are located, and realizing limitation of the system resources;
fault recovery function step: and handling the situation of system service breakdown caused by memory errors.
CN202310746321.5A 2023-06-21 2023-06-21 Container construction and operation system and method based on microkernel Pending CN116700901A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310746321.5A CN116700901A (en) 2023-06-21 2023-06-21 Container construction and operation system and method based on microkernel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310746321.5A CN116700901A (en) 2023-06-21 2023-06-21 Container construction and operation system and method based on microkernel

Publications (1)

Publication Number Publication Date
CN116700901A true CN116700901A (en) 2023-09-05

Family

ID=87840908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310746321.5A Pending CN116700901A (en) 2023-06-21 2023-06-21 Container construction and operation system and method based on microkernel

Country Status (1)

Country Link
CN (1) CN116700901A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116909753A (en) * 2023-09-12 2023-10-20 中国电子科技集团公司第十五研究所 Method and system for limiting kernel state operating system resources based on process grouping

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116909753A (en) * 2023-09-12 2023-10-20 中国电子科技集团公司第十五研究所 Method and system for limiting kernel state operating system resources based on process grouping

Similar Documents

Publication Publication Date Title
US10282229B2 (en) Asynchronous task management in an on-demand network code execution environment
US9952896B2 (en) Asynchronous task management in an on-demand network code execution environment
US7702828B2 (en) Input/output control apparatus, input/output control system, and input/output control method
US7996593B2 (en) Interrupt handling using simultaneous multi-threading
EP2577450B1 (en) Virtual machine migration techniques
US7533207B2 (en) Optimized interrupt delivery in a virtualized environment
US8458694B2 (en) Hypervisor with cloning-awareness notifications
US11150951B2 (en) Releasable resource based preemptive scheduling
US20070168525A1 (en) Method for improved virtual adapter performance using multiple virtual interrupts
CN109564524A (en) The safety guidance of virtualization manager
CN109564514A (en) Memory allocation technique in the virtualization manager of partial relief
CN109564523A (en) Variability of performance is reduced using opportunistic management program
CN110888743A (en) GPU resource using method, device and storage medium
CN1613059A (en) A method and a system for executing operating system functions, as well as an electronic device
CN107797848B (en) Process scheduling method and device and host equipment
US12026072B2 (en) Metering framework for improving resource utilization for a disaster recovery environment
CN105224886B (en) A kind of mobile terminal safety partition method, device and mobile terminal
CN113886089A (en) Task processing method, device, system, equipment and medium
CN114840318A (en) Scheduling method for preempting hardware key encryption and decryption resources through multiple processes
Li et al. Quest-V: A virtualized multikernel for high-confidence systems
CN116700901A (en) Container construction and operation system and method based on microkernel
CN113556264A (en) Real-time cloud platform management monitoring system
CN114816678B (en) Virtual machine scheduling method, system, equipment and storage medium
EP3389222B1 (en) A method and a host for managing events in a network that adapts event-driven programming framework
CN106484536B (en) IO scheduling method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination