CN117762644A

CN117762644A - Resource dynamic scheduling technology for distributed cloud computing systems

Info

Publication number: CN117762644A
Application number: CN202410024853.2A
Authority: CN
Inventors: 张瑞红
Original assignee: Huanggang Normal University
Current assignee: Huanggang Normal University
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-03-26

Abstract

The invention relates to a resource dynamic scheduling technology of a distributed cloud computing system, which comprises the following steps: s1, demand analysis: determining the service requirement and technical requirement of the system, evaluating the existing infrastructure and technology, and setting targets and performance indexes; s2, designing a system architecture: the overall structure of the system is designed based on the results of the demand analysis, and high-level components of the system such as databases, application servers, load balancers, and the like are determined. The invention remarkably improves the utilization rate of resources, reduces the cost, can optimize the system performance, accelerates the task execution speed and improves the overall reliability and usability of the system by intelligently analyzing and predicting the resource demand and adopting a high-efficiency scheduling strategy. In addition, it also supports extensibility and flexibility, enabling the system to accommodate future changes in demand. By constant monitoring and self-tuning, the system ensures high efficiency and stability of long-term operation, thereby creating greater commercial value for the enterprise or organization.

Description

Resource dynamic scheduling technology for distributed cloud computing systems

技术领域Technical field

本发明涉及资源动态调度技术领域，尤其涉及分布式云计算系统的资源动态调度技术。The present invention relates to the technical field of dynamic resource scheduling, and in particular to the dynamic resource scheduling technology of distributed cloud computing systems.

背景技术Background technique

分布式云计算系统的资源动态调度技术是专为处理和优化分布式云计算环境中资源分配和调度而设计的关键技术。它关注于如何在分布式云架构中合理地分配计算力、存储资源和网络带宽，以保证高效的系统性能和响应速度。The dynamic resource scheduling technology of distributed cloud computing systems is a key technology designed to handle and optimize resource allocation and scheduling in distributed cloud computing environments. It focuses on how to reasonably allocate computing power, storage resources and network bandwidth in a distributed cloud architecture to ensure efficient system performance and response speed.

经过申请人对现阶段分布式云计算环境要求调度技术的研究，发现现阶段分布式云计算环境要求调度技术存在以下弊端：After the applicant’s research on the scheduling technology required by the current distributed cloud computing environment, it was found that the scheduling technology required by the current distributed cloud computing environment has the following disadvantages:

现有系统往往无法迅速响应突变的工作负载，导致在高峰时段资源分配不足，而在低峰时段则造成资源浪费，另外，由于缺少全局视角的优化策略，现有系统往往无法在整个分布式环境中实现资源分配的最优化，且旧有调度机制在系统规模急剧扩大时常会遇到性能瓶颈，无法保持调度效率和准确性，最后，在面对节点故障或网络问题时，现有技术不足以保证高可用性和快速恢复，影响服务质量。Existing systems are often unable to respond quickly to sudden changes in workloads, resulting in insufficient resource allocation during peak hours and waste of resources during off-peak hours. In addition, due to the lack of optimization strategies from a global perspective, existing systems are often unable to optimize the entire distributed environment. Optimization of resource allocation is achieved in Ensure high availability and fast recovery, affecting service quality.

因此，本领域技术人员就提出了一种分布式云计算系统的资源动态调度技术。Therefore, those skilled in the art have proposed a dynamic resource scheduling technology for distributed cloud computing systems.

发明内容Contents of the invention

鉴于现有技术中存在的上述问题，本发明的主要目的在于提供分布式云计算系统的资源动态调度技术。In view of the above-mentioned problems existing in the prior art, the main purpose of the present invention is to provide a dynamic resource scheduling technology for a distributed cloud computing system.

本发明的技术方案是这样的：分布式云计算系统的资源动态调度技术，包括以下步骤：The technical solution of the present invention is as follows: dynamic resource scheduling technology for distributed cloud computing systems, including the following steps:

S1、需求分析：确定系统的业务需求和技术要求，评估现有的基础设施和技术，以及设置目标和性能指标；S1. Requirements analysis: Determine the business needs and technical requirements of the system, evaluate existing infrastructure and technology, and set goals and performance indicators;

S2、设计系统架构：基于需求分析的结果来设计系统的整体结构，确定系统的高级组件，如数据库、应用服务器和负载均衡器等，以及这些组件之间的交互方式，为了确保系统可扩展性和灵活性，需要设计和规划网络架构和数据存储策略；S2. Design system architecture: Design the overall structure of the system based on the results of demand analysis, determine the high-level components of the system, such as databases, application servers, load balancers, etc., as well as the interaction between these components, in order to ensure system scalability and flexibility, which requires designing and planning network architecture and data storage strategies;

S3、工作负载分析：理解系统将处理的任务种类、大小和频率的过程，通过对历史数据和预期工作负载进行分析，预测资源需求，并根据这些数据来优化资源分配和调度策略；S3. Workload analysis: The process of understanding the type, size and frequency of tasks that the system will handle, predicting resource requirements by analyzing historical data and expected workloads, and optimizing resource allocation and scheduling strategies based on these data;

S4、调度算法开发：根据工作负载分析的结果，开发资源调度算法，以优化任务运行的效率和资源利用率，这其中包括实现优先级队列、基于规则的引擎以及采用机器学习技术的自适应调度算法；S4. Scheduling algorithm development: Based on the results of workload analysis, develop resource scheduling algorithms to optimize task running efficiency and resource utilization, including the implementation of priority queues, rule-based engines, and adaptive scheduling using machine learning technology algorithm;

S5、资源管理：实现对物理和虚拟资源的监控、分配、优化和控制，这其中包括开发功能来自动管理资源生命周期，如分配、释放以及根据需求对资源进行扩缩和迁移；S5. Resource management: Realize the monitoring, allocation, optimization and control of physical and virtual resources, including the development of functions to automatically manage the resource life cycle, such as allocation, release, and expansion and migration of resources according to demand;

S6、系统集成验证：将所有独立开发的模块和组件组合在一起，然后作为一个完整的系统进行测试，这其中包括验证系统的功能和性能，并确保各部分能够协同工作满足设计规格；S6. System integration verification: All independently developed modules and components are combined together and then tested as a complete system, which includes verifying the function and performance of the system and ensuring that all parts can work together to meet the design specifications;

S7、监控策略实施：着重于实施监控与报警系统，以确保在生产环境中可以对系统健康状况、性能变化和潜在问题进行实时监控；S7. Monitoring strategy implementation: Focus on implementing monitoring and alarm systems to ensure that system health, performance changes and potential problems can be monitored in real time in the production environment;

S8、系统持续优化：在系统部署后期，需要持续地监控系统表现并根据反馈对系统进行优化，这其中包括升级硬件、更新软件、改进调度策略、优化资源分配等，以确保系统能够适应不断变化的需求和最新的技术趋势。S8. Continuous system optimization: In the later stage of system deployment, it is necessary to continuously monitor system performance and optimize the system based on feedback. This includes upgrading hardware, updating software, improving scheduling strategies, optimizing resource allocation, etc., to ensure that the system can adapt to constant changes. needs and the latest technology trends.

作为一种优选的实施方式，所述步骤S1可细化为：As a preferred implementation, step S1 can be refined into:

S11、评估硬件资源：评估现有的物理机、虚拟机、存储解决方案和网络资源，这个步骤的重点在于理解现有基础设施的性能特点、扩展能力和限制，评估包括检查硬件的年龄、维护记录、过去的性能历史和故障率，以便于未来的规模扩展和资源配置作出合理的推断和决策；S11. Evaluate hardware resources: Evaluate existing physical machines, virtual machines, storage solutions and network resources. The focus of this step is to understand the performance characteristics, expansion capabilities and limitations of the existing infrastructure. The evaluation includes checking the age and maintenance of the hardware. Records, past performance history and failure rates to facilitate reasonable inferences and decisions for future scale expansion and resource allocation;

S12、定义工作负载模型：确定系统需要支持的业务和任务类型，根据应用程序的特性，如交易型系统、数据处理或实时分析等，分析可能对资源产生的负载，在此步骤中，需要在模型中定义负载的预期峰值、平均值、稳态和变化趋势；S12. Define the workload model: Determine the types of businesses and tasks that the system needs to support. Based on the characteristics of the application, such as transactional systems, data processing, or real-time analysis, analyze the load that may be exerted on the resources. In this step, you need to The expected peak value, average value, steady state and changing trend of the load are defined in the model;

S13、设定调度目标：根据业务需求和预期的系统用途，设置调度器要达成的目标，更具体的，如果系统需要快速响应用户请求，可能优先考虑最小化任务的响应时间；或者如果节约成本是主要目标，可能会优化以提高能源和资源的使用效率；S13. Set scheduling goals: Based on business needs and expected system usage, set the goals to be achieved by the scheduler. More specifically, if the system needs to respond quickly to user requests, it may give priority to minimizing the response time of the task; or if it saves costs, is the main goal and may be optimized to increase the efficiency of energy and resource use;

S14、确定服务级别协议要求：SLAS定义服务的质量和向用户承诺的性能指标，这其中包括，系统的可用性、服务响应时间、故障修复时间等；S14. Determine service level agreement requirements: SLAS defines the quality of service and the performance indicators promised to users, including system availability, service response time, fault repair time, etc.;

S15、建立监控需求：监控需求和指标应当与已定义的SLAS和性能目标相匹配，决定监控指标，这其中包括，CPU使用率、内存使用、网络带宽、存储IOPS等，以及监控间隔和历史数据保留策略，此步骤能够帮助系统管理员理解系统状态，并在问题发生时及时响应；S15. Establish monitoring requirements: Monitoring requirements and indicators should match the defined SLAS and performance goals, and determine monitoring indicators, including CPU usage, memory usage, network bandwidth, storage IOPS, etc., as well as monitoring intervals and historical data. Retention policy, this step can help system administrators understand the status of the system and respond promptly when problems occur;

S16、标准化资源描述：为系统中的所有物理和虚拟资源创建统一的描述定义，这包括CPU的型号和核心数、内存大小、存储容量以及网络带宽等，标准化是资源分配和调度决策的关键基础，确保系统能够高效和一致地管理和使用这些资源。S16. Standardized resource description: Create a unified description and definition for all physical and virtual resources in the system, including CPU model and core number, memory size, storage capacity, network bandwidth, etc. Standardization is the key basis for resource allocation and scheduling decisions. , ensuring that the system can manage and use these resources efficiently and consistently.

通过上述技术手段，在这个阶段，目的是识别系统需求和可用资源。我们评估现存的硬件和软件资源，确定工作负载的类型和规模，制定调度目标，并确定必要的服务水平目标(SLAs)。接着，我们还需要决定监控方案以确保可以追踪关键的性能指标，最后制定资源的标准描述格式。Through the above technical means, at this stage, the aim is to identify system requirements and available resources. We evaluate existing hardware and software resources, determine the type and size of workloads, develop scheduling goals, and determine necessary service level objectives (SLAs). Next, we also need to decide on a monitoring solution to ensure that key performance indicators can be tracked, and finally develop a standard description format for resources.

作为一种优选的实施方式，所述步骤S2可细化为：As a preferred implementation, the step S2 can be refined into:

S21、确定系统架构模式：选择合适的架构模式，例如微服务架构、服务导向架构(SOA)、事件驱动架构或者传统的单体应用架构，这个架构模式的选择需要基于需求分析，考虑可维护性、可扩展性、性能和团队经验等因素；S21. Determine the system architecture model: Choose an appropriate architecture model, such as microservice architecture, service-oriented architecture (SOA), event-driven architecture or traditional single application architecture. The selection of this architecture model needs to be based on demand analysis and maintainability should be considered. , scalability, performance and team experience and other factors;

S22、设计容错机制：规划系统的冗余和容错机制以确保高可用性，这个设计需要包括数据备份方案，多区域部署，故障转移和恢复策略，以及实现无状态组件以支持水平扩展；S22. Design a fault-tolerant mechanism: Plan the redundancy and fault-tolerance mechanism of the system to ensure high availability. This design needs to include data backup solutions, multi-region deployment, failover and recovery strategies, and the implementation of stateless components to support horizontal expansion;

S23、规划数据管理策略：确定数据的存储需求，选择合适的数据库系统，并考虑数据一致性、完整性、可访问性，并规划数据备份、恢复和归档策略；S23. Plan data management strategy: Determine data storage requirements, select an appropriate database system, consider data consistency, integrity, and accessibility, and plan data backup, recovery, and archiving strategies;

S24、设计系统监控解决方案：基于需求分析中设定的监控需求，设计一个完整的监控解决方案，如何收集、存储和分析监控数据，如何设置报警阈值，选定监控工具和平台以提供实时监控和性能分析；S24. Design a system monitoring solution: Based on the monitoring requirements set in the requirements analysis, design a complete monitoring solution, how to collect, store and analyze monitoring data, how to set alarm thresholds, and select monitoring tools and platforms to provide real-time monitoring. and performance analysis;

S25、规划网络架构：细化网络架构包括内部网络隔离、公共和私有子网的设定、负载均衡器的使用以及确保网络通讯的安全策略如防火墙、入侵检测系统和加密传输；S25. Plan the network architecture: Refining the network architecture includes internal network isolation, the setting of public and private subnets, the use of load balancers, and security strategies to ensure network communication such as firewalls, intrusion detection systems, and encrypted transmission;

S26、定义系统组件与接口：为系统中的每一个组件明确职责并定义接口，对外公开的API设计应该便于理解和使用，并保持一致性和版本控制，内部组件间的通信也需要定义，包括消息传递协议和数据格式。S26. Define system components and interfaces: clarify responsibilities and define interfaces for each component in the system. The API design that is exposed to the outside world should be easy to understand and use, and maintain consistency and version control. Communication between internal components also needs to be defined, including Messaging protocols and data formats.

通过上述技术手段，设计阶段涉及到系统结构的总体规划。这包括将资源层次结构化，设置核心系统组件，定义调度流程，构建内部通信协议，规划故障处理和数据持久化策略。这个阶段的目标是创建一个稳固的蓝图，以指导系统的详细实现。Through the above technical means, the design phase involves the overall planning of the system structure. This includes structuring the resource hierarchy, setting up core system components, defining scheduling processes, building internal communication protocols, and planning fault handling and data persistence strategies. The goal of this phase is to create a solid blueprint that guides the detailed implementation of the system.

作为一种优选的实施方式，所述步骤S3可细化为：As a preferred implementation, step S3 can be refined into:

S31、数据收集与清理：设计高效的数据收集机制，涉及在分布式环境中从不同服务和应用自动收集资源利用数据，部署日志聚合工具如ELK(Elasticsearch,Logstash,andKibana)或Fluentd，以及监控系统如Prometheus或Datadog，来标准化和中心化地收集数据，制定清洗规则，包括异常检测和修正，时区同步，以及数据格式统一；S31. Data collection and cleaning: Design an efficient data collection mechanism, which involves automatically collecting resource utilization data from different services and applications in a distributed environment, deploying log aggregation tools such as ELK (Elasticsearch, Logstash, and Kibana) or Fluentd, and monitoring systems Such as Prometheus or Datadog to collect data in a standardized and centralized manner and formulate cleaning rules, including anomaly detection and correction, time zone synchronization, and unified data format;

S32、特征选择：运用数据分析和可视化工具(如Python中的Pandas和Seaborn)来评估资源使用模式，并通过统计测试和相关分析确定关键特征，实施机器学习特征选择技术(如递归特征消除)，辅以专业知识，从而优化模型的输入特征集；S32. Feature selection: Use data analysis and visualization tools (such as Pandas and Seaborn in Python) to evaluate resource usage patterns, identify key features through statistical testing and correlation analysis, and implement machine learning feature selection techniques (such as recursive feature elimination), Supplemented by professional knowledge to optimize the input feature set of the model;

S33、算法开发：基于选定的特征集，探索各种预测模型，包括传统统计方法(如ARIMA)和现代机器学习技术(如梯度提升机和神经网络)，确保算法开发环境具备必要的计算资源，并使用适当的ML框架(例如TensorFlow或scikit-learn)；S33. Algorithm development: Based on the selected feature set, explore various prediction models, including traditional statistical methods (such as ARIMA) and modern machine learning techniques (such as gradient boosting machines and neural networks), to ensure that the algorithm development environment has the necessary computing resources , and use an appropriate ML framework (such as TensorFlow or scikit-learn);

S34、模型训练：以历史数据为基础，采用机器学习工作流程管理技术(如MLflow或Kubeflow)确保模型训练过程的可追溯性和一致性，使用参数搜索(如网格搜索或随机搜索)来优化模型，并使用交叉验证来评估模型的泛化能力；S34. Model training: Based on historical data, use machine learning workflow management technology (such as MLflow or Kubeflow) to ensure the traceability and consistency of the model training process, and use parameter search (such as grid search or random search) to optimize model and use cross-validation to evaluate the generalization ability of the model;

S35、模型验证：在独立测试数据集上验证模型性能，采用适合的评估指标(如MAE、RMSE)来度量预测准确性，并使用验证结果来进一步优化模型参数，根据应用场景的变化定期回顾模型表现，更新模型以适应新的数据模式；S35. Model verification: Verify the model performance on an independent test data set, use suitable evaluation indicators (such as MAE, RMSE) to measure prediction accuracy, and use the verification results to further optimize model parameters, and review the model regularly according to changes in application scenarios. Representation, updating the model to adapt to new data patterns;

S36、集成与部署：设计和实施模型的自动化部署流程，包括CI/CD管道(持续集成/持续部署)，以便预测模型可以平滑地集成到现有的调度系统中，确保有充分的监控和报警机制来跟踪模型性能，以及在生产环境中管理模型的版本。S36. Integration and deployment: Design and implement the automated deployment process of the model, including CI/CD pipeline (continuous integration/continuous deployment), so that the predictive model can be smoothly integrated into the existing scheduling system to ensure adequate monitoring and alarming Mechanisms to track model performance and manage model versions in production environments.

通过上述技术手段，工作负载预测是为了能够预见未来的资源需求。它开始于数据的收集和清理，然后进行特征选择，以确定哪些数据点最相关，之后是算法的开发和模型的训练，最后验证和优化模型的准确性及其在实际操作中的有效性，并将其集成到调度器中。Through the above technical means, workload prediction is to be able to foresee future resource requirements. It starts with the collection and cleaning of data, then feature selection to determine which data points are most relevant, followed by algorithm development and model training, and finally validating and optimizing the accuracy of the model and its effectiveness in actual operations, and integrate it into the scheduler.

作为一种优选的实施方式，所述步骤S4可细化为：As a preferred implementation, step S4 can be refined into:

S41、策略框架设计：构建灵活且可扩展的调度框架，支持插件式的策略交换，框架应能够与系统架构无缝集成，可配置且易于维护。使用行业标准设计模式(如策略、观察者模式)来协助代码分离和确保系统的模块化；S41. Policy framework design: Build a flexible and scalable scheduling framework that supports plug-in policy exchange. The framework should be able to be seamlessly integrated with the system architecture, configurable and easy to maintain. Use industry standard design patterns (such as strategy, observer pattern) to assist code separation and ensure system modularity;

S42、静态调度算法实现：编写系列的基础的静态调度算法，如先来先服务(FCFS)、循环(Round-Robin)和固定优先级(Fixed Priority)调度，更具体的，算法应能够在不考虑外部变化情况下，简单快速地进行决策，为这些算法创建标准测试案例，以便验证其性能和正确性；S42. Static scheduling algorithm implementation: Write a series of basic static scheduling algorithms, such as first come first served (FCFS), round-robin (Round-Robin) and fixed priority (Fixed Priority) scheduling. More specifically, the algorithm should be able to Make decisions quickly and easily taking into account external changes and create standard test cases for these algorithms to verify their performance and correctness;

S43、动态调度机制：开发能够响应实时系统状态变化的动态调度算法，这些算法考虑当前的资源利用率、历史负载数据、服务水平协议(SLA)和优先级等因素，使用仿生算法(如遗传算法)，启发式方法或机器学习技术，来实现动态响应和自适应的调度决策；S43. Dynamic scheduling mechanism: Develop dynamic scheduling algorithms that can respond to changes in real-time system status. These algorithms consider factors such as current resource utilization, historical load data, service level agreements (SLA), and priorities, using bionic algorithms (such as genetic algorithms). ), heuristic methods or machine learning techniques to achieve dynamic response and adaptive scheduling decisions;

S44、算法优化：不断地对已有算法进行性能分析和优化，运用微观和宏观的优化技术，比如算法复杂度降低、多线程并行处理和资源预约机制，同时，考虑实施仿真环境以评估和优化算法在不同负载和条件下的表现；S44. Algorithm optimization: Continuously conduct performance analysis and optimization of existing algorithms, and use micro and macro optimization technologies, such as algorithm complexity reduction, multi-thread parallel processing and resource reservation mechanisms. At the same time, consider implementing a simulation environment for evaluation and optimization Algorithm performance under different loads and conditions;

S45、多目标调度支持：开发支持多目标决策的调度算法，算法能够平衡响应时间、资源使用率、能源效率和成本等多种因素，利用多目标优化理论和技术，如Pareto优化，来开发符合不同商业和技术需求的调度策略；S45. Multi-objective scheduling support: Develop scheduling algorithms that support multi-objective decision-making. The algorithm can balance multiple factors such as response time, resource usage, energy efficiency, and cost. It uses multi-objective optimization theory and technology, such as Pareto optimization, to develop Scheduling strategies for different business and technical needs;

S46、隔离与安全性：在调度决策中考虑租户间隔离和应用级安全性，实现基于角色的访问控制(RBAC)和租户工作负载之间的硬件资源隔离，在策略中加入安全验证和监控步骤，保护系统不受恶意行为的影响。S46. Isolation and security: Consider inter-tenant isolation and application-level security in scheduling decisions, implement role-based access control (RBAC) and hardware resource isolation between tenant workloads, and add security verification and monitoring steps to the policy. , protect the system from malicious behavior.

通过上述技术手段，这一阶段专注于开发能够有效分配资源的调度算法。它包括设计灵活的策略框架、实现静态和动态调度算法、进行算法优化，以确保调度结果符合业务要求，同时，还要考虑到隔离和安全性因素。Through the above technical means, this stage focuses on developing scheduling algorithms that can effectively allocate resources. It includes designing a flexible policy framework, implementing static and dynamic scheduling algorithms, and optimizing algorithms to ensure that scheduling results meet business requirements. At the same time, isolation and security factors must also be taken into consideration.

作为一种优选的实施方式，所述步骤S5可细化为：As a preferred implementation, step S5 can be refined into:

S51、定义资源管理策略：分析不同类型资源的特性和业务需求，制定相应的资源管理策略，为CPU调度制定核心绑定、优先级和时间片配额，为内存配置优先级和回收政策，为存储设置读写速率限制，为网络带宽分配最大和最小带宽限制，保证高效和公平的资源使用为目的，同时满足SLA要求；S51. Define resource management strategies: analyze the characteristics and business requirements of different types of resources, formulate corresponding resource management strategies, formulate core binding, priority and time slice quotas for CPU scheduling, configure priority and recycling policies for memory, and provide storage Set read and write rate limits, allocate maximum and minimum bandwidth limits for network bandwidth, and ensure efficient and fair resource use while meeting SLA requirements;

S52、资源分配算法：开发精细控制的资源分配算法，它能根据当前系统资源使用情况和工作负载的需求动态分配资源，算法考虑资源的相互依赖性和约束，实现快速响应和灵活调整的资源分配；S52. Resource allocation algorithm: Develop a finely controlled resource allocation algorithm, which can dynamically allocate resources according to the current system resource usage and workload requirements. The algorithm considers the interdependence and constraints of resources to achieve rapid response and flexible adjustment of resource allocation. ;

S53、资源优化循环：实施定期资源优化循环，通过分析历史数据识别资源的使用模式和效率瓶颈，定时调整资源分配，合并内存碎片，重新平衡分布等，以促进资源效率和系统稳定性；S53. Resource optimization cycle: Implement regular resource optimization cycles, identify resource usage patterns and efficiency bottlenecks by analyzing historical data, regularly adjust resource allocation, merge memory fragments, rebalance distribution, etc., to promote resource efficiency and system stability;

S54、弹性伸缩逻辑：实现弹性伸缩逻辑，允许系统根据实时监控数据和预测结果自动调整资源分配，在负载变化时，适时增减计算节点或服务实例，确保应用始终保持最优性能和成本效益；S54. Elastic scaling logic: Implement elastic scaling logic, allowing the system to automatically adjust resource allocation based on real-time monitoring data and prediction results. When the load changes, it can increase or decrease computing nodes or service instances in a timely manner to ensure that applications always maintain optimal performance and cost-effectiveness;

S55、能源效率优化：开发能源优化子系统，监控数据中心的能源使用情况，分析能耗模式，通过调整资源使用策略和调度策略减少能源消耗，确保系统的绿色环保运行，并有助于降低成本；S55. Energy efficiency optimization: Develop an energy optimization subsystem to monitor the energy usage of the data center, analyze energy consumption patterns, and reduce energy consumption by adjusting resource usage strategies and scheduling strategies to ensure the green operation of the system and help reduce costs. ;

S56、持久化与状态恢复：构建状态持久化机制，定期保存系统当前的状态到持久存储，如数据库或分布式存储系统中，在系统故障后，能够利用这些持久化的状态数据迅速恢复系统工作，最小化故障对业务的影响。S56. Persistence and state recovery: Build a state persistence mechanism to regularly save the current state of the system to persistent storage, such as a database or distributed storage system. After a system failure, these persistent state data can be used to quickly restore system work. , minimizing the impact of failures on business.

通过上述技术手段，在资源管理阶段，制定资源管理策略，开发资源分配算法，并引入资源优化循环。此外，实现弹性伸缩逻辑以适应负载变化，优化能源利用，并保证系统的持久性和面对故障的恢复力。Through the above technical means, in the resource management stage, resource management strategies are formulated, resource allocation algorithms are developed, and resource optimization cycles are introduced. In addition, elastic scaling logic is implemented to adapt to load changes, optimize energy utilization, and ensure system durability and resilience in the face of failures.

作为一种优选的实施方式，所述步骤S6可细化为：As a preferred implementation, step S6 can be refined into:

S61、模块集成：将开发完成的模块(如调度器、资源管理器、监控系统)集成到主系统框架内，确保模块间接口兼容，并完成必要的集成测试，检查数据流和功能调用是否按预期工作；S61. Module integration: Integrate the developed modules (such as scheduler, resource manager, monitoring system) into the main system framework to ensure that the interfaces between modules are compatible, and complete necessary integration tests to check whether the data flow and function calls are as required. expected work;

S62、系统范围测试：对整个系统进行端到端的测试，包括功能测试、集成测试、系统测试和验收测试，验证系统的核心功能、用户故事和边界条件，此外，需要测试系统在异常状况下的行为以确保稳定性；S62. System-wide testing: Conduct end-to-end testing of the entire system, including functional testing, integration testing, system testing and acceptance testing, to verify the core functions, user stories and boundary conditions of the system. In addition, it is necessary to test the system under abnormal conditions. behavior to ensure stability;

S63、性能评估：评估系统在典型和极端工作负载下的性能，包括响应时间、吞吐量和资源使用效率，使用标准化的性能评估工具和方法，并与业务目标中的性能要求相对比，确保满足预定指标；S63. Performance evaluation: Evaluate the performance of the system under typical and extreme workloads, including response time, throughput and resource usage efficiency, use standardized performance evaluation tools and methods, and compare it with the performance requirements in the business goals to ensure that it is met predetermined indicators;

S64、隔离验证：对系统进行安全审计，检查是否存在代码漏洞、错误配置和潜在的安全隐患，验证系统中的多租户隔离能力，确保不同用户或组织数据和操作的安全隔离；S64. Isolation verification: Conduct a security audit on the system to check whether there are code vulnerabilities, misconfigurations and potential security risks, verify the multi-tenant isolation capabilities in the system, and ensure the safe isolation of data and operations of different users or organizations;

S65、用户案例模拟：根据典型的用户场景，模拟用户操作以验证系统功能，确保用户流程和业务流程的正确性，并监测系统在模拟操作下的表现，调整系统配置以更好地服务用户需求；S65. User case simulation: Based on typical user scenarios, simulate user operations to verify system functions, ensure the correctness of user processes and business processes, monitor the performance of the system under simulated operations, and adjust system configuration to better serve user needs. ;

S66、负载压力测试：通过模拟高负载情况和人为制造压力点，识别系统在极限状态下的性能瓶颈和潜在问题，根据测试结果调整系统架构和资源分配，确保系统在实际运行中的稳定性和可靠性。S66. Load stress test: By simulating high load conditions and artificially created pressure points, the performance bottlenecks and potential problems of the system in the extreme state are identified, and the system architecture and resource allocation are adjusted based on the test results to ensure the stability and stability of the system in actual operation. reliability.

通过上述技术手段，这个阶段包括把之前独立开发的模块整合成一个完整的系统，进行系统范围的测试以确保模块之间正确协同工作，然后对系统的性能进行评估，确保它满足既定的性能标准。还要进行安全和隔离性评估，并通过模拟用户操作来进行测试。Using the techniques described above, this phase involves integrating previously independently developed modules into a complete system, conducting system-wide testing to ensure the modules work together correctly, and then evaluating the performance of the system to ensure it meets established performance standards. . Security and isolation assessments are also conducted and tested by simulating user actions.

作为一种优选的实施方式，所述步骤S7可细化为：As a preferred implementation, step S7 can be refined into:

S71、部署规划：创建一个详细的部署计划，考虑各种部署场景，如蓝绿部署以实现零宕机时间更新，或滚动更新来逐渐替换旧版本的实例，计划中应包括风险评估、事前通知、回滚策略、部署时间表，以及对关键业务时段的考虑；S71. Deployment planning: Create a detailed deployment plan and consider various deployment scenarios, such as blue-green deployment to achieve zero downtime updates, or rolling updates to gradually replace older version instances. The plan should include risk assessment and prior notification. , rollback strategy, deployment schedule, and consideration of critical business periods;

S72、自动化部署流程：开发一套自动化的部署流程，利用持续集成和持续部署(CI/CD)管道自动测试、打包和部署新版本，确保这个过程涵盖代码提交、构建、自动化测试、部署和通知，以减少人为错误并提高效率；S72. Automated deployment process: Develop an automated deployment process, use continuous integration and continuous deployment (CI/CD) pipelines to automatically test, package and deploy new versions, ensuring that this process covers code submission, build, automated testing, deployment and notification , to reduce human error and improve efficiency;

S73、监控策略实施：根据系统关键性能指标(KPIs)选择监控工具，建立监控策略，实现实时资源使用情况、服务运行状况和异常事件的监控，构建仪表板和报警系统，以便于运维人员快速响应任何问题；S73. Monitoring strategy implementation: Select monitoring tools based on system key performance indicators (KPIs), establish monitoring strategies, realize real-time resource usage, service operating status and abnormal event monitoring, and build dashboards and alarm systems to facilitate operation and maintenance personnel quickly Respond to any questions;

S74、故障演练：定期安排故障演练活动，模拟故障场景并验证系统的恢复流程和备份策略，这些演练帮助确保在真正发生紧急情况时，团队能够迅速准确地采取行动，减少可能的业务影响；S74. Fault drills: Regularly arrange fault drill activities to simulate fault scenarios and verify the system's recovery process and backup strategy. These drills help ensure that when a real emergency occurs, the team can take quick and accurate actions to reduce possible business impacts;

S75、性能调优：持续收集性能数据，并基于这些数据对系统进行定期的性能调优，识别性能瓶颈，优化配置，应用最佳实践以及更新硬件来应对性能挑战，同时要考虑成本效益，找到资源使用和性能的最佳平衡点；S75. Performance tuning: Continuously collect performance data, and perform regular performance tuning of the system based on these data, identify performance bottlenecks, optimize configurations, apply best practices, and update hardware to address performance challenges. At the same time, cost-effectiveness must be considered to find The best balance between resource usage and performance;

S76、文档和日志：维护详尽的系统文档，包括设计文档、用户手册、操作指南及常见问题解答(FAQs)，以便用户和管理人员理解和使用系统，实施日志管理政策，详细记录系统运行情况，结合日志分析工具来洞察系统行为，快速诊断问题。S76. Documentation and logs: Maintain detailed system documentation, including design documents, user manuals, operation guides and frequently asked questions (FAQs), so that users and managers can understand and use the system, implement log management policies, and record system operation in detail. Combine with log analysis tools to gain insight into system behavior and quickly diagnose problems.

通过上述技术手段，在部署阶段，我们要制定一个详细的部署计划，实现自动化部署过程，并部署监控策略来追踪资源和应用级别的性能。故障演练和备份验证是为了建立对可能故障的准备，以及根据实际的系统运行情况来进行性能调优。Through the above technical means, during the deployment phase, we need to formulate a detailed deployment plan, realize the automated deployment process, and deploy monitoring strategies to track resource and application-level performance. Failure drills and backup verification are to establish preparations for possible failures and perform performance tuning based on actual system operation.

作为一种优选的实施方式，所述步骤S8可细化为：As a preferred implementation, step S8 can be refined into:

S81、用户反馈收集：建立一个全面的用户反馈收集机制，包含在线调查，用户论坛，直接访问或客户支持反馈等多个渠道，确保可以收集到关于系统使用情况、性能瓶颈、可能的改进建议及用户满意度的数据，定期分析这些数据，识别用户体验改进点；S81. User feedback collection: Establish a comprehensive user feedback collection mechanism, including online surveys, user forums, direct visits or customer support feedback and other channels to ensure that system usage, performance bottlenecks, possible improvement suggestions and User satisfaction data, regularly analyze these data, and identify user experience improvement points;

S82、问题追踪修复：构建一个问题追踪系统来持续监控、记录和分类系统中出现的问题，为每个问题分配优先级，并在相应的时间框架内开展问题和修复工作，确保跨团队沟通流畅，使相关部门可以密切合作，快速解决问题；S82. Issue tracking and repair: Build an issue tracking system to continuously monitor, record and classify issues that arise in the system, assign a priority to each issue, and carry out issue and repair work within the corresponding time frame to ensure smooth cross-team communication. , so that relevant departments can work closely to solve problems quickly;

S83、性能监测优化：部署高级性能监测工具来实时追踪系统性能指标，使用数据分析和机器学习技术来确定性能趋势和潜在问题，基于分析结果，及时进行系统调整或优化配置以保持最佳性能状态；S83. Performance monitoring and optimization: Deploy advanced performance monitoring tools to track system performance indicators in real time, use data analysis and machine learning technology to identify performance trends and potential problems, and based on the analysis results, make timely system adjustments or optimize configurations to maintain optimal performance. ;

S84、更新迭代：实施一个结构化的更新和迭代策略，使系统能够集成新特性、性能改进和安全补丁，确保更新流程最小化对现场操作的干扰，并通过自动化测试确保更新前后的稳定性和兼容性；S84. Update iteration: Implement a structured update and iteration strategy to enable the system to integrate new features, performance improvements, and security patches, ensure that the update process minimizes disruption to on-site operations, and ensure stability and stability before and after updates through automated testing. compatibility;

S85、合规性保障：定期检查并实施最新的安全更新，监控安全漏洞数据库，确保系统及时更新以防止潜在攻击，同时，保持系统符合所有相关的行业标准和监管要求，进行定期的合规性审计和评估；S85. Compliance assurance: Regularly check and implement the latest security updates, monitor the security vulnerability database, ensure that the system is updated in a timely manner to prevent potential attacks, and at the same time, maintain the system in compliance with all relevant industry standards and regulatory requirements, and conduct regular compliance audits and evaluations;

S86、技术趋势评估：持续监测和评估新兴技术和行业趋势，考察它们可能对系统带来的益处，组织定期的内部知识分享会议和技术讲座，鼓励团队成员学习新技术，并考虑将其集成到系统中以保持领先优势。S86. Technology trend assessment: Continuously monitor and evaluate emerging technologies and industry trends, examine the benefits they may bring to the system, organize regular internal knowledge sharing meetings and technical lectures, encourage team members to learn new technologies and consider integrating them into system to stay ahead of the curve.

通过上述技术手段，系统要进入维护阶段，这时收集用户的反馈，跟踪并修复问题，持续监控系统性能，并定期进行更新和迭代。安全性和合规性是此阶段的重点之一。同时，也需要时刻评估新的技术趋势，以决定是否将其纳入系统中。Through the above technical means, the system will enter the maintenance phase. At this time, user feedback is collected, problems are tracked and fixed, system performance is continuously monitored, and updates and iterations are performed regularly. Security and compliance are one of the focuses of this phase. At the same time, new technology trends need to be constantly evaluated to decide whether to incorporate them into the system.

与现有技术相比，本发明的优点和积极效果在于：Compared with the existing technology, the advantages and positive effects of the present invention are:

本发明通过详细分析历史和实时数据来评估资源需求，然后设计一套结构来满足这些需求同时还具备足够的灵活性以应对未来变化。随后，系统利用预测模型来估计未来的资源请求，以精确调配资源，最大化性能和效率。最终，调度算法被设计出来确保资源在不同的任务和服务之间能够正确且有效地分配，从而使得系统能够保持高性能和高可用性。这整个流程需要不断地通过反馈和监控进行优化和调整，本调度技术通过智能分析和预测资源需求，并采用高效的调度策略，显著提高了资源利用率，降低了开销，它可以优化系统性能，加快了任务执行速度，并提高了系统的整体可靠性和可用性。此外，它还支持扩展性和灵活性，使系统能够适应未来需求的变化。通过持续监控和自我调整，该系统确保了长期运行的高效率和稳定性，从而为企业或组织创造了更大的商业价值。The invention assesses resource needs by analyzing historical and real-time data in detail, and then designs a structure to meet those needs while being flexible enough to handle future changes. The system then uses predictive models to estimate future resource requests to accurately allocate resources to maximize performance and efficiency. Ultimately, scheduling algorithms are designed to ensure that resources are allocated correctly and efficiently among different tasks and services, so that the system can maintain high performance and high availability. This entire process needs to be continuously optimized and adjusted through feedback and monitoring. This scheduling technology significantly improves resource utilization and reduces overhead through intelligent analysis and prediction of resource needs and the adoption of efficient scheduling strategies. It can optimize system performance. It speeds up task execution and improves the overall reliability and availability of the system. Additionally, it supports scalability and flexibility, allowing the system to adapt to future changes in requirements. Through continuous monitoring and self-adjustment, the system ensures high efficiency and stability in long-term operation, thereby creating greater business value for the enterprise or organization.

附图说明Description of the drawings

图1为本发明的大步骤流程图；Figure 1 is a large step flow chart of the present invention;

图2为本发明的S1步骤流程图；Figure 2 is a flow chart of S1 steps of the present invention;

图3为本发明的S2步骤流程图；Figure 3 is a flow chart of S2 steps of the present invention;

图4为本发明的S3步骤流程图；Figure 4 is a flow chart of S3 steps of the present invention;

图5为本发明的S4步骤流程图；Figure 5 is a flow chart of S4 steps of the present invention;

图6为本发明的S5步骤流程图；Figure 6 is a flow chart of S5 steps of the present invention;

图7为本发明的S6步骤流程图；Figure 7 is a flow chart of S6 steps of the present invention;

图8为本发明的S7步骤流程图；Figure 8 is a flow chart of steps S7 of the present invention;

图9为本发明的S8步骤流程图。Figure 9 is a flow chart of step S8 of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, rather than all embodiments.

下面将参照附图和具体实施例对本发明作进一步的说明The present invention will be further described below with reference to the accompanying drawings and specific embodiments.

实施例Example

如图1-图8所示，本发明提供分布式云计算系统的资源动态调度技术：包括以下步骤：As shown in Figures 1-8, the present invention provides resource dynamic scheduling technology for distributed cloud computing systems: including the following steps:

S8、系统持续优化：在系统部署后期，需要持续地监控系统表现并根据反馈对系统进行优化，这其中包括升级硬件、更新软件、改进调度策略、优化资源分配等，以确保系统能够适应不断变化的需求和最新的技术趋势；S8. Continuous system optimization: In the later stage of system deployment, it is necessary to continuously monitor system performance and optimize the system based on feedback. This includes upgrading hardware, updating software, improving scheduling strategies, optimizing resource allocation, etc., to ensure that the system can adapt to constant changes. needs and the latest technology trends;

所述步骤S1可细化为：The step S1 can be refined into:

S16、标准化资源描述：为系统中的所有物理和虚拟资源创建统一的描述定义，这包括CPU的型号和核心数、内存大小、存储容量以及网络带宽等，标准化是资源分配和调度决策的关键基础，确保系统能够高效和一致地管理和使用这些资源；S16. Standardized resource description: Create a unified description and definition for all physical and virtual resources in the system, including CPU model and core number, memory size, storage capacity, network bandwidth, etc. Standardization is the key basis for resource allocation and scheduling decisions. , ensure that the system can manage and use these resources efficiently and consistently;

综上所述：在这个阶段，目的是识别系统需求和可用资源。我们评估现存的硬件和软件资源，确定工作负载的类型和规模，制定调度目标，并确定必要的服务水平目标(SLAs)。接着，我们还需要决定监控方案以确保可以追踪关键的性能指标，最后制定资源的标准描述格式。To summarize: In this phase, the aim is to identify system requirements and available resources. We evaluate existing hardware and software resources, determine the type and size of workloads, develop scheduling goals, and determine necessary service level objectives (SLAs). Next, we also need to decide on a monitoring solution to ensure that key performance indicators can be tracked, and finally develop a standard description format for resources.

所述步骤S2可细化为：The step S2 can be refined into:

S26、定义系统组件与接口：为系统中的每一个组件明确职责并定义接口，对外公开的API设计应该便于理解和使用，并保持一致性和版本控制，内部组件间的通信也需要定义，包括消息传递协议和数据格式；S26. Define system components and interfaces: clarify responsibilities and define interfaces for each component in the system. The API design that is exposed to the outside world should be easy to understand and use, and maintain consistency and version control. Communication between internal components also needs to be defined, including Messaging protocols and data formats;

综上所述：设计阶段涉及到系统结构的总体规划。这包括将资源层次结构化，设置核心系统组件，定义调度流程，构建内部通信协议，规划故障处理和数据持久化策略。这个阶段的目标是创建一个稳固的蓝图，以指导系统的详细实现。To sum up: the design phase involves the overall planning of the system structure. This includes structuring the resource hierarchy, setting up core system components, defining scheduling processes, building internal communication protocols, and planning fault handling and data persistence strategies. The goal of this phase is to create a solid blueprint that guides the detailed implementation of the system.

所述步骤S3可细化为：The step S3 can be refined into:

S36、集成与部署：设计和实施模型的自动化部署流程，包括CI/CD管道(持续集成/持续部署)，以便预测模型可以平滑地集成到现有的调度系统中，确保有充分的监控和报警机制来跟踪模型性能，以及在生产环境中管理模型的版本；S36. Integration and deployment: Design and implement the automated deployment process of the model, including CI/CD pipeline (continuous integration/continuous deployment), so that the predictive model can be smoothly integrated into the existing scheduling system to ensure adequate monitoring and alarming Mechanisms to track model performance and manage model versions in production environments;

综上所述：工作负载预测是为了能够预见未来的资源需求。它开始于数据的收集和清理，然后进行特征选择，以确定哪些数据点最相关，之后是算法的开发和模型的训练，最后验证和优化模型的准确性及其在实际操作中的有效性，并将其集成到调度器中。To sum up: workload forecasting is about being able to foresee future resource needs. It starts with the collection and cleaning of data, then feature selection to determine which data points are most relevant, followed by algorithm development and model training, and finally validating and optimizing the accuracy of the model and its effectiveness in actual operations, and integrate it into the scheduler.

所述步骤S4可细化为：The step S4 can be refined into:

S46、隔离与安全性：在调度决策中考虑租户间隔离和应用级安全性，实现基于角色的访问控制(RBAC)和租户工作负载之间的硬件资源隔离，在策略中加入安全验证和监控步骤，保护系统不受恶意行为的影响；S46. Isolation and security: Consider inter-tenant isolation and application-level security in scheduling decisions, implement role-based access control (RBAC) and hardware resource isolation between tenant workloads, and add security verification and monitoring steps to the policy. , protect the system from malicious behavior;

综上所述：这一阶段专注于开发能够有效分配资源的调度算法。它包括设计灵活的策略框架、实现静态和动态调度算法、进行算法优化，以确保调度结果符合业务要求，同时，还要考虑到隔离和安全性因素。To summarize: this phase focuses on developing scheduling algorithms that can allocate resources efficiently. It includes designing a flexible policy framework, implementing static and dynamic scheduling algorithms, and optimizing algorithms to ensure that scheduling results meet business requirements. At the same time, isolation and security factors must also be taken into consideration.

所述步骤S5可细化为：The step S5 can be refined into:

S56、持久化与状态恢复：构建状态持久化机制，定期保存系统当前的状态到持久存储，如数据库或分布式存储系统中，在系统故障后，能够利用这些持久化的状态数据迅速恢复系统工作，最小化故障对业务的影响；S56. Persistence and state recovery: Build a state persistence mechanism to regularly save the current state of the system to persistent storage, such as a database or distributed storage system. After a system failure, these persistent state data can be used to quickly restore system work. , minimize the impact of failures on business;

综上所述：在资源管理阶段，制定资源管理策略，开发资源分配算法，并引入资源优化循环。此外，实现弹性伸缩逻辑以适应负载变化，优化能源利用，并保证系统的持久性和面对故障的恢复力。To summarize: In the resource management stage, resource management strategies are formulated, resource allocation algorithms are developed, and resource optimization cycles are introduced. In addition, elastic scaling logic is implemented to adapt to load changes, optimize energy utilization, and ensure system durability and resilience in the face of failures.

所述步骤S6可细化为：The step S6 can be refined into:

S66、负载压力测试：通过模拟高负载情况和人为制造压力点，识别系统在极限状态下的性能瓶颈和潜在问题，根据测试结果调整系统架构和资源分配，确保系统在实际运行中的稳定性和可靠性；S66. Load stress test: By simulating high load conditions and artificially created pressure points, the performance bottlenecks and potential problems of the system in the extreme state are identified, and the system architecture and resource allocation are adjusted based on the test results to ensure the stability and stability of the system in actual operation. reliability;

综上所述：这个阶段包括把之前独立开发的模块整合成一个完整的系统，进行系统范围的测试以确保模块之间正确协同工作，然后对系统的性能进行评估，确保它满足既定的性能标准。还要进行安全和隔离性评估，并通过模拟用户操作来进行测试。To summarize: This phase involves integrating previously independently developed modules into a complete system, conducting system-wide testing to ensure that the modules work together correctly, and then evaluating the performance of the system to ensure that it meets established performance standards. . Security and isolation assessments are also conducted and tested by simulating user actions.

所述步骤S7可细化为：The step S7 can be refined into:

S76、文档和日志：维护详尽的系统文档，包括设计文档、用户手册、操作指南及常见问题解答(FAQs)，以便用户和管理人员理解和使用系统，实施日志管理政策，详细记录系统运行情况，结合日志分析工具来洞察系统行为，快速诊断问题；S76. Documentation and logs: Maintain detailed system documentation, including design documents, user manuals, operation guides and frequently asked questions (FAQs), so that users and managers can understand and use the system, implement log management policies, and record system operation in detail. Combined with log analysis tools to gain insight into system behavior and quickly diagnose problems;

综上所述：在部署阶段，我们要制定一个详细的部署计划，实现自动化部署过程，并部署监控策略来追踪资源和应用级别的性能。故障演练和备份验证是为了建立对可能故障的准备，以及根据实际的系统运行情况来进行性能调优。To summarize: During the deployment phase, we need to develop a detailed deployment plan, automate the deployment process, and deploy monitoring strategies to track resource and application-level performance. Failure drills and backup verification are to establish preparations for possible failures and perform performance tuning based on actual system operation.

所述步骤S8可细化为：The step S8 can be refined into:

S86、技术趋势评估：持续监测和评估新兴技术和行业趋势，考察它们可能对系统带来的益处，组织定期的内部知识分享会议和技术讲座，鼓励团队成员学习新技术，并考虑将其集成到系统中以保持领先优势；S86. Technology trend assessment: Continuously monitor and evaluate emerging technologies and industry trends, examine the benefits they may bring to the system, organize regular internal knowledge sharing meetings and technical lectures, encourage team members to learn new technologies and consider integrating them into system to stay ahead of the curve;

综上所述：系统要进入维护阶段，这时收集用户的反馈，跟踪并修复问题，持续监控系统性能，并定期进行更新和迭代。安全性和合规性是此阶段的重点之一。同时，也需要时刻评估新的技术趋势，以决定是否将其纳入系统中。To sum up: the system needs to enter the maintenance phase. At this time, user feedback is collected, problems are tracked and fixed, system performance is continuously monitored, and updates and iterations are performed regularly. Security and compliance are one of the focuses of this phase. At the same time, new technology trends need to be constantly evaluated to decide whether to incorporate them into the system.

最后应说明的是：以上所述的各实施例仅用于说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或全部技术特征进行等同替换；而这些修改或替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above-mentioned embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that : It is still possible to modify the technical solutions recorded in the foregoing embodiments, or to equivalently replace some or all of the technical features; and these modifications or substitutions do not deviate from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present invention. range.

Claims

1. The resource dynamic scheduling technology of the distributed cloud computing system is characterized by comprising the following steps of:

s1, demand analysis: determining the service requirement and technical requirement of the system, evaluating the existing infrastructure and technology, and setting targets and performance indexes;

s2, designing a system architecture: designing the overall structure of the system based on the result of the demand analysis, determining high-level components of the system such as a database, an application server, a load balancer and the like, and the interaction mode among the components, wherein in order to ensure the expandability and the flexibility of the system, a network architecture and a data storage strategy need to be designed and planned;

s3, workload analysis: the process of understanding the type, size and frequency of tasks to be processed by the system predicts resource demands by analyzing historical data and expected workload, and optimizes resource allocation and scheduling strategies according to the data;

s4, developing a scheduling algorithm: according to the result of workload analysis, developing a resource scheduling algorithm to optimize the efficiency of task operation and the resource utilization rate, wherein the resource scheduling algorithm comprises an implementation priority queue, a rule-based engine and an adaptive scheduling algorithm adopting a machine learning technology;

S5, resource management: the monitoring, allocation, optimization and control of physical and virtual resources are realized, wherein the monitoring, allocation, optimization and control of physical and virtual resources comprise that a development function automatically manages the life cycle of the resources, such as allocation, release and expansion, contraction and migration of the resources according to requirements;

s6, system integration verification: combining all independently developed modules and components together and then testing as a complete system, including verifying the functionality and performance of the system and ensuring that the parts can work together to meet design specifications;

s7, monitoring policy implementation: emphasis is placed on implementing a monitoring and alarm system to ensure that system health, performance changes and potential problems can be monitored in real time in a production environment;

s8, continuously optimizing the system: in the late stages of system deployment, it is necessary to continuously monitor system performance and optimize the system according to feedback, including upgrading hardware, updating software, improving scheduling policies, optimizing resource allocation, etc., to ensure that the system can accommodate changing demands and up-to-date technology trends.

2. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S1 may be refined as:

S11, evaluating hardware resources: evaluating existing physical machines, virtual machines, storage solutions and network resources, the focus of this step being on understanding the performance characteristics, expansion capabilities and limitations of existing infrastructure, the evaluation including checking the age of the hardware, maintenance records, past performance history and failure rates to facilitate future scale expansion and resource allocation to make reasonable inferences and decisions;

s12, defining a workload model: determining the type of service and task which the system needs to support, analyzing the load which possibly occurs to the resource according to the characteristics of the application program, such as transaction type system, data processing or real-time analysis, etc., wherein in the step, the expected peak value, average value, steady state and change trend of the load need to be defined in the model;

s13, setting a scheduling target: setting the target to be achieved by the scheduler according to the service requirements and the expected system use, more specifically, if the system needs to respond to the user request quickly, the response time of the minimum task may be prioritized; or if cost savings are a primary goal, may be optimized to increase energy and resource usage efficiency;

s14, determining service level agreement requirements: SLAS defines quality of service and performance metrics promised to users, including, among other things, availability of the system, service response time, time to fail-over, etc.;

S15, establishing monitoring requirements: the monitoring requirements and indexes should be matched with the defined SLAS and performance targets to determine monitoring indexes, wherein the monitoring indexes comprise CPU utilization rate, memory use, network bandwidth, storage IOPS and the like, and monitoring intervals and historical data retention strategies, and the steps can help a system administrator to understand the system state and respond in time when a problem occurs;

s16, standardized resource description: a unified description definition is created for all physical and virtual resources in the system, including the model and core number of the CPU, memory size, storage capacity, network bandwidth, etc., standardization is a key basis for resource allocation and scheduling decisions, ensuring that the system can efficiently and consistently manage and use these resources.

3. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S2 may be refined as:

s21, determining a system architecture mode: selecting a suitable architecture mode, such as a micro-service architecture, a Service Oriented Architecture (SOA), an event driven architecture, or a traditional single application architecture, wherein the selection of the architecture mode is based on demand analysis, and factors such as maintainability, expandability, performance, team experience and the like are considered;

S22, designing a fault-tolerant mechanism: planning redundancy and fault tolerance mechanisms of the system to ensure high availability, this design needs to include data backup schemes, multi-zone deployments, failover and recovery strategies, and implementing stateless components to support horizontal expansion;

s23, planning a data management strategy: determining the storage requirement of data, selecting a proper database system, considering data consistency, integrity and accessibility, and planning data backup, recovery and archiving strategies;

s24, designing a system monitoring solution: based on the monitoring requirements set in the requirement analysis, designing a complete monitoring solution, how to collect, store and analyze monitoring data, how to set an alarm threshold, and selecting monitoring tools and platforms to provide real-time monitoring and performance analysis;

s25, planning a network architecture: the refined network architecture comprises internal network isolation, setting of public and private subnets, use of load balancers and security policies ensuring network communication such as firewalls, intrusion detection systems and encrypted transmissions;

s26, defining a system component and an interface: the API design disclosed herein should facilitate understanding and use of and maintain consistency and version control for each component in the system by defining interfaces and defining communication between internal components, including messaging protocols and data formats.

4. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S3 may be refined as:

s31, data collection and cleaning: designing an efficient data collection mechanism, involving automatically collecting resource utilization data from different services and applications in a distributed environment, deploying a log aggregation tool such as ELK (elastic search, log stack, and Kibana) or Fluentd, and a monitoring system such as promethaus or Datadog, to standardize and centrally collect data, formulate cleaning rules including anomaly detection and correction, time zone synchronization, and data format unification;

s32, feature selection: evaluating the resource usage pattern using data analysis and visualization tools (e.g., pandas and seaban in Python), and determining key features through statistical testing and correlation analysis, implementing machine learning feature selection techniques (e.g., recursive feature elimination), aided with expert knowledge, to optimize the input feature set of the model;

s33, developing an algorithm: based on the selected feature set, various predictive models are explored, including traditional statistical methods (such as ARIMA) and modern machine learning techniques (such as gradient hoisting and neural networks), ensuring that the algorithm development environment is provided with the necessary computing resources, and using an appropriate ML framework (e.g., tensorFlow or scikit-learn);

S34, model training: based on historical data, adopting a machine learning workflow management technology (such as MLflow or Kubeflow) to ensure the traceability and consistency of a model training process, optimizing the model by using parameter searching (such as grid searching or random searching), and evaluating the generalization capability of the model by using cross verification;

s35, model verification: verifying model performance on an independent test data set, measuring prediction accuracy by adopting a proper evaluation index (such as MAE and RMSE), further optimizing model parameters by using a verification result, periodically reviewing model performance according to the change of an application scene, and updating a model to adapt to a new data mode;

s36, integration and deployment: an automated deployment flow for designing and implementing models, including CI/CD pipelining (continuous integration/continuous deployment), so that predictive models can be smoothly integrated into existing dispatch systems, ensuring adequate monitoring and alarm mechanisms to track model performance, and managing versions of models in a production environment.

5. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S4 may be refined as:

s41, strategy framework design: constructing a flexible and extensible scheduling framework supporting plug-in policy exchange, wherein the framework should be capable of being seamlessly integrated with a system architecture, configurable and easy to maintain, and using industry standard design modes (such as policies, observer modes) to assist code separation and ensure modularization of the system;

S42, realizing a static scheduling algorithm: a series of basic static scheduling algorithms, such as First Come First Served (FCFS), round-Robin (Round-Robin) and Fixed Priority (Fixed Priority) scheduling, are written, more specifically, the algorithms should be able to make decisions simply and quickly without taking external changes into account, and standard test cases are created for these algorithms in order to verify their performance and correctness;

s43, a dynamic scheduling mechanism: developing dynamic scheduling algorithms capable of responding to real-time system state changes, wherein the algorithms consider the current resource utilization rate, historical load data, service Level Agreement (SLA), priority and other factors, and a bionic algorithm (such as a genetic algorithm), a heuristic method or a machine learning technology is used for realizing dynamic response and self-adaptive scheduling decision;

s44, algorithm optimization: continuously performing performance analysis and optimization on the existing algorithm, and applying microscopic and macroscopic optimization technologies such as algorithm complexity reduction, multithread parallel processing and resource reservation mechanism, and simultaneously, considering implementation simulation environment to evaluate and optimize the performance of the algorithm under different loads and conditions;

s45, multi-target scheduling support: developing a scheduling algorithm supporting multi-objective decision, wherein the algorithm can balance various factors such as response time, resource utilization rate, energy efficiency, cost and the like, and develop scheduling strategies meeting different business and technical requirements by utilizing multi-objective optimization theory and technology such as Pareto optimization;

S46, isolation and security: taking tenant separation and application level security into consideration in scheduling decision, realizing hardware resource separation between role-based access control (RBAC) and tenant workload, adding security verification and monitoring steps in the policy, and protecting the system from malicious behaviors.

6. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S5 may be refined as:

s51, defining a resource management strategy: analyzing the characteristics and business requirements of different types of resources, formulating corresponding resource management strategies, formulating core binding, priority and time slice quota for CPU scheduling, configuring priority and recovery policies for memory, setting read-write rate limit for storage, allocating maximum and minimum bandwidth limit for network bandwidth, ensuring efficient and fair resource use, and meeting SLA requirements;

s52, a resource allocation algorithm: developing a finely controlled resource allocation algorithm, wherein the resource allocation algorithm can dynamically allocate resources according to the current system resource use condition and the workload demand, and the algorithm considers the interdependence and constraint of the resources to realize quick response and flexibly adjusted resource allocation;

S53, resource optimization cycle: implementing periodic resource optimization cycle, identifying the use mode and efficiency bottleneck of the resource by analyzing the historical data, adjusting the resource allocation at regular time, merging memory fragments, re-balancing distribution and the like so as to promote the resource efficiency and the system stability;

s54, elastic expansion logic: realizing elastic telescopic logic, allowing the system to automatically adjust resource allocation according to real-time monitoring data and a prediction result, and increasing and decreasing computing nodes or service examples timely when the load changes, so as to ensure that the application always keeps optimal performance and cost benefit;

s55, optimizing energy efficiency: the energy optimization subsystem is developed, the energy use condition of the data center is monitored, the energy consumption mode is analyzed, the energy consumption is reduced by adjusting the resource use strategy and the scheduling strategy, the environment-friendly operation of the system is ensured, and the cost is reduced;

s56, persistence and state recovery: and (3) constructing a state persistence mechanism, periodically storing the current state of the system into persistent storage, such as a database or a distributed storage system, and after the system fails, rapidly recovering the system to work by utilizing the persistent state data to minimize the influence of the failure on the service.

7. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S6 may be refined as:

s61, module integration: integrating the developed modules (such as a scheduler, a resource manager and a monitoring system) into a main system framework, ensuring compatibility among the modules, completing necessary integrated test, and checking whether data flow and function call work as expected;

s62, system range test: performing end-to-end tests on the whole system, including functional tests, integration tests, system tests and acceptance tests, verifying the core functions, user stories and boundary conditions of the system, and in addition, requiring testing the behavior of the system under abnormal conditions to ensure stability;

s63, performance evaluation: evaluating the performance of the system under typical and extreme workloads, including response time, throughput, and resource usage efficiency, using standardized performance evaluation tools and methods, and comparing to performance requirements in business objectives, to ensure that predetermined metrics are met;

s64, isolation verification: performing security audit on the system, checking whether code loopholes, error configuration and potential safety hazards exist or not, verifying multi-tenant isolation capability in the system, and ensuring security isolation of data and operations of different users or organizations;

S65, user case simulation: according to typical user scenes, simulating user operation to verify system functions, ensuring correctness of user flows and business flows, monitoring performance of the system under the simulated operation, and adjusting system configuration to better serve user requirements;

s66, load pressure test: by simulating high load conditions and artificially manufacturing pressure points, performance bottlenecks and potential problems of the system in a limit state are identified, system architecture and resource allocation are adjusted according to test results, and stability and reliability of the system in actual operation are ensured.

8. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S7 may be refined as:

s71, deployment planning: creating a detailed deployment plan, considering various deployment scenarios, such as blue-green deployment to realize zero downtime update or rolling update to gradually replace old version examples, wherein the plan should include risk assessment, advance notice, rollback strategy, deployment schedule and consideration of key service period;

s72, an automatic deployment flow: developing an automatic deployment flow, automatically testing, packaging and deploying new versions by utilizing a continuous integration and continuous deployment (CI/CD) pipeline, and ensuring that the process covers code submission, construction, automatic testing, deployment and notification so as to reduce human errors and improve efficiency;

S73, implementing a monitoring strategy: selecting a monitoring tool according to Key Performance Indexes (KPIs) of the system, establishing a monitoring strategy, realizing monitoring of real-time resource use conditions, service running conditions and abnormal events, and constructing an instrument panel and an alarm system so as to facilitate operation and maintenance personnel to quickly respond to any problem;

s74, fault drilling: the fault exercise activities are arranged regularly, fault scenes are simulated, the recovery flow and the backup strategy of the system are verified, the exercises help ensure that when an emergency actually happens, a team can quickly and accurately take action, and possible business influence is reduced;

s75, performance tuning: continuously collecting performance data, performing periodic performance tuning on the system based on the data, identifying performance bottlenecks, optimizing configuration, applying best practices and updating hardware to cope with performance challenges, and simultaneously taking cost benefits into consideration to find the best balance point of resource use and performance;

s76, documents and logs: detailed system documents are maintained, including design documents, user manuals, operation guidelines and common problem Solutions (FAQs), so that users and administrators can understand and use the system, implement log management policies, record system operation conditions in detail, and combine log analysis tools to provide insight into system behavior and quickly diagnose problems.

9. The resource dynamic scheduling technique of a distributed cloud computing system according to claim 1, wherein: the step S8 may be refined as:

s81, user feedback collection: establishing a comprehensive user feedback collection mechanism comprising a plurality of channels such as online investigation, user forum, direct access or customer support feedback and the like, ensuring that data about system use conditions, performance bottlenecks, possible improvement suggestions and user satisfaction can be collected, periodically analyzing the data, and identifying user experience improvement points;

s82, problem tracking and repairing: constructing a problem tracking system to continuously monitor, record and classify problems in the system, distributing priority to each problem, and developing problems and repairing work in a corresponding time frame to ensure smooth inter-team communication, so that related departments can closely cooperate to rapidly solve the problems;

s83, performance monitoring optimization: deploying an advanced performance monitoring tool to track system performance indexes in real time, determining performance trends and potential problems by using data analysis and machine learning technologies, and timely performing system adjustment or optimal configuration to maintain an optimal performance state based on analysis results;

S84, updating iteration: implementing a structured updating and iteration strategy, so that the system can integrate new characteristics, performance improvement and security patches, ensure that the updating process minimizes interference to field operation, and ensure stability and compatibility before and after updating through automatic testing;

s85, compliance assurance: periodically checking and implementing the latest security update, monitoring a security vulnerability database, ensuring that the system is updated in time to prevent potential attacks, and simultaneously keeping the system to meet all relevant industry standards and supervision requirements for periodic compliance audit and evaluation;

s86, technical trend evaluation: emerging technologies and industry trends are continually monitored and assessed, looking at their possible benefits to the system, organizing regular internal knowledge sharing conferences and technical lectures, encouraging team members to learn new technologies, and considering their integration into the system to maintain leading advantages.