JP2006172065A

JP2006172065A - Checkpoint collection method, system and program

Info

Publication number: JP2006172065A
Application number: JP2004362606A
Authority: JP
Inventors: Yasuhiro Nakaoku; 康広中奥; Kenji Matsui; 謙治松井
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-12-15
Filing date: 2004-12-15
Publication date: 2006-06-29

Abstract

【課題】チェックポイントを採取するタイミングを決定する条件をシステム資源の利用状況で指定し、ジョブにチェックポイント採取命令を記述するというユーザの負担を軽減することが可能な技術を提供する。
【解決手段】障害発生時のジョブのリスタートを行う為のチェックポイント情報をジョブ実行中に取得するチェックポイント採取方法において、ジョブ毎のシステム資源の利用状況を示す情報を取得して記憶装置に格納するステップと、ジョブ毎のシステム資源の利用状況に応じたチェックポイント採取の要否を判定する為の条件と前記格納したシステム資源の利用状況の情報とを記憶装置から読み出してチェックポイント採取の要否を判定するステップと、前記判定結果に基づいてチェックポイント情報を記憶装置に格納してチェックポイントの採取を行うステップとを有するものである。
【選択図】図１
PROBLEM TO BE SOLVED: To provide a technique capable of reducing a user's burden of designating a condition for determining a timing for collecting checkpoints in a system resource usage state and describing a checkpoint collection command in a job.
In a checkpoint collection method for acquiring checkpoint information for restarting a job when a failure occurs during job execution, information indicating a system resource usage status for each job is acquired and stored in a storage device. The storage step, the conditions for determining whether or not checkpoint collection is necessary according to the system resource usage status for each job, and the stored system resource usage status information are read from the storage device to checkpoint collection. A step of determining necessity and a step of collecting checkpoints by storing checkpoint information in a storage device based on the determination result.
[Selection] Figure 1

Description

本発明は障害発生時のジョブのリスタートを行う為のチェックポイント情報をジョブ実行中に取得するチェックポイント採取技術に関するものである。 The present invention relates to a checkpoint collection technique for acquiring checkpoint information during job execution for restarting a job when a failure occurs.

障害発生時のジョブのリスタートを行う為のチェックポイント情報をジョブ実行中に取得するチェックポイント採取方法において、ジョブのチェックポイントを採取する方法として、ある時間間隔で定期的に採取する方法がある（例えば非特許文献１参照）。また、ジョブ実行時間とチェックポイント採取時間に対する演算結果に従ってチェックポイントの採取の要否を決定し、取得する方法がある（例えば特許文献１参照）。 In the checkpoint collection method that acquires checkpoint information for restarting a job when a failure occurs during job execution, there is a method of collecting job checkpoints periodically at certain time intervals. (For example, refer nonpatent literature 1). In addition, there is a method of determining whether or not to collect a checkpoint according to a calculation result with respect to a job execution time and a checkpoint collection time (see, for example, Patent Document 1).

特開２００４−９４４２２号公報JP 2004-94422 A E.N.Elnozahy, Lorenzo Alvisi, Yi-Min Wang, D.B.Johnson著「A Survey of Rollback-Recovery Protocols in Message-Passing Systems, Technical Report CMU-CS-96-181, Department of Computer Science, Carnegie Melon University, 1996)E.N.Elnozahy, Lorenzo Alvisi, Yi-Min Wang, D.B.Johnson, `` A Survey of Rollback-Recovery Protocols in Message-Passing Systems, Technical Report CMU-CS-96-181, Department of Computer Science, Carnegie Melon University, 1996)

前記いずれのチェックポイント採取方法もジョブ実行時間やチェックポイント採取時間に基づき、実時間をチェックポイント採取の条件としていたため、実際にジョブがどの程度処理されたかを評価できず、計算機システムの負荷状況によっては必ずしも適切なチェックポイント採取が行われなかった。また、ジョブプログラム内部にチェックポイントを採取する命令を記述する必要がありジョブプログラム作成時の負担であった。 Since any of the above checkpoint collection methods used the actual time as the checkpoint collection condition based on the job execution time or checkpoint collection time, it was not possible to evaluate how much the job was actually processed, and the load status of the computer system In some cases, appropriate checkpoints were not taken. In addition, it is necessary to describe an instruction for collecting checkpoints in the job program, which is a burden when creating the job program.

本発明の目的は上記問題を解決し、チェックポイントを採取するタイミングを決定する条件をシステム資源の利用状況で指定し、ジョブにチェックポイント採取命令を記述するというユーザの負担を軽減することが可能な技術を提供することにある。 The object of the present invention is to solve the above-mentioned problems, specify the conditions for determining the timing for collecting checkpoints in the usage status of system resources, and reduce the user's burden of writing a checkpoint collection command in the job. Is to provide new technology.

本発明の他の目的は実行中のジョブのシステム資源の利用状況に応じて、チェックポイント採取の条件を変更することが可能な技術を提供することにある。 Another object of the present invention is to provide a technique capable of changing checkpoint collection conditions in accordance with the use status of system resources of a job being executed.

本発明は、障害発生時のジョブのリスタートを行う為のチェックポイント情報をジョブ実行中に取得する計算機システムにおいて、システム資源の利用状況に応じてチェックポイントの採取を行うものである。 The present invention collects checkpoints in accordance with the use status of system resources in a computer system that acquires during execution of a job checkpoint information for restarting a job when a failure occurs.

本発明では、まず、ジョブ毎のシステム資源の利用状況に応じたチェックポイント採取の要否を判定する為の条件の入力を受け付けて記憶装置に格納する。この際、前記条件として、例えばＣＰＵ利用累積時間や入出力データの累積量の閾値を示す条件式や、それらの複数の条件式を複合させた複合条件式の入力を受け付けて記憶装置に格納する。 In the present invention, first, an input of a condition for determining whether or not it is necessary to collect a checkpoint according to the usage status of system resources for each job is received and stored in a storage device. At this time, as the condition, for example, an input of a conditional expression indicating a CPU usage accumulated time, a threshold value of the accumulated amount of input / output data, or a composite conditional expression obtained by combining these conditional expressions is received and stored in the storage device. .

次に、ジョブの実行時に、ジョブ毎のシステム資源の利用状況を示す情報を取得して記憶装置に格納する。その際には、前記格納した条件の要素であるＣＰＵ利用累積時間や入出力データ累積量について情報を取得して記憶装置に格納し、システム資源の利用状況を示す情報を更新する。 Next, when the job is executed, information indicating the usage status of the system resource for each job is acquired and stored in the storage device. At that time, information on the accumulated CPU usage time and the accumulated input / output data, which are the elements of the stored conditions, is acquired and stored in the storage device, and the information indicating the usage status of the system resources is updated.

そして、前記の様にシステム資源の利用状況を示す情報を更新した後、前記格納した条件と前記更新したシステム資源の利用状況の情報とを記憶装置から読み出してチェックポイント採取の要否を判定し、その判定結果に基づいてチェックポイント情報を記憶装置に格納してチェックポイントの採取を行う。すなわち、前記更新されたＣＰＵ利用累積時間や入出力データ累積量を前記条件式に代入して論理計算を行って、そのＣＰＵ利用累積時間や入出力データ累積量がチェックポイントの採取を行う条件を満たしているかどうかを判定し、その条件を満たしている場合には、当該ジョブのチェックポイント情報を記憶装置に格納する。 Then, after updating the information indicating the use status of the system resource as described above, the stored condition and the updated information on the use status of the system resource are read from the storage device to determine whether or not it is necessary to collect a checkpoint. Based on the determination result, the checkpoint information is stored in the storage device and the checkpoint is collected. That is, a logical calculation is performed by substituting the updated CPU usage cumulative time and input / output data cumulative amount into the conditional expression, and the CPU usage cumulative time and input / output data cumulative amount is a condition for collecting checkpoints. If the condition is satisfied, the checkpoint information of the job is stored in the storage device.

本発明によれば、チェックポイントを採取するタイミングを決定する条件をシステム資源の利用状況で指定し、ジョブにチェックポイント採取命令を記述するというユーザの負担を軽減することが可能である。 According to the present invention, it is possible to reduce a user's burden of designating a condition for determining the timing of collecting checkpoints by using system resources and describing a checkpoint collection command in a job.

（実施形態１）
以下に障害発生時のジョブのリスタートを行う為のチェックポイント情報をシステム資源の利用状況に応じて取得する実施形態１の計算機システムについて説明する。 (Embodiment 1)
A computer system according to the first embodiment that acquires checkpoint information for restarting a job when a failure occurs according to the use status of system resources will be described below.

図１は本実施形態の計算機システム１の概略構成を示す図である。図１に示す様に本実施形態の計算機システム１は、ジョブ実行管理部２０と、チェックポイント採取実行部２１と、スケジューラ２２と、ＣＰＵ監視部２３と、Ｉ／Ｏ監視部２４とを有している。 FIG. 1 is a diagram showing a schematic configuration of a computer system 1 of the present embodiment. As shown in FIG. 1, the computer system 1 according to the present embodiment includes a job execution management unit 20, a checkpoint collection execution unit 21, a scheduler 22, a CPU monitoring unit 23, and an I / O monitoring unit 24. ing.

ジョブ実行管理部２０は、チェックポイント採取実行部２１、スケジューラ２２、ＣＰＵ監視部２３及びＩ／Ｏ監視部２４の動作を制御してジョブの実行を管理する処理部である。チェックポイント採取実行部２１は、ジョブ毎のシステム資源の利用状況に応じたチェックポイント採取の要否を判定する為の条件を示す条件テーブル１２及び複合条件式１３と、システム資源の利用状況の情報を示す資源ステータステーブル１４とをメモリから読み出してチェックポイント採取の要否を判定し、その判定結果に基づいてチェックポイント情報を磁気ディスク装置に格納してチェックポイントの採取を行う処理部である。 The job execution management unit 20 is a processing unit that manages the execution of jobs by controlling the operations of the checkpoint collection execution unit 21, the scheduler 22, the CPU monitoring unit 23, and the I / O monitoring unit 24. The checkpoint collection execution unit 21 includes a condition table 12 and a compound conditional expression 13 that indicate conditions for determining whether or not checkpoint collection is necessary according to the system resource usage status for each job, and system resource usage status information. Is a processing unit that reads out the resource status table 14 indicating whether or not checkpoints are necessary, stores checkpoint information in the magnetic disk device based on the determination result, and collects checkpoints.

スケジューラ２２は、実行可能状態ジョブキュー３から実行状態ジョブキュー４へのジョブの状態遷移を行う処理部である。ＣＰＵ監視部２３は、各ジョブに割当てられたＣＰＵ時間を監視・記録する処理部であり、ジョブ毎のシステム資源の利用状況を示す情報として、ＣＰＵ利用累積時間等のＣＰＵの利用状況情報を取得してメモリ中の資源ステータステーブル１４に格納するリソース監視部である。 The scheduler 22 is a processing unit that performs job state transition from the executable state job queue 3 to the execution state job queue 4. The CPU monitoring unit 23 is a processing unit that monitors and records the CPU time allocated to each job, and acquires CPU usage status information such as CPU usage accumulated time as information indicating the usage status of system resources for each job. The resource monitoring unit stores the information in the resource status table 14 in the memory.

Ｉ／Ｏ監視部２４は、各ジョブが行ったＩ／Ｏ処理のデータ量を監視・記録する処理部であり、ジョブ毎のシステム資源の利用状況を示す情報として、入出力データの累積量等のＩ／Ｏデバイスの利用状況情報を取得してメモリ中の資源ステータステーブル１４に格納するリソース監視部である。 The I / O monitoring unit 24 is a processing unit that monitors and records the data amount of I / O processing performed by each job. As information indicating the usage status of system resources for each job, the input / output data cumulative amount, etc. This is a resource monitoring unit that acquires the usage status information of the I / O device and stores it in the resource status table 14 in the memory.

計算機システム１をジョブ実行管理部２０、チェックポイント採取実行部２１、スケジューラ２２、ＣＰＵ監視部２３及びＩ／Ｏ監視部２４として機能させる為のプログラムは、ＣＤ−ＲＯＭ等の記録媒体に記録され磁気ディスク等に格納された後、メモリにロードされて実行されるものとする。なお前記プログラムを記録する記録媒体はＣＤ−ＲＯＭ以外の他の記録媒体でも良い。また前記プログラムを当該記録媒体から情報処理装置にインストールして使用しても良いし、ネットワークを通じて当該記録媒体にアクセスして前記プログラムを使用するものとしても良い。 Programs for causing the computer system 1 to function as the job execution management unit 20, the checkpoint collection execution unit 21, the scheduler 22, the CPU monitoring unit 23, and the I / O monitoring unit 24 are recorded on a recording medium such as a CD-ROM and magnetically recorded. It is assumed that after being stored in a disk or the like, it is loaded into a memory and executed. The recording medium for recording the program may be a recording medium other than the CD-ROM. The program may be used by installing it from the recording medium into the information processing apparatus, or the program may be used by accessing the recording medium through a network.

図１に示す様に本実施形態の計算機システム１は、ジョブ実行管理部２０と実行可能状態ジョブキュー３と実行状態ジョブキュー４とを有しており、ジョブ実行管理部２０は機能別に、チェックポイント採取実行部２１、スケジューラ２２、ＣＰＵ監視部２３及びＩ／Ｏ監視部２４から構成されている。 As shown in FIG. 1, the computer system 1 of this embodiment includes a job execution management unit 20, an executable job queue 3, and an execution job queue 4. The job execution management unit 20 performs check according to function. The point collection execution unit 21, the scheduler 22, the CPU monitoring unit 23, and the I / O monitoring unit 24 are configured.

当該計算機システム１は資源を適切にジョブへ分配する為の資源管理の仕組みを備えており、本実施形態ではＣＰＵ監視部２３とＩ／Ｏ監視部２４がその仕組みに該当し、それぞれＣＰＵ監視部２３は各ジョブに割当てられたＣＰＵ時間を監視・記録し、Ｉ／Ｏ監視部２４は各ジョブが行ったＩ／Ｏ処理のデータ量を監視・記録するものである。 The computer system 1 has a resource management mechanism for appropriately distributing resources to jobs. In this embodiment, the CPU monitoring unit 23 and the I / O monitoring unit 24 correspond to the mechanism, and the CPU monitoring unit 23 monitors and records the CPU time allocated to each job, and the I / O monitoring unit 24 monitors and records the data amount of I / O processing performed by each job.

実行可能状態ジョブキュー３と実行状態ジョブキュー４では、それぞれジョブ情報１０−１、１０−２、・・・、１０−Ｍを管理する。これらジョブ情報１０のキュー間の遷移はスケジューラ２２によって行われる。ジョブ情報１０には、ジョブＩＤ１１、条件テーブル１２、複合条件式１３及び資源ステータステーブル１４が含まれる。 The executable state job queue 3 and the execution state job queue 4 manage job information 10-1, 10-2,..., 10-M, respectively. Transition between the queues of the job information 10 is performed by the scheduler 22. The job information 10 includes a job ID 11, a condition table 12, a compound conditional expression 13, and a resource status table 14.

ユーザ５はジョブとそれに付随するチェックポイント採取条件をジョブ実行管理部２０へ投入し、それを受けたジョブ実行管理部２０は、実行可能状態ジョブキュー３へジョブ情報１０を生成し、チェックポイント採取条件で以って条件テーブル１２と複合条件式１３の初期化を行う。 The user 5 inputs the job and the checkpoint collection condition associated therewith to the job execution management unit 20, and the job execution management unit 20 that receives the job generates job information 10 in the executable job queue 3 and collects the checkpoint. The condition table 12 and the compound conditional expression 13 are initialized according to the conditions.

資源ステータステーブル１４の情報は、ジョブ情報１０が、実行可能状態ジョブキュー３から実行状態ジョブキュー４へ遷移された後、ＣＰＵ監視部２３及びＩ／Ｏ監視部２４によって更新される。 The information in the resource status table 14 is updated by the CPU monitoring unit 23 and the I / O monitoring unit 24 after the job information 10 is transitioned from the executable state job queue 3 to the execution state job queue 4.

図１０は本実施形態の計算機システム１のハードウェア構成を示す図である。図１０に示す様に本実施形態の計算機システム１は、ジョブ実行管理部２０と実行可能状態ジョブキュー３と実行状態ジョブキュー４とをメモリ上に備えており、メモリや磁気ディスク装置等の記憶装置や通信装置等を使用して各種処理を行う。 FIG. 10 is a diagram illustrating a hardware configuration of the computer system 1 according to the present embodiment. As shown in FIG. 10, the computer system 1 of this embodiment includes a job execution management unit 20, an executable state job queue 3, and an execution state job queue 4 on a memory, and stores in a memory, a magnetic disk device, or the like. Various processes are performed using a device or a communication device.

次に、図２を参照して資源ステータステーブル１４について説明する。
図２は本実施形態のジョブ情報１０が保持する資源ステータステーブル１４の一例を示す図である。資源ステータステーブル１４はそれを保持するジョブにおける資源の利用状況を記録する目的がある。本実施形態においてそれぞれ記録される資源利用状況の項目は、ＣＰＵ利用累積時間１４−０１、ジョブ実時間１４−０２、Ｉ／Ｏ書込み累積量１４−０３、Ｉ／Ｏ読込み累積量１４−０４の４種類であり、それぞれ「当該ジョブにＣＰＵを割当てられた時間」「ジョブが投入されてから経過した実時間」「Ｉ／Ｏデバイスに対して書込みを行ったときの累積量」「Ｉ／Ｏデバイスに対して読み込みを行ったときの累積量」を意味している。 Next, the resource status table 14 will be described with reference to FIG.
FIG. 2 is a diagram showing an example of the resource status table 14 held by the job information 10 of this embodiment. The resource status table 14 has the purpose of recording the resource usage status in the job that holds it. The resource usage status items recorded in this embodiment are CPU usage accumulated time 14-01, job actual time 14-02, I / O write accumulated amount 14-03, and I / O read accumulated amount 14-04, respectively. There are four types, “time when CPU is assigned to the job”, “actual time elapsed since the job was submitted”, “cumulative amount when writing to I / O device”, “I / O” This means the cumulative amount when reading is performed on the device.

次に、図３と図４を参照して、条件テーブル１２と複合条件式１３の一例を説明する。
図３は本実施形態のジョブ情報１０が保持する条件テーブル１２の一例を示す図である。図３の条件テーブル１２は、チェックポイントの採取の要否を判定する為の複数の条件式１２−０１、１２−０２、・・・、１２−Ｎを保持している。 Next, an example of the condition table 12 and the compound conditional expression 13 will be described with reference to FIGS.
FIG. 3 is a diagram showing an example of the condition table 12 held by the job information 10 of this embodiment. The condition table 12 in FIG. 3 holds a plurality of conditional expressions 12-01, 12-02,..., 12-N for determining whether or not checkpoint collection is necessary.

条件テーブル１２の各条件式は、条件判定に用いられる要素として、図２で説明した４つの項目と、それぞれに対応する閾値とを格納しており、例えば、条件番号１の条件式１２−０１は、「ＣＰＵ利用累積時間が１０００[秒]を超えるときに真」となる条件を示している。なお条件テーブル１２の条件式の数を４つに限らないものとする。またこれらの条件式を複合させる目的で複合条件式１３があり、図４の様に表現する。 Each conditional expression in the condition table 12 stores the four items described in FIG. 2 and the corresponding threshold values as elements used for condition determination. For example, the conditional expression 12-01 with condition number 1 is stored. Indicates a condition of “true when the CPU usage cumulative time exceeds 1000 [seconds]”. The number of conditional expressions in the condition table 12 is not limited to four. There is a compound conditional expression 13 for the purpose of combining these conditional expressions, which is expressed as shown in FIG.

図４は、本実施形態のジョブ情報１０が保持する複合条件式１３の一例を示す図である。この図４では、複合条件式１３中の数値は、条件テーブル１２の条件番号で対応付けられる条件式１２−０１、１２−０２、・・・、１２−Ｎを演算子で結合することを意味しており、複合条件式１３は、演算子の論理和（＋）及び論理積（×）と否定演算子（￣）と括弧から構成されている。 FIG. 4 is a diagram illustrating an example of the compound conditional expression 13 held in the job information 10 according to the present embodiment. In FIG. 4, the numerical values in the compound conditional expression 13 mean that conditional expressions 12-01, 12-02,..., 12-N associated with the condition numbers in the condition table 12 are combined by an operator. The compound conditional expression 13 is composed of the logical sum (+) and logical product (x) of operators, the negation operator (￣), and parentheses.

一例として、データを多量に生成するジョブを実行する場合を考えると、当該ジョブにおいては、Ｉ／Ｏ書込み処理が支配的になるため、Ｉ／Ｏ書込み累積量に関連するよう複合条件式１３−１を指定することで、ジョブ全体の進捗に合ったチェックポイント採取タイミングでチェックポイントが採取できる。 As an example, when a job that generates a large amount of data is executed, since the I / O write process is dominant in the job, the compound conditional expression 13− is related to the accumulated I / O write amount. By specifying 1, checkpoints can be collected at the checkpoint collection timing that matches the progress of the entire job.

図３と図４から導かれる複合条件式１３−１を解釈すると「ＣＰＵ利用累積時間が１０００[秒]を超えると共にＣＰＵ利用累積時間が１００００[秒]を超えていないとき、もしくは、Ｉ／Ｏ書込み累積量が５００００[Byte]を超えたときに真」という意味である。 When the compound conditional expression 13-1 derived from FIGS. 3 and 4 is interpreted, “when the CPU usage cumulative time exceeds 1000 [seconds] and the CPU usage cumulative time does not exceed 10000 [seconds], or I / O It means “true” when the cumulative amount of writing exceeds 50000 [Byte].

また別の一例として、ネットワークを介して通信するジョブを実行する場合を考えると、当該ジョブにおいては、通信路の利用状況により遅延や待ち時間が発生するため、ジョブ実行時間に関連するよう複合条件式１３−２を指定すると、ＣＰＵ利用累積時間に合ったチェックポイント採取ポイントだけでなく、一定時間間隔のチェックポイント採取タイミングでもチェックポイントが採取できる。 As another example, considering the case of executing a job that communicates via a network, a delay or waiting time occurs depending on the use status of the communication path in the job, so that the complex condition is related to the job execution time. When Expression 13-2 is designated, checkpoints can be collected not only at checkpoint collection points that match the accumulated CPU usage time but also at checkpoint collection timings at regular time intervals.

図３と図４から導かれる複合条件式１３−２を解釈すると「ＣＰＵ利用累積時間が１０００[秒]を超えると共にＣＰＵ利用累積時間が１００００[秒]を超えていないとき、もしくは、ジョブ実時間が５００００[秒]を超えたときに真」という意味である。 When the compound conditional expression 13-2 derived from FIGS. 3 and 4 is interpreted, “when the CPU usage cumulative time exceeds 1000 [seconds] and the CPU usage cumulative time does not exceed 10000 [seconds], or the actual job time Is true when the value exceeds 50000 [seconds].

次に、図５を参照して、実行可能状態ジョブキュー３から実行状態ジョブキュー４へジョブが遷移する過程で、ＣＰＵ監視部２３とＩ／Ｏ監視部２４とによりジョブ情報１０を更新し、その後、チェックポイント採取実行部２１において、チェックポイント採取の要否判定を得てチェックポイント採取の実施に至る流れを説明する。 Next, referring to FIG. 5, the job information 10 is updated by the CPU monitoring unit 23 and the I / O monitoring unit 24 in the course of a job transition from the executable state job queue 3 to the execution state job queue 4. Thereafter, the flow from the checkpoint collection execution unit 21 to the checkpoint collection execution after obtaining the checkpoint collection necessity determination will be described.

図５は本実施形態のチェックポイント採取処理の処理手順を示すフローチャートである。図５の処理は、ジョブ実行管理部２０のスケジューラ２２が、実行可能状態ジョブキュー３から実行状態ジョブキュー４へと、ジョブを実行可能状態に遷移させた所から始まる（ステップ５００）。 FIG. 5 is a flowchart showing a processing procedure of checkpoint collection processing according to the present embodiment. The process of FIG. 5 starts from the point where the scheduler 22 of the job execution management unit 20 changes the job from the executable job queue 3 to the executable job queue 4 to the executable job (step 500).

Ｉ／Ｏ監視部２４は、実行可能状態に遷移したジョブのＩ／Ｏ処理を監視して当該ジョブの書込みデータ量と読込みデータ量の情報を取得し、それらのデータの累積量を算出して当該ジョブのジョブ情報１０へアクセスし、その資源ステータステーブル１４中のＩ／Ｏ書込み累積量１４−０３とＩ／Ｏ読込み累積量１４−０４を更新する（ステップ５０１）。Ｉ／Ｏ監視部２４はジョブが行った全Ｉ／Ｏ処理データ量を記録しているため、当該ジョブＩＤ１１を指標として、特定のジョブ情報１０におけるＩ／Ｏ書込み累積量１４−０３とＩ／Ｏ読込み累積量１４−０４を更新できる。 The I / O monitoring unit 24 monitors the I / O processing of the job that has transitioned to the executable state, acquires information on the write data amount and the read data amount of the job, and calculates the accumulated amount of the data. The job information 10 of the job is accessed, and the I / O write accumulation amount 14-03 and the I / O read accumulation amount 14-04 in the resource status table 14 are updated (step 501). Since the I / O monitoring unit 24 records the total amount of I / O processing data performed by the job, using the job ID 11 as an index, the I / O write accumulated amount 14-03 and I / O in the specific job information 10 The O reading accumulated amount 14-04 can be updated.

次にＣＰＵ監視部２３は、実行可能状態に遷移したジョブのＣＰＵ利用時間とジョブ実行時間の情報を取得し、それらのデータの累積量を算出して当該ジョブのジョブ情報１０へアクセスし、その資源ステータステーブル１４中のＣＰＵ利用累積時間１４−０１とジョブ実時間１４−０２を更新する（ステップ５０２）。ＣＰＵ監視部２３はジョブが利用した全ＣＰＵ利用時間を記録しているため、当該ジョブ情報１０のジョブＩＤ１１を指標として、特定のジョブ情報１０におけるＣＰＵ利用累積時間１４−０１とジョブ実時間１４−０２を更新できる。 Next, the CPU monitoring unit 23 obtains information on the CPU usage time and job execution time of the job that has transitioned to the executable state, calculates the accumulated amount of the data, and accesses the job information 10 of the job. The CPU usage accumulated time 14-01 and the job actual time 14-02 in the resource status table 14 are updated (step 502). Since the CPU monitoring unit 23 records the total CPU usage time used by the job, using the job ID 11 of the job information 10 as an index, the CPU usage accumulated time 14-01 in the specific job information 10 and the job actual time 14- 02 can be updated.

次にチェックポイント採取実行部２１は、条件テーブル１２中の条件式及び複合条件式１３と、資源ステータステーブル１４中のシステム資源の利用状況の情報とをメモリから読み出してチェックポイント採取の要否を判定する（ステップ５０３）。 Next, the checkpoint collection execution unit 21 reads out from the memory the conditional expression and compound conditional expression 13 in the condition table 12 and the system resource usage status information in the resource status table 14 to determine whether or not checkpoint collection is necessary. Determination is made (step 503).

すなわちチェックポイント採取実行部２１は、ジョブ情報１０に保持される条件テーブル１２と複合条件式１３の解析を行い、複合条件式１３中に用いられている条件番号の条件要素を条件テーブル１２から読み出した後、その条件要素に対応する資源状態項目の現在値を資源ステータステーブル１４から読み出す。そして、その現在値を条件テーブル１２中の条件式で判定する為の論理式を生成して複合条件式１３に代入し、その複合条件式１３での判定結果に従いチェックポイントを採取する（ステップ５０４）。 That is, the checkpoint collection execution unit 21 analyzes the condition table 12 and the compound conditional expression 13 held in the job information 10, and reads out the condition element of the condition number used in the compound conditional expression 13 from the condition table 12. After that, the current value of the resource status item corresponding to the condition element is read from the resource status table 14. Then, a logical expression for determining the current value with the conditional expression in the condition table 12 is generated and substituted into the composite conditional expression 13, and checkpoints are collected according to the determination result of the composite conditional expression 13 (step 504). ).

そして、チェックポイント採取が行われた後、次回条件判定時にチェックポイント採取条件を不当に満たさない様に、資源ステータステーブル１４を全て０に初期化する（ステップ５０５）。 After the checkpoint is collected, the resource status table 14 is all initialized to 0 so that the checkpoint collection condition is not unreasonably satisfied at the next condition determination (step 505).

また一般的に計算機システムでは、ジョブが実行状態に遷移するとき、ジョブ実行管理部２０からＣＰＵ利用時間が割当てられ、ジョブはその時間単位で実行される。そのため、本実施形態において特定のジョブに対する資源ステータステーブル１４が更新されぬまま所定時間以上ジョブが実行されることはなく、所定時間間隔で資源ステータステーブル１４の更新が行われる事が保証される。 In general, in a computer system, when a job transitions to an execution state, CPU usage time is allocated from the job execution management unit 20, and the job is executed in units of the time. For this reason, in this embodiment, the job is not executed for a predetermined time or longer without updating the resource status table 14 for a specific job, and it is guaranteed that the resource status table 14 is updated at predetermined time intervals.

次に、本実施形態において、ジョブを投入する際にチェックポイント採取条件も同時に投入する方法の一例について説明する。 Next, in the present embodiment, an example of a method for simultaneously inputting checkpoint collection conditions when a job is submitted will be described.

図６は本実施形態のジョブとチェックポイント採取条件を投入するコマンドの一例を示す図である。図６では、ユーザ５が入力装置からの条件式の集合６０８と複合条件式６０９の様な文字列表現のコマンド６０６の入力で条件を指定できることを表している。 FIG. 6 is a diagram illustrating an example of a command for inputting a job and checkpoint collection conditions according to the present embodiment. FIG. 6 shows that the user 5 can specify a condition by inputting a command 606 of a character string expression such as a set of conditional expressions 608 and a composite conditional expression 609 from the input device.

図６のコマンドの第１引数は、条件テーブル１２に格納されるべき条件式の集合６０８を表す。各条件式６０１〜６０４は「，」で区切られ、更に資源状態項目と閾値が「：」で区切られる。また、条件番号が条件式の文字列の左から順に１、２、・・・という様に付与される。更にコマンドの第２引数は複合条件式６０９を表す文字列であり、第３引数はジョブプログラム実行ファイル名６１０である。 The first argument of the command in FIG. 6 represents a set of conditional expressions 608 to be stored in the condition table 12. Each conditional expression 601 to 604 is delimited by “,”, and the resource status item and the threshold are further delimited by “:”. Condition numbers are assigned in order from the left of the character string of the conditional expression, such as 1, 2,... Further, the second argument of the command is a character string representing the compound conditional expression 609, and the third argument is the job program execution file name 610.

このとき、条件テーブル１２の条件要素の種類、複合条件式と条件式の記述方法、ジョブ投入コマンド６０７が予めマニュアル等の記述に公開されていることを前提とし、ユーザ５自身が作成したジョブプログラム実行ファイル名６１０をジョブ投入コマンドの引数として与えることによって前記指定を実現する。 At this time, the job program created by the user 5 on the premise that the condition element type in the condition table 12, the compound conditional expression and the description method of the conditional expression, and the job input command 607 are disclosed in advance in a description such as a manual. The designation is realized by giving the execution file name 610 as an argument of the job submission command.

以上説明した様に本実施形態の計算機システムによれば、システム資源の利用状況に応じてチェックポイントの採取を行うので、チェックポイントを採取するタイミングを決定する条件をシステム資源の利用状況で指定し、ジョブにチェックポイント採取命令を記述するというユーザの負担を軽減することが可能である。 As described above, according to the computer system of the present embodiment, checkpoints are collected according to the usage status of system resources. Therefore, the conditions for determining the timing for collecting checkpoints are specified by the usage status of system resources. It is possible to reduce the user's burden of writing a checkpoint collection command in the job.

（実施形態２）
以下にチェックポイント採取の要否を判定する為の条件の受け付け完了後に、指定されたジョブの前記条件を再度受け付けてその条件を更新する実施形態２の計算機システムについて説明する。 (Embodiment 2)
A computer system according to a second embodiment will be described below that accepts the conditions of a designated job again after the acceptance of the conditions for determining whether or not checkpoint collection is necessary and updates the conditions.

本実施形態は、既にチェックポイント採取の要否を判定する為の条件がジョブに与えられ、そのジョブが実行中であっても、ユーザ５が条件の再投入をできる様にするものである。 In this embodiment, a condition for determining whether or not it is necessary to collect a checkpoint is already given to a job, and the user 5 can re-enter the condition even when the job is being executed.

図７は本実施形態のジョブ情報一覧表示と表示コマンドの一例を示す図である。条件の再投入を行う場合、まずユーザ５はどのジョブに対して条件を更新するかを指定し得なくてはならない。そこで本実施形態では、計算機システム１中に現在存在するジョブ情報１０とそれを一意に指定するジョブＩＤ１１とそれに付随するチェックポイント採取条件を出力装置に示すことができるコマンド命令をユーザ５に提供する。 FIG. 7 is a diagram showing an example of the job information list display and display command of the present embodiment. When re-entering conditions, the user 5 must first be able to specify for which job the conditions are to be updated. Therefore, in the present embodiment, the user 5 is provided with a command command that can indicate to the output device the job information 10 that currently exists in the computer system 1, the job ID 11 that uniquely specifies the job information 10, and the checkpoint collection conditions that accompany it. .

このコマンドが発行されると、図７のジョブ情報一覧表示例７０１に一例を示す様に、本実施形態のジョブ実行管理部２０は、ジョブ情報１０を特定するジョブＩＤ１１と条件テーブル１２の情報、複合条件式１３をメモリから読み出して出力装置上に表示する。ジョブ情報例７０３の各行がそれぞれジョブ情報１０に相当し、ジョブ情報表示項目７０２がジョブ情報１０の表示項目の説明を表す。 When this command is issued, as shown in an example of a job information list display example 701 in FIG. 7, the job execution management unit 20 according to the present embodiment includes information on the job ID 11 specifying the job information 10 and information on the condition table 12, The compound conditional expression 13 is read from the memory and displayed on the output device. Each row of the job information example 703 corresponds to the job information 10, and the job information display item 702 represents the description of the display items of the job information 10.

ユーザ５はこの結果を受け、条件変更コマンド７０７に引数を与えてチェックポイント採取条件を変更する。引数は、それぞれ、条件式の集合７０８、上書きされる複合条件式７０９、対象とするジョブＩＤ７１１である。このとき、条件式の集合７０８は条件式７０４及び７０５から構成される。また、ジョブ情報例７０３の最終行の様に、チェックポイントの採取条件を省略してジョブを実行することも可能である。 In response to this result, the user 5 gives an argument to the condition change command 707 to change the checkpoint collection condition. The arguments are a set of conditional expressions 708, a compound conditional expression 709 to be overwritten, and a target job ID 711, respectively. At this time, the set of conditional expressions 708 includes conditional expressions 704 and 705. Further, as in the last line of the job information example 703, it is possible to execute the job without the checkpoint collection condition.

前記の様にコマンド７０６を投入することで特定のジョブに対するチェックポイント採取の条件を変更することができる。 By inputting the command 706 as described above, the checkpoint collection condition for a specific job can be changed.

図８は本実施形態のジョブ情報一覧表示処理の処理手順を示すフローチャートである。図８の処理について、図６のコマンド６０６を受けた場合の動作を例として説明する。 FIG. 8 is a flowchart showing a processing procedure of job information list display processing according to the present embodiment. The processing in FIG. 8 will be described by taking the operation when the command 606 in FIG. 6 is received as an example.

まず、実行可能状態ジョブキュー３に格納されるジョブ情報１０から、ジョブＩＤ１１と条件テーブル１２と複合条件式１３と資源ステータステーブル１４の情報を取得する（ステップ８０１）。 First, information on the job ID 11, the condition table 12, the compound conditional expression 13, and the resource status table 14 is acquired from the job information 10 stored in the executable job queue 3 (step 801).

次に、実行状態ジョブキュー４に格納されるジョブ情報１０から、ジョブＩＤ１１と条件テーブル１２と複合条件式１３と資源ステータステーブル１４の情報を取得する（ステップ８０２）。そして、ステップ８０２で取得したデータを出力装置に表示する（ステップ８０３）。 Next, information on the job ID 11, the condition table 12, the compound conditional expression 13, and the resource status table 14 is acquired from the job information 10 stored in the execution state job queue 4 (step 802). Then, the data acquired in step 802 is displayed on the output device (step 803).

図９は本実施形態のジョブ情報１０の条件テーブル１２と複合条件式１３を更新する処理の処理手順を示すフローチャートである。図９の処理について、図７のコマンド７０６を受けた場合の例として説明する。 FIG. 9 is a flowchart showing a processing procedure for updating the condition table 12 and the compound conditional expression 13 of the job information 10 according to this embodiment. The processing in FIG. 9 will be described as an example when the command 706 in FIG. 7 is received.

まず、メモリ中のジョブ情報１０を参照し、ユーザ５が指定したＩＤを持つジョブが存在するかどうかを判定し（ステップ９０１）、そのジョブが存在しないならば、その旨を表示し、その後は何もせず終了する（ステップ９０２）。 First, referring to the job information 10 in the memory, it is determined whether or not there is a job having an ID specified by the user 5 (step 901). If the job does not exist, that fact is displayed. The process ends without doing anything (step 902).

ユーザ５が指定したＩＤを持つジョブが存在する場合には、ユーザ５が指定した条件式の文字列を所定の文法規則と比較して当該条件式に不備がないかどうかを調べ（ステップ９０３）、条件式に不備があればその旨を表示して、その後は何もせず終了する（ステップ９０４）。 If there is a job having an ID specified by the user 5, the character string of the conditional expression specified by the user 5 is compared with a predetermined grammar rule to check whether the conditional expression is incomplete (step 903). If there is a defect in the conditional expression, that fact is displayed, and thereafter, nothing is done and the process ends (step 904).

条件式に不備が無い場合には、指定されたジョブに対するメモリ中のジョブ情報１０の条件テーブル１２と複合条件式１３に、前記指定された内容を格納して更新する（ステップ９０５）。そして、更新されたジョブに対するジョブ情報１０を図７のジョブ情報一覧表示例７０１の様に表示する（ステップ９０６）。 If there is no defect in the conditional expression, the specified content is stored and updated in the condition table 12 and the compound conditional expression 13 of the job information 10 in the memory for the specified job (step 905). Then, the job information 10 for the updated job is displayed as in the job information list display example 701 in FIG. 7 (step 906).

以上説明した様に本実施形態の計算機システムによれば、チェックポイント採取の要否を判定する為の条件を再度受け付けてその条件を更新するので、実行中のジョブのシステム資源の利用状況に応じて、チェックポイント採取の条件を変更することが可能である。 As described above, according to the computer system of this embodiment, the conditions for determining whether or not checkpoints need to be collected are accepted again and the conditions are updated. Therefore, depending on the system resource usage status of the job being executed The checkpoint collection conditions can be changed.

実施形態１の計算機システムの概略構成を示す図である。It is a figure which shows schematic structure of the computer system of Embodiment 1. FIG. 実施形態１のジョブ情報１０が保持する資源ステータステーブル１４の一例を示す図である。It is a figure which shows an example of the resource status table 14 which the job information 10 of Embodiment 1 hold | maintains. 実施形態１のジョブ情報１０が保持する条件テーブル１２の一例を示す図である。It is a figure which shows an example of the condition table 12 which the job information 10 of Embodiment 1 hold | maintains. 実施形態１のジョブ情報１０が保持する複合条件式１３の一例を示す図である。6 is a diagram illustrating an example of a compound conditional expression 13 held by job information 10 according to the first embodiment. FIG. 実施形態１のチェックポイント採取処理の処理手順を示すフローチャートである。6 is a flowchart illustrating a processing procedure of checkpoint collection processing according to the first embodiment. 実施形態１のジョブとチェックポイント採取条件を投入するコマンドの一例を示す図である。6 is a diagram illustrating an example of a command for inputting a job and checkpoint collection conditions according to the first embodiment. FIG. 実施形態２のジョブ情報一覧表示と表示コマンドの一例を示す図である。FIG. 10 is a diagram illustrating an example of a job information list display and a display command according to the second embodiment. 実施形態２のジョブ情報一覧表示処理の処理手順を示すフローチャートである。10 is a flowchart illustrating a processing procedure of job information list display processing according to the second embodiment. 実施形態２のジョブ情報１０の条件テーブル１２と複合条件式１３を更新する処理の処理手順を示すフローチャートである。10 is a flowchart illustrating a processing procedure for updating a condition table 12 and a compound conditional expression 13 of job information 10 according to the second embodiment. 実施形態１の計算機システム１のハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a computer system 1 according to the first embodiment. FIG.

Explanation of symbols

１…計算機システム、３…実行可能状態ジョブキュー、４…実行状態ジョブキュー、５…ユーザ、１０…ジョブ情報、１１…ジョブＩＤ、１２…条件テーブル、１３…複合条件式、１４…資源ステータステーブル、２０…ジョブ実行管理部、２１…チェックポイント採取実行部、２２…スケジューラ、２３…ＣＰＵ監視部、２４…Ｉ／Ｏ監視部、１４−０１…ＣＰＵ利用累積時間、１４−０２…ジョブ実時間、１４−０３…Ｉ／Ｏ書込み累積量、１４−０４…Ｉ／Ｏ読込み累積量、１２−０１〜１２−Ｎ…条件式、６０１〜６０４…条件式、６０６…コマンド、６０７…ジョブ投入コマンド、６０８…条件式の集合、６０９…複合条件式、６１０…ジョブプログラム実行ファイル名、７０１…ジョブ情報一覧表示例、７０２…ジョブ情報表示項目、７０３…ジョブ情報例、７０４及び７０５…条件式、７０６…コマンド、７０７…条件変更コマンド、７０８…条件式の集合、７０９…複合条件式、７１１…ジョブＩＤ。 DESCRIPTION OF SYMBOLS 1 ... Computer system, 3 ... Executable state job queue, 4 ... Execution state job queue, 5 ... User, 10 ... Job information, 11 ... Job ID, 12 ... Condition table, 13 ... Compound conditional expression, 14 ... Resource status table 20 ... job execution management unit, 21 ... checkpoint collection execution unit, 22 ... scheduler, 23 ... CPU monitoring unit, 24 ... I / O monitoring unit, 14-01 ... accumulated CPU usage time, 14-02 ... job real time 14-03 ... I / O write accumulation amount, 14-04 ... I / O read accumulation amount, 12-01 to 12-N ... Conditional expression, 601 to 604 ... Conditional expression, 606 ... Command, 607 ... Job input command 608 ... A set of conditional expressions 609 ... Compound conditional expression 610 ... Job program execution file name 701 ... Job information list display example 702 ... Job information display Eyes, 703 ... job information example, 704 and 705 ... condition, 706 ... command 707 ... condition change command, a set of 708 ... condition, 709 ... complex conditional expression 711 ... job ID.

Claims

In the checkpoint collection method for acquiring checkpoint information for restarting a job when a failure occurs during job execution,
Acquiring information indicating the usage status of system resources for each job and storing it in a storage device; conditions for determining whether or not to collect checkpoints according to the usage status of system resources for each job; Reading system resource usage information from the storage device and determining whether or not it is necessary to collect checkpoints; storing checkpoint information in the storage device based on the determination result; and collecting checkpoints; A checkpoint collection method characterized by comprising:

2. The compound conditional expression obtained by combining a plurality of conditional expressions for determining whether or not checkpoint collection is necessary is used as the condition for determining whether or not checkpoint collection is necessary. Checkpoint collection method.

The checkpoint collection method according to claim 1, wherein a condition for determining whether or not it is necessary to collect a checkpoint for the job is accepted when the job is submitted.

4. The method according to claim 1, wherein after completion of reception of the condition for determining whether or not the checkpoint is necessary, the condition of the designated job is received again and the condition is updated. Checkpoint collection method described in the section.

The checkpoint collection method according to any one of claims 1 to 4, wherein the accumulated usage time of the CPU is acquired as information indicating the usage status of the system resource.

6. The checkpoint collection method according to claim 1, wherein a cumulative amount of accessed input / output data is acquired as information indicating a use status of the system resource.

In a computer system that acquires checkpoint information for restarting a job when a failure occurs during job execution,
A resource monitoring unit that acquires information indicating the use status of system resources for each job and stores the information in a storage device; a condition for determining whether or not checkpoint collection is necessary according to the use status of system resources for each job; Checkpoints that read the stored system resource usage information from the storage device, determine whether or not checkpoint collection is necessary, and store the checkpoint information in the storage device based on the determination result to collect the checkpoint A computer system comprising a collection execution unit.

8. The computer system according to claim 7, wherein after completion of acceptance of a condition for determining whether or not to collect the checkpoint, the condition of the designated job is accepted again and the condition is updated.

In a program that causes a computer to execute a checkpoint collection method that acquires checkpoint information for restarting a job when a failure occurs during job execution.
Acquiring information indicating the usage status of system resources for each job and storing it in a storage device; conditions for determining whether or not to collect checkpoints according to the usage status of system resources for each job; Reading system resource usage information from the storage device and determining whether or not it is necessary to collect checkpoints; storing checkpoint information in the storage device based on the determination result; and collecting checkpoints; A program that causes a computer to execute.

10. The program according to claim 9, wherein after completion of acceptance of a condition for determining whether or not to collect the checkpoint, the condition of the designated job is accepted again and the condition is updated.