JP4712279B2

JP4712279B2 - Method and apparatus for controlling extensible computing system

Info

Publication number: JP4712279B2
Application number: JP2002508204A
Authority: JP
Inventors: アズイズ，アシャー; マークソン，トム; パターソン，マーティン; グレイ，マーク
Original assignee: テラスプリング・インコーポレーテッド
Priority date: 2000-06-20
Filing date: 2001-06-13
Publication date: 2011-06-29
Anticipated expiration: 2021-06-13
Also published as: JP2004508616A

Description

【０００１】
【発明の属する技術分野】
本発明は一般に、データ処理に関する。本発明は、特にコンピューティンググリッドを制御する方法および装置に関する。
【０００２】
【発明が解決しようとする課題】
今日のウェブサイトおよび他のコンピュータシステムのビルダーは、多くの興味深いシステムプランニング問題を抱えている。これらの問題には、容量プランニング、サイト利用可能度およびサイトの安全性が含まれている。これらの目標を達成するには、潜在的に大きく複雑であるかもしれないサイトの設計および運営が可能なトレーニングを受けた人員を探し出して、雇用することが必要である。多くの組織にとって、大きなサイトの設計、構築、運営は主力事業でないことが多いため、このような人員を探し出して雇用することは難しいことが分かっている。
【０００３】
１つの方法として、他の企業の他のウェブサイト共に同じ場所に配置された、第三者サイトの企業ウェブサイトを採用した。このような外部委託施設は現在、Exodus、AboveNet、GlobalCenterなどの企業から利用できる。これらの施設により、多数の顧客が共有する物理的スペース、冗長ネットワーク、発電施設が与えられる。
【０００４】
外部委託ウェブサイトの採用により、ウェブサイトの確立と維持の負担が大きく減るが、企業からウェブサイトの維持に関連する全ての問題を取り除くことにはならない。企業は、その施設の構築、運営、増大の間に、そのコンピューティング構造基盤に関する多くの仕事を行なわなければならない。このような施設で採用された企業の情報テクノロジー管理者は、施設でのその演算装置の手動選択、設置、構成、維持に関して責任がある。管理者は、リソースプランニングおよび取り扱いピーク容量などの困難な問題に取り組まなければならない。特に、管理者は、需要に対処するために外部委託企業からリソース需要および要求リソースを予測する必要がある。多くの管理者は、予期しないピーク需要に対する緩和策として、必要とする以上に実質的に多いリソースを要求することで十分な容量を確保する。残念ながら、これによって未使用の容量が多大なものになり、ウェブサイトを採用するための企業の諸経費が増加してしまう。
【０００５】
外部委託企業も、サーバ、ソフトウェア、電力施設を含む完全計算施設を提供しても、成長に伴って同一の手動の誤りやすい管理処置が必要となるので、外部委託企業にとって施設の拡大および成長は簡単ではない。さらに、予期しないピーク需要に対する容量プランニングと共に問題が残っている。この場合、外部委託企業は、かなりの量の未使用容量を維持することがある。
【０００６】
さらに、外部委託企業が管理するウェブサイトの必要条件は異なることがしばしばある。例えば、ある企業では、そのウェブサイトを独立して運営および制御するための能力が必要になる。他の企業では、そのウェブサイトを、外部委託企業で共に配置された他の全てのサイトから分離させる特定の種類またはレベルの安全確保が必要となる。別の例として、ある企業では、どこかに配置された企業イントラネットへの確実な接続が必要となる。
【０００７】
さらに、様々なウェブサイトは、内部トポロジーにおいて異なる。あるサイトは単に、ウェブロードバランサによってロードバランスの取れたウェブサーバ列から構成される。適切なロードバランサはCisco Systems, Inc.のLocal Director、F5LabsのBigIP、AletonのWeb Directorなどである。他のサイトは多層構成されることもあり、これによってウェブサーバ列はハイパーテキストプロトコル(HTTP)要求に対処できるが、アプリケーションロジックの大半は別のアプリケーションサーバにおいて実施される。これらのアプリケーションサーバを、データベースサーバの層に再び接続しなければならないことがある。
【０００８】
このような異なる構造シナリオの幾つかを、図１Ａ、図１Ｂ、図１Ｃに示す。図１Ａは単純なウェブサイトのブロック図であり、CPU１０２およびディスク１０４を含む単一のコンピューティング要素またはマシン１００から成る。マシン１００は、インターネットとして知られる世界規模のパケット交換式データネットワーク１０６、または他のネットワークに接続されている。マシン１００は、上述したタイプの同一位置サービス内に収容されていてもよい。
【０００９】
図１Ｂは、複数のウェブサーバWSA、WSB、WSCを含む１層ウェブサーバファーム１１０のブロック図である。各ウェブサーバは、インターネット１０６に接続されたロードバランサ１１２に接続されている。ロードバランサはサーバ間のトラフィックを分割して、各サーバのバランスのとれた処理ロードを維持する。ロードバランサ１１２も、ウェブサーバを許可されていないトラフィックから保護するためのファイアウォールを含むか、あるいはこれに接続されていてもよい。
【００１０】
図１Ｃは、ウェブサーバW1、W2などの層、アプリケーションサーバA1、A2などの層、およびデータベースサーバD1、D2などの層を含む３層サーバファーム１２０を示す。ウェブサーバは、HTTP要求に対処するために設けられる。アプリケーションサーバは、アプリケーションロジックの大部分を実行する。データベースサーバは、データベース管理システム（DBMS）ソフトウェアを実行する。
【００１１】
構成する必要のあるウェブサイトの種類のトポロジーが多様化され、該当する企業の必要条件が変化しているので、大規模ウェブサイトを構成する唯一の方法は、各サイトを物理的にカスタマイズすることであると考えられる。多くの組織はそれぞれ個別に同一問題に取り組んでおり、ゼロから各ウェブサイトをカスタマイズしている。これは非能率的であり、異なる企業で大量の同一の仕事が生じることになる。
【００１２】
従来の方法の別の問題は、リソースと容量プランニングである。ウェブサイトは、異なる日、またはその日の内の異なる時間で、非常に異なるレベルのトラフィックを受信する。ピークトラフィック時間では、ウェブサイトのハードウェアまたはソフトウェアは、オーバーロードのために適当な時間内で要求に応答することができないことがある。他の時間では、ウェブサイトのハードウェアまたはソフトウェアには過度の容量があり、十分に利用されていない。従来の方法では、過度のコストを負ったり過剰容量となることなく、ピークトラフィックに対処する十分なハードウェアおよびソフトウェアを有することにおけるバランスを見つけることは、困難な問題である。多くのウェブサイトは適切なバランスを見つけることができず、慢性的に過小容量または過剰容量に悩まされている。
【００１３】
別の問題は、ヒューマンエラーによって引き起こされる故障である。手動構成されたサーバファームを使用する現在の方法において存在する大きな潜在的災害は、新しいサーバをライブサーバ内に構成するときのヒューマンエラーにより、サーバファームが誤動作し、これによってウェブサイトのユーザへのサービスが失われてしまう可能性があることである。
【００１４】
上記に基づき、この分野において、カスタム構成を必要とすることなく、要求があり次第、直ちに簡単に拡張することのできるコンピューティングシステムを提供する改善された方法および装置が明確に必要である。
【００１５】
さらに、トラフィックスループットの変化を明らかにするためにそれぞれ必要に応じて拡張または縮小可能な多数の分離処理ノードの生成をサポートするコンピューティングシステムも必要である。
【００１６】
さらに、このような拡張可能コンピューティングシステムとその構成分離処理ノードを制御する方法および装置も必要である。他の必要性もここに示す開示内容から明らかとなるであろう。
【００１７】
【発明の開示】
本発明の１つの態様によれば、上記必要性、および以下の説明により明らかとなる他の必要性は、大規模なコンピューティング構造（「コンピューティンググリッド」）に基づき、非常に拡張性があり、非常に利用しやすくて確実なデータ処理サイトを制御および管理する方法および装置によって達せられる。コンピューティンググリッドは、物理的に構成され、その後要求に応じて様々な組織に対して論理的に分割される。コンピューティンググリッドは、一又は二以上のVLANスイッチおよび一又は二以上の記憶領域ネットワーク（SAN）スイッチに接続された非常に多数のコンピューティンググ要素を含んでいる。複数の記憶装置はSANスイッチに接続され、且つ適切な切り替えロジックおよびコマンドを介して、一又は二以上のコンピューティング要素に選択的に接続されてもよい。VLANスイッチの１つのポートは、インターネットなどの外部ネットワークに接続される。監視機構、層、マシンまたはプロセスは、VLANスイッチおよびSANスイッチに接続される。
【００１８】
初めに、全ての記憶装置およびコンピューティング要素は、アイドルプールに割り当てられる。プログラム制御の下、監視機構はVLANスイッチおよびSANスイッチのポートを一又は二以上のコンピューティング要素および記憶装置に接続するように動的に構成する。その結果、このような要素および装置はアイドルプールから論理的に除去されて、一又は二以上の仮想サーバファーム（VSF）またはインスタントデータセンタ（IDC）の一部となる。各VSFコンピューティング要素は、ブートストラップ操作および生成実行を行なうためにコンピューティング要素が使用できるブートイメージを含む記憶装置に向けられるか、あるいは関連付けられる。
【００１９】
本発明の１つの態様によると、監視層は、一又は二以上のスレーブ制御プロセス機構に通信接続された一又は二以上のマスター制御プロセス機構を含む制御機構階層から成る制御プレーンである。一又は二以上のマスター制御プロセス機構は、スレーブ制御プロセス機構のローディングに基づいて、スレーブ制御プロセス機構を割り当ておよび割り当て解除する。一又は二以上のマスター制御プロセス機構は、処理および記憶リソースのサブセットを選択することによってIDCを確立するようにスレーブ制御プロセス機構に支持する。一又は二以上のマスター制御プロセス機構は、スレーブ制御プロセス機構の周期的な検診を行なう。応答がなかったり、あるいは異常終了したスレーブ制御機構は再起動される。別のスレーブ制御機構は開始されて、再開できないスレーブ制御機構の代わりとなる。スレーブ制御機構は、マスター制御機構の周期的な検診を行なう。マスタースレーブ制御プロセス機構が異常終了すると、スレーブ制御プロセス機構が選択されて新たなマスター制御プロセス機構となり、以上終了したマスター制御プロセス機構の代わりとなる。
【００２０】
コンピューティンググリッドを一度物理的に構成し、且つ要求に応じてコンピューティンググリッドの部分を確実且つ動的に様々な組織に割り当てることにより、各サイトのカスタマイズのときには困難であったスケールメリットが得られる。
【００２１】
本発明は、添付の図面において、限定するのではなく、一例として図解されており、且つその中において同一の参照番号は同様の要素を示している。
【００２２】
【本発明の実施の形態】
以下の説明において、説明の目的で、本発明を完全に理解してもらうために多数の特定の細部が述べられている。しかしながら、本発明がこれらの特定の細部無しに実施されることは当業者に明らかとなるであろう。他の例では、本発明が不必要に分かりにくくなってしまうのを回避するために、既知の構造および装置がブロック図で示されている。
【００２３】
仮想サーバファーム（VSF）
一実施例によると、大規模なコンピューティング構造（「コンピューティンググリッド」）が設けられる。コンピューティンググリッドは物理的に一度構成され、その後要求に応じて論理的に区画されてもよい。コンピューティンググリッドの一部は、複数の企業または組織のそれぞれに割り当てられる。各組織のコンピューティンググリッドのロジック部分は、仮想サーバファーム（VSF）と呼ばれる。各組織はそのVSFの独立した運営管理制御を維持する。各VSFは、サーバファームまたは他の要素に与えられたリアルタイム要求に基づいて、CPUの数、記憶容量およびディスク、ネットワーク帯域幅を動的に変更することができる。VSFは同一の物理的コンピューティンググリッドから全て論理的に生成されるが、各VSFは全てのほかの組織のVSFから保護されている。VSFは、イントラネットを他の組織のVSFにさらすことなく、個人専用回線または仮想プライベートネットワーク（VPN）を使用することにより、イントラネットに逆に接続することができる。
【００２４】
組織は、コンピュータへの完全（例えば、スーパーユーザまたはルート）管理アクセスを実行し、これらのコンピュータが接続されたローカルエリアネットワーク（LAN）の全てのトラフィックを観察することができるが、それに割り当てられたコンピューティンググリッドの部分、つまりVSFにおけるデータおよびコンピューティング要素にのみアクセスできる。一実施例によると、これは、VSFの安全限界が動的に拡張および縮小する動的ファイアウォール方式を使用することによって可能となる。各VSFを使用して、インターネット、イントラネットまたはエキストラネットを介してアクセスできる組織の内容とアプリケーションを採用することができる。
【００２５】
コンピューティング要素およびその関連するネットワーキング、および記憶要素の構成と制御は、コンピューティンググリッドにおけるコンピューティング要素の何れかによって直接アクセスすることのできない監視機構によって行なわれる。便宜上、本文書では、監視機構は一般に制御プレーンと呼ばれ、一又は二以上のプロセッサまたはプロセッサのネットワークから構成されていてもよい。監視機構は、スーパバイザ、コントローラなどで構成されていてよい。ここに説明するように、他の方法を用いることもできる。
【００２６】
制御プレーンは、例えばネットワーク内または他の手段によって相互接続される一又は二以上のサーバなど、監視の目的用に割り当てられたコンピューティング要素の完全独立集合上で実施される。制御プレーンは、グリッドのネットワーキングおよび記憶要素の特殊制御ポートまたはインタフェースを介して、コンピューティンググリッドのコンピューティング、ネットワーキングおよび記憶要素に対して、制御動作を行なう。制御プレーンはシステムの切り替え要素に物理的インタフェースを与え、システムにおけるコンピューティング要素の負荷を監視し、グラフィカルユーザインタフェースまたは他の適切なユーザインタフェースを使用して運営管理機能を与える。
【００２７】
制御プレーンを実施するのに使用するコンピュータはコンピュータグリッド（および特定のVSF）におけるコンピュータには論理的には不可視であり、コンピュータグリッドにおける要素を介して、あるいは外部コンピュータから、決して攻撃されたり、破壊されることはない。制御プレーンのみがコンピュータグリッドにおける機器の制御ポートへの物理的接続部を有しており、これは特定のVSFにおけるメンバーシップを制御する。コンピューティングにおける機器はこれらの特殊制御ポートを介してのみ構成できるので、コンピューティンググリッドにおけるコンピューティング要素はその安全限界を変更したり、認められていない記憶またはコンピューティング機器へのアクセスを行なうことはできない。
【００２８】
従って、VSFにより、組織は、大規模共有コンピューティングインフラストラクチャから動的に作られたプライベートサーバーファーム、すなわちコンピューティンググリッドから構成されたように見えるコンピューティング設備と連動することができる。ここに説明するコンピューティングアーキテクチャと接続された制御プレーンは、そのプライバシーと保全性がコンピューティンググリッドの機器のハードウェアにおいて実施されるアクセス制御機構によって保護されるプライベートサーバファームを与える。
【００２９】
制御プレーンは、各VSFの内部トポロジーを制御する。制御プレーンはここに説明するコンピュータ、ネットワークスイッチおよび記憶ネットワークスイッチの基本相互接続を取り、これらを使用して様々なサーバファーム構成を作成することが可能である。これらには、限定されるものではないが、ロードバランサによって前処理された単層ウェブサーバファーム、および多層構成が含まれており、ウェブサーバはアプリケーションサーバと通信し、且つアプリケーションサーバはデータベースサーバと通信を行なう。様々な負荷バランシング、多層化、ファイアウォール構成が可能である。
【００３０】
コンピューティンググリッド
コンピューティンググリッドは単一の場所に存在し、幅広い領域に分散させることができる。最初に、本書はローカルエリア技術でのみ構成される単一の建物のサイズのネットワークにおける、コンピュータグリッドについて説明する。次に、本書は、コンピュータグリッドを広域ネットワーク（WAN）上で分散させる場合について説明する。
【００３１】
図２は、ローカルコンピューティンググリッド２０８を含む拡張可能コンピューティングシステム２００の１つの構成を示すブロック図である。本書において、「拡張可能」とは一般に、システムがフレキシブルで拡張性があり、要求があり次第特定の企業またはユーザに対して低下あるいは上昇させた計算力を与える能力を有していることを意味する。ローカルコンピューティンググリッド２０８は、多数のコンピューティング要素CPU１、CPU2、．．．CPUｎから成る。実施例において、１０，０００個以上のコンピューティング要素が存在している。これらのコンピューティング要素は長期の要素ごとの状態情報を含んでいたり、保存することはないので、ローカルディスクなどの永続性または不揮発性ストレージなしで構成してもよい。その代わり、全ての長期の状態情報は、コンピューティング要素とは別に、一又は二以上のSANスイッチ２０２を含む記憶領域ネットワーク（SAN）を介してコンピューティング要素に接続される複数のディスク、ディスク１、ディスク２、．．．ディスクｎに保存される。適切なSANスイッチの例は、BrocadeおよびExcelから販売されている。
【００３２】
全てのコンピューティング要素は、仮想LAN（VLAN）に分割される一又は二以上のVLANスイッチ２０４を介して、相互接続される。VLANスイッチ２０４はインターネット１０６に接続されている。一般に、コンピューティング要素は、VLANスイッチに接続された１つまたは２つのネットワークインタフェースを含んでいる。便宜上、図２において、全てのノードが２つのネットワークインタフェースを有しているが、ネットワークインタフェースがこれよりも少ないまたは多いノードもある。多くの製造供給元は現在、VLAN機能をサポートするスイッチを提供している。例えば、適切なVLANスイッチはCisco Systems, IncおよびXtreme Networksより入手可能である。同様に、SANを構成するための入手可能製品は多数あり、これにはファイバーチャネルスイッチ、SCSI対ファイバーチャネルブリッジング機器、ネットワークアタッチドストレージ（NAS）機器が含まれる。
【００３３】
制御プレーン２０６は、SAN制御経路、CPU制御経路、およびVLAN制御経路によって、SANスイッチ２０２、CPU１、CPU２、．．．CPUｎおよびVLANスイッチ２０４にそれぞれ接続される。
【００３４】
各VSFは、VLANの集合、VLANに取り付けられるコンピューティング要素の集合、およびコンピューティング要素の集合に接続されるSAN上で利用可能な記憶装置のサブセットから成る。SAN上で利用可能なストレージのサブセットをSANゾーンと呼び、これはSANハードウェアによって他のSANゾーンの一部であるコンピューティング要素からのアクセスから保護されている。好適には、非可鍛性ポート識別子を与えるVLANを使用して、一人の顧客またはエンドユーザが他の顧客またはエンドユーザのVSFリソースにアクセスするのを防止する。
【００３５】
図３は、SANゾーンを特色とする典型的な仮想サーバファームのブロック図である。複数のウェブサーバWS１、WS２などは、第１VLAN（VLAN１）によってロードバランサ（LB）／ファイアウォール３０２に接続されている。第２VLAN（VLAN２）は、インターネット１０６をロードバランサ（LB）／ファイアウォール３０２に接続する。各ウェブサーバは、後に説明する機構を使用してCPU１、CPU２などから選択することができる。ウェブサーバはSANゾーン３０４に接続されており、これは一又は二以上の記憶装置３０６ａ、３０６ｂに接続されている。
【００３６】
ある時点において、例えば図２のＣＰＵ１などのコンピューティンググリッドにおけるコンピューティング要素は、ＶＬＡＮの集合および単一のＶＳＦに関連するSANゾーンに接続されているだけである。通常、VSFは異なる組織間で共有されることはない。単一のSANゾーンに属するSAN上のストレージのサブセット、およびそれに関連するVLANの集合、およびこれらのVLAN上のコンピューティング要素が、VSFを規定する。
【００３７】
VLANのメンバーシップおよびSANゾーンのメンバーシップを制御することにより、制御プレーンはコンピューティンググリッドを多数のVSFに論理分割する。１つのVSFのメンバーは、他のVSFのコンピューティングまたは記憶リソースにアクセスできない。このようなアクセス制限は、VLANスイッチによって、且つファイバーチャネルスイッチやSCSI対ファイバーチャネルブリッジングハードウェアなどのエッジ機器といったSANハードウェアのポートレベルアクセス制御機構（例えばゾーニング）によって実行させる。コンピューティンググリッドの一部を形成するコンピューティング要素はVSANスイッチおよびSANスイッチの制御ポートまたはインタフェースに物理的に接続されていないので、VLANまたはSANゾーンのメンバーシップを制御することはできない。従って、コンピューティンググリッドのコンピューティング要素は、これらを含むVSFに配置されていないコンピューティング要素にアクセスできない。
【００３８】
制御プレーンを実行するコンピューティング要素のみが、グリッドにおける機器の制御ポートまたはインタフェースに物理的に接続される。コンピューティンググリッドの機器（コンピュータ、SANスイッチ、およびVLANスイッチ）は、これらの制御ポートまたはインタフェースによって構成されるだけである。これにより、コンピューティンググリッドを多数のVSFに動的に分割する単純であるが非常に安定した手段が得られる。
【００３９】
VSFにおける各コンピューティング要素は、他のコンピューティング要素と交換可能である。あるVSFに関連するコンピューティング要素、VLANおよびSANゾーンの数は、制御プレーンの制御の下で時間が経つと変化する。
【００４０】
一実施例において、コンピューティンググリッドは、予備の多数のコンピューティング要素から成るアイドルプールを含んでいる。アイドルプールからのコンピューティング要素は、CPUの増加、そのVSFで利用可能なメモリ容量、あるいはVSFにおける特定のコンピューティング要素の故障に対する対処などの理由で、特定のVSFに割り当ててもよい。コンピューティング要素がウェブサーバとして構成されている場合、アイドルプールは、変化するあるいは「バースト状の」ウェブトラフィック負荷および関連するピーク処理負荷に対する大きな「ショックアブソーバ」として機能する。
【００４１】
アイドルプールは多数の異なる組織間で共有されるので、単一の組織がアイドルプール全体の費用を支払わなければならないということがないため、スケールメリットが得られる。異なる組織が必要に応じてその日の異なる時間でアイドルプールからコンピューティング要素を得ることができるので、各VSFは必要なときに拡大し、且つトラフィックが通常の状態に落ち着いたときに縮小することが可能となる。多数の異なる組織が同時にピークに達し続け、それによってアイドルプールの容量が使い果たされる可能性がある場合、アイドルプールはそれに更に多くのCPUと記憶要素を追加することで増大させることが可能である（拡張性）。アイドルプールの容量は、通常の状態において、特定のVSFが必要なときにアイドルプールから別のコンピューティング要素を得ることができない確率を大きく減らすよう設計されている。
【００４２】
図４Ａ、図４Ｂ、図４Ｃおよび図４Ｄは、アイドルプールからコンピューティング要素を出し入れするときの連続工程を示すブロック図である。最初に図４Ａを参照し、制御プレーンがコンピューティンググリッドの要素を、VSF１およびVSF２というラベルの第１および第２VSFに論理的に接続させたものとする。アイドルプール４００は複数のCPU４０２から成り、そのうちの１つはCPUXとラベル付けされている。図４Ｂにおいて、VSF１で別のコンピューティング要素が必要となった。従って、制御プレーンは、経路４０４で示すように、CPUXをアイドルプール４００からVSF１に移動させる。
【００４３】
図４Ｃにおいて、VSF１はもはやCPUXが必要ではないので、制御プレーンはCPUXをVSF１からアイドルプール４００に戻す。図４Ｄにおいて、VSF２で別のコンピューティング要素が必要となった。従って、制御プレーンはCPUXをアイドルプール４００からVSF２に移動させる。従って、時間が経過して、トラフィックの状態が変化すると、単一のコンピューティング要素がアイドルプールに属し（図４）、特定のVSFに割り当てられ（図４Ｂ）、アイドルプールに戻され（図４Ｃ）、そして別のVSFに属することとなる（図４Ｄ）。
【００４４】
これらの各段階において、制御プレーンは、特定のVSF（またはアイドルプール）に関連するVLANおよびSANゾーンの一部となるそのコンピューティング要素に関連するLANスイッチおよびSANスイッチを構成する。一実施例によると、各推移の間において、コンピューティング要素はパワーダウンまたは再起動される。コンピューティング要素の電源が再び投入されると、コンピューティング要素はSANの記憶ゾーンの異なる部分を見る。特に、コンピューティング要素は、オペレーティングシステム（例えば、Linux、NT、Solarisなど）の起動可能イメージを含むSAN上の記憶ゾーンの部分を見る。記憶ゾーンはまた各組織に特有のデータ部分を含む（例えば、ウェブサーバに関連するファイル、データベースパーティションなど）。コンピューティング要素はまた別のVSFのVLAN集合の一部である別のVLANの一部であるため、転送先のVSFのVLANに関連するCPU、SAN記憶装置、NAS機器にアクセスできる。
【００４５】
好適な実施例において、記憶ゾーンは、コンピューティング要素によって想定される役割に関連する複数の予め定義された論理詳細設計を含んでいる。初めに、何れのコンピューティング要素も、ウェブサーバ、アプリケーションサーバ、データベースサーバなどの特定の役割やタスクにあてがわれていない。コンピューティング要素の役割は複数の予め定義された保存された詳細設計の何れかから得られ、このような詳細設計のそれぞれはその役割に関連するコンピューティング要素のブートイメージを定義する。詳細設計は、ブートイメージ位置を役割に関連付けさせるファイル、データベーステーブル、または他の保存形式で保存される。
【００４６】
従って、図４Ａ、図４Ｂ、図４Ｃおよび図４ＤにおけるCPUXの移動は論理的であって、物理的ではなく、制御プレーンの制御の下でVLANスイッチおよびSANゾーンを再構成することによって行なわれる。また、コンピューティンググリッドにおける各コンピューティング要素はまず本来代替可能であり、仮想サーバファームに接続されてブートイメージからソフトウェアをロードした後でのみ特定の処理役割を想定する。何れのコンピューティング要素も、ウェブサーバ、アプリケーションサーバ、データベースサーバなどの特定の役割またはタスクがあてがわれていない。コンピューティング要素の役割は、複数の予め定義された保存された詳細設計の何れかから得られ、これらの詳細設計のそれぞれは役割に関連しており、役割に関連するコンピュータ要素のブートイメージを定義する。
【００４７】
長期の状態情報は特定のコンピューティング要素（ローカルディスクなど）に保存されていないので、異なるVSF間でノードは簡単に移動でき、まったく異なるOSおよびアプリケーションソフトウェアを実行させることができる。これにより、計画された、あるいは計画されていないダウンタイムの場合に、コンピューティング要素はより交換しやすくなる。
【００４８】
特定のコンピューティング要素は、様々なVSFから出し入れするときに、異なる役割を実行することができる。例えば、コンピューティング要素は、あるVSFにおいてウェブサーバとして動作し、且つ別のVSFに移動させると、データベースサーバ、ウェブロードバランサ、ファイアウォールなどになる。また、異なるVSFにおいて、Linux、NT、Solarisなどの異なるオペレーティングシステムを連続的に起動および実行することもできる。従って、コンピューティンググリッドにおける各コンピューティング要素は代替可能であり、それに割り当てられた固定的役割はない。従って、コンピューティンググリッドの予備容量全体を使用して、何れかのVSFが必要とする何らかのサービスを提供することができる。これにより、特定のサービスを実行する各サーバが有する同一のサービスを提供することのできるバックアップサーバの数は数千になるので、単一のVSFが提供するサービスの利用可能度および信頼性は非常に高くなる。
【００４９】
さらに、コンピューティンググリッドの高予備容量によって、動的負荷バランシング特性および高プロセッサ利用可能度が得られる。この能力は、VLANを介して相互接続され、SANを介して記憶装置の構成可能ゾーンに接続され、また全て制御プレーンによってリアルタイムで制御されるディスクレスコンピューティング要素の一義的な組合せで可能となる。各コンピューティング要素はVSFにおける何れかの必要サーバの役割において動作することができ、またSANにおける何れかのディスクの何れかの論理分割に接続可能である。グリッドで更なるコンピューティングパワーやディスク容量が必要な場合、コンピューティング要素またはディスクストレージはアイドルプールに手動で追加されるが、これは時間が経過して更に多くの組織にVSFサービスが提供されると減少する。CPUの数、ネットワークおよびディスク処理能力、VSFで利用できる記憶装置を増大させるのに、手動で介入する必要はない。これらのリソース全ては、要求があるたびにCPU、アイドルプールで利用できるネットワークおよびディスクリソースから、制御プレーンによって割り当てられる。
【００５０】
特定のVSFは、手動で再構成されない。アイドルプールのコンピューティング要素のみが、コンピューティンググリッドに手動で再構成される。その結果、現在手動で構成されたサーバファームに存在する大きな潜在的障害が除去される。新たなサーバをライブサーバファームに構成する際のヒューマンエラーによってサーバファームが誤動作し、その結果そのウェブサイトのユーザへのサービスが失われてしまう可能性は、殆どなくなる。
【００５１】
制御プレーンはまた、SANに取り付けられた記憶装置に保存されたデータをコピーするので、特定の記憶要素の故障によって、システムの何れかの部分へのサービスが失われることはない。SANを使用し、且つ冗長的な記憶およびコンピューティング要素を与えることで、コンピューティング装置から長期記憶装置を取り除くことにより、どのコンピューティング要素も何れかの記憶パーティションに取り付けることができるので、高い利用可能性が得られる。
【００５２】
仮想サーバファームの確立、それに対するプロセッサの追加、およびそれからのプロセッサの除去の詳細な例
図５は、実施例によるコンピューティンググリッドおよび制御プレーン機構のブロック図である。図５を参照し、以下においてVSFを作成し、それにノードを追加し、且つそれからノードを除去するのに使用できる詳細な過程を説明する。
【００５３】
図５は、VLANケーパブルスイッチ５０４に接続されたコンピュータA〜Gを含むコンピューティング要素５０２を示す。VLANスイッチ５０４はインターネット１０６に接続されており、且つVLANスイッチはポートV１、V２などを有している。コンピュータA〜Gは更にSANスイッチ５０６に接続され、これは複数の記憶装置またはディスクD１〜D５に接続されている。SANスイッチ５０６はポートS１、S２などを有している。制御プレーン機構５０８は、制御経路およびデータ経路によって、SANスイッチ５０６およびVLANスイッチ５０４に通信接続されている。制御プレーンは、制御ポートを介してこれらの装置に制御コマンドを送信することができる。
【００５４】
便宜上、図５のコンピューティング要素の数は少なくなっている。実際には、多数のコンピュータ、例えば数千以上、および同数の記憶装置がコンピューティンググリッドを形成している。このような大きな構造において、多数のSANスイッチは相互接続されてメッシュを形成し、且つVLANスイッチは相互接続されてVLANメッシュを形成している。しかしながら、分かりやすくするため、図５では単一のSANスイッチと単一のVLANスイッチを示している。
【００５５】
まず、全てのコンピュータA〜Gが、制御プレーンがVSFの作成要求を受信するまで、アイドルプールに割り当てられている。VLANスイッチの全てのポートは、（アイドルゾーン用）VLANIとラベル付けされる特定のVLANに割り当てられている。制御プレーンがVSFを構成するように要求され、SAN上の記憶装置に接続された１つのロードバランサ／ファイアウォールおよび２つのウェブサーバを含むものとする。制御プレーンへの要求は、管理インタフェースまたは他のコンピューティング要素を介して受信される。
【００５６】
それに応じて、制御プレーンはCPUAをロードバランサ／ファイアウォールとして指定または割り当て、且つCPUBおよびCPUCをウェブサーバとして割り当てる。CPUAは論理的にSANゾーン１に置かれ、専用のロードバランサ／ファイアウォールソフトウェアを含むディスク上の起動可能パーティションに向けられる。「向けられる」という語は便宜上使用され、いかなる手段によって、動作させる必要のある適切なソフトウェアをCPUAが入手あるいは探し出すのに十分な情報がCPUAに与えられることを意味する。SANゾーン１にCPUAを配置することにより、CPUAは、そのSANゾーンのSANによって制御されるディスクからリソースを得ることが可能になる。
【００５７】
ロードバランサは、負荷バランスすべき２つのウェブサーバとしてのCPUBおよびCPUCについて知るために、制御プレーンによって構成される。ファイアウォール構成は、インターネット１０６からの認められないアクセスから、CPUBおよびCPUCを保護する。CPUBおよびCPUCは、特定のオペレーティングシステム（例えば、Solaris、Linux、NTなど）およびウェブサーバアプリケーションソフトウェア（例えばApache）のための起動用OSイメージを含むSAN上のディスクパーティションに向けられる。VLANスイッチは、VLAN１にポートｖ1およびv２を配置し、且つVLAN２にポートｖ３、v４、ｖ５、ｖ６およびｖ７を配置するように構成される。制御プレーンはSANスイッチ５０６を構成して、ファイバーチャネルスイッチポートｓ１、ｓ２、ｓ３およびｓ８をSANゾーン１に配置する。
【００５８】
CPUがどのように特定のディスクドライブに向けられ、且つこれが起動とディスクデータへの共有アクセスにどのような意味があるのかをここに説明する。
【００５９】
図６は、まとめてVSF１と呼ばれるコンピューティング要素の論理接続の結果を示すブロック図である。ディスクドライブDD１は記憶装置D１、D２などから選択される。図６に示す論理構造が得られると、CPU A, B, Cには起動コマンドが与えられる。それに応じて、CPUAは専用ロードバランサ／ファイアウォールコンピューティング要素となり、且つCPUBおよびCPUCはウェブサーバとなる。
【００６０】
今、方針に基づく規則のために、制御プレーンが、VSF１において別のウェブサーバが必要であると判断したものとする。これは、例えば、ウェブサーバへの要求の増加によって起こるものであり、且つ顧客の計画によって少なくとも３つのウェブサーバをVSF１に追加することが可能となる。あるいは、VSFを所有または運営する組織が別のサーバを欲し、そのVSFに更にサーバを追加することの可能な特権的ウェブページなどの管理機構によって追加したためである。
【００６１】
それに応じて、制御プレーンはVSF１にCPUDを追加することを決定する。そのために、制御プレーンは、ポートｖ８およびｖ９をVLAN２に追加することで、VLAN２にCPUDを追加する。また、CPUDのSANポートｓ４はSANゾーン１に追加される。CPUDは、ウェブサーバとして起動および実行されるSAN記憶装置の起動可能部分に向けられる。CPUDはまたウェブページ内容、実行可能サーバスクリプトなどから成るSAN上の共有データに読み出し専用アクセスする。このように、CPUBおよびCPUCが要求に対応するように、サーバファームに向けられたウェブ要求に対処することができる。制御プレーンは、CPUDを負荷バランシングされているサーバセットの一部として含むようロードバランサ（CPUA）を構成する。
【００６２】
次にCPUDは起動され、VSFのサイズは３つのウェブサーバおよび１つのロードバランサに増大した。図７は、結果として得られた論理接続性を示している。
【００６３】
制御プレーンが、VSF2という名前で、２つのウェブサーバと１つのロードバランサを必要とする別のVSFを作成する要求を受信するものとする。制御プレーンはCPUEをロードバランサ／ファイアウォールとなるよう割り当て、且つCPUFおよびCPUGをウェブサーバとなるよう割り当てる。再び負荷バランシングする２つのコンピューティング要素としてのCPUFおよびCPUGについて知るため、CPUEを構成する。
【００６４】
この構成を実施するため、制御プレーンは、VLAN1にポートｖ１０およびｖ１１が含まれ（つまり、インターネット１０６に接続）、且つVLAN３にポートｖ１２、ｖ１３、ｖ１４、ｖ１５が含まれるようVLANスイッチ５０４を構成する。同様に、SANゾーン２にSANポートｓ６、ｓ７、ｓ９が含まれるようSANスイッチ５０６を構成する。このSANゾーンは、CPUEをロードバランサとして、且つCPUFおよびCPUGをSANゾーンのディスクD２に含まれる共有読み取り専用ディスク部分を使用するウェブサーバとして実行させるのに必要なソフトウェアを含む記憶装置を含んでいる。
【００６５】
図８は、結果として得られる論理接続性のブロック図である。２つのVSF（VSF１、VSF２）が同一の物理VLANスイッチおよびSANスイッチを共有するが、２つのVSFは論理的に分割されている。CPU
B、C、DにアクセスするユーザまたはVSF１を所有または運営する企業は、VSF１のCPUおよびストレージにアクセスできるのみである。このようなユーザはVSF２のCPUまたはストレージにアクセスできない。これは、唯一の共有セグメント（VLAN１）上の別個のVLANおよび２つのファイアウォールの組合せ、および２つのVSFが構成される異なるSANゾーンのために、このようなアクセスができない。
【００６６】
さらに、制御プレーンは、VSF１を２つのウェブサーバに戻すことができると判断するものとする。これは、VSF１の負荷の一時的上昇が低下し、あるいはその他の管理行為がとられたためである。それに応じて、制御プレーンは、CPUの電源オフを含む特殊コマンドによってCPUDをシャットダウンする。CPUがシャットダウンすると、制御プレーンはポートｖ８およびｖ９をVLAN２から取外し、またはSANゾーン１からSANポートｓ４と取り外す。ポートｓ４はアイドルSANゾーンに配置される。アイドルSANゾーンは、例えば、（アイドル用）SANゾーンIまたはゾーン０に指定される。
【００６７】
その後、制御プレーンは別のノードをVSF２に追加することを決定する。これは、VSF２におけるウェブサーバの負荷が一時的に上昇したり、あるいは他の理由によるためである。従って、制御プレーンは、破線経路８０２で示すように、CPUDをVSF２に配置することを決定する。そのために、VLAN３にポートｖ８およびｖ９が含まれ、且つSANゾーン２にSANポートｓ４が含まれるようVLANスイッチを構成する。CPUDは、VSF２のサーバに必要なOSおよびウェブサーバソフトウェアの起動用イメージを含むディスク装置２の記憶部分に向けられる。また、CPUDは、VSF２のほかのウェブサーバが共有するファイルシステムのデータへの読み取り専用アクセスが許可される。CPUDは再び電源が投入され、VSF２における負荷バランシングされたウェブサーバとして実行し、SANゾーン１におけるデータまたはVLAN２に取り付けられたCPUへアクセスすることはない。特に、CPUDは、VSF１の一部であった初期の時点でも、VSF１の要素にアクセスすることはできない。
【００６８】
さらに、この構成において、CPUEによって実行される安全限界は、CPUDを含むまで動的に拡張した。従って、実施例によって、VSFに追加または除去されるコンピューティング要素を適切に保護するように自動的に調整する動的ファイアウォールが提供される。
【００６９】
説明のため、実施例はポートに基づくSANゾーニングについて説明した。他の種類のSANゾーニングも用いることができる。例えば、LUNレベルSANゾーニングを使用し、ディスクアレイ内の論理量に基づいてSANゾーンを作成してもよい。LUNレベルSANゾーニングに適した実例製品は、EMC CorporationのVolume Logics Productである。
【００７０】
SAN上のディスク装置
起動、あるいは他の共有する必要のあるディスクストレージ、起動プログラムおよびデータをどこで見つけるのかに関する情報を有するディスクストレージへのアクセスという目的で、CPUをSAN上の特定の装置に向ける方法は幾つかある。
【００７１】
１つの方法では、コンピューティング要素に取り付けられたSCSI対ファイバーチャネルブリッジング機器およびローカルディスクのSCSIインタフェースを設ける。そのSCSIポートからファイバーチャネルSANの適切な機器への経路を決定することにより、コンピュータは、ローカルに取り付けられたSCSI機器にアクセスするようにファイバーチャネルSAN上の記憶装置にアクセスできる。従って、起動ソフトウェアなどのソフトウェアは、ローカルに取り付けられたSCSI機器をブートオフするように、SAN上のディスク装置を単純にブートオフする。
【００７２】
別の方法は、ノードのファイバーチャネルインタフェースおよび関連するデバイスドライバを有し、ファイバーチャネルインタフェースをブート機器として使用可能にするROMおよびＯＳソフトウェアをブートすることである。
【００７３】
他の方法では、ＳＣＳＩまたはＩＤＥ機器コントローラとなるが、ＳＡＮ上で通信を行なってディスクにアクセスするインタフェースカード（例えばＰＣＩバスまたはＳバス）を有する。Solarisなどのオペレーティングシステムは、この方法で使用可能なディスクレスブート機能を完全提供する。
【００７４】
通常は、あるノードに関連するSANディスク機器は２種類ある。一方の種類は、他のコンピューティング要素と論理的に共有せず、起動可能OSイメージ、ローカル構成ファイルなどを含む通常はノードごとのルートパーティションであるものを構成する。これは、Unix（登録商標）システム上のルートファイルシステムと同等である。
【００７５】
２番目の種類のディスクは、他のノードとの共有ストレージである。共有の種類は、CPU上で実行するOSソフトウェアおよび共有ストレージにアクセスするノードのニーズによって異なる。OSが多数のノード間で共有ディスクパーティションの読み取り／書き込みアクセスを可能にするクラスタファイルシステムを提供する場合、共有ディスクはこのようなクラスタファイルシステムとして実装される。同様に、システムは、共有ディスクへの同時読み取り／書き込みアクセスを行なうために、クラスタ内での多数のノードの実行を可能にするオラクルパラレルサーバなどのデータベースソフトウェアを使用してもよい。このような場合、共有ディスクは、基本OSおよびアプリケーションソフトウェア内にすでに設計されている。
【００７６】
このような共有アクセスが不可能であるオペレーティングシステムの場合、OSおよび関連アプリケーションが他のノードと共有するディスク機器を管理できないため、共有ディスクを読み出し専用機器として実装することができる。多数のウェブアプリケーションの場合、ウェブ関連ファイルへ読み出し専用アクセスすればよい。例えば、Unix（登録商標）システムの場合、特定のファイルシステムを読み出し専用として実装してもよい。
【００７７】
マルチスイッチコンピューティンググリッド
図５に関連して上記に説明した構成は、複数のVLANスイッチを相互接続して大きな交換VLAN構造を形成することにより、且つ多数のSANスイッチを相互接続して大きな交換SANメッシュを構成することにより、多数のコンピューティングおよび記憶ノードに拡張することができる。この場合、コンピューティンググリッドは、SAN／VLAN交換メッシュがCPUおよび記憶装置の非常に多数のポートを含むことを除いて、図５に一般に示すアーキテクチャを有している。制御プレーンを実行する多数のコンピューティング要素は、以下に説明するように、VLAN／SANスイッチの制御ポートに物理的に接続可能である。多数のVLANスイッチを相互接続して複雑な多構内データネットワークを生成することは、この分野において知られている。例えば、G. Havilandによる"Designing High-Performance Campus Intranets with Multilayer Switching（多層切り替えを有する高性能構内イントラネットの設計）" Cisco Systems, Inc.,およびBrocadeから入手可能な情報を参照すること。
【００７８】
SANアーキテクチャ
説明では、SANがファイバーチャネルスイッチおよびディスク機器、および潜在的にSCSI対ファイバーチャネルブリッジなどのファイバーチャネルエッジ機器とから構成されることを前提としている。しかし、SANはギガビットイーサネット（登録商標）スイッチなどのほかの技術、または他の物理層プロトコルを使用するスイッチを使用して構成されてもよい。特に、IP上でSCSIプロトコルを実行させることにより、IPネットワーク上でSANを構築しようという試みが行なわれている。上述した方法およびアーキテクチャは、これらの他のSAN構築方法に適応できる。VLAN可能層２環境でIP上でSCSIなどのプロトコルを実行させることによってSANを構築する場合、SANゾーンはこれらを異なるVLANにマッピングすることによって生成される。
【００７９】
さらに、高速イーサネット（登録商標）またはギガビットイーサネット（登録商標）などのLAN技術上で動作するネットワークアタッチドストレージ（NAS）を使用してもよい。この選択肢により、保全性とコンピューティンググリッドの論理パーティショニングを強化するために、SANゾーンの代わりに異なるVLANを使用する。このようなNAS機器は通常、SunのNSFプロトコルやMicrosoftのSMBなどのネットワークファイルシステムをサポートして、多数のノードが同一のストレージを共有できるようにする。
【００８０】
制御プレーンの実施
ここに述べるように、制御プレーンは、SANおよびVLANスイッチの制御およびデータポートに接続される一又は二以上の処理リソースとして実施してもよい。様々な制御プレーンの実施を行なうことができ、且つ本発明は特定の制御プレーンの実施に制限されるものではない。制御プレーン実施の様々な面を、以下の項1）制御プレーンアーキテクチャ、２）マスターセグメントマネジャー選択、３）管理機能、４）方針および保全に関する考察で詳細に説明する。
【００８１】
１．制御プレーンアーキテクチャ
一実施例によれば、制御プレーンは制御プロセス階層として実施される。制御プロセス階層は一般に、一又は二以上のスレーブセグメントマネジャー機構に通信接続されてこれらを制御する一又は二以上のマスターセグメントマネジャー機構を含んでいる。一又は二以上のスレーブセグメントマネジャー機構は、一又は二以上のファームマネジャーを制御する。一又は二以上のファームマネジャーは、一又は二以上のVSFを管理する。マスターおよびスレーブセグメントマネジャー機構は、ハードウェア回路、コンピュータソフトウェア、または何れかの組合せにおいて実施されてもよい。
【００８２】
図９は、一実施例による制御プレーン９０２およびコンピューティンググリッド９０４との間の論理関係を示すブロック図９００である。制御プレーン９０２は、コンピューティンググリッド９０４におけるネットワーキングおよび記憶要素の特殊制御ポートまたはインタフェースを介して、コンピューティンググリッド９０４に含まれるコンピューティング、ネットワーキングおよび記憶要素を制御および管理する。コンピューティンググリッド９０４は、上述した実施例により生成された多数のVSF９０６または論理リソースグループを含む。
【００８３】
一実施例によると、制御プレーン９０２はマスターセグメントマネジャー９０８、一又は二以上のスレーブセグメントマネジャー９１０、および一又は二以上のファームマネジャー９１２を含んでいる。マスターセグメントマネジャー９０８、スレーブセグメントマネジャー９１０およびファームマネジャー９１２は、特定のコンピューティングプラットフォーム上の同一位置に配置されたり、あるいは多数のコンピューティングプラットフォーム上で分散されてもよい。便宜上、単一のマスターセグメントマネジャー９０８のみを図示および説明するが、多数のマスターセグメントマネジャー９０８を使用してもよい。
【００８４】
マスターセグメントマネジャー９０８は、スレーブセグメントマネジャー９１０に通信接続され、これを制御および管理している。各スレーブセグメントマネジャー９１０は、一又は二以上のファームマネジャー９１２に通信接続され、これを管理する。一実施例によれば、各ファームマネジャー９１２は、通信接続された対応するスレーブセグメントマネジャー９１０として同一のコンピューティングプラットフォーム上の同一位置に配置される。ファームマネジャー９１２は、コンピューティンググリッド９０４上でVSF９０６を確立、構成および維持する。一実施例によれば、各ファームマネジャー９１２は管理する単一のVSF９０６が割り当てられるが、ファームマネジャー９１２も多数のVSF９０６が割り当てられる。ファームマネジャー９１２はそれぞれ直接ではなく、各スレーブセグメントマネジャー９１０を介してのみ通信を行う。スレーブセグメントマネジャー９１０は、その割り当てられたファームマネジャー９１２の状態を監視する。スレーブセグメントマネジャー９１０は、機能停止や異常終了したそれぞれ割り当てられたファームマネジャー９１２を再開させる。
【００８５】
マスターセグメントマネジャー９０８はVSF９０６のローディングを監視して、各VSF９０６に割り当てるリソースの量を決定する。マスターセグメントマネジャー９０８は、必要時応じてファームマネジャー９１２を介してVSFのリソースを割り当ておよび割り当て解除するようにスレーブセグメントマネジャー９１０に指示する。特定のアプリケーションの必要条件に応じて様々な負荷バランシングアルゴリズムを実施してもよく、且つ本発明は特定の負荷バランシング方法に限定されるものではない。
【００８６】
マスターセグメントマネジャー９０８は、スレーブセグメントマネジャー９１０およびファームマネジャー９１２が実行されているコンピューティングプラットフォームのローディング情報を監視して、コンピューティンググリッド９０４は適切にサービスされているか否かを判断する。マスターセグメントマネジャー９０８はスレーブセグメントマネジャー９１０の割り当ておよび割り当て解除を行い、必要に応じてコンピューティンググリッド９０４を適切に管理するためにファームマネジャー９１２の割り当ておよび割り当て解除を行うようスレーブセグメントマネジャー９１０を指示する。一実施例によれば、マスターセグメントマネジャー９０８も、必要に応じてファームマネジャー９１２およびスレーブセグメントマネジャー９１０の間で負荷をバランスさせるために、ファームマネジャー９１２へのVSFの割り当て、およびスレーブセグメントマネジャー９１０へのファームマネジャー９１２の割り当てを管理する。一実施例によれば、スレーブセグメントマネジャー９１０はマスターセグメントマネジャー９０８と活発に通信し、コンピューティンググリッド９０４への変更要求、および別のスレーブセグメントマネジャー９１０および／またはファームマネジャー９１２の要求を行う。一又は二以上のスレーブセグメントマネジャー９１０および一又は二以上のファームマネジャー９１２を実行している処理プラットフォームが機能しなくなった場合、マスターセグメントマネジャー９０８は、停止したコンピューティングプラットフォームのファームマネジャー９１２から他のファームマネジャー９１２へVSF９０６を再割り当てする。この場合、マスターセグメントマネジャー９０８も、VSF９０６の再割り当てを行うために別のファームマネジャー９１２を開始するようにスレーブセグメントマネジャー９１０に指示することができる。VSF９０６に割り当てられた多数のコンピューティングリソース、多数のアクティブなファームマネジャー９１２、およびスレーブセグメントマネジャー９１０をアクティブに管理することにより、全体的な電力消費量を制御できる。例えば、電力を節約するために、マスターセグメントマネジャー９０８は、アクティブなスレーブセグメントマネジャー９１０またはファームマネジャー９１２を有していないコンピューティングプラットフォームをシャットダウンしてもよい。節電は、大きなコンピューティンググリッド９０４および制御プレーン９０２で重要となる。
【００８７】
一実施例によれば、マスターセグメントマネジャー９０８は、レジストリを使用することでスレーブセグメントマネジャー９１０を管理する。レジストリは、その状態、割り当てられたファームマネジャー９１２、および割り当てられたVSF９０６などの現在のスレーブセグメントマネジャー９１０についての情報を含んでいる。スレーブセグメントマネジャー９１０が割り当ておよび割り当て解除されると、レジストリは更新されて、スレーブセグメントマネジャー９１０の変更が反映される。例えば、新しいスレーブセグメントマネジャー９１０がマスターセグメントマネジャー９０８および割り当てられた一又は二以上のVSF９０６によって例示化されると、レジストリが更新されて、新しいスレーブセグメントマネジャー９１０およびその割り当てられたファームマネジャー９１２とVSF９０６の生成が反映される。次に、マスターセグメントマネジャー９０８はレジストリを定期的に調べて、スレーブセグメントマネジャー９１０へどのようにVSF９０６を割り当てるのがよいのかを判断することができる。
【００８８】
一実施例によれば、レジストリは、マスターセグメントマネジャー９１０がアクセスできるマスターセグメントマネジャー９０８についての情報を含んでいる。例えば、レジストリは一又は二以上のアクティブなマスターセグメントマネジャー９０８を識別するデータを含んでいてもよいので、新しいスレーブセグメントマネジャー９１０が生成されると、新しいスレーブセグメントマネジャー９１０はレジストリをチェックして、一又は二以上のマスターセグメントマネジャー９０８の識別について確認することができる。
【００８９】
レジストリは様々な形で実施されてもよく、且つ本発明は特定の実施方法に限定されない。例えば、レジストリは制御プレーン９０２内のデータベース９１４に保存されるデータファイルであってもよい。レジストリは、制御プレーン９０２の外に保存されなくてもよい。例えば、レジストリはコンピューティンググリッド９０４の記憶装置に保存されてもよい。この例では、記憶装置は制御プレーン９０２専用となり、VSF９０６に割り当てられない。
【００９０】
２．マスターセグメントマネジャー選出
一般に、マスターセグメントマネジャーは、制御プレーンが確立されたとき、あるいは既存のマスターセグメントマネジャーが故障した後に、選出される。一般に特定の制御プレーン対して単一のマスターセグメントマネジャーが存在するが、２つ以上のマスターセグメントマネジャーを選出して、制御プレーンのスレーブセグメントマネジャーを同時管理するほうが有利な場合もある。
【００９１】
一実施例によれば、制御プレーンにおけるスレーブセグメントマネジャーは、その制御プレーンのマスターセグメントマネジャーを選出する。マスターセグメントマネジャーがなく、単一のスレーブセグメントマネジャーのみが存在するという単純なケースでは、スレーブセグメントマネジャーがマスターセグメントマネジャーとなり、必要に応じて別のスレーブセグメントマネジャーを割り当てる。２つ以上のスレーブセグメントマネジャーが存在する場合、２つ以上のスレーブプロセスが例えば定足数などの採決によって新しいマスターセグメントマネジャーを選出する。
【００９２】
制御プレーンのスレーブセグメントマネジャーは必ずしも永続的ではないので、特定のスレーブセグメントマネジャーを選択して、採決に参加させてもよい。例えば、一実施例によれば、レジスタは、各スレーブセグメントマネジャーによって周期的に更新される各スレーブセグメントマネジャーのタイムスタンプを含んでいる。指定された選択基準に従って決定された、最も最近に更新されたタイムスタンプを有するスレーブセグメントマネジャーはいまだに実行されていると考えられ、新しいマスターセグメントマネジャーを選出するために選択される。例えば、指定数の最も新しいスレーブセグメントマネジャーを採決に選択してもよい。
【００９３】
一実施例によれば、選出シーケンス番号を全てのアクティブなスレーブセグメントマネジャーに割り当て、アクティブなスレーブセグメントマネジャーの選出シーケンス番号に基づいて新しいマスターセグメントマネジャーを決定する。例えば、最も低いあるいは最も高い選出シーケンス番号を使用して、特定のスレーブセグメントマネジャーを次の（または最初の）マスターセグメントマネジャーに選択してもよい。
【００９４】
マスターセグメントマネジャーが確立されると、マスターセグメントマネジャーとしての同一制御プレーンのスレーブセグメントマネジャーは、現在のマスターセグメントマネジャーにコンタクト（ピング）することによりマスターセグメントマネジャーの検診を周期的に行って、マスターセグメントマネジャーがまだアクティブであるか否かを判断する。現在のマスターセグメントマネジャーがアクティブでないと判断した場合、新しいマスターセグメントマネジャーを選出する。
【００９５】
図１０は、実施例によるマスターセグメントマネジャー選出の状態図１０００を示している。スレーブセグメントマネジャーのメインループである状態１００２において、スレーブセグメントマネジャーは、ピングタイマーの終了を待つ。ピングタイマーが終了すると、状態１００４となる。状態１００４において、スレーブセグメントマネジャーは、マスターセグメントマネジャーをピングする。さらに、状態１００４において、スレーブセグメントマネジャーのタイムスタンプ（TS）が更新される。マスターセグメントマネジャーがピングに応答した場合、マスターセグメントマネジャーはまだアクティブであり、状態１００２に戻る。特定時間後もマスターセグメントマネジャーから応答がなければ、状態１００６になる。
【００９６】
状態１００６において、アクティブなスレーブセグメントマネジャーのリストを得て、状態１００８になる。状態１００８において、他のスレーブセグメントマネジャーもマスターセグメントマネジャーからの応答を受信していないか確認する。この確認を行うためにスレーブセグメントマネジャーへメッセージを送る代わりに、この情報をデータベースから得る。マスターセグメントマネジャーがアクティブでないことにスレーブセグメントマネジャーが同意しない、すなわち一又は二以上のスレーブセグメントマネジャーがマスターセグメントマネジャーから適時の応答を受信した場合、現在のマスターセグメントマネジャーがまだアクティブであると推定され、状態１００２に戻る。特定の数のスレーブセグメントマネジャーが現在のマスターセグメントマネジャーから適時の応答を受信しなかった場合、現在のマスターセグメントマネジャーが「死んでいる」、すなわちアクティブでないと推定され、状態１０１０に進む。
【００９７】
状態１０１０において、プロセスを開始したスレーブセグメントマネジャーは選出テーブルから現在の選出番号、且つデータベースから次の選出番号を検索する。次に、スレーブセグメントマネジャーは選出テーブルを更新して、次の選出番号と一義的なアドレスを指定するエントリをマスター選出テーブルに書き込む。次に、スレーブセグメントマネジャーが現在の選出番号の最も低いシーケンス番号を読み出す状態１０１２に進む。状態１０１４において、特定のスレーブセグメントマネジャーが最も低いシーケンス番号を有しているか否か確認する。有していない場合、状態１００２に戻る。有している場合、特定のスレーブセグメントマネジャーがマスターセグメントマネジャーになる状態１０１６に進む。次に、状態１０１８に進み、選出番号をインクリメントする。
【００９８】
上述したように、スレーブセグメントマネジャーは一般に、その割り当てられたVSFのサービスと、マスターセグメントマネジャーからの命令に応じての新たなVSFの割り当てを行う。スレーブセグメントマネジャーはまたマスターセグメントマネジャーのチェックと、必要に応じて新たなマスターセグメントマネジャーの選出も行う。
【００９９】
図１１は、実施例によるスレーブセグメントマネジャーの様々な状態を示す状態図１１００である。処理は、スレーブセグメントマネジャー開始状態１１０２において始まる。状態１１０２から、現在のマスターセグメントマネジャーの状態を確認する要求に応じて、状態１１０４に進む。状態１１０４では、スレーブセグメントマネジャーは現在のマスターセグメントマネジャーにピングを送って、現在のマスターセグメントマネジャーがまだアクティブであるか否かを判断する。適時の応答が現在のマスターセグメントマネジャーからあれば、状態１１０６に進む。状態１１０６では、他のスレーブセグメントマネジャーにメッセージが同報通信され、マスターセグメントマネジャーがピングに応答したことを知らせる。状態１１０６から、開始状態１１０２に戻る。
【０１００】
状態１１０４で、適時のマスター応答がなければ、状態１１０８に進む。状態１１０８では、他のスレーブセグメントマネジャーにメッセージが同報通信され、マスターセグメントマネジャーがピングに応答しなかったことを知らせる。次に、開始状態１１０２に戻る。ちなみに、十分な数のスレーブセグメントマネジャーが現在のマスターセグメントマネジャーから応答を受信しなかった場合、新しいマスターセグメントマネジャーを上記のように選出する。
【０１０１】
状態１１０２から、マスターセグメントマネジャーからVSFを再開する要求を受信したら、状態１１１０に進む。状態１１１０では、VSFが再開されて、開始状態１１０２に戻る。
【０１０２】
上述したように、マスターセグメントマネジャーは一般に、マスターセグメントマネジャーが制御するコンピューティンググリッドのVSFが一又は二以上のスレーブセグメントマネジャーによって適切にサービスされるようにする。このために、マスターセグメントマネジャーは、マスターセグメントマネジャーとしての同一制御プレーンの全てのスレーブセグメントマネジャーの定期的検診を行う。一実施例によれば、マスターセグメントマネジャー９０８は、スレーブセグメントマネジャー９１０から状態情報を周期的に要求する。情報は例えば、どのVSF９０６がスレーブセグメントマネジャー９１０によってサービスされているかを含んでいる。特定のスレーブセグメントマネジャー９１０が特定時間内に応答しなければ、マスターセグメントマネジャー９０８は特定のスレーブセグメントマネジャー９１０の再開を試みる。特定のスレーブセグメントマネジャー９１０を再開できない場合、マスターセグメントマネジャー９０８は、異常のあるスレーブセグメントマネジャー９１０から別のスレーブセグメントマネジャー９１０にファームマネジャー９１２を再割り当てする。次に、マスターセグメントマネジャー９０８は一又は二以上の別のスレーブセグメントマネジャー９１０を例示化して、プロセスローディングの再バランシングを行うことができる。一実施例によれば、マスターセグメントマネジャー９０８は、スレーブセグメントマネジャー９１０を実行しているコンピューティングプラットフォームの状態を監視する。コンピューティングプラットフォームに異常があれば、マスターセグメントマネジャー９０８は、異常のあるコンピューティングプラットフォーム上のファームマネジャー９１２に割り当てられたVSFを、別のコンピューティングプラットフォームに割り当てる。
【０１０３】
図１２は、マスターセグメントマネジャーの状態図１２００である。処理は、マスターセグメントマネジャー開始状態１２０２において開始する。状態１２０２から、マスターセグメントマネジャー９０８が制御面９０２のスレーブセグメントマネジャー９１０の周期的検診を行うかあるいはこれを要求したときに、状態１２０４に進む。状態１２０４から、全てのスレーブセグメントマネジャー９１０が予測したように応答した場合、状態１２０２に戻る。これは、全てのスレーブセグメントマネジャー９１０が、全てのスレーブセグメントマネジャー９１０が普通に動作していることを示す特定の情報をマスターセグメントマネジャー９０８に提供した場合に、生じる。一又は二以上のスレーブセグメントマネジャー９１０が応答しない、あるいは一又は二以上のスレーブセグメントマネジャー９１０に異常があったことを示す応答をした場合、状態１２０６に進む。
【０１０４】
状態１２０６において、マスターセグメントマネジャー９０８は異常のあったスレーブセグメントマネジャー９１０の再開を試みる。これはいくつかの方法で行なうことができる。例えば、マスターセグメントマネジャー９０８は、応答のないあるいは異常のあったスレーブセグメントマネジャー９１０に再開メッセージを送ることができる。状態１２０６から、全てのスレーブセグメントマネジャー９１０が予想したように応答、すなわち問題なく再開された場合、状態１２０２戻る。例えば、異常のあったスレーブセグメントマネジャー９１０が問題なく再開すると、スレーブセグメントマネジャー９１０はマスターセグメントマネジャー９０８に再開確認メッセージを送る。状態１２０６から、一又は二以上のスレーブセグメントマネジャーが再開できなかった場合、状態１２０８に進む。これは、マスターセグメントマネジャー９０８が特定のスレーブセグメントマネジャー９１０から再開確認メッセージを受信しない場合に生じる。
【０１０５】
状態１２０８において、マスターセグメントマネジャー９０８は、スレーブセグメントマネジャー９１０を実行するマシンの現在のローディングを決定する。スレーブセグメントマネジャー９０８のローディング情報を得るために、マスターセグメントマネジャー９０８は、スレーブセグメントマネジャー９１０を直接ポーリングするか、あるいは例えばデータベース９１４など別の場所からローディング情報を得る。本発明は、マスターセグメントマネジャー９０８がスレーブセグメントマネジャー９１０のローディング情報を得るための特定の方法に限定されない。
【０１０６】
次に状態１２１０に進み、異常のあったスレーブセグメントマネジャー９１０に割り当てられたVSF９０６を他のスレーブセグメントマネジャー９１０に再割り当てする。VSF９０６が割り当てられているスレーブセグメントマネジャー９１０は、いつ再割り当てが完了したのかをマスターセグメントマネジャー９０８に知らせる。例えば、スレーブセグメントマネジャー９１０はマスターセグメントマネジャー９０８に再割り当て確認メッセージを送って、VSF９０６の再割り当てが問題なく終了したことを知らせることができる。異常のあったスレーブセグメントマネジャー９１０に関連する全てのVSF９０６の再割り当てが確認されるまで、状態１２１０に留まる。確認されれば、状態１２０２に戻る。
【０１０７】
異常のあったスレーブセグメントマネジャー９１０に関連するVSF９０６を他のアクティブスレーブセグメントマネジャー９１０へ再割り当てする代わりに、マスターセグメントマネジャー９０８は別のスレーブセグメントマネジャー９１０を割り当て、新しいスレーブセグメントマネジャー９１０にこれらのVSF９０６を割り当ててもよい。既存のスレーブセグメントマネジャー９１０または新しいスレーブセグメントマネジャー９１０へVSF９０６を再割り当てするかどうかの選択は、少なくとも部分的に、新しいスレーブセグメントマネジャー９１０の割り当てに関連する待ち時間、および既存のスレーブセグメントマネジャー９１０へのVSF９０６の再割り当てに関連する待ち時間に依る。何れの方法も特定のアプリケーションの必要条件に応じて使用することができ、且つ本発明は何れの方法にも限定されることはない。
【０１０８】
３．管理機能
一実施例によれば、制御プレーン９０２は、グローバルグリッドマネジャーに通信接続されている。制御面９０２は、グローバルグリッドマネジャーに、課金、障害、容量、ローディング、および他のコンピューティンググリッド情報を提供する。図１３は、実施例によるグローバルグリッドマネジャーの使用を説明するブロック図である。
【０１０９】
図１３において、コンピューティンググリッド１３００は、グリッドセグメント１３０２と呼ばれる論理部分にパーティションされる。各グリッドセグメント１３０２は、データプレーン９０４を制御および管理する制御プレーン９０２を含んでいる。この例において、各データプレーン９０４は図９のコンピューティンググリッド９０４と同一であるが、多数の制御プレーン９０２およびデータプレーン９０４、すなわちグリッドセグメント１３０２を管理するグローバルグリッドマネジャーの使用を説明するため、「データプレーン」と呼ばれる。
【０１１０】
各グリッドセグメントは、グローバルグリッドマネジャー１３０４に通信接続される。グローバルグリッドマネジャー１３０４、制御プレーン９０２、およびコンピューティンググリッド９０４は、単一のコンピューティングプラットフォームに同時配置されたり、あるいは多数のコンピューティングプラットフォーム上で分散させてもよく、本発明は特定の実施方法に限定されることはない。
【０１１１】
グローバルグリッドマネジャー１３０４は、複数のグリッドセグメント１３０２の集中管理およびサービスを行う。グローバルグリッドマネジャー１３０４は、様々な管理タスクで使用される制御プレーン９０２からの課金、ローディング、および他の情報を集めることができる。例えば、課金情報を使用して、コンピューティンググリッド９０４が提供するサービスの課金を行う。
【０１１２】
４．方針および保全についての考察
上述したように、制御プレーンにおけるスレーブセグメントマネジャーは、コンピューティンググリッドにおける関連するVSFと通信可能でなければならない。同様に、コンピューティンググリッドにおけるVSFは、その関連するスレーブセグメントマネジャーと通信可能でなければならない。更に、コンピューティンググリッドにおけるVSFは、あるVSFが何らかの方法で他のVSFの構造を変えてしまうのを防ぐために、互いに通信可能であってはならない。これらの方針を実施する様々な方法について説明する。
【０１１３】
図１４は、実施例によるコンピューティンググリッドへ制御プレーンを接続するアーキテクチャのブロック図１４００である。参照番号１４０２でまとめて識別されるVLANスイッチ（VLAN SW１〜VLAN SWn）および参照番号１４０４でまとめて識別されるSANスイッチ（SAN SW１〜SAN SWn）の制御（「CTL」）ポートは、イーサネット（登録商標）サブネット１４０６に接続される。イーサネット（登録商標）サブネット１４０６は、参照番号１４０８でまとめて識別される複数のコンピューティング要素（CPU１、CPU２〜CPUｎ）に接続される。従って、制御プレーン１４０８のコンピューティング要素のみが、VLANスイッチ１４０２およびSANスイッチ１４０４の制御ポート（CTL）に通信接続される。この構造は、VSF（図示せず）におけるコンピューティング要素が、それ自身または他のVSFに関連するVLANおよびSANゾーンのメンバーシップを変更してしまうのを防ぐ。この方法も、制御ポートがシリアルまたはパラレルポートである場合に適用可能である。この場合、ポートは制御プレーン１４０８のコンピューティング要素に接続される。
【０１１４】
図１５は、実施例による制御プレーンコンピューティング要素（CP CPU1、CP CPU2〜CP CPUn）１５０２をデータポートに接続する構造を示すブロック図１５００である。この構成において、制御プレーンコンピューティング要素５０２は、制御プレーンコンピューティング要素１５０２のために動作する制御プレーンエージェント１５０４に周期的にパケットを送る。制御プレーンエージェント１５０４は、リアルタイムデータのためにコンピューティング要素５０２を周期的にポーリングして、データを制御プレーンコンピューティング要素１５０２に送る。制御プレーン１５０２における各セグメントマネジャーは、制御プレーン（CP）LAN１５０６に通信接続されている。CP LAN１５０６は、CPファイアウォール１５０８を介して、VLANスイッチ５０４の特殊ポートV１７に通信接続されている。この構造により、制御プレーンコンピューティング要素１５０２に拡張可能な確実な手段が与えられ、コンピューティング要素５０２からリアルタイム情報が集められる。
【０１１５】
図１６は、実施例によるコンピューティンググリッドへ制御プレーンを接続するアーキテクチャのブロック図１６００である。制御プレーン１６０２は、制御プレーンコンピューティング要素CP CPU1、CP CPU2〜CP CPUnを含んでいる。制御プレーン１６０２における各制御プレーンコンピューティング要素CP CPU1、CP CPU2〜CP CPUnは、全体でSANメッシュ１６０４を形成する複数のSANスイッチのポートS1、S2〜Snに通信接続される。
【０１１６】
SANメッシュ１６０４は、制御プレーン１６０２に対してプライベートであるデータを含む記憶装置１６０６に通信接続されるSANポートSo、Spを含んでいる。記憶装置１６０６は、便宜上、ディスクとして図１６に示されている。記憶装置１６０６は、いずれのタイプの記憶媒体で実施されてもよく、本発明は記憶装置１６０６の特定の種類の記憶媒体に限定されることはない。記憶装置１６０６は、制御プレーンプライベート記憶ゾーン１６０８に論理的に配置される。制御プレーンプライベート記憶ゾーン１６０８は、制御プレーン１６０２を実施するログファイル、統計データ、現在の制御プレーン構成情報を維持する。SANポートSo、Spは制御プレーンプライベート記憶ゾーンの唯一の部分であり、他のSANゾーンには配置されることはないため、制御プレーン１６０２におけるコンピューティング要素のみが記憶装置１６０６にアクセスできる。また、S1、S2〜Sn、SoおよびSpは、制御プレーン１６０２におけるコンピューティング要素に通信接続されるのみの制御プレーンSANゾーンに存在する。これらのポートは、VSFにおけるコンピューティング要素（図示せず）がアクセスすることはできない。
【０１１７】
一実施例によれば、特定のコンピューティング要素CP CPU１、CP CPU２〜CP CPUnが記憶装置またはその一部にアクセスする必要がある場合、それは特定のVSFの一部であり、特定のコンピューティング要素は特定のVSFのSANゾーン内に置かれる。例えば、コンピューティング要素CP CPU２がVSFiディスク１６１０にアクセスする必要があるものとする。この場合、制御プレーンCP CPU２に関連するポートｓ２は、ポートSiを含むVSFiのSANゾーンに配置される。一度コンピューティング要素CP CPU２がポートSiのVSFiディスク１６１０へアクセスすると、コンピューティング要素CP CPU２はVSFiのSANゾーンから取り除かれる。
【０１１８】
同様に、コンピューティング要素CP CPU１がVSFjディスク１６１２にアクセスする必要があるものとする。この場合、コンピューティング要素CP CPU１はVSFjに関連するSANゾーン内に配置される。その結果、ポートS１は、ポートSjを含むゾーンを有するVSFjに関連するSANゾーン内に配置される。一度コンピューティング要素CP CPU１がポートSjに接続されたVSFjディスク１６１２へアクセスすると、コンピューティング要素CP CPU１はVSFjに関連するSANゾーンから除去される。この方法により、正確なSANゾーン制御を使用してリソースへのアクセスを正確に制御することによる、制御プレーンコンピューティング要素および制御プレーン記憶ゾーン１６０８の完全性が得られる。
【０１１９】
上述したように、単一の制御プレーンコンピューティング要素は複数のVSFの管理を行うことができる。従って、単一の制御プレーンコンピューティング要素は、各制御プレーンに大して確立された方針規則に従ってVSF間のファイアウォールを実行しながら、多数のVSFにおける自身を同時に明確にできなければならない。方針規則は、各制御プレーンのデータベース９１４（図９）に保存、あるいは中央セグメントマネジャー１３０２（図１３）によって実施してもよい。
【０１２０】
一実施例によれば、（物理的スイッチ）ポートに基づくVLANタグはスプーフできないため、VLANタギングとIPアドレスとの間を強固に結合させて、VSFによるスプーフ攻撃を防いでいる。あるVLANインタフェースで送られてくるIPパケットは、パケットが到着する論理インタフェースと同じVLANタグおよびIPアドレスを有していなければならない。これにより、VSFにおける不正サーバが別のVSFにおけるソースIPアドレスをスプーフし、別のVSFの論理構造を潜在的に変更し、あるいはコンピューティンググリッド機能の保全を破壊するIPスプーフィング攻撃を防止する。このVLANタギングを防止するほう方法では、高安全（クラスA）データセンターを使用して防止できるコンピューティンググリッドへの物理的アクセスが必要である。
【０１２１】
様々なネットワークフレームタギング形式を使用してデータパケットのタグを行ってもよく、且つ本発明は特定のタギング形式に限定されることはない。一実施例によれば、他の形式も適切であるが、IEE802.1ｑのVLANタグを使用している。この例では、VLAN／IPアドレス一貫性チェックを、アクセスを制御するために802.1ｑタグ情報が存在するIPスタックのサブシステムで実行する。この例において、コンピューティング要素は、コンピューティング要素が多数のVLANに同時に通信接続されるよう、VLAN可能ネットワークインタフェースカード（NIC）で構成されている。
【０１２２】
図１７は、実施例によるVLANタグとIPアドレスとの間を強固に結合する構成のブロック図１７００である。コンピューティング要素１７０２および１７０４は、NIC１７０８および１７１０を介して、VLANスイッチ１７０６のポートｖ１およびｖ２にそれぞれ通信接続される。VLANスイッチ１７０６も、アクセススイッチ１７１２および１７１４に通信接続される。ポートｖ１およびｖ２は、タグ形式で構成される。一実施例によれば、IEEE802.1ｑのVLANタグ情報は、VLANスイッチ１７０６によって提供される。
【０１２３】
広域コンピューティンググリッド
上述したVSFは、様々な方法でWAN上に分散される。
【０１２４】
一つの方法では、広域バックボーンは、非同期転送モード（ATM）切替に基づいていてもよい。この場合、各ローカルエリアVLANは、ATM LANエミュレーション（LANE）標準の一部であるエミュレーテッドLAN（ELAN）を使用して広域に拡張される。このように、単一のVSFは、ATM／SONET／OC-12リンクなどの幾つかの広域リンク全体に広がる。ELANは、ATM WAN全体に拡張するVLANの一部となる。
【０１２５】
他の方法では、VSFをVPNシステムを使用してWAN全体に拡張する。本実施例において、ネットワークの根本的特徴は不適切になり、VPNを使用して２つ以上のVSFをWAN全体にわたって相互接続し、単一の分散VSFを生成する。
【０１２６】
分散VSFにおいてデータを論理コピーするために、データミラーリング技術を使用することができる。あるいは、SAN対ATMブリッジングまたはSAN対ギガビットイーサネット（登録商標）ブリッジングなどの幾つかのSAN対WANブリッジング技術のうちの１つを使用して、WAN上にSANをブリッジさせる。IPはこのようなネットワーク上で問題なく動作するので、IPネットワーク上に構成されたSANはWAN上で自然に拡張する。
【０１２７】
図１８は、WAN接続上で拡張した複数のVSFのブロック図である。サンノゼセンター、ニューヨークセンター、およびロンドンセンターは、WAN接続によって接続されている。各WAN接続は、上述したようにATM、ELANまたはVPN接続から構成される。各センターは、少なくとも１つのVSFおよび少なくとも１つのアイドルプールから構成される。例えば、サンノゼセンターはVSF1AおよびアイドルプールAを有している。この構成において、センターの各アイドルプールのコンピューティングリソースは、他のセンターにあるVSFへの割り当てまたは指定に対して利用できる。このような割り当てまたは指定が行われると、VSFはWAN上で拡張する。
【０１２８】
VSFの使用例
上記例で説明したVSFアーキテクチャは、ウェブサーバシステムのからみで使用してもよい。従って、上記例は、特定のVSFにおけるCPUから構成したウェブサーバ、アプリケーションサーバおよびデータベースサーバに関して説明した。しかし、VSFアーキテクチャを他の多くのコンピューティング状況で使用し、他の種類のサービスを提供してもよく、且つ本発明はウェブサーバシステムに限定されるものではない。
【０１２９】
−内容分散ネットワークの一部としての分散VSF
一実施例において、VSFは、広域VSFを使用して内容分散ネットワーク（CDN）を提供する。CDNは、データの分散キャッシングを行うキャッシングサーバのネットワークである。キャッシングサーバのネットワークは、例えば、Inktomi Corporation, San Mateo, Californiaから販売されているTrafficServer（TS）ソフトウェアを使用して実施できる。TSはクラスタアウェアシステムであり、システムは、更に多くのCPUがキャッシングトラフィックサーバコンピューティング要素の集合に追加されると、拡張する。従って、CPUの追加が拡張の機構であるシステムに非常に適している。
【０１３０】
この構成において、システムは、TSなどのキャッシングソフトウェアを実行するVSFの部分に更に多くのCPUを動的に追加できるので、バースト状のウェブトラフィックが生じるのに近い地点でキャッシュ容量を増大させることが可能である。その結果、CDNは、適法的方法でCPUおよびI/O帯域幅において動的に拡張するように構成される。
【０１３１】
−ホステッドイントラネットアプリケーションのVSF
ホストおよび管理されたサービスとして、企業リソースプランニング（ERP）、ORMおよびCRMソフトウェアなどのイントラネットアプリケーションの提供への興味が増大している。Citrix WinFrameおよびCitrix MetaFrameなどの技術により、企業は、Windows（登録商標）CE機器またはウェブブラウザなどの小型軽量クライアント上でのサービスとしてMicrosoft Windows（登録商標）アプリケーションを提供することができる。VSFは拡張可能にこのようなアプリケーションをホストすることが可能である。
【０１３２】
例えば、ドイツのSAP Aktiengesellschaftより販売されているSAP R/3 ERPソフトウェアにより、企業は多数のアプリケーションおよびデータサーバを使用してバランスをロードさせることができる。VSFの場合、リアルタイムの要求または他の要因に基づいてVSFを拡張するために、企業は更に多くのアプリケーションサーバ（例えば、SAPダイアログサーバ）をVSFに動的に追加する。
【０１３３】
同様に、Citrix Metaframeにより、更に多くのCitrixサーバを追加することにより、ホステッドWindows（登録商標）アプリケーションを実行するサーバファーム上でWindows（登録商標）アプリーションユーザを拡張することができる。この場合、VSFに対し、Citrix MetaFrame VSFは、更に多くのMetaframeがホストするWindows（登録商標）アプリケーションのユーザを収容するために更に多くのCitrixサーバを動的に追加する。多くのほかのアプリケーションが上述した例と同様にホストされることが明らかとなる。
【０１３４】
−VSFとの顧客相互作用
VSFは求めに応じて生成されるため、VSFを「所有する」VSF顧客または組織は、VSFをカスタマイズするために様々な方法でシステムと互いに影響し合うことができる。例えば、VSFは制御プレーンを介して即座に生成および変更されるので、VSF顧客は特権アクセスが許されて、そのVSF自身を生成および変更してもよい。特権アクセスは、ウェブページおよび保全アプリケーション、トークンカード認証、ケルベロス交換、または他の適切な保全要素によって与えられたパスワード認証を使用して与えられる。
【０１３５】
一実施例において、一式のウェブページは、コンピューティング要素または別個のサーバによって供給される。ウェブページにより、顧客は、層の数、特定の層におけるコンピューティング要素の数、各要素に対して使用されるハードウェアおよびソフトウェアプラットフォーム、どの種類のウェブサーバ、アプリケーションサーバ、またはデータベースサーバソフトウェアこれらのコンピューティング要素上で事前に構成するかなどを指定することによって、カスタムVSFを生成することができる。従って、顧客は仮想供給コンソールを備えている。
【０１３６】
顧客またはユーザがこのような供給情報を入力した後、制御プレーンはオーダーを解析および評価し、それを実行するために待ち行列に入れる。オーダーは人間の管理者が再検討して、適切であることを確認することができる。企業のクレジット確認を実行させて、要求されたサービスに対して支払いを行う適切なクレジットを有していることを確認できる。供給オーダーが承認されると、制御プレーンは順序に適合するVSFを構成し、VSFにおける一又は二以上のコンピューティング要素へのルートアクセスを与えるパスワードを顧客に返す。次に、顧客はアプリケーションのマスターコピーをアップロードして、VSFで実行することができる。
【０１３７】
コンピューティンググリッドを採用する企業が営利目的の企業である場合、ウェブページから、クレジットカード、PO番号、電子小切手、または他の支払方法などの支払いに関する情報も受信することができる。
【０１３８】
別の実施例において、ウェブページにより、顧客は、リアルタイムロードに基づいて、要素の最小数と最大数との間のVSFの自動拡大縮小など、幾つかのVSFサービスプランのうちの１つを選択することができる。顧客は、ウェブサーバなどの特定の層におけるコンピューティング要素の最小数、またはVSF最小サーバ容量を有していなければならない期間などのパラメータの変更を可能にする制御値を有することができる。パラメータは、顧客の為替手形割引率を自動的に調整し、且つ課金ログファイル項目を生成する課金ソフトウェアにリンクしていてもよい。
【０１３９】
特権アクセス機構により、顧客は報告書を得て、使用、ロード、毎秒のヒット数またはトランザクション数に関するリアルタイム情報を監視し、リアルタイム情報に基づくVSFの特徴を調整することができる。上記特色により、サーバファームの構築に対する従来の手動による方法よりも優れた利点が得られる。従来の方法では、ユーザは、様々な方法でサーバを追加し、サーバファームを構成する面倒な手動手順を介さずに、サーバファームの特性を自動的に変更することはできない。
【０１４０】
−VSFに対する課金モデル
VSFの動的性質を考えると、コンピューティンググリッドおよびVSFを採用する企業は、VSFのコンピューティング要素および記憶要素の実際の使用に基づくVSFの課金モデルを使用して、VSFを所有する顧客に対してサービス料金を請求することができる。ここに開示するVSFアーキテクチャおよび方法は、あるVSFのリソースは静的に指定されないので、「即金払い」課金モデルを可能にする。従って、そのサーバファームの使用負荷が極めて変わりやすい特定の顧客は、一定のピークサーバ容量に関連する料金は課金されず、使用、瞬間使用などの実行平均を反映する料金が課金されるので、料金を節約することができる。
【０１４１】
例えば、企業は、１０台のサーバなどのコンピューティング要素の最小数に対する均一料金を規定し、且つリアルタイムの負荷が１０以上の要素を必要としたときを規定する課金モデルを使用して運営するので、ユーザは、何台の追加サーバが必要であり、且つそれらが必要であった時間に基づいて、追加サーバの追加料金で課金される。このような課金の単位は、請求されるリソースを反映してもよい。例えば、課金は、MIPS時間、CPU時間、CPU千秒などの単位で表してもよい。
【０１４２】
−顧客可視制御プレーンAPI
他の方法では、VSFの容量は、リソース変更のための制御プレーンの呼び出しを規定するアプリケーションプログラミングインタフェース（API）を顧客に与えることで、制御されてもよい。従って、顧客が用意したアプリケーションプログラムは、APIを使用して呼び出しまたは要求を発し、更に多くのサーバ、更に多くのストレージ、更に高い処理能力などを要求することができる。この方法は、顧客がコンピューティンググリッド環境について知り、制御プレーンが与える能力を利用するためにアプリケーションプログラムを必要とするときに使用してもよい。
【０１４３】
上記アーキテクチャにおいて、何れの部分も、顧客がコンピューティンググリッドとの使用でそのアプリケーションを変更する必要はない。既存のアプリケーションは、手動構成したサーバファームで動作するのと同様に動作する。しかしながら、制御プレーンによって与えられるリアルタイム負荷監視機能に基づいて必要とするコンピューティングリソースをよりよく理解するのであれば、アプリケーションはコンピューティンググリッドで可能なダイナミズムを利用することができる。アプリケーションプログラムによるサーバファームのコンピューティング容量の変更を可能にする上記性質のAPIは、サーバファームの構築に対する既存の手動方法を用いては可能ではない。
【０１４４】
−自動更新およびバージョニング
ここに開示する方法および機構を使用し、制御プレーンは、VSFのコンピューティング要素で実行されるオペレーティングシステムソフトウェアの自動更新およびバージョニングを行うことができる。従って、エンドユーザまたは顧客は、新たなパッチ、バグフィックスなどでオペレーティングシステムを更新することについて心配する必要はない。制御プレーンは、このようなソフトウェア要素が受信されるとそのライブラリを維持し、影響のあった全てのVSFのコンピューティング要素にこれらを自動的に分散およびインストールすることができる。
【０１４５】
実施機構
コンピューティング要素および制御プレーンは幾つかの形式で実施されてもよく、且つ本発明は特定の形式に限定されることはない。一実施例において、各コンピューティング要素は、不揮発性記憶装置１９１０を除き、図１９に示す要素を有する汎用デジタルコンピュータであり、また制御プレーンは、上記プロセスを実施するプログラム命令の制御の下で動作する図１９に示す種類の汎用デジタルコンピュータである。
【０１４６】
図１９は、本発明の実施例が実施されうるコンピュータシステム１９００を示すブロック図である。コンピュータシステム１９００は、情報を伝達するバス１９０２または他の通信機構、および情報を処理するためにバス１９０２に接続されたプロセッサ１９０４を含んでいる。コンピュータシステム１９００はまた情報とプロセッサ１９０４が実行する命令を保存するためにバス１９０２に接続されたランダムアクセスメモリ（RAM）または他の動的記憶装置などのメインメモリ１９０６を含んでいる。メインメモリ１９０６も、プロセッサ１９０４が実行する命令の実行中に、一時的数値変数や他の中間情報を保存するのに使用することができる。コンピュータシステム１９００は更に、静的情報およびプロセッサ１９０４の命令を保存するために、バス１９０２に接続されたリードオンリメモリ（ROM）１９０８や他の静的記憶装置を含んでいる。磁気ディスクや光ディスクなどの記憶装置１９１０が設けられ、情報および命令を保存するためにバス１９０２に接続されている。
【０１４７】
コンピュータシステム１９００は、情報をコンピュータユーザに表示するために、陰極線管（CRT）などのディスプレイ１９１２にバス１９０２を介して接続されていてもよい。英数字および他のキーを含む入力機器１９１４は、情報および命令の選択をプロセッサ１９０４に伝達するために、バス１９０２に接続されている。他の種類のユーザ入力機器は、方向情報および命令の選択をプロセッサ１９０４に伝達し、且つカーソルの動きをディスプレイ１９１２上で制御するためのマウス、トラックボール、カーソル方向キーなどのカーソルコントロール１９１６である。この入力機器は一般に、機器が平面における位置を指定することを可能にする２つの軸、すなわち第１軸（例えばｘ）および第２軸（例えばｙ）における２つの自由度を有している。
【０１４８】
本発明は、拡張可能コンピューティングシステムを制御するための、コンピュータシステム１９００の使用に関連している。本発明の一実施例によれば、拡張可能コンピューティングシステムの制御は、メインメモリ１９０６に含まれる一又は二以上の命令の一又は二以上のシーケンスを実行するプロセッサ１９０４に応じて、コンピュータシステム１９００によって行われる。このような命令は、記憶装置１９１０などの別のコンピュータで読み取り可能な媒体からメインメモリ１９０６に読み込まれる。メインメモリ１９０６に含まれる命令のシーケンスを実行することにより、プロセッサ１９０４は、上記のプロセス工程を実行する。マルチ処理構成において一又は二以上のプロセッサを使用し、メインメモリ１９０６に含まれる命令のシーケンスを実行してもよい。別の実施例においては、配線接続された回路を、ソフトウェア命令の代わりに、あるいはこれと組み合わせて使用し、本発明を実施してもよい。従って、本発明の実施例は、ハードウェア回路およびソフトウェアの特定の組合せに限定されない。
【０１４９】
ここで使用する用語「コンピュータで読み取り可能な媒体」は、プロセッサ１９０４に命令を与えて実行することに関連する媒体を意味する。このような媒体は、不揮発性媒体、揮発性媒体および伝送媒体を含むがこれらに限定されない多くの形式を取ることができる。不揮発性媒体は例えば、記憶装置１９１０などの光または磁気ディスクを含む。揮発性媒体は、メインメモリ１９０６などの動的メモリを含む。伝送媒体は、バス１９０２を構成する配線を含む同軸ケーブル、銅線および光ファイバーを含む。伝送媒体も、無線および赤外線データ通信の間に生成されるような音波や光波の形式を取ることができる。
【０１５０】
コンピュータで読み取り可能な媒体の一般的形式は、例えば、以下に説明するようなフロッピー（登録商標）ディスク、フレキシブルディスク、ハードディスク、磁気テープ、ほかの磁気媒体、CD-ROM、他の光媒体、パンチカード、紙テープ、穴のパターンを有する他の物理的媒体、RAM、PROM、EPROM、FLASH-EPROM、他のメモリチップまたはカートリッジ、搬送波、またはコンピュータが読み取り可能なほかの媒体を含む。
【０１５１】
コンピュータが読み取り可能な媒体の様々な形式は、プロセッサ１９０４に一又は二以上の命令の一又は二以上のシーケンスを送って実行させることに関連していてもよい。例えば、命令はまず、遠隔コンピュータの磁気ディスクに送られる。遠隔コンピュータはその動的メモリに命令をロードして、モデムを使用して電話回線上で命令を送る。コンピュータシステム１９００に対して遠隔にあるモデムは、電話回線上のデータを受信し、赤外線トランスミッタを使用してデータを赤外線信号に変換することができる。バス１９０２に接続された赤外線ディテクタは、赤外線信号で運ばれるデータを受信して、バス１９０２にデータを出す。バス１９０２はデータをメインメモリ１９０６に送り、ここからプロセッサ１９０４は命令の検索と実行を行う。メインメモリ１９０６が受信した命令は、プロセッサ１９０４の実行の前または後で記憶装置１９１０に随意に保存することができる。
【０１５２】
コンピュータシステム１９００は、バス１９０２に接続された通信インタフェース１９１８も含んでいる。通信インタフェース１９１８は、ローカルネットワーク１９２２に接続されたネットワークリンク１９２０へ接続する双方向データ通信を行う。例えば、通信インタフェース１９１８は、対応する種類の電話回線へのデータ通信接続を行うためのデジタル総合サービスネットワーク（ISDN）カードまたはモデムであってもよい。他の例とでは、通信インタフェース１９１８は、互換性のあるLANへのデータ通信接続を行うためのローカルエリアネットワーク（LAN）であってもよい。無線リンクも実施することができる。このような実施において、通信インタフェース１９１８は、様々な種類の情報を表すデジタルデータストリームを伝える電気、電磁または光信号を送受信する。
【０１５３】
ネットワークリンク１９２０は一般に、一又は二以上のネットワークを介して、他のデータ機器へのデータ通信を行う。例えば、ネットワークリンク１９２０は、ローカルネットワーク１９２２を介して、インターネットサービスプロバイダ（ISP）１９２６によって運営されるホストコンピュータ１９２４またはデータ機器への接続を提供する。ISP１９２６は、一般に「インターネット」と現在呼ばれている世界規模パケットデータ通信ネットワーク１９２８を介してデータ通信サービスを提供する。ローカルネットワーク１９２２およびインターネット１９２８は共に、デジタルデータストリームを伝える電気、電磁または光信号を使用する。様々なネットワークおよびネットワークリンク１９２０上の信号、および通信インタフェース１９１８を介して、コンピュータシステム１９００に対してデジタルデータを送受する信号は、情報を運ぶ搬送波の典型的な形である。
【０１５４】
コンピュータシステム１９００は、ネットワーク、ネットワークリンク１９２０および通信インタフェース１９１８を介して、メッセージを送信し、且つプログラムコードを含むデータを受信することができる。インターネットの例では、サーバ１９３０は、インターネット１９２８、ISP１９２６、ローカルネットワーク１９２２、および通信インタフェース１９１８を介して、アプリケーションプログラムの要求コードを送信する。本発明によれば、このようなダウンロードしたアプリケーションは、ここに説明する拡張可能コンピューティングシステムの制御を規定する。
【０１５５】
受信コードは、受信されるとプロセッサ１９０４により実行、および／または後で実行するために記憶装置１９１０あるいは他の不揮発性ストレージに保存しておいてもよい。このように、コンピュータシステム１９００は、搬送波という形でアプリケーションコードを得ることができる。
【０１５６】
ここに開示したコンピューティンググリッドは、時にパワーグリッドと呼ばれる公共電力ネットワークと概念的に比較される。パワーグリッドは、単一の大規模電力インフラストラクチャを介して電力サービスを得るために、多数の関係者に拡張可能手段を提供する。同様に、ここに開示したコンピューティンググリッドは、単一の大規模コンピューティングインフラストラクチャを使用することによって、多数の組織にコンピューティングサービスを提供する。パワーグリッドを使用するので、電力消費者はその個人電力設備を自主的に管理することはない。例えば、ユーティリティ消費者がその設備または共有設備において個人用発電機を運転させ、個人でその容量および増加を管理する理由はない。その代わりに、パワーグリッドは人口の大部分へ広範囲に電力を供給することができるので、大きなスケールメリットが得られる。同様に、ここに開示するコンピューティンググリッドは、単一の大規模なコンピューティングインフラストラクチャを使用して、人口の大部分にコンピューティングサービスを提供することができる。
【０１５７】
上記の詳述において、具体的な実施例に関連して本発明を説明した。しかしながら、本発明の広大な精神および範囲から逸脱することなく、様々な改良および変更を本発明に加えることが可能であることは明白となろう。従って、説明および図面は、限定的意味ではなく例証において考慮される。
【図面の簡単な説明】
【図１Ａ】図１Ａは、単一のコンピューティング要素トポロジーを使用する単純なウェブサイトのブロック図である。
【図１Ｂ】図１Ｂは、１層ウェブサーバファームのブロック図である。
【図１Ｃ】図１Ｃは、３層ウエブサーバファームのブロック図である。
【図２】図２は、ローカルコンピューティンググリッドを含む拡張可能コンピューティングシステム２００の１つの構成を示すブロック図である。
【図３】図３は、SANゾーンを特徴付ける典型的な仮想サーバファームのブロック図である。
【図４Ａ】図４Ａは、コンピューティング要素の追加および仮想サーバファームからの要素の除去に関連する連続工程を示すブロック図である。
【図４Ｂ】図４Ｂは、コンピューティング要素の追加および仮想サーバファームからの要素の除去に関連する連続工程を示すブロック図である。
【図４Ｃ】図４Ｃは、コンピューティング要素の追加および仮想サーバファームからの要素の除去に関連する連続工程を示すブロック図である。
【図４Ｄ】図４Ｄは、コンピューティング要素の追加および仮想サーバファームからの要素の除去に関連する連続工程を示すブロック図である。
【図５】図５は、仮想サーバファームシステム、コンピューティンググリッド、監視機構の実施例のブロック図である。
【図６】図６は、仮想サーバファームの論理接続のブロック図である。
【図７】図７は、仮想サーバファームの論理接続のブロック図である。
【図８】図８は、仮想サーバファームの論理接続のブロック図である。
【図９】図９は、制御プレーンおよびデータプレーンの論理関係のブロック図である。
【図１０】図１０は、マスター制御選択プロセスの状態図である。
【図１１】図１１は、スレーブ制御プロセスの状態図である。
【図１２】図１２は、マスター制御プロセスの状態図である。
【図１３】図１３は、中央制御プロセッサおよび多数の制御プレーンおよびコンピューティンググリッドのブロック図である。
【図１４】図１４は、制御プレーンおよびコンピューティンググリッドの部分を実施するアーキテクチャのブロック図である。
【図１５】図１５は、ファイアウォールによって保護されるコンピューティンググリッドを有するシステムのブロック図である。
【図１６】図１６は、制御プレーンをコンピューティンググリッドに接続するアーキテクチャのブロック図である。
【図１７】図１７は、VLANタグとIPアドレスを密に結合する配置のブロック図である。
【図１８】図１８は、WAN接続上で拡張した複数のVSFのブロック図である。
【図１９】図１９は、実施例が実施されるコンピュータシステムのブロック図である。[0001]
BACKGROUND OF THE INVENTION
The present invention generally relates to data processing. The present invention particularly relates to a method and apparatus for controlling a computing grid.
[0002]
[Problems to be solved by the invention]
Today's websites and other computer system builders have many interesting system planning problems. These issues include capacity planning, site availability and site safety. Achieving these goals requires finding and hiring trained personnel who can design and operate sites that may be potentially large and complex. For many organizations, designing, building, and operating large sites is often not the main business, so finding and hiring such people has proved difficult.
[0003]
One approach is to employ a third-party company website that is co-located with other company websites. Such outsourcing facilities are now available from companies such as Exodus, AboveNet, and GlobalCenter. These facilities provide physical space, redundant networks, and power generation facilities shared by many customers.
[0004]
Adopting an outsourced website greatly reduces the burden of establishing and maintaining a website, but does not remove all the problems associated with maintaining a website from a company. Companies have to do a lot of work on their computing infrastructure during the construction, operation and expansion of their facilities. Enterprise information technology managers employed at such facilities are responsible for the manual selection, installation, configuration and maintenance of the computing devices at the facility. Administrators must address difficult issues such as resource planning and handling peak capacity. In particular, an administrator needs to predict resource demand and required resources from an outsourced company to deal with demand. Many managers ensure sufficient capacity by requesting substantially more resources than they need as a mitigation against unexpected peak demand. Unfortunately, this increases the amount of unused capacity and increases the company's overhead to adopt a website.
[0005]
Even if an outsourced company provides a complete computing facility that includes servers, software, and power facilities, the same manual and error-prone management procedures are required as the company grows. It is not easy. In addition, problems remain with capacity planning for unexpected peak demand. In this case, the outsourced company may maintain a significant amount of unused capacity.
[0006]
In addition, the requirements for websites managed by outsourced companies are often different. For example, a company needs the ability to operate and control its website independently. Other companies require a certain type or level of security that separates their website from all other sites co-located with outsourced companies. As another example, some companies require a secure connection to a corporate intranet located somewhere.
[0007]
Furthermore, the various websites differ in internal topology. A site simply consists of a series of web servers that are load balanced by a web load balancer. Suitable load balancers include Cisco Systems, Inc. Local Director, F5Labs BigIP, and Aleton Web Director. Other sites may be multi-layered, which allows web server queues to handle hypertext protocol (HTTP) requests, but the majority of application logic is implemented in a separate application server. It may be necessary to reconnect these application servers to the database server layer.
[0008]
Some of these different structural scenarios are shown in FIGS. 1A, 1B, and 1C. FIG. 1A is a block diagram of a simple website consisting of a single computing element or machine 100 that includes a CPU 102 and a disk 104. Machine 100 is connected to a worldwide packet switched data network 106 known as the Internet, or other network. Machine 100 may be housed in a co-location service of the type described above.
[0009]
FIG. 1B is a block diagram of a one-tier web server farm 110 that includes multiple web servers WSA, WSB, WSC. Each web server is connected to a load balancer 112 connected to the Internet 106. The load balancer divides traffic between servers to maintain a balanced processing load on each server. The load balancer 112 may also include or be connected to a firewall to protect the web server from unauthorized traffic.
[0010]
FIG. 1C shows a three-tier server farm 120 that includes tiers such as web servers W1, W2, tiers such as application servers A1, A2, and tiers such as database servers D1, D2. A web server is provided to handle HTTP requests. The application server executes most of the application logic. The database server executes database management system (DBMS) software.
[0011]
As the topology of the types of websites that need to be configured has diversified and the requirements of the corresponding company are changing, the only way to configure a large website is to physically customize each site It is thought that. Many organizations work on the same issues individually and customize each website from scratch. This is inefficient and results in a lot of identical work in different companies.
[0012]
Another problem with conventional methods is resource and capacity planning. Websites receive very different levels of traffic on different days or at different times of the day. At peak traffic times, website hardware or software may not be able to respond to requests within a reasonable amount of time due to overload. At other times, the website hardware or software has excessive capacity and is not fully utilized. With traditional methods, finding a balance in having enough hardware and software to handle peak traffic without incurring undue cost or excess capacity is a difficult problem. Many websites are unable to find the right balance and are chronically suffering from under or over capacity.
[0013]
Another problem is a failure caused by human error. The major potential disaster that exists in the current method of using manually configured server farms is that the server farm malfunctions due to human error when configuring a new server in a live server, which can lead to website users The service may be lost.
[0014]
Based on the above, there is a clear need in the art for an improved method and apparatus that provides a computing system that can be readily and readily expanded on demand without the need for custom configurations.
[0015]
There is also a need for a computing system that supports the generation of multiple separate processing nodes that can each be expanded or reduced as needed to account for changes in traffic throughput.
[0016]
There is a further need for a method and apparatus for controlling such an extensible computing system and its configuration separation processing node. Other needs will become apparent from the disclosure provided herein.
[0017]
DISCLOSURE OF THE INVENTION
According to one aspect of the present invention, the above needs, as well as other needs that will become apparent from the following description, are based on a large computing structure ("computing grid") and are highly scalable. Achieved by a method and apparatus for controlling and managing a highly accessible and secure data processing site. The computing grid is physically configured and then logically divided for various organizations as required. The computing grid includes a large number of computing elements connected to one or more VLAN switches and one or more storage area network (SAN) switches. The plurality of storage devices are connected to the SAN switch and may be selectively connected to one or more computing elements via appropriate switching logic and commands. One port of the VLAN switch is connected to an external network such as the Internet. Monitoring mechanisms, layers, machines or processes are connected to VLAN switches and SAN switches.
[0018]
Initially, all storage devices and computing elements are assigned to the idle pool. Under program control, the monitoring mechanism dynamically configures the VLAN switch and SAN switch ports to connect to one or more computing elements and storage devices. As a result, such elements and devices are logically removed from the idle pool and become part of one or more virtual server farms (VSFs) or instant data centers (IDCs). Each VSF computing element is directed to or associated with a storage device that includes a boot image that can be used by the computing element to perform bootstrap operations and generation execution.
[0019]
According to one aspect of the present invention, the monitoring layer is a control plane consisting of a control mechanism hierarchy including one or more master control process mechanisms communicatively connected to one or more slave control process mechanisms. One or more master control process mechanisms allocate and deallocate slave control process mechanisms based on the loading of the slave control process mechanisms. One or more master control process mechanisms support the slave control process mechanism to establish an IDC by selecting a subset of processing and storage resources. One or more master control process mechanisms perform periodic screening of slave control process mechanisms. The slave control mechanism that does not respond or terminates abnormally is restarted. Another slave control mechanism is initiated to replace the slave control mechanism that cannot be resumed. The slave control mechanism periodically checks the master control mechanism. If the master-slave control process mechanism is abnormally terminated, the slave control process mechanism is selected to become a new master control process mechanism, which replaces the master control process mechanism that has been terminated.
[0020]
By physically configuring the computing grid once and assuredly and dynamically assigning parts of the computing grid to various organizations as required, the scale benefits that were difficult when customizing each site can be obtained .
[0021]
The present invention is illustrated by way of example and not limitation in the accompanying drawings, in which identical reference numbers indicate similar elements.
[0022]
[Embodiments of the Invention]
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
[0023]
Virtual server farm (VSF)
According to one embodiment, a large-scale computing structure (“computing grid”) is provided. The computing grid may be physically configured once and then logically partitioned on demand. A portion of the computing grid is assigned to each of a plurality of companies or organizations. The logic part of each organization's computing grid is called a virtual server farm (VSF). Each organization maintains its own operational control of the VSF. Each VSF can dynamically change the number of CPUs, storage capacity and disk, and network bandwidth based on real-time demands given to server farms or other factors. VSFs are all logically generated from the same physical computing grid, but each VSF is protected from the VSFs of all other organizations. A VSF can reversely connect to an intranet by using a personal leased line or a virtual private network (VPN) without exposing the intranet to other organizations' VSF.
[0024]
An organization can perform full (eg superuser or root) administrative access to computers and observe all traffic on the local area network (LAN) to which these computers are connected, but assigned to it Only the parts of the computing grid, ie the data and computing elements in the VSF, can be accessed. According to one embodiment, this is possible by using a dynamic firewall scheme where the safety limits of the VSF are dynamically expanded and contracted. Each VSF can be used to employ organizational content and applications that can be accessed over the Internet, intranet or extranet.
[0025]
The configuration and control of the computing element and its associated networking and storage elements are performed by a monitoring mechanism that is not directly accessible by any of the computing elements in the computing grid. For convenience, in this document the monitoring mechanism is commonly referred to as the control plane and may consist of one or more processors or a network of processors. The monitoring mechanism may be composed of a supervisor, a controller, and the like. Other methods can be used as described herein.
[0026]
The control plane is implemented on a completely independent set of computing elements assigned for monitoring purposes, such as one or more servers interconnected within the network or by other means. The control plane performs control operations on the computing, networking and storage elements of the computing grid via the grid's networking and storage element special control ports or interfaces. The control plane provides a physical interface to the switching elements of the system, monitors the load of computing elements in the system, and provides operational management functions using a graphical user interface or other suitable user interface.
[0027]
The computer used to implement the control plane is logically invisible to computers in the computer grid (and certain VSFs), and is never attacked or destroyed through elements in the computer grid or from external computers It will never be done. Only the control plane has a physical connection to the control port of the device in the computer grid, which controls membership in a particular VSF. Since devices in computing can only be configured through these special control ports, computing elements in the computing grid cannot change their safety limits or gain access to unauthorized storage or computing devices. Can not.
[0028]
Thus, VSF enables organizations to work with computing facilities that appear to be composed of private server farms, or computing grids, that are dynamically created from a large shared computing infrastructure. The control plane connected with the computing architecture described herein provides a private server farm whose privacy and integrity are protected by access control mechanisms implemented in the hardware of the computing grid.
[0029]
The control plane controls the internal topology of each VSF. The control plane takes the basic interconnection of the computers, network switches and storage network switches described herein and can be used to create various server farm configurations. These include, but are not limited to, a single-tier web server farm pre-processed by a load balancer, and a multi-tier configuration, where the web server communicates with the application server and the application server communicates with the database server. Communicate. Various load balancing, multiple layers, and firewall configurations are possible.
[0030]
Computing grid
The computing grid exists in a single location and can be distributed over a wide area. Initially, this document describes a computer grid in a single building-sized network composed only of local area technology. Next, this document describes a case where a computer grid is distributed over a wide area network (WAN).
[0031]
FIG. 2 is a block diagram illustrating one configuration of an expandable computing system 200 that includes a local computing grid 208. In this document, “expandable” generally means that the system is flexible and scalable, and has the ability to provide reduced or increased computational power to specific companies or users on demand. To do. The local computing grid 208 includes a number of computing elements CPU1, CPU2,. . . It consists of CPUn. In an embodiment, there are 10,000 or more computing elements. Since these computing elements do not contain or store state information for each element over time, they may be configured without permanent or non-volatile storage such as a local disk. Instead, all long-term state information is stored in a plurality of disks, disks 1 connected to the computing element via a storage area network (SAN) that includes one or more SAN switches 202, separately from the computing element. , Disk 2,. . . Saved on disk n. Examples of suitable SAN switches are sold by Brocade and Excel.
[0032]
All computing elements are interconnected via one or more VLAN switches 204 that are divided into virtual LANs (VLANs). The VLAN switch 204 is connected to the Internet 106. In general, a computing element includes one or two network interfaces connected to a VLAN switch. For convenience, in FIG. 2, all nodes have two network interfaces, but there are nodes with fewer or more network interfaces. Many manufacturers now provide switches that support VLAN functionality. For example, suitable VLAN switches are available from Cisco Systems, Inc and Xtreme Networks. Similarly, there are many products available to configure SANs, including Fiber Channel switches, SCSI to Fiber Channel bridging equipment, and network attached storage (NAS) equipment.
[0033]
The control plane 206 is connected to the SAN switch 202, CPU1, CPU2,. . . The CPUn and the VLAN switch 204 are connected to each other.
[0034]
Each VSF consists of a set of VLANs, a set of computing elements attached to the VLAN, and a subset of storage devices available on the SAN connected to the set of computing elements. A subset of storage available on the SAN is called a SAN zone, which is protected by SAN hardware from being accessed by computing elements that are part of other SAN zones. Preferably, VLANs that provide non- malleable port identifiers are used to prevent one customer or end user from accessing other customers or end users' VSF resources.
[0035]
FIG. 3 is a block diagram of a typical virtual server farm featuring a SAN zone. A plurality of web servers WS1, WS2 and the like are connected to a load balancer (LB) / firewall 302 by a first VLAN (VLAN1). The second VLAN (VLAN 2) connects the Internet 106 to the load balancer (LB) / firewall 302. Each web server can be selected from CPU1, CPU2, etc. using a mechanism described later. The web server is connected to the SAN zone 304, which is connected to one or more storage devices 306a, 306b.
[0036]
At some point, computing elements in the computing grid, such as CPU 1 of FIG. 2, for example, are only connected to a SAN zone associated with a collection of VLANs and a single VSF. Typically, VSF is not shared between different organizations. A subset of storage on a SAN that belongs to a single SAN zone, and a collection of associated VLANs, and computing elements on these VLANs define the VSF.
[0037]
By controlling VLAN membership and SAN zone membership, the control plane logically partitions the computing grid into multiple VSFs. A member of one VSF cannot access the computing or storage resources of another VSF. Such access restriction is executed by a VLAN switch and a port level access control mechanism (for example, zoning) of SAN hardware such as an edge device such as a fiber channel switch or SCSI-to-fiber channel bridging hardware. Since the computing elements that form part of the computing grid are not physically connected to the control ports or interfaces of the VSAN switch and SAN switch, it is not possible to control VLAN or SAN zone membership. Accordingly, the computing elements of the computing grid cannot access computing elements that are not located in the VSF that contains them.
[0038]
Only the computing elements that execute the control plane are physically connected to the control ports or interfaces of the equipment in the grid. The computing grid devices (computers, SAN switches, and VLAN switches) are only configured by these control ports or interfaces. This provides a simple but very stable means of dynamically dividing the computing grid into a number of VSFs.
[0039]
Each computing element in the VSF is interchangeable with other computing elements. The number of computing elements, VLANs and SAN zones associated with a given VSF changes over time under control of the control plane.
[0040]
In one embodiment, the computing grid includes an idle pool of spare multiple computing elements. Computing elements from the idle pool may be assigned to a particular VSF for reasons such as increasing CPU, the amount of memory available to that VSF, or dealing with a failure of a particular computing element in the VSF. When the computing element is configured as a web server, the idle pool functions as a large “shock absorber” for changing or “bursty” web traffic loads and associated peak processing loads.
[0041]
Since the idle pool is shared among many different organizations, a single organization does not have to pay for the entire idle pool, thus providing a scale advantage. Different organizations can get computing elements from the idle pool at different times of the day as needed, so each VSF can grow when needed and shrink when traffic settles to normal. It becomes possible. If many different organizations continue to peak at the same time, which can exhaust the capacity of the idle pool, the idle pool can be increased by adding more CPUs and storage elements to it. (Scalability). The capacity of the idle pool is designed to greatly reduce the probability that, under normal conditions, another computing element cannot be obtained from the idle pool when a particular VSF is needed.
[0042]
4A, 4B, 4C, and 4D are block diagrams that illustrate the sequential steps as computing elements are moved in and out of the idle pool. Referring first to FIG. 4A, assume that the control plane logically connects the elements of the computing grid to the first and second VSFs labeled VSF1 and VSF2. The idle pool 400 consists of a plurality of CPUs 402, one of which is labeled CPUX. In FIG. 4B, VSF1 required another computing element. Accordingly, the control plane moves the CPUX from the idle pool 400 to the VSF 1 as indicated by a path 404.
[0043]
In FIG. 4C, VSF1 no longer needs CPUX, so the control plane returns CPUX from VSF1 to idle pool 400. In FIG. 4D, VSF2 required another computing element. Therefore, the control plane moves CPUX from the idle pool 400 to VSF2. Thus, as time passes and traffic conditions change, a single computing element belongs to the idle pool (FIG. 4), is assigned to a particular VSF (FIG. 4B), and is returned to the idle pool (FIG. 4C). ) And belong to another VSF (FIG. 4D).
[0044]
At each of these stages, the control plane configures the LAN and SAN switches associated with that computing element that become part of the VLANs and SAN zones associated with a particular VSF (or idle pool). According to one embodiment, between each transition, the computing element is powered down or restarted. When the computing element is powered on again, the computing element sees a different part of the SAN storage zone. In particular, the computing element sees the portion of the storage zone on the SAN that contains a bootable image of an operating system (eg, Linux, NT, Solaris, etc.). The storage zone also includes data portions specific to each organization (eg, files associated with web servers, database partitions, etc.). Since the computing element is also part of another VLAN that is part of another VSF VLAN set, it can access the CPU, SAN storage, and NAS equipment associated with the destination VSF VLAN.
[0045]
In the preferred embodiment, the storage zone includes a plurality of predefined logical detail designs associated with the roles envisioned by the computing element. Initially, none of the computing elements are assigned to a specific role or task, such as a web server, application server, database server, or the like. The role of a computing element is derived from any of a plurality of pre-defined stored detail designs, each such detail design defining a boot image of the computing element associated with that role. The detailed design is stored in a file, database table, or other storage format that associates the boot image location with the role.
[0046]
Therefore, the movement of CPUX in FIGS. 4A, 4B, 4C, and 4D is logical, not physical, and is performed by reconfiguring VLAN switches and SAN zones under control of the control plane. Also, each computing element in the computing grid is initially substitutable and assumes a specific processing role only after being connected to a virtual server farm and loading software from a boot image. None of the computing elements are assigned a specific role or task, such as a web server, application server, database server, or the like. The role of a computing element can be derived from any of a plurality of pre-defined stored detail designs, each of which is associated with a role and defines a boot image of the computer element associated with the role To do.
[0047]
Since long-term state information is not stored on a specific computing element (such as a local disk), nodes can easily move between different VSFs and run completely different OS and application software. This makes it easier to replace computing elements in the event of planned or unplanned downtime.
[0048]
Certain computing elements can perform different roles when moving in and out of various VSFs. For example, a computing element can act as a web server in one VSF and move to another VSF to become a database server, web load balancer, firewall, and so on. In addition, different operating systems such as Linux, NT, and Solaris can be started and executed continuously in different VSFs. Thus, each computing element in the computing grid can be substituted and has no fixed role assigned to it. Thus, the entire spare capacity of the computing grid can be used to provide any services that any VSF requires. As a result, the number of backup servers that can provide the same service for each server that executes a specific service becomes thousands, so the availability and reliability of the service provided by a single VSF is extremely high. To be high.
[0049]
In addition, the high reserve capacity of the computing grid provides dynamic load balancing characteristics and high processor availability. This capability is possible with a unique combination of diskless computing elements that are interconnected via VLANs, connected to configurable zones of the storage device via SAN, and all controlled in real time by the control plane. Each computing element can operate in the role of any required server in the VSF and can be connected to any logical partition of any disk in the SAN. If the grid requires more computing power or disk space, computing elements or disk storage is manually added to the idle pool, which will provide VSF services to more organizations over time And decrease. No manual intervention is required to increase the number of CPUs, network and disk capacity, and storage available to VSF. All of these resources are allocated by the control plane from the CPU, network and disk resources available in the idle pool whenever requested.
[0050]
Certain VSFs are not manually reconfigured. Only the computing elements of the idle pool are manually reconfigured into the computing grid. As a result, large potential obstacles that currently exist in manually configured server farms are eliminated. There is almost no possibility that the server farm will malfunction due to human error when configuring a new server in a live server farm, resulting in the loss of service to users of that website.
[0051]
The control plane also copies data stored on storage devices attached to the SAN so that failure of a particular storage element does not cause loss of service to any part of the system. High utilization because any computing element can be attached to any storage partition by removing long-term storage from the computing device by using a SAN and providing redundant storage and computing elements The possibility is obtained.
[0052]
Detailed examples of establishing a virtual server farm, adding processors to it, and removing processors from it
FIG. 5 is a block diagram of a computing grid and control plane mechanism according to an embodiment. With reference to FIG. 5, the following describes the detailed process that can be used to create a VSF, add nodes to it, and then remove nodes from it.
[0053]
FIG. 5 shows a computing element 502 that includes computers AG connected to a VLAN capable switch 504. The VLAN switch 504 is connected to the Internet 106, and the VLAN switch has ports V1, V2, and the like. The computers A to G are further connected to a SAN switch 506, which is connected to a plurality of storage devices or disks D1 to D5. The SAN switch 506 has ports S1, S2, and the like. The control plane mechanism 508 is communicatively connected to the SAN switch 506 and the VLAN switch 504 through a control path and a data path. The control plane can send control commands to these devices via the control port.
[0054]
For convenience, the number of computing elements in FIG. 5 is reduced. In practice, a large number of computers, for example thousands or more, and the same number of storage devices form a computing grid. In such a large structure, a number of SAN switches are interconnected to form a mesh, and VLAN switches are interconnected to form a VLAN mesh. However, for the sake of clarity, FIG. 5 shows a single SAN switch and a single VLAN switch.
[0055]
First, all the computers A to G are assigned to the idle pool until the control plane receives a VSF creation request. All ports of the VLAN switch are assigned to a specific VLAN labeled VLANI (for idle zones). Assume that the control plane is required to configure VSF and includes one load balancer / firewall and two web servers connected to storage on the SAN. Requests to the control plane are received via a management interface or other computing element.
[0056]
In response, the control plane assigns or assigns CPUA as a load balancer / firewall and assigns CPUB and CPUC as web servers. CPUA is logically located in SAN zone 1 and is directed to a bootable partition on the disk containing dedicated load balancer / firewall software. The term “directed” is used for convenience and means that the CPUA is provided with sufficient information for any means to obtain or locate the appropriate software that needs to be run. By arranging the CPUA in the SAN zone 1, the CPUA can obtain resources from a disk controlled by the SAN in the SAN zone.
[0057]
The load balancer is configured by a control plane in order to know about CPUB and CPUC as two web servers to be load balanced. The firewall configuration protects CPUB and CPUC from unauthorized access from the Internet 106. CPUB and CPUC are directed to disk partitions on the SAN that contain boot OS images for specific operating systems (eg, Solaris, Linux, NT, etc.) and web server application software (eg, Apache). The VLAN switch is configured to arrange ports v1 and v2 in VLAN1 and to arrange ports v3, v4, v5, v6 and v7 in VLAN2. The control plane constitutes a SAN switch 506 and the fiber channel switch ports s1, s2, s3 and s8 are arranged in the SAN zone 1.
[0058]
Here we describe how the CPU is directed to a particular disk drive and what this means for booting and shared access to disk data.
[0059]
FIG. 6 is a block diagram illustrating the results of logical connections of computing elements collectively referred to as VSF1. The disk drive DD1 is selected from the storage devices D1, D2, etc. When the logical structure shown in FIG. 6 is obtained, an activation command is given to CPUs A, B, and C. Accordingly, CPUA becomes a dedicated load balancer / firewall computing element, and CPUB and CPUC become web servers.
[0060]
Now suppose that due to policy-based rules, the control plane has determined that another web server is required in VSF1. This occurs, for example, due to increasing demands on the web server, and it is possible to add at least three web servers to the VSF 1 according to customer plans. Or because the organization that owns or operates the VSF wants another server and adds it through a management mechanism such as a privileged web page that can add more servers to the VSF.
[0061]
In response, the control plane decides to add CPUD to VSF1. Therefore, the control plane adds CPUD to VLAN2 by adding ports v8 and v9 to VLAN2. Further, the SAN port s4 of the CPUD is added to the SAN zone 1. The CPUD is directed to the bootable portion of the SAN storage device that is booted and executed as a web server. CPUD also has read-only access to shared data on the SAN, which includes web page content, executable server scripts, and so on. In this way, web requests directed to the server farm can be handled so that CPUB and CPUC respond to the requests. The control plane configures the load balancer (CPUA) to include CPUD as part of the load balanced server set.
[0062]
CPUD was then started and the size of VSF increased to 3 web servers and 1 load balancer. FIG. 7 shows the resulting logical connectivity.
[0063]
Assume that the control plane receives a request to create another VSF named VSF2 that requires two web servers and one load balancer. The control plane assigns CPUE to be a load balancer / firewall and assigns CPUF and CPUG to be web servers. CPUE is configured to learn about CPUF and CPUG as two computing elements that are load balanced again.
[0064]
In order to implement this configuration, the control plane configures VLAN switch 504 so that VLAN 1 includes ports v10 and v11 (that is, connected to the Internet 106) and VLAN 3 includes ports v12, v13, v14, and v15. . Similarly, the SAN switch 506 is configured so that the SAN zone 2 includes the SAN ports s6, s7, and s9. This SAN zone includes storage devices that contain the software necessary to run CPUE as a load balancer and CPUF and CPUG as a web server that uses the shared read-only disk portion contained in SAN zone disk D2. .
[0065]
FIG. 8 is a block diagram of the resulting logical connectivity. Two VSFs (VSF1, VSF2) share the same physical VLAN switch and SAN switch, but the two VSFs are logically divided. CPU
Users accessing B, C, D or companies that own or operate VSF1 can only access the CPU and storage of VSF1. Such users cannot access the VSF2 CPU or storage. This is not possible due to the combination of a separate VLAN on the only shared segment (VLAN 1) and two firewalls, and the different SAN zones in which the two VSFs are configured.
[0066]
Further, the control plane shall determine that VSF1 can be returned to the two web servers. This is because the temporary increase in the load of VSF1 has been reduced or other management actions have been taken. In response, the control plane shuts down CPUD with a special command that includes powering off the CPU. When the CPU shuts down, the control plane removes ports v8 and v9 from VLAN 2 or from SAN zone 1 with SAN port s4. Port s4 is arranged in the idle SAN zone. The idle SAN zone is, for example, designated as SAN zone I (for idle) or zone 0.
[0067]
Thereafter, the control plane decides to add another node to VSF2. This is because the web server load in VSF2 temporarily increases or for other reasons. Therefore, the control plane decides to place CPUD in VSF 2 as indicated by dashed path 802. Therefore, the VLAN switch is configured so that the ports v8 and v9 are included in the VLAN 3 and the SAN port s4 is included in the SAN zone 2. The CPUD is directed to the storage portion of the disk device 2 that contains the OS and web server software boot images required for the VSF 2 server. CPUD is allowed read-only access to file system data shared by other web servers of VSF2. CPUD is powered on again and runs as a load balanced web server in VSF2 and does not access data in SAN zone 1 or CPUs attached to VLAN2. In particular, CPUD cannot access the elements of VSF1 even at the early time when it was part of VSF1.
[0068]
In addition, in this configuration, the safety limits enforced by CPUE were dynamically expanded to include CPUD. Accordingly, embodiments provide a dynamic firewall that automatically adjusts to properly protect computing elements that are added or removed from the VSF.
[0069]
For purposes of illustration, the example has described port-based SAN zoning. Other types of SAN zoning can also be used. For example, LUN zone SAN zoning may be used to create a SAN zone based on logical quantities in the disk array. A suitable example product for LUN level SAN zoning is Volume Logics Product from EMC Corporation.
[0070]
Disk unit on SAN
There are several ways to direct the CPU to a specific device on the SAN for the purpose of accessing the disk storage that has information about where to find the boot, or other disk storage that needs to be shared, boot program and data.
[0071]
In one method, a SCSI to Fiber Channel bridging device attached to the computing element and a local disk SCSI interface are provided. By determining the path from that SCSI port to the appropriate device in the Fiber Channel SAN, the computer can access storage devices on the Fiber Channel SAN to access locally attached SCSI devices. Therefore, software such as start-up software simply boots off the disk device on the SAN so as to boot off locally attached SCSI devices.
[0072]
Another method is to boot the ROM and OS software that has the node's Fiber Channel interface and associated device drivers and enables the Fiber Channel interface as a boot device.
[0073]
Other methods are SCSI or IDE device controllers, but have an interface card (eg, PCI bus or S bus) that communicates on the SAN to access the disk. Operating systems such as Solaris provide a complete diskless boot feature that can be used in this way.
[0074]
There are usually two types of SAN disk devices associated with a node. One type does not share logically with other computing elements, but constitutes what is usually a root partition for each node, including a bootable OS image, local configuration files, and the like. This is equivalent to a root file system on a Unix (registered trademark) system.
[0075]
The second type of disk is a shared storage with other nodes. The type of sharing depends on the OS software running on the CPU and the needs of the nodes accessing the shared storage. If the OS provides a cluster file system that allows read / write access to the shared disk partition between multiple nodes, the shared disk is implemented as such a cluster file system. Similarly, the system may use database software such as an Oracle parallel server that allows execution of multiple nodes in the cluster to provide simultaneous read / write access to the shared disk. In such a case, the shared disk is already designed in the basic OS and application software.
[0076]
In the case of an operating system in which such shared access is impossible, the shared disk can be implemented as a read-only device because the OS and related applications cannot manage disk devices shared with other nodes. For many web applications, read-only access to web-related files is sufficient. For example, in the case of a Unix (registered trademark) system, a specific file system may be implemented as read-only.
[0077]
Multi-switch computing grid
The configuration described above with respect to FIG. 5 is to interconnect multiple VLAN switches to form a large switched VLAN structure and to interconnect multiple SAN switches to form a large switched SAN mesh. Can be extended to multiple computing and storage nodes. In this case, the computing grid has the architecture shown generally in FIG. 5, except that the SAN / VLAN switched mesh includes a very large number of ports of CPUs and storage devices. A number of computing elements that implement the control plane can be physically connected to the control port of the VLAN / SAN switch, as described below. It is known in the art to interconnect multiple VLAN switches to create complex multi-premises data networks. See, for example, “Designing High-Performance Campus Intranets with Multilayer Switching” by G. Haviland, information available from Cisco Systems, Inc., and Brocade.
[0078]
SAN architecture
The description assumes that the SAN consists of Fiber Channel switches and disk equipment, and potentially Fiber Channel edge equipment such as SCSI to Fiber Channel bridges. However, the SAN may be configured using switches using other technologies, such as Gigabit Ethernet switches, or other physical layer protocols. In particular, attempts have been made to build SANs over IP networks by running the SCSI protocol over IP. The method and architecture described above can be adapted to these other SAN construction methods. When constructing a SAN by running a protocol such as SCSI over IP in a VLAN capable layer 2 environment, SAN zones are created by mapping them to different VLANs.
[0079]
Further, a network attached storage (NAS) operating on a LAN technology such as high speed Ethernet (registered trademark) or gigabit Ethernet (registered trademark) may be used. This option uses different VLANs instead of SAN zones to enhance integrity and logical partitioning of the computing grid. Such NAS devices typically support network file systems such as Sun's NSF protocol and Microsoft's SMB, allowing many nodes to share the same storage.
[0080]
Control plane implementation
As described herein, the control plane may be implemented as one or more processing resources connected to SAN and VLAN switch control and data ports. Various control plane implementations can be performed, and the invention is not limited to any particular control plane implementation. Various aspects of control plane implementation are described in detail in the following sections 1) Control plane architecture, 2) Master segment manager selection, 3) Management functions, 4) Policy and maintenance considerations.
[0081]
1. Control plane architecture
According to one embodiment, the control plane is implemented as a control process hierarchy. The control process hierarchy typically includes one or more master segment manager mechanisms that are communicatively connected to and control one or more slave segment manager mechanisms. One or more slave segment manager mechanisms control one or more farm managers. One or more farm managers manage one or more VSFs. The master and slave segment manager mechanisms may be implemented in hardware circuitry, computer software, or any combination.
[0082]
FIG. 9 is a block diagram 900 illustrating a logical relationship between a control plane 902 and a computing grid 904 according to one embodiment. The control plane 902 controls and manages the computing, networking and storage elements included in the computing grid 904 via networking and storage element special control ports or interfaces in the computing grid 904. The computing grid 904 includes a number of VSFs 906 or logical resource groups generated by the above-described embodiments.
[0083]
According to one embodiment, control plane 902 includes a master segment manager 908, one or more slave segment managers 910, and one or more farm managers 912. Master segment manager 908, slave segment manager 910, and farm manager 912 may be co-located on a particular computing platform or distributed over multiple computing platforms. For convenience, only a single master segment manager 908 is shown and described, but multiple master segment managers 908 may be used.
[0084]
The master segment manager 908 is communicatively connected to the slave segment manager 910 and controls and manages it. Each slave segment manager 910 is connected to and manages one or more farm managers 912. According to one embodiment, each farm manager 912 is co-located on the same computing platform as the corresponding slave segment manager 910 that is communicatively connected. The farm manager 912 establishes, configures and maintains the VSF 906 on the computing grid 904. According to one embodiment, each farm manager 912 is assigned a single VSF 906 to manage, but the farm manager 912 is also assigned multiple VSFs 906. Each of the farm managers 912 communicates only via each slave segment manager 910, not directly. The slave segment manager 910 monitors the status of its assigned farm manager 912. The slave segment manager 910 restarts the assigned firmware manager 912 that has stopped functioning or abnormally terminated.
[0085]
Master segment manager 908 monitors the loading of VSF 906 to determine the amount of resources allocated to each VSF 906. The master segment manager 908 instructs the slave segment manager 910 to allocate and deallocate VSF resources via the farm manager 912 as needed. Various load balancing algorithms may be implemented depending on the requirements of a particular application, and the present invention is not limited to a particular load balancing method.
[0086]
The master segment manager 908 monitors the loading information of the computing platform on which the slave segment manager 910 and the farm manager 912 are running to determine whether the computing grid 904 is properly serviced. Master segment manager 908 allocates and deallocates slave segment manager 910 and directs slave segment manager 910 to allocate and deallocate farm manager 912 to properly manage computing grid 904 as needed. . According to one embodiment, the master segment manager 908 also assigns a VSF to the farm manager 912 and to the slave segment manager 910 to balance the load between the farm manager 912 and the slave segment manager 910 as needed. The allocation of the farm manager 912 is managed. According to one embodiment, the slave segment manager 910 actively communicates with the master segment manager 908 to make a change request to the computing grid 904 and to request another slave segment manager 910 and / or a farm manager 912. If the processing platform running one or more slave segment managers 910 and one or more farm managers 912 ceases to function, the master segment manager 908 moves from the suspended computing platform farm manager 912 to the other. Reassign VSF 906 to the farm manager 912. In this case, the master segment manager 908 can also instruct the slave segment manager 910 to start another farm manager 912 to reassign the VSF 906. By actively managing the large number of computing resources allocated to the VSF 906, the large number of active farm managers 912, and the slave segment manager 910, overall power consumption can be controlled. For example, to save power, the master segment manager 908 may shut down computing platforms that do not have an active slave segment manager 910 or farm manager 912. Power saving is important at the large computing grid 904 and control plane 902.
[0087]
According to one embodiment, the master segment manager 908 manages the slave segment manager 910 using a registry. The registry includes information about the current slave segment manager 910, such as its status, assigned farm manager 912, and assigned VSF 906. As slave segment manager 910 is allocated and deallocated, the registry is updated to reflect changes in slave segment manager 910. For example, when a new slave segment manager 910 is instantiated by the master segment manager 908 and one or more assigned VSFs 906, the registry is updated to update the new slave segment manager 910 and its assigned farm manager 912 and VSF 906. The generation of is reflected. The master segment manager 908 can then periodically check the registry to determine how to assign the VSF 906 to the slave segment manager 910.
[0088]
According to one embodiment, the registry includes information about the master segment manager 908 that the master segment manager 910 can access. For example, the registry may include data identifying one or more active master segment managers 908 so that when a new slave segment manager 910 is created, the new slave segment manager 910 checks the registry and The identification of one or more master segment managers 908 can be verified.
[0089]
The registry may be implemented in various ways, and the invention is not limited to a particular implementation. For example, the registry may be a data file stored in the database 914 in the control plane 902. The registry may not be stored outside the control plane 902. For example, the registry may be stored on a storage device of the computing grid 904. In this example, the storage device is dedicated to the control plane 902 and is not assigned to the VSF 906.
[0090]
2. Election of master segment manager
In general, the master segment manager is elected when the control plane is established or after an existing master segment manager fails. There is generally a single master segment manager for a particular control plane, but it may be advantageous to elect more than one master segment manager to manage slave segment managers in the control plane simultaneously.
[0091]
According to one embodiment, the slave segment manager in the control plane elects the master segment manager for that control plane. In the simple case where there is no master segment manager and there is only a single slave segment manager, the slave segment manager becomes the master segment manager and assigns another slave segment manager as needed. If there are two or more slave segment managers, two or more slave processes elect new master segment managers by voting, for example quorum.
[0092]
Because the slave segment manager in the control plane is not necessarily permanent, a specific slave segment manager may be selected to participate in the vote. For example, according to one embodiment, the register includes a time stamp for each slave segment manager that is periodically updated by each slave segment manager. The slave segment manager with the most recently updated timestamp determined according to the specified selection criteria is still considered to be running and is selected to elect a new master segment manager. For example, a specified number of newest slave segment managers may be selected for voting.
[0093]
According to one embodiment, an elected sequence number is assigned to all active slave segment managers and a new master segment manager is determined based on the elected sequence number of the active slave segment manager. For example, the lowest or highest elected sequence number may be used to select a particular slave segment manager as the next (or first) master segment manager.
[0094]
When the master segment manager is established, the slave segment manager of the same control plane as the master segment manager periodically checks the master segment manager by contacting (pinging) the current master segment manager, and the master segment manager Determine if the manager is still active. If it is determined that the current master segment manager is not active, a new master segment manager is elected.
[0095]
FIG. 10 shows a state diagram 1000 for selecting a master segment manager according to an embodiment. In state 1002, which is the main loop of the slave segment manager, the slave segment manager waits for the ping timer to expire. When the ping timer expires, the state 1004 is entered. In state 1004, the slave segment manager pings the master segment manager. Further, in state 1004, the slave segment manager's timestamp (TS) is updated. If the master segment manager responds to the ping, the master segment manager is still active and returns to state 1002. If there is no response from the master segment manager after a specific time, state 1006 is entered.
[0096]
In state 1006, a list of active slave segment managers is obtained and state 1008 is entered. In state 1008, other slave segment managers also check to see if they have received a response from the master segment manager. Instead of sending a message to the slave segment manager to make this confirmation, this information is obtained from the database. If the slave segment manager does not agree that the master segment manager is not active, i.e. if one or more slave segment managers receive a timely response from the master segment manager, it is assumed that the current master segment manager is still active. Return to state 1002. If a certain number of slave segment managers do not receive a timely response from the current master segment manager, it is assumed that the current master segment manager is “dead”, ie, not active, and proceeds to state 1010.
[0097]
In state 1010, the slave segment manager that started the process retrieves the current selection number from the selection table and the next selection number from the database. Next, the slave segment manager updates the selection table and writes an entry designating the next selection number and a unique address in the master selection table. Next, the slave segment manager proceeds to state 1012 where the sequence number with the lowest selected number is read. In state 1014, it is ascertained whether a particular slave segment manager has the lowest sequence number. If not, return to state 1002. If so, proceed to state 1016 where the particular slave segment manager becomes the master segment manager. Next, proceeding to state 1018, the selection number is incremented.
[0098]
As described above, the slave segment manager generally assigns a new VSF in response to a service of the assigned VSF and a command from the master segment manager. The slave segment manager also checks the master segment manager and, if necessary, selects a new master segment manager.
[0099]
FIG. 11 is a state diagram 1100 illustrating various states of a slave segment manager according to an embodiment. Processing begins at slave segment manager start state 1102. From state 1102, the process proceeds to state 1104 in response to a request to confirm the current state of the master segment manager. In state 1104, the slave segment manager sends a ping to the current master segment manager to determine if the current master segment manager is still active. If the timely response is from the current master segment manager, proceed to state 1106. In state 1106, a message is broadcast to the other slave segment managers to inform that the master segment manager has responded to the ping. From state 1106, return to start state 1102.
[0100]
If there is no timely master response in state 1104, proceed to state 1108. In state 1108, a message is broadcast to the other slave segment managers to inform them that the master segment manager has not responded to the ping. Next, the process returns to the start state 1102. Incidentally, if a sufficient number of slave segment managers do not receive a response from the current master segment manager, a new master segment manager is elected as described above.
[0101]
If a request to resume VSF is received from the master segment manager from state 1102, the process proceeds to state 1110. In state 1110, VSF resumes and returns to start state 1102.
[0102]
As described above, the master segment manager generally ensures that the VSF of the computing grid controlled by the master segment manager is appropriately serviced by one or more slave segment managers. For this purpose, the master segment manager periodically checks all slave segment managers in the same control plane as the master segment manager. According to one embodiment, the master segment manager 908 periodically requests status information from the slave segment manager 910. The information includes, for example, which VSF 906 is being serviced by the slave segment manager 910. If a particular slave segment manager 910 does not respond within a particular time, the master segment manager 908 attempts to resume the particular slave segment manager 910. If a particular slave segment manager 910 cannot be resumed, the master segment manager 908 reassigns the farm manager 912 from the failed slave segment manager 910 to another slave segment manager 910. The master segment manager 908 can then instantiate one or more other slave segment managers 910 to perform process loading rebalancing. According to one embodiment, master segment manager 908 monitors the state of the computing platform executing slave segment manager 910. If there is an abnormality in the computing platform, the master segment manager 908 assigns the VSF assigned to the farm manager 912 on the abnormal computing platform to another computing platform.
[0103]
FIG. 12 is a state diagram 1200 of the master segment manager. Processing begins at a master segment manager start state 1202. From state 1202, the process proceeds to state 1204 when the master segment manager 908 performs or requests periodic inspection of the slave segment manager 910 on the control surface 902. From state 1204, if all slave segment managers 910 respond as expected, the process returns to state 1202. This occurs when all slave segment managers 910 provide master segment manager 908 with specific information indicating that all slave segment managers 910 are operating normally. If one or more slave segment managers 910 do not respond, or if one or more slave segment managers 910 respond that indicate that something is wrong, proceed to state 1206.
[0104]
In state 1206, master segment manager 908 attempts to resume slave segment manager 910 that has failed. This can be done in several ways. For example, the master segment manager 908 can send a resume message to the slave segment manager 910 that is not responding or has failed. From state 1206, if all slave segment managers 910 respond as expected, i.e. resumed without problems, state 1202 is returned. For example, when the failed slave segment manager 910 resumes without any problem, the slave segment manager 910 sends a resume confirmation message to the master segment manager 908. From state 1206, if one or more slave segment managers could not be resumed, proceed to state 1208. This occurs when the master segment manager 908 does not receive a resume confirmation message from a particular slave segment manager 910.
[0105]
In state 1208, master segment manager 908 determines the current loading of the machine executing slave segment manager 910. To obtain loading information for slave segment manager 908, master segment manager 908 polls slave segment manager 910 directly or obtains loading information from another location, such as database 914, for example. The present invention is not limited to a particular method for the master segment manager 908 to obtain loading information for the slave segment manager 910.
[0106]
Next, proceeding to state 1210, the VSF 906 assigned to the slave segment manager 910 having an abnormality is reassigned to another slave segment manager 910. The slave segment manager 910 to which the VSF 906 is assigned informs the master segment manager 908 when the reassignment is complete. For example, the slave segment manager 910 can send a reassignment confirmation message to the master segment manager 908 to indicate that the reassignment of the VSF 906 has been successfully completed. It remains in state 1210 until all VSF 906 reassignments associated with the failed slave segment manager 910 have been confirmed. If confirmed, the process returns to the state 1202.
[0107]
Instead of reassigning the VSF 906 associated with the failed slave segment manager 910 to other active slave segment managers 910, the master segment manager 908 assigns another slave segment manager 910 and assigns these VSFs 906 to the new slave segment manager 910. May be assigned. The choice of whether to reassign VSF 906 to an existing slave segment manager 910 or a new slave segment manager 910 is at least partly related to the latency associated with the assignment of the new slave segment manager 910 and to the existing slave segment manager 910. Depending on the latency associated with the reassignment of VSF 906. Either method can be used depending on the requirements of a particular application, and the present invention is not limited to any method.
[0108]
3. Management function
According to one embodiment, control plane 902 is communicatively connected to a global grid manager. The control surface 902 provides the global grid manager with billing, faults, capacity, loading, and other computing grid information. FIG. 13 is a block diagram illustrating the use of a global grid manager according to an embodiment.
[0109]
In FIG. 13, the computing grid 1300 is partitioned into logical portions called grid segments 1302. Each grid segment 1302 includes a control plane 902 that controls and manages the data plane 904. In this example, each data plane 904 is identical to the computing grid 904 of FIG. 9, but to illustrate the use of multiple control planes 902 and data planes 904, ie, global grid managers that manage grid segments 1302, “ It is called the “data plane”.
[0110]
Each grid segment is communicatively connected to a global grid manager 1304. The global grid manager 1304, control plane 902, and computing grid 904 may be co-located on a single computing platform or distributed over multiple computing platforms, and the present invention is not limited to a particular implementation. There is no limit.
[0111]
The global grid manager 1304 performs centralized management and service of a plurality of grid segments 1302. The global grid manager 1304 can collect billing, loading, and other information from the control plane 902 that is used in various management tasks. For example, the service provided by the computing grid 904 is charged using the charging information.
[0112]
4). Policy and maintenance considerations
As described above, the slave segment manager in the control plane must be able to communicate with the associated VSF in the computing grid. Similarly, a VSF in a computing grid must be able to communicate with its associated slave segment manager. In addition, VSFs in a computing grid must not be able to communicate with each other to prevent one VSF from altering the structure of another VSF in some way. Explain the various ways to implement these policies.
[0113]
FIG. 14 is a block diagram 1400 of an architecture for connecting a control plane to a computing grid according to an embodiment. The VLAN switch (VLAN SW1 to VLAN SWn) collectively identified by reference number 1402 and the control (“CTL”) port of the SAN switch (SAN SW1 to SAN SWn) collectively identified by reference number 1404 are Ethernet (registered) Trademark) subnet 1406. The Ethernet subnet 1406 is connected to a plurality of computing elements (CPU1, CPU2 to CPUn) collectively identified by reference numeral 1408. Accordingly, only the computing elements of the control plane 1408 are communicatively connected to the control ports (CTL) of the VLAN switch 1402 and the SAN switch 1404. This structure prevents computing elements in the VSF (not shown) from changing the membership of VLANs and SAN zones associated with itself or other VSFs. This method is also applicable when the control port is a serial or parallel port. In this case, the port is connected to the computing element of the control plane 1408.
[0114]
FIG. 15 is a block diagram 1500 illustrating a structure for connecting control plane computing elements (CP CPU1, CP CPU2 to CP CPUn) 1502 to data ports according to an embodiment. In this configuration, the control plane computing element 502 periodically sends packets to the control plane agent 1504 that operates for the control plane computing element 1502. The control plane agent 1504 periodically polls the computing element 502 for real-time data and sends the data to the control plane computing element 1502. Each segment manager in the control plane 1502 is communicatively connected to a control plane (CP) LAN 1506. The CP LAN 1506 is communicatively connected to the special port V17 of the VLAN switch 504 via the CP firewall 1508. This structure provides a reliable and extensible means for the control plane computing element 1502 to collect real-time information from the computing element 502.
[0115]
FIG. 16 is a block diagram 1600 of an architecture for connecting a control plane to a computing grid according to an embodiment. The control plane 1602 includes control plane computing elements CP CPU1 and CP CPU2 to CP CPUn. Each control plane computing element CP CPU1, CP CPU2 to CP CPUn in the control plane 1602 is communicatively connected to ports S1, S2 to Sn of a plurality of SAN switches forming the SAN mesh 1604 as a whole.
[0116]
The SAN mesh 1604 includes SAN ports So and Sp that are communicatively connected to a storage device 1606 that includes data that is private to the control plane 1602. The storage device 1606 is shown in FIG. 16 as a disk for convenience. The storage device 1606 may be implemented with any type of storage medium, and the present invention is not limited to a particular type of storage medium with the storage device 1606. Storage device 1606 is logically located in control plane private storage zone 1608. The control plane private storage zone 1608 maintains log files that implement the control plane 1602, statistical data, and current control plane configuration information. Since the SAN ports So, Sp are the only part of the control plane private storage zone and are not located in other SAN zones, only computing elements in the control plane 1602 can access the storage device 1606. In addition, S1, S2 to Sn, So, and Sp exist in the control plane SAN zone that is only communicatively connected to the computing elements in the control plane 1602. These ports are not accessible by computing elements (not shown) in the VSF.
[0117]
According to one embodiment, if a particular computing element CP CPU1, CP CPU2-CP CPUn needs to access a storage device or part thereof, it is part of a particular VSF and a particular computing element Is placed in the SAN zone of a particular VSF. For example, assume that computing element CP CPU 2 needs to access VSFi disk 1610. In this case, the port s2 associated with the control plane CP CPU2 is arranged in the VSFi SAN zone including the port Si. Once computing element CP CPU2 accesses VSFi disk 1610 on port Si, computing element CP CPU2 is removed from the VSFi SAN zone.
[0118]
Similarly, it is assumed that the computing element CP CPU 1 needs to access the VSFj disk 1612. In this case, the computing element CP CPU1 is located in the SAN zone associated with VSFj. As a result, port S1 is placed in the SAN zone associated with VSFj having the zone that includes port Sj. Once computing element CP CPU1 accesses VSFj disk 1612 connected to port Sj, computing element CP CPU1 is removed from the SAN zone associated with VSFj. This method provides the integrity of the control plane computing element and control plane storage zone 1608 by accurately controlling access to resources using precise SAN zone control.
[0119]
As described above, a single control plane computing element can manage multiple VSFs. Thus, a single control plane computing element must be able to define itself in multiple VSFs simultaneously, while implementing a firewall between VSFs according to policy rules established for each control plane. Policy rules may be stored in each control plane database 914 (FIG. 9) or may be enforced by the central segment manager 1302 (FIG. 13).
[0120]
According to one embodiment, VLAN tags based on (physical switch) ports cannot be spoofed, so VLAN tagging and IP addresses are tightly coupled to prevent VSF spoof attacks. An IP packet sent on a VLAN interface must have the same VLAN tag and IP address as the logical interface on which the packet arrives. This prevents an IP spoofing attack where an unauthorized server in a VSF spoofs the source IP address in another VSF, potentially alters the logical structure of another VSF, or destroys the integrity of the computing grid functionality. This method of preventing VLAN tagging requires physical access to the computing grid that can be prevented using a highly secure (Class A) data center.
[0121]
Various network frame tagging formats may be used to tag data packets, and the present invention is not limited to a particular tagging format. According to one embodiment, IEE802.1q VLAN tags are used, although other formats are suitable. In this example, a VLAN / IP address consistency check is performed in the subsystem of the IP stack where 802.1q tag information exists to control access. In this example, the computing element is configured with a VLAN enabled network interface card (NIC) so that the computing element is communicatively connected to multiple VLANs simultaneously.
[0122]
FIG. 17 is a block diagram 1700 of a configuration for firmly coupling a VLAN tag and an IP address according to an embodiment. Computing elements 1702 and 1704 are communicatively connected to ports v1 and v2 of VLAN switch 1706 via NICs 1708 and 1710, respectively. VLAN switch 1706 is also communicatively connected to access switches 1712 and 1714. Ports v1 and v2 are configured in a tag format. According to one embodiment, IEEE 802.1q VLAN tag information is provided by VLAN switch 1706.
[0123]
Wide area computing grid
The VSF described above is distributed over the WAN in various ways.
[0124]
In one method, the wide area backbone may be based on asynchronous transfer mode (ATM) switching. In this case, each local area VLAN is extended over a wide area using an emulated LAN (ELAN) that is part of the ATM LAN emulation (LANE) standard. Thus, a single VSF extends across several wide area links such as ATM / SONET / OC-12 links. The ELAN becomes part of the VLAN that extends across the ATM WAN.
[0125]
Another way is to extend VSF across the WAN using a VPN system. In this embodiment, the fundamental characteristics of the network become inappropriate, and VPNs are used to interconnect two or more VSFs across the WAN to create a single distributed VSF.
[0126]
Data mirroring techniques can be used to logically copy data in a distributed VSF. Alternatively, the SAN is bridged over the WAN using one of several SAN to WAN bridging technologies such as SAN to ATM bridging or SAN to Gigabit Ethernet bridging. Since IP works without problems on such a network, a SAN configured on the IP network will naturally expand on the WAN.
[0127]
FIG. 18 is a block diagram of a plurality of VSFs extended over a WAN connection. The San Jose Center, New York Center, and London Center are connected by a WAN connection. Each WAN connection consists of an ATM, ELAN or VPN connection as described above. Each center is composed of at least one VSF and at least one idle pool. For example, the San Jose Center has VSF1A and idle pool A. In this configuration, the computing resources of each idle pool in the center are available for assignment or designation to VSFs in other centers. When such an assignment or designation is made, the VSF expands over the WAN.
[0128]
Examples of using VSF
The VSF architecture described in the above example may be used in the context of a web server system. Therefore, the above example has been described with respect to a web server, an application server, and a database server configured from a CPU in a specific VSF. However, the VSF architecture may be used in many other computing situations to provide other types of services, and the present invention is not limited to web server systems.
[0129]
-Distributed VSF as part of a content distribution network
In one embodiment, VSF uses a wide area VSF to provide a content distribution network (CDN). The CDN is a network of caching servers that perform distributed caching of data. A network of caching servers can be implemented using, for example, TrafficServer (TS) software sold by Inktomi Corporation, San Mateo, California. TS is a cluster-aware system and the system will expand as more CPUs are added to the collection of caching traffic server computing elements. Therefore, it is very suitable for the system where the addition of CPU is an expansion mechanism.
[0130]
In this configuration, the system can dynamically add more CPUs to the portion of the VSF that runs caching software such as TS, which can increase the cache capacity near where bursty web traffic occurs. Is possible. As a result, the CDN is configured to dynamically expand in CPU and I / O bandwidth in a legal manner.
[0131]
-VSF for hosted intranet applications
There is increasing interest in providing intranet applications such as enterprise resource planning (ERP), ORM and CRM software as hosted and managed services. Technologies such as Citrix WinFrame and Citrix MetaFrame enable companies to provide Microsoft Windows applications as services on small and lightweight clients such as Windows CE devices or web browsers. VSF can host such applications in an extensible manner.
[0132]
For example, SAP R / 3 ERP software sold by SAP Aktiengesellschaft, Germany, allows companies to load balance using multiple applications and data servers. In the case of VSF, enterprises dynamically add more application servers (eg, SAP dialog servers) to the VSF to extend the VSF based on real-time requirements or other factors.
[0133]
Similarly, by adding more Citrix servers with Citrix Metaframe, it is possible to extend Windows (registered trademark) application users on server farms that execute hosted Windows (registered trademark) applications. In this case, the Citrix MetaFrame VSF dynamically adds more Citrix servers to the VSF in order to accommodate users of Windows (registered trademark) applications hosted by more Metaframes. It will be apparent that many other applications are hosted as in the example described above.
[0134]
-Customer interaction with VSF
Because VSFs are generated on demand, VSF customers or organizations that “own” the VSF can interact with the system in various ways to customize the VSF. For example, since a VSF is created and modified immediately via the control plane, VSF customers may be granted privileged access to create and modify the VSF itself. Privileged access is provided using password authentication provided by web pages and secure applications, token card authentication, Kerberos exchange, or other suitable secure element.
[0135]
In one embodiment, the set of web pages is served by a computing element or a separate server. Web pages allow customers to count the number of layers, the number of computing elements in a particular layer, the hardware and software platform used for each element, what kind of web server, application server, or database server software these A custom VSF can be generated, such as by specifying whether to pre-configure on a computing element. Thus, the customer has a virtual supply console.
[0136]
After a customer or user enters such supply information, the control plane analyzes and evaluates the order and queues it for execution. The order can be reviewed by a human administrator to confirm that it is appropriate. An enterprise credit check can be performed to confirm that you have the appropriate credit to pay for the requested service. When the supply order is approved, the control plane configures the VSF to match the order and returns a password to the customer giving root access to one or more computing elements in the VSF. The customer can then upload a master copy of the application and run it in VSF.
[0137]
If the company that employs the computing grid is a for-profit company, payment information such as a credit card, PO number, electronic check, or other payment method can also be received from the web page.
[0138]
In another embodiment, the web page allows the customer to select one of several VSF service plans based on real-time load, such as automatic scaling of the VSF between the minimum and maximum number of elements. can do. Customers can have control values that allow changes to parameters such as the minimum number of computing elements in a particular tier, such as a web server, or the time period that must have VSF minimum server capacity. The parameter may be linked to billing software that automatically adjusts the bill draft rate of the customer and generates a billing log file entry.
[0139]
The privileged access mechanism allows customers to obtain reports and monitor real-time information about usage, load, hits per second or number of transactions, and adjust VSF characteristics based on real-time information. The above features provide advantages over conventional manual methods for server farm construction. In the conventional method, the user cannot add the server by various methods and automatically change the characteristics of the server farm without going through the troublesome manual procedure for configuring the server farm.
[0140]
-Billing model for VSF
Given the dynamic nature of VSF, companies that employ computing grids and VSF use VSF's billing model based on actual use of VSF's computing and storage elements to help VSF-owned customers. You can charge a service fee. The VSF architecture and method disclosed herein allows for an “instant payment” charging model because certain VSF resources are not statically specified. Therefore, a specific customer whose usage load of the server farm is extremely variable is not charged a fee related to a certain peak server capacity, and is charged a fee that reflects the running average of usage, instantaneous use, etc. Can be saved.
[0141]
For example, a company operates using a billing model that prescribes a flat fee for a minimum number of computing elements, such as 10 servers, and a real-time load that requires 10 or more elements. The user is charged at an additional fee for additional servers based on how many additional servers are needed and the time they were needed. Such a unit of billing may reflect the resource being charged. For example, the charge may be expressed in units such as MIPS time, CPU time, and CPU 1000 seconds.
[0142]
-Customer visibility control plane API
In other ways, the capacity of the VSF may be controlled by giving the customer an application programming interface (API) that defines the control plane calls for resource changes. Therefore, the application program prepared by the customer can call or request using the API, and can request more servers, more storage, higher processing capacity, and the like. This method may be used when a customer knows about the computing grid environment and needs an application program to take advantage of the capabilities provided by the control plane.
[0143]
Neither part of the above architecture requires the customer to change its application for use with the computing grid. Existing applications operate in the same way that they operate on manually configured server farms. However, if a better understanding of the computing resources needed based on the real-time load monitoring capabilities provided by the control plane is available, applications can take advantage of the dynamism possible with the computing grid. APIs of the above nature that allow application farms to change the computing capacity of a server farm are not possible using existing manual methods for server farm construction.
[0144]
-Automatic update and versioning
Using the methods and mechanisms disclosed herein, the control plane can perform automatic updating and versioning of operating system software running on the VSF computing elements. Thus, end users or customers do not have to worry about updating the operating system with new patches, bug fixes, and the like. The control plane maintains its library when such software elements are received and can automatically distribute and install them on all affected VSF computing elements.
[0145]
Implementation mechanism
The computing elements and control plane may be implemented in several forms, and the invention is not limited to a particular form. In one embodiment, each computing element is a general purpose digital computer having the elements shown in FIG. 19, except for non-volatile storage 1910, and the control plane operates under the control of program instructions that implement the above process. 19 is a general-purpose digital computer of the type shown in FIG.
[0146]
FIG. 19 is a block diagram that illustrates a computer system 1900 upon which an embodiment of the invention may be implemented. Computer system 1900 includes a bus 1902 or other communication mechanism for communicating information, and a processor 1904 connected with bus 1902 for processing information. Computer system 1900 also includes main memory 1906 such as random access memory (RAM) or other dynamic storage device connected to bus 1902 for storing information and instructions executed by processor 1904. Main memory 1906 can also be used to store temporary numeric variables and other intermediate information during execution of instructions executed by processor 1904. Computer system 1900 further includes a read only memory (ROM) 1908 or other static storage device connected to bus 1902 for storing static information and instructions for processor 1904. A storage device 1910 such as a magnetic disk or optical disk is provided and connected to the bus 1902 for storing information and instructions.
[0147]
Computer system 1900 may be connected via bus 1902 to a display 1912, such as a cathode ray tube (CRT), for displaying information to a computer user. Input device 1914, including alphanumeric and other keys, is connected to bus 1902 for communicating information and command selections to processor 1904. Another type of user input device is a cursor control 1916 such as a mouse, trackball, cursor direction keys for communicating direction information and command selections to the processor 1904 and controlling cursor movement on the display 1912. . This input device generally has two degrees of freedom in two axes that allow the device to specify a position in the plane, ie, a first axis (eg, x) and a second axis (eg, y).
[0148]
The invention is related to the use of computer system 1900 for controlling an extensible computing system. According to one embodiment of the present invention, control of the extensible computing system is controlled by computer system 1900 in response to processor 1904 executing one or more sequences of one or more instructions contained in main memory 1906. Is done by. Such instructions are read into main memory 1906 from another computer readable medium, such as storage device 1910. By executing the sequence of instructions contained in the main memory 1906, the processor 1904 performs the process steps described above. One or more processors may be used in a multi-processing configuration to execute a sequence of instructions contained in main memory 1906. In another embodiment, the present invention may be implemented using hardwired circuitry instead of or in combination with software instructions. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software.
[0149]
The term “computer-readable medium” as used herein refers to any medium that participates in providing instructions to the processor 1904 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks such as storage device 1910. Volatile media includes dynamic memory, such as main memory 1906. The transmission medium includes a coaxial cable including wiring constituting the bus 1902, a copper wire, and an optical fiber. Transmission media can also take the form of acoustic or light waves, such as those generated during wireless and infrared data communications.
[0150]
Common formats of computer-readable media are, for example, floppy disks, flexible disks, hard disks, magnetic tapes, other magnetic media, CD-ROMs, other optical media, punches as described below. Includes cards, paper tape, other physical media with hole patterns, RAM, PROM, EPROM, FLASH-EPROM, other memory chips or cartridges, carrier waves, or other media that can be read by a computer.
[0151]
Various forms of computer readable media may be involved in causing the processor 1904 to send one or more sequences of one or more instructions to be executed. For example, the instructions are first sent to the remote computer's magnetic disk. The remote computer loads the instructions into its dynamic memory and sends the instructions over the telephone line using a modem. A modem remote to computer system 1900 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. The infrared detector connected to the bus 1902 receives the data carried by the infrared signal and outputs the data to the bus 1902. Bus 1902 sends data to main memory 1906, from which processor 1904 retrieves and executes instructions. The instructions received by main memory 1906 may be optionally stored on storage device 1910 either before or after execution of processor 1904.
[0152]
Computer system 1900 also includes a communication interface 1918 connected to bus 1902. The communication interface 1918 performs bidirectional data communication connected to the network link 1920 connected to the local network 1922. For example, the communication interface 1918 may be a digital integrated service network (ISDN) card or modem for making a data communication connection to a corresponding type of telephone line. In another example, communication interface 1918 may be a local area network (LAN) for making data communication connections to a compatible LAN. A wireless link can also be implemented. In such implementations, the communication interface 1918 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
[0153]
Network link 1920 typically provides data communication to other data devices via one or more networks. For example, the network link 1920 provides a connection via a local network 1922 to a host computer 1924 or data device operated by an Internet service provider (ISP) 1926. The ISP 1926 provides data communication services through a global packet data communication network 1928 that is now commonly referred to as the “Internet”. Local network 1922 and Internet 1928 both use electrical, electromagnetic or optical signals that carry digital data streams. Signals on various networks and network links 1920, and signals that send and receive digital data to and from computer system 1900 via communication interface 1918 are typical forms of carrier waves that carry information.
[0154]
Computer system 1900 can send messages and receive data, including program code, via the network, network link 1920 and communication interface 1918. In the Internet example, the server 1930 sends a request code for an application program via the Internet 1928, ISP 1926, local network 1922, and communication interface 1918. In accordance with the present invention, such downloaded applications define the control of the extensible computing system described herein.
[0155]
The received code may be stored on storage device 1910 or other non-volatile storage for execution by processor 1904 and / or later execution upon receipt. Thus, the computer system 1900 can obtain application code in the form of a carrier wave.
[0156]
The computing grid disclosed herein is conceptually compared to a public power network, sometimes referred to as a power grid. The power grid provides an extensible means for a large number of parties to obtain power services through a single large scale power infrastructure. Similarly, the computing grid disclosed herein provides computing services to multiple organizations by using a single large-scale computing infrastructure. Because the power grid is used, power consumers do not voluntarily manage their personal power facilities. For example, there is no reason for a utility consumer to operate a personal generator in their facility or shared facility and manage their capacity and increase on their own. Instead, the power grid can supply power to a large portion of the population over a wide area, thus providing significant scale benefits. Similarly, the computing grid disclosed herein can use a single large-scale computing infrastructure to provide computing services to the majority of the population.
[0157]
In the foregoing detailed description, the invention has been described with reference to specific embodiments. It will be apparent, however, that various modifications and changes can be made to the present invention without departing from the broad spirit and scope of the invention. Accordingly, the description and drawings are to be regarded in an illustrative rather than a restrictive sense.
[Brief description of the drawings]
FIG. 1A is a block diagram of a simple website that uses a single computing element topology.
FIG. 1B is a block diagram of a one-tier web server farm.
FIG. 1C is a block diagram of a three-tier web server farm.
FIG. 2 is a block diagram illustrating one configuration of an expandable computing system 200 that includes a local computing grid.
FIG. 3 is a block diagram of an exemplary virtual server farm that characterizes a SAN zone.
FIG. 4A is a block diagram illustrating sequential steps associated with adding a computing element and removing an element from a virtual server farm.
FIG. 4B is a block diagram illustrating successive steps associated with adding computing elements and removing elements from a virtual server farm.
FIG. 4C is a block diagram illustrating successive steps associated with adding computing elements and removing elements from a virtual server farm.
FIG. 4D is a block diagram illustrating sequential steps associated with adding computing elements and removing elements from a virtual server farm.
FIG. 5 is a block diagram of an embodiment of a virtual server farm system, a computing grid, and a monitoring mechanism.
FIG. 6 is a block diagram of a logical connection of a virtual server farm.
FIG. 7 is a block diagram of logical connection of a virtual server farm.
FIG. 8 is a block diagram of a logical connection of a virtual server farm.
FIG. 9 is a block diagram of a logical relationship between a control plane and a data plane.
FIG. 10 is a state diagram of a master control selection process.
FIG. 11 is a state diagram of a slave control process.
FIG. 12 is a state diagram of a master control process.
FIG. 13 is a block diagram of a central control processor and multiple control planes and computing grids.
FIG. 14 is a block diagram of an architecture that implements portions of a control plane and a computing grid.
FIG. 15 is a block diagram of a system having a computing grid protected by a firewall.
FIG. 16 is a block diagram of an architecture for connecting a control plane to a computing grid.
FIG. 17 is a block diagram of an arrangement for tightly coupling VLAN tags and IP addresses.
FIG. 18 is a block diagram of a plurality of VSFs extended over a WAN connection.
FIG. 19 is a block diagram of a computer system in which an embodiment is implemented.

Claims

A master control mechanism;
Communicatively connected to the master control mechanism and in response to one or more commands from the master control mechanism,
Selecting a first subset of processing resources from a set of processing resources;
Selecting a first subset of storage resources from the set of storage resources and communicatively connecting the first subset of processing resources to the first subset of storage resources;
Comprising one or more slave control mechanisms configured to establish a first logical resource group that includes a first subset of processing resources and a first subset of storage resources;
One or more slave control mechanisms
Determine the state of the master control mechanism and if the master control mechanism terminates abnormally or is no longer functioning properly,
A control device configured to select a new master control mechanism from one or more slave control mechanisms.

A master control process in which the master control mechanism is executed on one or more processors, and one or more slave processes in which one or more slave control mechanisms are executed on one or more processors. The control device according to claim 1.

The control device according to claim 1, wherein the master control mechanism is one or more master processors, and the one or more slave control mechanisms are one or more slave processors.

The master control mechanism is configured to receive one or more processing resources from the subset of processing resources and one or more processing resources from the subset of storage resources between one or more slave control mechanisms based on the loading of the slave control mechanism. The control apparatus according to claim 1, wherein the control apparatus is configured to dynamically reallocate control of storage resources.

The master control mechanism dynamically allocates one or more additional slave control mechanisms based on the loading of the slave control mechanism, and from one or more processing resource and storage resource subsets from the processing resource subset. The control device according to claim 1, wherein the control device is configured to assign control of one or more storage resources to one or more added slave control mechanisms.

One or more master processing mechanisms from one or more subsets of processing resources that are already assigned to one or more specific slave control mechanisms from one or more slave control mechanisms based on the loading of the slave control mechanism Reassign control of one or more specific storage resources from a specific processing resource and a subset of storage resources to one or more other slave control mechanisms from one or more slave control mechanisms; The control device according to claim 1, wherein the control device is configured to dynamically deallocate one or more specific slave control mechanisms.

Master control mechanism
Determine the state of one or more slave control mechanisms,
If one or more specific slave control mechanisms from one or more slave control mechanisms are not responding properly or are not functioning, try to restart one or more specific slave control mechanisms; and If one or more specific slave control mechanisms cannot be restarted,
Start one or more new slave control mechanisms and reassign control of processing and storage resources from one or more specific slave control mechanisms to one or more new slave control mechanisms The control device according to claim 1, which is configured as follows.

The controller of claim 1, wherein one or more instructions from the master control mechanism are generated based on expected processing and storage requirements of the first logical resource group.

One or more slave control mechanisms are further responsive to one or more instructions from the master control mechanism,
Dynamically changing the number of processing resources in the first subset of processing resources;
Dynamically changing the number of storage resources of the first subset of storage resources;
Communication connection behavior between the first subset of processing resources and the first subset of storage resources to reflect changes in the number of processing resources of the first subset of processing resources and the number of storage resources of the first subset of storage resources The control device according to claim 1, wherein the control device is configured to make a mechanical change.

A change in the number of processing resources in the first subset of processing resources and the number of storage resources in the first subset of storage resources is determined by the master control mechanism based on the actual loading of the first subset of processing resources and the first subset of storage resources. 10. The control device according to claim 9, which is indicated.

The one or more slave control mechanisms are further configured to establish a second logical resource group that includes a second subset of processing resources and a second subset of storage resources in response to one or more instructions from the master control mechanism. And the second logical resource group is
Selecting a second subset of processing resources from the set of processing resources;
The communication isolation from the first logical resource group by selecting a second subset of storage resources from the set of processing resources and communicatively connecting the second subset of processing resources to the second subset of storage resources. The control device described in 1.

A first subset of processing resources is communicatively coupled to the first subset of storage resources by using one or more storage area network (SAN) switches;
A second subset of processing resources is communicatively connected to the second subset of storage resources by using one or more SAN switches;
The control device according to claim 11, wherein the second logical resource group is separated from the first logical resource group by using tagging and SAN zoning.

13. The control device according to claim 12, wherein SAN zoning is performed by using port level SAN zoning or LUN level SAN zoning.

The master control mechanism is communicatively connected to the central control mechanism,
The master control mechanism is configured to provide loading information to the first logical resource group to the central control mechanism, and the master control mechanism is based on one or more central control commands received from the central control mechanism, The controller of claim 1, configured to generate one or more instructions to two or more slave control mechanisms.

A computer-implemented method for transmitting one or more sequences of one or more instructions for managing processing resources, wherein one or more instructions are executed by one or more processors of the computer When two or more sequences are executed, one or more processors are
Starting the master control mechanism;
Communicatively connected to the master control mechanism and in response to one or more commands from the master control mechanism,
Selecting a first subset of processing resources from a set of processing resources;
Selecting a first subset of storage resources from the set of storage resources and communicatively connecting the first subset of processing resources to the first subset of storage resources;
Performing one or more slave control mechanisms configured to establish a first logical resource group that includes a first subset of processing resources and a first subset of storage resources; and
One or more slave control mechanisms
A processing resource that determines the state of the master control mechanism and selects a new master control mechanism from one or more slave control mechanisms when the master control mechanism terminates abnormally or is no longer functioning properly How to manage.

Initiating a master control mechanism includes initiating a master control process executed on one or more processors, and initiating one or more slave control mechanisms comprises one or more 16. The method of claim 15, comprising initiating one or more slave processes executing on the processor.

Starting the master control mechanism includes starting one or more master control processors, and starting one or more slave control mechanisms starts one or more slave processors. The method of claim 15 comprising:

The master control mechanism controls one or more storage resources from the subset of processing resources and one or more storage resources from the subset of storage resources based on the loading of the slave control mechanism. The method of claim 15, wherein the reallocation is dynamically between slave control mechanisms.

The master control mechanism dynamically allocates one or more additional slave control mechanisms based on the loading of the slave control mechanism, and from one or more processing resource and storage resource subsets from the processing resource subset. 16. The method of claim 15, wherein control of one or more storage resources is assigned to one or more added slave control mechanisms.

One or two from a subset of processing resources where the master control mechanism is already assigned to one or more specific slave control mechanisms from one or more slave control mechanisms, based on the loading of the slave control mechanism. Reassign control of one or more specific storage resources from a subset of these specific processing resources and storage resources to one or more other slave control mechanisms from one or more slave control mechanisms The method of claim 15.

Master control mechanism
Determine the state of one or more slave control mechanisms,
Attempting to restart one or more specific slave control mechanisms if one or more specific slave control mechanisms from one or more slave control mechanisms are not responding or not functioning properly; And if one or more specific slave control mechanisms cannot be restarted, start one or more new control mechanisms, and one or more new slave control mechanisms from one or more new slave control mechanisms The method of claim 15, wherein the slave control mechanism reassigns control of processing resources and storage resources.

The method of claim 15, wherein one or more instructions from the master control mechanism are generated based on the predicted processing and storage requirements of the first logical resource group.

One or more slave control mechanisms are further responsive to one or more instructions from the master control mechanism,
Dynamically changing the number of processing resources in the first subset of processing resources;
Dynamically changing the number of storage resources in the first subset of storage resources;
Communication connection between the first subset of processing resources and the first subset of storage resources to reflect a change in the number of processing resources in the first subset of processing resources and the number of storage resources in the first subset of storage resources The method according to claim 15, wherein the dynamic change is performed.

Changing the number of processing resources in the first subset of processing resources and the number of storage resources in the first subset of storage resources is based on the actual loading of the first subset of processing resources and the first subset of storage resources. 24. The method of claim 23, directed by a mechanism.

One or more slave control mechanisms that establish a second logical resource group that includes a second subset of processing resources and a second subset of storage resources in response to one or more instructions from the master control mechanism; The second logical resource group is
Selecting a second subset of processing resources from the set of processing resources;
Claims decoupled from the first logical resource group by selection of a second subset of storage resources from the set of processing resources and communication connection of the second subset of processing resources to the second subset of storage resources. 15. The method according to 15.

A first subset of processing resources is communicatively coupled to the first subset of storage resources by using one or more storage area network (SAN) switches;
A second subset of processing resources is communicatively connected to a second subset of storage resources by using one or more SAN switches, and a second logical resource group is tagged and SAN zoned to provide a first logical resource 26. The method of claim 25, wherein communication is separated from the group.

27. The method of claim 26, wherein SAN zoning is performed by using port level SAN zoning or LUN level SAN zoning.

The master control mechanism is communicatively connected to the central control mechanism,
A master control mechanism is configured to provide loading information of the first logical resource group to the central control mechanism;
The master control mechanism is further configured to generate one or more instructions of the one or more slave control mechanisms based on the one or more central control instructions received from the central control mechanism. Item 16. The method according to Item 15.

A computer-readable storage medium storing one or more sequences of one or more instructions for managing processing resources, and one or more instructions by one or more processors Or, when two or more sequences are executed, one or more processors
Starting the master control mechanism;
Configured to establish a first logical resource group that is communicatively coupled to the master control mechanism and that includes a first subset of processing resources and a first subset of storage resources in response to one or more instructions from the master control mechanism Initiating one or more slave control mechanisms being performed,
Selecting a first subset of processing resources from a set of processing resources;
Selecting a first subset of storage resources from the set of storage resources and communicatively connecting the first subset of processing resources to the first subset of storage resources;
One or more slave control mechanisms
A computer-read that determines the state of the master control mechanism and selects a new master control mechanism from one or more slave control mechanisms if the master control mechanism terminates abnormally or is no longer functioning properly Possible storage medium.