JP2005506726A

JP2005506726A - Virtual network system and method in processing system

Info

Publication number: JP2005506726A
Application number: JP2002584166A
Authority: JP
Inventors: シュルター，ピーター; ジェング，スコット; マンカ，ピーター; カーティス，ポール; ミルン，イワン; スミス，マックス; グリーンスパン，アラン; ダフィー，エドワード; ブローネル，バーン; スプラックマン，ベン; バズビー，ダン
Original assignee: イジェネラ，インク．
Priority date: 2001-04-20
Filing date: 2002-04-16
Publication date: 2005-03-03
Also published as: WO2002086712A1; EP1388057A4; CA2444066A1; CN1290008C; EP1388057A1; DE10296675T5; CN1520550A

Abstract

仮想ネットワーキング（１００）のためのシステムおよび方法。交換イーサネットローカルエリアネットワークセマンティックスは、根元的な2地点間回線網に亘って提供されている。コンピュータプロセッサノードは、直接的にスイッチ構造に亘って仮想インタフェース経由で通信することがあるが、あるいはそれらはイーサネットスイッチエミュレータ(115a、115b)経由で通信することもある。ARP(アドレス解決プロトコル)ロジック(135)は、コンピュータプロセッサが、仮想MACアドレスを有するARP要求に答えることを可能にしつつ、仮想インタフェースにIP アドレスを関連づけさせるために使用される。
【選択図】図１Systems and methods for virtual networking (100). Switched Ethernet local area network semantics are provided across the underlying point-to-point network. Computer processor nodes may communicate directly across the switch structure via a virtual interface, or they may communicate via an Ethernet switch emulator (115a, 115b). ARP (Address Resolution Protocol) logic (135) is used to associate a virtual interface with an IP address while allowing a computer processor to answer an ARP request with a virtual MAC address.
[Selection] Figure 1

Description

【技術分野】
【０００１】
本発明は、企業とアプリケーションサービスプロバイダーのための計算機システムに関し、特に、仮想通信ネットワークを有する処理システムに関する。
【背景技術】
【０００２】
現在の企業計算およびアプリケーションサービスプロバイダーの環境では、複数の情報技術(IT)機能(電気的、ネットワーキングなど)からの社員が処理およびネットワークのリソースを配備するために参加しなければならない。従って、複数の部からの活動を調整する際のスケジューリング、および他の障害のために、新しいコンピュータサーバーを配備するのに数週間または数か月かかることがある。この長くて人手によるプロセスは人的および機器双方の費用を増加させて、アプリケーションの開始を遅らせる。
【０００３】
さらに、どれだけの処理能力をアプリケーションが必要とするかを予想することが難しいので、管理者は、計算能力の量を過剰に供給しがちである。その結果、データセンターの計算リソースは、利用されないか、低利用率であることが多い。
【０００４】
当初の供給より、さらに大きな処理能力が最終的に必要になった場合は、様々なIT機能は、より多くのまたは改善されたサーバーを配備し、それらを通信と記憶装置のネットワークに接続することなどの活動を、再度調整する必要があるであろう。システムが大きくなるにつれて、このタスクはますます困難になる。
【０００５】
配備もまた問題である。例えば、24台の従来型サーバーを配備する場合、全体のシステムを形成するために100箇所を越える個別の接続が必要になることがある。これらのケーブルの管理は継続中の挑戦であり、また各々は失敗ポイントを象徴している。冗長度を加えることにより失敗の危険を軽減することを試みることは、配線を２倍にしてその問題を悪化させる可能性がある一方、複雑さとコストを増加させる。
【０００６】
今日の技術によって高い有用性を供給することは困難で高価な命題である。一般に、フェイルオーバーサーバーは、すべてのプライマリサーバ用に配備されなければならない。さらに、通常、複雑な管理ソフトウェアおよびプロフェッショナルサービスが必要である。
【０００７】
一般に、既存のサーバー上で処理能力を調整するか、CPUをアップグレードすることは可能ではない。その代わりに、プロセッサ容量の拡大およびベンダーの次世代アーキテクチャーへの移動、あるいはその両方は、しばしば、さらに多くのハードウェア/ソフトウェアシステムが追加されることを意味して、新しい接続等を必要とする「フォークリフトアップグレード」を要求する。
【発明の開示】
【発明が解決しようとする課題】
【０００８】
従って、上記の欠点に取り組む企業およびASPの計算用のプラットホームを提供するシステムおよび方法に対しての必要性がある。
【課題を解決するための手段】
【０００９】
本発明は、仮想処理エリアネットワークが形成され配備されるはずのコンピューター処理用のプラットホームおよび方法を特徴とする。
【００１０】
本発明のある態様によれば、交換イーサネットローカルエリアネットワークをエミュレートする方法およびシステムが提供される。複数のコンピュータプロセッサ、スイッチ構造およびそれらのプロセッサへの2地点間リンクが提供される。仮想インタフェースロジックは、スイッチ構造および2地点間リンクに亘る仮想インタフェースを確立する。仮想インタフェースはそれぞれスイッチ構造によって、あるコンピュータプロセッサから別のコンピュータプロセッサまでのソフトウェア通信路を規定する。イーサネットドライバエミュレーションロジックは、少なくとも2台のコンピュータプロセッサ上でタスクを実行し、またスイッチエミュレーションロジックはそれらのコンピュータプロセッサ内の少なくとも1台の上でタスクを実行する。そのスイッチエミュレーションロジックは、スイッチエミュレーションロジックと、その上で実行するイーサネットドライバエミュレーションロジックを有する各コンピュータプロセッサとの間に仮想インタフェースを確立し、その間のソフトウェア通信を可能にする。さらに、それは、その上でタスクを実行するイーサネットドライバエミュレーションロジックを有するコンピュータプロセッサへの仮想インタフェースの内の1つからメッセージを受け取り、また、そのメッセージに関連した情報への対処に応じて、その上でタスクを実行するイーサネットドライバエミュレーションロジックを有する別のコンピュータプロセッサにメッセージを送信する。さらに、それは、その上でタスクを実行するイーサネットドライバエミュレーションロジックを有する各コンピュータプロセッサと、その上でタスクを実行するイーサネットドライバエミュレーションロジックを有する１台おきのコンピュータプロセッサとの間に仮想インタフェースも確立する。イーサネットドライバエミュレーションロジックユニキャストは、仮想インタフェースが満足に作動している場合はそこにソフトウェア通信路を規定する仮想インタフェース経由で、また仮想インタフェースが満足に作動していない場合はスイッチエミュレーションロジック経由で、エミュレートされるイーサネットネットワーク中の別のコンピュータプロセッサと通信する。
【００１１】
発明の別の態様によれば、アドレス解決プロトコル(ARP)を実装する方法およびシステムが提供される。計算プラットホームは、根元的な物理的ネットワークによって接続されている複数のプロセッサを有している。プロセッサ内の1台の上で実行可能なロジックは、計算プラットホーム上でエミュレートされるようにイーサネットネットワークのトポロジーを規定する。そのトポロジーはプロセッサノードおよびスイッチノードを含んでいる。それらのプロセッサ内の1台上で実行可能なロジックは、プロセッサノードの役割をするためのプロセッサとなるように、複数のものから一組のプロセッサを割り当てる。それらのプロセッサ内の1台上で実行可能なロジックは、エミュレートされるイーサネットネットワークの各プロセッサノードに仮想MACアドレスを割り当てる。それらのプロセッサ内の1台上で実行可能なロジックは、各プロセッサノードからの直接的なソフトウェア通信を、１台おきのプロセッサノードに供給するために、根元的な物理的ネットワークに仮想インタフェースを割り付ける。各仮想インタフェースは対応する識別表示を有する。各プロセッサノードは、ARP要求がIPアドレスを含んでいるスイッチノードにARP要求を伝えるために、ARP要求ロジックを有する。スイッチノードは、エミュレートされたイーサネットネットワーク中の他のすべてのプロセッサノードにARPの要求を伝えるためにARP要求同報通信ロジックを含んでいる。各プロセッサノードは、それがARP要求にあるIPアドレスに関連したプロセッサノードかどうかを判定し、そしてそうであれば、そのスイッチノードにARP応答を出すためのARP応答ロジックを有し、そこではそのARP応答はそのIPアドレスに関連したプロセッサノードの仮想MACアドレスを含んでいる。そのスイッチノードは、ARP応答を受け取り、かつARP要求ノードに対する仮想インタフェース識別表示を含むARP応答を修正するためのARP応答ロジックを含んでいる。
【００１２】
本発明の別の態様によると、コンピューター処理のためのプラットホームとその方法が、プロセッサフェイルオーバーをサポートするために提供される。複数のコンピュータプロセッサが、内部通信ネットワークに接続される。その内部ネットワーク上の仮想ローカルのエリア通信ネットワークが規定され、そして確立される。その仮想ローカルのエリア通信ネットワーク中の各コンピュータプロセッサは、対応する仮想MACアドレスを有し、またその仮想ローカルエリアネットワークは、一組のコンピュータプロセッサ間での通信を提供するが、その規定されたセット内以外の複数からのプロセッサは除外される。仮想記憶空間は、記憶ネットワークのアドレス空間への規定された対応によって規定され、そして確立される。コンピュータプロセッサによる故障に応じて、複数のものから１台のコンピュータプロセッサが、その故障プロセッサと交換するために、割り付けられる。その故障プロセッサのMACアドレスは、故障プロセッサに取って代わるプロセッサに割り当てられる。その故障プロセッサの仮想記憶空間および規定された対応は、故障プロセッサに取って代わるプロセッサに割り当てられる。仮想ローカルエリアネットワークが、その故障プロセッサに取って代わるプロセッサを含めて、かつ故障プロセッサを除外するために、再度確立される。
【００１３】
発明の別の態様によれば、IPアドレスによってアドレスされるサービスを提供するためのシステムとその方法が提供される。少なくとも2台のコンピュータプロセッサが、各々そのサービスを提供するためのロジックを含んでいる。クラスタロジックはそのサービスの要求メッセージを受け取る。そのメッセージはIPアドレスを有する。そのクラスタロジックは、そのサービスを提供するロジックを有する少なくとも2台のコンピュータプロセッサの内の1台にその要求を配布する。
【００１４】
発明の別の態様によれば、コンピュータ処理プラットホームは、内部通信ネットワークに接続された複数のコンピュータプロセッサを含んでいる。少なくとも1つの制御ノードは、外部の通信ネットワークと、外部記憶装置アドレス空間を有する外部記憶装置ネットワークと通信している。少なくとも1つの制御ノードは内部ネットワークに接続され、それにより複数コンピュータプロセッサと通信する。構成ロジックは、複数プロセッサから一組の対応するコンピュータプロセッサを有する仮想処理エリアネットワークと、その一組のコンピュータプロセッサ間の通信を提供するが、その規定された組には無い複数からのプロセッサは除外する仮想ローカルエリア通信ネットワーク、および記憶ネットワークのアドレス空間への規定された対応を備えた仮想記憶空間、を規定しかつ確立する。
【発明を実施するための最良の形態】
【００１５】
本発明の好ましい複数の実施態様は、構成コマンドによって仮想システムが配備される処理プラットホームを提供する。そのプラットホームは大きなプロセッサ集団を提供し、そこからソフトウエアコマンドによって下位集団が選ばれて、所与の一組のアプリケーションと顧客に役立つように配備されうるコンピュータ (「処理エリアネットワーク」あるいは「プロセッサクラスタ」)の仮想ネットワークが形成される。そして、その仮想化された処理エリアネットワーク(PAN)は、ウェブベースのサーバーアプリケーションのような特定顧客向けアプリケーションを実行するために使用され得る。その仮想化は、ローカルエリアネットワーク(LAN)の仮想化あるいは入出力記憶装置の仮想化を含み得る。そのようなプラットホームの提供によって、処理リソースは、例えば物理的にサーバーを供給しネットワークや記憶装置の結線を行い各サーバーに電源を提供するなどよりも、むしろ管理者から構成コマンドによってソフトウェア経由で素早くかつ容易に配備され得る。
【００１６】
（プラットホームとその挙動の概観）
図1に示されるように、好ましいハードウェアプラットホーム100は、高速相互結線110a、b経由でスイッチ構造115a、bに接続された一組の処理ノード105aから105nを含んでいる。そのスイッチ構造115a、bは、外部IPネットワーク125(あるいは他のデータ通信ネットワーク)と、および記憶装置ネットワーク(SAN)130との通信経路にある少なくとも1つの制御ノード120a、bにも接続されている。例えば、リモートでタスクを実行する管理アプリケーション135は、プラットホーム100を構成して仮想PANを配備することをサポートするために、IPネットワーク125経由で制御ノードの1つ以上にアクセスするであろう。
【００１７】
ある実施態様の下で、約24の処理ノード105aから105n、2つの制御ノード120、および2つのスイッチ構造115 a、115bが、単一のシャーシに収容されて、2地点間(PtP)リンクの固定された事前配線済みの回路網と相互結線されている。各処理ノード105は、他のものの中にブートと初期設定のための何らかのBIOSファームウエアを含んでいる1台以上(例えば4台)のプロセッサ106Jから106l、1枚以上のLAN接続カード(NIC)107およびローカルメモリ(例えば、4 Gバイトより大きな)を含んでいるボードである。プロセッサ106用のローカルディスクはない;その代わりに、ページングに必要な記憶装置を含むすべての記憶装置は、SAN記憶装置130によって取り扱われる。
【００１８】
各制御ノード120は、1台以上(例えば4台)のプロセッサ、ローカルメモリ、および処理ノード105用および制御ノード106用のオペレーティングシステムソフトウェアをブートするために使用される、ブートイメージおよび初期ファイルシステムの独立したコピーを保持するためのローカル磁気ディスク装置を含んでいる単一のボードである。各制御ノードは、ファイバーチャネルリンク122、124に接続された100 メガバイト/秒のファイバー・チャネル・アダプタ・カード128経由でSAN 130と通信し、そして、ギガビットイーサネットリンク121、123に接続された1つ以上のギガビットイーサネットNICを有する外部ネットワークインターフェイス129経由で、インターネット(あるいは他の外部ネットワーク)125と通信する。(他の多くの技術およびハードウェアがSANおよび外部ネットワーク接続のために使用され得る。) 各制御ノードは、管理アプリケーション135によって、リモートでウェブベースの管理の代わりに使用され得る専用管理ポートとして、低速イーサネットポート (図示されていない) を含んでいる。
【００１９】
そのスイッチ構造は、NIC-CLAN 1000およびclan 5300スイッチ等の1つ以上の30のポートGiganetスイッチ115から構成されていて、また、様々な処理および制御ノードは、対応するNICをそのような構造モジュールとの通信に使用している。Giganetスイッチ構造は、非同報通信多元接続性(NBMA)ネットワークのセマンティックスを有している。ノード間通信はすべてスイッチ構造経由である。各リンクは、NIC 107と、スイッチ構造115中のポートとの間に直列接続として形成される。各リンクは112 メガバイト/秒で作動する。
【００２０】
いくつかの実施態様では、複数のキャビネットあるいはシャーシは、より大きなプラットホームを形成するために、相互接続されることがある。また、他の実施態様では、構成が異なる場合がある；例えば、冗長な接続、スイッチおよび制御ノードが除去される場合がある。
【００２１】
ソフトウェア管理の下で、プラットホームは複数で同時で独立な処理エリアネットワーク(PANs)をサポートする。各PANは、PtP網に亘ってエミュレートされる仮想ローカルエリアネットワーク経由で通信するプロセッサ106の対応するサブセットを有するように、ソフトウェアコマンドによって形成される。各PANは、対応する仮想I/Oサブシステムを有するように形成される。物理的な配備あるいは配線は、PANを確立するためには必要ではない。ある好ましい実施態様の下では、プロセッサノードまたは制御ノード、あるいはその両方上でタスクを実行するソフトウェアロジックが、交換イーサネットセマンティックスをエミュレートする;プロセッサノードまたは制御ノード、あるいはその両方上でタスクを実行する他のソフトウェアロジックは、SCSIセマンティックスに従い、各PAN用の独立したI/Oアドレス空間を提供する仮想記憶サブシステム機能を提供する。
【００２２】
（ネットワークアーキテクチャ）
ある好ましい実施態様は、仮想のコンポーネント、インタフェースおよび接続を使用して、管理者が仮想でエミュレートされるLANを構築することを可能にしている。各バーチャルLANはプラットホーム100に対して内部的かつ専用であり得るし、あるいは複数のプロセッサが、単一のIPアドレスとして外部的には見えるプロセッサクラスタに形成される場合もある。
【００２３】
ある実施態様の下で、物理的で基礎的なネットワークはPtP網であるが、そのようにして作成された仮想ネットワークは、交換イーサネットネットワークをエミュレートする。仮想ネットワークはIEEE MACアドレスを利用し、また、処理ノードは、IPアドレスをMACアドレスと結び付け、関連付けるためにIETF ARPの処理をサポートする。従って、所与のプロセッサノードは、ARP要求がプラットホームに対して内部または外部であるノードから来たかどうかを、ARP要求に一貫して答える。
【００２４】
図2Aは、モデル化されるか、またはエミュレートされ得る典型的なネットワーク配置を示す。第１のサブネット202は、スイッチ206経由でお互いと通信する処理ノードPN₁、PN₂およびPN_kにより形成される。第２サブネット204は、スイッチ208経由でお互いと通信する処理ノードPN_kおよびPN_mにより形成される。交換イーサネットセマンティックスの下では、サブネット上の1つのノードは、そのサブネット上の別のノードと直接通信する場合がある;例えば、PN₁がPN₂にメッセージを送信する場合がある。また、セマンティックスは、あるノードが他のノードの組と通信することを可能にする;例えば、PN₁が他のノードに同報通信メッセージを送信する場合もある。PN_mが異なるサブネット上にあるので、処理ノードPN₁とPN₂がPN_mと直接的に通信することはできない。PN₁とPN₂がPN_mと通信するためには、両方のサブネットについてのより十分な理解を有する上位レイヤーのネットワークソフトウェアが利用される必要があるであろう。図示されていないが、所与のスイッチは、別のスイッチ等への「アップリンク」経由での通信も可能であろう。以下の記述で分かるように、そのようなアップリンクの必要性はスイッチが物理的な場合には必要性が異なる。具体的にはスイッチが仮想でありソフトウェア内でモデル化されているので、それらは要求される広さだけ水平に拡大することが出来る。(対照的に、物理的なスイッチは固定数の物理的なポートを有しており、水平スケーラビリティを提供するために時折アップリンクが必要とされる。)
【００２５】
図2Bは、図2Aのサブネット202および204をモデル化するために、ある実施態様の下で使用される典型的なソフトウェア通信路およびロジックを示す。通信路212は、処理ノードPN₁、PN₂、PN_kおよびPN_m、特にそれらの対応するプロセッサ側ネットワーク通信ロジック210を接続し、またさらにそれらは制御ノードに処理ノードを接続する。(明確化のためにロジックの単一のインスタンスとして挙げるが、PN_kは対応するプロセッサロジックの複数のインスタンス、例えば、1つのサブネット当たりの一つを有する場合がある。) 好ましい実施態様の下では、管理ロジックおよび制御ノードロジックは、通信路を確立し管理し破棄する責任がある。個々の処理ノードは、そのような経路を確立することは許されない。
【００２６】
以下に詳細に説明されるように、プロセッサロジックおよび制御ノードロジックは、共にそのような通信路上の交換イーサネットセマンティックスをエミュレートする。例えば制御ノードは、イーサネットスイッチのセマンティックス内のいくつか(全てが必要ではないが)をエミュレートするために制御ノード側仮想スイッチロジック214を有し、またプロセッサロジックは、イーサネットドライバのセマンティックス内のいくつか(全て必要ではないが)をエミュレートするためのロジックを含んでいる。
【００２７】
サブネット内では、１つのプロセッサノードが、対応する仮想インタフェース212経由で別のものと直接通信することがある。同様にあるプロセッサノードが別の仮想インタフェース経由で制御ノードロジックと通信することがある。ある実施態様の下では、根元的なスイッチ構造および関連するロジック(例えば、スイッチ構造管理者ロジック(図示されていない))は、2地点間回線網上にそのような仮想インタフェース(VIs)を確立して管理するための能力を提供する。さらにこれらの仮想インタフェースは信頼できる冗長な方法で確立され、ここでRVIと呼ばれる。この記述における局面では、VI対RVIの間の選択がシステムリソースとしてかけた費用とそのシステムによって望まれる信頼性の量に大きく依存するので、仮想インタフェース(VI)および信頼できる仮想インタフェース(RVI)という用語は、交換可能に使用される。
【００２８】
図2A、2Bを結合して参照すると、ノードPN₁がノードPN₂と通信するはずである場合、それは仮想インタフェース212_1-2によってごく普通に行なわれる。しかしながら例えばVI 212_1-2が満足に作動していない場合、好ましい実施態様では、PN₁とPN₂の間の通信がスイッチエミュレーションロジック経由で発生することを可能にしている。この場合メッセージはVI 212_1#switch206経由で、またVI 212_switch206-2経由で送信され得る。PN₁がサブネット202内の他のノードへのメッセージを同報通信するかマルチキャストするべき場合には、仮想インタフェース2121-switch206経由で制御ノード側ロジック214にそのメッセージを送信することによりそうする。その後、制御ノード側ロジック214は、関連性のあるVIを使用して、他の関連性のあるノードへのメッセージのクローンを作って送信することにより、同報通信かマルチキャストの機能をエミュレートする。同じか類似したVIが制御ノード側ロジックを要求する他のメッセージを伝達するために使用されてもよい。例えば下記に説明されるように、制御ノード側ロジックは、アドレス解決プロトコル(ARP)をサポートするためのロジックを含んでいて、またVIは制御ノードにARP応答および要求を伝えるために使用される。上記の記述はプロセッサロジックと制御ロジックの間の1つだけのVIを示唆しているが、多くの実施態様が複数のそのような接続を使用している。さらに、図はソフトウェア通信路における対称性を示唆しているが、そのアーキテクチャーは実際に非対称の通信を可能にしている。例えば、クラスタ通信サービスについては、以下で議論されるように、パケットは制御ノード経由で送られるであろう。しかしながら、戻り通信はノード間で直接である場合がある。
【００２９】
図2Aのネットワークのように、ノードPN₂とPN_mの間に通信用機構がないことに注目すべきである。さらに、通信路が中心に(処理ノード経由ではなく)管理され作成されていることによって、そのような経路は処理ノードによって作成可能ではなく、またその規定されたサブネットの接続は、プロセッサによって破られない。
【００３０】
図2Cは、図2AおよびBのサブネットを実現するための、ある実施態様の典型的な物理的接続を示す。特に処理ネットワークロジック210の各インスタンスは、相互結線110のPtPリンク216経由でスイッチ構造115と通信する。同様に、制御ノードは、スイッチロジック214の複数のインスタンスを有し、また各々はスイッチ構造に対してPtP接続216上で通信する。図2Bの仮想インタフェースは、さらに以下に記述されるように、これらの物理的なリンクに亘って情報を伝達するためのロジックを含んでいる。
【００３１】
そのようなネットワークを作成し形成するために、管理者は、PANのネットワークトポロジーを規定し、様々なノードのMACアドレス割当てを指定する(例えば、管理ソフトウェア135内のユーティリティ経由で)。MACアドレスは仮想であって、仮想インタフェースを識別しまたいかなる特定の物理的ノードにも結び付けられない。ある実施態様の下では、MACアドレスはIEEE 48ビットアドレスフォーマットに従うが、しかしその内容は、「ローカルに処理された」ビット(1に設定)、仮想インタフェースが最初に規定された(詳細は下記)制御ノード120のシリアル番号、および制御ノードにおいてNVRAMに維持される制御ノード上の持続性の遂次カウンターからのカウント値を含んでいる。これらのMACはレイヤー2レベルでノードを識別する(在来型として)ために使用されるであろう。例えばARP要求(PANに対して内部の、あるいは外部ネットワーク上のノードからであるか否か)に答える際に、これらのMACはARP応答に含まれるであろう。
【００３２】
制御ノード側のネットワークロジックは、LANの接続を反映する情報(例えば、どのノードが他のどのノードと通信するか)を含むデータ構造を保全する。また制御ノードロジックは規定されたMACアドレスにVI(あるいはRVI)マッピングを分配して割り当て、そして制御ノード間で、および制御ノードと処理ノード間でVI(あるいはRVI)を割り付けて割り当てる。図2Aの例では、そのロジックは、図2BのVI 212を割り付けて割り当てるであろう。(いくつかの実施態様中のVIおよびRVIの命名は、使用されたスイッチ構造およびスイッチ構造管理者ロジックの結果である。)
【００３３】
各プロセッサが立ち上がる時に、BIOSベースのブートロジックは、ノード105の各プロセッサ106を初期化し、とりわけ、制御ノードロジックに対してVI212を確立する(あるいは発見する)。次に、プロセッサノードは、制御ノードからプロセッサノードのMACアドレスのような関係データリンク情報、および同じデータリンク構成内の他の機器のMAC認証を得る。その後各プロセッサは制御ノードを備えたそのIPアドレスを登録し、その際、それはそのIPアドレスをノード、およびRVI(例えば、登録が到着したRVI)に結び付ける。このように制御ノードはサブネット上の各ノードに対する個々の仮想MAC用のIPアドレスを結び付けることができるであろう。上記のものに加えて、プロセッサノードは、さらに、他のノードへの、あるいは制御ノードネットワークロジックへのその接続に対するRVIまたはVI関連情報も得る。
【００３４】
したがって、ブートと初期設定の後、様々なプロセッサノードはそれらのレイヤー2データリンク接続を理解するに違いない。以下に説明されるように、レイヤー3(IP)の接続および特にレイヤー3とレイヤー２の結合はアドレス解決プロトコルの結果としてプロセッサの通常の処理中に決定される。
【００３５】
図3Aはプロセッサ側のネットワークロジック210を詳述し、また、図3Bは、ある実施態様の制御ノード側のネットワーク310ロジックを詳述する。プロセッサ側ロジック210はIPスタック305、仮想ネットワークドライバ310、ARPロジック350、RCLANレイヤー315および冗長なGiganetドライバ320a、bを含んでいる。制御ノード側ロジック310は、冗長なGiganetドライバ325a、b、RCLANレイヤー330、仮想クラスタプロキシロジック360、仮想LANサーバー335、AJRPサーバーロジック355、仮想LANプロキシ340および物理的LANドライバ345を含んでいる。
【００３６】
（IPスタック）
IPスタック305は、処理ノード106によって使用されるオペレーティングシステム(例えばLinux)と共に提供される通信プロトコルスタックである。IPスタックは、シミュレートされたイーサネットネットワークと通信するためにプロセッサ106上でタスクを実行するアプリケーションおよびオペレーティングシステムに、レイヤー3インタフェースを供給する。IPスタックは、仮想イーサネットレイヤー310に情報のパケットを供給し、それと共に、そのパケットの行き先としてのレイヤー3のIPアドレスの提供を行う。IPスタックロジックは、ある実施態様がチェックサム計算およびロジックを回避するという点を除いて、在来型である。
【００３７】
（仮想イーサネットドライバ）
仮想イーサネットドライバ310が、「真の」イーサネットドライバのようにIPスタック305に現われるであろう。この点では、仮想イーサネットドライバ310は、ネットワーク上の後続の送信のためにIPスタックからIPパケットあるいはデータグラムを受け取り、また、それは、IPパケットとしてそのスタックに届けられるように、ネットワークからパケット情報を得る。
【００３８】
スタックはMACヘッダを作成する。スタック中の「通常の」イーサネットコードが使用されてもよい。仮想イーサネットドライバは、既に作成されたMACヘッダおよび既にヘッダにある正確なMACアドレスを備えたパケットを受け取る。
【００３９】
資料部分において、また図4A〜Cに関して、パケットがネットワーク上で送信されるように、仮想イーサネットドライバ310は405出力IPデータグラムを待ち行列から外す。標準IPスタックARPのロジックが使用される。以下に説明されるように、ドライバは、適切な情報が各ノードのARPテーブルの中で終了するようにそれらを修正するために、そのシステムに入って出ていくARPパケットをすべて遮断する。パケットがイーサネットドライバに対して待ち行列に入れられる前に、通常のARPロジックは出力パケットのリンクレイヤーヘッダに正確なMACアドレスを入れる。その後そのドライバは、パケットを送る方法を決定するためにリンクレイヤーヘッダおよび行き先MACの検査だけを行う。そのドライバは直接的にはARPテーブルを操作しない(ARPのエントリの時に発生することがある無効を除いて)。
【００４０】
ARPのロジック350が待ち行列から外されたパケット中でIPアドレスに関連したMACアドレス情報(詳細は下記に)を有しいるか否か、ドライバ310が415を決定する。ARPロジック350が情報を有する場合、その情報は、パケットをそれに応じて送るために使用される(420)。ARPのロジック350が情報を持っていない場合、ドライバはそのような情報を決定する必要があり、また、ある好ましい実施態様では、この情報は、図4B、4Cに関して議論されるような、ARPプロトコルの実行の結果として得られる。
【００４１】
ARPのロジック350がMACアドレス情報を有する場合、ドライバはどこで、どのようにパケットを送るかを決定するために、ARPロジック350から返された情報を分析する。特に、MACアドレスが有効なフォーマットで、あるいは特別に無効なフォーマットであるかどうか判定するために、ドライバはアドレスを見る。例えば、ある実施態様では、内部ノード(つまり、そのプラットホーム内にあるPANノード)は、MACアドレスの最初のバイトに、ローカルに管理されたビット、マルチキャストビット、および別のあらかじめ定めたビットのパターンを設定する組合せで信号を与えられる。支配的なパターンは、有効なパターンである可能性は殆どないものである。
【００４２】
ARPのロジックから戻されたMACアドレスが有効なフォーマットである場合、そのMACアドレスに関連したIPアドレスは、少なくとも関連するサブネットに対して外部にあるノード向けで、また、好ましい実施態様においては、そのプラットホームに対して外部である。そのようなパケットを伝えるために、ドライバは、TLV(タイプ−長さ−値)ヘッダを備えたそのパケットを先頭に付加する。その後、そのロジックは、あらかじめ確立しているVIに亘る制御ノードに、そのパケットを送る。さらに、制御ノードはその送信の残りを適切に扱う。
【００４３】
ARPのロジック350から戻されたMACアドレス情報が、ある特別に無効なフォーマットである場合、その無効のフォーマットは、そのIPアドレスのノードが内部ノードに対するものであるという信号を送り、さらに、そのMACアドレス情報中のその情報は直接的に2つの処理ノードを接続しているVI(あるいはRVI)を識別するのを支援するために使用される。例えば、ARPテーブルエントリは、別の処理ノードに、パケット、例えば212i-2を送るために使用するRVI 212を識別する情報を保持する場合がある。ドライバは、TLVヘッダを備えたパケットを先頭に付加する。その後それはイーサネットプロトコルタイプを識別する情報と共に、ヘッダにアドレス情報も入れる。次にそのロジックはカプセルに入れられたパケットを送るための適切なVI(あるいはRVI)を選択する。そのVI(あるいはRVI)が満足に作動している場合、それはパケットを運ぶために使用される;それが満足に作動していない場合、スイッチロジックが適切なノードにそれを送ることができるように、パケットは、制御ノードスイッチロジック(以下に詳述)に送られる。そのARPの表は実際に使用するためのRVIを指定するための情報を含んでいる場合があるが、他の多くの技術が使用される場合もある。例えばその表の中の情報はそのような情報を間接的に供給することがあり、例として当該情報を指すことによって、あるいは当該情報が含まれていないことを通して当該情報を明らかにすることによる。
【００４４】
任意のマルチキャストまたは同報通信型メッセージについては、ドライバは、規定されたVI上の制御ノードにそれらのメッセージを送る。その後、その制御ノードはそのパケットのクローンを作り、すべてのノード(送信ノード以外の)とアップリンクにそれを送る。
【００４５】
ARPマッピングがなければ、上のレイヤーはドライバにそのパケットを決して送ってはいないであろう。利用可能なデータリンクレイヤーマッピングがない場合、ARPの分解が完了するまで、そのパケットは棚上げにされる。一旦ARPレイヤーがARP化を終了すると、保留ARPを押しとどめていたパケットは、それらのデータリンクヘッダを構築されて、その後そのパケットはドライバに送られる。
【００４６】
ARPロジックがIPスタックからのIPパケットのIPアドレスのためのマッピングを有せず、従って、ドライバ310が関連するアドレス指定情報(つまりMACアドレスあるいはRVI関連情報)を決定することができない場合、ドライバは、ARPのプロトコルに従うことにより、そのような情報を得る。図4B、4Cを参照すると、それに対するMACマッピングがローカルのARP表にはない関連IPアドレスを含んでいるARP要求パケットをそのドライバが作成する（425）。次にそのノードはTLVタイプヘッダを備えたARPのパケットを先頭に付加する（430）。次にARP要求が制御ノード側のネットワークロジック、具体的には仮想LANサーバー335に専用RVI経由で送られる。
【００４７】
より詳細に下に議論されるように、ARP要求パケットは関連するノードに対して制御ノードおよび同報通信440によって処理される（435）。例えば、要求しているノードがIPサービスクラスタの一部であるか否かを制御ノードがフラグを立てるであろう。
【００４８】
関連するノードのイーサネットドライバロジック310は、ARP応答を受け取り（445）、さらにそのノードのIPスタックへのコールを行うことにより、ターゲットIPアドレスをローカルに形成されたIPアドレスのリストと比較することにより、それがARP要求のターゲットであるかどうかを判定する（450）。それがターゲットでない場合、それは修正をしないでそのパケットを見送る。それがターゲットであるとそのドライバはTLVヘッダからローカルMACヘッダを作成し（460）、またローカルARP表を更新して、ARP応答を作成する（465）。ドライバは、ARP要求での情報(主としてソースMAC)を修正し、次に上位のレイヤーが扱うように通常はそのARP要求を見送る。必要な場合ARPの応答を形成するのは上位のレイヤーである。他のものの中にある応答は、その返答するノードのMACアドレスを含んでおり、その応答がローカルのノードからであることを示すTLVヘッダに設定されるビットを有している。この点ではそのノードはEETFタイプARPのセマンティックスに応じて(ARPの応答が中心的に扱われるATM ARPのプロトコルとは対照的に)応答する。次にその応答は470に送られる。
【００４９】
以下により詳細に説明されるように、制御ノードロジック335は、その応答を受け取ってそれを修正する（473）。例えば制御ノードは、ソースキャビネット、処理ノード番号、RVI接続番号、チャネル、仮想インタフェース番号および仮想LAN名を識別する情報を備えた、返答する内部ノードのMACアドレスを代用してもよい。次に一旦ARP応答が修正されると制御ノードロジックは適切なノード、つまりARP要求を送ったノード、あるいは、具体例では以下で議論されるJDPサービスクラスタ中のロードバランサーに、ARP応答を送る（475）。
【００５０】
最終的に、カプセル化されたARP応答が受信される（480）。返答するノードが外部ノードである場合、ARP応答は返答するノードのMACアドレスを含んでいる。返答するノードが内部ノードである場合、その代りにARP応答はノードと通信する関連RVIを識別する情報を含んでいる。いずれの場合もローカルの表は更新される（485）。
【００５１】
保留のデータグラムはキューから外されて（487）、適切なRVIが選択される（493）。上で議論されているように、適切なRVIはターゲットノードが内部か外部かに基づいて選択されている。TLVヘッダはパケットの先頭に付加されて送信される（495）。
【００５２】
仮想LAN内の通信については最大転送単位(MTU)は16,896バイトとして形成される。たとえ形成されたMTUが16,896バイトでも、イーサネットドライバ310は、パケットが外部ネットワークへ送られているときに認識する。経路MTU発見、ICMPおよびIPスタック変化の使用を通じて、経路MTUはソースノード105で変更される。この機構はパケットチェックサムをトリガーするために使用される。
【００５３】
本発明のある実施態様は、仮想LANサーバー335と仮想LANドライバ310との中にあるロジックの組合せによって、プロミスキャスモードを支援する。仮想LANドライバ310が仮想LANサーバー335からプロミスキャスモードメッセージを受け取ると、そのメッセージはプロミスキャスモードにを入ることを望むレシーバーの認証に関する情報を含んでいる。この情報は、レシーバーの場所(キャビネット、ノードなど)、そのレシーバー上のプロミスキャス仮想インタフェース310のインタフェース番号(パケットのデマルチプレクシングに必要な)、およびそのレシーバーが属する仮想LANの名前を含んでいる。次にレシーバーへプロミスキャスのパケットを送る方法(それらのパケットを送るためにどのRVIあるいは他の機構を使用するか)を決定するために、この情報はドライバ310によって使用される。仮想インタフェース310は、同じ仮想LANの上でプロミスキャスリスナーリストを保全する。送信ノードがプロミスキャスモードメッセージを受け取ると、それはそのプロミスキャスリストをそれに応じて更新するであろう。
【００５４】
パケットが仮想イーサネットドライバ310を通して送信される場合、このリストが検査されるであろう。リストが空でなければ、仮想イーサネットインタフェース310は下記のことを行うであろう:
【００５５】
送信パケットが同報通信されかあるいはマルチキャストされる場合、プロミスキャスコピーは送られないであろう。通常の同報通信動作はプロミスキャスリスナーにパケットを送信するであろう。
【００５６】
パケットがプロミスキャスリスナー以外の送付先を持つユニキャストパケットである場合、そのパケットはクローンを作られてプロミスキャスリスナーに送られるであろう。
【００５７】
ヘッダTLVは、着信パケットを多重分離して有効にするために、送付先が使用することができる追加情報を含んでいる。この情報の一部は行き先仮想イーサネットインタフェース番号(その受信ノード上の送付先装置番号)である。これらは実際のパケット送付先とプロミスキャスな送付先との間で異なり得るので、このヘッダは単純にクローン化され得ない。このようにメモリは個々のプロミスキャスリスナーへの各パケットクローンの各ヘッダに割り当てられなければならないであろう。プロミスキャスパケット用のパケットヘッダが成形される場合、パケットタイプはパケットがユニキャスト送信というよりむしろプロミスキャスな送信だったことを示すように設定されるであろう。
【００５８】
さらに仮想イーサネットドライバ310には、冗長な制御ノード接続を取り扱う責任がある。例えば、仮想イーサネットドライバは、各接続RVIにTLV心拍を送ることにより、周期的にエンド・ツー・エンドの連結度をテストするであろう。これは、ノードが応答を停止したか、あるいは停止されたノードが再度応答を始めたかどうかを、仮想イーサネットドライバが判定することを可能にするであろう。RVIまたは制御ノード120がダウンしていると判断されると、イーサネットドライバは残存する制御ノードによってトラフィックを送るであろう。両方の制御ノードが機能しているならば、ドライバ310は、その2つのノード間のバランストラフィックをロードしようとするであろう。
【００５９】
本発明のある実施態様は、パフォーマンスの向上をもたらす。例えば、IPスタック305への変更で、プラットホーム100内でのみ送信されたパケットは、プラットホーム100の要素がすべてエラー検出を提供するので、チェックサムは行なわれずにデータ送出が保証される。
【００６０】
さらに、PAN(あるいはプラットホーム100内でさえ)内の通信については、パケットがイーサネットによって許された最大寸法より大きいようにRVIが形成されてよい。このようにモデルがある実施態様中ではイーサネットの挙動をエミュレートしている一方で、最大のパケットサイズがパフォーマンスを改善するために突破される可能性がある。実際のパケットサイズはデータリンクレイヤーの一部としてうまく処理されるであろう。
【００６１】
制御ノードの故障は、RCLANレイヤーからの通知によって、あるいはTLV心拍の故障によって検知される。もし制御ノードが故障するとイーサネットドライバ310は残りの制御ノードにのみトラフィックを送るでろう。イーサネットドライバ310は、RCLANレイヤーからの通知あるいはTLV心拍の再開からの通知経由で、制御ノードの回復を認識するであろう。一旦制御ノードが回復したならば、イーサネットドライバ310はロードバランシングを再開するであろう。
【００６２】
ノードが直接のRVI(上に概説されたように)経由で別のノードと通信することができないことを検知すると、そのノードはスイッチの役割をして制御ノード経由で通信することを試みる。そのような故障は例えば仮想インタフェース応答信号の受信故障から、あるいは心拍機構を通して検知された故障からといった下部のRCLANレイヤーによって信号を送られことがある。この実例では、ドライバはTLVヘッダ中のビットをマークしてメッセージがユニキャストされるべきであることを指示し、希望のノード(例えば、必要ならIPアドレスに基づいて)にそのパケットを送ることができるように制御ノードにそのパケットを送る。
【００６３】
（RCLANレイヤー）
RCLANレイヤー315には、冗長な相互結線NIC 107の冗長度、フェイルオーバー、および負荷バランシングロジックを取り扱う責任がある。これは、故障の検知、故障時に冗長接続上でのトラフィック経路再設定、ロードバランシングおよび仮想ネットワークドライバ310にトラフィックを返送できないことの報告を、含んでいる。任意のRVI上にRVIを使用不可能にする決定的エラーがある場合、あるいはRVIが何らかの理由で故障した場合、仮想インターネットドライバ310は非同期的に通知されるはずである。
【００６４】
通常の状況下では、各プロセッサ上の仮想ネットワークドライバ310は、利用可能な制御ノード間で送信パケットのロードをバランスしようとするであろう。これは利用可能な制御ノード間の単純なラウンドロビン交替経由で行なわれるか、あるいはそれぞれにどれだけのバイトが送信されたかの追跡を続けて、最小バイトが送られた制御ノードに送信することによって実行され得る。
【００６５】
RCLANは、カーネル間の高帯域(経路毎に224 MB/秒)で、短い待ち時間で、信頼できる非同期固定通信を提供する。データを伝えることができない場合には、データのセンダーに通知され、それを伝えるための最善の努力がなされるであろう。RCLANはカーネル間の冗長な通信路を提供するために2つのGiganet clan 1000カードを使用する。それはclan 1000カードあるいはGiganetスイッチにおける単一の故障をとぎれることなく回復する。それはデータ脱落とデータエラーを検知し、必要に応じてそのデータを再送する。接続の内の1つが部分的に働いている、例えば誤り率が5%を超過しない限り通信は中断しないであろう。RCLANのクライアントは、RFC機構、リモートSCSI機構およびリモートイーサネットを含んでいる。RCLANはさらに単純形状のフロー制御も提供する。短い待ち時間および高い並列性は、各デバイスに対する複数の同時的な要求がデバイスへできるだけ早く転送され得るか、あるいはプロセッサノードに関するすべての要求を待ち行列に入れることに抗して可能な限りその装置の近くで完成のために待ち行列に入れられ得るように、プロセッサノードによって制御ノードに送られるようにすることによって達成される。
【００６６】
制御ノード側のRCLANレイヤー330は、上記のものと同様に作動する。
【００６７】
（Giganetドライバ）
Giganetドライバロジック320は、プロセッサ106または制御ノード120上にあるGiganet NIC 107にインタフェースを提供することに責任のあるロジックである。要するに、Giganetドライバロジックは、より上位のレイヤー、例えばRCLAN 315、イーサネットドライバ310が単にVIのセマンティックスを理解するだけでよいようにVI 標識によって関連づけられているVI接続を確立する。
【００６８】
Giganetドライバロジック320は、VI用のバッファおよび待ち行列用の各ノードにメモリを割り付けること、および接続とその記憶割当てのことを知るようにNIC 107を条件付けることに、責任がある。ある実施態様は、Giganetドライバによって提供されるVI接続を使用する。Giganet NICドライバコードは、仮想のインタフェース対(つまりVI)を確立し、対応する仮想インタフェース標識にそれを割り当てる。
【００６９】
各VIは、1つのGiganetポートと別のポートの間、あるいはより正確にいえば、メモリバッファとバッファに対する1つのノード上のメモリ待ち行列および別のノード上の待ち行列との間で確立される双方向接続である。ポートとメモリの割当ては上に述べられるようにNICドライバによって取り扱われる。データはNICが知っているバッファにそれを入れて、特定のメモリマップレジスタに書くことによりアクションをトリガーすることによって送信される。受信側ではそのデータがバッファに現われて、また、完了ステータスが待ち行列に現われる。送信と受信のプログラムが接続のバッファ中のメッセージを生成し消費することができる場合、そのデータをコピーする必要はない。オペレーティングシステムがアプリケーションアドレス空間へ接続のバッファおよび命令レジスターをメモリマップする場合、その送信は、アプリケーションプログラムからアプリケーションプログラムまで直接であり得る。各Giganetポートは、その上の1024箇所の同時的VI接続をサポートし、ハードウェアプロテクションでそれらを互いに離れているようにしておくことができ、したがって、異種のアプリケーションと同様にオペレーティングシステムも安全に一つのポートを共有することができる。本発明のある実施態様の下で、14箇所のVI接続が、すべてのポートから1ポートおきに同時的に確立される。
【００７０】
好ましい実施態様では、NICドライバは、対になった冗長接続内にVI接続を確立し、対の内の１接続は2個のスイッチ構造115a、115bの一方を通り、他の接続はもう一方のスイッチを通る。さらに、好ましい実施態様では、データは対になった2つのレッグ上を交互に送信され、スイッチ上の負荷を等しくする。代替的に冗長な対がフェイルオーバーの手法で使用されてもよい。
【００７１】
オペレーティングシステムが残る限り、そのノードによって確立された接続対は、すべて存続する。イーサネット接続をシミュレートするための接続対の確立は、ネットワークインターフェースカード間のケーブルを物理的に差し込むことに類似し、また同様に持続的であるように意図される。そのオペレーティングシステムが動作している間に、ノードの規定された構成が変わると、適用可能な冗長な仮想のインタフェース接続対がその確立され、あるいは変化のときに廃棄されるであろう。
【００７２】
制御ノード側のGiganetドライバロジック325は、上記のものに相似して作動する。
【００７３】
（仮想LANサーバー）
仮想LANサーバーロジック335は、基礎的なNBMAネットワーク上のイーサネットネットワークのエミュレーションを容易にする。仮想LANサーバーロジックは:
1. 対応する仮想LANへのメンバーシップを管理し;
2. RVIマッピングおよび管理を提供し;
3. RVIへのARPの処理およびIPマッピング;
4. 同報通信とマルチキャストのサービスを提供し;
5. 他の領域へ橋渡しし経路設定することを容易にし;
また
6. サービスクラスタを管理する。
【００７４】
（1. 仮想LANメンバーシップ管理）
管理者達は、管理アプリケーション135を使用して、仮想LANを構成する。仮想LAN上のIPアドレスの割当ておよび構成は、「通常の」サブネット上と同じ方法で行われればよい。使用するためのIPアドレスの選択は、仮想LAN上のノードの外部可視性に依存する。仮想LANがグローバルに見えない場合(プラットホーム100の外で可視でないか、あるいはインターネットから可視でないかのどちらか)、プライベートIPアドレスが使用されるべきである。そうでなければ、IPアドレスは、インターネット接続を提供するインターネットサービスプロバイダー(ISP)によって提供される範囲から構成されなければならない。一般に、仮想LAN IPアドレス割当ては、通常のLAN IPアドレス割当てと同様に扱われなければならない。制御ノード120のローカルディスク上に格納された構成ファイルは、仮想LANの内のIPアドレスを規定する。仮想ネットワークインターフェイスのために、IPエイリアスが仮想LANサーバーロジック335上のRVIマッピングに単に別のIPを作成する。各プロセッサは必要に応じて複数の仮想インタフェースを形成してもよい。仮想ネットワークインターフェイスの生成および構成に関する主な制限はIPアドレスの割り当ておよび構成である。
【００７５】
各仮想LANは、制御ノード120およびプロセッサノード105上でタスクを実行する多くのノードの両方上でタスクを実行するサーバーロジック335の対応するインスタンスを有する。トポロジーは管理者によって規定される。
【００７６】
各仮想LANサーバー335はちょうど1つの同報通信領域を管理するために形成され、また任意の数のレイヤー3(IP)サブネットがレイヤー2同報通信領域に存在してもよい。サーバー335は仮想LANを作成するための管理者コマンドに応じて形成され作成される。
【００７７】
プロセッサ106がその仮想ネットワークをブートし構成する場合、それは特別の管理RVI経由で仮想LANサーバー335に接続する。次にそのプロセッサは、それに割り当てられた仮想MACアドレス、および仮想LANメンバーシップ情報等のそれらのデータリンク構成情報を得る。仮想LANサーバー335は、それに接続することを試みるプロセッサが、適切にそのサーバー335がサービスしている仮想LANのメンバであると決定し、確認する。プロセッサが仮想LANメンバでない場合、サーバーへの接続が拒絶される。それがメンバである場合、仮想ネットワークドライバ310は仮想LANサーバーを有するそのIPアドレスを登録する。(ドライバ310が構成される場合、IPアドレスはIPスタック305によって提供される。) 次に仮想LANサーバーは、そのIPアドレスを登録が到着したRVIに結び付ける。これは、仮想LANサーバーが特定のIPアドレスに関連付けられるプロセッサを見つけ出せるようにする。さらにプロセッサとIPアドレスとの関連付けは、仮想LAN管理インタフェース135経由で行なってもよい。後の方法は、クラスタIPアドレス、あるいは下で議論される特殊扱いによるIPアドレスを適切に構成するのに必要である。
【００７８】
（2. RVIマッピングおよび管理）
上に概説されているように、ある実施態様は、データリンク層でにノードを接続して、制御接続を形成するためにRVIを使用する。これらの接続のいくつかは、制御ノードブートおよび初期設定の一部として、作成され割り当てられる。データリンク層接続は上記の理由で使用される。制御接続は管理、構成および健全性の情報を交換するために使用される。
【００７９】
いくつかのRVI接続は、ユニキャストトラフィック(例えば212i.2)用のノード間にある。サーバーがその要求、例えばARPのトラフィック、同報通信等を扱うことができるように、他のRVI接続は仮想LANサーバーロジック335にある。RVIを作成するために仮想LANサーバー335は、Giganetスイッチマネージャー360(スイッチ構造およびGiganet NICと共に給される)へのコールを通じてRVIを作成し、削除する。そのスイッチマネージャーは、制御ノード120上でタスクを実行して、Giganetドライバと協力してRVIを作成してもよい。
【００８０】
プロセッサ接続に関して、ノードが仮想LANサーバー335と共に登録するので、上記のように、仮想LANサーバーはそれらのノードに対して仮想MACアドレスを作成し、割り当てる。これと共に、仮想LANサーバーロジックは、様々なノードのためのトポロジーとMACの割当てを反映するデータ構造を保全する。次に仮想LANサーバーロジックは、ノード間のユニキャスト経路用に対応するRVIを作成する。これらのRVIは、その後に割り付けられ、そのノードのブート中に、そのノードに知られるようにされる。さらにRVIも仮想LANサーバーのARPのトラフィックの取り扱いの間にIPアドレスと関連づけられる。ノードがトポロジーから取り除かれる場合RV1接続が切り離される。
【００８１】
確立しているRVI接続の1つの終端でのノード106がリブ-トされる場合、その接続の各端末の2つのオペレーティングシステム、およびRVI管理ロジックはその接続を再度確立する。残された処理ノード上の接続を使用するソフトウェアは、何がその接続自体に起こったかに気づかないであろう。そのソフトウェアがもう一つの端のソフトウェアがリブ-トされたことに気づくか、あるいは注目するかは、それが使用している接続は何のためかということと、リブ-トされた端末が持続性の記憶装置からその状態を再度確立することができる程度とに依存する。例えば、転送制御プロトコル(TCP)経由で通信するどのようなソフトウェアも、TCPセションがすべてリブートによって閉じられることに気づくであろう。他方では、それが許可されたタイムアウト期間内に発生する場合、ネットワークファイルシステム(NFS)アクセスは無所属で、リブートによって影響されない。
【００８２】
万一ノードが直接のRVIの上にパケットをいつでも送れない場合は、それは仮想LANサーバー335経由で送付先へそのパケットを送ることを常に試みることができる。仮想LANサーバー335が制御接続経由で、仮想LAN上のすべての仮想イーサネットドライバ310インタフェースに接続されるので、仮想LANサーバー335はさらに、最後の手段のパケットリレー機構として役立てる。
【００８３】
仮想LANサーバー335への接続に関して、ある実施態様は、その関連する仮想LANサーバー335に接続するために、それが使用するべきRVIをアルゴリズム的に決定する仮想イーサネットドライバ310を使用する。実施態様に依存するアルゴリズムは、RVIを識別するためにキャビネット番号のような識別情報を考慮する必要があるであろう。
【００８４】
（3. RVIへのARP処理およびIPマッピング）
上で説明されているように、ある実施態様の仮想イーサネットドライバ310はARPを支援する。これらの実施態様では、IPパケットを含むユニキャストトラフィックをノード間で搬送するために使用されIPアドレスとRVIの間のノードでマッピングを作成するために、ARPの処理が効果的に使用される。
【００８５】
これを行うために、仮想イーサネットドライバ310は、専用RVI経由でARPパケット要求と応答を仮想LANサーバー335に送る。仮想LANサーバー335、また、特にARPサーバーロジック355は、パケットヘッダに情報を加えることによりそのパケットを処理する。上で説明されているように、この情報は、ソースとターゲットの識別を容易にし、ノード間で使用されるRVIを識別する。
【００８６】
ARPサーバーロジック355はARP要求を受け入れ、TLVヘッダを処理し、適切な場合には、内部プラットホームおよび外部ネットワーク上のすべての関連するノードにその要求を同報通信する。とりわけ、サーバーロジック355は、要求に起因するARP応答を誰が受け取るかを決定する。例えば、ソースがクラスタIPアドレスである場合、その応答は、必ずしもARP要求のソースではなく、クラスタ負荷バランサーに送られるべきである。サーバーロジック355は、ARP要求のTLVヘッダに情報を含めることにより、そうであることを示し、その結果、そのARPのターゲットはそれに応じて応答する。サーバー335は、追加されたヘッダにさらに詳しい情報を含めることにより、そのARPパケットを処理し、関連する領域のノードにそのパケットを同報通信するであろう。例えば修正済のヘッダは、ソースキャビネット、処理ノード番号、RYE接続番号、チャネル、仮想インタフェース番号および仮想LAN名(それらの内のいくつかはサーバー335だけに知られている)を識別する情報を含んでいてもよい。
【００８７】
そのARP応答は、サーバーロジック355によって受け取られ、次に、それは、対応するRVI関連情報への応答にMAC情報をマップする。そのRVI関連情報は、応答のターゲットMACエントリに入れられ、適切なソースノード(例えば、その要求のセンダーであろうが、クラスタIPアドレス付のようなインスタンスでは、異なるノードである場合もある)に送られる。
【００８８】
（4. 同報通信とマルチキャストのサービス）
上に概説されているように、同報通信は、専用RVI上のパケットを受け取ることにより処理される。次にそのパケットはサーバー335によってクローンを作られ、さらに関連する同報通信領域のすべての仮想インタフェース310にユニキャストされる。
【００８９】
同じアプローチがマルチキャストに使用され得る。マルチキャストパケットは、すべて仮想LANサーバーに反映されるであろう。いくつかの代わりの実施態様の下では、仮想LANサーバーは、同報通信と同様にマルチキャストを処理して、さらに望まれないパケットを排除するために各ノード上のIPフィルタリングに依存する。
【００９０】
アプリケーションがマルチキャストアドレスを送るか、または受け取りたい場合、それは最初にマルチキャストグループに参加しなければならない。プロセスがプロセッサ上でマルチキャスト結合を行なう場合、プロセッサ仮想ネットワークドライバ310は、専用RVI経由で仮想LANサーバー335に結合要求を送る。次に仮想LANサーバーは、インタフェース上の特定のマルチキャストMACアドレスを構成し、必要なときには以下で議論されているようにLANプロキシ340に通知する。プロキシ340は特定のマルチキャストグループ上で使用カウントを追跡し続ける必要があり、従ってどのプロセッサもそのマルチキャストグループに属さない場合にマルチキャストアドレスが撤去されるだけである。
【００９１】
（5. 他の領域へ橋渡しと経路設定）
システム100の透視図から、外部ネットワーク125は2つのモードの内の1つで作動し得る:フィルタ処理済み、あるいはフィルタ未処理。フィルタ処理済みモードにおいて、全システムのための単一のMACアドレスは、すべての送信パケットに使用される。これは、仮想LANプロキシ340の後の処理ノード107の仮想MACアドレスを隠し、ネットワーク125上の単一のノードとして(あるいはブリッジかプロキシ後の複数のノードとして)、そのシステムを出現させる。これが各内部ノード107の独自のリンクレイヤー情報を露出しないので、適切に着信パケットを伝えるために、他のある独自の識別子が要求される。フィルタモードで動作する場合、MACアドレスがそのシステムしか識別しないので、意図した受取人を独自に識別するために、各着信パケットの送付先IPアドレスが使用される。フィルタ未処理モードにおいて、ノード107の仮想MACはそれらが着信トラフィックを指示するために使用されるようにシステムの外部で見える。つまり、フィルタ処理済みモードはレイヤー3スイッチングを命じ、その一方でフィルタ未処理モードはレイヤー2スイッチングを可能にする。フィルタ処理済みモードは、ある基本要素(この場合仮想LANプロキシ340)がノード仮想MACにアドレスをすべての送信パケット上の外部ネットワーク125のMACアドレスと置換することを要求する。
【００９２】
いくつかの実施態様は、仮想LANが外部ネットワークに接続される能力をサポートしている。従って仮想LANはローカルに構成されないIPアドレスを処理しなければならないであろう。これをアドレスするためにある実施態様は、そのように接続された各仮想LANは1つの外部同報通信領域に制限されるという制限を課している。仮想LANの内部ノード用のIPアドレスおよびサブネット割当ては、外部領域に応じていなければならないであろう。
【００９３】
仮想LANサーバー335は、それが外部イーサネットドライバ345と内部プロセッサの間にパケットを移動させて、IP処理を行なわないという点でデータリンク層ブリッジの役割を有効に果たすことにより、外部接続にサービスする。しかしながらデータリンク層ブリッジのようなものとは違って、サーバーは必ずしも外部ネットワークから内部ノードへ独特なレイヤー2アドレスに依存することができるとは限らず、またその代わりに接続は、橋渡しの決定を下すためにレイヤー3(IP)情報を使用する場合がある。これを行うために、外部接続ソフトウェアは着信パケットからIPアドレス情報を抽出し、またそれは、それがそのノードにパケットを移動させるように、正しいノード106を識別するために、この情報を使用する。
【００９４】
付属の外部同報通信領域を有する仮想LANサーバー335は、外部ノードが同報通信領域のサブネットについての一貫した見方を有するように、外部領域からの、およびその領域へのパケットを遮断して、処理しなければならない。
【００９５】
付属の外部同報通信領域を有する仮想LANサーバー335が外部ノードからARPの要求を受け取ると、それは、すべての内部ノードへの要求を中継するであろう。次に、正しいノードは応答を構成し、仮想LANサーバー335を通ってリクエスターにその応答を送るであろう。プロキシが送信要求上のどのような必要なMACアドレス変換も処理するように、仮想LANサーバーは仮想LANプロキシ340と協力する。外部ソースからのARP応答およびARPの公示はすべて、ターゲットノードに直接的に中継されるであろう。
【００９６】
仮想イーサネットインタフェース310は、制御接続RVI上の仮想LANサーバー335へ外部送付先を有するユニキャストパケットをすべて送るであろう。(外部送付先は、MACアドレス形式によりドライバによって認識されてもよい。) 次に仮想LANサーバーはそれに応じて外部ネットワーク125にパケットを移動させるであろう。
【００９７】
仮想LANサーバー335が内部ノードから同報通信またはマルチキャストのパケットを受け取ると、すべての内部仮想LANメンバへのパケットの中継に加えて外部ネットワークへのパケットも中継する。仮想LANサーバー335が外部ソースから同報通信またはマルチキャストのパケットを受け取ると、すべての付属の内部ノードへのそのパケットを中継する。
【００９８】
ある実施態様の下では、IPルーターまたはファイアウォールの使用を通じて仮想LANを相互に連結させることは、物理的なLANを相互に連結させるのに使用されるものへの類似した機構を使用して遂行される。あるプロセッサは両方のLANの上で構成され、また、そのプロセッサ上のLinuxカーネルは、イネーブルにされたルーティング(また、恐らくIP擬装している)を有するに違いない。通常のIPサブネット作りおよび経路設定セマンティックスは、同じプラットホームにある2つのノード用に、常に保全されるであろう。
【００９９】
2つの外部サブネット間、間でそして外部と内部サブネット、そして2つの内部サブネット間のルーターとして、プロセッサが構成され得るであろう。内部ノードがルーター経由でパケットを送っている場合は、内部ネットワークの2地点間トポロジーであるために何も問題はない。センダーは、仮想LANサーバー(つまり、上で議論した典型的なプロセッサ同士の通信)の介入なしでルーター(つまり、経路設定するロジック付きで構成されたプロセッサ)に直接送るであろう。
【０１００】
外部ノードが内部ルーターにパケットを送り、外部ネットワーク125がフィルタ処理済みモードで動作している時、着信パケットの送付先MACアドレスはプラットホーム100のそれになるであろう。したがってMACアドレスは、独自にパケット送付先ノードを識別するために使用され得ない。送付先が仮想LANの上の内部ノードであるパケットについては、IPヘッダ中の送付先IPアドレスが、パケットを適切な送付先ノード宛てにするために使用される。しかしながらルーターは最終送付先ではないので、IPヘッダ中の送付先IPアドレスは、次のホップ(内部ルーターである)のそれではなく、最終送付先のそれである。したがって正しい内部ノードにそれを向けるために使用され得るものは、着信パケットにはない。この状況を処理するために、ある実施態様は、仮想LAN上の外部ネットワークに露出された高々1台のルーターという制限を課する。有効な送付先を持たない着信パケットが、このデフォルトノードに向けられるように、このルーターはデフォルト送付先として仮想LANサーバー335によって登録される。
【０１０１】
外部ノードが内部ルーターにパケットを送り、外部ネットワーク125がフィルタ未処理モードで動作している時、着信パケットの送付先MACアドレスは内部送付先ノードの仮想MACアドレスになるであろう。次にLANサーバー335は、送付先内部ノードへパケットを直接送るためにこの仮想MACを使用するであろう。この場合着信パケットのMACアドレスとしてのルーターが、独自に送付先ノードを識別するとともに任意の数の内部ノードがルーターとして機能していてもよい。
【０１０２】
ある構成がサブネット上の複数のルーターを要求する場合、1台のルーターが露出されたルーターとして取り上げられ得る。このルーターは必要なときに順番に他のルーターに回すことができるはずである。
【０１０３】
ある実施態様の下では、ルーターの冗長性は、ルーターをクラスタサービスおよび負荷バランシングにするか、または無所属方式(つまりTCPごとの接続ではなく、IPパケット毎)でフェイルオーバーすることによりもたらされる。
【０１０４】
本発明のある実施態様は、スイッチを通過するトラフィックがすべてプロミスキャスポート上で繰り返されるように、与えられたポートがプロミスキャスポートとして指定されるスイッチセマンティックスを提供することにより、プロミスキャスモードでの機能性をサポートする。プロミスキャスモードで聞くことを許されるノードは、仮想LANサーバーで管理上割り当てられるであろう。
【０１０５】
仮想イーサネットインタフェース310がプロミスキャスな受信モードに入る時、管理RVI上で仮想LANサーバー335にメッセージを送るであろう。このメッセージは、プロミスキャスモードに入る仮想イーサネットインタフェース310に関する情報をすべて含むであろう。仮想LANサーバーがノードからプロミスキャスモードメッセージを受け取る時、そのノードが相手かまわずに聞くのを許されているかどうかを判定するために、その構成情報をチェックするであろう。許されていない場合は、仮想LANサーバーは、それ以上の処理をしないで、そのプロミスキャスモードメッセージを取り下げるであろう。そのノードがプロミスキャスモードに入ることを許されていれば、仮想LANサーバーは仮想LAN上の他のすべてのノードへプロミスキャスモードメッセージを同報通信するであろう。仮想LANサーバーはまた着信外部パケットのコピーをそのノードに転送することができるように、プロミスキャスであるとしてそのノードをマークするであろう。プロミスキャスリスナーがそのRVI構成における何らかの変化を検知すると、関連する同報通信領域上の他のすべてのノードの状態を更新するために、仮想LANにプロミスキャスモードメッセージを送信するであろう。このことは、仮想LANに入るかまたは出るいかなるノードも更新するであろう。仮想イーサネットインタフェース310がプロミスキャスから出ると、そのインタフェースがプロミスキャスモードを出ていると通知するメッセージを仮想LANサーバーに送るであろう。次に、仮想LANサーバーは、仮想LAN上の他のすべてのノードにこのメッセージを送るであろう。プロミスキャス設定は任意の内部仮想インタフェースがプロミスキャスリスナーである場合に、外部接続をプロミスキャスモードにすることを可能にするであろう。これは、そのプラットホーム(しかし同じ仮想LAN上で)に対して外部であるトラフィックを、プロミスキャスリスナーに利用可能にするであろう。
【０１０６】
（6. 管理するサービスクラスタ）
サービスクラスタは、1つ以上のIPアドレス(あるいはホスト名)で利用可能な一組のサービスである。これらのサービスの例はHTTP、FTP、テルネット、NFSなどである。IPアドレスとポート番号の対は、外部ネットワーク125上のクライアントを含むクライアントに、クラスタによって提示された特定のサービスタイプ(サービスインスタンスではないが)を表わす。
【０１０７】
図5は、ある実施態様がクラスタIPアドレス経由でインターネットあるいは他の外部ネットワーク125に単一の仮想ホストとして仮想クラスタ505のサービスをどのように提供するかを示す。クラスタ505のサービスは、すべてそのIPアドレスの異なるポートを通って、単一のIPアドレスによってアドレスされる。図5の例では、サービスBは、負荷バランスサービスである。
【０１０８】
図3Bに関して、仮想クラスタは、仮想LANサーバー335と協力する仮想クラスタプロキシ(VCP)ロジック360の介在によって支援される。要するに、VCP 360には、個々の構成された仮想IPアドレスに対する着信接続、ポートフィルタおよび真のサーバー接続の配分を取り扱う責任がある。構成された、各クラスタIPアドレスに対して1つのVCPがあるであろう。
【０１０９】
パケットが仮想クラスタIPアドレス上に到着すると、仮想LANプロキシロジック340は処理のためにVCP 360にパケットを送るであろう。次に、そのVCPは、パケット内容、その内部接続状態キャッシュ、着信トラフィックに適用されている任意の負荷バランシングアルゴリズム、および構成されたサービスの稼動率に基づいて、そのパケットをどこに送るかを決定するであろう。VCPは、TCPかUDPポート番号だけでなく、送付先IPアドレスの両方に基づいて、着信パケットを中継するであろう。さらに、それは、VCP(あるいは既存のTCP接続のために)に既知のポート番号行きのパケットを分配するだけであろう。それはこれらのポートの構成、および仮想のクラスタを作成し特定のサービスインスタンスをそのクラスタにおいて利用可能にする、1台以上のプロセッサへのポート番号のマッピングである。複数のアプリケーションプロセッサからの同じサービスの複数のインスタンスが構成される場合には、そのVCPは、サービスインスタンス間の負荷バランスを取ることができる。
【０１１０】
VCP 360は、クラスタのIPアドレスに存在するすべての稼動中の接続のキャッシュを保全する。何らかの負荷バランシング判断は新しい接続がクライアントとサービスの間で確立されたときだけになされるであろう。一旦その接続が設定されると、そのサービスを提供するように構成された同じプロセッサ106にTCPストリームのパケットがすべて送られていることを確かめるために、VCPは、着信パケットヘッダの中のソースおよび送付先情報を使用するであろう。クライアントセション(例えばHTTPセション)を決定するための能力がない状態で、実際の接続/負荷バランシングマッピングキャッシュは、クライアントアドレスに基づいてパケットを送り、その結果、同じクライアントからの後続の接続は、同じプロセッサに行く(クライアントセションを持続的か、「粘着性」にして)。あるタイプだけのサービスがセション持続性を要求するので、セション持続性はサービスポート番号方式で選択可能でなければならない。
【０１１１】
ARP要求への応答およびARP応答のルーティングは、VCPによって処理される。プロセッサが任意のARPパケットを送ると、それは仮想イーサネットドライバ310によってそれを送り出すであろう。次にそのパケットは通常のARP処理のために、仮想LANサーバー335に送られるであろう。通常通り、その仮想LANサーバーはそのパケットを同報通信するだろうが、クラスタのどのメンバ(単なるセンダーではない)にもそれが同報通信されないことを確かめるであろう。さらにそれはその仮想LANサーバーを通して、具体的には負荷バランサーを通してのみARPソースが到達され得ることをARPターゲットに示す情報をパケットヘッダTLVに入れるであろう。そのARPターゲットは、内部か外部かにかかわらず、ARP要求を普通に処理し、仮想LANサーバーを通して応答を送り返すであろう。ARPソースがクラスタIPアドレスであったので、仮想LANサーバーは、どのプロセッサがオリジナルの要求を発送したかを判定することができないであろう。したがって、それを適切に処理することができるように、仮想LANサーバーは、各クラスタメンバにその応答を送るであろう。ARPのパケットがターゲットとして、クラスタIPアドレスを有するソースによって送られると、仮想LANサーバーはすべてのクラスタメンバにその要求を送るであろう。各クラスタメンバはARP要求を受け取り、それを普通に処理するであろう。次に、それらはARP応答を構成し、仮想LANサーバー経由でソースにそれを送るであろう。仮想LANサーバーがクラスタメンバから何らかのARP応答も受け取ると、その応答を止めるであろうが、その仮想LANサーバーは、ARPのソースへのARP応答を構成して、送るであろう。したがって、その仮想LANサーバーは、クラスタIPアドレスのすべてのARPに応答するであろう。そのARP応答は、ARPソースがVCPへのクラスタIPアドレスにパケットをすべて送るのに必要な情報を含んでいるであろう。外部のARPソースについては、これは、単純にソースハードウェアアドレスとして外部MACアドレスを有するARP応答になるであろう。内部ARPのソースについては、これは、直接接続されているRVIを通じてではなく、仮想LAN管理RVIの下でクラスタIPアドレスにパケットを送るように、そのソースに命じるのに必要な情報になるであろう。受け取られる任意の無償のARPパケットが、すべてのクラスタメンバに転送されるであろう。クラスタメンバによって送られた任意の無償のARPパケットも普通に送られるであろう。
【０１１２】
（仮想LANプロキシ）
仮想LANプロキシ340は、外部の物理的なネットワーク125への仮想インタフェースを有するすべてのプロセッサ中の物理的なネットワークリソースの基本的な調整を行なう。それは仮想LANサーバー335を外部ネットワーク125に橋渡しする。その外部ネットワーク125がフィルタ処理済みモードで動作していると、仮想LANプロキシ340は、内部仮想MACアドレスを、各ノードから、システム100に割り当てられた単一の外部MACに変換するであろう。外部ネットワーク125がフィルタ未処理モードで作動している場合は、そのようなMACの変換は必要ではない。仮想LANプロキシ340は、さらにIEEE 802.1Q 仮想LAN IDタグ付け情報、およびそれらのVLAN標識に基づいた、多重分離するパケットの挿入および除去も行なう。さらにそれは物理的なイーサネットインタフェース129へのアクセスを一列に並べて、物理的なネットワーク上のマルチキャストアドレスのようなMACアドレスの割当ておよび除去を調整する。
【０１１３】
外部ネットワーク125がフィルタ処理済みモードで動作しており、仮想LANプロキシ340が仮想LANサーバー335から送信パケット(ARP、あるいは他のもの)を受け取ると、それは、ソースMACアドレスとして内部フォーマットMACアドレスを物理的なイーサネット機器129のMACアドレスに置換する。外部ネットワーク125がフィルタ未処理モードで動作している場合は、そのような置換は必要ではない。
【０１１４】
仮想LANプロキシ340が着信ARPパケットを受け取る場合、それは、パケットを処理し、正しい送付先にパケットを中継する仮想LANサーバー335にパケットを移動させる。ARPのパケットが同報通信パケットである場合、そのパケットは、仮想LAN上のすべての内部ノードに中継される。そのパケットがユニキャストパケットである場合、そのパケットは送付先ノードにのみ送られる。外部ネットワーク125がフィルタ処理済みモードで動作しているときにそのARPパケット中のIPアドレスによって、あるいはそのARPパケット(そのMACアドレスでないものがそのARPパケットである)のイーサネットヘッダ中のMACアドレスによって、送付先ノードは決定される。
【０１１５】
（物理的なLANドライバ）
ある実施態様の下では、外部ネットワーク125への接続は、制御ノードに接続されたGigabit か100/10 base-Tイーサネットリンク経由である。物理的なLANドライバ345には、そのようなリンクとインタフェースする責任がある。そのインタフェース上に送られているパケットは、ソケットバッファにそれらのパケットを入れることを含む通常の方法で、装置への待ち行列に入れられるであろう。パケットを待ち行列に入れるために使用される待ち行列は、その装置の転送ルーチンへのパケットを待ち行列に入れるためにプロトコルスタックによって使用されるものである。着信パケットについては、そのパケットを含んでいるソケットバッファは、通過され、また、パケットデータはコピーされないであろう(マルチキャスト動作に必要とされれば、クローンを作られるが)。これらの実施態様の下では、総括的なLinuxネットワークデバイスドライバが、変更のない制御ノードの中で使用されてもよい。これは、付加的なデバイスドライバ作業を要求せずに、プラットホームへの新しい機器の追加を容易にする。
【０１１６】
物理的なネットワークインターフェイス345は、仮想LANプロキシ340とだけの通信状態にある。これは、制御ノードが仮想LANの動作に干渉する何らかの方法で外部接続を使用するのを防ぎ、さらにユーザーデータのセキュリティおよび分離を改善する、つまり管理者はどのユーザーのパケットでも「においを嗅が」なくてよい。
【０１１７】
（負荷バランシングとフェイルオーバー）
いくつかの実施態様の下では、外部ネットワーク125への冗長な接続は、外部ネットワーク125への2つの冗長なインタフェース間のパケット伝送の負荷バランスを取るために交互に使用されるであろう。他の実施態様は、仮想インタフェースが2つの制御ノード間で平等に分配されるように制御ノードを交互にすることに関して個々の仮想ネットワークインターフェイスを構成することにより、負荷バランスを取る。別の実施態様は1つの制御ノードを通じて送信し、別の制御ノードを通じて受け取る。
【０１１８】
フィルタ処理済みモードにある場合、外部ノードが一組の仮想ネットワークインターフェイスのためのパケットを送信する1つの外部的に見えるMACアドレスがあるであろう。そのアダプタが機能停止する場合、仮想ネットワークインターフェイスが別の制御ノードにフェイルオーバーしなければならないだけでなく、MACアドレスも外部ノードがARPのキャッシュ中に既にあるMACアドレスへパケットを送り続けることができるようにファイルオーバーする必要がある。本発明のある実施態様の下では、故障した制御ノードが回復すると、単一のMACアドレスが操作されて、また、そのMACアドレスは回復時に再マップされる必要はない。
【０１１９】
本発明の別の実施態様の下では、負荷バランシングは、両方の制御ノード上の送信を許可することにより実行されるが、受信だけは一つを通してである。フェイルオーバーの場合は、同じ制御ノードを通じて送信と受信の両方を行う。回復の場合は、それがMAC操作を要求しないので、回復された制御ノードを通した送信である。
【０１２０】
受信を行う制御ノードは、フィルタリングのためのIP情報と、マルチキャストMAC構成のためのマルチキャストアドレス情報とを有する。この情報は着信パケットを処理するために必要で、万一、受信制御ノードが故障した場合には、フェイルオーバーされなければならない。送信制御ノードが故障した場合、仮想ネットワークドライバは、受信制御ノードにのみ送信パケットを送り始める必要がある。特別のフェイルオーバー処理は、その送信制御ノードが故障したという認識以外に必要ではない。故障した制御ノードが回復する場合、仮想ネットワークドライバは、付加的な特別の回復処理なしに、回復された制御ノードに送信パケットを送ることを再開し得る。受信制御ノードが故障すると、送信制御ノードは受信インタフェースの役割を引き受けなければならない。これを行うためにそれは、パケット受信を可能にするために、その物理的なインタフェース上のMACアドレスをすべて構成しなければならない。交互に、両方の制御ノードは、それらのインタフェース上で構成された同じMACアドレスを有することができるはずであるが、受信は、制御ノードはパケットを受け取るための準備ができるまで、デバイスドライバによるイーサネット装置上で物理的に不可能にされるはずである。次にフェイルオーバーが単純にその装置上での受信を可能にするだろう。
【０１２１】
任意のプロセッサがマルチキャストグループに参加すると、そのインタフェースはマルチキャストMACアドレスで構成しなければならないので、フェイルオーバーがそのプロセッサに対してトランスペアレントになるように、マルチキャスト情報が制御ノード間で共有されなければならない。仮想ネットワークドライバは、とにかくマルチキャストグループメンバーシップの追跡をしなければならないので、必要に応じて、この情報は常に仮想LANサーバー経由でLANプロキシに利用可能になる。したがって、受信フェイルオーバーは、ローカルマルチキャストグループメンバーシップ表を再構築するために、仮想ネットワークドライバから尋ねられているマルチキャストグループメンバーシップに帰結するであろう。この動作は、低いオーバヘッドであり、フェイルオーバーと回復の間を除いて特別の処理を要せず、また制御ノード間のデータの何らかの特別な模写も要しない。受信がフェイルオーバーしており、さらに故障した制御ノードが回復すると、送信だけが、回復された制御ノードに譲られるであろう。したがって、仮想ネットワークインターフェイス上の回復用のアルゴリズムは、回復された制御ノードに送信を常に移動させて、それが存在するところに受信処理を任せる。
【０１２２】
仮想サービスクラスタはさらに負荷バランシングとフェイルオーバーを使用する場合がある。
【０１２３】
（マルチキャビネットプラットホーム）
いくつかの実施態様は、より大きなプラットホームを形成するためにキャビネットが相互に接続されることを可能にしている。各キャビネットは、キャビネット間接続に使用される少なくとも1つの制御ノードを有するであろう。各制御ノードは、ローカルの接続およびトラフィックを処理するために、仮想LANサーバー335を含むであろう。サーバーの内の1つは、仮想LANのための外部接続を有する制御ノードに設置されたもの等のマスターであるように構成される。それらのキャビネットのローカルプロセッサが参加できるように、別の仮想LANサーバーは、プロキシサーバーあるいはスレーブの役割を果たすであろう。そのマスターは、仮想LAN状態および制御をすべて保全し、一方プロキシはプロセッサとマスターの間でパケットを中継する。
【０１２４】
各仮想LANサーバープロキシは各マスター仮想LANサーバーへのRVIを保全する。あたかもマスターであるかのように、各ローカルプロセッサは仮想LANサーバープロシキサーバーに接続するであろう。プロセッサがIPおよびMACのアドレスを接続して登録すると、プロキシはそのIPおよびMACアドレスをマスターと共に登録するであろう。これは、マスターに、そのアドレスをプロキシからのRVIに結び付けさせるであろう。したがって、マスターはすべての内部ノード用のRVI結合を含むが、プロキシは同じキャビネットのノード用結合だけを含むであろう。
【０１２５】
マルチキャビネット仮想LANのある場所にあるプロセッサがその仮想LANサーバーに任意のパケットを送ると、そのパケットは処理のためにそのマスターに中継されるであろう。次に、マスターは、パケットに関する通常の処理を行うであろう。マルチキャストと同報通信に必要なこととして、マスターは、プロキシにパケットを中継するであろう。マスターは、さらに、プロキシ上のユニキャストパケットの送付先IPアドレスおよび登録済みのIPアドレスに基づいたユニキャストパケットを中継するであろう。マスターにおいては、プロキシ接続は、多くの構成されたIPアドレスを有するノードに非常によく似ていることに注意すること。
【０１２６】
（ネットワーク管理ロジック）
ブートするかカーネルデバッギングのような処理ノード上で動作するオペレーティングシステムがない間、そのノードのシリアルコンソールトラフィックおよびブートイメージは、その処理ノードのカーネルデバッギングソフトウェア、あるいはBIOSにあるスイッチドライバコードによって、制御ノード(図示されていない)上で動作している管理ソフトウェアに送られる。そこから、コンソールトラフィックは、再び、高速の外部ネットワーク125から、あるいは制御ノードの管理ポートを通ってアクセスされ得る。ブートイメージ要求は、その制御ノードのローカルディスクから、あるいは外部SAN 130上のパーティションから満たされ得る。好ましくは、処理ノードに何かが行われ得る前に、制御ノード120がブートされ、普通に動作している。制御ノードは、その管理ポートからそれ自体、ブートされるか、またはデバッグされる。
【０１２７】
何人かの顧客は、必要な時に、現場コンピュータに彼らの管理ポートを差し込むことにより、コントローラのブートとデバッギングを、ローカルなアクセスのみに制限したいことがある。他のものは、彼等の管理ポートに蓋をすることによってインターネットから適切に絶縁される、経営目的のための安全なネットワークセグメントを確立することにより、リモートブートおよびデバッギングを可能にすることを選択するかもしれない。一旦制御装置がブートされ普通に動作していれば、制御装置のため、およびそのプラットホームの残りのためのすべての他の管理機能は、管理者によって許可された場合には管理ポートからも高速外部ネットワーク125からもアクセスされ得る。
【０１２８】
各処理ノード105との間のシリアルコンソールトラフィックは、スイッチ構造115上のオペレーティングシステムカーネルドライバによって、制御ノード120上で動作している管理ソフトウェアに送られる。そこから、任意のノードのコンソールトラフィックは、通常の高速外部ネットワーク125、あるいは制御ノードの管理ポートのどちらかを通ってアクセスされ得る。
【０１２９】
（記憶装置アーキテクチャー）
ある実施態様は、記憶装置のSCSI型に従う。各仮想PANはそれ自身の仮想I/O空間を有し、さらにSCSIコマンドおよびそのような空間内のステータスを発する。制御ノードのロジックは、PANから必要なときに、アドレスとコマンドを変換するか変形し、さらにそれらのコマンドを取り扱うSAN 130にそれらを送信する。SANの透視図から、そのクライアントはプラットホーム100であり、また、コマンドを出した実際のPANは隠されていて、匿名である。SANアドレス空間が仮想になるので、プラットホーム100上で動作しているあるPANは、装置番号1から始めて番号付けする装置を有することがあり、また、別のPANも装置番号1を有することがある。しかし、装置番号1の各々は、SAN記憶装置の異なっていて独自である部分に対応するであろう。
【０１３０】
好ましい実施態様の下では、管理者は仮想記憶装置を作成することができる。各PANは、それ自身の独立した観点の大容量記憶装置を有するであろう。したがって、下に説明されるように、最初のPANは、そのSANの最初の場所に対する所与の装置/LUNアドレスマップを有することがあり、また、別のPANは、そのSANの第２の異なる場所に対する同一の所与の装置/LUNマップを有することがある。各プロセッサは、例えば、ディスクとパーティションを識別するために、主と副の装置番号に装置/LUNアドレスをマップする。それらの主と副の装置番号は、PANおよびPANの内のプロセッサによって物理アドレスととらえられるが、事実、それらは、SANによって提供される大容量記憶装置に対する仮想アドレスとして、プラットホームによって扱われる。つまり、各プロセッサの主と副の装置番号は対応するSANの場所へマップされる。
【０１３１】
図6は、ある実施態様の記憶アーキテクチャーを実装するために使用されるソフトウエアコンポーネントを図示している。通常、制御ノード120上で実行される構成基本要素605は、外部SAN 130と通信している。管理インタフェース基本要素610は、その構成基本要素605へのインタフェースを提供し、IPネットワーク125と、したがってリモート管理ロジック135(図1参照)と通信している。システム100の各プロセッサ106は、プロセッサ側記憶ロジック620のインスタンスを含んでいる。そのようなインスタンス620は、各々、制御ノード側記憶ロジック615の対応するインスタンスに対して、2つのRV1接続625経由で通信する。
【０１３２】
要するに、構成基本要素605およびインタフェース610は、プラットホーム100に分配されるSAN記憶装置の部分を発見すること、さらに特定のPANあるいはプロセッサ106に一部を細分割当てすることを管理者に許可することに責任がある。さらに、記憶装置配置ロジック605には、制御ノード側ロジック615にSAN記憶域割り当てを伝える責任もある。プロセッサ側記憶ロジック620には、制御ノード側ロジック615に専用RVI 625経由でプロセッサの内部相互結線110および記憶構造115上の記憶装置要求を伝える責任がある。その要求は、ある実施態様の下で、仮想記憶アドレスおよびSCSIコマンドを含むであろう。制御ノード側ロジックには、SANに対する対応実アドレスを識別し、コマンドとプロトコルを、例えば、ファイバーチャネル(iSCSIを備えたGigabitイーサネットは別の典型的な連結である)を含むが、限定はされない、SAN用の適切な形式に変換することにより、そのようなコマンドを受け取り処理する責任がある。
【０１３３】
（構成基本要素）
構成基本要素605は、SAN 130の中のどの要素が各個別のプロセッサ106に見えるかを決定する。それは、そのプロセッサが使用する装置番号(例えばSCSIターゲットおよびLUN)を、それらの付属のSCSIおよびファイバーチャンネルI/Oインタフェース128を通じて制御ノードに見える装置番号に変換するマッピング機能を提供する。さらに、それは、プロセッサが制御ノードに取り付けられるが、プロセッサの構成には含まれていない外部記憶装置にアクセスしないようにするアクセス管理機能を提供する。プロセッサに対して(およびそのプロセッサ上のシステム管理者およびアプリケーション/ユーザーに対して)提示されるモデルは、それを各プロセッサがそのプロセッサ上のインタフェースに取り付けられたそれ自身の大容量記憶装置を有するかのように見せる。
【０１３４】
とりわけ、この機能は、プロセッサ106上のソフトウェアが別のプロセッサに容易に移動され得るようにする。例えば、ある実施態様では、ソフトウェア(物理的な再ケーブル接続なしに)経由の制御ノードは、新しいプロセッサが必要な装置にアクセスできるようにするためにPAN構成を変更する場合がある。したがって、新しいプロセッサは別の記憶装置の個性を継承させられることがある。
【０１３５】
ある実施態様では制御ノードがSAN上のホストとして出現するが、代替の実施態様ではプロセッサがそのように動作することを可能にする。
【０１３６】
上に概説されるように、構成ロジックは、プラットホーム100に分配されたSAN記憶装置を発見し(例えばプラットホームブート中に)、また、このプールは管理者によって後に割り付けられる。発見があとで起動される場合、その発見動作を行なう制御ノードは新しい見え方を先の見え方と比較する。新しく利用可能な記憶装置は、管理者によって割り付けられ得る記憶装置のプールに加えられる。それを見えなくさせるパーティションは割り当てられずに、PANに分配され得る記憶装置の利用可能なプールから取り除かれる。それを見えなくさせるパーティションは、トリガーエラーメッセージを割り当てられるであろう。
【０１３７】
（管理インタフェース基本要素）
構成基本要素605は、管理ソフトウェアが、制御ノード120に見える装置と、個別のプロセッサ106に見える仮想装置との間の装置マッピングについて記述する情報にアクセスし、更新することを可能にする。さらにそれは制御情報へのアクセスを可能にする。割当てはシミュレートされたSCSIディスク識別に関連する処理ノードによって、例えばシミュレートされたコントローラ、ケーブル、ユニットあるいは論理装置番号(LUN)の名前によって、識別され得る。
【０１３８】
ある実施態様の下では、インタフェース基本要素610は、下記のような情報と統計を集めてモニターするように、その構成基本要素と協力する:
実行された入出力操作の総数
転送されたバイトの総数
実行された読み取り動作の総数
実行された書込み操作の総数
進行中だった入出力時間の総量
【０１３９】
（プロセッサ側記憶ロジック）
プロトコルのプロセッサ側ロジック620は、プロセッサ106上のオペレーティングシステムに、低レベルの仮想インタフェースを提供することにより、SCSIサブシステムをエミュレートするホストアダプタモジュールとして、実装される。プロセッサ106は、処理のために制御ノード120にSCSI I/Oコマンドを送るために、この仮想インタフェースを使用する。
【０１４０】
冗長な制御ノード120を使用する実施態様の下では、処理ノード105は、制御ノード120当たりロジック620の1つのインスタンスを含むであろう。ある実施態様の下では、プロセッサは論理的ではなく、物理的な装置番号付けを使用して、記憶装置を参照する。つまり、LUN、SCSIターゲット、チャネル、ホストアダプタおよび制御ノード120(例えばノード120aあるいは120b)を識別するために、アドレスは装置名として指定される。図8に示されるように、ある実施態様はホストアダプタ(H)、チャネル(C)、マップされたターゲット(mT)、およびマップされたLUN(mL)にターゲット(T)およびLUN(L)をマップする。
【０１４１】
図7は、プロセッサ側ロジック720用の典型的なアーキテクチャーを示す。ロジック720は、機器タイプに特有のドライバ(例えばディスクドライバ)705、中間レベルのSCSI I/Oドライバ710、とラッパーおよび相互結線ロジック715を含んでいる。
【０１４２】
機器タイプ専用のドライバ705は、オペレーティングシステムが用意されており、特定の機器タイプに関連付けられた在来型ドライバである。
【０１４３】
中間のレベルのSCSI I/Oドライバ710は、一旦その装置がSCSI機器であるとドライバ705が決定すれば機器タイプに専用のドライバ705によって呼び出される在来型の中間レベルドライバである。
【０１４４】
ラッパーおよび相互結線ロジック715は、中間レベルのSCSI I/Oドライバ710によって呼び出される。このロジックはSCSIサブシステムインタフェースを提供し、それによりSCSIサブシステムをエミュレートする。Giganet構造を使用するある実施態様では、ロジック715は、必要なときにSCSIコマンドを覆い隠すこと、および上述のように、NICに、制御ノードへの専用RVI経由で制御ノードにパケットを送らせるように、GiganetおよびRCLANインタフェースと対話することに責任がある。Giganetパケット用のヘッダー情報はこれが記憶装置パケットで、下に文脈で記述されて、他の情報を含んでいることを示すように修正される。図7に図示されなかったが、ラッパーロジック715は、冗長な相互結線110および構造115をサポートして利用するためにRCLANレイヤーを使用することがある。
【０１４５】
Giganet構造115を使用する実施態様については、接続725のRVIは、1024の利用可能なVIの範囲から仮想インタフェース(VI)番号を割り当てられる。通信すべき2つの端点については、スイッチ115は、ペアである(制御ノードスイッチポート、制御ノードVI番号)と、(プロセッサノード105スイッチポート、プロセッサノードVI番号)との間の双方向経路と共にプログラムされる。
【０１４６】
個別のRVIは、一方の方向に送信される各タイプのメッセージに使用される。したがって、プロトコルの反対側から送信することができるメッセージ用の各RVI上に未決の受信バッファが常にある。さらに、1つのタイプのメッセージだけが、各RVI上の一方の方向に送信されるので、RVIチャネルの各々にポストされた受信バッファは、プロトコルがその種のメッセージに使用する最大のメッセージ長に対して、適切に大きさを合わされ得る。他の実施態様の下では、すべての可能なメッセージタイプは2つのVIを使用するのではなく、単一のRVI上に多重化される。プロトコルとメッセージ形式は、特に2つのRVIの使用を要求せず、また、それらを多重分離することができるように、メッセージはそれ自体、それらのヘッダにメッセージタイプ情報を有する。
【０１４７】
2つのチャネルの内の1つは、SCSIコマンド(CMD)およびステータス(STAT)メッセージを交換するために使用される。別のチャネルはバッファ(BUF)および転送(TRAN)メッセージを交換するために使用される。このチャネルは、SCSIコマンドのデータペイロードを処理するために使用される。
【０１４８】
CMDメッセージは、ノード105に、制御情報、実行されるべきSCSIコマンド、およびI/Oバッファの仮想アドレスおよびサイズも含んでいる。STATメッセージは制御情報、およびSCSIコマンドを処理する間に生じる可能性のある任意のエラーを反映する完了ステータスコードを含んでいる。BUFメッセージは、制御ノード120における制御情報およびI/Oバッファの仮想アドレスおよびサイズを含んでいる。TRANメッセージは、制御情報を含んでおり、ノード105から制御ノード120までデータ送信の成功を確認するために使用される。
【０１４９】
プロセッサ側ラッパーロジック715は、コマンドがデータの転送を要求するかどうか、そしてそうであるならばどの方向であるかを決定するために送られるSCSIコマンドを検査する。分析によって、ラッパーロジック715は、それに応じてメッセージヘッダーに適切なフラグ情報を設定する。制御ノード側ロジックを説明するセクションは、フラグ情報がどのように利用されるかを説明する。
【０１５０】
本発明のある実施態様の下では、プロセッサ側記憶装置ロジック720と制御ノード側記憶装置ロジック715との間のリンク725は、SCSIプロトコルの一部ではなくまたSAN 130に通信されることではなくて、制御メッセージを伝達するために使用されることがある。代わりに、これらの制御メッセージは制御ノード側ロジック715によって処理されることになっている。
【０１５１】
プロトコル制御メッセージは、プロトコルのプロセッサ側に常に生成され、制御ノード側記憶装置ロジック715にプロセッサ側ロジック720を接続する、2つの仮想インタフェース(VI)の内の1つにあるプロトコルの制御ノード側に送られる。プロトコルコントロール操作に使用されたメッセージヘッダーは、プロトコル制御メッセージとしてメッセージを識別するために、異なるフラグビットが使用される以外は、コマンドメッセージヘッダーと同じである。制御ノード120は要求された動作を実行し、ステータスメッセージによって使用されるものと同じであるメッセージヘッダーを有するRVIに対応する。この方法では、まれに使用されるプロトコルコントロール操作用の個別のRVIは、必要ではない。
【０１５２】
冗長な制御ノードを使用するある実施態様の下では、プロセッサ側ロジック720は、出されたコマンドからあるエラーを検知し、さらにそれに応じて別の制御ノードへのコマンドを、再度出す。この再試行は中間のレベルのドライバ710中で実装されるであろう。
【０１５３】
（制御ノード側記憶装置ロジック）
ある実施態様の下では、制御ノード側記憶装置ロジック715は、デバイスドライバモジュールとして実装される。ロジック715は、制御ノード120上のオペレーティングシステムに装置レベルインタフェースを提供する。この装置レベルインタフェースは、構成基本要素705にアクセスするためにも使用される。このデバイスドライバモジュールが初期化されると、それは、プラットホーム100のすべてのプロセッサ106からのプロトコルメッセージに応答する。すべての構成活動は、装置レベルインタフェースを通って導入される。すべてのI/O活動は、相互結線110およびスイッチ構造115を通じて送信されて、受け取られるメッセージを通じて導入される。制御ノード120においては、プロセッサノード105(それは単に図7の１つのボックスとして示されるが)当たりロジック715の1つのインスタンスがあるであろう。ある実施態様の下では、制御ノード側ロジック715は、FCPかFCP-2プロトコル、あるいはiSCSI、あるいは様々な媒体に亘ってSCSI-2またはSCSI-3コマンドセットを使用する他のプロトコル経由で、SAN 130と通信する。
【０１５４】
上記のように、プロセッサ側ロジックは、データの流れがコマンドを関連づけるかどうか、そして、そうならば、どの方向であるかを示すRVIメッセージヘッダーにおけるフラグを設定する。制御ノード側記憶装置ロジック715は、プロセッサ側ロジックからメッセージを受け取り、次に、動作する方法を決定するための、例えば、バッファ等を割り付けるためにヘッダー情報を分析する。さらにそのロジックは、そのプロセッサからのメッセージに含まれているアドレス情報を、対応し、マップされたSANアドレスに変換し、さらにSAN 130にコマンドを出す(例えばFCPあるいはFCP-2経由で)。
【０１５５】
SCSIデータ転送段階を要求しないTEST UNIT READYコマンドのようなSCSIコマンドは、コマンドメッセージに使用されるRVI上に単一のコマンドを送るプロセッサ側ロジック720によって、および同じRVI上に単一のステータスメッセージを送り返す制御ノード側ロジックによって処理される。より具体的には、そのプロトコルのプロセッサ側は、標準メッセージヘッダー、このコマンドのための新しいシーケンス番号、希望のSCSIターゲットおよびLUN、実行されるべきSCSIコマンド、およびゼロのリストサイズを有するメッセージを構築する。そのロジックの制御ノード側は、メッセージを受け取り、SCSIコマンド情報を抽出し、インタフェース128経由でSAN 130にそれを伝達する。制御ノードはコマンド完了コールバックを受け取った後、標準メッセージヘッダー、このコマンドのためのシーケンス番号、完了したコマンドのステータス、および、コマンドがチェック条件ステータスで完成した場合、任意に要求センス・データを使用して、それは、プロセッサへのステータスメッセージを構築する。
【０１５６】
ホストメモリへSCSI機器からデータを転送することをSCSIデータ転送段階に要求するREADコマンドのようなSCSIコマンドは、制御ノード側のロジック715、およびプロセッサノード105Tのメモリへの1つ以上のRDMA WRITE動作、および制御ノードからの単一のステータスメッセージとで応答する制御ノードに、コマンドメッセージを送信するプロセッサ側のロジックによって、処理される。より具体的には、プロセッサ側ロジック720は、標準メッセージヘッダー、このコマンドのための新しいシーケンス番号、希望のSCSIターゲットおよびLUN、実行されるべきSCSIコマンド、そしてコマンドからのデータが格納されることになっているメモリの領域のリストを有するコマンドメッセージを構築する。制御ノード側ロジック715は、SCSIコマンドが制御ノード上でタスクを実行している間、SCSI動作からのデータを格納するために一時記憶装置バッファを割り付ける。制御ノード側ロジック715が処理のためにSAN 130にSCSIコマンドを送り、コマンドが完了した後、それは、1つ以上のRDMA WRITE動作のシーケンスを有するプロセッサ105メモリにデータを送り返す。次に、それは、標準メッセージヘッダー、このコマンドのためのシーケンス番号、完了したコマンドのステータス、および、そのコマンドがSCSI CHECK CONDITIONステータスで完成した場合、任意にREQUEST SENSEデータを有するステータスメッセージを構築する。
【０１５７】
ホストメモリからSCSI機器までデータを転送することをSCSIデータ転送段階に要求するWRITEコマンドのようなSCSIコマンドは、制御ノード側ロジック715へ単一コマンドメッセージ、制御ノード側ロジック715からプロセッサ側ロジックまでの1つ以上のBUFメッセージ、制御ノードにおけるメモリにプロセッサ側記憶装置ロジックからの1つ以上のRDMA WRITE動作、プロセッサ側ロジックから制御ノード側ロジックまでの1つ以上のTRANメッセージおよび制御ノード側ロジックからプロセッサ側ロジックへの単一ステータスメッセージを送るプロセッサ側ロジック720によって処理される。プロセッサ側記憶装置ロジックに制御ノードの一時的バッファ・メモリの場所を伝えるためのBUFメッセージの使用、およびRDMA WRITEデータ転送の完了を示すためのTRANメッセージの使用は、基礎的なGiganet構造中のRDMA READ能力の不足による。基礎的な構造がRDMA READ動作をサポートする場合、異なるシーケンスの対応するアクションが使用されるであろう。より具体的にはプロセッサ側ロジック720は、標準メッセージヘッダー、このコマンドのための新しいシーケンス番号、希望のSCSIターゲットおよびLUN、および実行されるべきSCSIコマンドを有するCMDメッセージを構築する。制御ノード側ロジック７１５は、ＳＣＳＩコマンドが制御ノード上でタスクを実行している間、ＳＣＳＩ動作からのデータを格納するために一時記憶装置バッファを割り付ける。次に、プロトコルの制御ノード側は、標準メッセージヘッダー、このコマンドのためのシーケンス番号、および制御ノード上で一時記憶装置バッファに使用される、仮想メモリの領域のリストを有するBUFメッセージを構築する。次に、プロセッサ側ロジック720は、一連の1回以上のRDMA WRITE動作で制御ノードメモリにデータを送る。次に、それは、標準メッセージヘッダーとこのコマンドのためのシーケンス番号とを有するTRANメッセージを構築する。制御ノード側ロジックはSCSIコマンドを処理にためにSAN 130に送り、さらにコマンド完了を受け取った後、標準メッセージヘッダー、このコマンドのためのシーケンス番号、完了したコマンドのステータス、および、コマンドがCHECK CONDITIONステータスで完了した場合、任意にREQUEST SENSEデータを有するSTATメッセージを構築する。
【０１５８】
いくつかの実施態様の下では、CMDメッセージは、そのコマンド用のデータが格納される仮想メモリの領域のリストを含んでいる。BUFとTRANのメッセージは、CMDメッセージにおける領域リストの各エントリへプロトコルの制御ノード側が個別のBUFメッセージを送れるようにするインデックスフィールドも含んでいる。プロトコルのプロセッサ側は、データ転送の単一のセグメントの完了を示すためのTRANメッセージを後に続けて、BUFメッセージに記述されたデータの量に対するRDMA WRITE動作の実行により、そのようなメッセージに応答するであろう。
【０１５９】
プロセッサ側ロジック720と制御ノード側ロジック715との間のプロトコルは、scatter-gather 入出力操作を可能にする。この機能は、入出力要求に含まれるデータが、仮想または物理あるいはその両方のメモリーのいくつかの別個の領域から読まれるかあるいは書かれ得るようにする。これは、複数の隣接していないバッファが、制御ノード上でその要求に使用され得るようにする。
【０１６０】
上記のように、構成ロジック705は、プラットホームに分配されたSAN記憶装置を発見すること、および管理者が特定のPANに記憶装置を細分割当てするように、インタフェースロジック710と対話することに責任がある。この割当ての一部として、構成基本要素705は、プロセッサアドレスおよび実際のSANアドレスとの一致を識別する情報を含んでいる記憶データ構造915を作成し保全する。図7はそのような構造を示す。その一致は、その処理ノードと、例えばシミュレートされたコントローラ、ケーブル、ユニットあるいは論理装置番号(LUN)の名前によって、シミュレートされたSCSIディスク識別表示との間であるであろう。
【０１６１】
（管理ロジック）
管理ロジック135はPANを供給するための制御ノードソフトウェアにインタフェースするために使用される。とりわけ、ロジック135は、管理者がPANの仮想ネットワークトポロジー、外部ネットワークへのその可視性を確立することを可能にし(例えばサービスクラスタとして)、そしてPAN上の装置のタイプ、例えば、ブリッジと経路設定を確立することを可能にする。
【０１６２】
管理者が最初の割当ての間に、あるいはその後にPANのための記憶装置を定義してもよいように、ロジック135は記憶管理インタフェースロジック710とのインタフェースも行う。その構成定義は上に議論された記憶装置の一致(SANに対するSCSI)、およびアクセス管理許可を含んでいる。
【０１６３】
上記のように、各PANおよび各プロセッサは、その仮想ネットワーキング(仮想MACアドレスを含めて)および仮想記憶によって定義された個性を有するであろう。プロセッサクラスタリングを実装するために、下記のように、そのような個性を記録する構造は管理ロジックによってアクセスされるであろう。さらにそれらは、上記のような管理者によってまたはエージェント管理者と共にアクセスされるであろう。例えば、エージェントは時刻または年のようなある出来事に応じて、あるいはそのシステム上のある負荷に応じてPANを再構成するために使用されるであろう。
【０１６４】
プロセッサのオペレーティングシステムソフトウェアは、制御ノード上で動作する管理ソフトウェアへGiganetスイッチ115を通ってノード用のコンソールI/Oトラフィックを送るためにシリアルコンソールドライバコードを含んでいる。そこから、管理ソフトウェアは、任意のノードのコンソールI/Oストリームを制御ノードの管理ポート(その低速イーサネットポートおよびその緊急時管理ポート)経由で、あるいは高速外部ネットワーク125経由で、管理者による許可に従いアクセス可能にすることができる。コンソールトラフィックは監査と履歴の目的で記録され得る。
【０１６５】
（クラスタ管理ロジック）
図9は、ある実施態様のクラスタ管理ロジックを図示する。クラスタ管理ロジック905は、PANのネットワークトポロジー、PAN内のMACアドレス割当てなどのように、上記ネットワークの情報を記録するデータ構造910にアクセスする。さらに、クラスタ管理ロジック905は、様々なプロセッサ106の記憶装置対応を記録するデータ構造915にもアクセスする。さらに、クラスタ管理ロジック905は、プラットホーム100内の割り付けられていないプロセッサのようなフリーリソースを記録するデータ構造920にもアクセスする。
【０１６６】
プロセッサエラーイベントあるいは管理者コマンドに応じて、クラスタ管理ロジック905は、与えられたプロセッサの記憶装置およびネットワークの個性を新しいプロセッサに「移動させる」ためにデータ構造を変更することができる。この方法では、新しいプロセッサは、前のプロセッサの個性を「継承する」。クラスタ管理ロジック905は、故障しているものを交換するために、PANにおいて新しいプロセッサを交換するために、これを行わせる原因になり得る。
【０１６７】
その新しいプロセッサは、前のプロセッサのMACアドレスを継承し、以前のもののように動作するであろう。新しいプロセッサがブートする時、制御ノードは連結度情報を通信し、必要とされるような故障してないプロセッサのための連結情報を更新するであろう。例えば、ある実施態様では、他のプロセッサのためのRVI接続はトランスペアレントに更新される;つまり、他のプロセッサ上のソフトウェアは、新しく交換されたプロセッサへの連結の確立に関係する必要はない。さらに、その新しいプロセッサは、以前のものの記憶装置対応を継承し、従って、以前のプロセッサの存続された状態を継承するであろう。
【０１６８】
他の利点の中でこれは、プロセッサを含むリソースのフリープールが所与のPANに亘ってではなく全プラットホームに亘って共有されることを可能にする。このようにフリーリソース(システムの信頼性およびフォルトトレランスを改善するためにそういうものとして維持されるであろう)はより効率的に使用されるであろう。
【０１６９】
新しいプロセッサが「交換される」時、それは、IPアドレスとMACアドレスとの関連付けを学習するために、再度ARPすることが、必要であろう。
【０１７０】
（選択肢）
スイッチ構造115の各Giganetポートが、その上の1024の同時的仮想インタフェース接続をサポートすることができ、ハードウェアプロテクションでそれらを相互に離れているようにしておくことができるので、オペレーティングシステムは、アプリケーションプログラムとノードのGiganetポートを、安全に共有することができる。これは、ドライバコードの全スタックを通じて動作する必要なしに、アプリケーションプログラムの直接の接続を可能にするであろう。これを行うと、オペレーティングシステムコールは、仮想インタフェースチャネルを確立し、アプリケーションアドレス空間へそのバッファと待ち行列をメモリマップするであろう。さらに、そのチャネルにインタフェースする低レベルの詳細をカプセル化しているライブラリは、そのような仮想のインタフェース接続の使用を容易にするであろう。そのライブラリは、自動的に冗長な仮想のインタフェースチャネルペアを確立し、呼出しアプリケーションからの任意の努力あるいは意識も要求せずに、それらの間の共有化や故障を管理することができるであろう。
【０１７１】
上記の実施態様は、ATMのような構造に関して内部的にイーサネットをエミュレートした。本設計は、アーキテクチャーの多くを単純化する内部イーサネット構造を使用するために変更されることがあり、例えば、エミュレーション機能に対するニーズの除去である。外部ネットワークがATMに応じて通信すると、別のバリエーションはイーサネットのエミュレーションなしでATMを内部に使用するであろうし、また、ATMは、そのようにアドレスされると、外部ネットワークに外部的に通信することができるであろう。別のバリエーションは、プラットホーム(つまりイーサネットのエミュレーションのない)に対して内部的に、ATMを可能にするだろうし、また、外部通信だけがイーサネットに変換される。これは内部通信を合理化するだろうが、コントローラでエミュレーションロジックを要求するだろう。
【０１７２】
ある実施態様は、ソフトウェア構成コマンドに基づいてPANを配備している。配備がプログラム制御に基づくであろうことが認識されるであろう。例えば、より多くのプロセッサがそのPANための動作のピーク時にソフトウェア管理の下で配備されるか、あるいは、そのPANに対して多かれ少なかれ記憶空間を対応させることが、ソフトウェアアルゴリズムの制御の下で展開されるかもしれない。
【０１７３】
本発明の範囲が上記の実施態様に制限されることがないが、追加された請求項によって規定されること;およびこれらの請求項は記述されたものの変更および改良を包含するであろうことは、認識されるであろう。
【図面の簡単な説明】
【０１７４】
【図１】本発明のある実施態様を図示する系統図である。
【図２Ａ】本発明のある実施態様に応じて確立された通信リンクを図示するダイアグラムである。
【図２Ｂ】本発明のある実施態様に応じて確立された通信リンクを図示するダイアグラムである。
【図２Ｃ】本発明のある実施態様に応じて確立された通信リンクを図示するダイアグラムである。
【図３Ａ】本発明のある実施態様のネットワークソフトウェアアーキテクチャを図示するダイアグラムである。
【図３Ｂ】本発明のある実施態様のネットワークソフトウェアアーキテクチャを図示するダイアグラムである。
【図４Ａ】本発明のある実施態様に応じてドライバロジックを図示する流れ図である。
【図４Ｂ】本発明のある実施態様に応じてドライバロジックを図示する流れ図である。
【図４Ｃ】本発明のある実施態様に応じてドライバロジックを図示する流れ図である。
【図５】本発明のある実施態様に応じてサービスクラスタを図示している。
【図６】本発明のある実施態様の記憶装置ソフトウェアアーキテクチャを図示している。
【図７】本発明のある実施態様のプロセッサ側の記憶装置ロジックを図示している。
【図８】本発明のある実施態様の記憶装置アドレスマッピングロジックを図示している。
【図９】本発明のある実施態様のクラスタ管理ロジックを図示している。
【符号の説明】
【０１７５】
105 処理ノード
115 スイッチ機構
120 制御ノード
135 管理ロジック
206 スイッチ
208 スイッチ
210 プロセッサロジック
214 仮想スイッチロジック
305 IPスタック
310 仮想ネットワークドライバ
320 Giganetドライバ
325 Giganetドライバ
335 仮想LANサーバー
340 仮想LANプロキシ
350 ARPロジック、
505 クラスタ
605 記憶装置構成ロジック
610 管理インタフェース、
615 制御ノード側記憶装置ロジック
905 クラスタ管理ロジック、
910 ネットワークデータ構造
915 記憶装置データ構造
920 フリーリソースデータ構造【Technical field】
[0001]
The present invention relates to a computer system for enterprises and application service providers, and more particularly to a processing system having a virtual communication network.
[Background]
[0002]
In today's enterprise computing and application service provider environments, employees from multiple information technology (IT) functions (electrical, networking, etc.) must participate to deploy processing and network resources. Thus, it may take weeks or months to deploy a new computer server due to scheduling and other obstacles in coordinating activities from multiple parts. This long and manual process increases both human and equipment costs and delays the start of the application.
[0003]
In addition, it is difficult to predict how much processing power an application will require, so managers tend to oversupply the amount of computing power. As a result, the computing resources of the data center are often not used or have a low utilization rate.
[0004]
If more processing power is ultimately needed than the original supply, various IT functions can deploy more or improved servers and connect them to a network of communications and storage devices. Such activities will need to be adjusted again. As the system grows, this task becomes increasingly difficult.
[0005]
Deployment is also a problem. For example, if you deploy 24 conventional servers, you may need more than 100 individual connections to form the entire system. Management of these cables is an ongoing challenge and each symbolizes a point of failure. Attempting to reduce the risk of failure by adding redundancy can double the wiring and exacerbate the problem, while increasing complexity and cost.
[0006]
Providing high usability with today's technology is a difficult and expensive proposition. In general, a failover server must be deployed for every primary server. In addition, complex management software and professional services are typically required.
[0007]
In general, it is not possible to adjust processing power or upgrade the CPU on an existing server. Instead, increasing processor capacity and / or moving to vendor next-generation architectures often means adding more hardware / software systems, necessitating new connections, etc. Request a “forklift upgrade”.
DISCLOSURE OF THE INVENTION
[Problems to be solved by the invention]
[0008]
Accordingly, there is a need for a system and method that provides an enterprise and ASP computing platform that addresses the above shortcomings.
[Means for Solving the Problems]
[0009]
The invention features a platform and method for computer processing in which a virtual processing area network is to be formed and deployed.
[0010]
In accordance with certain aspects of the present invention, a method and system for emulating a switched Ethernet local area network is provided. Multiple computer processors, switch structures and point-to-point links to those processors are provided. Virtual interface logic establishes a virtual interface across the switch structure and point-to-point links. Each virtual interface defines a software communication path from one computer processor to another by a switch structure. The Ethernet driver emulation logic performs tasks on at least two computer processors, and the switch emulation logic performs tasks on at least one of the computer processors. The switch emulation logic establishes a virtual interface between the switch emulation logic and each computer processor having Ethernet driver emulation logic executing thereon and allows software communication therebetween. In addition, it receives a message from one of the virtual interfaces to the computer processor that has Ethernet driver emulation logic that performs the tasks on it and responds to the information associated with that message as well. The message is sent to another computer processor having Ethernet driver emulation logic that performs the task. In addition, it also establishes a virtual interface between each computer processor having Ethernet driver emulation logic that performs tasks on it and every other computer processor that has Ethernet driver emulation logic that performs tasks on it. . The Ethernet driver emulation logic unicast is via the virtual interface that defines the software communication path when the virtual interface is operating satisfactorily, and via the switch emulation logic when the virtual interface is not operating satisfactorily. Communicate with another computer processor in the emulated Ethernet network.
[0011]
According to another aspect of the invention, a method and system for implementing an Address Resolution Protocol (ARP) is provided. The computing platform has a plurality of processors connected by an underlying physical network. Logic executable on one of the processors defines the topology of the Ethernet network to be emulated on a computing platform. The topology includes processor nodes and switch nodes. The logic that can be executed on one of the processors is assigned a set of processors from a plurality of ones so as to serve as a processor node. Logic executable on one of those processors assigns a virtual MAC address to each processor node of the emulated Ethernet network. Logic executable on one of those processors assigns a virtual interface to the underlying physical network to provide direct software communication from each processor node to every other processor node. . Each virtual interface has a corresponding identification. Each processor node has ARP request logic to communicate the ARP request to the switch node where the ARP request includes an IP address. The switch node includes ARP request broadcast logic to communicate ARP requests to all other processor nodes in the emulated Ethernet network. Each processor node has ARP response logic to determine if it is the processor node associated with the IP address in the ARP request and, if so, to issue an ARP response to that switch node, where The ARP response contains the virtual MAC address of the processor node associated with that IP address. The switch node includes ARP response logic for receiving the ARP response and modifying the ARP response including a virtual interface identification for the ARP requesting node.
[0012]
In accordance with another aspect of the invention, a platform and method for computer processing is provided to support processor failover. A plurality of computer processors are connected to the internal communication network. A virtual local area communication network on the internal network is defined and established. Each computer processor in the virtual local area communication network has a corresponding virtual MAC address, and the virtual local area network provides communication between a set of computer processors, but the defined set Processors from a plurality other than the inside are excluded. The virtual storage space is defined and established by a defined correspondence to the storage network address space. In response to a failure by the computer processor, one of the plurality of computer processors is allocated for replacement with the failed processor. The MAC address of the failed processor is assigned to the processor that replaces the failed processor. The virtual memory space of the failed processor and the defined correspondence is assigned to the processor that replaces the failed processor. A virtual local area network is re-established to include the processor that replaces the failed processor and to exclude the failed processor.
[0013]
According to another aspect of the invention, a system and method for providing a service addressed by an IP address is provided. At least two computer processors each contain logic to provide that service. The cluster logic receives the service request message. The message has an IP address. The cluster logic distributes the request to one of at least two computer processors having logic to provide the service.
[0014]
In accordance with another aspect of the invention, a computer processing platform includes a plurality of computer processors connected to an internal communication network. At least one control node is in communication with an external communication network and an external storage device network having an external storage device address space. At least one control node is connected to the internal network, thereby communicating with the multiple computer processors. The configuration logic provides a virtual processing area network having a set of corresponding computer processors from multiple processors and communication between the set of computer processors, but excludes multiple processors that are not in the defined set. Defining and establishing a virtual local area communication network and a virtual storage space with a defined correspondence to the address space of the storage network.
BEST MODE FOR CARRYING OUT THE INVENTION
[0015]
Preferred embodiments of the present invention provide a processing platform in which virtual systems are deployed by configuration commands. The platform provides a large group of processors from which subgroups are selected by software commands and can be deployed to serve a given set of applications and customers (a “processing area network” or “processor cluster”). )) Virtual network is formed. The virtualized processing area network (PAN) can then be used to run customer specific applications such as web-based server applications. The virtualization may include local area network (LAN) virtualization or input / output storage device virtualization. By providing such a platform, processing resources can be quickly delivered via software via configuration commands from the administrator rather than physically supplying servers, connecting networks and storage devices and providing power to each server, etc. And can be easily deployed.
[0016]
(Overview of the platform and its behavior)
As shown in FIG. 1, a preferred hardware platform 100 includes a set of processing nodes 105a-105n connected to switch structures 115a, b via high speed interconnects 110a, b. The switch structure 115a, b is also connected to an external IP network 125 (or other data communication network) and at least one control node 120a, b in the communication path to the storage device network (SAN) 130. . For example, a management application 135 that performs tasks remotely will access one or more of the control nodes via the IP network 125 to support configuring the platform 100 and deploying a virtual PAN.
[0017]
Under certain embodiments, approximately 24 processing nodes 105a-105n, two control nodes 120, and two switch structures 115a, 115b are housed in a single chassis to provide a point-to-point (PtP) link. Interconnected with a fixed pre-wired network. Each processing node 105 includes one or more (e.g. four) processors 106J-106l, one or more LAN connection cards (NICs) that contain some BIOS firmware for booting and initial configuration among others. A board containing 107 and local memory (eg, greater than 4 Gbytes). There is no local disk for the processor 106; instead, all storage devices, including those required for paging, are handled by the SAN storage device 130.
[0018]
Each control node 120 includes one or more (e.g., four) processors, local memory, and a boot image and initial file system used to boot operating system software for processing node 105 and control node 106. It is a single board containing a local magnetic disk unit for holding independent copies. Each control node communicates with SAN 130 via a 100 Mbyte / s Fiber Channel adapter card 128 connected to Fiber Channel links 122, 124, and one connected to Gigabit Ethernet links 121, 123 It communicates with the Internet (or other external network) 125 via the external network interface 129 having the above Gigabit Ethernet NIC. (Many other technologies and hardware can be used for SAN and external network connectivity.) Each control node can be used by management application 135 as a dedicated management port that can be used remotely instead of web-based management. Includes a slow Ethernet port (not shown).
[0019]
Its switch structure consists of one or more 30-port Giganet switches 115, such as NIC-CLAN 1000 and clan 5300 switches, and various processing and control nodes have corresponding NICs in such structural modules Used for communication with. The Giganet switch structure has the non-broadcast multiple connectivity (NBMA) network semantics. All inter-node communication is via a switch structure. Each link is formed as a series connection between the NIC 107 and a port in the switch structure 115. Each link operates at 112 megabytes / second.
[0020]
In some implementations, multiple cabinets or chassis may be interconnected to form a larger platform. Also, in other embodiments, the configuration may be different; for example, redundant connections, switches, and control nodes may be removed.
[0021]
Under software management, the platform supports multiple simultaneous independent processing area networks (PANs). Each PAN is formed by a software command to have a corresponding subset of processors 106 communicating via a virtual local area network emulated over a PtP network. Each PAN is formed to have a corresponding virtual I / O subsystem. Physical deployment or wiring is not necessary to establish a PAN. Under certain preferred embodiments, software logic that performs tasks on processor nodes and / or control nodes emulates switched Ethernet semantics; performs tasks on processor nodes and / or control nodes Other software logic follows the SCSI semantics and provides a virtual storage subsystem function that provides an independent I / O address space for each PAN.
[0022]
(Network architecture)
One preferred embodiment uses the virtual components, interfaces and connections to allow an administrator to build a virtual emulated LAN. Each virtual LAN may be internal and dedicated to the platform 100, or multiple processors may be formed in a processor cluster that appears externally as a single IP address.
[0023]
Under certain embodiments, the physical underlying network is a PtP network, but the virtual network so created emulates a switched Ethernet network. Virtual networks use IEEE MAC addresses, and processing nodes support IETF ARP processing to associate and associate IP addresses with MAC addresses. Thus, a given processor node will consistently answer ARP requests whether the ARP request came from a node that is internal or external to the platform.
[0024]
FIG. 2A shows a typical network deployment that can be modeled or emulated. The first subnet 202 is a processing node PN that communicates with each other via the switch 206 ₁ , PN ₂ And PN _k It is formed by. The second subnet 204 is a processing node PN that communicates with each other via the switch 208 _k And PN _m It is formed by. Under switched Ethernet semantics, one node on a subnet may communicate directly with another node on that subnet; for example, PN ₁ Is PN ₂ Send a message to Semantics also allows one node to communicate with another set of nodes; for example, PN ₁ May send broadcast messages to other nodes. PN _m Are on different subnets, so the processing node PN ₁ And PN ₂ Is PN _m Cannot communicate directly with. PN ₁ And PN ₂ Is PN _m To communicate with, higher layer network software with a better understanding of both subnets will need to be utilized. Although not shown, a given switch may also be able to communicate via an “uplink” to another switch or the like. As can be seen in the following description, the need for such an uplink is different when the switch is physical. Specifically, since the switches are virtual and modeled in software, they can be expanded horizontally by the required width. (In contrast, physical switches have a fixed number of physical ports, and occasionally uplinks are required to provide horizontal scalability.)
[0025]
FIG. 2B illustrates exemplary software channels and logic used under certain embodiments to model subnets 202 and 204 of FIG. 2A. The communication path 212 is the processing node PN ₁ , PN ₂ , PN _k And PN _m In particular, connect their corresponding processor-side network communication logic 210, and furthermore they connect the processing node to the control node. (Listed as a single instance of logic for clarity, but PN _k May have multiple instances of corresponding processor logic, eg, one per subnet. Under the preferred embodiment, the management logic and control node logic are responsible for establishing, managing and destroying the communication path. Individual processing nodes are not allowed to establish such a path.
[0026]
As described in detail below, both processor logic and control node logic emulate switched Ethernet semantics on such a communication path. For example, the control node has control node side virtual switch logic 214 to emulate some (but not all) in the Ethernet switch semantics, and the processor logic is Contains logic to emulate (though not all).
[0027]
Within a subnet, one processor node may communicate directly with another via the corresponding virtual interface 212. Similarly, one processor node may communicate with control node logic via another virtual interface. Under certain embodiments, the underlying switch structure and associated logic (e.g., switch structure administrator logic (not shown)) establishes such virtual interfaces (VIs) over a point-to-point network. And provide the ability to manage. In addition, these virtual interfaces are established in a reliable and redundant manner, referred to herein as RVI. In this aspect of the description, the choice between VI vs. RVI is highly dependent on the cost incurred as system resources and the amount of reliability desired by the system, so it is called virtual interface (VI) and reliable virtual interface (RVI). The terms are used interchangeably.
[0028]
Referring to FIGS. 2A and 2B combined, node PN ₁ Is node PN ₂ If it should communicate with the virtual interface 212 _1-2 It is done normally. However, for example VI 212 _1-2 Is not working satisfactorily, in a preferred embodiment, PN ₁ And PN ₂ Communication between the two can occur via the switch emulation logic. In this case the message is VI 212 _{1 # switch206} Via, also VI 212 _switch206-2 Can be sent via. PN ₁ If it is to broadcast or multicast a message to other nodes in subnet 202, it does so by sending the message to control node side logic 214 via virtual interface 2121-switch 206. The control node side logic 214 then emulates the broadcast or multicast functionality by using the relevant VIs to clone and send messages to other relevant nodes. . The same or similar VI may be used to convey other messages that require control node side logic. For example, as described below, the control node side logic includes logic to support Address Resolution Protocol (ARP), and the VI is used to communicate ARP responses and requests to the control node. Although the above description suggests only one VI between the processor logic and the control logic, many implementations use multiple such connections. Furthermore, although the figure suggests symmetry in the software channel, its architecture actually allows asymmetric communication. For example, for a cluster communication service, the packet will be sent via the control node, as discussed below. However, return communication may be direct between nodes.
[0029]
Node PN, as in the network of Figure 2A ₂ And PN _m It should be noted that there is no communication mechanism in between. Furthermore, because the communication path is centrally managed and created (not via the processing node), such a path cannot be created by the processing node, and the connection of the defined subnet is broken by the processor. Absent.
[0030]
FIG. 2C illustrates an exemplary physical connection of an embodiment for implementing the subnets of FIGS. 2A and B. In particular, each instance of the processing network logic 210 communicates with the switch structure 115 via the PtP link 216 of the interconnect 110. Similarly, the control node has multiple instances of switch logic 214 and each communicates over a PtP connection 216 to the switch structure. The virtual interface of FIG. 2B includes logic to communicate information across these physical links, as described further below.
[0031]
To create and form such a network, the administrator defines the PAN's network topology and specifies the MAC address assignments of the various nodes (eg, via a utility in the management software 135). The MAC address is virtual and identifies the virtual interface and is not tied to any particular physical node. Under certain embodiments, the MAC address follows the IEEE 48-bit address format, but its contents are the "locally processed" bit (set to 1), the virtual interface was first defined (details below) It includes the serial number of the control node 120 and the count value from the persistent sequential counter on the control node maintained in NVRAM at the control node. These MACs will be used to identify nodes (as conventional) at the layer 2 level. For example, when answering an ARP request (whether from a node internal to PAN or on an external network), these MACs will be included in the ARP response.
[0032]
The network logic on the control node side maintains a data structure that includes information reflecting the LAN connection (for example, which node communicates with which other node). The control node logic also distributes and assigns VI (or RVI) mappings to specified MAC addresses, and assigns and assigns VIs (or RVIs) between control nodes and between control nodes and processing nodes. In the example of FIG. 2A, the logic would allocate and assign VI 212 of FIG. 2B. (The VI and RVI designations in some embodiments are the result of the switch structure used and the switch structure administrator logic.)
[0033]
As each processor boots, the BIOS-based boot logic initializes each processor 106 of node 105 and, among other things, establishes (or discovers) VI 212 to the control node logic. The processor node then obtains the relevant data link information such as the processor node's MAC address from the control node and the MAC authentication of other devices in the same data link configuration. Each processor then registers its IP address with the control node, which binds that IP address to the node and RVI (eg, the RVI on which the registration arrived). In this way, the control node will be able to bind an IP address for each virtual MAC for each node on the subnet. In addition to the above, the processor node also obtains RVI or VI related information for its connection to other nodes or to the control node network logic.
[0034]
Thus, after boot and initialization, the various processor nodes must understand their layer 2 data link connections. As explained below, layer 3 (IP) connectivity and in particular layer 3 and layer 2 coupling is determined during the normal processing of the processor as a result of the address resolution protocol.
[0035]
FIG. 3A details the network logic 210 on the processor side, and FIG. 3B details the network 310 logic on the control node side of one embodiment. The processor side logic 210 includes an IP stack 305, a virtual network driver 310, an ARP logic 350, an RCLAN layer 315, and redundant Giganet drivers 320a, b. The control node side logic 310 includes redundant Giganet drivers 325a, b, RCLAN layer 330, virtual cluster proxy logic 360, virtual LAN server 335, AJRP server logic 355, virtual LAN proxy 340, and physical LAN driver 345.
[0036]
(IP stack)
The IP stack 305 is a communication protocol stack provided with an operating system (for example, Linux) used by the processing node 106. The IP stack provides a layer 3 interface to applications and operating systems that perform tasks on the processor 106 to communicate with the simulated Ethernet network. The IP stack supplies a packet of information to the virtual Ethernet layer 310 and, at the same time, provides a layer 3 IP address as the destination of the packet. The IP stack logic is conventional except that certain implementations avoid checksum computation and logic.
[0037]
(Virtual Ethernet driver)
A virtual Ethernet driver 310 will appear in the IP stack 305 like a “true” Ethernet driver. In this regard, the virtual Ethernet driver 310 receives an IP packet or datagram from the IP stack for subsequent transmission on the network, and it receives packet information from the network so that it can be delivered to that stack as an IP packet. obtain.
[0038]
The stack creates a MAC header. “Normal” Ethernet code in the stack may be used. The virtual Ethernet driver receives a packet with the MAC header already created and the exact MAC address already in the header.
[0039]
In the data part and with respect to FIGS. 4A-C, the virtual Ethernet driver 310 dequeues the 405 output IP datagram so that the packet is transmitted over the network. Standard IP stack ARP logic is used. As explained below, the driver blocks all ARP packets entering and leaving the system to modify them so that the appropriate information ends in each node's ARP table. The normal ARP logic puts the correct MAC address in the link layer header of the outgoing packet before the packet is queued to the Ethernet driver. The driver then only checks the link layer header and destination MAC to determine how to send the packet. The driver does not manipulate the ARP table directly (except for invalidity that can occur during ARP entry).
[0040]
The driver 310 determines 415 whether the ARP logic 350 has MAC address information (details below) associated with the IP address in the dequeued packet. If the ARP logic 350 has information, that information is used to send the packet accordingly (420). If the ARP logic 350 does not have the information, the driver needs to determine such information, and in some preferred embodiments, this information is the ARP protocol, as discussed with respect to FIGS. 4B and 4C. As a result of the execution of.
[0041]
If the ARP logic 350 has MAC address information, the driver analyzes the information returned from the ARP logic 350 to determine where and how to send the packet. In particular, the driver looks at the address to determine if the MAC address is in a valid format or a specially invalid format. For example, in some implementations, an internal node (i.e., a PAN node within that platform) has a locally-managed bit, a multicast bit, and another predetermined bit pattern in the first byte of the MAC address. A signal is given in the combination to set. The dominant pattern is unlikely to be a valid pattern.
[0042]
If the MAC address returned from the ARP logic is in a valid format, the IP address associated with that MAC address is at least for nodes that are external to the associated subnet, and in a preferred embodiment, the IP address External to the platform. To convey such a packet, the driver prepends the packet with a TLV (type-length-value) header. The logic then sends the packet to the control node over the pre-established VI. In addition, the control node handles the remainder of the transmission appropriately.
[0043]
If the MAC address information returned from the ARP logic 350 is in a specially invalid format, the invalid format signals that the node at that IP address is for an internal node, and then the MAC That information in the address information is used to help identify the VI (or RVI) that directly connects the two processing nodes. For example, an ARP table entry may hold information identifying the RVI 212 used to send a packet, eg 212i-2, to another processing node. The driver prepends a packet with a TLV header. It then puts address information in the header as well as information identifying the Ethernet protocol type. The logic then selects the appropriate VI (or RVI) to send the encapsulated packet. If that VI (or RVI) is working satisfactorily, it is used to carry the packet; if it is not working satisfactorily, the switch logic can send it to the appropriate node , The packet is sent to the control node switch logic (detailed below). The ARP table may contain information for specifying the RVI to actually use, but many other techniques may be used. For example, the information in the table may provide such information indirectly, for example by pointing to the information or by revealing the information through the absence of the information.
[0044]
For any multicast or broadcast type message, the driver sends those messages to the control node on the specified VI. The control node then clones the packet and sends it to all nodes (except the sending node) and the uplink.
[0045]
Without ARP mapping, the upper layer would never send the packet to the driver. If no data link layer mapping is available, the packet is shelved until ARP decomposition is complete. Once the ARP layer finishes ARPing, packets that have held the pending ARP are built with their data link headers, which are then sent to the driver.
[0046]
If the ARP logic does not have a mapping for the IP address of the IP packet from the IP stack, and therefore the driver 310 cannot determine the relevant addressing information (i.e. MAC address or RVI related information), the driver Get such information by following the ARP protocol. Referring to FIGS. 4B and 4C, the driver creates an ARP request packet that includes an associated IP address for which the MAC mapping is not in the local ARP table (425). The node then adds an ARP packet with a TLV type header to the beginning (430). Next, an ARP request is sent to the network logic on the control node side, specifically, to the virtual LAN server 335 via dedicated RVI.
[0047]
As discussed in more detail below, the ARP request packet is processed (435) by the control node and broadcast 440 to the associated node. For example, the control node will flag whether the requesting node is part of an IP service cluster.
[0048]
The associated node's Ethernet driver logic 310 receives the ARP response (445), and further compares the target IP address with a locally formed list of IP addresses by making a call to that node's IP stack. , Determine if it is the target of the ARP request (450). If it is not the target, it will forgo the packet without modification. If it is the target, the driver creates a local MAC header from the TLV header (460) and updates the local ARP table to create an ARP response (465). The driver modifies the information in the ARP request (primarily the source MAC) and then usually forgets the ARP request for higher layers to handle. It is the upper layer that forms the ARP response when necessary. The response in the others contains the MAC address of the responding node and has a bit set in the TLV header indicating that the response is from the local node. In this respect, the node responds according to the EETF type ARP semantics (as opposed to the ATM ARP protocol, where ARP responses are handled centrally). The response is then sent to 470.
[0049]
As described in more detail below, control node logic 335 receives the response and modifies it (473). For example, the control node may substitute the MAC address of the responding internal node having information identifying the source cabinet, processing node number, RVI connection number, channel, virtual interface number, and virtual LAN name. Once the ARP response has been modified, the control node logic then sends the ARP response to the appropriate node, ie the node that sent the ARP request, or, in the example, the load balancer in the JDP service cluster discussed below (475 ).
[0050]
Finally, an encapsulated ARP response is received (480). If the replying node is an external node, the ARP response contains the MAC address of the replying node. If the responding node is an internal node, the ARP response instead contains information identifying the associated RVI communicating with the node. In either case, the local table is updated (485).
[0051]
The pending datagram is removed from the queue (487) and the appropriate RVI is selected (493). As discussed above, the appropriate RVI is selected based on whether the target node is internal or external. The TLV header is added to the beginning of the packet and transmitted (495).
[0052]
For communication within a virtual LAN, the maximum transfer unit (MTU) is formed as 16,896 bytes. Even if the formed MTU is 16,896 bytes, the Ethernet driver 310 recognizes when the packet is being sent to the external network. Through the use of path MTU discovery, ICMP and IP stack changes, the path MTU is changed at the source node 105. This mechanism is used to trigger a packet checksum.
[0053]
Certain embodiments of the present invention support promiscuous mode by a combination of logic within the virtual LAN server 335 and the virtual LAN driver 310. When the virtual LAN driver 310 receives a promiscuous mode message from the virtual LAN server 335, the message includes information regarding the authentication of the receiver who wishes to enter promiscuous mode. This information includes the location of the receiver (cabinet, node, etc.), the interface number of the promiscuous virtual interface 310 on that receiver (required for packet demultiplexing), and the name of the virtual LAN to which the receiver belongs . This information is then used by driver 310 to determine how to send promiscuous packets to the receiver (which RVI or other mechanism to use to send those packets). The virtual interface 310 maintains a promiscuous listener list on the same virtual LAN. When the sending node receives a promiscuous mode message, it will update its promiscuous list accordingly.
[0054]
If the packet is sent through the virtual Ethernet driver 310, this list will be examined. If the list is not empty, virtual Ethernet interface 310 will do the following:
[0055]
Promiscuous copies will not be sent if the transmitted packet is broadcast or multicast. Normal broadcast operations will send packets to the promiscuous listener.
[0056]
If the packet is a unicast packet with a destination other than the promiscuous listener, the packet will be cloned and sent to the promiscuous listener.
[0057]
The header TLV includes additional information that can be used by the destination to demultiplex and validate the incoming packet. Part of this information is the destination virtual Ethernet interface number (the destination device number on the receiving node). Since these can differ between the actual packet destination and the promiscuous destination, this header cannot simply be cloned. Thus, memory would have to be allocated for each header of each packet clone to individual promiscuous listeners. If the packet header for a promiscuous packet is shaped, the packet type will be set to indicate that the packet was a promiscuous transmission rather than a unicast transmission.
[0058]
Furthermore, the virtual Ethernet driver 310 is responsible for handling redundant control node connections. For example, a virtual Ethernet driver will periodically test end-to-end connectivity by sending a TLV heartbeat to each connection RVI. This will allow the virtual Ethernet driver to determine if the node has stopped responding or if the stopped node has started responding again. If it is determined that the RVI or control node 120 is down, the Ethernet driver will send traffic through the remaining control node. If both control nodes are functioning, driver 310 will attempt to load balance traffic between the two nodes.
[0059]
Certain embodiments of the present invention provide improved performance. For example, with changes to the IP stack 305, packets sent only within the platform 100 are guaranteed to be sent without checksum because all elements of the platform 100 provide error detection.
[0060]
Further, for communication within the PAN (or even within the platform 100), the RVI may be configured such that the packet is larger than the maximum size allowed by the Ethernet. Thus, while some models emulate Ethernet behavior in certain embodiments, the maximum packet size may be broken through to improve performance. The actual packet size will be handled well as part of the data link layer.
[0061]
A control node failure is detected by notification from the RCLAN layer or by a TLV heartbeat failure. If the control node fails, the Ethernet driver 310 will send traffic only to the remaining control nodes. The Ethernet driver 310 will recognize the recovery of the control node via notification from the RCLAN layer or notification from the resumption of the TLV heartbeat. Once the control node has recovered, the Ethernet driver 310 will resume load balancing.
[0062]
If it detects that a node cannot communicate with another node via direct RVI (as outlined above), it will act as a switch and attempt to communicate via the control node. Such a fault may be signaled by the lower RCLAN layer, for example, from a virtual interface response signal reception fault or from a fault detected through a heartbeat mechanism. In this example, the driver can mark a bit in the TLV header to indicate that the message should be unicast and send the packet to the desired node (e.g., based on IP address if necessary). Send the packet to the control node so that it can.
[0063]
(RCLAN layer)
The RCLAN layer 315 is responsible for handling the redundancy, failover, and load balancing logic of the redundant interconnected NIC 107. This includes failure detection, traffic path reconfiguration on redundant connections in case of failure, load balancing and reporting that traffic cannot be returned to the virtual network driver 310. If there is a critical error that disables RVI on any RVI, or if RVI fails for any reason, the virtual Internet driver 310 should be notified asynchronously.
[0064]
Under normal circumstances, the virtual network driver 310 on each processor will attempt to balance the load of transmitted packets among the available control nodes. This can be done via a simple round-robin alternation between available control nodes, or by tracking how many bytes have been sent to each and sending them to the control node that sent the least bytes. Can be done.
[0065]
RCLAN provides reliable asynchronous fixed communication with high bandwidth between kernels (224 MB / sec per route), low latency, and low latency. If the data cannot be communicated, the data sender will be notified and the best effort will be made to convey it. RCLAN uses two Giganet clan 1000 cards to provide redundant communication paths between kernels. It recovers without interruption to a single failure in a clan 1000 card or Giganet switch. It detects data loss and data errors and retransmits the data as needed. Communication will not be interrupted unless one of the connections is partially working, eg, the error rate exceeds 5%. RCLAN clients include an RFC mechanism, a remote SCSI mechanism, and a remote Ethernet. RCLAN also provides simple shape flow control. Short latency and high parallelism means that multiple simultaneous requests for each device can be transferred to the device as soon as possible, or the device as much as possible against queuing all requests for processor nodes By being sent by the processor node to the control node so that it can be queued for completion near.
[0066]
The RCLAN layer 330 on the control node side operates in the same manner as described above.
[0067]
(Giganet driver)
The Giganet driver logic 320 is the logic responsible for providing an interface to the Giganet NIC 107 on the processor 106 or control node 120. In short, the Giganet driver logic establishes a VI connection that is related by a VI indicator so that a higher layer, for example RCLAN 315, Ethernet driver 310, simply needs to understand the semantics of the VI.
[0068]
The Giganet driver logic 320 is responsible for allocating memory to each buffer and queue node for VIs and conditioning NIC 107 to know about connections and their storage allocations. One embodiment uses the VI connection provided by the Giganet driver. The Giganet NIC driver code establishes a virtual interface pair (ie VI) and assigns it to the corresponding virtual interface indicator.
[0069]
Each VI is established between one Giganet port and another, or more precisely, between a memory buffer and a memory queue on one node for the buffer and a queue on another node Bidirectional connection. Port and memory allocation is handled by the NIC driver as described above. Data is sent by triggering an action by putting it in a buffer known to the NIC and writing to a specific memory map register. At the receiving end, the data appears in the buffer and the completion status appears in the queue. If the sending and receiving programs can generate and consume messages in the connection's buffer, there is no need to copy the data. If the operating system memory maps the buffers and instruction registers of the connection to the application address space, the transmission can be directly from the application program to the application program. Each Giganet port supports 1024 simultaneous VI connections on it, allowing you to keep them separated from each other with hardware protection, so the operating system is as secure as disparate applications One port can be shared. Under certain embodiments of the invention, 14 VI connections are established simultaneously from every other port every other port.
[0070]
In the preferred embodiment, the NIC driver establishes a VI connection in a pair of redundant connections, one connection in the pair going through one of the two switch structures 115a, 115b and the other connection in the other Go through the switch. Further, in the preferred embodiment, data is transmitted alternately on the two legs in pairs, making the load on the switches equal. Alternatively, redundant pairs may be used in a failover manner.
[0071]
As long as the operating system remains, all connection pairs established by that node will survive. Establishing a connection pair to simulate an Ethernet connection is similar to physically plugging in cables between network interface cards and is intended to be persistent as well. If the defined configuration of a node changes while the operating system is running, the applicable redundant virtual interface connection pair will be destroyed when it is established or changed.
[0072]
The Giganet driver logic 325 on the control node side operates similar to the above.
[0073]
(Virtual LAN server)
Virtual LAN server logic 335 facilitates emulation of an Ethernet network over a basic NBMA network. Virtual LAN server logic is:
1. Manage membership to the corresponding virtual LAN;
2. Provide RVI mapping and management;
3. ARP processing and IP mapping to RVI;
4. Provide broadcast and multicast services;
5. Facilitates bridging and routing to other areas;
Also
6. Manage service clusters.
[0074]
(1. Virtual LAN membership management)
Administrators use the management application 135 to configure a virtual LAN. The assignment and configuration of the IP address on the virtual LAN may be performed in the same manner as on the “normal” subnet. The selection of the IP address to use depends on the external visibility of the node on the virtual LAN. If the virtual LAN is not visible globally (either visible outside the platform 100 or not visible from the Internet), a private IP address should be used. Otherwise, the IP address must consist of a range provided by the Internet service provider (ISP) that provides the Internet connection. In general, virtual LAN IP address assignment must be handled in the same way as normal LAN IP address assignment. The configuration file stored on the local disk of the control node 120 defines the IP address in the virtual LAN. For a virtual network interface, an IP alias simply creates another IP for RVI mapping on the virtual LAN server logic 335. Each processor may form a plurality of virtual interfaces as necessary. The main limitation on the creation and configuration of virtual network interfaces is the assignment and configuration of IP addresses.
[0075]
Each virtual LAN has a corresponding instance of server logic 335 that performs the task on both the control node 120 and the many nodes that execute the task on the processor node 105. The topology is defined by the administrator.
[0076]
Each virtual LAN server 335 is formed to manage exactly one broadcast area, and any number of layer 3 (IP) subnets may exist in the layer 2 broadcast area. The server 335 is formed and created in response to an administrator command for creating a virtual LAN.
[0077]
When the processor 106 boots and configures its virtual network, it connects to the virtual LAN server 335 via a special management RVI. The processor then obtains the virtual MAC address assigned to it and their data link configuration information such as virtual LAN membership information. The virtual LAN server 335 determines and verifies that the processor attempting to connect to it is appropriately a member of the virtual LAN that the server 335 is serving. If the processor is not a virtual LAN member, connection to the server is refused. If it is a member, the virtual network driver 310 registers its IP address with the virtual LAN server. (If the driver 310 is configured, the IP address is provided by the IP stack 305.) The virtual LAN server then binds the IP address to the RVI on which the registration arrived. This allows the virtual LAN server to find the processor associated with a particular IP address. Further, the association between the processor and the IP address may be performed via the virtual LAN management interface 135. The latter method is necessary to properly configure a cluster IP address, or a special handling IP address discussed below.
[0078]
(2. RVI mapping and management)
As outlined above, one embodiment uses RVI to connect nodes at the data link layer to form a control connection. Some of these connections are created and assigned as part of the control node boot and initialization. Data link layer connections are used for the reasons described above. Control connections are used to exchange management, configuration and health information.
[0079]
Some RVI connections are between nodes for unicast traffic (eg 212i.2). Other RVI connections are in the virtual LAN server logic 335 so that the server can handle its requests, eg ARP traffic, broadcasts, etc. To create the RVI, the virtual LAN server 335 creates and deletes the RVI through a call to the Giganet switch manager 360 (supplied with the switch structure and Giganet NIC). The switch manager may execute a task on the control node 120 and create an RVI in cooperation with the Giganet driver.
[0080]
Since the nodes register with the virtual LAN server 335 for processor connection, the virtual LAN server creates and assigns virtual MAC addresses to those nodes as described above. Along with this, the virtual LAN server logic maintains a data structure that reflects the topology and MAC assignments for the various nodes. The virtual LAN server logic then creates a corresponding RVI for the unicast path between the nodes. These RVIs are then allocated and made known to the node during the boot of the node. RVI is also associated with an IP address during virtual LAN server ARP traffic handling. If the node is removed from the topology, the RV1 connection is disconnected.
[0081]
If a node 106 at one end of an established RVI connection is rebooted, the two operating systems at each end of the connection and the RVI management logic reestablish the connection. Software that uses a connection on the remaining processing node will not be aware of what happened to the connection itself. Whether the software notices or notices that the software at the other end has been rebooted is what the connection it is using for, and the rebooted terminal persists Depending on the extent to which the state can be re-established from the storage device. For example, any software that communicates via the Transfer Control Protocol (TCP) will notice that all TCP sessions are closed by a reboot. On the other hand, if it occurs within the allowed timeout period, Network File System (NFS) access is unaffected and is not affected by reboot.
[0082]
Should a node fail to send a packet over direct RVI at any time, it can always attempt to send the packet to the destination via the virtual LAN server 335. Since the virtual LAN server 335 is connected to all the virtual Ethernet driver 310 interfaces on the virtual LAN via the control connection, the virtual LAN server 335 further serves as a packet relay mechanism as a last resort.
[0083]
With respect to connecting to a virtual LAN server 335, one embodiment uses a virtual Ethernet driver 310 that algorithmically determines the RVI it should use to connect to its associated virtual LAN server 335. Implementation dependent algorithms will need to consider identification information such as cabinet number to identify the RVI.
[0084]
(3. ARP processing and IP mapping to RVI)
As described above, one embodiment of the virtual Ethernet driver 310 supports ARP. In these implementations, ARP processing is effectively used to create unicast traffic, including IP packets, between nodes, and to create mappings at nodes between IP addresses and RVI.
[0085]
To do this, the virtual Ethernet driver 310 sends an ARP packet request and response to the virtual LAN server 335 via dedicated RVI. The virtual LAN server 335, and in particular the ARP server logic 355, processes the packet by adding information to the packet header. As explained above, this information facilitates source and target identification and identifies the RVI used between nodes.
[0086]
The ARP server logic 355 accepts the ARP request, processes the TLV header, and broadcasts the request to all relevant nodes on the internal platform and external network, if appropriate. Among other things, the server logic 355 determines who receives the ARP response due to the request. For example, if the source is a cluster IP address, the response should be sent to the cluster load balancer, not necessarily the source of the ARP request. Server logic 355 indicates this by including information in the TLV header of the ARP request, so that the ARP target responds accordingly. Server 335 will process the ARP packet by including more detailed information in the added header and broadcast the packet to nodes in the associated region. For example, the modified header contains information identifying the source cabinet, processing node number, RYE connection number, channel, virtual interface number and virtual LAN name (some of which are known only to server 335). You may go out.
[0087]
That ARP response is received by the server logic 355, which then maps the MAC information to the response to the corresponding RVI related information. The RVI related information is placed in the target MAC entry of the response and to the appropriate source node (e.g. it may be the sender of the request, but in instances such as with cluster IP address, it may be a different node) Sent.
[0088]
(4. Broadcast and multicast services)
As outlined above, broadcasts are handled by receiving packets on dedicated RVI. The packet is then cloned by server 335 and unicast to all virtual interfaces 310 in the associated broadcast area.
[0089]
The same approach can be used for multicast. All multicast packets will be reflected on the virtual LAN server. Under some alternative implementations, the virtual LAN server relies on IP filtering on each node to handle multicast as well as broadcast and further filter out unwanted packets.
[0090]
If an application wants to send or receive a multicast address, it must first join the multicast group. If the process performs multicast binding on the processor, the processor virtual network driver 310 sends a binding request to the virtual LAN server 335 via dedicated RVI. The virtual LAN server then configures a specific multicast MAC address on the interface and notifies the LAN proxy 340 when necessary, as discussed below. Proxy 340 needs to keep track of usage counts on a particular multicast group, so the multicast address is only removed if no processor belongs to that multicast group.
[0091]
(5. Bridging and routing to other areas)
From a perspective view of the system 100, the external network 125 can operate in one of two modes: filtered or unfiltered. In filtered mode, a single MAC address for the entire system is used for all transmitted packets. This hides the virtual MAC address of the processing node 107 after the virtual LAN proxy 340 and causes the system to appear as a single node on the network 125 (or as a plurality of nodes after bridging or proxying). Since this does not expose the unique link layer information of each internal node 107, some other unique identifier is required to properly convey incoming packets. When operating in filter mode, the destination IP address of each incoming packet is used to uniquely identify the intended recipient since the MAC address identifies only that system. In the unfiltered mode, the virtual MACs of nodes 107 are visible outside the system as they are used to direct incoming traffic. That is, the filtered mode orders layer 3 switching, while the unfiltered mode allows layer 2 switching. Filtered mode requires a basic element (in this case virtual LAN proxy 340) to replace the address with the MAC address of the external network 125 on all outgoing packets from the node virtual MAC.
[0092]
Some implementations support the ability of a virtual LAN to be connected to an external network. The virtual LAN will therefore have to handle IP addresses that are not locally configured. One embodiment for addressing this imposes the restriction that each virtual LAN so connected is limited to one external broadcast area. The IP address and subnet assignment for the internal node of the virtual LAN will have to be adapted to the external area.
[0093]
Virtual LAN server 335 serves external connections by effectively acting as a data link layer bridge in that it moves packets between external Ethernet driver 345 and the internal processor and does not perform IP processing. . However, unlike something like a data link layer bridge, a server may not necessarily rely on a unique layer 2 address from the external network to the internal node, and instead the connection makes a bridging decision. Layer 3 (IP) information may be used to To do this, the external connection software extracts the IP address information from the incoming packet, and it uses this information to identify the correct node 106 so that it moves the packet to that node.
[0094]
Virtual LAN server 335 with attached external broadcast area blocks packets from and to external areas so that external nodes have a consistent view of the subnet of the broadcast area, Must be processed.
[0095]
When a virtual LAN server 335 having an attached external broadcast area receives an ARP request from an external node, it will relay the request to all internal nodes. The correct node will then compose the response and send the response through the virtual LAN server 335 to the requester. The virtual LAN server cooperates with the virtual LAN proxy 340 so that the proxy handles any necessary MAC address translation on the outgoing request. All ARP responses and ARP advertisements from external sources will be relayed directly to the target node.
[0096]
The virtual Ethernet interface 310 will send all unicast packets with external destinations to the virtual LAN server 335 on the control connection RVI. (The external destination may be recognized by the driver in the MAC address format.) The virtual LAN server will then move the packet to the external network 125 accordingly.
[0097]
When the virtual LAN server 335 receives a broadcast or multicast packet from the internal node, it relays the packet to the external network in addition to the relay of the packet to all the internal virtual LAN members. When the virtual LAN server 335 receives a broadcast or multicast packet from an external source, it relays the packet to all attached internal nodes.
[0098]
Under certain embodiments, interconnecting virtual LANs through the use of IP routers or firewalls is accomplished using a similar mechanism to that used to interconnect physical LANs. The One processor is configured on both LANs, and the Linux kernel on that processor must have enabled routing (and perhaps IP impersonation). Normal IP subnetting and routing semantics will always be preserved for two nodes on the same platform.
[0099]
The processor could be configured as a router between two external subnets, between and between external and internal subnets, and between two internal subnets. If the internal node is sending packets via a router, there is no problem because of the point-to-point topology of the internal network. The sender will send directly to the router (ie a processor configured with routing logic) without the intervention of a virtual LAN server (ie the typical processor-to-processor communication discussed above).
[0100]
When the external node sends a packet to the internal router and the external network 125 is operating in filtered mode, the destination MAC address of the incoming packet will be that of the platform 100. Therefore, the MAC address cannot be used to uniquely identify the packet destination node. For packets whose destination is an internal node on the virtual LAN, the destination IP address in the IP header is used to direct the packet to the appropriate destination node. However, since the router is not the final destination, the destination IP address in the IP header is that of the final destination, not that of the next hop (which is an internal router). Thus, there is nothing in the incoming packet that can be used to direct it to the correct internal node. In order to handle this situation, one implementation imposes the limitation of at most one router exposed to an external network on the virtual LAN. This router is registered by the virtual LAN server 335 as a default destination so that incoming packets that do not have a valid destination are directed to this default node.
[0101]
When the external node sends a packet to the internal router and the external network 125 is operating in the unfiltered mode, the destination MAC address of the incoming packet will be the virtual MAC address of the internal destination node. LAN server 335 will then use this virtual MAC to send the packet directly to the destination internal node. In this case, the router as the MAC address of the incoming packet may uniquely identify the destination node and any number of internal nodes may function as the router.
[0102]
If a configuration requires multiple routers on a subnet, one router can be taken as an exposed router. This router should be able to route to other routers in order when needed.
[0103]
Under certain embodiments, router redundancy is provided by making the router cluster service and load balancing or failing over in an unaffiliated manner (ie, per IP packet, not per TCP connection).
[0104]
Certain embodiments of the present invention provide promiscuous mode by providing switch semantics in which a given port is designated as a promiscuous port so that all traffic passing through the switch is repeated on the promiscuous port. Support functionality. Nodes that are allowed to listen in promiscuous mode will be administratively assigned by the virtual LAN server.
[0105]
When the virtual Ethernet interface 310 enters promiscuous receive mode, it will send a message to the virtual LAN server 335 over the management RVI. This message will contain all information about the virtual Ethernet interface 310 entering promiscuous mode. When a virtual LAN server receives a promiscuous mode message from a node, it will check its configuration information to determine whether the node is allowed to listen without regard to the other party. If not allowed, the virtual LAN server will withdraw its promiscuous mode message without further processing. If the node is allowed to enter promiscuous mode, the virtual LAN server will broadcast a promiscuous mode message to all other nodes on the virtual LAN. The virtual LAN server will also mark the node as promiscuous so that a copy of the incoming external packet can be forwarded to that node. When a promiscuous listener detects any change in its RVI configuration, it will send a promiscuous mode message to the virtual LAN to update the state of all other nodes on the associated broadcast area. This will update any node that enters or exits the virtual LAN. When the virtual Ethernet interface 310 exits promiscuous, it will send a message to the virtual LAN server notifying that the interface is out of promiscuous mode. The virtual LAN server will then send this message to all other nodes on the virtual LAN. Promiscuous configuration will allow external connections to be placed in promiscuous mode when any internal virtual interface is a promiscuous listener. This will make traffic that is external to that platform (but on the same virtual LAN) available to promiscuous listeners.
[0106]
(6. Service cluster to be managed)
A service cluster is a set of services that can be used with one or more IP addresses (or host names). Examples of these services are HTTP, FTP, Telnet, NFS, etc. The IP address and port number pair represents a particular service type (but not a service instance) presented by the cluster to clients, including clients on external network 125.
[0107]
FIG. 5 illustrates how an embodiment provides virtual cluster 505 services as a single virtual host to the Internet or other external network 125 via a cluster IP address. All services in cluster 505 are addressed by a single IP address through different ports of that IP address. In the example of FIG. 5, service B is a load balance service.
[0108]
With reference to FIG. 3B, the virtual cluster is supported by the intervention of virtual cluster proxy (VCP) logic 360 that cooperates with the virtual LAN server 335. In short, VCP 360 is responsible for handling the distribution of incoming connections, port filters and true server connections to each configured virtual IP address. There will be one VCP configured for each cluster IP address.
[0109]
When a packet arrives on the virtual cluster IP address, virtual LAN proxy logic 340 will send the packet to VCP 360 for processing. The VCP then determines where to send the packet based on the packet content, its internal connection state cache, any load balancing algorithm applied to incoming traffic, and the configured service uptime. Will. VCP will relay incoming packets based on both the destination IP address as well as the TCP or UDP port number. In addition, it will only distribute packets destined for known port numbers to the VCP (or for existing TCP connections). It is the configuration of these ports and the mapping of port numbers to one or more processors that create a virtual cluster and make a particular service instance available in that cluster. When multiple instances of the same service from multiple application processors are configured, the VCP can balance the load among the service instances.
[0110]
VCP 360 maintains a cache of all active connections that exist at the cluster IP address. Some load balancing decision will only be made when a new connection is established between the client and the service. Once the connection is set up, the VCP verifies that the source in the incoming packet header and The destination information will be used. Without the ability to determine a client session (eg HTTP session), the actual connection / load balancing mapping cache sends packets based on the client address, so that subsequent connections from the same client are the same Go to the processor (make the client session persistent or "sticky"). Since only certain types of services require session persistence, session persistence must be selectable in a service port number scheme.
[0111]
Responses to ARP requests and routing of ARP responses are handled by VCP. If the processor sends an arbitrary ARP packet, it will send it out by the virtual Ethernet driver 310. The packet will then be sent to the virtual LAN server 335 for normal ARP processing. As usual, the virtual LAN server will broadcast the packet, but will make sure it is not broadcast to any member of the cluster (not just a sender). It will also put in the packet header TLV information indicating to the ARP target that the ARP source can only be reached through that virtual LAN server, specifically only through the load balancer. The ARP target, whether internal or external, will normally process the ARP request and send the response back through the virtual LAN server. Since the ARP source was a cluster IP address, the virtual LAN server would not be able to determine which processor sent the original request. Therefore, the virtual LAN server will send its response to each cluster member so that it can be handled appropriately. When an ARP packet is sent as a target by a source with a cluster IP address, the virtual LAN server will send the request to all cluster members. Each cluster member will receive the ARP request and process it normally. They will then construct an ARP response and send it to the source via the virtual LAN server. If the virtual LAN server receives any ARP response from a cluster member, it will stop responding, but the virtual LAN server will construct and send an ARP response to the ARP source. Therefore, the virtual LAN server will respond to all ARPs for the cluster IP address. The ARP response will contain the information necessary for the ARP source to send all packets to the cluster IP address to the VCP. For an external ARP source, this would simply be an ARP response with the external MAC address as the source hardware address. For internal ARP sources, this will be the information needed to instruct that source to send packets to the cluster IP address under Virtual LAN Management RVI, not through directly connected RVI. Let's go. Any gratuitous ARP packets received will be forwarded to all cluster members. Any gratuitous ARP packets sent by cluster members will also be sent normally.
[0112]
(Virtual LAN proxy)
The virtual LAN proxy 340 performs basic coordination of physical network resources in all processors that have a virtual interface to the external physical network 125. It bridges the virtual LAN server 335 to the external network 125. When the external network 125 is operating in filtered mode, the virtual LAN proxy 340 will translate the internal virtual MAC address from each node to a single external MAC assigned to the system 100. If the external network 125 is operating in unfiltered mode, such MAC conversion is not necessary. Virtual LAN proxy 340 also inserts and removes demultiplexed packets based on IEEE 802.1Q virtual LAN ID tagging information and their VLAN indicators. It also aligns access to the physical Ethernet interface 129 and coordinates the assignment and removal of MAC addresses such as multicast addresses on the physical network.
[0113]
When the external network 125 is operating in filtered mode and the virtual LAN proxy 340 receives a transmission packet (ARP or other) from the virtual LAN server 335, it physically uses the internal format MAC address as the source MAC address. Replace with the MAC address of a typical Ethernet device 129. If the external network 125 is operating in the unfiltered mode, such replacement is not necessary.
[0114]
When the virtual LAN proxy 340 receives an incoming ARP packet, it processes the packet and moves the packet to a virtual LAN server 335 that relays the packet to the correct destination. If the ARP packet is a broadcast packet, the packet is relayed to all internal nodes on the virtual LAN. If the packet is a unicast packet, the packet is sent only to the destination node. By the IP address in the ARP packet when the external network 125 is operating in filtered mode, or by the MAC address in the Ethernet header of the ARP packet (the one that is not that MAC address) The destination node is determined.
[0115]
(Physical LAN driver)
Under certain embodiments, the connection to the external network 125 is via a Gigabit or 100/10 base-T Ethernet link connected to the control node. The physical LAN driver 345 is responsible for interfacing with such links. Packets being sent on that interface will be queued to the device in the usual way, including putting those packets in a socket buffer. The queue used to queue the packet is that used by the protocol stack to queue the packet to the device's transfer routine. For incoming packets, the socket buffer that contains the packet will be passed through and the packet data will not be copied (although it will be cloned if needed for multicast operation). Under these implementations, a generic Linux network device driver may be used in an unaltered control node. This facilitates the addition of new equipment to the platform without requiring additional device driver work.
[0116]
The physical network interface 345 is in a communication state only with the virtual LAN proxy 340. This prevents the control node from using external connections in any way that interferes with the operation of the virtual LAN, and further improves the security and isolation of user data, i.e. the administrator `` scents '' on any user packet It is not necessary.
[0117]
(Load balancing and failover)
Under some embodiments, redundant connections to the external network 125 will be used alternately to balance the packet transmission between the two redundant interfaces to the external network 125. Other embodiments balance the load by configuring individual virtual network interfaces with respect to alternating control nodes so that the virtual interfaces are evenly distributed between the two control nodes. Another embodiment transmits through one control node and receives through another control node.
[0118]
When in filtered mode, there will be one externally visible MAC address to which external nodes will send packets for a set of virtual network interfaces. If that adapter goes down, not only does the virtual network interface have to fail over to another control node, but the MAC address also allows the external node to continue sending packets to the MAC address already in the ARP cache. Need to file over. Under certain embodiments of the invention, when a failed control node recovers, a single MAC address is manipulated and that MAC address does not need to be remapped upon recovery.
[0119]
Under another embodiment of the invention, load balancing is performed by allowing transmissions on both control nodes, but only through one. In case of failover, both transmission and reception are performed through the same control node. In the case of recovery, it is a transmission through the recovered control node since it does not require a MAC operation.
[0120]
The control node that performs reception has IP information for filtering and multicast address information for multicast MAC configuration. This information is necessary for processing an incoming packet, and should be failed over if the reception control node fails. If the transmission control node fails, the virtual network driver needs to start sending transmission packets only to the reception control node. No special failover process is required other than the recognition that the transmission control node has failed. If the failed control node recovers, the virtual network driver may resume sending transmission packets to the recovered control node without additional special recovery processing. If the reception control node fails, the transmission control node must assume the role of the reception interface. In order to do this, it must configure all MAC addresses on that physical interface to allow packet reception. Alternately, both control nodes should be able to have the same MAC address configured on their interfaces, but the reception will be Ethernet by the device driver until the control node is ready to receive the packet. Should be physically impossible on the device. A failover would then simply allow reception on that device.
[0121]
When any processor joins a multicast group, its interface must be configured with a multicast MAC address, so multicast information must be shared between control nodes so that failover is transparent to that processor . Since the virtual network driver must keep track of multicast group membership anyway, this information is always available to the LAN proxy via the virtual LAN server as needed. Thus, receive failover will result in multicast group membership being queried from the virtual network driver to rebuild the local multicast group membership table. This operation is low overhead, requires no special processing except between failover and recovery, and does not require any special replication of data between control nodes. If the reception has failed over and the failed control node recovers, only the transmission will be transferred to the recovered control node. Thus, the recovery algorithm on the virtual network interface always moves the transmission to the recovered control node and leaves the receiving process where it is.
[0122]
Virtual service clusters may also use load balancing and failover.
[0123]
(Multi-cabinet platform)
Some embodiments allow the cabinets to be connected together to form a larger platform. Each cabinet will have at least one control node used for inter-cabinet connections. Each control node will include a virtual LAN server 335 to handle local connections and traffic. One of the servers is configured to be a master, such as one installed at a control node that has an external connection for the virtual LAN. Another virtual LAN server will act as a proxy server or slave so that local processors in those cabinets can participate. The master maintains all virtual LAN state and control, while the proxy relays packets between the processor and the master.
[0124]
Each virtual LAN server proxy maintains RVI to each master virtual LAN server. Each local processor will connect to the virtual LAN server proxy server as if it were the master. If the processor connects and registers the IP and MAC addresses, the proxy will register the IP and MAC addresses with the master. This will cause the master to bind its address to RVI from the proxy. Thus, the master will include RVI bindings for all internal nodes, but the proxy will only include node bindings for the same cabinet.
[0125]
When a processor at a location in a multi-cabinet virtual LAN sends an arbitrary packet to the virtual LAN server, the packet will be relayed to the master for processing. The master will then perform normal processing on the packet. As necessary for multicast and broadcast, the master will relay the packet to the proxy. The master will also relay unicast packets based on the destination IP address and registered IP address of the unicast packet on the proxy. Note that on the master, the proxy connection is very similar to a node with many configured IP addresses.
[0126]
(Network management logic)
While there is no operating system running on a processing node, such as booting or kernel debugging, the serial console traffic and boot image of that node are controlled by the processing node's kernel debugging software or the switch driver code in the BIOS. Sent to management software running on (not shown). From there, console traffic can again be accessed from the high speed external network 125 or through the management port of the control node. The boot image request can be satisfied from the local disk of the control node or from a partition on the external SAN 130. Preferably, the control node 120 is booted and operating normally before anything can be done on the processing node. The control node is itself booted or debugged from its management port.
[0127]
Some customers may want to limit controller boot and debugging to local access only by plugging their management port into the field computer when needed. Others choose to enable remote booting and debugging by establishing a secure network segment for management purposes that is properly insulated from the Internet by capping their management port Might do. Once the control unit has been booted and is operating normally, all other management functions for the control unit and for the rest of the platform are also fast externally from the management port if authorized by the administrator. It can also be accessed from the network 125.
[0128]
Serial console traffic to and from each processing node 105 is sent to management software running on the control node 120 by an operating system kernel driver on the switch structure 115. From there, the console traffic of any node can be accessed either through the normal high speed external network 125 or the management port of the control node.
[0129]
(Storage device architecture)
One embodiment follows the SCSI type of storage device. Each virtual PAN has its own virtual I / O space and also issues SCSI commands and status within such space. The control node logic translates or transforms addresses and commands as needed from the PAN and sends them to the SAN 130 that handles those commands. From the perspective of the SAN, the client is the platform 100, and the actual PAN that issued the command is hidden and anonymous. Since the SAN address space becomes virtual, one PAN running on platform 100 may have devices that number starting with device number 1, and another PAN may also have device number 1. . However, each of device numbers 1 will correspond to a different and unique part of the SAN storage device.
[0130]
Under the preferred embodiment, the administrator can create a virtual storage device. Each PAN will have its own independent view of mass storage. Thus, as described below, the first PAN may have a given device / LUN address map for the first location of that SAN, and another PAN may have a second different for that SAN. May have the same given device / LUN map for a location. Each processor maps device / LUN addresses to primary and secondary device numbers, for example, to identify disks and partitions. Their primary and secondary device numbers are considered physical addresses by the PAN and the processors within the PAN, but in fact they are treated by the platform as virtual addresses for the mass storage devices provided by the SAN. That is, the main and secondary device numbers of each processor are mapped to the corresponding SAN location.
[0131]
FIG. 6 illustrates the software components used to implement an embodiment storage architecture. Typically, the building block 605 that runs on the control node 120 is in communication with the external SAN 130. The management interface primitive 610 provides an interface to its constituent primitives 605 and is in communication with the IP network 125 and thus with the remote management logic 135 (see FIG. 1). Each processor 106 of system 100 includes an instance of processor-side storage logic 620. Each such instance 620 communicates via two RV1 connections 625 to the corresponding instance of the control node side storage logic 615.
[0132]
In short, the building block 605 and interface 610 allow the administrator to discover the portion of the SAN storage that is distributed to the platform 100 and to further subdivide a portion to a particular PAN or processor 106. responsible. Furthermore, the storage device placement logic 605 is also responsible for communicating SAN storage allocation to the control node side logic 615. The processor-side storage logic 620 is responsible for communicating storage device requests on the processor's internal interconnect 110 and storage structure 115 to the control node-side logic 615 via a dedicated RVI 625. The request will, under certain embodiments, include a virtual storage address and a SCSI command. The control node side logic identifies the corresponding real address for the SAN, and includes, but is not limited to, commands and protocols, such as Fiber Channel (Gigabit Ethernet with iSCSI is another typical concatenation), Responsible for receiving and processing such commands by converting them to the appropriate format for the SAN.
[0133]
(Structural basic elements)
The configuration primitive 605 determines which elements in the SAN 130 are visible to each individual processor 106. It provides a mapping function that translates device numbers used by the processor (eg, SCSI targets and LUNs) into device numbers that are visible to the control node through their attached SCSI and Fiber Channel I / O interfaces 128. In addition, it provides an access management function that prevents the processor from accessing external storage devices that are attached to the control node but are not included in the processor configuration. The model presented to a processor (and to system administrators and applications / users on that processor) has it's own mass storage device with each processor attached to an interface on that processor Show as if.
[0134]
Among other things, this functionality allows software on processor 106 to be easily moved to another processor. For example, in one embodiment, a control node via software (without physical re-cabling) may change the PAN configuration to allow a new processor to access the required devices. Thus, the new processor may be able to inherit the personality of another storage device.
[0135]
In some implementations, the control node appears as a host on the SAN, but alternative implementations allow the processor to operate as such.
[0136]
As outlined above, the configuration logic discovers SAN storage devices distributed to platform 100 (eg, during platform boot), and this pool is later allocated by the administrator. If discovery is triggered later, the control node performing the discovery operation compares the new appearance with the previous appearance. Newly available storage devices are added to a pool of storage devices that can be allocated by the administrator. The partition that makes it invisible is not allocated and is removed from the available pool of storage that can be distributed to the PAN. The partition that makes it invisible will be assigned a trigger error message.
[0137]
(Management interface basic elements)
The configuration primitive 605 allows management software to access and update information describing the device mapping between devices visible to the control node 120 and virtual devices visible to the individual processor 106. It also allows access to control information. The assignment can be identified by the processing node associated with the simulated SCSI disk identification, eg, by the name of the simulated controller, cable, unit, or logical unit number (LUN).
[0138]
Under certain embodiments, interface primitive 610 cooperates with its constituent primitives to collect and monitor information and statistics such as:
Total number of I / O operations performed
Total number of bytes transferred
The total number of read operations performed
Total number of write operations performed
Total amount of I / O time that was in progress
[0139]
(Processor side memory logic)
The processor side logic 620 of the protocol is implemented as a host adapter module that emulates the SCSI subsystem by providing a low level virtual interface to the operating system on the processor 106. The processor 106 uses this virtual interface to send SCSI I / O commands to the control node 120 for processing.
[0140]
Under embodiments that use redundant control nodes 120, the processing node 105 will include one instance of logic 620 per control node 120. Under certain embodiments, the processor refers to the storage device using physical device numbering rather than logical. That is, to identify the LUN, SCSI target, channel, host adapter, and control node 120 (eg, node 120a or 120b), the address is specified as the device name. As shown in FIG. 8, one embodiment places the target (T) and LUN (L) on the host adapter (H), channel (C), mapped target (mT), and mapped LUN (mL). Map.
[0141]
FIG. 7 shows an exemplary architecture for the processor side logic 720. The logic 720 includes a device type specific driver (eg, disk driver) 705, an intermediate level SCSI I / O driver 710, and wrapper and interconnect logic 715.
[0142]
The device type dedicated driver 705 is a conventional driver for which an operating system is prepared and associated with a specific device type.
[0143]
The intermediate level SCSI I / O driver 710 is a conventional intermediate level driver that is called by the driver 705 dedicated to the device type once the driver 705 determines that the device is a SCSI device.
[0144]
The wrapper and interconnect logic 715 is invoked by the mid-level SCSI I / O driver 710. This logic provides a SCSI subsystem interface, thereby emulating the SCSI subsystem. In one implementation using the Giganet structure, logic 715 obscures SCSI commands when needed, and causes the NIC to send packets to the control node via dedicated RVI to the control node, as described above. Is responsible for interacting with the Giganet and RCLAN interfaces. The header information for the Giganet packet is modified to indicate that this is a storage packet, described below in the context, and contains other information. Although not illustrated in FIG. 7, the wrapper logic 715 may use the RCLAN layer to support and utilize redundant interconnects 110 and structures 115.
[0145]
For embodiments using the Giganet structure 115, the RVI of connection 725 is assigned a virtual interface (VI) number from a range of 1024 available VIs. For the two endpoints to communicate, switch 115 is programmed with a bidirectional path between the pair (control node switch port, control node VI number) and (processor node 105 switch port, processor node VI number). Is done.
[0146]
A separate RVI is used for each type of message sent in one direction. Thus, there is always a pending receive buffer on each RVI for messages that can be sent from the other side of the protocol. In addition, since only one type of message is sent in one direction on each RVI, the receive buffer posted on each of the RVI channels is limited to the maximum message length that the protocol uses for that type of message. Can be appropriately sized. Under other embodiments, all possible message types are multiplexed on a single RVI rather than using two VIs. Protocols and message formats do not specifically require the use of two RVIs, and messages themselves have message type information in their headers so that they can be demultiplexed.
[0147]
One of the two channels is used to exchange SCSI command (CMD) and status (STAT) messages. Another channel is used to exchange buffer (BUF) and transfer (TRAN) messages. This channel is used to process the data payload of the SCSI command.
[0148]
The CMD message also includes control information, the SCSI command to be executed, and the virtual address and size of the I / O buffer for the node 105. The STAT message contains control information and a completion status code that reflects any errors that may occur while processing the SCSI command. The BUF message includes control information in the control node 120 and the virtual address and size of the I / O buffer. The TRAN message includes control information and is used to confirm the successful data transmission from the node 105 to the control node 120.
[0149]
The processor-side wrapper logic 715 examines the SCSI command that is sent to determine if the command requires data transfer and in which direction it is. By analysis, the wrapper logic 715 sets the appropriate flag information in the message header accordingly. The section describing the control node side logic describes how the flag information is used.
[0150]
Under certain embodiments of the present invention, the link 725 between the processor side storage logic 720 and the control node side storage logic 715 is not part of the SCSI protocol and is not communicated to the SAN 130. May be used to communicate control messages. Instead, these control messages are to be processed by the control node side logic 715.
[0151]
Protocol control messages are always generated on the processor side of the protocol and are connected to the control node side of the protocol in one of two virtual interfaces (VIs) that connect the processor side logic 720 to the control node side storage logic 715. Sent. The message header used for the protocol control operation is the same as the command message header, except that a different flag bit is used to identify the message as a protocol control message. The control node 120 performs the requested operation and corresponds to RVI with a message header that is the same as that used by the status message. This method does not require a separate RVI for protocol control operations that are rarely used.
[0152]
Under certain embodiments using redundant control nodes, the processor-side logic 720 detects an error from the issued command and reissues the command to another control node accordingly. This retry will be implemented in an intermediate level driver 710.
[0153]
(Control node side storage device logic)
Under certain embodiments, the control node side storage logic 715 is implemented as a device driver module. Logic 715 provides a device level interface to the operating system on control node 120. This device level interface is also used to access the building blocks 705. When this device driver module is initialized, it responds to protocol messages from all processors 106 of the platform 100. All configuration activities are introduced through the device level interface. All I / O activity is introduced through messages that are transmitted and received through interconnect 110 and switch structure 115. At control node 120 there will be one instance of logic 715 per processor node 105 (although it is simply shown as one box in FIG. 7). Under certain implementations, the control node side logic 715 can be configured via the FCP or FCP-2 protocol, or iSCSI, or other protocols that use the SCSI-2 or SCSI-3 command set across various media. Communicate with 130.
[0154]
As described above, the processor-side logic sets a flag in the RVI message header that indicates whether the data flow is associated with a command and, if so, in which direction. The control node side storage logic 715 receives the message from the processor side logic and then analyzes the header information to determine how to operate, eg, to allocate a buffer or the like. The logic further converts the address information contained in the message from the processor to a corresponding and mapped SAN address and issues a command to the SAN 130 (eg, via FCP or FCP-2).
[0155]
SCSI commands such as the TEST UNIT READY command that do not require a SCSI data transfer phase are sent by the processor-side logic 720 that sends a single command on the RVI used for the command message, and a single status message on the same RVI. Processed by the control node side logic to send back. More specifically, the processor side of the protocol builds a message with a standard message header, a new sequence number for this command, the desired SCSI target and LUN, the SCSI command to be executed, and a list size of zero. To do. The control node side of the logic receives the message, extracts the SCSI command information, and transmits it to the SAN 130 via the interface 128. After the control node receives the command completion callback, it uses the standard message header, the sequence number for this command, the status of the completed command, and optionally the request sense data if the command is completed with a check condition status It then builds a status message to the processor.
[0156]
A SCSI command, such as a READ command that requires the SCSI data transfer phase to transfer data from the SCSI device to the host memory, is one or more RDMA WRITE operations to the logic 715 on the control node side and the memory of the processor node 105T , And by a processor-side logic that sends a command message to a control node that responds with a single status message from the control node. More specifically, the processor side logic 720 stores the standard message header, the new sequence number for this command, the desired SCSI target and LUN, the SCSI command to be executed, and the data from the command. Construct a command message with a list of the regions of memory that are. The control node side logic 715 allocates a temporary storage buffer to store data from the SCSI operation while the SCSI command performs a task on the control node. The control node side logic 715 sends a SCSI command to the SAN 130 for processing, and after the command is completed, it sends data back to the processor 105 memory that has a sequence of one or more RDMA WRITE operations. It then builds a standard message header, a sequence number for this command, the status of the completed command, and a status message with optional REQUEST SENSE data if the command is completed with SCSI CHECK CONDITION status.
[0157]
A SCSI command, such as a WRITE command, that requires the SCSI data transfer stage to transfer data from the host memory to the SCSI device, is a single command message to the control node side logic 715, from the control node side logic 715 to the processor side logic. One or more BUF messages, one or more RDMA WRITE operations from processor-side storage logic to memory in the control node, one or more TRAN messages from processor-side logic to control node-side logic, and control node-side logic to processor Processed by processor-side logic 720 that sends a single status message to the side logic. The use of a BUF message to communicate the location of the control node's temporary buffer memory to the processor side storage logic, and the use of a TRAN message to indicate the completion of the RDMA WRITE data transfer Due to lack of READ ability. If the underlying structure supports RDMA READ operations, a different sequence of corresponding actions will be used. More specifically, the processor side logic 720 builds a CMD message with a standard message header, a new sequence number for this command, the desired SCSI target and LUN, and the SCSI command to be executed. The control node side logic 715 allocates a temporary storage buffer to store data from the SCSI operation while the SCSI command is executing a task on the control node. Next, the control node side of the protocol builds a BUF message with a standard message header, a sequence number for this command, and a list of areas of virtual memory used for temporary storage buffers on the control node. Next, the processor-side logic 720 sends data to the control node memory in a series of one or more RDMA WRITE operations. It then builds a TRAN message with a standard message header and a sequence number for this command. After the control node side logic sends the SCSI command to SAN 130 for processing and receives further command completion, the standard message header, the sequence number for this command, the status of the completed command, and the command is in CHECK CONDITION status If completed with, optionally construct a STAT message with REQUEST SENSE data.
[0158]
Under some implementations, the CMD message includes a list of areas of virtual memory in which data for the command is stored. The BUF and TRAN messages also include an index field that allows the protocol control node to send an individual BUF message to each entry in the region list in the CMD message. The processor side of the protocol responds to such a message by performing an RDMA WRITE operation for the amount of data described in the BUF message, followed by a TRAN message to indicate the completion of a single segment of data transfer. Will.
[0159]
The protocol between the processor side logic 720 and the control node side logic 715 enables scatter-gather input / output operations. This function allows the data contained in the I / O request to be read or written from several separate areas of virtual and / or physical memory. This allows multiple non-contiguous buffers to be used for the request on the control node.
[0160]
As mentioned above, the configuration logic 705 is responsible for discovering SAN storage devices distributed to the platform and for interacting with the interface logic 710 to allow administrators to suballocate storage devices to specific PANs. is there. As part of this assignment, the building block 705 creates and maintains a storage data structure 915 that includes information that identifies a match between the processor address and the actual SAN address. FIG. 7 shows such a structure. The match will be between the processing node and the simulated SCSI disk identification, eg, by the name of the simulated controller, cable, unit, or logical unit number (LUN).
[0161]
(Management logic)
Management logic 135 is used to interface to the control node software for supplying the PAN. Among other things, logic 135 allows the administrator to establish the PAN's virtual network topology, its visibility to external networks (eg as a service cluster), and the type of device on the PAN, eg bridge and routing Makes it possible to establish
[0162]
Logic 135 also interfaces with storage management interface logic 710 so that the administrator may define storage for the PAN during or after the initial assignment. Its configuration definition includes the storage agreement (SCSI to SAN) discussed above, and access control permissions.
[0163]
As described above, each PAN and each processor will have a personality defined by its virtual networking (including virtual MAC addresses) and virtual storage. To implement processor clustering, a structure that records such personality will be accessed by management logic, as described below. Furthermore, they will be accessed by an administrator as described above or with an agent administrator. For example, the agent may be used to reconfigure the PAN in response to certain events such as time or year, or in response to certain loads on the system.
[0164]
The processor operating system software includes serial console driver code to send console I / O traffic for the node through the Giganet switch 115 to management software running on the control node. From there, the management software can route any node's console I / O stream through the control node's management port (its low-speed Ethernet port and its emergency management port), or via the high-speed external network 125, as authorized by the administrator. Can be made accessible. Console traffic can be recorded for auditing and historical purposes.
[0165]
(Cluster management logic)
FIG. 9 illustrates the cluster management logic of an embodiment. The cluster management logic 905 accesses the data structure 910 that records the network information, such as the PAN network topology, MAC address assignment in the PAN, and the like. In addition, cluster management logic 905 also accesses a data structure 915 that records the storage device correspondence of the various processors 106. In addition, the cluster management logic 905 also accesses a data structure 920 that records free resources such as unassigned processors in the platform 100.
[0166]
In response to a processor error event or administrator command, the cluster management logic 905 can change the data structure to “move” the storage and network identity of a given processor to a new processor. In this way, the new processor “inherits” the personality of the previous processor. The cluster management logic 905 can be responsible for doing this to replace a new processor in the PAN to replace what is faulty.
[0167]
The new processor will inherit the previous processor's MAC address and behave like the previous one. When a new processor boots, the control node will communicate connectivity information and update connectivity information for non-failed processors as needed. For example, in some implementations, RVI connections for other processors are updated transparently; that is, software on other processors need not be involved in establishing a connection to a newly replaced processor. In addition, the new processor will inherit the storage support of the previous one, and thus will inherit the surviving state of the previous processor.
[0168]
Among other advantages, this allows a free pool of resources, including processors, to be shared across all platforms rather than across a given PAN. Thus free resources (which would be maintained as such to improve system reliability and fault tolerance) would be used more efficiently.
[0169]
When a new processor is "swapped" it will need to ARP again to learn the association between IP address and MAC address.
[0170]
(Option)
Since each Giganet port in switch structure 115 can support 1024 simultaneous virtual interface connections on it and keep them separated from each other with hardware protection, the operating system Application programs and node Giganet ports can be shared securely. This will allow direct connection of application programs without having to work through the entire stack of driver code. When doing this, the operating system call will establish a virtual interface channel and memory map its buffers and queues to the application address space. In addition, a library encapsulating the low-level details that interface to that channel will facilitate the use of such virtual interface connections. The library will automatically establish redundant virtual interface channel pairs and be able to manage sharing and failures between them without requiring any effort or awareness from the calling application. .
[0171]
The above embodiment emulated Ethernet internally for an ATM-like structure. This design may be modified to use an internal Ethernet structure that simplifies much of the architecture, for example, removing the need for emulation capabilities. If the external network communicates in response to ATM, another variation would use ATM internally without Ethernet emulation, and ATM would communicate externally to the external network when so addressed Would be able to. Another variation would enable ATM internally to the platform (ie without Ethernet emulation) and only external communications would be converted to Ethernet. This will streamline internal communication but will require emulation logic at the controller.
[0172]
Some embodiments deploy PANs based on software configuration commands. It will be appreciated that deployment will be based on program control. For example, more processors can be deployed under software management at the peak of operation for that PAN, or more or less storage space can be accommodated for that PAN under the control of the software algorithm. May be.
[0173]
The scope of the invention is not limited to the embodiments described above, but is defined by the appended claims; and these claims will encompass modifications and improvements of what has been described Will be recognized.
[Brief description of the drawings]
[0174]
FIG. 1 is a system diagram illustrating an embodiment of the present invention.
FIG. 2A is a diagram illustrating a communication link established in accordance with an embodiment of the present invention.
FIG. 2B is a diagram illustrating a communication link established in accordance with an embodiment of the present invention.
FIG. 2C is a diagram illustrating a communication link established in accordance with an embodiment of the present invention.
FIG. 3A is a diagram illustrating the network software architecture of an embodiment of the present invention.
FIG. 3B is a diagram illustrating the network software architecture of an embodiment of the present invention.
FIG. 4A is a flow diagram illustrating driver logic in accordance with an embodiment of the present invention.
FIG. 4B is a flow diagram illustrating driver logic in accordance with an embodiment of the present invention.
FIG. 4C is a flow diagram illustrating driver logic according to an embodiment of the present invention.
FIG. 5 illustrates a service cluster according to an embodiment of the present invention.
FIG. 6 illustrates the storage device software architecture of an embodiment of the present invention.
FIG. 7 illustrates the processor side storage logic of an embodiment of the present invention.
FIG. 8 illustrates storage address mapping logic of an embodiment of the present invention.
FIG. 9 illustrates the cluster management logic of an embodiment of the present invention.
[Explanation of symbols]
[0175]
105 processing node
115 Switch mechanism
120 control node
135 Management Logic
206 switch
208 switch
210 processor logic
214 Virtual switch logic
305 IP stack
310 Virtual network driver
320 Giganet driver
325 Giganet driver
335 Virtual LAN server
340 Virtual LAN proxy
350 ARP logic,
505 clusters
605 Storage device configuration logic
610 management interface,
615 Control node side storage device logic
905 cluster management logic,
910 Network data structure
915 Storage device data structure
920 Free resource data structure

Claims

A method of emulating a switched Ethernet local area network in a platform having a plurality of computer processors, a switch structure, and a point-to-point link to the processors:
Providing Ethernet driver emulation logic to perform tasks on at least two computer processors;
Providing switch emulation logic to perform tasks on at least one of the computer processors;
Establishing a virtual interface between the switch emulation logic and each computer processor having Ethernet driver emulation logic that performs tasks on it to enable software communication therebetween, The interface establishes a virtual interface defining a software communication path from a computer processor to another computer processor via the switch structure;
Establishing a virtual interface between each computer processor having Ethernet driver emulation logic executing a task thereon and every other computer processor having Ethernet driver emulation logic executing a task thereon;
When the virtual interface is operating between one computer processor and another so as to satisfy a predetermined evaluation criterion, the Ethernet driver emulation logic of the one computer processor is in software communication between them. Unicast communication with other computer processors via a virtual interface defining the path; and
When the virtual interface is operating between one computer processor and another computer without satisfying a predetermined evaluation standard, the Ethernet driver emulation logic of the one computer processor is connected to the other computer. Unicast communication with the other computer processor via a virtual interface to the switch emulation logic that transmits the unicast communication to a processor;
A method comprising:

The method of claim 1, wherein each of the computer processors having Ethernet driver emulation logic on which a task is performed is associated with a virtual MAC address, the MAC address being connected to an external network. The method, wherein the method is formed according to a rule for identifying one of the plurality of computer processors different from the MAC address of the computer.

The method of claim 2, wherein the platform is connected to an external network via interface logic to communicate with an external network, the external network interface logic being associated with its own MAC address; And the method wherein the message is communicated on the external network using the MAC address of the external network interface logic.

The method of claim 1, wherein a first computer processor uses a first virtual interface to unicast communication with a second computer processor, but the second computer processor The method of using a different virtual interface to communicate to a first computer processor.

The method of claim 1, wherein each computer processor includes switch structure driver logic for communicating over a point-to-point link, which also includes a checksum capability, the Ethernet driver emulation logic. Includes a checksum capability, but if the switch structure driver logic has already checksummed the message, the method inhibits such checksum functionality.

6. The method of claim 5, wherein the switch structure driver logic implements a reliable communication protocol to ensure receipt of messages on the switch structure.

The method according to claim 1, wherein the switch structure and the point-to-point link are arranged in a redundant configuration.

The method of claim 1, wherein the Ethernet driver emulation logic broadcasts the message by sending a message to the switch emulation logic via a virtual interface, the switch emulation logic including a virtual interface. The method of receiving and cloning broadcast messages from and sending the cloned messages to other computer processors in the network.

The method of claim 1, wherein the switch emulation logic defines and maintains computer processor membership to an emulated network.

The method of claim 1, wherein the Ethernet driver emulation logic transmits a message that is larger than a maximum transfer unit (MTU) size.

A system for emulating a switched Ethernet local area network:
With multiple computer processors;
A switch structure and a point-to-point link to the processor;
Virtual interface logic for establishing virtual interfaces across the switch structure and point-to-point links, each virtual interface defining a software communication path from one computer processor to another via the switch structure With virtual interface logic;
Ethernet driver emulation logic that performs tasks on at least two computer processors;
Switch emulation logic that performs tasks on at least one computer processor,
Logic to establish a virtual interface between the switch emulation logic and each computer processor having Ethernet driver emulation logic that performs tasks thereon to enable software communication therebetween;
An ethernet driver that receives a message from one of the virtual interfaces to a computer processor having ethernet driver emulation logic on which the task is executed and executes the task on it in response to address information associated with the message Each computer processor having logic for transmitting a message to another computer processor having emulation logic and Ethernet driver emulation logic executing a task thereon, and Ethernet driver emulation logic executing a task thereon Logic for establishing a virtual interface with every other computer processor;
Said switch emulation logic comprising:
With
When the virtual interface is operating so that the virtual interface satisfies a predetermined evaluation criterion, the Ethernet driver emulation logic passes through a virtual interface that defines a software communication path therebetween, and the virtual interface satisfies a predetermined evaluation criterion. If not, the switched Ethernet local area network that contains logic for unicast communication with another computer processor in the emulated Ethernet network is emulated via the switch emulation logic. A system to rate.

12. The system of claim 11, wherein each of the computer processors having Ethernet driver emulation logic on which a task is performed is associated with a virtual MAC address, and the MAC address connects the computer processor to an external network. The system formed according to a rule for identifying one of a plurality of computer processors different from a MAC address.

13. The system of claim 12, further comprising external network interface logic for communicating with an external network, wherein the external network interface logic is associated with its own MAC address and the switch emulation logic is Logic for sending a message to the external network interface logic for communication over an external network, such message using the MAC address of the external network interface logic, on the external network The system communicated in.

12. The system of claim 11, wherein the first computer processor uses a first virtual interface for unicast communication with a second computer processor, wherein the second computer processor is the first computer processor. A system that uses different virtual interfaces to communicate to one computer processor.

12. The system of claim 11, wherein each computer processor includes switch structure driver logic for communicating over the point-to-point link, the switch structure driver logic including checksum capabilities. In addition, when the Ethernet driver emulation logic includes a checksum capability and the switch structure driver logic performs a checksum of a message, a logic for suppressing a checksum function in the Ethernet driver emulation logic is provided. Including said system.

16. The system of claim 15, wherein the switch structure driver logic implements a reliable communication protocol to ensure receipt of messages across the switch structure.

12. The system according to claim 11, wherein the switch structure and the point-to-point link are arranged in a redundant configuration.

12. The system of claim 11, wherein the Ethernet driver emulation logic includes logic for broadcasting a message by sending the message to the switch emulation logic via a virtual interface, the switch emulation logic. The system includes broadcast logic for receiving and cloning broadcast messages from a virtual interface and transmitting the cloned messages to other computer processors in the network.

12. The system of claim 11, wherein the switch emulation logic includes logic for defining and maintaining computer processor membership to an emulated network.

12. The system of claim 11, wherein the Ethernet driver emulation logic includes logic for transmitting a message that is larger than a maximum transmission unit (MTU) size.

A method of implementing an Address Resolution Protocol (ARP) on a computing platform with multiple processors:
Defining the topology of an Ethernet network to be emulated on the computing platform, including processor nodes and switch node nodes;
Assigning a set of processors from a plurality of processors to serve as the processor node;
Assigning a processor to act as the switch node;
Assigning a virtual MAC address to each processor node of the emulated Ethernet network;
Allocating virtual interfaces on the underlying physical network to provide direct software communication from every processor node to every other processor node, each virtual interface having a corresponding identification Steps and;
A processor node communicating an ARP request to the switch node, wherein the ARP request includes an IP address;
The switch node communicates an ARP request to all other processor nodes in the emulated Ethernet network;
A processor node, which is associated with the IP address that grants the switch node an ARP response that includes the virtual MAC address of the processor node associated with the IP address;
The switch node receives the ARP response and includes a virtual interface identification for a virtual interface that the processor node issuing the ARP request should use for subsequent communication with the processor node associated with the IP address. Modifying the ARP response;
ARP implementation method with

The method of claim 21, wherein the underlying physical network is a point-to-point network connecting the plurality of processors.

22. The method of claim 21, wherein the subset of processors is organized as a cluster, one of the processors in the cluster is a load balancing processor node, and any processor in the cluster is A method in which the switch node modifies the ARP response so that upon granting an ARP request, the virtual interface identification for the load balancing processor node is included.

The method of claim 21, wherein the switch node is in communication with an external IP network and the act of communicating an ARP response includes identifying the ARP response from a processor node of the platform.

An Address Resolution Protocol (ARP) system:
A computing platform having a plurality of processors connected by an underlying physical network;
Logic that is executable on one of the processors and that defines the topology of an Ethernet network to be emulated on the computing platform, the topology comprising processor nodes and switch nodes When;
Logic for assigning a set of processors from the plurality to be executable on one of the processors to be a processor for acting as the processor node;
Logic that is executable on one of the processors and assigns a virtual MAC address to each processor node of the emulated Ethernet network;
Logic that is executable on one of the processors and assigns a virtual interface to the underlying physical network to provide direct software communication from each processor node to every other processor node. Logic with each virtual interface having a corresponding identification;
Each processor node having ARP request logic for communicating the ARP request to the switch node, wherein the ARP request includes an IP address;
The switch node including ARP request broadcast logic to communicate the ARP request to all other processor nodes in the emulated Ethernet network;
Each processor node with ARP response logic to determine whether a response is the processor node associated with the IP address in an ARP request, and if so, provide an ARP response to the switch node; Each processor node wherein the ARP response includes the virtual MAC address of the processor node associated with the IP address;
The switch node including ARP response logic, comprising: ARP response logic for modifying the ARP response to include receiving the ARP response and including a virtual interface identification for the ARP request node Switch node,
An Address Resolution Protocol (ARP) system.

26. The system of claim 25, wherein the underlying physical network is a point-to-point network connecting the plurality of processors.

26. The system of claim 25, wherein the subset of processors is organized as a cluster, one of the processors in the cluster is a load balancing processor node, and the switch node is an ARP from a processor node. Including logic to detect whether the response is from any processor in the cluster and, if so, to modify the ARP response to include the virtual interface identification for the load balancing processor node System.

26. The system of claim 25, wherein the switch node is in communication with an external IP network, and the processor node ARP response logic includes logic to identify that the ARP response is from a processor node of the platform. .

A computer processing platform,
A plurality of computer processors connected to an internal communication network;
At least one control node communicating with an external communication network and an external storage device network having an external storage device address space, wherein the at least one control node is connected to the internal network, thereby At least one control node in communication with a plurality of computer processors;
Configuration logic that defines and establishes a virtual processing area network, which provides a set of corresponding computer processors from the plurality of processors and communications within the set of computer processors, Configuration logic having a virtual storage area with a defined correspondence to the address space of the storage network, and a virtual local area communication network that excludes the processor from the plurality of none
A computer processing platform comprising:

The control node receives a communication message addressed to an entity on the external communication network via the internal communication network, and the control node provides a message related to the external communication network corresponding to the received message 30. The platform of claim 29, comprising logic.

Logic for the control node to receive a communication message addressed to an entity on the platform via the external communication network, and for the control node to provide a message to the addressed entity corresponding to the received message 30. The platform of claim 29, comprising:

30. The platform of claim 29, wherein the computer processor and the control node include network emulation logic for emulating an Ethernet function on the internal communication network.

The platform according to claim 32, wherein the internal communication network is a point-to-point exchange structure.

30. The platform of claim 29, wherein the internal communication network comprises redundant interconnections connecting the computer processor and at least one control node to a redundant switch structure.

35. The platform of claim 34, comprising at least one other control node connected to the interconnection to form a redundant control node.

The defined response in the external storage device address space for the control node to receive a storage device message from the computer processor via an internal communication network and for the control node to extract an address from the received storage device message 30. The platform of claim 29, further comprising logic for identifying an address and for providing a message on an external storage device network corresponding to the received storage device message and having the corresponding address.

For the control node to buffer data corresponding to writing a message received from a computer processor and to provide buffered data in the corresponding message provided to the external storage device network; 38. The platform of claim 36, comprising logic.

The control node receives a storage device message from the external storage device network, the control node identifies a corresponding computer processor or control node to which the received message responds, and corresponds to the identified processor or control node 38. The platform of claim 36, including logic for providing a message to perform.

A virtual processing area network deployment method:
A platform having a plurality of computer processors and at least one control node connected to an internal communication network, wherein the control node is in communication with an external communication network and an external storage device network having an external storage device address space. The act of providing,
Defining a corresponding set of computer processors for the virtual processing network;
Providing a communication within the set of computer processors, but establishing a virtual local area communication network excluding processors from a plurality not in the defined set;
An act of defining a correspondence between the virtual storage space of the virtual processing network having the defined one and the address space of the storage device network;
A virtual processing area network deployment method comprising:

The control node receives a communication message addressed to an entity on the external communication network via the internal communication network, and the control node provides a message on the external communication network corresponding to the received message. Item 40. The method according to Item 39.

40. The control node receives a communication message addressed to an entity on the platform via the external communication network, and the control node provides a message to the addressed entity corresponding to the received message. The method described in 1.

40. The method of claim 39, wherein the computer processor and the control node emulate an Ethernet function on the internal communication network.

43. The method of claim 42, wherein the internal communication network is a, a point-to-point exchange structure, and the emulation of Ethernet functionality is provided across the internal point-to-point exchange structure.

40. The method of claim 39, wherein the computer processor communicates over a redundant interconnection connecting the computer processor and the at least one control node.

45. The platform of claim 44, comprising at least one other control node connected to the interconnection, forming a redundant control node.

The control node receives a storage device message from the computer processor via the internal communication network, the control node extracts an address from the received storage device message, and a prescribed corresponding address in the external storage device address space 40. The method of claim 39, wherein the message is provided on an external storage network corresponding to the received storage message and having the corresponding address.

47. The control node buffers data corresponding to writing a message received from a computer processor and provides the buffered data in the corresponding message provided to the external storage device network. the method of.

The control node receives a storage device message from the external storage device network, the control node identifies a corresponding computer processor or control node to which the received message responds, and a message corresponding to the identified processor or control node 47. The method of claim 46, wherein:

A computer processing platform connectable to an external communication network and a storage device network,
A plurality of computer processors connected to an internal communication network;
Configuration logic for defining and establishing the following (a) and (b):
Each computer processor in the virtual local area communication network has a corresponding virtual MAC address, and the virtual local area network provides communication between a set of computer processors, The virtual local area communication network on the internal network, excluding processors from not a plurality,
Configuration logic for defining and establishing a virtual storage space having a defined contact to the address space of the storage device network;
Failover logic for allocating a computer processor from the plurality to replace the failed processor in response to a failure by the computer processor, the MAC address of the failed processor being assigned to the processor that replaces the failed processor Including logic to allocate, the virtual storage space of the failing processor and a defined correspondence to the processor to replace the failing processor, and the processor to replace the failing processor, and excluding the failing processor Failover logic including logic to re-establish the virtual local area network,
A computer processing platform comprising:

The configuration logic establishes a virtual interface to define a software communication path in the processor of the virtual network, and the failover logic replaces the failed processor from the processor in the virtual network. 50. The platform of claim 49, including logic for establishing a virtual interface to the processor.

50. The platform of claim 49, wherein the configuration logic comprises: a second set of computer processors; and a second virtual storage space that maintains prescribed communication to the storage device network address space. A second virtual local area network is established, and the failover logic causes the processor to replace the failed processor to inherit the individuality of the virtual local area network and the virtual storage of the failed processor. Said platform.

A computer processing method on a platform having a plurality of computer processors connected to an internal communication network comprising:
Each computer processor in the virtual local area communication network has a corresponding virtual MAC address, and the virtual local area network provides communication between a set of computer processors, but within its defined set Defining and establishing a virtual local area communication network on the internal network excluding the processor from a plurality of
Defining and establishing a virtual storage space having a specified contact to the address space of the storage device network;
Allocating the MAC address to the processor to replace the failed processor in the step of allocating a plurality of computer processors to replace the failed processor in response to a failure by the computer processor; Assigning the virtual storage space and the default communication of the failed processor to the processor to replace, and including the processor to replace the failed processor and excluding the failed processor from the virtual local Said step comprising re-establishing an area network;
The computer processing method comprising:

When establishing a virtual local area network, a virtual interface is established to define a software communication path between the processors of the virtual network, and when a processor replaces a failed processor, the virtual interface 54. The method of claim 52, established for the processor to replace.

A second virtual local area network is established by a second set of computer processors and a second virtual storage space having a defined contact with the storage device network address space, and if the processor fails, the failure 53. The method of claim 52, wherein the processor that replaces the processor inherits the virtual storage device personality of the virtual local area network and the failed processor.

Each of the at least two computer processors includes logic to provide a service;
Cluster logic for receiving a request message for the service, wherein the message has an IP address and distributes the request to one of at least two computer processors having logic to provide the service Cluster logic for
A system for providing the service comprising:

56. The system of claim 55, wherein the logic for distribution includes logic for analyzing the source information in an input message in determining which processor should service the message.

A method for providing a service addressed by an IP address comprising:
Including logic for servicing each of at least two computer processors;
Receiving a request message for the service, the message having an IP address, and further distributing the request to one of the at least two computer processors having logic to provide the service; about,
How to provide a service that has

58. The method of claim 57, wherein source information of the incoming message is analyzed to determine which processor will service the message.