CN1487418A - Far-end divect memory access invocating memory management unloading of network adapter - Google Patents
Far-end divect memory access invocating memory management unloading of network adapter Download PDFInfo
- Publication number
- CN1487418A CN1487418A CNA031557813A CN03155781A CN1487418A CN 1487418 A CN1487418 A CN 1487418A CN A031557813 A CNA031557813 A CN A031557813A CN 03155781 A CN03155781 A CN 03155781A CN 1487418 A CN1487418 A CN 1487418A
- Authority
- CN
- China
- Prior art keywords
- memory area
- mark
- affairs
- iscsi
- response
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/14—Protection against unauthorised use of memory or access to memory
- G06F12/1416—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights
- G06F12/145—Protection against unauthorised use of memory or access to memory by checking the object accessibility, e.g. type of access defined by the memory independently of subject rights the protection being virtual, e.g. for virtual blocks or segments before a translation mechanism
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Computer And Data Communications (AREA)
- Communication Control (AREA)
Abstract
A method, computer program product, and distributed data processing system for memory management. Memory regions are registered and have access rights and protection domains associated with them in response to receiving a request for a memory operation including a virtual address, which is used to address into a data structure. A second data structure is then used to translate the virtual address into physical addresses for the operation. A third data structure is used to allow an incoming request responsive to a remote operation being initiated.
Description
Technical field
The present invention relates generally to the communication protocol between host computer and I/O (I/O) equipment.More particularly, the invention provides the method that a kind of context that is used at main frame and I/O communication between devices carries out memory management.
Background technology
In Internet protocol (IP) network, software provides and can be used for the message passing mechanism of communicating by letter with special purpose computer with input-output apparatus, multi-purpose computer (main frame).Message passing mechanism comprises host-host protocol, upper-layer protocol and application programming interface.The key criterion host-host protocol that uses on IP network at present is transmission control protocol (TCP) and User Datagram Protoco (UDP) (UDP).TCP provides reliability services, and UDP provides unreliable service.From now on, stream control transmission protocol (sctp) also will be used to provide reliability services.The process of carrying out on equipment or computing machine visits IP network by upper-layer protocol such as socket, iSCSI and direct access file system (DAFS).
Unfortunately, quite a large amount of processor and the memory resources of TCP/IP software consumes.This problem has detailed description (referring to J.Kay in relevant list of references, J.Pasquale, " Profiling and reducing processing overhead in TCP/IP (processing expenditure among analysis and the reduction TCP/IP) ", IEEE/ACM Transactions onNetworking (IEEE/ACM networking journal), Vol 4, No.6, pp.817-828, in Dec, 1996; And D.D.Clark, V.Jacobson, J.Romkey, H.Salwen, " Ananalysis of TCP processing overhead (analysis of TCP processing expenditure) ", IEEECommunications Magazine (ieee communication magazine), Vol, 27, the 6 phases, in June, 1989, pp.23-29).From now on, network stack will be owing to following some former thereby continue to consume ample resources: use by the networking of using to increase; Use network security protocol; And fabric bandwidth ratio microprocessor and bandwidth of memory are with higher rate increase.In order to address this problem, industry network stack is being handled be given to IP collection offload engine (IP suite Offload Engine, IPSOE).
There is two kinds of unloadings (offload) method in industry at present.First method is used existing TCP/IP network stack, and does not increase any Additional Agreement.This method can be given to hardware with TCP/IP.But unfortunately do not eliminate the needs that the take over party duplicates.As described in top paper, duplicate having the greatest impact that CPU (central processing unit) (CPU) is used.For needs are duplicated in elimination, industry is being sought second method, comprises adding that frame, immediate data are placed (DDP) and based on the far-end direct memory access (DMA) (RDMA) of TCP and based on DDP and the RDMA of SCTP.Support that the required IP collection offload engine (IPSOE) of these two kinds of methods is that similarly the crucial difference between them is to adopt the second method hardware must support Additional Agreement.
IPSOE provides the message passing mechanism that can be used for communicating by letter by socket, iSCSI and DAFS between node.Visit IP network by sending/receive message dilivery to the transmission/reception work queue on the IPSOE in the process of carrying out on host computer or the equipment.These processes are also referred to as user (consumer) ".
Transmission/reception work queue (WQ) is distributed to the user as formation to (QP).Message can send by some different transport-types: traditional TCP, RDMA TCP, UDP or SCTP.The user sends by IPSOE and these message are extracted in (WC) formation from finish formation (CQ) result is finished in reception work.Source end IPSOE is responsible for outbound message is carried out segmentation, and they are sent to destination.Destination IP SOE is responsible for re-assemblying and enters message, and they are placed in the storage space by destination user appointment.These users use the IPSO verb to visit the function of being supported by IPSOE.The software of explaining verb and directly visiting IPSOE is called IPSO interface (IPSOI).
At present, host CPU is carried out most of IP collection processing.IP collection offload engine provides a kind of higher performance interface of communicating with other multi-purpose computers and I/O equipment of being used for.Sending or receive data by IPSOE needs CPU that data are copied to another position from a memory location, makes the IPSOE can the DASD zone thereby perhaps register storer.These options all need a large amount of cpu resources, and wherein, storer registration option is more suitable in big storage transmission.Yet along with the raising of network speed, required cpu resource amount will increase.Preferably have a kind ofly improve one's methods, device and computer instruction, be used to reduce and carry out the required cpu resource amount of following operation: register these memory locations, by window memory they are exposed to far end system, provide once contact (One Touch) visit then as the option that exposes window memory.In addition, preferably allow this mechanism be applicable to iSCSI 1.0, RDMA and iSCSI-R.
Summary of the invention
The invention provides a kind of method, computer program and distributed data disposal system, be used to register memory location, expose the previous memory location of registering, provide once contact visit then as the option that exposes window memory by window memory.
Specifically, the present invention relates to the memory area that carries out access by Internet protocol collection offload engine (IPSOE) according to the preferred embodiment of the invention.Providing a kind of mechanism to be used for implicit expression or explicit registration memory area and permission hardware directly uses this zone to prevent that simultaneously this zone is by other application uses by memory area table and address translation table.Thereby a kind of label table that utilizes will ask to be associated by entering the method for the previous memory area of registering of request visit with physics or virtual address.A kind of when by entering the mechanism that message unbinds to the window of previous binding when using for the first time.
Description of drawings
The new features that are considered to feature of the present invention are set forth in claims.Yet, by with reference to below in conjunction with the detailed description of accompanying drawing to exemplary embodiment, the present invention itself and preferably use pattern and other purpose and advantage will become better understood, wherein:
Fig. 1 illustrates the figure of distributed computer system according to the preferred embodiment of the invention;
Fig. 2 is the functional-block diagram of host processor node according to the preferred embodiment of the invention;
Fig. 3 A is the figure of IP collection offload engine according to the preferred embodiment of the invention;
Fig. 3 B is the figure of switch according to the preferred embodiment of the invention;
Fig. 3 C is the figure of router according to the preferred embodiment of the invention;
Fig. 4 illustrates the figure of work request processing according to the preferred embodiment of the invention;
Fig. 5 is the figure that the part of the distributed computer system according to the preferred embodiment of the invention of wherein using TCP or SCTP transmission is shown;
Fig. 6 illustrates the figure of Frame according to the preferred embodiment of the invention;
Fig. 7 illustrates the figure of the part of distributed computer system according to the preferred embodiment of the invention;
Fig. 8 is the figure that the network addressing that is used for distributed network of the present invention system is shown;
Fig. 9 is the figure of the part of the distributed computer system that comprises subnet in the preferred embodiment of the present invention;
Figure 10 is the figure that is used for the layered communication framework of the preferred embodiments of the present invention;
Figure 11 illustrates the process flow diagram and the chart of two kinds of storer enrollment mechanism according to the preferred embodiment of the invention;
Figure 12 illustrates storage management system according to the preferred embodiment of the invention;
Figure 13 illustrates the figure of memory area table clause according to the preferred embodiment of the invention;
Figure 14 is the process flow diagram that the inspection that must carry out when the registration memory area according to the preferred embodiment of the invention is shown;
Figure 15 illustrates according to the preferred embodiment of the invention by IPSOE to be used for verifying process flow diagram and the chart that is delivered to the process of the performed storage access of work queue element in the IPSOE work queue by the user as work request;
Figure 16 is process flow diagram and the chart that the process that is used for distinguishing the dissimilar stream that can be associated with far-end operation according to the preferred embodiment of the invention is shown;
Figure 17 A is process flow diagram and the chart that the memory management mechanism that is associated with iSCSI QP according to the preferred embodiment of the invention is shown;
Figure 17 B is process flow diagram and the chart that the memory management process that is used for verifying far-end iSCSI 1.0 operations according to the preferred embodiment of the invention is shown; And
Figure 18 illustrates to be used for according to the preferred embodiment of the invention to contacting once that access mechanism provides the cancellation function that is not exposed to distant-end node and reading request of checking far-end RDMA, RDMA to read to respond and RDMA writes the process flow diagram and the chart of the memory management process of message.
Embodiment
The invention provides a kind of distributed computer system, comprise the link of end node, switch, router and these assemblies of interconnection.End node can be an Internet protocol collection offload engine or based on the legacy hosts software of Internet protocol collection.Each end node uses the transmitting-receiving formation to receiving and send message.End node is divided into a plurality of frames with message, and by these frames of link transmission.Switch and interconnection of routers end node, and these frames are sent to suitable end node by Route Selection.End node is reassembled into message at destination with these sections.
Referring now to accompanying drawing Fig. 1 particularly, the figure of distributed computer system according to the preferred embodiment of the invention is shown.Distributed computer system shown in Figure 1 adopts the form of IP network (IP network) as IP network 100, and only the purpose for example provides, and the following embodiment of the invention can realize on the computer system of various other types and structure.For example, realize that the scope of computer system of the present invention can be from small server with a processor and some I/O (I/O) adapter to the large-scale parallel supercomputer system with hundreds and thousands of processors and thousands of I/O adapters.And the present invention can realize in the fabric of the far-end computer system that is connected by internet or in-house network.
IP network 100 is the high bandwidth of the node in the interconnection distributed computer system, low delay network.Node is any assembly that is connected to one or more network links and forms message source end and/or destination in network.In the example shown, IP network 100 comprises the node that adopts host processor node 102, host processor node 104 and redundant array independent disk (RAID) subsystem node 106 forms.Because IP network 100 can connect independent processor nodes, memory node and the dedicated processes node of any number and any kind, therefore node shown in Figure 1 only is the purpose for example.In these nodes any all can be used as end node, and it is defined in the equipment of initiating or finally use message or frame in the IP network 100 at this.
In one embodiment of the invention, fault processing mechanism is present in the distributed computer system, and wherein, fault processing mechanism is considered TCP between the end node or SCTP communication in distributed computing system such as the IP network 100.
In the exchanges data unit that this used message is application definition, it is the elementary cell that communicates between cooperating process.Frame is a data unit with Internet protocol collection head and/or tail tag encapsulation.Head generally provides control and routing information, is used to guide frame to pass through IP network 100.Tail tag generally comprises control and Cyclic Redundancy Check data, is used to guarantee not transmit the frame that has destroyed content.
In distributed computer system, IP network 100 comprises supports various forms of communications (traffic) communicating by letter and manage fabric as storage, interprocess communication (IPC), file access and socket.IP network 100 shown in Figure 1 comprises switched communication structure 116, and it allows a lot of equipment to transmit data simultaneously with high bandwidth and low delay in the environment of safety, remote side administration.End node can communicate by a plurality of ports, and utilizes a plurality of paths to pass through IP network infrastructure.A plurality of ports that pass through IP network shown in Figure 1 and path can be used for fault-tolerant and increase bandwidth data transmission.
IP network 100 among Fig. 1 comprises switch 112, switch 114 and router one 17.Switch is a plurality of links are linked together and to allow to use layer 2 destination address field frame to be sent to the equipment of another link by Route Selection from a link.When using Ethernet as link, the destination field is known as medium access control (MAC) address.Router is an equipment of frame being carried out Route Selection according to layer 3 destination address field.When using Internet protocol (IP) as layer 3 agreement, the destination address field is the IP address.
In one embodiment, link is the full-duplex channel between any two network structure elements such as end node, switch or the router.Suitable link example includes but not limited to the printed circuit copper mark (copper trace) on copper cable, optical cable, base plate and the printed circuit board (PCB).
For reliability services type (TCP and SCTP), end node such as host-processor end node and I/O adapter end node produce claim frame, and return acknowledgement frame.Switch and router pass to destination with frame from the source end.
In IP network 100 as shown in Figure 1, host processor node 102, host processor node 104 and RAID subsystem node 106 comprise at least one IPSOE with IP network 100 interfaces.In one embodiment, each IPSOE is the source frame or the stay of two nights frame (sink frame) of transmission on IP network 100 are realized IPSOI with enough details a end points.Host processor node 102 comprises the IPSOE that adopts host ip SOE 118 and IPSOE 120 forms.Host processor node 104 comprises IPSOE 122 and IPSOE 124.Host processor node 102 also comprises CPU (central processing unit) 126-130 and the storer 132 by bus system 134 interconnection.Host processor node 104 comprises CPU (central processing unit) 136-140 and the storer 142 by bus system 144 interconnection similarly.
IPSOE 118 provides and being connected of switch 112, and IPSOE 124 provides and being connected of switch 114, and IP collection offload engine 120 and 122 provides and switch 112 and 114 be connected.
In one embodiment, IP collection offload engine is to realize with the combination that unloads microprocessor with hardware or hardware.In this realization, the IP collection is handled and is given to IPSOE.This realization also allows to carry out a plurality of communications simultaneously on exchange network, and need not the traditional overhead that is associated with communication protocol.In one embodiment, IPSOE among Fig. 1 and IP network 100 user to distributed computer system under the situation that does not relate to the operating system nucleus process provide zero processor to copy data transmission, and adopt hardware that reliable, fault-tolerant communication is provided.
As shown in Figure 1, router one 17 is connected to the wide area network (WAN) and/or the Local Area Network of other main frames or other routers and is connected.
In this example, the RAID subsystem node 106 among Fig. 1 comprises processor 168, storer 170, IP collection offload engine (IPSOE) 172 and a plurality of redundancy and/or bar formula (striped) memory disc unit 174.
IP network 100 handle be used to store, the data communication of inter-processor communication, file access and socket.IP network 100 is supported high bandwidth, can be expanded and extremely low communicating by letter of postponing.User client can workaround system kernel process, directly accesses network communications component such as IPSOE, and this just allows messaging protocol efficiently.IP network 100 is suitable for current computation model, and is the structure piece that the storage of new model, troop (cluster) communicate by letter with universal networkization.In addition, the IP network 100 among Fig. 1 allows memory node to communicate by letter between them, perhaps communicates by letter with any or all processor node in the distributed computer system.Be connected at memory device under the situation of IP network 100, memory node roughly have with IP network 100 in the identical communication capacity of any host processor node.
In one embodiment, IP network 100 shown in Figure 1 is supported the semantic and storer semanteme of passage.The passage semanteme is called transmission/reception sometimes or pushes (push) traffic operation.The passage semanteme is the communication type that adopts in conventional I/O passage, wherein, and source end equipment propelling data, and the final destination of destination equipment specified data.In the passage semanteme, specify the communication port of purpose process from the frame of originating process transmission, but the designated frame purpose process storage space position that will write not.Therefore, in the passage semanteme, the purpose process is allocated in advance where the transmission data is placed.
In the storer semanteme, originating process directly reads or writes the virtual address space of distant-end node purpose process.Far-end purpose process only needs to communicate with the position of data buffer, and does not need to relate to any data transmission.Therefore, in the storer semanteme, originating process sends the Frame of the purpose buffer memory address that comprises the purpose process.In the storer semanteme, the authority that purpose process elder generation forward direction originating process is authorized its storer of access.
For storing, trooping and communicate by letter with universal networkization, passage semanteme and storer semanteme typically all are necessary.Typical storage operation adopts the combination of passage and storer semanteme.In the storage operation example of distributed computer system shown in Figure 1, host processor node such as host processor node 102 send to RAID subsystem IPSOE 172 by using the semantic storage operation of initiating of passage will coil write command.RAID subsystem inspection order, and use the semantic next direct storage space reading of data buffer zone of storer from host processor node.After the reading of data buffer zone, the RAID subsystem adopts the passage semanteme that I/O is finished the message propelling movement and gets back to host processor node.
In one exemplary embodiment, distributed computer system shown in Figure 1 is carried out the operation of employing virtual address and virtual memory protection mechanism to guarantee the correct and suitable access to all storeies.The application program that operates in this distributed computer system does not need to use physical addressing for any operation.
Next step illustrates the functional-block diagram of host processor node according to the preferred embodiment of the invention with reference to Fig. 2.Host processor node 200 is examples of host processor node, as the host processor node among Fig. 1 102.In this example, host processor node 200 shown in Figure 2 comprises one group of user 202-208, and they are processes of carrying out on host processor node 200.Host processor node 200 also comprises IP collection offload engine (IPSOE) 210 and IPSOE 212.IPSOE 210 comprises port 214 and 216, and IPSOE 212 comprises port 218 and 220.Each port is connected to a link.These ports can be connected to an IP network subnet or a plurality of IP network subnet, as the IP network among Fig. 1 100.
User 202-208 passes through verb interface 222 and message and data, services 224 transmission of messages is arrived IP network.The verb interface is the abstractdesription of IP collection offload engine function in itself.Operating system can expose some or all verb functions by its DLL (dynamic link library).Basically, the behavior of this interface definition main frame.In addition, host processor node 200 comprises message and data, services 224, and it is the high-level interface higher than verb layer, and is used for handling message and the data that receive by IPSOE 210 and IPSOE 212.Message and data, services 224 provide an interface to come processing messages and other data to user 202-208.
Referring now to Fig. 3 A, the figure of IP collection offload engine according to the preferred embodiment of the invention is shown.IP collection offload engine 300A shown in Fig. 3 A comprises a set of queues to (QP) 302A-310A, and they are used for transmission of messages to IPSOE port 312A-316A.To IPSOE port 312A-316A carry out quality of service field (QOSF) that data buffering uses network layer for example communication class (Traffic Class) field in IP version 6 standards guide (channel).Each network layer quality of service amount field has its oneself Flow Control.Internet engineering duty group (IETF) computer network with standard network protocol is used for disposing the link and the network address of all IP collection offload engine ports that are connected to network.Two such agreements are ARP(Address Resolution Protocol) and DHCP.The storer conversion is the mechanism that virtual address translation is become physical address and checking access right with protection (MTP) 338A.Direct memory access (DMA) (DMA) 340A supports to use storer 350A for the direct memory access (DMA) operation of formation to 302A-310A.
Single IP collection offload engine IPSOE 300A as shown in Figure 3A can support that thousands of formations are right.Individual queue sends work queue (SWQ) and receives work queue (RWQ) comprising.Send work queue and be used for sendaisle and the semantic message of storer.Receive the semantic message of work queue receiving cable.The user calls the DLL (dynamic link library) specific to operating system that is referred to herein as " verb ", so that work request (WR) is placed in the work queue.
Fig. 3 B illustrates switch 300B according to the preferred embodiment of the invention.Switch 300B comprises the frame relay 302B that the type of service field 306B by link or network layer quality of service amount field such as IP version 4 communicates by letter with a plurality of port 304B.Switch can be sent to any other port on the same switch with frame from a port by Route Selection as switch 300B.
Similarly, Fig. 3 C illustrates router three 00C according to the preferred embodiment of the invention.Router three 00C comprises the frame relay 302C that the type of service field 306C by network layer quality of service amount field such as IP version 4 communicates by letter with a plurality of port 304C.As switch 300B, router three 00C generally can be sent to any other port on the same router with frame from a port by Route Selection.
Referring now to Fig. 4, the figure of work request processing according to the preferred embodiment of the invention is shown.In Fig. 4, exist to receive work queue 400, send work queue 402 and finish the dealing request that formation 404 is used to handle user 406.These requests from user 406 finally send to hardware 408.In this example, user 406 produces work request 410 and 412, and information 414 is finished in reception work.As shown in Figure 4, the work request that is placed in the work queue is called work queue element (WQE).
Send work queue 402 and comprise work queue element (WQE) 422-428, the data that description will be transmitted on IP network infrastructure.Receive work queue 400 and comprise work queue element (WQE) 416-420, where description will place from the admission passage semantic data of IP network infrastructure.The work queue element is handled by the hardware among the IPSOE 408.
Verb also is provided for extracting the mechanism of finishing the work from finish formation 404.As shown in Figure 4, finishing formation 404 comprises and finishes queue element (QE) (CQE) 430-436.Finish the information that queue element (QE) comprises the relevant work queue element of before having finished.Finish formation 404 be used for for a plurality of formations to creating the single notice point of finishing.Finishing queue element (QE) is the data structure of finishing in the formation.This element is described and is finished the work queue element.Finish queue element (QE) comprise enough information determine formation to the particular job queue element (QE) of being finished.Finish the formation context and be and comprise management each finishes the message block of the required pointer of formation, length and other information.
The example work request that is supported to be used for transmission work queue 402 shown in Figure 4 is as described below.Sending work request is the passage semantic operation that one group of local data section is pushed to the data segment of being quoted by the reception work queue element of distant-end node.For example, work queue element 428 comprises quoting data segment 4 438, data segment 5 440 and data segment 6 442.The data segment that sends work request all comprises the virtual connected storage of part zone.The virtual address that is used for quoting the local data section is arranged in the address context of creating the right process of local queue.
Work request is read in far-end direct memory access (DMA) (RDMA) provides a kind of storer semantic operation to read virtual connected storage space on the distant-end node.Storage space can be the part of memory area or the part of window memory.Memory area is quoted the virtual connected storage address set by virtual address and length definition of previous registration.Window memory is quoted the virtual connected storage address set that is tied to previous registration zone.
The work request that reads RDMA reads the virtual connected storage space on the distal end node, and writes data into virtual continuous local storage space.Be similar to the transmission work request, read the virtual address that the work queue element is used for quoting the local data section by RDMA and be arranged in the address context of creating the right process of local queue.The far-end virtual address be arranged in by RDMA read the work queue element it as the far-end formation of target address context to affiliated process.
RDMA writes the work queue element provides a kind of storer semantic operation to write virtual connected storage space on the distant-end node.For example, work queue element 416 reference data sections 1 444, data segment 2 446 and the data segment 3 448 in the reception work queue 400.RDMA writes the virtual address that the work queue element comprises the dispersion tabulation in local virtual connected storage space and the local storage space write remote storage device space wherein.
RDMA FetchOp (extracting operation) work queue element provides a kind of storer semantic operation to come atomic operation carried out in the far-end word.RDMA FetchOp work queue element is that combination RDMA reads, modification and RDMA write operation.RDMA FetchOp work queue element can be supported some reading-revise-write operation, if for example relatively and equate then exchange.RDMA FetchOp is not included among the current RDMA based on IP standardization direction, but is described at this, because its breeding property in can realizing as some.
Binding (unbinding) remote access key (STag) work queue element offers IP collection offload engine hardware with an order, to change (destruction) window memory by a window memory and a memory area being carried out related (disconnecting related).STag is the part of each RDMA access, and is used for verifying far-end process allowance access buffer district.
In one embodiment, a kind of work queue element is only supported in reception work queue 400 shown in Figure 4, and it is called reception work queue element.Receiving the work queue element provides a kind of passage semantic operation to describe will to enter and sends message and write wherein local storage space.Receive the work queue element and comprise the dispersion tabulation of describing some virtual connected storages space.Enter transmission message and be written to these storage space.These virtual addresses are arranged in the address context of creating the right process of local queue.
For inter-processor communication, the user model software process directly from buffer zone reside in the storer the position by formation to the transmission data.In one embodiment, by the right transmission workaround system of formation, and take seldom host command circulation.Formation is to the zero processor to copy data transmission under the situation that allows not relate to operating system nucleus.The efficient support that zero processor to copy data transmission provides high bandwidth to communicate by letter with low delay.
When create formation to the time, formation is to the transmission of the selected type service that provides is provided.In one embodiment, realize three kinds of transmission services of distributed computer system support of the present invention: TCP, SCTP and UDP.
TCP and SCTP with local queue pair with one and only with a far-end formation to being associated.TCP and SCTP need a process right by formation of each process creation that IP network infrastructure communicates for it.Therefore, if N host processor node all comprises P process, and all P process on each node all wish with every other node on all processes communicate, then each host processor node needs P
2* (N-1) individual formation is right.And process can be with another formation on a formation pair and the same IPSOE to being associated.
The part of the distributed computer system that employing TCP or SCTP communicate by letter between distribution process on the whole as shown in Figure 5.The distributed computer system 500 of Fig. 5 comprises host processor node 1, host processor node 2 and host processor node 3.Host processor node 1 comprises process A 510.Host processor node 2 comprises process C 520 and process D 530.Host processor node 3 comprises process E 540.
Place the WQE of a transmit queue to adopt TCP or SCTP to make data be written to the reception memorizer space of quoting by the right reception WQE of associated queue.The RDMA operation element is in the right address space of associated queue.
In one embodiment of the invention, TCP or SCTP are owing to the hardware maintenance serial number and confirm that the transmission of all frames becomes reliable.Any failed communication of combination retry of hardware and IP network driver software.Even the right process client of formation the bit mistake occurring, is receiving under the situation of underload and network blockage and also obtain to communicate by letter reliably.If there is alternative route in the IP network infrastructure, even then under the situation that switch architecture, link or IP collection offload engine port break down, also can keep reliable communication.
In addition, can adopt affirmation to cross over IP network infrastructure reliably and transmit data.Affirmation can be or can not be that process-level is confirmed, verify that promptly receiving process has used the affirmation of data.Perhaps, affirmation can be only to represent that data have arrived the information of its destination.
User Datagram Protoco (UDP) is connectionless.UDP is used for finding new switch, router and end node by the management application program and they is incorporated in the given distributed computer system.UDP does not provide the reliability of TCP or SCTP to guarantee.Therefore, UDP safeguards at each end node under the situation of less status information and works.
Next step illustrates Frame example according to the preferred embodiment of the invention with reference to Fig. 6.Frame is according to the message unit of Route Selection by IP network infrastructure.Frame is that end node is to the end node structure, therefore by end node establishment and use.For the frame that is sent to IPSOE, switch and router that Frame both can't help in the IP network infrastructure produce, and also can't help its use.On the contrary, for the Frame that is sent to IPSOE, switch and router shift near the final purpose end with claim frame or acknowledgement frame simply, thereby revise the link header field in process.When frame was crossed over sub-net boundary, router can be revised the network head of frame.In passing through subnet, single frames rests on the single service class.
In Fig. 7, the part that distributed computer system 700 is shown is come the illustrated example request and is confirmed affairs.Distributed computer system 700 among Fig. 7 comprises the host processor node 702 of operation process A 716 and the host processor node 704 of operation process B 718.Host processor node 702 comprises IPSOE 706.Host processor node 704 comprises IPSOE 708.Distributed computer system among Fig. 7 comprises IP network infrastructure 710, and it comprises switch 712 and switch 714.IP network infrastructure comprises the link that IPSOE 706 is connected to switch 712; Switch 712 is connected to the link of switch 714; And the link that IPSOE 708 is connected to switch 714.
In example transactions, host processor node 702 comprises client process A.Host processor node 704 comprises client process B.Client process A is mutual with host ip SOE 706 to 23 720 by the formation that comprises transmit queue 724 and reception formation 726.Client process B is mutual with host ip SOE 708 to 24 722 by formation.Formation is to comprise the data structure that sends work queue and receive work queue to 23 and 24.
Process A initiates a message request by the work queue element being delivered to formation to 23 transmit queue.This work queue element as shown in Figure 4.The message request of client process A is quoted by being included in the aggregate list (gather list) that sends in the work queue element.Each data segment in the aggregate list points to the part in virtual continuous local storage zone, and it comprises the part of message, shown in the data segment 1,2 and 3 of preserving message part 1,2 and 3 among Fig. 4 respectively.
Hardware among the host ip SOE 706 reads the work queue element, and the message fragment that will be stored in the virtual continuous buffer zone is a plurality of Frames Frames as shown in Figure 6.Frame passes through IP network infrastructure according to Route Selection, and for the reliable transmission service, is confirmed by the final purpose end node.If do not confirm successfully that then Frame is transmitted again by the source end node.Frame is produced by the source end node, and is used by the destination node.
With reference to Fig. 8, the figure that is used for the network addressing of distributed network system of the present invention is shown.Host name provides the logical identifier of host node such as host processor node or I/O adapter node.The host name identification message endpoints resides in by the process on the end node of host name appointment thereby message is sent to.Therefore,, all have a host name, but a node can have a plurality of IPSOE for each node.
Single link layer address (for example, ethernet medium access layer address) 804 is distributed to each port 806 of end node assembly 802.Assembly can be IPSOE, switch or router.All IPSOE and router component all must have a MAC Address.Medium access point on the switch also is assigned a MAC Address.
Each port 806 of end node assembly 802 is distributed in a network address (for example, IP address) 812.Assembly can be IPSOE, switch or router.All IPSOE and router component must have a network address.Medium access point on the switch also is assigned a MAC Address.
Each port of switch 810 does not have link layer address associated therewith.Yet switch 810 can have media access port 814, and wherein, media access port 814 has link layer address 808 associated therewith and network layer address 816.
Fig. 9 illustrates the part of distributed computer system according to the preferred embodiment of the invention.Distributed computer system 900 comprises subnet 902 and subnet 904.Subnet 902 comprises host processor node 906,908 and 910.Subnet 904 comprises host processor node 912 and 914.Subnet 902 comprises switch 916 and 918.Subnet 904 comprises switch 920 and 922.
Router is created and the connection subnet.For example, subnet 902 is connected to subnet 904 by router 924 and 926.In an example embodiment, subnet has maximum 216 end nodes, switch and routers.
Subnet is defined as one group of end node of individual unit management and cascaded switches.Typically, subnet occupies single geography or functional area.For example, the single computer systems in house can be defined as subnet.In one embodiment, the switch in the subnet can be carried out very fast worm channel (wormhole) or straight-through (cut-through) Route Selection to message.
Switch inspection in the subnet unique destination link layer address (for example, MAC Address) in subnet is carried out Route Selection to entering message frame fast and efficiently to allow switch.In one embodiment, switch is simple relatively circuit, and typically is implemented as single integrated circuit.Subnet can have hundreds and thousands of end nodes that formed by cascaded switches.
As shown in Figure 9, for being extended to much bigger system, subnet is connected with 926 by router such as router 924.Router task of explanation end network layer address (for example, IP address) and frame carried out Route Selection.
An example embodiment of switch is on the whole shown in Fig. 3 B.Each I/O path on switch or the router all has a port.Usually, switch can be sent to any other port on the same switch with frame from a port by Route Selection.
In subnet such as subnet 902 or subnet 904, the path from the source port to the destination interface is determined by the link layer address (for example, MAC Address) of destination host ip SOE port.Between subnet, the path by the network layer address (IP address) of destination IP SOE port and the link layer address (for example, MAC Address) that will be used for arriving the router port of destination subnet determine.
In one embodiment, claim frame and the used path of corresponding positive acknowledgment (ACK) frame thereof do not require it is symmetrical.Forget among the embodiment of Route Selection an employing, switch is selected output port according to link layer address (for example, MAC Address).In one embodiment, switch uses one group of routing decision standard for all its input ports.In an example embodiment, the routing decision standard is included in the routing table.In another embodiment, switch all adopts one group of independent standard for each input port.
Data transactions in the distributed computer system of the present invention typically is made up of some hardware and software steps.The client process data transport service can be user model or kernel mode process.Client process by one or more formations to the formation shown in Fig. 3 A, 5 and 8 to visiting IP collection offload engine hardware.Client process calls the DLL (dynamic link library) specific to operating system, is referred to herein as " verb ".The software code of realizing verb is delivered to given formation to work queue with the work queue element.
Have multiple possible work queue element delivering method and multiple possible work queue element format, they are considered various costs/performance design point but are not influenced interoperability.Yet consumer process must communicate with the mode and the verb of complete definition, and must fully specify the form and the agreement of the data of crossing over the IP network infrastructure transmission, with permission equipment interoperability under different vendor's environment.
In one embodiment, IPSOE hardware detection work queue element is delivered, and visit work queue element.In this embodiment, the IPSOE hardware conversion is also verified the virtual address of work queue element, and visit data.
Outbound message splits into one or more Frames.In one embodiment, IPSOE hardware adds DDP/RDMA head, frame headers and CRC, transmission head and network head to each frame.The transmission head comprises serial number and other transmission information.The network head comprises routing iinformation such as destination IP address and other network routing iinformations.The link head comprises destination link layer address (for example, MAC Address) or other local routing iinformations.
If adopt TCP or SCTP, when request data frame arrived its destination node, the destination node uses confirmed that Frame allows the request data frame sender know that request data frame is verified and accepts at destination.Confirm Frame confirm one or more effectively and the request data frame of accepting.The requestor can have a plurality of unsettled request data frame before receiving any affirmation.In one embodiment, when create formation to the time, determine that a plurality of unsettled message are the number of request data frame.
An embodiment who is used to realize layer architecture 1000 of the present invention on the whole as shown in figure 10.Data and control information tissue that the layer architecture of Figure 10 illustrates each layer of data communication path and transmits between each layer.
IPSOE end node protocol layer (for example, being adopted by end node 1011) comprises the upper-layer protocol 1002 by user's 1003 definition, transport layer 1004; Network layer 1006, link layer 1008 and Physical layer 1010.Exchanger layer (for example, being adopted by switch 1013) comprises link layer 1008 and Physical layer 1010.Router layer (for example, being adopted by router one 015) comprises network layer 1006, link layer 1008 and Physical layer 1010.
Application program or the process of other layers to communicate by letter adopted in user 1003 and 1005 expressions between end node.Transport layer 1004 provides end-to-end message to move.In one embodiment, transport layer provides aforesaid four kinds of transmission services, and they are traditional TCP, the RDMA based on TCP, SCTP and UDP.Network layer 1006 is carried out by a subnet or a plurality of subnet frame Route Selection to the destination node.Link layer 1008 is carried out Flow Control 1020, error-checking and the preferential frame of crossing over link and is transmitted.
Referring now to Figure 11, provide the process flow diagram and the chart of two kinds of storer enrollment mechanism according to the preferred embodiment of the invention are shown.In " traditional mechanism " (1120) by IPSOE registration memory area, user 1100 uses single step 1104 to come by IPSOE registration memory area.This single step uses memory mapped I/O (MMIO), programming I/O (PIO) or may be that direct memory access (DMA) (DMA) CPU assists storer conversion and protection table (TPT) clauses and subclauses are transferred among the storer TPT 1108 of IPSOE 1112.If traditional mechanism uses MMIO or PIO to carry out transmission, then the user must wait for that these MMIO or PIO returned control and give host CPU before can using the storer TPT of new establishment.Depend on specific implementation, this delay may reduce system performance.
Figure 11 also illustrates the physical storage enrollment mechanism 1130 based on transmit queue.Under this mechanism, use the verb that visits IPSOE to come explicit exposure by the user by the physical storage registration of transmit queue.User 1140 must at first activate this mechanism of use in QP.This is based on the first step (1144) of the physical storage enrollment mechanism of transmit queue.This step comprises " activating the physical storage registration based on the transmit queue " field that is provided with in the QP context 1148.After the QP context was activated the physical storage registration of support based on transmit queue, (1152) user 1140 created clauses and subclauses by transmit queue (SQ) work request (WR) being delivered to one of IPSOE SQ as asking IPSOE 1192 among the SQ 1164 in storer TPT 1172.The verb interface returns the STag that is associated with storer registration WR immediately, storer is registered WR convert SQ work queue element (WQE) 1160 to, and storer is registered WQE be placed among the SQ 1164 (step 1156).When receiving when returning immediately, user 1140 can begin in this locality that places same SQ 1164 or far-end WR uses this STag.
When the SQ 1164 of IPSOE 1192 handles physical storage registration WQE 1160, its verifying memory registration WQE.If it is effective (for example that QP has activated based on the physical storage enrollment mechanism of transmit queue and STag, STag points to the clauses and subclauses of storer TPT, STagTag_Instance (mark example) mates the Tag_Instance in these clauses and subclauses, and storer TPT has enough spaces to be used for new clauses and subclauses), then create new storer TPT clauses and subclauses 1172 (steps 1168).
If storer registration WQE (for example runs into mistake, STag does not point to the clauses and subclauses of storer TPT, the perhaps STag Tag_Instance Tag_Instance in the STag clauses and subclauses pointed that do not match, perhaps STag clauses and subclauses pointed do not have enough spaces to be used for new memory TPT clauses and subclauses), then can take two kinds of semantic options to realize.Option one (step 1174) is reactive (reactive's), and supposition user 1140 does not know storer TPT space.If IPSOE realizes using option one, the QP that process: IPSOE will be associated with SQ below then carrying out places the transmit queue spent condition, stop processing memory registration WQE 1160 WQE (but all RQ WQE are handled in continuation, all enter the RDMA request of reading, any termination messages and SQ WQE before all) afterwards, produce the wrong CQE 1180 that finishes of mistake among the id memory registration WQE, CQE 1180 placed finishes formation 1184, and by 1184 couples of CQ all subsequently SQ WQE return and refresh (Flush) mistake CQE.The storer registration WR that user 1140 can retry finishes with mistake and all are WR subsequently.
Option 2 (step 1176) is an expection property (anticipatory), and supposition user 1140 knows storer TPT space.Just, the user knows how IPSOE 1192 is using storer TPT space.Under this option, user 1140 only sends the storer registration WR that guarantees to have enough spaces in storer TPT.If IPSOE realizes using option 2, the QP that process: IPSOE will be associated with SQ below then carrying out places error condition, stop to handle all this locality and far-end operation, sending termination messages flows with closing RDMA, produce the wrong CQE 1180 that finishes of mistake among the id memory registration WQE, CQE 1180 placed finish formation 1184, and return by 1184 couples of every other SQ of CQ and RQ WQE and to refresh wrong CQE.
(step 1188) at last, user 1140 finishes the result of 1188 extracts physical storeies registration WR by work.
Next step illustrates the figure of storage management system according to the preferred embodiment of the invention with reference to Figure 12.Storage management system 1200 adopts two table storer conversions and protection management structure, and it comprises memory area/window table 1202 and address translation table 1204.These two tables lump together and are called storer conversion and protection table (storer TPT).Memory area/window table 1202 comprises by IPSOE hardware and is used for judging the information that whether is authorized to visit the memory area of quoting in work request or far-end operation.In this example, visit can be asked in the WQE data segment 1206 in the work queue 1208.Address translation table 1204 comprise be used for will in WQE data segment 1206, provide virtual address translation become to constitute data buffer in memory area such as the memory area 1210 page the information of tabulation of one or more actual addresses.
When receiving WQE data segment such as WQE data segment 1206, use the key index in the WQE data segment to come memory area clauses and subclauses or window memory clauses and subclauses such as memory area clauses and subclauses 1212 or window memory clauses and subclauses 1213 in recognition memory zone/window table as index to memory area/window table 1202.Memory area table clause 1212 is used for judging whether the storage access of being asked is for authorized by the memory area of memory area clauses and subclauses definition.If access is authorized, then the reference address conversion table 1204.There are a plurality of address translation table, wherein,, all use an address translation table for defined each memory area/window.Each clauses and subclauses in the address translation table are the actual addresses that constitute the page or leaf of memory area/window part.Each clauses and subclauses is arranged with the pairing increasing order of the virtual address that is associated with memory area/window that increases progressively.IPSOE hardware comes index address conversion table 1204 according to the skew to memory area/window, wherein, this skew is to calculate by deduct the initial virtual address 1214 of memory area/window that obtains from memory area/window table clause 1212 from the virtual address 1216 of appointment work request or far-end operation packet header.This result forms the skew 1218 to the memory area that will visit.The low-order bit of this skew is used for index at the fixed page or leaf of address translation table clause middle finger, and high order bit is used for the index address conversion table.In this example, skew 1218 causes address translation is become the actual address of sign page or leaf 1220-1226, and wherein, page or leaf 1220-1226 comprises the data buffer of being quoted by WQE data segment 1206.
Figure 13 illustrates memory area/window table (MRWT) 1300 according to the preferred embodiment of the invention and the memory area that inserts respectively when binding with the bottom memory area when the registration memory area or with window memory and the layout of window memory clauses and subclauses.
In this example; memory area/window entries 1302 comprises the initial virtual address 1304 of memory area/window, memory area/length of window 1306, protected field 1308, mark example (tag_instance) 1310, entry type 1311, effective clauses and subclauses 1312, access control 1314, iSCSI control 1315, ATT control 1316, page or leaf size 1318 and address translation pointer 1320.
Each the clauses and subclauses define storage zone in memory area/window table 1300 or the feature of window memory.Memory area clauses and subclauses (1302) are used for describing memory area.Window memory clauses and subclauses (1303) are used for describing window memory.The remainder of this joint will be described the content of memory area clauses and subclauses.Except as otherwise noted, this description is applicable to the window memory clauses and subclauses, because MRE comprises identical field with MWE.But, the window memory clauses and subclauses can be through optimizing the address translation table use the memory area that window memory binds.
Being used for the part of STag of reference data buffer zone is called Tag_Index, and this is used for index store zone/window table to obtain the pairing memory area table clause of the memory area that will visit (MRE) by IPSOE hardware.More particularly, STag Tag_Index is used for reference stores device zone.The STag of memory area is included among the binding WQE.
The border in the length 1306 define storage zones of the initial virtual address 1304 of memory area and memory area.Protected field (PD) 1308 is used for judging whether the QP that initiates the work queue request has the authority of this memory area of visit.Just, the PD value that is stored in the memory area clauses and subclauses must be mated the PD value that is stored among the QP.Tag_Instance 1310 is the mark example value that are associated with memory area, and is used for verifying other parts of STag that are different from Tag_Index.When the definition of memory area changed, Tag_Instance provided access control.More particularly, use the Tag_Instance that is stored in these regional memory area clauses and subclauses to verify STag Tag_Instance.
Following example is the situation that dibit is realized: if the entry type field is ' 00 ' b, then these clauses and subclauses are used for the RDMA zone; If the entry type field is ' 01 ' b, then these clauses and subclauses are used for the RDMA window; If the entry type field is ' 10 ' b, then these clauses and subclauses are used for the iSCSI zone; And entry type field 1 ' 1 ' b is retained and does not use.
If clauses and subclauses are used for RDMA zone (comprising iSCSI-R), then the first address translation table clause of memory area points to the byte offset of first Physical Page that is associated with memory area.If clauses and subclauses are used for RDMA window (comprising iSCSI-R), then the first address translation table clause of window memory can be used as the index of the memory area ATT that is tied to window memory is realized.If these clauses and subclauses are used for iSCSI 1.0, then virtual address 1304 fields are as the pointer of the transmit queue WQE that comprises iSCSI order, and the first address translation table clause that is used for iSCSI 1.0 zones points to the byte offset of first Physical Page that is associated with iSCSI 1.0 zones.
Effectively whether clauses and subclauses 1312 expression clauses and subclauses are effective.Effectively clauses and subclauses 1312 can realize with a bit.If this bit set, then clauses and subclauses are effective, otherwise invalid.
The write-access authority that access control 1314 is determined this memory area.Access control 1314 fields comprise three son fields: access control Class1 330, window binding control 1332 and contact access control 1334.
Access control Class1 330 fields comprise four different access types: read this locality, this locality writes, far-end reads and far-end writes.These different access types can be encoded with four bits, wherein, if one of them bit set then activates the access type that is associated with this bit.If the access type that is associated with this bit is then forbidden in the not set of one of them bit.For example, if the set of local write-access bit then activates local write-access.If local write-access is then forbidden in the not set of local write-access bit.Attention: be used for this accessing if standard criterion is forbidden window, then the window memory clauses and subclauses will have only two kinds of different access types: far-end read access and far-end write-access.
Contact access control 1334 fields comprise at least two bits: whether a contact activation bit, specified entry activate once contact; And the contact bit, contact activation bit set for once, it is just meaningful, and whether specified entry was touched in the past.
Entering after byte stream arrives the end that (line up with) and target are the back segment that is associated of a RDMA message of these clauses and subclauses, can not use and once contact clauses and subclauses.Entering after byte stream arrives the end of the back segment that is associated for a regional RDMA message with target, can use and repeatedly contact clauses and subclauses.
For once contacting memory area, arrive when being the end of the back segment that is associated of the RDMA message in this zone when entering byte stream with target, effective clauses and subclauses 1312 fields of memory area are resetted.For repeatedly contacting memory area, when arriving when being the end of the back segment that is associated of the RDMA message in this zone with target entering byte stream, effective clauses and subclauses 1312 fields of memory area are unaffected.Being described in more detail in Figure 18 and the annex of this process provides.
If standard criterion only allows the once contact of window visit, then a kind ofly realize to select only to use once contact visit to the window memory clauses and subclauses (perhaps can provide memory area once contact visit as option).
Storage DDP serial number 1315 is used for storing the DDP serial number that is associated with last byte of useful load of DDP section of last bit set in the DDP head.Being described in more detail in Figure 18 and the annex of this process provides.
It is indirect ATT pages or leaves (secondary indirect pointer) of quoting physical page address (directly pointer), physical page address tabulation (one-level indirect pointer) or comprising the tabulation of ATT page or leaf that the ATT clauses and subclauses are specified in ATT control 1316.Address translation pointer 1212 is quoted the address translation table that is associated with this memory area.Notice that the ATT first entry of quoting a Physical Page can point to the skew to this page.Similarly, last clauses and subclauses can begin from the starting point of last Physical Page to finish with certain skew.
The big or small 1318 specific page sizes of page or leaf.For example, 4KB, 8KB, 64KB, 1MB, 16MB and 256MB and other possible sizes can be that effectively page or leaf is big or small, and this point it will be apparent to those skilled in that.
ATT clauses and subclauses 1320 are specified one or more 64 bit physical addresss.If ATT control field 1316 is made as direct pointer 1320, then each ATT clauses and subclauses 1320 is pointed to a physical address.If ATT control field 1316 is made as one-level indirect pointer 1338, then each ATT clauses and subclauses 1320 is pointed to a list of physical addresses 1340.If ATT control field 1316 is made as secondary indirect pointer 1348, then each ATT clauses and subclauses 1320 is pointed to an ATT item list 1350, and each clauses and subclauses in the ATT clauses and subclauses 1350 are pointed to a list of physical addresses 1354.
Figure 14 is the process flow diagram that is used for registering the process of memory area according to the preferred embodiment of the invention by storer registration user (abbreviating the user as).At first, the user checks whether memory area shares (step 1400) by a plurality of processes with public address translation table entry.If memory area is not (step 1400: deny) shared by this way, then the user must create memory area table clause (MRTE) and address translation table clause (ATTE).If a plurality of processes are used public address translation table entry (step 1400: be), then the user must use identical ATTE and only create MRTE (step 1412).
Referring now to Figure 15, it is to illustrate according to the preferred embodiment of the invention by IPSOE to be used for verifying process flow diagram and the chart that is delivered to the process of the performed memory access of work queue element in the IPSOE work queue by the user as work request.
At first (step 1552), user 1540 is delivered to work request in the IPSOE work queue 1564.Work queue can be transmit queue or receive formation.Work request comprises zero or multidata section more.For RDMA (comprising iSCSI-R), each data segment comprises STag, virtual address and length.For iSCSI 1.0: single STag is used for all data segments of WR; Each data segment among the WR comprises a physical address; First data segment comprises an added field and is used for defining start offset to first Physical Page; Last data segment also comprises an added field and is used for defining end skew to last Physical Page; And all middle data segment only comprise a physical address, because central leaf must begin and finish with page boundary.
Next step (step 1556), the verb interface converts WR to work queue element (WQE), and WQE 1560 is placed WQ 1564.
(step 1568) then, IPSOE 1592 visit WQE 1560.If WQ 1564 is RDMA (comprising iSCSI-R) WQ, then verify each data segment of in WQE 1560, quoting.Checking comprises following inspection: a) effectively entry field set; B) entry type of these clauses and subclauses is made as zone (that is, window can not be used for this accessing); C) with QP context that WQ 1564 is associated in the PD of the memory area clauses and subclauses 1570 quoted by data segment STag of PD (protected field) coupling; D) as the Tag_Instance among the Tag_Instance coupling MRE 1570 of a data segment STag part; E) plot of data segment and length are positioned at the address realm that is associated with MRE 1570; F) effectively (SQ RDMA writes and send WR needs local read access to access type; And RQ WR needs local write-access); And g) for SQ binding WR, the MRE 1570 that is tied to window memory allows window access.Attention: if WR is SQ binding WR and its request the correlation window setting is once contacted visit, then when IPSOE creates correlation window, will visit bit set to once contacting.
If WQ is iSCSI 1.0 WQE, then use the STag that in WQE, provides to come according to the data segment list creating iSCSI memory area that in WQE, provides.ISCSI 1.0 memory areas are started from scratch and are quoted, because they do not have virtual address field.
If the data segment that the user provided is effective, then visits the relational storage zone, and handle WQE.When successfully finishing, return CQE by the CQ that is associated with WQ.
Next step (step 1576) if arbitrary data segment is invalid, then do not visit the relational storage zone, and with CQ 1584 that WQ 1564 is associated in return mistake by CQE 1580.
At last, in step 1588, user 1540 is extracted in the WC of the WR of step 1552 submission.
Referring now to Figure 16, it is process flow diagram and the chart that is used for distinguishing the process of the dissimilar stream that can be associated with far-end operation.
IPSOE receives and enters TCP/IP section 1600.
In step 1604, IPSOE uses known TCP/IP/ Ethernet authentication mechanism to verify and enters TCP section 1604.In step 1608, IPSOE checks during the checking of TCP/IP section whether run into mistake.If it is effective to enter the TCP/IP section, then process enters step 1612.Otherwise 1616 abandon this section and process continuation wait TCP/IP section (step 1616).
In step 1612, after finishing proof procedure, use the TCP/IP five-tuple (transport-type, destination tcp port number, source end tcp port number, destination IP address and end IP address, source) enter the TCP/IP section to visit and enter the QP context that the TCP section is associated.
In step 1620, if do not have any QP context for entering the TCP section, then the user does not use any IPSOE TCP/IP unloading mechanism, and will enter the TCP section by known traditional NIC mechanism and be uploaded to the user.
In step 1624, quote iSCSI 1.0 QP contexts if enter the TCP section, then carry out processing shown in Figure 17 to entering the TCP section.
In step 1630, quote RDMA (comprising iSCSI-R) QP context if enter the TCP section, then IPSOE uses marker (the Marker with PDU Alignment of band PDU alignment, MPA) mechanism is extracted DDP section and relevant DDP head thereof, and carries out processing shown in Figure 180 to entering the TCP section.
Figure 17 A is the storer registration that is associated with iSCSI QP according to the preferred embodiment of the invention and the process flow diagram and the chart of de-registration mechanism.Figure 17 A illustrates and the memory management functions that is associated through the QP context 1706 among initialized QP such as the IPSOE 170 under the iSCSI pattern.
In step 1704, when user 1702 (operating in the iscsi device driver in the operating system nucleus of host CPU typically) initialization QP context 1706, user 1702 is made as iSCSI 1.0 with the QP pattern.When the QP context passed through initialization under the iSCSI pattern after, the work request that is delivered to QP transmit queue such as SQ 1728 comprised iSCSI order and the tabulation of related data transmission data segment thereof.IPSOE will follow the process flow diagram shown in Figure 17 A: be iSCSI order and related data transmission data segment establishment (registration) storer TPT clauses and subclauses thereof; The iSCSI order is sent to destination end; Carry out the data transmission (Figure 17 B) that is associated with the iSCSI order, and when receiving the iSCSI response, destroy the storer TPT clauses and subclauses (cancellation) of iSCSI order, and create the WC that comprises the iSCSI response.
Before user 1702 can send to destination end with the iSCSI order, user 1702 must create RQ WQE and receive the iSCSI response.In step 1710, user 1702 passes to IPSOE1708 for the iSCSI response that will be associated with the iSCSI order with RQ WR.In step 1712, verb interface checking RQ WR, and if effectively, then the verb interface is created RQ WQE 1714 according to this WR, and RQ WQE 1714 is placed relevant RQ1716, and turn back to user 1702 immediately.If WR is invalid, then the verb interface returns a mistake and gives user 1702.
In step 1720, user 1702 passes to IPSOE 1708 by the SQ WR that will comprise iSCSI order and related data transmission data segment thereof then asks this IPSOE to carry out the iSCSI order.In step 1732, verb interface checking SQ WR, and if effectively, then the verb interface is created a SQ WQE according to this WR, and SQ WQE 1724 is placed relevant SQ1728, and return an iSCSI order ID immediately and give user 1702.This order ID uses between user and IPSOE so that the iSCSI response is ordered with iSCSI and is associated.If WR is invalid, then the verb interface returns a mistake and gives user 1702.
When the SQ 1728 of IPSOE 1708 handled iSCSI order SQ WQE 1724, it verified this WQE.In step 1736, if QP is initialized as the iSCSI pattern, the device type that the iSCSI order will send for it is effective, the data transmission data segment that is associated with this iSCSI order effectively (for example, they not overlapping (wrap)), and storer TPT has enough spaces to be used for new clauses and subclauses, then creates new iSCSI storer TPT clauses and subclauses 1740 in storer TPT 1744.
If iSCSI order SQ WQE 1724 runs into mistake (for example, storer TPT does not have enough spaces to be used for another clauses and subclauses or iSCSI command operation sign indicating number is invalid for the device type of being quoted), then can take two kinds of semantic options to realize.Option one (step 1748) is reactive, and supposition user 1702 does not know storer TPT space.If IPSOE realizes using option one, the QP that process: IPSOE will be associated with SQ below then carrying out places the transmit queue spent condition, stop to handle WQE after the iSCSI order SQ WQE 1724 (but continue to handle all RQ WQE, all enter R2T, SQ WQE and other enter the iSCSI control messages before all), produce the wrong CQE 1776 that finishes of mistake among the sign iSCSI order SQ WQE 1724, CQE 1776 placed finishes formation 1772, and by CQ1772 to all subsequently SQ WQE return and refresh wrong CQE.The iSCSI order WR that user 1772 can retry finishes with mistake and all are WR subsequently.
Option 2 (step 1152) is an expection property, and supposition user 1702 knows storer TPT space.Just, the user knows how IPSOE is using storer TPT space.Under this option, user 1702 only sends the iSCSI order WR that guarantees to have enough spaces in storer TPT.If IPSOE realizes using option 2, the QP (QP 1706) that process: IPSOE will be associated with SQ below then carrying out places error condition, stop to handle all this locality and far-end operation, stop iSCSI stream, produce the wrong CQE 1776 that finishes of mistake among the sign iSCSI order SQ WQE, CQE 1776 placed finish formation 1772, and return by 1772 couples of every other SQ of CQ and RQ WQE and to refresh wrong CQE.
In step 1156, when IPSOE destination end processing logic obtained iSCSI order SQ WQE, IPSOE sent to destination end with the iSCSI order.
Process flow diagram among Figure 17 B that describes below illustrates according to the preferred embodiment of the invention the mechanism that is used for carrying out the data transfer phase of iSCSI order by IPSOE.
When destination end is finished the iSCSI order, (step 1764), destination end sends to originating end with iSCSI response (perhaps, read for equipment, comprise the iSCSI state in the data input message).
In step 1768, when IPSOE received the iSCSI response, IPSOE verified the stream (for example, passing through SCTP) that QP context 1706 and reception iSCSI respond or connects (for example, passing through TCP) and is associated.For TCP/IP, IPSOE is by guaranteeing and entering five-tuple (transport-type, destination port numbers, source end port numbers, destination IP address and end IP address, the source) coupling that iSCSI response is associated and carry out this step with the five-tuple of QP context dependent connection.Then, other iSCSI that are associated with the iSCSI response message of IPSOE checking and TCP field (for example, comprise serial number coupling in the TCP section of iSCSI response be stored among the QP next expect serial number).Then, IPSOE uses the Tag_Index of iSCSI originating end mark partly to inquire about the storer TPT clauses and subclauses (1740) that are associated with the iSCSl response.The Tag_Instance part of IPSOE checking iSCSI originating end mark.
If enter iSCSI response effectively (the Tag_Instance part that comprises iSCSI originating end mark), then IPSOE 1708: reference-to storage TPT clauses and subclauses 1740 are extracted the order ID of the iSCSI order that is associated with the iSCSI response; From storer TPT clauses and subclauses 1740, extract order ID; Destroy (cancellation) storer TPT clauses and subclauses 1740; And will order ID and enter the RQ WQE 1714 that iSCSI response placed and entered the QP that the iSCSI response is associated.Otherwise IPSOE 1708 abandons and enters the iSCSI response.
At last, in step 12 1778, user 1702 extracts the WC that comprises iSCSI order ID and iSCSI response.User 1702 uses iSCSI order ID to come related iSCSI response and iSCSI order.
Mechanism shown in Figure 17 A can be applied to general QP.Just, the storer register step can make up with general WR, and nullifies step and can carry out when distant-end node sends a message that comprises the mark (for example, handling mark (Steering Tag)) that will be canceled.
Figure 17 B is process flow diagram and the chart that the memory management process that is used for carrying out the iSCSI originating end tcp data segment of verifying far-end iSCSI 1.0 data transmission (for example, R2T or data input) operation according to the preferred embodiment of the invention is shown.Following realization only comprises data transmission message.Non-data transmission message is uploaded to the user by the reception formation of iSCSI QP.
In step 1796, use the Tag_Index of the originating end task flagging enter the iSCSI head partly to verify and enter the DDP section.Enter step B.
In step 1794, below carrying out, check: a) effectively entry field set by the clauses and subclauses of the Tag_Index partial index of the originating end task flagging that enters the iSCSI head; B) entry type of these clauses and subclauses is made as iSCSI 1.0 (that is, zone or window can not be used for iSCSI 1.0); C) with enter the QP context that the TCP section is associated in the PD of PD coupling storer TPT clauses and subclauses; D) as the Tag_Instance in the Tag_Instance coupling storer TPT clauses and subclauses of the originating end task flagging part of iSCSI 1.0 heads; E) skew that enters 1.0 sections heads of iSCSI is no more than the length field size that is stored in the storer TPT clauses and subclauses; F) access type effective (for example, for R2T, storer TPT clauses and subclauses activate the far-end read access, and for the data input, storer TPT clauses and subclauses activate the far-end write-access); And g) operation that the iSCSI of originating end order (inquiring about by the virtual address field of using storer TPT clauses and subclauses) coupling enters (destination end) iSCSI message (promptly, originating end iSCSI order be dish to write and enter iSCSI message be R2T, perhaps the iSCSI order is that to read and enter iSCSI message be data to dish).If these inspections are all passed through, then handle and enter step 1792.Otherwise the reception formation by iSCSI QP is passed to the user with the iSCSI message of mistake.
In step 1792, determine the type of iSCSI message.In step 1790, if entering the iSCSI head is R2T, then use target offset to be displaced to the buffer zone that the marked index by the originating end task flagging partly points to, and will be transferred to distant-end node up to the buffer contents of the length of appointment in the iSCSI head.The control information of iSCSI head is passed to the user by the reception formation of iSCSI QP.
In step 1788, if enter the iSCSI head is data input transmission, then use target offset to be displaced to the buffer zone that the marked index by the originating end task flagging partly points to, and will be transferred to distant-end node up to the buffer contents of the length of appointment in the iSCSI head.The control information of iSCSI head is passed to the user by the reception formation of iSCSI QP.
In step 1786, be different from data input or R2T transmission if enter the iSCSI head, then whole iSCSI message (control information and any data) is passed to the user by the reception formation of iSCSI QP.
Figure 18 is process flow diagram and the chart that illustrates according to the employed memory management process of the preferred embodiment of the present invention: provide the cancellation that is not exposed to distant-end node function to once contacting access mechanism; And checking and the far-end RDMA request of reading, RDMA read the storage access that response and RDMA write operation are associated.Not tape label buffer zone of DDP is quoted in the RDMA request of reading.RDMA reads response or RDMA writes and quotes DDP tape label buffer zone.It should be noted that RDMA sends Message Processing description is arranged in Figure 11.
For the RDMA request of reading 1800,, then use the message sequence number (MSN) that enters the DDP head to come index RDMA to read resource queue if enter the not tape label buffer zone that DPP section head is quoted band buffering area code 2.This is corresponding to step 1804.
Be index RDMA read request queue, IPSOE safeguards next expection MSN.In step 1806, if enter the MSN of DDP head and be next expection MSN or corresponding to the MSN that will be associated with obtainable RDMA read request queue clauses and subclauses, the useful load (being the RDMA request of reading) that will enter the DDP section then places the RDMA read request queue clauses and subclauses of being quoted by the MSN that enters the DDP head.Otherwise, call RDMA stream termination procedure by IPSOE.RDMA stream termination procedure comprises that establishment comprises the termination RDMA message that stops reason, will stop the opposite side that RDMA message sends to RDMA stream, disconnects RDMA stream (for example, by disconnecting the TCP connection) then.
In step 1808, use to enter the Tag_Index that RDMA reads the source STag of request head and verify that partly entering RDMA reads request.
In step 1812, to carrying out following inspection: a) effectively entry field set by entering the clauses and subclauses of Tag_Index partial index that RDMA reads the source STag of request head; B) entry type of these clauses and subclauses is made as window (that is, the zone can not be used for remote access); C) with enter the QP context that the TCP section is associated in the PD of PD coupling window memory clauses and subclauses; D) as the Tag_Instance among the Tag_Instance coupling MWE of a DDP head STag part; E) plot (target offset) and the length (MPA head length) that enters the DDP section is positioned at the address realm that is associated with MWE; And f) access type effective (that is, MWE activates the far-end read access).
If all inspections are all passed through, then IPSOE reads the window memory quoted of request and sends RDMA and read response and create RDMA and read response by reading by RDMA.Otherwise, produce the termination messages of describing error reason.
For RDMA read the response or RDMA write 1814, in step 1816, quote the tape label buffer zone if enter the head of DDP section, then use the Tag_Index of DDP head STag partly to come index store zone/window table.
In step 1820, to carrying out following inspection: a) effectively entry field set by the clauses and subclauses of the Tag_Index partial index that enters DDP head STag; B) entry type of these clauses and subclauses is made as window (that is, the zone can not be used for remote access); C) with enter the QP context that the TCP section is associated in the PD of PD coupling window memory clauses and subclauses; D) as the Tag_Instance among the Tag_Instance coupling MWE of a DDP head STag part; E) plot (target offset) and the length (MPA head length) that enters the DDP section is positioned at the address realm that is associated with MWE; And f) access type effective (that is, MWE activates the far-end write-access).If all inspections are all passed through, then handle and enter step 1824, otherwise, produce the termination messages of describing error reason.
In step 1824, check following field: enter last bit of DDP standard (if set, then its expression enters last DDP section that the DDP section is a RDMA message) in the DDP head; A contact activation bit of being stored in the storer TPT clauses and subclauses of quoting by STag; The contact bit of being stored in the storer TPT clauses and subclauses of quoting by STag; Storage DDP (byte stream) serial number of being stored in the storer TPT clauses and subclauses of quoting by STag; Bottom TCP byte order number; And last byte that enters DDP section (byte stream) serial number.In step 1828, above-listed field is carried out following inspection.
If the not set of last bit, the memory area/window that then uses the target offset field enter the DDP head to come index to quote by storer TPT clauses and subclauses, and the useful load that will enter the DDP section is transferred in memory area/window and (begins with TO).
If last bit set and with a not set of contact activation bit that enters the storer TPT clauses and subclauses that the DDP section is associated, memory area/the window that then uses the target offset enter the DDP head to come index to quote by storer TPT clauses and subclauses, and the useful load that will enter the DDP section is transferred in memory area/window and (begins with TO).
If last bit set and with the contact activation bit that enters storer TPT clauses and subclauses that the DDP section is associated with contact the equal set of bit, then the effective clauses and subclauses bit to storer TPT clauses and subclauses resets, and produce the termination messages of describing error reason (for example, attempting a contact area/window is carried out twice visit).
If last bit set and with the contact activation bit that enters storer TPT clauses and subclauses that the DDP section is associated with contact all not set of bit, and last byte that enters DDP section (byte stream) serial number equals next expection TCP byte order and number subtracts 1, then the effective clauses and subclauses bit to storer TPT clauses and subclauses resets, and the memory area/window that uses the target offset field enter the DDP head to come index to be quoted by storer TPT clauses and subclauses, and the useful load that will enter the DDP section is transferred in memory area/window and (begins with TO).This comprises final stage and the orderly situation about receiving that the DDP section is a tape label buffer zone message that enter.
If last bit set and with the contact activation bit set that enters the storer TPT clauses and subclauses that the DDP section is associated and contact not set of bit, and last byte that enters DDP section (byte stream) serial number is within TCP byte order window but be not that next expection TCP byte order number subtracts 1, then to the contact bit set of storer TPT clauses and subclauses, (byte stream) serial number that will be associated with last byte of the useful load that enters the DDP section is stored in the DDP sequence number field of storer TPT clauses and subclauses, and the useful load that will enter the DDP section is transferred in memory area/window and (begins with TO).When entering bit stream and arrive the end of storer TPT clauses and subclauses DDP sequence number field, effective entry field of storer TPT is resetted.This comprises the final stage that the DDP section is a tape label buffer zone message and the situation of unordered reception of entering.
If last bit set and with the contact activation bit set that enters the storer TPT clauses and subclauses that the DDP section is associated and contact not set of bit, and last byte that enters DDP section (byte stream) serial number then abandons and enters DDP section (sender will resend) outside TCP byte order window.
It should be noted that, though the present invention describes in the context of complete functionalization data handling system, but will be understood by those skilled in the art that process of the present invention can distribute with instruction or computer-readable medium form and various other forms of other functional description data, and the present invention is suitable for equally and is used for realizing that with actual the particular type of the signal bearing medium distributed is irrelevant.The wired or wireless communication link that the example of computer-readable medium comprises recordable-type media such as floppy disk, hard disk drive, RAM, CD-ROM, DVD-ROM and transmission type media such as numeral and analog communication links, use transmission form is wireless frequency and light wave transmissions for example.Computer-readable medium can adopt the form of the coded format of decoding at the actual use in the concrete data handling unit (DHU) assembly.The functional description data is the information of function being authorized machine.The functional description data includes but not limited to definition, object and the data structure of computer program, instruction, rule, the fact (fact), calculable functions.
Description of the invention provides for example with for the purpose of describing, and is not exhaustively or limit the invention to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Selecting and describing embodiment is for better explanation principle of the present invention and practical application, thereby and makes those of ordinary skill in the art can understand the present invention to design the various embodiment that have various modifications that are suitable for special-purpose.
Claims (36)
1. method comprises:
In network offload engines, receive work request from main frame;
Response receives work request, the memory area that registration is associated with main frame in conversion table.
2. the method for claim 1, wherein work request comprises the storage protection information that is associated with memory area, and described method also comprises:
Memory protection information in conversion table.
3. the method for claim 1, wherein work request receives by transmit queue.
4. the method for claim 1 also comprises:
Response registration memory area returns to main frame with a mark, and wherein, mark is associated with memory area.
5. method as claimed in claim 4, wherein, mark comprises the index to conversion table.
6. the method for claim 1 also comprises:
Response registration memory area is finished queue element (QE) with one and is placed and finish formation.
7. the method for claim 1, wherein work request comprises the iSCSI order.
8. method as claimed in claim 7, wherein, response is handled the iSCSI order and the registration memory area, and described method also comprises:
In finishing the iSCSI order, carry out affairs;
Receive the iSCSI response that is associated with affairs; And
Response receives the iSCSI response, nullifies the memory area that is associated with the iSCSI affairs.
9. the method for claim 1 also comprises:
Produce the mark that is associated with memory area; And
Usage flag reference stores device zone by the I/O affairs of connection protocol execution with distant-end node, wherein, in the I/O affairs, is used data is transmitted in the direct visit of memory area in network.
10. method as claimed in claim 9, wherein, connection protocol is transmission control protocol (TCP).
11. the method for claim 1 also comprises:
The data of the setting that the foundation expression is associated with memory area, wherein, the described expression memory area that is provided with is configured to only effectively to the once visit of distant-end node, and effectively to enter in the far-end operation reference-to storage regional and make memory area invalid thereby will respond finishing.
12. the method for claim 1, wherein response is handled the combination registration memory area that is associated with affairs in the upper-layer protocol and is registered memory area with the transmission work request, and described method also comprises:
The affairs that enter that receive in the upper-layer protocol send message, wherein, and the request of the mark that affairs transmission packet explanatory note in brackets pin is associated with memory area; And
Response receives transmission message, nullifies the memory area that is associated with mark.
13. a method comprises:
Work request is placed transmit queue in the network offload engines, and wherein, work request comprises the sign of memory area that will be by the network offload engines registration; And
Receive the mark that is associated with the memory area of being registered from network offload engines.
14. a method comprises:
In network offload engines, register memory area, use with the affairs of distant-end node being used for;
For memory area, carry out single affairs with distant-end node; And
Single affairs are carried out in response, nullify memory area.
15. a method comprises:
Receive the mark that is associated with memory area from distant-end node;
Judge whether the memory area that is associated with mark is canceled; And
The result of determination that the memory area that response is associated with mark has been canceled is indicated an error situation.
16. a kind of computer program at least a computer-readable medium comprises allowing computing machine to carry out the functional description data of following operation when being carried out by computing machine:
In network offload engines, receive work request from main frame;
Response receives work request, the memory area that registration is associated with main frame in conversion table.
17. computer program as claimed in claim 16; wherein; work request comprises the storage protection information that is associated with memory area, and described computer program also comprises when being carried out by computing machine the functional description data that allows computing machine to carry out following other operation:
Memory protection information in conversion table.
18. computer program as claimed in claim 16, wherein, work request receives by transmit queue.
19. computer program as claimed in claim 16 also comprises allowing computing machine to carry out the functional description data of following other operation when being carried out by computing machine:
Response registration memory area returns to main frame with a mark, and wherein, mark is associated with memory area.
20. computer program as claimed in claim 19, wherein, mark comprises the index to conversion table.
21. computer program as claimed in claim 16 also comprises allowing computing machine to carry out the functional description data of following other operation when being carried out by computing machine:
Response registration memory area is finished queue element (QE) with one and is placed and finish formation.
22. computer program as claimed in claim 16, wherein, work request comprises the iSCSI order.
23. computer program as claimed in claim 22, wherein, iSCSI order and registration memory area are handled in response, and described computer program comprises that also when carried out by computing machine permission computing machine carries out the functional description data of following other operation:
In finishing the iSCSI order, carry out affairs;
Receive the iSCSI response that is associated with affairs; And
Response receives the iSCSI response, nullifies memory area.
24. computer program as claimed in claim 16 also comprises allowing computing machine to carry out the functional description data of following other operation when being carried out by computing machine:
Produce the mark that is associated with memory area; And
Usage flag reference stores device zone by the I/O affairs of TCP (TCP/IP) execution with distant-end node, wherein, in the I/O affairs, is used data is transmitted in the direct visit of memory area in network.
25. computer program as claimed in claim 16 also comprises allowing computing machine to carry out the functional description data of following other operation when being carried out by computing machine:
The data of the setting that the foundation expression is associated with memory area, wherein, the described expression memory area that is provided with is configured to only effectively to the once visit of distant-end node, and effectively to enter in the far-end operation reference-to storage regional and make memory area invalid thereby will respond finishing.
26. computer program as claimed in claim 16, wherein, response is handled the combination registration memory area that is associated with affairs in the upper-layer protocol and is registered memory area with sending work request, and described computer program comprises that also when being carried out by computing machine permission computing machine carries out the functional description data of following other operation:
The affairs that enter that receive in the upper-layer protocol send message, wherein, and the request of the mark that affairs transmission packet explanatory note in brackets pin is associated with memory area; And
Response receives transmission message, nullifies the memory area that is associated with mark.
27. a kind of computer program at least a computer-readable medium comprises allowing computing machine to carry out the functional description data of following operation when being carried out by computing machine:
Work request is placed transmit queue in the network offload engines, and wherein, work request comprises the sign of memory area that will be by the network offload engines registration; And
Receive the mark that is associated with the memory area of being registered from network offload engines.
28. a kind of computer program at least a computer-readable medium comprises allowing computing machine to carry out the functional description data of following operation when being carried out by computing machine:
In network offload engines, register memory area, use with the affairs of distant-end node being used for;
For memory area, carry out single affairs with distant-end node; And
Single affairs are carried out in response, nullify memory area.
29. a kind of computer program at least a computer-readable medium comprises allowing computing machine to carry out the functional description data of following operation when being carried out by computing machine:
Receive the mark that is associated with memory area from distant-end node;
Judge whether the memory area that is associated with mark is canceled; And
The result of determination that the memory area that response is associated with mark has been canceled is indicated an error situation.
30. a network offload engines comprises:
One device is used for receiving work request from main frame;
One device is used for responding the memory area that receives work request and be associated with main frame in the conversion table registration.
31. network offload engines as claimed in claim 30, wherein, work request receives by transmit queue.
32. network offload engines as claimed in claim 30 also comprises:
One device is used to produce the mark that is associated with memory area; And
One device, be used for usage flag reference stores device zone, in network, pass through the I/O affairs of TCP (TCP/IP) execution and distant-end node, wherein, in the I/O affairs, use data are transmitted in the direct visit of memory area.
33. a host data processing system comprises:
One device is used for the transmit queue in the network offload engines that work request places with main frame is associated, and wherein, work request comprises the sign of memory area that will be by network offload engines registration; And
One device is used for receiving the mark that is associated with the memory area of being registered from network offload engines.
34. host data processing system as claimed in claim 33, wherein, mark comprises the index to conversion table.
35. a data handling system comprises:
One device is used to register memory area and uses with the affairs of distant-end node being used for;
One device is used for carrying out single affairs for memory area and distant-end node; And
One device is used for response and carries out single affairs and nullify memory area.
36. a network offload engines comprises:
One device is used for receiving the mark that is associated with memory area from distant-end node;
One device is used to judge whether the memory area that is associated with mark is canceled; And
One device is used to respond the result of determination that the memory area that is associated with mark has been canceled, indicates an error situation.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/235,679 | 2002-09-05 | ||
US10/235,679 US7299266B2 (en) | 2002-09-05 | 2002-09-05 | Memory management offload for RDMA enabled network adapters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1487418A true CN1487418A (en) | 2004-04-07 |
CN1308835C CN1308835C (en) | 2007-04-04 |
Family
ID=31990540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB031557813A Expired - Fee Related CN1308835C (en) | 2002-09-05 | 2003-09-02 | Far-end divect memory access invocating memory management unloading of network adapter |
Country Status (3)
Country | Link |
---|---|
US (1) | US7299266B2 (en) |
CN (1) | CN1308835C (en) |
TW (1) | TWI265696B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101165664B (en) * | 2006-10-17 | 2010-04-21 | 国际商业机器公司 | Apparatus and method for managing address conversion in data processing system |
CN1829231B (en) * | 2005-02-28 | 2010-11-03 | 惠普开发有限公司 | Method and apparatus for direct reception of inbound data |
CN102165739A (en) * | 2008-09-29 | 2011-08-24 | 思科技术公司 | Reliable reception of messages written via RDMA using hashing |
CN102197381A (en) * | 2008-10-28 | 2011-09-21 | Nxp股份有限公司 | Data processing circuit with cache and interface for a detachable device |
CN101741870B (en) * | 2008-11-07 | 2012-11-14 | 英业达股份有限公司 | Internet Minicomputer Interface Storage System |
CN103838517A (en) * | 2012-11-23 | 2014-06-04 | 中国科学院声学研究所 | Method and system for transmitting data between multi-core processor and disk array |
CN104782085A (en) * | 2012-10-25 | 2015-07-15 | 国际商业机器公司 | Technology for network communication by a computer system using at least two communication protocols |
CN105938460A (en) * | 2015-03-02 | 2016-09-14 | Arm 有限公司 | Memory management |
CN106502721A (en) * | 2016-09-26 | 2017-03-15 | 华为技术有限公司 | A kind of order discharging method, device and physical machine |
CN107111550A (en) * | 2014-12-22 | 2017-08-29 | 德克萨斯仪器股份有限公司 | Conversion is omitted by selective page and prefetches conversion omission time delay in concealing program Memory Controller |
CN109587112A (en) * | 2018-10-25 | 2019-04-05 | 华为技术有限公司 | It is a kind of send data method, receive data method, equipment and system |
CN110059027A (en) * | 2017-11-22 | 2019-07-26 | Arm有限公司 | The device and method for executing attended operation |
CN110119303A (en) * | 2013-11-12 | 2019-08-13 | 微软技术许可有限责任公司 | Construct virtual mainboard and virtual memory facilities |
CN112771501A (en) * | 2018-08-17 | 2021-05-07 | 甲骨文国际公司 | Remote Direct Memory Operation (RDMO) for transactional processing systems |
CN113190177A (en) * | 2021-05-12 | 2021-07-30 | 西安雷风电子科技有限公司 | Data storage method, terminal equipment, server and system |
CN113326155A (en) * | 2021-06-28 | 2021-08-31 | 深信服科技股份有限公司 | Information processing method, device, system and storage medium |
CN113419810A (en) * | 2020-07-27 | 2021-09-21 | 阿里巴巴集团控股有限公司 | Data interaction method and device, electronic equipment and computer storage medium |
CN114726929A (en) * | 2021-01-06 | 2022-07-08 | 迈络思科技有限公司 | Connection management in network adapters |
Families Citing this family (165)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040049580A1 (en) * | 2002-09-05 | 2004-03-11 | International Business Machines Corporation | Receive queue device with efficient queue flow control, segment placement and virtualization mechanisms |
JP2004192179A (en) * | 2002-12-10 | 2004-07-08 | Fujitsu Ltd | Apparatus for incorporating a NIC having an RDMA function into a system without hardware memory protection and without a dedicated monitor process |
US7089378B2 (en) * | 2003-03-27 | 2006-08-08 | Hewlett-Packard Development Company, L.P. | Shared receive queues |
US20040193833A1 (en) * | 2003-03-27 | 2004-09-30 | Kathryn Hampton | Physical mode addressing |
US7565504B2 (en) * | 2003-03-27 | 2009-07-21 | Hewlett-Packard Development Company, L.P. | Memory window access mechanism |
US20040193832A1 (en) * | 2003-03-27 | 2004-09-30 | Garcia David J. | Physical mode windows |
US7716323B2 (en) * | 2003-07-18 | 2010-05-11 | Netapp, Inc. | System and method for reliable peer communication in a clustered storage system |
US7593996B2 (en) * | 2003-07-18 | 2009-09-22 | Netapp, Inc. | System and method for establishing a peer connection using reliable RDMA primitives |
US7617376B2 (en) * | 2003-08-14 | 2009-11-10 | Hewlett-Packard Development Company, L.P. | Method and apparatus for accessing a memory |
US7852856B2 (en) * | 2003-08-29 | 2010-12-14 | Broadcom Corp. | System and method for providing pooling or dynamic allocation of connection context data |
US7404190B2 (en) * | 2003-09-18 | 2008-07-22 | Hewlett-Packard Development Company, L.P. | Method and apparatus for providing notification via multiple completion queue handlers |
US7275152B2 (en) * | 2003-09-26 | 2007-09-25 | Intel Corporation | Firmware interfacing with network protocol offload engines to provide fast network booting, system repurposing, system provisioning, system manageability, and disaster recovery |
US7512143B2 (en) * | 2003-10-16 | 2009-03-31 | International Business Machines Corporation | Buffer management for a target channel adapter |
US20050129039A1 (en) * | 2003-12-11 | 2005-06-16 | International Business Machines Corporation | RDMA network interface controller with cut-through implementation for aligned DDP segments |
US9213609B2 (en) * | 2003-12-16 | 2015-12-15 | Hewlett-Packard Development Company, L.P. | Persistent memory device for backup process checkpoint states |
US7921240B2 (en) * | 2004-03-10 | 2011-04-05 | Broadcom Corporation | Method and system for supporting hardware acceleration for iSCSI read and write operations and iSCSI chimney |
US20050216552A1 (en) * | 2004-03-24 | 2005-09-29 | Samuel Fineberg | Communication-link-attached persistent memory system |
US7783769B2 (en) * | 2004-03-31 | 2010-08-24 | Intel Corporation | Accelerated TCP (Transport Control Protocol) stack processing |
US20050223118A1 (en) * | 2004-04-05 | 2005-10-06 | Ammasso, Inc. | System and method for placement of sharing physical buffer lists in RDMA communication |
US20060067346A1 (en) * | 2004-04-05 | 2006-03-30 | Ammasso, Inc. | System and method for placement of RDMA payload into application memory of a processor system |
US8078705B2 (en) * | 2004-04-05 | 2011-12-13 | Hewlett-Packard Development Company, L.P. | Key-configured topology with connection management |
US20050220128A1 (en) * | 2004-04-05 | 2005-10-06 | Ammasso, Inc. | System and method for work request queuing for intelligent adapter |
US7526574B2 (en) * | 2004-04-21 | 2009-04-28 | International Business Machines Corporation | Method for communicating data transfer requests between data transfer protocols |
GB0408876D0 (en) * | 2004-04-21 | 2004-05-26 | Level 5 Networks Ltd | User-level stack |
GB0408868D0 (en) | 2004-04-21 | 2004-05-26 | Level 5 Networks Ltd | Checking data integrity |
US7577707B2 (en) * | 2004-04-21 | 2009-08-18 | International Business Machines Corporation | Method, system, and program for executing data transfer requests |
GB0408877D0 (en) * | 2004-04-21 | 2004-05-26 | Level 5 Networks Ltd | Signalling data reception |
US7475153B2 (en) * | 2004-07-16 | 2009-01-06 | International Business Machines Corporation | Method for enabling communication between nodes |
US7779081B2 (en) * | 2004-07-16 | 2010-08-17 | International Business Machines Corporation | Method, system, and program for forwarding messages between nodes |
US7430615B2 (en) | 2004-08-30 | 2008-09-30 | International Business Machines Corporation | RDMA server (OSI) global TCE tables |
US20060075057A1 (en) * | 2004-08-30 | 2006-04-06 | International Business Machines Corporation | Remote direct memory access system and method |
US7480298B2 (en) | 2004-08-30 | 2009-01-20 | International Business Machines Corporation | Lazy deregistration of user virtual machine to adapter protocol virtual offsets |
US7478138B2 (en) * | 2004-08-30 | 2009-01-13 | International Business Machines Corporation | Method for third party, broadcast, multicast and conditional RDMA operations |
US7813369B2 (en) | 2004-08-30 | 2010-10-12 | International Business Machines Corporation | Half RDMA and half FIFO operations |
US7522597B2 (en) | 2004-08-30 | 2009-04-21 | International Business Machines Corporation | Interface internet protocol fragmentation of large broadcast packets in an environment with an unaccommodating maximum transfer unit |
US8364849B2 (en) | 2004-08-30 | 2013-01-29 | International Business Machines Corporation | Snapshot interface operations |
US8023417B2 (en) | 2004-08-30 | 2011-09-20 | International Business Machines Corporation | Failover mechanisms in RDMA operations |
US7564847B2 (en) * | 2004-12-13 | 2009-07-21 | Intel Corporation | Flow assignment |
US20060168094A1 (en) * | 2005-01-21 | 2006-07-27 | International Business Machines Corporation | DIRECT ACCESS OF SCSI BUFFER WITH RDMA ATP MECHANISM BY iSCSI TARGET AND/OR INITIATOR |
US20060168092A1 (en) * | 2005-01-21 | 2006-07-27 | International Business Machines Corporation | Scsi buffer memory management with rdma atp mechanism |
US20060165084A1 (en) * | 2005-01-21 | 2006-07-27 | International Business Machines Corporation | RNIC-BASED OFFLOAD OF iSCSI DATA MOVEMENT FUNCTION BY TARGET |
US20060168091A1 (en) * | 2005-01-21 | 2006-07-27 | International Business Machines Corporation | RNIC-BASED OFFLOAD OF iSCSI DATA MOVEMENT FUNCTION BY INITIATOR |
US7343527B2 (en) * | 2005-01-21 | 2008-03-11 | International Business Machines Corporation | Recovery from iSCSI corruption with RDMA ATP mechanism |
US20060168286A1 (en) * | 2005-01-21 | 2006-07-27 | International Business Machines Corporation | iSCSI DATAMOVER INTERFACE AND FUNCTION SPLIT WITH RDMA ATP MECHANISM |
GB0505300D0 (en) | 2005-03-15 | 2005-04-20 | Level 5 Networks Ltd | Transmitting data |
GB0506403D0 (en) | 2005-03-30 | 2005-05-04 | Level 5 Networks Ltd | Routing tables |
EP3217285B1 (en) | 2005-03-10 | 2021-04-28 | Xilinx, Inc. | Transmitting data |
US20060236063A1 (en) * | 2005-03-30 | 2006-10-19 | Neteffect, Inc. | RDMA enabled I/O adapter performing efficient memory management |
US8458280B2 (en) | 2005-04-08 | 2013-06-04 | Intel-Ne, Inc. | Apparatus and method for packet transmission over a high speed network supporting remote direct memory access operations |
US20060227799A1 (en) * | 2005-04-08 | 2006-10-12 | Lee Man-Ho L | Systems and methods for dynamically allocating memory for RDMA data transfers |
US7761619B2 (en) * | 2005-05-13 | 2010-07-20 | Microsoft Corporation | Method and system for parallelizing completion event processing |
US20060259570A1 (en) * | 2005-05-13 | 2006-11-16 | Microsoft Corporation | Method and system for closing an RDMA connection |
US7760741B2 (en) * | 2005-05-18 | 2010-07-20 | International Business Machines Corporation | Network acceleration architecture |
US20070033301A1 (en) * | 2005-07-18 | 2007-02-08 | Eliezer Aloni | Method and system for transparent TCP offload with dynamic zero copy sending |
WO2007018599A1 (en) * | 2005-07-26 | 2007-02-15 | Thomson Licensing | Local area network management |
JP2007148520A (en) * | 2005-11-24 | 2007-06-14 | Hitachi Ltd | Information notification method and computer system |
GB0600417D0 (en) | 2006-01-10 | 2006-02-15 | Level 5 Networks Inc | Virtualisation support |
US7895329B2 (en) * | 2006-01-12 | 2011-02-22 | Hewlett-Packard Development Company, L.P. | Protocol flow control |
US7889762B2 (en) | 2006-01-19 | 2011-02-15 | Intel-Ne, Inc. | Apparatus and method for in-line insertion and removal of markers |
US7782905B2 (en) * | 2006-01-19 | 2010-08-24 | Intel-Ne, Inc. | Apparatus and method for stateless CRC calculation |
US8316156B2 (en) | 2006-02-17 | 2012-11-20 | Intel-Ne, Inc. | Method and apparatus for interfacing device drivers to single multi-function adapter |
US20070208820A1 (en) * | 2006-02-17 | 2007-09-06 | Neteffect, Inc. | Apparatus and method for out-of-order placement and in-order completion reporting of remote direct memory access operations |
US8078743B2 (en) | 2006-02-17 | 2011-12-13 | Intel-Ne, Inc. | Pipelined processing of RDMA-type network transactions |
US7849232B2 (en) * | 2006-02-17 | 2010-12-07 | Intel-Ne, Inc. | Method and apparatus for using a single multi-function adapter with different operating systems |
US7401201B2 (en) * | 2006-04-28 | 2008-07-15 | Freescale Semiconductor, Inc. | Processor and method for altering address translation |
US7836274B2 (en) * | 2006-09-05 | 2010-11-16 | Broadcom Corporation | Method and system for combining page buffer list entries to optimize caching of translated addresses |
US20080098197A1 (en) * | 2006-10-20 | 2008-04-24 | International Business Machines Corporation | Method and System For Address Translation With Memory Windows |
US20080306818A1 (en) * | 2007-06-08 | 2008-12-11 | Qurio Holdings, Inc. | Multi-client streamer with late binding of ad content |
US20080313029A1 (en) * | 2007-06-13 | 2008-12-18 | Qurio Holdings, Inc. | Push-caching scheme for a late-binding advertisement architecture |
US7805373B1 (en) | 2007-07-31 | 2010-09-28 | Qurio Holdings, Inc. | Synchronizing multiple playback device timing utilizing DRM encoding |
US7996482B1 (en) | 2007-07-31 | 2011-08-09 | Qurio Holdings, Inc. | RDMA based real-time video client playback architecture |
US8244826B2 (en) | 2007-10-23 | 2012-08-14 | International Business Machines Corporation | Providing a memory region or memory window access notification on a system area network |
US7454478B1 (en) * | 2007-11-30 | 2008-11-18 | International Business Machines Corporation | Business message tracking system using message queues and tracking queue for tracking transaction messages communicated between computers |
US8103785B2 (en) * | 2007-12-03 | 2012-01-24 | Seafire Micros, Inc. | Network acceleration techniques |
US8762476B1 (en) | 2007-12-20 | 2014-06-24 | Qurio Holdings, Inc. | RDMA to streaming protocol driver |
US7900016B2 (en) * | 2008-02-01 | 2011-03-01 | International Business Machines Corporation | Full virtualization of resources across an IP interconnect |
US7904693B2 (en) * | 2008-02-01 | 2011-03-08 | International Business Machines Corporation | Full virtualization of resources across an IP interconnect using page frame table |
US8060904B1 (en) | 2008-02-25 | 2011-11-15 | Qurio Holdings, Inc. | Dynamic load based ad insertion |
FR2937755B1 (en) | 2008-10-24 | 2010-12-31 | Commissariat Energie Atomique | DEVICE FOR MANAGING DATA BUFFERS IN A MEMORY SPACE DISTRIBUTED OVER A PLURALITY OF MEMORY ELEMENTS |
US7921178B2 (en) * | 2008-12-04 | 2011-04-05 | Voltaire Ltd. | Device, system, and method of accessing storage |
US8892789B2 (en) | 2008-12-19 | 2014-11-18 | Netapp, Inc. | Accelerating internet small computer system interface (iSCSI) proxy input/output (I/O) |
US8312487B1 (en) | 2008-12-31 | 2012-11-13 | Qurio Holdings, Inc. | Method and system for arranging an advertising schedule |
US9104406B2 (en) * | 2009-01-07 | 2015-08-11 | Microsoft Technology Licensing, Llc | Network presence offloads to network interface |
US8549092B2 (en) | 2009-02-19 | 2013-10-01 | Micron Technology, Inc. | Memory network methods, apparatus, and systems |
US8688798B1 (en) | 2009-04-03 | 2014-04-01 | Netapp, Inc. | System and method for a shared write address protocol over a remote direct memory access connection |
US8255475B2 (en) * | 2009-04-28 | 2012-08-28 | Mellanox Technologies Ltd. | Network interface device with memory management capabilities |
US8161494B2 (en) * | 2009-12-21 | 2012-04-17 | Unisys Corporation | Method and system for offloading processing tasks to a foreign computing environment |
US8577986B2 (en) | 2010-04-02 | 2013-11-05 | Microsoft Corporation | Mapping RDMA semantics to high speed storage |
US8639858B2 (en) | 2010-06-23 | 2014-01-28 | International Business Machines Corporation | Resizing address spaces concurrent to accessing the address spaces |
US9195623B2 (en) | 2010-06-23 | 2015-11-24 | International Business Machines Corporation | Multiple address spaces per adapter with address translation |
US8650335B2 (en) | 2010-06-23 | 2014-02-11 | International Business Machines Corporation | Measurement facility for adapter functions |
US8549182B2 (en) | 2010-06-23 | 2013-10-01 | International Business Machines Corporation | Store/store block instructions for communicating with adapters |
US8615645B2 (en) | 2010-06-23 | 2013-12-24 | International Business Machines Corporation | Controlling the selectively setting of operational parameters for an adapter |
US8650337B2 (en) | 2010-06-23 | 2014-02-11 | International Business Machines Corporation | Runtime determination of translation formats for adapter functions |
US9213661B2 (en) | 2010-06-23 | 2015-12-15 | International Business Machines Corporation | Enable/disable adapters of a computing environment |
US9342352B2 (en) | 2010-06-23 | 2016-05-17 | International Business Machines Corporation | Guest access to address spaces of adapter |
US8635430B2 (en) * | 2010-06-23 | 2014-01-21 | International Business Machines Corporation | Translation of input/output addresses to memory addresses |
US8626970B2 (en) | 2010-06-23 | 2014-01-07 | International Business Machines Corporation | Controlling access by a configuration to an adapter function |
US8566480B2 (en) | 2010-06-23 | 2013-10-22 | International Business Machines Corporation | Load instruction for communicating with adapters |
US8621112B2 (en) | 2010-06-23 | 2013-12-31 | International Business Machines Corporation | Discovery by operating system of information relating to adapter functions accessible to the operating system |
US8495731B1 (en) | 2010-10-01 | 2013-07-23 | Viasat, Inc. | Multiple domain smartphone |
US9113499B2 (en) | 2010-10-01 | 2015-08-18 | Viasat, Inc. | Multiple domain smartphone |
US8204480B1 (en) | 2010-10-01 | 2012-06-19 | Viasat, Inc. | Method and apparatus for secured access |
US8458800B1 (en) | 2010-10-01 | 2013-06-04 | Viasat, Inc. | Secure smartphone |
US8270963B1 (en) * | 2010-10-01 | 2012-09-18 | Viasat, Inc. | Cross domain notification |
US9116845B2 (en) | 2011-02-23 | 2015-08-25 | Freescale Semiconductor, Inc. | Remote permissions provisioning for storage in a cache and device therefor |
US8949551B2 (en) | 2011-02-23 | 2015-02-03 | Freescale Semiconductor, Inc. | Memory protection unit (MPU) having a shared portion and method of operation |
US8752063B2 (en) * | 2011-06-23 | 2014-06-10 | Microsoft Corporation | Programming interface for data communications |
US8639895B2 (en) | 2011-07-14 | 2014-01-28 | Freescale Semiconductor, Inc. | Systems and methods for memory region descriptor attribute override |
US8832216B2 (en) * | 2011-08-31 | 2014-09-09 | Oracle International Corporation | Method and system for conditional remote direct memory access write |
US8645663B2 (en) | 2011-09-12 | 2014-02-04 | Mellanox Technologies Ltd. | Network interface controller with flexible memory handling |
US9143467B2 (en) | 2011-10-25 | 2015-09-22 | Mellanox Technologies Ltd. | Network interface controller with circular receive buffer |
US9507639B2 (en) | 2012-05-06 | 2016-11-29 | Sandisk Technologies Llc | Parallel computation with multiple storage devices |
US9256545B2 (en) | 2012-05-15 | 2016-02-09 | Mellanox Technologies Ltd. | Shared memory access using independent memory maps |
US8761189B2 (en) | 2012-06-28 | 2014-06-24 | Mellanox Technologies Ltd. | Responding to dynamically-connected transport requests |
US8914458B2 (en) | 2012-09-27 | 2014-12-16 | Mellanox Technologies Ltd. | Look-ahead handling of page faults in I/O operations |
US8745276B2 (en) | 2012-09-27 | 2014-06-03 | Mellanox Technologies Ltd. | Use of free pages in handling of page faults |
US9639464B2 (en) | 2012-09-27 | 2017-05-02 | Mellanox Technologies, Ltd. | Application-assisted handling of page faults in I/O operations |
US9298642B2 (en) | 2012-11-01 | 2016-03-29 | Mellanox Technologies Ltd. | Sharing address translation between CPU and peripheral devices |
US9286225B2 (en) * | 2013-03-15 | 2016-03-15 | Saratoga Speed, Inc. | Flash-based storage system including reconfigurable circuitry |
US9304902B2 (en) | 2013-03-15 | 2016-04-05 | Saratoga Speed, Inc. | Network storage system using flash storage |
US20140304525A1 (en) * | 2013-04-01 | 2014-10-09 | Nexenta Systems, Inc. | Key/value storage device and method |
US10110518B2 (en) | 2013-12-18 | 2018-10-23 | Mellanox Technologies, Ltd. | Handling transport layer operations received out of order |
WO2015116077A1 (en) | 2014-01-30 | 2015-08-06 | Hewlett-Packard Development Company, L.P. | Access controlled memory region |
US9727503B2 (en) | 2014-03-17 | 2017-08-08 | Mellanox Technologies, Ltd. | Storage system and server |
US9696942B2 (en) | 2014-03-17 | 2017-07-04 | Mellanox Technologies, Ltd. | Accessing remote storage devices using a local bus protocol |
US10031857B2 (en) | 2014-05-27 | 2018-07-24 | Mellanox Technologies, Ltd. | Address translation services for direct accessing of local memory over a network fabric |
US10120832B2 (en) | 2014-05-27 | 2018-11-06 | Mellanox Technologies, Ltd. | Direct access to local memory in a PCI-E device |
US9485053B2 (en) | 2014-07-09 | 2016-11-01 | Integrated Device Technology, Inc. | Long-distance RapidIO packet delivery |
US9672180B1 (en) | 2014-08-06 | 2017-06-06 | Sanmina Corporation | Cache memory management system and method |
US9742855B2 (en) * | 2014-09-04 | 2017-08-22 | Mellanox Technologies, Ltd. | Hybrid tag matching |
US10083131B2 (en) * | 2014-12-11 | 2018-09-25 | Ampere Computing Llc | Generating and/or employing a descriptor associated with a memory translation table |
CN105518611B (en) * | 2014-12-27 | 2019-10-25 | 华为技术有限公司 | A kind of remote direct data access method, equipment and system |
US9444769B1 (en) * | 2015-03-31 | 2016-09-13 | Chelsio Communications, Inc. | Method for out of order placement in PDU-oriented protocols |
US10509764B1 (en) | 2015-06-19 | 2019-12-17 | Amazon Technologies, Inc. | Flexible remote direct memory access |
US20170155717A1 (en) * | 2015-11-30 | 2017-06-01 | Intel Corporation | Direct memory access for endpoint devices |
US10055343B2 (en) * | 2015-12-29 | 2018-08-21 | Memory Technologies Llc | Memory storage windows in a memory system |
CN105786624B (en) * | 2016-04-01 | 2019-06-25 | 浪潮电子信息产业股份有限公司 | Scheduling platform based on redis and RDMA technology |
ES2975242T3 (en) * | 2016-04-26 | 2024-07-04 | Umbra Tech Ltd | Data Beacon Pulse Generators Powered by Information Slingshot |
US10148581B2 (en) | 2016-05-30 | 2018-12-04 | Mellanox Technologies, Ltd. | End-to-end enhanced reliable datagram transport |
WO2017209876A1 (en) * | 2016-05-31 | 2017-12-07 | Brocade Communications Systems, Inc. | Buffer manager |
US20180004681A1 (en) * | 2016-07-02 | 2018-01-04 | Intel Corporation | Systems, Apparatuses, and Methods for Platform Security |
US10198378B2 (en) | 2016-11-18 | 2019-02-05 | Microsoft Technology Licensing, Llc | Faster data transfer with simultaneous alternative remote direct memory access communications |
US10516710B2 (en) | 2017-02-12 | 2019-12-24 | Mellanox Technologies, Ltd. | Direct packet placement |
US11979340B2 (en) | 2017-02-12 | 2024-05-07 | Mellanox Technologies, Ltd. | Direct data placement |
US10652320B2 (en) | 2017-02-21 | 2020-05-12 | Microsoft Technology Licensing, Llc | Load balancing in distributed computing systems |
US10210125B2 (en) | 2017-03-16 | 2019-02-19 | Mellanox Technologies, Ltd. | Receive queue with stride-based data scattering |
US11252464B2 (en) | 2017-06-14 | 2022-02-15 | Mellanox Technologies, Ltd. | Regrouping of video data in host memory |
US10367750B2 (en) | 2017-06-15 | 2019-07-30 | Mellanox Technologies, Ltd. | Transmission and reception of raw video using scalable frame rate |
US10691619B1 (en) | 2017-10-18 | 2020-06-23 | Google Llc | Combined integrity protection, encryption and authentication |
US10521360B1 (en) | 2017-10-18 | 2019-12-31 | Google Llc | Combined integrity protection, encryption and authentication |
US10958588B2 (en) * | 2018-02-05 | 2021-03-23 | International Business Machines Corporation | Reliability processing of remote direct memory access |
CN109067506A (en) * | 2018-08-15 | 2018-12-21 | 无锡江南计算技术研究所 | A kind of lightweight asynchronous message implementation method concurrent based on multi-slide-windows mouth |
US11469890B2 (en) | 2020-02-06 | 2022-10-11 | Google Llc | Derived keys for connectionless network protocols |
US11940933B2 (en) | 2021-03-02 | 2024-03-26 | Mellanox Technologies, Ltd. | Cross address-space bridging |
US11934658B2 (en) | 2021-03-25 | 2024-03-19 | Mellanox Technologies, Ltd. | Enhanced storage protocol emulation in a peripheral device |
US11726666B2 (en) | 2021-07-11 | 2023-08-15 | Mellanox Technologies, Ltd. | Network adapter with efficient storage-protocol emulation |
US11622004B1 (en) | 2022-05-02 | 2023-04-04 | Mellanox Technologies, Ltd. | Transaction-based reliable transport |
US12216575B2 (en) | 2022-07-06 | 2025-02-04 | Mellanox Technologies, Ltd | Patterned memory-network data transfer |
US12135662B2 (en) | 2022-07-06 | 2024-11-05 | Mellanox Technologies, Ltd. | Patterned direct memory access (DMA) |
US12137141B2 (en) | 2022-07-06 | 2024-11-05 | Mellanox Technologies, Ltd. | Patterned remote direct memory access (RDMA) |
US12117948B2 (en) | 2022-10-31 | 2024-10-15 | Mellanox Technologies, Ltd. | Data processing unit with transparent root complex |
US12007921B2 (en) | 2022-11-02 | 2024-06-11 | Mellanox Technologies, Ltd. | Programmable user-defined peripheral-bus device implementation using data-plane accelerator (DPA) |
CN115499489B (en) * | 2022-11-16 | 2023-02-28 | 苏州浪潮智能科技有限公司 | Method, device, equipment and readable medium for managing sub-network in link |
US20240220440A1 (en) * | 2022-12-28 | 2024-07-04 | Xilinx, Inc. | Network interface device |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2137488C (en) * | 1994-02-18 | 1998-09-29 | Richard I. Baum | Coexecuting method and means for performing parallel processing in conventional types of data processing systems |
US6034963A (en) | 1996-10-31 | 2000-03-07 | Iready Corporation | Multiple network protocol encoder/decoder and data processor |
US5920881A (en) * | 1997-05-20 | 1999-07-06 | Micron Electronics, Inc. | Method and system for using a virtual register file in system memory |
US6226680B1 (en) | 1997-10-14 | 2001-05-01 | Alacritech, Inc. | Intelligent network interface system method for protocol processing |
US6321276B1 (en) * | 1998-08-04 | 2001-11-20 | Microsoft Corporation | Recoverable methods and systems for processing input/output requests including virtual memory addresses |
US6859867B1 (en) | 2000-05-31 | 2005-02-22 | Intel Corporation | Translation and protection table and method of using the same to validate access requests |
US20020107962A1 (en) * | 2000-11-07 | 2002-08-08 | Richter Roger K. | Single chassis network endpoint system with network processor for load balancing |
US6947970B2 (en) * | 2000-12-19 | 2005-09-20 | Intel Corporation | Method and apparatus for multilevel translation and protection table |
US7149817B2 (en) * | 2001-02-15 | 2006-12-12 | Neteffect, Inc. | Infiniband TM work queue to TCP/IP translation |
US6578122B2 (en) * | 2001-03-01 | 2003-06-10 | International Business Machines Corporation | Using an access key to protect and point to regions in windows for infiniband |
US7401126B2 (en) * | 2001-03-23 | 2008-07-15 | Neteffect, Inc. | Transaction switch and network interface adapter incorporating same |
US20030005039A1 (en) * | 2001-06-29 | 2003-01-02 | International Business Machines Corporation | End node partitioning using local identifiers |
US6834332B2 (en) * | 2001-08-30 | 2004-12-21 | International Business Machines Corporation | Apparatus and method for swapping-out real memory by inhibiting i/o operations to a memory region and setting a quiescent indicator, responsive to determining the current number of outstanding operations |
US20030046330A1 (en) * | 2001-09-04 | 2003-03-06 | Hayes John W. | Selective offloading of protocol processing |
US7620692B2 (en) | 2001-09-06 | 2009-11-17 | Broadcom Corporation | iSCSI receiver implementation |
US6845403B2 (en) * | 2001-10-31 | 2005-01-18 | Hewlett-Packard Development Company, L.P. | System and method for storage virtualization |
US6983303B2 (en) * | 2002-01-31 | 2006-01-03 | Hewlett-Packard Development Company, Lp. | Storage aggregator for enhancing virtualization in data storage networks |
US8005966B2 (en) | 2002-06-11 | 2011-08-23 | Pandya Ashish A | Data processing system using internet protocols |
US7752361B2 (en) * | 2002-06-28 | 2010-07-06 | Brocade Communications Systems, Inc. | Apparatus and method for data migration in a storage processing device |
WO2004017173A2 (en) * | 2002-08-14 | 2004-02-26 | Broadcom Corporation | One shot rdma having a 2-bit state |
US8631162B2 (en) * | 2002-08-30 | 2014-01-14 | Broadcom Corporation | System and method for network interfacing in a multiple network environment |
-
2002
- 2002-09-05 US US10/235,679 patent/US7299266B2/en not_active Expired - Fee Related
-
2003
- 2003-06-24 TW TW092117093A patent/TWI265696B/en not_active IP Right Cessation
- 2003-09-02 CN CNB031557813A patent/CN1308835C/en not_active Expired - Fee Related
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1829231B (en) * | 2005-02-28 | 2010-11-03 | 惠普开发有限公司 | Method and apparatus for direct reception of inbound data |
CN101165664B (en) * | 2006-10-17 | 2010-04-21 | 国际商业机器公司 | Apparatus and method for managing address conversion in data processing system |
CN102165739A (en) * | 2008-09-29 | 2011-08-24 | 思科技术公司 | Reliable reception of messages written via RDMA using hashing |
CN102165739B (en) * | 2008-09-29 | 2014-05-07 | 思科技术公司 | Reliable reception of messages written via RDMA using hashing |
CN102197381A (en) * | 2008-10-28 | 2011-09-21 | Nxp股份有限公司 | Data processing circuit with cache and interface for a detachable device |
CN101741870B (en) * | 2008-11-07 | 2012-11-14 | 英业达股份有限公司 | Internet Minicomputer Interface Storage System |
CN104782085B (en) * | 2012-10-25 | 2017-12-29 | 国际商业机器公司 | The technology of network service is carried out using at least two communication protocols by computer system |
CN104782085A (en) * | 2012-10-25 | 2015-07-15 | 国际商业机器公司 | Technology for network communication by a computer system using at least two communication protocols |
CN103838517B (en) * | 2012-11-23 | 2017-06-09 | 中国科学院声学研究所 | A kind of method and system for transmitting data between polycaryon processor and disk array |
CN103838517A (en) * | 2012-11-23 | 2014-06-04 | 中国科学院声学研究所 | Method and system for transmitting data between multi-core processor and disk array |
CN110119303B (en) * | 2013-11-12 | 2023-07-11 | 微软技术许可有限责任公司 | Build virtual motherboards and virtual storage devices |
CN110119303A (en) * | 2013-11-12 | 2019-08-13 | 微软技术许可有限责任公司 | Construct virtual mainboard and virtual memory facilities |
CN107111550A (en) * | 2014-12-22 | 2017-08-29 | 德克萨斯仪器股份有限公司 | Conversion is omitted by selective page and prefetches conversion omission time delay in concealing program Memory Controller |
CN107111550B (en) * | 2014-12-22 | 2020-09-01 | 德克萨斯仪器股份有限公司 | Method and apparatus for hiding page miss transition latency for program extraction |
CN105938460A (en) * | 2015-03-02 | 2016-09-14 | Arm 有限公司 | Memory management |
CN105938460B (en) * | 2015-03-02 | 2021-06-25 | Arm 有限公司 | Memory management |
CN106502721A (en) * | 2016-09-26 | 2017-03-15 | 华为技术有限公司 | A kind of order discharging method, device and physical machine |
CN110059027A (en) * | 2017-11-22 | 2019-07-26 | Arm有限公司 | The device and method for executing attended operation |
CN112771501A (en) * | 2018-08-17 | 2021-05-07 | 甲骨文国际公司 | Remote Direct Memory Operation (RDMO) for transactional processing systems |
CN109587112B (en) * | 2018-10-25 | 2021-02-12 | 华为技术有限公司 | Data sending method, data receiving method, equipment and system |
WO2020082986A1 (en) * | 2018-10-25 | 2020-04-30 | 华为技术有限公司 | Data sending method, data receiving method, device, and system |
US11563832B2 (en) | 2018-10-25 | 2023-01-24 | Huawei Technologies Co., Ltd. | Data sending method and device, data receiving method and device, and system |
CN109587112A (en) * | 2018-10-25 | 2019-04-05 | 华为技术有限公司 | It is a kind of send data method, receive data method, equipment and system |
CN113419810A (en) * | 2020-07-27 | 2021-09-21 | 阿里巴巴集团控股有限公司 | Data interaction method and device, electronic equipment and computer storage medium |
CN114726929A (en) * | 2021-01-06 | 2022-07-08 | 迈络思科技有限公司 | Connection management in network adapters |
CN113190177A (en) * | 2021-05-12 | 2021-07-30 | 西安雷风电子科技有限公司 | Data storage method, terminal equipment, server and system |
CN113326155A (en) * | 2021-06-28 | 2021-08-31 | 深信服科技股份有限公司 | Information processing method, device, system and storage medium |
CN113326155B (en) * | 2021-06-28 | 2023-09-05 | 深信服科技股份有限公司 | Information processing method, device, system and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN1308835C (en) | 2007-04-04 |
TWI265696B (en) | 2006-11-01 |
TW200404432A (en) | 2004-03-16 |
US7299266B2 (en) | 2007-11-20 |
US20040049600A1 (en) | 2004-03-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1308835C (en) | Far-end divect memory access invocating memory management unloading of network adapter | |
CN1239999C (en) | ISCSI drive program and interface protocal of adaptor | |
CN1310475C (en) | Equipment for controlling access of facilities according to the type of application | |
CN1604057A (en) | Method and system for hardware enforcement of logical partitioning of a channel adapter's resources in a system area network | |
EP4184336B1 (en) | Non-posted write transactions for a computer bus | |
US11657015B2 (en) | Multiple uplink port devices | |
US11238203B2 (en) | Systems and methods for accessing storage-as-memory | |
US10503679B2 (en) | NVM express controller for remote access of memory and I/O over Ethernet-type networks | |
CN104823167B (en) | Live Fault recovery | |
US9696942B2 (en) | Accessing remote storage devices using a local bus protocol | |
JP4012545B2 (en) | Switchover and switchback support for network interface controllers with remote direct memory access | |
TWI570563B (en) | Posted interrupt architecture | |
EP1899830B1 (en) | Automated serial protocol target port transport layer retry mechanism | |
CN1640089A (en) | Methodology and mechanism for remote key validation for NGIO/InfiniBand applications | |
CN113490927B (en) | RDMA transport with hardware integration and out-of-order placement | |
EP3629187A1 (en) | Multi-uplink device enumeration and management | |
US20030046330A1 (en) | Selective offloading of protocol processing | |
CN1818890A (en) | Rnic-based offload of iscsi data movement function by initiator | |
CN1969267A (en) | User-level stack | |
CN1617526A (en) | Method and device for emulating multiple logic port on a physical poet | |
CN1488105A (en) | Method and apparatus forcontrolling flow of data between data processing systems via a memory | |
JP2008547330A (en) | Automated serial protocol initiator port transport layer retry mechanism | |
US20220222196A1 (en) | Pci express chain descriptors | |
CN1770110A (en) | Method, system and storage medium for lockless infinibandtm poll for I/O completion | |
US7761529B2 (en) | Method, system, and program for managing memory requests by devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20070404 Termination date: 20210902 |