| .. Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md |
| rsocket Protocol and Design Guide 11/11/2012 |
| |
| Data Streaming (TCP) Overview |
| ----------------------------- |
| Rsockets is a protocol over RDMA that supports a socket-level API |
| for applications. For details on the current state of the |
| implementation, readers should refer to the rsocket man page. This |
| document describes the rsocket protocol, general design, and |
| some implementation details. |
| |
| Rsockets exchanges data by performing RDMA write operations into |
| exposed data buffers. In addition to RDMA write data, rsockets uses |
| small, 32-bit messages for internal communication. RDMA writes |
| are used to transfer application data into remote data buffers |
| and to notify the peer when new target data buffers are available. |
| The following figure highlights the operation. |
| |
| host A host B |
| remote SGL |
| target SGL <------------- [ ] |
| [ ] ------ |
| [ ] -- ------ receive buffer(s) |
| -- -----> +--+ |
| -- | | |
| -- | | |
| -- | | |
| -- +--+ |
| -- |
| ---> +--+ |
| | | |
| | | |
| +--+ |
| |
| The remote SGL contains the address, size, and rkey of the target SGL. As |
| receive buffers become available on host B, rsockets will issue an RDMA |
| write against one of the entries in the target SGL on host A. The |
| updated entry will reference an available receive buffer. Immediate data |
| included with the RDMA write will indicate to host A that a target SGE |
| has been updated. |
| |
| When host A has data to send, it will check its target SGL. The current |
| target SGE will contain the address, size, and rkey of the next receive |
| buffer on host B. If the data transfer is smaller than the size of the |
| remote receive buffer, host A will update its target SGE to reflect the |
| remaining size of the receive buffer. That is, once a receive buffer has |
| been published to a remote peer, it will be fully consumed before a second |
| buffer is used. |
| |
| Rsockets relies on immediate data to notify the remote peer when data has |
| been transferred or when a target SGL has been updated. Because immediate |
| data requires that the remote QP have a posted receive, rsockets also uses |
| a credit based flow control mechanism. The number of credits is based on |
| the size of the receive queue, with initial credits exchanged during |
| connection setup. In order to transfer data, rsockets requires both |
| available receive buffers (published via the target SGL) and data credits. |
| |
| Since immediate data is limited to 32-bits, messages may either indicate |
| the arrival of application data or may be an internal message, but not both. |
| To avoid credit deadlock, rsockets reserves a small number of available |
| credits for control messages only, with the protocol relying on RNR NAKs |
| and retries to make forward progress. |
| |
| |
| Connection Establishment |
| ------------------------ |
| rsockets uses the RDMA CM for connection establishment. Struct rs_conn_data |
| is exchanged during the connection exchange as private data in the request |
| and reply messages. |
| |
| struct rs_sge { |
| uint64_t addr; |
| uint32_t key; |
| uint32_t length; |
| }; |
| |
| #define RS_CONN_FLAG_NET 1 |
| |
| struct rs_conn_data { |
| uint8_t version; |
| uint8_t flags; |
| uint16_t credits; |
| uint32_t reserved2; |
| struct rs_sge target_sgl; |
| struct rs_sge data_buf; |
| }; |
| |
| Version - current version is 1 |
| Flags |
| RS_CONN_FLAG_NET - Set to 1 if host is big Endian. |
| Determines byte ordering for RDMA write messages |
| Credits - number of initial receive credits |
| Reserved2 - set to 0 |
| Target SGL - Address, size (# entries), and rkey of target SGL. |
| Remote side will copy this into their remote SGL. |
| Data Buffer - Initial receive buffer address, size (in bytes), and rkey. |
| Remote side will copy this into their first target SGE. |
| |
| |
| Message Format |
| -------------- |
| Rsocket uses RDMA writes with immediate data for all message exchanges. |
| RDMA writes of 0 length are used if no additional data beyond the message |
| needs to be exchanged. Immediate data is limited to 32-bits. Rsockets |
| defines the following format for messages. |
| |
| The upper 3 bits are used to define the type of message being exchanged, |
| with the meaning of the lower 29 bits determined by the upper bits. |
| |
| Bits Message Meaning of |
| 31:29 Type Bits 28:0 |
| 000 Data Transfer bytes transferred |
| 001 reserved |
| 010 reserved - used internally, available for future use |
| 011 reserved |
| 100 Credit Update received credits granted |
| 101 reserved |
| 110 Iomap Updated index of updated entry |
| 111 Control control message type |
| |
| Data Transfer |
| Indicates that application data has been written into the next available |
| receive buffer. The size of the transfer, in bytes, is carried in the lower |
| bits of the message. |
| |
| Credit Update |
| Used to indicate that additional receive buffers and credits are available. |
| The number of available credits is carried in the lower bits of the message. |
| A credit update message is also used to indicate that a target SGE has been |
| updated, in which case the number of additional credits may be 0. The |
| receiver of a credit update message must check for updates to the target SGL |
| by inspecting the contents of the SGL. The rsocket implementation must take |
| care not to modify a remote target SGL while it may be in use. This is done |
| by tracking when a receive buffer referenced by a remote target SGL has been |
| filled. |
| |
| Iomap Updated |
| Used to indicate that a remote iomap entry was updated. The updated entry |
| contains the offset value associated with an address, length, and rkey. Once |
| an iomap has been updated, the local application can issue directed IO |
| transfers against the corresponding remote buffer. |
| |
| Control Message - DISCONNECT |
| Indicates that the rsocket connection has been fully disconnected and will no |
| longer send or receive data. Data received before the disconnect message was |
| processed may still be available for reading. |
| |
| Control Message - SHUTDOWN |
| Indicates that the remote rsocket has shutdown the send side of its |
| connection. The recipient of a shutdown message will no longer accept |
| incoming data, but may still transfer outbound data. |
| |
| |
| Iomapped Buffers |
| ---------------- |
| Rsockets allows for zero-copy transfers using what it refers to as iomapped |
| buffers. Iomapping and direct data placement (zero-copy) transfers are done |
| using rsocket specific extensions. The general operation is similar to |
| that used for normal data transfers described above. |
| |
| host A host B |
| remote iomap |
| target iomap <----------- [ ] |
| [ ] ------ |
| [ ] -- ------ iomapped buffer(s) |
| -- -----> +--+ |
| -- | | |
| -- | | |
| -- | | |
| -- +--+ |
| -- |
| ---> +--+ |
| | | |
| | | |
| +--+ |
| |
| The remote iomap contains the address, size, and rkey of the target iomap. As |
| the applicaton maps buffers host B to a given rsocket, rsockets will issue an RDMA |
| write against one of the entries in the target iomap on host A. The |
| updated entry will reference an available iomapped buffer. Immediate data |
| included with the RDMA write will indicate to host A that a target iomap |
| has been updated. |
| |
| When host A wishes to transfer directly into an iomapped buffer, it will check |
| its target iomap for an offset corresponding to a remotely mapped buffer. A |
| matching iomap entry will contain the address, size, and rkey of the target |
| buffer on host B. Host A will then issue an RDMA operation against the |
| registered remote data buffer. |
| |
| From host A's perspective, the transfer appears as a normal send/write |
| operation, with the data stream redirected directly into the receiving |
| application's buffer. |
| |
| |
| |
| Datagram Overview |
| ----------------- |
| The rsocket API supports datagram sockets. Datagram support is handled through an |
| entirely different protocol and internal implementation. Unlike connected rsockets, |
| datagram rsockets are not necessarily bound to a network (IP) address. A datagram |
| socket may use any number of network (IP) addresses, including those which map to |
| different RDMA devices. As a result, a single datagram rsocket must support |
| using multiple RDMA devices and ports, and a datagram rsocket references a single |
| UDP socket, plus zero or more UD QPs. |
| |
| Rsockets uses headers inserted before user data sent over UDP sockets to resolve |
| remote UD QP numbers. When a user first attempts to send a datagram to a remote |
| address (IP and UDP port), rsockets will take the following steps: |
| |
| 1. Store the destination address into a lookup table. |
| 2. Resolve which local network address should be used when sending |
| to the specified destination. |
| 3. Allocate a UD QP on the RDMA device associated with the local address. |
| 4. Send the user's datagram to the remote UDP socket. |
| |
| A header is inserted before the user's datagram. The header specifies the |
| UD QP number associated with the local network address (IP and UDP port) of |
| the send. |
| |
| A service thread is used to process messages received on the UDP socket. This |
| thread updates the rsocket lookup tables with the remote QPN and path record |
| data. The service thread forwards data received on the UDP socket to an |
| rsocket QP. After the remote QPN and path records have been resolved, datagram |
| communication between two nodes are done over the UD QP. |
| |
| UDP Message Format |
| ------------------ |
| Rsockets uses messages exchanged over UDP sockets to resolve remote QP numbers. |
| If a user sends a datagram to a remote service and the local rsocket is not |
| yet configured to send directly to a remote UD QP, the user data is sent over |
| a UDP socket with the following header inserted before the user data. |
| |
| struct ds_udp_header { |
| uint32_t tag; |
| uint8_t version; |
| uint8_t op; |
| uint8_t length; |
| uint8_t reserved; |
| uint32_t qpn; /* lower 8-bits reserved */ |
| union { |
| uint32_t ipv4; |
| uint8_t ipv6[16]; |
| } addr; |
| }; |
| |
| Tag - Marker used to help identify that the UDP header is present. |
| #define DS_UDP_TAG 0x55555555 |
| |
| Version - IP address version, either 4 or 6 |
| Op - Indicates message type, used to control the receiver's operation. |
| Valid operations are RS_OP_DATA and RS_OP_CTRL. Data messages |
| carry user data, while control messages are used to reply with the |
| local QP number. |
| Length - Size of the UDP header. |
| QPN - UD QP number associated with sender's IP address and port. |
| The sender's address and port is extracted from the received UDP |
| datagram. |
| Addr - Target IP address of the sent datagram. |
| |
| Once the remote QP information has been resolved, data is sent directly |
| between UD QPs. The following header is inserted before any user data that |
| is transferred over a UD QP. |
| |
| struct ds_header { |
| uint8_t version; |
| uint8_t length; |
| uint16_t port; |
| union { |
| uint32_t ipv4; |
| struct { |
| uint32_t flowinfo; |
| uint8_t addr[16]; |
| } ipv6; |
| } addr; |
| }; |
| |
| Verion - IP address version |
| Length - Size of the header |
| Port - Associated source address UDP port |
| Addr - Associated source IP address |