What is it: With this thing compiled into your kernel, Linux can use a remote server as one of its block devices. Every time the client computer wants to read /dev/nd0, it will send a request to the server via TCP, which will reply with the data requested. This can be used for stations with low disk space (or even diskless - if you boot from floppy) to borrow disk space from other computers. Unlike NFS, it is possible to put any file system on it. But (also unlike NFS), if someone has mounted NBD read/write, you must assure that no one else will have it mounted.
Limitations:It is impossible to use NBD as root file system, as an user-land program is required to start (but you could get away with initrd; I never tried that). (Patches to change this are welcome.) It also allows you to run read-only block-device in user-land (making server and client physically the same computer, communicating using loopback). Please notice that read-write nbd with client and server on the same machine is bad idea: expect deadlock within seconds (this may vary between kernel versions, maybe on one sunny day it will be even safe?). More generally, it is bad idea to create loop in 'rw mounts graph'. I.e., if machineA is using device from machineB readwrite, it is bad idea to use device on machineB from machineA.
Read-write nbd with client and server on some machine has rather fundamental problem: when system is short of memory, it tries to write back dirty page. So nbd client asks nbd server to write back data, but as nbd-server is userland process, it may require memory to fullfill the request. That way lies the deadlock.
Current state: It currently works. Network block device seems to be pretty stable. I originaly thought that it is impossible to swap over TCP. It turned out not to be true - swapping over TCP now works and seems to be deadlock-free.
If you want swapping to work, first make nbd working. (You'll have to mkswap on server; mkswap tries to fsync which will fail.) Now, you have version which mostly works. Ask me for kreclaimd if you see deadlocks.
Network block device has been included into standard (Linus') kernel tree in 2.1.101.
I've successfully ran raid5 and md over nbd. (Pretty recent version is required to do so, however.)
Devices: Network block device uses major 43, minors 0..n (where n is configurable in nbd.h). Create these files by mknod when needed. After that, your ls -l /dev/ should look like:
brw-rw-rw- 1 root root 43, 0 Apr 11 00:28 nd0 brw-rw-rw- 1 root root 43, 1 Apr 11 00:28 nd1 ...
These commands should do the job:
mknod /dev/nd0 b 43 0 mknod /dev/nd1 b 43 1 mknod /dev/nd2 b 43 2 mknod /dev/nd3 b 43 3
Disclaimer: If you try to export device with already existing data, be prepared to loose them. This beast already killed one partition (not mine :-). Make sure you test your setup at least a little bit before putting it into production. Don't do your tests on important filesystems. It might even work.
Few client/server versions is included here. Look into c-files for more documentation. Note that not all client/server versions are compatible with all other client/server versions and kernel versions. Good luck.
Example of usage; do this on one console: (of course you could use raw partition instead of /tmp/delme file)
root@bug:/tmp# cat /dev/zero > /tmp/delme root@bug:/tmp# ls -l /tmp/delme -rw-r--r-- 1 root root 15552512 Mar 16 22:40 /tmp/delme root@bug:/tmp# cd ~pavel/WWW/nbd root@bug:/home/pavel/WWW/nbd# ./nbd-server 1024 /tmp/delme
And this on the second one:
root@bug:/home/pavel/WWW/nbd# ./nbd-client other_machine 1024 /dev/nd0 [note note note: DON'T TRY THIS TO LOCALHOST!] Negotiation: ..size = 15552512 root@bug:/home/pavel/WWW/nbd# mke2fs /dev/nd0 mke2fs 1.10, 24-Apr-97 for EXT2 FS 0.5b, 95/08/09 Linux ext2 filesystem format Filesystem label= 3808 inodes, 15188 blocks 759 blocks (5.00%) reserved for the super user First data block=1 Block size=1024 (log=0) Fragment size=1024 (log=0) 2 block groups 8192 blocks per group, 8192 fragments per group 1904 inodes per group Superblock backups stored on blocks: 8193 Writing inode tables: done Writing superblocks and filesystem accounting information: done root@bug:/home/pavel/WWW/nbd# mount /dev/nd0 /mnt root@bug:/home/pavel/WWW/nbd# cd /mnt root@bug:/mnt# ls -al total 14 drwxr-xr-x 3 root root 1024 Mar 16 22:41 ./ drwxr-xr-x 25 root root 1024 Feb 15 15:45 ../ drwxr-xr-x 2 root root 12288 Mar 16 22:41 lost+found/ root@bug:/mnt# mkdir x root@bug:/mnt# cd / root@bug:/# umount /mnt root@bug:/# Kernel call returned.Closing: que, sock, done root@bug:/# e2fsck -f /tmp/delme e2fsck 1.10, 24-Apr-97 for EXT2 FS 0.5b, 95/08/09 Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /tmp/delme: 12/3808 files (0.0% non-contiguous), 499/15188 blocks
Thanx go to:
Old versions (you need these versions on 2.1.115 and less; you do not want to do this. This is history.)
Protocol: This is true for 'old' protocol, i.e. that one in 2.1.101 Linus' tree. Look at nbd.h what you are actually using. An user-land program passes a file handle with a connected TCP socket to the kernel driver. This way, kernel does not have to care about connecting etc. The protocol used is rather simple: If driver is asked to read from/write to the block device, it sends packet of following form "request" (all data are in network byte order):
__u32 magic; must be equal to NBD_REQUEST_MAGIC (see nbd.h) __u32 from; position in bytes to read from / write to __u32 len; number of bytes to be read / written __u64 handle; handle of operation __u32 type; 0 = read 1 = write ... in case of a write operation, this is immediately followed by len bytes of data
Upon completion of the operation, the server responds with a packet of following structure "reply":
__u32 magic; must be equal to NBD_REPLY_MAGIC (see nbd.h) __u64 handle; handle copyied from request __u32 error; 0 = operation completed successfully, else error code ... in case of read operation with no error, this is immediately followed len bytes of data