Previous Next Table of Contents

7. SMP, multiple architectures

A lot of people seem to use Linux in SMP boxes these days. Linux is not an Intel only operating system these days. You should keep all this in mind when hacking the Linux kernel. Of course, not everybody has the ability to test their own code on several different architectures or even on an SMP box. But one should follow a couple of rules to make the work easier for the maintainers, more portable and SMP safe. This is more important if you change some generic code, but applies even to several device drivers: e.g. most of the current cards are PCI cards: if you follow here mentioned rules, it should be trivial to port your Intel driver to SPARC, PPC, Alpha...

7.1 Writing a clean multi-port code

If you need to use some variable with a fixed size, you should use the u8, u16, u32, u64 (and their sX counterparts), never rely on long being 32bits and things like that. These types are internal to the kernel, if you need to put it in some header file which might be seen by userland applications, use __u8, __u16 form of these types. During hacking kernel, you can count just on int being at least 32bits (all currently supported platforms have int exactly 32bits), long being at least as wide as int and long being as wide as any pointer type. Also, when combining such variables into structures, keep in mind some architectures add padding into the structures, so that all variables are aligned to their respective size. Also, you should think about endianity problems, if you cannot be sure all data will be in host byte order. Linux comes with a huge set of endian conversion routines, which are defined in <asm/byteorder.h>. For conversion from CPU byte order to some specific one or back there are functions cpu_to_XXYY and XXYY_to_cpu, where XX is either be for big endian or le for little endian, YY is size of the value to be converted, in bits. For each of these functions, there are three variants: the standard, which takes value to be converted as an argument and returns it converted (this variant does checking for constant arguments, in which case it is optimized out at compile time rather than computed at runtime), one with p suffix which takes pointer to some value to be converted and returns the same (if you need to read some value from memory first anyway, it is better to use this variant - on some architectures the conversion is done during reading of the value from memory). Then there is a variant with s suffix, which works like the previous one, but stores the result to the address in argument back. If no conversion is needed, then this becomes a nop. What is probably worth to say here also, is: if you just want to set or clear some bits in a specific byte ordered variable, you don't have to convert it to your cpu order and back to do that:

x = cpu_to_le32(le32_to_cpu(x) | constant);
is not necessary, although heavily used in the kernel. Instead of that,
x |= cpu_to_le32(constant);
creates much better code (cpu_to_le32(constant) is optimized during compile time).
.

In all functions you use, make sure you understand on what type of arguments they operate. Never pass a pointer to integer, when some routine expects long, etc. E.g set_bit family of function assumes a pointer to long, so if you give it a pointer to int, it might work on some platforms and might fail on others.

7.2 Grep is your friend

Whenever you decide to change some function or variable, please note that grep(1) is your friend. Try to fix all occurrences of that function or variable, not only some of them. It does not matter that much if you, due to inability to test your code on platform YZ, break it. Maintainers of that platform will find out something has changed in the area they are responsible for, and will make some effort to fix it. Much worse is when the maintainers have to search what has changed that configuration XYZ stopped working. This does not mean you should not try your best to write code which will work there, of course.

7.3 SMP

Linux no longer employs a model, where at most one CPU can be in kernel mode and other CPUs allowed just userland execution. This was a great enhancement 2.1 kernels came up with. On the other side, this requires kernel programmers to think more about what they write and how it works and use several new things in their code.

The transition was started by moving the master kernel lock, which has been hidden in architecture specific assembly routines and which enforced the only one CPU in kernel mode rule, into lock_kernel() and unlock_kernel() functions, which appeared around all code which has been covered by it previously as well. This change wouldn't do anything by itself, just showed to the hackers places where the master kernel lock is held for no reason, or where using some nice tricks we can get do some things without holding the lock. This lock is still used in many places all around the kernel, e.g. during most of the system calls, including ioctls, during several filesystem operations, etc.

Obsolete cli

Most of the drivers in the past used save_flags, cli and restore_flags macros to protect their critical sections. Now, this still works in the current kernel, although it is the source of the poor linux scalability. cli on a SMP machine disables interrupts on all processors. Now, do you need that to protect your critical section? In most cases, this is a huge overkill. Also, when using these macros in your code, you don't explain whom do you want to protect from; that makes the code less readable and conversion to the proper synchronization primitives much harder. So, unless you know what you are doing and why, you should never use save_flags, cli and similar macros in your new code.

Spinlocks

The right way to protect your critical sections is usually using a spinlock. These are fast and very efficient, provided they guard a relatively small piece of code. To use spinlocks in your code, make sure <asm/spinlock.h> is included in your code. Put a variable of type spinlock_t to the structure you want to protect or declare it as a global variable. Usually, you want to start with the spinlock unlocked: in that case initialize it to the value SPIN_LOCK_UNLOCKED. Now, think about who can acquire this spinlock and from which kernel execution type. If the list of possible execution types does not include interrupts nor bottom half handlers, you can use the most simple form of spinlock (and the fastest one): spin_lock(spinlock_t *) acquires the spinlock, spin_unlock(spinlock_t *) releases it, spin_trylock acquires the spinlock, if nobody holds it, or returns 0 if the spinlock is already hold (in this case spin_trylock does not try to acquire it and just returns). On the other side, when it is possible some routine will try to acquire a spinlock from an interrupt or bottom half context, using these simple spinlocks could lead into deadlocks: some processor would acquire a spinlock, then would be interrupted by some interrupt, which would try to acquire the same spinlock. This would be already held by the same CPU, so it would wait forever. So, for this type of spinlocks kernel comes with a new set of primitives: spin_lock_irq and spin_unlock_irq for the situation when you know if you have interrupts on local CPU enabled. These functions act as a normal spinlock, but locally disable interrupts, so that the above described deadlock cannot happen. spin_lock_irqsave and spin_unlock_irqrestore take an additional argument, an unsigned long variable, into which they save current interrupt enable status on local CPU and restore it afterwards. Unlike the non-IRQ safe spinlocks, which are completely optimized away on single-CPU machines, the IRQ safe spinlocks turn into simple __cli(), __sti(), __save_flags() and __restore_flags(). Please note that it is very dangerous to mix spinlocks with old save_flags, cli, sti and restore_flags. This is because global cli and friends are implemented on SMP using spinlocks. The implementation uses spin_trylock, so that it is possible to nest cli on one processor, but while on some processor you are inside a cli area, other CPUs have to spin till you exit your critical section. So please take care about deadlocks which might result from use of a cli and spinlock mix in your code: if one CPU will first cli and then try to acquire some spinlock and another CPU will first acquire that spinlock and then try to cli, your machine will be stuck.

Read write locks

In some cases, using spinlocks is overkill. If you can allow multiple readers into some critical section, or one single writer, then you should look at rw locks implemented in Linux kernel. You use them similarly to spinlocks: you need to include <asm/spinlock.h>, need to consider things about IRQ safe and non-IRQ safe rw locks as you do with spinlocks. Just instead of spin_lock and the like you write either read_lock or write_lock, depending on whether you are a reader or writer.

Semaphores

Spinlocks are pretty efficient, if the critical section is small, but if 31 CPUs will spin waiting until CPU 32 finishes it's large critical section, your system won't scale very well. So, the clue is to avoid blindly putting spinlocks around any code you write, but to locate critical sections and if they are small enough, protect them. If you have large critical sections, it might be better idea to protect them using semaphores. They are more heavy weight, but unlike spinlocks, if they don't manage to acquire the critical sections, they sleep until the holder of the lock exits it's critical section. But in a lot of cases, you can modify your code, so that you can use spinlocks. One example (from namei.c):

Previously, kernel used semaphores for the critical section in the namei page cache:
inline char * get_page(void) {
        char * res;
        down(&getname_quicklock);
        res = getname_quicklist;
        if (res) {
                getname_quicklist = *(char**)res;
                getname_quickcount--;
        } else
                res = (char*)__get_free_page(GFP_KERNEL);
        up(&getname_quicklock);
        return res;
}

inline void putname(char * name) {
        if (name) {
                down(&getname_quicklock);
                *(char**)name = getname_quicklist;
                getname_quicklist = name;
                getname_quickcount++;
                up(&getname_quicklock);
        }
}
Simply replacing the semaphores with spinlocks would not work: get_free_pages() is to heavyweight and may sleep, which is unacceptable with an spinlock. But, moving it a little bit around, you can get it out of critical section and use spinlocks to protect it:
inline char * get_page(void) {
        char * res;
        spin_lock(&getname_quicklock);
        res = getname_quicklist;
        if (res) {
                getname_quicklist = *(char**)res;
                getname_quickcount--;
        }
        spin_unlock(&getname_quicklock);
        if (!res)
                res = (char*)__get_free_page(GFP_KERNEL);
        return res;
}

inline void putname(char * name) {
        if (name) {
                spin_lock(&getname_quicklock);
                *(char**)name = getname_quicklist;
                getname_quicklist = name;
                getname_quickcount++;
                spin_unlock(&getname_quicklock);
        }
}


Previous Next Table of Contents