Che materia stai cercando?

Real-time and embedded operating systems

This Guide covers the fundamentals of (i) real-time and embedded operating systems (focusing mostly on the differences with general purpose operating systems such as Linux), and (ii) real-time programming. The emphasis is on Free Software and Open Source Software examples: RTAI, RTLinux, eCos, RT-EMS, uCLinux ..., with a more than proportional focus on RTAI. This text also talks about design issues,... Vedi di più

Esame di Sistemi embedded docente Prof. L. Pomante

Anteprima

ESTRATTO DOCUMENTO

Real-Time and Embedded Guide

Herman Bruyninckx

K.U.Leuven, Mechanical Engineering

Leuven

Belgium

Herman.Bruyninckx@mech.kuleuven.ac.be

Real-Time and Embedded Guide

by Herman Bruyninckx

Copyright © 2000, 2001, 2002 Herman.Bruyninckx@mech.kuleuven.ac.be

This Guide covers the fundamentals of (i) real-time and embedded operating systems (focusing mostly on the

differences with general purpose operating systems such as Linux), and (ii) real-time programming. The emphasis is

on Free Software and Open Source Software examples: RTAI, RTLinux, eCos, RT-EMS, uCLinux, . . . , with a more

than proportional focus on RTAI.

This text also talks about design issues, software patterns and frameworks for real-time applications. That is, the

“high-level” aspects of these software projects. These higher levels are often poorly dealt with in publications on

real-time programming, which leads to the unfortunate situation that still too many real-time programmers use only

the powerful but dangerously unstructured API of their RTOS. Missing the chance to develop more structured, and,

hence, more deterministic and more portable software systems.

Both the low-level RTOS primitives, and the high-level design issues, are illustrated by the real-world example of a

hard real-time core for feedback control and signal processing.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or

any later version published by the Free Software Foundation, with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover

Texts. A copy of this license can be found at http://www.fsf.org/copyleft/fdl.html.

Revision History

Revision 0.01 Aug 31, 2000 Revised by: hb

Initial draft

Revision 0.02 Sep 30, 2000 Revised by: hb

Added: more info about signals

Revision 0.03 Sep 20, 2001 Revised by: hb

Removed: empty hardware, user space GUI and FAQ sections. Added: Software Patterns

Revision 0.04-build-20021211-1936 Dec., 11 2002 Revised by: hb

Extended and heavily reworked version. Preparing for pre-release.

Table of Contents

About this Guide .........................................................................................................................................i

1. Purpose and scope ...........................................................................................................................i

2. Feedback .........................................................................................................................................i

3. Copyrights, Licenses and Trademarks .......................................................................................... ii

4. Acknowledgements ....................................................................................................................... ii

I. Operating system basics .........................................................................................................................i

1. Real-time and embedded operating systems ..................................................................................1

1.1. OS responsibilities.............................................................................................................1

1.2. Trade-offs ..........................................................................................................................3

1.3. Time...................................................................................................................................5

1.4. Embedded OS....................................................................................................................9

1.5. Operating system standards.............................................................................................11

1.6. Linux for real-time and embedded ..................................................................................14

2. Task management and scheduling................................................................................................17

2.1. Processes and threads ......................................................................................................17

2.2. POSIX thread management .............................................................................................18

2.3. Linux tasks and tasklets...................................................................................................20

2.4. Scheduling .......................................................................................................................21

2.5. Priority-based scheduling ................................................................................................22

2.6. Priority spaces .................................................................................................................24

2.7. Linux scheduler ...............................................................................................................24

2.8. Linux real-time scheduling..............................................................................................25

3. Interrupts ......................................................................................................................................27

3.1. Introduction .....................................................................................................................27

3.2. Interrupt hardware ...........................................................................................................27

3.3. Interrupt software ............................................................................................................29

3.4. ISR, DSR and ASR..........................................................................................................32

4. IPC: synchronization....................................................................................................................35

4.1. IPC terminology ..............................................................................................................35

4.2. Race conditions and critical sections...............................................................................37

4.3. Signals .............................................................................................................................39

4.4. Exceptions .......................................................................................................................40

4.5. Atomic operations ...........................................................................................................41

4.6. Semaphore, mutex, spinlock, read/write lock, barrier.....................................................42

4.7. Condition variable for synchronization within mutex .....................................................50

4.8. Priority inversion .............................................................................................................53

4.9. Priority inheritance and priority ceiling ..........................................................................53

4.10. Lock-free synchronization for data exchange ...............................................................55

5. IPC: Data exchange......................................................................................................................57

5.1. Shared memory................................................................................................................57

5.2. FIFOs...............................................................................................................................58

5.3. Messages and mailboxes .................................................................................................58

5.4. Circular buffers................................................................................................................60

5.5. Swinging buffers..............................................................................................................61

5.6. Remote Procedure Calls ..................................................................................................61

iii

6. Memory management ..................................................................................................................63

6.1. Terminology ....................................................................................................................63

6.2. Shared memory in Linux .................................................................................................64

7. Real-time device drivers...............................................................................................................68

7.1. Mechanism and policy.....................................................................................................68

7.2. Device drivers in UNIX...................................................................................................68

7.3. Complex device drivers ...................................................................................................69

7.4. Comedi ............................................................................................................................70

7.5. Real-time serial line.........................................................................................................72

7.6. Real-time parallel port.....................................................................................................72

7.7. Real-time networking ......................................................................................................72

8. RTAI: the features ........................................................................................................................74

8.1. Overview .........................................................................................................................74

8.2. Task management and scheduling ...................................................................................74

8.3. Interrupts and traps ..........................................................................................................83

8.4. IPC: synchronization .......................................................................................................84

8.5. IPC: data exchange. .........................................................................................................88

8.6. Memory management......................................................................................................97

8.7. Real-time device drivers ..................................................................................................97

8.8. interface ...............................................................................................................97

/proc

8.9. RTAI loadable modules ...................................................................................................97

8.10. Specific features.............................................................................................................98

9. Linux-based real-time and embedded operating systems ..........................................................100

9.1. Introduction ...................................................................................................................100

9.2. RTLinux: Real-Time Linux ...........................................................................................100

9.3. RTAI: the Real-Time Application Interface ..................................................................102

9.4. uCLinux.........................................................................................................................103

9.5. Etlinux ...........................................................................................................................103

10. Non-Linux real-time operating systems...................................................................................104

10.1. The Adeos nano-kernel................................................................................................104

10.2. eCos .............................................................................................................................104

10.3. RT-EMS .......................................................................................................................105

10.4. Jaluna...........................................................................................................................105

10.5. Wonka + Oswald .........................................................................................................105

10.6. FIASCO and DROPS ..................................................................................................105

10.7. Real-time micro-kernel................................................................................................106

10.8. KISS Realtime Kernel .................................................................................................106

II. RTOS implementation......................................................................................................................107

11. RTAI: the implementation........................................................................................................108

11.1. The RTAI source tree...................................................................................................108

11.2. Hardware abstraction layer..........................................................................................110

11.3. Linux compatibility layer ............................................................................................115

11.4. RTOS core ...................................................................................................................116

11.5. LX/RT..........................................................................................................................118

11.6. Making your own extensions to LX/RT ......................................................................125

11.7. Module implementations .............................................................................................125

12. C++ and real-time ....................................................................................................................126

iv

12.1. C and C++ ...................................................................................................................126

12.2. C++ in the Linux RTOSs.............................................................................................127

13. Cross compilation, debugging and tracing...............................................................................129

13.1. Cross development ......................................................................................................129

13.2. Debugging ...................................................................................................................129

13.3. Linux Trace Toolkit .....................................................................................................129

III. Design ...............................................................................................................................................131

14. Design principles......................................................................................................................132

14.1. Structure and functionality ..........................................................................................132

14.2. Loose coupling ............................................................................................................133

14.3. Mediator ......................................................................................................................134

14.4. Components.................................................................................................................134

14.5. Architecture .................................................................................................................136

15. Patterns and Frameworks .........................................................................................................137

15.1. Definitions ...................................................................................................................137

15.2. Monitor ........................................................................................................................137

15.3. Producer-Consumer .....................................................................................................144

15.4. Events ..........................................................................................................................147

15.5. State Machines.............................................................................................................150

15.6. Execution Engine.........................................................................................................152

15.7. Distributed IPC ............................................................................................................153

15.8. Transactions.................................................................................................................153

16. Design example: “control”.......................................................................................................154

16.1. What is control?...........................................................................................................154

16.2. Functional components................................................................................................155

16.3. Infrastructural components..........................................................................................156

16.4. Design..........................................................................................................................157

16.5. Implementation............................................................................................................158

IV. Tips and tricks .................................................................................................................................159

17. Tips and tricks ..........................................................................................................................160

17.1. Tasks ............................................................................................................................160

17.2. Signals .........................................................................................................................160

17.3. Condition variables......................................................................................................161

17.4. Locks ...........................................................................................................................161

17.5. Interrupts......................................................................................................................162

17.6. Memory .......................................................................................................................162

17.7. Design..........................................................................................................................163

17.8. Programming ...............................................................................................................163

Bibliography ...........................................................................................................................................165

v

List of Figures

4-1. Priority inversion. ...............................................................................................................................53

15-1. General structure of a state.............................................................................................................150

16-1. Structure of generic control application. ........................................................................................156

vi

About this Guide

1. Purpose and scope

This Guide consist of several parts: Part 1 provides a top-down overview of real-time and embedded

operating systems, up to a more detailed description of the features and implementation of a “typical”

RTOS, i.e., RTAI; Part 2 gives more details about implementation of real-time functionality. Part 3

introduces some time-proven design solutions to common problems in real-time programming, as well as

a list of design and programming hints, to help readers gain time and reliability in their designs and

implementations.

The top-down view on real-time and embedded operating systems is complementary to the typical

bottom-up “show-me-the-code” and “bazaar” approach of development and documentation writing in the

free software world. Not that there is something wrong with this approach, but this Guide’s goal is

different: it wants to make it easier for newcomers to grasp the basic concepts and techniques behind

real-time operating systems, and to help them see the forest for the trees, without having to go and read

the code in a ton of different files. Nevertheless: the source code of the presented projects remains the

only complete and up-to-date documentation.

The document tries to be as independent as possible of any particular implementation or project: so, the

concepts introduced in the theoretical part of the text are not necessarily available in each and every

concrete example given in the text. Moreover, this Guide is not meant to be an exhaustive textbook on

real-time programming, so the reader is urged to go and read some operating system textbooks, and other

“low-level” books such as the Linux Device Drivers (http://www.xml.com/ldd/chapter/book/index.html)

book, for missing details. The reader should be familiar with the basics of operating systems, and be able

to read C code.

This guide first explains the general principles behind real-time and embedded operating systems. These

principles are not too different from general operating systems such as Linux or Microsoft NT, but the

emphasis lies on maximum determinism, and not on maximum average throughput. Because determinism

is often compromised in “high-level” programming language and operating system constructs, real-time

designers are confronted more directly than “normal” application developers with concepts, timing,

memory, and efficiency at the level of the operating system.

Another primary goal of this Guide is educational: it could over time evolve into classroom notes,

featuring demos, annotated code, more graphical illustrations, and more examples of good design and

code. Whether it will reach these educational goals depends on your contributions, as critical reader of

the text, looking out for opportunities to help improve the free documentation available on your favourite

free software project. . . i

About this Guide

2. Feedback

This Guide still has a number of paragraphs marked with “TODO”, signalling parts that are still to be

filled in with more details, or where the current treatment needs more thought and/or information.

Please direct all questions, additions or suggestions for changes to

< >.

Herman.Bruyninckx@mech.kuleuven.ac.be

3. Copyrights, Licenses and Trademarks

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free

Documentation License (FDL). Version 1.1 or any later version published by the Free Software

Foundation. A copy of this license can be found here (http://www.fsf.org/copyleft/fdl.html).

Linux is a trademark of Linus Torvalds. RTLinux is a trademark of VJY Associates LLC of RTLinux’s

creators Victor Yodaiken (http://www.linuxdevices.com/articles/AT2238037882.html) and Michael

Barabanov; they released RTLinux under the GPL license (http://www.fsf.org/copyleft/gpl.html). RTAI

was first released under the LGPL (http://www.fsf.org/copyleft/lesser.html) license by Paolo

Mantegazza, but, later, core components got a GPL license. eCos is released under the Red Hat eCos

Public License (http://www.redhat.com/embedded/technologies/ecos/ecoslicense.html), but also got the

GPL license later on. Real Time Scheduler by Montavista Software, Inc. (http://www.mvista.com) is

released under the GPL license. RT-EMS by On-line Applications Research (http://www.rtems.com/)

(OAR) is released under the GPL license. KURT from the University of Kansas Center for Research, Inc.

(http://www.ittc.ukans.edu/kurt/) is released under the GPL license. uCLinux from the Embedded

Linux/Microcontroller Project (http://www.uclinux.org) is released under the GPL license. The Linux

Trace Toolkit (http://www.opersys.com/LTT/) is released under the GPL license by Karim Yaghmour.

David Schleef released Comedi (http://stm.lbl.gov/comedi/) under the GPL license. Karim Yaghmour

and Philippe Gerum released Adeos (http://www.opersys.com/adeos/) under the GPL license.

4. Acknowledgements

Large parts of this document were written with financial support from the Flemish Fonds voor

Wetenschappelijk Onderzoek (FWO), and the Katholieke Universiteit Leuven, Belgium. The hospitality

offered by Prof. Henrik Christensen of Kungliga Tekniska Högskolan (KTH) in Stockholm, where other

parts of this document were written, is gratefully acknowledged.

The style for this Guide was originally copied from the LDP Author Guide

(http://www.linuxdoc.org/LDP/LDP-Author-Guide/) written by Mark F. Komarinski and Jorge Godoy. It

used the ldp.dsl SGML stylesheet from the Linux Documentation Project (http://www.linuxdoc.org). The

current style is DocBook. ii

About this Guide

Linux and the indispensable GNU libraries and tools are wonderful gifts from Linus Torvalds, the people

at the Free Software Foundation (http://www.fsf.org), and thousands of others. While, in general,

GNU/Linux is the appropriate name to denote the popular free software operating system, this text

usually talks about Linux, because the kernel is the topic of this document. The text also uses the term

free software as a general name for software released under licences approved by both the Free Software

Foundation (http://www.fsf.org/philosophy/license-list.html) and the Open Source Initiative

(http://www.opensource.org/licenses/).

The RTLinux (http://www.rtlinux.com) real-time extensions to Linux were originally created by Victor

Yodaiken and Michael Barabanov. Paolo Mantegazza created the Real Time Application Interface

(http://www.rtai.org) (RTAI) real-time extensions. Karim Yaghmour come up with the design of an

alternative approach towards building a real-time nano-kernel underneath Linux (or any operating system

kernel for that matter); this design was implemented in the Adeos project by Philippe Gerum. This text’s

discussion on real-time device drivers is much inspired by David Schleef’s design for Comedi

(http://stm.lbl.gov/comedi/).

Klaas Gadeyne and Patrice Kadionik gave valuable feedback on the first draft. Error feedback was also

received from Werner Spielmann, Gregory Matus, and Oscar Esteban. Paolo Mantegazza and Philippe

Gerum helped to clarify some RTAI internals. Many thanks also go to Peter Soetens of the Orocos

project (http://www.orocos.org), and the many critical students and summer interns whose questions

stimulated me to look deeper into all those things that I thought I understood but didn’t. iii

I. Operating system basics

This Part introduces the concepts and primitives with which general purpose as well as real-time and

embedded operating systems are built. The text dicusses the applicability and appropriateness of all these

concepts in real-time and embedded operating systems.

(TODO: more annotated code examples.)

Chapter 1. Real-time and embedded operating

systems

This Chapter discusses the basics of operating systems in general, and real-time and embedded operating

systems in particular. (This text uses the abbreviations OS, RTOS and EOS, respectively.) This

discussion makes clear why standard Linux doesn’t qualify as a real-time OS, nor as an embedded OS.

Real-time and embedded operating systems are in most respects similar to general purpose operating

systems: they provide the interface between application programs and the system hardware, and they rely

on basically the same set of programming primitives and concepts. But general purpose operating

systems make different trade-offs in applying these primitives, because they have different goals.

1.1. OS responsibilities

This Section discusses the basic responsibilities of the operating system that are relevant for this text: (i)

task management and scheduling, (ii) (deferred) interrupt servicing, (iii) inter-process communication

and synchronization, and (iv) memory management. General-purpose operating systems also have other

responsibilities, which are beyond the horizon of a real-time operating system: file systems and file

management, (graphical) user interaction, communication protocol stacks, disk IO, to name a few. More

details about the relevant responsibilities are given in the following Chapters.

1.1.1. Task management and scheduling

Task (or “process”, or “thread”) management is a primary job of the operating system: tasks must be

created and deleted while the system is running; tasks can change their priority levels, their timing

constraints, their memory needs; etcetera. Task management for an RTOS is a bit more dangerous than

for a general purpose OS: if a real-time task is created, it has to get the memory it needs without delay,

and that memory has to be locked in main memory in order to avoid access latencies due to swapping;

changing run-time priorities influences the run-time behaviour of the whole system and hence also the

predictability which is so important for an RTOS. So, dynamic process management is a potential

Chapter 2 gives more details.

headache for an RTOS.

In general, multiple tasks will be active at the same time, and the OS is responsible for sharing the

available resources (CPU time, memory, etc.) over the tasks. The CPU is one of the important resources,

and deciding how to share the CPU over the tasks is called “scheduling”.

The general trade-off made in scheduling algorithms is between, on the one hand, the simplicity (and

hence efficiency) of the algorithm, and, on the other hand, its optimality. (Note that various optimality

criterions exist!) Algorithms that want to be globally optimal are usually quite complex, and/or require

knowledge about a large number of task parameters, that are often not straightforward to find on line

(e.g., the duration of the next run of a specific task; the time instants when sleeping tasks will become 1

Chapter 1. Real-time and embedded operating systems

ready to run; etc.). Real-time and embedded operating systems favour simple scheduling algorithms,

because these take a small and deterministic amount of computing time, and require little memory

footprint for their code.

General purpose and real-time operating systems differ considerably in their scheduling algorithms.

They use the same basic principles, but apply them differently because they have to satisfy different

performance criterions. A general purpose OS aims at maximum average throughput, a real-time OS

aims at deterministic behaviour, and an embedded OS wants to keep memory footprint and power

consumption low. A large variety of “real-time” scheduling algorithms exists, but some are standard in

most real-time operating systems (see Section 2.5): static priority scheduling, earliest deadline first

(EDF), and rate-monotonic scheduling.

1.1.2. Interrupt servicing

An operating system must not only be able to schedule tasks according to a deterministic algorithm, but

it also has to service peripheral hardware, such as timers, motors, sensors, communication devices, disks,

etc. All of those can request the attention of the OS asynchronously, i.e., at the time that they want to use

the OS services, the OS has to make sure it is ready to service the requests. Such a request for attention is

often signaled by means of an interrupt. There are two kinds of interrupts:

Hardware interrupt. The peripheral device can put a bit on a particular hardware channel that triggers

• the processor(s) on which the OS runs, to signal that the device needs servicing. The result of this

trigger is that the processor saves its current state, and jumps to an address in its memory space, that

has been connected to the hardware interrupt at initialisation time.

Software interrupt. Many processors have built-in software instructions with which the effect of an

• hardware interrupt can be generated in software. The result of a software interrupt is also a triggering

of the processor, so that it jumps to a pre-specified address.

The operating system is, in principle, not involved in the execution of the code triggered by the hardware

interrupt: this is taken care of by the CPU without software interference. The OS, however, does have

influence on (i) connecting a memory address to every interrupt line, and (ii) what has to be done

immediately after the interrupt has been serviced, i.e., how “deferred interrupt servicing” is to be

handled. Obviously, real-time operating systems have a specific approach towards working with

interrupts, because they are a primary means to guarantee that tasks gets serviced deterministically.

Chapter 3 gives more details.

1.1.3. Communication and synchronization

A third responsibility of an OS is commonly known under the name of Inter-Process Communication

(IPC). (“Process” is, in this context, just another name for “task”.) The general name IPC collects a large

set of programming primitives that the operating system makes available to tasks that need to exchange 2

Chapter 1. Real-time and embedded operating systems

information with other tasks, or synchronize their actions. Again, an RTOS has to make sure that this

communication and synchronization take place in a deterministic way. Chapter 4 gives more details.

Besides communication and synchronization with other tasks that run on the same computer, some tasks

also need to talk to other computers, or to peripheral hardware (such as analog input or output cards).

This involves some peripheral hardware, such as a serial line or a network, and special purpose device

drivers (Chapter 7).

1.1.4. Memory management

A fourth responsibility of the OS is memory management: the different tasks in the system all require

part of the available memory, often to be placed on specified hardware addresses (for memory-mapped

IO). The job of the OS then is (i) to give each task the memory it needs (memory allocation), (ii) to map

the real memory onto the address ranges used in the different tasks (memory mapping), and (iii) to take

the appropriate action when a task uses memory that it has not allocated. (Common causes are:

unconnected pointers and array indexing beyond the bounds of the array.) This is the so-called

memory protection feature of the OS. Of course, what exactly the “appropriate action” should be depends

on the application; often it boils down to the simplest solution: killing the task and notifying the user.

Chapter 6 gives more details.

1.2. Trade-offs

This Section discusses some of the trade-offs that (both, general purpose, and real-time and embedded)

operating system designers commonly make.

Kernel space versus user space versus real-time space.

• Most modern processors allow programs to run in two different hardware protection levels. Linux calls

these two levels kernel space and user space. The latter have more protection against erroneous

accesses to physical memory of I/O devices, but access most of the hardware with larger latencies than

kernels space tasks. The real-time Linux variants add a third layer, the real-time space. This is in fact

nothing else but a part of kernel space used, but used in a particular way.

Monolithic kernel versus micro-kernel.

• A monolithic kernel has al OS services (including device drivers, network stacks, file systems, etc.)

running within the privileged mode of the processor. (This doesn’t mean that the whole kernel is one

single C file!) A micro-kernel, on the other hand, uses the privileged mode only for really core

services (task management and scheduling, interprocess communication, interrupt handling, and

memory management), and has most of the device drivers and OS services running as “normal” tasks.

The trade-off between both is as follows: a monolithic kernel is easier to make more efficient (because

OS services can run completely without switches from privileged to non-privileged mode), but a 3

Chapter 1. Real-time and embedded operating systems

micro-kernel is more difficult to crash (an error in a device driver that doesn’t run in privileged mode

is less likely to cause a system halt than an error occurring in privileged mode).

UNIX, Linux and Microsoft NT have monolithic kernels; QNX, FIASCO, VxWorks, and GNU/Hurd

have micro-kernels. Linux, as well as some commercial UNIX systems, allow to dynamically or

statically change the number of services in the kernel: extra functionality is added by loading a

module. But the loaded functionality becomes part of the monolithic kernel. A minimal Linux kernel

(which includes memory management, task switching and timer services) is some hundreds of

kilobytes big; this approaches the footprint for embedded systems. However, more and more

embedded systems have footprints of more than a megabyte, because they also require network stacks

and various communication functionalities.

Pre-emptable kernel or not.

• Linux was originally a non-pre-emptable kernel: a kernel space task cannot be interrupted by other

kernel space tasks, or by user space tasks. The kernel is “locked” as long as one kernel function is

executing. This usage of locks (Section 4.6) makes the design of the kernel simpler, but introduces

indeterministic latencies which are not tolerable in an RTOS.

In the 2.5 kernel series, Linux gets a more and more fine-grained kernel locking mechanism, and has

become to a large extent pre-emptable. (See Section 2.8.) Linux still has one “Big Kernel

Lock (BKL),” called in the Linux source code, but now independent subsystems

kernel_flag

(networking, disk IO, etc.) get their own sets of locks.

Scalability.

• Finer-grained locking is good for scalability, but usually an overhead for single-CPU systems.

Solaris is an example of a very fine-grained and scalable operating system, which performs worse on

“low-end” PCs. The Linux Scalability Effort (http://sourceforge.net/projects/lse/) project has more

information about the ongoing activities in this area, as far as the Linux kernel is concerned.

Scalability is much less of an issue in real-time applications, because the goals are so differen: the

desire behind scalable systems is to divide a large work load transparantly over a number of available

CPUs, while the desire behind real-time systems is have everything controlled in a strictly

deterministic way.

Memory management versus shared memory.

• Virtual memory and dynamic allocation and de-allocation of memory pages are amongst the most

commonly used memory management services of a general purpose operating system. However, this

memory management induces overhead, and some simpler processors have no support for this

memory management. On these processors (which power an enormous number of embedded

systems!), all tasks share the same memory space, such that developers must take care of the proper 4

Chapter 1. Real-time and embedded operating systems

use of that memory. Also some real-time kernels (such as RTLinux) have all their tasks share the same

address space (even if the processor supports memory management), because this allows more

efficient code.

Dedicated versus general.

• For many applications, it is worthwhile not to use a commercially or freely available operating system,

but write one that is optimised for the task at hand. Examples are the operating systems for mobile

phones, or Personal Digital Assistants. Standard operating systems would be too big, and they don’t

have the specific signal processing support (speech and handwriting recognition) that is typical for

these applications. Some applications even don’t need an operating system at all. (For example, a

simple vending machine.) The trade-offs here are: cost of development and decreased portability,

against cost of smaller and cheaper embedded systems.

Operating system versus language runtime.

• Application programs make use of “lower-level” primitives to build their functionality. This

functionality can be offered by the operating system (via system calls), or by a programming language

(via language primitives and libraries). Languages such as C++, Ada and Java offer lots of

functionality this way: memory management, threading, task synchronization, exception handling, etc.

This functionality is collected in a so-called runtime. The advantages of using a runtime are: its

interface is portable over different operating systems, and it offers ready-to-use and/or safe solutions

to common problems. The disadvantages are that a runtime is in general “heavy”, not deterministic in

execution time, and not very configurable. These disadvantages are important in real-time and

embedded contexts.

1.3. Time

Not surprisingly, “time” plays an important role in the design and use of a real-time operating system.

This Section introduces some relevant terminology and definitions.

1.3.1. Real time

Probably you’ll find as many interpretations of the meaning of real time as you find publications on this

topic. One simple definition is:

A real-time operating system is able to execute all of its tasks without violating specified timing constraints.

Another definition is:

Times at which tasks will execute can be predicted deterministically on the basis of knowledge about the

system’s hardware and software. 5

Chapter 1. Real-time and embedded operating systems

That means, if the hardware can do the job, the RTOS software will do the job deterministically. (This

determinism must be softened a bit, because of the “stochastic” nature of the inevitable scheduling

Section 1.3.2.)

“jitter”, see

One often makes distinction between “soft real time” and “hard real time”. “Soft” indicates that not

meeting the specified timing constraints is not a disaster, while it is a disaster for a hard real-time system.

For example: playing an audio or video file is soft real time, because few people will notice when a

sample comes a fraction of a second too late. Steering a space probe, on the other hand, requires hard

real time, because the rocket moves with a velocity of several kilometers per second such that small

delays in the steering signals add up to significant disturbances in the orbit which can cause erroneous

atmosphere entry situations. Precision mills and high-accuracy radiation or surgical robots are other

examples that require hard real-time: moving the mill or the robot one tenth of a millimeter too far due to

timing errors can cause the rejection of produced parts, or the death of patients.

Practically speaking, the distinction between soft and hard real time is often (implicitly and mistakenly)

related to the time scales involved in the system: in this reasoning, soft real-time tasks must typically be

scheduled with (coarser than) milli-seconds accuracy, and hard real-time tasks with micro-seconds

accuracy. But this implicit assumption has many exceptions! For example, a one-dollar 4 bit processor

controlling a traffic light can be more hard real time (in the sense of “deterministic”) than a 5000 dollar

Athlon-based e-commerce server.

1.3.2. Latency

The latency (or tardiness) of a task is the difference between the instant of time on which the task should

have started (or finished) and the instant of time on which it actually did. (Or, in different contexts, the

time between the generation of an event, and its perception.) Latencies are due to several factors: (i) the

timing properties of processor, bus, memory (on-chip cache, off-chip RAM and ROM) and peripheral

devices, (ii) the scheduling properties of the OS, (iii) the pre-emptiveness of its kernel, (iv) the load on

the system (i.e., the number of tasks that want to be scheduled concurrently), and (v) the context switch

time. This latter is the time the processor needs to save the data of the currently running task (e.g.,

registers, stack, and instruction pointer), and to replace it with the local data of the newly scheduled task.

Few of these factors are constant over time, and the statistical distribution of the latencies in the

subsequent schedulings of tasks is called the jitter.

This is a far from exhaustive list of kernel activities that introduce indeterminism into the timing

behaviour of a (general purpose) operating system:

Accessing the hard disk. Because the alignment of sectors, and the distance between the tracks needed

• by a given task are variable, that task cannot be sure about how long it will take to access the data it

needs. In addition, hard disks are mechanical devices, whose time scales are much longer than purely

electronic devices (such as RAM memory); and accesses to the hard disk are buffered in order to

reduce average disk access time.

Accessing a network. Especially with the TCP/IP protocol, that re-sends packets in case of

• transmission errors. 6

Chapter 1. Real-time and embedded operating systems

Low-resolution timing. See Section 1.3.4.

• Another delay related to time keeping is the fact that programming the timer chip often generates

• unpredictable delays. This delay is of the order of microseconds, so only important for really

high-accuracy timing.

Non-real-time device drivers. Device drivers are often sloppy about their time budget: they use busy

• waiting or roughly estimated sleeping periods, instead of timer interrupts, or lock resources longer

than strictly necessary, or run in user space with the corresponding timing unpredictability.

Memory allocation and management. After a task has asked for more memory (e.g., through a malloc

• function call), the time that the memory allocation task needs to fulfill the request is unpredictable.

Especially when the allocated memory has become strongly fragmented and no contiguous block of

memory can be allocated. Moreover, a general purpose operating system swaps code and data out of

the physical memory when the total memory requirements of all tasks is larger than the available

physical memory.

file system. This is the very rich (non-graphical) user interface to what is going on inside the

proc

• Linux kernel: all this information is offered to user tasks in the form of “files” in this (virtual!) file

system. However, accessing information in this file system implies significant overhead in some cases,

because the files are virtual: they are “created” only when needed.

The exact magnitude of all the above-mentioned time delays changes very strongly between different

hardware. Hence, it is not just the operating system software that makes the difference. For some

applications, the context switch time is most important (e.g., for sampling audio signals at 44kHz), while

other applications require high computational performance, at lower scheduling frequencies (e.g., robot

motion control at 1kHz). But again, some tasks, such as speech processing, require both.

1.3.3. Timing constraints

Different applications have different timing constraints, which, ideally, the RTOS should be able to

satisfy. However, there still doesn’t exist general and guaranteed scheduler algorithms (Chapter 2) that

are able to satisfy all the following classes of time constraints:

Deadline: a task has to be completed before a given instant in time, but when exactly the task is

• performed during the time interval between now and the deadline is not important for the quality of

the final result. For example: the processor must fill the buffer of a sound card before that buffer

empties; the voltage on an output port must reach a given level before another peripheral device comes

and reads that value.

Zero execution time: the task must be performed in a time period that is zero in the ideal case. For

• example: digital control theory assumes that taking a measurement, caculating the control action, and

sending it out to a peripheral device all take place instantaneously.

Quality of Service (QoS): the task must get a fixed amount of “service” per time unit. (“Service” often

• means “CPU time”, but could also be “memory pages”, “network bandwidth” or “disk access

bandwidth”.) This is important for applications such as multimedia (in order to read or write streaming

7

Chapter 1. Real-time and embedded operating systems

audio or video data to the multimedia devices), or network servers (both in order to guarantee a

minimum service as in order to avoid “denial of service” attacks).

The QoS is often specified by means of a small number of parameters: “s” seconds of service in each

time frame of “t” seconds. A specification of 5 micro-seconds per 20 micro-seconds is a much more

real-time QoS than a specification of 5 seconds per 20 seconds, although, on the average, both result in

the same amount of time allotted to the task.

The major problem is that the scheduler needs complete knowledge about how long each task is going to

take in the near future, and when it will become ready to run. This information is practically impossible

to get, and even when it is available, calculation of the optimal scheduling plan is a search problem with

high complexity, and hence high cost in time.

Different tasks compete for the same resources: processors, network, memory, disks, . . . Much more

than in the general purpose OS case, programmers of real-time systems have to take into account

worst-case scenarios: if various tasks could be needing a service, then sooner or later they will want it at

the same time.

1.3.4. Time data structures

The smallest time slice used in most general purpose operating system is longer than 1 millisecond. Not

because the processors are not fast enough to do significant amounts of work in that time slice, but

because 32 bit machines have only 2^32 time slices before their timing counter runs over. At 1000 ticks

per second, this corresponds to less than 50 days, which is certainly insufficient for servers and

embedded systems. Linux uses a scheduling time slice (“jiffie”) of 10 milliseconds on most processors.

(1 milliseconds on Alpha, which has 64 bit counters.)

The timing constraints of real-time tasks are often expressed with much higher resolutions than those of

the general purpose scheduler, i.e., (less than) microseconds instead of milliseconds. Hence, the data

structure in which the time is kept should be adapted to this higher rate, in order to avoid overflow. For

9) use a high-resolution time data structure that counts

example, the real-time Linux variants (Chapter

time in nanoseconds. A 64 bit integer should do the job in that case, but 32 bits could be too dangerous.

(A 32 bit counter overflows after about 4 seconds when counting at a 1 nanosecond rate!) Note that not

all compilers can deal with 64 bit integers, such that some assembly coding may be required in some

cases.

POSIX has standardized “clocks” and “timers” The is a data structure that keeps the time

timespec

in two separate seconds and nanoseconds sub-structures ( of the Linux source

include/linux/time.h

tree):

typedef long __kernel_time_t; // include/asm/posix_types.h

typedef __kernel_time_t time_t;

struct timespec { 8

Chapter 1. Real-time and embedded operating systems

time_t tv_sec; /* seconds, */

long tv_nsec; /* nanoseconds */

};

The data structure uses 64 bits, but the separation between seconds and nanoseconds is an

timespec

inefficient way of representing time: there are only approximately 2^30 = 10^9 nanoseconds in one

second. So, a little more than two bits of the nanoseconds field are not used. This means that, at each and

every addition of a time increment, the software has to check whether the boundary of 1 second hasn’t

been reached, such that the second field has to be updated. This is more complicated than just having a

64 bit counter that can keep on count without having to check.

1.4. Embedded OS

The concepts introduced in the previous sections apply of course also to embedded operating systems

(“EOS”). Embedded operating systems, however, have some features that distinguish them from

real-time and general purpose operating systems. But the definition of an “embedded operating system”

is probably even more ambiguous than that of an RTOS, and they come in a zillion different forms. But

you’ll recognize one when you see one, although the boundary between general purpose operating

systems and embedded operating systems is not sharp, and is even becoming more blurred all the time.

Embedded systems are being installed in tremendous quantities (an order of magnitude more than

desktop PCs!): they control lots of functions in modern cars; they show up in household appliances and

toys; they control vital medical instrumentation; they make remote controls and GPS (Global Position

Systems) work; they make your portable phones work; etc.

The simplest classification between different kinds of embedded operating systems is as follows:

High-end embedded systems. These systems are often down-sized derivatives of an existing general

• purpose OS, but with much of the “balast” removed. Linux has given rise to a large set of such

derivatives, because of its highly modular structure and the availability of source code. Examples are:

routers, switches, personal digital assistants, set top boxes.

Deeply embedded OS. These OSs must be really very small, and need only a handful of basic

• functions. Therefore, they are mostly designed from the ground up for a particular application. Two

typical functions deeply embedded systems (used to) lack are high-performance graphical user

interfacing or network communication. Examples are: automotive controls, digital cameras, portable

phones. But also these systems get more graphics and networking capabilities. . .

The most important features that make an OS into an embedded OS are:

Small footprint. Designers are continuously trying to put more computing power in smaller housings,

• using cheaper CPUs, with on-board digital and/or analog IO; and they want to integrate these CPUs in

all kinds of small objects. A small embedded OS also often uses only a couple of kilobytes of RAM

and ROM memory. 9

Chapter 1. Real-time and embedded operating systems

The embedded system should run for years without manual intervention. This means that the hardware

• and the software should never fail. Hence, the system should preferably have no mechanical parts,

such as floppy drives or hard disks. Not only because mechanical parts are more sensitive to failures,

but they also take up more space, need more energy, take longer to communicate with, and have more

complex drivers (e.g., due to motion control of the mechanical parts).

Many embedded systems have to control devices that can be dangerous if they don’t work exactly as

• designed. Therefore, the status of these devices has to be checked regularly. The embedded computer

system itself, however, is one of these critical devices, and has to be checked too! Hence, one often

sees hardware watchdogs included in embedded systems. These watchdogs are usually retriggerable

monostable timers attached to the processor’s reset input. The operating system checks within

specified intervals whether everything is working as desired, for example by examining the contents of

status registers. It then resets the watchdog. So, if the OS doesn’t succeed in resetting the timer, that

means that the system is not functioning properly and the timer goes off, forcing the processor to reset.

If something went wrong but the OS is still working (e.g., a memory protection error in one of the

tasks) the OS can activate a software watchdog, which is nothing else but an interrupt that schedules a

service routine to handle the error. One important job of the software watchdog could be to generate a

core dump, to be used for analysis of what situations led to the crash.

A long autonomy also implies using as little power as possible: embedded systems often have to live a

• long time on batteries (e.g., mobile phones), or are part of a larger system with very limited power

resources (e.g., satellites).

If the system does fail despite its designed robustness (e.g., caused by a memory protection fault,

• Section 1.1.4), there is usually no user around to take the appropriate actions. Hence, the system itself

should reboot autonomously, in a “safe” state, and “instantly” if it is supposed to control other critical

devices. Compare this to the booting of your desktop computer, which needs a minute or more before

it can be used, and always comes up in the same default state. . .

It should be as cheap as possible. Embedded systems are often produced in quantities of several

• thousands or even millions. Decreasing the unit price even a little bit boils down to enormous savings.

Some embedded systems are not physically reachable anymore after they have been started (e.g.,

• launched satellites) in order to add software updates. However, more and more of them can still be

accessed remotely. Therefore, they should support dynamic linking: object code that did not exist at

the time of start is uploaded to the system, and linked in the running OS without stopping it.

Some applications require all features of embedded and real-time operating systems. The best known

examples are mobile phones and (speech-operated) handheld computers (“PDA”s): they must be small,

consume little power, and yet be able to execute advanced signal processing algorithms, while taking up

as little space as possible.

The above-mentioned arguments led embedded OS developers to design systems with the absolute

minimum of software and hardware. Roughly speaking, developers of general purpose and real-time

operating systems approach their clients with a “Hey, look how much we can do!” marketing strategy;

while EOS developers say “Hey, look how little we need to do what you want!”. Hence, embedded

systems often come without a memory management unit (MMU), multi-tasking, a networking “stack”,

10

Chapter 1. Real-time and embedded operating systems

or file systems. The extreme is one single monolithic program on the bare processor, thus completely

eliminating the need for any operating system at all.

Taking out more and more features of a general purpose operating system makes its footprint smaller and

its predictability higher. On the other hand, adding more features to an EOS makes it look like a general

purpose OS. Most current RTOS and EOS operating systems are expanding their ranges of application,

and cover more of the full “feature spectrum.”

1.5. Operating system standards

Real-time and embedded systems are not a user product in themselves, but serve as platforms on which

to build applications. As for any other software platform, the availability of standards facilitates the job

of programmers enormously, because it makes it easier, cheaper and faster to develop new applications,

and to port an existing application to new hardware. In the world of real-time and embedded systems,

standardization is not a burning issue, because many projects in this area have unique requirements, need

unique extensions to already existing products, don’t need frequent updates by different people, and are

seldom visible to end-users. All these “features” do not really help in forcing developers to use

standards. . . (They do like standard tools though, which is one reason for the popularity of the Free

Software GNU tools.)

This Section lists some standardization efforts that exist in the real-time and embedded world.

1.5.1. POSIX

POSIX (“Portable Operating Systems Interface”, a name that Richard Stallman came up with) is a

standard for the function calls (the Application Programming Interface, API) of UNIX-like general

purpose operating systems. POSIX has some specifications on real-time primitives too. Its definition of

real time is quite loose:

The ability of the operating system to provide a required level of service in a bounded response time.

The standard is managed by the Portable Application Standards Committee (http://www.pasc.org/)

(PASC) of the Institute for Electrical and Electronic Engineers (http://www.ieee.org) (IEEE), and is not

freely available. There is an extensive Rationale document, that explains the reasons behind the choices

that the POSIX committees made, as well as lots of other interesting remarks. That document can be

found here (http://www.opengroup.org/onlinepubs/007904975/xrat/contents.html).

The POSIX components relevant to real-time are: 1003.1b (real-time), 1003.1d (additional real-time

extensions), 1003.1j (advanced real-time extensions). See this link

(http://www.opengroup.org/onlinepubs/007904975/idx/realtime.html) or here (IEEE Std 1003.1-2001)

(http://www.unix-systems.org/version3/ieee_std.html) for more details. These standards are often also

denoted as ANSI/IEEE Std. 1003.1b, etcetera.

POSIX also defines four so-called profiles for real-time systems: 11

Chapter 1. Real-time and embedded operating systems

PSE51 (Minimal Realtime System Profile). This profile offers the basic set of functionality for a single

• process, deeply embedded system, such as for the unattended control of special I/O devices. Neither

user interaction nor a file system (mass storage) is required. The system runs one single POSIX

process, that can run multiple POSIX threads. These threads can use POSIX message passing. The

process itself can use this message passing to communicate with other PSE5X-conformant systems

(e.g., multiple CPUs on a common backplane, each running an independent PSE51 system). The

hardware model for this profile assumes a single processor with its memory, but no memory

management unit (MMU) or common I/O devices (serial line, ethernet card, etc.) are required.

PSE52 (Realtime Controller System Profile). This profile is the PSE51 profile, plus support for a file

• system (possibly implemented as a RAM disk!) and asynchronous I/O.

PSE53 (Dedicated Realtime System Profile). This profile is the PSE51 profile, plus support for

• multiple processes, but minus the file system support of the PSE52 profile. The hardware can have a

memory management unit.

PSE54 (Multi-Purpose Realtime System Profile). This is the superset of the other profiles and

• essentially consists of the entire POSIX.1, POSIX.1b, POSIX.1c and.or POSIX.5b standards. Not all

processes or threads must be real-time. Interactive user processes are allowed on a PSE54 system, so

all of POSIX.2 and POSIX.2a are also included. The hardware model for this profile assumes one or

more processors with memory management units, high-speed storage devices, special interfaces,

network support, and display devices.

RTLinux claims to comply to the PSE51 profile; RTAI claims nothing.

Linux’s goal is POSIX compliance, but not blindly, and not at all costs. The /usr/include/unistd.h

header file gives information about which parts of the standard have been implemented already. For

example: the implementation of threads (see Chapter 2), and the scheduler modes (see Section 2.7).

Many of the real-time POSIX extensions have already been implemented in RTLinux and RTAI (see

Chapter 9).

1.5.2. Unix98

UNIX (UNIX98, Single UNIX Specification, Version 2) is the standardization of UNIX operating systems

driven by the Open Group (http://www.unix-systems.org/unix98.html). It incorporates a lot of the POSIX

standards.

1.5.3. EL/IX

EL/IX. The EL/IX (http://sources.redhat.com/elix/) API for embedded systems wants to be a

standards-compliant subset of POSIX and ANSI C.

1.5.4. µITRON

µITRON. µITRON (http://www.itron.gr.jp/home-e.html) is a Japanese standard for embedded systems.

“TRON” stands for The Real-time Operating system Nucleus; the letter “I” stands for industrial, and the

12

Chapter 1. Real-time and embedded operating systems

“mu” for micro. (There are other TRONs too: BTRON for business, CTRON for communication, . . . )

1.5.5. OSEK

OSEK. OSEK (http://www.osek-vdx.org/) is a German standard for an open architecture for distributed

vehicle control units. The architecture is open, but no free software implementation is available.

1.5.6. Real-Time Specification for Java

The Real-Time Specification for Java (RTSJ). (Java Community Process (http://jcp.org/jsr/detail/1.jsp),

rtj.org (http://www.rtj.org/).) is not really an operating system, but a runtime for a programming

language. The distinction is not really fundamental for normal desktop use; it can be enormous for

real-time use, because a runtime must make use of the services of the underlying operating system. That

means that a runtime with real-time features is useless on a non real-time operating system.

This specification was released in 2001, and, similar to the POSIX specifications, it is not an

implementation; some commercial implementations are already available. The basis prescriptions of the

specification are:

Implementations of the specification are allowed to introduce their own optimizations and extensions,

• such as, for example, scheduling algorithms or garbage collection.

The minimum task management includes static priority-based preemptive scheduling, with at least 28

• priority levels.

Priority inversion “prevention” (Section 4.9) is mandatory.

• An implementation must include classes that provide an asynchronous event mechanism.

• Exceptions must be allowed to change the context to another thread.

• Clases must be provided to allow direct access to physical memory.

1.5.7. Ada 95

Ada 95 real-time specifications.

The MaRTE OS (http://marte.unican.es/) is an example of a free software real-time kernel for embedded

applications that complies with Minimal Real-Time POSIX.13. Most of its code is written in Ada with

some C and assembler parts. The Ada runtime from the GNU Ada Toolkit (GNAT)

(ftp://ftp.cs.nyu.edu/pub/gnat) has been adapted to run on the kernel. The Ada compiler comes under the

GPL, but the runtime has a modified GPL license that allows it to be used without constraints in

commercial systems. 13

Chapter 1. Real-time and embedded operating systems

OpenRavenscar (http://polaris.dit.upm.es/~ork/) is another free software real-time kernel in Ada.

1.5.8. Real-Time CORBA

The Open Management Group (OMG) has released a specification of a “real-time” component broker

interface, called Real-Time CORBA (“RT-CORBA”). This is not a piece of software, but a specification

interface. So, various implementations can satisfy the interface, with very different real-time behaviour.

The RT-CORBA specifications allow the component builder to specify some desired properties that are

common for real-time tasks, such as static priority levels or time-outs. These specifications have to be

mapped onto real (RT)OS primitives by the specific implementation(s) used in the application.

1.6. Linux for real-time and embedded

Linux is a general purpose operating system, with a non-pre-emptable kernel: it wants to give all tasks a

fair share of the resources (processor, memory, peripheral devices, . . . ), and it doesn’t interrupt kernel

activities. Linux’s basic user space scheduler is of the time slicing type: it gives more or less equal time

slices to different tasks. It is possible to change the priorities of user space tasks to some extent (using the

command), but not enough to make the scheduling deterministic. Other reasons why Linux is a

nice

poor RTOS are the unpredictable delays caused by non-pre-emptable operations running in kernel space,

and by the mere size of that kernel. Indeed, nobody can understand the kernel sufficiently well to be able

to predict how long a certain operation is going to take.

All remarks above hold for all general purpose operating systems, such as Windows, AIX, IRIX,

HP-UX, Solaris, etc. It may sound strange at first, but “good old” DOS was much closer to being an

RTOS than Linux, because its scheduler was much less “fair” and advanced, and it had fewer system

services to look after. (However, DOS is only an advantage if there is only one real-time task!) Because

none of the desktop or server operating systems is a good candidate for real-time and/or embedded

applications, several companies have started to develop special purpose operating systems, often for

quite small markets. Many of them are UNIX-like, but they are not mutually compatible. The market is

very fragmented, with several dozens of RTOSs, none of which holds a majority of the market. At least,

this was the case before Linux appeared on the radar of real-time and embedded system companies.

Since about the year 2000, the market has seen lots of mergers and acquisitions, and substantial efforts

from the established RTOS companies to become as “Linux-compliant” as possible.

The fact that Microsoft tries to enter the market too (with its PocketPC/Windows CE product line) is only

accelerating this evolution. History has learned that the fragmented UNIX desktop and server markets

were easy targets for Microsoft. . . , even with inferior technology. So, hopefully the competitors have

learned from this experience.

While the Linux kernel people, headed by Linus Torvalds, are very keen on making the general support

and performance of Linux better, their interest in real time is very small, to say the least. . . No efforts to

make Linux into a real RTOS have to be expected from that side, but the kernel is evolving towards 14

Chapter 1. Real-time and embedded operating systems

higher pre-emptability (first of all, because this is necessary if one wants to scale Linux to more than,

say, two CPUs).

Torvalds has mentioned two reasons why he doesn’t want to make Linux into a real-time operating

system:

Computers are getting faster all the time, such that a general-purpose operating system will satisfy the

• requirements of more and more “real-time” users. (That is, those that require a fast system, which is

not the same as a deterministic system.)

Offering hard real-time features in a general-purpose OS will quickly result in “bad behaviour” of

• application programmers (http://kernelnotes.org/lnxlists/linux-kernel/lk_0006_05/msg00180.html):

they will all want their application to perform best, and program it with high priority. Experience has

shown many times that this leads to incompatible timing constraints between different applications

rather sooner than later.

However, there are no technical reasons why Linux would not be able to become (more of) an RTOS, and

much technology to make Linux more powerful on the high-end server systems is also useful for

real-time and embedded purposes: real multi-threading in the kernel, finer locks and scheduling points

needed for SMP systems, migration of processes over CPUs, “hot-swapable” devices, etc.

Anyway, quite a lot of Free Software efforts have started to contribute software in the area of real-time

and embedded systems. These contributions can be classified as follows:

Eliminating functionalities from the standard Linux kernel.

• This approach aims at reducing the memory footprint of the operating system, and is hence mainly

focused on embedded systems. uCLinux (http://www.uclinux.org) is an example. Other projects

develop small and simple C libraries, because the current versions of the GNU tools have become

quite large; for example, BusyBox (http://www.busybox.net) (a replacement for most of the utilities

µclibc

one usually finds in the GNU fileutils, shellutils, etc.); (http://www.uclibc.org) (a small version

of the general C library).

Patches to the standard Linux kernel.

• This approach replaces the standard scheduler of Linux with a more deterministic scheduling

algorithm, and adds scheduling points to the Linux source tree, in order to make the kernel more

responsive.

Real-time patches underneath the Linux kernel.

• This approach runs Linux as a low-priority process in a small real-time kernel. This kernel takes over

the real hardware from Linux, and replaces it with a software simulation.

The two major examples that follow this road are RTLinux (http://www.rtlinux.org/) (Section 9.2) and

RTAI (http://www.rtai.org/) (Section 9.3). 15

Chapter 1. Real-time and embedded operating systems

Linux-independent operating systems.

• These projects have been developed completely independently from Linux, and some of them are even

older. Some examples are RT-EMS (http://www.rtems.com/), and eCos

(http://sources.redhat.com/ecos/).

The Linux kernel that supports a typical desktop computer is several hundreds of kilobytes large. And

that does not include the memory taken up by the Linux tools and the users’ applications. Hence, the

Linux footprint is too large for many embedded systems. It also takes about a minute or so to boot a

PC-like computer, which is much too long for most embedded systems. And it expects a hard disk to

work with, and a power supply of more than 100 Watts for modern high-end CPUs, video cards and hard

disks.

However, one of the nicest things about Linux is its enormous configurability, of which you can get a

taste if you compile your own Linux kernel. That kernel can be constructed out of lots of more or less

independent modules, and you just leave out the modules that are not needed in your system. If your

application requires no ethernet card, leave out the network drivers; if you don’t need a screen, why

bother with installing the X Window System; etc. This means that many people have configured

Linux-derived systems that become so small that they fit on a single floppy.

The previous paragraphs may suggest that Linux proper has no chance at all at being used as an

embedded OS. However, it has some advantages that may turn out to be decisive in the not too distant

future (certainly because memory and CPUs become cheaper): its configurability, its ability to be

administered from a distance, and its many ways to add security features. 16

Chapter 2. Task management and scheduling

This Chapter explains what task management means, and how it can influence the real-time behaviour of

an operating system. Concrete examples come from the POSIX standard, but the concepts are identical

for other task management APIs. Scheduling of tasks is one of the responsibilities of the task

management with influence on the real-time behaviour of the system. Other responsibilities are: task

creation and deletion, linking tasks to interrupts and deferred interrupt servicing, and assignment and

change of scheduling priorities.

2.1. Processes and threads

We use “task” as the generic name for both processes and threads. A process is the normal “unit of

execution” in UNIX systems: if you compile a C program that has one single , then running this

main()

program requires one process. (That process can generate itself other processes too, of course.) The

operating system must provide several services to each process: memory pages (in virtual memory and in

physical RAM) for code, data, stack and heap, and for file and other descriptors; registers in the CPU;

queues for scheduling; signals and IPC; etc.

A process can spawn new processes (“children”), either by starting up an independent process via a

system call, or by -ing itself. (The Linux kernel uses a somewhat other approach, with the

fork clone()

Section 2.3.) The forked process is a copy of the parent process, but it gets its own memory,

function, see

registers, file descriptors, and process identifier. Starting a new process is a relatively heavy task for the

operating system, because memory has to be allocated, and lots of data structures and code segments

must be copied.

A thread is a “lightweight” process, in the sense that different threads share the same address space. That

is, they share global and “ ” variables, file descriptors, signal bookkeeping, code area, and heap,

static

but they have their own thread status, program counter, registers, signal mask (in Linux but not in

UNIX), and stack. The interesting fact from an RTOS point of view is that threads have shorter creation

and context switch times, and faster IPC (see Chapter 4). A “context switch” is the saving of the state of

the currently running task (registers, stack pointer, instruction pointer, etc.), and the restoring of the state

of the new task. Other advantages for using multiple threads within a process are:

The threads can be run on separate processors.

• The tasks can be prioritized, so that a less important computation can, in response to an external event,

• be suspended to process that event.

Computation can occur in one thread, while waiting for an event, such as the completion of I/O, can be

• outsourced to another thread.

On the other hand, using threads requires functions to be made “thread-safe”: when a function is

func()

called in one thread, this thread can be pre-empted by another thread, which, in turn, can call the same

function; hence, this function should not keep intermediate data in variables that are shared between the

different threads. 17

Chapter 2. Task management and scheduling

Many modern CPUs offer functionality such as floating-point calculation, digital signal processing (e.g.,

“MMX”), or on-chip memory caches. These functions require extra registers and/or operations, so, when

this extra functionality can be avoided, real-time determinism is increased (because the context switch

time is lower if less registers have to be saved and restored). For example, Linux doesn’t save the floating

point registers for kernel tasks and interrupt service routines.

2.2. POSIX thread management

The POSIX operating system standard (see Section 1.5) has an extensive threads API, which all

UNIX-like operating systems implement (albeit to varying degrees). The thread implementation in Linux

is not the most complete The real-time operating systems discussed in later Chapters all have decent, but

not complete, POSIX thread functionality. The reasons why many operating systems don’t implement

the full POSIX standards are: (i) POSIX is not a single, rigid standard, but a large set of complementary

standards with different focus points; (ii) one doesn’t need the whole API to build functional and

efficient software systems; (iii) some parts of the standard require complicated implementations with

meager practical and not-unique advantages; (iv) some features made it into the standard for the sole

purpose of being backwards compatible with older existing UNIX systems.

The POSIX API provides the following function calls (and others!) for thread creation and deletion:

int pthread_create(

pthread_t *thread, // thread data structure

pthread_attr_t *attr, // attributes data structure

void *(*start_routine) (void *), // function to execute

void *arg // argument to pass to function

);

void pthread_exit(void *retval);

int pthread_join(pthread_t thread, void **status),

int pthread_detach(pthread_t thread),

int pthread_cancel(),

(The initial letter “ ” indicates the POSIX heritance.) Thread creation involves some overhead, because

p

memory has to be allocated for the new thread; in a real-time setting, the memory also has to be locked

into RAM, in order to be sure that no time will ever be lost because the memory pages have to be

swapped in from disk when needed. Similarly, freeing memory at thread deletion is also an overhead. So,

a real-time application should do thread creation and deletion outside of the real-time activity.

Other overhead caused by task management is: satisfying requested changes in the timing or priority

properties of tasks, and the maintenance of the task queues at all priority levels when tasks are woken up,

put asleep, made running, obliged to wait for a blocking IPC call, etc.

In the , the programmer can specify the run-time priority of the task, as well as the

pthread_create()

scheduling policy to use, through the function call.

pthread_setschedparam() 18

Chapter 2. Task management and scheduling

, , and are different ways to end the

pthread_join() pthread_detach() pthread_cancel()

execution of a task. A task should indeed not be deleted blindly, because it shares a lot of its components

with other tasks, so its memory space and locks should not be released when its cousin tasks are still

using them. Especially cancelling a thread from within another thread is dangerous: it practically

impossible to tell what resources the cancelled task is holding (including locks!). So, POSIX prescribes a

procedure to cancel tasks. does not cancel the task immediately, but is only a

pthread_cancel()

request to the operating system to cancel the task. How the task is cancelled depends on how the task

initialised its own cancellation policy, via: : atomically sets the calling

int pthread_setcancelstate(int state, int *oldstate)

• task’s cancellability state to the indicated and returns the previous cancellability state in

state

Possible values for are and

oldstate. state PTHREAD_CANCEL_ENABLE

PTHREAD_CANCEL_DISABLE. : atomically sets the calling task’s

int pthread_setcanceltype(int type, int *oldtype)

• cancellability type to the indicated and returns the previous cancellability type in

type oldtype.

Possible values for are and

type PTHREAD_CANCEL_DEFERRED

PTHREAD_CANCEL_ASYNCHRONOUS.

The default cancellation type and state are and

PTHREAD_CANCEL_DEFERRED

PTHREAD_CANCEL_ENABLE.

Cancellation happens immediately if the task has chosen the PTHREAD_CANCEL_ASYNCHRONOUS

policy; so, this policy should only be chosen when the programmer is certain that the task can be killed at

any time, without compromising the rest of the system. If the task has chosen the

policy, it is cancelled only when it reaches a so-called cancellation

PTHREAD_CANCEL_DEFERRED

point. These OS-dependent points are function calls where the task tests whether it has received a

cancellation request. (Or rather, the operating system does the test for it, as well as the cancellation

handling, discussed below.) Cancellation function calls are typically calls that might block for a long

time, such that the OS need only check for pending cancellation requests when the operation is about to

block indefinitely. This includes, but is not at all limited to, ,

pthread_cond_wait()

, or , or .

pthread_cond_timedwait(() sem_wait() sigwait()

The task that one wants to cancel can postpone cancellation in order to perform application-specific

cleanup processing. It does this by “pushing” cancellation cleanup handlers every time that it acquires

some resource. As the task leaves the last cancellation point before releasing a resource, it needs to “pop”

the cleanup handler it had pushed earlier for this resource. Pushing and popping is done by the

and function calls. Every cleanup handler that

pthread_cleanup_push() pthread_cleanup_pop()

is still on the cleanup stack is invoked (in Last-in, First-Out order) when the task is cancelled, and its job

is to cleanly release the resource. The task terminates when the last cleanup handler returns. The task exit

status returned by on a cancelled task is PTHREAD_CANCELED.

pthread_join() Section 15.4 gives the generic software design

(This behaviour is quite standard in many software tasks;

behind such behaviour.)

The cancellation procedures above might seem a bit involved, but that’s due to the complexity of the

problem one wants to solve: making sure that another task exits without blocking other tasks. Anyway,

19

Chapter 2. Task management and scheduling

this kind of cancellation should be avoided whenever possible. The clean solution is to let all tasks in

your application react to a condition variable that indicates that it must shut down itself (Section 15.4.4).

An RTOS must also allow to specify the timing with which threads have to run. One typically uses two

timing modes:

Periodic: the task must run at regular intervals.

• One-shot: the task must run only once, at a predefined instant in time.

One-shot timing sometimes requires a bit more overhead, because of a more involved hardware timer

programming. POSIX has no standardized function calls for periodic timing. The reasons are that: (i)

there are multiple ways in which the desired functionality can be programmed with already existing

POSIX primitives; and (ii) most applications have to break the periodic loop in one way or another

anyhow, depending on application-specific conditions. Because of the lack of a (POSIX) standard API

for periodic thread timing, different operating systems implemented the functions on their own, such that

application programs will most probably have portability problems in this area. For example, RTLinux

uses for both options (the suffix stands for “non-portable”),

pthread_make_periodic_np() _np

while RTAI has and .

rt_set_periodic_mode() rt_set_oneshot_mode()

As examples of alternatives for the periodic timing function, POSIX provides the and

usleep()

function calls. These put tasks asleep with a high timing resolution (microsecond,

nanosleep()

respectively nanoseconds). The achievable resolution depends of course on the type of CPU.

Some other often-used functionality that POSIX has not standardized is: to allow the use of floating point

operations in a thread (for which, e.g., RTLinux has introduced ); to suspend

pthread_setfp_np()

execution of another thread than the one that executes the function

(“ ”); and “ ” to

pthread_suspend_np(another_thread) pthread_wakeup_np(another_thread)

resume execution of the other thread. Note again the “ ” suffix.

..._np

The floating point selection option was considered too low level and hardware dependent to put into the

POSIX standard. Saving a couple of registers more or less is more of a matter of optimization, and such

things don’t belong in a standard. The Linux scheduler, for example, always saves floating point registers

of user space processes by default.

The and functions are dangerous (see below), and

pthread_suspend_np() pthread_wakeup_np()

the POSIX committee had very good reasons not to include them in the standard. However, many users

think they are “user-friendly”, because they sometimes save them a lot of keystrokes. The danger of

is that, while its use is convenient to stop a thread, it leaves that thread most

pthread_suspend_np()

probably in an undefined state, such that it’s hard to predict what the thread is going to do when

starts it again!

pthread_wakeup_np()

The proper way of suspending the execution of a thread is to let the thread do it itself , at a moment it is

Chapter

ready to do so, i.e., it is in a well-defined state, from which it can restart in a deterministic way.

17 gives some more detailed examples. 20

Chapter 2. Task management and scheduling

2.3. Linux tasks and tasklets

The above-mentioned distinction between “process” and “thread” is not what Linus Torvalds has in

mind. He thinks the really important concept is the Context of execution: that includes things like CPU

state (registers, etc.), memory management state (page mappings), permission state (user ID, group ID),

code to execute, and various “communication states” (open files, signal handlers, etc.). An email by

Torvalds in which he explains his (and hence Linux’s) point of view can be found here

(http://www.uwsg.iu.edu/hypermail/linux/kernel/9608/0191.html). POSIX threads are offered on Linux

as a library, and basically only because of compliance with the standard. Anyway, they are just one

single possible way to share context. And the Linux kernel offers a more flexible alternative: the

creates a new “task”, with a large choice in what parts of the context of execution one wants to

clone()

share between the new task and the task that creates it. See the corresponding man page for more details.

Many operating systems provide another primitive besides threads or processes, that programmers can

use to execute functionality. Linux and RTAI call it tasklets, Section 2.6. A tasklet is a function whose

execution can be asked for by any kernel task, and that the operating system will execute before it does

its next scheduling. At that moment, the OS executes these functions one by one. So, the important

features of tasklets are:

They are a more “lightweight” primitive than tasks, to execute functions outside of, and prior to, the

• normal scheduling. of tasks.

They are not pre-empted by normal tasks.

But tasklets can be pre-empted by interrupts, because the kernel has enabled all hardware interrupts

when it runs the tasklets. Tasklets are typically only executed once, but some operating systems (e.g.,

RTAI) offer periodic execution of tasklets, by registering them with a timer. The tasklet primitive is also

Section 3.4.

very useful as a so-called Deferred Service Routine (DSR),

2.4. Scheduling

Some texts make a distinction between scheduling and dispatching, with dispatching being the simplest

of the two operations:

Scheduling: determining the order and the timing (i.e., the “schedule”) with which tasks should be run.

• Dispatching: the dispatcher starts and stops the tasks, i.e., it implements the schedule.

This text only uses the term “scheduling”.

A primary responsibility of an RTOS is to make sure that all tasks meet their timing constraints. Timing

constraints come in different flavours (deadline, zero execution time, QoS), and for every task the

constraints can change over time. For example, a motion generator for a mobile robot has much more

constraints to take into account when it navigates in an environment with many nearby obstacles, while

its job is much easier in open areas. Or, users of a multimedia server have different QoS requirements for

editing one video stream than for the editing and synchronization of several streams. 21

Chapter 2. Task management and scheduling

So, a “one-size-fits-all” scheduling algorithm does not exist. Although that is exactly what a general

purpose operating system hopes to offer. Hence, it should come as no surprise that there is a vast

literature on the theory of scheduling, accompanied by a large variety of (un)implemented scheduling

algorithms. A theoretically optimal schedule can only be reached in the unlikely situation of complete

knowledge about the processing, synchronization and communication requirements of each task, and the

processing and timing properties of the hardware. This state of complete knowledge is seldom reached in

real-world applications, especially when the requirements are dynamic (i.e., time varying). And even with

complete predictability, the general scheduling problem is NP-complete, which means that its complexity

increases exponentially with the number of tasks and constraints involved in the scheduling. And hence,

the scheduling algorithms don’t scale well under a growing load and/or hardware resources. This does

not imply, however, that the problem is infeasible for applications with only a few, well-defined tasks.

Each OS has a scheduler function (let’s call it ), that implements the scheduling algorithm.

schedule()

(Later sections discuss the most common scheduling algorithms.) This scheduler is not a task in itself: it

is a function call, that is called at various points in the kernel. These points are, not surprisingly, called

scheduling points. Typical scheduling points are: end of interrupt service routines (Section 3.3), the

moments when tasks want to go to sleep for one reason or another, or when they become ready to run.

Scheduling is pure overhead: all time spent on calculating which task to run next is lost for the really

productive tasks. And trying to use more optimal schedulers isn’t always a clever “solution”: advanced

schedulers consume (often unpredictably!) more time and resources, and their increased complexity

makes it more difficult for programmers to work with. Hence, the chance that those programmers make

the wrong design decisions increases. Simplicity is especially a key feature for real-time and embedded

systems; complex schedulers appear more in Operations Research applications, where the scheduling

problem and its algorithmic complexity are comparable to the operating system case, but where the

real-time constraints and the predictability of the cost of tasks are more manageable.

(TODO: explain (POSIX) cancellation points: why are they needed? what makes a point a valid

cancellation point? Warn against using cancellation, because it’s so error prone. Not only from the OS

point of view (that OS must make sure its thread and lock bookkeeping remains consistent, which is not

a simple job), but also from the application point of view (how do you make sure that there is no race

between one thread trying to cancel another thread, and a third thread that still wants to interact with that

to-be-cancelled thread? It’s way better to have each thread exit itself explicitly, and to have an explicit

exit condition for each thread. And to make thread interaction asynchronous.)

2.5. Priority-based scheduling

The simplest approach to the scheduling problem is to assign static priorities to all tasks. That means

that the priority is given to the task at the time it is created. The scheduler function is then

schedule()

very simple, because it looks at all wait queus at each priority level, and starts the task with the highest

priority that is ready to run.

Using priorities implies using pre-emption: interrupts a lower priority task in order to run a

schedule()

higher priority task that requests it. Pre-emption means that the running task’s context is switched out, 22

Chapter 2. Task management and scheduling

and the new task’s context is switched in.

One classifies priorities into statically and dynamically assigned priorities. In the former case, a task is

given a priority by the programmer at design time (or by the operator at system initialization time), and it

keeps this priority during its whole lifetime. In the dynamic case, becomes more complex,

schedule()

because it has to calculate the task’s priority on-line, based on a number of dynamically changing

parameters (time till next deadline; amount of work to process; etc.). As described before, the optimal

solution to a scheduling problem is usually impossible to find, so scheduling is often based on a set of

heuristics. This is the case for real-time as well as non-real-time schedulers. The heuristics in a general

purpose OS can be quite involved, but real-time and embedded operating systems mostly use simple

heuristics. Because “simple” means: faster and smaller and more predictable! Examples of such simple

dynamic scheduling algorithms, that are sometimes used to replace static priority scheduling, are:

Rate monotonic (RM). A task gets a higher priority if it has to run more frequently. This is a common

• approach in the case that all tasks are periodic. So, a task that has to run every n milliseconds gets a

higher priority than a task that runs every m milliseconds when n<m. Hence, changing the scheduling

frequency of a task on-line also changes its priority. The scheduler needs to know the periods of all

tasks it has to schedule.

Earliest deadline first (EDF). At each instant in time, there are a number of tasks that need to be

• finished in the near future. A task with a closer deadline gets a higher scheduling priority. The

scheduler needs not only to know the deadline time of all tasks it has to schedule, but also their

duration.

If different tasks in the system request different scheduling policies, the operating system has to make

trade-offs in determining the relative “weight” to give to each of the scheduling algorihtms. These

trade-offs will most probably be quite arbitrary, so porting your application between operating systems

could lead to different scheduling results.

Priority-based scheduling is simple to implement, because just has to look at the tasks in

schedule()

the highest priority queue that are ready to be scheduled, and to start the first one in this queue.

Priority-based scheduling, however, is difficult for the application programmers: they must try to map the

often complex (“high-dimensional”) synchronization interdependencies between the different threads in

their application onto the linear scale offered by priorities! One often-observed phenomenon in real-time

applications that grow over time, is that the programmers tend to raise the priorities of some threads,

every time they notice that the introduction of new functionality (and hence new threads) disturbs the

synchronization of the existing threads. Chapter 14 gives some more examples of the negative effects of

“coupling”, and Chapter 15 discusses time-proven approaches to take care of complex interdependencies.

So, the problem with priority-based scheduling is that it is an indirect way to specify how to cope with

timing and synchronization constraints: at run-time, doesn’t take these constraints

schedule()

themselves into account, but knows only about the priorities, which are the programmer’s indirect model

of the constraints.

In practice, all RTOSs at least offer static priority-based scheduling. Many also implement other

algorithms. Not always because of the intrinsic added value of the algorithm, but rather because of

typical marketing drives: users tend to buy software products with the highest number of features, even if

23

Chapter 2. Task management and scheduling

they risk to drown in the complexity and “feature bloat” (whose implications they often even don’t

understand. . . ). One of the more serious feature bloat examples in priority-based scheduling is the

Section 4.8), that

so-called priority inheritance “solution” to the priority inversion phenomenon (see

occurs when tasks share resources which they should not access concurrently.

2.6. Priority spaces

Many operating systems (especially RTOSs) let all tasks (system tasks as well as user tasks) live in the

same priority space: any task can be given any priority within this space. Others, such as Linux, UNIX,

or Microsoft NT, have separate priority spaces for different kinds of tasks. Linux has two: user space and

kernel space. Tasks running in user space can change their priorities (through the function call),

nice()

but all of them are pre-empted by any task in kernel space. Kernel space itself has three priority levels:

1. Interrupts: the “task” that services a hardware interrupt (timer, network, keyboard, . . . ) has the

highest priority. Such a task is called an interrupt service routine (ISR). (Section 3.3.) It should be as

short as possible, because it runs with all other interrupts disabled. An ISR is not really a task, but

just a function call, and its execution is not determined by the scheduler: the ISR is executed

immediately at the occurrence of an hardware interrupt, by the hardware of the interrupt controller

and the CPU (Section 3.2). The operating system software is not involved at all.

2. Tasklet functions (Linux specific, Section 3.4) and Deferred Service Routines (terminology often

used outside of Linux) are functions (again, not tasks!) that run at the second highest priority. Only

an hardware interrupt can pre-empt them. A tasklet can be activated by any kernel task; a deferred

interrupt function (Section 3.4) is typically triggered by a hardware interrupt service routine, to

further process an interrupt after the ISR has finished. Both have the same properties, and are

executed after all hardware interrupt service routine have finished, and before the “normal” tasks are

scheduled; interrupts are enabled when they run. In contrast to the hardware interrupts, the operating

system software is involved in determining when they are executed.

3. All other kernel tasks run at the lowest priority level in the kernel. They pre-empt every user space

task.

There is no consensus about the relative merits of having separate user and kernel spaces: some consider

it to be a design advantage (“divide et impera”), while others experience it as an unnecessarily artificial

constraint on their flexibility.

2.7. Linux scheduler

The scheduler implemented in the file of the Linux source tree

/usr/src/linux/kernel/sched.c

works with three scheduling modes (which are defined in the POSIX standard): SCHED_RR,

and is the default. The scheduling mode of a task is set

SCHED_FIFO SCHED_OTHER. SCHED_OTHER

by the POSIX system call.

sched_setscheduler()

is the round-robin time slicing algorithm. After a task finishes its time slice, it is moved to

SCHED_RR

the tail of its priority queue, such that another task in the same priority level can start running. If there is

no other task at this priority, the pre-empted task can continue. 24

Chapter 2. Task management and scheduling

is a First-In, First-Out scheduling algorithm: the tasks in one priority level are scheduled

SCHED_FIFO

in the order they get ready to run; once a task is scheduled, it keeps the processor until pre-empted by a

higher priority task, until it releases the processor voluntarily, or until it has to wait to get access to some

resource. This scheduler mode is often called “POSIX soft real-time” because it corresponds to the most

static priorities, but without the other necessary real-time

common real-time scheduling approach with

components.

The behaviour of the scheduler function is not prescribed by the POSIX standard. It is

SCHED_OTHER

meant to give freedom to the operating system programmers to implement their own scheduling

algorithm. In Linux, as in all general-purpose operating systems, the scheduler function

SCHED_OTHER

tries to combine two conflicting performance measures: maximimum throughput and good response to

interactive users. The Linux scheduler calculates a “goodness” value for each candidate task, based on a

number of heuristic rules. Recently, the scheduler function got a lot of attention from the Linux kernel

developers, since a new O(1) (“order one”) scheduling algorithm was introduced. O(1) means that the

function’s computational time does not increase with the number of tasks that must be scheduled. This

has led to a more responsive kernel, certainly in combination with the increased number of pre-emption

points (Section 2.8), which all lead to a call to the scheduler function.

Kernel tasks with the scheduling policy receive the lowest priority, “0”, while the

SCHED_OTHER

and policies can use priority levels from “1” to “99”. User space tasks are

SCHED_RR SCHED_FIFO

always scheduled with the policy. The priority levels 0 to 99 are prescribe in the POSIX

SCHED_OTHER

standard, and the portable POSIX way to find out about the minimum and maximum scheduling

priorities is through the and system

sched_get_priority_min() sched_get_priority_max()

calls. Both take one of the priority policies as their argument.

The scheduling for Symmetric Multi-Processor (SMP) systems is basically the same as for the

uni-processor case. There are some extra function calls to assign a task or an interrupt to a specific

processor, if the programmers desires so. This decision could lead to more efficient execution, because it

increases the chance that the task’s or ISR’s code can permanently be kept in the cache of that particular

processor.

2.8. Linux real-time scheduling

Linux will not become a full-fledged RTOS, for the simple reason that the requirements for a

general-purpose operating system are very different from those of an RTOS. However, soft real-time

additions to the standard Linux kernel have been developed in several places.

One active source of soft real-time efforts has been the audio and video community: in this area, Linux

and Microsoft NT perform poorly, in comparison to, for example, BeOS and Microsoft Windows 95. The

reason is that Linux and Microsoft NT can’t guarantee these multi-media tasks a deterministic share of

the resources (QoS). BeOS does offer QoS scheduling, while Microsoft Windows 95 simply has much

less things to do than a “real” operating system. . . . 25

Chapter 2. Task management and scheduling

Another reason for soft real-time work is the drive to make Linux scale better on multi-processor

systems. In this context, it is important to keep the locks on kernel functionality as small as possible,

because if one processor needs a lock, the other processors are also disturbed in their activity. The

expectation is that the scalability activity will make Linux into an operating system that can almost

guarantee milli-second deadlines (i.e., “soft real time”), without making it into a real RTOS.

Here is a (non-exhaustive) list of efforts to improve on latency problems in the Linux kernel:

Montavista’s Hard Hat Linux (http://www.mvista.com/products/hhl.html) with its pre-emption

• (http://www.mvista.com/dswp/PreemptibleLinux.pdf) patches. These are currently maintained by

Robert Love (http://www.tech9.net/rml/linux/), and gradually introduced in the new 2.5.x kernels. The

idea is to see whether the scheduler could run, at the moment of that a kernel spinlock (Section 4.6.3)

is released, or an interrupt routine (Chapter 3) exits. Commands exist to disable or enable kernel

pre-emption.

Ingo Molnar’s low-latency patches, now maintained by Andrew

• Morton (http://www.zipworld.com.au/~akpm/). They introduce more scheduling

points (http://www.linuxdevices.com/articles/AT8906594941.html) in the kernel code, such that the

time is reduced between the occurrence of an event that requires rescheduling and the actual

rescheduling. Probably, this work will be combined with the pre-emption work mentioned above.

TimeSys Linux/RT (http://www.timesys.com) develops and commercializes the so-called “Resource

• Kernel” loadable module, that makes the standard Linux kernel pre-emptable, and that allows to build

QoS scheduling for user tasks.

KURT (http://www.ittc.ukans.edu/kurt/) (Kansas University Real-Time Linux). KURT Linux allows

• for explicit scheduling of any real-time events rather than just tasks. This provides a more generic

framework onto which normal real-time process scheduling is mapped. Since event scheduling is

handled by the system, addition of new events such as periodic sampling data acquisition cards (video,

lab equipment, etc.) is highly simplified. KURT introduces two modes of operation: the normal mode

and the real-time mode. In normal mode, the system acts as a generic Linux system. When the kernel

is running in real-time mode, it only executes real-time processes. While in real-time mode, the system

can no longer be used as a generic workstation, as all of its resources are dedicated to executing its

real-time responsibilities as accurately as possible.

The LinuxBIOS (http://www.linuxbios.org) project allows to get rid of the usual BIOS chips that

• manage part of the hardware, and that introduce significant delays when booting. A LinuxBIOS

startup can take place in a few seconds, booting immediate in a ready-to-go kernel.

Linux-SRT (http://www.uk.research.att.com/~dmi/linux-srt/) is a QoS scheduler.

• QLinux (http://www.cs.umass.edu/~lass/software/qlinux/) is a QoS scheduler.

• Fairsched (http://fairsched.sourceforge.net/) is a hierarchical QoS scheduler: tasks are divided into

• groups and each group receives guaranteed CPU time allocation proportional to its weight. The

standard scheduler is used to schedule processes within a group.

DWCS (http://www.cc.gatech.edu/~west/dwcs.html) (Dynamic Window-Constrained Scheduling) is a

• QoS scheduler, parameterizing the service in terms of a request period and a window constraint. The

request period is the time interval over which a task must receive some share of the CPU; the window

constraint is the value of that minimum share the task much receive during this “window.” 26

Chapter 3. Interrupts

This Chapter explains the basics of interrupt servicing in a computer system, with again an emphasis on

the real-time application. Interrupt hardware and software come in a great variety of implementations and

functionalities, so some of the concepts talked about in this Chapter may not be relevant to your system.

3.1. Introduction

Interrupts are indispensable in most computer systems with real-time ambitions. Interrupts have to be

processed by a so-called ISR (Interrupt Service Routine). The faster this ISR can do its job, the better the

real-time performance of the RTOS, because other tasks are delayed less. Timers are one example of

peripheral devices that generate interrupts; other such devices are the keyboard, DAQ (Digital

AcQuisition) cards, video cards, the serial and parallel ports, etc. Also the processor itself can generate

interrupts, e.g., to switch to the “protected mode” of the processor, when executing an illegal operation,

as part of a debugging session, or when an “exception” is raised by an application program.

3.2. Interrupt hardware

An interrupt-driven system (which many RTOSs and EOSs are) typically has one or more of the

following hardware components:

Interrupt vector. Many systems have more than one hardware interrupt line (also called interrupt

• request (IRQ), and the hardware manufacturer typically assembles all these interrupt lines in an

“interrupt vector”. The INTEL 80x86 processors’ interrupt vector contains 256 entries, and is called

the Interrupt Description Table (IDT), [Hyde97]. (But most PCs manufacturers make only 16 of these

interrupts available as hardware interrupts! See below.) The interrupt vector is an array of pointers to

the interrupt service routines (Section 3.3) that will be triggered when the corresponding interrupt

occurs. The vector also contains a bit for each interrupt line that signals whether there is an interrupt

pending on that line, i.e., a peripheral device has raised the interrupt, and is waiting to be serviced.

Some processors use non-vectored interrupt processing: when an interrupt occurs, control is transfered

to one single routine, that has to decide what to do with the interrupt. The same strategy is also used,

in software, in most operating systems to allow multiple devices to share the same interrupt.

Synchronous or software interrupt. A synchronous interrupt (also called a software interrupt , or a

• trap) is an interrupt that is not caused by an (asynchronous) hardware event, but by a specific

(synchronous) machine language operation code. Such as, for example the in the Motorola

trap

68000, the in the ARM, the in the Intel 80x86, by a divide by zero, a memory segmentation

swi int

fault, etc. Since this feature is supported in the hardware, one can expect a large number of different,

not standardized, names and functions for software interrupts. . . 27

Chapter 3. Interrupts

Major differences between asynchronous/hardware interrupts and synchronous/software interrupts, on

most hardware, is that:

1. Further interrupts are disabled as soon as an hardware interrupt comes in, but not disabled in the

case of a software interrupt.

2. The handler of a software interrupt runs in the context of the interrupting task; the ISR of an

hardware interrupt has not connected task context to run in. So, the OS provides a context (that

can be the context of the task that happened to be running at the time of the interrupt).

Hardware and software interrupts do share the same interrupt vector, but that vector then provides

separate ranges for hardware and software interrupts.

Edge-triggered and level-triggered interrupts. From a hardware point of view, peripheral devices can

• transmit their interrupt signals in basically two different ways:

Edge-triggered. An interrupt is sent when the interrupt line changes from low to high, or vice versa.

• That is a almost “zero time” event, which increases the chances for a hardware loss of interrupts by

the interrupt controller. Moreover, if multiple devices are connected to the same interrupt line, the

operating system must call all registered interrupt service routines (see Section 3.3), because

otherwise it could cause a software loss of an interrupt: even if it detected only one edge transition,

and its first ISR acknowledged the receipt of this interrupt, it could still be that it missed another

edge transition, so it can only be sure after it has given all ISRs the chance to work. But of course,

this is not an efficient situation.

Level-triggered. An interrupt is signaled by a change in the level on the hardware interrupt line. This

• not only lowers the chance of missing a transition, but it also allows a more efficient servicing of the

interrupts: each ISR that has serviced the interrupt will acknowledge its peripheral device, which

will take away its contribution to the interrupt line. So, the level will change again after the last

peripheral device has been serviced. And the operating system should not try all ISRs connected to

the same hardware interrupt line.

Interrupt controller. This is a piece of hardware that shields the operating system from the electronic

• details of the interrupt lines. Some controllers are able to queue interrupts, such that none of them gets

lost (up to a given hardware limit, of course). Some allow various ways of configuring priorities on the

different interrupts.

The 8259 (http://www.cast-inc.com/cores/c8259a/c8259a-x.pdf) Programmable Interrupt

Controller (PIC) is still a very common chip for this job on PC architectures, despite its age of more

than 25 years. PC builders usually use two PICs, since each one can cope with only eight interrupts.

But using more than one has to happen in a daisy chain, i.e., the interrupt output pin of the first PIC is

connected to an input pin of the second one; this introduces delays in the interrupt servicing. As

another disadvantage, PICs were not designed to be used in multiprocessor systems.

Higher-quality and/or SMP motherboards use the Advanced Programmable Interrupt

Controller (APIC). This is not just a single chip, but a small hardware system that manages

interrupts:

Each CPU must have a “local APIC” with which it gets interrupts from the APIC system.

• 28

Chapter 3. Interrupts

The peripheral hardware connects its interrupt line to the I/O APIC. (There can be eight of them.)

• An I/O APIC then sends a signal to the local APIC of the CPU for which the interrupt is meant.

The APIC architecture is better than the PIC, because (i) it can have many more interrupts lines, hence

eliminating the need to share interrupts, (ii) it knows programmable interrupt priorities, (iii) it is faster

to program (only one machine instruction to the local APIC’c Task Priority Register (which is on the

CPU!), instead of two to the PIC, which in addition is not on the CPU) and (iv) it allows to work with

level-triggered interrupts instead of with edge-triggered interrupts. The PCI bus uses active low,

level-triggered interrupts, so can work fine together with APIC.

The PowerPC platforms have another interrupt hardware standard, the

OpenPIC (http://www.itis.mn.it/inform/materiali/evarchi/cyrix.dir/opnparc.htm), which also

guarantees a high hardware quality. OpenPIC also works with x86 architectures.

3.3. Interrupt software

From the software side, an interrupt-driven system must typically take into account one or more of the

following software issues:

Interrupt Service Routine (ISR), often called interrupt handler tout court. This software routine is

• called when an interrupt occurs on the interrupt line for which the ISR has been registered in the

interrupt vector. Typically, this registration takes place through a system call to the operating system,

but it can also be done directly in a machine instruction, by a sufficiently privileged program. The

registration puts the address of the function to be called by the interrupt, in the address field provided

in the interrupt vector at the index of the corresponding interrupt number.

The operating system does not (or rather, cannot) intervene in the launching of the ISR, because

everything is done by the CPU. The context of the currently running task is saved on the stack of that

task: its address is in one of the CPU registers, and it is the only stack that the CPU has immediate and

automatic access to. This fact has influence on the software configuration of the system: each task

must get enough stack space to cope with ISR overhead. So, the worst-case amount of extra stack

space to be foreseen in a task’s memory budget can grow large, especially for systems in which

interrupts can be nested. More and more operating systems, however, provide a separate “context” for

interrupt servicing, shared by all ISRs; examples are Linux and VxWorks.

An ISR should be as short as possible, because it runs with interrupts disabled, which prevents other

interrupts from being serviced, and, hence, other tasks from proceeding. The ISR should service the

peripheral device it was triggered by, and then return. This servicing typically consists of reading or

writing some registers on the device, and buffer them in a place where some other task can process

them further, outside of the ISR and hence with interrupts enabled again. This further processing is the

goal of the DSR (Deferred Service Routine), Section 3.4. Getting the data from the ISR to the DSR

should be done in a non-blocking way; FIFOs (Section 5.2) or circular buffers (Section 5.4) are often

used for this purpose. 29

Chapter 3. Interrupts

Trap handler/service request. A synchronous interrupt is sometimes also called a trap

• (http://www.osdata.com/topic/language/asm/trapgen.htm) or a software interrupt. The software

interrupts are “called” by the processor itself, such as in the case of register overflow, page address

errors, etc. They work like a hardware interrupts (saving state, switching to protected mode, jumping

to handler), but they run with the hardware interrupts enabled.

These software interrupts are very important, because they are the only means with which user space

tasks can execute “protected operations,” or “privileged instructions.” These privileged instructions

can only be executed when the processor is in its protected mode (also called privileged mode).

Privileged instructions are operations such as: to address physical IO directly; to work with the

memory management infrastructure such as the page lookup table; to disable and enable interrupts; or

to halt the machine. Privileged instructions are available to user space tasks via system calls, that are in

fact handlers of software interrupts: a system call puts some data in registers or on the stack, and then

executes a software interrupt, which makes the processor switch to protected mode and run the

interrupt handler. That handler can use the register or stack data for task specific execution. Recall that

the handler of a software interrupt runs in the context of the task that executes the system call, so it can

read the data that the task has put on the stack. But now the execution takes place in the protected

mode of the processor.

A system call is just one example of a software interrupt, or trap. An interrupt service routine of a trap

is often called a trap handler . Still another name for a software interrupt is a service request (SRQ).

Each type of CPU has part of its interrupt vector reserved for these trap handlers. Operating systems

typically have a default trap handler installed, which they attach to all possible software interrupts in

your system. Usually, you can replace any of these by your own. For example, RTAI has a

for this purpose. The OS also reserves a number of traps as system

rt_set_rtai_trap_handler()

signals. For example, RTAI reserves 32 signals, most of them correspond to what standard Linux uses.

Trap handlers are a major tool in debugging: compiling your code with the debug option turned on

results, among other things, in the introduction in (the compiled version of) your original code of

machine instructions that generate a trap after each line in your code. The ISR triggered by that trap

can then inform the debug task about which “breakpoint” in your program was reached, thanks to the

3.3).

register information that the trap has filled in (Section

Another major application of the trap functionality, certainly in the context of this document, is their

use by RTAI to deliver user space hard real-time functionality (Section 11.5). Linux just uses one

single software interrupt, at address for its user space system calls, leaving a lot of software

0x80,

interrupts to applications such as RTAI.

Interrupt latency. This is the time between the arrival of the hardware interrupt and the start of the

• execution of the corresponding ISR. The latency is not a crisp number, but rather a statistical quantity

becaused it is influenced by a large number of undeterministic effects (Section 1.3.2). This becomes

more and more the case in modern processors with their multiple levels of caches and instruction

pipelines, that all might need to be reset before the ISR can start. This latter fact is at the origin of the

30

Chapter 3. Interrupts

somewhat counter-intuitive phenomenon that some modern Gigahertz CPUs have longer interrupt

latencies than much older digital signal processors.

Interrupt enable/disable. Each processor has atomic operations to enable or disable (“mask”) the

• interrupts. Common names for these functions are (“set interrupt enable flag”, i.e., enable

sti()

interrupts to come through to interrupt the CPU) and (“clear interrupt enable flag”, i.e., don’t

cli()

allow interrupts). In the same context, one finds functions like and

save_flags()

. These are a bit more fine-grained than and , in the sense that they

restore_flags() sti() cli()

save/restore a bit-sequence where each bit corresponds to an hardware interrupt line and indicates

whether or not the interrupt on that particular line should be enabled. In other words, it saves the

“state” of the interrupt vector. ( in some cases does an implicit enabling of the

restore_flags()

interrupts too.) Note that disables all interrupts on all processors in an SMP system. That is a

cli()

costly approach to use, particularly so in an RTOS.

Interrupt priorities. Some systems offer, as a hardware feature, (static) priorities to interrupts. That

• means that the OS blocks a new interrupt if an ISR of an interrupt with a higher priority is still

running. (Or rather, as long has it has not enabled the interrupts again.) Similarly, the ISR of a

lower-priority interrupt is pre-empted when a higher-priority interrupt comes in. Hence, ISRs must be

re-entrant. And, if the processor allows interrupt priorities, most opportunities/problems that are

known in task scheduling (see Section 2.4) show up in the interrupt handling too!

Prioritized interrupt disable. Prioritized enabling/disabling of the interrupts is a software feature (that

• must have hardware support, of course) that allows the programmer to disable interrupts below a

specified priority level. Microsoft NT is an example of an OS kernel that extensively uses this feature.

Interrupt nesting. If the processor and/or operating system allow interrupt nesting, then an ISR

• servicing one interrupt can itself be pre-empted by another interrupt (which could come from the same

peripheral device that is now being serviced!). Interrupt nesting increases code complexity, because

ISRs must use re-entrant code only, i.e., the ISR must be written in such a way that it is robust against

being pre-empted at any time.

Interrupt sharing. Many systems allow different peripheral devices to be linked to the same hardware

• interrupt. The ISR servicing this interrupt must then be able to find out which device generated the

interrupt. It does this by (i) checking a status register on each of the devices that share the interrupt, or

(ii) calling in turn all ISRs that users have registered with this IRQ.

Interrupt sharing is implemented in most general purpose operating systems, hence also in the Linux

kernel. (See the file .) Linux accepts multiple interrupt handlers on the same

/kernel/softirq.c

interrupt number. The kernel hangs its own ISR on the hardware interrupt, and that kernel ISR invokes

one by one all the handler routines of the ISRs that have been registered by the application programs.

This means that they will be executed after the hardware ISR has finished, but before any other tasks,

and with interrupts enabled.

While the Linux kernel does interrupt sharing as mentioned in the previous paragraph, RTLinux and

RTAI don’t: they allow only one single ISR per IRQ, in order to be as deterministic as possible. (So,

be careful when putting interface cards in your computer, because all the ones for which you want to

install real-time drivers must be connected to different interrupt lines!) The real-time ISR that a user

program has registered is directly linked to the hardware interrupt, and hence runs with all interrupts

disabled. The other ISRs on that same IRQ are only executed when the non-real-time Linux kernel on

top of the RTOS gets the occasion to run, i.e., after all real-time activity is done, also non-ISR activity.

31

Chapter 3. Interrupts

3.4. ISR, DSR and ASR

An ISR should be as short as possible, in order to minimize the delay of interrupts to other ISRs, and the

scheduling of tasks. In general-purpose operating systems, only the ISR that the OS has attached to each

IRQ runs with interrupts disabled, but not the user-registered ISRs. A real real-time operating system, on

the other hand, allows only one ISR per IRQ, otherwise, the time determinism of the other ISRs is not

guaranteed! This makes the job for real-time programmers a bit easier, because they can design

non-re-entrant (and hence often simpler and faster) ISRs: the ISR can store local information without the

danger that it can be overwritten by another invocation of the same ISR code, and it has the guarantee of

atomicity, (i.e., the ISR will run without being pre-empted. However, when the OS and the hardware

allow interrupt priorities, the ISR at one IRQ level can be pre-empted by a higher-priority interrupt.

Typically, a hardware ISR just reads or writes the data involved in the communication with the peripheral

device or the trap that caused the interrupt, acknowledges the interrupt if the peripheral device requires it,

and then, if needed, wakes up another “task” to do any further processing. For example, most drivers for

network cards just transfer the raw packet data to or from the card in the ISR, and delegate all

interpretation of the data to another task. The skeleton of a typical ISR-DSR combination would look

like this:

dsr_thread()

{ while (1) {

wait_for_signal_from_isr();

process_data_of_ISR (); // including all blocking stuff

}

}

interrupt_handler( )

{ reset_hardware();

do_isr_stuff();

send_signal_to_wake_up_dsr();

re_enable_interrupts() // some RTOSs do this automatically

}

In the Linux kernel, this latter task used to be a bottom half, while the hardware interrupt-driven ISR was

called the top half . (Note that some operating systems use opposite terminology.) The bottom half

concept is more or less abandoned, and replaced by tasklets and softirqs (see the files

and ). The reason for abandoning the bottom

include/linux/interrupt.h kernel/softirq.c

halves is that Linux has a hard limit of maximum 32 bottom halves functions. Moreover, they run with

locks over the whole system, which is not very good for multi-processor systems. The softirq was 32

Chapter 3. Interrupts

introduced in the 2.3.43 kernel, as a multi-processor-aware version of the bottom half; there are still only

32 of them, so application programmers should stay away from them, and use tasklets instead.

(Tasklet are a very appropriate primitive in the context of interrupt servicing, but its usefulness is in no

way limited to only this context!)

“Tasklet” is a bit of an unfortunate name, because it has not much to do with schedulable tasks: a tasklet

is a function that the kernel calls when an ISR has requested a “follow-up” of its interrupt servicing.

Outside of the Linux world, this follow-up function is more often called DSR, Deferred Service

Routine, or (in Microsoft NT), Deferred Processing Call. In Linux, an unlimited number of tasklets is

allowed, and they have the same behaviour and functionality as the softirqs.

In Linux the ISR requests the execution of a tasklet/DSR via the tasklet_schedule(&tasklet)

command. The tasklet has first to be initialized with a tasklet_init (&tasklet,

; this call links a tasklet identifier with the function to be executed and a

tasklet_function, data)

data structure in which the ISR can store information for processing by the tasklet. The tasklet (or DSR,

or softirq) runs with interrupts enabled, but outside of the context of a particular task, just as the ISR that

has requested it. This means that neither the ISR, nor the DSR can use variables that have been defined

locally in the scope of the task(s) to which they logically are related. The execution of tasklets is

implemented in the file, and both tasklets and softirqs are treated as softirq

kernel/softirq.c

tasks.

Linux (and many other general purpose operating systems) executes the DSRs in sequence (without

mutual pre-emption, that is), at the end of hardware ISRs and before the kernel returns to user space. So,

at the end of each kernel call, the scheduler checks whether some DSRs are ready to be executed; see the

file in the Linux source code.

kernel/softirq.c

RTAI also has tasklets, and their semantics is more or less like the Linux tasklets. However, RTAI added

some extra features (see the files and ):

include/rtai_tasklets.h tasklets/tasklets.c

There is a special class of tasklets, called timers. They can be used to let context-independent

• functions run with specified timings.

RTAI also allows a user space function to be executed as a tasklet.

The RTAI tasklets are executed by a dedicated task and/or ISR in the RTAI kernel.

An ISR is not allowed to use semaphores or any other potentially blocking system calls: an ISR that

blocks on a lock held by another task causes big trouble, because all interrupts are disabled when the ISR

runs, such that the condition to wake up the other task might never occur. The same holds for RTAI

tasklets: a blocking tasklet also blocks the timer ISR or task that executes all tasklets.

Avoiding non-blocking calls is sufficient for maximum determinism in a UP (“uni-processor”) system. In

a multi-processor system, however, a race condition (see Section 4.2) can occur between the hardware

ISR on one processor, and any other task on one of the other processors; e.g., because the ISR and the

other task access shared data. (Remember that the ISR cannot use a lock!) The easiest solution is to not

33

Chapter 3. Interrupts

only mask the interrupts for one processor, but for all of them. This, however, prevents all processors

from working. One way around this are spinlocks (see Section 4.6.3). The operating system also helps a

bit, by guaranteeing that tasklets are serialized over all processors in the system; i.e., only one is

executed at a time.

Some operating systems have one more level of interrupt sharing: besides the ISR and DSR functions,

they offer the possibility to use Asynchronous Service Routines (ASR). (This name is not as standardized

as ISR and DSR.) In Microsoft NT, it is called an Asynchronous Procedure Call; eCos calls it DSR;

Linux doesn’t have the concept. ASRs can run after all DSRs have finished, but before normal tasks get a

chance to be scheduled. Their goal is to execute that part of the reaction to an interrupt, that needs the

thread’s context; for example, to make the thread stop some of its activities, including itself.

The eCos operating system executes an ASR with interrupts enabled, with the scheduler disabled, and

always in the context of one specific thread. So, the ASR can call all system functions, which is not the

case for ISR and DSR, which are not bound to a deterministically defined context.

The RTAI operating system gives the possibility to add to each real-time task a user-defined function that

runs in the task’s context and with interrupts disabled, every time that the task gets scheduled (hence, not

just when an interrupt has occurred). This allows, for example, an interrupt servicing to indirectly change

some task-specific attributes at each scheduling instant. This user-defined function is called “ ”

signal()

and is filled in by (XXX ???) in the task data structure. However, it’s just a pointer to

rt_task_init()

a function, so it could be filled in or changed on-line. 34

Chapter 4. IPC: synchronization

The decision about what code to run next is made by the operating system (i.e., its scheduler), or by the

hardware interrupts that force the processor to jump to an associated interrupt routine. To the scheduler

of the OS, all tasks are just “numbers” in scheduling queues; and interrupts “talk” to their own interrupt

service routine only. So, scheduler and interrupts would be sufficient organizational structure in a system

where all tasks just live next to each other, without need for cooperation. This, of course, is not sufficient

for many applications. For example, an interrupt service routine collects measurements from a peripheral

device, this data is processed by a dedicated control task, the results are sent out via another peripheral

device to an actuator, and displayed for the user by still another task.

Hence, the need exists for synchronization of different tasks (What is the correct sequence and timing to

execute the different tasks?), as well as for data exchange between them. Synchronization and data

exchange are complementary concepts, because the usefulness of exchanged data often depends on the

correct synchronization of all tasks involved in the exchange. Both concepts are collectively referred to

as Interprocess communication (“IPC”).

The role of the operating system in matters of IPC is to offer a sufficiently rich set of IPC-supporting

primitives. These should allow the tasks to engage in IPC without having to bother with the details of

their implementation and with hardware dependence. This is not a minor achievement of the operating

system developers, because making these IPC primitives safe and easy to use requires a lot of care and

insight. In any case, the current state-of-the-art in operating systems’ IPC support is such that they still

don’t offer much more than just primitives. Hence, programmers have to know how to apply these

primitives appropriately when building software systems consisting of multiple concurrent tasks; this

often remains a difficult because error-prone design and implementation job. Not in the least because no

one-size-fits-all solution can exist for all application needs.

4.1. IPC terminology

The general synchronization and data exchange problems involve (at least) two tasks, which we will call

the “sender” and the “receiver”. (These tasks are often also called “writer” and “reader”, or “producer”

and “consumer”.) For synchronization, “sender” and “receiver” want to make sure they are both in (or

not in) specified parts of their code at the same time. For data exchange, “sender” and “receiver” want to

make sure they can exchange data efficiently, without having to know too much of each other

Chapter 14), and according to several different policies, such as blocking/non-blocking,

(“decoupling”,

or with/without data loss.

Data exchange has a natural direction of flow, and, hence, the terminology “sender” and “receiver” is

appropriate. Synchronization is often without natural order or direction of flow, and, hence, the

terminology “sender” and “receiver” is less appropriate in this context, and “(IPC) client” might be a

more appropriate because symmetric terminology. Anyway, the exact terminology doesn’t matter too

much. Unless we want to be more specific, we will use the generic system calls and

send() receive()

to indicate the IPC primitives used by sender and receiver, respectively. 35

Chapter 4. IPC: synchronization

4.1.1. Blocking/Non-blocking

IPC primitives can have different effects on task scheduling:

Blocking. When executing the part of the IPC, the sender task is blocked (i.e., non-available

send()

• for scheduling) until the receiver has accepted the IPC in a call. And similarly the other

receive()

way around. If both the sender and the receiver block until both of them are in their and

send()

commands, the IPC is called synchronous. (Other names are: rendez-vous, or handshake.)

receive()

Synchronous IPC is the easiest to design with, and is very similar to building hardware systems.

Non-blocking (asynchronous). Sender and receiver are not blocked in their IPC commands. This

• means that there is incomplete synchronization: the sender doesn’t know when the receiver will get its

message, and the receiver cannot be sure the sender is still in the same state as when it sent the

message.

Blocking with time out. The tasks wait in their IPC commands for at most a specified maximum

• amount of time.

Conditional blocking. The tasks block in their IPC commands only if a certain condition is fulfilled.

Of course, blocking primitives should be used with care in real-time sections of a software system.

4.1.2. Coupling

IPC primitives can use different degrees of coupling:

Named connection: sender and receiver know about each other, and can call each other by name. That

• means that the sender fills in the unique identifier of the receiver in its command, and vice

send()

versa. This can set up a connection between both tasks, in a way very similar to the telephone system,

where one has to dial the number of the person one wants to talk to.

The connection can be one-to-one, or one-to-many (i.e., the single sender sends to more than one

receiver, such as for broadcasting to a set of named correspondents), or many-to-one (for example,

many tasks send logging commands to an activity logging task), or many-to-many (for example, video

conferencing).

Broadcast: the sender sends its message to all “listeners” (without explicitly calling them by name) on

• the (sub-branch of the) network to which it is connected. The listeners receive the message if they

want, without the sender knowing exactly which tasks have really used its message.

Blackboard: while a broadcast is a message on a network-like medium (i.e., the message is not stored

• in the network for later use), a blackboard IPC stores the messages from different senders. So,

receivers can look at them at any later time.

Object request broker (ORB): the previous types of IPC all imply a rather high level of coupling

• between sender and receiver, in the sense that they have to know explicitly the identity of their

communication partner, of the network branch, or of the blackboard. The current trend towards more

distributed and dynamically reconfigurable computer systems calls for more loosely-coupled forms of

IPC. The ORB concept has been developed to cover these needs: a sender component registers its 36

Chapter 4. IPC: synchronization

interface with the ORB; interested receivers can ask the broker to forward their requests to an

appropriate sender (“server”) component, without the need to know its identity, nor its address.

4.1.3. Buffering

IPC primitives can use different degrees of buffering, ranging from the case where the operating system

stores and delivers all messages, to the case where the message is lost if the receiver is not ready to

receive it.

Not all of the above-mentioned forms of IPC are equally appropriate for real-time use, because some

imply too much and/or too indeterministic overhead for communication and resource allocation.

4.2. Race conditions and critical sections

Often, two or more tasks need access to the same data or device, for writing and/or reading. The origin of

most problems with resource sharing (or resource allocation) in multi-tasking and multi-processor

systems is the fact that operations on resources can usually not be performed atomically, i.e., as if they

were executed as one single, non-interruptable instruction that takes zero time. Indeed, a task that

interfaces with a resource can at any instant be pre-empted, and hence, when it gets re-scheduled again, it

cannot just take for granted that the data it uses now is in the same state (or at least, a state that is

consistent with the state) before the pre-emption. Consider the following situation:

data number_1;

data number_2;

task A

{ data A_number;

A_number = read(number_1);

A_number = A_number + 1;

write(number_2,A_number);

}

task B

{ if ( read(number_1) == read(number_2) )

do_something();

else

do_something_else();

}

} takes different actions based on the (non-)equality of and But

number_1 number_2.

task B task B

can be pre-empted in its statement by , exactly at the moment that has already read

if task A task B 37

Chapter 4. IPC: synchronization

but not yet This means that it has read before the pre-emption,

number_1, number_2. number_1

and after the pre-emption, which violates the validity of the test.

number_2

The statement is one example of a so-called critical section: it is critical to the validity of the code

if

that the access to the data used in that statement (i.e., and be executed

number_1 number_2)

atomically, i.e., un-interruptable by anything else. (Most) machine code instructions of a given processor

execute atomically; but instructions in higher-level programming languages are usually translated into a

sequence of many machine code instructions, such that atomicity cannot be guaranteed.

There are three generic types of critical sections:

Access to the same data from different tasks, as illustrated by the example above.

• Access to a service. For example, allocation of a resource, execution of a “transaction” on a database.

• The service typically has to process a sequence of queries, and these have to succeed as a whole, or

fail as a whole.

Access to procedure code. Application tasks often run exactly the same code (for example the control

• algorithm in each of the joints of a robot), but on other data, and some parts of that code should be

executed by one task at a time only.

Of course, many applications involve combinations of different resource sharing needs.

The problem in all above-mentioned examples of access to shared resources is often called a race

condition: two or more tasks compete (“race”) against each other to get access to the shared resources.

Some of these race conditions have been given a special name:

Deadlock. has locked a resource and is blocked waiting for a resource that is locked by

Task A task

• , while is blocked waiting for the resource that is locked by .

B task B task A

Livelock. This situation is similar to the deadlock, with this difference: both tasks are not blocked but

• are actively trying to get the resource, in some form of busy waiting.

Starvation. In this situation, some tasks never get the chance to allocate the resource they require,

• because other tasks always get priority.

The four conditions that have to be satisfied in order to (potentially!) give rise to a deadlock are:

1. Locks are only released voluntarily by tasks. So, a task that needs two locks might obtain the first

lock, but block on the second one, so that it is not able anymore to voluntarily release the first lock.

2. Tasks can only get in a deadlock if they need more than one lock, and have to obtain them in a

(non-atomic) sequential order.

3. The resources guarded by locks can only be allocated to one single task. (Or to a finite number of

tasks.)

4. Tasks try to obtain locks that other tasks have already obtained, and these tasks form a circular list.

For example, is waiting for to release a lock, is waiting for to

task A task B task B task C

release a lock, and is waiting for .

task C task A 38

Chapter 4. IPC: synchronization

As soon as one of these four conditions is not satisfied, a deadlock can not occur. Moreover, these

conditions are not sufficient for deadlocks to occur: they just describe the conditions under which it is

possible to have deadlocks.

The literature contains many examples of deadlock avoidance and prevention algorithms. Deadlock

avoidance makes sure that all four necessary conditions are never satisfied at the same time; deadlock

prevention allows the possibility for a deadlock to occur, but makes sure that this possibility is never

realized. Both kinds of algorithms, however, often require some form of “global” knowledge about the

states of all tasks in the system. Hence, they are too indeterministic for real-time execution, and not

suitable for component-based design (because the requirement for global knowledge is in contradiction

with the loose coupling strived for in component systems (see Chapter 14).

There are some guaranteed deadlock avoidance algorithms, that are reasonably simple to implement. For

example, a deadlock cannot occur if all programs always take locks in the same order. This requires a

globally known and ordered lists of locks, and coding discipline from the programmers. Other prevention

algorithms use some of the following approaches: only allow each task to hold one resource; pre-allocate

resources; force release of a resource before a new request can be made; ordering all tasks and give them

priority according to that order.

Race conditions occur on a single processor system because of its multi-tasking and interrupt

functionalities. But they show up even more on multi-processor systems: even if one CPU is preventing

the tasks that it runs from accessing a resource concurrently, a task on another CPU might interfere.

4.3. Signals

Signals are one of the IPC synchronization primitives used for asynchronous notification: one task fires a

signal, which can cause other tasks to start doing thins. The emphasis is on “asynchronous” and on

“can”:

Asynchronous: the tasks that react to signals are in a completely arbitrary state, unrelated with the

• signaling task. Their reaction to the signal also need not be instantaneous, or synchronized, with the

signaling task. The task that sends the signal, and the tasks that use the signal, need not share any

memory, as in the case of semaphores or mutexes. This makes signals about the only synchronization

primitive that is straightforward to scale over a network.

Can: the signaling task fires a signal, and continues with its job. Whether or not other tasks do

• something with its signal is not of its concerns. The operating system takes care of the delivery of the

signal, and it nobody wants it, it is just lost.

In most operating systems, signals

are not queued. A task’s signal handler has no means to detect whether it has been signaled more than

• once.

carry no data.

• 39

Chapter 4. IPC: synchronization

have no deterministic delivery time or order. A task that gets signaled is not necessarily scheduled

• immediately.

have no deterministic order. A task that gets signaled multiple times has no way to find out in which

• temporal order the signals were sent.

So, these are reasons to avoid signals for synchronization between two running tasks, [BrinchHansen73].

In other words: notification in itself is not sufficient for synchronization. Synchronization needs two tasks

that do something together, while taking notice of each other, and respecting each other’s activities. Later

sections of the text present IPC primitives that are better suited for synchronization than signals.

POSIX has standardized signals and their connection to threads. The OS offers a number of pre-defined

signals (such as “kill”), and task can ask the operating system to connect a handler (i.e., a function) to a

particular signal on its behalf. The handler is “registered”, using the system call . The task

sigaction()

also asks the OS to receive or block a specific subset of all available signals; this is its “signal mask”.

Whenever a signal is received by the operating system, it executes the registered handlers of all tasks that

have this signal in their mask. The task can also issue a , which makes it sleep until

sigwait(signal)

the is received; in this case, the signal handler is not executed. Anyway, signals are a bit difficult

signal

to work with, as illustrated by this quote from the man page:

signal

For to work reliably, the signals being waited for must be blocked in all threads, not only in the

sigwait

calling thread, since otherwise the POSIX semantics for signal delivery do not guarantee that it’s the thread

doing the that will receive the signal. The best way to achieve this is block those signals before any

sigwait

threads are created, and never unblock them in the program other than by calling .

sigwait

The masks are also set on a per-thread basis, but the signal handlers are shared between all threads in a

process. Moreover, the implementation of signals tend to differ between operating systems, and the

POSIX standard leaves room for interpretation of its specification. For example, it doesn’t say anything

about the order in which blocked threads must be woken up by signals. So, these are reasons why many

developers don’t use signals too much.

POSIX has a specification for so-called “real-time signals” too. Real-time signals are queued, they pass a

4-byte data value to their associated signal handler, and they are guaranteed to be delivered in numerical

order, i.e., from lowest signal number to highest. For example, RTLinux implements POSIX real-time

signals, and offers 32 different signal levels. (See the file in the RTLinux

include/rtl_sched.h

source tree.) And RTAI also offers a 32 bit unsigned integer for events, but in a little different way: the

integer is used to allow signalling multiple events: each bit in the integer is an event, and a task can ask to

be notified when a certain AND or OR combination of these bits becomes valid. (See the file

in the RTAI source tree.)

include/rtai_bits.h

4.4. Exceptions

Exceptions are signals that are sent (“raised”) synchronously, i.e., by the task that is currently running.

(Recall that signals are asynchronous, in the sense that a task can receive a signal at any arbitrary

moment in its lifetime.) Exceptions are, roughly speaking, a signal from a task to itself. As operating

Section 3.1) used to handle non-normal cases

system primitive, an exception is a software interrupt (see

in the execution of a task: numerical errors; devices that are not reachable or deliver illegal messages;

etc. The software interrupt gives rise to the execution of an exception handler, that the task (or the 40

Chapter 4. IPC: synchronization

operating system, or another task) registered previously. In high-level programming languages, an

exception need not be a software interrupt, but it is a function call to the language’s runtime support, that

will take care of the exception handling.

4.5. Atomic operations

The concept of an atomic operation is very important in interprocess communication, because the

operating system must guarantee that the taking or releasing a lock is done without interruption. That can

only be the case if the hardware offers some form of atomic operation on bits or bytes. Atomic

operations come in various forms: in the hardware, in the operating system, in a language’s run-time, or

in an application’s support library, but always, the hardware atomic operation is at the bottom of the

atomic service. This Section focuses on the hardware support that is commonly available.

Most processors offer an atomic machine instruction to test a bit (or a byte or a word). In fact, the

operation not just tests the bit, but also sets the bit if that bit has not already been set. Hence, the

associated assembly instruction is often called , or something similar. Expressed in

test_and_set()

pseudo-code, the would look like this:

test_and_set()

int test_and_set(int *lock){

int temp = *lock;

*lock = 1;

return temp;

}

Another atomic instruction offered by (a fewer number of) processors is

: it compares a value at a given memory address with an

compare_and_swap(address,old,new)

“old” value given as parameter, and overwrites it with a “new” value if the compared values are the

same; in this case, it returns “true”. If the values are not equal, the new value is copied over the old value.

Examples of processors with a are the Alpha, ia32/ia64, SPARC and the

compare_and_swap()

M68000/PowerPC. (Look in the Linux source tree for the macro to find

__HAVE_ARCH_CMPXCHG

them.)

The operation is appropriate for the implementation of the synchronization

compare_and_swap()

needed in, for example, swinging pointers (see Section 4.10): in this case, the parameters of the

are the address of the pointer and its old and new values.

compare_and_swap(address,old,new)

The following pseudo-implementation is simplest to understand the semantics of the

:

compare_and_swap()

int compare_and_swap(address, old, new) {

get_lock();

if (*address == old) {

*address == new;

release_lock(); 41

Chapter 4. IPC: synchronization

return (1);

} else {

release_lock();

return (0);

};

The can, however, be implemented without locks, using the following pair of

compare_and_swap()

atomic instructions: and . Together, they implement an

load_linked() store_conditional()

atomic read-modify-write cycle. The idea is that the instruction marks a memory

load_linked()

location as “reserved” (but does not lock it!) and if no processor has tried to change the contents of that

memory location when the takes place, the store will succeed, otherwise it will

store_conditional()

fail. If it fails, the calling task must decide what to do next: retry, or do something else.

This pair of instructions can be used to implement in an obvious way, and

compare_and_swap()

without needing a lock:

int compare_and_swap(address, old, new) {

temp = load_linked(address);

if (old == temp) return store_conditional(address,new);

else return;

}

The test need not take place in a critical section, because both arguments are local to this

old == temp

single task.

There are some important caveats with the function:

compare_and_swap()

It only compares the values at a given memory location, but does not detect whether (or how many

• times) this value has changed! That is: a memory location can be changed twice and have its original

value back. To overcome this problem, a more extensive atomic operation is needed, the

, which also checks a tag attached to the pointer, and that

double_word_compare_and_swap()

increments the tag at each change of the value of the pointer. This operation is not very common in

processors!

It is not multi-processor safe: (TODO: why exactly?)

While the hardware support for locks is quite satisfactory, there is no support for transaction rollback.

Transaction rollback means that the software can undo the effects of a sequence of actions, in such a way

that the complete sequence takes place as a whole, or else is undone without leaving any trace.

Transaction rollback is a quite advanced feature, and not supported by operating systems; it’s however a

primary component of high-end database servers. 42

Chapter 4. IPC: synchronization

4.6. Semaphore, mutex, spinlock, read/write lock, barrier

Race conditions can occur because the access to a shared resource is not well synchronized between

different tasks. One solution is to allow tasks to get a lock on the resource. The simplest way to lock is to

disable all interrupts and disable the scheduler when the task wants the resource. This is certainly quite

effective for the running task, but also quite drastic and far from efficient for the activity of all other

tasks. Hence, programmers should not use these methods lightly if they want to maintain real

multi-tasking in the system. So, this text focuses on locking mechanisms that do not follow this drastic

approach. Basically, programmers can choose between two types of locking primitives (see later sections

for more details):

1. One based on busy waiting. This method has overhead due to wasting CPU cycles in the busy

waiting, but it avoids the overhead due to bookkeeping of queues in which tasks have to wait.

2. One based on the concept of a semaphore. This method has no overhead of wasting CPU cycles, but

it does have the overhead of task queue bookkeeping and context switches.

A generic program that uses locks would look like this:

data number_1;

data number_2;

lock lock_AB;

task A

{ data A_number;

get_lock(lock_AB);

A_number = read(number_1);

A_number = A_number + 1;

write(number_2,A_number);

release_lock(lock_AB);

}

task B

{ get_lock(lock_AB);

i = ( read(number_1) == read(number_2) );

release_lock(lock_AB);

if ( i ) do_something();

else do_something_else();

}

}

The and function calls do not belong to any specific programming

get_lock() release_lock()

language, library or standard. They have just been invented for the purpose of illustration of the idea.

When either or reaches its so-called critical section, it requests the lock; it gets the lock

task A task B

if the lock is not taken by the other task, and can enter the critical section; otherwise, it waits (“blocks”,

“sleeps”) till the other task releases the lock at the end of its critical section. A blocked task cannot be

scheduled for execution, so locks are to be used with care in real-time applications: the application

programmer should be sure about the maximum amount of time that a task can be delayed because of 43

Chapter 4. IPC: synchronization

locks held by other tasks; and this maximum should be less that specified by the timing constraints of the

system.

The should be executed atomically, in order to avoid a race condition when both tasks try

get_lock()

to get the lock at the same time. (Indeed, the lock is in this case an example of a shared resource, so

locking is prone to all race conditions involved in allocation of shared resources.) The atomicity of

getting a lock seems to be a vicious circle: one needs a lock to guarantee atomicity of the execution of

the function that must give you a lock. Of course, (only) the use of an atomic machine instruction can

break this circle. Operating systems implement the function by means of a atomic

get_lock()

machine instruction (see Section 4.5) on a variable associated with the lock.

test_and_set()

Another effective (but not necessarily efficient!) implementation of a lock is as follows (borrowed from

the Linux kernel source code):

int flags;

save_flags(flags); // save the state of the interrupt vector

cli(); // disable interrupts

// ... critical section ...

restore_flags(flags); // restore the interrupt vector to

// its original state

sti(); // enable interrupts

(Note that, in various implementations, implicitly uses .)

restore_flags() sti()

The implementation described above is not always efficient because: (i) in SMP systems the turns

cli()

Section 3.1), and (ii) if a can do the job, one should

off interrupts on all CPUs (see test_and_set()

use it, because the disabling of the interrupts and the saving of the flags generate a lot of overhead.

The lock concept can easily lead to unpredictable latencies in the scheduling of a task: the task can sleep

while waiting for a lock to be released; it doesn’t have influence on how many locks other tasks are

using, how deep the locks are nested, or how well-behaved other tasks use locks. Both tasks involved in a

synchronization using a lock have (i) to agree about which lock they use to protect their common data (it

must be in their common address space!), (ii) to be disciplined enough to release the lock, and (iii) to

keep the critical section as short as possible. Hence, the locks-based solution to access or allocation

constraints is equally indirect and primitive as the priority-based solution to timing constraints: it doesn’t

protect the data directly, but synchronizes the code that accesses the data. As with scheduling priorities,

locks give disciplined(!) programmers a means to reach deterministic performance measures. But even

discipline is not sufficient to guarantee consistency in large-scale systems, where many developers work

more or less independently on different parts.

Locks are inevitable for task synchronization, but for some common data exchange problems there exist

lock-free solutions (see Section 4.10). The problem with using locks is that they make an application

vulnerable for the priority inversion problem (see Section 4.8). Another problem occurs when the CPU

on which the task holding the lock is running, suddenly fails, or when that task enters a trap and/or

exception (see Section 3.1), because then the lock is not released, or, at best its release is delayed. 44

Chapter 4. IPC: synchronization

4.6.1. Semaphore

The name “semaphore” has its origin in the railroad world, where a it was the (hardware) signal used to

(dis)allow trains to access sections of the track: when the semaphore was lowered, a train could proceed

and enter the track; when entering, the semaphore was raised, preventing other trains from entering;

when the train in the critical section left that section, the semaphore was lowered again.

Edsger Dijkstra introduced the semaphore concept in the context of computing in 1965, [Dijkstra65]. A

semaphore is an integer number (initialized to a positive value), together with a set of function calls to

count and . POSIX names for and are and .

up() down() up() down() sem_wait() sem_signal()

POSIX also introduces the non-blocking functions (set the semaphore) and

sem_post()

(same as but instead of blocking, the state of the semaphore is given in

sem_trywait() sem_wait()

the function’s return value).

A task that executes a blocks if the count is zero or negative. The count is decremented

sem_wait()

when a task executes a ; if this makes the semaphore value non-negative again, the

sem_signal()

semaphore unblocks one of the tasks that were blocking on it.

So, the number of tasks that a semaphore allows to pass without blocking is equal to the positive number

with which it is initialized; the number of blocked tasks is indicated by the absolute value of a negative

value of the semaphore count.

The semaphore must also be created ( ) and deleted

S sem_init(S,initial_count)

( ) somewhere. The is the number of allowed holders of the

initial_count

sem_destroy(S)

semaphore lock. Usually, that number is equal to 1, and the semaphore is called a binary semaphore..

The general case is called a counting semaphore,. Most operating systems offer both, because their

implementations differ only in the initialization of the semaphore’s count.

From an implementation point of view, the minimum data structure of a semaphore has two fields:

struct semaphore {

int count; // keeps the counter of the semaphore.

queue Q; // lists the tasks that are blocked on the semaphore.

}

And (non-atomic!) pseudo code for and (for a binary semaphore)

sem_wait() sem_signal()

basically looks like this (see, for example, of the RTAI code tree for

upscheduler/rtai_sched.c

more detailed code):

semaphore S;

sem_wait(S)

{ if (S.count > 0) then S.count = S.count - 1;

else block the task in S.Q;

} 45

Chapter 4. IPC: synchronization

sem_signal(S)

{ if (S.Q is non-empty) then wakeup a task in S.Q;

else S.count = S.count + 1;

}

So, at each instant in time, a negative indicates the fact that at least one task is blocked on the

S.count

semaphore; the absolute value of gives the number of blocked tasks.

S.count

The semantics of the semaphore as a lock around a critical section is exactly as in its historical railway

inspiration. However, a semaphore can also be used for different synchronization goals: if just

task A

wants to synchronize with , (irrespective of the fact whether or not it needs to exclude

task B task B

from entering a shared piece of code), both tasks can use the and function

sem_wait() sem_signal()

calls.

Here is a pseudo code example of two tasks and that synchronize their mutual job by

task A task B

means of a semaphore:

semaphore S;

task A: task B:

main() main()

{ ... { ...

do_first_part_of_job(); do_something_else_B();

sem_signal(S); sem_wait(S);

do_something_else_A(); do_second_part_of_job();

... ...

} }

Finally, note that a semaphore is a lock for which the normal behaviour of the locking task is to go to

sleep. Hence, this involves the overhead of context switching, so don’t use semaphores for critical

sections that should take only a very short time; in these cases spinlocks are a more appropriate choice

(Section 4.6.3).

4.6.2. Mutex

A mutex (MUTual EXclusion) is often defined as a synonym for a binary semaphore. However, binary

semaphore and mutex have an important semantic distinction: a semaphore can be “signaled” and

“waited for” by any task, while only the task that has taken a mutex is allowed to release it. So, a mutex

has an owner, as soon as it has been taken. This semantics of a mutex corresponds nicely to its envisaged

use as a lock that gives only one task access to a critical section, excluding all others. That is, the task

entering the critical section takes the mutex, and releases it when it exits the critical section. When

another task tries to take the mutex when the first one still holds it, that other task will block. The 46

Chapter 4. IPC: synchronization

operating systems unblocks one waiting task as soon as the first task releases the mutex. This mutually

exclusive access to a section of the code is often also called serialization.

A POSIX mutex, for example, is a (counting) semaphore with priority inheritance implied (see Section

4.9). The basic POSIX API for mutexes is:

pthread_mutex_t lock;

int pthread_mutex_init( // Initialise mutex object:

pthread_mutex_t *mutex,

const pthread_mutexattr_t *mutex_attr

);

// Destroy mutex object.

int pthread_mutex_destroy(pthread_mutex_t *mutex);

// Non blocking mutex lock:

int pthread_mutex_trylock(pthread_mutex_t *mutex);

// Blocking mutex lock:

int pthread_mutex_lock(pthread_mutex_t *mutex);

// Mutex unlock:

int pthread_mutex_unlock(pthread_mutex_t *mutex);

A recursive mutex (or recursive semaphore) is a mutex that can be locked repeatedly by the owner.

Otherwise the thread that holds a mutex and would try to take the mutex again would lock itself, hence

leading to a deadlock. This recursive property is useful for complex mutual exclusion situations, such as

Section 15.2.

in monitors,

The POSIX API requires to indicate explicitly that a mutex should be recursive:

pthread_mutexattr_settype(&mutex, PTHREAD_MUTEX_RECURSIVE);

Some operating systems (e.g., VxWorks) use the recursive mutex mode as the default. Some offer a

so-called a fast mutex: such a mutex is locked and unlocked in the fastest manner possible on the given

operating system (i.e., it doesn’t perform any error checks). A fast mutex can only be locked one single

time by , and all subsequent calls cause the calling thread to block until the

pthread_mutex_lock()

mutex is freed; also the thread that holds the mutex is locked, which causes a deadlock. So, be careful

with using fast mutexes.

Many programmers tend to think that a semaphore is necessarily a more primitive RTOS function than a

mutex. This is not necessarily so, because one can implement a counting semaphore with a mutex and a

condition variable (Section 4.7):

int sem_wait(sem_t *sem)

{ pthread_mutex_lock(&sem->mutex); 47

Chapter 4. IPC: synchronization

while (sem->count == 0) pthread_cond_wait(&sem->cond, &sem->mutex);

sem->count--;

pthread_mutex_unlock(&sem->mutex);

return(0);

}

4.6.3. Spinlocks

A “spinlock” is the appropriate lock mechanism for multi-processor systems, and for use in all kinds of

contexts (kernel call, interrupt service routine, etc.). They are phasing out the use of “hard” exclusion

methods such as and , because: (i) these are too “global”, in the sense that they don’t

cli() sti()

specify the context in which the lock is needed; (ii) it is usually not necessary to disable interrupts in

order to protect two tasks from entering a critical section. However, you can not do all kinds of things

when running inside a critical section locked by a spinlock! For example, do nothing that can take a

“long” time, or that can sleep. Use semaphores or mutexes for this kind of locks.

The task that wants to get a spinlock tries to get a lock that is shared by all processors. If it doesn’t get

the lock, it keeps trying (“busy waiting ”) till it succeeds:

int spinlock(spinlock_t l){

while test_and_set(l) {};

};

So, it’s clear why you shouldn’t do things that take a long time within a spinlock context: another task

could be busy waiting for you all the time! An example of a spinlock in the Linux kernel is the “Big

1.2): the BKL is a recursive spinlock, i.e., it can be locked multiple times

Kernel Lock” (Section

recursively. That means that you (possibly in two separate tasks) can lock it twice in a row, but you also

have to release it twice after that.

Spinlocks come in three versions:

1. and : the classical mutual exclusion version, allowing interrupts to occur

spin_lock spin_unlock

while in the critical section.

2. and : as above, but with interrupts disabled.

spin_lock_irq spin_unlock_irq

3. and : as above, but saving the current state flag

spin_lock_irqsave spin_unlock_irqrestore

of the processor.

All of them work on (the address of) variables of the type One should call

spinlock_t.

before using the lock. The spinlock versions that disable interrupts do not disable

spin_lock_init()

interrupts on the other CPUs than the one the calling task is running on, in order not to bring down the

throughput of the whole multi-processor system. An example (Linux specific!) of the usage (not the

implementation!) of a spinlock with local interrupt disabling is given here:

spinlock_t l = SPIN_LOCK_UNLOCKED;

unsigned long flags 48

Chapter 4. IPC: synchronization

spin_lock_irqsave(&l, flags);

/* critical section ... */

spin_unlock_irqrestore(&l, flags);

So, both the concurrency and the multi-processor issues are dealt with. On a uni-processor system, this

should translate into:

unsigned long flags;

save_flags(flags);

cli();

/* critical section ... */

restore_flags(flags);

Note: the POSIX function has this semantics of disabling interrupts.

pthread_spin_lock()

Spinlocks are a trade-off between (i) disabling all interrupts on all processors (costly, safe, but what you

don’t want to do on a multi-processor system or a pre-emptable kernel), and (ii) wasting time in busy

waiting (which is the only alternative that remains). So, spinlocks work if the programmer is disciplined

enough to use them with care, that is for guaranteed very short critical sections. In principle, the latency

induced by a spinlock is not deterministic, which is in contradiction to its use for real-time. But they

offer a good solution in the case that the scheduling and context switching times generated by the use of

locks, are larger than the time required to execute the critical section the spinlock is guarding.

There is a reason why atomic test-and-set operations are not optimal on multi-processor systems built

from typical PC architecture processors: the test-and-set performed by one processor can make parts of

the caches on the other processors invalid because part of the operation involves writing to memory. And

this cache invalidating lowers the benefits to be expected from caching. But the following

implementation can help a bit:

int spinlock(spinlock_type l){

while test_and_set(l) { // enter wait state if l is 1

while (l == 1) {} // stay in wait state until l becomes 0

};

};

The difference with the previous implementation is that the requires a read and a

test_and_set()

write operation (which has to block memory access for other CPUs), while the test requires only

l == 1

a read, which can be done from cache.

4.6.4. Read/write locks

Often, data has only to be protected against concurrent writing, not concurrent reading. So, many tasks

can get a read lock at the same time for the same critical section, but only one single task can get a write

lock. Before this task gets the write lock, all read locks have to be released. Read locks are often useful to

access complex data structures like linked lists: most tasks only read through the lists to find the element

Section 5.5.)

they are interested in; changes to the list are much less common. (See also 49

Chapter 4. IPC: synchronization

Linux has a reader/writer spinlock (see below), that is used similarly to the standard spinlock, with the

exception of separate reader/writer locking:

rwlock_t rwlock = RW_LOCK_UNLOCKED; // initialize

read_lock(&rwlock);

/* critical section (read only) ... */

read_unlock(&rwlock);

write_lock(&rwlock);

/* critical section (read and write) ... */

write_unlock(&_rwlock);

Similarly, Linux has a read/write semaphore.

4.6.5. Barrier

Sometimes it is necessary to synchronize a lot of threads, i.e., they should wait until all of them have

reached a certain “barrier.” A typical implementation initializes the barrier with a counter equal to the

number of threads, and decrements the counter whenever one of the threads reaches the barrier (and

blocks). Each decrement requires synchronization, so the barrier cost scales linearly in the number of

threads.

POSIX (1003.1-2001) has a function, and a pthread_barrier_t

pthread_barrier_wait()

type. RTAI has something similar to a barrier but somewhat more flexible, which it calls “bits” (see file

in the RTAI source tree), and what some other operating systems call flags or

bits/rtai_bits.c

events. The is a 32 bit value, that tasks can share to encode any kind of AND or OR combination of

bits

binary flags. It can be used as a barrier for a set of tasks, by initializing the bits corresponding to each of

the tasks to “1” and letting each task that reaches the barrier reset its bit to “0”. This is similar to a

semaphore (or rather, an array of semaphores), but it is not “counting”.

4.7. Condition variable for synchronization within mutex

Condition variables have been introduced for two reasons (which amount basically to one single reason):

1. It allows to make a task sleep until a certain application-defined logical criterium is satisfied.

2. It allows to make a task sleep within a critical section. (Unlike a semaphore.) This is in fact the same

reason as above, because the critical section is needed to evaluate the application-defined logical

criterium atomically.

The solution to this problem is well known, and consists of the combination of three things:

1. A mutex lock (see Section 4.6.2).

2. A boolean expression, which represents the above-mentioned logical criterium. 50

Chapter 4. IPC: synchronization

3. A signal (see Section 4.3), that other tasks can fire to wake up the task blocked in the condition

variable, so that it can re-check its boolean expression.

The lock allows to check the boolean expression “atomically” in a critical section, and to wait for the

signal within that critical section. It’s the operating system’s responsibility to release the mutex behind

the back of the task, when it goes to sleep in the wait, and to take it again when the task is woken up by

the signal.

There exists a POSIX standard for condition variables. Here are some of the major prototypes for the

system call, used to make a task wait for its wake-up signal:

pthread_cond_wait()

#include <pthread.h>

// Initialise condition attribute data structure:

int pthread_condattr_init(pthread_condattr_t *attr);

// Destroy condition attribute data structure:

int pthread_condattr_destroy(pthread_condattr_t *attr);

// Initialise conditional variable:

int pthread_cond_init(

pthread_cond_t *cond,

const pthread_condattr_t *cond_attr

); // Destroy conditional variable:

int pthread_cond_destroy(pthread_cond_t *cond);

// Wait for condition variable to be signaled:

int pthread_cond_wait(

pthread_cond_t *cond,

pthread_mutex_t *mutex

); // Wait for condition variable to be signaled or timed-out:

int pthread_cond_timedwait(

pthread_cond_t *cond,

pthread_mutex_t *mutex,

const struct timespec *abstime

); // Restart one specific waiting thread:

int pthread_cond_signal(pthread_cond_t *cond);

// Restart all waiting threads:

int pthread_cond_broadcast(pthread_cond_t *cond);

Others system calls that take the same arguments are: (initialize the data

pthread_cond_init()

structure with which a condition variable is built), (signal the fact that a

pthread_cond_signal()

condition variable has changed state), (signals the state change to all

pthread_cond_broadcast()

tasks that are waiting for the signal, and wakes them all), (wait for the

pthread_cond_timedwait()

signal, or for a timer to expire, whichever comes first). 51

Chapter 4. IPC: synchronization

The of Section 4.6.2 shows a typical application of a condition variable. We repeat the

sem_wait()

code here for convenience:

int sem_wait(sem_t *sem)

{ pthread_mutex_lock(&sem->mutex);

while (sem->count == 0) pthread_cond_wait(&sem->cond, &sem->mutex);

sem->count--;

pthread_mutex_unlock(&sem->mutex);

return(0);

}

The semaphore has a mutex a condition signal and its particular boolean

sem->mutex, sem->cond,

expression, namely its being zero or not. The checking of this condition, as well as the possible

count

decrement of the must be done in a critical section, in order to synchronize access to the

count,

semaphore with other tasks. The function makes the calling task block on the

pthread_cond_wait()

condition variable if the boolean expression evaluates to false. The operating system releases the mutex

when the task must block, so that other tasks can use the semaphore. When the condition is signaled (this

is done by the complementary function , which is not given here but that executes a

sem_signal()

), the calling task is woken up and its mutex is activated (all in one

pthread_cond_broadcast()

atomic operation) such that the woken-up task can safely access the critical section, i.e., check its

boolean expression again. The above-mentioned atomicity is guaranteed by the operating system, which

itself uses some more internal locks in its implementation of the call.

pthread_cond_wait()

It is essential that tasks that wake up from waiting on a condition variable, re-check the boolean

expression for which they were waiting, because nothing guarantees that it is still true at the time of

waking up. Indeed, a task can be scheduled a long time after it was signaled. So, it should also be

prepared to wait again. This leads to the almost inevitable loop around a

while

.

pthread_cond_wait()

The should be the default way to signal the condition variable, and not

pthread_cond_broadcast()

. The latter is only an optimization in the case that one knows for sure that

pthread_cond_signal()

only one waiter must be woken up. However, this optimization violates the loose coupling principle of

14): if the application is changed somewhat, the “optimization” of before

good software design (Chapter

could well become a bottleneck, and solving the situation involves looking for the

calls that can be spread over various files in the application.

pthread_cond_signal()

However, blindly using can also have a negative effect, called the

pthread_cond_broadcast()

“thundering herd” problem: can wake up a large number of tasks, and

pthread_cond_broadcast()

in the case that only one task is needed to process the broadcast, all other woken-up tasks will

immediately go to sleep again. That means the scheduler is hidden under a “herd” of unnecessary

wake-up and sleep calls. So, Linux and other operating systems introduced policies that programmers

can use to give some tasks the priority in wake-ups.

Both semaphores/mutexes and condition variables can be used for synchronization between tasks.

However, they have some basic differences: 52

Chapter 4. IPC: synchronization

1. Signaling a semaphore has always an effect on the semaphore’s internal count. Signaling a condition

variable can sometimes have no effect at all, i.e., when no task is waiting for it.

2. A condition variable can be used to check an arbitrary complex boolean expression.

3. According to the POSIX rationale, a condition variable can be used to make a task wait indefinitely

long, but spinlocks, semaphores and mutexes are meant for shorter waiting periods. The reason is

that is not a cancelling point, while the is.

pthread_mutex_lock() pthread_cond_wait()

4. A condition variable is nothing more than a notification to a task that the condition it was waiting for

might have changed. And the woken-up task should check that condition again before proceeding.

This check-on-wake-up policy is not part of the semaphore primitive.

4.8. Priority inversion

Priority scheduling and locks are, in fact, contradictory OS primitives: priority scheduling wants to run

the highest priority job first, while a mutex excludes every other job (so, also the highest priority job)

from running in a critical section that is already entered by another job. And these contradictory goals

lead to tricky trade-offs. For example, everybody coding multi-tasking systems using priority-based task

scheduling and locking primitives should know about the “priority inversion” danger: in some situations,

the use of a lock prevents a task to proceed because it has to wait for a lower-priority task. The reason is

that a low-priority task (i) is in a critical section for which it holds the lock that blocks the high-priority

task, and (ii) it is itself pre-empted by a medium-priority task that has nothing to do with the critical

section in which the high- and low-priority tasks are involved. Hence, the name “priority inversion”: a

medium-priority job runs while a high-priority task is ready to proceed. The simplest case is depicted in

Figure 4-1. In that Figure, is the high-priority task, the medium-priority task, and

task H task M task

the low-priority task. At time instant enters the critical section it shares with . At

T1,

L task L task H

time blocks on the lock issued by . (Recall that it cannot pre-empt because

T2, task H task L task L

that task has the lock on their common critical section.) At time pre-empts the lower-priority

T3, task M

task , and at the same time also the higher-priority . At time stops, and

T4,

task L task H task M task

gets the chance again to finish the critical section code at time when, at last, can run.

T5

L task H

Figure 4-1. Priority inversion.

The best-known practical case of a priority inversion problem occurred during the Mars

Pathfinder mission in 1997. (More information about this story can be found at

http://www.kohala.com/start/papers.others/pathfinder.html or

http://research.microsoft.com/~mbj/Mars_Pathfinder/Mars_Pathfinder.html.)

4.9. Priority inheritance and priority ceiling

Operating system programmers have tried to “solve” (not prevent) the priority inversion problem, in two

53

Chapter 4. IPC: synchronization

different ways:

Priority inheritance. A low-priority task that holds the lock requested by a high-priority task

• temporarily “inherits” the priority of that high-priority task, from the moment the high-priority task

does the request. That way, the low-priority task will not be pre-empted by medium-level priority

tasks, and will be able to finish its critical section without holding up the high-priority task any longer

than needed. When it releases the lock, its priority drops to its original level, while the high-priority

task will now get the lock. The maximum predictable delay is the length of the critical section of the

low-priority task.

Priority inheritance generates run-time overhead, because the scheduler has to inspect the priorities of

all tasks that access a lock.

Priority ceiling. Every lock gets a priority level corresponding to the priority of the highest-priority

• task that can use the lock. This level is called the ceiling priority. Note that it is the lock that gets a

priority, which it gives to every task that tries the lock. So, when the low-priority task enters the

critical section, it immediately gets the ceiling priority from the lock, such that it will not be

pre-empted by any medium-level priority task. Therefore, another name of the priority ceiling protocol

is instant inheritance.

Priority ceiling generates compile-time overhead, because it can already at that moment check the

priorities of all tasks that will request a lock.

Priority ceiling has the pleasant property that it simplifies implementation and has small run-time

overhead (only the change in priority for the task entering a critical section): the lock never has to be

tested for being free or not, because any task that tries the lock runs at the highest priority to enter the

critical section: any other task that could test the lock would run at the same ceiling priority, and hence

would not have been interrupted in its critical section by the task that currently tests the lock. Indeed,

both tasks live in the same priority level and are scheduled with a policy. Instant

SCHED_FIFO

inheritance also offers a solution to the “deadly embrace” (see Section 4.2) occurring when two tasks

lock nested critical sections in opposite order: the first task to enter the outermost lock will have the

appropriate priority to finish the complete set of nested critical sections.

A possible problem with priority ceiling is that it makes more processes run at higher priorities, for

longer times than necessary. Indeed, the priorities of tasks are changed, irrespective of the fact whether

another task will try to request the lock or not. This reduces the discriminating effects of using

priorities is the first place, and it gives rise to “hidden” priority inversion: while task gets its priority

L

raised to the ceiling priority because it is involved in a lock with another task that has a very high

V

priority, a third task not involved in the lock could get pre-empted by although its priority is

H L

higher and is dormant most of the time.

V

Priority ceiling and inheritance look great at first sight, and they are part of some OS standards: priority

ceiling is in the POSIX standard (POSIX_PRIO_PROTECT), the Real-Time Specification for Java

(RTSJ), OSEK, and the Ada 95 real-time specifications. Priority inheritance is also part of standards such

as POSIX (POSIX_PRIO_INHERIT), and the RTSJ. But priority ceiling and inheritance can still not

54

Chapter 4. IPC: synchronization

guarantee that no inversion or indeterministic delays will occur, [Yodaiken2002], [Locke2002]. Morever,

the priority inheritance “feature” gives rise to code that is more complex to understand and certainly to

predict. Also, determining a priori the ceiling priority for a lock is not an easy matter (the compiler must

have access to all code that can possibly use a lock!), and can cause portability and extendability

headaches.

Priority inversion is always a result of a bad design, so it’s much better to prevent race conditions instead

of “solving” them. However, contrary to the deadlock prevention algorithm (Section 4.2), no similarly

simple and guaranteed algorithm for priority inversion is known. So, all an operating system could do to

help the programmers is signalling when priority inversion takes place, such that they can improve their

design.

Most RTOSes don’t apply priority inversion solutions for every case of sender-receiver synchronization.

For example Neutrino (from QNX) uses separate synchronization mechanisms for critical sections

(semaphores) and sender-receiver (which is synchronous IPC in QNX). It solves priority inversion only

so long as applications use a many-to-one IPC model. As soon as an application uses many-to-many

IPC (via a POSIX queue) there is no more prevention of priority inversion. Many-to-many is inherently

difficult because the kernel has no way to know which receiver might be ready next, so all it could do

would be to raise the priority of all potential listeners (and the processes upon which they are waiting).

And this would often result in a logjam as every process was raised to the same priority, invalidating

exactly the major reason why priorities were introduced in the first place.

4.10. Lock-free synchronization for data exchange

Some synchronization can also be done without locks, and hence this is much more efficient and

guaranteed to be deadlock-free, [Herlihy91], [Herlihy93]. Lock-free synchronization uses the

(see Section 4.5), or similar constructs. This functionality is

compare_and_swap(address,old,new)

applicable to the manipulation of pointers, e.g., to interchange two buffers in one atomic operation, or to

do linked list, queue or stack operations.

The following code fragment shows the basic form of this pointer swinging:

ptr = ...

do {

old = ptr;

new = new_value_for_pointer;

while ( !compare_and_swap(ptr,old,new) );

If the returns “false”, the swinging of the pointers should not be done, because

compare_and_swap()

some other task has done something with the pointer in the meantime.

Recall the possible problem with : it only compares the values of the addresses,

compare_and_swap()

not whether this value has been changed! This means that a double change of the pointer (back to its

original value) will not be detected. This occurs quite frequently, i.e., any time when memory space is

re-used, e.g., in a stack or a linked list. 55

Chapter 4. IPC: synchronization

Another problem of using for lock-free synchronization is that it is not always the

compare_and_swap

most efficient method available, because it involves the copying of a complete data structure before that

data structure can be updated without a lock. 56

Chapter 5. IPC: Data exchange

The previous Chapter looked at the synchronization aspect of IPC; this Chapter deals with the

mechanisms and policies of data exchange. The emphasis is on data exchange for real-time systems.

The mechanism of all data exchange IPC is quite similar: the operating system has some memory space

reserved for the data that has to be exchanged, and uses some sychronization IPC primitives for reading

or writing to that memory space. There is some “object” responsible for the memory and the locks; we

call this object the mediator, (Section 14.3) or the channel. The mediator is really the heart of the data

exchange: the IPC clients make function calls on it, but it’s the mediator that takes care of memory

allocation, buffering, locking, signalling, etc. Although a mediator is more of an object-oriented design

concept, it is there already in most of the old C code of operating systems. The bad news is that most

operating system designers didn’t realize that, and hence, they didn’t reuse the mediator code when

implementing the myriads of IPC forms they developed. . .

It’s especially in their policy (i.e., the choice of how the low-level mechanism is being used) that the

different forms of data exchange differ from each other. Below is a non-exhaustive list of policies for

data exchange. Almost every possible combination of options is feasible, so it should come as no surprise

that operating systems tend to have data exchange primitives with not quite the same API. . .

(No) Data loss. Whether or not everything that the “sender” sends to the “receiver” (or rather, to the

• IPC mediator object) will indeed be received by the “receiver”.

(Non)Blocking. (Also called (a)synchronous.) The “sender” and/or “receiver” block until the

• exchange is finished.

One-to-many/many-to-one/many-to-many/one-to-one. There is one single “sender” and multiple

• “receivers”. Or any variation on this theme.

Named/anonymous. The “sender” must explicitly give the identification of the “receivers”, or it must

• only name the mediator.

(Non)Buffered. The “sender” sends some data to the “receiver”, but only indirectly: the data is stored

• in a buffer, from which the “receiver” reads at its own leisure.

(Non)Prioritized. A message can get a priority, and the highest priority message gets delivered first.

• Similarly, the tasks that wait for a data exchange (hence, which are blocked by the lock of the

mediator), can be woken up according to their static priority.

There does exist some standardization in data exchange IPC: POSIX has its standard 1003.1b, in which it

specifies message queues, with 32 priorities and priority-based task queues on their locks.

5.1. Shared memory

Two (or more) tasks can exchange information by reading and writing the same area in memory. The 57

Chapter 5. IPC: Data exchange

main advantage is that the data exchange can take place with zero copying, because the “buffer” is just

one deep. For the rest, any policy can be implemented on top of shared memory. One area where shared

memory is very popular is for the data exchange with peripheral hardware, if possible under Direct

Memory Access (DMA).

Avalaible RAM memory is the only limit to the number of independent shared memory IPC “channels”.

The shared memory must be reserved from the operating system, and locked into RAM. (See Section

6.1.) If the tasks involved in the shared memory IPC want to know how “fresh” the data in the shared

segment is, they have to implement their own handshake protocol themselves, because the operating

system gives no indication as to whether data from shared memory has already been accessed or not. One

common approach is to put a counter in the shared memory data structure, that indicates how many writes

have already taken place. This counter could be a time stamp, which is, in addition, particularly useful

for an asynchronous monitor task in user space: that task, for example, plots the data from the real-time

task at its own rate, and the time stamps provide a way to keep the data plotted on a correct time line.

5.2. FIFOs

Shared memory has the properties of a so-called block device: programs can access arbitrary blocks on

the device, in any sequence. Character devices, on the other hand, can access the data only in a specified

linear sequence. A FIFO (First-in, First-Out) is such a character device IPC: its mediator policy is

loss-free, non-blocking (unless the FIFO is empty or full), in principle many-to-many but in practice

often 1-to-1 (i.e., only one sender and one receiver), and buffered (i.e., FIFOs put data in a pipeline,

where the sender adds data on one end, and the reader reads it at the other end).

Some FIFO implementations support blocking for synchronization at the reader’s end, so the reader gets

woken up as soon as new data has arrived. FIFOs that implement blocking have a “task queue” data

structure in which blocked tasks can wait.

FIFO’s often also allow asynchronous data exchange: a task can register a FIFO handler that the

operating system executes after data has been put into the FIFO. This is done with exactly the same

ISR-DSR principle as for hardware and software interrupts (Section 3.4): the writing of data into the

FIFO also fires an event that activates the FIFO handler; this handler will be a tasklet (Section 3.4), that,

just as in the case of an interrupt, is executed by the operating system before it does its next scheduling.

The boundaries between successive data in the FIFO need not necessarily be sharp, because different

blocks might have different sizes. However, in that case, one speaks more often about mailboxes, see

Section 5.3.

The mediator implementing the FIFO uses a lock for mutual exclusion during read or write, and in order

to keep the FIFO’s task queue data structures consistent. However, if the FIFO runs between a real-time

task and a user space task, the locking problem is very much simplified: the real-time task can never be

interrupted by the user task (because it runs in kernel space) so no lock is needed at the real-time side. 58

Chapter 5. IPC: Data exchange

5.3. Messages and mailboxes

Messages and mailboxes allow the sender to send data in arbitrary chunks, only limited by available

memory in the buffer. The message contains some “meta-information” about the size and sender of the

message; or whatever data that the message protocol prescribes. This meta-information is, in fact, the

only practical difference with FIFOs. From an implementation point of view, FIFOs, messages and

mailboxes all look very similar, in the sense that there is the mediator object that takes care of the buffer,

the locks and the queues of waiting tasks. Moreover, many operating systems make no distinction

between messages and mailboxes. If they do make a distinction, it is the following:

Message. The sender puts its message in a memory space it has allocated itself, and then sends the

• address of that memory to the OS, together with an identification of the receiver. The receiver asks the

OS whether there are messages for it, and decides to read them or not. Reading a message is done in

the same place as where it was written. If desired, a counter on the message data allows for 1-to-many

IPC.

Be careful with this oversimplified description: passing the address of a data chunk is error-prone

(multi-processor systems; virtual memory; . . . ).

Mailbox. The sender notifies the OS that it has a message, and gives the identification of the receiver.

• The OS then copies the message to the mailbox of the receiver; this mailbox is a buffer managed by

the operating system. The receiver task reads the messages in the order of arrival. (Of course,

variations on this policy exist.)

A natural extension to messages or mailboxes is synchronous message passing , sometimes also called

Send/Receive/Reply (because that’s what it is called in QNX): the “sender” sends a message, and waits

until the “receiver” has acknowledged the reception of the message. Hence, two messages are exchanged

in this form of IPC. The SIMPL (http://www.holoweb.net/~simpl/) (Synchronous Interprocess Messaging

Project for Linux) project offers a free software implementation of this form of message passing.

POSIX has standardized an API for messages (with the semantics of what was called “mailboxes” above,

i.e., the message queues are managed by the operating system). Here are the basic data structures and

prototypes:

struct mq_attr {

long mq_maxmsg; // Maximum number of messages in queue

long mq_msgsize; // Maximum size of a message (in bytes)

long mq_flags; // Blocking/Non-blocking behaviour specifier

// not used in mq_open only relevant

// for mq_getattrs and mq_setattrs

long mq_curmsgs; // Number of messages currently in queue

}; // Create and/or open a message queue:

mqd_t mq_open(

char *mq_name,

int oflags,

mode_t permissions, 59

Chapter 5. IPC: Data exchange

struct mq_attr *mq_attr

); // Receive a message:

size_t mq_receive(

mqd_t mq,

char *msg_buffer,

size_t buflen,

unsigned int *msgprio

); // Send a message to a queue

int mq_send(

mqd_t mq,

const char *msg,

size_t msglen,

unsigned int msgprio

); // Close a message queue:

int mq_close(mqd_t mq);

// Get the attributes of a message queue:

int mq_setattr(mqd_t mq, const struct mq_attr *new_attrs,

struct mq_attr *old_attrs);

// Register a request to be notified whenever a message

// arrives on an empty queue:

int mq_notify(mqd_t mq, const struct sigevent *notification);

// Destroy a message queue:

int mq_unlink(char *mq_name);

5.4. Circular buffers

A circular (or ring) buffer has most of the properties of shared memory, except that (i) its depth is larger

than one (i.e., it can contain more than one of the data structures exchanged in the communication). The

buffer is usually implemented as an array of communication data structures, and the positions of sender

and receiver are indicated by pointers in this array. When one of these pointers reaches the end of the

buffer, it swaps back to the start of the buffer and continues from there. So, data is lost when the sender

pointer overtakes the reader pointer; data is read multiple times if the reader pointer overtakes the writer

pointer. It’s straightforward to use a lock to avoid these situations. In that case, the lock makes the buffer

blocking. A lock can also be set on each data item in the buffer, in order to avoid concurrent access of the

same data.

Two common options for buffers (especially in real-time applications) are: 60

Chapter 5. IPC: Data exchange

Locking in memory. The memory used for the buffer should not be swapped out of the physical RAM.

• “Buffer Half Full” (High water/Low water) interrupt. The sender and/or receiver tasks can raise an

• event if the buffer is more than half full or half empty. This event must wake up the other part of the

IPC, such that it can take the appropriate actions to prevent the buffer from overflowing or getting

empty.

5.5. Swinging buffers

A swinging buffer (or “flip buffer”) is two things:

An advanced circular buffer. Instead of using one single shared memory array, a swinging buffer uses

• two or more. The sender fills up one of the buffers, while the receiver empties another one. Every time

one of the tasks reaches the end of its buffer, it starts operating on a buffer that the other task is not

using.

A deadlock-free “lock.” Both tasks operate on different data structures, hence no locks are used to

• access the data. Only when the tasks must decide which buffer to use, they use a lock on the buffer

pointers, or the corresponding atomic pointer switching operation (see Section 4.10). In this latter

case, the “lock” is atomic in hardware, and hence cannot cause any of the problems generated by

software locks.

So, a swinging buffer is non-blocking but loss-prone, because one task can fill or empty the same buffer

of the swinging buffer pair multiple times before the other task is ready to switch buffers.

The swinging buffer approach is also known under the name of read-copy-update (RCU

(http://lse.sourceforge.net/locking/rcu/rcupdate_doc.html) ). It can be used as an alternative for

read-write locks (Section 4.6.4) for “frequent reads/infrequent writes” applications: the readers follow a

pointer, and need no locks, while the (less frequent) writer swaps the pointers after having filled in the

new data structure.

5.6. Remote Procedure Calls

The previous flavours of IPC can all be catalogued as “low-level”: they are implemented with very basic

OS primitives, and are usually shielded from the users within system calls. One of the popular IPC

mechanisms at the user level are Remote Procedure Calls (RPC). With RPC, a user can invoke the

execution of a task on a remote computer, as if that task ran on the processor of the calling task. RPC is

implemented on top of messages, with a synchronizing hand-shake protocol. Obviously, RPC is not very

real-time, but could be useful for embedded systems.

On the other hand, RPC is the simplest form of what is also called distributed or embedded components:

software objects (“agents”) that can live on any computer on a network, and that tasks can access

transparently. There are three major standards in the area of distributed components: 61

Chapter 5. IPC: Data exchange

CORBA (http://www.corba.org) (Common Object Request Broker Architecture). This is a fully

• platform and vendor independent initiative.

DCOM, controlled by Microsoft.

• RMI (Remote Method of Invocation), from the Java world.

A real-time extension to CORBA has been specified in 2001. It takes care of the specification of the

determinism required for real-time applications, but needs to map its functionality onto primitives

offered on the host operating system. Of course, the absolute time scales of CORBA real-time are longer

than those of a stand-alone computer system.

(TODO: more details about CORBA, and real-time CORBA.) 62

Chapter 6. Memory management

This Chapter explains what memory management means, and how it influences the real-time behaviour

of an operating system. Non real-time aspects of memory management (virtual memory, swapping, dirty

pages management, etc.) are outside the scope of this document.

6.1. Terminology

All tasks need RAM memory to execute. Not only for placing their data, but also for their code and for

IPC with other tasks. A computer system offers a (most often) contiguous space of physical RAM, and

the MMU (Memory Management Unit) of the hardware, and the Virtual Memory software of the

operating system, help to give a task the impression that it is the only one that uses the memory. And that

that memory is (i) larger than the physically available RAM; (ii) distributed (transparantly to the task)

over a number of physically non-contiguous memory pages of fixed size; and (iii) protected from access

by other tasks.

But these general-purpose OS requirements are not those of real-time systems, or of embedded systems

on processors without MMU. Their concerns are:

Fast and deterministic memory management. The fastest and most deterministic approach to memory

• management is no memory management at all. This means that the programmers have all physical

RAM available as one contiguous block that they can use as they like. This approach is usually only an

option for small embedded systems that run a fixed and small number of tasks. Other RTOSs and EOSs

offer at least the basic memory management: memory allocation and deletion through system calls.

Page locking. Demand paging is the common approach in general purpose operating systems to

• distribute the scarce physical RAM over all tasks: each task gets a number of pages in RAM, and the

pages it hasn’t accessed recently are “swapped out” to make room for pages of other tasks. This

swapping is a non-deterministic thing, because it needs access to disk, and most disk controllers have

non-deterministic buffering for optimising the average throughput to or from the disk: when the task

needs code or data from one of its pages that is currently swapped out, the page has to be retrieved

from disk, and often another page in RAM has first to be swapped out to disk. Hence, the MMU of an

RTOS must lock the pages of real-time tasks in the physical RAM, in order to avoid the paging

overhead. POSIX provides the and function calls to do this locking.

mlock() mlockall()

Page locking is a Quality of Service feature of the operating system: it guarantees that tasks have a

specified amount of the memory resource at their disposal. In this respect, it is similar to the QoS

extensions of the scheduler (see Section 1.3.3).

Dynamic allocation. A task’s memory needs can change during its lifetime, such that it should be able

• to ask the operating system for more memory. The Linux system call for this purpose is

. (In kernel space!) A real-time memory manager can only make this dynamic allocation

vmalloc()

of memory deterministic, if the memory pages can be got from a pool of free pages locked in the

physical RAM. Anyway, dynamic allocation should be used very carefully in any real-time task, 63

Chapter 6. Memory management

because there is no guarantee that the memory pool has enough free pages left to satisfy all requests.

This implies, for example, that IPC approaches with dynamic memory allocation needs (such as

Section 5.3) are to be avoided.

unlimited mailboxes and messages, see

Nothing prevents an operating system from allocating memory in smaller blocks than one page.

However, finer and variable-sized granularity implies more complex memory management, memory

fragmentation, and hence less determinism.

Memory mapping. Real-time and embedded systems typically have to access peripheral devices. The

• on-board registers in which these devices place their data have to be mapped somewhere into the

address space of the corresponding device driver task. The POSIX system call to do this mapping is

. Typically, this mapping is a configuration activity, and hence need not be done in real-time.

mmap()

Memory sharing. One of the most efficient ways for tasks to communicate is through shared memory

• (see Section 6.2). The operating system has two major responsibilities in this area: (i) (de)allocation of

the shared memory, and (ii) synchronizing access to that memory by different tasks. The latter topic is

discussed in Chapter 4; the former is illustrated later in this Chapter.

RAM disks. In order to avoid the non-deterministic overhead of accessing hard disks (for real-time

• systems) or the extra cost, extra space, and reduced robustness of mechanical disk devices (for

embedded systems), part of the available RAM can used to emulate a hard disk. This means that that

memory is organized and accessed as a file system, as if it would reside on a hard disk.

When the RAM disk should be able to preserve data when the power is switched off, the embedded

system designer implements it in the form of a flash disk. This is memory that can be “burned” many

thousand times, rather quickly, with very little power, and from within the system code itself.

Reburning (“flashing”) is required either for reprogramming the device, or for temporary storage of

“non-volatile” data.

Having the system code in a file system on flash gives the added bonus that the code need not be

loaded into RAM, but may be executed in place. This results in shorter start-up times.

Stripped libraries. RAM is a scarce resource in real-time and embedded systems, such that

• programmers try to use as little of it as possible. Hence, they often use “stripped down” versions of

µlibc

general utility libraries (C library, math library, GUI library, etc.). is such a low-footprint version

of the C library.

6.2. Shared memory in Linux

(TODO: update state of affairs on shared memory! POSIX API for shared memory; sharing between

real-time and user space; shared memory managment through locks and/or monitor; ,

copy_to_user()

.)

copy_from_user() 64

Chapter 6. Memory management

This Section discusses two complementary ways to allocate shared memory in Linux. There is nothing

particularly real-time about using shared memory; allocating shared memory, however, is more

controversial: RTLinux doesn’t allow to allocate memory on-line, RTAI does.

The shared-memory pool is a block of physical memory set aside at boot time so that Linux does not use

it for processes. To set up the pool, you first determine how much physical memory the system has and

how much is to be used for shared memory. Normal Linux processes are required to map physical

memory into their private address space to access it. To do this, the Linux processes calls on the

open()

memory device . After the file descriptor is opened, the Linux process maps the shared

/dev/mem

memory into its address space using , which returns a pointer to the shared memory as mapped

mmap()

into the Linux process’s address space. Once the shared memory is mapped, it may be accessed by

dereferencing the pointer. When the process terminates, you use to unmap the shared memory

munmap()

by passing the pointer and the size of its object. Shared-memory access is easier in the kernel space of

the real-time Linux variants, since the real-time code executes in kernel space and thus is not required to

map physical addresses to virtual addresses.

6.2.1. Allocation at boot time

In this approach, a block of shared memory can be reserved at boot time, to prevent Linux from using it

for general purposes. The reservation is done using the parameter in LILO (or something

append=

similar for other bootloaders). Here is an example of a file that reserves 1 Megabyte

/etc/lilo.conf

for shared memory out of 16 available Megabytes:

image=/boot/zImage

label=rtlinux

root=/dev/hda1

append="mem=15m"

Linux will use only the first 15 Megabytes, and the last Megabyte can be used for shared memory

purposes. The base address of the shared memory in the above-mentioned example is:

#define BASE_ADDRESS (15 * 0x100000)

The real-time and user Linux tasks use different ways to access the memory.

A real-time task runs in kernel space, and hence can directly access the memory with its physical

address. For example, a data structure of type at the start of the shared memory is accessed as:

my_data

my_data *ptr;

ptr = (my_data *) BASE_ADDRESS;

ptr->... = ...

A user space tasks must use its virtual address. This mapping of physical memory into the virtual

address space consists of two steps: 65

Chapter 6. Memory management

The user space task must “open” the memory, by using the system call on the device

open()

• :

/dev/mem

#include <unistd.h> // POSIX defined open()

#include <fcntl.h> // O_RDWR for read and write access

// or O_RDONLY for read-only access, etc.

int fd; // file descriptor for the opened memory

if ((fd = open("/dev/mem", O_RDWR)) < 0 )) {

// handle possible error here

}

The system call then does the actual mapping.

mmap()

• my_data *ptr;

ptr = (my_data *) mmap (0, sizeof(my_data),

PROT_READ | PROT_WRITE,

MAP_SHARED,

fd, BASE_ADDRESS);

The parameters and are POSIX-defined and indicate read and write

PROT_READ PROT_WRITE

access; indicates that memory can be shared with any other task that maps it too. (See

MAP_SHARED

the man pages for more details.)

The shared memory is then accessed via the pointer, as in the kernel space task.

my_data *ptr;

ptr->... = ...

The task must use to un-map the shared memory used for the data structure:

my_data

munmap()

my_data *ptr;

munmap(ptr, sizeof(my_data));

6.2.2. Allocation in kernel space

The mbuff module implements the device. This device offers shared memory (allocated in

/dev/mbuff

the kernel using the ) in kernel as well as in user space. The shared memory does not need to be

vmalloc

reserved at the system startup and its size is not limited by memory fragmentation. It is logically (but not

physically) contiguous, and is locked in the physical RAM. When you allocate a block, the kernel first

grabs the free pages, then if there is not enough of them, starts freeing more, by reducing buffers, disk

cache and finally by swapping out to disk some user data and code. For sure this is not a real-time

operation—it may take seconds to get something like 100 MB out of 128 MB RAM machine. 66

Chapter 6. Memory management

6.2.3. Allocation in module

(TODO: latest kernel options for memory allocation; dmaBuffer module. Use this approach preferably at

boot time, otherwise you might not be able to find all the requested memory as a contiguous area in

RAM.)

(TODO: .)

copy_from_user() 67

Chapter 7. Real-time device drivers

An operating system must interface its peripheral devices to its kernel software as well as to the user

application software. This should be done in a modular and systematic way, such that all hardware

“looks the same” to software applications. The software that takes care of this hardware-independent

interfacing are device drivers. For the Linux real-time variants, Comedi (Section 7.4) is a successful and

steadily growing project for real-time and non real-time device drivers for digital acquisition cards.

In the UNIX world, device drivers are visible through the “files” (where stands for a

/dev/xyz xyz

particular device, such as, for example, for the first hard disk, for the first serial line, etc.).

hda ttyS0

The 2.4.X kernels have introduced the and driverfs (“driver file system”) approaches, which give

devfs

more structure to the information about the devices that have actually been loaded. But all these things

are for user space, and hence not relevant for the real-time Linux variants that operate in kernel space.

The bookkeeping aspects of registering a device driver, with major and minor numbers, as well as

guidelines for writing device drivers, are explained in detail in the UNIX literature. For the Linux

example, Rubini’s Linux Device Drivers book ([Rubini2001]) is the major reference.

7.1. Mechanism and policy

A major feature of a good device driver is that it “provides mechanism, not policy.” This means that it

should faithfully mimic all the interfacing capabilities of the device (the “mechanism”), but nothing

more. It should not try to interpret the exchanged data in any possible user context (the “policy”),

because that is the job of that user application program itself. Indeed, once a device driver offers a

software interface to the mechanism of the device, an application writer can use this mechanism interface

to use the device in one particular way. That is, some of the data stuctures offered by the mechanism are

interpreted in specific physical units, or some of them are taken together because this composition is

relevant for the application. For example, a analog output card can be used to generate voltages that are

the inputs for the electronic drivers of the motors of a robot; these voltages can be interpreted as setpoints

for the desired velocity of these motors, and six of them are taken together to steer one particular robot

with six-degrees of freedom. Some of the other outputs of the same physical device can be used by

another application program, for example to generate a sine wave that drives a vibration shaker. Or, the

robot control program can use a force sensor that is interfaced through a serial line. The force sensor

device driver “talks” to both the application program (i.e., the force control algorithm), and the serial line

device driver (for which it is a “user application” itself!). It is obvious that the serial line driver should

never implement function calls that are only useful in the force sensor driver context. Nevertheless, that’s

exactly what happens in many projects with constrained scope, vision and time. . .

As for the other operating system responsibilities discussed in the previous Chapters, writing device

drivers for an RTOS or an EOS is not so much different from writing them for a general-purpose OS.

Basically, in an RTOS context, one should make sure that all timing delays in the drivers are both short

and deterministic, and every DSR should be an appropriately prioritized thread or handler that waits on

an event to become active. 68

Chapter 7. Real-time device drivers

7.2. Device drivers in UNIX

In the UNIX philosophy, all devices are considered as being “files”, and hence, their device drivers share

the following functions: , , , , , . The

open() close() read() write() read_config() set_config()

function call names are operating system independent, and just for demonstration. However, ,

open()

, and , are POSIX compliant. The configuration function calls are, in UNIX

close() read() write()

often taken together in the function.

ioctl()

makes the device accessible for programs, while ends the accessibility. The device can

open() close()

be opened in different modes, such as, for example, (“read-only”)

O_RDONLY O_WRONLY,

(“write-only”), (“read and write”), and (“non-blocking”).

O_RDWR O_NONBLOCK

and interchange data between the peripheral device and the (kernel or application)

read() write()

software: a known datastructure is copied from one place in memory to another. Of course, the exact

contents of that data structure depends on the device and/or on the particular use of this device by the

programmer. reads out the device’s current configuration status, and programs that

read_config() set_config()

configuration. Configuration management often has less strict timing constraints than reading and

writing. It also has less standardized function calls, because of the larger variety in possible settings of

different hardware. Nevertheless, the POSIX standard prescribes the use of the function call,

ioctl()

for all configuration actions that don’t fit cleanly in the classic UNIX stream I/O model of ,

open()

, , and :

close() read() write()

int ioctl(int d, int request, ...)

The parameter is the “file descriptor” with which the device has been opened; is the

d request

particular configuration identifier; and are possible arguments that come with the

... request.

7.3. Complex device drivers

A simple device driver need nothing more than writing and/or reading of some hardware registers on a

peripheral device. Some devices interact with the software through hardware interrupts. Hence, their

Section 3.4). Recall that only a subset

device drivers must include an ISR, and possibly also a DSR (see

of all kernel space functions are available in the run-time context of an ISR. And a real-time device

driver is subjected to even more constraints.

Almost all devices can be interfaced in Programmed Input/Output (PIO) mode: the processor is

responsible for accessing bus addresses allocated to the device, and to read or write data. Some devices

also allow shared memory, or even Direct Memory Access (DMA): the device and the memory exchange

data amongst each other directly, without needing the processor. DMA is a feature of the bus, not of the

operating system; the operating system, however, must support its processes to use the feature, i.e.,

provide a system call to initialize DMA transfer, and a handler to react to the notification of the device 69


PAGINE

177

PESO

1.08 MB

AUTORE

Atreyu

PUBBLICATO

+1 anno fa


DESCRIZIONE DISPENSA

This Guide covers the fundamentals of (i) real-time and embedded operating systems (focusing mostly on the differences with general purpose operating systems such as Linux), and (ii) real-time programming. The emphasis is on Free Software and Open Source Software examples: RTAI, RTLinux, eCos, RT-EMS, uCLinux ..., with a more than proportional focus on RTAI. This text also talks about design issues, software patterns and frameworks for real-time applications. That is, the “high-level” aspects of these software projects. These higher levels are often poorly dealt with in publications on real-time programming, which leads to the unfortunate situation that still too many real-time programmers use only the powerful but dangerously unstructured API of their RTOS. Missing the chance to develop more structured, and, hence, more deterministic and more portable software systems. Both the low-level RTOS primitives, and the high-level design issues, are illustrated by the real-world example of a hard real-time core for feedback control and signal processing.


DETTAGLI
Corso di laurea: Corso di laurea magistrale in ingegneria delle telecomunicazioni
SSD:
Università: L'Aquila - Univaq
A.A.: 2011-2012

I contenuti di questa pagina costituiscono rielaborazioni personali del Publisher Atreyu di informazioni apprese con la frequenza delle lezioni di Sistemi embedded e studio autonomo di eventuali libri di riferimento in preparazione dell'esame finale o della tesi. Non devono intendersi come materiale ufficiale dell'università L'Aquila - Univaq o del prof Pomante Luigi.

Acquista con carta o conto PayPal

Scarica il file tutte le volte che vuoi

Paga con un conto PayPal per usufruire della garanzia Soddisfatto o rimborsato

Recensioni
Ti è piaciuto questo appunto? Valutalo!

Altri appunti di Sistemi embedded

Programmazione concorrente
Dispensa
Sistemi Embedded
Dispensa
SystemC
Dispensa
Sistemi Embedded
Dispensa