Monday, September 17, 2012

Linux Page Tables












Hardware-wise, we have a two level page table structure, where the first
level has 4096 entries, and the second level has 256 entries.  Each entry
is one 32-bit word.  Most of the bits in the second level entry are used
by hardware, and there aren't any "accessed" and "dirty" bits.

Linux on the other hand has a three level page table structure, which can
be wrapped to fit a two level page table structure easily - using the PGD
and PTE only.  However, Linux also expects one "PTE" table per page, and
at least a "dirty" bit.

Therefore, we tweak the implementation slightly - we tell Linux that we
have 2048 entries in the first level, each of which is 8 bytes (iow, two
hardware pointers to the second level.)  The second level contains two
hardware PTE tables arranged contiguously, preceded by Linux versions
which contain the state information Linux needs.  We, therefore, end up
with 512 entries in the "PTE" level.

This leads to the page tables having the following layout:

   pgd             pte
|        |
+--------+
|        |       +------------+ +0
+- - - - +       | Linux pt 0 |
|        |       +------------+ +1024
+--------+ +0    | Linux pt 1 |
|        |-----> +------------+ +2048
+- - - - + +4    |  h/w pt 0  |
|        |-----> +------------+ +3072
+--------+ +8    |  h/w pt 1  |
|        |       +------------+ +4096

See L_PTE_xxx below for definitions of bits in the "Linux pt", and
PTE_xxx for definitions of bits appearing in the "h/w pt".

PMD_xxx definitions refer to bits in the first level page table.

The "dirty" bit is emulated by only granting hardware write permission
iff the page is marked "writable" and "dirty" in the Linux PTE.  This
means that a write to a clean page will cause a permission fault, and
the Linux MM layer will mark the page dirty via handle_pte_fault().
For the hardware to notice the permission change, the TLB entry must
be flushed, and ptep_set_access_flags() does that for us.

The "accessed" or "young" bit is emulated by a similar method; we only
allow accesses to the page if the "young" bit is set.  Accesses to the
page will cause a fault, and handle_pte_fault() will set the young bit
for us as long as the page is marked present in the corresponding Linux
PTE entry.  Again, ptep_set_access_flags() will ensure that the TLB is
up to date.

However, when the "young" bit is cleared, we deny access to the page
by clearing the hardware PTE.  Currently Linux does not flush the TLB
for us in this case, which means the TLB will retain the translation
until either the TLB entry is evicted under pressure, or a context
switch which changes the user space mapping occurs.

Sunday, September 16, 2012

Saturday, September 15, 2012

kexec


Kexec is a patch to the Linux kernel that allows you to boot directly to a new kernel from the currently running one. In the boot sequence described above, kexec skips the entire bootloader stage (the first part) and directly jumps into the kernel that we want to boot to. There is no hardware reset, no firmware operation, and no bootloader involved. The weakest link in the boot sequence -- that is, the firmware -- is completely avoided. The big gain from this feature is that system reboots are now extremely fast. For enterprise-class systems, kexec drastically reduces reboot-related system downtime. For kernel and system software developers, kexec helps you quickly reboot your system during development or testing efforts without having to go through the costly firmware stage every time.
The kexec patch is the work of Eric Biederman and the project is under active development (see the Resources section for more details on the project and how to contribute to it).
Obviously, since this feature touches so many sensitive parts of the operating system, a great deal of care is needed to make it all work properly. The biggest challenge for kexec is that, in Linux, the new kernel that is to be rebooted to needs to sit in the same place in memory as the currently executing one. Replacing the existing kernel in memory with the new one, while still running in the context of the existing kernel, is a tough task. Another big issue is the state of the devices in the system. Firmware always initializes (or resets) the devices to a known "sane" state. The fact that kexec bypasses the firmware stage means that the state of the devices is unreliable.
Subsequent sections of this article will show you how to overcome these challenges, and how the direct booting to a new kernel is achieved. Note that kexec is currently available only on the x86 32-bit platform. Although work is underway to port kexec to other platforms, there is no working version of the code yet. Hence, all technical details in the subsequent sections are specific to the x86 platform

Kexec has two components. The first is the userspace component known as "kexec-tools." The second is the actual kernel patch. The two parts achieve the two main operations of kexec: loading the new kernel into memory and rebooting to it. Getting a kexec-enabled kernel is simple. Just download the kexec-tools package and the kernel-specific patch (see the link in the Resourcessection), build the kexec-tools package to obtain the kexec tool, and apply the kernel-specific patch to the kernel tree and reboot to it. Of course, make sure you have selected the CONFIG_KEXEC option while building the kernel.
As mentioned above, using kexec consists of (1) loading the kernel to be rebooted to into memory, and (2) actually rebooting to it. To load a kernel, the syntax is as follows:
kexec -l <kernel-image> --append="<command-line-options>"
where <kernel-image> is the kernel file that you intend to reboot to and <command-line-options> contain the command-line parameters that need to be passed to the new kernel. Because the wrong command-line options can cause problems during the reboot, passing the contents of /proc/cmdline is the safest way to ensure that legal values are passed to the rebooting kernel.
For example, if the kernel image you want to reboot is /boot/bzImage, and the contents of /proc/cmdline are"root=/dev/hda1", the command to load the kernel would be:
kexec -l /boot/bzImage -append="root=/dev/hda1"
Then, to actually reboot to the loaded kernel, just type:
kexec -e
The system will reboot immediately. Unlike the normal reboot process, kexec does not perform a clean shutdown of the system before rebooting. It is left to you to kill all applications and unmount file systems before attempting a kexec reboot.
One of the biggest challenges in the development of kexec comes from the fact that the Linux kernel runs from a fixed address in memory. This means that the new kernel needs to sit at the same place that the current kernel is running from. On x86 systems, the kernel sits at the physical address 0x100000 (virtual address 0xc0000000, known as PAGE_OFFSET). The task of overwriting the old kernel with the new one is done in three stages:
  1. Copy the new kernel into memory.
  2. Move this kernel image into dynamic kernel memory.
  3. Copy this image into the real destination (overwriting the current kernel), and start the new kernel.
The first two stages are achieved during the "loading" of the kernel. The first task here is to interpret the contents of the kernel image file. Kexec-tools has been built so that, in principle, you could load and boot to any (even a non-Linux) kernel. Currently, it is possible to boot to any elf32-format kernel image. The file is parsed and the kernel "segments" are loaded into buffers. These segments are categorized based on the nature of the code. For example, in the case of the commonly used "bzImage" kernel file format, the typical segments are for 16-bit kernel code, 32-bit kernel code, and init ramdisk code. The structure used to track these segments is known as kexec_segment and is a fairly simple structure:

Listing 1. The kexec_segment structure
struct kexec_segment {
   void *buf;
   size_t bufsz;
   void *mem;
   size_t memsz;
};

The first two elements of the structure point to the userspace buffer and its size, while the next two elements indicate the final destination of the segment and its size.
Once the kernel-file format-specific module loads the image into user memory, the image is transferred to dynamic kernel memory through the use of the sys_kexec system call. This system call allocates dynamic kernel pages for each of the segments that have been passed from userspace and copies the segments onto these kernel pages.
Kexec also allocates a kernel page to store a small stub of assembly code, known as the reboot_code_buffer. This stub of code does the actual job of overwriting the current kernel with the to-be-rebooted kernel and jumps to it. Thereboot_code_buffer is the only buffer that resides in its final resting place. In other words, it is executed from the same place that it is initially loaded to. In order to achieve this, on systems with MMU enabled, the page holding the code is identity mapped. Simply speaking, this involves creating a page table entry in init_mm (the kernel's page table structure) with the same physical and virtual address. This is necessary to be able to access this piece of code during the reboot operation, as discussed later.
Information about the reboot_code_buffer, the various segments, and other details is maintained through the use of thekimage structure:

Listing 2. The kimage structure
struct kimage {
        kimage_entry_t head;
        kimage_entry_t *entry;
        kimage_entry_t *last_entry;

        unsigned long destination;
        unsigned long offset;

        unsigned long start;
        struct page *reboot_code_pages;

        unsigned long nr_segments;
        struct kexec_segment segment[KEXEC_SEGMENT_MAX+1];

        struct list_head dest_pages;
        struct list_head unuseable_pages;
};

The most important parts of this structure are, of course, the segment[KEXEC_SEGMENT_MAX+1] elements, which point to the buffers in kernel memory containing the image, and the reboot_code_pages pointer to the assembly stub used during reboot.
Once the kernel image has been loaded, the system is ready to reboot into it. The actual operation on rebooting to the new kernel starts with the kexec -e command. This command essentially calls the kernel to perform a reboot using the sys_reboot system call, but with a special flag of - LINUX_REBOOT_CMD_KEXEC.
The reboot system call, upon seeing the special flag, transfers control to the machine_kexec() function. The actions performed by machine_kexec() are extremely architecture-specific. In the current x86 implementation, the sequence of actions is as follows:
  1. To access the identity-mapped reboot_code_buffer, switches from the current process's mm struct to using the kernel's init_mm structure.
  2. Stops the apics and disables interrupts.
  3. Copies the assembly stub code into the reboot_code_buffer that you had allocated during the loading of the kernel image. The assembly code is found in the relocate_new_kernel routine.
  4. Loads all the segment registers with the kernel data segment (__KERNEL_DS) value, and invalidates the GDT and IDT.
  5. Jumps to the code in the reboot_code_buffer, and passes some vital information as parameters to the new kernel, such as the indirection page containing the source/destination addresses of the kernel image, the starting address of the new kernel, the address of the reboot_code_buffer page, and a flag indicating whether the system has physical address extension (PAE) enabled.
The assembly stub code performs the following operations:
  • Reads the arguments from the stack and stores them on registers, and disables interrupts.
  • Using the address of its own page, which has been passed to it as an argument, sets up a stack at the end of that page.
  • Stores the starting address of the new kernel image onto the stack so that a return from the stub code automatically takes the system to the new kernel image.
  • Disables paging by setting appropriate bits on the cr0 register.
  • Resets the page directory base register, cr4, to zero.
  • Flushes the Translation Lookaside Buffers (TLBs).
  • Copies all the kernel image pages onto their final destination pages.
  • Flushes the TLB once again.
  • Resets all the registers to zero, except the stack pointer register esp (as it is pointing to the stack containing the starting address of the new kernel).
  • "Returns" from the stub code. This automatically takes the system to the new kernel.
After this sequence completes, the new kernel takes control and the system is booted up normally.

Systems with high availability requirements and kernel developers who have to constantly reboot their systems will benefit most from kexec. Because kexec skips the most time-consuming parts of system reboot, namely the firmware stage, reboots are extremely quick and availability is increased.
Kexec also has interesting applications in crash dumping tools. The Linux Kernel Crash Dumps (LKCD) project  has used kexec to develop a different dumping mechanism. At a system panic or user dump initiation, the system memory image is compressed and stored in available free memory pages. Next, the system is rebooted to another kernel using kexec. This new kernel is told where the dump is stored, and prevents the use of those memory regions by anyone. Subsequently, the memory dump can be written out to either a disk partition or across the network to a different machine.
The key to this design is the fact that by avoiding the firmware stage during reboot, LKCD is able to prevent the physical memory contents from being erased by the firmware. In a crash situation, LKCD also does not have to depend on an unreliable disk or network device driver to write out the memory image to the destination. Once a reboot has been performed and the system is in a reliable state, the dump is written out to the destination using normal system device drivers.
Kexec is currently available on the x86 32-bit platform only Having it on other architecture platforms such as PPC 64 and AMD 64 would be helpful. Also, better integration with the shutdown interface for graceful termination of processes, shutdown of devices, and unmounting of file systems would make it much more convenient for the average user.
You can contribute to the development of kexec. To get started, try out kexec on a test system. You can also join the "fastboot" mailing list, where all the technical discussions about the project take place

Thursday, September 13, 2012

Kernel Timers


Interrupts 0-31 -> Internal
32-255 ->External Devices

0-15 -> Inter Processor Interrupts

27 -> Global Timer
29->  Private Local Timer

ARM Global Timer

Interrupts to all the cpus using ID27.


ARM Local Timer

Interrupts only the local cpu using ID29.

Local Timer not supported in OMAP4 ES2.1

Rescheduling Interrupts are due to IPI.

Monday, September 10, 2012

ARM TrustZone

ARM

7 execution modes

Secure/NonSecure State

Monitor mode

vector base register  ->


To provide the exception behavior described above, a TrustZone-enabled processor implements three sets of exception vector tables. One of these tables is for the Normal world, one is for the Secure world, and the other is for Monitor mode.

If high vectors are enabled ie, v bit is set in CP15 , then it jumps to 0xFFFF0000 despite the value of VBAR. This is for Secure/Non-Secure state. For monitor mode , VBAR is the base.


ARM Features


Each core has the following features:
 ARM v7 CPU at 600 MHz
 32 KB of L1 instruction CACHE with parity check
 32 KB of L1 data CACHE with parity check
 Embedded FPU for single and double data precision scalar floating-point operations
 Memory management unit (MMU)
 ARM, Thumb2 and Thumb2-EE instruction set support
 TrustZone© security extension
 Program Trace Macrocell and CoreSight© component for software debug
 JTAG interface
 AMBA© 3 AXI 64-bit interface
 32-bit timer with 8-bit prescaler
 Internal watchdog (working also as timer)


The dual core configuration is completed by a common set of components:
 Snoop control unit (SCU) to manage inter-process communication, cache-2-cache and
system memory transfer, cache coherency
 Generic interrupt control (GIC) unit configured to support 128 independent interrupt
sources with software configurable priority and routing between the two cores
 64-bit global timer with 8-bit prescaler
 Asynchronous accelerator coherency port (ACP)
 Parity support to detect internal memory failures during runtime
 512 KB of unified 8-way set associative L2 cache with support for parity check and
ECC
 L2 Cache controller based on PL310 IP released by ARM
 Dual 64-bit AMBA 3 AXI interface with possible filtering on the second one to use a
single port for DDR memory access


TEX[2:0] C B Description
000 0 0 Strongly ordered
000 0 1 Shareable device
000 1 0 Outer and inner write-through, no write-allocate
000 1 1 Outer and inner write-back, no write-allocate
001 0 0 Outer and inner non-cacheable
001 0 1 Reserved
001 1 0 IMPLEMENTATION DEFINED
001 1 1 Outer and inner write-back, write-allocate
010 0 0 Non-shareable device
010 0 1 Reserved
010 1 - Reserved
011 - - Reserved
1BB A A Cacheable memory; outer = AA, inner = BB


AA/BB Attribute
00 Non-cacheable
01 Write-back, write-allocate
10 Write-through, no write-allocate
11 Write-back, no write-allocate

Saturday, September 8, 2012

Build Errors during building for panda board


arch/arm/mach-omap2/omap-headsmp.S: Assembler messages: 
arch/arm/mach-omap2/omap-headsmp.S:36: Error: selected processor does not support ARM mode `smc #0'

If you get these error message in some .S file , then add the below line to the .S file.
.arch_extension sec

Friday, September 7, 2012

context switch



Kernel can run for itself. It cannot access the user space.
PTBR will point to swapper_pg_dir.

It can also run on the behalf of a user process.
In this case kernel can access the user space of that particular process.
PTBR has the pgd of that particular process.

A process will have two stacks.
one in user space and one in kernel space.
If process does system call then kernel space stack is used.

context switch is not needed when prev == next;
ie.. no switch is needed when switching between 2 kernel threads.

kernel -> kernel ---> no switch
user1 -> user2   ---> full context switch
kernel -> user   ---> kernel->active = NULL
user -> kernel   ---> kernel->active = prev->active

Basically when switching from user to kernel , there is no need of full switch.
Use the previous process mm and do lazy_tlb.
While swtiching from kernel to user , make the active_mm to NULL.
active_mm is always used by the arch specific code for pagetable operations.

code after switch_to is executed only after the process is selected to run next time.
barrier() is present after switch_to to prevent any compiler interleavings.

when switch_to returns prev will be pointing to the real previous task.

A->B , B->C , C->A

              b4 switch_to  after switch_to
prev=a prev=b prev=b | prev=c
next=b next=c next=a | next=a


To switch between tasks requires the following steps:
1. Save the active task context and place the task in a dormant state.
2. Flush the caches; possibly clean the D-cache if using a writeback policy.
3. Flush the TLB to remove translations for the retiring task.
4. Configure the MMU to use new page tables translating the virtual memory execution
area to the awakening task’s location in physical memory.
5. Restore the context of the awakening task.
6. Resume execution of the restored task.

ARM Address Translation


page faults should not occur in the kernel address space.
ie, pages should be allocated from the memory manager before trying to use it.

32 bit virtual address

Name Level Page Table Page Size No. of
Size(KB) Supported(KB) Entries

Master/section level 1 16 1024 4096
Fine level 2 4 1, 4, or 64 1024
Coarse level 2 1 4 or 64 256

Level 1 :

PTBR 1 has the Page Table Base Register
bit 31-20 are used to index into the page table.
If last two bits(bit1 , bit0) in the corresponding entry of the first level page table are 10 which implies 1 MB pages are used.
12 bit base is taken from the first level pagetable and appended with 20 bit offset
from the virtual address to obtain the physical address.
bits 5- 8 indicate Domain
bit 2 - B bit
bit 3 - C bit
bit 10-12 AP

If last two bits are 00 , fault is generated.

If 01 , 11 then two level page tables are used.
bits 5- 8 indicate Domain

If 01 , coarse page table is used.
In first level page table entry , bits 31-10 are used as base for the second level page table.

Level 2:

Using the virtual address , corresponding entry in the second level page table is taken.
In that entry  , if bits 1,0 are
1. 01 -> large page(64k)
bits 31-16 are the base address of the physical page
11-4 are the access permissons fine grained to 16 KB.
3 - C Bit
2 - B bit
bits 19-16 of the virtual address are used to index into page table.
2. 10 -> small page(4k)
bits 31-12 are the base address of the physical page
11-4 are the access permissons fine grained to 16 KB.
3 - C Bit
2 - B bit
bits 19-12 of the virtual address are used to index into page table.
3. 11 -> tiny page(1k)
bits 31-10 are the base address of the physical page
5-4 are the access permissons fine grained to 16 KB.
3 - C Bit
2 - B bit
bits 19-10 of the virtual address are used to index into page table.
4. 00 -> page fault

Page Tables Initialization


paging_init()

The page tables and paging infrastructure are initialized as follows:
  • paging_init() is called by setup_arch() after the meminfo structure has been initialized and the bootmem allocator is ready. It calls the following routines:
    • memblock_set_current_limit()
    • build_mem_type_table() - builds a table of memory types. This has the page protection flags that are available for different memory types, for the current ARM processor. Different ARM processors have changed what flags they use, and where they are located in the page table entries, over the years. the 'mem_types' table encapsulates the settings for the running processor.
    • prepare_page_table() - this zeros out certain areas of the first-level page table (called pmd in this routine). For example, it zeros out the areas of the page table that will be covered by user-space (areas below the start of the kernel address space).
    • map_lowmem() - create the memory mappings (page table entries) for the lower portions of kernel memory. This is the "normal" memory that will be used by the kernel for static code and data, stack, and regular dynamic allocations.
    • devicemaps_init() - create the memory mappings for special CPU areas (e.g. cache flushing regions, and interrupt vectors) and reserved IO areas in the memory map.
    • kmap_init() - create the memory mapping for highmen ('pkmap')

Thursday, September 6, 2012

fork


kthread
create_kthread
kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
do_fork(flags|CLONE_VM|CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);

sys_fork(wrapper)
do_fork(SIGCHLD, regs->ARM_sp, regs, 0, NULL, NULL);

sys_clone(wrapper)
do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);

sys_vfork(wrapper)
do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs->ARM_sp, regs, 0, NULL, NULL);

COW is used in Fork.
Parent and child share the same memory regions.
once write occurs both child and parent gets its own memory regions.

vfork -> share the address space.
parent is suspended till child exits.
what happens after exec on vfork?
1. New address space is created for both parent & child. If so , does parent wait for child to exit.
2. both will be executing with new address space.?

 Because parent and child share the address space, you must not return from the function that called vfork(); doing so can corrupt the parent's stack..
 if your exec() call fails, you must call  __exit() , and not exit(), because calling exit() would close standard I/O stream buffers for the parent as well as the child.

scheduler

struct task_struct
{

prio -> It is the priority considered by the scheduler.
normal prio ->
static prio -> assigned during fork , changed only by nice , sched_setscheduler
rt priority -> lowest rt priority 0 , highest rt priority 99

sched_class -> scheduler class(CFS , RT..)
se -> scheduler schedules entities(process is an entity)
run_list -> Needed only for RR scheduler

policy -> cfs(SCHED_NORMAL(normal process) , SCHED_BATCH(wont preempt other process) , SCHED_IDLE)
rt(SCHED_RR , SCHED_FIFO)
cpus_allowed -> Cpus on which the process can execute
time_slice -> Needed only for RR scheduler
}




Priority
              RT Normal
0----------------------------------100---------139
Nice
                                   -20   19




p->prio = effective_prio(task);

Returns the normal priority.
If we are RT or boosted up gives the boosted priority.












p->normal_prio = p->static_prio | MAX_RT_PRIO-1-p->rt_priority (RT tasks)
p->prio = p->normal_prio | p->prio(if boosted previously or if RT keep priority unchanged)

when RT Mutexes are used , normal process boost priority to real time tasks.

generic scheduler + scheduling methods provided by sched_class

__schedule
{
disable premption
get processor id -> run queue
get curr process
check for hr timer , signals
update statistics (prempt count..)
deactivate prev task
preschedule of the scheduler class of the prev process
if no process in rq then do idle balance on rq
put prev task
pick next task
update clocks...
update context switch count
change the current process of the run queue to current
do the arch specific context switch
postschedule of the scheduler class of the curr process
update statistics
TIF_NEED_RESCHED -> do scheduling again
enable premption
}

where the places in which schedule() is called?

System Calls Related to Scheduling

System Call Description

nice( ) Change the priority of a conventional process.
getpriority( ) Get the maximum priority of a group of conventional processes.
setpriority( ) Set the priority of a group of conventional processes.
sched_getscheduler( ) Get the scheduling policy of a process.
sched_setscheduler( ) Set the scheduling policy and priority of a process.
sched_getparam( ) Get the scheduling priority of a process.
sched_setparam( ) Set the priority of a process.
sched_yield( ) Relinquish the processor voluntarily without blocking.
sched_get_ priority_min( ) Get the minimum priority value for a policy.
sched_get_ priority_max( ) Get the maximum priority value for a policy.
sched_rr_get_interval( ) Get the time quantum value for the Round Robin policy.

Runqueue :

Runqueue is list of all process with state = TASK_RUNNING
CFS Tasks are maintained in both list and redblack tree

Each active process can be in only one runqueue.
However threads from the same process can be on different runqueue.

struct rq
{
unsigned long nr_running;
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX]
struct load_weight load;

struct cfs_rq cfs;
struct rt_rq rt;

struct task_struct *idle , *curr;
u64 clock;
}

Scheduler Class :

enqueue_task
dequeue_task -> Enqueue/Dequeue the task into the runqueue(cfs rq / rt rq).

userspace does not interact with scheduler class.
It just sets the SCHED_*** flag.
Kernel sets the mapping between these.

Wednesday, September 5, 2012

Intresting Questions

1. What is the use of syscall system call?

    It is an indirect system call.

Sometimes the kernel adds system calls and it takes a while for the C library to support them.
Or maybe you are compiling on an old Linux distribution, but want to run on a newer one.
But in general, there is no advantage to using syscall if the C library in your compilation environment has what you need. (For one thing, it is even less portable than using a Linux-specific interface, since the system call numbers vary by CPU.)

2. View Pipe details of a process.


1) ls -l /proc/pid/fd

This will list the pipes

lr-x------ 1 prabagaran prabagaran 64 Sep  5 23:01 14 -> pipe:[57729]<br>
l-wx------ 1 prabagaran prabagaran 64 Sep  5 23:01 15 -> pipe:[57728]<br>
lr-x------ 1 prabagaran prabagaran 64 Sep  5 23:01 16 -> pipe:[57731]<br>
lr-x------ 1 prabagaran prabagaran 64 Sep  5 23:01 17 -> pipe:[57730]<br>

2). lsof | grep 57731

wineserve 3641 prabagaran   76w     FIFO        0,8       0t0   57731 pipe<br>
winedevic 3651 prabagaran   16r     FIFO        0,8       0t0   57731 pipe

These are the pipe information related to the given process id.

Tuesday, September 4, 2012

Around Scheduler


Scheduler invoked by
Hardware Interrupts
Software Interrupts
SoftIrq     -> softirq daemon.(priority 19).
Tasklets   -> Register the tasklet(ie add to the tasklet list).
                     can be executed only on one cpu(using TASKLET_STATE_RUN).
                     TASKLET_STATE_SCHED -> registered and ready to run.
Workqueues  -> Run in user context.
For each workqueue kernel generates the kernel daemon in whose context deferred tasks are performed.
For each workqueue , a thread can be created on every CPU of the system.

Waiting for events to occur
Waitqueues  -> When process are put to sleep using wait_event , you ensure that a corresponding wake_up call at another point in the kernel.
Completion