Process Latency and Kernel Timing


Software version	Kernel 2.6.32+
Operating System	Red Hat 6.3 Debian 7
Website	Kernel Website

The Clocks

There are several clocks that allow you to obtain or manipulate time operations:

RTC (Real Time Clock): this is the BIOS battery that keeps the date and time on a machine when it is turned off. You can get information about it in the /proc/driver/rtc file.
TSC (Time Stamp Counter): this is a counter that is set to the same frequency as the CPU, even if it oscillates. The kernel uses the TSC with the RTC to calculate the date and time.
PIC (Programmable Interrupt Counter): also known as PIT (Programmable Interval Timer) which allows to send interrupts to the kernel after a certain time has passed. It is generally used for process scheduling.
APIC (Advanced Programmable Interrupt Controller): It also operates on the CPU clock and allows tracking of running processes and sends local interruptions to this processor.

On a 2.6 kernel, the frequency of the PIC is 1MHz or 1 tick/ms (also called jiffy). This interval can be adjusted during kernel compilation or in boot parameters (for some distributions). A shorter tick value will give better resolution times, however, applications may run slightly slower.

The boot parameter is:

GRUB_CMDLINE_LINUX_DEFAULT="quiet tick_divider=<value>"

The interesting values are:

2 = 500 Hz
4 = 250 Hz
5 = 200 Hz
8 = 125 Hz
10 = 100 Hz

The advantage is the reduction of CPU overhead, but the scheduler is less fair with interactive processes.

Managing CPU Speed

In Red Hat, there's a tool to control the CPU clock speed. To do this, install the "cpuspeed" daemon and configure it in /etc/sysconfig/cpuspeed:

# /etc/sysconfig/cpuspeed
#
# This configuration file controls the behavior of both the
# cpuspeed daemon and various cpufreq modules.
# For the vast majority of users, there shouldn't be any need to
# alter the contents of this file at all. By and large, frequency
# scaling should Just Work(tm) with the defaults.

### DRIVER ###
# Your CPUFreq driver module
# Note that many drivers are now built-in, rather than built as modules,
# so its usually best not to specify one.
# default value: empty (try to auto-detect/use built-in)
DRIVER=

### GOVERNOR ###
# Which scaling governor to use
# Details on scaling governors for your cpu(s) can be found in
# cpu-freq/governors.txt, part of the kernel-doc package
# NOTES:
# - The GOVERNOR parameter is only valid on centrino, powernow-k8 (amd64)
#   and acpi-cpufreq platforms, other platforms that support frequency
#   scaling always use the 'userspace' governor.
# - Using the 'userspace' governor will trigger the cpuspeed daemon to run,
#   which provides said user-space frequency scaling.
# default value: empty (defaults to ondemand on centrino, powernow-k8,
#                       and acpi-cpufreq systems, userspace on others)
GOVERNOR=

### FREQUENCIES ###
# NOTE: valid max/min frequencies for your cpu(s) can be found in
# /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies
# on systems that support frequency scaling (though only after the
# appropriate drivers have been loaded via the cpuspeed initscript).
# maximum speed to scale up to
# default value: empty (use cpu reported maximum)
MAX_SPEED=
# minimum speed to scale down to
# default value: empty (use cpu reported minimum)
MIN_SPEED=

### SCALING THRESHOLDS ###
# Busy percentage threshold over which to scale up to max frequency
# default value: empty (use governor default)
UP_THRESHOLD=
# Busy percentage threshold under which to scale frequency down
# default value: empty (use governor default)
DOWN_THRESHOLD=

### NICE PROCESS HANDLING ###
# Let background (nice) processes speed up the cpu
# default value: 0 (background process usage can speed up cpu)
# alternate value: 1 (background processes will be ignored)
IGNORE_NICE=0


#####################################################
########## HISTORICAL CPUSPEED CONFIG BITS ##########
#####################################################
VMAJOR=1
VMINOR=1

# Add your favorite options here
#OPTS="$OPTS -s 0 -i 10 -r"

# uncomment and modify this to check the state of the AC adapter
#OPTS="$OPTS -a /proc/acpi/ac_adapter/*/state"

# uncomment and modify this to check the system temperature
#OPTS="$OPTS -t /proc/acpi/thermal_zone/*/temperature 75"

You can get the current information like this:

cpuspeed ---help 2>&1 | more

It's possible to see the possible assignable values:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 
3166000 2667000 2333000 2000000

So 3.16Ghz, 2.66Ghz, 2.33Ghz or 2Ghz.

And finally the governor shows the algorithm used. For example, here we use "on demand", which changes the processor speed on the fly according to demand:

> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors   
ondemand performance

If you want the best performance, disable this daemon. The drawback is of course the power consumption (think of the environment). You should know that if you need very low latencies, it is strongly recommended to disable this daemon.

IRQ Balancing

Make sure that this information is correctly compiled in the kernel:

> grep -e VOLUNTARY -e BKL /boot/config-2.6.32-279.2.1.el6.x86_64 | grep -v '#'
CONFIG_PREEMPT_VOLUNTARY=y
CONFIG_BKL=y

These options are designed to allow the kernel to preempt and schedule certain processes. The gain will be felt at the level of latency (especially network). For example, the kernel can handle disk IO operations and simultaneously receive interrupts from the network card. The handler doing disk IOs can be preempted in favor of the network card interrupt which would improve network latency.

It is nevertheless possible to disable IRQ balancing via a boot parameter:

GRUB_CMDLINE_LINUX_DEFAULT="quiet noapic"

If IRQs are unevenly distributed across CPUs, the result can be inconsistent performance when interrupt handlers preempt processes that are on the CPU.

To see interrupts on interrupts:

> cat /proc/interrupts 
           CPU0       CPU1       
  0:        121         38   IO-APIC-edge      timer
  1:          3          0   IO-APIC-edge      i8042
  7:          0          0   IO-APIC-edge      parport0
  8:          1          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          2          2   IO-APIC-edge      i8042
 16:       9138       9221   IO-APIC-fasteoi   uhci_hcd:usb2
 17:     119477     120478   IO-APIC-fasteoi   uhci_hcd:usb4, uhci_hcd:usb7
 18:     235767     237579   IO-APIC-fasteoi   ata_generic, ata_piix, ata_piix, uhci_hcd:usb8
 22:     627101     627432   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
 23:          0          0   IO-APIC-fasteoi   ehci_hcd:usb3, uhci_hcd:usb6
 40:    4752698          0  HPET_MSI-edge      hpet2
 41:          0    4481502  HPET_MSI-edge      hpet3
 47:     704766     717489   PCI-MSI-edge      eth0
 48:        722        764   PCI-MSI-edge      snd_hda_intel
 49:    1462373    1492905   PCI-MSI-edge      i915
NMI:       3001       2977   Non-maskable interrupts
LOC:        754        731   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:       3001       2977   Performance monitoring interrupts
IWI:          0          0   IRQ work interrupts
RES:    9186852    9208409   Rescheduling interrupts
CAL:        607        596   Function call interrupts
TLB:     447954     362994   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:         82         82   Machine check polls
ERR:          0
MIS:          0

Interceptions allow exploiting cache affinity for CPU and equalizing the number of CPU visits. To give an IRQ affinity to a CPU to improve performance by making the best use of cache affinity, you need to specify the bitmap of a core in hexadecimal. For example:

echo <cpu_mask> > /proc/irq/<interrupt_number>/smp_affinity

This will place this IRQ at the head of the active queue and preserve certain CPUs from being used for IRQ assignments. It is possible to configure this permanently in Red Hat in /etc/sysconfig/irqbalance. For those who want, it is possible to disable IRQ balancing:

1	`chkconfig irqbalance off`

For more information on IRQ affinities: http://kernel.org/doc/Documentation/IRQ-affinity.txt ^1

Enabling/Disabling CPUs On-the-fly

It is possible to enable or disable CPUs on-the-fly! It's actually very easy. First verify that your kernel allows this kind of thing:

CONFIG_HOTPLUG
CONFIG_SMP
CONFIG_HOTPLUG_CPU
CONFIG_ACPI_HOTPLUG_CPU

Get the list of processors:

> grep processor /proc/cpuinfo 
processor   : 0
processor   : 1

Then check your interrupts:

> cat /proc/interrupts 
           CPU0       CPU1       
  0:        109         36   IO-APIC-edge      timer
  1:          3          0   IO-APIC-edge      i8042
  7:          0          0   IO-APIC-edge      parport0
  8:          1          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          1          3   IO-APIC-edge      i8042
 16:      13500      13486   IO-APIC-fasteoi   uhci_hcd:usb3
 17:     122633     122322   IO-APIC-fasteoi   uhci_hcd:usb4, uhci_hcd:usb7
 18:     157011     157338   IO-APIC-fasteoi   ata_piix, ata_piix, uhci_hcd:usb8, ata_generic
 22:     365980     365602   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
 23:          0          0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb6
 40:    3995517          0  HPET_MSI-edge      hpet2
 41:          0    4003499  HPET_MSI-edge      hpet3
 47:     188227     196024   PCI-MSI-edge      eth0
 48:        472        468   PCI-MSI-edge      snd_hda_intel
 49:    1079730    1075353   PCI-MSI-edge      i915
NMI:       2169       2188   Non-maskable interrupts
LOC:        756        733   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:       2169       2188   Performance monitoring interrupts
IWI:          0          0   IRQ work interrupts
RES:    7945496    7982861   Rescheduling interrupts
CAL:        586        482   Function call interrupts
TLB:     397248     414104   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:         68         68   Machine check polls
ERR:          0
MIS:          0

To disable CPU 1:

echo 0 > /sys/devices/system/cpu/cpu1/online

We can see that there is only CPU 0 left:

> cat /proc/interrupts                        
           CPU0       
  0:        109   IO-APIC-edge      timer
  1:          3   IO-APIC-edge      i8042
  7:          0   IO-APIC-edge      parport0
  8:          1   IO-APIC-edge      rtc0
  9:          0   IO-APIC-fasteoi   acpi
 12:          1   IO-APIC-edge      i8042
 16:      13606   IO-APIC-fasteoi   uhci_hcd:usb3
 17:     123361   IO-APIC-fasteoi   uhci_hcd:usb4, uhci_hcd:usb7
 18:     157302   IO-APIC-fasteoi   ata_piix, ata_piix, uhci_hcd:usb8, ata_generic
 22:     366042   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
 23:          0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb6
 40:    4010597  HPET_MSI-edge      hpet2
 41:          1  HPET_MSI-edge    
 47:     188715   PCI-MSI-edge      eth0
 48:        472   PCI-MSI-edge      snd_hda_intel
 49:    1085415   PCI-MSI-edge      i915
NMI:       2177   Non-maskable interrupts
LOC:        756   Local timer interrupts
SPU:          0   Spurious interrupts
PMI:       2177   Performance monitoring interrupts
IWI:          0   IRQ work interrupts
RES:    7981178   Rescheduling interrupts
CAL:        587   Function call interrupts
TLB:     397336   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
MCE:          0   Machine check exceptions
MCP:         68   Machine check polls
ERR:          0
MIS:          0

Then reactivate CPU 1:

echo 1 > /sys/devices/system/cpu/cpu1/online

And everything gets back in place correctly :-). You should know that some CPUs cannot be disabled like the boot CPU.

Balancing CPU Utilization

Each core has its own run queue. For HyperThreaded processors, the logical processor uses the same run queue as the physical core. By default, there is a certain affinity and the tasks that occur on a CPU come back to it more or less automatically if other associated ones were going to see another CPU. Knowing that each CPU has its own cache, it's better that way. However, if one core is more loaded than another, the scheduler looks at the run queues every 100ms (or 1ms if the core does nothing) and decides to rebalance the load. The problem arises in the case where this balancing system is done too often, we can experience latency to avoid caches miss (everything depends on the applications)! You then have to choose what you want the most. To see the list of programs and their associated core:

> ps axo comm,psr
COMMAND         PSR
init              1
kthreadd          0
ksoftirqd/0       0
migration/0       0
watchdog/0        0
migration/1       1
ksoftirqd/1       1
watchdog/1        1
cpuset            0
khelper           1
kdevtmpfs         1
netns             1
sync_supers       1
bdi-default       0
kintegrityd       0
kblockd           1
khungtaskd        1
kswapd0           0

You can also see a process moving from one core to another:

watch -n2 'ps axo comm,pid,psr | grep <process>'

taskset

If you want to assign specific CPUs to certain processes, it's possible! The first step is to know the CPU bitmap. To give you an idea of how to get them:

> awk '/processor/{printf("CPU %s address : 0x0000000%s\n"), $3, $3}' /proc/cpuinfo ; echo 'All CPU       : xXFFFFFFFF'
CPU 0 address : 0x00000000
CPU 1 address : 0x00000001
CPU 2 address : 0x00000002
CPU 3 address : 0x00000003
All CPU       : xXFFFFFFFF

Then we will use the taskset command to assign a specific CPU to a PID:

taskset -p 0x00000001 <PID>

You should know that Numa processors have RAM directly mapped with CPUs to increase performance. This doesn't change the fact that other processors can use memory that is not associated with them. Here is a small overview of Numa:

Numa ^2

You can also specify parameters at the grub level to isolate CPUs (isolcpus):

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/mapper/vgos-root
#          initrd /initrd-[generic-]version.img
#boot=/dev/sda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.32-279.2.1.el6.x86_64)
    root (hd0,0)
    kernel /vmlinuz-2.6.32-279.2.1.el6.x86_64 ro root=/dev/mapper/vgos-root rd_NO_LUKS  KEYBOARDTYPE=pc KEYTABLE=fr LANG=en_US.UTF-8 rd_LVM_LV=vgos/root rd_NO_MD rd_LVM_LV=vgos/swap SYSFONT=latarcyrheb-sun16 crashkernel=128M biosdevname=0 rd_NO_DM isolcpus=0
    initrd /initramfs-2.6.32-279.2.1.el6.x86_64.img

CPU pinning is now possible on this CPU. We'll therefore have a smaller run queue and improved response times for tasks assigned to this CPU.

For more information: http://kernel.org/doc/Documentation/kernel-parameters.txt ^3

cpuset/cgroup

cpuset is a more advanced version of taskset that provides a more elegant, flexible and scalable method for controlling runqueues and latency on tasks. A cpuset is a group of CPUs (scheduler domain/cgroups) on which we will be able to balance tasks:

Cpuset ^4

To ensure that these features are present in the kernel:

> grep -i cpuset /proc/filesystems /boot/config-`uname -r`    
/proc/filesystems:nodev cpuset
/boot/config-3.2.0-3-amd64:CONFIG_CPUSETS=y
/boot/config-3.2.0-3-amd64:CONFIG_PROC_PID_CPUSET=y

The implementation of cpuset in the kernel is quite small and has no impact on the process scheduler. It uses a new VFS that does not introduce new system calls. This cpuset VFS can be mounted anywhere on the system. We will, for example, mount this in /mnt/cpuserts. Just create folders to make assignments to other CPUs. A CPU can belong to multiple cpusets.

Prerequisites

Set up cgroups mounting at boot:

cgroup  /sys/fs/cgroup  cgroup  defaults  0   0

In order to have all options enabled, you also need to modify the Grub options:

GRUB_CMDLINE_LINUX_DEFAULT="quiet cgroup_enable=memory swapaccount=1"

Then update Grub:

1	`update-grub`

Reboot afterwards.

Creation

Creating a cgroup is very simple:

1	`mkdir /sys/fs/cgroup/mycgroup1`

Assigning a CPU and its Memory to a cgroup

We will assign a CPU and its memory to our new cgroup (here, CPU 0):

echo 0 > /sys/fs/cgroup/mycgroup1/cpuset.cpus
echo 0 > /sys/fs/cgroup/mycgroup1/cpuset.mems

You can send multiple processors by separating them with commas, all enclosed in quotes.

Dedicating a CPU to a cgroup

To dedicate a CPU solely to certain processes:

echo 0 > /sys/fs/cgroup/mycgroup1/cpuset.cpus

Adding a Process to a cgroup

To add a process, it's very simple, just send its PID to the tasks file:

echo <PID> > /sys/fs/cgroup/mycgroup1/cpuset.tasks

Or if you have multiple processes with the same name:

for pid in $(pidof apache2); do
    echo $pid > /sys/fs/cgroup/mycgroup1/cpuset.tasks
done

Detaching a Process from a cgroup

It is possible to detach a process by attaching it to another cgroup or the machine's cgroup:

echo <PID> > /sys/fs/cgroup/tasks

Deleting a cgroup

To delete a cgroup is very simple, just delete the folder in question:

1	`rm -Rf /sys/fs/cgroup/mycgroup1/cpuset.`

Monitoring Pressure on a cpuset

It is possible to monitor the pressure on cpusets by activating like this:

echo 1 > /sys/fs/cgroup/memory_pressure_enabled

With this option, the kernel will start tracking the memory usage of cpusets. You can then retrieve the statistics in each cgroup:

1	`cat /sys/fs/cgroup/<cgroup>/memory_pressure`

You will find an execution average or the speed at which pages frames are freed.

Miscellaneous

Here are various tips:

To know which cgroup the PID is attached to:

1	`cat /proc/<pid>/cpuset`

To know which resource a PID can be scheduled to:

cat /proc/<pid>/status | grep allowed

To know if a CPU can belong to multiple cgroups:

> cat /sys/fs/cgroup/cpu_exclusive 
1

Automatic deletion of the cgroup when no task is active on it:

echo 1 > /sys/fs/cgroup/mycgroup1/cpuset.notify_on_release

Cgroups with cgconfig

To have cgroups working with cgconfig, be sure you've enabled cgroups in your fstab. Then install the daemon:

aptitude install cgroup-bin daemon

It will check all new running processes and affect them to a correct cgroup if a rule exists.

Unfortunately, it's not well packaged yet on Debian so we need to adjust some things:

cd /etc/init.d
cp skeleton cgconfig
cp skeleton cgred
chmod 755 cgconfig cgred
sed -i 's/skeleton/cgconfig/' cgconfig
sed -i 's/skeleton/cgred/' cgred
update-rc.d cgconfig defaults
update-rc.d cgred defaults
cd /usr/share/doc/cgroup-bin/examples/
cp cgred.conf /etc/default/
cp cgconfig.conf cgrules.conf /etc/
gzip -d cgconfig.gz
cp cgconfig cgred /etc/init.d/
cd /etc/init.d/
sed -i 's/sysconfig/defaults/' cgred cgconfig
sed -i 's/\/etc\/rc.d\/init.d\/functions/\/lib\/init\/vars.sh/' cgred
sed -i 's/--check/--name/' cgred
sed -i 's/killproc.*/kill $(cat $pidfile)/' cgred
sed -i 's/touch "$lockfile"/test -d \/var\/lock\/subsys || mkdir \/var\/lock\/subsys\n\t&/' cgconfig
chmod 755 cgconfig cgred

You now have your configuration files and all services installed correctly. Edit the configuration file:

#
#  Copyright IBM Corporation. 2007
#
#  Authors:     Balbir Singh <balbir@linux.vnet.ibm.com>
#  This program is free software; you can redistribute it and/or modify it
#  under the terms of version 2.1 of the GNU Lesser General Public License
#  as published by the Free Software Foundation.
#
#  This program is distributed in the hope that it would be useful, but
#  WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
group mariadb_cgroup {
    perm {
        admin {
            uid = tomcat;
        }
        task {
            uid = tomcat;
        }
    }

        cpuset {
                cpuset.mems = 0;
                cpuset.cpus = "1,2";
                cpuset.cpu_exclusive = 1;
        }
}

Here is an example for tomcat user, where I want to have 2 dedicated CPUs. Then you need to change the cgrules config:

1	`tomcat cpu tomcat_cgroup/`

This indicates that tomcat user will change cpu settings and the cgroup folder is tomcat_cgroup (/sys/fs/cgroup/tomcat_cgroup). Now restart it:

/etc/init.d/cgred stop 
/etc/init.d/cgconfig stop 
umount /sys/fs/cgroup 2>/dev/null 
rmdir /sys/fs/cgroup/* /sys/fs/cgroup 2>/dev/null 
mount /sys/fs/cgroup 
/etc/init.d/cgconfig start 
/etc/init.d/cgred start

Restart your tomcat service and it will automatically be placed in the cgroup :-)