Queueing Theory
Introduction
Queueing theory is a mathematical theory in the field of probability, which studies optimal solutions for managing waiting lines, or queues. A queue is necessary and will create itself if not anticipated, in all cases where supply is less than demand, even temporarily. It can apply to different situations: management of aircraft taking off or landing, waiting of customers and administrators at counters, or storage of computer programs before processing. This field of research, born in 1917 from the work of Danish engineer Erlang on the management of Copenhagen telephone networks between 1909 and 1920, studies in particular arrival systems in a queue, the different priorities of each new arrival, as well as statistical modeling of execution times. It is thanks to the contributions of mathematicians Khintchine, Palm, Kendall, Pollaczek and Kolmogorov that the theory really developed.
Little’s Law
Little mathematically demonstrated what Erlang had said years earlier. Here is the law:
L = A x W
- L: The queue length that corresponds to an average of requests waiting in the system
- A: This is the rate at which requests enter the system (x/second)
- W: Average time to satisfy a request
Queue Size
Requests are stored in memory. Therefore L can be a tunable read/write cache or a read-only cache for measurements. There are algorithms to manage queue priorities:
- Two queues (1 for urgent requests, 1 for normal requests). Requests are processed according to their urgency.
- A single queue that will process urgent requests first. Normal requests will only be processed if there are no more urgent ones.
Queue Size and Waiting Time
L tends to vary directly with W. Simple example:
100 requests = 100 requests/second _ 1s
200 requests = 100 requests/second _ 2s
800 requests = 100 requests/second * 8s
We can therefore predict the waiting time performance by restricting the queue size. Similarly, we can restrict the waiting time to optimize the queue size (in memory).
Waiting Time
The waiting time includes:
(W = Q + S)
- Queue time: waiting time for a resource to become available
- Service time: time for a resource to execute a request
The goal is to reduce both queue time and service time. For example, a waiting time corresponds to a web page loading. Its execution time can take a certain amount of time which can be multiplied according to the number of parallel requests.
If we look from a more system-oriented perspective:
W = Q + (T(system) + T(user))
- System time: time for a kernel task
- User time: time for a user task
You can get more information in the man page of the time command or in top. The goal is therefore:
- To reduce system time (but will block operations for users)
- Spend as much time as necessary on user tasks (but no more)
The service time corresponds to the time a process takes to execute a task inserted in the queue until the end of its execution. This type of service can be calculated as utilization / throughput. The average service time is defined by the amount of occupancy of a resource per request: S = Occupancy time / completions per second
User Time Algorithms
There are asymptotic complexities that help understand the necessary user time to execute x requests:
- Intractable: O(2^n)
- Polynomial: O(n^2)
- Linear: O(50n)
- Constant: O(1)
- Logarithmic: O(500 log n)
Note: O describes the order of growth rate at a time of execution or memory usage of an algorithm when its inputs grow.
When changing algorithms, performance can be improved by reducing waiting time and increasing throughput (number of actions).
Profiling with the time Command
The time command allows you to know the execution time of a command:
|
|
- real: this is the actual time the application takes to execute
- user: this is the CPU time needed to execute its instructions
- sys: this is the time that the CPU takes for kernel calls or waiting for I/O
To get the queue time: Q = W - (Tsys + Tuser)
Completion Rate
The bandwidth (theoretical) corresponds to a data size that can be transferred in a certain amount of time. The bandwidth (real) corresponds to a data size that is transferred in a certain amount of time. The throughput corresponds to a size of usable data that is transferred in a certain amount of time. The overhead is the difference between the real bandwidth and the throughput. It’s the surplus of real data transferred. For data transferred at a certain bandwidth, the throughput will always be less than the bandwidth.
The completion rate is the rate at which work is performed at a certain throughput or bandwidth:
x = work done / observation time
Bandwidth is generally predictable, but overhead can be reduced to speed up processing time. For example, on a 100Mb/s network, we can measure 80Mb/s of data at the application layer (OSI). Because in fact, we have 20Mb/s of overhead (20 + 80 = 100). If we reduce the overhead to 10, we can increase the throughput to 90Mb/s.
B (bandwidth) = X (transfers) + O (overhead)
However, it should be known that reducing overhead can inflict undesirable behavior. For example, if we decide to reduce overhead by choosing UDP instead of TCP for an application (since UDP has less overhead than TCP). If there is packet loss on the network, the application may request retransmission of the request. As a result, the application might generate more overhead than with TCP sessions. Depending on the applications, data does not necessarily need to be retransmitted (such as online music sites). After a certain number of packets not transmitted, the client can give an error message for bad data transmission.
Arrival Rate and Completion Rate
The completion rate is the rate at which work is performed and is related to throughput, as well as bandwidth (depending on the context).
A (number of arrivals) = C (number of completions) / T (observation time)
- T: the observation time corresponds to the time allotted for the execution of events in a given time (e.g., in seconds)
- A: The number of arrivals observed during the observation period (e.g., packets/s)
The number of arrivals (A) that were made during an observation time (T) are referenced as completed (C). The occupancy time is the observation time * the utilization time.
Finding the Best Observation Time
It is important to use these kinds of tools during non-zero load times. Making an observation during too short a time will not allow for satisfactory results. It is therefore advised to measure the observation time in the following way:
- Collect data as frequently as possible (ideally 1 second)
- Compare the measurement values according to Little’s Law
- Is the comparison good?
- yes: the observation time is valid and L = C x W
- no: redo the capture with a longer time interval
We will now calculate the queue length and compare it to the collected data. Here is an example of capture:
|
|
Based on the columns that sar gives us, here is the formula:
(rd_sec/s + wr_sec/s) * await / 1000 = requests/s
So if we take the information above, we can easily calculate like this (on sda here):
|
|
It is interesting to let these kinds of tools run for a while to profile an application on a server, for example, and realize during a high load period the number of requests available.
Predicting System Limits
It’s important to know that a saturated resource is the bottleneck for the entire system! You must therefore ensure that there is no funnel effect on your resources.
Reducing the number of accesses or improving bandwidth helps solve the problem, whether it’s for CPU, RAM, Network, or anything else.
It is therefore important to know both the arrival rate and the necessary bandwidth to properly configure your hardware.
Last updated 26 Jul 2013, 13:59 CEST.