1===================
2Block IO Controller
3===================
4
5Overview
6========
7cgroup subsys "blkio" implements the block io controller. There seems to be
8a need of various kinds of IO control policies (like proportional BW, max BW)
9both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
10Plan is to use the same cgroup based management interface for blkio controller
11and based on user options switch IO policies in the background.
12
13One IO control policy is throttling policy which can be used to
14specify upper IO rate limits on devices. This policy is implemented in
15generic block layer and can be used on leaf nodes as well as higher
16level logical devices like device mapper.
17
18HOWTO
19=====
20
21Throttling/Upper Limit policy
22-----------------------------
23Enable Block IO controller::
24
25	CONFIG_BLK_CGROUP=y
26
27Enable throttling in block layer::
28
29	CONFIG_BLK_DEV_THROTTLING=y
30
31Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
32
33        mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
34
35Specify a bandwidth rate on particular device for root group. The format
36for policy is "<major>:<minor>  <bytes_per_second>"::
37
38        echo "8:16  1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device
39
40This will put a limit of 1MB/second on reads happening for root group
41on device having major/minor number 8:16.
42
43Run dd to read a file and see if rate is throttled to 1MB/s or not::
44
45        # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
46        1024+0 records in
47        1024+0 records out
48        4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
49
50Limits for writes can be put using blkio.throttle.write_bps_device file.
51
52Hierarchical Cgroups
53====================
54
55Throttling implements hierarchy support; however,
56throttling's hierarchy support is enabled iff "sane_behavior" is
57enabled from cgroup side, which currently is a development option and
58not publicly available.
59
60If somebody created a hierarchy like as follows::
61
62			root
63			/  \
64		     test1 test2
65			|
66		     test3
67
68Throttling with "sane_behavior" will handle the
69hierarchy correctly. For throttling, all limits apply
70to the whole subtree while all statistics are local to the IOs
71directly generated by tasks in that cgroup.
72
73Throttling without "sane_behavior" enabled from cgroup side will
74practically treat all groups at same level as if it looks like the
75following::
76
77				pivot
78			     /  /   \  \
79			root  test1 test2  test3
80
81Various user visible config options
82===================================
83
84  CONFIG_BLK_CGROUP
85	  Block IO controller.
86
87  CONFIG_BFQ_CGROUP_DEBUG
88	  Debug help. Right now some additional stats file show up in cgroup
89	  if this option is enabled.
90
91  CONFIG_BLK_DEV_THROTTLING
92	  Enable block device throttling support in block layer.
93
94Details of cgroup files
95=======================
96
97Proportional weight policy files
98--------------------------------
99
100  blkio.bfq.weight
101	  Specifies per cgroup weight. This is default weight of the group
102	  on all the devices until and unless overridden by per device rule
103	  (see `blkio.bfq.weight_device` below).
104
105	  Currently allowed range of weights is from 1 to 1000. For more details,
106          see Documentation/block/bfq-iosched.rst.
107
108  blkio.bfq.weight_device
109          Specifies per cgroup per device weights, overriding the default group
110          weight. For more details, see Documentation/block/bfq-iosched.rst.
111
112	  Following is the format::
113
114	    # echo dev_maj:dev_minor weight > blkio.bfq.weight_device
115
116	  Configure weight=300 on /dev/sdb (8:16) in this cgroup::
117
118	    # echo 8:16 300 > blkio.bfq.weight_device
119	    # cat blkio.bfq.weight_device
120	    dev     weight
121	    8:16    300
122
123	  Configure weight=500 on /dev/sda (8:0) in this cgroup::
124
125	    # echo 8:0 500 > blkio.bfq.weight_device
126	    # cat blkio.bfq.weight_device
127	    dev     weight
128	    8:0     500
129	    8:16    300
130
131	  Remove specific weight for /dev/sda in this cgroup::
132
133	    # echo 8:0 0 > blkio.bfq.weight_device
134	    # cat blkio.bfq.weight_device
135	    dev     weight
136	    8:16    300
137
138  blkio.time
139	  Disk time allocated to cgroup per device in milliseconds. First
140	  two fields specify the major and minor number of the device and
141	  third field specifies the disk time allocated to group in
142	  milliseconds.
143
144  blkio.sectors
145	  Number of sectors transferred to/from disk by the group. First
146	  two fields specify the major and minor number of the device and
147	  third field specifies the number of sectors transferred by the
148	  group to/from the device.
149
150  blkio.io_service_bytes
151	  Number of bytes transferred to/from the disk by the group. These
152	  are further divided by the type of operation - read or write, sync
153	  or async. First two fields specify the major and minor number of the
154	  device, third field specifies the operation type and the fourth field
155	  specifies the number of bytes.
156
157  blkio.io_serviced
158	  Number of IOs (bio) issued to the disk by the group. These
159	  are further divided by the type of operation - read or write, sync
160	  or async. First two fields specify the major and minor number of the
161	  device, third field specifies the operation type and the fourth field
162	  specifies the number of IOs.
163
164  blkio.io_service_time
165	  Total amount of time between request dispatch and request completion
166	  for the IOs done by this cgroup. This is in nanoseconds to make it
167	  meaningful for flash devices too. For devices with queue depth of 1,
168	  this time represents the actual service time. When queue_depth > 1,
169	  that is no longer true as requests may be served out of order. This
170	  may cause the service time for a given IO to include the service time
171	  of multiple IOs when served out of order which may result in total
172	  io_service_time > actual time elapsed. This time is further divided by
173	  the type of operation - read or write, sync or async. First two fields
174	  specify the major and minor number of the device, third field
175	  specifies the operation type and the fourth field specifies the
176	  io_service_time in ns.
177
178  blkio.io_wait_time
179	  Total amount of time the IOs for this cgroup spent waiting in the
180	  scheduler queues for service. This can be greater than the total time
181	  elapsed since it is cumulative io_wait_time for all IOs. It is not a
182	  measure of total time the cgroup spent waiting but rather a measure of
183	  the wait_time for its individual IOs. For devices with queue_depth > 1
184	  this metric does not include the time spent waiting for service once
185	  the IO is dispatched to the device but till it actually gets serviced
186	  (there might be a time lag here due to re-ordering of requests by the
187	  device). This is in nanoseconds to make it meaningful for flash
188	  devices too. This time is further divided by the type of operation -
189	  read or write, sync or async. First two fields specify the major and
190	  minor number of the device, third field specifies the operation type
191	  and the fourth field specifies the io_wait_time in ns.
192
193  blkio.io_merged
194	  Total number of bios/requests merged into requests belonging to this
195	  cgroup. This is further divided by the type of operation - read or
196	  write, sync or async.
197
198  blkio.io_queued
199	  Total number of requests queued up at any given instant for this
200	  cgroup. This is further divided by the type of operation - read or
201	  write, sync or async.
202
203  blkio.avg_queue_size
204	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
205	  The average queue size for this cgroup over the entire time of this
206	  cgroup's existence. Queue size samples are taken each time one of the
207	  queues of this cgroup gets a timeslice.
208
209  blkio.group_wait_time
210	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
211	  This is the amount of time the cgroup had to wait since it became busy
212	  (i.e., went from 0 to 1 request queued) to get a timeslice for one of
213	  its queues. This is different from the io_wait_time which is the
214	  cumulative total of the amount of time spent by each IO in that cgroup
215	  waiting in the scheduler queue. This is in nanoseconds. If this is
216	  read when the cgroup is in a waiting (for timeslice) state, the stat
217	  will only report the group_wait_time accumulated till the last time it
218	  got a timeslice and will not include the current delta.
219
220  blkio.empty_time
221	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
222	  This is the amount of time a cgroup spends without any pending
223	  requests when not being served, i.e., it does not include any time
224	  spent idling for one of the queues of the cgroup. This is in
225	  nanoseconds. If this is read when the cgroup is in an empty state,
226	  the stat will only report the empty_time accumulated till the last
227	  time it had a pending request and will not include the current delta.
228
229  blkio.idle_time
230	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
231	  This is the amount of time spent by the IO scheduler idling for a
232	  given cgroup in anticipation of a better request than the existing ones
233	  from other queues/cgroups. This is in nanoseconds. If this is read
234	  when the cgroup is in an idling state, the stat will only report the
235	  idle_time accumulated till the last idle period and will not include
236	  the current delta.
237
238  blkio.dequeue
239	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
240	  gives the statistics about how many a times a group was dequeued
241	  from service tree of the device. First two fields specify the major
242	  and minor number of the device and third field specifies the number
243	  of times a group was dequeued from a particular device.
244
245  blkio.*_recursive
246	  Recursive version of various stats. These files show the
247          same information as their non-recursive counterparts but
248          include stats from all the descendant cgroups.
249
250Throttling/Upper limit policy files
251-----------------------------------
252  blkio.throttle.read_bps_device
253	  Specifies upper limit on READ rate from the device. IO rate is
254	  specified in bytes per second. Rules are per device. Following is
255	  the format::
256
257	    echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
258
259  blkio.throttle.write_bps_device
260	  Specifies upper limit on WRITE rate to the device. IO rate is
261	  specified in bytes per second. Rules are per device. Following is
262	  the format::
263
264	    echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device
265
266  blkio.throttle.read_iops_device
267	  Specifies upper limit on READ rate from the device. IO rate is
268	  specified in IO per second. Rules are per device. Following is
269	  the format::
270
271	   echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device
272
273  blkio.throttle.write_iops_device
274	  Specifies upper limit on WRITE rate to the device. IO rate is
275	  specified in io per second. Rules are per device. Following is
276	  the format::
277
278	    echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.write_iops_device
279
280          Note: If both BW and IOPS rules are specified for a device, then IO is
281          subjected to both the constraints.
282
283  blkio.throttle.io_serviced
284	  Number of IOs (bio) issued to the disk by the group. These
285	  are further divided by the type of operation - read or write, sync
286	  or async. First two fields specify the major and minor number of the
287	  device, third field specifies the operation type and the fourth field
288	  specifies the number of IOs.
289
290  blkio.throttle.io_service_bytes
291	  Number of bytes transferred to/from the disk by the group. These
292	  are further divided by the type of operation - read or write, sync
293	  or async. First two fields specify the major and minor number of the
294	  device, third field specifies the operation type and the fourth field
295	  specifies the number of bytes.
296
297Common files among various policies
298-----------------------------------
299  blkio.reset_stats
300	  Writing an int to this file will result in resetting all the stats
301	  for that cgroup.
302