History log of /linux-master/drivers/block/xen-blkback/common.h
Revision Date Author Comments
# 217759bb 23-Jan-2024 Christian Brauner <brauner@kernel.org>

xen: port block device access to file

Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-11-adbd023e19cc@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>


# e3d7581c 26-Dec-2023 liyouhong <liyouhong@kylinos.cn>

drivers/block/xen-blkback/common.h: Fix spelling typo in comment

Fix spelling typo in comment.

Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20231226095701.172080-1-liyouhong@kylinos.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 436d3705 27-Sep-2023 Jan Kara <jack@suse.cz>

xen/blkback: Convert to bdev_open_by_dev()

Convert xen/blkback to use bdev_open_by_dev() and pass the
handle around.

CC: xen-devel@lists.xenproject.org
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230927093442.25915-7-jack@suse.cz
Signed-off-by: Christian Brauner <brauner@kernel.org>


# cbfac770 16-Dec-2022 Juergen Gross <jgross@suse.com>

xen/blkback: move blkif_get_x86_*_req() into blkback.c

There is no need to have the functions blkif_get_x86_32_req() and
blkif_get_x86_64_req() in a header file, as they are used in one place
only.

So move them into the using source file and drop the inline qualifier.

While at it fix some style issues, and simplify the code by variable
reusing and using min() instead of open coding it.

Instead of using barrier() use READ_ONCE() for avoiding multiple reads
of nr_segments.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>


# 656f3c1d 12-Dec-2022 Juergen Gross <jgross@suse.com>

xen/blkback: remove stale prototype

There is no function xen_blkif_purge_persistent(), so remove its
prototype from common.h.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>


# 6935321e 12-Dec-2022 Juergen Gross <jgross@suse.com>

xen/blkback: fix white space code style issues

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>


# 06ba5d2e 31-Aug-2022 SeongJae Park <sj@kernel.org>

xen-blkback: Advertise feature-persistent as user requested

The advertisement of the persistent grants feature (writing
'feature-persistent' to xenbus) should mean not the decision for using
the feature but only the availability of the feature. However, commit
aac8a70db24b ("xen-blkback: add a parameter for disabling of persistent
grants") made a field of blkback, which was a place for saving only the
negotiation result, to be used for yet another purpose: caching of the
'feature_persistent' parameter value. As a result, the advertisement,
which should follow only the parameter value, becomes inconsistent.

This commit fixes the misuse of the semantic by making blkback saves the
parameter value in a separate place and advertises the support based on
only the saved value.

Fixes: aac8a70db24b ("xen-blkback: add a parameter for disabling of persistent grants")
Cc: <stable@vger.kernel.org> # 5.10.x
Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20220831165824.94815-2-sj@kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>


# d75e7f63 02-Feb-2021 Paul Durrant <pdurrant@amazon.com>

xen-blkback: fix compatibility bug with single page rings

Prior to commit 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid
inconsistent xenstore 'ring-page-order' set by malicious blkfront"), the
behaviour of xen-blkback when connecting to a frontend was:

- read 'ring-page-order'
- if not present then expect a single page ring specified by 'ring-ref'
- else expect a ring specified by 'ring-refX' where X is between 0 and
1 << ring-page-order

This was correct behaviour, but was broken by the afforementioned commit to
become:

- read 'ring-page-order'
- if not present then expect a single page ring (i.e. ring-page-order = 0)
- expect a ring specified by 'ring-refX' where X is between 0 and
1 << ring-page-order
- if that didn't work then see if there's a single page ring specified by
'ring-ref'

This incorrect behaviour works most of the time but fails when a frontend
that sets 'ring-page-order' is unloaded and replaced by one that does not
because, instead of reading 'ring-ref', xen-blkback will read the stale
'ring-ref0' left around by the previous frontend will try to map the wrong
grant reference.

This patch restores the original behaviour.

Fixes: 4a8c31a1c6f5 ("xen/blkback: rework connect_ring() to avoid inconsistent xenstore 'ring-page-order' set by malicious blkfront")
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: "Roger Pau Monné" <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210202175659.18452-1-paul@xen.org
Signed-off-by: Juergen Gross <jgross@suse.com>


# ca33479c 07-Dec-2020 Juergen Gross <jgross@suse.com>

xen: add helpers for caching grant mapping pages

Instead of having similar helpers in multiple backend drivers use
common helpers for caching pages allocated via gnttab_alloc_pages().

Make use of those helpers in blkback and scsiback.

Cc: <stable@vger.kernel.org> # 5.9
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovksy@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>


# a782483c 26-Nov-2020 Christoph Hellwig <hch@lst.de>

block: remove the nr_sects field in struct hd_struct

Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.

Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes. By only using the block_device field
this problem also gets fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: Chao Yu <yuchao0@huawei.com> [f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ca15ca40 07-Aug-2020 Mike Rapoport <rppt@kernel.org>

mm: remove unneeded includes of <asm/pgalloc.h>

Patch series "mm: cleanup usage of <asm/pgalloc.h>"

Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table. These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.

In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>

In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.

This patch (of 8):

In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory. Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.

As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.

The process was somewhat automated using

sed -i -E '/[<"]asm\/pgalloc\.h/d' \
$(grep -L -w -f /tmp/xx \
$(git grep -E -l '[<"]asm/pgalloc\.h'))

where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.

[rppt@linux.ibm.com: fix powerpc warning]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# cb9369bd 27-Jan-2020 SeongJae Park <sjpark@amazon.de>

xen/blkback: Squeeze page pools if a memory pressure is detected

Each `blkif` has a free pages pool for the grant mapping. The size of
the pool starts from zero and is increased on demand while processing
the I/O requests. If current I/O requests handling is finished or 100
milliseconds has passed since last I/O requests handling, it checks and
shrinks the pool to not exceed the size limit, `max_buffer_pages`.

Therefore, host administrators can cause memory pressure in blkback by
attaching a large number of block devices and inducing I/O. Such
problematic situations can be avoided by limiting the maximum number of
devices that can be attached, but finding the optimal limit is not so
easy. Improper set of the limit can results in memory pressure or a
resource underutilization. This commit avoids such problematic
situations by squeezing the pools (returns every free page in the pool
to the system) for a while (users can set this duration via a module
parameter) if memory pressure is detected.

Discussions
===========

The `blkback`'s original shrinking mechanism returns only pages in the
pool which are not currently be used by `blkback` to the system. In
other words, the pages that are not mapped with granted pages. Because
this commit is changing only the shrink limit but still uses the same
freeing mechanism it does not touch pages which are currently mapping
grants.

Once memory pressure is detected, this commit keeps the squeezing limit
for a user-specified time duration. The duration should be neither too
long nor too short. If it is too long, the squeezing incurring overhead
can reduce the I/O performance. If it is too short, `blkback` will not
free enough pages to reduce the memory pressure. This commit sets the
value as `10 milliseconds` by default because it is a short time in
terms of I/O while it is a long time in terms of memory operations.
Also, as the original shrinking mechanism works for at least every 100
milliseconds, this could be a somewhat reasonable choice. I also tested
other durations (refer to the below section for more details) and
confirmed that 10 milliseconds is the one that works best with the test.
That said, the proper duration depends on actual configurations and
workloads. That's why this commit allows users to set the duration as a
module parameter.

Memory Pressure Test
====================

To show how this commit fixes the memory pressure situation well, I
configured a test environment on a xen-running virtualization system.
On the `blkfront` running guest instances, I attach a large number of
network-backed volume devices and induce I/O to those. Meanwhile, I
measure the number of pages that swapped in (pswpin) and out (pswpout)
on the `blkback` running guest. The test ran twice, once for the
`blkback` before this commit and once for that after this commit. As
shown below, this commit has dramatically reduced the memory pressure:

pswpin pswpout
before 76,672 185,799
after 867 3,967

Optimal Aggressive Shrinking Duration
-------------------------------------

To find a best squeezing duration, I repeated the test with three
different durations (1ms, 10ms, and 100ms). The results are as below:

duration pswpin pswpout
1 707 5,095
10 867 3,967
100 362 3,348

As expected, the memory pressure decreases as the duration increases,
but the reduction become slow from the `10ms`. Based on this results, I
chose the default duration as 10ms.

Performance Overhead Test
=========================

This commit could incur I/O performance degradation under severe memory
pressure because the squeezing will require more page allocations per
I/O. To show the overhead, I artificially made a worst-case squeezing
situation and measured the I/O performance of a `blkfront` running
guest.

For the artificial squeezing, I set the `blkback.max_buffer_pages` using
the `/sys/module/xen_blkback/parameters/max_buffer_pages` file. In this
test, I set the value to `1024` and `0`. The `1024` is the default
value. Setting the value as `0` is same to a situation doing the
squeezing always (worst-case).

If the underlying block device is slow enough, the squeezing overhead
could be hidden. For the reason, I use a fast block device, namely the
rbd[1]:

# xl block-attach guest phy:/dev/ram0 xvdb w

For the I/O performance measurement, I run a simple `dd` command 5 times
directly to the device as below and collect the 'MB/s' results.

$ for i in {1..5}; do dd if=/dev/zero of=/dev/xvdb \
bs=4k count=$((256*512)); sync; done

The results are as below. 'max_pgs' represents the value of the
`blkback.max_buffer_pages` parameter.

max_pgs Min Max Median Avg Stddev
0 417 423 420 419.4 2.5099801
1024 414 425 416 417.8 4.4384682
No difference proven at 95.0% confidence

In short, even worst case squeezing on ramdisk based fast block device
makes no visible performance degradation. Please note that this is just
a very simple and minimal test. On systems using super-fast block
devices and a special I/O workload, the results might be different. If
you have any doubt, test on your machine with your workload to find the
optimal squeezing duration for you.

[1] https://www.kernel.org/doc/html/latest/admin-guide/blockdev/ramdisk.html

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>


# 14855954 02-Dec-2019 Paul Durrant <pdurrant@amazon.com>

xen-blkback: allow module to be cleanly unloaded

Add a module_exit() to perform the necessary clean-up.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: "Roger Pau Monné" <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>


# 6f2f39ad 13-Aug-2018 Juergen Gross <jgross@suse.com>

xen/blkback: remove unused pers_gnts_lock from struct xen_blkif_ring

pers_gnts_lock isn't being used anywhere. Remove it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# d77ff24e 13-Aug-2018 Juergen Gross <jgross@suse.com>

xen/blkback: move persistent grants flags to bool

The struct persistent_gnt flags member is meant to be a bitfield of
different flags. There is only PERSISTENT_GNT_ACTIVE flag left, so
convert it to a bool named "active".

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 973e5405 13-Aug-2018 Juergen Gross <jgross@suse.com>

xen/blkback: don't keep persistent grants too long

Persistent grants are allocated until a threshold per ring is being
reached. Those grants won't be freed until the ring is being destroyed
meaning there will be resources kept busy which might no longer be
used.

Instead of freeing only persistent grants until the threshold is
reached add a timestamp and remove all persistent grants not having
been in use for a minute.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 089bc014 13-Jun-2017 Jan Beulich <jbeulich@suse.com>

xen-blkback: don't leak stack data via response ring

Rather than constructing a local structure instance on the stack, fill
the fields directly on the shared ring, just like other backends do.
Build on the fact that all response structure flavors are actually
identical (the old code did make this assumption too).

This is XSA-216.

Cc: stable@vger.kernel.org

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 46464411 18-May-2017 Juergen Gross <jgross@suse.com>

xen/blkback: fix disconnect while I/Os in flight

Today disconnecting xen-blkback is broken in case there are still
I/Os in flight: xen_blkif_disconnect() will bail out early without
releasing all resources in the hope it will be called again when
the last request has terminated. This, however, won't happen as
xen_blkif_free() won't be called on termination of the last running
request: xen_blkif_put() won't decrement the blkif refcnt to 0 as
xen_blkif_disconnect() didn't finish before thus some xen_blkif_put()
calls in xen_blkif_disconnect() didn't happen.

To solve this deadlock xen_blkif_disconnect() and
xen_blkif_alloc_rings() shouldn't use xen_blkif_put() and
xen_blkif_get() but use some other way to do their accounting of
resources.

This at once fixes another error in xen_blkif_disconnect(): when it
returned early with -EBUSY for another ring than 0 it would call
xen_blkif_put() again for already handled rings on a subsequent call.
This will lead to inconsistencies in the refcnt handling.

Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# db6fbc10 08-Dec-2015 Bob Liu <bob.liu@oracle.com>

xen/blkback: make st_ statistics per ring

Make st_* statistics per ring and the VBD sysfs would iterate over all the
rings.

Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
are torn down, so it's safe.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Aligned the variables on the same column.


# d4bf0065 13-Nov-2015 Bob Liu <bob.liu@oracle.com>

xen/blkback: make pool of persistent grants and free pages per-queue

Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Results:
iops1: After patch "xen/blkfront: make persistent grants per-queue".
iops2: After this patch.

Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops1(k): 810 1230(~20%) 1024(~20%) 850(~20%)
Iops2(k): 810 1410(~35%) 1354(~75%) 1440(~100%)

With 4 queues after this commit we can get ~75% increase in IOPS, and
performance won't drop if increasing queue numbers.

Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# d62d8600 13-Nov-2015 Bob Liu <bob.liu@oracle.com>

xen/blkback: get the number of hardware queues/rings from blkfront

Backend advertises "multi-queue-max-queues" to front, also get the negotiated
number from "multi-queue-num-queues" written by blkfront.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 2fb1ef4f 10-Dec-2015 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: pseudo support for multi hardware queues/rings

Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
"xen/blkback: get the number of hardware queues/rings from blkfront".

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align variables in the structures.


# 59795700 13-Nov-2015 Bob Liu <bob.liu@oracle.com>

xen/blkback: separate ring information out of struct xen_blkif

Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can be associated with one or more rings/hardware queues.

Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.

This patch is a preparation for supporting multi hardware queues/rings.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align the variables in the structure.


# 1f13d75c 03-Nov-2015 Roger Pau Monné <roger.pau@citrix.com>

xen-blkback: only read request operation from shared ring once

A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.

When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 67de5dfb 05-May-2015 Julien Grall <julien.grall@citrix.com>

block/xen-blkback: Make it running on 64KB page granularity

The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity behaving as a
block backend on a non-modified Xen.

It's only necessary to adapt the ring size and the number of request per
indirect frames. The rest of the code is relying on the grant table
code.

Note that the grant table code is allocating a Linux page per grant
which will result to waste 6OKB for every grant when Linux is using 64KB
page granularity. This could be improved by sharing the page between
multiple grants.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>


# 6684fa1c 17-Jun-2015 Julien Grall <julien.grall@linaro.org>

block/xen-blkback: s/nr_pages/nr_segs/

Make the code less confusing to read now that Linux may not have the
same page size as Xen.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>


# 86839c56 02-Jun-2015 Bob Liu <bob.liu@oracle.com>

xen/block: add multi-page ring support

Extend xen/block to support multi-page ring, so that more requests can be
issued by using more than one pages as the request ring between blkfront
and backend.
As a result, the performance can get improved significantly.

We got some impressive improvements on our highend iscsi storage cluster
backend. If using 64 pages as the ring, the IOPS increased about 15 times
for the throughput testing and above doubled for the latency testing.

The reason was the limit on outstanding requests is 32 if use only one-page
ring, but in our case the iscsi lun was spread across about 100 physical
drives, 32 was really not enough to keep them busy.

Changes in v2:
- Rebased to 4.0-rc6.
- Document on how multi-page ring feature working to linux io/blkif.h.

Changes in v3:
- Remove changes to linux io/blkif.h and follow the protocol defined
in io/blkif.h of XEN tree.
- Rebased to 4.1-rc3

Changes in v4:
- Turn to use 'ring-page-order' and 'max-ring-page-order'.
- A few comments from Roger.

Changes in v5:
- Clarify with 4k granularity to comment
- Address more comments from Roger

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 69b91ede 02-Jun-2015 Bob Liu <bob.liu@oracle.com>

drivers: xen-blkback: delay pending_req allocation to connect_ring

This is a pre-patch for multi-page ring feature.
In connect_ring, we can know exactly how many pages are used for the shared
ring, delay pending_req allocation here so that we won't waste too much memory.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 77387b82 01-Apr-2015 Tao Chen <boby.chen@huawei.com>

xen-blkback: define pr_fmt macro to avoid the duplication of DRV_PFX

Define pr_fmt macro with {xen-blkback: } prefix, then remove all use
of DRV_PFX in the pr sentences. Replace all DPRINTK with pr sentences,
and get rid of DPRINTK macro. It will simplify the code.

And if the pr sentences miss a \n, add it in the end. If the DPRINTK
sentences have redundant \n, remove it. It will format the code.

These all make the readability of the code become better.

Signed-off-by: Tao Chen <boby.chen@huawei.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>


# b042a3ca 05-Feb-2015 David Vrabel <david.vrabel@citrix.com>

xen-blkback: default to X86_32 ABI on x86

Prior to the existance of 64-bit backends using the X86_64 ABI,
frontends used the X86_32 ABI. These old frontends do not specify the
ABI and when used with a 64-bit backend do not work.

On x86, default to the X86_32 ABI if one is not specified. Backends
on ARM continue to default to their NATIVE ABI.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>


# c43cf3ea 05-Jan-2015 Jennifer Herbert <jennifer.herbert@citrix.com>

xen-blkback: safely unmap grants in case they are still in use

Use gnttab_unmap_refs_async() to wait until the mapped pages are no
longer in use before unmapping them.

This allows blkback to use network storage which may retain refs to
pages in queued skbs after the block I/O has completed.

Signed-off-by: Jennifer Herbert <jennifer.herbert@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.de>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>


# 814d04e7 20-May-2014 Valentin Priescu <priescuv@amazon.com>

xen-blkback: defer freeing blkif to avoid blocking xenwatch

Currently xenwatch blocks in VBD disconnect, waiting for all pending I/O
requests to finish. If the VBD is attached to a hot-swappable disk, then
xenwatch can hang for a long period of time, stalling other watches.

INFO: task xenwatch:39 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ffff880057f01bd0 0000000000000246 ffff880057f01ac0 ffffffff810b0782
ffff880057f01ad0 00000000000131c0 0000000000000004 ffff880057edb040
ffff8800344c6080 0000000000000000 ffff880058c00ba0 ffff880057edb040
Call Trace:
[<ffffffff810b0782>] ? irq_to_desc+0x12/0x20
[<ffffffff8128f761>] ? list_del+0x11/0x40
[<ffffffff8147a080>] ? wait_for_common+0x60/0x160
[<ffffffff8147bcef>] ? _raw_spin_lock_irqsave+0x2f/0x50
[<ffffffff8147bd49>] ? _raw_spin_unlock_irqrestore+0x19/0x20
[<ffffffff8147a26a>] schedule+0x3a/0x60
[<ffffffffa018fe6a>] xen_blkif_disconnect+0x8a/0x100 [xen_blkback]
[<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
[<ffffffffa018ffce>] xen_blkbk_remove+0xae/0x1e0 [xen_blkback]
[<ffffffff8130b254>] xenbus_dev_remove+0x44/0x90
[<ffffffff81345cb7>] __device_release_driver+0x77/0xd0
[<ffffffff81346488>] device_release_driver+0x28/0x40
[<ffffffff813456e8>] bus_remove_device+0x78/0xe0
[<ffffffff81342c9f>] device_del+0x12f/0x1a0
[<ffffffff81342d2d>] device_unregister+0x1d/0x60
[<ffffffffa0190826>] frontend_changed+0xa6/0x4d0 [xen_blkback]
[<ffffffffa019c252>] ? frontend_changed+0x192/0x650 [xen_netback]
[<ffffffff8130ae50>] ? cmp_dev+0x60/0x60
[<ffffffff81344fe4>] ? bus_for_each_dev+0x94/0xa0
[<ffffffff8130b06e>] xenbus_otherend_changed+0xbe/0x120
[<ffffffff8130b4cb>] frontend_changed+0xb/0x10
[<ffffffff81309c82>] xenwatch_thread+0xf2/0x130
[<ffffffff81079f70>] ? wake_up_bit+0x40/0x40
[<ffffffff81309b90>] ? xenbus_directory+0x80/0x80
[<ffffffff810799d6>] kthread+0x96/0xa0
[<ffffffff81485934>] kernel_thread_helper+0x4/0x10
[<ffffffff814839f3>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff8147c17c>] ? retint_restore_args+0x5/0x6
[<ffffffff81485930>] ? gs_change+0x13/0x13

With this patch, when there is still pending I/O, the actual disconnect
is done by the last reference holder (last pending I/O request). In this
case, xenwatch doesn't block indefinitely.

Signed-off-by: Valentin Priescu <priescuv@amazon.com>
Reviewed-by: Steven Kady <stevkady@amazon.com>
Reviewed-by: Steven Noonan <snoonan@amazon.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# abb97b8c 11-Feb-2014 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: init persistent_purge_work work_struct

Initialize persistent_purge_work work_struct on xen_blkif_alloc (and
remove the previous initialization done in purge_persistent_gnt). This
prevents flush_work from complaining even if purge_persistent_gnt has
not been used.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 80bfa2f6 04-Feb-2014 Roger Pau Monne <roger.pau@citrix.com>

xen-blkif: drop struct blkif_request_segment_aligned

This was wrongly introduced in commit 402b27f9, the only difference
between blkif_request_segment_aligned and blkif_request_segment is
that the former has a named padding, while both share the same
memory layout.

Also correct a few minor glitches in the description, including for it
to no longer assume PAGE_SIZE == 4096.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[Description fix by Jan Beulich]
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# c05f3e3c 04-Feb-2014 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: fix shutdown race

Introduce a new variable to keep track of the number of in-flight
requests. We need to make sure that when xen_blkif_put is called the
request has already been freed and we can safely free xen_blkif, which
was not the case before.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# ef753411 04-Feb-2014 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: fix memory leaks

I've at least identified two possible memory leaks in blkback, both
related to the shutdown path of a VBD:

- blkback doesn't wait for any pending purge work to finish before
cleaning the list of free_pages. The purge work will call
put_free_pages and thus we might end up with pages being added to
the free_pages list after we have emptied it. Fix this by making
sure there's no pending purge work before exiting
xen_blkif_schedule, and moving the free_page cleanup code to
xen_blkif_free.
- blkback doesn't wait for pending requests to end before cleaning
persistent grants and the list of free_pages. Again this can add
pages to the free_pages list or persistent grants to the
persistent_gnts red-black tree. Fixed by moving the persistent
grants and free_pages cleanup code to xen_blkif_free.

Also, add some checks in xen_blkif_free to make sure we are cleaning
everything.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Matt Rushton <mrushton@amazon.com>
Reviewed-by: Matt Rushton <mrushton@amazon.com>
Cc: Matt Wilson <msw@amazon.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 8e3f8755 23-Jan-2013 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Check for insane amounts of request on the ring (v6).

Check that the ring does not have an insane amount of requests
(more than there could fit on the ring).

If we detect this case we will stop processing the requests
and wait until the XenBus disconnects the ring.

The existing check RING_REQUEST_CONS_OVERFLOW which checks for how
many responses we have created in the past (rsp_prod_pvt) vs
requests consumed (req_cons) and whether said difference is greater or
equal to the size of the ring, does not catch this case.

Wha the condition does check if there is a need to process more
as we still have a backlog of responses to finish. Note that both
of those values (rsp_prod_pvt and req_cons) are not exposed on the
shared ring.

To understand this problem a mini crash course in ring protocol
response/request updates is in place.

There are four entries: req_prod and rsp_prod; req_event and rsp_event
to track the ring entries. We are only concerned about the first two -
which set the tone of this bug.

The req_prod is a value incremented by frontend for each request put
on the ring. Conversely the rsp_prod is a value incremented by the backend
for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when
pushing the responses on the ring). Both values can
wrap and are modulo the size of the ring (in block case that is 32).
Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details.

The culprit here is that if the difference between the
req_prod and req_cons is greater than the ring size we have a problem.
Fortunately for us, the '__do_block_io_op' loop:

rc = blk_rings->common.req_cons;
rp = blk_rings->common.sring->req_prod;

while (rc != rp) {

..
blk_rings->common.req_cons = ++rc; /* before make_response() */

}

will loop up to the point when rc == rp. The macros inside of the
loop (RING_GET_REQUEST) is smart and is indexing based on the modulo
of the ring size. If the frontend has provided a bogus req_prod value
we will loop until the 'rc == rp' - which means we could be processing
already processed requests (or responses) often.

The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is
b/c it only tracks how many responses we have internally produced
and whether we would should process more. The astute reader will
notice that the macro RING_REQUEST_CONS_OVERFLOW provides two
arguments - more on this later.

For example, if we were to enter this function with these values:

blk_rings->common.sring->req_prod = X+31415 (X is the value from
the last time __do_block_io_op was called).
blk_rings->common.req_cons = X
blk_rings->common.rsp_prod_pvt = X

The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons)
is doing:

req_cons - rsp_prod_pvt >= 32

Which is,
X - X >= 32 or 0 >= 32

And that is false, so we continue on looping (this bug).

If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp
instead (sring->req_prod) of rc, the this macro can do the check:

req_prod - rsp_prov_pvt >= 32

Which is,
X + 31415 - X >= 32 , or 31415 >= 32

which is true, so we can error out and break out of the function.

Unfortunatly the difference between rsp_prov_pvt and req_prod can be
at 32 (which would error out in the macro). This condition exists when
the backend is lagging behind with the responses and still has not finished
responding to all of them (so make_response has not been called), and
the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able
to use said macro.

Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does
a simple check of:

req_prod - rsp_prod_pvt > RING_SIZE

And with the X values from above:

X + 31415 - X > 32

Returns true. Also not that if the ring is full (which is where
the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the
same condition:

X + 32 - X > 32

Which is false.

Lets use that macro.
Note that in v5 of this patchset the macro was different - we used an
earlier version.

Cc: stable@vger.kernel.org
[v1: Move the check outside the loop]
[v2: Add a pr_warn as suggested by David]
[v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan]
[v4: Move wake_up after kthread_stop as suggested by Jan]
[v5: Use RING_REQUEST_PROD_OVERFLOW instead]
[v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

gadsa


# bb642e83 02-May-2013 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: allocate list of pending reqs in small chunks

Allocate pending requests in smaller chunks instead of allocating them
all at the same time.

This change also removes the global array of pending_reqs, it is no
longer necessay.

Variables related to the grant mapping have been grouped into a struct
called "grant_page", this allows to allocate them in smaller chunks,
and also improves memory locality.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 402b27f9 18-Apr-2013 Roger Pau Monne <roger.pau@citrix.com>

xen-block: implement indirect descriptors

Indirect descriptors introduce a new block operation
(BLKIF_OP_INDIRECT) that passes grant references instead of segments
in the request. This grant references are filled with arrays of
blkif_request_segment_aligned, this way we can send more segments in a
request.

The proposed implementation sets the maximum number of indirect grefs
(frames filled with blkif_request_segment_aligned) to 256 in the
backend and 32 in the frontend. The value in the frontend has been
chosen experimentally, and the backend value has been set to a sane
value that allows expanding the maximum number of indirect descriptors
in the frontend if needed.

The migration code has changed from the previous implementation, in
which we simply remapped the segments on the shared ring. Now the
maximum number of segments allowed in a request can change depending
on the backend, so we have to requeue all the requests in the ring and
in the queue and split the bios in them if they are bigger than the
new maximum number of segments.

[v2: Fixed minor comments by Konrad.
[v1: Added padding to make the indirect request 64bit aligned.
Added some BUGs, comments; fixed number of indirect pages in
blkif_get_x86_{32/64}_req. Added description about the indirect operation
in blkif.h]
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[v3: Fixed spaces and tabs mix ups]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# bf0720c4 17-Apr-2013 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: make the queue of free requests per backend

Remove the last dependency from blkbk by moving the list of free
requests to blkif. This change reduces the contention on the list of
available requests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 3f3aad5e 17-Apr-2013 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: implement LRU mechanism for persistent grants

This mechanism allows blkback to change the number of grants
persistently mapped at run time.

The algorithm uses a simple LRU mechanism that removes (if needed) the
persistent grants that have not been used since the last LRU run, or
if all grants have been used it removes the first grants in the list
(that are not in use).

The algorithm allows the user to change the maximum number of
persistent grants, by changing max_persistent_grants in sysfs.

Since we are storing the persistent grants used inside the request
struct (to be able to mark them as "unused" when unmapping), we no
longer need the bitmap (unmap_seg).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# c6cc142d 17-Apr-2013 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: use balloon pages for all mappings

Using balloon pages for all granted pages allows us to simplify the
logic in blkback, especially in the xen_blkbk_map function, since now
we can decide if we want to map a grant persistently or not after we
have actually mapped it. This could not be done before because
persistent grants used ballooned pages, whereas non-persistent grants
used pages from the kernel.

This patch also introduces several changes, the first one is that the
list of free pages is no longer global, now each blkback instance has
it's own list of free pages that can be used to map grants. Also, a
run time parameter (max_buffer_pages) has been added in order to tune
the maximum number of free pages each blkback instance will keep in
it's buffer.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# ffb1dabd 18-Mar-2013 Roger Pau Monne <roger.pau@citrix.com>

xen-blkback: don't store dev_bus_addr

dev_bus_addr returned in the grant ref map operation is the mfn of the
passed page, there's no need to store it in the persistent grant
entry, since we can always get it provided that we have the page.

This reduces the memory overhead of persistent grants in blkback.

While at it, rename the 'seg[i].buf' to be 'seg[i].offset' as
it makes much more sense - as we use that value in bio_add_page
which as the fourth argument expects the offset.

We hadn't used the physical address as part of this at all.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
[v1: s/buf/offset/]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 986cacbd 11-Mar-2013 Zoltan Kiss <zoltan.kiss@citrix.com>

xen/blkback: Change statistics counter types to unsigned

These values shouldn't be negative, but after an overflow their value
can turn into negative, if they are signed. xentop can show bogus
values in this case.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Reported-by: Ichiro Ogino <ichiro.ogino@citrix.co.jp>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 0e367ae4 07-Mar-2013 David Vrabel <david.vrabel@citrix.com>

xen/blkback: correctly respond to unknown, non-native requests

If the frontend is using a non-native protocol (e.g., a 64-bit
frontend with a 32-bit backend) and it sent an unrecognized request,
the request was not translated and the response would have the
incorrect ID. This may cause the frontend driver to behave
incorrectly or crash.

Since the ID field in the request is always in the same place,
regardless of the request type we can get the correct ID and make a
valid response (which will report BLKIF_RSP_EOPNOTSUPP).

This bug affected 64-bit SLES 11 guests when using a 32-bit backend.
This guest does a BLKIF_OP_RESERVED_1 (BLKIF_OP_PACKET in the SLES
source) and would crash in blkif_int() as the ID in the response would
be invalid.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 0a8704a5 24-Oct-2012 Roger Pau Monne <roger.pau@citrix.com>

xen/blkback: Persistent grant maps for xen blk drivers

This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.

The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]


# 1f999572 21-Sep-2012 Oliver Chick <oliver.chick@citrix.com>

xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool

Changing the type of bdev parameters to be unsigned int :1, rather than bool.
This is more consistent with the types of other features in the block drivers.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8c9ce606 25-May-2012 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Copy id field when doing BLKIF_DISCARD.

We weren't copying the id field so when we sent the response
back to the frontend (especially with a 64-bit host and 32-bit
guest), we ended up using a random value. This lead to the
frontend crashing as it would try to pass to __blk_end_request_all
a NULL 'struct request' (b/c it would use the 'id' to find the
proper 'struct request' in its shadow array) and end up crashing:

BUG: unable to handle kernel NULL pointer dereference at 000000e4
IP: [<c0646d4c>] __blk_end_request_all+0xc/0x40
.. snip..
EIP is at __blk_end_request_all+0xc/0x40
.. snip..
[<ed95db72>] blkif_interrupt+0x172/0x330 [xen_blkfront]

This fixes the bug by passing in the proper id for the response.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=824641

CC: stable@kernel.org
Tested-by: William Dauchy <wdauchy@gmail.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 4dae7670 13-Mar-2012 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Squash the discard support for 'file' and 'phy' type.

The only reason for the distinction was for the special case of
'file' (which is assumed to be loopback device), was to reach inside
the loopback device, find the underlaying file, and call fallocate on it.
Fortunately "xen-blkback: convert hole punching to discard request on
loop devices" removes that use-case and we now based the discard
support based on blk_queue_discard(q) and extract all appropriate
parameters from the 'struct request_queue'.

CC: Li Dongyang <lidongyang@novell.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
[v1: Dropping pointless initializer and keeping blank line]
[v2: Remove the kfree as it is not used anymore]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 5ea42986 12-Oct-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blk[front|back]: Enhance discard support with secure erasing support.

Part of the blkdev_issue_discard(xx) operation is that it can also
issue a secure discard operation that will permanantly remove the
sectors in question. We advertise that we can support that via the
'discard-secure' attribute and on the request, if the 'secure' bit
is set, we will attempt to pass in REQ_DISCARD | REQ_SECURE.

CC: Li Dongyang <lidongyang@novell.com>
[v1: Used 'flag' instead of 'secure:1' bit]
[v2: Use 'reserved' uint8_t instead of adding a new value]
[v3: Check for nseg when mapping instead of operation]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 97e36834 11-Oct-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together

In a union type structure to deal with the overlapping
attributes in a easier manner.

Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 2d073846 29-Sep-2011 David Vrabel <david.vrabel@citrix.com>

block: xen-blkback: use API provided by xenbus module to map rings

The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the ring pages granted by
the frontend.

Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 29bde093 09-Oct-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.

We emulate the barrier requests by draining the outstanding bio's
and then sending the WRITE_FLUSH command. To drain the I/Os
we use the refcnt that is used during disconnect to wait for all
the I/Os before disconnecting from the frontend. We latch on its
value and if it reaches either the threshold for disconnect or when
there are no more outstanding I/Os, then we have drained all I/Os.

Suggested-by: Christopher Hellwig <hch@infradead.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# c555aab9 14-Aug-2011 Joe Jin <joe.jin@oracle.com>

xen-blkback: fixed indentation and comments

This patch fixes belows:

1. Fix code style issue.
2. Fix incorrect functions name in comments.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# b3cb0d6a 01-Sep-2011 Li Dongyang <lidongyang@novell.com>

xen-blkback: Implement discard requests ('feature-discard')

..aka ATA TRIM/SCSI UNMAP command to be passed through the frontend
and used as appropiately by the backend. We also advertise
certain granulity parameters to the frontend so it can plug them in.
If the backend is a realy device - we just end up using
'blkdev_issue_discard' while for loopback devices - we just punch
a hole in the image file.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Fixed up pr_debug and commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# e5de0630 01-Aug-2011 Jesper Juhl <jj@chaosbits.net>

Remove unneeded version.h includes from drivers/block/

It was pointed out by 'make versioncheck' that some includes of
linux/version.h are not needed in drivers/block/.
This patch removes them.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>


# 1bc05b0a 14-Aug-2011 Joe Jin <joe.jin@oracle.com>

xen-blkback: fixed indentation and comments

This patch fixes belows:

1. Fix code style issue.
2. Fix incorrect functions name in comments.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 5a577e38 12-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Add the prefix XEN in the common.h.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 3d814731 12-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Prefix 'vbd' with 'xen' in structs and functions.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 30fd1502 12-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Change structure name blkif_st to xen_blkif.

No need for that '_st' and xen_blkif is more apt.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 325a6486 12-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Remove the unused typedefs.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 452a6b2b 12-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Move include/xen/blkif.h into drivers/block/xen-blkback/common.h

Not point of the blkif.h file. It is not used by the frontend.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# b0f80127 12-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Fixing some more of the cleanpatch.pl warnings.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 68c88dd7 11-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Move blkif_get_x86_[32|64]_req to common.h in block/xen-blkback dir.

From the blkif.h header, which was exposed to the frontend.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 22b20f2d 12-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Use the DRV_PFX in the pr_.. macros.

To make it easier to read.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 1afbd730 11-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Make the DPRINTK uniform.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 01f37f2d 11-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Fixed up comments and converted spaces to tabs.

Suggested-by: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 24f567f9 04-May-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Add support for BLKIF_OP_FLUSH_DISKCACHE and drop BLKIF_OP_WRITE_BARRIER.

We drop the support for 'feature-barrier' and add in the support
for the 'feature-flush-cache' if the real backend storage supports
flushing.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 8b6bf747 20-Apr-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Prefix exposed functions with xen_

And also shorten the name if it has blkback to blkbk.

This results in the symbol table (if compiled in the kernel)
to be much shorter, prettier, and also easier to search for.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# 42c7841d 20-Apr-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: Inline some of the functions that were moved from vbd/interface.c

Shuffling code around.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>


# dfc07b13 18-Apr-2011 Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Move it from drivers/xen to drivers/block

.. and modify the Makefile and Kconfig files appropriately.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>