History log of /linux-master/drivers/md/dm-cache-target.c
Revision Date Author Comments
# 05bdb996 08-Jun-2023 Christoph Hellwig <hch@lst.de>

block: replace fmode_t with a block-specific type for block open flags

The only overlap between the block open flags mapped into the fmode_t and
other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new
blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and
->ioctl and stop abusing fmode_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# b362c733 18-Mar-2023 Yangtao Li <frank.li@vivo.com>

dm: push error reporting down to dm_register_target()

Simplifies each DM target's init method by making dm_register_target()
responsible for its error reporting (on behalf of targets).

Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 76227f6d 16-Feb-2023 Mike Snitzer <snitzer@kernel.org>

dm cache: add cond_resched() to various workqueue loops

Otherwise on resource constrained systems these workqueues may be too
greedy.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 774f13ac 07-Feb-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: declare variables static when sensible

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 23fda2ef 07-Feb-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: fix suspect indent whitespace

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 0ef0b471 01-Feb-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: add missing empty lines

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 8ca817c4 01-Feb-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: avoid spaces before function arguments or in favour of tabs

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# a4a82ce3 26-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: correct block comments format.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 86a3238c 25-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: change "unsigned" to "unsigned int"

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 3bd94003 25-Jan-2023 Heinz Mauelshagen <heinzm@redhat.com>

dm: add missing SPDX-License-Indentifiers

'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has
deprecated its use.

Suggested-by: John Wiele <jwiele@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 6b997386 30-Nov-2022 Mike Snitzer <snitzer@kernel.org>

dm cache: set needs_check flag after aborting metadata

Otherwise the commit that will be aborted will be associated with the
metadata objects that will be torn down. Must write needs_check flag
to metadata with a reset block manager.

Found through code-inspection (and compared against dm-thin.c).

Cc: stable@vger.kernel.org
Fixes: 028ae9f76f29 ("dm cache: add fail io mode and needs_check flag")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 6a459d8e 28-Nov-2022 Luo Meng <luomeng12@huawei.com>

dm cache: Fix UAF in destroy()

Dm_cache also has the same UAF problem when dm_resume()
and dm_destroy() are concurrent.

Therefore, cancelling timer again in destroy().

Cc: stable@vger.kernel.org
Fixes: c6b4fcbad044e ("dm: add cache target")
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 5c29e784 21-Jun-2022 Steven Lung <1030steven@gmail.com>

dm cache: fix typo in 2 comment blocks

Replace neccessarily with necessarily.

Signed-off-by: Steven Lung <1030steven@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>


# 70200574 14-Apr-2022 Christoph Hellwig <hch@lst.de>

block: remove QUEUE_FLAG_DISCARD

Just use a non-zero max_discard_sectors as an indicator for discard
support, similar to what is done for write zeroes.

The only places where needs special attention is the RAID5 driver,
which must clear discard support for security reasons by default,
even if the default stacking rules would allow for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> [drbd]
Acked-by: Jan Höppner <hoeppner@linux.ibm.com> [s390]
Acked-by: Coly Li <colyli@suse.de> [bcache]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220415045258.199825-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 69596f55 08-Mar-2022 Mike Snitzer <snitzer@redhat.com>

dm cache: use dm_submit_bio_remap

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 385411ff 01-Mar-2022 Christoph Hellwig <hch@lst.de>

dm: stop using bdevname

Just use the %pg format specifier instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# abfc426d 02-Feb-2022 Christoph Hellwig <hch@lst.de>

block: pass a block_device to bio_clone_fast

Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 3c4b455e 02-Feb-2022 Christoph Hellwig <hch@lst.de>

dm-cache: remove __remap_to_origin_clear_discard

Fold __remap_to_origin_clear_discard into the two callers to prepare
for bio cloning refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 6dcbb52c 17-Oct-2021 Christoph Hellwig <hch@lst.de>

dm: use bdev_nr_sectors and bdev_nr_bytes instead of open coding them

Use the proper helpers to read the block device size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20211018101130.1838532-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 8ec45662 12-Jul-2021 Tushar Sugandhi <tusharsu@linux.microsoft.com>

dm: update target status functions to support IMA measurement

For device mapper targets to take advantage of IMA's measurement
capabilities, the status functions for the individual targets need to be
updated to handle the status_type_t case for value STATUSTYPE_IMA.

Update status functions for the following target types, to log their
respective attributes to be measured using IMA.
01. cache
02. crypt
03. integrity
04. linear
05. mirror
06. multipath
07. raid
08. snapshot
09. striped
10. verity

For rest of the targets, handle the STATUSTYPE_IMA case by setting the
measurement buffer to NULL.

For IMA to measure the data on a given system, the IMA policy on the
system needs to be updated to have the following line, and the system
needs to be restarted for the measurements to take effect.

/etc/ima/ima-policy
measure func=CRITICAL_DATA label=device-mapper template=ima-buf

The measurements will be reflected in the IMA logs, which are located at:

/sys/kernel/security/integrity/ima/ascii_runtime_measurements
/sys/kernel/security/integrity/ima/binary_runtime_measurements

These IMA logs can later be consumed by various attestation clients
running on the system, and send them to external services for attesting
the system.

The DM target data measured by IMA subsystem can alternatively
be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
DM_TABLE_STATUS_CMD.

Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# dc4fa29f 24-Jun-2021 Mike Snitzer <snitzer@redhat.com>

dm io tracker: factor out IO tracker

Allow other code to use dm_io_tracker.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 63508e38 19-Mar-2021 Xu Wang <vulab@iscas.ac.cn>

dm cache: remove needless request_queue NULL pointer checks

Since commit ff9ea323816d ("block, bdi: an active gendisk always has a
request_queue associated with it") the request_queue pointer returned
from bdev_get_queue() shall never be NULL.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# b7770923 10-Dec-2020 Zheng Yongjun <zhengyongjun3@huawei.com>

dm cache: simplify the return expression of load_mapping()

Simplify the return expression.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 35d2835d 10-Nov-2020 Nick Desaulniers <ndesaulniers@google.com>

Revert "dm cache: fix arm link errors with inline"

This reverts commit 43aeaa29573924df76f44eda2bbd94ca36e407b5.

Since commit 0bddd227f3dc ("Documentation: update for gcc 4.9 requirement")
the minimum supported version of GCC is gcc-4.9. It's now safe to remove
this code.

Link: https://github.com/ClangBuiltLinux/linux/issues/427
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d4a512ed 19-Sep-2020 Mike Snitzer <snitzer@redhat.com>

dm: use dm_table_get_device_name() where appropriate in targets

dm_table_get_device_name() avoids calling dm_table_get_md() followed by
dm_device_name() -- saves intermediate dm_table_get_md() call.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 21cf8661 01-Jul-2020 Christoph Hellwig <hch@lst.de>

writeback: remove bdi->congested_fn

Except for pktdvd, the only places setting congested bits are file
systems that allocate their own backing_dev_info structures. And
pktdvd is a deprecated driver that isn't useful in stack setup
either. So remove the dead congested_fn stacking infrastructure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
[axboe: fixup unused variables in bcache/request.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# ed00aabd 01-Jul-2020 Christoph Hellwig <hch@lst.de>

block: rename generic_make_request to submit_bio_noacct

generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 636be424 27-Feb-2020 Mike Snitzer <snitzer@redhat.com>

dm: bump version of core and various targets

Changes made during the 5.6 cycle warrant bumping the version number
for DM core and the targets modified by this commit.

It should be noted that dm-thin, dm-crypt and dm-raid already had
their target version bumped during the 5.6 merge window.

Signed-off-by; Mike Snitzer <snitzer@redhat.com>


# 7cdf6a0a 19-Feb-2020 Mikulas Patocka <mpatocka@redhat.com>

dm cache: fix a crash due to incorrect work item cancelling

The crash can be reproduced by running the lvm2 testsuite test
lvconvert-thin-external-cache.sh for several minutes, e.g.:
while :; do make check T=shell/lvconvert-thin-external-cache.sh; done

The crash happens in this call chain:
do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
-> memset -> __memset -- which accesses an invalid pointer in the vmalloc
area.

The work entry on the workqueue is executed even after the bitmap was
freed. The problem is that cancel_delayed_work doesn't wait for the
running work item to finish, so the work item can continue running and
re-submitting itself even after cache_postsuspend. In order to make sure
that the work item won't be running, we must use cancel_delayed_work_sync.

Also, change flush_workqueue to drain_workqueue, so that if some work item
submits itself or another work item, we are properly waiting for both of
them.

Fixes: c6b4fcbad044 ("dm: add cache target")
Cc: stable@vger.kernel.org # v3.9
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 26b924b9 23-Oct-2019 Mikulas Patocka <mpatocka@redhat.com>

dm cache: replace spin_lock_irqsave with spin_lock_irq

If we are in a place where it is known that interrupts are enabled,
functions spin_lock_irq/spin_unlock_irq should be used instead of
spin_lock_irqsave/spin_unlock_irqrestore.

spin_lock_irq and spin_unlock_irq are faster because they don't need to
push and pop the flags register.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 13bd677a 16-Oct-2019 Mikulas Patocka <mpatocka@redhat.com>

dm cache: fix bugs when a GFP_NOWAIT allocation fails

GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
available and it fails if the mempool is exhausted and there is not enough
memory.

If we go down this path:
map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
we can see that map_bio() doesn't check the return value of mg_start(),
and the bio is leaked.

If we go down this path:
map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
mg_lock_writes -> mg_complete
the bio is ended with an error - it is unacceptable because it could
cause filesystem corruption if the machine ran out of memory
temporarily.

Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
wait until memory becomes available. mempool_alloc with GFP_NOIO can't
fail, so remove the code paths that deal with allocation failure.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# de7180ff 25-Feb-2019 Mike Snitzer <snitzer@redhat.com>

dm cache: add support for discard passdown to the origin device

DM cache now defaults to passing discards down to the origin device.
User may disable this using the "no_discard_passdown" feature when
creating the cache device.

If the cache's underlying origin device doesn't support discards then
passdown is disabled (with warning). Similarly, if the underlying
origin device's max_discard_sectors is less than a cache block discard
passdown will be disabled (this is required because sizing of the cache
internal discard bitset depends on it).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 61697a6a 18-Jan-2019 Mike Snitzer <snitzer@redhat.com>

dm: eliminate 'split_discard_bios' flag from DM target interface

There is no need to have DM core split discards on behalf of a DM target
now that blk_queue_split() handles splitting discards based on the
queue_limits. A DM target just needs to set max_discard_sectors,
discard_granularity, etc, in queue_limits.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c7cd5550 07-Oct-2018 Shenghui Wang <shhuiw@foxmail.com>

dm cache: destroy migration_cache if cache target registration failed

Commit 7e6358d244e47 ("dm: fix various targets to dm_register_target
after module __init resources created") inadvertently introduced this
bug when it moved dm_register_target() after the call to KMEM_CACHE().

Fixes: 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created")
Cc: stable@vger.kernel.org
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 5d07384a 25-Sep-2018 Mike Snitzer <snitzer@redhat.com>

dm cache: fix resize crash if user doesn't reload cache table

A reload of the cache's DM table is needed during resize because
otherwise a crash will occur when attempting to access smq policy
entries associated with the portion of the cache that was recently
extended.

The reason is cache-size based data structures in the policy will not be
resized, the only way to safely extend the cache is to allow for a
proper cache policy initialization that occurs when the cache table is
loaded. For example the smq policy's space_init(), init_allocator(),
calc_hotspot_params() must be sized based on the extended cache size.

The fix for this is to disallow cache resizes of this pattern:
1) suspend "cache" target's device
2) resize the fast device used for the cache
3) resume "cache" target's device

Instead, the last step must be a full reload of the cache's DM table.

Fixes: 66a636356 ("dm cache: add stochastic-multi-queue (smq) policy")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 7209049d 31-Jul-2018 Mike Snitzer <snitzer@redhat.com>

dm kcopyd: return void from dm_kcopyd_copy()

dm_kcopyd_copy() only ever returns 0 so there is no need for callers to
account for possible failure. Same goes for dm_kcopyd_zero().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# af9313c3 21-Jun-2018 John Pittman <jpittman@redhat.com>

dm cache: only allow a single io_mode cache feature to be requested

More than one io_mode feature can be requested when creating a dm cache
device (as is: last one wins). The io_mode selections are incompatible
with one another, we should force them to be selected exclusively. Add
a counter to check for more than one io_mode selection.

Fixes: 629d0a8a1a10 ("dm cache metadata: add "metadata2" feature")
Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 72d711c8 22-May-2018 Mike Snitzer <snitzer@redhat.com>

dm: adjust structure members to improve alignment

Eliminate most holes in DM data structures that were modified by
commit 6f1c819c21 ("dm: convert to bioset_init()/mempool_init()").
Also prevent structure members from unnecessarily spanning cache
lines.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6f1c819c 20-May-2018 Kent Overstreet <kent.overstreet@gmail.com>

dm: convert to bioset_init()/mempool_init()

Convert dm to embedded bio sets.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 1eb5fa84 28-Feb-2018 Mike Snitzer <snitzer@redhat.com>

dm: allow targets to return output from messages they are sent

Could be useful for a target to return stats or other information.
If a target does DMEMIT() anything to @result from its .message method
then it must return 1 to the caller.

Signed-off-By: Mike Snitzer <snitzer@redhat.com>


# 7e6358d2 24-Nov-2017 monty_pavel@sina.com <monty_pavel@sina.com>

dm: fix various targets to dm_register_target after module __init resources created

A NULL pointer is seen if two concurrent "vgchange -ay -K <vg name>"
processes race to load the dm-thin-pool module:

PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
#0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
[exception RIP: kmem_cache_alloc+108]
RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]

The race results in a NULL pointer because:

Process A (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target not registered;
c. modprobe dm_thin_pool and wait until end.

Process B (vgchange -ay -K):
a. send DM_LIST_VERSIONS_CMD ioctl;
b. pool_target registered;
c. table_load->dm_table_add_target->pool_ctr;
d. _new_mapping_cache is NULL and panic.
Note:
1. process A and process B are two concurrent processes.
2. pool_target can be detected by process B but
_new_mapping_cache initialization has not ended.

To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
with the same problem, simply dm_register_target() after all resources
created during module init (as labelled with __init) are finished.

Cc: stable@vger.kernel.org
Signed-off-by: monty <monty_pavel@sina.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ef7afb36 09-Nov-2017 Mike Snitzer <snitzer@redhat.com>

dm cache: lift common migration preparation code to alloc_migration()

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ede6507d 09-Nov-2017 Joe Thornber <ejt@redhat.com>

dm cache: remove usused deferred_cells member from struct cache

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 693b960e 19-Oct-2017 Mike Snitzer <snitzer@redhat.com>

dm cache: simplify get_per_bio_data() by removing data_size argument

There is only one per_bio_data size now that writethrough-specific data
was removed from the per_bio_data structure.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 9958f1d9 19-Oct-2017 Mike Snitzer <snitzer@redhat.com>

dm cache: remove all obsolete writethrough-specific code

Now that the writethrough code is much simpler there is no need to track
so much state or cascade bio submission (as was done, via
writethrough_endio(), to issue origin then cache IO in series).

As such the obsolete writethrough list and workqueue is also removed.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2df3bae9 19-Oct-2017 Mike Snitzer <snitzer@redhat.com>

dm cache: submit writethrough writes in parallel to origin and cache

Discontinue issuing writethrough write IO in series to the origin and
then cache.

Use bio_clone_fast() to create a new origin clone bio that will be
mapped to the origin device and then bio_chain() it to the bio that gets
remapped to the cache device. The origin clone bio does _not_ have a
copy of the per_bio_data -- as such check_if_tick_bio_needed() will not
be called.

The cache bio (parent bio) will not complete until the origin bio has
completed -- this fulfills bio_clone_fast()'s requirements as well as
the requirement to not complete the original IO until the write IO has
completed to both the origin and cache device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 8e3c3827 19-Oct-2017 Mike Snitzer <snitzer@redhat.com>

dm cache: pass cache structure to mode functions

No functional changes, just a bit cleaner than passing cache_features
structure.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d1260e2a 10-Nov-2017 Joe Thornber <ejt@redhat.com>

dm cache: fix race condition in the writeback mode overwrite_bio optimisation

When a DM cache in writeback mode moves data between the slow and fast
device it can often avoid a copy if the triggering bio either:

i) covers the whole block (no point copying if we're about to overwrite it)
ii) the migration is a promotion and the origin block is currently discarded

Prior to this fix there was a race with case (ii). The discard status
was checked with a shared lock held (rather than exclusive). This meant
another bio could run in parallel and write data to the origin, removing
the discard state. After the promotion the parallel write would have
been lost.

With this fix the discard status is re-checked once the exclusive lock
has been aquired. If the block is no longer discarded it falls back to
the slower full copy path.

Fixes: b29d4986d ("dm cache: significant rework to leverage dm-bio-prison-v2")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 5916a22b 22-Jun-2017 Eric Biggers <ebiggers@google.com>

dm: constify argument arrays

The arrays of 'struct dm_arg' are never modified by the device-mapper
core, so constify them so that they are placed in .rodata.

(Exception: the args array in dm-raid cannot be constified because it is
allocated on the stack and modified.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 74d46992 23-Aug-2017 Christoph Hellwig <hch@lst.de>

block: replace bi_bdev with a gendisk pointer and partitions index

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 4e4cbee9 03-Jun-2017 Christoph Hellwig <hch@lst.de>

block: switch bios to blk_status_t

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 1be56909 03-Jun-2017 Christoph Hellwig <hch@lst.de>

dm: change ->end_io calling convention

Turn the error paramter into a pointer so that target drivers can change
the value, and make sure only DM_ENDIO_* values are returned from the
methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 49b7f768 11-May-2017 Joe Thornber <ejt@redhat.com>

dm cache: simplify the IDLE vs BUSY state calculation

Drop the MODERATE state since it wasn't buying us much.

Also, in check_migrations(), prepare for the next commit ("dm cache
policy smq: don't do any writebacks unless IDLE") by deferring to the
policy to make the final decision on whether writebacks can be
serviced.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 701e03e4 11-May-2017 Joe Thornber <ejt@redhat.com>

dm cache: track all IO to the cache rather than just the origin device's IO

IO tracking used to throttle writebacks when the origin device is busy.

Even if all the IO is going to the fast device, writebacks can
significantly degrade performance. So track all IO to gauge whether the
cache is busy or not.

Otherwise, synthetic IO tests (e.g. fio) that might send all IO to the
fast device wouldn't cause writebacks to get throttled.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 072792dc 11-May-2017 Joe Thornber <ejt@redhat.com>

dm cache: fix incorrect 'idle_time' reset in IO tracker

Some bios have no payload (eg, a FLUSH), don't reset the idle_time when
these come in.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 48920ff2 05-Apr-2017 Christoph Hellwig <hch@lst.de>

block: remove the discard_zeroes_data flag

Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 449b668c 31-Mar-2017 Joe Thornber <ejt@redhat.com>

dm cache: set/clear the cache core's dirty_bitset when loading mappings

When loading metadata make sure to set/clear the dirty bits in the cache
core's dirty_bitset as well as the policy.

Otherwise the cache core is unaware that any blocks were dirty when the
cache was last shutdown. A very serious side-effect being that the
cleaner policy would therefore never be tasked with writing back dirty
data from a cache that was in writeback mode (e.g. when switching from
smq policy to cleaner policy when decommissioning a writeback cache).

This fixes a serious data corruption bug associated with writeback mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# b29d4986 15-Dec-2016 Joe Thornber <ejt@redhat.com>

dm cache: significant rework to leverage dm-bio-prison-v2

The cache policy interfaces have been updated to work well with the new
bio-prison v2 interface's ability to queue work immediately (promotion,
demotion, etc) -- overriding benefit being reduced latency on processing
IO through the cache. Previously such work would be left for the DM
cache core to queue on various lists and then process in batches later
-- this caused a serious delay in latency for IO driven by the cache.

The background tracker code was factored out so that all cache policies
can make use of it.

Also, the "cleaner" policy has been removed and is now a variant of the
smq policy that simply disallows migrations.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 742c8fdc 21-Oct-2016 Joe Thornber <ejt@redhat.com>

dm bio prison v2: new interface for the bio prison

The deferred set is gone and all methods have _v2 appended to the end of
their names to allow for continued use of the original bio prison in DM
thin-provisioning.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 629d0a8a 22-Sep-2016 Joe Thornber <ejt@redhat.com>

dm cache metadata: add "metadata2" feature

If "metadata2" is provided as a table argument when creating/loading a
cache target a more compact metadata format, with separate dirty bits,
is used. "metadata2" improves speed of shutting down a cache target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ca763d0a 09-Feb-2017 Joe Thornber <ejt@redhat.com>

dm cache: fix corruption seen when using cache > 2TB

A rounding bug due to compiler generated temporary being 32bit was found
in remap_to_cache(). A localized cast in remap_to_cache() fixes the
corruption but this preferred fix (changing from uint32_t to sector_t)
eliminates potential for future rounding errors elsewhere.

Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# dc3b17cc 02-Feb-2017 Jan Kara <jack@suse.cz>

block: Use pointer to backing_dev_info from request_queue

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# f73f44eb 27-Jan-2017 Christoph Hellwig <hch@lst.de>

block: add a op_is_flush helper

This centralizes the checks for bios that needs to be go into the flush
state machine.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 23cab26d 03-Oct-2016 Mike Snitzer <snitzer@redhat.com>

dm cache: add missing cache device name to DMERR in set_cache_mode()

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 1eff9d32 05-Aug-2016 Jens Axboe <axboe@fb.com>

block: rename bio bi_rw to bi_opf

Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe <axboe@fb.com>


# 28a8f0d3 05-Jun-2016 Mike Christie <mchristi@redhat.com>

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e6047149 05-Jun-2016 Mike Christie <mchristi@redhat.com>

dm: use bio op accessors

Separate the op from the rq_flag_bits and have dm
set/get the bio using bio_set_op_attrs/bio_op.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 843f0f2e 10-Mar-2016 Mike Snitzer <snitzer@redhat.com>

dm cache: bump the target version

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d14fcf3d 10-Mar-2016 Joe Thornber <ejt@redhat.com>

dm cache: make sure every metadata function checks fail_io

Otherwise operations may be attempted that will only ever go on to crash
(since the metadata device is either missing or unreliable if 'fail_io'
is set).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 30187e1d 31-Jan-2016 Mike Snitzer <snitzer@redhat.com>

dm: rename target's per_bio_data_size to per_io_data_size

Request-based DM will also make use of per_bio_data_size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# fe3265b1 25-Nov-2015 Mikulas Patocka <mpatocka@redhat.com>

dm: don't save and restore bi_private

Device mapper used the field bi_private to point to dm_target_io. However,
since kernel 3.15, the bi_private field is unused, and so the targets do
not need to save and restore this field.

This patch removes code that saves and restores bi_private from dm-cache,
dm-snapshot and dm-verity.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6f65985e 13-Sep-2015 Julia Lawall <Julia.Lawall@lip6.fr>

dm: drop NULL test before kmem_cache_destroy() and mempool_destroy()

Remove DM's unneeded NULL tests before calling these destroy functions,
now that they check for NULL, thanks to these v4.3 commits:
3942d2991 ("mm/slab_common: allow NULL cache pointer in kmem_cache_destroy()")
4e3ca3e03 ("mm/mempool: allow NULL `pool' pointer in mempool_destroy()")

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@ expression x; @@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# cc7da0ba 01-Sep-2015 Joe Thornber <ejt@redhat.com>

dm cache: fix use after freeing migrations

Both free_io_migration() and issue_discard() dereference a migration
that was just freed. Fix those by saving off the migrations's cache
object before freeing the migration. Also cleanup needless mg->cache
dereferences now that the cache object is available directly.

Fixes: e44b6a5a3c ("dm cache: move wake_waker() from free_migrations() to where it is needed")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# dc9cee5d 31-Aug-2015 Mike Snitzer <snitzer@redhat.com>

dm cache: small cleanups related to deferred prison cell cleanup

Eliminate __cell_release() since it only had one caller that always
released the cell holder.

Switch cell_error_with_code() to using free_prison_cell() for the sake
of consistency.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 9153df74 31-Aug-2015 Joe Thornber <ejt@redhat.com>

dm cache: fix leaking of deferred bio prison cells

There were two cases where dm_cell_visit_release() was being called,
which removes the cell from the prison's rbtree, but the callers didn't
also return the cell to the mempool. Fix this by having them call
free_prison_cell().

This leak manifested as the 'kmalloc-96' slab growing until OOM.

Fixes: 651f5fa2a3 ("dm cache: defer whole cells")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.1+


# 8ae12666 28-Apr-2015 Kent Overstreet <kent.overstreet@gmail.com>

block: kill merge_bvec_fn() completely

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits)
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# e44b6a5a 30-Jul-2015 Joe Thornber <ejt@redhat.com>

dm cache: move wake_waker() from free_migrations() to where it is needed

This stops spurious wake ups from calls to prealloc_free_structs().

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 795e633a 29-Jul-2015 Mike Snitzer <snitzer@redhat.com>

dm cache: fix device destroy hang due to improper prealloc_used accounting

Commit 665022d72f9 ("dm cache: avoid calls to prealloc_free_structs() if
possible") introduced a regression that caused the removal of a DM cache
device to hang in cache_postsuspend()'s call to wait_for_migrations()
with the following stack trace:

[<ffffffff81651457>] schedule+0x37/0x80
[<ffffffffa041e21b>] cache_postsuspend+0xbb/0x470 [dm_cache]
[<ffffffff810ba970>] ? prepare_to_wait_event+0xf0/0xf0
[<ffffffffa0006f77>] dm_table_postsuspend_targets+0x47/0x60 [dm_mod]
[<ffffffffa0001eb5>] __dm_destroy+0x215/0x250 [dm_mod]
[<ffffffffa0004113>] dm_destroy+0x13/0x20 [dm_mod]
[<ffffffffa00098cd>] dev_remove+0x10d/0x170 [dm_mod]
[<ffffffffa00097c0>] ? dev_suspend+0x240/0x240 [dm_mod]
[<ffffffffa0009f85>] ctl_ioctl+0x255/0x4d0 [dm_mod]
[<ffffffff8127ac00>] ? SYSC_semtimedop+0x280/0xe10
[<ffffffffa000a213>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[<ffffffff811fd432>] do_vfs_ioctl+0x2d2/0x4b0
[<ffffffff81117d5f>] ? __audit_syscall_entry+0xaf/0x100
[<ffffffff81022636>] ? do_audit_syscall_entry+0x66/0x70
[<ffffffff811fd689>] SyS_ioctl+0x79/0x90
[<ffffffff81023e58>] ? syscall_trace_leave+0xb8/0x110
[<ffffffff81654f6e>] entry_SYSCALL_64_fastpath+0x12/0x71

Fix this by accounting for the call to prealloc_data_structs()
immediately _before_ the call as opposed to after. This is needed
because it is possible to break out of the control loop after the call
to prealloc_data_structs() but before prealloc_used was set to true.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 3508e659 29-Jul-2015 Mike Snitzer <snitzer@redhat.com>

Revert "dm cache: do not wake_worker() in free_migration()"

This reverts commit 386cb7cdeeef97e0bf082a8d6bbfc07a2ccce07b.

Taking the wake_worker() out of free_migration() will slow writeback
dramatically, and hence adaptability.

Say we have 10k blocks that need writing back, but are only able to
issue 5 concurrently due to the migration bandwidth: it's imperative
that we wake_worker() immediately after migration completion; waiting
for the next 1 second wake up (via do_waker) means it'll take a long
time to write that all back.

Reported-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4246a0b6 20-Jul-2015 Christoph Hellwig <hch@lst.de>

block: add a bi_error field to struct bio

Currently we have two different ways to signal an I/O error on a BIO:

(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 665022d7 16-Jul-2015 Mike Snitzer <snitzer@redhat.com>

dm cache: avoid calls to prealloc_free_structs() if possible

If no work was performed then prealloc_data_structs() wasn't ever called
so there isn't any need to call prealloc_free_structs().

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# e782eff5 16-Jul-2015 Mike Snitzer <snitzer@redhat.com>

dm cache: avoid preallocation if no work in writeback_some_dirty_blocks()

Refactor writeback_some_dirty_blocks() to avoid prealloc_data_structs()
if the policy doesn't have any dirty blocks ready for writeback.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 386cb7cd 16-Jul-2015 Mike Snitzer <snitzer@redhat.com>

dm cache: do not wake_worker() in free_migration()

All methods that queue work call wake_worker() as you'd expect.
E.g. cell_defer, defer_bio, quiesce_migration (which is called by
writeback, promote, demote_then_promote, invalidate, discard, etc).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 255eac20 15-Jul-2015 Mike Snitzer <snitzer@redhat.com>

dm cache: display 'needs_check' in status if it is set

There is currently no way to see that the needs_check flag has been set
in the metadata. Display 'needs_check' in the cache status if it is set
in the cache metadata.

Also, update cache documentation.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# fba10109 29-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: age and write back cache entries even without active IO

The policy tick() method is normally called from interrupt context.
Both the mq and smq policies do some bottom half work for the tick
method in their map functions. However if no IO is going through the
cache, then that bottom half work doesn't occur. With these policies
this means recently hit entries do not age and do not get written
back as early as we'd like.

Fix this by introducing a new 'can_block' parameter to the tick()
method. When this is set the bottom half work occurs immediately.
'can_block' is set when the tick method is called every second by the
core target (not in interrupt context).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# b61d9509 22-Apr-2015 Mike Snitzer <snitzer@redhat.com>

dm cache: prefix all DMERR and DMINFO messages with cache device name

Having the DM device name associated with the ERR or INFO message is
very helpful.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 028ae9f7 22-Apr-2015 Joe Thornber <ejt@redhat.com>

dm cache: add fail io mode and needs_check flag

If a cache metadata operation fails (e.g. transaction commit) the
cache's metadata device will abort the current transaction, set a new
needs_check flag, and the cache will transition to "read-only" mode. If
aborting the transaction or setting the needs_check flag fails the cache
will transition to "fail-io" mode.

Once needs_check is set the cache device will not be allowed to
activate. Activation requires write access to metadata. Future work is
needed to add proper support for running the cache in read-only mode.

Once in fail-io mode the cache will report a status of "Fail".

Also, add commit() wrapper that will disallow commits if in read_only or
fail mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 88bf5184 27-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: wake the worker thread every time we free a migration object

When the cache is idle, writeback work was only being issued every
second. With this change outstanding writebacks are streamed
constantly. This offers a writeback performance improvement.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 40775257 15-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: boost promotion of blocks that will be overwritten

When considering whether to move a block to the cache we already give
preferential treatment to discarded blocks, since they are cheap to
promote (no read of the origin required since the data is junk).

The same is true of blocks that are about to be completely
overwritten, so we likewise boost their promotion chances.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 651f5fa2 15-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: defer whole cells

Currently individual bios are deferred to the worker thread if they
cannot be processed immediately (eg, a block is in the process of
being moved to the fast device).

This patch passes whole cells across to the worker. This saves
reaquiring the cell, and also collects bios destined for the same block
together, which allows them to be mapped with a single look up to the
policy. This reduces the overhead of using dm-cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 451b9e00 15-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: pull out some bitset utility functions for reuse

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 20f6814b 15-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: pass a new 'critical' flag to the policies when requesting writeback work

We only allow non critical writeback if the origin is idle. It is up
to the policy to decide what writeback work is critical.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 066dbaa3 15-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: track IO to the origin device using io_tracker

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 77289d32 15-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: add io_tracker

A little class that keeps track of the volume of io that is in flight,
and the length of time that a device has been idle for.

FIXME: rather than jiffes, may be best to use ktime_t (to support faster
devices).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# fb4100ae 20-May-2015 Joe Thornber <ejt@redhat.com>

dm cache: fix race when issuing a POLICY_REPLACE operation

There is a race between a policy deciding to replace a cache entry,
the core target writing back any dirty data from this block, and other
IO threads doing IO to the same block.

This sort of problem is avoided most of the time by the core target
grabbing a bio prison cell before making the request to the policy.
But for a demotion the core target doesn't know which block will be
demoted, so can't do this in advance.

Fix this demotion race by introducing a callback to the policy interface
that allows the policy to grab the cell on behalf of the core target.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 326e1dbb 22-May-2015 Mike Snitzer <snitzer@redhat.com>

block: remove management of bi_remaining when restoring original bi_end_io

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>


# c4cf5261 17-Apr-2015 Jens Axboe <axboe@fb.com>

bio: skip atomic inc/dec of ->bi_remaining for non-chains

Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.

Tested-by: Robert Elliott <elliott@hp.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>


# 0f30af98 22-May-2014 Manuel Schölling <manuel.schoelling@gmx.de>

dm: use time_in_range() and time_after()

To be future-proof and for better readability the time comparisons are modified
to use time_in_range() and time_after() instead of plain, error-prone math.

Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# a59db676 23-Jan-2015 Joe Thornber <ejt@redhat.com>

dm cache: fix problematic dual use of a single migration count variable

Introduce a new variable to count the number of allocated migration
structures. The existing variable cache->nr_migrations became
overloaded. It was used to:

i) track of the number of migrations in flight for the purposes of
quiescing during suspend.

ii) to estimate the amount of background IO occuring.

Recent discard changes meant that REQ_DISCARD bios are processed with
a migration. Discards are not background IO so nr_migrations was not
incremented. However this could cause quiescing to complete early.

(i) is now handled with a new variable cache->nr_allocated_migrations.
cache->nr_migrations has been renamed cache->nr_io_migrations.
cleanup_migration() is now called free_io_migration(), since it
decrements that variable.

Also, remove the unused cache->next_migration variable that got replaced
with with prealloc_structs a while ago.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# f824a2af 28-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: fix spurious cell_defer when dealing with partial block at end of device

We never bother caching a partial block that is at the back end of the
origin device. No cell ever gets locked, but the calling code was
assuming it was and trying to release it.

Now the code only releases if the cell has been set to a non NULL
value.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 1e32134a 26-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: dirty flag was mistakenly being cleared when promoting via overwrite

If the incoming bio is a WRITE and completely covers a block then we
don't bother to do any copying for a promotion operation. Once this is
done the cache block and origin block will be different, so we need to
set it to 'dirty'.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# f29a3147 26-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: only use overwrite optimisation for promotion when in writeback mode

Overwrite causes the cache block and origin blocks to diverge, which
is only allowed in writeback mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 2bb812df 26-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: discard block size must be a multiple of cache block size

Otherwise the cache blocks may span two discard blocks, which we don't
handle when doing the discard lookup.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 43c32bf2 25-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: fix a harmless race when working out if a block is discarded

It is more correct to hold the cell before checking the discard state.
These flags are only used as hints to the policy so this change will
have negligable effect.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 3e2e1c30 24-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: when reloading a discard bitset allow for a different discard block size

The discard block size can change if the origin changes size or if an
old DM cache is upgraded from using a discard block size that was equal
to cache block size.

To fix this an extent of discarded blocks is established for the purpose
of translating the old discard block size to the new in-core discard
block size and set bits. The old (potentially huge) discard bitset is
left ondisk until it is re-written using the new in-core information on
the next successful DM cache shutdown.

Fixes: 7ae34e777896 ("dm cache: improve discard support")
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2572629a 24-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: fix some issues with the new discard range support

Commit 7ae34e777 ("dm cache: improve discard support") needed to also:
- discontinue having DM core split the discard bios on cache block
boundaries
- calculate the cache's discard_nr_blocks relative to the determined
discard_block_size rather than using oblock_to_dblock()

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d1d9220c 11-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: emit a warning message if there are a lot of cache blocks

Loading and saving millions of block mappings takes time. We may as
well explain what's going on, and encourage people to use a larger
cache block size.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 7ae34e77 06-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: improve discard support

Safely allow the discard blocksize to be larger than the cache blocksize
by using the bio prison's range locking support. This also improves
discard performance considerly because larger discards are issued to the
dm-cache device. The discard blocksize was always intended to be
greater than the cache blocksize. But until now it wasn't implemented
safely.

Also, by safely restoring the ability to have discard blocksize larger
than cache blocksize we're able to significantly reduce the memory used
for the cache's discard bitset. Before, with a small discard blocksize,
the discard bitset could get quite large because its size is a function
of the discard blocksize and the origin device's size. For example,
previously, using a 32KB cache blocksize with a 40TB origin resulted in
1280MB of incore memory use for the discard bitset! Now, the discard
blocksize is scaled up accordingly to ensure the discard bitset is
capped at 2**14 bits, or 16KB.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 08b18451 06-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: revert "prevent corruption caused by discard_block_size > cache_block_size"

This reverts commit d132cc6d9e92424bb9d4fd35f5bd0e55d583f4be because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize. Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 1bad9bc4 07-Nov-2014 Joe Thornber <ejt@redhat.com>

dm cache: revert "remove remainder of distinct discard block size"

This reverts commit 64ab346a360a4b15c28fb8531918d4a01f4eabd9 because we
actually do want to allow the discard blocksize to be larger than the
cache blocksize. Further dm-cache discard changes will make this
possible.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 5f274d88 17-Sep-2014 Joe Thornber <ejt@redhat.com>

dm bio prison: introduce support for locking ranges of blocks

Ranges will be placed in the same cell if they overlap.

Range locking is a prerequisite for more efficient multi-block discard
support in both the cache and thin-provisioning targets.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# a195db2d 06-Oct-2014 Joe Thornber <ejt@redhat.com>

dm bio prison: switch to using a red black tree

Previously it was using a fixed sized hash table. There are times
when very many concurrent cells are held (such as when processing a very
large discard). When this happens the hash table performance becomes
very poor.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 40aa978e 04-Sep-2014 Anssi Hannula <anssi.hannula@iki.fi>

dm cache: fix race causing dirty blocks to be marked as clean

When a writeback or a promotion of a block is completed, the cell of
that block is removed from the prison, the block is marked as clean, and
the clear_dirty() callback of the cache policy is called.

Unfortunately, performing those actions in this order allows an incoming
new write bio for that block to come in before clearing the dirty status
is completed and therefore possibly causing one of these two scenarios:

Scenario A:

Thread 1 Thread 2
cell_defer() .
- cell removed from prison .
- detained bios queued .
. incoming write bio
. remapped to cache
. set_dirty() called,
. but block already dirty
. => it does nothing
clear_dirty() .
- block marked clean .
- policy clear_dirty() called .

Result: Block is marked clean even though it is actually dirty. No
writeback will occur.

Scenario B:

Thread 1 Thread 2
cell_defer() .
- cell removed from prison .
- detained bios queued .
clear_dirty() .
- block marked clean .
. incoming write bio
. remapped to cache
. set_dirty() called
. - block marked dirty
. - policy set_dirty() called
- policy clear_dirty() called .

Result: Block is properly marked as dirty, but policy thinks it is clean
and therefore never asks us to writeback it.
This case is visible in "dmsetup status" dirty block count (which
normally decreases to 0 on a quiet device).

Fix these issues by calling clear_dirty() before calling cell_defer().
Incoming bios for that block will then be detained in the cell and
released only after clear_dirty() has completed, so the race will not
occur.

Found by inspecting the code after noticing spurious dirty counts
(scenario B).

Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# b0246530 19-Jul-2014 Mike Snitzer <snitzer@redhat.com>

dm cache: set minimum_io_size to cache's data block size

Before, if the block layer's limit stacking didn't establish an
optimal_io_size that was compatible with the cache's data block size
we'd set optimal_io_size to the data block size and minimum_io_size to 0
(which the block layer adjusts to be physical_block_size).

Update cache_io_hints() to set both minimum_io_size and optimal_io_size
to the cache's data block size. This fixes an issue where mkfs.xfs
would create more XFS Allocation Groups on cache volumes than on a
normal linear LV of comparable size.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 895b47d7 14-Jul-2014 Mike Snitzer <snitzer@redhat.com>

dm cache metadata: use dm-space-map-metadata.h defined size limits

Commit 7d48935e cleaned up the persistent-data's space-map-metadata
limits by elevating them to dm-space-map-metadata.h. Update
dm-cache-metadata to use these same limits.

The calculation for DM_CACHE_METADATA_MAX_SECTORS didn't account for the
sizeof the disk_bitmap_header. So the supported maximum metadata size
is a bit smaller (reduced from 33423360 to 33292800 sectors).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>


# 304affaa 24-Jun-2014 Joe Thornber <ejt@redhat.com>

dm cache: fail migrations in the do_worker error path

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 8c081b52 13-May-2014 Joe Thornber <ejt@redhat.com>

dm cache: simplify deferred set reference count increments

Factor out inc_and_issue and inc_ds helpers to simplify deferred set
reference count increments. Also cleanup cache_map to consistently call
cell_defer and inc_ds when the bio is DM_MAPIO_REMAPPED.

No functional change.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 44fa816b 01-Aug-2014 Anssi Hannula <anssi.hannula@iki.fi>

dm cache: fix race affecting dirty block count

nr_dirty is updated without locking, causing it to drift so that it is
non-zero (either a small positive integer, or a very large one when an
underflow occurs) even when there are no actual dirty blocks. This was
due to a race between the workqueue and map function accessing nr_dirty
in parallel without proper protection.

People were seeing under runs due to a race on increment/decrement of
nr_dirty, see: https://lkml.org/lkml/2014/6/3/648

Fix this by using an atomic_t for nr_dirty.

Reported-by: roma1390@gmail.com
Signed-off-by: Anssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# f1daa838 23-May-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: always split discards on cache block boundaries

The DM cache target cannot cope with discards that span multiple cache
blocks, so each discard bio that spans more than one cache block must
get split by the DM core.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.9+


# 131cd131 01-May-2014 Mike Snitzer <snitzer@redhat.com>

dm cache: fix writethrough mode quiescing in cache_map

Commit 2ee57d58735 ("dm cache: add passthrough mode") inadvertently
removed the deferred set reference that was taken in cache_map()'s
writethrough mode support. Restore taking this reference.

This issue was found with code inspection.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+


# 0596661f 03-Apr-2014 Joe Thornber <ejt@redhat.com>

dm cache: fix a lock-inversion

When suspending a cache the policy is walked and the individual policy
hints written to the metadata via sync_metadata(). This led to this
lock order:

policy->lock
cache_metadata->root_lock

When loading the cache target the policy is populated while the metadata
lock is held:

cache_metadata->root_lock
policy->lock

Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
ensuring the cache_metadata root_lock is held whilst all the hints are
written, rather than being repeatedly locked while policy->lock is held
(as was the case with each callout that policy_walk_mappings() made to
the old save_hint() method).

Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
locking correctness") build option. However, it is not clear how the
LOCKDEP reported paths can lead to a deadlock since the two paths,
suspending a target and loading a target, never occur at the same time.
But that doesn't mean the same lock-inversion couldn't have occurred
elsewhere.

Reported-by: Marian Csontos <mcsontos@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 64ab346a 27-Mar-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: remove remainder of distinct discard block size

Discard block size not being equal to cache block size causes data
corruption by erroneously avoiding migrations in issue_copy() because
the discard state is being cleared for a group of cache blocks when it
should not.

Completely remove all code that enabled a distinction between the
cache block size and discard block size.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# d132cc6d 20-Mar-2014 Mike Snitzer <snitzer@redhat.com>

dm cache: prevent corruption caused by discard_block_size > cache_block_size

If the discard block size is larger than the cache block size we will
not properly quiesce IO to a region that is about to be discarded. This
results in a race between a cache migration where no copy is needed, and
a write to an adjacent cache block that's within the same large discard
block.

Workaround this by limiting the discard_block_size to cache_block_size.
Also limit the max_discard_sectors to cache_block_size.

A more comprehensive fix that introduces range locking support in the
bio_prison and proper quiescing of a discard range that spans multiple
cache blocks is already in development.

Reported-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Acked-by: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org


# e893fba9 12-Mar-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: fix access beyond end of origin device

In order to avoid wasting cache space a partial block at the end of the
origin device is not cached. Unfortunately, the check for such a
partial block at the end of the origin device was flawed.

Fix accesses beyond the end of the origin device that occured due to
attempted promotion of an undetected partial block by:

- initializing the per bio data struct to allow cache_end_io to work properly
- recognizing access to the partial block at the end of the origin device
- avoiding out of bounds access to the discard bitset

Otherwise, users can experience errors like the following:

attempt to access beyond end of device
dm-5: rw=0, want=20971520, limit=20971456
...
device-mapper: cache: promotion failed; couldn't copy block

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 8b9d9666 11-Mar-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: fix truncation bug when copying a block to/from >2TB fast device

During demotion or promotion to a cache's >2TB fast device we must not
truncate the cache block's associated sector to 32bits. The 32bit
temporary result of from_cblock() caused a 32bit multiplication when
calculating the sector of the fast device in issue_copy_real().

Use an intermediate 64bit type to store the 32bit from_cblock() to allow
for proper 64bit multiplication.

Here is an example of how this bug manifests on an ext4 filesystem:

EXT4-fs error (device dm-0): ext4_mb_generate_buddy:756: group 17136, 32768 clusters in bitmap, 30688 in gd; block bitmap corrupt.
JBD2: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# e0d849fa 27-Feb-2014 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: fix truncation bug when mapping I/O to >2TB fast device

When remapping a block to the cache's fast device that is larger than
2TB we must not truncate the destination sector to 32bits. The 32bit
temporary result of from_cblock() was being overflowed in
remap_to_cache() due to the logical left shift.

Use an intermediate 64bit type to store the 32bit from_cblock() result
to fix the overflow.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org


# 80ae49aa 31-Jan-2014 Mike Snitzer <snitzer@redhat.com>

dm cache: do not add migration to completed list before unhooking bio

When completing an overwrite bio, in overwrite_endio(), the associated
migration should not be added to the 'completed_migrations' until the
bio's fields are restored with dm_unhook_bio().

Otherwise, do_worker() can race to process 'completed_migrations' before
dm_unhook_bio() -- so the bio's bi_end_io is incorrect. This is
unlikely to cause any problems given the current code but should be
fixed on the basis of correctness.

Also, the cache's spinlock only needs to be held when manipulating the
'completed_migrations' list -- other changes don't need protection.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>


# c6eda5e8 31-Jan-2014 Mike Snitzer <snitzer@redhat.com>

dm cache: move hook_info into common portion of per_bio_data structure

Commit c9d28d5d ("dm cache: promotion optimisation for writes")
incorrectly placed the 'hook_info' member in the writethrough-only
portion of the per_bio_data structure.

Given that the overwrite optimization may be used for writeback the
'hook_info' member must be placed above the 'cache' member of the
per_bio_data structure. Any members above 'cache' are available from
both writeback and writethrough modes' per_bio_data structure.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org # 3.13+


# 2e68c4e6 15-Jan-2014 Mike Snitzer <snitzer@redhat.com>

dm cache: add policy name to status output

The cache's policy may have been established using the "default" alias,
which is currently the "mq" policy but the default policy may change in
the future. It is useful to know exactly which policy is being used.

Add a 'real' member to the dm_cache_policy_type structure and have the
"default" dm_cache_policy_type point to the real "mq"
dm_cache_policy_type. Update dm_cache_policy_get_name() to check if
real is set, if so report the name of the real policy (not the alias).

Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 6a388618 09-Jan-2014 Mike Snitzer <snitzer@redhat.com>

dm cache: add block sizes and total cache blocks to status output

Improve cache_status to emit:
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
...

Adding the block sizes allows for easier calculation of the overall size
of both the metadata and cache devices. Adding <#total cache blocks>
provides useful context for how much of the cache is used.

Unfortunately these additions to the status will require updates to
users' scripts that monitor the cache status. But these changes help
provide more comprehensive information about the cache device and will
simplify tools that are being developed to manage dm-cache devices --
because they won't need to issue 3 operations to cobble together the
information that we can easily provide via a single status ioctl.

While updating the status documentation in cache.txt spaces were
tabify'd.

Requested-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>


# 08844800 29-Nov-2013 Vincent Pelletier <plr.vincent@gmail.com>

dm cache: actually resize cache

Commit f494a9c6b1b6dd9a9f21bbb75d9210d478eeb498 ("dm cache: cache
shrinking support") broke cache resizing support.

dm_cache_resize() is called with cache->cache_size before it gets
updated to new_size, so it is a no-op. But the dm-cache superblock is
updated with the new_size even though the backing dm-array is not
resized. Fix this by passing the new_size to dm_cache_resize().

Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 8d307269 03-Dec-2013 Mike Snitzer <snitzer@redhat.com>

dm cache: increment bi_remaining when bi_end_io is restored

Move the bio->bi_remaining increment into dm_unhook_bio() so the
overwrite_endio() handler works as expected.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>


# 196d38bc 23-Nov-2013 Kent Overstreet <kmo@daterainc.com>

block: Generic bio chaining

This adds a generic mechanism for chaining bio completions. This is
going to be used for a bio_split() replacement, and it turns out to be
very useful in a fair amount of driver code - a fair number of drivers
were implementing this in their own roundabout ways, often painfully.

Note that this means it's no longer to call bio_endio() more than once
on the same bio! This can cause problems for drivers that save/restore
bi_end_io. Arguably they shouldn't be saving/restoring bi_end_io at all
- in all but the simplest cases they'd be better off just cloning the
bio, and immutable biovecs is making bio cloning cheaper. But for now,
we add a bio_endio_nodec() for these cases.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>


# 4f024f37 11-Oct-2013 Kent Overstreet <kmo@daterainc.com>

block: Abstract out bvec iterator

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6


# 7b6b2bc9 11-Nov-2013 Mike Snitzer <snitzer@redhat.com>

dm cache: resolve small nits and improve Documentation

Document passthrough mode, cache shrinking, and cache invalidation.
Also, use strcasecmp() and hlist_unhashed().

Reported-by: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 65790ff9 08-Nov-2013 Joe Thornber <ejt@redhat.com>

dm cache: add cache block invalidation support

Cache block invalidation is removing an entry from the cache without
writing it back. Cache blocks can be invalidated via the
'invalidate_cblocks' message, which takes an arbitrary number of cblock
ranges:
invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]*

E.g.
dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2ee57d58 24-Oct-2013 Joe Thornber <ejt@redhat.com>

dm cache: add passthrough mode

"Passthrough" is a dm-cache operating mode (like writethrough or
writeback) which is intended to be used when the cache contents are not
known to be coherent with the origin device. It behaves as follows:

* All reads are served from the origin device (all reads miss the cache)
* All writes are forwarded to the origin device; additionally, write
hits cause cache block invalidates

This mode decouples cache coherency checks from cache device creation,
largely to avoid having to perform coherency checks while booting. Boot
scripts can create cache devices in passthrough mode and put them into
service (mount cached filesystems, for example) without having to worry
about coherency. Coherency that exists is maintained, although the
cache will gradually cool as writes take place.

Later, applications can perform coherency checks, the nature of which
will depend on the type of the underlying storage. If coherency can be
verified, the cache device can be transitioned to writethrough or
writeback mode while still warm; otherwise, the cache contents can be
discarded prior to transitioning to the desired operating mode.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Morgan Mears <Morgan.Mears@netapp.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# f494a9c6 31-Oct-2013 Joe Thornber <ejt@redhat.com>

dm cache: cache shrinking support

Allow a cache to shrink if the blocks being removed from the cache are
not dirty.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c9d28d5d 31-Oct-2013 Joe Thornber <ejt@redhat.com>

dm cache: promotion optimisation for writes

If a write block triggers promotion and covers a whole block we can
avoid a copy.

Introduce dm_{hook,unhook}_bio to simplify saving and restoring bio
fields (bi_private is now used by overwrite). Switch writethrough
support over to using these helpers too.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# ffcbcb67 14-Oct-2013 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: optimize commit_if_needed

Check commit_requested flag _before_ calling
dm_cache_changed_this_transaction() superfluously.

Also, be sure to set last_commit_jiffies _after_ dm_cache_commit()
completes.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 2c2263c9 14-Oct-2013 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: log error message if dm_kcopyd_copy() fails

A migration failure should be logged (albeit limited).

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 80f659f3 14-Oct-2013 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: use cell_defer() boolean argument consistently

Fix a few cell_defer() calls that weren't passing a bool.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 4cb3e1db 01-Oct-2013 Mikulas Patocka <mpatocka@redhat.com>

dm cache: return -EINVAL if the user specifies unknown cache policy

Return -EINVAL when the specified cache policy is unknown rather than
returning -ENOMEM.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 238f8363 30-Oct-2013 Joe Thornber <ejt@redhat.com>

dm cache: improve efficiency of quiescing flag management

Make the quiescing flag an atomic_t and stop protecting it with a spin
lock.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# 66cb1910 30-Oct-2013 Joe Thornber <ejt@redhat.com>

dm cache: fix a race condition between queuing new migrations and quiescing for a shutdown

The code that was trying to do this was inadequate. The postsuspend
method (in ioctl context), needs to wait for the worker thread to
acknowledge the request to quiesce. Otherwise the migration count may
drop to zero temporarily before the worker thread realises we're
quiescing. In this case the target will be taken down, but the worker
thread may have issued a new migration, which will cause an oops when
it completes.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 3.9+


# f8e5f01a 20-Oct-2013 Joe Thornber <ejt@redhat.com>

dm cache: io destined for the cache device can now serve as tick bios

Previously only origin bios could trigger ticks, which meant if all
the io was destined for the cache no ticks were generated. If no ticks
are generated then multiple hits, and movements in general, are
attributed to the same tick.

Only a stop gap fix, we need a better solution.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>


# c9ec5d7c 16-Aug-2013 Mike Snitzer <snitzer@redhat.com>

dm cache: eliminate holes in cache structure

Reorder members in the cache structure to eliminate 6 out of 7 holes
(reclaiming 24 bytes). Also, the 'worker' and 'waker' members no longer
straddle cachelines.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>


# f6109372 20-Aug-2013 Mike Snitzer <snitzer@redhat.com>

dm cache: fix stacking of geometry limits

Do not blindly override the queue limits (specifically io_min and
io_opt). Allow traditional stacking of these limits if io_opt is a
factor of the cache's data block size.

Without this patch mkfs.xfs does not recognize the cache device's
provided limits as a useful geometry (e.g. raid) so these hints are
ignored. This was due to setting io_min to a useless value.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>


# 05473044 16-Aug-2013 Mike Snitzer <snitzer@redhat.com>

dm cache: add data block size limits to code and Documentation

Place upper bound on the cache's data block size (1GB).

Inform users that the data block size can't be any arbitrary number,
i.e. its value must be between 32KB and 1GB. Also, it should be a
multiple of 32KB.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>


# 43aeaa29 10-Jul-2013 Mikulas Patocka <mpatocka@redhat.com>

dm cache: fix arm link errors with inline

Use __always_inline to avoid a link failure with gcc 4.6 on ARM.
gcc 4.7 is OK.

It creates a function block_div.part.8, it references __udivdi3 and
__umoddi3 and it is never called. The references to __udivdi3 and
__umoddi3 cause a link failure.

Reported-by: Rob Herring <robherring2@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 2f14f4b5 10-May-2013 Joe Thornber <ejt@redhat.com>

dm cache: set config value

Share configuration option processing code between the dm cache
ctr and message functions.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 2c73c471 10-May-2013 Alasdair G Kergon <agk@redhat.com>

dm cache: move config fns

Move process_config_option() in dm-cache-target.c to make the
next patch more readable.

Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 8c5008fa 10-May-2013 Joe Thornber <ejt@redhat.com>

dm cache: replace memcpy with struct assignment

Use struct assignment rather than memcpy in dm cache.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# aeed1420 10-May-2013 Joe Thornber <ejt@redhat.com>

dm cache: fix typos in comments

Fix up some typos in dm-cache comments.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# f8350daf 10-May-2013 Joe Thornber <ejt@redhat.com>

dm cache: tune migration throttling

Tune the dm cache migration throttling.

i) Issue a tick every second, just in case there's no i/o going through.

ii) Drop the migration threshold right down to something suitable for
background work.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# fa4d683a 10-May-2013 Wei Yongjun <yongjun_wei@trendmicro.com.cn>

dm cache: fix error return code in cache_create

Return -ENOMEM if memory allocation fails in cache_create
instead of 0 (to avoid NULL pointer dereference).

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: stable@vger.kernel.org
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 19b0092e 05-Apr-2013 Mike Snitzer <snitzer@redhat.com>

dm cache: reduce bio front_pad size in writeback mode

A recent patch to fix the dm cache target's writethrough mode extended
the bio's front_pad to include a 1056-byte struct dm_bio_details.
Writeback mode doesn't need this, so this patch reduces the
per_bio_data_size to 16 bytes in this case instead of 1096.

The dm_bio_details structure was added in "dm cache: fix writes to
cache device in writethrough mode" which fixed commit e2e74d617e ("dm
cache: fix race in writethrough implementation"). In writeback mode
we avoid allocating the writethrough-specific members of the
per_bio_data structure (the dm_bio_details structure included).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# b844fe69 05-Apr-2013 Darrick J. Wong <darrick.wong@oracle.com>

dm cache: fix writes to cache device in writethrough mode

The dm-cache writethrough strategy introduced by commit e2e74d617eadc15
("dm cache: fix race in writethrough implementation") issues a bio to
the origin device, remaps and then issues the bio to the cache device.
This more conservative in-series approach was selected to favor
correctness over performance (of the previous parallel writethrough).
However, this in-series implementation that reuses the same bio to write
both the origin and cache device didn't take into account that the block
layer's req_bio_endio() modifies a completing bio's bi_sector and
bi_size. So the new writethrough strategy needs to preserve these bio
fields, and restore them before submission to the cache device,
otherwise nothing gets written to the cache (because bi_size is 0).

This patch adds a struct dm_bio_details field to struct per_bio_data,
and uses dm_bio_record() and dm_bio_restore() to ensure the bio is
restored before reissuing to the cache device. Adding such a large
structure to the per_bio_data is not ideal but we can improve this
later, for now correctness is the important thing.

This problem initially went unnoticed because the dm-cache test-suite
uses a linear DM device for the dm-cache device's origin device.
Writethrough worked as expected because DM submits a *clone* of the
original bio, so the original bio which was reused for the cache was
never touched.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# ea2dd8c1 20-Mar-2013 Mike Snitzer <snitzer@redhat.com>

dm cache: policy ignore hints if generated by different version

When reading the dm cache metadata from disk, ignore the policy hints
unless they were generated by the same major version number of the same
policy module.

The hints are considered to be private data belonging to the specific
module that generated them and there is no requirement for them to make
sense to different versions of the policy that generated them.
Policy modules are all required to work fine if no previous hints are
supplied (or if existing hints are lost).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# e2e74d61 20-Mar-2013 Joe Thornber <ejt@redhat.com>

dm cache: fix race in writethrough implementation

We have found a race in the optimisation used in the dm cache
writethrough implementation. Currently, dm core sends the cache target
two bios, one for the origin device and one for the cache device and
these are processed in parallel. This patch avoids the race by
changing the code back to a simpler (slower) implementation which
processes the two writes in series, one after the other, until we can
develop a complete fix for the problem.

When the cache is in writethrough mode it needs to send WRITE bios to
both the origin and cache devices.

Previously we've been implementing this by having dm core query the
cache target on every write to find out how many copies of the bio it
wants. The cache will ask for two bios if the block is in the cache,
and one otherwise.

Then main problem with this is it's racey. At the time this check is
made the bio hasn't yet been submitted and so isn't being taken into
account when quiescing a block for migration (promotion or demotion).
This means a single bio may be submitted when two were needed because
the block has since been promoted to the cache (catastrophic), or two
bios where only one is needed (harmless).

I really don't want to start entering bios into the quiescing system
(deferred_set) in the get_num_write_bios callback. Instead this patch
simplifies things; only one bio is submitted by the core, this is
first written to the origin and then the cache device in series.
Obviously this will have a latency impact.

deferred_writethrough_bios is introduced to record bios that must be
later issued to the cache device from the worker thread. This deferred
submission, after the origin bio completes, is required given that we're
in interrupt context (writethrough_endio).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# b978440b 20-Mar-2013 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: avoid calling policy destructor twice on error

If the cache policy's config values are not able to be set we must
set the policy to NULL after destroying it in create_cache_policy()
so we don't attempt to destroy it a second time later.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 617a0b89 20-Mar-2013 Heinz Mauelshagen <heinzm@redhat.com>

dm cache: detect cache_create failure

Return error if cache_create() fails.

A missing return check made cache_ctr continue even after an error in
cache_create() resulting in the cache object being destroyed. So a
simple failure like an odd number of cache policy config value arguments
would result in an oops.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# 414dd67d 20-Mar-2013 Joe Thornber <ejt@redhat.com>

dm cache: avoid 64 bit division on 32 bit

Squash various 32bit link errors.

>> on i386:
>> drivers/built-in.o: In function `is_discarded_oblock':
>> dm-cache-target.c:(.text+0x1ea28e): undefined reference to `__udivdi3'
...

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>


# c6b4fcba 01-Mar-2013 Joe Thornber <ejt@redhat.com>

dm: add cache target

Add a target that allows a fast device such as an SSD to be used as a
cache for a slower device such as a disk.

A plug-in architecture was chosen so that the decisions about which data
to migrate and when are delegated to interchangeable tunable policy
modules. The first general purpose module we have developed, called
"mq" (multiqueue), follows in the next patch. Other modules are
under development.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Heinz Mauelshagen <mauelshagen@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>