1/* 2 * CDDL HEADER START 3 * 4 * The contents of this file are subject to the terms of the 5 * Common Development and Distribution License (the "License"). 6 * You may not use this file except in compliance with the License. 7 * 8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE 9 * or http://www.opensolaris.org/os/licensing. 10 * See the License for the specific language governing permissions 11 * and limitations under the License. 12 * 13 * When distributing Covered Code, include this CDDL HEADER in each 14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE. 15 * If applicable, add the following below this CDDL HEADER, with the 16 * fields enclosed by brackets "[]" replaced with your own identifying 17 * information: Portions Copyright [yyyy] [name of copyright owner] 18 * 19 * CDDL HEADER END 20 */ 21 22/* 23 * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. 24 * Copyright (c) 2012, 2018 by Delphix. All rights reserved. 25 */ 26 27/* 28 * Virtual Device Labels 29 * --------------------- 30 * 31 * The vdev label serves several distinct purposes: 32 * 33 * 1. Uniquely identify this device as part of a ZFS pool and confirm its 34 * identity within the pool. 35 * 36 * 2. Verify that all the devices given in a configuration are present 37 * within the pool. 38 * 39 * 3. Determine the uberblock for the pool. 40 * 41 * 4. In case of an import operation, determine the configuration of the 42 * toplevel vdev of which it is a part. 43 * 44 * 5. If an import operation cannot find all the devices in the pool, 45 * provide enough information to the administrator to determine which 46 * devices are missing. 47 * 48 * It is important to note that while the kernel is responsible for writing the 49 * label, it only consumes the information in the first three cases. The 50 * latter information is only consumed in userland when determining the 51 * configuration to import a pool. 52 * 53 * 54 * Label Organization 55 * ------------------ 56 * 57 * Before describing the contents of the label, it's important to understand how 58 * the labels are written and updated with respect to the uberblock. 59 * 60 * When the pool configuration is altered, either because it was newly created 61 * or a device was added, we want to update all the labels such that we can deal 62 * with fatal failure at any point. To this end, each disk has two labels which 63 * are updated before and after the uberblock is synced. Assuming we have 64 * labels and an uberblock with the following transaction groups: 65 * 66 * L1 UB L2 67 * +------+ +------+ +------+ 68 * | | | | | | 69 * | t10 | | t10 | | t10 | 70 * | | | | | | 71 * +------+ +------+ +------+ 72 * 73 * In this stable state, the labels and the uberblock were all updated within 74 * the same transaction group (10). Each label is mirrored and checksummed, so 75 * that we can detect when we fail partway through writing the label. 76 * 77 * In order to identify which labels are valid, the labels are written in the 78 * following manner: 79 * 80 * 1. For each vdev, update 'L1' to the new label 81 * 2. Update the uberblock 82 * 3. For each vdev, update 'L2' to the new label 83 * 84 * Given arbitrary failure, we can determine the correct label to use based on 85 * the transaction group. If we fail after updating L1 but before updating the 86 * UB, we will notice that L1's transaction group is greater than the uberblock, 87 * so L2 must be valid. If we fail after writing the uberblock but before 88 * writing L2, we will notice that L2's transaction group is less than L1, and 89 * therefore L1 is valid. 90 * 91 * Another added complexity is that not every label is updated when the config 92 * is synced. If we add a single device, we do not want to have to re-write 93 * every label for every device in the pool. This means that both L1 and L2 may 94 * be older than the pool uberblock, because the necessary information is stored 95 * on another vdev. 96 * 97 * 98 * On-disk Format 99 * -------------- 100 * 101 * The vdev label consists of two distinct parts, and is wrapped within the 102 * vdev_label_t structure. The label includes 8k of padding to permit legacy 103 * VTOC disk labels, but is otherwise ignored. 104 * 105 * The first half of the label is a packed nvlist which contains pool wide 106 * properties, per-vdev properties, and configuration information. It is 107 * described in more detail below. 108 * 109 * The latter half of the label consists of a redundant array of uberblocks. 110 * These uberblocks are updated whenever a transaction group is committed, 111 * or when the configuration is updated. When a pool is loaded, we scan each 112 * vdev for the 'best' uberblock. 113 * 114 * 115 * Configuration Information 116 * ------------------------- 117 * 118 * The nvlist describing the pool and vdev contains the following elements: 119 * 120 * version ZFS on-disk version 121 * name Pool name 122 * state Pool state 123 * txg Transaction group in which this label was written 124 * pool_guid Unique identifier for this pool 125 * vdev_tree An nvlist describing vdev tree. 126 * features_for_read 127 * An nvlist of the features necessary for reading the MOS. 128 * 129 * Each leaf device label also contains the following: 130 * 131 * top_guid Unique ID for top-level vdev in which this is contained 132 * guid Unique ID for the leaf vdev 133 * 134 * The 'vs' configuration follows the format described in 'spa_config.c'. 135 */ 136 137#include <sys/zfs_context.h> 138#include <sys/spa.h> 139#include <sys/spa_impl.h> 140#include <sys/dmu.h> 141#include <sys/zap.h> 142#include <sys/vdev.h> 143#include <sys/vdev_impl.h> 144#include <sys/uberblock_impl.h> 145#include <sys/metaslab.h> 146#include <sys/metaslab_impl.h> 147#include <sys/zio.h> 148#include <sys/dsl_scan.h> 149#include <sys/abd.h> 150#include <sys/fs/zfs.h> 151#include <sys/trim_map.h> 152 153static boolean_t vdev_trim_on_init = B_TRUE; 154SYSCTL_DECL(_vfs_zfs_vdev); 155SYSCTL_INT(_vfs_zfs_vdev, OID_AUTO, trim_on_init, CTLFLAG_RWTUN, 156 &vdev_trim_on_init, 0, "Enable/disable full vdev trim on initialisation"); 157 158/* 159 * Basic routines to read and write from a vdev label. 160 * Used throughout the rest of this file. 161 */ 162uint64_t 163vdev_label_offset(uint64_t psize, int l, uint64_t offset) 164{ 165 ASSERT(offset < sizeof (vdev_label_t)); 166 ASSERT(P2PHASE_TYPED(psize, sizeof (vdev_label_t), uint64_t) == 0); 167 168 return (offset + l * sizeof (vdev_label_t) + (l < VDEV_LABELS / 2 ? 169 0 : psize - VDEV_LABELS * sizeof (vdev_label_t))); 170} 171 172/* 173 * Returns back the vdev label associated with the passed in offset. 174 */ 175int 176vdev_label_number(uint64_t psize, uint64_t offset) 177{ 178 int l; 179 180 if (offset >= psize - VDEV_LABEL_END_SIZE) { 181 offset -= psize - VDEV_LABEL_END_SIZE; 182 offset += (VDEV_LABELS / 2) * sizeof (vdev_label_t); 183 } 184 l = offset / sizeof (vdev_label_t); 185 return (l < VDEV_LABELS ? l : -1); 186} 187 188static void 189vdev_label_read(zio_t *zio, vdev_t *vd, int l, abd_t *buf, uint64_t offset, 190 uint64_t size, zio_done_func_t *done, void *private, int flags) 191{ 192 ASSERT(spa_config_held(zio->io_spa, SCL_STATE_ALL, RW_WRITER) == 193 SCL_STATE_ALL); 194 ASSERT(flags & ZIO_FLAG_CONFIG_WRITER); 195 196 zio_nowait(zio_read_phys(zio, vd, 197 vdev_label_offset(vd->vdev_psize, l, offset), 198 size, buf, ZIO_CHECKSUM_LABEL, done, private, 199 ZIO_PRIORITY_SYNC_READ, flags, B_TRUE)); 200} 201 202static void 203vdev_label_write(zio_t *zio, vdev_t *vd, int l, abd_t *buf, uint64_t offset, 204 uint64_t size, zio_done_func_t *done, void *private, int flags) 205{ 206 ASSERT(spa_config_held(zio->io_spa, SCL_ALL, RW_WRITER) == SCL_ALL || 207 (spa_config_held(zio->io_spa, SCL_CONFIG | SCL_STATE, RW_READER) == 208 (SCL_CONFIG | SCL_STATE) && 209 dsl_pool_sync_context(spa_get_dsl(zio->io_spa)))); 210 ASSERT(flags & ZIO_FLAG_CONFIG_WRITER); 211 212 zio_nowait(zio_write_phys(zio, vd, 213 vdev_label_offset(vd->vdev_psize, l, offset), 214 size, buf, ZIO_CHECKSUM_LABEL, done, private, 215 ZIO_PRIORITY_SYNC_WRITE, flags, B_TRUE)); 216} 217 218static void 219root_vdev_actions_getprogress(vdev_t *vd, nvlist_t *nvl) 220{ 221 spa_t *spa = vd->vdev_spa; 222 223 if (vd != spa->spa_root_vdev) 224 return; 225 226 /* provide either current or previous scan information */ 227 pool_scan_stat_t ps; 228 if (spa_scan_get_stats(spa, &ps) == 0) { 229 fnvlist_add_uint64_array(nvl, 230 ZPOOL_CONFIG_SCAN_STATS, (uint64_t *)&ps, 231 sizeof (pool_scan_stat_t) / sizeof (uint64_t)); 232 } 233 234 pool_removal_stat_t prs; 235 if (spa_removal_get_stats(spa, &prs) == 0) { 236 fnvlist_add_uint64_array(nvl, 237 ZPOOL_CONFIG_REMOVAL_STATS, (uint64_t *)&prs, 238 sizeof (prs) / sizeof (uint64_t)); 239 } 240 241 pool_checkpoint_stat_t pcs; 242 if (spa_checkpoint_get_stats(spa, &pcs) == 0) { 243 fnvlist_add_uint64_array(nvl, 244 ZPOOL_CONFIG_CHECKPOINT_STATS, (uint64_t *)&pcs, 245 sizeof (pcs) / sizeof (uint64_t)); 246 } 247} 248 249/* 250 * Generate the nvlist representing this vdev's config. 251 */ 252nvlist_t * 253vdev_config_generate(spa_t *spa, vdev_t *vd, boolean_t getstats, 254 vdev_config_flag_t flags) 255{ 256 nvlist_t *nv = NULL; 257 vdev_indirect_config_t *vic = &vd->vdev_indirect_config; 258 259 nv = fnvlist_alloc(); 260 261 fnvlist_add_string(nv, ZPOOL_CONFIG_TYPE, vd->vdev_ops->vdev_op_type); 262 if (!(flags & (VDEV_CONFIG_SPARE | VDEV_CONFIG_L2CACHE))) 263 fnvlist_add_uint64(nv, ZPOOL_CONFIG_ID, vd->vdev_id); 264 fnvlist_add_uint64(nv, ZPOOL_CONFIG_GUID, vd->vdev_guid); 265 266 if (vd->vdev_path != NULL) 267 fnvlist_add_string(nv, ZPOOL_CONFIG_PATH, vd->vdev_path); 268 269 if (vd->vdev_devid != NULL) 270 fnvlist_add_string(nv, ZPOOL_CONFIG_DEVID, vd->vdev_devid); 271 272 if (vd->vdev_physpath != NULL) 273 fnvlist_add_string(nv, ZPOOL_CONFIG_PHYS_PATH, 274 vd->vdev_physpath); 275 276 if (vd->vdev_fru != NULL) 277 fnvlist_add_string(nv, ZPOOL_CONFIG_FRU, vd->vdev_fru); 278 279 if (vd->vdev_nparity != 0) { 280 ASSERT(strcmp(vd->vdev_ops->vdev_op_type, 281 VDEV_TYPE_RAIDZ) == 0); 282 283 /* 284 * Make sure someone hasn't managed to sneak a fancy new vdev 285 * into a crufty old storage pool. 286 */ 287 ASSERT(vd->vdev_nparity == 1 || 288 (vd->vdev_nparity <= 2 && 289 spa_version(spa) >= SPA_VERSION_RAIDZ2) || 290 (vd->vdev_nparity <= 3 && 291 spa_version(spa) >= SPA_VERSION_RAIDZ3)); 292 293 /* 294 * Note that we'll add the nparity tag even on storage pools 295 * that only support a single parity device -- older software 296 * will just ignore it. 297 */ 298 fnvlist_add_uint64(nv, ZPOOL_CONFIG_NPARITY, vd->vdev_nparity); 299 } 300 301 if (vd->vdev_wholedisk != -1ULL) 302 fnvlist_add_uint64(nv, ZPOOL_CONFIG_WHOLE_DISK, 303 vd->vdev_wholedisk); 304 305 if (vd->vdev_not_present && !(flags & VDEV_CONFIG_MISSING)) 306 fnvlist_add_uint64(nv, ZPOOL_CONFIG_NOT_PRESENT, 1); 307 308 if (vd->vdev_isspare) 309 fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_SPARE, 1); 310 311 if (!(flags & (VDEV_CONFIG_SPARE | VDEV_CONFIG_L2CACHE)) && 312 vd == vd->vdev_top) { 313 fnvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_ARRAY, 314 vd->vdev_ms_array); 315 fnvlist_add_uint64(nv, ZPOOL_CONFIG_METASLAB_SHIFT, 316 vd->vdev_ms_shift); 317 fnvlist_add_uint64(nv, ZPOOL_CONFIG_ASHIFT, vd->vdev_ashift); 318 fnvlist_add_uint64(nv, ZPOOL_CONFIG_ASIZE, 319 vd->vdev_asize); 320 fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_LOG, vd->vdev_islog); 321 if (vd->vdev_removing) { 322 fnvlist_add_uint64(nv, ZPOOL_CONFIG_REMOVING, 323 vd->vdev_removing); 324 } 325 } 326 327 if (vd->vdev_dtl_sm != NULL) { 328 fnvlist_add_uint64(nv, ZPOOL_CONFIG_DTL, 329 space_map_object(vd->vdev_dtl_sm)); 330 } 331 332 if (vic->vic_mapping_object != 0) { 333 fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_OBJECT, 334 vic->vic_mapping_object); 335 } 336 337 if (vic->vic_births_object != 0) { 338 fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_BIRTHS, 339 vic->vic_births_object); 340 } 341 342 if (vic->vic_prev_indirect_vdev != UINT64_MAX) { 343 fnvlist_add_uint64(nv, ZPOOL_CONFIG_PREV_INDIRECT_VDEV, 344 vic->vic_prev_indirect_vdev); 345 } 346 347 if (vd->vdev_crtxg) 348 fnvlist_add_uint64(nv, ZPOOL_CONFIG_CREATE_TXG, vd->vdev_crtxg); 349 350 if (flags & VDEV_CONFIG_MOS) { 351 if (vd->vdev_leaf_zap != 0) { 352 ASSERT(vd->vdev_ops->vdev_op_leaf); 353 fnvlist_add_uint64(nv, ZPOOL_CONFIG_VDEV_LEAF_ZAP, 354 vd->vdev_leaf_zap); 355 } 356 357 if (vd->vdev_top_zap != 0) { 358 ASSERT(vd == vd->vdev_top); 359 fnvlist_add_uint64(nv, ZPOOL_CONFIG_VDEV_TOP_ZAP, 360 vd->vdev_top_zap); 361 } 362 } 363 364 if (getstats) { 365 vdev_stat_t vs; 366 367 vdev_get_stats(vd, &vs); 368 fnvlist_add_uint64_array(nv, ZPOOL_CONFIG_VDEV_STATS, 369 (uint64_t *)&vs, sizeof (vs) / sizeof (uint64_t)); 370 371 root_vdev_actions_getprogress(vd, nv); 372 373 /* 374 * Note: this can be called from open context 375 * (spa_get_stats()), so we need the rwlock to prevent 376 * the mapping from being changed by condensing. 377 */ 378 rw_enter(&vd->vdev_indirect_rwlock, RW_READER); 379 if (vd->vdev_indirect_mapping != NULL) { 380 ASSERT(vd->vdev_indirect_births != NULL); 381 vdev_indirect_mapping_t *vim = 382 vd->vdev_indirect_mapping; 383 fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_SIZE, 384 vdev_indirect_mapping_size(vim)); 385 } 386 rw_exit(&vd->vdev_indirect_rwlock); 387 if (vd->vdev_mg != NULL && 388 vd->vdev_mg->mg_fragmentation != ZFS_FRAG_INVALID) { 389 /* 390 * Compute approximately how much memory would be used 391 * for the indirect mapping if this device were to 392 * be removed. 393 * 394 * Note: If the frag metric is invalid, then not 395 * enough metaslabs have been converted to have 396 * histograms. 397 */ 398 uint64_t seg_count = 0; 399 uint64_t to_alloc = vd->vdev_stat.vs_alloc; 400 401 /* 402 * There are the same number of allocated segments 403 * as free segments, so we will have at least one 404 * entry per free segment. However, small free 405 * segments (smaller than vdev_removal_max_span) 406 * will be combined with adjacent allocated segments 407 * as a single mapping. 408 */ 409 for (int i = 0; i < RANGE_TREE_HISTOGRAM_SIZE; i++) { 410 if (1ULL << (i + 1) < vdev_removal_max_span) { 411 to_alloc += 412 vd->vdev_mg->mg_histogram[i] << 413 i + 1; 414 } else { 415 seg_count += 416 vd->vdev_mg->mg_histogram[i]; 417 } 418 } 419 420 /* 421 * The maximum length of a mapping is 422 * zfs_remove_max_segment, so we need at least one entry 423 * per zfs_remove_max_segment of allocated data. 424 */ 425 seg_count += to_alloc / zfs_remove_max_segment; 426 427 fnvlist_add_uint64(nv, ZPOOL_CONFIG_INDIRECT_SIZE, 428 seg_count * 429 sizeof (vdev_indirect_mapping_entry_phys_t)); 430 } 431 } 432 433 if (!vd->vdev_ops->vdev_op_leaf) { 434 nvlist_t **child; 435 int c, idx; 436 437 ASSERT(!vd->vdev_ishole); 438 439 child = kmem_alloc(vd->vdev_children * sizeof (nvlist_t *), 440 KM_SLEEP); 441 442 for (c = 0, idx = 0; c < vd->vdev_children; c++) { 443 vdev_t *cvd = vd->vdev_child[c]; 444 445 /* 446 * If we're generating an nvlist of removing 447 * vdevs then skip over any device which is 448 * not being removed. 449 */ 450 if ((flags & VDEV_CONFIG_REMOVING) && 451 !cvd->vdev_removing) 452 continue; 453 454 child[idx++] = vdev_config_generate(spa, cvd, 455 getstats, flags); 456 } 457 458 if (idx) { 459 fnvlist_add_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN, 460 child, idx); 461 } 462 463 for (c = 0; c < idx; c++) 464 nvlist_free(child[c]); 465 466 kmem_free(child, vd->vdev_children * sizeof (nvlist_t *)); 467 468 } else { 469 const char *aux = NULL; 470 471 if (vd->vdev_offline && !vd->vdev_tmpoffline) 472 fnvlist_add_uint64(nv, ZPOOL_CONFIG_OFFLINE, B_TRUE); 473 if (vd->vdev_resilver_txg != 0) 474 fnvlist_add_uint64(nv, ZPOOL_CONFIG_RESILVER_TXG, 475 vd->vdev_resilver_txg); 476 if (vd->vdev_faulted) 477 fnvlist_add_uint64(nv, ZPOOL_CONFIG_FAULTED, B_TRUE); 478 if (vd->vdev_degraded) 479 fnvlist_add_uint64(nv, ZPOOL_CONFIG_DEGRADED, B_TRUE); 480 if (vd->vdev_removed) 481 fnvlist_add_uint64(nv, ZPOOL_CONFIG_REMOVED, B_TRUE); 482 if (vd->vdev_unspare) 483 fnvlist_add_uint64(nv, ZPOOL_CONFIG_UNSPARE, B_TRUE); 484 if (vd->vdev_ishole) 485 fnvlist_add_uint64(nv, ZPOOL_CONFIG_IS_HOLE, B_TRUE); 486 487 switch (vd->vdev_stat.vs_aux) { 488 case VDEV_AUX_ERR_EXCEEDED: 489 aux = "err_exceeded"; 490 break; 491 492 case VDEV_AUX_EXTERNAL: 493 aux = "external"; 494 break; 495 } 496 497 if (aux != NULL) 498 fnvlist_add_string(nv, ZPOOL_CONFIG_AUX_STATE, aux); 499 500 if (vd->vdev_splitting && vd->vdev_orig_guid != 0LL) { 501 fnvlist_add_uint64(nv, ZPOOL_CONFIG_ORIG_GUID, 502 vd->vdev_orig_guid); 503 } 504 } 505 506 return (nv); 507} 508 509/* 510 * Generate a view of the top-level vdevs. If we currently have holes 511 * in the namespace, then generate an array which contains a list of holey 512 * vdevs. Additionally, add the number of top-level children that currently 513 * exist. 514 */ 515void 516vdev_top_config_generate(spa_t *spa, nvlist_t *config) 517{ 518 vdev_t *rvd = spa->spa_root_vdev; 519 uint64_t *array; 520 uint_t c, idx; 521 522 array = kmem_alloc(rvd->vdev_children * sizeof (uint64_t), KM_SLEEP); 523 524 for (c = 0, idx = 0; c < rvd->vdev_children; c++) { 525 vdev_t *tvd = rvd->vdev_child[c]; 526 527 if (tvd->vdev_ishole) { 528 array[idx++] = c; 529 } 530 } 531 532 if (idx) { 533 VERIFY(nvlist_add_uint64_array(config, ZPOOL_CONFIG_HOLE_ARRAY, 534 array, idx) == 0); 535 } 536 537 VERIFY(nvlist_add_uint64(config, ZPOOL_CONFIG_VDEV_CHILDREN, 538 rvd->vdev_children) == 0); 539 540 kmem_free(array, rvd->vdev_children * sizeof (uint64_t)); 541} 542 543/* 544 * Returns the configuration from the label of the given vdev. For vdevs 545 * which don't have a txg value stored on their label (i.e. spares/cache) 546 * or have not been completely initialized (txg = 0) just return 547 * the configuration from the first valid label we find. Otherwise, 548 * find the most up-to-date label that does not exceed the specified 549 * 'txg' value. 550 */ 551nvlist_t * 552vdev_label_read_config(vdev_t *vd, uint64_t txg) 553{ 554 spa_t *spa = vd->vdev_spa; 555 nvlist_t *config = NULL; 556 vdev_phys_t *vp; 557 abd_t *vp_abd; 558 zio_t *zio; 559 uint64_t best_txg = 0; 560 uint64_t label_txg = 0; 561 int error = 0; 562 int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL | 563 ZIO_FLAG_SPECULATIVE; 564 565 ASSERT(spa_config_held(spa, SCL_STATE_ALL, RW_WRITER) == SCL_STATE_ALL); 566 567 if (!vdev_readable(vd)) 568 return (NULL); 569 570 vp_abd = abd_alloc_linear(sizeof (vdev_phys_t), B_TRUE); 571 vp = abd_to_buf(vp_abd); 572 573retry: 574 for (int l = 0; l < VDEV_LABELS; l++) { 575 nvlist_t *label = NULL; 576 577 zio = zio_root(spa, NULL, NULL, flags); 578 579 vdev_label_read(zio, vd, l, vp_abd, 580 offsetof(vdev_label_t, vl_vdev_phys), 581 sizeof (vdev_phys_t), NULL, NULL, flags); 582 583 if (zio_wait(zio) == 0 && 584 nvlist_unpack(vp->vp_nvlist, sizeof (vp->vp_nvlist), 585 &label, 0) == 0) { 586 /* 587 * Auxiliary vdevs won't have txg values in their 588 * labels and newly added vdevs may not have been 589 * completely initialized so just return the 590 * configuration from the first valid label we 591 * encounter. 592 */ 593 error = nvlist_lookup_uint64(label, 594 ZPOOL_CONFIG_POOL_TXG, &label_txg); 595 if ((error || label_txg == 0) && !config) { 596 config = label; 597 break; 598 } else if (label_txg <= txg && label_txg > best_txg) { 599 best_txg = label_txg; 600 nvlist_free(config); 601 config = fnvlist_dup(label); 602 } 603 } 604 605 if (label != NULL) { 606 nvlist_free(label); 607 label = NULL; 608 } 609 } 610 611 if (config == NULL && !(flags & ZIO_FLAG_TRYHARD)) { 612 flags |= ZIO_FLAG_TRYHARD; 613 goto retry; 614 } 615 616 /* 617 * We found a valid label but it didn't pass txg restrictions. 618 */ 619 if (config == NULL && label_txg != 0) { 620 vdev_dbgmsg(vd, "label discarded as txg is too large " 621 "(%llu > %llu)", (u_longlong_t)label_txg, 622 (u_longlong_t)txg); 623 } 624 625 abd_free(vp_abd); 626 627 return (config); 628} 629 630/* 631 * Determine if a device is in use. The 'spare_guid' parameter will be filled 632 * in with the device guid if this spare is active elsewhere on the system. 633 */ 634static boolean_t 635vdev_inuse(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason, 636 uint64_t *spare_guid, uint64_t *l2cache_guid) 637{ 638 spa_t *spa = vd->vdev_spa; 639 uint64_t state, pool_guid, device_guid, txg, spare_pool; 640 uint64_t vdtxg = 0; 641 nvlist_t *label; 642 643 if (spare_guid) 644 *spare_guid = 0ULL; 645 if (l2cache_guid) 646 *l2cache_guid = 0ULL; 647 648 /* 649 * Read the label, if any, and perform some basic sanity checks. 650 */ 651 if ((label = vdev_label_read_config(vd, -1ULL)) == NULL) 652 return (B_FALSE); 653 654 (void) nvlist_lookup_uint64(label, ZPOOL_CONFIG_CREATE_TXG, 655 &vdtxg); 656 657 if (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_STATE, 658 &state) != 0 || 659 nvlist_lookup_uint64(label, ZPOOL_CONFIG_GUID, 660 &device_guid) != 0) { 661 nvlist_free(label); 662 return (B_FALSE); 663 } 664 665 if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 666 (nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_GUID, 667 &pool_guid) != 0 || 668 nvlist_lookup_uint64(label, ZPOOL_CONFIG_POOL_TXG, 669 &txg) != 0)) { 670 nvlist_free(label); 671 return (B_FALSE); 672 } 673 674 nvlist_free(label); 675 676 /* 677 * Check to see if this device indeed belongs to the pool it claims to 678 * be a part of. The only way this is allowed is if the device is a hot 679 * spare (which we check for later on). 680 */ 681 if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 682 !spa_guid_exists(pool_guid, device_guid) && 683 !spa_spare_exists(device_guid, NULL, NULL) && 684 !spa_l2cache_exists(device_guid, NULL)) 685 return (B_FALSE); 686 687 /* 688 * If the transaction group is zero, then this an initialized (but 689 * unused) label. This is only an error if the create transaction 690 * on-disk is the same as the one we're using now, in which case the 691 * user has attempted to add the same vdev multiple times in the same 692 * transaction. 693 */ 694 if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 695 txg == 0 && vdtxg == crtxg) 696 return (B_TRUE); 697 698 /* 699 * Check to see if this is a spare device. We do an explicit check for 700 * spa_has_spare() here because it may be on our pending list of spares 701 * to add. We also check if it is an l2cache device. 702 */ 703 if (spa_spare_exists(device_guid, &spare_pool, NULL) || 704 spa_has_spare(spa, device_guid)) { 705 if (spare_guid) 706 *spare_guid = device_guid; 707 708 switch (reason) { 709 case VDEV_LABEL_CREATE: 710 case VDEV_LABEL_L2CACHE: 711 return (B_TRUE); 712 713 case VDEV_LABEL_REPLACE: 714 return (!spa_has_spare(spa, device_guid) || 715 spare_pool != 0ULL); 716 717 case VDEV_LABEL_SPARE: 718 return (spa_has_spare(spa, device_guid)); 719 } 720 } 721 722 /* 723 * Check to see if this is an l2cache device. 724 */ 725 if (spa_l2cache_exists(device_guid, NULL)) 726 return (B_TRUE); 727 728 /* 729 * We can't rely on a pool's state if it's been imported 730 * read-only. Instead we look to see if the pools is marked 731 * read-only in the namespace and set the state to active. 732 */ 733 if (state != POOL_STATE_SPARE && state != POOL_STATE_L2CACHE && 734 (spa = spa_by_guid(pool_guid, device_guid)) != NULL && 735 spa_mode(spa) == FREAD) 736 state = POOL_STATE_ACTIVE; 737 738 /* 739 * If the device is marked ACTIVE, then this device is in use by another 740 * pool on the system. 741 */ 742 return (state == POOL_STATE_ACTIVE); 743} 744 745/* 746 * Initialize a vdev label. We check to make sure each leaf device is not in 747 * use, and writable. We put down an initial label which we will later 748 * overwrite with a complete label. Note that it's important to do this 749 * sequentially, not in parallel, so that we catch cases of multiple use of the 750 * same leaf vdev in the vdev we're creating -- e.g. mirroring a disk with 751 * itself. 752 */ 753int 754vdev_label_init(vdev_t *vd, uint64_t crtxg, vdev_labeltype_t reason) 755{ 756 spa_t *spa = vd->vdev_spa; 757 nvlist_t *label; 758 vdev_phys_t *vp; 759 abd_t *vp_abd; 760 abd_t *pad2; 761 uberblock_t *ub; 762 abd_t *ub_abd; 763 zio_t *zio; 764 char *buf; 765 size_t buflen; 766 int error; 767 uint64_t spare_guid, l2cache_guid; 768 int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL; 769 770 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); 771 772 for (int c = 0; c < vd->vdev_children; c++) 773 if ((error = vdev_label_init(vd->vdev_child[c], 774 crtxg, reason)) != 0) 775 return (error); 776 777 /* Track the creation time for this vdev */ 778 vd->vdev_crtxg = crtxg; 779 780 if (!vd->vdev_ops->vdev_op_leaf || !spa_writeable(spa)) 781 return (0); 782 783 /* 784 * Dead vdevs cannot be initialized. 785 */ 786 if (vdev_is_dead(vd)) 787 return (SET_ERROR(EIO)); 788 789 /* 790 * Determine if the vdev is in use. 791 */ 792 if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPLIT && 793 vdev_inuse(vd, crtxg, reason, &spare_guid, &l2cache_guid)) 794 return (SET_ERROR(EBUSY)); 795 796 /* 797 * If this is a request to add or replace a spare or l2cache device 798 * that is in use elsewhere on the system, then we must update the 799 * guid (which was initialized to a random value) to reflect the 800 * actual GUID (which is shared between multiple pools). 801 */ 802 if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_L2CACHE && 803 spare_guid != 0ULL) { 804 uint64_t guid_delta = spare_guid - vd->vdev_guid; 805 806 vd->vdev_guid += guid_delta; 807 808 for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent) 809 pvd->vdev_guid_sum += guid_delta; 810 811 /* 812 * If this is a replacement, then we want to fallthrough to the 813 * rest of the code. If we're adding a spare, then it's already 814 * labeled appropriately and we can just return. 815 */ 816 if (reason == VDEV_LABEL_SPARE) 817 return (0); 818 ASSERT(reason == VDEV_LABEL_REPLACE || 819 reason == VDEV_LABEL_SPLIT); 820 } 821 822 if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPARE && 823 l2cache_guid != 0ULL) { 824 uint64_t guid_delta = l2cache_guid - vd->vdev_guid; 825 826 vd->vdev_guid += guid_delta; 827 828 for (vdev_t *pvd = vd; pvd != NULL; pvd = pvd->vdev_parent) 829 pvd->vdev_guid_sum += guid_delta; 830 831 /* 832 * If this is a replacement, then we want to fallthrough to the 833 * rest of the code. If we're adding an l2cache, then it's 834 * already labeled appropriately and we can just return. 835 */ 836 if (reason == VDEV_LABEL_L2CACHE) 837 return (0); 838 ASSERT(reason == VDEV_LABEL_REPLACE); 839 } 840 841 /* 842 * TRIM the whole thing, excluding the blank space and boot header 843 * as specified by ZFS On-Disk Specification (section 1.3), so that 844 * we start with a clean slate. 845 * It's just an optimization, so we don't care if it fails. 846 * Don't TRIM if removing so that we don't interfere with zpool 847 * disaster recovery. 848 */ 849 if (zfs_trim_enabled && vdev_trim_on_init && !vd->vdev_notrim && 850 (reason == VDEV_LABEL_CREATE || reason == VDEV_LABEL_SPARE || 851 reason == VDEV_LABEL_L2CACHE)) 852 zio_wait(zio_trim(NULL, spa, vd, VDEV_SKIP_SIZE, 853 vd->vdev_psize - VDEV_SKIP_SIZE)); 854 855 /* 856 * Initialize its label. 857 */ 858 vp_abd = abd_alloc_linear(sizeof (vdev_phys_t), B_TRUE); 859 abd_zero(vp_abd, sizeof (vdev_phys_t)); 860 vp = abd_to_buf(vp_abd); 861 862 /* 863 * Generate a label describing the pool and our top-level vdev. 864 * We mark it as being from txg 0 to indicate that it's not 865 * really part of an active pool just yet. The labels will 866 * be written again with a meaningful txg by spa_sync(). 867 */ 868 if (reason == VDEV_LABEL_SPARE || 869 (reason == VDEV_LABEL_REMOVE && vd->vdev_isspare)) { 870 /* 871 * For inactive hot spares, we generate a special label that 872 * identifies as a mutually shared hot spare. We write the 873 * label if we are adding a hot spare, or if we are removing an 874 * active hot spare (in which case we want to revert the 875 * labels). 876 */ 877 VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0); 878 879 VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION, 880 spa_version(spa)) == 0); 881 VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE, 882 POOL_STATE_SPARE) == 0); 883 VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID, 884 vd->vdev_guid) == 0); 885 } else if (reason == VDEV_LABEL_L2CACHE || 886 (reason == VDEV_LABEL_REMOVE && vd->vdev_isl2cache)) { 887 /* 888 * For level 2 ARC devices, add a special label. 889 */ 890 VERIFY(nvlist_alloc(&label, NV_UNIQUE_NAME, KM_SLEEP) == 0); 891 892 VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_VERSION, 893 spa_version(spa)) == 0); 894 VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_POOL_STATE, 895 POOL_STATE_L2CACHE) == 0); 896 VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_GUID, 897 vd->vdev_guid) == 0); 898 } else { 899 uint64_t txg = 0ULL; 900 901 if (reason == VDEV_LABEL_SPLIT) 902 txg = spa->spa_uberblock.ub_txg; 903 label = spa_config_generate(spa, vd, txg, B_FALSE); 904 905 /* 906 * Add our creation time. This allows us to detect multiple 907 * vdev uses as described above, and automatically expires if we 908 * fail. 909 */ 910 VERIFY(nvlist_add_uint64(label, ZPOOL_CONFIG_CREATE_TXG, 911 crtxg) == 0); 912 } 913 914 buf = vp->vp_nvlist; 915 buflen = sizeof (vp->vp_nvlist); 916 917 error = nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP); 918 if (error != 0) { 919 nvlist_free(label); 920 abd_free(vp_abd); 921 /* EFAULT means nvlist_pack ran out of room */ 922 return (error == EFAULT ? ENAMETOOLONG : EINVAL); 923 } 924 925 /* 926 * Initialize uberblock template. 927 */ 928 ub_abd = abd_alloc_linear(VDEV_UBERBLOCK_RING, B_TRUE); 929 abd_zero(ub_abd, VDEV_UBERBLOCK_RING); 930 abd_copy_from_buf(ub_abd, &spa->spa_uberblock, sizeof (uberblock_t)); 931 ub = abd_to_buf(ub_abd); 932 ub->ub_txg = 0; 933 934 /* Initialize the 2nd padding area. */ 935 pad2 = abd_alloc_for_io(VDEV_PAD_SIZE, B_TRUE); 936 abd_zero(pad2, VDEV_PAD_SIZE); 937 938 /* 939 * Write everything in parallel. 940 */ 941retry: 942 zio = zio_root(spa, NULL, NULL, flags); 943 944 for (int l = 0; l < VDEV_LABELS; l++) { 945 946 vdev_label_write(zio, vd, l, vp_abd, 947 offsetof(vdev_label_t, vl_vdev_phys), 948 sizeof (vdev_phys_t), NULL, NULL, flags); 949 950 /* 951 * Skip the 1st padding area. 952 * Zero out the 2nd padding area where it might have 953 * left over data from previous filesystem format. 954 */ 955 vdev_label_write(zio, vd, l, pad2, 956 offsetof(vdev_label_t, vl_pad2), 957 VDEV_PAD_SIZE, NULL, NULL, flags); 958 959 vdev_label_write(zio, vd, l, ub_abd, 960 offsetof(vdev_label_t, vl_uberblock), 961 VDEV_UBERBLOCK_RING, NULL, NULL, flags); 962 } 963 964 error = zio_wait(zio); 965 966 if (error != 0 && !(flags & ZIO_FLAG_TRYHARD)) { 967 flags |= ZIO_FLAG_TRYHARD; 968 goto retry; 969 } 970 971 nvlist_free(label); 972 abd_free(pad2); 973 abd_free(ub_abd); 974 abd_free(vp_abd); 975 976 /* 977 * If this vdev hasn't been previously identified as a spare, then we 978 * mark it as such only if a) we are labeling it as a spare, or b) it 979 * exists as a spare elsewhere in the system. Do the same for 980 * level 2 ARC devices. 981 */ 982 if (error == 0 && !vd->vdev_isspare && 983 (reason == VDEV_LABEL_SPARE || 984 spa_spare_exists(vd->vdev_guid, NULL, NULL))) 985 spa_spare_add(vd); 986 987 if (error == 0 && !vd->vdev_isl2cache && 988 (reason == VDEV_LABEL_L2CACHE || 989 spa_l2cache_exists(vd->vdev_guid, NULL))) 990 spa_l2cache_add(vd); 991 992 return (error); 993} 994 995int 996vdev_label_write_pad2(vdev_t *vd, const char *buf, size_t size) 997{ 998 spa_t *spa = vd->vdev_spa; 999 zio_t *zio; 1000 abd_t *pad2; 1001 int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL; 1002 int error; 1003 1004 if (size > VDEV_PAD_SIZE) 1005 return (EINVAL); 1006 1007 if (!vd->vdev_ops->vdev_op_leaf) 1008 return (ENODEV); 1009 if (vdev_is_dead(vd)) 1010 return (ENXIO); 1011 1012 ASSERT(spa_config_held(spa, SCL_ALL, RW_WRITER) == SCL_ALL); 1013 1014 pad2 = abd_alloc_for_io(VDEV_PAD_SIZE, B_TRUE); 1015 abd_zero(pad2, VDEV_PAD_SIZE); 1016 abd_copy_from_buf(pad2, buf, size); 1017 1018retry: 1019 zio = zio_root(spa, NULL, NULL, flags); 1020 vdev_label_write(zio, vd, 0, pad2, 1021 offsetof(vdev_label_t, vl_pad2), 1022 VDEV_PAD_SIZE, NULL, NULL, flags); 1023 error = zio_wait(zio); 1024 if (error != 0 && !(flags & ZIO_FLAG_TRYHARD)) { 1025 flags |= ZIO_FLAG_TRYHARD; 1026 goto retry; 1027 } 1028 1029 abd_free(pad2); 1030 return (error); 1031} 1032 1033/* 1034 * ========================================================================== 1035 * uberblock load/sync 1036 * ========================================================================== 1037 */ 1038 1039/* 1040 * Consider the following situation: txg is safely synced to disk. We've 1041 * written the first uberblock for txg + 1, and then we lose power. When we 1042 * come back up, we fail to see the uberblock for txg + 1 because, say, 1043 * it was on a mirrored device and the replica to which we wrote txg + 1 1044 * is now offline. If we then make some changes and sync txg + 1, and then 1045 * the missing replica comes back, then for a few seconds we'll have two 1046 * conflicting uberblocks on disk with the same txg. The solution is simple: 1047 * among uberblocks with equal txg, choose the one with the latest timestamp. 1048 */ 1049static int 1050vdev_uberblock_compare(const uberblock_t *ub1, const uberblock_t *ub2) 1051{ 1052 int cmp = AVL_CMP(ub1->ub_txg, ub2->ub_txg); 1053 if (likely(cmp)) 1054 return (cmp); 1055 1056 return (AVL_CMP(ub1->ub_timestamp, ub2->ub_timestamp)); 1057} 1058 1059struct ubl_cbdata { 1060 uberblock_t *ubl_ubbest; /* Best uberblock */ 1061 vdev_t *ubl_vd; /* vdev associated with the above */ 1062}; 1063 1064static void 1065vdev_uberblock_load_done(zio_t *zio) 1066{ 1067 vdev_t *vd = zio->io_vd; 1068 spa_t *spa = zio->io_spa; 1069 zio_t *rio = zio->io_private; 1070 uberblock_t *ub = abd_to_buf(zio->io_abd); 1071 struct ubl_cbdata *cbp = rio->io_private; 1072 1073 ASSERT3U(zio->io_size, ==, VDEV_UBERBLOCK_SIZE(vd)); 1074 1075 if (zio->io_error == 0 && uberblock_verify(ub) == 0) { 1076 mutex_enter(&rio->io_lock); 1077 if (ub->ub_txg <= spa->spa_load_max_txg && 1078 vdev_uberblock_compare(ub, cbp->ubl_ubbest) > 0) { 1079 /* 1080 * Keep track of the vdev in which this uberblock 1081 * was found. We will use this information later 1082 * to obtain the config nvlist associated with 1083 * this uberblock. 1084 */ 1085 *cbp->ubl_ubbest = *ub; 1086 cbp->ubl_vd = vd; 1087 } 1088 mutex_exit(&rio->io_lock); 1089 } 1090 1091 abd_free(zio->io_abd); 1092} 1093 1094static void 1095vdev_uberblock_load_impl(zio_t *zio, vdev_t *vd, int flags, 1096 struct ubl_cbdata *cbp) 1097{ 1098 for (int c = 0; c < vd->vdev_children; c++) 1099 vdev_uberblock_load_impl(zio, vd->vdev_child[c], flags, cbp); 1100 1101 if (vd->vdev_ops->vdev_op_leaf && vdev_readable(vd)) { 1102 for (int l = 0; l < VDEV_LABELS; l++) { 1103 for (int n = 0; n < VDEV_UBERBLOCK_COUNT(vd); n++) { 1104 vdev_label_read(zio, vd, l, 1105 abd_alloc_linear(VDEV_UBERBLOCK_SIZE(vd), 1106 B_TRUE), VDEV_UBERBLOCK_OFFSET(vd, n), 1107 VDEV_UBERBLOCK_SIZE(vd), 1108 vdev_uberblock_load_done, zio, flags); 1109 } 1110 } 1111 } 1112} 1113 1114/* 1115 * Reads the 'best' uberblock from disk along with its associated 1116 * configuration. First, we read the uberblock array of each label of each 1117 * vdev, keeping track of the uberblock with the highest txg in each array. 1118 * Then, we read the configuration from the same vdev as the best uberblock. 1119 */ 1120void 1121vdev_uberblock_load(vdev_t *rvd, uberblock_t *ub, nvlist_t **config) 1122{ 1123 zio_t *zio; 1124 spa_t *spa = rvd->vdev_spa; 1125 struct ubl_cbdata cb; 1126 int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL | 1127 ZIO_FLAG_SPECULATIVE | ZIO_FLAG_TRYHARD; 1128 1129 ASSERT(ub); 1130 ASSERT(config); 1131 1132 bzero(ub, sizeof (uberblock_t)); 1133 *config = NULL; 1134 1135 cb.ubl_ubbest = ub; 1136 cb.ubl_vd = NULL; 1137 1138 spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); 1139 zio = zio_root(spa, NULL, &cb, flags); 1140 vdev_uberblock_load_impl(zio, rvd, flags, &cb); 1141 (void) zio_wait(zio); 1142 1143 /* 1144 * It's possible that the best uberblock was discovered on a label 1145 * that has a configuration which was written in a future txg. 1146 * Search all labels on this vdev to find the configuration that 1147 * matches the txg for our uberblock. 1148 */ 1149 if (cb.ubl_vd != NULL) { 1150 vdev_dbgmsg(cb.ubl_vd, "best uberblock found for spa %s. " 1151 "txg %llu", spa->spa_name, (u_longlong_t)ub->ub_txg); 1152 1153 *config = vdev_label_read_config(cb.ubl_vd, ub->ub_txg); 1154 if (*config == NULL && spa->spa_extreme_rewind) { 1155 vdev_dbgmsg(cb.ubl_vd, "failed to read label config. " 1156 "Trying again without txg restrictions."); 1157 *config = vdev_label_read_config(cb.ubl_vd, UINT64_MAX); 1158 } 1159 if (*config == NULL) { 1160 vdev_dbgmsg(cb.ubl_vd, "failed to read label config"); 1161 } 1162 } 1163 spa_config_exit(spa, SCL_ALL, FTAG); 1164} 1165 1166/* 1167 * On success, increment root zio's count of good writes. 1168 * We only get credit for writes to known-visible vdevs; see spa_vdev_add(). 1169 */ 1170static void 1171vdev_uberblock_sync_done(zio_t *zio) 1172{ 1173 uint64_t *good_writes = zio->io_private; 1174 1175 if (zio->io_error == 0 && zio->io_vd->vdev_top->vdev_ms_array != 0) 1176 atomic_inc_64(good_writes); 1177} 1178 1179/* 1180 * Write the uberblock to all labels of all leaves of the specified vdev. 1181 */ 1182static void 1183vdev_uberblock_sync(zio_t *zio, uint64_t *good_writes, 1184 uberblock_t *ub, vdev_t *vd, int flags) 1185{ 1186 for (uint64_t c = 0; c < vd->vdev_children; c++) { 1187 vdev_uberblock_sync(zio, good_writes, 1188 ub, vd->vdev_child[c], flags); 1189 } 1190 1191 if (!vd->vdev_ops->vdev_op_leaf) 1192 return; 1193 1194 if (!vdev_writeable(vd)) 1195 return; 1196 1197 int n = ub->ub_txg & (VDEV_UBERBLOCK_COUNT(vd) - 1); 1198 1199 /* Copy the uberblock_t into the ABD */ 1200 abd_t *ub_abd = abd_alloc_for_io(VDEV_UBERBLOCK_SIZE(vd), B_TRUE); 1201 abd_zero(ub_abd, VDEV_UBERBLOCK_SIZE(vd)); 1202 abd_copy_from_buf(ub_abd, ub, sizeof (uberblock_t)); 1203 1204 for (int l = 0; l < VDEV_LABELS; l++) 1205 vdev_label_write(zio, vd, l, ub_abd, 1206 VDEV_UBERBLOCK_OFFSET(vd, n), VDEV_UBERBLOCK_SIZE(vd), 1207 vdev_uberblock_sync_done, good_writes, 1208 flags | ZIO_FLAG_DONT_PROPAGATE); 1209 1210 abd_free(ub_abd); 1211} 1212 1213/* Sync the uberblocks to all vdevs in svd[] */ 1214int 1215vdev_uberblock_sync_list(vdev_t **svd, int svdcount, uberblock_t *ub, int flags) 1216{ 1217 spa_t *spa = svd[0]->vdev_spa; 1218 zio_t *zio; 1219 uint64_t good_writes = 0; 1220 1221 zio = zio_root(spa, NULL, NULL, flags); 1222 1223 for (int v = 0; v < svdcount; v++) 1224 vdev_uberblock_sync(zio, &good_writes, ub, svd[v], flags); 1225 1226 (void) zio_wait(zio); 1227 1228 /* 1229 * Flush the uberblocks to disk. This ensures that the odd labels 1230 * are no longer needed (because the new uberblocks and the even 1231 * labels are safely on disk), so it is safe to overwrite them. 1232 */ 1233 zio = zio_root(spa, NULL, NULL, flags); 1234 1235 for (int v = 0; v < svdcount; v++) { 1236 if (vdev_writeable(svd[v])) { 1237 zio_flush(zio, svd[v]); 1238 } 1239 } 1240 1241 (void) zio_wait(zio); 1242 1243 return (good_writes >= 1 ? 0 : EIO); 1244} 1245 1246/* 1247 * On success, increment the count of good writes for our top-level vdev. 1248 */ 1249static void 1250vdev_label_sync_done(zio_t *zio) 1251{ 1252 uint64_t *good_writes = zio->io_private; 1253 1254 if (zio->io_error == 0) 1255 atomic_inc_64(good_writes); 1256} 1257 1258/* 1259 * If there weren't enough good writes, indicate failure to the parent. 1260 */ 1261static void 1262vdev_label_sync_top_done(zio_t *zio) 1263{ 1264 uint64_t *good_writes = zio->io_private; 1265 1266 if (*good_writes == 0) 1267 zio->io_error = SET_ERROR(EIO); 1268 1269 kmem_free(good_writes, sizeof (uint64_t)); 1270} 1271 1272/* 1273 * We ignore errors for log and cache devices, simply free the private data. 1274 */ 1275static void 1276vdev_label_sync_ignore_done(zio_t *zio) 1277{ 1278 kmem_free(zio->io_private, sizeof (uint64_t)); 1279} 1280 1281/* 1282 * Write all even or odd labels to all leaves of the specified vdev. 1283 */ 1284static void 1285vdev_label_sync(zio_t *zio, uint64_t *good_writes, 1286 vdev_t *vd, int l, uint64_t txg, int flags) 1287{ 1288 nvlist_t *label; 1289 vdev_phys_t *vp; 1290 abd_t *vp_abd; 1291 char *buf; 1292 size_t buflen; 1293 1294 for (int c = 0; c < vd->vdev_children; c++) { 1295 vdev_label_sync(zio, good_writes, 1296 vd->vdev_child[c], l, txg, flags); 1297 } 1298 1299 if (!vd->vdev_ops->vdev_op_leaf) 1300 return; 1301 1302 if (!vdev_writeable(vd)) 1303 return; 1304 1305 /* 1306 * Generate a label describing the top-level config to which we belong. 1307 */ 1308 label = spa_config_generate(vd->vdev_spa, vd, txg, B_FALSE); 1309 1310 vp_abd = abd_alloc_linear(sizeof (vdev_phys_t), B_TRUE); 1311 abd_zero(vp_abd, sizeof (vdev_phys_t)); 1312 vp = abd_to_buf(vp_abd); 1313 1314 buf = vp->vp_nvlist; 1315 buflen = sizeof (vp->vp_nvlist); 1316 1317 if (nvlist_pack(label, &buf, &buflen, NV_ENCODE_XDR, KM_SLEEP) == 0) { 1318 for (; l < VDEV_LABELS; l += 2) { 1319 vdev_label_write(zio, vd, l, vp_abd, 1320 offsetof(vdev_label_t, vl_vdev_phys), 1321 sizeof (vdev_phys_t), 1322 vdev_label_sync_done, good_writes, 1323 flags | ZIO_FLAG_DONT_PROPAGATE); 1324 } 1325 } 1326 1327 abd_free(vp_abd); 1328 nvlist_free(label); 1329} 1330 1331int 1332vdev_label_sync_list(spa_t *spa, int l, uint64_t txg, int flags) 1333{ 1334 list_t *dl = &spa->spa_config_dirty_list; 1335 vdev_t *vd; 1336 zio_t *zio; 1337 int error; 1338 1339 /* 1340 * Write the new labels to disk. 1341 */ 1342 zio = zio_root(spa, NULL, NULL, flags); 1343 1344 for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd)) { 1345 uint64_t *good_writes = kmem_zalloc(sizeof (uint64_t), 1346 KM_SLEEP); 1347 1348 ASSERT(!vd->vdev_ishole); 1349 1350 zio_t *vio = zio_null(zio, spa, NULL, 1351 (vd->vdev_islog || vd->vdev_aux != NULL) ? 1352 vdev_label_sync_ignore_done : vdev_label_sync_top_done, 1353 good_writes, flags); 1354 vdev_label_sync(vio, good_writes, vd, l, txg, flags); 1355 zio_nowait(vio); 1356 } 1357 1358 error = zio_wait(zio); 1359 1360 /* 1361 * Flush the new labels to disk. 1362 */ 1363 zio = zio_root(spa, NULL, NULL, flags); 1364 1365 for (vd = list_head(dl); vd != NULL; vd = list_next(dl, vd)) 1366 zio_flush(zio, vd); 1367 1368 (void) zio_wait(zio); 1369 1370 return (error); 1371} 1372 1373/* 1374 * Sync the uberblock and any changes to the vdev configuration. 1375 * 1376 * The order of operations is carefully crafted to ensure that 1377 * if the system panics or loses power at any time, the state on disk 1378 * is still transactionally consistent. The in-line comments below 1379 * describe the failure semantics at each stage. 1380 * 1381 * Moreover, vdev_config_sync() is designed to be idempotent: if it fails 1382 * at any time, you can just call it again, and it will resume its work. 1383 */ 1384int 1385vdev_config_sync(vdev_t **svd, int svdcount, uint64_t txg) 1386{ 1387 spa_t *spa = svd[0]->vdev_spa; 1388 uberblock_t *ub = &spa->spa_uberblock; 1389 int error = 0; 1390 int flags = ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_CANFAIL; 1391 1392 ASSERT(svdcount != 0); 1393retry: 1394 /* 1395 * Normally, we don't want to try too hard to write every label and 1396 * uberblock. If there is a flaky disk, we don't want the rest of the 1397 * sync process to block while we retry. But if we can't write a 1398 * single label out, we should retry with ZIO_FLAG_TRYHARD before 1399 * bailing out and declaring the pool faulted. 1400 */ 1401 if (error != 0) { 1402 if ((flags & ZIO_FLAG_TRYHARD) != 0) 1403 return (error); 1404 flags |= ZIO_FLAG_TRYHARD; 1405 } 1406 1407 ASSERT(ub->ub_txg <= txg); 1408 1409 /* 1410 * If this isn't a resync due to I/O errors, 1411 * and nothing changed in this transaction group, 1412 * and the vdev configuration hasn't changed, 1413 * then there's nothing to do. 1414 */ 1415 if (ub->ub_txg < txg && 1416 uberblock_update(ub, spa->spa_root_vdev, txg) == B_FALSE && 1417 list_is_empty(&spa->spa_config_dirty_list)) 1418 return (0); 1419 1420 if (txg > spa_freeze_txg(spa)) 1421 return (0); 1422 1423 ASSERT(txg <= spa->spa_final_txg); 1424 1425 /* 1426 * Flush the write cache of every disk that's been written to 1427 * in this transaction group. This ensures that all blocks 1428 * written in this txg will be committed to stable storage 1429 * before any uberblock that references them. 1430 */ 1431 zio_t *zio = zio_root(spa, NULL, NULL, flags); 1432 1433 for (vdev_t *vd = 1434 txg_list_head(&spa->spa_vdev_txg_list, TXG_CLEAN(txg)); vd != NULL; 1435 vd = txg_list_next(&spa->spa_vdev_txg_list, vd, TXG_CLEAN(txg))) 1436 zio_flush(zio, vd); 1437 1438 (void) zio_wait(zio); 1439 1440 /* 1441 * Sync out the even labels (L0, L2) for every dirty vdev. If the 1442 * system dies in the middle of this process, that's OK: all of the 1443 * even labels that made it to disk will be newer than any uberblock, 1444 * and will therefore be considered invalid. The odd labels (L1, L3), 1445 * which have not yet been touched, will still be valid. We flush 1446 * the new labels to disk to ensure that all even-label updates 1447 * are committed to stable storage before the uberblock update. 1448 */ 1449 if ((error = vdev_label_sync_list(spa, 0, txg, flags)) != 0) { 1450 if ((flags & ZIO_FLAG_TRYHARD) != 0) { 1451 zfs_dbgmsg("vdev_label_sync_list() returned error %d " 1452 "for pool '%s' when syncing out the even labels " 1453 "of dirty vdevs", error, spa_name(spa)); 1454 } 1455 goto retry; 1456 } 1457 1458 /* 1459 * Sync the uberblocks to all vdevs in svd[]. 1460 * If the system dies in the middle of this step, there are two cases 1461 * to consider, and the on-disk state is consistent either way: 1462 * 1463 * (1) If none of the new uberblocks made it to disk, then the 1464 * previous uberblock will be the newest, and the odd labels 1465 * (which had not yet been touched) will be valid with respect 1466 * to that uberblock. 1467 * 1468 * (2) If one or more new uberblocks made it to disk, then they 1469 * will be the newest, and the even labels (which had all 1470 * been successfully committed) will be valid with respect 1471 * to the new uberblocks. 1472 */ 1473 if ((error = vdev_uberblock_sync_list(svd, svdcount, ub, flags)) != 0) { 1474 if ((flags & ZIO_FLAG_TRYHARD) != 0) { 1475 zfs_dbgmsg("vdev_uberblock_sync_list() returned error " 1476 "%d for pool '%s'", error, spa_name(spa)); 1477 } 1478 goto retry; 1479 } 1480 1481 /* 1482 * Sync out odd labels for every dirty vdev. If the system dies 1483 * in the middle of this process, the even labels and the new 1484 * uberblocks will suffice to open the pool. The next time 1485 * the pool is opened, the first thing we'll do -- before any 1486 * user data is modified -- is mark every vdev dirty so that 1487 * all labels will be brought up to date. We flush the new labels 1488 * to disk to ensure that all odd-label updates are committed to 1489 * stable storage before the next transaction group begins. 1490 */ 1491 if ((error = vdev_label_sync_list(spa, 1, txg, flags)) != 0) { 1492 if ((flags & ZIO_FLAG_TRYHARD) != 0) { 1493 zfs_dbgmsg("vdev_label_sync_list() returned error %d " 1494 "for pool '%s' when syncing out the odd labels of " 1495 "dirty vdevs", error, spa_name(spa)); 1496 } 1497 goto retry;; 1498 } 1499 1500 trim_thread_wakeup(spa); 1501 1502 return (0); 1503} 1504