#
98e61df5 |
|
12-Jun-2023 |
Joel Stanley <joel@jms.id.au> |
powerpc/powernv/pci: Remove last IODA1 defines Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230613045202.294451-4-joel@jms.id.au
|
#
5ac129cd |
|
12-Jun-2023 |
Joel Stanley <joel@jms.id.au> |
powerpc/powernv/pci: Remove ioda1 support The final "VPL" Power7 boxes that were used for powernv bringup have been scrapped, meaning there are no machines with ioda1 left. This patch removes the obvious unused code. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230613045202.294451-2-joel@jms.id.au
|
#
cad32d9d |
|
05-May-2022 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
KVM: PPC: Book3s: Retire H_PUT_TCE/etc real mode handlers LoPAPR defines guest visible IOMMU with hypercalls to use it - H_PUT_TCE/etc. Implemented first on POWER7 where hypercalls would trap in the KVM in the real mode (with MMU off). The problem with the real mode is some memory is not available and some API usage crashed the host but enabling MMU was an expensive operation. The problems with the real mode handlers are: 1. Occasionally these cannot complete the request so the code is copied+modified to work in the virtual mode, very little is shared; 2. The real mode handlers have to be linked into vmlinux to work; 3. An exception in real mode immediately reboots the machine. If the small DMA window is used, the real mode handlers bring better performance. However since POWER8, there has always been a bigger DMA window which VMs use to map the entire VM memory to avoid calling H_PUT_TCE. Such 1:1 mapping happens once and uses H_PUT_TCE_INDIRECT (a bulk version of H_PUT_TCE) which virtual mode handler is even closer to its real mode version. On POWER9 hypercalls trap straight to the virtual mode so the real mode handlers never execute on POWER9 and later CPUs. So with the current use of the DMA windows and MMU improvements in POWER9 and later, there is no point in duplicating the code. The 32bit passed through devices may slow down but we do not have many of these in practice. For example, with this applied, a 1Gbit ethernet adapter still demostrates above 800Mbit/s of actual throughput. This removes the real mode handlers from KVM and related code from the powernv platform. This updates the list of implemented hcalls in KVM-HV as the realmode handlers are removed. This changes ABI - kvmppc_h_get_tce() moves to the KVM module and kvmppc_find_table() is static now. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220506053755.3820702-1-aik@ozlabs.ru
|
#
6d9ba612 |
|
01-Jul-2021 |
Cédric Le Goater <clg@kaod.org> |
powerpc/powernv/pci: Drop unused MSI code MSIs should be fully managed by the PCI and IRQ subsystems now. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210701132750.1475580-26-clg@kaod.org
|
#
562d1e20 |
|
26-Mar-2021 |
Christoph Hellwig <hch@lst.de> |
powerpc/powernv: remove the nvlink support This code was only used by the vfio-nvlink2 code, which itself had no proper use. Drop this huge chunk of code build into every powernv or generic build. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210326061311.1497642-3-hch@lst.de
|
#
24b4c6b1 |
|
01-Sep-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Drop pnv_phb->initialized The pnv_phb->initialized flag is an odd beast. It was added back in 2012 in commit db1266c85261 ("powerpc/powernv: Skip check on PE if necessary") to allow devices to be enabled even if the device had not yet been assigned to a PE. Allowing the device to be enabled before the PE is configured may cause spurious EEH events since none of the IOMMU context has been setup. I'm not entirely sure why this was ever necessary. My best guess is that it was an workaround for a bug or some other undesireable behaviour from the PCI core. Either way, it's unnecessary now since as of commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE configuration") we can guarantee that the PE will be configured before the PCI core will allow drivers to bind to the device. It's also worth pointing out that the ->initialized flag is only set in pnv_pci_ioda_create_dbgfs(). That function has its entire body wrapped in #ifdef CONFIG_DEBUG_FS. As a result, for kernels built without debugfs (i.e. petitboot) the other checks in pnv_pci_enable_device_hook() are bypassed entirely. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200902013657.1753830-1-oohall@gmail.com
|
#
86052e40 |
|
25-Jul-2020 |
Randy Dunlap <rdunlap@infradead.org> |
powerpc/powernv/pci.h: delete duplicated word Drop the repeated word "for". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200726003809.20454-10-rdunlap@infradead.org
|
#
84d8505e |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/sriov: Remove vfs_expanded Previously iov->vfs_expanded was used for two purposes. 1) To work out how much we need to multiple the per-VF BAR size to figure out the total space required for the IOV BAR. 2) To indicate that IOV is not usable with this device (vfs_expanded == 0). We don't really need the field for either since the multiple in 1) is always the number PEs supported by the PHB. Similarly, we don't really need it in 2) either since the IOV data field will be NULL if we can't use IOV with the device. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-16-oohall@gmail.com
|
#
4c51f3e1 |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/sriov: Make single PE mode a per-BAR setting Using single PE BARs to map an SR-IOV BAR is really a choice about what strategy to use when mapping a BAR. It doesn't make much sense for this to be a global setting since a device might have one large BAR which needs to be mapped with single PE windows and another smaller BAR that can be mapped with a regular segmented window. Make the segmented vs single decision a per-BAR setting and clean up the logic that decides which mode to use. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-15-oohall@gmail.com
|
#
052da31d |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/sriov: De-indent setup and teardown Remove the IODA2 PHB checks. We already assume IODA2 in several places so there's not much point in wrapping most of the setup and teardown process in an if block. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-12-oohall@gmail.com
|
#
d29a2488 |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/sriov: Drop iov->pe_num_map[] Currently the iov->pe_num_map[] does one of two things depending on whether single PE mode is being used or not. When it is, this contains an array which maps a vf_index to the corresponding PE number. When single PE mode is not being used this contains a scalar which is the base PE for the set of enabled VFs (for for VFn is base + n). The array was necessary because when calling pnv_ioda_alloc_pe() there is no guarantee that the allocated PEs would be contigious. We can now allocate contigious blocks of PEs so this is no longer an issue. This allows us to drop the if (single_mode) {} .. else {} block scattered through the SR-IOV code which is a nice clean up. This also fixes a bug in pnv_pci_sriov_disable() which is the non-atomic bitmap_clear() to manipulate the PE allocation map. Other users of the map assume it will be accessed with atomic ops. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-11-oohall@gmail.com
|
#
a4bc676e |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe() Rework the PE allocation logic to allow allocating blocks of PEs rather than individually. We'll use this to allocate contigious blocks of PEs for the SR-IOVs. This patch also adds code to pnv_ioda_alloc_pe() and pnv_ioda_reserve_pe() to use the existing, but unused, phb->pe_alloc_mutex. Currently these functions use atomic bit ops to release a currently allocated PE number. However, the pnv_ioda_alloc_pe() wants to have exclusive access to the bit map while scanning for hole large enough to accomodate the allocation size. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-10-oohall@gmail.com
|
#
ad9add52 |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/sriov: Simplify used window tracking No need for the multi-dimensional arrays, just use a bitmap. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-8-oohall@gmail.com
|
#
37b59ef0 |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/sriov: Move SR-IOV into a separate file pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in there. The SR-IOV support can be extracted easily enough and is mostly standalone, so move it into a separate file. This patch also moves the PowerNV SR-IOV specific fields from pci_dn and moves them into a platform specific structure. I'm not sure how they ended up in there in the first place, but leaking platform specifics into common code has proven to be a terrible idea so far so lets stop doing that. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-5-oohall@gmail.com
|
#
01e12629 |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Add explicit tracking of the DMA setup state There's an optimisation in the PE setup which skips performing DMA setup for a PE if we only have bridges in a PE. The assumption being that only "real" devices will DMA to system memory, which is probably fair. However, if we start off with only bridge devices in a PE then add a non-bridge device the new device won't be able to use DMA because we never configured it. Fix this (admittedly pretty weird) edge case by tracking whether we've done the DMA setup for the PE or not. If a non-bridge device is added to the PE (via rescan or hotplug, or whatever) we can set up DMA on demand. This also means the only remaining user of the old "DMA Weight" code is the IODA1 DMA setup code that it was originally added for, which is good. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-3-oohall@gmail.com
|
#
5609ffdd |
|
22-Jul-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Add pci_bus_to_pnvhb() helper Add a helper to go from a pci_bus structure to the pnv_phb that hosts that bus. There's a lot of instances of the following pattern: struct pci_controller *hose = pci_bus_to_host(pdev->bus); struct pnv_phb *phb = hose->private_data; Without any other uses of the pci_controller inside the function. This is hard to read since it requires you to memorise the contents of the private data fields and kind of error prone since it involves blindly assigning a void pointer. Add a helper to make it more concise and explicit. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200722065715.1432738-1-oohall@gmail.com
|
#
718d249a |
|
17-Apr-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Reserve the root bus PE during init Doing it once during boot rather than doing it on the fly and drop the janky populated logic. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200417073508.30356-4-oohall@gmail.com
|
#
a8d7d5fc |
|
17-Apr-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Add helper to find ioda_pe from BDFN For each PHB we maintain a reverse-map that can be used to find the PE that a BDFN is currently mapped to. Add a helper for doing this lookup so we can check if a PE has been configured without looking at pdn->pe_number. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200417073508.30356-2-oohall@gmail.com
|
#
9d0879a2 |
|
14-Apr-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Add an explaination for PNV_IODA_PE_BUS_ALL It's pretty obsecure and confused me for a long time so I figured it's worth documenting properly. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200414233502.758-1-oohall@gmail.com
|
#
03b7bf34 |
|
05-Apr-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/npu: Move IOMMU group setup into npu-dma.c The NVlink IOMMU group setup is only relevant to NVLink devices so move it into the NPU containment zone. This let us remove some prototypes in pci.h and staticfy some function definitions. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200406030745.24595-8-oohall@gmail.com
|
#
96e2006a |
|
05-Apr-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv/pci: Move tce size parsing to pci-ioda-tce.c Move it in with the rest of the TCE wrangling rather than carting around a static prototype in pci-ioda.c Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200406030745.24595-7-oohall@gmail.com
|
#
946743d0 |
|
10-Jan-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powernv/pci: Move pnv_pci_dma_bus_setup() to pci-ioda.c This is only used in pci-ioda.c so move it there and rename it to match. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200110070207.439-6-oohall@gmail.com
|
#
0a25d9c4 |
|
10-Jan-2020 |
Oliver O'Halloran <oohall@gmail.com> |
powernv/pci: Fold pnv_pci_dma_dev_setup() into the pci-ioda.c version pnv_pci_dma_dev_setup() does nothing but call the phb->dma_dev_setup() callback, if one exists. That callback is only set for normal PCIe PHBs so we can remove the layer of indirection and use the ioda version in the pci_controller_ops. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200110070207.439-5-oohall@gmail.com
|
#
c37c792d |
|
17-Jul-2019 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda2: Allocate TCE table levels on demand for default DMA window We allocate only the first level of multilevel TCE tables for KVM already (alloc_userspace_copy==true), and the rest is allocated on demand. This is not enabled though for bare metal. This removes the KVM limitation (implicit, via the alloc_userspace_copy parameter) and always allocates just the first level. The on-demand allocation of missing levels is already implemented. As from now on DMA map might happen with disabled interrupts, this allocates TCEs with GFP_ATOMIC; otherwise lockdep reports errors 1]. In practice just a single page is allocated there so chances for failure are quite low. To save time when creating a new clean table, this skips non-allocated indirect TCE entries in pnv_tce_free just like we already do in the VFIO IOMMU TCE driver. This changes the default level number from 1 to 2 to reduce the amount of memory required for the default 32bit DMA window at the boot time. The default window size is up to 2GB which requires 4MB of TCEs which is unlikely to be used entirely or at all as most devices these days are 64bit capable so by switching to 2 levels by default we save 4032KB of RAM per a device. While at this, add __GFP_NOWARN to alloc_pages_node() as the userspace can trigger this path via VFIO, see the failure and try creating a table again with different parameters which might succeed. [1]: === BUG: sleeping function called from invalid context at mm/page_alloc.c:4596 in_atomic(): 1, irqs_disabled(): 1, pid: 1038, name: scsi_eh_1 2 locks held by scsi_eh_1/1038: #0: 000000005efd659a (&host->eh_mutex){+.+.}, at: ata_eh_acquire+0x34/0x80 #1: 0000000006cf56a6 (&(&host->lock)->rlock){....}, at: ata_exec_internal_sg+0xb0/0x5c0 irq event stamp: 500 hardirqs last enabled at (499): [<c000000000cb8a74>] _raw_spin_unlock_irqrestore+0x94/0xd0 hardirqs last disabled at (500): [<c000000000cb85c4>] _raw_spin_lock_irqsave+0x44/0x120 softirqs last enabled at (0): [<c000000000101120>] copy_process.isra.4.part.5+0x640/0x1a80 softirqs last disabled at (0): [<0000000000000000>] 0x0 CPU: 73 PID: 1038 Comm: scsi_eh_1 Not tainted 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Call Trace: [c000003d064cef50] [c000000000c8e6c4] dump_stack+0xe8/0x164 (unreliable) [c000003d064cefa0] [c00000000014ed78] ___might_sleep+0x2f8/0x310 [c000003d064cf020] [c0000000003ca084] __alloc_pages_nodemask+0x2a4/0x1560 [c000003d064cf220] [c0000000000c2530] pnv_alloc_tce_level.isra.0+0x90/0x130 [c000003d064cf290] [c0000000000c2888] pnv_tce+0x128/0x3b0 [c000003d064cf360] [c0000000000c2c00] pnv_tce_build+0xb0/0xf0 [c000003d064cf3c0] [c0000000000bbd9c] pnv_ioda2_tce_build+0x3c/0xb0 [c000003d064cf400] [c00000000004cfe0] ppc_iommu_map_sg+0x210/0x550 [c000003d064cf510] [c00000000004b7a4] dma_iommu_map_sg+0x74/0xb0 [c000003d064cf530] [c000000000863944] ata_qc_issue+0x134/0x470 [c000003d064cf5b0] [c000000000863ec4] ata_exec_internal_sg+0x244/0x5c0 [c000003d064cf700] [c0000000008642d0] ata_exec_internal+0x90/0xe0 [c000003d064cf780] [c0000000008650ac] ata_dev_read_id+0x2ec/0x640 [c000003d064cf8d0] [c000000000878e28] ata_eh_recover+0x948/0x16d0 [c000003d064cfa10] [c00000000087d760] sata_pmp_error_handler+0x480/0xbf0 [c000003d064cfbc0] [c000000000884624] ahci_error_handler+0x74/0xe0 [c000003d064cfbf0] [c000000000879fa8] ata_scsi_port_error_handler+0x2d8/0x7c0 [c000003d064cfca0] [c00000000087a544] ata_scsi_error+0xb4/0x100 [c000003d064cfd00] [c000000000802450] scsi_error_handler+0x120/0x510 [c000003d064cfdb0] [c000000000140c48] kthread+0x1b8/0x1c0 [c000003d064cfe20] [c00000000000bd8c] ret_from_kernel_thread+0x5c/0x70 ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) irq event stamp: 2305 ======================================================== hardirqs last enabled at (2305): [<c00000000000e4c8>] fast_exc_return_irq+0x28/0x34 hardirqs last disabled at (2303): [<c000000000cb9fd0>] __do_softirq+0x4a0/0x654 WARNING: possible irq lock inversion dependency detected 5.2.0-rc6-le_nv2_aikATfstn1-p1 #634 Tainted: G W softirqs last enabled at (2304): [<c000000000cba054>] __do_softirq+0x524/0x654 softirqs last disabled at (2297): [<c00000000010f278>] irq_exit+0x128/0x180 -------------------------------------------------------- swapper/0/0 just changed the state of lock: 0000000006cf56a6 (&(&host->lock)->rlock){-...}, at: ahci_single_level_irq_intr+0xac/0x120 but this lock took another, HARDIRQ-unsafe lock in the past: (fs_reclaim){+.+.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); local_irq_disable(); lock(&(&host->lock)->rlock); lock(fs_reclaim); <Interrupt> lock(&(&host->lock)->rlock); *** DEADLOCK *** no locks held by swapper/0/0. the shortest dependencies between 2nd lock and 1st lock: -> (fs_reclaim){+.+.} ops: 167579 { HARDIRQ-ON-W at: lock_acquire+0xf8/0x2a0 fs_reclaim_acquire.part.23+0x44/0x60 kmem_cache_alloc_node_trace+0x80/0x590 alloc_desc+0x64/0x270 __irq_alloc_descs+0x2e4/0x3a0 irq_domain_alloc_descs+0xb0/0x150 irq_create_mapping+0x168/0x2c0 xics_smp_probe+0x2c/0x98 pnv_smp_probe+0x40/0x9c smp_prepare_cpus+0x524/0x6c4 kernel_init_freeable+0x1b4/0x650 kernel_init+0x2c/0x148 ret_from_kernel_thread+0x5c/0x70 SOFTIRQ-ON-W at: lock_acquire+0xf8/0x2a0 fs_reclaim_acquire.part.23+0x44/0x60 kmem_cache_alloc_node_trace+0x80/0x590 alloc_desc+0x64/0x270 __irq_alloc_descs+0x2e4/0x3a0 irq_domain_alloc_descs+0xb0/0x150 irq_create_mapping+0x168/0x2c0 xics_smp_probe+0x2c/0x98 pnv_smp_probe+0x40/0x9c smp_prepare_cpus+0x524/0x6c4 kernel_init_freeable+0x1b4/0x650 kernel_init+0x2c/0x148 ret_from_kernel_thread+0x5c/0x70 INITIAL USE at: lock_acquire+0xf8/0x2a0 fs_reclaim_acquire.part.23+0x44/0x60 kmem_cache_alloc_node_trace+0x80/0x590 alloc_desc+0x64/0x270 __irq_alloc_descs+0x2e4/0x3a0 irq_domain_alloc_descs+0xb0/0x150 irq_create_mapping+0x168/0x2c0 xics_smp_probe+0x2c/0x98 pnv_smp_probe+0x40/0x9c smp_prepare_cpus+0x524/0x6c4 kernel_init_freeable+0x1b4/0x650 kernel_init+0x2c/0x148 ret_from_kernel_thread+0x5c/0x70 } === Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190718051139.74787-4-aik@ozlabs.ru
|
#
c498a4f9 |
|
25-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
powerpc/powernv: remove the unused tunneling exports These have been unused anywhere in the kernel tree ever since they've been added to the kernel. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
63982618 |
|
25-Jun-2019 |
Christoph Hellwig <hch@lst.de> |
powerpc/powernv: remove the unused pnv_pci_set_p2p function This function has never been used anywhere in the kernel tree since it was added to the tree. We also now have proper PCIe P2P APIs in the core kernel, and any new P2P support should be using those. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
1e496391 |
|
30-Mar-2017 |
Joe Perches <joe@perches.com> |
powerpc/powernv/ioda2: Add __printf format/argument verification Fix fallout too. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
0bd97167 |
|
19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Add compound IOMMU groups At the moment the powernv platform registers an IOMMU group for each PE. There is an exception though: an NVLink bridge which is attached to the corresponding GPU's IOMMU group making it a master. Now we have POWER9 systems with GPUs connected to each other directly bypassing PCI. At the moment we do not control state of these links so we have to put such interconnected GPUs to one IOMMU group which means that the old scheme with one GPU as a master won't work - there will be up to 3 GPUs in such group. This introduces a npu_comp struct which represents a compound IOMMU group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for NVLink bridges). This converts the existing NVLink1 code to use the new scheme. >From now on, each PE must have a valid iommu_table_group_ops which will either be called directly (for a single PE group) or indirectly from a compound group handlers. This moves IOMMU group registration for NVLink-connected GPUs to npu-dma.c. For POWER8, this stores a new compound group pointer in the PE (so a GPU is still a master); for POWER9 the new group pointer is stored in an NPU (which is allocated per a PCI host controller). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
83fb8ccf |
|
19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops At the moment NPU IOMMU is manipulated directly from the IODA2 PCI PE code; PCI PE acts as a master to NPU PE. Soon we will have compound IOMMU groups with several PEs from several different PHB (such as interconnected GPUs and NPUs) so there will be no single master but a one big IOMMU group. This makes a first step and converts an NPU PE with a set of extern function to a table group. This should cause no behavioral change. Note that pnv_npu_release_ownership() has never been implemented. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
0e759bd7 |
|
19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Move OPAL calls away from context manipulation When introduced, the NPU context init/destroy helpers called OPAL which enabled/disabled PID (a userspace memory context ID) filtering in an NPU per a GPU; this was a requirement for P9 DD1.0. However newer chip revision added a PID wildcard support so there is no more need to call OPAL every time a new context is initialized. Also, since the PID wildcard support was added, skiboot does not clear wildcard entries in the NPU so these remain in the hardware till the system reboot. This moves LPID and wildcard programming to the PE setup code which executes once during the booting process so NPU2 context init/destroy won't need to do additional configuration. This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as this is the way to tell if the NPU support is present and configured. This moves pnv_npu2_init() declaration as pseries should be able to use it. This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to call that. This exports pnv_npu2_map_lpar_dev() as following patches will use it from the VFIO driver. While at it, replace redundant list_for_each_entry_safe() with a simpler list_for_each_entry(). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
46a1449d |
|
19-Dec-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Move npu struct from pnv_phb to pci_controller The powernv PCI code stores NPU data in the pnv_phb struct. The latter is referenced by pci_controller::private_data. We are going to have NPU2 support in the pseries platform as well but it does not store any private_data in in the pci_controller struct; and even if it did, it would be a different data structure. This makes npu a pointer and stores it one level higher in the pci_controller struct. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
5f639e5f |
|
13-Nov-2018 |
Oliver O'Halloran <oohall@gmail.com> |
powerpc/powernv: Remove PCI_MSI ifdef checks CONFIG_PCI_MSI was made mandatory by commit a311e738b6d8 ("powerpc/powernv: Make PCI non-optional") so the #ifdef checks around CONFIG_PCI_MSI here can be removed entirely. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
a25de7af |
|
15-Oct-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda: Reduce a number of hooks in pnv_phb fixup_phb() is never used, this removes it. pick_m64_pe() and reserve_m64_pe() are always defined for all powernv PHBs: they are initialized by pnv_ioda_parse_m64_window() which is called unconditionally from pnv_pci_init_ioda_phb() which initializes all known PHB types on powernv so we can open code them. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
a68bd126 |
|
04-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda: Allocate indirect TCE levels on demand At the moment we allocate the entire TCE table, twice (hardware part and userspace translation cache). This normally works as we normally have contigous memory and the guest will map entire RAM for 64bit DMA. However if we have sparse RAM (one example is a memory device), then we will allocate TCEs which will never be used as the guest only maps actual memory for DMA. If it is a single level TCE table, there is nothing we can really do but if it a multilevel table, we can skip allocating TCEs we know we won't need. This adds ability to allocate only first level, saving memory. This changes iommu_table::free() to avoid allocating of an extra level; iommu_table::set() will do this when needed. This adds @alloc parameter to iommu_table::exchange() to tell the callback if it can allocate an extra level; the flag is set to "false" for the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns H_TOO_HARD. This still requires the entire table to be counted in mm::locked_vm. To be conservative, this only does on-demand allocation when the usespace cache table is requested which is the case of VFIO. The example math for a system replicating a powernv setup with NVLink2 in a guest: 16GB RAM mapped at 0x0 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000 the table to cover that all with 64K pages takes: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB If we allocate only necessary TCE levels, we will only need: (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect levels). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
090bad39 |
|
04-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Add indirect levels to it_userspace We want to support sparse memory and therefore huge chunks of DMA windows do not need to be mapped. If a DMA window big enough to require 2 or more indirect levels, and a DMA window is used to map all RAM (which is a default case for 64bit window), we can actually save some memory by not allocation TCE for regions which we are not going to map anyway. The hardware tables alreary support indirect levels but we also keep host-physical-to-userspace translation array which is allocated by vmalloc() and is a flat array which might use quite some memory. This converts it_userspace from vmalloc'ed array to a multi level table. As the format becomes platform dependend, this replaces the direct access to it_usespace with a iommu_table_ops::useraddrptr hook which returns a pointer to the userspace copy of a TCE; future extension will return NULL if the level was not allocated. This should not change non-KVM handling of TCE tables and it_userspace will not be allocated for non-KVM tables. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
191c2287 |
|
04-Jul-2018 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Move TCE manupulation code to its own file Right now we have allocation code in pci-ioda.c and traversing code in pci.c, let's keep them toghether. However both files are big enough already so let's move this business to a new file. While we at it, move the code which links IOMMU table groups to IOMMU tables as it is not specific to any PNV PHB model. These puts exported symbols from the new file together. This fixes several warnings from checkpatch.pl like this: "WARNING: Prefer 'unsigned int' to bare use of 'unsigned'". As this is almost cut-n-paste, there should be no behavioral change. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
8bf6b91a |
|
27-Jun-2018 |
Alastair D'Silva <alastair@d-silva.org> |
Revert "powerpc/powernv: Add support for the cxl kernel api on the real phb" Remove abandonned capi support for the Mellanox CX4. This reverts commit 4361b03430d685610e5feea3ec7846e8b9ae795f. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
0cfd7335 |
|
27-Jun-2018 |
Alastair D'Silva <alastair@d-silva.org> |
Revert "cxl: Add support for interrupts on the Mellanox CX4" Remove abandonned capi support for the Mellanox CX4. This reverts commit a2f67d5ee8d950caaa7a6144cf0bfb256500b73e. Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
7f2c39e9 |
|
22-Jan-2018 |
Frederic Barrat <fbarrat@linux.vnet.ibm.com> |
powerpc/powernv: Introduce new PHB type for opencapi links The NPU was already abstracted by opal as a virtual PHB for nvlink, but it helps to be able to differentiate between a nvlink or opencapi PHB, as it's not completely transparent to linux. In particular, PE assignment differs and we'll also need the information in later patches. So rename existing PNV_PHB_NPU type to PNV_PHB_NPU_NVLINK and add a new type PNV_PHB_NPU_OCAPI. Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
1b2c2b12 |
|
05-Sep-2017 |
Alistair Popple <alistair@popple.id.au> |
powerpc/powernv/npu: Don't explicitly flush nmmu tlb The nest mmu required an explicit flush as a tlbi would not flush it in the same way as the core. However an alternate firmware fix exists which should eliminate the need for this flush, so instead add a device-tree property (ibm,nmmu-flush) on the NVLink2 PHB to enable it only if required. Signed-off-by: Alistair Popple <alistair@popple.id.au> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
b2441318 |
|
01-Nov-2017 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
#
b9fde58d |
|
07-Sep-2017 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv: Rework EEH initialization on powernv Remove the post_init callback which is only used by powernv, we can just call it explicitly from the powernv code. This partially kills the ability to "disable" eeh at runtime via debugfs as this was calling that same callback again, but this is both unused and broken in several ways. If we want to revive it, we need to create a dedicated enable/disable callback on the backend that does the right thing. Let the bulk of eeh initialize normally at core_initcall() like it does on pseries by removing the hack in eeh_init() that delays it. Instead we make sure our eeh->probe cleanly bails out of the PEs haven't been created yet and we force a re-probe where we used to call eeh_init() again. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
25529100 |
|
04-Aug-2017 |
Frederic Barrat <fbarrat@linux.vnet.ibm.com> |
powerpc/powernv: Enable PCI peer-to-peer P9 has support for PCI peer-to-peer, enabling a device to write in the MMIO space of another device directly, without interrupting the CPU. This patch adds support for it on powernv, by adding a new API to be called by drivers. The pnv_pci_set_p2p(...) call configures an 'initiator', i.e the device which will issue the MMIO operation, and a 'target', i.e. the device on the receiving side. P9 really only supports MMIO stores for the time being but that's expected to change in the future, so the API allows to define both load and store operations. /* PCI p2p descriptor */ #define OPAL_PCI_P2P_ENABLE 0x1 #define OPAL_PCI_P2P_LOAD 0x2 #define OPAL_PCI_P2P_STORE 0x4 int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target, u64 desc) It uses a new OPAL call, as the configuration magic is done on the PHBs by skiboot. Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Russell Currey <ruscur@russell.cc> [mpe: Drop unrelated OPAL calls, s/uint64_t/u64/, minor formatting] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
5cb1f8fd |
|
13-Jun-2017 |
Russell Currey <ruscur@russell.cc> |
powerpc/powernv/pci: Dynamically allocate PHB diag data Diagnostic data for PHBs currently works by allocated a fixed-sized buffer. This is simple, but either wastes memory (though only a few kilobytes) or in the case of PHB4 isn't enough to fit the whole data blob. For machines that don't describe the diagnostic data size in the device tree, use the hardcoded buffer size as before. For those that do, only allocate exactly what's needed. In the special case of P7IOC (which has two types of diag data), the larger should be specified in the device tree. Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
31bbd45a |
|
13-Jun-2017 |
Russell Currey <ruscur@russell.cc> |
powerpc/powernv/pci: Reduce spam when dumping PEST Dumping the PE State Tables (PEST) can be highly verbose if a number of PEs are affected, especially in the case where the whole PHB is frozen and 512 lines get printed. Check for duplicates when dumping the PEST to reduce useless output. For example: PE[0f8] A/B: 9700002600000000 80000080d00000f8 PE[0f9] A/B: 8000000000000000 0000000000000000 PE[..0fe] A/B: as above PE[0ff] A/B: 8440002b00000000 0000000000000000 instead of: PE[0f8] A/B: 9700002600000000 80000080d00000f8 PE[0f9] A/B: 8000000000000000 0000000000000000 PE[0fa] A/B: 8000000000000000 0000000000000000 PE[0fb] A/B: 8000000000000000 0000000000000000 PE[0fc] A/B: 8000000000000000 0000000000000000 PE[0fd] A/B: 8000000000000000 0000000000000000 PE[0fe] A/B: 8000000000000000 0000000000000000 PE[0ff] A/B: 8440002b00000000 0000000000000000 and you can imagine how much worse it can get for 512 PEs. Signed-off-by: Russell Currey <ruscur@russell.cc> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
6b3d12a9 |
|
02-May-2017 |
Alistair Popple <alistair@popple.id.au> |
powerpc/powernv: Fix TCE kill on NVLink2 Commit 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") forced all TCE kills to go via the OPAL call for NVLink2. However the PHB3 implementation of TCE kill was still being called directly from some functions which in some circumstances caused a machine check. This patch adds an equivalent IODA2 version of the function which uses the correct invalidation method depending on PHB model and changes all external callers to use it instead. Fixes: 616badd2fb49 ("powerpc/powernv: Use OPAL call for TCE kill on NVLink2") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
1ab66d1f |
|
03-Apr-2017 |
Alistair Popple <alistair@popple.id.au> |
powerpc/powernv: Introduce address translation services for Nvlink2 Nvlink2 supports address translation services (ATS) allowing devices to request address translations from an mmu known as the nest MMU which is setup to walk the CPU page tables. To access this functionality certain firmware calls are required to setup and manage hardware context tables in the nvlink processing unit (NPU). The NPU also manages forwarding of TLB invalidates (known as address translation shootdowns/ATSDs) to attached devices. This patch exports several methods to allow device drivers to register a process id (PASID/PID) in the hardware tables and to receive notification of when a device should stop issuing address translation requests (ATRs). It also adds a fault handler to allow device drivers to demand fault pages in. Signed-off-by: Alistair Popple <alistair@popple.id.au> [mpe: Fix up comment formatting, use flush_tlb_mm()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
616badd2 |
|
09-Jan-2017 |
Alistair Popple <alistair@popple.id.au> |
powerpc/powernv: Use OPAL call for TCE kill on NVLink2 Add detection of NPU2 PHBs. NPU2/NVLink2 has a different register layout for the TCE kill register therefore TCE invalidation should be done via the OPAL call rather than using the register directly as it is for PHB3 and NVLink1. This changes TCE invalidation to use the OPAL call in the case of a NPU2 PHB model. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
00085f1e |
|
03-Aug-2016 |
Krzysztof Kozlowski <krzk@kernel.org> |
dma-mapping: use unsigned long for dma_attrs The dma-mapping core and the implementations do not change the DMA attributes passed by pointer. Thus the pointer can point to const data. However the attributes do not have to be a bitfield. Instead unsigned long will do fine: 1. This is just simpler. Both in terms of reading the code and setting attributes. Instead of initializing local attributes on the stack and passing pointer to it to dma_set_attr(), just set the bits. 2. It brings safeness and checking for const correctness because the attributes are passed by value. Semantic patches for this change (at least most of them): virtual patch virtual context @r@ identifier f, attrs; @@ f(..., - struct dma_attrs *attrs + unsigned long attrs , ...) { ... } @@ identifier r.f; @@ f(..., - NULL + 0 ) and // Options: --all-includes virtual patch virtual context @r@ identifier f, attrs; type t; @@ t f(..., struct dma_attrs *attrs); @@ identifier r.f; @@ f(..., - NULL + 0 ) Link: http://lkml.kernel.org/r/1468399300-5399-2-git-send-email-k.kozlowski@samsung.com Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> Acked-by: Mark Salter <msalter@redhat.com> [c6x] Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> [cris] Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> [drm] Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Joerg Roedel <jroedel@suse.de> [iommu] Acked-by: Fabien Dessenne <fabien.dessenne@st.com> [bdisp] Reviewed-by: Marek Szyprowski <m.szyprowski@samsung.com> [vb2-core] Acked-by: David Vrabel <david.vrabel@citrix.com> [xen] Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [xen swiotlb] Acked-by: Joerg Roedel <jroedel@suse.de> [iommu] Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org> Acked-by: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no> [avr32] Acked-by: Vineet Gupta <vgupta@synopsys.com> [arc] Acked-by: Robin Murphy <robin.murphy@arm.com> [arm64 and dma-iommu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
fd141d1a |
|
08-Jul-2016 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv/pci: Rework accessing the TCE invalidate register It's architected, always in a known place, so there is no need to keep a separate pointer to it, we use the existing "regs", and we complement it with a real mode variant. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> # Conflicts: # arch/powerpc/platforms/powernv/pci-ioda.c # arch/powerpc/platforms/powernv/pci.h Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
a34ab7c3 |
|
08-Jul-2016 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv/pci: Rename TCE invalidation calls The TCE invalidation functions are fairly implementation specific, and while the IODA specs more/less describe the register, in practice various implementation workarounds may be required. So name the functions after the target PHB. Note today and for the foreseeable future, there's a 1:1 relationship between an IODA version and a PHB implementation. There exist another variant of IODA1 (Torrent) but we never supported in with OPAL and never will. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
a2f67d5e |
|
13-Jul-2016 |
Ian Munsie <imunsie@au1.ibm.com> |
cxl: Add support for interrupts on the Mellanox CX4 The Mellanox CX4 in cxl mode uses a hybrid interrupt model, where interrupts are routed from the networking hardware to the XSL using the MSIX table, and from there will be transformed back into an MSIX interrupt using the cxl style interrupts (i.e. using IVTE entries and ranges to map a PE and AFU interrupt number to an MSIX address). We want to hide the implementation details of cxl interrupts as much as possible. To this end, we use a special version of the MSI setup & teardown routines in the PHB while in cxl mode to allocate the cxl interrupts and configure the IVTE entries in the process element. This function does not configure the MSIX table - the CX4 card uses a custom format in that table and it would not be appropriate to fill that out in generic code. The rest of the functionality is similar to the "Full MSI-X mode" described in the CAIA, and this could be easily extended to support other adapters that use that mode in the future. The interrupts will be associated with the default context. If the maximum number of interrupts per context has been limited (e.g. by the mlx5 driver), it will automatically allocate additional kernel contexts to associate extra interrupts as required. These contexts will be started using the same WED that was used to start the default context. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
4361b034 |
|
13-Jul-2016 |
Ian Munsie <imunsie@au1.ibm.com> |
powerpc/powernv: Add support for the cxl kernel api on the real phb This adds support for the peer model of the cxl kernel api to the PowerNV PHB, in which physical function 0 represents the cxl function on the card (an XSL in the case of the CX4), which other physical functions will use for memory access and interrupt services. It is referred to as the peer model as these functions are peers of one another, as opposed to the Virtual PHB model which forms a hierarchy. This patch exports APIs to enable the peer mode, check if a PCI device is attached to a PHB in this mode, and to set and get the peer AFU for this mode. The cxl driver will enable this mode for supported cards by calling pnv_cxl_enable_phb_kernel_api(). This will set a flag in the PHB to note that this mode is enabled, and switch out it's controller_ops for the cxl version. The cxl version of the controller_ops struct implements it's own versions of the enable_device_hook and release_device to handle refcounting on the peer AFU and to allocate a default context for the device. Once enabled, the cxl kernel API may not be disabled on a PHB. Currently there is no safe way to disable cxl mode short of a reboot, so until that changes there is no reason to support the disable path. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
f456834a |
|
13-Jul-2016 |
Ian Munsie <imunsie@au1.ibm.com> |
powerpc/powernv: Split cxl code out into a separate file The support for using the Mellanox CX4 in cxl mode will require additions to the PHB code. In preparation for this, move the existing cxl code out of pci-ioda.c into a separate pci-cxl.c file to keep things more organised. Signed-off-by: Ian Munsie <imunsie@au1.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
c5f7700b |
|
20-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Dynamically release PE This supports releasing PEs dynamically. A reference count is introduced to PE representing number of PCI devices associated with the PE. The reference count is increased when PCI device joins the PE and decreased when PCI device leaves the PE in pnv_pci_release_device(). When the count becomes zero, the PE and its consumed resources are released. Note that the count is accessed concurrently. So a counter with "int" type is enough here. In order to release the sources consumed by the PE, couple of helper functions are introduced as below: * pnv_pci_ioda1_unset_window() - Unset IODA1 DMA32 window * pnv_pci_ioda1_release_dma_pe() - Release IODA1 DMA32 segments * pnv_pci_ioda2_release_dma_pe() - Release IODA2 DMA resource * pnv_ioda_release_pe_seg() - Unmap IO/M32/M64 segments Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
63803c39 |
|
20-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Setup PE for root bus There is no parent bridge for root bus, meaning pcibios_setup_bridge() isn't invoked for root bus. The PE for root bus is the ancestor of other PEs in PELTV. It means we need PE for root bus populated before all others. This populates the PE for root bus in pcibios_setup_bridge() path if it's not populated yet. The PE number next to the reserved one is used as the PE# to avoid holes in continuous M64 space. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
c127562a |
|
20-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Increase PE# capacity Each PHB maintains an array helping to translate 2-bytes Request ID (RID) to PE# with the assumption that PE# takes one byte, meaning that we can't have more than 256 PEs. However, pci_dn->pe_number already had 4-bytes for the PE#. This extends the PE# capacity for every PHB. After that, the PE number is represented by 4-bytes value. Then we can reuse IODA_INVALID_PE to check the PE# in phb->pe_rmap[] is valid or not. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
b5cb9ab1 |
|
29-Apr-2016 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Enable NVLink pass through IBM POWER8 NVlink systems come with Tesla K40-ish GPUs each of which also has a couple of fast speed links (NVLink). The interface to links is exposed as an emulated PCI bridge which is included into the same IOMMU group as the corresponding GPU. In the kernel, NPUs get a separate PHB of the PNV_PHB_NPU type and a PE which behave pretty much as the standard IODA2 PHB except NPU PHB has just a single TVE in the hardware which means it can have either 32bit window or 64bit window or DMA bypass but never two of these. In order to make these links work when GPU is passed to the guest, these bridges need to be passed as well; otherwise performance will degrade. This implements and exports API to manage NPU state in regard to VFIO; it replicates iommu_table_group_ops. This defines a new pnv_pci_ioda2_npu_ops which is assigned to the IODA2 bridge if there are NPUs for a GPU on the bridge. The new callbacks call the default IODA2 callbacks plus new NPU API. This adds a gpe_table_group_to_npe() helper to find NPU PE for the IODA2 table_group, it is not expected to fail as the helper is only called from the pnv_pci_ioda2_npu_ops. This does not define NPU-specific .release_ownership() so after VFIO is finished, DMA on NPU is disabled which is ok as the nvidia driver sets DMA mask when probing which enable 32 or 64bit DMA on NPU. This adds a pnv_pci_npu_setup_iommu() helper which adds NPUs to the GPU group if any found. The helper uses helpers to look for the "ibm,gpu" property in the device tree which is a phandle of the corresponding GPU. This adds an additional loop over PEs in pnv_ioda_setup_dma() as the main loop skips NPU PEs as they do not have 32bit DMA segments. As pnv_npu_set_window() and pnv_npu_unset_window() are started being used by the new IODA2-NPU IOMMU group, this makes the helpers public and adds the DMA window number parameter. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au> [mpe: Add pnv_pci_ioda_setup_iommu_api() to fix build with IOMMU_API=n] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
85674868 |
|
29-Apr-2016 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Rework TCE Kill handling The pnv_ioda_pe struct keeps an array of peers. At the moment it is only used to link GPU and NPU for 2 purposes: 1. Access NPU quickly when configuring DMA for GPU - this was addressed in the previos patch by removing use of it as DMA setup is not what the kernel would constantly do. 2. Invalidate TCE cache for NPU when it is invalidated for GPU. GPU and NPU are in different PE. There is already a mechanism to attach multiple iommu_table_group to the same iommu_table (used for VFIO), we can reuse it here so does this patch. This gets rid of peers[] array and PNV_IODA_PE_PEER flag as they are not needed anymore. While we are here, add TCE cache invalidation after enabling bypass. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
7d623e42 |
|
29-Apr-2016 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda2: Export debug helper pe_level_printk() This exports debugging helper pe_level_printk() and corresponding macroses so they can be used in npu-dma.c. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-By: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
f9f83456 |
|
29-Apr-2016 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: Simplify DMA setup NPU devices are emulated in firmware and mainly used for NPU NVLink training; one NPU device is per a hardware link. Their DMA/TCE setup must match the GPU which is connected via PCIe and NVLink so any changes to the DMA/TCE setup on the GPU PCIe device need to be propagated to the NVLink device as this is what device drivers expect and it doesn't make much sense to do anything else. This makes NPU DMA setup explicit. pnv_npu_ioda_controller_ops::pnv_npu_dma_set_mask is moved to pci-ioda, made static and prints warning as dma_set_mask() should never be called on this function as in any case it will not configure GPU; so we make this explicit. Instead of using PNV_IODA_PE_PEER and peers[] (which the next patch will remove), we test every PCI device if there are corresponding NVLink devices. If there are any, we propagate bypass mode to just found NPU devices by calling the setup helper directly (which takes @bypass) and avoid guessing (i.e. calculating from DMA mask) whether we need bypass or not on NPU devices. Since DMA setup happens in very rare occasion, this will not slow down booting or VFIO start/stop much. This renames pnv_npu_disable_bypass to pnv_npu_dma_set_32 to make it more clear what the function really does which is programming 32bit table address to the TVT ("disabling bypass" means writing zeroes to the TVT). This removes pnv_npu_dma_set_bypass() from pnv_npu_ioda_fixup() as the DMA configuration on NPU does not matter until dma_set_mask() is called on GPU and that will do the NPU DMA configuration. This removes phb->dma_dev_setup initialization for NPU as pnv_pci_ioda_dma_dev_setup is no-op for it anyway. This stops using npe->tce_bypass_base as it never changes and values other than zero are not supported. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
0bbcdb43 |
|
29-Apr-2016 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/npu: TCE Kill helpers cleanup NPU PHB TCE Kill register is exactly the same as in the rest of POWER8 so let's reuse the existing code for NPU. The only bit missing is a helper to reset the entire TCE cache so this moves such a helper from NPU code and renames it. Since pnv_npu_tce_invalidate() does really invalidate the entire cache, this uses pnv_pci_ioda2_tce_invalidate_entire() directly for NPU. This adds an explicit comment for workaround for invalidating NPU TCE cache. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
1e916772 |
|
02-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Use PE instead of number during setup and release In current implementation, the PEs that are allocated or picked from the reserved list are identified by PE number. The PE instance has to be picked according to the PE number eventually. We have same issue when PE is released. For pnv_ioda_pick_m64_pe() and pnv_ioda_alloc_pe(), this returns PE instance so that pnv_ioda_setup_bus_PE() can use the allocated or reserved PE instance directly. Also, pnv_ioda_setup_bus_PE() returns the reserved/allocated PE instance to be used in subsequent patches. On the other hand, pnv_ioda_free_pe() uses PE instance (not number) as its argument. No logical changes introduced. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
2b923ed1 |
|
04-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv/ioda1: Improve DMA32 segment track In current implementation, the DMA32 segments required by one specific PE isn't calculated with the information hold in the PE independently. It conflicts with the PCI hotplug design: PE centralized, meaning the PE's DMA32 segments should be calculated from the information hold in the PE independently. This introduces an array (@dma32_segmap) for every PHB to track the DMA32 segmeng usage. Besides, this moves the logic calculating PE's consumed DMA32 segments to pnv_pci_ioda1_setup_dma_pe() so that PE's DMA32 segments are calculated/allocated from the information hold in the PE (DMA32 weight). Also the logic is improved: we try to allocate as much DMA32 segments as we can. It's acceptable that number of DMA32 segments less than the expected number are allocated. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
801846d1 |
|
02-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Remove DMA32 PE list PEs are put into PHB DMA32 list (phb->ioda.pe_dma_list) according to their DMA32 weight. The PEs on the list are iterated to setup their TCE32 tables at system booting time. The list is used for once at boot time and no need to keep it. This moves the logic calculating DMA32 weight of PHB and PE to pnv_ioda_setup_dma() to drop PHB's DMA32 list. Also, every PE traces the consumed DMA32 segment by @tce32_seg and @tce32_segcount are useless and they're removed. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
93289d8c |
|
02-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Track M64 segment consumption When unplugging PCI devices, their parent PEs might be offline. The consumed M64 resource by the PEs should be released at that time. As we track M32 segment consumption, this introduces an array to the PHB to track the mapping between M64 segment and PE number. Note: M64 mapping isn't covered by pnv_ioda_setup_pe_seg() as IODA2 doesn't support the mapping explicitly while it's supported on IODA1. Until now, no M64 is supported on IODA1 in software. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
689ee8c9 |
|
02-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Data type unsigned int for PE number This changes the data type of PE number from "int" to "unsigned int" in order to match the fact PE number is never negative: * The number of PE to which the specified PCI device is attached. * The PE number map for SRIOV VFs. * The returned PE number from pnv_ioda_alloc_pe(). * The returned PE number from pnv_ioda2_pick_m64_pe(). Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-By: Alistair Popple <alistair@popple.id.au> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
92b8f137 |
|
02-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Rename PE# fields in struct pnv_phb This renames the fields related to PE number in "struct pnv_phb" for better reflecting of their usages as Alexey suggested. No logical changes introduced. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
13ce7598 |
|
02-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Reorder fields in struct pnv_phb This moves those fields in struct pnv_phb that are related to PE allocation around. No logical change. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
475d92c2 |
|
02-May-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop phb->bdfn_to_pe() The last usage of pnv_phb::bdfn_to_pe() was removed in ff57b454ddb9 ("powerpc/eeh: Do probe on pci_dn"), so drop it. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
1bc74f1c |
|
08-Feb-2016 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Fix stale PE primary bus When PCI bus is unplugged during full hotplug for EEH recovery, the platform PE instance (struct pnv_ioda_pe) isn't released and it dereferences the stale PCI bus that has been released. It leads to kernel crash when referring to the stale PCI bus. This fixes the issue by correcting the PE's primary bus when it's oneline at plugging time, in pnv_pci_dma_bus_setup() which is to be called by pcibios_fixup_bus(). Cc: stable@vger.kernel.org # v4.1+ Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reported-by: Pradipta Ghosh <pradghos@in.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
2de50e96 |
|
07-Feb-2016 |
Russell Currey <ruscur@russell.cc> |
powerpc/powernv: Remove support for p5ioc2 "p5ioc2 is used by approximately 2 machines in the world, and has never ever been a supported configuration." The code for p5ioc2 is essentially unused and complicates what is already a very complicated codebase. Its removal is essentially a "free win" in the effort to simplify the powernv PCI code. In addition, support for p5ioc2 has been dropped from skiboot. There's no reason to keep it around in the kernel. Signed-off-by: Russell Currey <ruscur@russell.cc> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
5d2aa710 |
|
16-Dec-2015 |
Alistair Popple <alistair@popple.id.au> |
powerpc/powernv: Add support for Nvlink NPUs NVLink is a high speed interconnect that is used in conjunction with a PCI-E connection to create an interface between CPU and GPU that provides very high data bandwidth. A PCI-E connection to a GPU is used as the control path to initiate and report status of large data transfers sent via the NVLink. On IBM Power systems the NVLink processing unit (NPU) is similar to the existing PHB3. This patch adds support for a new NPU PHB type. DMA operations on the NPU are not supported as this patch sets the TCE translation tables to be the same as the related GPU PCIe device for each NVLink. Therefore all DMA operations are setup and controlled via the PCIe device. EEH is not presently supported for the NPU devices, although it may be added in future. Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
53522982 |
|
06-Aug-2015 |
Andrew Donnellan <andrew.donnellan@au1.ibm.com> |
powerpc/powernv: move dma_get_required_mask from pnv_phb to pci_controller_ops Simplify the dma_get_required_mask call chain by moving it from pnv_phb to pci_controller_ops, similar to commit 763d2d8df1ee ("powerpc/powernv: Move dma_set_mask from pnv_phb to pci_controller_ops"). Previous call chain: 0) call dma_get_required_mask() (kernel/dma.c) 1) call ppc_md.dma_get_required_mask, if it exists. On powernv, that points to pnv_dma_get_required_mask() (platforms/powernv/setup.c) 2) device is PCI, therefore call pnv_pci_dma_get_required_mask() (platforms/powernv/pci.c) 3) call phb->dma_get_required_mask if it exists 4) it only exists in the ioda case, where it points to pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c) New call chain: 0) call dma_get_required_mask() (kernel/dma.c) 1) device is PCI, therefore call pci_controller_ops.dma_get_required_mask if it exists 2) in the ioda case, that points to pnv_pci_ioda_dma_get_required_mask() (platforms/powernv/pci-ioda.c) In the p5ioc2 case, the call chain remains the same - dma_get_required_mask() does not find either a ppc_md call or pci_controller_ops call, so it calls __dma_get_required_mask(). Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
26ba248d |
|
18-Jun-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Pick M64 PEs based on BARs On PHB3, PE might be reserved in advance to reflect the M64 segments consumed by the PE according to M64 BARs (exclude VF BARs) of the PCI devices included in the PE. The PE is picked based on M64 BARs instead of the bridge's M64 windows, which might include VF BARs. Otherwise, wrong PE could be picked. The patch calculates the used M64 segments and PE numbers according to the M64 BARs, excluding VF BARs, of PCI devices in one particular PE, instead of the bridge's M64 windows. Then the right PE number is picked. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
d1203852 |
|
18-Jun-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Boolean argument for pnv_ioda_setup_bus_PE() The patch changes the type of last argument of pnv_ioda_setup_bus_PE() and phb::pick_m64_pe() to boolean. No functional change. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
96a2f92b |
|
18-Jun-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Reserve M64 PEs based on BARs On PHB3, some PEs might be reserved in advance to reflect the M64 segments consumed by those PEs. We're reserving PEs based on the M64 window of root port, which might contain VF BAR. The PEs for VFs are allocated dynamically, not reserved based on the consumed M64 segments. So the M64 window of root port isn't reliable for the task. Instead, we go through M64 BARs (VF BARs excluded) of PCI devices under the specified root bus and reserve PEs accordingly, as the patch does. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
05c6cfb9 |
|
05-Jun-2015 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/iommu/powernv: Release replaced TCE At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The exchange() receives a physical address unlike set() which receives linear mapping address; and returns a physical address as the clear() does. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
5780fb04 |
|
05-Jun-2015 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv/ioda2: Move TCE kill register address to PE At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property which contains the TCE kill register address. Writing to this register invalidates TCE cache on IODA/IODA2 hub. This moves the register address from iommu_table to pnv_pnb as this register belongs to PHB and invalidates TCE cache for all tables of all attached PEs. This moves the property reading/remapping code to a helper which is called when DMA is being configured for PE and which does DMA setup for both IODA1 and IODA2. This adds a new pnv_pci_ioda2_tce_invalidate_entire() helper which invalidates cache for the entire table. It should be called after every call to opal_pci_map_pe_dma_window(). It was not required before because there was just a single TCE table and 64bit DMA was handled via bypass window (which has no table so no cache was used) but this is going to change with Dynamic DMA windows (DDW). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
0eaf4def |
|
05-Jun-2015 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group So far one TCE table could only be used by one IOMMU group. However IODA2 hardware allows programming the same TCE table address to multiple PE allowing sharing tables. This replaces a single pointer to a group in a iommu_table struct with a linked list of groups which provides the way of invalidating TCE cache for every PE when an actual TCE table is updated. This adds pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group() helpers to manage the list. However without VFIO, it is still going to be a single IOMMU group per iommu_table. This changes iommu_add_device() to add a device to a first group from the group list of a table as it is only called from the platform init code or PCI bus notifier and at these moments there is only one group per table. This does not change TCE invalidation code to loop through all attached groups in order to simplify this patch and because it is not really needed in most cases. IODA2 is fixed in a later patch. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
b348aa65 |
|
05-Jun-2015 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/spapr: vfio: Replace iommu_table with iommu_table_group Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. This defines iommu_table_group struct which stores pointers to iommu_group and iommu_table(s). This replaces iommu_table with iommu_table_group where iommu_table was used to identify a group: - iommu_register_group(); - iommudata of generic iommu_group; This removes @data from iommu_table as it_table_group provides same access to pnv_ioda_pe. For IODA, instead of embedding iommu_table, the new iommu_table_group keeps pointers to those. The iommu_table structs are allocated dynamically. For P5IOC2, both iommu_table_group and iommu_table are embedded into PE struct. As there is no EEH and SRIOV support for P5IOC2, iommu_free_table() should not be called on iommu_table struct pointers so we can keep it embedded in pnv_phb::p5ioc2. For pSeries, this replaces multiple calls of kzalloc_node() with a new iommu_pseries_alloc_group() helper and stores the table group struct pointer into the pci_dn struct. For release, a iommu_table_free_group() helper is added. This moves iommu_table struct allocation from SR-IOV code to the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized. This change is here because those lines had to be changed anyway. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw: for the vfio related changes] Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
da004c36 |
|
05-Jun-2015 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table This adds a iommu_table_ops struct and puts pointer to it into the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush callbacks from ppc_md to the new struct where they really belong to. This adds the requirement for @it_ops to be initialized before calling iommu_init_table() to make sure that we do not leave any IOMMU table with iommu_table_ops uninitialized. This is not a parameter of iommu_init_table() though as there will be cases when iommu_init_table() will not be called on TCE tables, for example - VFIO. This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_" redundant prefixes. This removes tce_xxx_rm handlers from ppc_md but does not add them to iommu_table_ops as this will be done later if we decide to support TCE hypercalls in real mode. This removes _vm callbacks as only virtual mode is supported by now so this also removes @rm parameter. For pSeries, this always uses tce_buildmulti_pSeriesLP/ tce_buildmulti_pSeriesLP. This changes multi callback to fall back to tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not present. The reason for this is we still have to support "multitce=off" boot parameter in disable_multitce() and we do not want to walk through all IOMMU tables in the system and replace "multi" callbacks with single ones. For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2. This makes the callbacks for them public. Later patches will extend callbacks for IODA1/2. No change in behaviour is expected. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
7a8e6bbf |
|
27-May-2015 |
Michael Neuling <mikey@neuling.org> |
powerpc/pci: Add shutdown hook to pci_controller_ops Currently pnv_pci_shutdown() calls the PHB shutdown code for all PHBs in the system. It dereferences the private_data assuming it's a powernv PHB, which won't be the case when we have different PHB in the systems (like when we add vPHBs for CXL). This moves the shutdown hook to the pci_controller_ops and fixes the call site to use that instead. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
763d2d8d |
|
27-Apr-2015 |
Daniel Axtens <dja@axtens.net> |
powerpc/powernv: Move dma_set_mask() from pnv_phb to pci_controller_ops Previously, dma_set_mask() on powernv was convoluted: 0) Call dma_set_mask() (a/p/kernel/dma.c) 1) In dma_set_mask(), ppc_md.dma_set_mask() exists, so call it. 2) On powernv, that function pointer is pnv_dma_set_mask(). In pnv_dma_set_mask(), the device is pci, so call pnv_pci_dma_set_mask(). 3) In pnv_pci_dma_set_mask(), call pnv_phb->set_dma_mask() if it exists. 4) It only exists in the ioda case, where it points to pnv_pci_ioda_dma_set_mask(), which is the final function. So the call chain is: dma_set_mask() -> pnv_dma_set_mask() -> pnv_pci_dma_set_mask() -> pnv_pci_ioda_dma_set_mask() Both ppc_md and pnv_phb function pointers are used. Rip out the ppc_md call, pnv_dma_set_mask() and pnv_pci_dma_set_mask(). Instead: 0) Call dma_set_mask() (a/p/kernel/dma.c) 1) In dma_set_mask(), the device is pci, and pci_controller_ops.dma_set_mask() exists, so call pci_controller_ops.dma_set_mask() 2) In the ioda case, that points to pnv_pci_ioda_dma_set_mask(). The new call chain is dma_set_mask() -> pnv_pci_ioda_dma_set_mask() Now only the pci_controller_ops function pointer is used. The fallback paths for p5ioc2 are the same. Previously, pnv_pci_dma_set_mask() would find no pnv_phb->set_dma_mask() function, to it would call __set_dma_mask(). Now, dma_set_mask() finds no ppc_md call or pci_controller_ops call, so it calls __set_dma_mask(). Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
92ae0353 |
|
27-Apr-2015 |
Daniel Axtens <dja@axtens.net> |
powerpc/powernv: Specialise pci_controller_ops for each controller type Remove powernv generic PCI controller operations. Replace it with controller ops for each of the two supported PHBs. As an added bonus, make the two new structs const, which will help guard against bugs such as the one introduced in 65ebf4b63 ("powerpc/powernv: Move controller ops from ppc_md to controller_ops") Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
781a868f |
|
25-Mar-2015 |
Wei Yang <weiyang@linux.vnet.ibm.com> |
powerpc/powernv: Shift VF resource with an offset On PowerNV platform, resource position in M64 BAR implies the PE# the resource belongs to. In some cases, adjustment of a resource is necessary to locate it to a correct position in M64 BAR . This patch adds pnv_pci_vf_resource_shift() to shift the 'real' PF IOV BAR address according to an offset. Note: After doing so, there would be a "hole" in the /proc/iomem when offset is a positive value. It looks like the device return some mmio back to the system, which actually no one could use it. [bhelgaas: rework loops, rework overlap check, index resource[] conventionally, remove pci_regs.h include, squashed with next patch] Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
9e8d4a19 |
|
25-Mar-2015 |
Wei Yang <weiyang@linux.vnet.ibm.com> |
powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically Previously the iommu_table had the same lifetime as a struct pnv_ioda_pe and was embedded in it. The pnv_ioda_pe was assigned to a PE on the bootup stage. Since PEs are based on the hardware layout which is static in the system, they will never get released. This means the iommu_table in the pnv_ioda_pe will never get released either. This no longer works for VF PE. VF PEs are created and released dynamically when VFs are created and released. So we need to assign pnv_ioda_pe to VF PEs respectively when VFs are enabled and clean up those resources for VF PE when VFs are disabled. And iommu_table is one of the resources we need to handle dynamically. Current iommu_table is a static field in pnv_ioda_pe, which will face a problem when freeing it. During the disabling of a VF, pnv_pci_ioda2_release_dma_pe will call iommu_free_table to release the iommu_table for this PE. A static iommu_table will fail in iommu_free_table. According to these requirement, this patch allocates iommu_table dynamically. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
3532a741 |
|
16-Mar-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Use pci_dn, not device_node, in PCI config accessor The PCI config accessors previously relied on device_node. Unfortunately, VFs don't have a corresponding device_node, so change the accessors to use pci_dn instead. [bhelgaas: changelog] Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
2f6cf794 |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Remove unused file The patch removes unused file eeh-ioda.c and updates makefile accordingly. Besides, the definition of "struct pnv_eeh_ops" and the instances are all removed. Until now, the chip layer of EEH implementation for PowerNV platform is removed completely. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
cadf364d |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation reset() The patch drops PHB EEH operation reset() and merges its logic to eeh_ops::reset(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
2a485ad7 |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation next_error() The patch drops PHB EEH operation next_error() and merges its logic to eeh_ops::next_error(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
40ae5f69 |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation get_state() The patch drops PHB EEH operation get_state() and merges its logic to eeh_ops::get_state(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
7e3e4f8d |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation set_option() The patch drops PHB EEH operation set_option() and merges its logic to eeh_ops::set_option(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
bbe170ed |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation configure_bridge() The patch drops PHB EEH operation configure_bridge() and merges its logic to eeh_ops::configure_bridge(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
95edcdea |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation get_log() The patch drops PHB operation get_log() and merges its logic to eeh_ops::get_log(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
4cf17445 |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation post_init() The patch drops PHB EEH operation post_init() and merge its logic to eeh_ops::post_init(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
fa646c3c |
|
15-Feb-2015 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Drop PHB operation err_inject() The patch drops PHB EEH operation err_inject() and merge its logic to eeh_ops::err_inject(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
5ef73567 |
|
11-Nov-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Rename alloc_m64_pe() to reserve_m64_pe() The patch renames alloc_m64_pe() to reserve_m64_pe() to reflect its real usage: We reserve PE numbers for M64 segments in advance and then pick up the reserved PE numbers when building the mapping between PE numbers and M64 segments. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
fe7e85c6 |
|
29-Sep-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Override dma_get_required_mask() The dma_get_required_mask() function is used by some drivers to query the platform about what DMA mask is needed to cover all of memory. This is a bit of a strange semantic when we have to choose between IOMMU translation or bypass, but essentially what it means is "what DMA mask will give best performances". Currently, our IOMMU backend always returns a 32-bit mask here, we don't do anything special to it when we have bypass available. This causes some drivers to choose a 32-bit mask, thus losing the ability to use the bypass window, thinking this is more efficient. The problem was reported from the driver of following device: 0004:03:00.0 0107: 1000:0087 (rev 05) 0004:03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios \ Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05) This patch adds an override of that function in order to, instead, return a 64-bit mask whenever a bypass window is available in order for drivers to prefer this configuration. Reported-by: Murali N. Iyer <mniyer@us.ibm.com> Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
131c123a |
|
29-Sep-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/eeh: Introduce eeh_ops::err_inject The patch introduces eeh_ops::err_inject(), which allows to inject specified errors to indicated PE for testing purpose. The functionality isn't support on pSeries platform. On PowerNV, the functionality relies on OPAL API opal_pci_err_inject(). Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
#
49dec922 |
|
20-Jul-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Handle compound PE The patch introduces 3 PHB callbacks: compound PE state retrieval, force freezing and unfreezing compound PE. The PCI config accessors and PowerNV EEH backend can use them in subsequent patches. We don't export the capability of compound PE to EEH core, which helps avoiding more complexity to EEH core. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
262af557 |
|
20-Jul-2014 |
Guo Chao <yan@linux.vnet.ibm.com> |
powerpc/powernv: Enable M64 aperatus for PHB3 This patch enables M64 aperatus for PHB3. We already had platform hook (ppc_md.pcibios_window_alignment) to affect the PCI resource assignment done in PCI core so that each PE's M32 resource was built on basis of M32 segment size. Similarly, we're using that for M64 assignment on basis of M64 segment size. * We're using last M64 BAR to cover M64 aperatus, and it's shared by all 256 PEs. * We don't support P7IOC yet. However, some function callbacks are added to (struct pnv_phb) so that we can reuse them on P7IOC in future. * PE, corresponding to PCI bus with large M64 BAR device attached, might span multiple M64 segments. We introduce "compound" PE to cover the case. The compound PE is a list of PEs and the master PE is used as before. The slave PEs are just for MMIO isolation. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
8fa5d454 |
|
06-Jun-2014 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc/powernv: Add a page size parameter to pnv_pci_setup_iommu_table() Since a TCE page size can be other than 4K, make it configurable for P5IOC2 and IODA PHBs. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
361f2a2a |
|
24-Apr-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powrpc/powernv: Reset PHB in kdump kernel In the kdump scenario, the first kerenl doesn't shutdown PCI devices and the kdump kerenl clean PHB IODA table at the early probe time. That means the kdump kerenl can't support PCI transactions piled by the first kerenl. Otherwise, lots of EEH errors and frozen PEs will be detected. In order to avoid the EEH errors, the PHB is resetted to drop all PCI transaction from the first kerenl. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
d92a208d |
|
24-Apr-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/pci: Mask linkDown on resetting PCI bus The problem was initially reported by Wendy who tried pass through IPR adapter, which was connected to PHB root port directly, to KVM based guest. When doing that, pci_reset_bridge_secondary_bus() was called by VFIO driver and linkDown was detected by the root port. That caused all PEs to be frozen. The patch fixes the issue by routing the reset for the secondary bus of root port to underly firmware. For that, one more weak function pci_reset_secondary_bus() is introduced so that the individual platforms can override that and do specific reset for bridge's secondary bus. Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
7f52a526 |
|
24-Apr-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/eeh: Allow to disable EEH The patch introduces bootarg "eeh=off" to disable EEH functinality. Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable or enable EEH functionality. By default, we have the functionality enabled. For PowerNV platform, we will restore to have the conventional mechanism of clearing frozen PE during PCI config access if we're going to disable EEH functionality. Conversely, we will rely on EEH for error recovery. The patch also fixes the issue that we missed to cover the case of disabled EEH functionality in function ioda_eeh_event(). Those events driven by interrupt should be cleared to avoid endless reporting. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
f5bc6b70 |
|
24-Apr-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Move PNV_EEH_STATE_ENABLED around The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state, which is protected by CONFIG_EEH. We needn't that. Instead, we can have pnv_phb::flags and maintain all flags there, which is the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED to PNV_PHB_FLAG_EEH. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
467f79a9 |
|
24-Apr-2014 |
Gavin Shan <gwshan@linux.vnet.ibm.com> |
powerpc/powernv: Remove PNV_EEH_STATE_REMOVED The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't so useful any more and it's duplicated to EEH_PE_ISOLATED. The patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
cd15b048 |
|
10-Feb-2014 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv: Add iommu DMA bypass support for IODA2 This patch adds the support for to create a direct iommu "bypass" window on IODA2 bridges (such as Power8) allowing to bypass iommu page translation completely for 64-bit DMA capable devices, thus significantly improving DMA performances. Additionally, this adds a hook to the struct iommu_table so that the IOMMU API / VFIO can disable the bypass when external ownership is requested, since in that case, the device will be used by an environment such as userspace or a KVM guest which must not be allowed to bypass translations. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
ca1de5de |
|
20-Dec-2013 |
Brian W Hart <hartb@linux.vnet.ibm.com> |
powernv/eeh: Add buffer for P7IOC hub error data Prevent ioda_eeh_hub_diag() from clobbering itself when called by supplying a per-PHB buffer for P7IOC hub diagnostic data. Take care to inform OPAL of the correct size for the buffer. [Small style change to the use of sizeof -- BenH] Signed-off-by: Brian W Hart <hartb@linux.vnet.ibm.com> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
93aef2a7 |
|
22-Nov-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Move PHB-diag dump functions around Prior to the completion of PCI enumeration, we actively detects EEH errors on PCI config cycles and dump PHB diag-data if necessary. The EEH backend also dumps PHB diag-data in case of frozen PE or fenced PHB. However, we are using different functions to dump the PHB diag-data for those 2 cases. The patch merges the functions for dumping PHB diag-data to one so that we can avoid duplicate code. Also, we never dump PHB3 diag-data during PCI config cycles with frozen PE. The patch fixes it as well. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
36954dc7 |
|
04-Nov-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Reserve the correct PE number We're assigning PE numbers after the completion of PCI probe. During the PCI probe, we had PE#0 as the super container to encompass all PCI devices. However, that's inappropriate since PELTM has ascending order of priority on search on P7IOC. So we need PE#127 takes the role that PE#0 has previously. For PHB3, we still have PE#0 as the reserved PE. The patch supposes that the underly firmware has built the RID to PE# mapping after resetting IODA tables: all PELTM entries except last one has invalid mapping on P7IOC, but all RTEs have binding to PE#0. The reserved PE# is being exported by firmware by device tree. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
8e0a1611 |
|
28-Aug-2013 |
Alexey Kardashevskiy <aik@ozlabs.ru> |
powerpc: add real mode support for dma operations on powernv The existing TCE machine calls (tce_build and tce_free) only support virtual mode as they call __raw_writeq for TCE invalidation what fails in real mode. This introduces tce_build_rm and tce_free_rm real mode versions which do mostly the same but use "Store Doubleword Caching Inhibited Indexed" instruction for TCE invalidation. This new feature is going to be utilized by real mode support of VFIO. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
5c9d6d75 |
|
05-Sep-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Double size of log blob Each PHB instance (struct pnv_phb) has its corresponding log blob, which is used to hold the retrieved error log from firmware. The current size of that (4096) isn't enough for PHB3 case and the patch makes that double to 8192. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
5e4da530 |
|
22-Sep-2013 |
Anton Blanchard <anton@samba.org> |
powerpc/powernv: Fix some PCI sparse errors and one LE bug pnv_pci_setup_bml_iommu was missing a byteswap of a device tree property. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
9bf41be6 |
|
26-Jun-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Use dev-node in PCI config accessors Currently, we're using the combo (PCI bus + devfn) in the PCI config accessors and PCI config accessors in EEH depends on them. However, it's not safe to refer the PCI bus which might have been removed during hotplug. So we're using device node in the PCI config accessors and the corresponding backends just reuse them. The patch also fix one potential risk: We possiblly have frozen PE during the early PCI probe time, but we haven't setup the PE mapping yet. So the errors should be counted to PE#0. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
0b9e267d |
|
26-Jun-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Replace variables with flags We have 2 fields in "struct pnv_phb" to trace the states. The patch replace the fields with one and introduces flags for that. The patch doesn't impact the logic. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
37c367f2 |
|
20-Jun-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Debugfs directory for PHB The patch creates one debugfs directory ("powerpc/PCIxxxx") for each PHB so that we can hook EEH error injection debugfs entry there in proceeding patch. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
70f942db |
|
19-Jun-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/eeh: I/O chip next error The patch implements the backend for EEH core to retrieve next EEH error to handle. For the informational errors, we won't bother the EEH core. Otherwise, the EEH should take appropriate actions depending on the return value: 0 - No further errors detected 1 - Frozen PE 2 - Fenced PHB 3 - Dead PHB 4 - Dead IOC Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
8747f363 |
|
19-Jun-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/eeh: EEH backend for P7IOC For EEH on PowerNV platform, the overall architecture is different from that on pSeries platform. In order to support multiple I/O chips in future, we split EEH to 3 layers for PowerNV platform: EEH core, platform layer, I/O layer. It would give EEH implementation on PowerNV platform much more flexibility in future. The patch adds the EEH backend for P7IOC. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
73ed148a |
|
10-May-2013 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv: Improve kexec reliability We add a machine_shutdown hook that frees the OPAL interrupts (so they get masked at the source and don't fire while kexec'ing) and which triggers an IODA reset on all the PCIe host bridges which will have the effect of blocking all DMAs and subsequent PCIs interrupts. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
4cce9550 |
|
25-Apr-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: TCE invalidation for PHB3 The TCE should be invalidated while it's created or free'd. The approach to do that for IODA1 and IODA2 compliant PHBs are different. So the patch differentiate them with different functions called to do that for IODA1 and IODA2 compliant PHBs. It's notable that the PCI address is used to invalidate the corresponding TCE on IODA2 compliant PHB3. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
137436c9 |
|
25-Apr-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Patch MSI EOI handler on P8 The EOI handler of MSI/MSI-X interrupts for P8 (PHB3) need additional steps to handle the P/Q bits in IVE before EOIing the corresponding interrupt. The patch changes the EOI handler to cover that. we have individual IRQ chip in each PHB instance. During the MSI IRQ setup time, the IRQ chip is copied over from the original one for that IRQ, and the EOI handler is patched with the one that will handle the P/Q bits (As Ben suggested). Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
aa0c033f |
|
25-Apr-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Supports PHB3 The patch intends to initialize PHB3 during system boot stage. The flag "PNV_PHB_MODEL_PHB3" is introduced to differentiate IODA2 compatible PHB3 from other types of PHBs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
fb1b55d6 |
|
05-Mar-2013 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Use MSI bitmap to manage IRQs As Michael Ellerman mentioned, arch/powerpc/sysdev/msi_bitmap.c already implemented bitmap to manage (alloc/free) MSI interrupts. The patch intends to use that mechanism to manage MSI interrupts for PowerNV platform. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
|
#
db1266c8 |
|
19-Aug-2012 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: Skip check on PE if necessary While the device driver or PCI core tries to enable PCI device, the platform dependent callback "ppc_md.pcibios_enable_device_hook" will be called to check if there has one associated PE for the PCI device. If we don't have the associated PE for the PCI device, it's not allowed to enable the PCI device. Unfortunately, there might have some cases we have to enable the PCI device (e.g. P2P bridge), but the PEs have not been created yet. The patch handles the unfortunate cases. Each PHB (struct pnv_phb) has one field "initialized" to trace if the PEs have been created and configured or not. When the PEs are not available, we won't check the associated PE for the PCI device to be enabled. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Reviewed-by: Richard Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
7ebdf956 |
|
19-Aug-2012 |
Gavin Shan <shangw@linux.vnet.ibm.com> |
powerpc/powernv: PE list based on creation order The resource (I/O and MMIO) will be assigned on basis of PE from top to bottom so that we can implement the trick here: the resource that has been assigned to parent PE could be taken by child PE if necessary. The current implementation already has PE list per PHB basis, but the list doesn't meet our requirment: tracing PE based on their cration time from top to bottom. So the patch does rename for the DMA based PE list and introduces the list to trace the PEs sequentially based on their creation time. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Reviewed-by: Richard Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
cee72d5b |
|
29-Nov-2011 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv: Display diag data on p7ioc EEH errors Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
184cd4a3 |
|
15-Nov-2011 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv: PCI support for p7IOC under OPAL v2 This adds support for p7IOC (and possibly other IODA v1 IO Hubs) using OPAL v2 interfaces. We completely take over resource assignment and assign them using an algorithm that hands out device BARs in a way that makes them fit in individual segments of the M32 window of the bridge, which enables us to assign individual PEs to devices and functions. The current implementation gives out a PE per functions on PCIe, and a PE for the entire bridge for PCIe to PCI-X bridges. This can be adjusted / fine tuned later. We also setup DMA resources (32-bit only for now) and MSIs (both 32-bit and 64-bit MSI are supported). The DMA allocation tries to divide the available 256M segments of the 32-bit DMA address space "fairly" among PEs. This is done using a "weight" heuristic which assigns less value to things like OHCI USB controllers than, for example SCSI RAID controllers. This algorithm will probably want some fine tuning for specific devices or device types. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
c1a2562a |
|
19-Sep-2011 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv: Implement MSI support for p5ioc2 PCIe This implements support for MSIs on p5ioc2 PHBs. We only support MSIs on the PCIe PHBs, not the PCI-X ones as the later hasn't been properly verified in HW. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
#
61305a96 |
|
19-Sep-2011 |
Benjamin Herrenschmidt <benh@kernel.crashing.org> |
powerpc/powernv: Add support for p5ioc2 PCI-X and PCIe This adds support for PCI-X and PCIe on the p5ioc2 IO hub using OPAL. This includes allocating & setting up TCE tables and config space access routines. This also supports fallbacks via RTAS when OPAL is absent, using legacy TCE format pre-allocated via the device-tree (BML style) Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|