Skip to content

Conversation

@jerzyjamroz
Copy link
Contributor

@jerzyjamroz jerzyjamroz commented Jun 5, 2025

@mdavidsaver , The feature was tested with:

  • EVM and mrfioc2 IOC – everything works fine. One can power down and up the EVM card, and the IOC does not require any reboot. It just works.
  • EVR and mrfioc2 IOC - it works almost ok except it reports errors for: Pll-Sts and the timestamping TimeValid-Sts. In order to recover the IOC fully, the IOC reboot is required (probably because the EVR settings are gone).
  • This implementation is the most minimalistic update I found. I tested more (like BAR remap, etc.), but it looks like it's not necessary (I might be wrong, I haven’t tested that extensively).
  • If the proposed code should be placed in a different location, then all advice is welcome.

@jerzyjamroz jerzyjamroz requested a review from mdavidsaver June 5, 2025 12:15
@AppVeyorBot
Copy link

Build devlib2 1.0.24 failed (commit 14a35d4855 by @jerzyjamroz)

@AppVeyorBot
Copy link

Build devlib2 1.0.25 failed (commit 992c159c72 by @jerzyjamroz)

@jerzyjamroz
Copy link
Contributor Author

Introduced global function pointer devPCIonHotSwap which can be set
to a user-defined callback. This hook is invoked after successful
reopening of a PCI device (e.g. after hot-swap or replug).

@AppVeyorBot
Copy link

Build devlib2 1.0.26 failed (commit 308ff99de5 by @jerzyjamroz)

@mdavidsaver
Copy link
Collaborator

@jerzyjamroz Do you understand what the UIO driver is doing about the mapped register ranges when a device is removed?

Since this PR is not (I think) calling mmap() after reopening the UIO device, I have a suspicion the VM mappings are not being removed or updated. Which would imply that a hot-swap only appears to work if PCI re-enumeration happens to allocate the same address ranges for the swapped device.

@mdavidsaver
Copy link
Collaborator

In terms of API design. Unless the new hot plug callback gets a new/re-enumerated PCI device handle, there is an assumption that the swapped device is actually the exact same card.

@mdavidsaver
Copy link
Collaborator

... and I think this new callback would need to be per-device.

eg. how could a global function pointer work in a situation where one IOC process handles two PCI devices?

@dirk-zimoch
Copy link
Contributor

Wouldn’t any access to the mmap‘ed device address space cause a bus error when the device is swapped out?

@mdavidsaver
Copy link
Collaborator

It has been a decade since I looked into the situation in detail. When I last checked, I came to the conclusion that it was not possible to handle PCI hot swap for a UIO device with complete safety. (even getting this right in kernel seemed difficult) I was left with many unanswered questions about the Linux kernel APIs. eg. is a UIO driver allowed to block the remove handler to wait for user space interaction? Can process VM mappings be modified directly without blocking (or at least without syscalls)?

Still, if "complete safety" remains out of reach, then maybe the only practical paths forward are do nothing, or depending on what are likely unintended behaviors and documenting with lots of caveats.

@mdavidsaver
Copy link
Collaborator

Wouldn’t any access to the mmap‘ed device address space cause a bus error when the device is swapped out?

I had this same question, but was not able to find a clear answer. I suspect that at least for classic PCI, the answer is platform (CPU and/or PCI bridge) specific. I would hope that PCIe made this more uniform, but that is only a hope.

@jerzyjamroz
Copy link
Contributor Author

@jerzyjamroz Do you understand what the UIO driver is doing about the mapped register ranges when a device is removed?

Since this PR is not (I think) calling mmap() after reopening the UIO device, I have a suspicion the VM mappings are not being removed or updated. Which would imply that a hot-swap only appears to work if PCI re-enumeration happens to allocate the same address ranges for the swapped device.

  1. Good point — I observed that the physical BAR address (/sys/class/uio/uioX/maps/map0/addr) remains the same before and after the device is removed and re-added, as shown below:
bd-evm:~$ cat /sys/class/uio/uio1/maps/map0/addr
0x00000000df000000

# after device removal
bd-evm:~$ cat /sys/class/uio/uio1/maps/map0/addr
cat: /sys/class/uio/uio1/maps/map0/addr: No such file or directory

# after re-insertion
bd-evm:~$ cat /sys/class/uio/uio1/maps/map0/addr
0x00000000df000000
  1. I tested that with the same card in the same slot.
  • The slot will never change, but the card serial number can change (not the type) in our maintenance schedule, that is why I have done only the relevant tests.

@jerzyjamroz
Copy link
Contributor Author

there is an assumption that the swapped device is actually the exact same card.

I hope it works for the same card type, I mean that I can replace e.g. EVR (not only power cycle it). I will do that test soon.

@jerzyjamroz
Copy link
Contributor Author

... and I think this new callback would need to be per-device.

eg. how could a global function pointer work in a situation where one IOC process handles two PCI devices?

Good point, I haven't noticed that and haven't tested this scenario. It can be solved by adding void (*onHotSwap)(struct myDevice* dev); into the osd structure.

@AppVeyorBot
Copy link

Build devlib2 1.0.27 failed (commit 62c570679a by @jerzyjamroz)

@jerzyjamroz
Copy link
Contributor Author

Still, if "complete safety" remains out of reach, then maybe the only practical paths forward are do nothing, or depending on what are likely unintended behaviors and documenting with lots of caveats.

Well, I wrote the proposal that works for the scenario written above. I realised now that for EVR (not EVM) it would require triggering autosave configuration once the EVR is recovered, as probably the EVR memory (config) is clean after the power cycle. Something to think about.
But in summary, let me know your opinion if we want this PR or not. It looks harmless, as in the previous case, the driver crushes anyway, and now it is trying to recover this limited (the same card type, the same slot) scenario.

@AppVeyorBot
Copy link

Build devlib2 1.0.28 failed (commit 123ea3acc5 by @jerzyjamroz)

@AppVeyorBot
Copy link

Build devlib2 1.0.29 failed (commit 97c6cb5c3b by @jerzyjamroz)

@AppVeyorBot
Copy link

Build devlib2 1.0.30 failed (commit 4b15c8bd54 by @jerzyjamroz)

@AppVeyorBot
Copy link

Build devlib2 1.0.31 failed (commit 19cff351ba by @jerzyjamroz)

@AppVeyorBot
Copy link

Build devlib2 1.0.32 failed (commit 363a29a0f2 by @jerzyjamroz)

@AppVeyorBot
Copy link

Build devlib2 1.0.33 failed (commit a307eb82ba by @jerzyjamroz)

@AppVeyorBot
Copy link

Build devlib2 1.0.34 failed (commit 564b3ec5a0 by @jerzyjamroz)

@jerzyjamroz
Copy link
Contributor Author

My strong recommendation would be for you to prototype the whole process, including changes to mrfioc2 (uio and epics). Use this experience to inform any API change proposed for devlib2.

I did the full test (see here), myCustomHandler (the hot swap handler) would need to be registered in mrmEvrSetupPCI with recovery of the reseted memory of EVR. It works straight away for EVM as fanout concentrator, so it requires only the EVR handler.

In summary:

  1. I tested the hot-swap and the callback handler for the same hardware type and the same slot, and it works (let's call it maintenance hot swap which means only the hot swap of the existing installation).
  2. EVR requires some work to evaluate the best implementation of that (let's call it evrHotSwapHandler).
  3. But I have realised about the complexity of that, so the next steps depend on your recommendations.

@mdavidsaver
Copy link
Collaborator

The idea is to use it in your driver e.g. mrfioc2 like that osd->onHotSwapHook = myCustomHotSwapHandler;

So the whole structure is exposed, and becomes part of a public API, to allow access to one data member?

@mdavidsaver
Copy link
Collaborator

autosave configuration.

Has autosaved learned some notion of "driver" or "device" for filtering of records? Or are you assuming one device per IOC?

@mdavidsaver
Copy link
Collaborator

I did the full test ...

Was this an IOC with only a single EVR device?

I think a more comprehensive testing scenario would have an IOC with at least two EVR/EVM devices, and some other driver as well. eg. some unrelated driver which also makes use of autosave.

@jerzyjamroz
Copy link
Contributor Author

jerzyjamroz commented Jun 26, 2025

So the whole structure is exposed, and becomes part of a public API, to allow access to one data member?

It allows the designer of the myCustomHotSwapHandler to use e.g. osd->dev.slot or other members which might be useful or not. So it has some pros and cons.

@jerzyjamroz
Copy link
Contributor Author

jerzyjamroz commented Jun 26, 2025

Has autosaved learned some notion of "driver" or "device" for filtering of records? Or are you assuming one device per IOC?

Good point, I was assuming one device IOC. And, it is a question of how to recover the configuration of EVR. The good thing is that one does not need to do that for the EVM.

@jerzyjamroz
Copy link
Contributor Author

jerzyjamroz commented Jun 26, 2025

Was this an IOC with only a single EVR device?

1 x EVM + 1 x EVR

I think a more comprehensive testing scenario would have an IOC with at least two EVR/EVM devices, and some other driver as well. eg. some unrelated driver which also makes use of autosave.

Yes, It would be also good if more people would test it, I think @dirk-zimoch plans to do tests as well.

@AppVeyorBot
Copy link

Build devlib2 1.0.42 failed (commit f05486a20b by @jerzyjamroz)

@jerzyjamroz
Copy link
Contributor Author

@mdavidsaver , @dirk-zimoch - do you have any comments on that? I do not see any drawbacks there. The module gains this minimalistic recovery capability (so this pseudo hot-swap).

@mdavidsaver
Copy link
Collaborator

I disagree with installing devLibPCIOSD.h, and making a very OS specific structure into a public API. This primary goal of devLib2 is OS independent abstraction.

@mdavidsaver
Copy link
Collaborator

... minimalistic recovery capability (so this pseudo hot-swap).

As previously stated, I think it is unwise to depend on this incompletely specified behavior.

@dirk-zimoch
Copy link
Contributor

As devlib2 is essentially Michael‘s work, I abstain from any opinion.

@mdavidsaver
Copy link
Collaborator

@jerzyjamroz I hope you are not discouraged by my earlier comments. Supporting PCI/PCIe hot plugging would certainly be a useful feature. At present this PR seems like a useful tool to study the problems which would be involved, but is far from a finished product.

If you have continued interested, I would encourage you to answer some of the following questions. What exactly happens when you pull the ejector tab on a uTCA card? Which Linux driver handles notification to effected PCI devices? In what context are these notifications delivered? (task?, ISR?) What can be safely done in such a callback? Can it block for further events? Interact with userspace processes? What happens to concurrent PCI operations?

@jerzyjamroz
Copy link
Contributor Author

@mdavidsaver , OK I will change this PR as draft for eventual future analysis.

@jerzyjamroz jerzyjamroz marked this pull request as draft August 21, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants