So, I conclude from your FPGA experiments, you've seen max transfer rates close to 2 phi2 clocks per bit using a 6526R4 in a C64 as input, when you first phase-align the external FPGA with the chip. You then also make CNT only high for half of a PHI2 cycle, followed by 1.5 cycles low. Is that correct?
No, a short CNT low phase followed by a 1,5 PHI2 cycle high phase.
This is what the "best case" would look like for a transmission of a "0" bit followed by a "1" bit:
|<- 2 phi2 cycles ->|
PHI2 ‾‾|____|‾‾‾‾|____|‾‾‾‾|____|‾‾‾‾|____|‾‾‾‾|____|‾‾
CNT ‾‾‾‾‾‾‾‾|__|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|__|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|_
SP ‾‾‾‾‾‾‾‾|___________________|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|_
^ ^
1 2
From what I (think to...) understand, the 6526 would register the positive CNT edge at the negative PHI2 edge at "1" and sample SP 1,5 cycles later at the second positive PHI2 edge at "2". The whole CNT/SP transfer cycle is 2 PHI2 cycles. You could even go slightly faster for one bit by making the CNT "low" phase shorter, but that doesn't gain you anything, because if the time for 1 bit isn't a multiple of a PHI2 cycle, you lose the optimal phase offset for the next bit(s).
And this is the "worst case":
|<- ~3 phi2 cycles ->|
PHI2 |____|‾‾‾‾|____|‾‾‾‾|____|‾‾‾‾|____|‾‾‾‾|____|‾‾‾‾
CNT ‾‾‾‾‾‾‾‾|__|‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾|__|‾‾‾‾‾‾‾‾‾‾
SP ‾‾‾‾‾‾‾‾|___________________________|‾‾‾‾‾‾‾‾‾‾‾‾‾
^ ^
1 2
As the negative PHI2 edge just misses the positive CNT edge, SP is sampled ~1 PHI2 cycle later than in the "best case". So in order to work with all possible phase offsets, you have to make SP valid for one more PHI2 cycle.
I actually don't know much about chip design, that's why I can't really follow all your explanations. I merely took a black box approach and checked what signals lead to which bits being sampled in the SR. I can't really comment on what the actual design would have to look like to lead to this behaviour.
Regarding the IRQ issue: I only brought it up because I thought that you said you found that the SR IRQ has a similar issue as the "timer b bug". As that apparently was a misunderstanding, I wouldn't read too much into that. I didn't do enough research to be confident that the IRQ really wasn't set for the SR. In the end it could well be something as simple as a bad connection on the IRQ probe.