What one could think about is to include something like the sd2iec. That would be less complex than a 1541 emulation, but even probably too complex for the kff.
So with the Kung-Fu-Flash cartridge as-is the main limitation is that the ARM's available pins are fully used. There is not a single pin available.
TL;DR: I think the KFF is a good balance as-is. sd2iec with somewhat different hardware might work, not sure if it is worth it. Maybe its better to try to convert any missing SD2IEC-supported titles (How many are there these days?) not yet available for KFF, on C64 side to "plain" loading.
I did try to implement connecting to IEC on hardware connection side. I made a prototype version of Kim's KFF, using a version of the ARM chip with more pins available (a 100 pin version instead of the 64-pin version Kim uses). This is the exact same ARM silicon IC, just connected differently to the outside world. Then I connected those additional pins up to the IEC bus, the DMA pin and I also connected a separate SPI color LCD screen using a few pins on top of the KFF module. There are still some pins left over for other fun hobby projects. The size and cost is roughly the same as the 64-pin ARM chip. (Of course adding a few Euro for the IEC connector and LCD screen and anything else)
I had to move around some of the pin connections of the original pinout of the KFF to get it working, so this prototype is not a version that is directly software compatible with the original Kung-Fu Flash. It requires changing the GPIO port addresses with some 8-bit shifts here and there in Kim's ARM source code. Other than that, it is compatible with Kim's software.
I think something like sd2iec might be possible with that prototype pinout, though I never found the time to go ahead and code that. There are also fun things possible with the DMA pin, which I never found the time to code.
When running a C64 game/app, while the ARM is emulating the C64 ROMs, the ARM gets a software interrupt on every C64 1MHz cycle. The ARM then has a small number of cycles to react to that interrupt and process it, toggling GPIOs at 196 MHz. The main challenge I've found is that the ARM interrupt has quite a large latency variation (so number of cycles from the 1MHz interrupt hardware signal to the point the ARM starts running the interrupt code) depending on what the ARM was doing just before that interrupt happened. If you want to draw a parallel, the C64 has an interrupt uncertainty of 1-3 cycles typically, where the ARM can have anything from 12 to 50+? cycles of uncertainty. So all background ARM code written for this 1MHz mode (like the D64 handling) really needs to take into account not a single ARM instruction may cause high interrupt latency (like accessing some slow ARM peripherals). Kim's code is managing that latency very well. In practice with KFF code, the ARM interrupt latency is between 12 and 17 or so 196MHz cycles.
So any ARM sd2iec code would need to be written in a similar fashion.
In the version of Kim's software I worked on, when the C64 is only presenting the KFF menu, so directly after startup, it does not actually run that 1MHz interrupt. In that mode, the ARM does have cycles to spare and if the ARM happens to block for a lot of cycles the KFF menu works just fine. I was able to implement showing the box art of the game in the menu on the SPI LCD screen while the KFF menu is shown.
@Kim thanks again for the fantastic Kung-Fu Flash, it really is a very fun project.