Yes, my understanding is that the CNT pin can be clocked faster than PHI2. I can't remember if I ever did it, though. I think I looked at using it for 64NET for speeding up loading, by allowing the PC to use the shift register to push data over faster than via the parallel interface, but my recollection was that PCs at the time couldn't write to the ISA-connected printer port fast enough to make it beneficial.
Now, as for making a CIA replacement chip, I'd use a cheap Lattice FPGA/CPLD part, some of which I think are already 5V tolerant, and add the necessary level converters. I'd also be tempted to implement the level converters using transistors rather than buffer chips for the IO lines, so that the full behaviour of the pins can be properly implemented. While you need 2 transistors for every IO pin, they are little and could be placed on both sides of the PCB.
LG
Paul.