It's been a while since I made an update, but I am making progress in fits and starts. I ran into some roadblocks with the pipelining when introducing exceptions and some of the other vagaries of a real design, and so went back and thought through some of my assumptions. It turns out that I had a major error in how I understood the Wishbone bus specification.
In short, I had struggled with how to deal with latency with a pipelined operation when coupled with multiple masters/arbitration. If you allow bus preemption, it seems like you can lose data or have to reply requests, which doesn't make sense.
I ended up changing the design to eliminate preemption in the bus arbiters, and adjusting the logic to make sure that even the instruction fetch operation releases the bus every few cycles. This has simplified the bus flow a great deal. So I may still have Wishbone wrong, but it works for me.
I've also moved to a Harvard architecture, which means that I can avoid (explicit) arbitration by using dual port memory. Once I have the kinks worked out and I feel that all of my memory access tests are behaving as expected, I can add arbitration for things like external memory and run regressions to validate.