Square roots are somewhere between trivial and esoteric, between division and trigonometry. They are not used absolutely everywhere but are still fundamental. You need them for 3D, DSP, robotics, anywhere you deal with vectors. And there are several algorithms that are very interesting !
I remember being "somewhat focused" on integer square roots, from about 1994 to 1996 until other pet ideas took over. I saw some algorithms from naïve/crude to more established and the Newton method seemed nice. Except that it never converges fast enough and Newton-Raphson is too computationally intensive for small/weak processors. The critical aspect was now to find the best approximation with the fewest complexity and I didn't even have my Baccalauréat when I found a way, which I still haven't seen out there. The problem is that it requires a new opcode but this one would be very simple to compute.
It didn't give it a proper name that I remember, but today I would call this opcode COMB because it acts like a comb on binary data. It would keep one bit in each pair and dump the other. As an illustration, if you have 8 bits callled ABCDEFGH respectively, the result would be 0000ACEG. It drops one bit every 2 bits, and pads the MSB with 0.
Mathematically, it selects all the odd indices of bit, which are a way to rearrange the polynomial construction of the value. The algorithm would be :
COMB n to o
COMB (n>>1) to e
a = e + (o × 1.41... normalised to the binary representation)
Of course, if the opcode map is large enough, one could implement two opcode versions to select the odd and even bits to save one shift instruction.
The multiply with a constant is necessary to get enough precision for somewhat large values, though the mathematical definition is clearly still not fully respected (there are extra terms to compute), but I have tested this and seen that the approximation is "good enough" to speed up Newton to satisfaction.
A cruder version would use no multiply and provide a worse approximation. In fact I believe it was my first idea :
a = COMB ( n | n>>1 )
It's faster but lousy.
Anyway, COMBE and COMBO look like opcodes that can be very easily implemented in #F-CPU for example.