I decided to build a FP with 8bit exponent and 9bit mantissa (and with no NANs, infinities, denorms or rounding).The
sum of the bitlengths (plus one sign bit) means that the FP number
fits into a 18bit M4K block word on the CycloneII FPGA. The 9bit
mantissa means that only one hardware multiplier (out of 70) is used for
the floating multiplier. The exponent is represented in 8bit offset
binary form. For example 2^{0}=128
, 2^{2}=130
, and 2^{2}=126
. The mantissa is represented as a 9bit fraction with a range of [0 to 1.02^{9}]
. A sign bit of zero implies positive. The Verilog representation is {sign,exp[7:0],mantissa[8:0]}
.
Denormalized numbers are not permitted, so the highorder bit (binary
value 0.5) is always one, unless the value of the FP number underflows,
then it is zero. No error detection is performed and there is no
rounding. There are no NANs, infinities, denorms, or other special cases
(which make little sense in a realtime system anyway). Some example
representations are shown below.
Five operations are necessary for floating DSP. They are add,
negate, multiply, integertofloat, and floattointeger. Negate is
easy, just toggle the sign bit. The integer conversion algorithms are
necessary because the audio and video codecs are integerbased. Outlines
for the functions are below. Finally, the modules were tested by
building IIR filters. The SOS filters shown below validated the
performance of the floating point. An article written for Circuit Cellar Magazine describing this floating point is available.
Multiply algorithm:
 If either input number has a highorder bit of zero, then that input is zero and the product is zero.
 The output exponent is
exp1+exp2128
orexp1+exp2129
. If the sums of the input exponents is less than 129 then the exponent will underflow and the product is zero.  If both inputs are nonzero and the exponents don't underflow:
 Then if
(mantissa1)x(mantissa2)
has the high orderbit set, the top 9bits of the product are the output mantissa and the output exponent isexp1+exp2128
.  Otherwise the second bit of the product will be set, and the output mantissa is the top 9bits of
(product)<<1
and the output exponent isexp1+exp2129
.  The sign of the product is
(sign1)xor(sign2)
 Then if
Add algorithm:
 If both inputs are zero, the sum is zero.
 Determine which input is bigger, which smaller (absolute value) by first comparing the exponents, then the mantissas if necessary.
 Determine the difference in the exponents and shift the smaller input mantissa right by the difference. But if the exponent difference is greater than 8 then just output the bigger input.
 If the signs of the inputs are the same, add the bigger and (shifted) smaller mantissas. The result must be
0.5<sum<2.0
. If the result is greater than one, shift the mantissa sum right one bit and increment the exponent. The sign is the sign of either input.  If the signs of the inputs are different, subtract the bigger and
(shifted) smaller mantissas so that the result is always positive. The
result must be
0.0<difference<0.5
. Shift the mantissa left until the high bit is set, while decrementing the exponent. The sign is the sign of the bigger input.
The multiplier takes about 60 logic elements plus one hardware
multiplier on the CycloneII FPGA, while the adder takes about 220 logic
elements. The timing analyser suggests that the purely combinatiorial
multiplier should be able to run at 50 MHz and the adder at 30 MHz or
so.
The integertoFP and FPtointeger conversion routines allow you to specify a signed scale. Going from integer to float, the resulting floating point number is (integer_input)*2^{scale_input}. This feature allows you to convert numbers less than one. Going from float back to integer, you choose the scale you want to bring the floating point number back into a small integer range. The signed integer inputs and outputs are 10bit, 2's complement format.
Integer to FP:
I assumed 10bit, 2’s complement, integers since the mantissa is only 9 bits, but the process generalizes to more bits.
 Save the sign bit of the input and take the absolute value of the input.
 Shift the input left until the high order bit is set and count the number of shifts required. This forms the floating mantissa.
 Form the floating exponent by subtracting the number of shifts from step 2 from the constant 137 or (0h89(#of shifts)).
 Assemble the float from the sign, mantissa, and exponent.
FP to integer:
Converting back to integer is similarly simple, but no overflow is detected, so scale carefully.
 If the float exponent is less than 0h81, then the output is zero because the input is less than one.
 Otherwise shift the floating mantissa to the right by (0h89(floating exponent)) to form the absolute value of the output integer.
 Form the 2’s complement signed integer.
Testing the FP routines using IIR filtering by Secondordersections (SOS)
SOS filters have the advantage (over straight multipole filters) of
smaller dynamic range on coefficients, so the numerical stability is
better. SOS filters are also more straight forward to do with floating
point.
The downside is a few more state variables and a few more multiplies for each filter. A matlab program and function convert filter specifications to Verilog with 18bit floating point. The toplevel module defines filters of order 2, 4 and 6. The project is archived here.
Testing the FP routines using IIR filtering
The fpmult, fpadd, int2fp
and fp2int
routines were incorporated into the state machine filters described on the FPGA DSP
page, example 4. The routines worked, implying that the logic is
correct, however a 9bit manitssa is apparently not accurate enough to
implement highorder or narrow bandwidth filters. Second order filters
work fine, but 4th and 6th order filters became inaccurate when the
filter bandwidth was low. Use the SOS verions above for most actual
filters. A matlab program (and associated function) were used to convert matlabdesigned filter coefficients to floating point format. The toplevel module defines three filters and connects them to the audio in/out. The entire project is zipped here.
FP reciprocal
The ability to take a reciprocal allows division to occur. Reciprocal
was implemented NewtonRaphson interation on an initial linear estimate
of the reciprocal. This design
just tests for static correctness of the method by displaying values on
the LEDs. The process is to take the input number, strip off the sign
and exponent, compute the reciprocal of the remaining number between 0.5
and 1.0, form the new exponent as 0x81+(0x81input_exponent)
then merge together the input sign, new exponent and new mantissa from a
Newton iteration process. The module will run at 14 MHz and uses 3
floating point adders and 4 floating point multipliers.
The algorithm (from http://en.wikipedia.org/wiki/Division_%28digital%29) is as follows, with all operations being floating arithmetic:

Form
In_reduced={1'b0, 8'h80, m1}
wherem1
is the mantissa of the normalized input float. This operation (with the sign set to + and the exponent set to 0x80), limits the range to 0.5 to 1.0. 
x0 = 2.9142  2*in_reduced
(0.5<=in_reduced<=1.0
) x1 = x0*(2in_reduced*x0)
x2 = x1*(2in_reduced*x1)
 The reciprocal output mantissa is the mantissa of
x2
.  The reciprocal output exponent
=(in_reduced==9'b100000000)? 9'h102e1 : 9'h101e1
because an input value of exactly a power of 2 adds one to the exponent.
is the exponent of the normalized input float.
e1
FP reciprocal Square Root
A
reciprocal square root function is useful when normalizing vectors (e.g.
computer graphics) and con be converted to a square root with just one
more multiply.
This design
just tests for static correctness of the method by displaying values on
the LEDs. The process is to take the input number, strip off the sign
and exponent, compute the reciprocal square root of the remaining number
between 0.25 and 1.0, form the new exponent, then merge together the
new exponent and new mantissa from a Newton iteration process.
The module will run at 11 MHz and uses 3 floating point adders and 6
floating point multipliers.
The algorithm (from http://en.wikipedia.org/wiki/Methods_of_computing_square_roots) is as follows, with all operations being floating arithmetic. The x0
estimate is based on a linear approximation which I thought up.:
 Form
input_exp = (e1[0]==1)? 8'h7f : 8'h80
and
reduced_input = {1'b0, input_exp, m1}
wherem1
is the mantissa of the normalized input float. This operation, limits the range to 0.25 to 1.0. 
x0 = 2.05  reduced_input with (input 0.25<=reduced_input<=1.0)
x1 = x0/2 * (3  reduced_input*x0*x0)
x2 = x1/2 * (3  reduced_input*x1*x1)
 The reciprocal sqrt output mantissa is the mantissa of
x2
.  The reciprocal sqrt output exponent eout is given by
((m1==9'b100000000 && e1[0]==1) )? eout = 9'h82 + ((9'h80  e1)>>1) : eout = 9'h81 + ((9'h80  e1)>>1)
because an input value of exactly a power of 2 with odd exponent adds one to the exponent.e1
is the exponent of the normalized input float.