DSP4FPGAs using Xilinx ISE web pack
The conversion of designs from Altera Quartus or Max+Plus II to Xilinx ISE seems to be easy if we use standard HDL. Unfortunately there a couple of different issues that needs to be addressed. We assume that the ModelTech simulation environment and the web version (i.e., no core generation) is used. The first couple of items address the ISE/ModelTech design entry and then we will have a look at other important beneficial facts of the ISE software and compilation results of all 30+ file for the Xilinx virtex2 XC2V250-6cs144 device.
1) The Xilinx simulation with timing (“Post-Place & Route”) uses a bit-wise simulation model on LUT level. Back annotations are only done for the I/O ports, and are ALL from type standard_logic or standard_logic_vector. In order to match the behavior and the simulation with timing we need therefore:
Use only standard_logic or standard_logic_vector data type for I/O. As a consequence no integers, generic, or custom I/O data type, (e.g. subtype byte see CORDIC.VHD) can be used!
2) The ISE software supports the development of testbenches with the “Test Bench Waveform.” Use “new source” under the Project menu. This waveform will give you a quick way to generate a testbench that is used by ModelTech, both for behavior as well as simulation with timing. There are some benefits and drawbacks with the testbencher. For instance, you can not assign negative integer in the waveforms, you need to build the two’s complement, i.e., equivalent unsigned number by hand.
3) If you have feedback, you need to initialize the register to zero in your HDL code. You can not do this with the testbencher: For instance, ModelTech initialize all integer signal to the smallest value, i.e., (-128 for a 8 bit number), if you add two integers, the result will be -128-128=-256<-128 and ModelTech will stop and report an overflow. Some design, e.g., cordic, cic3r32, cic3s32, work only correct in behavior simulation if all integers are changed to standard_logic_vector data type. Changing I/O ports alone and using the conversion function does not always guarantee correct simulation results.
4) The simulation with timing usually needs one clock cycle more than behavior code until all logic is “settled.” You input stimuli should therefore be zero in the first clock cycle and if you like to match behavior and timing simulation and the design uses a (small) FSM for the control you need to add a synchronous or asynchronous reset. You need to do this for the following designs:
dafsm, dadrom, dasign, db4latti, db4poly, div_aegp, div_res, iir_par, mul_ser, rader7.
Just add for the designs a control for the FSM part like this:
-- IF RISING_EDGE(clk) THEN
-- synchronous reset
-- IF reset = '1'
THEN
-- state <= s0;
-- ELSE
IF reset = '1'
THEN -- asynchronous reset
state <= s0;
ELSIF RISING_EDGE(clk) THEN
CASE state IS
WHEN s0
=> -- Initialization step
Although at the first glance this synchronous or asynchronous control seems to be “cheap” because the FSM is small, we need to keep in mind that in the case the reset is active, all signal that are assign in the s0 state of the FSM need to be preserved with the initial state value. The following table shows the synthesis results for the three different reset styles for the design file dafsm.vhd (small distribute arithmetic state machine):
|
Reset style |
Performance / ns |
4-input LUT |
Gates |
|
No (Maxplus code) |
3.542 |
20 |
339 |
|
synchronous |
3.287 |
29 |
393 |
|
asynchronous |
3.554 |
29 |
393 |
The designs with reset usually have a higher LUT and gate
count. Depending on the design synchronous or asynchronous reset can have a
(small) influence on the performance of the designs.
5) Back annotation is only done for I/O ports. If we like to monitor internal nets, we can try to find the appropriate net name in the *_timesim.vhd file, but that is quite complicated and may change in another compiler run. Better idea is to introduce additional test outputs, (see for instance fir_lms.vhd for f1_out and f2_out). In the behavioral (but not in the timing) simulation internal test signals and variables can be monitored. Modify the *.udo file and add for instance for the fir_srg_tb.vhd file “add wave /fir_srg_tb/uut/tap” to the testbench.
6) There are a couple a nice features in Xilinx ISE package too: There is no need for special “lpm” block to use the internal resources for multiplier, shifter, RAMs or ROMs.
a) ISE converts a shift register in a single CLB based shift register. This can you save quite some resources.
b) Multiplier can be implemented with LUT only, Block multiplier (if available), or even pipelined LUT, which is a pipeline retiming! Just right click on the “Synthesis –XST” menu, select HDL Options under Properties and the last entry is the multiplier style. But note that for pipelined LUT design the additional register must be placed at the OUTPUT of the multiplier. Pipeline retiming is not done if the additional registers are at the inputs. You need about log2(B) additional registers to have good timing (see sec. 2.4 DSP with FPGAs book on pipeline multiplier).
This has an impact, on the registered performance, LUT usage and gates as the following table show s for the fir_gen.vhd example, i.e., length 4 programmable FIR filter (from Ch. 3):
|
Synthesis style |
Performance / ns |
4-input LUT |
18x18 bit mul. blocks |
Gates |
|
Block multiplier |
9.838 |
57 |
4 |
17552 |
|
LUT (no pipeline) |
15.341 |
433 |
0 |
6114 |
|
LUT (3stage pipeline) |
6.762 |
448 |
0 |
9748 |
For this multiplier size (9x9 bit) the pipelined LUT seems to be attractive, both from speed as well as gate count. If the number of LUT is limited, the block multiplier provides the next best alternative.
c) If you follow the recommended style the Xilinx software
synthesis tool (see XST manual and ISE help "Inferring BlockRAM in VHDL”) maps
your HDL code to the block RAM (see, fun_text.vhd). If the table is small
ISE auto option selects the LUT based implementation for a ROM table (see darom.vhd).
You can also initialized the table in the HDL code and use it as a
Please see XST manual Chapter 3, “FPGA Optimization” for details on ROM implementation.
There are some limitations that apply to the initialization of BlockRAMs (see, XST Ch. 2):
============================================================================
The last question we like to answer is now: How good are the synthesis results with ISE+virtex2 when compared with MaxPlusII or Quartus. To evaluate this question the 30+ design examples are compiled with ISE.
The table below shows the Xilinx ISE web Edition Ver. 6.2 synthesis results of the examples from the book DSP with FPGAs by Dr. Uwe Meyer-Baese when compared with the Max+plus II 10.2 results. We see that the overall synthesis results when compared LCs and Registered Performance under the “Total” entry. The design have been synthesized for maximum speed, with the usual standard synthesis “Auto” options.
|
Max 10.2 versus Xilinx ISE webpack version 6p2 |
|
|
|
||||
|
VHDL Device=Virtex2 XC2V250-6cs144 |
|
|
|
|
|||
|
Design |
|
18x18bit |
4-input |
|
|
|
|
|
|
gates |
mul. |
LUT |
Gain % |
ns |
MHz |
Gain % |
|
add_1p |
710 |
0 |
24 |
8.33 |
4.77 |
209.78 |
231.45 |
|
add_2p |
1300 |
0 |
41 |
41.46 |
4.85 |
206.02 |
225.51 |
|
add_3p |
2391 |
0 |
82 |
28.05 |
4.84 |
206.44 |
238.59 |
|
ammod |
3147 |
0 |
99 |
181.82 |
5.92 |
168.92 |
492.91 |
|
bfproc |
13474 |
3 |
71 |
647.89 |
14.70 |
68.02 |
401.64 |
|
ccmul |
12548 |
3 |
42 |
1073.8 |
0.00 |
0.00 |
0.00 |
|
cic3r32 |
4538 |
0 |
168 |
138.69 |
5.89 |
169.92 |
324.81 |
|
cic3s32 |
2870 |
0 |
114 |
108.77 |
5.28 |
189.39 |
324.27 |
|
cordic |
2720 |
0 |
226 |
7.96 |
6.39 |
156.49 |
294.39 |
|
dafsm |
309 |
0 |
19 |
94.74 |
3.55 |
281.69 |
401.50 |
|
dapara |
368 |
0 |
21 |
85.71 |
8.30 |
120.48 |
278.40 |
|
darom |
339 |
0 |
20 |
70.00 |
3.54 |
282.33 |
908.31 |
|
dasign |
690 |
0 |
58 |
12.07 |
7.74 |
129.23 |
286.46 |
|
db4latti |
7409 |
0 |
634 |
-47.79 |
4.29 |
233.37 |
415.85 |
|
db4poly |
2885 |
0 |
175 |
18.86 |
3.47 |
288.60 |
266.52 |
|
div_aegp |
8629 |
2 |
52 |
801.92 |
7.91 |
126.37 |
766.17 |
|
div_res |
1012 |
0 |
77 |
7.00 |
4.54 |
220.51 |
486.61 |
|
example |
383 |
0 |
35 |
-28.57 |
1.00 |
2.92 |
-97.66 |
|
fir6dlms |
17262 |
4 |
50 |
1216.00 |
10.52 |
95.05 |
306.01 |
|
fir_gen |
17552 |
4 |
57 |
1464.91 |
9.84 |
101.65 |
143.99 |
|
fir_lms |
16894 |
4 |
48 |
1175.00 |
16.03 |
62.39 |
593.27 |
|
fir_srg |
1648 |
0 |
137 |
-29.20 |
23.29 |
42.93 |
146.04 |
|
fun_text |
66301 |
0 |
33 |
-3.03 |
3.74 |
267.74 |
390.00 |
|
iir |
1075 |
0 |
74 |
-58.11 |
9.24 |
108.20 |
152.16 |
|
iir_par |
307 |
0 |
268 |
-19.78 |
5.14 |
194.51 |
520.66 |
|
iir_pipe |
2539 |
0 |
190 |
-66.32 |
9.96 |
100.42 |
101.85 |
|
lfsr |
57 |
0 |
1 |
500.00 |
2.81 |
356.38 |
684.11 |
|
lfsr6s3 |
69 |
0 |
3 |
100.00 |
1.74 |
575.04 |
1211.39 |
|
mul_ser |
1044 |
0 |
87 |
32.18 |
6.73 |
148.50 |
260.88 |
|
rader7 |
7441 |
0 |
443 |
9.71 |
10.39 |
96.24 |
317.70 |
|
Total |
|
|
|
252.1 |
|
|
369.13 |