Seamless signal processing block implementation using the cubed-c design environment

doi:10.15406/iratj.2017.02.00029

eISSN: 2574-8092

International Robotics & Automation Journal

Research Article Volume 2 Issue 4

Seamless signal processing block implementation using the cubed-c design environment

Michael Dossis

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Department of Informatics Engineering, TEI of Western Macedonia, Greece

Correspondence: Michael Dossis, Department of Informatics Engineering, TEI of Western Macedonia, Greece

Received: October 31, 2016 | Published: June 21, 2017

Citation: Dossis M. Seamless signal processing block implementation using the cubed-c design environment. Int Rob Auto J. 2017;2(4):147-152. DOI: 10.15406/iratj.2017.02.00029

Download PDF

Abstract

Design environments and automated CAD systems are proliferated nowadays with various preferences and restrictions in their work environments. One serious problem of automated high-level synthesis tools is their inability of at least difficulty to use for low, bit level functions such as signal processing blocks. Here the Cubed-C environment is used for the rapid implementation of a number of low level functions and blocks such as UARTs without difficulty. Cubed-C is a full-strength high-level synthesis CAD system; nevertheless, its structure and properties make it particularly suitable for this type of fine-grained applications. The experiments in this paper prove that the Cubed-C synthesis tools are particularly suitable for both complex and for low, bit level signal coding functions.

Keywords: high-level synthesis, rapid prototyping, low level signal processing, serial communications

Introduction

The proliferation and complexity of current integrated circuits (ICs) both as FPGA and as Application-Specific Integrated Circuits (ASICs) and System-on-Chips (SoCs) determine that a high proportion of the custom, specialized logic on chip is designed and verified with High-level Synthesis/Verification techniques. However, low-level functions such as bit-level processing or signal coding are implemented manually, destroying the advantage of having a unified design flow for all parts of the chip. The main contribution of this work is to demonstrate that our Cubed-C High-level Synthesis (HLS) tool and method is able to deal with both complex and low-level blocks with ease and integrate the verification of both parts in one formal step, such as the cycle-accurate simulators which are produced at the output of the tool. The tools is flexible and can be used to design loop-based signal coding response blocks for both NRZ and RTZ algorithms, as shown further down in this paper. All of this of course are simulated and shown both at the RTL and the cycle-accurate simulator accuracies. Next session discusses existing work for HLS. Session III outlines the Cubed-C tools and methodology. Session IV discusses signal-coding algorithms. Session V presents experimental results. The last session concludes upon this work and draws future work extensions.

Related work

Improved methodologies and tools started appearing as early as the late 90s and continue with enhanced input programming code sets as well as scheduling and other optimization algorithms. Furthermore, system level synthesis matured in the last decade by using more (application-wise) specialized and platform-oriented methodologies. The CoWare hardware-software co-design environment¹ employs a data model that allows the user to specify, simulate and produce heterogeneous implementations from heterogeneous specification source models. The specific synchronous dataflow (SDF) type of DSP applications is implemented into hardware using languages such as SILAGE,² DFL,³ and LUSTRE,^4,5 The advantage of this type of designs is that they can be scheduled at compile time and the execution of the compiled code can be two orders of magnitude faster than event-driven VHDL (e.g. RTL) simulations. In contrast to this, dynamic dataflow (DDF) algorithms consume and produce tokens that are data-dependent, and thus they allow for complex if-then-else and while loop control constructs. This is the largest part of real applications and it is dealt with by the Cubed-C synthesizer. CAD systems that allow for specifying both SDF and DDF algorithms and perform as much as possible static scheduling are the DSP-station from Mentor Graphics,³ PTOLEMY,⁶ GRAPE-II,⁷ COSSAP from Synopsys and SPW from the Alta group.⁸

C programs that include dynamic memory allocation, pointers and the functions malloc and free are mapped onto specific hardware.⁹ The SpC tool⁹ takes a C function with complex data structures and generates a Verlog model. The different techniques and optimizations described above have been implemented using the SUIF compiler environment.¹⁰ The memory model consists of distinct location sets, and it is used to map memory locations onto variables and arrays in Verlog. A heuristic for scheduling behavioral code with complex conditional control flow is discussed.¹¹ This heuristic is based on a specific intermediate design representation which apart from established techniques such as chaining and multi cycling, it enables more advanced techniques, such as conditional resource sharing and speculative execution, which are suitable for scheduling conditional behaviors. The developed tool can generate VHDL or C code from “Hierarchical Control and Data Flow Graphs”, but no reports about translating a standard programming language into HCDG are known so far. The synthesis approach in¹² utilizes a coordinated set of coarse-grain and fine-grain parallelizing transformations on the input design model. These transformations are executed in order to deliver synthesis results that don’t suffer from the negative effects of complex control constructs in the specification code. The synthesis techniques were implemented in the SPARK HLS tool, which transforms specifications in a small subset of C into RTL VHDL hardware models. A resource-constrained scheduler is used in SPARK and it is essentially a priority-based global list scheduling heuristic. Nevertheless, there are serious restrictions on the subset of the C language that SPARK accepts as input, and limitations such as inability to accept design hierarchy modules (e.g. subprograms) and of “while” type of loops.

Typical HLS tasks such as scheduling, resource allocation, module binding, module selection, register binding and clock selection are executed simultaneously in¹³ so as to achieve better optimization in design energy, power and area. The scheduling algorithm utilized in¹³ applies concurrent loop optimization and multi cycling and it is driven by resource constraints. The tool generates RTL Verlog implementations. The developed HLS system is targeted at control-intensive applications but it is also applicable to dataflow dominated designs. An incremental floor planner is discussed in¹⁴ which combine an incremental behavioral and physical optimization into HLS. These techniques were integrated into an existing interconnect-aware HLS tool called ISCALP.¹⁴ The average improvements of IFP-HLS over ISCALP, for implementations with unity aspect ratio functional units, are 12% in area, 7% in power consumption, 100% in reduction in the number of merge operations, and for some benchmarks the IFP-HLS CPU run time was 6 times less than that of the ISCALP method^15,16 introduces a synthesis methodology which is suitable for the design of distributed logic and memory architectures. Beginning with a behavioral description of the system in C, the methodology starts with behavioral profiling in order to extract simulation statistics of computations and references of array data. This allows the generation of footprints which contain the accessed array locations and the frequency of their occurrence. This synthesis approach is implemented into an industrial tool called Cyber.¹⁷

Communicating processes which are part of a system specification are implemented.¹⁸ In contrast to the conventional HLS approach which synthesizes each concurrent process of the system individually, the impact of the operation scheduling is considered globally in,¹⁸ in the system critical path (as opposed to the individual process critical path). The authors in¹⁸ claim that their methodology allocates the resources where they are mostly needed in the system, which is in the critical paths, and in this way it improves the overall multi-process designed system performance. In Gal¹⁹ memory access management is integrated within a HLS design flow. It mainly targets digital signal processing (DSP) applications but also more general streaming systems can be included along with specific performance constraints. Mutually exclusive scheduling methods^19–21 are implemented with the “Extended Data-flow Graph”. This is achieved because EDFG allows for data and conditional semantics to be handled in the same way, and thus the exploitation of potential design parallelism can be maximized. The processed graph is then given to the GAUT HLS tool²² to perform operator selection and allocation, scheduling and binding. This methodology is rather more suitable for dataflow dominated systems such as video streaming and linear DSP algorithms.

A combined execution of decomposition and pattern-matching techniques is applied on HLS problems, in order to reduce the total circuit area in.²³ The data path area is reduced by decomposing multi cycle operations, so that they are executed on monocycle functional units (FUs that take one clock cycle to execute and deliver their results). A simple formal model that relies on a FSM-based formalism for describing and synthesizing on-chip communication protocols and protocol converters between different bus-based protocols is discussed in.²⁴ The work in²⁴ contributes towards three aspects of protocol converter synthesis: a formal, FSM-based model for protocol definition, a precise definition of protocol compatibility and a definition of converters and converter correctness (for a given pair of existing and known protocols). Protocol converter test cases that were used to evaluate the work in²⁴ included an ASB to APB converter and a set of converters between the Open Core Protocol (OCP) and the AMBA family of bus protocols. The existing synthesis framework is limited to protocols that can be defined by a single FSM, and in the future more than one FSM per protocol description capabilities are envisaged by the authors.

The methodology of System Co Designer²⁵ uses an actor-oriented approach so as to integrate HLS into electronic system level (ESL) design space exploration tools. Its main aim is to automate the design and building of correct-by-construction System on a chip (SoC) implementations from a behavioral model. The design starts with an executable System C model. Then, commercial synthesizers such as Forte’s Synthesizer are used in order to generate hardware implementations of actors from the behavioral model. Modules or processes are modeled in²⁵ as an actor’s which communicate with other actors via a number of communication channels. This is the starting point for modeling a system in.²⁵ The specification language of an actor is a subset of System C which is defined in System oC library.²⁵ The final FPGA bit stream is generated in²⁵ using the Xilinx EDK (Embedded Development Kit) tools. A motion-JPEG test was used to validate the proposed methodology in.²⁵ A formal approach is followed in²⁶ so as to prove that every HLS translation of a source code model produces a RTL model that is functionally-equivalent to the one in the behavioral input to the HLS tools. The validating system in²⁶ is called SURYA and it is using the Simplify theorem proved to implement the validation algorithms. SURYA was used to validate the SPARK HLS tool,¹¹ and consequently SURYA managed to find two bugs in the SPARK compilations, which were unknown before.

An assumption is made by the formal model of refinement in,²⁷ that the specification and the implementation are single entry and single exit programs. A transition diagram represents every process in these programs. This diagram uses generalized program locations and program transitions. A program location represents a point in the control flow of the program and it is either a node identifier, or a pair of two locations which refer to the state of two processes that are running in parallel. The replacement of flip-flop registers with latches is proposed in²⁸ in order to yield better timing in the implemented designs. The justification for this is that latches are inherently more tolerant to process variations than flip-flops. The latch replacement in²⁸ is executed not only during the register allocation task, but in all steps of HLS, including scheduling, allocation and control synthesis. This method assumes that the delay of the controller is negligible, as compared to the transparent and non-transparent phase times. Nevertheless, implementing registers with latches instead of edge-triggered flip-flops is generally considered to be cumbersome due to the complicated timing behavior of latches.

The cubed-c design environment

The Cubed-C HLS design environment consists of the frontend and the backend compilers. The frontend compilers process programs in ADA and C and produce an intermediate format named ITF. More on the description of this format can be found in.²⁹ The frontend compilers extract all the information from the source code which is required to transform the input subprograms into functionally-equivalent RTL modules in hardware. The backend compiler is built with Prolog predicates and therefore its HLS transformations are formal and the produced hardware implementations are provably-correct. Most of these transformations and optimizations are captured in an aggressive scheduler called PARCS. PARCS will always try to bring the best result (most compressed schedule) however obeying to the data and control dependencies as well as any existing module-locked or global resource constraints.

Figure 1 depicts the design and verification flow within the Cubed-C environment. The internal structure of the Cubed-C hardware compilation system is obscured in this figure. However, the main features of unified synthesis/verification strategy are apparent in Figure 1. PARCS is based on formal techniques and it is a resource-constrained scheduler. It can be driven with local (module-wise) or global resource constraints. The backend compiler can be driven by options about the architectural template, the HDL language, the location of large multi-dimensional data objects, the use of custom blocks, etc. Apart from the RTL code, the backend compiler uses the same internal formal model for the hardware FSMs to extract one cycle-accurate simulator in ANSI-C for each module in the input code hierarchy. By compiling and executing these simulators, the user can go through the FSM states and observe the changes in various storage elements inputs and outputs of the design. Thus, this verification is based on formal means of verifying the same model used by synthesis, and in all of our test cases the functionality of the generated hardware structure coincided with that of the input code, which was expected due to the formal nature of the hardware synthesis transformations.

Figure 1 The Cubed-C synthesis/verification flow.

Signal coding algorithms

The 0 and 1 value of digital signals are sometimes transmitted as are with low and high voltage corresponding values. Often, and in order to enable clock recovery from the signal waveform, as well as information compression, they are modulated or mapped onto signal voltage level change or stable value as well. In general there are four categories of signals:

Non-return to zero (NRZ) signals.
Return to zero (RZ) signals.
Phase Encoded (PE) or phase split signals.
Multiple level signals – Multi-level binary (MLB) signals.

NRZ-level signals map the 1 value to high voltage and 0 to low, or the other way round. It is the simplest type of NRZ signal. NRZ-mark signals map the change in voltage to level 1 and the absence of change to level 0. NRZ-space coding is the opposite of NRZ-mark. These types of signal coding don’t offer error correction or clock recovery and they have a constant component which makes it impossible to transmit it with capacitive or inductive links. Unipolar RZ is the simplest coding of including the clock information. Unipolar RZ is the logical AND between the signal and the clock, therefore when value 1 then there is a 1 pulse for the first half of the period, and 0 otherwise. In Unipolar PPM coding value 0 is coded as a short pulse (1) at the beginning of the clock period, and value 1 is coded as a short pulse (1) somewhere at the middle of the clock cycle. In Unipolar PDM value 0 is coded as a short pulse (1) at the beginning of the clock cycle and value 1 as a long pulse (1) at the beginning of the clock cycle. In this way this group of coding schemes carries in its data the clock information, which can be recovered at the receiver.

Another group of signal coding is the Phase coding group. The first coding is called the Polar Biphasic level, and it is otherwise known as Manchester code. In Manchester code value 0 is coded as 0-to-1 transition at the middle of the cycle and value 1 is coded as a 1-to-0 transition at the middle of the clock cycle. Polar Biphasic Mark and Polar Biphasic Space are essentially Frequency Key Shifting. In Biphasic M. value 0 is coded with stable and alternating (with continuous 0) levels, and value 1 is coded with double frequency clock. Polar Biphasic S. is the opposite coding of Biphasic M. In delay modulation which is also known as Miller code value 0 is coded with alternating level (at the end of the cycle) if it is followed by 0, or same level if it is followed by 1. In Miller code, value 1 is represented by a change of the level in the middle of the clock cycle. An interesting group of signal coding is referred to as multi-level coding. This type of coding avoids the constant component and offers excellent synchronization of the receiver. The Polar RZ represents values 1 and 0 with positive and negative pulse of duration half of the cycle, respectively. Bipolar or alternate mark inversion features half-duration alternating positive/negative pulse for value 1 and level 0 for value 0. Decode coding represents 0-to-1 change with a positive pulse and 1-to-0 change with a negative pulse, and stable level otherwise. In pair selected ternary pairs of bits are encoded depending on an encoding table. In Duo binary coding, when value changes there is a change in level but only to half height, e.g. from 1 to 0, from 0 to -1, from -1 to 0 and from 0 to 1, and no change when values remain the same. For this group of coding it is necessary to encode at least 3 levels of signal, e.g. 1, 0 and -1. For this we have selected encoding with two bits that will drive a DAC at the output (not shown in our experiments).

Design experiments

Although most of the signal coding algorithms that were reported above, were coded in high-level ADA and implemented in the Cubed-C framework, for the sake of economy here we describe two representative ones the Manchester code and the Polar RZ. The algorithms were coded and debugged in executable ADA programs according to the verification scheme of Figure 1. Then the code was ported to the input of the Cubed-C compiler that synthesized the RTL VHDL code. The RTL models of the code/decode processors were simulated and verified in a RTL simulator. Then the RTL output (schematics) of the Cubed-C backend compiler we simulated and the simulations were compared to the behavior of the input ADA code. In all cases the behavior matched that of the source code programs. Figure 2, Figure 3 depicts snapshots of the RTL simulations for the Manchester modulator/demodulator.

Figure 2 RTL simulation of the Manchester encoder.

Figure 3 RTL simulation of the Manchester decoder.

Moreover, the RTZ Polar RZ algorithm was implemented with the same flow and the automatically generated RTL code of the Polar RZ processors was simulated to verify the expected correctness of the synthesis process. Figure 4, Figure 5 contains snapshots of the Polar RZ modulator/demodulator respectively. As shown in the above figures, in all synthesis experiments with Cubed-C the behavior of the synthesized code matches the intended one of the ADA source code. It is worthy to mention that most of the experiments took less than an hour each, which reinforces our contribution towards a sizable increase of the designer’s productivity. The Polar RZ experiments utilized the custom blocks option of the Cubed-C compiler to design the custom Boolean functions required for the signal value translations.

Figure 4 RTL simulation of the Polar RZ encoder.

Figure 5 RTL simulator of the Polar RZ decoder.

A small UART serial communications processor

A UART design was verified and synthesized using the Cubed-C synthesizer. Table 1 shows the Xilinx Spartan 3 FPGA statistics of its implementation. The worse-case delay was around 8 ns, allowing an implementation speed of clock up to 100MHz. The UART was verified at the ADA level, at the cycle-accurate simulator level (produced by Cubed-C) and at the RTL VHDL simulation. The latter is showed in a snapshot in Figure 6. It must be noted that in all cases the RTL and test benches as well as the cycle-accurate C models were produced in an automatic and formal way directly from the internal formal FSM model that is kept in the memory of the Cubed-C compiler. Therefore, the verification flow is formal and automatic which save the engineer from a lot of week’s manual work trying to remove functional and timing bugs.

Object	SPARTAN-3 XILINX FPGA Stats
Object	Used	Total	Utilization
Slice Flip Flops	84	7168	1%
4 Input LUTs	122	7168	1%
Occupied Slices	93	3584	2%
Bonded IOBs	27	173	15%
BUFGMUXs	2	8	255

Table 1 UART Design Implementation Statistics

Figure 6 RTL VHDL simulation snapshot of the UART design.

Conclusion and Future work

Formal and rapid automated synthesis and automated formal verification methods flows are used in the Cubed-C tools. The major contribution of this work is that low level, bit-wise; detailed hardware signal coding algorithms are rapidly and formally implemented with the Cubed-C framework, although Cubed-C is a full-blown HLS system. In all cases, RTL/gate level verification (simulations) showed that the behavior of the generated hardware processors matches the behavior of the input specification code and model. Future work includes more low level modulation algorithm implementation, more experiments with Cubed-C and the embedded Cycle-accurate simulator.