Published: 12 August 2013

A C code example

We shall now look at the C code of a coprocessing system consisting of the three components mentioned earlier. First, a C source file with a very simple example of a synthesized function and its wrapper are shown. Next, a host program outlines how the communication with the logic is done from the host. This host program is then improved for the sake of performance and reliability, and showing the recommended programming techniques.

Sample code for HLS synthesis

To clarify how HLS works with Xillybus, let’s consider a simple example, which demonstrates the calculation of a trigonometric sine and a simple integer operation, both covered in a custom function, mycalc(). This is a very simple function, but as Xilinx’ guide to Vivado HLS shows, the possibilities go way beyond this. mycalc() takes the role of the “synthesized function”.

This function is called by a wrapper function, xillybus_wrapper(), which is responsible for the interface with the host. It accepts an integer and a floating point number from the host through a data pipe, which is represented by the “in” argument. It returns the integer plus one and the (trigonometric) sine of the floating point number, using the “out” argument.

How the *in++ and *out++ operations transport data from and to the host application is explained below. A walkthrough of the code is given immediately after its listing here.

#include <math.h>
#include <stdint.h>
#include "xilly_debug.h"

extern float sinf(float);

int mycalc(int a, float *x2) {
  *x2 = sinf(*x2);
  return a + 1;
}

void xillybus_wrapper(int *in, int *out) {
#pragma AP interface ap_fifo port=in
#pragma AP interface ap_fifo port=out
#pragma AP interface ap_ctrl_none port=return

  uint32_t x1, tmp, y1;
  float x2, y2;

  xilly_puts("Hello, world\n");

  // Handle input data
  x1 = *in++;
  tmp = *in++;
  x2 = *((float *) &tmp); // Convert uint32_t to float

  // Debug output
  xilly_puts("x1=");
  xilly_decprint(x1, 1);
  xilly_puts("\n");

  // Run the calculations
  y1 = mycalc(x1, &x2);
  y2 = x2; // This helps HLS in the conversion below

  // Handle output data
  tmp = *((uint32_t *) &y2); // Convert float to uint32_t
  *out++ = y1;
  *out++ = tmp;
}

A brief explanation of the code above

This piece of code starts with #include statements. The “math.h” inclusion is necessary for the sine function. “xilly_debug.h” contains headers for debug functions.

The declaration of xillybus_wrapper() as well is the pragma directives followed by it relate to the interface with the Xillybus IP core, and must always appear as shown. In particular, the name of this function must not be changed, nor its arguments.

Next, we have a call to xilly_puts(), which produces a debug message that can be displayed easily on the host computer’s console, as elaborated in part VI.

After this, the input data is fetched. Each *in++ operation fetches a 32-bit word originating from the host. In the code shown, the first word is interpreted as an unsigned integer, and is put in x1. The second word is treated as a 32-bit float, and is stored in x2. The communication of data is explained further below.

This is followed by x1’s value written on the host computer’s console as a decimal number for debug purposes.

In the next part, a call to mycalc(), the “synthesized function” is made. This function returns one result as its return value, and the second piece of data goes back by changing x2. The wrapper function copies the updated value of x2 into a new variable, y2, which may appear to be a redundant operation.

Had this code been compiled for execution on a processor, the copying to y2 would have been redundant indeed. When using HLS, this is however necessary to make the compiler handle the conversion to float later on. This reflects a somewhat quirky behavior of the HLS compiler, but this is one of the delicate issues of using a pointer: The HLS compiler doesn’t really generate a memory array and a pointer to it. The use of the pointer is just a hint on what we want to accomplish, and sometimes these hints need to pushed a bit.

Finally, the results are sent back to the host: Each *out++ sends a 32-bit word to the computer, with due conversion from float.

Note that the *in++ and *out++ operators don't really move pointers, and there is no underlying memory array. Rather, these symbolize moving data from and to FIFOs (and eventually from and to Xillybus pipes). Hence, the only way the "in" and "out" variables may be used is *in++ and *out++.

The host program

The following program can be used to communicate with the logic. Most notable is that two device files, which behave like named pipes, are used for communication with the logic: /dev/xillybus_read_32 and /dev/xillybus_write_32. These two files are generated by Xillybus’ driver, as explained on this page.

As before, the listing is followed by comments.

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdint.h>

int main(int argc, char *argv[]) {

  int fdr, fdw;

  struct {
    uint32_t v1;
    float v2;
  } tologic, fromlogic;

  fdr = open("/dev/xillybus_read_32", O_RDONLY);
  fdw = open("/dev/xillybus_write_32", O_WRONLY);

  if ((fdr < 0) || (fdw < 0)) {
    perror("Failed to open Xillybus device file(s)");
    exit(1);
  }

  tologic.v1 = 123;
  tologic.v2 = 0.78539816; // ~ pi/4

  // Not checking return values of write() and read(). This must be done
  // in a real-life program to ensure reliability.

  write(fdw, (void *) &tologic, sizeof(tologic));
  read(fdr, (void *) &fromlogic, sizeof(fromlogic));

  printf("FPGA said: %d + 1 = %d and also "
	 "sin(%f) = %f\n",
	 tologic.v1, fromlogic.v1,
	 tologic.v2, fromlogic.v2);

  close(fdr);
  close(fdw);

  return 0;
}

The program begins with opening two files, /dev/xillybus_read_32 and /dev/xillybus_write_32. These two files are the operating system’s representation of the data pipes through which the host communicates with the logic.

The “tologic” structure is then filled with some values for transmission to the logic, after which it’s written directly from memory to xillybus_write_32. Effectively, this writes 8 bytes, or more precisely, two 32-bit words. The first is the integer 123 put in tologic.v1, and the second is the float in tologic.v2. The tologic structure was hence set up to match the logic expectation of data: One integer by the first *in++ instruction, and one float by the second.

It is crucial to match the amount of data sent to /dev/xillybus_write_32 with the number of *in++ operations in the wrapper function. If there is too little data sent, the synthesized function may not execute at all. If there’s too much, the following execution will probably be faulty.

At this point, the function is executed in logic, and the result is returned as two 32-bit words by virtue of the *out++ operations at the end of the wrapper function. These two values are read from /dev/xillybus_read_32 by the read() call that follows write().

In this example, the same structure format was chosen for “inlogic” and “outlogic”, but there’s no need to stick to this. It’s just important that the data sent and received is in sync with the wrapper function’s number of *in++ and *out++ operations.

Finally, the input and output structures are printed out for review.

It’s important to note, that the program above demonstrates a single execution of the synthesized function. This is not the way to measure the efficiency of using coprocessing, as I/O latency and other delays will cause a poor outcome. Rather, it’s kept concise for the sake of illustration. A more realistic program is given below for reference.

This code was written for compilation on Linux. Windows users may need to make all or some of the following adjustments:

  • Change the file name string from “/dev/xillybus_read_32″ to “\\\\.\\xillybus_read_32″ (the actual file name on Windows is \\.\xillybus_read_32, but escaping is necessary). The second file name changes to “\\\\.\\xillybus_write_32″.
  • Replace the #include statement for unistd.h with io.h
  • Replace the calls to open(), read(), write() and close() with _open(), _read(), _write() and _close()

Running

The expected behavior of a test run is now shown. For this to work, Xillybus’ driver must have been loaded and detected its counterpart in the logic fabric. How this is set up is explained in part IV.

Before attempting a test run, it’s recommended to begin watching the debug output by typing “cat /dev/xillybus_read_8″ at shell prompt. In another terminal window, run the program, which should look like this:

$ ./hlsdemo
FPGA said: 123 + 1 = 124 and also sin(0.785398) = 0.707107

As a result of the execution, some debug output will be generated:

$ cat /dev/xillybus_read_8
Hello, world
x1=123
Hello, world

The origins of the first two lines are easily found on the wrapper function above. The third “Hello, world” line may come slightly unexpected, and may not appear in some cases. It’s a result of the HLS compiler’s attempt to promote data flow. The logic’s state machine always assumes that new input data is onway, and attempts to move things forward as much as possible to save the processing time once the data arrives.

Since no input data is needed for the second “Hello, world”, it’s sent out as soon as possible. In this case, it’s immediately after “x1=123″, which depends on input data. In theory, it could go on printing out the “x1=” part as well, but the compiler didn’t optimize things this far.

A practical host program

The code above outlines the way data is exchanged, but two changes are necessary in practical system:

  • Sending a single set of data for processing is extremely inefficient, making I/O overhead a major delay component. It’s also wrong to wait for the outcome of a single execution before sending the next set.
  • The return values from read() and write() aren’t checked, so partial operation and UNIX signals aren’t handled properly. This is a negligible issue when a single chunk of 8 bytes is going back and forth, but may cause weird problems in real-life applications.

The program below shows a suggested practical Linux-style implementation of using the logic for coprocessing. This is a throughput-oriented implementation, focused on keeping the data flowing rather than completing rounds of requests and responses.

The following differences are most notable:

  • Rather than generating a single set of data for processing, an array of structures is allocated and sent. Likewise, an array of data is received from the logic. This reduces the I/O overhead, and the impact of software and hardware latencies.
  • The program forks into two processes, one for writing and one for reading data. Making these two tasks independent prevents the processing from stalling due to lack of data to process or output data waiting to be cleared up. This independency can be achieved with threads (in particular in Windows) or using the select() call as well.
  • The read() and write() calls are made as necessary to ensure reliable I/O. These while loops may appear cumbersome, but they are necessary to respond correctly to partial completions of these calls (not all bytes read or written) which is a frequent case under load. The EINTR error is also handled as necessary to react properly to POSIX signals, which may be sent to the running processes, possibly by unrelated software.

Note that for real use, the debug messages must be removed from the synthesized and wrapper functions, as they may slow down execution dramatically, in particular by forcing sequential execution where a speedup is possible by parallel execution.

The program’s listing follows.

#include <stdio.h>
#include <unistd.h>

#include <stdlib.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdint.h>

#define N 1000

struct packet {
  uint32_t v1;
  float v2;
};

int main(int argc, char *argv[]) {

  int fdr, fdw, rc, donebytes;
  char *buf;
  pid_t pid;
  struct packet *tologic, *fromlogic;
  int i;
  float a, da;

  fdr = open("/dev/xillybus_read_32", O_RDONLY);
  fdw = open("/dev/xillybus_write_32", O_WRONLY);

  if ((fdr < 0) || (fdw < 0)) {
    perror("Failed to open Xillybus device file(s)");
    exit(1);
  }

  pid = fork();

  if (pid < 0) {
    perror("Failed to fork()");
    exit(1);
  }

  if (pid) {
    close(fdr);

    tologic = malloc(sizeof(struct packet) * N);
    if (!tologic) {
      fprintf(stderr, "Failed to allocate memory\n");
      exit(1);
    }

    // Fill array of structures with just some numbers
    da = 6.283185 / ((float) N);

    for (i=0, a=0.0; i<N; i++, a+=da) {
      tologic[i].v1 = i;
      tologic[i].v2 = a;
    }

    buf = (char *) tologic;
    donebytes = 0;

    while (donebytes < sizeof(struct packet) * N) {
      rc = write(fdw, buf + donebytes, sizeof(struct packet) * N - donebytes);

      if ((rc < 0) && (errno == EINTR))
	continue;

      if (rc <= 0) {
	perror("write() failed");
	exit(1);
      }

      donebytes += rc;
    }

    sleep(1); // Let debug output drain (if used)

    close(fdw);
    return 0;
  } else {
    close(fdw);

    fromlogic = malloc(sizeof(struct packet) * N);
    if (!fromlogic) {
      fprintf(stderr, "Failed to allocate memory\n");
      exit(1);
    }

    buf = (char *) fromlogic;
    donebytes = 0;

    while (donebytes < sizeof(struct packet) * N) {
      rc = read(fdr, buf + donebytes, sizeof(struct packet) * N - donebytes);

      if ((rc < 0) && (errno == EINTR))
	continue;

      if (rc < 0) {
	perror("read() failed");
	exit(1);
      }

      if (rc == 0) {
	fprintf(stderr, "Reached read EOF!? Should never happen.\n");
	exit(0);
      }

      donebytes += rc;
    }

    for (i=0; i<N; i++)
      printf("%d: %f\n", fromlogic[i].v1, fromlogic[i].v2);

    sleep(1); // Let debug output drain (if used)

    close(fdr);
    return 0;
  }
}

 

>>>>> Next part >>>>>