FPGA large input data - verilog

I am trying to send a 4 kilobyte string to an FPGA, what is the easiest way that this can be done?
This is the link for the fpga that I am using. I am using Verilog and Quartus.

The answer to your question depends a lot on what is feeding this data into the FPGA. Even if there isn't a specific protocol you need to adhere to (SPI, Ethernet, USB, etc.), there is the question of how fast you need to accept the data, and how far the data has to travel. If it's very slow, you can create a simple interface using regular IO pins with a parallel data bus and a clock. If it's much faster, you may need to explore using high speed serial interfaces and the special hard logic available on your chip to handle those speeds. Even if it's slower, but the data needs to travel over some distance, a serial interface may be a good idea to minimize cable costs.
One thing I would add to #gbuzogany 's answer: You probably want to configure that block of memory in the FPGA as a FIFO so you can handle the data input clock running at a different rate than the internal clock of your FPGA.

You can use your FPGA blocks to create a memory inside the FPGA chip (you can do that from Quartus). The creation assistant allows you to initialise this memory with anything you want (e.g: a 4KB string). The problem is that in-FPGA memory uses many of your FPGA blocks, but for a board like this it must not be a problem.
Here is a video explaining how to do that on Quartus:
https://www.youtube.com/watch?v=1nhTDOpY5gU

You can use string for memory initialization. It's easy in Verilog in 'initial begin end' block.

There are 2 ways:
1. You can create a memory block by using Xilinx Core Generator and then load the initial data to the memory, then use the data for the code. Of course, you have to convert the string to the binary data.
2. You can write a code which has a memory to store the string, it can be a First-In-First-Out FIFO memory. Then, you write a testbench to read the string from a text file and write data into the FIFO. Your FPGA can read the string from the FIFO.

Related

Writing data on memory at different clocks

I want to write data on a common memory coming from different clock domains how can I do that?
I have one common memory block and that memory block works on clock having frequency clk. Now I want to write data on that memory coming from different clock domains i.e clk1, clk2, clk3, clk4 etc. How to do that?
I was thinking of using FIFO for each clock domain i.e. 1st FIFO have input clock clk1 and output at clk(same as memory), the 2md FIFO will have input clock clk2 and output at clk(same as memory) and so on.. But it seems that my design will outgrew if I will use large number of FIFOs. Kindly tell me the correct approach.
To pass data-units (bytes, words etc) safely between clock domains the asynchronous FIFO is the only safe solution. Note that the FIFO does not have to be deep, but in that case you may need flow control.
You may need flow control anyway as you have many sources all accessing the same memory.
But it seems that my design will outgrew if I will use large number of FIFOs.
Then you have a design problem: your FPGA is not large enough to implement the solution you have chosen. So either go to a bigger FPGA or find fundamentally different solution to your problem.
RAM can be write by clock domain A and read by clock domain B with different clocks (Dual-ported RAM):
http://www.asic-world.com/examples/verilog/ram_dp_ar_aw.html
This RAM should be used by some controller as asynchronous FIFO.
Many fpga have dedicated RAM components for example:
UFM Altera, Xilnx BRAM, Cypress Delta39K Cluster Memory Block etc.
Change your device if you have problem with large FIFO.

What is the purpose of structure iov_iter in Linux?

What is the purpose of struct iov_iter ? This structure is being used in Linux kernel instead of struct iovec. There is no any good documentation for iter interface. I had found one document on LWN but I am not able to understand that. Could anyone please help me to understand the iter interface which is being used in Linux kernel ?
One purpose of iovec, which the LWN article states up front, is to process data in multiple chunks.
If you have a number of discrete buffers, chained with pointers, and want to read/write them in one go, you could simply replace this with several read/write ops, but in some cases semantics are associated with read/write boundaries - so ops can't simply be split without changing the meaning. An alternative is to copy all the data in and out of a contiguous buffer, which is wasteful and we want to avoid at all costs.
Using the POSIX readv/writev or, in our case the iov_iter API, reduces the number of system calls, and hence the overhead involved. While in the kernel this doesn't translate to expensive ops like context switches, it is still a minor concern. Drivers also might handle larger chunks of data more efficiently than they would lots of smaller chunks when they have no way to know if there's more to come in the near future - this is especially true with network drivers, although I'm not aware of iov_iter being used there atm.
Another instance of the same situation is I/O to raw disk
devices, which only allow I/O to start and end of block
boundaries. A user might occasionally want to perform random access or overwrite a small piece of the buffer at, say, the start of a block and/or zero the rest.
Scenarios like that is exactly what iovec aimed to address; you can construct an iovec which enables you to do a whole block operation spread over several discrete buffers, which might even include a "scratch" buffer for dumping the parts of a block you read and don't care about processing, and a pre-zeroed buffer for chaining at the end of writev to zero out the rest of a block. Again, I should point out you can use a contiguous buffer with associated copying and/or zeroing, but the iov_iter API provides an alternative abstraction with less overhead, and perhaps easier to reason with when reading the code.
The term for operations like these in vector processing, or parallel computing, is "scatter/gather processing".

For emulating the Gameboy, why does it matter that the memory is broken up into different areas?

So I'm writing a gameboy emulator, and I'm not 100% sure why other projects took the time to break up the memory into proper categories. I don't know if there is a major technical dilemma I'm missing (maybe handling illegal parameters in instructions?), but it seems like the only thing that matters is that the address given by a write instruction is retrievable by the proper read instruction. So for a sub question, if I'm working under the assumption that the assembly is perfectly legal (meaning nothing is trying to read/write where it can't), can I just make a big array and read and write to it?
Note that this is a conceptual question and that I am aware a big array would be a memory hog, I'm not necessarily looking for the best way to do it, simply trying to learn how it works and why other emulator developers did it the way they did.
You are going to have read only memory, read/write memory and memory mapped I/O (peripherals etc). So you need to decode the address to some extent to break it into the major categories, then for the peripherals you have to emulate all of those so you have to get very detailed in your address decoding.
For the peripherals you will need to detect a read/write to some address which you cannot do by simply landing the writes in an array (two writes of the same value for example make a difference, you cant just scan some array to look for changes you have to trigger on reads and writes and perform the hardware action).
If you wish to be cycle accurate you will also need to know the timings for the rams and roms in order to mimic those, depending on how many banks of each or if timing is dependent on that you will need to decode the address further.
Hardware decodes these addresses to the same level so if you are emulating hardware then you need to...emulate hardware...and do the same amount of address decoding.
I'm going to be gameboy specific here. Look at gameboy's address space map. The address space itself is divided, it's not that emulators do it. Hardware itself operates that way.
Here's some of the regions that can't be implemented as just an array:
0x0000-0x3FFF. First bank of a ROM. It's read-only but not quite. Read the next one
0x4000-0x7FFF. Switchable ROM bank, it's also not quite read-only. Cartridges that don't fit into gameboy's address space contain memory bank controller. ROM will write to some specific read-only ROM regions to actually select which ROM bank is mapped into 0x4000-0x7FFF address range. So you have to detect these writes and then redirect reads into the selected ROM bank.
0xA000-0xBFFF. Switchable RAM bank. Same thing as with switchable ROM banks but now for RAM. Cartridges may contain additional RAM that's being mapped into gameboy's address space. Which bank of the RAM is mapped is controlled, again, by writes to specific read-only regions.
0xFF00-0xFF4B. IO ports. Here you have hardware registers mapped into address space. Gameboy has several hardware components each with it's own registers and even memory (display controller, sound processor, timers etc). To control that hardware ROM reads and writes into the IO ports. You obviously have to detect these writes so you can emulate the hardware they correspond to. It's not just CPU and memory you have to emulate. I would even say that the least part of it and the easy one. For example, it much harder to get display controller and sound channels right. They have complicated logic, bugs and very tricky behaviour that's not documented very well but is crucial to achieve accurate emulation. Wave channel in particular gave me a hard time.

Replace dual-port RAM with two single port RAMs for J1 Forth CPU on Altera FPGA

The wonderful J1 Forth CPU (Verilog source code) is given to work on Xilinx FPGA. I was trying to port it to an Altera Cyclone II FPGA.
I have difficulty getting the Altera dual-port RAM megafunction to work properly. Judging from the Verilog code, can I use two single port RAMs, instead of a dual port RAM?
The real questions is, does J1 Forth modify its own code while running? If not, why not separate the dual-port RAM into the code RAM (addressed by {_pc}) and the data RAM (addressed by _st0[15:1])?
Right now the compiler that produces code for the J1 (you should have gotten some Forth code that runs under Gforth) assumes that both data and code are coming from the same RAM space. In order to separate code and data (which you would have to do in order to use separate RAM banks for each) you would have to modify the compiler to put anything done with CREATE into the data RAM.
You would also have to change the definition of ! and # to fetch from the appropriate RAM bank. And forget implementing any sort of interactive prompt that allows you to define any new words, because that would require either writing to the RAM bank that contains code, or being able to point the PC at the data RAM bank, which defeats the purpose of having two RAM banks in the first place.
Otherwise you would have to wire up some sort of logic to keep the two RAM banks in sync.
The J1 is pretty tightly bound to the idea of having a dual-port RAM bank.

Large Data Flow between User and Kernel

What is the best way(performance) to have a bi-directional data flow between user-level and kernel-level ?
I understand that you can open a NETLINK socket and transfer the data through there. But, we have to adopt some other user-kernel interaction(system calls, ioctl) for sending control information across. Is this the most efficient way to transfer large amount of data across user-kernel boundary ?
Passing large buffers of data into the kernel driver/thread/whatever is no problem - the kernel has the privilege to read it, no problem. For returning stuff, the ususal way is to provide the kernel thingy with a sufficiently large user-space buffer, or buffer pool, for it to return data in. That's how its done for the usual stuff - file/network read/write, for example.
What is the problem, more exactly - do you need to transfer the data to/from kernel level on a different machine?
Rgds,
Martin

Resources