How to write SDC timing constraint an encrypted verilog code? - verilog

I have received a verilog project, with a key component in it being encrypted.
The performance on a FPGA varies depending on build environment and configuration[Note 1] and I suspect this is caused by insufficient timing constraints [Note 2].
In the TimeQuest timing analyzer I can see the names of the pins, nets, registers, ports within the encrypted core, but without looking at the actual code I don't know exactly what they mean.
So how should I start writing a SDC timing constraint in this situation?
Note 1:
The component is a MIPI-CSI2 TX. On the test RX side I constantly get SoT error(SoT error and sync not achieved), and sometimes ECC errors.
For a while, to make it "work" on an FPGA the code has to built on a Windows Machine. Then a few minor and unrelated changes are made to the code, it could work if built on a Linux machine. Then very recently a Linux build machine seems to perform better than a Windows machine.
Also changing optimization parameters seems to break the code very badly. E.g. currently only "Balanced" mode works. "Performance" and "Aggressive Performance" modes, which should improve performance, causes a lot of errors in the received signals. Memory contents of a upstream signal processing block also affect the MIPI-CSI2 TX this way.
Above all make me think there are some uncertainties not fully constrained by the SDC timing constraints.
Note 2:
I'm not able to fully verify this theory because I don't have the equipment to fully test signal, nor could I do gate-level simulation because the encrypted code does not allow me to generate EDA netlist.

So in general, there are two types of constraints I am usually concerned with up front. One is the IO constraints which in the event of signals going off chip, you want to control the relationship between the data and the clock that is being used to transmit data and of course the receive clock and the receive data. The second type is the clock constraint. You want to validate that the clocks themselves are specified correctly. Depending on the tool, once you have done that, there is generally a process of deriving clocks that occurs to get any PLL derivative clocks... that part may be automatic, or if it is like Altera, you just need to call derive_clocks and you are on your way.
There are other kinds of constraints of course, but most of those are for specifying exceptions. "Make this a multi-cycle, or false path that.." etc.. Those are probably not what you want here. Constrain to the clock and that should be pretty tight.
If it is an input/output constraint issue, you can vary the relationship between the RX and TX clocks and associated data.
Beyond that, if this encrypted IP is provided by a vendor, they will usually provide any additional constraints that are required beyond just basic input clock constraint that you need.
One strategy, if you really think it is a clock variation issue, you can spec the input clock to be a little faster than it really is. This would technically cause timing to get tougher to meet, but you will gain margin in your final image.
Another thing to think about, depending on your technology you are building to, you need to correctly specify the device part number, because of the speed grade. You might have a slow part, and are compiling to a fast part. This will wreak havoc. Some vendors guard against loading the wrong image on the wrong part, but some will let you hang yourself.
Finally, check your I/O signaling standard and termination specs. If you think signal integrity is at play here, then these will have to be considered. Make sure if you are using LVDS or CML when you should be using HSSL or you might need to add pullups to the signaling pins...etc.

Related

6502 cycle timing per instruction

I am writing my first NES emulator in C. The goal is to make it easily understandable and cycle accurate (does not necessarily have to be code-efficient though), in order to play games at normal 'hardware' speed. When digging into the technical references of the 6502, it seems like the instructions consume more than one CPU cycle - and also has different cycles depending on given conditions (such as branching). My plan is to create read and write functions, and also group opcodes by addressing modes using a switch.
The question is: When I have a multiple-cycle instruction, such as a BRK, do I need to emulate what is exactly happening in each cycle:
#Method 1
cycle - action
1 - read BRK opcode
2 - read padding byte (ignored)
3 - store high byte of PC
4 - store low byte of PC
5 - store status flags with B flag set
6 - low byte of target address
7 - high byte of target address
...or can I just execute all the required operations in one 'cycle' (one switch case) and do nothing in the remaining cycles?
#Method 2
1 - read BRK opcode,
read padding byte (ignored),
store high byte of PC,
store low byte of PC,
store status flags with B flag set,
low byte of target address,
high byte of target address
2 - do nothing
3 - do nothing
4 - do nothing
5 - do nothing
6 - do nothing
7 - do nothing
Since both methods consume the desired 7 cycles, will there be no difference between the two? (accuracy-wise)
Personally I think method 1 is the way-to-go solution, however I cannot think of a proper, easy way to implement it... (Please help!)
Do you 'need' to? It depends on the software. Imagine the simplest example:
STA ($56), Y
... which happens to hit a hardware register. If you don't do at least the write on the correct cycle then you've introduced a timing deficiency. The register you're writing to will be written to at the wrong time. What if it's something like a palette register, and the programmer is running a raster effect? Then you've just moved where the colour changes. You've changed the graphical output.
In practice, clever programmers do much smarter things than that — e.g. one might use a read-modify-write operation to read a hardware value at an exact cycle, modify it, then write it back at some other exact cycle.
So my answer is:
most software isn't written so that the difference between (1) and (2) will have any effect; but
some definitely is, because the author was very clever; and
some definitely is, just because the author experimented until they found a cool effect, regardless of whether they were cognisant of the cause; and
in any case, when you find something that doesn't work properly on your emulator, how much time do you want to spend considering all the permutations and combinations of potential causes? Every one you can factor out is one less to consider.
Most emulators used to use your method (2). What normally happens is that they work with 90% of software. Then there's a few cases that don't work, for which the emulator author puts in a special case here, a special case there. Those usually ended up interacting poorly and the emulator spent the rest of its life oscillating between supporting different 95% combinations of available software until somebody wrote a better one.
So just go with method (1). It will cause some software that would otherwise be broken not to be so. Also it'll teach you more, and it'll definitely eliminate any potential motivation for special cases so it'll keep your code cleaner. It'll be marginally slower but I think your computer can probably handle it.
Other tips: the 6502 has only a few addressing modes, and the addressing mode entirely dictates the timing. This document is everything you need to know for perfect timing. If you want perfect cleanliness, your switch table can just pick an addressing mode and a central operation, then exit and you can branch on addressing mode to do the main action.
If you're going to use vanilla read and write methods, which is smart on a 6502 as every single cycle is either a read or a write so it's almost all you need to say, just be careful of the method signatures. For example, the 6502 has a SYNC pin which allows an observer to discriminate an ordinary read from an opcode read. Check whether the NES exposes that to cartridges, as it's often used on systems that expose it for implicit paging and the main identifying characteristic of the NES is that there are hundreds of paging schemes.
EDIT: minor updates:
it's not actually completely true to say that a 6502 always reads or writes; it also has an RDY input. If the RDY input is asserted and the 6502 intends to read, it will instead halt while maintaining the intended read address. Rarely used in practice because it's insufficient for common tasks like allowing somebody else to take possession of memory — the 6502 will write regardless of the RDY input, it's really meant to help with single-stepping — and seemingly not included on the NES cartridge pinout, you needn't implement it for that machine.
per the same pinout, the sync signal also doesn't seem to be exposed to cartridges on that system.

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

request_irq succeeds but interrupt is never detected

I am running embedded linux 3.2.6 on an ARM processor. I am using a modified version of atmel's serial driver to control the 4 USART ports on my device. When I use the driver compiled with the kernel, all works fine. But I want to run the driver as a kernel module instead. I make all of the necessary changes and disable the internal driver and everything seems fine. The 4 tty devices are registered successfully and I can see that the all of my probe and initialization functions work correctly.
So here's the problem:
When I try to write to any of the devices, my "start transmit" function gets called but then waits for an interrupt from the usart which never occurs. So the write just hangs, and using a logic analyzer I can see that RTS gets asserted but no bytes show up on the tx line. I know that my call to request_irq succeeds and yet i never see any of the irq entries in /proc/interrupts. In the driver, I have also tried using request_irq to register a separate interrupt handler for a gpio line, and this works fine.
I know that this is a problem that is probably hard to diagnose, but I am looking for ANY possible suggestions that could lead me in the right direction to finding a solution. Let me know if you need any clarifications. Thank you
The symptoms reads like a peripheral clock that has not been enabled (or turned off): the device can be initialized w/o errors and an I/O operation can be setup, but the device doesn't do anything; it plays dead. Since no I/O ever starts, you're never going to get an interrupt indicating completion!
The other thing to check are the conditional compilation directives for HW configuration structures in your arch/arm/mach-xxx/zzz_devices.c file.
Make sure that the serial port structures have something like:
#if defined(CONFIG_SERIAL_ATMEL) || defined(CONFIG_SERIAL_ATMEL_MODULE)
and not just
#if defined(CONFIG_SERIAL_ATMEL)
Addendum
I could be wrong but the clock shouldn't have any effect on the CTS pin causing an interrupt, right?
Not right.
These digital circuits are synchronous state machines: without a clock, a change-of-state by an input cannot be processed.
Also, SoCs and modern uControllers use the peripheral clocks as on/off switches for those integrated peripherals. There is often way more functionality, i.e. peripherals, on the silicon chip than can actually be used, mostly due to insufficient quantity of pins to the board. So disabling the clocks to unused devices is employed to reduce power consumption.
You are far too focused on interrupts.
You do not have a solvable interrupt problem; those are secondary failures.
The lack of output when attempting to transmit is far more significant and revealing.
The root cause is probably a flawed configuration of the USART devices, since transmitting bits is an automatic operation for a configured & operational USART.
If the difference between not-working versus working is loadable module versus static linking, then the root cause is going to be something fundamental (and trivial) like my two suggestions.
Also your lack of acknowledgement regarding the #if defined(), e.g. you didn't respond with "Oh yeah, we already knew that", raises a gigantic red flag that says "Fix me first!"
Addendum 2
I'm tempted to delete this answer after discovering that the Atmel serial driver cannot be configured/built as a loadable module using make menuconfig (which is the premise for half of the answer). (Of course the Kconfig file could be hacked to make the config variable tristate instead of boolean to overcome the module restriction.) I've left a comment for the OP. But I also wanted to preserve the comment to Mr. Stratton pointing out how symbols in the .config file are (not) used.
So I did finally fix my problem. Thank you for the responses, none of them directly solved my problem but they did prompt further examination of my code. After some trial and error I finally got it working. I had originally moved the platform_device structures for each usart from /mach-at91/xxx_devices.c to my loadable module. Well for some reason the structures weren't getting the correct data to map to the hardware, I suppose because it wasn't correctly linking the symbols from the kernel (never got an error message though) and so some of the registration functions weren't even getting called. I ended up moving the structures and platform_device_register calls back into the devices file. I also decided to keep the driver for the console built-in using the original atmel_serial.c driver. I had to change the platform_device name for the console in both the devices file and in the built-in atmel_serial.c file in order for it to not conflict with my usart ports driver. I found that changing the platform_device and platform_driver name for the usarts from anything but "atmel_usart" resulted in usart transmission failing. I really don't understand why, but i'm just leaving it as atmel_usart so it works.
Thanks again to everybody who responded to my problem.

Question about cycle counting accuracy when emulating a CPU

I am planning on creating a Sega Master System emulator over the next few months, as a hobby project in Java (I know it isn't the best language for this but I find it very comfortable to work in, and as a frequent user of both Windows and Linux I thought a cross-platform application would be great). My question regards cycle counting;
I've looked over the source code for another Z80 emulator, and for other emulators as well, and in particular the execute loop intrigues me - when it is called, an int is passed as an argument (let's say 1000 as an example). Now I get that each opcode takes a different number of cycles to execute, and that as these are executed, the number of cycles is decremented from the overall figure. Once the number of cycles remaining is <= 0, the execute loop finishes.
My question is that many of these emulators don't take account of the fact that the last instruction to be executed can push the number of cycles to a negative value - meaning that between execution loops, one may end up with say, 1002 cycles being executed instead of 1000. Is this significant? Some emulators account for this by compensating on the next execute loop and some don't - which approach is best? Allow me to illustrate my question as I'm not particularly good at putting myself across:
public void execute(int numOfCycles)
{ //this is an execution loop method, called with 1000.
while (numOfCycles > 0)
{
instruction = readInstruction();
switch (instruction)
{
case 0x40: dowhatever, then decrement numOfCycles by 5;
break;
//lets say for arguments sake this case is executed when numOfCycles is 3.
}
}
After the end of this particular looping example, numOfCycles would be at -2. This will only ever be a small inaccuracy but does it matter overall in peoples experience? I'd appreciate anyone's insight on this one. I plan to interrupt the CPU after every frame as this seems appropriate, so 1000 cycles is low I know, this is just an example though.
Many thanks,
Phil
most emulators/simulators dealing just with CPU Clock tics
That is fine for games etc ... So you got some timer or what ever and run the simulation of CPU until CPU simulate the duration of the timer. Then it sleeps until next timer interval occurs. This is very easy to simulate. you can decrease the timing error by the approach you are asking about. But as said here for games is this usually unnecessary.
This approach has one significant drawback and that is your code works just a fraction of a real time. If the timer interval (timing granularity) is big enough this can be noticeable even in games. For example you hit a Keyboard Key in time when emulation Sleeps then it is not detected. (keys sometimes dont work). You can remedy this by using smaller timing granularity but that is on some platforms very hard. In that case the timing error can be more "visible" in software generated Sound (at least for those people that can hear it and are not deaf-ish to such things like me).
if you need something more sophisticated
For example if you want to connect real HW to your emulation/simulation then you need to emulate/simulate BUS'es. Also things like floating bus or contention of system is very hard to add to approach #1 (it is doable but with big pain).
If you port the timings and emulation to Machine cycles things got much much easier and suddenly things like contention or HW interrupts, floating BUS'es are solving themselves almost on their own. I ported my ZXSpectrum Z80 emulator to this kind of timing and see the light. Many things get obvious (like errors in Z80 opcode documentation, timings etc). Also the contention got very simple from there (just few lines of code instead of horrible decoding tables almost per instruction type entry). The HW emulation got also pretty easy I added things like FDC controlers AY chips emulations to the Z80 in this way (no hacks it really runs on their original code ... even Floppy formating :)) so no more TAPE Loading hacks and not working for custom loaders like TURBO
To make this work I created my emulation/simulation of Z80 in a way that it uses something like microcode for each instruction. As I very often corrected errors in Z80 instruction set (as there is no single 100% correct doc out there I know of even if some of them claim that they are bug free and complete) I come with a way how to deal with it without painfully reprogramming the emulator.
Each instruction is represented by an entry in a table, with info about timing, operands, functionality... Whole instruction set is a table of all theses entries for all instructions. Then I form a MySQL database for my instruction set. and form similar tables to each instruction set I found. Then painfully compared all of them selecting/repairing what is wrong and what is correct. The result is exported to single text file which is loaded at emulation startup. It sound horrible but In reality it simplifies things a lot even speedup the emulation as the instruction decoding is now just accessing pointers. The instruction set data file example can be found here What's the proper implementation for hardware emulation
Few years back I also published paper on this (sadly institution that holds that conference does not exist anymore so servers are down for good on those old papers luckily I still got a copy) So here image from it that describes the problematics:
a) Full throtlle has no synchronization just raw speed
b) #1 has big gaps causing HW synchronization problems
c) #2 needs to sleep a lot with very small granularity (can be problematic and slow things down) But the instructions are executed very near their real time ...
Red line is the host CPU processing speed (obviously what is above it take a bit more time so it should be cut and inserted before next instruction but it would be hard to draw properly)
Magenta line is the Emulated/Simulated CPU processing speed
alternating green/blue colors represent next instruction
both axises are time
[edit1] more precise image
The one above was hand painted... This one is generated by VCL/C++ program:
generated by these parameters:
const int iset[]={4,6,7,8,10,15,21,23}; // possible timings [T]
const int n=128,m=sizeof(iset)/sizeof(iset[0]); // number of instructions to emulate, size of iset[]
const int Tps_host=25; // max possible simulation speed [T/s]
const int Tps_want=10; // wanted simulation speed [T/s]
const int T_timer=500; // simulation timer period [T]
so host can simulate at 250% of wanted speed and simulation granularity is 500T. Instructions where generated pseudo-randomly...
Was a quite interesting article on Arstechnica talking about console simulation recently, also links to quite a few simulators that might make for quite good research:
Accuracy takes power: one man's 3GHz quest to build a perfect SNES emulator
The relevant bit is that the author mentions, and I am inclined to agree, that most games will appear to function pretty correctly even with timing deviations of +/-20%. The issue you mention looks likely to never really introduce more than a fraction of a percent timing error, which is probably imperceptible whilst playing the final game. The authors probably didn't consider it worth dealing with.
I guess that depends on how accurate you want your emulator to be. I do not think that it has to be that accurate. Think emulation of x86 platform, there are so many variants of processors and each has different execution latencies and issue rates.

Is forcing I2C communication safe?

For a project I'm working on I have to talk to a multi-function chip via I2C. I can do this from linux user-space via the I2C /dev/i2c-1 interface.
However, It seems that a driver is talking to the same chip at the same time. This results in my I2C_SLAVE accesses to fail with An errno-value of EBUSY. Well - I can override this via the ioctl I2C_SLAVE_FORCE. I tried it, and it works. My commands reach the chip.
Question: Is it safe to do this? I know for sure that the address-ranges that I write are never accessed by any kernel-driver. However, I am not sure if forcing I2C communication that way may confuse some internal state-machine or so.(I'm not that into I2C, I just use it...)
For reference, the hardware facts:
OS: Linux
Architecture: TI OMAP3 3530
I2C-Chip: TWL4030 (does power, audio, usb and lots of other things..)
I don't know that particular chip, but often you have commands that require a sequence of writes, first to one address to set a certain mode, then you read or write another address -- where the function of the second address changes based on what you wrote to the first one. So if the driver is in the middle of one of those operations, and you interrupt it (or vice versa), you have a race condition that will be difficult to debug. For a reliable solution, you better communicate through the chip's driver...
I mostly agree with #Wim. But I would like to add that this can definitely cause irreversible problems, or destruction, depending on the device.
I know of a Gyroscope (L3GD20) that requires that you don't write to certain locations. The way that the chip is setup, these locations contain manufacturer's settings which determine how the device functions and performs.
This may seem like an easy problem to avoid, but if you think about how I2C works, all of the bytes are passed one bit at a time. If you interrupt in the middle of the transmission of another byte, results can not only be truly unpredictable, but they can also increase the risk of permanent damage exponentially. This is, of course, entirely up to the chip on how to handle the problem.
Since microcontrollers tend to operate at speeds much faster than the bus speeds allowed on I2C, and since the bus speeds themselves are dynamic based on the speeds at which devices process the information, the best bet is to insert pauses or loops between transmissions that wait for things to finish. If you have to, you can even insert a timeout. If these pauses aren't working, then something is wrong with the implementation.

Resources