Variable substitution faster than in-line integer in Vic-20 basic? - basic

The following two (functionally equivalent) programs are taken from an old issue of Compute's Gazette. The primary difference is that program 1 puts the target base memory locations (7680 and 38400) in-line, whereas program 2 assigns them to a variable first.
Program 1 runs about 50% slower than program 2. Why? I would think that the extra variable retrieval would add time, not subtract it!
10 PRINT"[CLR]":A=0:TI$="000000"
20 POKE 7680+A,81:POKE 38400+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
Program 1
10 PRINT "[CLR]":A=0:B=7600:C=38400:TI$="000000"
20 POKE B+A,81:POKE C+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
Program 2

The reason is that BASIC is fully interpreted here, so the strings "7680" and "38400" need to be converted to binary integers EVERY TIME line 20 is reached (506 times in this program). In program 2, they're converted once and stored in B. So as long as the search-for-and-fetch of B is faster than convert-string-to-binary, program 2 will be faster.
If you were to use a BASIC compiler (not sure if one exists for VIC-20, but it would be a cool retro-programming project), then the programs would likely be the same speed, or perhaps 1 might be slightly faster, depending on what optimizations the compiler did.

It's from page 76 of this issue:
I used to love this magazine. It actually says a 30% improvement. Look at what's happening in program 2 and it becomes clear, because you are looping a lot using variables the program is doing all the memory allocation upfront to calculate memory addresses. When you do the slower approach each iteration has to allocate memory for the highlighted below as part of calculating out the memory address:
POKE 7680+A,81:POKE 38400+A
This is just the nature of the BASIC Interpreter on the VIC.

Accessing the first defined variable will be fast; the second will be a little slower, etc. Parsing multi-digit constants requires the interpreter to perform repeated multiplication by ten. I don't know what the exact tradeoffs are between variables and constants, but short variable names use less space than multi-digit constants. Incidentally, the constant zero may be parsed more quickly if written as a single decimal point (with no digits) than written as a digit zero.


Time Complexity on Linux Shell

I created this simple algorithm, for studying it's complexity. Algorithm 1
I studied its complexity, O(n^3), and tried using the "time" command on the Linux Shell. Timing execution Algorithm 1
So I decided to increase the complexity, about O(n^5). Algorithm 2
But when I use the Time command, the times don't increase as I thought. Timing execution Algorithm 2
I think the time is not increasing because of the values of n.
Basically when n is high, the variable named 'a' inside the 'f' function can't hold such a high number .
using a debugger and n=1000000000, I got a random number as a result (-1486618624) because the maximum number that a int can hold in c++ is 2147483647, maybe this is why you'r not seeing any changes,you can try and use other types like long long that will serve you better in this case ( i wanted to comment this answer but i don't have enough reputation x) )

What does the IS_ALIGNED macro in the linux kernel do?

I've been trying to read the implementation of a kernel module, and I'm stumbling on this piece of code.
unsigned long addr = (unsigned long) buf;
if (!IS_ALIGNED(addr, 1 << 9)) {
DMCRIT("#%s in %s is not sector-aligned. I/O buffer must be sector-aligned.", name, caller);
The IS_ALIGNED macro is defined in the kernel source as follows:
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
I understand that data has to be aligned along the size of a datatype to work, but I still don't understand what the code does.
It left-shifts 1 by 9, then subtracts by 1, which gives 111111111. Then 111111111 does bitwise-and with x.
Why does this code work? How is this checking for byte alignment?
In systems programming it is common to need a memory address to be aligned to a certain number of bytes -- that is, several lowest-order bits are zero.
Basically, !IS_ALIGNED(addr, 1 << 9) checks whether addr is on a 512-byte (2^9) boundary (the last 9 bits are zero). This is a common requirement when erasing flash locations because flash memory is split into large blocks which must be erased or written as a single unit.
Another application of this I ran into. I was working with a certain DMA controller which has a modulo feature. Basically, that means you can allow it to change only the last several bits of an address (destination address in this case). This is useful for protecting memory from mistakes in the way you use a DMA controller. Problem it, I initially forgot to tell the compiler to align the DMA destination buffer to the modulo value. This caused some incredibly interesting bugs (random variables that have nothing to do with the thing using the DMA controller being overwritten... sometimes).
As far as "how does the macro code work?", if you subtract 1 from a number that ends with all zeroes, you will get a number that ends with all ones. For example, 0b00010000 - 0b1 = 0b00001111. This is a way of creating a binary mask from the integer number of required-alignment bytes. This mask has ones only in the bits we are interested in checking for zero-value. After we AND the address with the mask containing ones in the lowest-order bits we get a 0 if any only if the lowest 9 (in this case) bits are zero.
"Why does it need to be aligned?": This comes down to the internal makeup of flash memory. Erasing and writing flash is a much less straightforward process then reading it, and typically it requires higher-than-logic-level voltages to be supplied to the memory cells. The circuitry required to make write and erase operations possible with a one-byte granularity would waste a great deal of silicon real estate only to be used rarely. Basically, designing a flash chip is a statistics and tradeoff game (like anything else in engineering) and the statistics work out such that writing and erasing in groups gives the best bang for the buck.
At no extra charge, I will tell you that you will be seeing a lot of this type of this type of thing if you are reading driver and kernel code. It may be helpful to familiarize yourself with the contents of this article (or at least keep it around as a reference):

Why is string manipulation more expensive?

I've heard this so many times, that I have taken it for granted. But thinking back on it, can someone help me realize why string manipulation, say comparison etc, is more expensive than say an integer, or some other primitive?
8bit example:
1 bit can be 1 or 0. With 2 bits you can represent 0, 1, 2, and 3. And so on.
With a byte you have 2^8 possibilities, from 0 to 255.
In a string a single letter is stored in a byte, so "Hello world" is 11 bytes.
If I want to do 100 + 100, 100 is stored in 1 byte of memory, I need only two bytes to sum two numbers. The result will need again 1 byte.
Now let's try with strings, "100" + "100", this is 3 bytes plus 3 bytes and the result, "100100" needs 6 bytes to be stored.
This is over-simplified, but more or less it works in this way.
The int data type in C# was carefully selected to be a good match with processor design. Which can store an int in a cpu register, a storage location that's an easy factor of 3 faster than memory. And a single cpu instruction to compare values of type int. The CMP instruction runs in less than a single cpu cycle, a fraction of a nano-second.
That doesn't work nearly as well for a string, it is a variable length data type and every single char in the string must be compared to test for equality. So it is automatically proportionally slower by the size of the string. Furthermore, string comparison is afflicted by culture dependent comparison rules. The kind that make "ss" and "ß" equal in German and "Aa" and "Å" equal in Danish. Nothing subtle to deal with, taken care of by highly optimized table-driven code inside the CLR. It can't beat CMP.
I've always thought it was because of the immutability of strings. That is, every time you make a change to the string, it requires allocating memory for a whole new string (rather than modifying the original in place).
Probably a woefully naive understanding but perhaps someone else can expound further.
There are several things to consider when looking at the "cost" of manipulating strings.
There is the cost in terms of memory usage, there is the cost in terms of CPU cycles used, and there is a cost associated with the complexity of the code involved.
Integer manipulation (Add, Subtract, Multipy, Divide, Compare) is most often done by the CPU at the hardware level, in few (or even 1) instruction. When the manipulation is done, the answer fits back in the same size chunk of memory.
Strings are stored in blocks of memory, which have to be manipulated a byte or word at a time. Comparing two 100 character long strings may require 100 separate comparison operations.
Any manipulation that makes a string longer will require, either moving the string to a bigger block of memory, or moving other stuff around in memory to allow growing the existing block.
Any manipulation that leaves the string the same, or smaller, could be done in place, if the language allows for it. If not, then again, a new block of memory has to be allocated and contents moved.

Ada : Variant size in record type

I having some trouble with the type Record with Ada.
I'm using Sequential_IO to read a binary file. To do that I have to use a type where the size is a multiple of the file's size. In my case I need a structure of 50 bytes so I created a type like this ("Vecteur" is an array of 3 Float) :
type Double_Byte is mod 2 ** 16; for Double_Byte'Size use 16;
type Triangle is
Normal : Vecteur(1..3);
P1 : Vecteur(1..3);
P2 : Vecteur(1..3);
P3 : Vecteur(1..3);
Byte_count1 : Double_Byte;
end record;
When I use the type triangle the size is 52 bytes, but when I take the size of each one separetely within it I find 50 bytes. Because 52 is not a multiple of my file's size I have execution errors. But I don't know how to fix this size, I ran some test and I think it come from Double_Byte, because when I removed it from the record I found a size of 48 bytes and when I put it back it's 52 bytes again.
Thanks you for your help.
Given Simon's latest comment, it may be impossible to do this portably using Sequential_IO; namely, reading the file on some machines (which don't support unaligned accesses) may leave half its contents unaligned and therefore liable to fail when you access them.
I can't help feeling that a better solution is to divorce the file format (which is fixed by compatibility with other systems) from the machine format (which is not). And therefore moving to Stream_IO and writing your own Read and Write primitives where necessary (e.g. to pack the odd sized Double_Byte component into 2 bytes, whatever its representation in memory) would be a more robust solution.
Then you can guarantee a file format compatible with other systems, and an internal memory format guaranteed to work.
The compiler is in no way obligated to use a specific size for Triangle unless you specify it. As you don't, it chooses whatever size it sees fit for fast access to the data. Even if you specify representation details for every component type of the record, the compiler might still choose to use more space for the record itself than necessary.
Considering the sizes you give, it seems obvious that one component of Vecteur has 4 bytes, which gives a total payload of 50 bytes for Triangle. The compiler now chooses to add 2 bytes padding, so that the record size is a multiple of the size of a 4-byte word. You can override this behavior with:
for Triangle'Size use 50 * 8;
This will force the compiler to use only 50 bytes for the record. As this is a tight fit, there is only one way to represent the record, and no further specification is necessary. If you do need to specify how exactly the record is represented, you can use a record representation clause.
The representation clause specifies the size for the type. However, each object of this type may still take up more space unless you additionally specify
pragma Pack (Triangle);
Edit 2:
After Simon's comment, I had a closer look at this and realized that there is a far better and cleaner solution. Instead of setting the 'Size and using pragma Pack, do this:
for Triangle use record at mod 2;
Normal at 0 range 0 .. 95;
P1 at 12 range 0 .. 95;
P2 at 24 range 0 .. 95;
P3 at 36 range 0 .. 95;
Byte_count1 at 48 range 0 .. 15;
end record;
The initial mod 2 defines that the record is to be aligned at a multiple of 2 bytes. This eliminates the padding at the end without the need of pragma Pack (which is not guaranteed to work the same way on every compiler).

Fast binary diff for 10 MB files

I have two 10 MB files, and I'd like to find the longest common subsequence with offsets, e.g. the result should look like:
42 bytes at offset 5 of the first file and offset 8 of the second file
85 bytes at offset 100 of the first file and offset 55 of the second file
This is a one-off-task, I have to run it only on a single pair of files.
I don't care about the programming language, but it must run on Linux.
I have tried command-line tools bsdiff and xdelta, but their output diff file format is too complicated to understand, and it lacks any documentation -- so I would have to understand complicated and undocumented C source code to get those results. It would take several hours, and I don't have that much time for this, so I'm giving up on that path.
I have tried Perl module String::LCSS_XS , but that's too slow (it has been running for an hour now), Perl module Algorithm::Diff::XS, but it needs too much memory, and Perl module Algorithm::LCSS, but that's too slow (implemented in Perl). I couldn't find anything useful in Python (the built-in difflib is too slow).
Is there a tool which runs quickly (i.e. less than a few hours) for 10 MB files, and I can convert its output to the format I want in less than an hour of work?
