Is there a way to change the value of an eBPF map incrementally each time the function is called?

Is there a way to change the value of an eBPF map incrementally each time the function is called? - linux

I'm currently using eBPF maps, and whenever I try to set the value (associated with a key in a hash table type map) to a variable that I increment at the end of the eBPF program, such that the value is incremented every time the function is run, the verifier throws an error,
invalid indirect read from stack R3 off -128+6 size 8
processed 188 insns (limit 1000000) max_states_per_insn 1 total_states 11 peak_states 11 mark_read 8
The main goal is to directly take the value and increment it every time the function is run.
I was under the impression that this would work.
bpf_spin_lock(&read_value->semaphore);
read_value->value += 1;
bpf_spin_unlock(&read_value->semaphore);
But this also throws the following error,
R1 type=inv expected=map_value
processed 222 insns (limit 1000000) max_states_per_insn 1 total_states 15 peak_states 15 mark_read 9

Related

Linux Fortran OpenMP - accessing global variables from a subroutine called from an OpenMP task

Is it legal/valid to access program global variables from an internal subroutine called from an OpenMP task?
ifort 2021.7.0 20220726 doesn't report an error, but appears to produce random results depending on compiler options. Example:
program test1
implicit none
integer :: i, j, g
g = 42
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP SINGLE
i = 0
j = 1
do while (j < 60)
i = i + 1
!$OMP TASK DEFAULT(SHARED) FIRSTPRIVATE(i,j)
call sub(i,j)
!$OMP END TASK
j = j + j
end do
!$OMP END SINGLE
!$OMP END PARALLEL
stop
contains
subroutine sub(i,j)
implicit none
integer i,j
!$OMP CRITICAL(unit6)
write(6,*) i,j,g
!$OMP END CRITICAL(unit6)
end subroutine sub
end program test1
Compiled with: ifort -o test1 test1.f90 -qopenmp -warn all -check all
Expected result:
5 16 42
4 8 42
6 32 42
3 4 42
2 2 42
1 1 42
Obtained result:
2 2 -858993460
5 16 -858993460
4 8 -858993460
6 32 -858993460
1 1 -858993460
3 4 -858993460
Note: the order of output lines doesn't matter --- just the number in the third column should be 42.
Different unexpected results are obtained by changing compiler options. For example, with "ifort -o test1 test1.f90 -qopenmp -warn all -O0", the third column is 256 and with "ifort -o test1 test1.f90 -qopenmp -O0" it is -740818552.
Of course g could be passed to sub() as an argument, but the program I'm assisting with working on has dozens of shared global variables (that don't change in the parallel part) and subroutine calls go several layers deep.
Thanks, Peter McGavin.

Please try the oneAPI compiler package 2022.2 or 2022.3.
/iusers/xtian/temp$ ifx -qopenmp jimtest.f90
/iusers/xtian/temp$ ./a.out
2 2 42
1 1 42
3 4 42
5 16 42
4 8 42
6 32 42

Need pseudo code for producer consumer problem stated below

Need algorithm/pseudo-code for a producer-consumer problem in which we have one
producer and two consumers sharing a bounded buffer (capacity N). Any of the two
consumers can consume an item from the buffer if buffer is not empty; however, the
consumers have to wait if the buffer is empty. Moreover, the producer has to wait if the buffer is full. Implement the buffer is FIFO circular queue.
Also can someone answer below questions?
Can all the N slots may be full in the buffer simultaneously?
If yes how? and if not why not?

The buffer can be full if the producer adds values to the buffer faster than the consumers extract values from the buffer. If the consumers extract the values as fast or faster than the producer adds values the buffer will never fill.
Instead of pseudo-code I will present a general solution using the Ada programming language. This solution works for one or more producers and one or more consumers.
The Ada solution consists of three files. The first file is a package specification which defines the API for a producer-consumer problem.
-----------------------------------------------------------------------
-- Producer-consumer with bounded buffer
-----------------------------------------------------------------------
generic
Capacity : Positive;
package Bounded_PC is
task type Producer is
entry set_id(Id : in Positive);
entry Stop;
end Producer;
task type Consumer is
entry set_id(Id : in Positive);
entry Stop;
end Consumer;
end Bounded_PC;
This solution requires the size of the shared buffer to be specified as a generic parameter when instantiating the package Bounded_PC.
The package defines two task types (tasks are Ada's name for threads). The first task type is named Producer. The Producer task type defines two entries. The entry Set_Id establishes the programmer-chosen Id for a particular instance of the task type. The entry Stop is used to stop the execution of an instance of the task type.
The task type Consumer has an identical interface, with an entry to set the task Id and an entry to stop the task.
This is the entire API for the producer and consumer tasks in this example.
The implementation of the tasks and their shared buffer is contained in the package body for the Bounded_PC package.
with Ada.Text_IO; use Ada.Text_IO;
package body Bounded_PC is
subtype Index_T is Positive range 1..Capacity;
type Buf_Array is array (Index_T) of Integer;
------------
-- Buffer --
------------
protected Buffer is
Entry Write(Item : in Integer);
Entry Read(Item : out Integer);
private
Buf : Buf_Array;
Write_Index : Index_T := 1;
Read_Index : Index_T := 1;
Count : Natural := 0;
end Buffer;
protected body Buffer is
entry Write(Item : in Integer) when Count < Capacity is
begin
Buf(Write_Index) := Item;
Write_Index := (Write_Index mod Capacity) + 1;
Count := Count + 1;
end Write;
entry Read(Item : out Integer) when Count > 0 is
begin
Item := Buf(Read_Index);
Read_Index := (Read_Index mod Capacity) + 1;
Count := Count - 1;
end Read;
end Buffer;
--------------
-- Producer --
--------------
task body Producer is
Value : Integer := 0;
Me : Positive;
begin
accept Set_Id(Id : in Positive) do
Me := Id;
end Set_Id;
loop
select
accept Stop;
exit;
else
select
Buffer.Write(Value);
Put_Line("Producer" & Me'Image & " wrote" & Value'Image);
Value := Value + 1;
or
delay 0.001;
end select;
end select;
end loop;
end Producer;
--------------
-- Consumer --
--------------
task body Consumer is
Value : Integer;
Me : Positive;
begin
accept Set_Id(Id : in Positive) do
Me := Id;
end Set_Id;
loop
select
accept Stop;
exit;
else
select
Buffer.Read(Value);
Put_Line("Consumer" & Me'Image & " read" & Value'Image);
or
delay 0.001;
end select;
end select;
end loop;
end Consumer;
end Bounded_PC;
Ada allows the programmer to specify the range of values for an array index. The subtype Index_T is a subtype of Integer with a minimum value of 1 and a maximum value of Capacity, which is the generic parameter used to make an instance of this package.
The array type used as the bounded shared buffer in this example is named Buf_Array. Buff_Array is indexed by Index_T and each element is an Integer.
The buffer itself is an Ada protected object. Ada protected objects are implicitly protected from race conditions. The protected object named Buffer has two entries. The Write entry will be called by instances of the Producer task type. The Read entry will be called by instances of the Consumer task type. The Write entry passes an Integer in to the buffer and the Read entry passes an Integer out of the buffer.
The private part of the protected Buffer specification contains the internal data elements of Buffer. Those elements are an instance of Buf_Array named Buf, and instance of Index_T named Write_Index, an instance of Intex_T named Read_Index and an instance of the subtype Natural, which is an Integer with a minimum value of 0. This instance is named Count.
The protected body of Buffer implements the two entries Write and Read.
The Write entry has a boundary condition allowing the Write entry to execute only when the buffer is not full, or as the logic states, when Count is less than Capacity. When the Write entry does execute it assigns the value in the parameter Item to the array element at Write_Index. It then increments the Write_Index accounting for modular wrap around logic. Finally it increments Count.
If the boundary condition is false, in this case Count is not less than Capacity, the calling task is suspended in a queue automatically managed by the language. The next task in the queue, if any, is then handled and its value is written to the buffer.
The logic for the Read entry is very similar. The boundary condition differs from the Write boundary condition. Instead the Read entry only executes when the buffer is not empty, or as stated in the logic, when Count > 0. When the Read entry executes it copies the value in the Buf array at index Read_Index to the parameter Item, increments Read_Index using modular arithmetic and decrements count. The Read entry has its own entry buffer where instance of Consumer will be queued until a value is written to Buffer.
The Producer task type is implemented by first declaring two local variables. Each instance of the task type will have its own instances of these two variables. The first variable, named Value, is an Integer initialized to 0. The second variable is a subtype of Integer named Positive, which is an integer with a minimum value of 1. The name of this variable is Me. Me will hold the task Id assigned to this instance of the task.
After the "begin" reserved word there is an accept statement which accepts the Set_Id task entry. The value passed to this instance of producer will be assigned to its local Me variable. The task will wait at the accept call until another task calls its entry and passed its ID.
The real work of this task is to write numbers to Buffer until another task calls its Stop entry. All this is done within the loop statement. The "select" statement forms the beginning of a conditional branch in the response. Either the task will accept a call from another task on the Stop entry, in which case it will exit the loop, or it will write a value to the Buffer protected object. Upon successfully writing to Buffer the producer instance outputs a message identifying the task ID and the value written to Buffer. The number contained in the variable Value is incremented. If the Write entry does not execute within 0.001 seconds (1 millisecond) the loop repeats.
The Consumer task type is very similar to the producer. It declares two local variables Value and Me. It waits to accept is Set_Id entry and assigns the number passed to it to its local variable named Me. The consumer then executes a loop similar to the producer loop. The only difference is that it reads value from Buffer and prints that value.
The Bounded_PC package must be used within an actual program. The following procedure named main provides the program structure to test this package.
with Bounded_PC;
procedure Main is
package Int_Pck is new Bounded_Pc(10);
use Int_Pck;
P1 : Producer;
P2 : Producer;
C1 : Consumer;
C2 : Consumer;
begin
P1.Set_Id(1);
P2.Set_Id(2);
C1.Set_Id(1);
C2.Set_Id(2);
delay 0.01;
P1.Stop;
P2.Stop;
delay 0.01;
C1.Stop;
C2.Stop;
end Main;
The a concrete instance of the generic Bounded_Pc package is declared passing the number 10 as the capacity of the Buffer array.
Two producers named P1 and P1 are created. Two consumers named C1 and C2 are created.
following the "begin" reserved word in the main procedure the Set_Id entries for tasks P1, P2, C1, C2 are called, passing ID values for each task.
The main procedure then delays (sleeps) for 0.01 seconds, or 10 milliseconds. The Stop entries for P1 and P2 are called, the main procedure delays for another 10 milliseconds and then calls the stop procedures for C1 and C2.
The output of this program is:
Producer 2 wrote 0
Producer 2 wrote 1
Producer 2 wrote 2
Producer 2 wrote 3
Producer 2 wrote 4
Producer 2 wrote 5
Producer 2 wrote 6
Producer 2 wrote 7
Producer 2 wrote 8
Producer 2 wrote 9
Producer 2 wrote 10
Consumer 1 read 0
Consumer 1 read 1
Consumer 1 read 2
Consumer 1 read 3
Consumer 1 read 4
Consumer 1 read 5
Consumer 1 read 6
Consumer 1 read 7
Consumer 1 read 8
Consumer 1 read 9
Consumer 1 read 10
Consumer 1 read 11
Producer 2 wrote 11
Consumer 2 read 0
Consumer 1 read 12
Producer 2 wrote 12
Producer 1 wrote 0
Producer 2 wrote 13
Consumer 2 read 13
Producer 2 wrote 14
Producer 1 wrote 1
Producer 1 wrote 2
Producer 1 wrote 3
Producer 2 wrote 15
Producer 1 wrote 4
Producer 1 wrote 5
Producer 2 wrote 16
Producer 2 wrote 17
Producer 2 wrote 18
Producer 1 wrote 6
Producer 2 wrote 19
Consumer 2 read 14
Consumer 1 read 1
Consumer 2 read 15
Consumer 1 read 2
Producer 1 wrote 7
Producer 2 wrote 20
Producer 1 wrote 8
Producer 2 wrote 21
Consumer 2 read 3
Consumer 1 read 4
Consumer 2 read 16
Consumer 1 read 5
Consumer 2 read 17
Consumer 1 read 18
Consumer 2 read 6
Consumer 1 read 19
Consumer 1 read 20
Consumer 1 read 8
Consumer 1 read 21
Consumer 1 read 9
Consumer 2 read 7
Consumer 1 read 22
Producer 1 wrote 9
Producer 1 wrote 10
Producer 2 wrote 22
Producer 1 wrote 11
Consumer 2 read 10
Consumer 1 read 11
Consumer 2 read 23
Consumer 1 read 12
Producer 1 wrote 12
Producer 2 wrote 23
Consumer 2 read 13
Producer 1 wrote 13
Producer 2 wrote 24
Producer 1 wrote 14
Consumer 1 read 24
Producer 1 wrote 15
Producer 1 wrote 16
Producer 1 wrote 17
Producer 1 wrote 18
Producer 1 wrote 19
Producer 1 wrote 20
Producer 1 wrote 21
Consumer 1 read 25
Producer 1 wrote 22
Producer 1 wrote 23
Producer 1 wrote 24
Producer 1 wrote 25
Consumer 1 read 15
Producer 2 wrote 25
Producer 1 wrote 26
Consumer 1 read 16
Consumer 2 read 14
Consumer 1 read 17
Producer 2 wrote 26
Producer 2 wrote 27
Consumer 1 read 19
Consumer 2 read 18
Consumer 1 read 20
Consumer 1 read 22
Consumer 1 read 23
Producer 1 wrote 27
Consumer 2 read 21
Consumer 2 read 25
Consumer 2 read 26
Consumer 2 read 26
Consumer 2 read 27
Consumer 1 read 24
Consumer 2 read 27
Producer 1 wrote 28
Producer 1 wrote 29
Producer 1 wrote 30
Consumer 2 read 28
Consumer 2 read 29
Consumer 2 read 30
Consumer 2 read 31
Consumer 1 read 28
Producer 1 wrote 31
Producer 2 wrote 28
Producer 1 wrote 32
Producer 2 wrote 29
Producer 2 wrote 30
Producer 2 wrote 31
Producer 2 wrote 32
Producer 2 wrote 33
Producer 2 wrote 34
Producer 2 wrote 35
Producer 2 wrote 36
Producer 2 wrote 37
Producer 2 wrote 38
Consumer 2 read 32
Producer 1 wrote 33
Consumer 1 read 29
Consumer 2 read 33
Consumer 1 read 30
Consumer 1 read 32
Consumer 1 read 33
Consumer 1 read 34
Consumer 1 read 35
Consumer 1 read 36
Consumer 1 read 37
Producer 1 wrote 34
Consumer 1 read 38
Producer 1 wrote 35
Consumer 2 read 31
Consumer 1 read 39
Consumer 1 read 35
Consumer 1 read 36
Producer 1 wrote 36
Consumer 2 read 34
Consumer 1 read 37
Producer 1 wrote 37
Producer 1 wrote 38
Producer 2 wrote 39
Consumer 2 read 38
Consumer 1 read 39
Consumer 2 read 40
Producer 1 wrote 39
Producer 2 wrote 40
Consumer 1 read 40
Producer 1 wrote 40
Consumer 2 read 41
Consumer 1 read 41
Producer 1 wrote 41
Producer 2 wrote 41
Producer 1 wrote 42
Consumer 2 read 42
Producer 2 wrote 42
Producer 2 wrote 43
Producer 2 wrote 44
Producer 2 wrote 45
Producer 2 wrote 46
Producer 2 wrote 47
Producer 2 wrote 48
Producer 2 wrote 49
Producer 2 wrote 50
Producer 2 wrote 51
Consumer 1 read 42
Producer 1 wrote 43
Producer 2 wrote 52
Consumer 2 read 43
Consumer 1 read 43
Consumer 2 read 44
Consumer 2 read 46
Consumer 2 read 47
Consumer 1 read 45
Consumer 2 read 48
Consumer 2 read 50
Consumer 2 read 51
Consumer 2 read 52
Consumer 1 read 49
Consumer 1 read 44
Consumer 2 read 53
Producer 2 wrote 53
Producer 2 wrote 54
Producer 1 wrote 44
Consumer 1 read 54
Consumer 2 read 55
Consumer 1 read 45
Producer 2 wrote 55
Producer 1 wrote 45
Producer 2 wrote 56
Consumer 2 read 56
Producer 2 wrote 57
Consumer 1 read 57

My matrix multiplication program takes quadruple time when thread count doubles

I wrote this simple program that multiplies matrices. I can specify how
many OS threads are used to run it with the environment variable
OMP_NUM_THREADS. It slows down a lot when the thread count gets
larger than my CPU's physical threads.
Here's the program.
static double a[DIMENSION][DIMENSION], b[DIMENSION][DIMENSION],
c[DIMENSION][DIMENSION];
#pragma omp parallel for schedule(static)
for (unsigned i = 0; i < DIMENSION; i++)
for (unsigned j = 0; j < DIMENSION; j++)
for (unsigned k = 0; k < DIMENSION; k++)
c[i][k] += a[i][j] * b[j][k];
My CPU is i7-8750H. It has 12 threads. When the matrices are large
enough, the program is fastest on around 11 threads. It is 4 times as
slow when the thread count reaches 17. Then run time stays about the
same as I increase the thread count.
Here's the results. The top row is DIMENSION. The left column is the
thread count. Times are in seconds. The column with * is when
compiled with -fno-loop-unroll-and-jam.
1024 2048 4096 4096* 8192
1 0.2473 3.39 33.80 35.94 272.39
2 0.1253 2.22 18.35 18.88 141.23
3 0.0891 1.50 12.64 13.41 100.31
4 0.0733 1.13 10.34 10.70 82.73
5 0.0641 0.95 8.20 8.90 62.57
6 0.0581 0.81 6.97 8.05 53.73
7 0.0497 0.70 6.11 7.03 95.39
8 0.0426 0.63 5.28 6.79 81.27
9 0.0390 0.56 4.67 6.10 77.27
10 0.0368 0.52 4.49 5.13 55.49
11 0.0389 0.48 4.40 4.70 60.63
12 0.0406 0.49 6.25 5.94 68.75
13 0.0504 0.63 6.81 8.06 114.53
14 0.0521 0.63 9.17 10.89 170.46
15 0.0505 0.68 11.46 14.08 230.30
16 0.0488 0.70 13.03 20.06 241.15
17 0.0469 0.75 20.67 20.97 245.84
18 0.0462 0.79 21.82 22.86 247.29
19 0.0465 0.68 24.04 22.91 249.92
20 0.0467 0.74 23.65 23.34 247.39
21 0.0458 1.01 22.93 24.93 248.62
22 0.0453 0.80 23.11 25.71 251.22
23 0.0451 1.16 20.24 25.35 255.27
24 0.0443 1.16 25.58 26.32 253.47
25 0.0463 1.05 26.04 25.74 255.05
26 0.0470 1.31 27.76 26.87 253.86
27 0.0461 1.52 28.69 26.74 256.55
28 0.0454 1.15 28.47 26.75 256.23
29 0.0456 1.27 27.05 26.52 256.95
30 0.0452 1.46 28.86 26.45 258.95
Code inside the loop compiles to this on gcc 9.3.1 with
-O3 -march=native -fopenmp. rax starts from 0 and increases by 64
each iteration. rdx points to c[i]. rsi points to b[j]. rdi
points to b[j+1].
vmovapd (%rsi,%rax), %ymm1
vmovapd 32(%rsi,%rax), %ymm0
vfmadd213pd (%rdx,%rax), %ymm3, %ymm1
vfmadd213pd 32(%rdx,%rax), %ymm3, %ymm0
vfmadd231pd (%rdi,%rax), %ymm2, %ymm1
vfmadd231pd 32(%rdi,%rax), %ymm2, %ymm0
vmovapd %ymm1, (%rdx,%rax)
vmovapd %ymm0, 32(%rdx,%rax)
I wonder why the run time increases so much when the thread count
increases.
My estimate says this shouldn't be the case when DIMENSION is 4096.
What I thought before I remembered that the compiler does 2 j loops at
a time. Each iteration of the j loop needs rows c[i] and b[j].
They are 64KB in total. My CPU has a 32KB L1 data cache and a 256KB L2
cache per 2 threads. The four rows the two hardware threads are working
with don't fit in L1 but fit in L2. So when j advances, c[i] is
read from L2. When the program is run on 24 OS threads, the number of
involuntary context switches is around 29371. Each thread gets
interrupted before it has a chance to finish one iteration of the j
loop. Since 8 matrix rows can fit in the L2 cache, the other software
thread's 2 rows are probably still in L2 when it resumes. So the
execution time shouldn't be much different from the 12 thread case.
However measurements say it's 4 times as slow.
Now that I have realized 2 j loops are done at a time. This way each
j iteration works on 96KB of memory. So 4 of them can't fit in the
256KB L2 cache. To verify this is what slows the program down, I
compiled the program with -fno-loop-unroll-and-jam. I got
vmovapd ymm0, YMMWORD PTR [rcx+rax]
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx+rax]
vmovapd YMMWORD PTR [rdx+rax], ymm0
The results are in the table. They are like when 2 rows are done at a
time. Which makes me wonder even more. When DIMENSION is 4096, 4
software threads' 8 rows fit in the L2 cache when each thread works on 1
row at a time, but 12 rows don't fit in the L2 cache when each thread
works on 2 rows at a time. Why are the run times similar?
I thought maybe it's because the CPU warmed up when running with less
threads and has to slow down. I ran the tests multiple times, both in
the order of increasing thread count and decreasing thread count. They
yield similar results. And dmesg doesn't contain anything related to
thermal or clock.
I tried separately changing 4096 columns to 4104 columns and setting
OMP_PROC_BIND=true OMP_PLACES=cores, and the results are similar.

This problem seems to come from either the CPU caches (due to the bad memory locality) or the OS scheduler (due to more threads than the hardware can simultaneously execute).
I cannot exactly reproduce the same effect on my i5-9600KF processor (with 6 cores and 6 threads) and with a matrix of size 4096x4096. However, similar effects occur.
Here are performance results (with GCC 9.3 using -O3 -march=native -fopenmp on Linux 5.6):
#threads | time (in seconds)
----------------------------
1 | 16.726885
2 | 9.062372
3 | 6.397651
4 | 5.494580
5 | 4.054391
6 | 5.724844 <-- maximum number of hardware threads
7 | 6.113844
8 | 7.351382
9 | 8.992128
10 | 10.789389
11 | 10.993626
12 | 11.099117
24 | 11.283873
48 | 11.412288
We can see that the computation time starts to significantly grow between 5 and 12 cores.
This problem is due to a lot more data fetched from the RAM. Indeed, 161.6 Gio are loaded from memory with 6 threads while 424.7 Gio are loaded with 12 threads! In both cases, 3.3 Gio are written to the RAM. Because my memory throughput is roughly 40 Gio/s, the RAM accesses represent more than 96% of the overall execution time with 12 threads!
If we dig deeper, we can see that the number of L1 cache references and L1 cache misses are the same whatever the number of threads used. Meanwhile, there are a lot more L3 cache misses (as well as more references). Here are L3-cache statistics:
With 6 threads: 4.4 G loads
1.1 G load-misses (25% of all LL-cache hits)
With 12 threads: 6.1 G loads
4.5 G load-misses (74% of all LL-cache hits)
This means that the locality of the memory access is clearly worse with more threads. I guess this is because the compiler is not clever enough to do high-level cache-based optimizations that could reduce RAM pressure (especially when the number of threads is high). You have to do tiling yourself in order to improve the memory locality. You can find a good guide here.
Finally, note that using more threads that the hardware can simultaneously execute is generally not efficient. One problem is that the OS scheduler often badly place threads to core and frequently move them. The usual way to fix that is to bind software threads to hardware threads using OMP_PROC_BIND=TRUE and set the OMP_PLACES environment variable. Another problem is that the threads are executed using preemptive multitasking with shared resources (eg. caches).
PS: please note that BLAS libraries (eg. OpenBLAS, BLIS, Intel MKL, etc.) are much more optimized than this code as most they already include clever optimization including manual vectorization for the target hardware, loop unrolling, multithreading, tiling and fast matrix transpositions when needed. For a 4096x4096 matrix, they are about 10 times faster.

DES: (Using sbox 2) to show that Two output bits from each S-box affect middle bits of the next round and the other two affect the end bits

Data Encryption Standard (DES) algorithm : (Using sbox 2) to show that Two output bits from each S-box affect middle bits of the next round and the other two affect the end bits.
The permutation table P is defined in the following table.
16 7 20 21 29 12 28 17 [END BITS]
1 15 23 26 5 18 31 10 [MIDDLE BITS]
2 8 24 14 32 27 3 9 [MIDDLE BITS]
19 13 30 6 22 11 4 25 [END BITS]
From the table above you can see that bits 7 and 6 refer to the end bits and 5 and 8 refer to the middle bits.
However am not sure if this is correct because if we consider E table the 5,6 are end bits and 7,8 affecting middle bit. What is correct ?

Don't fully understand the question but your first statement about bits 7,6,5 and 8 is true, but remember that the "cascade effect" will make all the changes made by the P-table will go to the "right side" of the equation; but at the same time these will interact in the next round in the left side!
To fully understand the process check out this link: http://www.cronos.est.pr/DES.php

redefine length.character in R

Since length is a generic method, why can't I do
length.character <- nchar
? It seems that strings are treated special in R. Is there a reason for that? Would you discourage defining functions like head.character and tail.character?

If you look at the help page for InternalMethods (mentioned in the details portion of the help page for length) it states that
For efficiency, internal dispatch only
occurs on objects, that
is those for which ‘is.object’ returns true.
Vectors are not objects in the same sense as other objects are, so the method dispatch is not being done on any basic vectors (not just character). if you really want to use this type of dispatch you need a defined object, e.g.:
> tmp <- state.name
> class(tmp) <- 'mynewclass'
> length.mynewclass <- nchar
> length(tmp)
[1] 7 6 7 8 10 8 11 8 7 7 6 5 8 7 4 6 8 9 5 8 13 8 9 11 8
[26] 7 8 6 13 10 10 8 14 12 4 8 6 12 12 14 12 9 5 4 7 8 10 13 9 7
>

My 2c:
Strings are not treated specially in R. If length did the same thing as nchar, then you would get unexpected results if you tried to compute length(c("foo", "bazz")). Or to put it another way, would you expect the length of a numeric vector to return the number of digits in each element of the vector or the length of the vector itself?
Also creating this method might side-effect other functions which expect the normal string behavior.

Now I found a reason not to define head.character: it changes the way how head works. For example:
head.character <- function(s,n) if(n<0) substr(s,1,nchar(s)+n) else substr(s,1,n)
test <- c("abc", "bcd", "cde")
head("abc", 2) # works fine
head(test,2)
Without the definition of head, the last line would return c("abc", "bcd"). Now, with head.character defined, this function is applied to each element of the list and returns c("ab", "bc", "cd").
But I have a strhead and a strtail function now.. :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string