Kernel dies when using itertools.permutation - python-3.x

I have a list of strings. And I want to find all possible combination of that list. I use itertools.permutation and it runs for while but then it crashes saying Kernel died, restarting. I try running the code through terminal too. But it crashes there too. Here is my code:
import itertools
sum_stats = ['pi', 'theta W','Tajima D','distVar','distSkew','distKurt','nDiplos',
'diplo_H1','diplo_H12','diplo_H2/H1','diplo_ZnS','diplo_Omega']
permuted_sum_stats = list(itertools.permutations(sum_stats))
Can someone show me an efficient way to create all possible combinations of this list?

Your list has 12 elements. To get all possible permutations, your new list needs 12!, or about 500 million elements. One of these lists takes about 150 bytes, excluding the space of the strings, which I assume is reused.
This leads to about 75 GB of data, which is probably more than the RAM of your machine.

Related

How can I use linux perf and interpret its output to understand CPU cache misses?

I am trying to measure the number of times memory references miss any CPU cache and need to fetch a cache line from memory. I have a very simple program that loads 100 million 4-byte integers into an array and then scans it or probes it randomly. I measure time, and then use perf to report various cache-related events: LLC-load, LLC-load-misses, LLC-store, LLC-store-misses. I am using Pop OS 18.10 (a variant of Ubuntu 18.10).
I run the program three ways:
1) Just load the array (100m integers).
2) Load the array and scan in physical order.
3) Load the array and read 100m random array locations.
#3 is 40x slower than #2, which is not surprising.
I am having some trouble both knowing what perf events to examine, and how to interpret the results:
I discovered the LLC-* events by googling, but they are not mentioned by "perf list".
I subtract the counts of events of the load-only run (#1) from the load-and-scan runs (#2, #3). The numbers are generally lower from the physical scan (#2) compared to the random access (#3). But from reading the documentation, and looking at the numbers, I don't really understand what the various events represent.
Does perf count events or does it sample them? If it's a true count, then I really can't make sense of the numbers I'm seeing. (E.g. the number of LLC-load-misses events doesn't match the number of cache line transfers that should be needed.)

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.
Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Python3 multiprocessing: Memory Allocation Error

I know that this question has been asked a lot of times, but the answers are not applicable.
This is answer one of a parallelized loop using multiprocessing on StackoverFlow:
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
inputs = range(1000000)
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
This code works fine. But if I increase the range to 1000000000, my 16GB of Ram are getting filled completely and I get [Errno 12] Cannot allocate memory. It seems as if the map function starts as many processes as possible. How do I limit the number of parallel processes?
The pool.map function starts 4 processes as you instructed it (in the line processes=4 you instruct the pool on how many processes it can use to perform your logic).
There is however a different issue underlying this implementation.
The pool.map function will return a list of objects, in this case its numbers.
Numbers do not act like int-s in ANSI-C they have overhead and will not overflow (e.g. turn to -2^31 whenever reaching 2^31+1 on 32-bit).
Also python lists are not array and do incur an overhead.
To be more specific, on python 3.6, running the following code will reveal some overhead:
>>>import sys
>>>t = [1,2,3,4]
>>>sys.getsizeof(t)
96
>>>t = [x for x in range(1000)]
>>>sys.getsizeof(t)
9024
So this means 24 bytes per number on small lists and ~9 bytes on large lists.
So for a list the size of 10^9 we get about 8.5GB
EDIT: 1. As tfb mentioned, this is not even the size of the underlying Number objects, just pointers and list overhead, meaning there is much more memory overhead I did not account for in the original answer.
Default python installation on windows is 32-bit (you can get 64-bit installation but you need to check the section of all available downloads in the python website), So I assumed you are using the 32-bit installation.
range(1000000000) creates a list of 10^9 ints. This is around 8GB (8 bytes per int on a 64-bit system). You are then trying to process this to create another list of 10^9 ints. A really really smart implementation might be able to do this on a 16GB machine, but its basically a lost cause.
In Python 2 you could try using xrange which might or might not help. I am not sure what the Python 3 equivalent is.

Is there a way to accelerate matrix plots?

ggpairs(), like its grandparent scatterplotMatrix(), is terribly slow as the number of pairs grows. That's fair; the number of permutations of pairs grows factorially.
What isn't fair is that I have to watch the other cores on my machine sit idle while one cranks away at 100% load.
Is there a way to parallelize large matrix plots?
Here is some sample data for benchmarking.
num.vars <- 100
num.rows <- 50000
require(GGally)
require(data.table)
tmp <- data.table(replicate(num.vars, runif(num.rows)),
class = as.factor(sample(0:1,size=num.rows, replace=TRUE)))
system.time({
tmp.plot <- ggpairs(data=tmp, diag=list(continuous="density"), columns=1:num.vars,
colour="class", axisLabels="show")
print(tmp.plot)})
Interestingly enough, my initial benchmarks excluding the print() statement ran at tolerable speeds (21 minutes for the above). The print statement, when added, caused what appear to be segfaults on my machine. (Hard to say at the moment because the R session is simply killed by the OS).
Is the problem in memory, or is this something that could be parallelized? (At least the plot generation part seems amenable to parallelization.)
Drawing ggpairs plots is single threaded because the bulk of the work inside GGally:::print.ggpairs happens inside two for loops (somewhere around line 50, depending upon how you count lines):
for (rowPos in 1:numCol) {
for (columnPos in 1:numCol) {
It may be possible to replace these with calls to plyr::l_ply (or similar) which has a .parallel argument. I have no idea if the graphics devices will cope OK with several cores trying to simultaneous draw things on them though. My gut feeling is that getting parallel plotting to work robustly may be non-trivial, but it could also be a fun project.

Variable substitution faster than in-line integer in Vic-20 basic?

The following two (functionally equivalent) programs are taken from an old issue of Compute's Gazette. The primary difference is that program 1 puts the target base memory locations (7680 and 38400) in-line, whereas program 2 assigns them to a variable first.
Program 1 runs about 50% slower than program 2. Why? I would think that the extra variable retrieval would add time, not subtract it!
10 PRINT"[CLR]":A=0:TI$="000000"
20 POKE 7680+A,81:POKE 38400+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
40 PRINT TI/60:END
Program 1
10 PRINT "[CLR]":A=0:B=7600:C=38400:TI$="000000"
20 POKE B+A,81:POKE C+A,6:IF A=505 THEN GOTO 40
30 A=A+1:GOTO 20
40 PRINT TI/60:END
Program 2
The reason is that BASIC is fully interpreted here, so the strings "7680" and "38400" need to be converted to binary integers EVERY TIME line 20 is reached (506 times in this program). In program 2, they're converted once and stored in B. So as long as the search-for-and-fetch of B is faster than convert-string-to-binary, program 2 will be faster.
If you were to use a BASIC compiler (not sure if one exists for VIC-20, but it would be a cool retro-programming project), then the programs would likely be the same speed, or perhaps 1 might be slightly faster, depending on what optimizations the compiler did.
It's from page 76 of this issue: http://www.scribd.com/doc/33728028/Compute-Gazette-Issue-01-1983-Jul
I used to love this magazine. It actually says a 30% improvement. Look at what's happening in program 2 and it becomes clear, because you are looping a lot using variables the program is doing all the memory allocation upfront to calculate memory addresses. When you do the slower approach each iteration has to allocate memory for the highlighted below as part of calculating out the memory address:
POKE 7680+A,81:POKE 38400+A
This is just the nature of the BASIC Interpreter on the VIC.
Accessing the first defined variable will be fast; the second will be a little slower, etc. Parsing multi-digit constants requires the interpreter to perform repeated multiplication by ten. I don't know what the exact tradeoffs are between variables and constants, but short variable names use less space than multi-digit constants. Incidentally, the constant zero may be parsed more quickly if written as a single decimal point (with no digits) than written as a digit zero.

Resources