List comprehension with multiple expressions and multiple conditions - python-3.x

Before this wode was written that iterates over a file (ipaddress) using nested loops and a module (ipaddress). Now i would like to optimize using list comprehension.
import ipaddress
tmp1=open('Path\\Test-ip.txt', 'r+')
tmp2=open('Path\\-ip.txt', 'w+')
tmp1=tmp1.readlines()
for i in tmp1:
i="".join(i.split())
i=ipaddress.ip_network(i,False)
for j in tmp1:
j="".join(j.split())
j=ipaddress.ip_network(j,False)
if j != i:
if ipaddress.IPv4Network(j).supernet_of(i):
tmp2.write(str(i))
tmp2.write('\n')
#Using List Comprehension
tmp1=open('Path\\Test-ip.txt', 'r+')
tmp2=open('Path\\Result-ip.txt', 'w+')
tmp1=tmp1.readlines()
tmp3=[ (("").join(i.split())) (("").join(j.split())) (ipaddress.ip_network(i)) (ipaddress.ip_network(j)) (tmp2.(write(str(i)))) for i in tmp1 for j in tmp1 if i!=j if ipaddress.IPv4Network(j).supernet_of(i)]

While generally a comprehension is faster than a for-loop, that is valid only if you are not doing things that are slow in the loop.
And (again in general) file I/O is much slower than doing calculations.
Note that you should always measure before trying to optimize. The bottleneck might not be where you think it is.
Run your script with python -m cProfile -s tottime yourscript.py your arguments and observe the output. That will tell you where the program is spending its time.
On my website, I have a list of articles detailing profiling of real-world python programs. You can find the first in the list here. The others are referenced at the bottom of that page.
If file I/O turns out to be the bottleneck, using memory mapped files (the memmap module) will probably improve things.

Related

How do I increase speed of XML retrieval and parsing in S3 Jupyter Notebook (SageMaker Studio)?

I am managing data for a computer vision project and am looking for a fast way to search and manipulate all the files in a given directory. I have a working solution but am only able to process maybe 10-20 files per second. I am new to Jupyter Notebooks so am looking for recommendations on increasing the efficiency of the attached code.
Current code is as follows:
car_count=0
label_dict={}
purge_list=[]
for each_src in source_keys:
pages = paginator.paginate(Bucket=src_bucket, Prefix=each_src)
for page in pages:
for obj in page['Contents']:
fpath = obj['Key']
fname = fpath.split('/')[-1]
if fname == '':
continue
copy_source = {
'Bucket': src_bucket,
'Key': fpath
}
if fname.endswith('.xml'):
obj=s3.Object(src_bucket,fpath)
data=obj.get()['Body'].read()
root = ET.fromstring(data)
for box in root.findall('object'):
name=box.find('name').text
if name in label_dict:
label_dict[name] +=1
else :
label_dict[name] = 1
if name not in allowed_labels:
purge_list.append(fpath)
print(f'Labels: {label_dict}',end='\r')
print(f'\nTotal Images files:{i}, Total XML files:{j}',end='\r')
#print(f'\nLabels: {label_dict})
print(f'\nPURGE LIST: ({len(purge_list)} files)')
Possible solutions:
Multithreading - I have done threading in normal Python 3.x is it common to multithread within a notebook?
Read Less of File - Currently read in whole file, not sure if this is a large bog down point but may increase speed.
Jupyter usually has a lot of overhead - also your syntax has three levels of for loops. In the python world, the lesser the for loops the better - also, binary data is almost always faster. So, a number of suggestions:
restructure your for loops, use some specialized lib from pypi for file fs
change language? use bash script
multi threading is a way indeed
caching, use redis or other fast data structures to "read-in" data
golang is comparatively easier to jump from python, also has good multi threading support - my 2 cents: its worth a try at least

Reading large number of files(20k+) in python memory Error

I am trying to read a large number of files 20k+ from my computer using python but I keep on getting this Memory ERROR(details below). Although I have 16GB of RAM from which 8 GB or more is free all the time and the size of all the files is just 270Mb. I have tried many different solutions like pandas read_csv() reading in chunks using open(file_path).read(100) and many others but I am unable to read the files. I have to create a corpus of words after reading the files in the list. Below is my code so far. Any help will be highly appreciated.
import os
import pandas as pd
collectionPath = r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt"
listOfFilesInCollection = os.listdir(collectionPath)
def wordList(file):
list_of_words_from_file =[]
for line in file:
for word in line.split():
list_of_words_from_file.append(word)
return list_of_words_from_file
list_of_file_word_Lists = {}
file=[]
for file_name in listOfFilesInCollection:
filePath = collectionPath + "\\" + file_name
with open(filePath) as f:
for line in f:
file.append(line)
list_of_file_word_Lists[file_name]=wordList(file)
print(list_of_file_word_Lists)
The error that I get
Traceback (most recent call last): File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
25, in
list_of_file_word_Lists[file_name]=wordList(file) File "C:/Users/Asghar
Nazir/PycharmProjects/pythonProject/IRAssignment1/init.py", line
14, in wordList
list_of_words_from_file.append(word) MemoryError
You probably want to move the file=[] at the beginning of the loop because you're currently adding lines of each new file you open without having first removed the lines of all the previous files.
Then, there're very likely more efficient approaches depending on what you're trying to achieve. If the order of words doesn't matter, then maybe using using a dict or a collections.Counter instead of a list can help to avoid duplication of identical strings. If neither the order nor the frequency of words matter, then maybe using a set instead is going to be even better.
Finally, since it's likely you'll find most words in several files, try to store each of them only once in memory. That way, you'll be able to scale way higher than a mere 20k files: there's plenty of space in 16 GiB of RAM.
Keep in mind that Python has lots of fixed overheads and hidden costs: inefficient data structures can cost way more than you would expect.
It is hard to tell why your memory problems arise without knowing the content of your files. Maybe it is enough to make your code more efficient. For example: The split()-function can handle multiple lines itself. So you don't need a loop for that. And using list comprehension is always a good idea in python.
The following code should return what you want and I don't see a reason why you should run out of memory using it. Besides that, Arkanosis' hint to the importance of data types is very valid. It depends on what you want to achieve with all those words.
from pathlib import Path
def word_list_from_file(path):
with open(path, 'rt') as f:
list_words = f.read().split()
return list_words
path_dir = Path(r"C:\Users\Asghar Nazir\OneDrive - Higher Education Commission\MSDS\S1\IR\assignment\ACL txt")
dict_file_content = {
str(path.name): word_list_from_file(path)
for path in path_dir.rglob("*.txt")
}
P.S.: I'm not sure how the pathlib-module works in windows. But from what I read, this code snippet is platform independent.

Julia - Parallelism for Reading a Large file

In Julia v1.1, assume that I have a very large text file (30GB) and I want parallelism (multi-threads) to read eachline, how can I do ?
This code is an attempt to do this after checking Julia's documentation on multi-threading, but it's not working at all
open("pathtofile", "r") do file
# Count number of lines in file
seekend(file)
fileSize = position(file)
seekstart(file)
# skip nseekchars first characters of file
seek(file, nseekchars)
# progress bar, because it's a HUGE file
p = Progress(fileSize, 1, "Reading file...", 40)
Threads.#threads for ln in eachline(file)
# do something on ln
u, v = map(x->parse(UInt32, x), split(ln))
.... # other interesting things
update!(p, position(file))
end
end
Note 1 : you need using ProgressMeter (I want my code to show a progress bar while parallelism the file reading)
Note 2 : nseekchars is an Int and the number of characters I want to skip in the beginning in my file
Note 3 : the code is working but doesn't do parellelism without Threads.#threads macro next to the for loop
For the maximum I/O performance:
Parallelize the hardware - that is use disk arrays rather than a single drive. Try searching for raid performance for many excellent explanations (or ask a separate question)
Use the Julia memory mapping mechanism
s = open("my_file.txt","r")
using Mmap
a = Mmap.mmap(s)
Once having the memory mapping, do the processing in parallel. Beware of false sharing for threads (depends on your actual scenario).

Python3 multiprocessing: Memory Allocation Error

I know that this question has been asked a lot of times, but the answers are not applicable.
This is answer one of a parallelized loop using multiprocessing on StackoverFlow:
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
inputs = range(1000000)
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
This code works fine. But if I increase the range to 1000000000, my 16GB of Ram are getting filled completely and I get [Errno 12] Cannot allocate memory. It seems as if the map function starts as many processes as possible. How do I limit the number of parallel processes?
The pool.map function starts 4 processes as you instructed it (in the line processes=4 you instruct the pool on how many processes it can use to perform your logic).
There is however a different issue underlying this implementation.
The pool.map function will return a list of objects, in this case its numbers.
Numbers do not act like int-s in ANSI-C they have overhead and will not overflow (e.g. turn to -2^31 whenever reaching 2^31+1 on 32-bit).
Also python lists are not array and do incur an overhead.
To be more specific, on python 3.6, running the following code will reveal some overhead:
>>>import sys
>>>t = [1,2,3,4]
>>>sys.getsizeof(t)
96
>>>t = [x for x in range(1000)]
>>>sys.getsizeof(t)
9024
So this means 24 bytes per number on small lists and ~9 bytes on large lists.
So for a list the size of 10^9 we get about 8.5GB
EDIT: 1. As tfb mentioned, this is not even the size of the underlying Number objects, just pointers and list overhead, meaning there is much more memory overhead I did not account for in the original answer.
Default python installation on windows is 32-bit (you can get 64-bit installation but you need to check the section of all available downloads in the python website), So I assumed you are using the 32-bit installation.
range(1000000000) creates a list of 10^9 ints. This is around 8GB (8 bytes per int on a 64-bit system). You are then trying to process this to create another list of 10^9 ints. A really really smart implementation might be able to do this on a 16GB machine, but its basically a lost cause.
In Python 2 you could try using xrange which might or might not help. I am not sure what the Python 3 equivalent is.

What standard commands can I use to print just the first few lines of sorted output on the command line efficiently?

I basically want the equivalent of
... | sort -arg1 -arg2 -... | head -n $k
but, my understanding is that sort will go O(n log n) over the whole input. In my case I'm dealing with lots of data, so runtime matters to me - and also I have a habit of overflowing my tmp/ folder with sort temporary files.
I'd rather have it go O(n log k) using e.g. a heap, which would presumably go faster, and which also reduces the working set memory to k as well.
Is there some combination of standard command-line tools that can do this efficiently, without me having to code something myself? Ideally it would support the full expressive sort power of the sort command. sort (on ubuntu at least) appears to have no man-page-documented switch to pull it off...
Based on the above, and some more poking, I'd say the official answer to my question is "there is no solution." You can use specialized tools, or you can use the tools you've got with their current performance, or you can write your own tool.
I'm debating tracking down the sort source code and offering a patch. In the meantime, in case this quick hack code helps for anybody doing something similar to what I was doing, here's what I wrote for myself. Not the best python, and a very shady benchmark: I offer it to anybody else who cares to provide more rigorous:
256 files, of about 1.6 Gigs total size, all sitting on an ssd, lines
separated by \n, lines of format [^\t]*\t[0-9]+
Ubuntu 10.4, 6 cores, 8 gigs of ram, /tmp on ssd as well.
$ time sort -t^v<tab> -k2,2n foo* | tail -10000
real 7m26.444s
user 7m19.790s
sys 0m17.530s
$ time python test.py 10000 foo*
real 1m29.935s
user 1m28.640s
sys 0m1.220s
using diff to analyze, the two methods differ on tie-breaking, but otherwise the sort order is the same.
test.py:
#!/usr/bin/env python
# test.py
from sys import argv
import heapq
from itertools import chain
# parse N - the size of the heap, and confirm we can open all input files
N = int(argv[1])
streams = [open(f, "r") for f in argv[2:]]
def line_iterator_to_tuple_iterator(line_i):
for line in line_i:
s,c = line.split("\t")
c = int(c)
yield (c, s)
# use heap to process inputs
rez = heapq.nlargest(N,
line_iterator_to_tuple_iterator(chain(*streams)),
key=lambda x: x[0])
for r in rez:
print "%s\t%s" % (r[1], r[0])
for s in streams:
s.close()
UNIX/Linux provides generalists toolset. For large datasets it does loads of I/O. It will do everything you can want, but slowly. If we had an idea of the input data it would help immensely.
IMO, You have some choices, none you will really like.
do a multipart "radix" pre-sort - for example have awk write all of the lines whose keys start with 'A' to one file 'B' to another, etc. Or if you only 'P' 'D' & 'Q', have awk just suck out what you want. Then do a full sort on a small subset. This creates 26 files named A, B ...Z
awk '{print $0 > substr($0,1,1)} bigfile; sort [options here] P D Q > result
Spend $$: (Example) Buy CoSort from iri.com any other sort software. These sorts use all kinds of optimizations, but they are not free like bash. You could also buy an SSD which speeds up sorting on disk by several orders of magnitude. 5000iops now to 75000iops. Use the TMPDIR variable to put your tmp files on the SSD, read and write only to the SSD. But use your existing UNIX toolset.
Use some software like R or strata, or preferably a database; all of these are meant for large datasets.
Do what you are doing now, but watch youtube while the UNIX sort runs.
IMO, you are using the wrong tools for large datasets when you want quick results.
Here's a crude partial solution:
#!/usr/bin/perl
use strict;
use warnings;
my #lines = ();
while (<>) {
push #lines, $_;
#lines = sort #lines;
if (scalar #lines > 10) {
pop #lines;
}
}
print #lines;
It reads the input data only once, continuously maintaining a sorted array of the top 10 lines.
Sorting the whole array every time is inefficient, of course, but I'll guess that for a gigabyte input it will still be substantially faster than sort huge-file | head.
Adding an option to vary the number of lines printed would be easy enough. Adding options to control how the sorting is done would be a bit more difficult, though I wouldn't be surprised if there's something in CPAN that would help with that.
More abstractly, one approach to getting just the first N sorted elements from a large array is to use a partial Quicksort, where you don't bother sorting the right partition unless you need to. That requires holding the entire array in memory, which is probably impractical in your case.
You could split the input into medium-sized chunks, apply some clever algorithm to get the top N lines of each chunk, concatenate the chunks together, then apply the same algorithm to the result. Depending on the sizes of the chunks, sort ... | head might be sufficiently clever. It shouldn't be difficult to throw together a shell script using split -l ... to do this.
(Insert more hand-waving as needed.)
Disclaimer: I just tried this on a much smaller file than what you're working with (about 1.7 million lines), and my method was slower than sort ... | head.

Resources