python multiprocessing fastq function - python-3.x

I am a new user of mutliprocessing module in Python3.
I have 2 fastq files (forward and reverse) and I want to process forward/reverse couple of reads. For this, from forward read, I get the corresponding reverse and apply a function with multiple arguments on the couple. So far, I've done it sequentially on 1 thread, which is quite long for huge files. Now, I would like to improve speed by parallelising the function application, so I create chunk of the forward file and apply function to each chunk using multiprocessing. Here is the code:
def chunk_itr(iterator, chunk_size):
"""
Function to split fastq file into smallest files for faster processing
From biopython solutions
"""
entry = True
while entry:
chunk = []
while len(chunk) < chunk_size:
try:
entry = next(iterator)
except StopIteration:
entry = None
if entry is None:
break
chunk.append(entry)
if chunk:
yield chunk
def chunk_fastq(f_fastq, chunkSize, path2out):
rec_itr = SeqIO.parse(open(f_fastq), "fastq")
os.mkdir(os.path.join(path2out, "chunk_files"))
dir_out = os.path.join(path2out, "chunk_files")
base = os.path.basename(f_fastq)
fname = os.path.splitext(base)[0]
for i, chunk in enumerate(chunk_itr(rec_itr, chunkSize)):
out_chunk_name = os.path.join(dir_out, "{0}_chunk{1}.fastq".format(fname, i))
with open(out_chunk_name, "w") as handle:
SeqIO.write(chunk, handle, "fastq")
def testmulti(fwd_chunk, rev_idx):
fwd_idx = SeqIO.index(fwd_chunk, "fastq")
for i in fwd_idx:
print(i, rev_idx[i])
pathfwd = "path/to/forward_file"
f_rev = "path/to/rev_fastq"
def main():
rev_idx = SeqIO.index(f_rev, "fastq")
chunk_fastq(pathfwd, 1000, path2chunk)
files = [f for f in os.listdir(path2chunk)]
# sequential
for i in files:
testmulti(i, rev_idx)
# parallel process
processes = []
for i in files:
proc = mp.Process(target=testmulti, args=(i, rev_idx,))
processes.append(proc)
proc.start()
for p in processes:
p.join()
The sequential approach works fine but the parallel one crash with the following error:
Process Process-2:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "test.py", line 28, in testmulti
print(i, rev_idx[i])
File "test.py", line 28, in testmulti
print(i, rev_idx[i])
File "/home/user/.local/lib/python3.6/site-packages/Bio/File.py", line 417, in __getitem__
record = self._proxy.get(self._offsets[key])
File "/home/user/.local/lib/python3.6/site-packages/Bio/File.py", line 417, in __getitem__
record = self._proxy.get(self._offsets[key])
File "/home/user/.local/lib/python3.6/site-packages/Bio/SeqIO/_index.py", line 69, in get
return self._parse(StringIO(_bytes_to_string(self.get_raw(offset))))
File "/home/user/.local/lib/python3.6/site-packages/Bio/SeqIO/_index.py", line 69, in get
return self._parse(StringIO(_bytes_to_string(self.get_raw(offset))))
File "/home/user/.local/lib/python3.6/site-packages/Bio/SeqIO/_index.py", line 664, in get_raw
raise ValueError("Problem with quality section")
File "/home/user/.local/lib/python3.6/site-packages/Bio/SeqIO/_index.py", line 642, in get_raw
raise ValueError("Premature end of file in seq section")
ValueError: Problem with quality section
ValueError: Premature end of file in seq section
From the Index class description in biopython, it seems there is an issue with the file format/structure
I double checked the input files and there are no error (and it works with the sequential approach).
My guess so far:
using Process like this is not a good option (I also tried pool.starmap, but without success)
since the f_rev is indexed once and then each process try to use it in parallel, there is a conflict
Any help would be appreciated
Thank you!

Ok, so I am still not 100% sure on the cause of the error, but after increasing the size of my fastq files I was able to replicate it.
It definitely has to do with the reverse index object created with SeqIO.index however I'm struggling to fully grasp what exactly that looks like from the source code as there is a lot of inheritance going on. I suspect that it is something to do with passing an open file-handle object to the child processes, but again I'm not well-versed enough in that side of things to guarantee it.
However I can successfully prevent the error. The solution involves moving your creation of the reverse index to the child process. I don't see any good reason not to either, the whole point of the SeqIO.Index method is that it creates a low-memory index rather than reading the whole file into memory, so creating one per-child process shouldn't be excessively expensive.
def testmulti(fwd_chunk, rev):
rev_idx = SeqIO.index(rev, "fastq")
fwd_idx = SeqIO.index(fwd_chunk, "fastq")
for i in fwd_idx:
print(i, rev_idx[i])
pathfwd = "path/to/forward_file"
f_rev = "path/to/rev_fastq"
def main():
chunk_fastq(pathfwd, 1000, path2chunk)
files = [f for f in os.listdir(path2chunk)]
# sequential
for i in files:
testmulti(i, f_rev)
# parallel process
processes = []
for i in files:
proc = mp.Process(target=testmulti, args=(i, f_rev,))
processes.append(proc)
proc.start()
for p in processes:
p.join()

Related

select row in heavy csv

i search how can i select some row with word in line so i use this script
import pandas
import datetime
df = pandas.read_csv(
r"C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv",
sep=",",
)
communes = ["PERPIGNAN"]
print()
df = df[~df["libelleCommuneEtablissement"].isin(communes)]
print()
so my script work well with a normal csv
but with a heavy Csv (4Go) the scipt say :
Traceback (most recent call last):
File "C:lafinessedufiness.py", line 5, in <module>
df = pandas.read_csv(r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv',
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
return parser.read(nrows)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas\_libs\parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1072, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1172, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1731, in pandas._libs.parsers._try_int64
MemoryError: Unable to allocate 128. KiB for an array with shape (16384,) and data type int64
do you know how can i fix this error please?
The pd.read_csv() function has an option to read the file in chunks, rather than loading it all at once. Use iterator=True and specify a reasonable chunk size (rows per chunk).
import pandas as pd
path = r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv'
it = pd.read_csv(path, sep=',', iterator=True, chunksize=10_000)
communes = ['PERPIGNAN']
filtered_chunks = []
for chunk_df in it:
chunk_df = chunk_df.query('libelleCommuneEtablissement not in #communes')
filtered_chunks.append(chunk_df)
df = pd.concat(filtered_chunks)
As you can see, you don't have enough memory available for Pandas to load that file entirely into memory.
One reason is that based on Python38-32 in the traceback, you're running a 32-bit version of Python, where 4 gigabytes (or is it 3 gigabytes?) is the limit for memory allocations anyway. If your system is 64-bit, you should switch to the 64-bit version of Python, so that's one obstacle less.
If that doesn't help, you'll just also need more memory. You could configure Windows's virtual memory, or buy more actual memory and install it in your system.
If those don't help, then you'll have to come up with a better approach than to load the big CSV entirely into memory.
For one, if you really only care about rows with the string PERPIGNAN (no matter the column; you can really filter it again in your code), you could do grep PERPIGNAN data.csv > data_perpignan.csv and work with that (assuming you have grep; you can do the same filtering with a short Python script).
Since read_csv() accepts any iterable of lines, you can also just do something like
def lines_from_file_including_strings(file, strings):
for i, line in enumerate(file):
if i == 0 or any(string in line for string in strings):
yield line
communes = ["PERPIGNAN", "PARIS"]
with open("StockEtablissement_utf8.csv") as f:
df = pd.read_csv(lines_from_file_including_strings(f, communes), sep=",")
for an initial filter.

How to share the files (files object) between the various processes in python?

I'm implementing the multiprocessing, there are lot of JSON files, I want to read and write those files by various processes, I'm doing multiprocessing. I don't want the race condition in between so I also want the processes needs to be synchronised there.
I'm trying the following dummy code. But I don't know why it is not working I'm using multiprocess Queue to share the open file object. Could you guys suggest me is there anything wrong I'm doing, i'm getting error, I'm new to multiprocessing stuff.
Below is my code:
from multiprocessing import Queue, Process, Lock
def writeTofile(q, lock, i):
print(f'some work by {i}')
text = f" Process {i} -- "
ans =""
for i in range(10000):
ans += text
#critiacl section
lock.acquire()
file = q.get()
q.put(file)
file.write(ans)
lock.release()
print(f'updated by process {i}')
def main():
q = Queue()
lock = Lock()
jobs = []
with open("test.txt", mode = 'a') as file:
q.put(file)
for i in range(4):
process = Process(target = writeTofile, args = (q, lock, i))
jobs.append(process)
process.start()
for j in jobs:
j.join()
print('completed')
if __name__ == "__main__":
main()
This is the error I'm getting below:
Traceback (most recent call last):
File "/Users/akshaysingh/Desktop/ipnb/multi-processing.py", line 42, in <module>
main()
File "/Users/akshaysingh/Desktop/ipnb/multi-processing.py", line 27, in main
q.put(file)
File "<string>", line 2, in put
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/managers.py", line 808, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.TextIOWrapper' object

Why am I getting an ValueError: too many file descriptors in select()?

I load into the proxies variable my proxies and try to do async requests for get ip. Its simple:
async def get_ip(proxy):
timeout = aiohttp.ClientTimeout(connect=5)
async with aiohttp.ClientSession(timeout=timeout) as session:
try:
async with session.get('https://api.ipify.org?format=json', proxy=proxy, timeout=timeout) as response:
json_response = await response.json()
print(json_response)
except:
pass
if __name__ == "__main__":
proxies = []
start_time = time.time()
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(get_ip(proxy)) for proxy in proxies]
loop.run_until_complete(asyncio.wait(tasks))
print('time spent to work: {} sec --------------'.format(time.time()-start_time))
This code work fine when i try to do 100-200-300-400 requests, but when is count more than 500 i alltime getting error:
Traceback (most recent call last):
File "async_get_ip.py", line 60, in <module>
loop.run_until_complete(asyncio.wait(tasks))
File "C:\Python37\lib\asyncio\base_events.py", line 571, in run_until_complete
self.run_forever()
File "C:\Python37\lib\asyncio\base_events.py", line 539, in run_forever
self._run_once()
File "C:\Python37\lib\asyncio\base_events.py", line 1739, in _run_once
event_list = self._selector.select(timeout)
File "C:\Python37\lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
File "C:\Python37\lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
ValueError: too many file descriptors in select()
I was looking for a solution, but all I found was a limitation at the OS. Can I somehow get around this problem without using additional libraries?
It's not a good idea to start unlimited amount of requests simultaneously. Each started request will consume some resources from CPU/RAM to OS's select() capacity, what, as is in your case, sooner or later will lead to problems.
To avoid the situation you should use asyncio.Semaphore which allows to limit maximum amount of simultaneous connections.
I believe only few changes should be made in your code:
sem = asyncio.Semaphore(50)
async def get_ip(proxy):
async with sem:
# ...
Here's full complex example of how to use semaphore in general.
P.S.
except:
pass
You should never do such thing, it'll just break a code sooner or later.
At very-very least use except Exception.

Python Multiprocessing( TypeError: cannot serialize '_io.BufferedReader' object )

I'm trying to make dictionary attack on zip file using Pool to increase speed.
But I face next error in Python 3.6, while it works in Python 2.7:
Traceback (most recent call last):
File "zip_crack.py", line 42, in <module>
main()
File "zip_crack.py", line 28, in main
for result in results:
File "/usr/lib/python3.6/multiprocessing/pool.py", line 761, in next
raise value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 450, in _ handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot serialize '_io.BufferedReader' object
I tried to search for same errors but couldn't find answer that can help here.
Code looks like this
def crack(pwd, f):
try:
key = pwd.strip()
f.extractall(pwd=key)
return True
except:
pass
z_file = zipfile.ZipFile("../folder.zip")
with open('words.dic', 'r') as passes:
start = time.time()
lines = passes.readlines()
pool = Pool(50)
results = pool.imap_unordered(partial(crack, f=z_file), lines)
pool.close()
for result in results:
if result:
pool.terminate()
break
pool.join()
I also tried another approach using map
with contextlib.closing(Pool(50)) as pool:
pool.map(partial(crack, f=z_file), lines)
which worked great and found passwords quickly in Python 2.7 but it throws same exception in python 3.6

Contradicting Errors?

So I'm trying to edit a csv file by writing to a temporary file and eventually replacing the original with the temp file. I'm going to have to edit the csv file multiple times so I need to be able to reference it. I've never used the NamedTemporaryFile command before and I'm running into a lot of difficulties. The most persistent problem I'm having is writing over the edited lines.
This part goes through and writes over rows unless specific values are in a specific column and then it just passes over.
I have this:
office = 3
temp = tempfile.NamedTemporaryFile(delete=False)
with open(inFile, "rb") as oi, temp:
r = csv.reader(oi)
w = csv.writer(temp)
for row in r:
if row[office] == "R00" or row[office] == "ALC" or row[office] == "RMS":
pass
else:
w.writerow(row)
and I get this error:
Traceback (most recent call last):
File "H:\jcatoe\Practice Python\pract.py", line 86, in <module>
cleanOfficeCol()
File "H:\jcatoe\Practice Python\pract.py", line 63, in cleanOfficeCol
for row in r:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
So I searched for that error and the general consensus was that "rb" needs to be "rt" so I tried that and got this error:
Traceback (most recent call last):
File "H:\jcatoe\Practice Python\pract.py", line 86, in <module>
cleanOfficeCol()
File "H:\jcatoe\Practice Python\pract.py", line 67, in cleanOfficeCol
w.writerow(row)
File "C:\Users\jcatoe\AppData\Local\Programs\Python\Python35-32\lib\tempfile.py", line 483, in func_wrapper
return func(*args, **kwargs)
TypeError: a bytes-like object is required, not 'str'
I'm confused because the errors seem to be saying to do the opposite thing.
If you read the tempfile docs you'll see that by default it's opening the file in 'w+b' mode. If you take a closer look at your errors, you'll see that you're getting one on read, and one on write. What you need to be doing is making sure that you're opening your input and output file in the same mode.
You can do it like this:
import csv
import tempfile
office = 3
temp = tempfile.NamedTemporaryFile(delete=False)
with open(inFile, 'r') as oi, tempfile.NamedTemporaryFile(delete=False, mode='w') as temp:
reader = csv.reader(oi)
writer = csv.writer(temp)
for row in reader:
if row[office] == "R00" or row[office] == "ALC" or row[office] == "RMS":
pass
else:
writer.writerow(row)

Resources