Split large strings by delimiters

Split large strings by delimiters - string

I am trying to process the output of a system('./foo') command. If I directly redirect the output to a file with system('./foo > output') and read the file by dlmread into MATLAB, it works fine, but I'm trying to avoid writing a huge ASCII file (about 1e7 lines) on the hard disk every time I do this.
So I want to deal with the output directly by reading it into a huge string and splitting the string. It works fine for small files:
[a,b] = system('./foo')
b=strsplit(b);
cellfun(#str2num, bb);
b=cellfun(#str2num, b(1:end),'UniformOutput',0);
b=cell2mat(b);
Unfortunately, this consumes already in the step of the strsplit operation way too much memory, so that MATLAB gets killed by the OOM killer.
I found the alternative:
b=textscan(b,'%s','delimiter',' ','multipleDelimsAsOne',1);
But it also consumes way too much memory.
Can somebody help me with a better idea how to split that string of numbers and read it into a matrix or generally how to avoid writing the output of the command to a file on the hard disk?
Edit: (I'm writing here, because in the comments is not enough space...)
#MZimmerman6 I tried now a version by dlmread with and without pre-allocation and your proposal as well as I understood it:
In fact the loop is much slower than the dlmread.
clear all
close all
tic
ttags1=dlmread('tmp.txt',' ',1,3);
toc
clear all
tic
[~,result]=system('perl -e ''while(<>){};print$.,"\n"'' tmp.txt');
numLines1=str2double(result);
ttags=zeros(numLines1,1);
ttags=dlmread('tmp.txt',' ',1,3);
toc
clear all
tic
fid = fopen('tmp.txt');
count = 1;
[~,result]=system('perl -e ''while(<>){};print$.,"\n"'' tmp.txt');
numLines1=str2double(result);
temp = cell(numLines1,1);
for i = 1:numLines1
tline = fgetl(fid);
if ischar(tline)
vals = textscan(tline,'%f','delimiter',',');
temp{i} = transpose(vals{1});
end
end
fclose(fid);
temp = cell2mat(temp);
toc
The result is:
Elapsed time is 19.762470 seconds.
Elapsed time is 21.546079 seconds.
Elapsed time is 796.755343 seconds.
Thank you & Best Regards
Am I doing something wrong?

You should not try to read the entire file into memory, as this can be extremely memory heavy. I would recommend reading the file line by line, and processing each individually, then store the results into a cell array. You can then, once the parsing is done, convert that into a normal matrix.
The first thing I could do is create a small Perl script to count the number of lines in the file you are reading, so you can pre-allocate memory for the data. Call this file countlines.pl. Information gathered from here
Perl - Countlines.pl
while (<>) {};
print $.,"\n";
This file will only be two lines, but will quickly count the total lines in the file.
You can then use the result of this file to pre-allocate and then do your line by line parsing. I used in my testing a simple comma separated file, so you can adjust textscan to handle things as you want.
MATLAB Script
% get number of lines in data file
numLines = str2double(perl('countlines.pl','text.txt'));
fid = fopen('text.txt');
count = 1;
temp = cell(numLines,1);
for i = 1:numLines
tline = fgetl(fid);
if ischar(tline)
vals = textscan(tline,'%f','delimiter',',');
temp{i} = transpose(vals{1});
end
end
fclose(fid);
temp = cell2mat(temp);
This should run relatively quickly depending on your file size, and do what you want. Of course you can edit how the parsing is done inside the loop, but this should be a good starting point.
Note for the future, do not try to read large amounts of stuff into memory if it is not completely necessary

Related

py2neo cursor appears to consume everything into memory rather than stream data

I am running a query to a Neo4J server, which I expect to return >100M rows (but just a few columns) and then write the results into a CSV file. This works well for queries that return up to 10-20M rows but becomes tricky as the resultant rows go up into 10^8 numbers.
I thought, writing the results row by row (ideally buffered) should be a solution but the csv.Writer appears to only write into disk once the whole code executes (i.e. at the end of the iteration), rather than in chunks as expected. In this example below, I tried explicitly flushing the file (which did not work). I also do not get any output on stdout indicating that the iteration is not occurring as intended.
The mem usage of the process is growing rapidly however, over 12GBs last I checked. That makes me think that the cursor is trying to get all the data before starting iteration, which it should not do, unless I misunderstood something.
Any ideas?
from py2neo import Graph
import csv
cursor = g.run(query)
with open('bigfile.csv', 'w') as csvfile:
fieldnames = cursor.keys()
writer = csv.Writer(csvfile)
# writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
# writer.writeheader()
i = 0
j = 1
for rec in cursor:
# writer.writerow(dict(rec))
writer.writerow(rec.values())
i +=1
if i == 50000:
print(str(i*j) + '...')
csvfile.flush()
i = 0
j +=1

Isn't the main problem the size of the query, rather than the method of writing the results to the CSV file? If you're chunking the writing process, perhaps you should chunk the querying process aswell, since the results are stored in memory while the file writing is taking place.

Splitting up a large file into smaller files at specific points

I know this question has been asked several times. But those solutions really don't help me here. I have a really big file (5GB almost) to read, get the data and give it to my neural network. I have to read line by line. At first I loaded the entire file into the memory using .readlines() function but it obviously resulted in out-of-memory issue. Next I instead of loading the entire file into the memory, I read it line by line but it still hasn't worked. So now I am thinking to split my file into smaller files and then read each of those files. The file format that for each sequence I have a header starting with '>' followed by a sequence for example:
>seq1
acgtccgttagggtjhtttttttttt
tttsggggggtattttttttt
>seq2
accggattttttstttttttttaasftttttttt
stttttttttttttttttttttttsttattattat
tttttttttttttttt
>seq3
aa
.
.
.
>seqN
bbbbaatatattatatatatattatatat
tatatattatatatattatatatattatat
tatattatatattatatatattatatatatta
tatatatatattatatatatatatattatatat
tatatatattatatattatattatatatattata
tatatattatatattatatatattatatatatta
So now I want to split my file which has 12700000 sequences into smaller files such that for each file with header '>' has it's correct corresponding sequence as well. How can I achieve this in python without running into memory issues. Insights would be appreciated.

I was able to do this with 12,700,000 randomized lines with 1-20 random characters in each line. Though the size of my file was far less than 5GB (roughly 300MB)--likely due to format. All of that said, you can try this:
x = 0
y = 1
string = ""
cycle = "Seq1"
with open(f"{FILEPATH}/main.txt", "r") as file:
for line in file:
if line[0] == ">":
if x % 5000 == 0 and x != 0:
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
cycle = f"Seq{y*5000+1}"
y += 1
string = ""
string += line
x += 1
if line[0] != ">":
string += line
with open(f"{FILEPATH}/Sequence Files/Starting{cycle}.txt", "a") as newfile:
newfile.writelines(string)
This will read the file line-by-line, append the first 5000 values to a string, write the string to a new file, and repeat for the rest of the original file. It will also name the file with the first sequence within the file.
The line that reads if x % 5000 == 0: is the line that defines the number of sequences within each file and the line cycle = "Seq" + str(y*5000+1) creates the formatting for the next filename. You can adjust the 5000 in these if you change your mind about how many sequences per file (you're creating 2,540 new files this way).

Stuck in infinite loop while trying to read all lines in proc.stdout.readline

I am trying to read each line in proc.stdout.readline and send the lines over the network, for example:
data = b''
for line in iter(proc.stdout.readline, ''):
data += line
clientsocket.send(data)
When I run this code I seem to be stuck in a inifinite loop unable to escape to the line:
clientsocket.send(data)
Is there a more efficient way to read the data? I've tried also with a while loop and breaking 'if not line':
data = b''
while True:
line += proc.stdout.readline()
data += line
if not line:
break
clientsocket.send(data)
This seems to also produce the same results. Is there a more efficient way to read all of the data from proc.stdout.readline?

I've encountered the same very problem. The strange thing that in Python 2.7 it had no problem to converge and actually stop iterating.
During debug (in Python 3.5) I've noticed that all true lines returned with the '\n' character, whereas the line that wasn't suppose to arrive returned as an empty string, i.e. ''. So, I just added an if-clause checking against '' and breaking the loop if positive.
My final version looks as follows:
lines = []
for _line in iter(process.stdout.readline, b''):
if _line == '':
break
lines.append(_line)
One thing that might be worth to mention, is that I used universal_newlines=True argument upon subprocess.Popen(..) call.

The statement: iter(proc.stdout.readline, "") will do a blocking read until it recieves an EOF.
If you want to read all the lines, then you can just do:
data = b''
data = b"".join(proc.stdout.readlines())
There is no other solution than for the proc to produce lines faster.
If you want, you can read lines with timeout (i.e. you can wait to read a select number of characters, or timeout if that number of characters are not read).
Those answers can be found here:
https://stackoverflow.com/a/10759061/6400614 .
https://stackoverflow.com/a/5413588/6400614

Reading file and getting values from a file. It shows only first one and others are empty

I am reading a file by using a with open in python and then do all other operation in the with a loop. While calling the function, I can print only the first operation inside the loop, while others are empty. I can do this by using another approach such as readlines, but I did not find why this does not work. I thought the reason might be closing the file, but with open take care of it. Could anyone please suggest me what's wrong
def read_datafile(filename):
with open(filename, 'r') as f:
a = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==2]
b = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==3]
c = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==2]
return a, b, c
read_datafile('data_file_name')
I only get values for a and all others are empty. When 'a' is commented, I get value for b and others are empty.
Updates
The file looks like this:
-0.6908270760153553 -0.4493128078936575 0.5090918714784820
0.6908270760153551 -0.2172871921063448 0.5090918714784820
-0.0000000000000000 0.6666999999999987 0.4597549674638203
0.3097856229862140 -0.1259623621214220 0.5475896447896115
0.6902143770137859 0.4593623621214192 0.5475896447896115

The construct
with open(filename) as handle:
a = [line for line in handle if condition]
b = [line for line in handle]
will always return an empty b because the iterator in a already consumed all the data from the open filehandle. Once you reach the end of a stream, additional attempts to read anything will simply return nothing.
If the input is seekable, you can rewind it and read all the same lines again; or you can close it (explicitly, or implicitly by leaving the with block) and open it again - but a much more efficient solution is to read it just once, and pick the lines you actually want from memory. Remember that reading a byte off a disk can easily take several orders of magnitude more time than reading a byte from memory. And keep in mind that the data you read could come from a source which is not seekable, such as standard output from another process, or a client on the other side of a network connection.
def read_datafile(filename):
with open(filename, 'r') as f:
lines = [line for line in f]
a = lines[2]
b = lines[3]
c = lines[2]
return a, b, c
If the file could be too large to fit into memory at once, you end up with a different set of problems. Perhaps in this scenario, where you only seem to want a few lines from the beginning, only read that many lines into memory in the first place.

What exactly are you trying to do with this script? The lines variable here may not contain what you want: it will contain a single line because the file gets enumerated by lines.

Searching through a huge list of short strings

I have a HUGE plaintext file with 1 billion strings, where average string length is around 10-12, with potential duplicates, and each string is on a different line. My task is that, when given a query string, find the line of first match if the string exists in my file, or return "not found."
A natural solution is to run grep -m1 -n '^querystring$' every time, which takes around 15-20 seconds, and this does not require extra storage and is not memory-intensive. Is this a good solution, or is there something much better?
(N.B. As a rough guide, my storage requirement: <10GB, and memory requirement: <16GB)

you can use a simple python code for that:
file = 'file.txt'
queryLine = 0
with open(file, 'r') as f:
for line in f:
if <YOUR QUERY> in line: return queryLine
else: queryLine += 1
this way, you break when finding a match, instead using grep and every time go over the whole file.

Here's a Python solution:
When you open a file you get an iterator giving you one line at a time, which is very memory efficient. My suggestion is to enumerate the file and get the first line line meeting your criterion like this:
def first_occurrence(filename, query):
with open(filename) as f:
filtered = (i for i, line in enumerate(f, 1) if query in line)
return next(filtered, 'not found')
If there's no such line, the default value 'not found' is returned. filtered = (...) builds a generator by employing a generator expression. Generators are iterators, so this part is memory efficient as well.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split large strings by delimiters - string

Related

py2neo cursor appears to consume everything into memory rather than stream data

Splitting up a large file into smaller files at specific points

Stuck in infinite loop while trying to read all lines in proc.stdout.readline

Reading file and getting values from a file. It shows only first one and others are empty

Searching through a huge list of short strings

Categories

Resources