I am creating a program that extracts the relevant information from a textfile with 500k lines.
What I've managed so far is to take the info from the textfile and make it into a list which each element being a line.
The relevant text is formatted like this:
*A title that informs that the following section will have the data I'm trying to extract *
*Valuable info in random amount of lines*
*-------------------*
and in between each relevant section of information, formatted in the same way but starting with another title i.e:
*A title that shows that this is data I don't want *
*Non-valuable info in random amount of lines *
*------------------- *
I've managed to list the indexes of the starting point with the follow code:
start = [i for i, x in enumerate(lines) if x[0:4] == searchObject1 and x[5:8] == searchObject2]
But I'm struggling to find the stopping points. I can't use the same method used when finding the starting points because the stopping line appears also after non-important info.
I'm quite the newbie to both Python and programming so the solution might be obvious.
A simple solution is to loop over the input file line by line, and keep only valuable lines. To know whether a line is valuable, we use a boolean variable that is:
set to true ("keep the lines") whenever we encounter a title marking the beginning of a section of interesting data,
set to false ("discard the lines") whenever we encounter an end of section mark. The variable is set to discard even when we encounter the end of a useless section, which doesn't change its state.
Here is the code (lines is the list of strings containing the data to parse):
bool keep = false;
data = []
for line in lines:
if line == <title of useful section> # Adapt
keep = true
elif line == <end of section> # Adapt
keep = false
else:
if keep:
data.append(line)
If none of the cases matched, the line was one of two things:
a line of data in a useless section
the title of a useless section
So it can be discarded.
Note that the titles and end of section lines are not saved.
Related
I am reading a file by using a with open in python and then do all other operation in the with a loop. While calling the function, I can print only the first operation inside the loop, while others are empty. I can do this by using another approach such as readlines, but I did not find why this does not work. I thought the reason might be closing the file, but with open take care of it. Could anyone please suggest me what's wrong
def read_datafile(filename):
with open(filename, 'r') as f:
a = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==2]
b = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==3]
c = [lines.split("\n")[0] for number, lines in enumerate(f) if number ==2]
return a, b, c
read_datafile('data_file_name')
I only get values for a and all others are empty. When 'a' is commented​, I get value for b and others are empty.
Updates
The file looks like this:
-0.6908270760153553 -0.4493128078936575 0.5090918714784820
0.6908270760153551 -0.2172871921063448 0.5090918714784820
-0.0000000000000000 0.6666999999999987 0.4597549674638203
0.3097856229862140 -0.1259623621214220 0.5475896447896115
0.6902143770137859 0.4593623621214192 0.5475896447896115
The construct
with open(filename) as handle:
a = [line for line in handle if condition]
b = [line for line in handle]
will always return an empty b because the iterator in a already consumed all the data from the open filehandle. Once you reach the end of a stream, additional attempts to read anything will simply return nothing.
If the input is seekable, you can rewind it and read all the same lines again; or you can close it (explicitly, or implicitly by leaving the with block) and open it again - but a much more efficient solution is to read it just once, and pick the lines you actually want from memory. Remember that reading a byte off a disk can easily take several orders of magnitude more time than reading a byte from memory. And keep in mind that the data you read could come from a source which is not seekable, such as standard output from another process, or a client on the other side of a network connection.
def read_datafile(filename):
with open(filename, 'r') as f:
lines = [line for line in f]
a = lines[2]
b = lines[3]
c = lines[2]
return a, b, c
If the file could be too large to fit into memory at once, you end up with a different set of problems. Perhaps in this scenario, where you only seem to want a few lines from the beginning, only read that many lines into memory in the first place.
What exactly are you trying to do with this script? The lines variable here may not contain what you want: it will contain a single line because the file gets enumerated by lines.
I used 3 lines of codes which worked well. Then I try to contract them into one line, which I believe can be done by putting two variables together. But for some reason, the contracted codes only returned 0 instead of the actual sum that can be computed before. What's gone wrong in the contracted codes?
hand = open('xxxxxx.txt')
# This is a text file that contains many numbers in random positions
import re
num = re.findall('[0-9]+', hand.read())
# I used regular expression on variable 'num' to extract all numbers from the file and put them into a list
numi = [int(i) for i in num]
# I used variable 'numi' to convert all numbers in string form to integer form
print(sum(numi))
# Successfully printed out the sum of all integers
print(sum([int(i) for i in re.findall('[0-9]+', hand.read())]))
# Here is the problem. I attempted to contract variables 'num' and 'numi' into one line of codes. But I only got 0 instead of the actual sum from it`enter code here`
if you execute all the code like I see up there, it is normal to get 0 because you didn't re-open the file after using it the first time, just re-open the file "hand" or leave the final line that you want to use and delete the three lines before it.
This code works fine for me -
hand = open('xxxxx.txt')
import re
print(sum([int(i) for i in re.findall('[0-9]+', hand.read())]))
You have to close the file and reopen it before running the last line.
I have 2 .csv datasets from the same source. I was attempting to check if any of the items from the first dataset are still present in the second.
#!/usr/bin/python
import csv
import json
import click
#click.group()
def cli(*args, **kwargs):
"""Command line tool to compare and generate a report of item that still persists from one report to the next."""
pass
#click.command(help='Compare the keysets and return a list of keys old keys still active in new keyset.')
#click.option('--inone', '-i', default='keys.csv', help='specify the file of the old keyset')
#click.option('--intwo', '-i2', default='keys2.csv', help='Specify the file of the new keyset')
#click.option('--output', '-o', default='results.json', help='--output, -o, Sets the name of the output.')
def compare(inone, intwo, output):
csvfile = open(inone, 'r')
csvfile2 = open(intwo, 'r')
jsonfile = open(output, 'w')
reader = csv.DictReader(csvfile)
comparator = csv.DictReader(csvfile2)
for line in comparator:
for row in reader:
if row == line:
print('#', end='')
json.dump(row, jsonfile)
jsonfile.write('\n')
print('|', end='')
print('-', end='')
cli.add_command(compare)
if __name__ == '__main__':
cli()
say each csv files has 20 items in it. it will currently iterate 40 times and end when I was expecting it to iterate 400 times and create a report of items remaining.
Everything but the iteration seems to be working. anyone have thoughts on a better approach?
Iterating 40 times sounds just about right - when you iterate through your DictReader, you're essentially iterating through the wrapped file lines, and once you're done iterating it doesn't magically reset to the beginning - the iterator is done.
That means that your code will start iterating over the first item in the comparator (1), then iterate over all items in the reader (20), then get the next line from the comparator(1), then it won't have anything left to iterate over in the reader so it will go to the next comparator line and so on until it loops over the remaining comparator lines (18) - resulting in total of 40 loops.
If you really want to iterate over all of the lines (and memory is not an issue), you can store them as lists and then you get a new iterator whenever you start a for..in loop, so:
reader = list(csv.DictReader(csvfile))
comparator = list(csv.DictReader(csvfile2))
Should give you an instant fix. Alternatively, you can reset your reader 'steam' after the loop with csvfile.seek(0).
That being said, if you're going to compare lines only, and you expect that not many lines will differ, you can load the first line in csv.reader() to get the 'header' and then forgo the csv.DictReader altogether by comparing the lines directly. Then when there is a change you can pop in the line into the csv.reader() to get it properly parsed and then just map it to the headers table to get the var names.
That should be significantly faster on large data sets, plus seeking through the file can give you the benefit of never having the need to store in memory more data than the current I/O buffer.
So.. I need to read a file and add the line number at the beginning of each line. Just as the title. How do you do it?
For example, if the content of the file was:
This
is
a
simple
test
file
These 6 lines, I should turn it into
1. This
2. is
3. a
4. simple
5. test
6. file
Keep the original content, but just adding the line number at the beginning.
My code looks like this so far:
def add_numbers(filename):
f = open(filename, "w+")
line_number = 1
for line in f.readlines():
number_added = str(line_number) + '. ' + f.readline(line)
line_number += 1
return number_added
But it doesn't really show anything as the result. I have no clues how to do it. Any help?
A few problems I see in your code:
You indentation is not correct. Everything below the def add_numbers(): should be indented one level.
It is good practice to close a file handle at the end of your method.
A similar question to yours was asked here. Looking at the various solutions posted there, using fileinput seems like your best bet because it allows you to edit your file in-place.
import fileinput
def add_numbers(filename):
line_number = 1
for line in fileinput.input(filename, inplace=True):
print("{}. {}".format(line_number, line))
line_number += 1
Also note that I use format to combine two strings instead adding them together, because this handles different variable types more easily. A good explanation of the use of format can be found here.
The underlying purpose is for me to become a python expert (long run) and the immediate goal is as follows...
I want to merge two massive lists into one. The lists are very large text files comprised of millions of lines which would look like the following-
bigfile1 bigfile2
(10,'red','blue','orange') (10,'31','false','true')
(11,'black','blue','green') (11,'88','true','true')
(12,'blue','blue','green') random junk once in a while
(13,'red','blue','yellow') (12,'3','false','false')
(14,'brown','red','red') (15,'6','true','true')
Using Python, I would like to:
merge the lines from each list and write them to a new list if the "usernumbers" before the first comma are the same.
Have program complete before our sun runs out of
hydrogen, so that ruled out iterating every line through every line.
I then learned about (a,b) in zip but then i had a new problem. I want to merge the lines from the lists only when the first number in the lines are the same...and within the lists, some numbers are skipped in one list, duplicates, garbled trash once in a while, etc. so I can't just go line by line and can't figure out if there's a way to iterate through only a or b when using zip. I realize these are files that should be in a database and you query from there, but it's an exercise for me to learn more Python.
I'm using Python3.4 on windows. If anyone has suggestions on completing the following, or start from scratch, I would greatly appreciate it!
I want the lines with same usernumbers to be merged together in a new file. My current code follows:
list1 = open('bigfile1.txt', 'r', errors = 'ignore')
list2 = open('bigfile2.txt', 'r', errors = 'ignore')
for a,b in zip(list1,list2):
c = (''.join(a.split("(")[1:])).rstrip()
d = ''.join(c.split(",")[:1])
e = (''.join(b.split("(")[1:])).rstrip()
f = ''.join(e.split(",")[:1])
if d == f:
#FILE.write()
print (a,b)
elif d != f:
##### I'm STUCK!! #####
#FILE.close()
list1.close()
list2.close()
Note: adding a wrapper to an iterator will reduce performance. Sorry.
You can skip lines of the file by wrapping the file iterator, this just means writing your own generator with a conditional:
from itertools import chain
def ignore_junk(file_iter):
for line in file_iter:
if line[0] == "(" and line[1:3].isdigit():
yield line
pair_rows_iter = zip(ignore_junk(list1),ignore_junk(list2))
all_lines_iter = itertools.chain.from_iterable(pair_rows_iter)
new_file.writerows(all_lines_iter)