Read some specific lines from a big file in python - python-3.x

I want to read some specific lines from a large text file where line numbers are in a list, for example:
list_Of_line =[3991, 3992, ...]. I want to check whether there is the string "this city" in line number 3991, 3992,... or not. I want to directly access those lines. How can I do this in python?
Text_File is like below
Line_No
......................
3990 It is a big city.
3991 I live in this city.
3992 I love this city.
.......................

There is no way to "directly access" a specific number of line of a file outright, since lines can start at any position and can be of any lengths. The only way to know where each line is in a file is therefore by reading every character of the file to locate each newline character.
Understanding that, you can iterate through a given file object to build a list of file positions of the end of each line by calling the tell method of the file object, so that you can then "directly access" any line number you want with the seek method of the file object to read the specific line:
list_of_lines = [3991, 3992]
with open('file.txt') as file:
positions = [0, *(file.tell() for line in file)]
for line_number in list_of_lines:
file.seek(positions[line_number - 1])
if 'this city' in next(file):
print(f"'this city' found in line #{line_number}")

Related

How to work with headers in log file? Python

I have log file (.txt) which I want to open and read line by line, convert the data and store with Pandas. However, it has a header with some useful information I want to grab. What is best practice when working with header sections? For example I need to grab the "CAN-bus adress" which is stored on the next row. The "CAN-bus adress" part will be the same for another file but the "460 (Machine)" will change. How do I effectively achieve that? If I run my code I get the error "TypeError: '_io.TextIOWrapper' object is not subscriptable"
Any guidance would be appreciated! I could write a nasty bit of code to get this data the next time through the loop with the help of a few if statements and Booleans but there must be a better way to do this.
Also, what is a good way to detect when the header is over and the data is starting? Just compare every line with "DateTime"?
log file:
Developer
Raw data extractor
Date Range to extract
From
12/18/2022
Until
02/01/2023
CAN-bus address
460 (Machine)
DateTime GPStime CAN-bus data
19 December 2022 07:20:53 1671430853 0162c0c1cafe0000
19 December 2022 07:20:53 1671430853 05000000003e3c00
...
Code:
with open(filePath) as openfileobject: #Open the file
for row, line in enumerate(openfileobject): #Read file line by line
if line.lower() == 'CAN-bus address'.lower(): #identify the CAN message ID
print(line)
print(openfileobject[row+1])
I have tried consecutive if statments and Boolean variables to keep track on if we have found the correct row or not. It gets messy.
I hope it may help.
skip = 0
can_addr = []
# open text file
with open("temp.txt", "r") as f:
for line in f:
# skip space, enter lines
line = line.strip()
if "Developer" in line:
# skip the line starting form Developer
skip = 1
continue
if "DateTime" in line:
# skip till the line that has "DateTime" and skip this line
skip = 0
continue
if skip == 1:
continue
# At this stage, if you print(line), you will get data starting from under the line of header
# adding address into array
can_addr.append(line[28:38])
print(can_addr)

How to add to the beginning of each line of a large file (>100GB) the index of that line with Python?

some_file.txt: (berore)
one
two
three
four
five
...
How can I effectively modify large file in Python?
with open("some_file.txt", "r+") as file:
for idx, line in enumerate(file.readlines()):
file.writeline(f'{idx} {line}') # something like this
some_file.txt: (after)
1 one
2 two
3 three
4 four
5 five
...
Don't try to load your entire file in memory, because the file may be too large for that. Instead, read line by line:
with open('input.txt') as inp, open('output.txt', 'w') as out:
idx = 1
for line in inp:
out.write(f'{idx} {line}'
idx += 1
You can't insert into the middle of a file without re-writing it. This is an operating system thing, not a Python thing.
Use pathlib for path manipulation. Rename the original file. Then copy it to a new file, adding the line numbers as you go. Keep the old file until you verify the new file is correct.
Open files are iterable, so you can use enumerate() on them directly without having to use readlines() first. The second argument to enumerate() is the number to start the count with. So the loop below will number the lines starting with 1.
from pathlib import Path
target = Path("some_file.txt")
# rename the file with ".old" suffix
original = target.rename(target.with_suffix(".old"))
with original.open("r") as source, target.open("w") as sink:
for line_no, line in enumerate(source, 1):
sink.writeline(f'{line_no} {line}')

Count the number of characters in a file

The question:
Write a function file_size(filename) that returns a count of the number of characters in the file whose name is given as a parameter. You may assume that when being tested in this CodeRunner question your function will never be called with a non-existent filename.
For example, if data.txt is a file containing just the following line: Hi there!
A call to file_size('data.txt') should return the value 10. This includes the newline character that will be added to the line when you're creating the file (be sure to hit the 'Enter' key at the end of each line).
What I have tried:
def file_size(data):
"""Count the number of characters in a file"""
infile = open('data.txt')
data = infile.read()
infile.close()
return len(data)
print(file_size('data.txt'))
# data.txt contains 'Hi there!' followed by a new line
character.
I get the correct answer for this file however I fail a test that users a larger/longer file which should have a character count of 81 but I still get 10. I am trying to get the code to count the correct size of any file.

Best way to fix inconsistent csv file in python

I have a csv file which is not consistent. It looks like this where some have a middle name and some do not. I don't know the best way to fix this. The middle name will always be in the second position if it exists. But if a middle name doesn't exist the last name is in the second position.
john,doe,52,florida
jane,mary,doe,55,texas
fred,johnson,23,maine
wally,mark,david,44,florida
Let's say that you have ① wrong.csv and want to produce ② fixed.csv.
You want to read a line from ①, fix it and write the fixed line to ②, this can be done like this
with open('wrong.csv') as input, open('fixed.csv', 'w') as output:
for line in input:
line = fix(line)
output.write(line)
Now we want to define the fix function...
Each line has either 3 or 4 fields, separated by commas, so what we want to do is splitting the line using the comma as a delimiter, return the unmodified line if the number of fields is 3, otherwise join the field 0 and the field 1 (Python counts from zero...), reassemble the output line and return it to the caller.
def fix(line):
items = line.split(',') # items is a list of strings
if len(items) == 3: # the line is OK as it stands
return line
# join first and middle name
first_middle = join(' ')((items[0], items[1]))
# we want to return a "fixed" line,
# i.e., a string not a list of strings
# we have to join the new name with the remaining info
return ','.join([first_second]+items[2:])

Different behaviour shown when running the same code for a file and for a list

I have observed this unusual behaviour when I try to do a string slicing on the words in a file and the words in a list.Both the results are quite different.
For example I have a file 'words.txt' which contains the following content
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
When I write the below piece of code, I expect to get a list of words with last letter omitted.
with open("words.txt", "r") as fo:
for l in fo:
print(l[:-1])
But instead I get this result below.No string slicing takes place and the words are similar as before.
POPE
POPS
ROPE
POKE
COPE
PAPE
NOPE
POLE
But if I write the below code, I get what I want
lis = ["POPE", "POPS", "ROPE", "POKE", "COPE", "PAPE", "NOPE", "POLE"]
for i in lis:
print(i[:-1])
I am able to delete the last letter of each of the words as expected.
POP
POP
ROP
POK
COP
PAP
NOP
POL
So why do I see two different results for the same operation [: -1] ?
The line ends with \n in files where as you dont need line endings in lists.
Your actual file contents are as follows
POPE\n
POPS\n
ROPE\n
POKE\n
COPE\n
PAPE\n
NOPE\n
POLE\n
hence the print(l[:-1]) is actually trimming the line ending i.e. \n.
To verify this, declare an empty list before the loop, and add each line to that list and print it. You will find the that the lines contain the \n on every line
stuff = []
with open("words.txt", "r") as fo:
for line in fo:
stuff.append(line)
print stuff
this will print ['POPE\n', 'POPS\n', 'ROPE\n', 'POKE\n']
If I am not wrong, you want to carry out the slicing operation on the file contents. I think you should look into strip() method.

Resources