Erasing part of a text file in Python - python-3.x

I have a text file in my hard disk which is really big. It has around 8 million json files which are separated by comma and I want to remove the last json ; however, because it is really big I cannot do it via regular editors (Notepad++, Sublime, Visual Studio Code, ...). So, I decided to use Python, but I have no clue how to erase part of an existing file using python. Any kind of help would be appreciated.
P.S: My file has such a structure:
json1, json2, json3, ...
when each json looks like {"a":"something", "b":"something", "c":"something"}

The easiest way would be to make the file content valid JSON by enclosing it with [ and ] so it becomes a list of dicts, and after removing the last item from the list, you can dump it back into a string and then remove its first and the last characters, which will be [ and ], which your original text file does not want:
import json
with open('file.txt', 'r') as r, open('newfile.txt', 'w') as w:
w.write(json.dumps(json.loads('[%s]' % r.read())[:-1])[1:-1])

Since you only want the last JSON object removed from the file, a much more efficient method would be to identify the first valid JSON object at the end of the file and truncate the file from where that JSON object's preceding comma is positioned.
This can be accomplished by seeking and reading backwards from the end of the file, one relatively small chunk at a time, split the buffer by { (since it marks the beginning of a JSON object), and prepend the fragments one at a time to a buffer until the buffer is parsable as a JSON object (this makes the code able to handle nested dict structures), at which point you should find the preceding comma from the preceding fragment and prepend the comma to the buffer, so that finally, you can seek the file to where the buffer starts and truncate the file:
import json
chunk_size = 1024
with open('file.txt', 'rb+') as f:
f.seek(-chunk_size, 2)
buffer = ''
while True:
fragments = f.read(chunk_size).decode().split('{')
f.seek(-chunk_size * 2, 1)
i = len(fragments)
for fragment in fragments[:0:-1]:
i -= 1
buffer = '{%s%s' % (fragment, buffer)
try:
json.loads(buffer)
break
except ValueError:
pass
else:
buffer = fragments[0] + buffer
continue
break
next_fragment = fragments[i - 1]
# if we don't have a comma in the preceding fragment and it is already the first
# fragment, we need to read backwards a little more
if i == 1 and ',' not in fragments[0]:
f.seek(-2, 1)
next_fragment = f.read(2).decode() + next_fragment
buffer = next_fragment[next_fragment.rindex(','):] + buffer
f.seek(-len(buffer.encode()), 2)
f.truncate()

Related

Reading a list of tuples from a text file in python

I am reading a text file and I want to read a list of tuples so that I can add another tuple to it in my program and write that appended tuple back to the text file.
Example in the file
[('john', 'abc')]
Want to write back to the file as
[('john', 'abc'), ('jack', 'def')]
However, I whenever I keep writing back to the file, the appended tuple seems to add in double quotes along with square brackets. I just want it to appear as above.
You can write a reusable function which takes 2 parameters file_path (on which you want to write tuple), tup (which you want to append) and put your logic inside that. Later you can supply proper data to this function and it will do the job for you.
Note: Don't forget to read the documentation as comments in code
tuples.txt (Before writing)
[('john', 'abc')]
Code
def add_tuple_to_file(file_path, tup):
with open(file_path, 'r+') as f:
content = f.read().strip() # read content from file and remove whitespaces around
tuples = eval(content) # convert string format tuple to original tuple object (not possible using json.loads())
tuples.append(tup) # append new tuple `tup` to the old list
f.seek(0) # After reading file, file pointer reaches to end of file so place it again at beginning
f.truncate() # truncate file (erase old content)
f.write(str(tuples)) # write back the updated list
# Try
add_tuple_to_file("./tuples.txt", ('jack', 'def'))
tuples.txt (After writing back)
[('john', 'abc'), ('jack', 'def')]
References
https://www.geeksforgeeks.org/python-ways-to-convert-string-to-json-object/
How to open a file for both reading and writing?
You can use ast.literal_eval to get the list object from the string.
import ast
s = "[('john', 'abc')]"
o = ast.literal_eval(s)
print(repr(o)==s)
o.append(('jack', 'def'))
newstr = repr(o)
print(newstr)
Here it is in action.

How to convert Hex to original file format?

I have a .tgz file that was formatted as shell code, it looks like this (Hex):
"\x1F\x8B\x08\x00\x44\x7A\x91\x4F\x00\x03\xED\x59\xED\x72.."
It was generated this way (python3):
import os
def main():
dump_src = "MyPlugin.tgz"
fc = ""
try:
with open(dump_src, 'rb') as fd:
fcr = fd.read()
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
except:
fcr = dump_src
for byte in bytearray(fcr):
fc += "\\x{:02x}".format(byte)
print(fc)
# failed attempt:
fcback = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
print (fcback)
if __name__ == "__main__":
main()
How can I convert this back to the original tgz archive?
Edit: failed attempt in the last section outputs this:
b'\x8b\x00\x10]\x03\x93o0\x85%\xe2!\xa4H\xf1Fi\xa7\x15\xf61&\x13N\xd9[\xfag\x11V\x97\xd3\xfb%\xf7\xe3\\\xae\xc2\xff\xa4>\xaf\x11\xcc\x93\xf1\x0c\x93\xa4\x1b\xefxj\xc3?\xf9\xc1\xe8\xd1\xd9\x01\x97qB"\x1a\x08\x9cO\x7f\xe9\x19\xe3\x9c\x05\xf2\x04a\xaa\x00A,\x15"RN-\xb6\x18K\x85\xa1\x11\x83\xac/\xffR\x8a\xa19\xde\x10\x0b\x08\x85\x93\xfc]\x8a^\xd2-T\x92\x9a\xcc-W\xc7|\xba\x9c\xb3\xa6V0V H1\x98\xde\x03#\x14\'\n 1Y\xf7R\x14\xe2#\xbe*:\xe0\xc8\xbb\xc9\x0bo\x8bm\xed.\xfd\xae\xef\x9fT&\xa1\xf4\xcf\xa7F\xf4\xef\xbb"8"\xb5\xab,\x9c\xbb\xfc3\x8b\xf5\x88\xf4A\x0ek%5eO\xf4:f\x0b\xd6\x1bi\xb6\xf3\xbf\xf7\xf9\xad\xb5[\xdba7\xb8\xf9\xcd\xba\xdd,;c\x0b\xaaT"\xd4\x96\x17\xda\x07\x87& \xceH\xd6\xbf\xd2\xeb\xb4\xaf\xbd\xc2\xee\xfc\'3zU\x17>\xde\x06u\xe3G\x7f\x1e\xf3\xdf\xb6\x04\x10A\x04\x10A\x04\x10A\x04\x10A\xff\x9f\xab\xe8(\x00'
And when I output it to a file (e.g. via python3 main.py > MyFile.tgz) the file is corrupted.
Since you know the format of the data (each byte is encoded as a string of 4 characters in the format "\xAB") it's easy to revert the conversion and get the original bytes again. It'll only take one line of Python code:
data = bytes(int(fc[i+2:i+4], 16) for i in range(0, len(fc), 4))
This uses:
range(start, stop, step) with step 4 to iterate in groups of 4 characters through your string
slicing to get each group of 2 hexadecimal digits
int(x, base) to convert the hexadecimal string to an integer
a generator expression to immediately pass the converted elements to:
bytes() to create a bytes object with the data
The variable data is now of type bytes and you could directly write it to a file (to decompress with an external zip program), or pass it to zlib.decompress() (to further process it in Python).
UPDATE (follow-up on the comments and updated question):
Firstly, I have tested the above code and it does result in the same bytes as the input. Are you really sure that the example output in your question is the actual result of the code in your question? Please try to be careful when copying code and/or output. A few remarks:
Your code is not properly formatted, so I cannot run it without making modifications. And when I have made modifications to the code, I might run different code than you do, yielding different results. So next time please copy-paste your exact (working, tested) code without modifications.
The format string in your code uses lowercase hexadecimal format, and your first example output uses uppercase. So that output cannot be from this code.
I don't have access to your file "MyPlugin.tgz", but when I test your code with another .tgz file (after fixing the IndentationErrors), my output is correct. It starts with \x1f\x8b as expected (this is the magic number in the gzip header). I can't explain why your output is different...
Secondly, it seems like you don't fully understand how bytes and string representations work. When you write print(fcback), a string representation of the Python object fcback (in this case a bytes object) is printed. The string representation of a bytes object is not the same as the binary data! When printing a bytes object, each byte that corresponds to a printable ASCII character is replaced by that character, other bytes are escaped (similar to the formatted string that your code generates). Also, it starts with b' and ends with '.
You cannot print binary data to your terminal and then pipe the output to a file. This will result in a different file. The correct way to write the data to a file is using file.write(data) in your Python code.
Here's a fully working example:
def binary_to_text(data):
"""Convert a bytes object to a formatted text string."""
text = ""
for byte in data:
text += "\\x{:02x}".format(byte)
return text
def text_to_binary(text):
"""Convert a formatted text string to a bytes object."""
return bytes(int(text[i+2:i+4], 16) for i in range(0, len(text), 4))
def main():
# Read the binary data from input file:
with open('MyPlugin.tgz', 'rb') as input_file:
input_data = input_file.read()
# Convert binary to text (based on your original code):
text = binary_to_text(input_data)
print(text[0:100])
# Convert the text back to binary:
output_data = text_to_binary(text)
print(output_data[0:100])
# Write the binary data back to a file:
with open('MyPlugin-restored.tgz', 'wb') as output_file:
output_file.write(output_data)
if __name__ == '__main__':
main()
Note that I only print the first 100 elements to keep the output short. Also notice that the second print-statement prints a much longer text. This is because the first print gets 100 characters (which are printed "as is"), while the second print gets 100 bytes (of which most bytes are escaped, causing the output to be longer).

Writing to files in ASCII with Python3, not UTF8

I have a program that I created with two sections.
The first one copies a text file with an integer in the middle of the file name in this format.
file = "Filename" + "str(int)" + ".txt"
the user can create as many copies of the file that they would like.
The second part of the program is what I am having the problem with. There is an integer at the very bottom of the file that is to correspond with the integer in the file name. After the first part is done, I open each file one at a time in "r+" read/write format. So I can file.seek(1000) to about where the integer is in the file.
Now in my opinion the next part should be easy. I should just simply have to write str(int) into the file right here. But it wasn't that easy. It worked just fine doing it like that in Linux at home, but at work on Windows it proved difficult. What I ended up having to do after file.seek(1000) is write to the file using Unicode UTF-8. I accomplished this with this code snippet of the rest of the program. I will document it so that it is able to be understood what is going on. Instead of having to write this in Unicode, I would love to be able to write this in good old regular English ASCII characters. Eventually this program will be expanded to include a lot more data at the bottom of each file. Having to write the data in Unicode is going to make things extremely difficult. If I just write the data without turning it into Unicode this is the result. This string is supposed to say #2 =1534, instead it says #2 =ㄠ㌵433.
If someone can show me what I am doing wrong that would be great. I would love to just use something like file.write('1534') to write the data to the file instead of having to do it in Unicode UTF-8.
while a1 < d1 :
file = "file" + str(a1) + ".par"
f = open(file, "r+")
f.seek(1011)
data = f.read() #reads the data from that point in the file into a variable.
numList= list(str(a1)) # "a1" is the integer in the file name. I had to turn the integer into a list to accomplish the next task.
replaceData = '\x00' + numList[0] + '\x00' + numList[1] + '\x00' + numList[2] + '\x00' + numList[3] + '\x00' #This line turns the integer into Utf 8 Unicode. I am by no means a Unicode expert.
currentData = data #probably didn't need to be done now that I'm looking at this.
data = data.replace(currentData, replaceData) #replaces the Utf 8 string in the "data" variable with the new Utf 8 string in "replaceData."
f.seek(1011) # Return to where I need to be in the file to write the data.
f.write(data) # Write the new Unicode data to the file
f.close() #close the file
f.close() #make sure the file is closed (sometimes it seems that this fails in Windows.)
a1 += 1 #advances the integer, and then return to the top of the loop
This is an example of writing to a file in ASCII. You need to open the file in byte mode, and using the .encode method for strings is a convenient way to get the end result you want.
s = '12345'
ascii = s.encode('ascii')
with open('somefile', 'wb') as f:
f.write(ascii)
You can obviously also open in rb+ (read and write byte mode) in your case if the file already exists.
with open('somefile', 'rb+') as f:
existing = f.read()
f.write(b'ascii without encoding!')
You can also just pass string literals with the b prefix, and they will be encoded with ascii as shown in the second example.

Merging multiple text files into one and related problems

I'm using Windows 7 and Python 3.4.
I have several multi-line text files (all in Persian) and I want to merge them into one under one condition: each line of the output file must contain the whole text of each input file. It means if there are nine text files, the output text file must have only nine lines, each line containing the text of a single file. I wrote this:
import os
os.chdir ('C:\Dir')
with open ('test.txt', 'w', encoding = 'UTF8') as OutFile:
with open ('news01.txt', 'r', encoding = 'UTF8') as InFile:
while True:
_Line = InFile.readline()
if len (_Line) == 0:
break
else:
_LineString = str (_Line)
OutFile.write (_LineString)
It worked for that one file but it looks like it takes more than one line in output file and also the output file contains disturbing characters like: &amp, &nbsp and things like that. But the source files don't contain any of them.
Also, I've got some other texts: news02.txt, news03.txt, news04.txt ... news09.txt.
Considering all these:
How can I correct my code so that it reads all files one after one, putting each in only one line?
How can I clean these unfamiliar and strange characters or prevent them to appear in my final text?
Here is an example that will do the merging portion of your question:
def merge_file(infile, outfile, separator = ""):
print(separator.join(line.strip("\n") for line in infile), file = outfile)
def merge_files(paths, outpath, separator = ""):
with open(outpath, 'w') as outfile:
for path in paths:
with open(path) as infile:
merge_file(infile, outfile, separator)
Example use:
merge_files(["C:\file1.txt", "C:\file2.txt"], "C:\output.txt")
Note this makes the rather large assumption that the contents of 'infile' can fit into memory. Reasonable for most text files, but possibly quite unreasonable otherwise. If your text files will be very large, you can this alternate merge_file implementation:
def merge_file(infile, outfile, separator = ""):
for line in infile:
outfile.write(line.strip("\n")+separator)
outfile.write("\n")
It's slower, but shouldn't run into memory problems.
Answering question 1:
You were right about the UTF-8 part.
You probably want to create a function which takes multiple files as a tuple of files/strings of file directories or *args. Then, read all input files, and replace all "\n" (newlines) with a delimiter (Default ""). out_file can be in in_files, but makes the assumption that the contents of files can be loaded in to memory. Also, out_file can be a file object, and in_files can be file objects.
def write_from_files(out_file, in_files, delimiter="", dir="C:\Dir"):
import _io
import os
import html.parser # See part 2 of answer
os.chdir(dir)
output = []
for file in in_files:
file_ = file
if not isinstance(file_, _io.TextIOWrapper):
file_ = open(file_, "r", -1, "UTF-8") # If it isn't a file, make it a file
file_.seek(0, 0)
output.append(file_.read().replace("\n", delimiter)) # Replace all newlines
file_.close() # Close file to prevent IO errors # with delimiter
if not isinstance(out_file, _io.TextIOWrapper):
out_file = open(out_file, "w", -1, "UTF-8")
html.parser.HTMLParser().unescape("\n".join(output))
out_file.write(join)
out_file.close()
return join # Do not have to return
Answering question 2:
I think you may of copied from a webpage. This does not happen to me. The &amp and &nbsp are the HTML entities, (&) and ( ). You may need to replace them with their corresponding character. I would use HTML.parser. As you see in above, it turns HTML escape sequences into Unicode literals. E.g.:
>>> html.parser.HTMLParser().unescape("Alpha &lt β")
'Alpha < β'
This will not work in Python 2.x, as in 3.x it was renamed. Instead, replace the incorrect lines with:
import HTMLParser
HTMLParser.HTMLParser().unescape("\n".join(output))

python3 opening files and reading lines

Can you explain what is going on in this code? I don't seem to understand
how you can open the file and read it line by line instead of all of the sentences at the same time in a for loop. Thanks
Let's say I have these sentences in a document file:
cat:dog:mice
cat1:dog1:mice1
cat2:dog2:mice2
cat3:dog3:mice3
Here is the code:
from sys import argv
filename = input("Please enter the name of a file: ")
f = open(filename,'r')
d1ct = dict()
print("Number of times each animal visited each station:")
print("Animal Id Station 1 Station 2")
for line in f:
if '\n' == line[-1]:
line = line[:-1]
(AnimalId, Timestamp, StationId,) = line.split(':')
key = (AnimalId,StationId,)
if key not in d1ct:
d1ct[key] = 0
d1ct[key] += 1
The magic is at:
for line in f:
if '\n' == line[-1]:
line = line[:-1]
Python file objects are special in that they can be iterated over in a for loop. On each iteration, it retrieves the next line of the file. Because it includes the last character in the line, which could be a newline, it's often useful to check and remove the last character.
As Moshe wrote, open file objects can be iterated. Only, they are not of the file type in Python 3.x (as they were in Python 2.x). If the file object is opened in text mode, then the unit of iteration is one text line including the \n.
You can use line = line.rstrip() to remove the \n plus the trailing withespaces.
If you want to read the content of the file at once (into a multiline string), you can use content = f.read().
There is a minor bug in the code. The open file should always be closed. I means to use f.close() after the for loop. Or you can wrap the open to the newer with construct that will close the file for you -- I suggest to get used to the later approach.

Resources