After looking all over the Internet, I've come to this.
Let's say I have already made a text file that reads:
Hello World
Well, I want to remove the very last character (in this case d) from this text file.
So now the text file should look like this: Hello Worl
But I have no idea how to do this.
All I want, more or less, is a single backspace function for text files on my HDD.
This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file:
import os
with open(filename, 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint.
For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint:
with open(filename, 'rb+') as filehandle:
# move to end, then scan forward until a non-continuation byte is found
filehandle.seek(-1, os.SEEK_END)
while filehandle.read(1) & 0xC0 == 0x80:
# we just read 1 byte, which moved the file position forward,
# skip back 2 bytes to move to the byte before the current.
filehandle.seek(-2, os.SEEK_CUR)
# last read byte is our truncation point, move back to it.
filehandle.seek(-1, os.SEEK_CUR)
filehandle.truncate()
Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with:
UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3)
one newline character at the end of the file (which is the default in Linux editors like vim or gedit)
If the text file contains non-English characters, neither of the answers provided so far would work.
What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file:
import os
def truncate_utf8_chars(filename, count, ignore_newlines=True):
"""
Truncates last `count` characters of a text file encoded in UTF-8.
:param filename: The path to the text file to read
:param count: Number of UTF-8 characters to remove from the end of the file
:param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
"""
with open(filename, 'rb+') as f:
last_char = None
size = os.fstat(f.fileno()).st_size
offset = 1
chars = 0
while offset <= size:
f.seek(-offset, os.SEEK_END)
b = ord(f.read(1))
if ignore_newlines:
if b == 0x0D or b == 0x0A:
offset += 1
continue
if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
# This is the first byte of a UTF8 character
chars += 1
if chars == count:
# When `count` number of characters have been found, move current position back
# with one byte (to include the byte just checked) and truncate the file
f.seek(-1, os.SEEK_CUR)
f.truncate()
return
offset += 1
How it works:
Reads only the last few bytes of a UTF-8 encoded text file in binary mode
Iterates the bytes backwards, looking for the start of a UTF-8 character
Once a character (different from a newline) is found, return that as the last character in the text file
Sample text file - bg.txt:
Здравей свят
How to use:
filename = 'bg.txt'
print('Before truncate:', open(filename).read())
truncate_utf8_chars(filename, 1)
print('After truncate:', open(filename).read())
Outputs:
Before truncate: Здравей свят
After truncate: Здравей свя
This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following.
f.seek(f.tell() - 1, os.SEEK_SET)
f.write('')
In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f:
f.seek(0,2) # end of file
size=f.tell() # the size...
f.truncate(size-1) # truncate at that size - how ever many characters
Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f:
f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal
f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal
f.truncate()
subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do:
with open('myfile.txt', 'r') as file:
data = file.read()[:-1]
with open('myfile.txt', 'w') as file:
file.write(data)
The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name.
This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)...
i don't advice to use this, but, it's possible to do like this ..
x = open("file").read()
os.remove("file")
open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command.
In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G.
Note that this operation took less than five seconds.
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:39:00.572940600 -0400
Modify: 2020-06-12 07:39:00.572940600 -0400
Change: 2020-06-12 07:39:00.572940600 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
chris#SR-ENG-P18 /cygdrive/c/Projects
$ truncate -s -1G update.iso
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:42:38.335782800 -0400
Modify: 2020-06-12 07:42:38.335782800 -0400
Change: 2020-06-12 07:42:38.335782800 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
The stat command tells you lots of info about a file including its size.
I am writing a piece of code to extract exif data from images using Python. I downloaded the Pillow module using pip3 and am using some code I found online:
from PIL import Image
from PIL.ExifTags import TAGS
imagename = "path to file"
image = Image.open(imagename)
exifdata = image.getexif()
for tagid in exifdata:
tagname = TAGS.get(tagid, tagid)
data = exifdata.get(tagid)
if isinstance(data, bytes):
data = data.decode()
print(f"{tagname:25}: {data}")
On some images this code works. However, for images I took on my Olympus camera I get the following error:
GPSInfo : 734
Traceback (most recent call last):
File "_pathname redacted_", line 14, in <module>
data = data.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 30: invalid continuation byte
When I remove the data = data.decode() part, I get the following:
GPSInfo : 734
PrintImageMatching : b"PrintIM\x000300\x00\x00%\x00\x01\x00\x14\x00\x14\x00\x02\x00\x01\x00\x00\x00\x03\x00\xf0\x00\x00\x00\x07\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x0b\x008\x01\x00\x00\x0c\x00\x00\x00\x00\x00\r\x00\x00\x00\x00\x00\x0e\x00P\x01\x00\x00\x10\x00`\x01\x00\x00 \x00\xb4\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x01\xff\x00\x00\x00\x02\x01\x83\x00\x00\x00\x03\x01\x83\x00\x00\x00\x04\x01\x83\x00\x00\x00\x05\x01\x83\x00\x00\x00\x06\x01\x83\x00\x00\x00\x07\x01\x80\x80\x80\x00\x10\x01\x83\x00\x00\x00\x00\x02\x00\x00\x00\x00\x07\x02\x00\x00\x00\x00\x08\x02\x00\x00\x00\x00\t\x02\x00\x00\x00\x00\n\x02\x00\x00\x00\x00\x0b\x02\xf8\x01\x00\x00\r\x02\x00\x00\x00\x00 \x02\xd6\x01\x00\x00\x00\x03\x03\x00\x00\x00\x01\x03\xff\x00\x00\x00\x02\x03\x83\x00\x00\x00\x03\x03\x83\x00\x00\x00\x06\x03\x83\x00\x00\x00\x10\x03\x83\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\t\x11\x00\x00\x10'\x00\x00\x0b\x0f\x00\x00\x10'\x00\x00\x97\x05\x00\x00\x10'\x00\x00\xb0\x08\x00\x00\x10'\x00\x00\x01\x1c\x00\x00\x10'\x00\x00^\x02\x00\x00\x10'\x00\x00\x8b\x00\x00\x00\x10'\x00\x00\xcb\x03\x00\x00\x10'\x00\x00\xe5\x1b\x00\x00\x10'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
ResolutionUnit : 2
ExifOffset : 230
ImageDescription : OLYMPUS DIGITAL CAMERA
Make : OLYMPUS CORPORATION
Model : E-M10MarkII
Software : Version 1.2
Orientation : 1
DateTime : 2020:02:13 15:02:57
YCbCrPositioning : 2
YResolution : 350.0
Copyright :
XResolution : 350.0
Artist :
How should I fix this problem? Should I use a different Python module?
I did some digging and figured out the answer to the problem I posted about. I originally postulated that the rest of the metadata was in the byte data:
b"PrintIM\x000300\x00\x00%\x00\x01\x00\x14\x00\x14\x00\x02\x00\x01\x00\x00\x00\x03\x00\xf0\x00\x00\x00\x07\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\n\x00\x00\x00\x00\x00\x0b\x008\x01\x00\x00\x0c\x00\x00\x00\x00\x00\r\x00\x00\x00\x00\x00\x0e\x00P\x01\x00\x00\x10\x00`\x01\x00\x00 \x00\xb4\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x01\xff\x00\x00\x00\x02\x01\x83\x00\x00\x00\x03\x01\x83\x00\x00\x00\x04\x01\x83\x00\x00\x00\x05\x01\x83\x00\x00\x00\x06\x01\x83\x00\x00\x00\x07\x01\x80\x80\x80\x00\x10\x01\x83\x00\x00\x00\x00\x02\x00\x00\x00\x00\x07\x02\x00\x00\x00\x00\x08\x02\x00\x00\x00\x00\t\x02\x00\x00\x00\x00\n\x02\x00\x00\x00\x00\x0b\x02\xf8\x01\x00\x00\r\x02\x00\x00\x00\x00 \x02\xd6\x01\x00\x00\x00\x03\x03\x00\x00\x00\x01\x03\xff\x00\x00\x00\x02\x03\x83\x00\x00\x00\x03\x03\x83\x00\x00\x00\x06\x03\x83\x00\x00\x00\x10\x03\x83\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\t\x11\x00\x00\x10'\x00\x00\x0b\x0f\x00\x00\x10'\x00\x00\x97\x05\x00\x00\x10'\x00\x00\xb0\x08\x00\x00\x10'\x00\x00\x01\x1c\x00\x00\x10'\x00\x00^\x02\x00\x00\x10'\x00\x00\x8b\x00\x00\x00\x10'\x00\x00\xcb\x03\x00\x00\x10'\x00\x00\xe5\x1b\x00\x00\x10'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x05\x05\x05\x00\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00##\x80\x80\xc0\xc0\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
That assumption wasn't correct. Although the above is metadata, it simply isn't the metadata I am looking for (in my case the FocalLength attribute). Rather it appears to be Olympus specific metadata. The answer to my solution was to find all the metadata. I found a piece of code that worked very well in Stack Overflow: In Python, how do I read the exif data for an image?.
I used the following code by Nicolas Gervais:
import os,sys
from PIL import Image
from PIL.ExifTags import TAGS
for (k,v) in Image.open(sys.argv[1])._getexif().items():
print('%s = %s' % (TAGS.get(k), v))
I replaced sys.argv[1] with the path name to the image file.
Alternate Solution
As MattDMo mentioned, there are also specific libraries for reading EXIF data in Python. One that I found that look promising is ExifRead which can be download by typing the following in the terminal:
pip install ExifRead
I am trying to write Chinese characters to a CSV file based on their Unicode code points found in a text file in unicode.org/Public/zipped/13.0.0/Unihan.zip. For instance, one example character is U+9109.
In the example below I can get the correct output by hard coding the value (line 8), but keep getting it wrong with every permutation I've tried at generating the bytes from the code point (lines 14-16).
I'm running this in Python 3.8.3 on a Debian-based Linux distro.
Minimal working (broken) example:
1 #!/usr/bin/env python3
2
3 def main():
4
5 output = open("test.csv", "wb")
6
7 # Hardcoded values work just fine
8 output.write('\u9109'.encode("utf-8"))
9
10 # Comma separation
11 output.write(','.encode("utf-8"))
12
13 # Problem is here
14 codepoint = '9109'
15 u_str = '\\' + 'u' + codepoint
16 output.write(u_str.encode("utf-8"))
17
18 # End with newline
19 output.write('\n'.encode("utf-8"))
20
21 output.close()
22
23 if __name__ == "__main__":
24 main()
Executing and viewing results:
example $
example $./test.py
example $
example $cat test.csv
鄉,\u9109
example $
The expected output would look like this (Chinese character occurring on both sides of the comma):
example $
example $./test.py
example $cat test.csv
鄉,鄉
example $
chr is used to convert integers to code points in Python 3. Your code could use:
output.write(chr(0x9109).encode("utf-8"))
But if you specify the encoding in the open instead of using binary mode you don't have to manually encode everything. print to a file handles newlines for you as well.
with open("test.txt",'w',encoding='utf-8') as output:
for i in range(0x4e00,0x4e10):
print(f'U+{i:04X} {chr(i)}',file=output)
Output:
U+4E00 一
U+4E01 丁
U+4E02 丂
U+4E03 七
U+4E04 丄
U+4E05 丅
U+4E06 丆
U+4E07 万
U+4E08 丈
U+4E09 三
U+4E0A 上
U+4E0B 下
U+4E0C 丌
U+4E0D 不
U+4E0E 与
U+4E0F 丏
I am trying to extract location from the text using the geography3 library in python.
import geograpy
address = 'Jersey City New Jersey 07306'
places = geograpy.get_place_context(text = address)
To which i get the below error UnicodeDecodeError:
~\Anaconda\lib\site-packages\geograpy\places.py in populate_db(self)
28 with open(cur_dir + "/data/GeoLite2-City-Locations.csv") as info:
29 reader = csv.reader(info)
---> 30 for row in reader:
31 print(row)
32 cur.execute("INSERT INTO cities VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?);", row)
~\Anaconda\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return
codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 276: character maps to <undefined>
After some investigation, i tried to modify the places.py file and added encoding = "utf-8" in the line -----> 30
with open(cur_dir + "/data/GeoLite2-City-Locations.csv", encoding="utf-8") as info:
But it still gives me the same error.
I also tried to save the GeoLite2-City-Locations.csv on my Desktop and then tried to read it using the same code.
with open("GeoLite2-City-Locations.csv", encoding="utf-8") as info:
reader = csv.reader(info)
for row in reader:
print(row)
which works absolutely fine and prints all the rows of the GeoLite2-City-Locations.csv.
I fail to understand the problem!
As a committer of geograpy3 to reproduce your issue i added a test to the most recent geograpy3 https://github.com/somnathrakshit/geograpy3/blob/master/tests/test_extractor.py:
with the result:
['Jersey', 'City'
so you might simply switch to the latest version.
def testStackoverflow54077973(self):
'''
see https://stackoverflow.com/questions/54077973/geograpy3-library-for-extracting-the-locations-in-the-text-gives-unicodedecodee
'''
address = 'Jersey City New Jersey 07306'
e=Extractor(text=address)
e.find_entities()
self.check(e.places,['Jersey','City'])
you should specify encoding encoding='utf-8' like you did, although in correct_country_mispelling(self, s) method in places.py (49 row)
After some investigation, this is a Windows vs Linux error in some cases. Even using the
with open(cur_dir + "/data/GeoLite2-City-Locations.csv", encoding="utf-8") as info:
I could not resolve the error on my Windows computer. However, the exact same code ran fine on a Linux computer I use as well. I looked in the the City-Locations.csv file on Linux, and it appeared LibreOffice automatically encoded and/or resolved all the characters. Where as looking at the same file in Excel, I would still have all the funky characters causing the error. Excel for some reason insists on keeping the odd characters.
After receiving the bytes from Server, it needs to convert into string. When I try below code, not works per expected.
a
Out[140]: b'NC\x00\x00\x00'
a.decode()
Out[141]: 'NC\x00\x00\x00'
a.decode('ascii')
Out[142]: 'NC\x00\x00\x00'
a.decode('ascii').strip()
Out[143]: 'NC\x00\x00\x00'
a.decode('utf-8').strip()
Out[147]: 'NC\x00\x00\x00'
# I need the Output as 'NC'
This is not an encoding issue, as the trailing bytes are all NUL bytes. Looks like your server is padding with Null bytes. To remove them just use
a.strip(b'\x00')