After looking all over the Internet, I've come to this.
Let's say I have already made a text file that reads:
Hello World
Well, I want to remove the very last character (in this case d) from this text file.
So now the text file should look like this: Hello Worl
But I have no idea how to do this.
All I want, more or less, is a single backspace function for text files on my HDD.
This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file:
import os
with open(filename, 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint.
For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint:
with open(filename, 'rb+') as filehandle:
# move to end, then scan forward until a non-continuation byte is found
filehandle.seek(-1, os.SEEK_END)
while filehandle.read(1) & 0xC0 == 0x80:
# we just read 1 byte, which moved the file position forward,
# skip back 2 bytes to move to the byte before the current.
filehandle.seek(-2, os.SEEK_CUR)
# last read byte is our truncation point, move back to it.
filehandle.seek(-1, os.SEEK_CUR)
filehandle.truncate()
Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with:
UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3)
one newline character at the end of the file (which is the default in Linux editors like vim or gedit)
If the text file contains non-English characters, neither of the answers provided so far would work.
What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file:
import os
def truncate_utf8_chars(filename, count, ignore_newlines=True):
"""
Truncates last `count` characters of a text file encoded in UTF-8.
:param filename: The path to the text file to read
:param count: Number of UTF-8 characters to remove from the end of the file
:param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
"""
with open(filename, 'rb+') as f:
last_char = None
size = os.fstat(f.fileno()).st_size
offset = 1
chars = 0
while offset <= size:
f.seek(-offset, os.SEEK_END)
b = ord(f.read(1))
if ignore_newlines:
if b == 0x0D or b == 0x0A:
offset += 1
continue
if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
# This is the first byte of a UTF8 character
chars += 1
if chars == count:
# When `count` number of characters have been found, move current position back
# with one byte (to include the byte just checked) and truncate the file
f.seek(-1, os.SEEK_CUR)
f.truncate()
return
offset += 1
How it works:
Reads only the last few bytes of a UTF-8 encoded text file in binary mode
Iterates the bytes backwards, looking for the start of a UTF-8 character
Once a character (different from a newline) is found, return that as the last character in the text file
Sample text file - bg.txt:
Здравей свят
How to use:
filename = 'bg.txt'
print('Before truncate:', open(filename).read())
truncate_utf8_chars(filename, 1)
print('After truncate:', open(filename).read())
Outputs:
Before truncate: Здравей свят
After truncate: Здравей свя
This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following.
f.seek(f.tell() - 1, os.SEEK_SET)
f.write('')
In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f:
f.seek(0,2) # end of file
size=f.tell() # the size...
f.truncate(size-1) # truncate at that size - how ever many characters
Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f:
f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal
f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal
f.truncate()
subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do:
with open('myfile.txt', 'r') as file:
data = file.read()[:-1]
with open('myfile.txt', 'w') as file:
file.write(data)
The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name.
This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)...
i don't advice to use this, but, it's possible to do like this ..
x = open("file").read()
os.remove("file")
open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command.
In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G.
Note that this operation took less than five seconds.
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:39:00.572940600 -0400
Modify: 2020-06-12 07:39:00.572940600 -0400
Change: 2020-06-12 07:39:00.572940600 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
chris#SR-ENG-P18 /cygdrive/c/Projects
$ truncate -s -1G update.iso
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:42:38.335782800 -0400
Modify: 2020-06-12 07:42:38.335782800 -0400
Change: 2020-06-12 07:42:38.335782800 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
The stat command tells you lots of info about a file including its size.
I want to use split function of LUA 5.1 to split string of emoji characters without spaces and add space between ones, but I can't do it rightly. So I do it by this way, but it's wrong:
#!/usr/bin/env lua
local text = "👊🏽👊🏽🥊😂🎹😂👩👩👧👦👨👦👦⌚↔"
for emoji in string.gmatch(text, "[%z\1-\127\194-\244][\128-\191]*") do
io.write(emoji .. " ")
end
See in browser Firefox 65!
MY WRONG RESULT: 👊 🏽 👊 🏽 🥊 😂 🎹 😂 👩 👩 👧 👦 👨 👦 👦 ⌚ ↔
WAITED RESULT: 👊🏽 👊🏽 🥊 😂 🎹 😂 👩👩👧👦 👨👦👦 ⌚ ↔
local text = "👊🏽👊🏽🥊😂🎹😂👩👩👧👦👨👦👦⌚↔"
for emoji in text
:gsub("(.)([\194-\244])", "%1\0%2")
:gsub("%z(\240\159\143[\187-\191])", "%1")
:gsub("%z(\239\184[\128-\143])", "%1")
:gsub("%z(\226\128\141)%z", "%1")
:gmatch"%Z+"
do
print(emoji)
end
One part of my AutoHotKey script should recognize if __ is typed.
Following the AutoHotKey documentation, I've tried:
~__::
tooltip,hi world
return
and got this error:
Line Text: ~__::
Error: Invalid hotkey.
this shows no errors, but works only for one underscore:
~_::
tooltip,hi world
return
this shows no errors, but it just clears the __:
:*:__::
tooltip,hi world
return
this shows error Error: Invalid hotkey.:
~:*:__::
tooltip,hi world
return
this shows no errors, but does nothing (Doku: Executehotstring) :
:X:~__::
tooltip,hi world
return
Here are 4 potential solutions. I have left one working, comment out/uncomment hotkey labels by adding/removing leading semicolons as appropriate.
The 2 blocks of code are functionally equivalent, and for the 2 alternatives, within each block, b0 prevents automatic backspacing, i.e. the underscores that you typed are not deleted.
;:*?:__:: ;deletes the underscores
:b0*?:__:: ;does not delete the underscores
SoundBeep
return
;note: the X option requires AHK v1.1.28+
;:X*?:__::SoundBeep ;deletes the underscores
;:Xb0*?:__::SoundBeep ;does not delete the underscores
This AutoHotKey recognize if __ is typed:
countUnderscore :=0
~_::
countUnderscore++
if(countUnderscore == 2){
tooltip, %countUnderscore% = countUnderscore
countUnderscore := 0
}
return
I am trying to use lzma to compress and decompress some data in memory. I know that the following approach works:
import lzma
s = 'Lorem ipsum dolor'
bytes_in = s.encode('utf-8')
print(s)
print(bytes_in)
# Compress
bytes_out = lzma.compress(data=bytes_in, format=lzma.FORMAT_XZ)
print(bytes_out)
# Decompress
bytes_decomp = lzma.decompress(data=bytes_out, format=lzma.FORMAT_XZ)
print(bytes_decomp)
The output is:
Lorem ipsum dolor
b'Lorem ipsum dolor'
b'\xfd7zXZ\x00\x00\x04\xe6\xd6\xb4F\x02\x00!\x01\x16\x00\x00\x00t/\xe5\xa3\x01\x00\x10Lorem ipsum dolor\x00\x00\x00\x00\xddq\x8e\x1d\x82\xc8\xef\xad\x00\x01)\x112\np\x0e\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'
b'Lorem ipsum dolor'
However, I notice that using lzma.LZMACompressor gives different results. With the following code:
import lzma
s = 'Lorem ipsum dolor'
bytes_in = s.encode('utf-8')
print(s)
print(bytes_in)
# Compress
lzc = lzma.LZMACompressor(format=lzma.FORMAT_XZ)
lzc.compress(bytes_in)
bytes_out = lzc.flush()
print(bytes_out)
# Decompress
bytes_decomp = lzma.decompress(data=bytes_out, format=lzma.FORMAT_XZ)
print(bytes_decomp)
I get this output:
Lorem ipsum dolor
b'Lorem ipsum dolor'
b'\x01\x00\x10Lorem ipsum dolor\x00\x00\x00\x00\xddq\x8e\x1d\x82\xc8\xef\xad\x00\x01)\x112\np\x0e\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'
And then the program fails on line 18 with _lzma.LZMAError: Input format not supported by decoder.
I have 3 questions here:
How come the output for lzma.compress is so much longer than lzma.LZMACompressor.compress even though it seemingly does the same thing?
In the second example, why does the decompressor complain about invalid format?
How can I get the second example to decompress correctly?
On your second example you're dropping a part of the compressed stream, and bytes_out only gets the flush part. On the other hand, that works:
lzc = lzma.LZMACompressor(format=lzma.FORMAT_XZ)
bytes_out = lzc.compress(bytes_in) + lzc.flush()
print(bytes_out)
note that the first example is really equivalent since source for lzma.compress is:
def compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None):
"""Compress a block of data.
Refer to LZMACompressor's docstring for a description of the
optional arguments *format*, *check*, *preset* and *filters*.
For incremental compression, use an LZMACompressor instead.
"""
comp = LZMACompressor(format, check, preset, filters)
return comp.compress(data) + comp.flush()
When i write '你' in agend and save it as test-unicode.txt in unicode mode,open it with xxd g:\\test-unicode.txt ,i got :
0000000: fffe 604f ..`O
1.fffe stand for little endian
2.the unicode of 你 is \x4f\x60
I want to write the 你 as 604f or 4f60 in the file.
output=open("g://test-unicode.txt","wb")
str1="你"
output.write(str1)
output.close()
error:
TypeError: 'str' does not support the buffer interface
When i change it into the following ,there is no errror.
output=open("g://test-unicode.txt","wb")
str1="你"
output.write(str1.encode())
output.close()
when open it with xxd g:\\test-unicode.txt ,i got :
0000000: e4bd a0 ...
How can i write 604f or 4f60 into my file the same way as microsoft aengda do(save as unicode format)?
"Unicode" as an encoding is actually UTF-16LE.
with open("g:/test-unicode.txt", "w", encoding="utf-16le") as output:
output.write(str1)