How to write Chinese characters to file based on unicode code point in Python3 - python-3.x

I am trying to write Chinese characters to a CSV file based on their Unicode code points found in a text file in unicode.org/Public/zipped/13.0.0/Unihan.zip. For instance, one example character is U+9109.
In the example below I can get the correct output by hard coding the value (line 8), but keep getting it wrong with every permutation I've tried at generating the bytes from the code point (lines 14-16).
I'm running this in Python 3.8.3 on a Debian-based Linux distro.
Minimal working (broken) example:
1 #!/usr/bin/env python3
2
3 def main():
4
5 output = open("test.csv", "wb")
6
7 # Hardcoded values work just fine
8 output.write('\u9109'.encode("utf-8"))
9
10 # Comma separation
11 output.write(','.encode("utf-8"))
12
13 # Problem is here
14 codepoint = '9109'
15 u_str = '\\' + 'u' + codepoint
16 output.write(u_str.encode("utf-8"))
17
18 # End with newline
19 output.write('\n'.encode("utf-8"))
20
21 output.close()
22
23 if __name__ == "__main__":
24 main()
Executing and viewing results:
example $
example $./test.py
example $
example $cat test.csv
鄉,\u9109
example $
The expected output would look like this (Chinese character occurring on both sides of the comma):
example $
example $./test.py
example $cat test.csv
鄉,鄉
example $

chr is used to convert integers to code points in Python 3. Your code could use:
output.write(chr(0x9109).encode("utf-8"))
But if you specify the encoding in the open instead of using binary mode you don't have to manually encode everything. print to a file handles newlines for you as well.
with open("test.txt",'w',encoding='utf-8') as output:
for i in range(0x4e00,0x4e10):
print(f'U+{i:04X} {chr(i)}',file=output)
Output:
U+4E00 一
U+4E01 丁
U+4E02 丂
U+4E03 七
U+4E04 丄
U+4E05 丅
U+4E06 丆
U+4E07 万
U+4E08 丈
U+4E09 三
U+4E0A 上
U+4E0B 下
U+4E0C 丌
U+4E0D 不
U+4E0E 与
U+4E0F 丏

Related

SED style Multi address in Python?

I have an app that parses multiple Cisco show tech files. These files contain the output of multiple router commands in a structured way, let me show you an snippet of a show tech output:
`show clock`
20:20:50.771 UTC Wed Sep 07 2022
Time source is NTP
`show callhome`
callhome disabled
Callhome Information:
<SNIPET>
`show module`
Mod Ports Module-Type Model Status
--- ----- ------------------------------------- --------------------- ---------
1 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
2 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
3 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
4 52 16x10G + 32x10/25G + 4x100G Module N9K-X96136YC-R ok
21 0 Fabric Module N9K-C9504-FM-R ok
22 0 Fabric Module N9K-C9504-FM-R ok
23 0 Fabric Module N9K-C9504-FM-R ok
<SNIPET>
My app currently uses both SED and Python scripts to parse these files. I use SED to parse the show tech file looking for a specific command output, once I find it, I stop SED. This way I don't need to read all the file (these can get to be very big files). This is a snipet of my SED script:
sed -E -n '/`show running-config`|`show running`|`show running config`/{
p
:loop
n
p
/`show/q
b loop
}' $1/$file
As you can see I am using a multi address range in SED. My question specifically is, how can I achieve something similar in python? I have tried multiple combinations of flags: DOTALL and MULTILINE but I can't get the result I'm expecting, for example, I can get a match for the command I'm looking for, but python regex wont stop until the end of the file after the first match.
I am looking for something like this
sed -n '/`show clock`/,/`show/p'
I would like the regex match to stop parsing the file and print the results, immediately after seeing `show again , hope that makes sense and thank you all for reading me and for your help
You can use nested loops.
import re
def process_file(filename):
with open(filename) as f:
for line in f:
if re.search(r'`show running-config`|`show running`|`show running config`', line):
print(line)
for line1 in f:
print(line1)
if re.search(r'`show', line1):
return
The inner for loop will start from the next line after the one processed by the outer loop.
You can also do it with a single loop using a flag variable.
import re
def process_file(filename):
in_show = False
with open(filename) as f:
for line in f:
if re.search(r'`show running-config`|`show running`|`show running config`', line):
in_show = True
if in_show
print(line)
if re.search(r'`show', line1):
return

How to delete last n characters of .txt file without having to re-write all the other characters [duplicate]

After looking all over the Internet, I've come to this.
Let's say I have already made a text file that reads:
Hello World
Well, I want to remove the very last character (in this case d) from this text file.
So now the text file should look like this: Hello Worl
But I have no idea how to do this.
All I want, more or less, is a single backspace function for text files on my HDD.
This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file:
import os
with open(filename, 'rb+') as filehandle:
filehandle.seek(-1, os.SEEK_END)
filehandle.truncate()
This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint.
For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint:
with open(filename, 'rb+') as filehandle:
# move to end, then scan forward until a non-continuation byte is found
filehandle.seek(-1, os.SEEK_END)
while filehandle.read(1) & 0xC0 == 0x80:
# we just read 1 byte, which moved the file position forward,
# skip back 2 bytes to move to the byte before the current.
filehandle.seek(-2, os.SEEK_CUR)
# last read byte is our truncation point, move back to it.
filehandle.seek(-1, os.SEEK_CUR)
filehandle.truncate()
Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with:
UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3)
one newline character at the end of the file (which is the default in Linux editors like vim or gedit)
If the text file contains non-English characters, neither of the answers provided so far would work.
What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file:
import os
def truncate_utf8_chars(filename, count, ignore_newlines=True):
"""
Truncates last `count` characters of a text file encoded in UTF-8.
:param filename: The path to the text file to read
:param count: Number of UTF-8 characters to remove from the end of the file
:param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored
"""
with open(filename, 'rb+') as f:
last_char = None
size = os.fstat(f.fileno()).st_size
offset = 1
chars = 0
while offset <= size:
f.seek(-offset, os.SEEK_END)
b = ord(f.read(1))
if ignore_newlines:
if b == 0x0D or b == 0x0A:
offset += 1
continue
if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000:
# This is the first byte of a UTF8 character
chars += 1
if chars == count:
# When `count` number of characters have been found, move current position back
# with one byte (to include the byte just checked) and truncate the file
f.seek(-1, os.SEEK_CUR)
f.truncate()
return
offset += 1
How it works:
Reads only the last few bytes of a UTF-8 encoded text file in binary mode
Iterates the bytes backwards, looking for the start of a UTF-8 character
Once a character (different from a newline) is found, return that as the last character in the text file
Sample text file - bg.txt:
Здравей свят
How to use:
filename = 'bg.txt'
print('Before truncate:', open(filename).read())
truncate_utf8_chars(filename, 1)
print('After truncate:', open(filename).read())
Outputs:
Before truncate: Здравей свят
After truncate: Здравей свя
This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following.
f.seek(f.tell() - 1, os.SEEK_SET)
f.write('')
In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f:
f.seek(0,2) # end of file
size=f.tell() # the size...
f.truncate(size-1) # truncate at that size - how ever many characters
Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f:
f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal
f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal
f.truncate()
subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do:
with open('myfile.txt', 'r') as file:
data = file.read()[:-1]
with open('myfile.txt', 'w') as file:
file.write(data)
The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name.
This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)...
i don't advice to use this, but, it's possible to do like this ..
x = open("file").read()
os.remove("file")
open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command.
In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G.
Note that this operation took less than five seconds.
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:39:00.572940600 -0400
Modify: 2020-06-12 07:39:00.572940600 -0400
Change: 2020-06-12 07:39:00.572940600 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
chris#SR-ENG-P18 /cygdrive/c/Projects
$ truncate -s -1G update.iso
chris#SR-ENG-P18 /cygdrive/c/Projects
$ stat update.iso
File: update.iso
Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file
Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1
Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users)
Access: 2020-06-12 07:42:38.335782800 -0400
Modify: 2020-06-12 07:42:38.335782800 -0400
Change: 2020-06-12 07:42:38.335782800 -0400
Birth: 2020-06-11 13:31:21.170568000 -0400
The stat command tells you lots of info about a file including its size.

Mimicking bash wc functionalities using python

I have written a very simple python programme, called wc.py, which mimics "bash wc" behaviour to count the number of words, lines and bytes in a file. My programme is as follow:
import sys
path = sys.argv[1]
w = 0
l = 0
b = 0
for currentLine in file:
wordsInLine = currentLine.strip().split(' ')
wordsInLine = [word for word in wordsInLine if word != '']
w += len(wordsInLine)
b += len(currentLine.encode('utf-8'))
l += 1
#output
print(str(l) + ' ' + str(w) + ' ' + str(b))
In order to execute my programme you should execute the following command:
python3 wc.py [a file to read the data from]
As the result it shows
[The number of lines in the file] [The number of words in the file] [The number of bytes in the file] [the file directory path]
The files I used to test my code is as follow:
file.txt which contains the following data:
1
2
3
4
Executing "wc file.txt" returns
4 4 8
Executing "python3 wc.py file.txt" returns 4 4 8
Download "Annual enterprise survey: 2020 financial year (provisional) – CSV" from CSV file download
Executing "wc [fileName].csv" returns
37081 500273 5881081
Executing "python3 wc.py [fileName].csv" returns
37081 500273 5844000
and a [something].pdf file
Executing "wc [something].pdf" works.
Executing "python3 code.py" throws the following errors:
Traceback (most recent call last):
File "code.py", line 10, in <module>
for currentLine in file:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbf in position 10: invalid start byte
As you can see, the output of python3 code.py [something].pdf and python3 code.py [something].csv is not the same as what wc returns. Could you help me to find the reason of this erroneous behaviour in my code?
Regarding the CSV file, if you look at the difference between your result and that of wc:
5881081 - 5844000 = 37081 which is exactly the number of lines.
That is, every line has one additional character in the original file. That character is the carriage return \r which got lost in Python because you iterate over lines and don't specify the linebreaks. If you want a byte-correct result, you have to first identify the type of linebreaks used in the file (and watch out for inconsistencies throughout the document).

How to read binary data in pyspark

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b
using pyspark.
import array
from io import StringIO
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 4106)
def mapper(features):
a = array.array('f')
a.frombytes(features)
return a.tolist()
def byte_mapper(bytes):
return str(bytes)
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
When just product_id is selected from the rdd using
decoded_embeddings = img_embedding_file.map(lambda x: [byte_mapper(x[:10]), mapper(x[10:])])
The output for product_id is
["b'1582480311'", "b'\\x00\\x00\\x00\\x00\\x88c-?\\xeb\\xe2'", "b'7#\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\xec/\\x0b?\\x00\\x00\\x00\\x00K\\xea'", "b'\\x00\\x00c\\x7f\\xd9?\\x00\\x00\\x00\\x00'", "b'L\\xa6\\n>\\x00\\x00\\x00\\x00\\xfe\\xd4'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\xe5\\xd0\\xa2='", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'", "b'\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00'"]
The file is hosted on s3.
The file in each row has first 10 bytes for product_id next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
EDIT:
Finally, the problem comes from the recordLength. It's not 4096 + 10 but 4096*4 + 10. Chaging to :
img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b", 16394)
Should work.
Actually you can find this in the provided code from the web site you downloaded the binary file:
for i in range(4096):
feature.append(struct.unpack('f', f.read(4))) # <-- so 4096 * 4
Old answer:
I think the issue comes from your byte_mapper function.
That's not the correct way to convert bytes to string. You should be using decode:
bytes = b'1582480311'
print(str(bytes))
# output: "b'1582480311'"
print(bytes.decode("utf-8"))
# output: '1582480311'
If you're getting the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 4: invalid start byte
That means product_id string contains non-utf8 characters. If you don't know the input encoding, it's difficult to convert into strings.
However, you may want to ignore those characters by adding option ignore to decode function:
bytes.decode("utf-8", "ignore")

Export a matrix to Excel

I made a matrix and I want to export it to Excel. The matrix looks like this:
1 2 3 4 5 6 7
2 0.4069264
3 0.5142857 0.2948718
4 0.3939394 0.4098639 0.3772894
5 0.3476190 0.3717949 0.3194444 0.5824176
6 0.2809524 0.3974359 0.2222222 0.3388278 0.3974359
7 0.2809524 0.5987654 0.3933333 0.4188713 0.4711538 0.3429487
8 0.4675325 0.4855072 0.4523810 0.4917184 0.3409091 0.4318182 0.4128788
9 0.3896104 0.5189594 0.4404762 0.2667549 0.5471429 0.3604762 0.3081502
10 0.4242424 0.4068878 0.3484432 0.2708333 0.4766484 0.3740842 0.4528219
11 0.3476190 0.3942308 0.2881944 0.3228022 0.4711538 0.2147436 0.3653846
12 0.6060606 0.3949830 0.2971612 0.3541667 0.5022894 0.3484432 0.4466490
13 0.4675325 0.5972222 0.6060606 0.3670635 0.4393939 0.3939394 0.3695652
14 0.4978355 0.4951499 0.4480952 0.4713404 0.3814286 0.3147619 0.4629121
15 0.4632035 0.4033883 0.4508929 0.3081502 0.4728571 0.3528571 0.4828571
16 0.3766234 0.5173993 0.4771825 0.4734432 0.5114286 0.3514286 0.4214286
17 0.3939394 0.5289116 0.3260073 0.3333333 0.5663919 0.2330586 0.3015873
18 0.3939394 0.3708791 0.2837302 0.4102564 0.3392857 0.2559524 0.4123810
19 0.3160173 0.5727041 0.4885531 0.3056973 0.4725275 0.3827839 0.3346561
20 0.3333333 0.5793651 0.4257143 0.4876543 0.4390476 0.2390476 0.3131868
21 0.5281385 0.3762755 0.4052198 0.2997449 0.4180403 0.2898352 0.4951499
22 0.3593074 0.3784014 0.4075092 0.2423469 0.4908425 0.3113553 0.3430335
23 0.5281385 0.5875850 0.4404762 0.4634354 0.6071429 0.3763736 0.3747795
24 0.3549784 0.6252381 0.5957341 0.4328571 0.4429563 0.4429563 0.3422619
25 0.4242424 0.4931973 0.5054945 0.2142857 0.4670330 0.4285714 0.4312169
26 0.3852814 0.5671769 0.4954212 0.4073129 0.3736264 0.4890110 0.4523810
27 0.5238095 0.3269558 0.5187729 0.4051871 0.5412088 0.5155678 0.5859788
28 0.3160173 0.1904762 0.3205128 0.3384354 0.3429487 0.3173077 0.5123457
29 0.2380952 0.4468537 0.5196886 0.4536565 0.4491758 0.4491758 0.4634039
30 0.4545455 0.4295635 0.4080087 0.4791667 0.3474026 0.3019481 0.4627329
31 0.2857143 0.3988095 0.3397436 0.3443878 0.4294872 0.2756410 0.3456790
32 0.3636364 0.3027211 0.3772894 0.3452381 0.4413919 0.3388278 0.3818342
33 0.3333333 0.4482402 0.4080087 0.4275362 0.2888199 0.4047619 0.4301242
34 0.5411255 0.4825680 0.4043040 0.4417517 0.4748168 0.3850733 0.3708113
35 0.3160173 0.5476190 0.4230769 0.3979592 0.3653846 0.3397436 0.2283951
36 0.4603175 0.4653209 0.4778912 0.5170807 0.3928571 0.4508282 0.4254658
37 0.3939394 0.1955782 0.2490842 0.4047619 0.2490842 0.3516484 0.4559083
38 0.3463203 0.4660494 0.4300000 0.4157848 0.3833333 0.2233333 0.2788462
39 0.5844156 0.4668367 0.3809524 0.3843537 0.4803114 0.3008242 0.5026455
40 0.5454545 0.4902211 0.3740842 0.2946429 0.5279304 0.2971612 0.3293651
41 0.5800866 0.3758503 0.5073260 0.5136054 0.3598901 0.5393773 0.4823633
42 0.4458874 0.3937390 0.3785714 0.4686949 0.3768315 0.3127289 0.4954212
43 0.6536797 0.5740741 0.5533333 0.4453263 0.4866667 0.5400000 0.4358974
44 0.5887446 0.5548469 0.4308608 0.3949830 0.5462454 0.3411172 0.5136684
45 0.4069264 0.4357993 0.4308608 0.3830782 0.4308608 0.3795788 0.4025573
46 0.5974026 0.3826531 0.3672161 0.3954082 0.4441392 0.3159341 0.5141093
47 0.2554113 0.4196429 0.4262821 0.4961735 0.2788462 0.3301282 0.3055556
I tried the command:
WriteXLS("my matrix after i converted it to data.frame", "test.xls")
but I got this error:
The Perl script 'WriteXLS.pl' failed to run successfully.
I googled it but I couldn't find a solution.
Thanks in advance.
Any reason why you can't just use write.csv?
write.csv(mymatrix, "test.csv")
Import it in Excel and you're set!
PS: I assume you're not putting quotes around your variable name in the WriteXLS call, right?
One other option on Windows (which seems a reasonable assumption given that you are using Excel):
You can write a matrix (or data frame) to the clipboard using a command like:
write.table(mymat, 'clipboard', sep='\t')
Then just go into Excel, click in the cell that you want to be the top left cell, then do a paste and your matrix is there (the sep='\t' is important for Excel to interpret it correctly).
This is similar to other answers, but you don't need an intermediate file on disk.
You could also check xlsx if you do not mind the Excel 2007 format, as xlsx does not depend on Perl (though depends on rJava).
After loading the packge via library(xlsx) just try the following:
write.xlsx(USArrests, "/usarrests.xlsx")
It's hard to see what is going on here exactly. Might be several things.
I think the easiest way to write a matrix to excell is by using write.table() and importing the data in excell. It takes an extra step but it also keeps your data in a nice format.
If foo is your matrix:
write.table(foo,"foo.txt")
If you get an error maybe trie coercing the object to a matrix:
write.table(as.matrix(foo),"foo.txt")
Does the matrix contain values in the upper triangle as well? Perhaps making a full matrix works:
foo<-foo+t(foo)
write.table(as.matrix(foo),"foo.txt")
But these are all just random shots in the dark since I don't have a matrix to work with.
EDIT: In response to the other answer, you can remove the column and rownames with col.names=FALSE and row.names=FALSE in both write.table() and write.csv() (which are the same function with different default values).
I have met the same problem, after reinstalling strawberry perl : after debugging the WriteXLS function in R, I found out the the perl module Text::CSV_XS was missing from my fresh new install. I installed this module from the DOS command line :
perl -MCPAN -e shell
install Text::CSV_XS
After this, WriteXLS was working fine.
upper # matrix name
write.xlsx2(upper,file = "File.xlsx", sheetName="Sheetname",col.names=TRUE, row.names=TRUE, append=TRUE, showNA=TRUE)

Resources