Python3 decode removes white spaces when should be kept - python-3.x

I'm reading a binary file that has a code on STM32. I placed deliberate 2 const strings in the code, that allows me to read SW version and description from a given file.
When you open a binary file with hex editor or even in python3, you can see correct form. But when run text = data.decode('utf-8', errors='ignore'), it removes a zeros from the file! I don't want this, as I keep EOL characters to properly split and extract string that interest me.
(preview of the end of the data variable)
Svc\x00..\Src\adc.c\x00..\Src\can.c\x00defaultTask\x00Task_CANbus_receive\x00Task_LED_Controller\x00Task_LED1_TX\x00Task_LED2_RX\x00Task_PWM_Controller\x00**SW_VER:GN_1.01\x00\x00\x00\x00\x00\x00MODULE_DESC:generic_module\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00**Task_SuperVisor_Controller\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x06\x07\x08\t\x00\x00\x00\x00\x01\x02\x03\x04..\Src\tim.c\x005!\x00\x08\x11!\x00\x08\x01\x00\x00\x00\xaa\xaa\xaa\xaa\x01\x01\nd\x00\x02\x04\nd\x00\x00\x00\x00\xa2J\x04'
(preview of text, i.e. what I receive after decode)
r g # IDLE TmrQ Tmr Svc ..\Src\adc.c ..\Src\can.c
defaultTask Task_CANbus_receive Task_LED_Controller Task_LED1_TX
Task_LED2_RX Task_PWM_Controller SW_VER:GN_1.01
MODULE_DESC:generic_module
Task_SuperVisor_Controller ..\Src\tim.c 5! !
d d J
with open(path_to_file, "rb") as binary_file:
# Read the whole file at once
data = binary_file.read()
text = data.decode('utf-8', errors='ignore')
# get index of "SW_VER:" sting in the file
sw_ver_index = text.rfind("SW_VER:")
# SW_VER found
if sw_ver_index is not -1:
# retrive the value, e.g. "SW_VER:WB_2.01" will has to start from position 7 and finish at 14
sw_ver_value = text[sw_ver_index + 7:sw_ver_index + 14]
module.append(tuple(('DESC:', sw_ver_value)))
else:
# SW_VER not found
module.append(tuple(('DESC:', 'N/A')))
# get index of "MODULE_DESC::" sting in the file
module_desc_index = text.rfind("MODULE_DESC:")
# MODULE_DESC found
if module_desc_index is not -1:
module_desc_substring = text[module_desc_index + 12:]
module_desc_value = module_desc_substring.split()
module.append(tuple(('DESC:', module_desc_value[0])))
print(module_desc_value[0])
As you can see my white characters are gone, while they should be present

Related

Reading images from pdf and extract Text from it

Problem Statement: I have a pdf which contains n number of pages and each page has 1 image whose text I need to read and perform some operation.
What I tried: I have to do this in python, and the only library I found with the best result is pytesserac.
I am pasting the sample code which I tried
fn = kw['fn'] = self.env.context.get('wfg_pg', kw['fn'])
zoom, zoom_config = self.get_zoom_for_doc(index), ' -c tessedit_do_invert=0'
if 3.3 < zoom < 3.5:
zoom_config += ' --oem 3 --psm 4'
elif 0 != page_number_list[0]:
zoom_config += ' --psm 6'
full_text, page_length = '', kw['doc'].pageCount
if recursion and index >= 10:
return fn.get('most_correct') or fn.get(page_number_list[0])
mat = fitz.Matrix(zoom, zoom) # increase resolution
for page_no in page_number_list:
page = kw['doc'].loadPage(page_no) # number of page
pix = page.getPixmap(matrix=mat)
with Image.open(io.BytesIO(pix.getImageData())) as img:
text_of_each_page = str(pytesseract.image_to_string(img, config='%s' % zoom_config)).strip()
fn[page_no] = text_of_each_page
full_text = '\n'.join((full_text, text_of_each_page, '\n'))
_logger.critical(f"full text in load immage {full_text}")
args = (full_text, page_number_list)
load = recursion and self.run_recursion_to_load_new_image_to_text(*args, **kw)
if recursion and load:
return self.load_image
return full_text
The issue: My pdf is having dates like 1/13, 1/7 the library is reading them as 143, 1n and in some places, it is reading 17 as 1). Also after the text, it is also giving some symbols like { & . , = randomly whereas in pdf these things are not even there.
For accuracy
1. I tried converting the image to .tiff format but it didn't work for me.
2. Tried adjusting the resolution of the image.
You can use pdftoppm tool for converting you images really fast as it provides you to use multi-threading feature by just passing thread_count=(no of threads).
You can refer to this link for more info on this tool. Also better images can increase the accuracy of tesseract.

Tkinter - converting a result

I am converting base64 to a string(complete) and I want to be able to edit the converted string entry and have this show up as the amended base64 i.e. input the #commented string below, click enter and then set "alg" to None in the second box and then have this amendment appear in the top / bottom box.
Note: I added the third box to compare whether the b64 results - it would be easier to change the results in the top box to reflect any amendments
import base64
import binascii
from tkinter import *
#eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9
root = Tk()
root.title("Converter")
root.geometry('290x200+250+60')
def b64decode():
s = strentry.get()
try:
return(res1.set(base64.b64decode(s).decode()))
except:
return(res.set("Error"))
def string_to_b64():
data = res1.get()
encodedBytes = base64.b64encode(data.encode("utf-8"))
encodedStr = str(encodedBytes, "utf-8")
return(res3.set(encodedStr))
strentry= StringVar()
res = StringVar()
res1 = StringVar()
res3 = StringVar()
Enter_Value_Entry = Entry(root, text="", textvariable = strentry)
Enter_Value_Entry.pack()
Enter_Value_Entry2 = Entry(root, textvariable = res1)
Enter_Value_Entry2.pack()
Enter_Value_Entry2 = Entry(root, textvariable = res3)
Enter_Value_Entry2.pack()
Error = Label(root,textvariable=res)
Error.pack()
root.bind("<Return>", lambda event: b64decode()==string_to_b64())
root.mainloop()
It is not advisable to try to edit Base64 unless you handle every bite by 8 & 6's
You say you want to just target a chunk so here it is laid out what the needful change must consider.
text as txt {"alg" :"HS25 6","ty p":"JW T"}
text as b64 eyJhbGci OiJIUzI1 NiIsInR5 cCI6IkpX VCJ9
change bytes of {"alg" to {"nul" no problem only first block is changed
text as txt {"nul" :"HS25 6","ty p":"JW T"}
text as b64 eyJudWwi OiJIUzI1 NiIsInR5 cCI6IkpX VCJ9
change bytes of {"alg" = {None BIG PROBLEM everything needs changing as 1bit was altered
text as txt {None: "HS256 ","typ ":"JWT "}
text as b64 e05vbmU6 IkhTMjU2 IiwidHlw IjoiSldU In0=
you MUST only change a block of 6x8 chars between boundary divisions for a new block of 8x6 chars between the same boundaries.
To make changes its best done in the related application thus for images use decode base64 to binary then use image editor and re encode back to base64. If you dont have the native fast base64 app as used on mac or linux then for poor window mans drag and drop (or commandline add filename) base64.cmd you can adapt:-
IF /I %~x1==.b64 goto Decode
rem todo IF /I %~x1==.hex goto HexDecode
FOR %%X in (.jpg .pdf .png) do IF /I %~x1==%%X goto Encode
:Help
echo Format unexpected, accepts .b64 .jpg .pdf OR .png
pause&exit /b
:Encode an accepted file to Base64 and trim off both -----BEGIN / END CERTIFICATE-----
Rem NOTE:- WILL overwrite, also previous .ext will be changed to lower case and .b64 appended
certutil -encode "%~pdn1%~x1" %temp%\tmp.b64 && findstr /v /c:-- %temp%\tmp.b64 > "%~pdn1%~x1.b64" && del %temp%\tmp.b64
Rem EXIT since we must not / cannot over-write the source
pause&exit /b
:Decode a .Base64 file to binary removing the appended .b64 NOTE:- will NOT allow overwite, thus error.
certutil -decode "%~pdn1.b64" "%~pdn1"
pause&exit /b

Python 3.73 inserting into bytearray = "object cannot be re-sized"

I'm working with a bytearray from file data. I'm opening it as 'r+b', so can change as binary.
In the Python 3.7 docs, it explains that a RegEx's finditer() can use m.start() and m.end() to identify the start and end of a match.
In the question Insert bytearray into bytearray Python, the answer says an insert can be made to a bytearray by using slicing. But when this is attempted, the following error is given: BufferError: Existing exports of data: object cannot be re-sized.
Here is an example:
pat = re.compile(rb'0.?\d* [nN]') # regex, binary "0[.*] n"
with open(file, mode='r+b') as f: # updateable, binary
d = bytearray(f.read()) # read file data as d [as bytes]
it = pat.finditer(d) # find pattern in data as iterable
for match in it: # for each match,
m = match.group() # bytes of the match string to binary m
...
val = b'0123456789 n'
...
d[match.start():match.end()] = bytearray(val)
In the file, the match is 0 n and I'm attempting to replace it with 0123456789 n so would be inserting 9 bytes. The file can be changed successfully with this code, just not increased in size. What am I doing wrong? Here is output showing all non-increasing-filesize operations working, but it failing on inserting digits:
*** Changing b'0.0032 n' to b'0.0640 n'
len(d): 10435, match.start(): 607, match.end(): 615, len(bytearray(val)): 8
*** Found: "0.0126 n"; set to [0.252] or custom:
*** Changing b'0.0126 n' to b'0.2520 n'
len(d): 10435, match.start(): 758, match.end(): 766, len(bytearray(val)): 8
*** Found: "0 n"; set to [0.1] or custom:
*** Changing b'0 n' to b'0.1 n'
len(d): 10435, match.start(): 806, match.end(): 809, len(bytearray(val)): 5
Traceback (most recent call last):
File "fixV1.py", line 190, in <module>
main(sys.argv)
File "fixV1.py", line 136, in main
nchanges += search(midfile) # perform search, returning count
File "fixV1.py", line 71, in search
d[match.start():match.end()] = bytearray(val)
BufferError: Existing exports of data: object cannot be re-sized
This is a simple case, much like modifying an iterable during iteration:
it = pat.finditer(d) creates a buffer from the bytearray object. This in turn "locks" the bytearray object from being changed in size.
d[match.start():match.end()] = bytearray(val) attempts to modify the size on the "locked" bytearray object.
Just like attempting to change a list's size while iterating over it will fail, an attempt to change a bytearray size while iterating over it's buffer will also fail.
You can give a copy of the object to finditer().
For more information about buffers and how Python works under the hood, see the Python docs.
Also, do keep in mind, you're not actually modifying the file. You'll nee to either write the data back to the file, or use memory mapped files. I suggest the latter if you're looking for efficiency.

How to handle blank line,junk line and \n while converting an input file to csv file

Below is the sample data in input file. I need to process this file and turn it into a csv file. With some help, I was able to convert it to csv file. However not fully converted to csv since I am not able to handle \n, junk line(2nd line) and blank line(4th line). Also, i need help to filter transaction_type i.e., avoid "rewrite" transaction_type
{"transaction_type": "new", "policynum": 4994949}
44uu094u4
{"transaction_type": "renewal", "policynum": 3848848,"reason": "Impressed with \n the Service"}
{"transaction_type": "cancel", "policynum": 49494949, "cancel_table":[{"cancel_cd": "AU"}, {"cancel_cd": "AA"}]}
{"transaction_type": "rewrite", "policynum": 5634549}
Below is the code
import ast
import csv
with open('test_policy', 'r') as in_f, open('test_policy.csv', 'w') as out_f:
data = in_f.readlines()
writer = csv.DictWriter(
out_f,
fieldnames=[
'transaction_type', 'policynum', 'cancel_cd','reason'],lineterminator='\n',
extrasaction='ignore')
writer.writeheader()
for row in data:
dict_row = ast.literal_eval(row)
if 'cancel_table' in dict_row:
cancel_table = dict_row['cancel_table']
cancel_cd= []
for cancel_row in cancel_table:
cancel_cd.append(cancel_row['cancel_cd'])
dict_row['cancel_cd'] = ','.join(cancel_cd)
writer.writerow(dict_row)
Below is my output not considering the junk line,blank line and transaction type "rewrite".
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with
the Service"
cancel,49494949,"AU,AA",
Expected output
transaction_type,policynum,cancel_cd,reason
new,4994949,,
renewal,3848848,,"Impressed with the Service"
cancel,49494949,"AU,AA",
Hmm I try to fix them but I do not know how CSV file work, but my small knoll age will suggest you to run this code before to convert the file.
txt = {"transaction_type": "renewal",
"policynum": 3848848,
"reason": "Impressed with \n the Service"}
newTxt = {}
for i,j in txt.items():
# local var (temporar)
lastX = ""
correctJ = ""
# check if in J is ascii white space "\n" and get it out
if "\n" in f"b'{j}'":
j = j.replace("\n", "")
# for grammar purpose check if
# J have at least one space
if " " in str(j):
# if yes check it closer (one by one)
for x in ([j[y:y+1] for y in range(0, len(j), 1)]):
# if 2 spaces are consecutive pass the last one
if x == " " and lastX == " ":
pass
# if not update correctJ with new values
else:
correctJ += x
# remember what was the last value checked
lastX = x
# at the end make J to be the correctJ (just in case J has not grammar errors)
j = correctJ
# add the corrections to a new dictionary
newTxt[i]=j
# show the resoult
print(f"txt = {txt}\nnewTxt = {newTxt}")
Termina:
txt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with \n the Service'}
newTxt = {'transaction_type': 'renewal', 'policynum': 3848848, 'reason': 'Impressed with the Service'}
Process finished with exit code 0

load .npy file from google cloud storage with tensorflow

i'm trying to load .npy files from my google cloud storage to my model i followed this example here Load numpy array in google-cloud-ml job
but i get this error
'utf-8' codec can't decode byte 0x93 in
position 0: invalid start byte
can you help me please ??
here is sample from the code
Here i read the file
with file_io.FileIO(metadata_filename, 'r') as f:
self._metadata = [line.strip().split('|') for line in f]
and here i start processing on it
if self._offset >= len(self._metadata):
self._offset = 0
random.shuffle(self._metadata)
meta = self._metadata[self._offset]
self._offset += 1
text = meta[3]
if self._cmudict and random.random() < _p_cmudict:
text = ' '.join([self._maybe_get_arpabet(word) for word in text.split(' ')])
input_data = np.asarray(text_to_sequence(text, self._cleaner_names), dtype=np.int32)
f = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[0]))
linear_target = tf.Variable(initial_value=np.load(f), name='linear_target')
s = StringIO(file_io.read_file_to_string(
os.path.join('gs://path',meta[1])))
mel_target = tf.Variable(initial_value=np.load(s), name='mel_target')
return (input_data, mel_target, linear_target, len(linear_target))
and this is a sample from the data sample
This is likely because your file doesn't contain utf-8 encoded text.
Its possible, you may need to initialize the file_io.FileIO instance as a binary file using mode = 'rb', or set binary_mode = True in the call to read_file_to_string.
This will cause data that is read to be returned as a sequence of bytes, rather than a string.

Resources