PIL Drawing text and breaking lines at \n - python-3.x

Hi im having some trouble getting sometimes longer texts which should be line breaked at specific lines onto an image it always just prints the \n with it without breaking the line and i cant find any info online if this is even possible or if it just sees the raw string to put on the img without checking for linebreaks. The Text is just some random stuff from a CSV
def place_text(self,text,x,y):
temp = self.csv_input[int(c)][count]
font = ImageFont.truetype('arial.ttf', 35) # font z.b.: arial.ttf
w_txt, h_txt = font.getsize(text)
print("Jetzt sind wie in der zweiten möglichkeit")
draw_text = ImageDraw.Draw(self.card[self.cardCount])
draw_text.text((x, y), temp, fill="black", font=font, align="left")
Yeah i know this Code is kinda all over the place but for putting the text on the image that shouldnt cause any issues does it?
Writing stuff on an Imgae with that results in just one line of continuous text with the \n's still in there and no line breaks.

Found the answer the String pulled fomr the CSV had to be decoded again before beeing placed
text = bytes(text, 'utf-8').decode("unicode_escape")
did the trick

Related

tkinter - Hebrew text changes order mid-text

So I got this weird issue where I simply want to display a Hebrew text (with special characters, accents, diacritics, etc.) on a Label.
lines = None
with open("hebrew.txt", encoding="utf-8") as f:
lines = f.read().split("\n")
window = Tk()
lbl = Label(window, text=lines[11], font=("Narkisim", 14))
lbl.grid(column=0, row=0)
window.geometry('850x200')
window.mainloop()
Hebrew is a RTL language. The thing is - up until a specific length of the line - it displays correctly. But once the line is longer than that length, the next words appear to the right of the text (in English, think of it as if words started appearing to the left of the text instead of each word being the right-most). This only happens with lines over a certain length... (e.g. if I do text=lines[11][:108] it's still good. The "good" length changes a bit from line to line.
This only happens with all nikkud/diacritics/accents. If I use regular Hebrew text without these special characters - it's all good. Any ideas what could be the issue? It's driving me crazy.

How to remove duplicate sentences from paragraph using NLTK?

I had a huge document with many repeated sentences such as (footer text, hyperlinks with alphanumeric chars), I need to get rid of those repeated hyperlinks or Footer text. I have tried with the below code but unfortunately couldn't succeed. Please review and help.
corpus = "We use file handling methods in python to remove duplicate lines in python text file or function. The text file or function has to be in the same directory as the python program file. Following code is one way of removing duplicates in a text file bar.txt and the output is stored in foo.txt. These files should be in the same directory as the python script file, else it won’t work.Now, we should crop our big image to extract small images with amounts.In terms of topic modelling, the composites are documents and the parts are words and/or phrases (phrases n words in length are referred to as n-grams).We use file handling methods in python to remove duplicate lines in python text file or function.As an example I will use some image of a bill, saved in the pdf format. From this bill I want to extract some amounts.All our wrappers, except of textract, can’t work with the pdf format, so we should transform our pdf file to the image (jpg). We will use wand for this.Now, we should crop our big image to extract small images with amounts."
from nltk.tokenize import sent_tokenize
sentences_with_dups = []
for sentence in corpus:
words = sentence.sent_tokenize(corpus)
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)
print(sentences_with_dups)
else:
print('No duplciates found')
Error message for the above code :
AttributeError: 'str' object has no attribute 'sent_tokenize'
Desired Output :
Duplicates = ['We use file handling methods in python to remove duplicate lines in python text file or function.','Now, we should crop our big image to extract small images with amounts.']
Cleaned_corpus = {removed duplicates from corpus}
First of all, the example you provided is messed up with spaces between the last period and next sentence, there are a lot of space missing in between them, so I cleaned up.
Then you can do:
corpus = "......"
sentences = sent_tokenize(corpus)
duplicates = list(set([s for s in sentences if sentences.count(s) > 1]))
cleaned = list(set(sentences))
Above will mess the order. If you care about the order, you can do the following to preserve:
duplicates = []
cleaned = []
for s in sentences:
if s in cleaned:
if s in duplicates:
continue
else:
duplicates.append(s)
else:
cleaned.append(s)

How can I decode a .bin into a .pdf

I extracted an embedded object from an excel spreadsheet that was a pdf but the excel zip file saves embedded objects as binary files.
I am trying to read the binary file and return it to it's original format as a pdf. I took some code from another question with a similar issue but when i try opening the pdf adobe gives error "can't open because file is damaged...not decoded correctly.."
Does anyone know of a way to do this?
with open('oleObject1.bin','rb') as f:
binaryData = f.read()
print(binaryData)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(base64.decodebytes(binaryData))
Link to the object file on github
Thanks Ryan, I was able to see what you are talking about. Here is solution for future reference.
str1 = b'%PDF-' # Begin PDF
str2 = b'%%EOF' # End PDF
with open('oleObject1.bin', 'rb') as f:
binary_data = f.read()
print(binary_data)
# Convert BYTE to BYTEARRAY
binary_byte_array = bytearray(binary_data)
# Find where PDF begins
result1 = binary_byte_array.find(str1)
print(result1)
# Remove all characters before PDF begins
del binary_byte_array[:result1]
print(binary_byte_array)
# Find where PDF ends
result2 = binary_byte_array.find(str2)
print(result2)
# Subtract the length of the array from the position of where PDF ends (add 5 for %%OEF characters)
# and delete that many characters from end of array
print(len(binary_byte_array))
to_remove = len(binary_byte_array) - (result2 + 5)
print(to_remove)
del binary_byte_array[-to_remove:]
print(binary_byte_array)
with open(os.path.expanduser('test1.pdf'), 'wb') as fout:
fout.write(binary_byte_array)
The bin file contains a valid PDF. There is no decoding required. The bin file though does have bytes before and after the PDF that need to be trimmed.
To get the first byte look for the first occurrence of string %PDF-
To get the final byte look for the last %%EOF.
Note, I do not know what "format" the leading/trailing bytes are, that are added by Excel. The solution above obliviously would not work if either of the ascii strings above could also be in the leading/trailing data.
You should try using a python library that allows you to write pdf files like reportlab or pyPDF

Unicode manipulation and garbage '[]' characters

I have a 4GB text file which I can't even load to view so I'm trying to separate it but I need to manipulate the data a bit at a time.
The problem is I'm getting these garbage white vertical rectangular characters and I can't search for what they are in a search engine because it won't paste nor can I get rid of them.
They look like these square parenthesis '[]' but without that small amount of space in the middle.
Their Unicode values differ so I can't just select one value and get rid of it.
I want to get rid of all of these rectangles.
Two more questions.
1) Why are there any Unicode characters here (in the img below) at all? I decoded them. What am I missing? Note: Later on I get string output that looks like a normal string such as 'code1234' etc but there are those Unicode exceptions there as well.
2) Can you see why larger end values would get this exception list index out of range? This only happens towards the end of the range and it isn't constant i.e. if end is 100 then maybe the last 5 will throw that exception but if end is 1000 then ONLY the LAST let's say 10 throw that exception.
Some code:
from itertools import islice
def read_from_file(file, start, end):
with open(file,'rb') as f:
for line in islice(f, start, end):
data.append(line.strip().decode("utf-8"))
for i in range(len(data)-1):
try:
if '#' in data[i]:
a = data.pop(i)
mail.append(a)
else:
print(data[i], data[i].encode())
except Exception as e:
print(str(e))
data = []
mail = []
read_from_file('breachcompilationuniq.txt', 0, 10)
Some Output:
Image link here as it won't let me format after pasting.
There's also this stuff later on, I don't know what these are either.
It appears that you have a text file which is not in the default encoding assumed by python (UTF-8), but nevertheless uses bytes values in the range 128-255. Try:
f = open(file, encoding='latin_1')
content = f.read()

How to extract a PDF's text using pdfrw

Can pdfrw extract the text out of a document?
I was thinking something along the lines of
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
page_texts.append(doc.getPage(page_nr).parse_page()) # ..or something
In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
string = #somehow decode bytestream. Maybe using zlib.decompress
# do something with that text
Edit:
May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.
Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.
Note: Give the whole Contents object in a list to the function
Note: This is not the same as pdfrw.PdfReader.uncompress()
And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.
Here's an example that may be useful:
for pg_num in range(number_of_pages):
pg_obj = pdfreader.getPage(pg_num)
print(pg_num)
if re.search(r'CSE', pg_obj.extractText()):
cse_count+= 1
pdfwriter.addPage(pg_obj)
Here extractText() would extract the text of the page containing the keyword CSE

Resources