Insert arabic text into an image using Matlab? - string

I'm using this Matlab script to write an arabic text into an image
I = imread('test.jpg');
text_str = cell(3,1);
conf_val = [85.212 98.76 78.342];
str = char(['م','ا','ل','س']);
encoded_str = unicode2native(str, 'UTF-8');
position = [23 23];
RGB = insertText(I,position,str);
figure
imshow(RGB)
It shows '?' in the image instead of the arabic letters.
link to result

This is because by default the character set encoding for m-files is ANSI.
So you can use the corresponding entities for the required letters. These entities can be generated using the following in the command window:
uint16(['م','ا','ل','س']); %Thanks to horchler
So, you can use:
I = imread('office_2.jpg'); %Using a built-in demo image
position = [23 23]; %As given in the question
str = char([1587 1604 1575 1605]); %Converted into the corresponding entities
rgb = insertText(I,position,str);
figure;
imshow(rgb);
Edit: The problem that you mentioned in the comment is reproducible in MATLAB R2015a. You can use AddTextToImage from the File Exchange. Download and add that to your path.
And then change rgb = insertText(I,position,str); to rgb = AddTextToImage(I,str,position); in the above code.

Related

pdfminer: extract only text according to font size

I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files.
The code below returns a list of the font size of each text block and its characters for one pdf file.
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
gives me an Extract_Data list with the various font sizes
[[9.800000000000068, 'aaa\n'], [11.0, 'dffg\n'], [10.000000000000057, 'bbb\n'], [10.0, 'hs\n'], [8.0, '2\n']]
example: font size 10.000000000000057
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
if character.size == '10.000000000000057':
element.get_text()
Extract_Data.append(element.get_text())
Data = ''.join(map(str, Extract_Data))
gives me a Data list with all of the text. How can i make it only extract font size '10.000000000000057' characters?
['aaa\ndffg\nbbb\nhs\n2\n']
I also want to integrate into a function that does this for multiple files resulting in a pandas df that has one row for each pdf.
Desired output: [['aaa\n bbb\n']]. Convertin pixels to points (int(character.size) * 72 / 96) as suggested eksewhere did not help. Maybe this has something to do with this? https://github.com/pdfminer/pdfminer.six/issues/202
This is the function it would be integrated later on:
directory = 'C:/Users/Sample/'
resource_manager = PDFResourceManager()
for file in os.listdir(directory):
if not file.endswith(".pdf"):
continue
fake_file_handle = io.StringIO()
manager = PDFResourceManager()
device = PDFPageAggregator(manager, laparams=params)
interpreter = PDFPageInterpreter(manager, device)
device = TextConverter(interpreter, fake_file_handle, laparams=LAParams())
params = LAParams(detect_vertical=True, all_texts=True)
elements = []
with open(os.path.join(directory, file), 'rb') as fh:
parser = PDFParser(fh)
document = PDFDocument(parser, '')
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
for page in enumerate (PDFPage.create_pages(document)):
for element in page:
Pdfminer is the wrong tool for that.
Use pdfplumber (which uses pdfminer under the hood) instead https://github.com/jsvine/pdfplumber, because it has utility functions for filtering out objects (eg. based on font size as you're trying to do), whereas pdfminer is primarily for getting all text.
import pdfplumber
def get_filtered_text(file_to_parse: str) -> str:
with pdfplumber.open(file_to_parse) as pdf:
text = pdf.pages[0]
clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and obj["size"] != 9))
print(clean_text.extract_text())
get_filtered_text("./my_pdf.pdf")
The example above I've shown is easier than yours because it just checks for font size 9.0, and you have
9.800000000000068 and 10.000000000000057
so the obj["size"] condition will be more complex in your case
obj["size"] has the datatype Decimal (from decimal import Decimal) so you probably will have to do something like obj["size"].compare(Decimal(9.80000000068)) == 0

PIL Drawing text and breaking lines at \n

Hi im having some trouble getting sometimes longer texts which should be line breaked at specific lines onto an image it always just prints the \n with it without breaking the line and i cant find any info online if this is even possible or if it just sees the raw string to put on the img without checking for linebreaks. The Text is just some random stuff from a CSV
def place_text(self,text,x,y):
temp = self.csv_input[int(c)][count]
font = ImageFont.truetype('arial.ttf', 35) # font z.b.: arial.ttf
w_txt, h_txt = font.getsize(text)
print("Jetzt sind wie in der zweiten möglichkeit")
draw_text = ImageDraw.Draw(self.card[self.cardCount])
draw_text.text((x, y), temp, fill="black", font=font, align="left")
Yeah i know this Code is kinda all over the place but for putting the text on the image that shouldnt cause any issues does it?
Writing stuff on an Imgae with that results in just one line of continuous text with the \n's still in there and no line breaks.
Found the answer the String pulled fomr the CSV had to be decoded again before beeing placed
text = bytes(text, 'utf-8').decode("unicode_escape")
did the trick

Highlight the text in python and save it in word file

I am trying to take the text from word file and highlighted the required text and aging want to save the text into new word file.
I am able to highlight the text using ANSI escape sequences but I am unable add it back to the word file.
from docx import Document
doc = Document('t.docx')
##string present in t.docx '''gnjdkgdf helloworld dnvjk dsfgdzfh jsdfKSf klasdfdf sdfvgzjcv'''
if 'helloworld' in doc.paragraphs[0].text:
high=doc.paragraphs[0].text.replace('helloworld', '\033[43m{}\033[m'.format('helloworld'))
doc.add_paragraph(high)
doc.save('t1.docx')
getting this error.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
Instead of using ANSI escape sequences, you could use python-docx's built-in Font highlight color:
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
doc = Document('t.docx')
##string present in t.docx '''gnjdkgdf helloworld dnvjk dsfgdzfh jsdfKSf klasdfdf sdfvgzjcv'''
# Get the first paragraph's text
p1_text = doc.paragraphs[0].text
# Create a new paragraph with "helloworld" highlighted
p2 = doc.add_paragraph()
substrings = p1_text.split('helloworld')
for substring in substrings[:-1]:
p2.add_run(substring)
font = p2.add_run('helloworld').font
font.highlight_color = WD_COLOR_INDEX.YELLOW
p2.add_run(substrings[-1])
# Save document under new name
doc.save('t1.docx')

How to extract a PDF's text using pdfrw

Can pdfrw extract the text out of a document?
I was thinking something along the lines of
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
page_texts.append(doc.getPage(page_nr).parse_page()) # ..or something
In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
string = #somehow decode bytestream. Maybe using zlib.decompress
# do something with that text
Edit:
May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.
Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.
Note: Give the whole Contents object in a list to the function
Note: This is not the same as pdfrw.PdfReader.uncompress()
And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.
Here's an example that may be useful:
for pg_num in range(number_of_pages):
pg_obj = pdfreader.getPage(pg_num)
print(pg_num)
if re.search(r'CSE', pg_obj.extractText()):
cse_count+= 1
pdfwriter.addPage(pg_obj)
Here extractText() would extract the text of the page containing the keyword CSE

Matlab sprintf incorrect result using random strings from list

I want create a string variable using ´sprintf´ and a random name from a list (in order to save an image with such a name). A draft of the code is the following:
Names = [{'C'} {'CL'} {'SCL'} {'A'}];
nameroulette = ceil(rand(1)*4)
filename = sprintf('DG_%d.png', Names{1,nameroulette});
But when I check filename, what I get is the text I typed followed not by one of the strings, but by a number that I have no idea where it comes from. For example, if my nameroulette = 1 then filename is DG_67.png, and if nameroulette = 4, filename = 'DG_65.png' . Where does this number come from and how can I fix this problem?
You just need to change
filename = sprintf('DG_%d.png', Names{1,nameroulette});
to
filename = sprintf('DG_%s.png', Names{1,nameroulette});
By the way you may want to have a look at randi command for drawing random integers.

Resources