Highlight the text in python and save it in word file - python-3.x

I am trying to take the text from word file and highlighted the required text and aging want to save the text into new word file.
I am able to highlight the text using ANSI escape sequences but I am unable add it back to the word file.
from docx import Document
doc = Document('t.docx')
##string present in t.docx '''gnjdkgdf helloworld dnvjk dsfgdzfh jsdfKSf klasdfdf sdfvgzjcv'''
if 'helloworld' in doc.paragraphs[0].text:
high=doc.paragraphs[0].text.replace('helloworld', '\033[43m{}\033[m'.format('helloworld'))
doc.add_paragraph(high)
doc.save('t1.docx')
getting this error.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

Instead of using ANSI escape sequences, you could use python-docx's built-in Font highlight color:
from docx import Document
from docx.enum.text import WD_COLOR_INDEX
doc = Document('t.docx')
##string present in t.docx '''gnjdkgdf helloworld dnvjk dsfgdzfh jsdfKSf klasdfdf sdfvgzjcv'''
# Get the first paragraph's text
p1_text = doc.paragraphs[0].text
# Create a new paragraph with "helloworld" highlighted
p2 = doc.add_paragraph()
substrings = p1_text.split('helloworld')
for substring in substrings[:-1]:
p2.add_run(substring)
font = p2.add_run('helloworld').font
font.highlight_color = WD_COLOR_INDEX.YELLOW
p2.add_run(substrings[-1])
# Save document under new name
doc.save('t1.docx')

Related

PIL Drawing text and breaking lines at \n

Hi im having some trouble getting sometimes longer texts which should be line breaked at specific lines onto an image it always just prints the \n with it without breaking the line and i cant find any info online if this is even possible or if it just sees the raw string to put on the img without checking for linebreaks. The Text is just some random stuff from a CSV
def place_text(self,text,x,y):
temp = self.csv_input[int(c)][count]
font = ImageFont.truetype('arial.ttf', 35) # font z.b.: arial.ttf
w_txt, h_txt = font.getsize(text)
print("Jetzt sind wie in der zweiten möglichkeit")
draw_text = ImageDraw.Draw(self.card[self.cardCount])
draw_text.text((x, y), temp, fill="black", font=font, align="left")
Yeah i know this Code is kinda all over the place but for putting the text on the image that shouldnt cause any issues does it?
Writing stuff on an Imgae with that results in just one line of continuous text with the \n's still in there and no line breaks.
Found the answer the String pulled fomr the CSV had to be decoded again before beeing placed
text = bytes(text, 'utf-8').decode("unicode_escape")
did the trick

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

Insert arabic text into an image using Matlab?

I'm using this Matlab script to write an arabic text into an image
I = imread('test.jpg');
text_str = cell(3,1);
conf_val = [85.212 98.76 78.342];
str = char(['م','ا','ل','س']);
encoded_str = unicode2native(str, 'UTF-8');
position = [23 23];
RGB = insertText(I,position,str);
figure
imshow(RGB)
It shows '?' in the image instead of the arabic letters.
link to result
This is because by default the character set encoding for m-files is ANSI.
So you can use the corresponding entities for the required letters. These entities can be generated using the following in the command window:
uint16(['م','ا','ل','س']); %Thanks to horchler
So, you can use:
I = imread('office_2.jpg'); %Using a built-in demo image
position = [23 23]; %As given in the question
str = char([1587 1604 1575 1605]); %Converted into the corresponding entities
rgb = insertText(I,position,str);
figure;
imshow(rgb);
Edit: The problem that you mentioned in the comment is reproducible in MATLAB R2015a. You can use AddTextToImage from the File Exchange. Download and add that to your path.
And then change rgb = insertText(I,position,str); to rgb = AddTextToImage(I,str,position); in the above code.

Hexadecimal byte string conversion to unicode in python

I read a string from a text file which contains many special characters in 'rb' mode and store it in a variable. But when I insert it into a textbox in tkinter some characters go missing.
inp = tut.encrypt_text(tut.getKey(password.get()),text.get("1.0", END))
The code passes a string to the encrypting function to encrypt the text and then gets the encrypted text in hex byte string.
outfile.write(encryptor.encrypt(chunk))
this code is used for writing the encrypted text to a file and then the same is being read from the file and returned back to the call made above.
with open(outputFile, 'rb') as out:
output =out.read()
return output
The encrypted text in file is:
0000000000000011P”=êäS£¾1:–ø{pâRÆA<,ã É˜“Uê
Here itself some characters are missing. When i read it in 'rb' mode and store in variable output and return to the call made, the variable has the value
b'0000000000000011P\x94=\xea\xe4S\x0f\xa3\xbe1\x18:\x96\xf8{p\xe2R\xc6A<,\xe3\xa0\xc9\x8d\x98\x93\x04\x08U\xea'
But when I insert it in the textbox using the code
text.delete('1.0', END)
text.insert(END, inp)
The text that is being printed in the textbox is
0000000000000011P=êäS£¾1:ø{pâRÆA<,ã ÉUê.
This text is different from the one which was in the file. How can I get the same string which is in the text file in the textbox?

How to extract a PDF's text using pdfrw

Can pdfrw extract the text out of a document?
I was thinking something along the lines of
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
page_texts = []
for page_nr in doc.numPages:
page_texts.append(doc.getPage(page_nr).parse_page()) # ..or something
In the docs the explain how to extract the text. However, it's just a bytestream. You could iterate over the pages and decode them individually.
from pdfrw import PdfReader
doc = PdfReader(pdf_path)
for page in doc.pages:
bytestream = page.Contents.stream # This is a string with bytes, Not a bytestring
string = #somehow decode bytestream. Maybe using zlib.decompress
# do something with that text
Edit:
May be worth nothing that pdfrw does not yet support text decompression due to its complexity according to the author.
Depends on which filters are applied to the page.Contents.stream. If it is only FlateDecode you can use pdfrw.uncompress.uncompress([page.Contents]) to decode it.
Note: Give the whole Contents object in a list to the function
Note: This is not the same as pdfrw.PdfReader.uncompress()
And then you have to parse the string to find your text. It will be be in blocks of lines between BT (begin text) and ET (end text) markers on lines ending in either 'TJ' or 'Tj' inside round brackets.
Here's an example that may be useful:
for pg_num in range(number_of_pages):
pg_obj = pdfreader.getPage(pg_num)
print(pg_num)
if re.search(r'CSE', pg_obj.extractText()):
cse_count+= 1
pdfwriter.addPage(pg_obj)
Here extractText() would extract the text of the page containing the keyword CSE

Resources