Extract text from a pdf using coordinates in python - python-3.x

I have a pdf file containing text and tables. I want to extract text from some region of interest (ROI).
I used pdfplumber to get the desired starting and ending coordinates. Then, I tried to crop the PDF between these coordinates and extract text but couldn't succeed as though the cropped pdf has only text from ROI but apparently the pdfstream is still holding all the info for that particular page. As a result when I am extracting it is giving me text from the whole page (of the original pdf). I don't want to convert cropped pdf into image and apply OCR on top of it due to chances of inaccuracy. Any help on how to extract text using those coordinates is very much appreciated. Thanks in advance

Related

get colored text in pdf file

I have a pdf include some MCQ questions and the right answer is colored and underlined
so I want to extract all the answer from the pdf and put them in the last page
I use pyPDF2 to transport the text from the pdf to a text.txt file
now I want to know how to get words in color or underlined from pdf with python
then I can put them in a list and to what I want
so, what can I use in order to do that ?

extract specific text from image by using easyocr

I use easy OCR to extract text from images, and it works well for me. However, I need to remove shaded numbers from the extracted text result. I need to erase text if it's shaded.
Any help?? For example in the following picture, I need to delete "2586" and extract only "2574"

Reportlab - Truncating text after specific width while writing on canvas

I am working on a project which has a PDF editor that we have built in angular. User can drag and drop dynamic fields (user defined fields) on the PDF. These fields when are filled from a user define form, the corresponding PDF (that user has edited) gets generated in the backend. We are using Python 3.7 and Reportlab to edit and write dynamic data to the PDF.
The length of the value dynamic fields is not fixed. It can be more than the field variable's length. On PDF editor, the user decides max width of the generated dynamic text. We want to write the text till the specified width only and truncate rest of the text. For example, first_name is the variable dropped on the PDF editor. It's max width is set to 10px. If the value of the variable is "Wolfeschlegelsteinhausenbergerdorff", the text should be written till 10 px and should get truncated after that.
So far, we have managed to write the full text (irrespective of width specified). Following is the code that we are using.
........
paragraph_style.textColor = HexColor(pdf_element["font_color"])
paragraph_style.fontSize = pdf_element["font_size"]
paragraph = Paragraph(str(output_text_value), style=paragraph_style)
paragraph.drawOn(can,location_x, location_y)
........
Above code writes full text on the PDF. However, we need a way to somehow truncate the text after specified width.
Any help is greatly appreciated.
Instead of directly drawing paragraph on the canvas, use frame. Add the paragraph inside frame, Apply truncate mode to the frame and then draw frame on canvas. Your code should look like following:
frame = KeepInFrame(min_width, min_height, [paragraph], mode='truncate')
frame.width = float(min_.replace("px",""))
frame.drawOn(can, location_x, location_y)
Hope it helps. Thanks

How to remove rectangle shapes from image, keeping text, in Python3?

I am trying to extract the text from flowcharts and decision trees. If I use the image with original boxes/shapes, the text region detection is poor. Is there any way to remove these shapes (keeping the text)?
You could use connectedComponentsWithStats(), you will have single component for the chart lines, then just remove that component from the image.

round text in svg in pdf

I am trying to send a SVG file in pdf. The pdf have some rounded/transformed text (company name).
Thing is I get the pdf created but the rounded text does not appear rounded on output PDF but appears straight. I have used DomPDF as well as TCPDF for creating the pdf.
Here is my SVG

Resources