I have a pdf include some MCQ questions and the right answer is colored and underlined
so I want to extract all the answer from the pdf and put them in the last page
I use pyPDF2 to transport the text from the pdf to a text.txt file
now I want to know how to get words in color or underlined from pdf with python
then I can put them in a list and to what I want
so, what can I use in order to do that ?
I use easy OCR to extract text from images, and it works well for me. However, I need to remove shaded numbers from the extracted text result. I need to erase text if it's shaded.
Any help?? For example in the following picture, I need to delete "2586" and extract only "2574"
I am working on a project which has a PDF editor that we have built in angular. User can drag and drop dynamic fields (user defined fields) on the PDF. These fields when are filled from a user define form, the corresponding PDF (that user has edited) gets generated in the backend. We are using Python 3.7 and Reportlab to edit and write dynamic data to the PDF.
The length of the value dynamic fields is not fixed. It can be more than the field variable's length. On PDF editor, the user decides max width of the generated dynamic text. We want to write the text till the specified width only and truncate rest of the text. For example, first_name is the variable dropped on the PDF editor. It's max width is set to 10px. If the value of the variable is "Wolfeschlegelsteinhausenbergerdorff", the text should be written till 10 px and should get truncated after that.
So far, we have managed to write the full text (irrespective of width specified). Following is the code that we are using.
........
paragraph_style.textColor = HexColor(pdf_element["font_color"])
paragraph_style.fontSize = pdf_element["font_size"]
paragraph = Paragraph(str(output_text_value), style=paragraph_style)
paragraph.drawOn(can,location_x, location_y)
........
Above code writes full text on the PDF. However, we need a way to somehow truncate the text after specified width.
Any help is greatly appreciated.
Instead of directly drawing paragraph on the canvas, use frame. Add the paragraph inside frame, Apply truncate mode to the frame and then draw frame on canvas. Your code should look like following:
frame = KeepInFrame(min_width, min_height, [paragraph], mode='truncate')
frame.width = float(min_.replace("px",""))
frame.drawOn(can, location_x, location_y)
Hope it helps. Thanks
I am trying to extract the text from flowcharts and decision trees. If I use the image with original boxes/shapes, the text region detection is poor. Is there any way to remove these shapes (keeping the text)?
You could use connectedComponentsWithStats(), you will have single component for the chart lines, then just remove that component from the image.
I am trying to send a SVG file in pdf. The pdf have some rounded/transformed text (company name).
Thing is I get the pdf created but the rounded text does not appear rounded on output PDF but appears straight. I have used DomPDF as well as TCPDF for creating the pdf.
Here is my SVG