Improve resolution ImageMagick - nlp

I'm converting PDF's into JPG's. But the quality decreases a lot whereby I can't extract text out of these images. I know there's something as resolution, but where do I need to place it in this code.
def pdf_to_image(filename):
for pdf_file in pdf_files:
pdf = wi(filename=pdf_file)
converted = pdf.convert("jpg")
base_file_name, _ = os.path.splitext(pdf_file)
i = 1
for img in converted.sequence:
page = wi(image=img)
page_file_name = f"{base_file_name}_{i}.jpg"
page.save(filename=page_file_name)
i = i + 1

Related

pdfminer: extract only text according to font size

I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files.
The code below returns a list of the font size of each text block and its characters for one pdf file.
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
gives me an Extract_Data list with the various font sizes
[[9.800000000000068, 'aaa\n'], [11.0, 'dffg\n'], [10.000000000000057, 'bbb\n'], [10.0, 'hs\n'], [8.0, '2\n']]
example: font size 10.000000000000057
Extract_Data=[]
for page_layout in extract_pages(path):
print(page_layout)
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
if character.size == '10.000000000000057':
element.get_text()
Extract_Data.append(element.get_text())
Data = ''.join(map(str, Extract_Data))
gives me a Data list with all of the text. How can i make it only extract font size '10.000000000000057' characters?
['aaa\ndffg\nbbb\nhs\n2\n']
I also want to integrate into a function that does this for multiple files resulting in a pandas df that has one row for each pdf.
Desired output: [['aaa\n bbb\n']]. Convertin pixels to points (int(character.size) * 72 / 96) as suggested eksewhere did not help. Maybe this has something to do with this? https://github.com/pdfminer/pdfminer.six/issues/202
This is the function it would be integrated later on:
directory = 'C:/Users/Sample/'
resource_manager = PDFResourceManager()
for file in os.listdir(directory):
if not file.endswith(".pdf"):
continue
fake_file_handle = io.StringIO()
manager = PDFResourceManager()
device = PDFPageAggregator(manager, laparams=params)
interpreter = PDFPageInterpreter(manager, device)
device = TextConverter(interpreter, fake_file_handle, laparams=LAParams())
params = LAParams(detect_vertical=True, all_texts=True)
elements = []
with open(os.path.join(directory, file), 'rb') as fh:
parser = PDFParser(fh)
document = PDFDocument(parser, '')
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
for page in enumerate (PDFPage.create_pages(document)):
for element in page:
Pdfminer is the wrong tool for that.
Use pdfplumber (which uses pdfminer under the hood) instead https://github.com/jsvine/pdfplumber, because it has utility functions for filtering out objects (eg. based on font size as you're trying to do), whereas pdfminer is primarily for getting all text.
import pdfplumber
def get_filtered_text(file_to_parse: str) -> str:
with pdfplumber.open(file_to_parse) as pdf:
text = pdf.pages[0]
clean_text = text.filter(lambda obj: not (obj["object_type"] == "char" and obj["size"] != 9))
print(clean_text.extract_text())
get_filtered_text("./my_pdf.pdf")
The example above I've shown is easier than yours because it just checks for font size 9.0, and you have
9.800000000000068 and 10.000000000000057
so the obj["size"] condition will be more complex in your case
obj["size"] has the datatype Decimal (from decimal import Decimal) so you probably will have to do something like obj["size"].compare(Decimal(9.80000000068)) == 0

how can i save multiple images into separate cells in csv?

I have an issue with saving my data which are converted to np arrays.
I wanna save each image one cell of CSV with the (writerows) function but it says it should be iterable, and when I use the (writerow) function it saves all of the images without any separation.
for each_image in raw_data:
image_file = Image.open(each_image)
each_file_path = image_file.filename
image = cv2.imread(each_file_path)
grey_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
image_from_array = Image.fromarray(grey_image)
width, height = image_from_array.size
format = image_from_array.format
mode = image_from_array.mode
img_grey = image_from_array.convert('L')
value = np.asarray(img_grey.getdata(), dtype=np.int).reshape((img_grey.size[1], img_grey.size[0]))
value = value.flatten()
print(value)
this is my code and any help with saving in CSV separately would be appreciated!
just use newline = '' when u open your CSV file.

Unpack binary file contents; modify value; then pack contents to new binary file

I have a binary file and limited knowledge of the structure of the file. I'd like to unpack the contents of the file, make a change to a value, and then re-pack the modified contents into a new binary file. If I can complete the unpacking successfully, I certainly can modify one of the values; and then I believe I will be able to handle the re-packing to create a new binary file. However, I am having trouble completing the unpacking. This is what I have so far
image = None
one = two = three = four = five = 0
with open(my_file, 'rb') as fil:
one = struct.unpack('i', fil.read(4))[0]
two = struct.unpack('i', fil.read(4))[0]
three = struct.unpack('d', fil.read(8))[0]
four = struct.unpack('d', fil.read(8))[0]
five = struct.unpack('iiii', fil.read(16))
image = fil.read(920)
When I set a breakpoint below the section of code displayed above, I can see that the type of the image variable above is <class 'bytes'>. The type of fil is <class 'io.BufferedReader'>. How can I unpack the data in this image variable?
The recommendation from #Stanislav directly led me to the solution to this problem. Ultimately, I did not need struct unpack/pack to reach my goal. The code below roughly illustrates the solution.
with open(my_file, 'rb') as fil:
data = bytearray(fil.read())
mylist = list(data)
mylist[8] = mylist[8] + 2 #modify some fields
mylist[9] = mylist[9] + 2
mylist[16] = mylist[16] + 3
data = bytearray(mylist)
another_file = open("other_file.bin", "wb")
another_file.write(data)
another_file.close()

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

Is it possible to exclude the contents of footers and headers of a page from a pdf file during extracting the text from it. As these contents are least important and almost redundant.
Note: For extracting the text from the .pdf file, I am using the PyPDF2 package on python version = 3.7.
How to exclude the contents of the footers and headers in PyPDF2. Any help is appreciated.
The code snippet is as follows:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = " "
pdfFileObj = open('C:\\Users\\Rocky\\Desktop\\req\\req\\0000 - gamma j.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
print(num_pages)
while (startPage <= endPage):
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.strip().split()
print(text)
Read(1, 1)
As there are no features provided by PyPDF2 officially, I've written a function of my own to exclude the headers and footers in a pdf page which is working fine for my use case. You can add your own Regex patterns in page_format_pattern variable. Here I'm checking only in the first and last elements of my text list.
You can run this function for each page.
def remove_header_footer(self,pdf_extracted_text):
page_format_pattern = r'([page]+[\d]+)'
pdf_extracted_text = pdf_extracted_text.lower().split("\n")
header = pdf_extracted_text[0].strip()
footer = pdf_extracted_text[-1].strip()
if re.search(page_format_pattern, header) or header.isnumeric():
pdf_extracted_text = pdf_extracted_text[1:]
if re.search(page_format_pattern, footer) or footer.isnumeric():
pdf_extracted_text = pdf_extracted_text[:-1]
pdf_extracted_text = "\n".join(pdf_extracted_text)
return pdf_extracted_text
Hope you find this helpful.
At the moment, PyPDF2 does not offer this. It's also unclear how to do it well as those are not semantically represented within the pdf
As a heuristic, you could search for duplicates in the top / bottom of the extracted text of pages. That would likely work well for long documents and not work at all for 1-page documents
You need to consider that the first few pages might have no header or a different header than the rest. Also, there can be differences between chapters and even / odd pages
(side note: I'm the maintainer of PyPDF2 and I think this would be awesome to have)

No space between words while reading and extracting the text from a pdf file in python?

Hello Community Members,
I want to extract all the text from an e-book with .pdf as the file extension. I came to know that python has a package PyPDF2 to do the necessary action. Somehow, I have tried and able to extract text but it results in inappropriate space between the extracted words, sometimes the results is the result of 2-3 merged words.
Further, I want to extract the text from page 3 onward, as the initial pages deals with the cover page and preface. Also, I don't want to include the last 5 pages as it contains the glossary and index.
Does there exist any other way to read a .pdf binary file with NO ENCRYPTION?
The code snippet, whatever I have tried up to now is as follows.
import PyPDF2
def Read():
pdfFileObj = open('book1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
global text
text = []
while(count < num_pages):
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText().split()
print(text)
Read()
This is a possible solution:
import PyPDF2
def Read(startPage, endPage):
global text
text = []
cleanText = ""
pdfFileObj = open('myTest2.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
while startPage <= endPage:
pageObj = pdfReader.getPage(startPage)
text += pageObj.extractText()
startPage += 1
pdfFileObj.close()
for myWord in text:
if myWord != '\n':
cleanText += myWord
text = cleanText.split()
print(text)
Read(0,0)
Read() parameters --> Read(first page to read, last page to read)
Note: To read the first page starts from 0 not from 1 (as for example in an array).

Resources