Could not append multiple image files in a PDF - python-3.x

I know there are answers regarding this question, but hear me out.
I am currently trying to make PDF out of .jpg files using img2pdf in python, but instead of appending the files to PDF it overwrites the already existing pages from the PDF.
Here's the code
import os,img2pdf
os.chdir("/home/aditya/Desktop")#images are inside desktop
root, dir, files = list(os.walk(os.getcwd()))[0]#files contains the
list of all names of all .jpg file
which I want to convert into PDF
with open("pdf_file.pdf","ab") as f:#PDF file is set to append
for img_file in files:
with open(img_file,"rb") as im_file:#read bytes from the image files
f.write(img2pdf.convert(im_file))#this line overwrites the exisiting
pages in the pdf despite the fact that
I have set it to #append
Any reason for this? Is there special attribute I need to pass?
Any help is appreciated. Thanks

img2pdf.convert does not convert image by image. It either does a single image or all at once
img2pdf.convert(list of images )
with open("pdf_file.pdf","ab") as f: #Open a pdf
f.write(img2pdf.convert(files)) #convert all the images and write bytes

Related

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

How to convert a scanned PDF file to Editable PDF file with python?

I just need to know if we can convert a scanned pdf file to an editable pdf file using python. I know couple of libraries out there like pytesseract, pyocr. Guidance in this regard will be highly appreciated. Thanks
A scanned pdf document (number of images combined into one pdf document) saved in pdf format. But in this pdf file, unable to select a single letter of the text and even search for this letter also.
I also faced the same problem. Hence I have handled this with 3 lines of code.
It converts scanned pdf files into a select and searchable text pdf document.
Hope it works for you!
import ocrmypdf
def scannedPdfConverter(file_path, save_path):
ocrmypdf.ocr(file_path, save_path, skip_text=True)
print('File converted successfully!')
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
Referenced fromn here
You should look before you ask.

Extracting a particular file from a zipfile using Python

I have a list of 3 million html files in a zipfile. I would like to extract ~4000 html files from the entire list of files. Is there a way to extract a specific file without unzipping the entire zipfile using Python?
Any leads would be appreciated! Thanks in advance.
Edit:My bad, I should have elaborated on the question. I have a list of all the html filenames that need to be extracted but they are spread out over 12 zipfiles. How do I iterate through each zipfile, extract the matched html file and get the final list of extracted html files?
Let's say you wish to extract all the html files, then you can this out. If you have the list of all the file names to be extracted, then this will require slight modification.
listOfZipFiles = ['sample1.zip', 'sample2.zip', 'sample1.zip',... , 'sample12.zip' ]
fileNamesToBeExtracted = ['file1.html', 'file2.html', ... 'filen.html']
# Create a ZipFile Object and load sample.zip in it
for zipFileName in listOfZipFiles:
with ZipFile(zipFileName, 'r') as zipObj:
# Get a list of all archived file names from the zip
listOfFileNames = zipObj.namelist()
# Iterate over the file names
for fileName in listOfFileNames:
# Check if file to be extracted is present in file names to be extracted
if fileName in fileNamesToBeExtracted:
# Extract a single file from zip
zipObj.extract(fileName)

How to add dicom tags for a series of dicom images?

I want to add Dicom tags to a series of Dicom images and want to save that modified batch.
I have written a simple python script using pydicom which can edit and add dicom tags in a single Dicom image, but i want to do same procedure for complete image set (say 20 or 30 images).
can anybody suggest me a way to do such task using pydicom or python?
Just collect your filenames in a list and process each filename (read the file, edit contents, save as new or maybe use the same name).
Have a look at the os module from python. For instance, os.listdir('path') returns a list of filenames found in the given path. If that path points to a directory that contains only dicom images you now have a list of dicom filenames. Next use os.path.join('path', filename) to get an absolute path that you can use as input for reading a dicom file with pydicom.
Also you might want to use a for loop.
Let's suppose you have a list of dicom image file paths in an array named dicom_paths. Then:
import pydicom
dicom_paths = [ list of image paths here ]
dicom_data = [pydicom.read_file(s) for s in dicom_paths]
for dicom_data_item in dicom_data:
#do what you want here
Hope it helps

Magento: "Image does not exist"

I'm importing a CSV file in Magento (version 1.9).
I receive the error: 'Image does not exist'.
I've tried to do everything I could find on the internet.
The template I'm using for upload is the default template taken from my export folder.
I've added the / before the image name and I've also saved the file as UTF-8 format.
Any advice would help.
Use advanced profiler
System > Import/Export > Dataflow – Profiles
You only need to include the attributes that are required, which is just the SKU. Plus the appropiate image attributes. Plus labels if you want to go all out.
When you are creating your new profile, enter the following settings:
Now you can hit save! With our Profile now complete, we just need to create the folder media/import. This is where you will be storing all your images awaiting import.
When uploading images, they need to be within a folder called media/import. Once saved to that folder you can then reference them relatively. By that I mean if your image is in media/import/test.jpg in your csv reference it as /test.jpg. It’s as easy as that.
Please check this link for more information
Import products using csv
in the Default Import
first move all the images in media/import folder and then use '/imagename' in csv and then import.
And give the 777 permission to the import folder.
Let me know if you have any query....
check 3 point before upload csv file in Magento
create media > import folder and place all images inside import
folder import folder should have 777 permission
the path of images should be /desert-002.jpg
It may issue with image path in CSV if a image path in CSV is abg/test.jpg then it path in Dir is ..media/import/abg/test.jpg.also check image extension letter issue. Suppose your image extension I'd JPG and you rewrite in CSV is jpg .then it show image not exits
Your file template must look like this:
sku,image
product-001,/product_image.jpg
This file must exist: yourdocroot/media/import/product_image.jpg
More detail please read this method:
Mage_Catalog_Model_Convert_Adapter_Product::saveImageDataRow
You will see these lines:
$imageFile = trim($importData['_media_image']);
$imageFile = ltrim($imageFile, DS);
$imageFilePath = Mage::getBaseDir('media') . DS . 'import' . DS . $imageFile;
I hope this help!!!

Resources