convert html to png using gmplot in Python - python-3.x

I am using gmplot in Python 3.8 to generate a layer of polygons on top of a satellite google maps layer. The map is saved to .html format but I would like to be able to convert the .html file to .png format to embed it in a pdf created in Python at a later stage (that will contain other elements, such as text and other images).
I generate the map using standard code as described in the gmplot tutorial:
import gmplot
latitude_list = [ 17.4567417, 17.5587901, 17.6245545]
longitude_list = [ 78.2913637, 78.007699, 77.9266135 ]
gmap = gmplot.GoogleMapPlotter(17.438139, 78.3936413, 11)
gmap.polygon(latitude_list, longitude_list, color = 'cornflowerblue')
gmap.draw("path_to_html")
I have checked different posts to get a solution, including this one and this one. From one of these posts, I have managed to get the following snippet of code:
import time
from selenium import webdriver
import chromedriver_binary # adds chromedriver binary to path
driver = webdriver.Chrome()
driver.get("local_url_of_html_file")
time.sleep(3)
driver.save_screenshot('map.png')
driver.quit()
It appears this code takes a screenshot of the html but I was wondering if there is any in-built function in gmplot to do this in a more straightforward way or other packages like bokeh.

Related

I am trying to use Fitz to extract data from a pdf that contains text in a very unstructured format. But it's returning none at the first step

Here's the code I have been trying with the output:
import fitz
import pandas as pd
doc = fitz.open('xyz.pdf')
page1 = doc[0]
words = page1.get_text("words")
first_annots=[]
rec=page1.first_annot.rect
rec
Output:
the output I am expecting is all text rectangles to be identified and called separately.
Here's where i found the code that i am implementing: https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/
Independent from your overall intention (to parse unstructured text):
Accessing the page's annotations via page.first_annot makes no sense at all.
Your exception is caused by the fact that that page page has no annotations, and therefore page.first_annot is None of course.
Again: whether or not there are annotations has nothing to do with the text of the page. Simply do not access page.first_annot.

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

Raw Images from rawpy darker than their thumbnails

I'm wanting to convert '.NEF' to '.png' using the rawpy, imageio and opencv libraries in Python. I've tried a variety of flags in rawpy to produce the same image that I see when I just open the NEF, but all of the images that output are extremely dark. What am I doing wrong?
My current version of the code is:
import rawpy
import imageio
from os.path import *
import os
import cv2
def nef2png(inputNEFPath):
parent, filename = split(inputNEFPath)
name, _ = splitext(filename)
pngName = str(name+'.png')
tempFileName = str('temp%s.tiff' % (name))
with rawpy.imread(inputNEFPath) as raw:
rgb = raw.postprocess(gamma=(2.222, 4.5),
no_auto_bright=True,
output_bps=16)
imageio.imsave(join(parent, tempFileName), rgb)
image = cv2.imread(join(parent, tempFileName), cv2.IMREAD_UNCHANGED)
cv2.imwrite(join(parent, pngName), image)
os.remove(join(parent, tempFileName))
I'm hoping to get to get this result:
https://imgur.com/Q8qWfwN
But I keep getting dark outputs like this:
https://imgur.com/0jIuqpQ
For the actual file NEF, I uploaded them to my google drive if you want to mess with it: https://drive.google.com/drive/folders/1DVSPXk2Mbj8jpAU2EeZfK8d2HZM9taiH?usp=sharing
You're not doing anything wrong, it's just that the thumbnail was generated by Nikon's proprietary in-camera image processing pipeline. It's going to be hard to get the exact same visual output from an open source tool with an entirely different set of algorithms.
You can make the image brighter by setting no_auto_bright=False. If you're not happy with the default brightening, you can play with the auto_bright_thr parameter (see documentation).

Extracting title from pdf using pypdf2 not working

I'm trying to extract the title of PDF files using pyPDF2. The output is either none or a wrong title. I tried using PDFminer as well, still the same result. I tried using 3 different pdf files. Is there a better way to extract the title with better accuracy?
This is the code I used:
from PyPDF2 import PdfFileReader
def get_pdf_title(pdf_file_path):
pdf_reader = PdfFileReader(open(pdf_file_path, "rb"))
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('C:/PythonPrograms/Test.pdf')
print(title)
Your code is working, at least for me on python 3.5.2. Check in the PDF properties that he indeed has a title.
PDF's title is part of its metadata, that needs to be set. It is not mandatory, not related to its content (other than by the will of the person writing it), nor with its filename.
If you use your snippet on a file with no title, it's output will be an empty string.

pytesseract image_to_string not pulling strings, but no Error

I am using the image_to_string function in the pytesseract package to convert multiple parts of a single picture file to string. All parts are working except for this image:
Here is the script that I am using to convert it:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('image.png')
text = pytesseract.image_to_string(im)
print(text)
Which gives the output:
—\—\—\N—\—\—\—\—\N
I have tried breaking up the image into smaller parts as well as processing the image as a jpg and as png. What can I do to have it output the values in the image?
Using a different page segmentation instead of the default one seems to work.
text = pytesseract.image_to_string(im,config ='--psm 6'))
According to the tesseract wiki, option 6 assumes a single uniform block of text. I tried with other options but only this one worked.
To check for other page segmentation methods read the tesseract wiki on how to improve quality of an image.

Resources