Save Images inline (Base64) for Markdown export in Jupyter notebook - base64

How can images from Matplotlib:
plt.tight_layout()
plt.savefig('Image.png', facecolor='w', edgecolor='w', transparent=False, bbox_inches='tight', pad_inches=0.1)
plt.show()
saved inline (embedded as base64?), so that there is no need for external files,
when the Notebook is downloaded as a Markdown file?

My workaround solution:
install this: https://github.com/freeman-lab/embed-images
and this: https://nbconvert.readthedocs.io/en/latest/install.html
add to the first cell:
%matplotlib inline
import os
import ipyparams
and to the last cell:
os.system('jupyter nbconvert --to markdown ' + ipyparams.notebook_name)
os.system('embed-images ' + ipyparams.notebook_name[:-6] + '.md > ' + ipyparams.notebook_name[:-6] + '_emb.md')
NBconvert convert the notebook.ipynb to notebook.md with all images in the folder notebook_files.
Embed-Images convert these images to base64, insert their code in the file and saves it as notebook_emb.md
Both lines outputs "0" for a successfully conversion.

Related

PDF parser in pdfs with multiple images and formats with python and tabula (open to other options)

So first off what im trying to do: create a pdf parser that will take ONLY tables out of any given pdf. I currently have some pdfs that are for parts manuals which contain an image of the part and then a table for details of the parts and I want to scrape and parse the table data from the pdf into a csv or similar excel style file(csv, xls etc)
What ive tried/trying: I am currently using python3 and tabula(i have no preference for either of these and open to other options) in which I have a py program that is able to scrape all the data of any pdf or directory of pdfs however it takes EVERYTHING including the image file code that has a bunch of 0 1 NaN(adding examples at the bottom). I was thinking of writing a filter function that removes these however that feels like overkill and was wondering/hoping there is a way to filter out the images with tabula or another library? (side note ive also attempted camelot however the module is not importing correctly even when it is in my pip freeze and this happens on both my mac m1 and mac m2 so assuming there is no arm support)
If anyone could help me or help guide me in a direction of a library or method of being able to iterate through all pages in a pdf and JUST grab the tables for export t csv that would be AMAZING!
current main file:
from tabula.io import read_pdf;
from traceback import print_tb;
import pandas as pd;
from tabulate import tabulate;
import os
def parser(fileName, count):
print("\nFile Number: ",count, "\nNow parsing file: ", fileName)
df = read_pdf(fileName, pages="all") #address of pdf file
for i in range(len(df)):
df[i].to_excel("./output/test"+str(i)+".xlsx")
print(tabulate(df))
print_tb(df)
def reader(type):
filecount = 1
if(type == 'f'):
file = input("\nFile(f) type selected\nplease enter full file name with path (ex. Users/Name/directory1/filename.pdf: ")
parser(file, filecount)
elif(type == 'd'):
#directory selected
location = input("\nPlease enter diectory path, if in the same folder just enter a period(.)")
print("Opening directory: ", location)
#loop through and parse directory
for filename in os.listdir(location):
f = os.path.join(location, filename)
# checking if it is a file
if os.path.isfile(f):
parser(f, filecount)
filecount + 1
else:
print('\n\n ERROR, path given does not contain a file or is not a directory type..')
else:
print("Error: please select directory(d) or file(f)")
fileType = input("\n-----> Hello!\n----> Would you like to parse a directory(d) or file(f)?").lower()
reader(fileType)

How to save all figures in pdf file in python created from seaborn style & dataframe?

This code gives me output of grid as 1 with style background.
def plot(grid):
cmap = sns.light_palette("red", as_cmap=True)
figure = pd.DataFrame(grid)
figure = figure.style.background_gradient(cmap=cmap, axis=None)
display(figure)
I wanted to store multiples images such as 1 in a single pdf file generated by Fun 'plot'.In case of matplotlib
from matplotlib.backends.backend_pdf import PdfFile,PdfPages
pdfFile = PdfPages("name.pdf")
pdfFile.savefig(plot)
pdfFile.close()
can do this. but for this case I am facing issues because it is dataframe or I am using searborn background_style.
could you please suggest to store output of above in single pdf file or png or jpg.
Here is my code to save all open figures to a pdf, it saves each plot to a separate page in the pdf.
from matplotlib.backends.backend_pdf import PdfPages
pp = PdfPages('C:\path\filename.pdf') #path to where you want to save the pdf
figNum = plt.get_fignums() #creates a list of all figure numbers
for i in range(len(figNum)): #loop to add each figure to pdf
pp.savefig(figNum[i]) #uses the figure number in the list to save it to the pdf
pp.close() #closes the opened file in memory
We can creat folder name 'image' and store all images of code output in png format.we will have to use dataframe image for that.
import dataframe_image as dfi
from PIL import Image
def plot(grid):
cmap = sns.light_palette("red", as_cmap=True)
figure = pd.DataFrame(grid)
figure = figure.style.background_gradient(cmap=cmap, axis=None)
dfi.export(figure, f'image\df_styled.png, max_cols=-1)

Python 3: write Russian text to PDF file

The problem was to write Russian text to PDF file. I have tried several encodings, however, this didn't solve the problem. You can find the solution I came up with in answer section. Please, note that write_to_file function writes text only on one page. It was not tested for larger files.
Here is a solution. I am using reportlab version 3.5.42.
from reportlab.lib.units import cm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph, Frame
from reportlab.graphics.shapes import Drawing, Line
from reportlab.pdfgen.canvas import Canvas
def write_to_file(filename, story):
"""
(str, list) -> None
Write text from list of strings story to filename. filename should be in format name.pdf.
Russian text is supported by font DejaVuSerif. DejaVuSerif.ttf should be saved in the working directory.
filename is stored in the same working directory.
"""
canvas = Canvas(filename)
pdfmetrics.registerFont(TTFont('DejaVuSerif', 'DejaVuSerif.ttf'))
# Various styles option are available, consult reportlab User Guide
style = ParagraphStyle('russian_text')
style.fontName = 'DejaVuSerif'
style.leading = 0.5*cm
# Using XML format for new line character
for i, part in enumerate(story):
story[i] = Paragraph(part.replace('\n', '<br></br>'), style)
# Create a frame to make the text fit to the page, A4 format is used by default
frame = Frame(0, 0, 21*cm, 29.7*cm, leftPadding=cm, bottomPadding=cm, rightPadding=cm, topPadding=cm,)
# Add different parts of the story
frame.addFromList(story, canvas)
canvas.save()

how to solve the problem: " expected <class 'openpyxl.styles.fills.Fill'>" when saving

Every file saving I get the error:
raise TypeError('expected ' + str(expected_type))
TypeError: expected <class 'openpyxl.styles.fills.Fill'>
and then can't access anymore to that the saved file (when I'm opening it manually)
open and load succesfully the file with openpyxl library:
book = openpyxl.load_workbook(r'C:\Users\shoshana\PycharmProjects\pandas\doh_golmi.xlsx')
Saving with:
book.save(r'C:\Users\shoshana\PycharmProjects\pandas\doh_golmi_2.xlsx')
Then
import openpyxl
......
openpyxl.load_workbook(r'C:\Users\shoshana\PycharmProjects\pandas\doh_golmi.xlsx')
....
book.save(r'C:\Users\shoshana\PycharmProjects\pandas\doh_golmi_2.xlsx')
expected of course to be able to save changes I do in the file and get access after that to the file when I'm opening it manually.
I was getting the same error, and I was using the cell.fill incorrectly. In other words, I was doing something of the sort
import openpyxl
openpyxl.load_workbook(r'C:\some\file.xlsx')
...
yellow = openpyxl.styles.colors.Color(rgb='FFFF00')
cell.fill = yellow
...
book.save(r'C:\some\file.xlsx')
The above is incorrect, you have to use a PatternFill (see below)
import openpyxl
openpyxl.load_workbook(r'C:\some\file.xlsx')
...
yellow = openpyxl.styles.colors.Color(rgb='FFFF00')
filling = openpyxl.styles.fills.PatternFill(patternType='solid', fgColor=color)
cell.fill = filling
...
book.save(r'C:\some\file.xlsx')
I had a similar problem, but I couldn't load the file.
The only way was to manually open it, save it and load it.
My workaround for it is to convert the file using libreoffice:
I ran this command line in my jupyter notebook:
!libreoffice --convert-to xls 'my_file.xlsx'
this creates a new xls(no x) file named my_file.xls, this file can be opened now with pandas.
import pandas as pd
df = pd.read_excel('my_file.xls')

processing multiple images in sequence in opencv python

I am trying to build the code using python, for which I need to process at least 50 images. So how should I read the images one by one and process it. Is it possible using a loop and do i need to create a separate database for this or just saving all the images in separate file will do?
I have written some code may statisfy your requirement.
import glob
import os,sys
import cv2
## Get all the png image in the PATH_TO_IMAGES
imgnames = sorted(glob.glob("/PATH_TO_IMAGES/*.png"))
for imgname in imgnames:
## Your core processing code
res = propress(imgname)
## rename and write back to the disk
#name, ext = os.path.splitext(imgname)
#imgname2 = name+"_res"+ext
imgname2 = "_res".join(os.path.splitext(imgname))
cv2.imwrite(imgname2, res)
The task consists of following steps,
Having the images in a directory e.g. foo/
Getting the list of all images in the foo/ directory
Lopp over the list of images
3.1. img = cv2.imread(images(i),0)
3.2. ProcessImage(img) #Run an arbitrary function on the image
3.3. filename = 'test' + str(i) +'.png'
3.4. cv2.imwrite(filename, img)
End of the loop

Resources