Python 3: write Russian text to PDF file - python-3.x

The problem was to write Russian text to PDF file. I have tried several encodings, however, this didn't solve the problem. You can find the solution I came up with in answer section. Please, note that write_to_file function writes text only on one page. It was not tested for larger files.

Here is a solution. I am using reportlab version 3.5.42.
from reportlab.lib.units import cm
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.lib.styles import ParagraphStyle
from reportlab.platypus import Paragraph, Frame
from reportlab.graphics.shapes import Drawing, Line
from reportlab.pdfgen.canvas import Canvas
def write_to_file(filename, story):
"""
(str, list) -> None
Write text from list of strings story to filename. filename should be in format name.pdf.
Russian text is supported by font DejaVuSerif. DejaVuSerif.ttf should be saved in the working directory.
filename is stored in the same working directory.
"""
canvas = Canvas(filename)
pdfmetrics.registerFont(TTFont('DejaVuSerif', 'DejaVuSerif.ttf'))
# Various styles option are available, consult reportlab User Guide
style = ParagraphStyle('russian_text')
style.fontName = 'DejaVuSerif'
style.leading = 0.5*cm
# Using XML format for new line character
for i, part in enumerate(story):
story[i] = Paragraph(part.replace('\n', '<br></br>'), style)
# Create a frame to make the text fit to the page, A4 format is used by default
frame = Frame(0, 0, 21*cm, 29.7*cm, leftPadding=cm, bottomPadding=cm, rightPadding=cm, topPadding=cm,)
# Add different parts of the story
frame.addFromList(story, canvas)
canvas.save()

Related

How to save all figures in pdf file in python created from seaborn style & dataframe?

This code gives me output of grid as 1 with style background.
def plot(grid):
cmap = sns.light_palette("red", as_cmap=True)
figure = pd.DataFrame(grid)
figure = figure.style.background_gradient(cmap=cmap, axis=None)
display(figure)
I wanted to store multiples images such as 1 in a single pdf file generated by Fun 'plot'.In case of matplotlib
from matplotlib.backends.backend_pdf import PdfFile,PdfPages
pdfFile = PdfPages("name.pdf")
pdfFile.savefig(plot)
pdfFile.close()
can do this. but for this case I am facing issues because it is dataframe or I am using searborn background_style.
could you please suggest to store output of above in single pdf file or png or jpg.
Here is my code to save all open figures to a pdf, it saves each plot to a separate page in the pdf.
from matplotlib.backends.backend_pdf import PdfPages
pp = PdfPages('C:\path\filename.pdf') #path to where you want to save the pdf
figNum = plt.get_fignums() #creates a list of all figure numbers
for i in range(len(figNum)): #loop to add each figure to pdf
pp.savefig(figNum[i]) #uses the figure number in the list to save it to the pdf
pp.close() #closes the opened file in memory
We can creat folder name 'image' and store all images of code output in png format.we will have to use dataframe image for that.
import dataframe_image as dfi
from PIL import Image
def plot(grid):
cmap = sns.light_palette("red", as_cmap=True)
figure = pd.DataFrame(grid)
figure = figure.style.background_gradient(cmap=cmap, axis=None)
dfi.export(figure, f'image\df_styled.png, max_cols=-1)

is there any way to get title of the pdf file direct iterating from pdf paragraphs and not from metadata?

I want to get the title in output when i pass the pdf file as input using python code.
I have used pdfreader, pypdf2, pdfminer libraries but all fetches the title from metadata.
Is there anyway to get the title directly from pdf paragraphs info?
Thank you for your help.
I found the solution from pdfminer library.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams
import os
path=r'/path/to/pdf'
Extract_Data=[]
for page_layout in extract_pages(path):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
Font_size=character.size
Extract_Data.append([Font_size,(element.get_text())])
title = (max(Extract_Data))
print(title[1])

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

How to fix 'ValueError("input must have more than one sentence")' Error

Im writing a script that takes a website url and downloads it using beautiful soup. It then uses gensim.summarization to summarize the text but I keep getting ValueError("input must have more than one sentence") even thought the text has more than one sentence. The first section of the script works that downloads the text but I cant get the second part to summarize the text.
import bs4 as bs
import urllib.request
from gensim.summarization import summarize
from gensim.summarization.textcleaner import split_sentences
#===========================================
print("(Insert URL)")
url = input()
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce,'lxml')
#===========================================
print(soup.title.string)
with open (soup.title.string + '.txt', 'wb') as file:
for paragraph in soup.find_all('p'):
text = paragraph.text.replace('.', '.\n')
text = split_sentences(text)
text = summarize(str(text))
text = text.encode('utf-8', 'ignore')
#===========================================
file.write(text+'\n\n'.encode('utf-8'))
It should create a .txt file with the summarized text in it after the script is run in whatever folder the .py file is located
You should not use split_sentences() before passing the text to summarize() since summarize() takes a string (with multiple sentences) as input.
In your code you are first turning your text into a list of sentences (using split_sentences()) and then converting that back to a string (with str()). The result of this is a string like "['First sentence', 'Second sentence']". It doesn't make sense to pass this on to summarize().
Instead you should simply pass your raw text as input:
text = summarize(text)

How can i write text to pdf file

I'm using Python3 and I have a long text file and I would like to create a new pdf and write the text inside.
I tried using reportlab but it writes only one line.
from reportlab.pdfgen pdfgen import canvas
c = canvas.Canvas("hello.pdf")
c.drawString(100,750, text)
c.save()
I know that I can tell it in which line to write what. But is there a library where I can just give the text and the margins and it will write it in the pdf file ?
Thanks
EDIT:
Or instead of that I could also use a library that easily converts txt file to pdf file ?
Simply drawing your string on the canvas won't do your job.
If its just raw text and you don't need to do any modifications like heading and other kinds of stuff to your text, then you can simply put your text in Flowables i.e Paragraph, and your Flowables can be appended to your story[].
You can adjust the margins according to your use.
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
from reportlab.lib.pagesizes import letter
styles = getSampleStyleSheet()
styleN = styles['Normal']
styleH = styles['Heading1']
story = []
pdf_name = 'your_pdf_file.pdf'
doc = SimpleDocTemplate(
pdf_name,
pagesize=letter,
bottomMargin=.4 * inch,
topMargin=.6 * inch,
rightMargin=.8 * inch,
leftMargin=.8 * inch)
with open("your_text_file.txt", "r") as txt_file:
text_content = txt_file.read()
P = Paragraph(text_content, styleN)
story.append(P)
doc.build(
story,
)
For more information on Flowables read reportlab-userguide

Resources