Problem with PyPDF2 in Python 3.11 get unreadble russian text - python-3.x

Trying to parse pdf with russian text, but can't make it work
from PyPDF2 import PdfReader
reader = PdfReader("./mpp/1.pdf")
page = reader.pages[0]
m = page.extract_text()
print(m)
File pdf: https://disk.yandex.ru/i/qSJfFZJFDuLDIA
i'm trying use .encode, but it's not help.
In total, i need a this:
route1 = {
"bus": "Yutong г/н 499",
"stations": "12"
}
stations = ['1; ул. Арсеньева, м-н Пчелка; 7:40','2; ул. Агеева, ДК "Юность"; 7:44', '3; ул. Горького, минирынок Исток; 7:55', '....etc up to 12']

Related

How to fix unidentified character problem while passing data from TKinter to Photoshop via Python script?

I made a GUI Application which looks like this:
The ones marked red are Tkinter Text widgets and the ones marked yellow are Tkinter Entry widgets
After taking user input, the data is to be added to a PSD file and then rendered as an image. But Lets say, after taking the following data as input:
It renders the following Photoshop file:
How do I fix this issue that it does not recognize "\n" properly and hence the rendered document is rendered useless.
Here is the code which deals with converting of the accepted user data into strings and then adding it to Photoshop template and then rendering it:
def DataAdder2CSV():
global edate, eSNO, eage, egender, ename, ePID, econtact, ecomp, eallergy, ehistory, eR
e=edate.get()
a=eSNO.get()
d=eage.get()
f=egender.get()
b=ename.get()
c=ePID.get()
g=econtact.get()
h=ecomp.get(1.0,END)
i=eallergy.get(1.0,END)
j=ehistory.get(1.0,END)
k=eR.get(1.0,END)
data=[a,b,c,d,e,f,g,h,i,j,k]
file=open("Patient_Data.csv","a", newline="")
writer=csv.writer(file, delimiter=",")
writer.writerow(data)
file.close()
messagebox.showinfo("Prescription Generator", "Data has been saved to the database successfully!")
import win32com.client, os
objShell = win32com.client.Dispatch("WScript.Shell")
UserDocs = objShell.SpecialFolders("MyDocuments")
from tkinter import filedialog
ExpDir=filedialog.askdirectory(initialdir=UserDocs, title="Choose Destination Folder")
psApp = win32com.client.Dispatch("Photoshop.Application")
psApp.Open("D:\Coding\Python Scripts\Dr Nikhil Prescription App\Prescription Generator\Presc_Template.psd")
doc = psApp.Application.ActiveDocument
lf1 = doc.ArtLayers["name"]
tol1 = lf1.TextItem
tol1.contents = b
lf2 = doc.ArtLayers["age"]
tol2 = lf2.TextItem
tol2.contents = d
lf3 = doc.ArtLayers["gender"]
tol3 = lf3.TextItem
tol3.contents = f
lf4 = doc.ArtLayers["pid"]
tol4 = lf4.TextItem
tol4.contents = c
lf4 = doc.ArtLayers["date"]
tol4 = lf4.TextItem
tol4.contents = e
lf5 = doc.ArtLayers["contact"]
tol5 = lf5.TextItem
tol5.contents = g
lf6 = doc.ArtLayers["complaint"]
tol6 = lf6.TextItem
varH=" "+h.rstrip("\n")
tol6.contents =varH
lf7 = doc.ArtLayers["allergy"]
tol7 = lf7.TextItem
tol7.contents = i.rstrip("\n")
lf8 = doc.ArtLayers["history"]
tol8 = lf8.TextItem
varJ=" "+j.rstrip("\n")
tol8.contents =varJ
lf9 = doc.ArtLayers["R"]
tol9 = lf9.TextItem
tol9.contents = k.rstrip("\n")
options = win32com.client.Dispatch('Photoshop.ExportOptionsSaveForWeb')
options.Format = 13
options.PNG8 = False
pngfile =ExpDir+f"/{c}-{b}_({e}).png"
doc.Export(ExportIn=pngfile, ExportAs=2, Options=options)
messagebox.showinfo("Prescription Generator", "Prescription has been saved in the desired location successfully!")
There are 3 ways of expressing new line characters:
MacOS uses \r
Linux uses \n
Windows uses \r\n
Python and tkinter use \n but it looks like psApp.Application uses \r instead. That is why the document isn't rendered properly. For more info read the answers to this question.

Getting neighbor´s element in xml archive with Python ElementTree

I got a big problem managing data in xml archives in python. I need the value in the tag ValorConta1 but I only have the value in NumeroConta which is child of PlanoConta.
<InfoFinaDFin>
<NumeroIdentificadorInfoFinaDFin>15501</NumeroIdentificadorInfoFinaDFin>
...
<PlanoConta>
<NumeroConta>2.02.01</NumeroConta>
</PlanoConta>
...
<ValorConta1>300</ValorConta1>
The code I write:
import xml.etree.ElementTree as ET
InfoDin = ET.parse('arquivos_xml/InfoFinaDFin.xml')
target_element_value = '2.01.01'
passivo = InfoDin.findall('.//PlanoConta[NumeroConta="' + target_element_value +'"]/../ValorConta1')
Try this.
from simplified_scrapy import SimplifiedDoc
html = '''
<InfoFinaDFin>
<NumeroIdentificadorInfoFinaDFin>15501</NumeroIdentificadorInfoFinaDFin>
...
<PlanoConta>
<NumeroConta>2.02.01</NumeroConta>
</PlanoConta>
...
<ValorConta1>300</ValorConta1>
</InfoFinaDFin>
'''
doc = SimplifiedDoc(html)
# print (doc.select('PlanoConta>NumeroConta>text()'))
# print (doc.select('ValorConta1>text()'))
ele = doc.NumeroConta.parent.getNext('ValorConta1')
# or
ele = doc.getElementByTag('ValorConta1',start='</NumeroConta>')
print (ele.text)
Result:
300
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

How can I visualize coordinates from a csv file using folium?

I have data in the cvs files:
ID, Name, Address, Latitude, Longitude.
FOO BO 34 Zako, Kost str.55 49.2955102 19.95274595
FOO B1 55 Vara, Dost str 44 49.4814 20.0303
ZOO B2 56 XXXX, YYYY str 99 49.5551 21.6766
I would like to visualize this data on the map using folium in python3.
The example code for coordinates is:
import folium
logo_url = 'https://upload.wikimedia.org/wikipedia/en/c/c6/Logo_link.png'
start_lat = 52.2138
start_lng = 20.9795
csv_data_lat = 49.2955102
csv_data_lng = 19.95274595
map_1 = folium.Map(location=[start_lat, start_lng], zoom_start=12,control_scale = True)
icon = folium.features.CustomIcon(logo_url, icon_size=(50, 50))
folium.Marker([csv_data_lat, csv_data_lng], popup='PKOBP', icon=icon).add_to(map_1)
map_1.save('map1.html')
How to do it for data from csv files?
I guess you could do something like this instead of the folium.Marker line:
def add_marker(row,map,icon):
folium.Marker([row["latitude"], row["longitude"]], popup='PKOBP', icon=icon).add_to(map)
df = pd.read_csv('your_csv.csv')
df.apply(add_marker,axis=1,args=(map_1,icon,))

AvroTypeException: When writing in python3

My avsc file is as follows:
{"type":"record",
"namespace":"testing.avro",
"name":"product",
"aliases":["items","services","plans","deliverables"],
"fields":
[
{"name":"id", "type":"string" ,"aliases":["productid","itemid","item","product"]},
{"name":"brand", "type":"string","doc":"The brand associated", "default":"-1"},
{"name":"category","type":{"type":"map","values":"string"},"doc":"the list of categoryId, categoryName associated, send Id as key, name as value" },
{"name":"keywords", "type":{"type":"array","items":"string"},"doc":"this helps in long run in long run analysis, send the search keywords used for product"},
{"name":"groupid", "type":["string","null"],"doc":"Use this to represent or flag value of group to which it belong, e.g. it may be variation of same product"},
{"name":"price", "type":"double","aliases":["cost","unitprice"]},
{"name":"unit", "type":"string", "default":"Each"},
{"name":"unittype", "type":"string","aliases":["UOM"], "default":"Each"},
{"name":"url", "type":["string","null"],"doc":"URL of the product to return for more details on product, this will be used for event analysis. Provide full url"},
{"name":"imageurl","type":["string","null"],"doc":"Image url to display for return values"},
{"name":"updatedtime", "type":"string"},
{"name":"currency","type":"string", "default":"INR"},
{"name":"image", "type":["bytes","null"] , "doc":"fallback in case we cant provide the image url, use this judiciously and limit size"},
{"name":"features","type":{"type":"map","values":"string"},"doc":"Pass your classification attributes as features in key-value pair"}
]}
I am able to parse this but when I try to write on this as follows, I keep getting issue. What am I missing ? This is in python3. I verified it is well formated json, too.
from avro import schema as sc
from avro import datafile as df
from avro import io as avio
import os
_prodschema = 'product.avsc'
_namespace = 'testing.avro'
dirname = os.path.dirname(__file__)
avroschemaname = os.path.join( os.path.dirname(__file__),_prodschema)
sch = {}
with open(avroschemaname,'r') as f:
sch= f.read().encode(encoding='utf-8')
f.close()
proschema = sc.Parse(sch)
print("Schema processed")
writer = df.DataFileWriter(open(os.path.join(dirname,"products.json"),'wb'),
avio.DatumWriter(),proschema)
print("Just about to append the json")
writer.append({ "id":"23232",
"brand":"Relaxo",
"category":[{"123":"shoe","122":"accessories"}],
"keywords":["relaxo","shoe"],
"groupid":"",
"price":"799.99",
"unit":"Each",
"unittype":"Each",
"url":"",
"imageurl":"",
"updatedtime": "03/23/2017",
"currency":"INR",
"image":"",
"features":[{"color":"black","size":"10","style":"contemperory"}]
})
writer.close()
What am I missing here ?

How to create PDF containing Persian(Farsi) text with reportlab, rtl and bidi in python

I've been trying to create a PDF file from content that can be English, Persian, digits or a combination of them.
there is some problems with Persian texts like: "این یک متن فارسی است"
۱- the text must be written from right to left
2- there is a difference between characters in different positions in the word (meaning that characters change their shape according to their surrounding characters)
3- because the sentence is read from right to left then the normal textwrap doesn't work correctly.
I used reportlab for creating PDf but unfortunately reportlab doesn't support Arabic and Persian alphabet so I used 'rtl' library by Vahid Mardani and 'pybidi' library by Meir Kriheli to make the text look right in PDF result.
first we need to add a font that supports Persian to reportlab:
in ubuntu 14.04:
copy Bahij-Nazanin-Regular.ttf into
/usr/local/lib/python3.4/dist-packages/reportlab/fonts folder
add font and styles to reportlab:
from reportlab.lib.enums import TA_RIGHT
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
pdfmetrics.registerFont(TTFont('Persian', 'Bahij-Nazanin-Regular.ttf'))
styles = getSampleStyleSheet()
styles.add(ParagraphStyle(name='Right', alignment=TA_RIGHT, fontName='Persian', fontSize=10))
in next step we need to reshape Persian text Letters to the right shape and make the direction of each word from right to left:
from bidi.algorithm import get_display
from rtl import reshaper
import textwrap
def get_farsi_text(text):
if reshaper.has_arabic_letters(text):
words = text.split()
reshaped_words = []
for word in words:
if reshaper.has_arabic_letters(word):
# for reshaping and concating words
reshaped_text = reshaper.reshape(word)
# for right to left
bidi_text = get_display(reshaped_text)
reshaped_words.append(bidi_text)
else:
reshaped_words.append(word)
reshaped_words.reverse()
return ' '.join(reshaped_words)
return text
and for adding bullet or wrapping the text we could use following function:
def get_farsi_bulleted_text(text, wrap_length=None):
farsi_text = get_farsi_text(text)
if wrap_length:
line_list = textwrap.wrap(farsi_text, wrap_length)
line_list.reverse()
line_list[0] = '{} •'.format(line_list[0])
farsi_text = '<br/>'.join(line_list)
return '<font>%s</font>' % farsi_text
return '<font>%s •</font>' % farsi_text
for testing the code we can write:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
doc = SimpleDocTemplate("farsi_wrap.pdf", pagesize=letter, rightMargin=72, leftMargin=72, topMargin=72,
bottomMargin=18)
Story = []
text = 'شاید هنوز اندروید نوقا برای تمام گوشی‌های اندرویدی عرضه نشده باشد، ولی اگر صاحب یکی از گوشی‌های نکسوس یا پیک' \
'سل باشید احتمالا تا الان زمان نسبتا زیادی را با آخرین نسخه‌ی اندروید سپری کرده‌اید. اگر در کار با اندروید نوقا' \
' دچار مشکل شده‌اید، با دیجی‌کالا مگ همراه باشید تا با هم برخی از رایج‌ترین مشکلات گزارش شده و راه حل آن‌ها را' \
' بررسی کنیم. البته از بسیاری از این روش‌ها در سایر نسخه‌های اندروید هم می‌توانید استفاده کنید. اندروید برخلاف iOS ' \
'روی گستره‌ی وسیعی از گوشی‌ها با پوسته‌ها و اپلیکیشن‌های اضافی متنوع نصب می‌شود. بنابراین تجویز یک نسخه‌ی مشترک برا' \
'ی حل مشکلات آن کار چندان ساده‌ای نیست. با این حال برخی روش‌های عمومی وجود دارد که بهتر است پیش از هر چیز آن‌ها را' \
' بیازمایید.'
tw = get_farsi_bulleted_text(text, wrap_length=120)
p = Paragraph(tw, styles['Right'])
Story.append(p)
doc.build(Story)
After working for a while with Reportlab, we had some problems with organizing and formatting it. It took a lot of time and was kind of complicated.
So we decided to work with pdfkit and jinja2. This way we can format and organize in html and CSS and we don't need to reformat Persian text too.
first we can design an html template file like the one below:
&lt!DOCTYPE html&gt
&lthtml&gt
&lthead lang="fa-IR"&gt
&ltmeta charset="UTF-8"&gt
&lttitle&gt&lt/title&gt
&lt/head&gt
&ltbody &gt
&ltp dir="rtl"&gtسوابق کاری&lt/p&gt
&ltul dir="rtl"&gt
{% for experience in experiences %}
&ltli&gt&lta href="{{ experience.url }}"&gt{{ experience.title }}&lt/a&gt&lt/li&gt
{% endfor %}
&lt/ul&gt
&lt/body&gt
&lt/html&gt
and then we use jinja2 library to render our data into Template, and then use pdfkit to create a pdf from render result:
from jinja2 import Template
from pdfkit import pdfkit
sample_data = [{'url': 'http://www.google.com/', 'title': 'گوگل'},
{'url': 'http://www.yahoo.com/fa/', 'title': 'یاهو'},
{'url': 'http://www.amazon.com/', 'title': 'آمازون'}]
with open('template.html', 'r') as template_file:
template_str = template_file.read()
template = Template(template_str)
resume_str = template.render({'experiences': sample_data})
options = {'encoding': "UTF-8", 'quiet': ''}
bytes_array = pdfkit.PDFKit(resume_str, 'string', options=options).to_pdf()
with open('result.pdf', 'wb') as output:
output.write(bytes_array)
In case anyone wants to generate pdfs from html templates using Django, this is how it can be done:
template = get_template("app_name/template.html")
context = Context({'something':some_variable})
html = template.render(context)
pdf = pdfkit.from_string(html, False)
response = HttpResponse(pdf, content_type='application/pdf')
response['Content-Disposition'] = 'attachment; filename=output.pdf'
return response
send multibyte (farsi , arabic) string as parameter to below typescript function, and put returned string to pdfMaker or any other PDF generator
farsiNew(farsistr){
// because pdfmake display it mirrored by default
var allText = '';
var point = 19;
var words = farsistr.split("\n");
var newword;
for(var i=0; i<=words.length-1; i++){
newword = words[i].split( ' ');
if (newword.length <point) {
allText = allText + newword.reverse().join(' ') + "\n";
}else{
for(var q =0; q<= Math.ceil (newword.length / point); q++) {
var s , t;
if (q === 0) {
s = 0; t = point;
}
else {
s = q * point + q;
t = s + point;
}
for (var v = t; v >= s; v--) {
if(!newword[v])
continue;
allText = allText + ' ' +newword[v]
}
allText = allText + '\n';
}
}
}
return allText;
}

Resources