How to decode encoded text from a web page

How to decode encoded text from a web page - python-3.x

I have an API page of my school site and I want to parse the data from it but the page looks like this and I don't know how to decode this text using python.
(The encoded letters are cyryllic)
The data from the page(it looks like this even in the browser):
\u0421\u0434\u0430\u0442\u044c \u043f\u043e\u0441\u043b\u0435 \u043a\u0430\u043d\u0438\u043a\u0443\u043b, 15 \u0430\u043f\u0440\u0435\u043b\u044f. <br />\r\n\u0423\u0431\u0435\u0434\u0438\u0442\u0435\u043b\u044c\u043d\u0430\u044f \u043f\u0440\u043e\u0441\u044c\u0431\u0430 \u043e\u0444\u043e\u0440\u043c\u043b\u044f\u0442\u044c \u0440\u0435\u0448\u0435\u043d\u0438\u0435 "\u043a\u0430\u043a \u043f\u043e\u043b\u043e\u0436\u0435\u043d\u043e" \u0432 \u0441\u043e\u043e\u0442\u0432\u0435\u0442\u0441\u0442\u0432\u0438\u0438 \u0441 \u0442\u0435\u043c "\u043a\u0430\u043a \u0443\u0447\u0438\u043b\u0438", \u0430 \u043d\u0435 \u0442\u0430\u043a, \u0431\u0443\u0434\u0442\u043e \u0431\u044b \u0432\u044b \u0435\u0433\u043e \u043d\u0430 \u043a\u043e\u043b\u0435\u043d\u043a\u0435 \u0437\u0430 5 \u043c\u0438\u043d\u0443\u0442 \u043f\u0435\u0440\u0435\u0434 \u0441\u0434\u0430\u0447\u0435\u0439 \u0434\u0435\u043b\u0430\u043b\u0438. \u041f\u0438\u0441\u0430\u0442\u044c \u0440\u0430\u0437\u0431\u043e\u0440\u0447\u0438\u0432\u043e \u0438 \u0430\u043a\u043a\u0443\u0440\u0430\u0442\u043d\u043e.
The data that I want to get:
Сдать после каникул, 15 апреля. <br />\r\nУбедительная просьба оформлять решение "как положено" в соответствии с тем "как учили", а не так, будто бы вы его на коленке за 5 минут перед сдачей делали. Писать разборчиво и аккуратно.

Prepared by #APIuz team
#For python3
import json
import io
text = '\u0421\u0434\u0430\u0442\u044c \u043f\u043e\u0441\u043b\u0435 \u043a\u0430\u043d\u0438\u043a\u0443\u043b, 15 \u0430\u043f\u0440\u0435\u043b\u044f. <br />\r\n\u0423\u0431\u0435\u0434\u0438\u0442\u0435\u043b\u044c\u043d\u0430\u044f \u043f\u0440\u043e\u0441\u044c\u0431\u0430 \u043e\u0444\u043e\u0440\u043c\u043b\u044f\u0442\u044c \u0440\u0435\u0448\u0435\u043d\u0438\u0435 "\u043a\u0430\u043a \u043f\u043e\u043b\u043e\u0436\u0435\u043d\u043e" \u0432 \u0441\u043e\u043e\u0442\u0432\u0435\u0442\u0441\u0442\u0432\u0438\u0438 \u0441 \u0442\u0435\u043c "\u043a\u0430\u043a \u0443\u0447\u0438\u043b\u0438", \u0430 \u043d\u0435 \u0442\u0430\u043a, \u0431\u0443\u0434\u0442\u043e \u0431\u044b \u0432\u044b \u0435\u0433\u043e \u043d\u0430 \u043a\u043e\u043b\u0435\u043d\u043a\u0435 \u0437\u0430 5 \u043c\u0438\u043d\u0443\u0442 \u043f\u0435\u0440\u0435\u0434 \u0441\u0434\u0430\u0447\u0435\u0439 \u0434\u0435\u043b\u0430\u043b\u0438. \u041f\u0438\u0441\u0430\u0442\u044c \u0440\u0430\u0437\u0431\u043e\u0440\u0447\u0438\u0432\u043e \u0438 \u0430\u043a\u043a\u0443\u0440\u0430\u0442\u043d\u043e.'
io.open("APIuz.txt", "w", encoding="utf-8").write(json.dumps(text, ensure_ascii=False))
print(open("APIuz.txt", "r").read())
#For python2
print(repr(u'\u0421\u0434\u0430\u0442\u044c \u043f\u043e\u0441\u043b\u0435 \u043a\u0430\u043d\u0438\u043a\u0443\u043b, 15 \u0430\u043f\u0440\u0435\u043b\u044f. <br />\r\n\u0423\u0431\u0435\u0434\u0438\u0442\u0435\u043b\u044c\u043d\u0430\u044f \u043f\u0440\u043e\u0441\u044c\u0431\u0430 \u043e\u0444\u043e\u0440\u043c\u043b\u044f\u0442\u044c \u0440\u0435\u0448\u0435\u043d\u0438\u0435 "\u043a\u0430\u043a \u043f\u043e\u043b\u043e\u0436\u0435\u043d\u043e" \u0432 \u0441\u043e\u043e\u0442\u0432\u0435\u0442\u0441\u0442\u0432\u0438\u0438 \u0441 \u0442\u0435\u043c "\u043a\u0430\u043a \u0443\u0447\u0438\u043b\u0438", \u0430 \u043d\u0435 \u0442\u0430\u043a, \u0431\u0443\u0434\u0442\u043e \u0431\u044b \u0432\u044b \u0435\u0433\u043e \u043d\u0430 \u043a\u043e\u043b\u0435\u043d\u043a\u0435 \u0437\u0430 5 \u043c\u0438\u043d\u0443\u0442 \u043f\u0435\u0440\u0435\u0434 \u0441\u0434\u0430\u0447\u0435\u0439 \u0434\u0435\u043b\u0430\u043b\u0438. \u041f\u0438\u0441\u0430\u0442\u044c \u0440\u0430\u0437\u0431\u043e\u0440\u0447\u0438\u0432\u043e \u0438 \u0430\u043a\u043a\u0443\u0440\u0430\u0442\u043d\u043e.').decode('unicode-escape'))
#The code is just what you think
#>>> u'Сдать после каникул, 15 апреля. <br /> Убедительная просьба оформлять решение "как положено" в соответствии с тем "как учили", а не так, будто бы вы его на коленке за 5 минутперед сдачей делали. Писать разборчиво и аккуратно.'

Simplest solution in my case:
some_string.encode('utf-8').decode()

Related

XML parsing Issue in Python

I am trying to parse the following (just a chunk here) of an XML file (subtitles)
<?xml version="1.0" encoding="utf-8"?>
<document id="6736625">
<s id="1">
<time id="T1S" value="00:02:54,941" />
- Le requin t'a eue.
</s>
<s id="2">
- Tu es sérieuse ?
</s>
<s id="3">
Regarde ce que tu as fait.
<time id="T1E" value="00:02:58.251" />
</s>
<s id="4">
<time id="T2S" value="00:02:58,351" />
Je vais t'en chercher un autre.
</s>
<s id="5">
On peut faire quelque chose, je m'ennuie....
<time id="T2E" value="00:03:01,249" />
</s>
...
with following Python code
tree = ET.parse('data/6736625.xml')
root = tree.getroot()
myPhrasesArray = [""]
for q in root:
try:
a = q.text
b = a
myPhrasesArray.append(b)
except :
print(" arh ")
print(myPhrasesArray)
but it returns :
['', '', '- Tu es sérieuse ?', 'Regarde ce que tu as fait.',
'', "On peut faire quelque chose, je m'ennuie....", '', '',
"J'ai promis à Stuart de l'appeler.", '', '- A tes ordres.',
.....
I can t seem to find a way to get the text value for "/s" if there is an ID time/ value line before the actual text.
Any help ???

Try it using lxml:
import lxml.html
subt = [your html above]
doc = lxml.html.fromstring(subt)
dialog = doc.xpath('//*/text()')
myPhrasesArray = []
for d in dialog:
if len(d.strip())>0:
myPhrasesArray.append(d.strip())
myPhrasesArray
Output:
["- Le requin t'a eue.",
'- Tu es sérieuse ?',
'Regarde ce que tu as fait.',
"Je vais t'en chercher un autre.",
"On peut faire quelque chose, je m'ennuie...."]

Rather than doing for q in root, you want to only iterate over s tags.
You can use ElementTree.iter() or ElementTree.findall(). The former looks in everything no matter how deep. The latter only looks at direct children. For the example you've given findall() would make more sense.
myPhrasesArray = [] # just start with it empty
for s in root.findall('s'):
myPhrasesArray.append(s.text)
Given this is pretty simple, you could even do it in one line:
myPhrasesArray = [s.text for s in root.findall('s')]

Solved it this way
parsedXml = ET.parse("data/tst/1914/"+ str(filename))
root = parsedXml.getroot()
for child in root:
try:
if child.tag == "s":
a = ''.join(child.itertext()).strip().lower()
if a.startswith("-"):
a = a.lstrip("-")
mySentences.append(a)

GitBook: Why do appear `1.` prefixes and how can I delete them?

This is my SUMMARY.md file:
# ОГЛАВЛЕНИЕ
1. [Введение](README.md)
1. [Назначение](chapters/001-introduction/README.md)
1. [Соглашения, принятые в документах](chapters/001-introduction/agreements.md)
1. [Границы проекта](chapters/001-introduction/project-borders.md)
1. [Ссылки](chapters/001-introduction/references.md)
1. [Общее описание](chapters/002-common-description/README.md)
1. [Общий взгляд на продукт](chapters/002-common-description/overview.md)
1. [Классы и характеристики пользователей](chapters/002-common-description/users.md)
1. [Операционная среда](chapters/002-common-description/system-requirements.md)
1. [Ограничения дизайна и реализации](chapters/002-common-description/restrictions.md)
1. [Предположения и зависимости](chapters/002-common-description/assumptions-and-dependencies.md)
This is the PDF publishing result:
Why do appear 1. prefixes and how can I delete them?

beautiful souphelp with defining the search criteria

I am trying to scrape a website with terms and their english translation and explanation. I have managed with Beautiful Soup and Request to get the tr or td entries, but I wonder whether someone could suggest how to refine my request to extract only the term and the explanation? I think that it is impossible to separate the english translation from the rest though?? The 'table' from the actual website is a combination of 19 tables(http://www.nodimarinari.it/Protopage10.html). The following is one entry
<tr>
<td valign="top" width="124">
<div align="center"><b><font color="#000066">tabella delle maree</font></b></div>
</td>
<td align="left" width="616"><font color="#000066"> (s.f.) (Maree) tide
table Tabella che riporta, per una determinata zona, l'andamento della
marea, cioe' giorno ed ora delle massime/minime e le relative variazioni
dei livelli delle acque. </font></td>
</tr>
<tr>
<td valign="top" width="124">
<div align="center"><font color="#000066"><b></b></font></div>
</td>
<td align="left" width="616"><font color="#000066"></font></td>
</tr>
Ideally I would like to get 'tabella delle maree','tide table', 'Tabella che riporta... '

Here is my results that according to my understanding of your question. Let me know if this what you're trying to achieve.
Code:
import requests
from bs4 import BeautifulSoup
url = requests.get("http://www.nodimarinari.it/Protopage10.html")
soup = BeautifulSoup(url.text, 'html.parser')
words = soup.find_all("font", color="#000066")
num = 0
term = []
explanation = []
for word in words:
if len(word.text) > 1:
if num % 2 == 0:
term.append(word.text)
elif num % 2 == 1:
explanation.append(str(word.text).replace("\n", ""))
num += 1
for i in range(0, len(term)):
print(term[i] + ": " + explanation[i] + "\n")
Output:
abbandonare: (v.) (Emergenze) to abandon L'attodi lasciare la imbarcazione pericolante da parte del capitano e dell'equipaggio,dopo aver esaurito tutti i tentativi suggeriti dall'arte nautica.
abbisciare: (v.) (Cavi e nodi) to range, tojag Preparare su uno spazio piano un cavo od una catena, ad ampie spire,in modo che il cavo o la catena possano svolgersi liberamente e scorreresenza impedimenti.
abbittare: (v.) (Cavi e nodi) to bitt Legareun cavo o una catena ad una bitta.
abbonacciare: (v.) (Vento e Mare) to becalm, tofall calm Il calmarsi del vento e del mare
abbordaggio: (s.m.) [A]. (Abbordi) collisionUrto o collisione accidentale tra imbarcazioni. La legislazione marittimaprevede un'ampio corredo di "norme per evitare l'abbordo in mare" checostituiscono la regolamentazione di base della navigazione [B]. a. intenzionaleboarding, running foul Investire intenzionalmente un'altra imbarcazionecon l'obiettivo di danneggiarla o affondarla. Anticamente si chiamavaanche "arrembaggio"
abbordare: (v.) (A bordo) to collide, to board,to run into vedi "abbordaggio".
abbordo: (s.m.) (Abbordi) collision Equivalentemoderno del termine "abbordaggio".
abbozzare: (v.) (Cavi e nodi) to stop Trattenerecon una legatura provvisoria, detta bozza, un cavo od una catena tesa,per evitarne lo scorrimento durante il tempo necessario per legarla definitivamente.
...
...
...
vogare: (v.) (Remi) to row Sinonimodi "remare".
volta: (s.f.) [A]. (Cavi enodi) bitter, wrap Indica comunemente un giro di un cavo o di una catenaattorno ad una bitta o ad un altro attrezzo atto a trattenere la cimastessa. [B]. dar v. (Manovre) to cleat, to belay, to make fast Legareuna cima od una catena attorno ad una bitta o una galloccia, con o senzanodi.
zattera di salvataggio: (s.f.) (A bordo) liferaftVedi "autogonfiabile".
zavorra: (s.f.) (Terminologia)ballast Pesi che vengono disposti a bordo dell'imbarcazione, normalmenteil piu' in basso possibile nella chiglia, per equilibrare la spinta lateraledel vento ed il conseguente movimento di rollio e di sbandata. Normalmentetali pesi sono in piombo e nei velieri moderni e' la lama di deriva stessa,costruita con metalli pesanti, ad essere zavorrata. In alcune particolarivelieri da regata la zavorra e' costituita da acqua di mare che puo' esserepompata in apposite casse (dal lato sopravvento) quando necessario.
zenit: (s.m.) (Geografia) zenitPunto della sfera celeste che si trova esattamente sulla verticale aldi sopra del luogo di osservazione.
zinco: (s.m.) (Varie) zinc,zinc plate Metallo dotato di particolari caratteristiche elettrolitichecol quale si realizzano piastre ed elementi (chiamati anodi) che vengonoutilizzati per evitare la corrosione di elementi metallici immersi adopera delle correnti galvaniche.Tali elementi vengono posizionati nellaparte immersa dello scafo, in corrispondenza di parti metalliche, in particolaredell'elica e della zavorra, e nel tempo si consumano evitando la corrosionedelle parti metalliche contigue. Per questo vengono anche chiamati zinchisacrificali.
Explanation:
So I used BeautifulSoup to parse all the terms and their meanings. To do that I had to parse all the text inside all the <font color="#000066"> elements inside the html code.
The words list contain the data in the following format:
[<font color="#000066">A</font>, <font color="#000066"> </font>, <font color="#000066"><b></b></font>, <font color="#000066"></font>, <font color="#000066">abbandonare</font>, <font color="#000066"> (v.) (Emergenze) to abandon L'atto
di lasciare la imbarcazione pericolante da parte del capitano e dell'equipaggio,
dopo aver esaurito tutti i tentativi suggeriti dall'arte nautica. </font>, <font color="#000066"><b></b></font>, <font color="#000066"></font>, <font color="#000066">abbisciare</font>, <font color="#000066"> (v.) (Cavi e nodi) to range, to
jag Preparare su uno spazio piano un cavo od una catena, ad ampie spire, ETC.]
The if condition if len(word.text) > 1: is used to ignore the single literals parsed (e.g: ['A', 'B', ..., 'Z']), that way we only deal with the terms and the meanings.
The next if condition if num % 2 == 0: is used to move all the terms from the words list to a list called term which is meant to contain only terms.
The last if condition if num % 2 == 1: is used to to move all the explanations from the words list to a list called explanation which is meant to contain only explanations.
The last for loop is for printing purposes only.

BeautifulSoup get element inside li tag

I'm having trouble parsing html element inside li tag.
This is my code:
from bs4 import BeautifulSoup
import requests
sess = requests.Session()
url = 'http://example.com'
page = sess.get(url)
page = BeautifulSoup(page.text)
soap = page.select('li.item')
print(soap.find('h3').text)
This is html code:
...
<li class="item">
<strong class="item-type">design</strong>
<h3 class="item-title">Item title</h3>
<p class="item-description">
Lorem ipsum dolor sit amet, dicam partem praesent vix ei, ne nec quem omnium cotidieque, omnes deseruisse efficiendi sit te. Mei putant postulant id. Cibo doctus eligendi at vix. Eos nisl exerci mediocrem cu, nullam pertinax petentium sea et. Vim affert feugait an.
</p>
</li>
...
There are more than 10 li tag I just paste one of them.
Output error:
Traceback (most recent call last):
File "test.py", line 10, in <module>
print(soap.find('h3').text)
AttributeError: 'list' object has no attribute 'find'

Thanks to #DaveJ , this method worked:
[s.find('h3').text for s in soap]

Cyrillic Alphabet to English or Latin

I hope you are all well. I have an Excel sheet that contains Cyrillic alphabet. I would like to get it to English/Latin Is there any easy way that Excel can do this? Thank you in advance for the help.

Your screenshot shows no Cyrillic letters: those would look like e.g. the Ukrainian alphabet letters: А а Б б В в Г г Ґ ґ Д д Е е Є є Ж ж З з И и І і Ї ї Й й К к Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш Щ щ Ь ь Ю ю Я я
You are victim of flagrant mojibake case rather as shown in next example; 39142772.txt file contains some accented characters (all Central Europe Latin). The file based on lines 1, 10 and 23 of your data, retyped to Czech and Hungarian valid names and is saved with UTF-8 encoding:
==> chcp 65001
Active code page: 65001
==> type D:\test\39142772.txt
1 STÁTNÍ ÚSTAV PRO KONTROLU LÉČIV
10 Pikó, Béla
23 Móricz, István
==> chcp 1252
Active code page: 1252
==> type D:\test\39142772.txt
1 STÃTNÃ ÃšSTAV PRO KONTROLU LÃ‰ÄŒIV
10 PikÃ³, BÃ©la
23 MÃ³ricz, IstvÃ¡n
==>
Explanation: chcp command changes the active console Code Page;
chcp 65001 (UTF-8): file is displayed properly;
chcp 1252 (West European Latin): accented characters in file are displayed mojibake transformed exactly as shown in your screenshot;
the same mojibake transformation would happen if you import a .txt or .csv file into Excel using wrong encoding.
Solution: import a .txt or .csv file into Excel using proper encoding. Procedure is described here: Is it possible to force Excel recognize UTF-8 CSV files automatically?.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to decode encoded text from a web page - python-3.x

Simplest solution in my case: some_string.encode('utf-8').decode()

Related

XML parsing Issue in Python

GitBook: Why do appear `1.` prefixes and how can I delete them?

beautiful souphelp with defining the search criteria

BeautifulSoup get element inside li tag

Cyrillic Alphabet to English or Latin

Categories

Resources