XML parsing Issue in Python - python-3.x

I am trying to parse the following (just a chunk here) of an XML file (subtitles)
<?xml version="1.0" encoding="utf-8"?>
<document id="6736625">
<s id="1">
<time id="T1S" value="00:02:54,941" />
- Le requin t'a eue.
</s>
<s id="2">
- Tu es sérieuse ?
</s>
<s id="3">
Regarde ce que tu as fait.
<time id="T1E" value="00:02:58.251" />
</s>
<s id="4">
<time id="T2S" value="00:02:58,351" />
Je vais t'en chercher un autre.
</s>
<s id="5">
On peut faire quelque chose, je m'ennuie....
<time id="T2E" value="00:03:01,249" />
</s>
...
with following Python code
tree = ET.parse('data/6736625.xml')
root = tree.getroot()
myPhrasesArray = [""]
for q in root:
try:
a = q.text
b = a
myPhrasesArray.append(b)
except :
print(" arh ")
print(myPhrasesArray)
but it returns :
['', '', '- Tu es sérieuse ?', 'Regarde ce que tu as fait.',
'', "On peut faire quelque chose, je m'ennuie....", '', '',
"J'ai promis à Stuart de l'appeler.", '', '- A tes ordres.',
.....
I can t seem to find a way to get the text value for "/s" if there is an ID time/ value line before the actual text.
Any help ???

Try it using lxml:
import lxml.html
subt = [your html above]
doc = lxml.html.fromstring(subt)
dialog = doc.xpath('//*/text()')
myPhrasesArray = []
for d in dialog:
if len(d.strip())>0:
myPhrasesArray.append(d.strip())
myPhrasesArray
Output:
["- Le requin t'a eue.",
'- Tu es sérieuse ?',
'Regarde ce que tu as fait.',
"Je vais t'en chercher un autre.",
"On peut faire quelque chose, je m'ennuie...."]

Rather than doing for q in root, you want to only iterate over s tags.
You can use ElementTree.iter() or ElementTree.findall(). The former looks in everything no matter how deep. The latter only looks at direct children. For the example you've given findall() would make more sense.
myPhrasesArray = [] # just start with it empty
for s in root.findall('s'):
myPhrasesArray.append(s.text)
Given this is pretty simple, you could even do it in one line:
myPhrasesArray = [s.text for s in root.findall('s')]

Solved it this way
parsedXml = ET.parse("data/tst/1914/"+ str(filename))
root = parsedXml.getroot()
for child in root:
try:
if child.tag == "s":
a = ''.join(child.itertext()).strip().lower()
if a.startswith("-"):
a = a.lstrip("-")
mySentences.append(a)

Related

Cannot import SongLyrics from lyrics_extractor

I wrote this code:
from lyrics_extractor import SongLyrics
apiKey = 'AIzaSyBDPVEi1OtzB3Nm6i9fd8HTkMCjsselIpM'
engineID = '35df92fbe0cad839c'
extract_lyrics = SongLyrics(apikey, engineID)
lyrics = extract_lyrics.get_lyrics("Reyes de la noche")
print(lyrics)
It is not working, I obtained this import error:
ImportError: cannot import name 'Songlyrics' from 'lyrics_extractor' (C:\Users\mica\AppData\Local\Programs\Python\Python39\lib\site-packages\lyrics_extractor_init_.py)
What could be wrong?
I found two problems:
You need to install the library with PIP:
pip install lyrics-extractor
apiKey has the wrong capitalization in line 4.
A further suggestion: the returned value is a dict, which looks nicer when printed like this:
from lyrics_extractor import SongLyrics
apiKey = 'AIzaSyBDPVEi1OtzB3Nm6i9fd8HTkMCjsselIpM'
engineID = '35df92fbe0cad839c'
extract_lyrics = SongLyrics(apiKey, engineID)
lyrics = extract_lyrics.get_lyrics("Reyes de la noche")
print(lyrics['title'])
print(lyrics['lyrics'])
Output:
Guasones – Reyes De La Noche Lyrics
Fuimos mucho mas que nada
Fuimos la mentira
Fuimos lo peor
Fuimos los soldados a la madrugada
Con esta ambición
Y ahora estoy en libertad
Y ahora que puedo pensar
En no volver hacer ese
El mismo de antes
Y que tristeza hay en la ciudad, amor
Sábado soleado
Y en el centro de la estatua del dolor
Me sentí parado
Fuimos muchos más que todos
Reyes de la noche
De esta tempestad
Si te vendí, si te robe, te traicione
Fui por uno más
Fuimos perros de la noche
Oxidados en tristeza
Y querer lo que querer
Sin tener que lastimar
Recordando que tu amor
Se robo la dignidad
Ahora olvidemos los dos
No volvamos a empezar
¿Para que?...
PS: I was manually writing out lyrics this morning for a song called "Brothers and Sisters" by Les Nubians. When I tried to use this library to get the lyrics for that song, like this:
extract_lyrics.get_lyrics("Brothers and Sisters")
... I got lyrics for a more popular song of the same title by another musical group. However, after adding the name of the group I wanted, I got the correct song's lyrics:
extract_lyrics.get_lyrics("Brothers and Sisters Nubians")
Installation
pip install lyrics-extractor
For more details go to this

can't get the right text when webscraping a webpage (with Python 3)

This is an example of the variable 'extract', when i execute my python code:
<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland
W^hen i execute my code, i now get the text 'Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'
What i really wanted is the part after 'title=', so this text: 'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'
I'm new in this section, and i find it very difficult to understand. Can someone give me a good direction?
See my code at this moment.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)
content=''
rows = soup.find_all('a')
tel=0
for row in rows:
if tel != 0:
#'tel' is only used to skip the first returned result. The first resultline is something that i don't need. I only need all the text after every 'title ='
print(row)
extract=row.get_text()
print('')
print(extract)
content=content+extract+'\n'
if tel == 10:
#a loop of max 10 times gives me enough information, i only need the first 10 articles
break
else:
tel=tel+1
#show me the result-text of the crawling
print('')
print('RESULT TEXT OF THE FIRST 10 ARTICLES:')
print(content)
You want the title attribute. From the description the following use of nth-child will get you the first 10 descriptions. I also needed 'lxml' parser. pip3 install lxml if not installed.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
print([i['title'] for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)')])
CSS:
li:nth-child(-n+10) > a:nth-child(2)
This asks for the first 10 li elements with at least two a tag children, and selects for the second a tag in each case. The > is a child combinator specifying that what is on the right must be a child of what is on the left.
Read about:
:nth-child ranges
Child combinators, pseudo class selectors and type selectors
Additional request for for loop with published date time info:
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)'):
print(i.parent['title'])
print(i.parent.a.text.strip())
print(i['title'])
print()
If I understand your question correctly, you should try to modify your code as follows:
...
for row in rows:
if tel != 0:
print(row)
extract=row["title"]
content=content+extract.replace("Meer informatie", "")+'\n'
...
Quick look at what you're doing, simply replace get_text() with .get('title'), that should work:
>>> data = '''<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(data)
>>> soup.find('a').get('title')
'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'

beautiful souphelp with defining the search criteria

I am trying to scrape a website with terms and their english translation and explanation. I have managed with Beautiful Soup and Request to get the tr or td entries, but I wonder whether someone could suggest how to refine my request to extract only the term and the explanation? I think that it is impossible to separate the english translation from the rest though?? The 'table' from the actual website is a combination of 19 tables(http://www.nodimarinari.it/Protopage10.html). The following is one entry
<tr>
<td valign="top" width="124">
<div align="center"><b><font color="#000066">tabella delle maree</font></b></div>
</td>
<td align="left" width="616"><font color="#000066"> (s.f.) (Maree) tide
table Tabella che riporta, per una determinata zona, l'andamento della
marea, cioe' giorno ed ora delle massime/minime e le relative variazioni
dei livelli delle acque. </font></td>
</tr>
<tr>
<td valign="top" width="124">
<div align="center"><font color="#000066"><b></b></font></div>
</td>
<td align="left" width="616"><font color="#000066"></font></td>
</tr>
Ideally I would like to get 'tabella delle maree','tide table', 'Tabella che riporta... '
Here is my results that according to my understanding of your question. Let me know if this what you're trying to achieve.
Code:
import requests
from bs4 import BeautifulSoup
url = requests.get("http://www.nodimarinari.it/Protopage10.html")
soup = BeautifulSoup(url.text, 'html.parser')
words = soup.find_all("font", color="#000066")
num = 0
term = []
explanation = []
for word in words:
if len(word.text) > 1:
if num % 2 == 0:
term.append(word.text)
elif num % 2 == 1:
explanation.append(str(word.text).replace("\n", ""))
num += 1
for i in range(0, len(term)):
print(term[i] + ": " + explanation[i] + "\n")
Output:
abbandonare: (v.) (Emergenze) to abandon L'attodi lasciare la imbarcazione pericolante da parte del capitano e dell'equipaggio,dopo aver esaurito tutti i tentativi suggeriti dall'arte nautica.
abbisciare: (v.) (Cavi e nodi) to range, tojag Preparare su uno spazio piano un cavo od una catena, ad ampie spire,in modo che il cavo o la catena possano svolgersi liberamente e scorreresenza impedimenti.
abbittare: (v.) (Cavi e nodi) to bitt Legareun cavo o una catena ad una bitta.
abbonacciare: (v.) (Vento e Mare) to becalm, tofall calm Il calmarsi del vento e del mare
abbordaggio: (s.m.) [A]. (Abbordi) collisionUrto o collisione accidentale tra imbarcazioni. La legislazione marittimaprevede un'ampio corredo di "norme per evitare l'abbordo in mare" checostituiscono la regolamentazione di base della navigazione [B]. a. intenzionaleboarding, running foul Investire intenzionalmente un'altra imbarcazionecon l'obiettivo di danneggiarla o affondarla. Anticamente si chiamavaanche "arrembaggio"
abbordare: (v.) (A bordo) to collide, to board,to run into vedi "abbordaggio".
abbordo: (s.m.) (Abbordi) collision Equivalentemoderno del termine "abbordaggio".
abbozzare: (v.) (Cavi e nodi) to stop Trattenerecon una legatura provvisoria, detta bozza, un cavo od una catena tesa,per evitarne lo scorrimento durante il tempo necessario per legarla definitivamente.
...
...
...
vogare: (v.) (Remi) to row Sinonimodi "remare".
volta: (s.f.) [A]. (Cavi enodi) bitter, wrap Indica comunemente un giro di un cavo o di una catenaattorno ad una bitta o ad un altro attrezzo atto a trattenere la cimastessa. [B]. dar v. (Manovre) to cleat, to belay, to make fast Legareuna cima od una catena attorno ad una bitta o una galloccia, con o senzanodi.
zattera di salvataggio: (s.f.) (A bordo) liferaftVedi "autogonfiabile".
zavorra: (s.f.) (Terminologia)ballast Pesi che vengono disposti a bordo dell'imbarcazione, normalmenteil piu' in basso possibile nella chiglia, per equilibrare la spinta lateraledel vento ed il conseguente movimento di rollio e di sbandata. Normalmentetali pesi sono in piombo e nei velieri moderni e' la lama di deriva stessa,costruita con metalli pesanti, ad essere zavorrata. In alcune particolarivelieri da regata la zavorra e' costituita da acqua di mare che puo' esserepompata in apposite casse (dal lato sopravvento) quando necessario.
zenit: (s.m.) (Geografia) zenitPunto della sfera celeste che si trova esattamente sulla verticale aldi sopra del luogo di osservazione.
zinco: (s.m.) (Varie) zinc,zinc plate Metallo dotato di particolari caratteristiche elettrolitichecol quale si realizzano piastre ed elementi (chiamati anodi) che vengonoutilizzati per evitare la corrosione di elementi metallici immersi adopera delle correnti galvaniche.Tali elementi vengono posizionati nellaparte immersa dello scafo, in corrispondenza di parti metalliche, in particolaredell'elica e della zavorra, e nel tempo si consumano evitando la corrosionedelle parti metalliche contigue. Per questo vengono anche chiamati zinchisacrificali.
Explanation:
So I used BeautifulSoup to parse all the terms and their meanings. To do that I had to parse all the text inside all the <font color="#000066"> elements inside the html code.
The words list contain the data in the following format:
[<font color="#000066">A</font>, <font color="#000066"> </font>, <font color="#000066"><b></b></font>, <font color="#000066"></font>, <font color="#000066">abbandonare</font>, <font color="#000066"> (v.) (Emergenze) to abandon L'atto
di lasciare la imbarcazione pericolante da parte del capitano e dell'equipaggio,
dopo aver esaurito tutti i tentativi suggeriti dall'arte nautica. </font>, <font color="#000066"><b></b></font>, <font color="#000066"></font>, <font color="#000066">abbisciare</font>, <font color="#000066"> (v.) (Cavi e nodi) to range, to
jag Preparare su uno spazio piano un cavo od una catena, ad ampie spire, ETC.]
The if condition if len(word.text) > 1: is used to ignore the single literals parsed (e.g: ['A', 'B', ..., 'Z']), that way we only deal with the terms and the meanings.
The next if condition if num % 2 == 0: is used to move all the terms from the words list to a list called term which is meant to contain only terms.
The last if condition if num % 2 == 1: is used to to move all the explanations from the words list to a list called explanation which is meant to contain only explanations.
The last for loop is for printing purposes only.

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

Acentuation in php, showing latin symbols

I'm writing a program to parse html data from a portuguese website, thing is, when I echo the data I read I get those weird symbols:
Meu PC estragou e tenho um netbook que u
so para assuntos acadΩmicos. Que raiva, nπo roda nem CS aqui. Aff que raiva! Com
o pode? Jß coloquei todas as mais baixar configuraτ⌡es e nao roda!
the original text is:
Meu PC estragou e tenho um netbook que u
so para assuntos acadêmicos. Que raiva, não roda nem CS aqui. Aff que raiva! Com
o pode? Já coloquei todas as mais baixar configurações e nao roda!
notice the acentuation:
acadêmicos -> acadΩmicos
Já -> Jß
how do I fix this ? I already tried:
echo utf8_decode($assunto);
but it didn't work ! help!
Set the encoding in the header.
header( 'Content-Type: text/html; charset=UTF-8' );
http://php.net/manual/en/function.header.php
Then to make sure the browser uses UTF-8 meta tag in the HTML head:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Resources