Problems with xpath on scrapy - python-3.x

I'm using Scrapy to create a crawler.
I want to extract only the title of the links that I will found.
This is the current part of the code that it's important to me:
<a class="cor-produto busca-titulo" title="Melhorar a saúde, economia de tempo e dinheiro: Veja os benefícios do uso da bicicleta" href="//g1.globo.com/busca/click?q=economia&p=0&r=1472008380299&u=http%3A%2F%2Fg1.globo.com%2Fma%2Fmaranhao%2Fjmtv-2edicao%2Fvideos%2Fv%2Fmelhorar-a-saude-economia-de-tempo-e-dinheiro-veja-os-beneficios-do-uso-da-bicicleta%2F5256064%2F&t=informacional&d=false&f=false&ss=8bcd843f636c6982&o=&cat=a">Melhorar a saúde, economia de tempo e dinheiro: Veja os benefíc...</a>
I want to extract only the title and I need to use xpath to do this. Anyone have any suggestion?
Thank you! :)

The XPath would be:
//a/#title
Being sel your Selector instance:
sel.xpath('//a/#title').extract()
Or maybe just from the response object:
response.xpath('//a/#title').extract()
Output:
Melhorar a saúde, economia de tempo e dinheiro: Veja os benefícios do uso da bicicleta

Related

can't get the right text when webscraping a webpage (with Python 3)

This is an example of the variable 'extract', when i execute my python code:
<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland
W^hen i execute my code, i now get the text 'Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'
What i really wanted is the part after 'title=', so this text: 'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'
I'm new in this section, and i find it very difficult to understand. Can someone give me a good direction?
See my code at this moment.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)
content=''
rows = soup.find_all('a')
tel=0
for row in rows:
if tel != 0:
#'tel' is only used to skip the first returned result. The first resultline is something that i don't need. I only need all the text after every 'title ='
print(row)
extract=row.get_text()
print('')
print(extract)
content=content+extract+'\n'
if tel == 10:
#a loop of max 10 times gives me enough information, i only need the first 10 articles
break
else:
tel=tel+1
#show me the result-text of the crawling
print('')
print('RESULT TEXT OF THE FIRST 10 ARTICLES:')
print(content)
You want the title attribute. From the description the following use of nth-child will get you the first 10 descriptions. I also needed 'lxml' parser. pip3 install lxml if not installed.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
print([i['title'] for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)')])
CSS:
li:nth-child(-n+10) > a:nth-child(2)
This asks for the first 10 li elements with at least two a tag children, and selects for the second a tag in each case. The > is a child combinator specifying that what is on the right must be a child of what is on the left.
Read about:
:nth-child ranges
Child combinators, pseudo class selectors and type selectors
Additional request for for loop with published date time info:
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)'):
print(i.parent['title'])
print(i.parent.a.text.strip())
print(i['title'])
print()
If I understand your question correctly, you should try to modify your code as follows:
...
for row in rows:
if tel != 0:
print(row)
extract=row["title"]
content=content+extract.replace("Meer informatie", "")+'\n'
...
Quick look at what you're doing, simply replace get_text() with .get('title'), that should work:
>>> data = '''<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(data)
>>> soup.find('a').get('title')
'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'

Reading a semicolon (';') seperated raw text in Python

I have managed to find some solutions online but often without an explanation. I am new to Python and normally choose to rework my data in Excel. I would, however, like to learn how to deal with a problem like the following:
I was given data for Brazil in this form. The website says to "save the file as a csv". It looks like a mess...
Espírito Santo;
Dados;
Ano;População total;Homens;Mulheres;Nascimentos;Óbitos;Taxa de Crescimento Geométrico;Taxa Bruta de Natalidade;Taxa Bruta de Mortalidade;Esperança de Vida ao Nascer;Esperança de Vida ao Nascer - Homens;Esperança de Vida ao Nascer - Mulheres;Taxa de Mortalidade Infantil;Taxa de Mortalidade Infantil - Homens;Taxa de Mortalidade Infantil - Mulheres;Taxa de Fecundidade Total;Razão de Dependência - Jovens 0 a 14 anos;Razão de Dependência - Idosos 65 ou mais anos;Razão de Dependência;Índice de Envelhecimento;
2010;3596057;1772936;1823121;54018;19734;x;15.02;5.49;75.93;71.9;80.19;11.97;13.59;10.28;1.73;34.49;10.17;44.67;29.41;
2011;3642595;1795501;1847094;55387;19923;1.29;15.21;5.47;76.36;72.35;80.59;11.3;12.87;9.66;1.77;33.72;10.41;44.13;30.77;
2012;3689347;1818188;1871159;55207;20142;1.28;14.96;5.46;76.76;72.78;80.96;10.69;12.2;9.1;1.75;32.98;10.68;43.65;32.17;
2013;3736386;1841035;1895351;56785;20396;1.27;15.2;5.46;77.14;73.19;81.31;10.14;11.6;8.6;1.8;32.29;10.97;43.26;34.22;
2014;3784361;1864376;1919985;57964;20676;1.28;15.32;5.46;77.51;73.58;81.64;9.64;11.06;8.15;1.83;31.73;11.31;43.04;35.59;
2015;3832826;1887984;1944842;58703;20979;1.28;15.32;5.47;77.85;73.95;81.95;9.19;10.56;7.74;1.85;31.29;11.69;42.98;37.44;
2016;3879376;1910629;1968747;55091;21282;1.21;14.2;5.49;78.18;74.31;82.24;8.78;10.11;7.38;1.73;30.84;12.13;42.97;39.35;
2017;3925341;1932993;1992348;58530;21624;1.18;14.91;5.51;78.49;74.65;82.5;8.42;9.71;7.06;1.84;30.52;12.61;43.13;41.31;
2018;3972388;1955930;2016458;58342;22016;1.2;14.69;5.54;78.79;74.97;82.76;8.09;9.34;6.77;1.83;30.31;13.14;43.45;43.6;
2019;4018650;1978483;2040167;58106;22419;1.16;14.46;5.58;79.06;75.27;82.99;7.79;9;6.52;1.83;30.12;13.71;43.83;45.45;
I used MSWord to replace the ";" with "," and Excel´s import from text to try and get a more familiar data frame.
How would you approach data in this form using Python? Save as a .csv then import again with pandas? I am hoping for a better solution by keeping it as a """ enclosed string.
You can tell the csv parser what the delimiter is, in this case it is ';'
with open('filepath.csv') as csv_file:
csv.reader(csv_file, delimiter=';')
How do you get this data? As file? As string?
If you have a file you can use pandas to read the csv which would give you a pandas DataFrame:
pandas.read_csv('filepath.csv', delimiterstr=';')
You can find more info here: https://realpython.com/python-csv/

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

python 3.5 imaplib read gmails as plain text

mail = imaplib.IMAP4_SSL('imap.gmail.com', 993)
username = 'MyGmail#gmail.com'
password = 'MyPasswordHere'
mail.login(username, password)
mail.select('INBOX')
typ, data = mail.search(None, 'ALL')
for num in data[0].split():
typ, data = mail.fetch(num, '(RFC822)')
print(data)
exit()
This is a part of the whole output: (not very readable)
[(b'1 (BODY[1] {1115}', b'\r\nHej Bjango For at sikre, at Bjangos side hj=C3=A6lper dig med at n=C3=A5 =\r\ndine m=C3=A5l, giver vi dig her nogle hurtige og nemme forslag til, hvad =\r\ndu kan g=C3=B8re: Opdater dit profilbillede og dit coverbillede =\r\nOverf=C3=B8r=C2=A0billede Tilf=C3=B8j en beskrivelse af din side =\r\nTilf=C3=B8j=C2=A0en=C2=A0beskrivelse Medtag et link til dit website =\r\nTilf=C3=B8j=C2=A0et=C2=A0link Sl=C3=A5 en opdatering eller et billede op =\r\np=C3=A5 din side Opret=C2=A0et=C2=A0opslag Inviter dine venner til at =\r\nsynes godt om din side Inviter=C2=A0dine=C2=A0venner\r\n\r\nHilsen Facebook-teamet\r\n\r\n\r\n\r\n=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=\r\n=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D\r\nDenne besked blev sendt til infobjango#gmail.com. Hvis du ikke =C3=B8nsker =\r\nat modtage disse e-mails fra Facebook fremover, skal du f=C3=B8lge =\r\nnedenst=C3=A5ende link for at afmelde dem.\r\nhttps://www.facebook.com/o.php?k=3DAS38jnuCT_H5AdZt&u=3D100015358233656&mi=\r\nd=3D548339b6125b3G5af6a3e64c38G0G37b\r\nFacebook, Inc., Attention: Community Support, 1 Hacker Way, Menlo Park, CA =\r\n94025\r\n\r\n'), b')']
So here is my question
How do I make the main part of the mail I received readable?
An example of what I mean by readable:
Dear bjango
This is a mail, which is totally readable without any "<<td style=3D"font-size: 16px; =\r\npadding-bottom: 26px; text-ali>". That's funny I just made a part of this mail unreadable that was supposed to be readable - I hope you get the point.
Best Regards Bjango
This approach does not completely solve my issue with readability:
import email
msg = email.message_from_bytes(data[0][1])
print(msg.get_payload(decode=False))
Output will be as followed:
Hej Bjango For at sikre, at Bjangos side hj=C3=A6lper dig med at n=C3=A5 =
dine m=C3=A5l, giver vi dig her nogle hurtige og nemme forslag til, hvad =
du kan g=C3=B8re: Opdater dit profilbillede og dit coverbillede =
Overf=C3=B8r=C2=A0billede Tilf=C3=B8j en beskrivelse af din side =
Tilf=C3=B8j=C2=A0en=C2=A0beskrivelse Medtag et link til dit website =
Tilf=C3=B8j=C2=A0et=C2=A0link Sl=C3=A5 en opdatering eller et billede op =
p=C3=A5 din side Opret=C2=A0et=C2=A0opslag Inviter dine venner til at =
synes godt om din side Inviter=C2=A0dine=C2=A0venner
Hilsen Facebook-teamet
But none the less, it's a massive improvement
This is the email I intend to get in the output section:
Hej Bjango For at sikre, at Bjangos side hjælper dig med at nå
dine mål, giver vi dig her nogle hurtige og nemme forslag til, hvad
du kan gøre: Opdater dit profilbillede og dit coverbillede
Overfører billeder Tilføj en beskrivelse af din side
Tilføj en beskrivelse Medtag et link til dit website
Tilføj et link Slå en opdatering eller et billede op
på din side Opret et opslag Inviter dine venner til at
synes godt om din side Inviter dine venner
Hilsen Facebook-teamet
Python 3 email parser should work here:
import email
msg = email.message_from_bytes(data[0][1])
payload = msg.get_payload(decode=True)
Assuming your email is encoded as MIME quoted-printable data, you can proceed to decode that using quopri module:
import quopri
message = quopri.decodestring(payload).decode('utf-8')
print(message)

Acentuation in php, showing latin symbols

I'm writing a program to parse html data from a portuguese website, thing is, when I echo the data I read I get those weird symbols:
Meu PC estragou e tenho um netbook que u
so para assuntos acadΩmicos. Que raiva, nπo roda nem CS aqui. Aff que raiva! Com
o pode? Jß coloquei todas as mais baixar configuraτ⌡es e nao roda!
the original text is:
Meu PC estragou e tenho um netbook que u
so para assuntos acadêmicos. Que raiva, não roda nem CS aqui. Aff que raiva! Com
o pode? Já coloquei todas as mais baixar configurações e nao roda!
notice the acentuation:
acadêmicos -> acadΩmicos
Já -> Jß
how do I fix this ? I already tried:
echo utf8_decode($assunto);
but it didn't work ! help!
Set the encoding in the header.
header( 'Content-Type: text/html; charset=UTF-8' );
http://php.net/manual/en/function.header.php
Then to make sure the browser uses UTF-8 meta tag in the HTML head:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Resources