python 3.5 imaplib read gmails as plain text - python-3.x

mail = imaplib.IMAP4_SSL('imap.gmail.com', 993)
username = 'MyGmail#gmail.com'
password = 'MyPasswordHere'
mail.login(username, password)
mail.select('INBOX')
typ, data = mail.search(None, 'ALL')
for num in data[0].split():
typ, data = mail.fetch(num, '(RFC822)')
print(data)
exit()
This is a part of the whole output: (not very readable)
[(b'1 (BODY[1] {1115}', b'\r\nHej Bjango For at sikre, at Bjangos side hj=C3=A6lper dig med at n=C3=A5 =\r\ndine m=C3=A5l, giver vi dig her nogle hurtige og nemme forslag til, hvad =\r\ndu kan g=C3=B8re: Opdater dit profilbillede og dit coverbillede =\r\nOverf=C3=B8r=C2=A0billede Tilf=C3=B8j en beskrivelse af din side =\r\nTilf=C3=B8j=C2=A0en=C2=A0beskrivelse Medtag et link til dit website =\r\nTilf=C3=B8j=C2=A0et=C2=A0link Sl=C3=A5 en opdatering eller et billede op =\r\np=C3=A5 din side Opret=C2=A0et=C2=A0opslag Inviter dine venner til at =\r\nsynes godt om din side Inviter=C2=A0dine=C2=A0venner\r\n\r\nHilsen Facebook-teamet\r\n\r\n\r\n\r\n=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=\r\n=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D\r\nDenne besked blev sendt til infobjango#gmail.com. Hvis du ikke =C3=B8nsker =\r\nat modtage disse e-mails fra Facebook fremover, skal du f=C3=B8lge =\r\nnedenst=C3=A5ende link for at afmelde dem.\r\nhttps://www.facebook.com/o.php?k=3DAS38jnuCT_H5AdZt&u=3D100015358233656&mi=\r\nd=3D548339b6125b3G5af6a3e64c38G0G37b\r\nFacebook, Inc., Attention: Community Support, 1 Hacker Way, Menlo Park, CA =\r\n94025\r\n\r\n'), b')']
So here is my question
How do I make the main part of the mail I received readable?
An example of what I mean by readable:
Dear bjango
This is a mail, which is totally readable without any "<<td style=3D"font-size: 16px; =\r\npadding-bottom: 26px; text-ali>". That's funny I just made a part of this mail unreadable that was supposed to be readable - I hope you get the point.
Best Regards Bjango
This approach does not completely solve my issue with readability:
import email
msg = email.message_from_bytes(data[0][1])
print(msg.get_payload(decode=False))
Output will be as followed:
Hej Bjango For at sikre, at Bjangos side hj=C3=A6lper dig med at n=C3=A5 =
dine m=C3=A5l, giver vi dig her nogle hurtige og nemme forslag til, hvad =
du kan g=C3=B8re: Opdater dit profilbillede og dit coverbillede =
Overf=C3=B8r=C2=A0billede Tilf=C3=B8j en beskrivelse af din side =
Tilf=C3=B8j=C2=A0en=C2=A0beskrivelse Medtag et link til dit website =
Tilf=C3=B8j=C2=A0et=C2=A0link Sl=C3=A5 en opdatering eller et billede op =
p=C3=A5 din side Opret=C2=A0et=C2=A0opslag Inviter dine venner til at =
synes godt om din side Inviter=C2=A0dine=C2=A0venner
Hilsen Facebook-teamet
But none the less, it's a massive improvement
This is the email I intend to get in the output section:
Hej Bjango For at sikre, at Bjangos side hjælper dig med at nå
dine mål, giver vi dig her nogle hurtige og nemme forslag til, hvad
du kan gøre: Opdater dit profilbillede og dit coverbillede
Overfører billeder Tilføj en beskrivelse af din side
Tilføj en beskrivelse Medtag et link til dit website
Tilføj et link Slå en opdatering eller et billede op
på din side Opret et opslag Inviter dine venner til at
synes godt om din side Inviter dine venner
Hilsen Facebook-teamet

Python 3 email parser should work here:
import email
msg = email.message_from_bytes(data[0][1])
payload = msg.get_payload(decode=True)
Assuming your email is encoded as MIME quoted-printable data, you can proceed to decode that using quopri module:
import quopri
message = quopri.decodestring(payload).decode('utf-8')
print(message)

Related

Cannot import SongLyrics from lyrics_extractor

I wrote this code:
from lyrics_extractor import SongLyrics
apiKey = 'AIzaSyBDPVEi1OtzB3Nm6i9fd8HTkMCjsselIpM'
engineID = '35df92fbe0cad839c'
extract_lyrics = SongLyrics(apikey, engineID)
lyrics = extract_lyrics.get_lyrics("Reyes de la noche")
print(lyrics)
It is not working, I obtained this import error:
ImportError: cannot import name 'Songlyrics' from 'lyrics_extractor' (C:\Users\mica\AppData\Local\Programs\Python\Python39\lib\site-packages\lyrics_extractor_init_.py)
What could be wrong?
I found two problems:
You need to install the library with PIP:
pip install lyrics-extractor
apiKey has the wrong capitalization in line 4.
A further suggestion: the returned value is a dict, which looks nicer when printed like this:
from lyrics_extractor import SongLyrics
apiKey = 'AIzaSyBDPVEi1OtzB3Nm6i9fd8HTkMCjsselIpM'
engineID = '35df92fbe0cad839c'
extract_lyrics = SongLyrics(apiKey, engineID)
lyrics = extract_lyrics.get_lyrics("Reyes de la noche")
print(lyrics['title'])
print(lyrics['lyrics'])
Output:
Guasones – Reyes De La Noche Lyrics
Fuimos mucho mas que nada
Fuimos la mentira
Fuimos lo peor
Fuimos los soldados a la madrugada
Con esta ambición
Y ahora estoy en libertad
Y ahora que puedo pensar
En no volver hacer ese
El mismo de antes
Y que tristeza hay en la ciudad, amor
Sábado soleado
Y en el centro de la estatua del dolor
Me sentí parado
Fuimos muchos más que todos
Reyes de la noche
De esta tempestad
Si te vendí, si te robe, te traicione
Fui por uno más
Fuimos perros de la noche
Oxidados en tristeza
Y querer lo que querer
Sin tener que lastimar
Recordando que tu amor
Se robo la dignidad
Ahora olvidemos los dos
No volvamos a empezar
¿Para que?...
PS: I was manually writing out lyrics this morning for a song called "Brothers and Sisters" by Les Nubians. When I tried to use this library to get the lyrics for that song, like this:
extract_lyrics.get_lyrics("Brothers and Sisters")
... I got lyrics for a more popular song of the same title by another musical group. However, after adding the name of the group I wanted, I got the correct song's lyrics:
extract_lyrics.get_lyrics("Brothers and Sisters Nubians")
Installation
pip install lyrics-extractor
For more details go to this

can't get the right text when webscraping a webpage (with Python 3)

This is an example of the variable 'extract', when i execute my python code:
<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland
W^hen i execute my code, i now get the text 'Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'
What i really wanted is the part after 'title=', so this text: 'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'
I'm new in this section, and i find it very difficult to understand. Can someone give me a good direction?
See my code at this moment.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)
content=''
rows = soup.find_all('a')
tel=0
for row in rows:
if tel != 0:
#'tel' is only used to skip the first returned result. The first resultline is something that i don't need. I only need all the text after every 'title ='
print(row)
extract=row.get_text()
print('')
print(extract)
content=content+extract+'\n'
if tel == 10:
#a loop of max 10 times gives me enough information, i only need the first 10 articles
break
else:
tel=tel+1
#show me the result-text of the crawling
print('')
print('RESULT TEXT OF THE FIRST 10 ARTICLES:')
print(content)
You want the title attribute. From the description the following use of nth-child will get you the first 10 descriptions. I also needed 'lxml' parser. pip3 install lxml if not installed.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
print([i['title'] for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)')])
CSS:
li:nth-child(-n+10) > a:nth-child(2)
This asks for the first 10 li elements with at least two a tag children, and selects for the second a tag in each case. The > is a child combinator specifying that what is on the right must be a child of what is on the left.
Read about:
:nth-child ranges
Child combinators, pseudo class selectors and type selectors
Additional request for for loop with published date time info:
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)'):
print(i.parent['title'])
print(i.parent.a.text.strip())
print(i['title'])
print()
If I understand your question correctly, you should try to modify your code as follows:
...
for row in rows:
if tel != 0:
print(row)
extract=row["title"]
content=content+extract.replace("Meer informatie", "")+'\n'
...
Quick look at what you're doing, simply replace get_text() with .get('title'), that should work:
>>> data = '''<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(data)
>>> soup.find('a').get('title')
'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'

interpret communication from / dev files

I want to intercept the content emitted by the keyboard,
I know that each device has a file linked to it,
my question is how to intercept a 'specific' usb port and where to find the documentation to interpret the data
Sorry for the english (google translator)
Based on: https://sergioprado.org/implementando-um-teclado-virtual-no-linux/
The file that abstracts event capture from input devices like mouse, keyboard, joysticks, touch screens, etc. is / dev / input
You can use an "evtest" tool to check which device is related to a given file: example evtest / dev / input / event4
Português
Baseado em: https://sergioprado.org/implementando-um-teclado-virtual-no-linux/
O arquivo que abstrai a captura de eventos de dispositivos de entrada como mouse, teclado, joysticks, touch screens, etc é /dev/input
Você pode usar a ferramenta "evtest" para verificar qual o dispositivo relacionado à determinado arquivo: exemplo evtest /dev/input/event4

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

Acentuation in php, showing latin symbols

I'm writing a program to parse html data from a portuguese website, thing is, when I echo the data I read I get those weird symbols:
Meu PC estragou e tenho um netbook que u
so para assuntos acadΩmicos. Que raiva, nπo roda nem CS aqui. Aff que raiva! Com
o pode? Jß coloquei todas as mais baixar configuraτ⌡es e nao roda!
the original text is:
Meu PC estragou e tenho um netbook que u
so para assuntos acadêmicos. Que raiva, não roda nem CS aqui. Aff que raiva! Com
o pode? Já coloquei todas as mais baixar configurações e nao roda!
notice the acentuation:
acadêmicos -> acadΩmicos
Já -> Jß
how do I fix this ? I already tried:
echo utf8_decode($assunto);
but it didn't work ! help!
Set the encoding in the header.
header( 'Content-Type: text/html; charset=UTF-8' );
http://php.net/manual/en/function.header.php
Then to make sure the browser uses UTF-8 meta tag in the HTML head:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Resources