Reading a semicolon (';') seperated raw text in Python - python-3.x

I have managed to find some solutions online but often without an explanation. I am new to Python and normally choose to rework my data in Excel. I would, however, like to learn how to deal with a problem like the following:
I was given data for Brazil in this form. The website says to "save the file as a csv". It looks like a mess...
Espírito Santo;
Dados;
Ano;População total;Homens;Mulheres;Nascimentos;Óbitos;Taxa de Crescimento Geométrico;Taxa Bruta de Natalidade;Taxa Bruta de Mortalidade;Esperança de Vida ao Nascer;Esperança de Vida ao Nascer - Homens;Esperança de Vida ao Nascer - Mulheres;Taxa de Mortalidade Infantil;Taxa de Mortalidade Infantil - Homens;Taxa de Mortalidade Infantil - Mulheres;Taxa de Fecundidade Total;Razão de Dependência - Jovens 0 a 14 anos;Razão de Dependência - Idosos 65 ou mais anos;Razão de Dependência;Índice de Envelhecimento;
2010;3596057;1772936;1823121;54018;19734;x;15.02;5.49;75.93;71.9;80.19;11.97;13.59;10.28;1.73;34.49;10.17;44.67;29.41;
2011;3642595;1795501;1847094;55387;19923;1.29;15.21;5.47;76.36;72.35;80.59;11.3;12.87;9.66;1.77;33.72;10.41;44.13;30.77;
2012;3689347;1818188;1871159;55207;20142;1.28;14.96;5.46;76.76;72.78;80.96;10.69;12.2;9.1;1.75;32.98;10.68;43.65;32.17;
2013;3736386;1841035;1895351;56785;20396;1.27;15.2;5.46;77.14;73.19;81.31;10.14;11.6;8.6;1.8;32.29;10.97;43.26;34.22;
2014;3784361;1864376;1919985;57964;20676;1.28;15.32;5.46;77.51;73.58;81.64;9.64;11.06;8.15;1.83;31.73;11.31;43.04;35.59;
2015;3832826;1887984;1944842;58703;20979;1.28;15.32;5.47;77.85;73.95;81.95;9.19;10.56;7.74;1.85;31.29;11.69;42.98;37.44;
2016;3879376;1910629;1968747;55091;21282;1.21;14.2;5.49;78.18;74.31;82.24;8.78;10.11;7.38;1.73;30.84;12.13;42.97;39.35;
2017;3925341;1932993;1992348;58530;21624;1.18;14.91;5.51;78.49;74.65;82.5;8.42;9.71;7.06;1.84;30.52;12.61;43.13;41.31;
2018;3972388;1955930;2016458;58342;22016;1.2;14.69;5.54;78.79;74.97;82.76;8.09;9.34;6.77;1.83;30.31;13.14;43.45;43.6;
2019;4018650;1978483;2040167;58106;22419;1.16;14.46;5.58;79.06;75.27;82.99;7.79;9;6.52;1.83;30.12;13.71;43.83;45.45;
I used MSWord to replace the ";" with "," and Excel´s import from text to try and get a more familiar data frame.
How would you approach data in this form using Python? Save as a .csv then import again with pandas? I am hoping for a better solution by keeping it as a """ enclosed string.

You can tell the csv parser what the delimiter is, in this case it is ';'
with open('filepath.csv') as csv_file:
csv.reader(csv_file, delimiter=';')
How do you get this data? As file? As string?
If you have a file you can use pandas to read the csv which would give you a pandas DataFrame:
pandas.read_csv('filepath.csv', delimiterstr=';')
You can find more info here: https://realpython.com/python-csv/

Related

can't get the right text when webscraping a webpage (with Python 3)

This is an example of the variable 'extract', when i execute my python code:
<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland
W^hen i execute my code, i now get the text 'Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'
What i really wanted is the part after 'title=', so this text: 'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'
I'm new in this section, and i find it very difficult to understand. Can someone give me a good direction?
See my code at this moment.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)
content=''
rows = soup.find_all('a')
tel=0
for row in rows:
if tel != 0:
#'tel' is only used to skip the first returned result. The first resultline is something that i don't need. I only need all the text after every 'title ='
print(row)
extract=row.get_text()
print('')
print(extract)
content=content+extract+'\n'
if tel == 10:
#a loop of max 10 times gives me enough information, i only need the first 10 articles
break
else:
tel=tel+1
#show me the result-text of the crawling
print('')
print('RESULT TEXT OF THE FIRST 10 ARTICLES:')
print(content)
You want the title attribute. From the description the following use of nth-child will get you the first 10 descriptions. I also needed 'lxml' parser. pip3 install lxml if not installed.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
print([i['title'] for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)')])
CSS:
li:nth-child(-n+10) > a:nth-child(2)
This asks for the first 10 li elements with at least two a tag children, and selects for the second a tag in each case. The > is a child combinator specifying that what is on the right must be a child of what is on the left.
Read about:
:nth-child ranges
Child combinators, pseudo class selectors and type selectors
Additional request for for loop with published date time info:
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)'):
print(i.parent['title'])
print(i.parent.a.text.strip())
print(i['title'])
print()
If I understand your question correctly, you should try to modify your code as follows:
...
for row in rows:
if tel != 0:
print(row)
extract=row["title"]
content=content+extract.replace("Meer informatie", "")+'\n'
...
Quick look at what you're doing, simply replace get_text() with .get('title'), that should work:
>>> data = '''<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(data)
>>> soup.find('a').get('title')
'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

Handling data compression with non-ASCII values while reading and writing file

I am trying to learn lossless compression algorithms using Python 3 and until now I have implemented huffman,burrow wheeler transform and move to front which can take up to 256 unique characters based on there ASCII values. So basically I am trying to read a UTF-8 text file and convert its characters to a single string, then alter that string to compress it. All the algorithms work perfectly but the problem lies in reading file with non-ASCII characters, because if I read the file without encoding it the data value of some special characters goes up to 8221 and movetofront algorithm gives this error:
ValueError: 8221 is not in list
To the read file I tried:
with open('test.txt','r',encoding='utf-8') as f:
data = f.readlines()
charData = ''.join(str(x.encode('utf-8'))[2:-1] for x in data)
huffmanEncode(mtfEncoding(bwt_suffixArray(charData)))
Encode individual char and slice b'', bytes representation from it.
which converts this-> 'you’ll have to check'
to this-> 'you\xe2\x80\x99ll have to check'
Now I input this string, compress it, then decompress it. Decompression works perfectly and I get my string back that represents Unicode. My question is how to get the original content of file back, I tried:
print(bytes(decompressedStr).decode('utf-8'))
#Gives:
>>>TypeError: string argument without an encoding
and:
print(codecs.encode(str,decompressedStr).decode('utf-8'))
#Gives same exact string back:
>>>you\xe2\x80\x99ll have to check
Is there a more efficient way to do this? If not how to convert Unicode representing string to UTF-8 string?
Compression algorithms work on bytes, which is what an encoded file contains. Open your original file in binary mode:
with open('test.txt','rb') as f:
data = f.read()
Don't decode it to Unicode characters, whose ordinal values can be much larger than a byte. Compress the bytes, decompress the bytes, then decode the result to Unicode.
Full example:
#!python3
#coding:utf8
import lzma
text = '''Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!'''
# Create a file containing non-ASCII characters:
with open('test.txt','w',encoding='utf8') as f:
f.write(text)
# Read the raw bytes data.
with open('test.txt','rb') as f:
data = f.read()
# Note: The file write/read can be skipped by encoding the original Unicode text
# to bytes manually.
#
# data = text.encode('utf8')
# Using a built-in Python compression/decompression algorithm.
compressed_data = lzma.compress(data)
decompressed_data = lzma.decompress(compressed_data)
print('orginial length =',len(data))
print('compressed length =',len(compressed_data))
print('decompressed length =',len(decompressed_data))
assert data == decompressed_data
# Now decode the byte data back to Unicode.
print(decompressed_data.decode('utf8'))
Output:
orginial length = 455
compressed length = 372
decompressed length = 455
Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!

Py3 Find common words in multiple strings

I have a problem and i can't get a proper and good solution.
Here is the thing, i have my list, which contains addresses:
addresses = ['Paris, 9 rue VoltaIre', 'Paris, 42 quai Voltaire', 'Paris, 87 rue VoLtAiRe']
And I want to get after treatment:
to_upper_words = ['paris', 'voltaire']
Because 'Paris' and 'Voltaire' are in all strings.
The goal is to upper all common words, and results for example:
'PARIS, 343 BOULEVARD SAINT-GERMAIN'
'PARIS, 458 BOULEVARD SAINT-GERMAIN'
'Paris', 'Boulevard', 'Saint-Germain' are in upper case because all strings (two in this case) have these words in common.
I found multiple solutions but only for two strings, and the result is ugly to apply for N strings.
If anyone can help...
Thank you in advance !
Remove the commas so that the words can be picked out simply by splitting on blanks. Then find the words that common to all of the addresses using set intersection. (There is only one such word for my collection, namely Montréal.) Now look through the original collection of addresses replacing each occurrence of common words with its uppercase equivalent. The most likely mistake in doing this might be to try to assign a value to an individual value in the list. However, these are, in effect, copies which is why I arranged to assign the modified values to addressses[i].
>>> addresses = [ 'Montréal, 3415 Chemin de la Côte-des-Neiges', 'Montréal, 3655 Avenue du Musée', 'Montréal, 3475 Rue de la Montagne', 'Montréal, 3450 Rue Drummond' ]
>>> sans_virgules = [address.replace(',', ' ') for address in addresses]
>>> listes_de_mots = [sans_virgule.split() for sans_virgule in sans_virgules]
>>> mots_en_commun = set(listes_de_mots[0])
>>> for liste in listes_de_mots[1:]:
... mots_en_commun = mots_en_commun.intersection(set(liste))
...
>>> mots_en_commun
{'Montréal'}
>>> for i, address in enumerate(addresses):
... for mot in list(mots_en_commun):
... addresses[i] = address.replace(mot, mot.upper())
...
>>> addresses
['MONTRÉAL, 3415 Chemin de la Côte-des-Neiges', 'MONTRÉAL, 3655 Avenue du Musée', 'MONTRÉAL, 3475 Rue de la Montagne', 'MONTRÉAL, 3450 Rue Drummond']

Problems with xpath on scrapy

I'm using Scrapy to create a crawler.
I want to extract only the title of the links that I will found.
This is the current part of the code that it's important to me:
<a class="cor-produto busca-titulo" title="Melhorar a saúde, economia de tempo e dinheiro: Veja os benefícios do uso da bicicleta" href="//g1.globo.com/busca/click?q=economia&p=0&r=1472008380299&u=http%3A%2F%2Fg1.globo.com%2Fma%2Fmaranhao%2Fjmtv-2edicao%2Fvideos%2Fv%2Fmelhorar-a-saude-economia-de-tempo-e-dinheiro-veja-os-beneficios-do-uso-da-bicicleta%2F5256064%2F&t=informacional&d=false&f=false&ss=8bcd843f636c6982&o=&cat=a">Melhorar a saúde, economia de tempo e dinheiro: Veja os benefíc...</a>
I want to extract only the title and I need to use xpath to do this. Anyone have any suggestion?
Thank you! :)
The XPath would be:
//a/#title
Being sel your Selector instance:
sel.xpath('//a/#title').extract()
Or maybe just from the response object:
response.xpath('//a/#title').extract()
Output:
Melhorar a saúde, economia de tempo e dinheiro: Veja os benefícios do uso da bicicleta

Resources