Handling data compression with non-ASCII values while reading and writing file - string

I am trying to learn lossless compression algorithms using Python 3 and until now I have implemented huffman,burrow wheeler transform and move to front which can take up to 256 unique characters based on there ASCII values. So basically I am trying to read a UTF-8 text file and convert its characters to a single string, then alter that string to compress it. All the algorithms work perfectly but the problem lies in reading file with non-ASCII characters, because if I read the file without encoding it the data value of some special characters goes up to 8221 and movetofront algorithm gives this error:
ValueError: 8221 is not in list
To the read file I tried:
with open('test.txt','r',encoding='utf-8') as f:
data = f.readlines()
charData = ''.join(str(x.encode('utf-8'))[2:-1] for x in data)
huffmanEncode(mtfEncoding(bwt_suffixArray(charData)))
Encode individual char and slice b'', bytes representation from it.
which converts this-> 'you’ll have to check'
to this-> 'you\xe2\x80\x99ll have to check'
Now I input this string, compress it, then decompress it. Decompression works perfectly and I get my string back that represents Unicode. My question is how to get the original content of file back, I tried:
print(bytes(decompressedStr).decode('utf-8'))
#Gives:
>>>TypeError: string argument without an encoding
and:
print(codecs.encode(str,decompressedStr).decode('utf-8'))
#Gives same exact string back:
>>>you\xe2\x80\x99ll have to check
Is there a more efficient way to do this? If not how to convert Unicode representing string to UTF-8 string?

Compression algorithms work on bytes, which is what an encoded file contains. Open your original file in binary mode:
with open('test.txt','rb') as f:
data = f.read()
Don't decode it to Unicode characters, whose ordinal values can be much larger than a byte. Compress the bytes, decompress the bytes, then decode the result to Unicode.
Full example:
#!python3
#coding:utf8
import lzma
text = '''Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!'''
# Create a file containing non-ASCII characters:
with open('test.txt','w',encoding='utf8') as f:
f.write(text)
# Read the raw bytes data.
with open('test.txt','rb') as f:
data = f.read()
# Note: The file write/read can be skipped by encoding the original Unicode text
# to bytes manually.
#
# data = text.encode('utf8')
# Using a built-in Python compression/decompression algorithm.
compressed_data = lzma.compress(data)
decompressed_data = lzma.decompress(compressed_data)
print('orginial length =',len(data))
print('compressed length =',len(compressed_data))
print('decompressed length =',len(decompressed_data))
assert data == decompressed_data
# Now decode the byte data back to Unicode.
print(decompressed_data.decode('utf8'))
Output:
orginial length = 455
compressed length = 372
decompressed length = 455
Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!

Related

Pandas string encoding when retrieving cell value

I have the following Series:
s = pd.Series(['ANO DE LOS BÃEZ MH EE 3 201'])
When I print the series I get:
0 ANO DE LOS BÃEZ MH EE 3 201
But when I get the cell element I get an hexadecimal value in the string:
>>> s.iloc[0]
'ANO DE LOS BÃ\x81EZ MH EE 3 201'
Why does this happens and how can I retrieve the cell value and get the string: 'ANO DE LOS BÃEZ MH EE 3 201'?
Even though I am not really sure where the issue arised I Could solve it by using the unidecode package.
output_string = unidecode(s.iloc[0])

Reading a semicolon (';') seperated raw text in Python

I have managed to find some solutions online but often without an explanation. I am new to Python and normally choose to rework my data in Excel. I would, however, like to learn how to deal with a problem like the following:
I was given data for Brazil in this form. The website says to "save the file as a csv". It looks like a mess...
Espírito Santo;
Dados;
Ano;População total;Homens;Mulheres;Nascimentos;Óbitos;Taxa de Crescimento Geométrico;Taxa Bruta de Natalidade;Taxa Bruta de Mortalidade;Esperança de Vida ao Nascer;Esperança de Vida ao Nascer - Homens;Esperança de Vida ao Nascer - Mulheres;Taxa de Mortalidade Infantil;Taxa de Mortalidade Infantil - Homens;Taxa de Mortalidade Infantil - Mulheres;Taxa de Fecundidade Total;Razão de Dependência - Jovens 0 a 14 anos;Razão de Dependência - Idosos 65 ou mais anos;Razão de Dependência;Índice de Envelhecimento;
2010;3596057;1772936;1823121;54018;19734;x;15.02;5.49;75.93;71.9;80.19;11.97;13.59;10.28;1.73;34.49;10.17;44.67;29.41;
2011;3642595;1795501;1847094;55387;19923;1.29;15.21;5.47;76.36;72.35;80.59;11.3;12.87;9.66;1.77;33.72;10.41;44.13;30.77;
2012;3689347;1818188;1871159;55207;20142;1.28;14.96;5.46;76.76;72.78;80.96;10.69;12.2;9.1;1.75;32.98;10.68;43.65;32.17;
2013;3736386;1841035;1895351;56785;20396;1.27;15.2;5.46;77.14;73.19;81.31;10.14;11.6;8.6;1.8;32.29;10.97;43.26;34.22;
2014;3784361;1864376;1919985;57964;20676;1.28;15.32;5.46;77.51;73.58;81.64;9.64;11.06;8.15;1.83;31.73;11.31;43.04;35.59;
2015;3832826;1887984;1944842;58703;20979;1.28;15.32;5.47;77.85;73.95;81.95;9.19;10.56;7.74;1.85;31.29;11.69;42.98;37.44;
2016;3879376;1910629;1968747;55091;21282;1.21;14.2;5.49;78.18;74.31;82.24;8.78;10.11;7.38;1.73;30.84;12.13;42.97;39.35;
2017;3925341;1932993;1992348;58530;21624;1.18;14.91;5.51;78.49;74.65;82.5;8.42;9.71;7.06;1.84;30.52;12.61;43.13;41.31;
2018;3972388;1955930;2016458;58342;22016;1.2;14.69;5.54;78.79;74.97;82.76;8.09;9.34;6.77;1.83;30.31;13.14;43.45;43.6;
2019;4018650;1978483;2040167;58106;22419;1.16;14.46;5.58;79.06;75.27;82.99;7.79;9;6.52;1.83;30.12;13.71;43.83;45.45;
I used MSWord to replace the ";" with "," and Excel´s import from text to try and get a more familiar data frame.
How would you approach data in this form using Python? Save as a .csv then import again with pandas? I am hoping for a better solution by keeping it as a """ enclosed string.
You can tell the csv parser what the delimiter is, in this case it is ';'
with open('filepath.csv') as csv_file:
csv.reader(csv_file, delimiter=';')
How do you get this data? As file? As string?
If you have a file you can use pandas to read the csv which would give you a pandas DataFrame:
pandas.read_csv('filepath.csv', delimiterstr=';')
You can find more info here: https://realpython.com/python-csv/

How to stack Sentinel bands?

I work with many Sentinel-2 images whose 12 strips I would like to stack into a single file. My images are in envi format (img and hdr).
I tried to do the concatenation with the module rsgislib by applying the following code :
import rsgislib
from rsgislib import imageutils
imagePath = "/run/media/afavro/Elements/Acquisitions_Copernicus2/Sentinel-2/THEIA/2A/resampling/subset_20181005_944_J_resampled.data/"
nom = 'ROI_Resize__Layer__Band_1_SENTINEL2A_20181005_104840_944_L2A_T31TDJ_D_V1_9'
imageList = [nom + '_ATB_R1.hdr',
nom + '_ATB_R2.hdr',
nom +'_SRE_B2.hdr',
nom +'_SRE_B3.hdr',
nom +'_SRE_B4.hdr',
nom +'_SRE_B5.hdr',
nom +'_SRE_B6.hdr',
nom +'_SRE_B7.hdr',
nom +'_SRE_B8.hdr',
nom +'_SRE_B8a.hdr',
nom +'_SRE_B9.hdr',
nom +'_SRE_B10.hdr',
nom +'_SRE_B11.hdr',
nom +'_SRE_B12.hdr',
nom +'_FRE_B2.hdr',
nom +'_FRE_B3.hdr',
nom +'_FRE_B4.hdr',
nom +'_FRE_B5.hdr',
nom +'_FRE_B6.hdr',
nom +'_FRE_B7.hdr',
nom +'_FRE_B8.hdr',
nom +'_FRE_B8a.hdr',
nom +'_FRE_B9.hdr',
nom +'_FRE_B10.hdr',
nom +'_FRE_B11.hdr',
nom +'_FRE_B12.hdr']
bandNamesList = ['_ATB_R1',
'_ATB_R2',
'_SRE_B2',
'_SRE_B3',
'_SRE_B4',
'_SRE_B5',
'_SRE_B6',
'_SRE_B7',
'_SRE_B8',
'_SRE_B8a',
'_SRE_B9',
'_SRE_B10',
'_SRE_B11',
'_SRE_B12',
'_FRE_B2',
'_FRE_B3',
'_FRE_B4',
'_FRE_B5',
'_FRE_B6',
'_FRE_B7',
'_FRE_B8',
'_FRE_B8a',
'_FRE_B9',
'_FRE_B10',
'_FRE_B11',
'_FRE_B12']
# Output image
outputImage = 'SENTINEL2A_20181005-104840-944_L2A_T31TDH_D_V1-9_stack.envi'
# Format and type
gdalFormat = 'ENVI'
dataType = rsgislib.TYPE_16UINT
# Stack
imageutils.stackImageBands(imageList, bandNamesList, outputImage, None, 0, gdalFormat, dataType)
But whatever parameters I change, I always end up with the same error message :
"There are 26 images to stack
ROI_Resize__Layer__Band_1_SENTINEL2A_20181005_104840_944_L2A_T31TDJ_D_V1_9_ATB_R1.hdr
ERROR 4: ROI_Resize__Layer__Band_1_SENTINEL2A_20181005_104840_944_L2A_T31TDJ_D_V1_9_ATB_R1.hdr: no such files or folders
Traceback (most recent call last):
File "<ipython-input-4-9b52a8bb11b5>", line 70, in <module>
imageutils.stackImageBands(imageList, bandNamesList, outputImage, None, 0, gdalFormat, dataType)
error: Could not open image ROI_Resize__Layer__Band_1_SENTINEL2A_20181005_104840_944_L2A_T31TDJ_D_V1_9_ATB_R1.hdr"
Do you have any advice for me ?
You are inputting the header file. I believe you should be entering the image file (e.g., .envi if that is the extension you are using).

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

Py3 Find common words in multiple strings

I have a problem and i can't get a proper and good solution.
Here is the thing, i have my list, which contains addresses:
addresses = ['Paris, 9 rue VoltaIre', 'Paris, 42 quai Voltaire', 'Paris, 87 rue VoLtAiRe']
And I want to get after treatment:
to_upper_words = ['paris', 'voltaire']
Because 'Paris' and 'Voltaire' are in all strings.
The goal is to upper all common words, and results for example:
'PARIS, 343 BOULEVARD SAINT-GERMAIN'
'PARIS, 458 BOULEVARD SAINT-GERMAIN'
'Paris', 'Boulevard', 'Saint-Germain' are in upper case because all strings (two in this case) have these words in common.
I found multiple solutions but only for two strings, and the result is ugly to apply for N strings.
If anyone can help...
Thank you in advance !
Remove the commas so that the words can be picked out simply by splitting on blanks. Then find the words that common to all of the addresses using set intersection. (There is only one such word for my collection, namely Montréal.) Now look through the original collection of addresses replacing each occurrence of common words with its uppercase equivalent. The most likely mistake in doing this might be to try to assign a value to an individual value in the list. However, these are, in effect, copies which is why I arranged to assign the modified values to addressses[i].
>>> addresses = [ 'Montréal, 3415 Chemin de la Côte-des-Neiges', 'Montréal, 3655 Avenue du Musée', 'Montréal, 3475 Rue de la Montagne', 'Montréal, 3450 Rue Drummond' ]
>>> sans_virgules = [address.replace(',', ' ') for address in addresses]
>>> listes_de_mots = [sans_virgule.split() for sans_virgule in sans_virgules]
>>> mots_en_commun = set(listes_de_mots[0])
>>> for liste in listes_de_mots[1:]:
... mots_en_commun = mots_en_commun.intersection(set(liste))
...
>>> mots_en_commun
{'Montréal'}
>>> for i, address in enumerate(addresses):
... for mot in list(mots_en_commun):
... addresses[i] = address.replace(mot, mot.upper())
...
>>> addresses
['MONTRÉAL, 3415 Chemin de la Côte-des-Neiges', 'MONTRÉAL, 3655 Avenue du Musée', 'MONTRÉAL, 3475 Rue de la Montagne', 'MONTRÉAL, 3450 Rue Drummond']

Resources