Py3 Find common words in multiple strings - python-3.x

I have a problem and i can't get a proper and good solution.
Here is the thing, i have my list, which contains addresses:
addresses = ['Paris, 9 rue VoltaIre', 'Paris, 42 quai Voltaire', 'Paris, 87 rue VoLtAiRe']
And I want to get after treatment:
to_upper_words = ['paris', 'voltaire']
Because 'Paris' and 'Voltaire' are in all strings.
The goal is to upper all common words, and results for example:
'PARIS, 343 BOULEVARD SAINT-GERMAIN'
'PARIS, 458 BOULEVARD SAINT-GERMAIN'
'Paris', 'Boulevard', 'Saint-Germain' are in upper case because all strings (two in this case) have these words in common.
I found multiple solutions but only for two strings, and the result is ugly to apply for N strings.
If anyone can help...
Thank you in advance !

Remove the commas so that the words can be picked out simply by splitting on blanks. Then find the words that common to all of the addresses using set intersection. (There is only one such word for my collection, namely Montréal.) Now look through the original collection of addresses replacing each occurrence of common words with its uppercase equivalent. The most likely mistake in doing this might be to try to assign a value to an individual value in the list. However, these are, in effect, copies which is why I arranged to assign the modified values to addressses[i].
>>> addresses = [ 'Montréal, 3415 Chemin de la Côte-des-Neiges', 'Montréal, 3655 Avenue du Musée', 'Montréal, 3475 Rue de la Montagne', 'Montréal, 3450 Rue Drummond' ]
>>> sans_virgules = [address.replace(',', ' ') for address in addresses]
>>> listes_de_mots = [sans_virgule.split() for sans_virgule in sans_virgules]
>>> mots_en_commun = set(listes_de_mots[0])
>>> for liste in listes_de_mots[1:]:
... mots_en_commun = mots_en_commun.intersection(set(liste))
...
>>> mots_en_commun
{'Montréal'}
>>> for i, address in enumerate(addresses):
... for mot in list(mots_en_commun):
... addresses[i] = address.replace(mot, mot.upper())
...
>>> addresses
['MONTRÉAL, 3415 Chemin de la Côte-des-Neiges', 'MONTRÉAL, 3655 Avenue du Musée', 'MONTRÉAL, 3475 Rue de la Montagne', 'MONTRÉAL, 3450 Rue Drummond']

Related

Pandas string encoding when retrieving cell value

I have the following Series:
s = pd.Series(['ANO DE LOS BÃEZ MH EE 3 201'])
When I print the series I get:
0 ANO DE LOS BÃEZ MH EE 3 201
But when I get the cell element I get an hexadecimal value in the string:
>>> s.iloc[0]
'ANO DE LOS BÃ\x81EZ MH EE 3 201'
Why does this happens and how can I retrieve the cell value and get the string: 'ANO DE LOS BÃEZ MH EE 3 201'?
Even though I am not really sure where the issue arised I Could solve it by using the unidecode package.
output_string = unidecode(s.iloc[0])

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

Handling data compression with non-ASCII values while reading and writing file

I am trying to learn lossless compression algorithms using Python 3 and until now I have implemented huffman,burrow wheeler transform and move to front which can take up to 256 unique characters based on there ASCII values. So basically I am trying to read a UTF-8 text file and convert its characters to a single string, then alter that string to compress it. All the algorithms work perfectly but the problem lies in reading file with non-ASCII characters, because if I read the file without encoding it the data value of some special characters goes up to 8221 and movetofront algorithm gives this error:
ValueError: 8221 is not in list
To the read file I tried:
with open('test.txt','r',encoding='utf-8') as f:
data = f.readlines()
charData = ''.join(str(x.encode('utf-8'))[2:-1] for x in data)
huffmanEncode(mtfEncoding(bwt_suffixArray(charData)))
Encode individual char and slice b'', bytes representation from it.
which converts this-> 'you’ll have to check'
to this-> 'you\xe2\x80\x99ll have to check'
Now I input this string, compress it, then decompress it. Decompression works perfectly and I get my string back that represents Unicode. My question is how to get the original content of file back, I tried:
print(bytes(decompressedStr).decode('utf-8'))
#Gives:
>>>TypeError: string argument without an encoding
and:
print(codecs.encode(str,decompressedStr).decode('utf-8'))
#Gives same exact string back:
>>>you\xe2\x80\x99ll have to check
Is there a more efficient way to do this? If not how to convert Unicode representing string to UTF-8 string?
Compression algorithms work on bytes, which is what an encoded file contains. Open your original file in binary mode:
with open('test.txt','rb') as f:
data = f.read()
Don't decode it to Unicode characters, whose ordinal values can be much larger than a byte. Compress the bytes, decompress the bytes, then decode the result to Unicode.
Full example:
#!python3
#coding:utf8
import lzma
text = '''Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!'''
# Create a file containing non-ASCII characters:
with open('test.txt','w',encoding='utf8') as f:
f.write(text)
# Read the raw bytes data.
with open('test.txt','rb') as f:
data = f.read()
# Note: The file write/read can be skipped by encoding the original Unicode text
# to bytes manually.
#
# data = text.encode('utf8')
# Using a built-in Python compression/decompression algorithm.
compressed_data = lzma.compress(data)
decompressed_data = lzma.decompress(compressed_data)
print('orginial length =',len(data))
print('compressed length =',len(compressed_data))
print('decompressed length =',len(decompressed_data))
assert data == decompressed_data
# Now decode the byte data back to Unicode.
print(decompressed_data.decode('utf8'))
Output:
orginial length = 455
compressed length = 372
decompressed length = 455
Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!

count how many words from a word list appear in a cell

Im trying to count the amount of times a word occurs in a string in excel, but im not sure how to do it. I found the formula =SUMPRODUCT(ISNUMBER(SEARCH(" "&Z$2:Z$20&","," "&A1&","))+0) in another post, but I can't seem to make it work for me.
data example (strings):
text
1335 Nu standaard 3 dagen #Full #Membership bij #LACH , #dating voor #50plus http://t.co/7urme1Huhq http://t.co/ZzGWnKNdgx
1334 De 50plus cliché-poll vul hem in en we zien direct of de clichés waar zijn,
1282 #Koopkracht staat centraal in de campagne van #50Plus. Volgens mij maken veel ouderen zich zorgen over heel andere zaken. #PS2015
1278 #ps15 #50Plus: Dramatische gevolgen afschaffen ouderentoeslag http://t.co/nJW8kQRTtu Zie http://t.co/bFGtRDhF7e
1276 Campagnebussen in beeld. Mijn favoriet: de Fiat Babyboom van 50Plus. http://t.co/Wji8fL3gDp
1275 50Plus staat alleen in stelling over sociaal beleid http://t.co/79TF5PWN2j
1266 .#Struijlaard #50PLUS Zuid-Holland trekt met een camper de hele provincie door. Vandaag tot 17.00 uur in Leerdam. #PS15
1262 Stem #50PLUS op 18 maart !
1261 Boek van 98-jarige zangeres http://t.co/xxFjkTxYDY #vrouwen #50plus #muziek #Gog #Dings #Liessel
1257 #senfkarin kwam je op 50plus?
Word list:
achterlijk
bewogen
bovenkomen
brutaliteit
dramatisch
gehoorzamen
groen
illuminatie
incongruent
kwaadspreken
ontmoeting
ontredderd
recidiveren
roestig
ruw
secuur
steenkoud
tippelen
treffen
verbleken
verkijken
verlevendigen
vertillen
vorig
wezen
zorg
zwier
I think the commas are messing it up, it should be
=SUMPRODUCT(ISNUMBER(SEARCH(" "&Z$2:Z$20&" "," "&A1&" "))+0)
So it matches the whole word plus the spaces at each end, even when it is the first or last word in the cell.

Isolate section of long character string in R

I have locality information in a vector that contains [lat.];[long.];[town name];[governorate name];[country name].
I want to create a new vector that includes just the town name for each observation.
Below you can see the contents of the vector for the first four observations.
[1] 36.7416894818782;10.227453200912464;Ben Arous;Gouvernorat de Ben Arous;TN;
[2] 37.17652020713887;9.784534661865223;Tinjah;Gouvernorat de Bizerte;TN;
[3] 34.7313;10.763400000000047;Sfax;Sfax;TN;
[4] 34.829474860751915;9.791573033378995;Regueb;Gouvernorat de Sidi Bouzid;TN;
I want to output a vector that looks like:
[1] Ben Arous
[2] Tinjah
[3] Sfax
[4] Regueb
You could use read.table with sep=";":
d <- read.table(textConnection("
36.7416894818782;10.227453200912464;Ben Arous;Gouvernorat de Ben Arous;TN;
37.17652020713887;9.784534661865223;Tinjah;Gouvernorat de Bizerte;TN;
34.7313;10.763400000000047;Sfax;Sfax;TN;
34.829474860751915;9.791573033378995;Regueb;Gouvernorat de Sidi Bouzid;TN;"),
sep=";", stringsAsFactors=FALSE)
d[, 3]
# [1] "Ben Arous" "Tinjah" "Sfax" "Regueb"

Resources