count how many words from a word list appear in a cell

count how many words from a word list appear in a cell - excel

Im trying to count the amount of times a word occurs in a string in excel, but im not sure how to do it. I found the formula =SUMPRODUCT(ISNUMBER(SEARCH(" "&Z$2:Z$20&","," "&A1&","))+0) in another post, but I can't seem to make it work for me.
data example (strings):
text
1335 Nu standaard 3 dagen #Full #Membership bij #LACH , #dating voor #50plus http://t.co/7urme1Huhq http://t.co/ZzGWnKNdgx
1334 De 50plus cliché-poll vul hem in en we zien direct of de clichés waar zijn,
1282 #Koopkracht staat centraal in de campagne van #50Plus. Volgens mij maken veel ouderen zich zorgen over heel andere zaken. #PS2015
1278 #ps15 #50Plus: Dramatische gevolgen afschaffen ouderentoeslag http://t.co/nJW8kQRTtu Zie http://t.co/bFGtRDhF7e
1276 Campagnebussen in beeld. Mijn favoriet: de Fiat Babyboom van 50Plus. http://t.co/Wji8fL3gDp
1275 50Plus staat alleen in stelling over sociaal beleid http://t.co/79TF5PWN2j
1266 .#Struijlaard #50PLUS Zuid-Holland trekt met een camper de hele provincie door. Vandaag tot 17.00 uur in Leerdam. #PS15
1262 Stem #50PLUS op 18 maart !
1261 Boek van 98-jarige zangeres http://t.co/xxFjkTxYDY #vrouwen #50plus #muziek #Gog #Dings #Liessel
1257 #senfkarin kwam je op 50plus?
Word list:
achterlijk
bewogen
bovenkomen
brutaliteit
dramatisch
gehoorzamen
groen
illuminatie
incongruent
kwaadspreken
ontmoeting
ontredderd
recidiveren
roestig
ruw
secuur
steenkoud
tippelen
treffen
verbleken
verkijken
verlevendigen
vertillen
vorig
wezen
zorg
zwier

I think the commas are messing it up, it should be
=SUMPRODUCT(ISNUMBER(SEARCH(" "&Z$2:Z$20&" "," "&A1&" "))+0)
So it matches the whole word plus the spaces at each end, even when it is the first or last word in the cell.

Related

can't get the right text when webscraping a webpage (with Python 3)

This is an example of the variable 'extract', when i execute my python code:
<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland
W^hen i execute my code, i now get the text 'Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'
What i really wanted is the part after 'title=', so this text: 'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'
I'm new in this section, and i find it very difficult to understand. Can someone give me a good direction?
See my code at this moment.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)
content=''
rows = soup.find_all('a')
tel=0
for row in rows:
if tel != 0:
#'tel' is only used to skip the first returned result. The first resultline is something that i don't need. I only need all the text after every 'title ='
print(row)
extract=row.get_text()
print('')
print(extract)
content=content+extract+'\n'
if tel == 10:
#a loop of max 10 times gives me enough information, i only need the first 10 articles
break
else:
tel=tel+1
#show me the result-text of the crawling
print('')
print('RESULT TEXT OF THE FIRST 10 ARTICLES:')
print(content)

You want the title attribute. From the description the following use of nth-child will get you the first 10 descriptions. I also needed 'lxml' parser. pip3 install lxml if not installed.
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
print([i['title'] for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)')])
CSS:
li:nth-child(-n+10) > a:nth-child(2)
This asks for the first 10 li elements with at least two a tag children, and selects for the second a tag in each case. The > is a child combinator specifying that what is on the right must be a child of what is on the left.
Read about:
:nth-child ranges
Child combinators, pseudo class selectors and type selectors
Additional request for for loop with published date time info:
import requests
from bs4 import BeautifulSoup
url = 'https://www.dekrantenkoppen.be/full/de_redactie'
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for i in soup.select('li:nth-child(-n+10) > a:nth-child(2)'):
print(i.parent['title'])
print(i.parent.a.text.strip())
print(i['title'])
print()

If I understand your question correctly, you should try to modify your code as follows:
...
for row in rows:
if tel != 0:
print(row)
extract=row["title"]
content=content+extract.replace("Meer informatie", "")+'\n'
...

Quick look at what you're doing, simply replace get_text() with .get('title'), that should work:
>>> data = '''<a href "https://www.dekrantenkoppen.be/detail/1520347/Vulkaanuitbarsting-en-aardbevingen-dichtbij-de-hoofdstad-van-IJsland.html" rel="dofollow" title="In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.">Vulkaanuitbarsting en aardbevingen dichtbij de hoofdstad van IJsland'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(data)
>>> soup.find('a').get('title')
'In het zuidwesten van IJsland is de vulkaan Fagradalsfjall uitgebarsten. Dat heeft de meteorologische dienst van het land laten weten. De uitbarsting is vooralsnog beperkt, maar er zijn wel twee grote lavastromen. De voorbije weken was IJsland al opgeschrikt door tienduizenden aardbevingen. In de loop van de dag is de kracht van de uitbarsting afgenomen.'

Pandas string encoding when retrieving cell value

I have the following Series:
s = pd.Series(['ANO DE LOS BÃEZ MH EE 3 201'])
When I print the series I get:
0 ANO DE LOS BÃEZ MH EE 3 201
But when I get the cell element I get an hexadecimal value in the string:
>>> s.iloc[0]
'ANO DE LOS BÃ\x81EZ MH EE 3 201'
Why does this happens and how can I retrieve the cell value and get the string: 'ANO DE LOS BÃEZ MH EE 3 201'?

Even though I am not really sure where the issue arised I Could solve it by using the unidecode package.
output_string = unidecode(s.iloc[0])

Reading a semicolon (';') seperated raw text in Python

I have managed to find some solutions online but often without an explanation. I am new to Python and normally choose to rework my data in Excel. I would, however, like to learn how to deal with a problem like the following:
I was given data for Brazil in this form. The website says to "save the file as a csv". It looks like a mess...
Espírito Santo;
Dados;
Ano;População total;Homens;Mulheres;Nascimentos;Óbitos;Taxa de Crescimento Geométrico;Taxa Bruta de Natalidade;Taxa Bruta de Mortalidade;Esperança de Vida ao Nascer;Esperança de Vida ao Nascer - Homens;Esperança de Vida ao Nascer - Mulheres;Taxa de Mortalidade Infantil;Taxa de Mortalidade Infantil - Homens;Taxa de Mortalidade Infantil - Mulheres;Taxa de Fecundidade Total;Razão de Dependência - Jovens 0 a 14 anos;Razão de Dependência - Idosos 65 ou mais anos;Razão de Dependência;Índice de Envelhecimento;
2010;3596057;1772936;1823121;54018;19734;x;15.02;5.49;75.93;71.9;80.19;11.97;13.59;10.28;1.73;34.49;10.17;44.67;29.41;
2011;3642595;1795501;1847094;55387;19923;1.29;15.21;5.47;76.36;72.35;80.59;11.3;12.87;9.66;1.77;33.72;10.41;44.13;30.77;
2012;3689347;1818188;1871159;55207;20142;1.28;14.96;5.46;76.76;72.78;80.96;10.69;12.2;9.1;1.75;32.98;10.68;43.65;32.17;
2013;3736386;1841035;1895351;56785;20396;1.27;15.2;5.46;77.14;73.19;81.31;10.14;11.6;8.6;1.8;32.29;10.97;43.26;34.22;
2014;3784361;1864376;1919985;57964;20676;1.28;15.32;5.46;77.51;73.58;81.64;9.64;11.06;8.15;1.83;31.73;11.31;43.04;35.59;
2015;3832826;1887984;1944842;58703;20979;1.28;15.32;5.47;77.85;73.95;81.95;9.19;10.56;7.74;1.85;31.29;11.69;42.98;37.44;
2016;3879376;1910629;1968747;55091;21282;1.21;14.2;5.49;78.18;74.31;82.24;8.78;10.11;7.38;1.73;30.84;12.13;42.97;39.35;
2017;3925341;1932993;1992348;58530;21624;1.18;14.91;5.51;78.49;74.65;82.5;8.42;9.71;7.06;1.84;30.52;12.61;43.13;41.31;
2018;3972388;1955930;2016458;58342;22016;1.2;14.69;5.54;78.79;74.97;82.76;8.09;9.34;6.77;1.83;30.31;13.14;43.45;43.6;
2019;4018650;1978483;2040167;58106;22419;1.16;14.46;5.58;79.06;75.27;82.99;7.79;9;6.52;1.83;30.12;13.71;43.83;45.45;
I used MSWord to replace the ";" with "," and Excel´s import from text to try and get a more familiar data frame.
How would you approach data in this form using Python? Save as a .csv then import again with pandas? I am hoping for a better solution by keeping it as a """ enclosed string.

You can tell the csv parser what the delimiter is, in this case it is ';'
with open('filepath.csv') as csv_file:
csv.reader(csv_file, delimiter=';')
How do you get this data? As file? As string?
If you have a file you can use pandas to read the csv which would give you a pandas DataFrame:
pandas.read_csv('filepath.csv', delimiterstr=';')
You can find more info here: https://realpython.com/python-csv/

Why Vim is assembling short lines when it should just break the long ones?

I am trying to edit a file using Vim. However, I have just started to use this editor.
This is the text that I am willing to fix (it is in portuguese, but this fact is irrelevant to my doubt):
---
ENUM Questão 1
AREA ETHICS
Janaína é procuradora do município de Oceanópolis e atua, fora da carga horária demandada pela função, como advogada na sociedade de advogados Alfa, especializada em Direito Tributário. A profissional já foi professora na universidade estadual Beta, situada na localidade, tendo deixado o magistério há um ano, quando tomou posse como procuradora municipal.
As you see, the phrase starting with "Janaina é..." is too big. I am trying to make everything have 80 columns.
Hence, I did:
:set textwidth=80
And, in visual mode with all the txt selected, I did:
gq
This is the final output:
--- ENUM Questão 1
AREA ETHICS
Janaína é procuradora do município de Oceanópolis e atua, fora da carga horária
demandada pela função, como advogada na sociedade de advogados Alfa,
especializada em Direito Tributário. A profissional já foi professora na
universidade estadual Beta, situada na localidade, tendo deixado o magistério há
um ano, quando tomou posse como procuradora municipal.
The final result is close to what I want. The only problem is the change from
---
ENUM Questão 1
to
--- ENUM Questão 1
I thought that :set textwidth=80 and :set columns=80 were commands made to break lines which were too long. But, for some reason, this command is assembling the short line with --- and the line with ENUM Questão [num]
Why is this happening?
How can I solve this?
Thanks.

gq can do a lot of things, depending on the formatexpr, formatprg or most possibly the formatoptions setting. see :h gq.
I would use the folloging regex:
:%s/.\{80}/&\r/g
*Note: there is also textwrap in vim, which may help you (:set wrap)

Py3 Find common words in multiple strings

I have a problem and i can't get a proper and good solution.
Here is the thing, i have my list, which contains addresses:
addresses = ['Paris, 9 rue VoltaIre', 'Paris, 42 quai Voltaire', 'Paris, 87 rue VoLtAiRe']
And I want to get after treatment:
to_upper_words = ['paris', 'voltaire']
Because 'Paris' and 'Voltaire' are in all strings.
The goal is to upper all common words, and results for example:
'PARIS, 343 BOULEVARD SAINT-GERMAIN'
'PARIS, 458 BOULEVARD SAINT-GERMAIN'
'Paris', 'Boulevard', 'Saint-Germain' are in upper case because all strings (two in this case) have these words in common.
I found multiple solutions but only for two strings, and the result is ugly to apply for N strings.
If anyone can help...
Thank you in advance !

Remove the commas so that the words can be picked out simply by splitting on blanks. Then find the words that common to all of the addresses using set intersection. (There is only one such word for my collection, namely Montréal.) Now look through the original collection of addresses replacing each occurrence of common words with its uppercase equivalent. The most likely mistake in doing this might be to try to assign a value to an individual value in the list. However, these are, in effect, copies which is why I arranged to assign the modified values to addressses[i].
>>> addresses = [ 'Montréal, 3415 Chemin de la Côte-des-Neiges', 'Montréal, 3655 Avenue du Musée', 'Montréal, 3475 Rue de la Montagne', 'Montréal, 3450 Rue Drummond' ]
>>> sans_virgules = [address.replace(',', ' ') for address in addresses]
>>> listes_de_mots = [sans_virgule.split() for sans_virgule in sans_virgules]
>>> mots_en_commun = set(listes_de_mots[0])
>>> for liste in listes_de_mots[1:]:
... mots_en_commun = mots_en_commun.intersection(set(liste))
...
>>> mots_en_commun
{'Montréal'}
>>> for i, address in enumerate(addresses):
... for mot in list(mots_en_commun):
... addresses[i] = address.replace(mot, mot.upper())
...
>>> addresses
['MONTRÉAL, 3415 Chemin de la Côte-des-Neiges', 'MONTRÉAL, 3655 Avenue du Musée', 'MONTRÉAL, 3475 Rue de la Montagne', 'MONTRÉAL, 3450 Rue Drummond']

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

count how many words from a word list appear in a cell - excel

I think the commas are messing it up, it should be =SUMPRODUCT(ISNUMBER(SEARCH(" "&Z$2:Z$20&" "," "&A1&" "))+0) So it matches the whole word plus the spaces at each end, even when it is the first or last word in the cell.

Related

can't get the right text when webscraping a webpage (with Python 3)

Pandas string encoding when retrieving cell value

Reading a semicolon (';') seperated raw text in Python

Why Vim is assembling short lines when it should just break the long ones?

Py3 Find common words in multiple strings

Categories

Resources