Isolate section of long character string in R - string

I have locality information in a vector that contains [lat.];[long.];[town name];[governorate name];[country name].
I want to create a new vector that includes just the town name for each observation.
Below you can see the contents of the vector for the first four observations.
[1] 36.7416894818782;10.227453200912464;Ben Arous;Gouvernorat de Ben Arous;TN;
[2] 37.17652020713887;9.784534661865223;Tinjah;Gouvernorat de Bizerte;TN;
[3] 34.7313;10.763400000000047;Sfax;Sfax;TN;
[4] 34.829474860751915;9.791573033378995;Regueb;Gouvernorat de Sidi Bouzid;TN;
I want to output a vector that looks like:
[1] Ben Arous
[2] Tinjah
[3] Sfax
[4] Regueb

You could use read.table with sep=";":
d <- read.table(textConnection("
36.7416894818782;10.227453200912464;Ben Arous;Gouvernorat de Ben Arous;TN;
37.17652020713887;9.784534661865223;Tinjah;Gouvernorat de Bizerte;TN;
34.7313;10.763400000000047;Sfax;Sfax;TN;
34.829474860751915;9.791573033378995;Regueb;Gouvernorat de Sidi Bouzid;TN;"),
sep=";", stringsAsFactors=FALSE)
d[, 3]
# [1] "Ben Arous" "Tinjah" "Sfax" "Regueb"

Related

How get all combinations at python with repeat

Code example
from itertools import *
from collections import Counter
from tqdm import *
#for i in tqdm(Iterable):
for i in combinations_with_replacement(['1','2','3','4','5','6','7','8'], 8):
b = (''.join(i))
if b == '72637721':
print (b)
when i try profuct i have
for i in product(['1','2','3','4','5','7','6','8'], 8):
TypeError: 'int' object is not iterable
How can i get all combinations ? ( i was belive it before not test , so now all what i was do wrong)
i was read about combinations_with_replacement return all , but how i see it's lie
i use python 3.8
Out put for ask
11111111 11111112 11111113 11111114 11111115 11111116 11111117
11111118 11111122 11111123 11111124 11111125 11111126 11111127
11111128 11111133 11111134 11111135 11111136 11111137 11111138
11111144 11111145 11111146 11111147 11111148 11111155 11111156
11111157 11111158 11111166 11111167 11111168 11111177 11111178
11111188 11111222 11111223 11111224 11111225 11111226 11111227
11111228 11111233 11111234 11111235 11111236 11111237 11111238
11111244 11111245 11111246 11111247 11111248 11111255 11111256
11111257 11111258 11111266 11111267 11111268 11111277 11111278
11111288
what it start give at end
56666888 56668888 56688888 56888888 58888888 77777777 77777776
77777778 77777766 77777768 77777788 77777666 77777668 77777688
77777888 77776666 77776668 77776688 77776888 77778888 77766666
77766668 77766688 77766888 77768888 77788888 77666666 77666668
77666688 77666888 77668888 77688888 77888888 76666666 76666668
76666688 76666888 76668888 76688888 76888888 78888888 66666666
66666668 66666688 66666888 66668888 66688888 66888888 68888888
88888888
more cleare think it how it be count from 1111 1111 to 8888 8888 ( but for characters , so this why i use try do it at permutation/combine with repitions...
it miss some possible combinations of that symbols.
as example what i try do , make all permutatuion of possible variants of hex numbers , like from 0 to F , but make it not only for them , make this possible for any charater.
this only at example ['1','2','3','4','5','6','7','8']
this can be ['a','b','x','c','d','g','r','8'] etc.
solition is use itertools.product instead combinations_with_replacement
from itertools import *
for i in product(['1','2','3','4','5','6','7','8'],repeat = 8):
b = (''.join(i))
if b == '72637721':
print (b)
:
itertools.product ('ABCD', 'ABCD') AA AB AC AD BA BB BC BD CA CB CC CD DA DB DC DD # full multiplication with duplicates and mirrored pairs
itertools.permutations ('ABCD', 2) -> AB AC AD BA BC BD CA CB CD DA DB DC # full multiplication without duplicates and mirrored pairs
itertools.combinations_with_replacement ('ABCD', 2) -> AA AB AC AD BB BC BD CC CD DD # no mirror pairs with duplicates
itertools.combinations ('ABCD', 2) -> AB AC AD BC BD CD # no mirrored pairs and no duplicates
Here's the updated code that will print you all the combinations. It does not matter if your list has strings and numbers.
To ensure that you are doing a combination only for the specific number of elements, I recommend that you do:
comb_list = [1, 2, 3, 'a']
comb_len = len(comb_list)
and replace the line with:
comb = combinations_with_replacement(comb_list, comb_len)
from itertools import combinations_with_replacement
comb = combinations_with_replacement([1, 2, 3, 'a'], 4)
for i in list(comb):
print (''.join([str(j) for j in i]))
This will result as follows:
1111
1112
1113
111a
1122
1123
112a
1133
113a
11aa
1222
1223
122a
1233
123a
12aa
1333
133a
13aa
1aaa
2222
2223
222a
2233
223a
22aa
2333
233a
23aa
2aaa
3333
333a
33aa
3aaa
aaaa
I don't know what you are trying to do. Here's an attempt to start a dialogue to get to the final answer:
samples = [1,2,3,4,5,'a','b']
len_samples = len(samples)
for elem in samples:
print (str(elem)*len_samples)
The output of this will be as follows:
1111111
2222222
3333333
4444444
5555555
aaaaaaa
bbbbbbb
Is this what you want? If not, explain your question section what you expect as an output.

Pandas string encoding when retrieving cell value

I have the following Series:
s = pd.Series(['ANO DE LOS BÃEZ MH EE 3 201'])
When I print the series I get:
0 ANO DE LOS BÃEZ MH EE 3 201
But when I get the cell element I get an hexadecimal value in the string:
>>> s.iloc[0]
'ANO DE LOS BÃ\x81EZ MH EE 3 201'
Why does this happens and how can I retrieve the cell value and get the string: 'ANO DE LOS BÃEZ MH EE 3 201'?
Even though I am not really sure where the issue arised I Could solve it by using the unidecode package.
output_string = unidecode(s.iloc[0])

how to separate html_text result using rvest?

I am trying to scrape information from google scholar web page:
https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science
library(rvest)
htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"
g_interest<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int") %>% html_text()
I got the following result:
[1] "Quantum Chemistry Electronic Structure Condensed Matter Physics Materials Science Nanotechnology "
[2] "density functional theory first principles calculations many body theory condensed matter physics materials science "
[3] "chemistry materials science physics nanotechnology "
[4] "Materials Science Nanotechnology Chemistry Physics "
[5] "Physics Theoretical Physics Condensed Matter Theory Materials Science Nanoscience "
[6] "Materials Science Quantum Chemistry Fiber Optic Sensors Geophysics "
[7] "Chemical Physics Condensed Matter Materials Science Magnetic Properties NMR "
[8] "Materials Science "
[9] "Materials Science Physics "
[10] "Physics Materials Science Theoretical Physics Nanoscience "
However, I would like to get the results like:
[1]"Quantum Chemistry; Electronic Structure;Condensed Matter Physics; Materials Science; Nanotechnology "
......
Any suggestions how to separate the results with ";"?
You can make use of purrr and stringr packages, extract all nodes first and concatenate individual ones.
library(rvest)
library(purrr)
library(stringr)
htmlfile<-"https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:materials_science"
content_nodes<- read_html(htmlfile) %>% html_nodes("div.gsc_oai_int")
map_chr(content_nodes,~.x %>%
html_nodes(".gsc_oai_one_int") %>%
html_text() %>%
str_c(collapse = ";"))
result:
[1] "Quantum Chemistry;Electronic Structure;Condensed Matter Physics;Materials Science;Nanotechnology"
[2] "density functional theory;first principles calculations;many body theory;condensed matter physics;materials science"
[3] "chemistry;materials science;physics;nanotechnology"
[4] "Materials Science;Nanotechnology;Chemistry;Physics"
[5] "Physics;Theoretical Physics;Condensed Matter Theory;Materials Science;Nanoscience"
[6] "Materials Science;Quantum Chemistry;Fiber Optic Sensors;Geophysics"
[7] "Chemical Physics;Condensed Matter;Materials Science;Magnetic Properties;NMR"
[8] "Materials Science"
[9] "Materials Science;Physics"
[10] "Physics;Materials Science;Theoretical Physics;Nanoscience"

use of Scrapy's regular expressions

I'm practicing on scraping newspaper's articles with Scrapy. I have some problems in sub-stringing text from web pages. Witht the built-in re and re_first functions I can set where to start the search, but I don't find how to set where to end.
Here follows the code:
import scrapy
from spider.items import Articles
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
class QuotesSpider(scrapy.Spider):
name = "lastampa"
allowed_domains = ['lastampa.it']
def start_requests(self):
urls = [
'http://www.lastampa.it/2017/10/26/economia/lavoro/datalogic-cerca-laureati-e-laureandi-per-la-ricerca-e-sviluppo-rPsS8gVM5ZX7gEZcugklwJ/pagina.html'
]
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
items = []
item = Articles()
item['date'] = response.xpath('//div[contains(#class, "ls-articoloDataPubblicazione")]').re_first(r'content=\s*(.*)')
item['author'] = response.xpath('//div[contains(#class, "ls-articoloAutore")]').re_first(r'">\s*(.*)')
item['title'] = response.xpath('//div[contains(#class, "ls-articoloTitolo")]').re_first(r'<h3>\s*(.*)')
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]').re_first(r'">\s*(.*)')
item['text'] = response.xpath('//div[contains(#class, "ls-articoloTesto")]').re_first(r'<p>\s*(.*)')
items.append(item)
Well, with this code I can get the needed text, but also all the following tags till the end of paths.
e.g.
'subtitle': 'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam</div>'
How can I escape the ending </div> (or any other character after a defined point) ?
Can someone can turn on the light on this. Thanks
You don't have to mess with regular expressions like this if you only need to extract the text from elements. Just use text() in your XPath expressions. For example for subtitle field, extract it using:
item['subtitle'] = response.xpath('//div[contains(#class, "ls-articoloCatenaccio")]/text()').extract_first().strip()
which, in your example, will produce:
u'Gli inserimenti saranno in Italia, Stati Uniti, Cina, Vietnam'
Article text is a bit more complicated, as it's spread over child elements. However, you can extract it from there, post-process it and store as a single string like this:
item['text'] = ''.join([x.strip() for x in response.xpath('//div[contains(#class, "ls-articoloTesto")]//text()').extract()])
which will produce:
u'Datalogic, leader mondiale nelle tecnologie di identificazione automatica dei dati e dei processi di automazione industriale, ha lanciato la selezione di giovani talenti laureati o laureandi in ingegneria e/o in altre materie scientifiche che abbiano voglia di confrontarsi con una realt\xe0 internazionale. Il primo step delle selezioni si \xe8 tenuto in occasione dell\u2019Open Day: i candidati hanno sostenuto un primo colloquio con senior manager nazionali e internazionali, arrivati per l\u2019occasione anche da sedi estere. Durante l\u2019Open Day \xe8 stata presentata l\u2019azienda insieme alle 80 posizioni aperte di cui una cinquantina in Italia. La ricerca dei candidati \xe8 rivolta a laureandi e neolaureati di tutta Italia interessati a far parte di una realt\xe0 dinamica e internazionale, provenienti da Facolt\xe0 tecnico-scientifiche quali Ingegneria, Fisica, Informatica.Ai giovani verr\xe0 proposto un interessante percorso di carriera che permetter\xe0 di approfondire e sviluppare tecnologie all\u2019avanguardia nei 10 centri di Ricerca dell\u2019azienda dislocati tra Italia, Stati Uniti, Cina e Vietnam, dove si ha l\u2019opportunit\xe0 di approfondire tutte le tematiche relative all\u2019elettronica e al software design: dallo sviluppo di sistemi operativi mobili, di sensori laser e tecnologie ottiche, allo studio di sistemi di intelligenza artificiale. Informazioni sul sito www.datalogic.com.Alcuni diritti riservati.'

Py3 Find common words in multiple strings

I have a problem and i can't get a proper and good solution.
Here is the thing, i have my list, which contains addresses:
addresses = ['Paris, 9 rue VoltaIre', 'Paris, 42 quai Voltaire', 'Paris, 87 rue VoLtAiRe']
And I want to get after treatment:
to_upper_words = ['paris', 'voltaire']
Because 'Paris' and 'Voltaire' are in all strings.
The goal is to upper all common words, and results for example:
'PARIS, 343 BOULEVARD SAINT-GERMAIN'
'PARIS, 458 BOULEVARD SAINT-GERMAIN'
'Paris', 'Boulevard', 'Saint-Germain' are in upper case because all strings (two in this case) have these words in common.
I found multiple solutions but only for two strings, and the result is ugly to apply for N strings.
If anyone can help...
Thank you in advance !
Remove the commas so that the words can be picked out simply by splitting on blanks. Then find the words that common to all of the addresses using set intersection. (There is only one such word for my collection, namely Montréal.) Now look through the original collection of addresses replacing each occurrence of common words with its uppercase equivalent. The most likely mistake in doing this might be to try to assign a value to an individual value in the list. However, these are, in effect, copies which is why I arranged to assign the modified values to addressses[i].
>>> addresses = [ 'Montréal, 3415 Chemin de la Côte-des-Neiges', 'Montréal, 3655 Avenue du Musée', 'Montréal, 3475 Rue de la Montagne', 'Montréal, 3450 Rue Drummond' ]
>>> sans_virgules = [address.replace(',', ' ') for address in addresses]
>>> listes_de_mots = [sans_virgule.split() for sans_virgule in sans_virgules]
>>> mots_en_commun = set(listes_de_mots[0])
>>> for liste in listes_de_mots[1:]:
... mots_en_commun = mots_en_commun.intersection(set(liste))
...
>>> mots_en_commun
{'Montréal'}
>>> for i, address in enumerate(addresses):
... for mot in list(mots_en_commun):
... addresses[i] = address.replace(mot, mot.upper())
...
>>> addresses
['MONTRÉAL, 3415 Chemin de la Côte-des-Neiges', 'MONTRÉAL, 3655 Avenue du Musée', 'MONTRÉAL, 3475 Rue de la Montagne', 'MONTRÉAL, 3450 Rue Drummond']

Resources