How can I subset filtering a row per categories with a for loop - python-3.x

How can i subset these lines of code with a for loop
I'm trying to subset these lines of code but I couldn't, I think that it could be done with a group by and a dictionary but I'm couldn't
df_belgium = df_sales[df_sales["Country"]=="Belgium"]
df_norway = df_sales[df_sales["Country"]=="Norway"]
df_portugal = df_sales[df_sales["Country"]=="portugal"]

The most straightforward way would be to loop through ["Belgium","Norway","portugal"], but trying to create objects with variable variable names like df_{country_name} is highly discouraged (see here), so I would recommend creating a dictionary to store your subset dataframes with the country names as keys.
You can use a dict comprehension:
df_sales_by_country = {country_name: df_sales[df_sales["Country"]==country_name] for country_name in ["Belgium","Norway","portugal"]}

The ideal is to use groupby and to store the sub-DataFrames in a dictionary:
d = dict(df.groupby('Country'))
Then access d['Belgium'] for example.
If you need to filter a subset of the countries:
# use a set for efficiency
keep = {'Belgium', 'Norway', 'Portugal'}
d = {key: g for key, g in df.groupby('Country') if country in keep}
or:
keep = ['Belgium', 'Norway', 'Portugal']
d = dict(df[df['Country'].isin(keep)].groupby('Country'))

Related

Read indices stored in a list

I have a dataset which is stored as a list. I want to be able to retrieve different pieces of the data and alter them. The indices of pieces I need are stored in a different list.
For example:
data_list = [[[1,2],[3,4]],[5,6]]
indices = [[0,0,1],[1,0]]
In this case I might want to retrieve data_list[0][0][1] and data_list[1][0] and change them to value 6, but I cannot simply do data_list[indices[0]] = 6. Is there a good way to do this?
You can try to loop over all the keys/sub-keys until you get the data you need.
What you can do is set a variable to a reference to the data_list and loop over the indexes and shift the reference until it's pointing to the lowest nested list.
Then you can set the value in that lowest list to whatever value you need.
data_list = [[[1,2],[3,4]],[5,6]]
indices = [[0,0,1],[1,0]]
for *path, final in indices:
val = data_list
for i in path:
val = val[i]
val[final] = 6
print(data_list)

Remove values from dictionary

I have a large dictionary and I am trying to remove values from keys if they start with certain values. Below is a small example of the dictionary.
a_data = {'78567908': {'26.01.17', '02.03.24', '26.01.12', '04.03.03', '01.01.13', '02.03.01', '01.01.10', '26.01.21'}, '85789070': {'26.01.02', '09.01.04', '02.05.04', '02.03.17', '02.05.01'}, '87140110': {'03.15.25', '03.15.24', '03.15.19'}, '87142218': {'26.17.13', '02.11.01', '02.03.22'}, '87006826': {'28.01.03'}}
After I read in the dictionary, I want to remove values from all keys that start with '26.' or '02.' It is possible that would leave a key with no values (an empty set).
I do have code that works:
exclude = ('26.', '02.')
f_a_data = {}
for k, v in a_data.items():
f_a_data.setdefault(k,[])
for code in v:
print (k, code, not code.startswith(exclude))
if not code.startswith(exclude):
f_a_data[k].append(code)
print('Filtered dict:')
print(f_a_data)
This returns a filtered dict:
Filtered dict:
{'78567908': ['04.03.03', '01.01.13', '01.01.10'], '85789070': ['09.01.04'], '87140110': ['03.15.25', '03.15.24', '03.15.19'], '87142218': [], '87006826': ['28.01.03']}
Question 1: Is this the best way to filter a dictionary?
Question 2: How could i modify the above snippet to return values in a set like the original dict?
Your code is quite all right in complexity terms but can be "pythonized" a little and still remain readable.
My proposal: you can rebuild a dictionary using nested comprehensions and all to test if you should include the values:
a_data = {'78567908': {'26.01.17', '02.03.24', '26.01.12', '04.03.03', '01.01.13', '02.03.01', '01.01.10', '26.01.21'}, '85789070': {'26.01.02', '09.01.04', '02.05.04', '02.03.17', '02.05.01'}, '87140110': {'03.15.25', '03.15.24', '03.15.19'}, '87142218': {'26.17.13', '02.11.01', '02.03.22'}, '87006826': {'28.01.03'}}
exclude = ('26.', '02.')
new_data = {k:{x for x in v if all(s not in x for s in exclude)} for k,v in a_data.items()}
result:
>>> new_data
{'78567908': {'01.01.10', '01.01.13', '04.03.03'},
'85789070': {'09.01.04'},
'87006826': {'28.01.03'},
'87140110': {'03.15.19', '03.15.25', '03.15.24'},
'87142218': set()}
(here using a dictionary comprehension embedding a set comprehension (since you need a set) using a generator comprehension in all)

Making a dictionary of from a list and a dictionary

I am trying to create a dictionary of codes that I can use for queries and selections. Let's say I have a dictionary of state names and corresponding FIPS codes:
statedict ={'Alabama': '01', 'Alaska':'02', 'Arizona': '04',... 'Wyoming': '56'}
And then I have a list of FIPS codes that I have pulled in from a Map Server request:
fipslist = ['02121', '01034', '56139', '04187', '02003', '04023', '02118']
I want to sort of combine the key from the dictionary (based on the first 2 characters of the value of that key) with the list items (also, based on the first 2 characters of the value of that key. Ex. all codes beginning with 01 = 'Alabama', etc...). My end goal is something like this:
fipsdict ={'Alabama': ['01034'], 'Alaska':['02121', '02003','02118'], 'Arizona': ['04187', '04023'],... 'Wyoming': ['56139']}
I would try to set it up similar to this, but it's not working quite correctly. Any suggestions?
fipsdict = {}
tempList = []
for items in fipslist:
for k, v in statedict:
if item[:2] == v in statedict:
fipsdict[k] = statedict[v]
fipsdict[v] = tempList.extend(item)
A one liner with nested comprehensions:
>>> {k:[n for n in fipslist if n[:2]==v] for k,v in statedict.items()}
{'Alabama': ['01034'],
'Alaska': ['02121', '02003', '02118'],
'Arizona': ['04187', '04023'],
'Wyoming': ['56139']}
You will have to create a new list to hold matching fips codes for each state. Below is the code that should work for your case.
for state,two_digit_fips in statedict.items():
matching_fips = []
for fips in fipslist:
if fips[:2] == two_digit_fips:
matching_fips.append(fips)
state_to_matching_fips_map[state] = matching_fips
>>> print(state_to_matching_fips_map)
{'Alabama': ['01034'], 'Arizona': ['04187', '04023'], 'Alaska': ['02121', '02003', '02118'], 'Wyoming': ['56139']}
For both proposed solutions I need a reversed state dictionary (I assume that each state has exactly one 2-digit code):
reverse_state_dict = {v: k for k,v in statedict.items()}
An approach based on defaultdict:
from collections import defaultdict
fipsdict = defaultdict(list)
for f in fipslist:
fipsdict[reverse_state_dict[f[:2]]].append(f)
An approach based on groupby and dictionary comprehension:
from itertools import groupby
{reverse_state_dict[k]: list(v) for k,v
in (groupby(sorted(fipslist), key=lambda x:x[:2]))}

looping over function with multiple argument

Beginning python user here. This seems like a simple question but I haven't been able to find an answer or at least recognize an answer.
I have a function as follows:
def standardize(columns, dictionary):
for x in columns:
df.iloc[:,x] = df.iloc[:,x].map(dictionary)
The function takes a list of columns and recodes all the values in that column according to the associated dictionary.
Rather than calling the function a dozen times for each list of columns and its associated dictionary:
standardize([15,19,27], dict1)
standardize([47,65,108], dict2)
standardize([49,53,55,90], dict3)
ideally I'd like it to loop over. a list of all the column lists and a list of all the dictionaries. something like:
for column_list in [[list1], [list2], [list3]]:
standardize(column_list, associated_dictionary)
How would I go about this?
No, there's no need to loop when considering single replacement. You can use applymap instead:
col_list = [...]
df.iloc[:, columns] = df.iloc[:, columns].applymap(dictionary.get)
Now, for multiple sets, you can zip your column lists and dictionaries and iterate:
column_lists = [col_list1, col_list2, ...]
dictionaries = [dict1, dict2, ...]
for c, d in zip(column_lists, dictionaries):
df.iloc[:, c] = df.iloc[:, c].applymap(d.get)

List, tuples or dictionary, differences and usage, How can I store info in python

I'm very new in python (I usually write in php). I want to understand how to store information in an associative array, and if you can explain me whats the difference of "tuples", "arrays", "dictionary" and "list" will be wonderful (I tried to read different source but I still not caching it).
So This is my code:
#!/usr/bin/python3.4
import csv
import string
nidless_keys = dict()
nidless_keys = ['test_string1','test_string2'] #this contain the string to
# be searched in linesreader
data = {'type':[],'id':[]} #here I want to store my information
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader: #every line in this csv have a url like
#www.test.com/?test_string1&id=123456
current_row_string = str(row)
for needle in nidless_keys:
current_needle = str(needle)
if current_needle in current_row_string:
data[current_needle[current_row_string[-8:]]) += 1 # also I
#need to count per every id how much rows there are.
In conclusion:
my_data_stored = [current_needle][current_row_string[-8]]
current_row_string[-8] is a url which the last 8 digit of the url is an ID.
So the array should looks like this at the end of the script:
test_string1 = 123456 = 20
= 256468 = 15
test_string2 = 123155 = 10
Edit 1:
Which type I need here to store the information?
Can you tell me how to resolve this script?
It seems you want to count how many times an ID in combination with a test string occurs.
There can be multiple ID/count combinations associated with every test string.
This suggests that you should use a dictionary indexed by the test strings to store the results. In that dictionary I would suggest to store collections.Counter objects.
This way, you would have to add a special case when a key in the results dictionary isn't found to add an empty Counter. This is a common problem, so there is a specialized form of dictionary in the collections module called defaultdict.
import collections
import csv
# Using a tuple for the keys so it cannot be accidentally modified
keys = ('test_string1', 'test_string2')
result = collections.defaultdict(collections.Counter)
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader:
for key in keys:
if key in row:
id = row[-6:] # ID's are six digits in your example.
# The first index is into the dict, the second into the Counter.
result[key][id] += 1
There is an even easier way, by using regular expressions.
Since you seem to treat every row in a CSV file as a string, there is little need to use the CSV reader, so I'll just read the whole file as text.
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
pattern = r'\?(.*)&id=(\d+)'
The pattern is a regular expression. This is a large topic in and of itself, so I'll only cover briefly what it does. (You might also want to check out the relevant HOWTO) At first glance it looks like complete gibberish, but it is actually a complete language.
In looks for two things in a line. Anything between ? and &id=, and a sequence of digits after &id=.
I'll be using IPython to give an example.
(If you don't know it, check out IPython. It is great for trying things and see if they work.)
In [1]: import re
In [2]: pattern = r'\?(.*)&id=(\d+)'
In [3]: text = """www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=234567
....: www.test.com/?foo&id=234567
....: www.test.com/?foo&id=123456
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234"""
The text variable points to the string which is a mock-up for the contents of your CSV file.
I am assuming that:
every URL is on its own line
ID's are a sequence of digits.
If these assumptions are wrong, this won't work.
Using findall to extract every match of the pattern from the text.
In [4]: re.findall(pattern, test)
Out[4]:
[('test_string1', '123456'),
('test_string1', '123456'),
('test_string1', '234567'),
('foo', '234567'),
('foo', '123456'),
('foo', '1234'),
('foo', '1234'),
('foo', '1234')]
The findall function returns a list of 2-tuples (that is key, ID pairs). Now we just need to count those.
In [5]: import collections
In [6]: result = collections.defaultdict(collections.Counter)
In [7]: intermediate = re.findall(pattern, test)
Now we fill the result dict from the list of matches that is the intermediate result.
In [8]: for key, id in intermediate:
....: result[key][id] += 1
....:
In [9]: print(result)
defaultdict(<class 'collections.Counter'>, {'foo': Counter({'1234': 3, '123456': 1, '234567': 1}), 'test_string1': Counter({'123456': 2, '234567': 1})})
So the complete code would be:
import collections
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
result = collections.defaultdict(collections.Counter)
pattern = r'\?(.*)&id=(\d+)'
intermediate = re.findall(pattern, test)
for key, id in intermediate:
result[key][id] += 1
This approach has two advantages.
You don't have to know the keys in advance.
ID's are not limited to six digits.
A brief summary of the python data types you mentioned:
A dictionary is an associative array, aka hashtable.
A list is a sequence of values.
An array is essentially the same as a list, but limited to basic datatypes. My impression is that they only exists for performance reasons, don't think I've ever used one. If performance is that critical to you, you probably don't want to use python in the first place.
A tuple is a fixed-length sequence of values (whereas lists and arrays can grow).
Lets take them one by one.
Lists:
List is a very naive kind of data structure similar to arrays in other languages in terms of the way we write them like:
['a','b','c']
This is a list in python , but seems very similar to array structure.
However there is a very large difference in the way lists are used in python and the usual arrays.
Lists are heterogenous in nature. This means that we can store any kind of data simultaneously inside it like:
ls = [1,2,'a','g',True]
As you can see, we have various kinds of data within a list and is a valid list.
However, one important thing about them is that we can access the list items using zero based indices. So we can write:
print ls[0],ls[3]
output: 1 g
Dictionary:
This datastructure is similar to a hash map data structure. It contains a (key,Value) pair. An empty dictionary looks like:
dc = {}
Now, to store a key,value pair, e.g., ('potato',3),(tomato,5), we can do as:
dc['potato'] = 3
dc['tomato'] = 5
and we saved the data in the dictionary dc.
The important thing is that we can even store another data structure element like a list within a dictionary like:
dc['list1'] = ls , where ls is the list defined above.
This shows the power of using dictionary.
In your case, you have difined a dictionary like this:
data = {'type':[],'id':[]}
This means that your dictionary will consist of only two keys and each key corresponds to a list, which are empty for now.
Talking a bit about your script, the expression :
current_row_string[-8:]
doesn't make a sense. The index should have been -6 instead of -8 that would give you the id part of the current row.
This part is the id and should have been stored in a variable say :
id = current_row_string[-6:]
Further action can be performed as seen the answer given by Roland.

Resources