good day!
I'm trying to find the minimum and maximum values for a given dataset
foo,1,1
foo,2,5
foo,3,0
bar,1,5
bar,2,0
bar,3,0
foo,1,1
foo,2,2
foo,3,4
bar,1,4
bar,2,0
bar,3,1
foo,1,4
foo,2,2
foo,3,3
bar,1,1
bar,2,3
bar,3,0
I try to sort my data using the 1st and the 2nd columns as ID and the 3rd column as value
from collections import defaultdict
data = defaultdict(list)
with open("file1.txt", 'r') as infile:
for line in infile:
line = line.strip().split(',')
meta = line[0]
id_ = line[1]
value = line[2]
try:
value = int(line[2])
data[meta+id_].append(value)
except ValueError:
print ('nope', sep='')
the output of my function is:
defaultdict(list,
{'foo1': ['1', '1', '4'],
'foo2': ['5', '2', '2'],
'foo3': ['0', '4', '3'],
'bar1': ['5', '4', '1'],
'bar2': ['0', '0', '3'],
'bar3': ['0', '1', '0']})
please advice how can I get minimum and maximum values for each ID?
so I need output of something like this:
defaultdict(list,
{'foo1': ['1', '4'],
'foo2': ['2', '5'],
'foo3': ['0', '4'],
'bar1': ['1', '5'],
'bar2': ['0', '3'],
'bar3': ['0', '1']})
Update:
with #AndiFB help I add sorting to my lists:
def sorting_func(string):
return int(string)
from collections import defaultdict
data = defaultdict(list)
with open("file1.txt", 'r') as infile:
for line in infile:
line = line.strip().split(',')
meta = line[0]
id_ = line[1]
value = line[2]
try:
if value != "-":
value = int(line[2])
data[meta+id_].append(value)
data[meta+id_].sort(key=sorting_func)
print("max:", *data[meta+id_][-1:], 'min:',*data[meta+id_][:1])
except ValueError:
print ('nope', sep='')
data
Output:
max: 1 min: 1
max: 5 min: 5
max: 0 min: 0
max: 5 min: 5
max: 0 min: 0
max: 0 min: 0
max: 1 min: 1
max: 5 min: 2
max: 4 min: 0
max: 5 min: 4
max: 0 min: 0
max: 1 min: 0
max: 4 min: 1
max: 5 min: 2
max: 4 min: 0
max: 5 min: 1
max: 3 min: 0
max: 1 min: 0
defaultdict(list,
{'foo1': [1, 1, 4],
'foo2': [2, 2, 5],
'foo3': [0, 3, 4],
'bar1': [1, 4, 5],
'bar2': [0, 0, 3],
'bar3': [0, 0, 1]})
Please advice how to save only min and max(the first and the last) values in the list?
to get something like this:
defaultdict(list,
{'foo1': ['1', '4'],
'foo2': ['2', '5'],
'foo3': ['0', '4'],
'bar1': ['1', '5'],
'bar2': ['0', '3'],
'bar3': ['0', '1']})
def sorting_func(string):
return int(string)
d = defaultdict(list)
d['python'].append('10')
d['python'].append('2')
d['python'].append('5')
print("d['python'].__contains__('10'): {}".format(d['python'].__contains__('10')))
print(str(d['python']))
d['python'].sort(key=sorting_func)
print('d["python"]: ' + str(d['python']))
print('d["python"][0]: ' + d['python'][0])
print('d["python"][2]: ' + d['python'][2])
print(str(len(d['python'])))
Resulting in the following output
d['python'].__contains__('10'): True
['10', '2', '5']
d["python"]: ['2', '5', '10']
d["python"][0]: 2
d["python"][2]: 10
3
You can sort the List leaving in the first position the minimum value, and in the last one
the max value
Be aware that if the string contained in the dic can not be casted to Int will result in an exception. The sorting function expects a number to compare. For example another sorting function could be:
def sorting_func(string):
return len(string)
This one sorts by the length of the string.
Since you are working on the dataset an easy way to achieve this would be using pandas and then doing a groupby on id and aggregating on values to get min and max values for each id
#your question
s ="""foo,1,1
foo,2,5
foo,3,0
bar,1,5
bar,2,0
bar,3,0
foo,1,1
foo,2,2
foo,3,4
bar,1,4
bar,2,0
bar,3,1
foo,1,4
foo,2,2
foo,3,3
bar,1,1
bar,2,3
bar,3,0"""
#splitting on new line
t = s.split('\n')
#creating datframe with comma separation
import pandas as pd
df = pd.DataFrame([i.split(',') for i in t])
Output:
>>> df
0 1 2
0 foo 1 1
1 foo 2 5
2 foo 3 0
3 bar 1 5
4 bar 2 0
5 bar 3 0
6 foo 1 1
7 foo 2 2
8 foo 3 4
9 bar 1 4
10 bar 2 0
11 bar 3 1
12 foo 1 4
13 foo 2 2
14 foo 3 3
15 bar 1 1
16 bar 2 3
17 bar 3 0
#creating id column by concatenating column 1 and 2, renaming column 2 as 'value' and dropping them col1 and 2
df['id']=df[0]+df[1]
df = df.rename(columns={df.columns[2]: 'value'})
df = df.drop([0,1], axis = 1)
Output:
>>> df
value id
0 1 foo1
1 5 foo2
2 0 foo3
3 5 bar1
4 0 bar2
5 0 bar3
6 1 foo1
7 2 foo2
8 4 foo3
9 4 bar1
10 0 bar2
11 1 bar3
12 4 foo1
13 2 foo2
14 3 foo3
15 1 bar1
16 3 bar2
17 0 bar3
#doing grouby and aggregating to get min and max for each id
df.groupby('id').value.agg([min,max])
Output:
min max
id
bar1 1 5
bar2 0 3
bar3 0 1
foo1 1 4
foo2 2 5
foo3 0 4
Related
Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25
MAPPER DATAFRAME
col_data = {'p0_tsize_qbin_':[1, 2, 3, 4, 5] ,
'p0_tsize_min':[0.0, 7.0499999999999545, 16.149999999999977, 32.65000000000009, 76.79999999999973] ,
'p0_tsize_max':[7.0, 16.100000000000023, 32.64999999999998, 76.75, 6759.850000000006]}
map_df = pd.DataFrame(col_data, columns = ['p0_tsize_qbin_', 'p0_tsize_min','p0_tsize_max'])
map_df
in Above data frame is map_df where column 2 and column 3 is the range and column1 is mapper value to the new data frame .
MAIN DATAFRAME
raw_data = {
'id': ['1', '2', '2', '3', '3','1', '2', '2', '3', '3','1', '2', '2', '3', '3'],
'val' : [3, 56, 78, 11, 5000,37, 756, 78, 49, 21,9, 4, 14, 75, 31,]}
df = pd.DataFrame(raw_data, columns = ['id', 'val','p0_tsize_qbin_mapped'])
df
EXPECTED OUTPUT MARKED IN BLUE
look for val of df dataframe in map_df min(column1) and max(columns2) where ever it lies get the p0_tsize_qbin_ value.
For Example : from df data frame val = 3 , lies in the range of p0_tsize_min p0_tsize_max where p0_tsize_qbin_ ==1 . so 1 will return
Try using pd.cut()
bins = map_df['p0_tsize_min'].tolist() + [map_df['p0_tsize_max'].max()]
labels = map_df['p0_tsize_qbin_'].tolist()
df.assign(p0_tsize_qbin_mapped = pd.cut(df['val'],bins = bins,labels = labels))
or
bins = pd.IntervalIndex.from_arrays(map_df['p0_tsize_min'],map_df['p0_tsize_max'])
map_df.loc[bins.get_indexer(df['val'].tolist()),'p0_tsize_qbin_'].to_numpy()
Output:
id val p0_tsize_qbin_mapped
0 1 3 1
1 2 56 4
2 2 78 5
3 3 11 2
4 3 5000 5
5 1 37 4
6 2 756 5
7 2 78 5
8 3 49 4
9 3 21 3
10 1 9 2
11 2 4 1
12 2 14 2
13 3 75 4
14 3 31 3
I wonder how to replace types in data frame. In this sample I want to replace all strings to 0 or NaN. Here is my simple df and I try too do:
df.replace(str, 0, inplace=True)
or
df.replace({str: 0}, inplace=True)
but above solutions does not work.
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
check this code will visit every cell in the data frame , and if it was nan or string will replace them with 0
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 1, 2, 3, np.nan],
'B': [np.nan, 6, 7, 8, 9],
'C': ['a', 10, 500, 'd', 'e']})
print("before >>> \n",df)
def replace_nan_and_strings(cell_value):
if pd.isnull(cell_value) or isinstance(cell_value,str):
return 0
else :
return cell_value
new_df=df.applymap(lambda (x):replace_nan_strings(x))
print("after >>> \n",new_df)
Try this:
df = df.replace('[a-zA-Z]', 0, regex=True)
This is how I tested it:
'''
0 1 2
0 NaN 1 'b'
1 2 3 'c'
2 4 'd' 5
3 10 20 30
'''
import pandas as pd
df = pd.read_clipboard()
df = df.replace('[a-zA-Z]', 0, regex=True)
print(df)
Output:
0 1 2
0 NaN 1 0
1 2.0 3 0
2 4.0 0 5
3 10.0 20 30
New scenario as requested in the comments below:
Input:
'''
0 '1' 2
0 NaN 1 'b'
1 2 3 'c'
2 '4' 'd' 5
3 10 20 30
'''
Output:
0 '1' 2
0 NaN 1 0
1 2 3 0
2 '4' 0 5
3 10 20 30
I have a df like this one:
import pandas as pd
cols = ['id', 'factor_var']
values = [
[1, 'a'],
[2, 'a'],
[3, 'a'],
[4, 'b'],
[5, 'b'],
[6, 'c'],
[7, 'c'],
[8, 'c'],
[9, 'c'],
[10, 'c'],
[11, 'd'],
]
df = pd.DataFrame(values, columns=cols)
My target df has the following columns:
target_columns = ['id', 'factor_var_a', 'factor_var_b', 'factor_var_other']
The column factor_var_other being all categories in the factor_var that are not a or b, disregarding the frequency in which each category appears.
Any ideas will be much appreciated.
You can replace non matched values of list by Series.where, reassign back by DataFrame.assign and last call get_dummies:
s = df['factor_var'].where(df['factor_var'].isin(['a','b']), 'other')
#alternative
#s = df['factor_var'].map({'a':'a','b':'b'}).fillna('other')
df = pd.get_dummies(df.assign(factor_var=s), columns=['factor_var'])
print (df)
id factor_var_a factor_var_b factor_var_other
0 1 1 0 0
1 2 1 0 0
2 3 1 0 0
3 4 0 1 0
4 5 0 1 0
5 6 0 0 1
6 7 0 0 1
7 8 0 0 1
8 9 0 0 1
9 10 0 0 1
10 11 0 0 1
I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)