Python - unable to count occurences of values in defined ranges in dataframe - python-3.x

I'm trying to write a code that takes analyses values in a dataframe, if the values fall in a class, the total number of those values are assigned to a key in the dictionary. But the code is not working for me. Im trying to create logarithmic classes and count the total number of values that fall in it
def bins(df):
"""Returns new df with values assigned to bins"""
bins_dict = {500: 0, 5000: 0, 50000: 0, 500000: 0}
for i in df:
if 100<i and i<=1000:
bins_dict[500]+=1,
elif 1000<i and i<=10000:
bins_dict[5000]+=1
print(bins_dict)
However, this is returning the original dictionary.
I've also tried modifying the dataframe using
def transform(df, range):
for i in df:
for j in range:
b=10**j
while j==1:
while i>100:
if i>=b:
j+=1,
elif i<b:
b = b/2,
print (i = b*(int(i/b)))
This code is returning the original dataframe.
My dataframe consists of only one column with values ranging between 100 and 10000000
Data Sample:
Area
0 1815
1 907
2 1815
3 907
4 907
Expected output
dict={500:3, 5000:2, 50000:0}
If i can get a dataframe output directly that would be helpful too
PS. I am very new to programming and I only know python

You need to use pandas for it:
import pandas as pd
df = pd.DataFrame()
df['Area'] = [1815, 907, 1815, 907, 907]
# create new column to categorize your data
df['bins'] = pd.cut(df['Area'], [0,1000,10000,100000], labels=['500', '5000', '50000'])
# converting into dictionary
dic = dict(df['bins'].value_counts())
print(dic)
Output:
{'500': 3, '5000': 2, '50000': 0}

Related

split dataframe into a dictionary of dictionaries

I have a dataframe containing 4 columns. I want to use 2 of the columns as keys for a dictionary of dictionaries, where the values inside are the remaining 2 columns (so a dataframe)
birdies = pd.DataFrame({'Habitat' : ['Captive', 'Wild', 'Captive', 'Wild'],
'Animal': ['Falcon', 'Falcon','Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.],
'Color': ["white", "grey", "green", "blue"]})
#this should ouput speed and color
birdies_dict["Falcon"]["Wild"]
#this should contain a dictionary, which the keys are 'Captive','Wild'
birdies_dict["Falcon"]
I have found a way to generate a dictionary of dataframes with a single column as a key, but not with 2 columns as keys:
birdies_dict = {k:table for k,table in birdies.groupby("Animal")}
I suggest to use defaultdict for this, a solution for the 2 column problem is:
from collections import defaultdict
d = defaultdict(dict)
for (hab, ani), _df in df.groupby(['Habitat', 'Animal']):
d[hab][ani] = _df
This breaks with 2 columns, if you want it with a higher depth, you can just define a recursive defaultdict:
from collections import defaultdict
recursive_dict = lambda: defaultdict(recursive_dict)
dct = recursive_dict()
dct[1][2][3] = ...
Pass to_dict to the inside:
birdies_dict = {k:d.to_dict() for k,d in birdies.groupby('Animal')}
birdies_dict['Falcon']['Habitat']
Output:
{0: 'Captive', 1: 'Wild'}
Or do you mean:
out = birdies.set_index(['Animal','Habitat'])
out.loc[('Falcon','Captive')]
which gives:
Max Speed 380
Color white
Name: (Falcon, Captive), dtype: object
IIUC:
birdies_dict = {k:{habitat: table[['Max Speed', 'Color']].to_numpy() for habitat in table['Habitat'].to_numpy()} for k,table in birdies.groupby("Animal")}
OR
birdies_dict = {k:{habitat: table[['Max Speed', 'Color']] for habitat in table['Habitat'].to_numpy()} for k,table in birdies.groupby("Animal")}
#In this case inner key will have a dataframe
OR
birdies_dict = {k:{inner_key: inner_table for inner_key, inner_table in table.groupby('Habitat')} for k,table in birdies.groupby("Animal")}

Python Multivariate Apply/Map/Applymap

Trying to use apply/map for multiple column values. works fine for one column. Made an example below of applying to a single column, need help making the commented out part work for multiple columns as inputs.
I need it to take 2 values on same row but from different columns as inputs to the function, perform a calc, and then place the result in the new column. If there is an efficent/optimized way to do this with apply/map/or both please let me know, Thanks!!!
import numpy as np
import pandas as pd
#gen some data to work with
df = pd.DataFrame({'Col_1': [2, 4, 6, 8],
'Col_2': [11, 22, 33, 44],
'Col_3': [2, 1, 2, 1]})
#Make new empty col to write to
df[4]=""
#some Function of one variable
def func(a):
return(a**2)
#applies function itemwise to coll & puts result in new column correctly
df[4] = df['Col_1'].map(func)
"""
#Want Multi Variate version. apply itemwise function (where each item is from different columns), compute, then add to new column
def func(a,b):
return(a**2+b**2)
#Take Col_1 value and Col_2 value; run function of multi variables, display result in new column...???
df[4] = df['Col_1']df['Col_2'].map(func(a,b))
"""
You can pass each row of a dataframe as a Series, using apply function. And then in the function itself you can use its value according to the requirement.
def func(df):
return (df['Col_1']**2+df['Col_2']**2)
df[4] = df.apply(func, axis = 1)
Do refer Documentation to explore.

Concatenation of pd.Series results in pandas.core.indexes.base.InvalidIndexError

I don't get why my code doesn't work.
If I execute the code below I get the following error:
pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
The created series have all a timestamp index, where some entries are the same. I don't understand what the difference to the example from the pandas manual is,
because the indices of the dfs have also identical indices.
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name) # create the series
ds_list.append(ds)
df = pd.concat(ds_list,axis=1,sort=False) # i copyed this line from the example, from the pandas manual
the resulting dataframe should be created like the one from the example
pics
In the example by pandas you can have a pandas.Series which has the same or similar index values as another which you intend to concatenate. However each pandas.Series index values have to be unique.
i.e. len(ds.index) - len(ds.index.unique()) should equal 0
Your index in ds has multiple values that are the same.
If you want to concatenate based on the index, it is unclear how to combine these indices with another pandas.Series, so you will recieve an error.
For example your code will work if each ds has an index which is unique.
ds_list = []
dates = pd.date_range(start='1/1/2019', periods=7); values = np.arange(7)
dic_data = {'Col1' : [(dates[i], values[i]) for i in range(7)], 'Col2' : [(dates[i], values[i]*2) for i in range(7)]}
for name in dic_data.keys():
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name)
ds_list.append(ds)
df = pd.concat(ds_list,axis=1,sort=False)
In [43]: df
Out[43]:
Col1 Col2
2019-01-01 0 0
2019-01-02 1 2
2019-01-03 2 4
2019-01-04 3 6
2019-01-05 4 8
2019-01-06 5 10
2019-01-07 6 12
However if your ds has duplicated index values it will not:
dic_data = {'Col1' : [(dates[i % 7], values[i % 7]) for i in range(12)], 'Col2' : [(dates[i], values[i]*2) for i in range(7)]}
for name in dic_data.keys():
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name)
ds_list.append(ds)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
As a fix I would aggregate on timestamps by for example doing a groupby or however you want to aggregate on these duplicates:
dic_data = {'Col1' : [(dates[i % 7], values[i % 7]) for i in range(12)], 'Col2' : [(dates[i], values[i]*2) for i in range(7)]}
for name in dic_data.keys():
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name)
ds = ds.groupby(level=0).sum()
ds_list.append(ds)
Or you could just drop duplicates, e.g. ds.drop_duplicates() ...

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

Subtract a single value from columns in pandas

I have two data frames, df and df_test. I am trying to create a new dataframe for each df_test row that will include the difference between x coordinates and the y coordinates. I wold also like to create a new column that gives the magnitude of this distance between objects. Below is my code.
import pandas as pd
import numpy as np
# Create Dataframe
index_numbers = np.linspace(0, 10, 11, dtype=np.int)
index_ = ['OP_%s' % number for number in index_numbers]
header = ['X', 'Y', 'D']
# print(index_)
data = np.round_(np.random.uniform(low=0, high=10, size=(len(index_), 3)), decimals=0)
# print(data)
df = pd.DataFrame(data=data, index=index_, columns=header)
df_test = df.sample(3)
# print(df)
# print(df_test)
for index, row in df_test.iterrows():
print(index)
print(row)
df_(index) = df
df_(index)['X'] = df['X'] - df_test['X'][row]
df_(index)['Y'] = df['Y'] - df_test['Y'][row]
df_(index)['Dist'] = np.sqrt(df_(index)['X']**2 + df_(index)['Y']**2)
print(df_(index))
Better For Loop
for index, row in df_test.iterrows():
# print(index)
# print(row)
# print("df_{0}".format(index))
df_temp = df.copy()
df_temp['X'] = df_temp['X'] - df_test['X'][index]
df_temp['Y'] = df_temp['Y'] - df_test['Y'][index]
df_temp['Dist'] = np.sqrt(df_temp['X']**2 + df_temp['Y']**2)
print(df_temp)
I have written a for loop to run through each row of the df_test dataframe and "try" to create the columns. The (index) in each loop is the name of the new data frame based on test row used. Once the dataframe is created with the modified and new columns I would need to save the data frames to a dictionary. The new loop produces the each of the new dataframes I need but what is the best way to save each new dataframe? Any help in creating these columns would be greatly appreciated.
Please comment with any questions so that I can make it easier to understand, if need be.

Resources