How to create a list of non-empty dataframes names?

How to create a list of non-empty dataframes names? - python-3.x

I have created a list of dataframe and want to run a loop through the list to run some manipulation on some of those dataframes. Note that although this is a list I have manually created, but this dataframes exist in my code.
df_list = [df_1, df_2, df_3, df_4, df_5, ...]
list_df_matching = []
list_non_matching = []
Most of these dataframes are blank. But 2 of them will have some records in them. I want to find the name of those dataframes and create a new list - list_non_matching
for df_name in df_list:
q = df_name.count()
if q > 0:
list_non_matching.append(df_name)
else:
list_df_matching.append(df_name)
My goal is to get a list of dataframe names like [df_4, df_10], but I am getting the following:
[DataFrame[id: string, nbr: string, name: string, code1: string, code2: string],
DataFrame[id: string, nbr: string, name: string, code3: string, code4: string]]
Is the list approach incorrect? Is there a better way of doing it?

Here is an example to illustrate one way to do it with the help of empty property and Python built-in function globals:
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame({"col1": [2, 4], "col2": [5, 9]})
df3 = pd.DataFrame(columns = ["col1", "col2"])
df4 = pd.DataFrame({"col1": [3, 8], "col2": [2, 0]})
df5 = pd.DataFrame({"col1": [], "col2": []})
df_list = [df1, df2, df3, df4, df5]
list_non_matching = [
name
for df in df_list
for name in globals()
if not df.empty and globals()[name] is df
]
print(list_non_matching)
# Output
['df2', 'df4']

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?

Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A

Groupby and transpose or unstack in Pandas

I have the following Python pandas dataframe:
There are more EventName's than shown on this date.
Each will have Race_Number = 'Race 1', 'Race 2', etc.
After a while the date increments.
.
I'm trying to create a dataframe that looks like this:
Each race has different numbers of runners.
Is there a way to do this in pandas ?
Thanks

I assumed output would be another DataFrame.
import pandas as pd
import numpy as np
from nltk import flatten
import copy
df = pd.DataFrame({'EventName': ['sydney', 'sydney', 'sydney', 'sydney', 'sydney', 'sydney'],
'Date': ['2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01', '2019-01.01'],
'Race_Number': ['Race1', 'Race1', 'Race1', 'Race2', 'Race2', 'Race3'],
'Number': [4, 7, 2, 9, 5, 10]
})
print(df)
dic={}
for rows in df.itertuples():
if rows.Race_Number in dic:
dic[rows.Race_Number] = flatten([dic[rows.Race_Number], rows.Number])
else:
dic[rows.Race_Number] = rows.Number
copy_dic = copy.deepcopy(dic)
seq = np.arange(0,len(dic.keys()))
for key, n_key in zip(copy_dic, seq):
dic[n_key] = dic.pop(key)
df = pd.DataFrame([dic])
print(df)

Resetting multi index for a pivot_table to get single line index

I have a dataframe in the following format:
df = pd.DataFrame({'a':['1-Jul', '2-Jul', '3-Jul', '1-Jul', '2-Jul', '3-Jul'], 'b':[1,1,1,2,2,2], 'c':[3,1,2,4,3,2]})
I need the following dataframe:
df_new = pd.DataFrame({'a':['1-Jul', '2-Jul', '3-Jul'], 1:[3, 1, 2], 2:[4,3,2]}).
I have tried the following:
df = df.pivot_table(index = ['a'], columns = ['b'], values = ['c'])
df_new = df.reset_index()
but it doesn't give me the required result. I have tried variations of this to no avail. Any help will be greatly appreciated.

try this one:
df2 = df.groupby('a')['c'].agg(['first','last']).reset_index()
cols_ = df['b'].unique().tolist()
cols_.insert(0,df.columns[0])
df2.columns = cols_
df2

How to appropriately test Pandas dtypes within dataframes?

Objective: to create a function that can match given dtypes to a predfined data type scenario.
Description: I want to be able to classify given datasets based on their attribution into predefined scenario types.
Below are two example datasets (df_a and df_b). df_a has only dtypes that are equal to 'object' while df_b has both 'object' and 'int64':
# scenario_a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]
df_a = pd.DataFrame(data, columns = ['Name','Color'])
df_a['Color'] = df_a['Color'].astype('object')
# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]
df_b = pd.DataFrame(data, columns = ['Name', 'Age'])
I want to be able to determine automatically which scenario it is based on a function:
import pandas as pd
import numpy as np
def scenario(data):
if data.dtypes.str.contains('object'):
return scenario_a
if data.dtypes.str.contatin('object', 'int64'):
return scenario_b
Above is what I have so far, but isn't getting the results I was hoping for.
When using the function scenario(df_a) I am looking for the result to be scenario_a and when I pass df_b I am looking for the function to be able to determine, correctly, what scenario it should be.
Any help would be appreciated.

Here is one approach. Create a dict scenarios, with the keys a sorted tuple of predefined dtypes, and the value being what you would want returned by the function.
Using your example, something like:
# scenario a
data_a = [['tom', 'blue'], ['nick', 'green'], ['julia', 'red']]
df_a = pd.DataFrame(data_a, columns = ['Name','Color'])
df_a['Color'] = df_a['Color'].astype('object')
# scenario_b
data_b = [['tom', 10], ['nick', 15], ['julia', 14]]
df_b = pd.DataFrame(data_b, columns = ['Name', 'Age'])
scenario_a = tuple(sorted(df_a.dtypes.unique()))
scenario_b = tuple(sorted(df_b.dtypes.unique()))
scenarios = {
scenario_a: 'scenario_a',
scenario_b: 'scenario_b'
}
print(scenarios)
# scenarios:
# {(dtype('O'),): 'scenario_a', (dtype('int64'), dtype('O')): 'scenario_b'}
def scenario(data):
dtypes = tuple(sorted(data.dtypes.unique()))
return scenarios.get(dtypes, None)
scenario(df_a)
# 'scenario_a'
scenario(df_b)
# scenario_b

Multi-index pandas dataframes: find an index related to the number of unique values a column has

# import Pandas library
import pandas as pd
idx = pd.MultiIndex.from_product([['A001', 'B001','C001'],
['0', '1', '2']],
names=['ID', 'Entries'])
col = ['A', 'B']
df = pd.DataFrame('-', idx, col)
df.loc['A001', 'A'] = [10,10,10]
df.loc['A001', 'B'] = [90,84,70]
df.loc['B001', 'A'] = [10,20,10]
df.loc['B001', 'B'] = [70,86,67]
df.loc['C001', 'A'] = [20,20,20]
df.loc['C001', 'B'] = [98,81,72]
#df is a dataframe
df
Following is the problem: How to return the ID which has more than one unique values for column 'A'? In the above dataset, ideally it should return B001.
I would appreciate if anyone could help me out with performing operations in multi-index pandas dataframes.

Use GroupBy.transform with nunique and filter by boolean indexing and for values of first levl of MultiIndex add get_level_values with unique:
a = df[df.groupby(level=0)['A'].transform('nunique') > 1].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')
Or use duplicated, but first need columns from MultiIndex by reset_index:
m = df.reset_index().duplicated(subset=['ID','A'], keep=False).values
a = df[~m].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create a list of non-empty dataframes names? - python-3.x

Related

column comprehension robust to missing values

Groupby and transpose or unstack in Pandas

Resetting multi index for a pivot_table to get single line index

How to appropriately test Pandas dtypes within dataframes?

Multi-index pandas dataframes: find an index related to the number of unique values a column has

Categories

Resources