How to create a dataframe from records that has Timestamps as index? - python-3.x

I am trying to create a dataframe in which i have Timestamps as index but it throws error. I can use the same methodology to create a dataframe in case the index is not a Timestamp. The following codebit is a bare min example:
Works fine:
pd.DataFrame.from_dict({'1':{'a':1,'b':2,'c':3},'2':{'a':1,'c':4},'3':{'b':6}})
output:
1 2 3
a 1 1.0 NaN
b 2 NaN 6.0
c 3 4.0 NaN
BREAKS
o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
output:
KeyError Traceback (most recent call last)
<ipython-input-627-f9a075f611c0> in <module>
1 o=np.arange(np.datetime64('2017-11-01 00:00:00'),np.datetime64('2017-11-01 00:00:00')+np.timedelta64(3,'D'),np.timedelta64(1,'D'))
2
----> 3 pd.DataFrame.from_records({o[0]:{'a':1,'b':2,'c':3},o[1]:{'a':1,'c':4},o[2]:{'b':6}})
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
~/anaconda3/envs/dfs/lib/python3.6/site-packages/pandas/core/frame.py in <listcomp>(.0)
1617 if columns is None:
1618 columns = arr_columns = ensure_index(sorted(data))
-> 1619 arrays = [data[k] for k in columns]
1620 else:
1621 arrays = []
KeyError: Timestamp('2017-11-01 00:00:00')
Please help me understand the behavior and what I am missing. Also, how do go about creating a dataframe from records that has Timestamps as indices

Change from_records to from_dict (just like in your working example)
and everything executes fine.
Another, optional hint: Since you create a Pandas DataFrame, use
pandasonic native way to create datetime values to use as column names:
o = pd.date_range(start='2017-11-01', periods=3)
Edit
I noticed that if you create o object the way I proposed (as a
date_range), you can use even from_records.
Edit 2
You wrote that you want datetime objects as the index, whereas
your code attempts to set them as column names.
If you want datetime objects as the index, run something like:
df = pd.DataFrame.from_records({'1':{o[0]:1, o[1]:2, o[2]:3},
'2':{o[0]:1, o[2]:4}, '3':{o[1]:6}})
The result is:
1 2 3
2017-11-01 1 1.0 NaN
2017-11-02 2 NaN 6.0
2017-11-03 3 4.0 NaN
Another way to create the above result is:
df = pd.DataFrame.from_records([{'1':1, '2':1}, {'1':2, '3':6}, {'1':3, '2':4}], index=o)

Related

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

Keeping columns of pandas dataframe whose substring is in the list

I have a dataframe with many columns. I only want to retain those columns whose substring is in the list. For example the lst and dataframe is:
lst = ['col93','col71']
sample_id. col9381.3 col8371.8 col71937.9 col19993.1
1
2
3
4
Based on the substrings, the resulting dataframe will look like:
sample_id. col9381.3 col71937.9
1
2
3
4
I have a code that go through the list and filter out the columns for whom I have a substring in a list but I don't know how to create a dataframe for it. The code so far:
for i in lst:
df2 = df1.filter(regex=i)
if df2.shape[1] > 0:
print(df2)
The above code is able to filter out the columns but I don't know how combine all of these into one dataframe. Insights will be appreciated.
Try with startswith which accepts a tuple of options:
df.loc[:, df.columns.str.startswith(('sample_id.',)+tuple(lst))]
Or filter which accepts a regex as you were trying:
df.filter(regex='|'.join(['sample_id']+lst))
Output:
sample_id. col9381.3 col71937.9
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN

Creating a single file from multiple files (Python 3.x)

I can't figure out a great way to do this, but I have 2 files with a standard date and value format.
File 1 File 2
Date Value Date Value
4 7.0 1 9.0
5 5.5 . .
6 4.0 7 2.0
I want to combine files 1 and 2 to get the following:
Combined Files
Date Value1 Value2 Avg
1 NaN 9.0 9.0
2 NaN 9.0 9.0
3 NaN 8.5 8.5
4 7.0 7.5 7.25
5 5.5 5.0 5.25
6 4.0 3.5 3.75
7 NaN 2.0 2.0
How would I attempt this? I figured I should make a masked array with the date going from 1 to 7 and then just append the files together, but I don't know how I would do that with file 1. Any help where to look would be appreciated.
Using Python 3.x
EDIT:
I solved my own problem!
I am sure there is a better way to streamline this. My solution, doesn't use the example above, I just threw in my code.
def extractFiles(Dir, newDir, newDir2):
fnames = glob(Dir)
farray = np.array(fnames)
## Dates range from 723911 to 737030
dateArray = np.arange(723911,737030) # Store the dates
dataArray = [] # Store the data, This needs to be a list! Not np.array!
for f in farray:
## Extracting Data
CH4 = np.genfromtxt(f, comments='#', delimiter=None, dtype=np.float).T
myData = np.full(dateArray.shape, np.nan) # Create an masked array
myDate = np.array([])
## Converts the given datetime into something more useable
for x, y in zip(*CH4[1:2], *CH4[2:3]):
myDate = np.append(myDate,
(mdates.date2num(datetime.strptime('{}-{}'.format(int(x), int(y)), '%Y-%m'))))
## Finds where the dates are the same and places the approprite concentration value
for i in range(len(CH4[3])):
idx = np.where(dateArray == myDate[i])
myData[idx] = CH4[3, i]
## Store all values in the list
dataArray.append(myData)
## Convert list to numpy array and save in txt file
dataArray = np.vstack((dateArray, dataArray))
np.savetxt(newDir, dataArray.T, fmt='%1.2f', delimiter=',')
## Find the averge of the data to plot
avg = np.nanmean(dataArray[1:].T,1)
avg = np.vstack((dateArray, avg))
np.savetxt(newDir2, avg.T, fmt='%1.2f', delimiter=',')
return avg
Here is my answer based on the information you gave me:
import pandas as pd
import os
# I stored two Excel files in a subfolder of this sample code
# Code
# ----Files
# -------- File1.xlsx
# -------- File2.xlsx
# Here I am saving the path to a variable
file_path = os.path.join(*[os.getcwd(), 'Files', ''])
# I define an empty DataFrame that we then fill we the files information
final_df = pd.DataFrame()
# file_number will be used to increment the Value column based number of files that we load.
# First file will be Value1, second will lead to Value2
file_number = 1
# os.listdir is now "having a look" into the "Files" folder and will return a list of files which is contained in there
# ['File1.xlsx', 'File2.xlsx'] in our case
for file in os.listdir(file_path):
# we load the Excel file with pandas function "read_excel"
df = pd.read_excel(file_path + file)
# Rename the column "Value" to "Value" + the "file_number"
df = df.rename(columns={'Value': 'Value'+str(file_number)})
# Check if the Dataframe already contains values
if not final_df.empty:
# If there is values already then we merge them together with the new values
final_df = final_df.merge(df, how='outer', on='Date')
else:
# Otherwise we "initialize" our final_df with the first Excel file that we loaded
final_df = df
# at the end we increment the file number by one to continue to next file
file_number += 1
# get all column names that have "Value" in it
value_columns = [w for w in final_df.columns if 'Value' in w]
# Create a new column for the average and build the average on all columns that we found for value columns
final_df['Avg'] = final_df.apply(lambda x: x[value_columns].mean(), axis=1)
# Sort the dataframe based on the Date
sorted_df = final_df.sort_values('Date')
print(sorted_df)
The print will output this:
Date Value1 Value2 Avg
3 1 NaN 9.0 9.00
4 2 NaN 9.0 9.00
5 3 NaN 8.5 8.50
0 4 7.0 7.5 7.25
1 5 5.5 5.0 5.25
2 6 4.0 3.5 3.75
6 7 NaN 2.0 2.00
Please be aware that this is not paying attention on the file names and is just loading one file after another based on the alphabet.
But this has the advantage that you can put as many files in there as you want.
If you need to load them in a specific order I can probably help you with that as well.

How to combine several csv in one with identical rows?

I have several csv files with approximately the following structure:
name,title,status,1,2,3
name,title,status,4,5,6
name,title,status,7,8,9
Most of the name columns is the same in all files, only the columns 1,2,3,4... are different.
I need to take turns adding new columns to existing and new rows, as well as updating the remaining rows each time.
For example, I have 2 tables:
name,title,status,1,2,3
Foo,Bla-bla-bla,10,45.6,12.3,45.2
Bar,Too-too,13,13.4,22.6,75.1
name,title,status,4,5,6
Foo,Bla-bla-bla,14,25.3,125.3,5.2
Fobo,Dom-dom,20,53.4,2.9,11.3
And at the output I expect a table:
name,title,status,1,2,3,4,5,6
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3
I did not find anything similar, who can tell how I can do this?
It looks like you want to keep just one version of ['name', 'title', 'status'] and from your example, you prefer to keep the last 'status' encountered.
I'd use pd.concat and follow that up with a groupby to filter out duplicate status.
df = pd.concat([
pd.read_csv(fp, index_col=['name', 'title', 'status'])
for fp in ['data1.csv', 'data2.csv']
], axis=1).reset_index('status').groupby(level=['name', 'title']).last()
df
status 1 2 3 4 5 6
name title
Bar Too-too 13 13.4 22.6 75.1 NaN NaN NaN
Fobo Dom-dom 20 NaN NaN NaN 53.4 2.9 11.3
Foo Bla-bla-bla 14 45.6 12.3 45.2 25.3 125.3 5.2
Then df.to_csv() produces
name,title,status,1,2,3,4,5,6
Bar,Too-too,13,13.4,22.6,75.1,,,
Fobo,Dom-dom,20,,,,53.4,2.9,11.3
Foo,Bla-bla-bla,14,45.6,12.3,45.2,25.3,125.3,5.2
Keep merging them:
df = None
for path in ['data1.csv', 'data2.csv']:
sub_df = pd.read_csv(path)
if df is None:
df = sub_df
else:
df = df.merge(sub_df, on=['name', 'title', 'status'], how='outer')

Create of multiple subsets from existing pandas dataframe [duplicate]

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).
I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.
I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
Is there a smart way to do it?
Can I ask why not just do it by slicing the data frame. Something like
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame() for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
DataFrameDict['Joe']
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
You can convert groupby object to tuples and then to dict:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
Easy:
[v for k, v in df.groupby('name')]
Groupby can helps you:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
Or you can make list from grouped and get all DataFrame's by index:
l_grouped = list(grouped)
l_grouped[0][1] - DataFrame for first group with first name.
In addition to Gusev Slava's answer, you might want to use groupby's groups:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
The method in the OP works, but isn't efficient. It may have seemed to run forever, because the dataset was long.
Use .groupby on the 'method' column, and create a dict of DataFrames with unique 'method' values as the keys, with a dict-comprehension.
.groupby returns a groupby object, that contains information about the groups, where g is the unique value in 'method' for each group, and d is the DataFrame for that group.
The value of each key in df_dict, will be a DataFrame, which can be accessed in the standard way, df_dict['key'].
The original question wanted a list of DataFrames, which can be done with a list-comprehension
df_list = [d for _, d in df.groupby('method')]
import pandas as pd
import seaborn as sns # for test dataset
# load data for example
df = sns.load_dataset('planets')
# display(df.head())
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
# Using a dict-comprehension, the unique 'method' value will be the key
df_dict = {g: d for g, d in df.groupby('method')}
print(df_dict.keys())
[out]:
dict_keys(['Astrometry', 'Eclipse Timing Variations', 'Imaging', 'Microlensing', 'Orbital Brightness Modulation', 'Pulsar Timing', 'Pulsation Timing Variations', 'Radial Velocity', 'Transit', 'Transit Timing Variations'])
# or a specific name for the key, using enumerate (e.g. df1, df2, etc.)
df_dict = {f'df{i}': d for i, (g, d) in enumerate(df.groupby('method'))}
print(df_dict.keys())
[out]:
dict_keys(['df0', 'df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9'])
df_dict['df1].head(3) or df_dict['Astrometry'].head(3)
There are only 2 in this group
method number orbital_period mass distance year
113 Astrometry 1 246.36 NaN 20.77 2013
537 Astrometry 1 1016.00 NaN 14.98 2010
df_dict['df2].head(3) or df_dict['Eclipse Timing Variations'].head(3)
method number orbital_period mass distance year
32 Eclipse Timing Variations 1 10220.0 6.05 NaN 2009
37 Eclipse Timing Variations 2 5767.0 NaN 130.72 2008
38 Eclipse Timing Variations 2 3321.0 NaN 130.72 2008
df_dict['df3].head(3) or df_dict['Imaging'].head(3)
method number orbital_period mass distance year
29 Imaging 1 NaN NaN 45.52 2005
30 Imaging 1 NaN NaN 165.00 2007
31 Imaging 1 NaN NaN 140.00 2004
For more information about the seaborn datasets
NASA Exoplanets
Alternatively
This is a manual method to create separate DataFrames using pandas: Boolean Indexing
This is similar to the accepted answer, but .loc is not required.
This is an acceptable method for creating a couple extra DataFrames.
The pythonic way to create multiple objects, is by placing them in a container (e.g. dict, list, generator, etc.), as shown above.
df1 = df[df.method == 'Astrometry']
df2 = df[df.method == 'Eclipse Timing Variations']
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.
Example
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
You can use the groupby command, if you already have some labels for your data.
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series is:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series is:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list a list of two pd.Series:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series itself to group the series, e.g., in_series.index.day
here's a small function which might help some (efficiency not perfect probably, but compact + more or less easy to understand):
def get_splited_df_dict(df: 'pd.DataFrame', split_column: 'str'):
"""
splits a pandas.DataFrame on split_column and returns it as a dict
"""
df_dict = {value: df[df[split_column] == value].drop(split_column, axis=1) for value in df[split_column].unique()}
return df_dict
it converts a DataFrame to multiple DataFrames, by selecting each unique value in the given column and putting all those entries into a separate DataFrame.
the .drop(split_column, axis=1) is just for removing the column which was used to split the DataFrame. the removal is not necessary, but can help a little to cut down on memory usage after the operation.
the result of get_splited_df_dict is a dict, meaning one can access each DataFrame like this:
splitted = get_splited_df_dict(some_df, some_column)
# accessing the DataFrame with 'some_column_value'
splitted[some_column_value]
The existing answers cover all good cases and explains fairly well how the groupby object is like a dictionary with keys and values that can be accessed via .groups. Yet more methods to do the same job as the existing answers are:
Create a list by unpacking the groupby object and casting it to a dictionary:
dict([*df.groupby('Name')]) # same as dict(list(df.groupby('Name')))
Create a tuple + dict (this is the same as #jezrael's answer):
dict((*df.groupby('Name'),))
If we only want the DataFrames, we could get the values of the dictionary (created above):
[*dict([*df.groupby('Name')]).values()]
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
This is the head of the dataframe:
I have created two lists;
one for the names of dataframes
and one for the couple of array [item_number, store_number].
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
Hope this will be helpful!

Resources