Python Pandas Create Multiple dataframes by slicing data at certain locations - python-3.x

I am new to Python and data analysis using programming. I have a long csv and I would like to create DataFrame dynamically and plot them later on. Here is an example of the DataFrame similar to the data exist in my csv file
df = pd.DataFrame(
{"a" : [4 ,5, 6, 'a', 1, 2, 'a', 4, 5, 'a'],
"b" : [7, 8, 9, 'b', 0.1, 0.2, 'b', 0.3, 0.4, 'b'],
"c" : [10, 11, 12, 'c', 10, 20, 'c', 30, 40, 'c']})
As seen, there are elements which repeated in each column. So I would first need to find the index of the repetition and following that use this for making subsets. Here is the way I did this.
find_Repeat = df.groupby(['a'], group_keys=False).apply(lambda df: df if
df.shape[0] > 1 else None)
repeat_idxs = find_Repeat.index[find_Repeat['a'] == 'a'].tolist()
If I print repeat_idxs, I would get
[3, 6, 9]
And this is the example of what I want to achieve in the end
dfa_1 = df['a'][Index_Identifier[0], Index_Identifier[1])
dfa_2 = df['a'][Index_Identifier[1], Index_Identifier[2])
dfb_1 = df['b'][Index_Identifier[0], Index_Identifier[1])
dfb_2 = df['b'][Index_Identifier[1], Index_Identifier[2])
But this is not efficient and convenient as I need to create many DataFrame like these for plotting later on. So I tried the following method
dfNames = ['dfa_' + str(i) for i in range(len(repeat_idxs))]
dfs = dict()
for i, row in enumerate(repeat_idxs):
dfName = dfNames[i]
slices = df['a'].loc[row:row+1]
dfs[dfName] = slices
If I print dfs, this is exactly what I want.
{'df_0': 3 a
4 1
Name: a, dtype: object, 'df_1': 6 a
7 4
Name: a, dtype: object, 'df_2': 9 a
Name: a, dtype: object}
However, if I want to read my csv and apply the above, I am not getting what's desired. I can find the repeated indices from csv file but I am not able to slice the data properly. I am presuming that I am not reading csv file correctly. I attached the csv file for further clarification csv file

Two options:
Loop over and slice
Detect the repeat row indices and then loop over to slice contiguous chunks of the dataframe, ignoring the repeat rows:
# detect rows for which all values are equal to the column names
repeat_idxs = df.index[(df == df.columns.values).all(axis=1)]
slices = []
start = 0
for i in repeat_idxs:
slices.append(df.loc[start:i - 1])
start = i + 1
The result is a list of dataframes slices, which are the slices of your data in order.
Use pandas groupby
You could also do this in one line using pandas groupby if you prefer:
grouped = df[~(df == df.columns.values).all(axis=1)].groupby((df == df.columns.values).all(axis=1).cumsum())
And you can now iterate over the groups like so:
for i, group_df in grouped:
# do something with group_df

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

Python Pandas How to get rid of groupings with only 1 row?

In my dataset, I am trying to get the margin between two values. The code below runs perfectly if the fourth race was not included. After grouping based on a column, it seems that sometimes, there will be only 1 value, therefore, no other value to get a margin out of. I want to ignore these groupings in that case. Here is my current code:
import pandas as pd
data = {'Name':['A', 'B', 'B', 'C', 'A', 'C', 'A'], 'RaceNumber':
[1, 1, 2, 2, 3, 3, 4], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second', 'First'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70, 75]}
df = pd.DataFrame(data)
print(df)
def winning_margin(times):
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']
winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')
avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins
How about returning a NaN if times does not have enough elements:
import numpy as np
def winning_margin(times):
if len(times) <= 1: # New code
return np.NaN # New code
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
your code runs with this change and seem to produce sensible results. But you can furthermore remove NaNs later if you want eg in this line
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin).dropna() # note the addition of .dropna()
You could get the winner and margin in one step:
def get_margin(x):
if len(x) < 2:
return np.NaN
i = x['TimeRanInSec'].idxmin()
nl = x['TimeRanInSec'].nsmallest(2)
margin = nl.max()-nl.min()
return [x['Name'].loc[i], margin]
Then:
df.groupby('RaceNumber').apply(get_margin).dropna()
RaceNumber
1 [B, 2]
2 [C, 6]
3 [C, 5]
(the data has the 'First' indicator corresponding to the slower time in the data)

Convert string to set array when loading csv file in pandas DataFrame

I'm trying to convert a pandas column from string to a set so I can perform set operations (-) and methods (.union) between two datafame on two set_array columns. The files are imported from two csv file with a set_array column. However, once I run pd.read_csv in pandas, the columns type becomes str, which prevents me from doing set operations and methods.
csv1:
set_array
0 {985,784}
1 {887}
2 set()
3 {123,469,789}
4 set()
After loading csv1 into a DataFrame using df = pd.read_csv(csv1), the data type becomes str, and when I try to call the first index using df['set_array'].values[0], I get the following:
'{985, 784}'
However, if I were to create my own DataFrame with a set column using df1 = pd.DataFrame({'set_array':[{985, 784},{887},{},{123, 469, 789},{}]}), and call the first index again using df['set_array'].values[0], I get the following (Desired output):
{985, 784} <-without the ''
Here is what I tried so far:
1) df.replace('set()', '') <-removes the set() portion from df
2) df['set_array'] = df['set_array'].apply(set) <-does not work
3) df['set_array'] = df['set_array'].apply(lambda x: {x}) <-does not work
4) df['set_array'].astype(int) <-convert to int first then convert to set, does not work
5) df['set_array'].astype(set) <-does not work
6) df['set_array'].to_numpy() <-convert to array, does not work
I'm also thinking to change the column to set at the pd.read_csv stage as a potential solution.
Is there any way to load csv using pandas and keep the set data type, or just simply convert the column from str to set so it looks like the desired output above?
Thanks!!
I agree with Cainã that dealing with the root input data cause would be the best approach here. But, if that's not possible, then something like this would be a lot more predictable than using eval if this is for some kind of production environment:
import re
def parse_set_string(s):
if s == 'set()':
return None # or return set() if you prefer
else:
string_nums_only = re.sub('[^0-9,]', '', s)
split_nums = string_nums_only.split(',')
return set(map(int, split_nums))
df.set_array.apply(parse_set_string)
We've seen this problem before when columns originally contained lists or numpy arrays. csv is a 2d format - rows and columns. So to_csv can only save these embedded objects as strings. What does the file look like?.
read_csv by default just loads the strings. To confuse things further, the pandas display does not quote strings. So the str of a set looks the same as the set itself.
With lists, it's enough to do a eval (or ast.literal_eval). With ndarray the string has to be edited first.
Make a dataframe and fill it with some objects:
In [107]: df = pandas.DataFrame([None,None,None])
In [108]: df
Out[108]:
0
0 None
1 None
2 None
In [109]: df[0][0]
In [110]: df[0][0]=[1,2,3]
In [111]: df[0][1]=np.array([1,2,3])
In [112]: df[0][2]={1,2,3}
In [113]: df
Out[113]:
0
0 [1, 2, 3]
1 [1, 2, 3]
2 {1, 2, 3}
The numpy equivalent:
In [114]: df.to_numpy()
Out[114]:
array([[list([1, 2, 3])],
[array([1, 2, 3])],
[{1, 2, 3}]], dtype=object)
Write it to a file:
In [115]: df.to_csv('test.pd')
In [116]: cat test.pd
,0
0,"[1, 2, 3]"
1,[1 2 3]
2,"{1, 2, 3}"
Read it
In [117]: df1 = pandas.read_csv('test.pd')
In [118]: df1
Out[118]:
Unnamed: 0 0
0 0 [1, 2, 3]
1 1 [1 2 3]
2 2 {1, 2, 3}
Ignoring the indexing that I should have suppressed, it looks a lot like the original df. But it contains strings, not list, array, or set.
In [119]: df1.to_numpy()
Out[119]:
array([[0, '[1, 2, 3]'],
[1, '[1 2 3]'],
[2, '{1, 2, 3}']], dtype=object)
Changing the frame to contains sets of differing sizes:
In [120]: df[0][1]=set()
In [122]: df[0][0]=set([1])
In [123]: df
Out[123]:
0
0 {1}
1 {}
2 {1, 2, 3}
In [124]: df.to_csv('test.pd')
In [125]: cat test.pd
,0
0,{1}
1,set()
2,"{1, 2, 3}"
In [136]: df2 =pandas.read_csv('test.pd',index_col=0)
In [137]: df2
Out[137]:
0
0 {1}
1 set()
2 {1, 2, 3}
Looks like eval can convert the empty set as well as the others:
In [138]: df3 =df2['0'].apply(eval)
In [139]: df3
Out[139]:
0 {1}
1 {}
2 {1, 2, 3}
Name: 0, dtype: object
In [140]: df2.to_numpy()
Out[140]:
array([['{1}'],
['set()'],
['{1, 2, 3}']], dtype=object)
In [141]: df3.to_numpy()
Out[141]: array([{1}, set(), {1, 2, 3}], dtype=object)
The problem with your DataFrame is that set_array contains
the text representation of both:
set literals,
Python code.
To cope with this case:
import ast.
Define the following conversion function:
def mySetConv(txt):
return set() if txt == 'set()' else ast.literal_eval(txt)
Apply it:
df.set_array = df.set_array.apply(mySetConv)
To check the result, you can run:
for it in df.set_array:
print(it, type(it))
getting:
{784, 985} <class 'set'>
{887} <class 'set'>
set() <class 'set'>
{789, 123, 469} <class 'set'>
set() <class 'set'>
If you had in your source file {} instead of set(), you
could run:
df.set_array = df.set_array.apply(ast.literal_eval)
Just a single line of code.

Creating different dataframe and outputting it to different csv based on list of indexes

I have a list of indexes like below based on N value. Here is the code I used to create the list of indexes
df = pd.DataFrame(np.arange(100).reshape((-1, 5)))
N = 4
ix = [[i, i+N] for i in range(0,len(df),N)]
ix
# [[0, 4], [4, 8], [8, 12], [12, 16], [16, 20]]
I want to create function which creates:
1) N dataframes (df_1, df_2, df_3, df_4, df_5). The rows in each dataframes is based on each list of indexes. For example, "df_1" will have all the rows between index 0 and 4 from the main dataframe df and similarly df_2 will have all the rows between index 4 and 8 from dataframe df
2) outputs each dataframes to csv as df_1.csv, df_2.csv ....
Below is the code I tried but "df_i = df.ix[i]" step only gets the row in the list not the range in the list :
def write(df, ix):
for i in ix:
try:
df_i = df.ix[i]
df_i.to_csv("a.csv", index = false)
except:
pass
You can use iloc
def write(df, ix):
c = 1
for i in ix:
try:
df_i = df.iloc[i[0]:i[1]] # use iloc
df_i.to_csv(f"df_{str(c)}.csv", index=False) # f-strings to name file
c+=1 # update your counter
except:
pass
df = pd.DataFrame(np.arange(100).reshape((-1, 5)))
N = 5
ix = [(i, i+N) for i in range(0,len(df),N)]
write(df, ix)

compare index and column in data frame with dictionary

I have a dictionary:
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
and a list:
L = [A,B,C]
I have a DataFrame:
df =pd.DataFrame(columns = L, index=L)
I would like to fill each row in df by values in dictionary based on dictionary keys.For example:
A B C
A 1 2 3
B 5 1 5
C 3 4 9
I tried doing that by:
df.loc[L[0]]=[1,2,3]
df.loc[L[1]]=[5,1,5]
df.loc[L[2]] =[3,4,9]
Is there another way to do that especially when there is a huge data?
Thank you for help
Here is another way that I can think of:
import numpy as np
import pandas as pd
# given
d = {'A-A': 1, 'A-B':2, 'A-C':3, 'B-A':5, 'B-B':1, 'B-C':5, 'C-A':3,
'C-B':4, 'C-C': 9}
L = ['A', 'B', 'C']
# copy the key values into a numpy array
z = np.asarray(list(d.values()))
# reshape the array according to your DataFrame
z_new = np.reshape(z, (3, 3))
# copy it into your DataFrame
df = pd.DataFrame(z_new, columns = L, index=L)
This should do the trick, though it's probably not the best way:
for index in L:
prefix = index + "-"
df.loc[index] = [d.get(prefix + column, 0) for column in L]
Calculating the prefix separately beforehand is probably slower for a small list and probably faster for a large list.
Explanation
for index in L:
This iterates through all of the row names.
prefix = index + "-"
All of the keys for each row start with index + "-", e.g. "A-", "B-"… etc..
df.loc[index] =
Set the contents of the entire row.
[ for column in L]
The same as your comma thing ([1, 2, 3]) just for an arbitrary number of items. This is called a "list comprehension".
d.get( , 0)
This is the same as d[ ] but returns 0 if it can't find anything.
prefix + column
Sticks the column on the end, e.g. "A-" gives "A-A", "A-B"…

Resources