Convert string to set array when loading csv file in pandas DataFrame

Convert string to set array when loading csv file in pandas DataFrame - python-3.x

I'm trying to convert a pandas column from string to a set so I can perform set operations (-) and methods (.union) between two datafame on two set_array columns. The files are imported from two csv file with a set_array column. However, once I run pd.read_csv in pandas, the columns type becomes str, which prevents me from doing set operations and methods.
csv1:
set_array
0 {985,784}
1 {887}
2 set()
3 {123,469,789}
4 set()
After loading csv1 into a DataFrame using df = pd.read_csv(csv1), the data type becomes str, and when I try to call the first index using df['set_array'].values[0], I get the following:
'{985, 784}'
However, if I were to create my own DataFrame with a set column using df1 = pd.DataFrame({'set_array':[{985, 784},{887},{},{123, 469, 789},{}]}), and call the first index again using df['set_array'].values[0], I get the following (Desired output):
{985, 784} <-without the ''
Here is what I tried so far:
1) df.replace('set()', '') <-removes the set() portion from df
2) df['set_array'] = df['set_array'].apply(set) <-does not work
3) df['set_array'] = df['set_array'].apply(lambda x: {x}) <-does not work
4) df['set_array'].astype(int) <-convert to int first then convert to set, does not work
5) df['set_array'].astype(set) <-does not work
6) df['set_array'].to_numpy() <-convert to array, does not work
I'm also thinking to change the column to set at the pd.read_csv stage as a potential solution.
Is there any way to load csv using pandas and keep the set data type, or just simply convert the column from str to set so it looks like the desired output above?
Thanks!!

I agree with Cainã that dealing with the root input data cause would be the best approach here. But, if that's not possible, then something like this would be a lot more predictable than using eval if this is for some kind of production environment:
import re
def parse_set_string(s):
if s == 'set()':
return None # or return set() if you prefer
else:
string_nums_only = re.sub('[^0-9,]', '', s)
split_nums = string_nums_only.split(',')
return set(map(int, split_nums))
df.set_array.apply(parse_set_string)

We've seen this problem before when columns originally contained lists or numpy arrays. csv is a 2d format - rows and columns. So to_csv can only save these embedded objects as strings. What does the file look like?.
read_csv by default just loads the strings. To confuse things further, the pandas display does not quote strings. So the str of a set looks the same as the set itself.
With lists, it's enough to do a eval (or ast.literal_eval). With ndarray the string has to be edited first.
Make a dataframe and fill it with some objects:
In [107]: df = pandas.DataFrame([None,None,None])
In [108]: df
Out[108]:
0
0 None
1 None
2 None
In [109]: df[0][0]
In [110]: df[0][0]=[1,2,3]
In [111]: df[0][1]=np.array([1,2,3])
In [112]: df[0][2]={1,2,3}
In [113]: df
Out[113]:
0
0 [1, 2, 3]
1 [1, 2, 3]
2 {1, 2, 3}
The numpy equivalent:
In [114]: df.to_numpy()
Out[114]:
array([[list([1, 2, 3])],
[array([1, 2, 3])],
[{1, 2, 3}]], dtype=object)
Write it to a file:
In [115]: df.to_csv('test.pd')
In [116]: cat test.pd
,0
0,"[1, 2, 3]"
1,[1 2 3]
2,"{1, 2, 3}"
Read it
In [117]: df1 = pandas.read_csv('test.pd')
In [118]: df1
Out[118]:
Unnamed: 0 0
0 0 [1, 2, 3]
1 1 [1 2 3]
2 2 {1, 2, 3}
Ignoring the indexing that I should have suppressed, it looks a lot like the original df. But it contains strings, not list, array, or set.
In [119]: df1.to_numpy()
Out[119]:
array([[0, '[1, 2, 3]'],
[1, '[1 2 3]'],
[2, '{1, 2, 3}']], dtype=object)
Changing the frame to contains sets of differing sizes:
In [120]: df[0][1]=set()
In [122]: df[0][0]=set([1])
In [123]: df
Out[123]:
0
0 {1}
1 {}
2 {1, 2, 3}
In [124]: df.to_csv('test.pd')
In [125]: cat test.pd
,0
0,{1}
1,set()
2,"{1, 2, 3}"
In [136]: df2 =pandas.read_csv('test.pd',index_col=0)
In [137]: df2
Out[137]:
0
0 {1}
1 set()
2 {1, 2, 3}
Looks like eval can convert the empty set as well as the others:
In [138]: df3 =df2['0'].apply(eval)
In [139]: df3
Out[139]:
0 {1}
1 {}
2 {1, 2, 3}
Name: 0, dtype: object
In [140]: df2.to_numpy()
Out[140]:
array([['{1}'],
['set()'],
['{1, 2, 3}']], dtype=object)
In [141]: df3.to_numpy()
Out[141]: array([{1}, set(), {1, 2, 3}], dtype=object)

The problem with your DataFrame is that set_array contains
the text representation of both:
set literals,
Python code.
To cope with this case:
import ast.
Define the following conversion function:
def mySetConv(txt):
return set() if txt == 'set()' else ast.literal_eval(txt)
Apply it:
df.set_array = df.set_array.apply(mySetConv)
To check the result, you can run:
for it in df.set_array:
print(it, type(it))
getting:
{784, 985} <class 'set'>
{887} <class 'set'>
set() <class 'set'>
{789, 123, 469} <class 'set'>
set() <class 'set'>
If you had in your source file {} instead of set(), you
could run:
df.set_array = df.set_array.apply(ast.literal_eval)
Just a single line of code.

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?

Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A

Creating different dataframe and outputting it to different csv based on list of indexes

I have a list of indexes like below based on N value. Here is the code I used to create the list of indexes
df = pd.DataFrame(np.arange(100).reshape((-1, 5)))
N = 4
ix = [[i, i+N] for i in range(0,len(df),N)]
ix
# [[0, 4], [4, 8], [8, 12], [12, 16], [16, 20]]
I want to create function which creates:
1) N dataframes (df_1, df_2, df_3, df_4, df_5). The rows in each dataframes is based on each list of indexes. For example, "df_1" will have all the rows between index 0 and 4 from the main dataframe df and similarly df_2 will have all the rows between index 4 and 8 from dataframe df
2) outputs each dataframes to csv as df_1.csv, df_2.csv ....
Below is the code I tried but "df_i = df.ix[i]" step only gets the row in the list not the range in the list :
def write(df, ix):
for i in ix:
try:
df_i = df.ix[i]
df_i.to_csv("a.csv", index = false)
except:
pass

You can use iloc
def write(df, ix):
c = 1
for i in ix:
try:
df_i = df.iloc[i[0]:i[1]] # use iloc
df_i.to_csv(f"df_{str(c)}.csv", index=False) # f-strings to name file
c+=1 # update your counter
except:
pass
df = pd.DataFrame(np.arange(100).reshape((-1, 5)))
N = 5
ix = [(i, i+N) for i in range(0,len(df),N)]
write(df, ix)

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)

I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

Python Pandas Create Multiple dataframes by slicing data at certain locations

I am new to Python and data analysis using programming. I have a long csv and I would like to create DataFrame dynamically and plot them later on. Here is an example of the DataFrame similar to the data exist in my csv file
df = pd.DataFrame(
{"a" : [4 ,5, 6, 'a', 1, 2, 'a', 4, 5, 'a'],
"b" : [7, 8, 9, 'b', 0.1, 0.2, 'b', 0.3, 0.4, 'b'],
"c" : [10, 11, 12, 'c', 10, 20, 'c', 30, 40, 'c']})
As seen, there are elements which repeated in each column. So I would first need to find the index of the repetition and following that use this for making subsets. Here is the way I did this.
find_Repeat = df.groupby(['a'], group_keys=False).apply(lambda df: df if
df.shape[0] > 1 else None)
repeat_idxs = find_Repeat.index[find_Repeat['a'] == 'a'].tolist()
If I print repeat_idxs, I would get
[3, 6, 9]
And this is the example of what I want to achieve in the end
dfa_1 = df['a'][Index_Identifier[0], Index_Identifier[1])
dfa_2 = df['a'][Index_Identifier[1], Index_Identifier[2])
dfb_1 = df['b'][Index_Identifier[0], Index_Identifier[1])
dfb_2 = df['b'][Index_Identifier[1], Index_Identifier[2])
But this is not efficient and convenient as I need to create many DataFrame like these for plotting later on. So I tried the following method
dfNames = ['dfa_' + str(i) for i in range(len(repeat_idxs))]
dfs = dict()
for i, row in enumerate(repeat_idxs):
dfName = dfNames[i]
slices = df['a'].loc[row:row+1]
dfs[dfName] = slices
If I print dfs, this is exactly what I want.
{'df_0': 3 a
4 1
Name: a, dtype: object, 'df_1': 6 a
7 4
Name: a, dtype: object, 'df_2': 9 a
Name: a, dtype: object}
However, if I want to read my csv and apply the above, I am not getting what's desired. I can find the repeated indices from csv file but I am not able to slice the data properly. I am presuming that I am not reading csv file correctly. I attached the csv file for further clarification csv file

Two options:
Loop over and slice
Detect the repeat row indices and then loop over to slice contiguous chunks of the dataframe, ignoring the repeat rows:
# detect rows for which all values are equal to the column names
repeat_idxs = df.index[(df == df.columns.values).all(axis=1)]
slices = []
start = 0
for i in repeat_idxs:
slices.append(df.loc[start:i - 1])
start = i + 1
The result is a list of dataframes slices, which are the slices of your data in order.
Use pandas groupby
You could also do this in one line using pandas groupby if you prefer:
grouped = df[~(df == df.columns.values).all(axis=1)].groupby((df == df.columns.values).all(axis=1).cumsum())
And you can now iterate over the groups like so:
for i, group_df in grouped:
# do something with group_df

Loading dictionary stored as .npz fails

l have a dictionary store at once as one object using np.savez when l open it with np.load as follow :
l get the following :
my_dic=np.load('/home/values.npz')
my_dic.files
['scores']
However when l try :
my_dic['scores'] # len(my_dic['scores'])=1 but contains 3000 keys and 3000 values
it outputs all the keys and values as one object.
Is there any way to access the values and keys ?
something like :
for k,values in my_dic['scores'].items():
# do something
Thank you

Sounds like you did:
In [80]: np.savez('test.npz', score={'a':1, 'b':2})
In [81]: d = np.load('test.npz')
In [83]: d.files
Out[83]: ['score']
In [84]: d['score']
Out[84]: array({'a': 1, 'b': 2}, dtype=object)
This is a 1 item array with a object dtype. Extract that item with item():
In [85]: d['score'].item()
Out[85]: {'a': 1, 'b': 2}
If instead I save the dictionary with kwargs syntax:
In [86]: np.savez('test.npz', **{'a':1, 'b':2})
In [87]: d = np.load('test.npz')
In [88]: d.files
Out[88]: ['a', 'b']
Now each dictionary key is a file in the archive:
In [89]: d['a']
Out[89]: array(1)
In [90]: d['b']
Out[90]: array(2)

Following the indications given by #hpalj,
l did the following to solve the problem:
x=list(my_dic['scores'].item()) #allows me to get the keys
keys=[]
values=[]
for i in np.arange(len(x))
value=my_dic['scores'].item()[x[i]]
values.append(value)
keys.append(x[i])
final_dic=dict(zip(keys,values))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Convert string to set array when loading csv file in pandas DataFrame - python-3.x

Related

column comprehension robust to missing values

Creating different dataframe and outputting it to different csv based on list of indexes

Using non-zero values from columns in function - pandas

Python Pandas Create Multiple dataframes by slicing data at certain locations

Loading dictionary stored as .npz fails

Categories

Resources