python-nested-dictionary - keys - python-3.x

i am iterating a nested dictionary using dict.keys() method. code works well if the dictionary is nested however if the dictionary is not nested , it throws an error (i.e)
{"a":{1:'i'}}
for above dictionary the code works fine but following dictionary it fails
{"a":1}
In my iteration , I wish to not throw error if the dictionary is not having further keys. per requirement we may pass nested or non-nested dictionaries.
Following is the sample code:
global n
n=0
df = pd.DataFrame(index = np.arange(10), columns = ['column0'])
def iterate_dict(dict):
global n
for j in dict.keys()
df[n] = j
n = n+1
return dict
#function call
iterate_dict({"a":1})
Error Message:
AttributeError: 'str' object has no attribute 'keys'
Thanks for the help.

You must have tried calling iterate_dict({"a":1}) with string argument like iterate_dict("{a:1}") which gives error AttributeError: 'str' object has no attribute 'keys'.
Try using:
returned_dict = iterate_dict({"a": 1})
It should work.
Adding working code here:
import pandas as pd
import numpy as np
global n
n = 0
df = pd.DataFrame(index=np.arange(10), columns=['column0'])
def iterate_dict(dict):
global n
for j in dict.keys():
df[n] = j
n = n + 1
return dict
# function call
iterate_dict({"a": 1})
print(df.head())
OUTPUT
column0 0
0 NaN a
1 NaN a
2 NaN a
3 NaN a
4 NaN a

Related

Add two columns to a pandas DataFrame based on condition

I am trying to add two new columns with different values based on two conditions.
Source sample data for left and right DataFrames
id
rec_type
end_date
13759
U
20210113
23806
N
NaN
21347
U
20210113
36904
N
NaN
id
23806
21347
Expected output:
id
rec_type
end_date
_merge
error_code
error_description
13759
U
20210113
left_only
601
update record not available in right table
23806
N
NaN
both
0
0
21347
U
20210113
both
0
0
36904
N
NaN
left_only
602
New record not available in right table
I am using numpy (np) select to achieve my requirement as in below code but I am getting error.
import pandas as pd
import numpy as np
merged_df = pd.merge(left_df, right_df,
how='outer',
on=['id'],
indicator=True)
merged_df = merged_df.query('_merge != "right_only"')
conditions = [((merged_df['_merge'] == "left_only") &
(merged_df['rec_type'] == "U") &
(merged_df['end_date'].notnull())),
((merged_df['_merge'] == "left_only") &
(merged_df['rec_type'] == "N") &
(merged_df['end_date'].isnull()))]
error_codes = dict()
error_codes['error_code'] = [601, 602]
error_codes['error_description'] = ['update record not available in right table',
'New record not available in right table']
merged_df['error_code'] = np.select(conditions, error_codes['error_code'])
merged_df['error_description'] = np.select(conditions, error_codes['error_description'])
I am getting below error, please share suggestions to resolve the error.
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
validate_df['error_code'] = np.select(conditions,
error_codes['error_code'])
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame. Try using .loc[row_indexer,col_indexer] =
value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
validate_df['error_description'] = np.select(conditions,
error_codes['error_description'])
Thanks,
Raghunath.
Note: Code is working fine with sample data but with more data, getting above error
I am able to resolve the issue by changing
merged_df = merged_df.query('_merge != "right_only"')
to below code
merged_df = merged_df[merged_df._merge != "right_only"]

'Series' object has no attribute 'columns' in Dask

I have a function para_func that takes a dataframe as input and returns a dataframe. I would like to group rows in df and apply function para_func. Finally, I have an error
distributed.worker - WARNING - Compute Failed
Function: execute_task
args: ((subgraph_callable, 'root', None, <function para_func at 0x000002F044CAC1F0>, 'drop_by_shallow_copy-b7fa028b7c197981e52eea9f870f36c8', (<function _concat at 0x000002F030567430>, [ id parent_id root _partitions
0 cqug90j cqug2sr cqug2sr 0
1 cqug90k 34fvry 34fvry 0
2 cqug90z cqu80zb cqu80zb 0
3 cqug91c cqtdj4m cqtdj4m 0
4 cqug91e cquc4rc cquc4rc 0
... ... ... ... ...
99995 cqv8wz8 cqv1gg9 34hylq 0
99996 cqv8wzj 34i1r5 34i1r5 0
99997 cqv8wzv 34jasa 34jasa 0
99998 cqv8wzx cqv8k2k 34jasa 0
99999 cqv8x08 cquywos 34hywb 0
[100000 rows x 4 columns]], False), ['_partitions'], 'simple-shuffle-ead8404542740024e1572ac449733a42'))
kwargs: {}
Exception: AttributeError("'Series' object has no attribute 'columns'")
Could you please elaborate on how to solve the error?
import pandas as pd
import networkx as nx
from dask.distributed import Client
import dask.dataframe as dd
client = Client(n_workers=4, threads_per_worker=2, processes=False, memory_limit='20GB')
def para_func(tmp_df):
siblings = pd.DataFrame({'id': tmp_df['id'], 'num_siblings': tmp_df.groupby('parent_id')['parent_id'].transform('count') - 1})
children = tmp_df.groupby(by = 'id').size().reindex(tmp_df['id'], fill_value = 0).to_frame().reset_index(level = 0).rename(columns = {0: 'num_children'})
att_df = siblings.merge(children, how = 'left', on = 'id')
return att_df
path = r'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/sample_df2.csv'
df = dd.read_csv(path, header = 0)
result = df.groupby('root').apply(para_func, meta = object)
computed_result = result.compute()
You need to pass DataFrame object that describes the output of para_func function as meta parameter of apply function. It should be the empty DataFrame with the same structure (column names and types of the columns) as the DataFrame returned by para_func function.

NumPy np.chararray to nd.array

I have the following np.chararray
array = np.chararray(shape=(5))
initialize_array(array)
# Gives out the following array of type S1
# [b'1' b'0' b'1' b'0' b'1']
How can I cast this array to a nd array? I wish for an array like
[ 1 0 1 0 1 ] # dtype = int
Is this possible via some function that I am not aware of? Or should I do it by "hand"?
Using astype like:
new_ndarray = array.astype("int")
Raises a ValueError:
ValueError: Can only create a chararray from string data.
MCVE
#!/usr/bin/python3
import numpy as np
char_array = np.chararray(shape=(5))
char_array[:] = [b'1',b'0',b'1',b'0',b'1']
nd_array = char_array.astype("int")
You can do:
import numpy as np
array = np.chararray(shape=(5))
array[:] = [b'1', b'0', b'1', b'0', b'1']
array_int = np.array(array, dtype=np.int32)
print(array_int)
# [1 0 1 0 1]

Issue in passing an array to an index in Series object(TypeError: len() of unsized object)

I have a data as ndarray
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
Then I tried:
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where(a == ele)
Xpointsfromindex=np.take(b, indexfound)
serobj1=pd.Series(Xpointsfromindex[0],index=ele) ##error happening here
serObj.apend(serobj1)
print(serObj)
I expect output to be like
0 ['x1','x3']
1 ['x2','x4']
2 ['x5','x6']
But it is giving me an error like "TypeError: len() of unsized object"
Where am I doing wrong?
I believe here is possible create DataFrame if same length of lists and then create lists with groupby:
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
df = pd.DataFrame({'a':a, 'b':b})
print(df)
a b
0 0 x1
1 1 x2
2 0 x3
3 1 x4
4 2 x5
5 2 x6
serObj = df.groupby('a')['b'].apply(list)
print (serObj)
a
0 [x1, x3]
1 [x2, x4]
2 [x5, x6]
Name: b, dtype: object
Just to stick to what OP was doing, here is the full code that works -
import pandas as pd
import numpy as np
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where([i==ele for i in a])
Xpointsfromindex=np.take(b, indexfound)
print(Xpointsfromindex)
serobj1=pd.Series(Xpointsfromindex[0],index=[ele for _ in range(np.shape(indexfound)[1])]) ##error happening here
serObj.append(serobj1)
print(serObj)
Output
[['x1' 'x3']]
[['x2' 'x4']]
[['x5' 'x6']]
Explanation
indexfound=np.where(a == ele) will always return False because you are trying to compare a list with a scalar. Changing it to list comprehension fetches the indices
The next change is using list comprehension at the index parameter of the pd.Series.
This will set you on your way to what you want to achieve

Define function to convert string to integer in Python

This is likely a very simple question but I would appreciate help!
As part of a larger script, I have a dataframe (imported from a csv file) with two columns, 'file_name' and 'value'. I have a short example below:
file_name value
0 201623800811s.fits True
1 201623802491s.fits True
2 201623802451s.fits False
I would like to define a function that reads the values within column 'value', and returns 0 for 'False' and 1 for 'True'. I would then like to append the results to a third column in the dataframe, and finally export the updated dataframe to the csv.
I have defined a function that appears to me to work. However, when I run the script it does not execute and I receive the message:
<function convert_string at 0x000000000DE35588>
In the console.
My function is below. Any help or advice will be welcomed.
def convert_string(explosions):
for i in range(0,len(explosions)):
if i == 'True' :
return 1
elif i == 'False' :
return 0
else:
return 2
print convert_string
If you are using an explicit for loop when working with a dataframe, you are most probably "doing it wrong". Also, what is the point of having a for loop if you return on the very first iteration?
Consider these:
import numpy as np
df['third_column'] = np.where(df['value'], 1, 0)
If you insist on defining a function:
def foo(x):
return int(x)
df['third_column'] = df['value'].apply(foo)
or simply
df['third_column'] = df['value'].apply(lambda x: int(x))
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [True, False]})
print(df)
# value
# 0 True
# 1 False
df['third_column'] = np.where(df['value'], 1, 0)
print(df)
# value third_column
# 0 True 1
# 1 False 0
You're not calling the function. Your print statement should be: print convert_string(<value>), where <value> is an integer.

Resources