'Series' object has no attribute 'columns' in Dask - python-3.x

I have a function para_func that takes a dataframe as input and returns a dataframe. I would like to group rows in df and apply function para_func. Finally, I have an error
distributed.worker - WARNING - Compute Failed
Function: execute_task
args: ((subgraph_callable, 'root', None, <function para_func at 0x000002F044CAC1F0>, 'drop_by_shallow_copy-b7fa028b7c197981e52eea9f870f36c8', (<function _concat at 0x000002F030567430>, [ id parent_id root _partitions
0 cqug90j cqug2sr cqug2sr 0
1 cqug90k 34fvry 34fvry 0
2 cqug90z cqu80zb cqu80zb 0
3 cqug91c cqtdj4m cqtdj4m 0
4 cqug91e cquc4rc cquc4rc 0
... ... ... ... ...
99995 cqv8wz8 cqv1gg9 34hylq 0
99996 cqv8wzj 34i1r5 34i1r5 0
99997 cqv8wzv 34jasa 34jasa 0
99998 cqv8wzx cqv8k2k 34jasa 0
99999 cqv8x08 cquywos 34hywb 0
[100000 rows x 4 columns]], False), ['_partitions'], 'simple-shuffle-ead8404542740024e1572ac449733a42'))
kwargs: {}
Exception: AttributeError("'Series' object has no attribute 'columns'")
Could you please elaborate on how to solve the error?
import pandas as pd
import networkx as nx
from dask.distributed import Client
import dask.dataframe as dd
client = Client(n_workers=4, threads_per_worker=2, processes=False, memory_limit='20GB')
def para_func(tmp_df):
siblings = pd.DataFrame({'id': tmp_df['id'], 'num_siblings': tmp_df.groupby('parent_id')['parent_id'].transform('count') - 1})
children = tmp_df.groupby(by = 'id').size().reindex(tmp_df['id'], fill_value = 0).to_frame().reset_index(level = 0).rename(columns = {0: 'num_children'})
att_df = siblings.merge(children, how = 'left', on = 'id')
return att_df
path = r'https://raw.githubusercontent.com/leanhdung1994/WebMining/main/sample_df2.csv'
df = dd.read_csv(path, header = 0)
result = df.groupby('root').apply(para_func, meta = object)
computed_result = result.compute()

You need to pass DataFrame object that describes the output of para_func function as meta parameter of apply function. It should be the empty DataFrame with the same structure (column names and types of the columns) as the DataFrame returned by para_func function.

Related

Python Pandas apply function not being applied to every row when using variables from a DataFrame

I have this weird Pandas problem, when I use the apply function using values from a data frame, it only gets applied to the first row:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [10, 20]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
#variable data frame - pull values from this to edit main data frame
headerVariables = [['varA', 'varB']]
valuesVariables = [[2, 10]]
dfVariables = pd.DataFrame(valuesVariables, columns = headerVariables)
dfVariables.to_csv('Variables.csv', index=False)
readVariablesCSV = pd.read_csv('Variables.csv')
readVarA = readVariablesCSV['varA']
readVarB = readVariablesCSV['varB']
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 NaN NaN
But when I just use regular variables (not being called from a data frame), it functions just fine:
import pandas as pd
# main data frame - to be edited
headerData = [['dataA', 'dataB']]
valuesData = [[10, 20], [20, 40]]
dfData = pd.DataFrame(valuesData, columns = headerData)
dfData.to_csv('MainData.csv', index=False)
readMainDataCSV = pd.read_csv('MainData.csv')
print(readMainDataCSV)
# variables
readVarA = 2
readVarB = 10
def formula(x):
return (x / readVarA) * readVarB
dfFormulaApplied = readMainDataCSV.apply(lambda x: formula(x))
print('\n', dfFormulaApplied)
Output:
dataA dataB
0 50.0 100.0
1 100.0 200.0
Help please I'm pulling my hair out.
If you take readVarA and readVarB from the dataframe by selecting the column it is a pandas Series with an index, which gives a problem in the calculation (dividing a series by another series with a different index doesn't work).
You can take the first value from the series to get the value like this:
def formula(x):
return (x / readVarA[0]) * readVarB[0]

python-nested-dictionary - keys

i am iterating a nested dictionary using dict.keys() method. code works well if the dictionary is nested however if the dictionary is not nested , it throws an error (i.e)
{"a":{1:'i'}}
for above dictionary the code works fine but following dictionary it fails
{"a":1}
In my iteration , I wish to not throw error if the dictionary is not having further keys. per requirement we may pass nested or non-nested dictionaries.
Following is the sample code:
global n
n=0
df = pd.DataFrame(index = np.arange(10), columns = ['column0'])
def iterate_dict(dict):
global n
for j in dict.keys()
df[n] = j
n = n+1
return dict
#function call
iterate_dict({"a":1})
Error Message:
AttributeError: 'str' object has no attribute 'keys'
Thanks for the help.
You must have tried calling iterate_dict({"a":1}) with string argument like iterate_dict("{a:1}") which gives error AttributeError: 'str' object has no attribute 'keys'.
Try using:
returned_dict = iterate_dict({"a": 1})
It should work.
Adding working code here:
import pandas as pd
import numpy as np
global n
n = 0
df = pd.DataFrame(index=np.arange(10), columns=['column0'])
def iterate_dict(dict):
global n
for j in dict.keys():
df[n] = j
n = n + 1
return dict
# function call
iterate_dict({"a": 1})
print(df.head())
OUTPUT
column0 0
0 NaN a
1 NaN a
2 NaN a
3 NaN a
4 NaN a

NumPy np.chararray to nd.array

I have the following np.chararray
array = np.chararray(shape=(5))
initialize_array(array)
# Gives out the following array of type S1
# [b'1' b'0' b'1' b'0' b'1']
How can I cast this array to a nd array? I wish for an array like
[ 1 0 1 0 1 ] # dtype = int
Is this possible via some function that I am not aware of? Or should I do it by "hand"?
Using astype like:
new_ndarray = array.astype("int")
Raises a ValueError:
ValueError: Can only create a chararray from string data.
MCVE
#!/usr/bin/python3
import numpy as np
char_array = np.chararray(shape=(5))
char_array[:] = [b'1',b'0',b'1',b'0',b'1']
nd_array = char_array.astype("int")
You can do:
import numpy as np
array = np.chararray(shape=(5))
array[:] = [b'1', b'0', b'1', b'0', b'1']
array_int = np.array(array, dtype=np.int32)
print(array_int)
# [1 0 1 0 1]

Problems with mapping user ids back to their respective cluster class in pandas

I did clustering and now want to map the cluster_class to each 'userid' row in my original dataframe. However, my past part of the code for "mapping" does NOT return the dataframe which I am expecting.
df=
userid recency frequency
1233 33232.0 5.715858
3344 23403.0 3.615858
#convert df to array
data=data.values
X=data
#Scale
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.25, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
#get dataframe with cluster_class and its data size
df=pd.DataFrame(pd.Series(labels).value_counts())
df.index.names = ['Cluster_Class']
df.rename(columns={ df.columns[0]: "Users" }, inplace = True)
df=
Users Cluster_Class
0 2096
-1 30
2 13
1 11
#MAP each cluster class to all userids. NOT WORKING!!!
N_CLUSTERS = len(df.index.names)-1
clusters = [X[db == i] for i in range(N_CLUSTERS)]
for i, c in enumerate(clusters):
print('Cluster {} has {} members: {}...'.format(i, len(c), c[0]))
You need to compare the labels.
But you compare the clustering object of type DBSCAN to an integer.

Define function to convert string to integer in Python

This is likely a very simple question but I would appreciate help!
As part of a larger script, I have a dataframe (imported from a csv file) with two columns, 'file_name' and 'value'. I have a short example below:
file_name value
0 201623800811s.fits True
1 201623802491s.fits True
2 201623802451s.fits False
I would like to define a function that reads the values within column 'value', and returns 0 for 'False' and 1 for 'True'. I would then like to append the results to a third column in the dataframe, and finally export the updated dataframe to the csv.
I have defined a function that appears to me to work. However, when I run the script it does not execute and I receive the message:
<function convert_string at 0x000000000DE35588>
In the console.
My function is below. Any help or advice will be welcomed.
def convert_string(explosions):
for i in range(0,len(explosions)):
if i == 'True' :
return 1
elif i == 'False' :
return 0
else:
return 2
print convert_string
If you are using an explicit for loop when working with a dataframe, you are most probably "doing it wrong". Also, what is the point of having a for loop if you return on the very first iteration?
Consider these:
import numpy as np
df['third_column'] = np.where(df['value'], 1, 0)
If you insist on defining a function:
def foo(x):
return int(x)
df['third_column'] = df['value'].apply(foo)
or simply
df['third_column'] = df['value'].apply(lambda x: int(x))
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [True, False]})
print(df)
# value
# 0 True
# 1 False
df['third_column'] = np.where(df['value'], 1, 0)
print(df)
# value third_column
# 0 True 1
# 1 False 0
You're not calling the function. Your print statement should be: print convert_string(<value>), where <value> is an integer.

Resources