Applying a function on pandas groupby object return variable type - python-3.x

When applying a user-defined function f that takes a pd.DataFrame as input and returns a pd.Series on a pd.DataFrame.groupby object, the type of the returned object seems to depend on the number of unique values present in the field used to perform the grouping operation.
I am trying to understand why the api is behaving this way, and for a neat way to have a pd.Series returned regardless of the number of unique values in the grouping field.
I went through the split-apply-combine section of pandas, and it seems like the single-valued dataframe is treated as a pd.Series which does not make sense to me.
import pandas as pd
from typing import Union
def f(df : pd.DataFrame) -> pd.Series:
"""
User-defined function
"""
return df['B'] / df['B'].max()
# Should only output a pd.Series
def perform_apply(df : pd.DataFrame) -> Union[pd.Series,pd.DataFrame] :
return df.groupby('A').apply(f)
# Some dummy dataframe with multiple values in field 'A'
df1 = pd.DataFrame({'A': 'a a b'.split(),
'B': [1,2,3],
'C': [4,6,5]})
# Subset of dataframe wiht a single value in field 'A'
df2 = df1[df1['A'] == 'a'].copy()
res1 = perform_apply(df1)
res2 = perform_apply(df2)
print(type(res1),type(res2))
# --------------------------------
# -> <class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>
pandas : 1.4.2
python : 3.9.0

Related

concat_ws and coalesce in pyspark

In Pyspark, I want to combine concat_ws and coalesce whilst using the list method. For example I know this works:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
#display(df)
df = df.withColumn("concat_ws2", concat_ws(':', coalesce('Type', lit("")), coalesce('Segment', lit(""))))
display(df)
But I want to be able to utilise the *[list] method so I don't have to list out all the columns within that bit of code, i.e. something like this instead:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
list = ["Type", "Segment"]
df = df.withColumn("almost_desired_output", concat_ws(':', *list))
display(df)
However as you can see, I want to be able to coalesce NULL with a blank, but not sure if that's possible using the *[list] method or do I really have to list out all the columns?
This would work:
Iterate over list of columns names
df=df.withColumn("almost_desired_output", concat_ws(':', *[coalesce(name, lit('')).alias(name) for name in df.schema.names]))
Output:
Or, Use fill - it'll fill all the null values across all columns of Dataframe (but this changes in the actual column, which may can break some use-cases)
df.na.fill("").withColumn("almost_desired_output", concat_ws(':', *list)
Or, Use selectExpr (again this changes in the actual column, which may can break some use-cases)
list = ["Type", "Segment"] # or just use df.schema.names
list2 = ["coalesce(type,' ') as Type", "coalesce(Segment,' ') as Segment"]
df=df.selectExpr(list2).withColumn("almost_desired_output", concat_ws(':', *list))

Changing column data type in Pandas DF using for command

I want to change the data type for a few columns in a database from Int/Float to Object.
I am trying to do this using for loop. But the data type is not changing.
for i in convert_to_object:
i = i.astype('object',inplace=True)
#Convert_to_object is a list of column names in data frame df
convert_to_object = [df_app1.REGION_RATING_CLIENT,
df_app1.REGION_RATING_CLIENT_W_CITY,
df_app1.REG_REGION_NOT_LIVE_REGION,
df_app1.REG_REGION_NOT_WORK_REGION,
df_app1.LIVE_REGION_NOT_WORK_REGION,
df_app1.REG_CITY_NOT_LIVE_CITY,
df_app1.REG_CITY_NOT_WORK_CITY,
df_app1.LIVE_CITY_NOT_WORK_CITY]
for i in convert_to_object:
i = i.astype('object',inplace=True)
df.info()
The data type for given columns should change to Object.
As far as I know DataFrame.astype does not handle the inplace param. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html How about:
cols = [
"REGION_RATING_CLIENT",
"REGION_RATING_CLIENT_W_CITY",
"REG_REGION_NOT_LIVE_REGION",
"REG_REGION_NOT_WORK_REGION",
"LIVE_REGION_NOT_WORK_REGION",
"REG_CITY_NOT_LIVE_CITY",
"REG_CITY_NOT_WORK_CITY",
"LIVE_CITY_NOT_WORK_CITY"
]
df[cols] = df[cols].astype(object)
Consider this example:
import pandas as pd
df = pd.DataFrame({
'A': [1,2],
'B': [1.2,3.4]
})
print(df.dtypes)
df[['A','B']] = df[['A','B']].astype(object)
print(df.dtypes)
Returns:
A int64
B float64
dtype: object
A object
B object
dtype: object

How to write a Pandas Dataframe into a HDF5 dataset

I'm trying to write data from a Pandas dataframe into a nested hdf5 file, with multiple groups and datasets within each group. I'd like to keep it as a single file which will grow in the future on a daily basis. I've had a go with the following code, which shows the structure of what I'd like to achieve
import h5py
import numpy as np
import pandas as pd
file = h5py.File('database.h5','w')
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
groups = ['A','B','C']
for m in groups:
group = file.create_group(m)
dataset = ['1','2','3']
for n in dataset:
data = df
ds = group.create_dataset(m + n, data.shape)
print ("Dataset dataspace is", ds.shape)
print ("Dataset Numpy datatype is", ds.dtype)
print ("Dataset name is", ds.name)
print ("Dataset is a member of the group", ds.parent)
print ("Dataset was created in the file", ds.file)
print ("Writing data...")
ds[...] = data
print ("Reading data back...")
data_read = ds[...]
print ("Printing data...")
print (data_read)
file.close()
This way the nested structure is created but it loses the index and columns. I've tried the
df.to_hdf('database.h5', ds, table=True, mode='a')
but didn't work, I get this error
AttributeError: 'Dataset' object has no attribute 'split'
Can anyone shed some light please. Many thanks
df.to_hdf() expects a string as a key parameter (second parameter):
key : string
identifier for the group in the store
so try this:
df.to_hdf('database.h5', ds.name, table=True, mode='a')
where ds.name should return you a string (key name):
In [26]: ds.name
Out[26]: '/A1'
I thought to have a go with pandas\pytables and the HDFStore class instead of h5py. So I tried the following
import numpy as np
import pandas as pd
db = pd.HDFStore('Database.h5')
index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['Col1', 'Col2', 'Col3'])
groups = ['A','B','C']
i = 1
for m in groups:
subgroups = ['d','e','f']
for n in subgroups:
db.put(m + '/' + n, df, format = 'table', data_columns = True)
It works, 9 groups (groups instead of datasets in pyatbles instead fo h5py?) created from A/d to C/f. Columns and indexes preserved and can do the dataframe operations I need. Still wondering though whether this is an efficient way to retrieve data from a specific group which will become huge in the the future i.e. operations like
db['A/d'].Col1[4:]

Assign a pandas dataframe to an object as a static class variable - memory use (Python)

I have an Python object called DNA. I want to create 100 instances of DNA. Each of the instances contains a pandas dataframe that is identical for all instances. To avoid duplication, I want to incorporate this dataframe as a static/class attribute.
import pandas as pd
some_df = pd.DataFrame()
class DNA(object):
df = some_variable # Do i declare here?
def __init__(self,df = pd.DataFrame(), name='1'):
self.name = name
self.instance_df = instance_df # I want to avoid this
DNA.some_df = df # Does this duplicate the data for every instance?
What is the correct way to do this?
Can I use the init function to create the class variable? Or will it create a separate class variable for every instance of the class?
Do I need to declare the class variable between the 'class..' and 'def init(...)'?
Some other way?
I want to be able to change the dataframe that I use as a class variable but once the class is loaded, it needs to reference the same value (i.e. the same memory) in all instances.
I've answered your question in the comments:
import pandas as pd
some_df = pd.DataFrame()
class DNA(object):
df = some_variable # You assign here. I would use `some_df`
def __init__(self,df = pd.DataFrame(), name='1'):
self.name = name
self.instance_df = instance_df # Yes, avoid this
DNA.some_df = df # This does not duplicate, assignment **never copies in Python** However, I advise against this
So, using
DNA.some_df = df
inside __init__ does work. Since default arguments are evaluated only once at function definition time, that df is always the same df, unless you explicitly pass a new df to __init__, but that smacks of bad design to me. Rather, you probably want something like:
class DNA(object):
def __init__(self,df = pd.DataFrame(), name='1'):
self.name = name
<some work to construct a dataframe>
df = final_processing_function()
DNA.df = df
Suppose, then you want to change it, at any point you can use:
DNA.df = new_df
Note:
In [5]: class A:
...: pass
...:
In [6]: a1 = A()
In [7]: a2 = A()
In [8]: a3 = A()
In [9]: A.class_member = 42
In [10]: a1.class_member
Out[11]: 42
In [11]: a2.class_member
Out[11]: 42
In [12]: a3.class_member
Out[12]: 42
Be careful, though, when you assign to an instance Python takes you at your word:
In [14]: a2.class_member = 'foo' # this shadows the class variable with an instance variable in this instance...
In [15]: a1.class_member
Out[15]: 42
In [16]: a2.class_member # really an instance variable now!
Out[16]: 'foo'
And that is reflected by examining the namespace of the instances and the class object itself:
In [17]: a1.__dict__
Out[17]: {}
In [18]: a2.__dict__
Out[18]: {'class_member': 'foo'}
In [19]: A.__dict__
Out[19]:
mappingproxy({'__dict__': <attribute '__dict__' of 'A' objects>,
'__doc__': None,
'__module__': '__main__',
'__weakref__': <attribute '__weakref__' of 'A' objects>,
'class_member': 42})

PySpark - Add a new nested column or change the value of existing nested columns

Supposing, I have a json file with lines in follow structure:
{
"a": 1,
"b": {
"bb1": 1,
"bb2": 2
}
}
I want to change the value of key bb1 or add a new key, like: bb3.
Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame.
The workflow works as follow:
def map_func(row):
dictionary = row.asDict(True)
adding new key or changing key value
return as_row(dictionary) # as_row convert dict to row recursively
df = spark.read.json("json_file")
df.rdd.map(map_func).toDF().write.json("new_json_file")
This could work for me. But I concern that converting DataFrame -> RDD ( Row -> dict -> Row) -> DataFrame would kill the efficiency.
Is there any other methods that could work for this demand but not at the cost of efficiency?
The final solution that I used is using withColumn and dynamically building the schema of b.
Firstly, we can get the b_schema from df schema by:
b_schema = next(field['type'] for field in df.schema.jsonValue()['fields'] if field['name'] == 'b')
After that, b_schema is dict and we can add new field into it by:
b_schema['fields'].append({"metadata":{},"type":"string","name":"bb3","nullable":True})
And then, we could convert it to StructType by:
new_b = StructType.fromJson(b_schema)
In the map_func, we could convert Row to dict and populate the new field:
def map_func(row):
data = row.asDict(True)
data['bb3'] = data['bb1'] + data['bb2']
return data
map_udf = udf(map_func, new_b)
df.withColumn('b', map_udf('b')).collect()
Thanks #Mariusz
You can use map_func as udf and therefore omit converting DF -> RDD -> DF, still having the flexibility of python to implement business logic. All you need is to create schema object:
>>> from pyspark.sql.types import *
>>> new_b = StructType([StructField('bb1', LongType()), StructField('bb2', LongType()), StructField('bb3', LongType())])
Then you define map_func and udf:
>>> from pyspark.sql.functions import *
>>> def map_func(data):
... return {'bb1': 4, 'bb2': 5, 'bb3': 6}
...
>>> map_udf = udf(map_func, new_b)
Finally apply this UDF to dataframe:
>>> df = spark.read.json('sample.json')
>>> df.withColumn('b', map_udf('b')).first()
Row(a=1, b=Row(bb1=4, bb2=5, bb3=6))
EDIT:
According to the comment: You can add a field to existing StructType in a easier way, for example:
>>> df = spark.read.json('sample.json')
>>> new_b = df.schema['b'].dataType.add(StructField('bb3', LongType()))

Resources