Assign a pandas dataframe to an object as a static class variable - memory use (Python) - python-3.x

I have an Python object called DNA. I want to create 100 instances of DNA. Each of the instances contains a pandas dataframe that is identical for all instances. To avoid duplication, I want to incorporate this dataframe as a static/class attribute.
import pandas as pd
some_df = pd.DataFrame()
class DNA(object):
df = some_variable # Do i declare here?
def __init__(self,df = pd.DataFrame(), name='1'):
self.name = name
self.instance_df = instance_df # I want to avoid this
DNA.some_df = df # Does this duplicate the data for every instance?
What is the correct way to do this?
Can I use the init function to create the class variable? Or will it create a separate class variable for every instance of the class?
Do I need to declare the class variable between the 'class..' and 'def init(...)'?
Some other way?
I want to be able to change the dataframe that I use as a class variable but once the class is loaded, it needs to reference the same value (i.e. the same memory) in all instances.

I've answered your question in the comments:
import pandas as pd
some_df = pd.DataFrame()
class DNA(object):
df = some_variable # You assign here. I would use `some_df`
def __init__(self,df = pd.DataFrame(), name='1'):
self.name = name
self.instance_df = instance_df # Yes, avoid this
DNA.some_df = df # This does not duplicate, assignment **never copies in Python** However, I advise against this
So, using
DNA.some_df = df
inside __init__ does work. Since default arguments are evaluated only once at function definition time, that df is always the same df, unless you explicitly pass a new df to __init__, but that smacks of bad design to me. Rather, you probably want something like:
class DNA(object):
def __init__(self,df = pd.DataFrame(), name='1'):
self.name = name
<some work to construct a dataframe>
df = final_processing_function()
DNA.df = df
Suppose, then you want to change it, at any point you can use:
DNA.df = new_df
Note:
In [5]: class A:
...: pass
...:
In [6]: a1 = A()
In [7]: a2 = A()
In [8]: a3 = A()
In [9]: A.class_member = 42
In [10]: a1.class_member
Out[11]: 42
In [11]: a2.class_member
Out[11]: 42
In [12]: a3.class_member
Out[12]: 42
Be careful, though, when you assign to an instance Python takes you at your word:
In [14]: a2.class_member = 'foo' # this shadows the class variable with an instance variable in this instance...
In [15]: a1.class_member
Out[15]: 42
In [16]: a2.class_member # really an instance variable now!
Out[16]: 'foo'
And that is reflected by examining the namespace of the instances and the class object itself:
In [17]: a1.__dict__
Out[17]: {}
In [18]: a2.__dict__
Out[18]: {'class_member': 'foo'}
In [19]: A.__dict__
Out[19]:
mappingproxy({'__dict__': <attribute '__dict__' of 'A' objects>,
'__doc__': None,
'__module__': '__main__',
'__weakref__': <attribute '__weakref__' of 'A' objects>,
'class_member': 42})

Related

Applying a function on pandas groupby object return variable type

When applying a user-defined function f that takes a pd.DataFrame as input and returns a pd.Series on a pd.DataFrame.groupby object, the type of the returned object seems to depend on the number of unique values present in the field used to perform the grouping operation.
I am trying to understand why the api is behaving this way, and for a neat way to have a pd.Series returned regardless of the number of unique values in the grouping field.
I went through the split-apply-combine section of pandas, and it seems like the single-valued dataframe is treated as a pd.Series which does not make sense to me.
import pandas as pd
from typing import Union
def f(df : pd.DataFrame) -> pd.Series:
"""
User-defined function
"""
return df['B'] / df['B'].max()
# Should only output a pd.Series
def perform_apply(df : pd.DataFrame) -> Union[pd.Series,pd.DataFrame] :
return df.groupby('A').apply(f)
# Some dummy dataframe with multiple values in field 'A'
df1 = pd.DataFrame({'A': 'a a b'.split(),
'B': [1,2,3],
'C': [4,6,5]})
# Subset of dataframe wiht a single value in field 'A'
df2 = df1[df1['A'] == 'a'].copy()
res1 = perform_apply(df1)
res2 = perform_apply(df2)
print(type(res1),type(res2))
# --------------------------------
# -> <class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>
pandas : 1.4.2
python : 3.9.0

chosing a function without if statement in python

Imagine having 200 functions, which represent 200 ways to solve a problem or calculate something like
def A():
...
def B():
...
.
.
.
and the method will be chosen as an input argument, meaning the user decides which method to use, giving this as an argument while running the program like "A" for function/method A.
how to chose that function without if-checking the name of every single function in python.
You can use a dictionary to access directly the function that you need in O(1) complexity. For example:
def A(x):
pass
def B(x):
pass
func_map = {"A": A, "B": B}
Say that you store the user input in a variable chosen_func, then to select and run the right function, do the following:
func_map[chosen_func](x)
Example:
In [1]: def A(x):
...: return x + x
In [2]: def B(x):
...: return x * x
In [3]: func_map = {"A": A, "B": B}
In [4]: func_map["A"](10)
Out[4]: 20
In [5]: func_map["B"](10)
Out[5]: 100

python deepcopy not deepcopying user classes?

I will get straight to the example that made me ask such a question:
Python 3.6.6 (default, Jul 19 2018, 14:25:17)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from copy import deepcopy
In [2]: class Container:
...: def __init__(self, x):
...: self.x = x
...:
In [3]: anobj = Container("something")
In [4]: outobj = Container(anobj)
In [5]: copy = deepcopy(outobj)
In [6]: id(copy) == id(outobj)
Out[6]: False
In [7]: id(copy.x) == id(outobj.x)
Out[7]: False
In [8]: id(copy.x.x) == id(outobj.x.x)
Out[8]: True
As per the documentation of deepcopy, I was expecting the last line to have False as a response, i.e. that deepcopy would also clone the string.
Why it is not the case?
How can I obtain the desired behaviour? My original code has various levels of nesting custom classes with "final" attributes that are predefined types.
Thanks in advance
At least in CPython, the ID points to the object address in memory. Because Python strings are immutable, deepcopy won't create a different ID. There's really no need to create a different string in memory to hold the exact same data.
The same happens for tuples that only hold immutable objects, for example:
>>> from copy import deepcopy
>>> a = (1, -1)
>>> b = deepcopy(a)
>>> id(a) == id(b)
True
If you tuple holds mutable objets, that won't happen:
>>> a = (1, [])
>>> b = deepcopy(a)
>>> id(a) == id(b)
False
So in the end the answer is: deepcopy is working just fine for your classes, you just found a gotcha about copying immutable objects.

PySpark - Add a new nested column or change the value of existing nested columns

Supposing, I have a json file with lines in follow structure:
{
"a": 1,
"b": {
"bb1": 1,
"bb2": 2
}
}
I want to change the value of key bb1 or add a new key, like: bb3.
Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame.
The workflow works as follow:
def map_func(row):
dictionary = row.asDict(True)
adding new key or changing key value
return as_row(dictionary) # as_row convert dict to row recursively
df = spark.read.json("json_file")
df.rdd.map(map_func).toDF().write.json("new_json_file")
This could work for me. But I concern that converting DataFrame -> RDD ( Row -> dict -> Row) -> DataFrame would kill the efficiency.
Is there any other methods that could work for this demand but not at the cost of efficiency?
The final solution that I used is using withColumn and dynamically building the schema of b.
Firstly, we can get the b_schema from df schema by:
b_schema = next(field['type'] for field in df.schema.jsonValue()['fields'] if field['name'] == 'b')
After that, b_schema is dict and we can add new field into it by:
b_schema['fields'].append({"metadata":{},"type":"string","name":"bb3","nullable":True})
And then, we could convert it to StructType by:
new_b = StructType.fromJson(b_schema)
In the map_func, we could convert Row to dict and populate the new field:
def map_func(row):
data = row.asDict(True)
data['bb3'] = data['bb1'] + data['bb2']
return data
map_udf = udf(map_func, new_b)
df.withColumn('b', map_udf('b')).collect()
Thanks #Mariusz
You can use map_func as udf and therefore omit converting DF -> RDD -> DF, still having the flexibility of python to implement business logic. All you need is to create schema object:
>>> from pyspark.sql.types import *
>>> new_b = StructType([StructField('bb1', LongType()), StructField('bb2', LongType()), StructField('bb3', LongType())])
Then you define map_func and udf:
>>> from pyspark.sql.functions import *
>>> def map_func(data):
... return {'bb1': 4, 'bb2': 5, 'bb3': 6}
...
>>> map_udf = udf(map_func, new_b)
Finally apply this UDF to dataframe:
>>> df = spark.read.json('sample.json')
>>> df.withColumn('b', map_udf('b')).first()
Row(a=1, b=Row(bb1=4, bb2=5, bb3=6))
EDIT:
According to the comment: You can add a field to existing StructType in a easier way, for example:
>>> df = spark.read.json('sample.json')
>>> new_b = df.schema['b'].dataType.add(StructField('bb3', LongType()))

How do I create a default dictionary of dictionaries

I am trying to write some code that involves creating a default dictionary of dictionaries. However, I have no idea how to initialise/create such a thing. My current attempt looks something like this:
from collections import defaultdict
inner_dict = {}
dict_of_dicts = defaultdict(inner_dict(int))
The use of this default dict of dictionaries is to for each pair of words that I produce from a file I open (e.g. [['M UH M', 'm oo m']] ), to set each segment of the first word delimited by empty space as a key in the outer dictionary, and then for each segment in the second word delimited by empty space count the frequency of that segment.
For example
[['M UH M', 'm oo m']]
(<class 'dict'>, {'M': {'m': 2}, 'UH': {'oo': 1}})
Having just run this now it doesn't seem to have output any errors, however I was just wondering if something like this will actually produce a default dictionary of dictionaries.
Apologies if this is a duplicate, however previous answers to these questions have been confusing and in a different context.
To initialise a defaultdict that creates dictionaries as its default value you would use:
d = defaultdict(dict)
For this particular problem, a collections.Counter would be more suitable
>>> from collections import defaultdict, Counter
>>> d = defaultdict(Counter)
>>> for a, b in zip(*[x.split() for x in ['M UH M', 'm oo m']]):
... d[a][b] += 1
>>> print(d)
defaultdict(collections.Counter,
{'M': Counter({'m': 2}), 'UH': Counter({'oo': 1})})
Edit
You expressed interest in a comment about the equivalent without a Counter. Here is the equivalent using a plain dict
>>> from collections import defaultdict
>>> d = defaultdict(dict)
>>> for a, b in zip(*[x.split() for x in ['M UH M', 'm oo m']]):
... d[a][b] = d[a].get(b, 0) + 1
>>> print(d)
defaultdict(dict, {'M': {'m': 2}, 'UH': {'oo': 1}})
You also could a use a normal dictionary and its setdefault method.
my_dict.setdefault(key, default) will look up my_dict[key] and ...
... if the key already exists, return its current value without modifying it, or ...
... assign the default value (my_dict[key] = default) and then return that.
So you can call my_dict.setdefault(key, {}) always when you want to get a value from your outer dictionary instead of the normal my_dict[key] to retrieve either the real value assigned with this key if it#s present, or to get a new empty dictionary as default value which gets automatically stored into your outer dictionary as well.
Example:
outer_dict = {"M": {"m": 2}}
inner_dict = d.setdefault("UH", {})
# outer_dict = {"M": {"m": 2}, "UH": {}}
# inner_dict = {}
inner_dict["oo"] = 1
# outer_dict = {"M": {"m": 2}, "UH": {"oo": 1}}
# inner_dict = {"oo": 1}
inner_dict = d.setdefault("UH", {})
# outer_dict = {"M": {"m": 2}, "UH": {"oo": 1}}
# inner_dict = {"oo": 1}
inner_dict["xy"] = 3
# outer_dict = {"M": {"m": 2}, "UH": {"oo": 1, "xy": 3}}
# inner_dict = {"oo": 1, "xy": 3}
This way you always get a valid inner_dict, either an empty default one or the one that's already present for the given key. As dictionaries are mutable data types, modifying the returned inner_dict will also modify the dictionary inside outer_dict.
The other answers propose alternative solutions or show you can make a default dictionary of dictionaries using d = defaultdict(dict)
but the question asked how to make a default dictionary of default dictionaries, my navie first attempt was this:
from collections import defaultdict
my_dict = defaultdict(defaultdict(list))
however this throw an error: *** TypeError: first argument must be callable or None
so my second attempt which works is to make a callable using the lambda key word to make an anonymous function:
from collections import defaultdict
my_dict = defaultdict(lambda: defaultdict(list))
which is more concise than the alternative method using a regular function:
from collections import defaultdict
def default_dict_maker():
return defaultdict(list)
my_dict = defaultdict(default_dict_maker)
you can check it works by assigning:
my_dict[2][3] = 5
my_dict[2][3]
>>> 5
or by trying to return a value:
my_dict[0][0]
>>> []
my_dict[5]
>>> defaultdict(<class 'list'>, {})
tl;dr
this is your oneline answer my_dict = defaultdict(lambda: defaultdict(list))

Resources