Dask Dataframe groupby and aggregate for column - pandas-groupby

I had a pd.DataFrame that I converted to Dask.DataFrame for faster computations.
My requirement is that I have to find out the 'Total Views' of a channel.
In pandas it would be, df.groupby(['ChannelTitle'])['VideoViewCount'].sum() but in dask the columns dtypes is object and groupby is taking these as string and not int(see image 2)
To handle above issue, I added two columns separating figure(115) and multiplier(6 for M, 3 for K) of views hoping to do an operation like ddf['new_views_f'] * (10**ddf['new_views_m']), but now I cannot find mul for two columns in dask.
Either I am missing something or complicating the requirement.

It does sound like you are complicating the requirement. For column multiplication, the regular pandas syntax will work (df['c'] = df['a'] * df['b']). In your case, it's possible to use pd.eval to get the actual numeric value for views:
import pandas as pd
import numpy as np
import dask.dataframe as dd
import random
df = pd.DataFrame(15*np.random.rand(15), columns=['views'])
df['views'] = df['views'].round(2).astype('str') + [random.choice(['K views', 'M views']) for _ in range(len(df))]
df['group'] = [random.choice([1,2,3]) for _ in range(len(df))]
ddf = dd.from_pandas(df, npartitions=2)
ddf['views_digits'] = ddf['views'].replace({'K views': '*1e3', 'M views': '*1e6'}, regex=True).map(pd.eval, meta=ddf['group'])
aggregate_df = ddf.groupby(['group']).agg({'views_digits': 'sum'}).compute()

Related

Plotting Pandas DF with Numpy Arrays

I have a Pandas df with multiple columns and each cell inside has a various number of elements of a Numpy array. I would like plot all the elements of the array for every cell within column.
I have tried
plt.plot(df['column'])
plt.plot(df['column'][0:])
both gives a ValueErr: setting an array element with a sequence
It is very important that these values get plotted to its corresponding index as the index represents linear time in this dataframe. I would really appreciate it if someone showed me how to do this properly. Perhaps there is a package other than matplotlib.pylot that is better suited for this?
Thank you
plt.plot needs a list of x-coordinates together with an equally long list of y-coordinates. As you seem to want to use the index of the dataframe for the x-coordinate and each cell contents for the y-coordinates, you need to repeat the x-values as many times as the length of the y-coordinates.
Note that this format doesn't suit a line plot, as connecting subsequent points would create some strange vertical lines. plt.plot accepts a marker as its third parameter, for example '.' to draw a simple dot at each position.
A code example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 30
df = pd.DataFrame({f'column{c}':
[np.random.normal(np.random.uniform(10, 100), 1, np.random.randint(3, 11)) for _ in range(N)]
for c in range(1, 6)})
legend_handles = []
colors = plt.cm.Set1.colors
desired_columns = df.columns
for column, color in zip(desired_columns, colors):
for ind, cell in df[column].iteritems():
if len(cell) > 0:
plotted, = plt.plot([ind] * len(cell), cell, '.', color=color)
legend_handles.append(plotted)
plt.legend(legend_handles, desired_columns)
plt.show()
Note that pandas really isn't meant to store complete arrays inside cells. The preferred way is to create a dataframe in "long" form, with each value in a separate row (with the "index" repeated). Most functions of pandas and seaborn don't understand about arrays inside cells.
Here's a way to create a long form which can be called using Seaborn:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
N = 30
df = pd.DataFrame({f'column{c}':
[np.random.normal(np.random.uniform(10, 100), 1, np.random.randint(3, 11)) for _ in range(N)]
for c in range(1, 6)})
desired_columns = df.columns
df_long_data = []
for column in desired_columns:
for ind, cell in df[column].iteritems():
for val in cell:
dict = {'timestamp': ind, 'column_name': column, 'value': val}
df_long_data.append(dict)
df_long = pd.DataFrame(df_long_data)
sns.scatterplot(x='timestamp', y='value', hue='column_name', data=df_long)
plt.show()
As per your problem, you have numpy arrays in each cell which you wanna plot. To pass your data to plt.plot() method you might need to pass every cell individually as whenever you try to pass it as a whole like you did, it is actually a sequence that you are passing. But the plot() method will accept a numpy array.
This might help:
for column in df.columns:
for cell in df[column]:
plt.plot(cell)
plt.show()

Join two pyspark dataframes to select all the columns from the first df and some columns from the second df

I tried importing two functions as shown below but I get an error
from pyspark.sql.functions import regexp_replace, col
df1 = sales.alias('a').join(customer.alias('b'),col('b.ID') == col('a.ID'))\
.select([col('a.'+xx) for xx in sales.columns] + col('b.others')
TypeError: 'str' object is not callable
I really don't understand what's wrong with that line of code? Thanks.
PySpark select function expects only string column names and there is no need to send column objects as arrays. So you could just need to do this instead
from pyspark.sql.functions import regexp_replace, col
df1 = sales.alias('a').join(customer.alias('b'),col('b.ID') == col('a.ID'))\
.select(sales.columns + ['others'])

Computing dask delayed objects stored in dataframe

I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.
You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.

How do I map df column values to hex color in one go?

I have a pandas dataframe with two columns. One of the columns values needs to be mapped to colors in hex. Another graphing process takes over from there.
This is what I have tried so far. Part of the toy code is taken from here.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mapper.to_rgba(x))
df
Which outputs:
How do I convert 'some_value' df column values to hex in one go?
Ideally using the sns.cubehelix_palette(light=1)
I am not opposed to using something other than matplotlib
Thanks in advance.
You may use matplotlib.colors.to_hex() to convert a color to hexadecimal representation.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(7, 2)), columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
# Try to map values to colors in hex
# # Taken from here
norm = matplotlib.colors.Normalize(vmin=0, vmax=21, clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
df
Efficiency
The above method it easy to use, but may not be very efficient. In the folling let's compare some alternatives.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def create_df(n=10):
# Create dataframe
df = pd.DataFrame(np.random.randint(0,21,size=(n, 2)),
columns=['some_value', 'another_value'])
# Add a nan to handle realworld
df.iloc[-1] = np.nan
return df
The following is the solution from above. It applies the conversion to the dataframe row by row. This quite inefficient.
def apply1(df):
# map values to colors in hex via
# matplotlib to_hex by pandas apply
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
df['some_value_color'] = df['some_value'].apply(lambda x: mcolors.to_hex(mapper.to_rgba(x)))
return df
That's why we might choose to calculate the values into a numpy array first and just assign this array as the newly created column.
def apply2(df):
# map values to colors in hex via
# matplotlib to_hex by assigning numpy array as column
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
mapper = plt.cm.ScalarMappable(norm=norm, cmap=plt.cm.viridis)
a = mapper.to_rgba(df['some_value'])
df['some_value_color'] = np.apply_along_axis(mcolors.to_hex, 1, a)
return df
Finally we may use a look up table (LUT) which is created from the matplotlib colormap, and index the LUT by the normalized data. Because this solution needs to create the LUT first, it is rather ineffienct for dataframes with less entries than the LUT has colors, but will pay off for large dataframes.
def apply3(df):
# map values to colors in hex via
# creating a hex Look up table table and apply the normalized data to it
norm = mcolors.Normalize(vmin=np.nanmin(df['some_value'].values),
vmax=np.nanmax(df['some_value'].values), clip=True)
lut = plt.cm.viridis(np.linspace(0,1,256))
lut = np.apply_along_axis(mcolors.to_hex, 1, lut)
a = (norm(df['some_value'].values)*255).astype(np.int16)
df['some_value_color'] = lut[a]
return df
Compare the timings
Let's take a dataframe with 10000 rows.
df = create_df(10000)
Original solution (apply1)
%timeit apply1(df)
2.66 s per loop
Array solution (apply2)
%timeit apply2(df)
240 ms per loop
LUT solution (apply3)
%timeit apply1(df)
7.64 ms per loop
In this case the LUT solution gives almost a factor 400 of improvement.

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

Resources