PySpark: How to fillna values in dataframe for specific columns? - apache-spark

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)

df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1

Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Related

Fill dataframe cells entry using dataframe column names and index

I try to fill a datafame using following approach:
I generate a mxn size dataframe
Column names for the dataframe areA to N and are read from a list passed to the method.
define the index for the dataframe.
fill the dataframe entries with Column name + _ + index
import numpy as np
import pandas as pd
from tabulate import tabulate
def generate_data(N_rows, N_cols,names_df =[]):
if N_rows == 4:
d16 = ['RU19-24', 'RU13-18', 'RU7-12', 'RU1-6']
df = pd.DataFrame(np.zeros((N_rows, N_cols)), index=d16 ,columns=names_df)
else:
print("The Elevation for each domain is defined by 4, you defined elevation: ", N_rows)
df = None
# df.loc[[],'Z'] = 3
return tabulate(df, headers='keys', tablefmt='psql')
a = generate_data(4,2, ['A', 'B'])
print(a)
Out:
+---------+-----+-----+
| | A | B |
|---------+-----+-----|
| RU19-24 | 0 | 0 |
| RU13-18 | 0 | 0 |
| RU7-12 | 0 | 0 |
| RU1-6 | 0 | 0 |
+---------+-----+-----+
Is it possible to take the index and concatenate with the column names to get the following output ?
+---------+-------------+-------------+
| | A | B |
|---------+-------------+-------------|
| RU19-24 | A_RU19-24 | B_RU19-24 |
| RU13-18 | A_RU13-18 | B_RU13-18 |
| RU7-12 | A_RU7-12 | B_RU7-12 |
| RU1-6 | A_RU1-6 | B_RU1-6 |
+---------+-------------+-------------+
IIUC, you can use, apply which take each column of the dataframe as a pd.Series, with an index (the dataframe index) and a series name(the dataframe column header):
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df.apply(lambda x: x.name+'_'+x.index)
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6
or use np.add.outer
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df_out = pd.DataFrame(np.add.outer(df.columns+'_',df.index).T, index=df.index, columns=df.columns)
df_out
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6

Using groupby in pandas to filter a dataframe using count and column value

I am trying to clean my dataframe and I am trying to use groupby function. I have ID and event_type as my columns. I want to get a new dataframe where if there is only one row having a Unique ID then the event_type must be a. If not then delete that row.
Data looks like this: The event_type can be "a" or "b"
+-----+------------+
| ID | event_type |
+-----+------------+
| xyz | a |
| pqr | b |
| xyz | b |
| rst | a |
+-----+------------+
Output:
Since the ID "pqr" occurs only once (which is the count) and does not have a (column value) as the event_type the dataframe should convert to the following:
+-----+------------+
| ID | event_type |
+-----+------------+
| xyz | a |
| xyz | b |
| rst | a |
+-----+------------+
You can use your logic within a groupby
import pandas as pd
df = pd.DataFrame({"ID":['xyz', 'pqr', 'xyz', 'rst'],
"event_type":['a', 'b', 'b', 'a']})
what you are asking is this
df.groupby("ID")\
.apply(lambda x: not (len(x)==1 and
not "a" in x["event_type"].values))
as you can check by printing it. Finally to use this filter you just run
df = df.groupby("ID")\
.filter(lambda x: not (len(x)==1 and
not "a" in x["event_type"].values))\
.reset_index(drop=True)

How can I overwrite in a Spark DataFrame null entries with other valid entries from the same dataframe?

I have a Spark DataFrame with data like this
| id | value1 |value2 |
------------------------
| 1 | null | 1 |
| 1 | 2 | null |
And want to transform it
into
| id | value1 |value2 |
-----------------------
| 1 | 2 | 1 |
That is, I need to get the rows with the same id and merge their values in a single row.
Could you explain me what is the most scalable way to do this?
df.groupBy(“id”).agg(collect_set(“value1”).alias(“value1”),collect_set(“value2”).alias(“value2”))
//more elegant way of doing for dynamic columns
df.groupBy(“id”).agg(df.columns.tail.map((_ -> “collect_set”)).toMap).show
//1.5
Val df1=df.rdd.map(i=>(i(0).toString,i(1).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
Val df2 = df.rdd.map(i=>(i(0).toString,i(2).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
df1.join(df2,df1(“_1”) === df2(“_1”),”inner”).drop(df2(“_1”)).show

How to clone column values in spark with their original order

I would like to clone the values of a column n times as they are in their original order.
For example if I want to replicate below column 2 times:
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+
What I am looking for :
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
| 1 |
| 2 |
| 3 |
+---+
Using explode or flatMap I can only get :
+---+
| v |
+---+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+
Code:
%spark
val ds = spark.range(1, 4)
val cloneCount = 2
val clonedDs = ds.flatMap(r => Seq.fill(cloneCount)(r))
clonedDs.show()
I can probably do a self union of the dataset ds but if the cloneCount is huge, eg. cloneCount = 200000, is it a preferred solution to union in a loop that many times?
You can try this:
// If the column values are expected to be in an increasing/descresing sequence
// then we add that to the orderBy: clone_index and col_value
// to get the values in order as they were initially
val clonedDs = ds.flatMap(col_value => Range(0, cloneCount)
.map(clone_index=>(clone_index,col_value)) )
clonedDs.orderBy("_1", "_2").map(_._2).show()
// If the column values are not expected to follow a sequence
// then we add another rank column and use that in orderBy along with clone_index
// to get the col_values in desired order
val clonedDs = ds.withColumn("rank", monotonically_increasing_id())
.flatMap(row => Range(0, cloneCount).map(
clone_index=> (clone_index, row.getLong(1), row.getLong(0))
) )
clonedDs.orderBy("_1", "_2").map(_._3).show()

Pandas DataFrame, Iterate through groups is very slow

I have a dataframe df with ~ 300.000 rows and plenty of columns:
| COL_A | ... | COL_B | COL_C |
-----+--------+-...--+--------+--------+
IDX
-----+--------+-...--+--------+--------+
'AAA'| 'A1' | ... | 'B1' | 0 |
-----+--------+-...--+--------+--------+
'AAB'| 'A1' | ... | 'B2' | 2 |
-----+--------+-...--+--------+--------+
'AAC'| 'A1' | ... | 'B3' | 1 |
-----+--------+-...--+--------+--------+
'AAD'| 'A2' | ... | 'B3' | 0 |
-----+--------+-...--+--------+--------+
I need to group after COL_A and from each row of each group I need the value of IDX (e.G.: 'AAA') and COL_B (e.G.: B1) in the order given in COL_C
For A1 I thus need: [['AAA','B1'], ['AAC','B3'], ['AAB','B2']]
This is what I do.
grouped_by_A = self.df.groupby(COL_A)
for col_A, group in grouped_by_A:
group = group.sort_values(by=[COL_C], ascending=True)
...
It works fine, but it's horribly slow (Core i7, 16 GB RAM). It already takes ~ 5 Minutes when I'm not doing anything with the values. Do you know a faster way?

Resources