PySpark: How to fillna values in dataframe for specific columns?

PySpark: How to fillna values in dataframe for specific columns? - apache-spark

I have the following sample DataFrame:
a | b | c |
1 | 2 | 4 |
0 | null | null|
null | 3 | 4 |
And I want to replace null values only in the first 2 columns - Column "a" and "b":
a | b | c |
1 | 2 | 4 |
0 | 0 | null|
0 | 3 | 4 |
Here is the code to create sample dataframe:
rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)])
df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"])
I know how to replace all null values using:
df2 = df2.fillna(0)
And when I try this, I lose the third column:
df2 = df2.select(df2.columns[0:1]).fillna(0)

df.fillna(0, subset=['a', 'b'])
There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1

Use a dictionary to fill values of certain columns:
df.fillna( { 'a':0, 'b':0 } )

Related

Fill dataframe cells entry using dataframe column names and index

I try to fill a datafame using following approach:
I generate a mxn size dataframe
Column names for the dataframe areA to N and are read from a list passed to the method.
define the index for the dataframe.
fill the dataframe entries with Column name + _ + index
import numpy as np
import pandas as pd
from tabulate import tabulate
def generate_data(N_rows, N_cols,names_df =[]):
if N_rows == 4:
d16 = ['RU19-24', 'RU13-18', 'RU7-12', 'RU1-6']
df = pd.DataFrame(np.zeros((N_rows, N_cols)), index=d16 ,columns=names_df)
else:
print("The Elevation for each domain is defined by 4, you defined elevation: ", N_rows)
df = None
# df.loc[[],'Z'] = 3
return tabulate(df, headers='keys', tablefmt='psql')
a = generate_data(4,2, ['A', 'B'])
print(a)
Out:
+---------+-----+-----+
| | A | B |
|---------+-----+-----|
| RU19-24 | 0 | 0 |
| RU13-18 | 0 | 0 |
| RU7-12 | 0 | 0 |
| RU1-6 | 0 | 0 |
+---------+-----+-----+
Is it possible to take the index and concatenate with the column names to get the following output ?
+---------+-------------+-------------+
| | A | B |
|---------+-------------+-------------|
| RU19-24 | A_RU19-24 | B_RU19-24 |
| RU13-18 | A_RU13-18 | B_RU13-18 |
| RU7-12 | A_RU7-12 | B_RU7-12 |
| RU1-6 | A_RU1-6 | B_RU1-6 |
+---------+-------------+-------------+

IIUC, you can use, apply which take each column of the dataframe as a pd.Series, with an index (the dataframe index) and a series name(the dataframe column header):
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df.apply(lambda x: x.name+'_'+x.index)
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6
or use np.add.outer
df = pd.DataFrame(index=['RU19-24','RU13-18','RU7-12','RU1-6'], columns = ['A','B'])
df_out = pd.DataFrame(np.add.outer(df.columns+'_',df.index).T, index=df.index, columns=df.columns)
df_out
Output:
A B
RU19-24 A_RU19-24 B_RU19-24
RU13-18 A_RU13-18 B_RU13-18
RU7-12 A_RU7-12 B_RU7-12
RU1-6 A_RU1-6 B_RU1-6

Using groupby in pandas to filter a dataframe using count and column value

I am trying to clean my dataframe and I am trying to use groupby function. I have ID and event_type as my columns. I want to get a new dataframe where if there is only one row having a Unique ID then the event_type must be a. If not then delete that row.
Data looks like this: The event_type can be "a" or "b"
+-----+------------+
| ID | event_type |
+-----+------------+
| xyz | a |
| pqr | b |
| xyz | b |
| rst | a |
+-----+------------+
Output:
Since the ID "pqr" occurs only once (which is the count) and does not have a (column value) as the event_type the dataframe should convert to the following:
+-----+------------+
| ID | event_type |
+-----+------------+
| xyz | a |
| xyz | b |
| rst | a |
+-----+------------+

You can use your logic within a groupby
import pandas as pd
df = pd.DataFrame({"ID":['xyz', 'pqr', 'xyz', 'rst'],
"event_type":['a', 'b', 'b', 'a']})
what you are asking is this
df.groupby("ID")\
.apply(lambda x: not (len(x)==1 and
not "a" in x["event_type"].values))
as you can check by printing it. Finally to use this filter you just run
df = df.groupby("ID")\
.filter(lambda x: not (len(x)==1 and
not "a" in x["event_type"].values))\
.reset_index(drop=True)

How can I overwrite in a Spark DataFrame null entries with other valid entries from the same dataframe?

I have a Spark DataFrame with data like this
| id | value1 |value2 |
------------------------
| 1 | null | 1 |
| 1 | 2 | null |
And want to transform it
into
| id | value1 |value2 |
-----------------------
| 1 | 2 | 1 |
That is, I need to get the rows with the same id and merge their values in a single row.
Could you explain me what is the most scalable way to do this?

df.groupBy(“id”).agg(collect_set(“value1”).alias(“value1”),collect_set(“value2”).alias(“value2”))
//more elegant way of doing for dynamic columns
df.groupBy(“id”).agg(df.columns.tail.map((_ -> “collect_set”)).toMap).show
//1.5
Val df1=df.rdd.map(i=>(i(0).toString,i(1).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
Val df2 = df.rdd.map(i=>(i(0).toString,i(2).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
df1.join(df2,df1(“_1”) === df2(“_1”),”inner”).drop(df2(“_1”)).show

How to clone column values in spark with their original order

I would like to clone the values of a column n times as they are in their original order.
For example if I want to replicate below column 2 times:
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+
What I am looking for :
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
| 1 |
| 2 |
| 3 |
+---+
Using explode or flatMap I can only get :
+---+
| v |
+---+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+
Code:
%spark
val ds = spark.range(1, 4)
val cloneCount = 2
val clonedDs = ds.flatMap(r => Seq.fill(cloneCount)(r))
clonedDs.show()
I can probably do a self union of the dataset ds but if the cloneCount is huge, eg. cloneCount = 200000, is it a preferred solution to union in a loop that many times?

You can try this:
// If the column values are expected to be in an increasing/descresing sequence
// then we add that to the orderBy: clone_index and col_value
// to get the values in order as they were initially
val clonedDs = ds.flatMap(col_value => Range(0, cloneCount)
.map(clone_index=>(clone_index,col_value)) )
clonedDs.orderBy("_1", "_2").map(_._2).show()
// If the column values are not expected to follow a sequence
// then we add another rank column and use that in orderBy along with clone_index
// to get the col_values in desired order
val clonedDs = ds.withColumn("rank", monotonically_increasing_id())
.flatMap(row => Range(0, cloneCount).map(
clone_index=> (clone_index, row.getLong(1), row.getLong(0))
) )
clonedDs.orderBy("_1", "_2").map(_._3).show()

Pandas DataFrame, Iterate through groups is very slow

I have a dataframe df with ~ 300.000 rows and plenty of columns:
| COL_A | ... | COL_B | COL_C |
-----+--------+-...--+--------+--------+
IDX
-----+--------+-...--+--------+--------+
'AAA'| 'A1' | ... | 'B1' | 0 |
-----+--------+-...--+--------+--------+
'AAB'| 'A1' | ... | 'B2' | 2 |
-----+--------+-...--+--------+--------+
'AAC'| 'A1' | ... | 'B3' | 1 |
-----+--------+-...--+--------+--------+
'AAD'| 'A2' | ... | 'B3' | 0 |
-----+--------+-...--+--------+--------+
I need to group after COL_A and from each row of each group I need the value of IDX (e.G.: 'AAA') and COL_B (e.G.: B1) in the order given in COL_C
For A1 I thus need: [['AAA','B1'], ['AAC','B3'], ['AAB','B2']]
This is what I do.
grouped_by_A = self.df.groupby(COL_A)
for col_A, group in grouped_by_A:
group = group.sort_values(by=[COL_C], ascending=True)
...
It works fine, but it's horribly slow (Core i7, 16 GB RAM). It already takes ~ 5 Minutes when I'm not doing anything with the values. Do you know a faster way?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark: How to fillna values in dataframe for specific columns? - apache-spark

df.fillna(0, subset=['a', 'b']) There is a parameter named subset to choose the columns unless your spark version is lower than 1.3.1

Use a dictionary to fill values of certain columns: df.fillna( { 'a':0, 'b':0 } )

Related

Fill dataframe cells entry using dataframe column names and index

Using groupby in pandas to filter a dataframe using count and column value

How can I overwrite in a Spark DataFrame null entries with other valid entries from the same dataframe?

How to clone column values in spark with their original order

Pandas DataFrame, Iterate through groups is very slow

Categories

Resources