Spark fillNa not replacing the null value - apache-spark

I have the following dataset and its contain some null values, need to replace the null value using fillna in spark.
DataFrame:
df = spark.read.format("com.databricks.spark.csv").option("header‌​","true").load("/sam‌​ple.csv")
>>> df.printSchema();
root
|-- Age: string (nullable = true)
|-- Height: string (nullable = true)
|-- Name: string (nullable = true)
>>> df.show()
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null| null|
+---+------+-----+
>>> df.na.fill(10).show()
when i'll give the na values it dosen't changed the same dataframe appeared again.
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null| null|
+---+------+-----+
tried create a new dataframe and store the fill values in dataframe but the result showing like unchanged.
>>> df2 = df.na.fill(10)
how to replace the null values? please give me the possible ways by using fill na.
Thanks in Advance.

It seems that your Height column is not numeric. When you call df.na.fill(10) spark replaces only nulls with column that match type of 10, which are numeric columns.
If Height column need to be string, you can try df.na.fill('10').show(), otherwise casting to IntegerType() is neccessary.

You can also provide a specific default value for each column if you prefer.
df.na.fill({'Height': '10', 'Name': 'Bob'})

To add to #Mariusz answer, here is the exact code to cast and fill NA values:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col
df = df.withColumn("Height", col("Height").cast(IntegerType()))
df2 = df.na.fill(value=10, subset=["Height"])
maybe the simpler solution would've been to give a string value, if you don't care about the column type:
df2 = df.na.fill(value="10", subset=["Height"])

Related

Adding values of two column which datatypes are in string format in pyspark

The log files is in json format,i extracted the data to dataframe of pyspark
There are two column whose values are in int but datatype of column is string.
cola|colb
45|10
10|20
Expected Output
newcol
55
30
but I am getting output like
4510
1020
Code i have used like
df = .select (F.concat("cola","colb") as newcol).show()
kindly help me how can i get correct output.
>>> from pyspark.sql.functions import col
>>> df.show()
+----+----+
|cola|colb|
+----+----+
| 45| 10|
| 10| 20|
+----+----+
>>> df.printSchema()
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
>>> df.withColumn("newcol", col("cola") + col("colb")).show()
+----+----+------+
|cola|colb|newcol|
+----+----+------+
| 45| 10| 55.0|
| 10| 20| 30.0|
+----+----+------+

Parquet bytes dataframe to UTF-8 in Spark

I am trying to read a dataframe from a parquet file with Spark in python but my dataframe is byte encoded so when I use spark.read.parquet and then df.show() it looks like the following:
+---+----------+----+
| C1| C2| C3|
+---+----------+----+
| 1|[20 2D 2D]| 0|
| 2|[32 30 31]| 0|
| 3|[43 6F 6D]| 0|
+---+----------+----+
As you can see it the values are converted to hexadecimal values... I've read the entire documentation of spark dataframes but I did not found anything. Is it possible to convert to UTF-8?
The df.printSchema() output:
|-- C1: long (nullable = true)
|-- C2: binary (nullable = true)
|-- C3: long (nullable = true)
The Spark version is 2.4.4
Thank you!
You have a binary type column, which is like a bytearray in python. You just need to cast to string:
df = df.withColumn("C2", df["C2"].cast("string"))
df.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#| 1| --| 0|
#| 2|201| 0|
#| 3|Com| 0|
#+---+---+---+
Likewise in python:
str(bytearray([0x20, 0x2D, 0x2D]))
#' --'

spark drop multiple duplicated columns after join

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) { (acc: DataFrame, colName: String) => acc.drop(col(colName)) }
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) { (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")}
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
df.dropDuplicates() works only for rows.
You can df1.drop(df2.column("value"))
You can specify columns you want to select, for example, with df.select(Seq of columns)
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
I wasn't completely satisfied with the answers in this. For the most part, especially #stack0114106 's answers, they hint at the right way and the complexity of doing it in a clean way. But they seem to be incomplete answers. To me a clean automated way of doing this is to use the df.columns functionality to get the columns as list of strings and then use sets to find the common columns to drop or find the unique columns to keep depending on your use case. However, if you use the select you will have to alias the dataframes so it knows which of the non-unique columns to keep. Anyways, using pseudocode because I can't be bothered to write the scala code proper.
common_cols = df_b.columns.toSet().intersection(df_a.columns.toSet())
df_a.join(df_b.drop(*common_cols))
The select version of this looks similar but you have to add in the aliasing.
unique_b_cols = df_b.columns.toSet().difference(df_a.columns.toSet()).toList
a_cols_aliased = df_a.columns.foreach(cols => "a." + cols)
keep_columns = a_cols_aliased.toList + unique_b_cols.toList
df_a.alias("a")
.join(df_b.alias("b"))
.select(*keep_columns)
I prefer the drop way, but having written a bunch of spark code. A select statement can often lead to cleaner code.

weekofyear() returning seemingly incorrect results for January 1

I'm not quite sure why my code gives 52 as the answer for: weekofyear("01/JAN/2017") .
Does anyone have a possible explanation for this? Is there a better way to do this?
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('weekOfYear').getOrCreate()
from pyspark.sql.functions import to_date
df = spark.createDataFrame(
[(1, "01/JAN/2017"), (2, "15/FEB/2017")], ("id", "date"))
df.show()
+---+-----------+
| id| date|
+---+-----------+
| 1|01/JAN/2017|
| 2|15/FEB/2017|
+---+-----------+
Calculate the week of the year
df=df.withColumn("weekofyear", functions.weekofyear(to_date(df["date"],"dd/MMM/yyyy")))
df.printSchema()
root
|-- id: long (nullable = true)
|-- date: string (nullable = true)
|-- weekofyear: integer (nullable = true)
df.show()
The 'error' is visible below:
+---+-----------+----------+
| id| date|weekofyear|
+---+-----------+----------+
| 1|01/JAN/2017| 52|
| 2|15/FEB/2017| 7|
+---+-----------+----------+
It seems like weekofyear() will only return 1 for January 1st if the day of the week is Monday through Thursday.
To confirm, I created a DataFrame with all "01/JAN/YYYY" from 1900 to 2018:
df = sqlCtx.createDataFrame(
[(1, "01/JAN/{y}".format(y=year),) for year in range(1900,2019)],
["id", "date"]
)
Now let's convert it to a date, get the day of the week, and count the values for weekofyear():
import pyspark.sql.functions as f
df.withColumn("d", f.to_date(f.from_unixtime(f.unix_timestamp('date', "dd/MMM/yyyy"))))\
.withColumn("weekofyear", f.weekofyear("d"))\
.withColumn("dayofweek", f.date_format("d", "E"))\
.groupBy("dayofweek", "weekofyear")\
.count()\
.show()
#+---------+----------+-----+
#|dayofweek|weekofyear|count|
#+---------+----------+-----+
#| Sun| 52| 17|
#| Mon| 1| 18|
#| Tue| 1| 17|
#| Wed| 1| 17|
#| Thu| 1| 17|
#| Fri| 53| 17|
#| Sat| 53| 4|
#| Sat| 52| 12|
#+---------+----------+-----+
Note, I am using Spark v 2.1 where to_date() does not accept a format argument, so I had to use the method described in this answer to convert the string to a date.
Similarly to_date() only returns 1 for:
January 2nd, if the day of the week is Monday through Friday.
January 3rd, if the day of the week is Monday through Saturday.
Update
This behavior is consistent with the ISO 8601 definition.

Spark DataFrame making column null value to empty

I have joined two data frames with left outer join. Resulting data frame has null values. How do to make them as empty instead of null.
| id|quantity|
+---+--------
| 1| null|
| 2| null|
| 3| 0.04
And here is the schema
root
|-- id: integer (nullable = false)
|-- quantity: double (nullable = true)
expected output
| id|quantity|
+---+--------
| 1| |
| 2| |
| 3| 0.04
You cannot make them "empty", since they are double values and empty string "" is a String. The best you can do is leave them as nulls or set them to 0 using fill function:
val df2 = df.na.fill(0.,Seq("quantity"))
Otherwise, if you really want to have empty quantities, you should consider changing quantity column type to String.

Resources