Spark DataFrame making column null value to empty

Spark DataFrame making column null value to empty - apache-spark

I have joined two data frames with left outer join. Resulting data frame has null values. How do to make them as empty instead of null.
| id|quantity|
+---+--------
| 1| null|
| 2| null|
| 3| 0.04
And here is the schema
root
|-- id: integer (nullable = false)
|-- quantity: double (nullable = true)
expected output
| id|quantity|
+---+--------
| 1| |
| 2| |
| 3| 0.04

You cannot make them "empty", since they are double values and empty string "" is a String. The best you can do is leave them as nulls or set them to 0 using fill function:
val df2 = df.na.fill(0.,Seq("quantity"))
Otherwise, if you really want to have empty quantities, you should consider changing quantity column type to String.

Related

apply threshold on column values in a pysaprk dataframe and convert the values to binary 0 or 1

I have a PySpark dataframe
simpleData = [("person0",10, 10), \
("person1",1, 1), \
("person2",1, 0), \
("person3",5, 1), \
]
columns= ["persons_name","A", 'B']
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
It looks like
root
|-- persons_name: string (nullable = true)
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- total: long (nullable = true)
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 10| 10|
| person1| 1| 1|
| person2| 1| 0|
| person3| 5| 1|
+------------+---+---+
Now I want a threshold of value 2 to be applied to the values of columns A and B, such that any value in the column less than the threshold becomes 0 and the values greater than the threshold becomes 1.
The final result should look something like-
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 1| 1|
| person1| 0| 0|
| person2| 0| 0|
| person3| 1| 0|
+------------+---+---+
How can I achieve this?

threshold = 2
exp.select(
[(F.col(col) > F.lit(threshold)).cast('int').alias(col) for col in ['A', 'B']]
)

Adding values of two column which datatypes are in string format in pyspark

The log files is in json format,i extracted the data to dataframe of pyspark
There are two column whose values are in int but datatype of column is string.
cola|colb
45|10
10|20
Expected Output
newcol
55
30
but I am getting output like
4510
1020
Code i have used like
df = .select (F.concat("cola","colb") as newcol).show()
kindly help me how can i get correct output.

>>> from pyspark.sql.functions import col
>>> df.show()
+----+----+
|cola|colb|
+----+----+
| 45| 10|
| 10| 20|
+----+----+
>>> df.printSchema()
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
>>> df.withColumn("newcol", col("cola") + col("colb")).show()
+----+----+------+
|cola|colb|newcol|
+----+----+------+
| 45| 10| 55.0|
| 10| 20| 30.0|
+----+----+------+

pyspark nested columns in a string

I am working with PySpark. I have a DataFrame loaded from csv that contains the following schema:
root
|-- id: string (nullable = true)
|-- date: date (nullable = true)
|-- users: string (nullable = true)
If I show the first two rows it looks like:
+---+----------+---------------------------------------------------+
| id| date|users |
+---+----------+---------------------------------------------------+
| 1|2017-12-03|{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} |
| 2|2017-12-04|{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]} |
+---+----------+---------------------------------------------------+
I would like to create a new DataFrame that contains the 'user' string broken out by each element. I would like something similar to
id user_id user_product
1 1 xxx
1 1 yyy
1 1 zzz
1 2 aaa
1 2 bbb
1 3 <null>
2 1 uuu
etc...
I have tried many approaches but can't seem to get it working.
The closest I can get is defining the schema such as the following and creating a new df applying schema using from_json:
userSchema = StructType([
StructField("user_id", StringType()),
StructField("product_list", StructType([
StructField("product", StringType())
]))
])
user_df = in_csv.select('id',from_json(in_csv.users, userSchema).alias("test"))
This returns the correct schema:
root
|-- id: string (nullable = true)
|-- test: struct (nullable = true)
| |-- user_id: string (nullable = true)
| |-- product_list: struct (nullable = true)
| | |-- product: string (nullable = true)
but when I show any part of the 'test' struct it returns nulls instead of values e.g.
user_df.select('test.user_id').show()
returns test.user_id :
+-------+
|user_id|
+-------+
| null|
| null|
+-------+
Maybe I shouldn't be using the from_json as the users string is not pure JSON. Any help as to approach I could take?

The schema should conform to the shape of the data. Unfortunately from_json supports only StructType(...) or ArrayType(StructType(...)) which won't be useful here, unless you can guarantee that all records have the same set of key.
Instead, you can use an UserDefinedFunction:
import json
from pyspark.sql.functions import explode, udf
df = spark.createDataFrame([
(1, "2017-12-03", """{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]}"""),
(2, "2017-12-04", """{"1":["uuu","yyy","zzz"],"2":["aaa"],"3":[]}""")],
("id", "date", "users")
)
#udf("map<string, array<string>>")
def parse(s):
try:
return json.loads(s)
except:
pass
(df
.select("id", "date",
explode(parse("users")).alias("user_id", "user_product"))
.withColumn("user_product", explode("user_product"))
.show())
# +---+----------+-------+------------+
# | id| date|user_id|user_product|
# +---+----------+-------+------------+
# | 1|2017-12-03| 1| xxx|
# | 1|2017-12-03| 1| yyy|
# | 1|2017-12-03| 1| zzz|
# | 1|2017-12-03| 2| aaa|
# | 1|2017-12-03| 2| bbb|
# | 2|2017-12-04| 1| uuu|
# | 2|2017-12-04| 1| yyy|
# | 2|2017-12-04| 1| zzz|
# | 2|2017-12-04| 2| aaa|
# +---+----------+-------+------------+

You dont need to use from_json. You have to explode two times, one for user_id and one for users.
import pyspark.sql.functions as F
df = sql.createDataFrame([
(1,'2017-12-03',{"1":["xxx","yyy","zzz"],"2":["aaa","bbb"],"3":[]} ),
(2,'2017-12-04',{"1":["uuu","yyy","zzz"],"2":["aaa"], "3":[]} )],
['id','date','users']
)
df = df.select('id','date',F.explode('users').alias('user_id','users'))\
.select('id','date','user_id',F.explode('users').alias('users'))
df.show()
+---+----------+-------+-----+
| id| date|user_id|users|
+---+----------+-------+-----+
| 1|2017-12-03| 1| xxx|
| 1|2017-12-03| 1| yyy|
| 1|2017-12-03| 1| zzz|
| 1|2017-12-03| 2| aaa|
| 1|2017-12-03| 2| bbb|
| 2|2017-12-04| 1| uuu|
| 2|2017-12-04| 1| yyy|
| 2|2017-12-04| 1| zzz|
| 2|2017-12-04| 2| aaa|
+---+----------+-------+-----+

How to max per column with nulls that should be replaced (with 0)?

How to get the MAX in the below dataframe?
val df_n = df.select($"ID").filter(($"READ") === "" && ($"ACT"!==""))
I have to find out the MAX of ID and in case if ID is NULL, I have to replace it with 0.

What about the following?
Test Dataset
scala> val df = Seq("0", null, "5", null, null, "-8").toDF("id")
df: org.apache.spark.sql.DataFrame = [id: string]
scala> df.printSchema
root
|-- id: string (nullable = true)
scala> df.withColumn("idAsLong", $"id" cast "long").printSchema
root
|-- id: string (nullable = true)
|-- idAsLong: long (nullable = true)
scala> val testDF = df.withColumn("idAsLong", $"id" cast "long")
testDF: org.apache.spark.sql.DataFrame = [id: string, idAsLong: bigint]
scala> testDF.show
+----+--------+
| id|idAsLong|
+----+--------+
| 0| 0|
|null| null|
| 5| 5|
|null| null|
|null| null|
| -8| -8|
+----+--------+
Solution
scala> testDF.agg(max("idAsLong")).show
+-------------+
|max(idAsLong)|
+-------------+
| 5|
+-------------+
Using na Operator
What if you had only negative values and null and so null is the maximum value? Use na operator on Dataset.
val withNulls = Seq("-1", "-5", null, null, "-333", null)
.toDF("id")
.withColumn("asInt", $"id" cast "int") // <-- column of type int with nulls
scala> withNulls.na.fill(Map("asInt" -> 0)).agg(max("asInt")).show
+----------+
|max(asInt)|
+----------+
| 0|
+----------+
Without na and replacing null it simply won't work.
scala> withNulls.agg(max("asInt")).show
+----------+
|max(asInt)|
+----------+
| -1|
+----------+
See na: DataFrameNaFunctions.

If you want to find out the maximum ID in this dataframe, you just have to add
.agg(max($"ID"))
However, I do not understand why you would want to replace the maximum ID without further grouping by 0. Anyway, if you feel more comfortable with SQL, you can always use the SQL interface:
df.createOrReplaceTempView("DF")
spark.sql("select max(id) from DF").show

Spark fillNa not replacing the null value

I have the following dataset and its contain some null values, need to replace the null value using fillna in spark.
DataFrame:
df = spark.read.format("com.databricks.spark.csv").option("header‌","true").load("/sam‌ple.csv")
>>> df.printSchema();
root
|-- Age: string (nullable = true)
|-- Height: string (nullable = true)
|-- Name: string (nullable = true)
>>> df.show()
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null| null|
+---+------+-----+
>>> df.na.fill(10).show()
when i'll give the na values it dosen't changed the same dataframe appeared again.
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null| null|
+---+------+-----+
tried create a new dataframe and store the fill values in dataframe but the result showing like unchanged.
>>> df2 = df.na.fill(10)
how to replace the null values? please give me the possible ways by using fill na.
Thanks in Advance.

It seems that your Height column is not numeric. When you call df.na.fill(10) spark replaces only nulls with column that match type of 10, which are numeric columns.
If Height column need to be string, you can try df.na.fill('10').show(), otherwise casting to IntegerType() is neccessary.

You can also provide a specific default value for each column if you prefer.
df.na.fill({'Height': '10', 'Name': 'Bob'})

To add to #Mariusz answer, here is the exact code to cast and fill NA values:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col
df = df.withColumn("Height", col("Height").cast(IntegerType()))
df2 = df.na.fill(value=10, subset=["Height"])
maybe the simpler solution would've been to give a string value, if you don't care about the column type:
df2 = df.na.fill(value="10", subset=["Height"])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark DataFrame making column null value to empty - apache-spark

Related

apply threshold on column values in a pysaprk dataframe and convert the values to binary 0 or 1

Adding values of two column which datatypes are in string format in pyspark

pyspark nested columns in a string

How to max per column with nulls that should be replaced (with 0)?

Spark fillNa not replacing the null value

Categories

Resources