Parquet bytes dataframe to UTF-8 in Spark - python-3.x

I am trying to read a dataframe from a parquet file with Spark in python but my dataframe is byte encoded so when I use spark.read.parquet and then df.show() it looks like the following:
+---+----------+----+
| C1| C2| C3|
+---+----------+----+
| 1|[20 2D 2D]| 0|
| 2|[32 30 31]| 0|
| 3|[43 6F 6D]| 0|
+---+----------+----+
As you can see it the values are converted to hexadecimal values... I've read the entire documentation of spark dataframes but I did not found anything. Is it possible to convert to UTF-8?
The df.printSchema() output:
|-- C1: long (nullable = true)
|-- C2: binary (nullable = true)
|-- C3: long (nullable = true)
The Spark version is 2.4.4
Thank you!

You have a binary type column, which is like a bytearray in python. You just need to cast to string:
df = df.withColumn("C2", df["C2"].cast("string"))
df.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#| 1| --| 0|
#| 2|201| 0|
#| 3|Com| 0|
#+---+---+---+
Likewise in python:
str(bytearray([0x20, 0x2D, 0x2D]))
#' --'

Related

apply threshold on column values in a pysaprk dataframe and convert the values to binary 0 or 1

I have a PySpark dataframe
simpleData = [("person0",10, 10), \
("person1",1, 1), \
("person2",1, 0), \
("person3",5, 1), \
]
columns= ["persons_name","A", 'B']
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.printSchema()
exp.show()
It looks like
root
|-- persons_name: string (nullable = true)
|-- A: long (nullable = true)
|-- B: long (nullable = true)
|-- total: long (nullable = true)
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 10| 10|
| person1| 1| 1|
| person2| 1| 0|
| person3| 5| 1|
+------------+---+---+
Now I want a threshold of value 2 to be applied to the values of columns A and B, such that any value in the column less than the threshold becomes 0 and the values greater than the threshold becomes 1.
The final result should look something like-
+------------+---+---+
|persons_name| A| B|
+------------+---+---+
| person0| 1| 1|
| person1| 0| 0|
| person2| 0| 0|
| person3| 1| 0|
+------------+---+---+
How can I achieve this?
threshold = 2
exp.select(
[(F.col(col) > F.lit(threshold)).cast('int').alias(col) for col in ['A', 'B']]
)

Adding values of two column which datatypes are in string format in pyspark

The log files is in json format,i extracted the data to dataframe of pyspark
There are two column whose values are in int but datatype of column is string.
cola|colb
45|10
10|20
Expected Output
newcol
55
30
but I am getting output like
4510
1020
Code i have used like
df = .select (F.concat("cola","colb") as newcol).show()
kindly help me how can i get correct output.
>>> from pyspark.sql.functions import col
>>> df.show()
+----+----+
|cola|colb|
+----+----+
| 45| 10|
| 10| 20|
+----+----+
>>> df.printSchema()
root
|-- cola: string (nullable = true)
|-- colb: string (nullable = true)
>>> df.withColumn("newcol", col("cola") + col("colb")).show()
+----+----+------+
|cola|colb|newcol|
+----+----+------+
| 45| 10| 55.0|
| 10| 20| 30.0|
+----+----+------+

Using partitions (with partitionBy) when writing a delta lake has no effect

When I initially write a delta lake, using partitions (with partitionBy) or not, does not make any difference.
Using a repartition on the same column before writing, only changes the number of parquet-files.
Making the column to partition explicitly 'not nullable' does not change the effect.
Versions:
Spark 2.4 (actually 2.4.0.0-mapr-620)
Scala 2.11.12
Delta Lake 0.5.0 (io.delta:delta-core_2.11:jar:0.5.0)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val tmp = spark.createDataFrame(
spark.sparkContext.parallelize((1 to 10).map(n => Row(n, n % 3))),
StructType(Seq(StructField("CONTENT", IntegerType), StructField("PARTITION", IntegerType))))
/*
tmp.show
+-------+---------+
|CONTENT|PARTITION|
+-------+---------+
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
| 7| 1|
| 8| 2|
| 9| 0|
| 10| 1|
+-------+---------+
tmp.printSchema
root
|-- CONTENT: integer (nullable = true)
|-- PARTITION: integer (nullable = true)
*/
tmp.write.format("delta").partitionBy("PARTITION").save("PARTITIONED_DELTA_LAKE")
The resulting delta-lake directory is as follows:
ls -1 PARTITIONED_DELTA_LAKE
_delta_log
00000000000000000000.json
part-00000-a3015965-b101-4f63-87de-1d06a7662312-c000.snappy.parquet
part-00007-3155dde1-9f41-49b5-908e-08ce6fc077af-c000.snappy.parquet
part-00014-047f6a28-3001-4686-9742-4e4dbac05c53-c000.snappy.parquet
part-00021-e0d7f861-79e9-41c9-afcd-dbe688720492-c000.snappy.parquet
part-00028-fe3da69d-660a-445b-a99c-0e7ad2f92bf0-c000.snappy.parquet
part-00035-d69cfb9d-d320-4d9f-9b92-5d80c88d1a77-c000.snappy.parquet
part-00043-edd049a2-c952-4f7b-8ca7-8c0319932e2d-c000.snappy.parquet
part-00050-38eb3348-9e0d-49af-9ca8-a323e58b3712-c000.snappy.parquet
part-00057-906312ad-8556-4696-84ba-248b01664688-c000.snappy.parquet
part-00064-31f5d03d-2c63-40e7-8fe5-a8374eff9894-c000.snappy.parquet
part-00071-e1afc2b9-aa5b-4e7c-b94a-0c176523e9f1-c000.snappy.parquet
cat PARTITIONED_DELTA_LAKE/_delta_log/00000000000000000000.json
{"commitInfo":{"timestamp":1579073383370,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"2cdd6fbd-bffa-415e-9c06-94ffc2048cbe","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"CONTENT\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"PARTITION\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1579073381183}}
{"add":{"path":"part-00000-a3015965-b101-4f63-87de-1d06a7662312-c000.snappy.parquet","partitionValues":{},"size":363,"modificationTime":1579073382329,"dataChange":true}}
{"add":{"path":"part-00007-3155dde1-9f41-49b5-908e-08ce6fc077af-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382545,"dataChange":true}}
{"add":{"path":"part-00014-047f6a28-3001-4686-9742-4e4dbac05c53-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382237,"dataChange":true}}
{"add":{"path":"part-00021-e0d7f861-79e9-41c9-afcd-dbe688720492-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382583,"dataChange":true}}
{"add":{"path":"part-00028-fe3da69d-660a-445b-a99c-0e7ad2f92bf0-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382893,"dataChange":true}}
{"add":{"path":"part-00035-d69cfb9d-d320-4d9f-9b92-5d80c88d1a77-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382488,"dataChange":true}}
{"add":{"path":"part-00043-edd049a2-c952-4f7b-8ca7-8c0319932e2d-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073383262,"dataChange":true}}
{"add":{"path":"part-00050-38eb3348-9e0d-49af-9ca8-a323e58b3712-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382683,"dataChange":true}}
{"add":{"path":"part-00057-906312ad-8556-4696-84ba-248b01664688-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382416,"dataChange":true}}
{"add":{"path":"part-00064-31f5d03d-2c63-40e7-8fe5-a8374eff9894-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382549,"dataChange":true}}
{"add":{"path":"part-00071-e1afc2b9-aa5b-4e7c-b94a-0c176523e9f1-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382511,"dataChange":true}}
I would expect something like
ls -1 PARTITIONED_DELTA_LAKE
_delta_log
00000000000000000000.json
PARTITION=0
part-00000-a3015965-b101-4f63-87de-1d06a7662312-c000.snappy.parquet
...
cat PARTITIONED_DELTA_LAKE/_delta_log/00000000000000000000.json
..."partitionBy":"[PARTITION]"...
..."partitionColumns":[PARTITION]...
..."partitionValues":{0}...
As Jacek commented, the used Spark version is too old. I have tried above code for the Spark-versions:
2.4.0
2.4.1
2.4.2
Only with 2.4.2 partitioning works as expected. Within this release this bugfix might be the reason the issue is fixed:
..
Users can specify columns in partitionBy and our internal data sources will use this information. Unfortunately, for external systems, this data is silently dropped with no feedback given to the user
..

spark drop multiple duplicated columns after join

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) { (acc: DataFrame, colName: String) => acc.drop(col(colName)) }
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) { (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")}
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
df.dropDuplicates() works only for rows.
You can df1.drop(df2.column("value"))
You can specify columns you want to select, for example, with df.select(Seq of columns)
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
I wasn't completely satisfied with the answers in this. For the most part, especially #stack0114106 's answers, they hint at the right way and the complexity of doing it in a clean way. But they seem to be incomplete answers. To me a clean automated way of doing this is to use the df.columns functionality to get the columns as list of strings and then use sets to find the common columns to drop or find the unique columns to keep depending on your use case. However, if you use the select you will have to alias the dataframes so it knows which of the non-unique columns to keep. Anyways, using pseudocode because I can't be bothered to write the scala code proper.
common_cols = df_b.columns.toSet().intersection(df_a.columns.toSet())
df_a.join(df_b.drop(*common_cols))
The select version of this looks similar but you have to add in the aliasing.
unique_b_cols = df_b.columns.toSet().difference(df_a.columns.toSet()).toList
a_cols_aliased = df_a.columns.foreach(cols => "a." + cols)
keep_columns = a_cols_aliased.toList + unique_b_cols.toList
df_a.alias("a")
.join(df_b.alias("b"))
.select(*keep_columns)
I prefer the drop way, but having written a bunch of spark code. A select statement can often lead to cleaner code.

Spark fillNa not replacing the null value

I have the following dataset and its contain some null values, need to replace the null value using fillna in spark.
DataFrame:
df = spark.read.format("com.databricks.spark.csv").option("header‌​","true").load("/sam‌​ple.csv")
>>> df.printSchema();
root
|-- Age: string (nullable = true)
|-- Height: string (nullable = true)
|-- Name: string (nullable = true)
>>> df.show()
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null| null|
+---+------+-----+
>>> df.na.fill(10).show()
when i'll give the na values it dosen't changed the same dataframe appeared again.
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10| 80|Alice|
| 5| null| Bob|
| 50| null| Tom|
| 50| null| null|
+---+------+-----+
tried create a new dataframe and store the fill values in dataframe but the result showing like unchanged.
>>> df2 = df.na.fill(10)
how to replace the null values? please give me the possible ways by using fill na.
Thanks in Advance.
It seems that your Height column is not numeric. When you call df.na.fill(10) spark replaces only nulls with column that match type of 10, which are numeric columns.
If Height column need to be string, you can try df.na.fill('10').show(), otherwise casting to IntegerType() is neccessary.
You can also provide a specific default value for each column if you prefer.
df.na.fill({'Height': '10', 'Name': 'Bob'})
To add to #Mariusz answer, here is the exact code to cast and fill NA values:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col
df = df.withColumn("Height", col("Height").cast(IntegerType()))
df2 = df.na.fill(value=10, subset=["Height"])
maybe the simpler solution would've been to give a string value, if you don't care about the column type:
df2 = df.na.fill(value="10", subset=["Height"])

Resources