Using partitions (with partitionBy) when writing a delta lake has no effect - apache-spark

When I initially write a delta lake, using partitions (with partitionBy) or not, does not make any difference.
Using a repartition on the same column before writing, only changes the number of parquet-files.
Making the column to partition explicitly 'not nullable' does not change the effect.
Versions:
Spark 2.4 (actually 2.4.0.0-mapr-620)
Scala 2.11.12
Delta Lake 0.5.0 (io.delta:delta-core_2.11:jar:0.5.0)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val tmp = spark.createDataFrame(
spark.sparkContext.parallelize((1 to 10).map(n => Row(n, n % 3))),
StructType(Seq(StructField("CONTENT", IntegerType), StructField("PARTITION", IntegerType))))
/*
tmp.show
+-------+---------+
|CONTENT|PARTITION|
+-------+---------+
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
| 7| 1|
| 8| 2|
| 9| 0|
| 10| 1|
+-------+---------+
tmp.printSchema
root
|-- CONTENT: integer (nullable = true)
|-- PARTITION: integer (nullable = true)
*/
tmp.write.format("delta").partitionBy("PARTITION").save("PARTITIONED_DELTA_LAKE")
The resulting delta-lake directory is as follows:
ls -1 PARTITIONED_DELTA_LAKE
_delta_log
00000000000000000000.json
part-00000-a3015965-b101-4f63-87de-1d06a7662312-c000.snappy.parquet
part-00007-3155dde1-9f41-49b5-908e-08ce6fc077af-c000.snappy.parquet
part-00014-047f6a28-3001-4686-9742-4e4dbac05c53-c000.snappy.parquet
part-00021-e0d7f861-79e9-41c9-afcd-dbe688720492-c000.snappy.parquet
part-00028-fe3da69d-660a-445b-a99c-0e7ad2f92bf0-c000.snappy.parquet
part-00035-d69cfb9d-d320-4d9f-9b92-5d80c88d1a77-c000.snappy.parquet
part-00043-edd049a2-c952-4f7b-8ca7-8c0319932e2d-c000.snappy.parquet
part-00050-38eb3348-9e0d-49af-9ca8-a323e58b3712-c000.snappy.parquet
part-00057-906312ad-8556-4696-84ba-248b01664688-c000.snappy.parquet
part-00064-31f5d03d-2c63-40e7-8fe5-a8374eff9894-c000.snappy.parquet
part-00071-e1afc2b9-aa5b-4e7c-b94a-0c176523e9f1-c000.snappy.parquet
cat PARTITIONED_DELTA_LAKE/_delta_log/00000000000000000000.json
{"commitInfo":{"timestamp":1579073383370,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isBlindAppend":true}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"2cdd6fbd-bffa-415e-9c06-94ffc2048cbe","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"CONTENT\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"PARTITION\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1579073381183}}
{"add":{"path":"part-00000-a3015965-b101-4f63-87de-1d06a7662312-c000.snappy.parquet","partitionValues":{},"size":363,"modificationTime":1579073382329,"dataChange":true}}
{"add":{"path":"part-00007-3155dde1-9f41-49b5-908e-08ce6fc077af-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382545,"dataChange":true}}
{"add":{"path":"part-00014-047f6a28-3001-4686-9742-4e4dbac05c53-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382237,"dataChange":true}}
{"add":{"path":"part-00021-e0d7f861-79e9-41c9-afcd-dbe688720492-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382583,"dataChange":true}}
{"add":{"path":"part-00028-fe3da69d-660a-445b-a99c-0e7ad2f92bf0-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382893,"dataChange":true}}
{"add":{"path":"part-00035-d69cfb9d-d320-4d9f-9b92-5d80c88d1a77-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382488,"dataChange":true}}
{"add":{"path":"part-00043-edd049a2-c952-4f7b-8ca7-8c0319932e2d-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073383262,"dataChange":true}}
{"add":{"path":"part-00050-38eb3348-9e0d-49af-9ca8-a323e58b3712-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382683,"dataChange":true}}
{"add":{"path":"part-00057-906312ad-8556-4696-84ba-248b01664688-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382416,"dataChange":true}}
{"add":{"path":"part-00064-31f5d03d-2c63-40e7-8fe5-a8374eff9894-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382549,"dataChange":true}}
{"add":{"path":"part-00071-e1afc2b9-aa5b-4e7c-b94a-0c176523e9f1-c000.snappy.parquet","partitionValues":{},"size":625,"modificationTime":1579073382511,"dataChange":true}}
I would expect something like
ls -1 PARTITIONED_DELTA_LAKE
_delta_log
00000000000000000000.json
PARTITION=0
part-00000-a3015965-b101-4f63-87de-1d06a7662312-c000.snappy.parquet
...
cat PARTITIONED_DELTA_LAKE/_delta_log/00000000000000000000.json
..."partitionBy":"[PARTITION]"...
..."partitionColumns":[PARTITION]...
..."partitionValues":{0}...

As Jacek commented, the used Spark version is too old. I have tried above code for the Spark-versions:
2.4.0
2.4.1
2.4.2
Only with 2.4.2 partitioning works as expected. Within this release this bugfix might be the reason the issue is fixed:
..
Users can specify columns in partitionBy and our internal data sources will use this information. Unfortunately, for external systems, this data is silently dropped with no feedback given to the user
..

Related

Parquet bytes dataframe to UTF-8 in Spark

I am trying to read a dataframe from a parquet file with Spark in python but my dataframe is byte encoded so when I use spark.read.parquet and then df.show() it looks like the following:
+---+----------+----+
| C1| C2| C3|
+---+----------+----+
| 1|[20 2D 2D]| 0|
| 2|[32 30 31]| 0|
| 3|[43 6F 6D]| 0|
+---+----------+----+
As you can see it the values are converted to hexadecimal values... I've read the entire documentation of spark dataframes but I did not found anything. Is it possible to convert to UTF-8?
The df.printSchema() output:
|-- C1: long (nullable = true)
|-- C2: binary (nullable = true)
|-- C3: long (nullable = true)
The Spark version is 2.4.4
Thank you!
You have a binary type column, which is like a bytearray in python. You just need to cast to string:
df = df.withColumn("C2", df["C2"].cast("string"))
df.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#| 1| --| 0|
#| 2|201| 0|
#| 3|Com| 0|
#+---+---+---+
Likewise in python:
str(bytearray([0x20, 0x2D, 0x2D]))
#' --'

How to get a string representation of DataFrame (as does Dataset.show)?

I need a useful string representation of a Spark dataframe. The one I get with df.show is great -- but I can't get that output as a string because the internal showString method called by show is private. Is there some way I can get a similar output without writing a method to duplicate this same functionality?
showString is simply private[sql] that means that the code to access it has to be in the same package, i.e. org.apache.spark.sql.
The trick is to create a helper object that does belong to the org.apache.spark.sql package, but the single method we're about to create is not private (at any level).
I usually mimic what an instance method does with the very first input parameter as the target and the input parameters to match the target method.
package org.apache.spark.sql
object AccessShowString {
def showString[T](df: Dataset[T],
_numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = {
df.showString(_numRows, truncate, vertical)
}
}
TIP Use paste -raw to copy and paste the code in spark-shell.
Let's use showString then.
import org.apache.spark.sql.AccessShowString.showString
val df = spark.range(10)
scala> println(showString(df, 10))
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
If you really set on reusing existing code, you can access showString by reflection
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val showString = classOf[org.apache.spark.sql.DataFrame].getDeclaredMethod("showString", classOf[Int], classOf[Int], classOf[Boolean])
showString: java.lang.reflect.Method = public java.lang.String org.apache.spark.sql.Dataset.showString(int,int,boolean)
scala> showString.setAccessible(true)
scala> showString.invoke(df, 10.asInstanceOf[Object], 20.asInstanceOf[Object], false.asInstanceOf[Object]).asInstanceOf[String]
res2: String =
"+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
"

How to rename duplicated columns after join? [duplicate]

This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status column is in two dataframes, i.e. aa_df and bb_df.
The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.
If you are trying to rename the status column of bb_df dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias or as operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol-based variants.)
NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.
So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.
There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select (over drop), I'd do the following to have a single id column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).
Let's assume you ended up with the following query and so you've got two id columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Please see the docs : withColumnRenamed()
You need to pass the name of the existing column and the new name to the function. Both of these should be strings.
result_df = aa_df.join(bb_df,'id', 'left').join(cc_df, 'id', 'left').withColumnRenamed('status', 'user_status')
If you have 'status' columns in 2 dataframes, you can use them in the join as aa_df.join(bb_df, ['id','status'], 'left') assuming aa_df and bb_df have the common column. This way you will not end up having 2 'status' columns.

DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster.
case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)
val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS
dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs
+-----+---------+-----+
| user|dimension|score|
+-----+---------+-----+
|12345| 0| 1.0|
+-----+---------+-----+
+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
| 0| 1| 1.0|
| 1| 0| 1.0|
| 2| 2| 1.0|
+---------+-------+-----+
+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345| 0| 1.0| 0| 1| 1.0|
+-----+---------+-----+---------+-------+-----+
which is correct. However after writing and reading the data, we see this
dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")
val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]
dims2.show
cent2.show
dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs
+-----+---------+-----+
| user|dimension|score|
+-----+---------+-----+
|12345| 0| 1.0|
+-----+---------+-----+
+---------+-------+-----+
|dimension|cluster|score|
+---------+-------+-----+
| 0| 1| 1.0|
| 1| 0| 1.0|
| 2| 2| 1.0|
+---------+-------+-----+
+-----+---------+-----+---------+-------+-----+
| user|dimension|score|dimension|cluster|score|
+-----+---------+-----+---------+-------+-----+
|12345| 0| 1.0| null| null| null|
+-----+---------+-----+---------+-------+-----+
However, using the RDD API produces the correct result
dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5)
res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0))))
We've tried changing the output format to ORC instead of parquet, but we see the same results. Running Spark 2.0 locally, not on a cluster, does not have this issue. Also running spark in local mode on the master node of the Hadoop cluster also works. Only when running on top of YARN do we see this issue.
This also seems very similar to this issue: https://issues.apache.org/jira/browse/SPARK-10896
This issue was fixed by the pull request submitted in https://issues.apache.org/jira/browse/SPARK-17806

PySpark: Randomize rows in dataframe

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).
It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.
You can order DataFrame by a column of random numbers:
from pyspark.sql.functions import rand
df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)
## +---+
## | x|
## +---+
## | 2|
## | 7|
## | 14|
## +---+
## only showing top 3 rows
but it is:
expensive - because it requires full shuffle and it something you typically want to avoid.
suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.
This code works for me without any RDD operations:
import pyspark.sql.functions as F
df = df.select("*").orderBy(F.rand())
Here is a more elaborated example:
import pyspark.sql.functions as F
# Example: create a Dataframe for the example
pandas_df = pd.DataFrame(([1,2],[3,1],[4,2],[7,2],[32,7],[123,3]),columns=["id","col1"])
df = sqlContext.createDataFrame(pandas_df)
df = df.select("*").orderBy(F.rand())
df.show()
+---+----+
| id|col1|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 7| 2|
| 32| 7|
|123| 3|
+---+----+
df.select("*").orderBy(F.rand()).show()
+---+----+
| id|col1|
+---+----+
| 7| 2|
|123| 3|
| 3| 1|
| 4| 2|
| 32| 7|
| 1| 2|
+---+----+

Resources