How to pass multiple column in partitionby method in Spark

How to pass multiple column in partitionby method in Spark - apache-spark

I am a newbie in Spark.I want to write the dataframe data into hive table. Hive table is partitioned on mutliple column. Through, Hivemetastore client I am getting the partition column and passing that as a variable in partitionby clause in write method of dataframe.
var1="country","state" (Getting the partiton column names of hive table)
dataframe1.write.partitionBy(s"$var1").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
When I am executing the above code,it is giving me error partiton "country","state" does not exists.
I think it is taking "country","state" as a string.
Can you please help me out.

The partitionBy function takes a varargs not a list. You can use this as
dataframe1.write.partitionBy("country","state").mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")
Or in scala you can convert a list into a varargs like
val columns = Seq("country","state")
dataframe1.write.partitionBy(columns:_*).mode("overwrite").save(s"$hive_warehouse/$dbname.db/$temp_table/")

Related

Read partioned hive table in pyspark instead of a parquet

I have a partionned parquet. It is partioned by date like:
/server/my_dataset/dt=2021-08-02
/server/my_dataset/dt=2021-08-01
/server/my_dataset/dt=2021-07-31
...
The size is huge, so I do not want to read it at the time and I need only august part, therefore I use:
spark.read.parquet("/server/my_dataset/dt=2021-08*")
It works just fine. However I am forced to move from reading parquet directly to reading from the corresponding hive table. Something like:
spark.read.table("schema.my_dataset")
However I want to keep the same logic of reading only certain partitions of the data. Is there a way to do so?

Try with filter and like operator.
Example:
spark.read.table("schema.my_dataset").filter(col("dt").like("2021-08%"))
UPDATE:
You can get all the august partition values into a variable then use filter query with in statement.
Example:
#get the partition values into variable and filter required
val lst=df.select("dt").distinct.collect().map(x => x(0).toString)
#then use isin function to filter only required partitions
df.filter(col("dt").isin(lst:_*)).show()
For python sample code:
lst=[1,2]
df.filter(col("dt").isin(*lst)).show()

Concatenate a static dataframe with Structured streaming dataframe

How could I union or concatenate a static dataframe with only one row to a stream dataframe with around 500 rows in spark. it is somehow put a stamp or mark (add a row to each table) for each streaming dataframe.
Adding more, my streaming data has no timestamp and I'm wondering if I can use foreach() or foreachBatch() or not?
I really appreciate it if you can help me.

Spark sql limit in IN clause

I have a query in spark-sql with a lot of values in the IN clause:
select * from table where x in (<long list of values>)
When i run this query i get a TransportException from the MetastoreClient in spark.
Column x is the partition column of the table. The hive metastore is on Oracle.
Is there a hard limit on how many values can be in the in clause?
Or can i maybe set the timeout value higher to give the metastore more time to answer.

yes,you can pass upto 1000 values inside IN clause.
However, you can use OR operator inside IN clause and slice the list of values into multiple 1000 windows.

Avoid duplicate partition in datalake

When I write parquet file Im passing one of the column value as partition by but when the dataframe is empty it doesnt create the partition (it is expected) and does nothing. To overcome this if I pass
df.partitionOf("department=One").write(df)
and when the dataframe is NOT empty it creates two level of partition
location/department=One/department=One
Is there any way to skip one if the partition already exists to avoid duplicates?

What is the path you are passing while writing dataframe? I didn't find partitionOf function for spark dataframe.
I think this should work for your case
df.write.mode("append").partitionBy("department").parquet("location/")
If you don't want to append data for the partitions which are already there find the partitons key from existing parquet and drop data with those partition keys and write remaining data in append mode.
scala code:
val dfi=spark.read.parquet(pathPrefix+finalFile).select(col("department"))
val finalDf = df.join(dfi, df.col("department") == dfi.col("department"), "left_outer")
.where(dfi.col("department").isNull())
.select(dfl.columns.map(col):_*)
finalDf.write.mode("append").partitionBy("department").parquet("location/")
You can optimize first step (creating dfi ) by finding partition keys from your Dataframe and keeping only those partition keys for which path exists.

Spark-Hive partitioning

The Hive table was created using 4 partitions.
CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets
The following lines in the spark code insert data into this table
hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")
and in the spark-defaults.conf, the number of parallel processes is 128
spark.default.parallelism=128
The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets.
The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.
Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?

As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the
.bucketBy([colName])
API while using DataFrameWriter API.
More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to pass multiple column in partitionby method in Spark - apache-spark

Related

Read partioned hive table in pyspark instead of a parquet

Concatenate a static dataframe with Structured streaming dataframe

Spark sql limit in IN clause

Avoid duplicate partition in datalake

Spark-Hive partitioning

Categories

Resources