spark to hive data types - apache-spark

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Related

Read partioned hive table in pyspark instead of a parquet

I have a partionned parquet. It is partioned by date like:
/server/my_dataset/dt=2021-08-02
/server/my_dataset/dt=2021-08-01
/server/my_dataset/dt=2021-07-31
...
The size is huge, so I do not want to read it at the time and I need only august part, therefore I use:
spark.read.parquet("/server/my_dataset/dt=2021-08*")
It works just fine. However I am forced to move from reading parquet directly to reading from the corresponding hive table. Something like:
spark.read.table("schema.my_dataset")
However I want to keep the same logic of reading only certain partitions of the data. Is there a way to do so?
Try with filter and like operator.
Example:
spark.read.table("schema.my_dataset").filter(col("dt").like("2021-08%"))
UPDATE:
You can get all the august partition values into a variable then use filter query with in statement.
Example:
#get the partition values into variable and filter required
val lst=df.select("dt").distinct.collect().map(x => x(0).toString)
#then use isin function to filter only required partitions
df.filter(col("dt").isin(lst:_*)).show()
For python sample code:
lst=[1,2]
df.filter(col("dt").isin(*lst)).show()

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Spark SQL ignoring dynamic partition filter value

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.
I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.
In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show
However, in this instance, If I hardcode the value, it only reads the proper directory.
spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show
Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).
I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.
Workaround by changing configuration
sql("set spark.sql.hive.convertMetastoreParquet = false")
Spark docs
"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Spark: read from parquet an int column as long

I have a parquet file that is read by spark as an external table.
One of the columns is defined as int both in the parquet schema and in the spark table.
Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files.
I changed also the type in the spark table to bigint.
However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.
Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.
Is there another way to read from parquet an int column as long in spark?
using pyspark couldn't you just do
df = spark.read.parquet('path to parquet files')
the just change the cast the column type in the dataframe
new_df = (df
.withColumn('col_name', col('col_name').cast(LongType()))
)
and then just save the new dataframe to same location with overwrite mode

Writing Hive table from Spark specifying CSV as the format

I'm having an issue writing a Hive table from Spark. The following code works just fine; I can write the table (which defaults to the Parquet format) and read it back in Hive:
df.write.mode('overwrite').saveAsTable("db.table")
hive> describe table;
OK
val string
Time taken: 0.021 seconds, Fetched: 1 row(s)
However, if I specify the format should be csv:
df.write.mode('overwrite').format('csv').saveAsTable("db.table")
then I can save the table, but Hive doesn't recognize the schema:
hive> describe table;
OK
col array<string> from deserializer
Time taken: 0.02 seconds, Fetched: 1 row(s)
It's also worth noting that I can create a Hive table manually and then insertInto it:
spark.sql("create table db.table(val string)")
df.select('val').write.mode("overwrite").insertInto("db.table")
Doing so, Hive seems to recognize the schema. But that's clunky and I can't figure a way to automate the schema string anyway.
That is because Hive SerDe do not support csv by default.
If you insist on using csv format, creating table as below:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
And insert data through df.write.insertInto
For more info:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
You are creating a table with text format and trying to insert CSV data into it, which may run in to problems. So as suggested in the answer by Zhang Tong, create the hive table using hive OpenCSVSerde.
After that, if you are more comfortable with Hive query language than dataframes, you can try this.
df.registerTempTable("temp")
spark.sql("insert overwrite db.table select * from temp")
This happens because HiveSerde is different for csv than what is used by Spark. Hive by default use TEXTFORMAT and the delimiter has to be specified while creating the table.
One Option is to use the insertInto API instead of saveAsTable while writing from spark. While using insertInto, Spark writes the contents of the Dataframe to the specified table. But it requires the schema of the dataframe to be same as the schema of the table. Position of the columns is important here as it ignores the column names.
Seq((5, 6)).toDF("a", "b").write.insertInto("t1")

Resources