Spark: read from parquet an int column as long - apache-spark

I have a parquet file that is read by spark as an external table.
One of the columns is defined as int both in the parquet schema and in the spark table.
Recently, I've discovered int is too small for my needs, so I changed the column type to long in new parquet files.
I changed also the type in the spark table to bigint.
However, when I try to read an old parquet file (with int) by spark as external table (with bigint), I get the following error:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
One possible solution is altering the column type in the old parquet to long, which I asked about here: How can I change parquet column type from int to long?, but it is very expensive since I have a lot of data.
Another possible solution is to read each parquet file according to its schema to a different spark table and create a union view of the old and new tables, which is very ugly.
Is there another way to read from parquet an int column as long in spark?

using pyspark couldn't you just do
df = spark.read.parquet('path to parquet files')
the just change the cast the column type in the dataframe
new_df = (df
.withColumn('col_name', col('col_name').cast(LongType()))
)
and then just save the new dataframe to same location with overwrite mode

Related

Writing Hive table from Spark specifying CSV as the format

I'm having an issue writing a Hive table from Spark. The following code works just fine; I can write the table (which defaults to the Parquet format) and read it back in Hive:
df.write.mode('overwrite').saveAsTable("db.table")
hive> describe table;
OK
val string
Time taken: 0.021 seconds, Fetched: 1 row(s)
However, if I specify the format should be csv:
df.write.mode('overwrite').format('csv').saveAsTable("db.table")
then I can save the table, but Hive doesn't recognize the schema:
hive> describe table;
OK
col array<string> from deserializer
Time taken: 0.02 seconds, Fetched: 1 row(s)
It's also worth noting that I can create a Hive table manually and then insertInto it:
spark.sql("create table db.table(val string)")
df.select('val').write.mode("overwrite").insertInto("db.table")
Doing so, Hive seems to recognize the schema. But that's clunky and I can't figure a way to automate the schema string anyway.
That is because Hive SerDe do not support csv by default.
If you insist on using csv format, creating table as below:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
And insert data through df.write.insertInto
For more info:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
You are creating a table with text format and trying to insert CSV data into it, which may run in to problems. So as suggested in the answer by Zhang Tong, create the hive table using hive OpenCSVSerde.
After that, if you are more comfortable with Hive query language than dataframes, you can try this.
df.registerTempTable("temp")
spark.sql("insert overwrite db.table select * from temp")
This happens because HiveSerde is different for csv than what is used by Spark. Hive by default use TEXTFORMAT and the delimiter has to be specified while creating the table.
One Option is to use the insertInto API instead of saveAsTable while writing from spark. While using insertInto, Spark writes the contents of the Dataframe to the specified table. But it requires the schema of the dataframe to be same as the schema of the table. Position of the columns is important here as it ignores the column names.
Seq((5, 6)).toDF("a", "b").write.insertInto("t1")

spark to hive data types

is there a way, how to convert an input string field into orc table with column specified as varchar(xx) in sparksql select query? Or I have to use some workaroud? I'm using Spark 1.6.
I found on Cloudera forum, Spark does not care about length, it saves the value as string with no size limit.
The table is inserted into Hive OK, but I'm little bit worried about data quality.
temp_table = sqlContext.table(ext)
df = temp_table.select(temp_dable.day.cast('string'))
I would like to see something like that :)))
df = temp_table.select(temp_dable.day.cast('varchar(100)'))
Edit:
df.write.partitionBy(part).mode('overwrite').insertInto(int)
Table I'm inserting into is saved as an ORC file (the line above probably should have .format('orc')).
I found here, that If I specify a column as a varchar(xx) type, than the input string will be cutoff to the xx length.
Thx

Enum equivalent in Spark Dataframe/Parquet

I have a table with hundreds of millions of rows, that I want to store in a dataframe in Spark and persist to disk as a parquet file.
The size of my Parquet file(s) is now in excess of 2TB and I want to make sure I have optimized this.
A large proportion of these columns are string values, that can be lengthy, but also often have very few values. For example I have a column with only two distinct values (a 20charcter and a 30 character string) and I have another column with a string that is on average 400characters long but only has about 400 distinct values across all entries.
In a relational database I would usually normalize those values out into a different table with references, or at least define my table with some sort of enum type.
I cannot see anything that matches that pattern in DF or parquet files. Is the columnar storage handling this efficiently? Or should I look into something to optimize this further?
Parquet doesn't have a mechanism for automatically generating enum-like types, but you can use the page dictionary. The page dictionary stores a list of values per parquet page to allow the rows to just reference back to the dictionary instead of rewriting the data. To enable the dictionary for the parquet writer in spark:
spark.conf.set("parquet.dictionary.enabled", "true")
spark.conf.set("parquet.dictionary.page.size", 2 * 1024 * 1024)
Note that you have to write the file with these options enabled, or it won't be used.
To enable filtering for existence using the dictionary, you can enable
spark.conf.set("parquet.filter.dictionary.enabled", "true")
Source: Parquet performance tuning:
The missing guide

Impala table from spark partitioned parquet files

I have generated some partitioned parquet data using Spark, and I'm wondering how to map it to an Impala table... Sadly, I haven't found any solution yet.
The schema of parquet is like :
{ key: long,
value: string,
date: long }
and I partitioned it with key and date, that gives me this kind of directories on my hdfs :
/data/key=1/date=20170101/files.parquet
/data/key=1/date=20170102/files.parquet
/data/key=2/date=20170101/files.parquet
/data/key=2/date=20170102/files.parquet
...
Do you know how I could tell Impala to create a table from this dataset with corresponding partitions (and without having to loop on each partition as I could have read) ? Is it possible ?
Thank you in advance
Assuming by schema of parquet , you meant the schema of the dataset and then using the columns to partition , you will have only the key column in the actual files.parquet files . Now you can proceed as follows
The solution is to use an impala external table .
create external table mytable (key BIGINT) partitioned by (value String ,
date BIGINT) stored as parquet location '....../data/'
Note that in above statement , you have to give path till the data folder
alter table mytable recover partitions'
refresh mytable;
The above 2 commands will automatically detect the partitions based on the schema of the table and get to know about the parquet files present in the sub directories.
Now , you can start querying the data .
Hope it helps

SPARK-HIVE-key differences between Hive and Parquet from the perspective of table schema processing

I am new in spark and hive. I do not understand the statement
"Hive considers all columns nullable, while nullability in Parquet is significant"
If any one explain the statement with example it will better for me. Thank your.
In standard SQL syntax, when you create a table, you can state that a specific column is "nullable" (i.e. may contain a Null value) or not (i.e. trying to insert/update a Null value will throw an error).Nullable is the default.
Parquet schema syntax supports the same concept, although when using AVRO serialization, not-nullable is the default.
Caveat -- when you use Spark to read multiple Parquet files, these files may have different schemas. Imagine that the schema definition has changed over time, and newer files have 2 more Nullable columns at the end. Then you have to request "schema merging" so that Spark reads the schema from all files (not just one at random) to make sure that all these schemas are compatible, then at read-time the "undefined" columns are defaulted to Null for older files.
Hive HQL syntax does not support the standard SQL feature; every column is, and must be, nullable -- simply because Hive does not have total control on its data files!
Imagine a Hive partitioned table with 2 partitions...
one partition uses TextFile format and contains CSV dumps from
different sources, some showing up all expected columns, some missing
the last 2 columns because they use an older definition
the second partition uses Parquet format for history, created by Hive INSERT-SELECT queries, but older
Parquet files are missing the last 2 columns also, because they were created using the older table definition
For the Parquet-based partition, Hive does "schema merging", but instead of merging the file schemas together (like Spark), it merges each file schema with the table schema -- ignoring columns that are not defined in the table, and defaulting to Null all table columns that are not in the file.
Note that for the CSV-based partition, it's much more brutal, because the CSV files don't have a "schema" -- they just have a list of values that are mapped to the table columns, in order. On reaching EOL all missing columns are set to Null; on reaching the value for the last column, any extra value on the line is ignored.

Resources