How to save Spark Dataframe to Hana Vora table? - apache-spark

We have a file that we want split into 3 and that we need to perform some data cleanup on before it can be imported into Hana Vora - otherwise everything has to be typed as String, which is not ideal.
We can import and prepare the DataFrames in spark just fine, but then when i try to write to either the HDFS filesystem or, better, to save as a Table in the "com.sap.spark.vora" datasource, i get errors.
Can any one advise on a reliable way to import the spark-prepared datasets into Hana Vora? Thanks!

Vora currently only officially supports appending data to an existing table (using the APPEND statement). For details see SAP HANA Vora Developer Guide -> Chapter "3.5 Appending Data to Existing Tables"
This means you would have to create an intermediate file. Vora supports reading from CSV, ORC, Parquet files. A dataframe can be saved in an ORC and Parquet files directly from Spark (see https://spark.apache.org/docs/1.6.1/sql-programming-guide.htm). To write to CSV files from Spark see https://github.com/databricks/spark-csv

Related

Table created with "stored as Parquet" option using PySpark SQL or Hive does not actually store data files in Parquet format

I create table on Hadoop cluster using PySpark SQL:spark.sql("CREATE TABLE my_table (...) PARTITIONED BY (...) STORED AS Parquet") and load some data with: spark.sql("INSERT INTO my_table SELECT * FROM my_other_table"), however the resulting files do not seem to be Parquet files, they're missing ".snappy.parquet" extension.
The same problem occurs when repeating those steps in Hive.
But surprisingly when I create table using PySpark DataFrame: df.write.partitionBy("my_column").saveAsTable(name="my_table", format="Parquet")
everything works just fine.
So, my question is: what's wrong with the SQL way of creating and populating Parquet table?
Spark version 2.4.5, Hive version 3.1.2.
Update (27 Dec 2022 after #mazaneicha answer)
Unfortunately, there is no parquet-tools on the cluster I'm working with, so the best I could do is to check the content of the files with hdfs dfs -tail (and -head). And in all cases there is "PAR1" both at the beginning and at the end of the file. And even more - the meta-data of parquet version (implementation):
Method # of files Total size Parquet version File name
Hive Insert 8 34.7 G Jparquet-mr version 1.10.0 xxxxxx_x
PySpark SQL Insert 8 10.4 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF insertInto 8 10.9 G Iparquet-mr version 1.6.0 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.c000
PySpark DF saveAsTable 8 11.5 G Jparquet-mr version 1.10.1 part-xxxxx-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-c000.snappy.parquet
(To create the same number of files I used "repartition" with df, and "distribute by" with SQL).
So, considering the above mentioned, it's still not clear:
Why there is no file extension in 3 out of 4 cases?
Why files created with Hive are so big? (no compression, I suppose).
Why PySpark SQL and PySpark Dataframe versions/implementations of parquet differ and how set them explicitly?
File format is not defined by the extension, but rather by the contents. You can quickly check if format is parquet by looking for magic bytes PAR1 at the very beginning and the very end of a file.
For in-depth format, metadata and consistency checking, try opening a file with parquet-tools.
Update:
As mentioned in online docs, parquet is supported by Spark as one of the many data sources via its common DataSource framework, so that it doesn't have to rely on Hive:
"When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance..."
You can find and review this implementation in Spark git repo (its open-source! :))

Does presto require a hive metastore to read parquet files from S3?

I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Using spark dataFrame to load data from HDFS

Can we use DataFrame while reading data from HDFS.
I have a tab separated data in HDFS.
I googled, but saw it can be used with NoSQL data
DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.
If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame
sqlContext.read.
format("com.databricks.spark.csv").
option("delimiter","\t").
option("header","true").
load("hdfs:///demo/data/tsvtest.tsv").show
To run the code from spark-shell use the following:
--packages com.databricks:spark-csv_2.10:1.4.0
In Spark 2.0 csv is natively supported so you should be able to do something like this:
spark.read.
option("delimiter","\t").
option("header","true").
csv("hdfs:///demo/data/tsvtest.tsv").show
If I am understanding correctly, you essentially want to read data from the HDFS and you want this data to be automatically converted to a DataFrame.
If that is the case, I would recommend you this spark csv library. Check this out, it has a very good documentation.

External Table not getting updated from parquet files written by spark streaming

I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :
CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
path "hdfs:////"
);
I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.
Dropping and recreating the table every time works fine but not a solution.
Please suggest how can my table have the data from newer files also.
Are you reading those tables with spark?
if so, spark caches parquet tables metadata (since schema discovery can be expensive)
To overcome this, you have 2 options:
Set the config spark.sql.parquet.cacheMetadata to false
refresh the table before the query: sqlContext.refreshTable("my_table")
See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion

Resources