I am using Spark 2.0. Requirement is to create a new table from selecting values into dataframe. While writing out the df as
df.write.saveasTable(hive_table_name, format='parquet',mode='overwrite')
Error is:
client cannot authenticate via:[TOKEN, KERBEROS]
Host Details : local host is: "some-ip" and destination host is:"some-other-ip"
Also if the table does not exist in hive, will spark.write.saveasTable create a new table in hive and auto infer the schema?
I am not sure about the error you are receiving, but yes, Spark will infer the schema automatically if you try to create a table that doesn't exist.
Hope that helps!
Subhash
Ok, some learnings over the past weeks,
saveasTable saves a table to the hdfs file system. Without a schema explicitly created on Hive to consume the parquet file, the schema inference from spark, while creating the dataframe is not used by hive to reflect the existing columns of a table on Hive.
Schema inference is only for JSON, CSV and not for .dat files or compressed textfiles. These files would have to be dealt with delimiters and the dataframe would have to be renamed with first row as column title and then saved to the disk.
Related
I am trying to understand the spark DataFrame API method called saveAsTable.
I have following question
If I simply write a dataframe using saveAsTable API
df7.write.saveAsTable("t1"), (assuming t1 did not exist earlier), will the newly created table be a hive table which can be read outside spark using Hive QL ?
Does spark also create some non-hive table (which are created using saveAsTable API but can not be read outside spark using HiveQL)?
How can check if a table is Hive Table or Non-Hive table ?
(I am new to big data processing, so pardon me if question is not phrased properly)
Yes. Newly created table will be hive table and can be queried from Hive CLI(Only if the DataFrame is created from single input HDFS path i.e. from non-partitioned single input HDFS path).
Below is the documentation comment in DataFrameWriter.scala class. Documentation link
When the DataFrame is created from a non-partitioned
HadoopFsRelation with a single input path, and the data source
provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and
Parquet), the table is persisted in a Hive compatible format, which
means other systems like Hive will be able to read this table.
Otherwise, the table is persisted in a Spark SQL specific format.
Yes, you can do. You table can be partitioned by a column, but can not use bucketing (its a problem between spark and hive).
I am having an issue with pyspark sql module. I created a partitioned table and saved it as parquet file into hive table by running spark job after multiple transformations.
Data load is successful into hive and also able to query the data. But when I try to query the same data from spark it says file path doesn't exist.
java.io.FileNotFoundException: File hdfs://localhost:8020/data/path/of/partition partition=15f244ee8f48a2f98539d9d319d49d9c does not exist
The partition which is mentioned in above error was the old partitioned column data which doesn't even exist now.
I have run the spark job which populates a new partition value.
I searched for solutions but all I can see is people say there was no issue in spark version 1.4 and there is an issue in 1.6
Can someone please suggest me the solution for this problem.
I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.
I am using spark streaming to write the aggregated output as parquet files to the hdfs using SaveMode.Append. I have an external table created like :
CREATE TABLE if not exists rolluptable
USING org.apache.spark.sql.parquet
OPTIONS (
path "hdfs:////"
);
I had an impression that in case of external table the queries should fetch the data from newly parquet added files also. But, seems like the newly written files are not being picked up.
Dropping and recreating the table every time works fine but not a solution.
Please suggest how can my table have the data from newer files also.
Are you reading those tables with spark?
if so, spark caches parquet tables metadata (since schema discovery can be expensive)
To overcome this, you have 2 options:
Set the config spark.sql.parquet.cacheMetadata to false
refresh the table before the query: sqlContext.refreshTable("my_table")
See here for more details: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-metastore-parquet-table-conversion
I am using SparkSQL in python. I have created a partitioned table (~few hundreds of partitions) stored it into Hive Internal Table using the hiveContext. The hive warehouse is located in S3.
When I simply do "df = hiveContext.table("mytable"). It would take over a minute to going through all the partitions the first time. I thought the metastore stored all the metadata. Why would spark still need to going through each partition? Is it possible to avoid this step so my startup can be faster?
The key here is that it takes this long to load the file metadata only on the first query. The reason is that SparkSQL doesn't store the partition metadata in the Hive metastore. For Hive partitioned tables, the partition information needs to be stored in the metastore. Depending on how the table is created will dictate how this behaves. From the information provided, it sounds like you created a SparkSQL table.
SparkSQL stores the table schema (which includes partition information) and the root directory of your table, but still discovers each partition directory on S3 dynamically when the query is run. My understanding is that this is a tradeoff so you don't need to manually add new partitions whenever the table is updated.