This may be a dumb question since lack of some fundamental knowledge of spark, I try this:
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").enableHiveSupport().getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("foo");
This creates table under 'default' database in Hive, and of course, I can fetch data from the table anytime I want.
I update above code to get rid of "enableHiveSupport",
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("bar");
The code runs fine, without any error, but when I try "select * from bar", spark says,
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'bar' not found in database 'default';
So I have 2 questions here,
1) Is it possible to create a 'raw' spark table, not hive table? I know Hive mantains the metadata in database like mysql, does spark also have similar mechanism?
2) In the 2nd code snippet, what does spark actually create when calling saveAsTable?
Many thanks.
Check answers below:
If you want to create raw table only in spark createOrReplaceTempView could help you. For second part, check next answer.
By default, if you call saveAsTable on your dataframe, it will persistent tables into Hive metastore if you use enableHiveSupport. And if we don't enableHiveSupport, tables will be managed by Spark and data will be under spark-warehouse location. You will loose these tables after restart spark session.
Related
I am trying to run a spark job written in Java, on the Spark cluster to load records as dataframe into a Hive Table i created.
df.write().mode("overwrite").insertInto(dbname.tablename);
Although the table and database exists in Hive, it throws below error:
org.apache.spark.sql.AnalysisException: Table or view not found: dbname.tablename, the database dbname doesn't exist.;
I also tried reading from an existing hive table different than the above table thinking there might be an issue while my table creation.
I also checked if my user has permission to the hdfs folder where the hive is storing the data.
It all looks fine, not sure what could be the issue.
Please suggest.
Thanks
I think it is searching for that table in spark instead of hive.
I have a table which has some missing partions. When I call it on hive it works fine
SELECT *
FROM my_table
but when call it from pyspark (v. 2.3.0) it fails with message Input path does not exist: hdfs://path/to/partition. The spark code I am running is just naive:
spark = ( SparkSession
.builder
.appName("prueba1")
.master("yarn")
.config("spark.sql.hive.verifyPartitionPath", "false")
.enableHiveSupport()
.getOrCreate())
spark.table('some_schema.my_table').show(10)
the config("spark.sql.hive.verifyPartitionPath", "false") has been proposed is
this question but seems to not work fine for me
Is there any way I can configure SparkSession so I can get rid of these. I am afraid that in the future more partitions will miss, so a hardcode solution is not possible
This error occurs when partitioned data dropped from HDFS i.e not using Hive commands to drop the partition.
If the data dropped from HDFS directly Hive doesn't know about the dropped the partition, when we query hive table it still looks for the directory and the directory doesn't exists in HDFS it results file not found exception.
To fix this issue we need to drop the partition associated with the directory in Hive table also by using
alter table <db_name>.<table_name> drop partition(<partition_col_name>=<partition_value>);
Then hive drops the partition from the metadata this is the only way to drop the metadata from the hive table if we dropped the partition directory from HDFS.
msck repair table doesn't drop the partitions instead only adds the new partitions if the new partition got added into HDFS.
The correct way to avoid these kind of issues in future drop the partitions by using Hive drop partition commands.
Does the other way around, .config("spark.sql.hive.verifyPartitionPath", "true") work for you? I have just managed to load data using spark-sql with this setting, while one of the partition paths from Hive was empty, and partition still existed in Hive metastore. Though there are caveats - it seems it takes significantly more time to load data compared to when this setting it set to false.
I have a spark structured streaming application (listening to kafka) that is also reading from a persistent table in s3 I am trying to have each microbatch check for updates to the table. I have tried
var myTable = spark.table("myTable!")
and
spark.sql("select * from parquet.`s3n://myFolder/`")
Both do not work in a streaming context. The issue is that the parquet file is changing at each update, and spark doesn't run any of the normal commands to refresh such as:
spark.catalog.refreshTable("myTable!")
spark.sqlContext.clearCache()
I have also tried:
spark.sqlContext.setConf("spark.sql.parquet.cacheMetadata","false")
spark.conf.set("spark.sql.parquet.cacheMetadata",false)
to no relief. There has to be a way to do this. Would it be smarter to use a jdbc connection to a Database instead?
Assuming I'm reading you right I believe the issue is that because DataFrame's are immutable, you cannot see the changes to your parquet table unless you restart the streaming query and create a new DataFrame. This question has come up on the Spark Mailing List before. The definitive answer appears to be that the only way to capture these updates is to restart the streaming query. If your application cannot tolerate 10 second hiccups you might want to check out this blog post which summarizes the above conversation and discusses how SnappyData enables mutations on Spark DataFrames.
Disclaimer: I work for SnappyData
This will accomplish what I'm looking for.
val df1Schema = spark.read.option("header", "true").csv("test1.csv").schema
val df1 = spark.readStream.schema(df1Schema).option("header", "true").csv("/1")
df1.writeStream.format("memory").outputMode("append").queryName("df1").start()
var df1 = sql("select * from df1")
The downside is that its appending. getting around one issue is to remove duplicates based on ID and with the newest date.
val dfOrder = df1.orderBy(col("id"), col("updateTableTimestamp").desc)
val dfMax = dfOrder.groupBy(col("id")).agg(first("name").as("name"),first("updateTableTimestamp").as("updateTableTimestamp"))
I am using Spark 2.1.0 and using Java SparkSession to run my SparkSQL.
I am trying to save a Dataset<Row> named 'ds' to be saved into a Hive table named as schema_name.tbl_name using overwrite mode.
But when I am running the below statement
ds.write().mode(SaveMode.Overwrite)
.option("header","true")
.option("truncate", "true")
.saveAsTable(ConfigurationUtils.getProperty(ConfigurationUtils.HIVE_TABLE_NAME));
the table is getting dropped after the first run.
When I am rerunning it, the table is getting created with the data loaded.
Even using truncate option didn't resolve my issue. Does saveAsTable consider truncating the data instead of dropping/creating the table? If so, what is the correct way to do it in Java ?
This is the reference to Apache JIRA for my question. Seems it is unresolved till now.
https://issues.apache.org/jira/browse/SPARK-21036
I can't seem to find much documentation on it but when I pull data from Hive in Spark SQL how is it retrieving the schema, is it automatically looking in the Hive Metastore? Also is it Hive telling spark to look at the file location to pull the data into a DataFrame? And how does it handle a view or can it not handle a view yet?
Yes, it looks up hive metastore.
Spark delegates hive queries to hive. It captures output and turn it to a dataframe of rows.
From docs:
When working with Hive one must construct a HiveContext, which
inherits from SQLContext, and adds support for finding tables in the
MetaStore and writing queries using HiveQL