I'm using spark 2.4.4 with AWS glue catalog.
In my spark job, I need to create a database in glue if it doesn't exist. I'm using the following statement in spark sql to do so.
spark.sql("CREATE DATABASE IF NOT EXISTS %s".format(hiveDatabase));
It works as expected in spark-shell, a database gets create in Glue.
But when I run the same piece of code using spark-submit, then the database is not created. Is there a commit/flush that I need to do when using spark-submit?
EDIT
I'm getting different results for show databases in spark-shell and spark-submit:
+---------------------+
|databaseName |
+---------------------+
|all |
|default |
|hive-db |
|navi-database-account|
|navi-par |
|testdb |
+---------------------+
+------------+
|databaseName|
+------------+
|default |
+------------+
Looks like spark-submit is creating the DB somewhere, but not in glue.
Needed to add following config:
("spark.sql.catalogImplementation", "hive")
Related
I have read other question and I am confused about the option. I want to read a Athena view in EMR spark and from searching on google/stackoverflow, I realized that these view are somehow stored in S3, so I first tried to find the external location of the view through
Describe mydb.Myview
It provides schema but doesnt provide the external location. From which I assumed that I cannot read it as Dataframe from S3
What i have considered so far for reading athena view in Spark
I have considered following options
Make a new table out of this athena VIEW using WITH statment with external format as PARQUET
CREATE TABLE Temporary_tbl_from_view WITH ( format = 'PARQUET', external_location = 's3://my-bucket/views_to_parquet/', ) AS ( SELECT * FROM "mydb"."myview"; );
Another option is based on this answer,which suggests
When you start an EMR cluster (v5.8.0 and later) you can instruct it
to connect to your Glue Data Catalog. This is a checkbox in the
'create cluster' dialog. When you check this option your Spark
SqlContext will connect to the Glue Data Catalog, and you'll be able
to see the tables in Athena.
but I am not sure how can I query this view (not table) in pyspark if athena table/views are available through Glue catalogue in spark context, will the simple statement like this work?
sqlContext.sql("SELECT * from mydbmyview")
Question, What is the more effecient way to read this view in spark, does recreating a table using WITH statement (external location) means that I am storing this thing in Glue catalog or S3 twice? If yes, How can I read it directly through S3 or glue catalog?
Just to share the solution I followed with others, I created my cluster with the following option enabled
Use AWS Glue Data Catalog for table metadata
Afterwards, I saw the database name from AWS GLUE and Was able to see the desired view in tablename as below
spark.sql("use my_db_name")
spark.sql("show tables").show(truncate=False)
+------------+---------------------------+-----------+
|database |tableName |isTemporary|
+------------+---------------------------+-----------+
| my_db_name|tabel1 |false |
| my_db_name|desired_table |false |
| my_db_name|tabel3 |false |
+------------+---------------------------+-----------+
I have a simple Spark(2.3.0) Standalone cluster with 1 master and 2 workers (node-1 and node-2). I saved my dataframe as a persistent table into Hive metastore using the saveAsTable command with pyspark on node-1:
>>> df.write.saveAsTable("test")
It works fine. I can restart pyspark on that node (node-1) and can see that the table is still there:
>>> spark.sql('show tables').show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| test| false|
+--------+---------+-----------+
But when I go to the other node (node-2), I get the following:
>>> spark.sql('show tables').show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
Can anyone help me on how we can use the table in node-2?
This is because metastore data is save locally in an Apache Derby database. I solve this by using mysql instead of derby.
I'm am having issues with the schema for Hive tables being out of sync between Spark and Hive on a Mapr cluster with Spark 2.1.0 and Hive 2.1.1.
I need to try to resolve this problem specifically for managed tables, but the issue can be reproduced with unmanaged/external tables.
Overview of Steps
Use saveAsTable to save a dataframe to a given table.
Use mode("overwrite").parquet("path/to/table") to overwrite the data for the previously saved table. I am actually modifying the data through a process external to Spark and Hive, but this reproduces the same issue.
Use spark.catalog.refreshTable(...) to refresh metadata
Query the table with spark.table(...).show(). Any columns that were the same between the original dataframe and the overwriting one will show the new data correctly, but any columns that were only in the new table will not be displayed.
Example
db_name = "test_39d3ec9"
table_name = "overwrite_existing"
table_location = "<spark.sql.warehouse.dir>/{}.db/{}".format(db_name, table_name)
qualified_table = "{}.{}".format(db_name, table_name)
spark.sql("CREATE DATABASE IF NOT EXISTS {}".format(db_name))
Save as a managed table
existing_df = spark.createDataFrame([(1, 2)])
existing_df.write.mode("overwrite").saveAsTable(table_name)
Note that saving as an unmanaged table with the following will produce the same issue:
existing_df.write.mode("overwrite") \
.option("path", table_location) \
.saveAsTable(qualified_table)
View the contents of the table
spark.table(table_name).show()
+---+---+
| _1| _2|
+---+---+
| 1| 2|
+---+---+
Overwrite the parquet files directly
new_df = spark.createDataFrame([(3, 4, 5, 6)], ["_4", "_3", "_2", "_1"])
new_df.write.mode("overwrite").parquet(table_location)
View the contents with the parquet reader, the contents show correctly
spark.read.parquet(table_location).show()
+---+---+---+---+
| _4| _3| _2| _1|
+---+---+---+---+
| 3| 4| 5| 6|
+---+---+---+---+
Refresh spark's metadata for the table and read in again as a table. The data will be updated for the columns that were the same, but the additional columns do not display.
spark.catalog.refreshTable(qualified_table)
spark.table(qualified_table).show()
+---+---+
| _1| _2|
+---+---+
| 6| 5|
+---+---+
I have also tried updating the schema in hive before calling spark.catalog.refreshTable with the below command in the hive shell:
ALTER TABLE test_39d3ec9.overwrite_existing REPLACE COLUMNS (`_1` bigint, `_2` bigint, `_3` bigint, `_4` bigint);
After running the ALTER command I then run describe and it shows correctly in hive
DESCRIBE test_39d3ec9.overwrite_existing
OK
_1 bigint
_2 bigint
_3 bigint
_4 bigint
Before running the alter command it only shows the original columns as expected
DESCRIBE test_39d3ec9.overwrite_existing
OK
_1 bigint
_2 bigint
I then ran spark.catalog.refreshTable but it didn't effect spark's view of the data.
Additional Notes
From the spark side, I did most of my testing with PySpark, but also tested in a spark-shell (scala) and a sparksql shell. While in the spark shell I also tried using a HiveContext but didn't work.
import org.apache.spark.sql.hive.HiveContext
import spark.sqlContext.implicits._
val hiveObj = new HiveContext(sc)
hiveObj.refreshTable("test_39d3ec9.overwrite_existing")
After performing the ALTER command in the hive shell, I verified in Hue that the schema also changed there.
I also tried running the ALTER command with spark.sql("ALTER ...") but the version of Spark we are on (2.1.0) does not allow it, and looks like it won't be available until Spark 2.2.0 based on this issue: https://issues.apache.org/jira/browse/SPARK-19261
I have also read through the spark docs again, specifically this section: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#hive-metastore-parquet-table-conversion
Based on those docs, spark.catalog.refreshTable should work. The configuration for spark.sql.hive.convertMetastoreParquet is typically false, but I switched it to true for testing and it didn't seem to effect anything.
Any help would be appreciated, thank you!
I faced a similar issue while using spark 2.2.0 in CDH 5.11.x package.
After spark.write.mode("overwrite").saveAsTable() when I issue spark.read.table().show no data will be displayed.
On checking I found it was a known issue with CDH spark 2.2.0 version. Workaround for that was to run the below command after the saveAsTable command was executed.
spark.sql("ALTER TABLE qualified_table set SERDEPROPERTIES ('path'='hdfs://{hdfs_host_name}/{table_path}')")
spark.catalog.refreshTable("qualified_table")
eg: If your table LOCATION
is like hdfs://hdfsHA/user/warehouse/example.db/qualified_table
then assign 'path'='hdfs://hdfsHA/user/warehouse/example.db/qualified_table'
This worked for me. Give it a try. I assume by now your issue would have been resolved. If not you can try this method.
workaround source: https://www.cloudera.com/documentation/spark2/2-2-x/topics/spark2_known_issues.html
I am using datastax cluster with 5.0.5.
[cqlsh 5.0.1 | Cassandra 3.0.11.1485 | DSE 5.0.5 | CQL spec 3.4.0 | Native proto
using spark-cassandra-connector 1.6.8
I tried to implement below code.. import is not working.
val rdd: RDD[SomeType] = ... // create some RDD to save import
com.datastax.bdp.spark.writer.BulkTableWriter._
rdd.bulkSaveToCassandra(keyspace, table)
Can someone suggest me how to implement this code. Are they any dependenceis required for this.
Cassandra Spark Connector has saveToCassandra method that could be used like this (taken from documentation):
val collection = sc.parallelize(Seq(("cat", 30), ("fox", 40)))
collection.saveToCassandra("test", "words", SomeColumns("word", "count"))
There is also saveAsCassandraTableEx that allows you to control schema creation, and other things - it's also described in documentation referenced above.
To use them you need to import com.datastax.spark.connector._ described in "Connecting to Cassandra" document.
And you need to add corresponding dependency - but this depends on what build system do you use.
The bulkSaveToCassandra method is available only when you're using DSE's connector. You need to add corresponding dependencies - see documentation for more details. But even primary developer of Spark connector says that it's better use saveToCassandra instead of it.
I am running a CDH distribution (version 5.6.0) with Impala (version 2.4.0).
I have some Parquet files stored in HDFS. Next, I have loaded these files into an Impala external table with the following query:
create external table parquetTable
like parquet 'hdfs://cloudera-impala-mn0.eastus.cloudapp.azure.com:8020/user/root/big_data/part-r-00015-66cf01ca-ffee-4a62-b2c3-c09177ec4bd7.gz.parquet'
stored as parquet location 'hdfs://cloudera-impala-mn0.eastus.cloudapp.azure.com:8020/user/root/big_data/;
Upon executing the following query all the files are successfully listed:
[cloudera-impala-dn0.eastus.cloudapp.azure.com:21000] > show files in parquettable;
Also, the metadata is correct (checked by executing describe parquettable).
The stats of the table are:
[cloudera-impala-dn0.eastus.cloudapp.azure.com:21000] > show table stats parquettable;
Rows | Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | Location
-1 | 838 | 249.64GB | NOT CACHED | NOT CACHED | PARQUET | false | hdfs://cloudera-impala-mn0.eastus.cloudapp.azure.com:8020/user/root/big_data
Executing the following query:
[cloudera-impala-dn0.eastus.cloudapp.azure.com:21000] > select count(*) from parquettable;
results in the following WARNING, but without any output result or error:
File 'hdfs://cloudera-impala-mn0.eastus.cloudapp.azure.com:8020/user/root/big_data/part-r-00001-7c29b85c-bd1f-420e-8834-96300076a92d.gz.parquet' has an invalid version number: ▒.F/
This could be due to stale metadata. Try running "refresh default.parquettable".
Running refresh default.parquettable did not have any effect.
Any help will be appreciated!
Your steps look good. The error complains about part-r-00001-7c29b85c-bd1f-420e-8834-96300076a92d.gz.parquet, while you use part-r-00015-66cf01ca-ffee-4a62-b2c3-c09177ec4bd7.gz.parquet when creating the table. So it looks like there is a problem in part-r-00001-7c29b85c-bd1f-420e-8834-96300076a92d.gz.parquet. Can you get rid of all files in the big_data table except part-r-00015-66cf01ca-ffee-4a62-b2c3-c09177ec4bd7.gz.parquet?