I am trying to use Spark Structured Streaming - writeStream API to write to an External Partitioned Hive table.
CREATE EXTERNAL TABLE `XX`(
`a` string,
`b` string,
`b` string,
`happened` timestamp,
`processed` timestamp,
`d` string,
`e` string,
`f` string )
PARTITIONED BY (
`year` int, `month` int, `day` int)
CLUSTERED BY (d)
INTO 6 BUCKETS
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.compression.strategy'='SPEED',
'orc.create.index'='true',
'orc.encoding.strategy'='SPEED');
and in Spark code,
val hiveOrcWriter: DataStreamWriter[Row] = event_stream
.writeStream
.outputMode("append")
.format("orc")
.partitionBy("year","month","day")
//.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
I see that on a non partition table, records are inserted into Hive. However, on using partitioned table, the spark job does not fail or raise exceptions but records are not inserted to Hive table.
Appreciate comments from anyone who has dealt with similar problems.
Edit:
Just discovered that the .orc files are indeed written to the HDFS, withe correct partition directory structure: eg. /_table_loc/_table_name/year/month/day/part-0000-0123123.c000.snappy.orc
However
select * from 'XX' limit 1; (or where year=2018)
returns no rows.
The InputFormat and OutputFormat for the Table 'XX' are org.apache.hadoop.hive.ql.io.orc.OrcInputFormat and
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat respectively.
This feature isn't provided out of the box in structured streaming. In normal processing, you would use dataset.write.saveAsTable(table_name) , and that method isn't available.
After processing and saving the data in HDFS, you can manually update the partitions (or using a script that does this on a schedule):
If you use Hive
MSCK REPAIR TABLE table_name
If you use Impala
ALTER TABLE table_name RECOVER PARTITIONS
Related
I'm trying to read parquet files structured as:
filename/year=2020/month=12/day=1
files are under the following Mounted AzureStorage as following logic: /mnt/silver/root_folder/folder_A/parquet/year=2020/month=01/day=1
I'm trying to create a table, using this sintax:
CREATE TABLE tablename
(
FIELD1 string,
...
,FIELDn Date
,Year INT
,Month INT
,Day INT
)
USING org.apache.spark.sql.parquet
LOCATION '/mnt/silver/root_folder/folder_A/parquet/'
OPTIONS( 'compression'='snappy')
PARTITIONED BY (Year, Month, Day)
But all options I tried for LOCATION gets no Results.
I already tried:
/mnt/silver/folder/folder/parquet/* and also many variations of it.
Any suggestion please?
You need to execute MSCK REPAIR TABLE <table_name> or ALTER TABLE <table_name> RECOVER PARTITIONS - any of them forces to re-discover data in partitions.
From documentation:
When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore. However, if the partitioned table is created from existing data, partitions are not registered automatically in the Hive metastore
P.S. when you use Delta, that's done automatically, so that's one of the good reasons for using it :-)
I'm running a 1-node cluster of Kafka, Spark and Cassandra. All locally on the same machine.
From a simple Python script I'm streaming some dummy data every 5 seconds into a Kafka topic. Then using Spark structured streaming, I'm reading this data stream (one row at a time) into a PySpark DataFrame with startingOffset = latest. Finally, I'm trying to append this row to an already existing Cassandra table.
I've been following (How to write streaming Dataset to Cassandra?) and (Cassandra Sink for PySpark Structured Streaming from Kafka topic).
One row of data is being successfully written into the Cassandra table but my problem is it's being overwritten every time rather than appended to the end of the table. What might I be doing wrong?
Here's my code:
CQL DDL for creating kafkaspark keyspace followed by randintstream table in Cassandra:
DESCRIBE keyspaces;
CREATE KEYSPACE kafkaspark
WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 1
};
USE kafkaspark;
CREATE TABLE randIntStream (
key int,
value int,
topic text,
partition int,
offset bigint,
timestamp timestamp,
timestampType int,
PRIMARY KEY (partition, topic)
);
Launch PySpark shell
./bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1,spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
Read latest message from Kafka topic into streaming DataFrame:
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("startingOffsets","latest").option("subscribe","topic1").load()
Some transformations and checking schema:
df2 = df.withColumn("key", df["key"].cast("string")).withColumn("value", df["value"].cast("string"))
df3 = df2.withColumn("key", df2["key"].cast("integer")).withColumn("value", df2["value"].cast("integer"))
df4 = df3.withColumnRenamed("timestampType","timestamptype")
df4.printSchema()
Function for writing to Cassandra:
def writeToCassandra(writeDF, epochId):
writeDF.write \
.format("org.apache.spark.sql.cassandra") \
.options(table="randintstream", keyspace="kafkaspark") \
.mode("append") \
.save()
Finally, query to write to Cassandra from Spark:
query = df4.writeStream \
.trigger(processingTime="5 seconds") \
.outputMode("update") \
.foreachBatch(writeToCassandra) \
.start()
SELECT * on table in Cassandra:
If the row is always rewritten in Cassandra, then you may have incorrect primary key in the table - you need to make sure that every row will have an unique primary key. If you're creating Cassandra table from Spark, then by default it just takes first column as partition key, and it alone may not be unique.
Update after schema was provided:
Yes, that's the case that I was referring - you have a primary key of (partition, topic), but every row from specific partition that you read from that topic will have the same value for primary key, so it will overwrite previous versions. You need to make your primary key unique - for example, add the offset or timestamp columns to the primary key (although timestamp may not be unique if you have data produced inside the same millisecond).
P.S. Also, in connector 3.0.0 you don't need foreachBatch:
df4.writeStream \
.trigger(processingTime="5 seconds") \
.format("org.apache.spark.sql.cassandra") \
.options(table="randintstream", keyspace="kafkaspark") \
.mode("update") \
.start()
P.P.S if you just want to move data from Kafka into Cassandra, you may consider the use of the DataStax's Kafka Connector that could be much lightweight compared to the Spark.
I am trying to create a spark dataframe on a existing HBase Table(HBase is secured via Kerberos). I need to perform some spark Sql operations on this table.
I have tried creating a RDD on a Hbase table but unable to convert it into dataframe.
You can create hive external table with HBase storage handler and then use that table to run your spark-sql queries.
Creating the hive external table:
CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,f:c1,f:c2’)
TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);
Spark-sql:
val df=spark.sql("SELECT * FROM foo WHERE …")
Note: Here spark is a SparkSession
I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage.
So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format.
This is how I am starting the spark session
SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
.config("hive.metastore.uris", "<thrift_url>")
.config("spark.sql.sources.bucketing.enabled", true)
.enableHiveSupport()
.master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));
The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql
spark.sql("select * from partitioned_table").show
To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it.
The above issue does not occur when the table is not partitioned.
The code to write the data is this
dfWithSchema.orderBy(sortKey).write()
.partitionBy("somekey")
.mode("append")
.format("parquet")
.option("path",PARQUET_PATH+tableName )
.saveAsTable(tableName);
Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?
To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
To filter out it further, please try the below approach.
selected_Data.where(col("column_name")=='col_value').show()
This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.
For Spark, it is recommended to create tables with partition columns of STRING type.
If you are getting below error message while filtering the hive partitioned table in spark.
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.
else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column
I create an external partitioned table in hive.
in the logs it shows numinputrows. that means the query is working and sending data. but when I connect to hive using beeline and query, select * or count(*) it's always empty.
def hiveOrcSetWriter[T](event_stream: Dataset[T])( implicit spark: SparkSession): DataStreamWriter[T] = {
import spark.implicits._
val hiveOrcSetWriter: DataStreamWriter[T] = event_stream
.writeStream
.partitionBy("year","month","day")
.format("orc")
.outputMode("append")
.option("compression", "zlib")
.option("path", _table_loc)
.option("checkpointLocation", _table_checkpoint)
hiveOrcSetWriter
}
What can be the issue? I'm unable to understand.
msck repair table tablename
It give go and check the location of the table and adds partitions if new ones exits.
In your spark process add this step in order to query from hive.
Your streaming job is writing new partitions to the table_location. But the Hive metastore is not aware of this.
When you run a select query on the table, the Hive checks metastore to get list of table partitions. Since the information in Metastore is outdated, so the data don't show up in the result.
You need to run -
ALTER TABLE <TABLE_NAME> RECOVER PARTITIONS
command from Hive/Spark to update the metastore with new partition info.