Presto hive connector reads .zst file - apache-spark

for the following query, both Hive and Spark-SQL work fine,
but the result returned by Presto (hive connector ) has wrong encoding/decoding.
Wonders how should I set the hive connector, or presto doesn't support reading zstd?
hive table:
CREATE TABLE mydb.testtb (
mid varchar COMMENT 'mid',
day varchar
)
WITH (
external_location = 'hdfs://userx/mydb/testtb/
format = 'TEXTFILE',
partitioned_by = ARRAY['day']
)
files pointing to the HDFS are written using zstd compression like
.../testtb/day=20221113/part-00020-63c1xxxxx000.zst
SQL
select * from mydb.testtb where day=20221113 limit 5
result of presto

Related

Saving partitioned table with BigQuery Spark connector

I wanted to create a table using from pyspark with the below two options (partion by and require filter) but I can't see an option to do this with the bigquery connector
This is how I would do it in BigQuery
CREATE dataset.table AS SELECT XXXX
PARTITION BY
DATE_TRUNC(collection_date, DAY) OPTIONS ( require_partition_filter = TRUE)
This is what I normally do
dataframe
.write
.format("bigquery")
.mode(mode)
.save(f"{dataset}.{table_name}")
You can use partitionField, datePartition, partitionType
For Clustering use - clusteredFields
See more options:
https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties

create table stored as parquet and compressed with snappy not work

I have tryed to save data to hdfs with parquet-snappy:
spark.sql("drop table if exists onehands.parquet_snappy_not_work")
spark.sql(""" CREATE TABLE onehands.parquet_snappy_not_work (`trans_id` INT) PARTITIONED by ( `year` INT) STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY") """)
spark.sql("""insert into onehands.parquet_snappy_not_work values (20,2021)""")
spark.sql("drop table if exists onehands.parquet_snappy_works_well")
val df = spark.createDataFrame(Seq(
(20, 2021)
)) toDF("trans_id", "year")
df.show()
df.write.format("parquet").partitionBy("year").mode("append").option("compression","snappy").saveAsTable("onehands.parquet_snappy_works_well")
df.write.format("parquet").partitionBy("year").mode("append").option("compression","snappy").saveAsTable("onehands.parquet_snappy_works_well")
but it`s not working with pre-created table
for onehands.parquet_snappy_not_work , the file is not ending with .snappy.parquet,
onehands.parquet_snappy_works_well looks working very well
[***]$ hadoop fs -ls /data/spark/warehouse/onehands.db/parquet_snappy_works_well/year=2021
/data/spark/warehouse/onehands.db/parquet_snappy_works_well/year=2021/part-00000-f5ec0f2d-525f-41c9-afee-ce5589ddfe27.c000.snappy.parquet
[****]$ hadoop fs -ls /data/spark/warehouse/onehands.db/parquet_snappy_not_work/year=2021
/data/spark/warehouse/onehands.db/parquet_snappy_not_work/year=2021/part-00000-85e2a7a5-c281-4960-9786-4c0ea88faf15.c000
even if I have tryed add some properties:
SET hive.exec.compress.output=true;
SET mapred.compress.map.output=true;
SET mapred.output.compress=true;
SET mapred.output.compression=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec;
but it still not work
by the way , the sql I got with "show create table onehands.parquet_snappy_works_well",e.g.
CREATE TABLE `onehands`.`parquet_snappy_works_well` (`trans_id` INT, `year` INT) USING parquet OPTIONS ( `compression` 'snappy', `serialization.format` '1' ) PARTITIONED BY (year)
can not be run with spark-sql
spark vrtsion: 2.3.1
hadoop version:2.9.2
What`s the problem with my code ? Thanks for your help

spark cassandra connector problem using catalogs

I am following the instructions found here to connect my spark program to read data from Cassandra. Here is how I have configured spark:
val configBuilder = SparkSession.builder
.config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions")
.config("spark.cassandra.connection.host", cassandraUrl)
.config("spark.cassandra.connection.port", 9042)
.config("spark.sql.catalog.myCatalogName", "com.datastax.spark.connector.datasource.CassandraCatalog")
According to the documentation, once this is done I should be able to query Cassandra like this:
spark.sql("select * from myCatalogName.myKeyspace.myTable where myPartitionKey = something")
however when I do so I get the following error message:
mismatched input '.' expecting <EOF>(line 1, pos 43)
== SQL ==
select * from myCatalog.myKeyspace.myTable where myPartitionKey = something
----------------------------------^^^
When I try in the following format I am successful at retrieving entries from Cassandra:
val frame = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "myKeyspace", "table" -> "myTable"))
.load()
.filter(col("timestamp") > startDate && col("timestamp") < endDate)
However this query requires a full table scan to be performed. The table contains a few million entries and I would prefer to avail myself of the predicate Pushdown functionality, which it would seem is only available via the SQL API.
I am using spark-core_2.11:2.4.3, spark-cassandra-connector_2.11:2.5.0 and Cassandra 3.11.6
Thanks!
The Catalogs API is available only in SCC version 3.0 that is not released yet. It will be released with Spark 3.0 release, so it isn't available in the SCC 2.5.0. So for 2.5.0 you need to register your table explicitly, with create or replace temporary view..., as described in docs:
spark.sql("""CREATE TEMPORARY VIEW myTable
USING org.apache.spark.sql.cassandra
OPTIONS (
table "myTable",
keyspace "myKeyspace",
pushdown "true")""")
Regarding the pushdowns (they work the same for all Dataframe APIs, SQL, Scala, Python, ...) - such filtering will happen when your timestamp is the first clustering column. And even in that case, the typical problem is that you may specify startDate and endDate as strings, not timestamp. You can check by executing frame.explain, and checking that predicate is pushed down - it should have * marker near predicate name.
For example,
val data = spark.read.cassandraFormat("sdtest", "test").load()
val filtered = data.filter("ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp) AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)")
val not_filtered = data.filter("ts >= '2019-03-10T14:41:34.373+0000' AND ts <= '2019-03-10T19:01:56.316+0000'")
the first filter expression will push predicate down, while 2nd (not_filtered) will require a full scan.

Query Redshift from Hive not pushing down predicates

I launched an AWS EMR cluster with EMR 5.28.0, Spark and Hive.
I was used to Spark SQL with spark-redshift connector that made me able to read/write in Redshift creating external tables like that:
CREATE TABLE `test`.`redshift_table` (`id` INT, `object_id` STRING)
USING com.databricks.spark.redshift
OPTIONS (
`tempdir` 's3a://my_bucket/table/',
`url` 'jdbc:redshift://xxxxxx:5439/database?user=user&password=password',
`forward_spark_s3_credentials` 'true',
`serialization.format` '1',
`dbtable` 'my.table'
)
Now I am looking for the equivalent thing in Hive:
at least to be able to read a Redshift table from Hive (so I can join Redshift data with other tables from the datalake)
and if possible to write to Redshift from Hive too (so I can create ETLs in the data lake writing some results to Redshift)
I've been looking around but I'm not sure what would be the format of the CREATE TABLE and if I need to install something else on the cluster before.
Thanks
Update:
I have been able to do it with EMR 5.28.0 now using those jars:
https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc-handler/3.1.2
https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.37.1061/RedshiftJDBC42-no-awssdk-1.2.37.1061.jar
and then creating the table in Hive with:
CREATE EXTERNAL TABLE test.table(
id INTEGER,
name STRING
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "POSTGRES",
"hive.sql.jdbc.driver" = "com.amazon.redshift.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:redshift://host:5439/database",
"hive.sql.dbcp.username" = "user",
"hive.sql.dbcp.password" = "password",
"hive.sql.table" = "schema.name",
"hive.sql.dbcp.maxActive" = "1"
);
The issue I have now is that it does not push down predicates to Redshift. For example "SELECT * FROM test.table where id = 1;" first executes a Redshift query reading the whole table, any idea how to change this behavior please?
I checked the Hive settings and I have:
hive.optimize.ppd=true
hive.optimize.ppd.storage=true

Spark SQL returns null for a column in HIVE table while HIVE query returns non null values

I have a hive table created on top of s3 DATA in parquet format and partitioned by one column named eventdate.
1) When using HIVE QUERY, it returns data for a column named "headertime" which is in the schema of BOTH the table and the file.
select headertime from dbName.test_bug where eventdate=20180510 limit 10
2) FROM a scala NOTEBOOK , when directly loading a file from a particular partition that also works,
val session = org.apache.spark.sql.SparkSession.builder
.appName("searchRequests")
.enableHiveSupport()
.getOrCreate;
val searchRequest = session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")
searchRequest.createOrReplaceTempView("SearchRequest")
val exploreDF = session.sql("select headertime from SearchRequest where SearchRequestHeaderDate='2018-05-10' limit 100")
exploreDF.show(20)
this also displays the values for the column "headertime"
3) But, when using spark sql to query directly the HIVE table as below,
val exploreDF = session.sql("select headertime from tier3_vsreenivasan.test_bug where eventdate=20180510 limit 100")
exploreDF.show(20)
it keeps returning null always.
I opened the parquet file and see that the column headertime is present with values, but not sure why spark SQL is not able to read the values for that column.
it will be helpful if someone can point out from where the spark SQL gets the schema? I was expecting it to behave similar to the HIVE QUERY

Resources