I wanted to create a table using from pyspark with the below two options (partion by and require filter) but I can't see an option to do this with the bigquery connector
This is how I would do it in BigQuery
CREATE dataset.table AS SELECT XXXX
PARTITION BY
DATE_TRUNC(collection_date, DAY) OPTIONS ( require_partition_filter = TRUE)
This is what I normally do
dataframe
.write
.format("bigquery")
.mode(mode)
.save(f"{dataset}.{table_name}")
You can use partitionField, datePartition, partitionType
For Clustering use - clusteredFields
See more options:
https://github.com/GoogleCloudDataproc/spark-bigquery-connector#properties
Related
Is there any way to get updated/inserted rows after upsert using merge to Delta table in spark streaming job?
val df = spark.readStream(...)
val deltaTable = DeltaTable.forName("...")
def upsertToDelta(events: DataFrame, batchId: Long) {
deltaTable.as("table")
.merge(
events.as("event"),
"event.entityId == table.entityId")
.whenMatched()
.updateExpr(...))
.whenNotMatched()
.insertAll()
.execute()
}
df
.writeStream
.format("delta")
.foreachBatch(upsertToDelta _)
.outputMode("update")
.start()
I know I can create another job to read updates from the delta table. But is it possible to do the same job? From what I can see, execute() returns Unit.
You can enable Change Data Feed on the table, and then have another stream or batch job to fetch the changes, so you'll able to receive information on what rows has changed/deleted/inserted. It could be enabled with:
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
if thable isn't registered, you can use path instead of table name:
ALTER TABLE delta.`path` SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
The changes will be available if you add the .option("readChangeFeed", "true") option when reading stream from a table:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.table("table_name")
and it will add three columns to table describing the change - the most important is _change_type (please note that there are two different types for update operation).
If you're worried about having another stream - it's not a problem, as you can run multiple streams inside the same job - you just don't need to use .awaitTermination, but something like spark.streams.awaitAnyTermination() to wait on multiple streams.
P.S. But maybe this answer will change if you explain why you need to get changes inside the same job?
Hi I have 2 table in my hive in which from first table i m selecting data creating dataframe and saving that dataframe into another table in orc format.I have created both the tables in same database.
when I am saving this dataframe into 2nd table I'm getting table not found in database issue.and if i m not using any databasename then it always creating and saving my df in hive default database.can someone please guide me why its not taking userdefined database and always taking as default database?below is code which I m using,and also i m using HDP.
//creating hive session
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(sparksession).build()
hive.setDatabase("dbname")
var a= "SELECT 'all columns' from dbname.tablename"
val a1=hive.executeQuery(a)
a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
instead of insertInto(dbname.table_name) if I'm using insertInto(table_name) then its is saving dataframe in default database. But if I'm giving dbname.tablename then its showing table not found in database.
I also tried same using dbSession using.
val dbSession = HiveWarehouseSession.session(sparksession).build()
dbSession.setDatabase("dbname")
Note: My second table(target table where I'm writing data) is a partitioned and bucketed table.
// 2. partitionBy(...)
{ a1.write
.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector")
.option("database", "dbname")
.option("table", "table_name")
.mode("Append")
.insertInto("dbname.table_name")
// My second table(target table where I'm writing data) is a partitioned and bucketed table. add .partitionBy(<list cols>)
}
I save Spark dataframe to Apache Ignite table with this code:
df.write\
.format("ignite")\
.option("table","REPORT")\
.option("primaryKeyFields", ', '.join(map(str, df.schema.names[:-1])))\
.option("config",configFile)\
.option("compression", "gzip")\
.mode("overwrite")\
.save()
But, I cannot find how create index on field with this owerwrite-saving.
I need this, but on .save() operation:
CREATE INDEX REPORT_FIELD_IDX ON PUBLIC.REPORT (FIELD)
It's pretty simple to do using the syntax like next:
CREATE INDEX IF NOT EXISTS AGE_IDX ON "PUBLIC".Person (AGE)
In case if a new table wasn't created then IF NOT EXISTS will work and nothing will be done. Otherwise, the index will be created.
It can be run using any SQL tool that can be used with Ignite (webconsole, visor, sqlline, jdbc, odbc, etc) but I guess that you are going to do it from Spark job. So you can try to use IgniteSparkSession or IgniteRDD to run SQL over Ignite:
IgniteSparkSession igniteSession = IgniteSparkSession.builder()
.appName("Spark Ignite example")
.igniteConfig(configPath)
.getOrCreate();
igniteSession.sqlContext().sql("CREATE INDEX IF NOT EXISTS AGE_IDX ON \"PUBLIC\".Person (AGE)");
or
val cacheRdd = igniteContext.fromCache("partitioned")
val result = cacheRdd.sql(
"CREATE INDEX IF NOT EXISTS AGE_IDX ON \"PUBLIC\".Person (AGE)")
No, you can't do that when saving DataFrame with Spark. Creating a table and creating an index are 2 different operations.
Here are all the options for DataFrame saving into Ignite, and as you can see, there is no option for index creation.
I launched an AWS EMR cluster with EMR 5.28.0, Spark and Hive.
I was used to Spark SQL with spark-redshift connector that made me able to read/write in Redshift creating external tables like that:
CREATE TABLE `test`.`redshift_table` (`id` INT, `object_id` STRING)
USING com.databricks.spark.redshift
OPTIONS (
`tempdir` 's3a://my_bucket/table/',
`url` 'jdbc:redshift://xxxxxx:5439/database?user=user&password=password',
`forward_spark_s3_credentials` 'true',
`serialization.format` '1',
`dbtable` 'my.table'
)
Now I am looking for the equivalent thing in Hive:
at least to be able to read a Redshift table from Hive (so I can join Redshift data with other tables from the datalake)
and if possible to write to Redshift from Hive too (so I can create ETLs in the data lake writing some results to Redshift)
I've been looking around but I'm not sure what would be the format of the CREATE TABLE and if I need to install something else on the cluster before.
Thanks
Update:
I have been able to do it with EMR 5.28.0 now using those jars:
https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc-handler/3.1.2
https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.37.1061/RedshiftJDBC42-no-awssdk-1.2.37.1061.jar
and then creating the table in Hive with:
CREATE EXTERNAL TABLE test.table(
id INTEGER,
name STRING
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
"hive.sql.database.type" = "POSTGRES",
"hive.sql.jdbc.driver" = "com.amazon.redshift.jdbc.Driver",
"hive.sql.jdbc.url" = "jdbc:redshift://host:5439/database",
"hive.sql.dbcp.username" = "user",
"hive.sql.dbcp.password" = "password",
"hive.sql.table" = "schema.name",
"hive.sql.dbcp.maxActive" = "1"
);
The issue I have now is that it does not push down predicates to Redshift. For example "SELECT * FROM test.table where id = 1;" first executes a Redshift query reading the whole table, any idea how to change this behavior please?
I checked the Hive settings and I have:
hive.optimize.ppd=true
hive.optimize.ppd.storage=true
I am trying to create a datapipeline where the incomng data is stored into parquet and i create and external hive table and users can query the hive table and retrieve data .I am able to save the parquet data and retrieve it directly but when i query the hive table its not returning any rows. I did the following test setup
--CREATE EXTERNAL HIVE TABLE
create external table emp (
id double,
hire_dt timestamp,
user string
)
stored as parquet
location '/test/emp';
Now created dataframe on some data and saved to parquet .
---Create dataframe and insert DATA
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val schema = List(("id", "double"), ("hire_dt", "date"), ("user", "string"))
val newCols= schema.map ( x => col(x._1).cast(x._2))
val newDf = employeeDf.select(newCols:_*)
newDf.write.mode("append").parquet("/test/emp")
newDf.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+-------+----+
|id |hire_dt|user|
+---+-------+----+
+---+-------+----+
As shown above i am able to see the data if i read from parquet directly but not from hive .The question is what i am doing wrong here ? What i am i doing wrong that the hive isnt getting the data. I thought msck repair may be a reason but i get error if i try to do msck repair table saying table not partitioned.
Based on your create table statement, you have used location as /test/emp but while writing data, you are writing at /tenants/gwm/idr/emp. So you will not have data at /test/emp.
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/test/emp';
Please re-create external table as
CREATE EXTERNAL HIVE TABLE create external table emp ( id double, hire_dt timestamp, user string ) stored as parquet location '/tenants/gwm/idr/emp';
Apart from the answer given by Ramdev below, you also need to be cautious of using the correct datatype around date/timestamp; as 'date' type is not supported by parquet when creating a hive table.
For that you can change the 'date' type for column 'hire_dt' to 'timestamp'.
Otherwise there will be a mismatch in data you persisting through spark and trying to read in hive (or hive SQL). Keeping it to 'timestamp' at both places will resolve the issue. I hope it helps.
Do you have enableHiveSupport() in your sparkSession builder() statement. Are you able to connect to hive metastore? Try doing show tables/databases in your code to see if you can display tables present at your hive location?
i got this working with below chgn.
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
So basically the issue was datatype mismatch and some how the original code the cast doesn't seem to work. So i did an explicit cast and then write it goes fine and able to query back as well.Logically both are doing the same not sure why the original code not working.
val employeeDf = Seq(("1", "2018-01-01","John"),("2","2018-12-01", "Adam")).toDF("id","hire_dt","user")
val dfTransformed = employeeDf.withColumn("id", employeeDf.col("id").cast(DoubleType))
.withColumn("hire_dt", employeeDf.col("hire_dt".cast(TimestampType))
dfTransformed.write.mode("append").parquet("/test/emp")
dfTransformed.show
--read the contents directly from parquet
val sqlcontext=new org.apache.spark.sql.SQLContext(sc)
sqlcontext.read.parquet("/test/emp").show
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+
--read from the external hive table
spark.sql("select id,hire_dt,user from emp").show(false)
+---+----------+----+
| id| hire_dt|user|
+---+----------+----+
|1.0|2018-01-01|John|
|2.0|2018-12-01|Adam|
+---+----------+----+