CassandraBeamIO conversion Into Pcollection of ROWS - cassandra

I am trying to read Data from Cassandra db using apache beam CassandraIO , my requirement is crating a Pcollection of Rows from cassandra db , currently my code look like this
PTransform<PBegin,PCollection<Row>>transform=CassandraIO.<Row>read()
.withHosts(Collections.singletonList("127.0.0.1"))
.withPort(9042)
.withKeyspace("\"testDb\"")
.withMapperFactoryFn(new CassandraRowMapper())
.withQuery(q)
.withTable("student")
.withEntity(Row.class)
.withCoder(SerializableCoder.of(Row.class));
any help will be appreciated

Related

zeppelin spark read parquet mysql write sql exception

After reading the parquet file in Apache Spark,
I specified the fields with spark sql .
If it is executed after setting the field to be the same as the created database table (type also), an SQL Exception occurs in the write jdbc part.
The questionable part is use near '"URL" TEXT , "GOD_NO" TEXT , "BRND_NM" TEXT , "CTGRY_NM" TEXT , "EVSP_ID" TEX...' at line 1.
seems to be being created. I'd like to know what I did wrong, thanks.
spark result
spark result
zeppelin interpreter

update table from Pyspark using JDBC

I have a small log dataframe which has metadata regarding the ETL performed within a given notebook, the notebook is part of a bigger ETL pipeline managed in Azure DataFactory.
Unfortunately, it seems that Databricks cannot invoke stored procedures so I'm manually appending a row with the correct data to my log table.
however, I cannot figure out the correct sytnax to update a table given a set of conditions :
the statement I use to append a single row is as follows :
spark_log.write.jdbc(sql_url, 'internal.Job',mode='append')
this works swimmingly however, as my Data Factory is invoking a stored procedure,
I need to work in a query like
query = f"""
UPDATE [internal].[Job] SET
[MaxIngestionDate] date {date}
, [DataLakeMetadataRaw] varchar(MAX) NULL
, [DataLakeMetadataCurated] varchar(MAX) NULL
WHERE [IsRunning] = 1
AND [FinishDateTime] IS NULL"""
Is this possible ? if so can someone show me how?
Looking at the documentation this only seems to mention using select statements with the query parameter :
Target Database is an Azure SQL Database.
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
just to add this is a tiny operation, so performance is a non-issue.
You can't do single record updates using jdbc in Spark with dataframes. You can only append or replace the entire table.
You can do updates using pyodbc- requires installing the MSSQL ODBC driver (How to install PYODBC in Databricks) or you can use jdbc via JayDeBeApi (https://pypi.org/project/JayDeBeApi/)

Spark returns Empty DataFrame but Populated in Hive

I have a table in hive
db.table_name
When I run the following in hive I get results back
SELECT * FROM db.table_name;
When I run the following in a spark-shell
spark.read.table("db.table_name").show
It shows nothing. Similarly
sql("SELECT * FROM db.table_name").show
Also shows nothing. Selecting arbitrary columns out before the show also displays nothing. Performing a count states the table has 0 rows.
Running the same queries works against other tables in the same database.
Spark Version: 2.2.0.cloudera1
The table is created using
table.write.mode(SaveMode.Overwrite).saveAsTable("db.table_name")
And if I read the file using the parquet files directly it works.
spark.read.parquet(<path-to-files>).show
EDIT:
I'm currently using a workaround by describing the table and getting the location and using spark.read.parquet.
Have you refresh metadata table? Maybe you need to refresh table to access to new data.
spark.catalog.refreshTable("my_table")
I solved the problem by using
query_result.write.mode(SaveMode.Overwrite).format("hive").saveAsTable("table")
which stores the results in textfile.
There is probably some incompatibility with the Hive parquet.
I also found a Cloudera report about it (CDH Release Notes): they recommend creating the Hive table manually and then load data from a temporary table or by query.

Unable to query HIVE Parquet based EXTERNAL table from spark-sql

We have an External Hive Table which is stored as Parquet. I am not the owner of the schema in which this hive-parquet table is so don't have much info.
The Problem here is when in try to Query that table from spark-sql>(Shell prompt) Not by using scala like spark.read.parquet("path"), I am getting 0 records stating "Unable to infer schema". But when i created a Managed Table by using CTAS in my personal schema just for testing i was able to query it from the spark-sql>(Shell prompt)
When i try it from spark-shell> via spark.read.parquet("../../00000_0").show(10) , I was able to see the data.
So this clears that something is wrong between
External Hive table - Parquet - Spark-SQL(shell)
If locating Schema would be the issue then it should behave same while accessing through spark session (spark.read.parquet(""))
I am using MapR 5.2, Spark version 2.1.0
Please suggest what can be the issue

Pyspark : GET row from hbase using row-key

I have a use-case to read from HBase inside a pyspark job and is currently doing a scan on the HBase table like this,
conf = {"hbase.zookeeper.quorum": host, "hbase.cluster.distributed": "true", "hbase.mapreduce.inputtable": "table_name", "hbase.mapreduce.scan.row.start": start, "hbase.mapreduce.scan.row.stop": stop}
rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result", keyConverter=keyConv, valueConverter=valueConv,conf=cmdata_conf)
I am unable to find the conf to do a GET on the HBase table. Can someone help me? I could find that filters are not supported with pyspark. But is it not possible to do a simple GET?
Thanks!

Resources