Hy guys,
I need to perform jdbc operation using Apache Spark DataFrame.
Basically I have an historical jdbc table called Measures where I have to do two operations:
1. Set endTime validity attribute of the old measure record to the current time
2. Insert a new measure record setting endTime to 9999-12-31
Can someone tell me how to perform (if we can) update statement for the first operation and insert for the second operation?
I tried to use this statement for the first operation:
val dfWriter = df.write.mode(SaveMode.Overwrite)
dfWriter.jdbc("jdbc:postgresql:postgres", tableName, prop)
But it doesn't work because there is a duplicate key violation. If we can do update, how we can do delete statement?
Thanks in advance.
I don't think its supported out of the box yet by Spark. What you can do it iterate over the dataframe/RDD using the foreachRDD() loop and manually update/delete the table using JDBC api.
here is link to a similar question :
Spark Dataframes UPSERT to Postgres Table
Related
I am working with graphframes, pyspark, and hive to work with graph data. As I process data I will be building a graph and eventually will be persisting this data into a Hive table, where I will not update it ever again.
Subsequent runs may have relationships to nodes from previous runs, so I will want to ensure I don't duplicate data.
For example, run #1 might find nodes: A, B, C. Run #2 might re-find node A, and also find new nodes X, Y, Z. I do not want A to appear twice in my table.
I am looking for the best way to handle this and would like to address the following issues:
I will need to track the status of the node as I process metadata associated with it. I will only want to persist the node's data to Hive after I have finished this processing.
I want to ensure that I don't create duplicate data when I encounter the same node (e.g. when I re-find A node above, I don't want to insert another row into Hive)
I am currently tinkering with the best way to do this. I know hive supports ACID transactions now, but it does not appear as though pyspark currently supports CRUD type operations. So here is what I'm planning on:
On each run, create a dataframe to store the nodes I have found.
When a new node is found: Check if the node already exists in Hive (e.g. sqlContext.sql("SELECT * FROM existingTable WHERE name="<NAME>"). If it does not exist update the dataframe with x = vertices.withColumn("name", F.when(F.col("id")=="a", "<THE-NEW-NAME>").otherwise(F.col("name"))) to add it to our Dataframe.
Once all the nodes have finished processing, create a temporary view: x.createOrReplaceTempView("myTmpView")
Finally, insert data from my temporary view into an existing table with sqlContext.sql("INSERT INTO TABLE existingTable SELECT * FROM myTmpView")
I think this will work, but it seems extremely hacky. I'm not sure if this is a function of my lack of understanding of Hive/Spark, or if this is just the nature of the tech stack. Is there a better way to do this? Is there a performance cost to handling it in this way?
In deltalake api, upserts(Merge) are supported using scala and also python. Which is exactly you are trying to implement.
https://docs.delta.io/latest/delta-update.html#merge-examples
Here is an alternate solution
Have a column updated_time timestamp in your table
union prev_run_results and current_run_results
group by 'node', select the latest timestamp
save the results
I want to insert huge volume of data from Spark into Cassandra. The data has a timestamp column which determines ttl. But, this differs for each row. My question is, how can I handle ttl while inserting data in bulk from Spark.
My current implementation -
raw_data_final.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Overwrite).options(Map("table" -> offerTable ,
"keyspace" -> keySpace, "spark.cassandra.output.ttl" -> ttl_seconds)).save
Here raw_data_final has around a million records with each record yielding a different ttl. So, is there a way to do a bulk insert and somehow specify ttl from a column within raw_data.
Thanks.
This is supported by setting the WriteConf parameter with TTLOption.perRow option. The official documentation has following example for RDDs:
import com.datastax.spark.connector.writer._
...
rdd.saveToCassandra("test", "tab", writeConf = WriteConf(ttl = TTLOption.perRow("ttl")))
In your case you need to replace "ttl" with the name of your column with TTL.
I'm not sure that you can set this directly on DataFrame, but you can always get RDD from DataFrame, and use saveToCassandra with WriteConf...
Update in September 2020th: support for writetime and ttl in dataframes was added in the Spark Cassandra Connector 2.5.0
As per Datastax documentation a read before a write in Cassandra is an anti pattern.
Whenever we use UPDATE either in CQLSH or using the Datastax drivers to set a few columns (with IFs & collection updates), does it not do a read before write first? Is that not an anti pattern? Am I missing something?
P.S I am not talking about mere UPSERTS but UPDATES on specific columns.
TIA!
No, Update is not an anti-pattern.
In Cassandra update is an upsert operation similar to insert.
UPDATE writes one or more column values to a row in a Cassandra table. Like INSERT, UPDATE is an upsert operation: if the specified row does not exist, the command creates it. All UPDATEs within the same partition key are applied atomically and in isolation.
But Lightweight transactions are read before write operation. Actually at the cost of four round trips.
Example of Lightweight transaction :
#Lightweight transaction Insert
INSERT INTO customer_account (customerID, customer_email)
VALUES (‘LauraS’, ‘lauras#gmail.com’)
IF NOT EXISTS;
#Lightweight transaction Update
UPDATE customer_account
SET customer_email=’laurass#gmail.com’
IF customerID=’LauraS’;
Both of the above statement are Lightweight transactions
Source : http://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html#cqlUpdate__description
I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!
I am trying to do some analytics on time series data stored in cassandra by using spark and the new connector published by Datastax.
In my schema the Partition key is the meter ID and I want to run spark operations only on specifics series, therefore I need to filter by meter ID.
I would like then to run a query like: Select * from timeseries where series_id = X
I have tried to achieve this by doing:
JavaRDD<CassandraRow> rdd = sc.cassandraTable("test", "timeseries").select(columns).where("series_id = ?",ids).toJavaRDD();
When executing this code the resulting query is:
SELECT "series_id", "timestamp", "value" FROM "timeseries" WHERE token("series_id") > 1059678427073559546 AND token("series_id") <= 1337476147328479245 AND series_id = ? ALLOW FILTERING
A clause is automatically added on my partition key (token("series_id") > X AND token("series_id") <=Y) and then mine is appended after that. This obviously does not work and I get an error saying: "series_id cannot be restricted by more than one relation if it includes an Equal".
Is there a way to get rid of the clause added automatically? Am I missing something?
Thanks in advance
The driver automatically determines the partition key using table metadata it fetches from the cluster itself. It then uses this to append the token ranges to your CQL so that it can read a chunk of data from the specific node it's trying to query. In other words, Cassandra thinks series_id is your partition key and not meter_id. If you run a describe command on your table, I bet you'll be surprised.