I am using Spark-Cassandra connector 1.1.0 with Cassandra 2.0.12.
I write RDDs to Cassandra via the saveToCassandra() Java API method.
Is there a way to set the TTL property of the persisted records with the connector?
Thanks,
Shai
Unfortunately it doesn't seem like there is a way to do this (that I know of) with version 1.1.0 of the connector. There is a way in 1.2.0-alpha3 however.
saveToCassandra() is a wrapper over WriterBuilder which has a withTTL method. Instead of using saveToCassandra you could use writerBuilder(keyspace,table,rowWriter).withTTL(seconds).saveToCassandra().
Yes, we can do.
Just set spark config key "spark.cassandra.output.ttl" .while creating sparkConf Object.
Note: Value should be in second
Related
In my spark job, I tried to overwrite a table in each microbatch of structured streaming
batchDF.write.mode(SaveMode.Overwrite).saveAsTable("mytable")
It generated the following error.
Can not create the managed table('`mytable`'). The associated location('file:/home/ec2-user/environment/spark/spark-local/spark-warehouse/mytable') already exists.;
I knew in Spark 2.xx, the way to solve this issue is to add the following option.
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")
It works well in spark 2.xx. However, this option was removed in Spark 3.0.0. Then, how should we solve this issue in Spark 3.0.0?
Thanks!
It looks like you run your test data generation and your actual test in the same process - can you just replace these with createOrReplaceTempView to save them to Spark's in-memory catalog instead of into a Hive catalog?
Something like : batchDF.createOrReplaceTempView("mytable")
I am working on a Spring Java Project and integrating Apache spark and cassandra using Datastax connector.
I have autowired sparkSession and the below lines of code seems to work.
Map<String, String> configMap = new HashMap<>();
configMap.put("keyspace", "key1");
configMap.put("table", tableName.toLowerCase());
Dataset<Row> ds = sparkSession.sqlContext().read().format("org.apache.spark.sql.cassandra").options(configMap)
.load();
ds.show();
But this is always giving me 20 records. I want to select all the records of table. can someone tell me how to do this ?
Thanks in advance.
show always outputs 20 records by default, although you can pass an argument to specify how many items do you need. But show is usually used just for briefly examine the data, especially when working interactively.
In your case, everything is really depends on what do you want to do with the data - you already successfully loaded the data using the load function - after that you can just start to use normal Spark functions - select, filter, groupBy, etc.
P.S. You can find here more examples on using Spark Cassandra Connector (SCC) from Java, although it's more cumbersome than using Scala... And I recommend to make sure that you're using SCC 2.5.0 or higher because of the many new features there.
How can I create a custom write format for Spark Dataframe to use it like df.write.format("com.mycompany.mydb").save()? I've tried reading through Datastax Cassandra connector code but still couldn't figure it out
Spark 3.0 completely changes the API. Some new interfaces e.g. TableProvider and SupportsWrite have been added.
You might find this guide helpful.
Using Spark's DataSourceV2.
If your are using Spark version < 2.3, then you can use Spark Data Source API V1.
During migration from PySpark to Spark with Scala I encountered a problem caused by the fact that SqlContext's registerDataFrameAsTable method is private. It made me think that my approach might be incorrect. In PySpark I do the following: load each table: df = sqlContext.load(source, url, dbtable), then register each sqlContext.registerDataFrameAsTable(df, dbtable), finally using sqlContext.sql method I can do my queries (which is basically what I need).
Is it right way to do it? How can I achieve it in Scala?
In Scala the registerTempTable and saveAsTable (which is experimental) methods are available directly on the DataFrame object and should be used.
I am trying to use CDC in Cassandra tried using incremental backup as mentioned in this link but the format of SSTables is very weired for the composite keys.Is there any way to implement CDC in cassandra.
Any pointers will be very useful.
It is available now from Cassandra 3.8
https://issues.apache.org/jira/browse/CASSANDRA-8844