I used delta lake 1.2 to write some tables, I assume now I can still use 2.0 to do the read and write on those tables. Is this correct?
tl;dr Yup! Those tables will still be compatible.
Different versions of Delta Lake will add new features (like OPTIMIZE ZORDER or using data skipping to speed up queries) and performance improvements. But they do not "break" existing tables. Only protocol upgrades do that.
These Delta Lake versions are all independent of the table protocol version. The table protocol version defines what version of the Delta protocol readers/writers must use. This protocol version is necessary when certain new features, like column mapping, require protocol changes that make them incompatible with older table protocol versions.
See https://github.com/delta-io/delta/blob/master/PROTOCOL.md#protocol-evolution
Related
I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.
I implement Spark data source (v2) and I didn't find a way to ensure data locality.
In data source v1 getPreferredLocations method can be implemented, what is the equivalent in data source v2?
In Spark data source v2 you should change to SupportsReportPartitioning
I see someone discuss some limitation in this issue SPARK-15689 - Data source API v2
So SupportsReportPartitioning is not powerful enough to support custom hash functions yet. There are two major operators that may introduce shuffle: join and aggregate. Aggregate only needs to have the data clustered, but doesn't care how, so the data source v2 can support it, if your implementation catches ClusteredDistribution. Join needs the data of the 2 children clustered by the spark shuffle hash function, which is not supported by data source v2 currently.
I would like to know how I can use CDC in cassandra. I found that this is already is implemented started from 3.8 version(
https://issues.apache.org/jira/browse/CASSANDRA-8844). Are there any examples of usage?
1. Enable CDC on cassandra.yaml
cdc_enabled (default: false)
Enable or disable CDC operations node-wide.
2. Enabling CDC on a table
CREATE TABLE foo (a int, b text, PRIMARY KEY(a)) WITH cdc=true;
// or
ALTER TABLE foo WITH cdc=true;
3. After memtable flush to disk you can access the row CDC data in $CASSANDRA_HOME/data/cdc_raw
In this folder cassandra store CommitLogSegments.You can check this link Read CommitLogSegments
Read More : https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/doc/source/operating/cdc.rst
You can write your own implementation of CommitLogReader, or use this sample implementation.
However, please note that CDC logs are not too much reliable (because of duplicate events and time taken to flush data to CDC), and are subject to format change in future releases.
I work at ScyllaDB which is Cassandra compatible and has CDC support as well - that is simpler to use.
You can specify if you with to get only the delta, pre=image, post-image. Data is stored in a system generated table and can be accessed and read via CQL.
As such:
there is no need to write and deploy code on the cassandra nodes to consume commitlogs (nor is there a need to flush to get them)
deduplication is inherent to the solution.
you can read more in https://docs.scylladb.com/using-scylla/cdc/
Can GraphX store, process, query and update large distributed graphs?
Does GraphX support these features or the data must be extracted from a Graph database source that will be subsequently processed by GraphX?
I want to avoid costs related to network communication and data movement.
It can actually be done, albeit with a pretty complicated measures. MLnick from GraphFlow posted on titan mail group here that he managed to use Spark 0.8 on Titan/Cassandra graph using FaunusVertex and TitanCassandraInputFormat, and that there was a problem in groovy 1.8.9 and newer Kryo version.
In his GraphFlow presentation in spark-summit, he seemed to have made the Titan/HBase over Spark 0.7.x works.
Or if you're savvy enough to implement the TitanInputFormat/TitanOutputFormat from Titan 0.5, perhaps you could keep us in the loop. And Titan developers said they do want to support Spark but haven't got the time/resources to do so.
Using Spark on Titan database is pretty much the only option I can think of regarding your question.
Spark doesn't really have support in itself for long term storage yet, other than through HDFS(technically it doesn't need to run on HDFS but it is heavily integrated with it). So you could just store all the edges and vertices in files, but this is clearly not the most efficient way. Another option is to use a graph database like neo4j
I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks
1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop
I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.
For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.