I implement Spark data source (v2) and I didn't find a way to ensure data locality.
In data source v1 getPreferredLocations method can be implemented, what is the equivalent in data source v2?
In Spark data source v2 you should change to SupportsReportPartitioning
I see someone discuss some limitation in this issue SPARK-15689 - Data source API v2
So SupportsReportPartitioning is not powerful enough to support custom hash functions yet. There are two major operators that may introduce shuffle: join and aggregate. Aggregate only needs to have the data clustered, but doesn't care how, so the data source v2 can support it, if your implementation catches ClusteredDistribution. Join needs the data of the 2 children clustered by the spark shuffle hash function, which is not supported by data source v2 currently.
Related
Is there a difference between CREATE STREAMING LIVE TABLE and CREATE INCREMENTAL LIVE TABLE? The documentation is mixed: For instance, STREAMING is used here, while INCREMENTAL is used here. I have tested both and so far I have not noticed any difference.
There are two aspects here:
Conceptual - incremental means that the minimal data changes are applied to a destination table, we don't recompute full data set when new data arrive. This is how is explained in the Getting Started book.
Syntax - CREATE INCREMENTAL LIVE TABLE was the original syntax for pipelines that were processing streaming data. But it was deprecated in the favor of CREATE STREAMING LIVE TABLE, but the old syntax is still supported for compatibility reasons.
I wanted to know what data sources can be called 'smart' in spark. As per book "Mastering Apache Spark 2.x", any data source can be called smart if spark can process data at data source side. Example JDBC sources.
I want to know if MongoDB, Cassandra and parquet could be considered as smart data sources as well?
I believe smart data sources can be those as well. At least according to slides 41 to 42 you can see mention of smart data sources and logos including those sources (note that mongodb logo isn't there but I believe it supports the same thing https://www.mongodb.com/products/spark-connector, see section "Leverage the Power of MongoDB") from the Databricks presentation here: https://www.slideshare.net/databricks/bdtc2
I was also able to find some information supporting that MongoDB is a smart data source, since it's used as an example in the "Mastering Apache Spark 2.x" book:
"Predicate push-down on smart data sources Smart data sources are those that support data processing directly in their own engine-where the data resides--by preventing unnecessary data to be sent to Apache Spark.
On example is a relational SQL database with a smart data source. Consider a table with three columns: column1, column2, and column3, where the third column contains a timestamp. In addition, consider an ApacheSparkSQL query using this JDBC data source but only accessing a subset of columns and rows based using projection and selection. The following SQL query is an example of such a task:
select column2,column3 from tab where column3>1418812500
Running on a smart data source, data locality is made use of by letting the SQL database do the filtering of rows based on timestamp and removal of column1. Let's have a look at a practical example on how this is implemented in the Apache Spark MongoDB connector"
I want to do simple analysis on a live stream of tweets.
How do you use a Twitter stream source in Hazelcast Jet without needing a DAG?
Details
The encapsulation of Twitter API is pretty good at StreamTwitterP.java.
However, the caller uses that as part of a DAG, c/o:
Vertex twitterSource =
dag.newVertex("twitter", StreamTwitterP.streamTwitterP(properties, terms));
My use case doesn't need the power of DAG, so I'd rather avoid that needless extra complexity.
To avoid a DAG, I'm looking to use SourceBuilder to define a new data source for live stream of tweets.
I assume that would have code similar to StreamTwitterP.java, mentioned above, however it's not clear to me the fit using the API of Hazelcast JET.
I was referring to SourceBuilder example from the docs.
You can convert a processor to a pipeline source:
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<String>streamFromProcessor("twitter",
streamTwitterP(properties, terms)))
...
There's also twitterSource version that uses SourceBuilder here.
I have been reading on Data Source V2 API and Filter Pushdown (and presumably Partition Pruning). In the examples one talks about Push Down to, say, mySQL.
OK, I am not clear. I see this discussion on datasource V2 API here and there (e.g. in Exploring Spark DataSource V2 - Part 4 : In-Memory DataSource with Partitioning). All good and well, but I can get pushdown working already for mySQL as the answer states. The discussions imply the opposite somehow - so I am clearly missing a point -somewhere along the line and I would like to know what.
My question/observation is that I can already do Filter Push Down for a JDBC Source such as mySQL. E.g. as follows by:
sql = "(select * from mytab where day = 2016-11-25 and hour = 10) t1"
This ensures not all data is brought back to SPARK.
So, what am I missing?
This ensures not all data is brought back to SPARK.
Yes, it does, but
val df = sparkr.read.jdbc(url, "mytab", ...)
df.where($"day" === "2016-11-25" and $"hour" === 10)
should as well, as long as there is not casting required, not matter the version (1.4 forward).
Filter Pushdown in Data Source V2 API
In Data Source V2 API, only data sources with DataSourceReaders with SupportsPushDownFilters interface support Filter Pushdown performance optimization.
Whether a data source supports filter pushdown in Data Source V2 API is just a matter of checking out the underlying DataSourceReader.
For MySQL it'd be the JDBC data source which is represented by the JdbcRelationProvider that does not seem to support Data Source V2 API (via ReadSupport). In other words, I doubt that MySQL is supported by a Data Source V2 API data source and so no filter pushdown in the new Data Source V2 API is expected.
Filter Pushdown in Data Source V1 API
That does not preclude filter pushdown optimization to be used via some other non-Data Source V2 APIs, i.e. Data Source V1 API.
In the case of the JDBC data source the filter pushdown is indeed supported by the former PrunedFilteredScan contract (which nota bene is used by the JDBCRelation only). That's however Data Source V1 API.
I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?
Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.