Is Presto is a data store for storing data? - presto

I am new to work on Presto. I have some doubts regarding Presto.
Whether Presto is a data store(database)?
If it is a query engine ? Whether there is any common query syntax for accessing Hive, SQL, Cassandra data using connectors or it will accept all data source queries based on connectors ?
Where the query execution will takes place in Presto or in connected data source end?

It is a query engine. However it accesses data from many different data sources.
Yes. It is ANSI SQL. When accessing data from underlying data source then it's specific interface is used (thrift, hdfs, jdbc etc), but this is hidden from the user.
In both places. Presto is capable to push down some data filtering down to underlying data source (projection, where clauses). There is current effort to also push more parts of SQL query (see https://github.com/prestosql/presto/issues/18). Rest is evaluated in Presto.

Related

Writing Data to External Databases Through PySpark

I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),
spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.
How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.
First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.
Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.
Lastly, for any kind of data transformation, you should use an ELT pattern:
Load raw data into an empty staging table
Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc

What are smart data sources in spark?

I wanted to know what data sources can be called 'smart' in spark. As per book "Mastering Apache Spark 2.x", any data source can be called smart if spark can process data at data source side. Example JDBC sources.
I want to know if MongoDB, Cassandra and parquet could be considered as smart data sources as well?
I believe smart data sources can be those as well. At least according to slides 41 to 42 you can see mention of smart data sources and logos including those sources (note that mongodb logo isn't there but I believe it supports the same thing https://www.mongodb.com/products/spark-connector, see section "Leverage the Power of MongoDB") from the Databricks presentation here: https://www.slideshare.net/databricks/bdtc2
I was also able to find some information supporting that MongoDB is a smart data source, since it's used as an example in the "Mastering Apache Spark 2.x" book:
"Predicate push-down on smart data sources Smart data sources are those that support data processing directly in their own engine-where the data resides--by preventing unnecessary data to be sent to Apache Spark.
On example is a relational SQL database with a smart data source. Consider a table with three columns: column1, column2, and column3, where the third column contains a timestamp. In addition, consider an ApacheSparkSQL query using this JDBC data source but only accessing a subset of columns and rows based using projection and selection. The following SQL query is an example of such a task:
select column2,column3 from tab where column3>1418812500
Running on a smart data source, data locality is made use of by letting the SQL database do the filtering of rows based on timestamp and removal of column1. Let's have a look at a practical example on how this is implemented in the Apache Spark MongoDB connector"

Columnar storage: Cassandra vs Redshift

How is columnar storage in the context of a NoSQL database like Cassandra different from that in Redshift. If Cassandra is also a columnar storage then why isn't it used for OLAP applications like Redshift?
The storage engines of Cassandra and Redshift are very different, and are created for different cases.
Cassandra's storage not really "columnar" in wide known meaning of this type of databases, like Redshift, Vertica etc, it is much more closer to key-value family in NoSQL world. The SQL syntax used in Cassandra is not any ANSI SQL, and it has very limited set of queries that can be ran there. Cassandra's engine built for fast writing and reading of records, based on key, while Redshift's engine is built for fast aggregations (MPP), and has wide support for analytical queries, and stores,encodes and compresses data on column level.
It can be easily understood with following example:
Suppose we have a table with user id and many metrics (for example weight, height, blood pressure etc...).
I we will run aggregate the query in Redshift, like average weight, it will do the following (in best scenario):
Master will send query to nodes.
Only the data for this specific column will be fetched from storage.
The query will be executed in parallel on all nodes.
Final result will be fetched to master.
Running same query in Cassandra, will result in scan of all "rows", and each "row" can have several versions, and only the latest should be used in aggregation. If you familiar with any key-value store (Redis, Riak, DynamoDB etc..) it is less effective than scanning all keys there.
Cassandra many times used for analytical workflows with Spark, acting as a storage layer, while Spark acting as actual query engine, and basically shouldn't be used for analytical queries by its own. With each version released more and more aggregation capabilities are added, but it is very far from being real analytical database.
I encountered the same question today, and found that this resource on AWS: https://aws.amazon.com/nosql/columnar/

How does Spark deal with JDBC data in relation to time?

I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?
Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.

Does registerTempTable cause the table to get cached?

I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table? I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
SQL is most likely going to be the fastest by far Databricks blog
Did you try to partition/repartition your dataframe as well to see whether it improves the performance?
Regarding registerTempTable: it only registers the table within a spark context. You can check with the UI.
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test")
test.show()
Storage is blank
vs
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test").cache()
test.show()
by the way registerTempTable is deprecated in Spark 2.0 and has been replaced by
createOrReplaceTempView
I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table?
The registerTempTabele or createOrReplaceTempView doesn't cache the data into memory or disc itself unless you use cache() function.
I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
Keep in mind the sql terms in sql query ultimately call the function inside. so whether you use sql query terms or functions available in code it doesn't matter. that is same thing.

Resources