I can select data from database in Spark like this:
var df = spark.read.
format("jdbc").
option("url", "jdbc:db://<DB server>:<DB port>/<dbname>").
option("user", "<username>").
option("password", "<password>").
option("dbtable", "<your table>").
load()
But after this how can I close db connection? Is it closed automatically?
Spark opens and closes the JDBC connections as needed, to extract/validate metadata when building query execution plan, to save dataframe partitions to a database, or to compute dataframe when scan is triggered by a Spark action. See JdbcRelationProvider,JdbcUtils, and
JDBCRDD source for where/how exactly its done.
Keep in mind, Spark is a distributed system. Each executor will require its own connection(s) to the database (e.g. when doing partitioned reads). There's simply no way how you could close all opened connections manually.
All of this is taken care of automatically by Spark and nothing you have to worry about..
Related
Assume I am loading data using:
df = spark.read("csv").option("header", "true").option("inferSchema", "true").load("path/to/csv")
Then why I need to run: df.createTempView("temp_df") ? If its because I need to be able to run sql queries, then can't I do that using: df.selectExpr("sql exp...")?
I am aware that df.createTempView("temp_df") will create a table which I can run sql expressions against, in a Spark session. Having said that, does that mean that the created table is distributed on the worker nodes?
The reason I'm asking is because I would like to know if the csv file is being loaded to the driver program and not the worker nodes immediately.
Second, can I have a dataframe on the worker nodes? If yes, then will spark.catalog.listTables() give me any information about this dataframe?
Finally, how can I check where is my data stored at any point of time?Being it dataframe or sql tables?
Any help is much appreciated!!
I have tried connecting spark with JDBC connections to fetch data from MySQL / Teradata or similar RDBMS and was able analyse the data.
Can spark be used to store the data to HDFS?
Is there any possibility for spark outperforming
the activities of Sqoop.
Looking for you valuable answers and explanations.
There are two main things about Sqoop and Spark. The main difference is Sqoop will read the data from your RDMS doesn't matter what you have and you don't need to worry much about how you table is configured.
With Spark using JDBC connection is a little bit different how you need to load the data. If your database doesn't have any column like numeric ID or timestamp Spark will load ALL the data in one single partition. And then will try to process and save. If you have one column to use as partition than Spark sometimes can be even faster than Sqoop.
I would recommend you to take a look in this doc.enter link description here
The conclusion is, if you are going to do a simple export and that need to be done daily with no transformation I would recommend Sqoop to be simple to use and will not impact your database that much. Using Spark will work well IF your table is ready for that, besides that goes with Sqoop
I have a stream from HDFS and I need to join it with my metadata that is also in HDFS, both Parquets.
My metadata sometimes got updated and I need to join with fresh and most recent, that means read metadata from HDFS every stream micro batch ideally.
I tried to test this, but unfortunately Spark reads metadata once that cache files(supposedly), even if I tried with spark.sql.parquet.cacheMetadata=false.
Is there a way how to read every micro batch? Foreach Writer is not what I'm looking for?
Here's code examples:
spark.sql("SET spark.sql.streaming.schemaInference=true")
spark.sql("SET spark.sql.parquet.cacheMetadata=false")
val stream = spark.readStream.parquet("/tmp/streaming/")
val metadata = spark.read.parquet("/tmp/metadata/")
val joinedStream = stream.join(metadata, Seq("id"))
joinedStream.writeStream.option("checkpointLocation", "/tmp/streaming-test/checkpoint").format("console").start()
/tmp/metadata/ got updated with spark append mode.
As far as I understand, with metadata accessing through JDBC jdbc source and spark structured streaming, Spark will query each micro batch.
As far as I found, there are two options:
Create temp view and refresh it using interval:
metadata.createOrReplaceTempView("metadata")
and trigger refresh in separate thread:
spark.catalog.refreshTable("metadata")
NOTE: in this case spark will read the same path only, it does not work if you need read metadata from different folders on HDFS, e.g. with timestamps etc.
Restart stream with interval as Tathagata Das suggested
This way is not suitable for me, since my metadata might be refreshed several times per hour.
Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.
I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call
I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.
At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.
rr = feature_and_label.join(result_zipped)\
.map(lambda x: (x[1][0][0], x[1][1]) )
Each Dstream here is represented for instance like this tuple: (4.0, 0).
I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.
This is an example of output:
Time: 2016-09-23 00:57:00
(0.0, 2)
Time: 2016-09-23 00:57:01
(4.0, 0)
Time: 2016-09-23 00:57:02
(4.0, 0)
...
As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is:
is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not.
Thank you.
Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:
Type of data managment, form data is stored in, connection to Spark
Database, SQL, Integrated
SnappyData
Database, SQL, Connector
MemSQL
Hana
Kudu
FiloDB
DB2
SQLServer (JDBC)
Oracle (JDBC)
MySQL (JDBC)
Database, NoSQL, Connector
Cassandra
HBase
Druid
Ampool
Riak
Aerospike
Cloudant
Database, Document, Connector
MongoDB
Couchbase
Database, Graph, Connector
Neo4j
OrientDB
Search, Document, Connector
Elasticsearch
Solr
Data grid, SQL, Connector
Ignite
Data grid, NoSQL, Connector
Infinispan
Hazelcast
Redis
File System, Files, Integrated
HDFS
File System, Files, Connector
S3
Alluxio
Datawarehouse, SQL, Connector
Redshift
Snowflake
BigQuery
Aster
Instead of using external connectors better go for spark structured streaming .