I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one.
I am planning to use azure-sqldb-spark connector which claims to turbo boost the write using bulk insert.
I went through the official doc: https://github.com/Azure/azure-sqldb-spark.
The library is written in scala and basically requires the use of 2 scala classes :
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val bulkCopyConfig = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"user" -> "username",
"password" -> "*********",
"databaseName" -> "MyDatabase",
"dbTable" -> "dbo.Clients",
"bulkCopyBatchSize" -> "2500",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.bulkCopyToSqlDB(bulkCopyConfig)
Can it be implemented in used in pyspark like this (using sc._jvm):
Config = sc._jvm.com.microsoft.azure.sqldb.spark.config.Config
connect= sc._jvm.com.microsoft.azure.sqldb.spark.connect._
//all config
df.connect.bulkCopyToSqlDB(bulkCopyConfig)
I am not an expert in Python. Can anybody help me with the complete snippet to get this done.
The Spark connector currently (as of march 2019) only supports the Scala API (as documented here).
So if you are working in a notebook, you could do all the preprocessing in python, finally register the dataframe as a temp table, e. g. :
df.createOrReplaceTempView('testbulk')
and have to do the final step in Scala:
%scala
//configs...
spark.table("testbulk").bulkCopyToSqlDB(bulkCopyConfig)
Related
I need to transfer data from one cluster to an other.
The table structure is the same on both clusters, what I need to do is select data from Table A, Clustering Key A1 on Cluster 1 and copy it to Table B, Clustering Key A1 on Cluster 2.
There is a high number of entries for that clustering key, I suppose > 50.000.000
I do not want and I cannot copy the whole table, because data between clusters in this table is different.
One option would be to write a script and loop through the data, writing to cluster 2. This would work but sounds inefficient and needs to address problems like "what to do if this script crashes in the middle of operation?"
What is the best approach for that?
Based on what I have experienced, Spark is provides the best mechanism to do such activities. You can do it using RDDs and DataFrame APIs both. Below is the code snippet from the reference links:
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext
sqlContext.setConf("ClusterOne/spark.cassandra.connection.host", "127.0.0.1")
sqlContext.setConf("ClusterTwo/spark.cassandra.connection.host", "127.0.0.2")
//Read from ClusterOne
val dfFromClusterOne = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"cluster" -> "ClusterOne",
"keyspace" -> "ks",
"table" -> "A"
))
.load
.filter($"id" === 'A1')
//Write to ClusterTwo
dfFromClusterOne
.write
.format("org.apache.spark.sql.cassandra")
.options(Map(
"cluster" -> "ClusterTwo",
"keyspace" -> "ks",
"table" -> "B"
))
.save
}
Reference links:
http://www.russellspitzer.com/2016/02/16/Multiple-Clusters-SparkSql-Cassandra/
Transferring data from one cluster to another in Cassandra
For bulk data copy, you should think about sstableloader. This is a good tool to copy the data from one cluster and load into another cluster. please refer below documentation.
https://cassandra.apache.org/doc/latest/tools/sstable/sstableloader.html?highlight=sstableloader
I am trying to translate the Spark implementation to Pyspark, which is discussed in this blog:
https://dorianbg.wordpress.com/2017/11/11/building-the-speed-layer-of-lambda-architecture-using-structured-spark-streaming/
However, I am having a lot of problems because some of the methods in a Spark Datafram aren't available or need to go through some conversions to make them work. I am specifically having trouble with this part:
var data_stream_cleaned = data_stream
.selectExpr("CAST(value AS STRING) as string_value")
.as[String]
.map(x => (x.split(";"))) //wrapped array
.map(x => tweet(x(0), x(1), x(2), x(3), x(4), x(5)))
.selectExpr( "cast(id as long) id", "CAST(created_at as timestamp) created_at", "cast(followers_count as int) followers_count", "location", "cast(favorite_count as int) favorite_count", "cast(retweet_count as int) retweet_count")
.toDF()
.filter(col("created_at").gt(current_date())) // kafka will retain data for last 24 hours, this is needed because we are using complete mode as output
.groupBy("location")
.agg(count("id"), sum("followers_count"), sum("favorite_count"), sum("retweet_count"))
How would you go about making this work? I have successfully connected to a Kafka stream. I'm just trying to aggregate the data so that I can load it to Redshift.
This is what I have so far:
ds = data_stream.selectExpr("CAST(value AS STRING) as string_value").rdd.map(lambda x: x.split(";"))
I get an error saying
Queries with streaming sources must be executed with writeStream.start()
What could be wrong? I'm not trying to query the data, just transform it. Any help would be greatly appreciated!
I have a MongoDB collection with 26,000 records which I am reading into a DataFrame. It has a column, column_1 with String value column_1_value across all the records. I am trying to filter the DataFrame and get the count as follows
val readConfig = ReadConfig(Map("collection" -> collectionName,"spark.mongodb.input.readPreference.name" -> "primaryPreferred", "spark.mongodb.input.database" -> dataBaseName, "spark.mongodb.input.uri" -> hostName))
val dataFrame = MongoSpark.load(spark, readConfig)
df.filter(df.col("column_1") === "column_1_value").count()
Where spark is the instance of SparkSession.
The MongoDB record structure looks something like this.
{
"_id" : ObjectId("SOME_ID"),
"column_1" : "column_1_value",
"column_2" : SOME_VALUE,
"column_3" : SOME_OTHER_VALUE
}
There is no nested structure, and all the records have the same set of fields. I am not accessing the DB during any time while Spark is running
As all the records have the same value of column_1, I expected to get the DataFrame size itself as the output, but instead I am getting a lower value. Not only that, I am getting different outcomes each time I run the above. The outcome is normally varying between 15,000 and 24,000.
But the same approach seems to work when the collection size is less, around 5,000.
I have tried the following approaches with no success
Used equalTo instead of ===
Used $column_1
Used isin
Used df.where instead of df.filter
Used createOrReplaceTempView and ran SQL query
The only thing that seems to work is either df.cache() or df.persist(), none of which I think will be good for the overall performance when working with large data.
What can be the possible reason for this behavior and the ways by which this can be resolved?
My Spark version is 2.2.0, MongoDB version is 3.4. I am running Spark in local mode, with 16 GB of RAM and an 8 core processor.
Edit 1
Tried changing the Mongo Spark connector partitioning policy as follows, but with no success.
val readConfig = ReadConfig(Map("collection" -> collectionName,"spark.mongodb.input.readPreference.name" -> "primaryPreferred", "spark.mongodb.input.database" -> dataBaseName, "spark.mongodb.input.uri" -> hostName, "spark.mongodb.input.partitionerOptions" -> "MongoPaginateBySizePartitioner"))
val dataFrame = MongoSpark.load(spark, readConfig)
df.count()
df.filter(df.col("column_1") === "column_1_value").count()
The first count returns the correct value, even with the default partitioning strategy, which leads me to believe that the Mongo Connector is working fine.
My RDD's type is RDD[Map], and the map format is like:
{"date": "2015-01-01", "topic": "sports", "content": "foo,bar"}
...
Now I would like to obtain a sequence like
{"date": "2015-01-01", "topic":"sports", "count":22}
that is, the count of every topic for each day.
How to group and count it in Spark?
Here is the code using spark sql on spark 1.3.0, this code is well tested and if you are familiar with sql you can write simple queries to process your JSON data. Please note that syntax is little different in latest version of Spark (for eg: 1.5):
Save file to HDFS (eg: /user/cloudera/data.json)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql("set spark.sql.shuffle.partitions=10");
-- You can change number of partitions to the number you want, by default it will use 200
import sqlContext.implicits._
val jsonData = sqlContext.jsonFile("/user/cloudera/data.json")
jsonData.registerTempTable("jsonData")
val tableData=sqlContext.sql("select \"date\", topic, count(1) from jsonData group by \"date\", topic")
tableData.collect().foreach(println)
If Map is an object having the fields you have shown, you can simply do this:
import org.apache.spark.SparkContext._
resultRDD=yourRDD.map( x => ((x.date,x.topic), 1)).reduceByKey(_+_)
resultRDD.map (
x =>
// here you have to create the JSON you want as output
// knowing that x._1._1 contains the date, x._1._2 contains the topic
// and x._2 contains the count
)
The code I have written i in Scala, but I'm sure it'll be easy for you to adapt it to your language if you're using Java or Python.
Moreover pay attention to the import I put since it's necessary to have an implicit conversion between a RDD and a PairRDD.
How do I add an HBase Timestamp using Phoenix-Spark similiar to HBase API:
Put(rowkey, timestamp.getMillis)
This is my code:
val rdd = processedRdd.map(r => Row.fromSeq(r))
val dataframe = sqlContext.createDataFrame(rdd, schema)
dataframe.save("org.apache.phoenix.spark", SaveMode.Overwrite,
Map("table" -> HTABLE, "zkUrl" -> zkQuorum))
You can use the following:
dataframe.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", "HTable").option("zkUrl", "xxxx:0000").save()
This feature is currently not supported in Phoenix. Maybe, you need to use HBase api instead of Phoenix.