I want to bring a huge table from oracle DB to Delta Lake day by day.
each day approximately has a volume of about 3 GB.
I want to bring to delta lake with below format :
eachfolder for date : Tablename/2020-12-10
from delta.tables import *
ip = '1.1.1.1'
port = '1111'
database = 'Test'
user = 'test'
password = 'test'
drivertype = 'oracle.jdbc.driver.OracleDriver'
start_date = "20210101"
stop_date = "20210103"
start = datetime.strptime(start_date, "%Y%m%d")
stop = datetime.strptime(stop_date, "%Y%m%d")
partitionColumn = 'time_section'
lowerBound = 0
upperBound = 24
partitions = 4
while start < stop:
SQLCommand = """(select/*+parallel(a,4)*/ a.* from lbi_app.mytble a where
date =%s)"""%start.strftime('%Y%m%d')
TempDFName = 'mytble'
df = spark.read.format("jdbc")\
.option("url", f"jdbc:oracle:thin:#//{ip}:{port}/{database}")\
.option("dbtable",SQLCommand).option("fetchsize", 500000)\
.option("user", user)\
.option("numPartitions", partitions)\
.option("lowerBound",lowerBound)\
.option("upperBound", upperBound)\
.option("partitionColumn", "%s"%partitionColumn)\
.option("oracle.jdbc.timezoneAsRegion", "false")\
.option("oracle.jdbc.mapDateToTimestamp", "true")\
.option("encoding", "UTF-8")\
.option("characterEncoding", "UTF-8")\
.option("useUnicode", "true")\
.option("password", password)\
.option("driver", drivertype)\
.load()
df.write.format("delta").partitionBy("date")\
.option("overwriteSchema","true")\
.mode("overwrite")\
.save("/delta/layer1/"+str(TempDFName))
start = start + timedelta(days=1)
but when I run this code all records saves into folder : myTable/date=20210101
I want to each date has own folder like below:
myTable/date=20210101, myTable/date=20210102, myTable/date=20210103
What is the solution for this problem?
I think that you need to use
.option("mergeSchema", "true").mode("append")
instead of
.option("overwriteSchema","true").mode("overwrite")
otherwise you'll overwrite the whole table on each iteration.
Related
I want to use Spark to read data from a table in Azure SQL. However I don't want to entire table, so I have used the "query" option so I can filter down what is needed. However, I can't find a way to pass a binary (SQL's RowVersion) parameter into the query. How can this be done?
df = spark.read.format("jdbc") \
.option("url", "jdbc:sqlserver://serverName.database.windows.net;databaseName=databaseName") \
.option("query", "SELECT * FROM dbo.tableName WHERE RowVersion > ?") \
.option("accesstoken", access_token) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
I reproduce the same thing in my environment and created a sample table state with row version name RV in Azure SQL.
Now I can read the row version data in Azure Databricks .please follow the below code :
Hostname = "<server_name>.database.windows.net"
Database = "<data_base_name"
Port = "1433"
username = "username"
password = "pass"
Url = "jdbc:sqlserver://{0}:{1};database={2}".format(Hostname,Port,Database)
connProp = {
"user" : username,
"password" : password,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
query = "(Select rv from states ) states"
df = spark.read.jdbc(url=Url, table=query, properties=connProp)
display(df)
My main question concerns about performance. Looking at the code below:
query = """
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
"""
df = spark.read.format(jdbc) \
.option("url", "connectionString") \
.option("user", user) \
.option("password", password) \
.option("numPartitions", 10) \
.option("partitionColumn", "Id") \
.option("lowerBound", lowerBound) \
.option("upperBound", upperBound) \
.option("dbtable", query) \
.load()
As far as I understand, this command will be sent to the DB process the query and return the value to spark.
Now considering the code below:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
).select(
f.col("Id"),
f.col("Name")
).where("Id = <> 1").orderBy(f.col("Id"))
I know that spark will load the entire table into memory and then execute the filters on the dataframe.
Finally, the last code snippet:
df = spark.read.jdbc(url = mssqlconnection,
table = "dbo.Customers",
properties = mssql_prop
)
final_df = spark_session.sql("""
SELECT Name, Id FROM Customers WHERE Id <> 1 ORDER BY Id
""")
I have 3 questions:
Among the 3 codes, which one is the most correct. I always use the second approach, is this correct?
What is the difference between using a spark.sql and using the commands directly in the dataframe according to the second code snipper?
What is the ideal number of lines for me to start using spark? Is it worth using in queries that return less than 1 million rows?
I would appreciate if anybody can point me to the code snippet for
converting spark SQL that has a oracle blob column into java byte[], here is what I have, but getting error.
Dataset<tableX> dataset = sparkSession.read()
.format("jdbc")
.option("url", "jdbc:oracle:thin:#(DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = xx )(PORT = 1234))(CONNECT_DATA = (SERVER = DEDICATED)(SERVICE_NAME = xy)))")
.option("dbtable", "(select lob_id , blob_data from tableX ) test1")
.option("user", "user1")
.option("password", "pass1")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.load();
//dataset.show();
dataset.foreach((ForeachFunction<tableX>) row -> {
byte blobData[] = row.getAs("blob_data");
}
Thank you.
DataframeReader.load returns an untyped dataframe (aka Dataset<Row>). When calling foreach with a ForeachFunction<tableX> a compiler error will occur.
Option 1: stick to the untyped dataframe and use ForeachFunction<Row>:
Dataset<Row> dataframe = sparkSession.read() ... .load();
dataframe.foreach((ForeachFunction<Row>) row -> {
byte blobData[] = (byte[])row.getAs("blob_data");
System.out.println(Arrays.toString(blobData));
});
Option 2: after reading the dataframe transform it into a typed dataset and use ForeachFunction<tableX>. A class tableX should exist and contain the fields of the original query as members:
Dataset<tableX> dataset = dataframe.as(Encoders.bean(tableX.class));
dataset.foreach((ForeachFunction<tableX>) row -> {
byte blobData[] = row.getBlob_data();
System.out.println(Arrays.toString(blobData));
});
Option 2 assumes that the class tableX has a getter for the field blob_data named getBlob_data().
I have a Kafka stream through which I am getting JSON based IoT device logs.I'm using pyspark to process the stream to analyze and create a transformed output.
My device json looks like this:
{"messageid":"1209a714-811d-4ad6-82b7-5797511d159f",
"mdsversion":"1.0",
"timestamp":"2020-01-20 19:04:32 +0530",
"sensor_id":"CAM_009",
"location":"General Assembly Area",
"detection_class":"10"}
{"messageid":"4d119126-2d12-412c-99c2-c159381bee5c",
"mdsversion":"1.0",
"timestamp":"2020-01-20 19:04:32 +0530",
"sensor_id":"CAM_009",
"location":"General Assembly Area",
"detection_class":"10"}
I'm trying to transform the logs in a way that it returns me unique count of each device based on the timestamp and sensor id. The result JSON would look like this:
{
"sensor_id":"CAM_009",
"timestamp":"2020-01-20 19:04:32 +0530",
"location":"General Assembly Area",
count:2
}
Full code that I'm trying - pyspark-kafka.py
spark = SparkSession.builder.appName('analytics').getOrCreate()
spark.sparkContext.setLogLevel('ERROR')
brokers='kafka-mybroker-url-host:9092'
readTopic = 'DetectionEntry'
outTopic = 'DetectionResults'
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",brokers).option("subscribe",readTopic).load()
transaction_detail_df1 = df.selectExpr("CAST(value AS STRING)", "timestamp")
alert_schema = StructType() \
.add("message_id", StringType()) \
.add("mdsversion", StringType()) \
.add("timestamp", StringType()) \
.add("sensor_id", StringType()) \
.add("location", StringType()) \
.add("detection_class", StringType()) \
transaction_detail_df2 = transaction_detail_df1\
.select(from_json(col("value"), alert_schema).alias("alerts"))
transaction_detail_df3 = transaction_detail_df2.select("alerts.*")
transaction_detail_df3 = transaction_detail_df3.withColumn("timestamp",to_timestamp(col("timestamp"),"YYYY-MM-DD HH:mm:ss SSSS")).withWatermark("timestamp", "500 milliseconds")
tempView = transaction_detail_df3.createOrReplaceTempView("alertsview")
results = spark.sql("select sensor_id, timestamp, location, count(*) as count from alertsview group by sensor_id, timestamp, location")
results.printSchema()
results_kakfa_output = results
results_kakfa_output.writeStream \
.format("console") \
.outputMode("append") \
.trigger(processingTime='3 seconds') \
.start().awaitTermination()
When I run this code, I get the following output. The overall objective is to process the entire device logs on an interval of 3 seconds and find unique counts for each timestamp entry for a device within the interval period. I have tried the SQL query on a MySQL database with same schema and it works fine. However, I'm getting no results here in the output to process further. I'm unable to figure out what am I missing here.
I am trying to load data from RDBMS to a hive table on HDFS. I am reading the RDBMS table in the below way:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("numPartitions",20)
.load()
I see in the executor logs that the option("numPartitions",20) is not given properly and the entire data in dumped into a single executor.
Now there are options to provide the partition column, lower bound & upper bound as below:
val mydata = spark.read
.format("jdbc")
.option("url", connection)
.option("dbtable", "select * from dev.userlocations")
.option("user", usrname)
.option("password", pwd)
.option("partitionColumn","columnName")
.option("lowerbound","x")
.option("upperbound","y")
.option("numPartitions",20).load()
The above one only works if I have the partition column is of numeric datatype. In the table I am reading, it is partitioned based on a column location. It is of size 5gb overall & there are 20 different partitions in the table based on. I have 20 distinct locations in the table. Is there anyway I can read the table in partitions based on the partition column of the table: location ?
Could anyone let me know if it can be implemented at all ?
You can use the predicates option for this. It takes an array of string and each item in the array is a condition for partitioning the source table. Total number of partitions determined by those conditions.
val preds = Array[String]("location = 'LOC1'", "location = 'LOC2' || location = 'LOC3'")
val df = spark.read.jdbc(
url = databaseUrl,
table = tableName,
predicates = preds,
connectionProperties = properties
)