I am using the spark-redshift connector in order to launch a query from Spark.
val results = spark.sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", url_connection)
.option("query", query)
.option("aws_iam_role", iam_role)
.option("tempdir", base_path_temp)
.load()
I would like to increase the slot count in order to improve the query, because is disk-based. But I don't know how to do the next query in the connector:
set wlm_query_slot_count to 3;
I don't see how to do this , since in the read command in the connector doesn't provide preactions and postactions like in the write command.
Thanks
Related
DataFrameLoadedFromLeftDatabase=data loaded using DataFrameReader from first database say LeftDB.
I need to
iterate through each row in this dataframe,
connect to a second database say RightDB,
find some matching record from RightDB,
and do some business logic
This is an iterative operation so it is not simply doable with a JOIN between LeftDB and RightDB to find some new fields, create a New Dataframe targetDF and write into a third Database say ThirdDB using DataframeWriter
I know that I can use
val targetDF = DataFrameLoadedFromLeftDatabase.mapPartitions(
partition => {
val rightDBconnection = new DbConnection // establish a connection to RightDB
val result = partition.map(record => {
readMatchingFromRightDBandDoBusinessLogicTransformationAndReturnAList(record, rightDBconnection)
}).toList
rightDBconnection.close()
result.iterator
}
).toDF()
targetDF.write
.format("jdbc")
.option("url", "jdbc:postgresql:dbserver")
.option("dbtable", "table3")
.option("user", "username")
.option("password", "password")
.save()
I am wondering whether apache spark is suitable for these type of chatty data processing applications
I am wondering whether interating throguh each record in RightDB will be too chatty in this approach
I am looking forward with some advices to improve this design to make use of SPARK capabilites. I also wanted to make sure the processing do not cause too much shuffle operations for performance reasons
Ref: Related SO Post
At this kind of situations we always prefer spark.sql. Basically define two different DFs and join them based on a query then you can apply your business logic afterwards.
For example;
import org.apache.spark.sql.{DataFrame, SparkSession}
// Add your columns here
case class MyResult(ID: String, NAME: String)
// Create a SparkSession
val spark = SparkSession.builder()
.appName("Join Tables and Add Prefix to ID Column")
.config("spark.master", "local[*]")
.getOrCreate()
// Read the first table from DB1
val firstTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB1")
.option("dbtable", "FIRST_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
firstTable.createOrReplaceTempView("firstTable")
// Read the second table from DB2
val secondTable: DataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB2")
.option("dbtable", "SECOND_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.load()
secondTable.createOrReplaceTempView("secondTable")
// Apply you filtering here
val result: DataFrame = spark.sql("SELECT f.*, s.* FROM firstTable as f left join secondTable as s on f.ID = s.ID")
val finalData = result.as[MyResult]
.map{record=>
// Do your business logic
businessLogic(record)
}
// Write the result to the third table in DB3
finalData.write
.format("jdbc")
.option("url", "jdbc:postgresql://localhost/DB3")
.option("dbtable", "THIRD_TABLE")
.option("user", "your_username")
.option("password", "your_password")
.save()
If your tables are big, you can execute a query and read its results directly. If you do this you can reduce your input sizes by filtering by dates etc:
val myQuery = """
(select * from table
where // do your filetering here
) foo
"""
val df = sqlContext.format("jdbc").
option("url", "jdbc:postgresql://localhost/DB").
.option("user", "your_username")
.option("password", "your_password")
.option("dbtable", myQuery)
.load()
Other than this, it is hard to do record specific operations directly via spark. You have to maintain your client connections etc as custom logics. Spark designed to read/write huge amounts data. It creates pipelines for this purpose. Simple operations will be an overhead for it. Always do your API calls (or single DB calls) in your map functions. If you use a cache layer in there, it could be life saving in terms of performance. Always try to use a connection pool in your custom database calls, otherwise spark will try to execute all of mapping operations with different connections which may create a pressure on your database and cause connection failures.
Can think of a lot of improvements but in general all of them are going to depend on having the data pre-distributed in a HDFS, HBase, Hive database, MongoDB,...
I mean: You are thinking "relational data with distributed processing mindset" ... I though we were already beyond that XD
I try read data in Delta format from ADLS. I want read some portion of that data using filter in place. Same approach worked for me during reading JDBC format
query = f"""
select * from {table_name}
where
createdate < to_date('{createdate}','YYYY-MM-DD HH24:MI:SS') or
modifieddate < to_date('{modifieddate}','YYYY-MM-DD HH24:MI:SS')
"""
return spark.read \
.format("jdbc") \
.option("url", url) \
.option("query", query) \
.option("user", username) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
So I tried to create in similar way reading delta using query but it reads whole table.
return spark.read \
.format("delta") \
.option("query", query) \
.load(path)
How could I solve this issue without reading full df and then filter it?
Thanks in advance!
Spark uses a functionality called predicate pushdown to optimize queries.
In the first case, the filters can be passed on to the oracle database.
Delta does not work that way. There can be optimisations through data skipping and Z-ordering, but since you are essentially querying parquet files, you have to read the all of them in memory and filter afterwards.
I'm new to Stack and Spark so please forgive me my simplicity and mistakes!
I have a problem with Clickhouse and spark (2.4.7), I work on a jupyter notebook.
Basically, I want to insert dataframe with Array column to Clickhouse with Array(String) column. Using yandex driver this is impossible, because jdbc doesn't support Arrays, right? ;)
So I wanted to run Spark Session with housepower jar: clickhouse-native-jdbc-shaded-2.6.4.jar, because I read that they added handling Arrays - correct me if I'm wrong.
And I want to get a query from Clickhouse via jdbc.
spark = SparkSession\
.builder\
.enableHiveSupport()\
.appName(f'custom-events-test)\
.config("spark.jars", "drivers/clickhouse-native-jdbc-shaded-2.6.4.jar")\
.getOrCreate()
My query:
query = """
select date as date,
partnerID as partnerID,
sessionID as sessionID,
toString(mapKeys(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as keys,
toString(mapValues(cast((JSONExtractKeysAndValues(ec.custom, 'String')), 'Map(String, String)'))) as values
from audience.uber_all
array join eventContents as ec
PREWHERE date = ('2022-07-22')
WHERE partnerID = 'XXX'
and ec.custom != '{}'
order by date, partnerID
"""
and my code:
df_tab = spark.read \
.format("jdbc") \
.option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
.option("url", f"jdbc:clickhouse://{ch_host}:9000/{ch_db}") \
.option("query", query) \
.option("user", ch_user) \
.option("password", ch_pass) \
.load()
But there I get an error:
housepower_error
BUT when I run above query with yandex driver: ru.yandex.clickhouse.ClickHouseDriver
everything works fine. (even with housepower jar)
This error also appears when I want to import column like this:
JSONExtractKeysAndValues(ec.custom, 'String')
or
toString(JSONExtractKeysAndValues(ec.custom, 'String'))
What am I doing wrong ?
And tell me how to insert a DF with Array column using spark jdbc to Clickhouse table also with Array(String) column? I was looking everywhere but coudn't find a solution...
Thank you in advance !
Oracle database table has 3 million records. I need to read it into dataframe and then convert it to json format and send it to eventhub for downstream systems.
Below is my pyspark code to connect and read oracle db table as dataframe
df = spark.read \
.format("jdbc") \
.option("url", databaseurl) \
.option("query","select * from tablename") \
.option("user", loginusername) \
.option("password", password) \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.option("oracle.jdbc.timezoneAsRegion", "false") \
.load()
then I am converting the column names and values of each row into json (placing under a new column named body) and then sending it to Eventhub.
I have defined ehconf and eventhub connection string. Below is my write to eventhub code
df.select("body") \
.write\
.format("eventhubs") \
.options(**ehconf) \
.save()
my pyspark code is taking 8 hours to send 3 million records to eventhub.
Could you please suggest how to write pyspark dataframe to eventhub faster ?
My Eventhub is created under eventhub cluster which has 1 CU in capacity
Databricks cluster config :
mode: Standard
runtime: 10.3
worker type: Standard_D16as_v4 64GB Memory,16 cores (min workers :1, max workers:5)
driver type: Standard_D16as_v4 64GB Memory,16 cores
The problem is that the jdbc connector just uses one connection to the database by default so most of your workers are probably idle. That is something you can confirm in Cluster Settings > Metrics > Ganglia UI.
To actually make use of all the workers the jdbc connector needs to know how to parallelize retrieving your data. For this you need a field that has evenly distributed data over its values. For example if you have a date field in your data and every date has a similar amount of records, you can use it to split up the data:
df = spark.read \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", tableName) \
.option("user", jdbcUsername) \
.option("password", jdbcPassword) \
.option("numPartitions", 64) \
.option("partitionColumn", "<dateField>") \
.option("lowerBound", "2019-01-01") \
.option("upperBound", "2022-04-07") \
.load()
You have to define the field name and the min and max value of that field so that the jdbc connector can try to split the work evenly between the workers. The numPartitions is the amount of individual connections opened and the best value depends on the count of workers in your cluster and how many connections your datasource can handle.
Spark 2.x here. My code:
val query = "SELECT * FROM some_big_table WHERE something > 1"
val df : DataFrame = spark.read
.option("url",
s"""jdbc:postgresql://${redshiftInfo.hostnameAndPort}/${redshiftInfo.database}?currentSchema=${redshiftInfo.schema}"""
)
.option("user", redshiftInfo.username)
.option("password", redshiftInfo.password)
.option("dbtable", query)
.load()
Produces:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:183)
at scala.Option.getOrElse(Option.scala:121)
I'm not reading anything from a Parquet file, I'm reading from a Redshift (RDBMS) table. So why am I getting this error?
If you use generic load function you should include format as well:
// Query has to be subquery
val query = "(SELECT * FROM some_big_table WHERE something > 1) as tmp"
...
.format("jdbc")
.option("dbtable", query)
.load()
Otherwise Spark assumes that you use default format, which in presence of no specific configuration, is Parquet.
Also nothing forces you to use dbtable.
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
query,
props
)
variant is also valid.
And of course with such simple query all of that it is not needed:
spark.read.jdbc(
s"jdbc:postgresql://${hostnameAndPort}/${database}?currentSchema=${schema}",
some_big_table,
props
).where("something > 1")
will work the same way, and if you want to improve performance you should consider parallel queries
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
Spark 2.1 Hangs while reading a huge datasets
Partitioning in spark while reading from RDBMS via JDBC
or even better, try Redshift connector.