How to read data from Greenplum via spark using filters - apache-spark

I am trying to filter data using a where clause having OR condition from Greenplum. I am using “Greenplum” connector in spark.
Snippet -
Df1 = Df.filter(col(‘id’)==‘1’ & (col(‘Name’)==‘abc’ | col(‘Name’).isNull()))
The connector translates this into a sql query internally and it looks like this -
Select * from df where
id=‘1’ and Name=‘abc’ or Name is null;
This is an incorrect query because I want to fetch all the records where id is 1 and name is abc or null. With this query the data fetched has records where id not equal to 1 but name is null.

Related

How to load data from database to Spark before filtering

I'm trying to run such a PySpark application:
with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark:
dataframe_mysql = spark.read.format('jdbc').options(
url="jdbc:mysql://.../...",
driver='com.mysql.cj.jdbc.Driver',
dbtable='my_table',
user=...,
password=...,
partitionColumn='id',
lowerBound=0,
upperBound=10000000,
numPartitions=11,
fetchsize=1000000,
isolationLevel='NONE'
).load()
dataframe_mysql = dataframe_mysql.filter("date > '2022-01-01'")
dataframe_mysql.write.parquet('...')
And I found that Spark didn't load data from Mysql until executing the write, this means that Spark let the database take care of filtering the data, and the SQL that database received may like:
select * from my_table where id > ... and id< ... and date > '2022-01-01'
My table was too big and there's no index on date column, the database couldn't handle the filtering. How can I load data into Spark's memory before filtering, I hope the query sent to the databse could be:
select * from my_table where id > ... and id< ...
According to #samkart's comment, set pushDownPredicate to False could solve this problem

Spark 3.0 JdbcRDD Java - Issue in specifying lowerBound and upperBound for views with no ID column

I am using Spark 3.0, in my Java program I am querying data from views which are in Oracle DB. I used the Java API JdbcRDD to query the views.
The problem I have is that the view doesn't contain any ID or timestamp columns. So, I am unable to construct my SQL query with lowerBound and upperBound values.
Please find below the example query I need to run in Spark. Here exp_stg.usr and exp_stg.prtcpnt are the two views exposed to me.
"SELECT a.participant,
a.desc,
b.firstname,
b.lastname,
b.dept,
b.telno,
b.emailaddr
FROM usr_stg.prtcpnt a
LEFT OUTER JOIN usr_stg.usr b
ON a.participant = b.participant
WHERE a.class = 'SpSession' "
I tried using temp tables in spark and join, but the query performance is bad as there are around ~13,000,000 rows in each view. Hence I tried to use the join query in the Oracle DB.
I was able to overcome the constraint using ROWNUM in the query. Using ROWNUM as the lowerBound and upperBound I am now able to get the data using JdbcRDD.
`
SELECT ROWNUM as id, a.participant,
a.desc,
b.firstname,
b.lastname,
b.dept,
b.telno,
b.emailaddr
FROM usr_stg.prtcpnt a
LEFT OUTER JOIN usr_stg.usr b
ON a.participant = b.participant
WHERE a.class = 'SpSession' and ?<=ROWNUM and ROWNUM<=?"`

Does spark load the entire table in memory if select condition is based on RDD transformation?

DataSet<Row> a = spark.read().format("com.memsql.spark.connector").option("query", "select * from a");
a = a.filter((row)-> row.x = row.y)
Sring xstring = "...select all values of x from a and make comma separated string"
DataSet<Row> b = spark.read().format("com.memsql.spark.connector").option("query", "select * from b where x in " + xstring);
b.show()
In this case as spark would load the entire b table in memory and then filter out xtring rows or it actually create that xstring and then load a subset of table b in memory when we call show
When memsql is queried using option("query", "select * from .......") the entire result (not table) will be read from memsql into executors. The MemSQL Spark Connector 2.0 supports column and filter pushdown for which the SQL needs to have the filter and join condition rather than applying filter and join on dataframe. In your example predicate push down will be used. In your example - entire table 'a' will be read because there is no filter condition, xstring will be build, then only that part of table 'b' is read that matches x in (...) condition.
Here is memsql documentation explaining this.

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

Apache Spark Query with HiveContext doesn't work

I use Spark 1.6.1. In my Spark Java Programm I connect to a Postgres Database and register every table as a temporary table via JDBC. For example:
Map<String, String> optionsTable = new HashMap<String, String>();
optionsTable.put("url", "jdbc:postgresql://localhost/database?user=postgres&password=passwd");
optionsTable.put("dbtable", "table");
optionsTable.put("driver", "org.postgresql.Driver");
DataFrame table = sqlContext.read().format("jdbc").options(optionsTable).load();
table.registerTempTable("table");
This works without problems:
hiveContext.sql("select * from table").show();
Also this works:
DataFrame tmp = hiveContext.sql("select * from table where value=key");
tmp.registerTempTable("table");
And then I can see the contents of the table with:
hiveContext.sql("select * from table").show();
But now I have a Problem. When I execute this:
hiveContext.sql("SELECT distinct id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left and tble.timestamp <= w.right").show();
Spark does nothing, but at the origin databse on Postgres it works very good. So I decided to modify the query a little bit to this:
hiveContext.sql("SELECT id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left").show();
This Query is working and gives me results. But the other query is not working. Where is the difference and why is the first query not working, but the second is working good?
And the database is not very Big. For testing it has a size of 4 MB.
Since you're trying to select a distinct ID, you need to select timestamp as a part of an aggregate function and then group by ID. Otherwise, it doesn't know which time stamp to pair with the ID.

Resources