PySpark Pushing down timestamp filter - apache-spark

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?

While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

Related

How to load data from database to Spark before filtering

I'm trying to run such a PySpark application:
with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark:
dataframe_mysql = spark.read.format('jdbc').options(
url="jdbc:mysql://.../...",
driver='com.mysql.cj.jdbc.Driver',
dbtable='my_table',
user=...,
password=...,
partitionColumn='id',
lowerBound=0,
upperBound=10000000,
numPartitions=11,
fetchsize=1000000,
isolationLevel='NONE'
).load()
dataframe_mysql = dataframe_mysql.filter("date > '2022-01-01'")
dataframe_mysql.write.parquet('...')
And I found that Spark didn't load data from Mysql until executing the write, this means that Spark let the database take care of filtering the data, and the SQL that database received may like:
select * from my_table where id > ... and id< ... and date > '2022-01-01'
My table was too big and there's no index on date column, the database couldn't handle the filtering. How can I load data into Spark's memory before filtering, I hope the query sent to the databse could be:
select * from my_table where id > ... and id< ...
According to #samkart's comment, set pushDownPredicate to False could solve this problem

How to read data from Greenplum via spark using filters

I am trying to filter data using a where clause having OR condition from Greenplum. I am using “Greenplum” connector in spark.
Snippet -
Df1 = Df.filter(col(‘id’)==‘1’ & (col(‘Name’)==‘abc’ | col(‘Name’).isNull()))
The connector translates this into a sql query internally and it looks like this -
Select * from df where
id=‘1’ and Name=‘abc’ or Name is null;
This is an incorrect query because I want to fetch all the records where id is 1 and name is abc or null. With this query the data fetched has records where id not equal to 1 but name is null.

BigQuery Storage API: Best Practice for Using Client from Spark Pandas UDF?

I have a spark script that needs to make 60 api calls for every row. Currently I am using BigQuery as a data warehouse. I was wondering if there was a way I can use either the BigQuery API or BigQuery Storage API to query the database from my udf? Maybe a way to perform batch queries? Would pandas-gbq be a better solution? Each query that I need to make per row is a select count(*) from dataset.table where {...} query.
Currently I am using the big query client as shown in the code snippet below, but I am not sure if this is the best way to utilize my resources. Apologies if the code is not done properly for this use case, I am new to spark and BigQuery.
def clients():
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/hadoop/credentials.json'
credentials, your_project_id = google.auth.default(
scopes=["https://www.googleapis.com/auth/cloud-platform"]
)
# Make clients.
bqclient = bigquery.Client(
credentials=credentials,
project=your_project_id,
)
bqstorageclient = bigquery_storage_v1beta1.BigQueryStorageClient(
credentials=credentials
)
return bqclient, bqstorageclient
def query_cache(query):
bqclient, bqstorageclient = clients()
dataframe = (
bqclient.query(query)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
return dataframe['f0_'][0]
#pandas_udf(schema(), PandasUDFType.GROUPED_MAP)
def calc_counts(df):
query = "select count(*) from dataset.table where ...{some column filters}..."
df['count'] = df.apply(query_cache, args=(query), axis=1)
The simpler option is to use the spark-bigquery-connector, which let you query BigQuery directly and get the result as a Spark dataframe. Converting this dataframe into pandas is then simple:
spark_df = spark.read.format('bigquery').option('table', table).load()
pandas_df = spark_df.toPandas()

How can you update values in a dataset?

So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?
I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except
If you are using DataFrame, you can register that dataframe as temp table,
using df.registerTempTable("events")
Then you can query like,
sqlContext.sql("SELECT * FROM events "+)
when clause translates into case clause which you can relate to SQL case clause.
Example
scala> val condition_1 = when(col("col_1").isNull,"NA").otherwise("AVAILABLE")
condition_1: org.apache.spark.sql.Column = CASE WHEN (col_1 IS NULL) THEN NA ELSE AVAILABLE END
or you can chain when clause as well
scala> val condition_2 = when(col("col_1") === col("col_2"),"EQUAL").when(col("col_1") > col("col_2"),"GREATER").
| otherwise("LESS")
condition_2: org.apache.spark.sql.Column = CASE WHEN (col_1 = col_2) THEN EQUAL WHEN (col_1 > col_2) THEN GREATER ELSE LESS END
scala> val new_df = df.withColumn("condition_1",condition_1).withColumn("condition_2",condition_2)
Still if you want to use table, then you can register your dataframe / dataset as temperory table and perform sql queries
df.createOrReplaceTempView("tempTable")//spark 2.1 +
df.registerTempTable("tempTable")//spark 1.6
Now, you can perform sql queries
spark.sql("your queries goes here with case clause and where condition!!!")//spark 2.1
sqlContest.sql("your queries goes here with case clause and where condition!!!")//spark 1.6
If you are using java dataset
you can update dataset by below.
here is the code
Dataset ratesFinal1 = ratesFinal.filter(" on_behalf_of_comp_id != 'COMM_DERIVS' ");
ratesFinal1 = ratesFinal1.filter(" status != 'Hit/Lift' ");
Dataset ratesFinalSwap = ratesFinal1.filter (" on_behalf_of_comp_id in ('SAPPHIRE','BOND') and cash_derivative != 'cash'");
ratesFinalSwap = ratesFinalSwap.withColumn("ins_type_str",functions.lit("SWAP"));
adding new column with value from existing column
ratesFinalSTW = ratesFinalSTW.withColumn("action", ratesFinalSTW.col("status"));

Apache Spark Query with HiveContext doesn't work

I use Spark 1.6.1. In my Spark Java Programm I connect to a Postgres Database and register every table as a temporary table via JDBC. For example:
Map<String, String> optionsTable = new HashMap<String, String>();
optionsTable.put("url", "jdbc:postgresql://localhost/database?user=postgres&password=passwd");
optionsTable.put("dbtable", "table");
optionsTable.put("driver", "org.postgresql.Driver");
DataFrame table = sqlContext.read().format("jdbc").options(optionsTable).load();
table.registerTempTable("table");
This works without problems:
hiveContext.sql("select * from table").show();
Also this works:
DataFrame tmp = hiveContext.sql("select * from table where value=key");
tmp.registerTempTable("table");
And then I can see the contents of the table with:
hiveContext.sql("select * from table").show();
But now I have a Problem. When I execute this:
hiveContext.sql("SELECT distinct id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left and tble.timestamp <= w.right").show();
Spark does nothing, but at the origin databse on Postgres it works very good. So I decided to modify the query a little bit to this:
hiveContext.sql("SELECT id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left").show();
This Query is working and gives me results. But the other query is not working. Where is the difference and why is the first query not working, but the second is working good?
And the database is not very Big. For testing it has a size of 4 MB.
Since you're trying to select a distinct ID, you need to select timestamp as a part of an aggregate function and then group by ID. Otherwise, it doesn't know which time stamp to pair with the ID.

Resources