Spark pushdown filter not being respected

Spark pushdown filter not being respected - apache-spark

I have a parquet files with a column platform with schema
StructField(platform,StringType,true)
Now I have a query
spark.read.parquet("...").filter(col("platform")==="android").count
What's weird in the SQL viewer is that spark is loading all the rows from that file. Here is the distribution of columns in the parquet file
+---------+------+
| platform| count|
+---------+------+
| windows|862963|
| android|278207|
| ios| 88930|
+---------+------+
Total rows = 1,230,100
Here is the explain
== Physical Plan ==
*(1) Filter (isnotnull(platform#309337) AND (platform#309337 = android))
+- *(1) ColumnarToRow
+- FileScan parquet [platform#309337,id#309338,os_version#309339,device_model#309340] Batched: true, DataFilters: [isnotnull(platform#309337), (platform#309337 = android)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(platform), EqualTo(platform,android)], ReadSchema: struct<platform:string,id:string,os_version:string,device_model:string,carrier:string,app_version...
Yet the SQL viewer on the Scan Parquet stage shows
rows output = 1,230,100
I expected the pushdown filter to only output 278207 (Android platform count)
The viewer shows the pushedfilter being applied as well
Anybody knows what is going on?

Related

How to aggregate data per day on spark streaming

My problem is the following:
I have to write a scala program for spark streaming, this program has to read data from a kafka topic and save it in a cassandra table.
My data is of the form:
transaction_id,customer_id,merchant_id,status,timestamp,invoice_no,invoice_amount
202207280000001,1966,319,SUCCESS,07/28/22,835-975-389,86562
The need is to aggregate the data by day and to count by amount (between 0 and 500, between 500 and 1000, between 1000 and 1500, and over 2000).
Can someone help me?
Thanks
OK let me try to be more clear.
A spark streaming application written in scala, which has to read data from a Kafka topic, the data is structured in the following way:
transaction_id,customer_id,merchant_id,status,timestamp,invoice_no,invoice_amount
Example (fictitious data):
202207280000001,1966,319,SUCCESS,07/28/22,835-975-389,86562
202207280000002,1970,320,SUCCESS,07/28/22,835-980-395,50000
202207280000003,1966,319,SUCCESS,07/28/22,835-975-399,200
202207280000004,658,400,SUCCESS,07/25/22,835-975-200,800
202207280000005,1966,319,SUCCESS,07/25/22,835-975-387,300
From these data I have to calculate in real time the Count of transaction happened falling in price bucket 0-500, 0-1000, 0-2000, and over 2000.
The displayed result would be for example :
07/28/22 | Below500 | 1
07/28/22 | Below1000 | 0
07/28/22 | Below1500 | 0
07/28/22 | Above2000 |2
07/25/22 | Below500 | 1
07/25/22 | Below1000 | 1
07/25/22 | Below1500 | 0
07/25/22 | Above2000 | 0
My program reads the topic well and displays the data coming from Kafka but I can't process the data to display the right result.
Thanks

Spark web UI showing two stages for a job without any wide transformation

I am trying to understand spark - Jobs and Stages through web ui. I ran a simple code for word count. The file has 44 lines in total.
strings = spark.read.text("word.txt")
filtered = strings.filter(strings.value.contains("The"))
filtered.count()
I see there are no wide transformation, and there is only 1 action. So, the application should have only 1 stage job. However I see there is a shuffle operation after read in the web UI. It shows its a 2 stage job. I am not sure why is it ? Can anyone please help me here ?
Edit : Adding SQL Plan
== Parsed Logical Plan ==
Aggregate [count(1) AS count#5L]
+- AnalysisBarrier
+- Filter Contains(value#0, The)
+- Relation[value#0] text
== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#5L]
+- Filter Contains(value#0, The)
+- Relation[value#0] text
== Optimized Logical Plan ==
Aggregate [count(1) AS count#5L]
+- Project
+- Filter (isnotnull(value#0) && Contains(value#0, The))
+- Relation[value#0] text
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#5L])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#8L])
+- *(1) Project
+- *(1) Filter (isnotnull(value#0) && Contains(value#0, The))
+- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[hdfs://dev/user/rk/word.txt], PartitionFilters: [], PushedFilters: [IsNotNull(value), StringContains(value,The)], ReadSchema: struct<value:string>

How to embed equation queries in elasticSearch

I have my search engine build using embeddings from documents using ELK...I want to query using mathematical equations and their respective outcomes.
for example: My query is Give me water pH 8+-1
The outcome should be:every document with an pH value from 7 to 9
Exampe 2:Density 1080 would search Density 1080 +- 10% by meaning 1080 +- 108 or, pH 7 would search pH 7 +- 10% by meaning 7 +- 0.7
How can i do this?

SparkSQL: Ignore/convert rows with invalid date format

I have a table with column name 'date1' (timestamp (nullable = true)), formated like this
scala> sql("select date1 from tablename).show(20);
+-------------------+
| date1 |
+-------------------+
|2016-08-20 00:00:00|
|2016-08-31 00:00:00|
|2016-08-31 00:00:00|
|2016-09-09 00:00:00|
|2016-09-08 00:00:00|
While reading through complete hive table, I am getting following error:
WARN TaskSetManager: Lost task 2633.0 in stage 4.0 (TID 7206, ip-10-0-0-241.ec2.internal, executor 11): TaskKilled (stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 80 in stage 4.0 failed 4 times, most recent failure: Lost task 80.3 in stage 4.0 (TID 8944, ip-10-0-0-241.ec2.internal, executor 42): java.time.format.DateTimeParseException: Text '0000-12-30T00:00:00' could not be parsed, unparsed text found at index 10
.....
.....
Caused by: java.time.format.DateTimeParseException: Text '0000-12-30T00:00:00' could not be parsed, unparsed text found at index 10
at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1952)
How can I ignore/convert the records so that I am able to read table?
SparkVersion: 2.2.1

This is a source data issue. Try to read whole data for this column alone & write it
scala> sql("select date1 from tablename).write.mode("overwrite").parquet("path/to/file.parquet")
If this is the issue with this column, then you will get error.
You try to query the source data using '0000-12-30T00:00:00'
This is clearly a data issue that you need to identify and remove it.
You can try below query to ignore the rows
sql("select date1 from tablename where date1 <> '0000-12-30T00:00:00'").count

You can try as below
spark.sql("""select cast(regexp_replace(date1,'[T,Z]',' ') as timestamp) from tablename""").show()
This will replace T/Z with a ' '(space) when it finds, otherwise it does nothing.
Hope this helps!

Distribution of time periods over rows with certain status (column value)

I have a Pyspark dataframe containing logs, with each row corresponding to the state of the system at the time it is logged, and a group number. I would like to find the lengths of the time periods for which each group is in an unhealthy state.
For example, if this were my table:
TIMESTAMP | STATUS_CODE | GROUP_NUMBER
--------------------------------------
02:03:11 | healthy | 000001
02:03:04 | healthy | 000001
02:03:03 | unhealthy | 000001
02:03:00 | unhealthy | 000001
02:02:58 | healthy | 000008
02:02:57 | healthy | 000008
02:02:55 | unhealthy | 000001
02:02:54 | healthy | 000001
02:02:50 | healthy | 000007
02:02:48 | healthy | 000004
I would want to return Group 000001 having an unhealthy time period of 9 seconds (from 02:02:55 to 02:03:04).
Other groups could also have unhealthy time periods, and I would want to return those as well.
Due to the possibility of consecutive rows with the same status, and since rows of different groups are interspersed, I am struggling to find a way to do this efficiently.
I cannot convert the Pyspark dataframe to a Pandas dataframe, as it is much too large.
How can I efficiently determine the lengths of these time periods?
Thanks so much!

the pyspark with spark-sql solution would look like this.
First we create the sample data-set. In addition to the dataset we generate row_number field partition on group and order by the timestamp. then we register the generated dataframe as a table say table1
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([
('2017-01-01 02:03:11','healthy','000001'),
('2017-01-01 02:03:04','healthy','000001'),
('2017-01-01 02:03:03','unhealthy','000001'),
('2017-01-01 02:03:00','unhealthy','000001'),
('2017-01-01 02:02:58','healthy','000008'),
('2017-01-01 02:02:57','healthy','000008'),
('2017-01-01 02:02:55','unhealthy','000001'),
('2017-01-01 02:02:54','healthy','000001'),
('2017-01-01 02:02:50','healthy','000007'),
('2017-01-01 02:02:48','healthy','000004')
],['timestamp','state','group_id'])
df = df.withColumn('rownum', row_number().over(Window.partitionBy(df.group_id).orderBy(unix_timestamp(df.timestamp))))
df.registerTempTable("table1")
once the dataframe is registered as a table (table1). the required data can be computed as below using spark-sql
>>> spark.sql("""
... SELECT t1.group_id,sum((t2.timestamp_value - t1.timestamp_value)) as duration
... FROM
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1 WHERE state = 'unhealthy') t1
... LEFT JOIN
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1) t2
... ON t1.group_id = t2.group_id
... AND t1.rownum = t2.rownum - 1
... group by t1.group_id
... """).show()
+--------+--------+
|group_id|duration|
+--------+--------+
| 000001| 9|
+--------+--------+
the sample dateset had unhealthy data for group_id 00001 only. but this solution works for cases other group_ids with unhealthy state.

One straightforward way (may be not optimal) is:
Map to [K,V] with GROUP_NUMBER as the Key K
Use repartitionAndSortWithinPartitions, so you will have all data for every single group in the same partition and have them sorted by TIMESTAMP. Detailed explanation how it works is in this answer: Pyspark: Using repartitionAndSortWithinPartitions with multiple sort Critiria
And finally use mapPartitions to get an iterator over sorted data in single partition, so you could easily find the answer you needed. (explanation for mapPartitions: How does the pyspark mapPartitions function work?)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark pushdown filter not being respected - apache-spark

Related

How to aggregate data per day on spark streaming

Spark web UI showing two stages for a job without any wide transformation

How to embed equation queries in elasticSearch

SparkSQL: Ignore/convert rows with invalid date format

Distribution of time periods over rows with certain status (column value)

Categories

Resources