SparkSQL: Ignore/convert rows with invalid date format

SparkSQL: Ignore/convert rows with invalid date format - apache-spark

I have a table with column name 'date1' (timestamp (nullable = true)), formated like this
scala> sql("select date1 from tablename).show(20);
+-------------------+
| date1 |
+-------------------+
|2016-08-20 00:00:00|
|2016-08-31 00:00:00|
|2016-08-31 00:00:00|
|2016-09-09 00:00:00|
|2016-09-08 00:00:00|
While reading through complete hive table, I am getting following error:
WARN TaskSetManager: Lost task 2633.0 in stage 4.0 (TID 7206, ip-10-0-0-241.ec2.internal, executor 11): TaskKilled (stage cancelled)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 80 in stage 4.0 failed 4 times, most recent failure: Lost task 80.3 in stage 4.0 (TID 8944, ip-10-0-0-241.ec2.internal, executor 42): java.time.format.DateTimeParseException: Text '0000-12-30T00:00:00' could not be parsed, unparsed text found at index 10
.....
.....
Caused by: java.time.format.DateTimeParseException: Text '0000-12-30T00:00:00' could not be parsed, unparsed text found at index 10
at java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1952)
How can I ignore/convert the records so that I am able to read table?
SparkVersion: 2.2.1

This is a source data issue. Try to read whole data for this column alone & write it
scala> sql("select date1 from tablename).write.mode("overwrite").parquet("path/to/file.parquet")
If this is the issue with this column, then you will get error.
You try to query the source data using '0000-12-30T00:00:00'
This is clearly a data issue that you need to identify and remove it.
You can try below query to ignore the rows
sql("select date1 from tablename where date1 <> '0000-12-30T00:00:00'").count

You can try as below
spark.sql("""select cast(regexp_replace(date1,'[T,Z]',' ') as timestamp) from tablename""").show()
This will replace T/Z with a ' '(space) when it finds, otherwise it does nothing.
Hope this helps!

Related

Reading list of queries from a file and executing them in pyspark

I want to read a list of queries stored in a text file(csv or any delimeter separated) queries and want to execute them one by one in pyspark. I am very new to spark and wanted to know if there is any related spark api which I can use for doing this.
sample data
C1 | C2 | C3
1 | 2 | 3
0 | 0 | 0
sample queries text file
select * from sample_data_table where C1 = 0,
select * from sample_data_table where C1 != 0
output
df1 ==> C1 | C2 | C3
0 | 0 | 0
df2 ==> C1 | C2 | C3
1 | 2 | 3

You can get the desired result by reading file as a dataframe and passing each query to spark.sql() method.
from pyspark.sql import *
from pyspark.sql.functions import *
spark = SparkSession.builder.master('local[*]').getOrCreate()
df = spark.read.text("file.txt")
# To have single query per row
rows = df.select(explode(split('value', ','))).collect()
'''
Using collect is not recommended since it brings all the data to driver.
If the driver memory is not enough we get OOM errors.
We can do something like
df.foreach(#...)
But in your case we need to use spark session object within foreach.
Code with in foreach will be sent to executors and will not be able to access
spark session within executor and you will get errors.
That's why I used collect(). Need to check some alternatives here.
'''
for sql in rows:
spark.sql(sql[0]).show()

Spark Structured Streaming ignore old records

I am new to spark and help me to arrive in solutions for this problem. I am receiving the input file it has information about an event occurred and the file itself has the timestamp value. Event Id is the primary column for this input. Refer below the sample input (the actual file has many other columns).
Event_Id | Event_Timestamp
1 | 2018-10-11 12:23:01
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
When we get the above input we need to get the latest record based on event id, timestamp and the expected output would be
Event_Id | Event_Timestamp
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
Hereafter whenever I receive the event information which has timestamp value less than the above value I need to ignore, for example, consider the second input
Event_Id | Event_Timestamp
2 | 2018-10-11 10:25:01
1 | 2018-10-11 08:23:01
3 | 2018-10-11 21:12:01
Now I need to ignore event_id 1 and 2 since it has the old timestamp that the state what we have right now. Only the event 3 would be passed and the expected output here is
3 | 2018-10-11 21:12:01
Assume we have n number of unique(10 billion) event id how it would be stored in spark memory, is there something needs to be taken care.
Thanks in advance

We can take max timestamp and use persist() method with disk_only or disk_only2 storage levels... In that case, we can achieve this I think...
Since it's an streaming data, we can try with memory_only or memory_only2 storage levels too...
Please try and update..

Pyspark window function with condition

Suppose I have a DataFrame of events with time difference between each row, the main rule is that one visit is counted if only the event has been within 5 minutes of the previous or next event:
+--------+-------------------+--------+
|userid |eventtime |timeDiff|
+--------+-------------------+--------+
|37397e29|2017-06-04 03:00:00|60 |
|37397e29|2017-06-04 03:01:00|60 |
|37397e29|2017-06-04 03:02:00|60 |
|37397e29|2017-06-04 03:03:00|180 |
|37397e29|2017-06-04 03:06:00|60 |
|37397e29|2017-06-04 03:07:00|420 |
|37397e29|2017-06-04 03:14:00|60 |
|37397e29|2017-06-04 03:15:00|1140 |
|37397e29|2017-06-04 03:34:00|540 |
|37397e29|2017-06-04 03:53:00|540 |
+--------+----------------- -+--------+
The challenge is to group by the start_time and end_time of the latest eventtime that has the condition of being within 5 minutes. The output should be like this table:
+--------+-------------------+--------------------+-----------+
|userid |start_time |end_time |events |
+--------+-------------------+--------------------+-----------+
|37397e29|2017-06-04 03:00:00|2017-06-04 03:07:00 |6 |
|37397e29|2017-06-04 03:14:00|2017-06-04 03:15:00 |2 |
+--------+-------------------+--------------------+-----------+
So far I have used window lag functions and some conditions, however, I do not know where to go from here:
%spark.pyspark
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
windowSpec = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"])
windowSpecDesc = W.partitionBy(result_poi["userid"], result_poi["unique_reference_number"]).orderBy(result_poi["eventtime"].desc())
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
nextEventTime = F.lag(col("eventtime"), -1).over(windowSpec)
# The windows are between the current row and following row. e.g: 3:00pm and 3:03pm
previousEventTime = F.lag(col("eventtime"), 1).over(windowSpec)
diffEventTime = nextEventTime - col("eventtime")
nextTimeDiff = F.coalesce((F.unix_timestamp(nextEventTime)
- F.unix_timestamp('eventtime')), F.lit(0))
previousTimeDiff = F.coalesce((F.unix_timestamp('eventtime') -F.unix_timestamp(previousEventTime)), F.lit(0))
# Check if the next POI is the equal to the current POI and has a time differnce less than 5 minutes.
validation = F.coalesce(( (nextTimeDiff < 300) | (previousTimeDiff < 300) ), F.lit(False))
# Change True to 1
visitCheck = F.coalesce((validation == True).cast("int"), F.lit(1))
result_poi.withColumn("visit_check", visitCheck).withColumn("nextTimeDiff", nextTimeDiff).select("userid", "eventtime", "nextTimeDiff", "visit_check").orderBy("eventtime")
My questions: Is this a viable approach, and if so, how can I "go forward" and look at the maximum eventtime that fulfill the 5 minutes condition. To my knowledge, iterate through values of a Spark SQL Column, is it possible? wouldn't it be too expensive?. Is there another way to achieve this result?
Result of Solution suggested by #Aku:
+--------+--------+---------------------+---------------------+------+
|userid |subgroup|start_time |end_time |events|
+--------+--------+--------+------------+---------------------+------+
|37397e29|0 |2017-06-04 03:00:00.0|2017-06-04 03:06:00.0|5 |
|37397e29|1 |2017-06-04 03:07:00.0|2017-06-04 03:14:00.0|2 |
|37397e29|2 |2017-06-04 03:15:00.0|2017-06-04 03:15:00.0|1 |
|37397e29|3 |2017-06-04 03:34:00.0|2017-06-04 03:43:00.0|2 |
+------------------------------------+-----------------------+-------+
It doesn't give the result expected. 3:07 - 3:14 and 03:34-03:43 are being counted as ranges within 5 minutes, it shouldn't be like that. Also, 3:07 should be the end_time in the first row as it is within 5 minutes of the previous row 3:06.

You'll need one extra window function and a groupby to achieve this.
What we want is for every line with timeDiff greater than 300 to be the end of a group and the start of a new one. Aku's solution should work, only the indicators mark the start of a group instead of the end. To change this you'll have to do a cumulative sum up to n-1 instead of n (n being your current line):
w = Window.partitionBy("userid").orderBy("eventtime")
DF = DF.withColumn("indicator", (DF.timeDiff > 300).cast("int"))
DF = DF.withColumn("subgroup", func.sum("indicator").over(w) - func.col("indicator"))
DF = DF.groupBy("subgroup").agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count("*").alias("events")
)
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
| 2|2017-06-04 03:34:00|2017-06-04 03:34:00| 1|
| 3|2017-06-04 03:53:00|2017-06-04 03:53:00| 1|
+--------+-------------------+-------------------+------+
It seems that you also filter out lines with only one event, hence:
DF = DF.filter("events != 1")
+--------+-------------------+-------------------+------+
|subgroup| start_time| end_time|events|
+--------+-------------------+-------------------+------+
| 0|2017-06-04 03:00:00|2017-06-04 03:07:00| 6|
| 1|2017-06-04 03:14:00|2017-06-04 03:15:00| 2|
+--------+-------------------+-------------------+------+

So if I understand this correctly you essentially want to end each group when TimeDiff > 300? This seems relatively straightforward with rolling window functions:
First some imports
from pyspark.sql.window import Window
import pyspark.sql.functions as func
Then setting windows, I assumed you would partition by userid
w = Window.partitionBy("userid").orderBy("eventtime")
Then figuring out what subgroup each observation falls into, by first marking the first member of each group, then summing the column.
indicator = (TimeDiff > 300).cast("integer")
subgroup = func.sum(indicator).over(w).alias("subgroup")
Then some aggregation functions and you should be done
DF = DF.select("*", subgroup)\
.groupBy("subgroup")\
.agg(
func.min("eventtime").alias("start_time"),
func.max("eventtime").alias("end_time"),
func.count(func.lit(1)).alias("events")
)

Approach can be grouping the dataframe based on your timeline criteria.
You can create a dataframe with the rows breaking the 5 minutes timeline.
Those rows are criteria for grouping the records and
that rows will set the startime and endtime for each group.
Then find the count and max timestamp(endtime) for each group.

Distribution of time periods over rows with certain status (column value)

I have a Pyspark dataframe containing logs, with each row corresponding to the state of the system at the time it is logged, and a group number. I would like to find the lengths of the time periods for which each group is in an unhealthy state.
For example, if this were my table:
TIMESTAMP | STATUS_CODE | GROUP_NUMBER
--------------------------------------
02:03:11 | healthy | 000001
02:03:04 | healthy | 000001
02:03:03 | unhealthy | 000001
02:03:00 | unhealthy | 000001
02:02:58 | healthy | 000008
02:02:57 | healthy | 000008
02:02:55 | unhealthy | 000001
02:02:54 | healthy | 000001
02:02:50 | healthy | 000007
02:02:48 | healthy | 000004
I would want to return Group 000001 having an unhealthy time period of 9 seconds (from 02:02:55 to 02:03:04).
Other groups could also have unhealthy time periods, and I would want to return those as well.
Due to the possibility of consecutive rows with the same status, and since rows of different groups are interspersed, I am struggling to find a way to do this efficiently.
I cannot convert the Pyspark dataframe to a Pandas dataframe, as it is much too large.
How can I efficiently determine the lengths of these time periods?
Thanks so much!

the pyspark with spark-sql solution would look like this.
First we create the sample data-set. In addition to the dataset we generate row_number field partition on group and order by the timestamp. then we register the generated dataframe as a table say table1
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([
('2017-01-01 02:03:11','healthy','000001'),
('2017-01-01 02:03:04','healthy','000001'),
('2017-01-01 02:03:03','unhealthy','000001'),
('2017-01-01 02:03:00','unhealthy','000001'),
('2017-01-01 02:02:58','healthy','000008'),
('2017-01-01 02:02:57','healthy','000008'),
('2017-01-01 02:02:55','unhealthy','000001'),
('2017-01-01 02:02:54','healthy','000001'),
('2017-01-01 02:02:50','healthy','000007'),
('2017-01-01 02:02:48','healthy','000004')
],['timestamp','state','group_id'])
df = df.withColumn('rownum', row_number().over(Window.partitionBy(df.group_id).orderBy(unix_timestamp(df.timestamp))))
df.registerTempTable("table1")
once the dataframe is registered as a table (table1). the required data can be computed as below using spark-sql
>>> spark.sql("""
... SELECT t1.group_id,sum((t2.timestamp_value - t1.timestamp_value)) as duration
... FROM
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1 WHERE state = 'unhealthy') t1
... LEFT JOIN
... (SELECT unix_timestamp(timestamp) as timestamp_value,group_id,rownum FROM table1) t2
... ON t1.group_id = t2.group_id
... AND t1.rownum = t2.rownum - 1
... group by t1.group_id
... """).show()
+--------+--------+
|group_id|duration|
+--------+--------+
| 000001| 9|
+--------+--------+
the sample dateset had unhealthy data for group_id 00001 only. but this solution works for cases other group_ids with unhealthy state.

One straightforward way (may be not optimal) is:
Map to [K,V] with GROUP_NUMBER as the Key K
Use repartitionAndSortWithinPartitions, so you will have all data for every single group in the same partition and have them sorted by TIMESTAMP. Detailed explanation how it works is in this answer: Pyspark: Using repartitionAndSortWithinPartitions with multiple sort Critiria
And finally use mapPartitions to get an iterator over sorted data in single partition, so you could easily find the answer you needed. (explanation for mapPartitions: How does the pyspark mapPartitions function work?)

Combining data based on column in spark

I have data in following format in hive table.
user | purchase | time_of_purchase
I want to get data in
user | list of purchases ordered by time
How do I do this in pyspark or hiveQL?
I have tried using collect_list in hive but it does not retain the order correctly by timestamp.
Edit :
Adding sample data as asked by KartikKannapur.
Here is a sample data
94438fef-c503-4326-9562-230e78796f16 | Bread | Jul 7 20:48
94438fef-c503-4326-9562-230e78796f16 | Shaving Cream | July 10 14:20
a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | Milk | July 7 3:48
a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | Bread | July 7 3:49
a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | Lotion | July 7 15:30
The output I want is
94438fef-c503-4326-9562-230e78796f16 | Bread , Shaving Cream
a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad | Milk , Bread , Lotion

One way of doing this is
First create a hive context and read table to a RDD.
from pyspark import HiveContext
purchaseList = HiveContext(sc).sql('from purchaseList select *')
Then process the RDD
from datetime import datetime as dt
purchaseList = purchaseList.map(lambda x:(x[0],[x[1],dt.strptime(x[2],"%b %d %H:%M")]))
purchaseByUser = purchaseList.groupByKey()
purchaseByUser = purchaseByUser.map(lambda x:(x[0],[y[0] for y in sorted(x[1], key=lambda z:z[1])]))
print(purchaseByUser.take(2))
Output
[('94438fef-c503-4326-9562-230e78796f16', ['Bread', 'Shaving Cream']), ('a0dcbb3b-d1dd-43aa-91d7-e92f48cee0ad', ['Milk', 'Bread', 'Lotion'])]
Save the RDD as new hive table
schema_rdd = HiveContext(sc).inferSchema(purchaseByUser)
schema_rdd.saveAsTable('purchaseByUser')
For reading and writing hive table see this stackoverflow question and spark docs

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

SparkSQL: Ignore/convert rows with invalid date format - apache-spark

You can try as below spark.sql("""select cast(regexp_replace(date1,'[T,Z]',' ') as timestamp) from tablename""").show() This will replace T/Z with a ' '(space) when it finds, otherwise it does nothing. Hope this helps!

Related

Reading list of queries from a file and executing them in pyspark

Spark Structured Streaming ignore old records

Pyspark window function with condition

Distribution of time periods over rows with certain status (column value)

Combining data based on column in spark

Categories

Resources