Spark filter dataframe returns empty result - apache-spark

I'm working in a project with Scala and Spark processing files that are stored in HDFS. Those files are landing in HDFS everyday in the morning. I have a job that reads that file from HDFS each day, process it and then writes the result in HDFS. After I convert the file into a Dataframe, this job executes a filter to get only the rows that contain a timestamp higher than the highest timestamp that was processed within the last file. This filter has an unknown behavior only some days. Some days works as expected and other days despite of the new file contains rows that match that filter, the filter result is empty. This happens all the times for the same file when it's executed in TEST environment but in my local works as expected using the same file with the same HDFS connection.
I've tried to filter in different ways but none of then work in that environment for some specific files but all of then work fine in my LOCAL:
1) Spark sql
val diff = fp.spark.sql("select * from curr " +
s"where TO_DATE(CAST(UNIX_TIMESTAMP(substring(${updtDtCol},
${substrStart},${substrEnd}),'${dateFormat}') as TIMESTAMP))" +
s" > TO_DATE(CAST(UNIX_TIMESTAMP('${prevDate.substring(0,10)}'
,'${dateFormat}') as TIMESTAMP))")
2) Spark filter functions
val diff = df.filter(date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)))
3) Adding extra column with the result of the filter and then filter by this new column
val test2 = df.withColumn("PrevDate", lit(prevDate.substring(0,10)))
.withColumn("DatePre", date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat))
.withColumn("Result", date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)))
.withColumn("x", when(date_format(unix_timestamp(substring(col(updtDtCol),0,10),dateFormat).cast("timestamp"),dateFormat).gt(date_format(unix_timestamp(substring(col("PrevDate"),0,10),dateFormat).cast("timestamp"),dateFormat)), lit(1)).otherwise(lit(0)))
val diff = test2.filter("x == 1")
I think that the issue is not caused either by the filter itself or probably by the file but I would like to receive feedback about what should I check or if anybody has faced this before.
Please let me know what information could be useful to post here in order to receive some feedback.
A part of file example looks like the following:
|TIMESTAMP |Result|x|
|2017-11-30-06.46.41.288395|true |1|
|2017-11-28-08.29.36.188395|false |0|
The TIMESTAMP values are compared with the previousDate (for instance: 2017-11-29) and I create a column called 'Result' with the result of that comparison that always works in both environment and also another column called 'x' with the same result.
As I mentioned before, if I use the comparator function between both dates or the result in column 'Result' or 'x' to filter the dataframe, sometimes the result is an empty dataframe but in local using the same HDFS and file, the result contains data.

I suspect it to be a data/date format issue. Did you get a chance to verify if the dates converted are as expected?
If the date string for both the columns has timezone included, the behavior is predictable.
If only of one of them has timezone included, the results will be different when executed in local and remote. It totally depends on timezone of cluster.
For debugging the issue, I would suggest you to have additional columns to capture the unix_timestamp(..)/millis of the respective date strings and have and additional column to capture the difference of the two columns. The diff column should help to find out where and why conversions gone wrong. Hope this helps.

In case anybody wants to know what happened with this issue and how I finally found the cause of the error here is the explanation. Basically it was caused by the different timezone of the machines where the job was executed (LOCAL machine and TEST server). The unix_timestamp function returned the correct value having in mind the timezone of the servers. Basically at the end I didn't have to use the unix_timestamp function and I didn't need to use the full content of the date field. Next time I will double check this before.

Related

Cassandra update query didn't change data and didn't throw an error

Hello I am facing a problem when I am trying to execute a really simple update query in cqlsh
update "Table" set "token"='111-222-333-444' where uid='123' and "key"='somekey';
It didn't throw any error, but the value of the token is still the same. However, if I try the same query for some other field it works just fine:
update "Table" set "otherfield"='value' where uid='123' and "key"='somekey';
Any ideas why Cassandra can prevent updates for some fields?
Most probably the entry was inserted by client with incorrect clocks, or something like. The data in Cassandra are "versioned" by write time that could be even in the future (depending on use case). And when reading, Cassandra compares the write time of all versions of the specified column (there could be multiple versions in the data files on the disk), and selects one with highest write time.
You need to check the write time of that column value (use the writetime function) and compare with your current time:
select writetime(token) from Table where uid='123' and "key"='somekey';
the resulting value is in the microseconds. You can remove last 3 digits, and use something like this site to convert it into human-understandable time.

Spark find max of date partitioned column

I have a parquet partitioned in the following way:
data
/batch_date=2020-01-20
/batch_date=2020-01-21
/batch_date=2020-01-22
/batch_date=2020-01-23
/batch_date=2020-01-24
Here batch_date which is the partition column is of date type.
I want only read the data from the latest date partition but as a consumer I don't know what is the latest value.
I could use a simple group by something like
df.groupby().agg(max(col('batch_date'))).first()
While this would work it's a very inefficient way since it involves a groupby.
I want to know if we can query the latest partition in a more efficient way.
Thanks.
Doing the method suggested by #pasha701 would involve loading the entire spark data frame with all the batch_date partitions and then finding max of that. I think the author is asking for a way to directly find the max partition date and load only that.
One way is to use hdfs or s3fs, and load the contents of the s3 path as a list and then finding the max partition and then loading only that. That would be more efficient.
Assuming you are using AWS s3 format, something like this:
import sys
import s3fs
datelist=[]
inpath="s3:bucket_path/data/"
fs = s3fs.S3FileSystem(anon=False)
Dirs = fs.ls(inpath)
for paths in Dirs:
date=paths.split('=')[1]
datelist.append(date)
maxpart=max(datelist)
df=spark.read.parquet("s3://bucket_path/data/batch_date=" + maxpart)
This would do all the work in lists without loading anything into memory until it finds the one you want to load.
Function "max" can be used without "groupBy":
df.select(max("batch_date"))
Using Show partitions to get all partition of table
show partitions TABLENAME
Output will be like
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
we can get data form specific partition using below query
select * from TABLENAME where pt='2012.07.28.10' and is_complete='1' limit 1;
Or additional filter or group by can be applied on it.
This worked for me in Pyspark v2.4.3. First extract partitions (this is for a dataframe with a single partition on a date column, haven't tried it when a table has >1 partitions):
df_partitions = spark.sql("show partitions database.dataframe")
"show partitions" returns dataframe with single column called 'partition' with values like partitioned_col=2022-10-31. Now we create a 'value' column extracting just the date part as string. This is then converted to date and the max is taken:
date_filter = df_partitions.withColumn('value', to_date(split('partition', '=')[1], 'yyyy-MM-dd')).agg({"value":"max"}).first()[0]
date_filter contains the maximum date from the partition and can be used in a where clause pulling from the same table.

Create XML request from each record of a dataframe

I have tried many options including withColumn, udf, lambda, foreach, map but but not getting the expected output. At max, I am able to transform only the first record. The inputfile.json will keep on increasing and the expect op should give the xml in the desired structure. I will later on produce the expected op on Kafka.
Spark 2.3, Python 2.7. Need is to do in PySpark.
Edit 1:
I am able to add a column in the main dataframe which has the required xml. I used withColumn and functions.format_string and able to add strings(the xml structures) to columns of the dataframe.
Now my next target is to produce just the value of that new column to Kafka. I am using df.foreachPartition(send_to_kafka) and have created a function as below:
def send_to_kafka(rows):
kafka = SimpleClient('localhost:9092')
producer = SimpleProducer(kafka)
for row in rows:
producer.send_messages('test', str(row.asDict()))
But unfortunately it does two things:
a. Produces record on Kafka as {'newColumn':u'myXMLPayload'}. I do not want that. I want only myXMLPayload to be produced on Kafka.
b. It adds u' to the value for unicoding the value.
I want to get rid of these two parts and I would be good to go.
Any help would be appreciated.

spark dataset : how to get count of occurence of unique values from a column

Trying spark dataset apis which reads a CSV file and count occurrence of unique values in a particular field. One approach which i think should work is not behaving as expected. Let me know what am i overlooking. I am posted both working as well as buggy approach below.
// get all records from a column
val professionColumn = data.select("profession")
// breakdown by professions in descending order
// ***** DOES NOT WORKS ***** //
val breakdownByProfession = professionColumn.groupBy().count().collect()
// ***** WORKS ***** //
val breakdownByProfessiond = data.groupBy("profession").count().sort("count") // WORKS
println ( s"\n\nbreakdown by profession \n")
breakdownByProfession.show()
Also please let me know which approach is more efficient. My guess would be the first one ( the reason to attempt that in first place )
Also what is the best way to save output of such an operation in a text file using dataset APIs
In the first case, since there are no grouping columns specified, the entire dataset is considered as one group -- this behavior holds even though there is only one column present in the dataset. So, you should always pass the list of columns to groupBy().
Now the two options would be: data.select("profession").groupBy("profession").count vs. data.groupBy("profession").count. In most cases, the performance of these two alternatives will be exactly the same since Spark tries to push projections (i.e., column selection) down the operators as much as possible. So, even in the case of data.groupBy("profession").count, Spark first selects the profession column before it does the grouping. You can verify this by looking at the execution plan -- org.apache.spark.sql.Dataset.explain()
In groupBy transformation you need to provide column name as below
val breakdownByProfession = professionColumn.groupBy().count().collect()

SQL dataframe first and last not returning "real" first and last values

I tried using the Apache Spark SQL dataframe's aggregate functions "first" and "last" on a large file with a spark master and 2 workers. When I do the "first" and "last" operations I am expecting back the last column from the file; but it looks like Spark is returning the "first" or "last" from the worker partitions.
Is there any way to get the "real" first and last values in aggregate functions?
Thanks,
Yes. It is possible depending on what you mean first "real" first and last values. For example, if you are dealing with timestamped data and "real" first value refers to the oldest record, just orderBy the data according to time and get the first value.
When you say When I do the "first" and "last" operations I am expecting back the last column from the file, I understand that you are in fact referring to the first/last row of data from the file. Please correct me if I mistook this.
Thanks.
Edit :
You can read the file in a single partition (by setting numPartitions = 1) and then zipWithIndex and finally parallize the resulting collection. This way you get a column to order on and you don't change the source file as well.

Resources