I read data from Kafka in Spark Structured Streaming 2.3.0. The data contains information about some teachers, there is teacherId, teacherName and teacherGroupsIds. TeacherGroupsIds is an array column which contains ids of the group. In my task I have to map the column with group ids to column containing information about group names([1,2,3] => [Suns,Books,Flowers]). The names and ids are stored in HBase and can change everyday. Later I have to send the data to another Kafka topic.
So, I read data from two sources - Kafka and HBase. I read data from HBase using shc library.
First, I explode the array column (group ids), later I join with the data from HBase.
In next step I would like to aggregate the data using teacherId. But this operation is not supported in Append Mode which I use.
I have tried with watermarking, but at the moment it doesn't work. I added a new column with timestamp and I would group by this column.
Dataset<Row> inputDataset = //reading from Kafka
Dataset<Row> explodedDataset = // explode function applied and join with HBase
Dataset<Row> outputDataset = explodedDataset
.withColumn("eventTime", lit(current_timestamp()))
.withWatermark("eventTime", "2 minutes")
.groupBy(window(col("eventTime"), "5 seconds"), col("teacherId"))
.agg(collect_list(col("groupname")));
Actual results show empty dataframe at the output. There is not any row.
The problem is current_timestamp().
current_timestamp returns the timestamp in that moment, so, if you create a dataframe with this column and print the result, you print the current timestamp, but if you process the df and you print the same column, you print the new timestamp.
This solution works locally, but sometimes in a distributed system it fails because the workers when receiving the order to print the data, this data is already outside the timestamp range.
Related
I have parquet data stored on S3 and Athena table partitioned by id and date.
The parquet files are stored in
s3://bucket_name/table_name/id=x/date=y/
The parquet file contains the partition columns in them (id, date), because of which I am not able to read them using AWS Glue.
I would like to read the data in only a few partitions and hence I am making use of partition predicate as follows:
today = date.today()
yesterday = today - timedelta(days = 1)
predicate = "date = date '" + str(yesterday) +"'"
df =glueContext.create_dynamic_frame_from_catalog(database_name, table_name, push_down_predicate= predicate)
However, since the files already contain the partition columns, I am getting the below error:
AnalysisException: Found duplicate column(s) in the data schema and
the partition schema: id, date
Is there a way I can read data from only a few partitions like this? Can I somehow read the data by ignoring id and date columns?
Any sort of help is appreciated :)
Concerning your first question 'Is there a way I can read data from only a few partitions like this?':
You don't need to use predicate in my opinion - the beauty of having partitioned parquet files is that Spark will push any filter which is applied along those partitions down to the file scanning phase. Meaning that Spark is able to skip certain groups by just reading the metadata of the parquet files.
Have a look at the physical execution plan once you execute a df = spark.read()and df.filter(col("date") == '2022-07-19').
You should find something along the lines of
+- FileScan parquet [idxx, ... PushedFilters: [IsNotNull(date), EqualTo(date, 2022-07-19)..
Concerning whether you can read the data by ignoring id and date columns: You can potentially add multiple parquet paths to the read function at the bottom level - which would ignore the date/id columns alltogether (I don't know why you would do that though if you need to filter on them):
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id=55/",
"file:///your/path/date=2022-07-19/id=40/")
# Shorter solution:
df = spark.read.parquet(
"file:///your/path/date=2022-07-19/id={55, 40}/*")
I am new to Azure Databricks,I am trying to write a dataframe output to a delta table which consists TIMESTAMP column. But strangely it changes the TIMESTAMP pattern after writing to delta table.
My DataFrame Output column holds the value in this format 2022-05-13 17:52:09.771,
But After writing it to the Table, The column value is getting populated as
2022-05-13T17:52:09.771+0000
I am using below function to generate this Dataframe output
val pretsUTCText = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
val tsUTCText: String = pretsUTCTextNew.format(ts)
val tsUTCCol : Column = lit(tsUTCText)
val df = df2.withColumn(to_timestamp(timestampConverter.tsUTCCol,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
The Dataframe output is returning 2022-05-13 17:52:09.771 as TIMESTAMP pattern.
But After writing it to Delta Table I see the same value is getting populated as 2022-05-13T17:52:09.771+0000
Thanks in Advance. I could not find any solution.
I have just found the same behaviour on Databricks as you, and it behaves differently than the Databricks document. It seems after some versions Databricks show timezone as a default so you see additional +0000. I think you can use date_format function when you populate data if you don't want it. Also, I think you don't need 'Z' in format text as it is for timezone. See the screenshot below.
I have a partitioned delta table stored in ADLS (partitoned on date column).
How to read only that data which is of the past one year, i.e data is something 2020-**-**?
You just need to read table and filter out the data you need - Spark will perform predicate pushdown & will read only data from the matching partitions. It's simple as (assuming that column is called date):
df = spark.read.format("delta").load("your-table-root-path") \
.filter("date >= '2020-01-01' and date <= '2020-12-31'")
Load in partitioned raw table based on date
raw_df = spark.read.format('delta').load(raw_data_path + 'tablename/partitionname=specifieddate')
This is the code I use usually to retrieve a certain partition from ADLS
I have a huge parquet table partitioned on registration_ts column - named stored.
I'd like to filter this table based on data obtained from small table - stream
In sql world the query would look like:
spark.sql("select * from stored where exists (select 1 from stream where stream.registration_ts = stored.registration_ts)")
In Dataframe world:
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi")
This all works, but the performance is suffering, because the partition pruning is not applied. Spark full-scans stored table, which is too expensive.
For example this runs 2 minutes:
stream.count
res45: Long = 3
//takes 2 minutes
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
[Stage 181:> (0 + 1) / 373]
This runs in 3 seconds:
val stream = stream.where("registration_ts in (20190516204l, 20190515143l,20190510125l, 20190503151l)")
stream.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(stream), Seq("registration_ts"), "leftsemi").collect
The reason is that in the 2-nd example the partition filter is propagated to joined stream table.
I'd like to achieve partition filtering on dynamic set of partitions.
The only solution I was able to come up with:
val partitions = stream.select('registration_ts).distinct.collect.map(_.getLong(0))
stored.where('registration_ts.isin(partitions:_*))
Which collects the partitions to driver and makes a 2-nd query. This works fine only for small number of partitions. When I've tried this solution with 500k distinct partitions, the delay was significant.
But there must be a better way ...
Here's one way that you can do it in PySpark and I've verified in Zeppelin that it is using the set of values to prune the partitions
# the collect_set function returns a distinct list of values and collect function returns a list of rows. Getting the [0] element in the list of rows gets you the first row and the [0] element in the row gets you the value from the first column which is the list of distinct values
from pyspark.sql.functions import collect_set
filter_list = spark.read.orc(HDFS_PATH)
.agg(collect_set(COLUMN_WITH_FILTER_VALUES))
.collect()[0][0]
# you can use the filter_list with the isin function to prune the partitions
df = spark.read.orc(HDFS_PATH)
.filter(col(PARTITION_COLUMN)
.isin(filter_list))
.show(5)
# you may want to do some checks on your filter_list value to ensure that your first spark.read actually returned you a valid list of values before trying to do the next spark.read and prune your partitions
I have a very large csv file, so i used spark and load it into a spark dataframe.
I need to extract the latitude and longitude from each row on the csv in order to create a folium map.
with pandas i can solve my problem with a loop:
for index, row in locations.iterrows():
folium.CircleMarker(location=(row["Pickup_latitude"],
row["Pickup_longitude"]),
radius=20,
color="#0A8A9F",fill=True).add_to(marker_cluster)
I found that unlike pandas data-frame the spark data-frame can't be processed by a loop =>how to loop through each row of dataFrame in pyspark .
so i thought that to i can engenieer the problem and cut the big data into hive tables then iterate them .
is it possible to cut the huge SPARK data-frame in hive tables and then iterate the rows with a loop?
Generally you don't need to iterate over DataFrame or RDD. You only create transformations (like map) that will be applied to each record and then call some action to call that processing.
You need something like:
dataframe.withColumn("latitude", <how to extract latitude>)
.withColumn("longitude", <how to extract longitude>)
.select("latitude", "longitude")
.rdd
.map(row => <extract values from Row type>)
.collect() // this will move data to local collection
In case if you can't do it with SQL, you need to do it using RDD:
dataframe
.rdd
.map(row => <create new row with latitude and longitude>)
.collect()