Iterate Spark data-frame with Hive tables - apache-spark

I have a very large csv file, so i used spark and load it into a spark dataframe.
I need to extract the latitude and longitude from each row on the csv in order to create a folium map.
with pandas i can solve my problem with a loop:
for index, row in locations.iterrows():
folium.CircleMarker(location=(row["Pickup_latitude"],
row["Pickup_longitude"]),
radius=20,
color="#0A8A9F",fill=True).add_to(marker_cluster)
I found that unlike pandas data-frame the spark data-frame can't be processed by a loop =>how to loop through each row of dataFrame in pyspark .
so i thought that to i can engenieer the problem and cut the big data into hive tables then iterate them .
is it possible to cut the huge SPARK data-frame in hive tables and then iterate the rows with a loop?

Generally you don't need to iterate over DataFrame or RDD. You only create transformations (like map) that will be applied to each record and then call some action to call that processing.
You need something like:
dataframe.withColumn("latitude", <how to extract latitude>)
.withColumn("longitude", <how to extract longitude>)
.select("latitude", "longitude")
.rdd
.map(row => <extract values from Row type>)
.collect() // this will move data to local collection
In case if you can't do it with SQL, you need to do it using RDD:
dataframe
.rdd
.map(row => <create new row with latitude and longitude>)
.collect()

Related

How to use map to select columns in RDD

I have an RDD dataset of flights and I have to select specific columns from it.
I have to select column numbers 9,4,5,8,17 and then create an sql dataframe with the results. The data is an RDD.
I tried the following but I get an error in the map.
q9 = data.map(lambda x: [x[i] for i in [9,4,5,8,17]])
sqlContext.createDataFrame(q9_1, ['Flight Num', 'DepTime', 'CRSDepTime', 'UniqueCarrier', 'Dest']).show(n=20)
What would you do? thanks!

Watermarking in Spark Structured Streaming 2.3.0

I read data from Kafka in Spark Structured Streaming 2.3.0. The data contains information about some teachers, there is teacherId, teacherName and teacherGroupsIds. TeacherGroupsIds is an array column which contains ids of the group. In my task I have to map the column with group ids to column containing information about group names([1,2,3] => [Suns,Books,Flowers]). The names and ids are stored in HBase and can change everyday. Later I have to send the data to another Kafka topic.
So, I read data from two sources - Kafka and HBase. I read data from HBase using shc library.
First, I explode the array column (group ids), later I join with the data from HBase.
In next step I would like to aggregate the data using teacherId. But this operation is not supported in Append Mode which I use.
I have tried with watermarking, but at the moment it doesn't work. I added a new column with timestamp and I would group by this column.
Dataset<Row> inputDataset = //reading from Kafka
Dataset<Row> explodedDataset = // explode function applied and join with HBase
Dataset<Row> outputDataset = explodedDataset
.withColumn("eventTime", lit(current_timestamp()))
.withWatermark("eventTime", "2 minutes")
.groupBy(window(col("eventTime"), "5 seconds"), col("teacherId"))
.agg(collect_list(col("groupname")));
Actual results show empty dataframe at the output. There is not any row.
The problem is current_timestamp().
current_timestamp returns the timestamp in that moment, so, if you create a dataframe with this column and print the result, you print the current timestamp, but if you process the df and you print the same column, you print the new timestamp.
This solution works locally, but sometimes in a distributed system it fails because the workers when receiving the order to print the data, this data is already outside the timestamp range.

Group by in Dataframe using numeric elements Usining Spark Scala

I have a Hive query which I need to convert it into Dataframe. The query is as below
select sum(col1),max(col2) from table
group by 3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24;
I don't know how do I do that in Dataframe, generally we use
df.groupBy(columnName).agg()
But how can I convert the above query to Spark Dataframe.
You can simply select the column names from array of columns (df.columns) using the indexes and then use those selected column names in groupBy and use aggregation function.
So the complete translation would be
import org.apache.spark.sql.functions._
val groupingIndexes = Seq(3,4,5,1,2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
df.groupBy(groupingIndexes.map(x => df.columns(x)).map(col): _*).agg(sum("col1"),max("col2"))
I hope the answer is helpful
val df = spark.table("tablename")
df.groupBy(lit(1), lit(2), lit(5),... ,lit(24)).agg(sum(col("col1")).as("sumval"), max(col("col2")).as("maxval")).select("maxval","sumval")
Thanks
Ravi

split,operate and union dataframe in spark

How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.
you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Resources