split,operate and union dataframe in spark - apache-spark

How can we split a dataframe and operate on individual split and union all the individual dataframes results back ?
Lets say i have dataframe with below columns. I need to split the dataframe based on channel and operate on individual splits which adds new column called bucket. then i need to union back the results.
account,channel,number_of_views
The groupBy is only allowing simple aggreagted operation. On each splitted dataframe i need to do feature extraction.
currently all Feature Transformers of spark-mllib are support only single dataframe.

you can randomly split like this
val Array(training_data, validat_data, test_data) = raw_data_rating_before_spilt.randomSplit(Array(0.6,0.2,0.2))
this will create 3 df then d what you want to do then you can join or union
val finalDF = df1.join(df2, df1.col("col_name")===df2.col("col_name"))
you can also join multiple df at the same time.
this is what you want or anything else.??

Related

Spark Union vs adding columns using lit in spark

This is a spark related question. I have to add static data to various types of records, each type of record being processed as a different dataframe (say df1, df2, .. df6)
The static data that I intend to add, has to be repeated with all the 6 dataframes.
What would be a more performant way:
For each of the 6 dataframes, use:
.witColumn("testA", lit("somethingA"))
.witColumn("testB", lit("somethingB"))
.witColumn("testC", lit("somethingC"))
or
Create a new DF, say staticDF which has all the columns that I intend to append to each of the 6 dataframes and use a union?
or
Any other option that I have not considered?
The first way is correct. The second way wouldn't work because union add rows to a dataframe, not columns.
Another way is to use select to select all new columns at the same time:
df2 = df.select(
'*',
lit('somethingA').alias('testA'),
lit('somethingB').alias('testB'),
lit('somethingC').alias('testC')
)

combining (not a sql join) 2 spark dataframes

I have two dataframes that are large here are sample examples..
first
firstnames|lastnames|age
tom|form|24
bob|lip|36
....
second
firstnames|lastnames|age
mary|gu|24
jane|lip|36
...
I would like to take both dataframes and combine them into one that look like:
firstnames|lastnames|age
tom|form|24
bob|lip|36
mary|gu|24
jane|lip|36
...
now I could write them both out and them read them together but that's a huge waste.
If both dataframes are identical in structure then it's straight forward -union()
df1.union(df2)
In case any dataframe have any missing column then you have add dummy column in that dataframe on that specific column position else union will throw column mismatch exception. in below example column 'c3' is missing in df1 so I am adding dummy column in df1 in last position.
from pyspark.sql.functions import lit
df1.select('c1','c2',lit('dummy')).union(df2.select('c1','c2','c3'))
this is a simple as shown here : union https://docs.databricks.com/spark/latest/faq/append-a-row-to-rdd-or-dataframe.html

Iterate Spark data-frame with Hive tables

I have a very large csv file, so i used spark and load it into a spark dataframe.
I need to extract the latitude and longitude from each row on the csv in order to create a folium map.
with pandas i can solve my problem with a loop:
for index, row in locations.iterrows():
folium.CircleMarker(location=(row["Pickup_latitude"],
row["Pickup_longitude"]),
radius=20,
color="#0A8A9F",fill=True).add_to(marker_cluster)
I found that unlike pandas data-frame the spark data-frame can't be processed by a loop =>how to loop through each row of dataFrame in pyspark .
so i thought that to i can engenieer the problem and cut the big data into hive tables then iterate them .
is it possible to cut the huge SPARK data-frame in hive tables and then iterate the rows with a loop?
Generally you don't need to iterate over DataFrame or RDD. You only create transformations (like map) that will be applied to each record and then call some action to call that processing.
You need something like:
dataframe.withColumn("latitude", <how to extract latitude>)
.withColumn("longitude", <how to extract longitude>)
.select("latitude", "longitude")
.rdd
.map(row => <extract values from Row type>)
.collect() // this will move data to local collection
In case if you can't do it with SQL, you need to do it using RDD:
dataframe
.rdd
.map(row => <create new row with latitude and longitude>)
.collect()

Spark Deduplicate column in dataframe based on column in other dataframe

I am trying to deduplicate values in a Spark dataframe column based on values in another dataframe column. It seems that withColumn() only works within a single dataframe, and subqueries won't be fully available until version 2. I suppose I could try to join the tables but that seems a bit messy. Here is the general idea:
df.take(1)
[Row(TIMESTAMP='20160531 23:03:33', CLIENT ID=233347, ROI NAME='my_roi', ROI VALUE=1, UNIQUE_ID='173888')]
df_re.take(1)
[Row(UNIQUE_ID='6866144:ST64PSIMT5MB:1')]
Basically just want to take the values from df and remove any that are found in df_re and then return the whole dataframe with the rows containing those duplicates removed. I'm sure I could iterate each one, but I am wondering if there is a better way.
Any ideas?
The way to do this is to do a left_outer join, and then filter for where the right-hand side of the join is empty. Something like:
val df1 = Seq((1,2),(2,123),(3,101)).toDF("uniq_id", "payload")
val df2 = Seq((2,432)).toDF("uniq_id", "other_data")
df1.as("df1").join(
df2.as("df2"),
col("df1.uniq_id") === col("df2.uniq_id"),
"left_outer"
).filter($"df2.uniq_id".isNull)

Computing difference between Spark DataFrames

I have two DataFrames df1 and df2. I want to compute a third DataFrame ``df3 such that df3 = (df1 - df2) i.e all elements present in df1 but not in df2. Is there any in-built library function to achieve that something like df1.subtract(df2)?
You are probably searching for except function: http://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.DataFrame
From the description:
def except(other: DataFrame): DataFrame
Returns a new DataFrame containing rows in this frame but not in
another frame. This is equivalent to EXCEPT in SQL.

Resources