personalized monotonically_increasing_id spark [duplicate] - apache-spark

This question already has answers here:
Concatenate columns in Apache Spark DataFrame
(18 answers)
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 4 years ago.
I have several dataframes and I want to uniquely identify each row in each dataframe. Hence I want to use personalized Ids .
I am using the monotonically_increasing_id() built-in function in spark as follows:
import org.apache.spark.sql.functions._
val dfWithId = trzuCom.withColumn("UniqueID", monotonically_increasing_id)
The problem is when I try to personalize it as follows :
val dfWithId = trzuCom.withColumn("UniqueID", "TB1_" + monotonically_increasing_id)
I get errors.
Actually I want to have TB1_ID for dataframe 1, TB2_ID and so one . Any I dea how to do this please.
Best Regards

Related

drop all rows that contain even one alphabet in pyspark [duplicate]

This question already has answers here:
How to check if a string column in pyspark dataframe is all numeric
(7 answers)
Closed last year.
pyspark 2.3.1
my rows to col1 should only contain integers. I am trying to filter out any row that have even one character. How can I do this in pyspark?
I've tried
df.select('col1').filter(df.col1.rlike(^[a-zA-Z]))
however rows that contain alphabet also contain integers therefore not filtered.
How can I do this?
You can try to select pure digital rows.
df = df.filter('col1 rlike "^[0-9]+$"')
df.show(truncate=False)

Combine ‘n’ data files to make a single Spark Dataframe [duplicate]

This question already has answers here:
How to perform union on two DataFrames with different amounts of columns in Spark?
(22 answers)
Closed 4 years ago.
I have ‘n’ number of delimited data sets, CSVs may be. But one of them might have a few extra columns. I am trying to read all of them as dataframes and put them in one. How can I merge them as an unionAll and make them a single dataframe ?
P.S: I can do this when I know what is ‘n’. And, it’s a simple unionAll when the column counts are equal.
There is another approach other than the solutions mentioned in first two comments.
Read all CSV files to a single RDD producing RDD[String].
Map to create Rdd[Row] with appropriate length while filling missing values with null or any suitable values.
Create dataFrame schema.
Create DataFrame from RDD[Row] using created Schema.
This may not be a good approach if the CSVs has large number of columns.
Hope this helps

How to overwrite multiple partitions in HIVE [duplicate]

This question already has answers here:
Overwrite specific partitions in spark dataframe write method
(14 answers)
Overwrite only some partitions in a partitioned spark Dataset
(3 answers)
Closed 4 years ago.
I have a large table and in which I would like overwrite certain top level partitions. for e.g. I have table which is partitioned based on year and month, and I would like to overwrite partitions say from year 2000 to 2018.
How I can do that.
Note : I would not like to delete the previous table and overwrite entire table with new data.

How to use Spark dataset GroupBy() [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I have a Hive table with the schema:
id bigint
name string
updated_dt bigint
There are many records having same id, but different name and updated_dt. For each id, I want to return the record (whole row) with the largest updated_dt.
My current approach is:
After reading data from Hive, I can use case class to convert data to RDD, and then use groupBy() to group by all the records with the same id together, and later picks the one with the largest updated_dt. Something like:
dataRdd.groupBy(_.id).map(x => x._2.toSeq.maxBy(_.updated_dt))
However, since I use Spark 2.1, it first convert data to dataset using case class, and then the above approach coverts data to RDD in order to use groupBy(). There may be some overhead converting dataset to RDD. So I was wondering if I can achieve this at the dataset level without converting to RDD?
Thanks a lot
Here is how you can do it using Dataset:
data.groupBy($"id").agg(max($"updated_dt") as "Max")
There is not much overhead if you convert it to RDD. If you choose to do using RDD, It can be more optimized by using .reduceByKey() instead of using .groupBy():
dataRdd.keyBy(_.id).reduceByKey((a,b) => if(a.updated_dt > b.updated_dt) a else b).values

merge set type columns using spark sql [duplicate]

This question already has answers here:
Array Intersection in Spark SQL
(2 answers)
Closed 5 years ago.
I have two datasets with columns that have the type of a set (for example, a column generated by the collect_set function)
I want to merge them in some join ... ie something like:
SELECT
...
SOME_MERGE_FUNCTION(x.x_set, y.y_set) as unioned_set
FROM x LEFT OUTER JOIN y ON ...
is there a function like SOME_MERGE_FUNCTION in spark sql which will basically create the union of x_set and y_set ?
First and foremost there is no such thing as set column. collect_list returns ArrayType column.
Also there is no built-in function for set intersection. Best you can do is to use UserDefinedFunction, for example like the one shown in Array Intersection in Spark SQL

Resources