How to create an unique autogenerated Id column in a spark dataframe

How to create an unique autogenerated Id column in a spark dataframe - apache-spark

I have a dataframe where I have to generate a unique Id in one of the columns. This id has to be generated with an offset.
Because , I need to persist this dataframe with the autogenerated id , now if new data comes in the autogenerated id should not collide with the existing ones.
I checked the monotonically increasing function but it does not accept any offset .
This is what I tried :
df=df.coalesce(1);
df = df.withColumn(inputCol,functions.monotonically_increasing_id());
But is there a way to make the monotonically_increasing_id() start from a starting offset ?

You can simply add to it to provide a minimum value for the id. Note that it is not guaranteed the values will start from the minimum value
.withColumn("id", monotonically_increasing_id + 123)
Explanation: Operator + is overloaded for columns https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L642

Or if you don't want to restrict your program into one only partition with df.coalesce(1) you can use zipWithIndex which starts with index = 0 as next:
lines = [["a1", "a2", "a3"],
["b1", "b2", "b3"],
["c1", "c2", "c3"]]
cols = ["c1", "c2", "c3"]
df = spark.createDataFrame(lines, cols)
start_indx = 10
df = df.rdd.zipWithIndex() \
.map(lambda (r, indx): (indx + start_indx, r[0], r[1], r[2])) \
.toDF(["id", "c1", "c2", "c3"])
df.show(10, False)
In this case I set the start_index = 10. And this will be the output:
+---+---+---+---+
|id |c1 |c2 |c3 |
+---+---+---+---+
|10 |a1 |a2 |a3 |
|11 |b1 |b2 |b3 |
|12 |c1 |c2 |c3 |
+---+---+---+---+

You could add a rownumber to your columns and then add that to the maximum existing identity column, or your offset. Once it is set drop the rownumber attribute.
from pyspark.sql import functions as sf
from pyspark.sql.window import Window
# Could also grab the exist max ID value
seed_value = 123
df = df.withColumn("row_number", sf.rowNumber().over(Window.partitionBy(sf.col("natural_key")).orderBy(sf.col("anything"))))
df = df.withColumn("id", sf.col("row_number")+seed_value)
Remember to drop the row_number attribute.

Related

Pyspark - Find sub-string from a column of data-frame with another data-frame

I have two different dataframes in Pyspark of String type. First dataframe is of single work while second is a string of words i.e., sentences. I have to check existence of first dataframe column from the second dataframe column. For example,
df2
+------+-------+-----------------+
|age|height| name| Sentences |
+---+------+-------+-----------------+
| 10| 80| Alice| 'Grace, Sarah'|
| 15| null| Bob| 'Sarah'|
| 12| null| Tom|'Amy, Sarah, Bob'|
| 13| null| Rachel| 'Tom, Bob'|
+---+------+-------+-----------------+
Second dataframe
df1
+-------+
| token |
+-------+
| 'Ali' |
|'Sarah'|
|'Bob' |
|'Bob' |
+-------+
So, how can I search for each token of df1 from df2 Sentence column. I need count for each word and add as a new column in df1
I have tried this solution, but work for a single word i.e., not for a complete column of dataframe

Considering the dataframe in the prev answer
from pyspark.sql.functions import explode,explode_outer,split, length,trim
df3 = df2.select('Sentences',explode(split('Sentences',',')).alias('friends'))
df3 = df3.withColumn("friends", trim("friends")).withColumn("length_of_friends", length("friends"))
display(df3)
df3 = df3.join(df1, df1.token == df3.friends,how='inner').groupby('friends').count()
display(df3)

You could use pyspark udf to create the new column in df1.
Problem is you cannot access a second dataframe inside udf (view here).
As advised in the referenced question, you could get sentences as broadcastable varaible.
Here is a working example :
from pyspark.sql.types import *
from pyspark.sql.functions import udf
# Instanciate df2
cols = ["age", "height", "name", "Sentences"]
data = [
(10, 80, "Alice", "Grace, Sarah"),
(15, None, "Bob", "Sarah"),
(12, None, "Tom", "Amy, Sarah, Bob"),
(13, None, "Rachel", "Tom, Bob")
]
df2 = spark.createDataFrame(data).toDF(*cols)
# Instanciate df1
cols = ["token"]
data = [
("Ali",),
("Sarah",),
("Bob",),
("Bob",)
]
df1 = spark.createDataFrame(data).toDF(*cols)
# Creating broadcast variable for Sentences column of df2
lstSentences = [data[0] for data in df2.select('Sentences').collect()]
sentences = spark.sparkContext.broadcast(lstSentences)
def countWordInSentence(word):
# Count if sentence contains word
return sum(1 for item in lstSentences if word in item)
func_udf = udf(countWordInSentence, IntegerType())
df1 = df1.withColumn("COUNT",
func_udf(df1["token"]))
df1.show()

How to convert single String column to multiple columns based on delimiter in Apache Spark

I have a data frame with a string column and I want to create multiple columns out of it.
Here is my input data and pagename is my string column
I want to create multiple columns from it. The format of the string is the same - col1:value1 col2:value2 col3:value3 ... colN:valueN . In the output, I need multiple columns - col1 to colN with values as rows for each column. Here is the output -
How can I do this in spark? Scala or Python both is fine for me. Below code creates the input dataframe -
scala> val df = spark.sql(s"""select 1 as id, "a:100 b:500 c:200" as pagename union select 2 as id, "a:101 b:501 c:201" as pagename """)
df: org.apache.spark.sql.DataFrame = [id: int, pagename: string]
scala> df.show(false)
+---+-----------------+
|id |pagename |
+---+-----------------+
|2 |a:101 b:501 c:201|
|1 |a:100 b:500 c:200|
+---+-----------------+
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- pagename: string (nullable = false)
Note - The example shows only 3 columns here but in general I have more than 100 columns that I expect to deal with.

You can use str_to_map, explode the resulting map and pivot:
val df2 = df.select(
col("id"),
expr("explode(str_to_map(pagename, ' ', ':'))")
).groupBy("id").pivot("key").agg(first("value"))
df2.show
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1|100|500|200|
| 2|101|501|201|
+---+---+---+---+

So two options immediately come to mind
Delimiters
You've got some obvious delimiters that you can split on. For this use the split function
from pyspark.sql import functions as F
delimiter = ":"
df = df.withColumn(
"split_column",
F.split(F.col("pagename"), delimiter)
)
# "split_column" is now an array, so we need to pull items out the array
df = df.withColumn(
"a",
F.col("split_column").getItem(0)
)
Not ideal, as you'll still need to do some string manipulation to remove the whitespace and then do the int converter - but this is easily applied to multiple columns.
Regex
As the format is pretty fixed, you can do the same thing with a regex.
import re
regex_pattern = r"a\:() b\:() c\:()"
match_groups = ["a", "b", "c"]
for i in range(re.compile(regex_pattern).groups):
df = df.withColumn(
match_groups[i],
F.regexp_extract(F.col(pagename), regex_pattern, i + 1),
)
CAVEAT: Check that Regex before you try and run anything (as I don't have an editor handy)

How could a PySpark RDD linear list be converted to a DataFrame?

I'd like to convert a linear list to a dataframe.
i.e. given the following list,
a = ["a1", "a2", "a3", b1", "b2", "b3", "c1", "c2", "c3"]
Expected result is,
+--------------------+
| col1 | col2 | col3 |
+--------------------+
| a1 | a2 | a3 |
| b1 | b2 | b3 |
| c1 | c2 | c3 |
+--------------------+
I tried the following but got an error.
from pyspark.sql.types import *
a = ["a1", "a2", "a3", "b1", "b2", "b3", "c1", "c2", "c3"]
rdd = sc.parallelize(a)
schema = StructType([
StructField("a", StringType(), True),
StructField("b", StringType(), True),
StructField("c", StringType(), True)
])
df = sqlContext.createDataFrame(rdd, schema)
df.show()
The last show() statement gets an error "Job aborted due to stage failure".
Please someone tell me the solution?
Thanks.

Based on your comment, I presume that you start with the rdd and not the list.
I further assume that you are determining order based on the index of the rdd. If these assumptions are correct, you can use zipWithIndex() to add a row number to each record.
Then divide the row number by 3 (use integer division) to group every 3 consecutive records. Next use groupByKey() to aggregate the records with the same key into a tuple.
Finally, drop the key and call toDF()
rdd.zipWithIndex()\
.map(lambda row: (row[1]//3, row[0]))\
.groupByKey()\
.map(lambda row: tuple(row[1]))\
.toDF(["a", "b", "c"])\
.show()
#+---+---+---+
#| a| b| c|
#+---+---+---+
#| a1| a2| a3|
#| c1| c2| c3|
#| b1| b2| b3|
#+---+---+---+

Here is a way that should hopefully meet your criteria
# First get a 1 column DF
df = sql.createDataFrame(sc.parallelize(a).map(lambda x: [x]), schema=['col'])
# split each value into a number and letter e.g. 'a1' --> ['a','1'])
df = df.withColumn('letter', f.split('col', '').getItem(0))
df = df.withColumn('number', f.split('col', '').getItem(1))
# Now pivot to get what you want (dropping extraneous columns and ordering
# to get exact output
output = (df.groupBy('letter')
.pivot('number')
.agg(f.first('col'))
.select([f.col(column).alias('col%s'%(column)) for column in ['1','2','3']])
.orderBy('col1')
.drop('letter'))

Iterate cols PySpark

I have a SQL table containing 40 columns: ID, Product, Product_ID, Date etc. and would like to iterate over all columns to get distinct values.
Customer table (sample):
ID Product
1 gadget
2 VR
2 AR
3 hi-fi
I have tried using dropDuplicates within a function that loops over all columns but the resultant output is only spitting out one distinct value per column instead of all possible distinct values.
Expected Result:
Column Value
ID 1
ID 2
ID 3
Product gadget
Product VR
Product AR
Product hi-fi
Actual Result:
Column Value
ID 1
Product gadget

The idea is to use collect_set() to fetch distinct elements in a column and then exploding the dataframe.
#All columns which need to be aggregated should be added here in col_list.
col_list = ['ID','Product']
exprs = [collect_set(x) for x in col_list]
Let's start aggregating.
from pyspark.sql.functions import lit , collect_set, explode, array, struct, col, substring, length, expr
df = spark.createDataFrame([(1,'gadget'),(2,'VR'),(2,'AR'),(3,'hi-fi')], schema = ['ID','Product'])
df = df.withColumn('Dummy',lit('Dummy'))
#While exploding later, the datatypes must be the same, so we have to cast ID as a String.
df = df.withColumn('ID',col('ID').cast('string'))
#Creating the list of distinct values.
df = df.groupby("Dummy").agg(*exprs)
df.show(truncate=False)
+-----+---------------+-----------------------+
|Dummy|collect_set(ID)|collect_set(Product) |
+-----+---------------+-----------------------+
|Dummy|[3, 1, 2] |[AR, VR, hi-fi, gadget]|
+-----+---------------+-----------------------+
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ['Dummy']).drop('Dummy')
df.show()
+--------------------+--------------------+
| key| val|
+--------------------+--------------------+
| collect_set(ID)| [3, 1, 2]|
|collect_set(Product)|[AR, VR, hi-fi, g...|
+--------------------+--------------------+
df = df.withColumn('val', explode(col('val')))
df = df.withColumnRenamed('key', 'Column').withColumnRenamed('val', 'Value')
df = df.withColumn('Column', expr("substring(Column,13,length(Column)-13)"))
df.show()
+-------+------+
| Column| Value|
+-------+------+
| ID| 3|
| ID| 1|
| ID| 2|
|Product| AR|
|Product| VR|
|Product| hi-fi|
|Product|gadget|
+-------+------+
Note: All the columns which are not strings, should be converted into String like df = df.withColumn('ID',col('ID').cast('string')). Otherwise, you will get error.

pyspark AnalysisException: "Reference '<COLUMN>' is ambiguous [duplicate]

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?

The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/

This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.

I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+

You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200

try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))

This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!

Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")

Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')

If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/

After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create an unique autogenerated Id column in a spark dataframe - apache-spark

Related

Pyspark - Find sub-string from a column of data-frame with another data-frame

How to convert single String column to multiple columns based on delimiter in Apache Spark

How could a PySpark RDD linear list be converted to a DataFrame?

Iterate cols PySpark

pyspark AnalysisException: "Reference '<COLUMN>' is ambiguous [duplicate]

Categories

Resources