update a dataframe column with new values

update a dataframe column with new values - apache-spark

df1 has fields id and json; df2 has fields idand json
df1.count() => 1200; df2.count() => 20
df1 has all the rows. df2 has an incremental update with just 20 rows.
My goal is to update df1 with the values from df2. All the ids of df2 are in df1. But df2 has updated values(in the json field) for those same ids.
Resulting df should have all the values from df1 and updated values from df2.
What is the best way to do this? - With the least number of joins and filters.
Thanks!

You can achieve this using one left join.
Create Example DataFrames
Using the sample data provided by #Shankar Koirala in his answer.
data1 = [
(1, "a"),
(2, "b"),
(3, "c")
]
df1 = sqlCtx.createDataFrame(data1, ["id", "value"])
data2 = [
(1, "x"),
(2, "y")
]
df2 = sqlCtx.createDataFrame(data2, ["id", "value"])
Do a left join
Join the two DataFrames using a left join on the id column. This will keep all of the rows in the left DataFrame. For the rows in the right DataFrame that don't have a matching id, the value will be null.
import pyspark.sql.functions as f
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
.select(
'id',
f.col('l.value').alias('left_value'),
f.col('r.value').alias('right_value')
)\
.show()
#+---+----------+-----------+
#| id|left_value|right_value|
#+---+----------+-----------+
#| 1| a| x|
#| 3| c| null|
#| 2| b| y|
#+---+----------+-----------+
Select the desired data
We will use the fact that the unmatched ids have a null to select the final columns. Use pyspark.sql.functions.when() to use the right value if it is not null, otherwise keep the left value.
df1.alias('l').join(df2.alias('r'), on='id', how='left')\
.select(
'id',
f.when(
~f.isnull(f.col('r.value')),
f.col('r.value')
).otherwise(f.col('l.value')).alias('value')
)\
.show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+
You can sort this output if you want the ids in order.
Using pyspark-sql
You can do the same thing using a pyspark-sql query:
df1.registerTempTable('df1')
df2.registerTempTable('df2')
query = """SELECT l.id,
CASE WHEN r.value IS NOT NULL THEN r.value ELSE l.value END AS value
FROM df1 l LEFT JOIN df2 r ON l.id = r.id"""
sqlCtx.sql(query.replace("\n", "")).show()
#+---+-----+
#| id|value|
#+---+-----+
#| 1| x|
#| 3| c|
#| 2| y|
#+---+-----+

I would like to provide a slightly more general solution. What happens if the input data has 100 columns instead of 2? We would spend too much time making a coalesce of those 100 columns to keep the values on the right side of the left join.
Another way to solve this problem would be to "delete" the updated rows from the original df and finally make a union with the updated rows.
data_orginal = spark.createDataFrame([
(1, "a"),
(2, "b"),
(3, "c")
], ("id", "value"))
data_updated = spark.createDataFrame([
(1, "x"),
(2, "y")
], ("id", "value"))
data_orginal.show()
+---+-----+
| id|value|
+---+-----+
| 1| a|
| 2| b|
| 3| c|
+---+-----+
data_updated.show()
+---+-----+
| id|value|
+---+-----+
| 1| x|
| 2| y|
+---+-----+
data_orginal.createOrReplaceTempView("data_orginal")
data_updated.createOrReplaceTempView("data_updated")
src_data_except_updated = spark.sql(f"SELECT * FROM data_orginal WHERE id not in (1,2)")
result_data = src_data_except_updated.union(data_updated)
result_data.show()
+---+-----+
| id|value|
+---+-----+
| 3| c|
| 1| x|
| 2| y|
+---+-----+
Notice that the query
SELECT * FROM data_orginal WHERE id not in (1,2)
could be generated automatically:
ids_collect = spark.sql(f"SELECT id FROM data_updated").collect()
ids_list = [f"{x.id}" for x in ids_collect]
ids_str = ",".join(ids_list)
query_get_all_except = f"SELECT * FROM data_original WHERE id not in ({ids_str})"

Related

delete duplicate records based on other column pyspark

I have a data frame in pyspark like below.
df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 1| N|
| 2| Y|
| 3| N|
+---+----+
I want to delete a record when there is a duplicate id and the test is N
Now when I query the new_df
new_df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 2| Y|
| 3| N|
+---+----+
I am unable to figure out the use case.
I have done groupby on the id count but it gives only the id column and count.
I have done like below.
grouped_df = new_df.groupBy("id").count()
How can I achieve my desired result
edit
I have a data frame like below.
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| N|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+
When I done like below
new_df = df.groupBy("sn").agg(max("attribute").alias("attribute"))
I am getting str has no attribute alias error
The expected result should be like below
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+

Not the most generic solution but should fit here nicely:
from pyspark.sql.functions import max
df = spark.createDataFrame(
[(1, "Y"), (1, "N"), (2, "Y"), (3, "N")], ("id", "test")
)
df.groupBy("id").agg(max("test").alias("test")).show()
# +---+----+
# | id|test|
# +---+----+
# | 1| Y|
# | 3| N|
# | 2| Y|
# +---+----+
More generic one:
from pyspark.sql.functions import col, count, when
test = when(count(when(col("test") == "Y", "Y")) > 0, "Y").otherwise("N")
df.groupBy("id").agg(test.alias("test")).show()
# +---+----+
# | id|test|
# +---+----+
# | 1| Y|
# | 3| N|
# | 2| Y|
# +---+----+
which can be generalized to accommodate more classes and non-trivial ordering, for example if you had three classes Y, ?, N evaluated in this order, you could:
(when(count(when(col("test") == "Y", True)) > 0, "Y")
.when(count(when(col("test") == "?", True)) > 0, "?")
.otherwise("N"))
If there are other columns you need to preserve these methods won't work, and you'll need something like shown in Find maximum row per group in Spark DataFrame

Another option using row_number:
df.selectExpr(
'*',
'row_number() over (partition by id order by test desc) as rn'
).filter('rn=1 or test="Y"').drop('rn').show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 3| N|
| 2| Y|
+---+----+
This method doesn't aggregate but only remove duplicated ids when test is N

Using Spark SQL temporary tables, I used Databricks Notebook
case class T(id:Int,test:String)
val df=spark.createDataset(Seq(T(1, "Y"), T(1, "N"), T(2, "Y"), T(3, "N")))
df.createOrReplaceTempView("df")
%sql select id, max(test) from df group by id

You can use the below code:
#register as temp table
df.registerTempTable("df")
#create single rows
newDF = sqlc.sql(WITH dfCte AS
(
select *,row_number() over (partition by id order by test desc) as RowNumber
from df
)
select * from dfCte where RowNumber =1)
#drop row numbers and show the newdf
newDF.drop('RowNumber').show()

Adding a group count column to a PySpark dataframe

I am coming from R and the tidyverse to PySpark due to its superior Spark handling, and I am struggling to map certain concepts from one context to the other.
In particular, suppose that I had a dataset like the following
x | y
--+--
a | 5
a | 8
a | 7
b | 1
and I wanted to add a column containing the number of rows for each x value, like so:
x | y | n
--+---+---
a | 5 | 3
a | 8 | 3
a | 7 | 3
b | 1 | 1
In dplyr, I would just say:
import(tidyverse)
df <- read_csv("...")
df %>%
group_by(x) %>%
mutate(n = n()) %>%
ungroup()
and that would be that. I can do something almost as simple in PySpark if I'm looking to summarize by number of rows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.count() \
.show()
And I thought I understood that withColumn was equivalent to dplyr's mutate. However, when I do the following, PySpark tells me that withColumn is not defined for groupBy data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count
spark = SparkSession.builder.getOrCreate()
spark.read.csv("...") \
.groupBy(col("x")) \
.withColumn("n", count("x")) \
.show()
In the short run, I can simply create a second dataframe containing the counts and join it to the original dataframe. However, it seems like this could become inefficient in the case of large tables. What is the canonical way to accomplish this?

When you do a groupBy(), you have to specify the aggregation before you can display the results. For example:
import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
]
df = sqlCtx.createDataFrame(data, ["x", "y"])
df.groupBy('x').count().select('x', f.col('count').alias('n')).show()
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
Here I used alias() to rename the column. But this only returns one row per group. If you want all rows with the count appended, you can do this with a Window:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.select('x', 'y', f.count('x').over(w).alias('n')).sort('x', 'y').show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+
Or if you're more comfortable with SQL, you can register the dataframe as a temporary table and take advantage of pyspark-sql to do the same thing:
df.registerTempTable('table')
sqlCtx.sql(
'SELECT x, y, COUNT(x) OVER (PARTITION BY x) AS n FROM table ORDER BY x, y'
).show()
#+---+---+---+
#| x| y| n|
#+---+---+---+
#| a| 5| 3|
#| a| 7| 3|
#| a| 8| 3|
#| b| 1| 1|
#+---+---+---+

as #pault appendix
import pyspark.sql.functions as F
...
(df
.groupBy(F.col('x'))
.agg(F.count('x').alias('n'))
.show())
#+---+---+
#| x| n|
#+---+---+
#| b| 1|
#| a| 3|
#+---+---+
enjoy

I found we can get even more close to the tidyverse example:
from pyspark.sql import Window
w = Window.partitionBy('x')
df.withColumn('n', f.count('x').over(w)).sort('x', 'y').show()

Explode array data into rows in spark [duplicate]

This question already has answers here:
Dividing complex rows of dataframe to simple rows in Pyspark
(3 answers)
Closed 5 years ago.
I have a dataset in the following way:
FieldA FieldB ArrayField
1 A {1,2,3}
2 B {3,5}
I would like to explode the data on ArrayField so the output will look in the following way:
FieldA FieldB ExplodedField
1 A 1
1 A 2
1 A 3
2 B 3
2 B 5
I mean I want to generate an output line for each item in the array the in ArrayField while keeping the values of the other fields.
How would you implement it in Spark.
Notice that the input dataset is very large.

The explode function should get that done.
pyspark version:
>>> df = spark.createDataFrame([(1, "A", [1,2,3]), (2, "B", [3,5])],["col1", "col2", "col3"])
>>> from pyspark.sql.functions import explode
>>> df.withColumn("col3", explode(df.col3)).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+
Scala version
scala> val df = Seq((1, "A", Seq(1,2,3)), (2, "B", Seq(3,5))).toDF("col1", "col2", "col3")
df: org.apache.spark.sql.DataFrame = [col1: int, col2: string ... 1 more field]
scala> df.withColumn("col3", explode($"col3")).show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| A| 1|
| 1| A| 2|
| 1| A| 3|
| 2| B| 3|
| 2| B| 5|
+----+----+----+

You can use explode function
Below is the simple example for your case
import org.apache.spark.sql.functions._
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(1, "A", List(1,2,3)),
(2, "B", List(3, 5))
)).toDF("FieldA", "FieldB", "FieldC")
data.withColumn("ExplodedField", explode($"FieldC")).drop("FieldC")
Hope this helps!

explode does exactly what you want. Docs:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
Also, here is an example from a different question using it:
https://stackoverflow.com/a/44418598/1461187

How to get the min of each row in PySpark DataFrame [duplicate]

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?

You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)

We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+

You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))

Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+

Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+

Split Spark dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very simple:
rdd.map(lambda row: row + [row.my_str_col.split('-')])
which takes something looking like:
col1 | my_str_col
-----+-----------
18 | 856-yygrm
201 | 777-psgdg
and converts it to this:
col1 | my_str_col | _col3 | _col4
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am aware of pyspark.sql.functions.split(), but it results in a nested array column instead of two top-level columns like I want.
Ideally, I want these new columns to be named as well.

pyspark.sql.functions.split() is the right approach here - you simply need to flatten the nested ArrayType column into multiple top-level columns. In this case, where each array only contains 2 items, it's very easy. You simply use Column.getItem() to retrieve each part of the array as a column itself:
split_col = pyspark.sql.functions.split(df['my_str_col'], '-')
df = df.withColumn('NAME1', split_col.getItem(0))
df = df.withColumn('NAME2', split_col.getItem(1))
The result will be:
col1 | my_str_col | NAME1 | NAME2
-----+------------+-------+------
18 | 856-yygrm | 856 | yygrm
201 | 777-psgdg | 777 | psgdg
I am not sure how I would solve this in a general case where the nested arrays were not the same size from Row to Row.

Here's a solution to the general case that doesn't involve needing to know the length of the array ahead of time, using collect, or using udfs. Unfortunately this only works for spark version 2.1 and above, because it requires the posexplode function.
Suppose you had the following DataFrame:
df = spark.createDataFrame(
[
[1, 'A, B, C, D'],
[2, 'E, F, G'],
[3, 'H, I'],
[4, 'J']
]
, ["num", "letters"]
)
df.show()
#+---+----------+
#|num| letters|
#+---+----------+
#| 1|A, B, C, D|
#| 2| E, F, G|
#| 3| H, I|
#| 4| J|
#+---+----------+
Split the letters column and then use posexplode to explode the resultant array along with the position in the array. Next use pyspark.sql.functions.expr to grab the element at index pos in this array.
import pyspark.sql.functions as f
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.show()
#+---+------------+---+---+
#|num| letters|pos|val|
#+---+------------+---+---+
#| 1|[A, B, C, D]| 0| A|
#| 1|[A, B, C, D]| 1| B|
#| 1|[A, B, C, D]| 2| C|
#| 1|[A, B, C, D]| 3| D|
#| 2| [E, F, G]| 0| E|
#| 2| [E, F, G]| 1| F|
#| 2| [E, F, G]| 2| G|
#| 3| [H, I]| 0| H|
#| 3| [H, I]| 1| I|
#| 4| [J]| 0| J|
#+---+------------+---+---+
Now we create two new columns from this result. First one is the name of our new column, which will be a concatenation of letter and the index in the array. The second column will be the value at the corresponding index in the array. We get the latter by exploiting the functionality of pyspark.sql.functions.expr which allows us use column values as parameters.
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.show()
#+---+-------+---+
#|num| name|val|
#+---+-------+---+
#| 1|letter0| A|
#| 1|letter1| B|
#| 1|letter2| C|
#| 1|letter3| D|
#| 2|letter0| E|
#| 2|letter1| F|
#| 2|letter2| G|
#| 3|letter0| H|
#| 3|letter1| I|
#| 4|letter0| J|
#+---+-------+---+
Now we can just groupBy the num and pivot the DataFrame. Putting that all together, we get:
df.select(
"num",
f.split("letters", ", ").alias("letters"),
f.posexplode(f.split("letters", ", ")).alias("pos", "val")
)\
.drop("val")\
.select(
"num",
f.concat(f.lit("letter"),f.col("pos").cast("string")).alias("name"),
f.expr("letters[pos]").alias("val")
)\
.groupBy("num").pivot("name").agg(f.first("val"))\
.show()
#+---+-------+-------+-------+-------+
#|num|letter0|letter1|letter2|letter3|
#+---+-------+-------+-------+-------+
#| 1| A| B| C| D|
#| 3| H| I| null| null|
#| 2| E| F| G| null|
#| 4| J| null| null| null|
#+---+-------+-------+-------+-------+

Here's another approach, in case you want split a string with a delimiter.
import pyspark.sql.functions as f
df = spark.createDataFrame([("1:a:2001",),("2:b:2002",),("3:c:2003",)],["value"])
df.show()
+--------+
| value|
+--------+
|1:a:2001|
|2:b:2002|
|3:c:2003|
+--------+
df_split = df.select(f.split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"])
df_split.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
I don't think this transition back and forth to RDDs is going to slow you down...
Also don't worry about last schema specification: it's optional, you can avoid it generalizing the solution to data with unknown column size.

I understand your pain. Using split() can work, but can also lead to breaks.
Let's take your df and make a slight change to it:
df = spark.createDataFrame([('1:"a:3":2001',),('2:"b":2002',),('3:"c":2003',)],["value"])
df.show()
+------------+
| value|
+------------+
|1:"a:3":2001|
| 2:"b":2002|
| 3:"c":2003|
+------------+
If you try to apply split() to this as outlined above:
df_split = df.select(split(df.value,":")).rdd.flatMap(
lambda x: x).toDF(schema=["col1","col2","col3"]).show()
you will get
IllegalStateException: Input row doesn't have expected number of values required by the schema. 4 fields are required while 3 values are provided.
So, is there a more elegant way of addressing this? I was so happy to have it pointed out to me. pyspark.sql.functions.from_csv() is your friend.
Taking my above example df:
from pyspark.sql.functions import from_csv
# Define a column schema to apply with from_csv()
col_schema = ["col1 INTEGER","col2 STRING","col3 INTEGER"]
schema_str = ",".join(col_schema)
# define the separator because it isn't a ','
options = {'sep': ":"}
# create a df from the value column using schema and options
df_csv = df.select(from_csv(df.value, schema_str, options).alias("value_parsed"))
df_csv.show()
+--------------+
| value_parsed|
+--------------+
|[1, a:3, 2001]|
| [2, b, 2002]|
| [3, c, 2003]|
+--------------+
Then we can easily flatten the df to put the values in columns:
df2 = df_csv.select("value_parsed.*").toDF("col1","col2","col3")
df2.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a:3|2001|
| 2| b|2002|
| 3| c|2003|
+----+----+----+
No breaks. Data correctly parsed. Life is good. Have a beer.

Instead of Column.getItem(i) we can use Column[i].
Also, enumerate is useful in big dataframes.
from pyspark.sql import functions as F
Keep parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
or
new_cols = ['new_1', 'new_2']
df = df.select('*', *[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)])
Replace parent column:
for i, c in enumerate(['new_1', 'new_2']):
df = df.withColumn(c, F.split('my_str_col', '-')[i])
df = df.drop('my_str_col')
or
new_cols = ['new_1', 'new_2']
df = df.select(
*[c for c in df.columns if c != 'my_str_col'],
*[F.split('my_str_col', '-')[i].alias(c) for i, c in enumerate(new_cols)]
)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

update a dataframe column with new values - apache-spark

Related

delete duplicate records based on other column pyspark

Adding a group count column to a PySpark dataframe

Explode array data into rows in spark [duplicate]

How to get the min of each row in PySpark DataFrame [duplicate]

Split Spark dataframe string column into multiple columns

Categories

Resources