I have a dataframe with consist of 5 columns . I need to add a new column at 3rd Position . How to achieve this in spark .
df.show()
+---------+--------+---+----------+--------+
|last_name|position|age|salary_inc| segment|
+---------+--------+---+----------+--------+
| george| IT| 10| 2313| one|
| jhon| non-it| 21| 34344| null|
| mark| IT| 11| 16161| third|
| spencer| it| 31| 2322| null|
| spencer| non-it| 41| 2322|Valuable|
+---------+--------+---+----------+--------+
Add new_column at position 3
+---------+--------+-----------+---+----------+--------+
|last_name|position|new_column |age|salary_inc| segment|
+---------+--------+-----------+---+----------+--------+
Can you please help me on this
(
df.withColumn("new_column", ...)
.select("last_name",
"position",
"new_column",
...)
.show()
)
Where first ellipses indicate what you're creating in your new column called "new_column"; for example lit(1) would give you literal (constant) 1 of type IntegerType. Second ellipses indicate remaining columns in the order you wish to select.
Related
Suppose I have the dataframe below:
s.id
s.first
s.last
s.age
d.first
d.last
d.age
UPDATED_FIELDS
1
AAA
BBB
10
AAA__
BBB
10
["first"]
2
CCC
DDD
20
CCC__
DDD
21
["first", "age"]
I want to transform it to the format below, so for each of the UPDATED_FIELDS in the first dataframe, I want to create a new row in my second dataframe.
id
field
s_value
d_value
1
first
AAA
AAA_
2
first
CCC
CCC_
2
age
20
21
I feel like I need to create a new dataframe, but couldn't get it working
This isn't straightforward because after you explode "UPDATED_FIELDS" to create the "field" column, you have the following dataframe:
+----+-------+------+-----+-------+------+-----+--------------+-----+
|s.id|s.first|s.last|s.age|d.first|d.last|d.age|UPDATED_FIELDS|field|
+----+-------+------+-----+-------+------+-----+--------------+-----+
| 1| AAA| BBB| 10| AAA__| BBB| 10| [first]|first|
| 2| CCC| DDD| 20| CCC__| DDD| 21| [first, age]|first|
| 2| CCC| DDD| 20| CCC__| DDD| 21| [first, age]| age|
+----+-------+------+-----+-------+------+-----+--------------+-----+
The new columns s_value and d_value depend on using the literal from the exploded column "field" to tell us which column's value to use. It would be nice to do something like:
df.withColumn(
"field", F.explode("UPDATED_FIELDS")
).withColumn(
"s_value", F.col(f"s.{F.col('field')}")
)
... but this will result in an error because {F.col('field')} cannot be interpreted as a literal.
Update: based on this helpful answer, you can in fact use a string literal as a column lookup. You will want to first make sure you can access df['field'] so you need to perform the explode separately first, then use F.coalesce(*F.when(...)) as a way of obtaining the value inside the column name that comes from the string inside df['field']:
df = df.withColumn(
"field", F.explode("UPDATED_FIELDS")
)
## obtain list of all unique strings in "field"
base_cols = df.select(F.collect_set("field").alias("column")).first()["column"]
df.withColumn(
"s_value", F.coalesce(*[F.when(df['field'] == c, df[f"`s.{c}`"]) for c in base_cols])
).withColumn(
"d_value", F.coalesce(*[F.when(df['field'] == c, df[f"`d.{c}`"]) for c in base_cols])
).select(
F.col("`s.id`").alias("id"),"field","s_value","d_value"
)
+---+-----+-------+-------+
| id|field|s_value|d_value|
+---+-----+-------+-------+
| 1|first| AAA| AAA__|
| 2|first| CCC| CCC__|
| 2| age| 20| 21|
+---+-----+-------+-------+
I want to update a row (having index numberInt) values of a given dataset dFIdx using values of another row from another dataset dFInitIdx (the row of the second dataset having different index j). I try in JAVA like the following :
for (String colName : dFInitIdx.columns())
dFIdx = dFIdx.where(col("id").equalTo(numberInt)).withColumn(colName,dFInitIdx.where(col("id").equalTo(j)).col(colName));
But i am getting this error :
Attribute(s) with the same name appear in the operation: id. Please
check if the right attribute(s) are used
How to achieve that update of one row in JAVA (preferably a one liner) ?
Thanks
Since both of your Datasets seem to have the same columns, you can use a join() method to merge them together based on your numberInt and j conditions, in order to select() (at least) the id column value of the first Dataset dFIdx and all the other columns from the second Dataset dFInitIdx.
dFIdx data sample:
+---+--------+---------+
| id|hundreds|thousands|
+---+--------+---------+
| 1| 100| 1000|
| 2| 200| 2000|
| 3| 300| 3000|
+---+--------+---------+
dFInitIdx data sample:
+---+--------+---------+
| id|hundreds|thousands|
+---+--------+---------+
| 1| 101| 1001|
| 2| 201| 2001|
| 3| 301| 3001|
+---+--------+---------+
Let's say that for the given data samples numberInt and j are (hardcoded) set as:
numberInt == 1
j == 2
The solution will look like this:
dFIdx.join(dFInitIdx, dFIdx("id").equalTo(numberInt) && dFInitIdx("id").equalTo(j))
.select(dFIdx("id"), dFInitIdx("hundreds"), dFInitIdx("thousands"))
And we can see the result of the query with show() as seen below:
+---+--------+---------+
| id|hundreds|thousands|
+---+--------+---------+
| 1| 201| 2001|
+---+--------+---------+
Hi folks I'm augmenting my DF was wondering if you can give a helping hand.
df = df.withColumn(('COUNTRY'), when(col("COUNTRY").startsWith("US"), "US").otherwise("null"))
What I am trying to achieve is resetting the column to, where a column value starts with US, such as US_Rules_Forever - to rewrite the dataframe simply as just US. Other variables to be set with null
ID COUNTRY
1 US_RULES
2 US_SANDWICH
3 USA_CLICKING
4 GLOBAL_CHICKEN_SANDWICH
ID COUNTRY
1 US
2 US
3 US
4 null
According to the docs, it should be startswith, not startsWith. w should not be capitalized.
df2 = df.withColumn('COUNTRY', when(col("COUNTRY").startswith("US"), "US"))
df2.show()
+---+-------+
| ID|COUNTRY|
+---+-------+
| 1| US|
| 2| US|
| 3| US|
| 4| null|
+---+-------+
mck was right - it was a syntax issue. Posting this for fellow devs:
df = df.withColumn(('COUNTRY'), when(col("COUNTRY").startswith("US"), "US").otherwise("null"))
I want to perform subtract between 2 dataframes in pyspark. Challenge is that I have to ignore some columns while subtracting dataframe. But end dataframe should have all the columns, including ignored columns.
Here is an example:
userLeft = sc.parallelize([
Row(id=u'1',
first_name=u'Steve',
last_name=u'Kent',
email=u's.kent#email.com',
date1=u'2017-02-08'),
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-09'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-10')
]).toDF()
userRight = sc.parallelize([
Row(id=u'2',
first_name=u'Margaret',
last_name=u'Peace',
email=u'marge.peace#email.com',
date1=u'2017-02-11'),
Row(id=u'3',
first_name=None,
last_name=u'hh',
email=u'marge.hh#email.com',
date1=u'2017-02-12')
]).toDF()
Expected:
ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.
End result should look something like this including "date1" column.
+----------+--------------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+--------------------+----------+---+---------+
|2017-02-08| s.kent#email.com| Steve| 1| Kent|
+----------+--------------------+----------+---+---------+
Seems you need anti-join:
userLeft.join(userRight, ["id"], "leftanti").show()
+----------+----------------+----------+---+---------+
| date1| email|first_name| id|last_name|
+----------+----------------+----------+---+---------+
|2017-02-08|s.kent#email.com| Steve| 1| Kent|
+----------+----------------+----------+---+---------+
You can also use a full join and only keep null values:
userLeft.join(
userRight,
[c for c in userLeft.columns if c != "date1"],
"full"
).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show()
+------------------+----------+---+---------+----------+----------+
| email|first_name| id|last_name| date1| date1|
+------------------+----------+---+---------+----------+----------+
|marge.hh#email.com| null| 3| hh|2017-02-10| null|
|marge.hh#email.com| null| 3| hh| null|2017-02-12|
| s.kent#email.com| Steve| 1| Kent|2017-02-08| null|
+------------------+----------+---+---------+----------+----------+
If you want to use joins, whether it's leftanti or full you'll need to find default values for your null in the joining columns (I think we discussed it in a previous thread).
You can also just drop the column that bothers you subtract and join:
df = userLeft.drop("date1").subtract(userRight.drop("date1"))
userLeft.join(df, df.columns).show()
+----------------+----------+---+---------+----------+
| email|first_name| id|last_name| date1|
+----------------+----------+---+---------+----------+
|s.kent#email.com| Steve| 1| Kent|2017-02-08|
+----------------+----------+---+---------+----------+
I have a Pyspark Dataframe with this structure:
+----+----+----+----+---+
|user| A/B| C| A/B| C |
+----+----+-------------+
| 1 | 0| 1| 1| 2|
| 2 | 0| 2| 4| 0|
+----+----+----+----+---+
I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:
+----+----+----+
|user| A/B| C|
+----+----+----+
| 1 | 1| 3|
| 2 | 4| 2|
+----+----+----+
Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?
I have a work around for this
val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)
Now make the Join then the out will contain the Values with different names
Then make the tuple of the list to be combined
val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))
And them Combine the value of the Columns within each tuple and get the desired output !
PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !