How to concatenate null columns in spark dataframe in java? - apache-spark

I am working with spark in java and I want to create a column which is a concatenation of all other column values separated by comma. I have tried few ways to do this but couldn't find a solution.
For example-
col1 | col2 | col3 | result
1 | john | 2.3 | 1,john,2.3
The problem that I'm facing is if a column value is null then simply nothing should be there but so far I have not achieved this.
For example-
col1 | col2 | col3 | result
1 | null | 2.3 | 1,,2.3 -> this is what I'm trying to achieve
What I have tried-
1.) I tried concat function of spark sql but if a column value is null then entire value of concat function will be null, which is not something i want.
2.) concat_ws function simply ignore the null values hence can't be used. Hence in my 2nd example the result would be something like 1,2.3
3.) using coalesce require matching data types so if I do something like coalesce(col_name,'') under concat function and col_name is number data type then it throws data types mismatch error.
4.) using case when statement also require same data type in THEN & ELSE conditions. So if I say case when column_1 is null THEN '' ELSE column1 END will throw error if column1 is number because '' is empty string and column_1 is number.
one way it can be achieved is creating map_function. But is there a way to do it in spark sql way?

You can cast all your fields to String so that you can run NVL on them and set them to empty string '' if they're null.
To generate an expression for all the columns in your dataframe automatically, you can use map function:
df.show()
//+----+----+----+
//|col1|col2|col3|
//+----+----+----+
//| 1|John| 2.3|
//| 2|null| 2.3|
//+----+----+----+
List<String> columns = df.columns() ;
// List("col1", "col2", "col3")
String nvlExpr = columns.stream().map(i -> "nvl(cast ("+i+" as string),'')").collect(joining(", ", "concat_ws(','," , ")"));
//concat_ws(',',nvl(cast (col1 as string), ''), nvl(cast (col2 as string), ''),nvl(cast (col3 as string), ''))
df.withColumn("result", expr(nvlExpr)).show()
//+----+----+----+----------+
//|col1|col2|col3| result|
//+----+----+----+----------+
//| 1|John| 2.3|1,John,2.3|
//| 2|null| 2.3| 2,,2.3|
//+----+----+----+----------+

This way you can concat
scala> spark.sql("select *, concat(cast(col1 as string),',' ,col2,',' ,cast(col3 as string)) as result from (select col1, col3, case when col2 is null then '' else col2 end as col2 from final)").show()
+----+----+----+--------+
|col1|col3|col2| result|
+----+----+----+--------+
| 1| 2.3| JK|1,JK,2.3|
| 1| 2.3| | 1,,2.3|
+----+----+----+--------+

Related

How to find if a spark column contains a certain value?

I have the following spark dataframe -
+----+----+
|col1|col2|
+----+----+
| a| 1|
| b|null|
| c| 3|
+----+----+
Is there a way in spark API to detect if col2 contains, say, 3? Please note that the answer should be just one indicator value - yes/no - and not the set of records that have 3 in col2.
The PySpark recommended way of finding if a DataFrame contains a particular value is to use pyspak.sql.Column.contains API. You can use a boolean value on top of this to get a True/False boolean value.
For your example:
bool(df.filter(df.col2.contains(3)).collect())
#Output
>>>True
bool(df.filter(df.col2.contains(100)).collect())
#Output
>>>False
Source : https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Column.contains.html
By counting the number of values in col2 that are equal to 3:
import pyspark.sql.functions as f
df.agg(f.expr('sum(case when col2 = 3 then 1 else 0 end)')).first()[0] > 0
You can use when as a conditional statement
from pyspark.sql.functions import when
df.select(
(when(col("col2") == '3', 'yes')
.otherwise('no')
).alias('col3')
)

How to convert single String column to multiple columns based on delimiter in Apache Spark

I have a data frame with a string column and I want to create multiple columns out of it.
Here is my input data and pagename is my string column
I want to create multiple columns from it. The format of the string is the same - col1:value1 col2:value2 col3:value3 ... colN:valueN . In the output, I need multiple columns - col1 to colN with values as rows for each column. Here is the output -
How can I do this in spark? Scala or Python both is fine for me. Below code creates the input dataframe -
scala> val df = spark.sql(s"""select 1 as id, "a:100 b:500 c:200" as pagename union select 2 as id, "a:101 b:501 c:201" as pagename """)
df: org.apache.spark.sql.DataFrame = [id: int, pagename: string]
scala> df.show(false)
+---+-----------------+
|id |pagename |
+---+-----------------+
|2 |a:101 b:501 c:201|
|1 |a:100 b:500 c:200|
+---+-----------------+
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- pagename: string (nullable = false)
Note - The example shows only 3 columns here but in general I have more than 100 columns that I expect to deal with.
You can use str_to_map, explode the resulting map and pivot:
val df2 = df.select(
col("id"),
expr("explode(str_to_map(pagename, ' ', ':'))")
).groupBy("id").pivot("key").agg(first("value"))
df2.show
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1|100|500|200|
| 2|101|501|201|
+---+---+---+---+
So two options immediately come to mind
Delimiters
You've got some obvious delimiters that you can split on. For this use the split function
from pyspark.sql import functions as F
delimiter = ":"
df = df.withColumn(
"split_column",
F.split(F.col("pagename"), delimiter)
)
# "split_column" is now an array, so we need to pull items out the array
df = df.withColumn(
"a",
F.col("split_column").getItem(0)
)
Not ideal, as you'll still need to do some string manipulation to remove the whitespace and then do the int converter - but this is easily applied to multiple columns.
Regex
As the format is pretty fixed, you can do the same thing with a regex.
import re
regex_pattern = r"a\:() b\:() c\:()"
match_groups = ["a", "b", "c"]
for i in range(re.compile(regex_pattern).groups):
df = df.withColumn(
match_groups[i],
F.regexp_extract(F.col(pagename), regex_pattern, i + 1),
)
CAVEAT: Check that Regex before you try and run anything (as I don't have an editor handy)

replace null values in string type column with zero PySpark

I need to replace null values in string type columns to be 0.
Data looks like this:
df.groupBy('content').count().show()
+---------------+------+
| content| count|
+---------------+------+
| videos| 754|
| food-news| 76151|
| null| 39|
| uk| 23879|
I have tried this:
df.na.fill(0).show()
But this piece of code only takes care of int type columns. How can I replace it for string type columns?
Thank you.
Fill with a string '0' too:
df = df.na.fill(0).na.fill('0')
In below code all int values will be replaced by 0 and string values to ' '(blank).
df = df.na.fill(0).na.fill(' ')

truncating all strings in a dataframe column after a specific character using pyspark

I have a dataframe df that contains a list of strings like so:
+-------------+
Products
+-------------+
| Z9L57.W3|
| H9L23.05|
| PRL57.AF|
+-------------+
I would like to truncate the list after the '.' character such that
it looks like:
+--------------+
Products_trunc
+--------------+
| Z9L57 |
| H9L23 |
| PRL57 |
+--------------+
I tried using the split function, but it only works for a single string and not lists.
I also tried
df['Products_trunc'] = df['Products'].str.split('.').str[0]
but I am getting the following error:
TypeError: 'Column' object is not callable
Does anyone have any insights into this?
Thank You
Your code looks like if you are used to pandas. The truncating in pyspark is a bit different. Have a look below:
from pyspark.sql import functions as F
l = [
( 'Z9L57.W3' , ),
( 'H9L23.05' ,),
( 'PRL57.AF' ,)
]
columns = ['Products']
df=spark.createDataFrame(l, columns)
The withColumn function allows you to modify existing columns or creating new one. The function takes 2 parameters: column name and columne expression. You will modify a columne when the column name already exists.
df = df.withColumn('Products', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+
|Products|
+--------+
| Z9L57|
| H9L23|
| PRL57|
+--------+
You will create a new column when you choose a not existing column name.
df = df.withColumn('Products_trunc', F.split(df.Products, '\.').getItem(0))
df.show()
Output:
+--------+--------------+
|Products|Products_trunc|
+--------+--------------+
|Z9L57.W3| Z9L57|
|H9L23.05| H9L23|
|PRL57.AF| PRL57|
+--------+--------------+

Trim in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype). In my use case i am not sure of what all columns are there in this input dataframe. User just pass me the name of dataframe and ask me to trim all the columns of this dataframe. Data in a typical dataframe looks like as below:
id Value Value1
1 "Text " "Avb"
2 1504 " Test"
3 1 2
Is there anyway i can do it without being dependent on what all columns are present in this dataframe and get all the column trimmed in this dataframe. Data after trimming aall the columns of dataframe should look like.
id Value Value1
1 "Text" "Avb"
2 1504 "Test"
3 1 2
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
input:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1|Text | Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Code:
import pyspark.sql.functions as func
for col in df.columns:
df = df.withColumn(col, func.ltrim(func.rtrim(df[col])))
Output:
df.show()
+---+-----+------+
| id|Value|Value1|
+---+-----+------+
| 1| Text| Avb|
| 2| 1504| Test|
| 3| 1| 2|
+---+-----+------+
Using trim() function in #osbon123's answer.
from pyspark.sql.functions import trim
for c_name in df.columns:
df = df.withColumn(c_name, trim(col(c_name)))
You should avoid using withColumn because it creates a new DataFrame which is time-consuming for very large dataframes. I created the following function based on this solution, but now it works with any dataframe even when it has string and non-string columns.
from pyspark.sql import functions as F
def trim_string_columns(of_data: DataFrame) -> DataFrame:
data_trimmed = of_data.select([
(F.trim(c.name).alias(c.name) if isinstance(c.dataType, StringType) else c.name) for c in of_data.schema
])
return data_trimmed
This is the cleanest (and most computationally efficient) way I've seen it done to trim all spaces in all columns. If you want underscores to replace spaces, simply replace "" with "_".
# Standardize Column names no spaces to underscore
new_column_name_list = list(map(lambda x: x.replace(" ", ""), df.columns))
df = df.toDF(*new_column_name_list)
You can use dtypes function in DataFrame API to get the list of Cloumn Names along with their Datatypes and then for all string columns use "trim" function to trim the values.
Regards,
Neeraj

Resources