Concatenate two columns of spark dataframe with null values - apache-spark

I have two columns in my spark dataframe
First_name Last_name
Shiva Kumar
Karthik kumar
Shiva Null
Null Shiva
My requirement is to add a new column to dataframe by concatenating the above 2 columns with a comma and handle null values too.
I have tried using concat and coalesce but I can't get the output with comma delimiter only when both columns are available
Expected output
Full_name
Shiva,kumar
Karthik,kumar
Shiva
Shiva

concat_ws concats and handles null values for you.
df.withColumn('Full_Name', F.concat_ws(',', F.col('First_name'), F.col('Last_name'))

You can use lit:
import pyspark.sql.functions as F
f = df.withColumn('Full_Name', F.concat(F.col('First_name'), F.lit(','), F.col('Last_name'))).select('Full_Name')
# fix null values
f = f.withColumn('Full_Name', F.regexp_replace(F.col('Full_Name'), '(,Null)|(Null,)', ''))
f.show()
+-------------+
| Full_Name|
+-------------+
| Shiva,Kumar|
|Karthik,kumar|
| Shiva|
| Shiva|
+-------------+

Related

Spark incorrectly interpret data type from csv to Double when string ending with 'd'

There is a CSV with a column ID (format: 8-digits & "D" at the end).
When reading csv with .option("inferSchema", "true"), it returns the data type as double and trimed the "D".
ACADEMIC_YEAR_SEM
ID
2013/1
12345678D
2013/1
22345678D
2013/2
32345678D
Image: https://i.stack.imgur.com/18Nu6.png
Is there any idea (apart from inferSchema=False) to get correct result? Thanks for help!
You can specify the schema with .schema and pass a string with columns and their type separated by commas:
df2 = spark.read.format('csv').option("header", "true").schema("ACADEMIC_YEAR_SEM string, ID string")\
.load("pyspark_sample_data.csv")
+-----------------+--------+
|ACADEMIC_YEAR_SEM| ID|
+-----------------+--------+
| 2013/1|1234567D|
| 2013/1|2234567D|
| 2013/2|3234567D|
+-----------------+--------+

pivoting a single row dataframe where groupBy can not be applied

I have a dataframe like this:
inputRecordSetCount
inputRecordCount
suspenseRecordCount
166
1216
10
I am trying to make it look like
operation
value
inputRecordSetCount
166
inputRecordCount
1216
suspenseRecordCount
10
I tried pivot, but it needs a groupBy field. I dont have any groupBy field. I found some reference of Stack in Scala. But not sure, how to use it in PySpark. Any help would be appreciated. Thank you.
You can use the stack() operation as mentioned in this tutorial.
Since there are 3 unique values, pass the size, and pair of label and column name:
stack(3, "inputRecordSetCount", inputRecordSetCount, "inputRecordCount", inputRecordCount, "suspenseRecordCount", suspenseRecordCount) as (operation, value)
Full example:
df = spark.createDataFrame(data=[[166,1216,10]], schema=['inputRecordSetCount','inputRecordCount','suspenseRecordCount'])
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (operation, value)"
df = df.selectExpr(exprs)
df.show()
+-------------------+-----+
| operation|value|
+-------------------+-----+
|inputRecordSetCount| 166|
| inputRecordCount| 1216|
|suspenseRecordCount| 10|
+-------------------+-----+

Split pyspark dataframe column and limit the splits

I have the below spark dataframe.
Column_1
Physics=99;name=Xxxx;age=15
Physics=97;chemistry=85;name=yyyy;age=14
Physics=97;chemistry=85;maths=65;name=zzzz;age=14
I have to split the above dataframe column into multiple columns like below.
column_1 name age
Physics=99 Xxxx 15
Physics=97;chemistry=85 yyyy 14
Physics=97;chemistry=85;maths=65 zzzz 14
I tried split using delimiter ; and limit. But it splits the subjects too into different columns. Name and age is clubbed together into single column. I require all subjects in one column and only name and age in separate columns.
Is it possible to achieve this in Pyspark.
You can use replace trick to split the columns.
df = spark.createDataFrame([('Physics=99;name=Xxxx;age=15'),('Physics=97;chemistry=85;name=yyyy;age=14'),('Physics=97;chemistry=85;maths=65;name=zzzz;age=14')], 'string').toDF('c1')
df.withColumn('c1', f.regexp_replace('c1', ';name', ',name')) \
.withColumn('c1', f.regexp_replace('c1', ';age', ',age')) \
.withColumn('c1', f.split('c1', ',')) \
.select(
f.col('c1')[0].alias('stat'),
f.col('c1')[1].alias('name'),
f.col('c1')[2].alias('age')) \
.show(truncate=False)
+--------------------------------+---------+------+
|stat |name |age |
+--------------------------------+---------+------+
|Physics=99 |name=Xxxx|age=15|
|Physics=97;chemistry=85 |name=yyyy|age=14|
|Physics=97;chemistry=85;maths=65|name=zzzz|age=14|
+--------------------------------+---------+------+
You can do like this to extract the names with a regex:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Physics=99;name=Xxxx;age=15",), ("Physics=97;chemistry=85;name=yyyy;age=14",),("Physics=97;chemistry=85;maths=65;name=zzzz;age=14",)], ["Column1"])
new_df = df.withColumn("name", F.regexp_extract('Column1', r'name=(\w+)', 1).alias('name'))
new_df.show()
Output:
+--------------------+----+
| Column1|name|
+--------------------+----+
|Physics=99;name=X...|Xxxx|
|Physics=97;chemis...|yyyy|
|Physics=97;chemis...|zzzz|
+--------------------+----+

Extract values from a complex column in PySpark

I have a PySpark dataframe which has a complex column, refer to below value:
ID value
1 [{"label":"animal","value":"cat"},{"label":null,"value":"George"}]
I want to add a new column in PySpark dataframe which basically convert it into a list of strings. If Label is null, string should contain "value" and if label is not null, string should be "label:value". So for above example dataframe, the output should look like below:
ID new_column
1 ["animal:cat", "George"]
You can use transform to transform each array element into a string, which is constructed using concat_ws:
df2 = df.selectExpr(
'id',
"transform(value, x -> concat_ws(':', x['label'], x['value'])) as new_column"
)
df2.show()
+---+--------------------+
| id| new_column|
+---+--------------------+
| 1|[animal:cat, George]|
+---+--------------------+

Split one column into multiple columns in Spark DataFrame using comma separator

I want to create a multiple columns from one column from Dataframe using comma separator in Java Spark.
I have one value with a comma in one column in DataFrame and want to split into multiple columns by using a comma separator. I have the following code:
Dataset<Row> dfreq1 = spark.read().format("json").option("inferSchema", "true")
.load("new.json");
dfreq1.show(5, 300);
dfreq1.createOrReplaceTempView("tempdata");
Dataset<Row> dfreq2 = dfreq1.sqlContext().sql("select split(names, '|') from tempdata");
dfreq2.show(5, 300);
Input
+----------------------------+
| name|
+-----------------------------+
|ABC1,XYZ1,GDH1,KLN1,JUL1,HAI1|
|ABC2,XYZ2,GDH2,KLN2,JUL2,HAI2|
+-----------------------------+
Output
+-----------------------------+
| Cl1| Cl2| Cl3| Cl3|Cl4 | Cl4|
+-----------------------------+
|ABC1|XYZ1|GDH1|KLN1|JUL1|HAI1|
|ABC2|XYZ2|GDH2|KLN2|JUL2|HAI2|
+-----------------------------+
you can try this
scala> var dfd =Seq(("ABC1,XYZ1,GDH1,KLN1,JUL1,HAI1"),("ABC2,XYZ2,GDH2,KLN2,JUL2,HAI2")).toDF("name")
scala> dfd.withColumn("temp", split(col("name"), ",")).select((0 until 6).map(i => col("temp").getItem(i).as(s"col$i")): _* ).show
+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|
+----+----+----+----+----+----+
|ABC1|XYZ1|GDH1|KLN1|JUL1|HAI1|
|ABC2|XYZ2|GDH2|KLN2|JUL2|HAI2|
+----+----+----+----+----+----+
hope this helps you
List<String> schemaList = Arrays.asList("name","gender","sale_amount","event","age","shop_time");
Column column = functions.col("value");
Column linesSplit = functions.split(column,"##");
for(int i=0;i<schemaList.size();i++){
lines = lines.withColumn(schemaList.get(i),linesSplit.getItem(i));
}
You read the csv from this column into a Dataset
Dataset<Row> df= spark.read
.option("header",false)
.option("inferSchema",true)
.option("delimiter", ",")
.csv(originalDF.map(x=>x.getString(0)))

Resources