Avoid writing of NULL fields present in pyspark dataframe - python-3.x

I have a spark dataframe having the following entries
column1 | column2
"a" | "b"
"x" | "c"
null | "a"
null | "b"
"x" | null
So when I convert it to a glue dynamic frame and write to an S3 bucket in json format the null values are also written.
I don't want to convert the null field to an empty string or number etc. Basically if a field value is null it should not be written. How can I avoid writing the null fields?

You can do something like to .na.fill('') default your values to empty string
df = spark.createDataFrame([("a",), ("b",), ("c",), (None,)], ['col'])
df.show()
+----+
| col|
+----+
| a|
| b|
| c|
|null|
+----+
df.na.fill('').show()
+---+
|col|
+---+
| a|
| b|
| c|
| |
+---+

Related

Search for 'Proper case' and mark it invalid using Pyspark

I have a big data set with multiple columns in it. Data Frame example is below, Here column 'first' holds names which I want to check whether is Proper case or not ? like aamir should be Aamir and Aamir malik should be Aamir Malik.
I want something like below.
I used Pyspark and below codes where I am getting the right answer but I want to detect it first and then make changes.
Here I have add a new column 'correct' and performed function.
name_check_1 = name_check.withColumn("correct", initcap(col("first")))
Then compare columns correct and first so it gives me not proper case name.
name_check_2 = name_check_1.filter('correct != first')
I need a way to get not proper case first and then correction.
My solution below :
Logic : Slice the string for first alphabet check it with correct string if equal it is valid else invalid. Make uppercase the first alphabet of firstname and lastname and rest to lower case and concatenate. Select only relevant columns.
from pyspark.sql.functions import *
from pyspark.sql.types import *
values = [
(1,"aamir"),
(2,"Aamir"),
(3,"atif"),
(4,"Atif"),
(5,"tahir"),
(6,"sameer"),
(7,"ifzaan"),
(8,"Ifzaan"),
(9,"Saquib"),
(10,"aamir malik"),
(11,"adcA")
]
rdd = sc.parallelize(values)
schema = StructType([
StructField("IDs", IntegerType(), True),
StructField("first", StringType(), True)
])
#create dataframe
data = spark.createDataFrame(rdd, schema)
#split first column into firstname and lastname
data = data.withColumn("firstname", split(data["first"]," ")[0]).withColumn("lastname", split(data["first"]," ")[1])
data = data \
.withColumn("flag", when((trim(substring(data["firstname"],0,1)) == upper(trim(substring(data["firstname"],0,1)))) |
(trim(substring(data["lastname"],0,1)) == upper(trim(substring(data["lastname"],0,1)))), lit("valid")).otherwise(lit("invalid"))) \
.withColumn("correct" , concat(concat(upper(trim(substring(data["firstname"],0,1))), trim(lower(substring(data["firstname"],2,1000)))),lit(" "),
when(data["lastname"].isNull(),lit("")) \
.otherwise(concat(upper(trim(substring(data["lastname"],0,1))),trim(lower(substring(data["lastname"],2,1000))))))) \
.select("IDs","first","flag","correct")
data.show()
#Result
+---+-----------+-------+-----------+
|IDs| first| flag| correct|
+---+-----------+-------+-----------+
| 1| aamir|invalid| Aamir |
| 2| Aamir| valid| Aamir |
| 3| atif|invalid| Atif |
| 4| Atif| valid| Atif |
| 5| tahir|invalid| Tahir |
| 6| sameer|invalid| Sameer |
| 7| ifzaan|invalid| Ifzaan |
| 8| Ifzaan| valid| Ifzaan |
| 9| Saquib| valid| Saquib |
| 10|aamir malik|invalid|Aamir Malik|
| 11| adcA|invalid| Adca |
+---+-----------+-------+-----------+
You know how to use initcap, so just create new column correct and compare it to the column first to check if it's already valid or not:
df.withColumn("correct", initcap(lower(col("first")))) \
.withColumn("flag", when(col("correct") != col("first"), lit("invalid")).otherwise("valid")) \
.show()
Gives:
+---+-----------+-----------+-------+
| id| first| correct| flag|
+---+-----------+-----------+-------+
| 1| aamir| Aamir|invalid|
| 2| Aamir| Aamir| valid|
| 3| atif| Atif|invalid|
| 4| Atif| Atif| valid|
| 5| tahir| Tahir|invalid|
| 6| sameer| Sameer|invalid|
| 7| ifzaan| Ifzaan|invalid|
| 8|Ifzaan Abcd|Ifzaan Abcd| valid|
| 9|Saquib abcd|Saquib Abcd|invalid|
+---+-----------+-----------+-------+

sanitize column values in pyspark dataframe

Given CSV file, I converted to Dataframe using something code like the following.
raw_df = spark.read.csv(input_data, header=True)
That creates dataframe looks something like this:
| Name |
========
| 23 |
| hi2 |
| me3 |
| do |
I want to convert this column to only contain numbers. The final result should be like where hi and me are removed:
| Name |
========
| 23 |
| 2 |
| 3 |
| do |
I want to sanitize the values and make sure it only contains number. But I'm not sure if it's possible in Spark.
Yes, It's possible. You can use regex_replace from function.
Please check this:
import pyspark.sql.functions as f
df = spark.sparkContext.parallelize([('12',), ('hi2',), ('me3',)]).toDF(["name"])
df.show()
+----+
|name|
+----+
| 12|
| hi2|
| me3|
+----+
final_df = df.withColumn('sanitize', f.regexp_replace('name', '[a-zA-Z]', ''))
final_df.show()
+----+--------+
|name|sanitize|
+----+--------+
| 12| 12|
| hi2| 2|
| me3| 3|
+----+--------+
final_df.withColumn('len', f.length('sanitize')).show()
+----+--------+---+
|name|sanitize|len|
+----+--------+---+
| 12| 12| 2|
| hi2| 2| 1|
| me3| 3| 1|
+----+--------+---+
You can adjust regex.
Otherway doing the same. It's just an another way but better use spark inbuilt functions if available. as shown above also.
from pyspark.sql.functions import udf
import re
user_func = udf (lambda x: re.findall("\d+", x)[0])
newdf = df.withColumn('new_column',user_func(df.Name))
>>> newdf.show()
+----+----------+
|Name|new_column|
+----+----------+
| 23| 23|
| hi2| 2|
| me3| 3|
+----+----------+

Compare two dataset and get what fields are changed

I am working on a spark using Java, where I will download data from api and compare with mongodb data, while the downloaded json have 15-20 fields but database have 300 fields.
Now my task is to compare the downloaded jsons to mongodb data, and get whatever fields changed with past data.
Sample data set
Downloaded data from API
StudentId,Name,Phone,Email
1,tony,123,a#g.com
2,stark,456,b#g.com
3,spidy,789,c#g.com
Mongodb data
StudentId,Name,Phone,Email,State,City
1,tony,1234,a#g.com,NY,Nowhere
2,stark,456,bg#g.com,NY,Nowhere
3,spidy,789,c#g.com,OH,Nowhere
I can't use the except, because of column length.
Expected output
StudentId,Name,Phone,Email,Past_Phone,Past_Email
1,tony,1234,a#g.com,1234, //phone number only changed
2,stark,456,b#g.com,,bg#g.com //Email only changed
3,spidy,789,c#g.com,,
Consider your data is in 2 dataframes. We can create temporary views for them, as shown below,
api_df.createOrReplaceTempView("api_data")
mongo_df.createOrReplaceTempView("mongo_data")
Next we can use Spark SQL. Here, we join both these views using the StudentId column and then use a case statement on top of them to compute the past phone number and email.
spark.sql("""
select a.*
, case when a.Phone = b.Phone then '' else b.Phone end as Past_phone
, case when a.Email = b.Email then '' else b.Email end as Past_Email
from api_data a
join mongo_data b
on a.StudentId = b.StudentId
order by a.StudentId""").show()
Output:
+---------+-----+-----+-------+----------+----------+
|StudentId| Name|Phone| Email|Past_phone|Past_Email|
+---------+-----+-----+-------+----------+----------+
| 1| tony| 123|a#g.com| 1234| |
| 2|stark| 456|b#g.com| | bg#g.com|
| 3|spidy| 789|c#g.com| | |
+---------+-----+-----+-------+----------+----------+
Please find the below same source code. Here I am taking the only phone number condition as an example.
val list = List((1,"tony",123,"a#g.com"), (2,"stark",456,"b#g.com")
(3,"spidy",789,"c#g.com"))
val df1 = list.toDF("StudentId","Name","Phone","Email")
.select('StudentId as "StudentId_1", 'Name as "Name_1",'Phone as "Phone_1",
'Email as "Email_1")
df1.show()
val list1 = List((1,"tony",1234,"a#g.com","NY","Nowhere"),
(2,"stark",456,"bg#g.com", "NY", "Nowhere"),
(3,"spidy",789,"c#g.com","OH","Nowhere"))
val df2 = list1.toDF("StudentId","Name","Phone","Email","State","City")
.select('StudentId as "StudentId_2", 'Name as "Name_2", 'Phone as "Phone_2",
'Email as "Email_2", 'State as "State_2", 'City as "City_2")
df2.show()
val df3 = df1.join(df2, df1("StudentId_1") ===
df2("StudentId_2")).where(df1("Phone_1") =!= df2("Phone_2"))
df3.withColumnRenamed("Phone_1", "Past_Phone").show()
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
|StudentId_1|Name_1|Past_Phone|Email_1|StudentId_2|Name_2|Phone_2|Email_2|State_2| City_2|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
| 1| tony| 123|a#g.com| 1| tony| 1234|a#g.com| NY|Nowhere|
+-----------+------+----------+-------+-----------+------+-------+-------+-------+-------+
We have :
df1.show
+-----------+------+-------+-------+
|StudentId_1|Name_1|Phone_1|Email_1|
+-----------+------+-------+-------+
| 1| tony| 123|a#g.com|
| 2| stark| 456|b#g.com|
| 3| spidy| 789|c#g.com|
+-----------+------+-------+-------+
df2.show
+-----------+------+-------+--------+-------+-------+
|StudentId_2|Name_2|Phone_2| Email_2|State_2| City_2|
+-----------+------+-------+--------+-------+-------+
| 1| tony| 1234| a#g.com| NY|Nowhere|
| 2| stark| 456|bg#g.com| NY|Nowhere|
| 3| spidy| 789| c#g.com| OH|Nowhere|
+-----------+------+-------+--------+-------+-------+
After Join :
var jn = df2.join(df1,df1("StudentId_1")===df2("StudentId_2"))
Then
var ans = jn.withColumn("Past_Phone", when(jn("Phone_2").notEqual(jn("Phone_1")),jn("Phone_1")).otherwise("")).withColumn("Past_Email", when(jn("Email_2").notEqual(jn("Email_1")),jn("Email_1")).otherwise(""))
Reference : Spark: Add column to dataframe conditionally
Next :
ans.select(ans("StudentId_2") as "StudentId",ans("Name_2") as "Name",ans("Phone_2") as "Phone",ans("Email_2") as "Email",ans("Past_Email"),ans("Past_Phone")).show
+---------+-----+-----+--------+----------+----------+
|StudentId| Name|Phone| Email|Past_Email|Past_Phone|
+---------+-----+-----+--------+----------+----------+
| 1| tony| 1234| a#g.com| | 123|
| 2|stark| 456|bg#g.com| b#g.com| |
| 3|spidy| 789| c#g.com| | |
+---------+-----+-----+--------+----------+----------+

How to rename duplicated columns after join? [duplicate]

This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status column is in two dataframes, i.e. aa_df and bb_df.
The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.
If you are trying to rename the status column of bb_df dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias or as operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol-based variants.)
NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.
So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.
There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select (over drop), I'd do the following to have a single id column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).
Let's assume you ended up with the following query and so you've got two id columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Please see the docs : withColumnRenamed()
You need to pass the name of the existing column and the new name to the function. Both of these should be strings.
result_df = aa_df.join(bb_df,'id', 'left').join(cc_df, 'id', 'left').withColumnRenamed('status', 'user_status')
If you have 'status' columns in 2 dataframes, you can use them in the join as aa_df.join(bb_df, ['id','status'], 'left') assuming aa_df and bb_df have the common column. This way you will not end up having 2 'status' columns.

pyspark - attempting to create new column based on the difference of two ArrayType columns

I have a table like so:
+-----+----+-------+-------+
|name | id | msg_a | msg_b |
+-----+----+-------+-------+
| a| 3|[a,b,c]|[c] |
| b| 5|[x,y,z]|[h,x,z]|
| c| 7|[a,x,y]|[j,x,y]|
+-----+----+-------+-------+
I want to add a column so that anything in msg_b but not in msg_a is surfaced.
E.g.
+-----+----+-------+-------+------------+
|name | id | msg_a | msg_b | difference |
+-----+----+-------+-------+------------+
| a| 3|[a,b,c]|[c] |NA |
| b| 5|[x,y,z]|[h,x,z]|[h] |
| c| 7|[a,x,y]|[j,x,y]|[j] |
+-----+----+-------+-------+------------+
Referring to a previous post, I've tried
df.select('msg_b').subtract(df.select('msg_a')).show()
which works, but I need the information as a table, with name and id
Doing this:
df.withColumn("difference", F.col('msg_b').subtract(F.col(''msg_a'))).show(5)
yields an TypeError: 'Column' object is not callable
Not sure if there is a separate function for performing this operation, if I'm missing something glaringly obvious, etc.
You have to use UDF:
from pyspark.sql.functions import *
from pyspark.sql.types import *
#udf(ArrayType(StringType()))
def subtract(xs, ys):
return list(set(xs) - set(ys))
Example
df = sc.parallelize([
(["a", "b", "c"], ["c"]), (["x", "y", "z"], ["h", "x", "z"])
]).toDF(["msg_a", "msg_b"])
df.select(subtract('msg_b', 'msg_a'))
+----------------------+
|subtract(msg_b, msg_a)|
+----------------------+
| []|
| [h]|
+----------------------+

Resources