Split Complex String in PySpark Dataframe Column - string

I have a PySpark dataframe column comprised of multiple addresses. The format is as below:
id addresses
1 [{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]
I want to transform it as below:
id
city
state
street
postalCode
country
1
null
null
123, ABC St, ABC Square
11111
USA
1
Dallas
TX
456, DEF Plaza, Test St
99999
USA
Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.
I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?

#Data
from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
('id','addresses'))
df.show(truncate=False)
#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema
##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select
df3.select('id','test_col.*').show()
+---+--------+-------+----------+-----+------------------------+
|id |city |country|postalCode|state|street |
+---+--------+-------+----------+-----+------------------------+
|1 |New York|USA |11111 |NY |123, ABC St, ABC Square|
+---+--------+-------+----------+-----+------------------------+

Related

using regex with group by in Apache spark

I have two tables first one contains the car model, and second one contains specific mode of that model
for example :
This is a part of first table :
Car Brand:
Abarth
Alfa Romeo
Aston Martin
Audi
Bentley
BMW
and second table :
Make/Model:
Chevrolet Pickup (Full Size)
Ford Pickup (Full Size)
Toyota Camry
Nissan Altima
Chevrolet Impala
Honda Accord
GMC Pickup (Full Size)
and I need to join these two tables, I need to use regex on it so i can take the first part of the second table and join the two tables
for example :
> Honda Accord join with Honda in first table
I did something like this :
Dataset<Row> updatedCars = carsTable.join(carsTheftsTable, expr("Car Brand rlike Make/Model")).cache();
but it's not working in java spark expr not found
any help ?
You could try to cross join and check each model with each car brand if they fit.
For example in pyspark:
import pyspark.sql.functions as F # noqa
from pyspark.sql.types import StructType, StructField, StringType
df1 = spark.createDataFrame(data=[
["Honda"],
["Alfa Romeo"],
["Aston Martin"],
["Toyota"],
["Nissan"],
["BMW"]
], schema=StructType([StructField("car_brand", StringType())])).cache()
df2 = spark.createDataFrame(data=[
["Chevrolet Pickup (Full Size)"],
["Ford Pickup (Full Size)"],
["Toyota Camry"],
["Nissan Altima"],
["Chevrolet Impala"],
["Honda Accord"],
["GMC Pickup (Full Size)"]
], schema=StructType([StructField("car_model", StringType())])).cache()
df1.crossJoin(df2).filter(F.col("car_model").contains(F.col("car_brand"))).show(100, False)
+---------+-------------+
|car_brand|car_model |
+---------+-------------+
|Honda |Honda Accord |
|Toyota |Toyota Camry |
|Nissan |Nissan Altima|
+---------+-------------+

Python Pandas Move Na or Null values to a new dataframe

I know I can drop NaN rows from a DataFrame with df.dropna(). But what if I want to move those NaN rows to a new DataFrame?
Dataframe looks like
FNAME, LNAME, ADDRESS, latitude, logitude, altitude
BOB, JONES, 555 Seaseme Street, 38.00,-91.00,0.0
JOHN, GREEN, 111 Maple Street, 34.00,-75.00,0.0
TOM, SMITH, 100 A Street, 20.00,-80.00,0.0
BETTY, CROCKER, 5 Elm Street, NaN,NaN,NaN
I know I can group and move to a new DataFrame like this
grouped = df.groupby(df.FNAME)
df1 = grouped.get_group("BOB")
and it will give me a new DataFrame with FNAME of BOB but when I try
grouped = df.groupby(df.altitude)
df1 = grouped.get_group("NaN")
I get a KeyError: 'NaN'. So how can I group by Nan or Null values?
Assuming you're satisfied that all 'Nan' values in a column are to be grouped together, what you can do is use DataFrame.fillna() to convert the 'Nan' into something else, to be grouped.
df.fillna(value={'altitude':'null_altitudes'}
This fills every null in the altitude column with the string 'null_altitudes'. If you do a groupby now, all 'null_altitudes' will be together. You can fill multiple columns at once using multiple key value pairs: values = {'col_1':'val_1', 'col_2':'val_2', etc}
You can use isna with any on rows:
# to get rows with NA in a new df
df1 = df[df.isna().any(axis=1)]

Concatenate two columns of spark dataframe with null values

I have two columns in my spark dataframe
First_name Last_name
Shiva Kumar
Karthik kumar
Shiva Null
Null Shiva
My requirement is to add a new column to dataframe by concatenating the above 2 columns with a comma and handle null values too.
I have tried using concat and coalesce but I can't get the output with comma delimiter only when both columns are available
Expected output
Full_name
Shiva,kumar
Karthik,kumar
Shiva
Shiva
concat_ws concats and handles null values for you.
df.withColumn('Full_Name', F.concat_ws(',', F.col('First_name'), F.col('Last_name'))
You can use lit:
import pyspark.sql.functions as F
f = df.withColumn('Full_Name', F.concat(F.col('First_name'), F.lit(','), F.col('Last_name'))).select('Full_Name')
# fix null values
f = f.withColumn('Full_Name', F.regexp_replace(F.col('Full_Name'), '(,Null)|(Null,)', ''))
f.show()
+-------------+
| Full_Name|
+-------------+
| Shiva,Kumar|
|Karthik,kumar|
| Shiva|
| Shiva|
+-------------+

Transform data into rdd and analyze

I am new in spark and have below data in csv format, which I want to convert in proper format.
Csv file with no header
Student_name=abc, student_grades=A, Student_gender=female
Student_name=Xyz, student_grades=B, Student_gender=male
Now I want to put it in rdd with creation of header
Student_Name student_grades student_gender
abc A female
Xyz B male
Also I want to get list of students with grades as A, B and C
What you could do is infer the schema from the first line of the file, and then transform the dataframe accordingly that is:
Remove the column name from the row values.
Rename the columns
Here is how you could do it. First, let's read your data from a file and display it.
// the options are here to get rid of potential spaces around the ",".
val df = spark.read
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.csv("path/your_file.csv")
df.show(false)
+----------------+----------------+---------------------+
|_c0 |_c1 |_c2 |
+----------------+----------------+---------------------+
|Student_name=abc|student_grades=A|Student_gender=female|
|Student_name=Xyz|student_grades=B|Student_gender=male |
+----------------+----------------+---------------------+
Then, we extract a mapping between the default names and the new names using the first row of the dataframe.
val row0 = df.head
val cols = df
.columns
.map(c => c -> row0.getAs[String](c).split("=").head )
Finally we get rid of the name of the columns with a split on "=" and rename the columns using our mapping:
val new_df = df
.select(cols.map{ case (old_name, new_name) =>
split(col(old_name), "=")(1) as new_name
} : _*)
new_df.show(false)
+------------+--------------+--------------+
|Student_name|student_grades|Student_gender|
+------------+--------------+--------------+
|abc |A |female |
|Xyz |B |male |
+------------+--------------+--------------+

How to handle different types of dates in a column in pandas

I'm trying to find different data types in a column of pandas dataFrame and have them in a separate column for some computation. I have tried Regex with mask function to identify other data types like string and integer as shown below
df[data_types]=df[i].astype(str).str.contains('^[-+]?[0-9]+$', case=False, regex=True), "Integer").mask(df[i].astype(str).str.contains('^[a-zA-Z ]+$', case=False, regex=True), "string")
Here the problem is i want to handle different types of date formats and identify them as a single data type "date". And column may have any type of data as below :
column_1
----------
18/01/18
01/18/18
17/01/2018
12/21/2018
jan-02-18
Nan
02-jan-18
2018/01/13
hello
2345
EDIT :
I have used mask in same line because, i want to handle every datatype in the column and identify them to have a final result like below
column_1 | data_types
---------- |- - - - - - -
18/01/18 | date
01/18/18 | date
17/01/2018 | date
12/21/2018 | date
jan-02-18 | date
Nan | null
02-jan-18 | date
2018/01/13 | date
hello | string
2345 | Integer
and this gives exactly what i need
df[data_types]=df[i].astype(str).str.contains('^[-+]?[0-9]+$', case=False, regex=True), "Integer").mask(df[i].astype(str).str.contains('^[a-zA-Z ]+$', case=False,regex=True),string").mask(to_datetime(df[i],errors='coerce').notnull(),"date".mask(df[i].astype(str).str.contains('nan', case=False, regex=True), "null")
Any help provided is highly appreciated
Use numpy.select for create new column by multiple condition and for datetimes use to_datetime with errors='coerce' for return NaNs for not parseable values, so check it by notna:
m1 = df[i].astype(str).str.contains('^[-+]?[0-9]+$', case=False, regex=True)
m2 = df[i].astype(str).str.contains('^[a-zA-Z ]+$', case=False, regex=True)
m3 = pd.to_datetime(df[i], errors='coerce').notna()
#oldier pandas versions
#m3 = pd.to_datetime(df[i], errors='coerce').notnull()
df[data_types]= np.select([m1, m2, m3], ["Integer", 'string', 'date'], default='not_matched')

Resources