using regex with group by in Apache spark - string

I have two tables first one contains the car model, and second one contains specific mode of that model
for example :
This is a part of first table :
Car Brand:
Abarth
Alfa Romeo
Aston Martin
Audi
Bentley
BMW
and second table :
Make/Model:
Chevrolet Pickup (Full Size)
Ford Pickup (Full Size)
Toyota Camry
Nissan Altima
Chevrolet Impala
Honda Accord
GMC Pickup (Full Size)
and I need to join these two tables, I need to use regex on it so i can take the first part of the second table and join the two tables
for example :
> Honda Accord join with Honda in first table
I did something like this :
Dataset<Row> updatedCars = carsTable.join(carsTheftsTable, expr("Car Brand rlike Make/Model")).cache();
but it's not working in java spark expr not found
any help ?

You could try to cross join and check each model with each car brand if they fit.
For example in pyspark:
import pyspark.sql.functions as F # noqa
from pyspark.sql.types import StructType, StructField, StringType
df1 = spark.createDataFrame(data=[
["Honda"],
["Alfa Romeo"],
["Aston Martin"],
["Toyota"],
["Nissan"],
["BMW"]
], schema=StructType([StructField("car_brand", StringType())])).cache()
df2 = spark.createDataFrame(data=[
["Chevrolet Pickup (Full Size)"],
["Ford Pickup (Full Size)"],
["Toyota Camry"],
["Nissan Altima"],
["Chevrolet Impala"],
["Honda Accord"],
["GMC Pickup (Full Size)"]
], schema=StructType([StructField("car_model", StringType())])).cache()
df1.crossJoin(df2).filter(F.col("car_model").contains(F.col("car_brand"))).show(100, False)
+---------+-------------+
|car_brand|car_model |
+---------+-------------+
|Honda |Honda Accord |
|Toyota |Toyota Camry |
|Nissan |Nissan Altima|
+---------+-------------+

Related

Pivoting streaming dataframes without aggregation in pyspark

ID
type
value
A
car
camry
A
price
20000
B
car
tesla
B
price
40000
Example dataframe that is being streamed.
I need output to look like this. Anyone have suggestions?
ID
car
price
A
camry
20000
B
tesla
40000
Whats a good way to transform this? I have been researching pivoting but it requires an aggregation which is not something I need.
You could filter the frame (df) twice, and join
(
df.filter(df.type=="car").withColumnRenamed("value","car")
.join(
df.filter(df.type=="price").withColumnRenamed("value","price")
, on="ID"
)
.select("ID", "car", "price")
)

Split Complex String in PySpark Dataframe Column

I have a PySpark dataframe column comprised of multiple addresses. The format is as below:
id addresses
1 [{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]
I want to transform it as below:
id
city
state
street
postalCode
country
1
null
null
123, ABC St, ABC Square
11111
USA
1
Dallas
TX
456, DEF Plaza, Test St
99999
USA
Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.
I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?
#Data
from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
('id','addresses'))
df.show(truncate=False)
#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema
##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select
df3.select('id','test_col.*').show()
+---+--------+-------+----------+-----+------------------------+
|id |city |country|postalCode|state|street |
+---+--------+-------+----------+-----+------------------------+
|1 |New York|USA |11111 |NY |123, ABC St, ABC Square|
+---+--------+-------+----------+-----+------------------------+

can I specify column names when creating a DataFrame

My data is in a csv file. The file hasn't got any header column
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
If I read it, Spark creates names for the columns automatically.
scala> val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv")
data: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]
Is it possible to provide my own names for the columns when reading the file if I don't want to use _c0, _c1? For eg, I want spark to use DEST, ORIG and count for column names. I don't want to add header row in the csv to do this
Yes you can, There is a way, You can us toDF function of dataframe.
val data = spark.read.csv("./data/flight-data/csv/2015-summary.csv").toDF("DEST", "ORIG", "count")
It's better to define schema (StructType) first, then load the csv data using the schema.
Here is how to define schema:
import org.apache.spark.sql.types._
val schema = StructType(Array(
StructField("DEST",StringType,true),
StructField("ORIG",StringType,true),
StructField("count",IntegerType,true)
))
Load the dataframe:
val df = spark.read.schema(schema).csv("./data/flight-data/csv/2015-summary.csv")
Hopefully it'll help you.

Using Dataframe instead of spark sql for data analysis

Below is the sample spark sql I wrote to get the count of male and female enrolled in an agency.I used sql to generate the output,
Is there a way to do similar thing using dataframe only not sql.
val districtWiseGenderCountDF = hiveContext.sql("""
| SELECT District,
| count(CASE WHEN Gender='M' THEN 1 END) as male_count,
| count(CASE WHEN Gender='F' THEN 1 END) as FEMALE_count
| FROM agency_enrollment
| GROUP BY District
| ORDER BY male_count DESC, FEMALE_count DESC
| LIMIT 10""".stripMargin)
Starting with Spark 1.6 you can use pivot + group by to achieve what you'd like
without sample data (and my own availability of spark>1.5) here's a solution that Should work (not tested)
val df = hiveContext.table("agency_enrollment")
df.groupBy("district","gender").pivot("gender").count
see How to pivot DataFrame? for a generic example

Join two files in Pyspark without using sparksql/dataframes

I have two files customer and sales like below
Customer :
cu_id name region city state
1 Rahul ME Vizag AP
2 Raghu SE HYD TS
3 Rohith ME BNLR KA
Sales:
sa_id sales country
2 100000 IND
3 230000 USA
4 240000 UK
Both the files are \t delimited.
I want to join both the files based on the cu_id from customer and sa_id from sales using pyspark with out using sparksql/dataframes.
your help is very much appreciated.
You can definitely use the join methods that Spark has to offer regarding workings with RDD's.
You can do something like:
customerRDD = sc.textFile("customers.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
salesRDD = sc.textFile("sales.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
joinedRDD = customerRDD.join(salesRDD)
And you will get a new RDD that contains the only joined records from both customer and sales files.

Resources