Join two files in Pyspark without using sparksql/dataframes - apache-spark

I have two files customer and sales like below
Customer :
cu_id name region city state
1 Rahul ME Vizag AP
2 Raghu SE HYD TS
3 Rohith ME BNLR KA
Sales:
sa_id sales country
2 100000 IND
3 230000 USA
4 240000 UK
Both the files are \t delimited.
I want to join both the files based on the cu_id from customer and sa_id from sales using pyspark with out using sparksql/dataframes.
your help is very much appreciated.

You can definitely use the join methods that Spark has to offer regarding workings with RDD's.
You can do something like:
customerRDD = sc.textFile("customers.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
salesRDD = sc.textFile("sales.tsv").map(lambda row: (row.split('\t')[0], "\t".join(row.split('\t')[1:])))
joinedRDD = customerRDD.join(salesRDD)
And you will get a new RDD that contains the only joined records from both customer and sales files.

Related

Replacing Null Values with Mean Value of the Column in Grid DB

So, I was working with GridDB NodeJs Connector, I know the query to find out the null values which shows the records/rows:
SELECT * FROM employees where employee_salary = NaN;
But I want to replace the null values of the column with the mean value of the column, in order to maintain the data consistency for data analysis. How do I do that in GridDB?
The Employee table looks like the following:
employee_id employee_salary first_name department
---------------+---------------+--------------+--------------
0 John Sales
1 60000 Lisa Development
2 45000 Richard Sales
3 50000 Lina Marketing
4 55000 Anderson Development

In Spark, how to repartition large data frames/RDDs such that shuffling while joining is reduced as much as possible?

I have one table called A which has a column called id. I have another table called B which has a column called a_id which refers to table A's id column (foreign key). I want to join these two tables based on this column but both the table are quite large (The size of table A is ~ 1 TB and table B is ~ 4.2 TB). So, my plan is to reduce shuffling as much as possible to increase efficiency, and for that, I am trying to partition the data frames such that rows with the same value in the two columns reside in the same partition as in the following example.
Example tables:
A:
id category
1 tennis
2 cricket
B:
name a_id
Roger 1
Novak 1
Lara 2
Ricky 2
What I want to achieve after partition:
Partition 1:
A:
id category
1 tennis
B:
name a_id
Roger 1
Novak 1
Partition 2:
A:
id category
2 cricket
B:
name a_id
Lara 2
Ricky 2
I was trying to achieve this by repartitioning A on column id and repartitioning B on column a_id after loading the tables. I checked this on a short test dataset and it seemed to render what I want.
But on the full dataset, spark job does a lot of shuffling during joining and gets stuck. So, what is the most efficient way of doing this?

how to I select only a specific column from a Dataset after sorting it

I have the following table:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I want to sort the table based on count column and want to see only count column. I have done it but I am doing it in 2 steps
1- I first sort to get sorted DS - dataDS.sort(col("count").desc)
2- then select on that DS- (dataDS.sort(col("count").desc)).select(col("count")).show();
The above feels like am embedded sql query to me. In sql however, I can do the same query without using an embedded query
select * from flight_data_2015 ORDER BY count ASC
Is there a better way for me to both sort and select without creating a new Dataset?
There is nothing wrong
(dataDS.sort(col("count").desc)).select(col("count")).show();
It is the right thing to do and has no negative performance implications, other than intrinsic problems of sorting as such.
Use it freely and don't worry about it anymore.

Incremental updates in Spark SQL

What is the most efficient way to append incremental updates in Spark SQL in Scala?
I have an employee dataframe E1 which is archived with primary key empId.
I also have a latest employee dataframe and want to write only the updated, new and the deleted ones back to the archival dataframe.
For example:
Employee archived:
EmpId, EmpName
1 Tom
2 Harry
Employee recent:
EmpId, EmpName
2 Harry Lewis
3 Hermoine
Difference should return:
EmpId, EmpName, deleted
1 Tom yes
2 Harry Lewis no
3 Hermoine no
If you only wanted to find updated or new rows it would be possible to use except, however, since the deleted rows should be present it is a bit more complicated. Assuming E1 is the archived employee dataframe and E2 is the recent one, you can use a full join in Scala as follows:
E1.withColumnRenamed("EmpName", "EmpNameOld")
.join(E2, Seq("EmpId"), "fullouter")
.where($"EmpName".isNull || $"EmpNameOld".isNull || $"EmpName" =!= $"EmpNameOld")
.withColumn("deleted", when($"EmpName".isNull, "yes").otherwise("no"))
.withColumn("EmpName", coalesce($"EmpName", $"EmpNameOld"))
.drop("EmpNameOld")
This will give you the wanted result, containing updated rows, new rows and deleted rows.

PySpark Job to analyse sales data

I am trying to write a Spark Job in python. I have two csv files containing following information:
File-1) product_prices.csv
product1 10
product2 20
product3 30
File-2 Sales_information.csv
id buyer transaction_date seller sales_data
1 buyer1 2015-1-01 seller1 {"product1":12,"product2":44}
2 buyer2 2015-1-01 seller3 {"product2":12}
3 buyer1 2015-1-01 seller3 {"product3":60,"product1":42}
4 buyer3 2015-1-01 seller2 {"product2":9,"product3":2}
5 buyer3 2015-1-01 seller1 {"product2":8}
Now, on above two files of data, I want to execute Spark job to find two things and have the data outputed to csv file
1) Total sales for each seller needs to be outputted to total_sellers_sales.csv file as
`seller_id total_sales`
`seller1 1160`
2) Ouput buyers list for each seller to sellers_buyers_list.csv as follows:
seller_id buyers
seller1 buyer1, buyer3
So can anyone tell me what is the correct way to do it to write a Spark job.
Note: I need a code in python
Here is my pyspark code in zeppelin 0.7.2.
First I created your sample dataframes manually:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
products = [ ("product1", 10), ("product2",20), ("product3",30)]
dfProducts = sqlContext.createDataFrame(products, ['product', 'price'])
sales = [ (1,"buyer1", "seller1","product1",12), (1,"buyer1", "seller1","product2",44), (2,"buyer2", "seller3","product2",12),(3,"buyer1", "seller3","product3",60),(3,"buyer1", "seller3","product1",42),(4,"buyer3", "seller2","product2",9),(4,"buyer3", "seller2","product3",2),(5,"buyer3", "seller1","product2",8)]
dfSales= sqlContext.createDataFrame(sales, ['id', 'buyer', 'seller','product','countVal'])
Total sales for each seller:
dfProducts.alias('p').join(dfSales.alias('s'),col('p.product')==col('s.product')).groupBy('s.seller').agg(F.sum(dfSales.countVal * dfProducts.price)).show()
Output:
Total sales for each seller:
Buyers list for each seller:
dfSales.groupBy("seller").agg(F.collect_set("buyer")).show()
Output: Buyers list for each seller
You can save results as csv using df.write.csv('filename.csv') method.
Hope this helps.

Resources