spark program to find the city with maximum population [duplicate] - apache-spark

This question already has answers here:
Find maximum row per group in Spark DataFrame
(2 answers)
Closed 1 year ago.
Input files contains rows like below (state,city,population):
west bengal,kolkata,150000
karnataka,bangalore,200000
karnataka,mangalore,80000
west bengal,bongaon,50000
delhi,new delhi,100000
delhi,gurgaon,200000
I have to write a Spark (Apache Spark) program in both Python and Scala to find the city with maximum population. Output will be like this:
west bengal,kolkata,150000
karnataka,bangalore,200000
delhi,new delhi,100000
So I need a three column output for each state. It's easy for me to get the output like this:
west bengal,15000
karnataka,200000
delhi,100000
But to get the city having maximum population is getting difficult.

In vanilla pyspark, map your data to a pair RDD where the state is the key, and the value is the tuple (city, population). Then reduceByKey to keep the largest city. Beware, in the case of cities with the same population it will keep the first one it encounters.
rdd.map(lambda reg: (reg[0],[reg[1],reg[2]]))
.reduceByKey(lambda v1,v2: ( v1 if v1[1] >= v2[1] else v2))
The results with your data look like this:
[('delhi', ['gurgaon', 200000]),
('west bengal', ['kolkata', 150000]),
('karnataka', ['bangalore', 200000])]

This should do the trick:
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize([
['west bengal','kolkata',150000],
['karnataka','bangalore',200000],
['karnataka','mangalore',80000],
['west bengal','bongaon',50000],
['delhi','new delhi',100000],
['delhi','gurgaon',200000],
])
>>> df = rdd.toDF(['state','city','population'])
>>> df.show()
+-----------+---------+----------+
| state| city|population|
+-----------+---------+----------+
|west bengal| kolkata| 150000|
| karnataka|bangalore| 200000|
| karnataka|mangalore| 80000|
|west bengal| bongaon| 50000|
| delhi|new delhi| 100000|
| delhi| gurgaon| 200000|
+-----------+---------+----------+
>>> df.groupBy('city').max('population').show()
+---------+---------------+
| city|max(population)|
+---------+---------------+
|bangalore| 200000|
| kolkata| 150000|
| gurgaon| 200000|
|mangalore| 80000|
|new delhi| 100000|
| bongaon| 50000|
+---------+---------------+
>>> df.groupBy('state').max('population').show()
+-----------+---------------+
| state|max(population)|
+-----------+---------------+
| delhi| 200000|
|west bengal| 150000|
| karnataka| 200000|
+-----------+---------------+

Related

How to create a combined data frame from each columns?

I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?
You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.

show first occurence(s) of a column

I want to use pyspark to create new dataframe based on input where it prints out the first occurrence of each different value column. Would rownumber() work or window(). Not sure best way approach this or would sparksql be best. Basically the second table is what I want output to be where it prints out just the first occurrence of a value column from input. I only interested in first occurrence of the "value" column. If a value is repeated only show the first one seen.
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|20 |TUES | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
+--------+--------+--------+
| VALUE| DAY | Color
+--------+--------+--------+
|20 |MON | BLUE|
|30 |WED | BLUE|
+--------+--------+--------+
Here's how I'd do this without using window. It will likely perform better on large data sets as it can use more of the cluster to do the work. You would need to use 'VALUE' as Department and 'Salary' as 'DATE' in your case.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
unGroupedDf = df.select( \
df["Department"], \
f.struct(*[\ # Make a struct with all the record elements.
df["Salary"].alias("Salary"),\ #will be sorted on Salary first
df["Department"].alias("Dept"),\
df["Name"].alias("Name")] )\
.alias("record") )
unGroupedDf.groupBy("Department")\ #group
.agg(f.collect_list("record")\ #Gather all the element in a group
.alias("record"))\
.select(\
f.reverse(\ #Make the sort Descending
f.array_sort(\ #Sort the array ascending
f.col("record")\ #the struct
)\
)[0].alias("record"))\ #grab the "Max element in the array
).select( f.col("record.*") ).show() # use struct as Columns
.show()
+---------+------+-------+
| Dept|Salary| Name|
+---------+------+-------+
| Sales| 4600|Michael|
| Finance| 3900| Jen|
|Marketing| 3000| Jeff|
+---------+------+-------+
Appears to me you want to drop duplicated items by VALUE. if so, use dropDuplicates
df.dropDuplicates(['VALUE']).show()
+-----+---+-----+
|VALUE|DAY|Color|
+-----+---+-----+
| 20|MON| BLUE|
| 30|WED| BLUE|
+-----+---+-----+
Here's how to do it with a window. In this example they us salary as the example. In your case I think you'd use 'DAY' for orderBy and 'Value' for partitionBy.
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James","Sales",3000),("Michael","Sales",4600),
("Robert","Sales",4100),("Maria","Finance",3000),
("Raman","Finance",3000),("Scott","Finance",3300),
("Jen","Finance",3900),("Jeff","Marketing",3000),
("Kumar","Marketing",2000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.show()
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
w2 = Window.partitionBy("department").orderBy(col("salary"))
df.withColumn("row",row_number().over(w2)) \
.filter(col("row") == 1).drop("row") \
.show()
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
| James| Sales| 3000|
| Maria| Finance| 3000|
| Kumar| Marketing| 2000|
+-------------+----------+------+
Yes, you'd need to develop a way of ordering days, but I think you get that it's possible and you picked the correct tool. I always like to warn people, this uses a window and they suck all the data to 1 executor to complete the work. This is not particularly efficient. On small datasets this is likely performant. On larger data sets it may take way too long to complete.

Join multiple columns from one data frame to single column from another without multiple join operation, in pyspark

I am trying to match multiple columns from one data frame (df) to to a multiple language dictionary (df_label) and extract the the corresponding labels for each column.
Note: This is not a duplcate question of Join multiple columns from one table to single column from another table
The following is an example of df and df_label dataframes and the desired output
df df_label output
+---+---+ +---+-----+----+ +---+---+------+------+------+
| s| o| | e| name|lang| | s| o|s_name|o_name| lang|
+---+---+ +---+-----+----+ +---+---+------+------+------+
| s1| o1| | s1|s1_en| en| | s2| o1| s2_fr| o1_fr| fr|
| s1| o3| | s1|s1_fr| fr| | s1| o1| s1_fr| o1_fr| fr|
| s2| o1| | s2|s2_fr| fr| | s1| o1| s1_en| o1_en| en|
| s2| o2| | o1|o1_fr| fr| | s2| o2| s2_fr| o2_fr| fr|
+---+---+ | o1|o1_en| en| +---+---+------+------+------+
| o2|o2_fr| fr|
+---+-----+----+
In other words I want match both columns [s,o] from df with column e from df_label and find their corresponding labels in different languages as shown above.
The multi-lang dictionary( df_label) is huge and columns [s,o] have many duplicates, so two join operations is highly inefficient.
Is there any way that could be achieved without multiple joins?
FYI, this is what I did using multiple joins but I really don't like it.
df = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df = df.join(df_label,col('s')==col('e')).drop('e').withColumnRenamed('name','s_name').withColumnRenamed('lang','s_lang')
df = df.join(df_label,col('o')==col('e')).drop('e').withColumnRenamed('name','o_name').select('s','o','s_name','o_name','s_lang','o','o_name','lang').withColumnRenamed('lang','o_lang').filter(col('o_lang')==col('s_lang')).drop('s_lang')
Building on what gaw suggested, this is my proposed solution
The approach was to use only one join but then use a conditional aggregate collect_list to check whether the match was for s column or o column.
df = = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df.join(df_label,(col('e')== col('s')) | (col('e') == col('o'))) \
.groupBy(['s','o','lang']) \
.agg(collect_list(when(col('e')==col('s'),col('name'))).alias('s_name')\
,collect_list(when(col('e')==col('o'),col('name'))).alias('o_name')) \
.withColumn('s_name',explode('s_name')).withColumn('o_name',explode('o_name')).show()
+---+---+----+------+------+
| s| o|lang|s_name|o_name|
+---+---+----+------+------+
| s2| o2| fr| s2_fr| o2_fr|
| s1| o1| en| s1_en| o1_en|
| s1| o1| fr| s1_fr| o1_fr|
| s2| o1| fr| s2_fr| o1_fr|
+---+---+----+------+------+
I created a way which works with only one join, but since it uses additional (expensive) operations like explode etc. I am not sure if it is faster.
But if you like you could give it a try.
The following code produces the desired output:
df = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df = df.join(df_label,[(col('s')==col('e')) | \
(col('o')==col('e'))]).drop('e').\ #combine the two join conditions
withColumn("o_name",when(col("name").startswith("o"),col("name")).otherwise(None)).\
withColumn("s_name",when(col("name").startswith("s"),col("name")).otherwise(None)).\ #create the o_name and s_name cols
groupBy("s","o").agg(collect_list("o_name").alias("o_name"),collect_list("s_name").alias("s_name")).\
#perform a group to aggregate the required vales
select("s","o",explode("o_name").alias("o_name"),"s_name").\ # explode the lists from the group to attach it to the correct pairs of o and s
select("s","o",explode("s_name").alias("s_name"),"o_name").\
withColumn("o_lang", col("o_name").substr(-2,2)).\
withColumn("lang", col("s_name").substr(-2,2)).filter(col("o_lang")==col("lang")).drop("o_lang")
#manually create the o_lang and lang columns
Result:
+---+---+------+------+----+
|s |o |s_name|o_name|lang|
+---+---+------+------+----+
|s2 |o2 |s2_fr |o2_fr |fr |
|s2 |o1 |s2_fr |o1_fr |fr |
|s1 |o1 |s1_fr |o1_fr |fr |
|s1 |o1 |s1_en |o1_en |en |
+---+---+------+------+----+

How to find avg length each column in pyspark? [duplicate]

This question already has an answer here:
Apply a transformation to multiple columns pyspark dataframe
(1 answer)
Closed 4 years ago.
I have created data frame like below:
from pyspark.sql import Row
l = [('Ankit','25','Ankit','Ankit'),('Jalfaizy','2.2','Jalfaizy',"aa"),('saurabh','230','saurabh',"bb"),('Bala','26',"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], ages=x[1],lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
| 25| Ankit|Ankit| Ankit|
| 2.2|Jalfaizy| aa|Jalfaizy|
| 230| saurabh| bb| saurabh|
| 26| aa| bb| Bala|
+----+--------+-----+--------+
I want find each column avg length for all comuns i.e below my expected output.i.e total number of character in particular column/ number of rows
+----+--------+-----+--------+
|ages| lname|mname| name|
+----+--------+-----+--------+
|2.5 | 5.5 | 2.75 | 6 |
+----+--------+-----+--------+
This is actually pretty straight forward. We will be using a projection for column length and an aggregation for avg :
from pyspark.sql.functions import length, col, avg
selection = ['lname','mname','name']
schemaPeople \
.select(*(length(col(c)).alias(c) for c in selection)) \
.agg(*(avg(col(c)).alias(c) for c in selection)).show()
# +-----+-----+----+
# |lname|mname|name|
# +-----+-----+----+
# | 5.5| 2.75| 6.0|
# +-----+-----+----+
This way, you'll be able to pass the names of the columns dynamically.
What we are doing here is actually unpacking the argument list (selection)
Reference : Control Flow Tools - Unpacking Argument Lists.
I think you can just create new rows for the individual lengths and then just group the dataframe. Then you would end up with something like:
df_new = spark.createDataFrame([
( "25","Ankit","Ankit","Ankit"),( "2.2","Jalfaizy","aa","Jalfaizy"),
("230","saurabh","bb","saurabh") ,( "26","aa","bb","Bala")
], ("age", "lname","mname","name"))
df_new.withColumn("len_age",length(col("age"))).withColumn("len_lname",length(col("lname")))\
.withColumn("len_mname",length(col("mname"))).withColumn("len_name",length(col("name")))\
.groupBy().agg(avg("len_age"),avg("len_lname"),avg("len_mname"),avg("len_name")).show()
Result:
+------------+--------------+--------------+-------------+
|avg(len_age)|avg(len_lname)|avg(len_mname)|avg(len_name)|
+------------+--------------+--------------+-------------+
| 2.5| 5.5| 2.75| 6.0|
+------------+--------------+--------------+-------------+
In Scala can be done in this way, guess, can be converted to Python by author:
val averageColumnList = List("age", "lname", "mname", "name")
val columns = averageColumnList.map(name => avg(length(col(name))))
val result = df.select(columns: _*)

Filter columns using column from another table

I have two dataframe, a small one with IDs and a large one (6 billion rows with id and trx_id). I want all the transactions from the large table that have the same customer ID as the small one. For example:
df1:
+------+
|userid|
+------+
| 348|
| 567|
| 595|
+------+
df2:
+------+----------+
|userid| trx_id |
+------+----------+
| 348| 287|
| 595| 288|
| 348| 311|
| 276| 094|
| 595| 288|
| 148| 512|
| 122| 514|
| 595| 679|
| 567| 870|
| 595| 889|
+------+----------+
Result I want:
+------+----------+
|userid| trx_id |
+------+----------+
| 348| 287|
| 595| 288|
| 348| 311|
| 595| 288|
| 595| 679|
| 567| 870|
| 595| 889|
+------+----------+
Should I use a join or filter? If so which command?
If the small dataframe can fit into memory, you can do a broadcast join. That means that the small dataframe will be broadcasted to all executor nodes and the subsequent join can be done efficiently without any shuffle.
You can hint that a dataframe should be broadcasted using broadcast:
df2.join(broadcast(df1), "userid", "inner")
Note that the join method is called on the larger dataframe.
If the first dataframe is even smaller (~100 rows or less), filter would be a viable, faster, option. The idea is to collect the dataframe and convert it to a list, then use isin to filter the large dataframe. This should be faster as long as the data is small enough.
val userids = df1.as[Int].collect()
df2.filter($"userid".isin(userids:_*))

Resources