I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?
You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.
Related
i have below data frame in which i am trying to create a new column by concatinating name from a list
df=
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
key_list=['name','state','id']
df.withColumn('prim_key', concat(*key_list)
df.show()
but above return the same result
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
i suspecting it might be due to space in the column names in DF. so i used trim to remove all space in column names, but no luck . it returning the same result
Any solution to this?
i found it... the issue was due to assigning the result to new or existing df
df=df.withColumn('prim_key', concat(*key_list)
I have below joined data frame and i want to convert each Row values in to two dictionary list
df=
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
|Maria |Finance |CA |102 |Maria| Finance | |102|
when in convert to row
df.collect()
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101),Row(name=Maria,department=Finance,state=CA,id=102,name=Maria,
department=Finance,state='',id=102)]
I need to create a two dictionary from each row. that is, each Row has repeated keys as such 'name,departement,state,id' but the values are different. so i need to have two dictionary from each Row, so that i can compare the differences of two dictionary from each row
#Expected:
[({name:James,department:Sales,state:NY,id:101},{name:James,department:Sales1,state:None,id:101}),
({name:Maria,department:Finance,state:CA,id:102},{name:Maria,department:Finance,state:CA,id:102,name:Maria,department:Finance,state:'',id:102})]
Is there any other solution to this?
My df contains product names and corresponding information. Relevant here is the name and country sold to:
+--------------------+-------------------------+
| Product_name|collect_set(Countries_en)|
+--------------------+-------------------------+
| null| [Belgium,United K...|
| #5 pecan/almond| [Belgium]|
| #8 mango/strawberry| [Belgium]|
|& Sully A Mild Th...| [Belgium,France]|
|"70CL Liqueu...| [Belgium,France]|
|"Gingembre&q...| [Belgium]|
|"Les Schtrou...| [Belgium,France]|
|"Sho-key&quo...| [Belgium]|
|"mini Chupa ...| [Belgium,France]|
| 'S Lands beste| [Belgium]|
|'T vlierbos confi...| [Belgium]|
|(H)eat me - Spagh...| [Belgium]|
| -cheese flips| [Belgium]|
| .soupe cerfeuil| [Belgium]|
|1 1/2 Minutes Bas...| [Belgium,Luxembourg]|
| 1/2 Reblochon AOP| [Belgium]|
| 1/2 nous de jambon| [Belgium]|
|1/2 tarte cerise ...| [Belgium]|
|10 Original Knack...| [Belgium,France,S...|
| 10 pains au lait| [Belgium,France]|
+--------------------+-------------------------+
sample input data:
[Row(code=2038002038.0, Product_name='Formula 2 men multi vitaminic', Countries_en='France,Ireland,Italy,Mexico,United States,Argentina-espanol,Armenia-pyсский,Aruba-espanol,Asia-pacific,Australia-english,Austria-deutsch,Azerbaijan-русский,Belarus-pyсский,Belgium-francais,Belgium-nederlands,Bolivia-espanol,Bosnia-i-hercegovina-bosnian,Botswana-english,Brazil-portugues,Bulgaria-български,Cambodia-english,Cambodia-ភាសាខ្មែរ,Canada-english,Canada-francais,Chile-espanol,China-中文,Colombia-espanol,Costa-rica-espanol,Croatia-hrvatski,Cyprus-ελληνικά,Czech-republic-čeština,Denmark-dansk,Ecuador-espanol,El-salvador-espanol,Estonia-eesti,Europe,Finland-suomi,France-francais,Georgia-ქართული,Germany-deutsch,Ghana-english,Greece-ελληνικά,Guatemala-espanol,Honduras-espanol,Hong-kong-粵語,Hungary-magyar,Iceland-islenska,India-english,Indonesia-bahasa-indonesia,Ireland-english,Israel-עברית,Italy-italiano,Jamaica-english,Japan-日本語,Kazakhstan-pyсский,Korea-한국어,Kyrgyzstan-русский,Latvia-latviešu,Lebanon-english,Lesotho-english,Lithuania-lietuvių,Macau-中文,Malaysia-bahasa-melayu,Malaysia-english,Malaysia-中文,Mexico-espanol,Middle-east-africa,Moldova-roman,Mongolia-монгол-хэл,Namibia-english,Netherlands-nederlands,New-zealand-english,Nicaragua-espanol,North-macedonia-македонски-јазик,Norway-norsk,Panama-espanol,Paraguay-espanol,Peru-espanol,Philippines-english,Poland-polski,Portugal-portugues,Puerto-rico-espanol,Republica-dominicana-espanol,Romania-romană,Russia-русский,Serbia-srpski,Singapore-english,Slovak-republic-slovenčina,Slovenia-slovene,South-africa-english,Spain-espanol,Swaziland-english,Sweden-svenska,Switzerland-deutsch,Switzerland-francais,Taiwan-中文,Thailand-ไทย,Trinidad-tobago-english,Turkey-turkce,Ukraine-yкраї́нська,United-kingdom-english,United-states-english,United-states-espanol,Uruguay-espanol,Venezuela-espanol,Vietnam-tiếng-việt,Zambia-english', Traces_en=None, Additives_tags=None, Main_category_en='Vitamins', Image_url='https://static.openfoodfacts.org/images/products/203/800/203/8/front_en.12.400.jpg', Quantity='60 compresse', Packaging_tags='barattolo,tablet', )]
Since I want to explore to which countries the products are sold to besides Belgium i split the country column to show every country individually using the code below
#create df with grouped products
countriesDF = productsDF\
.select("Product_name", "Countries_en")\
.groupBy("Product_name")\
.agg(F.collect_set("Countries_en").cast("string").alias("Countries"))\
.orderBy("Product_name")
#split df to show countries the product is sold to in a seperate column
countriesDF = countriesDF\
.where(col("Countries")!="null")\
.select("Product_name",\
F.split("Countries", ",").alias("Countries"),
F.posexplode(F.split("Countries", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Product_name",
F.concat(F.lit("Countries"),F.col("pos").cast("string")).alias("name"),
F.expr("Countries[pos]").alias("val")
)\
.groupBy("Product_name").pivot("name").agg(F.first("val"))\
.show()
However, this table now has over 400 columns for countries alone which is not presentable. So my question is:
am I doing the splitting / exploding correctly?
can I split the df so I get the countries as column names (e.g. 'France' instead of 'countries1' etc.) counting the number of times the product is sold in this country?
Some sample data :
val sampledf = Seq(("p1","BELGIUM,GERMANY"),("p1","BELGIUM,ITALY"),("p1","GERMANY"),("p2","BELGIUM")).toDF("Product_name","Countries_en")
Transform to required df :
df = sampledf
.withColumn("country_list",split(col("Countries_en"),","))
.select(col("Product_name"), explode(col("country_list")).as("country"))
+------------+-------+
|Product_name|country|
+------------+-------+
| p1|BELGIUM|
| p1|GERMANY|
| p1|BELGIUM|
| p1| ITALY|
| p1|GERMANY|
| p2|BELGIUM|
+------------+-------+
If you need only counts per country :
countDF = df.groupBy("Product_name","country").count()
countDF.show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|BELGIUM| 2|
| p1|GERMANY| 1|
| p2|BELGIUM| 1|
+------------+-------+-----+
Except Belgium :
countDF.filter(col("country") =!="BELGIUM").show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|GERMANY| 1|
+------------+-------+-----+
And if you really want countries as Columns :
countDF.groupBy("Product_name").pivot("country").agg(first("count"))
+------------+-------+-------+
|Product_name|BELGIUM|GERMANY|
+------------+-------+-------+
| p2| 1| null|
| p1| 2| 1|
+------------+-------+-------+
And you can .drop("BELGIUM") to achieve it.
Final code used:
#create df where countries are split off
df = productsDF\
.withColumn("country_list",split(col("Countries_en"),","))\
.select(col("Product_name"), explode(col("country_list")).alias("Country"))\
#create count and filter out Country Belgium, Product Name can be changed as needed
countDF = df.groupBy("Product","Country").count()\
.filter(col("Country") !="Belgium")\
.filter(col('Product') == 'Café').show()
I am trying to match multiple columns from one data frame (df) to to a multiple language dictionary (df_label) and extract the the corresponding labels for each column.
Note: This is not a duplcate question of Join multiple columns from one table to single column from another table
The following is an example of df and df_label dataframes and the desired output
df df_label output
+---+---+ +---+-----+----+ +---+---+------+------+------+
| s| o| | e| name|lang| | s| o|s_name|o_name| lang|
+---+---+ +---+-----+----+ +---+---+------+------+------+
| s1| o1| | s1|s1_en| en| | s2| o1| s2_fr| o1_fr| fr|
| s1| o3| | s1|s1_fr| fr| | s1| o1| s1_fr| o1_fr| fr|
| s2| o1| | s2|s2_fr| fr| | s1| o1| s1_en| o1_en| en|
| s2| o2| | o1|o1_fr| fr| | s2| o2| s2_fr| o2_fr| fr|
+---+---+ | o1|o1_en| en| +---+---+------+------+------+
| o2|o2_fr| fr|
+---+-----+----+
In other words I want match both columns [s,o] from df with column e from df_label and find their corresponding labels in different languages as shown above.
The multi-lang dictionary( df_label) is huge and columns [s,o] have many duplicates, so two join operations is highly inefficient.
Is there any way that could be achieved without multiple joins?
FYI, this is what I did using multiple joins but I really don't like it.
df = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df = df.join(df_label,col('s')==col('e')).drop('e').withColumnRenamed('name','s_name').withColumnRenamed('lang','s_lang')
df = df.join(df_label,col('o')==col('e')).drop('e').withColumnRenamed('name','o_name').select('s','o','s_name','o_name','s_lang','o','o_name','lang').withColumnRenamed('lang','o_lang').filter(col('o_lang')==col('s_lang')).drop('s_lang')
Building on what gaw suggested, this is my proposed solution
The approach was to use only one join but then use a conditional aggregate collect_list to check whether the match was for s column or o column.
df = = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df.join(df_label,(col('e')== col('s')) | (col('e') == col('o'))) \
.groupBy(['s','o','lang']) \
.agg(collect_list(when(col('e')==col('s'),col('name'))).alias('s_name')\
,collect_list(when(col('e')==col('o'),col('name'))).alias('o_name')) \
.withColumn('s_name',explode('s_name')).withColumn('o_name',explode('o_name')).show()
+---+---+----+------+------+
| s| o|lang|s_name|o_name|
+---+---+----+------+------+
| s2| o2| fr| s2_fr| o2_fr|
| s1| o1| en| s1_en| o1_en|
| s1| o1| fr| s1_fr| o1_fr|
| s2| o1| fr| s2_fr| o1_fr|
+---+---+----+------+------+
I created a way which works with only one join, but since it uses additional (expensive) operations like explode etc. I am not sure if it is faster.
But if you like you could give it a try.
The following code produces the desired output:
df = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df = df.join(df_label,[(col('s')==col('e')) | \
(col('o')==col('e'))]).drop('e').\ #combine the two join conditions
withColumn("o_name",when(col("name").startswith("o"),col("name")).otherwise(None)).\
withColumn("s_name",when(col("name").startswith("s"),col("name")).otherwise(None)).\ #create the o_name and s_name cols
groupBy("s","o").agg(collect_list("o_name").alias("o_name"),collect_list("s_name").alias("s_name")).\
#perform a group to aggregate the required vales
select("s","o",explode("o_name").alias("o_name"),"s_name").\ # explode the lists from the group to attach it to the correct pairs of o and s
select("s","o",explode("s_name").alias("s_name"),"o_name").\
withColumn("o_lang", col("o_name").substr(-2,2)).\
withColumn("lang", col("s_name").substr(-2,2)).filter(col("o_lang")==col("lang")).drop("o_lang")
#manually create the o_lang and lang columns
Result:
+---+---+------+------+----+
|s |o |s_name|o_name|lang|
+---+---+------+------+----+
|s2 |o2 |s2_fr |o2_fr |fr |
|s2 |o1 |s2_fr |o1_fr |fr |
|s1 |o1 |s1_fr |o1_fr |fr |
|s1 |o1 |s1_en |o1_en |en |
+---+---+------+------+----+
I have two dataframe, a small one with IDs and a large one (6 billion rows with id and trx_id). I want all the transactions from the large table that have the same customer ID as the small one. For example:
df1:
+------+
|userid|
+------+
| 348|
| 567|
| 595|
+------+
df2:
+------+----------+
|userid| trx_id |
+------+----------+
| 348| 287|
| 595| 288|
| 348| 311|
| 276| 094|
| 595| 288|
| 148| 512|
| 122| 514|
| 595| 679|
| 567| 870|
| 595| 889|
+------+----------+
Result I want:
+------+----------+
|userid| trx_id |
+------+----------+
| 348| 287|
| 595| 288|
| 348| 311|
| 595| 288|
| 595| 679|
| 567| 870|
| 595| 889|
+------+----------+
Should I use a join or filter? If so which command?
If the small dataframe can fit into memory, you can do a broadcast join. That means that the small dataframe will be broadcasted to all executor nodes and the subsequent join can be done efficiently without any shuffle.
You can hint that a dataframe should be broadcasted using broadcast:
df2.join(broadcast(df1), "userid", "inner")
Note that the join method is called on the larger dataframe.
If the first dataframe is even smaller (~100 rows or less), filter would be a viable, faster, option. The idea is to collect the dataframe and convert it to a list, then use isin to filter the large dataframe. This should be faster as long as the data is small enough.
val userids = df1.as[Int].collect()
df2.filter($"userid".isin(userids:_*))