How to create a combined data frame from each columns?

How to create a combined data frame from each columns? - python-3.x

I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?

You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.

Related

Unable to create a new column from a list using spark concat method?

i have below data frame in which i am trying to create a new column by concatinating name from a list
df=
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
key_list=['name','state','id']
df.withColumn('prim_key', concat(*key_list)
df.show()
but above return the same result
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
i suspecting it might be due to space in the column names in DF. so i used trim to remove all space in column names, but no luck . it returning the same result
Any solution to this?

i found it... the issue was due to assigning the result to new or existing df
df=df.withColumn('prim_key', concat(*key_list)

Pyspark how to convert repeated row element to dictionary list

I have below joined data frame and i want to convert each Row values in to two dictionary list
df=
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
|Maria |Finance |CA |102 |Maria| Finance | |102|
when in convert to row
df.collect()
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101),Row(name=Maria,department=Finance,state=CA,id=102,name=Maria,
department=Finance,state='',id=102)]
I need to create a two dictionary from each row. that is, each Row has repeated keys as such 'name,departement,state,id' but the values are different. so i need to have two dictionary from each Row, so that i can compare the differences of two dictionary from each row
#Expected:
[({name:James,department:Sales,state:NY,id:101},{name:James,department:Sales1,state:None,id:101}),
({name:Maria,department:Finance,state:CA,id:102},{name:Maria,department:Finance,state:CA,id:102,name:Maria,department:Finance,state:'',id:102})]
Is there any other solution to this?

filter dataframe by multiple columns after exploding

My df contains product names and corresponding information. Relevant here is the name and country sold to:
+--------------------+-------------------------+
| Product_name|collect_set(Countries_en)|
+--------------------+-------------------------+
| null| [Belgium,United K...|
| #5 pecan/almond| [Belgium]|
| #8 mango/strawberry| [Belgium]|
|& Sully A Mild Th...| [Belgium,France]|
|"70CL Liqueu...| [Belgium,France]|
|"Gingembre&q...| [Belgium]|
|"Les Schtrou...| [Belgium,France]|
|"Sho-key&quo...| [Belgium]|
|"mini Chupa ...| [Belgium,France]|
| 'S Lands beste| [Belgium]|
|'T vlierbos confi...| [Belgium]|
|(H)eat me - Spagh...| [Belgium]|
| -cheese flips| [Belgium]|
| .soupe cerfeuil| [Belgium]|
|1 1/2 Minutes Bas...| [Belgium,Luxembourg]|
| 1/2 Reblochon AOP| [Belgium]|
| 1/2 nous de jambon| [Belgium]|
|1/2 tarte cerise ...| [Belgium]|
|10 Original Knack...| [Belgium,France,S...|
| 10 pains au lait| [Belgium,France]|
+--------------------+-------------------------+
sample input data:
[Row(code=2038002038.0, Product_name='Formula 2 men multi vitaminic', Countries_en='France,Ireland,Italy,Mexico,United States,Argentina-espanol,Armenia-pyсский,Aruba-espanol,Asia-pacific,Australia-english,Austria-deutsch,Azerbaijan-русский,Belarus-pyсский,Belgium-francais,Belgium-nederlands,Bolivia-espanol,Bosnia-i-hercegovina-bosnian,Botswana-english,Brazil-portugues,Bulgaria-български,Cambodia-english,Cambodia-ភាសាខ្មែរ,Canada-english,Canada-francais,Chile-espanol,China-中文,Colombia-espanol,Costa-rica-espanol,Croatia-hrvatski,Cyprus-ελληνικά,Czech-republic-čeština,Denmark-dansk,Ecuador-espanol,El-salvador-espanol,Estonia-eesti,Europe,Finland-suomi,France-francais,Georgia-ქართული,Germany-deutsch,Ghana-english,Greece-ελληνικά,Guatemala-espanol,Honduras-espanol,Hong-kong-粵語,Hungary-magyar,Iceland-islenska,India-english,Indonesia-bahasa-indonesia,Ireland-english,Israel-עברית,Italy-italiano,Jamaica-english,Japan-日本語,Kazakhstan-pyсский,Korea-한국어,Kyrgyzstan-русский,Latvia-latviešu,Lebanon-english,Lesotho-english,Lithuania-lietuvių,Macau-中文,Malaysia-bahasa-melayu,Malaysia-english,Malaysia-中文,Mexico-espanol,Middle-east-africa,Moldova-roman,Mongolia-монгол-хэл,Namibia-english,Netherlands-nederlands,New-zealand-english,Nicaragua-espanol,North-macedonia-македонски-јазик,Norway-norsk,Panama-espanol,Paraguay-espanol,Peru-espanol,Philippines-english,Poland-polski,Portugal-portugues,Puerto-rico-espanol,Republica-dominicana-espanol,Romania-romană,Russia-русский,Serbia-srpski,Singapore-english,Slovak-republic-slovenčina,Slovenia-slovene,South-africa-english,Spain-espanol,Swaziland-english,Sweden-svenska,Switzerland-deutsch,Switzerland-francais,Taiwan-中文,Thailand-ไทย,Trinidad-tobago-english,Turkey-turkce,Ukraine-yкраї́нська,United-kingdom-english,United-states-english,United-states-espanol,Uruguay-espanol,Venezuela-espanol,Vietnam-tiếng-việt,Zambia-english', Traces_en=None, Additives_tags=None, Main_category_en='Vitamins', Image_url='https://static.openfoodfacts.org/images/products/203/800/203/8/front_en.12.400.jpg', Quantity='60 compresse', Packaging_tags='barattolo,tablet', )]
Since I want to explore to which countries the products are sold to besides Belgium i split the country column to show every country individually using the code below
#create df with grouped products
countriesDF = productsDF\
.select("Product_name", "Countries_en")\
.groupBy("Product_name")\
.agg(F.collect_set("Countries_en").cast("string").alias("Countries"))\
.orderBy("Product_name")
#split df to show countries the product is sold to in a seperate column
countriesDF = countriesDF\
.where(col("Countries")!="null")\
.select("Product_name",\
F.split("Countries", ",").alias("Countries"),
F.posexplode(F.split("Countries", ",")).alias("pos", "val")
)\
.drop("val")\
.select(
"Product_name",
F.concat(F.lit("Countries"),F.col("pos").cast("string")).alias("name"),
F.expr("Countries[pos]").alias("val")
)\
.groupBy("Product_name").pivot("name").agg(F.first("val"))\
.show()
However, this table now has over 400 columns for countries alone which is not presentable. So my question is:
am I doing the splitting / exploding correctly?
can I split the df so I get the countries as column names (e.g. 'France' instead of 'countries1' etc.) counting the number of times the product is sold in this country?

Some sample data :
val sampledf = Seq(("p1","BELGIUM,GERMANY"),("p1","BELGIUM,ITALY"),("p1","GERMANY"),("p2","BELGIUM")).toDF("Product_name","Countries_en")
Transform to required df :
df = sampledf
.withColumn("country_list",split(col("Countries_en"),","))
.select(col("Product_name"), explode(col("country_list")).as("country"))
+------------+-------+
|Product_name|country|
+------------+-------+
| p1|BELGIUM|
| p1|GERMANY|
| p1|BELGIUM|
| p1| ITALY|
| p1|GERMANY|
| p2|BELGIUM|
+------------+-------+
If you need only counts per country :
countDF = df.groupBy("Product_name","country").count()
countDF.show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|BELGIUM| 2|
| p1|GERMANY| 1|
| p2|BELGIUM| 1|
+------------+-------+-----+
Except Belgium :
countDF.filter(col("country") =!="BELGIUM").show()
+------------+-------+-----+
|Product_name|country|count|
+------------+-------+-----+
| p1|GERMANY| 1|
+------------+-------+-----+
And if you really want countries as Columns :
countDF.groupBy("Product_name").pivot("country").agg(first("count"))
+------------+-------+-------+
|Product_name|BELGIUM|GERMANY|
+------------+-------+-------+
| p2| 1| null|
| p1| 2| 1|
+------------+-------+-------+
And you can .drop("BELGIUM") to achieve it.

Final code used:
#create df where countries are split off
df = productsDF\
.withColumn("country_list",split(col("Countries_en"),","))\
.select(col("Product_name"), explode(col("country_list")).alias("Country"))\
#create count and filter out Country Belgium, Product Name can be changed as needed
countDF = df.groupBy("Product","Country").count()\
.filter(col("Country") !="Belgium")\
.filter(col('Product') == 'Café').show()

Join multiple columns from one data frame to single column from another without multiple join operation, in pyspark

I am trying to match multiple columns from one data frame (df) to to a multiple language dictionary (df_label) and extract the the corresponding labels for each column.
Note: This is not a duplcate question of Join multiple columns from one table to single column from another table
The following is an example of df and df_label dataframes and the desired output
df df_label output
+---+---+ +---+-----+----+ +---+---+------+------+------+
| s| o| | e| name|lang| | s| o|s_name|o_name| lang|
+---+---+ +---+-----+----+ +---+---+------+------+------+
| s1| o1| | s1|s1_en| en| | s2| o1| s2_fr| o1_fr| fr|
| s1| o3| | s1|s1_fr| fr| | s1| o1| s1_fr| o1_fr| fr|
| s2| o1| | s2|s2_fr| fr| | s1| o1| s1_en| o1_en| en|
| s2| o2| | o1|o1_fr| fr| | s2| o2| s2_fr| o2_fr| fr|
+---+---+ | o1|o1_en| en| +---+---+------+------+------+
| o2|o2_fr| fr|
+---+-----+----+
In other words I want match both columns [s,o] from df with column e from df_label and find their corresponding labels in different languages as shown above.
The multi-lang dictionary( df_label) is huge and columns [s,o] have many duplicates, so two join operations is highly inefficient.
Is there any way that could be achieved without multiple joins?
FYI, this is what I did using multiple joins but I really don't like it.
df = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df = df.join(df_label,col('s')==col('e')).drop('e').withColumnRenamed('name','s_name').withColumnRenamed('lang','s_lang')
df = df.join(df_label,col('o')==col('e')).drop('e').withColumnRenamed('name','o_name').select('s','o','s_name','o_name','s_lang','o','o_name','lang').withColumnRenamed('lang','o_lang').filter(col('o_lang')==col('s_lang')).drop('s_lang')

Building on what gaw suggested, this is my proposed solution
The approach was to use only one join but then use a conditional aggregate collect_list to check whether the match was for s column or o column.
df = = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df.join(df_label,(col('e')== col('s')) | (col('e') == col('o'))) \
.groupBy(['s','o','lang']) \
.agg(collect_list(when(col('e')==col('s'),col('name'))).alias('s_name')\
,collect_list(when(col('e')==col('o'),col('name'))).alias('o_name')) \
.withColumn('s_name',explode('s_name')).withColumn('o_name',explode('o_name')).show()
+---+---+----+------+------+
| s| o|lang|s_name|o_name|
+---+---+----+------+------+
| s2| o2| fr| s2_fr| o2_fr|
| s1| o1| en| s1_en| o1_en|
| s1| o1| fr| s1_fr| o1_fr|
| s2| o1| fr| s2_fr| o1_fr|
+---+---+----+------+------+

I created a way which works with only one join, but since it uses additional (expensive) operations like explode etc. I am not sure if it is faster.
But if you like you could give it a try.
The following code produces the desired output:
df = spark.createDataFrame([('s1','o1'),('s1','o3'),('s2','o1'),('s2','o2')]).toDF('s','o')
df_label = spark.createDataFrame([('s1','s1_en','en'),('s1','s1_fr','fr'),('s2','s2_fr','fr'),('o1','o1_fr','fr'),('o1','o1_en','en'),('o2','o2_fr','fr')]).toDF('e','name','lang')
df = df.join(df_label,[(col('s')==col('e')) | \
(col('o')==col('e'))]).drop('e').\ #combine the two join conditions
withColumn("o_name",when(col("name").startswith("o"),col("name")).otherwise(None)).\
withColumn("s_name",when(col("name").startswith("s"),col("name")).otherwise(None)).\ #create the o_name and s_name cols
groupBy("s","o").agg(collect_list("o_name").alias("o_name"),collect_list("s_name").alias("s_name")).\
#perform a group to aggregate the required vales
select("s","o",explode("o_name").alias("o_name"),"s_name").\ # explode the lists from the group to attach it to the correct pairs of o and s
select("s","o",explode("s_name").alias("s_name"),"o_name").\
withColumn("o_lang", col("o_name").substr(-2,2)).\
withColumn("lang", col("s_name").substr(-2,2)).filter(col("o_lang")==col("lang")).drop("o_lang")
#manually create the o_lang and lang columns
Result:
+---+---+------+------+----+
|s |o |s_name|o_name|lang|
+---+---+------+------+----+
|s2 |o2 |s2_fr |o2_fr |fr |
|s2 |o1 |s2_fr |o1_fr |fr |
|s1 |o1 |s1_fr |o1_fr |fr |
|s1 |o1 |s1_en |o1_en |en |
+---+---+------+------+----+

Filter columns using column from another table

I have two dataframe, a small one with IDs and a large one (6 billion rows with id and trx_id). I want all the transactions from the large table that have the same customer ID as the small one. For example:
df1:
+------+
|userid|
+------+
| 348|
| 567|
| 595|
+------+
df2:
+------+----------+
|userid| trx_id |
+------+----------+
| 348| 287|
| 595| 288|
| 348| 311|
| 276| 094|
| 595| 288|
| 148| 512|
| 122| 514|
| 595| 679|
| 567| 870|
| 595| 889|
+------+----------+
Result I want:
+------+----------+
|userid| trx_id |
+------+----------+
| 348| 287|
| 595| 288|
| 348| 311|
| 595| 288|
| 595| 679|
| 567| 870|
| 595| 889|
+------+----------+
Should I use a join or filter? If so which command?

If the small dataframe can fit into memory, you can do a broadcast join. That means that the small dataframe will be broadcasted to all executor nodes and the subsequent join can be done efficiently without any shuffle.
You can hint that a dataframe should be broadcasted using broadcast:
df2.join(broadcast(df1), "userid", "inner")
Note that the join method is called on the larger dataframe.
If the first dataframe is even smaller (~100 rows or less), filter would be a viable, faster, option. The idea is to collect the dataframe and convert it to a list, then use isin to filter the large dataframe. This should be faster as long as the data is small enough.
val userids = df1.as[Int].collect()
df2.filter($"userid".isin(userids:_*))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to create a combined data frame from each columns? - python-3.x

Related

Unable to create a new column from a list using spark concat method?

Pyspark how to convert repeated row element to dictionary list

filter dataframe by multiple columns after exploding

Join multiple columns from one data frame to single column from another without multiple join operation, in pyspark

Filter columns using column from another table

Categories

Resources