i have below data frame which i need to load in to csv with customized row and values
common_df.show()
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
|Maria |Finance |CA |102 |Maria| Finance | |102|
|Jen |Marketing |NY |103 |Jen | |NY2 |103|
i am following below approach currently to convert df to csv
pandasdf=common_df.toPandas()
pandasdf.to_csv("s3://mylocation/result.csv")
The above going to convert with same structure in csv.
however i need to structure from above format to something below, I think the solution would be to split each row to two allocating the id on left within data frame. but i don't see any example or solution directly from spark
|name |dept |state|id |
------------------------------------
101 |James |Sales |NY |101 |
|James |null |NY |101 |
------------------------------------
102 |Maria |Finance | |102 |
|Maria |Finance |CA |102 |
-------------------------------------
103 |Jen |Marketing |NY |103 |
|Jen | |NY2 |103 |
------------------------------------
Any solution to this?
Related
I have data in following format:
|cust_id |card_num |balance|payment |due |card_type|
|:-------|:--------|:------|:-------|:----|:------- |
|c1 |1234 |567 |344 |33 |A |
|c1 |2345 |57 |44 |3 |B |
|c2 |123 |561 |34 |39 |A |
|c3 |345 |517 |914 |23 |C |
|c3 |127 |56 |34 |32 |B |
|c3 |347 |67 |344 |332 |B |
I want it to be converted into following ArrayType.
|cust_id|card_num |balance |payment |due | card_type|
|:------|:-------- |:------ |:------- |:---- |:---- |
|c1 |[1234,2345] |[567,57] |[344,44] |[33,3] |[A,B] |
|c2 |[123] |[561] |[34] |[39] |[A] |
|c3 |[345,127,347]|[517,56,67]|914,34,344]|[23,32,332]|[C,B,B] |
How to write a generic code in pyspark to do this transformation and save it in csv format?
You just need to group by cust_id column and use collect_list function to get array type aggregated columns.
df = # input
df.groupBy("cust_id").agg(
collect_list("card_num").alias("card_num"),
collect_list("balance").alias("balance"),
collect_list("payment").alias("payment"),
collect_list("due").alias("due"),
collect_list("card_type").alias("card_type"))
I have below 2 data frames and i would like to apply similar condition and return the values in pyspark data frames.
df1.show()
+---+-------+--------+
|id |tr_type|nominal |
+---+-------+--------+
|1 |K |2.0 |
|2 |ZW |7.0 |
|3 |V |12.5 |
|4 |VW |9.0 |
|5 |CI |5.0 |
+---+-------+--------+
One dimensional mapping:
*abcefgh
+-------+------------+------------+-----------+
|odm_id |return_value|odm_relation|input_value|
+-------+------------+------------+-----------+
|abcefgh|B |EQ |K |
|abcefgh|B |EQ |ZW |
|abcefgh|S |EQ |V |
|abcefgh|S |EQ |VW |
|abcefgh|I |EQ |CI |
+-------+------------+------------+-----------+
I need to apply below condition The nominal volume is negated when there is a sell transaction.
IF (tr_type, $abcefgh.) == 'S' THEN ;
nominal = -nominal ;
The expected output:
+---+-------+-------+-----------+
|id |tr_type|nominal|nominal_new|
+---+-------+-------+-----------+
|1 |K |2.0 |2.0 |
|2 |ZW |7.0 |7.0 |
|3 |V |12.5 |-12.5 |
|4 |VW |9.0 |-9.0 |
|5 |CI |5.0 |5.0 |
+---+-------+-------+-----------+
you could join the 2 dataframes on tr_type == input_value and use a when().otherwise() to create the new column.
see example below using your samples
data_sdf. \
join(odm_sdf.selectExpr('return_value', 'input_value as tr_type').
dropDuplicates(),
['tr_type'],
'left'
). \
withColumn('nominal_new',
func.when(func.col('return_value') == 'S', func.col('nominal') * -1).
otherwise(func.col('nominal'))
). \
drop('return_value'). \
show()
# +-------+---+-------+-----------+
# |tr_type| id|nominal|nominal_new|
# +-------+---+-------+-----------+
# | K| 1| 2.0| 2.0|
# | CI| 5| 5.0| 5.0|
# | V| 3| 12.5| -12.5|
# | VW| 4| 9.0| -9.0|
# | ZW| 2| 7.0| 7.0|
# +-------+---+-------+-----------+
I am trying to figure out how to solve this use case using spark dataframe.
In the below google sheet, I have the source data where the questions from the survey answered by the people will be stored. Also the question columns will be more than 1000 columns approx, and is more dynamic and not fixed.
There is a metadata table, which explains about the question, its description and the choices it can contain.
Output table should be the like the one I had mentioned in the sheet. Any suggestions or ideas on how this can be achieved ?
https://docs.google.com/spreadsheets/d/1BAY8XWaio1DbzcQeQgru6PuNfT9A7Uhf650x_-PAjqo/edit#gid=0
Let's assume your main table is called df:
+---------+-----------+-----------+------+------+------+
|survey_id|response_id|person_name|Q1D102|Q1D103|Q1D105|
+---------+-----------+-----------+------+------+------+
|xyz |xyz |john |1 |2 |1 |
|abc |abc |foo |3 |1 |1 |
|def |def |bar |2 |2 |2 |
+---------+-----------+-----------+------+------+------+
and the mapping table is called df2:
+-----------+-------------+-------------------+---------+-----------+
|question_id|question_name|question_text |choice_id|choice_desc|
+-----------+-------------+-------------------+---------+-----------+
|Q1D102 |Gender |What is your gender|1 |Male |
|Q1D102 |Gender |What is your gender|2 |Female |
|Q1D102 |Gender |What is your gender|3 |Diverse |
|Q1D103 |Age |What is your age |1 |20 - 50 |
|Q1D103 |Age |What is your age |2 |50 > |
|Q1D105 |work_status |Do you work |1 |Yes |
|Q1D105 |work_status |Do you work |2 |No |
+-----------+-------------+-------------------+---------+-----------+
We can construct a dynamic unpivot expression as below:
val columns = df.columns.filter(c => c.startsWith("Q1D"))
val data = columns.map(c => s"'$c', $c").mkString(",")
val finalExpr = s"stack(${columns.length}, $data) as (question_id, choice_id)"
With 3 questions, we get the following expression (Q1D102, Q1D103 and Q1D105): stack(3, 'Q1D102', Q1D102,'Q1D103', Q1D103,'Q1D105', Q1D105) as (question_id, choice_id)
Finally, we use the constructed variable:
df = df
.selectExpr("survey_id", "response_id", "person_name", finalExpr)
.join(df2, Seq("question_id", "choice_id"), "left")
You get this result:
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|question_id|choice_id|survey_id|response_id|person_name|question_name|question_text |choice_desc|
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
|Q1D102 |1 |xyz |xyz |john |Gender |What is your gender|Male |
|Q1D102 |2 |def |def |bar |Gender |What is your gender|Female |
|Q1D102 |3 |abc |abc |foo |Gender |What is your gender|Diverse |
|Q1D103 |1 |abc |abc |foo |Age |What is your age |20 - 50 |
|Q1D103 |2 |xyz |xyz |john |Age |What is your age |50 > |
|Q1D103 |2 |def |def |bar |Age |What is your age |50 > |
|Q1D105 |1 |xyz |xyz |john |work_status |Do you work |Yes |
|Q1D105 |1 |abc |abc |foo |work_status |Do you work |Yes |
|Q1D105 |2 |def |def |bar |work_status |Do you work |No |
+-----------+---------+---------+-----------+-----------+-------------+-------------------+-----------+
Which I think is what you need (just unordered), good luck!
I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?
You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.
I have a PySpark dataframe df that looks like this:
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|James | |Smith |36636|M |3000 |
|Michael |Rose | |40288|M |4000 |
|Robert | |Williams|42114|M |4000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
I need to apply a filter of id > 4000 only to gender = M, and preserve all the gender = F. Therefore, the final dataframe should look like this:
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id |gender|salary|
+---------+----------+--------+-----+------+------+
|Michael |Rose | |40288|M |4000 |
|Robert | |Williams|42114|M |4000 |
|Maria |Anne |Jones |39192|F |4000 |
|Jen |Mary |Brown |30001|F |2000 |
+---------+----------+--------+-----+------+------+
The only way I can think of doing this is:
df_temp1 = df.filter(df.gender == 'F')
df_temp2 = df.where(df.gender == 'M').filter(df.id > 4000)
df = df_temp1.union(df_temp2)
Is this the most efficient way to do this? I'm new to Spark so any help is appreciated!
This should do the trick. where is an alias for filter.
>>> df.show()
+-------+------+-----+
| name|gender| id|
+-------+------+-----+
| James| M|36636|
|Michael| M|40288|
| Robert| F|42114|
| Maria| F|39192|
| Jen| F|30001|
+-------+------+-----+
>>> df.where(''' (gender == 'M' and id > 40000) OR gender == 'F' ''').show()
+-------+------+-----+
| name|gender| id|
+-------+------+-----+
|Michael| M|40288|
| Robert| F|42114|
| Maria| F|39192|
| Jen| F|30001|
+-------+------+-----+
use both the condition using OR
**
df = spark.createDataFrame([(36636,"M"),(40288,"M"),(42114,"M"),(39192,"F"),(30001,"F")],["id","gender"])
df = df.filter(((F.col("id") > 40000) & (F.col("gender") == F.lit("M"))) | (F.col("gender") == F.lit("F")))
df.show()
**
Output
+-----+------+
| id|gender|
+-----+------+
|40288| M|
|42114| M|
|39192| F|
|30001| F|
+-----+------+