PySpark query one column name with the value present in another column - python-3.x

Input_pyspark_dataframe:
id name collection student.1.price student.2.price student.3.price
111 aaa 1 100 999 232
222 bbb 2 200 888 656
333 ccc 1 300 777 454
444 ddd 1 400 666 787
output_pyspark_dataframe
id name collection price
111 aaa 1 100
222 bbb 2 888
333 ccc 1 300
444 ddd 3 787
we can find the correct price of each id by using value present in the collection column
Question
using pyspark, How i can find the correct price of each id by dynamically framing column name student.{collection}.price ?
please let me know.

A bit complete but you can do this way.
The fields will give you the field names of the struct field, student. You should give this manually and eventually get 1, 2, 3.
The first line then make an array of the columns student.{i}.price for i = range(1, 4). Similarly, the second line make an array of the literals {i}.
Now, zip this two array into one array such as
[('1', col('student.1.price')), ...]
and explode the array then it becomes:
('1', col('student.1.price'))
('2', col('student.2.price'))
('3', col('student.3.price'))
Since the arrays_zip give you an array of struct, the above result is struct type. Get each value by using struct key as the column, that is the index and price.
Finally, you can compare the collection and index (this is actually the field name of the student struct column).
import pyspark.sql.functions as f
fields = [field.name for field in next(field for field in df.schema.fields if field.name == 'student').dataType.fields]
df.withColumn('array', f.array(*map(lambda x: 'student.' + x + '.price', fields))) \
.withColumn('index', f.array(*map(lambda x: f.lit(x), fields))) \
.withColumn('zip', f.arrays_zip('index', 'array')) \
.withColumn('zip', f.explode('zip')) \
.withColumn('index', f.col('zip.index')) \
.withColumn('price', f.col('zip.array')) \
.filter('collection = index') \
.select('id', 'name', 'collection', 'price') \
.show(10, False)
+---+----+----------+-----+
|id |name|collection|price|
+---+----+----------+-----+
|111|aaa |1 |100 |
|222|bbb |2 |888 |
|333|ccc |1 |300 |
|444|ddd |3 |787 |
+---+----+----------+-----+

Related

Replace values of several columns with values mapping in other dataframe PySpark

I need to replace values of several columns (many more than those in the example, so I would like to avoid doing multiple left joins) of a dataframe with values from another dataframe (mapping).
Example:
df1 EXAM
id
question1
question2
question3
1
12
12
5
2
12
13
6
3
3
7
5
df2 VOTE MAPPING :
id
description
3
bad
5
insufficient
6
sufficient
12
very good
13
excellent
Output
id
question1
question2
question3
1
very good
very good
insufficient
2
very good
excellent
sufficient
3
bad
null
insufficient
Edit 1: Corrected id for excellent in vote map
First of all, you can create a reference dataframe:
df3 = df2.select(
func.create_map(func.col('id'), func.col('desc')).alias('ref')
).groupBy().agg(
func.collect_list('ref').alias('ref')
).withColumn(
'ref', func.udf(lambda lst: {k:v for element in lst for k, v in element.items()}, returnType=MapType(StringType(), StringType()))(func.col('ref'))
)
+---------------------------------------------------------------------------+
|ref |
+---------------------------------------------------------------------------+
|{3 -> bad, 12 -> good, 5 -> insufficient, 13 -> excellent, 6 -> sufficient}|
+---------------------------------------------------------------------------+
Then you can replace the value in question columns by getting the value in reference with 1 crossJoin:
df4 = df1.crossJoin(df3)\
.select(
'id',
*[func.col('ref').getItem(func.col(col)).alias(col) for col in df1.columns[1:]]
)
df4.show(10, False)
+---+----+---------+------------+
|id |q1 |q2 |q3 |
+---+----+---------+------------+
|1 |good|good |insufficient|
|2 |good|excellent|sufficient |
|3 |bad |null |insufficient|
+---+----+---------+------------+

Pandas, combine unique value from two column into one column while preserving order

I have data in four column as shown below. There are some values which are present in column 1, and some value of column 1 is again duplicated in column 3. I would like to combine column 1 with 3, while removing the duplicates from column 3. I would also like to preserve the order of column. Column 1 is associated with column 2 and column 3 is associated with column 4, so it would be nice if I can move column 1 items with column 2 and column 3 items with column 4 during merge. Any help will be appreciated.
Input table:
Item
Price
Item
Price
Car
105
Truck
54822
Chair
20
Pen
1
Cup
2
Car
105
Glass
1
Output table:
Item
Price
Car
105
Chair
20
Cup
2
Truck
54822
Pen
1
Glass
1
Thank you in advance.
After separating the input table into the left and right part, we can concatenate the left hand items with the unduplicated right hand items quite simply with boolean indexing:
import pandas as pd
# this initial section only recreates your sample input table
from io import StringIO
input = pd.read_table(StringIO("""| Item | Price | Item | Price |
|-------|-------|------|-------|
| Car | 105 | Truck| 54822 |
| Chair | 20 | Pen | 1 |
| Cup | 2 | Car | 105 |
| | | Glass| 1 |
"""), ' *\| *', engine='python', usecols=[1,2,3,4], skiprows=[1], keep_default_na=False)
input.columns = list(input.columns[:2])*2
# now separate the input table into the left and right part
left = input.iloc[:,:2].replace("", pd.NA).dropna().set_index('Item')
right = input.iloc[:,2:] .set_index('Item')
# finally construct the output table by concatenating without duplicates
output = pd.concat([left, right[~right.index.isin(left.index)]])
Price
Item
Car 105
Chair 20
Cup 2
Truck 54822
Pen 1
Glass 1

Show One to Many in One Row

I'd like to reformat a cross reference table I am using before merging it to my data.
Certain parts have a one to many relationship and I want to reformat these cases into a single row so I capturing all the info when I later merge/vlookup this table to my data. Most of the data is a one to one relationship so the solution has to be selective.
Currently:
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
What I want:
Marketing Number SKU
0 XXX 111; 222; 333
Use groupby to get the SKU values into a list
Then join the list values.
Since the values in the list are int type, they must be converted to strings, to join them.
import pandas as pd
# data and dataframe
data = {'Marketing Number': ['XXX', 'XXX', 'XXX', 'y', 'z', 'a'],
'SKU': [111, 222, 333, 444, 555, 666]}
df = pd.DataFrame(data)
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
3 y 444
4 z 555
5 a 666
# groupby with agg list
dfg = df.groupby('Marketing Number', as_index=False).agg({'SKU': list})
# join into string
dfg.SKU = dfg.SKU.apply(lambda x: '; '.join([str(i) for i in x]))
Marketing Number SKU
0 XXX 111; 222; 333
1 a 666
2 y 444
3 z 555

Pandas: With array of col names in a desired column order, select those that exist, NULL those that don't

I have an array of column names I want as my output table in that order e.g. ["A", "B", "C"]
I have an input table that USUALLY contains all of the values in the array but NOT ALWAYS (the raw data is a JSON API response).
I want to select all available columns from the input table, and if a column does not exist, I want it filled with NULLs or NA or whatever, it doesn't really matter.
Let's say my input DataFrame (call it input_table) looks like this:
+-----+--------------+
| A | C |
+-----+--------------+
| 123 | test |
| 456 | another_test |
+-----+--------------+
I want an output dataframe that has columns A, B, C in that order to produce
+-----+------+--------------+
| A | B | C |
+-----+------+--------------+
| 123 | NULL | test |
| 456 | NULL | another_test |
+-----+------+--------------+
I get a keyerror when I do input_table[["A","B","C"]]
I get a NoneType returned when I do input_table.get(["A","B","C"])
I was able to achieve what I want via:
for i in desired_columns_array:
if i not in input_dataframe:
ouput_dataframe[i] = ""
else:
output_dataframe[i] = input_dataframe[i]
But I'm wondering if there's something less verbose?
How do I get a desired output schema to match an input array when one or more columns in the input dataframe may not be present?
Transpose and reindex
df = pd.DataFrame([[123,'test'], [456, 'another test']], columns=list('AC'))
l = list('ACB')
df1 = df.T.reindex(l).T[sorted(l)]
A B C
0 123 NaN test
1 456 NaN another test
DataFrame.reindex over the column axis:
cols = ['A', 'B', 'C']
df.reindex(cols, axis='columns')
A B C
0 123 NaN test
1 456 NaN another_test

Get last unique row in Excel

I'm having a table looking like this:
id1 | id2 | dateUpdate
==========================================
aaa | 111 | 2016-01-01
aaa | 111 | 2016-01-02
aaa | 222 | 2016-01-05
aaa | 222 | 2016-01-15
bbb | 333 | 2016-01-05
bbb | 444 | 2016-01-01
ccc | 111 | 2016-01-02
I'd like to get only the latest row for each id1/id2 couple:
id1 | id2 | dateUpdate
==========================================
aaa | 111 | 2016-01-02
aaa | 222 | 2016-01-15
bbb | 333 | 2016-01-05
bbb | 444 | 2016-01-01
ccc | 111 | 2016-01-02
Thanks for your help!
Use Aggregate Function
=AGGREGATE(14, 6, 1/($A$2:$A$99=E2)*($B$2:$B$99=F2)*($C$2:$C$99), 1)
You have to put the unique combinations in column E2 and Column F2 downwards
You have five unique combinations
COLUMN A COLUMN B
aaa 111
aaa 222
bbb 333
bbb 444
ccc 111
And Aggregate Funtion is to be put in column G2 and fill down downwards.
=AGGREGATE(14, 6, 1/($A$2:$A$99=E2)*($B$2:$B$99=F2)*($C$2:$C$99), 1)
Regarding Syntax 14 is for Large value that is the for example larger for first aaa 111 combination that is #02-01-2016#
Next digit 6 in formula is for ignoring errors
You can quickly get the pseudo-Last value using the AGGREGATE function..
This is a standard non-array formula that does NOT require Ctrl+Shift+Enter. AGGREGATE was introduced with Excel 2010.
($A$2:$A$99=E2) checks which value of column A matches with the first unique value of column A of unique combination of A & B
This is converted to an array of logical values which gives an array of True Or False. Dividing 1 by these logical values give either 1 or DIV# error. Similar process is adopted for Column B values mentioned in column F. I have taken data limited to 99 rows. If it more number of rows you change the figure of 99 to a higher figure as per your requirements.
HTH
EDIT
You can also go for pivot table solution as suggested by #Tim Biegeleisen in his comments. Snapshot given here-under illustrates that approach.
id1 and id2 are taken as filter fields and Maximum Value of dateUpdate is chosen in the value field.

Resources