How to create a group | sub-group (pre-defined) cyclic order by considering the identical consecutive groupings (in Pandas DataFrame) columns? - python-3.x

Task 1: I am looking for a solution to create a group by considering the identical consecutive groupings in one of the columns (of my Panda's DataFrame, ..considering this as values of a list):
from itertools import groupby
test_list = ['AA', 'AA', 'BB', 'CC', 'DD', 'DD', 'DD', 'AA', 'BB', 'EE', 'CC']
data = pd.DataFrame(test_list)
data['batches'] = ['1','1','2','3','4','4','4','5','6','7','8'] # this is the goal to reach
print(data)
result = [list(y) for x, y in groupby(test_list)]
print(result)
[['AA', 'AA'], ['BB'], ['CC'], ['DD', 'DD', 'DD'], ['AA'], ['BB'], ['EE'], ['CC']]
So, I have a DataFrame with two columns: the first is a list of elements that must be kept in order + grouped into batches: identical consecutive grouping. The batch column where the result should be assigned.
I couldn't find a solution or a workaround. As you can see, I've created a list using the itertools groupby function by grouping the same cons. items, but this isn't the final result I'd like to see. I know that itertools groupby allows me to utilize a lambda function with the 'key=' parameter to perhapsĀ get to my solution.
I was thinking of merging the above and looping it into a dictionary, with the key being the batch numbers obtained by iterating the list using enumerate and the values being the list elements:
{1:['AA', 'AA'], 2:['BB'], 3:['CC'], 4: ['DD', 'DD', 'DD']...}
After that, I'll convert the dictionary (or any other solution/workaround) to a Data Series and add it to my batch column:
In this exercise, I just want to return the key(s) of my 'dictionary' (the number of unique batches) to the batches column.
| list | batches |
| -------- | ------- |
| AA | 1 |
| AA | 1 |
| BB | 2 |
| CC | 3 |
| DD | 4 |
| DD | 4 |
| DD | 4 |
| AA | 5 |
| BB | 6 |
| EE | 7 |
| CC | 8 |
EDITED:
Task 2: The added query for a similar task:
In this scenario, my initial list has a (pre-defined) cyclic order to follow such as AA -- AB -- AC belongs to one main group, DA -- DB -- belongs to another group.
The question is how to calculate the column sub-group so that I can have sub-groups listings under my main group...so to say, capturing repeated groups within the main group.
list
sub
main gr
AA
1
1
AB
1
1
AC
1
1
AA
2
1
AB
2
1
AC
2
1
DA
1
2
DB
1
2
I found a solution whose logic was based on #Shubham's comment. My solution to use the .cumcount() function as the following: df['sub'] = df.groupby(['main gr', 'list'].cumcount()+1 .cumcount()+1 if we want that the sub-order count/index starts at 1 instead 0.
(I'm not looking for the best solution, I am looking for a solution. Nevertheless, I would like to use this code for large datasets containing millions of entries).
I will highly appreciate any comment or supporting feedback.

Related

Replace values of several columns with values mapping in other dataframe PySpark

I need to replace values of several columns (many more than those in the example, so I would like to avoid doing multiple left joins) of a dataframe with values from another dataframe (mapping).
Example:
df1 EXAM
id
question1
question2
question3
1
12
12
5
2
12
13
6
3
3
7
5
df2 VOTE MAPPING :
id
description
3
bad
5
insufficient
6
sufficient
12
very good
13
excellent
Output
id
question1
question2
question3
1
very good
very good
insufficient
2
very good
excellent
sufficient
3
bad
null
insufficient
Edit 1: Corrected id for excellent in vote map
First of all, you can create a reference dataframe:
df3 = df2.select(
func.create_map(func.col('id'), func.col('desc')).alias('ref')
).groupBy().agg(
func.collect_list('ref').alias('ref')
).withColumn(
'ref', func.udf(lambda lst: {k:v for element in lst for k, v in element.items()}, returnType=MapType(StringType(), StringType()))(func.col('ref'))
)
+---------------------------------------------------------------------------+
|ref |
+---------------------------------------------------------------------------+
|{3 -> bad, 12 -> good, 5 -> insufficient, 13 -> excellent, 6 -> sufficient}|
+---------------------------------------------------------------------------+
Then you can replace the value in question columns by getting the value in reference with 1 crossJoin:
df4 = df1.crossJoin(df3)\
.select(
'id',
*[func.col('ref').getItem(func.col(col)).alias(col) for col in df1.columns[1:]]
)
df4.show(10, False)
+---+----+---------+------------+
|id |q1 |q2 |q3 |
+---+----+---------+------------+
|1 |good|good |insufficient|
|2 |good|excellent|sufficient |
|3 |bad |null |insufficient|
+---+----+---------+------------+

Pandas, combine unique value from two column into one column while preserving order

I have data in four column as shown below. There are some values which are present in column 1, and some value of column 1 is again duplicated in column 3. I would like to combine column 1 with 3, while removing the duplicates from column 3. I would also like to preserve the order of column. Column 1 is associated with column 2 and column 3 is associated with column 4, so it would be nice if I can move column 1 items with column 2 and column 3 items with column 4 during merge. Any help will be appreciated.
Input table:
Item
Price
Item
Price
Car
105
Truck
54822
Chair
20
Pen
1
Cup
2
Car
105
Glass
1
Output table:
Item
Price
Car
105
Chair
20
Cup
2
Truck
54822
Pen
1
Glass
1
Thank you in advance.
After separating the input table into the left and right part, we can concatenate the left hand items with the unduplicated right hand items quite simply with boolean indexing:
import pandas as pd
# this initial section only recreates your sample input table
from io import StringIO
input = pd.read_table(StringIO("""| Item | Price | Item | Price |
|-------|-------|------|-------|
| Car | 105 | Truck| 54822 |
| Chair | 20 | Pen | 1 |
| Cup | 2 | Car | 105 |
| | | Glass| 1 |
"""), ' *\| *', engine='python', usecols=[1,2,3,4], skiprows=[1], keep_default_na=False)
input.columns = list(input.columns[:2])*2
# now separate the input table into the left and right part
left = input.iloc[:,:2].replace("", pd.NA).dropna().set_index('Item')
right = input.iloc[:,2:] .set_index('Item')
# finally construct the output table by concatenating without duplicates
output = pd.concat([left, right[~right.index.isin(left.index)]])
Price
Item
Car 105
Chair 20
Cup 2
Truck 54822
Pen 1
Glass 1

Pandas: With array of col names in a desired column order, select those that exist, NULL those that don't

I have an array of column names I want as my output table in that order e.g. ["A", "B", "C"]
I have an input table that USUALLY contains all of the values in the array but NOT ALWAYS (the raw data is a JSON API response).
I want to select all available columns from the input table, and if a column does not exist, I want it filled with NULLs or NA or whatever, it doesn't really matter.
Let's say my input DataFrame (call it input_table) looks like this:
+-----+--------------+
| A | C |
+-----+--------------+
| 123 | test |
| 456 | another_test |
+-----+--------------+
I want an output dataframe that has columns A, B, C in that order to produce
+-----+------+--------------+
| A | B | C |
+-----+------+--------------+
| 123 | NULL | test |
| 456 | NULL | another_test |
+-----+------+--------------+
I get a keyerror when I do input_table[["A","B","C"]]
I get a NoneType returned when I do input_table.get(["A","B","C"])
I was able to achieve what I want via:
for i in desired_columns_array:
if i not in input_dataframe:
ouput_dataframe[i] = ""
else:
output_dataframe[i] = input_dataframe[i]
But I'm wondering if there's something less verbose?
How do I get a desired output schema to match an input array when one or more columns in the input dataframe may not be present?
Transpose and reindex
df = pd.DataFrame([[123,'test'], [456, 'another test']], columns=list('AC'))
l = list('ACB')
df1 = df.T.reindex(l).T[sorted(l)]
A B C
0 123 NaN test
1 456 NaN another test
DataFrame.reindex over the column axis:
cols = ['A', 'B', 'C']
df.reindex(cols, axis='columns')
A B C
0 123 NaN test
1 456 NaN another_test

How to label encode a DataFrame column which contains both numbers and strings?

I have this DataFrame column
+-------------------------------------+--+
| df: | |
+-------------------------------------+--+
| Index Ticket* | |
| 0 254326 | |
| 1 CA345 | |
| 3 SA12 | |
| 4 267891 | |
| ' ' | |
| ' ' | |
| ' ' | |
| 700 CA356 | |
+-------------------------------------+--+
It contains two kinds of values. Some are pure numbers and others are strings having letters and numbers.
Many rows have the same letters (CA345, CA675 etc). I would like to group and label the rows with same letters with the same numbers.
Eg. All rows having "CA" labelled as 0, all rows having "SA" labelled as 1.
Remaining rows all have six digit numbers (no letters in them). I would like to label all such rows with the same number (say 2 for example)
1st Approach
Define a custom function, check if the row isinstance(val, str) and contains "SA" or "CA"
def label_ticket(row):
if isinstance(row['Ticket'], str) and 'CA' in row['Ticket']:
return 0
if isinstance(row['Ticket'], str) and 'SA' in row['Ticket']:
return 1
return 2
Apply the custom function to new column df('Label').
df['Label'] = df.apply(label_ticket, axis=1)
print(df)
Ticket Label
0 254326 2
1 CA345 0
2 SA12 1
3 267891 2
700 CA356 0
2nd Approach
Further understanding the situation, it seems you have no idea what instances will come up in df['Ticket']. In this case you can use re.split() to search all string pattern and classify them into category accordingly.
import pandas as pd
import re
df = pd.DataFrame(columns=['Ticket'],
data=[[254326],
['CA345'],
['SA12'],
[267891],
['CA356']])
df['Pattern'] = df['Ticket'].apply(lambda x: ''.join(re.split("[^a-zA-Z]*", str(x))))
df_label = pd.DataFrame(df['Pattern'].unique(), columns=['Pattern']).reset_index(level=0).rename(columns={'index': 'Label'})
df = df.merge(df_label, how='left')
print(df)
Ticket Pattern Label
0 254326 0
1 CA345 CA 1
2 SA12 SA 2
3 267891 0
4 CA356 CA 1
I have not enough knowledge of python but
you may have try pandas.Series.str.extract
and
regular expression
Like:
ptrn=r'(?P<CA>(CA[\d]+))|(?P<SA>(SA[\d]+))|(?P<DIGIT>[\d]{6})'
import pandas as pd
import numpy as np
ls={'tk':[ '254326' , 'CA345', 'SA12' , '267891' , 'CA356' ]}
df = pd.DataFrame(ls)
s=df['tk'].str.extract(ptrn,expand=False)
newDf={0:[x for x in s['CA'] if pd.isnull(x)==False],1:[x for x in s['SA'] if pd.isnull(x)==False],2:[x for x in s['DIGIT'] if pd.isnull(x)==False]}
print(newDf)
out put:
{0: ['CA345', 'CA356'], 1: ['SA12'], 2: ['254326', '267891']}
demo

tabulate frequency counts including zeros

To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts. Making use of the information in this pivot table, I want to create a new table, with counts for every value between 1 and 5.
1,1
2,2
3,2
4,0
5,1
What is a good way to do this? My first thought was to use VLOOKUP, trapping any lookup error. But GETPIVOTDATA is apparently preferred for pivot tables. In any case, I failed with both approaches.
To be a bit more specific, assume my pivot table of counts is "PivotTable1" and that I have already created a one column table holding all the needed lookup keys (i.e., the numbers from 1 to 5). What formula should I put in the second column of this new table?
So starting with this:
To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts.
I then created the table like this:
X | Freq
- | ---------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
2 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
3 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
4 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
5 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
Or, in A1 mode:
X | Freq
- | -----------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F3),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F4),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F5),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F6),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F7),0)
The column X in my summary table is in column F.
Or as a table formula:
X | Freq
- | -------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
That gave me this result:
X | Freq
- | ----
1 | 1
2 | 2
3 | 2
4 | 0
5 | 1
If performance is not a major concern, you can bypass the pivot table and use the COUNTIF() function.
Create a list of all consecutive numbers that you want the counts for and use COUNTIF() for each of them with the first parameter being the range of your input numbers and the second being the number of the ordered result list:
A B C D
1 1 1 =COUNTIF(A:A,C1)
2 2 2 =COUNTIF(A:A,C2)
3 3 3 =COUNTIF(A:A,C3)
4 5 4 =COUNTIF(A:A,C4)
5 3 5 =COUNTIF(A:A,C5)
6 2

Resources