Pandas: With array of col names in a desired column order, select those that exist, NULL those that don't - python-3.x

I have an array of column names I want as my output table in that order e.g. ["A", "B", "C"]
I have an input table that USUALLY contains all of the values in the array but NOT ALWAYS (the raw data is a JSON API response).
I want to select all available columns from the input table, and if a column does not exist, I want it filled with NULLs or NA or whatever, it doesn't really matter.
Let's say my input DataFrame (call it input_table) looks like this:
+-----+--------------+
| A | C |
+-----+--------------+
| 123 | test |
| 456 | another_test |
+-----+--------------+
I want an output dataframe that has columns A, B, C in that order to produce
+-----+------+--------------+
| A | B | C |
+-----+------+--------------+
| 123 | NULL | test |
| 456 | NULL | another_test |
+-----+------+--------------+
I get a keyerror when I do input_table[["A","B","C"]]
I get a NoneType returned when I do input_table.get(["A","B","C"])
I was able to achieve what I want via:
for i in desired_columns_array:
if i not in input_dataframe:
ouput_dataframe[i] = ""
else:
output_dataframe[i] = input_dataframe[i]
But I'm wondering if there's something less verbose?
How do I get a desired output schema to match an input array when one or more columns in the input dataframe may not be present?

Transpose and reindex
df = pd.DataFrame([[123,'test'], [456, 'another test']], columns=list('AC'))
l = list('ACB')
df1 = df.T.reindex(l).T[sorted(l)]
A B C
0 123 NaN test
1 456 NaN another test

DataFrame.reindex over the column axis:
cols = ['A', 'B', 'C']
df.reindex(cols, axis='columns')
A B C
0 123 NaN test
1 456 NaN another_test

Related

how to change values in a df specifying by index contain in multiple lists, and each list for one column

I have a list where I have all the index of values to be replaced. I have to change them in 8 diferent columns with 8 diferent lists. The replacement could be a simple string.
How can I do it?
I have more than 20 diferent columns in this df
Eg:
list1 = [0,1,2]
list2 =[2,4]
list8 = ...
sustitution = 'no data'
Column A
Column B
marcos
peter
Julila
mike
Fran
Ramon
Pedri
Gavi
Olmo
Torres
OUTPUT:
| Column A | Column B |
| -------- | -------- |
| no data | peter |
| no data | mike |
| no data | no data |
| Pedri | Gavi |
| Olmo | no data |`
Use DataFrame.loc with zipped lists and columns names:
list1 = [0,1,2]
list2 =[2,4]
L = [list1,list2]
cols = ['Column A','Column B']
sustitution = 'no data'
for c, i in zip(cols, L):
df.loc[i, c] = sustitution
print (df)
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data
You can use the underlying numpy array:
list1 = [0,1,2]
list2 = [2,4]
lists = [list1, list2]
col = np.repeat(np.arange(len(lists)), list(map(len, lists)))
# array([0, 0, 0, 1, 1])
row = np.concatenate(lists)
# array([0, 1, 2, 2, 4])
df.values[row, col] = 'no data'
Output:
Column A Column B
0 no data peter
1 no data mike
2 no data no data
3 Pedri Gavi
4 Olmo no data

How to create a group | sub-group (pre-defined) cyclic order by considering the identical consecutive groupings (in Pandas DataFrame) columns?

Task 1: I am looking for a solution to create a group by considering the identical consecutive groupings in one of the columns (of my Panda's DataFrame, ..considering this as values of a list):
from itertools import groupby
test_list = ['AA', 'AA', 'BB', 'CC', 'DD', 'DD', 'DD', 'AA', 'BB', 'EE', 'CC']
data = pd.DataFrame(test_list)
data['batches'] = ['1','1','2','3','4','4','4','5','6','7','8'] # this is the goal to reach
print(data)
result = [list(y) for x, y in groupby(test_list)]
print(result)
[['AA', 'AA'], ['BB'], ['CC'], ['DD', 'DD', 'DD'], ['AA'], ['BB'], ['EE'], ['CC']]
So, I have a DataFrame with two columns: the first is a list of elements that must be kept in order + grouped into batches: identical consecutive grouping. The batch column where the result should be assigned.
I couldn't find a solution or a workaround. As you can see, I've created a list using the itertools groupby function by grouping the same cons. items, but this isn't the final result I'd like to see. I know that itertools groupby allows me to utilize a lambda function with the 'key=' parameter to perhapsĀ get to my solution.
I was thinking of merging the above and looping it into a dictionary, with the key being the batch numbers obtained by iterating the list using enumerate and the values being the list elements:
{1:['AA', 'AA'], 2:['BB'], 3:['CC'], 4: ['DD', 'DD', 'DD']...}
After that, I'll convert the dictionary (or any other solution/workaround) to a Data Series and add it to my batch column:
In this exercise, I just want to return the key(s) of my 'dictionary' (the number of unique batches) to the batches column.
| list | batches |
| -------- | ------- |
| AA | 1 |
| AA | 1 |
| BB | 2 |
| CC | 3 |
| DD | 4 |
| DD | 4 |
| DD | 4 |
| AA | 5 |
| BB | 6 |
| EE | 7 |
| CC | 8 |
EDITED:
Task 2: The added query for a similar task:
In this scenario, my initial list has a (pre-defined) cyclic order to follow such as AA -- AB -- AC belongs to one main group, DA -- DB -- belongs to another group.
The question is how to calculate the column sub-group so that I can have sub-groups listings under my main group...so to say, capturing repeated groups within the main group.
list
sub
main gr
AA
1
1
AB
1
1
AC
1
1
AA
2
1
AB
2
1
AC
2
1
DA
1
2
DB
1
2
I found a solution whose logic was based on #Shubham's comment. My solution to use the .cumcount() function as the following: df['sub'] = df.groupby(['main gr', 'list'].cumcount()+1 .cumcount()+1 if we want that the sub-order count/index starts at 1 instead 0.
(I'm not looking for the best solution, I am looking for a solution. Nevertheless, I would like to use this code for large datasets containing millions of entries).
I will highly appreciate any comment or supporting feedback.

Adding a new column whose values are based on another column in either dataframe or excel

I want to add a new column "X" whose values should be either 0 or 1 such that if there exists a value(particularly date in my case) in column "A", it should give 1 or any text
example:
A | X
----------
*date* | 1
null | 0
*date* | 1
*date* | 1
*date* | 1
null | 0
any way to do this in pandas or excel/office
Here is an example in excel:
=if(isnull(a2);0;1)
or
=if(a2>0;1;0)
(dates are aways greater then zero.

How to label encode a DataFrame column which contains both numbers and strings?

I have this DataFrame column
+-------------------------------------+--+
| df: | |
+-------------------------------------+--+
| Index Ticket* | |
| 0 254326 | |
| 1 CA345 | |
| 3 SA12 | |
| 4 267891 | |
| ' ' | |
| ' ' | |
| ' ' | |
| 700 CA356 | |
+-------------------------------------+--+
It contains two kinds of values. Some are pure numbers and others are strings having letters and numbers.
Many rows have the same letters (CA345, CA675 etc). I would like to group and label the rows with same letters with the same numbers.
Eg. All rows having "CA" labelled as 0, all rows having "SA" labelled as 1.
Remaining rows all have six digit numbers (no letters in them). I would like to label all such rows with the same number (say 2 for example)
1st Approach
Define a custom function, check if the row isinstance(val, str) and contains "SA" or "CA"
def label_ticket(row):
if isinstance(row['Ticket'], str) and 'CA' in row['Ticket']:
return 0
if isinstance(row['Ticket'], str) and 'SA' in row['Ticket']:
return 1
return 2
Apply the custom function to new column df('Label').
df['Label'] = df.apply(label_ticket, axis=1)
print(df)
Ticket Label
0 254326 2
1 CA345 0
2 SA12 1
3 267891 2
700 CA356 0
2nd Approach
Further understanding the situation, it seems you have no idea what instances will come up in df['Ticket']. In this case you can use re.split() to search all string pattern and classify them into category accordingly.
import pandas as pd
import re
df = pd.DataFrame(columns=['Ticket'],
data=[[254326],
['CA345'],
['SA12'],
[267891],
['CA356']])
df['Pattern'] = df['Ticket'].apply(lambda x: ''.join(re.split("[^a-zA-Z]*", str(x))))
df_label = pd.DataFrame(df['Pattern'].unique(), columns=['Pattern']).reset_index(level=0).rename(columns={'index': 'Label'})
df = df.merge(df_label, how='left')
print(df)
Ticket Pattern Label
0 254326 0
1 CA345 CA 1
2 SA12 SA 2
3 267891 0
4 CA356 CA 1
I have not enough knowledge of python but
you may have try pandas.Series.str.extract
and
regular expression
Like:
ptrn=r'(?P<CA>(CA[\d]+))|(?P<SA>(SA[\d]+))|(?P<DIGIT>[\d]{6})'
import pandas as pd
import numpy as np
ls={'tk':[ '254326' , 'CA345', 'SA12' , '267891' , 'CA356' ]}
df = pd.DataFrame(ls)
s=df['tk'].str.extract(ptrn,expand=False)
newDf={0:[x for x in s['CA'] if pd.isnull(x)==False],1:[x for x in s['SA'] if pd.isnull(x)==False],2:[x for x in s['DIGIT'] if pd.isnull(x)==False]}
print(newDf)
out put:
{0: ['CA345', 'CA356'], 1: ['SA12'], 2: ['254326', '267891']}
demo

Calculate mean per few columns in Pandas Dataframe

I have a Pandas dataframe, Data:
ID | A1| A2| B1| B2
ID1| 2 | 1 | 3 | 7
ID2| 4 | 6 | 5 | 3
I want to calculate mean of columns (A1 and A2), and (B1 and B2) separately and row-wise . My desired output:
ID | A1A2 mean | B1B2 mean
ID1| 1.5 | 5
ID2| 5 | 4
I can do mean of all columns together , but cannot find any functions to get my desired output.
Is there any built-in method in Python?
Use DataFrame.groupby with lambda function for get first letter of columns for mean, also if first column is not index use DataFrame.set_index:
df=df.set_index('ID').groupby(lambda x: x[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Another solution is extract columns names by indexing str[0]:
df = df.set_index('ID')
print (df.columns.str[0])
Index(['A', 'A', 'B', 'B'], dtype='object')
df = df.groupby(df.columns.str[0], axis=1).mean().add_suffix('_mean').reset_index()
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
Or:
df = (df.set_index('ID')
.groupby(df.columns[1:].str[0], axis=1)
.mean()
.add_suffix('_mean').reset_index()
Verify solution:
a = df.filter(like='A').mean(axis=1)
b = df.filter(like='B').mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)
ID A_mean B_mean
0 ID1 1.5 5.0
1 ID2 5.0 4.0
EDIT:
If have different columns names and need specify them in lists:
a = df[['A1','A2']].mean(axis=1)
b = df[['B1','B2']].mean(axis=1)
df = df[['ID']].assign(A_mean=a, B_mean=b)
print (df)

Resources