Pandas Dataframe array entries as rows [duplicate] - python-3.x

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 6 months ago.
Use pd.DataFrame.explode() method: How to unnest (explode) a column in a pandas DataFrame, into multiple rows
I have a pandas.Dataframe structure with arrays as entries which I would like to disaggregate each of the entries into a long format.
Below is the code to reproduce what I am looking for. StackOverflow is asking me to put more detain into this draft because it is mostly code, but it is mostly code because it allows the reader to reproduce the issue more clearly.
import pandas as pd
import numpy as np
date = '08-30-2022'
ids = ['s1', 's2']
g1 = ['b1', 'b2']
g2 = ['b1', 'b3', 'b4']
g_ls = [g1, g2]
v1 = [2.0, 2.5]
v2 = [3.2, np.nan, 3.7]
v_ls = [v1, v2]
dict = {
'date': [date] * len(ids),
'ids': ids,
'group': g_ls,
'values': v_ls
}
df_in = pd.DataFrame.from_dict(dict)
dict_out = {
'date': [date] * 5,
'ids': ['s1', 's1', 's2', 's2', 's2'],
'group': ['b1', 'b2', 'b1', 'b3', 'b4'],
'values': [2.0, 2.5, 3.2, np.nan, 3.7]
}
desired_df = pd.DataFrame.from_dict(dict_out)
Have:
date ids group values
0 08-30-2022 s1 [b1, b2] [2.0, 2.5]
1 08-30-2022 s2 [b1, b3, b4] [3.2, nan, 3.7]
Want:
date ids group values
0 08-30-2022 s1 b1 2.0
1 08-30-2022 s1 b2 2.5
2 08-30-2022 s2 b1 3.2
3 08-30-2022 s2 b3 NaN
4 08-30-2022 s2 b4 3.7

Try with
df = df_in.explode(['group','values'])
Out[173]:
date ids group values
0 08-30-2022 s1 b1 2.0
0 08-30-2022 s1 b2 2.5
1 08-30-2022 s2 b1 3.2
1 08-30-2022 s2 b3 NaN
1 08-30-2022 s2 b4 3.7

Related

Pandas - Conditional drop duplicates based on number of NaN

I have a Pandas 0.24.2 dataframe for Python 3.7x as below. I want to drop_duplicates() with the same Name based on a conditional logic. A similar question can be found here: Pandas - Conditional drop duplicates but it gets more complicated in my case
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Id': [1, 2, 3, 4, 5, 6 ],
'Name': ['A', 'B', 'C', 'A', 'B', 'C' ],
'Value1':[1, np.NaN, 0, np.NaN, 1, np.NaN],
'Value2':[np.NaN, 0, np.NaN, 1, np.NaN, 0 ],
'Value3':[np.NaN, 0, np.NaN, 1, np.NaN, np.NaN]
})
How is it possible to:
Drop duplicates for same 'Name' records, keeping the one that has less NaNs?
If they have the same number of NaNs, keeping the one that has NOT a NaN in 'Value1'?
The desired output would be:
Id Name Value1 Value2 Value3
2 2 B NaN 0 0
3 3 C 0 NaN NaN
4 4 A NaN 1 1
Idea is create helper columns for both conditions, sorting and remove duplicates:
df1 = df.assign(count= df.isna().sum(axis=1),
count_val1 = df['Value1'].isna().view('i1'))
df2 = (df1.sort_values(['count', 'count_val1'])[df.columns]
.drop_duplicates('Name')
.sort_index())
print (df2)
Id Name Value1 Value2 Value3
1 2 B NaN 0.0 0.0
2 3 C 0.0 NaN NaN
3 4 A NaN 1.0 1.0
Here is a different solution. The goal is to create two columns that help sort the duplicate rows that will be deleted.
First, we create the columns.
df['count_nan'] = df.isnull().sum(axis=1)
Value1_nan = []
for row in df['Value1']:
if row >= 0:
Value1_nan.append(0)
else:
Value1_nan.append(1)
df['Value1_nan'] = Value1_nan
We then sort the columns so that the column with the most NaNs appears first.
df.sort_values(by=['Name','count_nan', 'Value1'], inplace=True, ascending = [True, True, True])
Finally, we drop the "last" duplicate line. That is, we keep the line with the smallest number of NaNs followed by the line with the smallest number of NaNs in Value1
df = df.drop_duplicates(subset = ['Name'],keep='first')

How to add a column to a dataframe and assign conditional values?

I have a simple dataframe, where I want to add a new column(col3) with values determined by the values from 'col1'. If the value from 'col1' starts with A, I want to put 'A' to col3. And a similar thing to the value that starts with B.
import pandas as pd
d = {"col1" : ["A1", "A2", "B1", "B2"], "col2" : [1, 2, 3, 4]}
df = pd.DataFrame(data = d)
df
import numpy as np
import pandas as pd
d = {"col1" : ["A1", "A2", "B1", "B2"], "col2" : [1, 2, 3, 4]}
df = pd.DataFrame(data = d)
df['col3']=np.where((df.col1.str.startswith('A')),'A',df.col1)
df['col3']=np.where((df.col1.str.startswith('B')),'B',df.col3)
df
Output
col1 col2 col3
0 A1 1 A
1 A2 2 A
2 B1 3 B
3 B2 4 B
Try this
import numpy as np
import pandas as pd
d = {"col1" : ["A1", "A2", "B1", "B2", "C1", "C2"], "col2" : [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data = d)
df['col3']=df["col1"].str[0]
print(df)
This results you in
col1 col2 col3
0 A1 1 A
1 A2 2 A
2 B1 3 B
3 B2 4 B
4 C1 5 C
5 C2 6 C

How to replace Specific values of a particular column in Pandas Dataframe based on a certain condition?

I have a Pandas dataframe which contains students and percentages of marks obtained by them. There are some students whose marks are shown as greater than 100%. Obviously these values are incorrect and I would like to replace all percentage values which are greater than 100% by NaN.
I have tried on some code but not quite able to get exactly what I would like to desire.
import numpy as np
import pandas as pd
new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
'Percentages' : [85, 70, 101, 55, 120]})
# Percentages Student
#0 85 S1
#1 70 S2
#2 101 S3
#3 55 S4
#4 120 S5
new_DF[(new_DF.iloc[:, 0] > 100)] = np.NaN
# Percentages Student
#0 85.0 S1
#1 70.0 S2
#2 NaN NaN
#3 55.0 S4
#4 NaN NaN
As you can see the code kind of works but it actually replaces all the values in that particular row where Percentages is greater than 100 by NaN. I would only like to replace the value in Percentages column by NaN where its greater than 100. Is there any way to do that?
Try and use np.where:
new_DF.Percentages=np.where(new_DF.Percentages.gt(100),np.nan,new_DF.Percentages)
or
new_DF.loc[new_DF.Percentages.gt(100),'Percentages']=np.nan
print(new_DF)
Student Percentages
0 S1 85.0
1 S2 70.0
2 S3 NaN
3 S4 55.0
4 S5 NaN
Also,
df.Percentages = df.Percentages.apply(lambda x: np.nan if x>100 else x)
or,
df.Percentages = df.Percentages.where(df.Percentages<100, np.nan)
You can use .loc:
new_DF.loc[new_DF['Percentages']>100, 'Percentages'] = np.NaN
Output:
Student Percentages
0 S1 85.0
1 S2 70.0
2 S3 NaN
3 S4 55.0
4 S5 NaN
import numpy as np
import pandas as pd
new_DF = pd.DataFrame({'Student' : ['S1', 'S2', 'S3', 'S4', 'S5'],
'Percentages' : [85, 70, 101, 55, 120]})
#print(new_DF['Student'])
index=-1
for i in new_DF['Percentages']:
index+=1
if i > 100:
new_DF['Percentages'][index] = "nan"
print(new_DF)

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

How to sum columns in python based on column with not empty string

df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3],
'Sum over columns':[1,10,8,5,10]})
Hi everybody, could you please help me with following issue:
I'm trying to sum over columns to get a sum of data1 and data2.
If column with string (key1) is not NaN and if column with string (key2) is not NaN then sum data1 and data2. The result I want is shown in the sum column. Thank your for your help!
Try using the .apply method of df on axis=1 and numpy's array multiplication function to get your desired output:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'key1':[np.nan,'a','b','b','a'],
'data1':[2,5,8,5,7],
'key2':['ab', 'aa', np.nan, np.nan, 'one'],
'data2':[1,5,9,6,3]})
df['Sum over columns'] = df.apply(lambda x: np.multiply(x[0:2], ~x[2:4].isnull()).sum(), axis=1)
Or:
df['Sum over columns'] = np.multiply(df[['data1','data2']], ~df[['key1','key2']].isnull()).sum(axis=1)
Either one of them should yield:
# data1 data2 key1 key2 Sum over columns
# 0 2 1 NaN ab 1
# 1 5 5 a aa 10
# 2 8 9 b NaN 8
# 3 5 6 b NaN 5
# 4 7 3 a one 10
I hope this helps.

Resources