Convert string "nan" to numpy NaN in pandas Dataframe that contains Strings [duplicate] - python-3.x

I am working on a large dataset with many columns of different types. There are a mix of numeric values and strings with some NULL values. I need to change the NULL Value to Blank or 0 depending on the type.
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 NULL NULL NULL 8 NULL NULL Lemon 12 NULL
I want it to look like this,
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 0 8 0 Lemon 12
I can do this for each individual, but since I am going to be pulling several extremely large datasets with hundreds of columns, I'd like to do this some other way.
Edit:
Types from Smaller Dataset,
Field1 object
Field2 object
Field3 object
Field4 object
Field5 object
Field6 object
Field7 object
Field8 object
Field9 object
Field10 float64
Field11 float64
Field12 float64
Field13 float64
Field14 float64
Field15 object
Field16 float64
Field17 object
Field18 object
Field19 float64
Field20 float64
Field21 int64

Use DataFrame.select_dtypes for numeric columns, filter by subset and replace values to 0, then repalce all another columns to empty string:
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 NaN NaN NaN 8 NaN NaN Lemon 12 NaN
print (df.dtypes)
0 int64
1 object
2 float64
3 object
4 int64
5 object
6 float64
7 object
8 int64
9 object
dtype: object
c = df.select_dtypes(np.number).columns
df[c] = df[c].fillna(0)
df = df.fillna("")
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12
Another solution is create dictionary for replace:
num_cols = df.select_dtypes(np.number).columns
d1 = dict.fromkeys(num_cols, 0)
d2 = dict.fromkeys(df.columns.difference(num_cols), "")
d = {**d1, **d2}
print (d)
{0: 0, 2: 0, 4: 0, 6: 0, 8: 0, 1: '', 3: '', 5: '', 7: '', 9: ''}
df = df.fillna(d)
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12

You could try this to substitute a different value for each different column (A to C are numeric, while D is a string):
import pandas as pd
import numpy as np
df_pd = pd.DataFrame([[np.nan, 2, np.nan, '0'],
[3, 4, np.nan, '1'],
[np.nan, np.nan, np.nan, '5'],
[np.nan, 3, np.nan, np.nan]],
columns=list('ABCD'))
df_pd.fillna(value={'A':0.0,'B':0.0,'C':0.0,'D':''})

Related

How to add value to specific index that is out of bounds

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?
according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

Populate text in empty rows above in Pandas dataframe

Below is an example dataframe using Pandas. Sorry about the table format.
I would like to take the text 'RF-543' in column 'Test Case' and populate it in the two empty rows above it.
Likewise for text 'RT-483' in column 'Test Case', I would like to be populate it in the three empty rows above it.
I have over 900 rows of data with varying empty rows to be populated with the next non-empty row that follows. Any suggestions are appreciated.
Req#
Test Case
745
AB-003
348
AA-006
932
335
675
RF-543
348
988
444
124
RT-483
Regards,
Gus
If they are NaN something as below:
Creating dataframe test
>>> df = pd.DataFrame({"A": ["apple",np.nan, np.nan, "banana", np.nan, "grape"]})
>>> df
A
0 apple
1 NaN
2 NaN
3 banana
4 NaN
5 grape
Use [fillna] method
>>> df["A"].fillna(method='bfill', inplace=True)
>>> df
0 apple
1 banana
2 banana
3 banana
4 grape
5 grape
Name: A, dtype: object
If the fields are empty:
>>> df = pd.DataFrame({"A": ["apple", "", "", "banana", "", "grape"]})
>>> df
A
0 apple
1
2
3 banana
4
5 grape
>>> df = df.apply(lambda x: x["A"] if x["A"] != "" else np.nan, axis=1)
>>> df
0 apple
1 NaN
2 NaN
3 banana
4 NaN
5 grape
dtype: object
Then, use the fillna()
Try something like this:
df.loc[df["Test Case"] == '', "Test Case"] = np.nan
df.bfill()
Output
Req# Test Case
0 745 AB-003
1 348 AA-006
2 932 RF-543
3 335 RF-543
4 675 RF-543
5 348 RT-483
6 988 RT-483
7 444 RT-483
8 124 RT-483

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Pandas: Group dataframe by condition "last value in column defines the group"

I've got a sorted dataframe (sorted by "customer_id" and "point_in_time") which looks like this:
import pandas as pd
import numpy as np
testing = pd.DataFrame({"customer_id": (1,1,1,2,2,2,2,2,3,3,3,3,4,4),
"point_in_time": (4,5,6,1,2,3,7,9,5,6,8,10,2,5),
"x": ("d", "a", "c", "ba", "cd", "d", "o", "a", "g", "f", "h", "d", "df", "b"),
"revenue": (np.nan, np.nan, 40, np.nan, np.nan, 23, np.nan, 10, np.nan, np.nan, np.nan, 40, np.nan, 100)})
testing
Now I want to group the dataframe by "customer_id" and the "revenue". But with regard to "revenue" a group should start after the last existing revenue and end with the next occuring revenue.
So the groups should look like this:
If I had those groups I could easily do a
testing.groupby(["customer_id", "groups"])
I first tried to create those groups by first grouping by "customer_id" and applying a function to it in which I fill the missing values of "revenue":
def my_func(sub_df):
sub_df["groups"] = sub_df["revenue"].fillna(method="bfill")
sub_df.groupby("groups").apply(next_function)
testing.groupby(["customer_id"]).apply(my_func)
Unfortunately, this does not work if one customer has two revenues which are exactly the same. In this case after using fillna the group column of this customer will consist of only one value which does not allow additional grouping.
So how can this be done and what is the most efficient way to accomplish this task?
Thank you in advance!
Use Series.shift with Series.notna and Series.cumsum, last if necessary add 1:
testing["groups"] = testing['revenue'].shift().notna().cumsum() + 1
print (testing)
customer_id point_in_time x revenue groups
0 1 4 d NaN 1
1 1 5 a NaN 1
2 1 6 c 40.0 1
3 2 1 ba NaN 2
4 2 2 cd NaN 2
5 2 3 d 23.0 2
6 2 7 o NaN 3
7 2 9 a 10.0 3
8 3 5 g NaN 4
9 3 6 f NaN 4
10 3 8 h NaN 4
11 3 10 d 40.0 4
12 4 2 df NaN 5
13 4 5 b 100.0 5

Resources