Replacing values in specific columns in a Pandas Dataframe, when number of columns are unknown - python-3.x

I am brand new to Python and stacks exchange. I have been trying to replace invalid values ( x<-3 and x>12) with np.nan in specific columns.
I don't know how many columns I will have to deal with and thus will have to create a general code that takes this into account. I do however know, that the first two columns are ids and names respectively. I have searched google and stacks exchange for a solution but haven't been able to find a solution that solves my specific objective.
My question is; How would one replace values found in the third column and onwards?
My dataframe looks like this;
Data
I tried this line:
Data[Data > 12.0] = np.nan.
this replaced the first two columns with nan
1st attempt
I tried this line:
Data[(Data.iloc[(range(2,Columns))] >=12) & (Data.iloc[(range(2,Columns))]<=-3)] = np.nan
where,
Columns = len(Data.columns)
This is clearly wrong replacing all values in rows 2 to 6 (Columns = 7).
2nd attempt
Any thoughts would be greatly appreciated.
Python 3.6.1 64bits, Qt 5.6.2, PyQt5 5.6 on Darwin

You're looking for the applymap() method.
import pandas as pd
import numpy as np
# get the columns after the second one
cols = Data.columns[2:]
# apply mask to those columns
new_df = Data[cols].applymap(lambda x: np.nan if x > 12 or x <= -3 else x)
Documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html
This approach assumes your columns after the second contain float or int values.

You can set values to specific columns of a dataframe by using iloc and slicing the columns that you need. Then we can set the values using where
A short example using some random data
df = pd.DataFrame(np.random.randint(0,10,(4,10)))
0 1 2 3 4 5 6 7 8 9
0 7 7 9 4 2 6 6 1 7 9
1 0 1 2 4 5 5 3 9 0 7
2 0 1 4 4 3 8 7 0 6 1
3 1 4 0 2 5 7 2 7 9 9
Now we set the region to update and the region we want to update using iloc, and we slice columns indexed as 2 to the last column
df.iloc[:,2:] = df.iloc[:,2:].where((df < 7) & (df > 2))
Which will set the values in the Data Frame to NaN.
0 1 2 3 4 5 6 7 8 9
0 7 7 NaN 4.0 NaN 6.0 6.0 NaN NaN NaN
1 0 1 NaN 4.0 5.0 5.0 3.0 NaN NaN NaN
2 0 1 4.0 4.0 3.0 NaN NaN NaN 6.0 NaN
3 1 4 NaN NaN 5.0 NaN NaN NaN NaN NaN
For your data the code would be this
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data <= 12) & (Data >= -3))
Operator clarification
The setup I show directly above would look like this
-3 <= Data <= 12, gives everything between those numbers
If we reverse this logic using the & operator it looks like this
-3 >= Data <= 12, a number cannot be both less than -3 and greater than 12 at the same time.
So we use the or operator instead |. Code looks like this now....
Data.iloc[:,2:] = Data.iloc[:,2:].where((Data >= 12) | (Data <= -3))
So the data is checked on a conditional basis
Data <= -3 or Data >= 12

Related

How to add value to specific index that is out of bounds

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?
according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

Multiplying 2 pandas dataframes generates nan

I have 2 dataframes as below
import pandas as pd
dat = pd.DataFrame({'val1' : [1,2,1,2,4], 'val2' : [1,2,1,2,4]})
dat1 = pd.DataFrame({'val3' : [1,2,1,2,4]})
Now with each column of dat and want to multiply dat1. So I did below
dat * dat1
However this generates nan value for all elements.
Could you please help on what is the correct approach? I could run a for loop with each column of dat, but I wonder if there are any better method available to perform the same.
Thanks for your pointer.
When doing multiplication (or any arithmetic operation), pandas does index alignment. This goes for both the index and columns in case of dataframes. If matches, it multiplies; otherwise puts NaN and the result has the union of the indices and columns of the operands.
So, to "avoid" this alignment, make dat1 a label-unaware data structure, e.g., a NumPy array:
In [116]: dat * dat1.to_numpy()
Out[116]:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16
To see what's "really" being multiplied, you can align yourself:
In [117]: dat.align(dat1)
Out[117]:
( val1 val2 val3
0 1 1 NaN
1 2 2 NaN
2 1 1 NaN
3 2 2 NaN
4 4 4 NaN,
val1 val2 val3
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 1
3 NaN NaN 2
4 NaN NaN 4)
(extra: you have the indices same for dat & dat1; please change one of them's index, and then align again to see the union-behaviour.)
You need to change two things:
use mul with axis=0
use a Series instead of dat1 (else multiplication will try to align the indices, there is no common ones between your two dataframes
out = dat.mul(dat1['val3'], axis=0)
output:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16

pyspark - assign non-null columns to new columns

I have a dataframe of the following scheme in pyspark:
user_id datadate page_1.A page_1.B page_1.C page_2.A page_2.B \
0 111 20220203 NaN NaN NaN NaN NaN
1 222 20220203 5 5 5 5.0 5.0
2 333 20220203 3 3 3 3.0 3.0
page_2.C page_3.A page_3.B page_3.C
0 NaN 1.0 1.0 2.0
1 5.0 NaN NaN NaN
2 4.0 NaN NaN NaN
So it contains columns like user_id, datadate, and few columns for each page (got 3 pages), which are the result of 2 joins. In this example, i have page_1, page_2, page_3, and each has 3 columns: A,B,C. Additionally, for each page columns, for each row, they will either be all null or all full, like in my example.
I don't care about the values of each of the columns per page, I just want to get for each row, the [A,B,C] values that are not null.
example for a wanted result table:
user_id datadate A B C
0 111 20220203 1 1 2
1 222 20220203 5 5 5
2 333 20220203 3 3 3
so the logic will be something like:
df[A] = page_1.A or page_2.A or page_3.A, whichever is not null
df[B] = page_1.B or page_2.B or page_3.B, whichever is not null
df[C] = page_1.C or page_2.C or page_3.C, whichever is not null
for all of the rows..
and of course, I would like to do it in an efficient way.
Thanks a lot.
You can use the sql functions greatest to extract the greatest values in a list of columns.
You can find the documentation here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.greatest.html
from pyspark.sql import functions as F
(df.withColumn('A', F.greates(F.col('page_1.A'), F.col('page_2.A), F.col('page_3.A'))
.withColumn('B', F.greates(F.col('page_1.B'), F.col('page_2.B), F.col('page_3.B'))
.select('userid', 'datadate', 'A', 'B'))

How to write a code for the lambda method with pandas and random?

So I am working on this python code where I have imported a excel file in from my desktop. Currently in my code I am having problem with executing my lambda function. I am to use the lambda function in the apply() method for the data column(s) with outliers (i.e., z-scores > 3 or < -3), set the value for the outliers to null (i.e., np.nan), otherwise retain the original value.
I have types the line of code 2 different ways with the same outcome. However when I check to see if my overall code used the lambda function it did not and I don't know how to get my code to use the function instead of just running lambda as no part of the overall code.
This is the individual array for each datatypes after using the unique() method.
Have already replace the strings in the array
The image shows the use of the sort_value() method used in my code as well as finding possible outliers for the z-score of the variable before inputting the lambda function into the code.
data.SalaryZScores.apply(lambda SalaryZScores:np.nan if SalaryZScores > 3 or SalaryZScores<-3 else SalaryZScores)
Or
data['Salary'] = data['Salary'].apply(lambda Salary: random.randrange(data['Salary'].min(), data['Salary'].max())if pd.isnull(Salary) else Salary)
It's difficult to understand what your snippets are doing as you haven't provided the code to produce any of the objects in your question. If I understand what you're trying to do the following approaches might solve your problem:
Using .apply() and lambda
import pandas as pd
import numpy as np
# Create some data
data = np.arange(20).reshape(10, 2)
df = pd.DataFrame(data, columns=list('AB'))
df
[Out]
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
5 10 11
6 12 13
7 14 15
8 16 17
9 18 19
df['A'] = df['A'].apply(lambda x : np.nan if (x < 4) or (x > 10) else x)
df
[Out]
A B
0 NaN 1
1 NaN 3
2 4.0 5
3 6.0 7
4 8.0 9
5 10.0 11
6 NaN 13
7 NaN 15
8 NaN 17
9 NaN 19
Using Series.between:
df['A'].between(4, 10) # Creates a boolean mask between 4 and 10
[Out]
A
0 False
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 False
9 False
df['A'][~df['A'].between(4, 10)] = np.nan # Uses ~ to invert the boolean mask and set values to np.nan
df
[Out]
A B
0 NaN 1
1 NaN 3
2 4.0 5
3 6.0 7
4 8.0 9
5 10.0 11
6 NaN 13
7 NaN 15
8 NaN 17
9 NaN 19

python-3: how to create a new pandas column as subtraction of two consecutive rows of another column?

I have a pandas dataframe
x
1
3
4
7
10
I want to create a new column y as y[i] = x[i] - x[i-1] (and y[0] = x[0]).
So the above data frame will become:
x y
1 1
3 2
4 1
7 3
10 3
How to do that with python-3? Many thanks
Using .shift() and fillna():
df['y'] = (df['x'] - df['x'].shift(1)).fillna(df['x'])
To explain what this is doing, if we print(df['x'].shift(1)) we get the following series:
0 NaN
1 1.0
2 3.0
3 4.0
4 7.0
Which is your values from 'x' shifted down one row. The first row gets NaN because there is no value above it to shift down. So, when we do:
print(df['x'] - df['x'].shift(1))
We get:
0 NaN
1 2.0
2 1.0
3 3.0
4 3.0
Which is your subtracted values, but in our first row we get a NaN again. To clear this, we use .fillna(), telling it that we want to just take the value from df['x'] whenever a null value is encountered.

Resources