Pandas - df['A'].isnull() vs df['A']=='' difference - python-3.x

As title says, I'm bit confused between the usage isnull() and ==''. sometimes when empty columns are added to a dataframe, isnull() does not work.
FDF = pd.DataFrame()
FDF['A'] = ''
print (FDF.loc[FDF['A'].isnull()])
but in the same case following works.
print (FDF.loc[FDF['A']==''])
is it because the way I added a blank column in a dataframe? if so, what is the correct way to add an empty column ?

In pandas '' is not equal to np.nan
''==np.nan
Out[51]: False
That is why when you do the isnull check it will return False for empty string
Also, when you assign it assign a empty value series to the dataframe
FDF.A
Out[54]: Series([], Name: A, dtype: object)
Correct way to assign value
FDF['A'] = ['']
FDF
Out[59]:
A
0
All above is due to the empty dataframe assignment, after we have the index value not empty for the dataframe
We can do
FDF['A'] = ['']
FDF['B'] = ''
FDF
Out[64]:
A B
0

Related

split row value by separator and create new columns

I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html
This is more of a spark problem than a Foundry problem though.
What #M_S said is correct, as the error message was stating, you need a column and should use the lit function.
But be careful if
spark.sql.ansi.enabled is set to True
then this code will throw an ArrayIndexOutOfBoundsException if there are less than 5 items in your array.
Another way to do this would be to ensure that your array has at least 5 elements on each row by adding some ~, then get the first 5 items or by adding a temporary column with the length of the array and use a when condition.
By the way, you don't need to repeat df = every time
df = (
df
.withColumn("channels_split", split(df["channels"], ","))
.withColumn("default_values", array_repeat(lit("~"), 5)))
.withColumn("channels_split", concat(col("channels_split"), col("default_values")))
.withColumn("channel1", df["channels_split"][0])
.withColumn("channel2", df["channels_split"][1])
.withColumn("channel3", df["channels_split"][2])
.withColumn("channel4", df["channels_split"][3])
.withColumn("channel5", df["channels_split"][4])
.drop("channels_split")
)

find index of a dataframe that is of a specific datatype

I read in a excel file where time/date based entries there
df = pd.read_excel('test.xlsx)
then I set the index to the column that holds the date
df.set_index('Date',inplace=True)
unfortunately some entries in this column are not a date.
I did not find a satisfying solution for the question how to find them.
I did it straight forward, but I thought there must be a more panda like way.
for i in df.index:
if not isinstance(i, pd.Timestamp):
df[df_all.index == i]
# here would be the place to do something
To check the datatype of the objects in all columns, just type
df.dtypes
For a specific column, type
df[column_name].dtype
To test each object in the series, use
df['data type'] = df['Date'].apply(lambda x: 'date' if type(x)==pd.Timestamp else 'not date')
if the date is not in the index
or
df['data type'] = df.index.apply(lambda x: 'date' if type(x)==pd.Timestamp else 'not date')
if the date is in the index
Without setting the index you can create a boolean mask to check if a row in the 'Date' column is actually a date or not, and then maybe split the dataframe into two.
mask = df['test'].apply(lambda x: True if (type(x)==pd.datetime or type(x)==pd.Timestamp) else False)
date_df = df[mask]
nondate_df = df[~mask]

How to slice out column names based on column into row of new dataframe?

I have a df that looks like this
data.answers.1542213647002.subItemType data.answers.1542213647002.value.1542213647003
thank you for the response TRUE
How do I slice out the column name only for columns that have the string .value. and the column has the value TRUE into a new df like so?:
new_df
old_column_names
data.answers.1542213647002.value.1542213647003
I have roughly 100 more columns with .value. in it but not all of them have TRUE in them as values.
assume this sample df:
df = pd.DataFrame({'col':[1,2]*5,
'col2.value.something':[True,False]*5,
'col3.value.something':[5]*10,
'col4':[True]*10})
then
# boolean indexing with stack
new = pd.DataFrame(list(df[((df==True) & (df.columns.str.contains('.value.')))].stack().index))
# drop duplicates
new = new.drop(columns=0).drop_duplicates()
1
0 col2.value.something

Replace Null values of a column with its average in a Spark DataFrame

Is there any function in Spark which can calculate the mean of a column in a DataFrame by ignoring null/NaN? Like in R, we can pass an option such as na.rm=TRUE.
When I apply avg() on a column with a NaN, I get NaN only.
You can do the following :
df.na.drop(Seq("c_name")).select(avg(col("c_name")))
Create a dataframe without the null values in all the columns so that column mean can be calculated in the next step
removeAllDF = df.na.drop()
Create a list of columns in which the null values have to be replaced with column means and call the list "columns_with_nas"
Now iterate through the list "columns_with_nas" replace all the null values with the calculated mean values
for x in columns_with_nas:
meanValue = removeAllDF.agg(avg(x)).first()[0]
print(x, meanValue)
df= df.na.fill(meanValue, [x])
This seems to work for me in Spark 2.1.0:
In [16]: mydesc=[{'name':'Fela', 'age':46},
{'name':'Menelik','age':None},
{'name':'Zara','age':39}]
In [17]: mydf = sc.parallelize(mydesc).toDF()
In [18]: from pyspark.sql.functions import avg
In [20]: mydf.select(avg('age')).collect()[0][0]
Out[20]: 42.5

pandas. How to update a new pandas column row by row

I am trying to add a new column in a pandas data frame, then update the value of the column row by row:
my_df['col_A'] = 0
for index, row in my_df.iterrows():
my_df.loc[index]['col_A'] = 100 # value here changes in real case
print(my_df.loc[index]['col_A'])
my_df
However, in the print out, all values in the col_A are still 0, why is that? What did I miss? Thanks!
you are assigning to a slice in this line my_df.loc[index]['col_A'] = 100
Instead do
my_df['col_A'] = 0
for index, row in my_df.iterrows():
my_df.loc[index, 'col_A'] = 100 # value here changes in real case
print(my_df.loc[index]['col_A'])

Resources