Replace all Trues in a boolean dataframe with index valule - python-3.x

How can I replace all cells in a boolean dataframe (True/False) with the index name of that cell, when "True"? For example:
df = pd.DataFrame(
[
[False, True],
[True, False],
],
index=["abc", "def"],
columns=list("ab")
)
comes out as:
df = pd.DataFrame(
[
[False, abc],
[def, False],
],
index=["abc", "def"],
columns=list("ab")
)

Use df.mask:
Replace values where the condition is True.
df.mask(df,df.index)
a b
abc False abc
def def False

Related

Getting indices using conditional list comprehension

I have the following np.array:
my_array=np.array([False, False, False, True, True, True, False, True, False, False, False, False, True])
How can I make a list using list comprehension of the indices corresponding to the True elements. In this case the output I'm looking for would be [3,4,5,7,12]
I've tried the following:
cols = [index if feature_condition==True for index, feature_condition in enumerate(my_array)]
But is not working
why specifically a list comprehension?
>>> np.where(my_array==True)
(array([ 3, 4, 5, 7, 12]),)
this does the job and is quicker. The list solution would be:
>>> [index for index, feature_condition in enumerate(my_array) if feature_condition == True]
[3, 4, 5, 7, 12]
Accepted answer of this explains the confusion of the ordering: if/else in a list comprehension
I was curious of the differences:
def np_time(array):
np.where(my_array==True)
def list_time(array):
[index for index, feature_condition in enumerate(my_array) if feature_condition == True]
timeit.timeit(lambda: list_time(my_array),number = 1000)
0.007574789000500459
timeit.timeit(lambda: np_time(my_array),number = 1000)
0.0010812399996211752
The order of if is not correct, should be in last -
$more numpy_1.py
import numpy as np
my_array=np.array([False, False, False, True, True, True, False, True, False, False, False, False, True])
print (my_array)
cols = [index for index, feature_condition in enumerate(my_array) if feature_condition]
print (cols)
$python numpy_1.py
[False False False True True True False True False False False False
True]
[3, 4, 5, 7, 12]

Python3 Boolean assignment to multidimension list is wrong?

Trying to do below
visited = [[False] * 3]* 3
print(visited)
visited[0][0] = True
print(visited)
why it prints like this:
[[False, False, False], [False, False, False], [False, False, False]]
[[True, False, False], [True, False, False], [True, False, False]]
shouldn't it be:
[[False, False, False], [False, False, False], [False, False, False]]
[[True, False, False], [False, False, False], [False, False, False]]
When you create a 2D array using
arr = [something * m]*n, all subarrays point to the same memory location. If you modify even one subarray, all other subarrays get modified.
The correct way to initialise the 2D matrix is
arr = [[something for i in range(m)] for j in range(n)]
to create a n x m matrix.
I am going to give an example how the array works. when you create a contiguous array(below is an example to check what is the size of the integer in python)
from sys import getsizeof
getsizeof(bool())
24
In my case boolean is 3 bytes (24 / 8 bits). when you define an array like above you are going to get the start location (which is 1000 as per below example) and when you access an array with index you will be given access to that particular location by the calculation start_location + size of boolean * index
1st Index of array will give 1000 + 3 * 1 -> 1003 for example
In your example visited = [[False] * 3]* 3 we are actually multiplying the reference point.[1000, 1000, 1000] in the example given above. so when you do visited[0][0] = True you are ideally changing the array with reference 1000 and modifying the 0th element to True. since the all array elements point to same array the value becomes like [[True, False, False], [True, False, False], [True, False, False]]
you should be using like below to initialize 2D matrix
array = [[False for i in range(3)] for j in range(3)]
array[0][0] = True
array
[[True, False, False], [False, False, False], [False, False, False]]
You would have noticed that by above example we are creating separate arrays on the go (with different references ofcourse).

python all returns false when comparing identical dataframes

Below is sample code:
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]])
df2 = pd.DataFrame([[1, 2], [3, 4]])
print(df1.equals(df2))
print(df1==df2)
if all(df1==df2)==True:
print("same")
else:
print("diff")
and the output is:
True
0 1
0 True True
1 True True
diff
My question is why all is returning false when comparing identical dataframes?
Since you're asking why, this is actually showing a bit of unusual behavior of using the builtin all with pandas. When all is looking at your DataFrame, even though all of the values of the frames are equal, producing a boolean mask of all True, you have a Falsy value in the DataFrame columns, which means that all will return False.
From the docs for all
Return True if all elements of the iterable are true (or if the iterable is empty). Equivalent to:
def all(iterable):
for element in iterable:
if not element:
return False
return True
When this code is called on your DataFrame, it doesn't look at all at the values, but instead simply checks your column headers (since this is what is returned when running for element in df1 == df2), which in this case are 0 and 1, and since 0 is Falsy, returns False
We can validate this by changing the columns to all Truthy values.
In [29]: df1 == df2
Out[29]:
0 1
0 True True
1 True True
In [30]: all(df1 == df2)
Out[30]: False
In [31]: u = df1 == df2
In [32]: u.columns = [1, 2] # all truthy
In [33]: u
Out[33]:
1 2
0 True True
1 True True
In [35]: all(u)
Out[35]: True
The moral of the story is to not use builtin Python methods for this type of equality check when pandas provides the functionality for you with pd.DataFrame.equals, which handles edge cases like index alignment which you don't want to have to manually account for.
If need compare only values is possible use double all, first for compare per columns and then values in output Series:
if (df1==df2).all().all():
print("same")
else:
print("diff")
Details:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=['a','b'])
df2 = pd.DataFrame([[1, 2], [3, 4]], columns=['a','b'])
print(df1==df2)
a b
0 True True
1 True True
print((df1==df2).all())
a True
b True
dtype: bool
print((df1==df2).all().all())
True
Or use numpy.all:
if np.all(df1==df2):
print("same")
else:
print("diff")
But DataFrame.equals compare not only values, but also index and columns names between both DataFrames.

Subset one array column with another (boolean) array column

I have a Dataframe like this (in Pyspark 2.3.1):
from pyspark.sql import Row
my_data = spark.createDataFrame([
Row(a=[9, 3, 4], b=['a', 'b', 'c'], mask=[True, False, False]),
Row(a=[7, 2, 6, 4], b=['w', 'x', 'y', 'z'], mask=[True, False, True, False])
])
my_data.show(truncate=False)
#+------------+------------+--------------------------+
#|a |b |mask |
#+------------+------------+--------------------------+
#|[9, 3, 4] |[a, b, c] |[true, false, false] |
#|[7, 2, 6, 4]|[w, x, y, z]|[true, false, true, false]|
#+------------+------------+--------------------------+
Now I'd like to use the mask column in order to subset the a and b columns:
my_desired_output = spark.createDataFrame([
Row(a=[9], b=['a']),
Row(a=[7, 6], b=['w', 'y'])
])
my_desired_output.show(truncate=False)
#+------+------+
#|a |b |
#+------+------+
#|[9] |[a] |
#|[7, 6]|[w, y]|
#+------+------+
What's the "idiomatic" way to achieve this? The current solution I have involves map-ing over the underlying RDD and subsetting with Numpy, which seems inelegant:
import numpy as np
def subset_with_mask(row):
mask = np.asarray(row.mask)
a_masked = np.asarray(row.a)[mask].tolist()
b_masked = np.asarray(row.b)[mask].tolist()
return Row(a=a_masked, b=b_masked)
my_desired_output = spark.createDataFrame(my_data.rdd.map(subset_with_mask))
Is this the best way to go, or is there something better (less verbose and/or more efficient) I can do using Spark SQL tools?
One option is to use a UDF, which you can optionally specialize by the data type in the array:
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
def _mask_list(lst, mask):
return np.asarray(lst)[mask].tolist()
mask_array_int = F.udf(_mask_list, T.ArrayType(T.IntegerType()))
mask_array_str = F.udf(_mask_list, T.ArrayType(T.StringType()))
my_desired_output = my_data
my_desired_output = my_desired_output.withColumn(
'a', mask_array_int(F.col('a'), F.col('mask'))
)
my_desired_output = my_desired_output.withColumn(
'b', mask_array_str(F.col('b'), F.col('mask'))
)
UDFs mentioned in the previous answer is probably the way to go prior to the array functions added in Spark 2.4. For the sake of completeness, here is a "pure SQL" implementation before 2.4.
from pyspark.sql.functions import *
df = my_data.withColumn("row", monotonically_increasing_id())
df1 = df.select("row", posexplode("a").alias("pos", "a"))
df2 = df.select("row", posexplode("b").alias("pos", "b"))
df3 = df.select("row", posexplode("mask").alias("pos", "mask"))
df1\
.join(df2, ["row", "pos"])\
.join(df3, ["row", "pos"])\
.filter("mask")\
.groupBy("row")\
.agg(collect_list("a").alias("a"), collect_list("b").alias("b"))\
.select("a", "b")\
.show()
Output:
+------+------+
| a| b|
+------+------+
|[7, 6]|[w, y]|
| [9]| [a]|
+------+------+
A better way to do this is to use pyspark.sql.functions.expr, filter, and transform:
import pandas as pd
from pyspark.sql import (
functions as F,
SparkSession
)
spark = SparkSession.builder.master('local[4]').getOrCreate()
bool_df = pd.DataFrame([
['a', [0, 1, 2, 3, 4], [True]*4 + [False]],
['b', [5, 6, 7, 8, 9], [False, True, False, True, False]]
], columns=['id', 'int_arr', 'bool_arr'])
bool_sdf = spark.createDataFrame(bool_df)
def filter_with_mask(in_col, mask_col, out_name="masked_arr"):
filt_input = f'arrays_zip({in_col}, {mask_col})'
filt_func = f'x -> x.{mask_col}'
trans_func = f'x -> x.{in_col}'
result = F.expr(f'''transform(
filter({filt_input}, {filt_func}), {trans_func}
)''').alias
return result
Using the function:
bool_sdf.select(
'*', filter_with_mask('int_arr', 'bool_arr', bool_sdf)
).toPandas()
Results in:
id int_arr bool_arr masked_arr
a [0, 1, 2, 3, 4] [True, True, True, True, False] [0, 1, 2, 3]
b [5, 6, 7, 8, 9] [False, True, False, True, False] [6, 8]
This should be possible with pyspark >= 2.4.0 and python >= 3.6.

checking whether tuple in the list in python

Suppose I have a list of variables as follows:
v = [('d',0),('i',0),('g',0)]
What I want is to obtain a vector of values, that gives the truth value of the presence of the variable inside the list.
So, if have another list say
g = [('g',0)]
The output of that should be
op(v,g) = [False, False, True]
P.S.
I have tried using np.in1d but it gives the following:
array([False, True, False, True, True, True], dtype=bool)
In python you can use a list comprehension like following :
>>> v=[('d', 0), ('i', 0), ('g', 0)]
>>> g=[('t', 0), ('g', 0),('d',0)]
>>> [i in g for i in v]
[True, False, True]
You can convert those lists to numpy arrays and then use np.in1d like so -
import numpy as np
# Convert to numpy arrays
v_arr = np.array(v)
g_arr = np.array(g)
# Slice the first & second columns to get string & numeric parts.
# Use in1d to get matches between first columns of those two arrays;
# repeat for the second columns.
string_part = np.in1d(v_arr[:,0],g_arr[:,0])
numeric_part = np.in1d(v_arr[:,1],g_arr[:,1])
# Perform boolean AND to get the final boolean output
out = string_part & numeric_part
Sample run -
In [157]: v_arr
Out[157]:
array([['d', '0'],
['i', '0'],
['g', '0']],
dtype='<U1')
In [158]: g_arr
Out[158]:
array([['g', '1']],
dtype='<U1')
In [159]: string_part = np.in1d(v_arr[:,0],g_arr[:,0])
In [160]: string_part
Out[160]: array([False, False, True], dtype=bool)
In [161]: numeric_part = np.in1d(v_arr[:,1],g_arr[:,1])
In [162]: numeric_part
Out[162]: array([False, False, False], dtype=bool)
In [163]: string_part & numeric_part
Out[163]: array([False, False, False], dtype=bool)

Resources