Selecting pandas dataframe column by row-specific list - python-3.x

For each row in a dataframe, I'm trying to select the column, which is specified in a list. The list has the same length as the dataframe has rows.
df = pd.DataFrame({"a":[1,2,3,4,5],
"b":[3,4,5,6,7],
"c":[9,10,11,12,13]})
lst = ["a","a","c","b","a"]
The result would look like this:
[1,2,11,6,5]

Just lookup would be fine:
df.lookup(df.index,lst)
#array([ 1, 2, 11, 6, 5], dtype=int64)

lookup should be the way, but try something diff
df.stack().reindex(pd.MultiIndex.from_arrays([df.index,lst])).values
array([ 1, 2, 11, 6, 5])

Related

Pyspark: Convert list to list of lists

In a pyspark dataframe, I have a column which has list values, for example: [1,2,3,4,5,6,7,8]
I would like to convert the above as [[1,2,3,4] , [5,6,7,8]] given 4 for every column value.
Please let me know, how can I achieve this.
Thanks for your help in advance.
You can use transform function as shown below:
df = spark.createDataFrame([([1, 2, 3, 4, 5, 6, 7, 8],)], ['values'])
df.selectExpr("transform(sequence(1, size(values), 4), v-> slice(values, v, 4)) as values")\
.show(truncate=False)
+----------------------------+
|values |
+----------------------------+
|[[1, 2, 3, 4], [5, 6, 7, 8]]|
+----------------------------+

pyspark: for loop calculations over the columns

Anyone know how can I do theses calculations in pyspark?
data = {
'Name': ['Tom', 'nick', 'krish', 'jack'],
'Age': [20, 21, 19, 18],
'CSP': [2, 6, 8, 7],
'coef': [2, 2, 3, 3]
}
# Create DataFrame
df = pd.DataFrame(data)
colsToRecalculate = ['Age','CSP']
for i in range(len(colsToRecalculate)):
df[colsToRecalculate[i]] =df[colsToRecalculate[i]]/df["coef"]
You can use select() on spark dataframe and include multiple columns (with different calculations) as parameters. In your case:
df2 = spark.createDataFrame(pd.DataFrame(data))
df2.select(*[(F.col(c) / F.col('coef')).alias(c) for c in colsToRecalculate], 'coef').show()
Slight variation to bzu's answer which selects non-listed columns manually within the select. We can use dataframe.columns and check the columns against the colsToRecalculate list - If column is in the list, do the calculation, else leave column as is.
data_sdf. \
select(*[(func.col(k) / func.col('coef')).alias(k) if k in colsToRecalculate else k for k in data_sdf.columns])

Using multiple filter on multiple columns of numpy array - more efficient way?

I have the following 2 arrays:
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[7, 5, 6, 3],
[2, 4, 8, 9]]
ids = np.array([6, 5, 7, 8])
Each row in the array arr describes a 4-digit id, there are no redundant ids - neither in their values nor their combination. So if [1, 2, 3, 4] exists, no other combination of these 4 digits can exist. This will be important in a sec.
The array ids contains a 4-digit id, however the order might not be correct. Now I need to go through each row of arr and look if this id exists. In this example ids fits to the 2nd row from the top of arr. So arr[1,:].
My current solution creates a filter of each column to check if the values of ids exist in any of the 4 columns. After that I use these filters on arr. This seems way too complicated.
So I pretty much do this:
filter_1 = np.in1d(arr[:, 0], ids)
filter_2 = np.in1d(arr[:, 1], ids)
filter_3 = np.in1d(arr[:, 2], ids)
filter_4 = np.in1d(arr[:, 3], ids)
result = arr[filter_1 & filter_2 & filter_3 & filter_4]
Does anyone know a simpler solution? Maybe using generators?
Use np.isin all across arr and all-reduce to get result -
In [15]: arr[np.isin(arr, ids).all(1)]
Out[15]: array([[5, 6, 7, 8]])

I'm trying to add lists in lists by column. Is there a way to sum them with missing variables in a list?

I had followed the book and can sum lists in lists by column but one of the test cases is missing variables in the list and I'm unable to move forward because I keep getting an index error.
The first initial_list works as it should giving [3,6,9]
The second one though should apparently give me [3,4,9,4]
list_initial = [[1, 2, 3], [1, 2, 3],[1, 2, 3 ]]
list_initial = [[1, 2, 3], [1], [1, 2, 3, 4]]
def column_sums(list_initial):
column = 0
list_new = []
while column < len(list_initial):
total = sum(row[column] for row in list_initial )
list_new.append(total)
column = column + 1
print(list_new)
column_sums(list_initial)
You can effectively "transpose" your data so that rows become columns, and then use itertools.zip_longest with a fillvalue of 0, to sum across them, eg:
from itertools import zip_longest
list_initial = [[1, 2, 3], [1], [1, 2, 3, 4]]
summed = [sum(col) for col in zip_longest(*list_initial, fillvalue=0)]
# [3, 4, 6, 4]

Calculate the duplicate in multidimensional numpy array

I am using python-3.x and I would like to calculate the number of duplicates in numpy array.... for example:
import numpy as np
my_array = np.array([[2, 3, 5],
[2, 3, 5], # duplicate of row 0 (this will be count as 1)
[2, 3, 5], # duplicate of row 0 (this will be count as 2)
[1, 0, 9],
[3, 6, 6],
[3, 6, 6], # duplicate of row 0 (this will be count as 3)
[1, 0, 9]])
What I would like to get from the outptu is the number of duplicates in this array:
the number of the duplicate is 3
most of the methods are returning the values such as collections.Counter or return_counts and they not returning what I want if I am using them right.
Any advice would be much appreciated
You can get the duplicate count of array by take length of array - length of unique members of array:
the_number_of the duplicate = len(my_array) - len(np.unique(my_array, axis=0))
And the result of your example is 4 ([1,0,9] is duplicate also).
Here's a slight variation from #Anh Ngoc's answer. (for older versions of numpy where axis is not supported by np.unique)
number_of_duplicates = len(my_array) - len(set(map(tuple, my_array)))

Resources