Numpy specific reshape order for 2D Array - python-3.x

I've got a 4x3 matrix and I need to reshape it to a 2x6 matrix with a specific order:
Initial :
|1 5 9|
|2 6 10|
|3 7 11|
|4 8 12|
Wanted :
|1 2 5 6 9 10|
|3 4 7 8 11 12|
I tried tu use numpy but the order was not what I expected :
import numpy as np
a=np.array([[1,5,9],[2,6,10],[3,17,11],[4,8,12]])
b=a.reshape(2,6)
print(a)
print(b)

Reshape, permute and reshape -
In [51]: n = 2 # no. of rows to select from input to form each row of output
In [52]: a.reshape(-1,n,a.shape[1]).swapaxes(1,2).reshape(-1,n*a.shape[1])
Out[52]:
array([[ 1, 2, 5, 6, 9, 10],
[ 3, 4, 17, 8, 11, 12]])
Sort of an explanation :
Cut along the first axis to end up with 3D array such that we select n along the second one. Swap this second axis with the last(third) axis, so that we push back that second one to the end.
Reshape to 2D to merge the last two axes. That's our output!
For more in-depth info, please refer to the linked Q&A.
If we are given the number of rows in output -
In [54]: nrows = 2 # number of rows in output
In [55]: a.reshape(nrows,-1,a.shape[1]).swapaxes(1,2).reshape(nrows,-1)
Out[55]:
array([[ 1, 2, 5, 6, 9, 10],
[ 3, 4, 17, 8, 11, 12]])

Related

Get unique elements for every array-based row

I have a dataset which looks somewhat like this:
idx | attributes
--------------------------
101 | ['a','b','c']
102 | ['a','b','d']
103 | ['b','c']
104 | ['c','e','f']
105 | ['a','b','c']
106 | ['c','g','h']
107 | ['b','d']
108 | ['d','g','i']
I wish to transform the above dataframe into something like this:
idx | attributes
--------------------------
101 | [0,1,2]
102 | [0,1,3]
103 | [1,2]
104 | [2,4,5]
105 | [0,1,2]
106 | [2,6,7]
107 | [1,3]
108 | [3,6,8]
Here, 'a' is replaced by 0, 'b' is replaced by 1 and so. Essentially, I wish to find all unique elements and assign them numbers so that integer operations can be made on them. My current approach is by using RDDs to maintain a single set and loop across rows but it's highly memory and time-intensive. Is there any other method for this in PySpark?
Thanks in advance
Annotated code
from pyspark.ml.feature import StringIndexer
# Explode the dataframe by `attributes`
df1 = df.selectExpr('idx', "explode(attributes) as attributes")
# Create a StringIndexer to encode the labels
idx = StringIndexer(inputCol='attributes', outputCol='encoded', stringOrderType='alphabetAsc')
df1 = idx.fit(df1).transform(df1)
# group the encoded column by idx and aggregate using `collect_list`
df1 = df1.groupBy('idx').agg(F.collect_list(F.col('encoded').cast('int')).alias('attributes'))
Result
df1.show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [0, 1, 2]|
|102| [0, 1, 3]|
|103| [1, 2]|
|104| [2, 4, 5]|
|105| [0, 1, 2]|
|106| [2, 6, 7]|
|107| [1, 3]|
|108| [3, 6, 8]|
+---+----------+
This can be done in spark 2.4 as a one liner.
In spark 3.0 this can be done without expr.
df = spark.createDataFrame(data=[(101,['a','b','c']),
(102,['a','b','d']),
(103,['b','c']),
(104,['c','e','f']),
(105,['a','b','c']),
(106,['c','g','h']),
(107,['b','d']),
(108,['d','g','i']),],schema = ["idx","attributes"])
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [1, 2, 3]|
|102| [1, 2, 4]|
|103| [2, 3]|
|104| [3, 5, 6]|
|105| [1, 2, 3]|
|106| [3, 7, 8]|
|107| [2, 4]|
|108| [4, 7, 9]|
+---+----------+
The tricky bit: expr("transform( attributes, x -> ascii(x)-96)")
expr is used to say this is a SQL expression
transform takes a column [that is an array] and applies a function to each element in the array ( x is the lambda parameter for the element of the array. -> function start and ) function end.
ascii(x)-96) convert ascii code into integer.
If you are considering performance you may consider the explain plan for my answer vs the other one provided so far:
df1.groupBy('idx').agg(collect_list(col('encoded').cast('int')).alias('attributes')).explain()
== Physical Plan ==
ObjectHashAggregate(keys=[idx#24L], functions=[collect_list(cast(encoded#140 as int), 0, 0)])
+- Exchange hashpartitioning(idx#24L, 200)
+- ObjectHashAggregate(keys=[idx#24L], functions=[partial_collect_list(cast(encoded#140 as int), 0, 0)])
+- *(1) Project [idx#24L, UDF(attributes#132) AS encoded#140]
+- Generate explode(attributes#25), [idx#24L], false, [attributes#132]
+- Scan ExistingRDD[idx#24L,attributes#25]
my answer:
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).explain()
== Physical Plan ==
Project [idx#24L, transform(attributes#25, lambdafunction((ascii(lambda x#128) - 96), lambda x#128, false)) AS attributes#127]

How to zip/concat value and list in pyspark

I work on a function working with 4 imputs.
To do so, I would like to get a list summarizing these 4 elements. However I have two variables where the data is unique and two variables composed of lists. I can zip the two lists with arrays_zip, but I can't get an array list with the 4 elements :
+----+----+---------+---------+
| l1 | l2 | l3 | l4 |
+----+----+---------+---------+
| 1 | 5 | [1,2,3] | [2,2,2] |
| 2 | 9 | [8,2,7] | [1,7,7] |
| 3 | 3 | [8,4,9] | [5,1,3] |
| 4 | 1 | [5,5,3] | [8,4,3] |
What I want to get :
+----+----+---------+---------+------------------------------------------+
| l1 | l2 | l3 | l4 | l5 |
+----+----+---------+---------+------------------------------------------+
| 1 | 5 | [1,2,3] | [2,2,2] | [[1, 5, 1, 2],[1, 5, 2, 2],[1, 5, 3, 2]] |
| 2 | 9 | [8,2,7] | [1,7,7] | [[2, 9, 8, 1],[2, 9, 2, 7],[2, 9, 7, 7]] |
| 3 | 3 | [8,4,9] | [5,1,3] | [[3, 3, 8, 5],[3 ,3, 4, 1],[3, 3, 9, 3]] |
| 4 | 1 | [5,5,3] | [8,4,3] | [[4, 1, 5, 8],[4, 1, 5, 4],[4, 1, 3, 3]] |
My idea was to transform l1 and l2 in list with the l3 size, and apply then the arrays_zip. I did,'t found a consistent way to create this list.
As long as I obtained this list of list, I would apply a function as follow:
def is_good(data):
a,b,c,d = data
return a+b+c+d
is_good_udf = f.udf(lambda x: is_good(x), ArrayType(FloatType()))
spark.udf.register("is_good_udf ", is_good, T.FloatType())
My guess would be to build something like this, thanks to #kafels, where for each rows and each list of the list, it applies the function :
df.withColumn("tot", f.expr("transform(l5, y -> is_good_udf(y))"))
In order to obtain a list of results as [9, 10, 11] for the first row for instance.
You can use expr function and apply TRANSFORM:
import pyspark.sql.functions as f
df = df.withColumn('l5', f.expr("""TRANSFORM(arrays_zip(l3, l4), el -> array(l1, l2, el.l3, el.l4))"""))
# +---+---+---------+---------+------------------------------------------+
# |l1 |l2 |l3 |l4 |l5 |
# +---+---+---------+---------+------------------------------------------+
# |1 |5 |[1, 2, 3]|[2, 2, 2]|[[1, 5, 1, 2], [1, 5, 2, 2], [1, 5, 3, 2]]|
# |2 |9 |[8, 2, 7]|[1, 7, 7]|[[2, 9, 8, 1], [2, 9, 2, 7], [2, 9, 7, 7]]|
# |3 |3 |[8, 4, 9]|[5, 1, 3]|[[3, 3, 8, 5], [3, 3, 4, 1], [3, 3, 9, 3]]|
# |4 |1 |[5, 5, 3]|[8, 4, 3]|[[4, 1, 5, 8], [4, 1, 5, 4], [4, 1, 3, 3]]|
# +---+---+---------+---------+------------------------------------------+

Add different arrays from numpy to each row of dataframe

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.
For example, the PySpark dataframe is like this
| Id | Name |
| ------ | ------ |
| 1 | Bob |
| 2 | Alice |
| 3 | Mike |
And the numpy matrix is like this
[[2, 3, 5]
[5, 2, 6]
[1, 4, 7]]
The resulting expected dataframe should be like this
| Id | Name | customized_list
| ------ | ------ | ---------------
| 1 | Bob | [2, 3, 5]
| 2 | Alice | [5, 2, 6]
| 3 | Mike | [1, 4, 7]
Id column correspond to the order of the entries in the numpy matrix.
I wonder is there any efficient way to implement this?
Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.
import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#| 1| [2, 3, 5]|
#| 2| [5, 2, 6]|
#| 3| [1, 4, 7]|
#+---+---------------+
Here I used enumerate(..., start=1) to add the row number.
Now just do an inner join:
df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#| 1| Bob| [2, 3, 5]|
#| 3| Mike| [1, 4, 7]|
#| 2|Alice| [5, 2, 6]|
#+---+-----+---------------+

bucketing a spark dataframe- pyspark

I have a spark dataframe with column (age). I need to write a pyspark script to bucket the dataframe as a range of 10years of age( for ex age 11-20,age 21-30 ,...) and find the count of each age span entries .Need guidance on how to get through this
for ex :
I have the following dataframe
+-----+
|age |
+-----+
| 21|
| 23|
| 35|
| 39|
+-----+
after bucketing (expected)
+-----+------+
|age | count|
+-----+------+
|21-30| 2 |
|31-40| 2 |
+-----+------+
An easy way to run such a calculation would be to compute the histogram on the underlying RDD.
Given known age ranges (fortunately, this is easy to put together - here, using 1, 11, 21, etc.), it's fairly easy to produce the histogram:
hist = df.rdd\
.map(lambda l: l['age'])\
.histogram([1, 11, 21,31,41,51,61,71,81,91])
This will return a tuple with "age ranges" and their respective observation count, as:
([1, 11, 21, 31, 41, 51, 61, 71, 81, 91],
[10, 10, 10, 10, 10, 10, 10, 10, 11])
Then you can convert that back to a data frame using:
#Use zip to link age_ranges to their counts
countTuples = zip(hist[0], hist[1])
#make a list from that
ageList = list(map(lambda l: Row(age_range=l[0], count=l[1]), countTuples))
sc.parallelize(ageList).toDF()
For more info, check the histogram function's documentation in the RDD API

Redistribute elements of an array into multiple columns in a dataframe

I intend to create a dataframe from an array - with the element of the array sequentially distributed to multiple columns.
ex:
var A = Array(1,2,4,21,2,4,34,2,24,2,4,24,5,8,4,2,1,1)
var B = sc.parallelize(A.grouped(3).toList).map(Tuple1(_)).toDF("values")
Above results in
| values|
+-----------+
| [1, 2, 4]|
| [21, 2, 4]|
|[34, 2, 24]|
| [2, 4, 24]|
| [5, 8, 4]|
| [2, 1, 1]|
+-----------+
But I need these 3 elements in 3 different columns.
Please suggest a solution that doesn't hard code for 3 elements.
The basic problem is that you are creating Tuple1 which is a single element. Had you used a x=>Tuple3(x(0), x(1), x(2)) that would have solved it for the case of 3.
If you do not want to hardcode, you can do something like this:
def addColumns(num: Int, origDF: DataFrame) {
var df = origDF
for {
x <- 0 to num
} {
df = df.withColumn(s"col_$x", udf((y: Seq[Int]) => y(x))($"values"))
}
df
}
this will extract the relevant columns (you might also want to drop the original values).

Resources