Spark: Create dataframe from arrays in a column - apache-spark

I have a Spark dataframe (using Scala) with a column arrays that contains Array[Array[Int]], i.e.
var data = Seq(
((1, 2, 3), (3, 4, 5), (6, 7, 8)),
((1, 5, 7), (3, 4, 5), (6, 3, 0)),
...
).toDF("arrays")
I want to create a new dataframe in which each row contains one Array[Int] and there should be no repetitions. For example, the dataframe above would become:
+-----------+
| array |
+-----------+
| (1, 2, 3) |
| (3, 4, 5) |
| (6, 7, 8) |
| (1, 5, 7) |
| (6, 3, 0) |
+-----------+
where (3, 4, 5) appears only once.

Try:
df.withColumn("array", explode(df.array)).dropDuplicates()

Related

Count lines of a dataset in function of a column in PySpark

I'm working with PySpark. I have a dataset like this:
I want to count lines of my dataset in function of my "Column3" column.
For example, here I want to get this dataset:
pyspark.sql.functions.count(col):
Aggregate function: returns the number of items in a group.
temp = spark.createDataFrame([
(0, 11, 'A'),
(1, 12, 'B'),
(2, 13, 'B'),
(0, 14, 'A'),
(1, 15, 'c'),
(2, 16, 'A'),
], ["column1", "column2", 'column3'])
temp.groupBy('column3').agg(count('*').alias('count')).sort('column3').show(10, False)
# +-------+-----+
# |column3|count|
# +-------+-----+
# |A |3 |
# |B |2 |
# |c |1 |
# +-------+-----+
df.groupBy('column_3').count()

How to zip/concat value and list in pyspark

I work on a function working with 4 imputs.
To do so, I would like to get a list summarizing these 4 elements. However I have two variables where the data is unique and two variables composed of lists. I can zip the two lists with arrays_zip, but I can't get an array list with the 4 elements :
+----+----+---------+---------+
| l1 | l2 | l3 | l4 |
+----+----+---------+---------+
| 1 | 5 | [1,2,3] | [2,2,2] |
| 2 | 9 | [8,2,7] | [1,7,7] |
| 3 | 3 | [8,4,9] | [5,1,3] |
| 4 | 1 | [5,5,3] | [8,4,3] |
What I want to get :
+----+----+---------+---------+------------------------------------------+
| l1 | l2 | l3 | l4 | l5 |
+----+----+---------+---------+------------------------------------------+
| 1 | 5 | [1,2,3] | [2,2,2] | [[1, 5, 1, 2],[1, 5, 2, 2],[1, 5, 3, 2]] |
| 2 | 9 | [8,2,7] | [1,7,7] | [[2, 9, 8, 1],[2, 9, 2, 7],[2, 9, 7, 7]] |
| 3 | 3 | [8,4,9] | [5,1,3] | [[3, 3, 8, 5],[3 ,3, 4, 1],[3, 3, 9, 3]] |
| 4 | 1 | [5,5,3] | [8,4,3] | [[4, 1, 5, 8],[4, 1, 5, 4],[4, 1, 3, 3]] |
My idea was to transform l1 and l2 in list with the l3 size, and apply then the arrays_zip. I did,'t found a consistent way to create this list.
As long as I obtained this list of list, I would apply a function as follow:
def is_good(data):
a,b,c,d = data
return a+b+c+d
is_good_udf = f.udf(lambda x: is_good(x), ArrayType(FloatType()))
spark.udf.register("is_good_udf ", is_good, T.FloatType())
My guess would be to build something like this, thanks to #kafels, where for each rows and each list of the list, it applies the function :
df.withColumn("tot", f.expr("transform(l5, y -> is_good_udf(y))"))
In order to obtain a list of results as [9, 10, 11] for the first row for instance.
You can use expr function and apply TRANSFORM:
import pyspark.sql.functions as f
df = df.withColumn('l5', f.expr("""TRANSFORM(arrays_zip(l3, l4), el -> array(l1, l2, el.l3, el.l4))"""))
# +---+---+---------+---------+------------------------------------------+
# |l1 |l2 |l3 |l4 |l5 |
# +---+---+---------+---------+------------------------------------------+
# |1 |5 |[1, 2, 3]|[2, 2, 2]|[[1, 5, 1, 2], [1, 5, 2, 2], [1, 5, 3, 2]]|
# |2 |9 |[8, 2, 7]|[1, 7, 7]|[[2, 9, 8, 1], [2, 9, 2, 7], [2, 9, 7, 7]]|
# |3 |3 |[8, 4, 9]|[5, 1, 3]|[[3, 3, 8, 5], [3, 3, 4, 1], [3, 3, 9, 3]]|
# |4 |1 |[5, 5, 3]|[8, 4, 3]|[[4, 1, 5, 8], [4, 1, 5, 4], [4, 1, 3, 3]]|
# +---+---+---------+---------+------------------------------------------+

PySpark - Create Age Column Over a Window starting based on the first Non Zero Value

I have the following dataframe:
[Row(ID=123, MONTH_END=datetime.date(2017, 12, 31), Total=0.0),
Row(ID=123, MONTH_END=datetime.date(2018, 1, 31), Total=4006),
Row(ID=123, MONTH_END=datetime.date(2018, 2, 28), Total=2389),
Row(ID=123, MONTH_END=datetime.date(2018, 3, 31), Total=0),
Row(ID=123, MONTH_END=datetime.date(2018, 4, 30), Total=3547),
Row(ID=123, MONTH_END=datetime.date(2018, 5, 31), Total=4322)
......]
What I want to do is create a new column "age" based on the "Total" Column.The "age" column needs to be a row_number starting from the first non-zero value in "Total". The output needs to be:
[Row(ID=123, MONTH_END=datetime.date(2017, 12, 31), Total=0.0, age = None),
Row(ID=123, MONTH_END=datetime.date(2018, 1, 31), Total=4006, age = 1),
Row(ID=123, MONTH_END=datetime.date(2018, 2, 28), Total=2389, age = 2),
Row(ID=123, MONTH_END=datetime.date(2018, 3, 31), Total=0 ,age = 3),
Row(ID=123, MONTH_END=datetime.date(2018, 4, 30), Total=3547,age = 4),
Row(ID=123, MONTH_END=datetime.date(2018, 5, 31), Total=4322,age = 5)]
I started off with this given I have many IDs in the dataframe
sample.\
withColumn("age",F.row_number().over(Window.partitionBy("ID").orderBy("MONTH_END"))).take(10)
But this does not consider looking at the first non zero value in the Total Column.
You can utilise the first aggregation that has ignorenulls option. Using a couple of auxiliary columns that can be dropped later:
rnum the row number in the window
delta which is the rnum of the first row with Total != 0
df = spark_session.createDataFrame([
Row(ID=123, MONTH_END=datetime.date(2017, 12, 31), Total=0.0),
Row(ID=123, MONTH_END=datetime.date(2018, 1, 31), Total=4006.0),
Row(ID=123, MONTH_END=datetime.date(2018, 2, 28), Total=2389.0),
Row(ID=123, MONTH_END=datetime.date(2018, 3, 31), Total=0.0),
Row(ID=123, MONTH_END=datetime.date(2018, 4, 30), Total=3547.0),
Row(ID=123, MONTH_END=datetime.date(2018, 5, 31), Total=4322.0),
Row(ID=124, MONTH_END=datetime.date(2018, 5, 31), Total=0.0),
Row(ID=125, MONTH_END=datetime.date(2018, 5, 31), Total=4322.0)
])
win = Window.partitionBy("ID").orderBy("MONTH_END")
df.withColumn("rnum", row_number().over(win)) \
.withColumn("delta", first(when(col("Total") == 0, None).otherwise(col("rnum")), ignorenulls=True).over(win))\
.withColumn("age", when(col("delta").isNull(), None).otherwise(col("rnum")-col("delta")+1))\
.show()
Output:
.+---+----------+------+----+-----+----+
| ID| MONTH_END| Total|rnum|delta| age|
+---+----------+------+----+-----+----+
|124|2018-05-31| 0.0| 1| null|null|
|123|2017-12-31| 0.0| 1| null|null|
|123|2018-01-31|4006.0| 2| 2| 1|
|123|2018-02-28|2389.0| 3| 2| 2|
|123|2018-03-31| 0.0| 4| 2| 3|
|123|2018-04-30|3547.0| 5| 2| 4|
|123|2018-05-31|4322.0| 6| 2| 5|
|125|2018-05-31|4322.0| 1| 1| 1|
+---+----------+------+----+-----+----+
Left the columns rnum and delta for demonstration purposes.

When I append different lists to a list of lists, I get only the last list. Why?

Im trying to keep a log of past list combinations used with the python standard function append/extend etc:
log = [[(3, 23), (5, 7), (6, 14), (3, 0), (11, 24), (3, 5), (20, 19), (16, 9), (19, 5)]]
counter = 0
switch = False
while switch == False:
counter += 1
if counter > 5:
switch = True
lenth = len(log)
netlist = log[0]
index = netlist.index((11, 24))
tmp = netlist[index - 1]
netlist[index - 1] = netlist[index]
netlist[index] = tmp
print(netlist)
log.append(netlist)
print()
for lists in log:
print(lists)
The result should be a log variable containing all the used lists after the switch. So staring with de list in the log and after each switch it should append the list to the log when the loop ends all the used list should me printed.
The result I get is a log with all the same lists (the last one used).
First list is how it should me last is how it is now.
There are a couple of misunderstandings which I will attempt to clear.
Misunderstanding 1 - List are immutable
The result I get is a log with all the same lists (the last one used).
Correct understanding - Lists are mutable
When you interchange the values in the list using following code -
tmp = netlist[index - 1]
netlist[index - 1] = netlist[index]
netlist[index] = tmp
You are not creating a new list, you are just modifying the same list.
Read about Lists and mutability here.
The problem with this is as follows -
Your log list will end up containing multiple references to same list object.
So the situation will be something like this.
__________________________________________________
| netlist |
| |
| value - |
| [(3, 23), (5, 7), (6, 14), (3, 0), |
| (11, 24), (3, 5), (20, 19), (16, 9), (19, 5)] |
| |
| |
| |
|_____^________________^______________^____________|
______|________________|______________|_________________
| | | | |
| ___|_____ ____|_____ ____|_____ |
| | | | | | | |
| | 0 | | 1 | | 2 | |
| |_________| |__________| |__________| ........ |
| |
| log |
|________________________________________________________|
Here as you can see the 0th index refers to the same element as 1st index, 2nd index and so on.
What you probably wanted was a log to contain multiple different lists.
Something like so -
_____________ ____________ _____________
| | | | | |
| [(3, 23), | | another | | yet another |
| (5, 7)...]| | list | | list |
| | | | | |
| | | | | |
|_____^_______| |____^_______| |____^________| .......
______|________________|________________|_______________
| | | | |
| ___|_____ ____|_____ ____|_____ |
| | | | | | | |
| | 0 | | 1 | | 2 | |
| |_________| |__________| |__________| ....... |
| |
| log |
|________________________________________________________|
So when you try to print log what you get is the same list printed again and again.
The solution to this is to copy the initial list using the : operator.
Read how to copy list here.
Misunderstanding 2 - Using 0 to select last element.
Correct understanding - Use -1 to get last element.
This piece of code -
netlist = log[0]
suggests you are trying to select the last element of the list but using index 0.
To get last element in a list use index -1.
Read about slice notation here.
Final Code -
log = [[(3, 23), (5, 7), (6, 14), (3, 0), (11, 24), (3, 5), (20, 19), (16, 9), (19, 5)]]
counter = 0
switch = False
while switch == False:
counter += 1
if counter > 5:
switch = True
# lenth = len(log) Commenting this line as it's not needed
netlist = log[-1][:]
index = netlist.index((11, 24))
tmp = netlist[index - 1]
netlist[index - 1] = netlist[index]
netlist[index] = tmp
# print(netlist)
log.append(netlist)
# Remove 1st element of log if you don't need it.
log = log[1:]
print()
for lists in log:
print(lists)
This prints the following output which is what you expected.
[(3, 23), (5, 7), (6, 14), (11, 24), (3, 0), (3, 5), (20, 19), (16, 9), (19, 5)]
[(3, 23), (5, 7), (11, 24), (6, 14), (3, 0), (3, 5), (20, 19), (16, 9), (19, 5)]
[(3, 23), (11, 24), (5, 7), (6, 14), (3, 0), (3, 5), (20, 19), (16, 9), (19, 5)]
[(11, 24), (3, 23), (5, 7), (6, 14), (3, 0), (3, 5), (20, 19), (16, 9), (19, 5)]
[(19, 5), (3, 23), (5, 7), (6, 14), (3, 0), (3, 5), (20, 19), (16, 9), (11, 24)]
[(19, 5), (3, 23), (5, 7), (6, 14), (3, 0), (3, 5), (20, 19), (11, 24), (16, 9)]

how to deal with hundreds of colums data from textfile when training a model using spark ml

I have a textfile with hundreds of columns , but the columns don't have column names.
The first column is the label and the others are features. I've read some examples that must specify cloumn names for the train data. But it is quite troublesome to specify all the names since there are too many columns.
how can I deal with this situation?
You can use VectorAssempler in combination with list comprehension to structure your data for model training. Consider this example data with two feature columns (x1 and x2) and a response variable y.
df = sc.parallelize([(5, 1, 6),
(6, 9, 4),
(5, 3, 3),
(4, 4, 2),
(4, 5, 1),
(2, 2, 2),
(1, 7, 3)]).toDF(["y", "x1", "x2"])
First, we create a list of column names that are not "y":
colsList = [x for x in df.columns if x!= 'y']
Now, we can use VectorAssembler:
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(colsList)
vectorizer.setOutputCol("features")
output = vectorizer.transform(df)
output.select("features", "y").show()
+---------+---+
| features| y|
+---------+---+
|[1.0,6.0]| 5|
|[9.0,4.0]| 6|
|[3.0,3.0]| 5|
|[4.0,2.0]| 4|
|[5.0,1.0]| 4|
|[2.0,2.0]| 2|
|[7.0,3.0]| 1|
+---------+---+

Resources