Below is my data in csv which I read into dataframe.
id,pid,pname,ppid
1, 1, 5, -1
2, 1, 7, -1
3, 2, 9, 1
4, 2, 11, 1
5, 3, 5, 1
6, 4, 7, 2
7, 1, 9, 3
I am reading that data into a dataframe data_df. I am tryng to do a self-join on different columns. But the results dataframes are empty. Have tried multiple options.
Below is my code. Only the last joined4 is producing the result.
val joined = data_df.as("first").join(data_df.as("second")).where( col("first.ppid") === col("second.pid"))
joined.show(50, truncate = false)
val joined2 = data_df.as("first").join(data_df.as("second"), col("first.ppid") === col("second.pid"), "inner")
joined2.show(50, truncate = false)
val df1 = data_df.as("df1")
val df2 = data_df.as("df2")
val joined3 = df1.join(df2, $"df1.ppid" === $"df2.id")
joined3.show(50, truncate = false)
val joined4 = data_df.as("df1").join(data_df.as("df2"), Seq("id"))
joined4.show(50, truncate = false)
Below are the output of joined, joined2, joined3, joined4 respectively :
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
|id |pid|pname|ppid|id |pid|pname|ppid|
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+---+-----+----+
+---+---+-----+----+---+-----+----+
|id |pid|pname|ppid|pid|pname|ppid|
+---+---+-----+----+---+-----+----+
| 1 | 1| 5| -1| 1| 5| -1|
| 2 | 1| 7| -1| 1| 7| -1|
| 3 | 2| 9| 1| 2| 9| 1|
| 4 | 2| 11| 1| 2| 11| 1|
| 5 | 3| 5| 1| 3| 5| 1|
| 6 | 4| 7| 2| 4| 7| 2|
| 7 | 1| 9| 3| 1| 9| 3|
+---+---+-----+----+---+-----+----+
Sorry, later on figured out that the spaces in the csv were causing the issue. If I create a correctly structured csv of the initial data, the problem disappears.
Correct csv format as follows.
id,pid,pname,ppid
1,1,5,-1
2,1,7,-1
3,2,9,1
4,2,1,1
5,3,5,1
6,4,7,2
7,1,9,3
Ideally, I can also use the option to ignore leading whitespaces as shown in the following answer :
val data_df = spark.read
.schema(dataSchema)
.option("mode", "FAILFAST")
.option("header", "true")
.option("ignoreLeadingWhiteSpace", "true")
.csv(dataSourceName)
pySpark (v2.4) DataFrameReader adds leading whitespace to column names
Related
Having the following DataFrame I want to apply the corr function over the following DF;
val sampleColumns = Seq("group", "id", "count1", "count2", "orderCount")
val sampleSet = Seq(
("group1", "id1", 1, 1, 6),
("group1", "id2", 2, 2, 5),
("group1", "id3", 3, 3, 4),
("group2", "id4", 4, 4, 3),
("group2", "id5", 5, 5, 2),
("group2", "id6", 6, 6, 1)
)
val initialSet = sparkSession
.createDataFrame(sampleSet)
.toDF(sampleColumns: _*)
----- .show()
+------+---+------+------+----------+
| group| id|count1|count2|orderCount|
+------+---+------+------+----------+
|group1|id1| 1| 1| 6|
|group1|id2| 2| 2| 5|
|group1|id3| 3| 3| 4|
|group2|id4| 4| 4| 3|
|group2|id5| 5| 5| 2|
|group2|id6| 6| 6| 1|
+------+---+------+------+----------+
val initialSetWindow = Window
.partitionBy("group")
.orderBy("orderCountSum")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val groupedSet = initialSet
.groupBy(
"group"
).agg(
sum("count1").as("count1Sum"),
sum("count2").as("count2Sum"),
sum("orderCount").as("orderCountSum")
)
.withColumn("cf", corr("count1Sum", "count2Sum").over(initialSetWindow))
----- .show()
+------+---------+---------+-------------+----+
| group|count1Sum|count2Sum|orderCountSum| cf|
+------+---------+---------+-------------+----+
|group1| 6| 6| 15|null|
|group2| 15| 15| 6|null|
+------+---------+---------+-------------+----+
When trying to apply the corr function, some of the resulting values in cf are null for some reason:
The question is, how can I apply corr to each of the rows within their subgroup (Window)? Would like to obtain the corr value per Row and subgroup (group1 and group2).
in pandas , I can do something like this .
data = {"col1" : [np.random.randint(10) for x in range(1,10)],
"col2" : [np.random.randint(100) for x in range(1,10)]}
mypd = pd.DataFrame(data)
mypd
and give the two columns
are there any similar way to create a spark dataframe in pyspark ?
The answer shared by Steven is brilliant
Additionally if you are comfortable with Pandas
You can directly supply your pandas dataframe to the function createDataFrame
Spark >= 2.x
data = {
"col1": [np.random.randint(10) for x in range(1, 10)],
"col2": [np.random.randint(100) for x in range(1, 10)],
}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+
I have a dataframe in pyspark
id | value
1 0
1 1
1 0
2 1
2 0
3 0
3 0
3 1
I want to extract all the rows after the first occurrence of 1 in value column in the same id group. I have created Window with partition of Id but do not know how to get rows which are present after value 1.
Im expecting result to be
id | value
1 1
1 0
2 1
2 0
3 1
Below solutions may be relevant for this (It is working perfectly for small data but may cause the problem in big data if id are on multiple partitions)
df = sqlContext.createDataFrame([
[1, 0],
[1, 1],
[1, 0],
[2, 1],
[2, 0],
[3, 0],
[3, 0],
[3, 1]
],
['id', 'Value']
)
df.show()
+---+-----+
| id|Value|
+---+-----+
| 1| 0|
| 1| 1|
| 1| 0|
| 2| 1|
| 2| 0|
| 3| 0|
| 3| 0|
| 3| 1|
+---+-----+
#importing Libraries
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys
#This way we can generate a cumulative sum for values
df.withColumn(
"sum",
F.sum(
"value"
).over(W.partitionBy(["id"]).rowsBetween(-sys.maxsize, 0))
).show()
+---+-----+-----+
| id|Value|sum |
+---+-----+-----+
| 1| 0| 0|
| 1| 1| 1|
| 1| 0| 1|
| 3| 0| 0|
| 3| 0| 0|
| 3| 1| 1|
| 2| 1| 1|
| 2| 0| 1|
+---+-----+-----+
#Filter all those which are having sum > 0
df.withColumn(
"sum",
F.sum(
"value"
).over(W.partitionBy(["id"]).rowsBetween(-sys.maxsize, 0))
).where("sum > 0").show()
+---+-----+-----+
| id|Value|sum |
+---+-----+-----+
| 1| 1| 1|
| 1| 0| 1|
| 3| 1| 1|
| 2| 1| 1|
| 2| 0| 1|
+---+-----+-----+
Before running this you must be sure that data related to ID should be partitioned and no id can be on 2 partitions.
Ideally, you would need to:
Create a window partitioned by id and ordered the same way the dataframe already is
Keep only the rows for which there is a "one" before them in the window
AFAIK, there is no look up function within windows in Spark. Yet, you could follow this idea and work something out. Let's first create the data and import functions and windows.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
l = [(1, 0), (1, 1), (1, 0), (2, 1), (2, 0), (3, 0), (3, 0), (3, 1)]
df = spark.createDataFrame(l, ['id', 'value'])
Then, let's add an index on the dataframe (it's free) to be able to order the windows.
indexedDf = df.withColumn("index", F.monotonically_increasing_id())
Then we create a window that only looks at the values before the current row, ordered by that index and partitioned by id.
w = Window.partitionBy("id").orderBy("index").rowsBetween(Window.unboundedPreceding, 0)
Finally, we use that window to collect the set of preceding values of each row, and filter out the ones that do not contain 1. Optionally, we order back by index because the windowing does not preserve the order by id column.
indexedDf\
.withColumn('set', F.collect_set(F.col('value')).over(w))\
.where(F.array_contains(F.col('set'), 1))\
.orderBy("index")\
.select("id", "value").show()
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 0|
| 2| 1|
| 2| 0|
| 3| 1|
+---+-----+
I have the following DataFrame ordered by group, n1, n2
+-----+--+--+------+------+
|group|n1|n2|n1_ptr|n2_ptr|
+-----+--+--+------+------+
| 1| 0| 0| 1| 1|
| 1| 1| 1| 2| 2|
| 1| 1| 5| 2| 6|
| 1| 2| 2| 3| 3|
| 1| 2| 6| 3| 7|
| 1| 3| 3| 4| 4|
| 1| 3| 7| null| null|
| 1| 4| 4| 5| 5|
| 1| 5| 1| null| null|
| 1| 5| 5| null| null|
+-----+--+--+------+------+
Each row's n1_ptr and n2_ptr values refer to the n1 and n2 values of some other row in the group that comes later in the ordering. In other words, n1_ptr and n2_ptr are effectively pointers to another row. I want to use these pointers to identify chains of (n1, n2) pairs. For example, the chains in the given data would be: (0,0) -> (1,1) -> (2,2) -> (3,3) -> (4,4) -> (5,5); (1,5) -> (2,6) -> (3,7); and (5,1).
The ultimate goal is to consolidate each chain into a single row in a DataFrame describing the min and max n1 and n2 values in each chain. Continuing the example, this would yield
+-----+------+------+------+------+
|group|n1_min|n2_min|n1_max|n2_max|
+-----+------+------+------+------+
| 1| 0| 0| 5| 5|
| 1| 1| 5| 3| 7|
| 1| 5| 1| 5| 1|
+-----+------+------+------+------+
It seems like a udf might do the trick, but I am concerned about performance. Is there a more sensible/performant way to go about this?
A good solution would be to use graphframes: https://graphframes.github.io/quick-start.html.
First let's change the structure of your initial dataframe:
import pyspark.sql.functions as psf
df = sc.parallelize([[1, 0, 0, 1, 1],[1, 1, 1, 2, 2],[1, 1, 5, 2, 6],
[1, 2, 2, 3, 3],[1, 2, 6, 3, 7],[1, 3, 3, 4, 4],
[1, 3, 7, None, None],[1, 4, 4, 5, 5],[1, 5, 1, None, None],
[1, 5, 5, None, None]]).toDF(["group","n1","n2","n1_ptr","n2_ptr"]).filter("n1_ptr IS NOT NULL")
df = df.select(
"group",
psf.struct("n1", "n2").alias("src"),
psf.struct(df.n1_ptr.alias("n1"), df.n2_ptr.alias("n2")).alias("dst"))
From df we'll build a vertex and an edge dataframe:
v = df.select(
"group",
psf.explode(psf.array("src", "dst")).alias("id"))
e = df.drop("group")
The next step is to find all connected components using graphframes:
from graphframes import *
g = GraphFrame(v, e)
res = g.connectedComponents()
+-----+-----+------------+
|group| id| component|
+-----+-----+------------+
| 1|[0,0]|309237645312|
| 1|[1,1]|309237645312|
| 1|[1,1]|309237645312|
| 1|[2,2]|309237645312|
| 1|[1,5]| 85899345920|
| 1|[2,6]| 85899345920|
| 1|[2,2]|309237645312|
| 1|[3,3]|309237645312|
| 1|[2,6]| 85899345920|
| 1|[3,7]| 85899345920|
| 1|[3,3]|309237645312|
| 1|[4,4]|309237645312|
| 1|[3,7]| 85899345920|
| 1|[4,4]|309237645312|
| 1|[5,5]|309237645312|
| 1|[5,1]|292057776128|
| 1|[5,5]|309237645312|
+-----+-----+------------+
Now since the relation in your graph edges implies that nodes numbers n1 and n2 are monotonically increasing, we can simply aggregate by component and compute the min and max:
res.groupBy("group", "component").agg(
psf.min("id").alias("min_id"),
psf.max("id").alias("max_id")
)
+-----+------------+------+------+
|group| component|min_id|max_id|
+-----+------------+------+------+
| 1|309237645312| [0,0]| [5,5]|
| 1| 85899345920| [1,5]| [3,7]|
| 1|292057776128| [5,1]| [5,1]|
+-----+------------+------+------+
I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+