graph frames has a nice example for stateful motifs.
How can I explicitly return the counts? As you see the output only contains vertices and friends but not the counts.
How can I modify it to not (only) have access to the edges but access to the labels of the vertices as well?
when(relationship === "friend", cnt + 1).otherwise(cnt)
I.e. how could I enhance the count to count
the friends of each vertex with age > 30
the percentage of friendsGreater30 / allFriends
val g = examples.Graphs.friends // get example graph
// Find chains of 4 vertices.
val chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
// Query on sequence, with state (cnt)
// (a) Define method for updating state given the next element of the motif.
def sumFriends(cnt: Column, relationship: Column): Column = {
when(relationship === "friend", cnt + 1).otherwise(cnt)
}
// (b) Use sequence operation to apply method to sequence of elements in motif.
// In this case, the elements are the 3 edges.
val condition = Seq("ab", "bc", "cd").
foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship")))
// (c) Apply filter to DataFrame.
val chainWith2Friends2 = chain4.where(condition >= 2)
http://graphframes.github.io/user-guide.html
chainWith2Friends2.show()
Which will output
+-------------+------------+-------------+------------+-------------+------------+--------------+
| a| ab| b| bc| c| cd| d|
+-------------+------------+-------------+------------+-------------+------------+--------------+
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]| [e,Esther,32]|
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]| [b,Bob,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]| [d,David,29]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,f,follow]| [f,Fanny,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]| [b,Bob,36]|[b,c,follow]|[c,Charlie,30]|
| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|
+-------------+------------+-------------+------------+-------------+------------+--------------+
Note that sumFriends returns a Column, so condition is a column. This is why you can access it in a where statement without quotes. So all you have to do is add that column to your dataframe. After running the above code, I can run
chain4.withColumn("condition",condition).select("condition").show
+---------+
|condition|
+---------+
| 1|
| 0|
| 0|
| 0|
| 0|
| 3|
| 3|
| 3|
| 2|
| 2|
| 3|
| 1|
+---------+
you could also use chain4.select(condition)
Hope this helps
Related
Using the example in this question, how do I create rows of 0 count when aggregating all possible combinations? When using cube, rows of 0 do not populate.
This is the code and output:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
But this is the desired output (added row 4).
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 1| 0| <- count of records where x = bar AND y = 1
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
Is there another function that could do that?
I agree that crossJoin here is the correct approach. But I think afterwards it may be a bit more versatile to use a join instead of a union and groupBy. Especially if there are more aggregations than one count.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('foo', 1),
('foo', 2),
('bar', 2),
('bar', 2)],
['x', 'y'])
df_cartesian = df.select('x').distinct().crossJoin(df.select("y").distinct())
df_cubed = df.cube('x', 'y').count()
df_cubed.join(df_cartesian, ['x', 'y'], 'full').fillna(0, ['count']).show()
# +----+----+-----+
# | x| y|count|
# +----+----+-----+
# |null|null| 4|
# |null| 1| 1|
# |null| 2| 3|
# | bar|null| 2|
# | bar| 1| 0|
# | bar| 2| 2|
# | foo|null| 2|
# | foo| 1| 1|
# | foo| 2| 1|
# +----+----+-----+
First, let's see why you do not get combinations that do not appear in your dataset.
def cube(col1: String, cols: String*): RelationalGroupedDataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
As the doc states it, cube is just a fancy group by. You may aslo check that by running explain on your result. You would see that cube is basically an expand (to obtain the nulls) and a group by. Therefore it cannot show you combinations that are not in your dataset. A join would be needed for that so that values that are never in the same record together can "meet".
So let's construct a solution:
// this contains one line per possible combination, even those who are not in the dataset
// note that we set the count to 0
val cartesian = df
.select("x").distinct
.crossJoin(df.select("y").distinct)
.withColumn("count", lit(0))
// A dataset in which (2, 1) does not exist
val df = Seq((1, 1), (1, 2), (2, 2)).toDF("x", "y")
// Let's now union the cube with the Cartesian product (CP) and
// reperform a group by.
// Since the counts were set to zero in the CP, this will not impact the
// counts of the cube. It simply adds "missing" values with a count of 0.
df.cube("x", "y").count
.union(cartesian)
.groupBy("x", "y")
.agg(sum('count) as "count")
.show
which yields:
+----+----+-----+
| x| y|count|
+----+----+-----+
| 2| 2| 1|
| 1| 2| 1|
| 1| 1| 1|
| 2| 1| 0|
|null|null| 3|
| 1|null| 2|
|null| 1| 1|
|null| 2| 2|
| 2|null| 1|
+----+----+-----+
Consider this dataframe:
+----------+------+--+
| person| style| n|
+----------+------+--+
| P1| A| 1|
| P2| A| 1|
| P2| B| 2|
| P3| A| 1|
| P3| B| 2|
| P3| C| 2|
| P4| A| 2|
| P4| B| 2|
+----------+------+--+
The goal is to determine the preferred style for each person. Preference has tricky rules, however!
If a person has a single style (e.g. P1), that style is their preference regardless of the number of observations n.
If a person has 2 or more styles where one clearly has the greatest n (e.g. P2 style B) then that is the preference.
Now it gets harder. If a person has 3 or more styles where at least 2 n are the highest value and are the same, then all styles with the highest n are considered preferences (e.g. P3 styles B and C)
If a person has 2 or more styles where n is the same for all, NO preference is set (e.g. P4). Note that if P3 had one more n for style A (to make it 2) then it would also fall into this category.
Before experimenting with SPARK I would simply GROUP and ORDER by person and would iterate over these results, sniffing for max(n) and in general programmatically handling it. But I am new to SPARK and I understand that one should avoid iterating and collecting and such. I think my target output frame is
+----------+------+--+
| person| style| n|
+----------+------+--+
| P1| A| 1|
| P2| B| 2|
| P3| B| 2|
| P3| C| 2|
+----------+------+--+
There are several good examples of finding the highest value in a column in a GROUP BY but that doesn't satisfy rule #3 or #4 above.
I am guessing some sort of a self-join where (count(max(n)) <> count(*)) or count(max(n)) = 1 but I am no expert in SQL.
Not to make this a contest but as a comparison, in MongoDB I would get styles and counts into an array for each person and then use $reduce to walk the array and apply logic to see if n was the same in each style, then "marking" the document (not the array) with a code indicating the state.
You can mark the rows with condition 4 and remove them after you do the usual row_number trick (actually rank here, because you want to keep ties according to condition 3):
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'cond',
(F.count('*').over(Window.partitionBy('person')) > 1) &
(F.min('n').over(Window.partitionBy('person')) ==
F.max('n').over(Window.partitionBy('person'))
)
).withColumn(
'rn',
F.rank().over(Window.partitionBy('person').orderBy(F.desc('n')))
).filter('rn = 1 and not cond').drop('cond', 'rn')
df2.show()
+------+-----+---+
|person|style| n|
+------+-----+---+
| P2| B| 2|
| P3| B| 2|
| P3| C| 2|
| P1| A| 1|
+------+-----+---+
I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+
I would like to get the first and last row of each partition in spark (I'm using pyspark). How do I go about this?
In my code I repartition my dataset based on a key column using:
mydf.repartition(keyColumn).sortWithinPartitions(sortKey)
Is there a way to get the first row and last row for each partition?
Thanks
I would highly advise against working with partitions directly. Spark does a lot of DAG optimisation, so when you try executing specific functionality on each partition, all your assumptions about the partitions and their distribution might be completely false.
You seem to however have a keyColumn and sortKey, so then I'd just suggest to do the following:
import pyspark
import pyspark.sql.functions as f
w_asc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.asc(sortKey))
w_desc = pyspark.sql.Window.partitionBy(keyColumn).orderBy(f.desc(sortKey))
res_df = mydf. \
withColumn("rn_asc", f.row_number().over(w_asc)). \
withColumn("rn_desc", f.row_number().over(w_desc)). \
where("rn_asc = 1 or rn_desc = 1")
The resulting dataframe will have 2 additional columns, where rn_asc=1 indicates the first row and rn_desc=1 indicates the last row.
Scala: I think the repartition is not by come key column but it requires the integer how may partition you want to set. I made a way to select the first and last row by using the Window function of the spark.
First, this is my test data.
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
| 2| 3|
| 3| 1|
| 3| 3|
| 3| 5|
+---+-----+
Then, I use the Window function twice, because I cannot know the last row easily but the reverse is quite easy.
import org.apache.spark.sql.expressions.Window
val a = Window.partitionBy("id").orderBy("value")
val d = Window.partitionBy("id").orderBy(col("value").desc)
val df = spark.read.option("header", "true").csv("test.csv")
df.withColumn("marker", when(rank.over(a) === 1, "Y").otherwise("N"))
.withColumn("marker", when(rank.over(d) === 1, "Y").otherwise(col("marker")))
.filter(col("marker") === "Y")
.drop("marker").show
The final result is then,
+---+-----+
| id|value|
+---+-----+
| 3| 5|
| 3| 1|
| 1| 4|
| 1| 1|
| 2| 3|
| 2| 1|
+---+-----+
Here is another approach using mapPartitions from RDD API. We iterate over the elements of each partition until we reach the end. I would expect this iteration to be very fast since we skip all the elements of the partition except the two edges. Here is the code:
df = spark.createDataFrame([
["Tom", "a"],
["Dick", "b"],
["Harry", "c"],
["Elvis", "d"],
["Elton", "e"],
["Sandra", "f"]
], ["name", "toy"])
def get_first_last(it):
first = last = next(it)
for last in it:
pass
# Attention: if first equals last by reference return only one!
if first is last:
return [first]
return [first, last]
# coalesce here is just for demonstration
first_last_rdd = df.coalesce(2).rdd.mapPartitions(get_first_last)
spark.createDataFrame(first_last_rdd, ["name", "toy"]).show()
# +------+---+
# | name|toy|
# +------+---+
# | Tom| a|
# | Harry| c|
# | Elvis| d|
# |Sandra| f|
# +------+---+
PS: Odd positions will contain the first partition element and the even ones the last item. Also note that the number of results will be (numPartitions * 2) - numPartitionsWithOneItem which I expect to be relatively small therefore you shouldn't bother about the cost of the new createDataFrame statement.
I have a very wide dataframe in spark .It has 80 columns so I want to set a column to 0 and rest to 1.
So the one I want to set to 1 I tried it with
df = df.withColumn("set_zero_column", lit(0))
and it worked.
Now I want to set the rest columns to 1. How do I do without specify all the 79 names ?
Any help is appreciated
Use select with a list comprehension:
from pyspark.sql.functions import lit
set_one_columns = [lit(1).alias(c) for c in df.columns if c != "set_zero_column"]
df = df.select(lit(0).alias("set_zero_column"), *set_one_columns)
If you needed to maintain the original column order, you could do:
cols = [lit(0).alias(c) if c == "set_zero_column" else lit(1).alias(c) for c in df.columns]
df = df.select(*cols)
I try to answer in Scala:
Example:
Method1:
//sample dataframe
val df=Seq(("a",1)).toDF("id","id1")
//filter req columns and add literal value
val cls=df.columns.map(x => if (x != "id1") (x,lit("1")) else (x,lit("0")))
//use foldLeft and add columns dynamically
val df2=cls.foldLeft(df){(df,cls) => df.withColumn(cls._1,cls._2)}
Result:
df2.show()
+---+---+
| id|id1|
+---+---+
| 1| 0|
+---+---+
Method2:Pault approach :)
val cls=df.columns.map( x => if (x !="id1") lit(1).alias(s"${x}") else lit(0).alias(s"${x}"))
Result:
df.select(cls:_*).show()
+---+---+
| id|id1|
+---+---+
| 1| 0|
+---+---+
I am still new to spark sql , although it may not be the most efficient way to handle this scenario but will be glad if it helps or can be further improved,
This is how I was able to do it in java.
Step1:
Create Sparksession and
Load your file into daframe.
code:
public void process() throws AnalysisException {
SparkSession session = new SparkSession.Builder()
.appName("Untyped Agregation on data frame")
.master("local")
.getOrCreate();
//Load the file that you need to compute.
Dataset<Row> peopledf = session.read()
.option("header","true")
.option("delimiter"," ")
.csv("src/main/resources/Person.txt");
Output:
+--------+---+--------+
| name|age|property|
+--------+---+--------+
| Gaurav| 27| 1|
| Dheeraj| 30| 1|
| Saloni| 26| 1|
| Deepak| 30| 1|
| Db| 25| 1|
|Praneeth| 24| 1|
| jyoti| 26| 1|
+--------+---+--------+
Step2(optional):
Incase you require to provide constant value to any one column.
code:
//incase you require to chnage value for a single column.
Dataset<Row> peopledf1 = peopledf.withColumn("property",lit("0"));
peopledf1.show();
output:
+--------+---+--------+
| name|age|property|
+--------+---+--------+
| Gaurav| 27| 0|
| Dheeraj| 30| 0|
| Saloni| 26| 0|
| Deepak| 30| 0|
| Db| 25| 0|
|Praneeth| 24| 0|
| jyoti| 26| 0|
+--------+---+--------+
Step3:
Get the String array of all the column names in your data frame.
code:
//Get the list of all the coloumns
String[] myStringArray = peopledf1.columns();
Step4:
logic for filtering out the column among the array you dont want to provide constant value to and creating List of required columns names and lit("constsnt") for withColumns
code:
//create two list one bieng names of columns you need to compute
//other bieng same size(same number of element as that of column list)of
//lit("0") i.e constant
//filter out the coloumn that you dont want to apply constant upon.
List<String> myList = new ArrayList<String>();
List<Column> myList1 = new ArrayList<Column>();
for(String element : myStringArray){
if(!(element.contains("name"))){
myList.add(element);
myList1.add(lit("0"));
}
}
Step5:
Convert the List to Scala Seq as withColumns method requires that format of argument.
code:
//convert both list into scala Seq<Columns> and Seq<String> respectively.
//Need to do this because withColumns method requires arguments in Seq form.
//check scala doc for with columns
Seq<Column> mySeq1 = convertListToSeq(myList1);
Seq<String> mySeq= convertListToSeq1(myList);
code for convertListToSeq using JavaConverters:
//Use JavaConverters to Convert List to Scala Seq using provided method below
public Seq<String> convertListToSeq1(List<String> inputList) {
return
JavaConverters.asScalaIteratorConverter(inputList.iterator()).asScala().
toSeq();
}
public Seq<Column> convertListToSeq(List<Column> inputList) {
return JavaConverters.asScalaIteratorConverter(inputList.iterator())
.asScala().toSeq();
}
Step6:
Print Output to console
code:
//Display the required output on console.
peopledf1.withColumns(mySeq,mySeq1).show();
output:
+--------+---+--------+
| name|age|property|
+--------+---+--------+
| Gaurav| 0| 0|
| Dheeraj| 0| 0|
| Saloni| 0| 0|
| Deepak| 0| 0|
| Db| 0| 0|
|Praneeth| 0| 0|
| jyoti| 0| 0|
+--------+---+--------+
Please do comment if code can be improved further.
Happy Learning,
Gaurav