I have two tables with same column names, same data, same number of rows but ordering of rows might differ. Now I select column A from table_1 and column A from table_2 and compare the values. How can i achieve this using PySpark SQL can I do sha2/md5 checksum and compare?
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row
import pyspark.sql.functions as f
app_name="test"
table1="DB1.department"
table2="DB2.department"
conf = SparkConf().setAppName(app_name)
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
query1="select * from %s" %(table1)
df1 = sqlContext.sql(query1)
query2="select * from %s" %(table2)
df2 = sqlContext.sql(query2)
df3=sqlContext.sql(SELECT DB1.departmentid FROM DB1.department a FULL JOIN
DB2.department b ON a.departmentid = b.departmentid WHERE a.departmentid
IS NULL OR b.departmentid IS NULL)
df5=sqlContext.sql("select md5(departmentid) from department1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 580, in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
813, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'md5(departmentid)'
due to data type mismatch: argument 1 requires binary type, however,
'departmentid' is of bigint type.; line 1 pos 11"
when tried with md5 checksum it is saying it expects binarytype but department id is bigint
Table1:
departmentid departmentname departmentaddress
1 A Newyork
2 B Newjersey
3 C SanJose
4 D WashingtonDC
5 E Mexico
6 F Delhi
7 G Pune
8 H chennai
Table2:
departmentid departmentname departmentaddress
7 G Pune
8 H chennai
1 A Newyork
2 B Newjersey
3 C SanJose
4 D WashingtonDC
5 E Mexico
6 F Delhi
Here in table two order of rows has just changed but still data remained so, now technically these two tables are identical. Until and unless new row gets added or values modified the two tables are identical (Tables are taken for example and explanation, in real we deal with Bigdata)
The simplest solution is:
def is_identical(x, y):
return (x.count() == y.count()) and (x.subtract(y).count() == 0)
Example data:
df1 = spark.createDataFrame(
[(1, "A", "Newyork"), (2, "B", "Newjersey"),
(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi"),
(7, "G", "Pune"), (8, "H", "chennai")],
("departmentid", "departmentname", "departmentadd"))
df2 = spark.createDataFrame(
[(7, "G", "Pune"), (8, "H", "chennai"), (1, "A", "Newyork"), (2, "B", "Newjersey"),
(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi")],
("departmentid", "departmentname", "departmentadd"))
df3 = spark.createDataFrame(
[(1, "A", "New York"), (2, "B", "New Jersey"),
(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi"),
(7, "G", "Pune"), (8, "H", "chennai")],
("departmentid", "departmentname", "departmentadd"))
df4 = spark.createDataFrame(
[(3, "C", "SanJose"), (4, "D", "WashingtonDC"), (5, "E", "Mexico"), (6, "F", "Delhi")],
("departmentid", "departmentname", "departmentadd"))
Checks:
is_identical(df1, df2)
# True
is_identical(df1, df3)
# False
is_identical(df1, df4)
# False
is_identical(df4, df4)
# True
With outer join
from pyspark.sql.functions import col, coalesce, lit
from functools import reduce
from operator import and_
def is_identical_(x, y, keys=("departmentid", )):
def both_null(c):
return (col("x.{}".format(c)).isNull() &
col("y.{}".format(c)).isNull())
def both_equal(c):
return coalesce((col("x.{}".format(c)) ==
col("y.{}".format(c))), lit(False))
p = reduce(and_, [both_null(c) | both_equal(c) for c in x.columns if c not in keys])
return (x.alias("x").join(y.alias("y"), list(keys), "full_outer")
.where(~p).count() == 0)
you'd get the same result:
is_identical_(df1, df2)
# True
is_identical_(df1, df3)
# False
is_identical_(df1, df4)
# False
is_identical_(df4, df4)
# True
md5 is not use for you, because it is not an aggregation function. It computes checksum for a specific value.
Related
I am new to pyspark and am still trying to understand how map and reduce work.
I have a dataset read as an RDD, attached below after learn.txt.
Based on the second value (numeric one), I want to see which 2 letters have same value and how many times.
My current codes output:
[(('b', 'c'), 1),
('c', 1),
('d', 1),
('a', 1),
(('a', 'b'), 2),
((('a', 'b'), 'c'), 1)]
What I want as the output:
[(('b','a'),3),
(('a','b'),3),
(('b','c'),2),
(('c','b'),2),
(('a','c'),1),
(('c','a'),1)]
That is pairs only, of all permutations if they have a single match.
I don't believe my code will be too helpful but this is what I have got:
from pyspark import RDD, SparkContext
from pyspark.sql import DataFrame, SparkSession
sc = SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()
df = sc.textFile("learn.txt")
mapped = df.map(lambda x: [a for a in x.split(',')])
remapped = mapped.map(lambda x: (x[1], x[0]))
reduced = remapped.reduceByKey(lambda x,y: (x,y))
threemapped = reduced.map(lambda x: (x[1], 1))
output = threemapped.reduceByKey(lambda x, y: x+y)
output.collect()
Where learn.txt:
a,1
a,2
a,3
a,4
b,2
b,3
b,4
b,6
c,2
c,5
c,6
d,7
With .reduceByKey(lambda x,y: (x,y)), you create intertwined tuples of tuples of tuples... You are not going to be able to do anything with that.
Since you are looking for couples of values that share a key, we might use a join like this:
# same code as you
vals = df\
.map(lambda x: [a for a in x.split(',')])\
.map(lambda x: (x[1], x[0]))
# but then you can join vals with itself and use reduceByKey to count occurrences
result = vals.join(vals)\
.filter(lambda x: x[1][0] != x[1][1])\
.map(lambda x: ((x[1][1], x[1][0]), 1))\
.reduceByKey(lambda a, b: a+b)\
.collect()
which yields:
[(('b', 'a'), 3), (('c', 'a'), 1), (('c', 'b'), 2),
(('b', 'c'), 2), (('a', 'b'), 3), (('a', 'c'), 1)]
This is a common question but I have an extra condition: how do I remove matches based on a unique ID? Or, how to prevent matching against itself?
Given a dataframe:
df = pd.DataFrame({'id':[1, 2, 3],
'name':['pizza','pizza toast', 'ramen']})
I used solutions like this one to create a multi-index dataframe:
Fuzzy match strings in one column and create new dataframe using fuzzywuzzy
df_copy = df.copy()
compare = pd.MultiIndex.from_product([df['name'], df_copy['name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
So that's great but how can I use the unique ID to prevent matching against itself?
If there's a case of ID/name = 1/pizza and 10/pizza, obviously I want to keep those. But I need to remove the same ID in both indexes.
I suggest a slightly different approach for the same result using Python standard library difflib module, which provides helpers for computing deltas.
So, with the following dataframe in which pizza has two different ids (and thus should be checked against one another later on):
import pandas as pd
df = pd.DataFrame(
{"id": [1, 2, 3, 4], "name": ["pizza", "pizza toast", "ramen", "pizza"]}
)
Here is how you can find similarities between different id/name combinations, but avoid checking an id/name combination against itself:
from difflib import SequenceMatcher
# Define a simple helper function
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
And then, with the following steps:
# Create a column of unique identifiers: (id, name)
df["id_and_name"] = list(zip(df["id"], df["name"]))
# Calculate ratio only for different id_and_names
df = df.assign(
match=df["id_and_name"].map(
lambda x: {
value: ratio(x[1], value[1])
for value in df["id_and_name"]
if x[0] != value[0] or ratio(x[1], value[1]) != 1
}
)
)
# Format results in a readable fashion
df = (
pd.DataFrame(df["match"].to_list(), index=df["id_and_name"])
.reset_index(drop=False)
.melt("id_and_name", var_name="other_id_and_name", value_name="ratio")
.dropna()
.sort_values(by=["id_and_name", "ratio"], ascending=[True, False])
.reset_index(drop=True)
.pipe(lambda df_: df_.assign(ratio=df_["ratio"] * 100))
.pipe(lambda df_: df_.assign(ratio=df_["ratio"].astype(int)))
)
You get the expected result:
print(df)
# Output
id_and_name other_id_and_name ratio
0 (1, pizza) (4, pizza) 100
1 (1, pizza) (2, pizza toast) 62
2 (1, pizza) (3, ramen) 20
3 (2, pizza toast) (4, pizza) 62
4 (2, pizza toast) (1, pizza) 62
5 (2, pizza toast) (3, ramen) 12
6 (3, ramen) (4, pizza) 20
7 (3, ramen) (1, pizza) 20
8 (3, ramen) (2, pizza toast) 12
9 (4, pizza) (1, pizza) 100
10 (4, pizza) (2, pizza toast) 62
11 (4, pizza) (3, ramen) 20
I am trying to follow the example in
https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html
However when changing some criteria the result is not as per expectation.
Please see the steps below -
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
edges = sqlContext.createDataFrame([
("a", "b", "follow"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "follow"),
("d", "a", "follow"),
("a", "e", "follow")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
Now one change I have done in "relationship" column , all values are "follow" instead of friend.
Now below query is running fine -
g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 32", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+---------------+--------------+----------------+
| from| e0| v1| e1| to|
+--------------+--------------+---------------+--------------+----------------+
|[a, Alice, 34]|[a, e, follow]|[e, Esther, 32]|[e, d, follow]| [d, David, 29]|
|[a, Alice, 34]|[a, b, follow]| [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|
+--------------+--------------+---------------+--------------+----------------+
but if I change the filter criteria from 32 to 40 , wrong result is being fetched -
>>> g.bfs(fromExpr ="name = 'Alice'",toExpr = "age < 35", edgeFilter ="relationship != 'friend'" , maxPathLength = 10).show()
+--------------+--------------+
| from| to|
+--------------+--------------+
|[a, Alice, 34]|[a, Alice, 34]|
+--------------+--------------+
Ideally it should fetch similar result from first query because filter condition is still getting satisfied for all rows.
Any explanation behind this ?
bfs() search for the first result that meet your predicate. Alice age is 34, it meets toExpr = "age < 35" predicate so you got zero length path starting from Alice. Please change toExpr for something more specific. for example toExpr ="name = 'David' or name = 'Charlie'" Should give you exactly the same result as in the first query.
This question already has answers here:
Who can give a clear explanation for `combineByKey` in Spark?
(3 answers)
Closed 3 years ago.
I got a question from HW:
we have a sample data like this---
data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
the combineByKey code is like this---
>>> rdd = sc.parallelize( data )
>>> rdd2 = rdd.combineByKey
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
I got a result like this:
>>> myoutput = rdd2.collect()
>>> myoutput
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
since we suppose to manually write out the answer instead of just run the code to get the result.
after the first lambda, is it correct I got this result: (b, (2,4,1)), (a,(1,3,1)), (a,(4,6,1)),(b,(2,4,1)),(b,(3,5,1)? But I don't quite understand "x[1] + value*value" part for the second lambda? How to get the middle value of 17 and 9 for b and a?
Can anyone help to explain to me? Thank you!
As explained in the link by cricket_007,
When using combineByKey values are merged into one value at each partition then each partition value is merged into a single value.
Lets first look at the number of partitions and what each partition contains after we parallelize the data.
>>> data = [ ("B", 2), ("A", 1), ("A", 4), ("B", 2), ("B", 3) ]
>>> rdd = sc.parallelize( data )
>>> rdd.collect()
[('B', 2), ('A', 1), ('A', 4), ('B', 2), ('B', 3)]
Number of partitions (by default):
>>> num_partitions = rdd.getNumPartitions()
>>> print(num_partitions)
4
Contents of each partition:
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partitions {num} -> {partition}')
Partitions 0 -> [('B', 2)]
Partitions 1 -> [('A', 1)]
Partitions 2 -> [('A', 4)]
Partitions 3 -> [('B', 2), ('B', 3)]
combineByKey is defined as
combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)
The three functions that combineByKey takes as arguments,
createCombiner :(lambda value: (value, value+2, 1)
This will be called on every unseen key in a partition.
mergeValue : lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1) This will be called when the key is already seen before in a particular partition.
mergeCombiners : lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2])
This will be called to merge the keys of different partitions
partitioner : Beyond the scope of this answer.
Now let's work out what happens:
Parition 0: [('B', 2)]
createCombiner
('B', 2) -> Unseen Key -> ('B', (2, 2+2, 1))
-> ('B', (2,4,1)
# Same createCombiner for partition 1,2,3
Partition 1: [('A',1)]
createCombiner
('A',1) -> Unseen Key -> ('A', (1,3,1))
Partition 2: [('A',4)]
createCombiner
('A',4) -> Unseen Key -> ('A', (4,6,1))
Partition 3: [('B',2), ('B',3)]
createCombiner
('B',2) -> Unseen Key -> ('B',(2,4,1))
('B',3) -> Seen Key -> mergeValue ('B',(2,4,1)) with ('B',3)
-> ('B', (2 + 3, 4+(3*3), 1+1)
-> ('B', (5,13,2))
Partition 0 and Partition 3:
mergeCombiners ('B', (2,4,1)) and ('B', (5,13,2))
-> ('B', (2+5,4+13,1+2))
-> ('B', (7,19,3)
Partition 1 and 2:
mergeCombiners ('A', (1,3,1)) and ('A', (4,6,1))
-> ('A', (1+4, 3+6, 1+1))
-> ('A', (5,9,2))
So the final answer that we get is:
>>> rdd2 = rdd.combineByKey(lambda value: (value, value+2, 1),
... lambda x, value: (x[0] + value, x[1] + value*value, x[2] + 1),
... lambda x, y: (x[0] + y[0], x[1] + y[1], x[2] + y[2]))
>>> rdd2.collect()
[('B', (7, 17, 3)), ('A', (5, 9, 2))]
I hope this explains whats going on.
Additional Clarification as asked in comments:
How does spark set the number of partitions?
From the docs: Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)
How does spark partition the data?
A partition (aka split) is a logical chunk of a large distributed data set.
Spark has three different partitioning schemes, namely
hashPartitioner : The Default. Send keys with the same hash module end up on the same node.
customPartitioner :Example below.
rangePartitioner : Elements with keys in the same range appear on the same node.
I quote from Learning Spark by Karau et al. Pg.61, that spark does not give you explicit control on which key goes to which partition, but it ensures a set of keys will appear together on some node. If you want keys with the same value to appear together in the same partition you can use a custom partitioner like so.
>>> def customPartitioner(key):
... if key == 'A':
... return 0
... if key == 'B':
... return 1
>>> num_partitions = 2
>>> rdd = sc.parallelize( data ).partitionBy(num_partitions,customPartitioner)
>>> partitions = rdd.glom().collect()
>>> for num,partition in enumerate(partitions):
... print(f'Partition {num} -> {partition}')
Partition 0 -> [('A', 1), ('A', 4)]
Partition 1 -> [('B', 2), ('B', 2), ('B', 3)]
I encourage you to read the book to learn more.
I have a data frame like this:
id x y
1 a 1 P
2 a 2 S
3 b 3 P
4 b 4 S
I want to keep rows where the 'lead' value of y is 'S' let us say, so that my resulting data frame will be:
id x y
1 a 1 P
2 b 3 P
I am able to do it as follows with PySpark:
getLeadPoint = udf(lambda x: 'S' if (y == 'S') else 'NOTS', StringType())
windowSpec = Window.partitionBy(df['id'])
df = df.withColumn('lead_point', getLeadPoint(lead(df.y).over(windowSpec)))
dfNew = df.filter(df.lead_point == 'S')
But, here, I am mutating an unnecessary column and then filtering.
What I want to do instead is something like this where I filter using lead(), but can't get it to work:
dfNew = df.filter(lead(df.y).over(windowSpec) == 'S')
Any ideas on how I can achieve the result with direct filter using windowing?
R equivalent is:
library(dplyr)
df %>% group_by(id) %>% filter(lead(y) == 'S')
Assuming your data looks like this:
df = sc.parallelize([
("a", 1, 1, "P"), ("a", 2, 2, "S"),
("b", 4, 2, "S"), ("b", 3, 1, "P"), ("b", 2, 3, "P"), ("b", 3, 3, "S")
]).toDF(["id", "x", "timestamp", "y"])
and window spec is equivalent to
from pyspark.sql.functions import lead, col
from pyspark.sql import Window
w = Window.partitionBy("id").orderBy("timestamp")
you can simply add column and use it for filtering:
(df
.withColumn("lead_y", lead("y").over(w))
.where(col("lead_y") == "S").drop("lead_y"))
It is not pretty but will be way more efficient than UDF call.
Not efficient, but you can zip with index, then make a new RDD where you add 1 to the index, then join on the index and then it turns into a simple filter operation.