PySpark Count Distinct By Group In A RDD

PySpark Count Distinct By Group In A RDD - apache-spark

I have an RDD of date-time and hostname as tuple and I want to count the unique hostnames by date.
RDD:
X = [(datetime.datetime(1995, 8, 1, 0, 0, 1), u'in24.inetnebr.com'),
(datetime.datetime(1995, 8, 1, 0, 0, 7), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 1, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 9), u'ix-esc-ca2-07.ix.netcom.com'),
(datetime.datetime(1995, 8, 3, 0, 0, 10), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 3, 0, 0, 10), u'slppp6.intermind.net'),
(datetime.datetime(1995, 8, 4, 0, 0, 10), u'piweba4y.prodigy.com'),
(datetime.datetime(1995, 8, 5, 0, 0, 11), u'slppp6.intermind.net')]
DESIRED OUTPUT:
[(datetime.datetime(1995, 8, 1, 0, 0, 1), 2),
(datetime.datetime(1995, 8, 2, 0, 0, 8), 2),
(datetime.datetime(1995, 8, 3, 0, 0, 10), 2),
(datetime.datetime(1995, 8, 4, 0, 0, 10), 1),
(datetime.datetime(1995, 8, 5, 0, 0, 11), 1)]
MY ATTEMPT:
dayGroupedHosts = X.groupBy(lambda x: x[0]).distinct()
dayHostCount = dayGroupedHosts.count()
I am getting an error while performing count operation. I am new to Spark and I would like to know the correct and efficient transformation to achieve such tasks.
Thanks a lot in advance.

You need to first convert the keys into dates. Then group by the key, and count the distinct values:
X.map(lambda x: (x[0].date(), x[1]))\
.groupByKey()\
.mapValues(lambda vals: len(set(vals)))\
.sortByKey()\
.collect()
#[(datetime.date(1995, 8, 1), 2),
# (datetime.date(1995, 8, 2), 2),
# (datetime.date(1995, 8, 3), 2),
# (datetime.date(1995, 8, 4), 1),
# (datetime.date(1995, 8, 5), 1)]

Or convert to a DataFrame and use countDistinct method:
import pyspark.sql.functions as f
df = spark.createDataFrame(X, ["dt", "hostname"])
df.show()
+-------------------+--------------------+
| dt| hostname|
+-------------------+--------------------+
|1995-08-01 00:00:01| in24.inetnebr.com|
|1995-08-01 00:00:07| uplherc.upl.com|
|1995-08-01 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:09|ix-esc-ca2-07.ix....|
|1995-08-03 00:00:10| uplherc.upl.com|
|1995-08-03 00:00:10|slppp6.intermind.net|
|1995-08-04 00:00:10|piweba4y.prodigy.com|
|1995-08-05 00:00:11|slppp6.intermind.net|
+-------------------+--------------------+
df.groupBy(f.to_date('dt').alias('date')).agg(
f.countDistinct('hostname').alias('hostname')
).show()
+----------+--------+
| date|hostname|
+----------+--------+
|1995-08-02| 2|
|1995-08-03| 2|
|1995-08-01| 2|
|1995-08-04| 1|
|1995-08-05| 1|
+----------+--------+

Related

incorrect result showing in K nearest neighbour approach

I am reforming the 2D coordinate number in a aligned way which was not aligned (coordinate numbers were suffled) before.
I have below input coordinates,
X = [2, 2, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6, 5, 4, 3, 5, 5, 5]
Y = [2, 3, 3, 3, 4, 5, 6, 6, 6, 5, 4, 3, 2, 2, 2, 2, 3, 4, 5]
I have to make it aligned. Therefore, I first applied Sorted function on this coordinates. I got below output after it.
merged_list1 = sorted(zip(X, Y))
output
X1_coordinate_reformed = [2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6]
Y1_coordinate_reformed = [2, 3, 2, 3, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 4, 5, 6]
Still it iot aligned properly. I want two consecutive nodes place next to each other. Therefore I am applying the approach to find the nearest coordinate from origin to find the very first node. Then from the first node, I found another nearest coordinate and so on...For that, I have applied below code,
First I wrote a function which calculates the distance and gives index of the nearest coordinate from the list.
def solve(pts, pt):
x, y = pt
idx = -1
smallest = float("inf")
for p in pts:
if p[0] == x or p[1] == y:
dist = abs(x - p[0]) + abs(y - p[1])
if dist < smallest:
idx = pts.index(p)
smallest = dist
elif dist == smallest:
if pts.index(p) < idx:
idx = pts.index(p)
smallest = dist
return idx
coor2 = list(zip(X1_coordinate_reformed, Y1_coordinate_reformed)) # make a list which contains tuples of X and Y coordinates
pts2 = coor2.copy()
origin1 = (0, 0)
new_coor1 = []
for i in range(len(pts2)):
pt = origin1
index_num1 = solve(pts2, pt)
print('index is', index_num1)
origin1 = pts2[index_num1]
new_coor1.append(pts2[index_num1])
del pts2[index_num1]
After running the code, I got below output,
[(6, 6), (5, 6), (4, 6), (4, 5), (4, 4), (4, 3), (3, 3), (2, 3), (2, 2), (3, 2), (4, 2), (5, 2), (5, 3), (5, 4), (5, 5), (6, 5), (6, 4), (6, 3), (6, 2)]
Which is not correct because it can be clearly understand that,
coor2 = [(2, 2), (2, 3), (3, 2), (3, 3), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)]
origin = (0, 0)
if we find the distance between Origin which was (0, 0) in very first and from every coordinate from above coor2 list, we will get (2,2) is nearest coordinate. Then How come my code gives (6,6) is the nearest coordinate??
The interesting thing is, if I apply the same procedure (sorting followed by finding nearest coordinate) on below coordinates,
X2_coordinate = [2, 4, 4, 2, 3, 2, 4, 3, 1, 3, 4, 3, 1, 2, 0, 3, 4, 2, 0]
Y2_coordinate = [3, 4, 2, 1, 3, 2, 1, 0, 0, 2, 3, 4, 1, 4, 0, 1, 0, 0, 1]
After applying sorted function
X2_coordinate_reformed = [0, 0, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]
Y2_coordinate_reformed = [0, 1, 0, 1, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]
After applying method of searching nearest coordinates mentioned above, the result I got
[(0, 0), (0, 1), (1, 1), (1, 0), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (3, 4), (3, 3), (3, 2), (3, 1), (3, 0), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4)]
Kindly suggest me where I am doing wrong and what should I change??

It is better to use scipy for finding closest coordinate.
The code given below works.
from scipy import spatial
pts = merged_list1.copy()
origin = (0, 0)
origin = np.array(origin)
new_coordi = []
for i in range(len(pts)):
x = origin
distance,index = spatial.KDTree(pts).query(x)
new_coordi.append(pts[index])
origin = np.array(pts[index])
del pts[index]

Get all combinations of N items

I have a list of items:
[0,1,10,20,5,6,7]
is there a brief, pythonic way to get all groupings of n variables? In this case, similar groups with a different order are considered duplicates.
3:
(0,1,10)
(0,1,20)
(0,2,5)
...
4:
(0,1,10,20)
(0,1,10,5)
(0,1,10,6)
...

Maybe you are looking for "powerset" from recipes in itertools:
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
l = [0,1,10,20,5,6,7]
list(powerset(l))
Output:
[(),
(0,),
(1,),
(10,),
(20,),
(5,),
(6,),
(7,),
(0, 1),
(0, 10),
(0, 20),
(0, 5),
(0, 6),
(0, 7),
(1, 10),
(1, 20),
(1, 5),
(1, 6),
(1, 7),
(10, 20),
(10, 5),
(10, 6),
(10, 7),
(20, 5),
(20, 6),
(20, 7),
(5, 6),
(5, 7),
(6, 7),
(0, 1, 10),
(0, 1, 20),
(0, 1, 5),
(0, 1, 6),
(0, 1, 7),
(0, 10, 20),
(0, 10, 5),
(0, 10, 6),
(0, 10, 7),
(0, 20, 5),
(0, 20, 6),
(0, 20, 7),
(0, 5, 6),
(0, 5, 7),
(0, 6, 7),
(1, 10, 20),
(1, 10, 5),
(1, 10, 6),
(1, 10, 7),
(1, 20, 5),
(1, 20, 6),
(1, 20, 7),
(1, 5, 6),
(1, 5, 7),
(1, 6, 7),
(10, 20, 5),
(10, 20, 6),
(10, 20, 7),
(10, 5, 6),
(10, 5, 7),
(10, 6, 7),
(20, 5, 6),
(20, 5, 7),
(20, 6, 7),
(5, 6, 7),
(0, 1, 10, 20),
(0, 1, 10, 5),
(0, 1, 10, 6),
(0, 1, 10, 7),
(0, 1, 20, 5),
(0, 1, 20, 6),
(0, 1, 20, 7),
(0, 1, 5, 6),
(0, 1, 5, 7),
(0, 1, 6, 7),
(0, 10, 20, 5),
(0, 10, 20, 6),
(0, 10, 20, 7),
(0, 10, 5, 6),
(0, 10, 5, 7),
(0, 10, 6, 7),
(0, 20, 5, 6),
(0, 20, 5, 7),
(0, 20, 6, 7),
(0, 5, 6, 7),
(1, 10, 20, 5),
(1, 10, 20, 6),
(1, 10, 20, 7),
(1, 10, 5, 6),
(1, 10, 5, 7),
(1, 10, 6, 7),
(1, 20, 5, 6),
(1, 20, 5, 7),
(1, 20, 6, 7),
(1, 5, 6, 7),
(10, 20, 5, 6),
(10, 20, 5, 7),
(10, 20, 6, 7),
(10, 5, 6, 7),
(20, 5, 6, 7),
(0, 1, 10, 20, 5),
(0, 1, 10, 20, 6),
(0, 1, 10, 20, 7),
(0, 1, 10, 5, 6),
(0, 1, 10, 5, 7),
(0, 1, 10, 6, 7),
(0, 1, 20, 5, 6),
(0, 1, 20, 5, 7),
(0, 1, 20, 6, 7),
(0, 1, 5, 6, 7),
(0, 10, 20, 5, 6),
(0, 10, 20, 5, 7),
(0, 10, 20, 6, 7),
(0, 10, 5, 6, 7),
(0, 20, 5, 6, 7),
(1, 10, 20, 5, 6),
(1, 10, 20, 5, 7),
(1, 10, 20, 6, 7),
(1, 10, 5, 6, 7),
(1, 20, 5, 6, 7),
(10, 20, 5, 6, 7),
(0, 1, 10, 20, 5, 6),
(0, 1, 10, 20, 5, 7),
(0, 1, 10, 20, 6, 7),
(0, 1, 10, 5, 6, 7),
(0, 1, 20, 5, 6, 7),
(0, 10, 20, 5, 6, 7),
(1, 10, 20, 5, 6, 7),
(0, 1, 10, 20, 5, 6, 7)]

from itertools import combinations
list(combinations([0,1,10,20,5,6,7], 3))

Compare two lists of lists and fill in blank values

I'm reading data from an API and have a list of lists like this:
listData = [[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]
I need to create a complete list filling in the missing values. I've created a destination, like this:
listDest = [[datetime.datetime(2018, 1, 1, 5, 0), None],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), None],
[datetime.datetime(2018, 1, 1, 8, 0), None]]
The end result should look like this:
[[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]
Here is the code I've tried:
for blankTime, blankValue in listDest:
for dataTime, dataValue in listData:
if blankTime == dataTime:
blankIndex = listDest.index(blankTime)
dataIndex = listData.index(dataTime)
listDest[blankIndex] = tempRm7[dataIndex]
This returns the following error, which is confusing since I know that value is in both lists.
ValueError: datetime.datetime(2018, 1, 1, 5, 0) is not in list
I attempted to adapt the methods in this answer but that's for a 1D list and I couldn't figure out how to make it work for my 2D list.

If both lists are sorted, you can merge them and then group them (using heapq.merge/itertools.groupby):
import datetime
from heapq import merge
from itertools import groupby
listData = [[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]
listDest = [[datetime.datetime(2018, 1, 1, 5, 0), None],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), None],
[datetime.datetime(2018, 1, 1, 8, 0), None]]
out = [next(g) for _, g in groupby(merge(listData, listDest, key=lambda k: k[0]), lambda k: k[0])]
# pretty print to screen:
from pprint import pprint
pprint(out)
Prints:
[[datetime.datetime(2018, 1, 1, 5, 0), -6.78125],
[datetime.datetime(2018, 1, 1, 6, 0), None],
[datetime.datetime(2018, 1, 1, 7, 0), -6.125],
[datetime.datetime(2018, 1, 1, 8, 0), -5.90625]]

How to convert "2011-2012" to datetime object in mongodb in python?

Hi I am saving year as string as "2011-2012" in db, i want to convert it to datetime object and save it in datetime object like "datetime.datetime(2011, 1, 1, 0, 0)-datetime.datetime(2012, 1, 1, 0, 0)"?
can you please help me

you could use rrule from the dateutil package:
from datetime import datetime
from dateutil.rrule import rrule, MONTHLY
# input string:
years = '2011-2012'
# map to integer:
years = tuple(map(int, years.split('-')))
# obtain a datetime object for the starting year and the total years:
start_year = datetime(years[0], 1, 1)
total_years = years[1]-years[0]
# use rrule to generate a monthly range:
monthrange = []
dt_range = list(rrule(freq=MONTHLY, count=total_years*12+1, dtstart=start_year))
# [datetime.datetime(2011, 1, 1, 0, 0),
# datetime.datetime(2011, 2, 1, 0, 0),
# datetime.datetime(2011, 3, 1, 0, 0),
# datetime.datetime(2011, 4, 1, 0, 0),
# datetime.datetime(2011, 5, 1, 0, 0),
# datetime.datetime(2011, 6, 1, 0, 0),
# datetime.datetime(2011, 7, 1, 0, 0),
# datetime.datetime(2011, 8, 1, 0, 0),
# datetime.datetime(2011, 9, 1, 0, 0),
# datetime.datetime(2011, 10, 1, 0, 0),
# datetime.datetime(2011, 11, 1, 0, 0),
# datetime.datetime(2011, 12, 1, 0, 0),
# datetime.datetime(2012, 1, 1, 0, 0)]

Strange Behaviour when Updating Cassandra row

I am using pyspark and pyspark-cassandra.
I have noticed this behaviour on multiple versions of Cassandra(3.0.x and 3.6.x) using COPY, sstableloader, and now saveToCassandra in pyspark.
I have the following schema
CREATE TABLE test (
id int,
time timestamp,
a int,
b int,
c int,
PRIMARY KEY ((id), time)
) WITH CLUSTERING ORDER BY (time DESC);
and the following data
(1, datetime.datetime(2015, 3, 1, 0, 18, 18, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 19, 12, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 22, 59, tzinfo=<UTC>), 1, 0, 0)
(1, datetime.datetime(2015, 3, 1, 0, 23, 52, tzinfo=<UTC>), 0, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 2, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 32, 8, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 43, 30, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 44, 12, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 48, 49, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 49, 7, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 5, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 50, 53, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 53, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 51, 59, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 54, 35, tzinfo=<UTC>), 1, 1, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 28, tzinfo=<UTC>), 0, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 55, 55, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 0, 56, 24, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 14, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 11, 17, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 8, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 12, 10, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 43, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 17, 49, tzinfo=<UTC>), 0, 3, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 12, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 24, tzinfo=<UTC>), 2, 1, 0)
Towards the end of the data, there are two rows which have the same timestamp.
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 2, 1, 0)
(1, datetime.datetime(2015, 3, 1, 1, 24, 18, tzinfo=<UTC>), 1, 2, 0)
It is my understanding that when I save to Cassandra, one of these will "win" - there will only be one row.
After writing to cassandra using
rdd.saveToCassandra(keyspace, table, ['id', 'time', 'a', 'b', 'c'])
Neither row appears to have won. Rather, the rows seem to have "merged".
1 | 2015-03-01 01:17:43+0000 | 1 | 2 | 0
1 | 2015-03-01 01:17:49+0000 | 0 | 3 | 0
1 | 2015-03-01 01:24:12+0000 | 1 | 2 | 0
1 | 2015-03-01 01:24:18+0000 | 2 | 2 | 0
1 | 2015-03-01 01:24:24+0000 | 2 | 1 | 0
Rather than the 2015-03-01 01:24:18+0000 containing (1, 2, 0) or (2, 1, 0), it contains (2, 2, 0).
What is happening here? I can't for the life of me figure out this behaviour is being caused.

This is a little known effect that comes from the batching together of data. Batching writes assigns the same timestamp to all Inserts in the batch. Next, if two writes are done with the exact same timestamp then there is a special merge rule since there was no "last" write. The Spark Cassandra Connector uses intra-partition batches by default so this is very likely to happen if you have this kind of clobbering of values.
The behavior with two identical write timestamps is a merge based on the Greater value.
Given Table (key, a, b)
Batch
Insert "foo", 2, 1
Insert "foo", 1, 2
End batch
The batch gives both mutations the same timestamp. Cassandra cannot chose a "last-written" since they both happened at the same time, instead it just chooses the greater value of the two. The merged result will be
"foo", 2, 2

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark Count Distinct By Group In A RDD - apache-spark

Related

incorrect result showing in K nearest neighbour approach

Get all combinations of N items

Compare two lists of lists and fill in blank values

How to convert "2011-2012" to datetime object in mongodb in python?

Strange Behaviour when Updating Cassandra row

Categories

Resources