Interpolate seconds to milliseconds in dataset? - python-3.x

I have a sorted dataset by timestamps in seconds. However I need to somehow convert it to millisecond accuracy.
Example
dataset = [
# UNIX timestamps with reading data
(0, 0.48499),
(2, 0.48475),
(3, 0.48475),
(3, 0.48473),
(3, 0.48433),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(5, 0.48396),
(12, 0.48353),
]
Expected output (roughly)
interpolated = [
# Timestamps with millisecond accuracy
(0.0, 0.48499),
(2.0, 0.48475),
(3.0, 0.48475),
(3.14, 0.48473),
(3.28, 0.48433),
(3.42, 0.48403),
(3.57, 0.48403),
(3.71, 0.48403),
(3.85, 0.48403),
(3.99, 0.48403),
(5.0, 0.48396),
(12.0, 0.48353),
]
I don't have much experience with Pandas and I've gone through interpolate and drop_duplicates but couldn't figure out how to go about this.
I would think this is a common problem so any help appreciated. Ideally I want to spread evenly the numbers.

You can use groupby and apply methods. I didn't come up with a specific method like interpolate in this case, but there might be a more pythonic way.
Code:
import numpy as np
import pandas as pd
# Create a sample dataframe
dataset = [(0, 0.48499), (2, 0.48475), (3, 0.48475), (3, 0.48473), (3, 0.48433), (3, 0.48403), (3, 0.48403), (3, 0.48403), (3, 0.48403), (3, 0.48403), (5, 0.48396), (12, 0.48353)]
df = pd.DataFrame(dataset, columns=['t', 'value'])
# Convert UNIX timestamps into the desired format
df.t = df.groupby('t', group_keys=False).apply(lambda df: df.t + np.linspace(0, 1, len(df)))
Output:
t
value
0
0.48499
2
0.48475
3
0.48475
3.14286
0.48473
3.28571
0.48433
3.42857
0.48403
3.57143
0.48403
3.71429
0.48403
3.85714
0.48403
4
0.48403
5
0.48396
12
0.48353
(Input:)
t
value
0
0.48499
2
0.48475
3
0.48475
3
0.48473
3
0.48433
3
0.48403
3
0.48403
3
0.48403
3
0.48403
3
0.48403
5
0.48396
12
0.48353

Related

pandas get max threshold values from tuples in list

I am working with pandas dataframe. One of the columns has list of tuples in each row with some score. I am trying to get scores higher than 0.20. How do I put a threshold instead of max? I tried itemgetter and lambda if else. It didn't worked as I thought. What am I doing wrong?
from operator import itemgetter
import pandas as pd
# sample data
l1 = ['1','2','3']
l2 = ['test1','test2','test3']
l3 = [[(1,0.95),(5,0.05)],[(7,0.10),(1,0.20),(6,0.70)],[(7,0.30),(1,0.70)]]
df = pd.DataFrame({'id':l1,'text':l2,'score':l3})
print(df)
# # Preview from print statement above
id text score
1 test1 [(1, 0.95), (5, 0.05)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)]
# Try #1:
print(df['score'].apply(lambda x: max(x,key=itemgetter(0))))
# Preview from print statement above
(5, 0.05)
(7, 0.1)
(7, 0.3)
# Try #2: Gives `TypeError`
df['score'].apply(lambda x: ((x,itemgetter(0)) if x >= 0.20 else ''))
What I am trying to get for output:
id text probability output needed
1 test1 [(1, 0.95), (5, 0.05)] [(1, 0.95)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)] [(1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)] [(7, 0.3), (1, 0.7)]
You can use a pretty straightforward list comprehension to get the desired output. I'm not sure how you would use itemgetter for this:
df['score'] = df['score'].apply(lambda x: ([y for y in x if min(y) >= .2]))
df
id text score
0 1 test1 [(1, 0.95)]
1 2 test2 [(1, 0.2), (6, 0.7)]
2 3 test3 [(7, 0.3), (1, 0.7)]
If you wanted an alternative result (like an empty tuple, you can use:
df['score'] = df['score'].apply(lambda x: ([y if min(y) >= .2 else () for y in x ]))

How to divide the content of an RDD

I have an rdd that I will like to divide the content and return a list of tuple.
rdd_to_divide = [('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))]
result_rdd = [('Nate', 1.2/1.2), ('Mike', 5/10), ('Ben', 3/7), ('Chad', 12/20)]
Thanks in advance
Use a lambda function to map the dataframe as below:
>>> rdd_to_divide = sc.parallelize([('Nate', (1.2, 1.2)), ('Mike', (5, 10)), ('Ben', (3, 7)), ('Chad', (12, 20))])
>>> result_rdd = rdd_to_divide.map(lambda x: (x[0], x[1][0]/x[1][1]))
>>> result_rdd.take(5)
[('Nate', 1.0), ('Mike', 0.5), ('Ben', 0.42857142857142855), ('Chad', 0.6)]

How to avoid inplace removal of modification of a Networkx graph

I've a netwrokx graph, I'm trying to remove the edges of the graph using remove_edges.
I want to remove each edge in the original graph and post-process H to get further stats like edges connected to the edge that has been removed.
import networkx as nx
import matplotlib.pyplot as plt
# fig 1
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
for e in [[5, 0], [3, 6]]:
H = G.remove_edge(e[0], e[1])
nx.draw(G, with_labels=True)
plt.show()
In the above, the edge is removed inplace in G. So for the second iteration, the original graph is no
longer present. How can this be avoided? I want to retain the original graph for every iteration and instead store the graph that results after edge removal in another copy, H.
Any suggestions will be highly appreciated.
EDIT: Based on what's suggested below
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
G_copy = G.copy()
for e in [[5, 0], [3, 6]]:
print(G_copy.edges())
H = G_copy.remove_edge(e[0], e[1])
nx.draw(G_copy, with_labels=True)
plt.show()
print(G_copy.edges())
Obtained output:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Expected:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Make a copy of the original graph and modify the copy:
H = G.copy()
...
H.remove_edge(e[0], e[1])

How to named slice of a two-dimensional array in python

points = [ (1, 2), (3, 4), (5, 6), (7, 8) ]
points = np.array(points)
plt.plot(points[:,0],points[:,1],'ro')
how to named this slice([:,0] and [:,1]) ?

Repartitioning partitioned data

I'm working on a skewed data problem, such that my smallest partitions are below 64MB and my largest partitions can be greater than 1GB. I've been contemplating a strategy to map a few small partitions to the same partition key, thus creating a partition comprised of partitions. This is all in the hope of reducing variance in task size as well as number of files stored on disk.
At one point in my Spark application, I need to operate on the (non-grouped) original partitions and to do so, will need to repartition by the original key. This brings me to my question:
Suppose I have two data sets as seen below. Each row is a tuple of the form (partition_key, (original_key, data)). In data0, you can see that original_key = 0 is on its own node, whereas original_key = 4 and original_key = 5 are together on the node containing partition_key = 3. In data1, things are not as organized.
If data0 is partitioned by partition_key, and then partitioned by original_key, will a shuffle occur? In other words, does it matter during the second partitionBy call that data0 is more organized than data1?
data0 = [
(0, (0, 'a')),
(0, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(1, (1, 'e')),
(1, (2, 'f')),
(1, (2, 'g')),
(2, (3, 'h')),
(2, (3, 'i')),
(2, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(3, (5, 'm')),
(3, (5, 'n')),
(3, (5, 'o')),
]
data1 = [
(0, (0, 'a')),
(1, (0, 'b')),
(0, (0, 'c')),
(1, (1, 'd')),
(2, (1, 'e')),
(1, (2, 'f')),
(3, (2, 'g')),
(2, (3, 'h')),
(0, (3, 'i')),
(3, (3, 'j')),
(3, (4, 'k')),
(3, (4, 'l')),
(1, (5, 'm')),
(2, (5, 'n')),
(3, (5, 'o')),
]
rdd0 = sc.parallelize(data0, 3).cache()
partitioned0 = rdd0.partitionBy(4)
partitioned0.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
rdd1 = sc.parallelize(data1, 3).cache()
partitioned1 = rdd1.partitionBy(4)
partitioned1.map(lambda row: (row[1][0], row[1])).partitionBy(6).collect()
When you call re-partition shuffle kicks in.
How much of data get's shuffled is based on the original RDD.
As a side note: when you do sc.parallelize(data0,3) the 3 is mere guideline. If the default partition is <=3 then your rdd0 will have 3 partitions. If your data0 is on more HDFS blocks providing the partition number has no effect.

Resources