Adding column to empty DataFrames via a loop - python-3.x

I have the following code:
for key in temp_dict:
temp_dict[key][0][0] = temp_dict[key][0][0].insert(0, "Date", None)
where temp_dict is:
{'0.5SingFuel': [[Empty DataFrame
Columns: [Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing180': [[Empty DataFrame
Columns: [Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing380': [[Empty DataFrame
Columns: [Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]]}
What I would like to have is:
{'0.5SingFuel': [[Empty DataFrame
Columns: [Date, Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing180': [[Empty DataFrame
Columns: [Date, Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]], 'Sing380': [[Empty DataFrame
Columns: [Date, Month, Trades, -0.25, -0.2, -0.15, -0.1, -0.05, 0.0, 0.05, 0.1, 0.15, 0.2, 0.25, Total]
Index: []]]}
My code produces the following error:
ValueError: cannot insert Date, already exists
I would have thought that I was looping from one dict key to the next, but I was going through the debugger and it looks like:
Code does what it is supposed to
Moves onto next key and the previous key becomes empty
The new key already has "Date" in the columns and then the code tries to add it, which of course it can't
This probably makes no sense, hence why I need some help - I am confused.
I think I am mis-assigning the variables, but not completely sure how.

One problem is that insert is kind of an inplace operation, so you don't need to reassign. The second problem is if the column exists, then insert does not work as you said, so you need to check if it is in the columns already, and maybe reorder to put this column as first.
# dummy dictionary, same structure
d = {0:[[pd.DataFrame(columns=['a','b'])]],
1:[[pd.DataFrame(columns=['a','c'])]]}
# name of the column to insert
col='c'
for key in d.keys():
df_ = d[key][0][0] # easier to define a variable
if col not in df_.columns:
df_.insert(0,col,None)
else: # reorder and reassign in this case, remove the else if you don't need
d[key][0][0] = df_[[col] + df_.columns.difference([col]).tolist()]
print(d)
# {0: [[Empty DataFrame
# Columns: [c, a, b] # c added as column
# Index: []]], 1: [[Empty DataFrame
# Columns: [c, a] # c in first position now
# Index: []]]}

Related

I want to create simple list [0, 0.05, 0.10, 0.15,.....,1.00] using while loop in python

I want to create a while loop in python which will give an output as a list [0.00, 0.05, 0.10, 0.15,...., 1.00]
I tried doing it by following method:
alpha=0
alphalist=list()
while alpha<=1:
alphalist.append(alpha)
alpha+=0.05
print(alphalist)
I got the output as [0, 0.05, 0.1, 0.15000000000000002, 0.2, 0.25, 0.3, 0.35, 0.39999999999999997, 0.44999999999999996, 0.49999999999999994, 0.5499999999999999, 0.6, 0.65, 0.7000000000000001, 0.7500000000000001, 0.8000000000000002, 0.8500000000000002, 0.9000000000000002, 0.9500000000000003]
But What I want is this: [0.00, 0.05, 0.10, 0.15,...., 1.00]
This is the result of floating-point error. 0.05 isn't really the rational number 1/20 to begin with, so any arithmetic involving it may differ from what you expect.
Dividing two integers, rather than starting with a floating-point value, helps mitigate the problem.
>>> [x/100 for x in range(0, 101, 15)]
[0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]
There are some numbers that can cause imprecisions with the floating number system computers use. You're just seeing an example of that.
What I would do, if you want to continue using the while loop this way, is to add another line with
alpha = round(alpha,2)

Matplotlib Quiver plot matching key label color with arrow color

Using matplotlib, python3.6. I am trying to create some quiverkeys for a quiver plot but having a hard time getting the label colors to match certain arrows. Below is a simplified version of the code to show the issue. When I use the same color (0.3, 0.1, 0.2, 1.0 ) for a vector at (1,1) and as 'labelcolor' of a quiverkey I see 2 different colors.
q=plt.quiver([1, 2,], [1, 1],
[[49],[49]],
[0],
[[(0.6, 0.8, 0.5, 1.0 )],
[(0.3, 0.1, 0.2, 1.0 )]],
angles=[[45],[90]])
plt.quiverkey(q, .5, .5, 7, r'vector2', labelcolor=(0.3, 0.1, .2, 1),
labelpos='S', coordinates = 'figure')
Supposedly you meant to be using the color argument of quiver to set the actual colors.
import matplotlib.pyplot as plt
q=plt.quiver([1, 2,], [1, 1], [5,0], [5,5],
color=[(0.6, 0.8, 0.5, 1.0 ), (0.3, 0.1, 0.2, 1.0 )])
plt.quiverkey(q, .5, .5, 7, r'vector2', labelcolor=(0.3, 0.1, .2, 1),
labelpos='S', coordinates = 'figure')
plt.show()
Else, the C argument is interpreted as the values to map to colors according to the default colormap. Since you only have two arrows, only the first two values from the 8 numbers in the array given to the C argument are taken into account. But the colormap normalization uses all of those values, such that it ranges between 0.1 and 1.0. The call
q=plt.quiver([1, 2,], [1, 1], [5,0], [5,5],
[(0.6, 0.8, 0.5, 1.0 ), (0.3, 0.1, 0.2, 1.0 )])
is hence equivalent to
q=plt.quiver([1, 2,], [1, 1], [5,0], [5,5],
[0.6, 0.8], norm=plt.Normalize(vmin=0.1, vmax=1))
resulting in the first arrows color to be the value of 0.6 in the viridis colormap normalized between 0.1 and 1.0, and the second arrow to 0.8 in that colormap.
This becomes apparent if we add plt.colorbar(q, orientation="horizontal"):

How to efficiently deal with nested data in PySpark?

I had a situation here, and found that collect_list in spark is not efficient when the item is already a list.
Basically, I tried to calculate the mean of a nested list (the size of each list is guaranteed to be the same). When the data set becomes, for example, 10 M rows, it may produce out of memory errors. Originally, I thought it has something to do with the udf (to calculate the mean). But actually, I found that the aggregation part (collect_list of lists) is the real problem.
What I am doing now is to divide the 10 M rows into multiple blocks (by 'user'), aggregate each block individually, and then union them at the end. Any better suggestion on efficiently dealing with nested data?
For example, the toy example is like this:
data = [('user1','place1', ['place1', 'place2', 'place3'], [0.0, 0.5, 0.4], [0.0, 0.4, 0.3]),
('user1','place2', ['place1', 'place2', 'place3'], [0.7, 0.0, 0.4], [0.6, 0.0, 0.3]),
('user2','place1', ['place1', 'place2', 'place3'], [0.0, 0.4, 0.3], [0.0, 0.3, 0.4]),
('user2','place3', ['place1', 'place2', 'place3'], [0.1, 0.2, 0.0], [0.3, 0.1, 0.0]),
('user3','place2', ['place1', 'place2', 'place3'], [0.3, 0.0, 0.4], [0.2, 0.0, 0.4]),
]
data_df = sparkApp.sparkSession.createDataFrame(data, ['user', 'place', 'places', 'data1', 'data2'])
data_agg = data_df.groupBy('user') \
.agg(f.collect_list('place').alias('place_list'),
f.first('places').alias('places'),
f.collect_list('data1').alias('data1'),
f.collect_list('data1').alias('data2'),
)
import numpy as np
def average_values(sim_vectors):
if len(sim_vectors) == 1:
return sim_vectors[0]
mat = np.array(sim_vectors)
mean_vector = np.mean(mat, axis=0)
return np.round(mean_vector, 3).tolist()
avg_vectors_udf = f.udf(average_values, ArrayType(DoubleType()))
data_agg_ave = data_agg.withColumn('data1', avg_vectors_udf('data1')) \
.withColumn('data2', avg_vectors_udf('data2'))
The result would be:
+-----+----------------+--------------------+-----------------+-----------------+
| user| place_list| places| data1| data2|
+-----+----------------+--------------------+-----------------+-----------------+
|user1|[place1, place2]|[place1, place2, ...|[0.35, 0.25, 0.4]|[0.35, 0.25, 0.4]|
|user3| [place2]|[place1, place2, ...| [0.3, 0.0, 0.4]| [0.3, 0.0, 0.4]|
|user2|[place1, place3]|[place1, place2, ...|[0.05, 0.3, 0.15]|[0.05, 0.3, 0.15]|
+-----+----------------+--------------------+-----------------+-----------------+

How to construct a numpy array with its each element be the minimum value of all possible values?

I want to construct a 1d numpy array a, and I know each a[i] has several possible values. Of course, the numbers of the possible values of any two elements of a can be different. For each a[i], I want to set it be the minimum value of all the possible values.
For example, I have two array:
idx = np.array([0, 1, 0, 2, 3, 3, 3])
val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
The array I want to construct is following:
a = np.array([0.1, 0.5, 0.6, 0.1])
So does there exist any function in numpy can finish this work?
Here's one approach -
def groupby_minimum(idx, val):
sidx = idx.argsort()
sorted_idx = idx[sidx]
cut_idx = np.r_[0,np.flatnonzero(sorted_idx[1:] != sorted_idx[:-1])+1]
return np.minimum.reduceat(val[sidx], cut_idx)
Sample run -
In [36]: idx = np.array([0, 1, 0, 2, 3, 3, 3])
...: val = np.array([0.1, 0.5, 0.2, 0.6, 0.2, 0.1, 0.3])
...:
In [37]: groupby_minimum(idx, val)
Out[37]: array([ 0.1, 0.5, 0.6, 0.1])
Here's another using pandas -
import pandas as pd
def pandas_groupby_minimum(idx, val):
df = pd.DataFrame({'ID' : idx, 'val' : val})
return df.groupby('ID')['val'].min().values
Sample run -
In [66]: pandas_groupby_minimum(idx, val)
Out[66]: array([ 0.1, 0.5, 0.6, 0.1])
You can also use binned_statistic:
from scipy.stats import binned_statistic
idx_list=np.append(np.unique(idx),np.max(idx)+1)
stats=binned_statistic(idx,val,statistic='min', bins=idx_list)
a=stats.statistic
I think, in older scipy versions, statistic='min' was not implemented, but you can use statistic=np.min instead. Intervals are half open in binned_statistic, so this implementation is safe.

search in sublists and match common elements with other sublist

i am searching for an answer but i didn't find anything about my problem.
x=[['100',220, 0.5, 0.25, 0.1],['105',400, 0.12, 0.56, 0.9],['600',340, 0.4, 0.7, 0.45]]
y=[['1','100','105','601'],['2','104','105','600'],['3','100','105','604']]
i want as result:
z=[['1','100',0.5,0.25,0.1,'105',0.12,0.56,0.9],['2','105',0.12,0.56,0.9,'600',0.4,0.7,0.45],['3','100',0.5, 0.25, 0.1,'105', 0.12, 0.56, 0.9]]
i want to search in list y and match list x with list y where i get a new list z that containts the common sublists.
this is just an example, normally contains list x and y 10000 sublists.
i compare out of y ['1','100','105','601'] and search the '100','105','601' in list x (example ['100',220, 0.5, 0.25, 0.1]). if i find a match i make a new list z.
Can someone help me?
Answer edited because comments
You said in the comments:
search the second, third and fourth number in each y. and compare that with the number on place one in list x
and
then i would like to add (from list x) the numbers on place 1,3,4,5
Then try something like this:
x = [
['100', 220, 0.5, 0.25, 0.1],
['105', 400, 0.12, 0.56, 0.9],
['600', 340, 0.4, 0.7, 0.45]
]
y = [
['1', '100', '105', '601'],
['2', '104', '105', '600'],
['3', '100', '105', '604']
]
z = []
xx = dict((k, v) for k, _, *v in x)
for first, *yy in y:
zz = [first]
for n in yy:
numbers = xx.get(n)
if numbers:
zz.append(n)
zz.extend(numbers)
z.append(zz)
print(z)
z should now be:
[['1', '100', 0.5, 0.25, 0.1, '105', 0.12, 0.56, 0.9],
['2', '105', 0.12, 0.56, 0.9, '600', 0.4, 0.7, 0.45],
['3', '100', 0.5, 0.25, 0.1, '105', 0.12, 0.56, 0.9]]
First, I convert x into a dictionary, for easy lookup.
The iteration pattern used here was introduced with pep-3132 and works like this:
>>> head, *tail = range(5)
>>> head
0
>>> tail
[1, 2, 3, 4]

Resources