converting pandas dataframe into dictionary?? - python-3.x

I have a pandas dataframe as news_datasetwhere column id is an article ID and column Content is Article content (large text). Given as,
ID Content
17283 WASHINGTON — Congressional Republicans have...
17284 After the bullet shells get counted, the blood...
17285 When Walt Disney’s “Bambi” opened in 1942, cri...
17286 Death may be the great equalizer, but it isn’t...
17287 SEOUL, South Korea — North Korea’s leader, ...
Now, all I want to convert pandas dataframe into dictionary such as ID would be a key and Content will the value. Basically, what I have done at first something like,
dd={}
for i in news_dataset['ID']:
for j in news_dataset['Content']:
dd[j]=i
This piece of code is pathetic and taking so much time(> 4 minutes) to get processed. So, after checking for some better approaches(stackoverflow). What I have finally did is,
id_array=[]
content_array=[]
for id_num in news_dataset['ID']:
id_array.append(id_num)
for content in news_dataset['Content']:
content_array.append(content)
news_dict=dict(zip(id_array,content_array))
This code takes nearly 15 seconds to get executed.
What I want to ask is,
i) what's wrong in first code and why it take so much time to get processed?
ii) Does using for loop inside another for loop is wrong way to do iterations when it comes to large text data?
iii) what would be right way to create dictionary using for loop within single piece of query?

I think generally loops in pandas should be avoid if exist some non loop, obviously vectorized alternatives.
You can create index by column ID and call Series.to_dict:
news_dict=news_dataset.set_index('ID')['Content'].to_dict()
Or zip:
news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
#alternative
#news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
Performance:
np.random.seed(1425)
#1000rows sample
news_dataset = pd.DataFrame({'ID':np.arange(1000),
'Content':np.random.choice(list('abcdef'), size=1000)})
#print (news_dataset)
In [98]: %%timeit
...: dd={}
...: for i in news_dataset['ID']:
...: for j in news_dataset['Content']:
...: dd[j]=i
...:
61.7 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %%timeit
...: id_array=[]
...: content_array=[]
...: for id_num in news_dataset['ID']:
...: id_array.append(id_num)
...: for content in news_dataset['Content']:
...: content_array.append(content)
...: news_dict=dict(zip(id_array,content_array))
...:
251 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [100]: %%timeit
...: news_dict=news_dataset.set_index('ID')['Content'].to_dict()
...:
584 µs ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'],news_dataset['Content']))
...:
106 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [102]: %%timeit
...: news_dict=dict(zip(news_dataset['ID'].values, news_dataset['Content'].values))
...:
122 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Related

How can I reduce Execution time of Python code

In this this code I'm calculating difference between squares of n numbers and the square of the sum of n numbers.
Example : n=3, (1+2+3)^2 -(1^2+2^2+3^2) =22
def sum_square_diff(num):
sum1=0
sum2=0
for i in range(1,num+1):
sum1 +=i**2
sum2 +=i
sum2=sum2**2
diff=sum2-sum1
return diff
if __name__=="__main__":
n=int(input())
for i in range(n):
num=int(input())
result=sum_square_diff(num)
print(result)
This code is correct but it takes too much time to complete execution.
In the first place, the formula that you want to compute has a closed-form representation. There is no need for any loops:
n*n*(n+1)*(n+1)/4 - n*(n+1)*(2*n+1)/6
But if you insist, you can get >3x speedup by using numpy instead of raw Python:
def sum_square_diff1(num):
x = np.arange(1,num+1)
return x.sum()**2-(x**2).sum()
In [7]: %timeit sum_square_diff(100)
19.6 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [8]: %timeit sum_square_diff1(100)
5.61 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cupy indexing is slow

I am trying to perform operations on a large cupy array of size 16000. I find mathematical operations such as addition to be quite fast, but indexing using boolean masks to be relatively slow. For example, the following code:
import cupy as cp
arr = cp.random.normal(0, 1, 16000)
%timeit arr * 5
%timeit arr > 0.4
%timeit arr[arr > 0.4] = 0
gives me the output:
28 µs ± 950 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26.5 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
104 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Any reason why the final indexing is at least twice as slow? I assumed that multiplication should be slower than setting array elements.
Update: This is not true for numpy indexing. Changing the cupy array to numpy, I get:
6.71 µs ± 373 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.42 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.39 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In the 3rd case, cupy is composing the result via a sequence of operations: cupy_greater, cupy_copy, inclusive_scan_kernel, inclusive_scan_kernel, add_scan_blocked_sum_kernel, CUDA memcpy DtoH (perhaps to provide the number of elements that need to be set to zero), CUDA memset (perhaps to set an array to zero), and finally cupy_scatter_update_mask (to scatter the zeros to their correct locations, perhaps).
This is a considerably more complex sequence than arr*5, which seems to run a single cupy_multiply under the hood. You can probably do better with a cupy user-defined kernel:
import cupy as cp
clamp_generic = cp.ElementwiseKernel(
'T x, T c',
'T y',
'y = (y > x)?c:y',
'clamp_generic')
arr = cp.random.normal(0, 1, 16000)
clamp_generic(0.4, 0, arr)

parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!
You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

Optimizing pandas operation for column intersection

I have a DataFrame with 2 columns (event and events) . Event column contains a particular eventid and events column contain list of event Ids.
Example :-
df
event events
'a' ['x','y','abc','a']
'b' ['x','y','c','a']
'c' ['a','c']
'd' ['b']
I want to create another column(eventoccured) indicating whether event isin events.
eventoccured
1
0
1
0
I am currently using
df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
which gives the desired result but is slow, I want a faster solution for this.
Thanks
One idea is use list comprehension:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
1.15 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: %timeit df['eventoccured1'] = [x in y for x, y in zip(df['event'], df['events'])]
15.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How to perform sum pooling in PyTorch

How to perform sum pooling in PyTorch. Specifically, if we have input (N, C, W_in, H_in) and want output (N, C, W_out, H_out) using a particular kernel_size and stride just like nn.Maxpool2d ?
You could use torch.nn.AvgPool1d (or torch.nn.AvgPool2d, torch.nn.AvgPool3d) which are performing mean pooling - proportional to sum pooling. If you really want the summed values, you could multiply the averaged output by the pooling surface.
https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html#torch.nn.AvgPool2d find divisor_override.
set divisor_override=1
you'll get a sumpool
import torch
input = torch.tensor([[[1,2,3],[3,2,1],[3,4,5]]])
sumpool = torch.nn.AvgPool2d(2, stride=1, divisor_override=1)
sumpool(input)
you'll get
tensor([[[ 8, 8],
[12, 12]]])
To expand on benjaminplanche's answer:
I need sum pooling as well and it doesn't seem to directly exist, but it is equivalent to running a conv2d with a weights parameter made of ones. I thought it would be faster to run AvgPool2d and multiply by the kernel size product. Turns out, not exactly.
Bottom line up front:
Use torch.nn.functional.avg_pool2d and its related functions and multiply by the kernel size.
Testing in Jupyter I find:
(Overhead)
%%timeit
x = torch.rand([1,1,1000,1000])
>>> 3.49 ms ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
_=F.avg_pool2d(torch.rand([1,1,1000,1000]), [10,10])*10*10
>>> 4.99 ms ± 74.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 1.50 ms ± 79.0 µs) (I found the *10*10 only adds around 20 µs to the graph)
avePool = nn.AvgPool2d([10, 10], 1, 0)
%%timeit
_=avePool(torch.rand([1,1,1000,1000]))*10*10
>>> 80.9 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(So 77.4 ms ± 1.58 ms)
y = torch.ones([1,1,10,10])
%%timeit
_=F.conv2d(torch.rand([1,1,1000,1000]), y)
>>> 14.4 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 10.9 ms ± 426 µs)
sumPool = nn.Conv2d(1, 1, 10, 1, 0, 1, 1, False)
sumPool.weight = torch.nn.Parameter(y)
%%timeit
_=sumPool(torch.rand([1,1,1000,1000]))
>>> 7.24 ms ± 63.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(So 3.75 ms ± 68.3 µs)
And as a sanity check.
abs_err = torch.max(torch.abs(avePool(x)*10*10 - sumPool(x)))
magnitude = torch.max(torch.max(avePool(x)*10*10, torch.max(sumPool(x))))
relative_err = abs_err/magnitude
abs_err.item(), magnitude.item(), relative_err.item()
>>> (3.814697265625e-06, 62.89910125732422, 6.064788493631568e-08)
That's probably a reasonable rounding related error.
I do not know why the functional version is faster than making a dedicated kernel, but it looks like if you want to make a dedicated kernel, prefer the Conv2D version, and make the weights untrainable with sumPool.weights.requires_grad = False or with torch.no_grad(): during creation of the kernel parameters. These results may change with kernel size, so test for your own application if you need to speed up this part. Let me know if I missed something...

Resources