numpy broadcasting on pandas dataframe gives memory error - python-3.x

I have two data frames. Dataframe A is of shape (1269345,5) and dataframe B is of shape (18583586, 3).
Dataframe A looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
Dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to make extract rows and make two data frames for which position column in dataframe B falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.So resulting dataframe would look like:
###Final dataframe A
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
###Final dataframe B
ID_sim. position string
1 89 aa
4 568 bb
I tried using numpy broadcasting like this:
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
But this gave me the following error:
MemoryError: Unable to allocate 2.72 TiB for an array with shape (18583586, 160711) and data type bool
I think its because my numpy becomes quite large when I try broadcasting. How can I achieve my task without numpy broadcasting considering that my dataframes are very large. Insights will be appreciated.

This is likely due to your system overcommit mode.
It will be 0 by default,
Heuristic overcommit handling. Obvious overcommits of address space
are refused. Used for a typical system. It ensures a seriously wild
allocation fails while allowing overcommit to reduce swap usage. The
root is allowed to allocate slightly more memory in this mode. This is
the default.
By Running below command to check your current overcommit mode
$ cat /proc/sys/vm/overcommit_memory
0
In this case, you're allocating
> 156816 * 36 * 53806 / 1024.0**3
282.8939827680588
~282 GB and the kernel is saying well obviously there's no way I'm going to be able to commit that many physical pages to this, and it refuses the allocation.
If (as root) you run:
$ echo 1 > /proc/sys/vm/overcommit_memory
This will enable the "always overcommit" mode, and you'll find that indeed the system will allow you to make the allocation no matter how large it is (within 64-bit memory addressing at least).
I tested this myself on a machine with 32 GB of RAM. With overcommit mode 0 I also got a MemoryError, but after changing it back to 1 it works:
>>> import numpy as np
>>> a = np.zeros((156816, 36, 53806), dtype='uint8')
>>> a.nbytes
303755101056
You can then go ahead and write to any location within the array, and the system will only allocate physical pages when you explicitly write to that page. So you can use this, with care, for sparse arrays.

Related

Avoid number truncation in pandas rows [duplicate]

I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.
It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.

Get Poisson expectation of preceding values of a time series in Python

I have some time series data (in a Pandas dataframe), d(t):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
I would like to get a time-shifted version of the data, e.g. d(t-1):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
d(t-1) NaN 5 3 17 6 ... 23
But with a complication. Instead of simply time-shifting the data, I need to take the expected value based on a Poisson-distributed shift. So instead of d(t-i), I need E(d(t-j)), where j ~ Poisson(i).
Is there an efficient way to do this in Python?
Ideally, I would be able to dynamically generate the result with i as a parameter (that I can use in an optimization).
numpy's Poisson functions seem to be about generating draws from a Poisson rather than giving a PMF that could be used to calculate expected value. If I could generate a PMF, I could do something like:
for idx in len(d(t)):
Ed(t-i) = np.multiply(d(t)[:idx:-1], PMF(Poisson, i)).sum()
But I have no idea what actual functions to use for this, or if there is an easier way than iterating over indices. This approach also won't easily let me optimize over i.
You can use scipy.stats.poisson to get PMF.
Here's a sample:
from scipy.stats import poisson
mu = 10
# Declare 'rv' to be a poisson random variable with λ=mu
rv = poisson(mu)
# poisson.pmf(k) = (e⁻ᵐᵘ * muᵏ) / k!
print(rv.pmf(4))
For more information about scipy.stats.poisson check this doc.

How to sort electronics values with suffixes (k, m, g, uF, H, etc.)?

How to sort rows when a column has standard electronics "suffixes"?
I see many questions here that are close, but most go the other way, like
Format numbers in thousands (K) in Excel
Anyone in electronics will immediately appreciate this problem. I have lots of parts lists, and am pasting values into Excel/GSheets. They are standard suffixes, but clearly not solely numbers. Here is a representative sample:
A B C D
RA367 0603 2.2 5% 1/10w MF-LF
RA770 0201 5.1k 1% 1/20w MF
RA775 0201 5.1k 1% 1/20w MF
RB600 0402 0 5% 1/16w MF-LF
RB604 0201 0 5% 1/20w MF
Only column C is needed to sort. The suffixes vary on the type of component, but are not mixed when sorted. In other words, you would never sort a column of 'mixed' components such as:
2.5k
1.0pF
10m
20uF
2 kOhms
[...]
The mutiplier portion of the suffixes would always be the same, as in R, k, m, , are typically resistors; pF, F, and uF are capacitors, H, uH, etc. is for inductors (for Henries), etc. So it is best if "conversion" for sorting consider only the first character (u, p, k, m, R) which are always the multiplier, and if no multiplier character (as in the 0 in the first example) just sort as a number.
1.1 = 1.1
1.1 k = 1100
1.1k = 1100
1.1kOhms = 1100
1.1k Ohms = 1100
[...]
This is because lots of parts listings will omit the type of value (resistor, capacitor, etc.) and only give the base number (1, 2, 40, 1m, 2.2k, ...). his is because again, values of different components are never mixed.
Here is a real-world snippet from a large distributor, from a downloaded CSV:
[...]
0 Ohms
100 kOhms
100 kOhms
100 kOhms
1 MOhms
1 MOhms
1 MOhms
100 Ohms
100 Ohms
100 Ohms
49.9 Ohms
[...]
Here you can see how the default sorting on first, second character fails, and that there is even a space between the base and multiplier. A solution should not have to worry about a finite list of types of components, ignoring the Ohms, R, H, F, etc. after the value is determined by the base and optional multiplier.
These are the only two ways you will see components listed-with or without that space. I am wondering if there is a single, elegant function to apply to a range, or if multiple ones are needed based on the space introduced in the second example.
This may seem like an obscure problem, but large suppliers offer CSV downloads of their products, and when you need to order, and are combining lists in different formats, it becomes most cumbersome.
Something like this should work for resistors and capacitors, assuming m meaning milli- isn't used:
=sort(A:A,REGEXEXTRACT(A:A,"[0-9.]+")*1000^(search(iferror(regexextract(A:A,"[0-9.]+\s*([pukmKM])")," "),"pux km")-4),1)
(I know you wouldn't mix them, but this is just to demonstrate)

what is the need to divide a list sys.getsizeof() by 8 or 4 ( Depending upon the machine) after subtracting 64 or 36 from the list

I am trying to find the capacity of a list by a function. But a step involves subtracting the list size by 64 ( in my machine ) and also it has to be divided by 8 to get the capacity. What does this capacity value mean ?
I tried reading the docs of python to know about sys.getsizeof() method but still it couldn't answer my doubts.
import sys
def disp(l1):
print("Capacity",(sys.getsizeof(l1)-64)//8) // What does this line mean especially //8 part
print("Length",len(l1))
mariya_list=[]
mariya_list.append("Sugarisverysweetand it can be used for cooking sweets
and also used in beverages ")
mariya_list.append("Choco")
mariya_list.append("bike")
disp(mariya_list)
print(mariya_list)
mariya_list.append("lemon")
print(mariya_list)
disp(mariya_list)
mariya_list.insert(1,"leomon Tea")
print(mariya_list)
disp(mariya_list)
Output:
Capacity 4
Length 1
['Choco']
['Choco', 'lemon']
Capacity 4
Length 2
['Choco', 'leomon Tea', 'lemon']
Capacity 4
Length 3
This is the output. Here I am unable to understand what does capacity 4 mean. Why does it repeats the same value four even after subsequent addition of elements.

identifying decrease in values in spark (outliers)

I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!
You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH

Resources