I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.
It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.
I have some time series data (in a Pandas dataframe), d(t):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
I would like to get a time-shifted version of the data, e.g. d(t-1):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
d(t-1) NaN 5 3 17 6 ... 23
But with a complication. Instead of simply time-shifting the data, I need to take the expected value based on a Poisson-distributed shift. So instead of d(t-i), I need E(d(t-j)), where j ~ Poisson(i).
Is there an efficient way to do this in Python?
Ideally, I would be able to dynamically generate the result with i as a parameter (that I can use in an optimization).
numpy's Poisson functions seem to be about generating draws from a Poisson rather than giving a PMF that could be used to calculate expected value. If I could generate a PMF, I could do something like:
for idx in len(d(t)):
Ed(t-i) = np.multiply(d(t)[:idx:-1], PMF(Poisson, i)).sum()
But I have no idea what actual functions to use for this, or if there is an easier way than iterating over indices. This approach also won't easily let me optimize over i.
You can use scipy.stats.poisson to get PMF.
Here's a sample:
from scipy.stats import poisson
mu = 10
# Declare 'rv' to be a poisson random variable with λ=mu
rv = poisson(mu)
# poisson.pmf(k) = (e⁻ᵐᵘ * muᵏ) / k!
print(rv.pmf(4))
For more information about scipy.stats.poisson check this doc.
How to sort rows when a column has standard electronics "suffixes"?
I see many questions here that are close, but most go the other way, like
Format numbers in thousands (K) in Excel
Anyone in electronics will immediately appreciate this problem. I have lots of parts lists, and am pasting values into Excel/GSheets. They are standard suffixes, but clearly not solely numbers. Here is a representative sample:
A B C D
RA367 0603 2.2 5% 1/10w MF-LF
RA770 0201 5.1k 1% 1/20w MF
RA775 0201 5.1k 1% 1/20w MF
RB600 0402 0 5% 1/16w MF-LF
RB604 0201 0 5% 1/20w MF
Only column C is needed to sort. The suffixes vary on the type of component, but are not mixed when sorted. In other words, you would never sort a column of 'mixed' components such as:
2.5k
1.0pF
10m
20uF
2 kOhms
[...]
The mutiplier portion of the suffixes would always be the same, as in R, k, m, , are typically resistors; pF, F, and uF are capacitors, H, uH, etc. is for inductors (for Henries), etc. So it is best if "conversion" for sorting consider only the first character (u, p, k, m, R) which are always the multiplier, and if no multiplier character (as in the 0 in the first example) just sort as a number.
1.1 = 1.1
1.1 k = 1100
1.1k = 1100
1.1kOhms = 1100
1.1k Ohms = 1100
[...]
This is because lots of parts listings will omit the type of value (resistor, capacitor, etc.) and only give the base number (1, 2, 40, 1m, 2.2k, ...). his is because again, values of different components are never mixed.
Here is a real-world snippet from a large distributor, from a downloaded CSV:
[...]
0 Ohms
100 kOhms
100 kOhms
100 kOhms
1 MOhms
1 MOhms
1 MOhms
100 Ohms
100 Ohms
100 Ohms
49.9 Ohms
[...]
Here you can see how the default sorting on first, second character fails, and that there is even a space between the base and multiplier. A solution should not have to worry about a finite list of types of components, ignoring the Ohms, R, H, F, etc. after the value is determined by the base and optional multiplier.
These are the only two ways you will see components listed-with or without that space. I am wondering if there is a single, elegant function to apply to a range, or if multiple ones are needed based on the space introduced in the second example.
This may seem like an obscure problem, but large suppliers offer CSV downloads of their products, and when you need to order, and are combining lists in different formats, it becomes most cumbersome.
Something like this should work for resistors and capacitors, assuming m meaning milli- isn't used:
=sort(A:A,REGEXEXTRACT(A:A,"[0-9.]+")*1000^(search(iferror(regexextract(A:A,"[0-9.]+\s*([pukmKM])")," "),"pux km")-4),1)
(I know you wouldn't mix them, but this is just to demonstrate)
I am trying to find the capacity of a list by a function. But a step involves subtracting the list size by 64 ( in my machine ) and also it has to be divided by 8 to get the capacity. What does this capacity value mean ?
I tried reading the docs of python to know about sys.getsizeof() method but still it couldn't answer my doubts.
import sys
def disp(l1):
print("Capacity",(sys.getsizeof(l1)-64)//8) // What does this line mean especially //8 part
print("Length",len(l1))
mariya_list=[]
mariya_list.append("Sugarisverysweetand it can be used for cooking sweets
and also used in beverages ")
mariya_list.append("Choco")
mariya_list.append("bike")
disp(mariya_list)
print(mariya_list)
mariya_list.append("lemon")
print(mariya_list)
disp(mariya_list)
mariya_list.insert(1,"leomon Tea")
print(mariya_list)
disp(mariya_list)
Output:
Capacity 4
Length 1
['Choco']
['Choco', 'lemon']
Capacity 4
Length 2
['Choco', 'leomon Tea', 'lemon']
Capacity 4
Length 3
This is the output. Here I am unable to understand what does capacity 4 mean. Why does it repeats the same value four even after subsequent addition of elements.
I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!
You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH