Avoid number truncation in pandas rows [duplicate] - python-3.x

I have files of the below format in a text file which I am trying to read into a pandas dataframe.
895|2015-4-23|19|10000|LA|0.4677978806|0.4773469340|0.4089938425|0.8224291972|0.8652525793|0.6829942860|0.5139162227|
As you can see there are 10 integers after the floating point in the input file.
df = pd.read_csv('mockup.txt',header=None,delimiter='|')
When I try to read it into dataframe, I am not getting the last 4 integers
df[5].head()
0 0.467798
1 0.258165
2 0.860384
3 0.803388
4 0.249820
Name: 5, dtype: float64
How can I get the complete precision as present in the input file? I have some matrix operations that needs to be performed so i cannot cast it as string.
I figured out that I have to do something about dtype but I am not sure where I should use it.

It is only display problem, see docs:
#temporaly set display precision
with pd.option_context('display.precision', 10):
print df
0 1 2 3 4 5 6 7 \
0 895 2015-4-23 19 10000 LA 0.4677978806 0.477346934 0.4089938425
8 9 10 11 12
0 0.8224291972 0.8652525793 0.682994286 0.5139162227 NaN
EDIT: (Thank you Mark Dickinson):
Pandas uses a dedicated decimal-to-binary converter that sacrifices perfect accuracy for the sake of speed. Passing float_precision='round_trip' to read_csv fixes this. See the documentation for more.

Related

How to Solve "IndexError: single positional indexer is out-of-bounds" With DataFrames of Varying Shapes

I have checked the other posts about IndexError: single positional indexer is out-of-bounds but could not find solutions that explain my problem.
I have a DataFrame that looks like:
Date Balance
0 2020-01-07 168.51
1 2020-02-07 179.46
2 2020-03-07 212.15
3 2020-04-07 221.68
4 2020-05-07 292.23
5 2020-06-07 321.61
6 2020-07-07 332.27
7 2020-08-07 351.63
8 2020-09-07 372.26
My problem is I want to run a script that takes in a DataFrame like the one above and returns the balance of the each row using something like df.iloc[2][1]. However, the DataFrame can be anywhere from 1 to 12 rows in length. So if I call df.iloc[8][1] and the DataFrame is less than 9 rows in length then I get the IndexError.
If I want to return the balance for every row using df.iloc[]... how can I handle the index errors without using 12 different try and except statements?
Also the problem is simplified here and the DataFrame can get rather large so I want to try and stay away from looping if possible
Thanks!!
My Solution was to use a loop over the length of the list and append the balance into a list. I then padded the list to the length of 12 with 'NaN' values.
num_months = len(df)
N=12
list_balance_months = []
for month in range(num_months):
list_balance_months .append(df_cd.iloc[month][0])
list_balance_months += [np.nan] * (N - len(list_balance_months ))
balance_month_1, balance_month_2, balance_month_3, balance_month_4, balance_month_5, balance_month_6, balance_month_7, balance_month_8, balance_month_9, balance_month_10, balance_month_11, balance_month_12 = list_credit_months
with this solution, if balance_month_11 is called and the DataFrame only has 4 months of data, instead of index error it will give np.nan (nan).
Please let me know if you can think of a simpler solution!

Dask apply with custom function

I am experimenting with Dask, but I encountered a problem while using apply after grouping.
I have a Dask DataFrame with a large number of rows. Let's consider for example the following
N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)
I want to bin the values of col_1 and I follow the solution from here
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)
where
def test_f(df,col,bins,labels):
return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))
and this works as I expect it to.
Now I want to take the median value in each bin (taken from here)
median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()
Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.
However, If I want the mean and use mean
median = ddf2.groupby('bin_num')['col_1'].mean().compute()
it works and the output has 10 rows.
The question is then: what am I doing wrong that is preventing apply from operating as mean?
Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
You are right! I was able to reproduce your problem on Dask 2.11.0. The good news is that there's a solution! It appears that the Dask groupby problem is specifically with the category type (pandas.core.dtypes.dtypes.CategoricalDtype). By casting the category column to another column type (float, int, str), then the groupby will work correctly.
Here's your code that I copied:
import dask.dataframe as dd
import pandas as pd
import numpy as np
def test_f(df, col, bins, labels):
return df.assign(bin_num=pd.cut(df[col], bins, labels=labels))
N = 10000
df = pd.DataFrame({'col_1': np.random.random(N), 'col_2': np.random.random(N)})
ddf = dd.from_pandas(df, npartitions=8)
bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1', bins, labels)
print(ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which prints out the problem you mentioned
bin_num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
5 0.550844
6 0.651036
7 0.751220
8 NaN
9 NaN
Name: col_1, Length: 80, dtype: float64
Here's my solution:
ddf3 = ddf2.copy()
ddf3["bin_num"] = ddf3["bin_num"].astype("int")
print(ddf3.groupby('bin_num')['col_1'].apply(pd.Series.median).compute())
which printed:
bin_num
9 0.951369
2 0.249150
1 0.149563
0 0.049897
3 0.347906
8 0.847819
4 0.449029
5 0.550608
6 0.652778
7 0.749922
Name: col_1, dtype: float64
#MRocklin or #TomAugspurger
Would you be able to create a fix for this in a new release? I think there is sufficient reproducible code here. Thanks for all your hard work. I love Dask and use it every day ;)

Numbers appearing in scientific notation after imputing missing values with mean in a dataframe

I have imputed missing values with mean for my dataset but post this process I can see that the amount values are showing in a scientific format, though the data type is still float64. I have used the following code :
mean_value1=df1['amount'].mean()
df1['amount']=df1['amount'].fillna(mean_value1)
df1['start_balance']=df1['start_balance'].fillna(mean_value2)
mean_value3=df1['end_balance'].mean()
df1['end_balance']=df1['end_balance'].fillna(mean_value3)
df1 = df1.fillna(df1.mode().iloc[0])
df1.head()
missing values are treated correctly but the values for start balance and end balance are coming in scientific notation. How can I prevent this to happen?
The output looks like following:
amount booking_date booking_text date_end_balance date_start_balance end_balance month start_balance tx_code
-60790.332082 2017-06-30 SEPA-Gutschrift 2017-06-30 2017-06-01 2.693179e+07 June-2017 2.652441e+07 166.0
-10.000000 2016-03-22 GEBUEHREN 2016-03-22 2016-02-22 3.589838e+06 March-2016 3.590838e+06 808.0
If you don't want to round the numbers you can change how they are displayed in the output this way
import pandas as pd
df = pd.DataFrame(np.random.random(5)*10000000000, columns=['random'])
pd.set_option('display.float_format', lambda x: '%.0f' % x)
df
which gives this output
random
0 7591769472
1 78148991059
2 19880680453
3 1965830619
4 39390983843
instead of this output
random
0 6.704323e+10
1 6.714734e+10
2 8.447027e+09
3 3.051957e+10
4 1.481439e+09
change %.0f to whatever number of decimal places you want to see from the numbers so two change 0 to 2, 3 0 to 3 etc.
you can also use df.apply(lambda x: '%.0f' % x, axis=1) as well
df1['amount'] = df1['amount'].astype('int64')
df1['start_balance'] = df1['start_balance'].astype('int64')
This worked for me well! in a different step but still worked

Using pandas style to give colors to some rows with a specific condition

This is the output of pandas in excel format:
Id comments number
1 so bad 1
1 so far 2
2 always 3
2 very good 4
3 very bad 5
3 very nice 6
3 so far 7
4 very far 8
4 very close 9
4 busy 10
I want to use pandas to give a color (for example: gray color) to rows that their value for Id column is even. For example rows 3 and 4 have even Id numbers, but rows 5, 6 and 7 have odd Id numbers. Is there any possible way to use pandas to do it?
As explained in the documentation http://pandas.pydata.org/pandas-docs/stable/style.html what you basically want to do is write a style function and apply it to the style object.
def _color_if_even(s):
return ['background-color: grey' if val % 2 == 0 else '' for val in s]
and call it on my Styler object, i.e.,
df.style.apply(_color_if_even, subset=['id'])

Efficiently concatanate a large number of columns

I tried to concatenate a large number of columns containing integers in one string.
Basically, starting from:
df = pd.DataFrame({'id':[1,2,3,4],'a':[0,1,2,3], 'b':[4,5,6,7], 'c':[8,9,0,1]})
To obtain:
id join
0 1 481
1 2 592
2 3 603
3 4 714
I found several methods to do this (here and here):
Method 1:
conc['glued']=''
i=1
while i < len(df.columns):
conc['glued'] = conc['glued'] + df[df.columns[i]].values.astype(str)
i=i+1
This method work, but is a bit long (45min on my "test" case of 18,000 rows x 40,000 columns). I am concerned by the loop on the columns as this program should be applied at the end on tables of 600.000 columns and I am afraid it will be too long.
Method 2a
conc['join']=[''.join(row) for row in df[df.columns[1:]].values.astype(str)]
Method 2b
conc['apply'] = df[df.columns[1:]].apply(lambda x: ''.join(x.astype(str)), axis=1)
Both of these methods are 10 times more efficient than the previous one, iterate on rows which is good and work perfectly on my "debug" table df. But, when I apply it to my "test" table of 18k x 40k, it leads to a MemoryError: (I have 60% of my 32GB of RAM occupied after reading the corresponding csv file).
I can copy my DataFrame without overpass the memory, but curiously, applying this method make the code crash.
Do you see how I can fix and improve this code to use an efficient row based iteration? Thank you !
Appendix:
Here is the code I use on my test case:
geno_reader = pd.read_csv(genotype_file,header=0,compression='gzip', usecols=geno_columns_names)
fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})
I should use the chunksize option to read this file but I haven't yet really understand how to use it after reading.
Method 1:
fimpute_geno['Calls'] = ''
for i in range(1,len(geno_reader.columns)):
fimpute_geno['Calls'] = fimpute_geno['Calls']\
+ geno_reader[geno_reader.columns[i]].values.astype(int).astype(str)
This work in 45min.
There is some quite disgusting piece of code like the .astype(int).astype(str). I don't know why Python don't recognize my integers and consider them as float.
Method 2:
fimpute_geno['Calls'] = geno_reader[geno_reader.columns[1:]]\
.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)
This leads to an MemoryError:
Here' something to try. It would require that you convert your columns to strings though. your sample frame
b c id
0 4 8 1
1 5 9 2
2 6 0 3
3 7 1 4
then
#you could also do this conc[['b','c','id']] for the next two lines
conc.ix[:,'b':'id'] = conc.ix[:,'b':'id'].astype('str')
conc['join'] = np.sum(conc.ix[:,'b':'id'],axis=1)
Would give
a b c id join
0 0 4 8 1 481
1 1 5 9 2 592
2 2 6 0 3 603
3 3 7 1 4 714

Resources