Pandas Dataframe: Dropping Selected rows with 0.0 float type values - python-3.x

Please I have a dataset that contains amount as float type. Some of the rows contain values of 0.00 and because they skew the dataset, I need to drop them. I have temporarily set the "Amount" to index and sorted the value as well.
Afterwards, I attempted to drop the rows after subsetting with iloc but eep getting error message in the form ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
'''mortgage = mortgage.set_index('Gross Loan Amount').sort_values('Gross Loan Amount')
mortgage.drop([mortgage.loc[0.0]])'''
I equally tried this:
'''mortgage.drop(mortgage.loc[0.0])'''
it flagged the error of the form KeyError: "[Column_names] not found in axis"
Please how else can I accomplish the task?

You could make a boolean frame and then use any
df = df[~(df == 0).any(axis=1)]
in this code, all rows that have at least one zero in their data has been removed

Let me see if I get your problem. I created this sample dataset:
df = pd.DataFrame({'Values': [200.04,100.00,0.00,150.15,69.98,0.10,2.90,34.6,12.6,0.00,0.00]})
df
Values
0 200.04
1 100.00
2 0.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60
9 0.00
10 0.00
Now, in order to get rid of the 0.00 values, you just have to do this:
df = df[df['Values'] != 0.00]
Output:
df
Values
0 200.04
1 100.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60

Related

How to get the column name of a dataframe from values in a numpy array

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?
given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

How to reformat time series to fill in missing entries with NaNs?

I have a problem that involves converting time series from one
representation to another. Each item in the time series has
attributes "time", "id", and "value" (think of it as a measurement
at "time" for sensor "id"). I'm storing all the items in a
Pandas dataframe with columns named by the attributes.
The set of "time"s is a small set of integers (say, 32),
but some of the "id"s are missing "time"s/"value"s. What I want to
construct is an output dataframe with the form:
id time0 time1 ... timeN
val0 val1 ... valN
where the missing "value"s are represented by NaNs.
For example, suppose the input looks like the following:
time id value
0 0 13
2 0 15
3 0 20
2 1 10
3 1 12
Then, assuming the set of possible times is 0, 2, and 3, the
desired output is:
id time0 time1 time2 time3
0 13 NaN 15 20
1 NaN NaN 10 12
I'm looking for a Pythonic way to do this since there are several
million rows in the input and around 1/4 million groups.
You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.
For your example, the simple pivot is sufficient:
>>> df = df.pivot(index="id", columns="time", values="value")
time 0 2 3
id
0 13.0 15.0 20.0
1 NaN 10.0 12.0
To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:
# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)
# name them "time#"
df.columns = "time" + df.columns.astype(str)
# remove the column index name "time"
df = df.rename_axis(None, axis=1)
Final df:
time0 time1 time2 time3
id
0 13.0 NaN 15.0 20.0
1 NaN NaN 10.0 12.0

Pandas shift datetimeindex takes too long time running

I have a running time issue with shifting a large dataframe with datetime index.
Example using created dummy data:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13]*10**5,'col3':list(np.random.randint(0,100000,14*10**5)),'col2':list(pd.date_range('2020-01-01','2020-08-01',freq='M'))*2*10**5})
df.col3=df.col3.astype(str)
df.drop_duplicates(subset=['col3','col2'],keep='first',inplace=True)
If I shift not using datetimeindex, it only takes about 12s:
%%time
tmp=df.groupby('col3')['col1'].shift(2,fill_value=0)
Wall time: 12.5 s
But when I use datetimeindex, as that situation that I need, it takes about 40 minutes:
%%time
tmp=df.set_index('col2').groupby('col3')['col1'].shift(2,freq='M',fill_value=0)
Wall time: 40min 25s
In my situation, I need the data from shift(1) until shift(6) and merge them with original data by col2 and col3. So I use for looping and merge.
Is there any solution for this? Thanks for your answer, will appreciate so much any respond.
Ben's answer solves it:
%%time
tmp=df1[['col1','col3', 'col2']].assign(col2 = lambda x: x['col2'] + MonthEnd(2)).set_index(['col3', 'col2']).add_suffix(f'_{2}').fillna(0).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 5.94 s
also implement to the looping:
%%time
res=(pd.concat([df1.assign(col2 = lambda x: x['col2'] + MonthEnd(i)).set_index(['col3', 'col2']).add_suffix(f'_{i}') for i in range(0,7)],axis=1).fillna(0)).reindex(pd.MultiIndex.from_frame(df1[['col3','col2']])).reset_index()
Wall time: 1min 44s
Actually, my real data is already using MonthEnd(0) so I just use loop in range(1,7). I also implement to multiple columns so I don't use astype and implement reindex because I use left merge.
The two operations are slightly different, and the results are not the same because your data (at least the dummy here) is not ordered and especially if you have missing dates for some col3 values. That said, the time difference seems enormous. So I think you should go a bit differently.
One way is to add X MonthEnd to col2 for X from 0 to 6, use concat all of them, after set_index the col3 and col2, add_suffix to keep track of the "shift" value. fillna and convert the dtype to original one. The rest is mostly cosmetic depending on your needs.
from pandas.tseries.offsets import MonthEnd
res = (
pd.concat([
df.assign(col2 = lambda x: x['col2'] + MonthEnd(i))
.set_index(['col3', 'col2'])
.add_suffix(f'_{i}')
for i in range(0,7)],
axis=1)
.fillna(0)
# depends on your original data
.astype(df['col1'].dtype)
# if you want a left merge ordered like original df
#.reindex(pd.MultiIndex.from_frame(df[['col3','col2']]))
# if you want col2 and col3 back as columns
# .reset_index()
)
Note that concat does a outer join by default, so you end up with month that where not in your original data and col1_0 is actually the original data with my random numbers.
print(res.head(10))
col1_0 col1_1 col1_2 col1_3 col1_4 col1_5 col1_6
col3 col2
0 2020-01-31 7 0 0 0 0 0 0
2020-02-29 8 7 0 0 0 0 0
2020-03-31 2 8 7 0 0 0 0
2020-04-30 3 2 8 7 0 0 0
2020-05-31 4 3 2 8 7 0 0
2020-06-30 12 4 3 2 8 7 0
2020-07-31 13 12 4 3 2 8 7
2020-08-31 0 13 12 4 3 2 8
2020-09-30 0 0 13 12 4 3 2
2020-10-31 0 0 0 13 12 4 3
This is an issue with groupby + shift. The problem is that if you specify an axis other than 0 or a frequency it falls back to a very slow loop over the groups. If neither of those are specified it's able to use a much faster path, which is why you see an order of magitude difference between the performance.
The relevant code in for DataFrame.GroupBy.shift is:
def shift(self, periods=1, freq=None, axis=0, fill_value=None):
"""..."""
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))
Previously this issue extended to specifying a fill_value

Unable to make dtype converson from object to str

loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 36 months
1 2500.0 2500.0 2500.0 60 months
My dataframe frame goes like above .
I need to change '36 months' under term to '3'(months to years) but I'm unable to do so as the dtype of 'term' is Object.
loan.term.replace('36months','3',inplace=True) -->No change in dataframe
I tried the below code for type conversion but it still returns the dtype as Object
loan['term']=loan.term.astype(str)
Expected output:
loan_amnt funded_amnt funded_amnt_inv term
0 5000.0 5000.0 4975.0 3
1 2500.0 2500.0 2500.0 5
Any help would be dearly appreciated .Thank you for your time.
Put your data in a variable like data, use this:
for i in data:
i['term'] = int(i['term'].split(' ')[0])//12
print(data)
you'll get your output!

reading data with varying length header

I want to read in python a file which contains a varying length header and then extract in a dataframe/series the variables which are coming after the header.
The data looks like :
....................................................................
Data coverage and measurement duty cycle:
When the instrument duty cycle is not in measure mode (i.e. in-flight
calibrations) the data is not given here (error flag = 2).
The measurements have been found to exhibit a strong sensitivity to cabin
pressure.
Consequently the instrument requires calibrated at each new cabin
pressure/altitude.
Data taken at cabin pressures for which no calibration was performed is
not given here (error flag = 2).
Measurement sensivity to large roll angles was also observed.
Data corresponding to roll angles greater than 10 degrees is not given
here (error flag = 2)
......................................................................
High Std: TBD ppb
Target Std: TBD ppb
Zero Std: 0 ppb
Mole fraction error flag description :
0 : Valid data
2 : Missing data
31636 0.69 0
31637 0.66 0
31638 0.62 0
31639 0.64 0
31640 0.71 0
.....
.....
So what I want is to extract the data as :
Time C2H6 Flag
0 31636 0.69 0 NaN
1 31637 0.66 0 NaN
2 31638 0.62 0 NaN
3 31639 0.64 0 NaN
4 31640 0.71 0 NaN
5 31641 0.79 0 NaN
6 31642 0.85 0 NaN
7 31643 0.81 0 NaN
8 31644 0.79 0 NaN
9 31645 0.85 0 NaN
I can do that with
infile="/nfs/potts.jasmin-north/scratch/earic/AEOG/data/mantildas_faam_20180911_r1_c118.na"
flightdata = pd.read_fwf(infile, skiprows=53, header=None, names=['Time', 'C2H6', 'Flag'],)
but I m skipping approximately 53 rows because I counted how much I should skip. I have a bunch of these files and some don't have exactly 53 rows in the header so I was wondering what would be the best way to deal with this and a criteria to have Python always only read the three columns of data when finds them? I thought if I'd want let's say Python to actually read the data from where encounters
Mole fraction error flag description :
0 : Valid data
2 : Missing data
what should I do ? What about another criteria to use which would work better ?
You can split on the header delimiter, like so:
with open(filename, 'r') as f:
myfile = f.read()
infile = myfile.split('Mole fraction error flag description :')[-1]
# skip lines with missing data
infile = infile.split('\n')
# likely a better indicator of a line with incorrect format, you know the data better
infile = '\n'.join([line for line in infile if ' : ' not in line])
# create dataframe
flightdata = pd.read_fwf(infile, header=None, names=['Time', 'C2H6', 'Flag'],)

Resources