Error during conversion column pandas data frame python 3 - python-3.x

I have a big problem with pandas. I have an important data frame containing
Ref_id PRICE YEAR MONTH BRAND
100000 '5000' '2012' '4' 'FORD'
100001 '10000' '2015' '5' 'MERCEDES'
...
I want to convert my PRICE, YEAR and MONTH columns but when I use .astype(int) or .apply(lambda x : int(x)) on the column I receive an ValueError. The length of my data frame is 1.8 million on rows.
ValueError: invalid literal for int() with base 10: 'PRICE'
So I don't understand why pandas wants to convert the name of the column.
Could you explain me why ?
Best,
C.

Try this:
In [59]: cols = 'PRICE YEAR MONTH'.split()
In [60]: cols
Out[60]: ['PRICE', 'YEAR', 'MONTH']
In [61]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [62]: df
Out[62]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012 4 FORD
1 100001 10000.0 2015 5 MERCEDES
2 100002 NaN 2016 6 AUDI
Reproducing your error:
In [65]: df
Out[65]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
2 100002 PRICE 2016 6 AUDI # pay attention at `PRICE` value !!!
In [66]: df['PRICE'].astype(int)
...
skipped
...
ValueError: invalid literal for int() with base 10: 'PRICE'
As #jezrael has added in this comment most probably you have "bad" (unexpected) values in your data set.
You can use one of the following techniques in order clean it up:
In [155]: df
Out[155]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
2 Ref_id PRICE YEAR MONTH BRAND
3 100002 15000 2016 5 AUDI
In [156]: df.dtypes
Out[156]:
Ref_id object
PRICE object
YEAR object
MONTH object
BRAND object
dtype: object
In [157]: df = df.drop(df.loc[df.PRICE == 'PRICE'].index)
In [158]: df
Out[158]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
3 100002 15000 2016 5 AUDI
In [159]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [160]: df
Out[160]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000 2012 4 FORD
1 100001 10000 2015 5 MERCEDES
3 100002 15000 2016 5 AUDI
In [161]: df.dtypes
Out[161]:
Ref_id object
PRICE int64
YEAR int64
MONTH int64
BRAND object
dtype: object
or simply:
In [159]: for c in cols:
...: df[c] = pd.to_numeric(df[c], errors='coerce')
...:
In [165]: df
Out[165]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012.0 4.0 FORD
1 100001 10000.0 2015.0 5.0 MERCEDES
2 Ref_id NaN NaN NaN BRAND
3 100002 15000.0 2016.0 5.0 AUDI
and then .dropna(how='any') if you know that there were no NaN's in your original data set:
In [166]: df = df.dropna(how='any')
In [167]: df
Out[167]:
Ref_id PRICE YEAR MONTH BRAND
0 100000 5000.0 2012.0 4.0 FORD
1 100001 10000.0 2015.0 5.0 MERCEDES
3 100002 15000.0 2016.0 5.0 AUDI

Related

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

Groupby dates quaterly in a pandas dataframe and find count for their occurence

My Dataframe looks like
"dataframe_time"
INSERTED_UTC
0 2018-05-29
1 2018-05-22
2 2018-02-10
3 2018-04-30
4 2018-03-02
5 2018-11-26
6 2018-03-07
7 2018-05-12
8 2019-02-03
9 2018-08-03
10 2018-04-27
print(type(dataframe_time['INSERTED_UTC'].iloc[1]))
<class 'datetime.date'>
I am trying to group the dates together and find the count of their occurrence quaterly. Desired Output -
Quarter Count
2018-03-31 3
2018-06-30 5
2018-09-30 1
2018-12-31 1
2019-03-31 1
2019-06-30 0
I am running the following command to group them together
dataframe_time['INSERTED_UTC'].groupby(pd.Grouper(freq='Q'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Int64Index'
First are dates converted to datetimes and then is used DataFrame.resample with on for get column with datetimes:
dataframe_time.INSERTED_UTC = pd.to_datetime(dataframe_time.INSERTED_UTC)
df = dataframe_time.resample('Q', on='INSERTED_UTC').size().reset_index(name='Count')
Or your solution is possible change to:
df = (dataframe_time.groupby(pd.Grouper(freq='Q', key='INSERTED_UTC'))
.size()
.reset_index(name='Count'))
print (df)
INSERTED_UTC Count
0 2018-03-31 3
1 2018-06-30 5
2 2018-09-30 1
3 2018-12-31 1
4 2019-03-31 1
You can convert the dates to quarters by to_period('Q') and group by those:
df.INSERTED_UTC = pd.to_datetime(df.INSERTED_UTC)
df.groupby(df.INSERTED_UTC.dt.to_period('Q')).size()
You can also use value_counts:
df.INSERTED_UTC.dt.to_period('Q').value_counts()
Output:
INSERTED_UTC
2018Q1 3
2018Q2 5
2018Q3 1
2018Q4 1
2019Q1 1
Freq: Q-DEC, dtype: int64

how to merge month day year columns in date column?

The date is in separate columns
Month Day Year
8 12 1993
8 12 1993
8 12 1993
I want to merge it in one column
Date
8/12/1993
8/12/1993
8/12/1993
I tried
df_date = df.Timestamp((df_filtered.Year*10000+df_filtered.Month*100+df_filtered.Day).apply(str),format='%Y%m%d')
I get this error
AttributeError: 'DataFrame' object has no attribute 'Timestamp'
Using pd.to_datetime with astype(str)
1. as string type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y').dt.strftime('%d/%m/%Y')
Month Day Year Date
0 8 12 1993 08/12/1993
1 8 12 1993 08/12/1993
2 8 12 1993 08/12/1993
2. as datetime type:
df['Date'] = pd.to_datetime(df['Month'].astype(str) + df['Day'].astype(str) + df['Year'].astype(str), format='%d%m%Y')
Month Day Year Date
0 8 12 1993 1993-12-08
1 8 12 1993 1993-12-08
2 8 12 1993 1993-12-08
Here is the solution:
df = pd.DataFrame({'Month': [8, 8, 8], 'Day': [12, 12, 12], 'Year': [1993, 1993, 1993]})
# This way dates will be a DataFrame
dates = df.apply(lambda row:
pd.Series(pd.Timestamp(row[2], row[0], row[1]), index=['Date']),
axis=1)
# And this way dates will be a Series:
# dates = df.apply(lambda row:
# pd.Timestamp(row[2], row[0], row[1]),
# axis=1)
apply method generates a new Series or DataFrame iteratively applying provided function (lambda in this case) and joining the results.
You can read about apply method in official documentation.
And here is the explanation of lambda expressions.
EDIT:
#JohnClements suggested a better solution, using pd.to_datetime method:
dates = pd.to_datetime(df).to_frame('Date')
Also, if you want your output to be a string, you can use
dates = df.apply(lambda row: f"{row[2]}/{row[0]}/{row[1]}",
axis=1)
You can try:
df = pd.DataFrame({'Month': [8,8,8], 'Day': [12,12,12], 'Year': [1993, 1993, 1993]})
df['date'] = pd.to_datetime(df)
Result:
Month Day Year date
0 8 12 1993 1993-08-12
1 8 12 1993 1993-08-12
2 8 12 1993 1993-08-12
Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
Month 3 non-null int64
Day 3 non-null int64
Year 3 non-null int64
date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 176.0 bytes

Cannot convert object to date after groupby

I have been successful with converting while working with a different dataset a couple days ago. However, I cannot apply the same technique to my current dataset. The set looks as:
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
Datez Volumez
0 2016-09-19 6.300000e+07
1 2016-09-20 3.382694e+07
2 2016-09-26 4.000000e+05
3 2016-09-27 4.900000e+09
4 2016-09-28 5.324995e+08
totalHist.dtypes
Datez object
Volumez float64
dtype: object
This used to do the trick:
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'], format='%d-%m-%Y')
totalHist.dtypes
which now is giving me:
KeyError: 'Datez'
During handling of the above exception, another exception occurred:
How can I fix this? I am doing this groupby before trying:
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
totalHist.head()
totalHist.columns.values[[0, 1]] = ['Datez', 'Volumez']
totalHist.head()
You can just use .rename() to rename your columns
Generate some data (in same format as OP)
d = ['1/1/2018','1/2/2018','1/3/2018',
'1/3/2018','1/4/2018','1/2/2018','1/1/2018','1/5/2018']
df = pd.DataFrame(d, columns=['Date'])
df['Trading_Value'] = [1000,1005,1001,1001,1002,1009,1010,1002]
print(df)
Date Trading_Value
0 1/1/2018 1000
1 1/2/2018 1005
2 1/3/2018 1001
3 1/3/2018 1001
4 1/4/2018 1002
5 1/2/2018 1009
6 1/1/2018 1010
7 1/5/2018 1002
GROUP BY
totalHist = df.groupby('Date', as_index = False).agg({"Trading_Value": "sum"})
print(totalHist.head())
Date Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Rename columns
totalHist.rename(columns={'Date':'Datez','totalHist':'Volumez'}, inplace=True)
print(totalHist)
Datez Trading_Value
0 1/1/2018 2010
1 1/2/2018 2014
2 1/3/2018 2002
3 1/4/2018 1002
4 1/5/2018 1002
Finally, convert to datetime
totalHist['Datez'] = pd.to_datetime(totalHist['Datez'])
print(totalHist.dtypes)
Datez datetime64[ns]
Trading_Value int64
dtype: object
This was done with python --version = 3.6.7 and pandas (0.23.4).

grouping by weekly days pandas

I have a dataframe,df containing
Index Date & Time eventName eventCount
0 2017-08-09 ABC 24
1 2017-08-09 CDE 140
2 2017-08-10 CDE 150
3 2017-08-11 DEF 200
4 2017-08-11 ABC 20
5 2017-08-16 CDE 10
6 2017-08-16 ABC 15
7 2017-08-17 CDE 10
8 2017-08-17 DEF 50
9 2017-08-18 DEF 80
...
I want to sum the eventCount for each weekly day occurrences and plot for the total events for each weekly day(from MON to SUN) i.e. for example:
Summation of the eventCount values of:
2017-08-09 and 2017-08-16(Mondays)=189
2017-08-10 and 2017-08-17(Tuesdays)=210
2017-08-16 and 2017-08-23(Wednesdays)=300
I have tried
dailyOccurenceSum=df['eventCount'].groupby(lambda x: x.weekday).sum()
and I get this error:AttributeError: 'int' object has no attribute 'weekday'
Starting with df -
df
Index Date & Time eventName eventCount
0 0 2017-08-09 ABC 24
1 1 2017-08-09 CDE 140
2 2 2017-08-10 CDE 150
3 3 2017-08-11 DEF 200
4 4 2017-08-11 ABC 20
5 5 2017-08-16 CDE 10
6 6 2017-08-16 ABC 15
7 7 2017-08-17 CDE 10
8 8 2017-08-17 DEF 50
9 9 2017-08-18 DEF 80
First, convert Date & Time to a datetime column -
df['Date & Time'] = pd.to_datetime(df['Date & Time'])
Next, call groupby + sum on the weekday name.
df = df.groupby(df['Date & Time'].dt.weekday_name)['eventCount'].sum()
df
Date & Time
Friday 300
Thursday 210
Wednesday 189
Name: eventCount, dtype: int64
If you want to sort by weekday, convert the index to categorical and call sort_index -
cat = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday']
df.index = pd.Categorical(df.index, categories=cat, ordered=True)
df = df.sort_index()
df
Wednesday 189
Thursday 210
Friday 300
Name: eventCount, dtype: int64

Resources