I got error like NameError: name 'df_clean' is not defined - python-3.5

Counting Handset Manufacturers
top3 = df_clean.info()['Handset Manufacturer'].value_counts().head(3)
top3

2 problems: One problem is that you need to define what data the df_clean variable should point to, like:
import pandas as pd
df_clean = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_clean.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null int64
dtypes: int64(2)
memory usage: 176.0 bytes
The other problem is that the pandas.DataFrame.info method returns None, so there is no way you can subscript from df_clean.info() (like the df_clean.info()['Handset Manufacturer'] you did in your code). Try removing the .info() part, like:
top3 = df_clean['A'].value_counts().head(3)
print(top3)
Output:
0 1
1 2
2 3
Name: A, dtype: int64

Related

How to change column dtypes via list comprehension?

How can I use a list comprehension to set the dtype of the pandas columns which are currently set as 'object' (i.e. the first two columns alpha and hl) ? I tried the below list compreshension but it generated a syntax error.
import pandas as pd
alpha = {'alpha': ['0.0001', '0.001', '0.01', '0.1']}
hl = {'hl': [(16, ), (16,16), (64, 64,), (128, 128)]}
score = {'score': [0.65, 0.75, 0.85, 0.95]}
data = {}
for i in [alpha, hl, score]:
data.update(i)
df = pd.DataFrame(data)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 alpha 4 non-null object
1 hl 4 non-null object
2 score 4 non-null float64
[df[c].astype('string') if df[c].dtype == 'O' for c in df.columns]
I get the following error:
[df[c].astype('string') if df[c].dtype == 'O' for c in df.columns]
^
SyntaxError: invalid syntax
I am not sure if it is what you are looking for and why would you want to do it with a list comprehension is not quite clear to me. However, this seems to fix the error.
import pandas as pd
[df[col].astype({col : str}) for col in df.columns if df[col].dtype == 'object']

Pandas csv reader - how to force a column to be a specific data type (and replace NaN with null)

I am just getting started with Pandas and I am reading a csv file using the read_csv() method. The difficulty I am having is needing to set a column to a specific data type.
df = pd.read_csv('test_data.csv', delimiter=',', index_col=False)
my df looks like this:
RangeIndex: 4 entries, 0 to 3
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_code 4 non-null object
1 store_code 4 non-null int64
2 cost1 4 non-null float64
3 start_date 4 non-null int64
4 end_date 4 non-null int64
5 quote_reference 0 non-null float64
6 min1 4 non-null int64
7 cost2 2 non-null float64
8 min2 2 non-null float64
9 cost3 1 non-null float64
10 min3 1 non-null float64
dtypes: float64(6), int64(4), object(1)
memory usage: 480.0+ bytes
you can see that I have multiple 'min' columns min1, min2, min3
min1 is correctly detected as an int64, but min2 and min3 are float64.
this is due to min1 being fully populated, whereas min2, and min3 are sparsely populated.
here is my df:
as you can see min2 has 2 NaN values.
trying to change the data type using
df['min2'] = df['min2'].astype('int')
I get this error:
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
Ideally I want to change the data type to Int, and have NaN replaced by a NULL (ie don't want a 0).
I have tried a variety of methods, ie fillna, but can't crack this.
All help greatly appreciated.
Since pandas 1.0, you can use a generic pandas.NA to replace numpy.nan. This is useful to serve as an integer NA.
To perform the convertion, use the "Int64" type (note the capital I).
df['min2'] = df['min2'].astype('Int64')
Example:
s = pd.Series([1, 2, None, 3])
s.astype('Int64')
Or:
pd.Series([1, 2, None, 3], dtype='Int64')
Output:
0 1
1 2
2 <NA>
3 3
dtype: Int64

Loop over columns with df.shift in Python

Lets say you have a dataframe like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
df
A B
0 3 5
1 1 6
2 2 7
3 3 8
Now I want to skew and calculate on each column. I put the values as I want them skewed in the index:
range_span = range(4)
result = pd.DataFrame(index=range_span)
Then I try to pupulate result with the following:
for c in df.columns:
for i in range_span:
result.iloc[i][c] = df[c].shift(i).max()
result
This only returns the index. I expected something like this:
You've got 3 critical issues:
issue #1
At this line
result.iloc[i][c] = df[c].shift(i).max()
Raises warning that help understand why result is empty.
...\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
According to their document:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
As iloc[i] will return slice - aka copy - of that rows, you couldn't set original dataframe result. Further, this is why iloc didn't raised issue when it got str index. Explained in #2.
Instead you use iloc - potentially loc with str - like this:
>>> df
A B C
0 1 10 100
1 2 20 200
2 3 30 300
>>> df.iloc[1, 2]
200
>>>df.iloc[[1, 2], [1, 2]]
B C
1 20 200
2 30 300
>>> df.iloc[1:3, 1:3]
B C
1 20 200
2 30 300
>>> df.iloc[:, 1:3]
B C
0 10 100
1 20 200
2 30 300
# ..and so on
issue #2
If you fix issue #1 then you'll see following error:
result.iloc[[i][c]] = df[c].shift(i).max()
TypeError: list indices must be integers or slices, not str
Also from their document:
property DataFrame.iloc: Purely integer-location based indexing for selection by position.
At for c in df.columns: You're passing column name A, B which is str, not int. Use loc instead for str column indices.
This didn't raise TypeError due to issue #1 - as c was passed as argument of __setitem__().
Issue #3
Normally dataframe cannot be enlarged without special functions like combine.
# using same df from #1
>>> df.iloc[1, 3] = 300
Traceback (most recent call last):
File "~\pandas\core\indexing.py", line 1394, in _has_valid_setitem_indexer
raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object
Easier fix would be using dict and convert to DataFrame when manipulation is complete. Or just creating DataFrame to match or have a larger size at firsthand:
>>> df2 = pd.DataFrame(index=range(4), columns=range(3))
>>> df2
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Combining all, correct fix would be:
import pandas as pd
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
result = pd.DataFrame(index=df.index, columns=df.columns)
for col in df.columns:
for index in df.index:
result.loc[index, col] = df[col].shift(index).max()
print(result)
Output:
A B
0 3 8
1 3 7
2 3 6
3 3 5

How to prevent information loss when downcasting floats and integers using pandas.to_numeric() in python

In order to save memory, I started looking into downcasting numeric column types in pandas.
In the quest of saving memory, I would like to convert object columns to e.g. float32 or float16 instead of the automatic standard float64, or int32, int16, or int8 instead of (the automatic integer standard format) int64 etc.
However, this means that high numbers cannot be displayed or saved correctly when certain values within the column/series exceed specific limits. More details on this can be seen in the data type docs.
For instance int16 stands for Integer (-32768 to 32767).
While playing around with extremely large numbers, I figured that pd.to_numeric() doesn't have any means to prevent such very high numbers from being coerced to a placeholder called inf which can also be produced manually via float("inf").
In the following specific example, I'm going to demonstrate that one specific value in the first column, namely 10**100 will only be displayed correctly in the float64 format, but not using float32. My concern is in particular, that upon using pd.to_numeric(downcast="float") this function doesn't tell the user that it converts high numbers to inf behind the scences, which leads as a consequence to a silent loss of information which is clearly undesired, even if memory can be saved this way.
In[45]:
# Construct an example dataframe
df = pd.DataFrame({"Numbers": [100**100, 6, 8], "Strings": ["8.0", "6", "7"]})
# Print out user info
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Numbers 3 non-null object
1 Strings 3 non-null object
dtypes: object(2)
memory usage: 176.0+ bytes
None
# Undesired result obtained by downcasting
pd.to_numeric(df["Numbers"], errors="raise", downcast="float")
Out[46]:
0 inf
1 6.0
2 8.0
Name: Numbers, dtype: float32
# Correct result without downcasting
pd.to_numeric(df["Numbers"], errors="raise")
Out[47]:
0 1.000000e+200
1 6.000000e+00
2 8.000000e+00
Name: Numbers, dtype: float64
I would strongly prefer that pd.to_numeric() would avoid automatically values being coerced to inf since this significates a loss of information. It seems like its priority is just to save memory no matter what.
There should be a built-in method to avoid this coercion producing information loss.
Of course, I could test it afterwards and convert it to the highest precision as a corrective measure, like so:
In[61]:
# Save to temporary "dummy" series as otherwise, the infinity values would override the real values and the info would be lost already
dummy_series = pd.to_numeric(df["Numbers"], errors="raise", downcast="float")
## Check for the presence of undesired inf-values ##
# i) inf-values produces: avoid downcasting
if float("inf") in dummy_series.values:
print("\nInfinity values are present!\nTry again without downcasting.\n")
df["Numbers"] = pd.to_numeric(df["Numbers"], errors="raise")
# ii) If there is no inf-value, adopt the downcasted series as is
else:
df["Numbers"] = dummy_series
# Check result
print(df["Numbers"])
Out[62]:
Infinity values are present!
Try again without downcasting.
0 1.000000e+200
1 6.000000e+00
2 8.000000e+00
Name: Numbers, dtype: float64
This doesn't seem very pythonic to me though, and I bet there must a better built-in solution either in pandas or numpy directly.
For float16, float32, and float64, the maximum values are known. So, you can just look at the maximum value and decide the datatype based on that:
import numpy as np
cases = [[1e100, 6, 8],
[10**100, 6, 8],
[1e36, 6, 8],
[-32760, 6, 8],
[10**500, 6, 8],
]
maxfloats = [(65504, np.float16), (3.402e38, np.float32), (1.797e308, np.float64)]
for input_list in cases:
input_s = pd.Series(np.array(input_list, dtype=np.object))
maxval = np.abs(input_s).max()
for dtype_max, dtype in maxfloats:
if maxval < dtype_max:
break
else:
dtype = np.object
out_array = np.array(input_s, dtype=dtype)
out_s = pd.Series(out_array)
print(f'Input:\n{input_s}\nOutput:\n{out_s}\n----')
Result:
Input:
0 1e+100
1 6
2 8
dtype: object
Output:
0 1.000000e+100
1 6.000000e+00
2 8.000000e+00
dtype: float64
----
Input:
0 1000000000000000000000000000000000000000000000...
1 6
2 8
dtype: object
Output:
0 1.000000e+100
1 6.000000e+00
2 8.000000e+00
dtype: float64
----
Input:
0 1e+36
1 6
2 8
dtype: object
Output:
0 1.000000e+36
1 6.000000e+00
2 8.000000e+00
dtype: float32
----
Input:
0 -32760
1 6
2 8
dtype: object
Output:
0 -32768.0
1 6.0
2 8.0
dtype: float16
----
Input:
0 1000000000000000000000000000000000000000000000...
1 6
2 8
dtype: object
Output:
0 1000000000000000000000000000000000000000000000...
1 6
2 8
dtype: object

my pandas dataframe is not filterable by a column condition

I am trying to only show rows where values in column A are greater than 0. I applied the following code but I am not getting the right returned dataframe. Why?
in: df.info()
out:
A non-null int64
B non-null int64
in:df['A']>0
out:
A B
5 1
0 0
Obviously, the second row should NOT show. What is going on here?
The way you wrote the condition it's actually a filter (aka mask or predicate). You can take that filter and apply it to the DataFrame to get the actual rows:
In [1]: from pandas import DataFrame
In [2]: df = DataFrame({'A': range(5), 'B': ['a', 'b', 'c', 'd', 'e']})
In [3]: df
Out[3]:
A B
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
In [4]: df['A'] > 2
Out[4]:
0 False
1 False
2 False
3 True
4 True
Name: A, dtype: bool
In [5]: df[df['A'] > 2]
Out[5]:
A B
3 3 d
4 4 e
Another way to do the same thing is to use query():
In [6]: df.query('A > 2')
Out[6]:
A B
3 3 d
4 4 e

Resources