I need to do the data validation for range. To check wheather the column values are within the given range if the value is greater or less than the given range error should occur and display the row no or index where the error has been occured .
my data is as follows:
Draft_Fore
12
14
87
16
90
It should produce the error for the value 87 and 90 as I have considered the range of the column must be greater than 5 and less than 20.
The code which I have tried is as follows:
def validate_rating(Draft_Fore):
Draft_Fore = int(Draft_Fore)
if Draft_Fore > 5 and Draft_Fore <= 20:
return True
return False
df = pd.read_csv("/home/anu/Desktop/dr.csv")
for i, Draft_Fore in enumerate(df):
try:
validate_rating(Draft_Fore)
except Exception as e:
print('Error at index {}: {!r}'.format(i, Draft_Fore))
print(e)
To print the location where the error has occured in the row
A little explanation to clarify my comment. Assuming your dataframe looks like
df = pd.DataFrame({'col1': [12, 14, 87, 16, 90]})
you could do
def check_in_range(v, lower_lim, upper_lim):
if lower_lim < v <= upper_lim:
return True
return False
lower_lim, upper_lim = 5, 20
for i, v in enumerate(df['col1']):
if not check_in_range(v, lower_lim, upper_lim):
print(f"value {v} at index {i} is out of range!")
# --> gives you
value 87 at index 2 is out of range!
value 90 at index 4 is out of range!
So your check function is basically fine. However, if you call to enumerate a df, the values will be the column names. What you need is to enumerate the specific column.
Concerning your idea to raise an exception, I'd suggest to have a look at raise and assert.
So you could e.g. use raise:
for i, v in enumerate(df['col1']):
if not check_in_range(v, lower_lim, upper_lim):
raise ValueError(f"value {v} at index {i} is out of range")
# --> gives you
ValueError: value 87 at index 2 is out of range
or assert:
for i, v in enumerate(df['col1']):
assert v > lower_lim and v <= upper_lim, f"value {v} at index {i} is out of range"
# --> gives you
AssertionError: value 87 at index 2 is out of range
Note: If you have a df, why not use its features for convenience? To get the in-range values of the column, you could just do
df[(df['col1'] > lower_lim) & (df['col1'] <= upper_lim)]
# --> gives you
col1
0 12
1 14
3 16
Related
I am trying to create a dataframe with python's pandas library utilizing data obtained with a requests response. The problem is when there is not that item available on the API so it raises a KeyError and crashes the program.
The source data frame is being iterated over each product name. It then takes the product name of that row and finds how many different SKUs exists, creating a row in a new dataframe for each SKU and adding some quantities and other needed information to the new dataframe. The idea is to have a row with ALL the same information on the first dataframe repeated however many SKUs there are updated with the quantity and package ID for that SKU.
If the length of the response returned is 0, I still want it to append the row from the first data frame
def create_additional_rows_needed(comb_data):
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.debug("test")
new_combined_data = pd.DataFrame(columns=comb_data.columns)
COVA_DATA_LEN = 2993
row = 0
current_item = ''
while row < len(comb_data):
number_of_skus = 0
current_item = comb_data.iloc[row, 1]
if (len(current_item)) is not None:
number_of_skus = len(find_gb_product(current_item))
else:
number_of_skus = 0
current_quantity = find_gb_product(current_item).iloc[number_of_skus - 1, find_gb_product(current_item).columns.get_loc('quantity')]
logger.info('Current Quantity: {}'.format(current_quantity))
current_package = find_gb_product(current_item)['lot_number'][number_of_skus - 1]
if number_of_skus == 0:
pass
while number_of_skus > 0:
logger.info('Current Item: {}'.format(current_item))
logger.info('Number of Skus: {}'.format(number_of_skus))
logger.info('Appending: {}'.format(comb_data.iloc[row, 1]))
new_combined_data = new_combined_data.append([comb_data.iloc[row, :]])
new_combined_data.iloc[-1, new_combined_data.columns.get_loc('TotalOnHand')] = current_quantity
new_combined_data.iloc[-1, new_combined_data.columns.get_loc('PackageId')] = current_package
number_of_skus = number_of_skus - 1
logger.info('Finished index {}'.format(row))
row = row + 1
logger.info('Moving to index {}'.format(row))
return new_combined_data
It goes well for every item with the exception of a few. Here is the error I get.
KeyError
2889 return self._engine.get_loc(casted_key)
2890 except KeyError as err:
-> 2891 raise KeyError(key) from err
2892
2893 if tolerance is not None:
KeyError: 'quantity'
This has taken up my entire weekend and all my sleep and is due Monday Morning at 10am MST with only two days notice. Please help me.
Catching the error and continuing should work. Something along the lines of:
while row < len(comb_data):
....
try:
current_quantity = find_gb_product(current_item).iloc[number_of_skus - 1, find_gb_product(current_item).columns.get_loc('quantity')]
except KeyError:
continue
....
Im trying to learn machine learning and i need to fill in the missing values for the cleaning stage of the workflow. i have 13 columns and need to impute the values for 8 of them. One column is called Dependents and i want to fill in the blanks with the word missing and change the cells that do contain data as follows: 1 to one, two to 2, 3 to three and 3+ to threePlus.
Im running the program in Anaconda and the name of the dataframe is train
train.columns
this gives me
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
dtype='object')
next
print("Dependents")
print(train['Dependents'].unique())
this gives me
Dependents
['0' '1' '2' '3+' nan]
now i try imputing values as stated
def impute_dependent():
my_dict={'1':'one','2':'two','3':'three','3+':'threePlus'};
return train.Dependents.map(my_dict).fillna('missing')
def convert_data(dataset):
temp_data = dataset.copy()
temp_data['Dependents'] = temp_data[['Dependents']].apply(impute_dependent,axis=1)
return temp_data
this gives the error
TypeError Traceback (most recent call last)
<ipython-input-46-ccb1a5ea7edd> in <module>()
4 return temp_data
5
----> 6 train_dataset = convert_data(train)
7 #test_dataset = convert_data(test)
<ipython-input-46-ccb1a5ea7edd> in convert_data(dataset)
1 def convert_data(dataset):
2 temp_data = dataset.copy()
----> 3 temp_data['Dependents'] =
temp_data[['Dependents']].apply(impute_dependent,axis=1)
4 return temp_data
5
D:\Anaconda2\lib\site-packages\pandas\core\frame.py in apply(self, func,
axis, broadcast, raw, reduce, result_type, args, **kwds)
6002 args=args,
6003 kwds=kwds)
-> 6004 return op.get_result()
6005
6006 def applymap(self, func):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in
apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
TypeError: ('impute_dependent() takes 0 positional arguments but 1 was
given', 'occurred at index 0')
i expected one, two , three and threePlus to replace the existing values and missing to fill in the blanks
Would this do?
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
def convert_data(dataset):
temp_data = dataset.copy()
temp_data.Dependents = temp_data.Dependents.map(my_dict)
return temp_data
As a side note, part of your problem might be the use of apply: essentially apply passes data through a function and puts in what comes out. I might be wrong but I think your function needs to take the input given by apply, eg:
def impute_dependent(dep):
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
return my_dict[dep]
df.dependents = df.dependents.apply(impute_dependents)
This way, for every value in df.dependents, apply will take that value and give it to impute_dependents as an argument, then take the returned value as output. As is, when I trial your code I get an error because impute_dependent takes no arguments.
I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient
I have a pandas dataframe:
It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...
I want to create a new column AgeRange and populate with the following ranges:
<2
2 - 18
18 - 35
35 - 65
65+
so I wrote a function
def agerange(values):
for i in values:
if complete.Age_units == 'Y':
if complete.Age > 1 AND < 18 return '2-18'
elif complete.Age > 17 AND < 35 return '18-35'
elif complete.Age > 34 AND < 65 return '35-65'
elif complete.Age > 64 return '65+'
else return '< 2'
I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:
agedetails['age_range'] = ageRange(agedetails)
BUT when I try to run the first code to create the function I get:
File "<ipython-input-124-cf39c7ce66d9>", line 4
if complete.Age > 1 AND complete.Age < 18 return '2-18'
^
SyntaxError: invalid syntax
Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?
So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?
I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...
With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.
Pandas: pd.cut
As #JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.
You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.
bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']
df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)
print(df.dtypes)
# Age int64
# Age_units object
# AgeRange category
# dtype: object
NumPy: np.digitize
np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.
Note that for boundary cases the lower bound is used for mapping to a bin.
import pandas as pd, numpy as np
df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})
bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']
d = dict(enumerate(names, 1))
df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))
Result
Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+
I need to iterate over column 'movies_rated', check the value against the conditions, and write a value in a newly create column 'expert_level'. When I test on a subset of data, it works. But when I run it against my whole dateset, it only gets filled with value 1.
for num in df_merge['movies_rated']:
if num in range(20,31):
df_merge['expert_level'] = 1
elif num in range(31,53):
df_merge['expert_level'] = 2
elif num in range(53,99):
df_merge['expert_level'] = 3
elif num in range(99,202):
df_merge['expert_level'] = 4
else:
df_merge['expert_level'] = 5
here's a sample dataframe.
movies = [88,20,35,55,1203,99,2222,847]
name = ['angie','chris','pine','benedict','alice','spock','tony','xena']
df = pd.DataFrame(movies,name,columns=['movies_rated'])
certainly there's a less verbose way of doing this?
You could build an IntervalIndex and then apply pd.cut. I'm sure this is a duplicate, but I can't find one right now which uses both closed='left' and .codes, though I'm sure it exists.
bins = pd.IntervalIndex.from_breaks([0, 20, 31, 53, 99, 202, np.inf], closed='left')
df["expert_level"] = pd.cut(movies, bins).codes
which gives me
In [242]: bins
Out[242]:
IntervalIndex([[0.0, 20.0), [20.0, 31.0), [31.0, 53.0), [53.0, 99.0), [99.0, 202.0), [202.0, inf)]
closed='left',
dtype='interval[float64]')
and
In [243]: df
Out[243]:
movies_rated expert_level
angie 88 3
chris 20 1
pine 35 2
benedict 55 3
alice 1203 5
spock 99 4
tony 2222 5
xena 847 5
Note that I've set this up so that scores below 20 get a 0 value, so they can be distinguished from really high rankings. If you really want everything outside the bins to get 5, it'd be straightforward to remap 0 to 5, or just pass breaks of [20, 31, 53, 99, 202] and then map anything with a code of -1 (which means 'not binned') to 5.
I think np.select with the pandas function between is a good choice for you:
conds = [df.movies_rated.between(20,30), df.movies_rated.between(31,52),
df.movies_rated.between(53,98), df.movies_rated.between(99,202)]
choices = [1,2,3,4]
df['expert_level'] = np.select(conds,choices, 5)
>>> df
movies_rated expert_level
angie 88 3
chris 20 1
pine 35 2
benedict 55 3
alice 1203 5
spock 99 4
tony 2222 5
xena 847 5
you could do it with apply and a function:
def expert_level_check(num):
if 20<= num < 31:
return 1
elif 31<= num < 53:
return 2
elif 53<= num < 99:
return 3
elif 99<= num < 202:
return 4
else:
return 5
df['expert_level'] = df['movies_rated'].apply(expert_level_check)
it is slower to manually iterate over a df, I recommend reading this
I am new to python and I am getting an error while executing the below piece of code. Would really appreciate if anybody could help me understand it.
About the data: The dataframe is stored in "train" and the column name is "neighborhood". values in "neighborhood" are like "#Queens#jackson heights" or "#Manhattan#uppereast side". So i am trying to split hashtags and then consider only the first word in each row (i.e. Queens & Manhattan $ etc.)
It does print the expected output but with this error:
IndexError Traceback (most recent call last)
<ipython-input-89-b199ce84fe1c> in <module>()
5 for row in train['neighborhood'].str.split('#'):
6 # if more than a value,
----> 7 if len(row[1]) == 5 :
8 # Append a num grade
9 grades.append('1')
IndexError: list index out of range
train = pd.DataFrame(train, columns = ['id','listing_type','floor','latitude','longitude','price','beds','baths','total_rooms','square_feet','pet_details','neighborhood'])
# Create a list to store the data
grades = [ ]
# For each row in the column,
for row in train['neighborhood'].str.split('#'):
# if more than a value,
if row[1] == 'Queens':
# Append a num grade
grades.append('1')
# else, if more than a value,
elif row[1] == 'Manhattan':
# Append a letter grade
grades.append('2')
# else, if more than a value,
elif row[1] == 'Bronx':
# Append a letter grade
grades.append('3')
# else, if more than a value,
elif row[1] == 'Brooklyn':
# Append a letter grade
grades.append('4')
# else, if more than a value,
else:
# Append a fail0ing grade
grades.append('5')