Delete rows of a pandas data frame having string values in python 3.4.1 - python-3.x

I have read a csv file with pandas read_csv having 8 columns. Each column may contain int/string/float values. But I want to remove those rows having string values and return a data frame with only numeric values in it. Attaching the csv sample.
I have tried to run this following code:
import pandas as pd
import numpy as np
df = pd.read_csv('new200_with_errors.csv',dtype={'Geo_Level_1' : int,'Geo_Level_2' : int,'Geo_Level_3' : int,'Product_Level_1' : int,'Product_Level_2' : int,'Product_Level_3' : int,'Total_Sale' : float})
print(df)
but I get the following error:
TypeError: unorderable types: NoneType() > int()
I am running with python 3.4.1.
Here is the sample csv.
Geo_L_1,Geo_L_2,Geo_L_3,Pro_L_1,Pro_L_2,Pro_L_3,Date,Sale
1, 2, 3, 129, 1, 5193316745, 1/1/2012, 9
1 ,2, 3, 129, 1, 5193316745, 1/1/2013,
1, 2, 3, 129, 1, 5193316745, , 8
1, 2, 3, 129, NA, 5193316745, 1/10/2012, 10
1, 2, 3, 129, 1, 5193316745, 1/10/2013, 4
1, 2, 3, ghj, 1, 5193316745, 1/10/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/11/2012, 4
1, 2, 3, 129, 1, ghgj, 1/11/2013, 2
1, 2, 3, 129, 1, 5193316745, 1/11/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/12/2012, ghgj
1, 2, 3, 129, 1, 5193316745, 1/12/2013, 5

So the way I would approach this is to try to convert the columns to an int using a user function with a Try/Catch to handle the situation where the value cannot be coerced into an Int, these get set to NaN values. Drop the row where you have an empty value, for some reason it actually has a length of 1 when I tested this with your data, it may work for you using len 0.
In [42]:
# simple function to try to convert the type, returns NaN if the value cannot be coerced
def func(x):
try:
return int(x)
except ValueError:
return NaN
# assign multiple columns
df['Pro_L_1'], df['Pro_L_3'], df['Sale'] = df['Pro_L_1'].apply(func), df['Pro_L_3'].apply(func), df['Sale'].apply(func)
# drop the 'empty' date row, take a copy() so we don't get a warning
df = df.loc[df['Date'].str.len() > 1].copy()
# convert the string to a datetime, if we didn't drop the row it would set the empty row to today's date
df['Date']= pd.to_datetime(df['Date'])
# now convert all the dtypes that are numeric to a numeric dtype
df = df.convert_objects(convert_numeric=True)
# check the dtypes
df.dtypes
Out[42]:
Geo_L_1 int64
Geo_L_2 int64
Geo_L_3 int64
Pro_L_1 float64
Pro_L_2 float64
Pro_L_3 float64
Date datetime64[ns]
Sale float64
dtype: object
In [43]:
# display the current situation
df
Out[43]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
1 1 2 3 129 1 5193316745 2013-01-01 NaN
3 1 2 3 129 NaN 5193316745 2012-01-10 10
4 1 2 3 129 1 5193316745 2013-01-10 4
5 1 2 3 NaN 1 5193316745 2014-01-10 6
6 1 2 3 129 1 5193316745 2012-01-11 4
7 1 2 3 129 1 NaN 2013-01-11 2
8 1 2 3 129 1 5193316745 2014-01-11 6
9 1 2 3 129 1 5193316745 2012-01-12 NaN
10 1 2 3 129 1 5193316745 2013-01-12 5
In [44]:
# drop the rows
df.dropna()
Out[44]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
4 1 2 3 129 1 5193316745 2013-01-10 4
6 1 2 3 129 1 5193316745 2012-01-11 4
8 1 2 3 129 1 5193316745 2014-01-11 6
10 1 2 3 129 1 5193316745 2013-01-12 5
For the last line assign it so df = df.dropna()

Related

Truncate and re-number a column that corresponds to a specific id/group by using Python

I have a dataset given as such in Python:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'id': [1, 1, 1, 1, 1,1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3],
'runs': [6, 6, 6, 6, 6,6,7,8,9,10, 3, 3, 3,4,5,6, 5, 5,5, 5,5,6,7,8],
'Children': ['No', 'Yes', 'Yes', 'Yes', 'No','No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Days': [123, 128, 66, 120, 141,123, 128, 66, 120, 141, 52,96, 120, 141, 52,96, 120, 141,123,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The above dataframe looks as such :
Here, for every 'id', I wish to truncate the columns where 'runs' are being repeated and make the numbering continuous in that id.
For example,
For id=1, truncate the 'runs' at 6 and re-number the dataset starting from 1.
For id=2, truncate the 'runs' at 3 and re-number the dataset starting from 1.
For id=3, truncate the 'runs' at 5 and re-number the dataset starting from 1.
The net result needs to look as such:
Can somebody please let me know how to achieve this task in python?
I wish to truncate and re-number a column that corresponds to a specific id/group by using Python
Filter out the duplicates with loc and duplicated, then renumber with groupby.cumcount:
out = (df[~df.duplicated(subset=['id', 'runs'], keep=False)]
.assign(runs=lambda d: d.groupby(['id']).cumcount().add(1))
)
Output:
id runs Children Days
6 1 1 Yes 128
7 1 2 Yes 66
8 1 3 Yes 120
9 1 4 No 141
13 2 1 Yes 141
14 2 2 Yes 52
15 2 3 Yes 96
21 3 1 Yes 36
22 3 2 Yes 58
23 3 3 No 89
You can create a loop to go through each id and run cutoff value, and for each iteration of the loop, determine the new segment of your dataframe by the id and run values of the original dataframe, and append the new dataframe to your final dataframe.
df_truncated = pd.DataFrame(columns=df.columns)
for id,run_cutoff in zip([1,2,3],[6,3,5]):
df_chunk = df[(df['id'] == id) & (df['runs'] > run_cutoff)].copy()
df_chunk['runs'] = range(1, len(df_chunk)+1)
df_truncated = pd.concat([df_truncated, df_chunk])
Result:
id runs Children Days
6 1 1 Yes 128
7 1 2 Yes 66
8 1 3 Yes 120
9 1 4 No 141
13 2 1 Yes 141
14 2 2 Yes 52
15 2 3 Yes 96
21 3 1 Yes 36
22 3 2 Yes 58
23 3 3 No 89
def function1(dd:pd.DataFrame):
dd1=dd.drop_duplicates(subset='runs',keep=False)
return dd1.assign(runs=dd1.runs.rank().astype(int))
df.groupby('id').apply(function1).reset_index(drop=True)
out:
id runs Children Days
0 1 1 Yes 128
1 1 2 Yes 66
2 1 3 Yes 120
3 1 4 No 141
4 2 1 Yes 141
5 2 2 Yes 52
6 2 3 Yes 96
7 3 1 Yes 36
8 3 2 Yes 58
9 3 3 No 89

how to change rows to column in python

I want to convert my dataframe rows to column and take last value of last column.
here is my dataframe
df=pd.DataFrame({'flag_1':[1,2,3,1,2,500],'dd':[1,1,1,7,7,8],'x':[1,1,1,7,7,8]})
print(df)
flag_1 dd x
0 1 1 1
1 2 1 1
2 3 1 1
3 1 7 7
4 2 7 7
5 500 8 8
df_out:
1 2 3 1 2 500 1 1 1 7 7 8 8
Assuming you want a list as output, you can mask the initial values of the list column and stack:
import numpy as np
out = (df
.assign(**{df.columns[-1]: np.r_[[pd.NA]*(len(df)-1),[df.iloc[-1,-1]]]})
.T.stack().to_list()
)
Output:
[1, 2, 3, 1, 2, 500, 1, 1, 1, 7, 7, 8, 8]
For a wide dataframe with a single row, use .to_frame().T in place of to_list() (here with a MultiIndex):
flag_1 dd x
0 1 2 3 4 5 0 1 2 3 4 5 5
0 1 2 3 1 2 500 1 1 1 7 7 8 8

Count positive, negative or zero values numbers for multiple columns in Python

Given a dataset as follows:
[{'id': 1, 'ltp': 2, 'change': nan},
{'id': 2, 'ltp': 5, 'change': 1.5},
{'id': 3, 'ltp': 3, 'change': -0.4},
{'id': 4, 'ltp': 0, 'change': 2.0},
{'id': 5, 'ltp': 5, 'change': -0.444444},
{'id': 6, 'ltp': 16, 'change': 2.2}]
Or
id ltp change
0 1 2 NaN
1 2 5 1.500000
2 3 3 -0.400000
3 4 0 2.000000
4 5 5 -0.444444
5 6 16 2.200000
I would like to count the number of positive, negative and 0 values for columns ltp and change, the result may like this:
columns positive negative zero
0 ltp 5 0 1
1 change 3 2 0
How could I do that with Pandas or Numpy? Thanks.
Updated: if I need groupby type and count following the logic above
id ltp change type
0 1 2 NaN a
1 2 5 1.500000 a
2 3 3 -0.400000 a
3 4 0 2.000000 b
4 5 5 -0.444444 b
5 6 16 2.200000 b
The expected output:
type columns positive negative zero
0 a ltp 3 0 0
1 a change 1 1 0
2 b ltp 2 0 1
3 b change 2 1 0
Use np.sign with selected columns first, then counts values in value_counts, transpose, replaced missing values and last rename columns names by dictionary with convert index to column columns:
d= {-1:'negative', 1:'positive', 0:'zero'}
df = (np.sign(df[['ltp','change']])
.apply(pd.value_counts)
.T
.fillna(0)
.astype(int)
.rename(columns=d)
.rename_axis('columns')
.reset_index())
print (df)
columns negative zero positive
0 ltp 0 1 5
1 change 2 0 3
EDIT: Another solution with type column with DataFrame.melt, mapping column with np.sign and count values by crosstab:
d= {-1:'negative', 1:'positive', 0:'zero'}
df1 = df.melt(id_vars='type', value_vars=['ltp','change'], var_name='columns')
df1['value'] = np.sign(df1['value']).map(d)
df1 = (pd.crosstab([df1['type'],df1['columns']], df1['value'])
.rename_axis(columns=None)
.reset_index())
print (df1)
type columns negative positive zero
0 a change 1 1 0
1 a ltp 0 3 0
2 b change 1 2 0
3 b ltp 0 2 1

Pandas Min and Max Across Rows

I have a dataframe that looks like below. I want to get a min and max value per city along with the information about which products were ordered min and max for that city. Please help.
Dataframe
db.min(axis=0) - min value for each column
db.min(axis=1) - min value for each row
use Dataframe.min and Datafram.max
DataFrame.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
matrix = [(22, 16, 23),
(33, 50, 11),
(44, 34, 11),
(55, 35, 60),
(66, 36, 13)
]
dfObj = pd.DataFrame(matrix, index=list('abcde'), columns=list('xyz'))
x y z
a 22 16.0 23.0
b 33 50 11.0
c 44 34.0 11.0
d 55 35.0 60
e 66 36.0 13.0
Get a series containing the minimum value of each row
minValuesObj = dfObj.min(axis=1)
print('minimum value in each row : ')
print(minValuesObj)
output
minimum value in each row :
a 16.0
b 11.0
c 11.0
d 35.0
e 13.0
dtype: float64
MMT Marathi, based on the answers provided by Danil and Sutharp777, you should be able to get to your answer. However, I see you have questions for them. Not sure if you are looking for a column to be created that has the min/max value for each row.
Here's the full dataframe with the solution. I am merely compiling the answers they have already given
import pandas as pd
d = [['20in Monitor',2,2,1,2,2,2,2,2,2],
['27in 4k Gaming Monitor',2,1,2,2,1,2,2,2,2],
['27in FHD Monitor',2,2,2,2,2,2,2,2,2],
['34in Ultrawide Monitor',2,1,2,2,2,2,2,2,2],
['AA Batteries (4-pack)',5,5,6,7,6,6,6,6,5],
['AAA Batteries (4-pack)',7,7,8,8,9,7,8,9,7],
['Apple Airpods Headphones',2,2,3,2,2,2,2,2,2],
['Bose SoundSport Headphones',2,2,2,2,3,2,2,3,2],
['Flatscreen TV',2,1,2,2,2,2,2,2,2]]
c = ['Product','Atlanta','Austin','Boston','Dallas','Los Angeles',
'New York City','Portland','San Francisco','Seattle']
df = pd.DataFrame(d,columns=c)
df['min_value'] = df.min(axis=1)
df['max_value'] = df.max(axis=1)
print (df)
The output of this will be:
Product Atlanta Austin ... Seattle min_value max_value
0 20in Monitor 2 2 ... 2 1 2
1 27in 4k Gaming Monitor 2 1 ... 2 1 2
2 27in FHD Monitor 2 2 ... 2 2 2
3 34in Ultrawide Monitor 2 1 ... 2 1 2
4 AA Batteries (4-pack) 5 5 ... 5 5 7
5 AAA Batteries (4-pack) 7 7 ... 7 7 9
6 Apple Airpods Headphones 2 2 ... 2 2 3
7 Bose SoundSport Headphones 2 2 ... 2 2 3
8 Flatscreen TV 2 1 ... 2 1 2
If you want the min and max of each column, then you can do this:
print ('min of each column :', df.min(axis=0).to_list()[1:])
print ('max of each column :', df.max(axis=0).to_list()[1:])
This will give you:
min of each column : [2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2]
max of each column : [7, 7, 8, 8, 9, 7, 8, 9, 7, 7, 9]

pd.Series(pred).value_counts() how to get the first column in dataframe?

I apply pd.Series(pred).value_counts() and get this output:
0 2084
-1 15
1 13
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
dtype: int64
When I create a list I get only the second column:
c_list=list(pd.Series(pred).value_counts()), Out:
[2084, 15, 13, 10, 7, 4, 3, 3, 3, 2, 2, 2, 2]
How do I get ultimately a dataframe that looks like this including a new column for size% of total size?
df=
[class , size ,relative_size]
0 2084 , x%
-1 15 , y%
1 13 , etc.
3 10
4 7
6 4
11 3
8 3
2 3
9 2
7 2
5 2
10 2
You are very nearly there. Typing this in the blind as you didn't provide a sample input:
df = pd.Series(pred).value_counts().to_frame().reset_index()
df.columns = ['class', 'size']
df['relative_size'] = df['size'] / df['size'].sum()

Resources