How to use “na_values='?'” option in the pd.read.csv() function? - python-3.x

I am trying to find the operation with na_values='?' option in the pd.read.csv() function.
So that I can find the list of rows containing "?" value and then remove that value.

Sample:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv')
df = pd.read_csv(StringIO(temp))
print (df)
id col1 col2 col3
0 1 13? 15 14
1 1 13 15 ?
2 1 12 15 13
3 2 ? 15 ?
4 2 18 15 13
5 2 18? 15 13
If want remove values with ? which are separately or substrings need mask created by str.contains and then check if at least one True per row by DataFrame.any:
print (df.astype(str).apply(lambda x: x.str.contains('?', regex=False)))
id col1 col2 col3
0 False True False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False True False False
m = ~df.astype(str).apply(lambda x: x.str.contains('?', regex=False)).any(axis=1)
print (m)
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
2 1 12 15 13
4 2 18 15 13
If want replace only separately ? simply compare value:
print (df.astype(str) == '?')
id col1 col2 col3
0 False False False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False False False False
m = ~(df.astype(str) == '?').any(axis=1)
print (m)
0 True
1 False
2 True
3 False
4 True
5 True
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
0 1 13? 15 14
2 1 12 15 13
4 2 18 15 13
5 2 18? 15 13
It replace all ? to NaNs is necessary parameter na_values and dropna if want remove all rows with NaNs:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv', na_values='?')
df = pd.read_csv(StringIO(temp), na_values='?')
print (df)
id col1 col2 col3
0 1 13? 15 14.0
1 1 13 15 NaN
2 1 12 15 13.0
3 2 NaN 15 NaN
4 2 18 15 13.0
5 2 18? 15 13.0
df = df.dropna()
print (df)
id col1 col2 col3
0 1 13? 15 14.0
2 1 12 15 13.0
4 2 18 15 13.0
5 2 18? 15 13.0

na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('some-data.csv', na_values=na_values)
Create a list with useless parameters and use it trough reading from file

"??" or "####" type of junk values can be converted into missing value, since in python all the blank values can be replaced with nan. Hence you can also replace these type of junk value to missing value by passing them as as list to the parameter
'na_values'.
data_csv = pd.read_csv('test.csv',na_values = ["??"])

If you want to remove the rows which are contain "?" in pandas dataframe, you can try with:
suppose you have df:
import pandas as pd
df = pd.read_csv('test.csv')
df:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
3 test?dsfsa 9/15/2016
check if column A contain "?" to generate new df1:
df1 = df[df.A.str.contains("\?")==False]
df1 will be:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
which will give you the new df1 which doesn't contain "?".

Related

pandas drop rows based on condition on groupby

I have a DataFrame like below
I am trying to groupby cell column and drop the "NA" values where group size > 1.
required Output :
How to get my expected output? How to filter on a condition and drop rows in groupby statement?
From your DataFrame, first we group by cell to get the size of each groups :
>>> df_grouped = df.groupby(['cell'], as_index=False).size()
>>> df_grouped
cell size
0 A 3
1 B 1
2 D 3
Then, we merge the result with the original DataFrame like so :
>>> df_merged = pd.merge(df, df_grouped, on='cell', how='left')
>>> df_merged
cell value kpi size
0 A 5.0 thpt 3
1 A 6.0 ret 3
2 A NaN thpt 3
3 B NaN acc 1
4 D 8.0 int 3
5 D NaN ps 3
6 D NaN yret 3
To finish, we filter the Dataframe to get the expected result :
>>> df_filtered = df_merged[~((df_merged['value'].isna()) & (df_merged['size'] > 1))]
>>> df_filtered[['cell', 'value', 'kpi']]
cell value kpi
0 A 5.0 thpt
1 A 6.0 ret
3 B NaN acc
4 D 8.0 int
Use boolean mask:
>>> df[df.groupby('cell').cumcount().eq(0) | df['value'].notna()]
cell value kpi
0 A crud thpt
1 A 6 ret
3 B NaN acc
4 D hi int
Details:
m1 = df.groupby('cell').cumcount().eq(0)
m2 = df['value'].notna()
df.assign(keep_at_least_one=m1, keep_notna=m2, keep_rows=m1|m2)
# Output:
cell value kpi keep_at_least_one keep_notna keep_rows
0 A crud thpt True True True
1 A 6 ret False True True
2 A NaN thpt False False False
3 B NaN acc True False True
4 D hi int True True True
5 D NaN ps False False False
6 D NaN yret False False False

Search value in Next Month Record Pandas

Given that i have a df like this:
ID Date Amount
0 a 2014-06-13 12:03:56 13
1 b 2014-06-15 08:11:10 14
2 a 2014-07-02 13:00:01 15
3 b 2014-07-19 16:18:41 22
4 b 2014-08-06 09:39:14 17
5 c 2014-08-22 11:20:56 55
...
129 a 2016-11-06 09:39:14 12
130 c 2016-11-22 11:20:56 35
131 b 2016-11-27 09:39:14 42
132 a 2016-12-11 11:20:56 18
I need to create a column df['Checking'] to show that ID will appear in next month or not and i tried the code as below:
df['Checking']= df.apply(lambda x: check_nextmonth (x.Date,
x.ID), axis=1)
where
def check_nextmonth(date, id)=
x= id in df['user_id'][df['Date'].dt.to_period('M')== ((date+
relativedelta(months=1))).to_period('M')].values
return x
but it take too long to process a single row.
How can i improve this code or another way to achieve what i want?
Using pd.to_datetime with ts tricks:
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df['tmp'] = (df['Date'] - pd.DateOffset(months=1)).dt.month
s = df.groupby('ID').apply(lambda x:x['Date'].dt.month.isin(x['tmp']))
df['Checking'] = s.reset_index(level=0)['Date']
Output:
ID Date Amount tmp Checking
0 a 2014-06-13 12:03:56 13 5 True
1 b 2014-06-15 08:11:10 14 5 True
2 a 2014-07-02 13:00:01 15 6 False
3 b 2014-07-19 16:18:41 16 6 True
4 b 2014-08-06 09:39:14 17 7 False
5 c 2014-08-22 11:20:56 18 7 False
Here's one method of doing it, check if the grouped id's next month is equal to current month + 1, and assign the same by sorting the ID.
check = df.groupby('ID').apply(lambda x : x['Date'].dt.month.shift(-1) == x['Date'].dt.month+1).stack().values
df = df.sort_values('ID').assign( checking = check).sort_index()
ID Date Amount checking
0 a 2014-06-13 12:03:56 13 True
1 b 2014-06-15 08:11:10 14 True
2 a 2014-07-02 13:00:01 15 False
3 b 2014-07-19 16:18:41 16 True
4 b 2014-08-06 09:39:14 17 False
5 c 2014-08-22 11:20:56 18 False

pandas dataframe groups check the number of unique values of a column is one but exclude empty strings

I have the following df,
id invoice_no
1 6636
1 6637
2 6639
2 6639
3
3
4 6635
4 6635
4 6635
the invoice_no for id 3 are all empty strings or spaces; I want to
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
but also consider spaces and empty string invoice_no in each group as same_invoice_no = False; I am wondering how to do that. The result will look like,
id invoice_no same_invoice_no
1 6636 False
1 6637 False
2 6639 True
2 6639 True
3 False
3 False
4 6635 True
4 6635 True
4 6635 True
Empty strings equate to True but NaNs don't. Replace empty strings by Numpy nan
df.replace('', np.nan, inplace = True)
df['same_invoice_no'] = df.groupby("id")["invoice_no"].transform('nunique') == 1
id invoice_no same_invoice_no
0 1 6636.0 False
1 1 6637.0 False
2 2 6639.0 True
3 2 6639.0 True
4 3 NaN False
5 3 NaN False
6 4 6635.0 True
7 4 6635.0 True
8 4 6635.0 True

Combine rows based on index or column

I have three dataframes: df1, df2, df3. I am trying to add a list of ART_UNIT do df1.
df1 is 260846 rows x 4 columns:
Index SYMBOL level not-allocatable additional-only
0 A 2 True False
1 A01 4 True False
2 A01B 5 True False
3 A01B1/00 7 False False
4 A01B1/02 8 False False
5 A01B1/022 9 False False
6 A01B1/024 9 False False
7 A01B1/026 9 False False
df2 is 941516 rows x 2 columns:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
0 A44C27/00 3715
1 A44C27/001 2015
2 A44C27/001 3715
3 A44C27/001 2615
4 A44C27/005 2815
5 A44C27/006 3725
6 A44C27/007 3215
7 A44C27/008 3715
8 F41A33/00 3715
9 F41A33/02 3715
10 F41A33/04 3715
11 F41A33/06 3715
12 G07C13/00 3715
13 G07C13/005 3715
14 G07C13/02 3716
And df3 is the same format as df2, but has 673023 rows x 2 columns
The 'CLASSIFICATION_SYMBOL_CD' in df2 and df3 are not unique.
For each 'CLASSIFICATION_SYMBOL_CD' in df2 and df3, I want to find the same string in df1 'SYMBOL' and add a new column to df1 'ART_UNIT' that contains all of the 'ART_UNIT' from df2 and df3.
For example, in df2, 'CLASSIFICATION_SYMBOL_CD' A44C27/001 has ART_UNIT 2015, 3715, and 2615.
I want to write those ART_UNIT to the correct row in df1 so that is reads:
Index SYMBOL level not-allocatable additional-only ART_UNIT
211 A44C27/001 2 True False [2015, 3715, 2615]
So far, I've tried to group df2/df3 by 'CLASSIFICATION_SYMBOL_CD'
gp = df2.groupby(['CLASSIFICATION_SYMBOL_CD'])
for x in df2['CLASSIFICATION_SYMBOL_CD'].unique():
df2_g = gp.get_group(x)
Which gives me:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
1354 A61N1/3714 3762
117752 A61N1/3714 3766
347573 A61N1/3714 3736
548026 A61N1/3714 3762
560771 A61N1/3714 3762
566120 A61N1/3714 3766
566178 A61N1/3714 3762
799486 A61N1/3714 3736
802408 A61N1/3714 3736
Since df2 and df3 have the same format concatentate them first.
import pandas as pd
df = pd.concat([df2, df3])
Then to get the lists of all art units, groupby and apply list.
df = df.groupby('CLASSIFICATION_SYMBOL_CD').ART_UNIT.apply(list).reset_index()
# CLASSIFICATION_SYMBOL_CD ART_UNIT
#0 A44C27/00 [3715]
#1 A44C27/001 [2015, 3715, 2615]
#2 A44C27/005 [2815]
#3 A44C27/006 [3725]
#...
Finally, bring this information to df1 with a merge (you could map or something else too). Rename the column first to have less to clean up after the merge.
df = df.rename(columns={'CLASSIFICATION_SYMBOL_CD': 'SYMBOL'})
df1 = df1.merge(df, on='SYMBOL', how='left')
Output:
Index SYMBOL level not-allocatable additional-only ART_UNIT
0 0 A 2 True False NaN
1 1 A01 4 True False NaN
2 2 A01B 5 True False NaN
3 3 A01B1/00 7 False False NaN
4 4 A01B1/02 8 False False NaN
5 5 A01B1/022 9 False False NaN
6 6 A01B1/024 9 False False NaN
7 7 A01B1/026 9 False False NaN
Sadly, you didn't provide any overlapping SYMBOLs in df1, so nothing merged. But this will work with your full data.

Skipping every nth row in pandas

I am trying to slice my dataframe by skipping every 4th row. The best way I could get it done is by getting the index of every 4th row and then selecting all the other rows. Like below:-
df[~df.index.isin(df[::4].index)]
I was wondering if there is a simpler and/or more pythonic way of getting this done.
One possible solution is create mask by modulo and filter by boolean indexing:
df = pd.DataFrame({'a':range(10, 30)}, index=range(20))
#print (df)
b = df[np.mod(np.arange(df.index.size),4)!=0]
print (b)
a
1 11
2 12
3 13
5 15
6 16
7 17
9 19
10 20
11 21
13 23
14 24
15 25
17 27
18 28
19 29
Details:
print (np.mod(np.arange(df.index.size),4))
[0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3]
print (np.mod(np.arange(df.index.size),4)!=0)
[False True True True False True True True False True True True
False True True True False True True True]
If unique index values use a bit changed #jpp solution from comment:
b = df.drop(df.index[::4], 0)

Resources