Filtering columns in a pandas dataframe - python-3.x

I have a dataframe with the following column.
A
55B
<lhggkkk>
66c
dggfhhjjjj
I need to filter the records which start with number(such as 55B and 66C) separately and the others separately. Can anyone please help?

Try:
import pandas as pd
df = pd.DataFrame()
df['A'] = ['55B','<lhggkkk>','66c','dggfhhjjjj']
df['B'] = df['A'].apply(lambda x:x[0].isdigit())
print(df)
A B
0 55B True
1 <lhggkkk> False
2 66c True
3 dggfhhjjjj False

Try to check if the first number is digit then boolen index i.e
mask = df['A'].str[0].str.isdigit()
one = df[mask]
two = df[~mask]
print(one,'\n',two)
A
0 55B
2 66c
A
1 <lhggkkk>
3 dggfhhjjjj

To check first string is digit or not:
df['A'].str[0].str.isdigit()
So:
import pandas as pd
import numpy as np
df:
-----------------
| A
-----------------
0 | 55B
1 | <lhggkkk>
2 | 66c
3 | dggfhhjjjj
df['Result'] = np.where(df['A'].str[0].str.isdigit(), 'Numbers', 'Others')
df:
----------------------------
| A | Result
----------------------------
0 | 55B | Numbers
1 | <lhggkkk> | Others
2 | 66c | Numbers
3 | dggfhhjjjj | Others

Related

How to solve the ValueError: Unstacked DataFrame is too big, causing int32 overflow in python?

I have a dataframe in dynamic format for each ID
df:
ID |Start Date|End date |claim_no|claim_type|Admission_date|Discharge_date|Claim_amt|Approved_amt
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351
10 |01-Apr-20 |31-Mar-21| 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964
10 |01-Apr-20 |31-Mar-21| 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
11 |12-Dec-20 |11-Dec-21| 1503 |CSHLESS | 12-Jan-2021 | 15-Jan-2021 | 76137 | 50286
11 |12-Dec-20 |11-Dec-21| 1505 |CSHLESS | 05-Jan-2021 | 07-Jan-2021 | 30000 | 0
Based on the ID column i am trying to convert all the dynamic variables into a static format so that i can have a single row for each ID.
Columns such as ID, Start Date,End date are static in nature and rest of the columns are dynamic in nature for each ID.
Inorder to acheive the below output:
ID |Start Date|End date |claim_no_1|claim_type_1|Admission_date_1|Discharge_date_1|Claim_amt_1|Approved_amt_1|claim_no_2|claim_type_2|Admission_date_2|Discharge_date_2|Claim_amt_2|Approved_amt_2|claim_no_3|claim_type_3|Admission_date_3|Discharge_date_3|Claim_amt_3|Approved_amt_3
10 |01-Apr-20 |31-Mar-21| 1123 |CSHLESS | 23-Aug-2020 | 25-Aug-2020 | 25406 | 19351 | 1212 |POSTHOSP | 30-Aug-2020 | 01-Sep-2020 | 4209 | 3964 | 1680 |CSHLESS | 18-Mar-2021 | 23-Mar-2021 | 18002 | 0
i am using the below code:
# Index columns
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
cols = df.groupby(idx).cumcount() + 1
# Reshape using stack and unstack
df_out = df.set_index([*idx, cols]).stack().unstack([-2, -1])
# Flatten the multiindex columns
df_out.columns = df_out.columns.map('{0[1]}_{0[0]}'.format)
but it throws a ValueError: Unstacked DataFrame is too big, causing int32 overflow
Try this:
Index columns (very similar to your code)
idx = ['ID', 'Start Date', 'End date']
# Sequential counter to identify unique rows per index columns
df['nrow'] = df.groupby(idx)['claim_no'].transform('rank')
df['nrow'] = df['nrow'].astype(int).astype(str)
instead of stack & unstack. Using these functions you can have better control over columns
df1 = pd.melt(df, id_vars =['nrow', *idx] , value_vars=['claim_no', 'claim_type', 'Admission_date',
'Discharge_date', 'Claim_amt', 'Approved_amt'],
value_name='var'
)
df2 = df1.pivot(index=[*idx],
columns=['variable', 'nrow'], values='var')
df2.columns = ['_'.join(col).rstrip('_') for col in df2.columns.values]
print(df2)
claim_no_1 claim_no_2 claim_no_3 claim_type_1 claim_type_2 claim_type_3 Admission_date_1 Admission_date_2 Admission_date_3 Discharge_date_1 Discharge_date_2 Discharge_date_3 Claim_amt_1 Claim_amt_2 Claim_amt_3 Approved_amt_1 Approved_amt_2 Approved_amt_3
ID Start Date End date
10 01-Apr-20 31-Mar-21 1123 1212 1680 CSHLESS POSTHOSP CSHLESS 23-Aug-2020 30-Aug-2020 18-Mar-2021 25-Aug-2020 01-Sep-2020 23-Mar-2021 25406 4209 18002 19351 3964 0
11 12-Dec-20 11-Dec-21 1503 1505 NaN CSHLESS CSHLESS NaN 12-Jan-2021 05-Jan-2021 NaN 15-Jan-2021 07-Jan-2021 NaN 76137 30000 NaN 50286 0 NaN

Filter DataFrame to delete duplicate values in pyspark

I have the following dataframe
date | value | ID
--------------------------------------
2021-12-06 15:00:00 25 1
2021-12-06 15:15:00 35 1
2021-11-30 00:00:00 20 2
2021-11-25 00:00:00 10 2
I want to join this DF with another one like this:
idUser | Name | Gender
-------------------
1 John M
2 Anne F
My expected output is:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
What I need is: Get only the most recent value of the first dataframe and join only this value with my second dataframe. Although, my spark script is joining both values:
My code:
df = df1.select(
col("date"),
col("value"),
col("ID"),
).OrderBy(
col("ID").asc(),
col("date").desc(),
).groupBy(
col("ID"), col("date").cast(StringType()).substr(0,10).alias("date")
).agg (
max(col("value")).alias("value")
)
final_df = df2.join(
df,
(col("idUser") == col("ID")),
how="left"
)
When i perform this join (formating the columns is abstracted in this post) I have the following output:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
2 Anne F 10
I use substr to remove hours and minutes to filter only by date. But when I have the same ID in different days my output df has the 2 values instead of the most recently. How can I fix this?
Note: I'm using only pyspark functions to do this (I now want to use spark.sql(...)).
You can use window and row_number function in pysaprk
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("ID").orderBy("date").desc()
df1_latest_val = df1.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
The output of table df1_latest_val will look something like this
date | value | ID | row_number |
-----------------------------------------------------
2021-12-06 15:15:00 35 1 1
2021-11-30 00:00:00 20 2 1
Now you will have df with the latest val, which you can directly join with another table.

How to label encode a DataFrame column which contains both numbers and strings?

I have this DataFrame column
+-------------------------------------+--+
| df: | |
+-------------------------------------+--+
| Index Ticket* | |
| 0 254326 | |
| 1 CA345 | |
| 3 SA12 | |
| 4 267891 | |
| ' ' | |
| ' ' | |
| ' ' | |
| 700 CA356 | |
+-------------------------------------+--+
It contains two kinds of values. Some are pure numbers and others are strings having letters and numbers.
Many rows have the same letters (CA345, CA675 etc). I would like to group and label the rows with same letters with the same numbers.
Eg. All rows having "CA" labelled as 0, all rows having "SA" labelled as 1.
Remaining rows all have six digit numbers (no letters in them). I would like to label all such rows with the same number (say 2 for example)
1st Approach
Define a custom function, check if the row isinstance(val, str) and contains "SA" or "CA"
def label_ticket(row):
if isinstance(row['Ticket'], str) and 'CA' in row['Ticket']:
return 0
if isinstance(row['Ticket'], str) and 'SA' in row['Ticket']:
return 1
return 2
Apply the custom function to new column df('Label').
df['Label'] = df.apply(label_ticket, axis=1)
print(df)
Ticket Label
0 254326 2
1 CA345 0
2 SA12 1
3 267891 2
700 CA356 0
2nd Approach
Further understanding the situation, it seems you have no idea what instances will come up in df['Ticket']. In this case you can use re.split() to search all string pattern and classify them into category accordingly.
import pandas as pd
import re
df = pd.DataFrame(columns=['Ticket'],
data=[[254326],
['CA345'],
['SA12'],
[267891],
['CA356']])
df['Pattern'] = df['Ticket'].apply(lambda x: ''.join(re.split("[^a-zA-Z]*", str(x))))
df_label = pd.DataFrame(df['Pattern'].unique(), columns=['Pattern']).reset_index(level=0).rename(columns={'index': 'Label'})
df = df.merge(df_label, how='left')
print(df)
Ticket Pattern Label
0 254326 0
1 CA345 CA 1
2 SA12 SA 2
3 267891 0
4 CA356 CA 1
I have not enough knowledge of python but
you may have try pandas.Series.str.extract
and
regular expression
Like:
ptrn=r'(?P<CA>(CA[\d]+))|(?P<SA>(SA[\d]+))|(?P<DIGIT>[\d]{6})'
import pandas as pd
import numpy as np
ls={'tk':[ '254326' , 'CA345', 'SA12' , '267891' , 'CA356' ]}
df = pd.DataFrame(ls)
s=df['tk'].str.extract(ptrn,expand=False)
newDf={0:[x for x in s['CA'] if pd.isnull(x)==False],1:[x for x in s['SA'] if pd.isnull(x)==False],2:[x for x in s['DIGIT'] if pd.isnull(x)==False]}
print(newDf)
out put:
{0: ['CA345', 'CA356'], 1: ['SA12'], 2: ['254326', '267891']}
demo

Populating a pandas dataframe from an odd dictionary

I have a dictionary as follows:
{'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
I would like to come up with a dataframe like this:
+----+----------+--------+
| ID | header | body |
+----+----------+--------+
| 1 | header_1 | body_1 |
+----+----------+--------+
| 2 | header_1 | body_3 |
+----+----------+--------+
| 3 | header_1 | body_2 |
+----+----------+--------+
| 4 | header_2 | body_6 |
+----+----------+--------+
| 5 | header_2 | body_4 |
+----+----------+--------+
| 6 | header_2 | body_5 |
+----+----------+--------+
| 7 | header_4 | body_7 |
+----+----------+--------+
Where blank items (such as for the key header_10 in the dict above) would receive a value of None. I have tried a number of varieties for df.loc such as:
for header_name, body_list in all_unique.items():
for body_name in body_list:
metadata.loc[metadata.index[-1]] = [header_name, body_name]
To no avail. Surely there must be a quick way in Pandas to append rows and autoincrement the index? Something similar to the SQL INSERT INTO statement only using pythonic code?
Use dict comprehension for add Nones for empty lists and then flatten for list of tuples:
d = {'header_1': ['body_1', 'body_3', 'body_2'],
'header_2': ['body_6', 'body_4', 'body_5'],
'header_4': ['body_7', 'body_8'],
'header_3': ['body_9'],
'header_9': ['body_10'],
'header_10': []}
d = {k: v if bool(v) else [None] for k, v in d.items()}
data = [(k, y) for k, v in d.items() for y in v]
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None
Another solution:
data = []
for k, v in d.items():
if bool(v):
for y in v:
data.append((k, y))
else:
data.append((k, None))
df = pd.DataFrame(data, columns= ['a','b'])
print (df)
a b
0 header_1 body_1
1 header_1 body_3
2 header_1 body_2
3 header_2 body_6
4 header_2 body_4
5 header_2 body_5
6 header_4 body_7
7 header_4 body_8
8 header_3 body_9
9 header_9 body_10
10 header_10 None
If the dataset is too big, this solution would be slow, but it should still work.
for key in data.keys():
vals= data[key]
# Create temp df with data from a single key
t_df = pd.DataFrame({'header':[key]*len(vals),'body':vals})
# Append it to your full dataframe.
df = df.append(t_df)
This is another unnesting problem again
Borrow Jez's setting up for your d
d = {k: v if bool(v) else [None] for k, v in d.items()}
1st convert your dict into dataframe
df=pd.Series(d).reset_index()
df.columns
Out[204]: Index(['index', 0], dtype='object')
Then using this function in here
yourdf=unnesting(df,[0])
yourdf
Out[208]:
0 index
0 body_1 header_1
0 body_3 header_1
0 body_2 header_1
1 body_6 header_2
1 body_4 header_2
1 body_5 header_2
2 body_7 header_4
2 body_8 header_4
3 body_9 header_3
4 body_10 header_9
5 None header_10
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

pandas - with restructuring data in data frame

I have a data frame that has data in format
time | name | value
01/01/1970 | A | 1
02/01/1970 | A | 2
03/01/1970 | A | 1
01/01/1970 | B | 5
02/01/1970 | B | 3
I what to change this data to something like
time | A | B
01/01/1970 | 1 | 5
02/01/1970 | 2 | 3
03/01/1970 | 1 | NA
How can I achieve this in pandas? I have tried groupby on dataframe and then joining but its coming out right.
thanks in advance
Use DataFrame.pivot (doc):
import numpy as np
df = pd.DataFrame(
{'name': ['A', 'A', 'A', 'B', 'B'],
'time': ['01/01/1970', '02/01/1970', '03/01/1970', '01/01/1970', '02/01/1970'],
'value': [1, 2, 1, 5, 3]})
print(df.pivot(index='time', columns='name', values='value'))
yields
A B
time
01/01/1970 1 5
02/01/1970 2 3
03/01/1970 1 NaN
Note that time is now the index. If you wish to make it a column, call reset_index():
df.pivot(index='time', columns='name', values='value').reset_index()
# name time A B
# 0 01/01/1970 1 5
# 1 02/01/1970 2 3
# 2 03/01/1970 1 NaN
Use the .pivot function:
df = pd.DataFrame({'time' : [0,1,2,3],
'name': ['A','A', 'B', 'B'], 'value': [10,20,30,40]})
df.pivot(index = 'time', columns = 'name', values = 'value')

Resources