I have a Csv which has data in different manner :
Data Set is given below
data = [[12,'abc#xyz.com', 'NaN', 'NaN' ], [12,'abc#xyz.com','NaN' , 'NaN'], ['NaN', 'NaN','x' , 'y' ] , ['NaN','NaN', 'a','b'] , ['13','qwer#123.com','NaN','NaN'],['NaN','NaN', 'x','r']]
df = pd.DataFrame(data , columns = ['id' , 'email','notes_key' , 'notes_value'])
df
Ideally third and fourth column should have the same id as first column.
The column name notes_key and notes_value represents the key:value pair i.e. the key is notes_key and its corresponding pair is in notes_pair.
I have to manipulate the dataframe in a way such that output turns out :
data = [[12,abc#xyz.com,x,y],[12,abc#xyz.com,a,b]]
df = pd.DataFrame(data , columns =['id','email','notes_key' , 'notes_value'])
I tried dropping the null values.
You can forward filling missing values by id and then remove rows if missing values in both columns notes_key,notes_value:
#if necessary
df = df.replace('NaN', np.nan)
df[['id','email']] = df[['id','email']].ffill()
df = df.dropna(subset=['notes_key','notes_value'], how='all')
print (df)
id email notes_key notes_value
2 12 abc#xyz.com x y
3 12 abc#xyz.com a b
5 13 qwer#123.com x r
Related
I have two data frames DF1 and DF2 with more that 280+ columns in both df, I have to compare both df on some unique key , so I am merging both df first ,
Compare = pd.merge(df1,df2,how='outer',on='unique_key',suffix=('_X','_Y'))
now , I want to compare consecutive columns like
Compare['compare_1'] = Compare[_X]==Compare[_Y].
But, since columns are more than 280+ so I cant create compare columns for each set individually, I am looking for a function which can compare these consecutive columns.
I tried something like this,
col=df.columns
for x,i in enumerate(col):
for y,j in enumerate(col):
if y-x==1 and i!=j:
bina = df[i]-df[j]
df['MOM_' + str(j) + '_' + str(i)] = bina
But , it is not working as my df are huge more that 100k records and loops are making it complex.
IIUC use:
df1 = Compare.like('_X')
df1.columns = df1.columns.str.replace('_X$','')
df2 = Compare.like('_Y')
df2.columns = df2.columns.str.replace('_Y$','')
out = Compare.join(df1.sub(df2).add_prefix('MOM_'))
I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)
I have a dataframe which has the below data
srl_no created_on completed_on prev_completed_on time_from_last Dense_Rank
XXXXXX1 2020-10-09T08:52:25 2020-10-09T08:57:45 null null 1
XXXXXX1 2020-10-09T09:04:32 2020-10-09T09:06:37 2020-10-09T08:57:45 407 2
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:06:37 213 3
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:12:17 -127 3
I want to substract the prev_completed_on from the created_on to get the time_from_last however as the last two rows have the same created_on and completed_on I am getting the time as negative. In this scenario I need to substract the value from the second row, i.e. substract based on the dense_rank column.
So in the above scenario I need to substract the value of completed_on of 2nd row from the value of created_on for the 4th row.
Code for the above
df = spark.createDataFrame(
[
('XXXXXX1','2020-10-09T08:52:25','2020-10-09T08:57:45'), # create your data here, be consistent in the types.
('XXXXXX1','2020-10-09T09:04:32','2020-10-09T09:06:37'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
],
['srl_no', 'created_on','completed_on'] # add your columns label here
)
df = df.withColumn('created_on',f.col('created_on').cast(TimestampType()))
df = df.withColumn('created_on',f.col('created_on').cast(TimestampType()))
partition_cols = ["srl_no"]
window_clause = Window.partitionBy(partition_cols).orderBy(f.col('completed_on').asc())
# create the row number column
df1 = df.withColumn('prev_completed_on',f.lag(f.col("completed_on"))\
.over(window_clause).cast(TimestampType()))
df1 = df1.withColumn('dense_rank',f.dense_rank()\
.over(window_clause))
df1 = df1.withColumn("time_from_last",\
f.col("created_on").cast(LongType()) - col("prev_completed_on").cast(LongType()))
expected output
srl_no created_on completed_on prev_completed_on time_from_last Dense_Rank
XXXXXX1 2020-10-09T08:52:25 2020-10-09T08:57:45 null null 1
XXXXXX1 2020-10-09T09:04:32 2020-10-09T09:06:37 2020-10-09T08:57:45 407 2
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:06:37 213 3
XXXXXX1 2020-10-09T09:10:10 2020-10-09T09:12:17 2020-10-09T09:12:17 **213** 3
The trick here is to use a groupby to get the minimum date per srl_no, dense_rank. When joining that back to the prepared data frame you get the required result.
df = spark.createDataFrame(
[
('XXXXXX1','2020-10-09T08:52:25','2020-10-09T08:57:45'), # create your data here, be consistent in the types.
('XXXXXX1','2020-10-09T09:04:32','2020-10-09T09:06:37'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
('XXXXXX1','2020-10-09T09:10:10','2020-10-09T09:12:17'),
],
['srl_no', 'created_on','completed_on'] # add your columns label here
)
df = df.withColumn('created_on',F.col('created_on').cast(T.TimestampType()))
df = df.withColumn('created_on',F.col('created_on').cast(T.TimestampType()))
partition_cols = ["srl_no"]
window_clause = Window.partitionBy(partition_cols).orderBy(F.col('completed_on').asc())
# create the row number column
df_with_rank = df.withColumn('prev_completed_on',F.lag(F.col("completed_on"))\
.over(window_clause).cast(T.TimestampType()))
df_with_rank = df_with_rank.withColumn('dense_rank', F.dense_rank()\
.over(window_clause))
dense_rank = df_with_rank.groupby("srl_no", "dense_rank") \
.agg(F.min('prev_completed_on').alias('prev_completed_on'))
df_with_rank = df_with_rank.drop('prev_completed_on')
df_with_rank = df_with_rank.join(dense_rank, ["srl_no", "dense_rank"], 'left')
df_with_rank.show()
Output:
+-------+----------+-------------------+-------------------+-------------------+
| srl_no|dense_rank| created_on| completed_on| prev_completed_on|
+-------+----------+-------------------+-------------------+-------------------+
|XXXXXX1| 1|2020-10-09 08:52:25|2020-10-09T08:57:45| null|
|XXXXXX1| 2|2020-10-09 09:04:32|2020-10-09T09:06:37|2020-10-09 08:57:45|
|XXXXXX1| 3|2020-10-09 09:10:10|2020-10-09T09:12:17|2020-10-09 09:06:37|
|XXXXXX1| 3|2020-10-09 09:10:10|2020-10-09T09:12:17|2020-10-09 09:06:37|
+-------+----------+-------------------+-------------------+-------------------+
In my dataframe, I have multiple columns whose values I would like to replace into one column. For instance, I would like the NaN values in MEDICATIONS: columns to be replaced by a value if it exists in any other column except MEDICATION:
Input:
Expected Output:
`
df['MEDICATIONS'].combine_first(df["Rest of the columns besides MEDICATIONS:"])
`
Link of the dataset:
https://drive.google.com/file/d/1cyZ_OWrGNvJyc8ZPNFVe543UAI9snHDT/view?usp=sharing
Something like this?
import pandas as pd
df = pd.read_csv('data - data.csv')
del df['Unnamed: 0']
df['Combined_Meds'] = df.astype(str).values.sum(axis=1)
df['Combined_Meds'] = df['Combined_Meds'].str.replace('nan', '', regex=False)
cols = list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
df.sample(10)
I want to have a dataframe with repeated values with the same id number. But i want to split the repeated rows into colunms.
data = [[10450015,4.4],[16690019 4.1],[16690019,4.0],[16510069 3.7]]
df = pd.DataFrame(data, columns = ['id', 'k'])
print(df)
The resulting dataframe would have n_k (n= repated values of id rows). The repeated id gets a individual colunm and when it does not have repeated id, it gets a 0 in the new colunm.
data_merged = {'id':[10450015,16690019,16510069], '1_k':[4.4,4.1,3.7], '2_k'[0,4.0,0]}
print(data_merged)
Try assiging the column idx ref, using DataFrame.assign and groupby.cumcount then DataFrame.pivot_table. Finally use a list comprehension to sort column names:
df_new = (df.assign(col=df.groupby('id').cumcount().add(1))
.pivot_table(index='id', columns='col', values='k', fill_value=0))
df_new.columns = [f"{x}_k" for x in df_new.columns]
print(df_new)
1_k 2_k
id
10450015 4.4 0
16510069 3.7 0
16690019 4.1 4