Python Pandas : Extend operation of a column if a condition matches - python-3.x

I have two different dataframes, i.e.,
firstDF = pd.DataFrame([{'mac':1,'location':['kitchen']}])
predictedDF = pd.DataFrame([{'mac':1,'location':['lab']}])
If the mac column value of predictedDF contains in mac column value of firstDF , then location column value of firstDF should extend the location column of predictedDF and the result of firstDF should be,
firstDF
mac location
0 1 ['kitchen','lab']
I have tried with,
firstDF.loc[firstDF['mac'] == predictedDF['mac'], 'mac'] = firstDF.loc[firstDF['location'].extend(predictedDF['location']), 'location']
Whereas the same returns,
AttributeError: 'Series' object has no attribute 'extend'

If lists in location columns first DataFrame.merge for one DataFrame and then join with + and DataFrame.pop for extract column (use and drop):
df = firstDF.merge(predictedDF, on='mac', how='left')
df['location'] = df.pop('location_x') + df.pop('location_y')
print (df)
mac location
0 1 [kitchen, lab]
Test with more values - if missing values then replace them to []:
firstDF = pd.DataFrame({'mac':[1, 2],'location':[['kitchen'],['kitchen']]})
predictedDF = pd.DataFrame([{'mac':1,'location':['lab']}])
df = firstDF.merge(predictedDF, on='mac', how='left').applymap(lambda x: x if x == x else [])
df['location'] = df.pop('location_x') + df.pop('location_y')
print (df)
mac location
0 1 [kitchen, lab]
1 2 [kitchen]

Related

Pandas: Remove characters from cell in a FOR loop

I have a dataframe with a column labeled Amount which is a dollar amount. For some reason, some of the cells in this column are enclosed in quotation marks (ex: "$47.25").
I'm running this for loop and was wondering what is the best approach to remove the quotes.
for f in files:
print(f)
df = pd.read_csv(f, header = None, nrows=1)
print(df)
je = df.iloc[0,1]
df2 = pd.read_csv(f,header = 6, dtype = {'Amount':float})
df2.to_excel(w, sheet_name = je, index = False)
I have attempted to strip the " from the value using a for loop:
for cell in df2['Amount']:
cell = cell.strip('"')
df2['Amount']=pd.to_numeric(df2['Amount'])
But I am getting:
ValueError: Unable to parse string "$-167.97" at position 0
Thank you in advance!
Given this toy dataframe:
import pandas as pd
df = pd.DataFrame(
{"Transaction": ["t1", "t2"], "Amount": ["$47.25", "'$-167.97'"]}
)
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 '$-167.97'
Instead of using a for loop, which should generally be avoided with dataframes, you could simply remove the quotation marks from the Amountcolumn like this:
df["Amount"] = df["Amount"].str.replace("\'", "")
print(df)
# Outputs
Transaction Amount
0 t1 $47.25
1 t2 $-167.97

Adding new columns in a dataframe gives length mismatch error

From a csv file (initial.csv):
"Id","Name"
1,"CLO"
2,"FEV"
2,"GEN"
3,"HYP"
4,"DIA"
1,"COL"
1,"EOS"
4,"GAS"
1,"AEK"
I am grouping by the Id column and agreggating the Name column values so that each unique Id has all the Name values appended on the same row (new.csv):
"Id","Name"
1,"CLO","COL","EOS","AEK"
2,"FEV","GEN"
3,"HYP"
4,"DIA","GAS"
Now some rows have extra name values for which I want to append corresponding columns according the maximum count of Name values that exist on the rows, i.e.
"Id","Name","Name2","Name3","Name4"
1,"CLO","COL","EOS","AEK"
2,"FEV","GEN"
3,"HYP"
4,"DIA","GAS"
I do not understand how I can add new columns on dataframe to match the data.
Below is my code:
import pandas as pd
df = pd.read_csv('initial.csv', delimiter=',')
max_names_count = 0
for id in unique_ids_list:
mask = df['ID'] == id
names_count = len(df[mask])
if names_count > max_names_count:
max_names_count = names_count
group_by_id = df.groupby(["Id"]).agg({"Name": ','.join})
# Create new columns 'Id', 'Name', 'Name2', 'Name3', 'Name4'
new_column_names = ["Id", "Name"] + ['Name' + str(i) for i in range(2, max_names_count+1)]
group_by_id.columns = new_column_names # <-- ValueError: Length mismatch: Expected axis has 1 elements, new values have 5 elements
group_by_id.to_csv('new.csv', encoding='utf-8')
Try:
df = pd.read_csv("initial.csv")
df_out = (
df.groupby("Id")["Name"]
.agg(list)
.to_frame()["Name"]
.apply(pd.Series)
.rename(columns=lambda x: "Name" if x == 0 else "Name{}".format(x + 1))
.reset_index()
)
df_out.to_csv("out.csv", index=False)
Creates out.csv:
Id,Name,Name2,Name3,Name4
1,CLO,COL,EOS,AEK
2,FEV,GEN,,
3,HYP,,,
4,DIA,GAS,,

Pandas filter through dataframe and compute stats

I'm trying to access certain categories of data and do stat computation.
A B C Type
0 1.539708 -1.166480 0.533026 foo
1 1.302092 -0.505754 0.533026 foo
2 -0.371983 1.104803 -0.651520 bar
3 -1.309622 1.118697 -1.161657 bar
4 -1.924296 0.396437 0.812436 baz
Expected output (I've left the data blank below, however the actual program will have correct output.):
user_input = input('Select type: ') <-----user input foo
Mean 25% Median
A
B
C
So far I'm able to create a function to caclulate mean, 25% and median for the whole dataframe using below,
def stat(df):
mean = df[['A','B','C']].mean()
quantile = df[['A','B','C']].quantile(0.25)
median = df[['A','B','C']].median()
df1 = mean.rename('Mean').to_frame()
df2 = quantile.rename('25%').to_frame()
df3 = median.rename('Median').to_frame()
df = df1.join([df2,df3])
return df
What I'm lacking is to have the option to select particular type in column Type and still producing the same outcome as stat function. Can anyone gives hint?
You just need to do some boolean indexing with .loc for the Type column:
user_input = input('Select type: ')
def stat(df, Type):
mean = df.loc[(df['Type'] == Type), ['A','B','C']].mean()
quantile = df.loc[(df['Type'] == Type), ['A','B','C']].quantile(0.25)
median = df.loc[(df['Type'] == Type), ['A','B','C']].median()
df1 = mean.rename('Mean').to_frame()
df2 = quantile.rename('25%').to_frame()
df3 = median.rename('Median').to_frame()
df = df1.join([df2,df3])
return df
For example, this is how it would look like if you filter row-wise if the user_input is foo
stat(df, user_input)
Out[1]:
Mean 25% Median
A 1.420900 1.361496 1.420900
B -0.836117 -1.001298 -0.836117
C 0.533026 0.533026 0.533026

How to merge two dataframes and return data from another column in new column only if there is match?

I have a two df that look like this:
df1:
id
1
2
df2:
id value
2 a
3 b
How do I merge these two dataframes and only return the data from value column in a new column if there is a match?
new_merged_df
id value new_value
1
2 a a
3 b
You can try this using #JJFord3 setup:
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
#Use isin to create new_value
df2['new_value'] = df2['value'].where(df2.index.isin(df1.index))
#Use reindex with union to rebuild dataframe with both indexes
df2.reindex(df1.index.union(df2.index))
Output:
value new_value
1 NaN NaN
2 a a
3 b NaN
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
new_merged_df_outer = df1.merge(df2,how='outer',left_index=True,right_index=True)
new_merged_df_inner = df1.merge(df2,how='inner',left_index=True,right_index=True)
new_merged_df_inner.rename(columns={'value':'new_value'})
new_merged_df = new_merged_df_outer.merge(new_merged_df_inner,how='left',left_index=True,right_index=True)
First, create an outer merge to keep all indexes.
Then create an inner merge to only get the overlap.
Then merge the inner merge back to the outer merge to get the desired column setup.
You can use full outer join
Lets model your data with case classes:
case class MyClass1(id: String)
case class MyClass2(id: String, value: String)
// this one for the result type
case class MyClass3(id: String, value: Option[String] = None, value2: Option[String] = None)
Creating some inputs:
val input1: Dataset[MyClass1] = ...
val input2: Dataset[MyClass2] = ...
Joining your data:
import scala.implicits._
val joined = input1.as("1").joinWith(input2.as("2"), $"1.id" === $"2.id", "full_outer")
joined map {
case (left, null) if left != null => MyClass3(left.id)
case (null, right) if right != null => MyClass3(right.id, Some(right.value))
case (left, right) => MyClass3(left.id, Some(right.value), Some(right.value))
}
DataFrame.merge has in parameter indicator which
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
This can be used to check if there is a match
import pandas as pd
df1 = pd.DataFrame(index=[1,2])
df2 = pd.DataFrame({'value' : ['a','b']},index=[2,3])
# creates a new column `_merge` with values `right_only`, `left_only` or `both`
merged = df1.merge(df2, how='outer', right_index=True, left_index=True, indicator=True)
merged['new_value'] = merged.loc[(merged['_merge'] == 'both'), 'value']
merged = merged.drop('_merge', axis=1)
Use merge and isin:
df = df1.merge(df2,on='id',how='outer')
id_value = df2.loc[df2['id'].isin(df1.id.tolist()),'id'].unique()
mask = df['id'].isin(id_value)
df.loc[mask,'new_value'] = df.loc[mask,'value']
# alternative df['new_value'] = np.where(mask, df['value'], np.nan)
print(df)
id value new_value
0 1 NaN NaN
1 2 a a
2 3 b NaN

How to check NULL values while comparing 2 text files using spark data frames

The below code failing to capture the 'null' value records. From below df1, the column NO . 5 has a null value (name field).
As per my below requirement OutputDF, the No. 5 record should come as mentioned. But after below code execution this record is not coming into the final output. The records with 'null' values are not coming into the output. Except this, remaining everything fine.
df1
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 600
3 HR GOPI 1500
5 HW 700
df2
NO DEPT NAME SAL
1 IT RAM 1000
2 IT SRI 900
4 MT SUMP 1200
5 HW MAHI 700
OutputDF
NO DEPT NAME SAL FLAG
1 IT RAM 1000 SAME
2 IT SRI 900 UPDATE
4 MT SUMP 1200 INSERT
3 HR GOPI 1500 DELETE
5 HW MAHI 700 UPDATE
from pyspark.shell import spark
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
sc = spark.sparkContext
filedf1 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\files\\file1.csv")
filedf2 = spark.read.option("header","true").option("delimiter", ",").csv("C:\\files\\file2.csv")
filedf1.createOrReplaceTempView("table1")
filedf2.createOrReplaceTempView("table2")
df1 = spark.sql( "select * from table1" )
df2 = spark.sql( "select * from table2" )
#DELETE
df_d = df1.join(df2, df1.NO == df2.NO, 'left').filter(F.isnull(df2.NO)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('DELETE').alias('FLAG'))
print("df_d left:",df_d.show())
#INSERT
df_i = df1.join(df2, df1.NO == df2.NO, 'right').filter(F.isnull(df1.NO)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('INSERT').alias('FLAG'))
print("df_i right:",df_i.show())
#SAME
df_s = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) == F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).select(df1.NO,df1.DEPT,df1.NAME,df1.SAL, F.lit('SAME').alias('FLAG'))
print("df_s inner:",df_s.show())
#UPDATE
df_u = df1.join(df2, df1.NO == df2.NO, 'inner').filter(F.concat(df2.NO,df2.DEPT,df2.NAME,df2.SAL) != F.concat(df1.NO,df1.DEPT,df1.NAME,df1.SAL)).select(df2.NO,df2.DEPT,df2.NAME,df2.SAL, F.lit('UPDATE').alias('FLAG'))
print("df_u inner:",df_u.show())
df = df_d.union(df_i).union(df_s).union(df_u)
df.show()
Here i'm comparing both df1 and df2, if found new records in df2 taking flag as INSERT, if record is same in both dfs then taking as SAME, if the record is in df1 and not in df2 taking as DELETE and if the record exist in both dfs but with different values then taking df2 values as UPDATE.
There's two issues with the code:
The result of F.concat of a null returns null, so this part in code filters out row row NO 5:
.filter(F.concat(df2.NO, df2.NAME, df2.SAL) != F.concat(df1.NO, df1.NAME, df1.SAL))
You are only selecting df2. It's fine in the example case above, but if your df2 has a null then the resultant dataframe will have null.
You can try concatenating it with a udf below:
def concat_cols(row):
concat_row = ''.join([str(col) for col in row if col is not None])
return concat_row
udf_concat_cols = udf(concat_cols, StringType())
The function concat_row can be broken down into two parts:
"".join([mylist]) is a string function. It joins everything in the
list with the defined delimeter, in this case it's an empty string.
[str(col) for col in row if col is not None] is a list comprehension, it does as it reads: for each column in the row, if
the column is not None, then append the str(col) into the list.
List comprehension is just a more pythonic way of doing this:
mylist = []
for col in row:
if col is not None:
mylist.append(col))
You can replace your update code as:
df_u = (df1
.join(df2, df1.NO == df2.NO, 'inner')
.filter(udf_concat_cols(struct(df1.NO, df1.NAME, df1.SAL)) != udf_concat_cols(struct(df2.NO, df2.NAME, df2.SAL)))
.select(coalesce(df1.NO, df2.NO),
coalesce(df1.NAME, df2.NAME),
coalesce(df1.SAL, df2.SAL),
F.lit('UPDATE').alias('FLAG')))
You should do something similar for your #SAME flag and break the line for readability.
Update:
If df2 always have the correct (updated) result, there is no need to coalesce.
The code for this instance would be:
df_u = (df1
.join(df2, df1.NO == df2.NO, 'inner')
.filter(udf_concat_cols(struct(df1.NO, df1.NAME, df1.SAL)) != udf_concat_cols(struct(df2.NO, df2.NAME, df2.SAL)))
.select(df2.NO,
df2.NAME,
df2.SAL,
F.lit('UPDATE').alias('FLAG')))

Resources