I have two dumps of data , the old dump and the new dump.
Old dump goes like:-
New Dump is the changed data or the newly added data.
My goal is to find the newly added data.
Sample new dump goes like:-
For the actual implementation i am querying old dump from the database , and new dump is the weekly files received for any new adds or updates
I am using the below code to find the newly adds , but something is not right as when i go to find the duplicate id's it gives blank dataframe even though the duplicates are present.
Below id my code:-
cols_to_show=["id","fname","lname","title","yearjoined"]
prev_batch=pd.read_csv("sample1.csv",encoding='cp1252',sep='\t',error_bad_lines=False, warn_bad_lines=True) #olddump
current_batch=pd.read_csv("sample2",encoding='cp1252',sep='\t',error_bad_lines=False, warn_bad_lines=True) #newdump
prev_batch['version']='v0.0'
current_batch['version']='v1.0'
full_set = pd.concat([prev_batch,current_batch],ignore_index=True)
changes = full_set.drop_duplicates(subset=cols_to_show, keep='last')
dupe=changes[changes.duplicated('id',keep=False)] #gives empty dataframe
change_new = dupe_lawyers[(dupe["version"] == 'v1.0')]
change_old = dupe_lawyers[(dupe["version"] == 'v0.0')]
changes['duplicate']=changes["id"].isin(dupe["id"])
new_added = full_set.drop_duplicates(subset=cols_to_show, keep='first')
new_added['duplicate']=new_added["id"].isin(dupe["id"])
added_new = new_added[(new_lawyers["duplicate"] == False)&(new_added["version"] == "v1.0")]
I am not able to understand why dupe is giving me empty dataframe
I am not sure, if you are looking for an answer only to the code that you have provided, but another approach could be to use the pandas compare API to get the difference between the dataframes. I am showing an example below taken from their website.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
df2
diff_df = df.compare(df2)
After this you can get the difference using the xs api and get the difference of the dataframe.
diff_df.xs('other', level=1, axis=1)
For this case will be the following:
Related
In Pyspark, I want to combine concat_ws and coalesce whilst using the list method. For example I know this works:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
#display(df)
df = df.withColumn("concat_ws2", concat_ws(':', coalesce('Type', lit("")), coalesce('Segment', lit(""))))
display(df)
But I want to be able to utilise the *[list] method so I don't have to list out all the columns within that bit of code, i.e. something like this instead:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
list = ["Type", "Segment"]
df = df.withColumn("almost_desired_output", concat_ws(':', *list))
display(df)
However as you can see, I want to be able to coalesce NULL with a blank, but not sure if that's possible using the *[list] method or do I really have to list out all the columns?
This would work:
Iterate over list of columns names
df=df.withColumn("almost_desired_output", concat_ws(':', *[coalesce(name, lit('')).alias(name) for name in df.schema.names]))
Output:
Or, Use fill - it'll fill all the null values across all columns of Dataframe (but this changes in the actual column, which may can break some use-cases)
df.na.fill("").withColumn("almost_desired_output", concat_ws(':', *list)
Or, Use selectExpr (again this changes in the actual column, which may can break some use-cases)
list = ["Type", "Segment"] # or just use df.schema.names
list2 = ["coalesce(type,' ') as Type", "coalesce(Segment,' ') as Segment"]
df=df.selectExpr(list2).withColumn("almost_desired_output", concat_ws(':', *list))
I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
%python
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
Logically thinking there must not be such a computational difference (more than double the time and memory used).
Can anyone help me understand this ?
DAG of successful logic:
DAG of failure logic:
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 17
I am have a very weird problem with my code. After saving the DataFrame data in another variable, and changing the temporary one, the first one gets also updated. Example :
import pandas as pd
df = pd.DataFrame()
df['Test'] = [1, 2, 3]
temp = pd.DataFrame()
temp = df
temp['New Column'] = [2, 3, 4]
print(df)
Results :
Test New Column
0 1 2
1 2 3
2 3 4
Am I missing something here ?
Thank you very much in advance
This:
temp = df
doesn't create a copy of df, but instead makes temp be just an other name for df. And because dataframes are mutable, the changes through one name are reflected on the other.
When i am trying to use df.at fuction without loop it works fine and change the data for a perticular column but it is giving error while using this in a loop.
Code is here.
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2]}
df1 = pd.DataFrame(data1)
data2 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [4.1, 3.4, 7.1, 9.2]}
df2 = pd.DataFrame(data2)
df3 = pd.concat([df1, df2], axis=1)
for i in range(int(len(df1))):
for j in range(int(len(df2))):
if df1['Name'][i] != df2['Name'][j]:
continue
else:
out = (df1['Height'][i] - df2['Height'][j])
df3.at[i, 'Height_Comparison'] = out
break
print(df3)
The issue was occurring becz of duplicate column names('Name', 'Height') in Data Frame df3 becz of the concat operation. Concat make double entries with same column names ('Name', 'Height') in Data Frame df3 which is creating this problem.
once i changed the column names to Name1, Height1 in df1 and Name2, Heigh2 in df2 the issue got resolved.
I have 6 data sets. Their names are: e10_all, e11_all, e12_all, e13_all, e14_all, and e19_all.
All have different numbers of columns and rows, but with some common columns. I need to append the rows of these columns together. First, I want to determine the columns that are common to all of the data sets, so I know which columns to select in my SQL query.
In R, I am able to do this using:
# Create list of dts
list_df = list(e10_all, e11_all, e12_all, e13_all, e14_all, e19_all)
col_common = colnames(list_df[[1]])
# Write for loop
for (i in 2:length(list_df)){
col_common = intersect(col_common, colnames(list_df[[i]]))
}
# View the common columns
col_common
# Get as a comma-separated list
cat(noquote(paste(col_common, collapse = ',')))
I want to do the same thing, but in Python. Does anyone happen to know a way?
Thank you
It's not that different in pandas. Making some dummy dataframes:
>>> import pandas as pd
>>> e10_all = pd.DataFrame({"A": [1,2], "B": [2,3], "C": [2,3]})
>>> e11_all = pd.DataFrame({"B": [4,5], "C": [5,6]})
>>> e12_all = pd.DataFrame({"B": [1,2], "C": [3,4], "M": [8,9]})
Then your code would translate to something like
>>> list_df = [e10_all, e11_all, e12_all]
>>> col_common = set.intersection(*(set(df.columns) for df in list_df))
>>> col_common
{'C', 'B'}
>>> ','.join(sorted(col_common))
'B,C'
That second line turns each of the frames' columns into a set and then takes the intersection of all of them. A more literal translation of your code would work too, although we tend to avoid writing loops where we can avoid it, and we tend to loop over elements directly (for df in list_df[1:]:) rather than going via index. Still,
col_common = set(list_df[0].columns)
for i in range(1, len(list_df)):
col_common = col_common.intersection(list_df[i].columns)
would get the job done.