Join different DataFrames using loop in Pyspark - apache-spark

I have 5 CSV files in a file, and want to join them in one data frame in Pyspark: I use the code below:
name_file =['A', 'B', 'C', 'D', 'V']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=False,
inferSchema= True)
full_data=full_data.join(n,["id"])
Error: I got an unexpected result > The last dataframe joined just with itself.
Expected Result: There should be 6 columns, each CSV has 2 data frames one of them in common with others. The join should be on this column. As a result, the final data frame should have a common column and 5 special columns from each CSV file.

There seem to be several things wrong with the code or perhaps you have not provided the complete code.
Have you defined fullpath?
You have set header=False then how will spark know that there is an
"id" column?
Your indentation looks wrong under the for loop.
full_data has not been defined yet, so how are you using it on the
right side of the evaluation within the for loop? I suspect you have
initialized this to the first csv file and then attempting to join
it with first csv again.
I ran a small test on the below code which worked for me and addresses the questions I've raised above. You can adjust it to your need.
fullpath = '/content/sample_data/'
full_data = spark.read.csv(fullpath+'Book1.csv'
,header=True,
inferSchema= True)
name_file =['Book2', 'Book3']
for n in name_file:
n= spark.read.csv(fullpath+n+'.csv'
,header=True,
inferSchema= True)
full_data=full_data.join(n,["id"])
full_data.show(5)

Related

why am I getting column object not callable error in pyspark?

I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.
argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)
argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)
cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")
So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.
if fi.filter(col("col1").count() > 0).collect():
fi.show()
error
TypeError: 'Column' object is not callable
Note:
I have 3 columns as a joining condition which is in a list and assigned to a variable cond, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due to join.
Please suggest where am I making mistakes.
Thank you
If I understand correctly, that's simply :
fi.select(cond).collect()
The left_anti already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct before the collect to remove duplicates.
Did you import the column function?
from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
fi.show()

Data frame data type conflict...conver

I am receiving the error below upon running a python file.
'invalid literal for int() with base 10: 'data missing'
It looks as though some of the data in my dataframe is not of a type compatable with an arithmetic operation I would like to perform.
Can someone advise on how I might be able to locate the position of data that is giving the error? And or bypass the entire error with a preprocessing step that allows for the normalization step?
I am confused because missing data should be dropped with the df1.dropna and if not there
The original line throwing the error was the line used to normalize the data. (last line below)
i've tried to convert the dataframe with
df1 = df1.astype(int)
df1 = pd.concat([df2,df3], axis = 1, join_axes = [df2.index])
df1 = df1.fillna(method = 'bfill')
df1 = df1.dropna(axis =0)
df1 = df1.astype(int)
df1 = (df1 - df1.min())/(df1.max() - df1.min())
I think u should try df1.dtypes to check the data type of each column first
here is the documentation of that:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

How to split pandas dataframe into multiple dataframes based on unique string value without aggregating

I have a df with multiple country codes in a column (US, CA, MX, AU...) and want to split this one df into multiple ones based on these country code values, but without aggregating it.
I've tried a for loop but was only able to get one df and it was aggregated with groupby().
I gave up trying to figure it out so I split them based on str.match and wrote one line for each country code. Is there a nice for loop that could achieve the same as below code? If it would write a csv file for each new df that would be fantastic.
us = df[df['country_code'].str.match("US")]
mx = df[df['country_code'].str.match("MX")]
ca = df[df['country_code'].str.match("CA")]
au = df[df['country_code'].str.match("AU")]
.
.
.
We can write a for loop which takes each code and uses query to get the correct part of the data. Then we write it to csv with to_csv also using f-string:
codes = ['US', 'MX', 'CA', 'AU']
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
temp.to_csv(f'df_{code}.csv')
note: f_string only work if Python >= 3.5
To keep the dataframes:
codes = ['US', 'MX', 'CA', 'AU']
dfs=[]
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
dfs.append(temp)
temp.to_csv(f'df_{code}.csv')
Then you can acces them with the index, for example: print(dfs[0]) or print(dfs[1]).

How do I give a text key to a dataframe stored as a value in a dictionary?

So I have 3 dataframes - df1,df2.df3. I'm trying to loop through each dataframe so that I can run some preprocessing - set date time, extract hour to a separate column etc. However, I'm running into some issues:
If I store the df in a dict as in df_dict = {'df1' : df1, 'df2' : df2, 'df3' : df3} and then loop through it as in
for k, v in df_dict.items():
if k == 'df1':
v['Col1']....
else:
v['Coln']....
I get a NameError: name 'df1' is not defined
What am I doing wrong? I initially thought I was not reading in the df1..3 data in but that seems to operate ok (as in it doesn't fail and its clearly reading it in given the time lag (they are big files)). The code preceding it (for load) is:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
for k,v in DF_DATA.items():
print(k, v) #this works to print out both key and value
k = pd.read_csv(v) #this does not
I am thinking this maybe the cause but not sure.I'm expecting the load loop to create the 3 dataframes and put them into memory. Then for the loop on the top of the page, I want to reference the string key in my if block condition so that each df can get a slightly different preprocessing treatment.
Thanks very much in advance for your assist.
You didn't create df_dict correctly. Try this:
DF_DATA = { 'df1': 'df1.csv','df2': 'df2.csv', 'df3': 'df3.csv' }
df_dict= {k:pd.read_csv(v) for k,v in DF_DATA.items()}

Difference between elements when reading from multiple files

I am trying to get the difference between each element after reading multiple csv files. Each csv file has 13 rows and 128 columns. I am trying to get the column-wise difference
I read the files using
data = [pd.read_csv(f, index_col=None, header=None) for f in _temp]
I get a list of all samples.
According to this I have to use .diff() to get the difference. Which goes something like this
data.diff()
This works but instead of getting the difference between each row in the same sample, I get the difference between each row of one sample to another sample.
Is there a way to separate this and let the difference happen within each sample?
Edit
Ok I am able to get the difference between the data elements by doing this
_local = pd.DataFrame(data)
_list = []
_a = _local.index
for _aa in _a:
_list.append(_local[0][_aa].diff())
flow = pd.DataFrame(_list, index=_a)
I am creating too many DataFrames, is there a better way to do this?
Here is a relatively efficient way to read you dataframes one at a time and calculate their differences which are stored in a list df_diff.
df_diff = []
df_old = pd.read_csv(_temp[0], index_col=None)
for f in _temp[1:]:
df = pd.read_csv(f, index_col=None)
df_diff.append(df_old - df)
df_old = df
Since your code work you should real post on https://codereview.stackexchange.com/
(PS. The leading "_" is not really pythonic. pls avoid. It makes your code harder to read. )
_local = pd.DataFrame(data)
_list = [ _local[0][_aa].diff() for _aa in _local.index ]
flow = pd.DataFrame(_list, index=_local.index )

Resources