the robust way to combine multiple dataframe considering different input scenarios - python-3.x

There are three data frames, df_1, df_2 and df_3. I combined them as follows
result1 = df_1.append(df_2,ignore_index=True)
result2 = result1.append(df_3,ignore_index=True)
Then result2 is the combined dataframe. This code segment current works fine if neither of these three input data frames is empty.
However, in practice, any of these three input data frames can be empty. What is the most efficient approach to handle these different scenarios without implementing complex if-else logic to evaluate different scenarios, e.g., df_1 is empty, or both df_1 and df_3 are empty, etc.

IIUC use concat with list of Dataframes, it working if all or any DataFrame(s) are empty:
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
Empty DataFrame
Columns: []
Index: []
df_1 = pd.DataFrame()
df_2 = pd.DataFrame({'a':[1,2]})
df_3 = pd.DataFrame({'a':[10,20]})
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
a
0 1
1 2
2 10
3 20
df_1 = pd.DataFrame()
df_2 = pd.DataFrame({'a':[1,2]})
df_3 = pd.DataFrame()
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
a
0 1
1 2

Related

Writing to a tsv file from multiple list in Python

I have two lists
ids = [1,2,3]
values = [10,20,30]
I need to create a tsv file with two columns - id and result and put the ids and values in there. The output should look like this
id result
1 10
2 20
3 30
I wrote below code
output_columns = ['id','result']
data = zip(output_columns, ids, values)
with open('output.tsv', 'w', newline='') as f_output:
tsv_output = csv.writer(f_output, delimiter='\t')
tsv_output.writerow(data)
But this gives me an output like below which is wrong
('id', '1', '10') ('result', '2','20')
I understand that this wrong output is because the way I did zip to create a row of data. But I am not sure how to solve it.
Please suggest.
output_columns = ['id','result']
data = zip(ids, values)
with open('output.tsv', 'w', newline='') as f_output:
tsv_output = csv.writer(f_output, delimiter='\t')
tsv_output.writerow(output_columns)
for id, val in data:
tsv_output.writerow([id, val])
It's easier using pandas
In [8]: df = pd.DataFrame({"ids":[1,2,3], "values":[10,20,30]})
In [9]: df
Out[9]:
ids values
0 1 10
1 2 20
2 3 30
In [10]: df.to_csv("data.tsv", sep="\t", index=False)

looping through list of pandas dataframes and make it empty dataframe

I have a multiple pandas dataframe. I want empty each dataframes like below
df1 = pd.DataFrame()
df2 = pd.DataFrame()
Instead of doing it individually, is there any way to do it in one line of code.
If I understood correctly, this will work:
df_list = []
for i in range (0,10):
df = pd.DataFrame()
df_list.append(df)
print(df_list[0].head())

Filter dataframe based on groupby sum()

I want to filter my dataframe based on a groupby sum(). I am looking for lines where the amounts for a spesific date, gets to zero.
I have solve this by creating a for loop. I suspect this will reduce performance if the dataframe is large.
It also seems clunky.
newdf = pd.DataFrame()
newdf['name'] = ('leon','eurika','monica','wian')
newdf['surname'] = ('swart','swart','swart','swart')
newdf['birthdate'] = ('14051981','198001','20081012','20100621')
newdf['tdate'] = ('13/05/2015','14/05/2015','15/05/2015', '13/05/2015')
newdf['tamount'] = (100.10, 111.11, 123.45, -100.10)
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
df2 = df.loc[df["tamount"] == 0, "tdate"]
df3 = pd.DataFrame()
for i in df2:
df3 = df3.append(newdf.loc[newdf["tdate"] == i])
print (df3)
The below code is creating an output of the two lines getting to zero when combined on tamount
name surname birthdate tdate tamount
0 leon swart 1981-05-14 13/05/2015 100.1
3 wian swart 2010-06-21 13/05/2015 -100.1
Just use basic numpy :)
import numpy as np
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
dates = df['tdate'][np.where(df['tamount'] == 0)[0]]
newdf[np.isin(newdf['tdate'], dates) == True]
Hope this helps; let me know if you have any questions.

Which properties has Spark DataFrame after join of two DataFrames with same partitioning?

Say, I have two Spark DataFrames with column some_col
df_1 = df_1.repartition(50, 'some_col')
df_2 = df_2.repartition(50, 'some_col')
df_3 = df_1.join(df_2, on='some_col')
I thought that df_3 should be also partitioned by some_col and has 50 partitions but my experiments show that at least the last condition is not true. Why does it happen?
What happens in terms of time-consuming operations (re-partitioning or re-location) can happen after
df_3 = df_3.repartition(50, 'some_col')
condition that " df_3 should be also partitioned by some_col and has 50 partitions " will only be true if df_1 and df_2 have the partitions with same values for "some_col" i.e. if df_1 has 2 partitions : [(1,2)], [(3,1),(3,7)], (such that some_col values are 1, 3) then df_2 needs to have partitions with some_col values 1,3. If that is the case then on joining df_1 and df_2, it will produce df_3 with same number of partition as in df_1 or df_2.
In all other cases it will try to create a default 200 partitions and shuffle the whole join operation.
for clarity you can try following example:
rdd1 = sc.parallelize([(1,2), (1,9), (2, 3), (3,4)])
df1 = rdd1.toDF(['a', 'b'])
df1 = df1.repartition(3, 'a')
df1.rdd.glom().collect() #outputs like:
>> [[Row(a=2,b=3)], [Row(a=3,b=4)], [Row(a=1,b=2), Row(a=1,b=9)]]
df1.rdd.getNumPartitions()
>>3
rdd2 = sc.parallelize([(1,21), (1,91), (2, 31), (3,41)])
df2 = rdd2.toDF(['a', 'b'])
df2 = df2.repartition(3, 'a')
df2.rdd.glom().collect() #outputs like:
>> [[Row(a=2,b=31)], [Row(a=3,b=41)], [Row(a=1,b=21), Row(a=1,b=91)]]
df2.rdd.getNumPartitions()
>>3
df3 = df1.join(df2, on='a')
df3.rdd.glom().collect() #outputs like:
>> [[Row(a=2,b=3,b=31)], [Row(a=3,b=4,b=41)], [Row(a=1,b=2,b=21), Row(a=1,b=9,b=91)]]
df21.rdd.getNumPartitions()
>>3

Calling a data frame via a string

I have a list of countries such as:
country = ["Brazil", "Chile", "Colombia", "Mexico", "Panama", "Peru", "Venezuela"]
I created data frames using the names from the country list:
for c in country:
c = pd.read_excel(str(c + ".xls"), skiprows = 1)
c = pd.to_datetime(c.Date, infer_datetime_format=True)
c = c[["Date", "spreads"]]
Now I want to be able to merge all the countries data frames using the columns date as the key. The idea is to create a loop like the following:
df = Brazil #this is the first dataframe, which also corresponds to the first element of the list country.
for i in range(len(country)-1):
df = df.merge(country[i+1], on = "Date", how = "inner")
df.set_index("Date", inplace=True)
I got the error ValueError: can not merge DataFrame with instance of type <class 'str'>. It seems python is not calling the data frame which the name is in the country list. How can I call those data frames starting from the country list?
Thanks masters!
Your loop doesn't modify the contents of the country list, so country is still a list of strings.
Consider building a new list of dataframes and looping over that:
country_dfs = []
for c in country:
df = pd.read_excel(c + ".xls", skiprows=1)
df = pd.to_datetime(df.Date, infer_datetime_format=True)
df = df[["Date", "spreads"]]
# add new dataframe to our list of dataframes
country_dfs.append(df)
then to merge,
merged_df = country_dfs[0]
for df in country_dfs[1:]:
merged_df = merged_df.merge(df, on='Date', how='inner')
merged_df.set_index('Date', inplace=True)

Resources