Append Two Dataframes Together (Pandas, Python3) - python-3.x

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4

Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

Related

How to add value to specific index that is out of bounds

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?
according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

How can I transform this dataset in pandas so that it easy to filter and compare?

I have the following DataFrame:
Segments Airline_pct_tesco Airline_pct_asda food_pct_tesco food_pct_asda Airline_diff food_diff
A 1 2 4 2 -1 2
B 2 2 4 4 0 0
c 10 5 12 10 5 2
I want to convert it to this format:
Segments Category Asda% Tesco% Diff%
A Airline 2 1 -1
b Food 4 4 0
c Airline 5 10 5
A Food 2 4 2
(only partially showing). Note
category is the col name without the '_pct_tesco' or '_diff' or '_pct_asda'
I am unsure how to go about this - I have tried transform but I just don't know how I can get it in a way which is easy for any user to use. I am doing this in pandas and am not sure how to even begin! The Asda% are related to '_pct_asda' columns and same for diff and tesco columns respectively..
Let's try set_index to save columns, then create a MultiIndex.from_frame using str.extract on the columns to create a MultiIndex based on the values before a list of suffixes, then stack to go to long-form.
new_df = df.set_index('Segments')
# Define allowed suffixes here
suffixes = ['_pct_asda', '_pct_tesco', '_diff']
# Extract Values
new_df.columns = (
pd.MultiIndex.from_frame(
new_df.columns.str.extract(rf'(.*?)({"|".join(suffixes)})'),
names=['Category', None]
)
)
new_df = new_df.stack(0)
new_df:
_diff _pct_asda _pct_tesco
Segments Category
A Airline -1 2 1
food 2 2 4
B Airline 0 2 2
food 0 4 4
c Airline 5 5 10
food 2 10 12
To get cleaner output add reset_index + rename to fix column names and index and also re-order columns.
new_df = new_df.reset_index().rename(columns={
'_pct_asda': 'Asda%',
'_pct_tesco': 'Tesco%',
'_diff': 'Diff%'
})[['Segments', 'Category', 'Asda%', 'Tesco%', 'Diff%']]
new_df:
Segments Category Asda% Tesco% Diff%
0 A Airline 2 1 -1
1 A food 2 4 2
2 B Airline 2 2 0
3 B food 4 4 0
4 c Airline 5 10 5
5 c food 10 12 2

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

How can you identify the best companies for each variable and copy the cases?

i want to compare the means of subgroups. The cases of the subgroup with the lowest and the highest mean should be copied and applied to the end of the dataset:
Input
df.head(10)
Outcome
Company Satisfaction Image Forecast Contact
0 Blue 2 3 3 1
1 Blue 2 1 3 2
2 Yellow 4 3 3 3
3 Yellow 3 4 3 2
4 Yellow 4 2 1 5
5 Blue 1 5 1 2
6 Blue 4 2 4 3
7 Yellow 5 4 1 5
8 Red 3 1 2 2
9 Red 1 1 1 2
I have around 100 cases in my sample. Now i look at the means for each company.
Input
df.groupby(['Company']).mean()
Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
So for satisfaction Yellow got the best and Blue the worst value. I want to copy the cases of yellow and blue and add them to the dataset but now with the new lable "Best" and "Worst". I dont want to rename it and i want to iterate over the dataset and to this for other columns, too (for example Image). Is there a solution for it? After i added the cases i want an output like this:
Input
df.groupby(['Company']).mean()
Expected Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
Best 3.857143 3.142857 3.000000 3.142857
Worst 2.666667 2.583333 2.866667 2.625000
But how i said. It is really important that the companies with the best and worst values for each column will be added again and not just be renamed because i want to do to further data processing with another software.
************************UPDATE****************************
I found out how to copy the correct cases:
Input
df2 = df.loc[df['Company'] == 'Yellow']
df2 = df2.replace('Yellow','Best')
df2 = df2[['Company','Satisfaction']]
new = [df,df2]
result = pd.concat(new)
result
Output
Company Contact Forecast Image Satisfaction
0 Blue 1.0 3.0 3.0 2
1 Blue 2.0 3.0 1.0 2
2 Yellow 3.0 3.0 3.0 4
3 Yellow 2.0 3.0 4.0 3
..........................................
87 Best NaN NaN NaN 3
90 Best NaN NaN NaN 4
99 Best NaN NaN NaN 1
111 Best NaN NaN NaN 2
Now i want to copy the cases of the company with the best values for the other variables, too. But now i have to identify manually which company is best for each category. Isnt there a more comfortable solution?
I have a solution. First i create a dictionary with the variables i want to create a dummy company for best and worst:
variables = ['Contact','Forecast','Satisfaction','Image']
After i loop over this columns and adding the cases again with the new label "Best" or "Worst":
for n in range(0,len(variables),1):
Start = variables[n-1]
neu = df.groupby(['Company'], as_index=False)[Start].mean()
Best = neu['Company'].loc[neu[Start].idxmax()]
Worst = neu['Company'].loc[neu[Start].idxmin()]
dfBest = df.loc[df['Company'] == Best]
dfWorst = df.loc[df['Company'] == Worst]
dfBest = dfBest.replace(Best,'Best')
dfWorst = dfWorst.replace(Worst,'Worst')
dfBest = dfBest[['Company',Start]]
dfWorst = dfWorst[['Company',Start]]
new = [df,dfBest,dfWorst]
df = pd.concat(new)
Thanks guys :)

Resources