Plot from csv with panda grouping - python-3.x

If I have a csv with 4 columns:
how can I average the values of one column (x) over the average of another column (y) by grouping through the first one with panda? I have to do a loop for every value of the first column? I am not sure about the implementation.
For example, if I have a csv file:
a,1,2,4
a,2,2,5
a,3,2,6
a,4,2,5
b,1,3,2
b,2,3,3
b,3,3,4
and I want a plot with a,average(3rd column) and b,average(3rd column)
I have to do something like:
df=pd.reas_csv
x=group_by("values of the 1st column").average()
I would also try to plot kde over the 2nd column, which has ten rows for every group of the first column.
I don't understand how to group data from *csv file without a header in particular.
Thank you for the help.

Assume your dataframe looks like
print(df)
0 1 2 3
0 a 1 2 4
1 a 2 2 5
2 a 3 2 6
3 a 4 2 5
4 b 1 3 2
5 b 2 3 3
6 b 3 3 4
If you want to plot with a average of 3rd column and b average of 3rd column, you can do
import pandas as pd
import matplotlib.pyplot as plt
df.groupby(0).mean()[3].plot.bar(rot=0)
plt.show()

Related

Csv file split comma separated values into separate rows and dividing the corresponding dollar amount by the number of comma separated values in panda

beginner here!
I have a csv file with comma separated values. I want to split each comma separated value in different rows in pandas. However, the corresponding dollar amounts should be divided by the number of comma separated values in each cell and export the result in a different csv file.
the csv table and the desired output table
I have used df.explode(IDs) but couldn’t figure out how to divide the Dollar_Amount by the number of IDs in the corresponding cells.
import pandas as pd
in_csv = pd.read_csv(‘inputCSV.csv’)
new_csv = df.explode(‘IDs’)
new_csv.to_csv(‘outputCSV.csv’)
You can divide the dollar amount by the number of ids in each row before using explode. This can be done as follows:
# Preprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].str[1:].str.replace(',', '').astype(float)
df['IDs'] = df['IDs'].str.split(",")
# Compute the new dollar amount and explode
df['Dollar_Amount'] = df['Dollar_Amount'] / df['IDs'].str.len()
df = df.explode('IDs')
# Postprocessing
df['Dollar_Amount'] = df['Dollar_Amount'].round(2).apply(lambda x: '${0:,.2f}'.format(x))
With an example input:
IDs Dollar_Amount A
0 1,2,3,4 $100,000.00 4
1 5,6,7 $50,000.00 3
2 9 $20,000.00 1
3 10,11 $20,000.00 2
The result is as follows:
IDs Dollar_Amount A
0 1 $25,000.00 4
0 2 $25,000.00 4
0 3 $25,000.00 4
0 4 $25,000.00 4
1 5 $16,666.67 3
1 6 $16,666.67 3
1 7 $16,666.67 3
2 9 $20,000.00 1
3 10 $10,000.00 2
3 11 $10,000.00 2
There will be a one line way to do this with a lambda function (if you are new, read up on lambda functions!) but as a slightly less new beginner, I think its easier to think about this as two separate operations.
Operation 1 - get the count of ids, Operation 2 - do the division
If you take a look here https://towardsdatascience.com/count-occurrences-of-a-value-pandas-e5dad02303e9 you'll get a good lesson on how to do the group by you need to get the count of ids and join it back to your data frame. I'd read that because its a much more detailed explainer, but if you want a simple line of code consider this Pandas, how to count the occurance within grouped dataframe and create new column?
Once you have it, the divison is as simple as df['new_col'] = df['col1']/df['col2']

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.
Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

I want to merge 4 rows to form 1 row with 4 sub-rows in pandas Dataframe

This is my dataframe
I have tried this but it didn't work:
df1['quarter'].str.contains('/^[-+](20)$/', re.IGNORECASE).groupby(df1['quarter'])
Thanks in advance
Hi and welcome to the forum! If I understood your question correctly, you want to form groups per year?
Of course, you can simply do a group by per year as you already have the column.
Assuming you didn't have the year column, you can simply group by the whole string except the last 2 characters of the quarter column. Like this (I created a toy dataset for the answer):
import pandas as pd
d = {'quarter' : pd.Series(['1947q1', '1947q2', '1947q3', '1947q4','1948q1']),
'some_value' : pd.Series([1,3,2,4,5])}
df = pd.DataFrame(d)
df
This is our toy dataframe:
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
4 1948q1 5
Now we simply group by the year, but we substract the last 2 characters:
grouped = df.groupby(df.quarter.str[:-2])
for name, group in grouped:
print(name)
print(group, '\n')
Output:
1947
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
1948
quarter some_value
4 1948q1 5
Additional comment: I used an operation that you can always apply to strings. Check this, for example:
s = 'Hi there, Dhruv!'
#Prints the first 2 characters of the string
print(s[:2])
#Output: "Hi"
#Prints everything after the third character
print(s[3:])
#Output: "there, Dhruv!"
#Prints the text between the 10th and the 15th character
print(s[10:15])
#Output: "Dhruv"

Subtract a subset of columns from a key column in Pandas Pivot

I have a pivot table with multiple columns of data in a time series:
A B C D
11/1/2018 1 5 5 7
11/2/2018 2 6 6 8
11/3/2018 3 7 7 9
The values in the data columns are not important for this example. I would like to subtract the value in the "key" column (column A in this case) from a subset of columns: B & C in this case. I would then like to drop any columns not in the subset or the key column. Result would be:
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4
I have subtracted columns in the past via code like this:
df['dif'] = df['B'] -df['A']
But this will add the "dif" column. I would like to replace column B with B-A values. Also, instead of passing the instructions one at a time (B-A, C-A), would like to pass the list something like "if column in list, subtract key column, else drop column."
Thanks
pandas.DataFrame.sub with axis=0
When subtracting a Series from a DataFrame Pandas will align the columns of the DataFrame with the index of the Series by default. This is what happens when you use the - operator. However, when you use the pandas.DataFrame.sub method, you can override that default and specify that the DataFrame should align its index with the index of the Series.
def f(d, key, subset):
return d[[key]].join(d[subset].sub(d[key], axis=0))
f(df, 'A', ['B', 'C'])
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4
You can use apply to substract A from the subset columns that you choose and finally join again with A.
df['A'].to_frame().join(df[['B','C']].apply(lambda x: x - df['A']))
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4

Using pandas style to give colors to some rows with a specific condition

This is the output of pandas in excel format:
Id comments number
1 so bad 1
1 so far 2
2 always 3
2 very good 4
3 very bad 5
3 very nice 6
3 so far 7
4 very far 8
4 very close 9
4 busy 10
I want to use pandas to give a color (for example: gray color) to rows that their value for Id column is even. For example rows 3 and 4 have even Id numbers, but rows 5, 6 and 7 have odd Id numbers. Is there any possible way to use pandas to do it?
As explained in the documentation http://pandas.pydata.org/pandas-docs/stable/style.html what you basically want to do is write a style function and apply it to the style object.
def _color_if_even(s):
return ['background-color: grey' if val % 2 == 0 else '' for val in s]
and call it on my Styler object, i.e.,
df.style.apply(_color_if_even, subset=['id'])

Resources