How to add multiple columns in reset_index from a pandas- series - python-3.x

Am having a pandas series & i need to show it's elements in columnar fashion embedded in an email body with HTML.Hence tried below snippet. It worked for me initially with two columns, but now i have three columns to display. Here is code-
a = max_compare.rename_axis('Metric').reset_index(name='Yesterday').to_html()
Here max_compare is series with below output-
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Metric</th>
<th>Yesterday</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>GSV</td>
<td>4424180.0</td>
</tr>
<tr>
<th>1</th>
<td>Units</td>
<td>7463.0</td>
</tr>
</tbody>
</table>
Now i am having another element to be shown in column. Tried below code snippet, but got exception.
a =max_compare.rename_axis('Metric').reset_index(name=['Yesterday'],[L30 average]).to_html()
Current Output in HTML
Output-

Since you do not provide any data sample nor expected result, I do not know exactly what you want to do. However, this is how to reset a MultiIndex.
level : int, str, tuple, or list, default None
Only remove the given levels from the index. Removes all levels by default
Link to the documentation
import pandas as pd
import io
# Paste the text by using of triple-quotes to span String literals on multiple lines
zz = """index grade class numgrade numyear
0 A Senior 12 4
1 A Junior 12 3
2 A Junior 12 3
3 A Senior 12 4
4 A Junior 12 3
"""
df = pd.read_table(io.StringIO(zz), delim_whitespace=True)
df.set_index(['index', 'grade'], inplace=True)
# Reset the two levels
df.reset_index(level=(0,1))
Edit
The same method can be used with Series. But Reseting a MultiIndex of a Series will convert it back to a DataFrame since Series, by definition, have only one column (a value) and an index. In consequence, I think you should use DataFrame if you want to deal with more than one column.
# Paste the text by using of triple-quotes to span String literals on multiple lines
zz = """index grade class
0 A Senior
1 A Junior
2 A Junior
3 A Senior
4 A Junior
"""
df = pd.read_table(io.StringIO(zz), delim_whitespace=True)
df.set_index(['index', 'grade'], inplace=True)
# Creating a series with a multiIndex
se = df.iloc[:,0]
print(type(se))
# Reseting index also works for Series.
# In case of MultiIndex, Series will be converted back
# to a DataFrame since Series, by definition, have
# only one column and an index
se.reset_index(level=(0,1))
# <class 'pandas.core.series.Series'>
# index grade class
# 0 0 A Senior
# 1 1 A Junior
# 2 2 A Junior
# 3 3 A Senior
# 4 4 A Junior

Related

Extracting specific values from a pandas columns and storing it in new columns

I have a pandas column which is storing data in a form of a list in the following format:
text
[['Mark','PERSON'],['Data Scientist','TITLE'], ['Berlin','LOC'], ['Python','SKILLS'], ['Tableau,','SKILLS'], ['SQL','SKILLS'], ['AWS','SKILLS']]
[['John','PERSON'],['Data Engineer','TITLE'], ['London','LOC'], ['Python','SKILLS'], ['DB2,','SKILLS'], ['SQL','SKILLS']
[['Pearson','PERSON'],['Intern','TITLE'], ['Barcelona','LOC'], ['Python','SKILLS'], ['Excel,','SKILLS'], ['SQL','SKILLS']
[['Broody','PERSON'],['Manager','TITLE'], ['Barcelona','LOC'], ['Team Management','SKILLS'], ['Excel,','SKILLS'], ['Good Communications','SKILLS']
[['Rita','PERSON'],['Software Developer','TITLE'], ['London','LOC'], ['Dot Net','SKILLS'], ['SQl Server,','SKILLS'], ['VS Code,'SKILLS']
What I want to see as an output is :
PERSON TITLE LOC SKILLS
Mark Data Scientist Berlin Python, Tableau, SQL, AWS
John Data Engineer London Python, DB2,SQL
..... and so on for the rest of the input rows as well
So essentially splitting the data by "," and storing the left part before "," as the column header and the right part of the "," as the value.
How can I achieve this?
If you have a data frame like this call it "df":
index text
0 1 [[Mark, PERSON], [Data Scientist, TITLE], [Ber...
1 2 [[John, PERSON], [Data Engineer, TITLE], [Lond...
2 3 [[Pearson, PERSON], [Intern, TITLE], [Barcelon...
3 4 [[Broody, PERSON], [Manager, TITLE], [Barcelon...
4 5 [[Rita, PERSON], [Software Developer, TITLE], ...
You can try something like that :
person=[]
skills=[]
title=[]
loc=[]
temp=[]
for i in range(len(df['text'])):
for j in range(len(df['text'][i])):
if df['text'][i][j][1]=='PERSON':
person.append(df['text'][i][j][0])
elif df['text'][i][j][1]=='TITLE':
title.append(df['text'][i][j][0])
elif df['text'][i][j][1]=='LOC':
loc.append(df['text'][i][j][0])
elif df['text'][i][j][1]=='SKILLS':
temp.append(df['text'][i][j][0].replace(",", ""))
skills.append(",".join(temp))
temp=[]
Output
PERSON TITLE LOC SKILLS
0 Mark Data Scientist Berlin Python,Tableau,SQL,AWS
1 John Data Engineer London Python,DB2,SQL
2 Pearson Intern Barcelona Python,Excel,SQL
3 Broody Manager Barcelona Team Management,Excel,Good Communications
4 Rita Software Developer London Dot Net,SQl Server,VS Code

How to organise different datasets on Excel into the same layout/order (using pandas)

I have multiple Excel spreadsheets containing the same types of data but they are not in the same order. For example, if file 1 has the results of measurements A, B, C and D from River X printed in columns 1, 2, 3 and 4, respectively but file 2 has the same measurements taken for a different river, River Y, printed in columns 6, 7, 8, and 9 respectively, is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe (i.e. make it so that Sheet2 has the measurements for River Y printed in columns 1, 2, 3 and 4)? Sometimes the data is presented horizontally, not vertically as described above, too. If I have the same measurements for, say, 400 different rivers on 400 separate sheets, but the presentation/layout of data is erratic with regards to each individual file, it would be useful to be able to put a single order on every spreadsheet without having to manually shift columns on Excel.
Is there a way to use pandas to reorganise one dataframe to match the layout of another dataframe?
You can get a list of columns from one of your dataframes and then sort that. Next you can use the sorted order to reorder your remaining dataframes. I've created an example below:
import pandas as pd
import numpy as np
# Create an example of your problem
root = 'River'
suffix = list('123')
cols_1 = [root + '_' + each_suffix for each_suffix in suffix]
cols_2 = [root + '_' + each_suffix for each_suffix in suffix[::]]
data = np.arange(9).reshape(3,3)
df_1 = pd.DataFrame(columns=cols_1, data=data)
df_2 = pd.DataFrame(columns=cols_2, data=data)
df_1
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
df_2
[out] River_3 River_2 River_1
0 0 1 2
1 3 4 5
2 6 7 8
col_list = df_1.columns.to_list() # Get a list of column names use .sort() to sort in place or
sorted_col_list = sorted(col_list, reverse=False) # Use reverse True to invert the order
def rearrange_df_cols(df, target_order):
df = df[target_order]
print(df)
return df
rearrange_df_cols(df_1, sorted_col_list)
[out] River_1 River_2 River_3
0 0 1 2
1 3 4 5
2 6 7 8
rearrange_df_cols(df_2, sorted_col_list)
[out] River_1 River_2 River_3
0 2 1 0
1 5 4 3
2 8 7 6
You can write a function based on what's above and apply it to all of your file/sheets provided that all columns names exist (NB the must be written identically).
Sometimes the data is presented horizontally, not vertically as described above, too.
This would be better as a separate question. In principle you should check the dimension of your data e.g. df.shape and based of the shape you can either use df.transpose() and then your function to reorder the columns names or directly use your function to reorder the column names.

I want to merge 4 rows to form 1 row with 4 sub-rows in pandas Dataframe

This is my dataframe
I have tried this but it didn't work:
df1['quarter'].str.contains('/^[-+](20)$/', re.IGNORECASE).groupby(df1['quarter'])
Thanks in advance
Hi and welcome to the forum! If I understood your question correctly, you want to form groups per year?
Of course, you can simply do a group by per year as you already have the column.
Assuming you didn't have the year column, you can simply group by the whole string except the last 2 characters of the quarter column. Like this (I created a toy dataset for the answer):
import pandas as pd
d = {'quarter' : pd.Series(['1947q1', '1947q2', '1947q3', '1947q4','1948q1']),
'some_value' : pd.Series([1,3,2,4,5])}
df = pd.DataFrame(d)
df
This is our toy dataframe:
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
4 1948q1 5
Now we simply group by the year, but we substract the last 2 characters:
grouped = df.groupby(df.quarter.str[:-2])
for name, group in grouped:
print(name)
print(group, '\n')
Output:
1947
quarter some_value
0 1947q1 1
1 1947q2 3
2 1947q3 2
3 1947q4 4
1948
quarter some_value
4 1948q1 5
Additional comment: I used an operation that you can always apply to strings. Check this, for example:
s = 'Hi there, Dhruv!'
#Prints the first 2 characters of the string
print(s[:2])
#Output: "Hi"
#Prints everything after the third character
print(s[3:])
#Output: "there, Dhruv!"
#Prints the text between the 10th and the 15th character
print(s[10:15])
#Output: "Dhruv"

How to write content of a list into an Excel sheet using openpyxl

I have the following list:
d_list = ["No., Start Name, Destination, Distance (miles)",
"1,ALBANY,NY CRAFT,28",
"2,GRACO,PIONEER,39",
"3,FONDA,ROME,41",
"4,NICCE,MARRINERS,132",
"5,TOUCAN,SUBVERSIVE,100",
"6,POLL,CONVERGENCE,28",
"7,STONE HOUSE,HUDSON VALLEY,9",
"8,GLOUCESTER GRAIN,BLACK MUDD POND,75",
"9,ARMY LEAGUE,MUMURA,190",
"10,MURRAY,FARMINGDALE,123"]
So, basically, the list consists of thousands of elements (just showed here a sample of 10), each is a string of comma separated elements. I'd like to write this into a new worksheet in a workbook.
Note: the workbook already exists and contains other sheets, I'm just adding a new sheet with this data.
My code:
import openpyxl
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.create_sheet(title='distance')
for i in range(len(d_list)):
sheet.append(list(d_list[i]))
I'm expecting (in this example) 11 rows of data, each with 4 columns. However, I'm getting 11 rows alright but with each character of each string written in each cell! I think am almost there ... what am I missing? (Note: I've read through all the available posts related to this topic, but couldn't find any that answers this specific type of of question, hence I'm asking).
Many thanks!
You can use pandas to solve this:
1.) Convert your list into a dataframe:
In [231]: l
Out[231]:
['No., Start Name, Destination, Distance (miles)',
'1,ALBANY,NY CRAFT,28',
'2,GRACO,PIONEER,39',
'3,FONDA,ROME,41',
'4,NICCE,MARRINERS,132',
'5,TOUCAN,SUBVERSIVE,100',
'6,POLL,CONVERGENCE,28',
'7,STONE HOUSE,HUDSON VALLEY,9',
'8,GLOUCESTER GRAIN,BLACK MUDD POND,75',
'9,ARMY LEAGUE,MUMURA,190',
'10,MURRAY,FARMINGDALE,123']
In [228]: df = pd.DataFrame([i.split(",") for i in l])
In [229]: df
Out[229]:
0 1 2 3
0 No. Start Name Destination Distance (miles)
1 1 ALBANY NY CRAFT 28
2 2 GRACO PIONEER 39
3 3 FONDA ROME 41
4 4 NICCE MARRINERS 132
5 5 TOUCAN SUBVERSIVE 100
6 6 POLL CONVERGENCE 28
7 7 STONE HOUSE HUDSON VALLEY 9
8 8 GLOUCESTER GRAIN BLACK MUDD POND 75
9 9 ARMY LEAGUE MUMURA 190
10 10 MURRAY FARMINGDALE 123
2.) Write the above Dataframe to excel in a new-sheet in 4 columns:
import numpy as np
from openpyxl import load_workbook
path = "data.xlsx"
book = load_workbook(path)
writer = pd.ExcelWriter(path, engine = 'openpyxl')
writer.book = book
df.to_excel(writer, sheet_name = 'distance')
writer.save()
writer.close()

CountIf statement excel

I want to be able to check if data in one cell in column a = x and then if so count the data in adjacent cell in column b.
To explain in more detail, I am creating a statistics chart where commissioners need to be able to be able to compare data for region 1 to region 2 and region 3.
So column A will contain region keys such as 1, 1, 1, 2, 3, 3 then in column B will be ages. So if column A contains a 1 collect data in cell adjacent to the cell it is counting.
I hope this makes sense.
So I want the formula to only count the cells that adjacent cells contain district 1 for example
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
<table style="width:100%">
<tr>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>District 1</td>
<td>12</td>
</tr>
<tr>
<td>District 2</td>
<td>12</td>
</tr>
<tr>
<td>District 1</td>
<td>12</td>
</tr>
<tr>
<td>District 2</td>
<td>12</td>
</tr>
</table>
You just need to use COUNTIFS() to add extra criteria to when to count something.
Using the following data:
Region Age
2 12
2 16
2 41
1 62
3 26
3 50
2 12
1 65
3 64
Where Region = B1 we can use =COUNTIFS(A2:A10,2,B2:B10,12) to count the instances where Region = 2 and where Age = 12. This returns 2.
Obviously we can expand this out to all manner of criteria or keep adding extra criteria. Having data that looks like:
Region Age Ethnicity
2 12 Black
2 16 Caucasian
2 41 Asian
1 62 Black
3 26 Caucasian
3 50 Asian
2 12 Caucasian
1 65 Caucasian
3 64 Asian
We can now count Asian people equal to or over the age of 50 in region 3 with =COUNTIFS(A2:A10,3,B2:B10,">=50",C2:C10,"Asian")
And if you want to bucket people using upper and lower bounds it's simply a case of creating the bound in separate ranges and criteria pairs. For instance to find all people in region 2 aged between 10 and 20 (not inclusive on the upper bound) we would use =COUNTIFS(A2:A10,2,B2:B10,">=10",B2:B10,"<20")

Resources