I have a dataset given as such:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'team': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Run_time': [1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'No'],
'LoanAmount': [123, 128, 66, 120, 141, 52,96,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
Here, I wish to add an additional column 'Last_entry' which will contain 0's and 1's.
This column appears such that, for team-A, the last run-time is 5. So that row has Last_entry=1 and all other run-times for team-A should be 0.
For team-B, the last run-time is 3. So that row has Last_entry=1 and all other run-times for team-B should be 0.
For team-C, the last run-time is 4. So that row has Last_entry=1 and all other run-times for team-C should be 0.
The net result needs to look as such:
New dataframe by adding additional column
Can somebody please let me know how to achieve this task in python?
I wish to add an additional column in an existing dataset by using python
You can use groupby and tail to get the last entry for each team. Then make a new column of zeroes, and set the resulting rows to one:
# Determine indicies for last entries
last_entry_idx = df.groupby('team').tail(1).index
# Create new column
df['last_entry'] = 0
df.loc[last_entry_idx, 'last_entry'] = 1
Related
I have a dataset given as such:
#Load the required libraries
import pandas as pd
#Create dataset
data = {'team': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
'Run_time': [1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'No'],
'LoanAmount': [123, 128, 66, 120, 141, 52,96,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)
The dataset looks as such:
Here, in the 'Run_time' column, the numbering starts at different index values.
I wish to ensure that the 'Run_time' column starts from 1 only.
The dataset needs to look as such:
Can somebody please let me know how to modify this column in Python such that the numbering is continuous?
import pandas as pd
#Create dataset
data = {'team': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'],
'Run_time': [1, 2, 3, 4, 5, 1, 2, 3, 1, 2, 3, 4],
'Married': ['No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No'],
'Self_Employed': ['No', 'No', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'No', 'No'],
'LoanAmount': [123, 128, 66, 120, 141, 52,96,15,85,36,58,89],
}
#Convert to dataframe
df = pd.DataFrame(data)
# print("df = \n", df)
df.Run_time = df.index+1
df
I am trying to transform a dataframe using pivot. Since the column contains duplicate entries, i tried to add a count column following what's suggested here (Question 10 posted in this answer).
import pandas as pd
from pprint import pprint
if __name__ == '__main__':
d = {
't': [0, 1, 2, 0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4, 4, 4, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1, 2, 1, 1],
}
df = pd.DataFrame(d)
df = df.drop('t', axis=1)
df.insert(0, 'count', df.groupby('input').cumcount())
pd.pivot(df, index='count', columns='type', values='value')
But I still get the same error raise ValueError("Index contains duplicate entries, cannot reshape") ValueError: Index contains duplicate entries, cannot reshape.
Could someone please suggest how to resolve this error?
As far as you have more then one value associated with 'A' and 'B' you have to aggregate values somehow.
So if I've understood your issue right possible solution is the following:
#pip install pandas
import pandas as pd
d = {
't': [0, 1, 2, 0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4, 4, 4, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1, 2, 1, 1],
}
df = pd.DataFrame(d)
df
# I've used aggfunc='sum' argument for example, the default value is 'mean'
pd.pivot_table(df, index='t', columns='type', values='value', aggfunc='sum')
Returns
I have a data frame like below. I want to get a dictionary consisting of a list.My expected output is. Can you pls assist me to get it?
You can use the handy groupby function in Pandas:
df = pd.DataFrame({
'Department': ['y1', 'y1', 'y1', 'y2', 'y2', 'y2'],
'Section': ['A', 'B', 'C', 'A', 'B', 'C'],
'Cost': [10, 20, 30, 40, 50, 60]
})
output = {dept: group['Cost'].tolist() for dept, group in df.groupby('Department')}
gives
{'y1': [10, 20, 30], 'y2': [40, 50, 60]}
I have a Pandas data frame, which looks like the following:
df =
col1
['a', 'b']
['d', 'c', 'a']
['b', 'f', 'a']
col1 is a list column, which contains strings. I want to calculate value counts of each element, which may occur in any of the lists in any row. Expected output is a dictionary, of counts of each value
Expected Output
df_dict = {'a': 3, 'b': 2, 'c': 1, 'd': 1, 'f': 1}
How to do this efficiently in 1 line preferably to make the code clean. Sorry, if it has been answered before.
With explode and value_counts:
df['col1'].explode().value_counts().to_dict()
Output:
{'a': 3, 'b': 2, 'd': 1, 'f': 1, 'c': 1}
I have a data with two columns: Product and Category. See below for an example of the data:
import pandas as pd
df = pd.DataFrame({'Product': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'Category': ['Text', 'Text2', 'Text3', 'Text4', 'Text', 'Text2', 'Text3', 'Text4'],
'Value': [80, 10, 5, 5, 5, 3, 2, 0]})
I would like to visualize this data in a diagram:
Here the "Total" is the total value of the entire data frame, "A" & "B" boxes are the total value for each product, and then the values for each product & category are in the right-most box.
I'm not very familiar with the viz packages available in Python. Is there a package that exists that does these types of visualizations.
You can use graphviz. But you need to extract your own blocks/nodes
Example:
from graphviz import Graph
g = Graph()
g.attr(rankdir='RL')
T = df['Value'].sum()
g.node('1', 'Total = ' + str(T), shape='square')
A = df.loc[df.Product == 'A', ['Category', 'Value']].to_string(index=False)
g.node('2', A, shape='square')
B = df.loc[df.Product == 'B', ['Category', 'Value']].to_string(index=False)
g.node('3', B, shape='square')
g.edges(['21', '31'])
g.render(view=True)