Python 3 - Pandas - Missing Data and Binning Values - python-3.x

I have the below code which pulls data from a CSV file and I want to perform some analysis on one of the variables in the data set. The variable is an object data type and it has a lot of missing values. Therefore, I am using pd.to_numeric to transform them to NaN. Then I create some bins (1,2,3,4,5) to categorise the values of the variable.However, I would like to include NaN as a category or bin in the new variable ("variable_q"). For example as "No Data" category. Is that possible? How can I do that?Am I missing anything>
import pandas as pd
import numpy as np
data=pd.read_csv("dataset.csv")
data["variable"]=pd.to_numeric(data["variable"],errors="coerce")
data["variable_q"]=pd.cut(x=data["variable"],bins=[1,2,3,4,5],labels=["Low","Moderate","High","Extremely High"])
Thank you very much!

Well, I am doing a frequency distribution on the variable and I just realised that adding dropna=False as an argument in the below code solves the problem.
A NaN category has been created!So there is no need to add a "No Data" category in the above code.
data["variable_q"].value_counts(sort=False,dropna=False)

Related

Countif function in Python looping in every cell

enter image description here
Hey everyone I am used to work in excel but recently after getting a dataset of about 500k rows that need to be worked in the same worksheet I have huge capacity issues and I was advised to try and transition any function to a python environment. So this excel function "=IF(COUNTIF($J$2:J3,J3)>1,0,1)"~J is the column of the Asset ID~ goes to each cell and if it has previous encountered it in the cells above it returns 0 and if it is unique it returns 1.
How that would be possible in a python environment if I load my table as a DataFrame?
Thanks in advance
You can use pandas to achieve this very easily:
import pandas as pd # import pandas
df = pd.read_excel('your_file.xlsx') # use appropriate function according to your file type
df['Unique'] = ~df[Asset_Id].duplicated().astype(int) # places 1 where it is not encountered before, 0 elsewhere

How to perform group by on multiple columns in a pandas dataframe and predict future values using fbProphet in python?

My dataframe looks like following. I am trying to aggregate(sum) my amount column based on Date and Group present in pandas dataframe. I was able to successfully aggregate the column. However, I am not sure how to pass in fbprophet to predict the future values based on grouped date and Group. Below is the code for aggregation.
Note: I am beginner in python, please provide explanation with code.
Data Frame
import pandas as pd
data = {'Date':['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01',
'2017-02-01', '2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01'],'Group':['A','A','B','B','C','C','D','D','A','A','B','B','C','C','D','D'],
'Amount':['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']}
df = pd.DataFrame(data)
Code tried so far:
grouped = df.groupby(['Group','Date'])[['Amount']].sum()
You're suffering from a few problems.
numeric
The 3rd line of your data initialization should pass in float rather than str.
Elide the quotes.
Or, this will fix it:
'Amount':[float(n) for n in ['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']]}
We do this because you really don't want .sum() to put together 12.1 and 13.2
and come up with '12.113.2'.
You'd much prefer 25.3.
index
The grouped object you computed looks good on the face of it,
but if you inspect the .dtypes attribute you'll see
it's only offering the Amount column to facebook prophet.
To remedy that, use .reset_index():
>>> grouped.reset_index(inplace=True)
>>> grouped.dtypes
Group object
Date object
Amount float64
dtype: object
But now we see one last fly in the ointment.
dates
Having opaque categories of 'A' or 'B' is fine,
but for Date we probably want to know that February or March
comes a certain number of days after January,
rather that leaving opaque str labels in that column.
We might have done the type conversion back when we presented the data input,
but it's fine to clean it up at this stage, too:
import datetime as dt
def to_timestamp(day: str):
return dt.datetime.strptime(day, '%Y-%m-%d')
grouped['Date'] = grouped.Date.apply(to_timestamp)
Having successfully wrangled the shape and types of your data,
you should now be in a good position to let libraries further analyze it.

I need to change the value of a specific column of a dataframe using condition format while imported multiple Excel file

import pandas as pd
batch=pd.read_excel('batch.xlsx')
stock_report=pd.read_excel('Stock_Report.xlsx')
Result_stock=pd.merge(stock_report,batch[['Batch','Cost price']], on='Batch').fillna(0)
Result_stock2=pd.merge(Result_stock,batch[['Item number',' Batch MRP']], on='Item number').fillna(0)
Result_stock2['Total']=Result_stock2['Posted quantity']*Result_stock2['Cost price']
I need to change the value of Column(Total) for Result_stock2 by multiplying it with two column value if it has 0.
You need to learn some formatting. Please format your code so we can read.
If I understood what you mean and your script is working fine so far, you should just simply add:
Result_stock2.loc[Result_stock2['Total']==0,'Total']=(****OPERATION YOU NEED****)
example in 'OPERATION'
Result_stock2.loc[Result_stock2['Total']==0,'Posted quantity']*(Result_stock2.loc[Result_stock2['Total']==0,'Cost price']-5)
It's not a beautiful code but will do what you need.

How to Label encode multiple non-contiguous dataframe columns

I am having a pandas dataframe with multiple columns (some of them non-contiguous) which would need to be label encoded. From my understanding of the LabelEncoder class, for each column I would need to use a different LabelEncoder object. I am using the code below (list_of_string_cols in the code below is a list of all the columns which needs to be label encoded)
for col in list_of_string_cols:
labelenc = LabelEncoder()
train_X[col] = labelenc.fit_transform(train_X[col])
test_X[col] = labelenc.transform(test_X[col])
Is this the correct way?
Yes that's correct.
Since LabelEncoder was primarily made to deal with labels and not features, so it allowed only a single column at a time.
Up until the current version of scikit-learn (0.19.2), what you are using is the correct way of encoding multiple columns. See this question which also does what you are doing:
Label encoding across multiple columns in scikit-learn
From next version onwards (0.20), OrdinalEncoder can be used to encode all categorical feature columns at once.

NetworkXError: Input is not a correct Pandas DataFrame

I'm trying to create a network map in Python using NetworkX with labels to find protential connections between people. Currently I have a 1,316 x 1,316 matrix of 1's and 0's that represent if there is a connection or not. I have been able to import this data into pyhon using a dataframe. Screenshot of Dataframe This is a small screenshot of the python dataframe. You can see the column and row names are numbers but at the end of the list they turn into actual names. If I remove the names and make a grid of just 1's and 0's NetworkX will allow me to turn this into a connection graph but it is almost useless without the labels to know who is connected to who. When including the labels in dataframe and trying to run the code listed below I recieve an error.
Error
NetworkXError: Input is not a correct Pandas DataFrame.
Code
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network)
I will admit to being quite new at this so any and all help will be appreciated. If you have any advice on why I'm getting this error or a better way to go about this, I am open to suggestions.
Thanks for your help.
I had the same problem. What helped me to solve is to just transform the dataframe to a numpy array. (This only works if shape[0] == shape[1])
here is what I did:
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network.to_numpy())
This should work for you.

Resources