Countif function in Python looping in every cell - python-3.x

enter image description here
Hey everyone I am used to work in excel but recently after getting a dataset of about 500k rows that need to be worked in the same worksheet I have huge capacity issues and I was advised to try and transition any function to a python environment. So this excel function "=IF(COUNTIF($J$2:J3,J3)>1,0,1)"~J is the column of the Asset ID~ goes to each cell and if it has previous encountered it in the cells above it returns 0 and if it is unique it returns 1.
How that would be possible in a python environment if I load my table as a DataFrame?
Thanks in advance

You can use pandas to achieve this very easily:
import pandas as pd # import pandas
df = pd.read_excel('your_file.xlsx') # use appropriate function according to your file type
df['Unique'] = ~df[Asset_Id].duplicated().astype(int) # places 1 where it is not encountered before, 0 elsewhere

Related

Extracting specific text strings from a cell

Please take a look at the below text:
Mr. Manimaran R MMTFD NL20824448862,Room no-206
I have around half a million cells like this and i want to extract the ID number from this text using a formula. The problem that i am facing is that some ids start with NL some with NH and some with IN/IWA and the length of the text string containing the ids is also different (13/14/15 etc).
Is there any formula through which i can extract these ids??
Currently I am using this formula =MID(text,FIND("NH",text,1),15).
But i need to change it everytime for NL/IN as also for 14/15 etc.
Use of external tools is the way to go with that much variance in what you need to extract.
How about the following (assuming text is in cell C3 (basically pivoting off the comma and looking back)
=right(mid(C3, find(",",C3)-16,16),len(mid(C3, find(",",C3)-16,16))-find(" ",mid(C3, find(",",C3)-16,16)))
You can use pandas with python3 to easily and quickly do such task.
In python3 shell
import pandas as pd
df= pd.read_excel('./your_excel_file.xlsx') ##import from excel
df['ID']=df['Data_header_name'].str.split(' ',expand=True)[4].str.split(',',expand=True)[0] ##here you can use any relevant filters to get what you want
df.to_excel('output_excel.xlsx') ## save to excel
You also need to install openpyxl, xlrd, numpy, pandas using pip
pip install openpyxl xlrd numpy pandas

I need to change the value of a specific column of a dataframe using condition format while imported multiple Excel file

import pandas as pd
batch=pd.read_excel('batch.xlsx')
stock_report=pd.read_excel('Stock_Report.xlsx')
Result_stock=pd.merge(stock_report,batch[['Batch','Cost price']], on='Batch').fillna(0)
Result_stock2=pd.merge(Result_stock,batch[['Item number',' Batch MRP']], on='Item number').fillna(0)
Result_stock2['Total']=Result_stock2['Posted quantity']*Result_stock2['Cost price']
I need to change the value of Column(Total) for Result_stock2 by multiplying it with two column value if it has 0.
You need to learn some formatting. Please format your code so we can read.
If I understood what you mean and your script is working fine so far, you should just simply add:
Result_stock2.loc[Result_stock2['Total']==0,'Total']=(****OPERATION YOU NEED****)
example in 'OPERATION'
Result_stock2.loc[Result_stock2['Total']==0,'Posted quantity']*(Result_stock2.loc[Result_stock2['Total']==0,'Cost price']-5)
It's not a beautiful code but will do what you need.

Parsing data from an unstructured Excel file using python

my objective is to parse and then find 2 columns within certain rows having certain headings.
I am trying to parse an Excel file that contains some blocks of strings/numbers that are separated with blanks both in rows and columns from each other.
I am using Pandas in python and I am still not succeeding to resolve the problem of empty or error noìumbers between the two blocks.
using
import os
import pandas as pd
..
my_df = pd.read_excel(my_file, error_bad_lines=False)
did not resolve the problem and stops as soon as it gets to the first empty zone after the first block.
most of the tutorials I watched assume that the excel files to be parsed are neatly filled from top to bottom with maximum som NAN cells in the middle

NetworkXError: Input is not a correct Pandas DataFrame

I'm trying to create a network map in Python using NetworkX with labels to find protential connections between people. Currently I have a 1,316 x 1,316 matrix of 1's and 0's that represent if there is a connection or not. I have been able to import this data into pyhon using a dataframe. Screenshot of Dataframe This is a small screenshot of the python dataframe. You can see the column and row names are numbers but at the end of the list they turn into actual names. If I remove the names and make a grid of just 1's and 0's NetworkX will allow me to turn this into a connection graph but it is almost useless without the labels to know who is connected to who. When including the labels in dataframe and trying to run the code listed below I recieve an error.
Error
NetworkXError: Input is not a correct Pandas DataFrame.
Code
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network)
I will admit to being quite new at this so any and all help will be appreciated. If you have any advice on why I'm getting this error or a better way to go about this, I am open to suggestions.
Thanks for your help.
I had the same problem. What helped me to solve is to just transform the dataframe to a numpy array. (This only works if shape[0] == shape[1])
here is what I did:
Network = pd.DataFrame.from_csv('H:\\Network.csv')
G1 = nx.to_networkx_graph(Network.to_numpy())
This should work for you.

Error with openpyxl after converting a script to Python 3.x that utilizes pandas and numpy

About a year ago I wrote a script that took a single column of datetime values and ran a window through the series to determine the greatest "lumping" of values based on an adjustable dimension of time. For example, given a million date time values what is the maximum value of entries that exist within 1 second, or 1 minute, or 1 hour of each other.
The problem is that I had a machine blow up on me and lost some of the documentation, specifically the versions of packages that I was working with. I think I've updated the code to execute within 3.x but am now getting errors that seem to suggest that pandas no longer supports the packages I'm trying to use. I've tried just installing a few random versions, updating pip, etc., but am not having much luck.
The exact error states, 'UserWarning: Installed openpyxl is not supported at this time. Use >=1.61 and <2.0.0' -- I'm not seeing a version history in their repository. Might just try installing older versions of Python and trying to bash this into place.
Here is the code:
import numpy as np
import pandas as pd
# Your original code was correct here. I assumed there will be a data column along with the timestamps.
df = pd.read_csv("ET.txt", parse_dates=["dt"])
# Construct a univariate `timeseries` instead of a single column dataframe as output by `read_csv`.
# You can think of a dataframe as a matrix with labelled columns and rows. A timeseries is more like
# an associative array, or labelled vector. Since we don't need a labelled column, we can use a simpler
# representation.
data = pd.Series(0, df.dt)
print(data)
window_size = 1
buckets_sec = data.resample("1S", how="count").fillna(0)
# We have to shift the data back by the same number of samples as the window size. This is because `rolling_apply`
# uses the timestamp of the end of the period instead of the beginning. I assume you want to know when the most
# active period started, not when it ended. Finally, `dropna` will remove any NaN entries appearing in the warmup
# period of the sliding window (ie. it will output NaN for the first window_size-1 observations).
rolling_count = pd.rolling_apply(buckets_sec, window=window_size, func=np.nansum).shift(-window_size).dropna()
print(rolling_count.describe())
# Some interesting data massaging
# E.g. See how the maximum hit count over the specified sliding window evolves on an hourly
# basis:
seconds_max_hits = rolling_count.resample("S", how="max").dropna()
# Plot the frequency of various hit counts. This gives you an idea how frequently various
# hit counts occur.
seconds_max_hits.hist()
# Same on a daily basis
daily_max_hits = rolling_count.resample("S", how="max").dropna()
Screen cap of the error: http://i.imgur.com/uSv29I5.png
I'm not sure why you're seeing an openpyxl related error but if you are it seems like you should update your version of Pandas. There were some significant changes in openpyxl that affected exporting to Excel from Pandas but these have since been resolved.

Resources