I have made a program where I'm converting raw file from my bank account for credit card transactions to a cleansed file with some new columns.
I'm replacing a column values based on my dictionary. Dictionary has 5 rows, where as the data frame has variable rows. It is to further group the data into types.
I'm also filtering the data so using masking aswell.
replace code
t_type = df2['Transaction'].replace(mappingforcc.load_dictionary.dictionary, inplace=True)
while debugging, when i make the rows equal in dictionary and the dataframe, code runs smooth without any issue. but when there is mismatch among both, I'm getting following error:
ValueError: cannot assign mismatch length to masked array
I even made a function so that i dont have two data frames in my original code as I'm creating dictionary from an excel file.
Despite several searches, unable to resolve it.
Thanks in advance for the help.
Edit: I have found the issue, the problem is that I'm creating dictionary from the following code
load_dictionary.dictionary2 = df_dict.groupby(['Transactions'])['Type'].apply(list).to_dict()
due to which I'm getting the following output in dictionary as there are multiple rows in the sheet.
{'Adv Tax FCY Tran 1%-F': ['Recoverable - Adv. Tax', 'Recoverable - Adv. Tax', 'Recoverable - Adv. Tax', 'Recoverable - Adv. Tax']
Due to which, if another transaction of 'Adv Tax FCY Tran 1%-F' appears, python cannot interpret as it tries to find a value in it.
Need help to avoid this issue.
I solved it, removed duplicates from the dictionary excel file.
My file had multiple rows for each of the lines, hence the group by function incorporated all in dictionary. And when there was a mismatch in numbers, i was getting the error of mismatch.
I also created a function to create my own directory to come around it though but didn't use it as it worked like a charm after i removed duplicates.
I'm a beginner at programming but still sharing the function (converted to code for ease of sharing and easier for everyone to use)
import openpyexcel
from openpyexcel import load_workbook
import ast
file_path = r"filepath.xlsx"
df = openpyexcel.load_workbook(file_path)
ws = df['Sheet1'] #Enter your sheet name or automate using multiple functions in openpyxl
dictionary = "{"
for i in range(2,ws.max_row): #Since i have a header too, therefore started from 2nd Row
dictionary = dictionary + '"' + (ws[('A'+str(i))].value) + '"' + ":" + '"' + (ws[('B'+str(i))].value) + '"' #I have used column A and B, you can change accordingly.
if i != (ws.max_row-1):
dictionary = dictionary + ','
dictionary = dictionary + '}'
dictionary = ast.literal_eval(dictionary)
This issue wasn't at all a waste of time, learnt alot of things along the way
Related
I have a .tsv file that's storing genomic data. 13884 rows, 2 columns. Strain ID : Barcode.
I'm attempting to write all of these values into a dictionary using csv.DictReader:
strains = {}
with open("mutation_barcodes") as file:
reader = csv.DictReader(file, delimiter="\t")
for row in reader:
key = row['strain']
value = row['barcode']
strains[key] = value
Maybe my logic is wrong here but my assumption was that I have would have an equal number of dictionary entries. Indeed, up until row 11918, the dictionary entries do match. After this line though there start to be some errors, ending up with about a thousand fewer dictionary entries than rows.
I've tried removing all lines from 11918 onwards; works as expected.
I've removed all lines before 11918, effectively running this on lines 11919-13884; also runs as expected.
I googled briefly if there was a max size of a dictionary but it seems like that isn't the issue. Any ideas? I imagine I'm implementing something incorrectly here.
** I believe my issue is that some of the strain ID's are repeated but with different mutational values, which is causing DictReader to overwrite the previous value of the strain id key
i think the simplest solution is to use defaultdict
from collections import defaultdict
strains = defaultdict(list)
with open("mutation_barcodes") as file:
reader = csv.DictReader(file, delimiter="\t")
for row in reader:
key = row['strain']
value = row['barcode']
strains[key].append(value)
I was going thru some ML python code just try to understand what is does and how it works. I noticed a youtube video that took me to this code random-forests-tutorials. The code actually uses hard-coded Array/List. But if I use file as input, then it throws
IndexError: list index out of range in the print_tree function
could someone please help me with resolving this? I have not yet changed anything else in the program besides just pointing it to file as input instead of hard-coded Array.
I created this function to read the CSV data from HEADER and TRAINING files. But to read the TESTING data file i have similar function but am not reading row[5] as it does not exist. the number of columns of Testing data file is 1 short.
def getBackData(filename)
with open(filename, newline='') as csvfile:
rawReader = csv.reader(csvfile, delimiter=',', quotechar='"')
if "_training" in filename:
parsed = ((row[0],
int(row[1]),
int(row[2]),
int(row[3]),
row[4],
row[5])
for row in rawReader)
else:
parsed = rawReader
theData = list(parsed)
return theData
So in the code am using the variables as
training_data = fs.getBackData(fileToUse + "_training.dat")
header = fs.getBackData(fileToUse + "_header.dat")
testing_data =fs.getBackData(fileToUse + "_testing.dat")
Sample Data for Header is
header = ["CYCLE", "PASSES", "FAILURES", "IGNORED", "STATUS", "ACCEPTANCE"]
Sample for Training Data is
"cycle1",10,5,1,"fail","discard"
"cycle2",7,9,0,"fail","discard"
"cycle3",14,2,0,"pass","accept"
Sample for Testing Data is
"cycle71",11,4,1,"failed"
"cycle72",16,0,0,"passed"
I cant believe myself. I was wondering why was it so difficult to use a CSV file when every thing else is so easy in Python. My bad, I am new to it. So I finally figured out what was causing the list out of bound.
the function getBackData to be used to Training DATA & Testing Data only.
Separate function required for Header and Testing Data. Because Header will have equal number of columns, still the data type will be String.
Actually, I was using the function getBackData for Header also. and it was returning the CSV (containing headers) in a 2D list. Typically thats what it does. this was causing the issue.
Headers were supposed to be read as header[index], instead the code was recognizing it as header[row][col]. thats what I missed. I assumed Python to be intelligent enough to understand if only 1 row is there in CSV it should return a 1-D array.
Deserves a smiley :-)
Low-level python skills here (learned programming with SAS).
I am trying to apply a series of fuzzy string matching (fuzzywuzzy lib) formulas on pairs of strings, stored in a base dataframe. Now I'm conflicted about the way to go about it.
Should I write a loop that creates a specific dataframe for each formula and then append all these sub-dataframes in a single one? The trouble with this approach seems to be that, since I cannot dynamically name the sub-dataframe, the resulting value gets overwritten at each turn of the loop.
Or should I create one dataframe in a single loop, taking my formulas names and expression as a dict? The trouble here gives me the same problem as above.
Here is my formulas dict:
# ratios dict: all ratios names and functions
ratios = {"ratio": fuzz.ratio,
"partial ratio": fuzz.partial_ratio,
"token sort ratio": fuzz.token_sort_ratio,
"partial token sort ratio": fuzz.partial_token_sort_ratio,
"token set ratio": fuzz.token_set_ratio,
"partial token set ratio": fuzz.partial_token_set_ratio
}
And here is the loop I am currently sweating over:
# for loop iterating over ratios
for r, rn in ratios.items():
# fuzzing function definition
def do_the_fuzz(row):
return rn(row[base_column], row[target_column])
# new base df containing ratio data and calculations for current loop turn
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
It gives me the same problem, namely that the 'mesure' column gets overwritten, and I end up with a column full of the last value (here: 'partial token set').
My overall problem is that I cannot understand if and how I can dynamically name dataframes, columns or values in a python loop (or if I'm even supposed to do it).
I've been trying to come up with a solution myself for too long and I just can't figure it out. Any insight would be very much appreciated! Many thanks in advance!
I would create a dataframe that is updated at each loop iteration:
final_df = pd.DataFrame()
for r, rn in ratios.items():
...
df_out1 = pd.DataFrame(data = df_out, columns = [base_column, target_column, 'mesure', 'valeur', 'drop'])
df_out1['mesure'] = r
df_out1['valeur'] = df_out.apply(do_the_fuzz, axis = 1)
final_df = pd.concat([final_dfl, df_out1], axis=0)
I hope this can help you.
I have a dataset that '?' instead of 'NaN' for missing values. I could have gone through each column using replace but the only problem is I have 22 columns. I am trying to create a loop do it effectively but I am getting wrong. Here is what I am doing:
for col in adult.columns:
if adult[col]=='?':
adult[col]=adult[col].str.replace('?', 'NaN')
The plan is to use the 'NaN' then use the fillna function or to drop them with dropna. The second problem is that not all the columns are categorical so the str function is also wrong. How can I easily deal with this situation?
If you're reading the data from a .csv or .xlsx file you can use the na_values parameter:
adult = pd.read_csv('path/to/file.csv', na_values=['?'])
Otherwise do what #MasonCaiby said and use adult.replace('?', float('nan'))
I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column