Get count of data from particular Excel cell using python - python-3.x

I am reading an excel file as below using pandas and writing the results to a dataframe .
I want to get the count of rows present in "Expected Result" column for each Testcase . I used the len function, it was throwing "TypeError: object of type 'numpy.int64' has no len()" error . Is there a way to capture the row count from excel for each test in python .
Here is my code
df = pd.read_excel("input_test_2.xlsx")
testcases = df['Test'].values
expected_result = df['Expected Result'].values
for i in range(0,len(df)):
testcase_nm = testcases[i]
_expected = expected_result[i]
print("Count of Expected Result:" , len(_expected))
This is the Output I am looking for :
Testcase-1 , Count of Expected Result: 1
Testcase-2 , Count of Expected Result: 3

Without seeing the dataframe data, it's tough to say if this will work as it's not clear how pandas handles merged excel data.
in general though:
df_counts = df.groupby('Test').count().reset_index() # won't give you the text, but a new dataframe

Related

why am I getting column object not callable error in pyspark?

I am doing a simple parquet file reading and running a query to find the un-matched rows from left table. Please see the code snippet below.
argTestData = '<path to parquet file>'
tst_DF = spark.read.option('header', True).parquet(argTestData)
argrefData = '<path to parquet file>'
refDF = spark.read.option('header', True).parquet(argrefData)
cond = ["col1", "col2", "col3"]
fi = tst_DF.join(refDF, cond , "left_anti")
So far things are working. However, as a requirement, I need to get the elements list if the above gives count > 0, i.e. if the value of fi.count() > 0, then I need the elements name. So, I tried below code, but it is throwing error.
if fi.filter(col("col1").count() > 0).collect():
fi.show()
error
TypeError: 'Column' object is not callable
Note:
I have 3 columns as a joining condition which is in a list and assigned to a variable cond, and I need to get the un-matched records for those 3 columns, so the if condition has to accommodate them. OfCourse there are many other columns due to join.
Please suggest where am I making mistakes.
Thank you
If I understand correctly, that's simply :
fi.select(cond).collect()
The left_anti already get the records which do not match (exists in tst_DF but not in refDF).
You can add a distinct before the collect to remove duplicates.
Did you import the column function?
from pyspark.sql import functions as F
...
if fi.filter(F.col("col1").count() > 0).collect():
fi.show()

Converting Pandas DataFrame OrderedSet column into list

I have a Pandas DataFrame, one column, is an OrderedSet like this:
df
OrderedSetCol
0 OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])
This is:
from ordered_set import OrderedSet
I am just trying to convert this column into list:
df['OrderedSetCol_list'] = df['OrderedSetCol'].apply(lambda x: ast.literal_eval(str("\'" + x.replace('OrderedSet(','').replace(')','') + "\'")))
The code executes succesfully, but, my column type is still str and not list
type(df.loc[0]['OrderedSetCol_list'])
str
What am I doing wrong?
EDIT: My OrderedSetCol is also a string column as I am reading a file from a disk, which was originally saved from OrderedSet column.
Expected Output:
[1721754, 3622558, 2550234, 2344034, 8550040]
You can apply a list calling just like you would do with the OrderedSet itself:
df = pd.DataFrame({'OrderedSetCol':[OrderedSet([1721754, 3622558, 2550234, 2344034, 8550040])]})
df.OrderedSetCol.apply(list)
Output:
[1721754, 3622558, 2550234, 2344034, 8550040]
If your data type string column:
df.OrderedSetCol.str.findall('\d+')

ValueError: could not convert string to float: 'Pregnancies'

def loadCsv(filename):
lines = csv.reader(open('diabetes.csv'))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]
return dataset
Hello, I'm trying to implement Naive-Bayes but its giving me this error even though i've manually changed the type of each column to float.
it's still giving me error.
Above is the function to convert.
The ValueError is because the code is trying to cast (convert) the items in the CSV header row, which are strings, to floats. You could just skip the first row of the CSV file, for example:
for i in range(1, len(dataset)): # specifying 1 here will skip the first row
dataset[i] = [float(x) for x in dataset[i]
Note: that would leave the first item in dataset as the headers (str).
Personally, I'd use pandas, which has a read_csv() method, which will load the data directly into a dataframe.
For example:
import pandas as pd
dataset = pd.read_csv('diabetes.csv')
This will give you a dataframe though, not a list of lists. If you really want a list of lists, you could use dataset.values.tolist().

Python conditional vlookup?

Im using pandas dataframes to work with 2 csv files
I need a vlookup, but I want to apply another vlookup if the result is a null string....any idea?
one dataframe file is called data
the other is called data2
this vlookup (WORKING code), will find data["ID A"] == data2['Person_ID'] and bring data2 ['Status_job'] from that row:
Code:
data['STATUS X'] = data['ID A'].map ( data2[['Person_ID', 'Status_job']].set_index('Person_ID') ['Status_job'].to_dict() )
BUT, I want another vlookup in case that ['Status_job'] return a null string. (same code but Program_ID instead Person_ID)
Working code2:
data['STATUS X'] = data['ID A'].map ( data2[['Program_ID', 'Status_job']].set_index('Program_ID') ['Status_job'].to_dict() )
How can I merge these 2 codes into 1 conditional? tried .loc and lambda x, but not sure how to make it work with no error, will appreciate any help.

How to split pandas dataframe into multiple dataframes based on unique string value without aggregating

I have a df with multiple country codes in a column (US, CA, MX, AU...) and want to split this one df into multiple ones based on these country code values, but without aggregating it.
I've tried a for loop but was only able to get one df and it was aggregated with groupby().
I gave up trying to figure it out so I split them based on str.match and wrote one line for each country code. Is there a nice for loop that could achieve the same as below code? If it would write a csv file for each new df that would be fantastic.
us = df[df['country_code'].str.match("US")]
mx = df[df['country_code'].str.match("MX")]
ca = df[df['country_code'].str.match("CA")]
au = df[df['country_code'].str.match("AU")]
.
.
.
We can write a for loop which takes each code and uses query to get the correct part of the data. Then we write it to csv with to_csv also using f-string:
codes = ['US', 'MX', 'CA', 'AU']
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
temp.to_csv(f'df_{code}.csv')
note: f_string only work if Python >= 3.5
To keep the dataframes:
codes = ['US', 'MX', 'CA', 'AU']
dfs=[]
for code in codes:
temp = df.query(f'country_code.str.match("{code}")')
dfs.append(temp)
temp.to_csv(f'df_{code}.csv')
Then you can acces them with the index, for example: print(dfs[0]) or print(dfs[1]).

Resources