Comparing columns with pandas_schema - python-3.x

I am using python 3.8 and Pandas_schema to run integrity checks on data. I have a requirement that workflow_next_step should never be the same as workflow_entry_step. I'm trying to generate a CustomSeriesValidation that compares both columns because I do not see a stock function that does this.
Is there a way to compare two cell values in the same row using Pandas_Schema? In this example, Pandas_Schema would return an error for Mary because she was moved from In Progress to In Progress.
df = config.pd.DataFrame({
'prospect': ['Bob', 'Jill', 'Steve', 'Mary'],
'value': [10000, 15000, 500, 50000],
'workflow_entry_step': ['New', 'In Progress', 'Closed', 'In Progress'],
'workflow_next_step': ['In Progress', 'Closed' ,None, 'In Progress']})
schema = Schema([
Column('prospect', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('value', [CanConvertValidation(int),'Doesn\'t convert to integer.']),
Column('workflow_entry_step', [InListValidation([None,'New','In Progress','Closed'])]),
Column('workflow_next_step', [CustomSeriesValidation(lambda x: x != Column('workflow_entry_step'), InListValidation([None,'New','In Progress','Closed'])]), 'Steps cannot be the same.')])

import pandas as pd
df = pd.DataFrame({
'prospect': ['Bob', 'Jill', 'Steve', 'Mary'],
'value': [10000, 15000, 500, 50000],
'workflow_entry_step': ['New', 'In Progress', 'Closed', 'In Progress'],
'workflow_next_step': ['In Progress', 'Closed' ,None, 'In Progress']})
schema = Schema([
Column('prospect', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('value', [CanConvertValidation(float)]),
Column('workflow_entry_step', [InListValidation([None,'New','In
Progress','Closed'])]),
Column('workflow_next_step', [CustomSeriesValidation(lambda x: x !=
df['workflow_entry_step'], 'Steps cannot be the same.'),
InListValidation([None,'New','In Progress','Closed'])])
])
errors = schema.validate(df)
for error in errors:
print(error)
Output:
{row: 3, column: "workflow_next_step"}: "In Progress" Steps cannot be the same.

Related

error with adding new list typed column using pyspark.sql.functions.arry

i get error when i tried to add new array typed column using the pyspark.sql.functions.arry
column = ['id', 'fname', 'age', 'avg_wh']
data= [('1', 'user_1', '40', 8.5),
('2', 'user_2', '6', 1.5),
('3', 'user_3', '4', 5.5),
('10', 'user_10', '4', 2.5)]
from pyspark.sql import functions as F
df = spark.createDataFrame(data,column)
df.withColumn("lsitColumn" ,F.array(["1","2","3"]))
df.show()
the Error
raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.AnalysisException: cannot resolve '1' given input columns: [age, avg_wh, fname, id];; 'Project [id#0, fname#1, age#2, avg_wh#3, array('1, '2, '3) AS lsitColumn#8] +- LogicalRDD [id#0, fname#1, age#2, avg_wh#3], false
could you please assist what is the roote cause for this error , i managed to create the column by using UDF but i don't understand why this basic failed
the UDF
extract = f.udf(lambda x: list(["1","2","3"]), ArrayType(StringType()))
percentielDF = df.withColumn("lsitColumn", extract("id"))
i expected to get new DF with list typed column and i get error

Unable to populate column in dataframe using value from dict which is present in dict inside list in one of the column

I have one column in data frame with values like this List with dictionary inside
[{'symbol': '$', 'value': 5.2, 'currency': 'USD', 'raw': '$5.20', 'name': '$5.20$5.20 ($1.30/Fl Oz)', 'asin': 'B07N31VZP8']
I want only 'value' from this dictionary.
for i in df["prices"]:
try:
#print(type(i[0]["value"]))
df["prices"] = df["prices"].apply(lambda i:i[0]["value"])
except Exception as e:
print("",e)
but I'm getting the following error even if I was able to get that value, I cant populate it in dataframe column
'float' object is not subscribable
How to overcome this issue ?
I think the problem is that you apply the transformation multiple times. The apply() already applies the transformation for every row in your dataframe. If you remove the loop it should work:
import pandas as pd
# Create your data
df = pd.DataFrame([
['test',[{'symbol': '$', 'value': 5.2, 'currency': 'USD', 'raw': '$5.20', 'name': '$5.20$5.20 ($1.30/Fl Oz)', 'asin': 'B07N31VZP8'}]]
, ['test',[{'symbol': '$', 'value': 6.8, 'currency': 'USD', 'raw': '$5.20', 'name': '$5.20$5.20 ($1.30/Fl Oz)', 'asin': 'B07N31VZP8'}]]
], columns=['test','prices'])
print(df)
df["prices"] = df["prices"].apply(lambda i:i[0]["value"])
print(df)
Input dataframe:
test prices
0 test [{'symbol': '$', 'value': 5.2, 'currency': 'US...
1 test [{'symbol': '$', 'value': 6.8, 'currency': 'US...
Output dataframe:
test prices
0 test 5.2
1 test 6.8

filter dataframe columns as you iterate through rows and create dictionary

I have the following table of data in a spreadsheet:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I'm reading the spreadsheet and passing the data as a dataframe.
I need to transform this table of data to json following a specific schema.
I have the following script:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row.to_dict())
which return:
{'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}
I want to be able to filter out a specific column. For example, to return this:
{'Name': 'bar', 'Description': 'foofoo'}
I know that I can print only the columns I want with this print(row['Name'],row['Description']) however this is only returning me values when I also want to return the key.
How can I do this?
I wrote this entire thing only to realize that #anky_91 had already suggested it. Oh well...
import pandas as pd
data = {
"name": ["foo", "abc", "baz", "bar"],
"description": ["foobar", "foofoo", "foobaz", "foofoo"],
"value": [5, 3, 4, 8],
}
df = pd.DataFrame(data=data)
print(df, end='\n\n')
rec_dicts = df.loc[df["description"] == "foofoo", ["name", "description"]].to_dict(
"records"
)
print(rec_dicts)
Output:
name description value
0 foo foobar 5
1 abc foofoo 3
2 baz foobaz 4
3 bar foofoo 8
[{'name': 'abc', 'description': 'foofoo'}, {'name': 'bar', 'description': 'foofoo'}]
After converting to dictionary you can delete the key which you don't need with:
del(row[value])
Now the dictionary will have only name and description.
You can try this:
import io
import pandas as pd
s="""Name,Description,Value
foo,foobar,5
baz,foobaz,4
bar,foofoo,8
"""
df = pd.read_csv(io.StringIO(s))
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row[['Name', 'Description']].to_dict())
Result:
{'Name': 'bar', 'Description': 'foofoo'}

For every element in a list a, how to count how many times it appear in one specific column in another dataframe

For every element in a dict a, I need to count how many times the element in 'age' column appears in one specific column of another dataframe in pandas
For example , I have a dict below:
a={'age':[22,38,26],'no':[1,2,3]}
and I have another dataframe with a few columns
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
I would like to know how many times every element in dict a appears in the column 'age' in TableB. The result I expect is c={'age':[22,38,26],'count':[2,2,1]}
I have tried apply function but it does not work. It comes with syntax error, I'm new to Pandas, could anyone please help with that? Thank you!
def myfunction(y):
seriesObj = TableB.apply(lambda x: True if y in list(x) else False, axis=1)
numOfRows = len(seriesObj[seriesObj == True].index)
return numofRows
c['age']=a['age']
c['count']=a['age'].apply(myfunction)
I would like to know how many times every element in list a appears in the column 'age' in TableB. The result should be
c={'age':[22,38,26],'count':[2,2,1]}
Use value_counts method with pd.Series and to_dict with pd.DataFrame
(pd.Series(TableB['age'])
.value_counts()
.loc[a['age']]
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))
You can use pandas.Series.value_counts() on the age column and select the results you're interested in. The following solution will also take into account possible missing values in your 'a' list.
a=[22,38,26,99]
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'John', 'Jane', 'Doe'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
tableB_df = pd.DataFrame(TableB)
counts_series = tableB_df['age'].value_counts()
counts_series_intersection = counts_series.loc[counts_series.index.intersection(a)]
counts_df = pd.DataFrame({'age': counts_series.index, 'count': counts_series.values})
Have a look at the following resources for more info:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-list-with-missing-labels-is-deprecated
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html
You can just use merging of data frames to filter out the values that don't appear in a and just count the values.
import pandas as pd
a={'age':[22,38,26],'no':[1,2,3]}
TableB= {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen', 'Jones', 'Davis', 'Smith'],
'age': [22,38,26,35,41,22,38],
'fare': [7.25, 71.83, 0 , 8.05,7,6.05,6],
'survived?': [False, True, True, False, True, False, True]}
df_a = pd.DataFrame(a)
df_tb = pd.DataFrame(TableB)
(pd.merge(df_tb, df_a, on='age')['age']
.value_counts()
.rename('count')
.rename_axis('age')
.reset_index()
.to_dict(orient='list'))
{'age': [22, 38, 26], 'count': [2, 2, 1]}

How to substring the column name in python

I have a column named 'comment1abc'
I am writing a piece of code where I want to see that if a column contains certain string 'abc'
df['col1'].str.contains('abc') == True
Now, instead of hard coding 'abc', I want to use a substring like operation on column 'comment1abc' (to be precise, column name, not the column values)so that I can get the 'abc' part out of it. For example below code does a similar job
x = 'comment1abc'
x[8:11]
But how do I implement that for a column name ? I tried below code but its not working.
for col in ['comment1abc']:
df['col123'].str.contains('col.names[8:11]')
Any suggestion will be helpful.
Sample dataframe:
f = {'name': ['john', 'tom', None, 'rock', 'dick'], 'DoB': [None, '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'location': ['NY', 'NJ', 'PA', 'NY', None], 'code': ['abc1xtr', '778abc4', 'a2bcx98', None, 'ab786c3'], 'comment1abc': ['99', '99', '99', '99', '99'], 'comment2abc': ['99', '99', '99', '99', '99']}
df1 = pd.DataFrame(data = f)
and sample code:
for col in ['comment1abc', 'comment2abc']:
df1[col][df1['code'].str.contains('col.names[8:11]') == True] = '1'
I think the answer would be simple like this:
for col in ['comment1abc', 'comment2abc']:
x = col[8:11]
df1[col][df1['code'].str.contains('x') == True] = '1'
Trying to use a column name within .str.contains() wasn't a good idea. Better use a string.

Resources