TypeError: No matching signature found, when tryint to pivot dataframe - python-3.x

I have a dataframe,
df
ID UD MD TD
1 1 3 13.0 115
2 1 4 14.0 142
3 2 3 13.0 156
4 2 8 13.5 585
I get the error. TypeError: No matching signature found, when I try df_p = df.pivot('ID','UD','MD')
What could be wrong ?

here is the data
df = pd.DataFrame({'ID': [1,1,2,2], 'UD': [3,4,3,8], 'MD':[13,14,13,13.5], 'TD':[115,142,156,585]})
by default in python 3.9 the dtypes are as following
ID int64
UD int64
MD float64
TD int64
dtype: object
and you can run df.pivot('ID','UD','MD')
UD 3 4 8
ID
1 13.0 14.0 NaN
2 13.0 NaN 13.5
but if you by a chance have read the data as float16
df.astype({'MD':np.float16}).pivot('ID','UD','MD')
you would get the TypeError: No matching signature found

Related

pyspark - assign non-null columns to new columns

I have a dataframe of the following scheme in pyspark:
user_id datadate page_1.A page_1.B page_1.C page_2.A page_2.B \
0 111 20220203 NaN NaN NaN NaN NaN
1 222 20220203 5 5 5 5.0 5.0
2 333 20220203 3 3 3 3.0 3.0
page_2.C page_3.A page_3.B page_3.C
0 NaN 1.0 1.0 2.0
1 5.0 NaN NaN NaN
2 4.0 NaN NaN NaN
So it contains columns like user_id, datadate, and few columns for each page (got 3 pages), which are the result of 2 joins. In this example, i have page_1, page_2, page_3, and each has 3 columns: A,B,C. Additionally, for each page columns, for each row, they will either be all null or all full, like in my example.
I don't care about the values of each of the columns per page, I just want to get for each row, the [A,B,C] values that are not null.
example for a wanted result table:
user_id datadate A B C
0 111 20220203 1 1 2
1 222 20220203 5 5 5
2 333 20220203 3 3 3
so the logic will be something like:
df[A] = page_1.A or page_2.A or page_3.A, whichever is not null
df[B] = page_1.B or page_2.B or page_3.B, whichever is not null
df[C] = page_1.C or page_2.C or page_3.C, whichever is not null
for all of the rows..
and of course, I would like to do it in an efficient way.
Thanks a lot.
You can use the sql functions greatest to extract the greatest values in a list of columns.
You can find the documentation here: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.greatest.html
from pyspark.sql import functions as F
(df.withColumn('A', F.greates(F.col('page_1.A'), F.col('page_2.A), F.col('page_3.A'))
.withColumn('B', F.greates(F.col('page_1.B'), F.col('page_2.B), F.col('page_3.B'))
.select('userid', 'datadate', 'A', 'B'))

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Convert string "nan" to numpy NaN in pandas Dataframe that contains Strings [duplicate]

I am working on a large dataset with many columns of different types. There are a mix of numeric values and strings with some NULL values. I need to change the NULL Value to Blank or 0 depending on the type.
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 NULL NULL NULL 8 NULL NULL Lemon 12 NULL
I want it to look like this,
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 0 8 0 Lemon 12
I can do this for each individual, but since I am going to be pulling several extremely large datasets with hundreds of columns, I'd like to do this some other way.
Edit:
Types from Smaller Dataset,
Field1 object
Field2 object
Field3 object
Field4 object
Field5 object
Field6 object
Field7 object
Field8 object
Field9 object
Field10 float64
Field11 float64
Field12 float64
Field13 float64
Field14 float64
Field15 object
Field16 float64
Field17 object
Field18 object
Field19 float64
Field20 float64
Field21 int64
Use DataFrame.select_dtypes for numeric columns, filter by subset and replace values to 0, then repalce all another columns to empty string:
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 NaN NaN NaN 8 NaN NaN Lemon 12 NaN
print (df.dtypes)
0 int64
1 object
2 float64
3 object
4 int64
5 object
6 float64
7 object
8 int64
9 object
dtype: object
c = df.select_dtypes(np.number).columns
df[c] = df[c].fillna(0)
df = df.fillna("")
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12
Another solution is create dictionary for replace:
num_cols = df.select_dtypes(np.number).columns
d1 = dict.fromkeys(num_cols, 0)
d2 = dict.fromkeys(df.columns.difference(num_cols), "")
d = {**d1, **d2}
print (d)
{0: 0, 2: 0, 4: 0, 6: 0, 8: 0, 1: '', 3: '', 5: '', 7: '', 9: ''}
df = df.fillna(d)
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12
You could try this to substitute a different value for each different column (A to C are numeric, while D is a string):
import pandas as pd
import numpy as np
df_pd = pd.DataFrame([[np.nan, 2, np.nan, '0'],
[3, 4, np.nan, '1'],
[np.nan, np.nan, np.nan, '5'],
[np.nan, 3, np.nan, np.nan]],
columns=list('ABCD'))
df_pd.fillna(value={'A':0.0,'B':0.0,'C':0.0,'D':''})

Pandas DataFrame Apply Efficiency

I have a dataframe to which I wan't to add a column with a kind of status if there is a matching value in another dataframe. I have the current code which works:
df1['NewColumn'] = df1['ComparisonColumn'].apply(lambda x: 'Match' if any(df2.ComparisonColumn == x) else ('' if x is None else 'Missing'))
I know the line is ugly, but I get the impression that its inefficient. Can you suggest a better way to make this comparison?
You can use np.where, isin, and isnull:
Create some dummy data:
np.random.seed(123)
df = pd.DataFrame({'ComparisonColumn':np.random.randint(10,20,20)})
df.iloc[4] = np.nan #Create missing data
df2 = pd.DataFrame({'ComparisonColumn':np.random.randint(15,30,20)})
Do matching with np.where:
df['NewColumn'] = np.where(df.ComparisonColumn.isin(df2.ComparisonColumn),'Matched',np.where(df.ComparisonColumn.isnull(),'Missing',''))
Output:
ComparisonColumn NewColumn
0 12.0
1 12.0
2 16.0 Matched
3 11.0
4 NaN Missing
5 19.0 Matched
6 16.0 Matched
7 11.0
8 10.0
9 11.0
10 19.0 Matched
11 10.0
12 10.0
13 19.0 Matched
14 13.0
15 14.0
16 10.0
17 10.0
18 14.0
19 11.0

Resources