How to read particular row from groupby dataframe? - python-3.x

I am preprocessing my data about car sales where lots of used cars price is 0. I want to replace 0 value with the mean price value of similar kind of cars.
Here, I have found mean values for each car with groupby function:
df2= df1.groupby(['car','body','year','engV'])['price'].mean()
This is my dataframe extracted from actual data with price is zero
rep_price=df[df['price']==0]
I want to assign mean price value from df2['Land Rover'] to rep_price['Land Rover'] which is 0

Since I'm not sure about the columns you have in those dataframes I'm gonna take a wild guess to try to give you a head start, let's say 'Land Rover' is a value in a column called 'Car_Type' in df1, and then you grouped your data like this:
df2= df1.groupby(['Car_Type'])['price'].mean()
In that case something like this should cover your need:
df1.loc[df1['Car_Type']=='Land Rover','price'] = df2['Land Rover']

Related

How to drop entire record if more than 90% of features have missing value in pandas

I have a pandas dataframe called df with 500 columns and 2 million records.
I am able to drop columns that contain more than 90% of missing values.
But how can I drop in pandas the entire record if 90% or more of the columns have missing values across the whole record?
I have seen a similar post for "R" but I am coding in python at the moment.
You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).
df.dropna(axis=0, thresh=50, inplace=True)
You could use isna + mean on axis=1 to find the percentage of NaN values for each row. Then select the rows where it's less than 0.9 (i.e. 90%) using loc:
out = df.loc[df.isna().mean(axis=1)<0.9]

Making PANDAS identify a PATTERN inside a value on a DF

I'm using Python 3.9 with Pandas and Numpy.
Every day I receive a df with orders from the company I work for. Each day, this df comes from a different country that I don't know the language, and this dataframes don't have a pattern. In this case, I don't know what's the column name nor the index.
I just know that the orders follows a patter: 3 numbers + 2 letters like 000AA, 149KL, 555EE etc.
I saw that with strings is possible, but with pandas I just found commands that needs the name of the column.
df.column_name.str.contains(pat=r'\d\d\d\w\w', regex=True)
If I can find the column that only have this pattern, I know what the orders column is.
I started with a synthetic data set
import pandas
df = pandas.DataFrame([{'a':3,'b':4,'c':'222BB','d':'2asf'},
{'a':2,'b':1,'c':'111AA','d':'942'}])
I then cycle through each column. If the datatype is object, then I test whether all the elements in the Series match the regex
for column_id in df.columns:
if df[column_id].dtype=='object':
if all(df[column_id].str.contains(pat=r'\d\d\d\w\w', regex=True)):
print("matching column:",column_id)

How can I groupby rows by the columns in which they actually posses a data point?

I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.
This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.

Splitting DataFrame based on index

I have a DataFrame which I want to split into three DataFrames based on the string properties of the index. The index consists of IDs with the first two letters indicating the country, e.g.
DE1
UK4
US5
DE2
UK1
US3
I want three DataFrames with indexes being:
DE1
DE2
UK1
UK4
US3
US5
This seems promising:
df.groupby(df.index.str[:2]).groups
But I don't know how to use it to solve my problem...
Here is one way to do it
split = []
for value in df.index.str[:2].unique().values:
split.append(df[df.index.str[:2] == value])
We first compute the unique country code using the first 2 letters in the data frame. Then we loop over them and index into the data frame using all the unique country code. Here, I just added the resulting DataFrame to an array.

How to aggregate a boolean column in Pandas Dataframe in a Pivot Table as percentage

I have a Pandas Dataframe. Among its many columns are ID, which is Boolean, Quarter, which gives year and quarter (e.g. 2016Q1) and State (e.g. TX, CA), so it would look something like:
id Quarter State
True 15Q1 AZ
False 17Q1 WY
True 14Q2 NH
False 15Q1 AZ
I'm trying to build a pivot table with ID as the value, State as the index, and quarter as the columns. I'd like to use np.mean as the agg_func but I get DataError: No numeric types to aggregate
It displays right when I use count as the aggregate function instead. And when I aggregate the total np.mean(df['id']), I get .64, which is exactly the type of output I'm looking for except more aggregated instead of granular. So why does np.sum work there, but not when I use it as the aggregating function in a pivot table? How do I get it to work.
I think I could convert the True's and False's to 1's and 0's, but I'd prefer not to, since I actually have a lot of 'id' columns I'm looking to aggregate this way.
EDIT: So it's an issue that only pops up with my full dataset, and not the toy dataset I used as an example. I played around some more and the ValueError: No objects to concatenate still pops up if I do a groupby with mean as the aggregating function on 'Year' or 'State'. It even pops up when I try df['id'].describe()
Has anyone ever encountered an issue like this before?
Your output is not very clear but this is what I think you need
pd.pivot_table(df, index='State', columns='Quarter', values = 'id', aggfunc='mean')
You get
Quarter 14Q2 15Q1 17Q1
State
AZ NaN 0.5 NaN
NH 1.0 NaN NaN
WY NaN NaN 0.0
You can pass the parameter fill_values = 0 in pivot_table to replace the NaNs by 0

Resources