Splitting dataframe based on multiple column values - python-3.x

I have a dataframe with 1M+ rows. A sample of the dataframe is shown below:
df
ID Type File
0 123 Phone 1
1 122 Computer 2
2 126 Computer 1
I want to split this dataframe based on Type and File. If the total count of Type is 2 (Phone and Computer), total number of files is 2 (1,2), then the total number of splits will be 4.
In short, total splits is as given below:
total_splits=len(set(df['Type']))*len(set(df['File']))
In this example, total_splits=4. Now, I want to split the dataframe df in 4 based on Type and File.
So the new dataframes should be:
df1 (having data of type=Phone and File=1)
df2 (having data of type=Computer and File=1)
df3 (having data of type=Phone and File=2)
df4 (having data of type=Computer and File=2)
The splitting should be done inside a loop.
I know we can split a dataframe based on one condition (shown below), but how do you split it based on two ?
My Code:
data = {'ID' : ['123', '122', '126'],'Type' :['Phone','Computer','Computer'],'File' : [1,2,1]}
df=pd.DataFrame(data)
types=list(set(df['Type']))
total_splits=len(set(df['Type']))*len(set(df['File']))
cnt=1
for i in range(0,total_splits):
for j in types:
locals()["df"+str(cnt)] = df[df['Type'] == j]
cnt += 1
The result of the above code gives 2 dataframes, df1 and df2. df1 will have data of Type='Phone' and df2 will have data of Type='Computer'.
But this is just half of what I want to do. Is there a way we can make 4 dataframes here based on 2 conditions ?
Note: I know I can first split on 'Type' and then split the resulting dataframe based on 'File' to get the output. However, I want to know of a more efficient way of performing the split instead of having to create multiple dataframes to get the job done.
EDIT
This is not a duplicate question as I want to split the dataframe based on multiple column values, not just one!

You can make do with groupby:
dfs = {}
for k, d in df.groupby(['Type','File']):
type, file = k
# do want ever you want here
# d is the dataframe corresponding with type, file
dfs[k] = d
You can also create a mask:
df['mask'] = df['File'].eq(1) * 2 + df['Type'].eq('Phone')
Then, for example:
df[df['mask'].eq(0)]
gives you the first dataframe you want, i.e. Type==Phone and File==1, and so on.

Related

How to find certain value similar to vlookup in excel? And how to create dataframe with for loop?

I have two dataframes both have 2 columns (2 variables).
I was thinking that I could use something similar to vlookup in excel?
And maybe I could create a new dataset using for loop and put the quotients in this dataset I don't know how exactly I could do that.
(I ALSO NEED TO PUT THE VALUES IN A DATASET so the suggested post does not answer my question completely)
example:
dataframe1
number amount
1 2
2 3
3 4
dataframe2
number amount
1 5
2 6
4 2
Assuming that you imported Dataframe1 as Dataframe1, and Dataframe2 as Dataframe2, and both are data.frame.
library(tidyverse)
Dataframe1 %>%
inner_join(Dataframe2 %>% rename(Amount2 = Amount),
by="id") -> Dataframe
At this point you can perform your operation
library(tidyverse)
Dataframe %>%
mutate(result = Amount/Amount2) -> Dataframe
and check if the column result is what you were looking for.
To find the highest ratio:
Dataframe$result %>% max(na.rm = T)
But there are many other ways to record this value; this is the most straightforward.

How can I groupby rows by the columns in which they actually posses a data point?

I don't even know if groupby is the correct function to use for this. It's a bit hard to understand so Ill include a screenshot of my dataframe: screenshot
Basically, this dataframe has way too many columns because each column is specific to only one or a few rows. You can see in the screenshot that the first few columns are specific towards the first row and the last few columns are specific to the last row. I want to make it so that each row only has the columns that actually pertain to it. I've tried several methods of using groupby('equipment name') and several methods using dropna but none work in the way I need it to. I'm also open to separating it into multiple dataframes.
Any method is acceptable, this bug has been driving me crazy. It took me a while to get to this point because this started out as an unintelligible 10,000 line json. I'm pretty new to programming as well.
This is a very cool answer that could be one option - and it does use groupby so sorry for dismissing!!! This will group your data into DataFrames where each DataFrame has a unique group of columns, and any row which only contains values for those columns will be in that DataFrame. If your data are such that there are multiple groups of rows which share the exact same columns, this solution is ideal I think.
Just to note, though, if your null values are more randomly spread out throughout the dataset, or if one row in a group of rows is missing a single entry (compared to related rows), you will end up with more combinations of unique non-null columns, and then more output DataFrames.
There are also (in my opinion) nice ways to search a DataFrame, even if it is very sparse. You can check the non-null values for a row:
df.loc[index_name].dropna()
Or for an index number:
df.iloc[index_number].dropna()
You could further store these values, say in a dictionary (this is a dictionary of Series, but could be converted to DataFrame:
row_dict = {row : df.loc[row].dropna() for row in df.index}
I could imagine some scenarios where something based off these options is more helpful for searching. But that linked answer is slick, I would try that.
EDIT: Expanding on the answer above based on comments with OP.
The dictionary created in the linked post contain the DataFrames . Basically you can use this dictionary to do comparisons with the original source data. My only issue with that answer was that it may be hard to search the dictionary if the column names are janky (as it looks like in your data), so here's a slight modification:
for i, (name,df) in enumerate(df.groupby(df.isnull().dot(df.columns))):
d['df' + str(i)] = df.dropna(1)
Now the dictionary keys are "df#", and the values are the DataFrames. So if you wanted to inspect the content one DataFrame, you can call:
d['df1'].head()
#OR
print(d['df0'])
If you wanted to look at all the DataFrames, you could call
for df in d.values():
print(df.head()) #you can also pass an integer to head to show more rows than 5
Or if you wanted to save each DataFrame you could call:
for name in sorted(d.keys()):
d[name].to_csv('path/to/file/' + name + '.csv')
The point is, you've gotten to a data structure where you can look at the original data, separated into DataFrames without missing data. Joining these back into a single DataFrame would be redundant, as it would create a single DataFrame (equal to the original) or multiple with some amount of missing data.
I think it comes down to what you are looking for and how you need to search the data. You could rename the dictionary keys / output .CSV files based on the types of machinery inside, for example.
I thought your last comment might mean that objects of similar type might not share the same columns; say for example if not all "Exhaust Fans" have the same columns, they will end up in different DataFrames in the dictionary. This maybe the type of case where it might be easier to just look at individual rows, rather than grouping them into weird categories:
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
You could again then save these DataFrames as CSV files or look at them one by one (or e.g. search for Exhaust Fans by seeing if "Exhaust" is in they key). You could also print them all at once:
import pandas as pd
import numpy as np
import natsort
#making some randomly sparse data
columns = ['Column ' + str(i+1) for i in range(10)]
index = ['Row ' + str(i+1) for i in range(100)]
df = pd.DataFrame(np.random.rand(100,10), columns=columns,index=index)
df[df<.7] = np.nan
#creating the dictionary where each key is a row name
df_dict = {row : pd.DataFrame(df.loc[row].dropna()).transpose() for row in df.index}
#printing all the output
for key in natsort.natsorted(df_dict.keys())[:5]: #using [:5] to limit output
print(df_dict[key], '\n')
Out[1]:
Column 1 Column 4 Column 7 Column 9 Column 10
Row 1 0.790282 0.710857 0.949141 0.82537 0.998411
Column 5 Column 8 Column 10
Row 2 0.941822 0.722561 0.796324
Column 2 Column 4 Column 5 Column 6
Row 3 0.8187 0.894869 0.997043 0.987833
Column 1 Column 7
Row 4 0.832628 0.8349
Column 1 Column 4 Column 6
Row 5 0.863212 0.811487 0.924363
Instead of printing, you could write the output to a text file; maybe that's the type of document that you could look at (and search) to compare to the input tables. Bute not that even though the printed data are tabular, they can't be made into a DataFrame without accepting that there will be missing data for rows which don't have entries for all columns.

Build Pandas DataFrame with String Entries using 2 Separate DataFrames

Suppose you have two separate pandas DataFrames with the same row and column indices (in my case, the column indices were constructed by .unstack()'ing a MultiIndex built using df.groupby([col1,col2]))
df1 = pd.DataFrame({'a':[.01,.02,.03],'b':[.04,.05,.06]})
df2 = pd.DataFrame({'a':[.04,.05,.06],'b':[.01,.02,.03]})
Now suppose I would like to create a 3rd DataFrame, df3, where each entry of df3 is a string which uses the corresponding element-wise entries of df1 and df2. For example,
df3.iloc[0,0] = '{:.0%}'.format(df1.iloc[0,0]) + '\n' + '{:.0%}'.format(df2.iloc[0,0])
I recognize this is probably easy enough to do by looping over all entries in df1 and df2 and creating a new entry in df3 based on these values (which can be slow for large DataFrames), or even by joining the two DataFrames together (which may require renaming columns), but I am wondering if there a more pythonic / pandorable way of accomplishing this, possibly using applymap or some other built-in pandas function?
The question is similar to Combine two columns of text in dataframe in pandas/python but the previous question does not consider combining multiple DataFrames into a single.
IIUC, you just need add df1 and df2 with '\n'
df3 = df1.astype(str) + '\n' + df2.astype(str)
Out[535]:
a b
0 0.01\n0.04 0.04\n0.01
1 0.02\n0.05 0.05\n0.02
2 0.03\n0.06 0.06\n0.03
You can make use of the vectorized operations of Pandas (given that the dataframes share row and column index)
(df1 * 100).astype(str) + '%\n' + (df2 * 100).astype(str) + '%'
You get
a b
0 1.0%\n4.0% 4.0%\n1.0%
1 2.0%\n5.0% 5.0%\n2.0%
2 3.0%\n6.0% 6.0%\n3.0%

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

How to t-test by group in a pandas dataframe?

I have quite a huge pandas dataframe with many columns. The dataframe contains two groups. It is basically setup as follows:
import pandas as pd
csv = [{"air" : 0.47,"co2" : 0.43 , "Group" : 1}, {"air" : 0.77,"co2" : 0.13 , "Group" : 1}, {"air" : 0.17,"co2" : 0.93 , "Group" : 2} ]
df = pd.DataFrame(csv)
I want to perform a t-test paired t-test on air and co2 thereby compare the two groups Group = 1 and Group = 2.
I have many many more columns than just air co2- hence, I would like to find a procedure that works for all columns int the dataframe. I believe, I could use scipy.stats.ttest_rel together with pd.groupby oder apply. How would that work? Thanks in advance /R
I would use pandas dataframe.where method.
group1_air = df.where(df.Group== 1).dropna()['air']
group2_air = df.where(df.Group== 2).dropna()['air']
This bit of code returns into group1_air all the values of the air column where the group column is 1 and all the values of air where group is 2 in group2_air.
The drop.na() is required because the .where method will return NAN for every row in which the specified conditions is not met. So all rows where group is 2 will return with NAN values when you use df.where(df.Group== 1).
Whether you need to use scipy.stats.ttest_rel or scipy.stats.ttest_ind depends on your groups. If you samples are from independent groups you should use ttest_ind if your samples are from related groups you should use ttest_rel.
So if your samples are independent from oneanother your final piece of required code is.
scipy.stats.ttest_ind(group1_air,group2_air)
else you need to use
scipy.stats.ttest_rel(group1_air,group2_air)
When you want to also test co2 you simply need to change air for co2 in the given example.
Edit:
This is a rough sketch of the code you should run to execute ttests over every column in your dataframe except for the group column. You may need to tamper a bit with the column_list to get it completely compliant with your needs (you may not want to loop over every column for example).
# get a list of all columns in the dataframe without the Group column
column_list = [x for x in df.columns if x != 'Group']
# create an empty dictionary
t_test_results = {}
# loop over column_list and execute code explained above
for column in column_list:
group1 = df.where(df.Group== 1).dropna()[column]
group2 = df.where(df.Group== 2).dropna()[column]
# add the output to the dictionary
t_test_results[column] = scipy.stats.ttest_ind(group1,group2)
results_df = pd.DataFrame.from_dict(t_test_results,orient='Index')
results_df.columns = ['statistic','pvalue']
At the end of this code you have a dataframe with the output of the ttest over every column you will have looped over.

Resources