How to combine two rows of same dataset side by side in python? - python-3.x

I have a dataset and I want to combine the first two rows of the same dataset into a single dataset. The original dataset is very big but I have mentioned a small example here.
df
one two three
0 T H A
1 N K S
2 F O R
3 H L P
After combining the first two rows it should look like this:
df
one two three one two three
0 T H A N K S
I'm very new to StackOverflow and started my career recently in python. If my question is not formatted correctly please suggest edits. Thanks.

You can use df.iloc to get two slices of the dataframe one for even rows and another for odd rows. Then pd.concat(..., axis=1) to get them back together.
Notice pd.concat will try to align the input dataframes on their index (i.e. 0, 1, 2, 3) and if one of the dataframes does not have data for a particular index then it will fill with null values. So we need reset_index to get the desired output.
df = pd.concat(
[
df.iloc[::2].reset_index(drop=True),
df.iloc[1::2].reset_index(drop=True)
], axis=1
)
Output
one two three one two three
0 T H A N K S
1 F O R H L P
You can read more about pd.concat in this answer and of course the user guide

Related

Filter Dataframe by comparing one column to list of other columns

I have a dataframe with numerous float columns. I want to filter the dataframe, leaving only the values that are inbetween the High and Low columns of the same dataframe.
I know how to do this when the conditions are one column compared to another column. But there are 102 columns, so I cannot write a condition for each column. And all my research just illustrates how to compare two columns and not one column against all others (or I am not typing the right search terms).
I tried df= df[ (df['High'] <= df[DFColRBs]) & (df['Low'] >= df[DFColRBs])].copy() But it erases everything.
and I tried booleanselction = df[ (df[DFColRBs].between(df['High'],df['Low'])]
and I tried: df= df[(df[DFColRBs].ge(df['Low'])) & (df[DFColRBs].le(df['Low']))].copy()
and I tried:
BoolMatrix = (df[DFColRBs].ge(DF_copy['Low'], axis=0)) & (df[DFColRBs].le(DF_copy['Low'], axis=0))
df= df[BoolMatrix].copy()
But it erases everything in dataframe, even 3 columns that are not included in the list.
I appreciate the guidance.
Example Dataframe:
High Low Close _1m_21 _1m_34 _1m_55 _1m_89 _1m_144 _1m_233 _5m_21 _5m_34 _5m_55
0 1.23491 1.23456 1.23456 1.23401 1.23397 1.23391 1.2339 1.2337 1.2335 1.23392 1.23363 1.23343
1 1.23492 1.23472 1.23472 1.23422 1.23409 1.234 1.23392 1.23375 1.23353 1.23396 1.23366 1.23347
2 1.23495 1.23479 1.23488 1.23454 1.23422 1.23428 1.23416 1.23404 1.23372 1.23415 1.234 1.23367
3 1.23494 1.23472 1.23473 1.23457 1.23425 1.23428 1.23417 1.23405 1.23373 1.23415 1.234 1.23367
Based on what you've said in the comments, best to split the df into the pieces you want to operate on and the ones you don't, then use matrix operations.
tmp_df = DF_copy.iloc[:, 3:].copy()
# or tmp_df = DF_copy[DFColRBs].copy()
# mask by comparing test columns with the high and low columns
m = tmp_df.le(DF_copy['High'], axis=0) & tmp_df.ge(DF_copy['Low'], axis=0)
# combine the masked df with the original cols
DF_copy2 = pd.concat([DF_copy.iloc[:, :3], tmp_df.where(m)], axis=1)
# or replace with DF_copy.iloc[:, :3] with DF_copy.drop(columns=DFColRBs)

Combine time series data for boxplot from multiple csv files

I have multiple time series data in csv files from Netlogo model runs. I would like to join those series into one dataframe so that I can do a boxplot to see variations from different simulation model runs. X values in each csv are the time iterations (integers). The y values are the values of a particular measure in the model, e.g., population count. So, I can join the csvs with concat. There are repeated column names for the y variables. My thought is to combine columns with the same name into one column as a list of numbers (y values). Then I can pass that x, y to boxplot to plot that variable across time with its variations - median, etc. Data is of the form:
x population groups color
0 0 0.00 0.00 0.00
1 1 74.47 42.48 40.96
2 2 74.46 42.48 40.96
would become
x population groups color
0 0 [0.00, 1.2] [0.00, 5] [0.00, 4]
1 1 [74.47, 3.2] [42.48, 55] [40.96, 55]
2 2 [74.46, Nan] [42.48, NaN] [40.96, NaN]
There are multiples of this dataframe from different csv files (thousands). The x axis value can have a different maximum time value for different runs / csvs.
How do I combine dataframes such that I get one dataframe with a list of y values for a given y (column) for each x value. There will be NaNs for some y values for runs that ended early. Note that are multiple y columns. Note that each column is a separate boxplot (overlayed on the same plot).
I have tried concat, join, merge, and not been able to convert multiple columns with the same or different names into one column with a list of values rather than a single value.
Or, is there even a better way to do what I want to do with the data?
The answer ended up being simpler than I expected. Insight into how to do this came from this answer.
Make a list of the time series dataframes: dn = [d1,d2,d3,...]
Concatenate the dataframes: dn = pd.concat(dl, axis=1)
Create a new column with the list of values:
dn['new'] = dn['data column name'].values.tolist()
This generates the new column with the list of values that I can now use to make a box plot.

I want to remove rows where a specific value doesn't increase. Is there a faster/more elegant way?

I have a dataframe with 30 columns, 1.000.000 rows and about 150 MB size. One column is categorical with 7 different elements and another column (Depth) contains mostly increasing numbers. The graph for each of the elements looks more or less like this.
I tried to save the column Depth as series and iterate through it while dropping rows that won't match the criteria. This was reeeeeaaaally slow.
Afterwards I added a boolean column to the dataframe which indicates if it will be dropped or not, so I could drop the rows in the end in a single step. Still slow. My last try (the code to it is in this post) was to create a boolean list to save the fact if it passes the criteria there. Still really slow (about 5 hours).
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Depth'].index.min()
maxIdx = df.loc[df['Element']==element]['Depth'].index.max()
for x in range(minIdx,maxIdx):
if df.loc[df['Element']==element]['Depth'][x] < currentMax:
dropList[x]=False
else:
currentMax = df.loc[df['Element']==element]['Depth'][x]
df: The main dataframe
elements: a list with the 7 different elements (same as in the categorical column in df)
All rows in an element, where the value Depth isn't bigger than all previous ones should be dropped. With the next element it should start with 0 again.
Example:
Input: 'Depth' = [0 1 2 3 4 2 3 5 6]
'AnyOtherColumn' = [a b c d e f g h i]
Output: 'Depth' [0 1 2 3 4 5 6]
'AnyOtherColumn' = [a b c d e h i]
This should apply to whole rows in the dataframe of course.
Is there a way to get this faster?
EDIT:
The whole rows of the input dataframe should stay as they are. Just the ones where the 'Depth' does not increase should be dropped.
EDIT2:
The remaining rows should stay in their initial order.
How about you take a 2-step approach. First you use a fast sorting algorithm (for example Quicksort) and next you get rid of all the duplicates?
Okay, I found a way thats faster. Here is the code:
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Tiefe'].index.min()
# maxIdx = df.loc[df['Element']==element]['Tiefe'].index.max()
elementList = df.loc[df['Element']==element]['Tiefe'].to_list()
for x in tqdm(range(len(elementList))):
if elementList[x] < currentMax:
dropList[x+minIdx]=False
else:
currentMax = elementList[x]
I took the column and saved it as a list. To preserve, the index of the dataframe I saved the lowest one and within the loop it gets added again.
Overall it seems the problem was the loc function. From initially 5 hours runtime, its now about 10 seconds.

pandas - grouping values by pair of columns and pivoting

Been struggling to think what to do here, pivoting and melting and whatnot doesn't seem to be working out. I was trying to join the names of the to/from destinations together and then re-order the combined names but it was a total mess
My data concerns flows from one location to another, it's in the format:
pd.DataFrame(columns=['from_location','to_location','flow'],data =[['a','b',1],['b','a',3]])
from_location to_location flow
0 a b 1
1 b a 3
but my output needs to be the format:
pd.DataFrame(columns=['connection','flow','back flow','net'],data =[['a -> b',1,3,2]])
connection flow back flow net
0 a -> b 1 3 2
Any nice built in functions that can rearrange things like this? I'm not even sure what keywords to search by
Use:
#df = df.sort_values(['from_location','to_location'])
df1 = pd.DataFrame(np.sort(df[['from_location','to_location']], axis=1),
columns=list('ab'), index=df.index)
s = df1['a'] + ' -> ' + df1['b']
df2 = df.groupby(s)['flow'].agg(['first','last']).assign(net=lambda x: x['last'] - x['first'])
print (df2)
first last net
a -> b 1 3 2
Explanation:
If necessary first sort_values if possible some paired rows are swapped
Sort columns per rows by numpy.sort and join columns together with splitter
Then groupby by joined values and aggregate by agg with first and last
Last if need subtract columns add new column by assign

Pandas sort not maintaining sort

What is the right way to multiply two sorted pandas Series?
When I run the following
import pandas as pd
x = pd.Series([1,3,2])
x.sort()
print(x)
w = [1]*3
print(w*x)
I get what I would expect - [1,2,3]
However, when I change it to a Series:
w = pd.Series(w)
print(w*x)
It appears to multiply based on the index of the two series, so it returns [1,3,2]
Your results are essentially the same, just sorted differently.
>>> w*x
0 1
2 2
1 3
>>> pd.Series(w)*x
0 1
1 3
2 2
>>> (w*x).sort_index()
0 1
1 3
2 2
The rule is basically this: Anytime you multiply a dataframe or series by a dataframe or series, it will be done by index. That's what makes it pandas and not numpy. As a result, any pre-sorting is necessarily ignored.
But if you multiply a dataframe or series by a list or numpy array of a conforming shape/size, then the list or array will be treated as having the exact same index as the dataframe or series. The pre-sorting of the series or dataframe can be preserved in this case because there can not be any conflict with the list or array (which don't have an index at all).
Both of these types of behavior can be very desirable depending on what you are trying do. That's why you will often see answers here that do something like df1 * df2.values when the second type of behavior is desired.
In this example, it doesn't really matter because your list is [1,1,1] and gives the same answer either way, but if it was [1,2,3] you would get different answers, not just differently sorted answers.

Resources