Why is order of data items reversed while creating a pandas series? - python-3.x

I am new to python and pandas so please bear with me. I tried searching the answer everywhere but couldn't find it. Here's my question:
This is my input code:
list = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], list)
The output is:
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
Now, my question is why the "list" is coming before the first list specified while creating the series? I tried running the same code multiple times to check if the series creation is orderless. Any help would be highly appreciated.
Python Version:
Python 3.6.0
Pandas Version:
'0.19.2'

I think you omit index which specify first column called index - so Series construction now is:
#dont use list as variable, because reversed word in python
L = [1, 2, 3, 1, 2, 3]
s = pd.Series(data=[1, 2, 3, 10, 20, 30], index=L)
print (s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
You can also check Series documentation.

Related

Edit multiple values with df.at()

Why does
>>> offset = 2
>>> data = {'Value': [7, 9, 21, 22, 23, 100]}
>>> df = pd.DataFrame(data=data)
>>> df.at[:offset, "Value"] = 99
>>> df
Value
0 99
1 99
2 99
3 22
4 23
5 100
change values in indices [0, 1, 2]? I would expect them only to be changed in [0, 1] to be conform with regular slicing.
Like when I do
>>> arr = [0, 1, 2, 3, 4]
>>> arr[0:2]
[0, 1]
.at behaves like .loc, in that it selects rows/columns by label. Label slicing in pandas is inclusive. Note that .iloc, which performs slicing on the integer positions, behaves like you would expect. See this good answer for a motivation.
Also note that the pandas documentation suggests to use .at only when selecting/setting single values. Instead, use .loc.
On line 4, when you say :2, it means all rows from 0 to 2 or 0:2. If you want to change only the 3rd row, you should change it to 2:2

Filter dataframe by minimum number of values in groups

I have the following dataframe structure:
#----------------------------------------------------------#
# Generate dataframe mock example.
# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b'])
# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# introduce nan into dataframe 1.
for col in df1.columns:
df1.loc[df1.sample(frac=0.1).index, col] = np.nan
# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)
# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]
# Set index as group column.
df = df.set_index('Groups')
# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index()
# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
0: 'Values'})
#----------------------------------------------------------#
Original dataframe - 'df' before stacking:
Groups 1 2 3 4 5 6
a 3 5 5 3 5 5
a nan nan 3 4 7 3
a 6 2 nan 6 2 4
b 8 8 8 8 8 8
b 10 9 4 10 9 4
b 4 6 6 4 6 6
I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6.
I am currently using the following method:
# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
# Reset index.
df_idx_reset = orig_df.reset_index()
# Generate list with size of each 'Group'.
grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()
# Data series for each 'Group'.
expt_class_1 = test_stack_df.head(grp_num[0])
expt_class_2 = test_stack_df.tail(grp_num[1])
# Check if both 'Groups' contain at least 3 values per 'ID'.
valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
len(expt_class_2['Values'].value_counts()) >=3
# Return 'true' or 'false'
return(valid_IDs)
# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)
# Transpose original dataframe.
df_T = df.transpose()
# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()
I could achieve this with minimal fuss by applying pandas.dataframe.dropna() method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.
For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby & transform but can't get anything to work.
Is there a simpler and faster method? Thanks for your time!
EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!
This will do the trick for you:
df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)
Outputs:
1 False
2 False
3 False
4 True
5 True
6 True
dtype: bool

How to create a separate df after applying groupby?

I have a df as follows:
Product Step
1 1
1 3
1 6
1 6
1 8
1 1
1 4
2 2
2 4
2 8
2 8
2 3
2 1
3 1
3 3
3 6
3 6
3 8
3 1
3 4
What I would like to do is to:
For each Product, every Step must be grabbed and the order must not be changed, that is, if we look at Product 1, after Step 8, there is a 1 coming and that 1 must be after 8 only. So, the expected output for product 1 and product 3 should be of the order: 1, 3, 6, 8, 1, 4; for the product 2 it must be: 2, 4, 8, 3, 1.
Update:
Here, I only want one value of 6 for product 1 and 3, since in the main df both the 6 next to each other, but both the values of 1 must be present since they are not next to each other.
Once the first step is done, the products with the same Steps must be grouped together into a new df (in the below example: Product 1 and 3 have same Steps, so they must be grouped together)
What I have done:
import pandas as pd
sid = pd.DataFrame(data.groupby('Product').apply(lambda x: x['Step'].unique())).reset_index()
But it is yielding a result like:
Product 0
0 1 [1 3 6 8 4]
1 2 [2 4 8 3 1]
2 3 [1 3 6 8 4]
which is not the result I want. I would like the value for the first and third product to be [1 3 6 8 1 4].
IIUC Create the Newkey by using cumsum and diff
df['Newkey']=df.groupby('Product').Step.apply(lambda x : x.diff().ne(0).cumsum())
df.drop_duplicates(['Product','Newkey'],inplace=True)
s=df.groupby('Product').Step.apply(tuple)
s.reset_index().groupby('Step').Product.apply(list)
Step
(1, 3, 6, 8, 1, 4) [1, 3]
(2, 4, 8, 3, 1) [2]
Name: Product, dtype: object
groupby preservers the order of rows within a group, so there isn't much need to worry about the rows shifting.
A straightforward, but not greatly performant, solution would be to apply(tuple), since they are hashable allowing you to group on them to see which Products are identical. form_seq will make it so that consecutive values only appear once in the list of steps before forming the tuple.
def form_seq(x):
x = x[x != x.shift()]
return tuple(x)
s = df.groupby('Product').Step.apply(form_seq)
s.groupby(s).groups
#{(1, 3, 6, 8, 1, 4): Int64Index([1, 3], dtype='int64', name='Product'),
# (2, 4, 8, 3, 1): Int64Index([2], dtype='int64', name='Product')}
Or if you'd like a DataFrame:
s.reset_index().groupby('Step').Product.apply(list)
#Step
#(1, 3, 6, 8, 1, 4) [1, 3]
#(2, 4, 8, 3, 1) [2]
#Name: Product, dtype: object
The values of that dictionary are the groupings of products that share the step sequence (given by the dictionary keys). Products 1 and 3 are grouped together by the step sequence 1, 3, 6, 8, 1, 4.
Another very similar way:
df_no_dups=df[df.shift()!=df].dropna(how='all').ffill()
df_no_dups_grouped=df_no_dups.groupby('Product')['Step'].apply(list)

Filter simultaneously by different values of rows Pandas

I have a huge dataframe with product_id and their property_id's. Note that for each property starts with new index. I need to filter simultaneously by different property_id values for each product_id. Is there any way to do it fast?
out_df
product_id property_id
0 3588 1
1 3588 2
2 3588 5
3 3589 1
4 3589 3
5 3589 5
6 3590 1
7 3590 2
8 3590 5
For example want kinda that to filter for each product_id by two properties that are assigned at different rows like out_df.loc[(out_df['property_id'] == 1) & (out_df['property_id'] == 2)] but instead of it).
I need something like that but working at the same time for all rows of each product_id column.
I know that it can be done via groupby into lists
3587 [2, 1, 5]
3588 [1, 3, 5]
3590 [1, 2, 5]
and finding intersections inside lists.
gp_df.apply(lambda r: {1, 2} < (set(r['property_id'])), axis=1)
But it takes time and at the same time Pandas common filtering is greatly optimized for speed (believe in using some tricky right and inverse indexes inside what do search engines like ElasticSearch, Sphinx etc) .
Expected output: where both {1 and 2} are having.
3587 [2, 1, 5]
3590 [1, 2, 5]
Since this is just as much a performance as a functional question, I would go with an intersection approach like this:
df = pd.DataFrame({'product_id': [3588, 3588, 3588, 3589, 3589, 3589, 3590, 3590,3590],
'property_id': [1, 2, 5, 1, 3, 5, 1, 2, 5]})
df = df.set_index(['property_id'])
print("The full DataFrame:")
print(df)
start = time()
for i in range(1000):
s1 = df.loc[(1), 'product_id']
s2 = df.loc[(2), 'product_id']
s_done = pd.Series(list(set(s1).intersection(set(s2))))
print("Overlapping product_id's")
print(time()-start)
Iterating the lookup 1000 times takes 0.93 seconds on my ThinkPad T450s. I took the liberty to test #jezrael's two suggestions and they come in at 2.11 and 2.00 seconds, the groupby approach is, software engineering wise, more elegant though.
Depending on the size of your data set and the importance of performance, you can also switch to more simple datatypes, like classic dictionaries and gain further speed.
Jupyter Notebook can be found here: pandas_fast_lookup_using_intersection.ipynb
do you mean something like this?
result = out_df.loc[out_df['property_id'].isin([1,2]), :]
If you want you can then drop duplicates based on product_id...
The simpliest is use GroupBy.transform with compare sets:
s = {1, 2}
a = df[df.groupby('product_id')['property_id'].transform(lambda r: s < set(r))]
print (a)
product_id property_id
0 3588 1
1 3588 2
2 3588 5
6 3590 1
7 3590 2
8 3590 5
Another solution is filter only values of sets, removing duplicates first:
df1 = df[df['property_id'].isin(s) & ~df.duplicated(['product_id', 'property_id'])]
Then is necessary check if lengths of each group is same as length of set with this solution:
f, u = df1['product_id'].factorize()
ids = df1.loc[np.bincount(f)[f] == len(s), 'product_id'].unique()
Last filter all rows with product_id by condition:
a = df[df['product_id'].isin(ids)]
print (a)
product_id property_id
0 3588 1
1 3588 2
2 3588 5
6 3590 1
7 3590 2
8 3590 5

Find all combinations by columns

I have n-raws m-columns matrix and want to find all combinations. For example:
2 5 6 9
5 2 8 3
1 1 9 4
2 5 3 9
my program will print
2-5-6-9
2-5-6-3
2-5-6-4
2-5-6-9
2-5-8-9
2-5-8-3...
Can't define m x for loops. How to do that?
Use a recursion. It is enough to specify for each position which values can be there (columns), and make a recursion which has as parameters list of numbers for passed positions. In recursion iteration make iteration through possibilities of next position.
Python implementation:
def C(choose_numbers, possibilities):
if len(choose_numbers) >= len(possibilities):
print '-'.join(map(str, choose_numbers)) # format output
else:
for i in possibilities[len(choose_numbers)]:
C(choose_numbers+[i], possibilities)
c = [[2, 5, 1, 2], [5, 2, 1, 5], [6, 8, 9, 3], [9, 3, 4, 9]]
C([], c)

Resources