How do you combine data from some columns but keep other separate? - excel-formula

In google sheet or excel i have date, product and qty of that product. One date has multiple columns of products and the associated qty. I want to combine the products columns and qty columns. Thanks for the help.
I've tried transpose(split(join function but can't get the date column to work. I've tried index match, but that only works with one column.
Now:
Columns with data are:
Date / Product / Qty / Product / Qty / Product / Qty
1/1/2019 / bananas / 10 / apples / 5 / oranges / 2
1/2/2019 / apples / 5 / oranges / 3 / bananas / 20
I want:
Date / Product / Qty
1/1/2019 / bananas / 10
1/1/2019 / apples / 5
1/1/2019 / oranges / 2
1/2/2019 / apples / 5
1/2/2019 / oranges / 3
1/2/2019 / bananas / 20
so I want to stack the data from the 3 columns of product and the 3 columns of qty but repeat the date for each combination

In Google Sheets you could write a JavaScript function to iterate over your cells and create a new two-dimensional array according to your needs along the following lines:
var array = [['Date', 'Product', 'Qty', 'Product', 'Qty', 'Product', 'Qty'], ['1/1/2019', 'bananas', 10, 'apples', 5, 'oranges', 2], ['1/2/2019', 'apples', 5, 'oranges', 3, 'bananas', 20]]
var newArray = [];
for (var j = 1; j < array.length; j++) {
for (var i = 1; i < array[0].length; i++) {
if (i % 2 === 1) {
newArray.push([array[j][0], array[j][i], array[j][i+1]]);
}
}
};

Related

Given a column value, check if another column value is present in preceding or next 'n' rows in a Pandas data frame

I have the following data
jsonDict = {'Fruit': ['apple', 'orange', 'apple', 'banana', 'orange', 'apple','banana'], 'price': [1, 2, 1, 3, 2, 1, 3]}
Fruit price
0 apple 1
1 orange 2
2 apple 1
3 banana 3
4 orange 2
5 apple 1
6 banana 3
What I want to do is check if Fruit == banana and if yes, I want the code to scan the preceding as well as the next n rows from the index position of the 'banana' row, for an instance where Fruit == apple. An example of the expected output is shown below taking n=2.
Fruit price
2 apple 1
5 apple 1
I have tried doing
position = df[df['Fruit'] == 'banana'].index
resultdf= df.loc[((df.index).isin(position)) & (((df['Fruit'].index+2).isin(['apple']))|((df['Fruit'].index-2).isin(['apple'])))]
# Output is an empty dataframe
Empty DataFrame
Columns: [Fruit, price]
Index: []
Preference will be given to vectorized approaches.
IIUC, you can use 2 masks and boolean indexing:
# df = pd.DataFrame(jsonDict)
n = 2
m1 = df['Fruit'].eq('banana')
# is the row ±n of a banana?
m2 = m1.rolling(2*n+1, min_periods=1, center=True).max().eq(1)
# is the row an apple?
m3 = df['Fruit'].eq('apple')
out = df[m2&m3]
output:
Fruit price
2 apple 1
5 apple 1

Perform unique row operation after a groupby

I have been stuck to a problem where I have done all the groupby operation and got the resultant dataframe as shown below but the problem came in last operation of calculation of one additional column
Current dataframe:
code industry category count duration
2 Retail Mobile 4 7
3 Retail Tab 2 33
3 Health Mobile 5 103
2 Food TV 1 88
The question: Want an additional column operation which calculates the ratio of count of industry 'retail' for the specific code column entry
for example: code 2 has 2 industry entry retail and food so operation column should have value 4/(4+1) = 0.8 and similarly for code3 as well as shown below
O/P:
code industry category count duration operation
2 Retail Mobile 4 7 0.8
3 Retail Tab 2 33 -
3 Health Mobile 5 103 2/7 = 0.285
2 Food TV 1 88 -
Help on here as well that if I do just groupby I will miss out the information of category and duration also what would be better way to represent the output df there can been multiple industry and operation is limited to just retail
I can't think of a single operation. But the way via a dictionary should work. Oh, and in advance for the other answerers the code to create the example dataframe.
st_l = [[2,'Retail','Mobile', 4, 7],
[3,'Retail', 'Tab', 2, 33],
[3,'Health', 'Mobile', 5, 103],
[2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns=
['code','industry','category','count','duration'])
And now my attempt:
sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)
You can create a new column with the total count of each code using groupby.transform(), and then use loc to find only the rows that have as their industry 'Retail' and perform your division:
df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)
df.drop('total_per_code',axis=1,inplace=True)
prints back:
code industry category count duration operation
0 2 Retail Mobile 4 7 0.800000
1 3 Retail Tab 2 33 0.285714
2 3 Health Mobile 5 103 NaN
3 2 Food TV 1 88 NaN

Join two dataframes based on closest combination that sums up to a target value

Im trying to join below two dataframes based on closest combination of rows from df2 column Sales that sums up to target value in df1 column Total Sales, columns Name & Date in both dataframes should be the same when joining (as showed in expected output).
For Example : in df1 row number 0 should be matched only with df2 rows 0 & 1, since columns Name & Date is the same, Which is Name : John and Date : 2021-10-01.
df1 :
df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
"Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})
Name Date Total Sales
0 John 2021-10-01 15500
1 John 2021-11-01 5500
2 Jack 2021-10-10 17600
3 Nancy 2021-10-12 20700
4 Ahmed 2021-10-30 12000
df2 :
df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
"7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
"Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
"8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
"Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
"8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
"Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
"7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})
ID Name Date Sales
0 JO1 John 2021-10-01 10000
1 JO2 John 2021-10-01 5000
2 JO3 John 2021-11-01 1000
3 JO4 John 2021-11-01 5500
4 JA1 Jack 2021-10-10 10000
5 JA2 Jack 2021-10-10 7000
6 NA1 Nancy 2021-10-12 20000
7 NA2 Nancy 2021-10-12 100
8 NA3 Nancy 2021-10-12 500
9 NA4 Nancy 2021-10-12 100
10 AH1 Ahmed 2021-10-30 5000
11 AH2 Ahmed 2021-10-30 7000
12 AH3 Ahmed 2021-10-30 10000
13 AH3 Ahmed 2021-10-29 12000
Expected Output :
Name Date Total Sales Comb IDs Comb Total
0 John 2021-10-01 15500 JO1, JO2 15000.0
1 John 2021-11-01 5500 JO4 5500.0
2 Jack 2021-10-10 17600 JA1, JA2 17000.0
3 Nancy 2021-10-12 20700 NA1, NA2, NA3, NA4 20700.0
4 Ahmed 2021-10-30 12000 AH1, AH2 12000.0
What i have tried below is working for only one row at a time, but im not sure how to apply it in pandas dataframes to get the expected output.
Variable numbers in below script represent Sales column in df2, and variable target below represent Total Sales column in df1.
import itertools
import math
numbers = [1000, 5000, 3000]
target = 6000
best_combination = ((None,))
best_result = math.inf
best_sum = 0
for L in range(0, len(numbers)+1):
for combination in itertools.combinations(numbers, L):
sum = 0
for number in combination:
sum += number
result = target - sum
if abs(result) < abs(best_result):
best_result = result
best_combination = combination
best_sum = sum
print("\nbest sum{} = {}".format(best_combination, best_sum))
[Out] best sum(1000, 5000) = 6000
Take the code you wrote which finds the best sum and turn it into a function (let's call it opt, which has parameters for target and a dataframe (which will be a subset of df2. It needs to return a list of IDs which correspond to the optimal combination.
Write another function which takes 3 arguments name, date and target (let's call it calc). This function will filter df2 based on name and date, and pass it, along with the target to the opt function and return the result of that function. Finally, iterate through rows of df1, and call calc with the row arguments (or alternatively use pandas.DataFrame.apply

taking top 3 in a groupby, and lumping rest into 'other category'

I am currently doing a groupby in pandas like this:
df.groupby(['grade'])['students'].nunique())
and the result I get is this:
grade
grade 1 12
grade 2 8
grade 3 30
grade 4 2
grade 5 600
grade 6 90
Is there a way to get the output such that I see the groups of the top 3, and everything else is classified under other?
this is what I am looking for
grade
grade 3 30
grade 5 600
grade 6 90
other (3 other grades) 22
I think you can add a helper column in the df and call it something like "Grouping".
name the top 3 rows with its original name and name the remaining as "other" and then just group by the "Grouping" column.
Can't do much without the actual input data, but if this is your starting dataframe (df) after your groupby -
grade unique
0 grade_1 12
1 grade_2 8
2 grade_3 30
3 grade_4 2
4 grade_5 600
5 grade_6 90
You can do a few more steps to get to your table -
ddf = df.nlargest(3, 'unique')
ddf = ddf.append({'grade': 'Other', 'unique':df['unique'].sum()-ddf['unique'].sum()}, ignore_index=True)
grade unique
0 grade_5 600
1 grade_6 90
2 grade_3 30
3 Other 22

Oracle - Identifying Dominant Records

I have a table that contains two STRING values (all single words) along with a corresponding COUNT for each occurrence of the STRING, e.g;
ID STR_1 COUNT_1 STR_2 COUNT_2
1 ORANGES 2 APPLES 10
2 APPLES 10 ORANGES 2
3 ORANGES 2 BANANAS 1
4 BANANAS 1 APPLES 10
5 BANANAS 1 ORANGES 2
N.B. STR_1 is considered the ‘master’ value. Also, the COUNT for each individual STRING value will be consistent between STR_1 And STR_2 and between rows (e.g. ORANGES will always have a COUNT of 2)
What I’m trying to achieve is to remove records whereby an ‘enantiomer’ exists, for example; in the above data, ID 2 would be considered an ‘enantiomer’ of ID 1 (ID 1.STR_1 = ID.2 STR_2 and ID 1.STR_2 = ID.2 STR_1), however, ID 2 would be considered the dominant record with ID 1 being discarded (because the COUNT for APPLES is greater than the COUNT for ORANGES) – therefore the desired output would be;
ID STR_1 COUNT_1 STR_2 COUNT_2
2 APPLES 10 ORANGES 2
3 ORANGES 2 BANANAS 1
4 BANANAS 1 APPLES 10
IF a scenario exists whereby the COUNT values between different STRINGS match, the longest STRING would be considered the dominant record and retained e.g.;
ID STR_1 COUNT_1 STR_2 COUNT_2
1 ORANGES 10 APPLES 10
2 APPLES 10 ORANGES 10
3 ORANGES 10 BANANAS 1
4 BANANAS 1 APPLES 10
5 BANANAS 1 ORANGES 10
With the desired output being;
ID STR_1 COUNT_1 STR_2 COUNT_2
1 ORANGES 10 APPLES 10
3 ORANGES 10 BANANAS 1
4 BANANAS 1 APPLES 10
Test Data;
WITH
TEST_DATA AS
(
SELECT 1 ID, 'ORANGES' STR_1, 2 COUNT_1, 'APPLES' STR_2, 10 COUNT_2 FROM DUAL
UNION
SELECT 2 ID, 'APPLES' STR_1, 10 COUNT_1, 'ORANGES' STR_2, 2 COUNT_2 FROM DUAL
UNION
SELECT 3 ID, 'ORANGES' STR_1, 2 COUNT_1, 'BANANAS' STR_2, 1 COUNT_2 FROM DUAL
UNION
SELECT 4 ID, 'BANANAS' STR_1, 1 COUNT_1, 'APPLES' STR_2, 10 COUNT_2 FROM DUAL
UNION
SELECT 5 ID, 'BANANAS' STR_1, 1 COUNT_1, 'ORANGES' STR_2, 2 COUNT_2 FROM DUAL
)
Any help finding a solution to the above would be much appreciated.
Many thanks in advance.
Use anti join (not exists operator):
select *
from test_data t
where not exists (
select 1 from test_data t1
where t.str_1 = t1.str_2
and t.str_2 = t1.str_1
and (
t.count_1 < t1.count_1
or
t.count_1 = t1.count_1
and
length( t.str_1 ) < length( t1.str_1 )
)
)
order by id
In a case when for a given pair of rows both counts and lengths are equal, then the query picks both rows.

Resources