So I've looked at some other posts, but they didn't quite help. I'm not new to python, but I'm relatively new to pandas and this has me stumped as to how to accomplish it in any manner that's not horribly inefficient. The data sets I've got are a little bit large and have some extraneous columns of data that I don't need, I've got them loaded as dataframes but they basically look like this:
+---------+---------+--------+-------+
| Subject | Week | Test | Value |
+---------+---------+--------+-------+
| 1 | Week 4 | Test 1 | 4 |
| 1 | Week 8 | Test 1 | 7 |
| 1 | Week 12 | Test 1 | 3 |
| 1 | Week 4 | Test 2 | 6 |
| 1 | Week 8 | Test 2 | 3 |
| 1 | Week 12 | Test 2 | 9 |
| 2 | Week 4 | Test 1 | 1 |
| 2 | Week 8 | Test 1 | 4 |
| 2 | Week 12 | Test 1 | 2 |
| 2 | Week 4 | Test 2 | 8 |
| 2 | Week 8 | Test 2 | 1 |
| 2 | Week 12 | Test 2 | 3 |
+---------+---------+--------+-------+
I want to rearrange the dataframes so that they look like this:
+---------+---------+--------+--------+
| Subject | Week | Test 1 | Test 2 |
+---------+---------+--------+--------+
| 1 | Week 4 | 4 | 6 |
| 1 | Week 8 | 7 | 3 |
| 1 | Week 12 | 3 | 9 |
| 2 | Week 4 | 1 | 8 |
| 2 | Week 8 | 4 | 1 |
| 2 | Week 12 | 2 | 3 |
+---------+---------+--------+--------+
If anyone has any ideas on how I can make this happen, I'd greatly appreciate it, and thank you in advance for your time!
Edit: After trying the solution provided by #HarvIpan, this is the output I'm getting:
+-----------------------------------------------+
| Subject Week Test_Test 1 Test_Test 2 |
+-----------------------------------------------+
| 0 1 Week 12 5 0 |
| 1 1 Week 4 5 0 |
| 2 1 Week 8 11 0 |
| 3 2 Week 12 0 12 |
| 4 2 Week 4 0 14 |
| 5 2 Week 8 0 4 |
+-----------------------------------------------+
Try using df.pivot_table.
You should be able to get the desired outcome with:
df.pivot_table(index=['Subject','Week'], columns='Test', values='Value')
You need get dummy variable for column Test with pd.get_dummies(df[['Test', 'Value']], 'Test').mul(df['Value'], 0)] with multiplication of their Value before concatenating them back to your original df. Then groupby Subject and Week before summing them.
pd.concat([df.drop(['Test', 'Value'],1), pd.get_dummies(df[['Test']], 'Test').mul(df['Value'], 0)], axis=1).groupby(['Subject', 'Week']).sum(axis=1).reset_index()
Output:
Subject Week Test_ Test 1 Test_ Test 2
0 1 Week 12 3 9
1 1 Week 4 4 6
2 1 Week 8 7 3
3 2 Week 12 2 3
4 2 Week 4 1 8
5 2 Week 8 4 1
Related
This question already has an answer here:
How to split a file into chunks with 1000 lines in each chunk in Bash? [duplicate]
(1 answer)
Closed 3 months ago.
I need help splitting a big file (1.6 M records) into multiple files based on the maximum number of lines allowed per the sub files, with the caveat that an order should not spill across files and appear in multiple files.
Quick overview about the file:
The file has order information about transaction at a retail store. Each order can have multiple items. Below is a small example of a sample file.
sample_file:
order_nu
item_nu
Sale
1
1
10
1
2
20
1
3
30
2
1
10
2
2
20
3
1
10
3
2
10
4
1
20
4
2
24
4
3
34
4
4
10
4
5
20
5
1
30
5
2
20
5
3
40
Is it possible to write a Linux script that can help me split a file based on the number of lines with the caveat that an order should not spill across files and appear in multiple files.
For example for the above file, I need it be split with the condition that the individual sub_files should not have more than by 5 records per file, and an order should not appear in more than one file (assumption is an order will not have more than 5 items). Below is the expected output:
sub_file1 :
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 1 | 1 | 10 |
| 1 | 2 | 20 |
| 1 | 3 | 30 |
| 2 | 1 | 10 |
| 2 | 2 | 20 |
sub_file2:
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 3 | 1 | 10 |
| 3 | 2 | 10 |
sub_file3:
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 4 | 1 | 20 |
| 4 | 2 | 24 |
| 4 | 3 | 34 |
| 4 | 4 | 10 |
| 4 | 5 | 20 |
sub_file4:
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 5 | 1 | 30 |
| 5 | 2 | 20 |
| 5 | 3 | 40 |
Please let me know if there are any questions
Thank you!
Try something like this
max_lines=x
counter=1
while read line;
do
echo $line >> sub_file$counter.txt
if [ `wc -l < sub_file$counter.txt` -gt $max_lines ]
then
counter=$((counter+1))
fi
done < sample_file.txt
My DataFrame looks something like this:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
What I am trying to acheive is if substring start and end is present I want the rows between them.
Expected Result is:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?
m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]
But this only returns the first set of start and end, also it expects the exact string.
output:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
+----------------------------------+---------+
The answer you linked to was designed to work with a single pair of Start/End.
A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):
m1 = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)
# boolean indexing
out = df[m1|m1.shift()]
Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of #Quang does:
# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)
# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()
# boolean slicing
out = df[m1|m1.shift()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
Let's try:
# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)
# keep rows with equal forward fill and backward fill
df[groups.bfill() == groups.ffill()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
One option is with an interval index:
Get the positions of the starts and ends:
starts = df.Col1.str.startswith("Start").to_numpy().nonzero()[0]
ends = df.Col1.str.startswith("End").to_numpy().nonzero()[0]
Build an interval index, and get matches where the index lies between Start and End:
intervals = pd.IntervalIndex.from_arrays(starts, ends, closed='both')
intervals = intervals.get_indexer(df.index)
Filter the original dataframe with the intervals, where intervals are not less than 0:
df.loc[intervals >= 0]
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
I am trying to build a dataset from an online questionnaire. In this questionnaire, participants were asked to name 6 items. These items are represented with numbers from 1 to 6 (order of mention does not matter). Afterwards, participants were asked to rank those items from most important to least important (order here matters). Right now I have three columns "Named items", "Item ranked" and "Rank." The last column represents the position at which each case was ranked at. Thus, the idea would be to look at the number in the first column "Named item" and search for its position on the second column "Items Ranked" and return its position to the third column corresponding row.
Since the numbers go from 1 to 6, every six rows the process has to start again on the 7th row. I have a total of 186 participants, which means there's a total of 1116 items. What would be the most efficient way of doing this and preventing human error?
Here is an example of how the sheet looks like done manually:
+----------------------+-----------------------------+------+
| Order of named items | Items ranked (# = Identity) | Rank |
+----------------------+-----------------------------+------+
| 1 | 2 | 4 |
| 2 | 5 | 1 |
| 3 | 6 | 6 |
| 4 | 1 | 5 |
| 5 | 4 | 2 |
| 6 | 3 | 3 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
| 1 | 5 | 3 |
| 2 | 6 | 4 |
| 3 | 1 | 5 |
| 4 | 2 | 6 |
| 5 | 3 | 1 |
| 6 | 4 | 2 |
| 1 | 2 | 2 |
| 2 | 1 | 1 |
| 3 | 6 | 4 |
| 4 | 3 | 5 |
| 5 | 4 | 6 |
| 6 | 5 | 3 |
+----------------------+-----------------------------+------+
You can use this non volatile function:
=MATCH(A2,INDEX(B:B,INT((ROW(1:1)-1)/6)*6+2):INDEX(B:B,INT((ROW(1:1)-1)/6)*6+7),0)
Assuming 1st column starts at A2 and second column at B2 use this formula in C2 copied down
=MATCH(A2,OFFSET(B$2,6*INT((ROWS(C$2:C2)-1)/6),0,6),0)
OFFSET returns the 6 cell range required and MATCH finds the position of the relevant item within that
See screenshot below
I have this data below in Excel. What I want is to return the No.of Inactive months and the Inactive months themselves.
ACTIVITY MONTH
Jan17 Feb17 Mar17 Apr17 Reg Month No.Inactive months Months Inactive
User ID
1 5 38 0 60 Jan17
2 0 242 203 20 Feb17
3 30 0 0 30 Jan17
4 0 0 0 40 Apr17
5 0 0 16 0 Mar17
To count the inactive months you can use the following.
+---+------+--------+--------+--------+--------+--+-----------------+
| | A | B | C | D | E | F| G |
+---+------+--------+--------+--------+--------+--+-----------------+
| 1 | User | Jan 17 | feb-17 | mar-17 | apr-17 | | Inactive months |
| 2 | 1 | 5 | 38 | 0 | 60 | | 1 |
| 3 | 2 | 0 | 242 | 203 | 20 | | 1 |
| 4 | 3 | 30 | 0 | 0 | 30 | | 2 |
| 5 | 4 | 0 | 0 | 0 | 40 | | 3 |
| 6 | 5 | 0 | 0 | 16 | 0 | | 3 |
+---+------+--------+--------+--------+--------+--+-----------------+
where in cell G2 the is this formula =COUNTIF(B2:E2,0)
To show the list of inactive months it's a little bit harder.
The point is that you have to explain how you want to see these results.
The easier way is to use the conditional formatting anc color the cell with zero (but this is not so useful). Others way could be to traspose the table and filter the column with zero. Another one could be to use a VBA macro....
I am looking to create a YTD total, however, the year end date needs to change depending on the value in another column. In other words, the fiscal year for group 1 would be from 11-1 (Nov-1) to 10-31 (Oct-31), while the fiscal year for group 2 would be from 7-1 (Jul-1) to 6-30 (Jul-30). What I need to do is when calculating the fiscal year, I need the calculation for that year to be different depending on what group the line item is in. So 2015 for group 1 would be 2014-11-01 to 2015-10-31 while 2015 for group 2 would be 2014-07-01 to 2014-06-30. Please see an example table here (Please note that I do have a date table related to this one in order to create date functions):
**Table 1**
-------------------------
Group | Date | Qty
1 | 2014-10-01 | 1
1 | 2014-11-01 | 1
1 | 2015-01-01 | 2
1 | 2015-05-01 | 1
1 | 2015-10-31 | 2
1 | 2015-11-01 | 1
2 | 2014-06-01 | 1
2 | 2014-07-01 | 1
2 | 2014-12-25 | 2
2 | 2015-01-01 | 1
2 | 2015-06-30 | 2
2 | 2015-07-01 | 1
With this information in mind, I need to create a YTDTOTAL function that will dynamically change the , parameter depending on what group the line item is in. I thought of using an if statement, but realized that it wouldn't work on a measure. Something like this:
Total $ Sold YTD = TOTALYTD([TOTAL $ Sold],directSQLDateTable[date],ALL(directSQLDateTable[date]), IIF([GROUP = "A","10/31","6/30"))
In the end, I would like to create an output similar to this (The "Group A YTD" and "Group B YTD" columns really are not needed, just wanted to add to demonstrate my example):
Year-Month | Total_Qty | Group A YTD | Group B YTD
--------------------------------------------------
2014-07 | 1 | 0 | 1
2014-08 | 1 | 0 | 1
2014-09 | 1 | 0 | 1
2014-10 | 2 | 1 | 1
2014-11 | 2 | 1 | 1
2014-12 | 4 | 1 | 3
2015-01 | 7 | 3 | 4
2015-02 | 7 | 3 | 4
2015-03 | 7 | 3 | 4
2015-04 | 7 | 3 | 4
2015-05 | 8 | 4 | 4
2015-06 | 10 | 4 | 6
2015-07 | 5 | 4 | 1
2015-08 | 5 | 4 | 1
2015-09 | 5 | 4 | 1
2015-10 | 7 | 6 | 1
2015-11 | 2 | 1 | 1
2015-12 | 2 | 1 | 1
Please let me know if you have any questions. My apologies ahead of time if I didn't do that great of job explaining this or if I have left out a piece of info.
Thanks for any advice/help in advance! You guys on here are the best!
TOTALYTD() includes everything you need for this.
TotalQty:= SUM(Table1[Qty])
QtyYTDGroup1:=
TOTALYTD(
[TotalQty]
,DimDate[Date]
,Table1[Group] = 1
,"10/31"
)
QtyYTDGroup2:=
TOTALYTD(
[TotalQty]
,DimDate[Date]
,Table1[Group] = 2
,"6/30"
)
TotalQtyYTD:= [QtyYTDGroup1] + [QtyYTDGroup2]
I can provide a detailed explanation if you want, but I think the function definition pretty much covers it.