script for splitting a file based on number of lines [duplicate] - linux

This question already has an answer here:
How to split a file into chunks with 1000 lines in each chunk in Bash? [duplicate]
(1 answer)
Closed 3 months ago.
I need help splitting a big file (1.6 M records) into multiple files based on the maximum number of lines allowed per the sub files, with the caveat that an order should not spill across files and appear in multiple files.
Quick overview about the file:
The file has order information about transaction at a retail store. Each order can have multiple items. Below is a small example of a sample file.
sample_file:
order_nu
item_nu
Sale
1
1
10
1
2
20
1
3
30
2
1
10
2
2
20
3
1
10
3
2
10
4
1
20
4
2
24
4
3
34
4
4
10
4
5
20
5
1
30
5
2
20
5
3
40
Is it possible to write a Linux script that can help me split a file based on the number of lines with the caveat that an order should not spill across files and appear in multiple files.
For example for the above file, I need it be split with the condition that the individual sub_files should not have more than by 5 records per file, and an order should not appear in more than one file (assumption is an order will not have more than 5 items). Below is the expected output:
sub_file1 :
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 1 | 1 | 10 |
| 1 | 2 | 20 |
| 1 | 3 | 30 |
| 2 | 1 | 10 |
| 2 | 2 | 20 |
sub_file2:
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 3 | 1 | 10 |
| 3 | 2 | 10 |
sub_file3:
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 4 | 1 | 20 |
| 4 | 2 | 24 |
| 4 | 3 | 34 |
| 4 | 4 | 10 |
| 4 | 5 | 20 |
sub_file4:
| order_nu | item_nu | Sale |
| -------- | --------|-------|
| 5 | 1 | 30 |
| 5 | 2 | 20 |
| 5 | 3 | 40 |
Please let me know if there are any questions
Thank you!

Try something like this
max_lines=x
counter=1
while read line;
do
echo $line >> sub_file$counter.txt
if [ `wc -l < sub_file$counter.txt` -gt $max_lines ]
then
counter=$((counter+1))
fi
done < sample_file.txt

Related

Auto Incrementing Number Values in Excel

I have to re-number over 30,000 rows in excel and am looking for a way to do this through an embedded excel function.
I have two columns, the original BuildingCount and the Test column. In the BuildingCount column, I have inconsistent count that needs to be consecutive 1,2,3 numbers in order to run a macros. However, the numbers are not always consecutive. I have been writing different variations of excel functions. The below is the output for =IF(A2>1),A2+1,1)
+----+---------------+------------+
| | A | B |
+----+---------------+------------+
| 1 | BuildingCount | TestColumn |
| 2 | 1 | #VALUE! |
| 3 | 2 | 1 |
| 4 | 3 | 3 |
| 5 | 5 | 4 |
| 6 | 6 | 6 |
| 7 | 9 | 7 |
| 8 | 1 | 10 |
| 9 | 2 | 1 |
| 10 | 3 | 3 |
| 11 | 4 | 4 |
| 12 | 5 | 5 |
+----+---------------+------------+
Ideally, the output would be the following:
+----+---------------+------------+
| | A | B |
+----+---------------+------------+
| 1 | BuildingCount | TestColumn |
| 2 | 1 | 1 |
| 3 | 2 | 2 |
| 4 | 3 | 3 |
| 5 | 5 | 4 |
| 6 | 6 | 5 |
| 7 | 7 | 6 |
| 8 | 1 | 1 |
| 9 | 2 | 2 |
| 10 | 3 | 3 |
| 11 | 4 | 4 |
| 12 | 5 | 5 |
+----+---------------+------------+
Any ideas would be very welcomed.
Formula in B2:
=IF(ROW()=2,1,IF(A2>A1,B1+1,1))
And dragged down

Python selecting different number of rows for each group of a mutlilevel index

I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!
Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1

pandas rearranging experiment data

So I've looked at some other posts, but they didn't quite help. I'm not new to python, but I'm relatively new to pandas and this has me stumped as to how to accomplish it in any manner that's not horribly inefficient. The data sets I've got are a little bit large and have some extraneous columns of data that I don't need, I've got them loaded as dataframes but they basically look like this:
+---------+---------+--------+-------+
| Subject | Week | Test | Value |
+---------+---------+--------+-------+
| 1 | Week 4 | Test 1 | 4 |
| 1 | Week 8 | Test 1 | 7 |
| 1 | Week 12 | Test 1 | 3 |
| 1 | Week 4 | Test 2 | 6 |
| 1 | Week 8 | Test 2 | 3 |
| 1 | Week 12 | Test 2 | 9 |
| 2 | Week 4 | Test 1 | 1 |
| 2 | Week 8 | Test 1 | 4 |
| 2 | Week 12 | Test 1 | 2 |
| 2 | Week 4 | Test 2 | 8 |
| 2 | Week 8 | Test 2 | 1 |
| 2 | Week 12 | Test 2 | 3 |
+---------+---------+--------+-------+
I want to rearrange the dataframes so that they look like this:
+---------+---------+--------+--------+
| Subject | Week | Test 1 | Test 2 |
+---------+---------+--------+--------+
| 1 | Week 4 | 4 | 6 |
| 1 | Week 8 | 7 | 3 |
| 1 | Week 12 | 3 | 9 |
| 2 | Week 4 | 1 | 8 |
| 2 | Week 8 | 4 | 1 |
| 2 | Week 12 | 2 | 3 |
+---------+---------+--------+--------+
If anyone has any ideas on how I can make this happen, I'd greatly appreciate it, and thank you in advance for your time!
Edit: After trying the solution provided by #HarvIpan, this is the output I'm getting:
+-----------------------------------------------+
| Subject Week Test_Test 1 Test_Test 2 |
+-----------------------------------------------+
| 0 1 Week 12 5 0 |
| 1 1 Week 4 5 0 |
| 2 1 Week 8 11 0 |
| 3 2 Week 12 0 12 |
| 4 2 Week 4 0 14 |
| 5 2 Week 8 0 4 |
+-----------------------------------------------+
Try using df.pivot_table.
You should be able to get the desired outcome with:
df.pivot_table(index=['Subject','Week'], columns='Test', values='Value')
You need get dummy variable for column Test with pd.get_dummies(df[['Test', 'Value']], 'Test').mul(df['Value'], 0)] with multiplication of their Value before concatenating them back to your original df. Then groupby Subject and Week before summing them.
pd.concat([df.drop(['Test', 'Value'],1), pd.get_dummies(df[['Test']], 'Test').mul(df['Value'], 0)], axis=1).groupby(['Subject', 'Week']).sum(axis=1).reset_index()
Output:
Subject Week Test_ Test 1 Test_ Test 2
0 1 Week 12 3 9
1 1 Week 4 4 6
2 1 Week 8 7 3
3 2 Week 12 2 3
4 2 Week 4 1 8
5 2 Week 8 4 1

Assigning ranks to items that vary in order

I am trying to build a dataset from an online questionnaire. In this questionnaire, participants were asked to name 6 items. These items are represented with numbers from 1 to 6 (order of mention does not matter). Afterwards, participants were asked to rank those items from most important to least important (order here matters). Right now I have three columns "Named items", "Item ranked" and "Rank." The last column represents the position at which each case was ranked at. Thus, the idea would be to look at the number in the first column "Named item" and search for its position on the second column "Items Ranked" and return its position to the third column corresponding row.
Since the numbers go from 1 to 6, every six rows the process has to start again on the 7th row. I have a total of 186 participants, which means there's a total of 1116 items. What would be the most efficient way of doing this and preventing human error?
Here is an example of how the sheet looks like done manually:
+----------------------+-----------------------------+------+
| Order of named items | Items ranked (# = Identity) | Rank |
+----------------------+-----------------------------+------+
| 1 | 2 | 4 |
| 2 | 5 | 1 |
| 3 | 6 | 6 |
| 4 | 1 | 5 |
| 5 | 4 | 2 |
| 6 | 3 | 3 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
| 1 | 5 | 3 |
| 2 | 6 | 4 |
| 3 | 1 | 5 |
| 4 | 2 | 6 |
| 5 | 3 | 1 |
| 6 | 4 | 2 |
| 1 | 2 | 2 |
| 2 | 1 | 1 |
| 3 | 6 | 4 |
| 4 | 3 | 5 |
| 5 | 4 | 6 |
| 6 | 5 | 3 |
+----------------------+-----------------------------+------+
You can use this non volatile function:
=MATCH(A2,INDEX(B:B,INT((ROW(1:1)-1)/6)*6+2):INDEX(B:B,INT((ROW(1:1)-1)/6)*6+7),0)
Assuming 1st column starts at A2 and second column at B2 use this formula in C2 copied down
=MATCH(A2,OFFSET(B$2,6*INT((ROWS(C$2:C2)-1)/6),0,6),0)
OFFSET returns the 6 cell range required and MATCH finds the position of the relevant item within that
See screenshot below

Excel: Give scores based on range, where max = 1 and min = 10

I have following problem:
I want to give scores to a range of numbers from 1-10 for example:
| | A | B |
|---|------|----|
| 1 | 1209 | 1 |
| 2 | 401 | 7 |
| 3 | 123 | 9 |
| 4 | 49 | 10 |
| 5 | 30 | 10 |
(Not sure if B is 100% correct but roughly)
I got the B values with
=ABS(CEILING(A1;MAX($A$1:$A$32)/10)*10/MAX($A$1:$A$32)-11)
It seems to work but if I for example take numbers like
| | A | B |
|---|------|----|
| 1 | 100 | 1 |
| 2 | 90 | 2 |
| 3 | 80 | 3 |
| 4 | 70 | 4 |
| 5 | 50 | 6 |
But I want 50 to be 10.
I would like to have it scalable so I can do it with a 1-10 or 1-100 or 5-27 or whatever scale and with however many numbers in the list and whatever numbers to score from.
Thanks!
Use this formula:
=$E$1 + ROUND((MIN($A:$A)-A1)/((MAX($A:$A)-MIN($A:$A))/($E$1-$E$2)),0)
It is scalable. You put the max and min in E1 and E2.

Resources