How to select n numbers around the mean? - subset

I'm wondering if there is a way to select, say 5 values, around the mean of a series of numbers.
Below, there is an example of a series of numbers from 1:100 showing that I'm able to get the 5 top or last numbers from the series.
x = 1:100
y = 1:100
df = data.frame(x,y)
mean(df$x)
library(dplyr)
df %>% top_n(5, x)
df %>% top_n(-5, x)
In the end, I want 5 values that are situated around the mean of 50.5. For example, these numbers could be from 48,49, 50, 51, and 52.

library(dplyr)
df$z <- abs(df$x - mean(df$x))
arrange(df, z)$x[1:5]
This would get you the desired result!

Related

Sum of two square matrixes code keeps failing

I have this task: 'Write a program that adds two square matrices. The program will read the dimension of the matrix, N, and will then read N*N numbers representing the first matrix, row by row. It will then read another N*N numbers representing the second matrix. The program will output the resulting matrix, one row per line. ' for which I wrote the code below. However, the platform I am doing the task on keeps saying that 1 of 2 tests failed...It works just fine for me. Maybe the problem is on their side?
from operator import add
#Enter a digit for you matrix, e.g if you want it to be 2x2 enter 2
n = int(input())
#Input digits for both matrixes rows one at a time
matrix1_r1 = [int(input()) for x in range(n)]
matrix1_r2 = [int(input()) for x in range(n)]
matrix2_r1 = [int(input()) for x in range(n)]
matrix2_r2 = [int(input()) for x in range(n)]
final1 = list(map(add, matrix1_r1, matrix2_r1))
final2 = list(map(add, matrix1_r2, matrix2_r2))
print(final1)
print(final2)
Their sample innput is:
2
1
2
3
4
5
6
7
8
their sample output is:
[6, 8]
[10, 12]
Your code works for the example, and for any input that is 2 by 2. It will fail for any other sized matrix, because your code only computes two rows for each matrix. Rather than hard-coding something so fundamental, you should be using nested loops and a list of lists to get the right number of rows. Or, if you want to be a little fancy, list comprehensions can do it all really neatly:
n = int(input())
matrix1 = [[int(input()) for col in range(n)] for row in range(n)]
matrix2 = [[int(input()) for col in range(n)] for row in range(n)]
matrix_sum = [[a + b for a, b in zip(row1, row2)] for row1, row2 in zip(matrix1, matrix2)]
print(*matrix_sum, sep='\n')

Handle unknown number of columns while pandas csv read

I got the data from stdin like
x
x
x y
x
x y z
...
and I want to create pandas DataFrame based on that input
df = pd.read_csv(sys.stdin, sep='\t', header=None)
, but the problem here is that my data on third line has more values than the first one and I got
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
The questions is: How can I handle this error when I don't know the longest elements chain(separated by \t).
The whole task can be performed in a single instruction,
without any counting of elements in each row.
I prepared such an example reading from a string, using io.StringIO:
df = pd.DataFrame([ ln.rstrip().split('\t') for ln in
io.StringIO(txt).readlines() ]).fillna('')
The list comprehension converts each source line into a list
of fragments (between Tabs).
Then this list comprehension is the data parameter to pd.DataFrame
and note that such a list of rows can contain rows of different length.
I added also fillna('') to convert each NaN into an empty string
(you are free to delete it if you wish).
To run the test, I defined the source variable as:
txt = '''x
x
x y
x
x y z
x
x y z v'''
end executed the above code, getting:
0 1 2 3
0 x
1 x
2 x y
3 x
4 x y z
5 x
6 x y z v
In the target version, replace reading from a string with reading from
stdin.

DataFrame: how can I groupby Z and calculate the mean X in Y range

I have a data frame which includes 3 columns - Test, X and Y. I want to add new columns Xmean which include the mean value of X with a condition on Y for each Test.
For example Xmean include the mean value on X while Y >= 5 for each Test.
import pandas as pd
df=pd.read_csv(r'Downloads\test.txt',delimiter=',',encoding='utf-8')
df_sort=df.sort_values("test")
df_filter=df_sort[df_sort['y']>=5]
applying aggregates function to find mean
df_agg=df_filter.groupby(['test'])['x'].mean()
join two dataframes to get desired output
df_final=pd.merge(df_sort[['test','x','y']],df_agg,on='test')
print(df_final)
Output attached

I want to remove rows where a specific value doesn't increase. Is there a faster/more elegant way?

I have a dataframe with 30 columns, 1.000.000 rows and about 150 MB size. One column is categorical with 7 different elements and another column (Depth) contains mostly increasing numbers. The graph for each of the elements looks more or less like this.
I tried to save the column Depth as series and iterate through it while dropping rows that won't match the criteria. This was reeeeeaaaally slow.
Afterwards I added a boolean column to the dataframe which indicates if it will be dropped or not, so I could drop the rows in the end in a single step. Still slow. My last try (the code to it is in this post) was to create a boolean list to save the fact if it passes the criteria there. Still really slow (about 5 hours).
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Depth'].index.min()
maxIdx = df.loc[df['Element']==element]['Depth'].index.max()
for x in range(minIdx,maxIdx):
if df.loc[df['Element']==element]['Depth'][x] < currentMax:
dropList[x]=False
else:
currentMax = df.loc[df['Element']==element]['Depth'][x]
df: The main dataframe
elements: a list with the 7 different elements (same as in the categorical column in df)
All rows in an element, where the value Depth isn't bigger than all previous ones should be dropped. With the next element it should start with 0 again.
Example:
Input: 'Depth' = [0 1 2 3 4 2 3 5 6]
'AnyOtherColumn' = [a b c d e f g h i]
Output: 'Depth' [0 1 2 3 4 5 6]
'AnyOtherColumn' = [a b c d e h i]
This should apply to whole rows in the dataframe of course.
Is there a way to get this faster?
EDIT:
The whole rows of the input dataframe should stay as they are. Just the ones where the 'Depth' does not increase should be dropped.
EDIT2:
The remaining rows should stay in their initial order.
How about you take a 2-step approach. First you use a fast sorting algorithm (for example Quicksort) and next you get rid of all the duplicates?
Okay, I found a way thats faster. Here is the code:
dropList = [True]*len(df.index)
for element in elements:
currentMax = 0
minIdx = df.loc[df['Element']==element]['Tiefe'].index.min()
# maxIdx = df.loc[df['Element']==element]['Tiefe'].index.max()
elementList = df.loc[df['Element']==element]['Tiefe'].to_list()
for x in tqdm(range(len(elementList))):
if elementList[x] < currentMax:
dropList[x+minIdx]=False
else:
currentMax = elementList[x]
I took the column and saved it as a list. To preserve, the index of the dataframe I saved the lowest one and within the loop it gets added again.
Overall it seems the problem was the loc function. From initially 5 hours runtime, its now about 10 seconds.

Plotting Number of Events that Occur in an Interval Histogram

In Excel, I have two fields per row, a start date/time and an end date/time. I am looking to plot a histogram that shows how many of the rows' intervals contain the time on the x axis.
For example, some start and end times could be: [1,3], [3,4], [7,9], and [7,8]
And I want an output similar to:
x x x
x x x x x x x
1.2.3.4.5.6.7.8.9
How can this be done?
One way is to split your tuples (say with Text to Columns) into say ColumnsA:B, starting in Row3 (say to Row6) and series filling D1:L9 (or to suit) with D1 = 1 and an integer increase.
Then in D3 copied across to L3 and D3:L3 copied down to Row6:
=IF(AND(D$1>=$A3,D$1<=$B3),"X","")
and in D2 copied across to L2:
=COUNTIF(D3:D6,"X")
Then make a column chart (INSERT > Charts) from D1:L2.
However, I may have misunderstood because there is no particular time significance to the above - the data is just treated like integers.

Resources