Filter and select random rows - excel

I am not familiar with pandas and openpyxl package in Python but was just wondering if it is possible to select random, filtered rows from an excel file and save the output in a separate excel file ?
For example:
The data below represents what is in my excel file and what I would like to do is that:
A B C D E F G
1 x -8 x x x 10/12/2019
3 x +9 x x x 28/12/2019
4 x +2 x x x 12/12/2019
1 x -2 x x x 15/12/2019
4 x -9 x x x 19/12/2019
5 x -1 x x x 25/12/2019
Filter to only negative values in column C ("-ve")
Filter to only items (Column G) dated from 15/12/2019 onwards
Randomly select samples (for example, 2) from this filtered list
Save the 2 randomly selected samples in a new excel file, with the headers (A,B,C,D,E,F,G)
Is there any way i could do this ?
Potentially, I will be using the script for well over 20 times.
Thank you !

You can use:
from numpy.random import choice
df['G']=pd.to_datetime(df['G'])
#ensure that datelike column is of type datetime
out=df.loc[df['C'].le(-1) & df['G'].ge(pd.to_datetime('15/12/2019'))]
#Check your 1st and 2nd condition
out.loc[choice(out.index,2)].to_excel('filename.xlsx',index=False)
#get random 2 values from the filtered rows and save it to excel

Related

Handle unknown number of columns while pandas csv read

I got the data from stdin like
x
x
x y
x
x y z
...
and I want to create pandas DataFrame based on that input
df = pd.read_csv(sys.stdin, sep='\t', header=None)
, but the problem here is that my data on third line has more values than the first one and I got
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2
The questions is: How can I handle this error when I don't know the longest elements chain(separated by \t).
The whole task can be performed in a single instruction,
without any counting of elements in each row.
I prepared such an example reading from a string, using io.StringIO:
df = pd.DataFrame([ ln.rstrip().split('\t') for ln in
io.StringIO(txt).readlines() ]).fillna('')
The list comprehension converts each source line into a list
of fragments (between Tabs).
Then this list comprehension is the data parameter to pd.DataFrame
and note that such a list of rows can contain rows of different length.
I added also fillna('') to convert each NaN into an empty string
(you are free to delete it if you wish).
To run the test, I defined the source variable as:
txt = '''x
x
x y
x
x y z
x
x y z v'''
end executed the above code, getting:
0 1 2 3
0 x
1 x
2 x y
3 x
4 x y z
5 x
6 x y z v
In the target version, replace reading from a string with reading from
stdin.

Pandas, DataFrame unique values from few columns [duplicate]

This question already has an answer here:
Get total values_count from a dataframe with Python Pandas
(1 answer)
Closed 4 years ago.
I am trying to count uniqiue values that are in few columns. My data frame looks like that:
Name Name.1 Name.2 Name.3
x z c y
y p q x
q p a y
Output should looks like below:
x 2
z 1
c 1
y 3
q 2
p 2
a 1
I used a groupby or count_values but couldn't get a correct output. Any ideas ? Thanks All !
Seems you want to consider values regardless of their row or column location. In that case you should collapse the dataframe and just use Counter.
from collections import Counter
arr = np.array(df)
count = Counter(arr.reshape(arr.size))
Another (Pandas-based) approach is to (Series) apply value_counts to multiple columns and then take the sum (column-wise)
df2 = df.apply(pd.Series.value_counts)
print(df2.sum(axis=1).astype(int)
a 1
c 1
p 2
q 2
x 2
y 3
z 1
dtype: int32

add numeric prefix to pandas dataframe column names

how would I add variable numeric prefix to dataframe column names
If I have a DataFrame df
colA colB
0 A X
1 B Y
2 C Z
How would I rename the columns according to the number of columns. Something like this:
1_colA 2_colB
0 A X
1 B Y
2 C Z
The actually number of columns is very large to be renamed manually
Thanks for the help
Use enumerate for count with f-strings and list comprehension:
#python 3.6+
df.columns = [f'{i}_{x}' for i, x in enumerate(df.columns, 1)]
#python below 3.6
#df.columns = ['{}_{}'.format(i, x) for i, x in enumerate(df.columns, 1)]
print (df)
1_colA 2_colB
0 A X
1 B Y
2 C Z

Order rows chronologically by using Time Value provided in a column - Do this for many dates

I have a sheet which has Dates as first column and time as second column. Then in other columns more details (which are not part of the problem).
The problem is on a given date there are "n" number of rows (each with same date in 1st column). BUT, the time is not chronological.
Say, on 7th Jan there are 4 rows of data with times such as
7-jan-2016 14:25:33 x y z
7-jan-2016 10:43:51 v t s
7-jan-2016 13:01:02 h m p
7-jan-2016 12:48:15 l p l
9-jan-2016 problem same as above
I need to rearrange the rows chronologically FOR EACH DATE. Such that above looks like this:
7-jan-2016 10:43:51 v t s
7-jan-2016 12:48:15 l p l
7-jan-2016 13:01:02 h m p
7-jan-2016 14:25:33 x y z
9-jan-2016 no more problems.. and as above..
How can I achieve this without manually cut-pasting rows that are in 1000's.
Sort by the first column, then select Add Level and sort by the second column.

Excel: if any field in a row is not null, set the cell in the first column to an incrementing row number?

I'm trying to create a formula that will have the effect of automatically adding an incremental row number to the first column of my spreadsheet when any data is entered in any column of the row.
So for example, if my table looks like this:
a b c d e f
1 # x x x x x
2 # x x x x x
3 # x x
4 # x
5 # x
I would like the 'a' column to read:
0
1
2
3
4
And if I added more data in any column, e.g.:
a b c d e f
1 # x x x x x
2 # x x x x x
3 # x x x
4 # x x
5 # x x
6 # x
7 # x
It would then automatically up the index row 'a' to read as:
0
1
2
3
4
5
6
I hope this makes sense! Any help, or even just a tutorial to point me in the right direction would be so appreciated!
Here is the answer:
=if(counta(B2:F2)>0,A1+1,"")
The above works! Thanks to Alan Whitelaw for the answer, only posting this for others as it has the fixed syntax.
If I undersand corectly, and the sheet will always be filled in "in order" down the rows this should work
=if(counta(B2:D2)>0,A1+1,"")
Pop this in A2, and where B2:D2 is the rest of the row to test.
Excel's counta() counts non-blank cells.

Resources