Filter Multiple Columns between multiple range - python-3.x

I have a very big data contains nnumerical values mostly. I want to filter multiple columns that each is between different range. The problem is columns and range will be selected by user which means that filtered columns and ranges can be changed each time.
e.g 0<df[a]<5 & 0<df[b]<10. It can be "a" and "b" and "c" also, totaly depend on input.
I want to see how many rows in a range such that for example; for each column; col.a is between "0" and "1", "1" and "2" etc. until 5 and same for col.b or any other until e.g "10"
Because of my code is very long , tried to explain the attached some part inside strings:
# -*- coding: utf-8 -*-
"""
excel_file: readed excel file dataframe
entered_parameters: (list) to be filtered columns typed by user
parameters: readed columns of excel_file
limits: (list) upper_limits inputted by user for each entered_parameters
ranges: range or incrementation list for each entered parameters
boolean_frame: Boolean dataframe returned for filtering each entered_parameters(columns) upto limits in each cycle
total_boolean_frame:appended boolean_frame(shows ranges up to limits for each parameter)
total_frame: concat of total_boolean_frame (shows all filtered boolean values by range for all param)
"""
total_frame=pd.DataFrame()
parameters=[i for i in excel_file.columns if type(i)==str]
totalrownumberlist=[]
for i,v in enumerate(limits):
if i==0:
totalrownumberlist.append(len(excel_file)*v)
else:
totalrownumberlist.append(totalrownumberlist[i-1]*v)
totalrownumber=totalrownumberlist[-1]
for i,param in enumerate(entered_parameters):
total_boolean_frame=pd.DataFrame()
appended_row_num=totalrownumberlist[i]
if param in parameters:
while appended_row_num<=totalrownumber:
boolean_frame=pd.DataFrame()
initial=0
while initial<limits[i]:
boolean_frame[param]=(excel_file[param]>=initial) & (excel_file[param]<=initial+ranges[i])
boolean_frame["aralik-%s"%param]="%s-%s"%(initial,initial+ranges[i])
initial=initial+ranges[i]
total_boolean_frame=total_boolean_frame.append(boolean_frame,sort=False,ignore_index=True)
appended_row_num=appended_row_num+totalrownumberlist[i]
total_frame=pd.concat([total_frame,total_boolean_frame],axis=1)`
Edit: Output should be like this; count(range[0-1] col.a and range[0-1] col.b)=2 (avg. through axis=1 if all cells in the row is True which means avg of rows for excel_file[total_frame.all(axis=1)]. count(range[1-2] col.a and range[0-1] col.b)=3 with avg. again, count(range[2-3] col.a and range[0-1] col.b)=6 and avg. and goes on...
Thnks

Related

How do I drop complete rows (including all values in it) that contain a certain value in my Pandas dataframe?

I'm trying to write a python script that finds unique values (names) and reports the frequency of their occurrence, making use of Pandas library. There's a total of around 90 unique names, which I've anonymised in the head of the dataframe pasted below.
,1,2,3,4,5
0,monday09-01-2022,tuesday10-01-2022,wednesday11-01-2022,thursday12-01-2022,friday13-01-2022
1,Anonymous 1,Anonymous 1,Anonymous 1,Anonymous 1,
2,Anonymous 2,Anonymous 4,Anonymous 5,Anonymous 5,Anonymous 5
3,Anonymous 3,Anonymous 3,,Anonymous 6,Anonymous 3
4,,,,,
I'm trying to drop any row (the full row) that contains the regex expression "^monday.*", intending to indicate the word "monday" followed by any other number of random characters. I want to drop/deselect any cell/value within that row.
To achieve this goal, I've tried using the line of code below (and many other approaches I found on SO).
df = df[df[1].str.contains("^monday.*", case = True, regex=True) == False]
To clarify, I'm trying to search values of column "1" for the value "^.monday.*" and then deselecting the rows and all values in that row that match the regex expression. I've succesfully removed "monday09-01-2022" and "tuesday10-01-2022" etc.. but I'm also losing random names that are not in the matching rows.
Any help would be very much appreciated! Thank you!

{Python} - [Pandas] - How sum columns by condition less than in columns name

First explaining the dataframe, the values of columns '0-156', '156-234', '234-546' .... '> 76830' is the percentage distribution for each range of distances in meters, totaling 100%.
Column 'Cell Name' refers to the data element of the other columns and the column 'Distance' is the column that will trigger the desired sum.
I need to sum the values of the columns '0-156', '156-234', '234-546' .... '> 76830' which are less than the value of the 'Distance' (Meters) column.
Below creation code for testing.
import pandas as pd
# initialize list of lists
data = [['Test1',0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
['Test2',0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
['Test3',0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Cell Name','0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','>76830','Distance'])
Example of what should be done:
The value of column 'Distance' = 158.412286772863 therefore would have to sum the values <= of the following columns, 0-156, '156-234' totalizing 19.43116162 %.
Thanks so much!
As I understand it, you want to sum up all the percentage values in a row, where the lower value of the column-description (in case of '0-156' it would be 0, in case of '156-234' it would be 156, and so on...) is smaller than the value in the distance column.
First I would suggest, that you transform your string-like column-names into values, as an example:
lowerlimit=df.columns[2]
>>'156-234'
Then read the string only till the '-' and make it a number
int(lowerlimit[:lowerlimit.find('-')])
>> 156
You can loop this through all your columns and make a new row for the lower limits.
For a bit more simplicity I left out the first column for your example, and added another first row with the lower limits of each column, that you could generate as described above. Then this code works:
data = [[0,156,234,546,1014,1950,3510,6630,11430,30030,53430,76830,1e-23],[0.36516562,19.065996,49.15094,24.344206,0.49186087,1.24217,5.2812457,0.05841639,0,0,0,0,158.4122868],
[0.20406325,10.664485,48.70978,14.885571,0.46103176,8.75815,14.200708,2.1162114,0,0,0,0,192.553074],
[0.13483211,0.6521175,6.124511,41.61725,45.0036,5.405257,1.0494527,0.012979688,0,0,0,0,1759.480042]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['0-156','156-234','234-546','546-1014','1014-1950','1950-3510','3510-6630','6630-14430','14430-30030','30030-53430','53430-76830','76830-','Distance'])
df['lastindex']=None
df['sum']=None
After creating basically your dataframe, I add two columns 'lastindex' and 'sum'.
Then I am searching for the last index in every row, that is has its lower limit below the distance given in that row (df.iloc[x,-3]); afterwards I'm summing up the respective columns in that row.
for i in np.arange(1,len(df)):
df.at[i,'lastindex']=np.where(df.iloc[0,:-3]<df.iloc[i,-3])[0][-1]
df.at[i,'sum']=sum(df.iloc[i][0:df.at[i,'lastindex']+1])
I hope, this is helpful. Best, lepakk

Compare two columns and output new column based on order of reference column

I'm trying to compare two columns (list) with same IDs (just in different order). I want to reference first columns order, compare it to the next column, and then reformat 2nd columns order based on first columns order in new column (or list). From there I can pull corresponding columns that match the order of first column (price, demographic, etc).
Input:
First column (reference column):
12321
12323
324214
32313452
1232132
fs2421
sfasrfas
asfasd
Second column (re-order necessary):
12321
sfasrfas
12323
324214
1232132
fs2421
asfasd
32313452
I have tried writing a for loop in python with two separate lists for each column IDs as well as Index/Match in excel but can only seem to output 'matching' IDs.
Excel
=INDEX($A$2:$A$589,MATCH(C2,$A$2:$A$589,0),2)
Python
## setting empty list and extract only matched values from both lists made above ##
matched_IDs = []
unique_IDs = []
for Part_No in updated_2_list:
if Part_No in updated_1_list:
matched_IDs.append(Part_No)
elif Part_No not in updated_2_list:
unique_IDs.append(Part_No)
print(matched_IDs)
#len(matched_IDs)
len(matched_IDs)
I expect to match the order of first column in new column (or list).
Output:
Third column (new column after second column was re-ordered)
12321
12323
324214
32313452
1232132
fs2421
sfasrfas
asfasd
You mean like this:
=INDEX(C:C,MATCH(A1,C:C,0))

Excel array formula anomaly

I have an array formula in Excel that works fine in all cells of the array except when there is a change in the conditional tests, and I'm not sure why.
The array formula is:
{=TEXT(VALUE(Header!$A$2)+VALUE(ReadingID)
*(IF(EventID="2", 1,IF(EventID="4", 1,0))*(VALUE(Header!$N$2)/86400)
+IF(EventID="2", 0, IF(EventID="4", 0, 1))*(VALUE(Header!$M$2)/86400))
, "#.000000")}
Typical data for the formula cells value:
Header!$A$2 = '43432.40434' # An excel serial date/time number as text.
ReadingID = #incremental numbers as text e.g. '1000', '1001' etc.
EventID = # Values 1 or 2 or 3 or 4 as text.
Header!$M$2 = 60 # as text.
Header!$N$2 = 10 # as text.
The ReadingID and EventID columns are the same size as the array formula column.
Typical results when EventID changes from, say, "2" to "3", are as follows:
ReadingID EventID Result Diff
'1540 '2 43432.582581 0.000116
'1541 '2 43432.582696 0.000115
'1542 '3 43433.475173 0.892477
'1543 '3 43433.475868 0.000695
'1544 '3 43433.476562 0.000694
The Diff column is simply to show the increment from row to row and is consistent either side of the transition in EventID value (e.g. from "2" to "3"). The same anomaly occurs at all points where the EventID value changes (i.e. "1" to "2"; "3" to "4").
The array formula spans several thousand cells and returns the expected result in all other rows, except when EventID changes.
I originally tried an OR function to perform the incremental sum, but that didn't work, hence the nested IF statements.
Can anyone suggest if there is something wrong with the array formula, or how to avoid this rogue result?
NOTE: The data is in text format as it is being imported from elsewhere in CSV format and I would like to preserve the raw import.

Generating test data in Excel for an EAV table

This is a pretty complicated question so be prepared! I want to generate some test data in excel for my EAV table. The columns I have are:
user_id, attribute, value
Each user_id will repeat for a random number of times between 1-4, and for each entry I want to pick a random attribute from a list, and then a random value which this can take on. Lastly I want the attributes for each id entry to be unique i.e. I do not want more than one entry with the same id and attribute. Below is an example of what I mean:
user_id attribute value
100001 gender male
100001 religion jewish
100001 university imperial
100002 gender female
100002 course physics
Possible values:
attribute value
gender male
female
course maths
physics
chemistry
university imperial
cambridge
oxford
ucl
religion jewish
hindu
christian
muslim
Sorry that the table above messed up. I don't know how to paste into here while retaining the structure! Hopefully you can see what I'm talking about otherwise I can get a screenshot.
How can I do this? In the past I have generated random data using a random number generator and a VLOOKUP but this is a bit out of my league.
My approach is to create a table with all four attributes for each ID and then filter that table randomly to get between one and four filtered rows per ID. I assigned a random value to each attribute. The basic setup looks like this:
To the left is the randomized eav table and to the left is the lookup table used for the randomized values. Here's the formulas. Enter them and copy down:
Column A - Establishes a random number every four digits. This determines the attribute that must be selected:
=IF(COUNTIF(C$2:C2,C2)=1,RANDBETWEEN(1,4),A1)
Column B - Uses the formula in A to determine if row is included:
=IF(COUNTIF(C$2:C2,C2)=A2,TRUE,RANDBETWEEN(0,1)=1)
Column C - Creates the IDs, starting with 100,001:
=(INT((ROW()-2)/4)+100000)+1
Column D - Repeats the four attributes:
=CHOOSE(MOD(ROW()-2,4)+1,"gender","course","university","religion")
Column E - Finds the first occurence of the Column D attribute in the lookup table and selects a randomly offset value:
=INDEX($H$2:$H$14,(MATCH(D2,$G$2:$G$14,0))+RANDBETWEEN(0,COUNTIF($G$2:$G$14,D2)-1))
When you filter on the TRUEs in Column B you'll get your list of one to four Attributes per ID. Disconcertingly, the filtering forces a recalculation, so the filtered list will no longer say TRUE for every cell in column B.
If this was mine I'd automate it a little more, perhaps by putting the "magic number" 4 in it's own cell (the count of attributes).
There are a number of ways to do this. You could use either perl or python. Both have modules for working with spreadsheets. In this case, I used python and the openpyxl module.
# File: datagen.py
# Usage: datagen.py <excel (.xlsx) filename to store data>
# Example: datagen.py myfile.xlsx
import sys
import random
from openpyxl import Workbook
from openpyxl.cell import get_column_letter
# verify that user specified an argument
if len(sys.argv) < 2:
print "Specify an excel filename to save the data, e.g myfile.xlsx"
exit(-1)
# get the excel workbook and worksheet objects
wb = Workbook()
ws = wb.get_active_sheet()
# Modify this line to specify the range of user ids
ids = range(100001, 100100)
# data structure for the attributes and values
data = { 'gender': ['male', 'female'],
'course': ['maths', 'physics', 'chemistry'],
'university': ['imperial','cambridge','oxford', 'ucla'],
'religion': ['jewish', 'hindu', 'christian','muslim']}
# Write column headers in the spreadsheet
ws.cell('%s%s'%('A', 1)).value = 'user_id'
ws.cell('%s%s'%('B', 1)).value = 'attribute'
ws.cell('%s%s'%('C', 1)).value = 'value'
row = 1
# Loop through each user id
for user_id in ids:
# randomly select how many attributes to use
attr_cnt = random.randint(1,4)
attributes = data.keys()
for idx in range(attr_cnt):
# randomly select attribute
attr = random.choice(attributes)
# remove the selected attribute from further selection for this user id
attributes.remove(attr)
# randomly select a value for the attribute
value = random.choice(data[attr])
row = row + 1
# write the values for the current row in the spreadsheet
ws.cell('%s%s'%('A', row)).value = user_id
ws.cell('%s%s'%('B', row)).value = attr
ws.cell('%s%s'%('C', row)).value = value
# save the spreadsheet using the filename specified on the cmd line
wb.save(filename = sys.argv[1])
print "Done!"

Resources