Pandas Check sequence or patters - python-3.x

I need to check if there is a special patter in the columns, its easier to see with some data.
now if you see there are to punch in in next to each other,
i need way to detect this patters, normally you need to clock out before you clock in
so this is a mistake from my system and i need a way to detect this on pandas.
i was thinking using .apply(function, axis=1)
thank you in advance.
best,

Using pandas.DataFrame.shift(), this code compares the row with the next row, creating a column 'flag' when they are exactly the same:
comparison = df == df.shift()
df['flag'] = comparison['Date'] & comparison['Name'] & comparison['Activity']]
With your data, the output is:

Related

Alternatives to interpolate three dimensional data

I have a table that shows me a chemical concentration value based on temperature, pH and
ammonia. The way the I measure these variables, the ammonia level are always one of these six values (on top of the table), so it works as a categorical variable.
I need a way to interpolate on this table, based on these 3 variables. I tried using a combination of INDEX and MATCH, but I was not able to achieve what I wanted. Then I thought of "dividing" the table in intervals to "reduce" one variable and use an IF function to select which interval to interpolate based on the third variable (I was thinking pH or Ammonia), but I can't figure out a way to change intervals dynamically like this.
Can anyone think of an alternative to accomplish what I'm trying to do? If possible I would like to avoid using VBA, but if there is no other way I have no problem using it.
Thank you for the help!
I'm attaching an example of the table below.
Assuming that PH is in Column A:
=INDEX(A:H;MATCH(6,8;A:A;0)+MATCH(25;B:B;0)-2;MATCH(2;2:2,0))
Where the -2 needs to be changed to the number of rows BEFORE the first 22 in Temp.
This also assumes that the pattern of 22;25;28 in Temp is the same for every pH

How do I get additional column name information in a pandas group by / nlargest calculation?

I am comparing pairs of strings using six fuzzywuzzy ratios, and I need to output the top three scores for each pair.
This line does the job:
final2_df = final_df[['nameHiringOrganization', 'mesure', 'name', 'valeur']].groupby(['nameHiringOrganization', 'name'])['valeur'].nlargest(3)
However, the excel output table lacks the 'mesure' column, which contains the ratio's name. This is annoying, because then I'm not able to identify which of the six ratios works best for any given pair.
I thought selecting columns ath the beginning might work (final_df[['columns', ...]]), but it doesn't seem to.
Any thought on how I might add that info?
Many thanks in advance!
I think here is possible use another solution with sorting by 3 columns with DataFrame.sort_values and then using GroupBy.head:
final2_df = (final_df.sort_values(['nameHiringOrganization', 'name', 'valeur'],
ascending=[True, True, False])
.groupby(['nameHiringOrganization', 'name'])
.head(3))

Create a column based on index that represents Calendar-day difference between two consecutive trading days

I have trading data that have a date as its index.
I want to create a new column that makes it compute the difference between the consecutive dates as follow
I created the following code to do so
df2=df1.reset_index(drop=False).copy()
df2['Date_lag']=df2.Date.shift()
df2['NT_diff'] = (df2['Date'] - df2['Date_lag']).dt.days
df2=df2.loc[:, df2.columns != 'Date_lag']
df2=df2.set_index('Date')
df2.head()
However, I am sure that there is an easier way to do this by a simpler code. May you please advise on this matter.
Thank you so much in advance
I think the best way is to use timedelta64 built in function in numpy as follow
Calendar-day difference between two consecutive trading days
df1['date']=df1.index
df1['NT_diff'] = np.abs((df1['date'] - df1['date'].shift(1))/ np.timedelta64(1, 'D'))
df1=df1.loc[:, df1.columns != 'date']
df1.head()
This should give you the result you are looking for.

Separating values that are combined in one string

I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.

Sort a spread sheet via gspread

I have a Google spreadsheet full of names, dates, and some other numbers.
I made an desktop application that provides a nice UI for said info.
After using the application a bit I became slightly annoyed with the order the data was being displayed.
I have been researching all day and I cannot seem to find anything on the topic of sorting the spreadsheet from a python script.
All I need is some function that I can call every time someone adds something to it to re-sort the sheet.
I would very much appreciate it if someone could help me out.
Thanks in advance.
GSpread has a .sort() method to sort a worksheet using given sort orders. Here's how you can use it (Source - GSpread Docs):
Parameters:
specs (list) – The sort order per column. Each sort order represented by a tuple where the first element is a column index and the second element is the order itself: ‘asc’ or ‘des’.
range (str) – The range to sort in A1 notation. By default sorts the whole sheet excluding frozen rows.
Example:
# Sort sheet A -> Z by column 'B'
wks.sort((2, 'asc'))
# Sort range A2:G8 basing on column 'G' A -> Z
# and column 'B' Z -> A
wks.sort((7, 'asc'), (2, 'des'), range='A2:G8')
You can use PyGsheets lib. It uses sheets API v4 on lower level.
my_worksheet.sort_range() function will help you but it has some specialities
Numbering in start and end cells start with 1 but
basecolumnindex starts with 0.
You can pass the cell's address in 2 ways: text index (like "A3") or tuple with 2 elements (like (1, 3)). The second way doesn't work for me.
The range limited by start and end cells should contain column passed in basecolumnindex

Resources