So, I'm trying to divide values across two columns of a .csv file, one of which comprises intagers ('counts'), and the other is made up of floats ('Surface').
df = pd.read_csv(r'G:\file_path\file1.csv')
df['f'] = df['counts']/df['Surface']
Doing so returns the 'TypeError: string indices must be integers' error message.
An example of the file is:
I have tried to find information online on how to divide floats but can only find endless resourcess on how to use the one-slash (/) or two-slash (//) methods to output floats or intagers, opposed to anything about actually dividing floats themselves.
Any ideas on how I resolve this?? Surely it can't be all that complicated.
Cheers,
R
I suspect one of the columns is dtype object.
Please try
Data
df=pd.DataFrame({'counts':[49, 47,44,43],'Surface':[1.878914,1.854631,1.854631,1.660323]})
print(df)
counts Surface
0 49 1.878914
1 47 1.854631
2 44 1.854631
3 43 1.660323
df['f'] = df['counts'].astype(int)/df['Surface'].astype(float)
counts Surface f
0 49 1.878914 26.078895
1 47 1.854631 25.341968
2 44 1.854631 23.724396
3 43 1.660323 25.898575
Related
I want to extract the subsequences indicated by the first and last locations in data frame 'B'.
The algorithm that I came up with is:
Identify the rows of B that fall in the locations of A
Find the relative position of the locations (i.e. shift the locations to make them start from 0)
Start a for loop using the relative position as a range to extract the subsequences.
The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.
Desired output:
first last sequences
3 5 ACA
8 12 CGGAG
105 111 ACCCCAA
115 117 TGT
Used data frames:
import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
],'first':[1,100], 'last':[12,117]})
B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})
One solution could be as follows:
out = pd.merge_asof(B, A, on=['last'], direction='forward',
suffixes=('','_y'))
out.loc[:,['first','last']] = \
out.loc[:,['first','last']].sub(out.first_y, axis=0)
out = out.assign(sequences=out.apply(lambda row:
row['first.sequence'][row['first']:row['last']+1],
axis=1)).drop(['first.sequence','first_y'], axis=1)
out.update(B)
print(out)
first last sequences
0 3 5 ACA
1 8 12 CGGAG
2 105 111 ACCCCAA
3 115 117 TGT
Explanation
First, use df.merge_asof to match first values from B with first values from A. I.e. 3, 8 will match with 1, and 105, 115 will match with 100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index 1 or 100 instead of a normal 0.
We use this last bit of information to find out where the string slice should start and end. So, we do out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset" 3 to 2 (minus 1) and 105 to 5 (minus 100).
Now, we can use df.apply to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have used Series.str.slice instead.
Finally, we assign the result to out (as col sequences), drop the cols we no longer need, and we use df.update to "reset" the columns first and last.
Trying to figure out a way to determine if a dataset of 'X' values in a column contains 5 or more consecutive negative numbers. If so, how many times does that happen (5 negatives in consecutive cells).
Example data:
-110
-164
-101
-144
-117
0
0
10
15
22
-56
-60
-54
2
10
0
In this example I would expect the result to be "1" since 5 consecutive negative integers occurs only once in the dataset.
I built a nasty nested IF statement run to figure this out but I know there's an easier way.
I cannot seem to get a shorter formula to work.
Appreciate anyone's input
You can try FREQUENCY function:
=SUM(--(FREQUENCY(IF(A:A<0,ROW(A:A)),IF(A:A>=0,ROW(A:A)))>=5))
for older excel versions it must be entered with Ctrl+Shift+Enter
Given a random integer, for example, 19357982357627685397198. How can I compress these numbers into a string of text that has fewer characters?
The string of text must only contain numbers or alphabetical characters, both uppercase and lowercase.
I've tried Base64 and Huffman-coding that claim to compress, but none of them makes the string shorter when writing on a keyboard.
I also tried to make some kind of algorithm that tries to divide the integer by the numbers "2,3,...,10" and check if the last number in the result is the number it was divided by (looks for 0 in case of division by 10). So, when decrypting, you would just multiply the number by the last number in the integer. But that does not work because in some cases you can't divide by anything and the number would stay the same, and when it would be decrypted, it would just multiply it into a larger number than you started with.
I also tried to divide the integer into blocks of 2 numbers starting from left and giving a letter to them (a=1, b=2, o=15), and when it would get to z it would just roll back to a. This did not work because when it was decrypted, it would not know how many times the number rolled over z and therefore be a much smaller number than in the start.
I also tried some other common encryption strategies. For example Base32, Ascii85, Bifid Cipher, Baudot Code, and some others I can not remember.
It seems like an unsolvable problem. But because it starts with an integer, each number can contain 10 different combinations. While in the alphabet, letters can contain 26 different combinations. This makes it so that you can store more data in 5 alphabetical letters, than in a 5 digit integer. So it is possible to store more data in a string of characters than in an integer in mathematical means, but I just can't find anyone who has ever done it.
You switch from base 10 to eg. base 62 by repeatedly dividing by 62 and record the remainders from each step like this:
Converting 6846532136 to base62:
Operation Result Remainder
6846532136 / 62 110427937 42
110427937 / 62 1781095 47
1781095 / 62 28727 21
28727 / 62 463 21
463 / 62 7 29
7 / 62 0 7
Then you use the remainder as index in to a base62 alphabet of your choice eg:
0 1 2 3 4 5 6
01234567890123456789012345678901234567890123456789012345678901
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
Giving: H (7) d (29) V (21) V (21) v (47) q (42) = HdVVvq
------
It's called base10 to base62, there bunch of solutions and code on the internet.
Here is my favorite version: Base 62 conversion
I have some time series data (in a Pandas dataframe), d(t):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
I would like to get a time-shifted version of the data, e.g. d(t-1):
time 1 2 3 4 ... 99 100
d(t) 5 3 17 6 ... 23 78
d(t-1) NaN 5 3 17 6 ... 23
But with a complication. Instead of simply time-shifting the data, I need to take the expected value based on a Poisson-distributed shift. So instead of d(t-i), I need E(d(t-j)), where j ~ Poisson(i).
Is there an efficient way to do this in Python?
Ideally, I would be able to dynamically generate the result with i as a parameter (that I can use in an optimization).
numpy's Poisson functions seem to be about generating draws from a Poisson rather than giving a PMF that could be used to calculate expected value. If I could generate a PMF, I could do something like:
for idx in len(d(t)):
Ed(t-i) = np.multiply(d(t)[:idx:-1], PMF(Poisson, i)).sum()
But I have no idea what actual functions to use for this, or if there is an easier way than iterating over indices. This approach also won't easily let me optimize over i.
You can use scipy.stats.poisson to get PMF.
Here's a sample:
from scipy.stats import poisson
mu = 10
# Declare 'rv' to be a poisson random variable with λ=mu
rv = poisson(mu)
# poisson.pmf(k) = (e⁻ᵐᵘ * muᵏ) / k!
print(rv.pmf(4))
For more information about scipy.stats.poisson check this doc.
In the flowing data frame in Pandas, I want to extract columns corresponding dates between '03/01' and '06/01'. I don't want to use the index at all, as my input would be a start and end dates. How could I do so ?
A B
0 01/01 56
1 02/01 54
2 03/01 66
3 04/01 77
4 05/01 66
5 06/01 72
6 07/01 132
7 08/01 127
First create a list of the dates you need using daterange. I'm adding the year 2000 since you need to supply a year for this to work, im then cutting it off to get the desired strings. In real life you might want to pay attention to the actual year due to things like leap-days.
date_start = '03/01'
date_end = '06/01'
dates = [x.strftime('%m/%d') for x in pd.date_range('2000/{}'.format(date_start),
'2000/{}'.format(date_end), freq='D')]
dates is now equal to:
['03/01',
'03/02',
'03/03',
'03/04',
.....
'05/29',
'05/30',
'05/31',
'06/01']
Then simply use the isin argument and you are done
df = df.loc[df.A.isin(dates)]
df
If your columns is a datetime column I guess you can skip the strftime part in th list comprehension to get the right result.
You are welcome to use boolean masking, i.e.:
df[(df.A >= start_date) && (df.A <= end_date)]
Inside the bracket is a boolean array of True and False. Only rows that fulfill your given condition (evaluates to True) will be returned. This is a great tool to have and it works well with pandas and numpy.