I have such a matrix
id = (123, 979, 234)
matrix:
123 979 234
123 0 30 45
979 30 0 60
234 15 45 0
My problem is, I want to access a matrix in a fast and easy way. Something like this:
matrix[id][id]
example:
print(matrix[123][979])
output 30
For now I'm using a list including a list. So I can access the data by knowing the position. This is not very comfortable, because I don't know the position, I just know the id. For now I am using a function which gives me the right number. This is very slow and I need this for a calculation with many iterations.
Does anybody has an idea to solve this in a fast way?
The function to calculate the matrix for now is this below, but it is just zero or 30*60 seconds. I want to create a new matrix with individual times, but before coding it, I want to figure out, in which way I can store the data to have fast and easy access.
def get_matrix(permutation):
criteria = [django_model1.objects.filter(id=id).get().django_model2.format for id in permutation]
# and to speed up: an ugly combination of 2 list comprehensions and a lambda function.
return [[(lambda c1, c2: timedelta(seconds = 0 ) if c1==c2 else timedelta(seconds = 30*60 )) (c1,c2) for c2 in criteria] for c1 in criteria ]
Using pandas:
data = [[0, 30, 45], [30, 0, 60], [15, 45, 0]]
ids = [123, 979, 234]
df = pd.DataFrame(data, columns = ids, index = ids)
data can be constructed in a lot of ways: depends on how you're constructing your matrix. Refer to the docs for more info.
Now, refer by id:
>>> df[979][123]
30
Note: The order of ids is reversed since pandas takes the column id as the first index.
Related
Given a dataframe
data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'],
['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']]
df = pd.DataFrame(data, columns = ['Customer','Sequence'])
Calculate the following:
First Sequence in each group is assigned a GroupID of 1.
Compare first Sequence to subsequent Sequence values in each group.
If difference is greater than .5, increment GroupID.
If GroupID was incremented, instead of comparing subsequent values to the first, use the current Sequence.
In the desired results table below...
Bob only has 1 record so the GroupID is 1.
Alice has 2 records and the difference between the two Sequence values (46 & 47) is greater than .5 so the GroupID is incremented.
Charlie's Sequence values are all the same, so all records get GroupID 1.
For Doug, the difference between the first two Sequence values (23 & 35) is greater than .5, so the GroupID for the second Sequence becomes 2. Now, since the GroupID was incremented, I want to compare the next value of 35.5 to 35, not 23, which means the last two rows share the same GroupID.
Desired results:
CustomerID
Sequence
GroupID
Bob
25
1
Alice
46
1
Alice
47
2
Charlie
19
1
Charlie
19
1
Charlie
19
1
Doug
23
1
Doug
35
2
Doug
35.5
2
My implementation:
# generate unique ID based on each customers Sequence
df['EventID'] = df.groupby('Customer')[
'Sequence'].transform(lambda x: pd.factorize(x)[0]) + 1
# impute first Sequence for each customer for comparison
df['FirstSeq'] = np.where(
df['EventID'] == 1, df['Sequence'], np.nan
)
# groupby and fill first Sequence forward
df['FirstSeq'] = df.groupby('Customer')[
'FirstSeq'].transform(lambda v: v.ffill())
# get difference of first Sequence and all others
df['FirstSeqDiff'] = abs(df['FirstSeq'] - df['Sequence'])
# create unique GroupID based on Sequence difference from first Sequence
df["GroupID"] = np.cumsum(df.FirstSeqDiff > 0.5) + 1
The above works for cases like Bob, Alice and Charlie but not Doug because it is always comparing to the first Sequence. How can I modify the code to change the compared Sequence value if the GroupID is incremented?
EDIT:
The dataframe will always be sorted by Customer and Sequence. I guess a better way to explain my goal is to assign a unique ID to all Sequence values whose difference are .5 or less, grouping by Customer.
The code has errors -> add df = df.astype({'Customer':str,'Sequence':np.float64}) would fix it. But still you cannot get what you want with this design. Try to define your own lambda function myfunc, which solves your problem directly:
data = [['Bob','25'],['Alice','46'],['Alice','47'],['Charlie','19'],
['Charlie','19'],['Charlie','19'],['Doug','23'],['Doug','35'],['Doug','35.5']]
df = pd.DataFrame(data, columns = ['Customer','Sequence'])
df = df.astype({'Customer':str,'Sequence':np.float64})
def myfunc(series):
ret = []
series = series.sort_values().values
for i,val in enumerate(series):
if i==0:
ret.append(1)
else:
ret.append(ret[-1]+(series[i]-series[i-1]>0.5))
return ret
df['EventID'] = df.groupby('Customer')[
'Sequence'].transform(lambda x: myfunc(x))
print (df)
Happy coding my friend.
I am building a dataframe and need to assign values from a defined list to a new column in the dataframe. I have found an answer which gives a method to assign elements from a list randomly to a new column in a dataframe here (How to assign random values from a list to a column in a pandas dataframe?).
But I want to be able to control the distribution of the elements in my list within the new dataframe by either assigning a frequency of occurrences or some other method to control how many times each list element appears in the dataframe.
For example, if I have a list my_list = [50, 40, 30, 20, 10] how can I say that for a dataframe (df) with n number of rows assign 50 to 10% of the rows, 40 to 20%, 30 to 30%, 20 to 35% and 10 to 5% of the rows.
Any other method to control for the distribution of list elements is welcome, the above is a simple explanation to illustrate how one way to be able to control frequency may look.
You can use choice function from numpy.random, providing probability distribution.
>>> a = np.random.choice([50, 40, 30, 20, 10], size=100, p=[0.1, 0.2, 0.3, 0.35, 0.05])
>>> pd.Series(a).value_counts().sort_index(ascending=False)
50 9
40 25
30 19
20 38
10 9
dtype: int64
Just put the desired size into size parameter (dataframe's length)
I am trying to implement a formula to create a new column in Dataframe using existing column but that column is a summation from 0 to a number present in some other column.
I was trying something like this:
dataset['B']=sum([1/i for i in range(dataset['A'])])
I know something like this would work
dataset['B']=sum([1/i for i in range(10)])
but I want to make this 10 dynamic based on some different column.
I keep on getting this error.
TypeError: 'Series' object cannot be interpreted as an integer
First of all, I should admit that I could not understand you question completely. However, what I understood that you want to iterate over the rows of a DataFrame and make a new column by doing some operation/s on that value.
Is that is so, then I would recommend you following link
Regarding TypeError: 'Series' object cannot be interpreted as an integer:
The init signature range() takes integers as input. i.e [i for i in range(10)] should give you [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. However, if one of the value from your dataset['A'] is float, or not integer , this might result in the error you are having. Moreover, if you notice, the first value is a zero, as a result, 1/i should result in a different error. As a result, you might have to rewrite the code as [1/i for i in range (1 , row_value_of_dataset['A'])]
It will be highly appreciate if you could make an example of what you DataFrame might look like and what is your desired output. Then perhaps it is easier to post a solution.
BTW forgot to post what I understood from your question:
#assume the data:
>>>import pandas as pd
>>>data = pd.DataFrame({'A': (1, 2, 3, 4)})
#the data
>>>data
A
0 1
1 2
2 3
3 4
#doing operation on each of the rows
>>>data['B']=data.apply(lambda row: sum([1/i for i in range(1, row.A)] ), axis=1)
# Column B is the newly added data
>>>data
A B
0 1 0.000000
1 2 1.000000
2 3 1.500000
3 4 1.833333
Perhaps explicitly use cumsum, or even apply?
Anyway trying to move an array/list item directly into a dataframe and seem to view this as a dictionary. Try something like this, I've not tested it,
array_x = [x, 1/x for x in dataset.values.tolist()] # or `dataset.A.tolist()`
df = pd.DataFrame(data=(np.asarray(array_x)))
df.columns = [A, B]
Here the idea is to break the Series apart into a list, and input the list into a dataframe. This can be explicitly done without needing to go Series->list->dataframe and is not very efficient.
I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row with the highest value in column B.
So this:
A B
1 10
1 20
2 30
2 40
3 10
Should turn into this:
A B
1 20
2 40
3 10
I'm guessing there's probably an easy way to do this—maybe as easy as sorting the DataFrame before dropping duplicates—but I don't know groupby's internal logic well enough to figure it out. Any suggestions?
This takes the last. Not the maximum though:
In [10]: df.drop_duplicates(subset='A', keep="last")
Out[10]:
A B
1 1 20
3 2 40
4 3 10
You can do also something like:
In [12]: df.groupby('A', group_keys=False).apply(lambda x: x.loc[x.B.idxmax()])
Out[12]:
A B
A
1 1 20
2 2 40
3 3 10
The top answer is doing too much work and looks to be very slow for larger data sets. apply is slow and should be avoided if possible. ix is deprecated and should be avoided as well.
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index()
A B
1 1 20
3 2 40
4 3 10
Or simply group by all the other columns and take the max of the column you need. df.groupby('A', as_index=False).max()
Simplest solution:
To drop duplicates based on one column:
df = df.drop_duplicates('column_name', keep='last')
To drop duplicates based on multiple columns:
df = df.drop_duplicates(['col_name1','col_name2','col_name3'], keep='last')
I would sort the dataframe first with Column B descending, then drop duplicates for Column A and keep first
df = df.sort_values(by='B', ascending=False)
df = df.drop_duplicates(subset='A', keep="first")
without any groupby
Try this:
df.groupby(['A']).max()
I was brought here by a link from a duplicate question.
For just two columns, wouldn't it be simpler to do:
df.groupby('A')['B'].max().reset_index()
And to retain a full row (when there are more columns, which is what the "duplicate question" that brought me here was asking):
df.loc[df.groupby(...)[column].idxmax()]
For example, to retain the full row where 'C' takes its max, for each group of ['A', 'B'], we would do:
out = df.loc[df.groupby(['A', 'B')['C'].idxmax()]
When there are relatively few groups (i.e., lots of duplicates), this is faster than the drop_duplicates() solution (less sorting):
Setup:
n = 1_000_000
df = pd.DataFrame({
'A': np.random.randint(0, 20, n),
'B': np.random.randint(0, 20, n),
'C': np.random.uniform(size=n),
'D': np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), size=n),
})
(Adding sort_index() to ensure equal solution):
%timeit df.loc[df.groupby(['A', 'B'])['C'].idxmax()].sort_index()
# 101 ms ± 98.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.sort_values(['C', 'A', 'B'], ascending=False).drop_duplicates(['A', 'B']).sort_index()
# 667 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think in your case you don't really need a groupby. I would sort by descending order your B column, then drop duplicates at column A and if you want you can also have a new nice and
clean index like that:
df.sort_values('B', ascending=False).drop_duplicates('A').sort_index().reset_index(drop=True)
Easiest way to do this:
# First you need to sort this DF as Column A as ascending and column B as descending
# Then you can drop the duplicate values in A column
# Optional - you can reset the index and get the nice data frame again
# I'm going to show you all in one step.
d = {'A': [1,1,2,3,1,2,3,1], 'B': [30, 40,50,42,38,30,25,32]}
df = pd.DataFrame(data=d)
df
A B
0 1 30
1 1 40
2 2 50
3 3 42
4 1 38
5 2 30
6 3 25
7 1 32
df = df.sort_values(['A','B'], ascending =[True,False]).drop_duplicates(['A']).reset_index(drop=True)
df
A B
0 1 40
1 2 50
2 3 42
You can try this as well
df.drop_duplicates(subset='A', keep='last')
I referred this from https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
Here's a variation I had to solve that's worth sharing: for each unique string in columnA I wanted to find the most common associated string in columnB.
df.groupby('columnA').agg({'columnB': lambda x: x.mode().any()}).reset_index()
The .any() picks one if there's a tie for the mode. (Note that using .any() on a Series of ints returns a boolean rather than picking one of them.)
For the original question, the corresponding approach simplifies to
df.groupby('columnA').columnB.agg('max').reset_index().
When already given posts answer the question, I made a small change by adding the column name on which the max() function is applied for better code readability.
df.groupby('A', as_index=False)['B'].max()
Very similar method to the selected answer, but sorting data frame by multiple columns might be an easier way to code.
Firstly, sort the date frame by both "A" and "B" columns, the ascending=False ensure it is ranked from highest value to lowest:
df.sort_values(["A", "B"], ascending=False, inplace=True)
Then, drop duplication and keep only the first item, which is already the one with the highest value:
df.drop_duplicates(inplace=True)
this also works:
a=pd.DataFrame({'A':a.groupby('A')['B'].max().index,'B':a.groupby('A') ['B'].max().values})
I am not going to give you the whole answer (I don't think you're looking for the parsing and writing to file part anyway), but a pivotal hint should suffice: use python's set() function, and then sorted() or .sort() coupled with .reverse():
>>> a=sorted(set([10,60,30,10,50,20,60,50,60,10,30]))
>>> a
[10, 20, 30, 50, 60]
>>> a.reverse()
>>> a
[60, 50, 30, 20, 10]
Right now, my code takes scraped web data from a file (BigramCounter.txt), and then finds all the bigrams within that file so that the data looks like this:
Counter({('the', 'first'): 45, ('on', 'purchases'): 42, ('cash', 'back'): 39})
After this, I try to feed it into a pandas DataFrame where it spits this df out:
the on cash
first purchases back
0 45 42 39
This is very close to what I need but not quite. First off, the DF does not read my attempt to name the columns. Furthermore, I was hoping for something formatted more like this where its two COLUMNS and the Words are not split between Cells:
Words Frequency
the first 45
on purchases 42
cash back 39
For reference, here is my code. I think I may need to reorder an axis somewhere but I'm not sure how? Any ideas?
import re
from collections import Counter
main_c = Counter()
words = re.findall('\w+', open('BigramCounter.txt', encoding='utf-8').read())
bigrams = Counter(zip(words,words[1:]))
main_c.update(bigrams) #at this point it looks like Counter({('the', 'first'): 45, etc...})
comm = [[k,v] for k,v in main_c]
frame = pd.DataFrame(comm)
frame.columns = ['Word', 'Frequency']
frame2 = frame.unstack()
frame2.to_csv('text.csv')
I think I see what you're going for, and there are many ways to get there. You were really close. My first inclination would be to use a series, especially since you'd (presumably) just be getting rid of the df index when you write to csv, but it doesn't make a huge difference.
frequencies = [[" ".join(k), v] for k,v in main_c.items()]
pd.DataFrame(frequencies, columns=['Word', 'Frequency'])
Word Frequency
0 the first 45
1 cash back 39
2 on purchases 42
If, as I suspect, you want word to be the index, add frame.set_index('Word')
Word Frequency
the first 45
cash back 39
on purchases 42