Find count of a column in pandas dataframe based on condition - python-3.x

I am using below method to find count of a pandas dataframe having 55k rows. This is included in a for loop of site list (4000 sites). It is taking many minutes to complete the loop of 4000 sites when below line is included.
for i in g_sitelist:
x = len(dfreglist[(dfreglist['site'] == i) & (dfreglist['isactive'] == 1)])
Is there any other better way to do so that the loop can be completed with in a second.

You can use value_counts():
site_counts = dfreglist[dfreglist['isactive'].eq(1)]['site'].value_counts()
This would give a series of the site values and the count that are active which you can then iterate.

Use numpy - convert each column to array and call np.sum:
m = (dfreglist['isactive'].values == 1)
for i in g_sitelist:
x = np.sum((dfreglist['site'].values == i) & m)
Faster solution:
df = dfreglist[dfreglist['site'].isin(g_sitelist) & (dfreglist['isactive'].values == 1)]
out = df['site'].value_counts()

Related

Is this the valid "if" expression for not printing the names of less than four characters [duplicate]

I like to filter out data whose string length is not equal to 10.
If I try to filter out any row whose column A's or B's string length is not equal to 10, I tried this.
df=pd.read_csv('filex.csv')
df.A=df.A.apply(lambda x: x if len(x)== 10 else np.nan)
df.B=df.B.apply(lambda x: x if len(x)== 10 else np.nan)
df=df.dropna(subset=['A','B'], how='any')
This works slow, but is working.
However, it sometimes produce error when the data in A is not a string but a number (interpreted as a number when read_csv read the input file):
File "<stdin>", line 1, in <lambda>
TypeError: object of type 'float' has no len()
I believe there should be more efficient and elegant code instead of this.
Based on the answers and comments below, the simplest solution I found are:
df=df[df.A.apply(lambda x: len(str(x))==10]
df=df[df.B.apply(lambda x: len(str(x))==10]
or
df=df[(df.A.apply(lambda x: len(str(x))==10) & (df.B.apply(lambda x: len(str(x))==10)]
or
df=df[(df.A.astype(str).str.len()==10) & (df.B.astype(str).str.len()==10)]
import pandas as pd
df = pd.read_csv('filex.csv')
df['A'] = df['A'].astype('str')
df['B'] = df['B'].astype('str')
mask = (df['A'].str.len() == 10) & (df['B'].str.len() == 10)
df = df.loc[mask]
print(df)
Applied to filex.csv:
A,B
123,abc
1234,abcd
1234567890,abcdefghij
the code above prints
A B
2 1234567890 abcdefghij
A more Pythonic way of filtering out rows based on given conditions of other columns and their values:
Assuming a df of:
data = {
"names": ["Alice", "Zac", "Anna", "O"],
"cars": ["Civic", "BMW", "Mitsubishi", "Benz"],
"age": ["1", "4", "2", "0"],
}
df=pd.DataFrame(data)
df:
age cars names
0 1 Civic Alice
1 4 BMW Zac
2 2 Mitsubishi Anna
3 0 Benz O
Then:
df[
df["names"].apply(lambda x: len(x) > 1)
& df["cars"].apply(lambda x: "i" in x)
& df["age"].apply(lambda x: int(x) < 2)
]
We will have :
age cars names
0 1 Civic Alice
In the conditions above we are looking first at the length of strings, then we check whether a letter "i" exists in the strings or not, finally, we check for the value of integers in the first column.
I personally found this way to be the easiest:
df['column_name'] = df[df['column_name'].str.len()!=10]
You can also use query:
df.query('A.str.len() == 10 & B.str.len() == 10')
If You have numbers in rows, then they will convert as floats.
Convert all the rows to strings after importing from cvs. For better performance split that lambdas into multiple threads.
you can use df.apply(len) . it will give you the result
For string operations such as this, vanilla Python using built-in methods (without lambda) is much faster than apply() or str.len().
Building a boolean mask by mapping len to each string inside a list comprehension is approx. 40-70% faster than apply() and str.len() respectively.
For multiple columns, zip() allows to evaluate values from different columns concurrently.
col_A_len = map(len, df['A'].astype(str))
col_B_len = map(len, df['B'].astype(str))
m = [a==3 and b==3 for a,b in zip(col_A_len, col_B_len)]
df1 = df[m]
For a single column, drop zip() and loop over the column and check if the length is equal to 3:
df2 = df[[a==3 for a in map(len, df['A'].astype(str))]]
This code can be written a little concisely using the Series.map() method (but a little slower than list comprehension due to pandas overhead):
df2 = df[df['A'].astype(str).map(len)==3]
Filter out values other than length of 10 from column A and B, here i pass lambda expression to map() function. map() function always applies in Series Object.
df = df[df['A'].map(lambda x: len(str(x)) == 10)]
df = df[df['B'].map(lambda x: len(str(x)) == 10)]
You could use applymap to filter all columns you want at once, followed by the .all() method to filter only the rows where both columns are True.
#The *mask* variable is a dataframe of booleans, giving you True or False for the selected condition
mask = df[['A','B']].applymap(lambda x: len(str(x)) == 10)
#Here you can just use the mask to filter your rows, using the method *.all()* to filter only rows that are all True, but you could also use the *.any()* method for other needs
df = df[mask.all(axis=1)]

Dask: masking a dataframe based on multiple conditions to perform selective calculations

I'm looking to replace values on rows where multiple conditions are met when using dask. The pre-set value with which I'll perform the replacement is present in one column, and if the condition is met, then I'll replace the target value with the pre-set value.
I'd like to stay in dask rather than performing this action with another library if possible because of memory constraints when shifting dataframes around.
At the moment, I'm attempting to use the .mask command.
Where GrassDeadFMC >= 12 and Windspeed <= 10 then make GrassFMCoefficient equal to the value in GFMG12L10.
ddf['GrassFMCoefficient'] = ddf['GFMG12L10'].mask(ddf['GrassDeadFMC'] >= 12 & ddf['WindSpeed'] <= 10)
The error I'm receiving is:
ValueError: Metadata inference failed in `and_`.
Original error is below:
------------------------
TypeError('cannot compare a dtyped [float32] array with a scalar of type [bool]')
A minimum executable script, which gives a slightly different error, but probably suffers from the same issue, I guess.
import dask.dataframe as dd
import pandas as pd
from random import randint
df = pd.DataFrame({'GrassFMCoefficient': [0 for x in range(10)],
'GFMG12L10': [randint(1, 50) for x in range(10)],
'GrassDeadFMC': [randint(1, 50) for x in range(10)],
'WindSpeed': [randint(1, 30) for x in range(10)]})
ddf = dd.from_pandas(df,npartitions=1)
ddf['GrassFMCoefficient'] = ddf['GFMG12L10'].mask(ddf['GrassDeadFMC'] >= 12 & ddf['WindSpeed'] <= 10)
print(ddf.head(10))
Any help on this would be appreciated.
do you want result like this??
you have to isolate each condition with Bracket '()', ex. (condition1) & (condition2). it makes Boolean compare with Boolean too.
ddf['GrassFMCoefficient'] = ddf['GFMG12L10'].mask((ddf['GrassDeadFMC'] >= 12) & (ddf['WindSpeed'] <= 10))

Looking for a specific combination algorithm to solve a problem

Let’s say I have a purchase total and I have a csv file full of purchases where some of them make up that total and some don’t. Is there a way to search the csv to find the combination or combinations of purchases that make up that total ? Let’s say the purchase total is 155$ and my csv file has the purchases [5.00$,40.00$,7.25$,$100.00,$10.00]. Is there an algorithm that will tell me the combinations of the purchases that make of the total ?
Edit: I am still having trouble with the solution you provided. When I feed this spreadsheet with pandas into the code snippet you provided it only shows one solution equal to 110.04$ when there are three. It is like it is stopping early without finding the final solutions.This is the output that I have from the terminal - [57.25, 15.87, 13.67, 23.25]. The output should be [10.24,37.49,58.21,4.1] and [64.8,45.24] and [57.25,15.87,13.67,23.25]
from collections import namedtuple
import pandas
df = pandas.read_csv('purchases.csv',parse_dates=["Date"])
from collections import namedtuple
values = df["Purchase"].to_list()
S = 110.04
Candidate = namedtuple('Candidate', ['sum', 'lastIndex', 'path'])
tuples = [Candidate(0, -1, [])]
while len(tuples):
next = []
for (sum, i, path) in tuples:
# you may range from i + 1 if you don't want repetitions of the same purchase
for j in range(i+1, len(values)):
v = values[j]
# you may check for strict equality if no purchase is free (0$)
if v + sum <= S:
next.append(Candidate(sum = v + sum, lastIndex = j, path = path + [v]))
if v + sum == S :
print(path + [v])
tuples = next
A dp solution:
Let S be your goal sum
Build all 1-combinations. Keep those which sums less or equal than S. Whenever one equals S, output it
Build all 2-combinations reusing the previous ones.
Repeat
from collections import namedtuple
values = [57.25,15.87,13.67,23.25,64.8,45.24,10.24,37.49,58.21,4.1]
S = 110.04
Candidate = namedtuple('Candidate', ['sum', 'lastIndex', 'path'])
tuples = [Candidate(0, -1, [])]
while len(tuples):
next = []
for (sum, i, path) in tuples:
# you may range from i + 1 if you don't want repetitions of the same purchase
for j in range(i + 1, len(values)):
v = values[j]
# you may check for strict equality if no purchase is free (0$)
if v + sum <= S:
next.append(Candidate(sum = v + sum, lastIndex = j, path = path + [v]))
if abs(v + sum - S) <= 1e-2 :
print(path + [v])
tuples = next
More detail about the tuple structure:
What we want to do is to augment a tuple with a new value.
Assume we start with some tuple with only one value, say the tuple associated to 40.
its sum is trivially 40
the last index added is 1 (it is the number 40 itself)
the used values is [40], since it is the sole value.
Now to generate the next tuples, we will iterate from the last index (1), to the end of the array.
So candidates are 7.25, 100.00, 10.00
The new tuple associated to 7.25 is:
sum: 40 + 7.25
last index: 2 (7.25 has index 2 in array)
used values: values of tuple union 7.25, so [40, 7.25]
The purpose of using the last index, is to avoid considering [7.25, 40] and [40, 7.25]. Indeed they would be the same combination
So to generate tuples from an old one, only consider values occurring 'after' the old one from the array
At every step, we thus have tuples of the same size, each of them aggregates the values taken, the sum it amounts to, and the next values to consider to augment it to a bigger size
edit: to handle floats, you may replace (v+sum)<=S by abs(v+sum - S)<=1e-2 to say a solution is reach when you are very close (here distance arbitrarily set to 0.01) to solution
edit2: same code here as in https://repl.it/repls/DrearyWindingHypertalk (which does give
[64.8, 45.24]
[57.25, 15.87, 13.67, 23.25]
[10.24, 37.49, 58.21, 4.1]

Find same values in two huge datasets

i have a list with roughly 2 000 rows [UnixTimestamp, Value01, Value02](it comes as a JSON) and i have another list which has a few million rows [UnixTimestamp, Value01, Value02] (it comes as a .csv) I want to figure out if each element in the smaller list has an element in the second list with the same values.
Both the lists are sorted by the timestamp
The simplest way is obviously something like that:
for x in small_List:
if x in big_list:
return True
return False
But does that make sense or is there a more efficient way?
Thanks
If they are just lists, you can try something like this.
set(small_list) & set(big_list)
Converting to set will remove the duplicate values and you can use & operator to compare and result back the same values of the two sets.
Both are already sorted by timestamp, so use that to your advantage:
big_list_index = 0
for x in small_list:
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp < x.timestamp:
big_list_index += 1
y = big_list[big_list_index]
while big_list_index < len(big_list) and y.timestamp == x.timestamp:
if y.timestamp == x.timestamp and y.value01 == x.value01 and y.value02 == x.value02:
return True
else:
big_list_index += 1
y = big_list[big_list_index]
If timestamps are unique, complexity is O(len(big_list) + len(small_List))

Splitting the output obtained by Counter in Python and pushing it to Excel

I am using the counter function to count every word of the description of 20000 products and see how many times this word repeats like 'pipette' repeats 1282 times.To do this i have split a column A into many columns P,Q,R,S,T,U & V
df["P"] = df["A"].str.split(n=10).str[0]
df["Q"] = df["A"].str.split(n=10).str[1]
df["R"] = df["A"].str.split(n=10).str[2]
df["S"] = df["A"].str.split(n=10).str[3]
df["T"] = df["A"].str.split(n=10).str[4]
df["U"] = df["A"].str.split(n=10).str[5]
df["V"] = df["A"].str.split(n=10).str[6]
This shows the splitted products
And the i am individually counting all of the columns and then add them to get the total number of words.
d = Counter(df['P'])
e = Counter(df['Q'])
f = Counter(df['R'])
g = Counter(df['S'])
h = Counter(df['T'])
i = Counter(df['U'])
j = Counter(df['V'])
m = d+e+f+g+h+i+j
print(m)
This is the image of the output i obtained on using counter.
Now i want to transfer the output into a excel sheet with the Keys in one column and the Values in another.
Am i using the right method to do so? If yes how shall i push them into different columns.
Note: Length of each key is different
Also i wanna make all the items of column 'A' into lower case so that the counter does not repeat the items. How shall I go about it ?
I've been learning python for just a couple of months but I'll give it a shot. I'm sure there are some better ways to perform that same action. Maybe we both can learn something from this question. Let me know how this turns out. GoodLuck
import pandas as pd
num = len(m.keys())
df = pd.DataFrame(columns=['Key', 'Value']
for i,j,k in zip(range(num), m.keys(), m.values()):
df.loc[i] = [j, k]
df.to_csv('Your_Project.csv')

Resources