Trying to use pandas to group all data in Column B [duplicate] - python-3.x

I have a large (about 12M rows) DataFrame df:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.

When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()
-Source.

Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)
An example:
>>> df['Embarked'].value_counts(dropna=False)
S 644
C 168
Q 77
NaN 2

Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.
To use collections.Counter:
from collections import Counter
out = pd.Series(Counter(df['word']))
To use numpy.unique:
import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)
To use groupby + size:
out = pd.Series(df.index, index=df['word']).groupby(level=0).size()
One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).
Benchmarks
(if having the counts sorted is not important):
If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.
If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.
However, if it is dtype int, then the fastest method is numpy.unique:
Code used to produce the plots:
import perfplot
import numpy as np
import pandas as pd
from collections import Counter
def creator(n, dt='obj'):
s = pd.Series(np.random.randint(2*n, size=n))
return s.astype(str) if dt=='obj' else s
def plot_perfplot(datatype):
perfplot.show(
setup = lambda n: creator(n, datatype),
kernels = [lambda s: s.value_counts(),
lambda s: pd.Series(Counter(s)),
lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
],
labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
n_range = [2 ** k for k in range(5, 25)],
equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
xlabel = '~len(s)',
title = f'dtype {datatype}'
)
plot_perfplot('obj')
plot_perfplot('int')

Related

fast date based replacement of rows in Pandas

I am on a quest of finding the fastest replacement method based on index in Pandas.
I want to fill np.nans to all rows based on index (DateTimeIndex).
I tested various types of selection, but obviously, the bottleneck is setting the rows equal to a value (np.nan in my case).
Naively, I want to do this:
df['2017-01-01':'2018-01-01'] = np.nan
I tried and tested a performance of various other methods, such as
df.loc['2017-01-01':'2018-01-01'] = np.nan
And also creating a mask with NumPy to speed it up
df['DateTime'] = df.index
st = pd.to_datetime('2017-01-01', format='%Y-%m-%d').to_datetime64()
en = pd.to_datetime('2018-01-01', format='%Y-%m-%d').to_datetime64()
ge_start = df['DateTime'] >= st
le_end = df['DateTime'] <= en
mask = (ge_start & le_end )
and then
df[mask] = np.nan
#or
df.where(~mask)
But with no big success. I have DataFrame (that I cannot share unfortunately) of size cca (200,1500000), so kind of big, and the operation takes order of seconds of CPU time, which is way too much imo.
Would appreciate any ideas!
edit: after going through
Modifying a subset of rows in a pandas dataframe and Why dataframe.values is very slow and unifying datatypes for the operation, the problem is solved with cca 20x speedup.

Is there any efficient function for extracting the probability vector after running pyspark ML algorithm other than rdd.extract in Pyspark?

I want to extract the probability from the vector and make it into rows. For that i have used rdd.map(extract)
def extract(row):
return(row.prediction,)+tuple(row.probability.toArray() .tolist())
I have 96 probabilities within that vector. After extracting them into rows i sorted and selected top 10 probabilities. This works good for small dataset like 1000 records (ie. 96*1000 =96000 rows). But for 100k records the function is taking more time. So is there any other way to extract those probabilities and make them as rows?
One thing might be improved in your code is to calculate the top-N directly in extract() function instead of retrieving all 96 probabilities and then post-processing them to find top-N. For example using np.partition on the Numpy ndarray returned from toArray() method:
from numpy import partition, arange
N = 10
extract = lambda row: (row.prediction,) + tuple(-partition(-row.probability.toArray(),arange(N))[:N])
my_model.summary.predictions.rdd.map(extract).take(20)
Note: if order is not important for the top-N probabilities, then adjust arange(N) to N.
EDIT: Since Vector is not natively supported by SparkSQL(as of Spark 2.4.4), to use Dataframe APIs and its optimization, you'll have to first use an UDF to convert the Vector in this question into ArrayType.
For spark 2.4+ use udf+sort_array+slice which can use Java's optimization engine to sort but no partial sorting:
from pyspark.sql.functions import udf, sort_array, slice
udf_extract_1 = udf(lambda v: v.toArray().tolist(), 'array<double>')
(my_model.summary.predictions
.select('prediction', udf_extract_1('probability').alias('probability'))
.withColumn('probability', slice(sort_array('probability', False),1,N))
.show(truncate=False))
Or use udf + Python's partial-sorting function:
from pyspark.sql.functions import udf
from numpy import partition, arange
udf_extract_2 = udf(lambda v: (-partition(-v.toArray(), arange(N))[:N]).tolist(), 'array<double>')
(my_model.summary.predictions
.select('prediction', udf_extract_2('probability').alias('probability'))
.show(truncate=False))

Using loops to call multiple pandas dataframe columns

I am new to python3 and trying to do chisquared tests on columns in a pandas dataframe. My columns are in pairs: observed_count_column_1, expected count_column_1, observed_count_column_2, expected_count_column_2 and so on. I would like to make a loop to get all column pairs done at once.
I succeed doing this if I specify column index integers or column names manually.
This works
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
chisquare(df.iloc[:,[0]], df.iloc[:,[1]])
This, trying with a loop, does not:
from scipy.stats import chisquare
import pandas as pd
df = pd.read_csv (r'count.csv')
for n in [0,2,4,6,8,10]:
chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
The loop code does not seem to run at all and I get no error but no output either.
I was wondering why this is happening and how can I actually approach this?
Thank you,
Dan
Consider building a data frame of chi-square results from list of tuples, then assign column names as indicators for observed and expected frequencies (subsetting even/odd columns by indexed notation):
# CREATE DATA FRAME FROM LIST IF TUPLES
# THEN ASSIGN COLUMN NAMES
chi_square_df = (pd.DataFrame([chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]) \
for n in range(0,11,2)],
columns = ['chi_sq_stat', 'p_value'])
.assign(obs_freq = df.columns[::2],
exp_freq = df.columns[1::2])
)
chisquare() function returns two values so you can try this:
for n in range(0, 11, 2):
chisq, p = chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]
print('Chisq: {}, p-value: {}'.format(chisq, p))
You can find what it returns in the docs here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Thank you for the suggestions. Using the information from Parfait comment, that loops don't print I managed to find a solution, although not as elegant as their own solution above.
for n in range(0, 11, 2):
print(chisquare(df.iloc[:,[n]], df.iloc[:,[n+1]]))
This gives the expected results.
Dan

pandas group by in parallel

I'm having some trouble splitting the aggregation step of a group-by operation across multiple cores. I have the following working code, and would like to apply it over several processors:
import pandas as pd
import numpy as np
from multiprocessing import Pool, cpu_count
mydf = pd.DataFrame({'v1':[1,2,3,4]*6,'v2':['a','b','c']*8,'v3':np.arange(20,44)})
Which I can then apply the following GroupBy operation:
(the step I wish to do in parallel)
pd.groupby(mydf,by=['v1','v2']).apply(lambda x: np.percentile(x['v3'],[20,30]))
yielding the series:
1 a [22.4, 23.6]
b [26.4, 27.6]
c [30.4, 31.6]
2 a [31.4, 32.6]
b [23.4, 24.6]
c [27.4, 28.6]
I Tried the following, with reference to:parallel groupby
def applyParallel(dfGrouped, func):
with Pool(1) as p:
ret_list = p.map(func, [group for name, group in dfGrouped])
return pd.concat(ret_list)
def myfunc(df):
df['pct1'] = df.loc[:,['v3']].apply(np.percentile,args=([20],))
df['pct2'] = df.loc[:,['v3']].apply(np.percentile,args=([80],))
return(df)
grouped = pd.groupby(mydf,by=['v1','v2'])
applyParallel(grouped,myfunc)
But I'm losing the index structure and getting duplicates. I could probably solve this step with a further group by operation, but I think it shouldn't be too difficult to avoid it entirely. Any suggestions?
Not that I'm still looking for an answer, but It'd probably be better to use a library that handles parallel manipulations of pandas DataFrames, rather than trying to do so manually.
Dask is one option which is intended to scale Pandas operations with little code modification.
Another option (but is maybe a little more difficult to set up) is PySpark

Use data in Spark Dataframe column as condition or input in another column expression

I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
])
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
new_df.show()
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))

Resources