How to calculate the number of rows of a dataframe efficiently? [duplicate] - apache-spark

This question already has answers here:
Count on Spark Dataframe is extremely slow
(2 answers)
Getting the count of records in a data frame quickly
(2 answers)
Closed 3 years ago.
I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:
>>> df = spark.range(10)
>>> df.sample(0.5).count()
4
In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.

Related

Turbocharge a lambda function comparing values in two dataframes

I've got an OHLCV financial dataset I'm working with that's approx 1.5M rows. It's for a single security back to 2000 at a 1min resolution. The dataset had all the zero volume timestamps removed so has gaps in the trading day which I don't want for my purposes.
Raw data (df1) looks like:
Timestamp, Open, High, Low, Close, Volume
To fill in the all the zero volume timestamps I created an empty trading calendar (df2) using (pandas_market_calendars which is an amazing time saver) and I've then used df3 = merge_asof(df2, df1, on='Timestamp', direction = nearest) to fill all the timestamps. This is the behaviour I wanted for the price data (OHLC) but not for volume. I need all the 'filled' timestamps to show zero volume so I figured a lambda function would suit (as below) to check if each timestamp was in the original dataframe (df1) or not.
tss = df1.Timestamp.to_numpy()
df2['Adj_Volume'] = df2.apply(lambda x: x['Volume'] if x['Timestamp'] in tss else 0, axis=1)
I ran this for an hour, then two hours, then five hours and it's still not finished. To try and work out what's going on, I then used tqdm (progress_apply) and it estimates that it'll take 100 hours to complete! I running the conda dist of Jupyter Notebooks on an 2014 MacbookAir (1.7Ghz, 8Gb RAM) which isn't a supercomputer but 100 hours seems whacky.
If I cut down tss and df3 to a single year (~50k rows), it'll run in ~5mins. However, this doesn't scale linearly to the full dataset. 100 hours vs 100 mins (5mins x 20 years (2000 - 2019)). Slicing the dataframe up into years in a python level loop to then join them again afterwards feels clunky, but I can't think of another way.
Is there a smarter way to do this which takes advantage of vectorised operations that can be run on the entire dataset in a single operation?
Cheers
Maybe you could try with np.wherefunction and .isin method ?
import numpy as np
df2['Volume'] = np.where(df2['Timestamp'].isin(tss),df2['Volume'], 0)

How to get descriptive statistics of all columns in python [duplicate]

This question already has answers here:
How do I expand the output display to see more columns of a Pandas DataFrame?
(22 answers)
Closed 3 years ago.
I have a dataset with 200000 rows and 201 columns. I want to have descriptive statistics of all the variables.
I tried:
'''train.describe()'''
But this is only giving the output for the first and last 8 variables. This there any method I can use to get the statistics for all of the columns.
probably, some of your columns where in some type other than numerical. Try train.apply(pd.to_numeric) then train.describe()

Using median instead of mean as aggregation function in Spark [duplicate]

This question already has answers here:
How to find median and quantiles using Spark
(8 answers)
Closed 5 years ago.
Say I have a dataframe that contains cars, their brand and their price. I would like to replace the avg below by median (or another percentile):
df.groupby('carBrand').agg(F.avg('carPrice').alias('avgPrice'))
However, it seems that there is no aggregation function that allows to compute this in Spark.
You can try the approxQuantile function (see http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

Python pandas: Best way to normalize data? [duplicate]

This question already has answers here:
Normalise between 0 and 1 ignoring NaN
(3 answers)
Closed 6 years ago.
I have a large pandas dataframe with about 80 columns. Each of the 80 columns in the dataframe report daily traffic statistics for websites (the columns are the websites).
As I don't want to work with the raw traffic statistics, I rather like to normalize all of my columns (except for the first, which is the date). Either from 0 to 1 or (even better) from 0 to 100.
Date A B ...
10/10/2010 100.0 402.0 ...
11/10/2010 250.0 800.0 ...
12/10/2010 800.0 2000.0 ...
13/10/2010 400.0 1800.0 ...
That being said, I wonder which normalization to apply. Min-Max scaling vs. z-Score Normalization (standardization)? Some of my columns have strong outliers. It would be great to have an example. I am sorry not being able to provide the full data.
First, turn your Date column into an index.
dates = df.pop('Date')
df.index = dates
Then either use z-score normalizing:
df1 = (df - df.mean())/df.std()
or min-max scaling:
df2 = (df-df.min())/(df.max()-df.min())
I would probably advise z-score normalization, because min-max scaling is highly susceptible to outliers.

Handling categorical variables in StreamingLogisticRegressionwithSGD [duplicate]

This question already has answers here:
How to encode categorical features in Apache Spark
(3 answers)
Closed 6 years ago.
I am trying to use StreamingLogisticRegressionwithSGD to build a CTR prediction model.
The document is here
mentions that the numFeatures should be constant.
The problem that I am facing is :
Since most of my variables are categorical, the numFeatures variable should be the final set of variables after encoding and parsing the categorical variables in labeled point format.
Suppose, for a categorical variable x1 I have 10 distinct values in current window.
But in the next window some new values/items gets added to x1 and the number of distinct values increases. How should I handle the numFeatures variable in this case, because it will change now ?
Basically, my question is how should I handle the new values of the categorical variables in streaming model.
Thanks,
Kundan
You should fill the missing columns with zero values and discard any newly encountered values in each window to make sure the number of remains the same as when used for training.
Lets consider a column city having the values [NewYork, Paris, Tokyo] in the training set. This would result in three columns.
If during prediction you find the values [NewYork, Paris, Chicago, RioDeJaneiro] you should discard the values Chicago and "RioDeJaneiro" then fill zero value for the column corresponding to "Tokyo" such that the result still has three columns (one for each of [NewYork, Paris, Tokyo] ).

Resources