Time Optimization for pandas dataframe reconstruction (random to fixed sampling) - python-3.x

I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.
I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.
What I am trying to achieve is transform to a new data-set with fixed sampling rate.
I am transforming my data-set as follows:
I'm setting a time_delta as a static time step of 0.5 seconds
I'm dropping reccords corresponding to the same nanosecond
I'm generating timestamps from my start_time to the calculated end_time
I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.
I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.
Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.
I've also tested locally without any improvement.
# ====================================================================
# Rate Re-Definition
# ====================================================================
SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES
start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD
fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)
# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================
df1 = df.drop_duplicates(subset='TimeStamp', keep="last")
# ====================================================================
# Construct new dataframe
# ====================================================================
df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0
for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)
df2['TimeStamp'] = fixed_timestamps
I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").
I would appreciate any help and pointers towards the right direction.
Thanks in advance

Related

fast date based replacement of rows in Pandas

I am on a quest of finding the fastest replacement method based on index in Pandas.
I want to fill np.nans to all rows based on index (DateTimeIndex).
I tested various types of selection, but obviously, the bottleneck is setting the rows equal to a value (np.nan in my case).
Naively, I want to do this:
df['2017-01-01':'2018-01-01'] = np.nan
I tried and tested a performance of various other methods, such as
df.loc['2017-01-01':'2018-01-01'] = np.nan
And also creating a mask with NumPy to speed it up
df['DateTime'] = df.index
st = pd.to_datetime('2017-01-01', format='%Y-%m-%d').to_datetime64()
en = pd.to_datetime('2018-01-01', format='%Y-%m-%d').to_datetime64()
ge_start = df['DateTime'] >= st
le_end = df['DateTime'] <= en
mask = (ge_start & le_end )
and then
df[mask] = np.nan
#or
df.where(~mask)
But with no big success. I have DataFrame (that I cannot share unfortunately) of size cca (200,1500000), so kind of big, and the operation takes order of seconds of CPU time, which is way too much imo.
Would appreciate any ideas!
edit: after going through
Modifying a subset of rows in a pandas dataframe and Why dataframe.values is very slow and unifying datatypes for the operation, the problem is solved with cca 20x speedup.

PySpark apply function on 2 dataframes and write to csv for billions of rows on small hardware

I am trying to apply a levenshtein function for each string in dfs against each string in dfc and write the resulting dataframe to csv. The issue is that I'm creating so many rows by using the cross join and then applying the function, that my machine is struggling to write anything (taking forever to execute).
Trying to improve write performance:
I'm filtering out a few things on the result of the cross join i.e. rows where the LevenshteinDistance is less than 15% of the target word's.
Using bucketing on the first letter of each target word i.e. a, b, c, etc. still no luck (i.e. job runs for hours and doesn't generate any results).
from datetime import datetime
from config import config
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
def fuzzy_match(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey"))
df = dfc\
.crossJoin(dfs)\
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) )\
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") )\
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15")\
.orderBy("OrganisationName", "CompanyName")
def fuzzy_match_approve(df, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary):
"""
Filters fuzzy matching DataFrame results on approved/rejected based on set of conditions:
- If there is only 1 match against the SF name
- If more than 1 match then take that with LS distance of 1
- If more than 1 match and more multiple LS distances of 1, then take the one where Soundex codes are the same
Writes results and summary to CSV.
"""
def write_with_bucket(df, bucket_col, path):
df.write\
.mode("overwrite")\
.bucketBy(26, bucket_col)\
.option("path", path)\
.option("header", True)\
.saveAsTable("bucket", format="csv")
# Add window function columns:
# OrganisationNameMatchCount: Count AccountID per OrganisationName
# LevenshteinDistance1Count: Count AccountID per OrganisationName where LevenshteinDistance = 1
windowSpec = Window.partitionBy("OrganisationName")
df = df\
.select("AccountID", "OrganisationName", "OrganisationNameKey", "CompanyNumber", "CompanyName", "LevenshteinDistance", "HasSameSoundex")\
.withColumn("OrganisationNameMatchCount", F.count("AccountID").over(windowSpec))\
.withColumn("LevenshteinDistance1Count", F.count(F.when(F.col("LevenshteinDistance")==1, F.col("AccountID"))).over(windowSpec))
# Add bucket key column
df = df.withColumn( "OrganisationNameBucketKey", F.substring( col("OrganisationNameKey"),0,1) )
# Define fuzzy match approved condition
is_approved_1 = ( F.col("OrganisationNameMatchCount") == 1 )
is_approved_2 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") == 1) & (F.col("LevenshteinDistance") == 1) )
is_approved_3 = ( (F.col("OrganisationNameMatchCount") > 1) & (F.col("LevenshteinDistance1Count") > 1) & (F.col("HasSameSoundex") == 'true') )
is_approved = is_approved_1 | is_approved_2 | is_approved_3
# Split fuzzy match results into approved and rejected
df_approved = df.filter(is_approved)
df_rejected = df.filter(~is_approved)
# Export results
# df_approved.write.csv(path_fuzzy_match_approved, mode="overwrite", header=True, quoteAll=True)
# df_rejected.write.csv(path_fuzzy_match_rejected, mode="overwrite", header=True, quoteAll=True)
write_with_bucket(df_approved, "OrganisationNameBucketKey", path_fuzzy_match_approved)
write_with_bucket(df_rejected, "OrganisationNameBucketKey", path_fuzzy_match_rejected)
def main():
spark = SparkSession...
# Apply fuzzy match
dfs = spark.read...
dfc = spark.read...
path_summary = ...
df_fuzzy_match = fuzzy_match(dfs, dfc, path_summary)
# Export results
path_fuzzy_match_approved = ...
path_fuzzy_match_rejected = ...
fuzzy_match_approve(df_fuzzy_match, path_fuzzy_match_approved, path_fuzzy_match_rejected, path_summary)
main()
Other info:
df.rdd.getNumPartitions() is 2
dfs.count() is 12,515
dfc.count() is 5,110,430
Jobs:
How can I improve performance here and get the results into a CSV successfully?
There are a couple of things you can do to improve your computation:
Improve parallelism
As Nithish mentioned in the comments, you don't have enough partitions in your input data frames to make use of all your CPU cores. You're not using all your CPU capability and this will slow you down.
To increase your parallelism, repartition dfc to at least your number of cores:
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)
You need to do this because your crossJoin is run as a BroadcastNestedLoopJoin which doesn't reshuffle your large input dataframe.
Separate your computation stages
A Spark dataframe/RDD is conceptually just a directed action graph (DAG) of operations to run on your input data but it does not hold data. One consequence of this behavior is that, by default, you'll rerun your computations as many times as you reuse your dataframe.
In your fuzzy_match_approve function, you run 2 separate filters on your df, this means you rerun the whole cross-join operations twice. You really don't want this !
One easy way to avoid this is to use cache() on your fuzzy_match result which should be fairly small given your inputs and matching criteria.
def fuzzy_match_running(dfs, dfc, path_summary):
"""
Implements the Levenshtein and Soundex algorithms and returns a fuzzy matched DataFrame.
Filters out those where resulting LS distance is less than 15% of SF name length.
"""
# Apply Levenshtein and Soundex functions
dfs = dfs.withColumn("OrganisationNameKeyLen", F.length("OrganisationNameKey")).cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism).cache()
df = dfc.crossJoin(dfs) \
.withColumn( "LevenshteinDistance", F.levenshtein( F.lower("OrganisationNameKey") , F.lower("CompanyNameKey") ) ) \
.withColumn( "HasSameSoundex", F.soundex("OrganisationNameKey") == F.soundex("CompanyNameKey") ) \
.where("LevenshteinDistance < OrganisationNameKeyLen * 0.15") \
.orderBy("OrganisationName", "CompanyName") \
.cache()
return df
If I run my fuzzy_match_running on some example data frames on my 8 core/16 threads I9-9980HK laptop (spark in local[*] mode with 8GB driver memory):
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 679.5572581291199 seconds
Matches/core/sec: 933436.210726889
The job takes about 12 min doing 572494*17728 ~ 10 billion row comparisons
at 933k comparisons/seconds/core. Since your job does 64 billions row comparisons I would expect it to take about 80 min on my laptop.
You should run a similar experiment on your computer with a smaller sample to get an idea of your actual computing speed.
Going further: maximizing matches/sec
To go faster, we need to adjust the computation and increase the number of comparisons that can be done per seconds.
A few things stand out in the function:
you filter your output by comparing the levenshtein distance, an integer, to a decimal calculation. This means spark will cast your integer to a decimal and operate on decimal. Comparing decimals is much slower than integers and it's unnecessary here, you can cast the bound to an int beforehand.
your levenshtein operates on the lower versions of your keys, this means, for each row comparison, Spark will convert the column values to lower again and again, wasting CPU cycles for redundant stuff. You can preprocess this before your join.
I update the function like this:
def fuzzy_match(dfs: DataFrame, dfc: DataFrame, path_summary: str) -> DataFrame:
dfs = dfs.withColumn("OrganisationNameKeyLower", F.lower("OrganisationNameKey"))\
.withColumn("MatchingTolerance", F.ceil(F.length("OrganisationNameKey") * 0.15).cast("int"))\
.cache()
dfc = dfc.repartition(dfc.sql_ctx.sparkContext.defaultParallelism)\
.withColumn("CompanyNameKeyLower", F.lower("CompanyNameKey"))\
.cache()
df = dfc.crossJoin(dfs)\
.withColumn("LevenshteinDistance", F.levenshtein(F.col("OrganisationNameKeyLower"), F.col("CompanyNameKeyLower")).cast("int")) \
.where("LevenshteinDistance < MatchingTolerance")\
.drop("MatchingTolerance")\
.cache()
# clean unnecessary caches before returning
dfs.unpersist()
dfc.unpersist()
return df
When running the updated version on the same inputs as before and on the same computer I get nearly twice the performance as the first implementation
dfc rowcount : 572494
dfs rowcount : 17728
fuzzy_match rowcount: 7228499
Duration: 356.23311281204224 seconds
Matches/core/sec: 1780641.1846241967
If that is still too slow for your needs, you'll need to find conditions on your data that you can use as a join condition but that's highly data and use case specific.

Why reading an small subset of the rows with Parquet Dataset take the same time than reading the whole file?

I'm developing a program to analyze some historical prices of some assets. The data is structured and analyzed as a pandas dataframe. The columns are the dates and the rows are the assets. Previously I was using the transpose of this, but this format gave me better reading time. I saved this data in a parquet file and now I want to read an interval of dates from A to B for example and an small set of assets, analyze it and then repeat the same process with the same assets but in the interval from B + 1 to C.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
Code example:
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
# generating the small data for the example, the file weight like 150MB for this example, the real data
# has 2 GB
dates = pd.bdate_range('2019-01-01', '2020-03-01')
assets = list(range(1000, 50000))
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
historical_prices.columns = historical_prices.columns.strftime('%Y-%m-%d')
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
# row_group_size=100,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# reading the complete parquet dataset
start_time = time.time()
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# Reading only one asset of the parquet dataset
start_time = time.time()
filters = [('assets', '=', assets[0])]
historical_prices_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
historical_prices_dataset.read_pandas().to_pandas()
print(time.time() - start_time)
# this is what I want to do, read by intervals.
num_intervals = 5
for i in range(num_intervals):
start = int(i * len(dates) / num_intervals)
end = int((i + 1) * len(dates) / num_intervals)
interval = list(dates[start:end].strftime('%Y-%m-%d'))
historical_prices_dataset.read_pandas(columns=interval).to_pandas()
# Here goes some analyzing process that can't be done in parallel due that the results of every interval
# are used in the next interval
print(time.time() - start_time)
I was using the transpose of this, but this format gave me better reading time.
Parquet supports individual column reads. So if you have 10 columns of 10k rows and you want 5 columns then you'll read 50k cells. If you have 10k columns of 10 rows and you want 5 columns then you'll read 50 cells. So presumably this is why the transpose gave you better reading time. I don't think I have enough details here. Parquet also supports reading individual row groups, more on that later.
You have roughly 49,000 assets and 300 dates. I'd expect you to get better performance with assets as columns but 49,000 is a lot of columns to have. It's possible that either you are having to read too much column metadata or you are dealing with CPU overhead from keeping track of so many columns.
It is a bit odd to have date values or asset ids as columns. A far more typical layout would be to have three columns: "date", "asset id", & "price".
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file
Yes, if you have a single row group. Parquet does not support partial row group reads. I believe this is due to the fact that the columns are compressed. However, I do not get the same results you are getting. The middle time in your example (the single asset read) is typically ~60-70% of the time of the first read. So it is faster. Possibly just because there is less conversion to do to get to pandas or maybe there is some optimization I'm not aware of.
The problem is that even if I use a unique row, the parquet read take the same time that if I read the whole file. Is there a way to improve this behaviour?, It would be good that, once it filter the rows, it saves where the blocks in memory are to speed up the nexts reads. Do I have to write a new file with the assets filtered?.
Row groups may be your answer. See the next section.
I tried writing the parquet file with a small number of row groups and smaller data page size to avoid the complete reading, but this doesn't gave me a good results in terms of time.
This is probably what you are after (or you can use multiple files). Parquet supports reading just one row group out of a whole file. However, 100 is too small of a number for row_group_size. Each row group creates some amount of metadata in the file and has some overhead for processing. If I change that to 10,000 for example then the middle read is twice as fast (and now only 30-40% of the full table read).
Other question that I have is the follwing. Why if we read the complete parquet file using a Parquet Dataset and use_legacy_dataset = False, it takes more time than reading the same parquet dataset with use_legacy_dataset = True?
This new datasets API is pretty new (new as of 1.0.0 which released in July). It's possible there is just a bit more overhead. You are not doing anything that would take advantage of the new datasets API (e.g. using scan or non-parquet datasets or new filesystems). So while use_legacy_datasets shouldn't be faster it should not be any slower either. They should take roughly the same amount of time.
It sounds like you have many assets (tens of thousands) and you want to read a few of them. You also want to chunk the read into smaller reads (which you are using the date for).
First, instead of using the date at all, I would recommend using dataset.scan (https://arrow.apache.org/docs/python/dataset.html). This will allow you to process your data one row group at a time.
Second, is there any way you can group your asset ids? If each asset ID has only a single row you can ignore this. However, if you have (for example) 500 rows for each asset ID (or 1 row for each asset ID/date pair) can you write your file so that it looks something like this...
asset_id date price
A 1 ?
A 2 ?
A 3 ?
B 1 ?
B 2 ?
B 3 ?
If you do this AND you set the row group size to something reasonable (try 10k or 100k and then refine from there) then you should be able to get it so that you are only reading 1 or 2 row groups per asset ID.
I found another approach that give me better times for my specific cases, of course, this is a not very general solution. It has some not pyarrow's functions, but it do what I thought the filters of pyarrow do when we read multiple times the same rows. When the number of row groups to read grow, the parquet dataset gave better performance.
import pandas as pd
import numpy as np
import time
import pyarrow.parquet as pq
from typing import Dict, Any, List
class PriceGroupReader:
def __init__(self, filename: str, assets: List[int]):
self.price_file = pq.ParquetFile(filename)
self.assets = assets
self.valid_groups = self._get_valid_row_groups()
def _get_valid_row_groups(self):
"""
I don't fine a parquet function to make this row group search, so I did this manual search.
Note: The assets index is sorted, so probably this can be improved a lot.
"""
start_time = time.time()
assets = pd.Index(self.assets)
valid_row_groups = []
index_position = self.price_file.schema.names.index("assets")
for i in range(self.price_file.num_row_groups):
row_group = self.price_file.metadata.row_group(i)
statistics = row_group.column(index_position).statistics
if np.any((statistics.min <= assets) & (assets <= statistics.max)):
valid_row_groups.append(i)
print("getting the row groups: {}".format(time.time() - start_time))
return valid_row_groups
def read_valid_row_groups(self, dates: List[str]):
row_groups = []
for row_group_pos in self.valid_groups:
df = self.price_file.read_row_group(row_group_pos, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
row_groups.append(df)
df = pd.concat(row_groups)
"""
# This is another way to read the groups but I think it can consume more memory, probably is faster.
df = self.price_file.read_row_groups(self.valid_groups, columns=dates, use_pandas_metadata=True).to_pandas()
df = df.loc[df.index.isin(self.assets)]
"""
return df
def write_prices(assets: List[int], dates: List[str]):
historical_prices = pd.DataFrame(np.random.rand(len(assets), len(dates)), assets, dates)
# name of the index
historical_prices.index.name = 'assets'
# writing the parquet file using the lastest version, in the comments are the thigns that I tested
historical_prices.to_parquet(
'historical_prices.parquet',
version='2.0',
data_page_version='2.0',
writer_engine_version='2.0',
row_group_size=4000,
# compression=None
# use_dictionary=False,
# data_page_size=1000,
# use_byte_stream_split=True,
# flavor='spark',
)
# generating the small data for the example, the file weight like 150MB, the real data weight 2 GB
total_dates = list(pd.bdate_range('2019-01-01', '2020-03-01').strftime('%Y-%m-%d'))
total_assets = list(range(1000, 50000))
write_prices(total_assets, total_dates)
# selecting a subset of the whole assets
valid_assets = total_assets[:3000]
# read the price file for the example
price_group_reader = PriceGroupReader('historical_prices.parquet', valid_assets)
# reading all the dates, only as an example
start_time = time.time()
price_group_reader.read_valid_row_groups(total_dates)
print("complete reading: {}".format(time.time() - start_time))
# this is what I want to do, read by intervals.
num_intervals = 5
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_group_reader.read_valid_row_groups(interval)
# print(df)
print("interval reading: {}".format(time.time() - start_time))
filters = [('assets', 'in', valid_assets)]
price_dataset = pq.ParquetDataset(
'historical_prices.parquet',
filters=filters,
use_legacy_dataset=False
)
start_time = time.time()
price_dataset.read_pandas(columns=total_dates).to_pandas()
print("complete reading with parquet dataset: {}".format(time.time() - start_time))
start_time = time.time()
for i in range(num_intervals):
start = int(i * len(total_dates) / num_intervals)
end = int((i + 1) * len(total_dates) / num_intervals)
interval = list(total_dates[start:end])
df = price_dataset.read_pandas(columns=interval).to_pandas()
print("interval reading with parquet dataset: {}".format(time.time() - start_time))

(KNN ) row compute use outer DataFrame on pyspark

question
my data structure is like this:
train_info:(over 30000 rows)
----------
odt:string (unique)
holiday_type:string
od_label:string
array:array<double> (with variable length depend on different odt and holiday_type )
useful_index:array<int> (length same as vectors)
...(other not important cols)
label_data:(over 40000 rows)
----------
holiday_type:string
od_label: string
l_origin_array:array<double> (with variable length)
...(other not important cols)
my expected result is like this(length same with train_info):
--------------
odt:string
holiday_label:string
od_label:string
prediction:int
my solution is like this:
if __name__=='__main __'
loop_item = train_info.collect()
result = knn_for_loop(spark, loop_item,train_info.schema,label_data)
----- do something -------
def knn_for_loop(spark, predict_list, schema, label_data):
result = list()
for i in predict_list:
# turn this Row col to Data Frame and join on label data
# across to this row data pick label data array data
predict_df = spark.sparkContext.parallelize([i]).toDF(schema) \
.join(label_data, on=['holiday_type', "od_label"], how='left') \
.withColumn("l_array",
UDFuncs.value_from_array_by_index(f.col('l_origin_array'), f.col("useful_index"))) \
.toPandas()
# pandas execute
train_x = predict_df.l_array.values
train_y = predict_df.label.values
test_x = predict_df.array.values[0]
test_y = KNN(train_x, train_y, test_x)
result.append((i['odt'], i['holiday_type'], i['od_label'], test_y))
return result
it's worked but is really slow, I estimate each row need 18s.
in R language I can do this easily using do function:
train_info%>%group_by(odt)%>%do(.,knn_loop,label_data)
something my tries
I tried to join them before use,and query them when I compute, but the data is too large to run (these two df have 400 million rows after join and It takes up 180 GB disk space on hive and query really slowly).
I tried to use pandas_udf, but it only allows one pd.data.frame parameter and slow).
I tried to use UDF, but UDF can't receive data frame obj.
I tried to use spark-knn package ,but I run with error,may be my offline
installation is wrong .
thanks for your help.

Dataframe sample in Apache spark | Scala

I'm trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg
df1.count() = 10
df2.count() = 1000
noOfSamples = 10
I want to sample the data in such a way that i get 10 samples of size 101 each( 1 from df1 and 100 from df2)
Now while doing so,
var newSample = df1.sample(true, df1.count() / noOfSamples)
println(newSample.count())
What does the fraction here imply? can it be greater than 1? I checked this and this but wasn't able to comprehend it fully.
Also is there anyway we can specify the number of rows to be sampled?
The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:
val newSample = df1.sample(true, 1D*noOfSamples/df1.count)
However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:
val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)
Some scalability observations
You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place.
Therefore depending on the context of your application, you may want to use an already known number of total samples, or an approximation.
val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)
Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.
val guessedFraction = 0.1
val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)
As for your questions:
can it be greater than 1?
No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.
Also is there anyway we can specify the number of rows to be sampled?
You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.
To answer your question, is there anyway we can specify the number of rows to be sampled?
I recently needed to sample a certain number of rows from a spark data frame. I followed the below process,
Convert the spark data frame to rdd.
Example: df_test.rdd
RDD has a functionality called takeSample which allows you to give the number of samples you need with a seed number.
Example: df_test.rdd.takeSample(withReplacement, Number of Samples, Seed)
Convert RDD back to spark data frame using sqlContext.createDataFrame()
Above process combined to single step:
Data Frame (or Population) I needed to Sample from has around 8,000 records:
df_grp_1
df_grp_1
test1 = sqlContext.createDataFrame(df_grp_1.rdd.takeSample(False,125,seed=115))
test1 data frame will have 125 sampled records.
To answer if the fraction can be greater than 1. Yes, it can be if we have replace as yes. If a value greater than 1 is provided with replace false, then following exception will occur:
java.lang.IllegalArgumentException: requirement failed: Upper bound (2.0) must be <= 1.0.
I too find lack of sample by count functionality disturbing. If you are not picky about creating a temp view I find the code below useful (df is your dataframe, count is sample size):
val tableName = s"table_to_sample_${System.currentTimeMillis}"
df.createOrReplaceTempView(tableName)
val sampled = sqlContext.sql(s"select *, rand() as random from ${tableName} order by random limit ${count}")
sqlContext.dropTempTable(tableName)
sampled.drop("random")
It returns an exact count as long as your current row count is as large as your sample size.
The below code works if you want to do a random split of 70% & 30% of a data frame df,
val Array(trainingDF, testDF) = df.randomSplit(Array(0.7, 0.3), seed = 12345)
I use this function for random sampling when exact number of records are desirable:
def row_count_sample (df, row_count, with_replacement=False, random_seed=113170):
ratio = 1.08 * float(row_count) / df.count() # random-sample more as dataframe.sample() is not a guaranteed to give exact record count
# it could be more or less actual number of records returned by df.sample()
if ratio>1.0:
ratio = 1.0
result_df = (df
.sample(with_replacement, ratio, random_seed)
.limit(row_count) # since we oversampled, make exact row count here
)
return result_df
May be you want to try below code..
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

Resources