Optimized way to join pyspark dataframes on closest distance

Optimized way to join pyspark dataframes on closest distance - apache-spark

I'm trying to join the GHCN weather dataset and another dataset:
Weather Dataset: (called "weather" in the code)
station_id
date
PRCP
SNOW
SNWD
TMAX
TMIN
Latitude
Longitude
Elevation
State
date_final
CA001010235
19730707
null
null
0.0
0.0
null
48.4
-123.4833
17.0
BC
1973-07-07
CA001010235
19780337
14
8
0.0
0.0
null
48.4
-123.4833
17.0
BC
1978-03-30
CA001010595
19690607
null
null
0.0
0.0
null
48.5833
-123.5167
17.0
BC
1969-06-07
Species Dataset: (called "ebird" in the code where "ebird_id" is unique for each row in the dataset)
speciesCode
comName
sciName
locId
locName
ObsDt
howMany
lat
lng
obsValid
obsReview
locationPrivate
subId
ebird_id
nswowl
Northern Saw-whet Owl
Aegolius acadicus
L787133
Kumdis Slough
2017-03-20 23:15
1
53.7392187
-132.1612358
TRUE
FALSE
TRUE
S35611913
eff-178121-fff
wilsni1
Wilson's Snipe
Gallinago delicata
L1166559
Hornby Island--Ford Cove
2017-03-20 21:44
1
49.4973435
-124.6768427
TRUE
FALSE
FALSE
S35323282
abc-1920192-fff
cacgoo1
Cackling Goose
Branta hutchinsii
L833055
Central Saanich--»æIKEL (Maber Flats)
2017-03-20 19:24
5
48.5724686
-123.4305167
TRUE
FALSE
FALSE
S35322116
yhj-9102910-fff
Result Expected: I need to join these tables by finding the closest weather station for each row in the species dataset for the same date. So in this example, ebird_id "eff-178121-fff" is closest to the weather station "CA001010235" and the distance is around 20 kms.
speciesCode
comName
sciName
locId
locName
ObsDt
howMany
lat
lng
obsValid
obsReview
locationPrivate
subId
ebird_id
station_id
date
PRCP
SNOW
SNWD
TMAX
TMIN
Latitude
Longitude
Elevation
State
date_final
distance(kms)
nswowl
Northern Saw-whet Owl
Aegolius acadicus
L787133
Kumdis Slough
2017-03-20 23:15
1
53.7392187
-132.1612358
TRUE
FALSE
TRUE
S35611913
eff-178121-fff
CA001010235
20170320
null
null
0.0
0.0
null
48.4
-123.4833
17.0
BC
2017-03-20
20
wilsni1
Wilson's Snipe
Gallinago delicata
L1166559
Hornby Island--Ford Cove
2017-03-20 21:44
1
49.4973435
-124.6768427
TRUE
FALSE
FALSE
S35323282
abc-1920192-fff
CA001010595
20170320
null
null
0.0
0.0
null
48.5833
-123.5167
17.0
BC
2017-03-20
What I have tried so far: I referred to this link and it works for a sample of the datasets but when I tried to run the below code for the entire weather and entire species dataset, the cross join worked but the partitionBy and windows function line was taking too long. I also tried replacing the partionBy and windows function with pyspark SQL queries in case that would be faster but it's still taking time. Is there any optimized way to do this?
join_df = ebird.crossJoin(weather).withColumn("dist_longit", radians(weather["Longitude"]) - radians(ebird["lng"])).withColumn("dist_latit", radians(weather["Latitude"]) - radians(ebird["lat"]))
join_df = join_df.withColumn("haversine_distance_kms", asin(sqrt(
sin(join_df["dist_latit"] / 2) ** 2 + cos(radians(join_df["lat"]))
* cos(radians(join_df["Latitude"])) * sin(join_df["dist_longit"] / 2) ** 2
)
) * 2 * 6371).drop("dist_longit","dist_latit")
W = W.partitionBy("ebird_id")
result = join_df.withColumn("min_dist", min(join_df['haversine_distance_kms']).over(W))\
.filter(col("min_dist") == col('haversine_distance_kms'))
print(result.show(1))
Edit:
Size of the datasets:
print(weather.count()) #output: 8211812
print(ebird.count()) #output: 1564574

Related

Cronbach.alpha = NA

I have panel data and I trying to run Cronbach.alpha, however, the result is NA. Does anybody know why?
data = P_df
cronbach.alpha(p_df,na.rm = TRUE)
Cronbach's alpha for the 'p_df' data-set
Items: 169
Sample units: 1284
alpha: NA

API - Dataframe - getting updated value in the same 1st row instead of new row each time

Problem: (I am getting updated value in the same 1st row instead of new row each time.)
Details: I am trying to get live data from API, I am trying from several days but no code is giving proper dataframe results.
I am having around 100-150 stocks list to be read from .xlsx file and trying to get live data in dataframe and then write it in .xlsx file (1 stock in 1 sheet, 2nd stock in 2nd sheet in below format:
Date Time Symbol ltp ltq Volume
2021-10-01 11:00:00 A 103.45 50430 110350470
2021-10-01 11:00:01 A 104.29 99500 110400900
2021-10-01 11:00:02 A 105.14 70570 110500400
2021-10-01 11:00:03 A 105.99 90640 110570970
2021-10-01 11:00:04 A 106.84 65710 110661610
2021-10-01 11:00:05 A 107.69 98780 110727320
2021-10-01 11:00:06 A 108.54 84850 110826100
2021-10-01 11:00:07 A 109.39 77920 110910950
2021-10-01 11:00:08 A 110.24 61990 110988870
2021-10-01 11:00:09 A 111.09 53060 111050860
2021-10-01 11:00:10 A 111.94 74130 111103920
and in 1 Main sheet Dashboard with hyperlink to check each stock:
e.g.
Sr StockSheet Name ltp high open low previous close Today's volume
1 A 001_A
2 B 002_B
3 C 003_C
4 D 004_D
5 E 005_E
6 F 006_F
7 G 007_G
8 H 008_H
9 I 009_I
10 J 010_J
after credential and access token of API below is the main code:
socket_opened = False
def event_handler_quote_update(tick):
tick_symbol = tick['instrument'].symbol
tick_ltp = tick['ltp']
tick_volume = tick['volume']
tick_ltq = tick['ltq']
tick_ltt = datetime.datetime.fromtimestamp(tick['ltt'])
tick_timestamp = datetime.datetime.fromtimestamp(tick['exchange_time_stamp'])
d = {'symbol': [tick_symbol], 'ltp': [tick_ltp], 'ltq': [tick_ltq], 'volume': [tick_volume], 'ltt': [tick_ltt], 'timestamp': [tick_timestamp]}
df = pd.DataFrame(data=d)
xw.Book('test1.xlsx').sheets['Sheet1'].range('A1').value = df
print(df)
def open_callback():
global socket_opened
socket_opened = True
alice.start_websocket(event_handler_quote_update, open_callback, run_in_background=True)
call = ()
for symbol in fno_list:
callable(alice.get_instrument_by_symbol('NSE', symbol))
while socket_opened == False:
pass
alice.subscribe(alice.get_instrument_by_symbol('NSE', 'RELIANCE'), LiveFeedType.MARKET_DATA)
Code Ended.
Help: As I am new and learning python I'm looking forward for any improvement/suggestions/better way to do this.

Spark stage taking too long - 2 executors doing "all" the work

I've been trying to figure this out for the past day, but have not been successful.
Problem I am facing
I'm reading a parquet file that is about 2GB big. The initial read is 14 partitions, then eventually gets split into 200 partitions. I perform seemingly simple SQL query that runs for 25+ mins runtime, about 22 mins is spent on a single stage. Looking in the Spark UI, I see that all computation is eventually pushed to about 2 to 4 executors, with lots of shuffling. I don't know what is going on. Please I would appreciate any help.
Setup
Spark environment - Databricks
Cluster mode - Standard
Databricks Runtime Version - 6.4 ML (includes Apache Spark 2.4.5, Scala 2.11)
Cloud - Azure
Worker Type - 56 GB, 16 cores per machine. Minimum 2 machines
Driver Type - 112 GB, 16 cores
Notebook
Cell 1: Helper functions
load_data = function(path, type) {
input_df = read.df(path, type)
input_df = withColumn(input_df, "dummy_col", 1L)
createOrReplaceTempView(input_df, "__current_exp_data")
## Helper function to run query, then save as table
transformation_helper = function(sql_query, destination_table) {
createOrReplaceTempView(sql(sql_query), destination_table)
}
## Transformation 0: Calculate max date, used for calculations later on
transformation_helper(
"SELECT 1L AS dummy_col, MAX(Date) max_date FROM __current_exp_data",
destination_table = "__max_date"
)
## Transformation 1: Make initial column calculations
transformation_helper(
"
SELECT
cId AS cId
, date_format(Date, 'yyyy-MM-dd') AS Date
, date_format(DateEntered, 'yyyy-MM-dd') AS DateEntered
, eId
, (CASE WHEN isnan(tSec) OR isnull(tSec) THEN 0 ELSE tSec END) AS tSec
, (CASE WHEN isnan(eSec) OR isnull(eSec) THEN 0 ELSE eSec END) AS eSec
, approx_count_distinct(eId) OVER (PARTITION BY cId) AS dc_eId
, COUNT(*) OVER (PARTITION BY cId, Date) AS num_rec
, datediff(Date, DateEntered) AS analysis_day
, datediff(max_date, DateEntered) AS total_avail_days
FROM __current_exp_data
CROSS JOIN __max_date ON __main_data.dummy_col = __max_date.dummy_col
",
destination_table = "current_exp_data_raw"
)
## Transformation 2: Drop row if Date is not valid
transformation_helper(
"
SELECT
cId
, Date
, DateEntered
, eId
, tSec
, eSec
, analysis_day
, total_avail_days
, CASE WHEN analysis_day == 0 THEN 0 ELSE floor((analysis_day - 1) / 7) END AS week
, CASE WHEN total_avail_days < 7 THEN NULL ELSE floor(total_avail_days / 7) - 1 END AS avail_week
FROM current_exp_data_raw
WHERE
isnotnull(Date) AND
NOT isnan(Date) AND
Date >= DateEntered AND
dc_eId == 1 AND
num_rec == 1
",
destination_table = "main_data"
)
cacheTable("main_data_raw")
cacheTable("main_data")
}
spark_sql_as_data_table = function(query) {
data.table(collect(sql(query)))
}
get_distinct_weeks = function() {
spark_sql_as_data_table("SELECT week FROM current_exp_data GROUP BY week")
}
Cell 2: Call helper function that triggers the long running task
library(data.table)
library(SparkR)
spark = sparkR.session(sparkConfig = list())
load_data_pq("/mnt/public-dir/file_0000000.parquet")
set.seed(1234)
get_distinct_weeks()
Long running stage DAG
Stats about long running stage
Logs
I trimmed it down, and show only entries that appeared multiple times below
BlockManager: Found block rdd_22_113 locally
CoarseGrainedExecutorBackend: Got assigned task 812
ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
InMemoryTableScanExec: Predicate (dc_eId#61L = 1) generates partition filter: ((dc_eId.lowerBound#622L <= 1) && (1 <= dc_eId.upperBound#621L))
InMemoryTableScanExec: Predicate (num_rec#62L = 1) generates partition filter: ((num_rec.lowerBound#627L <= 1) && (1 <= num_rec.upperBound#626L))
InMemoryTableScanExec: Predicate isnotnull(Date#57) generates partition filter: ((Date.count#599 - Date.nullCount#598) > 0)
InMemoryTableScanExec: Predicate isnotnull(DateEntered#58) generates partition filter: ((DateEntered.count#604 - DateEntered.nullCount#603) > 0)
MemoryStore: Block rdd_17_104 stored as values in memory (estimated size <VERY SMALL NUMBER < 10> MB, free 10.0 GB)
ShuffleBlockFetcherIterator: Getting 200 non-empty blocks including 176 local blocks and 24 remote blocks
ShuffleBlockFetcherIterator: Started 4 remote fetches in 1 ms
UnsafeExternalSorter: Thread 254 spilling sort data of <Between 1 and 3 GB> to disk (3 times so far)

Read multiple values into pandas DataFrame

I have data in a text file with the following format:
{"date":"Jan 6"; "time":"07:00:01"; "ip":"178.41.163.99"; "user":"null"; "country":"Slovakia"; "city":"Miloslavov"; "lat":48.1059; "lon":17.3}
{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"postgres"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}
{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"null"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}
I need to read it into pandas DataFrame with keys to column names and values to items. This is my code to read data in:
columns = ['date', 'time', 'ip', 'user', 'country', 'city', 'lat', 'lon']
df = pd.read_csv("log.txt", sep=';', header=None, names=columns)
A bit frustrated since all I've managed to get is this:
date time ... lat lon
0 {"date":"Jan 6" "time":"07:00:01" ... "lat":48.1059 "lon":17.3}
1 {"date":"Jan 6" "time":"07:05:26" ... "lat":57.7072 "lon":11.9668}
2 {"date":"Jan 6" "time":"07:05:26" ... "lat":57.7072 "lon":11.9668}
I've read docs from top to bottom, but still unable to achieve required result, like below:
date time ... lat lon
0 Jan 6 07:00:01 ... 48.1059 17.3
1 Jan 6 07:05:26 ... 57.7072 11.9668
2 Jan 6 07:05:26 ... 57.7072 11.9668
Is it possible at all? Any advice will be much appreciated. Thanks.

If, as it looks like, you don't have any ; in the string values, you could use string replacement to make it into valid (line separated) json:
In [11]: text
Out[11]: '{"date":"Jan 6"; "time":"07:00:01"; "ip":"178.41.163.99"; "user":"null"; "country":"Slovakia"; "city":"Miloslavov"; "lat":48.1059; "lon":17.3}\n{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"postgres"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}\n{"date":"Jan 6"; "time":"07:05:26"; "ip":"37.123.163.124"; "user":"null"; "country":"Sweden"; "city":"Gothenburg"; "lat":57.7072; "lon":11.9668}'
In [12]: pd.read_json(text.replace(";", ","), lines=True)
Out[12]:
city country date ip lat lon time user
0 Miloslavov Slovakia Jan 6 178.41.163.99 48.1059 17.3000 07:00:01 null
1 Gothenburg Sweden Jan 6 37.123.163.124 57.7072 11.9668 07:05:26 postgres
2 Gothenburg Sweden Jan 6 37.123.163.124 57.7072 11.9668 07:05:26 null

Sphinx Search Gives Zero Result

Hi just installed Sphinx on my CentOS VPS.. But for some reason whenever I search, it gives me no result.. I'm using ssh for searching.. Here is the command
search --index sphinx_index_cc_post -a Introducing The New Solar Train Tunnel
This is the output of command
Sphinx 2.0.5-release (r3308)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/usr/local/etc/sphinx.conf'...
index 'sphinx_index_cc_post': query 'Introducing The New Solar Train Tunnel ': returned 0 matches of 0 total in 0.000 sec
words:
1. 'introducing': 0 documents, 0 hits
2. 'the': 0 documents, 0 hits
3. 'new': 0 documents, 0 hits
4. 'solar': 0 documents, 0 hits
5. 'train': 0 documents, 0 hits
6. 'tunnel': 0 documents, 0 hits
This is my index in config file
source sphinx_index_cc_post
{
type = mysql
sql_host = localhost
sql_user = user
sql_pass = password
sql_db = database
sql_port = 3306
sql_query_range = SELECT MIN(postid),MAX(postid) FROM cc_post
sql_range_step = 1000
sql_query = SELECT postedby, category, totalvotes, trendvalue, featured, isactive, postingdate \
FROM cc_post \
WHERE postid BETWEEN $start AND $end
}
index sphinx_index_cc_post
{
source = sphinx_index_cc_post
path = /usr/local/sphinx/data/sphinx_index_cc_post
charset_type = utf-8
min_word_len = 2
}
The index seems to work fine, when I rotate the index, I successfully get the documents. Here is the result of my indexer
[root#server1 data]# indexer --rotate sphinx_index_cc_post
Sphinx 2.0.5-release (r3308)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/usr/local/etc/sphinx.conf'...
indexing index 'sphinx_index_cc_post'...
WARNING: Attribute count is 0: switching to none docinfo
WARNING: source sphinx_index_cc_post: skipped 1 document(s) with zero/NULL ids
collected 2551 docs, 0.1 MB
sorted 0.0 Mhits, 100.0% done
total 2551 docs, 61900 bytes
total 0.041 sec, 1474933 bytes/sec, 60784.40 docs/sec
total 2 reads, 0.000 sec, 1.3 kb/call avg, 0.0 msec/call avg
total 6 writes, 0.000 sec, 1.0 kb/call avg, 0.0 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=17888).
I also tried removing attributes but no luck!! I'm guessing either its some config problem of query issue

Your query is:
SELECT postedby, category, totalvotes, trendvalue, featured, isactive, postingdate \
FROM cc_post
from column names, I guess you don't have full text in any of those columns. Are you missing the column that contains text?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Optimized way to join pyspark dataframes on closest distance - apache-spark

Related

Cronbach.alpha = NA

API - Dataframe - getting updated value in the same 1st row instead of new row each time

Spark stage taking too long - 2 executors doing "all" the work

Read multiple values into pandas DataFrame

Sphinx Search Gives Zero Result

Categories

Resources