How to select n rows from large data set using spark - apache-spark

I need to select n rows from very large data set which has millions of rows. Let's say 4 million rows out of 15 million. Currently, I'm adding row_number to records within each partition and selecting the required percentage of records from each partition. For instance, 4 million is 26.66 % of 15 million. But when I'm trying to choose 26 % from each partition, the total number is going down because of the missing 0.6 %. As shown below, rows are selected when the row_number is less than percentage. Is there a better way to do this ?

dataframe sample function can be used. Solution available in below link
How to select an exact number of random rows from DataFrame

Related

Interpolating a huge set of data

So I have a very large set of data (4 million rows+) with journey times between two location nodes for two separate years (2015 and 2024). These are stored in dat files in a format of:
Node A
Node B
Journey Time (s)
123
124
51.4
So I have one long file of over 4 million rows for each year. I need to interpolate journey times for a year between the two for which I have data. I've tried Power Query in Excel as well as Power BI Desktop but have had no reasonable solution beyond cutting the files into smaller < 1 million row pieces so that Excel can manage.
Any ideas?
What type of output are you looking for? PowerBI can easily handle this amount of data, but it depends what you expect your result to be. If you're looking for the average % change in node to node travel time between the two years, then PowerBI could be utilised as it is great at aggregating and comparing large datasets.
However, if you are wanting an output of every single node to node delta between those two years i.e. 4M row output, then PowerBI will calculate this, but then what do you do with it.... a 4M long table?
If you're looking to have an exported result >150K rows (PowerBI limit) or >1M rows (Excel limit), then I would use Python for that (as mentioned above)

Pyspark job being stuck at the final task

The flow of my program is something like this:
1. Read 4 billion rows (~700GB) of data from a parquet file into a data frame. Partition size used is 2296
2. Clean it and filter out 2.5 billion rows
3. Transform the remaining 1.5 billion rows using a pipeline model and then a trained model. The model is trained using a logistic regression model where it predicts 0 or 1 and 30% of the data is filtered out of the transformed data frame.
4. The above data frame is Left outer joined with another dataset of ~1 TB (also read from a parquet file.) Partition size is 4000
5. Join it with another dataset of around 100 MB like
joined_data = data1.join(broadcast(small_dataset_100MB), data1.field == small_dataset_100MB.field, "left_outer")
6. The above dataframe is then exploded to the factor of ~2000 exploded_data = joined_data.withColumn('field', explode('field_list'))
7. An aggregate is performed aggregate = exploded_data.groupBy(*cols_to_select)\
.agg(F.countDistinct(exploded_data.field1).alias('distincts'), F.count("*").alias('count_all')) There are a total of 10 columns in the cols_to_select list.
8. And finally an action, aggregate.count() is performed.
The problem is, the third last count stage (200 tasks) gets stuck at task 199 forever. In spite of allocating 4 cores and 56 executors, the count uses only one core and one executor to run the job. I tried breaking down the size from 4 billion rows to 700 million rows which is 1/6th part, it took four hours. I would really appreciate some help in how to speed this process up Thanks
The operation was being stuck at the final task because of the skewed data being joined to a huge dataset. The key that was joining the two dataframes was heavily skewed. The problem was solved for now by removing the skewed data from the dataframe. If you must include the skewed data, you can use iterative broadcast joins (https://github.com/godatadriven/iterative-broadcast-join). Look into this informative video for more details https://www.youtube.com/watch?v=6zg7NTw-kTQ

Matlab data auto import and arrange program

I have a large data set of smart meter which has more than a million rows. Data looks like this
customer number time load
1000 19501 1.5
.... ..... ...
1000 19548 1.5
1000 19600 1.5
1000 ... ..
1000 19648 1.5
. . .
1001 19501 1.5
. . .
Where first column is customer number, second column shows datand time and third column shows load. The date time starts from 19501, goes upto 48 and then it moves to 19600 and similarly for 7 days. Now i want to analsye this data in matlab using clustering. Firstly the data is in .txt format and due to its large number of rows, it does not open in matlab.
I opened it in excel (although it does not read it fully but still a million row data is good enough for me). I have reduced the number of rows so that they are readable to matlab and arranged the data using filters to have data arranged for individual customer from time 19501 till last reading for this customer and then second customer and so on.For my matlab clustering, I need the data from 19501-19548 hrs in 1 row, then next 48 readings for same customer in next row and so on till last customer.
Is it possible to have a matlab code which can do it automatically or shall I look for something in excel?

Change range of AVERAGE and STDEV every ten rows

I have a big dataset in Excel and I want to perform the same operation over and over until the end of all rows.
I have groups of ten values (ten rows) and for each of these groups, I want to only display each row value if it is in the range of the average +- standard deviation of each group. As an example, let's say:
Image showing two groups of ten values. First with the result that I want in red
I already managed to perform the operation that I want for a group of ten values with this formula:
=IF(AND(F2<=(AVERAGE($F$2:$F$11))+(STDEV($F$2:$F$11)),F2>=(AVERAGE($F$2:$F$11))-(STDEV($F$2:$F$11))),F2, "")
Since my data is kind of big (around 5,000 rows) I would like to have a function that I can drag to the other groups of ten values without the need to update the average and standard deviation range.
=IF(F2=MEDIAN(F2,AVERAGE(INDEX(F:F,2+10*INT((ROWS($1:1)-1)/10)):INDEX(F:F,11+10*INT((ROWS($1:1)-1)/10)))+{-1,1}*STDEV(INDEX(F:F,2+10*INT((ROWS($1:1)-1)/10)):INDEX(F:F,11+10*INT((ROWS($1:1)-1)/10)))),F2, "")
Copy down as required.
Regards

Control which rows a data table will calculate

I have a 2-way data table in Excel (as in the option under "What-If Analysis"). There are 50 rows when analysing a 50 year deal, but only 4 rows when analysing a 4 year deal.
I only want to use one data table (of 50 rows) but I don't want it to calculate all of the values if it doesn't have to. e.g. if I have a five year deal I want the values in the first 5 rows to be calculated, but for the rest I would like it to display 0 or a blank.
Is there a way to do this without VBA?
(I was thinking with VBA I could create a whole new data table every time I run it, but would prefer not to as I am still developing structures.)
I'm guessing that either your row labels will step consistently. Just blank out the years that are not required, when not required (rows 6 and upwards in your example for 5 years), and repopulate with series fill to suit.

Resources