Does anyone how to do pagination in spark sql query?
I need to use spark sql but don't know how to do pagination.
Tried:
select * from person limit 10, 10
It has been 6 years, don't know if it was possible back then
I would add a sequential id on the answer and search for registers between offset and offset + limit
On pure spark sql query it would be something like this, for offset 10 and limit 10
WITH count_person AS (
SELECT *, monotonically_increasing_id() AS count FROM person)
SELECT * FROM count_person WHERE count > 10 AND count < 20
On Pyspark it would be very similar
import pyspark.sql.functions as F
offset = 10
limit = 10
df = df.withColumn('_id', F.monotonically_increasing_id())
df = df.where(F.col('_id').between(offset, offset + limit))
Its flexible and fast enough even for a big volume of data
karthik's answer will fail if there are duplicate rows in the dataframe. 'except' will remove all rows in df1 which are in df2 .
val filteredRdd = df.rdd.zipWithIndex().collect { case (r, i) if 10 >= start && i <=20 => r }
val newDf = sqlContext.createDataFrame(filteredRdd, df.schema)
There is no support for offset as of now in spark sql. One of the alternatives you can use for paging is through DataFrames using except method.
Example: You want to iterate with a paging limit of 10, you can do the following:
DataFrame df1;
long count = df.count();
int limit = 10;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 10, next 10, etc rows
df = df.except(df1);
count = count - limit;
}
If you want to do say, LIMIT 50, 100 in the first go, you can do the following:
df1 = df.limit(50);
df2 = df.except(df1);
df2.limit(100); //required result
Hope this helps!
Please find bellow a useful PySpark (Python 3 and Spark 3) class named SparkPaging which abstract the pagination mecanism :
https://gitlab.com/enahwe/public/lib/spark/sparkpaging
Here's the usage :
SparkPaging
Class for paging dataframes and datasets
Example
- Init example 1:
Approach by specifying a limit.
sp = SparkPaging(initData=df, limit=753)
- Init example 2:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging(initData=df, pages=6)
- Init example 3:
Approach by specifying a limit.
sp = SparkPaging()
sp.init(initData=df, limit=753)
- Init example 4:
Approach by specifying a number of pages (if there's a rest, the number of pages will be incremented).
sp = SparkPaging()
sp.init(initData=df, pages=6)
- Reset:
sp.reset()
- Iterate example:
print("- Total number of rows = " + str(sp.initDataCount))
print("- Limit = " + str(sp.limit))
print("- Number of pages = " + str(sp.pages))
print("- Number of rows in the last page = " + str(sp.numberOfRowsInLastPage))
while (sp.page < sp.pages-1):
df_page = sp.next()
nbrRows = df_page.count()
print(" Page " + str(sp.page) + '/' + str(sp.pages) + ": Number of rows = " + str(nbrRows))
- Output:
- Total number of rows = 4521
- Limit = 753
- Number of pages = 7
- Number of rows in the last page = 3
Page 0/7: Number of rows = 753
Page 1/7: Number of rows = 753
Page 2/7: Number of rows = 753
Page 3/7: Number of rows = 753
Page 4/7: Number of rows = 753
Page 5/7: Number of rows = 753
Page 6/7: Number of rows = 3
Related
There is a data frame with name df which contains repeating rows identified with DICE_SUMMARY_ID.
After I perform some calculations for different columns, I need to write back the results to the original dataframe.
The issue is that df contains over 100k rows and a for loop is very time consuming. Currently, it shows about 3 hours.
How can I reduce the time?
#extract unique ids from dataframe
uniqueIDs = df['DICE_SUMMARY_ID'].unique()
#iterate over the unique ids and calculate
for i in range(len(uniqueIDs)):
#get a slice of the dataframe at i'th unique id
uniqueID_df = df.loc[df['DICE_SUMMARY_ID'] == uniqueIDs[i]]
#calculate sum of all family types
SINGLE_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10001).sum())
EXTRA_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10003).sum())
NO_OF_ADULTS = int(SINGLE_ADULTS + EXTRA_ADULTS)
NO_OF_DEPENDENTS_U_16 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20001).sum())
NO_OF_DEPENDENTS_16_TO_18 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20002).sum())
#get the array of indexes of each unique uid in the tuple
#each unique uid has 10 - 20 rows in the original df,
#given that there are over 100k records, this becoming very time consuming
indices = np.where(df["DICE_SUMMARY_ID"] == uniqueIDs[i])[0]
for j in indices:
# #insert value in column at index for each repeating index
df['NO_OF_ADULTS'].iloc[j] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[j] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[j] = NO_OF_DEPENDENTS_16_TO_18
faster version, but I am still not satisfied
df['NO_OF_ADULTS'].iloc[indices.min():indices.max()] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_16_TO_18
I would like to access to all elementary flows generated by an activity in Brightway in a table that would gather the flows and the amounts.
Let's assume a random activity :
lca=bw.LCA({random_act:2761,method)
lca.lci()
lca.lcia()
lca.inventory
I have tried several ways but none works :
I have tried to export my lci with brightway2-io but some errors appear that i cannot solve :
bw2io.export.excel.lci_matrices_to_excel(db_name) returns an error when computing the biosphere matrix data for a specific row :
--> 120 bm_sheet.write_number(bio_lookup[row] + 1, act_lookup[col] + 1, value)
122 COLUMNS = (
123 u"Index",
124 u"Name",
(...)
128 u"Location",
129 )
131 tech_sheet = workbook.add_worksheet("technosphere-labels")
KeyError: 1757
I try to get manually the amount of a specific elementary flow. For example, let's say I want to compute the total amount of Aluminium needed for the activity. To do so, i try this:
flow_Al=Database("biosphere3").search("Aluminium, in ground")
(I only want the resource Aluminium that is extracted as a ore, from the ground)
amount_Al=0
row = lca.biosphere_dict[flow_Al]
col_indices = lca.biosphere_matrix[row, :].tocoo()
amount_consumers_lca = [lca.inventory[row, index] for index in col_indices.col]
for j in amount_consumers_lca:
amount_Al=amount_Al+j
amount_Al`
This works but the final amount is too low and probably isn't what i'm looking for...
How can I solve this ?
Thank you
This will work on Brightway 2 and 2.5:
import pandas as pd
import bw2data as bd
import warnings
def create_inventory_dataframe(lca, cutoff=None):
array = lca.inventory.sum(axis=1)
if cutoff is not None and not (0 < cutoff < 1):
warnings.warn(f"Ignoring invalid cutoff value {cutoff}")
cutoff = None
total = array.sum()
include = lambda x: abs(x / total) >= cutoff if cutoff is not None else True
if hasattr(lca, 'dicts'):
mapping = lca.dicts.biosphere
else:
mapping = lca.biosphere_dict
data = []
for key, row in mapping.items():
amount = array[row, 0]
if include(amount):
data.append((bd.get_activity(key), row, amount))
data.sort(key=lambda x: abs(x[2]))
return pd.DataFrame([{
'row_index': row,
'amount': amount,
'name': flow.get('name'),
'unit': flow.get('unit'),
'categories': str(flow.get('categories'))
} for flow, row, amount in data
])
The cutoff doesn't make sense for the inventory database, but it can be adapted for the LCIA result (characterized_inventory) as well.
Once you have a pandas DataFrame you can filter or export easily.
Imagine I have a huge dataset which I partitionBy('id'). Assume that id is unique to a person, so there could be n number of rows per id and the goal is to reduce it to one.
Basically, aggregating to make id distinct.
w = Window().partitionBy(id).rowsBetween(-sys.maxsize, sys.maxsize)
test1 = {
key: F.first(key, True).over(w).alias(key)
for key in some_dict.keys()
if (some_dict[key] == 'test1')
}
test2 = {
key: F.last(key, True).over(w).alias(k)
for k in some_dict.keys()
if (some_dict[k] == 'test2')
}
Assume that I have some_dict with values either as test1 or test2 and based on the value, I either take the first or last as shown above.
How do I actually call aggregate and reduce this?
cols = {**test1, **test2}
cols = list(cols.value())
df.select(*cols).groupBy('id').agg(*cols) # Doesnt work
The above clearly doesn't work. Any ideas?
Goal here is : I have 5 unique IDs and 25 rows with each ID having 5 rows. I want to reduce it to 5 rows from 25.
Let assume you dataframe name df which contains duplicate use below method
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
window = Window.partitionBy(df['id']).orderBy(df['id'])
final = df.withColumn("row_id", row_number.over(window)).filter("row_id = 1")
final.show(10,False)
change the order by condition in case there is specific criteria so that particular record will be on top of partition
I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not matched.
I want to add column ID to the final delta table and increment it each time we insert data. This column identify each row in our delta table. Is there any way to put that in place ?
def Merge(dict1, dict2):
res = {**dict1, **dict2}
return res
def create_default_values_dict(correspondance_df,marketplace):
dict_output = {}
for field in get_nan_keys_values(get_mapping_dict(correspondance_df, marketplace)):
dict_output[field] = 'null'
# We want to increment the id row each time we perform an insertion (TODO TODO TODO)
# if field == 'id':
# dict_output['id'] = col('id')+1
# else:
return dict_output
def create_matched_update_dict(mapping, products_table, updates_table):
output = {}
for k,v in mapping.items():
if k == 'source_name':
output['products.source_name'] = lit(v)
else:
output[products_table + '.' + k] = F.when(col(updates_table + '.' + v).isNull(), col(products_table + '.' + k)).when(col(updates_table + '.' + v).isNotNull(), col(updates_table + '.' + v))
return output
insert_dict = create_not_matched_insert_dict(mapping, 'products', 'updates')
default_dict = create_default_values_dict(correspondance_df_products, 'Cdiscount')
insert_values = Merge(insert_dict, default_dict)
update_values = create_matched_update_dict(mapping, 'products', 'updates')
delta_table_products.alias('products').merge(
updates_df_table.limit(20).alias('updates'),
"products.barcode_ean == updates.ean") \
.whenMatchedUpdate(set = update_values) \
.whenNotMatchedInsert(values = insert_values)\
.execute()
I tried to increment the column id in the function create_default_values_dict but it's seems to not working well, it doesn't auto increment by 1. Is there another way to solve this problem ? Thanks in advance :)
Databricks has IDENTITY columns for hosted Spark
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY
[ ( [ START WITH start ] [ INCREMENT BY step ] ) ]
This works on Delta tables.
Example:
create table gen1 (
id long GENERATED ALWAYS AS IDENTITY
, t string
)
Requires Runtime version 10.4 or above.
Delta does not support auto-increment column types.
In general, Spark doesn't use auto-increment IDs, instead favoring monotonically increasing IDs. See functions.monotonically_increasing_id().
If you want to achieve auto-increment behavior you will have to use multiple Delta operations, e.g., query the max value + add it to a row_number() column computed via a window function + then write. This is problematic for two reasons:
Unless you introduce an external locking mechanism or some other way to ensure that no updates to the table happen in-between finding the max value and writing, you can end up with invalid data.
Using row_number() will reduce parallelism to 1, forcing all the data through a single core, which will be very slow with large data.
Bottom line, you really do not want to use auto-increment columns with Spark.
Hope this helps.
I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])