How to query attribute table in QGIS to return the sum of a subset - python-3.x

I am using Jupyter notebook to query a dataset too big to be opened with excel and my pc is too slow to perform calculations directly on QGIS.
My logic is as follows, after imported pandas:
x = df[(df.OBJECTID == 4440) & (df.Landuse == 'Grass - Urban')]
want_area = x['Area'].sum() #returning the whole dataframe sum for that field!!
summed = x['Area'].sum()
ratio = round(want_area / summed, 2)
How can I tweak the code in order to obtain the sum of the 'Area' of the above subset only, and not for the whole dataframe (800+ thousand features)?
Hope my simple question makes sense and thank you very much!

Related

Python data source - first two columns disappear

I have started using PowerBI and am using Python as a data source with the code below. The source data can be downloaded from here (it's about 700 megabytes). The data is originally from here (contained in IOT_2019_pxp.zip).
import pandas as pd
import numpy as np
import os
path = /path/to/file
to_chunk = pd.read_csv(os.path.join(path,'A.txt'), delimiter = '\t', header = [0,1], index_col = [0,1],
iterator=True, chunksize=1000)
def chunker(to_chunk):
to_concat = []
for chunk in to_chunk:
try:
to_concat.append(chunk['BG'].loc['BG'])
except:
pass
return to_concat
A = pd.concat(chunker(to_chunk))
I = np.identity(A.shape[0])
L = pd.DataFrame(np.linalg.inv(I-A), index=A.index, columns=A.columns)
The code simply:
Loads the file A.txt, which is a symmetrical matrix. This matrix has every sector in every region for both rows and columns. In pandas, these form a MultiIndex.
Filters just the region that I need which is BG. Since it's a symmetrical matrix, both row and column are filtered.
The inverse of the matrix is calculated giving us L, which I want to load into PowerBI. This matrix now just has a single regular Index for sector.
This is all well and good however when I load into PowerBI, the first column (sector names for each row i.e. the DataFrame Index) disappears. When the query gets processed, it is as if it were never there. This is true for both dataframes A and L, so it's not an issue of data processing. The column of row names (the DataFrame index) is still there in Python, PowerBI just drops it for some reason.
I need this column so that I can link these tables to other tables in my data model. Any ideas on how to keep it from disappearing at load time?
For what it's worth, calling reset_index() removed the index from the dataframes and they got loaded like regular columns. For whatever reason, PBI does not properly load pandas indices.
For a regular 1D index, I had to do S.reset_index().
For a MultiIndex, I had to do L.reset_index(inplace=True).

fast date based replacement of rows in Pandas

I am on a quest of finding the fastest replacement method based on index in Pandas.
I want to fill np.nans to all rows based on index (DateTimeIndex).
I tested various types of selection, but obviously, the bottleneck is setting the rows equal to a value (np.nan in my case).
Naively, I want to do this:
df['2017-01-01':'2018-01-01'] = np.nan
I tried and tested a performance of various other methods, such as
df.loc['2017-01-01':'2018-01-01'] = np.nan
And also creating a mask with NumPy to speed it up
df['DateTime'] = df.index
st = pd.to_datetime('2017-01-01', format='%Y-%m-%d').to_datetime64()
en = pd.to_datetime('2018-01-01', format='%Y-%m-%d').to_datetime64()
ge_start = df['DateTime'] >= st
le_end = df['DateTime'] <= en
mask = (ge_start & le_end )
and then
df[mask] = np.nan
#or
df.where(~mask)
But with no big success. I have DataFrame (that I cannot share unfortunately) of size cca (200,1500000), so kind of big, and the operation takes order of seconds of CPU time, which is way too much imo.
Would appreciate any ideas!
edit: after going through
Modifying a subset of rows in a pandas dataframe and Why dataframe.values is very slow and unifying datatypes for the operation, the problem is solved with cca 20x speedup.

AWS Glue dynamic frames not getting updated

We are currrently facing an issue where we cannot insert more than 600K records in oracle db using AWS glue. We are getting connection reset error and DBA's are currently looking into it. As a temporary solution we thought of adding data in chunks by splitting a dataframe into multiple dataframe and looping this list of dataframe to add data. We are sure that splitting algorithm works fine and here is the code we use
def split_by_row_index(df, num_partitions=10):
# Let's assume you don't have a row_id column that has the row order
t = df.withColumn('_row_id', monotonically_increasing_id())
# Using ntile() because monotonically_increasing_id is discontinuous across partitions
t = t.withColumn('_partition', ntile(num_partitions).over(Window.orderBy(t._row_id)))
return [t.filter(t._partition == i + 1) for i in range(num_partitions)]
Here each DF have unique data but somehow when we convert this df in dynamic frame in loop it is we are getting common data in each dynamic frame. here is small snippet for this example
df_trns_details_list = split_by_row_index(df_trns_details, int(df_trns_details.count() / 100000))
trnsDetails1 = DynamicFrame.fromDF(df_trns_details_list[0], glueContext, "trnsDetails1")
trnsDetails2 = DynamicFrame.fromDF(df_trns_details_list[1], glueContext, "trnsDetails2")
print(df_trns_details_list[0].count())# counts are same
print(trnsDetails1.count())
print('-------------------------------')
print(df_trns_details_list[1].count()) # counts are same
print(trnsDetails2.count())
print('-------------------------------')
subDf1 = trnsDetails1.toDF().select(col("id"), col("details_id"))
subDf2 = trnsDetails2.toDF().select(col("id"), col("details_id"))
common = subDf1.intersect(subDf2)
# ------------------ common data exists----------------
print(common.count())
subDf3 = df_trns_details_list[0].select(col("id"), col("details_id"))
subDf4 = df_trns_details_list[1].select(col("id"), col("details_id"))
#------------------0 common data----------------
common1 = subDf3.intersect(subDf4)
print(common1.count())
here Id and details_id combination will be unique
We used this logic in multiple areas where it worked not sure why this is happening.
We are also quite new to Python and AWS Glue so any suggestion to improve it also welcomed. Thanks

which is the efficient way to iterate in python?

I have to iterate one by one over 1 million records, which are stored in a list. And its value is present in a Pandas dataframe. I have to first find its value in the dataframe then perform some arthritic operation on it. And again store it in another Pandas dataframe. But it takes too much time to complete. So I have stored the value in a tuple and the performance has improved a bit but not as expected. Is there any way to optimize this?
Below is sample code I have done.
c2=['Fruits','animals',...]
list1=[]
for j in c2:
data2=dataframe.loc[(dataframe['value'] == j)]
data3=data2.describe()
range1=data3.loc['max']-data3.loc['min']
The most efficient way is to use vectorized functions. Typing this in the blind:
c2 = ['Fruits', 'animals', ...]
tmp = dataframe[dataframe['value'].isin(c2)] \
.groupby('value') \
.agg(['min', 'max'])
df_range = tmp['max'] - tmp['min']

Filtering Spark DataFrame on new column

Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
)
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
train.createOrReplaceTempView("train_table")
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
batch_df.select("batch_number").distinct().show()
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?
Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
Note:
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).

Resources