What is the best practice when importing 2 simapro datasets in brightway2 to merge them together - brightway

I have been importing one simaproCSV dataset with a recipe
sp = SimaProCSVImporter("recipe.CSV","recipe")
sp.migrate("simapro-ecoinvent-3")
sp.apply_strategies()
and another simaproCSV dataset with 4 specific unit processes for some of the ingredients in the first dataset.
sp2 = SimaProCSVImporter("ingredients.CSV","ingredients")
sp2.migrate("simapro-ecoinvent-3")
sp2.apply_strategies()
By matching all exchanges of the ingredients with ecoinvent I am able to do impact assessments.
sp2.match_database("ecoinvent 3.2 cutoff",ignore_categories=True)
db = sp2.write_database()
lca = LCA(
demand={db.random(): 1},
method=('IPCC 2013', 'GWP', '100 years'),
)
lca.lci()
lca.lcia()
lca.score
As a next step I have matched the recipe dataset first with ecoinvent, and then second with the ingredient dataset.
sp.match_database("ecoinvent 3.2 cutoff",ignore_categories=True)
sp.match_database("ingredients",ignore_categories=True)
db2 = sp.write_database()
When I want to do the LCA calculation;
lca = LCA(
demand={db2.random(): 1},
method=('IPCC 2013', 'GWP', '100 years'),
)
lca.lci()
lca.lcia()
lca.score
I get the following error:
Technosphere matrix is not square: 12917 rows and 12921 products.
What did I do wrong, what is the best practice?

Hard to say without seeing the actual data. Are you checking .statistics() each time to make sure there aren't any unlinked exchanges before writing the database? The warning message is a bit confusing (now fixed in 1.3.5), but you have too many products (rows) and not enough activities (columns). The most probable way this could happen is if you have an activity with multiple products, but again, impossible to say more or suggest fixes without seeing the actual data.

Related

Delta live tables data quality checks -Retain failed records

There are 3 types of quality checks in Delta live tables:
expect (retain invalid records)
expect_or_drop (drop invalid records)
expect_or_fail (fail on invalid records)
I want to retain invalid records, but I also want to keep track of them. So, by using expect, can I query the invalid records, or is it just for keeping stats like "n records were invalid"?
expect just record that you had some problems so you have some statistics about you data quality in the pipeline. But it's not very useful in practice.
Native quarantine functionality is still not available, that's why there is the recipe in the cookbook. Although it's not exactly what you need, you can still build on top of it, especially if you take into account the second part of the recipe that explicitly adds a Quarantine column - we can combine it with expect to get statistics into UI:
import dlt
from pyspark.sql.functions import expr
rules = {}
quarantine_rules = {}
...
quarantine_rules = "NOT({0})".format(" AND ".join(rules.values()))
#dlt.table(
name="partitioned_farmers_market",
partition_cols = [ 'Quarantine' ]
)
#dlt.expect_all(rules)
def get_partitioned_farmers_market():
return (
dlt.read("raw_farmers_market")
.withColumn("Quarantine", expr(quarantine_rules))
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime",
"Quarantine")
)
Another approach would be to use first part of the recipe (that uses expect_all_or_drop), and just union both tables (it's better to mark the valid/invalid tables with temporary = True marker)

Spark: problem with crossJoin (takes a tremendous amount of time)

First of all, I have to say that I've already tried everything I know or found on google (Including this Spark: How to use crossJoin which is exactly my problem).
I have to calculate the Cartesian product between two DataFrame - countries and units such that -
A.cache().count()
val units = A.groupBy("country")
.agg(sum("grade").as("grade"),
sum("point").as("point"))
.withColumn("AVR", $"grade" / $"point" * 1000)
.drop("point", "grade")
val countries = D.select("country").distinct()
val C = countries.crossJoin(units)
countries contains a countries name and its size bounded by 150. units is DataFrame with 3 rows - an aggregated result of other DataFrame. I checked 100 times the result and those are the sizes indeed - and it takes 5 hours to complete.
I know I missed something. I've tried caching, repartitioning, etc.
I would love to get some other ideas.
I have two suggestions for you:
Look at the explain plan and the spark properties, for the amount of data you have mentioned 5 hours is a really long time. My expectation is you have way too many shuffles, you can look at different properties like : spark.sql.shuffle.partitions
Instead of doing a cross join, you can maybe do a collect and explore broadcasts
https://sparkbyexamples.com/spark/spark-broadcast-variables/ but do this only on small amounts of data as this data is brought back to the driver.
What is the action you are doing afterwards with C?
Also, if these datasets are so small, consider collecting them to the driver, and doing these manupulation there, you can always spark.createDataFrame later again.
Update #1:
final case class Unit(country: String, AVR: Double)
val collectedUnits: Seq[Unit] = units.as[Unit].collect
val collectedCountries: Seq[String] = countries.collect
val pairs: Seq[(String, Unit)] = for {
unit <- units
country <- countries
} yield (country, unit)
I've finally understood the problem - Spark used too many excessive numbers of partitions, and thus the shuffle takes a lot of time.
The way to solve it is to change the default number -
sparkSession.conf.set("spark.sql.shuffle.partitions", 10)
And it works like magic.

Pyspark proper skew data salting technique example code

I have following dataframes
deptDf.columns
['deptid', 'name', 'dept', 'deptid']
empDf.columns
['eid', 'ename', 'deptid','esal']
if do the join on based on the deptid
deptDf.join(empDf, deptDf.deptid == empDf.deptid, 'inner')
few depts have few no of employees where few depts have huge no of employees , here so data is skewed.
To overcode data skew issue i want to use data salting technique, Could you please someone provide the code for it.
Try this one (you can break it down into more than 10 parts, depends on your table really):
bigger_tabs = bigger.randomSplit([1.0] * 10)
for i in range(len(bigger_tabs)):
bigger_tabs[i]=bigger_tabs[i].join(smaller, COND, JOIN_TYPE)
final_tab=bigger_tabs[0]
for i in range(1,len(bigger_tabs)):
final_tab=final_tab.union(bigger_tabs[i])

Filtering Spark DataFrame on new column

Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
)
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
train.createOrReplaceTempView("train_table")
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
batch_df.select("batch_number").distinct().show()
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?
Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
Note:
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).

Spark MLLib ALS: Efficient mapping of misc user and product IDs to integer

I am attempting to build an online recommender system using the Spark recommendation ALS algorithm. My data resides in MongoDB, where I keep collections of users, items and ratings. The identifiers for these documents are of the default type ObjectID. I am looking for an efficient way to map these ObjectID types to the required int for ALS. Concretely, my ratings collection consists of entries of the structure {user: ObjectID, item: ObjectID, rating: float}.
My recommender system will be getting new ratings fed to it regularly, which requires new ALS models to be computed with batches of new ratings coming in. Therefore I do not plan to save the models, and consider it the easiest implementation to get the new ratings from MongoDB based on their timestamp and that of the last trained model. New ratings are then processed in Spark to get int IDs assigned, so I'm looking for the most efficient implementation. Below I elaborate on my attempt, any feedback on how to improve my approach will be greatly appreciated.
My attempt
As per the answer to this question, I have tried to implement strategies using RDD.zipWithUniqueId() and RDD.zipWithIndex(). My procedure is as follows, replacing zipWithIndex() for zipWithUniqueId() for the second variation:
# Retrieve from MongoDB (using pymongo_spark)
ratings_mongo = sc.mongoRDD(mongo_path)
ratings = ratings_mongo.map(lambda r: (r['user'], r['item'], r['rating']))
# Get distinct users and items
users = ratings.map(lambda r: r[0]).distinct()
items = ratings.map(lambda r: r[1]).distinct()
# Zips the RDDs and creates 'mirrored' RDDs to facilitate reverse mapping
user_int = users.zipWithIndex()
int_user = user_int.map(lambda u: (u[1], u[0]))
item_int = items.zipWithIndex()
int_item = item_int.map(lambda i: (i[1], i[0]))
# Substitutes the ObjectIDs in the ratings RDD with the corresponding int values
ratings = ratings.map(lambda r: (r[0], (r[1], r[2]))).join(user_int).map(lambda r: (r[1][1], r[1][0][0], r[1][0][1]))
ratings = ratings.map(lambda r: (r[1], (r[0], r[2]))).join(item_int).map(lambda r: (r[1][0][0], r[1][1], r[1][0][1]))
I am relatively new to the game, and I feel that there may be a more efficient way to go about this. Also, the zipWithIndex() does not work on my full dataset, as its progress stalls, but does not give an immediate error. It does seem to work on smaller samples.
Otherwise, the zipWithUniqueId() does work, but seems horrendously slow to complete.
Alternative 1
Use Python's hash():
def hash_ids(rating):
return hash(rating[0]) & 0xffffffff, hash(rating[1]) & 0xffffffff, rating[2]
ratings = ratings.map(hash_ids)
Very fast to execute, but, with my limited understanding of hashing, this will ultimately result in collisions. Also, the collisions will always be the same users and items, therefore some user's recommendations will always be someone else's. Am I right?
Alternative 2
Do the conversion outside of Spark, and possibly maintain a unique ID field of type int in the MongoDB documents.

Resources