Related
This is my code:
x = list(coll.find({"activities.flowCenterInfo": {
'$exists': True
}},{'activities.activityId':1,'activities.flowCenterInfo':1,'_id':0}).limit(5))
for row in x:
print(row)
This is the result of x for one sample:
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]}
I want to convert to Dataframe to write the oracle table. How can i convert it to Dataframe properly i can't find anyway
This image shows that the mongodb structure of one sample
Assuming that activities key contains a list with a single dict, each field within flowCenterInfo key is marked with fcinfo_:
# sample list
l = [{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]},
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]},
{'activities': [{'activityId': 'B83F36898FE444309757FBEB6DF0685D', 'flowCenterInfo': {'processId': '178888', 'demandComplaintSubject': 'İkna Görüşmesi', 'demandComplaintDetailSubject': 'Hayat Sigortadan Ayrılma', 'demandComplaintId': '178888'}}]}]
df = pd.DataFrame.from_records([dict(**{'activityId': r['activities'][0]['activityId']}, \
**dict(zip(map('fcinfo_{}'.format, r['activities'][0]['flowCenterInfo'].keys()), \
r['activities'][0]['flowCenterInfo'].values()))) for r in l])
print(df)
activityId fcinfo_processId ... fcinfo_demandComplaintDetailSubject fcinfo_demandComplaintId
0 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
1 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
2 B83F36898FE444309757FBEB6DF0685D 178888 ... Hayat Sigortadan Ayrılma 178888
[3 rows x 5 columns]
I would like to access to all elementary flows generated by an activity in Brightway in a table that would gather the flows and the amounts.
Let's assume a random activity :
lca=bw.LCA({random_act:2761,method)
lca.lci()
lca.lcia()
lca.inventory
I have tried several ways but none works :
I have tried to export my lci with brightway2-io but some errors appear that i cannot solve :
bw2io.export.excel.lci_matrices_to_excel(db_name) returns an error when computing the biosphere matrix data for a specific row :
--> 120 bm_sheet.write_number(bio_lookup[row] + 1, act_lookup[col] + 1, value)
122 COLUMNS = (
123 u"Index",
124 u"Name",
(...)
128 u"Location",
129 )
131 tech_sheet = workbook.add_worksheet("technosphere-labels")
KeyError: 1757
I try to get manually the amount of a specific elementary flow. For example, let's say I want to compute the total amount of Aluminium needed for the activity. To do so, i try this:
flow_Al=Database("biosphere3").search("Aluminium, in ground")
(I only want the resource Aluminium that is extracted as a ore, from the ground)
amount_Al=0
row = lca.biosphere_dict[flow_Al]
col_indices = lca.biosphere_matrix[row, :].tocoo()
amount_consumers_lca = [lca.inventory[row, index] for index in col_indices.col]
for j in amount_consumers_lca:
amount_Al=amount_Al+j
amount_Al`
This works but the final amount is too low and probably isn't what i'm looking for...
How can I solve this ?
Thank you
This will work on Brightway 2 and 2.5:
import pandas as pd
import bw2data as bd
import warnings
def create_inventory_dataframe(lca, cutoff=None):
array = lca.inventory.sum(axis=1)
if cutoff is not None and not (0 < cutoff < 1):
warnings.warn(f"Ignoring invalid cutoff value {cutoff}")
cutoff = None
total = array.sum()
include = lambda x: abs(x / total) >= cutoff if cutoff is not None else True
if hasattr(lca, 'dicts'):
mapping = lca.dicts.biosphere
else:
mapping = lca.biosphere_dict
data = []
for key, row in mapping.items():
amount = array[row, 0]
if include(amount):
data.append((bd.get_activity(key), row, amount))
data.sort(key=lambda x: abs(x[2]))
return pd.DataFrame([{
'row_index': row,
'amount': amount,
'name': flow.get('name'),
'unit': flow.get('unit'),
'categories': str(flow.get('categories'))
} for flow, row, amount in data
])
The cutoff doesn't make sense for the inventory database, but it can be adapted for the LCIA result (characterized_inventory) as well.
Once you have a pandas DataFrame you can filter or export easily.
I'm trying to apply a function to a specific column in this dataframe
datetime PM2.5 PM10 SO2 NO2
0 2013-03-01 7.125000 10.750000 11.708333 22.583333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667
2 2013-03-03 76.916667 120.541667 61.291667 81.000000
3 2013-03-04 22.708333 44.583333 22.854167 46.187500
4 2013-03-06 223.250000 265.166667 116.236700 142.059383
5 2013-03-07 263.375000 316.083333 97.541667 147.750000
6 2013-03-08 221.458333 297.958333 69.060400 120.092788
I'm trying to apply this function(below) to a specific column(PM10) of the above dataframe:
range1 = [list(range(0,50)),list(range(51,100)),list(range(101,200)),list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
Where "x" can be any column and "y" = Range1
Available Options
df.PM10.apply(c1_c2,args(df.PM10,range1),axis=1)
df.PM10.apply(c1_c2)
I've tried these couple of available options and none of them seems to be working. Any suggestions?
Not sure what the expected output is from the function. But to get the function getting called you can try the following
from functools import partial
df.PM10.apply(partial(c1_c2, y=range1))
Update:
Ok, I think I understand a little better. This should work, but 'range1' is a list of lists of integers. Your data doesn't have integers and the new column comes up empty. I created another list based on your initial data that works. See below:
df = pd.read_csv('pm_data.txt', header=0)
range1= [[7.125000,10.750000,11.708333,22.583333],list(range(0,50)),list(range(51,100)),list(range(101,200)),
list(range(201,300)),list(range(301,400)),list(range(401,2000))]
def c1_c2(x,y):
for a in y:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
df['function']=df.PM10.apply(lambda x: c1_c2(x,range1))
print(df.head(10))
datetime PM2.5 PM10 SO2 NO2 new_column function
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000 16.458333
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167 NaN
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083 NaN
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167 NaN
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333 NaN
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167 NaN
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917 NaN
Only the first item in 'function' had a match because it came from your initial data because of 'if x in a'.
Old Code:
I'm also not sure what you are doing. But you can use a lambda to modify columns or create new ones.
Like this,
import pandas as pd
I created a data file to import from the data you posted above:
datetime,PM2.5,PM10,SO2,NO2
2013-03-01,7.125000,10.750000,11.708333,22.583333
2013-03-02,30.750000,42.083333,36.625000,66.666667
2013-03-03,76.916667,120.541667,61.291667,81.000000
2013-03-04,22.708333,44.583333,22.854167,46.187500
2013-03-06,223.250000,265.166667,116.236700,142.059383
2013-03-07,263.375000,316.083333,97.541667,147.750000
2013-03-08,221.458333,297.958333,69.060400,120.092788
Here is how I import it,
df = pd.read_csv('pm_data.txt', header=0)
and create a new column and apply a function to the data in 'PM10'
df['new_column'] = df['PM10'].apply(lambda x: x+15 if x < 30 else x/20)
which yields,
datetime PM2.5 PM10 SO2 NO2 new_column
0 2013-03-01 7.125000 10.750000 11.708333 22.583333 25.750000
1 2013-03-02 30.750000 42.083333 36.625000 66.666667 2.104167
2 2013-03-03 76.916667 120.541667 61.291667 81.000000 6.027083
3 2013-03-04 22.708333 44.583333 22.854167 46.187500 2.229167
4 2013-03-06 223.250000 265.166667 116.236700 142.059383 13.258333
5 2013-03-07 263.375000 316.083333 97.541667 147.750000 15.804167
6 2013-03-08 221.458333 297.958333 69.060400 120.092788 14.897917
Let me know if this helps.
"I've tried these couple of available options and none of them seems to be working..."
What do you mean by this? What's your output, are you getting errors or what?
I see a couple of problems:
range1 lists contain int while your column values are float, so c1_c2() will return None.
if the data types were the same within range1 and columns, c1_c2() will return None when value is not in range1.
Below is how I would do it, assuming the data-types match:
def c1_c2(x):
range1 = [list of lists]
for a in range1:
if x in a:
min_val = min(a)
max_val = max(a)+1
return max_val - min_val
return x # returns the original value if not in range1
df.PM10.apply(c1_c2)
This is by far the most difficult problem I have faced. I am trying to create plots indexed on ratetype. For example, a matrix of unique ratetype x avg customer number for that ratetype is what I want to create efficiently. The lambda expression for getting the rows where the value is equal to each individual ratetype then getting the average customer number for that type then creating a series based on these two lists that are equal in size and length and accurate, is way over my head for pandas.
The number of different ratetypes can be in the hundreds. Reading it into a list via lambda would logically be a better choice than hard coding each possibility, as the list is going to only increase in size and new variability.
""" a section of the data for example use. Working with column "Ratetype"
column "NumberofCustomers" to work towards getting something like
list1 = unique occurs of ratetypes
list2 = avg number of customers for each ratetype
rt =['fixed','variable',..]
avg_cust_numbers = [45.3,23.1,...]
**basically for each ratetype: get mean of all row data for custno column**
ratetype,numberofcustomers
fixed,1232
variable, 1100
vec, 199
ind, 1211
alg, 123
bfd, 788
csv, 129
ggg, 1100
aaa, 566
acc, 439
"""
df['ratetype','number_of_customers']
fixed = df.loc['ratetype']=='fixed']
avg_fixed_custno = fixed.mean()
rt_counts = df.ratetype.value_counts()
rt_uniques = df.ratetype.unique()
# rt_uniques would be same size vector as avg_cust_nos, has to be anyway
avg_cust_nos = [avg_fixed_custno, avg_variable_custno]
My goal is to create and plot these subplots using matplot.pyplot.
data = {'ratetypes': pd.Series(rt_counts, index=rt_uniques),
'Avg_cust_numbers': pd.Series(avg_cust_nos, index=rt_uniques),
}
df = pd.DataFrame(data)
df = df.sort_values(by=['ratetypes'], ascending=False)
fig, axes = plt.subplots(nrows=2, ncols=1)
for i, c in enumerate(df.columns):
df[c].plot(kind='bar', ax=axes[i], figsize=(12, 10), title=c)
plt.savefig('custno_byrate.png', bbox_inches='tight')
I have a rating data set like this: (userId,itemId,rating)
1 100 4
1 101 5
1 102 3
1 10 3
1 103 5
4 353 2
4 354 4
4 355 5
7 420 5
7 421 4
7 422 4
I'm trying to use ALS method to construct a matrix factorization model to obtain user latent features and product latent features by this code:
object AlsTest {
def main(args: Array[String])
{
System.setProperty("hadoop.home.dir","C:\\spark-1.5.1-bin-hadoop2.6\\winutil")
val conf = new SparkConf().setAppName("test").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data
val data = sc.textFile("ratings.txt")
val ratings = data.map(_.split(" ") match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank =10
val numIterations =30
val model = ALS.train(ratings, rank, numIterations, 0.01)
val a = model.productFeatures().cache().collect.foreach(println) //.cache().collect.count()//.collect.foreach(println)
I have set the rank equal 10, and out put format for model.productFeatures() should be a RDD:[(int,Array[Double])] but when I see the out put there is some problems, there are some characters in output(what are these characters) and the number of Array elements in records is different, these are latent features values and counts of them in every records must be equal also,these aren't ten ,exactly equal to rank number. out put is like this:
(48791,7fea9bb7)
(48795,284b451d)
(48799,3d64767d)
(48803,2f812fc3)
(48807,49d3ea7)
(48811,768cf084)
(48815,6845b7b6)
(48819,4e9c724a)
(48823,23191538)
(48827,3200d90f)
(48831,77bd30fe)
(48839,5a1e0261)
(48843,31c56ccf)
(48855,5b90359)
(48863,1b9de9d0)
(48867,313afdc8)
(48871,2b834c34)
(48875,666d21d6)
(48891,12ca97a2)
(48907,74f8fc8e)
(48911,452becc9)
(48915,4a47062b)
(48919,c76ef46)
(48923,3f596eca)
(48927,258e904c)
(48939,570abc88)
(48947,6c3d75f0)
(48951,18667983)
(48955,493b9633)
(48959,4b579d60)
in matrix factorization we should construct two matrix with lower dimensions so that multiply them equal to rating matrix:
rating matrix= p*q(transpose),
p= user latent feature matrix,
q= product latent features matrix,
can any one explain about the out put format of als methods in spark?
To see the latent factors for each product use this syntax:
model.productFeatures.collect().foreach{case (productID,latentFactors) => println("proID:"+ productID + " factors:"+ latentFactors.mkString(",") )}
The result for the given dataset is as follows:
proID:1 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:2 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953
proID:3 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:4 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953
As you can see each product has exactly 10 factors, which is a correct number according to the given parameter val rank =10.
To answer your second question, consider that after training the model you can access to the two variables namely userFeatures: RDD[(Int, Array[Double])] and productFeatures: RDD[(Int, Array[Double])]. The entries of user-item matrix are determined using dot product of these two variables. For example, if you check out the source code of predict method, you can understand how we use these variables to predict the rating of specific user for one product:
def predict(user: Int, product: Int): Double = {
val userVector = userFeatures.lookup(user).head
val productVector = productFeatures.lookup(product).head
blas.ddot(rank, userVector, 1, productVector, 1)
}