Query with multiple filters on Pandas - python-3.x

I want to execute this query.The query is " filtering data with 'Gas Oil/ Diesel Oil - Production' transaction and the year is greater than 2000 ". Firstly , i tried to execute my query with & operand and vectorized column selection without using if statement. But it did not work.After then , i found this query at below.This time i could not get any output.What do you think about my query problem ?.Thanks ...
if all(b['Commodity - Transaction'] == 'Gas Oil/ Diesel Oil - Production') and all(b[ b['Year'] >2000 ]):
print (b)
else:
print('did not find any values')

what's wrong with:
b.loc[(b['Commodity - Transaction'] == 'Gas Oil/ Diesel Oil - Production') & (b['Year'] >2000)]
?

You can try first create mask with contains and the create subset - use Boolean indexing:
print b[(b['Commodity - Transaction'].str.contains('Gas Oil/ Diesel Oil - Production')) &
(b['Year'] > 2000) ]

Related

kdb/q: How to apply a string manipulation function to a vector of strings to output a vector of strings?

Thanks in advance for the help. I am new to kdb/q, coming from a Python and C++ background.
Just a simple syntax question: I have a string with fields and their corresponding values
pp_str: "field_1:abc field_2:xyz field_3:kdb"
I wrote an atomic (scalar) function to extract the value of a given field.
get_field_value: {[field; pp_str] pp_fields: " " vs pp_str; pid_field: pp_fields[where like[pp_fields; field,":*"]]; start_i: (pid_field[0] ss ":")[0] + 1; end_i: count pid_field[0]; indices: start_i + til (end_i - start_i); pid_field[0][indices]}
show get_field_value["field_1"; pp_str]
"abc"
show get_field_value["field_3"; pp_str]
"kdb"
Now how do I generalize this so that if I input a vector of fields, I get a vector of values? I want to input ("field_1"; "field_2"; "field_3") and output ("abc"; "xyz"; "kdb"). I tried multiple approaches (below) but I just don't understand kdb/q's syntax well enough to vectorize my function:
/ Attempt 1 - Fail
get_field_value[enlist ("field_1"; "field_2"); pp_str]
/ Attempt 2 - Fail
get_field_value[; pp_str] /. enlist ("field_1"; "field_3")
/ Attempt 3 - Fail
fields: ("field_1"; "field_2")
get_field_value[fields; pp_str]
To run your function for each you could project the pp_str variable and use each for the others
q)get_field_value[;pp_str]each("field_1";"field_3")
"abc"
"kdb"
Kdb actually has built-in functionality to handle this: https://code.kx.com/q/ref/file-text/#key-value-pairs
q){#[;x](!/)"S: "0:y}[`field_1;pp_str]
"abc"
q)
q){#[;x](!/)"S: "0:y}[`field_1`field_3;pp_str]
"abc"
"kdb"
I think this might be the syntax you're looking for.
q)get_field_value[; pp_str]each("field_1";"field_2")
"abc"
"xyz"

Normalising units/Replace substrings based on lists using Python

I am trying to normalize weight units in a string.
Eg:
1.SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre - SUCO MARACUJA COM GENGIBRE PCS 300 ML
2. OVOS CAIPIRAS ANA MARIA BRAGA 10UN - OVOS CAIPIRAS ANA MARIA BRAGA 10U
3. SUCO MARACUJA MAMAO PCS 300 Gram - SUCO MARACUJA MAMAO PCS 300 G
4. SUCO ABACAXI COM MACA PCS 300Milli litre - SUCO ABACAXI COM MACA PCS 300ML
The keyword table is :
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
I tried to take up these lists as a table but am having difficulty in comparing two dataframes or tables in python.
I tried the below code.
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
z='SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre'
#for row in mongo_docs:
#z = row['clean_hntproductname']
for x in unit:
for y in norm_unit:
if (re.search(r'\s'+x+r'$',z,re.I)):
# clean_hntproductname = t.lower().replace(x.lower(),y.lower())
# myquery3 = { "_id" : row['_id']}
# newvalues3 = { "$set": {"clean_hntproductname" : 'clean_hntproductname'} }
# ds_hnt_prod_data.update_one(myquery3, newvalues3)
I'm using Python(Jupyter) with MongoDb(Compass). Fetching data from Mongo and writing back to it.
From my understanding you want to:
Update all the rows in a table which contain the words in the unit array, to the ones in norm_unit.
(Disclaimer: I'm not familiar with MongoDB or Python.)
What you want is to create a mapping (using a hash) of the words you want to change.
Here's a trivial solution (i.e. not best solution but would probably point you in the right direction.)
unit_conversions = {
'Kilo': 'KG'
'Kilogram': 'KG',
'Gram': 'G'
}
# pseudo-code
for each row that you want to update
item_description = get the value of the string in the column
for each key in unit_conversion (e.g. 'Kilo')
see if the item_description contains the key
if it does, replace it with unit_convertion[key] (e.g. 'KG')
update the row

Replace values in observations (i.e., multiple columns within multiple rows) based on multiple conditionals

I am trying to replace the values of 3 columns within multiple observations based on two conditionals ( e.g., specific ID after a particular date).
I have seen similar questions.
Pandas Multiple Conditions Function based on Column
Pandas replace, multi column criteria
Pandas: How do I assign values based on multiple conditions for existing columns?
Replacing values in a pandas dataframe based on multiple conditions
However, they did not quite address my problem or I can't quite manipulate them to solve my problem.
This code will generate a dataframe similar to mine:
df = pd.DataFrame({'SUR_ID': {0:'SUR1', 1:'SUR1', 2:'SUR1', 3:'SUR1', 4:'SUR2', 5:'SUR2'}, 'DATE': {0:'05-01-2019', 1:'05-11-2019', 2:'06-15-2019', 3:'06-20-2019', 4: '05-15-2019', 5:'06-20-2019'}, 'ACTIVE_DATE': {0:'05-01-2019', 1:'05-01-2019', 2:'05-01-2019', 3:'05-01-2019', 4: '05-01-2019', 5:'05-01-2019'}, 'UTM_X': {0:'444895', 1:'444895', 2:'444895', 3:'444895', 4: '445050', 5:'445050'}, 'UTM_Y': {0:'4077528', 1:'4077528', 2:'4077528', 3:'4077528', 4: '4077762', 5:'4077762'}})
Output Dataframe:
What I am trying to do:
I am trying to replace UTM_X,UTM_Y, AND ACTIVE_DATE with
[444917, 4077830, '06-04-2019']
when
SUR_ID is "SUR1" and DATE >= "2019-06-04 12:00:00"
This is a poorly adapted version of the solution for question 1 in attempts to fix my problem- throws error:
df.loc[[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00'], ['UTM_X', 'UTM_Y', 'Active_Date']] = [444917, 4077830, '06-04-2019']
First ensure that the column Date is of type datetime, and then when using 2 conditions, they need to be between parenthesis individually. so you can do:
df.DATE = pd.to_datetime(df.DATE)
df.loc[ (df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00')),
['UTM_X', 'UTM_Y', 'ACTIVE_DATE']] = [444917, 4077830, '06-04-2019']
See the difference between what you wrote for the boolean mask:
[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00']
and what is here with parenthesis
(df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00'))
Use:
df['UTM_X']=df['UTM_X'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),444917)
df['UTM_Y']=df['UTM_Y'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),4077830)
df['ACTIVE_DATE']=df['ACTIVE_DATE'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),'06-04-2019')
Output:
SUR_ID DATE ACTIVE_DATE UTM_X UTM_Y
0 SUR1 05-01-2019 05-01-2019 444895 4077528
1 SUR1 05-11-2019 05-01-2019 444895 4077528
2 SUR1 06-15-2019 06-04-2019 444917 4077830
3 SUR1 06-20-2019 06-04-2019 444917 4077830
4 SUR2 05-15-2019 05-01-2019 445050 4077762
5 SUR2 06-20-2019 05-01-2019 445050 4077762

Smart Beta Portfolio: How to generate the weights based on dollar volume traded for each date

def generate_dollar_volume_weights(close, volume):
"""
Generate dollar volume weights.
Parameters
----------
close : DataFrame
Close price for each ticker and date
volume : str
Volume for each ticker and date
Returns
-------
dollar_volume_weights : DataFrame
The dollar volume weights for each ticker and date
"""
assert close.index.equals(volume.index)
assert close.columns.equals(volume.columns)
#TODO: Implement function
dollar_volume_weights=np.cumsum(close * volume) / np.cumsum(volume)
return dollar_volume_weights
project_tests.test_generate_dollar_volume_weights(generate_dollar_volume_weights)#testing the function
So my results are as below
INPUT close:
XHNP FCUB ZYRP
2005-09-09 35.44110000 34.17990000 34.02230000
2005-09-10 92.11310000 91.05430000 90.95720000
2005-09-11 57.97080000 57.78140000 58.19820000
2005-09-12 34.17050000 92.45300000 58.51070000
INPUT volume:
XHNP FCUB ZYRP
2005-09-09 9836830.00000000 17807200.00000000 8829820.00000000
2005-09-10 82242700.00000000 68531500.00000000 48160100.00000000
2005-09-11 16234800.00000000 13052700.00000000 9512010.00000000
2005-09-12 10674200.00000000 56831300.00000000 9316010.00000000
OUTPUT dollar_volume_weights:
XHNP FCUB ZYRP
2005-09-09 35.44110000 34.17990000 34.02230000
2005-09-10 86.05884636 79.32405834 82.13590461
2005-09-11 81.84884187 76.49494177 78.71200871
2005-09-12 77.57172242 82.30022612 76.22980471
EXPECTED OUTPUT FOR dollar_volume_weights:
XHNP FCUB ZYRP
2005-09-09 0.27719777 0.48394253 0.23885970
2005-09-10 0.41632975 0.34293308 0.24073717
2005-09-11 0.41848548 0.33536102 0.24615350
2005-09-12 0.05917255 0.85239760 0.08842984
I'm a beginner here, I really can't understand what I'm missing here, I think the expression for generating dollar volume weight should be some thing like dollar_volume_weights=np.cumsum(close * volume) / np.cumsum(volume)
can someone tell me why my results are different?
I think you should do close * volume and divide by the sum (np.sum(x, axis=1)) of volume * close and specify the axis=1. Furthermore, you could use div() function for this division and specify axis=0.
I agree with #joseph here. I assume you are talking about the Smart Beta project of Udacity. Their example in the prior cell is misleading and that confused the heck out of me. Anyway, the way I solved it is like the following:
def generate_dollar_volume_weights(close, volume):
"""
Generate dollar volume weights.
Parameters
----------
close : DataFrame
Close price for each ticker and date
volume : str
Volume for each ticker and date
Returns
-------
dollar_volume_weights : DataFrame
The dollar volume weights for each ticker and date
"""
assert close.index.equals(volume.index)
assert close.columns.equals(volume.columns)
#TODO: Implement function
print(close)
print(volume)
dollar_volume = close * volume
print(dollar_volume)
total_dollar_volume = dollar_volume.sum(axis=1)
print(total_dollar_volume)
dollar_volume_weights = dollar_volume.div(total_dollar_volume,axis=0)
print(dollar_volume_weights)
return dollar_volume_weights
project_tests.test_generate_dollar_volume_weights(generate_dollar_volume_weights)
Can you explain more how you desire to calculate dollar_volume_weights from the input to get the expected result?
To be clear what your code is doing, the followings show the logic it computes the result,
2005-09-09 close * volume / volume ...
2005-09-10 (close1*volume1) + (close2*volume2) / (volume1+volume2) ...

pd.to_datetime to solve '2010/1/1' rather than '2010/01/01'

I have a dataframe which contain a column 'trade_dt' like this
2009/12/1
2009/12/2
2009/12/3
2009/12/4
I got this problem
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y-&m-%d')
ValueError: time data '2009/12/1' does not match format '%Y-&m-%d' (match)
how to solve it? Thanks~
Need change format for match - replace & and - to % and /:
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'], format='%Y/%m/%d')
Also working with sample data removing format (but not sure with real data):
benchmark['trade_dt'] = pd.to_datetime(benchmark['trade_dt'])
print (benchmark)
trade_dt
0 2009-12-01
1 2009-12-02
2 2009-12-03
3 2009-12-04

Resources