Spark Dataframe timestamp column manipulation failing without any error message - apache-spark

aggregate = aggregate.withColumn('DaysSinceFirstUsage', when(months_between(current_date(), col('FirstUsage')) > 120, - (sys.maxsize - 1)).otherwise(days_between(current_date(), col('FirstUsage')))
aggregate = aggregate.withColumn('DaysSinceLastUsage', when(months_between(current_date(), col('LastUsage')) > 120, - (sys.maxsize - 1)).otherwise(days_between(current_date(), col('LastUsage')))

Silly mistake :)
Closing bracket at the end was missing and datediff was wrongly written as days_between. Query running fine after correction.
aggregate = aggregate.withColumn('DaysSinceFirstUsage', when(months_between(current_date(), col('FirstUsage')) > 120, - (sys.maxsize - 1)).otherwise(datediff(current_date(), col('FirstUsage'))))
aggregate = aggregate.withColumn('DaysSinceLastUsage', when(months_between(current_date(), col('LastUsage')) > 120, - (sys.maxsize - 1)).otherwise(datediff(current_date(), col('LastUsage'))))

Related

Failing to use sumproduct on date ranges with multiple conditions [Python]

From replacement data table (below on the image), I am trying to incorporate the solbox product replace in time series data format(above on the image). I need to extract out the number of consumers per day from the information.
What I need to find out:
On a specific date, which number of solbox product was active
On a specific date, which number of solbox product (which was a consumer) was active
I have used this line of code in excel but cannot implement this on python properly.
=SUMPRODUCT((Record_Solbox_Replacement!$O$2:$O$1367 = "consumer") * (A475>=Record_Solbox_Replacement!$L$2:$L$1367)*(A475<Record_Solbox_Replacement!$M$2:$M$1367))
I tried in python -
timebase_df['date'] = pd.date_range(start = replace_table_df['solbox_started'].min(), end = replace_table_df['solbox_started'].max(), freq = frequency)
timebase_df['date_unix'] = timebase_df['date'].astype(np.int64) // 10**9
timebase_df['no_of_solboxes'] = ((timebase_df['date_unix']>=replace_table_df['started'].to_numpy()) & (timebase_df['date_unix'] < replace_table_df['ended'].to_numpy() & replace_table_df['customer_type'] == 'customer']))
ERROR:
~\Anaconda3\Anaconda4\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
232 # The ambiguous case is object-dtype. See GH#27803
233 if len(lvalues) != len(rvalues):
--> 234 raise ValueError("Lengths must match to compare")
235
236 if should_extension_dispatch(lvalues, rvalues):
ValueError: Lengths must match to compare
Can someone help me please? I can explain in comment section if I have missed something.

Python3 Renaming Files By tkinter Listbox

I want to rename all files in a directory by tkinter listbox.
Got stuck at this point:
files_list = os.listdir(root.foldername)
print(files_list)
gives me
['1.mp4', '10.mp4', '2.mp4', '3.mp4', '4.mp4', '5.mp4', '6.mp4', '7.mp4', '8.mp4', '9.mp4']
values = [listbox.get(idx) for idx in listbox.curselection()]<br>
And
inlist = (', '.join(values))<br>
print(inlist)
gives me
Lost - 1x01 - Pilot(1), Lost - 1x02 - Pilot(2), Lost - 1x03 - Tabula Rasa, Lost - 1x04 - Walkabout, Lost - 1x05 - White Rabbit, Lost - 1x06 - House Of The Rising Sun, Lost - 1x07 - The Moth, Lost - 1x08 - Confidence Man, Lost - 1x09 - Solitary, Lost - 1x10 - Raised By Another
Now I'm looking for a solution to use os.rename in order to rename the files 1.mp4 till 10.mp4.
Additionally Python for whatever reason does not come with a built-in way to have natural sorting, so it sorts 1.mp4 followed by 10.mp4.
Thank you very much in advance.
For natural sorting take a look at Sorting alphanumeric strings in Python.
Then loop through all files and rename them, eg.
for i in range(len(files_list)):
old_file_name = files_list[i]
new_file_name = values[i] + '.mp4'
os.rename(old_file_name, new_file_name)
For assistance in dealing with pathnames see os.path.

sqlQuery of DocumentDB input bindings with modulation symbol makes function's failure

I using the DocumentDB input bindings on Azure Functions.
Today, I specified a following query as a sqlQuery.
SELECT c.id, c.created_at FROM c
WHERE {epoch} - c.created_at_epoch >= 86400*31
AND (CEILING({epoch}/86400) - CEILING(c.created_at_epoch / 86400)) % 31 = 0
Afterwards, I saw a following error when function is triggered.
2017-07-04T10:31:44.873 Function started (Id=95a2ab7a-8eb8-4568-b314-2c3b04a0eadf)
2017-07-04T10:31:49.544 Function completed (Failure, Id=95a2ab7a-8eb8-4568-b314-2c3b04a0eadf, Duration=4681ms)
2017-07-04T10:31:50.106 Exception while executing function: Functions.Bonus. Microsoft.Azure.WebJobs.Host: The '%' at position 148 does not have a closing '%'.
I want to use a modulation symbol within sqlQuery. What can I?
Best regards.
2017-07-15(JST) Append.
And today, I tried following another query for avoid this issue.
SELECT c.id, c.created_at FROM c
WHERE {epoch} - c.created_at_epoch >= 86400*31 AND
(CEILING({epoch}/86400) - CEILING(c.created_at_epoch / 86400)) -
(31 *
CEILING(
(CEILING({epoch}/86400) - CEILING(c.created_at_epoch / 86400))
/ 31
)
) = 0
Just in case, I tried this query that is specified a epoch = 1499218423 on the Cosmos DB.
SELECT c.id, c.created_at FROM c
WHERE 1499218423 - c.created_at_epoch >= 86400*31 AND
(CEILING(1499218423/86400) - CEILING(c.created_at_epoch / 86400)) -
(31 *
CEILING(
(CEILING(1499218423/86400) - CEILING(c.created_at_epoch / 86400))
/ 31
)
) = 0
That's result is followings.
[
{
"id": "70251cbf-44b3-4cd9-991f-81127ad78bca",
"created_at": "2017-05-11 18:46:16"
},
{
"id": "0fa31de2-4832-49ea-a0c6-b517d64ede85",
"created_at": "2017-05-11 18:48:22"
},
{
"id": "b9959d15-92e7-41c3-8eff-718c4ab2be6e",
"created_at": "2017-05-11 19:01:43"
}
]
It looks fine. Then I specify it as sqlQuery and test with following queue data.
{"epoch":1499218423}
And code of the function is followings.
module.exports = function (context, myQueueItem) {
context.log(context.bindings.members, myQueueItem);
context.done();
};
Afetrwards, I saw following results.
2017-07-05T03:00:47.158 Function started (Id=e4d060b5-3ddc-4271-bf91-9f314e7e1148)
2017-07-05T03:00:47.408 [] { epoch: 1499871600 }
2017-07-05T03:00:47.408 Function completed (Success, Id=e4d060b5-3ddc-4271-bf91-9f314e7e1148, Duration=245ms)
It looks differences in results of bindings(as context.bindings.members).
Why appeared this differences?
Related question : Deferences among the Azure CosmosDB Query Explorer's results and the Azure Functions results
I want to use a modulation symbol within sqlQuery. What can I?
The modulation symbol(%) in Azure function configuration is used to retrieve values from app settings. For your issue, I suggest you add a item in app setting as following.
After that, you could use %modulationsymbol% instead of % in your query as following.
SELECT c.id, c.created_at FROM c
WHERE {epoch} - c.created_at_epoch >= 86400*31
AND (CEILING({epoch}/86400) - CEILING(c.created_at_epoch / 86400)) %modulationsymbol% 31 = 0

Query with multiple filters on Pandas

I want to execute this query.The query is " filtering data with 'Gas Oil/ Diesel Oil - Production' transaction and the year is greater than 2000 ". Firstly , i tried to execute my query with & operand and vectorized column selection without using if statement. But it did not work.After then , i found this query at below.This time i could not get any output.What do you think about my query problem ?.Thanks ...
if all(b['Commodity - Transaction'] == 'Gas Oil/ Diesel Oil - Production') and all(b[ b['Year'] >2000 ]):
print (b)
else:
print('did not find any values')
what's wrong with:
b.loc[(b['Commodity - Transaction'] == 'Gas Oil/ Diesel Oil - Production') & (b['Year'] >2000)]
?
You can try first create mask with contains and the create subset - use Boolean indexing:
print b[(b['Commodity - Transaction'].str.contains('Gas Oil/ Diesel Oil - Production')) &
(b['Year'] > 2000) ]

splitting a file.txt into two file with a condition

How can i split the given file into two different files results codes and warning codes.AS given below is single text file and I want to split it into two files as I had lot more file in such condition to split.
Result Codes:
0 - SYS_OK - "Ok"
1 - SYS_ERROR_E - "System Error"
1001 - MVE_SYS_E - "MTE System Error"
1002 - MVE_COMMAND_SYNTAX_ERROR_E - "Command Syntax is wrong"
Warning Codes:
0 - SYS_WARN_W - "System Warning"
100001 - MVE_SYS_W - "MVE System Warning"
200001 - SLEA_SYS_W - "SLEA System Warning"
200002 - SLEA_INCOMPLETE_SCRIPTED_OVERRIDE_COMMAND_W - "One or more of the entered scripted override commands has missing mandatory parameters"
300001 - L1_SYS_W - "L1 System Warning"
Well, on first glance, the distinction seems to be that "warnings" all contain the character sequence _W - and anything that doesn't is "results". Did you notice that?
awk '/_W -/{print >"warnings";next}{print >"results"}'
Here is a python solution:
I am assuming you are having the list of warning codes.
import re
warnings = open(r'warning-codes.txt');
warn_codes =[]
for line in warnings:
m = re.search(r'(\d+) .*',line);
if(m):
warn_codes.append(m.groups(1));
ow = open('output-warnings.txt','w')
ors = open('output-results.txt','w')
log_file = open(r'log.txt');
for line in log_file:
m = re.search(r'(\d+) .*',line);
if(m and (m.groups(1) in warn_codes)):
ow.write(line+'\n');
elif(m):
ors.write(line+'\n');
else:
print("none");
ow.close()
ors.close()

Resources