My co-worker and I have been setting up, configuring, and testing Dask for a week or so now, and everything is working great (can't speak highly enough about how easy, straightforward, and powerful it is), but now we are trying to leverage it for more than just testing and are running into an issue. We believe it's a fairly simple one related to syntax and an understanding gap. Any help to get it running is greatly appreciated. Any support in evolving our understanding of more optimal paths is also greatly appreciated.
High level flow:
Open data in pandas & clean it (we plan on moving this to a pipeline)
From there, convert the cleaned data set for regression into a dask data frame
Set the x & y variables and create all unique x combination sets
Create all unique formulas (y ~ x1 + x2 +0)
Run each individual formula set with the data through a linear lasso lars model to get the AIC for each formula for ranking
Current Issue:
Run each individual formula set (~1700 formulas) with the data (1 single data set which doesn’t vary with each run) on the dask cluster and get the results back
Optimize the calculation & return the final data
# In[]
# Imports:
import logging as log
import datetime as dat
from itertools import combinations
import numpy as np
import pandas as pd
from patsy import dmatrices
import sklearn as sk
from sklearn.linear_model import LogisticRegression, SGDClassifier, LinearRegression
import dask as dask
import dask.dataframe as dk
from dask.distributed import Client
# In[]
# logging, set the dask client, open & clean the data, pass into a dask dataframe
format='%(asctime)s %(message)s',
datefmt="%m-%d %H:%M:%S"
c = Client('ip:port')
ST =
data_pd = pd.read_csv('some.txt', sep="\t")
#fill some na/clean up the data a bit
data_pd['V9'] = data_pd.V9.fillna("Declined")
data_pd['y'] = data_pd.y.fillna(0)
data_pd['x1'] = data_pd.x1.fillna(0)
#output the clean data and re-import into dask, we could alse use from_pandas to get to dask dataframes
data = dk.read_csv(r'C:\path\*.csv', sep=",")
# set x & y variables - the below is truncated
y_var = "y"
x_var = ['x1',
#list of all variables
all_var = list(y_var) + x_var
#all unique combinations
x_var_combos = [combos for combos in combinations(x_var,2)]
#add single variables for testing as well
for i in x_var:
# create formulas from our y, x variables
def formula(y_var, combo):
combo_len = len(combo)
if combo_len == 2:
formula = y_var +"~"+combo[0] +"+"+ combo[1]+"+0"
formula = y_var +"~"+combo[0]+"+0"
return formula
def model_aic(dt, formula):
k = 2
y_df, x_df = dmatrices(formula, dt, return_type = 'dataframe')
y_df = np.ravel(y_df)'dmatrices successful')
LL_model = sk.linear_model.LassoLarsIC(max_iter = 100)
AIC_Value = min(, y_df).criterion_) + ( (2*(k**2)+2*(k)) / (len(x_df)-k-1) )'AIC_Value: %s', AIC_Value)
oup = [formula ,AIC_Value, len(dt)-AIC_Value]
return oup
# ----------------- here's where we're stuck ---------------------
# ----------------- we think this is correct ----------------------
# ----------------- create a list of all formula to execute -------
# In[]
out = []
for i in x_var_combos:
var = model_aic(data, formula(y_var, i))
# ----------------- but we're stuck figuring out how to -----------
# ------------------make it compute & return the result -----------
ans = c.compute(*out)
ans2 = c.compute(out[1])
print (ans2)


Multiprocessing pool map for a BIG array computation go very slow than expected

I've experienced some difficulties when using multiprocessing Pool in python3. I want to do BIG array calculation by using Basically, I've a 3D array which I need to do computation for 10 times and it generates 10 output files sequentially. This task can be done 3 times i,e, in the output we get 3*10=30 output files(*.txt). To do this, I've prepared the following script for small array calculation (a sample problem). However, when I use this script for a BIG array calculation or array come out from a series of files, then this piece of code (maybe pool) capture the memory, and it does not save any .txt file at the destination directory. There is no error message when I run the file with command mpirun python3
Can anybody suggest what is the problem in the sample script and how to write code to get rid of stuck? I've not received any error message, but don't know where the problem occurs. Any help is appreciated. Thanks!
import numpy as np
import multiprocessing as mp
from scipy import signal
import matplotlib.pyplot as plt
import contextlib
import os, glob, re
import random
import cmath, math
import time
import pdb
#File Storing path
save_results_to = 'File saving path'
arr_x = [0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0, 8.49, 12.0]
arr_y = [0, 8.49, 12.0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0]
total_rows = 5000
arr = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr1 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr2 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
# Finding cross spectral density (CSD)
def my_func1(data):
# Do something here
return array1
t0 = time.time()
my_data1 = my_func1(arr)
my_data2 = my_func1(arr1)
my_data3 = my_func1(arr2)
print('Time required {} seconds to execute CSD--For loop'.format(time.time()-t0))
mydata_list = [my_data1,my_data3,my_data3]
def my_func2(data2):
# Do something here
return from_data2
start_freq = 100
stop_freq = 110
freq_range= np.around(np.linspace(start_freq,stop_freq,11)/10, decimals=2)
no_of_freq = len(freq_range)
list_arr =[]
def my_func3(csd):
for fr_count in range(start_freq, stop_freq):
csd_single = csd[:,:, fr_count]
print('Shape of list is :', np.array(list_csd).shape)
return list_csd
def parallel_function(BIG_list_data):
with contextlib.closing(mp.Pool(processes=10)) as pool:
dft=, BIG_list_data)
data_arr = np.array(dft)
print('shape of data :', data_arr.shape)
return data_arr
count_day = 1
count_hour =0
for count in range(3):
count_hour +=1
list_arr = my_func3(mydata_list[count]) # Load Numpy files
print('Array shape is :', np.array(arr).shape)
t0 = time.time()
data_dft = parallel_function(list_arr)
print('The hour number={} data is processing... '.format(count_hour))
print('Time in parallel:', time.time() - t0)
for i in range(no_of_freq-1): # (11-1=10)
jj = freq_range[i]
#print('The hour_number {} and frequency number {} data is processing... '.format(count_hour, jj))
dft_1hr_complx = data_dft[i,:,:]
np.savetxt(save_results_to + f'csd_Day_{count_day}_Hour_{count_hour}_f_{jj}_hz.txt', dft_1hr_complx.view(float))
As #JérômeRichard suggested,to aware your job scheduler you need to define the number of processors will engage to perform this task. So, the following command could help you: ncpus = int(os.getenv('SLURM_CPUS_PER_TASK', 1))
You need to use this line inside your python script. Also, inside the parallel_function use with contextlib.closing(mp.Pool(ncpus=10)) as pool: instead of with contextlib.closing(mp.Pool(processes=10)) as pool:. Thanks

How can I interpolate values from two lists (in Python)?

I am relatively new to coding in Python. I have mainly used MatLab in the past and am used to having vectors that can be referenced explicitly rather than appended lists. I have a script where I generate a list of x- and y- (z-, v-, etc) values. Later, I want to interpolate and then print a table of the values at specified points. Here is a MWE. The problem is at line 48:
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
I'm not sure I have the correct syntax for the last two lines either:
table[nn] = ('%.2f' %xq, '%.2f' %yq)
Here is the full script for the MWE:
#This script was written to test how to interpolate after data was created in a loop and stored as a list. Can a list be accessed explicitly like a vector in matlab?
from scipy.interpolate import interp1d
from math import * #for ceil
from astropy.table import Table #for Table
import numpy as np
# define the initial conditions
x = 0 # initial x position
y = 0 # initial y position
Rmax = 10 # maxium range
""" initializing variables for plots"""
x_list = [x]
y_list = [y]
""" define functions"""
# not necessary for this MWE
"""create sample data for MWE"""
# x and y data are calculated using functions and appended to their respective lists
h = 1
t = 0
tf = 10
# Example of interpolation without a loop:
#x = np.linspace(0, 10, num=11, endpoint=True)
#y = np.cos(-x**2/9.0)
#f = interp1d(x, y)
for i in range(N):
x = h*i
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
## Interpolation after x- and y-lists are already created
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
Your help and patience will be greatly appreciated!
Best regards,
Your code has some glaring issues that made it really difficult to understand. Let's first take a look at some things I needed to fix:
for i in range(N):
x = h*1
y = cos(-x**2/9.0)
""" appends selected data for ability to plot"""
You are appending a single value without modifying it. What I presume you wanted is down below.
intervals = 0.5
nfinal = ceil(Rmax/intervals)
NN = nfinal+1 # length of table
dtype = [('Range (units?)', 'f8'), ('Drop? (units)', 'f8')]
table = Table(data=np.zeros(N, dtype=dtype))
for nn in range(NN):#for nn = 1:NN
xq = 0.0 + (nn-1)*intervals #0.0 + (nn-1)*intervals
yq = interp1d(x_list, y_list, xq(nn))#interp1(output1(:,1),output1(:,2),xq(nn))
table[nn] = ('%.2f' %xq, '%.2f' %yq)
This is where things get strange. First: use pandas tables, this is the more popular choice. Second: I have no idea what you are trying to loop over. What I presume you wanted was to vary the number of points for the interpolation, which I have done so below. Third: you are trying to interpolate a point, when you probably want to interpolate over a range of points (...interpolation). Lastly, you are using the interp1d function incorrectly. Please take a look at the code below or run it here; let me know what you exactly wanted (specifically: what should xq / xq(nn) be?), because the MRE you provided is quite confusing.
from scipy.interpolate import interp1d
from math import *
import numpy as np
Rmax = 10
h = 1
t = 0
tf = 10
N = ceil(tf/h)
x = np.arange(0,N+1)
y = np.cos(-x**2/9.0)
interval = 0.5
NN = ceil(Rmax/interval) + 1
ip_list = np.arange(1,interval*NN,interval)
xtable = []
ytable = []
for i,nn in enumerate(ip_list):
f = interp1d(x,y)
x_i = np.arange(0,nn+interval,interval)
xtable += [x_i]
ytable += [f(x_i)]
[print(i) for i in xtable]
[print(i) for i in ytable]

faster method for comparing two lists element-wise

I am building a relational DB using python. So far I have two tables, as follows:
>>> df_Patient.columns
[1] Index(['NgrNr', 'FamilieNr', 'DosNr', 'Geslacht', 'FamilieNaam', 'VoorNaam',
'GeboorteDatum', 'PreBirth'],
>>> df_LaboRequest.columns
[2] Index(['RequestId', 'IsComplete', 'NgrNr', 'Type', 'RequestDate', 'IntakeDate',
The two tables are quite big:
>>> df_Patient.shape
[3] (386249, 8)
>>> df_LaboRequest.shape
[4] (342225, 7)
column NgrNr on df_LaboRequest if foreign key (FK) and references the homonymous column on df_Patient. In order to avoid any integrity error, I need to make sure that all the values under df_LaboRequest[NgrNr] are in df_Patient[NgrNr].
With list comprehension I tried the following (to pick up the values that would throw an error):
[x for x in list(set(df_LaboRequest['NgrNr'])) if x not in list(set(df_Patient['NgrNr']))]
Though this is taking ages to complete. Would anyone recommend a faster method (method as a general word, as synonym for for procedure, nothing to do with the pythonic meaning of method) for such a comparison?
One-liners aren't always better.
Don't check for membership in lists. Why on earth would you create a set (which is the recommended data structure for O(1) membership checks) and then cast it to a list which has O(N) membership checks?
Make the set of df_Patient once outside the list comprehension and use that instead of making the set in every iteration
patients = set(df_Patient['NgrNr'])
lab_requests = set(df_LaboRequest['NgrNr'])
result = [x for x in lab_requests if x not in patients]
Or, if you like to use set operations, simply find the difference of both sets:
result = lab_requests - patients
Alternatively, use pandas isin() function.
patients = patients.drop_duplicates()
lab_requests = lab_requests.drop_duplicates()
result = lab_requests[~lab_requests.isin(patients)]
Let's test how much faster these changes make the code:
import pandas as pd
import random
import timeit
# Make dummy dataframes of patients and lab_requests
randoms = [random.randint(1, 1000) for _ in range(10000)]
patients = pd.DataFrame("patient{0}".format(x) for x in randoms[:5000])[0]
lab_requests = pd.DataFrame("patient{0}".format(x) for x in randoms[2000:8000])[0]
# Do it your way
def fun1(pat, lr):
return [x for x in list(set(lr)) if x not in list(set(pat))]
# Do it my way: Set operations
def fun2(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return lr_s - pat_s
# Or explicitly iterate over the set
def fun3(pat, lr):
pat_s = set(pat)
lr_s = set(lr)
return [x for x in lr_s if x not in pat_s]
# Or using pandas
def fun4(pat, lr):
pat = pat.drop_duplicates()
lr = lr.drop_duplicates()
return lr[~lr.isin(pat)]
# Make sure all 3 functions return the same thing
assert set(fun1(patients, lab_requests)) == set(fun2(patients, lab_requests)) == set(fun3(patients, lab_requests)) == set(fun4(patients, lab_requests))
# Time it
timeit.timeit('fun1(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun1', number=100)
# Output: 48.36615000000165
timeit.timeit('fun2(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun2', number=100)
# Output: 0.10799920000044949
timeit.timeit('fun3(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun3', number=100)
# Output: 0.11038020000069082
timeit.timeit('fun4(patients, lab_requests)', 'from __main__ import patients, lab_requests, fun4', number=100)
# Output: 0.32021789999998873
Looks like we have a ~150x speedup with pandas and a ~500x speedup with set operations!
I don't have a pandas installed right now to try this. But you could try removing the list(..) cast. I don't think it provides anything meaningful to the program and sets are much faster for lookup, e.g. x in set(...), than lists.
Also you could try doing this with the pandas API rather than lists and sets, sometimes this faster. Try searching for unique. Then you could compare the size of the two columns and if it is the same, sort them and do an equality check.

Goodness of fit always being zero despite taking random data?

I'm trying to write code that generates random data and computes goodness of fit but I'm not understanding why the chi-squared test is always zero, may I have a fix for this ? For an attempted fix I tried playing around with different types to see if I get any resulting changes in the initial output, also I've tried changing the parameters to the loop in question.
from scipy import stats
import math
import random
import numpy
import scipy
import numpy as np
def Linear_Chi2_Generate(observed_values = [], expected_values = []):
# !!!!!!! Generation of Data !!!!!!!!!! #
for i in range(0,12):
a = random.randint(-10,10)
b = random.randint(-10,10)
y = a * (b + i)
# !!! Array Setup !!!! #
# ***Had the Array types converted to floats before computing Chi2*** #
# #
t_s = 0
o_v = np.array(observed_values)
e_v = np.array(expected_values)
o_v_f = o_v.astype(float)
e_v_f = o_v.astype(float)
z_o_e_v_f = zip(o_v.astype(float), e_v.astype(float))
for i in z_o_e_v_f:
t_s += [((o_v_f)-(e_v_f))]**2/(e_v_f) # Computs the Chi2 Stat !
print("Observed Values ", o_v_f)
print("Expected Values" , e_v_f)
print("Our goodness of fit for our linear function", stats.chi2.cdf(t_s,df))
return t_s
In your original code, e_v_f = o_v.astype(float) made o_v_f, e_v_f ending up the same. There was also some issue in the for loop. I have edited your code a bit. See what it does you are looking for:
from scipy import stats
import math
import random
import numpy
import scipy
import numpy as np
def Linear_Chi2_Generate(observed_values = [], expected_values = []):
# !!!!!!! Generation of Data !!!!!!!!!! #
for i in range(0,12):
a_o = random.randint(-10,10)
b_o = random.randint(-10,10)
y_o = a_o * (b_o + i)
# a_e = random.randint(-10,10)
# b_e = random.randint(-10,10)
# y_e = a_e * (b_e + i)
expected_values.append(y_o + 5)
# !!! Array Setup !!!! #
# ***Had the Array types converted to floats before computing Chi2*** #
# #
t_s = 0
o_v = np.array(observed_values)
e_v = np.array(expected_values)
o_v_f = o_v.astype(float)
e_v_f = e_v.astype(float)
z_o_e_v_f = zip(o_v.astype(float), e_v.astype(float))
for o, e in z_o_e_v_f:
t_s += (o - e) **2 / e # Computs the Chi2 Stat !
print("Observed Values ", o_v_f)
print("Expected Values" , e_v_f)
print("Our goodness of fit for our linear function", stats.chi2.cdf(t_s,df))
return t_s

How to import files using a for loop with path names in dictionary in Python?

I want to create a dictionary which has all the information needed to import files, parse dates etc. Then I want to use a for loop to import all these files. But after the for loop is finished I'm only left with the last dataset in the dictionary. As if it overwrites them.
I execute the file in the path folder so that's not a problem.
I tried creating a new dictionary where I add each import but that makes it much harder for later when I need to reference them. I want them as separate dataframes in the variable explorer.
Here's the code:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import AutoMinorLocator # for time series visualisation
# Import data
#PATH = r"C:\Users\sherv\OneDrive\Documents\GitHub\Python-Projects\Research Project\Data"
data = {"google":["multiTimeline.csv", "Month"],
"RDPI": ["RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": ["CPI.csv", "DATE"],
"GDP": ["GDP.csv", "DATE"],
"UE": ["Unemployment_2004_Present_US(Grab-5-12-18).csv", "DATE"],
"SP500": ["S&P500.csv", "Date"],
"IR": ["InterestRate_2004-1-1_Present_US(Grab-5-12-18).csv", "DATE"],
"PPI": ["PPIACO.csv", "DATE"],
"PMI": ["ISM-MAN_PMI.csv", "Date"]}
for dataset in data.keys():
dataset = pd.read_csv("%s" %(data[dataset][0]), index_col="%s" %(data[dataset][1]), parse_dates=["%s" %(data[dataset][1])])
dataset = dataset.loc["2004-01-01":"2018-09-01"]
# Visualise
minor_locator = AutoMinorLocator(12)
# Investigating overall trendSS
def google_v_X(Data_col, yName, title):
fig, ax1 = plt.subplots()
ax1.set_ylabel('google (%)', color='b')
ax1.tick_params('y', colors='b')
ax2 = ax1.twinx()
ax2.set_ylabel('%s' %(yName), color='r')
ax2.tick_params('%s' %(yName), colors='r')
plt.title("Google vs %s trends" %(title))
# Google-CPI
google_v_X(CPI["CPI"], "CPI 1982-1985=100 (%)", "CPI")
# Google-RDPI
google_v_X(RDPI["DSPIC96"], "RDPI ($)", "RDPI")
# Google-GDP
google_v_X(GDP["GDP"], "GDP (B$)", "GDP")
# Google-UE
google_v_X(UE["Value"], "Unemployed persons", "Unemployment")
# Google-SP500
google_v_X(SP500["Close"], "SP500", "SP500")
# Google-PPI
google_v_X(PPI["PPI"], "PPI")
# Google-PMI
google_v_X(PMI["PMI"], "PMI", "PMI")
# Google-IR
google_v_X(IR["FEDFUNDS"], "Fed Funds Rate (%)", "Interest Rate")
I also tried creating a function to read and parse and then use that in a loop like:
def importdata(key, path ,parseCol):
key = pd.read_csv("%s" %(path), index_col="%s" %(parseCol), parse_dates=["%s" %(parseCol)])
key = key.loc["2004-01-01":"2018-09-01"]
for dataset in data.keys():
importdata(dataset, data[dataset][0], data[dataset][0])
But I get an error because it doesn't recognise the path as a string and it says its not defined.
How can I get them to not overwrite each other or how can I get python to recognise the input to the function as a string? Any help is appreciated, Thanks
The for loop is referencing the same dataset variable so each time the loop is executed the variable is replaced with the newly imported dataset. You need to store the result somewhere, whether thats as a new variable each time or in a dictionary. Try something like this:
googleObj = None
RDPIObj = None
CPIObj = None
data = {"google":[googleObj, "multiTimeline.csv", "Month"],
"RDPI": [RDPIObj,"RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": [CPIObj, "CPI.csv", "DATE"]}
for dataset in data.keys():
obj = data[dataset][0]
obj = pd.read_csv("%s" %(data[dataset][1]), index_col="%s" %(data[dataset][2]), parse_dates=["%s" %(data[dataset][2])])
obj = dataset.loc["2004-01-01":"2018-09-01"]
This way you will have a local dataframe object for each of your datasets. The downside is that you have to define each variable.
Another option is making a second dictionary like you mentioned, something like this:
data = {"google":["multiTimeline.csv", "Month"],
"RDPI": ["RealDisposableIncome-2004-1_Present-Mon-US(Grab-30-11-18).csv", "DATE"],
"CPI": ["CPI.csv", "DATE"]}
output_data = {}
for dataset_key in data.keys():
dataset = pd.read_csv("%s" %(data[dataset_key][0]), index_col="%s" %(data[dataset_key][1]), parse_dates=["%s" %(data[dataset_key][1])])
dataset = dataset.loc["2004-01-01":"2018-09-01"]
output_data[dataset_key] = dataset
Reproducible example (however you should be very careful with using "exec"):
# Generating data
import os
import pandas as pd
df1 = pd.DataFrame([['a',1],['b',2]], index=[0,1], columns=['col1','col2'])
df2 = pd.DataFrame([['c',3],['d',4]], index=[2,3], columns=['col1','col2'])
# Exporting data
df1.to_csv('df1.csv', index_label='Month')
df2.to_csv('df2.csv', index_label='DATE')
# Definition of Loading metadata
loading_metadata = {
# Importing with accordance to loading_metadata (caution for indentation)
for dataset in loading_metadata.keys():
print(dataset, loading_metadata[dataset][0], loading_metadata[dataset][1])
{0} = pd.read_csv('{1}', index_col='{2}').rename_axis('')
""".format(dataset, loading_metadata[dataset][0], loading_metadata[dataset][1])
Exported data (df1.csv):
Exported data (df2.csv):
Loaded data:
col1 col2
0 a 1
1 b 2
col1 col2
2 c 3
3 d 4
