I am doing some data analysis task , with this python script i can get my desired results , but its very slow maybe due to for loop , i have to handle millions of data rows , is there way to change this script to fast ?
df=df.sort_values(by='ts')
df = df.set_index(pd.DatetimeIndex(df['ts']))
df = df.rename(columns={'ts': 'Time'})
x2=df.groupby(pd.Grouper(freq='1D', base=30, label='right'))
for name, df1 in x2:
df1_split=np.array_split(df1,2)
df_first=df1_split[0]
df_second=df1_split[1]
length_1=len(df_first)
length_2=len(df_second)
if len(df_first)>=5000:
df_first_diff_max=abs(df_first['A'].diff(periods=1)).max()
if df_first_diff_max<=10:
time_first=df_first['Time'].values[0]
time_first=pd.DataFrame([time_first],columns=['start_time'])
time_first['End_Time']=df_first['Time'].values[-1]
time_first['flag']=1
time_first['mean_B']=np.mean(df_first['B'])
time_first['mean_C']=np.mean(df_first['C'])
time_first['mean_D']=np.mean(df_first['D'])
time_first['E']=df_first['E'].values[0]
time_first['F']=df_first['F'].values[0]
result.append(time_first)
if len(df_second)>=5000:
df_second_diff_max=abs(df_second['A'].diff(periods=1)).max()
if df_second_diff_max<=10:
print('2')
time_first=df_second['Time'].values[0]
time_first=pd.DataFrame([time_first],columns=['start_time'])
time_first['End_Time']=df_second['Time'].values[-1]
time_first['flag']=2
time_first['mean_B']=np.mean(df_second['B'])
time_first['mean_C']=np.mean(df_second['C'])
time_first['mean_D']=np.mean(df_second['D'])
time_first['E']=df_second['E'].values[0]
time_first['F']=df_second['F'].values[0]
result.append(time_first)
final=pd.concat(result)
If you want to handle millions of rows maybe you should try to use Hadoop or Spark if you have resources enough.
I think that analyze such amount of data in a single node is a bit crazy.
If you are willing to try something different with Pandas, you could try using vectorization. Here is a link to a quick overview of the time to iterate over a set of data. It looks like Numpy has the most efficient vectorization method, but the internal Pandas one might work for you as well.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
The Pandas Built-In Function: iterrows() — 321 times faster
The apply() Method — 811 times faster
Pandas Vectorization — 9280 times faster
Numpy Vectorization — 71,803 times faster
(All according to timing the operations on a dataframe with 65 columns and 1140 rows)
Related
I'm trying to work on a dataset with 510,000 rows and 636 columns. I loaded it into a dataframe using the dask dataframe method, but the entries can't be displayed. When i try to get the shape, it results in delays. Is there a way for me to analyze the whole dataset without using big data technologies like Pyspark?
from dask import dataframe
import requests
import zipfile
import os
import pandas as pd
if os.path.exists('pisa2012.zip') == False:
r = requests.get('https://s3.amazonaws.com/udacity-hosted-downloads/ud507/pisa2012.csv.zip', allow_redirects=True)
open('pisa2012.zip', 'wb').write(r.content)
if os.path.exists('pisa2012.csv') == False:
with zipfile.ZipFile('pisa2012.zip', 'r') as zip_ref:
zip_ref.extractall('./')
df_pisa = dataframe.read_csv('pisa2012.csv')
df_pisa.shape #Output:(Delayed('int-e9d8366d-1b9e-4f8e-a83a-1d4cac510621'), 636)
Firstly, spark, dask and vaex are all "big data" technologies.
it results in delays
If you read the documentation, you will see that dask is lazy and only performs operations on demand, you have to want to. The reason is, that just getting the shape requires reading all the data, but the data will not be held in memory - that is the whole point and the feature that lets you work with bigger-than-memory data (otherwise, just use pandas).
This works:
df_pisa.shape.compute()
Bute, better, figure out what you actually want to do with the data; I assume you are not just after the shape. You can put multiple operations/delayed objects into a dask.compute() to do them at once and not have to repeat expensive tasks like reading/parsing the file.
I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)
I am trying to parallelize a pandas operation which splits a dataframe column having comma separated values into 2 columns. Normal pandas operation is taking around 5 secs on my python instance which directly uses df.str.split on that particular column. My dataframe contains 2 million rows and hence I'm trying to bring down the code running time.
As the first approach to parallelize, I am using Python's multiprocessing library by creating pools equivalent to number of CPU cores available on my instance. For the second approach to the same problem, I am using concurrent.futures library by mentioning a chunksize of 4.
However, I see that multiprocessing library is taking around the same time as the normal pandas operation(5 secs) whereas the concurrent.futures is taking more than a minute to run the same line.
1) Does a Google Compute Engine support these Python multiprocessing libraries?
2) Why is the parallel processing not working on the GCP?
Thanks in advance. Below is the sample code:
import pandas as pd
from multiprocessing import Pool
def split(e):
return e.split(",")
df = pd.DataFrame({'XYZ':['CAT,DOG',
'CAT,DOG','CAT,DOG']})
pool = Pool(4)
df_new = pd.DataFrame(pool.map(split, df['XYZ'], columns = ['a','b'])
df_new = pd.concat([df, df_new], axis=1)
The above code is taking about the same time as the code below which is a normal pandas operation which uses only one core:
df['a'], df['b'] = df['XYZ'].str.split(',',1).str
Uisng concurrent.futures:
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor() as pool:
a = pd.DataFrame(pool.map(split, df['XYZ'], chunksize = 4),
columns=['a','b'])
print (a)
The above code using concurrent.futures is taking more than a minute to run on the GCP. Please note that the code I have posted is just the sample code. The dataframe I am using in the project have 2 million such rows. Any help would be really appreciated!
Why did you choose chunksize=4? This is pretty small, for 2 million rows this would only break it into 500,000 operations. The total runtime might only take 1/4th of the time, but the additional overhead would likely make this take longer than a single-threaded approach.
I'd recommend using a much larger chunksize. Anywhere from 10,000 to 200,000 might be appropriate, but you should tweak this based on some experimentation with the results you get.
I have a Spark DataFrame where all fields are integer type. I need to count how many individual cells are greater than 0.
I am running locally and have a DataFrame with 17,000 rows and 450 columns.
I have tried two methods, both yielding slow results:
Version 1:
(for (c <- df.columns) yield df.where(s"$c > 0").count).sum
Version 2:
df.columns.map(c => df.filter(df(c) > 0).count)
This calculation takes 80 seconds of wall clock time. With Python Pandas, it takes a fraction of second. I am aware that for small data sets and local operation, Python may perform better, but this seems extreme.
Trying to make a Spark-to-Spark comparison, I find that running MLlib's PCA algorithm on the same data (converted to a RowMatrix) takes less than 2 seconds!
Is there a more efficient implementation I should be using?
If not, how is the seemingly much more complex PCA calculation so much faster?
What to do
import org.apache.spark.sql.functions.{col, count, when}
df.select(df.columns map (c => count(when(col(c) > 0, 1)) as c): _*)
Why
Your both attempts create number of jobs proportional to the number of columns. Computing the execution plan and scheduling the job alone are expensive and add significant overhead depending on the amount of data.
Furthermore, data might be loaded from disk and / or parsed each time the job is executed, unless data is fully cached with significant memory safety margin which ensures that the cached data will not be evicted.
This means that in the worst case scenario nested-loop-like structure you use can roughly quadratic in terms of the number of columns.
The code shown above handles all columns at the same time, requiring only a single data scan.
The problem with your approach is that the file is scanned for every column (unless you have cached it in memory). The fastet way with a single FileScan should be:
import org.apache.spark.sql.functions.{explode,array}
val cnt: Long = df
.select(
explode(
array(df.columns.head,df.columns.tail:_*)
).as("cell")
)
.where($"cell">0).count
Still I think it will be slower than with Pandas, as Spark has a certain overhead due to the parallelization engine
Im trying to use Dask to read a folder of very large csv files (which all fit in memory, they're extremely large, but I have a lot of RAM) - my current solution looks like:
val = 'abc'
df = dd.read_csv('/home/ubuntu/files-*', parse_dates=['date'])
# 1 - df_pd = df.compute(get=dask.multiprocessing.get)
ddf_selected = df.map_partitions(lambda x: x[x['val_col'] == val])
# 2 - ddf_selected.compute(get=dask.multiprocessing.get)
Is 1 (and then using pandas) or 2 better? Just trying to get a sense of what to do?
You can also just do the following:
ddf_selected = ddf[ddf['val_col'] == val]
In terms of which is better it depends strongly on the operation. For large datasets that don't require in-memory shuffles dask.dataframe will likely perform better. For random access or full sorts pandas will likely perform better.
You may not want to use the multiprocessing scheduler. Generally for Pandas we recommend either the threaded or distributed schedulers.