I have a helper function that use pandas DataFrame multiple times, but I don't need any other pandas functions.
Is it better to import the whole library and call DataFrame to make the code more consistent or just import DataFrame?
My function will be called over 100,000 times and return a dictionary.
import pandas as pd
temp_df = pd.DataFrame()
VS.
from pandas import DataFrame
temp_df = DataFrame()
Related
I'm using the Databricks. For my data I created a DeltaLake. Then I tried to modify the column using pandas API but for some reason the following error message pops up:
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
I use the following code to rewrite data in the table:
df_new = spark.read.format('delta').load(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/{delta_name}")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from math import *
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
%matplotlib inline
from pyspark.pandas.config import set_option
import pyspark.pandas as ps
win_len = 5000
# For this be sure you have runtime 1.11 or earlier version
df_new = df_new.pandas_api()
print('Creating Average active power for U1 and V1...')
df_new['p_avg1'] = df_new.Current1.mul(df_new['Voltage1']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U2 and V2...')
df_new['p_avg2'] = df_new.Current2.mul(df_new['Voltage2']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U3 and V3...')
df_new['p_avg3'] = df_new.Current3.mul(df_new['Voltage3']).rolling(min_periods=1, window=win_len).mean()
print('Creating Average active power for U4 and V4...')
df_new['p_avg4'] = df_new.Current4.mul(df_new['Voltage4']).rolling(min_periods=1, window=win_len).mean()
print('Converting to Spark dataframe')
df_new = df_new.to_spark()
print('Complete')
Previously with pandas API there were no problem, I'm using the lastest Runtime 11.2. Only one dataframe was loaded while I was using cluster.
Thank you in advance.
The error message is suggesting this: In order to allow this operation, enable 'compute.ops_on_diff_frames' option
Here's how to enable this option per the docs:
import pyspark.pandas as ps
ps.set_option('compute.ops_on_diff_frames', True)
The docs have this important warning:
Pandas API on Spark disallows the operations on different DataFrames (or Series) by default to prevent expensive operations. It internally performs a join operation which can be expensive in general.
I am trying to run the below script to add to columns to the left of a file; however it keeps giving me
valueError: header must be integer or list of integers
Below is my code:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv",header='true')
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
Any ways to make the creation_dt column a string?
According to pandas docs header is row number(s) to use as the column names, and the start of the data and must be int or list of int. So you have to pass header=0 to read_csv method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Also, pandas automatically creates dataframe from read file, you don't need to do it additionally. Use just
df = pd.read_csv("/home/ex.csv", header=0)
You can try:
import pandas as pd
import numpy as np
read_file = pd.read_csv("/home/ex.csv")
df=pd.DataFrame(read_file)
def add_col(x):
df.insert(loc=0, column='Creation_DT', value=str(pd.to_datetime('today')))
df.insert(loc=1, column='Creation_By', value="Sean")
df.to_parquet("/home/sample.parquet")
add_col(df)
I am looking for the best way to compute many dask delayed obejcts stored in a dataframe. I am unsure if the pandas dataframe should be converted to a dask dataframe with delayed objects within, or if the compute call should be called on all values of the pandas dataframe.
I would appreciate any suggestions in general, as I am having trouble with the logic of passing delayed object across nested for loops.
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from dask import delayed, compute
steps = 5
sample = [int(x) for x in np.linspace(5, 100, num=steps)]
enr_df = pd.DataFrame()
for N in sample:
enr = []
for i in range(20):
k = np.random.randint(1, 200)
enr.append(delayed(hypergeom.sf)(k=k, M=10000, n=20, N=N, loc=0))
enr_df[N] = enr
I cannot call compute on this dataframe without applying the function across all cells like so: enr_df.applymap(compute) (which I believe calls compute on each value individually).
However if I convert to a dask dataframe the delayed objects I want to compute are layered in the dask dataframe structure:
enr_dd = dd.from_pandas(enr_df, npartitions=1)
enr_dd.compute()
And the computation output I expect does not proceed.
You can pass a list of delayed objects into dask.compute
results = dask.compute(*list_of_delayed_objects)
So you need to get a list from your Pandas dataframe. This is something you can do with normal Python code.
I create a rdd of pandas DataFrame as intermediate result. I want to convert a Spark DataFrame, eventually save it into parquet file.
I want to know what is the efficient way.
Thanks
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).\
assign(col=x)
sc.parallelize(range(5)).map(create_df).\
.TO_DATAFRAME()..write.format("parquet").save("parquet_file")
I have tried pd.concat to reduce rdd to a big dataframe, seems not right.
So talking of efficiency, since spark 2.3 Apache Arrow is integrated with Spark and it is supposed to efficiently transfer data between JVM and Python processes thus enhancing the performance of the conversion from pandas dataframe to spark dataframe. You can enable it by
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
If your spark distribution doesn't have arrow integrated, this should not throw an error, will just be ignored.
A sample code to be run at pyspark shell can be like below:
import numpy as np
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf = pd.DataFrame(np.random.rand(100, 3))
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
Your create_df method returns a panda dataframe and from that you can create spark dataframe - not sure why you need "sc.parallelize(range(5)).map(create_df)"
So your full code can be like
import pandas as pd
import numpy as np
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
pdf = create_df(10)
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
import pandas as pd
def create_df(x):
df=pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
return df.values.tolist()
sc.parallelize(range(5)).flatMap(create_df).toDF().\
.write.format("parquet").save("parquet_file")
How can I read a standard labVIEW generated TDMS file using python?
For the benefit of the community , posting sample code base i have used to efficiently read *.tdms file into pandas dataframe. After multiple trials simplified the code for ease of use and documentation.
#import required libraries
from nptdms import TdmsFile
import numpy as np
import pandas as pd
#bokeh plots
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
#load the tdms file
tdms_file = TdmsFile("/Volumes/Data/dummy/sample.tdms")
#split all the tdms grouped channels to a separate dataframe
#tdms_file.as_dataframe()
for group in tdms_file.groups():
grp1_data = tdms_file.object('grp1').as_dataframe()
grp2_data = tdms_file.object('grp2').as_dataframe()
#plot the data on bokeh plots
# Use Bokeh chart to make plot
p = bokeh.charts.Line(grp1_data, x='time', y='values', color='parameter', xlabel='time (h)', ylabel='values')
# Display it
bokeh.io.show(p)
Suggestions and improvements are welcome.
For clarity, i would further simplify the answer by Sundar to:
from nptdms import TdmsFile
tdms_file = TdmsFile(r"path_to_.tdms")
for group in tdms_file.groups():
df = tdms_file.object(group).as_dataframe()
print(df.head())
print(df.keys())
print(df.shape)
That will read the different groups of the tdms into pandas dataframes.
This worked for me:
import pandas as pd
from nptdms import TdmsFile
tdms_file = TdmsFile("path/to/tdms_file.tdms")
df = tdms_file['group'].as_dataframe()
print(df.head())
print(df.keys())
print(df.shape)
The npTDMS version 1.1.0 at least didn't have any object method for TdmsFile objects that was used in the previous examples here.
Combination of answers given by Joris and ax7ster -- for npTMDS v1.3.1.
import nptdms
from nptdms import TdmsFile
print(nptdms.__version__)
fn = 'foo.tdms'
tdms_file = TdmsFile(fn)
for group in tdms_file.groups():
df = group.as_dataframe()
print(group.name)
print(df.head())
print(df.keys())
print(df.shape)
This reads all the groups in the TDMS file and doesn't require group names to be known beforehand.
It also possible to convert the whole TDMS file into one DataFrame, see example below.
from nptdms import TdmsFile
fn = 'foo.tdms'
tdms_file = TdmsFile(fn)
df = tdms_file.as_dataframe()