I have started learning Pyspark.
So, in a scenario, I was testing if I can use a Gdrive as source for streaming data.
I will put csv file one by one ,and the code will monitor file,and produce aggregation based on that.
Here is my code:
from google.colab import drive
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
df=spark.readStream.format("csv").schema(schema).option("header" ,True).option("sep",",").load("/content/drive/My Drive/Pyspark/")
# df.show()
I want to display output of aggregation in colab.
But,its not displaying any output.
Can anyone suggest some solution?
I want to check if a delta table in an s3 bucket is actually a delta table. I am trying do this by
from delta import *
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder\
if DeltaTable.isDeltaTable(spark, "s3a://landing-zone/table_name/year=2022/month=2/part-0000-xyz.snappy.parquet"):
This code runs forever without returning any result. I tested it with a local delta table and there it works. When I trim the path url so it stops after the actual table name, the code shows the same behavior. I also generated a boto3 client and I can see the bucket list when calling s3.list_bucket(). Do I need to parse the client somehow into the if statement?
Thanks a lot in advance!
I am an idiot, I forgot that it is not enough to just create a boto3 client, but I also have to make the actual connection to S3 via
I am trying to load a few parquet files from a directory into Python for tensorflow/pytorch.
The files are too large to be loaded through the pyarrow.parquet functions
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('dir')
table = dataset.read()
This gives out of memory error.
I have also tried using petastorm, but that doesn't work for make_reader() because it isn't of the petastorm type.
with make_batch_reader('dir') as reader:
dataset = make_petastorm_dataset(reader)
When I used the make_batch_reader() and then the make_petastorm_dataset(reader), it again gave an zip not iterable error or something along those lines.
I am not sure how to load the file into Python for ML training.
Some quick help would be greatly appreciated.
For pyarrow, you can list the directory with Python, iterate over *.parquet files, open each one as pq.ParquetFile, and read it one row group at a time. This will alleviate the memory pressure, but won't be super fast without parallelization.
For petastorm, you are right to use make_batch_reader(). Indeed, the error messages are not always helpful; but you can inspect the stack trace and investigate where in petastorm code it originates from.
You can load entire data using dask using below code.
You can also load only chucks of data whenever needed by computing only those lines using the index. [Assuming you have different index].
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
def load_chunk(pth):
x = ParquetFile(pth).to_pandas()
x = x.drop('[unwanted_columns_to_save_space]',axis=1)
return x
files = glob.glob('./your_path/*.parquet')
ddf = dd.from_delayed([load_chunk(f) for f in files])
df = ddf.compute()
I have got a file in HDFS (/user/username/Project/data/file.xlsx) that I want to read into a DataFrame. (I do not care if it is a PySpark DataFrame or Pandas, but Pandas is preferred.)
I am using a Zeppelin Notebook to do my code.
Is it possible to get data from this file?
I have already tried the following commands, but none of them worked:
df = pd.read_excel("/user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs:///user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs://user/username/Project/data/file.xlsx")
I don't think you can read files stored in hdfs directly with pandas.
You probably have to either :
load the file into spark then use toPandas()
df = spark.read.format("excel").load("hdfs:xxx").toPandas()
use some alternative to enable pandas to read directly, as described here
It seems export and import commands in Python Interpreter in Apache Zeppellin can be only realised through "pd.read_csv" and "to_csv" modules.
I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.
I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing.
I realized that BigQuery supports the Hadoop Input / Output format
and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD".
Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this?
Google now has an example on how to use the BigQuery connector with Spark.
There does seem to be a problem using the GsonBigQueryInputFormat, but I got a simple Shakespeare word counting example working
import json
import pyspark
sc = pyspark.SparkContext()
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD("com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat", "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)