Load data in pyspark where csv file contains bullets - apache-spark

Hi I have to load data in hive, I am using pyspark for data processing.
I am getting csv file file from upstream which has bullets in it. when I do dataframe.show(), I don't see bullets, I rather see ???.
Can someone suggest how can I load the data, I am ok with replacing bullets with space or blank.

Related

Unable to read and a 2GB nested json with dataframe in pyspark

I am trying to read a JSON in pyspark in databricks. it is a nested JSON with multiple levels. it has one array which holds the entire data in it. I am able to process if the size of the JSOn is small like 5mb. but same code is not working for 2GB or bigger file. The structure of the json is as below. The whole data is coming into in_network column. it will be appreciated if can some one tell how to load this into data frame and then to a table. The structure of JSON
I tried with explode the column. it is able to print the schema of the object. when I am trying to save it to table or file or if I use display or show on the data frame, it is hanging and the job is getting failed.
below is the code I am using to read the json and explode
from pyspark.sql.functions import explode,col
df = spark.read.option('multiline',True).json("/FileStore/shared_uploads/konda/SampleDatafilefrombigfile.json")
df=df.withColumn('in_network',explode('in_network'))\
.withColumn('negotiation_arrangement',col('in_network.negotiation_arrangement'))\
.withColumn('name',col('in_network.name'))\
.withColumn('billing_code_type',col('in_network.billing_code_type'))\
.withColumn('billing_code_type_version',col('in_network.billing_code_type_version'))\
.withColumn('billing_code',col('in_network.billing_code'))\
.withColumn('description',col('in_network.description'))\
.withColumn('explode_negotiated_rates',explode('in_network.negotiated_rates'))\
.withColumn('exp_neg_prices',explode(col('explode_negotiated_rates.negotiated_prices')))\
.withColumn('exp_neg_pg',explode(col('explode_negotiated_rates.provider_groups')))\
.withColumn('exp_npi',explode(col('exp_neg_pg.npi')))\
.withColumn('billing_class',col('exp_neg_prices.billing_class'))\
.withColumn('expiration_date',col('exp_neg_prices.expiration_date'))\
.withColumn('negotiated_rate',col('exp_neg_prices.negotiated_rate'))\
.withColumn('negotiated_type',col('exp_neg_prices.negotiated_type'))\
.withColumn('service_code',explode(col('exp_neg_prices.service_code')))\
.withColumn('tin_type',col('exp_neg_pg.tin.type'))\
.withColumn('tin_val',col('exp_neg_pg.tin.value'))\
.drop('in_network')\
.drop('explode_negotiated_rates')\
.drop('exp_neg_prices')\
.drop("exp_neg_pg")
df.printSchema()

CSV upload to PySpark data frame counting linebreaks as new rows

I am trying to load a CSV into a Spark data frame using common instructions, however, the CSV is incorrectly loading. Below is the header and a problematic record. This is the Vim view of the file, showing ^M carriage return.
symbol,tweet_id,text,created_at,retweets,likes,geo,place,coordinates,location^M
AIG,1423790670557351945, "Next Monday, US stock market may edge up. The market is more likely to be mixed.
$MNST $AIG $WFC $GS $MET $BAC $JPM▒<80>▒",2021-08-06 23:38:43,1,0,,,,"Toronto, Ontario, Canada"^M
Here is the command I'm using to load the CSV into Spark data frame:
df = spark.read.load("symbols_tweets.csv",
format="csv", sep=",", header="true")
The issue is that spark.read.load ends up thinking $MNST is a new row since it appears on a new line. Is there any way I can have Spark pay attention to the Unix carriage return ^M instead so that it would load the rows as intended? As a workaround, I tried converting the CSV to Pandas data frame and then to Spark data frame, however that is resulting in more complex datatype issues - I would rather solve in a more direct manner.

BigQuery howto load data from local file as content

I have a requirement where in I will receive file content which I need to load to BigQuery tables. Standard API shows how to load data from local file but I don't see any variant of the load method which accepts file content as string rather than a file path. Any idea how I can achieve this ?
As we can see in the source code and official documentation load function loads data only from a local file or Storage File. Allowed options are:
AVRO,
CSV,
JSON,
ORC,
PARQUET
The load job is created and it will run your data load asynchronously. If you would like instantaneous access to your data, insert it using Table insert function, where you need to provide the rows to insert into the table:
// Insert a single row
table.insert({
INSTNM: 'Motion Picture Institute of Michigan',
CITY: 'Troy',
STABBR: 'MI'
}, insertHandler);
If you want to load i.e. CSV file, firstly you need to save data to a CSV in Node.js manually. Then, load it as a single column CSV using load() method. That will load the whole string as a single column.
Additionally, what I can recommend you is to use Dataflow templates, i.e. Cloud Storage Text to BigQuery, that read text files stored in Cloud Storage, transform them using a JavaScript User Defined Function (UDF), and output the result to BigQuery. But your data to load needs to be stored in Cloud Storage.

Converting 2TB of gziped multiline JSONs to NDJSONs

For my research I have a dataset of about 20,000 gziped multiline json files (~2TB, all have the same schema). I need to process and clean this data (I should say I'm very new to data analytics tools).
After spending a few days reading about Spark and Apache Beam I'm convinced that the first step would be to first convert this dataset to NDJSONs. In most books and tutorials they always assume you are working with some new line delimited file.
What is the best way to go about converting this data?
I've tried to just launch a large instance on gcloud and just use gunzip and jq to do this. Not surprisingly, it seems that this will take a long time.
Thanks in advance for any help!
Apache Beam supports unzipping file if you use TextIO.
But the delimiter remains to be New Line.
For multiline json you can read complete file using in parallel and then convert the json string to pojo and eventually reshuffle the data to utilize parallelism.
So the steps would be
Get the file list > Read individual files > Parse file content to json objects > Reshuffle > ...
You can get the file list by FileSystems.match("gcs://my_bucker").metadata().
Read individual files by Compression Compression.detect((fileResouceId).getFilename()).readDecompressed(FileSystems.open(fileResouceId))
Converting to NDJSON is not necessary if you use sc.wholeTextFiles. Point this method at a directory, and you'll get back an RDD[(String, String)] where ._1 is the filename and ._2 is the content of the file.

How to import data in csv format in J?

I want to know how can I import data in CSV and then how I can deal with it?
I had loaded the file but do not know how to read it.
'',' fixdsv dat ] load '/Users/apple/Downloads/data'
Assuming that the file /Users/apple/Downloads/data is a csv file then you should be able to load it into a J session as a boxed table like this:
load 'csv'
data=: readcsv '/Users/apple/Downloads/data'
If the file uses delimiters other than commas (e.g. Tabs) then you could use the tables/dsv addon.
data=: TAB readdsv '/Users/apple/Downloads/data'
See the J wiki for more information on the tables/csv and tables/dsv addons.
After loading the file, I think that I would start by reading the file into a variable then working with that.
data=: 1:!1 <'filepath/filename' NB. filename and path need to be boxed string
http://www.jsoftware.com/help/dictionary/dx001.htm
Also you could look at jd which is specifically a relational database system if you are more focussed on file management than data processing.
http://code.jsoftware.com/wiki/Jd/Index

Resources