Spark Python: Converting multiple lines from inside a loop into a dataframe - apache-spark

I have a loop that is going to create multiple rows of data which I want to convert into a dataframe.
Currently I am creating a CSV format string and inside the loop keep appending to it along separated by a newline. I am creating a CSV file so that I can also save it as a text file for other processing.
File Header:
output_str="Col1,Col2,Col3,Col4\n"
Inside for loop:
output_str += "Val1,Val2,Val3,Val4\n"
I then create an RDD by splitting it with the newline and then convert in into the dataframe as follows.
output_rdd = sc.parallelize(output_str.split("\n"))
output_df = output_rdd.map(lambda x: (x, )).toDF()
It creates a dataframe but only has 1 column. I know that is because of the map function where I am making it into a list with only 1 item in the set. What I need is a list with multiple items. So perhaps I should be calling split() function on every line to get a list. But I am getting a feeling that there should be a much more straight-forward way. Appreciate any help. Thanks.
Edit: To give more information, using Spark SQL I have filtered my dataset to those rows that contain the problem. However the rows contain information in following format (separated by '|'). And I need to extract those values from column 3 which has corresponding flag set to 1 in column 4 (Here it is 0xcd)
Field1|Field2|0xab,0xcd,0xef|0x00,0x01,0x00
So I am collecting the output at the driver and then parsing the last 2 columns after which I am left with regular strings that I want to put back in a dataframe. I am not sure if I can achieve the same using Spark SQL to parse the output in the manner I want.

Yes, indeed your current approach seems a little too complicated... Creating large string in Spark Driver and then parallelizing it with Spark is not really performant.
First of all question from where you are getting your input data? In my opinion you should use one of existing Spark readers to read it. For example you can use:
CSV -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
jdbc -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.jdbc
json -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json
parquet -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet
not structured text file -> http://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.SparkContext.textFile
In next step you can preprocess it using Spark DataFrame or RDD API depending on your use case.

A bit late, but currently you're applying a map to create a tuple for each row containing the string as the first element. Instead of this, you probably want to split the string, which can easily be done inside the map step. Assuming all of your rows have the same number of elements you can replace:
output_df = output_rdd.map(lambda x: (x, )).toDF()
with
output_df = output_rdd.map(lambda x: x.split()).toDF()

Related

GroupByKey to fill values and then ungroup apache beam

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:
dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()
I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.
It's better to process your elements with a pcollection instead of a dataframe to avoid memory issues.
First read your CSV as a pcollection and then you can use GroupByKey and process the grouped elements and yield the results with a separate transformation.
It could be something like this
(pcollection | 'Group by key' >> beam.GroupByKey()
| 'Process grouped elements' >> beam.ParDo(UngroupElements()))
The input pcollection should be list of tuples each one contains the key you want to group with and the element.
And the ptransformation would look like this:
class UngroupElements(beam.ParDo):
def process(element):
k, v = element
for elem in list(v):
# process your element
yield elem
You can try to use exactly the same code as Pandas in Beam: https://beam.apache.org/documentation/dsls/dataframes/overview/
You can use read_csv to read your data into a dataframe, and then apply the same code that you would use in Pandas. Not all Pandas operations are supported (https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/), but that specific case with the group by key should work.

Levenshtein distance algorithm on Spark

I'm starting with Hadoop ecosystem and I'm facing some questions and need your help.
I have two HDFS files and need to execute Levenshtein distance between a group of columns of the first one versus another group of the second one.
This process will be executed each day with a quite considerable amount of data (150M rows in the first file Vs 11M rows in the second one).
I will appreciate to have some guidance (code example, references, etc) on how I can read my two files from HDFS execute Levenshtein distance (using Spark?) as described and save the results on a third HDFS file.
Thank you very much in advance.
I guess you have csv file so you can read the directly to the dataframe:
val df1 = spark.read.option("header","true").csv("hdfs:///pathtoyourfile_1")
The spark.sql.functions module conatins deflevenshtein(l: Column, r: Column): Column function so you need to pass as a parameter - dataframe column with String type, if you want to pass a group of columns you can take concat('col1,'col2,..) function to concatenate multiple columns and pass them to the previous function. If you have 2 or more dataframes you have to join them into one dataframe and then perform distance calculation. Finally you can save your results to csv using df.write.csv("path")

Create XML request from each record of a dataframe

I have tried many options including withColumn, udf, lambda, foreach, map but but not getting the expected output. At max, I am able to transform only the first record. The inputfile.json will keep on increasing and the expect op should give the xml in the desired structure. I will later on produce the expected op on Kafka.
Spark 2.3, Python 2.7. Need is to do in PySpark.
Edit 1:
I am able to add a column in the main dataframe which has the required xml. I used withColumn and functions.format_string and able to add strings(the xml structures) to columns of the dataframe.
Now my next target is to produce just the value of that new column to Kafka. I am using df.foreachPartition(send_to_kafka) and have created a function as below:
def send_to_kafka(rows):
kafka = SimpleClient('localhost:9092')
producer = SimpleProducer(kafka)
for row in rows:
producer.send_messages('test', str(row.asDict()))
But unfortunately it does two things:
a. Produces record on Kafka as {'newColumn':u'myXMLPayload'}. I do not want that. I want only myXMLPayload to be produced on Kafka.
b. It adds u' to the value for unicoding the value.
I want to get rid of these two parts and I would be good to go.
Any help would be appreciated.

How to extract a Dataset content n rows by n rows?

I have to output the results of a Dataset into a Postgis (spatial) database. Spark doesn't handle it and I had to write specific code that cannot be serialized. It means that I can't use dataset.foreach(...) method, and I have to execute my database insertions from outside Spark tasks.
But a whole
List<Row> rows = ds.collectAsList()
will produce an out of memory error.
And a
List<Row> row = takeList();
only returns the n first rows of the dataset.
Is there a way to read sequentially the dataset, so that I can read its whole content from the beginning to the end, extracting each time only a fixed amount of rows ?
You can try randomSplit method to split your dataframe into multiple dataframes.
For example, to split into 3:
ds.randomSplit(Array(1,1,1))

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Resources