Load a master data file to spark ecosystem - apache-spark

While building a log processing system, I came across a scenario where I need to look up data from a tree file (Like a DB) for each and every log line for corresponding value. What is the best approach to load an external file which is very large into the spark ecosystem? The tree file is of size 2GB.
Here is my scenario
I have a file contains huge number of log lines.
Each log line needs to be split by a delimiter to 70 fields
Need to lookup the data from tree file for one of the 70 fields of a log line.
I am using Apache Spark Python API and running on a 3 node cluster.
Below is the code which I have written. But it is really slow
def process_logline(line, tree):
row_dict = {}
line_list = line.split(" ")
row_dict["host"] = tree_lookup_value(tree, line_list[0])
new_row = Row(**row_dict)
return new_row
def run_job(vals):
spark.sparkContext.addFile('somefile')
tree_val = open(SparkFiles.get('somefile'))
lines = spark.sparkContext.textFile("log_file")
converted_lines_rdd = lines.map(lambda l: process_logline(l, tree_val))
log_line_rdd = spark.createDataFrame(converted_lines_rdd)
log_line_rdd.show()

Related

Use pyspark to partition 100 rows from csv file

I'm trying to group 100 rows of a large csv file (100M+ rows) to send to a Lambda function.
I can use SparkContext to have a workaround like this:
csv_file_rdd = sc.textFile(csv_file).collect()
count = 0
buffer = []
while count < len(csv_file_rdd):
buffer.append(csv_file_rdd[count])
count += 1
if count % 100 == 0 or count == len(csv_file_rdd):
# Send buffer to process
print("Send:", buffer)
# Clear buffer
buffer = []
but there must be a more elegant solution. I've tried using SparkSession and mapPartition but I haven't been able to make it work.
I suppose that your current data is not partitioned in any way (I mean its only one file), so iterating over it sequencially is a must. I suggest to load it as a data frame spark.read.csv(csv_file) then repartition as in this question and save to disk. Once it's saved you'll have a big number of files containing the specified number of records (100 in your case), taht can be used by other program to send to a Lambda (probably with a Pool of workers). See this post to get an idea. Probably is a naive idea but get's the job done.

Live updating graph from increasing amount of csv files

I need to analyse some spectral data in real-time and plot it as a self-updating graph.
The program I use outputs a text file every two seconds.
Usually I do the analysis after gathering the data and the code works just fine. I create a dataframe, where each csv file represents a column. The problem is, with several thousands of csv files, the import becomes very slow and creating a dataframe out of all the csv files takes usually more than half an hour.
Below the code for creating the dataframe from multiple csv files.
''' import, append and concat files into one dataframe '''
all_files = glob.glob(os.path.join(path, filter + "*.txt")) # path to the files by joining path and file name
all_files.sort(key=os.path.getmtime)
data_frame = []
name = []
for file in (all_files):
creation_time = os.path.getmtime(file)
readible_date = datetime.fromtimestamp(creation_time)
df = pd.read_csv(file, index_col=0, header=None, sep='\t', engine='python', decimal=",", skiprows = 15)
df.rename(columns={1: readible_date}, inplace=True)
data_frame.append(df)
full_spectra = pd.concat(data_frame, axis=1)
for column in full_spectra.columns:
time_step = column - full_spectra.columns[0]
minutes = time_step.total_seconds()/60
name.append(minutes)
full_spectra.columns = name
return full_spectra
The solution I thought of was using the watchdog module and everytime a new textfile is created it gets appended as a new column to the existing dataframe and the updated dataframe is plotted. Because then, I do not need to loop over all csv files all the time.
I found a very nice example on how to use watchdog here
My problem is, I could not find a solution how after detecting the new file with watchdog, to read it and append it to the existing dataframe.
A minimalistic example code should look something like this:
def latest_filename():
"""a function that checks within a directoy for new textfiles"""
return(filename)
df = pd.DataFrame() #create a dataframe
newdata = pd.read_csv(latest_filename) #The new file is found by watchdog
df["newcolumn"] = newdata["desiredcolumn"] #append the new data as column
df.plot() #plot the data
The plotting part should be easy and my thoughts were to adapt the code presented here. I am more concerned with the self-updating dataframe.
I appreciate any help or other solutions that would solve my issue!

Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
#delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}#{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:
meta = dask.compute(load(query, 0,10 ,connString))
results = from_delayed(
[
load(query,start, end,connString) for start,end in
get_partitions(num_partitions)
],
mete=meta.loc[:0, :] # zero-length version of table
)

Massive azure wasp JSON folder(450 GB) read in spark optimized way

i have processed file in azure spark . It takes long time to process the file . can anyone please suggest me the optimized way to achieve less process timings . Also attached my sample code with this.
// Azure container filesystem, it is contain source, destination, archive and result files
val azureContainerFs = FileSystem.get(sc.hadoopConfiguration)
// Read source file list
val sourceFiles = azureContainerFs.listStatus(new Path("/"+sourcePath +"/"),new PathFilter {
override def accept(path: Path): Boolean = {
val name = path.getName
name.endsWith(".json")
}
}).toList.par
// Ingestion processing to each file
for (sourceFile <- sourceFiles) {
// Tokenize file name from path
val sourceFileName = sourceFile.getPath.toString.substring(sourceFile.getPath.toString.lastIndexOf('/') + 1)
// Create a customer invoice DF from source json
val customerInvoiceDf = sqlContext.read.format("json").schema(schemaDf.schema).json("/"+sourcePath +"/"+sourceFileName).cache()
Thanks in Advance!
Please write us a bit more about your stack, and processing power (number of masters, slaves, how you deploy code, things like that)

Spark Streaming Desinging Questiion

I am new in spark. I wanted to do spark streaming setup to retrieve key value pairs of below format files:
file: info1
Note: Each info file will have around of 1000 of these records. And our system is continuously generating these info files. Through, spark streaming i wanted to do mapping of line numbers and info files and wanted to get aggregate result.
Can we give input to spark cluster these kind of files? I am interested in the "SF" and "DA" delimiters only, "SF" corresponds to source file and "DA" corresponds the ( line number, count).
As this input data is not the line format, so is this the good idea to use these files for the spark input or should i need to do some intermediary stage where i need to clean these files to generate new files which will have each record information in line instead of blocks?
Or can we achieve this in Spark itself? What should be the right approach?
What i wanted to achieve?
I wanted to get line level information. Means, to get line (As a key) and info files (as values)
Final output i wanted is like below:
line178 -> (info1, info2, info7.................)
line 2908 -> (info3, info90, ..., ... ,)
Do let me know if my explanation is not clear or if i am missing something.
Thanks & Regards,
Vinti
You could do something like this. Having your DStream stream:
// this gives you DA & FP lines, with the line number as the key
val validLines = stream.map(_.split(":")).
filter(line => Seq("DA", "FP").contains(line._1)).
map(_._2.split(","))
map(line => (line._1, line._2))
// now you should accumulate values
val state = validLines.updateStateByKey[Seq[String]](updateFunction _)
def updateFunction(newValues: Seq[Seq[String]], runningValues: Option[Seq[String]]): Option[Seq[String]] = {
// add the new values
val newVals = runnigValues match {
case Some(list) => list :: newValues
case _ => newValues
}
Some(newVals)
}
This should accumulate for each key a sequence with the values associated, storing it in state

Resources