File Name and Variable in Flume - apache-spark

Right now I am working in a project where we are trying to read tomcat access log using flume and process those data in Spark and dump those in DB in proper format. But problem is that tomcat access log file is a daily rolling file and file name will change every day. Some thing like...
localhost_access_log.2017-09-19.txt
localhost_access_log.2017-09-18.txt
localhost_access_log.2017-09-17.txt
and my flume conf file for source section is something like
# Describe/configure the source
flumePullAgent.sources.nc1.type = exec
flumePullAgent.sources.nc1.command = tail -F /tomcatLog/localhost_access_log.2017-09-17.txt
#flumePullAgent.sources.nc1.selector.type = replicating
Which is running tail command on a fixed file name(I used fixed name , for testing only). How can I pass the file name as a parameter in flume conf file.
In fact , If some how I able to pass the file name as parameter , then also it will not be a actual solution. say , I start flume today with some file name (example : "localhost_access_log.2017-09-19.txt"), tomorrow when I will change the file name (localhost_access_log.2017-09-19.txt to localhost_access_log.2017-09-20.txt) some one has to stop the flume and restart with new file name. In that case it will not be a continues process, I have to stop / start the flume using cron job or something like this. Another problem is that I will loss some data(The server we are working now is high throughput server , 700-800 TPS almost ) every day during the processing time.(I mean time it will take to generate the new file name+time to stop flume+time to start flule)
Any one , have idea how to run flume with roll over file name in production environment? Any help will be highly appreciated...

exec source is not suitable for your task, you can instead use Spooling Directory Source. From Flume user guide:
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear.
Then, in config file you'd mention your logs directory like this:
agent.sources.spooling_src.spoolDir = /tomcatLog

Related

Reading GeoJSON in databricks, no mount point set

We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html
Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):
nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights)
However, the new changes that have been made have removed the mount point and we are now reading files in using the string:
"wasbs://Z#Y.blob.core.windows.net/X/Personnel/*.csv"
This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.
We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights)
This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.
So some questions for you kind folks:
Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
What is the correct way to read in GeoJSON files into databricks without a mount point set?
Why does our process fail upon trying to read the second created temp GeoJSON file?
Thanks in advance for any assistance - very new to Databricks...!
Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.
But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:
dbutils.fs.cp("wasbs://Z#Y.blob.core.windows.net/...Nights_new.geojson",
"file:///tmp/temp_nights")
and then read file from /tmp/temp_nights:
nights = gpd.read_file(filename="/tmp/temp_nights")

Spark Streaming finds file but claims it can't find the file

I have the below - which monitors a directory & pulls in the logs every X seconds.
The issue I have is this:
I set the script running
I then create a file in the directory (let's say testfile.txt)
The script then errors saying textfile.txt does not exist
It found the file and the filename, so it does exist and it finds it.
What I can see is that I define the path with a file:/// and it returns an error that it can't find file:/. So it seems to be missing two // for some reason:
Thanks for any help!!!!
Code
#only files after stream starts
df = spark_session\
.readStream\
.option('newFilesOnly', 'true')\
.option('header', 'true')\
.schema(myschema)\
.text('file:///home/keenek1/analytics/logs/')\
.withColumn("FileName", input_file_name())
Error
FileNotFoundException: File file:/home/keenek1/analytics/logs/loggywoggywoo.txt does not exist\
Please change file:/// to hdfs://.
df = spark_session\
.readStream\
.option('newFilesOnly', 'true')\
.option('header', 'true')\
.schema(myschema)\
.text('hdfs://home/keenek1/analytics/logs/')\ # changed file:/// to hdfs://
.withColumn("FileName", input_file_name())
For below question
If the same log file is overwritten lets say hourly, the checkpoint doesn't reprocess the file. I need it to say 'if modified time changes, reprocess' - is that possible?
workaround will be, point your spark streaming to different directory & use spark listeners to check file timestamp from actual directory if any changes in file timestamp, move that file to your streaming directory with new name
Let me know if you want code, I can give you in scala, may be you need to convert that into python.

HDFS and Spark: Best way to write a file and reuse it from another program

I have some results from a Spark application saved in the HDFS as files called part-r-0000X (X= 0, 1, etc.). And, because I want to join the whole content in a file, I'm using the following command:
hdfs dfs -getmerge srcDir destLocalFile
The previous command is used in a bash script which makes empty the output directory (where the part-r-... files are saved) and, inside a loop, executes the above getmerge command.
The thing is I need to use the resultant file in another Spark program which need that merged file as input in the HDFS. So I'm saving it as local and then I upload it to the HDFS.
I've thought another option which is write the file from the Spark program in this way:
outputData.coalesce(1, false).saveAsTextFile(outPathHDFS)
But I've read coalesce() doesn't help with the performance.
Any other ideas? suggestions? Thanks!
You wish to merge all the files into a single one so that you can load all the files at once into a Spark rdd, is my guess.
Let the files be in Parts(0,1,....) in HDFS.
Why not load it with wholetextFiles, which actually does what you need.
wholeTextFiles(path, minPartitions=None, use_unicode=True)[source]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn
Do rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”), then rdd contains:
(a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content)
Try SPARK BucketBy.
This is a nice feature via df.write.saveAsTable(), but this format can only be read by SPARK. Data shows up in Hive metastore but cannot be read by Hive, IMPALA.
The best solution that I've found so far was:
outputData.saveAsTextFile(outPath, classOf[org.apache.hadoop.io.compress.GzipCodec])
Which saves the outputData in compressed part-0000X.gz files under the outPath directory.
And, from the other Spark app, it reads those files using this:
val inputData = sc.textFile(inDir + "part-00*", numPartition)
Where inDir corresponds to the outPath.

Spark - Folder with same name as text file automatically created after RDD?

I placed a text file named Linecount2.txt in hdfs and built a simple rdd to count the number of lines using spark.
val lines = sc.textFile("user/root/hdpcd/Linecount2.txt")
lines.count()
This works.
But when I tried using the same text file with the aforementioned path, I receive the error:
"org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:"
When I looked into that path, I could see a folder was created 'Linecount.txt'.Hence the path for the file is now
("user/root/hdpcd/Linecount2.txt/Linecount2.txt")
Then, after defining the path I was able to run it successfully.
The third time I tried this, I got the same error because input path doesn't exist.
When I went through the path,
Why does this happen?
There is a difference between putting an HDFS file at user/root/hdpcd/Linecount2.txt compared to /user/root/hdpcd/Linecount2.txt, (or, more simply hdpcd/Linecount2.txt, when you already are the root user)
The leading slash is very important if you want to place a file in an absolute directory other than your current user account, otherwise, that's the default.
You've not given your hdfs put command, but the issue here is simply the difference between the absolute and relative paths. And it's not Spark specifically that's the issue
Also, hdfs put will say that a file already exists if you try to place it in the same location, so the fact you were able to upload twice should be an indication that your path was incorrect

CORB Batch process output Report Extraction issue

While running the CORB job, I am Extracting 100,000 URI's and loading the
data in one file at Linux server. The expectation is all the output records should be store in one file with 100k count. However The data was stored in multiple files with different counts. Can anyone help me out with root cause why the CORB process is creating multiple files in the output directory?
Please find the details of the CORB properties file that I configured in my local directory
Properties file :
THREAD-COUNT=4
PROCESS-TASK=com.marklogic.developer.corb.extension.ResilientTransform
SSL-CONFIG-CLASS=com.marklogic.developer.corb.TwoWaySSLConfig
SSL-PROPERTIES-FILE=/eiestore/ssl-configs/common-corb-sslconfig.properties
DECRYPTER=com.marklogic.developer.corb.HostKeyDecrypter
MODULE-ROOT=/a/abcmodules/corb-process/
MODULES-DATABASE="abcmodules"
URIS-MODULE=corb-select-uris.xqy
XQUERY-MODULE=corb-get-process.xqy
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
PRE-BATCH-TASK=com.marklogic.developer.corb.PreBatchUpdateFileTask
EXPORT-FILE-TOP-CONTENT=Id,value,type
EXPORT-FILE-DIR=/a/b/c/d/

Resources