How can we change the filename of spark output [sparkJava] - apache-spark

I wanted to change spark output filename , is the anyway to do that, Spark is writting in AWS S3.

Check below code will work for HDFS & also S3.
Created rename function which will take path & name to be changed, if directory contains multiple files, This will just append sequence number at the end of file name like json_data_1.json
import org.apache.hadoop.fs.{FileSystem, Path, RemoteIterator}
import org.apache.hadoop.fs._
// For converting to scala Iterator
implicit def convertToScalaIterator[T](remoteIterator: RemoteIterator[T]): Iterator[T] = {
case class wrapper(remoteIterator: RemoteIterator[T]) extends Iterator[T] {
override def hasNext: Boolean = remoteIterator.hasNext
override def next(): T = remoteIterator.next()
}
wrapper(remoteIterator)
}
import java.net.URI
def fs(path: String) = FileSystem.get(URI.create(path),spark.sparkContext.hadoopConfiguration)
// Rename files
def rename(path: String,name: String) = {
fs(path)
.listFiles(new Path(path),true)
.toList
.filter(_.isFile)
.map(_.getPath)
.filterNot(_.toString.contains("_SUCCESS"))
.zipWithIndex
.map(p => fs(p._1.toString).rename(p._1,new Path(s"${p._1.getParent}/${name}_${p._2}.${p._1.toString.split("\\.")(1)}")))
}
scala> val path = "/tmp/samplea"
path: String = /tmp/samplea
scala> df.repartition(5).write.format("json").mode("overwrite").save(path)
scala> s"ls -ltr ${path}".!
total 8
-rw-r--r-- 1 sriniva wheel 0 Jun 6 13:57 part-00000-607ffd5e-7d28-4331-9a69-de36254c80b1-c000.json
-rw-r--r-- 1 sriniva wheel 282 Jun 6 13:57 part-00001-607ffd5e-7d28-4331-9a69-de36254c80b1-c000.json
-rw-r--r-- 1 sriniva wheel 0 Jun 6 13:57 _SUCCESS
scala> rename(path,"json_data")
res193: List[Boolean] = List(true, true)
scala> s"ls -ltr ${path}".!
total 8
-rw-r--r-- 1 sriniva wheel 0 Jun 6 13:57 json_data_0.json
-rw-r--r-- 1 sriniva wheel 282 Jun 6 13:57 json_data_1.json
-rw-r--r-- 1 sriniva wheel 0 Jun 6 13:57 _SUCCESS

Related

Download a list of photos from CSV

I tried to download a list of photos from the CSV file. I want to save all photos from the link list column and save it to the folder with the same name as the link list column.
Expected:
CSV file:
1 a-Qc01o78E a-rrBuci0w a-qj8s5nlM a-Cwciy2zx
0 2 https://photo.yupoo.com/ven-new/9f13389c/big.jpg https://photo.yupoo.com/ven-new/8852c424/big.jpg https://photo.yupoo.com/ven-new/650d84fd/big.jpg https://photo.yupoo.com/ven-new/a99f9e52/big.jpg
1 3 https://photo.yupoo.com/ven-new/f0adc019/big.jpg https://photo.yupoo.com/ven-new/c434624c/big.jpg https://photo.yupoo.com/ven-new/bed9125c/big.jpg https://photo.yupoo.com/ven-new/2d0b7a67/big.jpg
2 4 https://photo.yupoo.com/ven-new/8844627a/big.jpg https://photo.yupoo.com/ven-new/edda4ec4/big.jpg https://photo.yupoo.com/ven-new/3283fe57/big.jpg https://photo.yupoo.com/ven-new/1f6425e5/big.jpg
3 5 https://photo.yupoo.com/ven-new/eeb8b78a/big.jpg https://photo.yupoo.com/ven-new/6cdcbbf7/big.jpg https://photo.yupoo.com/ven-new/f64ca040/big.jpg https://uvd.yupoo.com/ven-new/22259049_oneTrue.jpg
4 6 https://photo.yupoo.com/ven-new/9c3e9a92/big.jpg https://photo.yupoo.com/ven-new/ea257725/big.jpg https://photo.yupoo.com/ven-new/64b5a57f/big.jpg https://uvd.yupoo.com/ven-new/22257899_oneTrue.jpg
5 7 https://photo.yupoo.com/ven-new/baaf8945/big.jpg https://photo.yupoo.com/ven-new/e9cc7392/big.jpg https://photo.yupoo.com/ven-new/753418b5/big.jpg https://uvd.yupoo.com/ven-new/22257619_oneTrue.jpg
6 8 https://photo.yupoo.com/ven-new/a325f44a/big.jpg https://photo.yupoo.com/ven-new/671be145/big.jpg https://photo.yupoo.com/ven-new/1742a09d/big.jpg https://photo.yupoo.com/ven-new/c9d0aa0f/big.jpg
7 9 https://photo.yupoo.com/ven-new/367cf72a/big.jpg https://photo.yupoo.com/ven-new/ae7f1b1b/big.jpg https://photo.yupoo.com/ven-new/d0ef54ed/big.jpg https://photo.yupoo.com/ven-new/2d0905df/big.jpg
8 10 https://photo.yupoo.com/ven-new/3fcacff3/big.jpg https://photo.yupoo.com/ven-new/c4ea9b1e/big.jpg https://photo.yupoo.com/ven-new/683db958/big.jpg https://photo.yupoo.com/ven-new/3b065995/big.jpg
9 11 https://photo.yupoo.com/ven-new/c7de704a/big.jpg https://photo.yupoo.com/ven-new/92abc9ea/big.jpg https://photo.yupoo.com/ven-new/bd1083db/big.jpg https://photo.yupoo.com/ven-new/a9086d26/big.jpg
10 12 https://photo.yupoo.com/ven-new/fc481727/big.jpg https://photo.yupoo.com/ven-new/a49c94df/big.jpg
11 13 https://photo.yupoo.com/ven-new/1e0e0e10/big.jpg https://photo.yupoo.com/ven-new/62580909/big.jpg
12 14 https://photo.yupoo.com/ven-new/934b423e/big.jpg https://photo.yupoo.com/ven-new/74b81853/big.jpg
13 15 https://photo.yupoo.com/ven-new/adf878b2/big.jpg https://photo.yupoo.com/ven-new/5ad881c3/big.jpg
14 16 https://photo.yupoo.com/ven-new/59dc1203/big.jpg https://photo.yupoo.com/ven-new/3cd676ac/big.jpg
15 17 https://photo.yupoo.com/ven-new/6d8eb080/big.jpg
16 18 https://photo.yupoo.com/ven-new/9a027ada/big.jpg
17 19 https://photo.yupoo.com/ven-new/bdeaf1b5/big.jpg
18 20 https://photo.yupoo.com/ven-new/1f293683/big.jpg
def get_photos(x):
with requests.Session() as c:
df = pd.read_csv("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\YUOPOO SCRAPER\\FF_data_frame.csv")
NazwyKolum = pd.read_csv("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\bf3_strona.csv")
c.get('https://photo.yupoo.com/')
c.headers.update({'referer': 'https://photo.yupoo.com/'})
URLS=(df[NazwyKolum['LINKS'][x]]) #.to_string(index=False))
print(URLS) #prints a list of links. Example https://photo.yupoo.com/ven-new/650d84fd/big.jpg, https://photo.yupoo.com/ven-new/bed9125c/big.jpg
#proxies = {'https': 'http://45.77.76.254:8080'}
res = c.get(URLS,timeout=None)
if res.status_code == 200:
return res.content
try:
for x in range(2,50):
with open("C:\\Users\\Lukasz\\Desktop\\PROJEKTY PYTHON\\W TRAKCIE\\YUOPOO SCRAPER\\"+str(x)+'NAZWAZDECIA.jpg', 'wb') as f:
f.write(get_photos(x))
except:
print("wr")
To be honest, I do not how to handle it anymore, I wast a lot of time with no progress. Thank you, guys.
import pandas as pd
import requests, os
df = pd.read_csv("a.csv")
def create_directory(directory):
if not os.path.exists(directory):
os.makedirs(directory)
def download_save(url, folder):
create_directory(folder)
res = requests.get(url)
with open(f'{folder}/{url.split("/")[-2]}.jpg', 'wb') as f:
f.write(res.content)
for col in df.columns:
print(col)
for url in df[col].tolist():
print(url)
if str(url).startswith("http"):
download_save(url, col)
The above code creates column names as directories and downloads the images and saves it with unique name in the url.
Here a.csv is as follows:
a-Qc01o78E,a-rrBuci0w,a-qj8s5nlM,a-Cwciy2zx
https://photo.yupoo.com/ven-new/9f13389c/big.jpg,https://photo.yupoo.com/ven-new/8852c424/big.jpg,https://photo.yupoo.com/ven-new/650d84fd/big.jpg,https://photo.yupoo.com/ven-new/a99f9e52/big.jpg
https://photo.yupoo.com/ven-new/f0adc019/big.jpg,https://photo.yupoo.com/ven-new/c434624c/big.jpg,https://photo.yupoo.com/ven-new/bed9125c/big.jpg,https://photo.yupoo.com/ven-new/2d0b7a67/big.jpg
https://photo.yupoo.com/ven-new/8844627a/big.jpg,https://photo.yupoo.com/ven-new/edda4ec4/big.jpg,https://photo.yupoo.com/ven-new/3283fe57/big.jpg,https://photo.yupoo.com/ven-new/1f6425e5/big.jpg
https://photo.yupoo.com/ven-new/eeb8b78a/big.jpg,https://photo.yupoo.com/ven-new/6cdcbbf7/big.jpg,https://photo.yupoo.com/ven-new/f64ca040/big.jpg,https://uvd.yupoo.com/ven-new/22259049_oneTrue.jpg
https://photo.yupoo.com/ven-new/9c3e9a92/big.jpg,https://photo.yupoo.com/ven-new/ea257725/big.jpg,https://photo.yupoo.com/ven-new/64b5a57f/big.jpg,https://uvd.yupoo.com/ven-new/22257899_oneTrue.jpg

Reading a text file with multiple headers in Spark

I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame
STN--- WBAN YEARMODA TEMP
010010 99999 20060101 33.5 23
010010 99999 20060102 35.3 23
010010 99999 20060103 34.4 24
STN--- WBAN YEARMODA TEMP
010010 99999 20060120 35.2 22
010010 99999 20060121 32.2 21
010010 99999 20060122 33.0 22
You can read the text file as a normal text file in an RDD
You have a separator in the text file, let's assume it's a space
Then you can remove the header from it
Remove all lines inequal to the header
Then convert the RDD to a dataframe using .toDF(col_names)
Like this:
rdd = sc.textFile("path/to/file.txt").map(lambda x: x.split(" ")) # step 1 & 2
headers = rdd.first() # Step 3
rdd2 = rdd.filter(lambda x: x != headers)
df = rdd2.toDF(headers) # Step 4
You can try this out. I have tried on console.
val x = sc.textFile("hdfs path of text file")
val header = x.first()
var y = x.filter(x=>(!x.contains("STN"))) //this will remove all the line
var df = y.toDF(header)
Hope this will works for you.

Pulls last 24 hours of logs and clean them python 3.x

I have three file: 2 .gz files and 1 .log file. These files are pretty big. Below I have a sample copy of my original data. I want to extract the entries that correspond to the last 24 hours.
a.log.1.gz
2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/25-24:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/26-00:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/26-10:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/26-15:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log.2.gz
2018/03/26-20:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/27-10:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
I am getting the below result but it is not cleaned.
result.txt
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
Below code pulls the last 24 hours of line.
from datetime import datetime, timedelta
import glob
import gzip
from pathlib import Path
import shutil
def open_file(path):
if Path(path).suffix == '.gz':
return gzip.open(path, mode='rt', encoding='utf-8')
else:
return open(path, encoding='utf-8')
def parsed_entries(lines):
for line in lines:
yield line.split(' ', maxsplit=1)
def earlier():
return (datetime.now() - timedelta(hours=24)).strftime('%Y/%m/%d-%H:%M:%S')
def get_files():
return ['a.log'] + list(reversed(sorted(glob.glob('a.log.*'))))
output = open('output.log', 'w', encoding='utf-8')
files = get_files()
cutoff = earlier()
for i, path in enumerate(files):
with open_file(path) as f:
lines = parsed_entries(f)
# Assumes that your files are not empty
date, line = next(lines)
if cutoff <= date:
# Skip files that can just be appended to the output later
continue
for date, line in lines:
if cutoff <= date:
# We've reached the first entry of our file that should be
# included
output.write(line)
break
# Copies from the current position to the end of the file
shutil.copyfileobj(f, output)
break
else:
# In case ALL the files are within the last 24 hours
i = len(files)
for path in reversed(files[:i]):
with open_file(path) as f:
# Assumes that your files have trailing newlines.
shutil.copyfileobj(f, output)
# Cleanup, it would get closed anyway when garbage collected or process exits.
output.close()
I want to use below function to clean the lines:
def _clean_logs(line):
# noinspection SpellCheckingInspection
lemmatizer = WordNetLemmatizer()
clean_line = clean_line.strip()
clean_line = clean_line.lstrip('0123456789.- ')
cleaned_log = " ".join(
[lemmatizer.lemmatize(word) for word in nltk.word_tokenize(clean_line)])
cleaned_log = cleaned_log.replace('"', ' ')
return cleaned_log
Now, I want to use the able clean function which will clean the dirty data. I am not sure how to use it while pulling the last 24 hours. I wanna make it memory efficient as well as fast.
To answer the "make it memory efficient" part of your question, you can use the re module in stead of replace:
for line in lines:
line = re.sub('[T:-]', '', line)
This will reduce the complexity of your code and give better performance

Extracting data from web page to CSV file, only last row saved

I'm faced with the following challenge: I want to get all financial data about companies and I wrote a code that does it and let's say that the result is like below:
Unnamed: 0 I Q 2017 II Q 2017 \
0 Przychody netto ze sprzedaży (tys. zł) 137 134
1 Zysk (strata) z działal. oper. (tys. zł) -423 -358
2 Zysk (strata) brutto (tys. zł) -501 -280
3 Zysk (strata) netto (tys. zł)* -399 -263
4 Amortyzacja (tys. zł) 134 110
5 EBITDA (tys. zł) -289 -248
6 Aktywa (tys. zł) 27 845 26 530
7 Kapitał własny (tys. zł)* 22 852 22 589
8 Liczba akcji (tys. szt.) 13 921,975 13 921,975
9 Zysk na akcję (zł) -0029 -0019
10 Wartość księgowa na akcję (zł) 1641 1623
11 Raport zbadany przez audytora N N
but 464 times more.
Unfortunately when I want to save all 464 results in one CSV file I can save only one last result. Not all 464 results, just one... Could you help me save all? Below is my code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bankier.pl/gielda/notowania/akcje'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[0]
#Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]
#get
names_of_company = df["Walor AD"].values
links_to_financial_date = []
#all linkt with the names of companies
links = []
for i in range(len(names_of_company)):
new_string = 'https://www.bankier.pl/gielda/notowania/akcje/' + names_of_company[i] + '/wyniki-finansowe'
links.append(new_string)
############################################################################
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
page2 = requests.get(url2)
soup = BeautifulSoup(page2.content,'lxml')
# Find the second table on the page
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
df2.to_csv('output.csv', index=False, header=None)
You've almost got it. You're just overwriting your CSV each time. Replace
df2.to_csv('output.csv', index=False, header=None)
with
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
in order to append to the CSV instead of overwriting it.
Also, your example doesn't work because this:
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
should be:
for i in links:
url2 = i
When the website has no data, skip and move on to the next one:
try:
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
except:
pass

Issues posting data with locust file

def post_driver(l):
def id_generator(size = 6, chars = string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
data = {"startLocation":{"address":{"fullAddress":"\n, ","fullAddressInLine":", , ","line1":"10407 Water Hyacinth Dr","city":"Orlando","state":"FL","zip":"32825"},"latitude":28.52144,"longitude":-81.23301},"endLocation":{"address":{"fullAddress":"\n, ","fullAddressInLine":", , ","line1":"1434 N Alafaya Trail","city":"Orlando","state":"FL","zip":"32819"},"latitude":28.52144,"longitude":-81.23301},"user":{"firstName":"Test" + id_generator(),"lastName":"doe","Username":"","Password":"","userId":"","role":"","email":"zbaker#productivityapex.com","username":"Test" + id_generator() + "","newPassword":"test"},"vehicle":{"vehicleNumber":"" + id_generator(),"licensePlate":"3216549877","maxCapacity":200000,"maxCapacityUnits":"units","costPerMile":3500,"fixedCost":35000,"costPerHour":35,"odometer":0,"tags":[{"text":"vehicle"}],"constraintTags":[],"id":"59fcca46520a6e2bb4a397ed"},"earliestStartTimeDate":"Mon Nov 13 2017 07:00:00 GMT-0500 (Eastern Standard Time)","restBreakDurationDate":"Mon Nov 13 2017 01:00:00 GMT-0500 (Eastern Standard Time)","maximumWorkingHours":9,"maximumDrivingHours":8,"fixedCost":0,"costPerMile":0,"costPerHour":0,"restBreakWindows":[{"startTime":"2017-11-03T19:30:35.275Z","endTime":"2017-11-04T05:00:35.275Z"}],"color":"#1c7c11","tags":[{"text":"test"}],"driverNumber":"" + id_generator(),"phoneNumber":"7894651231","earliestStartTime":"07:00:00","restBreakDuration":"01:00:00","customFields":{}}
# json_str = json.dumps(data)
r = l.client.post('/api/v1/drivers', data=str(data))
# content = r.json()
# return content["id"]
I am currently having issues with posting the data variable and having it succeed in my locust file. I've done this with other locust files with similar formats, but it's failing every time on this particular request... can someone please help me?

Resources