Reading a text file with multiple headers in Spark

Reading a text file with multiple headers in Spark - apache-spark

I have a text file having multiple headers where "TEMP" column has the average temperature for the day, followed by the number of recordings. How can I read this text file properly to create a DataFrame
STN--- WBAN YEARMODA TEMP
010010 99999 20060101 33.5 23
010010 99999 20060102 35.3 23
010010 99999 20060103 34.4 24
STN--- WBAN YEARMODA TEMP
010010 99999 20060120 35.2 22
010010 99999 20060121 32.2 21
010010 99999 20060122 33.0 22

You can read the text file as a normal text file in an RDD
You have a separator in the text file, let's assume it's a space
Then you can remove the header from it
Remove all lines inequal to the header
Then convert the RDD to a dataframe using .toDF(col_names)
Like this:
rdd = sc.textFile("path/to/file.txt").map(lambda x: x.split(" ")) # step 1 & 2
headers = rdd.first() # Step 3
rdd2 = rdd.filter(lambda x: x != headers)
df = rdd2.toDF(headers) # Step 4

You can try this out. I have tried on console.
val x = sc.textFile("hdfs path of text file")
val header = x.first()
var y = x.filter(x=>(!x.contains("STN"))) //this will remove all the line
var df = y.toDF(header)
Hope this will works for you.

Related

Dataframe creation from text file and regex: python code optimisation

I want to extract patterns from a textfile and create pandas dataframe.
Each line inside the text file look like this:
2022-07-01,08:00:57.853, +12-34 = 1.11 (0. AA), a=0, b=1 cct= p=0 f=0 r=0 pb=0 pbb=0 prr=2569 du=89
I want to extract the following patterns:
+12-34, 1.11, a=0, b=1 cct= p=0 f=0 r=0 p=0 pbb=0 prr=2569 du=89 where cols={id,res,a,b,p,f,r,pb,pbb,prr,du}.
I have written the following the code to extract patterns and create dataframe. The file is around 500MB containing huge amount of rows.
files = glob.glob(path_torawfolder + "*.txt")
lines = []
for fle in files:
with open(fle) as f:
items = {}
lines += f.readlines()
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
dct = {k:[v] for k,v in feature_dict.items()}
series = pd.DataFrame(dct)
#print(series)
df = pd.concat([df,series], ignore_index=True)
Any suggestions to optimize the code and reduce the processing time, please?
Thanks!

A bit of improvement: in the previous code, there were few unnecessary conversions from dict to df.
dicts = []
def create_dataframe():
df = pd.DataFrame()
for l in lines:
feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
feature_dict["id"] = (l.split("+")[-1]).split(" =")[0]
feature_dict["res"] = re.findall(r'(\d \.\d{2})',feature_interest)[0]
dicts.append(feature_dict)
df = pd.DataFrame(dicts)
return df
Line # Hits Time Per Hit % Time Line Contents
8 def create_dataframe():
9 1 551.0 551.0 0.0 df = pd.DataFrame()
10 1697339 727220.0 0.4 1.7 for l in lines:
11 1697338 1706328.0 1.0 4.0 feature_interest = (l.split("+")[-1]).split("= ", 1)[-1]
12 1697338 20857891.0 12.3 49.1 feature_dict = dict(re.findall(r'(\S+)=(\w+)', feature_interest))
13
14 1697338 1987874.0 1.2 4.7 feature_dict["ctry_provider"] = (l.split("+")[-1]).split(" =")[0]
15 1697338 9142820.0 5.4 21.5 feature_dict["acpa_codes"] = re.findall(r'(\d\.\d{2})',feature_interest)[0]
16 1697338 1039880.0 0.6 2.4 dicts.append(feature_dict)
17
18 1 7025303.0 7025303.0 16.5 df = pd.DataFrame(dicts)
19 1 2.0 2.0 0.0 return df
Improvement reduced the computation to few mins. Any more suggestions to optimize by using dask or parallel computing?

How to expand the dataframe based on the column values?

I have this dataframe:
utc arc_time_s tec_tecu elevation_deg lat_e_deg lon_e_deg
01.01.2018 01:19 54 3.856 17.35 57.44 25.02
01.01.2018 01:19 53 4.021 17.29 57.47 25.03
01.01.2018 01:19 52 4.029 17.22 57.51 25.05
01.01.2018 01:19 51 4.015 17.15 57.54 25.07
01.01.2018 01:19 50 3.997 17.08 57.57 25.09
What I want is expanding the dataframe based on lat_e_deg column to have all values with decimal scale 2.
I found the method resample but it seems like only can be used for datetime column.
So as an output I want to have like this:
How can I do this?

import pandas as pd
import numpy as np
# reconstruct part of your DataFrame for testing purposes:
df = pd.DataFrame([[17.35, 57.44], [17.29, 57.47], [17.22, 57.51]],
columns = ['elevation_deg', 'lat_e_deg'])
# create a Series of the desired stepwise values:
lat_e_deg_expanded = pd.Series(np.arange(start = min(df['lat_e_deg']),
stop = max(df['lat_e_deg']),
step = 0.01),
name = 'lat_e_deg')
# merge the expanded series with the original DataFrame and sort:
df_expanded = pd.merge(df, lat_e_deg_expanded,
on = 'lat_e_deg',
how = 'outer')
df_expanded.sort_values(by = 'lat_e_deg', inplace = True)

You can create pd.Series with step = 0.01 and then join to original dataframe.
Example code assuming df is dataframe with missing decimal values:
ts = pd.Series(np.arange(start = 57.44, stop = 57.57, step=0.01), name = "t")
df = pd.DataFrame({'t': [57.44, 57.47, 57.57]})
df2 = pd.merge(ts, df, how = "left").sort_values("t")
Result:
t
0 57.44
1 57.45
2 57.46
3 57.47
4 57.48
5 57.49
6 57.50
7 57.51
8 57.52
9 57.53
10 57.54
11 57.55
12 57.56
13 57.57

Extracting data from web page to CSV file, only last row saved

I'm faced with the following challenge: I want to get all financial data about companies and I wrote a code that does it and let's say that the result is like below:
Unnamed: 0 I Q 2017 II Q 2017 \
0 Przychody netto ze sprzedaży (tys. zł) 137 134
1 Zysk (strata) z działal. oper. (tys. zł) -423 -358
2 Zysk (strata) brutto (tys. zł) -501 -280
3 Zysk (strata) netto (tys. zł)* -399 -263
4 Amortyzacja (tys. zł) 134 110
5 EBITDA (tys. zł) -289 -248
6 Aktywa (tys. zł) 27 845 26 530
7 Kapitał własny (tys. zł)* 22 852 22 589
8 Liczba akcji (tys. szt.) 13 921,975 13 921,975
9 Zysk na akcję (zł) -0029 -0019
10 Wartość księgowa na akcję (zł) 1641 1623
11 Raport zbadany przez audytora N N
but 464 times more.
Unfortunately when I want to save all 464 results in one CSV file I can save only one last result. Not all 464 results, just one... Could you help me save all? Below is my code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bankier.pl/gielda/notowania/akcje'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[0]
#Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]
#get
names_of_company = df["Walor AD"].values
links_to_financial_date = []
#all linkt with the names of companies
links = []
for i in range(len(names_of_company)):
new_string = 'https://www.bankier.pl/gielda/notowania/akcje/' + names_of_company[i] + '/wyniki-finansowe'
links.append(new_string)
############################################################################
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
page2 = requests.get(url2)
soup = BeautifulSoup(page2.content,'lxml')
# Find the second table on the page
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
df2.to_csv('output.csv', index=False, header=None)

You've almost got it. You're just overwriting your CSV each time. Replace
df2.to_csv('output.csv', index=False, header=None)
with
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
in order to append to the CSV instead of overwriting it.
Also, your example doesn't work because this:
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
should be:
for i in links:
url2 = i
When the website has no data, skip and move on to the next one:
try:
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
except:
pass

get multiple colums into text file

I have a CSV file and I want to convert it to a text file based on the first column which is the ids. and then each file contain multiple columns. for example
file.csv
id val1 val 2 val3
1 50 52 60
2 45 84 96
and etc.
here is my code:
dir_name = '/Users/user/My Documents/test/'
with io.open('file1.csv', 'rt',encoding='utf8') as f:
reader = csv.reader(f, delimiter=',')
next(reader)
xx = []
for row in reader:
with open(os.path.join(dir_name, row[0] + ".txt"),'a') as f2:
xx = row[1:2]
f2.write(xx +"\n")
so it should be:
1.text
50 52 60
2.text
45 84 96
but it only creates files without content.
can anyone help me?. Thanks in advance

There were a couple of issues:
It's actually a whitespace separated values file, not a comma-separated values file. So, you have to change the delimiter from ,. Also, the whitespace is repeated, so you can pass an additional flag to the csv module.
Some funkiness with the array indexing and conversion to string.
This program meets your requirements:
#!/usr/bin/python
import io
import csv
import os
dir_name = './'
with io.open('input.csv', 'rt',encoding='utf8') as f:
reader = csv.reader(f, skipinitialspace=True, delimiter=' ')
next(reader)
xx = []
for row in reader:
filename = os.path.join(dir_name, row[0])
with open(filename + ".txt", 'a') as f2:
xx = row[1:]
f2.write(" ".join(xx) +"\n")

Python File Writing Format Issue

The Code
def rainfallInInches():
file_object = open('rainfalls.txt')
list_of_cities = []
list_of_rainfall_inches = []
for line in file_object:
cut_up_line = line.split()
city = cut_up_line[0]
rainfall_mm = int(line[len(line) - 3:])
rainfall_inches = rainfall_mm / 25.4
list_of_cities.append(city)
list_of_rainfall_inches.append(rainfall_inches)
inch_index = 0
desired_file = open("rainfallInInches.txt", "w")
for city in list_of_cities:
desired_file.writelines(str((city, "{0:0.2f}".format(list_of_rainfall_inches[inch_index]))))
inch_index += 1
desired_file.close()
rainfalls.txt
Manchester 37
Portsmouth 9
London 5
Southampton 12
Leeds 20
Cardiff 42
Birmingham 34
Edinburgh 26
Newcastle 11
rainfallInInches.txt
This is the unwanted output
('Manchester', '1.46')('Portsmouth', '0.35')('London',
'0.20')('Southampton', '0.47')('Leeds', '0.79')('Cardiff',
'1.65')('Birmingham', '1.34')('Edinburgh', '1.02')('Newcastle',
'0.43')
My program takes the data from 'rainfalls.txt' which has rainfall information in mm and converts the mm to inches then writes this new information into a new file 'rainfallInInches.txt'.
I've gotten this far except I can't figure out how to format 'rainfallInInches.txt' to make it look like 'rainfalls.txt'.
Bear in mind that I am a student, which you probably gathered by my hacky code.

My program takes the data from 'rainfalls.txt' which has rainfall information in mm and converts the mm to inches then writes this new information into a new file 'rainfallInInches.txt'.
You could separate the parsing of the input file, the conversion from mm to inches, and the final formatting for writing:
#!/usr/bin/env python
# read input
rainfall_data = [] # city, rainfall pairs
with open('rainfalls.txt') as file:
for line in file:
if line.strip(): # non-blank
city, rainfall = line.split() # no comments in the input
rainfall_data.append((city, float(rainfall)))
def mm_to_inches(mm):
"""Convert *mm* to inches."""
return mm * 0.039370
# write output
with open('rainfallInInches.txt', 'w') as file:
for city, rainfall_mm in rainfall_data:
file.write("{city} {rainfall:.2f}\n".format(city=city,
rainfall=mm_to_inches(rainfall_mm)))
rainfallInInches.txt:
Manchester 1.46
Portsmouth 0.35
London 0.20
Southampton 0.47
Leeds 0.79
Cardiff 1.65
Birmingham 1.34
Edinburgh 1.02
Newcastle 0.43
If you feel confident that each step is correct in isolation then you could combine the steps:
#!/usr/bin/env python
def mm_to_inches(mm):
"""Convert *mm* to inches."""
return mm * 0.039370
with open('rainfalls.txt') as input_file, \
open('rainfallInInches.txt', 'w') as output_file:
for line in input_file:
if line.strip(): # non-blank
city, rainfall_mm = line.split() # no comments
output_file.write("{city} {rainfall:.2f}\n".format(city=city,
rainfall=mm_to_inches(float(rainfall_mm))))
It produces the same output.

First is better change your parser to split a string by space. With this you dont need a complex logic to take numbers.
After this, to print correctly, change your output to
file.write("{} {0:0.02f}\n".format(city,list_of_rainfall_inches[inch_index] ))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reading a text file with multiple headers in Spark - apache-spark

You can try this out. I have tried on console. val x = sc.textFile("hdfs path of text file") val header = x.first() var y = x.filter(x=>(!x.contains("STN"))) //this will remove all the line var df = y.toDF(header) Hope this will works for you.

Related

Dataframe creation from text file and regex: python code optimisation

How to expand the dataframe based on the column values?

Extracting data from web page to CSV file, only last row saved

get multiple colums into text file

Python File Writing Format Issue

Categories

Resources