I have a csv data in which I've escaped existing backslash characters using another backslash:
year,content
2021,"\\"foo\\",bar"
I'd like to read it using spark and display the data. The expected data on the dataframe:
+----------+--------------+
| year | content |
+----------+--------------+
| 2021 | \"foo\",bar |
+----------+--------------+
But when I ran this:
schema = StructType(
[
StructField("year", IntegerType(), False),
StructField("content", StringType(), False),
]
)
df = (
spark.read.csv(
f"s3://path/to/csv",
schema=schema,
header=True
)
)
df.show(20,False)
I'm getting this:
+----------+--------------+
| year | content |
+----------+--------------+
| 2021 | "\"foo\\" |
+----------+--------------+
Any idea how to handle this properly?
If you can save your data like this (single escape instead of double escape)
year,content
2021,"\"foo\", bar"
then you can read it as you're doing and the output would be
# +----+----------+
# |year|content |
# +----+----------+
# |2021|"foo", bar|
# +----+----------+
The error I am getting:
invalid string interpolation: `$$', `$'ident or `$'BlockExpr expected
Spark SQL:
val sql =
s"""
|SELECT
| ,CAC.engine
| ,CAC.user_email
| ,CAC.submit_time
| ,CAC.end_time
| ,CAC.duration
| ,CAC.counter_name
| ,CAC.counter_value
| ,CAC.usage_hour
| ,CAC.event_date
|FROM
| xyz.command AS CAC
| INNER JOIN
| (
| SELECT DISTINCT replace(split(get_json_object(metadata_payload, '$.configuration.name'), '_')[1], 'acc', '') AS account_id
| FROM xyz.metadata
| ) AS QCM
| ON QCM.account_id = CAC.account_id
|WHERE
| CAC.event_date BETWEEN '2019-10-01' AND '2019-10-05'
|""".stripMargin
val df = spark.sql(sql)
df.show(10, false)
You added s prefix which means you want the string be interpolated. It means all tokens prefixed with $ will be replaced with the local variable with the same name. From you code it looks like you do not use this feature, so you could just remove s prefix from the string:
val sql =
"""
|SELECT
| ,CAC.engine
| ,CAC.user_email
| ,CAC.submit_time
| ,CAC.end_time
| ,CAC.duration
| ,CAC.counter_name
| ,CAC.counter_value
| ,CAC.usage_hour
| ,CAC.event_date
|FROM
| xyz.command AS CAC
| INNER JOIN
| (
| SELECT DISTINCT replace(split(get_json_object(metadata_payload, '$.configuration.name'), '_')[1], 'acc', '') AS account_id
| FROM xyz.metadata
| ) AS QCM
| ON QCM.account_id = CAC.account_id
|WHERE
| CAC.event_date BETWEEN '2019-10-01' AND '2019-10-05'
|""".stripMargin
Otherwise if you really need the interpolation you have to quote $ sign like this:
val sql =
s"""
|SELECT
| ,CAC.engine
| ,CAC.user_email
| ,CAC.submit_time
| ,CAC.end_time
| ,CAC.duration
| ,CAC.counter_name
| ,CAC.counter_value
| ,CAC.usage_hour
| ,CAC.event_date
|FROM
| xyz.command AS CAC
| INNER JOIN
| (
| SELECT DISTINCT replace(split(get_json_object(metadata_payload, '$$.configuration.name'), '_')[1], 'acc', '') AS account_id
| FROM xyz.metadata
| ) AS QCM
| ON QCM.account_id = CAC.account_id
|WHERE
| CAC.event_date BETWEEN '2019-10-01' AND '2019-10-05'
|""".stripMargin
I have a .txt file as in the example reported below. I would like to convert it into a .csv table, but I'm not having much success.
Mack3 Line Item Journal Time 14:22:33 Date 03.10.2015
Panteni Ledger 1L TGEPIO00/CANTINAOAS Page 20.001
--------------------------------------------------------------------------------------------------------------------------------------------
| Pstng Date|Entry Date|DocumentNo|Itm|Doc..Date |BusA|PK|SG|Sl|Account |User Name |LCurr| Amount in LC|Tx|Assignment |S|
|------------------------------------------------------------------------------------------------------------------------------------------|
| 07.01.2014|07.02.2014|4919005298| 36|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 0,85 | |20140107 | |
| 07.01.2014|07.02.2014|4919065298| 29|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 2,53 | |20140107 | |
| 07.01.2014|07.02.2014|4919235298| 30|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 30,00 | |20140107 | |
| 07.01.2014|07.02.2014|4119005298| 32|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 1,00 | |20140107 | |
| 07.01.2014|07.02.2014|9019005298| 34|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 11,10 | |20140107 | |
|------------------------------------------------------------------------------------------------------------------------------------------|
The file in question is structure as a report from SAP. Practicing with python and looking in other posts I found this code:
with open('file.txt', 'rb') as f_input:
for line in filter(lambda x: len(x) > 2 and x[0] == '|' and x[1].isalpha(), f_input):
header = [cols.strip() for cols in next(csv.reader(StringIO(line), delimiter='|', skipinitialspace=True))][1:-1]
break
with open('file.txt', 'rb') as f_input, open(str(ii + 1) + 'output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for line in filter(lambda x: len(x) > 2 and x[0] == '|' and x[1] != '-' and not x[1].isalpha(), f_input):
csv_input = csv.reader(StringIO(line), delimiter='|', skipinitialspace=True)
csv_output.writerow(csv_input)
Unfortunately it does not work for my case. In fact it creates empty .csv files and it seems to not read properly the csv_input.
Any possible solution?
Your input file can be treated as CSV once we filter out a few lines, namely the ones that do not start with a pipe symbol '|' followed by a space ' ', which would leave us with this:
| Pstng Date|Entry Date|DocumentNo|Itm|Doc..Date |BusA|PK|SG|Sl|Account |User Name |LCurr| Amount in LC|Tx|Assignment |S|
| 07.01.2014|07.02.2014|4919005298| 36|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 0,85 | |20140107 | |
| 07.01.2014|07.02.2014|4919065298| 29|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 2,53 | |20140107 | |
| 07.01.2014|07.02.2014|4919235298| 30|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 30,00 | |20140107 | |
| 07.01.2014|07.02.2014|4119005298| 32|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 1,00 | |20140107 | |
| 07.01.2014|07.02.2014|9019005298| 34|07.01.2019| |81| | |60532640 |tARFooWMOND |EUR | 11,10 | |20140107 | |
Your output is mainly empty because that x[1].isalpha() check is never true on this data. The character in position 1 on each line is always a space, never alphabetic.
It's not necessary to open the input file multiple times, we can read, filter and write to the output in one go:
import csv
ii = 0
with open('file.txt', 'r', encoding='utf8', newline='') as f_input, \
open(str(ii + 1) + 'output.csv', 'w', encoding='utf8', newline='') as f_output:
input_lines = filter(lambda x: len(x) > 2 and x[0] == '|' and x[1] == ' ', f_input)
csv_input = csv.reader(input_lines, delimiter='|')
csv_output = csv.writer(f_output)
for row in csv_input:
csv_output.writerow(col.strip() for col in row[1:-1])
Notes:
You should not use binary mode when reading text files. Use r and w modes, respectively, and explicitly declare the file encoding. Choose the encoding that is the right one for your files.
For work with the csv module, open files with newline='' (which lets the csv module pick the correct line endings)
You can wrap multiple files in the with statements using the \ at the end of the line.
StringIO is completely unnecesary.
I'm not using skipinitialspace=True because some of the columns also have spaces at the end. Therefore I'm calling .strip() manually on each value when writing the row.
The [1:-1] is necessary to get rid of the superfluous empty columns (before the first and after the last | in the input)
Output is as follows
Pstng Date,Entry Date,DocumentNo,Itm,Doc..Date,BusA,PK,SG,Sl,Account,User Name,LCurr,Amount in LC,Tx,Assignment,S
07.01.2014,07.02.2014,4919005298,36,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"0,85",,20140107,
07.01.2014,07.02.2014,4919065298,29,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"2,53",,20140107,
07.01.2014,07.02.2014,4919235298,30,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"30,00",,20140107,
07.01.2014,07.02.2014,4119005298,32,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"1,00",,20140107,
07.01.2014,07.02.2014,9019005298,34,07.01.2019,,81,,,60532640,tARFooWMOND,EUR,"11,10",,20140107,
I have a Data set like below:
file : test.txt
149|898|20180405
135|379|20180428
135|381|20180406
31|898|20180429
31|245|20180430
135|398|20180422
31|448|20180420
31|338|20180421
I have created data frame by executing below code.
spark = SparkSession.builder.appName("test").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df_transac = spark.createDataFrame(sc.textFile("test.txt")\
.map(lambda x: x.split("|")[:3])\
.map(lambda r: Row('cCode'= r[0],'pCode'= r[1],'mDate' = r[2])))
df_transac .show()
+-----+-----+----------+
|cCode|pCode| mDate|
+-----+-----+----------+
| 149| 898| 20180405 |
| 135| 379| 20180428 |
| 135| 381| 20180406 |
| 31| 898| 20180429 |
| 31| 245| 20180430 |
| 135| 398| 20180422 |
| 31| 448| 20180420 |
| 31| 338| 20180421 |
+-----+-----+----------+
my df.printSchemashow like below:
df_transac.printSchema()
root
|-- customerCode: string (nullable = true)
|-- productCode: string (nullable = true)
|-- quantity: string (nullable = true)
|-- date: string (nullable = true)
but I want to create a data frame based my input dates i.e date1="20180425" date2="20180501"
my expected output is:
+-----+-----+----------+
|cCode|pCode| mDate|
+-----+-----+----------+
| 135| 379| 20180428 |
| 31| 898| 20180429 |
| 31| 245| 20180430 |
+-----+-----+----------+
please help on this how can I achieve this.
Here is a simple filter applied to your df :
df_transac.where("mdate between '{}' and '{}'".format(date1,date2)).show()
+-----+-----+--------+
|cCode|pCode| mDate|
+-----+-----+--------+
| 135| 379|20180428|
| 31| 898|20180429|
| 31| 245|20180430|
+-----+-----+--------+
Just getting into python, and so I decided to make a hangman game. Works good, but I was wondering if there was any kind of optimizations I could make or ways to clean up the code. Also, if anyone could recommend a project that I could do next that'd be cool.
import sys
import codecs
import random
def printInterface(lst, attempts):
""" Prints user interface which includes:
- hangman drawing
- word updater """
for update in lst:
print (update, end = '')
if attempts == 1:
print ("\n\n\n\n\n\n\n\n\n\n\n\t\t _____________")
elif attempts == 2:
print ("""
|
|
|
|
|
|
|
|
|
______|______""")
elif attempts == 3:
print ("""
______
|
|
|
|
|
|
|
|
|
______|______""")
elif attempts == 4:
print ("""
______
| |
| |
(x_X) |
|
|
|
|
|
|
______|______""")
elif attempts == 5:
print ("""
______
| |
| |
(x_X) |
| |
| |
| |
|
|
|
______|______""")
elif attempts == 6:
print ("""
______
| |
| |
(x_X) |
| |
/| |
| |
|
|
|
______|______""")
elif attempts == 7:
print ("""
______
| |
| |
(x_X) |
| |
/|\ |
| |
|
|
|
______|______""")
elif attempts == 8:
print ("""
______
| |
| |
(x_X) |
| |
/|\ |
| |
/ |
|
|
______|______""")
elif attempts == 9:
print ("""
______
| |
| |
(x_X) |
| |
/|\ |
| |
/ \ |
|
|
______|______""")
def main():
try:
wordlist = codecs.open("words.txt", "r")
except Exception as ex:
print (ex)
print ("\n**Could not open file!**\n")
sys.exit(0)
rand = random.randint(1,5)
i = 0
for word in wordlist:
i+=1
if i == rand:
break
word = word.strip()
wordlist.close()
lst = []
for h in word:
lst.append('_ ')
attempts = 0
printInterface(lst,attempts)
while True:
guess = input("Guess a letter: ").strip()
i = 0
for letters in lst:
if guess not in word:
print ("No '{0}' in the word, try again!".format(guess))
attempts += 1
break
if guess in word[i] and lst[i] == "_ ":
lst[i] = (guess + ' ')
i+=1
printInterface(lst,attempts)
x = lst.count('_ ')
if x == 0:
print ("You win!")
break
elif attempts == 9:
print ("You suck! You iz ded!")
break
if __name__ == '__main__':
while True:
main()
again = input("Would you like to play again? (y/n): ").strip()
if again.lower() == "n":
sys.exit(1)
print ('\n')
I didn't try the code, but here's some random tips:
Try to format your code accordingly to PEP 8 (use i += 1 instead of i+=1). PEP 8 is the standard style guide for Python.
Use
lst = ['_ '] * len(word)
instead of the for-loop.
Use enumerate as in:
for i, word in enumerate(wordlist)
instead of manually keeping track of i in the loop.
The default mode for opening files is 'r', there's no need to specify it. Are you using codecs.open instead of the built-in open in order to get Unicode strings back? Also, try to catch a more specific exception that Exception -- probably IOError.
First idea: ASCII art
The things special to Python are regular expression syntax and range() function, as well as [xxx for yyy in zzz] array filler.
import re
def ascii_art(attempt):
return re.sub(r'\d', '', re.sub('[0{0}].' \
.format(''.join([str(e) for e in range(attempt + 1, 10)])), ' ', """
3_3_3_3_3_3_
4| 2|
4| 2|
4(4x4_4X4) 2|
5| 2|
6/5|7\ 2|
5| 2|
8/ 9\ 2|
2|
2|
1_1_1_1_1_1_1|1_1_1_1_1_1_
"""))
for i in range(1, 10):
print(ascii_art(i))
Second idea: loops
Use enumerate for word reading loop. Use
for attempt in range(1, 10):
# inside main loop
...
print ('you suck!')
as the main loop. Operator break should be used with care and not as replacement for for!
Unless I miss something, the structure of
for letters in lst:
if guess not in word:
...
break
if guess in word[i]:
...
will be more transparent as
if guess not in word:
...
else:
index = word.find (guess)
...
I would use list instead of if .. else statement in printInterface.