Write a pyspark dataframe to text without changing its structure - apache-spark

I have a pyspark dataframe as shown below
+--------------------+
| speed|
+--------------------+
|[5.59239, 2.51329...|
|[0.0191166, 0.169...|
|[0.561913, 0.4098...|
|[0.393343, 0.3580...|
|[0.118315, 0.1183...|
|[0.831407, 0.4470...|
|[1.49012e-08, 0.1...|
|[0.0411047, 0.152...|
|[0.620069, 0.8262...|
|[0.20373, 0.20373...|
+--------------------+
How can I write this dataframe to CSV such that I save it as it is shown above.Currently I tried coalesce but it saved as below
"[5.59239, 2.51329, 0.141536, 1.27485, 2.35138, 12.9668, 12.9668, 2.52421, 0.330804, 0.459188, 0.459188, 0.651573, 3.15373, 6.11923, 8.8445, 8.0871, 0.855173, 1.43534, 1.43534, 1.05988, 1.05988, 0.778344, 1.20522, 1.70414, 1.70414, 0.0795492, 1.10385, 1.4759, 1.64844, 0.82941, 1.11321, 1.37977, 0.849902, 1.24436, 1.24436, 0.698651, 0.791467, 0.636781, 0.666729, 0.666729, 0.45688, 0.45688, 0.158829, 2.12693, 29.8682, 29.8682, 9.62536, 3.40384, 2.51002, 1.55077, 1.01774, 0.922753, 0.922753, 0.0438924, 0.530669, 0.879573, 0.627267, 0.0532846, 0.0890066, 0.0884833, 0.140008, 0.147534, 0.0180038, 0.0132851, 0.112785, 0.112785, 0.22997, 0.22997, 0.0524423, 0.141886, 0.328422,............]"
But I want to save it in the format such that it is a proper excel file,with speed as column name and its values as a list of lists.
I dont want to use topandas() as it is memory intensive
If i have over emphasised/under emphasised sth,please let me know in the comments.

df.coalesce(1).write.option("header","true").csv("file:///s/tesing")

I resolved this!
df_Welding_amp.rdd.coalesce(1).saveAsTextFile('home/ram/file.csv')
Though I didnt get exactly as a list of lists, I was able to successfully get in row format as below
Row(speed='[5.59239, 2.51329, 0.141536, 1.27485, 2.35138, 12.9668, 12.9668, 2.52421, 0.330804, 0.459188, 0.459188, 0.651573, 3.15373, 6.11923, 8.8445, 8.0871, 0.855173, 1.43534, 1.43534, 1.05988, 1.05988, 0.778344, 1.20522, 1.70414, 1.70414, 0.0795492, 1.10385, 1.4759, 1.64844, 0.82941........
.....]
Row(speed='[0.0191166, 0.169978, 0.226254, 0.149923, 0.149923, 0.505102, 0.505102, 0.369975, 0.305384, 0.154693, 0.224818, 0.875909, 0.875909, 2.5506, 6.06761, 5.0829, 4.46667, 2.16333, 3.74257, 3.74257, 2.33873, 1.39336, 1.56772, 0.889895, 0.249284, 0.249284, 0.132409, 0.177825, 0.270215, 0.398466, 2.3726, 4.87186, 4.05198, 2.23753, 0.266356, 0.513157, 0.78962, 0.523164, 0.138469, 0.315834, 0.315834]

Related

Using SklearnTutorial and unable to undertand the output of vectorizer.get_feature_names_out()

Is the output part of my 20newsgroups_train data? or is it from a default library? Because the words like 'zz_g9q3' dont make sense.
currently using 20newsgroups_train dataset and 20newsgroups_test datasets
Input:
vectorizer=TfidfVectorizer()
vectors_test=vectorizer.transform(newsgroups_test.data)
print(vectorizer.get_feature_names_out()[-50:])
Output:
['zyra' 'zysec' 'zysgm3r' 'zysv' 'zyt' 'zyu' 'zyv' 'zyxel' 'zyxel1496b'
'zz' 'zz20d' 'zz93sigmc120' 'zz_g9q3' 'zzcrm' 'zzd' 'zzg6c' 'zzi776'
'zzneu' 'zznki' 'zznkj' 'zznkjz' 'zznkzz' 'zznp' 'zzo' 'zzr11' 'zzr1100'
'zzrk' 'zzt' 'zztop' 'zzy_3w' 'zzz' 'zzzoh' 'zzzz' 'zzzzzz' 'zzzzzzt'
'ªl' '³ation'
'º_________________________________________________º_____________________º'
'ºnd' 'çait' 'çon' 'ère' 'ée' 'égligent' 'élangea' 'érale' 'ête'
'íålittin' 'ñaustin' 'ýé']

Pandas Dataframe sum result is wrong

I make program using pandas and openpyxl to manipulate excel files,
series of data is:
l=[466629703, NA, 527821349, NA,734823364, NA,1667241489, NA,502673377, NA,491316417, NA,505520276, NA,2840580259, NA,1399526794, NA,468709318, NA,425220764, NA,409771252, NA,643692418, NA,1193809483, NA,353829950, NA,424820400, NA,406999623, NA,389293014, NA,1168972722, NA,420654309, NA,390431735, NA,356588382, NA]
deposit_sum = sep_df[sep_kward][deposit].dropna().astype(int).sum()
The result has to be 16188926398
But 11200862491 is the result of above code. Only one of file occurs that error. What do you think is the problem?
Don't typecast values to int after dropping NaN's convert the value to int64 because this 2840580259.0 is out of range for integer value:
deposit_sum =df[0].dropna().astype('int64').sum()
#deposit_sum =sep_df[sep_kward][deposit].dropna().astype('int64').sum()
output of deposit_sum:
16188926398
Sample dataframe used:
NA=float('NaN')
l=[466629703, NA, 527821349, NA,734823364, NA,1667241489, NA,502673377, NA,491316417, NA,505520276, NA,2840580259, NA,1399526794, NA,468709318, NA,425220764, NA,409771252, NA,643692418, NA,1193809483, NA,353829950, NA,424820400, NA,406999623, NA,389293014, NA,1168972722, NA,420654309, NA,390431735, NA,356588382, NA]
df=pd.DataFrame(l)

Filter non equal values in pyspark, using condition.\ where(array_contains())

I have a pyspark code
condition_no_hypertension = condition.\
where(array_contains('clinicalStatus.coding.code', 'active')).\
where(array_contains('verificationStatus.coding.code', 'confirmed')).\
where(array_contains('code.coding.code', '38341003')).\
where(condition.onsetDateTime > '1900-01-01').\
withColumn('condition_status', condition['clinicalStatus.coding.code'].getItem(0)).\
withColumn('verification_status', condition['verificationStatus.coding.code'].getItem(0)).\
withColumn('snomed_code', condition['code.coding.code'].getItem(0)).\
withColumn('snomed_name', condition['code.coding.display'].getItem(0)).\
select(\
(condition['subject.reference'].substr(10, 40).alias('patient_id')),
'condition_status',\
'verification_status',\
'snomed_code', \
'snomed_name',\
to_date(condition['onsetDateTime']).alias('first_observation_date'))
How to change this code and pick up everything but code?
I tried
where(array_contains('code.coding.code', !='38341003')).\
but it does not work.
You can use ~ (not):
where(~array_contains('code.coding.code', '38341003'))

Why does my PySpark regular expression not give more than the first row?

Taking inspiration from this answer: https://stackoverflow.com/a/61444594/4367851 I have been able to split my .txt file into columns in a Spark DataFrame. However, it only gives me the first game - even though the sample .txt file contains many more.
My code:
basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
withColumn("Event", col("new")[0]).\
withColumn("White", col("new")[2]).\
withColumn("Black", col("new")[3]).\
withColumn("Result", col("new")[4]).\
withColumn("UTCDate", col("new")[5]).\
withColumn("UTCTime", col("new")[6]).\
withColumn("WhiteElo", col("new")[7]).\
withColumn("BlackElo", col("new")[8]).\
withColumn("WhiteRatingDiff", col("new")[9]).\
withColumn("BlackRatingDiff", col("new")[10]).\
withColumn("ECO", col("new")[11]).\
withColumn("Opening", col("new")[12]).\
withColumn("TimeControl", col("new")[13]).\
withColumn("Termination", col("new")[14]).\
drop("new")
basefile.show()
Output:
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
| Event| White| Black| Result| UTCDate| UTCTime| WhiteElo| BlackElo| WhiteRatingDiff| BlackRatingDiff| ECO| Opening| TimeControl| Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
Input file:
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
[Event "Rated Classical game"]
.
.
.
Each game starts with [Event so I feel like it should be doable as the file has repeating structure, alas I can't get it to work.
Extra points:
I don't actually need the move list so if it's easier they can be deleted.
I only want the content of what is inside the " " for each new line once it has been converted to a Spark DataFrame.
Many thanks.
wholeTextFiles reads each file into a single record. If you read only one file, the result will a RDD with only one row, containing the whole text file. The regexp logic in the question returns only one result per row and this will be the first entry in the file.
Probably the best solution would be to split the file at the os level into one file per game (for example here) so that Spark can read the multiple games in parallel. But if a single file is not too big, splitting the games can also be done within PySpark:
Read the file(s):
basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()
Create a list of columns and convert this list into a list of column expressions using regexp_extract:
from pyspark.sql import functions as F
cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]
Extract the data:
split the whole file into an array of games
explode this array into single records
delete the line breaks within each record so that the regular expression works
use the column expressions defined above to extract the data
basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
.selectExpr("explode(game) as game") \
.withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
.select(cols) \
.show(truncate=False)
Output (for an input file containing three copies of the game):
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event |White|Black |Result|UTCDate |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game2|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game3|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

Azure Stream Analitics list in the list

How do I take the data from the file list if json array is in the list? eg.
{"city":{"id":1851632,"name":"Shuzenji",
"coord":{"lon":138.933334,"lat":34.966671},
"country":"JP",
"cod":"200",
"message":0.0045,
"cnt":38,
"list":[{
"dt":1406106000,
"main":{
"temp":298.77,
"temp_min":298.77,
"temp_max":298.774,
"pressure":1005.93,
"sea_level":1018.18,
"grnd_level":1005.93,
"humidity":87
"temp_kf":0.26},
"weather":[{"id":804,"main":"Clouds","description":"overcast clouds","icon":"04d"}],
"clouds":{"all":88},
"wind":{"speed":5.71,"deg":229.501},
"sys":{"pod":"d"},
"dt_txt":"2014-07-23 09:00:00"}
]}
How do I get to weather.main?
First your JSON format seem to have errors, let's fix these first:
{"city":{"id":1851632,"name":"Shuzenji"},
"coord":{"lon":138.933334,"lat":34.966671},
"country":"JP",
"cod":"200",
"message":0.0045,
"cnt":38,
"list":[{
"dt":1406106000,
"main":{
"temp":298.77,
"temp_min":298.77,
"temp_max":298.774,
"pressure":1005.93,
"sea_level":1018.18,
"grnd_level":1005.93,
"humidity":87,
"temp_kf":0.26},
"weather":[{"id":804,"main":"Clouds","description":"overcast clouds","icon":"04d"}],
"clouds":{"all":88},
"wind":{"speed":5.71,"deg":229.501},
"sys":{"pod":"d"},
"dt_txt":"2014-07-23 09:00:00"}
]},
Note comma after at the end of humidity line, and closed record in the first line after Shuzenji.
Now, in the JSON example weather.main, is actually list[0].weather[0].main. Here is the way to get precisely this value
out of the JSON structure using Array and Record built-in functions:
SELECT
-- input.list[0].weather[0].main
WeatherMain = GetRecordPropertyValue(GetArrayElement(GetRecordPropertyValue(GetArrayElement(input.list, 0), 'weather'), 0), 'main')
FROM input

Resources