Reading large text file into a dataframe for data analysis in Python - python-3.x

I know similar questions have been asked before. But I still cannot figure out the best way to process data for my program
I have a large text file (50,000 to 5,000,000 lines of text). I need to process each line of this file and write it into a Dataframe so that I can do some data analysis on them.
The dataframe has 9 columns mostly floats and some strings and no. of rows ~ no. of lines in the input file
Currently, I am reading this file line-by line using "with open.." and then using regex to extract the required data and writing this as a row into the Data frame. As this is going through a For loop it takes forever to complete.
What is the best way to do this ? Any pointers or sample programs ? Should I even be using a dataframe ?
Here is my code.
def gcodetodf(self):
with open(self.inputfilepath, 'r') as ifile:
lflag = False
for item in ifile:
layermatch = self.layerpattern.match(item)
self.tlist = item.split(' ')
self.clist = re.split(r"(\w+)", item)
if layermatch and (str(self.tlist[2][:-1]) == 'end' or int(self.tlist[2][:-1]) == (self.endlayer + 1)):
break
if (layermatch and int(self.tlist[2][:-1]) == self.startlayer) or lflag is True:
lflag = True
# clist = re.split(r"(\w+)", item)
map_gcpat = {bool(self.gonepattern.match(item)): self.gc_g1xyef,
bool(self.gepattern.match(item)): self.gc_g1xye,
bool(self.gtrpattern.match(item)): self.gc_g1xyf,
bool(self.resetextpattern.match(item)): self.gc_g92e0,
bool(self.ftpattern.match(item)): self.gc_ftype,
bool(self.toolcompattern.match(item)): self.gc_toolcmt,
bool(self.layerpattern.match(item)): self.gc_laycmt,
bool(self.zpattern.match(item)): self.gc_g1z}
map_gcpat.get(True, self.contd)()
# print(self.newdataframe)
an example function that writes to the dataframe looks like this:
def gc_g1xye(self):
self.newdataframe = self.newdataframe.append(
{'Xc': float(self.tlist[1][1:]), 'Yc': float(self.tlist[2][1:]), 'Zc': self.gc_z,
'E': float(self.tlist[3][1:]),
'F': None, 'FT': self.ft_var, 'EW': self.tc_ew, 'LH': self.tc_lh, 'Layer': self.cmt_layer},
ignore_index=True)
sample input file:
........
G1 X159.8 Y140.2 E16.84505
G1 X159.8 Y159.8 E17.56214
M204 S5000
M205 X30 Y30
G0 F2400 X159.6 Y159.8
G0 X159.33 Y159.33
G0 X159.01 Y159.01
M204 S500
M205 X20 Y20
;TYPE:SKIN
G1 F1200 X140.99 Y159.01 E18.22142
G1 X140.99 Y140.99 E18.8807
G1 X159.01 Y140.99 E19.53999
G1 X159.01 Y159.01 E20.19927
M204 S5000
M205 X30 Y30
G0 F2400 X150.21 Y150.21
M204 S500
M205 X20 Y20
G1 F1200 X149.79 Y150.21 E20.21464
G1 X149.79 Y149.79 E20.23
G1 X150.21 Y149.79 E20.24537
G1 X150.21 Y150.21 E20.26073
M204 S5000
M205 X30 Y30
G0 F2400 X150.61 Y150.61
M204 S500
M205 X20 Y20
G1 F1200 X149.39 Y150.61 E20.30537
G1 X149.39 Y149.39 E20.35
G1 X150.61 Y149.39 E20.39464
..........

Beware that DataFrame.append returns a copy of your old DataFrame with the new rows added: it does not work inplace. Constructing a DataFrame row by row, using append will then work in O(n^2) instead of O(n), which is rather bad if you have 5 million rows...
What you want to do instead is to append each row to a list first (a list of dicts is fine), and then create the DataFrame object from that once all the parsing is done. This will be much faster because appending to a list is done in constant time, so your total complexity should be O(n) instead.
def gc_g1xye(self):
self.data.append(
{'Xc': float(self.tlist[1][1:]), 'Yc': float(self.tlist[2][1:]), 'Zc': self.gc_z,
'E': float(self.tlist[3][1:]),
'F': None, 'FT': self.ft_var, 'EW': self.tc_ew, 'LH': self.tc_lh, 'Layer': self.cmt_layer})
...
# Once the parsing is done:
self.newdataframe = pd.DataFrame(self.data)
Is this the best way of doing it? It looks like a good start to me. Should you be using a DataFrame? From what you say you want to do with the data once you've parsed it, a DataFrame sounds like a good option.
As a random unrelated tip, I recommend the tqdm package for showing a progress bar of your for-loop. It's super easy to use, and it helps you in judging whether it's worth waiting for that loop to finish!

Related

Pine script - Security function not show correct on different timeframe

I'm newbie and try to get ichimoku data on 4 hour timeframe but it not showing the correct value when I shift.
//#version=4
study(title="test1", overlay=true)
conversionPeriods = input(9, minval=1, title="Conversion Line Length")
basePeriods = input(26, minval=1, title="Base Line Length")
laggingSpan2Periods = input(52, minval=1, title="Leading Span B Length")
displacement = input(26, minval=1, title="Displacement")
donchian_M240(len) => avg(security(syminfo.tickerid, 'D' , lowest(len)), security(syminfo.tickerid, 'D', highest(len)))
tenkanSen_M240 = donchian_M240(conversionPeriods)
kijunSen_M240 = donchian_M240(basePeriods)
senkoSpanA_M240 = avg(tenkanSen_M240, kijunSen_M240)
plot(senkoSpanA_M240[25], title="senkoSpanA_M240[25]")
The value senkoSpanA_M240[25] keep changing when I'm in M5, M15, M30, H1, H4 or D1.
Can you help please?
the reason it keeps changing when you change time frames is because you are using a historical bar reference [25] on your senkoSpanA_M240.
This means it will look for the senkoSpanA_M240 condition that occurred 25 bars ago.
Depending on which time frame you are selecting, it will look back 25 bars of that time frame and perform the calculation.
What exactly are you trying to achieve by using the [25]?

Why does my PySpark regular expression not give more than the first row?

Taking inspiration from this answer: https://stackoverflow.com/a/61444594/4367851 I have been able to split my .txt file into columns in a Spark DataFrame. However, it only gives me the first game - even though the sample .txt file contains many more.
My code:
basefile = spark.sparkContext.wholeTextFiles("example copy 2.txt").toDF().\
selectExpr("""split(replace(regexp_replace(_2, '\\\\n', ','), ""),",") as new""").\
withColumn("Event", col("new")[0]).\
withColumn("White", col("new")[2]).\
withColumn("Black", col("new")[3]).\
withColumn("Result", col("new")[4]).\
withColumn("UTCDate", col("new")[5]).\
withColumn("UTCTime", col("new")[6]).\
withColumn("WhiteElo", col("new")[7]).\
withColumn("BlackElo", col("new")[8]).\
withColumn("WhiteRatingDiff", col("new")[9]).\
withColumn("BlackRatingDiff", col("new")[10]).\
withColumn("ECO", col("new")[11]).\
withColumn("Opening", col("new")[12]).\
withColumn("TimeControl", col("new")[13]).\
withColumn("Termination", col("new")[14]).\
drop("new")
basefile.show()
Output:
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
| Event| White| Black| Result| UTCDate| UTCTime| WhiteElo| BlackElo| WhiteRatingDiff| BlackRatingDiff| ECO| Opening| TimeControl| Termination|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
|[Event "Rated Cla...|[White "BFG9k"]|[Black "mamalak"]|[Result "1-0"]|[UTCDate "2012.12...|[UTCTime "23:01:03"]|[WhiteElo "1639"]|[BlackElo "1403"]|[WhiteRatingDiff ...|[BlackRatingDiff ...|[ECO "C00"]|[Opening "French ...|[TimeControl "600...|[Termination "Nor...|
+--------------------+---------------+-----------------+--------------+--------------------+--------------------+-----------------+-----------------+--------------------+--------------------+-----------+--------------------+--------------------+--------------------+
Input file:
[Event "Rated Classical game"]
[Site "https://lichess.org/j1dkb5dw"]
[White "BFG9k"]
[Black "mamalak"]
[Result "1-0"]
[UTCDate "2012.12.31"]
[UTCTime "23:01:03"]
[WhiteElo "1639"]
[BlackElo "1403"]
[WhiteRatingDiff "+5"]
[BlackRatingDiff "-8"]
[ECO "C00"]
[Opening "French Defense: Normal Variation"]
[TimeControl "600+8"]
[Termination "Normal"]
1. e4 e6 2. d4 b6 3. a3 Bb7 4. Nc3 Nh6 5. Bxh6 gxh6 6. Be2 Qg5 7. Bg4 h5 8. Nf3 Qg6 9. Nh4 Qg5 10. Bxh5 Qxh4 11. Qf3 Kd8 12. Qxf7 Nc6 13. Qe8# 1-0
[Event "Rated Classical game"]
.
.
.
Each game starts with [Event so I feel like it should be doable as the file has repeating structure, alas I can't get it to work.
Extra points:
I don't actually need the move list so if it's easier they can be deleted.
I only want the content of what is inside the " " for each new line once it has been converted to a Spark DataFrame.
Many thanks.
wholeTextFiles reads each file into a single record. If you read only one file, the result will a RDD with only one row, containing the whole text file. The regexp logic in the question returns only one result per row and this will be the first entry in the file.
Probably the best solution would be to split the file at the os level into one file per game (for example here) so that Spark can read the multiple games in parallel. But if a single file is not too big, splitting the games can also be done within PySpark:
Read the file(s):
basefile = spark.sparkContext.wholeTextFiles(<....>).toDF()
Create a list of columns and convert this list into a list of column expressions using regexp_extract:
from pyspark.sql import functions as F
cols = ['Event', 'White', 'Black', 'Result', 'UTCDate', 'UTCTime', 'WhiteElo', 'BlackElo', 'WhiteRatingDiff', 'BlackRatingDiff', 'ECO', 'Opening', 'TimeControl', 'Termination']
cols = [F.regexp_extract('game', rf'{col} \"(.*)\"',1).alias(col) for col in cols]
Extract the data:
split the whole file into an array of games
explode this array into single records
delete the line breaks within each record so that the regular expression works
use the column expressions defined above to extract the data
basefile.selectExpr("split(_2,'\\\\[Event ') as game") \
.selectExpr("explode(game) as game") \
.withColumn("game", F.expr("concat('Event ', replace(game, '\\\\n', ''))")) \
.select(cols) \
.show(truncate=False)
Output (for an input file containing three copies of the game):
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Event |White|Black |Result|UTCDate |UTCTime |WhiteElo|BlackElo|WhiteRatingDiff|BlackRatingDiff|ECO|Opening |TimeControl|Termination|
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+
|Rated Classical game |BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game2|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
|Rated Classical game3|BFG9k|mamalak|1-0 |2012.12.31|23:01:03|1639 |1403 |+5 |-8 |C00|French Defense: Normal Variation|600+8 |Normal |
+---------------------+-----+-------+------+----------+--------+--------+--------+---------------+---------------+---+--------------------------------+-----------+-----------+

How to use custom mean, median, mode functions with array of 2500 in python?

So I am trying to solve mean, median and mode challenge on Hackerrank. I defined 3 functions to calculate mean, median and mode for a given array with length between 10 and 2500, inclusive.
I get an error with an array of 2500 integers, not sure why. I looked into python documentation and found no mentions of max length for lists. I know I can use statistics module but trying the hard way and being stubborn I guess. Any help and criticism is appreciated regarding my code. Please be honest and brutal if need be. Thanks
N = int(input())
var_list = [int(x) for x in input().split()]
def mean(sample_list):
mean = sum(sample_list)/N
print(mean)
return
def median(sample_list):
sorted_list = sorted(sample_list)
if N%2 != 0:
median = sorted_list[(N//2)]
else:
median = (sorted_list[N//2] + sorted_list[(N//2)-1])/2
print(median)
return
def mode(sample_list):
sorted_list = sorted(sample_list)
mode = min(sorted_list)
max_count = sorted_list.count(mode)
for i in sorted_list:
if (i <= mode) and (sorted_list.count(i) >= max_count):
mode = i
print(mode)
return
mean(var_list)
median(var_list)
mode(var_list)
Compiler Message
Wrong Answer
Input (stdin)
2500
19325 74348 68955 98497 26622 32516 97390 64601 64410 10205 5173 25044 23966 60492 71098 13852 27371 40577 74997 42548 95799 26783 51505 25284 49987 99134 33865 25198 24497 19837 53534 44961 93979 76075 57999 93564 71865 90141 5736 54600 58914 72031 78758 30015 21729 57992 35083 33079 6932 96145 73623 55226 18447 15526 41033 46267 52486 64081 3705 51675 97470 64777 31060 90341 55108 77695 16588 64492 21642 56200 48312 5279 15252 20428 57224 38086 19494 57178 49084 37239 32317 68884 98127 79085 77820 2664 37698 84039 63449 63987 20771 3946 862 1311 77463 19216 57974 73012 78016 9412 90919 40744 24322 68755 59072 57407 4026 15452 82125 91125 99024 49150 90465 62477 30556 39943 44421 68568 31056 66870 63203 43521 78523 58464 38319 30682 77207 86684 44876 81896 58623 24624 14808 73395 92533 4398 8767 72743 1999 6507 49353 81676 71188 78019 88429 68320 59395 95307 95770 32034 57015 26439 2878 40394 33748 41552 64939 49762 71841 40393 38293 48853 81628 52111 49934 74061 98537 83075 83920 42792 96943 3357 83393{-truncated-}
Download to view the full testcase
Expected Output
49921.5
49253.5
2184
Your issue seems to be that you are actually using standard list operations rather than calculating things on the fly, while looping through the data once (for the average). sum(sample_list) will almost surely give you something which exceeds the double-limit, i.a.w. it becomes really big.
Further reading
Calculating the mean, variance, skewness, and kurtosis on the fly
How do I determine the standard deviation (stddev) of a set of values?
Rolling variance algorithm
What is a good solution for calculating an average where the sum of all values exceeds a double's limits?
How do I determine the standard deviation (stddev) of a set of values?
How to efficiently compute average on the fly (moving average)?
I figured out that you forgot to change the max_count variable inside the if block. Probably that causes the wrong result. I tested the debugged version on my computer and they seem to work well when I compare their result with the scipy's built-in functions. The correct mode function should be
def mode(sample_list):
N = len(sample_list)
sorted_list = sorted(sample_list)
mode = min(sorted_list)
max_count = sorted_list.count(mode)
for i in sorted_list:
if (sorted_list.count(i) >= max_count):
mode = i
max_count = sorted_list.count(i)
print(mode)
I was busy with some stuff and now came back to completing this. I am happy to say that I have matured enough as a coder and solved this issue.
Here is the solution:
# Enter your code here. Read input from STDIN. Print output to STDOUT
# Input an array of numbers, convert it to integer array
n = int(input())
my_array = list(map(int, input().split()))
my_array.sort()
# Find mean
array_mean = sum(my_array) / n
print(array_mean)
# Find median
if (n%2) != 0:
array_median = my_array[n//2]
else:
array_median = (my_array[n//2 - 1] + my_array[n//2]) / 2
print(array_median)
# Find mode(I could do this using multimode method of statistics module for python 3.8)
def sort_second(array):
return array[1]
modes = [[i, my_array.count(i)] for i in my_array]
modes.sort(key = sort_second, reverse=True)
array_mode = modes[0][0]
print(array_mode)

Plot x and y if z == [value]

Only just started using python this week, so I'm a total beginner. Imagine I have a massive dataset with data like so:
close high low open time symbol
0.04951 0.04951 0.04951 0.04951 7/16/2010 BTC
0.08584 0.08585 0.05941 0.04951 7/17/2010 BTC
0.0808 0.09307 0.07723 0.08584 7/18/2010 ETH
How, using matplotlib, can I plot close with time, only if symbol = BTC? I was thinking something like
bitgroup = df.groupby('symbol')
if bitgroup == 'BTC':
df(['close','time']).plot()
plt.show()
Building on this, I'd then like to use these new groups to create new columns, such as returns, (calculated using (p1-p0)/p0) doing something like this:
def createnewcolumn()
for i in bitgroup
df[returns] = (bitgroup['close'].ix[i] - bitgroup['close'].ix[i-1]) / bitgroup['close'].ix[i-1]
createnewcolumn()
Any help would be greatly appreciated in turning this pseudocode into real code!
df.symbol == 'BTC'
returns a list of [0, 1, 1, 0, 0, 0 ... ] for each row, and then you can use that as a mask on the original data -
df[df.symbol == 'BTC']

How to split text file like this in python?

N-Heptane 100.20
Hexane 86.17
Hydrochloric Acid 36.47
Hydrogen, H2 2.016
Hydrogen Chloride 36.461
Hydrogen Sulfide 34.076
Hydroxyl, OH 17.01
Krypton 83.80
Methane, CH4 16.044
Methyl Alcohol 32.04
Methyl Butane 72.15
Methyl Chloride 50.488
Natural Gas 19.00
Neon, Ne 20.179
Nitric Oxide, NO 30.006
Nitrogen, N2 28.0134
Nitrous Oxide, NO2 44.012
N-Octane 114.22
Oxygen, O2 31.9988
Ozone 47.998
N-Pentane 72.15
Iso-Pentane 72.15
Propane, C3H8 44.097
Propylene 42.08
the text content like this, i'd like to split the string in Molecular Formula and Molecular weight
e.g
{"Hydrogen, H2":2.016, "Hydrogen Chloride":36.461, etc........}
You simply iterate over each row and use rsplit to retrieve last white-space separated value as your dictionary value. Rest of line goes to it as a key.
d = {}
with open(filename) as f:
for line in f:
key, value = line.rsplit(None, 1)
d[key] = float(value)

Resources