How to identify data gaps based on filenames on Python? - python-3.x

It happens that I have a folder located at
C:\Users\StoreX\Downloads\Binance futures data\AliceUSDT-Mark_Prices_Klines_1h_Timeframe
which only contains 253 csv files with the following filenames:
1. ALICEUSDT-1h-2021-06-01.csv
2. ALICEUSDT-1h-2021-06-02.csv
3. ALICEUSDT-1h-2021-06-03.csv
4. ALICEUSDT-1h-2021-06-06.csv
5. ALICEUSDT-1h-2021-06-09.csv
6. ALICEUSDT-1h-2021-06-11.csv
7. ALICEUSDT-1h-2021-06-12.csv
.
.
.
253. ALICEUSDT-1h-2022-02-13.csv
Each of those files contains the hourly price action of a particular asset, having in total 24 rows (no column names), and therefore, it can be assumed that each filename corresponds to the price action data taken for a particular asset in a particular date.
However, if you look closely at the example above, there are some files missing at the very beginning, which are:
ALICEUSDT-1h-2021-06-04.csv
ALICEUSDT-1h-2021-06-05.csv
ALICEUSDT-1h-2021-06-07.csv
ALICEUSDT-1h-2021-06-08.csv
ALICEUSDT-1h-2021-06-10.csv
This obviously means I could not take into account those files that are previous to the missing files for developing a trading strategy.
So, I would first have to detect which files are missing based on its name, for then defining where to start plotting the price action to avoiding all of the of the possible gaps.
Update: Here's what I have done so far:
import os
import datetime
def check_path(infile):
return os.path.exists(infile)
first_entry = input('Tell me the path where your csv files are located at:')
while True:
if check_path(first_entry) == False:
print('\n')
print('This PATH is invalid!')
first_entry = input('Tell me the RIGHT PATH in which your csv files are located: ')
elif check_path(first_entry) == True:
print('\n')
final_output = first_entry
break
for name in os.listdir(first_entry):
if name.endswith(".csv"):
print((name.partition('-')[-1]).partition('-')[-1].removesuffix(".csv"))
Output:
2021-06-01
2021-06-02
2021-06-03
2021-06-06
2021-06-09
.
.
.
2022-02-13
Any ideas?

IIUC, you have a list of dates and try to find out what dates are missing if you compare the list against a date range based on min and max date in the list. Sets can help, ex:
import re
from datetime import datetime, timedelta
l = ["ALICEUSDT-1h-2021-06-01.csv",
"ALICEUSDT-1h-2021-06-02.csv",
"ALICEUSDT-1h-2021-06-03.csv",
"ALICEUSDT-1h-2021-06-06.csv",
"ALICEUSDT-1h-2021-06-09.csv",
"ALICEUSDT-1h-2021-06-11.csv",
"ALICEUSDT-1h-2021-06-12.csv"]
# extract the dates, you don't have to use a regex here, it's more for convenience
d = [re.search(r"[0-9]{4}\-[0-9]{2}\-[0-9]{2}", s).group() for s in l]
# to datetime
d = [datetime.fromisoformat(s) for s in d]
# now make a date range based on min and max dates in d
r = [min(d)+timedelta(n) for n in range((max(d)-min(d)).days+1)]
# ...so we can do a membership test with sets to find out what is missing...
missing = set(r) - set(d)
sorted(missing)
[datetime.datetime(2021, 6, 4, 0, 0),
datetime.datetime(2021, 6, 5, 0, 0),
datetime.datetime(2021, 6, 7, 0, 0),
datetime.datetime(2021, 6, 8, 0, 0),
datetime.datetime(2021, 6, 10, 0, 0)]

Related

Finding number is present in sequence

I need to find out whether a given number is present in a given sequence. Sequence is an arithmetic progression with common difference of 2
Ex Input1 - 5, 7, 9, 11, 13, 15, . . .
for this sequence key is 19 so it is present i the sequence output would be True.
For input2 - be 4,6,8,10...
Given key is 15 so it is not present output should be False.
I have written the code but it is only working for first set of input which is odd numbers .Its failing for input2 which is even
arr = 4,6,8
n=10
b=max(arr)
l=[]
if b>n:
c=b
else:
c=n
d=arr[1]-arr[0]
for i in range(1,c+d+1,d):
l.append(i)
if n in l:
print("True")
else:
print("False")
Output 1 - True
output2 - False
You can take advantage of range's smart implementation of __contains__ to get a one-liner O(1) solution:
print(6 in range(2, 10, 2))
# True
print(5 in range(2, 10, 2))
# False
And an extreme example to show how fast and scalable this is (the size of the range does not matter):
from timeit import Timer
print(min(Timer(lambda: 1_000_000_000 in range(2, 1_000_000_003, 2)).repeat(1000, 1000)))
# 00032309999999990957

Transform JSON to excel table

I have data in csv - 2 columns, 1st column contains member id and second contains characteristics in Key-Value pairs (nested one under another).
I have seen online codes which convert a simple Key-value pairs but not able to transform data like what i have shown above
I want to transform this data into a excel table as below
I did it with this XlsxWriter package, so first you have to install it by running pip install XlsxWriter command.
import csv # to read csv file
import xlsxwriter # to write xlxs file
import ast
# you can change this names according to your local ones
csv_file = 'data.csv'
xlsx_file = 'data.xlsx'
# read the csv file and get all the JSON values into data list
data = []
with open(csv_file, 'r') as csvFile:
# read line by line in csv file
reader = csv.reader(csvFile)
# convert every line into list and select the JSON values
for row in list(reader)[1:]:
# csv are comma separated, so combine all the necessary
# part of the json with comma
json_to_str = ','.join(row[1:])
# convert it to python dictionary
str_to_dict = ast.literal_eval(json_to_str)
# append those completed JSON into the data list
data.append(str_to_dict)
# define the excel file
workbook = xlsxwriter.Workbook(xlsx_file)
# create a sheet for our work
worksheet = workbook.add_worksheet()
# cell format for merge fields with bold and align center
# letters and design border
merge_format = workbook.add_format({
'bold': 1,
'border': 1,
'align': 'center',
'valign': 'vcenter'})
# other cell format to design the border
cell_format = workbook.add_format({
'border': 1,
})
# create the header section dynamically
first_col = 0
last_col = 0
for index, value in enumerate(data[0].items()):
if isinstance(value[1], dict):
# this if mean the JSON key has something else
# other than the single value like dict or list
last_col += len(value[1].keys())
worksheet.merge_range(first_row=0,
first_col=first_col,
last_row=0,
last_col=last_col,
data=value[0],
cell_format=merge_format)
for k, v in value[1].items():
# this is for go in deep the value if exist
worksheet.write(1, first_col, k, merge_format)
first_col += 1
first_col = last_col + 1
else:
# 'age' has only one value, so this else section
# is for create normal headers like 'age'
worksheet.write(1, first_col, value[0], merge_format)
first_col += 1
# now we know how many columns exist in the
# excel, and set the width to 20
worksheet.set_column(first_col=0, last_col=last_col, width=20)
# filling values to excel file
for index, value in enumerate(data):
last_col = 0
for k, v in value.items():
if isinstance(v, dict):
# this is for handle values with dictionary
for k1, v1 in v.items():
if isinstance(v1, list):
# this will capture last 'type' list (['Grass', 'Hardball'])
# in the 'conditions'
worksheet.write(index + 2, last_col, ', '.join(v1), cell_format)
else:
# just filling other values other than list
worksheet.write(index + 2, last_col, v1, cell_format)
last_col += 1
else:
# this is handle single value other than dict or list
worksheet.write(index + 2, last_col, v, cell_format)
last_col += 1
# finally close to create the excel file
workbook.close()
I commented out most of the line to get better understand and reduce the complexity because you are very new to Python. If you didn't get any point let me know, I'll explain as much as I can. Additionally I used enumerate() python Built-in Function. Check this small example which I directly get it from original documentation. This enumerate() is useful when numbering items in the list.
Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration. The __next__() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over iterable.
>>> seasons = ['Spring', 'Summer', 'Fall', 'Winter']
>>> list(enumerate(seasons))
[(0, 'Spring'), (1, 'Summer'), (2, 'Fall'), (3, 'Winter')]
>>> list(enumerate(seasons, start=1))
[(1, 'Spring'), (2, 'Summer'), (3, 'Fall'), (4, 'Winter')]
Here is my csv file,
and here is the final output of the excel file. I just merged the duplicate header values (matchruns and conditions).

How to recategorize numeric values into new grouping using Pandas as a function, with no limit of conditions [duplicate]

I've just started coding in python, and my general coding skills are fairly rusty :( so please be a bit patient
I have a pandas dataframe:
It has around 3m rows. There are 3 kinds of age_units: Y, D, W for years, Days & Weeks. Any individual over 1 year old has an age unit of Y and my first grouping I want is <2y old so all I have to test for in Age Units is Y...
I want to create a new column AgeRange and populate with the following ranges:
<2
2 - 18
18 - 35
35 - 65
65+
so I wrote a function
def agerange(values):
for i in values:
if complete.Age_units == 'Y':
if complete.Age > 1 AND < 18 return '2-18'
elif complete.Age > 17 AND < 35 return '18-35'
elif complete.Age > 34 AND < 65 return '35-65'
elif complete.Age > 64 return '65+'
else return '< 2'
I thought if I passed in the dataframe as a whole I would get back what I needed and then could create the column I wanted something like this:
agedetails['age_range'] = ageRange(agedetails)
BUT when I try to run the first code to create the function I get:
File "<ipython-input-124-cf39c7ce66d9>", line 4
if complete.Age > 1 AND complete.Age < 18 return '2-18'
^
SyntaxError: invalid syntax
Clearly it is not accepting the AND - but I thought I heard in class I could use AND like this? I must be mistaken but then what would be the right way to do this?
So after getting that error, I'm not even sure the method of passing in a dataframe will throw an error either. I am guessing probably yes. In which case - how would I make that work as well?
I am looking to learn the best method, but part of the best method for me is keeping it simple even if that means doing things in a couple of steps...
With Pandas, you should avoid row-wise operations, as these usually involve an inefficient Python-level loop. Here are a couple of alternatives.
Pandas: pd.cut
As #JonClements suggests, you can use pd.cut for this, the benefit here being that your new column becomes a Categorical.
You only need to define your boundaries (including np.inf) and category names, then apply pd.cut to the desired numeric column.
bins = [0, 2, 18, 35, 65, np.inf]
names = ['<2', '2-18', '18-35', '35-65', '65+']
df['AgeRange'] = pd.cut(df['Age'], bins, labels=names)
print(df.dtypes)
# Age int64
# Age_units object
# AgeRange category
# dtype: object
NumPy: np.digitize
np.digitize provides another clean solution. The idea is to define your boundaries and names, create a dictionary, then apply np.digitize to your Age column. Finally, use your dictionary to map your category names.
Note that for boundary cases the lower bound is used for mapping to a bin.
import pandas as pd, numpy as np
df = pd.DataFrame({'Age': [99, 53, 71, 84, 84],
'Age_units': ['Y', 'Y', 'Y', 'Y', 'Y']})
bins = [0, 2, 18, 35, 65]
names = ['<2', '2-18', '18-35', '35-65', '65+']
d = dict(enumerate(names, 1))
df['AgeRange'] = np.vectorize(d.get)(np.digitize(df['Age'], bins))
Result
Age Age_units AgeRange
0 99 Y 65+
1 53 Y 35-65
2 71 Y 65+
3 84 Y 65+
4 84 Y 65+

How to combine the whole line in an excel into one list element?

Excel
I am a student and currently studying python. I want to make a program so I can combine the whole line in an excel into one list element. How am I suppose to do that? Currently, I can get the data from the excel into a string. How do I combine them and turn them into a list with each line as an element? How do I write this excel into a dictionary?
Thank in advance.
Below is my code:
import csv
def getMedalStats():
fLocation="C://TEMP//"
print("Assumed file location is at: ", fLocation)
fName = input("\nPlease enter a file name with its extension (ex. XXX.txt): ")
fin = open(fLocation + fName, 'r')
aStr = fin.read()
return aStr
#*********************************************************************************
def main():
eventList = getMedalStats()
print(eventList)
I strongly recommend using pandas.read_excel() to do this. Pandas is a very helpful module that makes data manipulation super easy and convenient.
As a starting point, you can do the followings:
import pandas as pd
fin = pd.read_excel(fLocation + fName)
This would return fin as a pandas dataframe, which you can then play with.
As an example of how to manipulate dataframes:
d = {'col1': [1, 2, 3], 'col2': [3, 4, 5], 'col3': ['3a', '4b', '5d']}
df = pd.DataFrame(data=d)
If you want to turn the first row into a list, you can do:
list(df.loc[0]) where 0 is the index of the first row.
Or if you want to turn the first column into a list, you can do:
list(df.loc[:,'col1']) where col1 is the name of the first column.
Or if you want to turn the whole excel into a list with each row as an element (by concatenating all values in each row):
df['all']=df.astype(str).apply(''.join, axis=1)
list(df.loc[:,'all'])
Let me know if it doesn't work or you need any other help!

How to read a leaderboard from a text file and sort from lowest score to highest?

I have a reflex test game and have setup a text file that stores the user's name and score with a space in between. How do I sort the text file numerically so that the lowest numbers are at the top and highest at the bottom
E.g
Ben 1.43
Eric 3.53
Steve 7.45
I want to include the 2 decimals places.
Code:
import time
import random
global start
global end
global score
def gameBegin():
print('***************')
print('* Reflex Game *')
print('***************')
print("\nPress Enter as soon as you see the word 'stop'.")
print('Press Enter to Begin')
input()
def gameStart():
global start
global end
time.sleep(random.randint(1,1))
start = time.time()
input('STOP')
end = time.time()
def gameScore():
global start
global end
global score
score=round(end-start,2)
print (score)
def scorePrint():
global score
with open("Leaderboards.txt", "r+") as board:
print (board.read())
def boardEdit():
global score
name = input('Enter Your Name For The Leader Board : ')
board = open ("Leaderboards.txt","a+")
board.write(name )
board.write(" ")
board.write(str(score) )
def boardSort():
#HELP
gameBegin()
gameStart()
gameScore()
scorePrint()
boardEdit()
boardSort()
look at this link https://wiki.python.org/moin/HowTo/Sorting
this will help you with any kind of sort you need.
but to do what you are asking you would need to perform a sort before printing the leaderboard
A simple ascending sort is very easy -- just call the sorted() function. It returns a new sorted list:
sorted([5, 2, 3, 1, 4])
then becomes
[1, 2, 3, 4, 5]
You can also use the list.sort() method of a list. It modifies the list in-place (and returns None to avoid confusion). Usually it's less convenient than sorted() - but if you don't need the original list, it's slightly more efficient.
>>>a = [5, 2, 3, 1, 4]
>>>a.sort()
>>>a
[1, 2, 3, 4, 5]

Resources