importing txt file into tuple within dictionary - python-3.x

I have a data structure in a .txt file set up as follows:
andrew
3 3 1 0 3 0 3 0 0 -3 0 5 3 0 1 0 0 5 3 0 0 0 0 1 0 3 0 1 0 0 3 5 3 3 0
0 0 5 0 5 0 3 3 0 -3 0 0 5 1 5 3 0 3 0 0
I'm trying to import it as in this format:
{'user':[0,1,3,4,5]}
I've tried various implementations, but can't find anything to suit my needs and account for so many variables.
My current code is as follows:
with open('ratings.txt', 'r') as f:
for line in f:
x = line.split()
print(x)
ratings[x[0]] = [y for y in x[1:]])
Any idea on how to improve the code, please?

you can read the whole file by read() method then split it so that you can create the dict as you want with out reading line by line
ratings = {}
with open('ratings.txt', 'r') as f:
X = f.read().split()
x = set([int (i) for i in X[1:] if int(i) >= 0]) # this will create a set which values are >= 0
ratings[X[0]] = list(x)
print(ratings)# output --> {'andrew': [0, 1, 3, 5]}
this is assuming your text file will always be in the format you mention above only one user at a time
As per Your text File I think you can do like this :
you read 2 lines by 2 to until you find a empty line or the end of file
like below
ratings = {}
with open('ratings.txt', 'r') as f:
while True:
name = f.readline().strip()
values = f.readline().strip().split()
#print(values)
if not name: break
else:
ratings[name] = list(set(int (i) for i in values if int(i) >= 0))
print(ratings)# output --> {'andrew': [0, 1, 3, 5]}

Related

How to split string with the values in their specific columns indexed on their label?

I have the following data
Index Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
. .
. .
. .
I have to create split the data format into different columns each with their own header and their values, the result should be as below:
Index Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
. .
. .
. .
The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.
The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.
Is there a way to implement this in python?
You could do:
df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
reset_index().pivot('index',1,0).fillna(0).reset_index()
1 Index CO CV EL PET
0 0 100 0 0 0
1 1 50 0 0 50
2 2 0 98 2 0
3 3 50 50 0 0
Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:
import re
def f(x):
L = []
for val in x.split('-'):
k, v = re.findall('(\d+)(\D+)', val)[0]
L.append((v, k))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
If in data exist some values without number or number only solution should be changed for more general like:
print (df)
Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
4 AAA
5 20
def f(x):
L = []
for val in x.split('-'):
extracted = re.findall('(\d+)(\D+)', val)
if len(extracted) > 0:
k, v = extracted[0]
L.append((v, k))
else:
if val.isdigit():
L.append(('No match digit', val))
else:
L.append((val, 0))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL AAA No match digit
0 100CO 100 0 0 0 0 0
1 50CO-50PET 50 50 0 0 0 0
2 98CV-2EL 0 0 98 2 0 0
3 50CV-50CO 50 0 50 0 0 0
4 AAA 0 0 0 0 0 0
5 20 0 0 0 0 0 20
Try this:
import pandas as pd
import re
df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})
split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)
df = pd.concat([df, split_df], axis = 1)

Getting a dataframe of combinations from a list of dictionaries

I have a following list of dictionaries:
options = [{'A-1': ['x', 'y']},
{'A-3': ['x', 'y', 'z']},
Values of each dictionary (e.g. x and y) are basically the options that keys (e.g. A-1) can have. How can I have the following dataframe of combinations? Only one value (e.g. either x or y) of a key (e.g. A-1) can can take 1 at a time. All values of a dictionary cannot be 0 at a time.
I have trying to use itertools.combinations(), but couldn't find the way to get the desired result.
This way I can find the number of combinations n_comb and number of connections n_conn which will be number of rows and columns of the dataframe.
n_conn = 0
n_comb = 1
for dic in options:
for key in dic:
n_comb = n_comb * len(dic[key])
n_conn = n_conn + len(dic[key])
One way using pandas.get_dummies and merge:
dfs = [pd.get_dummies(pd.DataFrame(o)).assign(merge=1) for o in options]
new_df = dfs[0].merge(dfs[1], on="merge").drop("merge", 1)
print(new_df)
Or make it more flexible using functools.reduce:
from functools import reduce
new_df = reduce(lambda x, y: x.merge(y, on="merge"), dfs).drop("merge", 1)
Output:
A-1_x A-1_y A-3_x A-3_y A-3_z
0 1 0 1 0 0
1 1 0 0 1 0
2 1 0 0 0 1
3 0 1 1 0 0
4 0 1 0 1 0
5 0 1 0 0 1

How to append strings from a loop to a single output line?

So I am trying to make a text-based game of connect-4 for the purposes of better understanding Python and how it actually works.
Short version
How can I append printed text from every run-through of a while loop to a print output that exists just before the while loop
Out of the two methods seen below (The work in progress and the current successfully working one) which is a better practice of executing the desired output?
Long version
I am trying to use a looping system to print out an array in an evenly spaced and aesthetically pleasing format after every turn is taken, so users have clear feedback of what the current board looks like before the next turn is taken.
To do this I want to be able to have lines of code that are as small as possible for making it easier to read the code itself. Although this might not be the best practice for executing this scenario I want to understand this way of coding better so I could apply it to future projects if need be.
In terms of the actual execution, I am trying to use a while loop to append 7 positions of an array one after another in the same output line for array positions that are in the same row. after this, I want to print the next row on the line below the previous one as seen in the code below "Desired output".
Thank you in advance for your answers, suggestions and comments.
Work in progress
import numpy as np
ARRAY = np.zeros(shape=(6, 7), dtype = 'int8')
# In reality I will be using an empty array that gradually gets populated
# Zeros are used for ease of asking the question
def Display_board():
i = 0
while i < 7:
j = 0
print(" ", end = " ")
while j < 8:
print(str(ARRAY[i][j]))
j += 1
i += 1
work in progress output
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
# It goes on but didn't include as it would take up unnessary space in the question
If I change the line that prints the array to as follows I get another undesired output
print(str(ARRAY[i][j]), end = " ")
#output
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Current working method - Gives desired output
def Display_board():
for i in range(6):
print(" " + str(ARRAY[i][0]) + " " + str(ARRAY[i][1]) + " " + str(ARRAY[i][2]) \
+ " " + str(ARRAY[i][3]) + " " + str(ARRAY[i][4]) + " " + str(ARRAY[i][5])\
+ " " + str(ARRAY[i][6]))
Desired output
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
The simple fix is to use end=' ' on the print inside the while loop on j and then add a print() after it:
def Display_board():
i = 0
while i < 6:
j = 0
print(" ", end = " ")
while j < 7:
print(str(ARRAY[i][j]), end=" ")
j += 1
print()
i += 1
Output:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
You can also use a nested list comprehension with join to achieve the output in one line:
def Display_board():
print('\n'.join(' '.join([' '] + [str(ARRAY[i][j]) for j in range(7)]) for i in range(6)))
came up with two functions
first one:
def display_board1(board):
m, n = board.shape
for i in range(m):
for j in range(n):
print(board[i][j], end= ' ')
print()
return 1
second one:
def display_board2(board):
s = board.__str__()
s = s.replace('[', ']')
s = s.replace(']', '')
s = ' ' + s
print(s)
return 1
the return 1 statements are just for plotting, delete them if you dont want them
here's their performance with respect to input size
display_board2() is faster and more stable
import perfplot
bench = perfplot.bench(
setup= np.zeros,
kernels= [
display_board1,
display_board2
],
n_range= [(i, i) for i in range(10)],
)
bench.show()
FINAL FINAL EDIT:
Fixed the code to ACTUALLY use the width setting!
FINAL EDIT :)
If the numbers can be greater than 9 you can use the wonderful python f-string formatting option:
ARRAY = [
[1, 2, 3, 4, 5],
[10, 20, 3, 4, 5],
[1, 2, 30, 4, 5],
[1, 2, 3, 4, 500],
]
width = 3
for row in ARRAY:
print(" ".join(f'{x:>{width}}' for x in row))
which produces:
1 2 3 4 5
10 20 3 4 5
1 2 30 4 5
1 2 3 4 500
EDIT:
This, while less intuitive is shorter and arguably more pythonic:
for row in ARRAY:
print(" ".join(map(str, row)))
This will work for any ARRAY:
ARRAY = [
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
[1, 2, 3, 4, 5],
]
for row in ARRAY:
for n in row:
print(n, end = " ")
print()
poduces:
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Edited to remove "" in print("")

Faster way to count number of timestamps before another timestamp

I have two dataframe "train" and "log". "log" has datetime columns "time1" while train has datetime column "time2". For every row in "train" I want to find out counts of "time1" when "time1" is before "time2".
I already tried the apply method with dataframe.
def log_count(row):
return sum((log['user_id'] == row['user_id']) & (log['time1'] < row['time2']))
train.apply(log_count, axis = 1)
It is taking very long with this approach.
Since you want to do this once for each (paired) user_id group, you could do the following:
Create a column called is_log which is 1 in log and 0 in train:
log['is_log'] = 1
train['is_log'] = 0
The is_log column will be used to keep track of whether or not a row comes from log or train.
Concatenate the log and train DataFrames:
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
Sort the combined DataFrame by user_id and time:
combined = combined.sort_values(by=["user_id", "time"])
So now combined looks something like this:
time user_id is_log
6 2000-01-17 0 0
0 2000-03-13 0 1
1 2000-06-08 0 1
7 2000-06-25 0 0
4 2000-07-09 0 1
8 2000-07-18 0 0
10 2000-03-13 1 0
5 2000-04-16 1 0
3 2000-08-04 1 1
9 2000-08-17 1 0
2 2000-10-20 1 1
Now the count that you are looking for can be expressed as a cumulative sum of the is_log column, grouped by user_id:
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
This is the main idea: Counting the number of 1s in the is_log column is equivalent to counting the number of times in log which come before each time in train.
For example,
import numpy as np
import pandas as pd
np.random.seed(2019)
def random_dates(N):
return np.datetime64("2000-01-01") + np.random.randint(
365, size=N
) * np.timedelta64(1, "D")
N = 5
log = pd.DataFrame({"time1": random_dates(N), "user_id": np.random.randint(2, size=N)})
train = pd.DataFrame(
{
"time2": np.r_[random_dates(N), log.loc[0, "time1"]],
"user_id": np.random.randint(2, size=N + 1),
}
)
log["is_log"] = 1
train["is_log"] = 0
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
combined = combined.sort_values(by=["user_id", "time"])
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
print(log)
# time1 user_id is_log
# 0 2000-03-13 0 1
# 1 2000-06-08 0 1
# 2 2000-10-20 1 1
# 3 2000-08-04 1 1
# 4 2000-07-09 0 1
print(train)
yields
time user_id is_log count
6 2000-01-17 0 0 0
7 2000-06-25 0 0 2
8 2000-07-18 0 0 3
10 2000-03-13 1 0 0
5 2000-04-16 1 0 0
9 2000-08-17 1 0 1

delete specific rows from csv using pandas

I have a csv file in the format shown below:
I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.
df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
However, I get the following error:
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1104, in mtrand.RandomState.choice
(numpy/random/mtrand/mtrand.c:17062)
ValueError: a must be greater than 0
Can you guys help me?
sample DataFrame built with #andrew_reece's code
In [9]: df
Out[9]:
center left right steering throttle brake
0 center_54.jpg left_75.jpg right_39.jpg 1 0 0
1 center_20.jpg left_81.jpg right_49.jpg 3 1 1
2 center_34.jpg left_96.jpg right_11.jpg 0 4 2
3 center_98.jpg left_87.jpg right_34.jpg 0 0 0
4 center_67.jpg left_12.jpg right_28.jpg 1 1 0
5 center_11.jpg left_25.jpg right_94.jpg 2 1 0
6 center_66.jpg left_27.jpg right_52.jpg 1 3 3
7 center_18.jpg left_50.jpg right_17.jpg 0 0 4
8 center_60.jpg left_25.jpg right_28.jpg 2 4 1
9 center_98.jpg left_97.jpg right_55.jpg 3 3 0
.. ... ... ... ... ... ...
90 center_31.jpg left_90.jpg right_43.jpg 0 1 0
91 center_29.jpg left_7.jpg right_30.jpg 3 0 0
92 center_37.jpg left_10.jpg right_15.jpg 1 0 0
93 center_18.jpg left_1.jpg right_83.jpg 3 1 1
94 center_96.jpg left_20.jpg right_56.jpg 3 0 0
95 center_37.jpg left_40.jpg right_38.jpg 0 3 1
96 center_73.jpg left_86.jpg right_71.jpg 0 1 0
97 center_85.jpg left_31.jpg right_0.jpg 3 0 4
98 center_34.jpg left_52.jpg right_40.jpg 0 0 2
99 center_91.jpg left_46.jpg right_17.jpg 0 0 0
[100 rows x 6 columns]
In [10]: df.steering.value_counts()
Out[10]:
0 43 # NOTE: 43 zeros
1 18
2 15
4 12
3 12
Name: steering, dtype: int64
In [11]: df.shape
Out[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
1 18
2 15
4 12
3 12
0 4 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Here's a one-line approach, using concat() and sample():
import numpy as np
import pandas as pd
# first, some sample data
# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center left right steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg 3 3 0
1 center_75.jpg left_68.jpg right_26.jpg 0 0 2
2 center_29.jpg left_8.jpg right_88.jpg 0 1 0
3 center_22.jpg left_26.jpg right_23.jpg 1 0 0
4 center_88.jpg left_0.jpg right_56.jpg 4 1 0
5 center_93.jpg left_18.jpg right_15.jpg 0 0 0
Now drop all but 10% of rows with steering==0:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.
Using a mask on steering combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

Resources