Score on testing set - Text mining - python-3.x

I try to print accurancy score after score validation. Score validation is equal to 82%. But the accuracy score is equal to 0 ...
grid = GridSearchCV(pipe, param_grid, cv=5)
grille=grid.fit(text_train, Y_train)
y_pred = grille.predict(text_test)
from sklearn import metrics
#matrice de confusion
print("matrice confusion")
print(metrics.confusion_matrix(Y_test,y_pred))
#succès en test
print("score accuracy")
print(metrics.accuracy_score(Y_test,y_pred))
When I print confusion matrix :
[[ 0 0 0 0 0 0 0 0 0 0 27 2]
[ 0 0 0 0 0 0 0 0 0 0 88 23]
[ 0 0 0 0 0 0 0 0 0 0 12 0]
[ 0 0 0 0 0 0 0 0 0 0 22 4]
[ 0 0 0 0 0 0 0 0 0 0 108 90]
[ 0 0 0 0 0 0 0 0 0 0 28 60]
[ 0 0 0 0 0 0 0 0 0 0 16 94]
[ 0 0 0 0 0 0 0 0 0 0 49 424]
[ 0 0 0 0 0 0 0 0 0 0 5 76]
[ 0 0 0 0 0 0 0 0 0 0 78 1487]
[ 0 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 0]]
I don't understand why.. and what is the format of my matrix (0 0 0 0 0 0 0 27)

Here is documentation for confusion and accuracy.
Very simply, the confusion matrix claims that all of the validation inputs were in classes 0-9, but incorrectly identified as class 10 or 11.
Since you failed to provide the canonical Minimal, complete, verifiable example, I have no way to work on the problem past this point. Please update this with code that reproduces the problem.

Related

Parse .log file to collect data after keyword then create a nested dictionary using predefined column names

I am trying to parse a .log file using python for the Status information about processes of a linux system. The .log file has a lot of different sections of information, the sections of interest start with "##START-ALLPROCESSES-xxxxxxxx" where x is the epoch date and end with '##END-ALLPROCESSES-xxxxxxx". After this line each process is listed with 52 columns each, the number of processes may change depending on the info recorded at the time, and there may be multiple sections with this information at different times.
The idea is to open the .log file, find the sections and then use the XXXXXXX as the key for a nested dictionary where the keys are the predefined column dates filled in with the values from the section, and do this for all different sections that would be found on the .log fie. The nested dictionary would look something like below
[date1-XXXXXX:
[ columnName1: process1,
.
.
.
columnName52: info1
],
.
.
.
[ columnName1: process52,
.
.
.
columName52: info52
]
],
[date2-XXXXXX:
[ columnName1: process1,
.
.
.
columnName52: info1
],
.
.
.
[ columnName1: process52,
.
.
.
columName52: info52
]
]
The data in the .log file looks as follow and would have multiple sections as this but with a different date each line starts with the process id and (process name)
##START-ALLPROCESSES-1676652419
1 (systemd) S 0 1 1 0 -1 4210944 2070278 9743969773 2070 2703 8811 11984 7638026 9190549 20 0 1 0 0 160043008 745 18446744073709551615 187650352414720 187650353516788 281474853505456 0 0 0 671173123 4096 1260 1 0 0 17 0 0 0 2706 0 0 187650353585800 187650353845340 187651263758336 281474853506734 281474853506745 281474853506745 281474853507053 0
10 (rcu_bh) I 2 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 2 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10251 (kworker/1:2) I 2 0 0 0 -1 69238880 0 0 0 0 0 914 0 0 20 0 1 0 617684776 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 1 0 0 0 0 0 0 0 0 0 0 0 0 0
10299 (loop2) S 2 0 0 0 -1 3178560 0 0 0 0 0 24 0 0 0 -20 1 0 10871 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 169 0 0 0 0 0 0 0 0 0 0
10648 (kworker/2:0) I 2 0 0 0 -1 69238880 0 0 0 0 0 567 0 0 20 0 1 0 663634994 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 2 0 0 0 0 0 0 0 0 0 0 0 0 0
1082 (nvme-wq) I 2 0 0 0 -1 69238880 0 0 0 0 0 0 0 0 0 -20 1 0 109 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 3 0 0 0 0 0 0 0 0 0 0 0 0 0
1095 (scsi_eh_0) S 2 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 3 0 0 0 0 0 0 0 0 0 0 0 0 0
1096 (scsi_tmf_0) I 2 0 0 0 -1 69238880 0 0 0 0 0 0 0 0 0 -20 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1099 (scsi_eh_1) S 2 0 0 0 -1 2129984 0 0 0 0 0 0 0 0 20 0 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 (migration/0) S 2 0 0 0 -1 69238848 0 0 0 0 0 4961 0 0 -100 0 1 0 2 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 99 1 0 0 0 0 0 0 0 0 0 0 0
1100 (scsi_tmf_1) I 2 0 0 0 -1 69238880 0 0 0 0 0 0 0 0 0 -20 1 0 110 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##END-ALLPROCESSES-1676652419
I have tried it multiple ways but I cannot seem to get it to go correctly, my last attempt
columns = ['pid', 'comm', 'state', 'ppid', 'pgrp', 'session', 'tty_nr', 'tpgid', 'flags', 'minflt', 'cminflt', 'majflt', 'cmajflt', 'utime', 'stime',
'cutime', 'cstime', 'priority', 'nice', 'num_threads', 'itrealvalue', 'starttime', 'vsize', 'rss', 'rsslim', 'startcode', 'endcode', 'startstack', 'kstkesp',
'kstkeip', 'signal', 'blocked', 'sigignore', 'sigcatch', 'wchan', 'nswap', 'cnswap', 'exit_signal', 'processor', 'rt_priority', 'policy', 'delayacct_blkio_ticks',
'guest_time', 'cguest_time', 'start_data', 'end_data', 'start_brk', 'arg_start', 'arg_end', 'env_start', 'env_end', 'exit_code' ]
for file in os.listdir(dir):
if file.endswith('.log'):
with open(file, 'r') as f:
data = f.read()
data = data.split('##START-ALLPROCESSES-')
data = data[1:]
for i in range(len(data)):
data[i] = data[i].split('##END-ALLPROCESSES-')
data[i] = data[i][0]
data[i] = re.split('\r', data[i])
data[i] = data[i][0]
data[i] = re.split('\n', data[i])
for j in range(len(data[i])):
data[i][j] = re.split('\s+', data[i][j])
#print(data[i])
data[i][0] = str(data[i][0])
data_dict = {}
for i in range(len(data)):
data_dict[data[i][0]] = {}
for j in range(len(columns)):
data_dict[data[i][0]][columns[j]] = data[i][j+1]
print(data_dict)
I converted the epoch date into a str as I was getting unhashable list errors, however that made it so the epoch date shows as a key but each column now has the entire list for the 52 columms of information as a single one, so definitely I am missing something
To solve this problem, you could follow the following steps:
Open the .log file and read the contents
Search for all the sections of interest by finding lines that start with "##START-ALLPROCESSES-" and end with "##END-ALLPROCESSES-"
For each section found, extract the epoch date and create a dictionary with an empty list for each of the 52 columns
Iterate over the lines within the section and split the line into the 52 columns using space as a separator. Add the values to the corresponding list in the dictionary created in step 3
Repeat steps 3 and 4 for all the sections found in the .log file
Return the final nested dictionary
Here is some sample code that implements these steps:
import re
def parse_log_file(log_file_path):
with open(log_file_path, 'r') as log_file:
log_contents = log_file.read()
sections = re.findall(r'##START-ALLPROCESSES-(.*?)##END-ALLPROCESSES-', log_contents, re.DOTALL)
nested_dict = {}
for section in sections:
lines = section.strip().split('\n')
epoch_date = lines[0].split('-')[-1]
column_names = ['column{}'.format(i) for i in range(1, 53)]
section_dict = {column_name: [] for column_name in column_names}
for line in lines[1:]:
values = line.strip().split()
for i, value in enumerate(values):
section_dict[column_names[i]].append(value)
nested_dict['date{}-{}'.format(epoch_date, len(section_dict['column1']))] = section_dict
return nested_dict
You can call this function by passing the path to the .log file as an argument. The function returns the nested dictionary described in the problem statement.

What is the correct observation shape for a 15x15 np array in a openai gym environment?

I am creating a gym enviroment which has a observation of just a 15x15 grid. The grid is filled initially with 0s, and as the game progresses the contents change to between 0 and 255. There are 225 possible actions, each of which corresponding to a location. My current code for init is:
self.action_space = Discrete(225)
self.observation_shape = Box(low=-1000,high=10000,shape=(15,15,),dtype=np.uint8)
however when running stable stable baselines 3 code:
import stable_baselines3
from stable_baselines3 import DQN
model = DQN("MultiInputPolicy", env, verbose=1)
model.learn(total_timesteps=10000, log_interval=4)
I get the error
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-129-d01d55d2bc81> in <module>()
2 from stable_baselines3 import DQN
3
----> 4 model = DQN("MultiInputPolicy", env, verbose=1)
5 model.learn(total_timesteps=10000, log_interval=4)
6 model.save("battleship_dqn")
5 frames
/usr/local/lib/python3.7/dist-packages/stable_baselines3/common/preprocessing.py in get_obs_shape(observation_space)
156
157 else:
--> 158 raise NotImplementedError(f"{observation_space} observation space is not supported")
159
160
NotImplementedError: [[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]] observation space is not supported
This leads me to think that i have used the wrong openai gym space, or have defined shape() within box incorrectly. What would be the correct way to define a 15x15 numpy array?
Also, if it is useful, the context of the error message is
if isinstance(observation_space, spaces.Box):
return observation_space.shape
elif isinstance(observation_space, spaces.Discrete):
# Observation is an int
return (1,)
elif isinstance(observation_space, spaces.MultiDiscrete):
# Number of discrete features
return (int(len(observation_space.nvec)),)
elif isinstance(observation_space, spaces.MultiBinary):
# Number of binary features
return (int(observation_space.n),)
elif isinstance(observation_space, spaces.Dict):
return {key: get_obs_shape(subspace) for (key, subspace) in observation_space.spaces.items()}
else:
raise NotImplementedError(f"{observation_space} observation space is not supported")
Edit I couldn't find any solution to this, so i tried to use Keras-rl2 instead. With Keras-rl2, i ended up using (1,15,15) instead of (15,15,), but I don't know if that will work in Stable Baselines 2.

how can i visualize data using t-SNE?

I'm working with K-means in python3 for educational purpose.
I have a data set of clusters and centroids with multi-dimension.
I need to visualize those data in 2D using t-SNE.
Can anyone help me to do that. (little explanation about the code will be very helpful for me.)
Data set is given below:
Centroids:
[0.0, 0.0, 1.125, 0.5, 0.25, 0.375, 0.125, 0.0, 0.75, 0.0, 0.0, 0.0, 0.0, 1.5, 0.5, 0.125, 0.0, 0.75, 0.25, 0.0, 1.75, 0.0, 0.0, 1.125, 0.125, 0.625, 0.25, 0.0, 0.25, 0.0, 0.625, 0.75, 0.0, 0.0, 0.625, 0.0, 0.75, 0.0, 0.625, 0.0, 0.0, 1.375, 0.625, 0.0, 1.0, 1.25, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.25, 0.625, 0.875, 0.0, 0.75, 1.25, 1.5, 0.0, 0.0]
[1.6666666666666667, 1.5833333333333333, 0.4166666666666667, 0.16666666666666666, 0.16666666666666666, 0.08333333333333333, 0.08333333333333333, 0.0, 0.08333333333333333, 0.16666666666666666, 0.0, 0.0, 0.0, 1.3333333333333333, 1.0, 0.0, 0.0, 0.0, 0.3333333333333333, 0.0, 0.3333333333333333, 0.0, 0.08333333333333333, 0.08333333333333333, 0.16666666666666666, 0.0, 0.0, 0.25, 0.0, 0.0, 0.16666666666666666, 0.0, 0.0, 0.0, 1.1666666666666667, 1.1666666666666667, 0.0, 0.0, 0.0, 0.16666666666666666, 0.0, 0.0, 0.9166666666666666, 0.0, 0.0, 0.16666666666666666, 0.0, 0.16666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16666666666666666, 0.25, 0.0, 0.4166666666666667, 0.75, 0.4166666666666667, 0.0, 0.0]
[0.0, 0.0, 0.16666666666666666, 0.6666666666666666, 0.16666666666666666, 1.6666666666666667, 1.6111111111111112, 0.2222222222222222, 0.5, 0.0, 0.05555555555555555, 1.2222222222222223, 0.5555555555555556, 0.0, 0.0, 0.1111111111111111, 0.05555555555555555, 0.0, 0.6666666666666666, 0.0, 0.6111111111111112, 0.0, 0.4444444444444444, 0.3333333333333333, 0.5, 0.0, 0.0, 0.05555555555555555, 0.0, 0.0, 0.05555555555555555, 0.16666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.05555555555555555, 0.7222222222222222, 0.0, 0.0, 1.2777777777777777, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9444444444444444, 0.05555555555555555, 0.3888888888888889, 0.0, 0.3888888888888889, 0.5, 0.4444444444444444, 0.3333333333333333, 0.0]
[0.06976744186046512, 0.13953488372093023, 0.13953488372093023, 0.627906976744186, 0.11627906976744186, 0.37209302325581395, 0.18604651162790697, 0.023255813953488372, 0.0, 0.0, 0.0, 0.27906976744186046, 0.13953488372093023, 0.20930232558139536, 0.11627906976744186, 0.13953488372093023, 0.06976744186046512, 0.09302325581395349, 0.18604651162790697, 0.023255813953488372, 0.8372093023255814, 0.13953488372093023, 0.11627906976744186, 0.27906976744186046, 0.06976744186046512, 0.023255813953488372, 0.18604651162790697, 0.046511627906976744, 0.0, 0.0, 0.046511627906976744, 0.11627906976744186, 0.06976744186046512, 0.18604651162790697, 0.046511627906976744, 0.023255813953488372, 0.023255813953488372, 0.06976744186046512, 0.09302325581395349, 0.06976744186046512, 0.13953488372093023, 0.023255813953488372, 0.2558139534883721, 0.023255813953488372, 0.18604651162790697, 0.2558139534883721, 0.0, 0.23255813953488372, 0.18604651162790697, 0.023255813953488372, 0.06976744186046512, 0.11627906976744186, 0.0, 0.06976744186046512, 0.046511627906976744, 0.3023255813953488, 0.046511627906976744, 0.23255813953488372, 0.2558139534883721, 0.3023255813953488, 0.0, 0.09302325581395349]
[0.0, 0.0, 0.5, 0.0, 1.5, 0.25, 0.0, 2.5, 1.75, 0.75, 1.0, 0.0, 0.0, 1.0, 0.0, 0.25, 1.5, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 1.25, 0.5, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.75, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.25, 0.0, 0.5, 0.25, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.5, 1.0, 0.0, 0.0, 0.0]
Clusters:
cluster_01 :>
[0 0 1 0 1 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 2 1 0 0 0 0 0 2 0 0 0 0 0 1 2 0 0 0]
[0 0 1 0 1 0 0 0 2 0 0 0 0 2 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 2 1 0 0 0 0 0 2 0 0 0 0 0 1 2 0 0 0]
[0 0 0 2 0 1 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 0 0 1 0 3 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 2 3 0 0 0 2 0 0]
[0 0 0 2 0 1 0 0 1 0 0 0 0 1 2 0 0 0 0 0 0 0 0 1 1 2 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 2 2 0 0 0 2 0 0]
[0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 1 0 2 0 0 0 0 0 0 1 0 1 0 0 0 1 2 3 0 0]
[0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 2 1 0 2 0 0 0 0 0 0 2 0 1 0 0 0 1 2 2 0 0]
[0 0 1 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 2 0 3 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 2 0 0 0 0 0 1 0 0 1 0 0 1 2 1 0 0]
[0 0 2 0 0 1 1 0 0 0 0 0 0 0 0 1 0 2 0 0 3 0 0 1 0 0 0 0 0 0 1 2 0 0 0 0 2 0 1 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 2 0 1 0 2 0 0]
cluster_02 :>
[1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[2 2 0 0 1 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[1 2 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0]
[2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 2 0 0 0]
[2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 2 0 0 0]
[0 3 2 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 2 0 0]
[3 2 2 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 2 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0]
[3 3 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 2 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[2 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[2 1 0 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[2 1 0 1 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
cluster_03 :>
[0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 2 0 2 1 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0]
[0 0 0 0 0 2 0 2 1 0 0 2 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0]
[0 0 0 2 0 1 2 0 1 0 1 0 0 0 0 1 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 0 2 0 1 0 0 0 0]
[0 0 0 2 0 1 2 0 1 0 0 0 1 0 0 1 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 0 2 0 1 0 0 0 0]
[0 0 0 0 0 2 1 1 0 0 0 2 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 1 0]
[0 0 0 0 0 2 1 1 0 0 0 2 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 1 0]
[0 0 1 0 0 2 1 0 1 0 0 2 1 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 0 1 0 0 1 2 0 0]
[0 0 1 0 0 1 1 0 1 0 0 2 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 2 0 0]
[0 0 0 2 1 2 3 0 1 0 0 1 0 0 0 0 0 0 3 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 2 1 2 2 0 1 0 0 1 0 0 0 0 0 0 2 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 2 1 0 0 0 0 2 0 0 0 0 0 0 0 0 3 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0]
[0 0 0 0 1 1 3 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0]
[0 0 1 0 0 2 1 0 1 0 0 3 2 0 0 0 0 0 1 0 1 0 2 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 1 1 0 0 2 2 2 0 0]
[0 0 0 0 0 2 2 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 2 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0]
[0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0]
cluster_04 :>
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 0 3 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 2 0 0 0 0 0 0 1 0 0 1 2 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 0 0 0 0 0 0 1 0 0 1 1 0 0]
[0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 1 2 0 1 1 0 0 0]
[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 1 2 0 1 1 0 0 0]
[0 0 0 2 0 0 3 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0]
[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 1 0 0 0 2 0 0 0 0 2 2 0 0 0 0 0 1 0 0 1 0 0 0 0 2 0 0 1 0 0 0 0 2 0 1 1 1 0 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 2 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 2 0 0 0 0 0 0 0 0 2 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0]
[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 2 2 0 0 0 0 0 0 0 0 1 0 2 0 0]
[0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 2 0 0]
[0 1 0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0]
[0 0 0 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
[0 0 0 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
cluster_05 :>
[0 0 0 0 1 0 0 2 2 1 0 0 0 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 2 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0]
[0 0 0 0 0 0 0 2 2 0 2 0 0 1 0 0 2 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0]
[0 0 2 0 3 0 0 3 2 0 2 0 0 1 0 0 2 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 1 0 0 0]
[0 0 0 0 2 1 0 3 1 2 0 0 0 2 0 1 2 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 0 0 0]
You can use sklearn's TSNE.fit_transform() on all the data points and receive them in new reduced dimensions.
from sklearn.manifold import TSNE
all_nodes = clus1 + clus2 + clus3 + clus4 + clus5 + centroids
result = TSNE(n_components=2, learning_rate=100, early_exaggeration=50).fit_transform(all_nodes)
To get the same plot:
import seaborn as sns
import pandas as pd
X = {
'x': result[:,0],
'y': result[:,1],
'col' : ['clus1'] * len(clus1) + ['clus2'] * len(clus2) + ['clus3'] * len(clus3) + ['clus4'] * len(clus4) + ['clus5'] * len(clus5) + ['centroids'] * len(centroids),
}
data = pd.DataFrame(data=X)
sns.set(style="white", color_codes=True)
sns.lmplot( x="x", y="y", data=data, fit_reg=False, hue='col',
markers=['o', 'o', 'o', 'o', 'o', 'x'])
You can adjust the parameters to the TSNE yourself. You can find all parameters here.
UPDATE
If you want to plot the graph in matplotlib I've quickly put together this code:
group_sizes = [len(arr) for arr in [clus1, clus2, clus3, clus4, clus5]]
colors = ['red', 'blue', 'green', 'purple', 'orange']
start_pos = 0
for idx, (pos, col) in enumerate(zip(group_sizes, colors)):
plt.scatter(X['x'][start_pos:start_pos+pos], X['y'][start_pos:start_pos+pos], c=col, marker='o', label='Cluster {}'.format(idx+1))
start_pos += pos
plt.scatter(X['x'][-5:], X['y'][-5:], c='black', marker='x', label='Centroids')
_ = plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
which outputs

Create dummy variables for interdependent categories in pandas

I'm trying to set up a linear regression model in order to predict traffic counts based on the day, and the time of day. Since both are categorical variables, I have to create dummy variables. The get_dummies function makes this very easy, when doing this for both variables individually. However, in the case of predicting traffic volumes, the interdependence between the day and the time of day are important. Therefore, I'll need dummies for all days * all time intervals.
I made a small example, avoiding to trouble you with big datasets:
import pandas as pd
df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'Time': [11,15,9,15,17,10,20],
'Count': [100,150,150,150,180,60,50]})
df_dummies = pd.get_dummies(df.Day)
print(df_dummies)
Results in a nice dataframe with dummies:
Fri Mon Sat Sun Thu Tue Wed
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
So what I'm after is something like this:
import pandas as pd
df = pd.DataFrame({'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'Time': [11,15,9,15,17,10,20],
'Count': [100,150,150,150,180,60,50]})
df_dummies = pd.get_dummies(df.Day * df.Time)
print(df_dummies)
With a result like this:
Fri_9 Fri_15 Mon_9 Mon_15 Sat_9 Sat_15 Sun_9 ...
0 0 1 0 0 0 0 0 ...
1 0 0 0 0 0 1 0 ...
2 0 0 0 0 0 0 1 ...
3 0 0 0 0 1 0 0 ...
4 1 0 0 0 0 0 0 ...
5 0 0 1 0 0 0 0 ...
6 0 0 0 1 0 0 0 ...
7 0 0 0 0 0 0 0 ...
[...]
Is there any way in which this can be done elegantly?
I belive need join columns together with cast to strings:
df_dummies = pd.get_dummies(df.Day + '_' + df.Time.astype(str))
#df_dummies = pd.get_dummies(df.Day.str.cat(df.Time.astype(str), sep='_'))
print(df_dummies)
Fri_17 Mon_11 Sat_10 Sun_20 Thu_15 Tue_15 Wed_9
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
I'm trying to set up a linear regression model in order to predict
Technically, you can make a dummy of the tuples:
>>> pd.get_dummies(df[['Day', 'Time']].apply(tuple, axis=1))
(Fri, 17) (Mon, 11) (Sat, 10) (Sun, 20) (Thu, 15) (Tue, 15) (Wed, 9)
0 0 1 0 0 0 0 0
1 0 0 0 0 0 1 0
2 0 0 0 0 0 0 1
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 1 0 0 0
...
however, I think this approach is not the best at the ML level. This will probably fragment the data very much, making things hard for your regressor. You might consider using a gradient-boosted decision tree, if you're after the interactions.

Go binary data being truncated on string conversion

I am using GORM. My data field is varbinary(2000) but why is the data being inserted (new) being truncated to around 20 bytes?
var charData CharacterData
db.Where(CharacterData{CharId: charId, Segment: 0}).FirstOrInit(&charData)
charData.Data = data[16:]
db.Save(&charData)
len(data[16:]) is 2000.
type CharacterData struct {
CharId uint32
Data []byte
Segment uint8
SegmentId uint16
CreatedAt time.Time
UpdatedAt time.Time
DeletedAt time.Time
}
I also tried:
fmt.Println(string(charData0))
fmt.Println(len(charData0))
fmt.Println(len(string(charData0)))
0uyuyuyyuyuyuyy���������������������������� // only this gets inserted to the db which is short of the actual 1587
1587
1587
And it is also truncating the data. So I guess the problem lies with the to string casting.
The insert log shows:
INSERT INTO character_data (char_id, data, segment) VALUES ('2', '0tututututt����������������������������', '0')
data[16:] contains:
[0 0 48 6 1 0 1 0 104 101 104 101 104 101 104 101 104 101 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 255 255 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
update
Actual data inserted to database:
0tututututtÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
Instead of what normally should be something similar to:
pdukelÐÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ2ÿÿÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿHHllbÍfÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿúúúúúúúÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿþ ú
úúþ
þþþþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ'ÿÿ
using raw queries
var charData0 []byte = data[16:]
hexData := fmt.Sprintf("%x", charData0)
fmt.Println(hexData)
db.Exec("INSERT INTO character_data (char_id, data, segment) VALUES (?, UNHEX(?), ?)", charId, hexData, 0)
=> 000030060100010079697969796979697979697900000000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
But the database only got:
0yiyiyiyiyyiyÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
update
Actually tried running the query on sqlpro:
INSERT INTO character_data (char_id, data, segment) VALUES ('4', UNHEX('000030060100010074757475747574757475740000000000050000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff0000000000000000000000000000ffff000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'), '0')
And I still got truncated data. So does the problem lie with MYSQL?
EDIT:
You mentioned in the comment that you use MariaDB (not MySQL).
See MariaDB Tutorial: DATA TYPES:
VARBINARY(size) - Maximum size of 255 characters. - Where size is the number of characters to store. Variable-length string.
varbinary length is limited to 255 characters... Also it is in section STRING DATATYPES. This means MariaDB handles varbinary as a string...
This may very well explain the value you see: after about 255 bytes it contains just 0 values and nothing else, most likely because only 255 bytes come from the database and the rest is just filled with the zero-value of byte (which is 0).
A string in Go is stored as a byte array and interpreted using UTF-8 encoding.
A string may contain any byte sequence, but not all byte sequences represent valid UTF-8 encoded text.
So if you use Data to store arbitrary seuqence of bytes, you can't always display or treat it as a text. Just because printing the result of string(Data) conversion displays only 16 characters it doesn't mean Data contains only 16 bytes (or the bytes of 16 characters).
You can check it by printing its length:
fmt.Println(len(charData.Data))
Read this blog post to learn more about the topic:
Strings, bytes, runes and characters in Go
Edit:
You mentioned you use MySQL. What version?
MySQL version 5.0.2 and below only allows 255 bytes in varbinary, version 5.0.3 and above allows 64 KB. If the data inserted does not fit into the max (or defined) length, it is truncated.
Also please see MySQL 5.6 Reference Manual :: Section 11 Data Types :: 11.7 Data Type Storage Requirements:
The internal representation of a table has a maximum row size of 65,535 bytes, even if the storage engine is capable of supporting larger rows.
So if you (obviously) have other values in the row, they also have to fit into the row limit (which is 64 KB). Note that this is MySQL 5.6.

Resources