Encoding Dataframe features numerically - python-3.x

I am having a dataframe, with a number of features. There is one particular feature, that is totally dynamic and I aim to encode it. I cannot use one-hot encoding as the unique count of values can change. LabelEncoder can be of use, but can the number of classes/target labels be changed ?
Consider an example (the Name feature):
index | A | B | Name
------+---+---+-----
1 5 6 abc
2 4 7 abc
2 3 0 def
2 3 0 ghi
3 3 0 abc
3 3 0 def
I wish to encode it as
index | A | B | Name
------+---+---+-----
1 5 6 1
2 4 7 1
2 3 0 2
2 3 0 3
3 3 0 1
3 3 0 2
And also keeping in mind that if later on another value different than all these comes up, they automatically get stored in the encoder by the next successive value like even if the next row input is
index | A | B | Name
------+---+---+-----
1 5 6 xyz
It gets encoded to and is used as
index | A | B | Name
------+---+---+-----
1 5 6 4
And how do I get the original value back ?

You can try factorize
df.Name=df.Name.factorize()[0]+1

You can use astype category and then use the category accessor .cat to get the assigned codes:
df['Name'] = df['Name'].astype('category').cat.codes + 1
Output:
index A B Name
0 1 5 6 1
1 2 4 7 1
2 2 3 0 2
3 2 3 0 3
4 3 3 0 1
5 3 3 0 2

Related

Create a new pandas column with repeating a value according with another column

I have a table like this
times v2
0 4 10
1 2 20
2 0 30/n30
3 1 40
4 0 9
What I want if change the values of v2 when times != 0, and the change consists in adding "\0" as many times as the times columns says.
times v2
0 4 10\n0\n0\n0\n0
1 2 20\n0\n0
2 0 30\n30
3 1 40\n0
4 0 9
You can do
df.v2+=df.times.map(lambda x : x*"\n0")
df
Out[325]:
times v2
0 4 10\n0\n0\n0\n0
1 2 20\n0\n0
2 0 30/n30
3 1 40\n0
4 0 9

Taking different records from groups using group by in pandas

Suppose I have dataframe like this
>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
Now I want top all records from each group using group id except last 3. That means I want to drop last 3 records from all groups. How can I do it using pandas group_by. This is dummy data.
Use GroupBy.cumcount for counter from back by ascending=False and then compare by Series.gt for greater values like 2, because python count from 0:
df = df[df.groupby('id').cumcount(ascending=False).gt(2)]
print (df)
id value
3 2 1
Details:
print (df.groupby('id').cumcount(ascending=False))
0 2
1 1
2 0
3 3
4 2
5 1
6 0
7 0
8 0
dtype: int64

Pandas how to turn each group into a dataframe using groupby

I have a dataframe looks like,
A B
1 2
1 3
1 4
2 5
2 6
3 7
3 8
If I df.groupby('A'), how do I turn each group into sub-dataframes, so it will look like, for A=1
A B
1 2
1 3
1 4
for A=2,
A B
2 5
2 6
for A=3,
A B
3 7
3 8
By using get_group
g=df.groupby('A')
g.get_group(1)
Out[367]:
A B
0 1 2
1 1 3
2 1 4
You are close, need convert groupby object to dictionary of DataFrames:
dfs = dict(tuple(df.groupby('A')))
print (dfs[1])
A B
0 1 2
1 1 3
2 1 4
print (dfs[2])
A B
3 2 5
4 2 6

Split one column by every into "n" columns with one character each

I have a file with one single column and 10 rows. Every column has the same number of characters (5). From this file I would like to get a file with 10 rows and 5 columns, where each column has 1 character only. I have no idea on how to do that in linux.. Any help? Would AWK do this?
The real data has many more rows (>4K) and characters (>500K) though. Here is a short version of the real data:
31313
30442
11020
12324
00140
34223
34221
43124
12211
04312
Desired output:
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2
Thanks!
I think that this does what you want:
$ awk -F '' '{ $1 = $1 }1' file
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2
The input field separator is set to the empty string, so every character is treated as a field. $1 = $1 means that awk "touches" every record, causing it to be reformatted, inserting the output field separator (a space) between every character. 1 is the shortest "true" condition, causing awk to print every record.
Note that the behaviour of setting the field separator to an empty string isn't well-defined, so may not work on your version of awk. You may find that setting the field separator differently e.g. by using -v FS= works for you.
Alternatively, you can do more or less the same thing in Perl:
perl -F -lanE 'say "#F"' file
-a splits each input record into the special array #F. -F followed by nothing sets the input field separator to the empty string. The quotes around #F mean that the list separator (a space by default) is inserted between each element.
You can use this sed as well:
sed 's/./& /g; s/ $//' file
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2
Oddly enough, this isn't trivial to do with most standard Unix tools (update: except, apparently, with awk). I would use Python:
python -c 'import sys; map(sys.stdout.write, map(" ".join, sys.stdin))' in.txt > new.txt
(This isn't the greatest idiomatic Python, but it suffices for a simple one-liner.)
another unix toolchain for this task
$ while read line; do echo $line | fold -w1 | xargs; done < file
3 1 3 1 3
3 0 4 4 2
1 1 0 2 0
1 2 3 2 4
0 0 1 4 0
3 4 2 2 3
3 4 2 2 1
4 3 1 2 4
1 2 2 1 1
0 4 3 1 2

Create a list of duplicate records that are in several columns

I have a data set that is spread across five columns. Sample of data:
Raw Data End Results
A B C D E A B C D E
1 2 2 1 6 1 2 2 1 6
0 3 3 0 6 0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
1 2 2 1 6
0 3 3 0 6
The length of record varies from 10 to 40.
The data is to help me keep record of inventory and I wish to know which orders are popular.
Unfortunately I am still using Excel 2003.
Because I am not really sure what you have, this is deliberately simple:
In ColumnG Row1 put:
=A1&B1&C1&D1&E1
and copy down to suit. Select ColumnG and Paste Special, Values. Select ColumnG and sort. Insert in H1 and copy down to suit:
=COUNTIF(G$1:G1,G1)
1 should indicate the first ("unique") instance of each of the rows of Raw Data (and the other numbers the number of repetitions - up to 7 in your example, so one 'original' and six 'copies'.

Resources