pandas replace bytes object b'\x00' with string

pandas replace bytes object b'\x00' with string - python-3.x

I have a Dataframe with a column contains bytes objects b'\x00' to b'\x08'. I would like to replace them with the corresponding string '00' to '08'.
The bytes objects from b'\x01' to b'\x08' can be easily replaced by using a dict in dataframe.replace.
However the job on b'\x00' doesn't work. Here is my test.
My python version is
'3.6.3 |Anaconda custom (64-bit)| (default, Nov 8 2017, 15:10:56) [MSC v.1900 64 bit (AMD64)]'. Pandas is pre-installed.
bytes_list=[b'\x00',b'\x01',b'\x02',b'\x03',b'\x04',b'\x05',b'\x06',b'\x07',b'\x08']
bytes_df=pd.DataFrame({"bytes":bytes_list})
replacements = {b'\x00': '00',b'\x01': '01',b'\x02': '02',b'\x03': '03',b'\x04': '04'
,b'\x05': '05',b'\x06': '06',b'\x07': '07',b'\x08': '08'}
bytes_df.replace(replacements,inplace=True)
print(bytes_df)
Output:
bytes
0 b'\x00'
1 01
2 02
3 03
4 04
5 05
6 06
7 07
8 08
So you can see the bytes object b'\x00' isn't replaced.
Later I use applymap with a lambda func get what I need.
bytes_df.applymap(lambda x: x[0] if type(x) is bytes else x)
But I was still wondering if there is any other easy and pythonic way can do the same. And I was also wondering if this is a bug.
Could anyone explain?

You could circumvent the problem:
import pandas as pd
bytes_list=[b'\x00',b'\x01',b'\x02',b'\x03',b'\x04',b'\x05',b'\x06',b'\x07',b'\x08']
bytes_df=pd.DataFrame({ "bytes": ['0'+str(ord(x)) for x in bytes_list] })
print(bytes_df)
Output:
bytes
0 00
1 01
2 02
3 03
4 04
5 05
6 06
7 07
8 08

Related

How to remove space between numbers but leave space between names on the same column in a DataFrame Pandas

I would like to clean a Dataframe in such a way that only cells that contain numbers will not have empty spaces but cells with names remain the same.
Author
07 07 34
08 26 20
08 26 20
Tata Smith
Jhon Doe
08 26 22
3409243
here is my approach which is failing
df.loc[df["Author"].str.isdigit(), "Author"] = df["Author"].strip()
How can I handle this?

You might want to use regex.
import pandas as pd
import re
# Create a sample dataframe
import io
df = pd.read_csv(io.StringIO('Author\n 07 07 34 \n 08 26 20 \n 08 26 20 \n Tata Smith\n Jhon Doe\n 08 26 22\n 3409243'))
# Use regex
mask = df['Author'].str.fullmatch(r'[\d ]*')
df.loc[mask, 'Author'] = df.loc[mask, 'Author'].str.replace(' ', '')
# You can also do the same treatment by the following line
# df['Author'] = df['Author'].apply(lambda s: s.replace(' ', '') if re.match(r'[\d ]*$', s) else s)
Author
070734
082620
082620
Tata Smith
Jhon Doe
082622
3409243

How about this?
import pandas as pd
df = pd.read_csv('two.csv')
# remove spaces on copy
df['Author_clean'] = df['Author'].str.replace(" ","")
# try conversion to numeric if possible
df['Author_clean'] = df['Author_clean'].apply(pd.to_numeric, errors='coerce')
# fill missing vals with original strings
df['Author_clean'].fillna(df['Author'], inplace=True)
print(df.head(10))
Output:
Author Author_clean
0 07 07 34 70734.0
1 08 26 20 82620.0
2 08 26 20 82620.0
3 Tata Smith Tata Smith
4 Jhon Doe Jhon Doe
5 08 26 22 82622.0
6 3409243 3409243.0

Sockets programming: DIG DNS Query Messages: Incorrect header length?

RFC Reference
I am working on a project which involves sockets programming and interpreting the output from DIG DNS queries.
I'm using RFC 1035 as my reference. Although this is quite old now (1987) as far as I can tell from later RFCs (for example 8490) the DNS headers are still the same.
https://www.rfc-editor.org/rfc/rfc1035
Code Overview: IPv6 TCP query
I have written a short program in C which reads from a IPv6 TCP socket. I send data to this socket using DIG. (My program simply reads all data it sees on a socket, and prints it to stdout.)
Note that there are two unusual things here:
Firstly the use of IPv6
Secondly the use of TCP (DNS messages are often UDP)
Here is the command used:
dig #::1 -p 8053 duckduckgo.com +tcp
I am running dig version DiG 9.16.13-Debian, on Debian Testing. (cera 2021-May)
Output, Discussion and Question
Here is the hexadecimal and printable character output which is read from the socket:
Hex:
00 37 61 78 01 20 00 01 00 00 00 00 00 01 0A 64 75 63 6B 64 75 63 6B 67 6F 03 63 6F 6D 00 00 01 00 01 00 00 29 10 00 00 00 00 00 00 0C 00 0A 00 08 00 7A 4* 48 2C 16 0* 33
Char:
00 7 61 x 01 20 00 01 00 00 00 00 00 01 0A d u c k d u c k g o 03 c o m 00 00 01 00 01 00 00 ) 10 00 00 00 00 00 00 0C 00 0A 00 08 00 z 4* H , 16 0* 33
If non-printable characters are encountered, the hex value is printed instead.
Although this is a fairly long stream of data, the question relates to the length of the header.
According to RFC 1035, the length of the header should be 12 bytes.
4.1.1. Header section format
The header contains the following fields:
1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The header is followed by a QUESTION SECTION. The question section begins with a single byte which specifies the length.
Inspecting the data stream above, we see that the byte at offset 12 has a value of 0. I repeat it below with offset numbers to make it clear. The data is in the middle row, the row above and below are byte offsets.
0 1 2 3 4 5 6 7 8 9 10 11 <- byte 12
00 37 61 78 01 20 00 01 00 00 00 00 00 01 0A 64 75 63 6B ...
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 <- byte 15
This clearly doesn't make any sense.
Looking again at the stream, we can see that "duckduckgo" is preceeded by the byte 0A. This is 10 in decimal and corresponds to the 10 characters of "duckduckgo". This string is followed by a byte 03 which corresponds to the 3 bytes of "com".
The offset of the byte 0A is 15. Not 12.
I must have misunderstood the RFC specification. But what have I misunderstood? Does the header itself start at a different offset to what I think it is? (Byte zero.) Or is there perhaps some padding between the end of the header and the beginning of the first question section?
Existing Question on this site:
Comments: The below link states that there is no padding. This is the only answer on this question. The question is about DNS responses rather than queries, and does not ask about the header section of a query. (Although information from one should presumably apply to the other, but possibly does not.)
Do DNS messages pad names to an even number of bytes?
Comments: The below link asks about the best way to build a data structure to handle DNS data. Additionally, the answer notes that one has to be careful about network byte order and machine byte order. I am already aware of this and I use ntohs() to convert from network byte order to x86_64 byte order before printing information to stdout. This is not the problem and does not explain why I see information about the dns query starting at byte 15 instead of 12, when the header should be a fixed size of 12 bytes.
Implementing a DNS Query in c++ according to RFC 1035

Thanks to #SteffenUllrich who prompted the solution for this in the comments.
RFC 1035 4.2.2 states
4.2.2. TCP usage
Messages sent over TCP connections use server port 53 (decimal). The
message is prefixed with a two byte length field which gives the message
Mockapetris [Page 32]
RFC 1035 Domain Implementation and Specification November 1987
length, excluding the two byte length field. This length field allows
the low-level processing to assemble a complete message before beginning
to parse it.
I had removed the 2-byte field at the start of my struct at some point.
This is what the structure looks like with the 2 byte length field re-enabled.
struct __attribute__((__packed__)) dns_header
{
unsigned short ID;
union
{
unsigned short FLAGS;
struct
{
unsigned short QR : 1;
unsigned short OPCODE : 4;
unsigned short AA : 1;
unsigned short TC : 1;
unsigned short RD : 1;
unsigned short RA : 1;
unsigned short Z : 3;
unsigned short RCODE : 4;
};
};
unsigned short QDCOUNT;
unsigned short ANCOUNT;
unsigned short NSCOUNT;
unsigned short ARCOUNT;
};
struct __attribute__((__packed__)) dns_struct_tcp
{
unsigned short length; // length excluding 2 bytes for length field
struct dns_header header;
};
For example: I recieved a TCP packet of length 53 bytes. The value of length is set to 51.
To read data into this struct:
memcpy(&dnsdata, buf, sizeof(struct dns_struct_tcp));
To interpret this data (since it is stored in network byte order):
void dns_header_print(FILE *file, const struct dns_header *header)
{
fprintf(file, "ID: %u\n", ntohs(header->ID));
char str_FLAGS[8 * sizeof(unsigned short) + 1];
str_FLAGS[8 * sizeof(unsigned short)] = '\0';
print_binary_16_fixed_width(str_FLAGS, header->FLAGS);
fprintf(file, "FLAGS: %s\n", str_FLAGS);
fprintf(file, "FLAGS: QOP ATRRZZZR \n");
fprintf(file, " RCODEACDA CODE\n");
fprintf(file, "QDCOUNT: %u\n", ntohs(header->QDCOUNT));
fprintf(file, "ANCOUNT: %u\n", ntohs(header->ANCOUNT));
fprintf(file, "NSCOUNT: %u\n", ntohs(header->NSCOUNT));
fprintf(file, "ARCOUNT: %u\n", ntohs(header->ARCOUNT));
}
Note that the flags are unchanged, since each field of flags is less than 8 bits in length. However on x86_64 systems, unsigned short is stored in little-endian format, hence ntohs() is use to convert data which is in big-endian (network) byte order to little-endian (host) byte order.

How does jwt implement RSA256 signature verification in nodejs

I was trying to understand the internal working of how JWT performs RS256 signature verification. The signature algorithm works on following basic steps:
Hash the original data
Encrypt the hash with RSA private key
And for verification it follows the following steps:
Decrypt using RSA public key
Match the hash generated from decryption with the SHA256 hash of original message.
While trying to test this on one of the jwt
eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiYWRtaW4iOnRydWUsImlhdCI6MTUxNjIzOTAyMn0.POstGetfAytaZS82wHcjoTyoqhMyxXiWdR7Nn7A29DNSl0EiXLdwJ6xC6AfgZWF1bOsS_TuYI3OG85AmiExREkrS6tDfTQ2B3WXlrr-wp5AokiRbz3_oB4OxG-W9KcEEbDRcZc0nH3L7LzYptiy1PtAylQGxHTWZXtGz4ht0bAecBgmpdgXMguEIcoqPJ1n3pIWk_dUZegpqx0Lka21H6XxUTxiy8OcaarA8zdnPUnV6AmNP3ecFawIFYdvJB_cm-GvpCSbr8G8y_Mllj8f4x9nBH8pQux89_6gUY618iYv7tuPWBFfEbLxtF2pZS6YC1aSfLQxeNe8djT9YjpvRZA
I found that the hash obtained from signature contains some extra characters.
E.g. the SHA256 of original message in case of above jwt in hex encoding is
8041fb8cba9e4f8cc1483790b05262841f27fdcb211bc039ddf8864374db5f53
but the hash obtained from signature of above jwt after decryption is
3031300d0609608648016503040201050004208041fb8cba9e4f8cc1483790b05262841f27fdcb211bc039ddf8864374db5f53
Which has 3031300d060960864801650304020105000420 extra characters infront of the hash.
What do these characters represent and shouldn't the hash obtained from message and signature be identical?

rfc7518 3.3 defines JWS algorithms RS256,384,512:
This section defines the use of the RSASSA-PKCS1-v1_5 digital
signature algorithm as defined in Section 8.2 of RFC 3447 [RFC3447]
(commonly known as PKCS #1), using SHA-2 [SHS] hash functions.
and rfc3447 8.2 defines RSASS-PKCS1-v1_5
RSASSA-PKCS1-v1_5 combines the RSASP1 and RSAVP1 primitives with the
EMSA-PKCS1-v1_5 encoding method. ....
where EMSA-PKCS1-v1_5 is defined in rfc3447 9.2 as:
1. Apply the hash function to the message M to produce a hash value
H:
H = Hash(M).
If the hash function outputs "message too long," output "message
too long" and stop.
2. Encode the algorithm ID for the hash function and the hash value
into an ASN.1 value of type DigestInfo (see Appendix A.2.4) with
the Distinguished Encoding Rules (DER), where the type DigestInfo
has the syntax
DigestInfo ::= SEQUENCE {
digestAlgorithm AlgorithmIdentifier,
digest OCTET STRING
}
The first field identifies the hash function and the second
contains the hash value. Let T be the DER encoding of the
DigestInfo value (see the notes below) and let tLen be the length
in octets of T.
3. If emLen < tLen + 11, output "intended encoded message length too
short" and stop.
4. Generate an octet string PS consisting of emLen - tLen - 3 octets
with hexadecimal value 0xff. The length of PS will be at least 8
octets.
5. Concatenate PS, the DER encoding T, and other padding to form the
encoded message EM as
EM = 0x00 || 0x01 || PS || 0x00 || T.
6. Output EM. [added: which is then modexp'ed with d by RSASP1 to
sign, or matched to the value modexp'ed with e by RSAVP1 to verify]
Notes.
1. For the six hash functions mentioned in Appendix B.1, the DER
encoding T of the DigestInfo value is equal to the following:
MD2: (0x)30 20 30 0c 06 08 2a 86 48 86 f7 0d 02 02 05 00 04
10 || H.
MD5: (0x)30 20 30 0c 06 08 2a 86 48 86 f7 0d 02 05 05 00 04
10 || H.
SHA-1: (0x)30 21 30 09 06 05 2b 0e 03 02 1a 05 00 04 14 || H.
SHA-256: (0x)30 31 30 0d 06 09 60 86 48 01 65 03 04 02 01 05 00
04 20 || H.
SHA-384: (0x)30 41 30 0d 06 09 60 86 48 01 65 03 04 02 02 05 00
04 30 || H.
SHA-512: (0x)30 51 30 0d 06 09 60 86 48 01 65 03 04 02 03 05 00
04 40 || H.
The prefix you discovered corresponds exactly to that specified in Note 1 for the encoding in step 2 of a DigestInfo structure for a SHA-256 hash, as expected.
Note rfc3447=PKCS1v2.1 has been superseded by rfc8017=PKCS1v2.2, but the only relevant change in this area is the addition of the SHA512/224 and SHA512/256 hashes, which JWS doesn't use.
Describing signing and verifying as 'encrypting' and 'decrypting' the hash (really, the encoding aka padding of the hash) is considered obsolete. It was used originally, decades ago, and only for RSA not other signature algorithms, because of the mathematical similarity between the modexp operations used for encrypting and decrypting vs signing and verifying, but it was found that thinking of these as the same or interchangeable resulted in system implementations that were vulnerable and broken. In particular see rfc3447 5.2:
The main mathematical operation in each primitive is
exponentiation, as in the encryption and decryption primitives of
Section 5.1. RSASP1 and RSAVP1 are the same as RSADP and RSAEP
except for the names of their input and output arguments; they are
distinguished as they are intended for different purposes.
nodejs uses this obsolete terminology because it uses OpenSSL which via its predecessor SSLeay dates back to the early 1990s when this mistake was still common.
However, that isn't really a programming/development issue and is more on topic for crypto.SX and security.SX; see some of the links I collected at https://security.stackexchange.com/questions/159282/can-openssl-decrypt-the-encrypted-signature-in-an-amazon-alexa-request#159289 .

Chip EMV - Getting AFL for every smart card

Continue from: EMV Reading PAN Code
I'm working in C, so I havn't Java tools and all the functions that parse automatically the response of APDU command.
I want to read all types of smart cards.
I have to parse the response of an GET PROCESSING OPTIONS and get the AFL (Access File Locator) of every card.
I have three cards with three different situation:
A) HelloBank: 77 12 82 2 38 0 94 c 10 2 4 1 18 1 1 0 20 1 1 0 90
B) PayPal: 77 12 82 2 39 0 94 c 18 1 1 0 20 1 1 0 28 1 3 1 90
C) PostePay: 80 a 1c 0 8 1 1 0 18 1 2 0 90
Case A)
I've got three different AFL: 10 2 4 1, 18 1 1 0, 20 1 1 0
So I send 00 B2 SFI P2 00 where SFI was 10>>3 (10 was first byte of first AFL) and P2 was SFI<<3|4 and this way I got the correct PAN Code of my card.
Case B)
I've got three different AFL: 18 1 1 0, 20 1 1 0, 28 1 3 1.
So I send 00 B2 SFI P2 00 builded in the same way as Case A, but I got the response 6A 83 for every AFL.
Case C)
I've got two different AFL: 8 1 1 0, 18 1 2 0 but I cannot parse those automatically because there isn't the same TAG of previous response.
If I use those AFL it worked and I can get the PAN Code of the card.
How can I make an universal way to read the correct AFL and how can I make the correct command with those AFL?

Here is the decoding of AFL:
You will get the AFL in multiple of 4 Bytes normally. Divide your complete AFL in a chunk of 4 Bytes. Lets take an example of 1 Chunk:
AABBCCDD
AA -> SFI (Decoding is described below)
BB -> First Record under this SFI
CC -> Last Record under this SFI
DD -> Record involved for Offline Data Authentication (Not for your use for the moment)
Taking your example 10 02 04 01 18 01 01 00 20 01 10 00
Chunks are 10 02 04 01, 18 01 01 00, 20 01 10 00
10 02 04 01 -->
Taking 1st Byte 10 : 00010000 Take initial 5 bits from MSB --> 00010 means 2 : Means SFI 2
Taking 2nd Byte 02 : First Record under SFI 2 is 02
Taking 3rd Byte 04 : Last Record under SFI 2 is 04
Excluding 4 Byte explanation since no use
Summary : SFI 2 contains record 2 to 4
How Read Record command will form :
APDU structure : CLA INS P1 P2 LE
CLA 00
INS B2
P1 (Rec No)02 (SInce in this SFI 2 inital record is 02)
P2 (SFI) SFI 02 : Represent the SFI in 5 binay digit 00010 and then append 100 in the end : 00010100 : In Hex 14
So P2 is 14
LE 00
APDU to Read SFI 2 Rec 2 : 00 B2 02 14 00
APDU to Read SFI 2 Rec 3 : 00 B2 03 14 00
APDU to Read SFI 2 Rec 4 : 00 B2 04 14 00
Now if you will try to Read Rec 5, Since this Rec is not present you will get SW 6A83 in this case.
Use the same procedure for all chunk to identify the available Records and SFIs
BY this mechanisam you can write the function to parse the AFL

Excel: ignoring 0 at start of numbers

I have a list of times which i want to add to a string
0900 1730
0900 1730
1000 1700
0930 1700
i need to break these up to hours and minutes like so
09 00 17 30
09 00 17 30
10 00 17 00
09 30 17 00
to do this i am using the MID() function to get the first two characters from the cell and then the last two. But when i do this for numbers that start with 0 of have 00 it drops the first 0 like so
0930 = ",MID(B2,1,2),",",MID(B2,3,2)," output - 93 0 what i want = 09 30
0900 = ",MID(B2,1,2),",",MID(B2,3,2)," output - 90 0 what i want = 09 00
1000 = ",MID(B2,1,2),",",MID(B2,3,2)," output - 10 0 what i want = 10 00
is there a way to solve this?

You can use a mid of a pre-formatted block:
=MID(RIGHT("0000"&B2,4),1,2) =MID(RIGHT("0000"&B2,4),3,2)
This should give you two strings like 09 & 30.
If you want two numeric values you can add a value function:
=VALUE(MID(RIGHT("0000"&B2,4),1,2))

One way is place Single Quote(') before the 0 then it will store the 0930 as text in cell
and your formula will also work, No need to change in the formula.
So the value 0930 will be '0930

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

pandas replace bytes object b'\x00' with string - python-3.x

You could circumvent the problem: import pandas as pd bytes_list=[b'\x00',b'\x01',b'\x02',b'\x03',b'\x04',b'\x05',b'\x06',b'\x07',b'\x08'] bytes_df=pd.DataFrame({ "bytes": ['0'+str(ord(x)) for x in bytes_list] }) print(bytes_df) Output: bytes 0 00 1 01 2 02 3 03 4 04 5 05 6 06 7 07 8 08

Related

How to remove space between numbers but leave space between names on the same column in a DataFrame Pandas

Sockets programming: DIG DNS Query Messages: Incorrect header length?

How does jwt implement RSA256 signature verification in nodejs

Chip EMV - Getting AFL for every smart card

Excel: ignoring 0 at start of numbers

Categories

Resources