How to do decode linux RAW data - printers

I am having a problem trying to extract the original contents of a text file that I am printing through localhost:9100, for Linux with a simple python server found here. However in windows looking under the spool directory, the text is plain as day (this is not using anything apart from changing the generic printer to RAW and localhost:9100).
However, I need to be able to do this on Linux, I believe the encoding WinAnsiEncoding might be the problem. But I am not sure. The sections below show the output.
test_text.txt
Contains
TEST101
From Windows spool
Contains
D 8 t e s t _ t e x t - N o t e p a d T e s t
Ñ ÿ O Ω R †s EMF l
` X “ ( 54 çÖ P r i n t t e s t %
Ä% ÄR p •ˇˇˇ
Ëär Ö9È$ñ[ HÏô] H g $8uˇ òÏô]
H lxˇ ¿:„år # <çr <ç œ=uˇ
ì …≥
πj=uˇ ÄÒår qÇ=uˇ S : ¡±ˇdv % b d Äm
T Ñ ¢ O Ω ° fl{á#uPá#¢ O L
ˇˇˇˇˇˇˇˇ` t e s t _ t e x t < < < < < < < < <
ÿ ≥ à ¥ T x ÿ ≥ { fl{á#uPá#ÿ ≥ L
ˇˇˇˇˇˇˇˇ\ T E S T 1 0 1 < < < < < < < K
T p ¸ ¥ c fl{á#uPá#¸ ¥ L
ˇˇˇˇˇˇˇˇX P a g e 1 < < < < < <
From Linux
Contains
b'%PDF-1.5\n%\xb5\xed\xae\xfb\n3 0 obj\n<< /Length 4 0 R\n /Filter
/FlateDecode\n>>\nstream\nx\x9c+\xe42P\x00\xc1\xa2t\x05\xfdD\x03\x85\xf4b.\xa7\x10.K\xb0\x90\xa5\x82\xa1\x85\x82\x85\xa1\x99\x9e\x89\x91\x81\x81\xb1\xb9BH.\x97~\x9a\xae\x81\xae\x81\x82\xa1BH\x1a\x97F\x88kp\x88\xa1\x81\xa1fH\x16\x97k\x08W
\x17\x00[\xe6\x0f\xf9\nendstream\nendobj\n4 0 obj\n 77\nendobj\n2 0
obj\n<<\n /ExtGState <<\n /a0 << /CA 1 /ca 1 >>\n >>\n
/Font <<\n /f-0-0 5 0 R\n >>\n>>\nendobj\n6 0 obj\n<< /Type
/Page\n /Parent 1 0 R\n /MediaBox [ 0 0 595.275591 841.889764 ]\n
/Contents 3 0 R\n /Group <<\n /Type /Group\n /S
/Transparency\n /I true\n /CS /DeviceRGB\n >>\n
/Resources 2 0 R\n>>\nendobj\n7 0 obj\n<< /Length 8 0 R\n /Filter
/FlateDecode\n /Length1
4532\n>>\nstream\nx\x9c\xedW}l[\xd5\x15?\xf7=\xfb\xf9+q^\xd2\xc48q\x13?\xe7\xd5\x8eS\xdbqj\xa7\t\xcdB\xf2\xea/\\x9c#\xbe\\xec\x98\x0c;\x8dC\xa0t\xcb\x96.JP\x19e\x08\xc8\xdc\x96fh\xd3\x04\x83Q\x89ih\x08\xad\xd7\x94\xa10\tVM\x08\xb1A\xa0\x7f\xa0i\x1a\x0c\xf5\x8f}t\x1bA\x15P*!\xb5\xde\xb9\xcfN\xd6\xf2\xa5\xfd1\xed\x8fi\xcf\xef\xdes\xce\xef\x9c{\xee\xb9\xf7\x1c\xdf\xf7\x1e\x10\x000\xc2a\xe0A\xdaw
7\xf7\x87\xbf=>\x07\xa0\xf9\x08\x80\x9b\xd8\xb7pP2\xddTw#\x80n\x18e\xcd\xcc\xdcm\x07\x16^\x8c\xbc\x0c\x18\x07\x10\xe4\xdb\xee\\\x9a\x89\xff\xec\xe7<zx\x06\xc7\xc4g\xf3\xb9\xe9\x0bB\xcb\x1f\x01j\x16\x11\xeb\x9eE#\xbb\x93\xe9j(\xca\xdbf\x0f\x1c\\4\xad\x92i\x94\xdfB\xd9z\xe7\xd7\xf7\xe5\x00L[Q~\x0f\xe5\xad\x07r\x8bs\xc2\x82\xe1I\x00Q\x8f\xb24\xf7\xcd\xfc\xdc\xc3\xfd\xfe\x9f\xa0\xec\x06
\x05\xe0
\x80\xf4\xd7\xdc\xfb\x18\xad\x0el\x8aI\xc3\x1bt\xa0\xd7\x12\rD``\xcd\xbfF\xc4\xdf\xad\x89\xef\xae\xed\xe8\x0c\xd6:j\xdb\xb0\x05\xc8J\xe0\xf2[\xdc\xfb\x97\xea\x02\\\xe1\xd2\x02z%\xd0\x86\xdd\x1a\xb7\x1f=HJ\r\xe1\xe7\x05\x8d\xc6\xa0\xe7\xb4K\xba\xdd\x1a\xf42\xb0\x1e
\xb5u\xbb\xfco\\zcG\'i\x90k\xf1\xb7\x93\xac==\xfa4\xb7?v\xe9m\xce\x19c>2\xd8\xbd\x8e>L\xd0\xad4\xebt<o\xac\xd6\x1a\xe61P\x8c\xb1J\xbf\xa4\xd5\x1a4Kd\xb7a\x89\xc5\xb5>\xb0\x8e\xfe\xd61\xb6\xc9\xf5\x80\xf8F\xc0\x1f\xf0\xab\x0168\x1a\x82\r2\xf6\x19\x92\xb9|\xf1\xdc9b\xc7\x19^\x8a\xbd\x1dC\xff=\xa5\x0f\xc9\x8b\xdc\x12\xec\x80\xdf+&\xaf\xd5\x0cC^\xb1\x1a\x86\xdaVK\xe7N\x19Uz\xfe\x94\x01\x86\xb61Z\xad\xd23\x0c\x97\x99\\\x03C6fW\x0bCM\xab\xa5\xb3\xa7\xea\xa8\x9e\xc9ez\x9e\xe1H\xcf>W#\xc2\xa0\xaa#\xef[\x10P\xfa\x911\x89\xe6\x1a\x184I\x06#\x0cr"\x9a\x127\x8ekh\xe0\x17\\xc1\x8e\x8e\xad\x0bP-Vs\xd5\x81\xb4\x9e\xe8\xdb\x1b\x1d\xa2\xdf\x0f\x87\xdac\x16K\xe3!1\x8a+\r\xf8Y\xab\xdd\x85\x1b8\xa9\xae7\xe8\x17\xdf\xc5nG\xa7\xe73WO\xa0{gW\x07\xd7\xd6\xc1\xef\xec\xea\x0e\x06Z\xb8kt.\x97\xdc*4\xd4#\xdb\xc27\xd4\x0br+\xaa{,\x96\xbf\x0f\x84\xac\xee#SK\x8f\xa7q&\xd3\xb8c\xbb\xc3\xd8T\x1f\xf3\xf8B[\x1c\xde&\xd7\x80\xaf\xf1\xc0\x8c\xa3\xd7\xd7jh\xac\xbf\xb1\xb7\xb3\xb3\xb9\xdd~\x8d\xb1\xaa\xc5\xdd\xed\x1c\xc9\xe8\x8cf]\xb4\xca\xea\xf35m\xb3YL\xa2\xe4\xed\xf3\x8c\xdcl0\x9b\x85\x88\xa9\x19\xd4ZH\xf7<\xe6Q\x84\x90\xe2\xe2\xacU\x95\xf5\x12\t7\x15\xcb\xdc(h\x8f\xf0Fc]\xad\xd6\xf8\x90>\t\xa2h\xe2\xf7\xc2#0\xe8_\xdb(\x93\xe0\xa55q-P^\xe0\x96\xee\xee\x9em\xb0\xc1\xb13Xk\xe6u\x89\xb6\xbd\xf7[\x97O=\xb2;\xdd[\xc8?\xea\xaf\xf3_z;\x16#\xcf\x1a\x1b\xea\xeb\x046?\x0f\x1d\x98\xe7W0\xcf&\xb8\x06<\xf0\xf1\xf3\xedb\xb5\x19\x06\x9d\x95\xac8+\xe9\xb5Wd\xa4\xe7\x95z\x04l\xaa\x99\xc5*\xc2\x90E\xaa)\xe7W\xcd\xa7\xc8F
\x8a\xf4\x8c\x82\x99\x1c23\x1b\xb3\x88\x9c\x9e\xa9Z+\xb4E\xa5\xe7\x9e\xb35\xc3
\xac\x96N+m\xb6\xad0\xb8\x02d\x18\x8f\x82\x158\rg\xe1<\x08\xa0\x18\xaa\xe2p\xBxS\xe0\x05\xc5\xd6\x12\x17\x98\xad\x84\xa8
\xb4Z\x8f\xd6\xf9Z\x8fVy\x07x\xc2\x9f0\x10\x83\xa1\xa6\xedx\xc3\xcd\xcd\xc7kRj\xcd\xd7\xee\xf2O\xae\xb1*X\xc7\x12#\xc1\xe3\x99\\xf3L~\xc3\xf3\xd9Z
,\xd5.\xb5\n,Z\xb5W\x93\xefrby\xc8\xadf\x8eUD0\xd0\xcf\x91.{c\x93$55\xda/\x7f\xb2?7u\xc7\x1dS\xb9\xfd$R\xb5\xb5\xa3\xb5\xb5ck\xd5\x06%\xbb\x9b\\xae\xa6&\xb7\xdb\xf9\xd8SO=\xc6Z0\x13q\xbb#\x99\x85\x963\xcf\x03;\xf9\xaa#\xc3\xe1\xf9\x06-X\x01<\x98\xe1\x1e(\x911\x92#\x8b\xe4\xdb\xe4a\xee\x15\xee\x1d\xc9%uJ\xbd\xd23\x8e\xd6R\x89\x9dIp\x82\x8c\x92,\xea\xef\xae\xe8\xb7\xa0~\xd7\xa6\xfe\x8b/\x82s\xbcC\x1e%\x8f\x91\x1f\xe3\xefD\xe5\xf7\n\xfe^%\xaf~\xe9\xc8\xf2\xa5\x05\xcd\xbfa\xc5.\xeeKt\xe5U\xff\xff\xfa\x1f\xbe\xf04+_\xd5\xa5\x0f\x19!g\x00JX\xf4\xdc/T\xd4X\xfa\xb8\xf41>\xf1U\xcb\xd2\x07F{3Z|#\xba\xb05rsd\x11QC\xe9"|\x02\x06\xc4/~\xee$\xa3j?\xaer\xf8#\x84^\x88c\x1fR\xf1QH\xc2<\\x87x\x18\xa58\xf6\x8b\x88\x8d#\x8d\xa9V\xa3\x90\xc5\xfeAD\xfa\x10\xe9V\xf98\xf2\xbe+\xfco\xff\xcf\xee\xc9\x7f\xef"\x8d\xa5\x8b\xca#9>6:2|\xd3\x8dC\x83\x89\x1b\xf6\xc4\xaf\x8fE#\xe1\xd0ne\xa0\xff\xba\xbe\xaf\xf4\xee\xba\xb6\xa7{\xa7\xbf\xc3\xe7u\xbb\x9c\xdb\xe4V\xbb\xb5\xbeV\xac\xa96\x19\rz\x9d\xa0\xd5\xf0\x1c\x01oT\x8ee%\xea\xcaR\x8dK\x8e\xc7}L\x96s\x08\xe4\xae\x00\xb2TB(v\xb5\r\x95\xb2\xaa\x99t\xb5\xa5\x82\x963\x9f\xb2T\xca\x96\xca\xa6%\x11\xa5>\xe8\xf3y\xa5\xa8,\xd1\xb5\x88,\xad\x92\x89\x91\x14\xf2\xc7"rZ\xa2\xeb?\xa4\xf2\x1a\x97*T\xa3\xe0p\xe0\x08)j\x9d\x8dH\x94d\xa5(\x8d-\xcc\x16\xa2\xd9\x08\xfa+\x9a\x8ca9\x9c7\xfa\xbcP4\x9a\x905!G\xdd\xf2\\x91\xb8\xfb\x89\xcap\xeeho\x91\x03}5\x9b\x96\xf2\xcehn\x9a\x0e\x8f\xa4\xa2\x11\x9b\xc3\x91V1\x08\xab\xbe\xa8\x10\xa6:\xd5\x97t;\x8b\x19\x8eHE\xef\xe9\xc2\xd1U\x11\xa6\xb2\x9e\xaaiy:wK\x8a\xf29\x1cT\xe0\xa3\x85\xc2\x83\xb4\xd6C\xdb\xe5\x08m\xbf\xebOV\r\x9ez\xe5H\x94zdt\x96\x18\xdd\x9c\x80P\xadS\x94\xa5\xc2\x05\xc0\xe0\xe5\xf5\xf7\xaeFr\x15Dp\x8a\x17\x80\xb1l\x89\x9b\xdb\x84\xfa\r\x1e06\x8c\x10\xd7\xe7p\xb0X\x8e\xac*0\x85\x02=<\x92*\xcb\x12L\xd9\x9e\x05\xc5\xefIS.\xcb4\xa774\rI\xa69\xbc\xa1\xd9\x1c\x9e\x95\x1d,U\xd1l\xe5^\x98\xb5\xd2\xc3S\x92\xcf\x8b\xbb\xaf\xdeN\xbcQ/Q\xde\x95\x9d\xda7\xcbh.\x90#\x91\xf2\xbe\x8d\xa7\xa8\x12AF\xc9U\xd6\x1a-v\xfa\xd1>\x97\xc5E\xdc\xce\xb6a$E\xfd\xf2\x1c\xad\x97Ce\x03\x04$\x96\x83\xdb\xc7R\xea\x90\xca0Z\x1f\xa6\x90\xddW\x19E\xfd\xd1\x08\x8bK\x8a\x16\xb2\x91r\x80\xcc\x97<\x92z\x01\x82\xa5\xb3\xc5.\xc9v*\x08]\x90fqPK\x18\x93\xe2\x8a\x16R\xd33\xd4\x9e\xb5Mc}\xceH)\x9b\x83*i\xdc\xbe\xb4\x9c\xca\xa7Y\x96d\x91\xb6\x9f\xc5\xe9\x1c\xea\x8c\xea(\\xdb\xa7\xac7\x8c\xd9\xcauN\xbd\x94\xe2l|\x9ae\x0b\x01)\x86\x9d\x1c\xeaC\x85\x88\xe9RE\x96\xd1P\x9f\x94"6\xd80\xc3Y*\x16\x8c\xbb\xca\x0f\n\xbc3\x1cg*\x9e\r\r\xc7m\x8e\xb4\xa3|}IH\xb6JLZ\'\xd5\xe1KD3\xa6\xf2<_\x18Z\xd9\x9a\x05\xd4.E\xf3\x91+\x02\xbc\xca\xa9\xb6\x12\xc5\xdb\xe7\xc7\xc9\xb1\xbd\xa8L\x8c#\xf4,\x9d\xf1\r\x15\xef\xc4\x7f.b\x1c\xbaQ!\x96E\xabDaXJ\xc9y9-c\r)\xc3)\xb66\xb6\xd7j~\x13crbd"\xa5f\xbbR%\xe3WIe\xfd\xb5\x9b\xba\nG\xb90\x16\xccc\xdb\xc8\xa9*_\xaf\xca\x9bb\xfcS\xea=\x1bj\xa9\xa0\x97\x13c\x05\xe6Y\xae8\x04\xa9\xb0\x87\x02\x96\xac\x82\x7f\xcek\xeb\xba*\xff\xdf\x18\x1eor,\'K\xa2\x14+\xe4VK\x87\xa7\nEE)\xccE\xb3\xb3\xbd\xcc\x8f\xbcg\xba
\x8f\xa5\xfaljx\xa3\xa9\xbbmw\xb1\xe9\xea
A\x12\xe3!\x9f\x17\x0f\x9fPQ&\xcb#E\x85,\x8fM\xa4^\x10\xf1%oy<U\xe4H(\xcd\xaa\xdf:\x8b\x0b\xc4\xc3.*M\xb3\xcd9\x94\x9e-d\xd3\xac\xb4\xc1\x82\x1b\x897\xa1D\xee\x07\xca\xc9\xfdE\xc2\tU\xd4(\xe7C\xd4$\x87\x18>\xc0\xf0\x812.0\\\x87i!\x16\xe2\x83\xf2{\xd8\xeb\xaf\x8bo\xfd%pkM\xdf\x050\xf0\xe7\x18\xf2\x9b?\xbf\xf4\x1a\xa3\xbf\xfd\xe9M\xaf]~\xb9\xb4\x97{\x84\xefG\xd1\x80ou\xe577\xec\xf9\xfe\xd2^d>*eJ\x19\xee\x91\xcf\xbc\xd1\t\xe4#\x08k\xc3\x96\xc1\xd6\x83-\x81\xad\xa3\xa2\x8f[\xc6\x86\xf3q\xec\xbb\xfe\x0c\xfbvW\xbd\x08d\x1c\x9f\xdd_\xc3/c\x0e\xdf\x87\xfd\x90\xc6\xd7\xce\xa7t\'\xd9[\xa3b~\xe2\xc9\x93Or\xca\xe3M\x8e\xd8\x8f\x96\xb6\xdb\xd9w#\xd5\xa3Uu\xb1\x1f.\xc5\xed?\x98/\x03\xe3\xdfG\xe0\xe1\xf9-\xf6\xef\xcd\xc7\xed\xc7\xd1\xea!T\x1eE\xe5\x11\x94\xefEZ\xb8o\xbb\xfd\xbb\x0f\xc4\xed\xcb\xa8\xbb\x1fu\xf7\xa1\xedw\x10\xbf\x07\xf5\x8bH\x0f!~r\xe9WKo.\xf1\xca\x92]\x8e-
\xf6KB
L#iM\xce\x86oK\xce\x84\xf3\xc9\xe9p.\xb9/\x9cMN\x85oM~5<\x99\xbc%\x9cIN\x84\xd3I\xba\x8af7\x90\x9a<\xde\xf7\xe4\x8f\xe7\xf9N<\xa0s\x9d\xb9ln%Gsgs\xc2\xc9[\tL\x92\xce\xc9\xec\xe4\xca$\x9f\n\xefM\xde\x1cN&\xc7V\x86\x92\xa3+\x89\xe4\xc8\xca\r\xc9\xe1\x95=\xc9X&\x94\x8cfv\xe3\x13\x1c\xdd\x9d\xb2\x10-Y%+\x89U\xfe\xaf\xa3\t\xaa\x1f\xceP\xb2L\x9dc\xacWF&\xa8\xb0L!9\x91I\x15\ty(}\xff\xb1c\x10jN\xd0\xe6\xb1\x14=\xd1\x9cN\xd082\nc\x0e#\x03\xcdE\x0b\x84\xd2\x1e\x0fl|\xf7\xcc\x1f\x9c\xc7\xfb\xe0\xfc\xbf>\x84P7\xcf\xc8\x86\x11Q\x19&\xab\x00\xea\x0fZ\x01\xfe\t\x84\x8e\xc1\x17\nendstream\nendobj\n8
0 obj\n 2568\nendobj\n9 0 obj\n<< /Length 10 0 R\n /Filter
/FlateDecode\n>>\nstream\nx\x9c]P\xcbn\xc3
\x10\xbc\xf3\x15{L\x0e\x118\x8eo\xc8R\x95^|H[\xd5\xed\x07X\x1c\xa4\x1a\x10\xc6\x07\xff}\xd78J\xa5\x1eg\x1f3;Z~\xed^;\xef2\xf0\x8f\x14t\x8f\x19\xac\xf3&\xe1\x1c\x96\xa4\x11\x06\x1c\x9dg\xd5\x19\x8c\xd3\xf9\x91\x95_O*2N\xe4~\x9d3N\x9d\xb7\x81I\t\xfc\x93\x9asN+\x1c^L\x18\xf0\xc8\x00\x80\xbf\'\x83\xc9\xf9\x11\x0e\xdf\xd7~/\xf5K\x8c?8\xa1\xcf
X\xdb\x82AKr7\x15\xdf\xd4\x84\xc0\x0b\xf9\xd4\x19\xea\xbb\xbc\x9e\x88\xf67\xf1\xb5F\x84s\xc9\xab\xdd\x92\x0e\x06\xe7\xa84&\xe5GdR\x88\x16\xa4\xb5-Co\xfe\xf5\x9a\x9d1X}W\x89\xc9\xe6B\x93BP\xf2\xd2\x14L\x81\xea\xf5^\xaf\t\xd7U\xc1\x14\x08\x8b\x1d\x8b\xa2\xfdP\xd9\xb6l\xe7x\xda\xd7KJ\xe4\xbc\xdc\xacX\xde\xcc:\x8f\xcf\xb3\xc6\x107Vy\xbf9px\x10\nendstream\nendobj\n10
0 obj\n 245\nendobj\n11 0 obj\n<< /Type /FontDescriptor\n
/FontName /UKESOF+UbuntuMono-Regular\n /FontFamily (Ubuntu Mono)\n
/Flags 32\n /FontBBox [ -316 -170 665 830 ]\n /ItalicAngle 0\n
/Ascent 830\n /Descent -170\n /CapHeight 830\n /StemV 80\n
/StemH 80\n /FontFile2 7 0 R\n>>\nendobj\n5 0 obj\n<< /Type /Font\n
/Subtype /TrueType\n /BaseFont /UKESOF+UbuntuMono-Regular\n
/FirstChar 32\n /LastChar 84\n /FontDescriptor 11 0 R\n
/Encoding /WinAnsiEncoding\n /Widths [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 500 500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 0 0 0 0 0 0 0 0 0
0 0 0 0 500 500 ]\n /ToUnicode 9 0 R\n>>\nendobj\n1 0 obj\n<< /Type
/Pages\n /Kids [ 6 0 R ]\n /Count 1\n>>\nendobj\n12 0 obj\n<<
/Creator (cairo 1.14.6 (http://cairographics.org))\n /Producer
(cairo 1.14.6 (http://cairographics.org))\n>>\nendobj\n13 0 obj\n<<
/Type /Catalog\n /Pages 1 0 R\n>>\nendobj\nxref\n0 14\n0000000000
65535 f \n0000004169 00000 n \n0000000190 00000 n \n0000000015 00000 n
\n0000000169 00000 n \n0000003833 00000 n \n0000000299 00000 n
\n0000000527 00000 n \n0000003189 00000 n \n0000003212 00000 n
\n0000003535 00000 n \n0000003558 00000 n \n0000004234 00000 n
\n0000004362 00000 n \ntrailer\n<< /Size 14\n /Root 13 0 R\n /Info
12 0 R\n>>\nstartxref\n4415\n%%EOF\n'
Screenshots of linux printer setup

Related

Pattern identification and sequence detection

I have a dataset 'df' that looks something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 1 1 0 0 1
D 0 0 1 0 0 1
As you can see there are several rows of ones and zeros. Can anyone suggest me a code in python such that I am able to count the number of times '1' occurs continuously before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at seen_2 and seen_3, so the event will be 1. Similarly for the member B, the first double zero event occurs at seen_3 and seen_4 so there are two 1s that occur before this. The resultant table should have a new column 'event' something like this:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 1 1 0 0 1 3
D 0 0 1 0 0 1 1
My approach:
df = df.set_index('MEMBER')
# count 1 on each rows since the last 0
s = (df.stack()
.groupby(['MEMBER', df.eq(0).cumsum(1).stack()])
.cumsum().unstack()
)
# mask of the zeros:
u = s.eq(0)
# look for the first 1 0 0
idx = (~u &
u.shift(-1, axis=1, fill_value=False) &
u.shift(-2, axis=1, fill_value=False) ).idxmax(1)
# look up
df['event'] = s.lookup(idx.index, idx)
Test data:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6
0 A 1 0 1 0 0 1
1 B 1 1 0 0 1 0
2 C 1 1 1 0 0 1
3 D 0 0 1 0 0 1
4 E 1 0 1 1 0 0
Output:
MEMBER seen_1 seen_2 seen_3 seen_4 seen_5 seen_6 event
0 A 1 0 1 0 0 1 1
1 B 1 1 0 0 1 0 2
2 C 1 1 1 0 0 1 3
3 D 0 0 1 0 0 1 1
4 E 1 0 1 1 0 0 2

How to filter a matrix based on another column

I want to filter a matrix file using a column from another file.
I have 2 tab-separated files. One includes a matrix. I want to filter my matrix file based on the first column of FileB. If the headers(column names) of this matrix file (FileA) are present in the first column of File B, I want to filter them to use in a new file. All solutions I could try were based on filtering rows, not fields. Any help is appreciated. Thanks!
FileA
A B C D E F G H I J K L M N
R1 0 0 0 0 0 0 0 0 0 1 0 0 1 1
R2 1 1 0 1 0 0 0 0 1 0 1 0 0 0
R3 0 0 0 0 0 0 0 0 0 0 0 0 0 1
R4 1 1 0 1 0 0 0 1 0 1 0 1 0 0
R5 0 0 0 0 1 0 1 0 1 0 1 0 1 0
FileB
A Green
B Purple
K Blue
L Blue
Z Green
M Purple
N Red
O Red
U Red
My expected output is:
ExpectedOutput
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
Oh, what the heck, I'm not sure having you post an R script is really going to make any difference other than satisfying my need to be pedantic so here y'go:
$ cat tst.awk
NR == FNR {
outFldNames2Nrs[$1] = ++numOutFlds
next
}
FNR == 1 {
$0 = "__" FS $0
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
outFldNr = outFldNames2Nrs[$inFldNr]
out2inFldNrs[outFldNr] = inFldNr
}
}
{
printf "%s", $1
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2inFldNrs[outFldNr]
if (inFldNr) {
printf "%s%s", OFS, $inFldNr
}
}
print ""
}
$ awk -f tst.awk fileB fileA
__ A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0
I'm using the term "field name" to apply to the letter at the top of each column ("field" in awk). Try to figure the rest out for yourself from looking at the man pages and adding "prints" if/when useful and then feel free to ask questions if you have any.
I added __ at the front of your header line so you'd have the same number of columns in every line of output - that makes it easier to pass along to other tools to manipulate further but it's easy to tweak the code to not do that if you don't like it.
As #EdMorton mentions, bash may not be a suitable tool to manipulate
complex data structure as a table from maintainability and robustness
point of view.
Here is a bash script example just for information:
#!/bin/bash
declare -A seen
declare -a ary include
while read -r alpha color; do
seen["$alpha"]=1
done < FileB
while read -r -a ary; do
if (( $((nr++)) == 0 )); then # handle header line
echo -n " "
for (( i=0; i<${#ary[#]}; i++ )); do
alpha="${ary[$i]}"
if [[ ${seen["$alpha"]} = 1 ]]; then
echo -n " $alpha"
include[$((i+1))]=1
fi
done
else
echo -n "${ary[0]}"
for (( i=1; i<${#ary[#]}; i++ )); do
if [[ ${include[$i]} = 1 ]]; then
echo -n " ${ary[$i]}"
fi
done
fi
echo
done < FileA
If python is your option, you can say instead something like:
import pandas as pd
dfb = pd.read_csv("./FileB", sep="\s+", header=None)
vb = [x[0] for x in dfb.values.tolist()]
dfa = pd.read_csv("./FileA", sep="\s+")
va = dfa.columns.tolist()
print(dfa[sorted(set(va) & set(vb))])
Output:
A B K L M N
R1 0 0 0 0 1 1
R2 1 1 1 0 0 0
R3 0 0 0 0 0 1
R4 1 1 0 1 0 0
R5 0 0 1 0 1 0

Create a new large matrix by stacking in its diagonal K matrices

l have K (let K here be 7) distincts matrices of dimension (50,50).
I would like to create a new matrix L by filling it in diagonal with the K matrices. Hence L is of dimension (50*K,50*K).
What l have tried ?
K1=np.random.random((50,50))
N,N=K1.shape
K=7
out=np.zeros((K,N,K,N),K1.dtype)
np.einsum('ijik->ijk', out)[...] = K1
L=out.reshape(K*N, K*N) # L is of dimension (50*7,50*7)=(350,350)
Its indeed creating a new matrix L by stacking K1 seven times within its diagonal. However, l would like to stack respectively K1,K2,K3,K5,K6,K7 rather than K1 seven times.
Inputs :
K1=np.random.random((50,50))
K2=np.random.random((50,50))
K3=np.random.random((50,50))
K4=np.random.random((50,50))
K5=np.random.random((50,50))
K6=np.random.random((50,50))
K7=np.random.random((50,50))
L=np.zeros((50*7,50*7))#
Expected outputs :
L[:50,:50]=K1
L[50:100,50:100]=K2
L[100:150,100:50]=K3
L[150:200,150:200]=K4
L[200:250,200:250]=K5
L[250:300,250:300]=K6
L[300:350,300:350]=K7
You could try scipy.linalg.block_diag. If you look at the source, this function basically just loops over the given blocks the way you have written as your output. It can be used like:
K1=np.random.random((50,50))
K2=np.random.random((50,50))
K3=np.random.random((50,50))
K4=np.random.random((50,50))
K5=np.random.random((50,50))
K6=np.random.random((50,50))
K7=np.random.random((50,50))
L=sp.linalg.block_diag(K1,K2,K3,K4,K5,K6,K7)
If you have your K as a ndarray of shape (7,50,50) you can unpack it directly like:
K=np.random.random((7,50,50))
L=sp.linalg.block_diag(*K)
If you don't want to import scipy, you can always just write a simple loop to do what you have written for the expected output.
Here is a way to do that with NumPy:
import numpy as np
def put_in_diagonals(a):
n, rows, cols = a.shape
b = np.zeros((n * rows, n * cols), dtype=a.dtype)
a2 = a.reshape(-1, cols)
ii, jj = np.indices(a2.shape)
jj += (ii // rows) * cols
b[ii, jj] = a2
return b
# Test
a = np.arange(24).reshape(4, 2, 3)
print(put_in_diagonals(a))
Output:
[[ 0 1 2 0 0 0 0 0 0 0 0 0]
[ 3 4 5 0 0 0 0 0 0 0 0 0]
[ 0 0 0 6 7 8 0 0 0 0 0 0]
[ 0 0 0 9 10 11 0 0 0 0 0 0]
[ 0 0 0 0 0 0 12 13 14 0 0 0]
[ 0 0 0 0 0 0 15 16 17 0 0 0]
[ 0 0 0 0 0 0 0 0 0 18 19 20]
[ 0 0 0 0 0 0 0 0 0 21 22 23]]

Count the number of overlapping substrings within a string

example:
s <- "aaabaabaa"
p <- "aa"
I want to return 4, not 3 (i.e. counting the number of "aa" instances in the initial "aaa" as 2, not 1).
Is there any package to solve it? Or is there any way to count in R?
I believe that
find_overlaps <- function(p,s) {
gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
if (length(gg)==1 && gg==-1) 0 else length(gg)
}
find_overlaps("aa","aaabaabaa") ## 4
find_overlaps("not_there","aaabaabaa") ## 0
find_overlaps("aa","aaaaaaaa") ## 7
will do what you want, which would be more clearly expressed as "finding the number of overlapping substrings within a string".
This a minor variation on Finding the indexes of multiple/overlapping matching substrings
substring might be useful here, by taking every successive pair of characters.
( ss <- sapply(2:nchar(s), function(i) substring(s, i-1, i)) )
## [1] "aa" "aa" "ab" "ba" "aa" "ab" "ba" "aa"
sum(ss %in% p)
## [1] 4
I needed the answer to a related more-general question. Here is what I came up with generalizing Ben Bolker's solution:
my.data <- read.table(text = '
my.string my.cov
1.2... 1
.21111 2
..2122 3
...211 2
112111 4
212222 1
', header = TRUE, stringsAsFactors = FALSE)
desired.result.2ch <- read.table(text = '
my.string my.cov n.11 n.12 n.21 n.22
1.2... 1 0 0 0 0
.21111 2 3 0 1 0
..2122 3 0 1 1 1
...211 2 1 0 1 0
112111 4 3 1 1 0
212222 1 0 1 1 3
', header = TRUE, stringsAsFactors = FALSE)
desired.result.3ch <- read.table(text = '
my.string my.cov n.111 n.112 n.121 n.122 n.222 n.221 n.212 n.211
1.2... 1 0 0 0 0 0 0 0 0
.21111 2 2 0 0 0 0 0 0 1
..2122 3 0 0 0 1 0 0 1 0
...211 2 0 0 0 0 0 0 0 1
112111 4 1 1 1 0 0 0 0 1
212222 1 0 0 0 1 2 0 1 0
', header = TRUE, stringsAsFactors = FALSE)
find_overlaps <- function(s, my.cov, p) {
gg <- gregexpr(paste0("(?=",p,")"),s,perl=TRUE)[[1]]
if (length(gg)==1 && gg==-1) 0 else length(gg)
}
p <- c('11', '12', '21', '22', '111', '112', '121', '122', '222', '221', '212', '211')
my.output <- matrix(0, ncol = (nrow(my.data)+1), nrow = length(p))
for(i in seq(1,length(p))) {
my.data$p <- p[i]
my.output[i,1] <- p[i]
my.output[i,(2:(nrow(my.data)+1))] <-apply(my.data, 1, function(x) find_overlaps(x[1], x[2], x[3]))
apply(my.data, 1, function(x) find_overlaps(x[1], x[2], x[3]))
}
my.output
desired.result.2ch
desired.result.3ch
pre.final.output <- matrix(t(my.output[,2:7]), ncol=length(p), nrow=nrow(my.data))
final.output <- data.frame(my.data[,1:2], t(apply(pre.final.output, 1, as.numeric)))
colnames(final.output) <- c(colnames(my.data[,1:2]), paste0('x', p))
final.output
# my.string my.cov x11 x12 x21 x22 x111 x112 x121 x122 x222 x221 x212 x211
#1 1.2... 1 0 0 0 0 0 0 0 0 0 0 0 0
#2 .21111 2 3 0 1 0 2 0 0 0 0 0 0 1
#3 ..2122 3 0 1 1 1 0 0 0 1 0 0 1 0
#4 ...211 2 1 0 1 0 0 0 0 0 0 0 0 1
#5 112111 4 3 1 1 0 1 1 1 0 0 0 0 1
#6 212222 1 0 1 1 3 0 0 0 1 2 0 1 0
A tidy, and I think more readable solution is
library(tidyverse)
PatternCount <- function(text, pattern) {
#Generate all sliding substrings
map(seq_len(nchar(text) - nchar(pattern) + 1),
function(x) str_sub(text, x, x + nchar(pattern) - 1)) %>%
#Test them against the pattern
map_lgl(function(x) x == pattern) %>%
#Count the number of matches
sum
}
PatternCount("aaabaabaa", "aa")
# 4

Matlab string operation

I have converted a string to binary as follows
message='hello my name is kamran';
messagebin=dec2bin(message);
Is there any method for storing it in array?
I am not really sure of what you want to do here, but if you need to concatenate the rows of the binary representation (which is a matrix of numchars times bits_per_char), this is the code:
message = 'hello my name is kamran';
messagebin = dec2bin(double(message));
linearmessagebin = reshape(messagebin',1,numel(messagebin));
Please note that the double conversion returns your ASCII code. I do not have access to a Matlab installation here, but for example octave complains about the code you provided in the original question.
NOTE
As it was kindly pointed out to me, you have to transpose the messagebin before "serializing" it, in order to have the correct result.
If you want the result as numeric matrix, try:
>> str = 'hello world';
>> b = dec2bin(double(str),8) - '0'
b =
0 1 1 0 1 0 0 0
0 1 1 0 0 1 0 1
0 1 1 0 1 1 0 0
0 1 1 0 1 1 0 0
0 1 1 0 1 1 1 1
0 0 1 0 0 0 0 0
0 1 1 1 0 1 1 1
0 1 1 0 1 1 1 1
0 1 1 1 0 0 1 0
0 1 1 0 1 1 0 0
0 1 1 0 0 1 0 0
Each row corresponds to a character. You can easily reshape it into to sequence of 0,1

Resources