Sort list python3 - python-3.x
I would like to order this list.
From:
01104D-BB'42
01104D-BB42
01104D-BB43
01104D-CC'42
01104D-CC'72
01104D-CC32
01104D-CC42
01104D-CC62
01104D-CC72
01104D-DD'74
01104D-DD'75
01104D-DD'76
01104D-DD'77
01104D-DD'78
01104D-DD75
01104D-DD76
01104D-DD77
01104D-DD78
01104D-EE'102
01104D-EE'12
01104D-EE'2
01104D-EE'32
01104D-EE'42
01104D-EE'52
01104D-EE'53
01104D-EE'72
01104D-EE'82
01104D-EE'92
01104D-EE102
01104D-EE12
01104D-EE2
01104D-EE3
01104D-EE32
01104D-EE42
01104D-EE52
01104D-EE62
01104D-EE72
01104D-EE82
01104D-EE83
01104D-EE92
01104D-EE93
To:
01104D-BB42
01104D-BB43
01104D-BB'42
01104D-CC32
01104D-CC42
01104D-CC62
01104D-CC72
01104D-CC'42
01104D-CC'72
01104D-DD75
01104D-DD76
01104D-DD77
01104D-DD78
01104D-DD'74
01104D-DD'75
01104D-DD'76
01104D-DD'77
01104D-DD'78
01104D-EE102
01104D-EE12
01104D-EE2
01104D-EE3
01104D-EE32
01104D-EE42
01104D-EE52
01104D-EE62
01104D-EE72
01104D-EE82
01104D-EE83
01104D-EE92
01104D-EE93
01104D-EE'102
01104D-EE'12
01104D-EE'2
01104D-EE'32
01104D-EE'42
01104D-EE'52
01104D-EE'53
01104D-EE'72
01104D-EE'82
01104D-EE'92
Can you help me?
thanks
I'm guessing here, because you haven't explained how you want the sort to be done. But it looks like you want the character ' to sort after the digits 0-9, and the ascii sort order puts it before the digits. If that is correct, then you need to substitute a different character for '. A good choice might be ~ because it is the last printable ascii character.
If your data is in mylist, then
mylist.sort(key=lambda a: a.replace("'","~"))
will sort it in the order I'm guessing you want.
Related
How to match optional Number along with alphanumeric in Ruta Script
I am working on entity extraction in Pega. I have requirement to match a policy number which has 3 parts: 1) Optionally 1 would be first character in policy. It is optional 2) alphanumeric of length 2 followed by optionally Hyphen or Space 3) alphanumeric of length 3 So some examples of formats are: AB-CDE, AB CDE, ABCDE, 1AB-CDE 23-456, 23 456, 23456, 123456 AB-2B4, AB-B2C, A1-2B4, 2A-34B, 12A-34B, 123-45C etc. I am facing problem whenever policy number is starting with 2 or 3 digits or it don't have any space or hyphen. For example 12A-34B, 123-45C, 23456, 123456. I have written below script: PACKAGE uima.ruta.example; Document{-> RETAINTYPE(SPACE)}; ("1")+? ((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,4)}; ((NUM* W*)|(W* NUM*)){REGEXP(".{2}")} ("-"|SPACE)? ((NUM* W* NUM*)|(W* NUM* W*)){REGEXP(".{3}")->MARK(EntityType,1,3)}; This code is working fine for patterns having space/hyphen like: AB-CDE, AB CDE, 1AB-CDE. But not working if don't have space and hyphen or pattern starts with 2 or 3 digits. Please help to write correct pattern. Thanks in advance.
The UIMA Ruta seed annotation NUM, covers the whole number. Therefore, examples like 23456, 123456 cannot be split in subannotations by Ruta. A solution would be to use pure regexp to annotate all the mentioned examples: "\\w{2,3}[\\-|\\s]?\\w{2,3}" -> EntityType;
force linux sort to use lexicographic order
I generated a text file with pseudo-random numbers like this: -853340442 1130519212 -2070936922 -707168664 -2076185735 -2135012102 166464098 1928545126 5768715 1060168276 -684694617 395859713 -680897578 -2095893176 1457930442 299309402 192205833 1878010157 -678911642 2062673581 -1801057195 795693402 -631504846 2117889796 448959250 547707556 -1115929024 168558507 7468411 1600190097 -746131117 1557335455 73377787 -1144524558 2143073647 -2044347857 1862106004 -193937480 1596949168 -1193502513 -920620244 -365340967 -677065994 500654963 1031304603 Now I try to put it in order using linux sort command: sort prng >prngsorted The result is not what I expected: 1060168276 -684694617 395859713 -1144524558 2143073647 -2044347857 -1193502513 -920620244 -365340967 166464098 1928545126 5768715 168558507 7468411 1600190097 1862106004 -193937480 1596949168 299309402 192205833 1878010157 448959250 547707556 -1115929024 -677065994 500654963 1031304603 -678911642 2062673581 -1801057195 -680897578 -2095893176 1457930442 -707168664 -2076185735 -2135012102 -746131117 1557335455 73377787 795693402 -631504846 2117889796 -853340442 1130519212 -2070936922 Obviously, sort tries to parse strings and extract numbers for sorting. And it seems to ignore minus signs. Is it possible to force sort to be a bit dumber and just compare lines lexicographically? The result should be like this: -1144524558 2143073647 -2044347857 -1193502513 -920620244 -365340967 -677065994 500654963 1031304603 -678911642 2062673581 -1801057195 -680897578 -2095893176 1457930442 -707168664 -2076185735 -2135012102 -746131117 1557335455 73377787 -853340442 1130519212 -2070936922 1060168276 -684694617 395859713 166464098 1928545126 5768715 168558507 7468411 1600190097 1862106004 -193937480 1596949168 299309402 192205833 1878010157 448959250 547707556 -1115929024 795693402 -631504846 2117889796 Note: I tried -d option but it did not help Note 2: Probably I should use another utility instead of sort?
The sort command takes account of your locale settings. Many of the locales ignore dashes for collation. You can get appropriate sorting with LC_COLLATE=C sort filename
custom sort with the help of awk $ awk '{print ($1<0?"-":"+") "\t" $0}' file | sort -k1,1 -k2 | cut -f2- -1144524558 2143073647 -2044347857 -1193502513 -920620244 -365340967 -677065994 500654963 1031304603 -678911642 2062673581 -1801057195 -680897578 -2095893176 1457930442 -707168664 -2076185735 -2135012102 -746131117 1557335455 73377787 -853340442 1130519212 -2070936922 1060168276 -684694617 395859713 166464098 1928545126 5768715 168558507 7468411 1600190097 1862106004 -193937480 1596949168 299309402 192205833 1878010157 448959250 547707556 -1115929024 795693402 -631504846 2117889796 sort by sign only first, then regular sort and remove sign afterwards...
Japanese Unicode: Convert radical to regular character code
How can I convert Japanese radical characters into their "regular" kanji character counterparts? For instance, the character for the radical fire is ⽕ (with a Unicode value of 12117) And the regular character is 火 (with a Unicode value of 28779) EDIT: To clarify, the reason why I think I need this is because I would like to obtain the stroke information for each radical by using the kanjivg data set. However, (I need to look into this further), I'm not sure if kanjivg has stroke data for the radical characters, but it definitely has stroke data for the regular kanji characters. The language that I'm working with is Java - but I assumed that conversion would be similar for any language.
Using RADKFILE for this was was a neat idea (#Paul) but I don't think it uses Kangxi radicals because it's encoded in EUC-JP and if my browser (or Github) doesn't automatically convert between Kangxi/kanji, the list only has non-Kangxi characters as long as we're talking about Unicode. The Unicode range for Kangxi radicals is on this Wikipedia page: Unicode/Character reference/2000-2FFF (bottom). Somebody has created a mapping between them: Kanji to Kangxi Radical remapping tables. I did not check the correctness but when you convert the code points to characters you can see if they look the same. Here's how you do it in Java: Creating Unicode character from its number Here is the list in CSV for convenience (kanji,radical): 0x4E00,0x2F00 0x4E28,0x2F01 0x4E36,0x2F02 0x4E3F,0x2F03 0x4E59,0x2F04 0x4E85,0x2F05 0x4E8C,0x2F06 0x4EA0,0x2F07 0x4EBA,0x2F08 0x513F,0x2F09 0x5165,0x2F0A 0x516B,0x2F0B 0x5182,0x2F0C 0x5196,0x2F0D 0x51AB,0x2F0E 0x51E0,0x2F0F 0x51F5,0x2F10 0x5200,0x2F11 0x529B,0x2F12 0x52F9,0x2F13 0x5315,0x2F14 0x531A,0x2F15 0x5338,0x2F16 0x5341,0x2F17 0x535C,0x2F18 0x5369,0x2F19 0x5382,0x2F1A 0x53B6,0x2F1B 0x53C8,0x2F1C 0x53E3,0x2F1D 0x56D7,0x2F1E 0x571F,0x2F1F 0x58EB,0x2F20 0x5902,0x2F21 0x590A,0x2F22 0x5915,0x2F23 0x5927,0x2F24 0x5973,0x2F25 0x5B50,0x2F26 0x5B80,0x2F27 0x5BF8,0x2F28 0x5C0F,0x2F29 0x5C22,0x2F2A 0x5C38,0x2F2B 0x5C6E,0x2F2C 0x5C71,0x2F2D 0x5DDB,0x2F2E 0x5DE5,0x2F2F 0x5DF1,0x2F30 0x5DFE,0x2F31 0x5E72,0x2F32 0x5E7A,0x2F33 0x5E7F,0x2F34 0x5EF4,0x2F35 0x5EFE,0x2F36 0x5F0B,0x2F37 0x5F13,0x2F38 0x5F50,0x2F39 0x5F61,0x2F3A 0x5F73,0x2F3B 0x5FC3,0x2F3C 0x6208,0x2F3D 0x6236,0x2F3E 0x624B,0x2F3F 0x652F,0x2F40 0x6534,0x2F41 0x6587,0x2F42 0x6597,0x2F43 0x65A4,0x2F44 0x65B9,0x2F45 0x65E0,0x2F46 0x65E5,0x2F47 0x66F0,0x2F48 0x6708,0x2F49 0x6728,0x2F4A 0x6B20,0x2F4B 0x6B62,0x2F4C 0x6B79,0x2F4D 0x6BB3,0x2F4E 0x6BCB,0x2F4F 0x6BD4,0x2F50 0x6BDB,0x2F51 0x6C0F,0x2F52 0x6C14,0x2F53 0x6C34,0x2F54 0x706B,0x2F55 0x722A,0x2F56 0x7236,0x2F57 0x723B,0x2F58 0x723F,0x2F59 0x7247,0x2F5A 0x7259,0x2F5B 0x725B,0x2F5C 0x72AC,0x2F5D 0x7384,0x2F5E 0x7389,0x2F5F 0x74DC,0x2F60 0x74E6,0x2F61 0x7518,0x2F62 0x751F,0x2F63 0x7528,0x2F64 0x7530,0x2F65 0x758B,0x2F66 0x7592,0x2F67 0x7676,0x2F68 0x767D,0x2F69 0x76AE,0x2F6A 0x76BF,0x2F6B 0x76EE,0x2F6C 0x77DB,0x2F6D 0x77E2,0x2F6E 0x77F3,0x2F6F 0x793A,0x2F70 0x79B8,0x2F71 0x79BE,0x2F72 0x7A74,0x2F73 0x7ACB,0x2F74 0x7AF9,0x2F75 0x7C73,0x2F76 0x7CF8,0x2F77 0x7F36,0x2F78 0x7F51,0x2F79 0x7F8A,0x2F7A 0x7FBD,0x2F7B 0x8001,0x2F7C 0x800C,0x2F7D 0x8012,0x2F7E 0x8033,0x2F7F 0x807F,0x2F80 0x8089,0x2F81 0x81E3,0x2F82 0x81EA,0x2F83 0x81F3,0x2F84 0x81FC,0x2F85 0x820C,0x2F86 0x821B,0x2F87 0x821F,0x2F88 0x826E,0x2F89 0x8272,0x2F8A 0x8278,0x2F8B 0x864D,0x2F8C 0x866B,0x2F8D 0x8840,0x2F8E 0x884C,0x2F8F 0x8863,0x2F90 0x897E,0x2F91 0x898B,0x2F92 0x89D2,0x2F93 0x8A00,0x2F94 0x8C37,0x2F95 0x8C46,0x2F96 0x8C55,0x2F97 0x8C78,0x2F98 0x8C9D,0x2F99 0x8D64,0x2F9A 0x8D70,0x2F9B 0x8DB3,0x2F9C 0x8EAB,0x2F9D 0x8ECA,0x2F9E 0x8F9B,0x2F9F 0x8FB0,0x2FA0 0x8FB5,0x2FA1 0x9091,0x2FA2 0x9149,0x2FA3 0x91C6,0x2FA4 0x91CC,0x2FA5 0x91D1,0x2FA6 0x9577,0x2FA7 0x9580,0x2FA8 0x961C,0x2FA9 0x96B6,0x2FAA 0x96B9,0x2FAB 0x96E8,0x2FAC 0x9751,0x2FAD 0x975E,0x2FAE 0x9762,0x2FAF 0x9769,0x2FB0 0x97CB,0x2FB1 0x97ED,0x2FB2 0x97F3,0x2FB3 0x9801,0x2FB4 0x98A8,0x2FB5 0x98DB,0x2FB6 0x98DF,0x2FB7 0x9996,0x2FB8 0x9999,0x2FB9 0x99AC,0x2FBA 0x9AA8,0x2FBB 0x9AD8,0x2FBC 0x9ADF,0x2FBD 0x9B25,0x2FBE 0x9B2F,0x2FBF 0x9B32,0x2FC0 0x9B3C,0x2FC1 0x9B5A,0x2FC2 0x9CE5,0x2FC3 0x9E75,0x2FC4 0x9E7F,0x2FC5 0x9EA5,0x2FC6 0x9EBB,0x2FC7 0x9EC3,0x2FC8 0x9ECD,0x2FC9 0x9ED1,0x2FCA 0x9EF9,0x2FCB 0x9EFD,0x2FCC 0x9F0E,0x2FCD 0x9F13,0x2FCE 0x9F20,0x2FCF 0x9F3B,0x2FD0 0x9F4A,0x2FD1 0x9F52,0x2FD2 0x9F8D,0x2FD3 0x9F9C,0x2FD4 0x9FA0,0x2FD5
It's not entirely clear why you want this, but one possible way to do it is with Jim Breen's radkfile file that maps radicals to associated kanjis and the reverse. Combine that with some heuristics and Breen's kanjidic file (to the extent that these resources are reliable), and you can pretty easily generate a mapping. Here's an example in Python, using the cjktools library, which has Python wrappers for these things. from cjktools.resources.radkdict import RadkDict from cjktools.resources.kanjidic import Kanjidic def make_rad_to_kanji_dict(): rdict = RadkDict() kdict = Kanjidic() # Get all the radicals where there are kanji made up entirely of the one # radical - the ones we want are a subset of those tmp = ((rads[0], kanji) for kanji, rads in rdict.items() if len(rads) == 1) # All the ones with the same number of strokes - should be all the ones that # are homographs out = {rad: kanji for rad, kanji in tmp if (kanji in kdict and kdict[kanji].stroke_count == rdict.radical_to_stroke_count[rad])} return out RAD_TO_KANJI_DICT = make_rad_to_kanji_dict() if __name__ == "__main__": print(RAD_TO_KANJI_DICT['⽕']) You can iterate through the file it generates and output a static mapping pretty easily. There may be existing homograph lists for that sort of thing, but I don't know of any. radkdict only has 128 kanji consisting of exactly 1 radical, so it is also a simple matter to just enumerate all of those and manually check which ones match your criteria. Note: I looked through the list of things that are caught by the "consisting of exactly one radical" heuristic but skipped over in the "has the same stroke order" list, it seems that '老' (radical) -> '老' (kanji) and '刈' (radical) -> '刈' (kanji) are the only ones that, for whatever reason, don't get caught by this. Here is a CSV generated with this method.
entering text in a file at specific locations by identifying the number being integer or real in linux
I have an input like below 46742 1 48276 48343 48199 48198 46744 1 48343 48344 48200 48199 46746 1 48344 48332 48201 48200 48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02 48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02 48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02 Each segment with the 2nd entry like 1 being integer is like thousands of lines and then starts the segment with the 2nd entry being real like 3.58077402e+01 Before anything beings I have to input a text like *Revolved *Gripped *Crippled 46742 1 48276 48343 48199 48198 46744 1 48343 48344 48200 48199 46746 1 48344 48332 48201 48200 *Cracked *Crippled 48283 3.58077402e+01 -2.97697746e+00 1.50878647e+02 48282 3.67231688e+01 -2.97771595e+00 1.50419488e+02 48285 3.58558188e+01 -1.98122787e+00 1.50894850e+02 so I need to enter specific texts at those locations. It is worth mentioning that the file is space delimited and not tabs delimited and that the text starting with * has to be at the very left of the line without spacing. The format of the rest of the file should be kept too. Any suggestions with sed or awk would be highly appreaciated! The text in the beginning could entered directly so that is not a prime problem since that is the start of the file, problematic is the second bunch of line so identify that the second entry has turned to real.
An awk with fixed strings: awk 'BEGIN{print "*Revolved\n*Gripped\n*Crippled"} match($2,"\+")&&!pr{print "*Cracked\n*Crippled";pr=1}1' yourfile match($2,"\+")&&!pr : When + char is found at $2 field(real number) and pr flag is null.
Add a number to each line of a file in bash
I have some files with some lines in Linux like: 2013/08/16,name1,,5000,8761,09:00,09:30 2013/08/16,name1,,5000,9763,10:00,10:30 2013/08/16,name1,,5000,8866,11:00,11:30 2013/08/16,name1,,5000,5768,12:00,12:30 2013/08/16,name1,,5000,11764,13:00,13:30 2013/08/16,name2,,5000,2765,14:00,14:30 2013/08/16,name2,,5000,4765,15:00,15:30 2013/08/16,name2,,5000,6765,16:00,16:30 2013/08/16,name2,,5000,12765,17:00,17:30 2013/08/16,name2,,5000,25665,18:00,18:30 2013/08/16,name2,,5000,45765,09:00,10:30 2013/08/17,name1,,5000,33765,10:00,11:30 2013/08/17,name1,,5000,1765,11:00,12:30 2013/08/17,name1,,5000,34765,12:00,13:30 2013/08/17,name1,,5000,12765,13:00,14:30 2013/08/17,name2,,5000,1765,14:00,15:30 2013/08/17,name2,,5000,3765,15:00,16:30 2013/08/17,name2,,5000,7765,16:00,17:30 My column separator is "," and in the third column (currently ,,), I need the entry number within the same day. For example, with date 2013/08/16 I have 11 lines and with date 2013/08/17 7 lines, so I need add the numbers for example: 2013/08/16,name1,1,5000,8761,09:00,09:30 2013/08/16,name1,2,5000,9763,10:00,10:30 2013/08/16,name1,3,5000,8866,11:00,11:30 2013/08/16,name1,4,5000,5768,12:00,12:30 2013/08/16,name1,5,5000,11764,13:00,13:30 2013/08/16,name2,6,5000,2765,14:00,14:30 2013/08/16,name2,7,5000,4765,15:00,15:30 2013/08/16,name2,8,5000,6765,16:00,16:30 2013/08/16,name2,9,5000,12765,17:00,17:30 2013/08/16,name2,10,5000,25665,18:00,18:30 2013/08/16,name2,11,5000,45765,09:00,10:30 2013/08/17,name1,1,5000,33765,10:00,11:30 2013/08/17,name1,2,5000,1765,11:00,12:30 2013/08/17,name1,3,5000,34765,12:00,13:30 2013/08/17,name1,4,5000,12765,13:00,14:30 2013/08/17,name2,5,5000,1765,14:00,15:30 2013/08/17,name2,6,5000,3765,15:00,16:30 2013/08/17,name2,7,5000,7765,16:00,17:30 I need do it in bash. How can I do it?
This one's good too: awk -F, 'sub(/,,/, ","++a[$1]",")1' file Output: 2013/08/16,name1,1,5000,8761,09:00,09:30 2013/08/16,name1,2,5000,9763,10:00,10:30 2013/08/16,name1,3,5000,8866,11:00,11:30 2013/08/16,name1,4,5000,5768,12:00,12:30 2013/08/16,name1,5,5000,11764,13:00,13:30 2013/08/16,name2,6,5000,2765,14:00,14:30 2013/08/16,name2,7,5000,4765,15:00,15:30 2013/08/16,name2,8,5000,6765,16:00,16:30 2013/08/16,name2,9,5000,12765,17:00,17:30 2013/08/16,name2,10,5000,25665,18:00,18:30 2013/08/16,name2,11,5000,45765,09:00,10:30 2013/08/17,name1,1,5000,33765,10:00,11:30 2013/08/17,name1,2,5000,1765,11:00,12:30 2013/08/17,name1,3,5000,34765,12:00,13:30 2013/08/17,name1,4,5000,12765,13:00,14:30 2013/08/17,name2,5,5000,1765,14:00,15:30 2013/08/17,name2,6,5000,3765,15:00,16:30 2013/08/17,name2,7,5000,7765,16:00,17:30