special characters not getting loaded in oracle 11g DB using sqlldr - linux
I am trying to load a file using sql loader in oracle DB which has special characters as follows:
5200504,07ED,,Ãresutjämning,,,Occurrences,,,0.25,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,, 5200959,07ED,,Ãresutjämning,,,Occurrences,,,0.15,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,,
5201122,07ED,,Ãresutjämning,,,Occurrences,,,-0.42,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,,
5203421,07ED,,Ãresutjämning,,,Occurrences,,,-0.05,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,,
As per my analysis I have already checked source file encoding which is UTF8. NLS_LANG on Linux machine is American_america.UTF8
and oracle DB setting is as follows:
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CALENDAR GREGORIAN
NLS_DATE_FORMAT DD-MON-RR
NLS_DATE_LANGUAGE AMERICAN
NLS_CHARACTERSET AL32UTF8
NLS_SORT BINARY
NLS_TIME_FORMAT HH.MI.SSXFF AM
NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM
NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZR
NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR
NLS_DUAL_CURRENCY $
NLS_NCHAR_CHARACTERSET AL16UTF16
NLS_COMP BINARY
NLS_LENGTH_SEMANTICS BYTE
NLS_NCHAR_CONV_EXCP FALSE
after load the special character Ãresutjämning is getting changed to Öresutjämning.
the load.ctl file is as follows:
load data
CHARACTERSET UTF8
infile *
into table SE_TDC_1_BILL_ITEM
TRUNCATE
fields terminated by ","
(BILL_NUMBER,SUBSCRIBER_ID,USAGE_DATE_TIME,USAGE_TYPE,CALLING_NUMBER,CALLED_NUMBER,METRIC,VOLUME_,DURATION_,CHARGE_OPCO_C,SOURCE_COUNTRY,TARGET_COUNTRY,SOURCE_NETWORK,TARGET_NETWORK,CHARGEALE,PCTG_CHARGED,PRODUCT,ROAM_TYPE,MT_MO_INDICATOR,TARIFF_TYPE,CONTRACT_ID,CHARGE_CURRENCY,ITEM_TAX_CODE,ITEM_TAX_RATE,SUBSCRIBER_EMAIL_ADDRESS)
BEGINDATA
5200504,07ED,,Ãresutjämning,,,Occurrences,,,0.25,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,,
5200959,07ED,,Ãresutjämning,,,Occurrences,,,0.15,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,,
5201122,07ED,,Ãresutjämning,,,Occurrences,,,-0.42,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,,
5203421,07ED,,Ãresutjämning,,,Occurrences,,,-0.05,,,,,TRUE,100,Ãresutjämning,,,Ãresutjämning,,SEK,,0.00,,,
Please assist.
Related
Escaping a space in a properties file for WildFly with Oracle DB
I am having a hard time using an environment variable with a space in a properties file read by WildFly (24) in Linux using Oracle 19 in RDS. One like: SELECT 1 FROM DUAL The issue is that wildfly won't even parse the file if the spaces are in there with the normal quoting methods. I have it setup so that variable is in a file called datasource.properties that gets read from standalone.conf where this variable sits: JAVA_OPTS="$JAVA_OPTS -DDATABASE_CONNECTION_CHECK=${DATABASE_CONNECTION_CHECK}" It's read in with the following in standalone.conf: set -a . /opt/wildfly_config/datasource.properties set +a That in turn gets populated in standalone.xml with: <connection-url>${env.DATABASE_JDBC_URL}</connection-url> I try putting it in quotes and oddly enough it doesn’t start at all. Standalone.sh is no longer able to parse it: Error: Could not find or load main class 1 Caused by: java.lang.ClassNotFoundException: 1 I have tried many things such as: DATABASE_CONNECTION_CHECK="SELECT{ }1{ }FROM{ }DUAL" DATABASE_CONNECTION_CHECK="'SELECT 1 FROM DUAL'" DATABASE_CONNECTION_CHECK='SELECT 1 FROM DUAL' DATABASE_CONNECTION_CHECK="SELECT+1+FROM+DUAL" DATABASE_CONNECTION_CHECK="SELECT\ 1\ FROM\ DUAL" DATABASE_CONNECTION_CHECK="\"SELECT 1 FROM DUAL\"" DATABASE_CONNECTION_CHECK="\"'SELECT 1 FROM DUAL'\"" DATABASE_CONNECTION_CHECK="SELECT%201%20FROM%20DUAL" DATABASE_CONNECTION_CHECK="SELECT\{ }1\{ }FROM\{ }DUAL" DATABASE_CONNECTION_CHECK='SELECT{ }1{ }FROM{ }DUAL' DATABASE_CONNECTION_CHECK="'SELECT{ }1{ }FROM{ }DUAL'" DATABASE_CONNECTION_CHECK="''SELECT{ }1{ }FROM{ }DUAL''" DATABASE_CONNECTION_CHECK="SELECT%1%FROM%DUAL" (I realize some of these don't make sense but I was looking for anything different.) Startup looks good in the log output this with some of these, but then java doesn’t like it, for some reason it sees the escape usage: Caused by: Error : 936, Position : 9, Sql = SELECT+1+FROM+DUAL, OriginalSql = SELECT+1+FROM+DUAL, Error Msg = ORA-00936: missing expression or Caused by: Error : 911, Position : 6, Sql = SELECT%1%FROM%DUAL, OriginalSql = SELECT%1%FROM%DUAL, Error Msg = ORA-00911: invalid character or WARN [org.jboss.jca.adapters.jdbc.local.LocalManagedConnectionFactory] (ServerService Thread Pool -- 46) IJ030027: Destroying connection that is not valid, due to the following exception: oracle.jdbc.driver.T4CConnection#2c1456f8: java.sql.SQLException: Non supported SQL92 token at position: 7 This last one is the only one that really netted anything different. I got that with: DATABASE_CONNECTION_CHECK="SELECT{}1{}FROM{}DUAL" I can use sed to change the value in the standalone.xml, but all of the other properties I am doing work fine with the exception of this one. I had a hard time with a semicolon in the jdbc string with MSSQL and putting the semicolon in braces like "{;}" fixed that. This DB apparently does not follow the same syntax. Is there an encoding type that will help this with Oracle and keeps wildfly happy? EDIT: More tests: DATABASE_CONNECTION_CHECK=\"SELECT' '1' 'FROM' 'DUAL\" gets Caused by: Error : 900, Position : 0, Sql = "SELECT 1 FROM DUAL", OriginalSql = "SELECT 1 FROM DUAL", Error Msg = ORA-00900: invalid SQL statement' (doesn't seem to like the quotes) But without the escaping of the quotes I get: Caused by: Error : 923, Position : 9, Sql = SELECT' '1' 'FROM' 'DUAL, OriginalSql = SELECT' '1' 'FROM' 'DUAL, Error Msg = ORA-00923: FROM keyword not found where expected
A better solution was to change the sourcing of the file from: set +a . /opt/PrimeKey/wildfly_config/datasource.properties set -a to . /opt/PrimeKey/wildfly_config/datasource.properties and make it so all the variables brought in were variables and not properties: export DATABASE_CONNECTION_CHECK="SELECT 1 FROM DUAL"
How to delete last n characters of .txt file without having to re-write all the other characters [duplicate]
After looking all over the Internet, I've come to this. Let's say I have already made a text file that reads: Hello World Well, I want to remove the very last character (in this case d) from this text file. So now the text file should look like this: Hello Worl But I have no idea how to do this. All I want, more or less, is a single backspace function for text files on my HDD. This needs to work on Linux as that's what I'm using.
Use fileobject.seek() to seek 1 position from the end, then use file.truncate() to remove the remainder of the file: import os with open(filename, 'rb+') as filehandle: filehandle.seek(-1, os.SEEK_END) filehandle.truncate() This works fine for single-byte encodings. If you have a multi-byte encoding (such as UTF-16 or UTF-32) you need to seek back enough bytes from the end to account for a single codepoint. For variable-byte encodings, it depends on the codec if you can use this technique at all. For UTF-8, you need to find the first byte (from the end) where bytevalue & 0xC0 != 0x80 is true, and truncate from that point on. That ensures you don't truncate in the middle of a multi-byte UTF-8 codepoint: with open(filename, 'rb+') as filehandle: # move to end, then scan forward until a non-continuation byte is found filehandle.seek(-1, os.SEEK_END) while filehandle.read(1) & 0xC0 == 0x80: # we just read 1 byte, which moved the file position forward, # skip back 2 bytes to move to the byte before the current. filehandle.seek(-2, os.SEEK_CUR) # last read byte is our truncation point, move back to it. filehandle.seek(-1, os.SEEK_CUR) filehandle.truncate() Note that UTF-8 is a superset of ASCII, so the above works for ASCII-encoded files too.
Accepted answer of Martijn is simple and kind of works, but does not account for text files with: UTF-8 encoding containing non-English characters (which is the default encoding for text files in Python 3) one newline character at the end of the file (which is the default in Linux editors like vim or gedit) If the text file contains non-English characters, neither of the answers provided so far would work. What follows is an example, that solves both problems, which also allows removing more than one character from the end of the file: import os def truncate_utf8_chars(filename, count, ignore_newlines=True): """ Truncates last `count` characters of a text file encoded in UTF-8. :param filename: The path to the text file to read :param count: Number of UTF-8 characters to remove from the end of the file :param ignore_newlines: Set to true, if the newline character at the end of the file should be ignored """ with open(filename, 'rb+') as f: last_char = None size = os.fstat(f.fileno()).st_size offset = 1 chars = 0 while offset <= size: f.seek(-offset, os.SEEK_END) b = ord(f.read(1)) if ignore_newlines: if b == 0x0D or b == 0x0A: offset += 1 continue if b & 0b10000000 == 0 or b & 0b11000000 == 0b11000000: # This is the first byte of a UTF8 character chars += 1 if chars == count: # When `count` number of characters have been found, move current position back # with one byte (to include the byte just checked) and truncate the file f.seek(-1, os.SEEK_CUR) f.truncate() return offset += 1 How it works: Reads only the last few bytes of a UTF-8 encoded text file in binary mode Iterates the bytes backwards, looking for the start of a UTF-8 character Once a character (different from a newline) is found, return that as the last character in the text file Sample text file - bg.txt: Здравей свят How to use: filename = 'bg.txt' print('Before truncate:', open(filename).read()) truncate_utf8_chars(filename, 1) print('After truncate:', open(filename).read()) Outputs: Before truncate: Здравей свят After truncate: Здравей свя This works with both UTF-8 and ASCII encoded files.
In case you are not reading the file in binary mode, where you have only 'w' permissions, I can suggest the following. f.seek(f.tell() - 1, os.SEEK_SET) f.write('') In this code above, f.seek() will only accept f.tell() b/c you do not have 'b' access. then you can set the cursor to the starting of the last element. Then you can delete the last element by an empty string.
with open(urfile, 'rb+') as f: f.seek(0,2) # end of file size=f.tell() # the size... f.truncate(size-1) # truncate at that size - how ever many characters Be sure to use binary mode on windows since Unix file line ending many return an illegal or incorrect character count.
with open('file.txt', 'w') as f: f.seek(0, 2) # seek to end of file; f.seek(0, os.SEEK_END) is legal f.seek(f.tell() - 2, 0) # seek to the second last char of file; f.seek(f.tell()-2, os.SEEK_SET) is legal f.truncate() subject to what last character of the file is, could be newline (\n) or anything else.
This may not be optimal, but if the above approaches don't work out, you could do: with open('myfile.txt', 'r') as file: data = file.read()[:-1] with open('myfile.txt', 'w') as file: file.write(data) The code first opens the file, and then copies its content (with the exception of the last character) to the string data. Afterwards, the file is truncated to zero length (i.e. emptied), and the content of data is saved to the file, with the same name. This is basically the same as vins ms's answer, except that it doesn't use the os package, and that is used the safer 'with open' syntax. This may not be recommended if the text file is huge. (I wrote this since none of the above approaches worked out too well for me in python 3.8).
here is a dirty way (erase & recreate)... i don't advice to use this, but, it's possible to do like this .. x = open("file").read() os.remove("file") open("file").write(x[:-1])
On a Linux system or (Cygwin under Windows). You can use the standard truncate command. You can reduce or increase the size of your file with this command. In order to reduce a file by 1G the command would be truncate -s 1G filename. In the following example I reduce a file called update.iso by 1G. Note that this operation took less than five seconds. chris#SR-ENG-P18 /cygdrive/c/Projects $ stat update.iso File: update.iso Size: 30802968576 Blocks: 30081024 IO Block: 65536 regular file Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1 Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users) Access: 2020-06-12 07:39:00.572940600 -0400 Modify: 2020-06-12 07:39:00.572940600 -0400 Change: 2020-06-12 07:39:00.572940600 -0400 Birth: 2020-06-11 13:31:21.170568000 -0400 chris#SR-ENG-P18 /cygdrive/c/Projects $ truncate -s -1G update.iso chris#SR-ENG-P18 /cygdrive/c/Projects $ stat update.iso File: update.iso Size: 29729226752 Blocks: 29032448 IO Block: 65536 regular file Device: ee6ddbceh/4000177102d Inode: 19421773395035112 Links: 1 Access: (0664/-rw-rw-r--) Uid: (1052727/ chris) Gid: (1049089/Domain Users) Access: 2020-06-12 07:42:38.335782800 -0400 Modify: 2020-06-12 07:42:38.335782800 -0400 Change: 2020-06-12 07:42:38.335782800 -0400 Birth: 2020-06-11 13:31:21.170568000 -0400 The stat command tells you lots of info about a file including its size.
Loading .xls file with Pig Latin
I have a .xls file with thousand rows with the following structure : id | number | date | description 1232 | 41515 | 3/9/16 | amazing I'm trying to load it skipping the first header row and without date column (so just id, number, description and I haven't found yet how to) using Pig with the following script : REGISTER /usr/hdp/current/pig-client/lib/piggybank.jar data = LOAD '/user/maria_dev/file.xls' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as (Id:chararray,case_number:chararray,date:chararray,block:chararray,iucr:chararray); data_sample = LIMIT data 10; DUMP data_sample; but I'm getting a weird result from the dump with lines such as : ( � � � � � �,,,,) Thanks for your help
There is no direct way of loading .xls files into hdfs using csvexcelstorage.You will have to save the .xls file as .csv file and then use csvexcelstorage to load it. Also note that you have 4 fields and your schema has 5 fields. data = LOAD '/user/maria_dev/file.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as (Id:chararray,case_number:chararray,date:chararray,desc:chararray); data_sample = LIMIT data 10; DUMP data_sample;
Why does ncl not find netcdf files?
I use ncl-ncarg 6.1.2-7 from Trusty under Ubuntu 14.04. I created a soft link from usr/share/ncarg to usr/lib and set the environment and path by: export NCARG_ROOT="/usr" export PATH=$NCARG_ROOT/bin:$PATH I have a simple_plot_pr.ncl which create a panel plot from 3 netCDF files. load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/gsn_code.ncl" load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/gsn_csm.ncl" load "$NCARG_ROOT/lib/ncarg/nclscripts/csm/contributed.ncl" begin ;-- read data and set variable references f1 = addfile ("home/robert/Dokumenty/climatological monthly mean pr_1971-2000.nc","r") f2 = addfile ("home/robert/Dokumenty/climatological monthly mean pr_2021-2050.nc","r") f3 = addfile ("home/robert/Dokumenty/climatological monthly mean pr_2071-2100.nc","r") pr1 = f1->pr pr2 = f2->pr pr3 = f3->pr ;-- open a PNG file wks = gsn_open_wks("png","panel_plot") ;-- create plot array plot = new(3,graphic) ;-- set resources for contour plots res = True res#gsnMaximize = True res#cnFillOn = True res#tiMainString = "Climatological mean monthly precipitation amount" gsn_define_colormap(wks,"rainbow") plot(0) = gsn_csm_colormap(wks,pr1(:,:),res) res#tiMainString = "" plot(1) = gsn_csm_colormap(wks,pr2(:,:),res) res#tiMainString = "" plot(3) = gsn_csm_colormap(wks,pr3(:,:),res) ;-- create panel plot gsn_panel(wks,plot,(/3,1/),False) end When I run this .ncl file I get the following error messages: Copyright (C) 1995-2013 - All Rights Reserved University Corporation for Atmospheric Research NCAR Command Language Version 6.1.2 The use of this software is governed by a License Agreement. See http://www.ncl.ucar.edu/ for more details. fatal:["FileSupport.c":2761]:_NclFindFileExt: Requested file <home/Dokumenty/climatological monthly mean pr_1971-2000.nc> or <home/Dokumenty/climatological monthly mean pr_1971-2000> does not exist fatal:["FileSupport.c":3106]:(home/Dokumenty/climatological monthly mean pr_1971-2000.nc) has no file extension, can't determine type of file to open fatal:["FileSupport.c":2761]:_NclFindFileExt: Requested file <home/robert/Dokumenty/climatological monthly mean pr_2021-2050.nc> or <home/robert/Dokumenty/climatological monthly mean pr_2021-2050> does not exist fatal:["FileSupport.c":3106]:(home/robert/Dokumenty/climatological monthly mean pr_2021-2050.nc) has no file extension, can't determine type of file to open fatal:["FileSupport.c":2761]:_NclFindFileExt: Requested file <home/robert/Dokumenty/climatological monthly mean pr_2071-2100.nc> or <home/robert/Dokumenty/climatological monthly mean pr_2071-2100> does not exist fatal:["FileSupport.c":3106]:(home/robert/Dokumenty/climatological monthly mean pr_2071-2100.nc) has no file extension, can't determine type of file to open fatal:file (f1) isn't defined fatal:["Execute.c":8128]:Execute: Error occurred at or near line 11 in file simple_plot_pr.ncl I checked these files and they exist. I do not understand why ncl not find them? Can someone give me a suggestion to solve this issue?
The error is arising definitely due to the spaces in the file names. Please rename the filename without spaces.
You can also use the forward slash before the space in the file name. For example. Instead of writing "climatological monthly mean pr_1971-2000.nc"; you cane write the file names in the following format. "climatological\ monthly\ mean\ pr_1971-2000.nc"
can anyone tell me what the encoding of this string is? Its meant to be base64
cpxSR2bnPUihaNxIFFA8Sc+8gUnWuJxJi8ywSW5ju0npWrFJHW2MSZAeMklcZ71IjrBySF2ci0gdecRI0vD/SM4ZF0m1ZSJJBY8bSZJl/0intaxIlQJBSPdY3EdBLM9Hp4wLSOK8Nki8L1pIoglxSAvNbkjHg0VIDlv7R6B2Y0elCqVGFWuVRgagAkdxHTdHELxRR9i2VkdyEUlHU84kRzTS2kalKFxG This is a string from an XML file from my mass spectrometer. I am trying to write a program to load two such files, subtract one set of values from another, and write the results to a new file. According to the specification file for the .mzML format, the encoding of the numerical data is alleged to be base64. I can't convert this data string to anything legible using any of the many online base64 converter or using NotepaD++ and the MIME toolkit's base64 converter. The string, in the context of the results file, looks like this: <binaryDataArray encodedLength="224"> <cvParam cvRef="MS" accession="MS:1000515" name="intensity array" unitAccession="MS:1000131" unitName="number of counts" unitCvRef="MS"/> <cvParam cvRef="MS" accession="MS:1000521" name="32-bit float" /> <cvParam cvRef="MS" accession="MS:1000576" name="no compression" /> <binary>cpxSR2bnPUihaNxIFFA8Sc+8gUnWuJxJi8ywSW5ju0npWrFJHW2MSZAeMklcZ71IjrBySF2ci0gdecRI0vD/SM4ZF0m1ZSJJBY8bSZJl/0intaxIlQJBSPdY3EdBLM9Hp4wLSOK8Nki8L1pIoglxSAvNbkjHg0VIDlv7R6B2Y0elCqVGFWuVRgagAkdxHTdHELxRR9i2VkdyEUlHU84kRzTS2kalKFxG</binary> I can't proceed until I can work out what format this encoding is meant to be! Thanks in advance for any replies.
You can use this trivial program to convert it to plaintext: #include <stdio.h> int main(void) { float f; while (fread(&f, 1, 4, stdin) == 4) printf("%f\n", f); } I compiled this to "floatdecode" and used this command: echo "cpxSR2bnPUihaNxIFFA8Sc+8gUnWuJxJi8ywSW5ju0npWrFJHW2MSZAeMklcZ71IjrBySF2ci0gdecRI0vD/SM4ZF0m1ZSJJBY8bSZJl/0intaxIlQJBSPdY3EdBLM9Hp4wLSOK8Nki8L1pIoglxSAvNbkjHg0VIDlv7R6B2Y0elCqVGFWuVRgagAkdxHTdHELxRR9i2VkdyEUlHU84kRzTS2kalKFxG" | base64 -d | ./floatdecode Output is: 53916.445312 194461.593750 451397.031250 771329.250000 1062809.875000 1283866.750000 1448337.375000 1535085.750000 1452893.125000 1150371.625000 729577.000000 387898.875000 248514.218750 285922.906250 402376.906250 524166.562500 618908.875000 665179.312500 637168.312500 523052.562500 353709.218750 197642.328125 112817.929688 106072.507812 142898.609375 187123.531250 223422.937500 246822.531250 244532.171875 202255.109375 128694.109375 58230.625000 21125.322266 19125.541016 33440.023438 46877.441406 53692.062500 54966.843750 51473.445312 42190.324219 28009.101562 14090.161133
Yet another Java Base64 decode with options to uncompress should you need it. Vendor spec indicated "32-bit float" = IEEE-754 and specified little-endian. Schmidt's converter shows the bit pattern for IEEE-754. One more Notepad++ step to look at the hex codes: Notepad++ TextFX plugin (after the Base64 decode you already did) select the text TextFX > TextFX Convert > Convert text to Hex-32 lets you look at the hex codes: "000000000 72 9C 52 47 66 E7 3D 48- ... 6E 63 BB 49 |rœRGfç=H¡hÜHP Little-endian: 47529C72 converts (via Schmidt) as shown above by David.
You can access such data from mzML files in Python through pymzML, a python interface to mzML files. http://pymzml.github.com/