Sql Server database table export issue using BCP utility and database table contains "HTML ELEMENTS" - excel

In have been using the bcp utility to export my sqlserver database table(contains HTML elements in cells) using the command below:
C:>bcp "select * from dbName.dbo.TableName" queryout c:\bcpexport.xls -c -k -SServerName -U sa -P 111
However export is successful but the rows are messed up if some column contain HTML tags/elements. This is a serious problem I am facing since this results error while importing this excel to my MySql database.
Below is screenshot of the excel with messed up rows/cols.
Any help/support is highly appreciated.

You may want to try adding the Unicode Character Format flag, -w, to your bcp. It should help with character encoding issues, even if your data types are not unicode.
•if the source and destination data are not Unicode data types, use of Unicode character format minimizes the loss of extended characters in the source data that cannot be represented at the destination.
http://technet.microsoft.com/en-us/library/ms188289.aspx

Related

Can't receive text from PostgreSQL into Excel via ODBC due to character coding problem (UTF-8 vs Win1250)

My base scenario is that I want to make an excel report with data from a PostgreSQL DB.
I get them via ODBC, making a simple linked table with PowerQuery.
For DSN I choose (None), then I write the connectio string and the SQL statement. Generally it works fine, but with one column, it doesn't. I recive the following error message:
ODBC: ERROR [22P05] ERROR: character with byte sequence 0xc2 0xb2 in encoding "UTF8" has no equivalent in encoding "WIN1250";Error while executing the query
So that is clear, the source is in UTF-8 with characters that are not compatible with Win1250.
What I am looking for is a general solution either on DB or excel site.
The used SQL statement is a simple SELECT * FROM [view], so I can use any replacement or converting or anything just to be able to hanle it with transformations on the column. I can replace the view with function if that is better.
But it would be better, if you can suggest an excel site solution.
With it there is some criteria. That scenario, when "I get the data first in text, then I convert it to Win1250, then import to excel" wont't fit, and I need something which connects to the excel file itself, so if I move it to an other pc, it need to work too without any more modification.
Thanks for all the help!

Output json file on some fields without filtering data with Shodan?

I've downloaded some JSON data from Shodan, and only want to retain some fields from it. To explore what I want, I'm running the following, which works-
shodan parse --fields ip,port --separator , "data.json.gz"
However, I now want to output/ export the data; I'm trying to run the following -
shodan parse --fields ip,port -O "data_processed.json.gz" "data.json.gz"
It's requiring me to specify a filter parameter, which I don't need. If I do add an empty filter as so, it tells me data_processes.json.gz doesn't exist.
shodan parse --fields ip,port -f -O "data_processed.json.gz" "data.json.gz"
I'm a bit stumped on how to export only certain fields of my data; how do I go about doing so?
If you only want to output those 2 properties then you can simply pipe them to a file:
shodan parse --fields ip,port --separator , data.json.gz > data_processed.csv
A few things to keep in mind:
You probably want to export the ip_str property as it's a more user-friendly version of the IP address. The ip property is a numeric version of the IP address and aimed at users storing the information in a database.
You can convert your data file into Excel or CSV format using the shodan convert command. For example: shodan convert data.json.gz csv See here for a quick guide: https://help.shodan.io/guides/how-to-convert-to-excel

PDF Data and Table Scraping to Excel

I'm trying to figure out a good way to increase the productivity of my data entry job.
What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel.
More specifically the data I am working with is from grocery store flyers. As it stands now we have to manually enter every deal in the flyer into a database. A sample of a flyer is http://weeklyspecials.safeway.com/customer_Frame.jsp?drpStoreID=1551
What I am hoping to do is have columns for products, price, and predefined options (Loyalty Cards, Coupons, Select Variety... that sort of thing).
Any help would be appreciated, and if I need to be more specific let me know.
After looking at the specific PDF linked to by the OP, I have to say that this is not quite displaying a typical table format.
It contains many images inside the "cells", but the cells are not all strictly vertically or horizontally aligned:
So this isn't even a 'nice' table, but an extremely ugly and awkward one to work with...
Having said that, I'll have to add:
Extracting even 'nice' tables from PDFs in general is extremely difficult...
Standard PDFs do not provide any hints about the semantics of what they draw on a page:
the only distinction that the syntax provides is the distinctions between vector elements (lines, fills,...), images and text.
Whether any character is part of a table or part of a line or just a lonely, single character within an otherwise empty area is not easy to recognize programmatically by parsing the PDF source code.
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
Why Updating Dollars for Docs Was So Difficult (ProPublica-Website)
...but doing so with TabulaPDF works very well!
Having said the above now let me add this:
For an amazing open source family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages) -- contradicting what I said in my introductionary paragraphs! -- check out TabulaPDF. See these links:
Introducing Tabula: Upload a PDF, get back tabular CSV data. Poof!
Tabula-Extractor: A Command Line Interface to Tabula
Tabula source code repository
Tabula API (upcoming, not ready yet)
Tabula-Extractor is written in Ruby.
In the background it makes use of PDFBox (which is written in Java) and a few other third-party libs.
To run, Tabula-Extractor requires JRuby-1.7 installed.
Installing Tabula-Extractor
I'm using the 'bleeding-edge' version of Tabula-Extractor directly from its GitHub source code repository.
Getting it to work was extremely easy, since on my system JRuby-1.7.4_0 is already present:
mkdir ~/svn-stuff
cd ~/svn-stuff
git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
Included in this Git clone will already be the required libraries, so no need to install PDFBox.
The command line tool is in the /bin/ subdirectory.
Exploring the command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h
Tabula helps you extract tables from PDFs
Usage:
tabula [options] <pdf_file>
where [options] are:
--pages, -p <s>: Comma separated list of ranges, or all. Examples:
--pages 1-3,5-7, --pages 3 or --pages all. Default
is --pages 1 (default: 1)
--area, -a <s>: Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c <s>: X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3
--password, -s <s>: Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o <s>: Write output to <file> instead of STDOUT (default:
-)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating
each cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using
spreadsheet-style extraction (if there are ruling
lines separating each cell, as in a PDF of an Excel
spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells. (Only in
spreadsheet mode.)
--version, -v: Print version and exit
--help, -h: Show this message
Extracting the table which the OP wants
I'm not even trying to extract this ugly table from the OP's monster PDF. I'll leave it as an excercise to these readers who are feeling adventurous enough...
Instead, I'll demo how to extract a 'nice' table. I'll take pages 651-653 from the official PDF-1.7 specification, here represented with screenshots:
I used this command:
~/svn-stuff/git.tabula-extractor/bin/tabula \
-p 651,652,653 -g -n -u -f CSV \
~/Downloads/pdfs/PDF32000_2008.pdf
After importing the generated CSV into LibreOffice Calc, the spreadsheet looks like this:
To me this looks like the perfect extraction of a table which did spread over 3 different PDF pages. (Even the newlines used within table cells made it into the spreadsheet.)
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

gitlab issues not work with Chinese

When I create an issue, an enter Title and Details in Chinese.
But It not works.
Form input
Result
The documentation "Setup Database" does mention
# Create the GitLab production database
mysql> CREATE DATABASE IF NOT EXISTS `gitlabhq_production` DEFAULT CHARACTER SET `utf8` COLLATE `utf8_unicode_ci`;
So it could be possible this charset is missing in your database.
Johannes Schleifenbaum mentions in your issue 4620:
Are your database and tables (in this case issues) created with utf8 character-set/collation? I had the exact same issue.
$: mysql -ugitlab -p gitlabhq_production
mysql> SHOW FULL COLUMNS FROM issues;
mysql> SHOW VARIABLES LIKE "character_set_database";
mysql> SHOW VARIABLES LIKE "collation_database";
The blog post "Converting Character sets in MySQL to UTF8" proposes different options, including:
mysql> ALTER DATABASE gitlabhq_production DEFAULT CHARACTER SET utf8 COLLATE=utf8_unicode_ci;
mysql> ALTER TABLE issues CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;,

How to know which csv delimiter is being used on a pc using php?

I am generating a csv file.
I would like to set up the delimiter dynamically, that is, get the value of the list-separator set up on the pc and then use it in my csv.
Is it possible??
No it's not. You want your csv to open in Excel. Correct? Then use [TAB] and direct the file to be saved ".xls" and not ".csv". The worst case is a warning message at opening regarding the file format for the user, but data will be visible in cells for sure.
Another approach for excel files is to use Excel(X)ML; it's very simple for raw data (http://office.microsoft.com/en-us/excel-help/overview-of-xml-in-excel-HA010206396.aspx)
I did generate a CSV using ',' as delimiter..But on my client side, it was not working. I replaced the delimiter by ';' and it works on the client side.
That means the client expected ; instead of ,. You cannot know what the client expects unless he tells you. There are an uncountable number of programs that use the CSV format, there's no universal way to figure out which delimiter they expect. Usually it works the other way around: you create the CSV and specify to the client what delimiter you have used. There's no technical protocol to do this though, CSV is too informal to have any such universal specification.

Resources