Remove header row from CSV's - excel

I have a directory with circa 3k CSV files containing various data, I need to collate these into a single file at some point, but first I need to remove all of the header rows from each file.
Usually for this I would collate the files, and then simply open in Excel, and filter to the header rows before deleting them all. Unfortunately, these sum to something around 9M rows, and Excel doesn't like that...
Can anybody think of a way around this? Preferably some sort of batch script that will run through all files in a directory.
Thanks in advance,
A.

The following assumes the first line of each file is the header line to be eliminated.
It will only work properly if none of the files contain the <TAB> character, and none of the files is too large. I can't remember the specifics, but at some point, MORE with redirected output will hang waiting for a keypress if the input file gets too large.
(for %F in (*.csv) do #more +1 "%F") >concat_csv.txt
I made sure to give the output file a different extension so that the command does not try to process the output! An alternative is to redirect the output to a CSV file but in a different folder.
If you want to use this in a batch file, then double up the percents (%F becomes %%F)

I am not sure this is what you are looking for... Here is one way to get rid of the duplicate headers in C#. The main purpose of the code is to store one header is string header and to read the files by skipping the first row (while (rdr.Peek() != -1)).
I have also used a dictionary to store the rows of each csv file. This will prevent duplicate rows in different csv files to be included (I am not sure if this function will be helpful in your case).
Imagine fname is a string array that contains the files you wish to merge.
Dictionary<string, string> dict = new Dictionary<string, string>();
string destinationFile = <write path of your destination file>;
string dir = <write path of your original directory>
string header = "";
if (dir.Length != 0)
{
foreach (string f in fnames)
{
using (StreamReader rdr = new StreamReader(dir + "\\" + f))
{
header = rdr.ReadLine();
while (rdr.Peek() != -1)
{
string ln = rdr.ReadLine();
string[] split_ln = ln.Split(',');
string value = (split_ln.Length != 2) ? string.Join(",", split_ln.Skip(1)) : split_ln[1];
dict.Add(split_ln[0], value);
}
}
}
using (StreamWriter wr = new StreamWriter(destinationFile))
{
wr.WriteLine(header);
foreach (var pair in dict)
{
wr.WriteLine("{0},{1}", Convert.ToString(pair.Key), pair.Value);
}
}
}

Related

How to clear a text file without deleting it using groovy

I've a text file which I'll be using it to write content.. But every time before I write something to the file, I wish to clear the content without deleting the file..
How would I achieve the above? Any suggestions?
With text files you can simply set it to an empty string.
file.text = ''
I would use the setBytes method on java.io.File and provide it with an empty byte array:
file.bytes = new byte[0]
Passing an empty list also works, impressively.
file.bytes = []
Presumably, you want simply to overwrite the file with new content. To do that:
def content = ...
new File("test.txt").withWriter { writer ->
writer.write(content)
}
Note that File.withWriter will do all the usual housekeeping re: open/close file.

Reading text file and omitting line

Is there any method of reading from a text file and omitting certain lines from the output into a text box?.
the text file will look like this
Name=Test Name
Date=19/02/14
Message blurb spanning over several lines
The format will always be the same and the Name & Date will always be the 1st 2 rows and these are the rows that i want to omit and return the rest of the message blurb to a text box.
I know how to use the ReadAllLines function and StreamReader but not sure how to start coding it.
Any pointers or directions to some relevant online documentation?
Thanks in advance
You can read file line by line and just skip lines with given beginnings:
string[] startsToOmit = new string[] { "Name=", "Date=" };
var result = File.ReadLines(path)
.Where(line => !startsToOmit.Any(start => line.StartsWith(start)));
and then you have an IEnumerable<string> as a result, you can use it for example by result.ToList().
Just read the stream line by line:
using (StreamReader sr = new StreamReader(path))
{
Console.WriteLine(sr.ReadLine());
}
Ignore the first two lines, and process the 3rd line however you need.

Seperated text line in Apache POI XWPFRun object

I 'm trying to replace a template DOCX document with Apache POI by using the XWPFDocument class. I have tags in the doc and a JSON file to read the replacement data. My problem is that a text line seems separated in a certain way in DOCX when I change its extension to ZIP file and open document.xml. For example [MEMBER_CONTACT_INFO] text becomes [MEMBER_CONTACT_INFO and ] separately. POI reads this in the same way since the DOCX original is like this. This creates 2 XWPFRun objects in the paragraph which show the text as [MEMBER_CONTACT_INFO and ] separately.
My question is, is there a way to force POI to run like Word via merging related runs or something like that? Or how can I solve this problem? I 'm matching run texts while replacing and I can't find my tag because it is split into 2 different run object.
Best
This wasted so much of my time once...
Basically, an XWPFParagraph is composed of multiple XWPFRuns, and XWPFRun is a contagious text that has a fixed same style.
So when you try writing something like "[PLACEHOLDER_NAME]" in MS-Word it will create a single XWPFRun. But if you somehow add a few things more, and then you go back and change "[PLACEHOLDER_NAME]" to something else it is never guaranteed that it will remain a single XWPFRun it is quite possible that it will split to two Runs. AFAIK this is how MS-Word works.
How to avoid splitting of Runs in such cases?
Solution: There are two solutions that I know of:
Copy text "[PLACEHOLDER_NAME]" to Notepad or something. Make your necessary modification and copy it back and paste it instead of "[PLACEHOLDER_NAME]" in your word file, this way your whole "[PLACEHOLDER_NAME]" will be replaced with new text avoiding splitting of XWPFRuns.
Select "[PLACEHOLDER_NAME]" and then click of MS-Word "Replace" option and Replace with "[Your-new-edited-placeholder]" and this will guarantee that your new placeholder will consume a single XWPFRun.
If you have to change your new placeholder again, follow step 1 or 2.
Here is the java code to fix that separate text line issue. It will also handle the mult-format string replacement.
public static void replaceString(XWPFDocument doc, String search, String replace) throws Exception{
for (XWPFParagraph p : doc.getParagraphs()) {
List<XWPFRun> runs = p.getRuns();
List<Integer> group = new ArrayList<Integer>();
if (runs != null) {
String groupText = search;
for (int i=0 ; i<runs.size(); i++) {
XWPFRun r = runs.get(i);
String text = r.getText(0);
if (text != null)
if(text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text, 0);
}
else if(groupText.startsWith(text)){
group.add(i);
groupText = groupText.substring(text.length());
if(groupText.isEmpty()){
runs.get(group.get(0)).setText(replace, 0);
for(int j = 1; j<group.size(); j++){
p.removeRun(group.get(j));
}
group.clear();
groupText = search;
}
}else{
group.clear();
groupText = search;
}
}
}
}
for (XWPFTable tbl : doc.getTables()) {
for (XWPFTableRow row : tbl.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph p : cell.getParagraphs()) {
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text.contains(search)) {
String safeToUseInReplaceAllString = Pattern.quote(search);
text = text.replaceAll(safeToUseInReplaceAllString, replace);
r.setText(text);
}
}
}
}
}
}
}
For me it didn't work as I expected (every time). In my case I used "${PLACEHOLDER} in the text. At first we need to take a look how Apache Poi recognize each Paragraph which we want to iterate through with Runs. If you go deeper with docx file construction you will know that one run is a sequence of characters of text with the same font style/font size/colour/bold/italic etc. That way placeholder sometimes was divided into parts OR sometimes whole paragraph was recognized as a one Run and it was impossible to iterate through words. What I did is to bold placeholder name in a template document. Than when iterating through RUN I was able to iterate through whole placeholder name ${PLACEHOLDER}. When I replaced that value with
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
}
}
I've added just r.isBold(false); after setText.
That way placeholder is recognized as a different run -> I'm able to replace specific placeholder, and in the processed document I have no bolding, just a plain text. For me one of a additional advantage was that visualy I'm able to faster find placeholders in text.
So finally above loop looks like that:
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("originalText")) {
text = text.replace("originalText", "newText");
r.setText(text,0);
r.isBold(false);
}
}
I hope it will help to someone, while I spend too much time for that :)
I also had this issue few days ago and I couldn't find any solution. I chose to use PLACEHOLDER_NAME instead of [PLACEHOLDER_NAME]. This is working fine for me and it's seen like a single XWPFRun object.
To be sure that a word will be consider as a single XWPFRun,
You can use merge_field as variable in word like that
Place cursor on the word you want to be a single run.
Press CTRL and F9 together and { } in gray will appear.
Right-click on the { } field and select Edit Field.
In pop-up box, select Mail Merge from Categories and then MergeField from Field Names.
Click OK.

CSV field with newline character in a cell to import to excel

I've got a problem with data in CSV file to import into Excel.
I've parsed data from a website, and contain line break <br>, I convert this tag into "\n" and write to a CSV file. However, when I import this CSV file into Excel, the line-break display incorrectly. It results new line as a new row instead of a new line in a single cell itself.
Anyone have face this problem before? Really appreciate your suggestion.
Thanks!
Edit: Here the sample to demonstrate my situation
static void TestLine()
{
string sampleData = "日찬양 까페에 올린 충격적인 <br>글코리아타임스";
string formattedData = sampleData.Replace("<br>", "\n");
using (StreamWriter writer = new StreamWriter(#"C:\SampleData.csv", false, Encoding.Unicode))
{
writer.WriteLine(formattedData);
}
}
I want the sampleData displaying in a cell, however, the result happens in 2 cells.
It looks to me like misformated CSV file. To handle this situation correctly, you will need to have a field with line break contained within quotation marks.
UPDATE after the sample:
This works fine:
static void Main(string[] args)
{
string sampleData = "\"日찬양 까페에 올린 충격적인 <br>글코리아타임스\"";
string formattedData = sampleData.Replace("<br>", "\n");
using (StreamWriter writer = new StreamWriter(#"C:\SampleData.csv", false, Encoding.Unicode))
{
writer.WriteLine(formattedData);
}
}
You need to wrap the field in the quotation marks (")
Please take a look at CSV-1203 File Format Specification, in particular the sections on "End-of-Record Marker" and "Field Payload Protection". Hopefully this should give clear guidance on the inner workings of the CSV file format.

How can I import data from text files into Excel?

I have multiple folders. There are multiple txt files inside these folder. I need to extract data (just a single value: value --->554) from a particular type of txt file in this folder.(individual_values.txt)
No 100 Value 555 level match 0.443 top level 0.443 bottom 4343
There will be many folders with same txt file names but diff value. Can all these values be copyed to excel one below the other.
I have to extract a value from a txt file which i mentioned above. Its a same text file with same name located inside different folders. All i want to do is extract this value from all the text file and paste it in excel or txt one below the other in each row.
Eg: The above is a text file here I have to get the value of 555 and similarly from other diff values.
555
666
666
776
Yes.
(you might want to clarify your question )
Your question isn't very clear, I imagine you want to know how this can be done.
You probably need to write a script that traverses the folders, reads the individual files, parses them for the value you want, and generates a Comma Separated Values (CSV) file. CSV files can easily be imported to Excel.
There are two or three basic methods you can use to get stuff into a Excel Spreadsheet.
You can use OLE wrappers to manipulate Excel.
You can write the file in a binary form
You can use Excel's import methods to take delimited text in as a spreadsheet.
I chose the latter way, because 1) it is the simplest, and 2) your problem is so poorly stated as it does not require a more complex way. The solution below outputs a tab-delimited text file that Excel can easily support.
In Perl:
use IO::File;
my #field_names = split m|/|, 'No/Value/level match/top level/bottom';
#' # <-- catch runaway quote
my $input = IO::File->new( '<data.txt' );
die 'Could not open data.txt for input!' unless $input;
my #data_rows;
while ( my $line = <$input> ) {
my %fields = $line =~ /(level match|top level|bottom|Value|No)\s+(\d+\S*)/g;
push #data_rows, \%fields if exists $fields{Value};
}
$input->close();
my $tab_file = IO::File->new( '>data.tab' );
die 'Could not open data.tab for output!' unless $tab_file;
$tab_file->print( join( "\t", #field_names ), "\n" );
foreach my $data_ref ( #data ) {
$tab_file->print( join( "\t", #$data_ref{#field_names} ), "\n" );
}
$tab_file->close();
NOTE: Excel's text processing is really quite neat. Try opening the text below (replacing the \t with actual tabs) -- or even copying and pasting it:
1\t2\t3\t=SUM(A1:C1)
I chose c#, because i thought it would be fun to use a recursive lambda. This will create the csv file containing matches to the regex pattern.
string root_path = #"c:\Temp\test";
string match_filename = "test.txt";
Func<string,string,StringBuilder, StringBuilder> getdata = null;
getdata = (path,filename,content) => {
Directory.GetFiles(path)
.Where(f=>
Path.GetFileName(f)
.Equals(filename,StringComparison.OrdinalIgnoreCase))
.Select(f=>File.ReadAllText(f))
.Select(c=> Regex.Match(c, #"value[\s\t]*(\d+)",
RegexOptions.IgnoreCase))
.Where(m=>m.Success)
.Select(m=>m.Groups[1].Value)
.ToList()
.ForEach(m=>content.AppendLine(m));
Directory.GetDirectories(path)
.ToList()
.ForEach(d=>getdata(d,filename,content));
return content;
};
File.WriteAllText(
Path.Combine(root_path, "data.csv"),
getdata(root_path, match_filename, new StringBuilder()).ToString());
No.
just making sure you have a 50/50 chance of getting the right answer
(assuming it was a question answerable by Yes and No) hehehe
File_not_found
Gotta have all three binary states for the response.

Resources