I'm using the CSVReader object, but it is skipping the first line. Is there any way to prevent this?
My code:
if (UploadFile.AskExt() == WebDialogResult.OK)
{
StringBuilder filetext = new StringBuilder();
PX.SM.FileInfo info = PXContext.SessionTyped<PXSessionStatePXData>().FileInfo["MyFileImportSessionKey"] as PX.SM.FileInfo;
Byte[] bytes = info.BinData;
using (CSVReader reader = new CSVReader(bytes, Encoding.ASCII.CodePage))
{
reader.Reset();
filetext.Append("FIRST:" + reader.GetValue(0) + " DONE" + Environment.NewLine);
while (reader.MoveNext())
{
for (int i = 0; i < reader.IndexKeyPairs.Count; i++)
{
filetext.Append(reader.GetValue(i));
}
filetext.Append(Environment.NewLine);
}
System.Web.HttpContext.Current.Session.Remove("MyFileImportSessionKey");
}
CurrentDocument.Ask(filetext.ToString(), MessageButtons.OK);
}
This always shows all lines after the first line. I understand that typically the first line is a header and would want to be ignored, but in my case I need all lines in the file.
The header is definitely handled as a special case.
I believe it's exposed in the IndexKeyPairs public collection.
foreach (KeyValuePair<int, string> headerColumn in reader.IndexKeyPairs)
{
int headerColumIndex = headerColumn.Key;
string headerColumnCellValue = headerColumn.Value;
}
Related
I see that you can't use string tokenizer on an array because you cant convert String() to String[]. After a length of time I realized that if the inputFromFile method reads it line by line, I can tokenize it line by line. I just don't know how to do it so that it returns the tokenized version of it.
I'm assuming in the line=in.ReadLine(); line I should put StringTokenizer token = new StringTokenizer(line,",").. but it doesn't seem to be working.
Any help? (I have to tokenize the commas).
public class Project1 {
private static int inputFromFile(String filename, String[] wordArray) {
TextFileInput in = new TextFileInput(filename);
int lengthFilled = 0;
String line = in.readLine();
while (lengthFilled < wordArray.length && line != null) {
wordArray[lengthFilled++] = line;
line = in.readLine();
}// while
if (line != null) {
System.out.println("File contains too many Strings.");
System.out.println("This program can process only "
+ wordArray.length + " Strings.");
System.exit(1);
} // if
in.close();
return lengthFilled;
} // method inputFromFile
public static void main(String[] args) {
String[] numArray = new String[100];
inputFromFile("input1.txt", numArray);
for (int i = 0; i < numArray.length; i++) {
if (numArray[i] == null) {
break;
}
System.out.println(numArray[i]);
}// for
for (int i=0;i<numArray.length;i++)
{
Integer.parseInt(numArray[i]);
}
}// main
}// project1
This is what I meant:
while (lengthFilled < wordArray.length && line != null) {
String[] tokens = line.split(",");
if(tokens == null || tokens.length == 0) {
//line without required token, add whole line as it is
wordArray[lengthFilled++] = line;
} else {
//add each token into wordArray
for(int i=0; i<tokens.length;i++) {
wordArray[lengthFilled++] = tokens[i];
}
}
line = in.readLine();
}// while
There can be other approaches as well. For instance, you can use a StringBuilder to read everything as one big string and them split it on your required tokens etc. The above logic is just to point you in right direction.
Could you please point out where is the bug in my code?
I have a simple text file with the following data structure:
something1
something2
something3
...
It results a String[] where every element is the last element of the file. I can't find the mistake, but it goes wrong somewhere around the line.setLength(0);
Any ideas?
public String[] readText() throws IOException {
InputStream file = getClass().getResourceAsStream("/questions.txt");
DataInputStream in = new DataInputStream(file);
StringBuffer line = new StringBuffer();
Vector lines = new Vector();
int c;
try {
while( ( c = in.read()) != -1 ) {
if ((char)c == '\n') {
if (line.length() > 0) {
// debug
//System.out.println(line.toString());
lines.addElement(line);
line.setLength(0);
}
}
else{
line.append((char)c);
}
}
if(line.length() > 0){
lines.addElement(line);
line.setLength(0);
}
String[] splitArray = new String[lines.size()];
for (int i = 0; i < splitArray.length; i++) {
splitArray[i] = lines.elementAt(i).toString();
}
return splitArray;
} catch(Exception e) {
System.out.println(e.getMessage());
return null;
} finally {
in.close();
}
}
I see one obvious error - you're storing the same StringBuffer instance multiple times in the Vector, and you clear the same StringBuffer instance with setLength(0). I'm guesing you want to do something like this
StringBuffer s = new StringBuffer();
Vector v = new Vector();
...
String bufferContents = s.toString();
v.addElement(bufferContents);
s.setLength(0);
// now it's ok to reuse s
...
If your problem is to read the contents of the file in a String[], then you could actually use apache common's FileUtil class and read in an array list and then convert to an array.
List<String> fileContentsInList = FileUtils.readLines(new File("filename"));
String[] fileContentsInArray = new String[fileContentsInList.size()];
fileContentsInArray = (String[]) fileContentsInList.toArray(fileContentsInArray);
In the code that you have specified, rather than setting length to 0, you can reinitialize the StringBuffer.
using System;
namespace jagged_array
{
class Program
{
static void Main(string[] args)
{
string[][] Members = new string[10][]{
new string[]{"amit","amit#gmail.com", "9999999999"},
new string[]{"chandu","chandu#gmail.com","8888888888"},
new string[]{"naveen","naveen#gmail.com", "7777777777"},
new string[]{"ramu","ramu#gmail.com", "6666666666"},
new string[]{"durga","durga#gmail.com", "5555555555"},
new string[]{"sagar","sagar#gmail.com", "4444444444"},
new string[]{"yadav","yadav#gmail.com", "3333333333"},
new string[]{"suraj","suraj#gmail.com", "2222222222"},
new string[]{"niharika","niharika#gmail.com","11111111111"},
new string[]{"anusha","anusha#gmail.com", "0000000000"},
};
for (int i =0; i < Members.Length; i++)
{
System.Console.Write("Name List ({0}):", i + 1);
for (int j = 0; j < Members[i].Length; j++)
{
System.Console.Write(Members[i][j] + "\t");
}
System.Console.WriteLine();
}``
Console.ReadKey();
}
}
}
The above is the code for my C# console program in which i used jagged array and i assigned values manually but now my requirement is 'without assigning manually into array i want the same details to import into my program from an csv file(which is at some location in my disc). So how to do it what functions should i make use , please help me with some example. Thank you.
static void Main()
{
string csv_file_path=#"C:\Users\Administrator\Desktop\test.csv";
DataTable csvData = GetDataTabletFromCSVFile(csv_file_path);
Console.WriteLine("Rows count:" + csvData.Rows.Count);
Console.ReadLine();
}
private static DataTable GetDataTabletFromCSVFile(string csv_file_path)
{
DataTable csvData = new DataTable();
try
{
using(TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
string[] colFields = csvReader.ReadFields();
foreach (string column in colFields)
{
DataColumn datecolumn = new DataColumn(column);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
//Making empty value as null
for (int i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
}
csvData.Rows.Add(fieldData);
}
}
}
catch (Exception ex)
{
}
return csvData;
}
Treat the CSV file like an excel workbook and you will find a lot of examples on the web for what you need to do.
ExcelFile ef = new ExcelFile();
// Loads file.
ef.LoadCsv("filename.csv");
// Selects first worksheet.
ExcelWorksheet ws = ef.Worksheets[0];
I won't go into details, but you can read lines text from a file with File.ReadAllLines.
Once you have those lines, you can split them into parts using String.Split (at least this will work if the CSV file contains very simple information as in your example).
I am writing a C# program to analyze the the number of browsers in the UserAgent column of a web server log. I wish to output the browser type, browser major version, and the number of hits.
How can I optimize this?
I am using regex to compare the UserAgent string with predefined strings to test for Firefox, Opera, etc. I then use regex to cancel out a possible mismatch. I then use a regex to obtain the major version. I use a struct to hold this information for each browser:
private struct Browser
{
public int ID;
public string name;
public string regex_match;
public string regex_not;
public string regex_version;
public int regex_group;
}
I then load the browser information and loop over all of the records for the UserAgent:
Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_match = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_match = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_match = "(?i)safari/([\\d\\.]*)";
browsers[4].regex_match = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_not = "(?i)flock";
browsers[1].regex_not = "";
browsers[2].regex_not = "";
browsers[3].regex_not = "(?i)android|arora|chrome|shiira";
browsers[4].regex_not = "(?i)webtv|omniweb|opera";
browsers[0].regex_version = "(?i)firefox/([\\d\\.]*)";
browsers[1].regex_version = "(?i)opera/([\\d\\.]*)";
browsers[2].regex_version = "(?i)chrome/([\\d\\.]*)";
browsers[3].regex_version = "(?i)version/([\\d\\.]*)";
browsers[4].regex_version = "(?i)msie([+_ ]|)([\\d\\.]*)";
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
for (int i = 0; i < 65000; i++)
{
foreach (Browser b in browsers)
{
if (Regex.IsMatch(csUserAgent[i], b.regex_match))
{
if (b.regex_not != "")
{
if (Regex.IsMatch(csUserAgent[i], b.regex_not))
{
continue;
}
}
string strBrowser = b.name;
if (b.regex_version != "")
{
string strVersion = Regex.Match(csUserAgent[i], b.regex_version).Groups[b.regex_group].Value;
int intPeriod = strVersion.IndexOf('.');
if (intPeriod > 0)
{
strBrowser += " " + strVersion.Substring(0, intPeriod);
}
}
if (!browser_counts.ContainsKey(strBrowser))
{
browser_counts.Add(strBrowser, 1);
}
else
{
browser_counts[strBrowser]++;
}
break;
}
}
}
You could
construct a hashtable or most-frequently matches user-agent and avoid matching the regexen.
store compile new Regex(pattern, RegexOptions.Compiled) instead of just pattern
combine the regexes into a single regex and take advantage of RegexOptions.Compiled and RegexOptions.CultureInvariantIgnoreCase
instead of matching twice (once with IsMatch and once with Matches) match once (Matches) and check whether the MatchCollection is empty
This is only a starting point - I might come up with more ideas on reading the code :)
Edit One more:
avoid parsing the version with another regex - only safari requires special treaetment according to your config. Try to 'catch' the version with the same regex as the browserid. (I'd simply make an exception for safari for now)
E.g. you could have a single static regex instance like this:
private static readonly Regex _regex = new Regex(
"(?i)"
+ "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
+ "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
You can conveniently access the proper subgroups by using match.Groups["browserid"] and match.Groups["version"]. This nearly eliminates all the use for your list of Browser structs.
The only thing it still caters for is the exclusion regex (regex_not). I suggest re-profiling with the single positive regex first, though and see whether there is still a performance problem left before frying smaller fish.
Benchmark
I wrote a benchmark (see below). I'll be updating this incrementally until I loose interest :) (I know my dataset isn't representative. If you upload a file, I'll test it with that)
replacing the separate regexes by the single statically compiled regex, speeds up from 14s to 2.1s (a 6x speedup); this is only with the outermost match replaced
replacing the regex_not/regex_version by precompiled regexes did not make much of a difference with my test set (but I don't have actual matching useragents, so that makes sense)
.
using System;
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;
public class Program
{
private struct Browser
{
public int ID;
public string name;
public Regex regex_match, regex_not, regex_version;
public int regex_group;
}
private static readonly Regex _regex = new Regex("(?i)"
+ "(?<browserid>(?:firefox/|opera/|chrome/|chrome/|safari/|msie[+_ ]?))"
+ "(?<version>[\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
public static void Main(string[] args)
{
Browser[] browsers = new Browser[5];
for (int i = 0; i < 5; i++)
{
browsers[i].ID = i;
}
browsers[0].name = "Firefox";
browsers[1].name = "Opera";
browsers[2].name = "Chrome";
browsers[3].name = "Safari";
browsers[4].name = "Internet Explorer";
browsers[0].regex_match = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[1].regex_match = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[2].regex_match = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[3].regex_match = new Regex("(?i)safari/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[4].regex_match = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
// OPTIMIZATION #2
browsers[0].regex_not = new Regex("(?i)flock", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[1].regex_not = null;
browsers[2].regex_not = null;
browsers[3].regex_not = new Regex("(?i)android|arora|chrome|shiira", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[4].regex_not = new Regex("(?i)webtv|omniweb|opera", RegexOptions.Compiled | RegexOptions.CultureInvariant);
// OPTIMIZATION #2
browsers[0].regex_version = new Regex("(?i)firefox/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[1].regex_version = new Regex("(?i)opera/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[2].regex_version = new Regex("(?i)chrome/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[3].regex_version = new Regex("(?i)version/([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[4].regex_version = new Regex("(?i)msie([+_ ]|)([\\d\\.]*)", RegexOptions.Compiled | RegexOptions.CultureInvariant);
browsers[0].regex_group = 1;
browsers[1].regex_group = 1;
browsers[2].regex_group = 1;
browsers[3].regex_group = 1;
browsers[4].regex_group = 2;
Dictionary<string, int> browser_counts = new Dictionary<string, int>();
var lookupBrowserId = new Dictionary<string, int> {
{ "firefox/", 0 },
{ "opera/", 1 },
{ "chrome/", 2 },
{ "safari/", 3 },
{ "msie+", 4 },
{ "msie_", 4 },
{ "msie ", 4 },
{ "msie", 4 },
};
for (int i=1; i<20; i++)
foreach (var line in System.IO.File.ReadAllLines("/etc/dictionaries-common/words"))
{
// OPTIMIZATION #1 START
Match match = _regex.Match(line);
{
if (match.Success)
{
Browser b = browsers[lookupBrowserId[match.Groups["browserid"].Value]];
// OPTIMIZATION #1 END
// OPTIMIZATION #2
if (b.regex_not != null && b.regex_not.IsMatch(line))
continue;
string strBrowser = b.name;
if (b.regex_version != null)
{
// OPTIMIZATION #2
string strVersion = b.regex_version.Match(line).Groups[b.regex_group].Value;
int intPeriod = strVersion.IndexOf('.');
if (intPeriod > 0)
{
strBrowser += " " + strVersion.Substring(0, intPeriod);
}
}
if (!browser_counts.ContainsKey(strBrowser))
{
browser_counts.Add(strBrowser, 1);
}
else
{
browser_counts[strBrowser]++;
}
break;
}
}
}
}
}
I have a list of files, which need to be read, in chunks, into a byte[], which is then passed to a hashing function. The tricky part is this: if I reach the end of a file, I need to continue reading the next file untill I fill the buffer, like so:
read 16 bits as an example:
File 1: 00101010
File 2: 01101010111111111
would need to be read as 0010101001101010
The point is: these files can be as large as several gigabytes, and I don't want to completely load them into memory. Loading pieces into a buffer of, like, 30 MB would be perfectly fine.
I want to use threading, but would it be efficient to thread reading a file? I don't know if Disc I/O is such a large bottleneck that this would be worth it. Would the hashing be sped up sufficently if I only thread that part, and lock on the read of each chunk? It is important the hashes get saved in the correct order.
The second thing I need to do, is to generate the MD5sum from each file as well. Is there anyway to do this more efficiently than doing this as a separate step?
(This question has some overlap with Is there a built-in way to handle multiple files as one stream?, but I thought this differed enough)
I am really stumped what approach to take, as I am fairly new to C#, as well as to threading. I already tried the approaches listed below, but they do not suffice for me.
As I am new to C# I value every kind of input on any aspect of my code.
This piece of code was threaded, but does not 'append' the streams, and as such generates invalid hashes:
public void DoHashing()
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = numThreads;
options.CancellationToken = cancelToken.Token;
Parallel.ForEach(files, options, (string f, ParallelLoopState loopState) =>
{
options.CancellationToken.ThrowIfCancellationRequested();
using (BufferedStream fileStream = new BufferedStream(File.OpenRead(f), bufferSize))
{
// Get the MD5sum first:
using (MD5CryptoServiceProvider md5 = new MD5CryptoServiceProvider())
{
md5.Initialize();
md5Sums[f] = BitConverter.ToString(md5.ComputeHash(fileStream)).Replace("-", "");
}
//setup for reading:
byte[] buffer = new byte[(int)pieceLength];
//I don't know if the buffer will f*ck up the filelenghth
long remaining = (new FileInfo(f)).Length;
int done = 0;
while (remaining > 0)
{
while (done < pieceLength)
{
options.CancellationToken.ThrowIfCancellationRequested();
//either try to read the piecelength, or the remaining length of the file.
int toRead = (int)Math.Min(pieceLength - done, remaining);
int read = fileStream.Read(buffer, done, toRead);
//if read == 0, EOF reached
if (read == 0)
{
remaining = 0;
break;
}
//offsets
done += read;
remaining -= read;
}
// Hash the piece
using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
{
sha1.Initialize();
byte[] hash = sha1.ComputeHash(buffer);
hashes[f].AddRange(hash);
}
done = 0;
buffer = new byte[(int)pieceLength];
}
}
}
);
}
This other piece of code isn't threaded (and doesn't calculate MD5):
void Hash()
{
//examples, these got handled by other methods
List<string> files = new List<string>();
files.Add("a.txt");
files.Add("b.blob");
//....
long totalFileLength;
int pieceLength = Math.Pow(2,20);
foreach (string file in files)
{
totalFileLength += (new FileInfo(file)).Length;
}
//Reading the file:
long remaining = totalFileLength;
byte[] buffer = new byte[Math.min(remaining, pieceSize)];
int index = 0;
FileStream fin = File.OpenRead(files[index]);
int done = 0;
int offset = 0;
while (remaining > 0)
{
while (done < pieceLength)
{
int toRead = (int)Math.Min(pieceLength - offset, remaining);
int read = fin.Read(buffer, done, toRead);
//if read == 0, EOF reached
if (read == 0)
{
index++;
//if last file:
if (index > files.Count)
{
remaining = 0;
break;
}
//get ready for next round:
offset = 0;
fin.OpenRead(files[index]);
}
done += read;
offset += read;
remaining -= read;
}
//Doing the piece hash:
HashPiece(buffer);
//reset for next piece:
done = 0;
byte[] buffer = new byte[Math.min(remaining, pieceSize)];
}
}
void HashPiece(byte[] piece)
{
using (SHA1CryptoServiceProvider sha1 = new SHA1CryptoServiceProvider())
{
sha1.Initialize();
//hashes is a List
hashes.Add(sha1.ComputeHash(piece));
}
}
Thank you very much for your time and effort.
I'm not looking for completely coded solutions, any pointer and idea where to go with this would be excellent.
Questions & remarks to yodaj007's answer:
Why if (currentChunk.Length >= Constants.CHUNK_SIZE_IN_BYTES)? Why not ==? If the chunk is larger than the chunk size, my SHA1 hash gets a different value.
currentChunk.Sources.Add(new ChunkSource()
{
Filename = fi.FullName,
StartPosition = 0,
Length = (int)Math.Min(fi.Length, (long)needed)
});
Is a really interesting idea. Postpone reading untill you need it. Nice!
chunks.Add(currentChunk = new Chunk());
Why do this in the if (currentChunk != null) block and in the for (int i = 0; i < (fi.Length - offset) / Constants.CHUNK_SIZE_IN_BYTES; i++) block? Isn't the first a bit redundant?
Here is my complete answer. I tested it on one of my anime folders. It processes 14 files totaling 3.64GiB in roughly 16 seconds. In my opinion, using any sort of parallelism is more trouble than it is worth here. You're being limited by disc I/O, so multithreading will only get you so far. My solution can be easily parallelized though.
It starts by reading "chunk" source information: source file, offset, and length. All of this is gathered very quickly. From here, you can process the "chunks" using threading however you wish. Code follows:
public static class Constants
{
public const int CHUNK_SIZE_IN_BYTES = 32 * 1024 * 1024; // 32MiB
}
public class ChunkSource
{
public string Filename { get; set; }
public int StartPosition { get; set; }
public int Length { get; set; }
}
public class Chunk
{
private List<ChunkSource> _sources = new List<ChunkSource>();
public IList<ChunkSource> Sources { get { return _sources; } }
public byte[] Hash { get; set; }
public int Length
{
get { return Sources.Select(s => s.Length).Sum(); }
}
}
static class Program
{
static void Main()
{
DirectoryInfo di = new DirectoryInfo(#"C:\Stuff\Anime\Shikabane Hime Aka");
string[] filenames = di.GetFiles().Select(fi=> fi.FullName).OrderBy(n => n).ToArray();
var chunks = ChunkFiles(filenames);
ComputeHashes(chunks);
}
private static List<Chunk> ChunkFiles(string[] filenames)
{
List<Chunk> chunks = new List<Chunk>();
Chunk currentChunk = null;
int offset = 0;
foreach (string filename in filenames)
{
FileInfo fi = new FileInfo(filename);
if (!fi.Exists)
throw new FileNotFoundException(filename);
Debug.WriteLine(String.Format("File: {0}", filename));
//
// First, start off by either starting a new chunk or
// by finishing a leftover chunk from a previous file.
//
if (currentChunk != null)
{
//
// We get here if the previous file had leftover bytes that left us with an incomplete chunk
//
int needed = Constants.CHUNK_SIZE_IN_BYTES - currentChunk.Length;
if (needed == 0)
throw new InvalidOperationException("Something went wonky, shouldn't be here");
offset = needed;
currentChunk.Sources.Add(new ChunkSource()
{
Filename = fi.FullName,
StartPosition = 0,
Length = (int)Math.Min(fi.Length, (long)needed)
});
if (currentChunk.Length >= Constants.CHUNK_SIZE_IN_BYTES)
{
chunks.Add(currentChunk = new Chunk());
}
}
else
{
offset = 0;
}
//
// Note: Using integer division here
//
for (int i = 0; i < (fi.Length - offset) / Constants.CHUNK_SIZE_IN_BYTES; i++)
{
chunks.Add(currentChunk = new Chunk());
currentChunk.Sources.Add(new ChunkSource()
{
Filename = fi.FullName,
StartPosition = i * Constants.CHUNK_SIZE_IN_BYTES + offset,
Length = Constants.CHUNK_SIZE_IN_BYTES
});
Debug.WriteLine(String.Format("Chunk source created: Offset = {0,10}, Length = {1,10}", currentChunk.Sources[0].StartPosition, currentChunk.Sources[0].Length));
}
int leftover = (int)(fi.Length - offset) % Constants.CHUNK_SIZE_IN_BYTES;
if (leftover > 0)
{
chunks.Add(currentChunk = new Chunk());
currentChunk.Sources.Add(new ChunkSource()
{
Filename = fi.FullName,
StartPosition = (int)(fi.Length - leftover),
Length = leftover
});
}
else
{
currentChunk = null;
}
}
return chunks;
}
private static void ComputeHashes(IList<Chunk> chunks)
{
if (chunks == null || chunks.Count == 0)
return;
Dictionary<string, MemoryMappedFile> files = new Dictionary<string, MemoryMappedFile>();
foreach (var chunk in chunks)
{
MemoryMappedFile mms = null;
byte[] buffer = new byte[Constants.CHUNK_SIZE_IN_BYTES];
Stopwatch sw = Stopwatch.StartNew();
foreach (var source in chunk.Sources)
{
lock (files)
{
if (!files.TryGetValue(source.Filename, out mms))
{
Debug.WriteLine(String.Format("Opening {0}", source.Filename));
files.Add(source.Filename, mms = MemoryMappedFile.CreateFromFile(source.Filename, FileMode.Open));
}
}
var view = mms.CreateViewStream(source.StartPosition, source.Length);
view.Read(buffer, 0, source.Length);
}
Debug.WriteLine("Done reading sources in {0}ms", sw.Elapsed.TotalMilliseconds);
sw.Restart();
MD5 md5 = MD5.Create();
chunk.Hash = md5.ComputeHash(buffer);
sw.Stop();
Debug.WriteLine(String.Format("Computed hash: {0} in {1}ms", String.Join("-", chunk.Hash.Select(h=> h.ToString("X2")).ToArray()), sw.Elapsed.TotalMilliseconds));
}
foreach (var x in files.Values)
{
x.Dispose();
}
}
}
I don't guarantee everything is spotlessly free of bugs. But I did have fun working on it. Look at the output window in Visual Studio for the debug information. It looks like this:
File: C:\Stuff\Anime\Shikabane Hime Aka\Episode 02.mkv
Chunk source created: Offset = 26966010, Length = 33554432
Chunk source created: Offset = 60520442, Length = 33554432
Chunk source created: Offset = 94074874, Length = 33554432
Chunk source created: Offset = 127629306, Length = 33554432
Chunk source created: Offset = 161183738, Length = 33554432
Chunk source created: Offset = 194738170, Length = 33554432
Chunk source created: Offset = 228292602, Length = 33554432
...
Opening C:\Stuff\Anime\Shikabane Hime Aka\Episode 02.mkv
Done reading sources in 42.9362ms
The thread '' (0xc10) has exited with code 0 (0x0).
Computed hash: 3C-81-A5-2C-90-02-24-23-42-5B-19-A2-15-56-AB-3F in 94.2481ms
Done reading sources in 0.0053ms
Computed hash: 58-F0-6D-D5-88-D8-FF-B3-BE-B4-6A-DA-63-09-43-6B in 98.9263ms
Done reading sources in 29.4805ms
Computed hash: F7-19-8D-A8-FE-9C-07-6E-DB-D5-74-A6-E1-E7-A6-26 in 85.0061ms
Done reading sources in 28.4971ms
Computed hash: 49-F2-CB-75-89-9A-BC-FA-94-A7-DF-E0-DB-02-8A-99 in 84.2799ms
Done reading sources in 31.106ms
Computed hash: 29-7B-18-BD-ED-E9-0C-68-4B-47-C6-5F-D0-16-8A-44 in 84.1444ms
Done reading sources in 31.2204ms
Computed hash: F8-91-F1-90-CF-9C-37-4E-82-68-C2-44-0D-A7-6E-F8 in 84.2592ms
Done reading sources in 31.0031ms
Computed hash: 65-97-ED-95-07-31-BF-C8-3A-BA-2B-DA-03-37-FD-00 in 95.331ms
Done reading sources in 33.0072ms
Computed hash: 9B-F2-83-E6-A8-DF-FD-8D-6C-5C-9E-F4-20-0A-38-4B in 85.9561ms
Done reading sources in 31.6232ms
Computed hash: B6-7C-6B-95-69-BC-9C-B2-1A-07-B3-13-28-A8-10-BC in 84.1866ms
Here is the parallel version. It's basically the same really. Using parallelism = 3 cut the processing time down to 9 seconds.
private static void ComputeHashes(IList<Chunk> chunks)
{
if (chunks == null || chunks.Count == 0)
return;
Dictionary<string, MemoryMappedFile> files = new Dictionary<string, MemoryMappedFile>();
Parallel.ForEach(chunks, new ParallelOptions() { MaxDegreeOfParallelism = 2 }, (chunk, state, index) =>
{
MemoryMappedFile mms = null;
byte[] buffer = new byte[Constants.CHUNK_SIZE_IN_BYTES];
Stopwatch sw = Stopwatch.StartNew();
foreach (var source in chunk.Sources)
{
lock (files)
{
if (!files.TryGetValue(source.Filename, out mms))
{
Debug.WriteLine(String.Format("Opening {0}", source.Filename));
files.Add(source.Filename, mms = MemoryMappedFile.CreateFromFile(source.Filename, FileMode.Open));
}
}
var view = mms.CreateViewStream(source.StartPosition, source.Length);
view.Read(buffer, 0, source.Length);
}
Debug.WriteLine("Done reading sources in {0}ms", sw.Elapsed.TotalMilliseconds);
sw.Restart();
MD5 md5 = MD5.Create();
chunk.Hash = md5.ComputeHash(buffer);
sw.Stop();
Debug.WriteLine(String.Format("Computed hash: {0} in {1}ms", String.Join("-", chunk.Hash.Select(h => h.ToString("X2")).ToArray()), sw.Elapsed.TotalMilliseconds));
});
foreach (var x in files.Values)
{
x.Dispose();
}
}
EDIT
I found a bug, or what I think is a bug. Need to set the read offset to 0 if we're starting a new file.
EDIT 2 based on feedback
This processes the hashes in a separate thread. It's necessary to throttle the I/O. I was running into OutOfMemoryException without doing so. It doesn't really perform that much better, though. Beyond this... I'm not sure how it can be improved any further. Perhaps by reusing the buffers, maybe.
public class QueueItem
{
public Chunk Chunk { get; set; }
public byte[] buffer { get; set; }
}
private static void ComputeHashes(IList<Chunk> chunks)
{
if (chunks == null || chunks.Count == 0)
return;
Dictionary<string, MemoryMappedFile> files = new Dictionary<string, MemoryMappedFile>();
foreach (var filename in chunks.SelectMany(c => c.Sources).Select(c => c.Filename).Distinct())
{
files.Add(filename, MemoryMappedFile.CreateFromFile(filename, FileMode.Open));
}
AutoResetEvent monitor = new AutoResetEvent(false);
ConcurrentQueue<QueueItem> hashQueue = new ConcurrentQueue<QueueItem>();
CancellationToken token = new CancellationToken();
Task.Factory.StartNew(() =>
{
int processCount = 0;
QueueItem item = null;
while (!token.IsCancellationRequested)
{
if (hashQueue.TryDequeue(out item))
{
MD5 md5 = MD5.Create();
item.Chunk.Hash = md5.ComputeHash(item.buffer);
if (processCount++ > 1000)
{
processCount = 0;
monitor.Set();
}
}
}
}, token);
foreach (var chunk in chunks)
{
if (hashQueue.Count > 10000)
{
monitor.WaitOne();
}
QueueItem item = new QueueItem()
{
buffer = new byte[Constants.CHUNK_SIZE_IN_BYTES],
Chunk = chunk
};
Stopwatch sw = Stopwatch.StartNew();
foreach (var source in chunk.Sources)
{
MemoryMappedFile mms = files[source.Filename];
var view = mms.CreateViewStream(source.StartPosition, source.Length);
view.Read(item.buffer, 0, source.Length);
}
sw.Restart();
sw.Stop();
hashQueue.Enqueue(item);
}
foreach (var x in files.Values)
{
x.Dispose();
}
}
I'm new to C# too, but I think what your are looking for is System.IO.MemoryMappedFiles namespace since C# 4.0
Using this API functions the operating system itself takes care how to manage the current file region in memory.
In stead of copy&paste code here, continue reading this article: http://www.developer.com/net/article.php/3828586/Using-Memory-Mapped-Files-in-NET-40.htm
Regarding the MD5 use the System.Security.Cryptography.MD5CryptoServiceProvider class. Maybe it's faster.
In your case where you have to go over the "boundaries" of one file, do it. Let the operating system handle how the memory mapped files are represented in memory. Work as you would do with "small" sized buffers.
In .Net 4 you now have System.IO.MemoryMappedFiles
You can create a ViewAccessor of a particular chuck size to match your hash function, and then just keep filling your hash function buffer from the current ViewAccessor, when you run out of file, start chunking the next file using the current hash chuck offset as your ViewAccessor offset