I am trying to read an excel with OpenXML.
What I did is simply as following:
private WorkbookPart wbPart = null;
private SpreadsheetDocument document = null;
public byte[] GetExcelReport()
{
byte[] original = File.ReadAllBytes(this.originalFilename);
using (MemoryStream stream = new MemoryStream())
{
stream.Write(original, 0, original.Length);
using (SpreadsheetDocument excel = SpreadsheetDocument.Open(stream, true))
{
this.document = excel;
this.wbPart = document.WorkbookPart;
UpdateValue();
}
stream.Seek(0, SeekOrigin.Begin);
byte[] data = stream.ToArray();
return data;
}
}
I initialized this.originalFilename in the constructor. It is the filename ended with '.xlsx' which i created with excel 2010.
But this line of code
using (SpreadsheetDocument excel = SpreadsheetDocument.Open(stream, true))
gives the exception: Message: System.IO.FileFormatException: File contains corrupted data.
The StackTrace:
Does anyone know how to solve this problem? At the beginning, I didn't use the Stream, I just use SpreadsheetDocument.Open(filename, true). However, it turns out to be exactly the same exception.
I've tried to create a new .xlsx file, but it's still the same.
There is a MSDN page which describes the process of reading and writing Excel file using stream and open xml SDK.
http://msdn.microsoft.com/en-us/library/office/ff478410.aspx
Try extracting the document contents through zip application and check whether you are getting the standard folders inside like xl,docProps and _rels etc.,
This is a method to find whether the package is properly packaged as archive or not.
Hope this helps.
Related
I've been working on Text Extractor that works on .docx file using Tika. And it is working file for basic text and text in tables and textboxes, but it fails for images.
How do I get text from Image, tesseract along with tika can be used to get text from an image alone but for that I would need to extract out the image from document. How do I do this?
Kindly help if anybody has worked upon something like this.
This the code that works fine for text, textbox and tables,but not for images:
public class BasicDocumentExtractor {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream=new FileInputStream(new File("D:\\Nidhi\\sw\\ws\\Hello.docx"));
ParseContext pcontext=new ParseContext();
//OOXml parser
OOXMLParser msofficeparser=new OOXMLParser ();
msofficeparser.parse(inputstream, handler,metadata,pcontext);
System.out.println("Contents of the document:" +handler.toString());
/*System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames){
System.out.println(name + ": " + metadata.get(name));
}*/
}
}
You need to enable recursion in Tika in order to get the embedded images. The simplest way is normally just to use the RecursiveParserWrapper to do it for you.
If you use it, your code would instead be roughly
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
TikaInputStream input = TikaInputStream.get(new File("D:\\Nidhi\\sw\\ws\\Hello.docx"));
Parser wrapped = new AutoDetectParser();
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped,
new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, 60));
wrapper.parse(stream, handler, metadata, context);
// Get metadata from children
List<Metadata> list = wrapper.getMetadata();
// Get metadata from main document
System.out.println("Main doc name is " + metadata.get(TikaCoreProperties.TITLE));
System.out.println("Contents of the document:" +handler.String());
As I was trying really hard to do this since las 24hours, I figured out a way, a pretty easy one. Since, Tika is built on the top of POI, using POI this task can be efficiently executed. Also, it is not a direct solution so alomost no tutorials are available for this purpose, I hope nobody else has to face this issue in future. This is the running code that extracts all images from a .docx document:
public static void getImages() throws Exception {
XWPFDocument doc=new XWPFDocument(new FileInputStream("D:\\Nidhi\\CDAC\\Images\\test1.docx"));
List images=doc.getAllPictures();
int i =0;
while (i<images.size()) {
XWPFPictureData pic= (XWPFPictureData) images.get(i);
System.out.println(pic.getFileName() + " "+ pic.getPictureType() +" "+ pic.getData());
FileOutputStream fos=new FileOutputStream("D:\\Nidhi\\CDAC\\Images\\b" + i+".jpg");
fos.write(pic.getData());
i++;
}
}
Also, if it will work on all MS Office 2007+ files, for .doc and such files use HWPF in the exactly same manner.
I'm writing a program (in C#) that will be able to replace a local workbook from a server if the server version is higher, and then open it. To this end I'm trying to read Custom Property "Revision Number" of both local and server copies. The issue is that the workbook contains macros that launch on open, and I don't want to run any macros just to check the Revision Code. So is there a way to read the Revision Number of an excel 2007 xlsm file without actually opening it? If not, is there a way to open a workbook in C# and not execute it's macros?
Actually I tried the tkacprow's suggestion to use OpenXML and it worked. It just took me a while to produce a working code and I just got it working yesterday. Fratyx, your tip also looks interesting - i'll keep that in mind. Here's a working code:
public string GetVersion(string fileName)
{
string propertyValue = string.Empty;
try
{
using (var wb = SpreadsheetDocument.Open(fileName, false))
{
const string corePropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
const string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
const string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";
// Get the core properties part (core.xml).
CoreFilePropertiesPart xCoreFilePropertiesPart;
xCoreFilePropertiesPart = wb.CoreFilePropertiesPart;
// Manage namespaces to perform XML XPath queries.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("cp", corePropertiesSchema);
nsManager.AddNamespace("dc", dcPropertiesSchema);
nsManager.AddNamespace("dcterms", dcTermsPropertiesSchema);
// Get the properties from the package.
XmlDocument xdoc = new XmlDocument(nt);
// Load the XML in the part into an XmlDocument instance.
xdoc.Load(xCoreFilePropertiesPart.GetStream());
string searchString = string.Format("//cp:coreProperties/{0}", "cp:version");
XmlNode xNode = xdoc.SelectSingleNode(searchString, nsManager);
if (!(xNode == null))
{
//Console.WriteLine(" version is " + xNode.InnerText);
propertyValue = xNode.InnerText;
}
}
}
catch (OpenXmlPackageException e)
{
throw new ApplicationException(String.Format("Incorrect Format detected in a file: {0}" , fileName),e.GetBaseException());
}
return propertyValue;
}
I'd like to open a new window with PDF content that is placed within a String variable.
I already have a button with an event connected. In this event I want to call the new window.
The method looks like this:
private void show_archivobjekt(String data) {
String pdf = anfrage.get_archivobjectdata(data);
System.out.println(pdf); // This shows my PDF content in console and works!
// How to convert this String into a StreamSource
StreamResource streamResource = new StreamResource(pdfss, "test.pdf", myView);
streamResource.setCacheTime(5000);
streamResource.setMIMEType("application/pdf");
myView.getMainWindow().open(streamResource, "_blank");
}
myView is the Application.
How can I convert the String pdf to a StreamSource (pdfss)? Do I have to save it as file at first or is it possible to convert it to a StreamSource directly in memory?
The console output shows me the typically PDF content starting with %PDF-1.3 ... and so on.
Any help would be appreciated. Thanks in advance!
Rainer
The answer to this question is available through the official support forum here: https://vaadin.com/forum#!/thread/148544
Simply create your StreamResource like this, by using the byte representation of your string to create a ByteArrayInputStream as the source:
StreamResource streamResource = new StreamResource(
new StreamResource.StreamSource() {
public InputStream getStream() {
return new ByteArrayInputStream(pdf.getBytes());
}
}, "test.pdf");
I need a solution that creates an InfoPath instance form from an XSN template that exists on a SharePoint server, I am using this approach but this extracts template files on temp directory of server that we may not have write permission to. Is there better solution for this?
You just change the CAB-library, to one that can extract the template file to memory, as this one,
Minimum C# code to extract from .CAB archives or InfoPath XSN files, in memory
And then call, myCab.ExtractFile("template.xml", out buffer, out bufferLen);
the complete code would look something like
private byte[] GetXmlForm(SPDocumentLibrary list) {
byte[] data = null;
SPFile file = list.ParentWeb.GetFile(list.DocumentTemplateUrl);
Stream fs = file.OpenBinaryStream();
try {
data = new byte[fs.Length];
fs.Read(data, 0, data.Length);
} finally {
fs.Close();
}
byte[] buffer;
int bufferLen;
CabExtract cab = new CabExtract(data);
cab.ExtractFile("template.xml", out buffer, out bufferLen);
return buffer;
}
I have a form library in my share point site. Programmatically I need to fill some fields. Can I do that? If any one know please provide me some sample code. First I need to retrieve the infopath document and then I need to fill the fields.
What axel_c posted is pretty dang close. Here's some cleaned up and verified working code...
public static void ChangeFields()
{
//Open SharePoint site
using (SPSite site = new SPSite("http://<SharePoint_Site_URL>"))
{
using (SPWeb web = site.OpenWeb())
{
//Get handle for forms library
SPList formsLib = web.Lists["FormsLib"];
if (formsLib != null)
{
foreach (SPListItem item in formsLib.Items)
{
XmlDocument xml = new XmlDocument();
//Open XML file and load it into XML document
using (Stream s = item.File.OpenBinaryStream())
{
xml.Load(s);
}
//Do your stuff with xml here. This is just an example of setting a boolean field to false.
XmlNodeList nodes = xml.GetElementsByTagName("my:SomeBooleanField");
foreach (XmlNode node in nodes)
{
node.InnerText = "0";
}
//Get binary data for new XML
byte[] xmlData = System.Text.Encoding.UTF8.GetBytes(xml.OuterXml);
using (MemoryStream ms = new MemoryStream(xmlData))
{
//Write data to SharePoint XML file
item.File.SaveBinary(ms);
}
}
}
}
}
}
The Infopath document is just a regular XML file, the structure of which matches the data sources you defined in the Infopath form.
You just need to access the file via the SharePoint object model, modify it using standard methods (XmlDocument API) and then write it back to the SharePoint list. You must be careful to preserve the structure and insert valid data or you won't be able to open the form using Infopath.
You should really check out a book on SharePoint if you plan to do any serious development. Infopath is also a minefield.
Object model usage examples: here, here and here. The ridiculously incomplete MSDN reference documentation is here.
EDIT: here is some example code. I haven't done SharePoint for a while so I'm not sure this is 100% correct, but it should give you enough to get started:
// Open SharePoint site
using (SPSite site = new SPSite("http://<SharePoint_Site_URL>"))
{
using (SPWeb web = site.OpenWeb())
{
// Get handle for forms library
SPList formsLib = web.Lists["FormsLib"];
if (formsLib != null)
{
SPListItem itm = formsLib.Items["myform.xml"];
// Open xml and load it into XML document
using (Stream s = itm.File.OpenBinary ())
{
MemoryStream ms;
byte[] xmlData;
XmlDocument xml = new XmlDocument ();
xml.Load (s);
s.Close ();
// Do your stuff with xml here ...
// Get binary data for new XML
xmlData = System.Text.Encoding.UTF8.GetBytes (xml.DocumentElement.OuterXml);
ms = new MemoryStream (xmlData);
// Write data to sharepoint item
itm.File.SaveBinary (ms);
ms.Close ();
itm.Update ();
}
}
web.Close();
}
site.Close();
}
It depends a bit on your available tool set, skills and exact requirements.
There are 2 main ways of pre populating data inside an InfoPath form.
Export the relevant fields as part of the form's publishing process. The fields will then become columns on the Document / Forms library from where you can manipulate them either manually, via a Workflow or wherever your custom code is located.
Directly manipulate the form using code similar to what was provided by Axel_c previously. The big question here is: what will trigger this code? An event receiver on the Document Library, a SharePoint Designer Workflow, a Visual Studio workflow etc?
If you are trying to do this from a SharePoint Designer workflow then have a look at the Workflow Power Pack for SharePoint. It allows C# and VB code to be embedded directly into the workflow without the need for complex Visual Studio development. An example of how to query InfoPath data from a workflow can be found here. If you have some development skills you should be able to amend it to suit your needs.
I also recommend the site www.infopathdev.com, they have excellent and active forums. You will almost certainly find an answer to your question there.
Thanks for the sample code, #axel_c and #Jeff Burt
Below is just the same code from Jeff Burt modified for a file in Document set which I needed. If you don't already have the Document Set reference, you can check out this site on how to grab one:
http://howtosharepoint.blogspot.com/2010/12/programmatically-create-document-set.html
Also, the codes will open the .xml version of the infopath form and not the .xsn template version which you might run into.
Thanks again everyone...
private void ChangeFields(DocumentSet docSet)
{
string extension = "";
SPFolder documentsetFolder = docSet.Folder;
foreach (SPFile file in documentsetFolder.Files)
{
extension = Path.GetExtension(file.Name);
if (extension != ".xml") //check if it's a valid xml file
return;
XmlDocument xml = new XmlDocument();
//Open XML file and load it into XML document, needs to be .xml file not .xsn
using (Stream s = file.OpenBinaryStream())
{
xml.Load(s);
}
//Do your stuff with xml here. This is just an example of setting a boolean field to false.
XmlNodeList nodes = xml.GetElementsByTagName("my:fieldtagname");
foreach (XmlNode node in nodes)
{
node.InnerText = "xyz";
}
//Get binary data for new XML
byte[] xmlData = System.Text.Encoding.UTF8.GetBytes(xml.OuterXml);
using (MemoryStream ms = new MemoryStream(xmlData))
{
//Write data to SharePoint XML file
file.SaveBinary(ms);
}
}
}
I had this issue and resolved it with help from Jeff Burt / Axel_c's posts.
I was trying to use the XMLDocument.Save([stream]) and SPItem.File.SaveBinary([stream]) methods to write an updated InfoPath XML file back to a SharePoint library. It appears that XMLDocument.Save([stream]) writes the file back to SharePoint with the wrong encoding, regardless of what it says in the XML declaration.
When trying to open the updated InfoPath form I kept getting the error "a calculation in the form has not been completed..."
I've written these two functions to get and update and InfoPath form. Just manipulate the XML returned from ReadSPFiletoXMLdocument() in the usual way and send it back to your server using WriteXMLtoSPFile().
private System.Xml.XmlDocument ReadSPFiletoXMLdocument(SPListItem item)
{
//get SharePoint file XML
System.Xml.XmlDocument xDoc = new System.Xml.XmlDocument();
try
{
using (System.IO.Stream xmlStream = item.File.OpenBinaryStream())
{
xDoc.Load(xmlStream);
}
}
catch (Exception ex)
{
//put your own error handling here
}
return xDoc;
}
private void WriteXMLtoSPFile(SPListItem item, XmlDocument xDoc)
{
byte[] xmlData = System.Text.Encoding.UTF8.GetBytes(xDoc.OuterXml);
try
{
using (System.IO.MemoryStream outStream = new System.IO.MemoryStream(xmlData))
{
item.File.SaveBinary(outStream);
}
}
catch (Exception ex)
{
//put your own error handling here
}
}