When you ask someone to send you a contract or a report there is a high probability that you’ll get a DOCX file. Whether you like it not, it makes sense considering that 1.2 billion people use Microsoft Office although a definition of “use” is quite vague in this case. DOCX is a binary file which is, unlike XLSX, not famous for being easy to integrate into your application. PDF is much easier when you care more about how a document is displayed than its abilities for further modifications. Let’s focus on that.
Python has a few great libraries to work with DOCX (python-dox) and PDF files (PyPDF2, pdfrw). Those are good choices and a lot of fun to read or write files. That said, I know I'd fail miserably trying to achieve 1:1 conversion.
Python - Working with.docx module Python Server Side Programming Programming Word documents contain formatted text wrapped within three object levels. Lowest level- Run objects, Middle level- Paragraph objects and Highest level- Document object. Python-docx does not automatically set any of the document core properties other than to add a core properties part to a presentation that doesn’t have one (very uncommon). If python-docx adds a core properties part, it contains default values for the title, lastmodifiedby, revision, and modified properties. Using python to populate a MS Word template (aka mailmerge) Word Merge Fields. In order for docx-mailmerge to work correctly, you need to create a standard Word document and define the appropriate merge fields. Python-docx¶ Release v0.8.10 (Installation) python-docx is a Python library for creating and updating Microsoft Word (.docx) files.
Looking further I came across unoconv. Universal Office Converter is a library that’s converting any document format supported by LibreOffice/OpenOffice. That sound like a solid solution for my use case where I care more about quality than anything else. As execution time isn't my problem I have been only concerned whether it’s possible to run LibreOffice without X display. Apparently, LibreOffice can be run in haedless mode and supports conversion between various formats, sweet!
I’m grateful to unoconv for an idea and great README explaining multiple problems I can come across. In the same time, I’m put off by the number of open issues and abandoned pull requests. If I get versions right, how hard can it be? Not hard at all, with few caveats though.
LibreOffice is available on all major platforms and has an active community. It's not active as new-hot-js-framework-active but still with plenty of good read and support. You can get your copy from the download page. Be a good user and go with up-to-date version. You can always downgrade in case of any problems and feedback on latest release is always appreciated.
On macOS and Windows executable is called
libreoffice on Linux. I'm on macOS, executable
soffice isn't available in my
PATH after the installation but you can find it inside the
LibreOffice.app. To test how LibreOffice deals with your files you can run:
In my case results were more than satisfying. The only problem I saw was a misalignment in a file when the alignment was done with spaces, sad but true. This problem was caused by missing fonts and different width of 'replacements' fonts. No worries, we'll address this problem later.
While reading unoconv issues I've noticed that many problems are connected due to the mismatch of the versions. I'm going with Docker so I can have pretty stable setup and so I can be sure that everything works.
Let's start with defining simple
Dockerfile, just with dependencies and
ADD one DOCX file just for testing:
Let's build an image:
After image is created we can run the container and convert the file inside the container:
Running LibreOffice as a subprocess
We want to run LibreOffice converter as a subprocess and provide the same API for all platforms. Let's define a module which can be run as a standalone script or which we can later import on our server.
Required arguments which
convert_to accepts are
folder to which we save PDF and a path to the
source file. Optionally we specify a
timeout in seconds. I’m saying optional but consider it mandatory. We don’t want a process to hang too long in case of any problems or just to limit computation time we are able to give away to each conversion. LibreOffice executable location and name depends on the platform so edit
libreoffice_exec to support platform you’re using.
subprocess.run doesn’t capture stdout and stderr by default. We can easily change the default behavior by passing
subprocess.PIPE. Unfortunately, in the case of the failure, LibreOffice will fail with return code 0 and nothing will be written to stderr. I decided to look for the success message assuming that it won’t be there in case of an error and raise
LibreOfficeError otherwise. This approach hasn’t failed me so far.
Uploading files with Flask
Converting using the command line is ok for testing and development but won't take us far. Let's build a simple server in Flask.
We'll need few helper function to work with files and few custom errors for handling error messages. Upload directory path is defined in
config.py. You can also consider using flask-restplus or flask-restful which makes handling errors a little easier.
The server is pretty straightforward. In production, you would probably want to use some kind of authentication to limit access to
uploads directory. If not, give up on serving static files with Flask and go for Nginx.
Important take-away from this example is that you want to tell your app to be threaded so one request won't prevent other routes from being served. However, WSGI server included with Flask is not production ready and focuses on development. In production, you want to use a proper server with automatic worker process management like gunicorn. Check the docs for an example how to integrate gunicorn into your app. We are going to run the application inside a container so host has to be set to publicly visible
Now when we have a server we can update
Dockerfile. We need to copy our application source code to the image filesystem and install required dependencies.
docker-compose.yml we want to specify ports mapping and mount a volume. If you followed the code and you tried running examples you have probably noticed that we were missing the way to tell Flask to run in a debugging mode. Defining environment variable without a value is causing that this variable is going to be passed to the container from the host system. Alternatively, you can provide different config files for different environments.
Supporting custom fonts
I've mentioned a problem with missing fonts earlier. LibreOffice can, of course, make use of custom fonts. If you can predict which fonts your user might be using there's a simple remedy. Add following line to your
Now when you put custom font file in the
font directory in your project, rebuild the image. From now on you support custom fonts!
This should give you the idea how you can provide quality conversion of different documents to PDF. Although the main goal was to convert a DOCX file you should be fine with presentations, spreadsheets or images.
Further improvements could be providing support for multiple files, the converter can be configured to accept more than one file as well.
Photo by Samuel Zeller on Unsplash.
Did you enjoy it? Follow [email protected] on Twitter, where I share some interesting, bite-size content.
This ebook goes beyond Jest documentation to explain software testing techniques. I focus on unit test separation, mocking, matchers, patterns, and best practices.Get it now!
Tips & Tricks $9
Python Docx Documentation
The docx module creates, reads and writes Microsoft Office Word 2007 docx files
The docx module creates, reads and writes Microsoft Office Word 2007 docxfiles.
These are referred to as ‘WordML’, ‘Office Open XML’ and ‘Open XML’ byMicrosoft.
These documents can be opened in Microsoft Office 2007 / 2010, Microsoft MacOffice 2008, Google Docs, OpenOffice.org 3, and Apple iWork 08.
They also validate as well formed XML.
The module was created when I was looking for a Python support for MS Word.docx files, but could only find various hacks involving COM automation,calling .Net or Java, or automating OpenOffice or MS Office.
The docx module has the following features:
Features for making documents include:
- Numbered lists
- Document properties (author, company, etc)
- Multiple levels of headings
- Section and page breaks
Thanks to the awesomeness of the lxml module, we can:
- Search and replace
- Extract plain text of document
- Add and delete items anywhere within the document
- Change document properties
- Run xpath queries against particular locations in the document - useful forretrieving data from user-completed templates.
Making and Modifying Documents
Just download python docx.
Use pip or easy_install to fetch the lxml and PIL modules.
Congratulations, you just made and then modified a Word document!
Extracting Text from a Document
If you just want to extract the text from a Word file, run:
Ideas & To Do List
- Further improvements to image handling
- Document health checks
- Markdown conversion support
We love forks, changes and pull requests!
- Check out the [HACKING](HACKING.markdown) to add your own changes!
- For this project on github
- Send a pull request via github and we’ll add your changes!
Want to talk? Need help?
Licensed under the MIT license
Short version: this code is copyrighted to me (Mike MacCana), I give youpermission to do what you want with it except remove my name from the credits.See the LICENSE file for specific terms.
Release historyRelease notifications RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size docx-0.2.4.tar.gz (54.9 kB)||File type Source||Python version None||Upload date||Hashes|
Hashes for docx-0.2.4.tar.gz
Python Docx Package