One way to gain a quick familiarity with NeXus is to start working with some data. For at least thefirst few examples in this section, we have a simple two-column set of 1-D data, collected as part of aseries of alignment scans by the APS USAXS instrument during the time it was stationed atbeam line 32ID. We will show how to write thisdata using the Python language and the
h5py package 1(using
h5py calls directly rather than using the NeXus NAPI). Theactual data to be written was extracted (elsewhere) from a
spec2 data fileand read as a text block from a file by the Python source code.Our examples will start with the simplest case and add only mild complexity with each new casesince these examples are meant for those who are unfamiliar with NeXus.
Conda install linux-ppc64le v3.2.1; osx-arm64 v3.2.1; linux-64 v3.2.1; win-32 v2.7.1; linux-aarch64 v3.2.1; osx-64 v3.2.1; win-64 v3.2.1; To install this package with.
The data shown plotted in the next figure will be written to the NeXus HDF5 fileusing only two NeXus base classes,
NXdata, in the first exampleand then minor variations on this structure in the next two examples. Thedata model is identical to the one in the Introductionchapter except that the names will be different, as shown below:
The h5py package is a Pythonic interface to the HDF5 binary data format. It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. The file 'LIGOdata.hdf5' is already in your working directory. In this exercise, you'll import it using the h5py library. You'll also print out its datatype to confirm you have imported it correctly. You'll then study the structure of the file in order to see precisely what HDF groups it contains. The following are 15 code examples for showing how to use h5py.version.These examples are extracted from open source projects. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example.
our h5py example
two-column data for our mr_scan
22.214.171.124. Writing the simplest data using
These two examples show how to write the simplest data (above).One example writes the data directly to the NXdata groupwhile the other example writes the data to
NXinstrument/NXdetector/dataand then creates a soft link to that data in
h5py example writing and reading a NeXus data file¶
Writing the HDF5 file using h5py¶
In the main code section of BasicWriter.py,a current time stampis written in the format of ISO 8601 (
yyyy-mm-ddTHH:MM:SS).For simplicity of this code example, we use a text string for the time, rather thancomputing it directly from Python support library calls. It is easier this way tosee the exact type of string formatting for the time. When using the Python
datetime package, one way to write the time stamp is:
The data (
mr is similar to “two_theta” and
I00 is similar to “counts”) is collated into two Python lists. We use thenumpy package to read the file and parse the two-column format.
The new HDF5 file is opened (and created if not already existing) for writing,setting common NeXus attributes in the same command from our support library.Proper HDF5+NeXus groups are created for
/entry:NXentry/mr_scan:NXdata.Since we are not using the NAPI, oursupport library must create and set the
NX_class attribute on each group.
We want to create the desired structure of
First, our support library calls
f=h5py.File()to create the file and root level NeXus structure.
Then, it calls
nxentry=f.create_group('entry')to create the
entryat the root level.
Then, it calls
nxdata=nxentry.create_group('mr_scan')to create the
entryas a child of the
Next, we create a dataset called
title to hold a title string that canappear on the default plot.
Next, we create datasets for
I00 using our support library.The data type of each, as represented in
numpy, will be recognized by
h5py and automatically converted to the proper HDF5 type in the file.A Python dictionary of attributes is given, specifying the engineering units and othervalues needed by NeXus to provide a default plot of this data. By setting
signal='I00'as an attribute on the group, NeXus recognizes
I00 as the defaulty axis for the plot. The
axes='mr' attribute on the NXdatagroup connects the dataset to be used as the x axis.
Finally, we must remember to call
f.close() or we mightcorrupt the file when the program quits.
BasicWriter.py: Write a NeXus HDF5 file using Python with h5py
Reading the HDF5 file using h5py¶
The file reader, BasicReader.py,is very simple since the bulk of the work is done by
h5py.Our code opens the HDF5 we wrote above,prints the HDF5 attributes from the file, reads the two datasets,and then prints them out as columns. As simple as that.Of course, real code might add some error-handling andextracting other useful stuff from the file.
See that we identified each of the two datasets using HDF5 absolute path references(just using the group and dataset names). Also, while coding this example, we were remindedthat HDF5 is sensitive to upper or lowercase. That is,
I00 is not the same is
BasicReader.py: Read a NeXus HDF5 file using Python with h5py
BasicReader.py is shown next.
Finding the default plottable data¶
Let’s make a new reader that follows the chain ofattributes (
@axes)to find the default plottable data. We’ll use thesame data file as the previous example.Our demo here assumes one-dimensional data.(For higher dimensionality data,we’ll need more complexity when handling the
@axes attribute and we’ll to check thefield sizes. See section Find the plottable data,subsection Version 3, for the details.)
reader_attributes_trail.py: Read a NeXus HDF5 file using Python with h5py
reader_attributes_trail.py is shown next.
Plotting the HDF5 file¶
Now that we are certain our file conforms to the NeXusstandard, let’s plot it using the
NeXpy3client tool. To help label the plot, we added the
long_name attributes to each of our datasets.We also added metadata to the root level of our HDF5 filesimilar to that written by the NAPI. It seemed to be a useful addition.Compare this with plot of our mr_scanand note that the horizontal axis of this plot is mirrored from that above.This is because the data is stored in the file in descending
mr order and
NeXpy has plottedit that way (in order of appearance) by default.
126.96.36.199. Links to Data in External HDF5 Files¶
HDF5 files may contain links to data (or groups) in other files.This can be used to advantage to refer to data in existing HDF5 filesand create NeXus-compliant data files. Here, we show such an example,using the same
two_theta data from the examples above.
We use the HDF5 external file links with NeXus data files.
f is an open
h5py.File() object in which we will create the new link,
local_addr is an HDF5 path address,
external_file_name is the name(relative or absolute) of an existing HDF5 file, and
external_addr is theHDF5 path address of the existing data in the
external_file_name to be linked.
Take for example, the structure of
external_angles.hdf5,a simple HDF5 data file that contains just the
two_thetaangles in an HDF5 dataset at the root level of the file.Although this is a valid HDF5 data file, it is not a valid NeXus data file:
The data in the file
external_angles.hdf5 might be referenced fromanother HDF5 file (such as
external_counts.hdf5)by an HDF5 external link. 4Here is an example of the structure:
see these URLs for further guidance on HDF5 external links:https://portal.hdfgroup.org/display/HDF5/H5L_CREATE_EXTERNAL,http://docs.h5py.org/en/stable/high/group.html#external-links
A valid NeXus data file could be created that refers to the data in these fileswithout making a copy of the data files themselves.
It is necessary for allthese files to be located together in the same directory for the HDF5 external filelinks to work properly.`
To be a valid NeXus file, it must contain a NXentry group.For the files above, it is simple to make a master file that links tothe data we desire, from structure that we create. We then add thegroup attributes that describe the default plottable data:
Here is (the basic structure of)
external_master.hdf5, an example:
source code: externalExample.py¶
Here is the complete code of a Python program, using
h5pyto write a NeXus-compliant HDF5 file with links to data in other HDF5 files.
externalExample.py: Write using HDF5 external links
The Python code and files related to this section may be downloaded from the following table.
2-column ASCII data used in this section
python code to read example prj_test.nexus.hdf5
python code to write example prj_test.nexus.hdf5
h5dump analysis of external_angles.hdf5
HDF5 file written by externalExample
punx tree analysis of external_angles.hdf5
h5dump analysis of external_counts.hdf5
HDF5 file written by externalExample
punx tree analysis of external_counts.hdf5
python code to write external linking examples
h5dump analysis of external_master.hdf5
NeXus file written by externalExample
punx tree analysis of external_master.hdf5
h5dump analysis of the NeXus file
NeXus file written by BasicWriter
punx tree analysis of the NeXus file
What is PyTables?¶
PyTables is a package for managing hierarchical datasets designed toefficiently cope with extremely large amounts of data.
It is built on top of the HDF51 library, the Python language2 and theNumPy3 package.It features an object-oriented interface that, combined with C extensionsfor the performance-critical parts of the code, makes it a fast yetextremely easy-to-use tool for interactively storing and retrieving verylarge amounts of data.
What are PyTables’ licensing terms?¶
PyTables is free for both commercial and non-commercial use, under the termsof the BSD license.
I’m having problems. How can I get support?¶
The most common and efficient way is to subscribe (remember you need tosubscribe prior to send messages) to the PyTables users mailing list4, andsend there a brief description of your issue and, if possible, a short scriptthat can reproduce it.Hopefully, someone on the list will be able to help you.It is also a good idea to check out the archives of the user’s list5 (you maywant to check the Gmane archives6 instead) so as to see if the answer to yourquestion has already been dealt with.
HDF51 is the underlying C library and file format that enables PyTables toefficiently deal with the data. It has been chosen for the following reasons:
Designed to efficiently manage very large datasets.
Lets you organize datasets hierarchically.
Very flexible and well tested in scientific environments.
Good maintenance and improvement rate.
Technical excellence (R&D 100 Award7).
It’s Open Source software
Python is interactive.
People familiar with data processing understand how powerful command lineinterfaces are for exploring mathematical relationships and scientific datasets. Python provides an interactive environment with the added benefit ofa full featured programming language behind it.
Python is productive for beginners and experts alike.
PyTables is targeted at engineers, scientists, system analysts, financialanalysts, and others who consider programming a necessary evil. Any timespent learning a language or tracking down bugs is time spent not solvingtheir real problem. Python has a short learning curve and most people cando real and useful work with it in a day of learning. Its clean syntax andinteractive nature facilitate this.
Python is data-handling friendly.
Python comes with nice idioms that make the access to data much easier:general slicing (i.e.
data[start:stop:step]), list comprehensions,iterators, generators … are constructs that make the interaction with yourdata very easy.
NumPy3 is a Python package to efficiently deal with large datasetsin-memory, providing containers for homogeneous data, heterogeneous data,and string arrays.PyTables uses these NumPy containers as in-memory buffers to push the I/Obandwith towards the platform limits.
Where can PyTables be applied?¶
In all the scenarios where one needs to deal with large datasets:
Data acquisition in real time
Fast data processing
Medicine (biological sensors, general data gathering & processing)
System log monitoring & consolidation
Tracing of routing data
Alert systems in security
Is PyTables safe?¶
Well, first of all, let me state that PyTables does not support transactionalfeatures yet (we don’t even know if we will ever be motivated to implementthis!), so there is always the risk that you can lose your data in case of anunexpected event while writing (like a power outage, system shutdowns …).Having said that, if your typical scenarios are write once, read many, thenthe use of PyTables is perfectly safe, even for dealing extremely large amountsof data.
Can PyTables be used in concurrent access scenarios?¶
It depends. Concurrent reads are no problem at all. However, whenever a process(or thread) is trying to write, then problems will start to appear. First,PyTables doesn’t support locking at any level, so several process writingconcurrently to the same PyTables file will probably end up corrupting it, sodon’t do this! Even having only one process writing and the others reading isa hairy thing, because the reading processes might be reading incomplete datafrom a concurrent data writing operation.
The solution would be to lock the file while writing and unlock it after aflush over the file has been performed. Also, in order to avoid cache (HDF51,PyTables) problems with read apps, you would need to re-open your fileswhenever you are going to issue a read operation. If a re-opening operation isunacceptable in terms of speed, you may want to do all your I/O operations inone single process (or thread) and communicate the results via sockets,
Queue.Queue objects (in case of using threads), or whatever, with theclient process/thread.
The examples directory contains two scripts demonstrating methods of accessing aPyTables file from multiple processes.
The first, multiprocess_access_queues.py, uses a
multiprocessing.Queue object to transfer read and write requests frommultiple DataProcessor processes to a single process responsible for allaccess to the PyTables file. The results of read requests are then transferredback to the originating processes using other
The second example script, multiprocess_access_benchmarks.py, demonstratesand benchmarks four methods of transferring PyTables array data betweenprocesses. The four methods are:
multiprocessing.Pipefrom the Python standard library.
Using a memory mapped file that is shared between two processes. The NumPyarray associated with the file is passed as the out argument to the
Using a Unix domain socket. Note that this example uses the ‘abstractnamespace’ and will only work under Linux.
Using an IPv4 socket.
What kind of containers does PyTables implement?¶
PyTables does support a series of data containers that address specific needsof the user. Below is a brief description of them:
Lets you deal with heterogeneous datasets. Allows compression. Enlargeable.Supports nested types. Good performance for read/writing data.
Provides quick and dirty array handling. Not compression allowed.Not enlargeable. Can be used only with relatively small datasets (i.e.those that fit in memory). It provides the fastest I/O speed.
Provides compressed array support. Not enlargeable. Good speed whenreading/writing.
Most general array support. Compressible and enlargeable. It is prettyfast at extending, and very good at reading.
Supports collections of homogeneous data with a variable number of entries.Compressible and enlargeable. I/O is not very fast.
The structural component.A hierarchically-addressable container for HDF5 nodes (each of thesecontainers, including Group, are nodes), similar to a directory in aUNIX filesystem.
Please refer to the Library Reference for more specific information.
Cool! I’d like to see some examples of use.¶
Sure. Go to the HowToUse section to find simple examples that will help yougetting started.
Can you show me some screenshots?¶
Well, PyTables is not a graphical library by itself. However, you may want tocheck out ViTables8, a GUI tool to browse and edit PyTables & HDF51 files.
Is PyTables a replacement for a relational database?¶
No, by no means. PyTables lacks many features that are standard in mostrelational databases. In particular, it does not have support forrelationships (beyond the hierarchical one, of course) between datasets and itdoes not have transactional features. PyTables is more focused on speed anddealing with really large datasets, than implementing the above features. Inthat sense, PyTables can be best viewed as a teammate of a relationaldatabase.
For example, if you have very large tables in your existing relationaldatabase, they will take lots of space on disk, potentially reducing theperformance of the relational engine. In such a case, you can move those hugetables out of your existing relational database to PyTables, and let yourrelational engine do what it does best (i.e. manage relatively small or mediumdatasets with potentially complex relationships), and use PyTables for what ithas been designed for (i.e. manage large amounts of data which are looselyrelated).
How can PyTables be fast if it is written in an interpreted language like Python?¶
Actually, all of the critical I/O code in PyTables is a thin layer of code ontop of HDF51, which is a very efficient C library. Cython9 is used as theglue language to generate “wrappers” around HDF5 calls so that they can beused in Python. Also, the use of an efficient numerical package such as NumPy3makes the most costly operations effectively run at C speed. Finally,time-critical loops are usually implemented in Cython9 (which, if usedproperly, allows to generate code that runs at almost pure C speeds).
If it is designed to deal with very large datasets, then PyTables should consume a lot of memory, shouldn’t it?¶
Well, you already know that PyTables sits on top of HDF5, Python and NumPy3,and if we add its own logic (~7500 lines of code in Python, ~3000 in Cython and~4000 in C), then we should conclude that PyTables isn’t effectively a paradigmof lightness.
Having said that, PyTables (as HDF51 itself) tries very hard to optimize thememory consumption by implementing a series of features like dynamicdetermination of buffer sizes, Least Recently Used cache for keeping unusednodes out of memory, and extensive use of compact NumPy3 data containers.Moreover, PyTables is in a relatively mature state and most memory leaks havebeen already addressed and fixed.
Just to give you an idea of what you can expect, a PyTables program can dealwith a table with around 30 columns and 1 million entries using as low as 13 MBof memory (on a 32-bit platform). All in all, it is not that much, is it?.
Why was PyTables born?¶
Because, back in August 2002, one of its authors (Francesc Alted10) had a needto save lots of hierarchical data in an efficient way for later post-processingit. After trying out several approaches, he found that they presented distinctinconveniences. For example, working with file sizes larger than, say, 100 MB,was rather painful with ZODB (it took lots of memory with the version availableby that time).
The netCDF311 interface provided by Scientific Python12 was great, but it didnot allow to structure the hierarchically; besides, netCDF311 only supportshomogeneous datasets, not heterogeneous ones (i.e. tables). (As an aside,netCDF411 overcomes many of the limitations of netCDF311, although curiouslyenough, it is based on top of HDF51, the library chosen as the base forPyTables from the very beginning.)
So, he decided to give HDF51 a try, start doing his own wrappings to it andvoilà, this is how the first public release of PyTables (0.1) saw the light inOctober 2002, three months after his itch started to eat him ;-).
Does PyTables have a client-server interface?¶
H5py Dataset To Numpy Array
Not by itself, but you may be interested in using PyTables through pydap13, aPython implementation of the OPeNDAP14 protocol. Have a look at the PyTablesplugin of pydap13.
How does PyTables compare with the h5py project?¶
Well, they are similar in that both packages are Python interfaces to the HDF51library, but there are some important differences to be noted. h5py16 is anattempt to map the HDF51 feature set to NumPy3 as closely as possible. Inaddition, it also provides access to nearly all of the HDF51 C API.
Instead, PyTables builds up an additional abstraction layer on top of HDF51 andNumPy3 where it implements things like an enhanced type system, an enginefor enabling complex queries, an efficient computationalkernel17, advanced indexing capabilities18 or an undo/redo feature, to namejust a few. This additional layer also allows PyTables to be relativelyindependent of its underlying libraries (and their possible limitations). Forexample, PyTables can support HDF51 data types like enumerated or time thatare available in the HDF51 library but not in the NumPy3 package; or evenperform powerful complex queries that are not implemented directly in neitherHDF51 nor NumPy3.
Furthermore, PyTables also tries hard to be a high performance interface toHDF5/NumPy, implementing niceties like internal LRU caches for nodes and otherdata and metadata, automatic computation of optimal chunk sizes for the datasets, a variety of compressors, ranging fromslow but efficient (bzip219) to extremely fast ones (Blosc20) in addition to thestandard zlib21. Another difference is that PyTables makes use of numexpr22 soas to accelerate internal computations (for example, in evaluating complexqueries) to a maximum.
For contrasting with other opinions, you may want to check the PyTables/h5pycomparison in a similar entry of the FAQ of h5py23.
I’ve found a bug. What do I do?¶
The PyTables development team works hard to make this eventuality as rare aspossible, but, as in any software made by human beings, bugs do occur. If youfind any bug, please tell us by file a bug report in the issue tracker24 onGitHub25.
Is it possible to get involved in PyTables development?¶
Indeed. We are keen for more people to help out contributing code, unit tests,documentation, and helping out maintaining this wiki. Drop us a mail on theusers mailing list and tell us in which area do you want to work.
How can I cite PyTables?¶
The recommended way to cite PyTables in a paper or a presentation is asfollowing:
Author: Francesc Alted, Ivan Vilata and others
Title: PyTables: Hierarchical Datasets in Python
Year: 2002 -
Here’s an example of a BibTeX entry:
PyTables 2.x issues¶
I’m having problems migrating my apps from PyTables 1.x into PyTables 2.x. Please, help!¶
Sure. However, you should first check out the Migrating from PyTables 1.x to 2.xdocument.It should provide hints to the most frequently asked questions on this regard.
For combined searches like table.where(‘(x<5) & (x>3)’), why was a & operator chosen instead of an and?¶
Search expressions are in fact Python expressions written as strings, and theyare evaluated as such. This has the advantage of not having to learn a newsyntax, but it also implies some limitations with logical and and oroperators, namely that they can not be overloaded in Python. Thus, it isimpossible right now to get an element-wise operation out of an expression like‘array1 and array2’. That’s why one has to choose some other operator, being& and the most similar to their C counterparts && and , whicharen’t available in Python either.
You should be careful about expressions like ‘x<5 & x>3’ and others like ‘3< x < 5’ which ‘’won’t work as expected’’, because of the different operatorprecedence and the absence of an overloaded logical and operator. More onthis in the appendix about condition syntax in the HDF5 manual26.
There are quite a few packages affected by those limitations including NumPy3themselves and SQLObject27, and there have been quite longish discussions aboutadding the possibility of overloading logical operators to Python (see PEP33528 and this thread30 for more details).
I can not select rows using in-kernel queries with a condition that involves an UInt64Col. Why?¶
This turns out to be a limitation of the numexpr22 package. Internally,numexpr22 uses a limited set of types for doing calculations, and unsignedintegers are always upcasted to the immediate signed integer that can fit theinformation. The problem here is that there is not a (standard) signed integerthat can be used to keep the information of a 64-bit unsigned integer.
So, your best bet right now is to avoid uint64 types if you can. If youabsolutely need uint64, the only way for doing selections with this isthrough regular Python selections. For example, if your table has a colMcolumn which is declared as an UInt64Col, then you can still filter itsvalues with:
However, this approach will generally lead to slow speed (specially on Win32platforms, where the values will be converted to Python long values).
I’m already using PyTables 2.x but I’m still getting numarray objects instead of NumPy ones!¶
This is most probably due to the fact that you are using a file created withPyTables 1.x series. By default, PyTables 1.x was setting an HDF5 attributeFLAVOR with the value ‘numarray’ to all leaves. Now, PyTables 2.x seesthis attribute and obediently converts the internal object (truly a NumPyobject) into a numarray one. For PyTables 2.x files the FLAVOR attributewill only be saved when explicitly set via the leaf.flavor property (or whenpassing data to an
Table at creation time), so youwill be able to distinguish default flavors from user-set ones by checking theexistence of the FLAVOR attribute.
Meanwhile, if you don’t want to receive numarray objects when reading oldfiles, you have several possibilities:
Remove the flavor for your datasets by hand:
Use the :program:’ptrepack` utility with the flag –upgrade-flavorsso as to convert all flavors in old files to the default (effectively byremoving the FLAVOR attribute).
Remove the numarray (and/or Numeric) package from your system.Then PyTables 2.x will return you pure NumPy objects (it can’t beotherwise!).
Error when importing tables¶
You have installed the binary installer for Windows and, when importing thetables package you are getting an error like:
This problem can be due to a series of reasons, but the most probable one isthat you have a version of a DLL library that is needed by PyTables and it isnot at the correct version. Please, double-check the versions of the requiredlibraries for PyTables and install newer versions, if needed. In most cases,this solves the issue.
In case you continue getting problems, there are situations where otherprograms do install libraries in the PATH that are optional to PyTables(for example BZIP2 or LZO), but that they will be used if they are found inyour system (i.e. anywhere in your
PATH). So, if you find any ofthese libraries in your PATH, upgrade it to the latest version available (youdon’t need to re-install PyTables).
Can’t find LZO binaries for Windows¶
Unfortunately, the LZO binaries for Windows seems to be unavailable from itsusual place at http://gnuwin32.sourceforge.net/packages/lzo.htm. So, in orderto allow people to be able to install this excellent compressor easily, we havepackaged the LZO binaries in a zip file available at:http://www.pytables.org/download/lzo-win. This zip file follows the samestructure that a typical GnuWin3229 package, so it is just a matter of unpackingit in your
GNUWIN32 directory and following the instructions in the PyTables Manual15.
Hopefully somebody else will take care again of maintaining LZO for Windowsagain.
Tests fail when running from IPython¶
You may be getting errors related with Doctest when running the test suite fromIPython. This is a known limitation in IPython (seehttp://lists.ipython.scipy.org/pipermail/ipython-dev/2007-April/002859.html).Try running the test suite from the vanilla Python interpreter instead.
Tests fail when running from Python 2.5 and Numeric is installed¶
Numeric doesn’t get well with Python 2.5, even on 32-bit platforms. This isa consequence of Numeric not being maintained anymore and you should considermigrating to NumPy as soon as possible. To get rid of these errors, justuninstall Numeric.
H5py File Format