Working with datasets
We have 4 main goals for this tutorial:
- Create a Dataset
- From data in memory
- From a snapshot
- Run
packingcubeson the dataset - Save the sorted dataset
- Load the dataset
For the tutorial, you'll need
-
Access to a Snapshot
You'll need access to a snaphsot file that looks GADGETish. So anything from Gadget (v2 and up), Gizmo, Swift, OpenGadget, Arepo, and similar should work.
We'll be using a snapshot from IllustrisTNG, based on Arepo, for this tutorial.
Don't have a snapshot?
If you don't have a snapshot readily available, we will be using snapshot_090.hdf5 (about \(10^7\) particles, 2.5 GB) from one of the IllustrisTNG runs used in the CAMELS project.
Install and Import Dependencies¶
As always for a tutorial, we'll need to install and import the dependencies we'll need:
Now import the modules we need
Creating a Dataset¶
The first thing we need to do for anything with packingcubes that will be
used long term is create a Dataset object. We'll demonstrate both of the
primary ways to do that here.
From a Snapshot¶
Creating a dataset from a snapshot is pretty straightforward if it's Gadget-based1.
We need to choose which particles we want to look at
Then we can check the particle positions are loaded:
Eager Loading
GadgetishHDF5Datasets eagerly load positions data, so you can specify
which particle type you want to load first by setting the
particle_type parameter in the constructor.
From positions in memory¶
Déjà-vu?
This part will actually be very similar to the
Finding Particles Within a Shape
tutorial, because Cubes does this step internally when you pass it an array.
We'll start by generating some random data. We'll make 1000 particles with coordinates ranging from 0 to 100.
Then to create a dataset, just use
Run packingcubes¶
We now want to cubify our datasets.
It's actually the same command (Cubes), regardless of which kind of dataset
you're using. However, we'll need to specify that we only want the gas particles
at this time for the GadgetishHDF5Dataset23.
cubes = packingcubes.Cubes(dataset, particle_type="PartType0")
inmem_cubes = packingcubes.Cubes(inmem_dataset)
These cubes are ready for searching!
Saving a Dataset¶
Generally, you'll only want to do the initial cubing once and save the results. (Though it is fast enough you could regenerate it each time for "small" datasets).
Saving Cubes recap
You likely have already seen how to save the cubes information in the
Getting Started page, but as a recap, it's just
cubes.save("snapshot_cubes.hdf5")
Saving the cubes structure doesn't save the sorted particle position information to disk. You'll need to separately save it:
dataset.save(output_file="sorted_positions.hdf5")
inmem_dataset.save(output_file="inmem_positions.hdf5")
Loading a sorted dataset¶
If you have an already sorted dataset, like we now do, you have two options:
- Specify the sorted dataset as your dataset filepath - Simpler, but will only
contain the fields you've already sorted (so just
positions) - Pass the sorted dataset to the
sorted_filepathparameter - An extra step, but will check the sorted dataset for any fields first before loading from the original.
Tip
You can also pass the sorted_filepath parameter to the Cubes call
directly!
We'll pick option 2.
dataset_reloaded = packingcubes.GadgetishHDF5Dataset(
filepath="snapshot_090.hdf5",
sorted_filepath="sorted_positions.hdf5",
)
Note the sorted positions!
Sorting additional fields¶
Once the dataset is sorted, we can include additional fields to be linked to the particles. The process takes two steps:
- Define a mapping between names and fields. Names are what you want the field
to be referenceable as (say
mass), and fields are the actual data. All datasets support two types:Nlength vectors andNxMsized matrices. For HDF5-basedDatasets, a string corresponding to the name of the HDF5 dataset on disk is also supported (e.g., if you providefield_name,packingcubeswill look for a dataset in the formcurrent_particle_type/field_name). Example: - Provide the mapping to the dataset. This will sort the data using the shuffle list in a one-time pass. Note that this means this step must be done after the cubing process! Otherwise the field will retain the original order.
The extra fields will now be available as dataset attributes:
Keeping the same name for on-disk fields
In the future we expect to allow lists of strings if you prefer to keep the
on-disk names. Until then, you can use a dict comprehension
(extra_fields = {f:f for f in fields}).
Already sorted data
If your data is already spatially sorted, you must provide a tuple of
(field, True) instead of the field directly. This will skip the sorting
step when adding that particular field, e.g.
dataset.process_extra_fields(
{
"sf":(already_sorted, True),
"sdf":("already_sorted_on_disk",True),
"nsf":"not_sorted",
}
)
You can specify a tuple for non-sorted fields as well, (like
("not_sorted", False)), but it's obviously not necessary.
All-in-one¶
As mentioned previously, the Cubes command combines a number of the above
steps into one command. So if you don't need any additional flexibility, or
direct access to the dataset positions/shuffle list/etc., you can include the
dataset saving and extra fields in the initial Cubes call via the
sorted_filepath, save_dataset, and extras parameters, like so:
cubes = packingcubes.Cubes(
"./snapshot_090.hdf5",
particle_type="PartType0",
sorted_filepath="sorted_positions.hdf5",
save_dataset=True,
extras={"Mass":"Mass","vx":"Velocity_x"},
)
The Dataset is then available via cubes.dataset.
-
Otherwise you'll get whatever particles are currently loaded (i.e. whatever is in
dataset.particle_type). By default, forGadgetishHDF5Datasets, that's the first top-level group whose name starts with"Part", and the first element ofdataset.particle_types. ↩ -
We don't need to specify the particle type for InMemory datasets because they only have one,
"PartTypeIM". This is just a dummy name used when saving the particles out, however; you can change it on initialization or withinmem_dataset.particle_type = "NewName". Note that the new name would need to start with"Part"for it to be picked up byGadgetishHDF5Dataset. ↩