Working with datasets

We have 4 main goals for this tutorial:

Create a Dataset
1. From data in memory
2. From a snapshot
Run packingcubes on the dataset
Save the sorted dataset
Load the dataset

For the tutorial, you'll need

Access to a Snapshot

You'll need access to a snaphsot file that looks GADGETish. So anything from Gadget (v2 and up), Gizmo, Swift, OpenGadget, Arepo, and similar should work.

We'll be using a snapshot from IllustrisTNG, based on Arepo, for this tutorial.

Don't have a snapshot?

If you don't have a snapshot readily available, we will be using snapshot_090.hdf5 (about \(10^7\) particles, 2.5 GB) from one of the IllustrisTNG runs used in the CAMELS project.

Install and Import Dependencies¶

As always for a tutorial, we'll need to install and import the dependencies we'll need:

pippixi

$ pip install packingcubes

$ pixi add packingcubes --pypi

Now import the modules we need

import numpy as np 

import packingcubes

Creating a Dataset¶

The first thing we need to do for anything with packingcubes that will be used long term is create a Dataset object. We'll demonstrate both of the primary ways to do that here.

From a Snapshot¶

Creating a dataset from a snapshot is pretty straightforward if it's Gadget-based¹.

dataset = packingcubes.GadgetishHDF5Dataset(filepath="./snapshot_090.hdf5")

We need to choose which particles we want to look at

dataset.particle_type = "PartType0" # the gas particles

Then we can check the particle positions are loaded:

dataset.positions

Eager Loading

GadgetishHDF5Datasets eagerly load positions data, so you can specify which particle type you want to load first by setting the particle_type parameter in the constructor.

From positions in memory¶

Déjà-vu?

This part will actually be very similar to the Finding Particles Within a Shape tutorial, because Cubes does this step internally when you pass it an array.

We'll start by generating some random data. We'll make 1000 particles with coordinates ranging from 0 to 100.

positions = np.random.uniform(size=(1000,3)) * 100

Then to create a dataset, just use

inmem_dataset = packingcubes.InMemory(positions=positions)

Run `packingcubes`¶

We now want to cubify our datasets.

It's actually the same command (Cubes), regardless of which kind of dataset you're using. However, we'll need to specify that we only want the gas particles at this time for the GadgetishHDF5Dataset²³.

cubes = packingcubes.Cubes(dataset, particle_type="PartType0")
inmem_cubes = packingcubes.Cubes(inmem_dataset)

These cubes are ready for searching!

cubes.get_particle_indices_in_sphere(
    center=[13548,23147,1684], radius=1000,
)

inmem_cubes.get_particle_indices_in_sphere(
    center=[64,73,15], radius=20,
)

Saving a Dataset¶

Generally, you'll only want to do the initial cubing once and save the results. (Though it is fast enough you could regenerate it each time for "small" datasets).

Saving Cubes recap

You likely have already seen how to save the cubes information in the Getting Started page, but as a recap, it's just cubes.save("snapshot_cubes.hdf5")

Saving the cubes structure doesn't save the sorted particle position information to disk. You'll need to separately save it:

dataset.save(output_file="sorted_positions.hdf5")
inmem_dataset.save(output_file="inmem_positions.hdf5")

Loading a sorted dataset¶

If you have an already sorted dataset, like we now do, you have two options:

Specify the sorted dataset as your dataset filepath - Simpler, but will only contain the fields you've already sorted (so just positions)
Pass the sorted dataset to the sorted_filepath parameter - An extra step, but will check the sorted dataset for any fields first before loading from the original.

Tip

You can also pass the sorted_filepath parameter to the Cubes call directly!

We'll pick option 2.

dataset_reloaded = packingcubes.GadgetishHDF5Dataset(
    filepath="snapshot_090.hdf5",
    sorted_filepath="sorted_positions.hdf5",
)

dataset_reloaded.positions

Note the sorted positions!

Sorting additional fields¶

Once the dataset is sorted, we can include additional fields to be linked to the particles. The process takes two steps:

Define a mapping between names and fields. Names are what you want the field to be referenceable as (say mass), and fields are the actual data. All datasets support two types: N length vectors and NxM sized matrices. For HDF5-based Datasets, a string corresponding to the name of the HDF5 dataset on disk is also supported (e.g., if you provide field_name, packingcubes will look for a dataset in the form current_particle_type/field_name). Example:
```
calculated_field = calculation()
extra_fields = {"Mass":"Mass", "vx":"Velocity_x", "cf":calculated_field}
```
Provide the mapping to the dataset. This will sort the data using the shuffle list in a one-time pass. Note that this means this step must be done after the cubing process! Otherwise the field will retain the original order.
```
dataset.process_extra_fields(extra_fields)
```

The extra fields will now be available as dataset attributes:

mass = dataset.Mass
vx = dataset.vx
dcf = dataset.cf

Keeping the same name for on-disk fields

In the future we expect to allow lists of strings if you prefer to keep the on-disk names. Until then, you can use a dict comprehension (extra_fields = {f:f for f in fields}).

Already sorted data

If your data is already spatially sorted, you must provide a tuple of (field, True) instead of the field directly. This will skip the sorting step when adding that particular field, e.g.

dataset.process_extra_fields(
    {
        "sf":(already_sorted, True),
        "sdf":("already_sorted_on_disk",True),
        "nsf":"not_sorted",
    }
)

Otherwise the field will be re-ordered using the shuffle list, un-sorting it.

You can specify a tuple for non-sorted fields as well, (like ("not_sorted", False)), but it's obviously not necessary.

All-in-one¶

As mentioned previously, the Cubes command combines a number of the above steps into one command. So if you don't need any additional flexibility, or direct access to the dataset positions/shuffle list/etc., you can include the dataset saving and extra fields in the initial Cubes call via the sorted_filepath, save_dataset, and extras parameters, like so:

cubes = packingcubes.Cubes(
    "./snapshot_090.hdf5", 
    particle_type="PartType0",
    sorted_filepath="sorted_positions.hdf5",
    save_dataset=True,
    extras={"Mass":"Mass","vx":"Velocity_x"},
)

The Dataset is then available via cubes.dataset.

Follows the Gadget-2 header specification here ↩
Otherwise you'll get whatever particles are currently loaded (i.e. whatever is in dataset.particle_type). By default, for GadgetishHDF5Datasets, that's the first top-level group whose name starts with "Part", and the first element of dataset.particle_types. ↩
We don't need to specify the particle type for InMemory datasets because they only have one, "PartTypeIM". This is just a dummy name used when saving the particles out, however; you can change it on initialization or with inmem_dataset.particle_type = "NewName". Note that the new name would need to start with "Part" for it to be picked up by GadgetishHDF5Dataset. ↩