Skip to content

PyPI version

Fast XML Upload

For projects with a lot of files, the xmlupload command is too slow. That's why we developed, for internal usage, a specific workflow for fast mass uploads. The fast mass upload workflow processes the files locally before uploading them to the DSP server. Then, it creates the resources of the XML file on the DSP server.

In order for the fast mass upload to work, you need the following dependencies:

  • Your machine must be able to run the DSP software stack. The (internal) document "Installation of your Mac" explains what software needs to be installed.
  • Install ffmpeg, e.g. with brew install ffmpeg
  • Install ImageMagick, e.g. with brew install imagemagick

The fast mass upload consists of the following steps:

  1. Prepare your data as explained below
  2. Process the files locally with dsp-tools process-files
  3. Upload the files to DSP with dsp-tools upload-files
  4. Create the resources on DSP with dsp-tools fast-xmlupload

1. Prepare Your Data

The following data structure is expected:

my_project
├── data_model.json
├── data.xml   (<bitstream>multimedia/dog.jpg</bitstream>)
└── multimedia
    ├── dog.jpg
    ├── cat.mp3
    └── subfolder
        ├── snake.pdf
        └── bird.mp4

Note:

  • Your project must contain one XML data file, anywhere.
  • Your project must contain one sub-folder that contains all multimedia files (here: multimedia).
  • The multimedia files in multimedia may be arbitrarily nested.
  • Every path referenced in a <bitstream> in the XML file must point to a file in multimedia.
  • The paths in the <bitstream> are relative to the project root.

2. dsp-tools process-files

Process the files locally, using a SIPI container.

dsp-tools process-files --input-dir=multimedia --output-dir=tmp data.xml 

The following options are available:

  • --input-dir (mandatory): path to the input directory where the files should be read from
  • --output-dir (mandatory): path to the output directory where the processed/transformed files should be written to
  • --nthreads (optional, default computed by the concurrent library, dependent on the machine): number of threads to use for processing
  • --batchsize (optional, default 5000): number of files to process in one batch

All files referenced in the <bitstream> tags of the XML are expected to be in the input directory which is provided with the --input-dir option.

The processed files (derivative, .orig file, sidecar file, as well as the preview file for movies) will be stored in the given --output-dir directory. If the output directory doesn't exist, it will be created automatically.

Additionally, a pickle file is written to the output directory with the name processing_result_[timestamp].pkl. It contains a mapping from the original files to the processed files, e.g. multimedia/dog.jpgtmp/0b/22/0b22570d-515f-4c3d-a6af-e42b458e7b2b.jp2.

Important Note: Resource Leak

Due to a resource leak, Python must be quitted after a certain time. For big datasets, only a batch of files is processed, then Python exits with exit code 2. In this case, you need to restart the command several times, until the exit code is 0. Only then, all files are processed. Unexpected errors result in exit code 1. If this batch splitting happens, every run produces a new pickle file.

You can orchestrate this with a shell script, e.g.:

exit_code=2
while [ $exit_code -eq 2 ]; do
    dsp-tools process-files --input-dir=multimedia --output-dir=tmp data.xml
    exit_code=$?
done

if [ $exit_code -ne 0 ]; then
    echo "Error: exit code $exit_code"
    exit $exit_code
fi

3. dsp-tools upload-files

After all files are processed, the upload step can be started.

dsp-tools upload-files --processed-dir=tmp

The following options are available:

  • -d | --processed-dir (mandatory): path to the directory where the processed files are located (same as --output-dir in the processing step)
  • -n | --nthreads (optional, default 4): number of threads to use for uploading (optimum depends on the number of CPUs on the server)
  • -s | --server (optional, default: 0.0.0.0:3333): URL of the DSP server
  • -u | --user (optional, default: root@example.com): username (e-mail) used for authentication with the DSP-API
  • -p | --password (optional, default: test): password used for authentication with the DSP-API

This command will collect all pickle files in the current working directory that were created by the process-files command.

Important Note

Due to a resource leak, Python must be quitted after a certain time. If there are multiple pickle files from the previous step, it is not recommended executing upload-files with all pickle files being present. Rather, store them somewhere else, and execute upload-files with only a part of the pickle files.

4. dsp-tools fast-xmlupload

dsp-tools fast-xmlupload --pkl-file=processing_result_20230414_152810.pkl data.xml

The following options are available:

  • -s | --server (optional, default: 0.0.0.0:3333): URL of the DSP server
  • -u | --user (optional, default: root@example.com): username (e-mail) used for authentication with the DSP-API
  • -p | --password (optional, default: test): password used for authentication with the DSP-API

This command will collect all pickle files in the current working directory that were created by the process-files command.


Last update: September 12, 2023