Using the Pipeline
==================

``riptide`` comes with a pipeline application, ``rffa``, that can search an observation across a 
range of dispersion measure (DM) trials, using many CPUs in parallel. ``rffa`` automatically
performs the sequence of steps described in :ref:`Quickstart Guide` for all input DM trials, groups
the detected peaks into sensible clusters, and then produces a candidate file and plot for each
cluster thus found. Here we cover how to search an observation for sources with an unknown DM.

``riptide`` does not have (yet) a dedispersion engine, so you must take care of dedispersing your
multi-channel observation data using another software package. Here we will use PRESTO_, but the
general process is similar regardless of the dedispersion engine used. **If you are not familiar 
with PRESTO, please first have a look at** the `PRESTO tutorial`_ and documentation.

.. _PRESTO: https://github.com/scottransom/presto

.. _`PRESTO tutorial`: https://www.cv.nrao.edu/~sransom/PRESTO_search_tutorial.pdf


RFI Mitigation
--------------

Before actually dedispersing, a sensible starting point is to produce a radio-frequency 
interference (RFI) mask using ``rfifind``. It will scan the input data for bad frequency channels 
and time intervals, and save them to a so-called mask file. This file can then be read by the 
dedispersion utility ``prepsubband`` which can replace the bad data sections with a sensible value.
This will remove a significant amount of spurious candidates from the search output down the line.

Choosing the optimal DM step
----------------------------

The next stage is to calculate how closely the consecutive DM trials should be spaced, which in 
PRESTO is done using ``DDplan.py``. The ideal DM step size is a function of:

* The observing band parameters: centre frequency, bandwidth, number of channels
* The sampling time of the data
* **The minimum pulse width being searched for**. When running ``DDplan.py`` this is the resolution (``-r``) command-line argument: it should be the minimum period you plan to search (``period_min``), divided by the minimum number of phase bins (``bins_min``) you plan to use.

**Example:** We want to search a Parkes multibeam receiver observation, with a centre frequency of
1382 MHz, a bandwidth of 400 MHz, 1024 frequency channels, and a sampling interval of 64 
microseconds. We will instruct ``prepsubband`` to use 64 sub-bands for dedispersion (``-s 64`` 
below), and want to cover DMs from 0 to 1,000. We will then use ``riptide`` to search for periods 
down to 100 ms using at least 200 phase bins, which amounts to a minimum pulse width of 0.5 ms 
(``-r 0.5``). The call to ``DDplan.py`` is thus:

.. code-block:: console

   $ DDplan.py --loDM 0.0 --hiDM 1000.0 -f 1382.0 -b 400.0 -n 1024 -s 64 -t 0.000064 -r 0.5

   Minimum total smearing     : 0.0907 ms
   --------------------------------------------
   Minimum channel smearing   : 0 ms
   Minimum smearing across BW : 0.00629 ms
   Minimum sample time        : 0.064 ms

   Setting the new 'best' resolution to : 0.5 ms
      Note: ok_smearing > dt (i.e. data is higher resolution than needed)
            New dt is 4 x 0.064 ms = 0.256 ms
   Best guess for optimal initial dDM is 0.407

   Low DM    High DM     dDM  DownSamp  dsubDM   #DMs  DMs/call  calls  WorkFract
     0.000    585.000    0.30       4   15.00    1950      50      39    0.8211
   585.000   1010.000    0.50       8   25.00     850      50      17    0.1789


Dedispersing the data
---------------------

The table above returned by ``DDplan.py`` defines the sequence of calls to make to ``prepsubband``. 
In the most recent versions of PRESTO, ``DDplan.py`` writes a python script that directly makes the
right sequence of calls to ``prepsubband``, otherwise it has to be generated by other means (a 
custom script) or the calls have to be made manually (**not** recommended).

For example, covering the
DM range 0 to 585 requires 39 consecutive calls to prepsubband, each producing 50 DM trials spaced 
by a step of 0.30.


.. code-block:: console

   $ prepsubband -lodm 0.0 -dmstep 0.3 -numdms 50 -nsub 64 -downsamp 4 observation.fil
   $ prepsubband -lodm 15.0 -dmstep 0.3 -numdms 50 -nsub 64 -downsamp 4 observation.fil
   [...]
   $ prepsubband -lodm 570.0 -dmstep 0.3 -numdms 50 -nsub 64 -downsamp 4 observation.fil


See the `PRESTO tutorial`_ for more details. Once all calls to ``prepsubband`` have been made, we 
can search the resulting set of DM trials, which will consist of pairs of ``.inf`` (header) 
and ``.dat`` (binary data) files.

.. _`PRESTO tutorial`: https://www.cv.nrao.edu/~sransom/PRESTO_search_tutorial.pdf


Configuring the riptide pipeline
--------------------------------

The ``rffa`` application is highly flexible and takes a YAML configuration file as
an input. A `model configuration file`_, with detailed comments, can be found in the repository.
This should be your starting point. Most parameters are mandatory. If the configuration file is 
malformed, the ``rffa`` application will raise an Exception with a helpul error message. 

.. _`model configuration file`: https://github.com/v-morello/riptide/blob/master/riptide/pipeline/config/example.yaml


Number of parallel processes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The first parameter is the number of parallel processes to use for the search; each process goes
through one DM trial at a time. This should be the number of cores available for the search; if
you are running the code on a SLURM supercomputing facility, this should be equal to 
``cpus-per-task``.

.. code-block:: YAML

   processes: 8


Data format and band parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Since version ``0.2.0``, ``riptide`` reads the observing band parameters directly from the input ``.inf`` files when using PRESTO for dedispersion. 
However, when using SIGPROC's dedispersion routine, the DM trial files do *not* contain that information, and it must be specified in the config file.
These parameters are important at various stages of the search process.

.. code-block:: YAML

   # Input format, either 'presto' or 'sigproc'
   format: presto

   ### Observing band parameters: leave blank except for SIGPROC input data
   # Minimum observing frequency in MHz
   fmin: 

   # Maximum observing frequency in MHz
   fmax:

   # Number of channels in the data
   nchans:


DM trial selection
^^^^^^^^^^^^^^^^^^

Although the pipeline can be passed a specific list of DM trial files to search, a more practical option is to pass all DM trial files and use the options below to select only
a certain DM range.

.. code-block:: YAML

   dmselect:
      # Minimum DM trial in pc cm^{-3}
      # If left blank, start at the minimum available trial DM
      min: 0.0

      # Maximum DM trial in pc cm^{-3}
      # This is a hard limit, regardless of sky coordinates (see below)
      # If left blank, stop at the maximum available trial DM
      max: 1000.0

      # Maximum value of Trial_DM x |sin b| where b is the Galactic latitude of the observation.
      # This is a simple method to limit the maximum trial DM as a function of Galactic coordinates
      # Almost no Galactic pulsars are known to have DM x |sin b| > 40
      # If left blank, no latitude-dependent cap on the maximum trial DM is applied
      dmsinb_max: 45.0


Red noise subtraction
^^^^^^^^^^^^^^^^^^^^^

This section mirrors the parameters passed to the dereddining function. See :meth:`riptide.TimeSeries.deredden`

.. code-block:: YAML

   dereddening:
      # Width of the running median window in seconds used by the median subtraction
      # routine before searching the input time series
      rmed_width: 5.0

      # 'minpts' parameter passed to the ffa_search() function
      rmed_minpts: 101


Defining the search space
^^^^^^^^^^^^^^^^^^^^^^^^^

This section defines a list of search ranges, each with a minimum and maximum trial period, and a duty cycle resolution specified via a minimum and maximum number of phase bins.
Here the idea is to use more phase bins for longer search periods. Each range in the list has three sections:  

* ``ffa_search``: The list of parameters passed to the :func:`riptide.ffa_search` function. Any unspecified parameters will be set to the default values in the function definition.
* ``find_peaks``: The list of parameters passed to the :func:`riptide.find_peaks` function. Unspecified parameters are also set to their default values.
* ``candidates``: The number of phase bins and sub-integrations in the candidate files produced when searching this period range.

The ``name`` attribute is only for logging purposes and can be set to anything.

.. code-block:: YAML

   ranges:
      - name: 'short'
        ffa_search:
            period_min: 0.20
            period_max: 1.00
            bins_min: 240
            bins_max: 260

        find_peaks:
            smin: 6.0
         
        candidates:
            bins: 256
            subints: 32

      - name: 'medium'
        ffa_search:
            period_min: 1.00
            period_max: 5.00
            bins_min: 480
            bins_max: 520

        find_peaks:
            smin: 6.0
         
        candidates:
            bins: 512
            subints: 32

      - name: 'long'
        ffa_search:
            period_min: 5.00
            period_max: 180.00
            bins_min: 960
            bins_max: 1040

        find_peaks:
            smin: 6.0
         
        candidates:
            bins: 1024
            subints: 32


Peak clustering and harmonic flagging
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

These parameters control how the many periodogram peaks found during the search across all DM trials are clustered into candidates, and how the candidates deemed to be a harmonic of another are removed. 
They should be left to their default values unless there is a good reason to. The default parameters for harmonic flagging are conservative; 
they should very rarely flag a real pulsar as a harmonic of a brighter RFI instance.

.. code-block:: YAML

   # Parameters of the peak clustering that is performed once all DM trials have
   # been searched
   clustering:
      # Clustering radius in units of 1 / Tobs
      # Two peaks whose frequencies are within (clrad / Tobs) Hz of each other
      # are considered part of the same cluster
      radius: 0.2


   # Harmonic flagging parameters
   # See the docstring of the htest() function in harmonic_testing.py for details
   # NOTE: this is only a flagging operation, the actual *removal* of candidates 
   # flagged as harmonics is entirely optional, see below
   harmonic_flagging:
      denom_max: 100
      phase_distance_max: 1.0
      dm_distance_max: 3.0
      snr_distance_max: 3.0


Candidate filters and plotting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Right before producing candidate files and/or plots, a list of manual filters can be applied. Candidate plots can also be generated automatically.

.. code-block:: YAML

   # Filters applied to the final list of clusters, *just before* the associated
   # candidate files are produced.
   # The cap on candidate number is applied last, after all unworthy candidates have been removed
   # Any of these fields can be left empty, in which case the corresponding filter is NOT applied
   candidate_filters:
      dm_min: 
      snr_min: 7.0
      remove_harmonics: True
      max_number: 


   # If True, save a PNG plot for every candidate
   # Candidate files can always be loaded and plotted later
   plot_candidates: True


Running the Pipeline
---------------------

Once the pipeline configuration file is ready, the pipeline application ``rffa`` takes two mandatory arguments: the config file via ``-c`` option and a list of all the DM trial files to search. For example:

.. code-block:: console

   rffa -c myConfig.yml dedispersed_data/*.inf

There are additional options, e.g. to set a specific output directory or save a log file. See ``rffa --help``.


.. NOTE::

   ``rffa`` runs its own internal dedispersion plan to "thin out" the list of DM trials and select the minimum amount necessary to cover the DM range. The actual DM step it chooses is as a function of the minimum pulse width being searched (as specified in the YAML config file).
   This is a design choice; ``rffa`` can be run along with a standard FFT-based search code and ingest the same set of dedispersed time series files. Indeed the DM step required for millisecond pulsar searches is much smaller than for ordinary pulsars.


Data products
-------------

Once the pipeline finishes, the following data products will be written in the specified output directory:  

* A CSV table of all detected periodogram peaks across all DM trials  
* A CSV table of clusters, obtained by grouping together peaks with frequencies close to each other  
* A CSV table of candidates, which will have the same entries as the clusters table, unless you have enabled harmonic filtering in the config file. In this case any cluster that was flagged as a harmonic of another is removed from the final candidate list. 
* One JSON file per :class:`riptide.Candidate` object, which can be loaded using :func:`riptide.load_json` and plotted / manipulated. These contain header information, a table of peaks associated to the candidate, and a sub-integration plots obtained by folding the DM trial at which they were detected with the highest S/N.
* One PNG plot per candidate, if the associated option was enabled in the configuration file.