DPPP

Go to source documentation.

DPPP (the Default Preprocessing Pipeline, previously NDPPP for New Preprocessing Pipeline) is the LOFAR data pipelined processing program. It can be used to do all kind of operations on the data in a pipelined way, so the data are read and written only once.

DPPP started as a new and faster version of IDPPP. The original differences can be seen here.

DPPP preprocesses the data of a LOFAR observation by executing steps like flagging or averaging. Such steps can be used for the raw data as well as the calibrated data by defining the data column to use. One or more of the following steps can be defined as a pipeline. DPPP has an implicit input and output step. It is also possible to have intermediate output steps.

DPPP comes with quite some predefined steps, but it is possible to plugin arbitrary steps, either implemented in C++ or Python.

The following steps are possible:
  • Flagging and Filtering:

    • AOFlagger for automatic flagging in time/freq windows using Andre Offringa’s advanced aoflagger.

    • PreFlagger to flag given baselines, time slots, etc.

    • UVWFlagger to flag based on UVW coordinates, possibly in the direction of another source.

    • MADFlagger for automatic flagging in time/freq windows based on median filtering.

    • Filter to filter on baseline and/or channel (only the given baselines/channels are kept). The reader step has an implicit filter.

  • Averaging

    • Averager to average data in time and/or freq.

  • Phase shifting

    • PhaseShift to shift data to another phase center.

  • Demixing to remove strong sources (A-team) from the data.

  • Station summation

    • StationAdder to add stations (usually the superterp stations) forming new station(s) and baselines.

  • Counter to count the number of flags per baseline, frequency, and correlation. A flagging step also counts how many visibilities it flagged. Counts can be saved to a table to be plotted later using function plotflags in python module lofar.dppp.

  • Data calibration and Data scaling

    • ApplyCal to apply an existing calibration to a MeasurementSet.

    • GainCal to calibrate gains using StefCal.

    • DDECal to calibrate direction dependent gains.

    • Predict to predict the visibilities of a given sky model.

    • H5ParmPredict to subtract multiple directions of visibilities corrupted by an instrument model (in H5Parm) generated by DDECal.

    • ApplyBeam to apply the LOFAR beam model, or the inverse of it.

    • SetBeam to set the beam keywords after prediction.

    • ScaleData to scale the data with a polynomial in frequency (based on SEFD of LOFAR stations).

    • Upsample to upsample visibilities in time

    • Output to add intermediate output steps

  • Interpolate for improving the accuracy of data averaging.

  • User defined steps provide a plugin mechanism for arbitrary steps implemented in C++.

  • Python defined steps provide a plugin mechanism for arbitrary steps implemented in Python.

The input is one or more (regularly shaped) MeasurementSets (MSs). The data in the given column are piped through the steps defined in the parset file and finally written (if needed). It makes it possible to, say, flag at the full resolution, average, flag on a lower resolution, average further, and finally write the data. Regularly shaped means that all time slots in the MS must contain the same baselines and channels. DPPP can handle only one spectral window. If the MS has multiple spectral windows, one has to be selected.

If multiple MSs are given as input, their data are combined in frequency. It means that the time, phase direction, etc. of the different MSs have to be the same. Note that other steps (like averaging) can still be used. When combining MSs (thus combining subbands), it is possible that one or more of them do not exist. Flagged data will be inserted for them. The missing frequency info is deduced from the other subbands. Note that in order to insert missing subbands in the data, the names of the missing MSs have to be given at the right place in the list of MS names. Otherwise DPPP does not know that subbands are missing.

The output can be a new MeasurementSet, but it is also possible to update the flags if the input is a single MS. If averaging or phase-shifting to another phase center is done, the only option is to create a new MeasurementSet.

At the end the run time is shown. Note that on a multi-core machine the user time can exceed the elapsed time (user time is counted per core). By default the percentage of time each step took is also shown.

The AOFlagger, MADFlagger, and Demixer, by far the most expensive parts of DPPP, can run multi-threaded if DPPP is built with OpenMP. It is possible to define the number of threads to use by the global key numthreads. Is that is not set, it uses the environment variable OMP_NUM_THREADS. If also that variable is undefined, an DPPP run uses as many threads as there are CPU cores. Thus if multiple DPPP runs are started on a machine, the default total number of threads will exceed the number of CPU cores.

MeasurementSet Access

  • The msin step defines which MS and which DATA column to use. It is possible to specify multiple MSs using a glob-pattern or a vector of MS names.

  • If multiple MSs are given, they will be concatenated in frequency. It means that all MSs must have the same times, baselines, etc. Flagged data can be inserted for MSs that are specified, but do not exist.

  • It is possible to select baselines and/or a band (spectral window) and/or skip leading or trailing channels. This is the same for each input MS.

  • Optionally proper weights can be calculated using the auto-correlation data.

  • It sets flags for invalid data (NaN or infinite).

  • Dummy, fully flagged data with correct UVW coordinates will be inserted for missing time slots in the MS. This can only be done if a single input MS is used.

  • Missing time slots at the beginning or end of the MS can be detected by giving the correct start and end time. This is particularly useful for the imaging pipeline where BBS requires that the MSs of all subbands of an observation have the same time slots. When updating an MS, those inserted slots are temporary and not put back into the MS.

  • The ms_out step defines the output. If a band is selected, the output MS (including its SPECTRAL_WINDOW subtable) contains that band only (its id is 0). The input MS is updated if no output name is given or if the output name is equal to the input name or equal to a dot.

The calculation of the weights is done as follows.

Weight[ANT1_POL1, ANT2_POL2] = N / (autocorr[ANT1_POL1] * autocorr[ANT2_POL2])
N = EXPOSURE * CHAN_WIDTH * WGHT

where WGHT is the weight put in by RTCP (number of samples used / total number of samples). This note discusses weighting in some more detail.

Flagging

It is important to realize that a MeasurementSet contains columns FLAG and FLAG_ROW to indicate if data are flagged. If FLAG_ROW is set, all data in that row are flagged. DPPP will set FLAG_ROW if all FLAG are set (and vice-versa). When clearing the flags manually, it is important to realize that both columns have to be cleared. For example:

taql 'update my.ms set FLAG=F, FLAG_ROW=F'
DPPP flagging behaviour is as follows.
  • If one correlation is flagged, all correlations will be flagged (e.g. XX,YX,YY are flagged if XY is flagged).

  • The msin step flags data containing NaNs or infinite numbers or if FLAG_ROW is set.

  • An AOFlagger step can be used to flag using Andre Offringa’s rficonsole code. Because DPPP always reads entire time slots, the flagging can be done on limited time windows only (depending on the available memory). An overlap can be defined to reduce boundary effects. By default QUALITY subtables will be created containing statistical flagging quality information. They can be inspected using tools like aoqplot. The default strategy works well for HBA data, but not for LBA data. The strategy LBAdefault should be used for it.

  • A PreFlagger step can be used to flag (or unflag) on time, baseline, elevation, azimuth, simple uv-distance, channel, frequency, amplitude, phase, real, and imaginary. Multiple values (or ranges) can be given for one or more of those keywords. A keyword matches if the data matches one of the values. The results of all given keywords are AND-ed. For example, only data matching given channels and baselines are flagged. Keywords can be grouped in a set making it a single (super) keyword. Such sets can be OR-ed or AND-ed. It makes it possible to flag, for example, channel 1-4 for baseline A and channel 34-36 for baseline B. Here it is explained in a bit more detail.

  • A UVWFlagger step can be used to flag on UVW coordinates in meters and/or wavelengths. It is possible to base the UVW coordinates on a given phase center. If no phase center is given, the UVW coordinates in the input MS are used.

  • A MADFlagger step can be used to flag on the amplitudes of the data. It flags based on the median of the absolute difference of the amplitudes and the median of the amplitudes. It uses a running median with a box of the given size (number of channels and time slots). It is a rather expensive flagging method with usually good results. The flagging parameters can be given as an expression to make them dependent on baseline length. It is possible to specify which correlations to use in the MADFlagger. Flagging on XX only, can save a factor 4 in performance. Furthermore it is possible to only flag the auto-correlations and apply the results to the cross-correlations with a baseline length within optionally given limits.

Averaging

  • Unflagged visibility data are averaged in frequency and/or time taking the weights into account. New weights are calculated as the sum of the old weights. Some older LOFAR MSs have weight 0 for unflagged data points. These weights are set to 1.

  • The UVW coordinates are also averaged (not recalculated).

  • It fills the new column LOFAR_FULL_RES_FLAG with the flags at the original resolution for the channels selected from the input MS. It can be used by BBS to deal with bandwidth and time smearing.

  • Averaging in frequency requires that the average factor fits integrally. E.g. one cannot average every 5 channels when having 256 channels.

  • When averaging in time, dummy time slots will be inserted for the ones missing at the end. In that way the output MeasurementSet is still regular in time.

  • An averaged point can be flagged if too few unflagged input points were available

Demixing

Demixing (or Smart Demixing explained below) is a faster and more flexible way of the old demixing python script to demix and subtract strong sources (A-team). Jones matrices can be estimated for the direction of the subtract-sources, model-sources, and the optional target-source.
  • It is possible to have different averaging for the demix and subtract step.

  • Selected (e.g. shorter) baselines can be demixed (others will be averaged only). By default only the cross-correlations are used.

  • Four different direction types can be given: * The subtract-sources are subtracted from the data. They must have a source model. * The model-sources can be given to take the contribution of other strong sources into account when solving for the gains. They must have a source model as well. The target source should NOT be part of this list. * The other-sources directions are taken into account when demixing. They are projected away when solving for the gains. * If the target source is given, it must have a source model and no other-sources can be given. If no target source is given, the target direction can be projected away like the extra-sources. Weak target sources should not be projected away.

  • A source model mentioned above is the patch name in the SourceDB (e.g. CasA). At the moment only point and Gaussian sources are supported. The direction used for demixing is the centroid of the sources that belong to the patch. The direction for an extra source (for which no model is used) can be given as a parameter if that is needed.

  • It is important to note that the target source model must NOT be given using the subtract-sources or model-sources. If it has to be used, give it using the targetsource parameter.

  • The Jones matrices will be estimated jointly for all directions, so better results are expected if the sources are close to the target. However, joint estimation of the Jones matrices for all directions is slower than estimating the Jones matrices for each direction separately. In the near future an option will be added to estimate the Jones matrices for each direction separately like the old demixing script is doing.

Smart Demixing

Smart Demixing does demixing as above, but in a smarter way using a scheme developed by Reinout van Weeren. For each time chunk (say 2 minutes) it is decided how to demix.

It needs three source models, which are made from a text file using makesourcedb. Note that for performance it is best to run makesourcedb with parameter outtype=blob:

  • A detailed model of the A-team sources used in the solve and subtract steps.

  • A coarse model of the A-team sources used in the estimate step. If not given, the detailed model will be used.

  • A model of the target field. Usually the user can create it from the GSM using gsm.py.

Smart demixing works as follows:
  • If an A-team source is at about the same position as a source in the target model, the source is removed from the A-team list and its detailed model replaces the source in the target model used in the solve step (not for the estimate step).

  • Using the coarse A-team model, the visibilities are estimated per baseline for each A-team source. By default the beam model is applied to get the apparent visibilities. The sources and baselines are selected for which the maximum amplitude exceeds a given threshold. A source/station will be solved for if the station appears in at least N of the selected baselines for that source. A detailed source model is used in that step to get as accurate gains as possible.

  • The visibilities of the target are estimated in a similar way using the target model. The target is included in the solve if its maximum amplitude exceeds a threshold or if the amplitude ratio Target/Ateam exceeds a threshold. The target is also included if it is close to an A-team source and the ratio exceeds another (smaller) threshold. Otherwise, the target is ignored (if close) or deprojected.

A detailed decision tree that the smart demixing algorithm follows is available here,

When solving for the complex gains of the selected A-team sources, the detailed A-team model is used to get the correct gains. Note that by default the sources/stations not solved for are still used in the solve step. There Jones matrices will have a small gain value on the diagonal and zeroes for the off-diagonal values.

At the end a log is produced showing how the demixing behaved. It shows:
  • percentage of converged solves and the average number of iterations used for them.

  • percentage of times the target was included, deprojected, and ignored.

  • percentage of times a source/station was solved for (thus matched the threshold/ratio criteria).

  • average and standard deviation of percentage amplitude subtracted per source per baseline

Phase shifting

  • Data can be shifted to another phase center.

  • A shift step can shift back to the original phase center (by giving an empty center). If that is done by the last shift step, no new MS needs to be created.

Upsample

  • Upsampling data can be useful for at least one use case. Consider data that has been integrated for two seconds, by a correlator (the AARTFAAC correlator) that sometimes misses one second of data. The times of the visibilities will then look like [0, 2, 4, 7, 9, 12], each having integration time 2 seconds. DPPP will automatically fill missing time slots, which will lead to times [0, 2, 4, 6, 7, 9, 11, 12]. This is still a nonuniform time coverage, which is not desirable. Calling the upsample step with timestep=2 on this data will create times [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] (it will remove the inserted dummy time slots that overlap, i.e. at 7 and 12). This data is then useful for further processing, e.g. averaging to 10 seconds.

Station summation

  • One or more new stations can be defined from a list of existing stations. An existing station can occur in only one new station.

  • The data of baselines containing only one of the stations are added to form a new baseline.

  • Optionally the auto-correlations can be added to form a new auto-correlation ‘baseline’.

  • The data can be added with or without weight.

  • Optionally averaging instead of summing can be done.

Data scaling

  • The data can be scaled with a polynomial in frequency to correct for the SEFD of the LOFAR stations.

  • The default coefficients have been determined empirically. It is possible to specify them per station.

  • It can take the number of used dipoles/tiles into account when scaling (e.g. for remote/international or for failing ones).

Filtering

  • Similar to the msin step a filter makes it possible to keep only the given channels and/or baselines.

  • By default, a station is always kept in the ANTENNA table, even if all its baselines are removed. This can be changed with the key remove.

Flag statistics and plotting

Several steps show statistics about flagged data points.
  • A MADFlagger and AOFlagger step show the percentage of visibilities flagged by that flagging step. It shows: * The percentages per baseline and per station. * The percentages per channel. * The number of flagged points per correlation, i.e. which correlation triggered the flagging. This may help in determining which correlations to use in the MADFlagger.

  • A UVWFlagger and PreFlagger step show the percentage of visibilities flagged by that flagging step. It shows percentages per baseline and per channel.

  • The msin step shows the number of visibilities flagged because they contain a NaN or infinite value. It is shown which correlation triggered the flagging, so usually only the first correlation is really counted.

  • A Counter step can be used to count and show the number of flagged visibilities. Such a step can be inserted at any point to show the cumulative number of flagged visibilities. For example, it can be defined as the first and last step to know how many visibilities have been flagged in total by the various steps.

  • Each step giving flagging percentages can save the percentages per frequency and per station to a table. The extension .flagfreq is used for the table containing the flags per frequency; the extension .flagstat for the flags per station. The full basename of the table is the main part of the MS followed by __<stepname> followed by the extension. The path for these tables can be specified in the parset file.

  • The plotflags function in the Python module lofar.dppp can be used to plot those tables. It can plot multiple subbands by giving it a list of table names. The flags per station will be averaged for those subbands.

Intermediate output step

The step out can write data to disk at an intermediate stage. It takes the same arguments as the msout step. As an example, the following reduction will flag, save flagged data at high resolution, then average and save the result in another measurement set. On the averaged data, it will also apply a calibration table and save that in the CORRECTED_DATA column.

msin=L123.MS

steps=[aoflag,out1,average,out2,applycal]

# Write out flagged data at full resolution
out1.type=out
out1.name=L123-flagged.MS

average.timestep=4

# Write out averaged data
out2.type=out
out2.name=L123-averaged.MS
out2.datacolumn=DATA

applycal.parmdb=instrument.parmdb

# Write the corrected data to CORRECTED_DATA
msout=L123-averaged.MS
msout.datacolumn=CORRECTED_DATA

User defined step

Besides the predefined DPPP steps like AOFlagger, etc., it is possible to use any user-defined DPPP step implemented in C++ or Python.

If implemented in C++ such a step has to reside in a shared library, that will dynamically be loaded by DPPP. The name of such a shared library has to be the step type name. DPPP will try to load the library libdppp_xxx.so (or .dylib on OS-X) for a step type xxx.

To make this a bit more flexible it is possible to define multiple steps in a single shared library. In such a case the step type name has to consist of 2 parts separated by a dot. The first part is the library name, the second part the step type in that library.

For example:

steps=[averager, mystep1, mystep2]
mystep1.type = mystep.stepa
mystep2.type = mystep.stepb

defines two user steps. Both step implementations reside in library libmystep.so.
A description and example of a dynamically loaded step can be found in the source
code repository in DPPP/TestDynDPPP.

Python defined step

The mechanism described above is used to make it possible to implement a user step in Python. The step type has to be pythondppp and the name of the Python module and class containing the code have to be given. DPPP will start an embedded Python shell, load the module, and instantiate an object of the class. See the documention of PythonStep.

ParSet File

Similar to most LOFAR programs, the parameters for the DPPP program are given in a so-called parset file. Note that it is possible to add parameters or overwrite parameters, defined in the parset file, using command line arguments. For example:

DPPP DPPP.pset parm1=value1 parm2=value2 ...

The steps to perform have to be defined in the parset file. They are executed in the given order, where the data are piped from one step to the other until all data are processed. Each step has a name to be used thereafter as a prefix in the keyword names specifying the type and parameters of the step.

The most basic parset is as follows. It copies the DATA column of the MS and flags NaN and infinite data.

msin = ~/SB0.MS
msout = SB0_DPPP.MS
steps=[]

The following example is more elaborate. It flags (using a median flagger), averages all channels, flags the result of the average, and finally averages in time. Note that ‘msin’ and ‘msout’ can be seen as an implicit first and last step.

msin = ~/SB0.MS
msin.startchan = 8
msin.nchan = 240
msin.datacolumn = DATA     # is the default

msout = "SB0_DPPP.MS"      # if empty, the input MS is updated and
                         #  no averaging steps can be done
msout.datacolumn = DATA    # is the default

steps = [flag1,count,avg1,flag2,avg2,count]
flag1.type=madflagger
flag1.threshold=1
flag1.freqwindow=31
flag1.timewindow=5
flag1.correlations=[0,3]   # only flag on XX and YY
flag1.count.save = true    # save flag percentages
flag1.count.path = $HOME   #  to a table in $HOME

avg1.type = average
avg1.freqstep = 240
avg1.timestep = 1          # is the default

flag2.type=madflagger
flag2.threshold=2
flag2.timewindow=51

avg2.type = average
avg2.timestep = 5

Plotting the flag percentages, saved by the first MADFlagger step, could be done in python like:

import lofar.dppp as ld
ld.plotflags ('$HOME/SB0_flag1.flagfreq')  # step name was flag1