pjnz.Rmd
Spectrum stores projection data as PJNZ
files. These are just zips containing several files of projection data.
unzip(system.file("testdata", "Botswana2018.PJNZ", package = "specio"),
list = TRUE)
#> Name Length Date
#> 1 Botswana_ 2018 updated ART.DP 2038692 2018-03-23 16:21:00
#> 2 Botswana_ 2018 updated ART.DPUA 656 2018-03-23 15:23:00
#> 3 Botswana_ 2018 updated ART.DPUAD 382382 2018-03-23 15:23:00
#> 4 Botswana_ 2018 updated ART.ep1 1946 2018-03-20 12:44:00
#> 5 Botswana_ 2018 updated ART.ep3 1292 2018-03-20 12:44:00
#> 6 Botswana_ 2018 updated ART.ep4 10866 2018-03-20 12:44:00
#> 7 Botswana_ 2018 updated ART.ep5 410266 2018-03-20 12:44:00
#> 8 Botswana_ 2018 updated ART.PJN 6470 2018-03-23 16:21:00
#> 9 Botswana_ 2018 updated ART.SPT 12156 2018-01-22 10:21:00
#> 10 Botswana_ 2018 updated ART.TYP 11 2018-03-20 12:44:00
#> 11 Botswana_ 2018 updated ART.xml 512995 2018-03-20 12:44:00
#> 12 Botswana_ 2018 updated ART_meta.csv 986 2018-01-22 10:21:00
#> 13 Botswana_ 2018 updated ART_surv.csv 21363 2018-01-22 10:21:00
#> 14 epptmplt/epptmplt_en/Concentrated (C).wst 85305 2018-03-20 12:44:00
#> 15 epptmplt/epptmplt_en/Urban Rural (G).wst 29419 2018-03-20 12:44:00
specio
Note that large portions of the general information about the files are taken from a write up of the Spectrum-EPP Communications between Jeff Eaton and Tim Brown in December 2016. Where this relates to specio
this information has been updated to reflect the best of current knowledge.
.DP
The DP
file contains demographic projection data. The data is persisted as a csv
with a column containing tags used to identify the field.
The csv
contains 4 named columns, Tag
, Description
, Notes
and Data
. There are also an arbitrary number of extra columns all of which contain more data.
#> 'data.frame': 7852 obs. of 10 variables:
#> $ Tag : chr "" "<FirstYear MV2>" "" "" ...
#> $ Description: chr "" "" "" "<Value>" ...
#> $ Notes : chr "" "" "" "" ...
#> $ Data : chr "" "" "" "1970" ...
#> $ X : chr "" "" "" "" ...
#> $ X.1 : chr "" "" "" "" ...
#> $ X.2 : chr "" "" "" "" ...
#> $ X.3 : num NA NA NA NA NA NA NA NA NA NA ...
#> $ X.4 : chr "" "" "" "" ...
#> $ X.5 : chr "" "" "" "" ...
The Tag
column contains only field tags e.g. <BigPop MV3>
and end tags <End>
. The field tag is used to identify the start of data related to a particular property which runs up to the next end tag.
The Description
column can sometimes contain information about the data in the other columns but is frequently blank and is rarely used in EPP data. For each property it contains a <Value>
tag indicating where the data in the data column begins. The data will start either on the same row as the value tag or in the row one below.
For example the first 3 columns of the Botswana 2018 data for a particular field tag are
#> Tag Description
#> 534 <CD4ThreshHoldAdults MV>
#> 535 CD4 count threshold for eligibility - Adults
#> 536 <Value>
#> 537 <End>
#> Notes Data X X.1 X.2
#> 534
#> 535
#> 536 200 200 200 200
#> 537
The Notes
column optionally contains information about each row of data. This may be things such as row names for the data or information about a group of data contained in the next few rows. When reading in EPP data, because this column is used in different ways for different fields, we tend to not read it in and instead label data via configuration.
The Data
column and onwards contains all of the data related to a field. It can be of many types including scalars, vectors, arrays, multi-dimensional arrays and collections of arrays. Therefore each field requires some configuration for specio
to be able to extract the data from the file.
.xml
The xml
file contains a seralised Java object representing the workset. It contains all data used by the Java popup when EPP is launched within Spectrum. Each property is contained within top level Java epp2011.core.sets.Workset
class.
Note that for array properties reading the index is important as any zeros in the array are sometimes omitted from the seralised data and so we must add them back in at the appropriate indices when we read in the data from the xml
file.
Note also that the default representation for an NA
value in the serialized Java is -1
so we convert these when they are encountered.
Some data is persisted as properties of the class e.g.
<void property="aidsNormalizeRange">
<array class="int" length="2">
<void index="0">
<int>1975</int>
</void>
<void index="1">
<int>1993</int>
</void>
</array>
</void>
The type can be read from the second line after the property is declared. This can be another object itself.
Data may also be persisted as a method, we can see this for data which represents a collection of data frames e.g.
<void method="add">
<object class="epp2011.core.sets.ProjectionSet" id="ProjectionSet1">
<void property="PMTCTData">
<array class="[D" length="14">
<void index="0">
<array class="double" length="41">
<void index="0">
<double>-1.0</double>
</void>
<void index="1">
<double>-1.0</double>
</void>
...
</array>
</void>
<void index="1">
<array class="double" length="41">
<void index="0">
<double>-1.0</double>
</void>
...
</array>
</void>
...
</array>
</void>
<void property="PMTCTSiteSampleSizes">
<array class="[I" length="14">
...
These can be identified by the object class ProjectionSet
and array class [D
or [I
meaning array of double or int arrays respectively.
.SPT
- EPP results for SpectrumReturns the results of the national epidemic to Spectrum. The SPT
file is the primary file for sending the results of EPPs fitting back to Spectrum. The file contains 6 or more sections each of which is delimited by either single a =
or two ==
. It has the following overall structure:
Lines after the line containing the keyword BASEYEAR
through to the delimiter =
, the end of the projection indicator. Each separate line contains the projection year, HIV prevalence and HIV incidence for the workset as a whole.
The examples are taken from an existing EPP file for Ukraine.
EPP 5.0 // Indicates the EPP file version
Ukraine // Country name
AGERANGE 15-49 // Age range – specifies whether fitting was done assuming 15-49 or 15+
BASEYEAR 2009 // Base year – this is an artifact of when EPP fixed population size for
// concentrated epidemics in a given year
1970,0.00000,0.00000 // Projection year, national HIV prevalence %, national HIV incidence %
…
2019,0.84569,0.03831
2020,0.86060,0.03875 // Projection year, national HIV prevalence %, national HIV incidence %
= // End of section 1 indicator
Lines after the end of projection indicator =
at the end of Section 1 through to the ==
delimiter. Each line contains the projection year, female/male incidence ratio if available (-1 otherwise e.g. as in generalized epidemics where F/M ratio cannot be determined in EPP), percent of HIV+ individuals who are IDU in that year, number of IDU AIDS deaths and number of IDU non-AIDS deaths in that year.
= // End of section 1 indicator
1970,-1.00000,0.00000,0,0 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
….
2019,0.14165,16.79733,344,681 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
2020,0.12591,15.69599,314,677 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
== // End of section 2 indicator
Lines starting after ==
delimiter at the end of section 2 through to the next =
. This repeats the same information as section 1.
== // End of section 2 indicator
Ukraine_May 2015: // Workset name
POP 24051282 INC 100.0 // Total population in base year & percent of base year
// incidence in the workset – always 100%
1970,0.00000,0.00000 // Year, workset HIV prevalence %, workset HIV incidence %
…
2020,0.86060,0.03875 // Year, workset HIV prevalence %, workset HIV incidence %
= // End of section 3 indicator
Lines after the end of projection indicator =
at the end of section 3 through to the ==
delimiter. This repeats the same information as in section 2.
= // End of section 3 indicator
1970,-1.00000,0.00000,0,0 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
…
2020,0.12591,15.69599,314,677 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths
== // End of section 4 indicator
Subsequent sections then describe each sub-population or sub-epidemic projection in detail with the following information:
Starting after the ==
at the end of section 4. After providing some detail on the sub-population (specified below), each subsequent line gives the projection year, HIV prevalence, HIV incidence and total population size in EPP for that particular sub-population
== // End of section 4 indicator
Ukraine_May 2015\IDUs:BOTH,IDU,75.0 // Sub-pop name, :, special population indicators, % male
POP 292826 INC 32.8823 // Population in base year & percent of base year incidence in sub-population
IDUMORT 1.0700 // Excess IDU mortality among HIV+ IDU
1970,0.00000,0.00000,300000 // Year, sub-pop HIV prevalence %,
// sub-pop HIV incidence %, pop size in year
…
2020,10.97878,0.40086,255083 // Year, sub-pop HIV prevalence %,
// sub-pop HIV incidence %, pop size in year
= // End of section 5 indicator
Starting after the =
at the end of section 5 through the next ==
. Each line gives the projection year, F/M ratio for that group, percent of the group who are IDUs, number of IDU AIDS deaths and the number of IDU non-AIDS deaths.
If sub-population has IDU characteristic, then % of HIV+ who are IDUs will be 100% and IDUs will be AIDS & non-AIDS deaths for the sub-population. Otherwise, all three numbers will be zero.
= // End of section 5 indicator
1970,0.33333,100.00000,0,0 // Year, F/M ratio, % of HIV+ who are IDUs, IDU AIDS deaths,
// IDU non-AIDS deaths for sub-pop
…
2020,0.33333,100.00000,314,677
== // End of section 6 indicator
Section 5 and 6 are then repeated for each sub-population and/or sub-epidemic until the end of the file.
When dealing with generalised epidemics, the F/M ratio will always be returned as -1.0
, indicating to Spectrum to use its own patterns since EPP does not track gender in generalised epidemics. In addition, there will generally not be any IDU. Thus, lines in section 2 will appear as:
1981,-1.00000, 0.00000,0,0
1982,-1.00000, 0.00000,0,0
...
If a -1.0
ratio occurs in a concentrated epidemic, it signifies that a value could not be calculated for that year as the incidence was zero. Otherwise the F/M ratio for that particular workset, sub-epidemic or sub-population projection will be shown.
For each sub-population (an actual curve fit to data), the special population indicators are of two types:
URBAN
, RURAL
or BOTH
LORISK
, FSW
, MSW
, IDU
, CLIENT
, MSM
, PRI
(prisoner) or TG
(transgender).SPU
- EPP uncertainty results for SpectrumNote that this is missing from the example above as it has been removed manually to save disk space as this is a large file. The SPU
file contains uncertainty results for the national epidemic.
The purpose of the SPU
file is to pass the resamples done during the IMIS process to Spectrum for use in its own uncertainty calculations. Normally, 3000 resamples are done. The SPU
file passes first the overall national Bayesian medians followed by the number of unique national resamples. Because a particular resample may get selected multiple times, each resample is also provided with a COUNT of the number of times it was resampled, followed by a series of lines containing the prevalence and incidence each year in the format:
An annotated description of the format is as follows:
EPP 5.0 3000 // EPP file format version followed by number of resamples
Botswana // Country name
BASEYEAR 2009 // Baseyear for populations (deprecated and not used)
1970, 0.00000, 0.00000 // Bayesian median prevalence and incidence series
1971, 0.05806, 0.04821
…
2020, 32.93643, 2.14343
== // End of Bayesian median series indicator
COUNT 2.0 // Number of times following series was resampled
1970, 0.00000, 0.00000 // Prevalence & incidence series for 1st unique resample
1971, 0.04905, 0.00332
…
2020, 31.89423, 2.43432
== // End of series indicator
COUNT 5.0 // Number of times following series was resampled
1970, 0.00000, 0.00000 // Prevalence & incidence series for 2nd unique resample
1971, 0.05902, 0.00102
…
2020, 31.45333, 2.98174
==
… // this continues until all unique curves in the 3,000
// resamples are specified. COUNTs will sum to 3,000
specio
As seen above the PJNZ
file contains several other files other than those read in by specio
. Where information about these files is known it is included below. This information is taken from a document written by Tim Brown in 2016, some of this may be out of date so should not be considered the source of truth for details about how these files are used.
ep1
- year and demographic inputs that EPP needs from SpectrumThe ep1
file provides the essential information EPP needs to set up the epidemic projection and calculate parameters derived from the demographics to ensure Spectrum-EPP consistency on populations over time. In the absence of HIV, this data is used by EPP to exactly reproduce the Spectrum populations for each year. This information includes:
NOTE: in the following file specifications, words entered all in CAPS are keywords to be used to easily identify the information contained on that line or in that section of the file. All population numbers are entered as integers. All values are separated by commas, so no extraneous commas should occur in country names or in projection names.
Each population line consists of five numbers separated by commas:
Year, 15-49_or 15+_population, 15 year olds, 50 year olds, net_migration_15-49_or_15+
Population values will be provided for the full epidemiological projection period, from the projection start year to the projection end year.
The format for the ep1
file is as follows
//Same as the invoking argument for
//country name (may have blanks but
COUNTRY,country name,country code //no commas), country code is UN code
PROJNAME,“C:\Users\tim\DATA\proj_name.PJN” //Same as invoking argument(fully
//qualified filename for base proj file)
FIRSTPROJYR,1970 //First year of the epidemiological projection
LASTPROJYR,2016 //Last year of epidemiological projection
IDUMORT,1.07 //Excess IDU mortality in percent per year
AGERANGE 15-49 //The age range surveillance data addresses
POPSTART //Start of the non-AIDS population projection
1970,3459875,148280,98854,34589
1971,3632868,155694,103796,35281
1972,3814512,163479,108986,35986
1973,4005237,171653,114435,36706
1974,4205499,180236,120157,37440
1975,4415774,189247,126165,38189
...
2012,8900434,381447,254298,65234
2013,8989438,385262,256841,65886
2014,9079332,389114,259410,66545
2015,9170126,393005,262004,67211
2016,9261827,396935,264624,67883
POPEND
For example, here are the first few lines from the file for Peru:
COUNTRY,Peru,604
PROJNAME,C:\Users\Tim Brown\AppData\Roaming\Futures Institute\Spectrum\Temp\Peru 2015 FinalPJNZ~BC99.tmp\Peru 2015 Final.PJN
FIRSTPROJYR,1970
LASTPROJYR,2021
IDUMORT,1.07
AGERANGE 15-49
POPSTART
1970,5928014,291589,83621,0
1971,6117343,300730,86139,0
1972,6314019,310432,88774,0
1973,6518406,320827,91641,0
1974,6731110,332084,94737,0
1975,6949523,343962,98008,-3022
1976,7175224,356148,101368,-4632
1977,7408225,368362,104835,-6078
typ
- type of the EPP workset either GENERALIZED
or CONCENTRATED
File is used to communicate epidemic type back to Spectrum. It consists of a single line containing either GENERALIZED
or CONCENTRATED
depending on type selected by user.
ep4
- ART data and parametersThe ep4
file provides the information about antiretroviral therapy using the CD4 compartment model adopted in Spectrum. This includes the following parameters (with associated keywords indicated in parentheses):
CD4LOWLIMITS
): the lower CD4 limit for each of the CD4 compartments in the modelLAMBDA
): the progression rate through the CD4 compartmentsNEWINFECTSCD4
): the percent of new infections going into each of the CD4 compartmentsMU
): the annual mortality of those not on ARTALPHA1
, ALPHA2
and ALPHA3
): the annual mortality for those on ART for the 1st 6 months, 2nd 6 months and for more than one yearINFECTREDUC
): the percent by which the person’s risk of infecting another is reduced by being on ART.ARTSTART
/ARTEND
).In addition, this file contains information on the number of HIV-positive 15 year olds entering the 15-49 or 15+ population and the number of HIV-positive 50 years olds leaving the 15-49 population, disaggregated by on ART (HIVPOS_15YEAROLDS
, HIVPOS_50YEAROLDS
) and not on ART (HIVPOS_15YEAROLDSART
, HIVPOS_50YEAROLDSART
). For 15 year olds a CD4 distribution is provided for both on ART and off ART groups , while 50 year olds are assumed to have the same CD4 distribution as the population as a whole. The file also contains indicators for special populations (SPECPOP
), the type of ART coverage (ARTCOVERAGE
: MALE_FEMALE
, CD4_PERCENT
, CD4_NUMBER
) and data on dropout rates (ARTDROPOUTRATE
) and CD4 medians at initiation (CD4MEDIAN
).