Upload a new file
# Risk Map Project*Version Alpha 2*

## Support files

## Input file format

## Visualizations

For more details, see: https://en.wikipedia.org/wiki/Receiver_operating_characteristic## Models

### Spatial Binomial Model

#### Bibliography

### Logistic Regression

#### Bibliography

### Random Forest Model

#### Bibliography

## Software

The Polio Risk Map Project allows you to upload case data and run them through our workflow in order to generate a risk map.

The workflow will go through the following steps:

**Cleaning:**On this step, the system will make sure the column of your input files are correct, the location name are matching the country locations and the dates are provided in the correct format.**Aggregation:**We will then aggregate the data based on the time period selected during upload.**Model run:**The statistical model will then run for each time period and generate the risk score.**Visualizations:**You will be able to visualize the cleaned up data and the risk score on a map.

Below you will find example files for each of the supported countries. Also the hierarchy table is provided for your convenience and includes all the names expected for the country locations.

Afghanistan
* * Example file

* * 568 districts (75 infected) - 241 cases - 7 years of data - Endemic

* * Hierarchy file

Chad
* * Example file

* * 61 districts (48 infected) - 344 cases - 13 years of data - Endemic

* * Hierarchy file

Democratic Republic of the Congo
* * Example file

* * 509 districts (98 infected) - 298 cases - 11 years of data - Outbreaks

* * Hierarchy file

Ethiopia
* * Example file

* * 78 districts (76 infected) - 3025 cases - 11 years of data - Endemic, then stop

* * Hierarchy file

Guinea
* * Example file

* * 37 districts (30 infected) - 180 cases - 13 years of data - Outbreaks

* * Hierarchy file

Haiti
* * Example file

* * 41 districts (30 infected) - 4746 cases - 22 years of data - Endemic to extinction

* * Hierarchy file

India
* * Example file

* * 659 districts (240 infected) - 5176 cases - 12 years of data - Endemic to extinction

* * Hierarchy file

Liberia
* * Example file

* * 15 districts (15 infected) - 1512 cases - 22 years of data - Endemic to Flare-up

* * Hierarchy file

Nigeria
* * Example file

* * 774 districts (57 infected) - 1076 cases - 14 years of data - Endemic, very local

* * Hierarchy file

Pakistan
* * Example file

* * 163 districts (144 infected) - 1305 cases - 16 years of data - Endemic with Flare-ups

* * Hierarchy file

Sierra Leone
* * Example file

* * 14 districts (14 infected) - 1529 cases - 22 years of data - Endemic to Flare-up

* * Hierarchy file

South Africa
* * Example file

* * 53 districts (53 infected) - 9400 cases - 27 years of data - Endemic

* * Hierarchy file

South Sudan
* * Example file

* * 78 districts (33 infected) - 7717 cases - 22 years of data - Endemic

* * Hierarchy file

United Republic of Tanzania
* * Example file

* * 198 districts (168 infected) - 50595 cases - 27 years of data - Endemic

* * Hierarchy file

Zambia
* * Example file

* * 74 districts (60 infected) - 4213 cases - 22 years of data - Endemic

* * Hierarchy file

The system is expecting a specific file format for the file that you wish to upload. The requirements are:

- Being in CSV format
- The case locations needs to be expressed with the following columns:
admin0, admin1, admin2

- The cases dates need to be in the column:
Case_Date

For example the following could be an example of correct Nigeria file format:

PolIS Case ID, Case_Date, admin0, admin1, admin2 NGA10-353, 9/10/2010, NIGERIA, BORNO, MAIDUGURI NGA10-4312, 27/09/2010 NIGERIA, KANO, DAMBATTA NGA11-1372, 29/11/2011 NIGERIA, JIGAWA, BABURA NGA11-1387, 29/10/2011 NIGERIA, JIGAWA, BIRNIN KUDU NGA11-1564, 28/07/2011 NIGERIA, KANO, DAWAKIN KUDU NGA11-1641, 8/6/2011, NIGERIA, KEBBI, BIRNIN KEBBI NGA11-1787, 29/11/2011 NIGERIA, KATSINA, MANI NGA11-1796, 2/10/2011, NIGERIA, KATSINA, MASHI NGA11-1733, 27/08/2011 NIGERIA, KANO, NASSARAWA NGA12-6291, 27/03/2012 NIGERIA, KATSINA, BATSARI NGA11-3897, 25/08/2011 NIGERIA, JIGAWA, RINGIM

The AUC is a common evaluation metric for binary classification problems.
Consider a plot of the true positive rate vs the false positive rate as the threshold value for classifying an item as “True” or “False” is increased from 0 to 1.

If the classifier is very good, the true positive rate will increase quickly and the area under this curve will be close to 1.

If the classifier is no better than random guessing, the true positive rate will increase linearly with the false positive rate and the area under this curve will be around 0.5.

For more details, see: https://en.wikipedia.org/wiki/Receiver_operating_characteristic

The probability of at least one case in a district during a 6-month period is modeled as a function of an overall level of risk as well as a set of independent and spatially structured random effects,
also known as the convolution model.^{1} In the first stage of this hierarchical model, we assume the presence or absence of cases in district *i* and period *t (X _{it})* is distributed

*logit(q _{it})=μ+β_{i}X_{i,t-1}+β_{2}Z_{i, t-1}+θ_{i}+ϕ_{i}*

where *μ* is the overall risk level, *β _{i}* is the coefficient for at least one case in district

At the second stage of the hierarchical model, we assign priors to the random effects. The independent effects are assigned the prior *ϕ _{i} |σ_{ϕ}^{2}~N(0,σ_{ϕ}^{2}) for i=1,…,I*.
The spatially structured effect is assigned the intrinsic conditional autoregressive prior (ICAR)

This model was fit in R^{4} using the Integrated Nested Laplace Approximation (INLA)^{5,6} as implemented in the INLA package.^{7}

- Besag J, York J, Mollié A. Bayesian image restoration with two applications in spatial statistics. Ann Inst Stat Math 1991; 43: 1–59.
- Besag J. Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc Ser B 1974; 36: 192–236.
- Rue H, Held L. Gaussian Markov Random Fields: Theory and Application. Boca Raton: Chapman and Hall/CRC Press, 2005.
- R Core Development Team. R: a language and environment for statistical computing, 3.2.1. Doc. Free. available internet http//www.r-project.org. 2016. DOI:10.1017/CBO9781107415324.004.
- Rue H., Martino S., Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc Ser B 2009; 71: 319–92.
- Lindgren F, Rue H, Linström J. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic differential equation approach (with discussion). J R Stat Soc Ser B 2011; 73: 423–98.
- Lindgren F, Rue H. Bayesian spatial modelling with R-INLA. J Stat Softw 2015; 63.

The probability of at least one case in a district during a 6-month period is modeled as a function of an overall level of risk as well as the presence of cases in the previous period.
The presence of a case in district *i* and period *t(X _{it})* is distributed

*logit(q _{it})=μ+β_{1}X_{i,t-1}+β_{2}Z_{i,t-1}*

where *μ* is the overall risk level, *β _{1}* is the coefficient for at least one case in district

This model was fit in R^{1} using the Integrated Nested Laplace Approximation (INLA)^{2,3} as implemented in the INLA package.^{4}

- R Core Development Team. R: a language and environment for statistical computing, 3.2.1. Doc. Free. available internet http//www.r-project.org. 2016. DOI:10.1017/CBO9781107415324.004.
- Rue H., Martino S., Chopin N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc Ser B 2009; 71: 319–92.
- Lindgren F, Rue H, Linström J. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic differential equation approach (with discussion). J R Stat Soc Ser B 2011; 73: 423–98.
- Lindgren F, Rue H. Bayesian spatial modelling with R-INLA. J Stat Softw 2015; 63.

The probability of at least one case in a district during the upcoming 6-month period is modeled using a random forest classifier^{1}.
Seven covariates are available to the ensemble: the total case count in the previous time period in the district and in its neighbors,
the total and average historical case counts in the district and in its neighbors, and a dummy variable for whether the time period is
the first or second half of the year as a proxy for seasonality. The model was fit in R^{2}, using the randomForest package^{3}.

- Breiman, Leo (2001). "Random Forests". Machine Learning. 45 (1): 5–32. doi:10.1023/A:1010933404324
- R Core Development Team. R: a language and environment for statistical computing, 3.2.1. Doc. Free. available internet: http//www.r-project.org. 2016. DOI:10.1017/CBO9781107415324.004.
- Liaw, A. and Wiener, M. Classification and Regression by randomForest. R News 2(3), 18-22, 2002. Doc. Free. available internet: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf.

**This software is distributed as is, completely without warranty or service support.
Institute for Disease Modeling and its employees are not liable for the condition or performance of the software.**

This software is leveraging the following technologies and libraries:

**R****Python**- Python 2.7.13 - link
- Django 1.10.5 - link
- SQLAlchemy 1.1.5 - link
- enum34 1.1.6 - link
- fasteners 0.14.1 - link
- numpy 1.11.3+mkl - link
- pandas 0.18.1 - link
- pandassql 0.7.3 - link
- pip - link
- python-dateutil 2.6.0 - link
- pytz 2016.10 - link
- requests 2.13.0 - link
- simpledbf 0.2.6 - link
- Tornado 4.4.2- link