memilio.epidata.getCaseData

getCaseData.py Downloads the case data of the Robert Koch-Institute (RKI) and provides it in different ways.

The raw case data we download can be found at https://github.com/robert-koch-institut/SARS-CoV-2-Infektionen_in_Deutschland

Be careful: Date of deaths or recovery is not reported in original case data and will be extrapolated in this script.

Functions

`check_for_completeness`(df, run_checks[, ...])	Checks if all counties are mentioned in the case data set
`fetch_case_data`(directory, filename, conf_obj)	Downloads the case data
`get_case_data`([read_data, out_folder, ...])	Wrapper function that downloads the case data and provides different kind of structured data into json files.
`main`()	Main program entry.
`preprocess_case_data`(raw_df, directory, ...)	Preprocessing of the case data
`write_case_data`(df, directory, conf_obj[, ...])	Writing the different case data file. Following data is generated and written to the mentioned filename - All infected (current and past) for whole germany are stored in "cases_infected" - All deaths whole germany are stored in "cases_deaths" - Infected, deaths and recovered for whole germany are stored in "cases_all_germany" - Infected split for states are stored in "cases_infected_state" - Infected, deaths and recovered split for states are stored in "cases_all_state" - Infected split for counties are stored in "cases_infected_county(_split_berlin)" - Infected, deaths and recovered split for county are stored in "cases_all_county(_split_berlin)" - Infected, deaths and recovered split for gender are stored in "cases_all_gender" - Infected, deaths and recovered split for state and gender are stored in "cases_all_state_gender" - Infected, deaths and recovered split for county and gender are stored in "cases_all_county_gender(_split_berlin)" - Infected, deaths and recovered split for age are stored in "cases_all_age" - Infected, deaths and recovered split for state and age are stored in "cases_all_state_age" - Infected, deaths and recovered split for county and age are stored in "cases_all_county_age(_split_berlin)".

memilio.epidata.getCaseData.check_for_completeness( df: pandas.DataFrame, run_checks: bool, merge_berlin: bool = False, merge_eisenach: bool = True, )

Checks if all counties are mentioned in the case data set

This check had to be added due to incomplete data downloads It is checked if all counties are part of the data. If data is incomplete the data is downloaded from another source. Note: There is no check if data for every day and every county is available (which can happen).

Parameters:

df – pd.Dataframe. Dataframe to check
merge_berlin – bool True or False. Defines if Berlin’s districts are kept separated or get merged. Default value = false
merge_eisenach – bool True or False. Defines if Eisenbach districts are kept separated or get merged. Default value = True.
run_checks – bool

Returns:

Boolean to say if data is complete or not

memilio.epidata.getCaseData.fetch_case_data( directory: str, filename: str, conf_obj, read_data: bool = False, ) → pandas.DataFrame

Downloads the case data

The data is read either from the internet or from a json file (CaseDataFull.json), stored in an earlier run. If the data is read from the internet, before changing anything the data is stored in CaseDataFull.json. If data should be downloaded, it is checked if data contains all counties. If not a different source is tried. The file is read in or stored at the folder “out_folder”/Germany/pydata. To store and change the data we use pandas.

Parameters:

directory – str Path to the output directory
filename – str Name of the full dataset filename
conf_obj – configuration object
read_data – bool. Defines if data is read from file or downloaded. (Default value = dd.defaultDict[‘read_data’])

Returns:

df pd.Dataframe. Dataframe containing the downloaded case data

memilio.epidata.getCaseData.get_case_data(

read_data: bool = False,

out_folder: str = '/home/docs/checkouts/readthedocs.org/user_builds/memilio/data/',

file_format: str = 'json_timeasstring',

start_date: date = datetime.date(2020, 1, 1),

end_date: date = datetime.date(2026, 7, 31),

impute_dates: bool = False,

moving_average: int = 0,

split_berlin: bool = False,

rep_date: bool = False,

files: str = 'All',

**kwargs,

) → dict

Wrapper function that downloads the case data and provides different kind of structured data into json files.

The data is read either from the internet or from a json file (CaseDataFull.json), stored in an earlier run. If the data is read from the internet, before changing anything the data is stored in CaseDataFull.json. If data should be downloaded, it is checked if data contains all counties. If not a different source is tried. The file is read in or stored at the folder “out_folder”/Germany/pydata. To store and change the data we use pandas.

While working with the data - the column names are changed to english depending on defaultDict - a new column “Date” is defined. - we are only interested in the values where the parameter NeuerFall, NeuerTodesfall, NeuGenesen are larger than 0. The values, when these parameters are negative are just useful, if one would want to get the difference to the previous day. For details we refer to the above mentioned webpage. - For all different parameters and different columns the values are added up for whole germany for every date and the cumulative sum is calculated. Unless something else is mentioned. - For Berlin all districts can be merged into one [Default]. Otherwise, Berlin is divided into multiple districts and

different file names are used.

Following data is generated and written to the mentioned filename
- All infected (current and past) for whole germany are stored in “cases_infected”
- All deaths whole germany are stored in “cases_deaths”
- Infected, deaths and recovered for whole germany are stored in “cases_all_germany”
- Infected split for states are stored in “cases_infected_state”
- Infected, deaths and recovered split for states are stored in “cases_all_state”
- Infected split for counties are stored in “cases_infected_county(_split_berlin)”
- Infected, deaths and recovered split for county are stored in “cases_all_county(_split_berlin)”
- Infected, deaths and recovered split for gender are stored in “cases_all_gender”
- Infected, deaths and recovered split for state and gender are stored in “cases_all_state_gender”
- Infected, deaths and recovered split for county and gender are stored in “cases_all_county_gender(_split_berlin)”
- Infected, deaths and recovered split for age are stored in “cases_all_age”
- Infected, deaths and recovered split for state and age are stored in “cases_all_state_age”
- Infected, deaths and recovered split for county and age are stored in “cases_all_county_age(_split_berlin)”

Parameters:

read_data – True or False. Defines if data is read from file or downloaded. Default defined in defaultDict. (Default value = dd.defaultDict[‘read_data’])
file_format – File format which is used for writing the data. Default defined in defaultDict. (Default value = dd.defaultDict[‘file_format’])
out_folder – Folder where data is written to. Default defined in defaultDict. (Default value = dd.defaultDict[‘out_folder’])
start_date – Date of first date in dataframe. (Default value = dd.defaultDict[‘start_date’])
end_date – Date of last date in dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘end_date’])
impute_dates – True or False. Defines if values for dates without new information are imputed. Default defined in defaultDict. (Default value = dd.defaultDict[‘impute_dates’])
moving_average – Integers >=0. Applies an ‘moving_average’-days moving average on all time series to smooth out effects of irregular reporting. Default defined in defaultDict. (Default value = dd.defaultDict[‘moving_average’])
split_berlin – True or False. Defines if Berlin’s disctricts are kept separated or get merged. Default defined in defaultDict. (Default value = dd.defaultDict[‘split_berlin’])
rep_date – True or False. Defines if reporting date or reference date is taken into dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘rep_date’])
files – List of strings or ‘All’ or ‘Plot’. Defnies which files should be provided (and plotted). Default ‘All’.
**kwargs –

Returns:

None

memilio.epidata.getCaseData.main(): Main program entry.

memilio.epidata.getCaseData.preprocess_case_data( raw_df: pandas.DataFrame, directory: str, filename: str, conf_obj, split_berlin: bool = False, rep_date: bool = False, ) → pandas.DataFrame

Preprocessing of the case data

While working with the data - the column names are changed to english depending on defaultDict - a new column “Date” is defined. - we are only interested in the values where the parameter NeuerFall, NeuerTodesfall, NeuGenesen are larger than 0. The values, when these parameters are negative are just useful, if one would want to get the difference to the previous day. For details we refer to the above mentioned webpage. - For all different parameters and different columns the values are added up for whole germany for every date and the cumulative sum is calculated. Unless something else is mentioned. - For Berlin all districts can be merged into one [Default]. Otherwise, Berlin is divided into multiple districts and

different file names are used.

Parameters:

raw_df – pd.Dataframe. Contains the downloaded or read raw case data
directory – str Path to the output directory
filename – str Name of the full dataset filename
conf_obj – configuration object
split_berlin – bool. Defines if Berlin’s disctricts are kept separated or get merged. Default defined in defaultDict. (Default value = dd.defaultDict[‘split_berlin’])
rep_date – bool Defines if reporting date or reference date is taken into dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘rep_date’])

Returns:

df pd.Dataframe

memilio.epidata.getCaseData.write_case_data( df: pandas.DataFrame, directory: str, conf_obj, file_format: str = 'json_timeasstring', start_date: date = datetime.date(2020, 1, 1), end_date: date = datetime.date(2026, 7, 31), impute_dates: bool = False, moving_average: int = 0, split_berlin: bool = False, rep_date: bool = False, files: str = 'All', ) → dict

Writing the different case data file. Following data is generated and written to the mentioned filename

All infected (current and past) for whole germany are stored in “cases_infected”

All deaths whole germany are stored in “cases_deaths”

Infected, deaths and recovered for whole germany are stored in “cases_all_germany”

Infected split for states are stored in “cases_infected_state”

Infected, deaths and recovered split for states are stored in “cases_all_state”

Infected split for counties are stored in “cases_infected_county(_split_berlin)”

Infected, deaths and recovered split for county are stored in “cases_all_county(_split_berlin)”

Infected, deaths and recovered split for gender are stored in “cases_all_gender”

Infected, deaths and recovered split for state and gender are stored in “cases_all_state_gender”

Infected, deaths and recovered split for county and gender are stored in “cases_all_county_gender(_split_berlin)”

Infected, deaths and recovered split for age are stored in “cases_all_age”

Infected, deaths and recovered split for state and age are stored in “cases_all_state_age”

Infected, deaths and recovered split for county and age are stored in “cases_all_county_age(_split_berlin)”

Parameters:

df – pd.DataFrame Processed dataframe
directory – str Path to the output directory
conf_obj – configuration object
file_format – str File format which is used for writing the data. Default defined in defaultDict. (Default value = dd.defaultDict[‘file_format’])
start_date – date Date of first date in dataframe. Default 2020-01-01. (Default value = dd.defaultDict[‘start_date’])
end_date – date. Date of last date in dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘end_date’])
impute_dates – bool True or False. Defines if values for dates without new information are imputed. Default defined in defaultDict. (Default value = dd.defaultDict[‘impute_dates’])
moving_average – int Integers >=0. Applies an ‘moving_average’-days moving average on all time series smooth out effects of irregular reporting. Default defined in defaultDict. (Default value = dd.defaultDict[‘moving_average’])
split_berlin – bool True or False. Defines if Berlin’s districts are kept separated or get merged. Default defined in defaultDict. (Default value = dd.defaultDict[‘split_berlin’])
rep_date – bool True or False. Defines if reporting date or reference date is taken into dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘rep_date’])
files – list. List of strings or ‘All’ or ‘Plot’. Defines which files should be provided (and plotted). Default ‘All’.

Returns:

None