memilio.epidata.getCaseData

getCaseData.py Downloads the case data of the Robert Koch-Institute (RKI) and provides it in different ways.

The raw case data we download can be found at https://github.com/robert-koch-institut/SARS-CoV-2-Infektionen_in_Deutschland

Be careful: Date of deaths or recovery is not reported in original case data and will be extrapolated in this script.

Functions

check_for_completeness(df, run_checks[, ...])

Checks if all counties are mentioned in the case data set

fetch_case_data(directory, filename, conf_obj)

Downloads the case data

get_case_data([read_data, out_folder, ...])

Wrapper function that downloads the case data and provides different kind of structured data into json files.

main()

Main program entry.

preprocess_case_data(raw_df, directory, ...)

Preprocessing of the case data

write_case_data(df, directory, conf_obj[, ...])

Writing the different case data file. Following data is generated and written to the mentioned filename - All infected (current and past) for whole germany are stored in "cases_infected" - All deaths whole germany are stored in "cases_deaths" - Infected, deaths and recovered for whole germany are stored in "cases_all_germany" - Infected split for states are stored in "cases_infected_state" - Infected, deaths and recovered split for states are stored in "cases_all_state" - Infected split for counties are stored in "cases_infected_county(_split_berlin)" - Infected, deaths and recovered split for county are stored in "cases_all_county(_split_berlin)" - Infected, deaths and recovered split for gender are stored in "cases_all_gender" - Infected, deaths and recovered split for state and gender are stored in "cases_all_state_gender" - Infected, deaths and recovered split for county and gender are stored in "cases_all_county_gender(_split_berlin)" - Infected, deaths and recovered split for age are stored in "cases_all_age" - Infected, deaths and recovered split for state and age are stored in "cases_all_state_age" - Infected, deaths and recovered split for county and age are stored in "cases_all_county_age(_split_berlin)".

memilio.epidata.getCaseData.check_for_completeness(
df: pandas.DataFrame,
run_checks: bool,
merge_berlin: bool = False,
merge_eisenach: bool = True,
)

Checks if all counties are mentioned in the case data set

This check had to be added due to incomplete data downloads It is checked if all counties are part of the data. If data is incomplete the data is downloaded from another source. Note: There is no check if data for every day and every county is available (which can happen).

Parameters:
  • df – pd.Dataframe. Dataframe to check

  • merge_berlin – bool True or False. Defines if Berlin’s districts are kept separated or get merged. Default value = false

  • merge_eisenach – bool True or False. Defines if Eisenbach districts are kept separated or get merged. Default value = True.

  • run_checks – bool

Returns:

Boolean to say if data is complete or not

memilio.epidata.getCaseData.fetch_case_data(
directory: str,
filename: str,
conf_obj,
read_data: bool = False,
) pandas.DataFrame

Downloads the case data

The data is read either from the internet or from a json file (CaseDataFull.json), stored in an earlier run. If the data is read from the internet, before changing anything the data is stored in CaseDataFull.json. If data should be downloaded, it is checked if data contains all counties. If not a different source is tried. The file is read in or stored at the folder “out_folder”/Germany/pydata. To store and change the data we use pandas.

Parameters:
  • directory – str Path to the output directory

  • filename – str Name of the full dataset filename

  • conf_obj – configuration object

  • read_data – bool. Defines if data is read from file or downloaded. (Default value = dd.defaultDict[‘read_data’])

Returns:

df pd.Dataframe. Dataframe containing the downloaded case data

memilio.epidata.getCaseData.get_case_data(
read_data: bool = False,
out_folder: str = '/home/docs/checkouts/readthedocs.org/user_builds/memilio/data/',
file_format: str = 'json_timeasstring',
start_date: date = datetime.date(2020, 1, 1),
end_date: date = datetime.date(2026, 5, 11),
impute_dates: bool = False,
moving_average: int = 0,
split_berlin: bool = False,
rep_date: bool = False,
files: str = 'All',
**kwargs,
) dict

Wrapper function that downloads the case data and provides different kind of structured data into json files.

The data is read either from the internet or from a json file (CaseDataFull.json), stored in an earlier run. If the data is read from the internet, before changing anything the data is stored in CaseDataFull.json. If data should be downloaded, it is checked if data contains all counties. If not a different source is tried. The file is read in or stored at the folder “out_folder”/Germany/pydata. To store and change the data we use pandas.

While working with the data - the column names are changed to english depending on defaultDict - a new column “Date” is defined. - we are only interested in the values where the parameter NeuerFall, NeuerTodesfall, NeuGenesen are larger than 0. The values, when these parameters are negative are just useful, if one would want to get the difference to the previous day. For details we refer to the above mentioned webpage. - For all different parameters and different columns the values are added up for whole germany for every date and the cumulative sum is calculated. Unless something else is mentioned. - For Berlin all districts can be merged into one [Default]. Otherwise, Berlin is divided into multiple districts and

different file names are used.

  • Following data is generated and written to the mentioned filename
    • All infected (current and past) for whole germany are stored in “cases_infected”

    • All deaths whole germany are stored in “cases_deaths”

    • Infected, deaths and recovered for whole germany are stored in “cases_all_germany”

    • Infected split for states are stored in “cases_infected_state”

    • Infected, deaths and recovered split for states are stored in “cases_all_state”

    • Infected split for counties are stored in “cases_infected_county(_split_berlin)”

    • Infected, deaths and recovered split for county are stored in “cases_all_county(_split_berlin)”

    • Infected, deaths and recovered split for gender are stored in “cases_all_gender”

    • Infected, deaths and recovered split for state and gender are stored in “cases_all_state_gender”

    • Infected, deaths and recovered split for county and gender are stored in “cases_all_county_gender(_split_berlin)”

    • Infected, deaths and recovered split for age are stored in “cases_all_age”

    • Infected, deaths and recovered split for state and age are stored in “cases_all_state_age”

    • Infected, deaths and recovered split for county and age are stored in “cases_all_county_age(_split_berlin)”

Parameters:
  • read_data – True or False. Defines if data is read from file or downloaded. Default defined in defaultDict. (Default value = dd.defaultDict[‘read_data’])

  • file_format – File format which is used for writing the data. Default defined in defaultDict. (Default value = dd.defaultDict[‘file_format’])

  • out_folder – Folder where data is written to. Default defined in defaultDict. (Default value = dd.defaultDict[‘out_folder’])

  • start_date – Date of first date in dataframe. (Default value = dd.defaultDict[‘start_date’])

  • end_date – Date of last date in dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘end_date’])

  • impute_dates – True or False. Defines if values for dates without new information are imputed. Default defined in defaultDict. (Default value = dd.defaultDict[‘impute_dates’])

  • moving_average – Integers >=0. Applies an ‘moving_average’-days moving average on all time series to smooth out effects of irregular reporting. Default defined in defaultDict. (Default value = dd.defaultDict[‘moving_average’])

  • split_berlin – True or False. Defines if Berlin’s disctricts are kept separated or get merged. Default defined in defaultDict. (Default value = dd.defaultDict[‘split_berlin’])

  • rep_date – True or False. Defines if reporting date or reference date is taken into dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘rep_date’])

  • files – List of strings or ‘All’ or ‘Plot’. Defnies which files should be provided (and plotted). Default ‘All’.

  • **kwargs

Returns:

None

memilio.epidata.getCaseData.main()

Main program entry.

memilio.epidata.getCaseData.preprocess_case_data(
raw_df: pandas.DataFrame,
directory: str,
filename: str,
conf_obj,
split_berlin: bool = False,
rep_date: bool = False,
) pandas.DataFrame

Preprocessing of the case data

While working with the data - the column names are changed to english depending on defaultDict - a new column “Date” is defined. - we are only interested in the values where the parameter NeuerFall, NeuerTodesfall, NeuGenesen are larger than 0. The values, when these parameters are negative are just useful, if one would want to get the difference to the previous day. For details we refer to the above mentioned webpage. - For all different parameters and different columns the values are added up for whole germany for every date and the cumulative sum is calculated. Unless something else is mentioned. - For Berlin all districts can be merged into one [Default]. Otherwise, Berlin is divided into multiple districts and

different file names are used.

Parameters:
  • raw_df – pd.Dataframe. Contains the downloaded or read raw case data

  • directory – str Path to the output directory

  • filename – str Name of the full dataset filename

  • conf_obj – configuration object

  • split_berlin – bool. Defines if Berlin’s disctricts are kept separated or get merged. Default defined in defaultDict. (Default value = dd.defaultDict[‘split_berlin’])

  • rep_date – bool Defines if reporting date or reference date is taken into dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘rep_date’])

Returns:

df pd.Dataframe

memilio.epidata.getCaseData.write_case_data(
df: pandas.DataFrame,
directory: str,
conf_obj,
file_format: str = 'json_timeasstring',
start_date: date = datetime.date(2020, 1, 1),
end_date: date = datetime.date(2026, 5, 11),
impute_dates: bool = False,
moving_average: int = 0,
split_berlin: bool = False,
rep_date: bool = False,
files: str = 'All',
) dict

Writing the different case data file. Following data is generated and written to the mentioned filename

  • All infected (current and past) for whole germany are stored in “cases_infected”

  • All deaths whole germany are stored in “cases_deaths”

  • Infected, deaths and recovered for whole germany are stored in “cases_all_germany”

  • Infected split for states are stored in “cases_infected_state”

  • Infected, deaths and recovered split for states are stored in “cases_all_state”

  • Infected split for counties are stored in “cases_infected_county(_split_berlin)”

  • Infected, deaths and recovered split for county are stored in “cases_all_county(_split_berlin)”

  • Infected, deaths and recovered split for gender are stored in “cases_all_gender”

  • Infected, deaths and recovered split for state and gender are stored in “cases_all_state_gender”

  • Infected, deaths and recovered split for county and gender are stored in “cases_all_county_gender(_split_berlin)”

  • Infected, deaths and recovered split for age are stored in “cases_all_age”

  • Infected, deaths and recovered split for state and age are stored in “cases_all_state_age”

  • Infected, deaths and recovered split for county and age are stored in “cases_all_county_age(_split_berlin)”

Parameters:
  • df – pd.DataFrame Processed dataframe

  • directory – str Path to the output directory

  • conf_obj – configuration object

  • file_format – str File format which is used for writing the data. Default defined in defaultDict. (Default value = dd.defaultDict[‘file_format’])

  • start_date – date Date of first date in dataframe. Default 2020-01-01. (Default value = dd.defaultDict[‘start_date’])

  • end_date – date. Date of last date in dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘end_date’])

  • impute_dates – bool True or False. Defines if values for dates without new information are imputed. Default defined in defaultDict. (Default value = dd.defaultDict[‘impute_dates’])

  • moving_average – int Integers >=0. Applies an ‘moving_average’-days moving average on all time series smooth out effects of irregular reporting. Default defined in defaultDict. (Default value = dd.defaultDict[‘moving_average’])

  • split_berlin – bool True or False. Defines if Berlin’s districts are kept separated or get merged. Default defined in defaultDict. (Default value = dd.defaultDict[‘split_berlin’])

  • rep_date – bool True or False. Defines if reporting date or reference date is taken into dataframe. Default defined in defaultDict. (Default value = dd.defaultDict[‘rep_date’])

  • files – list. List of strings or ‘All’ or ‘Plot’. Defines which files should be provided (and plotted). Default ‘All’.

Returns:

None