memilio.epidata.modifyDataframeSeries

modifyDataframeSeries.py Tools for modifying data frame series like imputing zeros for unknown dates,

copying previous values, and/or computing moving averages

Functions

create_intervals_mapping(from_lower_bounds, ...)

Creates a mapping from given intervals to new desired intervals

extract_subframe_based_on_dates(df, ...)

Removes all data with date lower than start date or higher than end date.

fit_age_group_intervals(df_age_in, age_out)

Creates a mapping from given intervals to new desired intervals. Provide all intervals as "x-y".

impute_and_reduce_df(df_old, group_by_cols, ...)

Impute missing dates of dataframe time series and optionally calculates a moving average of the data.

insert_column_by_map(df, col_to_map, ...[, ...])

Adds a column to a given dataframe based on a mapping of values of a given column

split_column_based_on_values(df_to_split, ...)

Splits a column in a dataframe into separate columns.

memilio.epidata.modifyDataframeSeries.create_intervals_mapping(
from_lower_bounds,
to_lower_bounds,
)

Creates a mapping from given intervals to new desired intervals

Parameters:
  • from_lower_bounds – lower bounds of original intervals

  • to_lower_bounds – desired lower bounds of new intervals

Returns:

mapping from intervals to intervals The mapping is given as a list of tupels for every original interval. The list contains a tuple for every new interval intersecting the original interval. Each tuple defines the share of the original interval that is mapped to the new interval and the index of the new interval. We assume that the range of the intervals mapped from is contained in the range of the intervals mapped to. For example for from_lower_bounds = [5,20,30,80,85,90] and to_lower_bounds=[0,15,20,60,100] given the mapping would be [[[2/3,0], [1/3,1]],

[[1,2]], [[3/5,2], [2/5,3]], [[1,3]], [[1,3]]]

memilio.epidata.modifyDataframeSeries.extract_subframe_based_on_dates(
df,
start_date,
end_date,
)

Removes all data with date lower than start date or higher than end date.

Returns the Dataframe with only dates between start date and end date. Resets the Index of the Dataframe.

Parameters:
  • df – The dataframe which has to be edited

  • start_date – Date of first date in dataframe

  • end_date – Date of last date in dataframe

Returns:

a dataframe with the extracted dates

memilio.epidata.modifyDataframeSeries.fit_age_group_intervals(
df_age_in,
age_out,
df_population=None,
max_age=100,
)
Creates a mapping from given intervals to new desired intervals. Provide all intervals as “x-y”.

Boundary age groups can be provided with “<x” or “>y”. Minimum and maximum are then taken as 0 and 99, respectively.

Example: If df_population is set, we can use this data set to best interpolate @df_age_in to the desired age stratification of @age_out. Where this data is not finely enough resolved or if this data set is not provided, we assume the population to be equally distributed. Ex. df_age_in = [“1-10”: 4, “11-60”: 10, “61-99”: 8] age_out = [“1-5”, “6-10”, “11-50”, “51-99”] returns [“1-5”: 2, “6-10”: 2, “11-50”: 8, “51-99”: 10] if no population data is provided.

If we also provide the population data population = [“1-5”: 40, “6-7”: 5, “8-10”: 5, “11-60”: 25, “61-99”: 25], The output is: [“1-5”: 3.2, “6-10”: 0.8, “11-50”: 8., “51:99”: 10.]

Parameters:
  • df_age_in – Dataframe with columns of different age intervals and one row for subpopulation sizes for an arbitrary feature.

  • age_out – Desired age group distribution in list of strings.

  • df_population – Total population data of the same structure as df_age_in used to inter- or extrapolate date of @df_age_in. (Default value = None)

  • max_age – (Default value = 100)

Returns:

Subpopulations of @df_age_in inter- or extrapolated to age stratification as required by @age_out.

memilio.epidata.modifyDataframeSeries.impute_and_reduce_df(
df_old,
group_by_cols,
mod_cols,
impute='forward',
moving_average=0,
min_date='',
max_date='',
start_w_firstval=False,
)

Impute missing dates of dataframe time series and optionally calculates a moving average of the data. Extracts Dates between min and max date.

Parameters:
  • df_old – old pandas dataframe

  • group_by_cols – Column names for grouping by and items of particular group specification (e.g., for region: list of county or federal state IDs)

  • mod_cols – List of columns for which the imputation and/or moving average is conducted (e.g., Confirmed or ICU)

  • impute – Default: ‘forward’] imputes either based on older values (‘forward’) or zeros (‘zeros’)

  • moving_average – Default: 0, no averaging] Number of days over which to compute the moving average

  • min_date – Default: ‘’, taken from df_old] If set, minimum date to be set in new data frame for all items in group_by

  • max_date – Default: ‘’, taken from df_old] If set, maximum date to be set in new data frame for all items in group_by

  • start_w_firstval – Default: False] If True and min_date < first date in dataframe, then between min_date and first date, the value of the first date will be repeated backwards. If False, then zero is set there.

Returns:

dataframe with imputed dates (and moving average if requested)

memilio.epidata.modifyDataframeSeries.insert_column_by_map(
df,
col_to_map,
new_col_name,
map,
new_col_dtype='object',
)

Adds a column to a given dataframe based on a mapping of values of a given column

The mapping is defined by a list containing tupels of the form (new_value, old_value) where old_value is a value in the col_to_map and new_value the value that is added in the new column if col_to_map contains the old_value.

Parameters:
  • df – dataframe to modify

  • col_to_map – column containing values to be mapped

  • new_col_name – name of the new column containing the mapped values

  • map – List of tuples of values in the column to be added and values in the given column

  • new_col_dtype – String of dtype [Default: ‘object’] for the new generated column

Returns:

dataframe df with column of state names correspomding to state ids

memilio.epidata.modifyDataframeSeries.split_column_based_on_values(
df_to_split,
column_to_split,
column_vals_name,
groupby_list,
column_identifiers_to_names_dict,
compute_cumsum,
)

Splits a column in a dataframe into separate columns. For each unique value that appears in a selected column, all corresponding values in another column are transfered to a new column. If required, cumulative sum is calculated in new generated columns.

Parameters:
  • df_to_split – global pandas dataframe

  • column_to_split – identifier of the column for which separate values will define separate dataframes

  • column_vals_name – The name of the original column which will be split into separate columns named according to new_column_labels.

  • groupby_list – The name of the original columns with which data of new_column_labels can be joined.

  • column_identifiers_to_names_dict – Dict for new labels of resulting columns.

  • compute_cumsum – Computes cumulative sum in new generated columns

Returns:

a dataframe with the new splitted columns