Skip to content

A family for lumping together levels that meet some criteria.

  • fct_lump_min(): lumps levels that appear fewer than min times.

  • fct_lump_prop(): lumps levels that appear in fewer than (or equal to) prop * n times.

  • fct_lump_n() lumps all levels except for the n most frequent (or least frequent if n < 0)

  • fct_lump_lowfreq() lumps together the least frequent levels, ensuring that "other" is still the smallest level.

fct_lump() exists primarily for historical reasons, as it automatically picks between these different methods depending on its arguments. We no longer recommend that you use it.

Usage

fct_lump(
  f,
  n,
  prop,
  w = NULL,
  other_level = "Other",
  ties.method = c("min", "average", "first", "last", "random", "max")
)

fct_lump_min(f, min, w = NULL, other_level = "Other")

fct_lump_prop(f, prop, w = NULL, other_level = "Other")

fct_lump_n(
  f,
  n,
  w = NULL,
  other_level = "Other",
  ties.method = c("min", "average", "first", "last", "random", "max")
)

fct_lump_lowfreq(f, w = NULL, other_level = "Other")

Arguments

f

A factor (or character vector).

n

Positive n preserves the most common n values. Negative n preserves the least common -n values. It there are ties, you will get at least abs(n) values.

prop

Positive prop lumps values which do not appear at least prop of the time. Negative prop lumps values that do not appear at most -prop of the time.

w

An optional numeric vector giving weights for frequency of each value (not level) in f.

other_level

Value of level used for "other" values. Always placed at end of levels.

ties.method

A character string specifying how ties are treated. See rank() for details.

min

Preserve levels that appear at least min number of times.

See also

fct_other() to convert specified levels to other.

Examples

x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% table()
#> .
#>  A  B  C  D  E  F  G  H  I 
#> 40 10  5 27  1  1  1  1  1 
x %>%
  fct_lump_n(3) %>%
  table()
#> .
#>     A     B     D Other 
#>    40    10    27    10 
x %>%
  fct_lump_prop(0.10) %>%
  table()
#> .
#>     A     B     D Other 
#>    40    10    27    10 
x %>%
  fct_lump_min(5) %>%
  table()
#> .
#>     A     B     C     D Other 
#>    40    10     5    27     5 
x %>%
  fct_lump_lowfreq() %>%
  table()
#> .
#>     A     D Other 
#>    40    27    20 

x <- factor(letters[rpois(100, 5)])
x
#>  [1] e e k a g c a e a g b e b g d b k e f g c d c e h e e g g b d b a e d
#> [36] d b h e f h d d f g c b c f f d b d d b f c c c c d b i c g b d e b g
#> [71] c d e a d e h c b e f d f d b c d a e c e a b c c h c c g
#> Levels: a b c d e f g h i k
table(x)
#> x
#>  a  b  c  d  e  f  g  h  i  k 
#>  7 15 18 17 16  8 10  5  1  2 
table(fct_lump_lowfreq(x))
#> 
#>     a     b     c     d     e     f     g     h Other 
#>     7    15    18    17    16     8    10     5     3 

# Use positive values to collapse the rarest
fct_lump_n(x, n = 3)
#>  [1] e     e     Other Other Other c     Other e     Other Other Other
#> [12] e     Other Other d     Other Other e     Other Other c     d    
#> [23] c     e     Other e     e     Other Other Other d     Other Other
#> [34] e     d     d     Other Other e     Other Other d     d     Other
#> [45] Other c     Other c     Other Other d     Other d     d     Other
#> [56] Other c     c     c     c     d     Other Other c     Other Other
#> [67] d     e     Other Other c     d     e     Other d     e     Other
#> [78] c     Other e     Other d     Other d     Other c     d     Other
#> [89] e     c     e     Other Other c     c     Other c     c     Other
#> Levels: c d e Other
fct_lump_prop(x, prop = 0.1)
#>  [1] e     e     Other Other g     c     Other e     Other g     b    
#> [12] e     b     g     d     b     Other e     Other g     c     d    
#> [23] c     e     Other e     e     g     g     b     d     b     Other
#> [34] e     d     d     b     Other e     Other Other d     d     Other
#> [45] g     c     b     c     Other Other d     b     d     d     b    
#> [56] Other c     c     c     c     d     b     Other c     g     b    
#> [67] d     e     b     g     c     d     e     Other d     e     Other
#> [78] c     b     e     Other d     Other d     b     c     d     Other
#> [89] e     c     e     Other b     c     c     Other c     c     g    
#> Levels: b c d e g Other

# Use negative values to collapse the most common
fct_lump_n(x, n = -3)
#>  [1] Other Other k     Other Other Other Other Other Other Other Other
#> [12] Other Other Other Other Other k     Other Other Other Other Other
#> [23] Other Other h     Other Other Other Other Other Other Other Other
#> [34] Other Other Other Other h     Other Other h     Other Other Other
#> [45] Other Other Other Other Other Other Other Other Other Other Other
#> [56] Other Other Other Other Other Other Other i     Other Other Other
#> [67] Other Other Other Other Other Other Other Other Other Other h    
#> [78] Other Other Other Other Other Other Other Other Other Other Other
#> [89] Other Other Other Other Other Other Other h     Other Other Other
#> Levels: h i k Other
fct_lump_prop(x, prop = -0.1)
#>  [1] Other Other k     a     Other Other a     Other a     Other Other
#> [12] Other Other Other Other Other k     Other f     Other Other Other
#> [23] Other Other h     Other Other Other Other Other Other Other a    
#> [34] Other Other Other Other h     Other f     h     Other Other f    
#> [45] Other Other Other Other f     f     Other Other Other Other Other
#> [56] f     Other Other Other Other Other Other i     Other Other Other
#> [67] Other Other Other Other Other Other Other a     Other Other h    
#> [78] Other Other Other f     Other f     Other Other Other Other a    
#> [89] Other Other Other a     Other Other Other h     Other Other Other
#> Levels: a f h i k Other

# Use weighted frequencies
w <- c(rep(2, 50), rep(1, 50))
fct_lump_n(x, n = 5, w = w)
#> Error in fct_lump_n(x, n = 5, w = w): `w` must be the same length as `f` (99), not length 100.

# Use ties.method to control how tied factors are collapsed
fct_lump_n(x, n = 6)
#>  [1] e     e     Other Other g     c     Other e     Other g     b    
#> [12] e     b     g     d     b     Other e     f     g     c     d    
#> [23] c     e     Other e     e     g     g     b     d     b     Other
#> [34] e     d     d     b     Other e     f     Other d     d     f    
#> [45] g     c     b     c     f     f     d     b     d     d     b    
#> [56] f     c     c     c     c     d     b     Other c     g     b    
#> [67] d     e     b     g     c     d     e     Other d     e     Other
#> [78] c     b     e     f     d     f     d     b     c     d     Other
#> [89] e     c     e     Other b     c     c     Other c     c     g    
#> Levels: b c d e f g Other
fct_lump_n(x, n = 6, ties.method = "max")
#>  [1] e     e     Other Other g     c     Other e     Other g     b    
#> [12] e     b     g     d     b     Other e     f     g     c     d    
#> [23] c     e     Other e     e     g     g     b     d     b     Other
#> [34] e     d     d     b     Other e     f     Other d     d     f    
#> [45] g     c     b     c     f     f     d     b     d     d     b    
#> [56] f     c     c     c     c     d     b     Other c     g     b    
#> [67] d     e     b     g     c     d     e     Other d     e     Other
#> [78] c     b     e     f     d     f     d     b     c     d     Other
#> [89] e     c     e     Other b     c     c     Other c     c     g    
#> Levels: b c d e f g Other

# Use fct_lump_min() to lump together all levels with fewer than `n` values
table(fct_lump_min(x, min = 10))
#> 
#>     b     c     d     e     g Other 
#>    15    18    17    16    10    23 
table(fct_lump_min(x, min = 15))
#> 
#>     b     c     d     e Other 
#>    15    18    17    16    33