Mushroom Data Set 2020

Data Directory

Data Set Description Primary Data Data Set Description Secondary Data


Abstract

Bookbased mushroom data set, describing physical characteristics, binary classification in poisonous or edible.

Primary Data Table

Data Set Characteristics Multivariate Number of Instances 173 Area Life
Attribute Characteristics qualitative and quantitative Number of Variables 20 Date published 15.10.2020
Associated Tasks Data simulation Missing values Yes

Secondary Data Table

Data Set Characteristics Multivariate Number of Instances 61,069 Area Life
Attribute Characteristics qualitative and quantitative Number of Variables 20 Date published 15.10.2020
Associated Tasks Binary classification Missing values No

Citation

Wagner, D., Heider, D. & Hattab, G. Mushroom data creation, curation, and simulation to support classification tasks. Sci Rep 11, 8134 (2021). https://doi.org/10.1038/s41598-021-87602-3


Source

Created by Dennis Wagner , Dominik Heider , Georges Hattab
Based on Patrick Hardin. Mushrooms & Toadstools. Collins, 2012
Inspired by Jeff Schlimmer. Mushroom Data Set. 1987. URL: https://archive.ics.uci.edu/ml/datasets/Mushroom


License

All source code and pertaining data available on this site is open-source, freely available for modification and remixing under the Creative Commons License CC BY 4.0.


Data Set Information

The primary data set contains descriptions of 173 mushroom species as entries. It can be used to simulate hypothetical mushrooms.

The secondary data set is a product of such simulation and contains 61,069 hypothetical mushrooms. It can be used for binary classification.

Variable Measurement Values
cap-diameter quantitative float number in cm
cap-shape qualitative bell=b
conical=c
convex=x
flat=f
sunken=s
spherical=p
others=o
cap-surface qualitative fibrous=i
grooves=g
scaly=y
smooth=s
shiny=h
leathery=l
silky=k
sticky=t
wrinkled=w
fleshy=e
cap-color qualitative brown=n
buff=b
gray=g
green=r
pink=p
purple=u
red=e
white=w
yellow=y
blue=l
orange=o
black=k
does-bruise-bleed qualitative bruises-or-bleeding=t
no=f
gill-attachment qualitative adnate=a
adnexed=x
decurrent=d
free=e
sinuate=s
pores=p
unknown=?
gill-spacing qualitative close=c
distant=d
none=f
gill-color qualitative see cap-color
none=f
stem-height quantitative float number in cm
stem-width quantitative float number in mm
stem-root qualitative bulbous=b
swollen=s
club=c
cup=u
equal=e
rhizomorphs=z
rooted=r
stem-surface qualitative see cap-surface
none=f
stem-color qualitative see cap-color
none=f
veil-type qualitative partial=p
universal=u
veil-color qualitative see cap-color
none=f
has-ring qualitative ring=t
ring-type qualitative cobwebby=c
evanescent=e
flaring=r
grooved=g
large=l
pendant=p
sheathing=s
zone=z
scaly=y
movable=m/td>
none=f
unknown=?
spore-print-color qualitative see cap color
habitat qualitative grasses=g
leaves=l
meadows=m
paths=p
heaths=h
urban=u
waste=w
woods=d
season qualitative spring=s
summer=u
autumn=a
winter=w

UCI 1987

Secondary 2020

A caption for the above image. Both heat maps have the x-axis and the y-axis listing all the variables, including the class. The sequential gray scale palette encodes a correlation value of 1 in black and lower values in gray. Each heat map corresponds to a pairwise correlation where each cell has two variables as its position. For the 1987 data heat map, we observed that the veil-type correlates with all other variables which makes it redundant. While the correlation values for the gill-attachment with the stalk-color above and below ring both capped at 0.97. The latter also highly correlated with the veil-color at 0.87 and 0.88, respectively. Another instance is the odor, it has alone a 0.91 correlation towards the class, which rendered the classification task quite obsolete. While this extremely high correlation is an outlier, nearly half the variables have a class determining correlation between 0.25 and 0.5. In comparison, no single class determining correlation was found above 0.2 for the secondary data. The only notable high correlations are expected correlations for the continuous variables, which were determined by the assumed co-variance matrix.
 
Annotated mushroom observations. From left to right, the annotated mushroom species are: Amanita muscaria, Coprinopsis atramentaria, Pluteus cervinus. The one image without an annotation corresponds to a species from the puffball mushroom family. Because stemless mushrooms species were excluded from the data, an identification cannot be made. The largest image is shown for a mushroom from the Russula fragilis species with the following characteristics: sunken cap-shape, purple cap-color, white gill-color, white stem-color.