Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
Title: Wine Quality
Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## [1] "Names of variables "
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [1] "Dimensions of wine data"
## [1] 1599 12
## [1] "Structure of wine data"
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 15.9 15.6 15.6 15.5 15.5 15 15 14.3 14 13.8 ...
## $ volatile.acidity : num 0.36 0.685 0.645 0.645 0.645 0.21 0.21 0.31 0.41 0.49 ...
## $ citric.acid : num 0.65 0.76 0.49 0.49 0.49 0.44 0.44 0.74 0.63 0.67 ...
## $ residual.sugar : num 7.5 3.7 4.2 4.2 4.2 2.2 2.2 1.8 3.8 3 ...
## $ chlorides : num 0.096 0.1 0.095 0.095 0.095 0.075 0.075 0.075 0.089 0.093 ...
## $ free.sulfur.dioxide : num 22 6 10 10 10 10 10 6 6 6 ...
## $ total.sulfur.dioxide: num 71 43 23 23 23 24 24 15 47 15 ...
## $ density : num 0.998 1.003 1.003 1.003 1.003 ...
## $ pH : num 2.98 2.95 2.92 2.92 2.92 3.07 3.07 2.86 3.01 3.02 ...
## $ sulphates : num 0.84 0.68 0.74 0.74 0.74 0.84 0.84 0.79 0.81 0.93 ...
## $ alcohol : num 14.9 11.2 11.1 11.1 11.1 9.2 9.2 8.4 10.8 12 ...
## $ quality : int 5 7 5 5 5 7 7 6 6 6 ...
## [1] "Summary of Redwine data"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Quality ranges from 0 to 10,but in data minimum is 3 and maximum is 8, which means that most of the wines we will look at in the analysis are average wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Data set is regarding the wine quality and several chemical componets that it contains.there ae 1599 samples of wine with 10 variables(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfer dioxide, density, pH, sulphates, alcohol, quality) of type numeric and 1 rating factor quality of type int.
Quality is the main feature of insterest ,given by 3 wine experts according to their knowledge and experience.Quality ranges from 0 to 10 but our data has least quality of 3 and highest quality of 8. Lets find out what are the main deciding factors for high quality wine.
There can lot more features since in real world so many factors affect the quality of Red wine.
Yes , i made total.acidity and combined.sulphur.dioxide, which may show some unseen trends.
Volatile acidity is having a bimodal distribution and Citric acid has quite a long-tail distribution.But it is not a Normal Distribution.
the data was already tidy so there was no requirement for any adjustment.
## [1] "Correlation among the variables"
## volatile.acidity total.sulfur.dioxide density
## -0.39055778 -0.18510029 -0.17491923
## chlorides pH free.sulfur.dioxide
## -0.12890656 -0.05773139 -0.05065606
## residual.sugar fixed.acidity citric.acid
## 0.01373164 0.12405165 0.22637251
## sulphates alcohol quality
## 0.25139708 0.47616632 1.00000000
Observing the correlation, alcohol and volatile acidity, have a higher correlation with the quality of wine.Suphates and citric acid are also correlated with the quality of wine.
Residual sugar has almost no correlation with quality.
Quality being the feature of interest,the correlation between quality and each different variable in the dataset are examined.Quality of wine is directly proportional to the alcohol content and volatile acidity and inversely proportional to density,total sulfur dioxide and chlorides.
pH and volatile acidity are positively correleated, Higher the pH value means less acidity, but from plots a higher volatile acidity means more acidity.
Density of wine has high negative correlation with the amount of alcohol in wine.
I was expecting a close relation between sulphur and sulphur dioxide,there seems no relation with correlation coefficient of 0.04.
correlation of quality with other variables
## [1] "Correlation among the variables with quality"
## volatile.acidity total.sulfur.dioxide density
## -0.39055778 -0.18510029 -0.17491923
## chlorides pH free.sulfur.dioxide
## -0.12890656 -0.05773139 -0.05065606
## residual.sugar fixed.acidity citric.acid
## 0.01373164 0.12405165 0.22637251
## sulphates alcohol quality
## 0.25139708 0.47616632 1.00000000
From the correlations we can clearly see alcohol positiely and volatile.acidity negitively are having a strong relation with quality.
And density and fixed acidity have a strong correlation.
The below plots are large in number,since i believe in exploratory data analysis one must visualize,then only hidden trends can be found ,so i have plotted all the scatter plots of all combinations by factorizing quality.
After visualizing the above plots i am interested in few plots below
The region near to origin seems crowded with high quality wine.Low volatile.acidity and low total.sulfur.dioxide indicates high quality wine.
Considering the highest correlated variable with quality,alcohol, and density is most correlated with alcohol negitively.Higher the quality of wine when there is more alcohol and low density.
there is a strong correlation between fixed.acidity and density.Cant conclude on this due to insufficient data.
## [1] " Scatter plot: \n residual.sugar vs citric.acid"
From the above plot its clearly seen there is almost no correlation between residual sugar and citric acid Where we can see the scatter plot for any value of citric acid residual sugar is mostly between 0 and 4.
alcohol and volatile acidity are having a strong correlation with quality.less volatile acidity, and more alcohol gives better wine.
Sulphates range from 0.5 to 1.5 & chlorides from 0.1 to 1.5 gives a high quality wine. This suggests that there is an optimal range for volumes of these two features to make the best wine.
citric acid and volatile acidity did not give any usefulresults.
There is a strong correlation between fixed.acidity and density.But the reason is unknown, might be depending on the propeties of wine we can conclude on it.
NO correlation between residual sugar and quality of wine.
I was expecting a close relation between sulphur and sulphur dioxide,there seems no relation with correlation coefficient of 0.04.
Quality ranges from 0 to 10,but in data minimum is 3 and maximum is 8, which means that most of the wines we will look at in the analysis are average wines, wines 5,6 constitute of 80% of wines, while wines 7,8 only contribute 10% or more of the wine data. Due to lack of data on high quality wine we cant contribute to understand what are the main composition which leads to high quality wine.
“A very high correlation is seen ,higher alcohol content will give high quality wine.Considering the highest correlated variable with quality,alcohol, and density is most correlated with alcohol negitively.Higher the quality of wine when there is more alcohol and low density.
The density of wine is inversely proportional to the alcohol present in it from scatter_smooth plot.There are few concentrations of quality 8 in scatter plot where high alcohol content and low density of wine.
The red wine dataset contains 1,599 observation with 10 variables(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfer dioxide, density, pH, sulphates, alcohol, quality) of type numeric and 1 rating factor quality of type int. I am interested in the correlation between the features and wine quality.The wine data set contains information on the chemical properties of a selection of wines collected in 2009. It also includes sensorial data (wine ranking).
Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
Starting from the histograms and box plots we could keenly see the distribution of points.Some were normal distributions and few were bimodal distributuions.
In bivariate analysis combination of boxplot and scatterplot gave a keen idea how each variable is correlated to each other and mainly to our feature of interest quality.
Multivariate analysis has taken a further step from bivariate analysis to get insight on quality of wine.And we understood that for a specific range of chemical component we can have high quality of wine. Sulphates range from 0.5 to 1.5 & chlorides from 0.1 to 1.5 gives a high quality wine. This suggests that there is an optimal range for volumes of these two features to make the best wine.
As the data was tidy already there was no need for data wrangling.
For future analysis i would like to collect data on few features listed below. Since in real world so many factors affect the quality of Red wine.
Does cost of wine is dependent on quality ? then can we find a way to make better wine with less cost my understanding its relation with all the factors in real world?
http://rstudio-pubs-static.s3.amazonaws.com/3355_d3f08cb2f71f44f2bbec8b52f0e5b5e7.html
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
——————————————————— X —————————————————————-