EDA of GoodReads Best Books Ever by Andy Trick
========================================================
## [1] 30583 16
## [1] "rating" "votes" "isbn" "author"
## [5] "series" "page_count" "publisher" "reviews"
## [9] "setting" "awards" "year" "genre"
## [13] "title" "rank" "year.bucket" "part_of_series"
## 'data.frame': 30583 obs. of 16 variables:
## $ rating : num 4.39 4.42 3.56 4.23 4.25 4.22 3.8 4.38 4.18 3.79 ...
## $ votes : num 6110 391 4471 540 10531 ...
## $ isbn : Factor w/ 21699 levels "",",","000100039X",..: 7458 7537 4206 11265 8262 1718 8926 374 4954 6757 ...
## $ author : Factor w/ 13884 levels "'Ali Ibn ABI Al-Hazm Ibn Al-Nafis",..: 12550 5260 12332 5600 8436 1734 4350 12005 3311 3702 ...
## $ series : Factor w/ 820 levels "","1-800-Where-R-You",..: 674 1 774 1 1 1 1 1 1 1 ...
## $ page_count : num 374 NA NA 279 NA 767 NA NA NA 464 ...
## $ publisher : Factor w/ 6720 levels "","'Hayastan' hratarakchutyu",..: 5377 5361 3537 3903 2427 2661 4801 2685 1569 6454 ...
## $ reviews : num 1003 1109 3423 2226 945 ...
## $ setting : Factor w/ 1246 levels "","ABA Indie Next Book",..: 341 206 388 1 1 1 1 1 1 9 ...
## $ awards : Factor w/ 359 levels "","Abraham Lincoln Award Nominee (2006)",..: 124 1 123 1 1 1 1 1 1 1 ...
## $ year : num 2008 2003 2005 1813 1936 ...
## $ genre : Factor w/ 147 levels "","Academic",..: 147 51 147 28 28 28 28 24 120 20 ...
## $ title : Factor w/ 28863 levels "","'Are these my basoomas I see before me?'",..: 23887 13231 27398 19034 12958 22436 6268 23523 23803 28713 ...
## $ rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ year.bucket : Factor w/ 8 levels "(0,1900]","(1900,1950]",..: 7 6 6 1 2 2 2 3 3 1 ...
## $ part_of_series: logi TRUE FALSE TRUE FALSE FALSE FALSE ...
## [1] "" "Academic"
## [3] "Action" "Adult"
## [5] "Adult Fiction" "Adventure"
## [7] "Alcohol" "American"
## [9] "Amish" "Animals"
## [11] "Anthologies" "Anthropology"
## [13] "Apocalyptic" "Architecture"
## [15] "Art" "Asian Literature"
## [17] "Autobiography" "Biography"
## [19] "Biology" "Book Club"
## [21] "Buddhism" "Business"
## [23] "Category Romance" "Childrens"
## [25] "Christian" "Christian Fiction"
## [27] "Christianity" "Classics"
## [29] "Comics" "Computer Science"
## [31] "Contemporary" "Crafts"
## [33] "Crime" "Criticism"
## [35] "Cultural" "Culture"
## [37] "Dark" "Dc Comics"
## [39] "Design" "Disability"
## [41] "Drama" "Dungeons And Dragons"
## [43] "Economics" "Education"
## [45] "Environment" "Erotica"
## [47] "Esoterica" "European Literature"
## [49] "Family" "Fan Fiction"
## [51] "Fantasy" "Feminism"
## [53] "Fiction" "Folklore"
## [55] "Food And Drink" "Football"
## [57] "Games" "Gardening"
## [59] "Gender" "Glbt"
## [61] "Gothic" "Graphic Novels Manga"
## [63] "Health" "Historical"
## [65] "Historical Fiction" "History"
## [67] "History And Politics" "Holiday"
## [69] "Horror" "Humanities"
## [71] "Humor" "Inspirational"
## [73] "Kids" "Language"
## [75] "Law" "Lds"
## [77] "Leadership" "Literary Fiction"
## [79] "Literature" "Love"
## [81] "Love Inspired" "Magical Realism"
## [83] "Management" "Manga"
## [85] "Marriage" "Marvel"
## [87] "Media Tie In" "Medical"
## [89] "Mental Health" "Mermaids"
## [91] "Military" "Military History"
## [93] "Modern" "Music"
## [95] "Mystery" "Mythology"
## [97] "New Adult" "Non Fiction"
## [99] "Novels" "Occult"
## [101] "Paranormal" "Parenting"
## [103] "Philosophy" "Plays"
## [105] "Poetry" "Politics"
## [107] "Polyamory" "Productivity"
## [109] "Psychology" "Queer"
## [111] "Race" "Realistic Fiction"
## [113] "Reference" "Regency"
## [115] "Relationships" "Religion"
## [117] "Roman" "Romance"
## [119] "Science" "Science Fiction"
## [121] "Science Fiction Fantasy" "Self Help"
## [123] "Sequential Art" "Sexuality"
## [125] "Shapeshifters" "Short Stories"
## [127] "Social Science" "Sociology"
## [129] "Space" "Speculative Fiction"
## [131] "Spirituality" "Sports"
## [133] "Sports And Games" "Spy Thriller"
## [135] "Superheroes" "Suspense"
## [137] "Teaching" "Textbooks"
## [139] "Thriller" "Travel"
## [141] "Urban" "War"
## [143] "Western" "Womens Fiction"
## [145] "World War II" "Writing"
## [147] "Young Adult"
## [1] "(0,1900]" "(1900,1950]" "(1950,1980]" "(1980,1990]" "(1990,2000]"
## [6] "(2000,2005]" "(2005,2010]" "(2010,2015]"
## rating votes
## Min. :0.00 Min. : 1
## 1st Qu.:3.80 1st Qu.: 2990
## Median :4.00 Median : 6171
## Mean :4.00 Mean : 6179
## 3rd Qu.:4.21 3rd Qu.: 9485
## Max. :5.00 Max. :12559
##
## isbn
## : 8864
## http://watergreen.wix.com/watersgreenhouse: 6
## , : 3
## http://forums.sennadar.com : 3
## 0061353450 : 2
## 0062248162 : 2
## (Other) :21703
## author series
## James Patterson: 69 :29680
## Stephen King : 68 Dark Saga : 8
## Nora Roberts : 66 Kate Shugak : 7
## Francine Pascal: 59 Otherworld/Sisters of the Moon: 7
## Agatha Christie: 56 Breeds : 6
## Meg Cabot : 53 Argeneau : 5
## (Other) :30212 (Other) : 870
## page_count publisher reviews
## Min. : 0.0 : 1897 Min. : 1
## 1st Qu.: 210.0 Vintage : 420 1st Qu.:1060
## Median : 306.0 Penguin Books : 328 Median :1807
## Mean : 356.1 HarperCollins : 322 Mean :1884
## 3rd Qu.: 416.8 Ballantine Books: 257 3rd Qu.:2793
## Max. :4892.0 Createspace : 254 Max. :3642
## NA's :28253 (Other) :27105
## setting
## :28599
## United States : 82
## London, England : 58
## New York City, New York: 48
## United Kingdom : 28
## New York : 23
## (Other) : 1745
## awards
## :30180
## Goodreads Choice Nominee for Romance (2010) : 5
## Golden Duck Award for Hal Clement Award for Young Adult (2010), YALSA Teens' Top Ten (2010), Children's Choice Book Award for Teen Choice Book of the Year (2010), Indies Choice Book Award for Young Adult (2010), Teen Read Award Nominee for Best Read (2010) : 4
## Goodreads Choice Nominee (2013) : 4
## Goodreads Choice Nominee for Paranormal Fantasy (2010) : 4
## Locus Award Nominee for Best Young Adult Novel (2008), Mythopoeic Fantasy Award for Children's Literature (2008), Odyssey Award for Excellence in Audiobook Production Honor (2008), Books I Loved Best Yearly (BILBY) Awards for Older Readers (2008), YALSA Teens' Top Ten (2008): 4
## (Other) : 382
## year genre title
## Min. :-800 Fantasy : 3835 English : 479
## 1st Qu.:1981 Fiction : 3767 Arabic : 72
## Median :2000 : 2866 Urdu : 12
## Mean :1975 Romance : 2419 The Hobbit : 10
## 3rd Qu.:2008 Young Adult: 2200 ã´ã¡ã³ãã¤ã¢é¨å£«: 9
## Max. :2800 Non Fiction: 1418 Spanish : 9
## NA's :9105 (Other) :14078 (Other) :29992
## rank year.bucket part_of_series
## Min. : 1 (2005,2010]:4198 Mode :logical
## 1st Qu.: 7646 (1990,2000]:3610 FALSE:29680
## Median :15292 (2010,2015]:3296 TRUE :903
## Mean :15292 (2000,2005]:3102 NA's :0
## 3rd Qu.:22938 (1950,1980]:2795
## Max. :30583 (Other) :4424
## NA's :9158
Univariate Plots Section
Taking a first look at The best books ever dataset. It is not too surprising that the majority of the books on the list are more current. Lets take a closer look:
It appears there is a steady increase in book releases per year that peaks at almost 1100 books in 2011! to compare, the average yearly entries for the 20 years before that is 545.5… almost half.
## [1] 1078
## [1] 545.5
one more histogram of book released based on the Year split applied earlier.
Average book length is between 200 and 500 pages.
A closer look shows the length of a book is most likely betwen275 and 340. Do the top 250 books apply to this grouping? That may help us determine if page length correlated any to popularity of a book.
It looks like the top 250 books from the dataset are pretty similiar to the entire population. There may be a slight increase in the overall numbers though. Lets look at summarys of the two groups.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 210.0 306.0 356.1 416.8 4892.0 28253
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32.0 248.8 374.0 485.8 481.2 2700.0 198
##
## Pearson's product-moment correlation
##
## data: gr$page_count and gr$rating
## t = 5.2171, df = 2328, p-value = 1.978e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06718583 0.14746469
## sample estimates:
## cor
## 0.1075005
So it looks like all across the board the top250 books have a slightly higher page count than the entire dataset. There is a .10 positive correlation between length and average rating of a book.
A look at the top genre occurances in the dataset. Any genre related plots from here out will only include genre’s with atleast 500 occurances in the data.
Fiction and Fantasy are by far the most popular genres. Romance and Young adult also see a surprisingly large population.
Interestingly, when subsetting the top 250 ranked books, fantasy drops very low and classics take over the number one count. Fiction stays popular while romance is nowhere on the list.
## n
## 1 0.02952621
A look at how many books in the dataset are part of a series. Only around 3 percent.
Top authors in the book list. I’m very surprised to see 8 authors have over 50 books in the data.
Top Publishers. Vintage Publishing has the most books on the list by a large margin at a count of 420. The next highest, Penguin Books, has 92 fewer.
Althought techniquely different, if Penguin Books and Penguin Classics were combined, Penguin would come in at 566 books on the list.
Above are four histograms depicting what could be considered the popularity of a book. Average rating (out of 5 stars), and number of votes for best book.
There is not much of a difference between the entire dataset and the top250 subset. Rating tends to be between 3.5 and 4.5, while number of voted is pretty spread between 0 and 10,000. The only notable outlier in the voted plots would be the 0 for the entire dataset. This is due to a large amount of books at the end of the list having under 100 votes.
Univariate Analysis
What is the structure of your dataset?
This Data was obtained from GoodReads.com’s best books ever list. There are 30,583 books in the dataset with 14 features (rating, votes, isbn, author, series, page_count, publisher, reviews, setting, awards, year, genre, title, rank). I created two more variables (year.bucket and part.of.series.) The variables isbn, author, series, publisher, setting, awards, genre, and year.bucket are factor variables. part.of.series is a boolean.
Observations: -Most books are Not a part of a series. -The majority of books on the list are from the past 20 years. -median book length is 306 pages. -mean rating is 3.99. -there are 147 unique Genres.
What is/are the main feature(s) of interest in your dataset?
The main feature of interest in this dataset are Genre. I am looking to discover any trends or patterns in peoples opinion of, and the popularity of, genre.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Rank, rating, year, year.bucket, andpage_count are all important features to discover trends in genre. Year will help me discover any differences over time between genres, while page count will assist in my interest in learning if length of books is different depending on genre. Rank and Rating are both good values to determine popularity and opinion of a book.
Did you create any new variables from existing variables in the dataset?
I created two new variables: year.bucket, and part.of.series. Year.bucket was created to split the books into 8 similiar sized year categories. Fewer years in the more current years to counteract the large amount of books released in the past 25 years. Part.of.Series is a boolean determining if the book is a stand alone or part of a series.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Instead of changing the form of the data, I opted to use several different subsets of the original dataset. This allowed me to keep a copy of the initial data, while also clean the data for specific purposes.
Author.sub, genre.sub, and publisher.sub are all subsets specific for each title. for example: author.sub is a dataset of only authors with 50 or more books in the dataset. genre.sub is similiar in that it is genres with atleast 200 occurances.
I also subset the top 250 ranked books to see if a more immediate pattern emerged from a smaller, but more popular, selection. Along with these, there are several more subsets split amongst specific genres. (example: sff.sub, fantasy.sub)
Bivariate Plots Section
Plotting Length(page count) by Rating shows a slight positive correlation between the two.
##
## Pearson's product-moment correlation
##
## data: gr$page_count and gr$rating
## t = 5.2171, df = 2328, p-value = 1.978e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06718583 0.14746469
## sample estimates:
## cor
## 0.1075005
There seems to be a slight correlation between rank and page count as well. This was expected after earlier finding the top 250 ranked books were typically longer. How correlated?
##
## Pearson's product-moment correlation
##
## data: gr$rank and gr$page_count
## t = -5.8784, df = 2328, p-value = 4.737e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.16075838 -0.08072846
## sample estimates:
## cor
## -0.1209399
cor = .12 The negative must be reversed becasue rank is ‘higher’ the lower the number.
Votes by Length does not show any pattern. The only thing this plot shows is that most books qre between 200 and 500 pages. Expect all plots with votes included to not depict any trends due to not being correlated to the data in any particular way.
Here is something I was setting out to discover; Are particular genres typically longer of shorter than others?
In our dataset the genres History, Historical Fiction, and Fantasy are on average longer than most. Children books are, very predictably, the shortest.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 210.0 306.0 356.1 416.8 4892.0 28253
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 104.0 324.0 463.0 512.1 597.0 1264.0 516
Above shows a summary of length for the dataset followed by a summary of only History books length.
Still votes values do not show any pattern or trend to the data.
A scatterplot and boxplot of genre by average rating. All genres are typically around the 4 star rating. Sequential Art is abnormally high.