Announcing TokuDB v7.1.5
Yesterday we released TokuDB v7.1.5, which includes the following important features and fixes:
Upgraded MySQL and MariaDB to version 5.5.36.
Six months of performance improvements to our underlying Fractal Tree indexing.
Fixes for bugs, stalls, and behavioral issues reported by users.
Fixes for issues identified by the addition of Random Query Generator (RQG) testing.
Fixes for issues identified by the addition of Valgrind testing.
Full details on the changes in TokuDB v7.1.5 can be found in the release notes section of the TokuDB User’s Guide, available from our documentation page.
As always, you can download the Community and Enterprise Editions from the TokuDB download page.
Ghosts of MySQL past, part 8.1: Five Years
With many apologies to David Bowie, come 2009 it was my 5 year anniversary with Sun (well, MySQL AB and then Sun). Companies tend to like to talk about how they like to retain employees and that many employees stay with the company for a long time. It is very, very expensive to hire the right people, so this is largely a good plan.
In 2009, it was five years since I joined MySQL AB – something I didn’t quite originally expect (let’s face it, in the modern tech industry you’re always surprised when you’re somewhere for more than a few years).
The whole process of having been with sun for 5 years seemed rather impersonal… it largely felt like a form letter automatically sent out with a certificate and small badge (below). You also got to go onto a web site and choose from a variety of gifts. So, what did I choose? The one thing that would be useful at Burning Man: a camelbak.
5 year badge for Sun Microsystems
Insert benchmark for InnoDB, MongoDB and TokuMX and flash storage
This work is introduced with a few disclaimers in an earlier post. For these tests I ran the insert benchmark client in two steps: first to load 100M documents/rows into an empty database and then to load another 400M documents/rows. The test used 1 client thread, 1 collection/table and the query threads were disabled. The replication log (oplog/binlog) was disabled for all tests. I assume that all of the products would suffer with that enabled.Note that the database in this test was fully cached by InnoDB and TokuMX and almost fully cached for MongoDB. This means there will be no disk reads for InnoDB during index maintenance, no disk reads for TokuMX during compaction and few disk reads for MongoDB during index maintenance. Future posts have results for databases much larger than RAM.Performance summary:database size - InnoDB matches TokuMX when using 2X compression alas this comes at a cost in the insert rate. Without compression TokuMX is better than InnoDB and both are much better than MongoDB. I ran tests using the readable attribute names in iibench-mongodb like "cashregisterid" and then repeated the tests after changing the code to use names that were between 2 and 5 bytes. Using shorter names did not have a big impact on database size because most of the space is used in indexes not for document storage. After one test with 500M documents and shorter names the space used is 38.5% for document storage and 62% for the indexes.write efficiency - MongoDB wrote about 10X more bytes to storage than InnoDB and about 20X more than TokuMX. But the MongoDB database was only 2X to 4X larger than the others. It will rewrite the same page to disk much more frequently than InnoDB. The tests used the default for syncdelay (60) and it tries to flush all dirty database files to disk every 60 seconds via calls to msync. InnoDB is able to keep dirty pages in the buffer pool for a much longer time. But increasing syncdelay to 300 did not have a significant impact on the bytes written rate -- that reduced the rate by between 10% and 20%. I encountered intermittent stalls on writes with MongoDB 2.4.9 that occur when an msync is in progress and an insert needs to allocate a new file. That is fixed in 2.6 and PMP helped me to debug it.insert rate - uncompressed InnoDB did better than TokoMX and compressed InnoDB matched it. All did better than MongoDB. The problem for MongoDB was write efficiency. The problem for TokuMX is that the random write penalty for InnoDB is greatly reduced by a fast storage device. This advantage is reduced when the database gets larger than RAM and completely lost when slower storage is used which will be described in a future post.I enabled fsync on commit for all tests. That used j:1 for inserts for MongoDB and TokuMX. That also used journalCommitInterval=2 for MongoDB and logFlushPeriod=0 for TokuDB. This means there can be a 2/3 * 1/2 millisecond wait for group commit on average with MongoDB. But tests run with j:0 (not reported here) showed that forcing the journal to disk has little impact on performance. For InnoDB I used innodb_flush_log_at_trx_commit=1. I disable the InnoDB doublewrite buffer for most of the tests. That sacrifices protection from torn page writes. MongoDB doesn't have an option to protect against torn page writes. TokuMX isn't at risk from torn page writes. Otherwise I used the default options for MongoDB and TokuMX. They are much easier to configure than MySQL. I won't describe all of the MySQL settings but I use direct IO. The test server has 144G of RAM, 24 cores with HT enabled and fast flash storage that can do more than 50,000 16kb random IOPs.The powerOf2Sizes option was not enabled for this test. From reading source code that should only change how space is allocated for documents in the heap and it won't help or hurt with index fragmentation. I have reports to share for powerOf2Sizes in future blog posts.The tests used iibench-mongodb in Java for TokuMX and MongoDB and the python version for InnoDB. I tested MySQL 5.6.12, TokuMX 1.4.0 and MongoDB 2.4.9 in the following configurations.inno-16k-dw0 - InnoDB with 16kb pages and the doublewrite buffer disabledinno-16k-dw1 - InnoDB with 16kb pages and the doublewrite buffer enabledinno-8k-dw0 - InnoDB with 8kb pages and the doublewrite buffer disabledinno-8k-dw0 - InnoDB with 8kb pages and the doublewrite buffer disabledinno-8k-dw0-zlib - InnoDB with 8kb pages, the doublewrite buffer disabled, and 2X compression enabledtokumx - TokuMX with logFlushPeriod=0 and writes done with j:1mongo24 - MongoDB 2.4.9 with journalCommitInterval=2 and writes done with j:1The table below summarizes the results after inserting 100M documents/rows.DB-size Bytes-per-doc Write-rate Bytes-written Test-seconds Inserts/sec Server17.0 GB 182 20.8 MB/s 80.7 GB 3895 25674 innodb-16k-dw017.0 GB 182 35.4 MB/s 140.1 GB 3963 25231 innodb-16k-dw116.0 GB 171 34.9 MB/s 118.0 GB 3375 29622 innodb-8k-dw0 8.5 GB 91 21.3 MB/s 162.8 GB 7661 13048 innodb-8k-dw0-zlib 9.3 GB 99 15.0 MB/s 99.0 GB 6586 15182 tokumx42.0 GB 450 171.1 MB/s 1825.4 GB 10679 9364 mongo24The table below summarizes the results after inserting another 400M documents/rows.DB-size Bytes-per-doc Write-rate Bytes-written Test-seconds Inserts/sec Server 75 GB 161 42.2 MB/s 667 GB 15846 25244 innodb-16k-dw0 75 GB 161 81.7 MB/s 1300 GB 15926 25117 innodb-16k-dw1 75 GB 161 72.9 MB/s 988 GB 13562 29492 innodb-8k-dw0 40 GB 85 24.2 MB/s 730 GB 30130 13276 innodb-8k-dw0-zlib 46 GB 98 15.3 MB/s 417 GB 27241 14684 tokumx154 GB 330 188.6 MB/s 9838 GB 52233 7658 mongo24
Database benchmarks are hard but not useless. I use them to validate performance models and to find behavior that can be improved. It is easy to misunderstand results produced by others and they are often misused for marketing (benchmarketing). It is also easy to report incorrect results and I have done that a few times for InnoDB. A benchmark report is much more useful when it includes an explanation. Only one of these is an explanation: A is faster than B, A is faster than B because it uses less random IO. It isn't easy to explain results. That takes time and expertise in the DBMS and the rest of the hardware and software stack used during the test. The trust I have in benchmark reports is inversely related to the number of different products that have been tested.This is an introduction for a sequence of blog posts that compare MongoDB, TokuMX and InnoDB. I am an expert with InnoDB, above average with TokuDB (and the Toku part of TokuMX) and just getting started with MongoDB. Moving from technical expertise to personal bias I have had a lot of success with InnoDB. I also have a lot of respect for Tokutek and my brother works there. While I am new to MongoDB, I think they are building a useful product. My focus is on DBMS internals and storage efficiency. This is a narrow focus but critical for deployments where the hardware cost is significant. My goal is to describe problems that can be fixed to make MongoDB better.I used the insert benchmark to measure the insert rate, write efficiency and database size. The benchmark loads a collection/table in PK order. The table has 3 secondary indexes and the values for these attributes/columns are randomly generated. The random key order for secondary index maintenance can generate fragmentation and random IO. I use the benchmark to measure storage efficiency and am more interested in write efficiency and database size than in the average insert rate.Can the DBMS use the IOPs capacity of the storage devices? Many years ago InnoDB wasn't able to do more than 10,000 reads/second from disk. Now I am able to demo something between 150k and 200k per second assuming low latency storage and 24 CPU cores with HT enabled. It has come a long way. There are many problems that prevent a DBMS from using the IOPs capacity including mutex contention, CPU overhead/pathlength and IO serialization. TokuMX and InnoDB have one CPU overhead that is missing from MongoDB. They generate checksums when database pages are written and validate the checksums when pages are read from disk. That is an overhead that I hope will be added to MongoDB.Can the DBMS reduce random IO for secondary indexes? InnoDB has the change buffer for this and TokuMX has fractal trees.Can the DBMS reduce index fragmentation? TokuMX has fractal trees. InnoDB is an update-in-place b-tree. I won't claim it has anything special in this regard. A common problem for database benchmarks is to load and go so that queries are tested against an unfragmented database while the usual production DBMS has been fragmented. By fragmented I mean subject to inserts, updates and deletes in random key order so write-optimized databases can also be fragmented and there is an impact on query performance in that case, but that is for another blog post. The insert benchmark tries very hard to make the database fragmented by requiring secondary index maintenance in random key order.Can the DBMS reduce the amount of disk space used for the database? Both TokuMX and InnoDB support compression.For write efficiency I report the total bytes written to the storage device during the test. Flash devices are rated by write endurance. A DBMS that writes less to storage allows you to either buy less expensive flash devices or get more years of use from a device. I used iostat to record the write rate and this includes writes to the journal file (or redo log) and database files. Note that all writes are not created equal. TokuMX does large random writes. InnoDB and MongoDB do small random writes and small random writes also require writes in the background when flash garbage collection (block cleaning) is done. The writes from flash GC are not captured by iostat and I do not report them.Database size can be increased by fragmentation for update-in-place indexes and old versions of rows for write-optimized database. It can be decreased by compression. A smaller database means you buy less flash or disk to store it and to store backups.The insert rate is the least interesting of the three metrics. I don't ignore the results but a lower insert rate can be tolerated if that provides better write efficiency or database compression.This was a long introduction. Benchmark results will be in another post.
MySQL Workbench 6.1: Performance Schema Reports
The Performance Schema Reports feature from MySQL Workbench show summaries of the many statistics gathered by the MySQL performance_schema. By inspecting these reports, you can get some insight of what’s happening in a MySQL server in aspects such as:
I/O by amount of data
I/O by latency/time spent
Performance critical operations performed by queries of the same type (table scans, index scans, temporary tables, sorts etc)
The MySQL SYS Schema
MySQL 5.5 introduced the performance_schema feature. performance_schema contains tables that log server performance and activity statistics. You can inspect it to have a clearer understanding about what kind of work is the server doing, how much time is spent doing that, resources used globally or by individual queries etc. MySQL 5.6 expanded on it and adds even more information, making it a lot more powerful.
However, the performance_schema tables mostly contain raw data and require some processing and SQL magic to make it palatable so you can get the most of it. And that is what the new SYS schema project provides. The SYS schema has many views that process data in performance_schema providing high-level reports and stored procedures that make it easier for advanced users to work with the performance_schema.
The SYS schema is bundled in MySQL Workbench 6.1 and is – after confirmation – automatically installed in supported MySQL servers. Currently, only MySQL 5.6 and 5.7 are supported, because the performance_schema in MySQL 5.5 still lacked maturity and did not yet provide enough information to be very useful.
To view the reports, the SYS schema must be installed and the appropriate P_S instrumentation has to be enabled. If not, Workbench will prompt you to click a button to enable it.
The required instrumentation is (if you choose to change it manually):
current and history_long consumers must be enabled for all events
all wait instruments must be enabled and timed
all stage instruments must be enabled and timed
all statement instruments must be enabled and timed
The currently available reports are listed below. As more views are added to the SYS schema, Workbench will also be updated to list them.
Top File I/O Activity Report
Show the Files doing the most IOs in bytes
Top I/O by File by Time
Show highest IO usage by file and latency
Top I/O by Event Category
Show the highest IO Data usage by event categories
Top I/O in Time by Event Categories
Show the highest IO time consumers by event categories
Top I/O Time by User/Thread
Show the top IO time consumers by User/thread
Lists statements with various aggregated statistics
Statements in Highest 5 Percent by Runtime
List all statements who’s average runtime, in microseconds is in highest 5 percent
Using Temp Tables
Lists all statements that use temporary tables – access the highest # of disk temporary tables, then memory temp tables
List all normalized statements that have done sorts, access in the following priority order – sort_merge_passes, sort_scans and sort_rows
Full Table Scans
Lists statements that have performed a full table scan. Access query performance and the where clause(s) and if no index is used, consider adding indexes for large tables
Errors or Warnings
List statements that have raised errors or warnings.
Schema Object Overview (High Overhead)
Shows count by object type for each schema.
Note: On instances with a large number of objects, this can take some time to execute.
Schema Index Statistics
Schema Table Statistics
Schema Table Statistics (with InnoDB buffer)
Tables with Full Table Scans
Find tables that are being accessed by full table scans ordering by the number of rows scanned descending
List of indexes that were never used since the server started or since P_S data collection started.
Waits by Time
Lists the top wait events by their total time, ignoring idle (this may be very large)
Waits by User by Time
Lists the top wait events by their total time, ignoring idle (this may be very large)
Wait Classes by Time
Lists the top wait classes by total time, ignoring idle (this may be very large)
Waits Classes by Average Time
Lists the top wait classes by average time, ignoring idle (this may be very large)
InnoDB Buffer Stats by Schema
Summarizes the output of the INFORMATION_SCHEMA.INNODB_BUFFER_PAGE table, aggregating by schema
InnoDB Buffer Stats by Table
Summarizes the output of the INFORMATION_SCHEMA.INNODB_BUFFER_PAGE table, aggregating by schema and table name