Making the MTR rpl suite GTID_MODE Agnostic
In MySQL 5.6 we introduced GTID_MODE as a new server option. A global transaction identifier (GTID) is a unique identifier created and associated with each transaction when it is committed on the server of origin (master). This identifier is unique not only to the server on which it originated, but is unique across all servers in a given replication setup. There is a 1-to-1 mapping between all transactions and all GTIDs. For additional information, please refer to the MySQL manual.
Prior to 5.6.17 and 5.7.4, we had GTID specific replication (referred to as “rpl” within the MTR suite) regression tests, and we had to separately run the binlog suite with the GTID_MODE=ON option enabled. To improve the test coverage further we wanted to make the MTR rpl suite GTID_MODE agnostic.
From 5.6.17 and 5.7.4 on, the MTR rpl suite is run daily within the continuous integration testing framework of MySQL, with GTID_MODE=ON for better code coverage.
However we cannot just turn on the necessary mysqld switches and expect all of the test cases to pass, as the tests are bound to behave differently when run with GTID_MODE=ON. There were a number of challenges associated with this new change, and this post will explain the challenges and the necessary solutions that were implemented in order to make this change.
We wanted to periodically run the rpl suite with --GTID_MODE=ON. However, one cannot just turn on the necessary mysqld switches and expect all of the test cases to pass due to some differences in behavior when run with GTID_MODE=ON.
Some of the challenges that we expected were:
C1. Result file differences due to additional output from SHOW BINLOG EVENTS. This breaks many tests due to the result file difference, and MTR is results file oriented.
C2. Tests that mix transactional and non transactional engines in the same statement/ transaction. GTIDs have options and requirements around such behavior.
C3. The fact that some tests do not require the logging of slave updates to be ON, and thus with GTID_MODE=ON we fail to start the server. This is because all servers involved in a GTID based replication group must have GTID_MODE=ON.
C4. Currently, even slaves using the “--gtid-mode=on --enforce-gtid-consistency --log-slave-updates” server options will still connect to the master the using MASTER_AUTO_POSITION=0, because by default the replication setup in the mysql-test rpl suite has MASTER_AUTO_POSITION=0. But for GTID_MODE=ON, we should have MASTER_AUTO_POSITION=1 since that will allow for automatic connection and replication setup, using the correct GTID values.
To overcome these challenges, the following solutions were implemented :
Solution to C1: We use “show_binlog_events.inc” which filters out the additional GTID output from the result files.
However, this is not always enough. In some cases we need to create two wrappers for the test case. One works on traditional event positioning and the other with GTIDs. The former keeps the original test name, the latter gets “_gtid_” injected into it just after “rpl”. For example, rpl.rpl_binlog_errors becomes rpl.rpl_gtid_binlog_errors.
These should also wrap the original test case that it is moved into (keeping the original name, with the extension renamed from .test to .inc).
Solution to C2: These tests are not supported, so they were simply added to the skip-test-list when GTID_MODE=ON.
Solution to C3: We simply needed to skip these tests as well.
Solution to C4: We will set MASTER_AUTO_POSITION=1 by setting –let $use_gtids=1 before including master-slave.inc.
This setting must be automatic in rpl_init.inc, before calling rpl_change_topology.inc.
And it should be unset on rpl_end.inc before calling rpl_change_topology.inc again, which reverts CHANGE MASTER to MASTER_AUTO_POSITION=0
Apart from these more recurring challenges, we should also deprecate --sync_slave_with_master (and similar ones) and only use “include/sync_slave_sql_with_master.inc”, which handles both legacy file-positions, as well as GTID-based replication protocols.
Having solved the above challenges, we now run the MTR rpl suite with GTID_MODE=ON on a daily basis. This has greatly improved the test coverage, thus allowing us to identify GTID related problems well before they are pushed in the working release branch.
If you have any questions or feedback regarding this project, please post them here. I would love to hear what the community thinks about all of this.
‘Open Source Appreciation Day’ draws OpenStack, MySQL and CentOS faithful
210 people registered for the inaugural “Open Source Appreciation Day” March 31 in Santa Clara, Calif. The event will be held each year at Percona Live henceforth.To kick off the Percona Live MySQL Conference & Expo 2014, Percona held the first “Open Source Appreciation Day” on Monday, March 31st. Over 210 people registered and the day’s two free events focused on CentOS and OpenStack.The OpenStack Today event brought together members of the OpenStack community and MySQL experts in an afternoon of talks and sharing of best practices for both technologies. After a brief welcome message from Peter Zaitsev, co-founder and CEO of Percona, Florian Haas shared an introduction to OpenStack including its history and the basics of how it works.Jay Pipes delivered lessons from the field based on his years of OpenStack experience at AT&T, at Mirantis, and as a frequent code contributor to the project. Jay Janssen, a Percona managing consultant, complemented Jay Pipes’ talk with a MySQL expert’s perspective of OpenStack. He also shared ways to achieve High Availability using the latest version of Galera (Galera 3) and other new features found in the open source Percona XtraDB Cluster 5.6.Amrith Kumar’s presentation focused on the latest happenings in project Trove, OpenStack’s evolving DBaaS component, and Tesora’s growing involvement. Amrith also won quote of the day for his response to a question about the difference between “elastic” and “scalable.” Amrith: “The waistband on my trousers is elastic. It is not scalable.” Sandro Mazziotta wrapped up the event by sharing the challenges and opportunities of OpenStack from both an integrator as well as operator point of view based on the customer experiences of eNovance.OpenStack Today was made possible with the support of our sponsors, Tesora and hastexo. Here are links to presentations from the OpenStack Today event. Any missing presentations will soon be added to the OpenStack Today event page.What is OpenStack, and what’s in it for DBAs? Florian Haas, CEO, Principal Consultant, hastexoTales from the Field: Backend Data Storage in OpenStack Clouds Jay Pipes, Principal Technical Architect, MirantisPercona XtraDB Cluster in OpenStack Jay Janssen, Principal Consultant, PerconaAn Elastic Parallel Data-as-a-Service (DaaS) platform with OpenStack Trove and Tesora Amrith Kumar, Founder and CTO, TesoraChallenges and lessons learned with OpenStack deployments and MySQL Sandro Mazziotta, Senior Director, Product Management, eNovanceAutoscaling, replication, and failover: cloud native MySQL on OpenStack Sebastian Stadil, Founder, ScalrSpeakers in the CentOS Dojo Santa Clara event shared information about the current status of CentOS, the exciting road ahead, and best practices in key areas such as system administration, running MySQL, and administration tools. Here’s a rundown of topics and presentations from the event. Any missing presentations will soon be added to the CentOS Dojo Santa Clara event page.Welcome and Housekeeping Karsten Wade, CentOS Engineering Manager, Red HatThe New CentOS Project Karsten Wade, CentOS Engineering Manager, Red HatSystems Automation and Metrics at Pinterest Jeremy Carroll, Operations Engineer, PinterestSoftware Collections on CentOS Joe Brockmeier, Open Source & Standards, Red HatTwo Years Living Your Future Joe Miller, Lead Systems Engineer, PantheonRunning MySQL on CentOS Linux Peter Zaitsev, CEO and Co-Founder, PerconaNotes on MariaDB 10 Michael Widenius, Founder and CTO, MariaDB FoundationHappy Tools Jordan Sissel, Systems Engineer, DreamHostThank you to all of the presenters at the Open Source Appreciation Day events and to all of the attendees for joining.I hope to see you all again this November 3-4 at Percona Live London. The Percona Live MySQL Conference and Expo 2015 will also return to the Hyatt Santa Clara and Santa Clara Convention Center from April 13-16, 2015 – watch for more details in the coming months!The post ‘Open Source Appreciation Day’ draws OpenStack, MySQL and CentOS faithful appeared first on MySQL Performance Blog.
MongoDB, TokuMX and InnoDB for disk IO-bound, update-only by PK
I used sysbench to measure TPS for a workload that does 1 update by primary key per transaction. The database was much larger than RAM and the server has a SAS disk array that can do at least 2000 IOPs with a lot of concurrency. The update is to a non-indexed column so there is no secondary index maintenance which also means there is no benefit from a fractal tree in TokuMX or the change buffer in InnoDB. I also modified the benchmark client to avoid creating a secondary index. Despite that TokuMX gets almost 2X more TPS than InnoDB and InnoDB gets 3X to 5X more TPS than MongoDB.TokuMX is faster because it doesn't use (or waste) random IOPs on writes so more IO capacity is available for reads. In this workload an update is a read-modify-write operation where the read is likely to require a disk read.MongoDB is slower for two reasons. The first reason is the per-database RW-lock and the result doesn't get better with more concurrent clients. For this test all collections were in one database. The lock is held while the b-tree index for the PK is searched to find the document to update. Disk reads might be done when the lock is held. The second reason is that it does twice the number of disk reads per update while InnoDB & TokuMX do about 1 per update. Part of the difference is that InnoDB and TokukMX have clustered PK indexes but the results are much worse than I expected for MongoDB. I wonder if caching of index leaf blocks is not as effective as I expect or if I am wrong to expect this. Maybe this is one of the problems of depending on the OS VM to cache the right data.Yield on page faultThe TPS results for MongoDB are limited by disk read latency. Even though there is a disk array that can do a few thousand random reads per second, the array sustains about 150 reads/second when there is a single stream of IO requests. And the per-database RW-lock guarantees that is the case. So MongoDB won't get more than 1 / disk-read-latency updates per second for this test regardless of the number of disks in the array or number of concurrent clients.MongoDB documentation mentions that the per-database RW-lock can be yielded on page faults but the documentation wasn't specific enough for me. I think this is what you need to know and I hope MongoDB experts correct any mistakes.Yield is only done for access to documents. It is not done while accessing primary or secondary indexes. To see in the code where a yield might be done search for calls to Record::_accessing() which throws PageFaultException. The record might also be "prefetched" after releasing the per-database RW-lock via a call to Record::touch().Yield is done on predicted page faults, not on actual page faults. AFAIK, a signal handler for SIGSEGV could be used to do this for actual page faults and MongoDB creates a handler for SIGSEGV but only to print a stack trace before exiting. MongoDB has something like an LRU to track memory references and predict page faults. I haven't spent much time trying to figure out that code but have seen those functions use a lot of CPU time for some benchmarks. I am curious why the btree code uses that tracking code (it calls likelyInPhysicalMemory). To learn more about the page fault prediction code read the functions Record::likelyInPhysicalMemory and Record::_accessing and the classes PageFaultException and Rolling.From reading the above you should assume that you really want all indexes to be cached in RAM. Alas that can be hard to do for big data databases. For this test my server has 72G of RAM and the PK indexes are 83G. So I know that all of the indexes won't be cached.I tried to overcome disk read stalls during index searching by changing the Java sysbench client to manually prefetch the to-be-updated document by calling findOne prior to the update. That improved TPS by about 20%. I hoped for more but the prefetch attempt needs a read-lock and pending write-lock requests on the per-database RW-lock appear to block new read-lock requests. I think this is done to prevent write-lock requests from getting starved. My attempt is not a workaround.ConfigurationThis test used the sysbench clients as described previously. Tests were run for 8, 16, 32 and 64 concurrent clients. There were 8 collections/tables in one database with 400M documents/rows per collection/table. The test server has a SAS disk array that can do more than 2000 IOPs with many concurrent requests, 16 CPU cores with HT enabled and 72G of RAM. The sysbench clients ran on the same host as mysqld/mongod. Tests were first run for 30 minutes at each concurrency level to warmup the DBMS and then for either 60 or 120 minutes when measurements were taken. I tested these configurations:mongo-p2y - 874 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0mongo-p2n - 828 GB database, MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0mysql - 698 GB database, MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=2mysql-zlib - 349 GB database, MySQL 5.6.12, InnoDB 2X compression (key_block_size=4) via zlib, no compression, flush_log_at_trx_commit=2, buffer_pool_size=60G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=3000, lru_scan_depth=500, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=1tokumz-quicklz - 513 GB database, TokuMX 1.4.1 with quicklz compression, logFlushPeriod=300, w:1,j:0tokumz-zlib - 385 GB database, TokuMX 1.4.1 with zlib compression, logFlushPeriod=300, w:1,j:0ResultsMongoDB does twice the number of disk reads per update compared to TokuMX and InnoDB. MongoDB TPS does not increase with concurrency. TPS does increase with concurrency for InnoDB and TokuMX which benefit from having many more concurrent pending disk reads. TokuMX does better than InnoDB because it doesn't use random IOPs for database page writes so there is more capacity remaining for reads.TPSconfiguration 8 clients 16 clients 32 clients 64 clientstokumx-zlib 888 1267 1647 2034tokumx-quicklz 870 1224 1567 1915mysql-zlib 562 809 983 1140mysql 543 737 913 1043mongo-p2y 168 168 169 169mongo-p2n 168 169 168 169iostat r/sconfiguration 8 clients 16 clients 32 clients 64 clientstokumx-zlib 924 1279 1650 2032tokumx-quicklz 891 1243 1600 1948mysql-zlib 520 727 862 966mysql 512 695 855 970mongo-p2y 337 340 342 344mongo-p2n 343 347 350 350disk reads per updateconfiguration 8 clients 16 clients 32 clients 64 clientstokumx-zlib 1.04 1.01 1.00 1.00tokumx-quicklz 1.02 1.02 1.02 1.02mysql-zlib 0.93 0.90 0.88 0.85mysql 0.94 0.94 0.94 0.93mongo-p2y 2.01 2.02 2.02 2.04mongo-p2n 2.04 2.05 2.08 2.07
MySQL 5.7.4 Overview and Highlights
MySQL 5.7.4 was recently released (it is the latest MySQL 5.7, and is the “m14″ or “Milestone 14″ release), and is available for download here and here.
The 5.7.4 changelog begins with the following, so I felt it appropriate to include it here as well.
“This release is dedicated to the memory of two young engineers of the MySQL Engineering family, Astha and Akhila, whom we lost while they were in their early twenties. This is a small remembrance and a way to recognize your contribution to the 5.7 release. You will be missed.”
As for the fixes, there are quite a few, which is to be expected in such an early milestone release.
The main highlights for me were:
The Performance Schema now instruments prepared statements (for both the binary and text protocols). Info is available in the prepared_statements_instances table, along with performance_schema_max_prepared_statements_instances system variable, and Performance_schema_prepared_statements_lost status variable.
Incompatible Change: MySQL deployments installed using RPM packages now are secure by default (single root account, ‘root’@'localhost’, no anonymous-user accounts, no test database).
Incompatible Change: MySQL now enables database administrators to establish a policy for automatic password expiration: Any user who connects to the server using an account for which the password is past its permitted lifetime must change the password.
Performance; InnoDB: InnoDB now supports multiple page_cleaner threads for flushing dirty pages from buffer pool instances. A new system variable, innodb_page_cleaners, is used to specify the number of page_cleaner threads.
Incompatible Change: The AES_ENCRYPT() and AES_DECRYPT() functions now permit control of the block encryption mode and take an optional initialization vector argument
InnoDB: InnoDB now supports the Transportable Tablespace feature for partitioned InnoDB tables and individual InnoDB table partitions. This enhancement eases backup procedures for partitioned tables and enables copying of partitioned tables and individual table partitions between MySQL instances.
Of course, there were many, many more fixes/updates (InnoDB being #1, Replication #2, and Partitioning #3 with most fixed bugs), so be sure to read through the full changelog. And if you are running a previous version of *5.7*, then definitely plan on upgrading to this latest 5.7.4.
Hope this helps.
How TokuMX Secondaries Work in Replication
As I’ve mentioned in previous posts, TokuMX replication differs quite a bit from MongoDB’s replication. The differences are large enough such that we’ve completely redone some of MongoDB’s existing algorithms. One such area is how secondaries apply oplog data from a primary. In this post, I’ll explain how.
In designing how secondaries apply oplog data, we did not look closely at how MongoDB does it. In fact, I’ve currently forgotten all I’ve learned about MongoDB’s implementation, so I am not in a position to compare the two. I think I recall that MongoDB’s oplog idempotency was a key to their algorithms. Because we chose not to be idempotent (to avoid complexity elsewhere), we couldn’t use the same design. Instead, we looked to another non-idempotent implementation for inspiration: MySQL.
Stephane Combaudon writes a nice explanation of how MySQL’s replication works here:
“On a slave, replication involves 2 threads: the IO thread which copies the binary log of the master to a local copy called the relay log and the SQL thread which then executes the queries written in the relay log. The current position of each thread is stored in a file: master.info for the IO thread and relay-log.info for the SQL thread.”
What is not mentioned here is that if the binary log is enabled on the slave, the SQL thread will also replicate the queries that are written in the relay log to the binary log.
With TokuMX, we wanted a similar approach. We wanted one thread to be responsible for producing oplog data with a tailable cursor and writing it, and another thread to be responsible for replaying the oplog data and applying it to collections. But we did not want a separate relay log and binary log. This seemed to add unnecessary complexity. Instead, with TokuMX, the oplog is responsible for the work of the relay log and the binary log. To merge these functions, we added the “applied” bit to the oplog.
Here is how TokuMX secondaries apply oplog data. Hopefully, with this explanation, the use of the “applied” bit becomes clear:
The “producer” thread reads oplog data from the primary (or another secondary in the case of chained replication), and writes that data to the oplog. In doing so, it sets the applied bit to false. This work happens within a single transaction. That means, should there be a crash, the system will know upon startup that this entry has not yet been applied to collections
Then the “applier” thread, within a single transaction, applies the oplog data that has been written, and updates the oplog entry’s applied bit from false to true.
A nice property of this design is that upon recovering from a crash, the oplog is guaranteed to be up to date to a certain point in time, and that no gaps exist. That is, we don’t need to be worried about some oplog entry missing whose GTID is less than the GTID of the final entry.
However, because the applier may naturally be behind the producer, upon recovering from a crash, we need to find and apply all transactions whose applied bit is set to false. Here is how we do it. Once a second, another background thread learns what the minimum unapplied GTID is, and writes it to the collection “local.replInfo”. Because this value is updated only once a second, it is not accurate. However, it is a nice conservative estimate of what the minimum unapplied GTID actually is. Upon starting up a secondary that has already been initial synced, we read the oplog forward from this value saved in local.replInfo (which cannot be much more than a second behind the end), and apply any transaction whose applied bit is false.
A downside to this design is that data is written to the oplog twice for each transaction, once by the producer, and once by the applier to update the “applied” bit. In CPU-bound write-heavy workloads, this may present an issue (although we have no evidence). If necessary, we can likely improve upon this in the future, but that discussion is for another day.