
Clustrix's Frequently Asked Questions page is a central hub where its customers can always go to with their most common questions. These are the 81 most popular questions Clustrix receives.
ClustrixDB outputs to several different log files, the main four being CLUSTRIX(default), QUERY, USER, and DEBUG. By default these three logs will rollover and start a new log file when the Log file reaches 1GB in size.
By defaultthe cluster will holdfor each log type 10 uncompressed log files and 20 gzipped log files for a total of 90 files. So in the most very extreme situations the logs could take up to 35-40GB of space.
If you would like to force the log files to rollover before the 1GB size limit is met, you may invoke the memlog rollover function with the the clx cp command.
Example:
/opt/clustrix/bin/clx cp 'memlog rollover default'
/opt/clustrix/bin/clx cp 'memlog rollover query'
This will cause the clustrix.log (aka default log) and the query.log, respectively to rotate and start a new file.
The method for the legacy Clustrix Appliance is a bit different. You will need to call the script/bin/ logrotate.pl with the argument of the log file to rotate.
Example:
clx cmd "/bin/ logrotate.pl /var/log/sprout.log"
clx cmd "/bin/ logrotate.pl /var/log/query.log"
View ArticleTo check the status of an ALTER you can simply query the system.alter_progress table with:
sql> select * from system.alter_progress;
This will show the following information about any alters in progress which will give an estimate on the time the alter will complete based off of the amount of time and work done up until that point. Please note that system load and writes to the table will affect the alter time and can change the estimation.
+-----------------------+--------------------+| Field | Type |+-----------------------+--------------------+| session_id | int(0) unsigned || src_relation_oid | oid || src_relation_name | varchar(65535) || start | datetime(6) || copy_xid | oid || rows_read | bigint(0) unsigned || src_relation_est_rows | bigint(0) unsigned || progress_pct | double || completion_est | datetime(6) |+-----------------------+--------------------+
---------------------------------------------------------------------------------------------
To get an indication as to how far along an alter is in earlier versions of ClustrixDB (before 9.0), you can do the following:
First find the xid for the transaction in the system.transactions table. You'll be looking for SERIALIZABLE (DDL) here. For example:
mysql> select * from system.transactions where isolation='SERIALIZABLE ' order by 2,1;
That SERIALIZABLE transaction was started by the alter table machinery. You can then track the progress of the alter statement by getting an estimate for the number of rows in the table using explain:
mysql> explain select * from databasename.tablename;
And then plug in that number into the following query:
mysql> select sum(rows_read)/122838050 from system.transactions where xid = 5745092185389604867;
View ArticleCustomers with current Subscription Agreements or considering one may be interested in reviewing our Current Support Policies. We aim to please and provide you with fantastic uptime and availability. Our policies have been crafted from industry best-practices and years of experience. If you have suggestions, please send us an email.
View ArticleDistribution in ClustrixDB determines the way that data is stored across the cluster. For more information about data distribution on ClustrixDB, you can read the following article in our documentation: Managing Data Distribution
In v7.5 the default distribution of indexes will be 1. This differs from the current default which is number of columns. The redistribute task was designed to give the best distribution according to disk usage across the cluster but as we've found that this can be less performant in many cases we recommended that this task be disabled and redistribute actions be performed in accordance with the guidelines detailed below. Secondly, as the redistribute task may undo many of the changes recommended below, it should be disabled before proceeding.
You can disable the redistribute task with:
set global task_rebalancer_redistribute_interval_ms=0;
The reason we've found distribution to decrease performance is mainly broadcasts. Broadcasts cause increased traffic across the network and when unnecessary, cause added contention without a performance benefit. The information below describes ways to identify tables and indexes that would benefit from various redistribute actions. A redistribute action is a table re-write, so this may take a long time on many of your larger tables. If your team has any questions on particular changes, we'll be happy to review them.
Broadcasts
Broadcasting is when a node asks another node for additional information. This adds parallelism (a good thing) but can also add noise/congestion. Removing un-needed or unhelpful broadcasts via schema or query changes can be beneficial to the system as a whole.
Detecting:
QPC queries will give information on queries that have recently run. Looking at this table can help to determine recent queries that could be causing issues. Using this to look at broadcasts specifically:
SELECT statement,
broadcasts,
exec_ms,
counts,
rows_read
FROM system.qpc_queries
ORDER BY broadcasts DESC
LIMIT 5\G
ALLNODES:
Potential ALLNODES Candidates:
Table Size < 10MiB
Write Frequency < 1K
Read Frequency > 1M
Currently not ALLNODES
Changing small tables with very few writes will help by preventing the need to broadcast to other nodes to get table data. This especially helps with small tables that are used in JOINs with larger tables. Tables that are frequently written to should not be altered to ALLNODES. Partitioned tables should also not be altered to ALLNODES.
SELECT `database`,
`table`,
reads,
writes,
bytes
FROM (SELECT `database`,
`table`,
Sum(reads) AS reads,
Sum(inserts) + Sum(deletes) + Sum(replaces) AS writes,
Sum(bytes) AS bytes,
Count(DISTINCT( `index` )) AS paths,
Count(DISTINCT( slice )) AS slices,
Count(DISTINCT( replica )) AS replicas
FROM system.container_stats cs
JOIN system.table_replicas tr
ON cs.replica = tr.replica
GROUP BY 1,
2) AS a
WHERE bytes < 10 * 1024 * 1024
AND writes < 1000
AND reads > 1000000
AND slices > paths
ORDER BY 1,
2;
To alter a table to ALLNODES, you can use the following:
sql> ALTER TABLE foo.bar REPLICAS=ALLNODES;
Index Distribution:
Potential Index Distribution Candidates
Single column indexes with DISTRIBUTE=2
Multi-column indexes with DISTRIBUTE >= number of columns with mostly unique initial columns.
Single column indexes with DISTRIBUTE=2
Example:
CREATE TABLE `foo2` (
`i` int(11) not null,
PRIMARY KEY (`i`) /*$ DISTRIBUTE=1 */,
KEY `idx_i` (`i`) /*$ DISTRIBUTE=2 */
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=7 */
ALTER TABLE foo2 INDEX idx_i DISTRIBUTE=1;
In v7.0+ is no longer possible to create an index with a higher distribution that it's keycount. In previous versions, or on clusters upgraded from previous versions, this change will remove unneeded broadcasts on single column indexes.
Multi-column indexes:
With Multi-column indexes there can be a benefit to changing the distribution in some cases. Cases:
Distribution is > the number of columns - Change distribution to at most the number of columns.
Indexes that have mostly unique initial columns - Consider reducing distribution if unnecessary broadcasts are being performed.
Check explain of queries using index and look for stream combine.
View ArticleDefault Distribution
ClustrixDB uses a hashing algorithm to determine where a table (representation) or indexs should reside within a cluster. The columns selected for hashing are referred to as the distribution key for that representation and determines placement of data across nodes. To accomplish this, each index contains metadata about the distribution key. By default, the distribution key uses the first column of an index (DISTRIBUTE = 1), regardless of how many columns comprise the index. This is true for all indices including the primary key.
For more on this, including how to modify the distribution key, see our documentation for: DISTRIBUTE.
Increasing Distribution
It can sometimes be beneficial to modify the distribution key to include additional columns for an index or table. This is an online operation (DDL) but requires a re-write of the representation so can take some time and require free space to complete for large tables and indexes. As with any other alter operation, this should be run during off-peak.
The following sections describe circumstances where a distribution change should be considered.
Column Uniqueness
If the first column of the key does not contain a sufficient number of unique values, it can prevent the database from being able to distribute the representation evenly across the cluster. This can lead to either very large slices that grow beyond the split threshold and/or a wide variance between the smallest and largest slice sizes. This can cause queries that access those slices to vary in performance. A general guideline is that the first key of a key should contain at least an order of magnitude more values than the representations number of slices. For example, for a table with 600 slices the key should contain at least 6000 unique values.
You can see which representations the systems thinks would benefit from a change in distribution with the following query. This query will tell you the current distribution and the total number of columns in a key. You'll want to increase distribution depending on the uniqueness of the columns which may not mean changing it to equal the total number of columns in the key. Note: Single column non-primary keys that aren't unique enough for proper distribution can be redistributed to include the primary key.
SELECT db.name AS DATABASE, rel.name AS table, rep.name AS index, rep.cont_keycount AS total_keycount, hd.range_keycount AS current_keycount, bytes, rhd.slice_size_deviance, node_write_deviance, vdev_write_deviance, node_read_deviance, vdev_read_deviance FROM system.rebalancer_hash_distributions rhd JOIN system.representations rep ON rep.representation = rhd.hash_dist JOIN system.relations rel ON rel.table = rep.relation JOIN system.DATABASES db ON db.db = rel.db JOIN system.range_hash_distributions hd ON hd.representation = rep.representation WHERE rhd.needs_redistribute AND rep.cont_keycount != hd.range_keycount;
To determine how many columns you should distribute across you need to look at the uniqueness of each column in the key. For a sample representation (`col1`,`col2`) belonging to table foo.bar, you could query the number of rows in the table and the uniqueness with:
CREATE TABLE `bar` (
`col1` bigint(20) unsigned not null,
`col2` bigint(20) unsigned not null,
PRIMARY KEY (`col1`,`col2`) /*$ DISTRIBUTE=1
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=50 CONTAINER=layered */
explain select * from foo.bar;
+--------------------------------------------------------------+------------+-------------+
| Operation | Est. Cost | Est. Rows |
+--------------------------------------------------------------+------------+-------------+
| stream_combine | 6690015.46 | 786516.00 |
| index_scan 1 := bar.__idx_bar__PRIMARY | 103914.13 | 199750.30 |
+--------------------------------------------------------------+------------+-------------+
2 rows in set (0.10 sec)
select count(distinct (`col1`)),count(distinct(`col2`)) from foo.bar;
+-----------------------------+---------------------------+
| count(distinct (`col1`)) | count(distinct(`col2`)) |
+-----------------------------+---------------------------+
| 5 | 5200 |
+-----------------------------+---------------------------+
1 row in set (0.00 sec)
In this example col1 is not very unique so the data will be hash distributed into 5 buckets. If the second column is added to the distribution key by running ALTER TABLE foo.bar PRIMARY KEY DISTRIBUTE=2; the slices will be more equal in size and more able to be balanced across the cluster.
Increase Parallelism
Changing the distribution of a key will also change the behavior of some planner operators. This can allow queries run against the table to use more parallel operators. In the example below, changing the DISTRIBUTION of the primary key allows for the usage of dist_stream_aggregate because the data being aggregated will be distributed throughout the cluster. This is more overall work for the cluster, but will result in a lower query latency for this query (and potentially other queries against that representation). Additionally, a longer distribution key can make it more expensive in cases where the distributed column(s) is part of a GROUP BY list because more of this aggregation will need to be done by the GTM (Global Transaction Manager) node.
Original Schema and Explain:
CREATE TABLE `bar` (
`iDate` datetime not null,
`c_code` int(10) unsigned not null,
`counter` bigint(20) unsigned not null,
`unique_count` bigint(20) unsigned not null,
`area` varchar(32) not null,
PRIMARY KEY (`iDate`,`c_code`,`area`) /*$ DISTRIBUTE=1 */
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=400 CONTAINER=layered */;
sql> explain select sum(counter) from test.bar where iDate = '2017-06-22';
+---------------------------------------------------------------------------------+-------------+-------------+
| Operation | Est. Cost | Est. Rows |
+---------------------------------------------------------------------------------+-------------+-------------+
| stream_aggregate expr0 := sum((1 . "counter")) | 54050340.69 | 71114370.00 |
| index_scan 1 := bar.__idx_bar__PRIMARY, iDate = param(0) | 39827461.69 | 71114370.00 |
+---------------------------------------------------------------------------------+-------------+-------------+
2 rows in set (2.80 sec)
Primary Key DISTRIBUTE=3:
CREATE TABLE `bar` (
`iDate` datetime not null,
`c_code` int(10) unsigned not null,
`counter` bigint(20) unsigned not null,
`unique_count` bigint(20) unsigned not null,
`area` varchar(32) not null,
PRIMARY KEY (`iDate`,`c_code`,`area`) /*$ DISTRIBUTE=3 */
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=400 CONTAINER=layered */;
sql> explain select sum(counter) from test.bar where iDate = '2017-06-22';
+-------------------------------------------------------------------------------------+-------------+-----------+
| Operation | Est. Cost | Est. Rows |
+-------------------------------------------------------------------------------------+-------------+-----------+
| stream_aggregate expr0 := sum((0 . "expr0")) | 25188878.89 | 800.00 |
| stream_combine | 25188713.89 | 800.00 |
| dist_stream_aggregate expr0 := sum((1 . "counter")) | 31485.79 | 1.00 |
| index_scan 1 := bar.__idx_bar__PRIMARY, iDate = param(0) | 30928.61 | 55217.97 |
+-------------------------------------------------------------------------------------+-------------+-----------+
4 rows in set (0.00 sec)
Note: This change will add broadcasts to the system, which in some cases can affect global cluster performance. Broadcasts cause increased traffic across the network and when unnecessary, cause added contention without a performance benefit. If you see a decrease in performance after increasing the distribution of a popular table, you can measure broadcasts with the following query
SELECT statement,
broadcasts,
exec_ms,
counts,
rows_read
FROM system.qpc_queries
ORDER BY broadcasts DESC
LIMIT 5\G
If this query suddenly shows the representation you've changed as having a very high number of broadcasts it may be necessary to reduce the distribution of that representation.
View ArticleLeaving sessions open with uncommitted transactions cancause undo space to grow as bigc (ClustrixDB garbage collector) can't move forward until the transaction is committed. In the meantime, other transactions are filling up the undo space.
To identifyexplicit uncommitted transactionsuse the following query:
SELECT *
FROM system.sessions
WHERE xid IN
(
SELECT xid
FROM system.transactions
ORDER BY xid ASC
LIMIT 1) \g
Sample output:
mysql> select * from system.sessions where xid in (select xid from system.transactions order by xid asc limit 1) \G
*************************** 1. row ***************************
nodeid: 1
session_id: 3305627649
source_ip: 192.168.100.201
source_port: 61600
local_ip: 192.168.1.1
local_port: 3306
user: 5990219407120025601
database: test
trx_state: open <===
statement_state: finished <===
xid: 6088543909255731202
isolation: REPEATABLE-READ
last_statement: SELECT col1, col2, col3 FROM test_tbl t1 JOIN test2_tbl t2 ON t1.col1 = t2.col1 WHERE t1.col1= 12345
time_in_state_s: 7487 <===
created: 2014-12-03 09:06:24
heap_id: 720576447552535674
trx_age: 7493 <===
trx_mode: explicit <===
trx_counter_select: 25
trx_counter_insert: 6trx_counter_update: 0
trx_counter_delete: 0
trx_is_writer: 1
1 row in set (0.10 sec)
Notice that the trx_state is "open" and that the trx_mode is "explicit". This indicates that this is an uncommitted transaction. Also note that the statement_state is "finished" with high time_in_state_s and trx_age values.
Once you've identified the session with the uncommitted transaction, you can ask the user to commit the transaction or you can kill the session which will roll back the transaction.
The global variable idle_trx_timeout_s defines how long a session can be idle for before the cluster decides to kill the session and roll back the transaction.
If you have a non-default idle_trx_timeout_s that is high or is set to 0, then these uncommitted transactions may cause the occasional issue.
To see how you can set these variables to control how the cluster manages idle connections, please have a look at: Automated Transaction and Session Killers
View ArticleThe most common cause of locking in a database is concurrent transactions trying to update the same row(s) of a table. As part of MVCC locks are necessary to keep data consistent across different transactions. Since locking will present to the application as latency, its important to identify why locks are being held and lessen or eliminate the impact from locking contention on the user experience.
To look for active locks in the system, you can use the following query:
sql> select * from system.lockman;
Example:
sql> select * from system.lockman;+--------+---------------------+---------------------+---------+------------+--------+---------------------+-----------+| nodeid | holder | waiter | wait_ms | timeout_ms | domain | key | exclusive |+--------+---------------------+---------------------+---------+------------+--------+---------------------+-----------+| 15 | 6268069961036267554 | 6268069963849242654 | 11977 | 0 | 4103 | 5979715258906695681 | 1 || 15 | 6268069961036267554 | 6268069968114098208 | 10980 | 0 | 4103 | 5979715258906695681 | 1 || 15 | 6268069961036267554 | 6268069975814778910 | 9249 | 0 | 4103 | 5979715258906695681 | 1+--------+---------------------+---------------------+---------+------------+---------------------+---------------------+-----------+--------+------------------+---------------+1 row in set (0.01 sec)
The holder is the XID of the transaction holding a lock, and waiter indicates an XID waiting for that lock to be released. This example is a clear illustration of some transaction sitting on a lock for more than 11 seconds (it was an explicit write transaction that didn't commit).
To see which transaction is holding the most locks at any given time you can use following query:
SQL> SELECT * FROM system.sessions WHERE xid = (select holder FROM (select holder,count(1) cnt FROM system.lockman GROUP BY holder ORDER BY cnt desc limit 1) dummy)\G
The example belowis an explicit write transaction waiting to commit:
sql> SELECT * FROM system.sessions WHERE xid = (select holder FROM (select holder,count(1) cnt FROM system.lockman GROUP BY holder ORDER BY cnt desc limit 1) dummy)\G*************************** 1. row ***************************nodeid: 4 session_id: 3774276612 source_ip: 10.65.148.79 source_port: 33845 local_ip: 10.65.60.51 local_port: 3306 user: 5762026690960054275 database: footrx_state: open statement_state: finished xid: 6268069961036267554cpu: 1 isolation: REPEATABLE-READ last_statement: update SOME_TABLE set product_version = 25 where product_id = 'ABC123' and supplier_id ='foobar_id' and sku_id ='SKUABC123' and product_version = 24 time_in_state_s: 18715 created: 2016-07-11 09:27:16 heap_id: 72057594040818613trx_age: 0trx_mode: explicittrx_counter_select: 0trx_counter_insert: 0trx_counter_update: 0trx_counter_delete: 0trx_is_writer: 01 row in set (0.01 sec)
Types of Locking
Serialized Locking Queue:
If the result of the above query keeps changing, this usually indicates a serialized locking queue. The most common cause of this type of locking is overlapping rows for inserts or updates. For example:
Query A inserts rows with unique id 1,2,3
Query B updates rows with unique id 2,4
Query C updates rows with unique id 4,1
Query B affects rows that are locked by Query A so it must wait until Query A finishes to apply it's changes. Query C in turn, must wait for Query B to finish because it is updating the same rows. A long line of these events can cause queries at the end of the chain to take a long time and consequently have a large value for lockman_waittime_ms (see historical locking below).
Ways to resolve / avoid Serialized Locking Queues:
Avoid concurrent updates to the same row(s).
Look for ways to improve the performance of the query causing locking contention (index and query tuning).
Bad Query / Missing Index:
A poorly written query or a query missing an index can often cause issues for the database. Overlapping write queries can cause cascading issues by making other queries against those rows or table wait on locks.
In identifying whether if a query is missing any indexes or otherwise underperformant, it is often helpful to look at the explain plan and DDLs of the tables to understand the cost of the plan, and if add an index or refactor the query to improve performance.
Historical Locking:
To identify queries that have been held up by locking in the past you can look at the query log, specifically `lockman_waits` and `lockman_waittime_ms`. A high lockman_waittime_ms time with a large number of lockman_waits usually indicates a serialized locking queue, while a high lockman_waittime_ms with a single (or very small amount) of lockman_waits usually indicates another query took out a large table or row lock on that the logged query was trying to write to. Youll often be able to find the lock holding query logged around the same time as the waiting query. You can look at the queries that had the most lockman_waittime for a given log segment with the following:
[root@test001 ~]# /opt/clustrix/bin/clx cmd 'grep "^YYYY-MM-DD" /data/clustrix/log/query.log' > /tmp/YYYY-MM-DD_query.log
[root@test001 ~]# cat /tmp/YYYY-MM-DD_query.log | sed 's/^.*lockman_waittime_ms: \([0-9]*\).*$/\1 \0/' |sort -rn|less
Example (query log entries sorted based on "lockman_waitime_ms"):
2016-03-22 21:40:05.449257 UTC ip-10-10-10-1 clxnode: INSTR SQLERR SID:149052431 db=test [email protected] ac=N xid=56fca1e77ece381e sql="INSERT INTO `foo` (`account_id`, `created_at`, `member_id`, `sub_list_id`, `updated_at`) VALUES (7825, '2016-03-22 21:31:25', 78348504, 166562, '2016-03-22 21:31:25')" [Session killed due to closed client socket: ] time 520100.0ms; reads: 6; inserts: 15; deletes: 0; updates: 0; counts: 1; rows_read: 6; forwards: 13; broadcasts: 0; rows_output: 7; semaphore_matches: 0; fragment_executions: 15; runtime_ns: 13; cpu_waits: 1; cpu_waittime_ns: 1; bm_fixes: 39; bm_loads: 0; bm_waittime_ns: 0; lockman_waits: 2; lockman_waittime_ms: 517712; trxstate_waits: 0; trxstate_waittime_ms: 0; wal_perm_waittime_ms: 0; bm_perm_waittime_ms: 0;attempts: 2
2016-03-22 21:41:18.389570 UTC ip-1010-10-2 clxnode: INSTR SLOW SID:231803921 db=test [email protected] ac=N xid=56fca211d2b06022 sql="INSERT INTO `foo` (`account_id`, `created_at`, `member_id`, `sub_list_id`, `updated_at`) VALUES (7825, '2016-03-22 21:33:51', 78348504, 84758, '2016-03-22 21:33:51')" [Ok: 1 rows affected] time 446739.8ms; reads: 3; inserts: 13; deletes: 0; updates: 0; counts: 1; rows_read: 3; forwards: 12; broadcasts: 0; rows_output: 2; semaphore_matches: 0; fragment_executions: 13; runtime_ns: 11; cpu_waits: 0; cpu_waittime_ns: 0; bm_fixes: 44; bm_loads: 0; bm_waittime_ns: 0; lockman_waits: 1; lockman_waittime_ms: 446699; trxstate_waits: 0; trxstate_waittime_ms: 0; wal_perm_waittime_ms: 0; bm_perm_waittime_ms: 0; attempts: 1
2016-03-22 21:33:09.726210 UTC ip-10-10-10-1 clxnode: INSTR SLOW SID:149049359 db=test [email protected] ac=N xid=56fca1791d45d81e sql="INSERT INTO `foo` (`account_id`, `created_at`, `member_id`, `sub_list_id`, `updated_at`) VALUES (7825, '2016-03-22 21:31:20', 78348504, 166562, '2016-03-22 21:31:20')" [Ok: 1 rows affected] time 109261.8ms; reads: 3; inserts: 13; deletes: 0; updates: 0; counts: 1; rows_read: 3; forwards: 12; broadcasts: 0; rows_output: 2; semaphore_matches: 0; fragment_executions: 13; runtime_ns: 11; cpu_waits: 0; cpu_waittime_ns: 0; bm_fixes: 47; bm_loads: 0; bm_waittime_ns: 0; lockman_waits: 1; lockman_waittime_ms: 109232; trxstate_waits: 0; trxstate_waittime_ms: 0; wal_perm_waittime_ms: 0; bm_perm_waittime_ms: 0; attempts: 1
In the instance of the longest waiter from above shown example logs, it was waiting on two different locks for a total of 520100.0ms (520 seconds) and was then killed by an automated transaction killer. Presumably, the client net_write_timeout is set to 500 seconds. The query above has a very small value for runtime_ns (13), which means it spent most of its time waiting on locks.
Assuming the locking is caused by the queries already identified then the insert/updates that have been identified should only lock the rows they're directly affecting. The overlap is where queries will get a stack up on locks. You can imagine a large build-up if you have a few queries all taking 30+ seconds waiting on each other to complete. In the previous example, you'd get the following result:
Query A - Takes the time of Query A
Query B - Takes the time of Query A & B
Query C - Takes the time of Query A & B & C
The more overlap, the longer the wait, and the worse things get. If you're able to reduce the overlap and/or the runtime of the individual queries, you should see less locking overall. Running things in batches, especially when the overlap is expected should help mitigate the issue here.
View ArticleFor most replication errors, the cause of the error can usually be found in the output from show slave status. However, if you're trying to troubleshot a slave lag or if you need to find out more details about what is in the replicated event, you can use mysqlbinlog (a mysql utility) or repclient (a Clustrix utility) to extract binlog events.
Binlogs in Clustrix
Unlike mysql where it writes the binlogs to their own individual files, Clustrix binlogs are stored within the database storage so that it has the same protection as database data.
Causes of Slave Stops
Most common slave stops are due to:
Row not found: Row on the slave doesn't match the row details of the replicated event.
Duplicate key in container: row with the same primary key already exists on the slave.
Although these errors are fairly self-explanatory, you would still want to use mysqlbinlog or repclient to determine whether the replication event is a standalone query or if it's a part of a larger explicit transaction. If it's the latter, you wouldn't want to skip the event as you may also skip other writes that are in that explicit transaction.
Please see Controlling Slave Behavior on Error for what settings you maywant to set for skipping events as well as how much of a match the row should have for a successful write.
Less common errors include:
arity errors (mismatch in the table definition with the replication event)
foreign key contraints
Sample Errors
Here are some show slave status output with the last successful log position read, the binlog file name, the master host, replication username and reason for the slave stopping. You'll need some of this information when extracting data from the binlogs.
*************************** 1. row ***************************
Slave_Name: default
Slave_Status: Errored
Master_Host: 10.0.0.10
Master_Port: 3306
Master_User: repl
Master_Log_File: default
Slave_Enabled: Enabled
Log_File_Seq: 28
Log_File_Pos: 45532636
Last_Error: Duplicate key in container: test_tbl2.__idx_test_tbl2__PRIMARY (314159). Statement: insert into test_tbl2 (id) values (314159), Database: test
Connection_Status: Disconnected
Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
Seconds_Behind_Master: NULL
*************************** 1. row ***************************
Slave_Name: default
Slave_Status: Errored
Master_Host: 10.0.0.10
Master_Port: 3306
Master_User: repl
Master_Log_File: default
Slave_Enabled: Enabled
Log_File_Seq: 2485
Log_File_Pos: 489659967
Last_Error: Row Not Found: for delete on test.TEST_TBL3 at default2485.489660488 where TEST_TBL3.__idx_TEST_TBL3__PRIMARY(id)=(271828). Statement: COMMIT, Database:
Connection_Status: Disconnected
Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)
A sample SBR slave error:
*************************** 1. row ***************************
Slave_Name: default
Slave_Status: Reconnecting
Master_Host: 10.0.0.10
Master_Port: 3306
Master_User: repl
Master_Log_File: default
Slave_Enabled: Enabled
Log_File_Seq: 8051
Log_File_Pos: 56376112
Last_Error: Can't find any matching row in the test_tbl4 table: year '1666' does not exist. Statement: update test_tbl4 set event="Great Fire" where year=1666, Database:
Connection_Status: Disconnected
Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)
Depending on your workload, you may need to extract binlog events just to make sure that the event is a standalone query that isn't a part of a larger transaction. Otherwise, if you skip the event and it is a part of a larger transaction, you will be skipping all of the writes in that explicit transaction.
Using the last successful log position and log file name from the show slave output, we can extract event details from the binlogs to help troubleshoot why a slave might have stopped or lag.
Differences between mysqlbinlog and repclient
Both mysqlbinlog and repclient utilities do the same thing, namely extracting events from binlogs. The main difference is the output and how readable they are and mysqlbinlog doesn't understand all clustrix replication events.
The nice thing about repclient is that you can specify how many events (-count=###) to extract from the starting log position. Repclient can also be used to generate raw binlog dumps from Clustrix so that you can replay the events to another slave or use it as a part of a backup plan.
The useful aspect about mysqlbinlog is that it presents the write event in an easier to understand format especially with row-based replication (RBR). With RBR, only the row changes are replicated so you won't be able to get the actual query that was processed to generate the write events.
You could also use these tools for troubleshooting SBR issues.
Parameters
The main parameters for either command are:
log position: the last successful event processed is shown in the show slave status output on the slave cluster.
binlog file name: consists of the binlog name and a six digit suffix that increments when a new binlog is created.
master host: IP address or hostname of one of the nodes on the master cluster.
username and password: to access the binlog data.
If the slave is stopped, it will show the last successful event that it had processed along with the log position of that event and the binlog from that it was reading from.
Comparision of output between repclient and mysqlbinlog:
Here's some output from both utilities querying the same binlog. Before extracting the binlog event, you should cd onto a filesystem that has sufficient space as the binlog extract files can grow very quickly.
cd /clustrix # or wherever you have the most available spacerepclient -logname statd.006870 -logpos 59900827 -count 100 -verbose > binlog.output
Look for the BEGIN that defines that start of the transaction:
00000000 42 45 47 49 4E |BEGIN |
The next section shows the table mapping for statd.statd_history. For other replication events that write to multiple tables, it will have entries like this for each of the tables:
}
2014-11-14 12:47:06.599042 UTC eukanuba002 INFO mysql/replication/repclient.c:445 repdump_event(): MsgType19 [pos: 59900886, log_offset: 59900943]
message mysql_table_map {
header = message mysql_event_header {
timestamp = 1415960473
type = 19
server_id = 2101898466
size = 57
log_offset = 59900943
flags = 0
}
table_id = 1
flags = 0
db = "statd"
table = "statd_history"
columns = [
mysql_type_long,
mysql_type_long,
mysql_type_datetime,
mysql_type_double
]
}
The next event is the actual INSERT. In this case, the mysql_write_rows = INSERT. Note the table_id value. If the event has multiple table mappings, the table_id will identify which table the write is being applied:
2014-11-14 12:47:06.599135 UTC eukanuba002 INFO mysql/replication/repclient.c:445 repdump_event(): MsgType23 [pos: 59900943, log_offset: 59900997]
message mysql_write_rows {
header = message mysql_event_header {
timestamp = 1415960473
type = 23
server_id = 2101898466
size = 54
log_offset = 59900997
flags = 0
}
table_id = 1
flags = 0
used_columns = [1111]
rowdata =
00000000 F0 6D 00 00 00 DF 2F 00 00 F5 90 F3 77 51 12 00 |.m..../.....wQ..|
00000010 00 00 00 A0 69 8F 5C E5 41 |....i.\.A |
}
This next section is the actual data being inserted (values in parenthesis). So the event is inserting the following data (109,12255,'2014-11-14 10:20:05',2867100493):
1 Rows:
[t_vu32(109), t_vu32(12255), t_datetime(2014-11-14 10:20:05), t_vf64(2867100493)]
2014-11-14 12:47:06.599270 UTC eukanuba002 INFO mysql/replication/repclient.c:445 repdump_event(): MsgType23 [pos: 59900997, log_offset: 59901051]
message mysql_write_rows {
header = message mysql_event_header {
timestamp = 1415960473
type = 23
server_id = 2101898466
size = 54
log_offset = 59901051
flags = 0
}
table_id = 1
flags = 0
used_columns = [1111]
rowdata =
00000000 F0 6E 00 00 00 A8 41 00 00 FD 8E F3 77 51 12 00 |.n....A.....wQ..|
00000010 00 00 00 00 00 00 00 00 00 |......... |
There were several addition mysql_write_rows (inserts) until the transaction completed with COMMIT:
00000000 43 4F 4D 4D 49 54 |COMMIT |
Compare the repclient output with the output from mysqlbinlog:
When using mysqlbinlog to extract events from the binlog, include the use of the --hexdump option as it tells you when a transaction starts and commits but also means more data to scroll through.
mysqlbinlog --verbose --base64-output=ALWAYS --hexdump --start-position=59900827 -R statd.006870 > mysqbinlog.output
ctrl-c after a second or two.
There's some hex output along with some log positions, but again we want to look for the BEGIN:
# Start: binlog v 4, server v 5.0.45 created 700101 0:00:00
BINLOG '
AAAAAA/ibEh9ZgAAAAAAAAAAAAQANS4wLjQ1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAUwAEGggAAAAICAgC
'/*!*/;
# at 59900866
#141114 10:21:13 server id 2101898466 end_log_pos 59900886
# Position Timestamp Type Master ID Size Master Pos Flags
# 39203c2 99 d7 65 54 02 e2 6c 48 7d 3b 00 00 00 d6 03 92 03 00 00
# 39203d5 04 bc b8 0f 00 00 00 00 00 00 00 15 00 00 00 00 |................|
# 39203e5 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 39203f5 21 00 00 42 45 47 49 4e 00 00 00 04 21 00 21 00 |...BEGIN|
#
BINLOG '
mddlVALibEh9OwAAANYDkgMAAAS8uA8AAAAAAAAAFQAAAAAAAAEAAAgAAAAAAAQhACEAIQAAQkVH
SU4=
'/*!*/;
# at 59900925
# at 59900982
# at 59901036
# at 59901090
# at 59901144
# at 59901198
Here's the event where the table is mapped:
#141114 10:21:13 server id 2101898466 end_log_pos 59900943
# Position Timestamp Type Master ID Size Master Pos Flags
# 39203fd 99 d7 65 54 13 e2 6c 48 7d 39 00 00 00 0f 04 92 03 00 00
# 3920410 01 00 00 00 00 00 00 00 05 73 74 61 74 64 00 0d |.........statd..|
# 3920420 73 74 61 74 64 5f 68 69 73 74 6f 72 79 00 04 03 |statd.history...|
# 3920430 03 0c 05 01 08 08 |......|
# Table_map: `statd`.`statd_history` mapped to number 1
[...some lines deleted...]
#141114 10:21:13 server id 2101898466 end_log_pos 59901219
# Position Timestamp Type Master ID Size Master Pos Flags
# 392050e 99 d7 65 54 02 e2 6c 48 7d 3c 00 00 00 23 05 92 03 00 00
# 3920521 04 bc b8 0f 00 00 00 00 00 00 00 15 00 00 00 00 |................|
# 3920531 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 3920541 21 00 00 43 4f 4d 4d 49 54 |...COMMIT|
After this bit of the hex dump, mysqlbinlog helpfully parses out the decoded writes into something more readable:
BINLOG '
mddlVBPibEh9OQAAAA8EkgMAAAEAAAAAAAAABXN0YXRkAA1zdGF0ZF9oaXN0b3J5AAQDAwwFAQgI
mddlVBfibEh9NgAAAEUEkgMAAAEAAAAAAAAABA/wbQAAAN8vAAD1kPN3URIAAAAAoGmPXOVB
### INSERT INTO statd.statd_history
### SET
### @1=109
### @2=12255
### @3=2014-11-14 10:20:05
### @4=2867100493
mddlVBfibEh9NgAAAHsEkgMAAAEAAAAAAAAABA/wbgAAAKhBAAD9jvN3URIAAAAAAAAAAAAA
### INSERT INTO statd.statd_history
### SET
### @1=110
### @2=16808
### @3=2014-11-14 10:15:01
### @4=0
mddlVBfibEh9NgAAALEEkgMAAAEAAAAAAAAABA/wfQAAAPIAAADxkPN3URIAAAAAAMCJ9lJB
### INSERT INTO statd.statd_history
### SET
### @1=125
### @2=242
### @3=2014-11-14 10:20:01
### @4=4971047
mddlVBfibEh9NgAAAOcEkgMAAAEAAAAAAAAABA/wbQAAAOEvAAD1kPN3URIAAAAAAAAAAAAA
### INSERT INTO statd.statd_history
### SET
### @1=109
### @2=12257
### @3=2014-11-14 10:20:05
### @4=0
mddlVALibEh9PAAAACMFkgMAAAS8uA8AAAAAAAAAFQAAAAAAAAEAAAgAAAAAAAQhACEAIQAAQ09N
TUlU
'/*!*/;
Start of the next transaction as denoted by the BEGIN:
# at 59901258
#141114 10:21:13 server id 2101898466 end_log_pos 59901278
# Position Timestamp Type Master ID Size Master Pos Flags
# 392054a 99 d7 65 54 02 e2 6c 48 7d 3b 00 00 00 5e 05 92 03 00 00
# 392055d 01 44 a7 14 00 00 00 00 00 00 00 15 00 00 00 00 |.D..............|
# 392056d 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 392055d 01 44 a7 14 00 00 00 00 00 00 00 15 00 00 00 00 |.D..............|
# 392056d 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 392057d 21 00 00 42 45 47 49 4e 00 00 00 04 21 00 21 00 |...BEGIN|
The one minor issue that we've found with mysqlbinlog is that it sometimes doesn't start at the correct starting log position, but it isn't too far from starting log position.
Arity Example
Let's say you got this slave stopped error:
*************************** 1. row ***************************
Slave_Name: default
Slave_Status: Errored
Master_Host: 10.0.0.10
Master_Port: 3306
Master_User: repl
Master_Log_File: clx-binlog
Slave_Enabled: Enabled
Log_File_Seq: 5
Log_File_Pos: 1000000
Last_Error: Row Type Mismatch: Column arity mismatch between slave and master for the table test.test_tbl
Connection_Status: Disconnected
Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
Seconds_Behind_Master: NULL
On one of the nodes on the master cluster:
cd /clustrix # or wherever you have the most available space
mysqlbinlog -u repl -h 127.0.0.1 -p'xxxxxxxxx' --verbose --base64-output=ALWAYS --hexdump --start-position=1000000 -R clx-binlog.000005 > mysqlbinlog.output
or alternatively you can use
repclient -logname clx-binlog.000005 -logpos 1000000 -count 100 -verbose > repclient.output?
Here's a snippet of the mysqlbinlog extract file:
#141022 8:55:17 server id 193994141 end_log_pos 1003393
# Position Timestamp Type Master ID Size Master Pos Flags
# f4f6b f5 70 47 54 02 9d 1d 90 0b 3c 00 00 00 81 4f 0f 00 00 00
# f4f7e 04 fc 96 c2 00 00 00 00 00 00 00 15 00 00 00 00 |................|
# f4f8e 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# f4f9e 21 00 00 43 4f 4d 4d 49 54 |...COMMIT|
#
BINLOG '
[...some output that can be ignored...]
### UPDATE test.TEST_TBL
### WHERE
### @1=15871698
### @2=1
### @3=3109804
### @4='Test'
### @5='User'
### @6=0
### @7=0
### @8=3336901
### @9=3128759
### @10=0
### @11=1413951934
### SET
### @1=15871698
### @2=1
### @3=3109804
### @4='Test'
### @5='User2'
### @6=1
### @7=1
### @8=3336901
### @9=3128759
### @10=0
### @11=1413968117
[...some output that can be ignored...]
'/*!*/;
# at 1003431
#141022 8:55:17 server id 193994141 end_log_pos 1003452
# Position Timestamp Type Master ID Size Master Pos Flags
# f4fa7 f5 70 47 54 02 9d 1d 90 0b 3b 00 00 00 bc 4f 0f 00 00 00
# f4fba 09 48 dd b8 00 00 00 00 00 00 00 15 00 00 00 00 |.H..............|
# f4fca 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# f4fda 21 00 00 42 45 47 49 4e 00 00 00 04 21 00 21 00 |...BEGIN|
Looking at the sample output above, we can glean the following details:
Event was processed at 08:55:17 local time
One single update write was processed as the next event starts with a BEGIN.
test.TEST_TBL was being updated with columns 5, 6, 7 and 11 being modified as they're the only differences.
The fields denoted as @1, @2, @3...@11 match the order of the columns from show create table.
If this was the event that caused the slave to stop due to an arity error, it will be because the number of columns on the slave table doesn't match the number of columns in the write event. We can then verify this by looking at the DDL for the TEST_TBL.
To resolve this issue, you would want to contact support to help determine whether there are other writes with the same number of columns and whether these statements are in an explicit transaction. We would then want to stop the slave and manually apply the change(s) on the slave and skip the event(s).
Slave Lag from Large Write Events
If your slave is continuing to lag and the log position isn't growing, you can extract the event from the binlogs.
If your binlog extract has an event that has a long list of log positions, it indicates that this event is a large transaction:
# 3333333# 3333432# 3334023# 3336023...# 12312341...# 23423233
If you get stuck with trying to scroll through these, you can search for " Position " and that should take you to the next section of the output.
Eventually, you'll find the TABLE mapping that identifies the table that the event is writing to and whether it's an UPDATE, DELETE or INSERT. You can then go back to the master to find the actual query used. If the query took a bit of time to execute, then it should be logged in the query.log. Otherwise, you should be able to find it in the master's clustrix_statd.qpc_history.
Specifying a starting log position that is in the middle of an event
If you get an error like this in your binlog extract output:
Could not decode row data: Generic Error: table_id 1 not found
this is because you've specified a starting log position that isn't at the start of the event and that the utility needs the table mapping info. It may take several attempts if your starting log position is in the middle of a large transaction as you move to an earlier starting log position, but if you use the log position from a stopped slave event, you shouldn't run into this.
Reptool
Reptool is a new utlity that adds additional functionality and can decode the event in a more readable output. It will be available in a future release.
View ArticleIn ClustrixDB the sizes of all tables is contained in the system.table_sizes table.
mysql> desc system.table_sizes;
+----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+--------------+------+-----+---------+-------+
| Database | varchar(256) | NO | | NULL | |
| Table | varchar(256) | NO | | NULL | |
| Bytes | double | NO | | NULL | |
+----------+--------------+------+-----+---------+-------+
3 rows in set (0.01 sec)
To get the size of a single database use this query
SELECT Database, SUM(bytes)/1024/1024/1024 AS 'Size in GB' FROM system.table_sizes WHERE `Database`='<db_name>';
Similarly, you can get the size of a table with the following query:
SELECT `Table`, SUM(bytes) /1024/1024/1024 as 'Size in GB' FROM system.table_sizes WHERE `Database`='<db_name>' AND `Table`='<table_name>';
When calculating database and table sizes for Clustrix Fast Backup and Restore you should divide the above values by the number of replicas (usually 2), as the backup does not include replicas.
If you're looking attable sizes for re-slicing operations it can be helpful to know the size of the base representation and the indexes separately. You can use a query like the following to obtain this information:
select `Database`,`Table`,`Index`,Bytes/1024/1024/1024 as size_gb from system.index_sizes where `database` = '<db_name>' and `Table`='<table_name>';
View ArticleHotness History
Although we have a useful tool called hotness_history that shows you what the current hot tables are, there is a table that you can query to look at hot tables over a longer period of time. This is the CLUSTRIX_STATD.HOTNESS_HISTORY table.
mysql> show create table hotness_history \G
*************************** 1. row ***************************
Table: hotness_history
Create Table: CREATE TABLE `hotness_history` (
`timestamp` timestamp not null default current_timestamp,
`read_rank` int(11) unsigned,
`write_rank` int(11) unsigned,
`node` int(11) unsigned not null,
`database` varchar(256) CHARACTER SET utf8 not null,
`table` varchar(256) CHARACTER SET utf8 not null,
`index` varchar(256) CHARACTER SET utf8 not null,
`reads` int(11) unsigned,
`writes` int(11) unsigned,
`replicas` int(11) unsigned,
`ranked_replicas` int(11) unsigned,
`total_replicas` int(11) unsigned,
PRIMARY KEY (`timestamp`,`node`,`database`,`table`,`index`) /*$ DISTRIBUTE=5 */,
KEY `timestamp_2` (`timestamp`,`write_rank`) /*$ DISTRIBUTE=2 */,
KEY `timestamp` (`timestamp`,`read_rank`) /*$ DISTRIBUTE=2 */
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=4 CONTAINER=btree */
1 row in set (0.00 sec)
The key fields are:
timestamp: statistics for that table or index for the last 5 minutes.
read_rank: how hot the table or index was for reads during the time.
write_rank: how hot the table or index was for writes during the time.
table: table that is hot.
index: this could be the index or could be the base representation. Base reps are suffixed with _PRIMARY.
replicas: number of replicas for that table or index. Generally, the number of slices should be equal or greater than the number of nodes so that data is spread out among all the nodes.
The cluster ranks each table or index with a read_rank and a write_rank for the duration of the stats window. A table or index that has a high rank would have had more reads or writes than the other representations.
To find the hot reads tables based on average read_rank for the last 24 hours, you can use:
SELECT Round(Avg(read_rank)) AS read_rank,
Round(Avg(write_rank)) AS write_rank,
Sum(reads) AS reads,
Sum(writes) AS writes,
DATABASE,
`table`,
`index`,
total_replicas
FROM clustrix_statd.hotness_history
WHERE timestamp >now() - INTERVAL 1 day
AND DATABASE NOT IN ('clustrix_statd', '_replication', 'clustrix_ui', 'system')
AND read_rank IS NOT NULL
GROUP BY `table`, `index`
ORDER BY read_rank ASC, write_rank ASC LIMIT 100;
Similarly, you can do this for hot writes based on average write_rank:
SELECT Round(Avg(read_rank)) AS read_rank,
Round(Avg(write_rank)) AS write_rank,
Sum(reads) AS reads,
Sum(writes) AS writes,
DATABASE,
`table`,
`index`,
total_replicas
FROM clustrix_statd.hotness_history
WHERE timestamp >now() - INTERVAL 1 day
AND DATABASE NOT IN ('clustrix_statd', '_replication', 'clustrix_dbi', 'system')
AND write_rank IS NOT NULL
GROUP BY `table`, `index`
ORDER BY write_rank ASC, read_rank ASC;
A variation to these queries is to look at the total reads and writes over the duration:
Hot reads by sum of reads:
SELECT Round(Avg(read_rank)) AS read_rank,
Round(Avg(write_rank)) AS write_rank,
Sum(reads) AS reads,
Sum(writes) AS writes,
DATABASE,
`table`,
`index`,
total_replicas
FROM clustrix_statd.hotness_history
WHERE timestamp >now() - INTERVAL 1 day
AND DATABASE NOT IN ('clustrix_statd', '_replication', 'clustrix_dbi', 'system')
AND read_rank IS NOT NULL
GROUP BY `table`, `index`
ORDER BY 3 DESC, 4 DESC LIMIT 100;
Hot writes by sum of writes:
SELECT Round(Avg(read_rank)) AS read_rank,
Round(Avg(write_rank)) AS write_rank,
Sum(reads) AS reads,
Sum(writes) AS writes,
DATABASE,
`table`,
`index`,
total_replicas
FROM clustrix_statd.hotness_history
WHERE timestamp >now() - INTERVAL 1 day
AND DATABASE NOT IN ('clustrix_statd', '_replication', 'clustrix_dbi', 'system')
AND write_rank IS NOT NULL
GROUP BY `table`, `index`
ORDER BY 4 DESC, 3 DESC LIMIT 100;
If the table is hot and the number of replicas is less than the two times the number of nodes, it may result in a load imbalance. You may want to consider re-slicing the table.
A hot table could indicate a poorly optimized query that is missing indexes and may warrant further investigation.
View ArticleAn MVCC conflict error will arise when overlapping transactions create a conflict that can only be resolved by killing one of the two transactions. The best example of this is when two transactions try to modify the same row(s). Overlapping transactions affecting different rows of the same table will not cause an MVCC error. It is best practicesto insert retry logic for errors of this type into the application.
For most workloads, the primary reason for seeing the "Container MVCC conflict" error is that a later transaction hascommitted work in the same place. The most common reason for this isan INSERT statement, which does not acquire locks to do work, hascompleted a transaction while the earlier statement prepares tocommit.
The "MVCC serializable scheduler conflict" error, on the other hand, arises when a user transaction conflicts with some system transaction (only system transactions use serializable isolation), such as a rebalancer action. This error can also result from writing to a relation duringan ALTER. In ClustrixDB 9 and later, this error is reported to the user as "Transaction started writing to queued container after queue was created".
To learnmore about MVCC, please see our documentation: Concurrency Control
View ArticleUsing QPC_HISTORY
Clustrix collects statistics on not just cluster health, but also on queries and ranks them based on how much CPU the query takes, rows read, rows written, execution count and execution time.
Sometimes you just want to see what are your heavy queries are or perhaps trying to figure out a slowdown in performance without using the UI (the UI queries the same CLUSTRIX_STATD.QPC_HISTORY table).
You could parse out the query logs as described in How to Read the Query Logs, but sometimes you want something quick and easy. You can query CLUSTRIX_STATD.QPC_HISTORY.
Main things to keep in mind regarding stats in QPC_HISTORY:
Stats are collected every five minutes and inserted into the qpc_history table.
Stats in each entry are collected from the previous five minutes of the timestamp.
Stats older than 24 hours are rolled into one hour entries where is_rollup is then set to 1 and all the stats in that entry are from the previous hour of the timestamp.
Rank is based on how much of the CPU the query takes, rows read, rows written, execution count and execution time. The higher the rank, the more resources that particular query took.
Stats older than 7 days will get purged.
Stats with rank > 100 will get purged after 24 hours.
Sometimes, the same query will have different query keys because the query itself was recompiled due to staleness or due to changes to the probability distribution.
mysql> describe clustrix_statd.qpc_history;
+-------------------+---------------------+------+-----+-------------------+
| Field | Type | Null | Key | Default |
+-------------------+---------------------+------+-----+-------------------+
| query_key | bigint(20) unsigned | NO | PRI | NULL |
| timestamp | timestamp | NO | PRI | current_timestamp |
| database | varchar(256) | YES | | NULL |
| statement | varchar(8196) | YES | | NULL |
| exec_count | int(11) unsigned | YES | | NULL |
| exec_ms | bigint(20) unsigned | YES | | NULL |
| exec_us | bigint(20) unsigned | YES | | NULL |
| avg_exec_ms | double | YES | | NULL |
| avg_exec_us | double | YES | | NULL |
| min_exec_ms | double | YES | | NULL |
| max_exec_ms | double | YES | | NULL |
| compile_count | int(11) unsigned | YES | | NULL |
| compile_ms | bigint(20) unsigned | YES | | NULL |
| rows_read | bigint(20) unsigned | YES | | NULL |
| avg_rows_read | double | YES | | NULL |
| rows_written | bigint(20) unsigned | YES | | NULL |
| avg_rows_written | double | YES | | NULL |
| rows_output | bigint(20) unsigned | YES | | NULL |
| avg_rows_output | double | YES | | NULL |
| rank | int(11) unsigned | YES | | NULL |
| is_rollup | tinyint(4) | NO | PRI | 0 |
| rows_inserted | bigint(20) unsigned | YES | | NULL |
| avg_rows_inserted | double | YES | | NULL |
| rows_replaced | bigint(20) unsigned | YES | | NULL |
| avg_rows_replaced | double | YES | | NULL |
| rows_deleted | bigint(20) unsigned | YES | | NULL |
| avg_rows_deleted | double | YES | | NULL |
+-------------------+---------------------+------+-----+-------------------+
27 rows in set (0.01 sec)
Main columns of interest are:
timestamp: contains stats from the previous 5 minutes, unless it's rolled up after 24 hours where the entry would have stats from the previous hour.
exec_count: total number of executions for this query during the previous stats window
exec_ms: total time in ms the query took during the period
avg_exec_ms: average execution time for the query. Queries with long avg_exec_ms times are potential candidates for optimization.
max_exec_ms: max execution time...may require further investigation if the max_exec_ms exceeds avg_exec_ms by a significant margin. May indicate resource contention or some blocking issue.
rows_read: total number of rows read for the stats window.
avg_rows_read: average rows read per query execution. More rows read, the more the cluster has to work. Likely candidate for query optimization if query needs to read a lot of rows.
rank: useful to find top ranked queries.
is_rollup: stats older than 24 hours will get rolled into an hourly entry.
Find top queries from the last 24 hours by execution counts (may indicate a misconfigured application or some test script that isn't working as expected):
SELECT query_key,
LEFT(statement, 100),
Sum(exec_count) AS exec_count,
Round(Avg(avg_rows_read)) AS avg_RR,
Round(Avg(avg_exec_ms)) AS avg_exec_ms
FROM clustrix_statd.qpc_history
WHERE timestamp BETWEEN ( Now() - INTERVAL 24 hour ) AND Now()
AND DATABASE != 'clustrix_statd'
GROUP BY query_key
ORDER BY exec_count DESC,
avg_rr DESC LIMIT 100;
Find top queries from the last 24 hours by rows read (may indicate a poorly optimized query):
SELECT query_key,
LEFT(statement, 100),
Sum(exec_count) AS exec_count,
Round(Avg(avg_rows_read)) AS avg_RR,
Round(Avg(avg_exec_ms)) AS avg_exec_ms
FROM clustrix_statd.qpc_history
WHERE timestamp BETWEEN ( Now() - INTERVAL 24 hour ) AND Now()
AND DATABASE != 'clustrix_statd'
GROUP BY query_key
ORDER BY avg_rr DESC,
exec_count DESC LIMIT 100;
Find top queries by total exec time (may indicate a complex query that takes a lot of time to execute (in conjunction with high rows read)):
SELECT query_key,
LEFT(statement, 100),
Sum(exec_count) AS exec_count,
Round(Avg(avg_rows_read)) AS avg_RR,
Round(Avg(avg_exec_ms)) AS avg_exec_ms
FROM clustrix_statd.qpc_history
WHERE timestamp BETWEEN ( Now() - INTERVAL 24 hour ) AND Now()
AND DATABASE != 'clustrix_statd'
GROUP BY query_key
ORDER BY avg_exec_ms DESC,
exec_count DESC LIMIT 100;
Look at last 30 top 3 ranked queries:
SELECT timestamp,
DATABASE,
rank,
LEFT(statement,100),
exec_count,
avg_rows_read,
avg_exec_ms
FROM clustrix_statd.qpc_history
WHERE rank <4 AND DATABASE !='clustrix_statd' ORDER BY timestamp DESC, rank ASC LIMIT 30;
Some Tips
Narrow down the timestamp period if you have an idea of when an issue occurred.
Specify the table name in the where clause with the like predicate: where statement like '%table_name%' if you know which table may be the culprit to your problem.
Look at rows_written if you're interested in write queries.
If a query seems to be running slower than normal, compare the avg_exec_ms of that query from a time when it seems to be running fine and see what else might be running around the same time frame. Could indicate a resource contention (other large running queries or blocking queries). Looking at clustrix_statd.QPC_HISTORY in conjunction with the query logs would help.
View ArticleEach query log entry is made up of the following bits of information:
[timestamp] [nodename] [process name] [category] [SID] [db] [user] [ac] [xid] [sql] [status] [execution time] [reads] [inserts] [deletes] [updates] [counts] [rows_read] [forwards] [broadcasts] [rows_output] [semaphore_matches] [fragment_executions] [runtime_ns] [cpu_waits] [cpu_waittime_ns] [bm_fixes] [bm_loads] [bm_waittime_ns] [lockman_waits] [lockman_waittime_ms] [trxstate_waits] [trxstate_waittime_ms] [wal_perm_waittime_ms] [bm_perm_waittime_ms] [sigmas] [sigma_fallbacks] [row_count] [found_rows] [insert_id] [fanout] [attempts]
timestamp: Includes the timezone. It is extremely important to have the clocks in sync on all the nodes.
nodename: Name of the node where the entry was logged.
process name: The Clustrix process name, "clxnode". Will be "real" on the legacy appliance.
category: Type of entry in the log. SLOW, DDL, BAD or SQLERR.
sid: session ID. Useful when trying to identify a particular session with the queries that were logged.
db: database name the query was run from.
user: User executing the query. If you're using statement-based replication, you may want to search for the replication account when troubleshooting statements from the master.
ac: auto-commit. This is useful to determine whether the query was in an explicit transaction. Note that DDLs will all have ac=N even if the DDLs are not in an explicit transaction.
xid: transaction ID. Useful to link a session to an XID when troubleshooting locking issues.
sql: full query, unless the query exceeds the 4KB statement buffer; the logger will then insert ellipses in the middle of the query and remove part of the query to fit the 4KB limit.
status: The result of the query like rows affected or error message.
time: total time from when the query was received, compiled, processed and output returned. This is what we look at for SLOW queries.
reads: Number of calls by the query execution engine to operators CONTAINER_READ and CONTAINER_MULTI_READ. The number of times the database reads from a container (but not the number of rows read).
inserts: Number of calls by the query execution engine to operators CONTAINER_INSERT and CONTAINER_INSREP. The number of times the database inserts to a container. This includes both the number of calls, and the number of rows written.
deletes: Number of calls by the query execution engine to operator CONTAINER_REMOVE. The number of times the database deletes from a container. This includes both the number of calls, and the number of rows deleted.
updates: Number of calls by the query execution engine to operators CONTAINER_UPDATE and CONTAINER_OVERWRITE. The number of times the database updates a container. This includes both the number of calls, and the number of rows updated.
counts: Number of calls by the query execution engine to operators BARRIER_ADD and BARRIER_FETCHADD.
rows_read: Rows read by the query execution engine from operators CONTAINER_READ, CONTAINER_MULTI_READ, and MERGE_SORT_READ. TheTotal number of rows read to get all needed data for the query, includes reads from indexes.
forwards: Number of rows forwarded to specific nodes.
broadcasts: Number of rows broadcast to all nodes.
rows_output: Number of calls by the query execution engine to operator OUTPUT_START. TheTotal number of rows returned.
semaphore_matches: Number of calls by the query execution engine to operator SEM_ACQUIRE.
runtime_ns: Nanoseconds the database spent executing bytecode. This represents the aggregate total CPU time spent by all nodes to run the query.
cpu_wait: Amount of time spent waiting on the CPU
cpu_waittime_ns: Nanoseconds that CPU spent executing query
bm_fixes: Number of attempted page fixes by the Buffer Manager.
bm_loads: Number of pages loaded from disk by the Buffer Manager.
bm_waittime_ns: Nanoseconds spent blocked on Buffer Manager page fixes.
lockman_waits: Number of calls by the query execution engine to operators LOCK_SHARED and LOCK_EXCLUSIVE that had to block. The number of times that the query had to wait for a lock to be released by another query.
lockman_waittime_ms: Milliseconds spent blocked during VM operators LOCK_SHARED and LOCK_EXCLUSIVE. The total time spent waiting for other queries to release locks on needed rows.
trxstate_waits: Number of calls to trxstate_check that had to block.
trxstate_waittime_ms: Milliseconds spent blocked in trxstate_check.
wal_perm_waittime_ms: Milliseconds spent blocked in WALS waiting for thecheckpoint, because WAL was too full.
bm_perm_waittime_ms: Milliseconds spent waiting for the Buffer Manager to grant write permission for pages.
sigmas: Count of sigma containers used by the query. Sigmas are used to store intermediate and/or result rows for some queries.
sigma_fallbacks: Number of sigmas that did not fit in memory and had to fall back to disk.
row_count: Rows affected by the query
found_rows: Rows returned by the query
fanout: Indicates whether fanout was used by the query.
attempts: Indicates how many times this query was attempted.
New in8.x. The following will only show when they are greater than 0ms:
translate:time spent in translate_dml()
prefetch:time spent building the sierra stub (includes fetching pds)
plan:time spent in normalize/plan
compile:time spent in compiling sierra to til
execution:time spent in invocation
View ArticleReplication Slave Lags
With replication, you may experience occasional slave lag, where the slave is processing events are some seconds behind when the master originally executed them. Often, the slave will catch up on its own, but in cases where it doesn't or if you want to know what might have caused the lag, the following may help with troubleshooting slave lags.
Common Causes of Slave Lags
Some common causes of replication slave lags include:
CPU load: cluster is to busy to process replication events in a timely manner.
missing indexes: no primary key on the table that is being updated or if SBR, not having the correct indexes.
lock wait timeouts: lock wait timeouts caused by uncommitted transaction on the slave.
general volume of writes: increased volume of writes perhaps from growth in site traffic or data migrations with bulk delete or update.
data consistency: frequent slave stops and starts due to data consistency issues (duplicate keys or row not found).
long running aggregate queries (SBR): aggregate queries in SBR taking a long time to process resulting in a backlog of queries awaiting to be processed.
data distribution: target table not evenly distributed on all the nodes.
network issues: self explanatory.
Before we jump into the different causes of slave lags, we should set some expectations and review some replication basics.
Expectations
Processing write events will be slower on the slave than on the master because replication is a single threaded process while on the master, write queries can be processed in parallel.
For most customers, a single binlog is more than adequate for their replication needs, but in write-heavy environments, having multiple binlogs may be a viable option.
See our documentation: Configuring Replication
Types of Replication
The two types of replication are:
SBR (Statement-based replication): good for bulk writes, simpler to troubleshoot.
RBR (Row-based replication): good for small writes where the slave processes the row changes rather than processing the query, generally more efficient.
To determine what type of replication your master cluster is running, you can run:
show binlogs\G
The format value will tell you what type of replication you're running for that particular binlog:
mysql> show binlogs\G
*************************** 1. row ***************************
log_name: test
head: 6257570335613048834
log_all_dbs: 1
format: row
version: 2
segments: 1
bytes: 524288
1 row in set (0.08 sec)
Knowing the type of replication is useful as there are different tables and stats that are associated with each replication type (covered later). It also means that if you're running RBR, the queries that generated the writes aren't logged in the query logs of the slave so you would need to look on the master to identify problematic queries.
Query Logs
Reviewing query and clustrix log entries will be important in helping you identify why the slave is lagging. You'll need to follow the steps in the documentation How To Read the Query Logs to extract the log entries.
Note that when troubleshooting slave lags, you can usually ignore long running SELECT statements since those read queries are not written to the binlogs and replicated to the slave. But do keep in mind that the master or slave could be hit hard by poorly performing SELECT queries. Don't automatically eliminate the SELECT queries as a culprit.
Generally for RBR, you would want to look for long running write queries or for write queries that affect a large number of rows on the master as a potential cause of slave lags.
For SBR, you should look at the query logs on the slave cluster to see what SLOW writes are coming across.
Clustrix Logs
Slave issues are logged in the clustrix logs on the node that the slave process is running on. You can find out where the slave is running from with:
SELECT slave_name,
master_host,
nodeid
FROM system.mysql_repconfig;
Note that the slave process isn't always bound to the same node as the slave process can move after a group change.
The log entries will note when the slave stopped and usually why it stopped along with the last successful log position and log file that it had read. The cluster will also log when the slave restarted and what log position it was requesting.
Here are some sample clustrix.log entries:
Aug 6 16:46:54.164643 UTC 2014 clxnode002 real: ALERT SLAVE_STOP WARNING Stopped mysql slave clxslave (10.0.1.10:3306) on non-transient error: Duplicate key in container: test.__idx_email__PRIMARY ([email protected]). Statement: COMMIT, Database: (master_log.015688, position 89420179)
Aug 6 16:47:02.750361 UTC 2014 clxnode002 real: INFO mysql/replication/slave_driver.ct:112 mysql_slave_sig_migrated(): User stopping mysql slave "clxslave"
Aug 6 16:47:04.096524 UTC 2014 clxnode002 real: INFO mysql/replication/slave_driver.ct:105 mysql_slave_sig_migrated(): User starting mysql slave "clxslave"
Aug 6 16:47:04.128394 UTC 2014 clxnode002 real: INFO mysql/replication/slave.ct:804 mysql_slave_initialized(): mysql slave "clxslave" writing to binlogs per system.mysql_slave_log_updates
Aug 6 16:47:04.128844 UTC 2014 clxnode002 real: INFO mysql/replication/slave.ct:723 mysql_slave_session(): mysql slave "clxslave" connecting to 10.0.1.10:3306
Aug 6 16:47:04.128884 UTC 2014 clxnode002 real: INFO mysql/replication/slave.ct:690 mysql_slave_connected(): mysql slave "clxslave" opened relay log for 10.0.1.10:3306 (requesting "master_log.015688, position 89419912")
Aug 6 16:47:04.549432 UTC 2014 chndb002 real: INFO mysql/replication/slave_trx.c:129 check_skip_evt(): slave clxslave (10.0.1.10:3306) (master_log.015688, position 89420179) skipped transaction (remaining 0): t_mysql_query
To familiarize yourself with how to read the clustrixlog entries, we'll pick out some details from the sample log entries above:
The slave stopped when reading the event at log position 89420179because of a duplicate key in table test, namely, a duplicate key on the primary key of the table "test" with the value of "[email protected]".
The binlog file that it was reading from is master_log.015688 and the last successful log position for the replication event is at 89419912.
The slave process then automatically stops and restarts the slave after skipping one event.
Note that the slave skipping behavior is configurable. See our documentation: Controlling Slave Behavior on Errors
You'll need the log position and binlog name as a starting point for extracting binlog details to identify what is coming across the replication stream. See ourFAQ Extracting binlog data for Troubleshooting forhow to extract binlog events.
Slave Status
The first thing to check is to determine whether the slave is processing events from the replication stream. You can get the slave status with:
show slave status\G
The thing to note is whether the log position is "moving" and whether the relay log is growing. If it is, you know that the slave is processing the replication stream and that its getting events from the master.
Also pay attention to the last error. If it says duplicate key or lock wait timeout, then the issue is obvious and go to the appropriate sections below. Otherwise, you will need to do some further troubleshooting.
CPU Load
If a cluster is busy either on the master or slave, replication can be affected.
A heavily loaded master may not be able to feed the slave fast enough while a heavily loaded slave (in a master-master setup) may not be able to process the replicated events in a timely manner. To see the load and what the load trend has been, you can run the following commands on the cluster:
mysql> show load;
+--------------+------+
| facility | LOAD |
+--------------+------+
| bm_miss_rate | 0 |
| cpu | 0.4 |
| disk_util | 2.8 |
+--------------+------+
3 rows in set (0.19 sec)
mysql> SELECT * FROM clustrix_statd.statd_history NATURAL JOIN clustrix_statd.statd_metadata WHERE name='clustrix.cpu.load_avg' ORDER BY TIMESTAMP DESC LIMIT 10;
+-----+---------------------+----------+-----------------------+
| id | timestamp | value | name |
+-----+---------------------+----------+-----------------------+
| 643 | 2016-03-01 21:40:00 | 0.1207 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:35:00 | 0.124875 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:30:00 | 0.117828 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:25:00 | 0.106306 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:20:00 | 0.124611 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:15:01 | 0.104327 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:10:00 | 0.113514 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:05:00 | 0.115097 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:00:00 | 0.107797 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 20:55:00 | 0.113899 | clustrix.cpu.load_avg |
+-----+---------------------+----------+-----------------------+
10 rows in set (0.00 sec)
If the CPU load average is higher than normal, you'll need to identify what's causing the heavy load. This could be due to:
volume of queries
long running queries (perhaps analytics or queries that generate lots of writes)
poorly optimized queries (queries not using the correct indexes or the tables are missing indexes).
You might have fast writes that are being processed in a timely manner on the master, but due to the volume of writes, the slave starts to lag. The queries themselves may not get logged in the query log because the execution times for them don't exceed the slow logging threshold; however, you can query the clustrix_statd.qpc_history table to see how frequent your updates, deletes or inserts may be running with the following query on the master (amend timestamp and statement accordingly):
SELECT TIMESTAMP,
STATEMENT,
sum(exec_count),
avg(avg_exec_ms)
FROM clustrix_statd.qpc_history
WHERE rank < 10
AND TIMESTAMP > '2014-10-10 12:40:00'
AND STATEMENT LIKE 'update%'
GROUP BY query_key
ORDER BY TIMESTAMP ASC LIMIT 3 \G
If it shows lots of updates during the stats period, then it's a likely culprit. For example, 200,000 executions of a simple update query in 15 minutes may not get logged in the query logs, but you should be able to catch this by looking at qpc_history.
You can also use the query log entries on the master to identify bulk writes that may be causing the slave to lag. Although 130000 rows updated doesn't sound like much, when there are 30 updates in a very short amount of time, it can slow down the slave:
Aug 4 21:22:35.449849 UTC 2014 clxmaster003 real: INSTR SLOW SID:3780625411 db=test [email protected] ac=Y xid=53e0fd9420ba4006 sql="UPDATE bigtable SET TIME_UPDATED=now(),DELETED=0 WHERE ID=371" [Ok: 138756 rows updated] time 27487.7ms; reads: 277513; inserts: 416268; deletes: 0; updates: 277512; counts: 138756; rows_read: 416268; forwards: 832543; rows_output: 2; semaphore_matches: 0; runtime: 3735902785; attempts: 1
Note the number of rows updated in the square brackets.
For potential long running queries, you can usually find them with the following query (depending on your version of clustrix) and see how long the query has been in that state:
SELECT *
FROM system.sessions
WHERE statement_state='executing'
AND time_in_state_s>1
ORDER BY time_in_state_s DESC LIMIT 10 \G
If the same type of query is showing up multiple times and are taking a long time to process, it could indicate a missing index or a lock wait timeout (the latter would show up in the query logs).
Indexing
To determine whether a query is missing an index and how to troubleshoot this is a whole other topic, but we'll briefly cover one.
One RBR scenario is that the slave lag continues to grow, relay log is changing, but the log position isn't moving. This may be caused by a bulk delete or update where there isn't a primary key on the target table even though there might be a corresponding index. You should check the create table definition for that table that is being written to on the slave and determine whether it has a primary key or not.
For example, if this query was processed on the master and it had deleted a million rows:
delete from foo where start_date<'2014-01-01'
and the DDL for the table has:
mysql> show create table foo \G
*************************** 1. row ***************************
Table: foo
Create Table: CREATE TABLE `foo` (
`id` bigint(16),
`bar` bigint(16),
`start_date` date,
`end_date` date,
KEY `start_date` (`start_date`) /*$ DISTRIBUTE=1 */
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=1 */
The solution here would be to not only have the primary key in the table, but to set the query to be replicated across as a statement based query if your replication is by default RBR. The other thing to do here may be to include a limit to reduce the write load:
alter table foo add primary key (id);
set session binlog_format='statement'
delete from foo where statt_date<'2014-01-01' limit 100000;
### repeat until all rows are deleted ###
If you're running SBR, the query being processed could be poorly optimized or the table doesn't have the correct indexes. You'll need to identify the offending write(s) from the master by looking at the query logs or qpc_history on the master and see if you can find any writes that are slow to execute.
You can then look at the actual query and see whether the query has the correct search arguments to use the indexes that are in the table definition.
Lock Wait Timeouts
You might get the following "lock wait timeout exceeded" error for the slave process in your sprout or clustrix log:
Jul 17 18:07:38.393608 UTC 2014 clxnode002 real: INFO mysql/replication/slave_state.c:175 mysql_slave_state_reconnect(): Stopped mysql slave clxslave (10.1.0.10:3306) on transient error, scheduling retry: Lock wait timeout exceeded: . Statement: COMMIT, Database: (master_log.014723, position 5397030)
or you might see this "Lock wait timeout exceeded" in the Last_Error field when you run the show slave status command:
mysql> show slave status \G
*************************** 1. row ***************************
Slave_Name: clxslave
Slave_Status: Reconnecting
Master_Host: 10.1.0.10
Master_Port: 3306
Master_User: replication
Master_Log_File: master_log
Slave_Enabled: Enabled
Log_File_Seq: 14723
Log_File_Pos: 5397030
Last_Error: Lock wait timeout exceeded: . Statement: COMMIT, Database:
Connection_Status: Disconnected
Relay_Log_Bytes_Read: 0
Relay_Log_Current_Size: 0
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)
This is occurring because the slave event is trying to acquire a lock to update a row or table that is currently be locked by another process that hasn't committed. After five minutes (default), it generates this error if it can't acquire a lock.
You should be able to find the offending transaction if it's still uncommitted with:
SELECT *
FROM system.sessions
WHERE xid IN
(SELECT holder
FROM system.lockman
ORDER BY holder ASC LIMIT 1)\G
Here's a sample of what the output might look like:
mysql> select * from system.sessions where xid in (select holder from system.lockman order by holder asc limit 1) \G*************************** 1. row *************************** nodeid: 5 session_id: 1977584645
source_ip: 10.1.1.100
source_port: 13574
local_ip: 10.1.0.10
local_port: 3306
user: 5711972934625828866
database: prod
trx_state: open
statement_state: executing
xid: 6037483960283521034
isolation: REPEATABLE-READ
last_statement: UPDATE foo set bar=1
time_in_state: 34
created: 2014-07-17 10:27:48
trx_age: 1517
trx_mode: explicit
trx_counter_select: 3
trx_counter_insert: 0
trx_counter_update: 42
trx_counter_delete: 0
trx_is_writer: 1
1 row in set (0.07 sec)
In this example, the trx_mode is explicit which tells us that this transaction is not set to auto-commit. You can then notify the user to commit the transaction or you can kill the session.
Slave Process Stuck
In rare cases, the slave process may get stuck where the slave stopped due to an error and when it restarted, the previous slave still hadn't terminated.
This is then manifested on the slave cluster with the following symptoms:
slave lag increasing
bigc growing
undo space growing
relay log doesn't grow
stats in the system.transactions table for the slave's XID don't increase
slave process is the oldest transaction
select * from system.sessions where xid in (select xid from system.transactions order by xid asc limit 1 \G # check to see if user=4100 which would be the slave process.
If this is confirmed that the slave is stuck, see if you can kill the session. If that doesn't work, you should contact support.
Issues on the Master
If the CPU load on the slave is low and if there doesn't seem to be any long running queries on the slave and there aren't any indexing issues, you might want to have a look at the master cluster.
Any errors in the master logs (clustrix.log)?
Is the CPU load on the master high?
Is there a big write event being replicated?
Are there lots of small writes (hundreds of thousands of small updates, deletes or inserts)?
Volume of Writes
Are there more binlog files being created than normal? Perhaps there is a migration and data is being duplicated as a part of the migration or maybe it's a growth in overall writes and you might need to plan for having multiple binlogs? This is more readily apparent with RBR, but could also show up in SBR environments if there is a significant increase in write statements.
You can query the stats tables to find out how much the binlogs have grown within a time period by looking at the difference in the binlog sizes and charting it:
SELECT (max(value) - min(value))/1024/1024/1024 AS Binlog_Growth_GB
FROM clustrix_statd.statd_history
NATURAL JOIN clustrix_statd.statd_metadata
WHERE TIMESTAMP BETWEEN timestampadd(DAY,-1,now()) AND now()
AND name = 'clustrix.capacity.disks.binlog.size_bytes';
For RBR, you can count the number of binlogs created within a specific time period (like per day) and see whether the lag corresponds to the increase in binlogs:
tee binlogfiles.todaysdate;
show all binlog files;
For SBR, you could see whether there is an increase in write commands (deletes, updates, inserts) and compare that with the previous period:
mysql> select name, timestamp, max(value) - min(value) as Exec_count from clustrix_statd.statd_history natural join clustrix_statd.statd_metadata where timestamp between timestampadd(day,-1,now()) and now() and name in ('clustrix.stats.Com_delete','clustrix.stats.Com_update','clustrix.stats.Com_insert') group by name limit 10;
>+---------------------------+---------------------+------------+
| name | timestamp | Exec_count |
+---------------------------+---------------------+------------+
| clustrix.stats.Com_update | 2014-08-05 15:15:00 | 196959 |
| clustrix.stats.Com_insert | 2014-08-05 15:10:00 | 17493971 |
| clustrix.stats.Com_delete | 2014-08-05 14:20:01 | 5494 |
+---------------------------+---------------------+------------+
3 rows in set (0.00 sec)
Charting the results would make it easier to visualize and identify increased binlog writes (see attached images).
Data Consistency
Frequent slave stops due to duplicate keys or row not found is going to generate additional slave lag as it may take up to 10 seconds for the slave to restart if it's been configured to skip these events. Ideally, you shouldn't be in this situation as this would indicate a data consistency issue but this occurs mainly in a master-master setup where the application may have mistakenly written the same data onto both clusters. It could also occur in a master-slave configuration where your development team may have inadvertently written to the slave.
Long Running Analytic Queries
Since replication is single-threaded, if you have many complex writes that take a long time to process in SBR, the slave will lag until it has finished processing this and all of the other queries that have queued up behind it.
Data Distribution
If a table is poorly distributed, for example, having less slices than the number of nodes in the cluster, you're not taking advantage of all of the available resources. This table may be "hot" as the writes aren't being spread out more evenly across all the nodes. You can find your hot tables by looking at hotness_history (amend the value in the brackets):
select * from clustrix_statd.hotness_history where write_rank <20 and database not in ('system','clustrix_statd','_replication') and total_replicas < [### 2 * number_of_nodes###] order by timestamp desc, write_rank asc limit 50;
Alternatively, you can use the hotness tool (https://support.clustrix.com/entries/23261163-Hotness) to see what is hot at the moment.
Network Issues
Your network monitoring system or your NOC should notify you when there are network issues that would impact replication. In the event that they don't, you might notice that the slave lag jumps up from 0 to a high value:
mysql> select * from clustrix_statd.statd_metadata natural join clustrix_statd.statd_history where name like '%seconds_behind%' order by timestamp desc limit 200;
+------+-----------------------------------------------------------+---------------------+-------+
| id | name | timestamp | value |
+------+-----------------------------------------------------------+---------------------+-------+
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:55:05 | 1486 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:51:06 | 2095 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:45:24 | NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:40:05 | 3676 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:35:06 | NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:32:13 | 5575 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:25:03 | NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:20:07 | 6569 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:15:28 | 6438 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:10:07 | 6671 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:04:49 | 6864 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:00:24 | NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-20 23:56:25 | 7155 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-20 23:50:04 | NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-20 23:44:49 | 0 |
The sprout or clustrix log on the node where the slave is running from would have logged this:
Nov 20 23:49:08.216590 UTC 2014 clxnode001 real: INFO mysql/replication/slave_state.c:175 mysql_slave_state_reconnect(): Stopped mysql slave clx-slave ( 10.0.0.100:3306 ) on transient error, scheduling retry: Could not connect to master (Connection reset by peer)
Multiple Binlogs
In some cases, there may be too many writes for a single binlog to handle. This could be from growth of your site or cyclical demand like during holiday periods. A workaround would be to have multiple binlogs where non-dependent writes can be logged to their own binlogs. For example, you may have writes to a user table and writes a message table. If they're in different schemas and don't have any dependency on each other, you could set up a binlog per schema so replication events for one schema doesn't get queued up behind writes to the other schema.
Please contact support for more information on how to do this.
Extracting Binlog Events
This FAQ covers how to extract events from binlogs: How To Extract Binlog Events.
Stats
For RBR, these tables provide some insight in RBR stats (stats in the following two tables reset every time the slave restarts):
system.slave_row_stats: select * from system.slave_row_stats order by event_bytes desc limit 10; # shows the top ten tables that are being written to since the slave started.
system.mysql_slave_stats: select * from system.mysql_slave_stats; # look for bytes_per_sec and events_per_sec and whether batches value is incrementing.
The above stats might tell you which tables are "hot" and help you in finding write queries for those tables.
For SBR, these stats on the slave may be useful in identifying the volume of statements being replicated:
clustrix.stats.Com_delete
clustrix.stats.Com_update
clustrix.stats.Com_insert
when used in conjunction with the stats in
clustrix_statd.qpc_history
and with query log entries.
Best Practices
If you have multiple slave processes, point them to different nodes on the master cluster rather than the VIP host.
RBR is generally more efficient, but you do need to set your bulk writes to be statement based (set session binlog_format='statement') to not flood the slave with lots of write events.
If your write volumes are too much for one binlog, separate your binlog writes into different binlogs if the writes aren't dependent.
View ArticleIn ClustrixDB the alerter is a background process which monitors and sends alerts messages when specific criteria are met.
When an alert is triggered an email notification will be sent to the configured email addresses with informationabout the triggering issue. A list of ClustrixDB alerts can be found in the knowledge base article: What types of alerts can Clustrix send?
Ocasionally you may encounter situations where you wish to temporarily disable the alerterprocess.To disable the alerter from sending email alerts on a node, create a file named .noalertsin the / directory.
Example:
root@node001:/$ touch/.noalerts
Alerts generated while the .noalerts file exists will be logged to theclustrix.log on the node that initiated the alert. To silence alerts cluster wide the/.noalerts would need to be created on each node. You can do this with the clx tool.
Example:
root@node001:/$ /opt/clustrix/bin/clx cmd 'touch/.noalerts'
Possible reasonsfor disabling alerts:
PlannedMaintenance
Prevent spurious alerts when performing maintenance or other activities where alerts are expected.
A known hardware issue.
Prevent alerts from a node with a known issue, until the issue can be resolved.
If a node is not sendingalerts but the alerts are appearing in the nodes clustrix.log it is worth checking to ensure there is no /.noalerts file present.
View Article
Question: How much space do I need for my log partition?
Short Answer: 50GB
Long Answer:ClustrixDB outputs to several different log files, the main three beingDEFAULT(clustrix.log on software or sprout.log on appliance),QUERY, andDEBUG. By default these three logs will rollover, start a new log, and archive the old log when any of the following condition are met:
Log file reaches 1GB in size.
Log file reaches the age of 24 hours.
Time hits 00:00:01.
By defaultthe cluster will holdfor each log type10 uncompressed log files and 20 gzipped log files for a total of 90 files. So in the most extreme situations the logs could potentially take up a little more than 35GB of space. Because of this and to be on the safe side we recommend 50GBpartition for your log files.
For a complete list of best practices for ClustrixDB software installations: Best Practices for Platform Configuration for ClustrixDB Software Installations
View ArticleA Group Change is the event that occurs when a cluster forms a new group. This occurs when a node joins or leaves the cluster group. This can also manifest in Clustrix Insight as 503 Errors while the Group Change occurs.
Transient Node FailureWhen the cluster loses contact with an individual node for any reason, remainingnodes in the cluster form a new group without this node and continue serving clients. All serviceslikereplication slaves are reassigned across the surviving nodes. Clients connected through a load balancerexperience only a short unavailability during agroup change. Clients that were distributed to the failed node must reconnect. Clients that were directly connected to the failed node will beunable to query the database and must connect to another node.
After the failed node has recovered, it willrejoin the cluster and accept work. While a failed node mayrequire manual intervention to recover (example: restoring power), no further action is neededto reintegrate anode into the cluster after it is restored.
New Node AdditionWhen a new node (or nodes) is added to the Cluster, the cluster forms a new group with that node added as a member. At this point the Cluster will begin balancing data to the new node.
View ArticleHere are the recommended system specs to run ClustrixDB.
Minimum of 3 servers required for a ClustrixDB cluster
Operating System: RHEL or CentOS 6.x
We recommend either 8-core or 16-core servers. If using 16-core servers, the remaining specs below stay the same.
General ClustrixDB Server Recommendations:
CPUs:
2x 8-core CPUs 2.0GHz or higher (the higher the better)
RAM:
64GB or more(typically not more than 256GB)
Disks:
4x SATA SSD 400GB or more
2x SATA HDD 1TB or higher
Hardware RAID:
4x SSDs as RAID-0
2x HDDs as RAID-1
Network:
1x 1 Gigabit interfaces
1x 10GbE interfaces
Private Network:
1x 10-Gigabit Ethernet switch with enough ports to connect each ClustrixDB server
High Availability Server Recommendations:
Same as above, with following upgrades:
Network:
2x 1 Gigabit interfaces
2x 10GbE interfaces
Power Supply Unit:
2x power supply units (redundant)
Private Network:
2x10-Gigabit Ethernet switch with enough ports to connect each ClustrixDB server
Other Recommendations:
Hyperthreading: Most DB workloads do not benefit from Hyperthreading, therefore Clustrix, Inc recommends as general guidance disabling Hyperthreading in the ClustrixDBservers BIOS settings.
SSD Quality: Unfortunately, solid state drives (SSDs) are available in various levels of quality. Clustrix, Inc recommends using SSDs that have the following characteristics:
Enterprise-class SSDs (not consumer-class available from Amazon.com )
Contains a Power Loss Protection feature
Recommend using SSDs manufactured by Intel (every Intel SSD weve seen has Power Loss Protection)
Amazon AWS: When running in AWS we strongly recommend using enhanced networking as it has shown a measurable performance impact.
Example ClustrixDB Server Configuration (Dell)
Dell PowerEdge R630 Server, No TPM
Chassis with up to 8, 2.5" Hard Drives, up to 2 PCIe Slots (With Optional Riser)
CPUs
Intel Xeon E5-2640 v3 2.6GHz,20M Cache,8.00GT/s QPI,Turbo,HT,8C/16T (90W) Max Mem 1866MHz
Upgrade to Two Intel Xeon E5-2640 v3 2.6GHz,20M Cache,8.00GT/s QPI,Turbo,HT,8C/16T (90W)
Memory
4x 16GB RDIMM, 2133MT/s, Dual Rank, x4 Data Width
RAID Controller
PERC H330 RAID Controller
Disks
4x 400GB Solid State Drive SATA Mix Use MLC 6Gpbs 2.5in Hot-plug Drive, S3610
2x 1TB 7.2K RPM SATA 6Gbps 2.5in Hot-plug Hard Drive
Network
Broadcom 5720 QP 1Gb Network Daughter Card
QLogic 57810 Dual Port 10Gb Base-T Network Adapter
Other Notes:
No Operating System
Unconfigured RAID for H330/H730/H730P (1-24 HDDs or SSDs)
You can read more on initial setup of ClustrixDB in our Getting Started Guide.
View Article
In ClustrixDB all portions of a table are considered an index. Thus when the sizes from in system.table_sizes and system.index_sizes when summed are expected to be the same. However, index_sizes gives more granularity into the makeup of the table which can be useful in some cases.
The base representation of a table listed as __base unless there is a primary key, in which case the primary key will take the place of the base representation in the index listing. A primary key (which includes the base representation) is noted as __idx_<table_name>__PRIMARY. Additional indexes are listed by name.
Example:
sql> create table foo01 (id int, name varchar);Query OK, 0 rows affected (0.32 sec)
sql> create table foo02 (id int not null primary key, name varchar);Query OK, 0 rows affected (0.05 sec)
sql> create table foo03 (id int not null primary key, name varchar, key indexkey (name));Query OK, 0 rows affected (0.05 sec)
sql> select * from system.table_sizes where `database` = 'example';+----------+-------+--------+| Database | Table | Bytes |+----------+-------+--------+| example | foo01 | 327680 || example | foo02 | 327680 || example | foo03 | 655360 |+----------+-------+--------+3 rows in set (0.12 sec)
sql> select * from system.index_sizes where `database` = 'example';+----------+-------+----------------------+--------+| Database | Table | Index | Bytes |+----------+-------+----------------------+--------+| example | foo01 | __base_foo01 | 327680 || example | foo02 | __idx_foo02__PRIMARY | 327680 || example | foo03 | __idx_foo03__PRIMARY | 327680 || example | foo03 | indexkey | 327680 |+----------+-------+----------------------+--------+4 rows in set (0.11 sec)
If you want to query the system.index_sizes table and not count the base rep or the primary key (obtain the size of just the secondary indicies) you can add the following to the query above:
AND `Index` NOT LIKE '%__base%' AND `Index` NOT LIKE '%__PRIMARY'
Example:
sql> select * from system.index_sizes where `database` = 'example' and `Index` NOT LIKE '%__base%' AND `Index` NOT LIKE '%__PRIMARY';+----------+-------+----------+--------+| Database | Table | Index | Bytes |+----------+-------+----------+--------+| example | foo03 | indexkey | 327680 |+----------+-------+----------+--------+1 row in set (0.11 sec)
View ArticleThe most common way to import data onto a ClustrixDB cluster is through use of the clustrix_import tool, as described in Importing Data. Clustrix Import is designed to take advantage of ClustrixDB parallelism and is the recommended method for performant data ingest.
If for some reason you are unable to use clustrix_import, here are some recommendations for optimal import speed:
Increase concurrency as much as possible. If using LOAD DATA INFILE, consider splitting the import file into multiple chunks, and running parallel LDI invocations. If you have an application which loads data, utilize multiple threads. A rule of thumb for concurrency is one thread per core in the cluster.
Disable auto_increment for the import phase. There is some performance overhead to maintaining the auto_increment value, even when the value is being provided in the INSERT. If you create your table without the AUTO_INCREMENT clause of the column, this penalty is avoided; obviously this is only workable if the data you are importing includes the value of the auto_increment column. Once import is complete, you can ALTER TABLE MODIFY COLUMN to re-add the AUTO_INCREMENT column, an operation which does not require a table rewrite, and so completes immediately. Note that clustrix_import automatically does this for you.
View ArticleVersions: 7.5
When engaging Clustrix Support, one of the first things our team usuallyneeds to start diagnosing issues are the logs. Therefore, we recommend collecting these logs, and submit them when you initially contact Clustrix Support. This helps us identify the issue, and the associated solution much faster.
Collecting the logs
There are two primary logs that our teamlooks at, which will tell us where to dignext:
The ClustrixDB server log: clustrix.log
The ClustrixDB query log: query.log
In some cases, other logs can be helpful in identifying issues, namely:
The ClustrixDB nannylog: nanny.log
The ClustrixDB debuglog: debug.log
System /var/log/messages: messages.log
Each server in the ClustrixDB cluster writes its logs locally.Therefore we will need the logs from all files.
Collecting the logs can be performed from a single ClustrixDB serverusing the clx tool :
/opt/clustrix/bin/clx logdump clustrix > combined_clustrix.log
/opt/clustrix/bin/clx logdump query> combined_query.log
tar cvzf clustrix_logs.tgzcombined_clustrix.logcombined_query.log
rm -fcombined_clustrix.logcombined_query.log
Now you have a compressed tar file named clustrix_logs.tgz in your current working directory.
Sending the logs
Now that you have the logs collected into a tar file, you can send the file to Clustrix Support a couple different ways.
If it is small (i.e. less than 10MB) you can simply attach it to the email you send to Clustrix Support.
If it is a large file, you can FTP it to our Upload-only Clustrix FTP server
Alternatively, you can share the file in a service like Dropbox or Google Drive, and provide the download link in the email toClustrix Support.
To FTP files to our Upload-only Clustrix FTP server.
FTP server: ftp.clustrix.com
User: anonymous
Password: your email address
Once the file is uploaded, it will not be visible since the FTP server is configured to be write-only. Please give the file an easily identifiable name we can identify as yours. Then please let us know you have uploaded the file by including the file name in your email to [email protected].
View ArticleAlerts Configuration
Before configuring alerts, we recommend setting the cluster_name and customer_name global variables. If you have multiple clusters, thiswill help to identify which cluster is sending alerts. This also helps Clustrix Support identify where alerts are coming from if you are sending alerts to our team at [email protected] (recommended).${alerts_name} will parse out as customer_name - cluster_name and${cluster_name} will show the value of cluster_name. You can set these globals with:
set globalcustomer_name='<company name>';
set globalcluster_name='<cluster identifier (usually similar to node hostnames)>';
There are two system tables in ClustrixDB used to configurealerts:
alerts_parameters and alerts_subscriptions.
alerts_parameters:
SQL> select * from system.alerts_parameters;+-------------------+-------------------------------------------------------+| name | value |+-------------------+-------------------------------------------------------+| body_max_chars | 50000 || email_subject | ${alerts_name} [${severity}] ${summary} || smtp_password | || smtp_security | || smtp_username | || smtp_port | 25 || smtp_sender | ${alerts_name} CLX Log Alert <[email protected]> || smtp_server | smtpserver.mycompany.com || subject_max_chars | 100 || email_body | Severity: ${severity} |Date: ${date} ${tz} Host: ${host} HWID: ${hwid} Cluster: ${cluster_name} Version: ${version} Image Version: ${image_version} Message: ${message} | | email_encoding | quoted-printable |+-------------------+-------------------------------------------------------+11 rows in set (0.00 sec)
The default server configurationis intended for testing purposes only and should be changed for production installations.
To set a newSMTP server, you can run the following:
SQL> update system.alerts_parameters set value='smtp.foo.com' where name='smtp_server';
To modify other alerter settings, simply update the row:
SQL> update system.alerts_parameters set value='foo' where name='bar';
alerts_subscriptions: This tablecontains the emailaddresses that alerts will be sent to.
SQL> select * from system.alerts_subscriptions;+-----------------------------+
| address |
+-----------------------------+
+-----------------------------+
1 rows in set (0.01 sec)
You can insert, update or delete from this table with regular SQL commands.
For example:
SQL> insert into system.alerts_subscriptions values ('[email protected]');
Note: When any settings are changed the alerter must be told to re-read the config. After modifying any parametersyou mustrunthe following for changes to take effect:
SQL> ALTER CLUSTER RESET ALERTER;Testing Alerts
You cansend a test alert viaSQL with the following syntax.
SQL> select alert(<alert level>, '<alert text>');
The alert levels are:
0 - CRITICAL1 - ERROR2 - WARNING3 - INFO
The alert text can be anything you desire.
Example:
SQL> select alert(3,'I am a test alert');
+------------------------------+
| alert(3,'I am a test alert') |
+------------------------------+
| 0 |
+------------------------------+
1 row in set (0.00 sec)
This is the email generated from the above test:
From: foo-cl20dafefv4af047 CLX Log Alert <[email protected]>Subject: foo-cl20dafefb4a7bf047 [INFO] User Invoked From SQLDate: Mon, Apr 18, 2016 at 12:38PMTo: [email protected]: INFODate: 2016-04-18 07:08:13.084133 UTC Host: foo HWID: 00:30:48:be:63:88 Cluster: Clustrix_cluster Version: 5.0.45-clustrix-7.0.1 Image Version: CentOS release 6.5 (Final) Message: [SQL] I am a test alert
View ArticleIt's useful to compare global variable settings against the default in many cases.The best practice is to check these:
Before an upgrade
After an Upgrade
When encountering issues
We use the following query to check globals against the defaults. This ignores a few globals that are always different than the default, like license and cluster_id:
SQL> SELECT name,
value,
default_value
FROM system.global_variables
JOIN system.global_variable_definitions
USING (name)
WHERE value != default_value
AND name NOT IN ('cluster_ssh_key_private',
'cluster_ssh_key_public',
'cluster_host_key_public',
'cluster_host_key_private',
'cluster_vip',
'cluster_vip_enabled',
'server_id',
'cluster_name',
'version',
'format_version',
'cluster_vip_director',
'cluster_vip_external',
'customer_name',
'session_log_bad_queries',
'session_log_slow_queries',
'session_log_slow_threshold_ms',
'view_strmaps_upgraded',
'global_variables_ignored_version',
'cluster_id',
'license');
View ArticleIf you are having a problem where UPDATE queries that are normally fast are running unexpectedly slow or timing out there may be an uncommitted explicit transaction, or a long running query that is preventing your update because it has takenout a lock on the row you are attempting to UPDATE.
To identify long running locking queries, while you're running your update statement in one session, open up a second session and run the following:
SELECT* FROM system.sessions WHERE xidIN(SELECTDISTINCT(holder) FROM system.lockman)\G
If a session shows up, this may be the one that is blocking your update and you can ask the user to commit or you can try and kill the session.
View ArticleSometimes when creating a table by copying and pasting the schema from the output of a SHOW CREATE TABLE, the table that is created has 3 slices when you defined a different number.
To see this problem in actions:
Schema definition with SHOW CREATE:
mysql> show create table bar\G
*************************** 1. row ***************************
Table: bar
Create Table: CREATE TABLE `bar` (
`id` int(0)
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=3 */
And we paste it back in, modifying the slice count like so:
mysql> CREATE TABLE `bar` (
-> `id` int(0)
-> ) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=30 *
-> ;
Whenrunning SHOW CREATE TABLEit is still the default 3 slices:
mysql> show create table bar\G
*************************** 1. row ***************************
Table: bar
Create Table: CREATE TABLE `bar` (
`id` int(0)
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=3 */
The reason for this behavioris that mysql client is stripping the comments delimited by /*$ */ before sending the query to the server. It does this regardless of whether you are using mysql client interactively or input queries to stdin.
The solution is to remove the comment delimiters before submitting these queries to Clustrix.
Note that clustrix_import will not strip these comments, however unless you disable auto-reslice with -A option, it may apply reslicing logic to the table.
View ArticleA "No Such Token" error when trying to fill out a satisfaction survey occurs from either of the following:
Time, the link that is shared with an end-user will only work for approximately 72 hours after being sent.
Usage, the link that is shared with an end-user can only be clicked on once. If they click on it and rate the ticket, it will expire. If they click on it anddon'trate the ticket, it will expire. If someone else (perhaps an admin or an agent) received the same link and clicked on it, the token will expire.
You can login to the Clustrix support system directly to rate tickets after either of these conditions has occurred.
You can read more about this on Zendesk's support forums No Such Token page.
View ArticleIn ClustrixDB versions 7.5+ there is a globalvariable device_auto_resize_to_largest that is default to TRUE, but only for new installs. For clusters that have beenupgraded from previous versions, you'll want to first confirm that the /data/clustrix/ partition with largest device1 file has enough space to fit device1-temp file (default device_temporary_space_limit_bytes). If that's the case, you can set the variableto true, which will cause all nodes to grow their device1 file to the same size as the largest device1 file in the system.
To set the global:
sql> set global device_auto_resize_to_largest=true;
If in the future you wanted to allocatemore database space for each node (grow the device1 file) you can grow the device fileswith the following command:
sql> ALTER CLUSTER RESIZE DEVICES 400G
A bit more information on this can be found in our docs here: Managing File Space and Database Capacity
View ArticleClustrix Clone Users will help to migrate all your MySQL users and permissions to/from MySQL and Clustrix.
./clustrix_clone_users --help
Usage: clustrix_clone_users [options]Options:-h, --help show this help message and exit-H HOST, --host=HOST-P PORT, --port=PORT sql port, default is mysql port: 3306-u USER, --user=USER username, default is root-p PASSWD, --passwd=PASSWD-v, --verbose-V, --version
Example:
./clustrix_clone_users -H myhost -u root >> user_dump.sql
This can then be imported into Clustrix like:
mysql -uroot -h <clustrix IP> < user_dump.sqlYou can download the script here: Clustrix Clone Users
View ArticleThe Clustrix Nagios Plugin is used to monitor a Clustrix Cluster with Nagios.
Usage: check_clustrix.py [options]
Options:-h, --help show this help message and exit-H HOSTNAME, --hostname=HOSTNAME any hostname for a clustrix system-P PORT, --port=PORT port to connect to-u USER, --user=USER database user to connect as-p PASSWORD, --password=PASSWORD account password-L, --list-checks list available checks-w WARN, --warn=WARN warning threshold-c CRIT, --crit=CRIT critical threshold Available checks:check_cpu Check that the cluster cpu utilization is within bounds. check_executing_sessions Check that the number of executing statements which have been running for more than 5 seconds is within a reasonable bound check_long_running_queries Check for queries which have been executing over 4 hours. check_offline_slices Check for number of offline slices. Any value over 0 indicates that some data is unavailable in the current membership. User quries will get errors. check_underprotected_slices Check for any slices which have less than 2 replicas.
Additional statistics are managed by our new statistics aggregation and tracking mechanism, statd.py. This script runs on each node (but only one instance is active at a time),and collects statistics into tables in the clustrix_statd database (clustrix_statd.statd_metadata, clustrix_statd.statd_current, clustrix_statd.statd_history).
There is a method in the nagios plugin, fetch_stats, which retrieves the specified stats (as many as desired per invocation) from these statd tables, e.g.:
cmd$ nagios/check_clustrix.py -H alpha013 -u root fetch_stats clustrix.stats.bm_miss_rate clustrix.tps clustrix.qpsOK - clustrix.stats.bm_miss_rate=0 OK - clustrix.qps=22 OK - clustrix.tps=21 | clustrix.stats.bm_miss_rate=0 clustrix.qps=22 clustrix.tps=21
This allows us to expand the number of stats available over time, without having to revisit the nagios plugin.
Below is where to find some of the stats that were included in older versions of the plugin, in the new framework:
bm_miss_rate - Rate of Buffer Misses vs. Cache Hits fetch_stats clustrix.stats.bm_miss_rate
disk_read - Percentage of Maximum Disk Read Capability
disk_write - Percentage of Maximum Disk Write Capability
You can calculate rate/delta from:
fetch_stats clustrix.io.disks.bytes_read
fetch_stats clustrix.io.disks.bytes_written
qps - Queries per Second
fetch_stats clustrix.qps
tps - Transactions per Second
fetch_stats clustrix.tps
conns - Current Connections / Max Connections
queries - Number of Currently Running Queries
avg_q_age - Average Age of Running Queries
max_q_age - Age of Longest Running Query these are reported by fetch_session_stats (does not use statd):
Example:
cmd$ nagios/check_clustrix.py -H alpha013 -u root fetch_session_stats
OK - CONNECTIONS=5 OK - EXECUTING_SESSIONS=2 OK - AVG_QUERY_AGE=2052.000000 OK - MAX_QUERY_AGE=4104 | CONNECTIONS=5 EXECUTING_SESSIONS=2 AVG_QUERY_AGE=2052.000000 MAX_QUERY_AGE=4104
quorum - Number of Nodes in Quorum / Number of Nodes in Cluster
fetch_stats clustrix.cluster.nodes_in_quorum
fetch_stats clustrix.cluster.total_nodes
rebalancer_jobs_queued - Number of Rebalancer Jobs Currently Queued
fetch_stats clustrix.rebalancer.jobs_queued
http://files.clustrix.com/support/files/clustrix_nagios.tar.gz
View ArticleClustrix Import is a parallel import tool to import MySQL dumps into a ClustrixDB Cluster.Importing mysqldumps with the mysql client results in a single-threaded insert, which will be much slower than clustrix_import. Clustrix Import is installed along with ClustrixDB and located in the /opt/clustrix/bin/ directory.
Example Usage:
$ ./clustrix_import -H <cluster\_ip> -u root --passwd=<password> -i mydumpfile.dump
You can read the Importing Data section of the System Administrators Guide forfurtherinformation about Clustrix Import and importing data other than MySQL dumps, like CSV files.
View ArticleBinglog Trim is a script to trim binlogs according to the retention policy
Usage: binlog-trim [options]
Options:
-h, --help show this help message and exit
-H HOST, --host=HOST
-P PORT, --port=PORT sql port, default is mysql port: 3306
-u USER, --user=USER username, default is root
-p PASSWD, --passwd=PASSWD
-n NUM_FILES, --num-files=NUM_FILES
number of files to trim a time
-d, --dry do not perform any actual trims
-k KEEP_DAYS, --days=KEEP_DAYS
keep this many days of binlogs
-i INTERVAL, --interval=INTERVAL
seconds between trims
-b BINLOG_NAME, --binlog_name=BINLOG_NAME
name of binlog to trim; must specify if multiple
binlogs exist
-M MAX_RUN_TIME_MINS, --max-run-time-mins=MAX_RUN_TIME_MINS
maximum time (minutes) script may run
-V, --version
-N, --new-method use alternate method of SHOW BINLOG FILES for faster
performance
binlog-trim is generally deployed as a cron job on one of the nodes. An example entry to run once a day at 5:35UTC, with a retention policy of 7 days, trimming no more than 15files at a time, with a minimum 60 second pause between trims, and operating on the binlog called clustrix-bin:
35 5 * * * root /bin/binlog-trim -H localhost -i 60 -k 7 -n 15 -b clustrix-bin 2>&1 >> /var/log/binlog-trim.log
Note that the INTERVAL is a minimum wait between trims; there is additional logic in the script to prevent the trims from building up too much cleanup work (the logs will indicate this with 'waiting for bigc to pass trim').
View ArticleHotness is a tool to find the 'hot' containers involved in writes and reads for a given cluster. To run, simply copy it to /clustrix/ on one of the nodes and execute the script with:
clx_node$ ./hotness
This script will output the most used containers and what tables they hold data for. This can be used to find imbalances that could be impacting cluster performance.
Example Output:
Overall hot writes (1950131):
3:sbtest.k469605 (24%)
2:sbtest.\_\_idx\_sbtest\_\_PRIMARY440979 (22%)
2:sbtest.k440606 (22%)
3:sbtest.\_\_idx\_sbtest\_\_PRIMARY300478 (15%)
1:sbtest.k148129 (7%)
1:sbtest.\_\_idx\_sbtest\_\_PRIMARY140502 (7%)
3:statd\_current.\_\_idx\_statd\_current\_\_PRIMARY4916 (0%)
1:statd\_current.\_\_idx\_statd\_current\_\_PRIMARY3336 (0%)
2:statd\_current.\_\_idx\_statd\_current\_\_PRIMARY1580 (0%)
3:alerts\_parameters.\_alerts\_parameters\_ind\_00 (0%)
3:USER\_SPHINX\_UPDATE.idx\_processed\_location\_chunkid\_userid0 (0%)
1:statd\_metadata.\_\_idx\_statd\_metadata\_\_PRIMARY0 (0%)
2:USER\_SPHINX\_UPDATE.idx\_processed\_location\_chunkid\_userid0 (0%)
1:rebalancer\_activity\_log.\_rebalancer\_activity\_log\_ind\_10 (0%)
2:alerts\_parameters.\_alerts\_parameters\_ind\_00 (0%)
3:mysql\_slave\_log\_updates.\_mysql\_slave\_log\_updates\_ind\_00 (0%)
3:network\_interfaces.\_\_base\_network\_interfaces0 (0%)
1:protection\_log.\_protection\_log\_ind\_10 (0%)
3:bar.\_\_base\_bar0 (0%)
2:mysql\_slave\_log\_updates.\_mysql\_slave\_log\_updates\_ind\_00 (0%)
Overall hot reads (38898587):
1: ALL 31%
2: ALL 0%
3: ALL 68%
3:sbtest.\_\_idx\_sbtest\_\_PRIMARY26542546 (68%)
1:sbtest.\_\_idx\_sbtest\_\_PRIMARY12350927 (31%)
1:statd\_current.\_\_idx\_statd\_current\_\_PRIMARY1672 (0%)
3:statd\_metadata.\_\_idx\_statd\_metadata\_\_PRIMARY874 (0%)
1:statd\_metadata.\_\_idx\_statd\_metadata\_\_PRIMARY798 (0%)
2:statd\_metadata.\_\_idx\_statd\_metadata\_\_PRIMARY792 (0%)
2:statd\_current.\_\_idx\_statd\_current\_\_PRIMARY792 (0%)
1:protection\_log.\_protection\_log\_ind\_1180 (0%)
3:statd\_history.timestamp2 (0%)
2:statd\_history.timestamp2 (0%)
1:statd\_history.timestamp2 (0%)
3:alerts\_parameters.\_alerts\_parameters\_ind\_00 (0%)
3:USER\_SPHINX\_UPDATE.idx\_processed\_location\_chunkid\_userid0 (0%)
2:USER\_SPHINX\_UPDATE.idx\_processed\_location\_chunkid\_userid0 (0%)
1:rebalancer\_activity\_log.\_rebalancer\_activity\_log\_ind\_10 (0%)
2:alerts\_parameters.\_alerts\_parameters\_ind\_00 (0%)
3:mysql\_slave\_log\_updates.\_mysql\_slave\_log\_updates\_ind\_00 (0%)
3:network\_interfaces.\_\_base\_network\_interfaces0 (0%)
3:bar.\_\_base\_bar0 (0%)
2:mysql\_slave\_log\_updates.\_mysql\_slave\_log\_updates\_ind\_00 (0%)
View ArticleSometimes you mayencountered the error "Query compilation used too much memory" in response to a large, complex query, or run across a warning with the same text in the clustrix.log. The solution for this type of erroris toincrease the global max_sierra_opt_mem_MB and max_sierra_tot_mem_MB parameters, and this willgenerally allowed the query to compile, though warnings in the logs may persist (and initial compilation may take several seconds).
This article will explain what the "compilation used too much memory" message means, what the max_sierra_* global variables actually do, and other ways to handle issues compiling large complex queries.
Compilation phases and memory limits
Glossing over a great deal of complexity, for the purposes of memory utilization we are concerned with three phases of compilation:
Parsing the query
Optimizing the query plan
Generating executable instructions (query fragments)
Parsing
The amount of memory which can be used for parsing the query is determined bymax_sierra_parse_mem_MiB (default 256). This guard is in place to prevent a database crash in the face of a pathologically large query (the query which prompted the addition of this guard selected 40K columns). If a query encounters this limit, it will error with:
ERROR 1 (HY000) at line 1: [80897] Sierra used too much working memory: Query parsing
This limit can be raised to a max of 512MB, but please contact support if you encounter this issue, so we can help determine if the parser may be doing something improper in consuming so much memory.
Optimization
Typically the most memory-intensive phase is the second, query optimization. This is where the planner tries to find the best approach to answering the query, from potentially very many possible approaches. This includes selecting join order, which indexes to use, and applying a variety of optimizations, such as converting LEFT JOIN to INNER JOIN when possible or converting a bunch of OR'd col='value' to an equivalent IN SET. The optimization process involves searching the space of all possible query plans, comparing estimated execution costs to determine which is most likely to execute fastest (this is covered in much greater detail in the Query Optimizer section of the Clustrix Documentation on docs.clutrix.com). This space of all possible solutions grows larger with more complex queries, in particular as the number of JOINs increases -- number of possible JOIN orderings is n! (n factorial, i.e. 1*2*3*...*n) . The number of available indexes also contributes to the size of the search space. As we work through all these possible plans, we have to keep track of what we've discovered so far, which can add up to a significant amount of memory.
The amount of memory allowed for the optimization phase is controlled by max_sierra_opt_mem_MiB (default 80). When the compiler exceeds this limit, it will stop exploring the search space for a more optimal plan, and proceed to the next phase of compilation with the best plan it's found so far. This plan could be perfectly good, or pretty bad, in which case execution will be more expensive than it could be, with higher latency and greater load placed on the cluster. Importantly, though, in this case the query will execute and return results to the user, with no indication of problems. The indication that we did not finish optimization because of memory constraints will be a WARNING like the following in clustrix.log:
WARNING rigr/sierra/plan_parse.c:282 sierra_plan_impl(): Compilation used too much memory; took 4.230127sec: SELECT `t`.`id` AS `t0_c0`, `t`.`user_id` AS `t0_c1`, `t`.`item_id` AS `t0_c2`, ...INFO rigr/sierra/plan_parse.c:287 sierra_plan_impl(): Cost=239.49296827979, rows=2.653641INFO rigr/sierra/plan_parse.c:289 sierra_plan_impl(): LEXPRS PEXPRS DUPEXP GROUP ADGRP MEMORYINFO rigr/sierra/plan_parse.c:290 sierra_plan_impl(): 11732 37457 10331 2996 2996 117714kB TOTALINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 5 10986 1076 5 5 3828kB nljoin_ruleINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 0 10986 1081 0 0 1459kB nljoin_rule_parINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 10962 0 6060 2451 2451 81795kB join_commute_ruleINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 0 8567 792 0 0 6526kB join_to_msjoinINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 0 2782 0 0 0 568kB enforce_combineINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 0 1826 0 0 0 2036kB enforce_mergeINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 0 1826 0 0 0 4171kB enforce_sortINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 424 0 1037 268 268 6999kB table_scan_sargeINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 0 178 0 0 0 70kB phys_filterINFO rigr/sierra/plan_parse.c:295 sierra_plan_impl(): 55 111 278 55 55 2275kB pklookup_to_nljoin
The WARNING message shows the query in question, and the following lines show how much memory is being used by the various optimization rules. In this example we can see that join_commute_rule is the main culprit of our memory consumption, at over 80MB. This is typical when we have a very large number of JOINs.
It is possible to increase the sierra_max_opt_mem_MiB global (up to 1024MB), but bear in mind that the memory limit is also effectively a time limit, and so increasing the memory available may result in initial compilation taking significantly longer (though the query plan cache (QPC) will amortize this cost). Also make sure that sierra_max_opt_mem_MiB is well below max_sierra_tot_mem_MiB, as described below.
Compiling
Once we've parsed, and searched available plans for an optimal plan, we must compile the plan into query fragments to be distributed to the nodes for parallel execution. This phase requires memory as well, and memory consumption is bounded by max_sierra_tot_mem_MiB, default 128. If we exceedmax_sierra_tot_mem_MiB, the query will error as follows:
ERROR 1 (HY000): [80896] Sierra used too much memory: Compilation
Note that max_sierra_tot_mem_MiBincludes the memory used during optimization (they use the same heap); that is, if we've consumed 80MB during optimization, and supposing max_sierra_tot_mem_MiB is default 128, then we may have less than 48MB available for the remainder of compilation. Thus it is important to retain some gap betweenmax_sierra_tot_mem_MiB andsierra_max_opt_mem_MiB. If you are receiving the ERROR "Sierra used too much memory: Compilation", ensure thatsierra_max_opt_mem_MiB is well belowmax_sierra_tot_mem_MiB.
Working set memory
There is an additional portion of memory used during compilation, called the working set memory. It is used by the compiler as sort of a scratch area, so it grows and shrinks throughout the compilation process (as opposed to the memory used above, which typically grows through the entire compilation). In v5.0, this is gated by the global variable max_sierra_working_mem_MiB, and defaults to 512. A query exceeding this limit would see an error like:
ERROR 1 (HY000): [80897] Sierra used too much working memory: Compilation
While this global can be increased to 1024, usage of such a large amount of memory in the working set may indicate a problem, so we encourage you to contact support if you encounter this error. Also note that this working memory limit is separate from the max_sierra_tot_mem_MiB limit, and working set memory not counted in the max_sierra_tot_mem_MiB check.
Reducing compilation memory requirements with query hints
To avoid long compilation times, you can reduce the amount of work the compiler must do during optimization, by eliminating some of the plan permutations it must explore, i.e. join ordering and index selection.
Reducing optimization space with STRAIGHT_JOIN
The most effective strategy is to provide join ordering by using STRAIGHT_JOIN rather than JOIN; this tells the compiler to use the ordering provided in the query for this particular join. Note that this is provided for each JOIN between two tables, i.e. `foo` STRAIGHT_JOIN `bar` ON (foo.id = bar.pid); Clustrix does not currently support MySQL's SELECT STRAIGHT_JOIN syntax. Clustrix does support LEFT STRAIGHT_JOIN (unlike MySQL).
You can reduce planner memory consumption significantly by providing join ordering for some but not all joined tables -- remember that the cost of exploring joins increases factorially, so just eliminating one has a big impact. Take care when adding STRAIGHT_JOINs, considering whether your data may change over time such that the forced join ordering will be suboptimal.
Here's an example of a query our QA folks came up with in one of their test cases, which happens to trigger the "compilation used too much memory warning":
SELECT id, action AS op, reason, DATABASES.name AS `database`, relations.name AS relation, representations.name AS representation, NULL AS bytes, NULL AS started, NULL AS finished, NULL AS error, queued, NULL AS running, IF(queued < From_unixtime(1350755986), 'leftover', '') AS leftoverFROM system.rebalancer_queued_activity JOIN system.rebalancer_copies USING (id) JOIN system.slices USING (slice) JOIN system.representations USING (representation) JOIN system.relations ON ( relations.table = representations.relation ) JOIN system.DATABASES USING (db)WHERE id NOT IN (SELECT id FROM system.rebalancer_activity_log);
While not a ton of joins on the face of it, the issue arises from the fact that rebalancer_copies is a view, built from many joins and sub-selects (some of these on further views) such that it involves 14 joins all told. This results in 3-4 second compilation time and the compilation used too much memory warnings. Can we shorten compilation and eliminate these warnings by adding some judicious STRAIGHT_JOINs?
In this case, I just forced the join ordering between rebalancer_queued_activity and rebalancer_copies, by changing the second line of the FROM clause to:
STRAIGHT_JOIN system.rebalancer_copies ON (rebalancer_copies.id = rebalancer_queued_activity.id)
(STRAIGHT_JOIN does not currently support USING syntax, so I changed it to a standard ON.)
This query now compiles in 0.2 seconds. In this case that view was blowing out the complexity of the query, so forcing the join ordering of the view had very significant impact. But in fact, forcing a STRAIGHT_JOIN in nearly any of these cases will greatly reduce compilation time and eliminate the log warnings; this is because join permutations grow factorially as noted earlier, so just removing (or forcing the order of) one join has a big impact. In this case, as long as we allow rebalancer_queued_activity to be first in the join ordering (since it is usually quite small), we are unlikely to prevent a reasonably good plan.
Reducing optimization space with USE INDEX
Where there are many indexes available for a table, it may also be helpful to hint the planner with index to use, if this is obvious to the DBA. This can be done using the USE INDEX (indexname) syntax, similar to MySQL.
Factors which can significantly increase compiler memory usage
Number of JOINs
This includes views which include more JOINs
Sub-selects are also treated as JOINs
15 tables can be problematic
Indicated by high memory costs in rules: nljoin_rule, nljoin_rule_par, join_commute_rule, filter_inner_join_rule
Mitigate by specifying join ordering using STRAIGHT_JOIN
Use of multiple IN sets in WHERE clause
IN sets are internally converted to mtables (memory tables) and thus create additional JOIN complexity
Indicated by high costs in rules above, plus inner_join_mtable_find
Mitigate by converting IN to multiple OR clauses, e.g. a.thing IN (1,2,3) => (a.thing = 1 OR a.thing = 2 OR a.thing = 3); particularly helpful when several IN SETs in a query
Multi-row inserts, with functions having different parameters repeatedly used in the VALUES tuples, e.g.:
e.g. INSERT INTO foo VALUES (1, concat('abc','def')), (2, concat('ghi','jkl')), ... (10000, concat('zyx','wvu'));
Note that this is not an issue when there is no parameter, or the same parameter is used repeatedly, e.g. now() or from_unixtime(N) where N is the same for all rows.
This will fail in parsing: Sierra used too much working memory: Parsing plan
Get in touch with Clustrix Support!
Finally, if you are having difficulty applying these workarounds, or are not achieving desired results, please contact Clustrix Support, either through the forums, or customers with support contracts may open a ticket.
View ArticleThis error "No space available for allocation type"occurswhen the temporary device space allocated for sorting is exhausted. This error is almost always it is seen due to some large BI type of analytical queries having ORDER BY or GROUP BY operation involved. Theglobal variable, device_temporary_space_limit_bytesis used to set how much sort space is allocated per node for a cluster. Default value for this variable is 5GB.
Short term recommendation for such an error is to set the global variable to a higher value. Long term recommendation is to identify the query that's filling up the temporary work space and tune it to reduce the data mass that it has to deal with in sorting and/or grouping before sending the resultset to the client.
When the /data directory is 100% full and thedevice_temporary_space_limit_bytesis set to greater than the default value, you can sometimes see aNo space available on deviceerror instead ofNo space available for allocation type.
View ArticleIf you are attempting to upgrade your software cluster and the upgrade is failing on "Extracting and Verifying Checksums" then you will need to rerun the upgrade package as root user.
clustrix001$ ./clustrix-clxnode-v5.2-10186-upgrade.sh install
Checking for node clock sync... Done
Verifying Payload MD5 Checksum... Done
Pushing package to cluster... Done
Extracting and Verifying Checksums on Upgrade... FAILURE
Tar Extraction Failed on clustrix001 Fix and try again.
View Article
What does the error "WAL filled beyond 75% limit" mean when seen in the clustrix.log file?
Example:
2016-02-20 23:15:38.888734 UTC clustrix001.local.com clxnode: INFO storage/wal/wal_fs.c:339 wal_fs_get_write_permission(): WAL filled beyond 75% limit. Blocking appends until checkpoint completes, if behavior persists, try starting checkpoint early with -wal-fs-cp-start-pct
The info message essentially means that the WAL is filling faster than it can be flushed to disk.
The WAL (write-ahead-log) is an in-memory write buffer in ClustrixDB. By default, the WAL begins flushing to disk once it reaches 50% full, and will begin blocking writes at 75% to ensure that system tasks responsible for flushing the WAL have enough headroom to complete. The resulting effect is some increased latency in these writes.
View ArticleA customer sent the following screenshot asking why workload distribution looked so skewed, with nearly all reads and writes going to nodes 1 and 2.
An important thing to note here is that the CPU load shown is very low: this is basically an idle system, so we're probably measuring a very small number of total writes, probably from internal tasks such as the rebalancer or statistics collection mechanism.
So the take-home here is that the workload distribution tool is really meaningful when the system is under load, and may show spurious imbalances when the system is idle.
View ArticleTo create an admin user 'clustrix', with password 'clustrix' issue the following query as root user:
insert into clustrix_ui.auth_user values (NULL,'clustrix','','','','pbkdf2_sha256$10000$wn7mcOjxzxnS$tHneHomqT+ybfHOnJzIZuGl6dwII1Ru/hw/WU34i+No=',1,1,0,now(),now());
Now you can login as clustrix/clustrix.
View Article
Versions: 7.0
The steps below can be used to uninstall the ClustrixDB software on a server.
WARNING: When you perform this procedure, any data stored on this server will be lost and unrecoverable.
1) Remove thenode from the cluster (if necessary)
If the server on which you wish to uninstall ClustrixDB is currently part of an active cluster, to maintain data protection for the rest of the cluster, first use FLEX DOWN(softfail)to removethe node.
Wait until the entire Flex Down procedure is complete, and the node has been completely removed from the cluster.
For details on using Flex Down to remove a node from your cluster see the documentation: Reducing Your Cluster's Capacity - Flex Down
2) Stop ClustrixDB Services
On each server you want to uninstall ClustrixDB, stop the ClustrixDB services:
#service clustrix stop
3) Run the uninstall_clustrix script
On each server you are uninstalling ClustrixDB on, run the uninstall_clustrix script:
# /opt/clustrix/bin/uninstall_clustrix
You will be prompted to confirm the action. Type "y" for Yes, or CTRL-C to cancel.
After the script exits, the uninstall is complete.
View ArticleThere are already some documentation regarding query logging:
Query Logging
Bad Plan Logging
How to Enable Slow Query Logging
but we'll focus on understanding the query log entries and how you can use them to identify long running queries.
Why Bother?
Although the UI is useful in identifying potential query issues and performing historical comparisons, reviewing the query logs may uncover issues that don't show up in the UI.
Query Log
By default the ClustrixDB logs are located in the directory:
/data/clustrix/log
Each query log entry is timestamped and categorized as:
SLOW: query execution time exceeded the global variable session_log_slow_threshold_ms value (default is 20000 ms); enabled by default with global variable session_log_slow_queries.
DDL: query is a DDL (CREATE, DROP, ALTER) or a SET GLOBAL; enabled by default with global variable session_log_ddl.
BAD: query reads more rows than were returned which may indicate a bad plan or missing index; disabled by default with global variable session_log_bad_queries.
SQLERR: a syntax error, timeout, permission denied, etc.; enabled by default with global variable session_log_bad_queries.
Global Variables
Query logging is governed by these global variables (defaults shown):
mysql> show global variables like 'session_log%';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| session_log_bad_queries | false |
| session_log_bad_read_ratio | 100 |
| session_log_bad_read_threshold | 4000 |
| session_log_ddl | true |
| session_log_error_queries | true |
| session_log_slow_queries | true |
| session_log_slow_threshold_ms | 20000 |
+--------------------------------+-------+
7 rows in set (0.00 sec)
With v5.2 onwards, measurement units have been added to variable and field names to help identify the value better. For example, the global variable session_log_slow_threshold_ms was called session_log_slow_threshold prior to v5.2.
Query Log Entry
Each query log entry is made up of the following bits of information:
[timestamp] [nodename] [process name] [category] [SID] [db] [user] [ac] [xid] [sql] [status] [execution time] [reads] [inserts] [deletes] [updates] [counts] [rows_read] [forwards] [rows_output] [semaphore_matches] [runtime] [attempts]
Timestamp: Includes the timezone. It is extremelyimportant to have the clocks in sync on all the nodes.
Hostname: Name of the node where the entry was logged.
Process name: The Clustrix process name, "clxnode". Will be "real" on the legacy appliance.
Category: Type of entry in the log. SLOW, DDL, BAD or SQLERR.
SID: session ID. Useful when trying to identify a particular session with the queries that were logged.
DB: database name the query was run from.
User: User executing the query. If you're using statement-based replication, you may want to search for the replication account when troubleshooting statements from the master.
AC: auto-commit. This is useful to determine whether the query was in an explicit transaction. Note that DDLs will all have ac=N even if the DDLs are not in an explicit transaction.
XID: transaction ID. Useful to link a session to an XID when troubleshooting locking issues.
SQL: full query, unless the query exceeds the 4KB statement buffer; the logger will then insert ellipses in the middle of the query and remove part of the query to fit the 4KB limit.
Status: result of the query like rows affected or error message.
Time: total time from when the query was received, compiled, processed and output returned. This is what we look at for SLOW queries.
The rest are internal counters which may be of interest in some troubleshooting exercises.
Sample Log Entries
Here are some sample log entries:
2014-04-28 00:00:00.630935 UTC 2014 clxnode001 clxnode: INSTR SLOW SID:3756073989 db=test [email protected] ac=Y xid=535eff740597800a sql="delete from test_tbl" [Ok: 1103697 rows deleted] time 220365.5ms; reads: 1103700; inserts: 3311091; deletes: 6622182; updates: 0; counts: 1103697; rows_read: 2207394; forwards: 9933282; rows_output: 2; semaphore_matches: 0; runtime: 80800528947; attempts: 1
2014-04-28 00:00:05.918485 UTC 2014 clxnode001 clxnode: INSTR DDL SID:3751805957 db=#undef user=root@localhost ac=N xid=535eff796dffe80a sql="trim binlog `binlogtest` before file 'binlogtest.010086'" [no error] time 1236.3ms; reads: 0; inserts: 0; deletes: 0; updates: 0; counts: 0; rows_read: 0; forwards: 0; rows_output: 0; semaphore_matches: 0; runtime: 0; attempts: 1
2014-04-28 00:00:00.988627 UTC 2014 clxnode001 clxnode: INSTR SQLERR SID:3737319429 db=test [email protected] ac=Y xid=535eff74fac4780a sql="SELECT ID INTO ID_Rule FROM test2_tbl WHERE ID = ID2" [NOT FOUND] time 0.7ms; reads: 3; inserts: 1; deletes: 0; updates: 0; counts: 0; rows_read: 4; forwards: 3; rows_output: 2; semaphore_matches: 0; runtime: 84060; attempts: 1
2014-04-28 00:01:06.780817 UTC 2014 clxnode001 clxnode: INSTR BAD SID:3737319429 db=test [email protected] ac=Y xid=535effb51a2a580a sql="SELECT IFNULL(ID, 0) AS ID, COL2, IFNULL(DATE1,'2014-01-01'), IFNULL(COL4, 0), IFNULL(Code,''), IFNULL(Date2, '2050-01-01'), IFNULL(Date3,'2014-01-01'), IFNULL(Col8, '') FROM test3_tbl WHERE ID = OtherID" [Ok: 1 rows selected] time 1672.7ms; reads: 3; inserts: 1; deletes: 0; updates: 0; counts: 1; rows_read: 1485493; forwards: 4; rows_output: 3; semaphore_matches: 0; runtime: 1119399218; attempts: 1
Observations
Some observations gleaned from the sample entries above:
The first query (delete from test_tbl) has been logged as SLOW because it took 220 secs (recorded as 220365.5ms) to delete 1103697 rows. When troubleshooting replication slave lags, you may want to look for large "rows deleted" or "rows updated" in the MASTER query logs if you suspect large writes to be the cause of the slave lag.
Note that ac=N on the second log entry as trim binlog is a DDL, but other statements could also be in an explicit transaction when started with START TRANSACTION or BEGIN.
Entries 3 and 4 are from the same session as they have the same session ID.
Collecting the Log Entries
With a distributed database, each node will only log what it is processing so you will need to look at the query logs from all of the nodes to identify long running queries.
It's generally easier to work with the log entries in one location so use the clx utility to make your life easier.
For the legacy appliance atool that is also useful is the "uncontinued" script (http://support.clustrix.com/entries/22477162-How-to-group-log-message-that-are-split-via-continued) as the longer queries will be split up with "---continued---". The uncontinued utility strips out this string and joins the lines to make the log entries more readable.
The following is an example of what you may want to run to collect the entries. Replace the logfile names and date/timestamp accordingly:
cd /data/clustrix/ # or some other location like /tmp that has sufficient space
/opt/clustrix/bin/clx cmd 'grep "^[today's date]" /data/clustrix/log/query.log' > q.log.todaysdate
# For the legacy appliance it is slightly different<
clx cmd 'grep "^[today's date]" /var/log/query.log' > q.log.todaysdatecat q.log.todaysdate | ./uncontinued > q.log.todaysdate.uc
Identifying Slow Queries
With the extracted query log from above, you can parse out the longest running queries with:
export TMPDIR=/clustrix #For the appliances, you may want to set TMPDIR to /clustrix.
cat q.log.todaysdate.uc | sed 's/^.*time \([0-9]*\).*$/\1 \0/' | sort -rn > longrunningq.todaysdate
The longest running queries would be at the top of the output file. Here's a sample output after processing the query logs for long running queries:
103166 Jul 11 17:47:58.330347 UTC 2014 clxnode1 real: INSTR SLOW SID:2464620545 db=test [email protected] ac=Y xid=53c1881e3c271802 sql="select col1, col2, col3, col4, col5, col6, col7, col8, col9 from tbl12 where col1>25 and col2=0 and col3<>200" [Ok: 0 rows selected] time 103166.3ms; reads: 5; inserts: 1; deletes: 0; updates: 0; counts: 0; rows_read: 1154141; forwards: 5; rows_output: 2; semaphore_matches: 0; runtime: 4921071027; attempts: 158055 Jul 11 17:25:39.453805 UTC 2014 clxnode1 real: INSTR SLOW SID:2464048129 db=test [email protected] ac=Y xid=53c1831077c1f802 sql="select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13 from tbl2 where col1 like '123456' and date(col2) = '2014-07-11'" [Ok: 0 rows selected] time 58055.0ms; reads: 5; inserts: 1; deletes: 0; updates: 0; counts: 0; rows_read: 11797794; forwards: 5; rows_output: 2; semaphore_matches: 0; runtime: 40765655404; attempts: 155666 Jul 11 17:46:10.868463 UTC 2014 clxnode1 real: INSTR SLOW SID:2445069313 db=test [email protected] ac=Y xid=53c187e245f42002 sql="select col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13, col4, col15, col16, col17, col18, col19, col20, col21, col22, col23, col24, col25, col26, col27, col28, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57 from tbl3 where col1='1234567'" [Ok: 1 rows selected] time 55666.1ms; reads: 5; inserts: 1; deletes: 0; updates: 0; counts: 1; rows_read: 11798489; forwards: 6; rows_output: 3; semaphore_matches: 0; runtime: 125292702595; attempts: 1
After having identified the long running queries, you will then need to determine why the queries are slow. This could be various reasons for this:
heavy system load
poor data distribution
missing indexes
network latencies
We may cover these topics in future postings.
Best Practice
The best practice would be to use both the UI and the query log processing to identify possible sources of performance issues.
You may want to setup a nightly cron job that collects and parses out the previous day's query logs entries for SLOW, BAD and SQLERR entries and review them daily just to make sure that your application or users aren't loading the cluster with poor queries. Fixing poor performing queries would free up system resources to process more queries and make the cluster more responsive.
View ArticleOn ClustrixDB there is no individual slow query log. Rather slow queries are logged into the query log based off of the variablessession_log_slow_queriesandsession_log_slow_threshold_ms.
session_log_slow_queries:Enables logging of slow queries. Default true.
session_log_slow_threshold_ms:slow query threshold in miliseconds. Default 20000.
These queries will enable slow query logging for queries taking over 4 seconds to execute.
SET GLOBAL session_log_slow_threshold_ms = 4000;
The node that is processing the query will log slow queries into it's own individual query.log. Thus, you must get all the query.logs from all nodes in order to get everything.
This command will get the query logs for each node.
clx cmd 'grep SLOW /var/log/query.log' > /clustrix/slow_query.log
And this one will sort all the logs by time.
clx cmd 'grep SLOW /var/log/query.log' | sort > /clustrix/slow_query_sorted.log
You can also set slow query logging for the current session (useful in debugging) with:
SET session_log_slow_threshold_ms = 4000;
Check out Query Logging for more information
View ArticleThe following bash command will search for any queries listed as BAD or SLOW submitted on February 18 and 19 in 2016 and write them to a file called slow_bad.log in the /clustrix directory of the node on which the command was run..
/opt/clustrix/bin/clx cmd 'grep "^2016-02-1[8,9]" /data/clustrix/log/query.log'| egrep "BAD"\|"SLOW" >/data/clustrix/slow_bad.log
View ArticleYou can change themax_allowed_packetglobal variable to increase (or decrease) max allowed packet size. For example, to change this to 32MB you would run the following:
sql> set global max_allowed_packet=33554432;
The max value for this is 64MB (67108864). This will affect client connections (like JDBC) as they use this value to determine items like the size of theirreceivingbuffer.
Max_allowed_packet also has an effect on the replication slave. If the master sends an event larger than max_allowed_packet, the slave will stop with error:
Mysql client protocol error: Packet too big.
In such a case, the max_allowed_packet value should be increased (as large as 64MB) to avoid the issue.
View ArticleYou can stop and restart the slave with the following commands:
mysql> stop slave;
mysql> start slave;
To skip an event you would stop the slave and then restart it with the skip clause.
mysql> start slave skip 1;
For more details, check the "Slaving and Replication"please see the Clustrix documentation: Using ClustrixDB as a Slave
View ArticleTo export users you can use the following script:
clustrix_clone_users
Make sure the script is executable once you have downloaded it.
chmod 755 clustrix_clone_users
The script will get the grants for all users in a way that's ideal for backup and import purposes. For example:
./clustrix_clone_users -H myhost -u root >> user_dump.sql
This can then be imported into Clustrix like:
mysql -uroot -h <clustrix IP> < user_dump.sql
For more information on clustrix_clone_users, please see the README
View ArticleWhen importing data to a Clustrix system it is best practices to use the clustrix_import tool. This is because clustrx_import has been designed to take advantage of the parallel nature of clustered databasesto import MySQL dumps into a Clustrix Cluster.Importing mysqldumps with the mysql client results in a single-threaded insert, which will be much slower than clustrix_import.
Detailed information on importing data to Clustrix can be found on our documentation site: Importing Data
If your import is running slowly or failing there are a few things to look at:
You may get much faster import speed if you are running clustrix_import locally from a node, rather than from outside the cluster.
Even though a password for the root user is not required when accessing sql from localhost, it is required for clustrix_import.
We highly recommend running clustrix_import in a screen session so that if you are disconnected from the node, your import will continue to run.
If you have been using triggers in MySQL you will want to run the mysqldump with the "--skip-triggers" argument.
If you are still having problems, please let us know, and include the clustrix_import output to help us understand where/how things are slow.
Related links:
Download Page: clustrix_import tool
Clustrix Documentation: Importing Data
View Article
Clustrix has the ability to ignore replication events for any particular table. It works the same way as specifying databases to ignore (or include) which is documented here.
For a per table basis, you need to use the table mysql_table_replication_policy and the global mysql_default_table_replication_policy, instead of using the table mysql_db_replication_policy and the global mysql_default_db_replication_policy.
For example to ignore db1.table1 and db2.table2 on slave you would:
sql> SET GLOBAL mysql_default_table_replication_policy = true; sql> INSERT INTO mysql_table_replication_policy (slave_name, `database`, `table`, allow) VALUES ('slave', 'db1', 'table1', false); sql> INSERT INTO mysql_table_replication_policy (slave_name, `database`, `table`, allow) VALUES ('slave', 'db2', 'table2', false);
Notes:
You need to specify the slave name as well as db and table names.
This configuration is read when the Slave starts, which happens periodically when its buffer fills. Stop the Slave manually before changing policies.
View Article
(gmp pending remove does not currently work for v7.0 only)
If you ever encounter and issue with a node stuck inpendingstate in the ui you will need to access the clusters control port to remove the pending node.
To access the control port and remove the pending node perform the following commands on one node:
telnet localhost 2048
$ gmp pending remove '1.2.3.4'
$ gmp pending list
Cntrl-D
Where 1.2.3.4 is the IP address of the node stuck in the pending state.
View ArticleThis FAQ applies to any setup with an mdadm Raid device hosting the data partition. This includes AWS with instance store and other cloud instances.
With ClustrixDB rebooting your VM is possible,however,if you have setup a mdadm raid volume for your clustrixDB data directory, you'd want to ensure that after reboot,the mdadm Raid device is "re-assembled" and the data partition is mounted at the correct location.
By default mdadm will re-assemble any array it finds on the available disks, but it may not assign that array the same device name as before. That is fine as long as you don't count on the device name to mount the filesystem. To work around this you can label the filesystem and mount it by label. Editing fstab is the best way to auto mount. Using an init script is trickier because you'd want it to run before clxnode (the ClustrixDB process) starts.
Below i describe how i would setup the storage from scratch and have it be persistent after reboot (using an AWS instance with instance store) :
#Use super user:
sudo su
#install mdadmyum -y install mdadm
#create raid device on md0 or whatever name
yes | mdadm --create /dev/md0 --level=0 -c64 --raid-devices=2/dev/xvdf /dev/xvdg
# make FSmkfs -t ext4 /dev/md0
mkdir -p /data/clustrix
# Give the partition a label
e2label /dev/md0 CLUSTRIX-DATA
# Edit fstab to mount it by label instead of device name. This is necessary as the raid array maybe
# assigned a different device name (in this case md127)
echo 'LABEL=CLUSTRIX-DATA /data/clustrix ext4 defaults,noatime,nodiratime 0 2' >> /etc/fstab
# test fstab by running mount:
mount -a
# The filesystem should be mounted now. Verify with :
df -H
# Backup mdadm config to its config file. (optional)mdadm --verbose --detail --scan >> /etc/mdadm.conf
View ArticleTo terminate all processes belonging to a specific user. In MySQL youwould gather ids with the the following query:
sql> select Id from information_schema.processlist where user="user_name" > ids.txt
Then for loop that file to kill each one.
-----------------------------------------------
With ClustrixDB:
The collection query is very similar:
sql> select session_id as id from system.sessions where user=(select user from system.users where username='<username>');
Example:
sql> select session_id as id from system.sessions where user=(select user from system.users where username='foo'); +----------+ | id | +----------+ | 12289 | | 13398017 | +----------+
The kill processes with:
sql>kill 13398017;
You'll see that the killed process is now gone:
sql> select session_id as id from system.sessions where user=(select user from system.users where username='foo');+----------+ | id | +----------+ | 12289 | | 13460481 | +----------+
If you wanted to kill all processes, just put the recorded ids into a loop and kill them.
View Article