{"id":1938505,"date":"2021-07-04T13:31:00","date_gmt":"2021-07-04T11:31:00","guid":{"rendered":"http:\/\/nhub.news\/?p=1938505"},"modified":"2021-07-04T17:02:09","modified_gmt":"2021-07-04T15:02:09","slug":"mysql-to-redshift-4-ways-to-replicate-your-data","status":"publish","type":"post","link":"http:\/\/nhub.news\/de\/2021\/07\/mysql-to-redshift-4-ways-to-replicate-your-data\/","title":{"rendered":"MySQL to Redshift: 4 Ways to Replicate Your Data"},"content":{"rendered":"

When you pair MySQL and Redshift, you eliminate the risk of crashing your production database while running queries.<\/b>
\nJoin the DZone community and get the full member experience. MySQL is the most popular open source cloud database in the world, and for good reason. It\u2019s powerful, flexible, and extremely reliable. Tens of thousands of companies use MySQL to power their web-based applications and services every day. But when it comes to data analytics, it\u2019s a different story. MySQL is quickly bogged down by even the smallest analytical queries, putting your entire application at risk of crashing. As one FlyData customer said to us, \u201cI have nightmares about our MySQL production database going down.\u201d That\u2019s why so many companies turn to Amazon Redshift to complement MySQL. Redshift is built to handle petabytes of data and deliver analytics in a fraction of the time. When you pair MySQL and Redshift, you eliminate the risk of crashing your production database while running queries. There are several ways to replicate your MySQL data to Redshift. But first, let\u2019s dig a little deeper into why you should replicate your MySQL database to Redshift. Many companies that use MySQL to power their web applications choose Redshift for data analytics. There are several reasons why you should do this, too: Because of MySQL\u2019s inherent weaknesses, many companies replicate data to Redshift for their analytics needs. There are 4 ways to accomplish this: The simplest way to replicate to Redshift is to export your entire MySQL data. However, this is also the least efficient method. There are three steps: To start, export data using MySQL\u2019s mysqldump command. A typical mysqldump command looks like this: The output of this command is your MySQL SQL statement. You cannot run the SQL on Redshift as is \u2014 you\u2019ll have to transform the statement into a format suitable for Redshift import. For the best upload performance, convert your SQL statement into TSV (tab-separated values) format. You can do this by using the Redshift COPY command. The COPY command converts your SQL statement into TSV format. Then it batch uploads the files into a Redshift table in Amazon S3. For example, a row of data in your MySQL dump would look like this: Using COPY, it will be transformed into this: Note that values are separated by a tab character (\\t). You may also have to convert data values to be Redshift compatible. This is because MySQL and Redshift support different column and data types. For example, the DATE value \u20180000-00-00\u2019 is valid in MySQL, but will throw an error in Redshift. You have to convert the value into an acceptable Redshift format, like \u20180001-01-01.\u2019 After you\u2019ve transformed your MySQL statement, the last step is to import it from S3 to Redshift. To do this, simply run the COPY command: Although import and export is the simplest way to replicate to Redshift, it is not ideal for frequent updates. For example, it takes roughly 30 minutes to export 18 GB of data from MySQL over a 100 Mbps network. It takes another 30 minutes to import that data into Redshift. This assumes you experience zero connection issues during import or export, which would force you to start the process over. A more efficient method of replicating MySQL to Redshift is incremental SELECT and COPY. If the import and export is too slow for your needs, incremental SELECT and COPY might be your answer. The SELECT and COPY method only updates the records that have changed since the last update. This takes considerably less time and bandwidth compared to importing and exporting your entire dataset. SELECT and COPY enables you to sync MySQL and Redshift much more frequently. To use incremental SELECT and COPY, your MySQL table has to meet a couple of conditions: Like import and export, there are three steps to this method: Incremental SELECT exports only the rows that have changed since the last update. The SELECT query you run on MySQL looks like this: Save the result to a file for transformation. This transformation step is the same as the import and export method. Transform the MySQL data into TSV format for Redshift. At this point, your MySQL TSV file includes both updated rows and newly inserted rows. You cannot simply run a COPY command straight to your destination Redshift table. This would cause the updated rows to be duplicated. To avoid duplicate rows, use the DELSERT (delete + insert) technique: Where id is the unique key of the table. Incremental SELECT and COPY is more efficient than import and export, but it has limitations of its own. The main problem is that rows deleted from your MySQL table stay in Redshift indefinitely. This isn\u2019t an issue if you want to keep historical data on Redshift while purging old data from MySQL. Otherwise, deleted rows in Redshift can cause major headaches during data analysis. Another drawback of this method is that it doesn\u2019t replicate table schema changes. When a column is added or dropped from the MySQL table, you need to make the corresponding change on the Redshift table manually. Finally, the query used to pull updated rows from a MySQL table can affect the performance of your MySQL database. If any of these drawbacks are a dealbreaker, this next method is for you. Change data capture (CDC) is a technique that captures changes made to data in MySQL and applies it to the destination Redshift table. It\u2019s similar to incremental SELECT and COPY in that it only imports changed data, not the entire database. Unlike incremental SELECT and COPY, however, CDC allows you to achieve true replication of MySQL to Redshift. To use the CDC method with a MySQL database, you must utilize the binary change log (binlog). Binlog allows you to capture change data as a stream, enabling near real-time replication. Binlog not only captures data changes (insert, update, delete) but also table schema changes such as add\/drop column. It also ensures that rows deleted from MySQL are also deleted in Redshift. When you use CDC with binlog, you are actually writing an application that reads, transforms, and imports streaming data from MySQL to Redshift. There is an open source library you can use to do this called mysql-replication-listener. This C++ library offers a streaming API to read data from MySQL binlog in real-time. A high-level API is also available for a couple of languages: kodama (Ruby) and python-mysql-replication (Python). First, set the MySQL config parameters to enable binlog. The following is a list of parameters related to binlog: Parameter binlog_format sets the format for how binlog events are stored in the binlog file. There are 3 supported formats: statement, mixed, and row. Statement format saves queries in the binlog files as-is (e.g. UPDATE SET firstname=\u2019Tom\u2019 WHERE id=293;). Although it saves binlog file size, it has issues when used for replication. For replication to Redshift, use row format. Row format saves changed values in the binlog files. It increases the binlog file size but ensures data consistency between MySQL and Amazon Redshift. log_bin sets the path where binlog files are stored. expire_logs_days determines how many days binlog files are kept. Specify the tables you\u2019d like to replicate in the replicate-wild-do-table parameter. Only those tables specified should go into the binlog files. We recommend keeping binlog files for a couple days. This ensures you have time to address any issues that arise during replication. If you use a MySQL replication slave server as the source, it\u2019s important to specify the log-slave-updates to TRUE. Otherwise, data changes made on the replication master will not be logged in the binlog. Also, your MySQL account needs to have the following privileges in order to perform replication related tasks: When you use the binlog, \u201cexport\u201d is really a real-time data stream of your MySQL binlog files. How the binlog data is delivered depends on the API you use. For example, with Kodama, binlog data is delivered as a stream of binlog events. Kodama lets you register event handlers for different event types (insert, update, delete, alter table, create table, etc). Your application will receive binlog events. It will then generate an output ready for Redshift import (for data changes) or schema change (for table schema changes). The data change import is similar to the transformation steps of our other replication methods. Unlike the others, however, binlog allows you to handle deleted events. You need to handle deleted events specifically to maintain Redshift Upload Performance. Finally, it\u2019s time to import your binlog data stream. The problem is Redshift doesn\u2019t have steaming upload functionality. Use the DELSERT import technique we outlined in the incremental SELECT and COPY method. Binlog is the ideal method of replication from MySQL to Redshift, but it still has downsides. Building your CDC application requires serious development effort. In addition to the data streaming flow we described above, you will have to build: Transaction management. Track data streaming performance in case an error forces your application to stop while reading binlog data. Transaction management ensures you can pick up where you left off. Data buffering and retry. Similarly, Redshift can become unavailable while your application is sending data. Your application needs to buffer unsent data until the Redshift cluster comes back online. If this step is done incorrectly, it can cause data loss or duplicate data. Table schema change support. A table schema change binlog event (alter\/add\/drop table) comes as a native MySQL SQL statement which does not run on Redshift as-is. To support table schema changes, you\u2019ll have to convert MySQL statements to the corresponding Amazon Redshift statements. With the help of ETL tools, you can replicate data to Redshift in near real-time. Unlike the CDC method, such tools can manage the entire replication process and automatically map MySQL data types into formats used by Redshift, so you don\u2019t have to. You can even synchronize multiple MySQL databases (and other types of databases as well) to Redshift at the same time. Plus, the setup process is simple and short. You rely on MySQL to power your business, but its limitations with data analytics are well-known. Redshift provides a simple, powerful solution to your BI needs. Together, MySQL and Redshift can push your business to the next level. As you\u2019ve seen, there are numerous ways to replicate data from MySQL to Redshift. Methods range from simple to complex, and painfully slow to near real-time. The method you choose depends on several factors: Remember: The fastest, truest replication method is change data capture (CDC), which utilizes MySQL\u2019s binlog. The downside is that it requires developer hours to build and maintain the application. Make your decision by considering the aforementioned factors, your business goals, and data analytics needs. Published at DZone with permission of Ben Putano, DZone MVB. See the original article here. Opinions expressed by DZone contributors are their own.<\/p>\n