Skip to main content

Read data inconsistency immediately after writing with Mongo DB Replication during actual service development

I encountered a problem while using MongoDB while developing an actual service. The problem was that if the same data was read immediately after writing, the data was not confirmed. At the time, I didn't know much about the subject matter, so I spent a lot of time and had a hard time. I wrote about the circumstances and causes of this problem.

Problem situation

  • At that time, the company mainly used MongoDB in many services.
  • Problems discovered while creating a new service and preparing for launch for global launch. After stage testing, abnormal behavior occurred when the service was put into production
    • Queries requested from the server intermittently fail (server queries called from the client internally write mongodb and then read the data again)
    • In normal operation, the written data must be read again and the next operation must be performed, but in case of failure, an error is encountered in the client.
    • Queries are performed in the following flow:
      • Client call
      • Request AWS service from server and wait for result
      • Write results to mongodb when AWS service is completed
      • After writing, read the data again from mongodb and perform the following actions
    • When looking at the AWS service query on the server, the AWS service was operating normally.
    • When I went into mongodb directly to check the write part of the server, the data actually existed. In other words, it was confirmed that there was no problem with the write operation.
    • The read part of the server query failed, but there was written data in the db. A situation where correction was difficult because the exact cause was not known.
  • No abnormalities found when checked in the existing test environment
    • Test environment is staging. A phenomenon that did not occur at all at the time of testing occurred in the product.

Resolution

  • At first, it was not recognized as a difference in environment, but it was thought to be a bug in the code and the logic was checked. Even after checking several times, I couldn't find any major problems.
  • I kept checking and couldn't find a solution, so I asked colleagues who knew more and checked together.
    • It is a problem that even my colleagues are not familiar with, so it takes some time to solve.
  • At that time, I learned a little about distributed DB and was able to deduce the problem.
    • Compare the differences between stage and production to identify problems. When comparing the differences, the differences were as follows:
    • In stage, MongoDB operates on a single node without replication.
    • In production, MongoDB has two secondary nodes in replication.
  • In the production environment, a replica set was configured for availability.
    • In replication, production MongoDB writes and then spreads the data to the secondary node.
    • Since there was no separate setting, if you write from db and read immediately, the read operation is performed before the data is propagated to the secondary. secondary passes existing data as a response value
      • Propagation time was confirmed to be approximately 2 to 4 seconds.
    • Later, when checking the data in mongodb compass, the propagation time had passed and the data written during read was visible correctly.
  • The methods that emerged at the time to solve the above issue are as follows.
    • Wait until the DB status changes by polling from the client. It seemed like the simplest method.
    • Wait until the DB status changes by polling from the server. From the client’s perspective, existing queries can be used without change.
    • Ensuring that the propagation to the mongodb query is completed and set to readable, such as queries or collections
  • I tried the second method among the above methods. Rather than the read operation occurring immediately after writing, a small delay is added between writing and reading
    • Check that it operates normally after adding delay
  • Through this, I learned a little more about CAP theory.
    • In the case of a distributed structure, it has three characteristics: consistency, availability, and partitioning tolerance.
    • To satisfy one of consistency and availability, you have no choice but to give up the other.
      • When consistency is satisfied: The data value must be the same no matter which of the distributed nodes is accessed. If consistency is broken, different data can be passed when requesting a query.
      • When availability is satisfied: Provides the ability to process requests normally even if one or more distributed nodes fail to synchronize.
    • Case where consistency is broken because the current db structure satisfies availability.

organize

  • If replication is set up in the db, be careful when the write and read times are different
  • It was a good experience to see and feel the CAP theory with my own eyes.
  • Let's study about the service I'm developing whenever I have time. You can't know everything, but the more you know, the more helpful it is in troubleshooting.
  • Please be aware that the environment between stage and production may be different. In other words, problems may arise if there is a difference between the test environment and the actual operating environment, and it would be better to think about this in advance.
    • To reduce debug time, it would be a good idea to keep the production and stage environments almost similar or identical.
    • Time is also a cost. Many things, such as labor costs and service opening delays, can be delayed or cause additional costs. I thought it would be okay to have the stage and production settings the same as long as the cost was not too large compared to the server cost.