Why does the unit test fail to pass?


Please ignore the previous content!

  • Now, I have tried the latest release v0.9.8, there are only 3 tests failed.
99% tests passed, 3 tests failed out of 274

Total Test time (real) = 821.94 sec

The following tests FAILED:
       174 - rpc_rpc_stub-test (Failed)
       217 - integration-tests_master_config-itest (Failed)
       226 - integration-tests_linked_list-test (Failed)
Errors while running CTest

174: TEST FAILURE
174: Test command: /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/tests-rpc/rpc_stub-test --gtest_output=xml:/home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-rpc__rpc_stub-test/RpcStubTest_IPv6.xml --gtest_filter=RpcStubTest.IPv6
174: Test exit status: 1
174: Log path: /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-rpc__rpc_stub-test/RpcStubTest_IPv6.log
174: Relevant log lines:
174: I0322 13:26:31.500322 22395 mem_tracker.cc:158] MemTracker: hard memory limit is 8.000000 GB
174: I0322 13:26:31.500339 22395 mem_tracker.cc:160] MemTracker: soft memory limit is 6.800000 GB
174: /home/donghq/jinlespace/yugabyte-db/src/yb/rpc/rpc_stub-test.cc:709: Failure
174: Value of: server_address.is_unspecified()
174: Actual: true
174: Expected: false
174: Test failure stack trace:
174: /home/donghq/jinlespace/yugabyte-db/src/yb/rpc/rpc_stub-test.cc:709: @ 0x41b4c2 yb::rpc::RpcStubTest_IPv6_Test::TestBody()
174: Updating /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-rpc__rpc_stub-test/RpcStubTest_IPv6.xml with a link to test log


217: TEST FAILURE
217: Test command: /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/tests-integration-tests/master_config-itest --gtest_output=xml:/home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__master_config-itest/MasterChangeConfigTest_TestNewLeaderWithPendingConfigLoadsSysCatalog.xml --gtest_filter=MasterChangeConfigTest.TestNewLeaderWithPendingConfigLoadsSysCatalog
217: Test exit status: 1
217: Log path: /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__master_config-itest/MasterChangeConfigTest_TestNewLeaderWithPendingConfigLoadsSysCatalog.log
217: Relevant log lines:
217: W0322 13:36:58.572082 13480 external_mini_cluster.cc:1007] Failed to get leader master after 100 attempts, returning the first master.
217: I0322 13:36:58.572091 13480 master_config-itest.cc:347] New leader 127.0.0.1:27222
217: /home/donghq/jinlespace/yugabyte-db/src/yb/integration-tests/master_config-itest.cc:348: Failure
217: Expected: new_master->bound_rpc_addr().port()
217: Which is: 18567
217: To be equal to: new_leader->bound_rpc_addr().port()
217: Updating /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__master_config-itest/MasterChangeConfigTest_TestNewLeaderWithPendingConfigLoadsSysCatalog.xml with a link to test log


226: TEST FAILURE
226: Test command: /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/tests-integration-tests/linked_list-test --gtest_output=xml:/home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__linked_list-test/LinkedListTest_TestLoadWhileOneServerDownAndVerify.xml --gtest_filter=LinkedListTest.TestLoadWhileOneServerDownAndVerify
226: Test exit status: 1
226: Log path: /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__linked_list-test/LinkedListTest_TestLoadWhileOneServerDownAndVerify.log
226: Relevant log lines:
226: @ 0xffffffffffffffff
226:
226: *** Check failure stack trace: ***
226: @ 0x7f1e7f3f4e8b DumpStackTraceAndExit (yb/util/logging.cc:156)
226: @ 0x7f1e7e5bb10c
226: tests-integration-tests/linked_list-test failed to produce an XML output file at /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__linked_list-test/LinkedListTest_TestLoadWhileOneServerDownAndVerify.xml
226: Generating an XML output file using parse_test_failure.py: /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__linked_list-test/LinkedListTest_TestLoadWhileOneServerDownAndVerify.xml
226: Updating /home/donghq/jinlespace/yugabyte-db/build/release-gcc-dynamic-community/yb-test-logs/tests-integration-tests__linked_list-test/LinkedListTest_TestLoadWhileOneServerDownAndVerify.xml with a link to test log


Hi,

Could you check other files in this directory, and upload relevant files/error snippets from those files if any.

Adding Mikhail (@mbautin) in case he has any ideas off-hand.

Hi @lumedar, thank you for reporting this issue! A lot of our tests, including the ones you mentioned above, are really integration tests that are running an entire distributed YugaByte DB mini-cluster using multiple processes locally, and therefore they can be quite sensitive to the test environment. Can you share the specs (available memory / # of CPUs) of the host that you are running the tests on? Also could you please tar/gz and share the yb-test-logs directory so we can take a closer look at the test failures that you are observing?

Also could you please post the output of ulimit -a in the shell that you’re running the tests from? Specifically, setting both the limit on the number of open files and the number of processes/threads to something really high, such as 100000, might help.

I would also recommend that you try out the master branch or the more recent v0.9.8.0 tag.

We will be sharing more details about our testing methodology soon.

2 Likes

I have uploaded the relevant logs, and here is the `ulimit -a`:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 512453
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1000000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 512453
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Thanks. I have uploaded the relevant files.

The last two test cases can sometimes pass.

Nice :slight_smile: . Where there any other changes in your environment that you think might have helped? Even with older release, would have expected most of the tests to pass.

regards,
Kannan

@bharat:

Posting relevant snippet from the reported log:

https://raw.githubusercontent.com/lumedar/YugaByteLog/master/MasterChangeConfigTest_TestNewLeaderWithPendingConfigLoadsSysCatalog.log

Have we seen this in the test runs? Could this be related to increasing the default number of missed heartbeats when a leader election is triggered?

[m-2] I0322 13:36:58.571132 15390 raft_consensus.cc:436] Election start skipped as do_not_start_election_test_only flag is set to true.
I0322 13:36:58.572062 13480 external_mini_cluster.cc:1005] GetLeaderMasterIndex@100 hit error: Timed out (yb/rpc/rpc.cc:166): GetLeaderMasterRpc(addrs: 127.0.0.1:27222,127.0.0.1:26801,127.0.0.1:23935,127.0.0.1:18567, num_attempts: 99) passed its deadline 15599955.996s (now: 15599956.081s)
W0322 13:36:58.572082 13480 external_mini_cluster.cc:1007] Failed to get leader master after 100 attempts, returning the first master.
I0322 13:36:58.572091 13480 master_config-itest.cc:347] New leader 127.0.0.1:27222
/home/donghq/jinlespace/yugabyte-db/src/yb/integration-tests/master_config-itest.cc:348: Failure
Expected: new_master->bound_rpc_addr().port()
Which is: 18567
To be equal to: new_leader->bound_rpc_addr().port()
Which is: 27222

We also see this particular error in our internal CI environment. Opened a GitHub issue: LinkedListTest.TestLoadWhileOneServerDownAndVerify fails sometimes with "Failure: IO error (yb/client/batcher.cc:189): Some errors occurred" · Issue #120 · yugabyte/yugabyte-db · GitHub

  1. No removing the “-Werror” flags
  2. Using YugaByte/linuxbrew instead of devtoolset-6
  3. Using the latest release branch
    The above three changes may help to pass more tests.

Now I try to compile the release v0.9.8 and run its unit test in a docker container(base image:centos7), here is the result:

99% tests passed, 1 tests failed out of 274

Total Test time (real) = 608.35 sec

The following tests FAILED:
        135 - util_inetaddress-test (Failed)
Errors while running CTest

Most of the time, util_inetaddress-test failed with:

tcmalloc: large alloc 1811955712 bytes == 0x188c000 @  0x7f0958f46092 0x7f095679b699
*** Aborted at 1521790543 (unix time) try "date -d @1521790543" if you are using GNU date ***
PC: @     0x7f0955d22278 __memcpy_avx_unaligned
*** SIGSEGV (@0x0) received by PID 12368 (TID 0x7f095a1f4840) from PID 0; stack trace: ***
......

the log file:https://github.com/lumedar/YugaByteLog/blob/master/InetAddressTest_TestErrors.log
In a few cases, it can pass.
So, can you give me more help? Thanks!

PS:
in function “int SubstitutedSize(StringPiece format,const SubstituteArg* const* args_array);” of substitute.cc
I added this test code:

LOG(INFO) << "MYTEST_BEGIN";
LOG(INFO) << "args: " << args_array[index]->data() <<" len: "<<args_array[index]->size();
LOG(INFO) << "MYTEST_END";

output:

I0323 10:21:08.141396  2653 substitute.cc:63] MYTEST_BEGIN
I0323 10:21:08.141398  2653 substitute.cc:64] args: Connection timed out len: 7017560849296405558
I0323 10:21:08.141402  2653 substitute.cc:65] MYTEST_END

Why?

Haven’t seen this failure. Filed an issue: util_inetaddress-test failure when running in a docker container · Issue #121 · yugabyte/yugabyte-db · GitHub – will take a look soon. Thank you!

Thanks for reporting this @lumedar. We do see this in our CI runs at the rate of 1 in 20 times or so. This should be independent of the increase of missed heartbeat gflag change @kannan - we do stop elections using do_not_start_election_test_only in this unit test. Will triage this with higher priority…