Searched hist:57146 (Results 1 - 4 of 4) sorted by relevance

/linux-master/include/linux/
H A Doom.hdiff 9066e5cf Tue Aug 11 19:31:22 MDT 2020 Yafang Shao <laoar.shao@gmail.com> mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 9066e5cf Tue Aug 11 19:31:22 MDT 2020 Yafang Shao <laoar.shao@gmail.com> mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/mm/
H A Doom_kill.cdiff 9066e5cf Tue Aug 11 19:31:22 MDT 2020 Yafang Shao <laoar.shao@gmail.com> mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 9066e5cf Tue Aug 11 19:31:22 MDT 2020 Yafang Shao <laoar.shao@gmail.com> mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/fs/proc/
H A Dbase.cdiff 9066e5cf Tue Aug 11 19:31:22 MDT 2020 Yafang Shao <laoar.shao@gmail.com> mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff 9066e5cf Tue Aug 11 19:31:22 MDT 2020 Yafang Shao <laoar.shao@gmail.com> mm, oom: make the calculation of oom badness more accurate

Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.

Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.

The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.

To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.

[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Qian Cai <cai@lca.pw>
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/linux-master/kernel/sched/
H A Dfair.cdiff 05cbdf4f Fri Sep 21 11:48:59 MDT 2018 Mel Gorman <mgorman@techsingularity.net> sched/numa: Limit the conditions where scan period is reset

migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.

This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine. In that case, scanning faster and trapping more NUMA faults
will further overload the machine.

Specjbb2005 results (8 warehouses)
Higher bops are better

2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 203370 205332 0.964744
1 328431 319785 -2.63252

2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 206070 206585 0.249915

2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188386 189162 0.41192
1 201566 213760 6.04963

4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 59157.4 58736.8 -0.710985
1 105495 105419 -0.0720413

Some events stats before and after applying the patch.

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,825,492 14,285,708
migrations 1,152,509 1,180,621
faults 371,948 339,114
cache-misses 55,654,206,041 55,205,631,894
sched:sched_move_numa 1,856 843
sched:sched_stick_numa 4 6
sched:sched_swap_numa 428 219
migrate:mm_migrate_pages 898 365

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 57146 26907
numa_hint_faults_local 51612 24279
numa_hit 238164 239771
numa_huge_pte_updates 16 0
numa_interleave 63 68
numa_local 238085 239688
numa_other 79 83
numa_pages_migrated 883 363
numa_pte_updates 67540 27415

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,288,525 3,202,779
migrations 38,652 37,186
faults 111,678 106,076
cache-misses 12,111,197,376 12,024,873,744
sched:sched_move_numa 900 931
sched:sched_stick_numa 0 0
sched:sched_swap_numa 5 1
migrate:mm_migrate_pages 714 637

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 18572 17409
numa_hint_faults_local 14850 14367
numa_hit 73197 73953
numa_huge_pte_updates 11 20
numa_interleave 25 25
numa_local 73138 73892
numa_other 59 61
numa_pages_migrated 712 668
numa_pte_updates 24021 27276

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,451,543 8,474,013
migrations 202,804 254,934
faults 310,024 320,506
cache-misses 253,522,507 110,580,458
sched:sched_move_numa 213 725
sched:sched_stick_numa 0 0
sched:sched_swap_numa 2 7
migrate:mm_migrate_pages 88 145

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11830 22797
numa_hint_faults_local 11301 21539
numa_hit 90038 89308
numa_huge_pte_updates 0 0
numa_interleave 855 865
numa_local 89796 88955
numa_other 242 353
numa_pages_migrated 88 149
numa_pte_updates 12039 22930

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,049,153 2,195,628
migrations 11,405 11,179
faults 162,309 149,656
cache-misses 7,203,343 8,117,515
sched:sched_move_numa 22 49
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 1 5

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 1693 3577
numa_hint_faults_local 1669 3476
numa_hit 25177 26142
numa_huge_pte_updates 0 0
numa_interleave 194 358
numa_local 24993 26042
numa_other 184 100
numa_pages_migrated 1 5
numa_pte_updates 1577 3587

perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 94,515,937 100,602,296
migrations 4,203,554 4,135,630
faults 832,697 789,256
cache-misses 226,248,698,331 226,160,621,058
sched:sched_move_numa 1,730 1,366
sched:sched_stick_numa 14 16
sched:sched_swap_numa 432 374
migrate:mm_migrate_pages 1,398 1,350

vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 80079 47857
numa_hint_faults_local 68620 39768
numa_hit 241187 240165
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 241186 240165
numa_other 1 0
numa_pages_migrated 1347 1224
numa_pte_updates 80729 48354

perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 63,704,961 58,515,496
migrations 573,404 564,845
faults 230,878 245,807
cache-misses 76,568,222,781 73,603,757,976
sched:sched_move_numa 509 996
sched:sched_stick_numa 31 10
sched:sched_swap_numa 182 193
migrate:mm_migrate_pages 541 646

vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 8501 13422
numa_hint_faults_local 2960 5619
numa_hit 35526 36118
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 35526 36116
numa_other 0 2
numa_pages_migrated 539 616
numa_pte_updates 8433 13374

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff 3f9672ba Fri Sep 21 11:48:58 MDT 2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com> sched/numa: Reset scan rate whenever task moves across nodes

Currently task scan rate is reset when NUMA balancer migrates the task
to a different node. If NUMA balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.

Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.

Specjbb2005 results (8 warehouses)
Higher bops are better

2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 200668 203370 1.3465
1 321791 328431 2.06345

2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 204848 206070 0.59654

2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188098 188386 0.153112
1 200351 201566 0.606436

4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 58145.9 59157.4 1.73959
1 103798 105495 1.63491

Some events stats before and after applying the patch.

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,912,183 13,825,492
migrations 1,155,931 1,152,509
faults 367,139 371,948
cache-misses 54,240,196,814 55,654,206,041
sched:sched_move_numa 1,571 1,856
sched:sched_stick_numa 9 4
sched:sched_swap_numa 463 428
migrate:mm_migrate_pages 703 898

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 50155 57146
numa_hint_faults_local 45264 51612
numa_hit 239652 238164
numa_huge_pte_updates 36 16
numa_interleave 68 63
numa_local 239576 238085
numa_other 76 79
numa_pages_migrated 680 883
numa_pte_updates 71146 67540

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,156,720 3,288,525
migrations 30,354 38,652
faults 97,261 111,678
cache-misses 12,400,026,826 12,111,197,376
sched:sched_move_numa 4 900
sched:sched_stick_numa 0 0
sched:sched_swap_numa 1 5
migrate:mm_migrate_pages 20 714

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 272 18572
numa_hint_faults_local 186 14850
numa_hit 71362 73197
numa_huge_pte_updates 0 11
numa_interleave 23 25
numa_local 71299 73138
numa_other 63 59
numa_pages_migrated 2 712
numa_pte_updates 0 24021

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,606,824 8,451,543
migrations 155,352 202,804
faults 301,409 310,024
cache-misses 157,759,224 253,522,507
sched:sched_move_numa 168 213
sched:sched_stick_numa 0 0
sched:sched_swap_numa 3 2
migrate:mm_migrate_pages 125 88

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 4650 11830
numa_hint_faults_local 3946 11301
numa_hit 90489 90038
numa_huge_pte_updates 0 0
numa_interleave 892 855
numa_local 90034 89796
numa_other 455 242
numa_pages_migrated 124 88
numa_pte_updates 4818 12039

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,113,167 2,049,153
migrations 10,533 11,405
faults 142,727 162,309
cache-misses 5,594,192 7,203,343
sched:sched_move_numa 10 22
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 6 1

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 744 1693
numa_hint_faults_local 584 1669
numa_hit 25551 25177
numa_huge_pte_updates 0 0
numa_interleave 263 194
numa_local 25302 24993
numa_other 249 184
numa_pages_migrated 6 1
numa_pte_updates 744 1577

perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 101,227,352 94,515,937
migrations 4,151,829 4,203,554
faults 745,233 832,697
cache-misses 224,669,561,766 226,248,698,331
sched:sched_move_numa 617 1,730
sched:sched_stick_numa 2 14
sched:sched_swap_numa 187 432
migrate:mm_migrate_pages 316 1,398

vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 24195 80079
numa_hint_faults_local 21639 68620
numa_hit 238331 241187
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 238331 241186
numa_other 0 1
numa_pages_migrated 204 1347
numa_pte_updates 24561 80729

perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 62,738,978 63,704,961
migrations 562,702 573,404
faults 228,465 230,878
cache-misses 75,778,067,952 76,568,222,781
sched:sched_move_numa 648 509
sched:sched_stick_numa 13 31
sched:sched_swap_numa 137 182
migrate:mm_migrate_pages 733 541

vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 10281 8501
numa_hint_faults_local 3242 2960
numa_hit 36338 35526
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 36338 35526
numa_other 0 0
numa_pages_migrated 706 539
numa_pte_updates 10176 8433

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff 05cbdf4f Fri Sep 21 11:48:59 MDT 2018 Mel Gorman <mgorman@techsingularity.net> sched/numa: Limit the conditions where scan period is reset

migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.

This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine. In that case, scanning faster and trapping more NUMA faults
will further overload the machine.

Specjbb2005 results (8 warehouses)
Higher bops are better

2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 203370 205332 0.964744
1 328431 319785 -2.63252

2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 206070 206585 0.249915

2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188386 189162 0.41192
1 201566 213760 6.04963

4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 59157.4 58736.8 -0.710985
1 105495 105419 -0.0720413

Some events stats before and after applying the patch.

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,825,492 14,285,708
migrations 1,152,509 1,180,621
faults 371,948 339,114
cache-misses 55,654,206,041 55,205,631,894
sched:sched_move_numa 1,856 843
sched:sched_stick_numa 4 6
sched:sched_swap_numa 428 219
migrate:mm_migrate_pages 898 365

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 57146 26907
numa_hint_faults_local 51612 24279
numa_hit 238164 239771
numa_huge_pte_updates 16 0
numa_interleave 63 68
numa_local 238085 239688
numa_other 79 83
numa_pages_migrated 883 363
numa_pte_updates 67540 27415

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,288,525 3,202,779
migrations 38,652 37,186
faults 111,678 106,076
cache-misses 12,111,197,376 12,024,873,744
sched:sched_move_numa 900 931
sched:sched_stick_numa 0 0
sched:sched_swap_numa 5 1
migrate:mm_migrate_pages 714 637

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 18572 17409
numa_hint_faults_local 14850 14367
numa_hit 73197 73953
numa_huge_pte_updates 11 20
numa_interleave 25 25
numa_local 73138 73892
numa_other 59 61
numa_pages_migrated 712 668
numa_pte_updates 24021 27276

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,451,543 8,474,013
migrations 202,804 254,934
faults 310,024 320,506
cache-misses 253,522,507 110,580,458
sched:sched_move_numa 213 725
sched:sched_stick_numa 0 0
sched:sched_swap_numa 2 7
migrate:mm_migrate_pages 88 145

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 11830 22797
numa_hint_faults_local 11301 21539
numa_hit 90038 89308
numa_huge_pte_updates 0 0
numa_interleave 855 865
numa_local 89796 88955
numa_other 242 353
numa_pages_migrated 88 149
numa_pte_updates 12039 22930

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,049,153 2,195,628
migrations 11,405 11,179
faults 162,309 149,656
cache-misses 7,203,343 8,117,515
sched:sched_move_numa 22 49
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 1 5

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 1693 3577
numa_hint_faults_local 1669 3476
numa_hit 25177 26142
numa_huge_pte_updates 0 0
numa_interleave 194 358
numa_local 24993 26042
numa_other 184 100
numa_pages_migrated 1 5
numa_pte_updates 1577 3587

perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 94,515,937 100,602,296
migrations 4,203,554 4,135,630
faults 832,697 789,256
cache-misses 226,248,698,331 226,160,621,058
sched:sched_move_numa 1,730 1,366
sched:sched_stick_numa 14 16
sched:sched_swap_numa 432 374
migrate:mm_migrate_pages 1,398 1,350

vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 80079 47857
numa_hint_faults_local 68620 39768
numa_hit 241187 240165
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 241186 240165
numa_other 1 0
numa_pages_migrated 1347 1224
numa_pte_updates 80729 48354

perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 63,704,961 58,515,496
migrations 573,404 564,845
faults 230,878 245,807
cache-misses 76,568,222,781 73,603,757,976
sched:sched_move_numa 509 996
sched:sched_stick_numa 31 10
sched:sched_swap_numa 182 193
migrate:mm_migrate_pages 541 646

vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 8501 13422
numa_hint_faults_local 2960 5619
numa_hit 35526 36118
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 35526 36116
numa_other 0 2
numa_pages_migrated 539 616
numa_pte_updates 8433 13374

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
diff 3f9672ba Fri Sep 21 11:48:58 MDT 2018 Srikar Dronamraju <srikar@linux.vnet.ibm.com> sched/numa: Reset scan rate whenever task moves across nodes

Currently task scan rate is reset when NUMA balancer migrates the task
to a different node. If NUMA balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.

Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.

Specjbb2005 results (8 warehouses)
Higher bops are better

2 Socket - 2 Node Haswell - X86
JVMS Prev Current %Change
4 200668 203370 1.3465
1 321791 328431 2.06345

2 Socket - 4 Node Power8 - PowerNV
JVMS Prev Current %Change
1 204848 206070 0.59654

2 Socket - 2 Node Power9 - PowerNV
JVMS Prev Current %Change
4 188098 188386 0.153112
1 200351 201566 0.606436

4 Socket - 4 Node Power7 - PowerVM
JVMS Prev Current %Change
8 58145.9 59157.4 1.73959
1 103798 105495 1.63491

Some events stats before and after applying the patch.

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 13,912,183 13,825,492
migrations 1,155,931 1,152,509
faults 367,139 371,948
cache-misses 54,240,196,814 55,654,206,041
sched:sched_move_numa 1,571 1,856
sched:sched_stick_numa 9 4
sched:sched_swap_numa 463 428
migrate:mm_migrate_pages 703 898

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 50155 57146
numa_hint_faults_local 45264 51612
numa_hit 239652 238164
numa_huge_pte_updates 36 16
numa_interleave 68 63
numa_local 239576 238085
numa_other 76 79
numa_pages_migrated 680 883
numa_pte_updates 71146 67540

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
cs 3,156,720 3,288,525
migrations 30,354 38,652
faults 97,261 111,678
cache-misses 12,400,026,826 12,111,197,376
sched:sched_move_numa 4 900
sched:sched_stick_numa 0 0
sched:sched_swap_numa 1 5
migrate:mm_migrate_pages 20 714

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
Event Before After
numa_hint_faults 272 18572
numa_hint_faults_local 186 14850
numa_hit 71362 73197
numa_huge_pte_updates 0 11
numa_interleave 23 25
numa_local 71299 73138
numa_other 63 59
numa_pages_migrated 2 712
numa_pte_updates 0 24021

perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 8,606,824 8,451,543
migrations 155,352 202,804
faults 301,409 310,024
cache-misses 157,759,224 253,522,507
sched:sched_move_numa 168 213
sched:sched_stick_numa 0 0
sched:sched_swap_numa 3 2
migrate:mm_migrate_pages 125 88

vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 4650 11830
numa_hint_faults_local 3946 11301
numa_hit 90489 90038
numa_huge_pte_updates 0 0
numa_interleave 892 855
numa_local 90034 89796
numa_other 455 242
numa_pages_migrated 124 88
numa_pte_updates 4818 12039

perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
cs 2,113,167 2,049,153
migrations 10,533 11,405
faults 142,727 162,309
cache-misses 5,594,192 7,203,343
sched:sched_move_numa 10 22
sched:sched_stick_numa 0 0
sched:sched_swap_numa 0 0
migrate:mm_migrate_pages 6 1

vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
Event Before After
numa_hint_faults 744 1693
numa_hint_faults_local 584 1669
numa_hit 25551 25177
numa_huge_pte_updates 0 0
numa_interleave 263 194
numa_local 25302 24993
numa_other 249 184
numa_pages_migrated 6 1
numa_pte_updates 744 1577

perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 101,227,352 94,515,937
migrations 4,151,829 4,203,554
faults 745,233 832,697
cache-misses 224,669,561,766 226,248,698,331
sched:sched_move_numa 617 1,730
sched:sched_stick_numa 2 14
sched:sched_swap_numa 187 432
migrate:mm_migrate_pages 316 1,398

vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 24195 80079
numa_hint_faults_local 21639 68620
numa_hit 238331 241187
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 238331 241186
numa_other 0 1
numa_pages_migrated 204 1347
numa_pte_updates 24561 80729

perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
cs 62,738,978 63,704,961
migrations 562,702 573,404
faults 228,465 230,878
cache-misses 75,778,067,952 76,568,222,781
sched:sched_move_numa 648 509
sched:sched_stick_numa 13 31
sched:sched_swap_numa 137 182
migrate:mm_migrate_pages 733 541

vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
Event Before After
numa_hint_faults 10281 8501
numa_hint_faults_local 3242 2960
numa_hit 36338 35526
numa_huge_pte_updates 0 0
numa_interleave 0 0
numa_local 36338 35526
numa_other 0 0
numa_pages_migrated 706 539
numa_pte_updates 10176 8433

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Completed in 1443 milliseconds