#1434 closed defect (invalid)
NMVe SMART/Health Critical Warning Bit 3 (0x4) not necessary an error
Reported by: | ThomasH | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | |
Component: | all | Version: | |
Keywords: | nvme | Cc: |
Description (last modified by )
Hello,
we are using a NVME "SAMSUNG MZVLB512HAJQ-00000"
The Smartmontool is reporting the following error:
=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! - NVM subsystem reliability has been degraded SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x04
The Attribut "Critical Warning" has different bits, signaling different status.
According to the documentation the flag for bit 4 is:
"If set to ‘1’, then the volatile memory backup device has failed. This field is only valid if the controller has a volatile memory backup solution"
The referred documentation is at https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf (page 122).
Currently the bit 4 is reported as an error but this depends on the model (whether it has a capacitor for power loss protection).
Either this should be detected or maybe a parameter could be used to signal whether this is an error or not.
Attachments (2)
Change History (13)
by , 4 years ago
comment:1 by , 4 years ago
Description: | modified (diff) |
---|
comment:2 by , 4 years ago
Keywords: | nvme added |
---|---|
Milestone: | → undecided |
Summary: | Smart-Attribute Critical Warning Bit 4 not necessary an error → NMVe SMART/Health Critical Warning Bit 4 not necessary an error |
comment:3 by , 4 years ago
Hello,
thanks for the quick reply.
The specification only tells, when this field is considered valid and how to interpret it in this case. If there is no "volatile memory backup solution", then there is no definition about the meaning of that bit. So I would not see it as a bug, if this bit is set in absence of the volatile memoery backup solution.
I checked the Identify Controller structure, as you recommended. I found two elements in the structure which might be related to it:
vwc : 0x1 awupf : 0
For completeness I have attached the full structure.
awupf is described in the specification as:
"This field indicates the size of the write
operation guaranteed to be written atomically to the NVM across all namespaces
with any supported namespace format during a power fail or error condition."
Maybe we should delete/ignore bit 3 of "Critical Warning" when awupf is zero (?)
by , 4 years ago
Attachment: | nvme-structure.txt added |
---|
comment:4 by , 4 years ago
Summary: | NMVe SMART/Health Critical Warning Bit 4 not necessary an error → NMVe SMART/Health Critical Warning Bit 3 (0x4) not necessary an error |
---|
comment:5 by , 4 years ago
It does not seems to be a healthy drive to me.
- None of the other reports we saw had this bit to 1. Also typically if not in use fields are zeroed.
- There is no clear definition of what "Volatile Memory Backup System" is. However, intel datasheet has
Bit 4: Volatile Memory Backup System has failed (e.g., enhanced power loss capacitor test failure)
. In this case it is just about capacitor which gives a chance to SSD to shutdown correctly in case of power loss. - I found at least one report in the net for similar drive, where this bit is set to 0.
My suggestion is to try to RMA this drive or at least to contact vendor about that. Not sure if adding any logic to hide this warnings does make a sense, at least until we 100% confident in correctness of it.
comment:6 by , 4 years ago
I agree. I don't remember any similar report since we added NVMe support in 6.5 (2016, #657).
- Setting an invalid error bit to
1
or(rand() & 1)
is at least bad software engineering. - There might also be an unrelated error but the firmware reports it with the wrong bit.
- NVMe 1.4 does not specify something like: Bit 4 shall be ignored if the controller does not indicate a volatile memory backup solution in the FOOBAR field of the Identify Controller data structure.
Leaving ticket open as undecided for now.
comment:7 by , 4 years ago
Hello,
thanks for the input and thoughts about that topic.
We will open a ticket at Samsung and try to get further information.
We have 2 servers with Software-Raid-1 each and the second nvme shows this entry 0x04.
The first nvmes in the Raid show 0x00 each. Maybe this is random or related.
comment:9 by , 4 years ago
Hello,
Samsung didn't offer any support because it is an OEM product.
Thus we contacted the hosting provider.
Our provider seems to be aware of this problem and recommended to clean the slot
of the NVMe.
Today, the provider conducted the cleaning and the error went away from the first nvme.
So it seems to be an electrical issues, not a software issue.
Thanks for your input and your suggestions!
comment:10 by , 4 years ago
Milestone: | undecided |
---|---|
Resolution: | → invalid |
Status: | new → closed |
This is likely a hardware or firmware issue and not a smartmontools bug.
If a device has no volatile memory backup solution, then it should never set this bit. Everything else is IMO a firmware bug.
Does any bit in Identify Controller structure report whether bit 4 of Critical Warning is valid?
PS: Please don't send smartctl outputs as screen shots. Use plain-text attachments or wiki markup instead.