Could use some bug tracking advice.

So I have a problem with some code I’ve been working on for the last few months. The code, which is compiled into two separate processes suddenly stops working. No error, nothing in dmesg, nothing in any file in /var/log period. It did however occur to me that since rsyslog is likely or possible disabled.

What my code does is read from the CAN peripheral. Form extended packets out of the CAN frames( NMEA 2000 fastpackets ), and then writes the data into a POSIX shared memory file ( /dev/shm/file ). The second process simply reads from the file, and shuffles the data out over a websocket in json / human readable form. The data on the webside of things is tested accurate, although I do occasionally get a malformed json object warning from firefox firebug.

The kernel I’m currently using is 4.2.0-rc4-bone2, which seems to have no noticeable problems.

Anyway, I’m relatively new to Linux development, and was wondering if anyone might be able to offer some advice as to how I can track this down. It did occur to me that I could attempt to trap process signals, and see if anything interesting comes of that. Short of this however, do I have any other options ? Since my code runs for days before the processes stopping - I’m pretty sure the traditional gdb, strace / ltrace options would be ineffective. But maybe I’m wrong ?

Oh, and right my bad. I’ve run valgrind → no problems. But both processes stop as far as I can tell simultaneously. I’ve never actually seen them stop, only the aftermath. Also, memory usage never seems to go above ~60M total - system wide. This has me stumped . . .

William,

I'm certainly not a developer... Why do you question rsyslog? From your explanation above I don't see the connection.

Perhaps something with the shared memory due to the fact that both programs are failing. Really don't have a clue how to debug that one as I've only used it as a user with ntpd and gpsd.

Using strace did come to mind, the resultant file would likely be huge if you saved it. Trapping the exit signals might yield some clue as to where to look.

Good luck!

Mike

So I have a problem with some code I've been working on for the last few
months. The code, which is compiled into two separate processes suddenly
stops working. No error, nothing in dmesg, nothing in any file in /var/log
period. It did however occur to me that since rsyslog is likely or possible
disabled.

What my code does is read from the CAN peripheral. Form extended packets out
of the CAN frames( NMEA 2000 fastpackets ), and then writes the data into a
POSIX shared memory file ( /dev/shm/file ).

Since this involves two processes that as you say stop simultaneously,
I'd suspect a latent synchronization bug. You don't say how you
interlock your shared memory, but one possibility is that your reader
code gets stuck because you overwrite the data while it's reading it.
Debugging this type of thing is tricky, but maybe write a state
machine that lights some LEDs that show the phases of your
synchronization process, and wait to see where it's stuck.

The second process simply reads
from the file, and shuffles the data out over a websocket in json / human
readable form. The data on the webside of things is tested accurate,
although I do occasionally get a malformed json object warning from firefox
firebug.

I'd definitely look at this malformation---it could be the smoke from
the real fire. Or not. In any case, this one should be easier to
find---just wait for the message, inspect the data in firebug, and
write a checker routine, inspecting your outgoing data, that watches
for this type of distortion.

HI Przemek,

Since this involves two processes that as you say stop simultaneously,
I’d suspect a latent synchronization bug. You don’t say how you
interlock your shared memory, but one possibility is that your reader
code gets stuck because you overwrite the data while it’s reading it.
Debugging this type of thing is tricky, but maybe write a state
machine that lights some LEDs that show the phases of your
synchronization process, and wait to see where it’s stuck.

Currently, I have no synchronization. At one point I was using a byte in shared memory as a binary stopgap, but after a while it was not working predictably. Now, I’m re-reading documentation on POSIX semaphores, and creating a semaphore in shared memory, instead of using a system wide resource.

I’d definitely look at this malformation—it could be the smoke from
the real fire. Or not. In any case, this one should be easier to
find—just wait for the message, inspect the data in firebug, and
write a checker routine, inspecting your outgoing data, that watches
for this type of distortion.

The first thing that comes to mind here, which I forgot to add to my post last night is that I am not zeroing out the shared memory file before usage. I know this is bad . . .but am not convinced this is what the problem is. However since it is / can be a one line of code fix. I will do so. The odd thing here is that I get maybe 1-2 notifications an hour - If that. Then it is inside the actual json object ( string pointer - e.g. char *buffer ) - not outside.

What does all this mean to me. The first impression that I get out of this is that it is a synchronization issue. I’m still not convinced though . . .

Also, for what it’s worth. I’m using mmap() and not file open(), read(), write(). So the code is very fast.

HI Przemek,

*Since this involves two processes that as you say stop simultaneously,*

* I'd suspect a latent synchronization bug. You don't say how you*
* interlock your shared memory, but one possibility is that your reader*
* code gets stuck because you overwrite the data while it's reading it.*
* Debugging this type of thing is tricky, but maybe write a state*
* machine that lights some LEDs that show the phases of your*
* synchronization process, and wait to see where it's stuck.*

Currently, I have no synchronization. At one point I was using a byte in
shared memory as a binary stopgap, but after a while it was not working
predictably. Now, I'm re-reading documentation on POSIX semaphores, and
creating a semaphore in shared memory, instead of using a system wide
resource.

Then you have two things that happen with no predictable time
relationship to each other at all.

You could be writing part of a multibyte message when trying to read
that message with another process.

A binary semaphore controls access to the shared (message) resource.
Checking the binary semaphore generally involves turning off
interrupts so that the other process can't grab control during the
check code. If you have two separate processors, you still need to
deal with the same thing, not so much interrupts, but permission to
access. The semaphore read/write must be atomic, and the access must
be negotiated between the two processors (generally happens in
hardware for two processors, happens in software for two processes
running on the same processor).

*I'd definitely look at this malformation---it could be the smoke from*

* the real fire. Or not. In any case, this one should be easier to*
* find---just wait for the message, inspect the data in firebug, and*
* write a checker routine, inspecting your outgoing data, that watches*
* for this type of distortion. *

The first thing that comes to mind here, which I forgot to add to my post
last night is that I am not zeroing out the shared memory file before
usage. I know this is bad . . .but am not convinced this is what the
problem is. However since it is / can be a one line of code fix. I will do
so. The odd thing here is that I get maybe 1-2 notifications an hour - If
that. Then it is inside the actual json object ( string pointer - e.g. char
*buffer ) - not outside.

What does all this mean to me. The first impression that I get out of this
is that it is a synchronization issue. I'm still not convinced though . . .

analyze the code to see what happens if one process is writing while
the other is reading.

The error rate may be just a measure of how frequently this happens.

Harvey

Hi Harvey,

Thanks for the response. I think the biggest question in my mind is - Ok, so perhaps I have a synchronization problem that rears it’s head once in a while. But is this really that much of a problem which may cause both processes to stop ?

A sample here and there once in a while that does not display, because it is malformed does not bother me. The processes stopping - does. I can not see how this could be causing the processes to stop. However . . . I honestly do not know one way or the other.

Hi Harvey,

Thanks for the response. I think the biggest question in my mind is - Ok,
so perhaps I have a synchronization problem that rears it's head once in a
while. But is this really that much of a problem which may cause both
processes to stop ?

A sample here and there once in a while that does not display, because it
is malformed does not bother me. The processes stopping - does. I can not
see how this could be causing the processes to stop. However . . . I
honestly do not know one way or the other.

Process A: while process B is busy, wait, then read from process B

Process B: while process A is busy, wait, then read from process A

Classic deadlock.

Process A: wait for permission to read special area, read, then wait
outside that permission area. No restrictions on process B except
when accessing special area (which happens infrequently) .

Process B: wait for permission to read special area, read, then wait
outside that permission area. No restrictions on process A except
when accessing special area (which happens infrequently) .

Since the waiting is outside that special area, and the processes are
not allowed to hog the special area (and block the other process),
then neither process can block the other except for a very brief time.

The implication is that the process check and access special area
takes a very small time, and the wait/do something else part takes a
longer time.

Harvey

Ok. In my case however -

Process A writes to shared memory only.

Process B Reads from shared memory only.

As it stands Process B starts off with a variable set to 0x00. then compares this to a byte position in the file. When Process B first starts, this comparison will always fail. Process B then copies the contents of the file, sets the variable to this value to the value at the byte position. Then sends the data out over a websocket.

On the next iteration of the loop cycle. Process B then reads this value again, makes the comparison - which will likely succeed. The loop cycle then continues until this comparison fails again. Where the logic process repeats. It’s pretty simple - Or so I thought.

The reasoning for this development model is simple. Code segregation. Code in process B does not play well with the code in process A. They’re both accessing network devices, and when it happen simultaneously - Data gets lost. Which happens more often than not.

William:

Shared memory is notoriously hard, and as has been pointed out, it really sounds like a synchronization issue that you are facing. Which leads to a couple of design questions:

#1 Could what you are doing be accomplished with a POSIX threads system instead of processes? This would mean you would not need to copy the data over, just simply pass pointers from one thread to another in a producer-consumer model.
#2 If you can not accomplish your solution via threads, how big is the data that you are sending and how often is it being sent? Maybe a pipe might be a better solution, as pipes intrinsically have their own synchronization built in.

Walt

Ok. In my case however -

Process A writes to shared memory only.
Process B Reads from shared memory only.

Ok, so that eliminates one form of data corruption.

As it stands Process B starts off with a variable set to 0x00. then
compares this to a byte position in the file. When Process B first starts,
this comparison will always fail. Process B then copies the contents of the
file, sets the variable to this value to the value at the byte position.
Then sends the data out over a websocket.

Ok:
1) what stops process A from writing to the shared buffer if process B
is reading it?

2) what keeps B from getting an incomplete or inaccurate value from
process A for the byte position? is it a byte variable or is it an
integer? Does the processor write this as an integer in one
uninterruptible process?

3) if both A and B access Internet devices (over the same interface
I'd guess), what stops the data collision between process A and
process B? What protects that Internet resource? What is the result
if both A and B read a status register at the same time (in the
hardware)?

Harvey

1) what stops process A from writing to the shared buffer if process B
is reading it?

Nothing. I assume that writes are slower, or at most as fast as reads. Both reads, and writes are done using a mmap’d pointer.

2) what keeps B from getting an incomplete or inaccurate value from
process A for the byte position? is it a byte variable or is it an
integer? Does the processor write this as an integer in one
uninterruptible process?

Aside from the fact that the byte position I’m testing here is a source ID, of two different devices. Nothing. They do come in - in order one after the other however. This is not permanent however. When I start tracking more data, for one set of data this will still work. But not for other sets of data. Write / read type is char. No way really to get this wrong as with gcc -Wall, gcc will warn. I have no errors or warning when compiling.

  1. if both A and B access Internet devices (over the same interface
    I’d guess), what stops the data collision between process A and
    process B? What protects that Internet resource? What is the result
    if both A and B read a status register at the same time (in the
    hardware)?

No. I guess more correctly they are socket devices. Both using Linux network sockets. socketcan for CANBus, and standard Linux sockets for ethernet. The web libraries I did not write. It’s libmongoose.

Walter,

Thank you for your reply.

I’ve examined pretty much all of SYSV and POSIX IPC mechanisms. I’m no expert here, as this is really my first go with anything IPC, and pretty much my first “major” application running on Linux.

Pipes may not be fast enough for what I’m trying to accomplish. To keep an explanation short. I’m only tracking one PGN. A PGN for fastpackets is a set of data items in this case. For this one PGN I’m dealing with 3 items in data ( voltage, current and frequency ), but program wide I have to keep track of much more. This PGN is also only one of of roughly 20. WIth most PGNs issuing data sets of varying length 2 times a second . . .

It may be I’ll have to somehow rate limit the data I’ll be dealing with. I did consider POSIX Message queues, but according to what I’ve read. POSIX shared memory is the fastest of all IPC mechanisms, and while I do agree that it is not very easy. Personally, I think shared memory is easy now that I understand a lot of it. At minimum, it’s not very hard to under the idea, and implement it in code. Semaphores, mutexes, and threads however I do find a bit intimidating. At minimum, I personally think they’re overly complex.

I have though about a lot of different approaches, and I’m not saying my approach won’t change. This is just where I am right now. Stumbling about learning the various Linux API’s / libraries. Using, and understanding fork() is on my TODO list, I just have not made it there yet. These two processes are actually two separate executables. I am a bit worried about process context switching though. I mean I’m sure I am inuring some penalty right now running two separate executables, but I’m not sure it would be the same using threads.

*1) what stops process A from writing to the shared buffer if process B*
* is reading it?*

Nothing. I assume that writes are slower, or at most as fast as reads. Both
reads, and writes are done using a mmap'd pointer.

Murphy says that you cannot guarantee this.

*2) what keeps B from getting an incomplete or inaccurate value from*

* process A for the byte position? is it a byte variable or is it an*
* integer? Does the processor write this as an integer in one*
* uninterruptible process?*

Aside from the fact that the byte position I'm testing here is a source ID,
of two different devices. Nothing. They do come in - in order one after the
other however. This is not permanent however. When I start tracking more
data, for one set of data this will still work. But not for other sets of
data. Write / read type is char. No way really to get this wrong as with
gcc -Wall, gcc will warn. I have no errors or warning when compiling.

OK, assumption is that they are sequential. Depends on what the
process switching time is and phase of the moon. My paranoid
assumption is that they are not necessarily sequential and can occur
at any time in relationship to each other.

char is ok, you don't get corrupted values, but you may get the "last"
value rather than the current one unless you interlock the two tasks.

3) if both A and B access Internet devices (over the same interface
I'd guess), what stops the data collision between process A and
process B? What protects that Internet resource? What is the result
if both A and B read a status register at the same time (in the
hardware)?

No. I guess more correctly they are socket devices. Both using Linux
network sockets. socketcan for CANBus, and standard Linux sockets for
ethernet. The web libraries I did not write. It's libmongoose.

Ok, do you know if these functions are thread safe?

I think that's what's giving you problems, the programming is not
thread aware or thread safe.

Harvey

Walter,

Thank you for your reply.

I've examined pretty much all of SYSV and POSIX IPC mechanisms. I'm no
expert here, as this is really my first go with anything IPC, and pretty
much my first "major" application running on Linux.

Which means, perhaps, the first application where the OS is a real
factor.

Pipes may not be fast enough for what I'm trying to accomplish. To keep an
explanation short. I'm only tracking one PGN. A PGN for fastpackets is a
set of data items in this case. For this one PGN I'm dealing with 3 items
in data ( voltage, current and frequency ), but program wide I have to keep
track of much more. This PGN is also only one of of roughly 20. WIth most
PGNs issuing data sets of varying length 2 times a second . . .

The problem may be more of "how much data and how long to process it"
rather than the frequency of the data itself.

You are correct to consider context switching time.

It may be I'll have to somehow rate limit the data I'll be dealing with. I
did consider POSIX Message queues, but according to what I've read. POSIX
shared memory is the fastest of all IPC mechanisms, and while I do agree
that it is not very easy. Personally, I think shared memory is easy now
that I understand a lot of it. At minimum, it's not very hard to under the
idea, and implement it in code. Semaphores, mutexes, and threads however I
do find a bit intimidating. At minimum, I personally think they're overly
complex.

Hmmm, perhaps not quite that intimidating.

A thread is a path of execution. A single program consisting of a
loop and a single interrupt has two threads.

Threads share common resources, data, address space. It's up to you
to make them well behaved about what changes what and why.... That's
why microprocessors save the registers on the stack for an interrupt.

Processes are threads with isolated resources. Each process ideally
thinks that it is the only thing running in a processor, and data just
"magically" appears. The OS's job is to keep the processes separate.

Mutexes and semaphores are similar, and are synchronization mechanisms
between either threads or processes. Please look up the definition
and explanation of "critical section" in programming.

The idea is to have a flag that can be changed without interference
from another process, or for that matter, can be read without
interfering with another process. This could be a complete message.

The mutexes and semaphores serve to synchronize two processes which,
by the very nature of an operating system, *cannot* be guaranteed to
by synchronous.

I have though about a lot of different approaches, and I'm not saying my
approach won't change. This is just where I am right now. Stumbling about
learning the various Linux API's / libraries. Using, and understanding
fork() is on my TODO list, I just have not made it there yet. These two
processes are actually two separate executables. I am a bit worried about
process context switching though. I mean I'm sure I am inuring some penalty
right now running two separate executables, but I'm not sure it would be
the same using threads.

It actually would be the same with thread vs. processes. The only
real difference is that the threads share the same address space as
the each other, so they have access to variables without a special
mechanism (which would take time).

Processes, as I mentioned, run in their own worlds, with the operating
system controlling what they see (resources, shared memory, etc). That
mechanism has overhead.

So yes, threads are faster than processes, but more dangerous.

Harvey

OK so with all that in mind, I’m back to square 1. These processes can not share the same memory space. libmongoose seems to love stomping all over the stack, and I’m fairly sure it is not thread safe. Which is why I’m using two separate executables.

OR maybe I could go crazy and malloc() everything ? heheh no way :wink:

Thanks for the info guys, I will definitely look into using semaphores, and actually found a decent read on them the night i made this post.

I will also zero out the shared memory file before each initial use. Via /dev/zero. Well only when the IPC server first starts.

OK so with all that in mind, I'm back to square 1. These processes can not
share the same memory space. libmongoose seems to love stomping all over
the stack, and I'm fairly sure it is not thread safe. Which is why I'm
using two separate executables.

Linux (someone please correct me if I am wrong) *has* to be thread
safe.

The problem I see is that you are not using the parts of the OS that
are designed to keep you from messing things up.

*OR* maybe I could go crazy and malloc() everything ? heheh no way :wink:

malloc, or perhaps a version that the OS uses that turns OFF
interrupts and deals with the memory manager in the chips, should only
allocate memory within the program's space, or once that memory is
allocated, automatically assign it to that program.

An operating system is about managing resources, giving them to a
program, and making the whole thing graceful.

Harvey

Linux (someone please correct me if I am wrong) has to be thread
safe.

The problem I see is that you are not using the parts of the OS that
are designed to keep you from messing things up.

malloc, or perhaps a version that the OS uses that turns OFF
interrupts and deals with the memory manager in the chips, should only
allocate memory within the program’s space, or once that memory is
allocated, automatically assign it to that program.

An operating system is about managing resources, giving them to a
program, and making the whole thing graceful.

Harvey,

You do understand what happens when you have a function that’s on the stack, that uses local variables, when suddenly a callback interrupts that function call ? Not only does that function call cease to exist, but all local variable data is gone. What’s more, the function does not even resume. As it’s been popped off the stack. My comment about malloc() was tongue in cheek. As it would not even help. Anyway, there probably is a workaround for this situation that I’m currently not aware of.

This is why libmongoose is not thread safe, and cannot coexist in the same executable as my canbus manipulation routines. This is standard behavior for libmongoose according to that I’ve read. Even the maintainer says it’s not thread safe, if memory serves me correctly.

*Linux (someone please correct me if I am wrong) *has* to be thread*
* safe.*

* The problem I see is that you are not using the parts of the OS that*
* are designed to keep you from messing things up.*

*malloc, or perhaps a version that the OS uses that turns OFF*
* interrupts and deals with the memory manager in the chips, should only*
* allocate memory within the program's space, or once that memory is*
* allocated, automatically assign it to that program.*

* An operating system is about managing resources, giving them to a*
* program, and making the whole thing graceful.*

Harvey,

You do understand what happens when you have a function that's on the
stack, that uses local variables, when suddenly a callback interrupts that
function call ? Not only does that function call cease to exist, but all
local variable data is gone. What's more, the function does not even
resume. As it's been popped off the stack.

Hmmm, seriously enough, I've not seen that behavior and it doesn't
make all that much sense to me about how it works.

I'd have thought that if you call another function, even if it's the
same in a recursive call *and* the function is reentrant, then I'd
expect everything on the stack to be where I left it.

I have code that essentially calls a function based on a table entry
(of the desired function) that is referred to by an index.

It works well.

So this particular behavior is a bit odd to me, since it says that the
stack is not behaving the way I'd think it was.

I'm not sure that this is behaving like a sensible call (even an
interrupt or context change).

What's the rationale for this kind of behavior?

My comment about malloc() was
tongue in cheek. As it would not even help. Anyway, there probably is a
workaround for this situation that I'm currently not aware of.

Depends on whether or not this is an expected behavior and why it
happens.

Most of my experience is with embedded microprocessors and FreeRTOS as
an operating system, although I have written a time slicing
cooperative operating system which did work, just more effort than I
wanted to do to make it pre-emptive.

I'm puzzled right now....

This is why libmongoose is not thread safe, and cannot coexist in the same
executable as my canbus manipulation routines. This is standard behavior
for libmongoose according to that I've read. Even the maintainer says it's
not thread safe, if memory serves me correctly.

Ah, now if not thread safe, then uses static variables or volatiles,
which I can see. That could be the answer, but I'm used to writing
either re-entrant code (down to the device drivers) or knowing whether
or not the routine is re-entrant and controlling access by semaphores
or mutexes as needed.

Mostly the non-reentrant code had to be initialized (i.e setting up
heap structures when first called), and the easiest way to do that is
to have a static variable for the heap structures. Up to me to see
that it is not called in a re-entrant manner, since it will clobber
the heap structure's data on the second call.

Harvey