Application for GSoC contribution and advice requested

Hello everyone,
I am Prashant Dandriyal, a final year undergraduate of Bachelor of Technology in Electronics and Communication Engineering (E.C.E). My experiences with embedded systems include simple electro-mechanical circuits, 8-bit microcontrollers and 32 bit ones like the EK-M4C123GXL TIVA Launchpad and the CC26X2R1 development board. The components by Texas Instruments were provided to me by Texas Instruments in part of the India Innovation Challenge and Design Contest (IICDC-2018).

I have also been drawn towards the (previously niche) field of Embedded Artificial Intelligence (Embedded AI), better knows as Edge AI or TinyML as one of its subsidaries. I have been following the TinyML community for some time now. In process of implementing Machine Learning at the edge, I have completed course work and worked on TensorFlow, TensorFlow Lite and Intel OpenVINO toolkit, all in process of shifting inference to the edge. For my final year project, I am in process of implementing On-device Learning on Low Compute-capable devices.
For the GSoC 2020, I would like to contribute to the project YOLO models on the X15/AI. As one the mentors is Mr Hunyue Yau, I would request you all to redirect me to the related communities where I can discuss the idea with the mentors.

I am also going to introduce myself in the #beagle-gsoc channel at riot.im and ask for help.
I will be indebted for this help as it will enable me to finalize my project and begin with the preparations.

Thanking You,
Prashant Dandriyal

Hi,

There has been a few mails on the YOLO project idea. Trying to address all of
them in one email. The assumption is the students have subscribed.

Please try catch mentors on the freenode IRC in the #beagle-gsoc channel. It
looks like you have seen the elinux page. Please note the timezones for the
potential mentors in the bottom. I am often on that channel on and off between
10:30-19:00 US/Pacific time. Do hang around the channel. Most of us will look
at what has been said on the channel and respond even if it is later so asking
and leaving immediately will not work well.

General comments - Ideally, we would like to see the YOLO model working on the
BBAI (or x15 or even a BBB!) at a full 30fps video frame rate. Having said
that, right now I am seeing around 10 seconds per frame. This is largely due
to YOLO ask implemented with the Darknet framework not taking advantage of the
hardware. There are many possible ways of working on this. Do keep in mind
GSoC is a relatively short period of time. As part of the application, there
should be a convincing explanation on why you think you can accomplish what
you propose in the time frame.

Just to throw out potential work in this general area:
- Attempt leverage the TIDL stuff to accelerate it. Right now, TIDL doesn't
support all the layers so there will have to be some of it done on the
accelerators and some of it done on the ARM.

- Attempt to use the (any day now) updated TIDL stuff with the Tensorflow lite
support to run the model. Jason might prefer this path.

- Attempt to use the model conversion tools in TIDL.

- Attempt to use OpenCL to accelerate things. Please note, a brute force
recompile of the OpenCL port of darknet does not work. Most likely, this is
due to the port focusing on OpenCL with a GPU instead of OpenCL with the DSP
as it is on the BBAI/x15. A preliminary debug suggests it is a memory problem
somewhere.

- Attempt to use the SGX GPU. Currently, only OpenGLES is supported on there.
This would basically be a port to use OpenGLES. Nice thing about this path is
it could be reused on the BBB too.

- It could be a combination of any of the above.

Caveats -
YOLOv3 may not fit on the device. YOLOv2 (or even YOLOv2-tiny) may be a more
feasible approach.

Part of this is performance, it would be good to identify what size frames is
being targeted. I have found 320x240 to be convenient as that's common to many
webcams.

Thank you so much Sir. I have some queries but I'll ask them in the GSoC channel at your prescribed time.
I hope that's ok.

@H Sir, I am workin on your suggestion and installing TIDL to validate my assumptions about the configuration file…in the mean time, please go through my updated app : https://elinux.org/BeagleBoard/GSoC/2020Proposal/PrashantDandriyal_2 which contains the mention to the demo I was talking about. i have detailed about it in my repo :https://github.com/PrashantDandriyal/GSoC2020_YOLOModelsOnTheBB_AI/blob/master/README.md

Hi,

It may be useful if you expand/clarify the benefits. You speak of Edge devices
but with AI, there are 2 halves. Training and inference. Up til this point,
most other things on the Beagle (bone/boards) are pretty much the same as on a
desktop. i.e. you can natively compile things. However, this symmetry isnt
there with AI.

Also, a big thing with the YOLO models is the ability to locate and identify
at the same time without doing iterative things like a sliding window to
search. This couples nicely with limited compute power like on the x15/AI
platforms. Getting this model to work would make locate/identify more viable.

Hello Sir,

There seems to be a problem. If I use NNPACK or other common acceleration library, they can only help our objective by using the on-board GPUs, which is not much on the AI or x15. Alternatively using the DSP requires writing OpenCL… Its not clear how EVEs can be configured similarly…
Do you suggest any way around that ?

Hi Prashant,

NNPACK has 2 main acceleration strategies that should apply to every ARM
boards. I have this working on the AI and there are references to people using
it on the RPi. It uses NEON (ARM SIMD)when possible and it uses a math
identity to change things around. Convolution in time is multiplication in
frequency; so NNPack does a FT to convert it to multiplication. In some cases
tis is faster.

I think on other platforms, it can use the GPU. If we want to use the GPU, we
have a few ways of doing this:
- Expand on the GPU acceleration in NNPack.
- Expand on the CUDA/OpenCL ports of Darknet.

The risk with the GPU is we are entering unknown territory as OpenGLES 2.0 for
GPGPU is untested; in addition, the SGX is not the fastest GPU around.
Nevertheless, it may still be useful. I can share code to set things up - main
thing is to figure out how to express things as shader code.

Coding in OpenCL for the DSP is the best way to go but there may be a few
limitations. Most code/examples for OpenCL assume a GPU backend whereas we
have a DSP. From my own experiments, it seems there is a limitation on ow
much/fast we can move day between the DSP and the main ARM core.

For us mere mortals, the only way to use the EVE for this purpose is via TIDL.

A crazy idea (as in, I haven't thought it through) - can we divide up the
tasks so some of it goes through TIDL to leverage the DSP/EVE and some of it
goes via the GPU. The starting point would probally be Darknet+NNPack. Biggest
risk I see here is transfer overhead between all those components and possibly
having to waste a lot of time figuring out how to pipeline all these together
to have a reasonable throughput. Just an idea for thought.

Hello Sir,
I am working on both the paths:

  1. Trying to get best model compression using automatic layer grouping of the “model import” feature. Although, the YOLO v2-tiny model converted to TensorFlow, seems to have many unsupported layers (probably due to the un-optimised conversion from Darknet).
  2. Understanding the methods to use the OpenCL ports of Darknet. There are some good ports on the web but they still fail to meet our objectives. (only TIDL seems to be our saviour).
    Meanwhile, I stumbled upon this mindblowing work https://www.jevoisinc.com/pages/Examples… where they manage to get an FPS of above 70+. In running the YOLO models, they show a FPS of 15 here http://jevois.org/moddoc/DarknetYOLO/modinfo.html. I am trying to understand their backend now. This is the code I could find till now http://jevois.org/basedoc/classDarknetYOLO.htmlhttp://jevois.org/basedoc/classDarknetYOLO.html. Will update you in tomorrow’s meeting.

Hi,

Maybe I missed it but...
Where did they say they can do inference at 70fps? From what I saw, that's how
fast they can pull frames out of USB. Even 15 doesn't seem to be for
inference.

Yeah you are right. I shared it in a hurry. Sorry.

Hello Sir,
Just another update: TIDL supports only CAFFE, TENSORFLOW, TENSORFLOW LITE and ONNX. I have converted the YOLO v2-tiny and v3-tiny to all these formats (except the CAFFE) and tried importing into TIDL model import; but in case of TF and ONNX, layers aren’t supported while in TFLite, there’s a strange segmentation fault. I have attached some of the console verbose in this e2e query.

Now, its somewhat clear with how far TIDL can be useful in our case, although I will still try other things with it. But as you may have seen in above mentioned e2e query, one of TI member, mentioned that there’s another option to deploy these models using the TVM compiler. Although I am just beginning to look into it, I found some instructions to set it up with the caveat

Currently Neo compiler with Sitara support can compile any models supported by Neo, but only TensorFlow models can be compiled to run on TIDL for acceleration if the model can be supported by TIDL.

So, if I manage to compile the Darknet-converted-to-TF models uisng this NEO compiler, we can use TIDL… may be our last option to leverage the co-processors.
Would like to know what are your thoughts about it…

Prashant Dandriyal

Hi,

It comes down to there being a non zero cost in utilizing off processor (core)
resources. This applies to the EVE, DSP, and SGX. Part of the challenge here
is to figure out what is an optimal pipeline which involves balancing the
various overheads.

Any idea where the slow down is for a single EVE case/smaller batches? Getting
some insight on this can be useful. Off the top of my head, I can see a few
possibilities:
- ARM -> EVE and/or EVE -> ARM communications has a high overhead compare to
the work that the EVE has to do.
- The ARM is idle while waiting for the EVE to do the work.

Hi Pradan,

I haven't looked into the compilation before. I think it is fine as long as it
doesn't require regenerating the weights file and it doesn't take a long time.
How open is the compiler?

The other potential gotcha is - will the Neo compiler result in the same seg
faults?