About six months ago I started working on a new project, a weeding robot targeted at organic farms. I have been very interested in the revolution in visual object recognition and I wanted to work on a project that uses vision so I could learn more about it, and to try and produce a robot that does something more useful than just putter around my house. I debated several different projects, but in the end it was my attendance at the AUVSI XPONENTIAL conference that pushed the idea of a weeding robot over the top for me. I attended several talks that discussed the state of the art in agricultural drones and robots, and it convinced me that this is an area with a lot of potential for intelligent, autonomous robots. There are a number of companies that have started doing research on robots within agriculture, mostly in Europe, but with several in the US as well. So it still looks like it is early in the development phase with no clearly dominant player. I also received some really good feedback from the organic farmers I talked with. Several of them said that dealing with weeds without using herbicides was one of their biggest headaches. So it looks like there is a market for this type of robot. I am still trying to get feedback from various organic and non-organic farmers, and I hope to continue getting feedback as my plans progress.
Another reason I decided to go after the weeding robot is that it looked like it was something that could realistically be done now, whereas even a few years ago that was not the case. Over the last couple of years the tools and algorithms available to perform this kind of vision project have really matured. However, before diving in whole hog and building an expensive robot, I wanted to get some verification that this is a project that could realistically be successful. So as a first step I decided to build a classifier network that could look at images of individual plants and classify them as weeds or crops. This would provide a gut check on whether I could even build a network what would be able to tell the crops from the weeds in a best case scenario.
Since I am dealing with deep neural networks (DNN) this means I needed data to train and test with. A lot of data! The images I found online did not really seem to fit what I wanted precisely. I needed images of a variety of individual weeds and crops taken from the point of view of a small unmanned ground vehicle (UGV). I was lucky in that there is a local community volunteer organic garden close to my house that produces food for the elderly and the infirm (CASA Garden. Special thanks to Howard and Sean). So I packed up my trusty Panasonic camera and went to volunteer. Over the summer I spent a good bit of time taking pictures of plants, and doing weeding and other chores. Luckily, my first visit was just as they were putting in a new batch of corn and okra. So I decided those would be the two crop plants I would use. Each time I visited the garden I would first go through and take pictures of the weeds, and a few with both weeds and crops. Then I would go through and pull the weeds, and then do another round of pictures with just the crops. This let me make sure that what was in the picture was really a weed, because sometimes it was difficult to tell when just looking at the picture 3 weeks later. It also let me get good clean shots of the crop plants without any obstructing weeds. Over the course of the summer I ended up with over 7000 hi-res shots of various weeds, crops, and combinations.
Once I had a little raw data to start playing with I needed to get it into a format where I could train with it. I found a really cool open source app called labelImg that made it easy to go through batches of photos and draw labeled bounding boxes around objects and save them in a standard xml format. I ended up making some modifications to that app to make it easier to use if for my specific project. You can find that on my GitHub account here. The labeled images looked like what is shown in figure 1.
I then wrote some python pre-processing scripts to go through the xml files and pull out the bounded images, refactor them to a standard size, and save them in the correct directory structure needed by Digits and Caffe. Once I had the data for ~40% of the images finished I ran a test and was really pleased with the results. For this test I lumped all the weeds together into one class. So overall it had three classes: weed, corn, and okra. After training the network it ended up with over a 98% accuracy rating when shown an image of plants. The coolest part is that it was really pretty good at classifying types of weeds it had never seen before. I kept back some of the weed images where I only had a few examples of a given type and did not train with them. They looked distinct from any of the weed types it had been trained on, but it was able to correctly identify them as weeds.
Being able to correctly classify individual images of plants as weed or crop is a good first step, but it is still a big leap from that to getting a robot that can reliably use vision to decide what plants to kill as weeds, and which to leave alone as crop. A robot will be taking images with lots of clutter. There may be numerous crop plants in view with weeds intersperced between and around them. There may be farm tools and vehicles, trees, bushes, and people in view. So it is not enough to be able to look at a nicely centered image focused on a single plant and classify it. The DNN must be able to locate all plants within a cluttered image, classify each one as weed or crop, and outline the rough area where the plants extend. This is a much more difficult task.
Luckily over the last year a new tool has been developed that can do this in a real-time setting, the fully convolutional neural network (Shelhamer 2016). A traditional DNN takes image data as an input and feeds it through numerous neuron layers till it gets to the end, where the output is aggregated using a softmax function to give probabilities for N distinct classes. In the case above there were three distinct classes as the output. It would provide the probability that a given image data belonged to each class. An FCN works similarly up front, but it chops off that final aggregation step and instead takes the network output for large sections of the image and runs it backwards (deconvolution) to produce a mask image that shows where it thinks objects are located within the image, and whats the most likely classification is for each object. So what you end up with is a color coded label image that shows areas where it thinks objects are located. In this case the FCN will look at a picture of a corn field and color in the pixels where it thinks there is corn, okra, or weeds. To train this type of network you need to feed it in not only the original image, but also a ground truth label image that color codes where each object is in the picture. Figure 2 shows a raw image with corn and weeds in it. Figure 3 shows the color coded label image, and figure 4 shows the output generated by the FCN with the labels it drew.
Thanks to the hard work of people like Evan Shellemar, Greg Heinrich, and many others, an open source version of the FCN was added to Caffe and Digits, making it much easier for people like me to build and test one of these networks. However, I once again had to get the data ready for training, and this time it would be more difficult since I had to draw in color coded areas on the image instead of just drawing a rough bounding box. I looked at several tools for doing this, but in the end I found Gimp's foreground extraction tool to be the easiest and cheapest to use. I wrote a couple of little python plug-in scripts to help automate things a bit, but it was still a lot more work to get these first sets of image labels done. I also hired a little bit of help on some of the images. (Thanks to Sadaf Sahar)
Once I had ~10% (698) of the images labeled in this way I decided that was enough for an initial test to see if it would work before spending a lot of effort to do the rest of the images. After some trial and error I was finally able to train a network that attained an accuracy rate above 94%. Figure 5 shows a number of test images with the FCN labeling on it, and provides a brief description of what was labeled correctly and incorrectly for each one. These are images that the NN had never been trained on. The NN is still far from perfect, and is still not something that can be used directly to control a real life weeding robot, but to me it is impressive and acts as a proof of concept. It demonstrates that it should be possible to get it to that final stage without too much more work.
One thing to note is that I will not need for the entire plant to be labeled. In the final system it should be good enough to have some portion of it labeled. If the robot can look at a cluttered image and see where the plants are located, then it can zoom in on those plants to get a more accurate read on what each one is before making any decisions on whether or not to get rid of it. Notice in the images that when it is focused on just one, or a small cluster of images it usually does a really good job of labelling them. Also, the robot will need to integrate the visual information with point cloud data at some point, so if a portion of the plant is labeled it should be possible to use this to determine the 3-D outline of a plant and label the whole thing. Additionally, the robot will not be using a single labeled image to make the decision on what to eliminate. Instead it will need to "look around" a bit to see a plant from a few angles before making a decision. It will integrate the output from all these images before making a final decision.
Overall I am really impressed with the initial results from this first test network. It is only using about 10% of the total training data, and it is using the most granular FCN32 model. There are three different models that you can use for an FCN. FCN32 is the lowest, most granular level. FCN16 and FCN8 give you 2x and 4x the resolution on the resulting mask image. This produces a smoother and more accurate final prediction for the areas of found items. So using the full data set and switching to the FCN8 model are two things that should further improve the accuracy. However, there are a few problems that I think will still persist with this solution, and must be addressed some other way for the final system. I will discuss ideas for using multi-spectral and depth data to address these problems in my next blog post.
Please subscribe to my Newsletter!
NeuroRobotic Technologies is dedicated to creating the next generation of intelligent, adaptive robotic systems by building autonomous controls systems that mimic the brains of real animals.