Cassandra – frequent mistakes

Since we started working with Cassandra I’ve noted down all the mistakes we made due to our inexperience with the application, so we don’t repeat them again. I didn’t talk about them much because I was really ashamed for some of them 😀 But recently I’ve seen a video talking about frequent mistakes with Cassandra, and almost all our mistakes were there! If only this video had existed when I started… *sigh* But hey, now that I’ve seen we are not dumb, because being wrong is part of learning Cassandra, I am not ashamed anymore, and I’ll explain all the mistakes, just to help out anybody starting with cassandra right now. Here we go!

Mistake #1- Using SAN or RAID 1/10/5

These systems solve inexistent problems, because Cassandra is developed to be fault tolerant within its nodes. Of course we can add another high availability layer with SAN or RAID, but it will be more expensive and perform worse than adding more Cassandra nodes. Besides, with SAN, if all our nodes write to the cabinet, we’re adding a Single Point of Failure that we had not before. Plus, Cassandra data is written to several servers at the same time, and this triggers internal processes (like compactions) at the same time. Cassandra tries to get the maximum I/O assuming disks are local (which is normal), and the result is all the servers squeezing the SAN at the same time, bringing I/O performance down (thus, our cluster performance). Our performance with a SAN will not improve, and probably will get worse, and with a high cost. Obviously you can use a cabinet powerful enough to stand this load, but with half the money you can improve cassandra “the Cassandra way”: using local discs and, if we want RAID, make it RAID0. Cassandra already is fault tolerant, let Cassandra do its work for itself.

Mistake #2- Using a load balancer in front of Cassandra.

At first it seems a good idea: we guarantee an even load and, if a node goes down, the load balancer will tag it and won’t send any more connections until it comes back. This is already working in our webservers, and does a good job, right? ERROR!! The same issue as before: This problem doesn’t exist in Cassandra. Data is evenly distributed along the nodes, and the high level clients (such as Hector, Astyanax or pycassa) already send queries evenly, and tag the dead nodes. Plus, adding a load balancer we are adding a Single Point of failure to the system, where we didn’t have any, we are using more resources and making the architecture more expensive. We can even have our load balancer being he bottleneck in our system, bringing us more problems where there were none. Cassandra already balance the load, let Cassandra do its work for itself.

Mistake #3- Put CommitLog and SSTables on the same disk (doesn’t apply to SSD disks)

The CommitLog is written sequentially all the time. If the SSTables are being read in the same disk (these are random reads), each read will affect the CommitLog writes, because we will loose sequenciality, the disk will make seeks, and will slow down the writing process (keep in mind that write speed in Cassandra is limited by CommitLog writes, because the rest is written to RAM). The CommitLog doesn’t need much disk space, but it’s painfully affected by the seek time. In our jobs, we disregarded it, thinking it would make no difference to have it separate… but testing our cluster, where the nodes had loadavg of 7-7.5, we decreased to 4.5-5 just changing that. Our face? Open mouth at first, and facepalm afterwards. Obviously this does not apply to SSD disks where there is no seek time.

Mistake #4- Forget to install JNA library (Java Native Access)

What’s happening in this server where snapshots take more time than the others? Why the loadavg in this server is higher? Check out your puppet manifests, because it’s 99% probability you forgot to put the JNA library. The JNA (Java Native Access) allows the java virtual machine to interact with the filesystem in a native way through the operating system, not with java libraries that are slower and inefficient. This way, interacting with files is way faster, and you can see it specially when doing snapshots. Installation is very easy, and consequences of not installing it are serious.

Mistake #5 – Not raising the 1024 file descriptors limit for Cassandra

Cassandra works with lots of sockets and open files (there are a lot of SSTables!), and 1024 (the default maximum in linux) runs out fast. Probably, while your are in a test or integration environment you will not notice it. And you may even start in the production environment with no problems. But while SSTables are written, the number of open files increases, and once you reach the limit, your node will go down in a nasty way, probably dragging the rest of the nodes with it. Don’t forget to add the limits.conf modification in your puppet manifest!!

Mistake #6- Giving javaHeap more than 8GB

You have a 64GB RAM server (you rock!!) and you think about giving Cassandra some of that memory, or have “OutOfMemory” errors every now and then, or you think GarbageCollection is running too often, or for whatever reason you think it’s a good idea to increase the RAM assigned to Heap, so you change cassandra-env.sh to increase javaHeap. But don’t forget GarbageCollection is a resource consuming process, and the cluster affectation grows exponentially along with the amount of memory to be freed. So, it’s recommended to assign 4-8GB for the javaHeap, and in some cases even more, up to 12GB, but only if we know what are we doing and are targetting something. You shouldn’t assign more than 16GB, never.

If you want to use this memory for your cluster, don’t bother, Cassandra will end up using it as OS page cache.

Mistake #7- Using Amazon EBS disks

EBS disks in Amazon have an unreliable performance. Amazon sells them as they’re good, but actually the I/O is very variable, and we can have a long time with high disk latency killing our cluster. It’s better to use the EC2 ephemeral disks because they have a more reliable performance. Yes, if a node goes down or if we need to reboot it, we will loose data in the node and we will need to rebuild it… but we have the nodetool rebuild for that. And yes, if all the zone hosting our cluster goes down, we will need to start from scratch, and that’s very risky, but if you are using aws you are very likely storing snapshots of the servers and Cassandra data in Amazon S3, and restoring will be fast (Netflix’s Priam will help you to get the job done). More information about EBS.

And these are the mistakes we made. If you have read this BEFORE starting with cassandra, you just saved yourself some headaches! 😀

  • Tim Heider

    Thank you for sharing this valuable experience.

  • R. Toma

    Re: “Mistake #4- Forget to install JNA library (Java Native Access)”

    Docs for Cassandra 2.1 do not mention installing JNA anymore as previous versions did.
    I believe Java Native Access is still used, but has been integrated via /usr/share/cassandra/lib/jna-4.0.0.jar

  • Paul Villacorta

    This is a really good pain points for beginners to have in mind. For #7 item, Cassandra works best also for EBS Optimized Instances. We’ve tried and tested using the Provisioned IOPS and was pretty amazed with the performance + latency it has achieved! I wonder if someone has made some benchmarking on cassandra using r3 instance types from amazon and using EBS Optimized.

    Great!