hypervisor-based single cpu instances and atomic CAS

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list
It seems that AWS ec2-instances give huge price breaks per cpu for workloads that can be split to work on a set of "virtual" single-cpu instances.

I am wondering about my heavy use of lockfree datastructures in j.u.c and atomic variables in particular for these workloads, and the underlying CAS call that these are mostly based on on (relative to such single cpu machines in the hypervisor sense).  Are these CAS calls overly costly, making lockfree algorithms an anti-pattern on such machines?  Should traditional synchronized be preferred?  Is there any difference between hypervisor machines that have one cpu and real machines that do (if any such still exist), relative to the CAS?

The answers to this question might also imply that nodejs is more performant than java on such single cpu machine instances in the cloud.  Which would come as a big surprise.


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list


On Feb 15, 2018, at 11:34 AM, andy--- via Concurrency-interest <[hidden email]> wrote:

It seems that AWS ec2-instances give huge price breaks per cpu for workloads that can be split to work on a set of "virtual" single-cpu instances.

I would start with examining this premise (that single-cpu instances are heavily discounted). I think it is wrong.

E.g. here is a snapshot of the current on-demand pricing of low end instances on AWS (the ones that have 1 vcore options):

vCPUECUMemory (GiB)Instance Storage (GB)Linux/UNIX Usage
General Purpose - Current Generation
t2.nano1Variable0.5EBS Only$0.0058 per Hour
t2.micro1Variable1EBS Only$0.0116 per Hour
t2.small1Variable2EBS Only$0.023 per Hour
t2.medium2Variable4EBS Only$0.0464 per Hour
t2.large2Variable8EBS Only$0.0928 per Hour
t2.xlarge4Variable16EBS Only$0.1856 per Hour

As you can see, the price is actually pretty linear per GB. And the smallest amount of GB/vcore you can get (when going above 1 vcore) is 2GB. The only option for renting <2GB instances is with single vcore instance, and the price remains linear per GB down to 0.5GB.

So a way to look at the above is "single CPU instances give you an options of renting instances with less than 2GB of memory". Or "<4GB AWS instances have only 1 vcore".

You may look at the above and think "If I can keep my cpu-intensive workloads to <0.5GB of memory, I can get a lot of cheap CPU using single vcore instances". But that would be misleading. The physical machines AWS uses have at least 2GB physical memory per hyperthread (and likely a lot more, ranging to 4GB-8GB per hyperthread on modern HW). E.g. A single 2 socket, 48 vcore, 128GB physical machine has enough memory to hold 256 t2.nano instances. If you actually rented 256 those instances you are not getting 256 vcores, and may only get 48 (or even less).

I am wondering about my heavy use of lockfree datastructures in j.u.c and atomic variables in particular for these workloads, and the underlying CAS call that these are mostly based on on (relative to such single cpu machines in the hypervisor sense).  Are these CAS calls overly costly, making lockfree algorithms an anti-pattern on such machines?  Should traditional synchronized be preferred?  Is there any difference between hypervisor machines that have one cpu and real machines that do (if any such still exist), relative to the CAS?

It may be interesting to have a mode where the JIT is told the JVM is limited to a single core (either because the OS has only 1 vcore, or because the JVM is limited to a single core via isolcpus of cpu sets). The JIT would then be able to emit non-atomic CAS, and remove cpu-ordeing instructions for e.g. volatile writes, neither of which will affect program order, and both of which could result in faster execution of the same logic. 


The answers to this question might also imply that nodejs is more performant than java on such single cpu machine instances in the cloud.  Which would come as a big surprise.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest


_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

signature.asc (891 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list



On 2/15/2018 1:33 PM, Gil Tene via Concurrency-interest wrote:


On Feb 15, 2018, at 11:34 AM, andy--- via Concurrency-interest <[hidden email]> wrote:

It seems that AWS ec2-instances give huge price breaks per cpu for workloads that can be split to work on a set of "virtual" single-cpu instances.

I would start with examining this premise (that single-cpu instances are heavily discounted). I think it is wrong.

E.g. here is a snapshot of the current on-demand pricing of low end instances on AWS (the ones that have 1 vcore options):


vCPU ECU Memory (GiB) Instance Storage (GB) Linux/UNIX Usage
General Purpose - Current Generation
t2.nano 1 Variable 0.5 EBS Only $0.0058 per Hour
t2.micro 1 Variable 1 EBS Only $0.0116 per Hour
t2.small 1 Variable 2 EBS Only $0.023 per Hour
t2.medium 2 Variable 4 EBS Only $0.0464 per Hour
t2.large 2 Variable 8 EBS Only $0.0928 per Hour
t2.xlarge 4 Variable 16 EBS Only $0.1856 per Hour

As you can see, the price is actually pretty linear per GB. And the smallest amount of GB/vcore you can get (when going above 1 vcore) is 2GB. The only option for renting <2GB instances is with single vcore instance, and the price remains linear per GB down to 0.5GB.

So a way to look at the above is "single CPU instances give you an options of renting instances with less than 2GB of memory". Or "<4GB AWS instances have only 1 vcore".

You may look at the above and think "If I can keep my cpu-intensive workloads to <0.5GB of memory, I can get a lot of cheap CPU using single vcore instances". But that would be misleading. The physical machines AWS uses have at least 2GB physical memory per hyperthread (and likely a lot more, ranging to 4GB-8GB per hyperthread on modern HW). E.g. A single 2 socket, 48 vcore, 128GB physical machine has enough memory to hold 256 t2.nano instances. If you actually rented 256 those instances you are not getting 256 vcores, and may only get 48 (or even less).

I am wondering about my heavy use of lockfree datastructures in j.u.c and atomic variables in particular for these workloads, and the underlying CAS call that these are mostly based on on (relative to such single cpu machines in the hypervisor sense).  Are these CAS calls overly costly, making lockfree algorithms an anti-pattern on such machines?  Should traditional synchronized be preferred?  Is there any difference between hypervisor machines that have one cpu and real machines that do (if any such still exist), relative to the CAS?

It may be interesting to have a mode where the JIT is told the JVM is limited to a single core (either because the OS has only 1 vcore, or because the JVM is limited to a single core via isolcpus of cpu sets). The JIT would then be able to emit non-atomic CAS, and remove cpu-ordeing instructions for e.g. volatile writes, neither of which will affect program order, and both of which could result in faster execution of the same logic.
JIT can already do single hardware thread optimizations.  This used to default on.  Many years ago, I asked for it to default to off since affinity masking (I think) confused the JVM into making the wrong decision and cause crashes.  If the feature defaults to off, there is a command line switch to turn it on.

I have not heard if JIT can do single core (2 hardware threads) optimizations.  There might be some optimizations to do here but they will be limited compared to single hardware thread.


The answers to this question might also imply that nodejs is more performant than java on such single cpu machine instances in the cloud.  Which would come as a big surprise.

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

-- 
-Nathan

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list

It’s ironic that we are (were?) on the verge of ripping out os::is_MP() and always assuming MP at build and runtime.

 

BTW with single-core we can get rid of memory ordering instructions, but CAS must still be atomic wrt. any possible context switching.

 

David

 

From: Concurrency-interest [mailto:[hidden email]] On Behalf Of Nathan and Ila Reynolds via Concurrency-interest
Sent: Friday, February 16, 2018 6:41 AM
To: [hidden email]
Subject: Re: [concurrency-interest] hypervisor-based single cpu instances and atomic CAS

 

 

 

On 2/15/2018 1:33 PM, Gil Tene via Concurrency-interest wrote:

 



On Feb 15, 2018, at 11:34 AM, andy--- via Concurrency-interest <[hidden email]> wrote:

 

It seems that AWS ec2-instances give huge price breaks per cpu for workloads that can be split to work on a set of "virtual" single-cpu instances.

 

I would start with examining this premise (that single-cpu instances are heavily discounted). I think it is wrong.

 

E.g. here is a snapshot of the current on-demand pricing of low end instances on AWS (the ones that have 1 vcore options):

 

vCPU

ECU

Memory (GiB)

Instance Storage (GB)

Linux/UNIX Usage

General Purpose - Current Generation

t2.nano

1

Variable

0.5

EBS Only

$0.0058 per Hour

t2.micro

1

Variable

1

EBS Only

$0.0116 per Hour

t2.small

1

Variable

2

EBS Only

$0.023 per Hour

t2.medium

2

Variable

4

EBS Only

$0.0464 per Hour

t2.large

2

Variable

8

EBS Only

$0.0928 per Hour

t2.xlarge

4

Variable

16

EBS Only

$0.1856 per Hour

 

As you can see, the price is actually pretty linear per GB. And the smallest amount of GB/vcore you can get (when going above 1 vcore) is 2GB. The only option for renting <2GB instances is with single vcore instance, and the price remains linear per GB down to 0.5GB.

 

So a way to look at the above is "single CPU instances give you an options of renting instances with less than 2GB of memory". Or "<4GB AWS instances have only 1 vcore".

 

You may look at the above and think "If I can keep my cpu-intensive workloads to <0.5GB of memory, I can get a lot of cheap CPU using single vcore instances". But that would be misleading. The physical machines AWS uses have at least 2GB physical memory per hyperthread (and likely a lot more, ranging to 4GB-8GB per hyperthread on modern HW). E.g. A single 2 socket, 48 vcore, 128GB physical machine has enough memory to hold 256 t2.nano instances. If you actually rented 256 those instances you are not getting 256 vcores, and may only get 48 (or even less).



I am wondering about my heavy use of lockfree datastructures in j.u.c and atomic variables in particular for these workloads, and the underlying CAS call that these are mostly based on on (relative to such single cpu machines in the hypervisor sense).  Are these CAS calls overly costly, making lockfree algorithms an anti-pattern on such machines?  Should traditional synchronized be preferred?  Is there any difference between hypervisor machines that have one cpu and real machines that do (if any such still exist), relative to the CAS?

 

It may be interesting to have a mode where the JIT is told the JVM is limited to a single core (either because the OS has only 1 vcore, or because the JVM is limited to a single core via isolcpus of cpu sets). The JIT would then be able to emit non-atomic CAS, and remove cpu-ordeing instructions for e.g. volatile writes, neither of which will affect program order, and both of which could result in faster execution of the same logic.

JIT can already do single hardware thread optimizations.  This used to default on.  Many years ago, I asked for it to default to off since affinity masking (I think) confused the JVM into making the wrong decision and cause crashes.  If the feature defaults to off, there is a command line switch to turn it on.

I have not heard if JIT can do single core (2 hardware threads) optimizations.  There might be some optimizations to do here but they will be limited compared to single hardware thread.



 

The answers to this question might also imply that nodejs is more performant than java on such single cpu machine instances in the cloud.  Which would come as a big surprise.

 

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest





_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest



-- 
-Nathan

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 15/02/18 19:34, andy--- via Concurrency-interest wrote:
> I am wondering about my heavy use of lockfree datastructures in j.u.c and atomic
> variables in particular for these workloads, and the underlying CAS call that
> these are mostly based on on (relative to such single cpu machines in the
> hypervisor sense).  Are these CAS calls overly costly, making lockfree
> algorithms an anti-pattern on such machines?  Should traditional synchronized be
> preferred?
Just one note: synchronized uses CAS under the hood.  If locks are (mostly)
uncontended, the cost of synchronized is the cost of a CAS or two.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
On 16/02/18 00:50, David Holmes via Concurrency-interest wrote:

> It's ironic that we are (were?) on the verge of ripping out
> os::is_MP() and always assuming MP at build and runtime.

Not really, IMO.  With virtualization and containers you get
migration, and you might migrate a running VM to another with more
virtul CPUs.  Even if there's some performance advantage it's not
safe to do this in general.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
IMO the cost of synchronized is not the cost of implementing the feature itself, since it is eclipsed by the cost of running it in production. 😊

--
Cheers,

On Feb 19, 2018 10:26, "Andrew Haley via Concurrency-interest" <[hidden email]> wrote:
On 15/02/18 19:34, andy--- via Concurrency-interest wrote:
> I am wondering about my heavy use of lockfree datastructures in j.u.c and atomic
> variables in particular for these workloads, and the underlying CAS call that
> these are mostly based on on (relative to such single cpu machines in the
> hypervisor sense).  Are these CAS calls overly costly, making lockfree
> algorithms an anti-pattern on such machines?  Should traditional synchronized be
> preferred?
Just one note: synchronized uses CAS under the hood.  If locks are (mostly)
uncontended, the cost of synchronized is the cost of a CAS or two.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest
Reply | Threaded
Open this post in threaded view
|

Re: hypervisor-based single cpu instances and atomic CAS

JSR166 Concurrency mailing list
In reply to this post by JSR166 Concurrency mailing list
JVM may even bias the locks, so they remain owned by the thread until a contender claims them.

A single-CPU process won’t have the contenders, so the synchronized blocks will be a few reads.


Alex


> On 19 Feb 2018, at 09:23, Andrew Haley via Concurrency-interest <[hidden email]> wrote:
>
> On 15/02/18 19:34, andy--- via Concurrency-interest wrote:
>> I am wondering about my heavy use of lockfree datastructures in j.u.c and atomic
>> variables in particular for these workloads, and the underlying CAS call that
>> these are mostly based on on (relative to such single cpu machines in the
>> hypervisor sense).  Are these CAS calls overly costly, making lockfree
>> algorithms an anti-pattern on such machines?  Should traditional synchronized be
>> preferred?
> Just one note: synchronized uses CAS under the hood.  If locks are (mostly)
> uncontended, the cost of synchronized is the cost of a CAS or two.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
> _______________________________________________
> Concurrency-interest mailing list
> [hidden email]
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
[hidden email]
http://cs.oswego.edu/mailman/listinfo/concurrency-interest