qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models


From: Igor Mammedov
Subject: Re: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models
Date: Thu, 6 Feb 2020 14:08:39 +0100

On Wed, 5 Feb 2020 13:07:31 -0600
Babu Moger <address@hidden> wrote:

> On 2/5/20 10:56 AM, Igor Mammedov wrote:
> > On Wed, 5 Feb 2020 10:10:06 -0600
> > Babu Moger <address@hidden> wrote:
> >   
> >> On 2/5/20 3:38 AM, Igor Mammedov wrote:  
> >>> On Tue, 4 Feb 2020 13:08:58 -0600
> >>> Babu Moger <address@hidden> wrote:
> >>>     
> >>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:    
> >>>>> On Mon, 3 Feb 2020 13:31:29 -0600
> >>>>> Babu Moger <address@hidden> wrote:
> >>>>>       
> >>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:      
> >>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
> >>>>>>> Babu Moger <address@hidden> wrote:
> >>>>>>>         
> >>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=vDAkIxR3U6LX%2FmnYjZPRC55smMqLend%2FHQjbfYWydBk%3D&amp;reserved=0
> >>>>>>>>
> >>>>>>>> Currently, the APIC ID is decoded based on the sequence
> >>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and 
> >>>>>>>> other
> >>>>>>>> vendors' configurations, but this decoding sequence does not follow 
> >>>>>>>> that of
> >>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU 
> >>>>>>>> topology
> >>>>>>>> inconsistency.  When booting a guest VM, the kernel tries to 
> >>>>>>>> validate the
> >>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu 
> >>>>>>>> models.
> >>>>>>>>
> >>>>>>>> To fix the problem we need to build the topology as per the Processor
> >>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
> >>>>>>>> Processors. It is available at 
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=rVMRN%2BbUeGWEksKO5uQ3Wxc71eeHCXMrkLVRbo4JHHI%3D&amp;reserved=0
> >>>>>>>>
> >>>>>>>> Here is the text from the PPR.
> >>>>>>>> Operating systems are expected to use 
> >>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize], the
> >>>>>>>> number of least significant bits in the Initial APIC ID that 
> >>>>>>>> indicate core ID
> >>>>>>>> within a processor, in constructing per-core CPUID masks.
> >>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number 
> >>>>>>>> of cores
> >>>>>>>> (MNC) that the processor could theoretically support, not the actual 
> >>>>>>>> number of
> >>>>>>>> cores that are actually implemented or enabled on the processor, as 
> >>>>>>>> indicated
> >>>>>>>> by Core::X86::Cpuid::SizeId[NC].
> >>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> >>>>>>>> • ApicId[6] = Socket ID.
> >>>>>>>> • ApicId[5:4] = Node ID.
> >>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
> >>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : 
> >>>>>>>> {1'b0,LogicalCoreID[1:0]}        
> >>>>>>>
> >>>>>>>
> >>>>>>> After checking out all patches and some pondering, used here approach
> >>>>>>> looks to me too intrusive for the task at hand especially where it
> >>>>>>> comes to generic code.
> >>>>>>>
> >>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
> >>>>>>> reasoning behind it first)
> >>>>>>>
> >>>>>>> Lets look for a way to simplify it a little bit.
> >>>>>>>
> >>>>>>> So problem we are trying to solve,
> >>>>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for 
> >>>>>>> EPYC based CPUs)
> >>>>>>>  2: it depends on knowing total number of numa nodes.
> >>>>>>>
> >>>>>>> Externally workflow looks like following:
> >>>>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
> >>>>>>>       that's used by possible_cpu_arch_ids() singleton to build list 
> >>>>>>> of
> >>>>>>>       possible CPUs (which is available to user via command 
> >>>>>>> 'hotpluggable-cpus')
> >>>>>>>
> >>>>>>>       Hook could be called very early and possible_cpus data might be
> >>>>>>>       not complete. It builds a list of possible CPUs which user could
> >>>>>>>       modify later.
> >>>>>>>
> >>>>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa 
> >>>>>>> node,node_id=x,cpus="
> >>>>>>>       options to assign cpus to nodes, which is one way or another 
> >>>>>>> calling
> >>>>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' 
> >>>>>>> list
> >>>>>>>       with node information. It happens early when total number of 
> >>>>>>> nodes
> >>>>>>>       is not available.
> >>>>>>>
> >>>>>>>   2.2 user does not provide explicit node mappings for CPUs.
> >>>>>>>       QEMU steps in and assigns possible cpus to nodes in 
> >>>>>>> machine_numa_finish_cpu_init()
> >>>>>>>       (using the same machine_set_cpu_numa_node()) right before 
> >>>>>>> calling boards
> >>>>>>>       specific machine init(). At that time total number of nodes is 
> >>>>>>> known.
> >>>>>>>
> >>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to 
> >>>>>>> be defined before
> >>>>>>> boards init() is run.      
> >>>>
> >>>> In case of 2.1, we need to have the arch_id already generated. This is
> >>>> done inside possible_cpu_arch_ids. The arch_id is used by
> >>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.    
> >>>
> >>> I might have missed something but I don't see arch_id itself being used in
> >>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus    
> >>
> >> Before calling machine_set_cpu_numa_node, we call
> >> cpu_index_to_instance_props -> x86_cpu_index_to_props->
> >> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
> >>
> >> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
> >> the available cpus. Based on the arch_id, it also sets up the props.  
> > 
> > 
> > x86_possible_cpu_arch_ids()
> >    arch_id = x86_cpu_apic_id_from_index(x86ms, i)
> >    x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,  
> > ms->smp.threads, &topo);
> >    // assign socket/die/core/thread from topo
> > 
> > so currently it uses indirect way to convert index in possible_cpus->cpus[]
> > to socket/die/core/thread ids.
> > But essentially it take '-smp' options and [0..max_cpus) number as original 
> > data
> > converts it into intermediate apic_id and then reverse engineer it back to
> > topo info.
> > 
> > Why not use x86_topo_ids_from_idx() directly to get rid of 'props' 
> > dependency on apic_id?  
> 
> It might work. But this feels like a work-around and delaying the problem
> for later. Just re-arranging the numa code little bit we can address this.

The idea behind possible_cpus is to allow users query topo information
board generates (based on -smp) at configuration time (or late) so users
could know what -numa cpu,topo_options [and -device foo-cpu,topo_options]
to use, initializing apic_id on the first access is secondary and I did
it only because I could do it without additional data.

But main purpose of possible_cpus is to keep topology information.
That includes numa node mapping, which should be stored in possible_cpus
along with the rest of cpu topology.

Looking [12/18] numa patch, it makes -numa node,cpus legacy option
to reintroduce data duplication, by storing mapping elsewhere and
then putting that mapping into possible_cpus at numa complete time
(that's what I dislike and don't see a valid reason to do so).

That also won't work if user queries hotpluggable-cpus before that time
and it also doesn't work if user uses preferable -numa cpu,topo_options
as both would initialize possible_cpus on the first access.

So if you need do some board specific post-processing done on topo
information when it's complete and recalculate apic_id do it at board
init time like was suggested before (x86_cpu_new() looks like a good
place to do it).

[...]




reply via email to

[Prev in Thread] Current Thread [Next in Thread]