qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models


From: Babu Moger
Subject: Re: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models
Date: Wed, 5 Feb 2020 13:07:31 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2


On 2/5/20 10:56 AM, Igor Mammedov wrote:
> On Wed, 5 Feb 2020 10:10:06 -0600
> Babu Moger <address@hidden> wrote:
> 
>> On 2/5/20 3:38 AM, Igor Mammedov wrote:
>>> On Tue, 4 Feb 2020 13:08:58 -0600
>>> Babu Moger <address@hidden> wrote:
>>>   
>>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:  
>>>>> On Mon, 3 Feb 2020 13:31:29 -0600
>>>>> Babu Moger <address@hidden> wrote:
>>>>>     
>>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:    
>>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>>>>>> Babu Moger <address@hidden> wrote:
>>>>>>>       
>>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=vDAkIxR3U6LX%2FmnYjZPRC55smMqLend%2FHQjbfYWydBk%3D&amp;reserved=0
>>>>>>>>
>>>>>>>> Currently, the APIC ID is decoded based on the sequence
>>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and 
>>>>>>>> other
>>>>>>>> vendors' configurations, but this decoding sequence does not follow 
>>>>>>>> that of
>>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU 
>>>>>>>> topology
>>>>>>>> inconsistency.  When booting a guest VM, the kernel tries to validate 
>>>>>>>> the
>>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu 
>>>>>>>> models.
>>>>>>>>
>>>>>>>> To fix the problem we need to build the topology as per the Processor
>>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>>>>>> Processors. It is available at 
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=rVMRN%2BbUeGWEksKO5uQ3Wxc71eeHCXMrkLVRbo4JHHI%3D&amp;reserved=0
>>>>>>>>
>>>>>>>> Here is the text from the PPR.
>>>>>>>> Operating systems are expected to use 
>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>>>>>> number of least significant bits in the Initial APIC ID that indicate 
>>>>>>>> core ID
>>>>>>>> within a processor, in constructing per-core CPUID masks.
>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of 
>>>>>>>> cores
>>>>>>>> (MNC) that the processor could theoretically support, not the actual 
>>>>>>>> number of
>>>>>>>> cores that are actually implemented or enabled on the processor, as 
>>>>>>>> indicated
>>>>>>>> by Core::X86::Cpuid::SizeId[NC].
>>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>>>>>> • ApicId[6] = Socket ID.
>>>>>>>> • ApicId[5:4] = Node ID.
>>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : 
>>>>>>>> {1'b0,LogicalCoreID[1:0]}      
>>>>>>>
>>>>>>>
>>>>>>> After checking out all patches and some pondering, used here approach
>>>>>>> looks to me too intrusive for the task at hand especially where it
>>>>>>> comes to generic code.
>>>>>>>
>>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
>>>>>>> reasoning behind it first)
>>>>>>>
>>>>>>> Lets look for a way to simplify it a little bit.
>>>>>>>
>>>>>>> So problem we are trying to solve,
>>>>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC 
>>>>>>> based CPUs)
>>>>>>>  2: it depends on knowing total number of numa nodes.
>>>>>>>
>>>>>>> Externally workflow looks like following:
>>>>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
>>>>>>>       that's used by possible_cpu_arch_ids() singleton to build list of
>>>>>>>       possible CPUs (which is available to user via command 
>>>>>>> 'hotpluggable-cpus')
>>>>>>>
>>>>>>>       Hook could be called very early and possible_cpus data might be
>>>>>>>       not complete. It builds a list of possible CPUs which user could
>>>>>>>       modify later.
>>>>>>>
>>>>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa 
>>>>>>> node,node_id=x,cpus="
>>>>>>>       options to assign cpus to nodes, which is one way or another 
>>>>>>> calling
>>>>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' 
>>>>>>> list
>>>>>>>       with node information. It happens early when total number of nodes
>>>>>>>       is not available.
>>>>>>>
>>>>>>>   2.2 user does not provide explicit node mappings for CPUs.
>>>>>>>       QEMU steps in and assigns possible cpus to nodes in 
>>>>>>> machine_numa_finish_cpu_init()
>>>>>>>       (using the same machine_set_cpu_numa_node()) right before calling 
>>>>>>> boards
>>>>>>>       specific machine init(). At that time total number of nodes is 
>>>>>>> known.
>>>>>>>
>>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be 
>>>>>>> defined before
>>>>>>> boards init() is run.    
>>>>
>>>> In case of 2.1, we need to have the arch_id already generated. This is
>>>> done inside possible_cpu_arch_ids. The arch_id is used by
>>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.  
>>>
>>> I might have missed something but I don't see arch_id itself being used in
>>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus  
>>
>> Before calling machine_set_cpu_numa_node, we call
>> cpu_index_to_instance_props -> x86_cpu_index_to_props->
>> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
>>
>> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
>> the available cpus. Based on the arch_id, it also sets up the props.
> 
> 
> x86_possible_cpu_arch_ids()
>    arch_id = x86_cpu_apic_id_from_index(x86ms, i)
>    x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,  
> ms->smp.threads, &topo);
>    // assign socket/die/core/thread from topo
> 
> so currently it uses indirect way to convert index in possible_cpus->cpus[]
> to socket/die/core/thread ids.
> But essentially it take '-smp' options and [0..max_cpus) number as original 
> data
> converts it into intermediate apic_id and then reverse engineer it back to
> topo info.
> 
> Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency 
> on apic_id?

It might work. But this feels like a work-around and delaying the problem
for later. Just re-arranging the numa code little bit we can address this.

> 
> 
> 
>> And these props values are used to assign the nodes in
>> machine_set_cpu_numa_node.
>>
>> At this point we are still parsing the numa nodes and so we don't know the
>> total number of numa nodes. Without that information, the arch_id
>> generated here will not be correct for EPYC models.
>>
>> This is the reason for changing the generic numa code(patch #12-Split the
>> numa initialization).
>>
>>>
>>>    
>>>> If we want to move the arch_id generation into board init(), then we need
>>>> to save the cpu indexes belonging to each node somewhere.  
>>>
>>> when cpus are assigned explicitly, decision what cpus go to what nodes is
>>> up to user and user configured mapping is stored in 
>>> MachineState::possible_cpus
>>> which is accessed by via possible_cpu_arch_ids() callback.
>>> Hence I don see any reason to touch cpu indexes.  
>>
>> Please see my reasoning above.
>>
>>>   
>>>>  
>>>>>>>
>>>>>>> In 2.2 case it calls get_default_cpu_node_id() -> 
>>>>>>> x86_get_default_cpu_node_id()
>>>>>>> which uses arch_id calculate numa node.
>>>>>>> But then question is: does it have to use APIC id or could it infer 
>>>>>>> 'pkg_id',
>>>>>>> it's after, from ms->possible_cpus->cpus[i].props data?      
>>>>>>
>>>>>> Not sure if I got the question right. In this case because the numa
>>>>>> information is not provided all the cpus are assigned to only one node.
>>>>>> The apic id is used here to get the correct pkg_id.    
>>>>>
>>>>> apicid was composed from socket/core/thread[/die] tuple which 
>>>>> cpus[i].props is.
>>>>>
>>>>> Question is if we can compose only pkg_id based on the same data without
>>>>> converting it to apicid and then "reverse engineering" it back
>>>>> original data?    
>>>>
>>>> Yes. It is possible.
>>>>  
>>>>>
>>>>> Or more direct question: is socket-id the same as pkg_id?    
>>>>
>>>> Yes. Socket_id and pkg_id is same.
>>>>  
>>>>>
>>>>>     
>>>>>>    
>>>>>>>   
>>>>>>> With that out of the way APIC ID will be used only during board's 
>>>>>>> init(),
>>>>>>> so board could update possible_cpus with valid APIC IDs at the start of
>>>>>>> x86_cpus_init().
>>>>>>>
>>>>>>> ====
>>>>>>> in nutshell it would be much easier to do following:
>>>>>>>
>>>>>>>  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
>>>>>>>     if impossible as alternative recompute APIC IDs there if cpu
>>>>>>>     type is EPYC based (since number of nodes is already known)
>>>>>>>  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
>>>>>>>
>>>>>>> this way one doesn't need to touch generic numa code, introduce
>>>>>>> x86 specific init_apicid_fn() hook into generic code and keep
>>>>>>> x86/EPYC nuances contained within x86 code only.      
>>>>>>
>>>>>> I was kind of already working in the similar direction in v4.
>>>>>> 1. We already have split the numa initialization in patch #12(Split the
>>>>>> numa initialization). This way we know exactly how many numa nodes are
>>>>>> there before hand.    
>>>>>
>>>>> I suggest to drop that patch, It's the one that touches generic numa
>>>>> code and adding more legacy based extensions like cpu_indexes.
>>>>> Which I'd like to get rid of to begin with, so only -numa cpu is left.
>>>>>
>>>>> I think it's not necessary to touch numa code at all for apicid generation
>>>>> purpose, as I tried to explain above. We should be able to keep
>>>>> this x86 only business.    
>>>>
>>>> This is going to be difficult without touching the generic numa code.patch 
>>>> #12(Split the  
>>>>>> numa initialization)  
>>>
>>> Looking at current code I don't see why one would touch numa code.
>>> Care to explain in more details why you'd have to touch it?  
>>
>> Please see the reasoning above.
>>>   
>>>>>> 2. Planning to remove init_apicid_fn
>>>>>> 3. Insert the handlers inside X86CPUDefinition.    
>>>>> what handlers do you mean?    
>>>>
>>>> Apicid generation logic can be separated into 3 types of handlers.
>>>> x86_apicid_from_cpu_idx: Generate apicid from cpu index.
>>>> x86_topo_ids_from_apicid: Generate topo ids from apic id.
>>>> x86_apicid_from_topo_ids: Generate apicid from topo ids.
>>>>
>>>> We should be able to generate one id from other(you can see topology.h).
>>>>
>>>> X86CPUDefinition will have the handlers specific to each model like the
>>>> way we have features now. The above 3 handlers will be used as default
>>>> handler.  
>>>
>>> it probably shouldn't be a part of X86CPUDefinition,
>>> as it's machines responsibility to generate and set APIC ID.
>>>
>>> What you are doing with this topo functions in this version
>>> looks more that enough to me.  
>>
>> It is all the exact same topo functions. Only making these functions as
>> the handlers inside the X86CPUDefinition.
>>
>>>   
>>>> The EPYC model will have its corresponding handlers.
>>>>
>>>> x86_apicid_from_cpu_idx_epyc
>>>> x86_topo_ids_from_apicid_epyc
>>>> x86_apicid_from_topo_ids_epyc.  
>>>
>>> CPU might use call backs, but does it have to?
>>> I see cpu_x86_cpuid() uses these functions to decode apic_id back to topo
>>> info and then compose various leaves based on it.
>>> Within CPU code I'd just use
>>>  if (i_am_epyc)
>>>     x86_topo_ids_from_apicid_epyc()
>>>  else
>>>     x86_topo_ids_from_apicid()
>>> it's easier to read and one doesn't have to go figure
>>> indirection chain to figure out what code is called.  
>>
>> Eduardo already commented on this idea. Anything specific to cpu models
>> should be part of the X86CPUDefinition. We should not compare the specific
>> model here. Comparing the specific model does not scale. We are achieving
>> this by loading the model definition(similar to what we do in
>> x86_cpu_load_model).
> 
> ok
> 
>>
>>>      
>>>>>> 4. EPYC model will have its own apid id handlers. Everything else will be
>>>>>> initialized with a default handlers(current default handler).
>>>>>> 5. The function pc_possible_cpu_arch_ids will load the model definition
>>>>>> and initialize the PCMachineState data structure with the model specific
>>>>>> handlers.    
>>>>> I'm not sure what do you mean here.    
>>>>
>>>> PCMachineState will have the function pointers to the above handlers.
>>>> I was going to load the correct handler based on the mode type.  
>>>
>>> Could be done like this, but considering that within machine we need
>>> to calculate apic_id only once, the same 'if' trick would be simpler
>>>
>>> x86_cpus_init() {
>>>
>>>   if (cpu == epic) {
>>>      make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms))
>>>   }  
>>
>> Once again, this does not scale. Please see my response above.
>>
>>>
>>>   // go on with creating cpus ...
>>> }
>>>   
>>>>>> Does that sound similar to what you are thinking. Thoughts?    
>>>>> If you have something to share and can push it on github,
>>>>> I can look at, whether it has design issues to spare you a round trip on 
>>>>> a list.
>>>>> (it won't be proper review but at least I can help to pinpoint most 
>>>>> problematic parts)
>>>>>     
>>>> My code for the current approach is kind of ready(yet to be tested). I can
>>>> send it as v3.1 if you want to look. Or we can wait for our discussion to
>>>> settle. I will post it after our discussion.  
>>> ok, lets wait till we finish this discussion  
>>
>> I can post my draft patch to give you more idea about what i am talking
>> about now. Let me know.
>>
>>>   
>>>> There is one more problem we need to address. I was going to address later
>>>> in v4 or v5.
>>>>
>>>> This works
>>>> -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7
>>>>
>>>> This does not work
>>>> -numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7  
>>> Is it supposed to work (i.e. can real hardware do such topology)?  
>>
>> Hardware does not support this configuration. That is why I did not think
>> it is serious enough to fix this problem right now.
>>
>>>   
>>>> This requires the generic code to pass the node information to the x86
>>>> code which requires some handler changes. I was thinking my code will
>>>> simplify the changes to address this issue.  
>>>
>>> without more information, it's hard to comment on issue and whether
>>> extra complexity of callbacks is justificated. 
>>>
>>> There could be 2 ways here, add fixes to this series so we could see the 
>>> reason
>>> or make this series simple to solve apic_id problem only and then on top of
>>> it send the second series that solves another issue.
>>>
>>> Considering that this series is already big/complicated enough,
>>> personally I'd go for 2nd option. As it's easier to describe what patches 
>>> are
>>> doing and easier to review => should result in faster reaching consensus 
>>> and merging.
>>> [...]
>>>   
>>
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]